Atomicity & Write-Ahead Logging
Two-phase execution model and reliable async operations
The Challenge
IC canisters face unique constraints:
- Execution limits: Few billion instructions per message
- Async calls: Inter-canister calls can fail
- Upgrades: Subnets and Canisters can be upgraded at any time
- Crashes: Unexpected failures can occur
Financial operations must remain consistent despite these challenges.
Two-Phase Execution Model
Every critical operation follows this pattern:
Phase 1: Synchronous (Atomic)
- All state changes are atomic (all-or-nothing)
- User receives immediate feedback
- No pending async operations if validation fails
- State persists before async work begins
Duration: Typically < 200ms
Phase 2: Asynchronous (WAL-backed)
- Operations persist across canister upgrades
- Automatic retries with exponential backoff
- Idempotent execution (safe to retry)
- Error tracking for manual intervention
Duration: Variable (seconds to minutes).
Write-Ahead Log (WAL)
The WAL is a persistent queue of pending async operations stored in stable memory.
WAL Entry Structure
Each entry tracks:
- Kind: Operation type ("outflow", "liquidation", etc.)
- Status: Current execution state
- Retry info: Attempts, max retries, backoff, next attempt time
- Audit trail: First seen, last update, last error
- Payload: Operation-specific data
WAL Status States
| Status | Meaning |
|---|---|
Enqueued | Queued for first execution |
InFlight | Currently executing |
Succeeded | Completed successfully |
FailedRetryable | Failed, will retry |
FailedPermanent | Failed, needs intervention |
Retry Policy
Operations use exponential backoff:
- Attempt 1: 2 seconds
- Attempt 2: 4 seconds
- Attempt 3: 8 seconds
- Attempt 4: 16 seconds
- Attempt 5: 32 seconds
Total time before permanent failure: ~62 seconds
Timer-Based Execution
A background timer processes the WAL every 30 seconds, acquiring pending operations in batches of up to 256 and executing those whose next attempt time has passed.
Idempotency Guarantees
Multiple layers prevent double-execution:
| Layer | Mechanism |
|---|---|
| Operation ID | Unique ID per operation, duplicates rejected |
| Status Check | Already-succeeded operations skipped |
| Per-Entry Locking | Exclusive lock prevents concurrent execution |
| Handler Deduplication | Pool tracks processed withdrawal IDs |
Failure Scenarios
Scenario 1: Transient Network Failure
User withdraws 1 BTC → Lending Canister burns shares (committed) → Pool call times out
Recovery: WAL marks FailedRetryable → Waits 2 seconds → Retries → Eventually succeeds
Scenario 2: Subnet or Canister Upgrade
100 withdrawals pending in WAL → Operator upgrades Subnet or Canister → Heap cleared, timers stopped
Recovery: post_upgrade() reinitializes timers → WAL entries preserved (stable storage) → Operations resume
Scenario 3: Health Factor Violation
User tries to withdraw → Lending Canister burns shares → Health factor check fails
Recovery: Lending Canister re-mints shares (rollback) → No WAL entry created → User receives immediate error
Scenario 4: Permanent Failure
Withdrawal with invalid address → Pool rejects permanently → WAL marks FailedPermanent
Recovery: Admin investigates → Fixes root cause → Manually resets operation → Retry succeeds
Atomic Operations Table
| Operation | Atomic State Changes | Async Operations |
|---|---|---|
| Deposit | Mint supply shares | Pool-initiated |
| Withdraw | Burn supply shares, health check | Pool withdrawal, ckAsset burn |
| Borrow | Mint debt shares, health check | Pool withdrawal, ckAsset burn |
| Repay | Burn debt shares | Pool-initiated |
| Liquidate | Burn debt, burn collateral, mint treasury | Collateral transfer, change refund |
Persistence Properties
The WAL uses stable storage that survives:
- Canister upgrades
- Canister crashes
- Node restarts
The WAL ensures that once a user's state is updated, the corresponding async operation will eventually complete, even across canister upgrades and network failures.