Postgres Overview

Storage · Transactions · MVCC · Locks · Indexes · WAL

Mental model
PostgreSQL is a collection of cooperating mechanisms. Heap pages store row versions, transactions group changes, MVCC decides which versions readers can see, locks coordinate conflicting work, indexes provide faster paths to rows, and WAL makes committed changes recoverable.

Follow One Ledger Transfer

Consider a transfer of SGD 100 from account A to account B. The application writes the transfer, the debit posting, the credit posting, and an outbox event in one transaction. PostgreSQL must make the whole operation atomic while other requests read balances and attempt transfers concurrently.

This flow connects the major internals. The heap stores the records. Indexes find an account's postings and reject a duplicate idempotency key. MVCC gives readers a consistent view. Locks serialize conflicting changes to the same account. WAL protects the commit even when the modified heap pages have not reached disk yet.


Storage: Heap Pages and Tuples

A table is primarily a heap: an unordered collection of 8 KB pages. Each page contains tuples, which are physical row versions. An index is a separate structure that maps a key to a tuple location in the heap.

graph LR Q["account_id = 42"] --> B["B-tree index"] B -->|"page + offset"| P["8 KB heap page"] P --> R["posting tuple"]

The separation matters. Adding an index does not reorder the table. It creates another structure that must be updated and stored. Reads may become much faster, but writes become more expensive.

Ledger example
A query for an account statement should not scan every posting. An index on (account_id, created_at) gives PostgreSQL an ordered path to one account's history.

For physical storage, page layout, table forks, and TOAST, read The Internals of PostgreSQL, Chapter 1.


MVCC: Row Versions Instead of Read Locks

MVCC means Multi-Version Concurrency Control. PostgreSQL usually lets readers and writers proceed concurrently by keeping multiple versions of a row. A snapshot decides which committed versions a statement can see.

An UPDATE does not simply overwrite the old tuple. PostgreSQL creates a new tuple version and marks the old one as superseded. A reader with an older snapshot may still see the old version while a newer reader sees the new one.

sequenceDiagram participant R as Balance reader participant W as Transfer writer participant H as Heap R->>H: Read using snapshot S1 W->>H: Create new account version Note over H: Old and new versions coexist R->>H: S1 still sees old version W->>H: Commit Note over H: A later snapshot sees new version

The trade-off is cleanup. Old versions eventually become dead tuples. Vacuum marks their space reusable once no active snapshot can still need them. Long transactions can retain old snapshots and prevent that cleanup.

Operational failure chain
Long transaction → old snapshot retained → vacuum cannot remove dead tuples → table and indexes grow → more pages are read → latency and maintenance cost increase.

For tuple visibility, transaction IDs, HOT updates, freezing, and wraparound, see the PostgreSQL guides on concurrency control and routine vacuuming.


Locks: Coordination Where Versions Are Not Enough

MVCC reduces read and write blocking, but it does not remove conflicts. Two transactions attempting to update the same row cannot both proceed independently. One must wait, fail, or be retried depending on the operation and isolation level.

Ledger example
Two transfers concurrently debit the same account. Both cannot validate the same starting balance and then subtract money independently. The design needs row locking, a guarded atomic update, or serializable transaction logic to preserve the no-overdraft invariant.

Locks also apply to tables and schema changes. Many ALTER TABLE operations need strong locks. A dangerous production pattern is a migration waiting behind a long query while newer queries queue behind the waiting migration.

sequenceDiagram participant Q1 as Long SELECT participant DDL as ALTER TABLE participant Q2 as New SELECT Q1->>DB: Holds a read lock DDL->>DB: Waits for strong table lock Q2->>DB: Queues behind waiting DDL Note over Q1,Q2: A short migration can create a long outage

Practical rule: treat migrations as concurrency events, not file changes. Understand the required lock and set a short lock_timeout so a migration fails instead of silently forming a queue.

Further reading: PostgreSQL explicit locking.


Indexes and the Query Planner

An index is a faster access path, not a guarantee that PostgreSQL will use it. The planner estimates the cost of available strategies and chooses between a sequential scan, an index scan, joins, sorts, and other operations.

flowchart TD SQL["Account statement query"] --> P["Planner estimates cost"] P --> S["Sequential scan"] P --> I["Index scan"] P --> B["Bitmap scan"] I --> C["Chosen when estimated cheapest"]

For the ledger, two indexes have different correctness and performance jobs:

IndexPurpose
(account_id, created_at)Read one account's posting history efficiently and in useful order.
UNIQUE (idempotency_key)Turn duplicate transfer submissions into a database-enforced invariant.

Indexes cost storage, cache space, and extra work on every insert or update. The right question is not “can this column be indexed?” but “which important query or invariant does this index support, and is that benefit worth its write cost?”

Use EXPLAIN (ANALYZE, BUFFERS) when the observed plan matters. PostgreSQL's index documentation covers index types, multicolumn indexes, partial indexes, and index-only scans.


WAL: Commit the Log Before the Pages

WAL means Write-Ahead Log. PostgreSQL can modify a heap page in memory before writing that page to its table file. The durability rule is that the WAL describing a change must reach durable storage before the corresponding dirty data page does.

sequenceDiagram participant App participant Memory as Shared buffers participant WAL participant Disk as Table files App->>Memory: Modify transfer and posting pages App->>WAL: Append change and commit records WAL-->>App: Flush completes, commit acknowledged Memory-->>Disk: Dirty pages written later

After a crash, PostgreSQL replays WAL to restore changes that were committed but not yet present in the table files. WAL is also the source stream for physical replication and logical decoding.

Each WAL position has an LSN, or Log Sequence Number. CDC systems should use source log positions to preserve order and resume safely. Wall-clock timestamps are not a reliable substitute for log order.

For WAL record structure, checkpoints, recovery, and replication, read the official WAL introduction.


Practical Rules


Continue Learning

Primary sources

Videos