PostgreSQL Internals & MVCC
The most important internal topic in the entire syllabus. Understanding how PostgreSQL physically stores data and manages concurrent versions explains why reads never block writes, where bloat comes from, and why VACUUM exists.
Architecture
Postmaster, Backend Processes, and Shared Memory
The Postmaster — The Master Process
When you start PostgreSQL, the first process that runs is the Postmaster. It is the parent of all other PostgreSQL processes. Its job is to listen for incoming client connections and spawn a new backend process for each one.
┌─────────────────────────────────────────────────────────────┐
│ POSTMASTER (PID 1) │
│ Listens on port 5432, spawns processes │
└──────┬───────────────────────────────────────┬─────────────┘
│ forks on each connection │ background workers
▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐
│ Backend │ │ Backend │ │ Background Workers │
│ Process │ │ Process │ │ ─────────────────── │
│ (Client A) │ │ (Client B) │ │ • Autovacuum │
│ │ │ │ │ • WAL Writer │
│ Runs SQL │ │ Runs SQL │ │ • Checkpointer │
│ for 1 conn │ │ for 1 conn │ │ • BGWriter │
└──────┬───────┘ └──────┬───────┘ │ • Stats Collector │
│ │ └──────────────────────┘
└────────┬────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ SHARED MEMORY │
│ ───────────────────────────────────────────────────────── │
│ Shared Buffers (cache) │ WAL Buffers │ Lock Tables │
│ CLOG (commit status) │ Proc Array │ Free Space Map │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DISK STORAGE │
│ Data Files (base/) │ WAL Files (pg_wal/) │ Indexes │
└─────────────────────────────────────────────────────────────┘
Postmaster
The Postmaster is the supervisor. It does not run queries itself. When a client connects, Postmaster forks a new backend process dedicated to that connection. If the backend crashes, only that one connection is lost — not the whole database.
Backend Process
Each client connection gets its own backend process. This process is responsible for parsing SQL, planning the query, executing it, and sending results back. It has its own private memory (work_mem, sort buffers) and also accesses the shared memory pool.
Shared Memory — The Heart of PostgreSQL
Shared memory is the area all backend processes can read and write simultaneously. The most important component is shared_buffers — the in-memory page cache. When a backend needs a data page, it first checks shared_buffers. If found (cache hit), no disk I/O. If not found (cache miss), it reads from disk into shared_buffers.
| Component | Purpose | Default / Notes |
|---|---|---|
| shared_buffers | Page cache — holds hot data pages in memory | 128MB default; set to 25% of RAM in production |
| WAL Buffers | Holds WAL records before flush to disk | Auto-tuned; ~1/32 of shared_buffers |
| CLOG | Commit Log — tracks which transactions are committed/aborted | Tiny; critical for MVCC visibility |
| Lock Table | All currently held locks | Shared; checked by every backend |
| Proc Array | Active transactions and their XIDs | Used to compute snapshots for MVCC |
Storage — Pages, Blocks & Heap
How data is physically laid out on disk
Pages and Blocks — The Unit of Storage
PostgreSQL reads and writes data in fixed-size units called pages (also called blocks). The default page size is 8KB. Even if you update a single byte in a row, PostgreSQL reads and writes the entire 8KB page. This is the fundamental unit of I/O.
$PGDATA/base/<dboid>/<relfilenode>. Files are split at 1GB.What's Inside a Page?
Every 8KB page has the same structure. Understanding this explains why large rows are expensive and why VACUUM matters.
┌──────────────────────────────────────────────────────┐ ← Page Start (byte 0) │ PAGE HEADER (24 bytes) │ │ pd_lsn: last WAL record that modified this page │ │ pd_lower: end of item pointer array │ │ pd_upper: start of tuple data │ │ pd_special: start of special space (for indexes) │ ├──────────────────────────────────────────────────────┤ │ ITEM POINTER ARRAY (grows downward →) │ │ [ptr1][ptr2][ptr3]... each 4 bytes │ │ Points to each tuple's offset within this page │ ├──────────────────────────────────────────────────────┤ │ │ │ FREE SPACE │ │ (between item array and tuple data) │ │ │ ├──────────────────────────────────────────────────────┤ │ TUPLE DATA (grows upward ←) │ │ Each tuple (row) has: │ │ [HeapTupleHeader][actual column data] │ │ HeapTupleHeader contains: xmin, xmax, ctid, infomask│ ├──────────────────────────────────────────────────────┤ │ SPECIAL SPACE (for index pages only) │ └──────────────────────────────────────────────────────┘ ← Page End (byte 8191)
ctid which is its physical location: (page_number, tuple_number_within_page). For example (0,1) means page 0, first tuple. When a row is updated, its ctid changes because the new version is written to a different location.Heap Storage — What Tables Actually Are
In PostgreSQL, a table is stored as a heap — rows are written in no particular order, just wherever there's space. This is different from clustered tables (like InnoDB) where rows are stored in primary key order. The heap structure is why PostgreSQL has CLUSTER as an explicit operation.
Why Row Width Matters
Because a page is 8KB and rows cannot span pages (for heap tables), a single row must fit within one page. Wide rows (many large columns) mean fewer rows per page, which means more pages to scan for full table scans and more I/O overall.
-- 100 bytes per row -- 8192 / 100 ≈ 80 rows per page -- 1M rows → ~12,500 pages -- Sequential scan: fast
-- 4000 bytes per row -- 8192 / 4000 ≈ 2 rows per page -- 1M rows → ~500,000 pages -- Sequential scan: 40× more I/O!
MVCC — Multi-Version Concurrency Control
⭐ The most important PostgreSQL internal topic
The Core Idea — Writers Don't Block Readers
In traditional databases, if someone is writing to a row, readers must wait. PostgreSQL avoids this entirely using MVCC: instead of modifying a row in place, it keeps multiple versions of every row. Each transaction sees a consistent snapshot of the database as it existed when the transaction started.
Transaction IDs (XIDs) — The Clock of MVCC
Every transaction gets a unique Transaction ID (XID), a monotonically increasing 32-bit integer. XIDs are the "timestamps" MVCC uses to determine which row versions are visible to which transactions.
Row Versions — xmin and xmax
Every row (tuple) in PostgreSQL's heap carries two hidden metadata fields in its header:
When you UPDATE a row, PostgreSQL does NOT modify the existing tuple. Instead it:
- Sets
xmaxon the old tuple to the current XID (marks it as "dead to new readers") - Inserts a brand new tuple with the updated data, setting
xminto the current XID
The old tuple sticks around until VACUUM cleans it up. This is where table bloat comes from.
xmax=0
id=1 salary=50000
(LIVE)
xmax=200
id=1 salary=50000
(DEAD to txns ≥200)
xmax=0
id=1 salary=60000
(NEW LIVE version)
Visibility Rules — Which Version Do You See?
When your transaction reads a row, PostgreSQL applies these rules to decide which version to show you:
| Condition | Is the tuple visible? |
|---|---|
| xmin is committed AND xmax is 0 | ✅ YES — row is live and never deleted |
| xmin is committed AND xmax is committed AND xmax > my snapshot | ✅ YES — was deleted but by a txn after my snapshot |
| xmin is committed AND xmax is committed AND xmax ≤ my snapshot | ❌ NO — row was deleted before my snapshot started |
| xmin is NOT committed (still in progress or aborted) | ❌ NO — inserting txn didn't commit yet |
| xmin = my own XID (I inserted this row in this txn) | ✅ YES — I can see my own uncommitted changes |
Snapshot Isolation — What You See When You BEGIN
When your transaction starts, PostgreSQL takes a snapshot of the database state. A snapshot captures:
A transaction with XID=N is visible to me if: N < xmin (definitely committed before I started) OR (xmin ≤ N < xmax AND N not in xip_list AND N is committed).
Interactive: MVCC Visibility Explorer
Step through a real MVCC scenario: two concurrent transactions, one updating, the other reading. See exactly what each transaction sees at each moment.
🔬 MVCC Demo — Concurrent Read & Write
Scenario: salary row starts at 50,000. T1 updates it to 60,000. What does T2 see?
Dead Tuples, Bloat & Why VACUUM Exists
The inevitable consequence of MVCC — and the solution
Dead Tuples — The Price of MVCC
Because MVCC keeps old row versions instead of overwriting them, every UPDATE and DELETE leaves behind dead tuples — old row versions that no active transaction can see anymore. These dead tuples accumulate on disk and cause:
Visualizing Bloat After Heavy Updates
VACUUM FULL rewrites the entire table to a new file, actually returning space to the OS — but it requires an ACCESS EXCLUSIVE lock (blocks everything) and is slow. Use it sparingly.Checking Bloat in Production
Autovacuum — The Automatic Solution
PostgreSQL's autovacuum daemon runs in the background and automatically vacuums tables when the number of dead tuples exceeds a threshold. It's controlled by these key settings:
Snapshot Isolation in Practice
What each isolation level actually sees under MVCC
How Isolation Levels Map to Snapshots
Different isolation levels (from Topic 6) are implemented in PostgreSQL by controlling when a new snapshot is taken:
| Isolation Level | Snapshot Taken | What You See | PostgreSQL Behavior |
|---|---|---|---|
| READ COMMITTED | At each statement | Committed data as of each query | Default in PostgreSQL. Fresh snapshot per SQL statement. |
| REPEATABLE READ | At first statement in txn | Consistent view of the whole transaction | Same snapshot for all queries in the txn. No phantom reads. |
| SERIALIZABLE | At first statement + tracking | As if transactions ran one at a time | SSI (Serializable Snapshot Isolation) — detects conflicts, may abort. |
READ COMMITTED — The Default
In READ COMMITTED, each SQL statement takes a fresh snapshot. This means if a concurrent transaction commits between two of your queries, your second query sees the new data. This is the most common isolation level.
REPEATABLE READ — Stable Snapshot
In REPEATABLE READ, the snapshot is taken once at the start of the transaction and reused for all queries. You see a stable, consistent view of the database — even if other transactions commit in the meantime.
Long-Running Transactions — A Silent Killer
A long-running transaction holds its snapshot open. PostgreSQL cannot clean up (VACUUM) any dead tuples that might be visible to that snapshot. This means one idle long transaction can cause the entire table to bloat without any vacuum being able to help.
idle_in_transaction_session_timeout.Topic 8 Summary
The mental model that connects everything in PostgreSQL
The One Mental Model to Rule Them All
This one principle explains:
• Why reads don't block writes → (each txn sees its own snapshot)
• Why table bloat happens → (old versions accumulate)
• Why VACUUM exists → (to clean up dead versions)
• Why long transactions are dangerous → (they hold snapshots open, blocking VACUUM)
• Why XID wraparound is a risk → (32-bit counter of versions must be frozen periodically)
Key Concepts Summary
| Concept | What it is | Why it matters |
|---|---|---|
| Postmaster | Parent process; spawns one backend per connection | Fork model means crashes are isolated per connection |
| shared_buffers | In-memory page cache; shared by all backends | Cache hits = no disk I/O; set to 25% RAM in prod |
| Page (8KB) | Unit of storage; all reads/writes use whole pages | Row width affects rows-per-page → scan efficiency |
| Heap | Unordered row storage (unlike clustered tables) | Inserts go anywhere with free space; CLUSTER reorders |
| MVCC | Multiple row versions; readers see consistent snapshots | No read-write blocking; foundation of concurrency |
| xmin / xmax | Hidden fields on every row — born/died in which XID | How visibility is determined without locks |
| XID | Transaction ID — 32-bit counter per transaction | The "clock" of MVCC; must be periodically frozen |
| Snapshot | Point-in-time view of committed transactions | Taken at BEGIN or per-statement depending on isolation |
| Dead Tuples | Old row versions no longer visible to any transaction | Cause bloat; must be cleaned by VACUUM |
| Autovacuum | Background daemon that runs VACUUM automatically | Essential; never disable; tune scale_factor per table |
| CLOG | Commit log; records commit/abort status per XID | Checked during visibility evaluation; lives in shared memory |
| TOAST | Overflow storage for large column values (>~2KB) | Keeps main heap rows compact; transparent to users |