All insights
Databases· Jun 25, 2026· 13 min read

MVCC, dead tuples, and the vacuum you're scared of: how Postgres really stores your rows

An UPDATE in Postgres does not update a row. It writes a new one and leaves a corpse. Understand that single fact and bloat, the visibility map, HOT updates, long-transaction pain, and transaction-ID wraparound all stop being mysteries.

AD
Aman Dhyani
Co-founder, SERP Axis

MVCC: a row is not a row, it's a stack of versions

Postgres gives every transaction a consistent snapshot without readers blocking writers or vice versa. It does this with Multi-Version Concurrency Control: instead of overwriting data in place, it keeps multiple physical versions of each row (each version is a 'tuple') and shows each transaction the version that was valid as of its snapshot.

Every heap tuple carries two hidden system columns that decide its visibility: `xmin`, the id of the transaction that inserted it, and `xmax`, the id of the transaction that deleted or superseded it (0 if still live). A tuple is visible to your snapshot if its `xmin` committed before your snapshot and its `xmax` is null or belongs to a transaction not visible to you.

The hidden columns are real — you can select them.
sql
SELECT xmin, xmax, ctid, * FROM orders WHERE id = 42;
-- xmin: inserting txid   xmax: deleting/superseding txid (0 = live)
-- ctid: physical location (page, tuple index) — changes when the row moves

UPDATE is insert + tombstone, not overwrite

This is the fact that explains everything else: Postgres has no in-place update for a live row. An UPDATE writes a brand-new tuple with the new values and stamps the old tuple's `xmax` with the updating transaction. A DELETE just stamps `xmax`. The old version stays physically on the page — it has to, because concurrent transactions with older snapshots may still need to see it.

So a row updated ten times has, for a moment, up to ten physical versions on disk. Once no active snapshot can possibly need an old version, that version is 'dead' — invisible to everyone, but still occupying space and still pointed at by indexes until something cleans it up.

Why write-heavy tables balloon

A table where every row is updated frequently generates dead tuples at the rate of your UPDATEs. If cleanup can't keep pace, the heap and its indexes grow far beyond the live data — 'bloat' — and every scan reads more pages to find the same rows.

Dead tuples, bloat, and what VACUUM actually does

VACUUM is the garbage collector for dead tuples. It scans for tuples no snapshot can see, marks their space as reusable, and cleans up the corresponding index entries. Crucially, a plain VACUUM does not return space to the operating system — it makes the space available for reuse within the same table. The file stays the same size; new rows fill the reclaimed slots.

`VACUUM FULL` does return space to the OS, but it does so by rewriting the entire table into a new file under an ACCESS EXCLUSIVE lock — no reads or writes for the duration. On a live table that's usually unacceptable; reach for `pg_repack` (online, no long lock) instead. The right answer for 99% of tables is: don't let bloat happen — let autovacuum keep up.

autovacuum wakes on thresholds (by default, roughly when dead tuples exceed 20% of the table plus a small constant). On large, hot tables the defaults are too lazy: tune `autovacuum_vacuum_scale_factor` down and `autovacuum_vacuum_cost_limit` up so it runs sooner and faster.

OperationReclaims to OS?LockUse when
autovacuum / VACUUMNo (reuses in-table)Non-blocking (SHARE UPDATE EXCL)Always — this is the steady state
VACUUM FULLYes (rewrites table)ACCESS EXCLUSIVE (blocks all)Almost never; only offline maintenance
pg_repackYes (online rewrite)Brief lock onlyYou must reclaim bloat on a live table

HOT updates: the optimization that saves your indexes

Because UPDATE writes a new tuple, it normally must also insert that tuple into every index — even indexes on columns that didn't change — because the physical location (ctid) moved. That write amplification is brutal on heavily-indexed tables.

Heap-Only Tuple (HOT) updates are Postgres's escape hatch. If an UPDATE changes no indexed column and the new tuple fits on the same page as the old one, Postgres chains the new version to the old one within the page and does not touch any index. Index entries keep pointing at the old tuple; reads follow the intra-page HOT chain to the current version. No index bloat, far less I/O.

The requirement 'fits on the same page' is why `fillfactor` matters. Set a table's fillfactor below 100 (say 85–90) and Postgres leaves free space on every page at load time, dramatically raising the fraction of updates that qualify as HOT. For an update-heavy table, this one setting can be the difference between healthy and bloated.

Two levers keep an update-heavy table lean: (1) don't index columns you update constantly, so more updates qualify for HOT; (2) lower fillfactor so there's room on the page for the new version. Both increase the HOT-update rate — check it with pg_stat_user_tables.n_tup_hot_upd vs n_tup_upd.

Index-only scans need the visibility map (another reason to vacuum)

An index-only scan answers a query from the index alone, skipping the heap fetch — a big win. But an index entry doesn't record whether its tuple is visible to your snapshot; only the heap tuple's xmin/xmax do. So how can Postgres trust the index without visiting the heap?

It consults the visibility map (VM): a compact bitmap with a bit per heap page meaning 'every tuple on this page is visible to all transactions'. If the VM says a page is all-visible, the index-only scan can skip the heap fetch for tuples on it. The catch: the VM is maintained by VACUUM. A table that isn't vacuumed has a stale VM, its pages aren't marked all-visible, and your 'index-only' scans quietly degrade into regular index scans with heap fetches. Vacuuming is not just cleanup — it's what keeps your fastest scans fast.

Why one idle transaction can bloat your entire database

VACUUM may only remove a dead tuple if no snapshot could still need it. The oldest snapshot in the system defines the 'xmin horizon' — the boundary before which tuples are safe to reclaim. A single long-running or forgotten-open transaction (the classic `BEGIN;` in a psql session someone walked away from, or an ORM that leaks a transaction) pins that horizon in the past.

While it's held, VACUUM across the whole cluster cannot remove any dead tuple newer than that transaction's snapshot — not just in the tables it touched, in every table. Your write-heavy tables keep generating dead tuples that autovacuum dutifully scans but is forbidden to reclaim. Bloat climbs, and it won't stop until that transaction ends.

The number-one cause of surprise bloat

Monitor for it directly: SELECT * FROM pg_stat_activity WHERE state IN ('idle in transaction','active') ORDER BY xact_start; and set idle_in_transaction_session_timeout so leaked transactions are killed automatically. Long analytics queries and idle transactions are the same hazard.

Transaction-ID wraparound: the 2-billion cliff

Transaction ids are 32-bit. Postgres compares them modulo 2^32 as 'roughly 2 billion in the past' vs 'roughly 2 billion in the future', so every tuple's xmin must stay in the visible past relative to current transactions. If ids ever wrapped around, ancient committed rows would suddenly look as if they were written by future transactions and vanish from view — catastrophic silent data loss.

The defense is 'freezing': VACUUM marks tuples old enough as frozen — permanently visible, their age no longer relevant. Anti-wraparound autovacuums are triggered automatically as a table's oldest unfrozen xid approaches `autovacuum_freeze_max_age`, and they cannot be skipped. If freezing still can't keep up — often because autovacuum was disabled, starved, or blocked by that idle transaction from the previous section — Postgres protects your data by refusing new transactions until you VACUUM. A database that stops accepting writes to avoid corruption is the failure mode behind more than one famous outage.

Watch age(datfrozenxid) per database and age(relfrozenxid) per table. Rising steadily toward autovacuum_freeze_max_age (default 200M) means freezing is falling behind — investigate autovacuum health before it becomes an incident, never after.

The model to keep

Postgres never overwrites a live row; it versions it. UPDATE and DELETE leave dead tuples; VACUUM reclaims them for reuse and maintains the visibility map and freezing that keep scans fast and the database safe. Bloat is what happens when cleanup falls behind — usually because autovacuum is under-tuned or a long transaction pins the horizon. Get autovacuum aggressive enough, keep transactions short, use HOT-friendly schemas (fewer indexed hot columns, lower fillfactor), and MVCC gives you snapshot isolation almost for free. Ignore it, and the same machinery quietly turns a 2 GB table into a 40 GB one.

Tags
PostgreSQLMVCCDatabasesPerformanceVACUUM
Related services

Want this handled by senior operators instead of read about? Our Software Management practice turns the ideas above into shipped work — or explore everything we do below.

Free 48-hour audit · no lock-in

The cost of waiting
is your competitor.

Every 90 days you delay is 90 days of authority compounding for someone else. Get the audit. See the math. Then decide.

No lock-in
Weekly invoicing
Reply within
3 hours
Audit value
$2,400 yours, free