Understanding Postgres Storage System

I’ve been trying to figure out the internals of Postgres storage system just to understand what has kept it around for so long. In the process, I stumbled upon this paper which outlines the architectural ideas involved in the early development of postgres. Note that this paper was published in 1987, so the current version of Postgres does not adhere completely to what the paper suggests. But some of the ideas presented are remarkable regardless.

In a nutshell, the Postgres storage engine can be understood as a collection of 3 design principles:

Before diving into the above principles in detail, lets see how the transaction system works in postgres.


Relational storage

Per row metadata:

Field Description
OID a system-assigned unique record identifier
Xmin transaction identifier of the interaction inserting the record
Tmin commit time of XMin (when the record became valid)
Cmin command identifier of the interaction inserting the record
Xmax transaction identifier of the interaction deleting the record
Tmax commit time of Xmax (when the record stopped being valid)
Cmax command identifier of the interaction deleting the record
PTR a forward pointer

The current row per metadata can be found here

  1. INSERT: Fields OID, Xmin and Cmin were set. Rest of the fields were left blank.
  2. UPDATE: 2 operations happen at this point.
    • XMax and Cmax are set to the identity of the current interaction in the record being replaced to indicate that it has now become invalid.
    • A new record is inserted into the data base with the proposed replacement value for the data fields. OID is set to the OID of the record being replaced, and Xmin and Cmin are set to the identity of the current interaction. PTR of the new interation now pointed to the older interaction.
  3. DELETE: Xmax and Cmax were set. Tmin was also set to indicate the transaction has been committed.

In case of UPDATE transaction, only a few fields would differ as OID, Xmin and Cmin etc was being reused. Therefore, in order to optimize space complexity, following steps were taken:

NOTE: It looks like the above above mechanism does not exist anymore. It currently has Heap Only Tuples (HOT). From the README, long story short:

HOT eliminates redundant index entries and allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples without performing a table-wide vacuum. It does this by allowing single-page vacuuming, also called “defragmentation”.

Record access

Each page:

Hence, Postgres can scan a relation by following the forward linked list. Although it can execute query plans backward as well, therefore, needing a pointer to the previous page.

Secondary indexes can be constructed for any base relation, and each index is maintained by an access method that provides procedures for access to the index, such as get-record-by-key, insert-record, and delete-record. The term secondary index is used vaguely in the modern times. I recommend getting an idea from here.

When a record is inserted - an anchor point is constructed for the record along with index entries for each secondary index. Each index record contains a key or a collection of keys along with a pointer to an entry in the line table on the page where the indexed record resides.

When an existing record is updated - a delta record is constructed and chained onto the appropriate anchor record. If an indexed field has been modified, an entry is added to the appropriate index containing the new key(s) and a pointer to the anchor record.

An important thing to note for modern postgres system, it does secondary index maintenance for all secondary indexes unless no indexed columns have changed. So if there 3 secondary indexes and an UPDATE operation changes a column used by 1 of them then maintenance is done for all of them – unless HOT is used.

There’s a lot of other noticeable things around secondary indexes but I won’t go into depth. Feel free to read the paper.

Vacuuming the disk

An asynchronous process (called vacuum cleaner) was responsible for sweeping records which are no longer valid to the archive. The syntax to invoke this cleaner was:

vacuum rel-name after T

where T is time, eg: ‘30 days’

Conditions for vacuum cleaner to find valid candidates for archival:

There were certain status assigned to those records so that cleaner can ensure no data is irrecoverably lost. In the first case, a record wascopied to the archive unless no-archive status is set for this relation. Whereas, in the second and third case, cleaner would simply reclaim the space.

Vacuum process

Vacuuming was done in three phases, namely:

  1. Write an archive record and its associated index records
  2. Write a new anchor point in the current database
  3. Reclaim the space occupied by the old anchor point and its delta records

What postgres was doing at the time wasn’t crash safe. But it managed to handle it correctly. Several scenarios:

In either case, the duplicate record consumed system resources but it wasn’t really convcerning because of postgres is a relational system and removed duplicate records during processing.

Vacuuming cost

The paper discusses two scenarios:

The above is accompanied by several constant parameters:

  whole chain K updates
archive writes 1 + Z 1 + Z
disk reads 1 1
disk writes 1 + Z 1 + Z

The above table reflects the IO count for vacuuming. The cost PER RECORD of the vacuum cleaner is inversely proportional to the length of the chain. As long as an anchor point and several delta records are vacuumed together, the cost claimed was marginal.

There are other things that paper discusses like the archival medium, magnetic disk indexes etc but I’m refraining from writing those just yet because I don’t understand them completely.