PostgreSQL 17 & 18: The Advanced Open-Source Database
A deep technical guide to PostgreSQL in 2026. Covers schema design and rich data types (JSONB, arrays, ranges), indexing with B-tree, GIN, GiST, SP-GiST and BRIN, query tuning with EXPLAIN ANALYZE, MVCC and VACUUM internals, declarative partitioning, streaming and logical replication, connection pooling with PgBouncer, the pgvector extension for AI/RAG embeddings, PostGIS and TimescaleDB, and an honest Postgres vs MySQL comparison.
Table of Contents
- 1. Schema Design and Data Types
- 2. Indexing Strategies (B-tree, GIN, GiST, BRIN)
- 3. Query Optimization with EXPLAIN ANALYZE
- 4. Connection Pooling with PgBouncer
- 5. Streaming and Logical Replication
- 6. Backup and Recovery
- 7. Declarative Partitioning
- 8. MVCC, Transactions, and Isolation Levels
- 9. VACUUM, Autovacuum, and Bloat
- 10. JSONB, Window Functions, and CTEs
- 11. pgvector: Embeddings and RAG
- 12. Extensions: PostGIS, TimescaleDB, and More
- 13. Real-World Experience
- 14. PostgreSQL 17/18 and Postgres vs MySQL
1. Schema Design and Data Types
A good schema is the foundation of database performance. PostgreSQL's greatest strength is its rich, strict type system: on top of the usual normalized relational design, it gives you native JSONB, arrays, ranges, enums, and user-defined domains and composite types. Model your data properly first, then reach for the semi-structured types only where the shape genuinely varies per row.
- Normalize first (1NF-3NF): Atomic columns, a real primary key, no partial or transitive dependencies. Denormalize only with measured evidence. PostgreSQL enforces integrity with CHECK, UNIQUE, NOT NULL, EXCLUDE, and FOREIGN KEY constraints, including deferrable ones
- Identity columns over serial: Prefer
GENERATED ALWAYS AS IDENTITY(SQL-standard) over the legacyserialpseudo-type. For distributed inserts, PostgreSQL 18 adds a nativeuuidv7()function whose time-ordered values avoid the random B-tree page splits that UUIDv4 causes - Right-size types:
int(4 bytes) vsbigint(8 bytes);textandvarchar(n)are stored identically (there is no performance penalty fortextin Postgres). Always store instants astimestamptz(stored as UTC), never naivetimestamp. Usenumericfor money, neverfloat - JSONB for semi-structured data:
jsonbis a decomposed binary format that supports indexing and rich operators. Prefer it overjson(which stores raw text). Use it for metadata, feature flags, and payloads whose shape varies -- not as an excuse to avoid columns you query and constrain often - Arrays and ranges: Native array columns (
text[],int[]) avoid junction tables for simple lists. Range types (int4range,tstzrange) model intervals, and anEXCLUDE USING gistconstraint prevents overlapping bookings at the database level -- something MySQL cannot express - Enums and domains:
CREATE TYPE ... AS ENUMgives compact, ordered enumerations;CREATE DOMAINattaches reusable CHECK constraints to a base type. Soft deletes usedeleted_at timestamptzplus a partial index (Postgres supports these natively, unlike MySQL)
-- Well-designed PostgreSQL schema
CREATE TYPE user_status AS ENUM ('active','inactive','suspended');
CREATE TABLE users (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
public_id uuid NOT NULL DEFAULT uuidv7(), -- time-ordered UUID (PG 18)
email citext NOT NULL, -- case-insensitive (citext ext.)
name text NOT NULL,
status user_status NOT NULL DEFAULT 'active',
tags text[] NOT NULL DEFAULT '{}', -- native array
metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
deleted_at timestamptz,
CONSTRAINT uq_users_email UNIQUE (email)
);
CREATE TABLE bookings (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id bigint NOT NULL REFERENCES users(id),
room_id bigint NOT NULL REFERENCES rooms(id),
during tstzrange NOT NULL, -- the reserved interval
status text NOT NULL DEFAULT 'confirmed'
CHECK (status IN ('confirmed','cancelled','completed')),
created_at timestamptz NOT NULL DEFAULT now(),
-- No two confirmed bookings for the same room may overlap in time:
EXCLUDE USING gist (room_id WITH =, during WITH &&)
WHERE (status = 'confirmed')
);
2. Indexing Strategies (B-tree, GIN, GiST, BRIN)
B-tree Indexes
B-tree is PostgreSQL's default and most-used access method. It keeps keys in sorted order and serves equality, range, ORDER BY, and MIN/MAX queries. Unlike MySQL's InnoDB, PostgreSQL tables are heap-organized (not clustered on the primary key), so every index -- including the primary key -- is a separate secondary structure pointing at heap tuples.
- Leftmost prefix rule: A composite index on (a, b, c) serves predicates on (a), (a, b), or (a, b, c) -- but not (b) or (c) alone. PostgreSQL 18 adds B-tree skip scan, which can use later columns when a leading column has few distinct values
- Column order matters: Put equality columns before range columns, and generally the most selective first. Match the index's sort direction to your ORDER BY (you can specify
ASC/DESCandNULLS FIRST/LAST) - Covering indexes with INCLUDE: Add non-key payload columns via
INCLUDE (...)so an index-only scan returns everything without a heap fetch. EXPLAIN shows Index Only Scan. Index-only scans also depend on the visibility map being fresh (keep VACUUM healthy) - Selectivity: ratio of distinct values to rows. A boolean alone is a poor B-tree candidate -- use a partial index instead. Run
ANALYZEso the planner has current statistics; extend them withCREATE STATISTICSfor correlated columns - Write overhead & HOT: Every index is maintained on write. PostgreSQL's Heap-Only Tuple (HOT) optimization avoids index churn when an UPDATE does not touch any indexed column and the page has free space -- another reason not to over-index and to leave
fillfactorheadroom on hot tables
-- Composite index matching a common access pattern
-- SELECT ... FROM bookings WHERE user_id = $1 AND status = 'confirmed'
-- ORDER BY created_at DESC LIMIT 20;
CREATE INDEX idx_bookings_user_status_created
ON bookings (user_id, status, created_at DESC);
-- Covering / index-only scan with INCLUDE (payload not part of the key)
-- SELECT id, email, name FROM users WHERE status = 'active' ORDER BY created_at DESC;
CREATE INDEX idx_users_status_created
ON users (status, created_at DESC) INCLUDE (email, name);
-- Build without long write locks in production:
CREATE INDEX CONCURRENTLY idx_users_status_created
ON users (status, created_at DESC) INCLUDE (email, name);
GIN and GiST Indexes
Beyond B-tree, PostgreSQL ships several specialized access methods. GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) are the two you will reach for most: GIN indexes the elements inside composite values (JSONB keys, array elements, full-text lexemes, trigrams), while GiST indexes geometric, range, and nearest-neighbor data. This is where Postgres pulls decisively ahead of MySQL.
- GIN for JSONB / arrays / FTS: A GIN index accelerates containment (
@>), key existence (?), array overlap (&&), and full-text (@@). Use thejsonb_path_opsoperator class for a smaller, faster containment-only index. Combine withpg_trgmto makeLIKE '%term%'and fuzzy search index-backed - GiST for ranges & geometry: GiST powers range-overlap (
&&),EXCLUDEconstraints, nearest-neighbor (ORDER BY point <-> target), and PostGIS spatial queries. SP-GiST suits non-balanced structures like quadtrees and IP prefixes - Real hash indexes: PostgreSQL also has a genuine
USING hashindex -- crash-safe and WAL-logged since PG 10 -- for equality-only lookups on large, non-sortable keys. It is niche; a B-tree is usually the better default
-- GIN on a JSONB column for containment / key queries
CREATE INDEX idx_users_metadata ON users USING gin (metadata jsonb_path_ops);
-- Uses the index: WHERE metadata @> '{"plan":"pro"}'
-- GIN on an array column
CREATE INDEX idx_users_tags ON users USING gin (tags);
-- Uses the index: WHERE tags && ARRAY['vip','beta'] -- overlap
-- Trigram GIN makes substring / fuzzy search index-backed
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_users_name_trgm ON users USING gin (name gin_trgm_ops);
-- Uses the index: WHERE name ILIKE '%nobile%'
-- GiST for nearest-neighbor on geometry / vectors of points
CREATE INDEX idx_places_geom ON places USING gist (location);
-- ORDER BY location <-> ST_MakePoint(-74.07, 4.65) LIMIT 10;
Partial, Expression, and BRIN Indexes
PostgreSQL natively supports partial indexes (a WHERE clause on the index), expression indexes (index the result of a function), and BRIN (Block Range INdex) for huge, naturally-ordered tables. These are first-class features -- MySQL can only approximate them with generated columns.
- Partial indexes:
CREATE INDEX ... WHERE deleted_at IS NULLindexes only live rows, making the index smaller, faster, and cheaper to maintain. Perfect for soft deletes, or for a "pending jobs" queue where only a tiny fraction of rows matter - Expression indexes: Index a computed value like
lower(email)or(metadata->>'plan'). The query must use the exact same expression to match. This replaces MySQL's functional indexes and is more flexible - BRIN for append-only tables: BRIN stores only the min/max value per block range, so an index on a billion-row time-series table can be a few megabytes. It excels when the physical row order correlates with the column (e.g. a monotonically increasing
created_at). Tiny, but only useful on well-correlated data
-- Partial index: only index rows that are not soft-deleted
CREATE INDEX idx_users_active ON users (created_at DESC)
WHERE deleted_at IS NULL;
-- Partial index for a work queue (indexes ~0.1% of the table)
CREATE INDEX idx_jobs_pending ON jobs (run_after)
WHERE status = 'pending';
-- Expression index for case-insensitive lookups
CREATE INDEX idx_users_email_lower ON users (lower(email));
-- Uses the index: WHERE lower(email) = lower($1)
-- BRIN on a massive, time-ordered append-only table (tiny footprint)
CREATE INDEX idx_events_created_brin ON events
USING brin (created_at) WITH (pages_per_range = 32);
Common Indexing Pitfalls
- Functions on indexed columns:
WHERE date_trunc('year', created_at) = '2026-01-01'cannot use a plain B-tree. Rewrite as a range:WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01', or build a matching expression index - Implicit casts: Comparing a
textcolumn to an integer literal, or atimestamptzto a bare string, can defeat an index. Match types and cast explicitly - Leading-wildcard LIKE:
WHERE name LIKE '%smith'cannot use a normal B-tree. Apg_trgmGIN index fixes this; plainLIKE 'smith%'works with a B-tree usingtext_pattern_ops - OR across columns:
WHERE a = 1 OR b = 2often can't use one composite index. PostgreSQL can combine two separate indexes with a BitmapOr, or rewrite the query as aUNION - Redundant & unused indexes: Every index adds write cost and bloats. Find dead weight with
pg_stat_user_indexes(idx_scan = 0) and check size withpg_relation_size(). Drop what the planner never chooses
EXPLAIN (ANALYZE, BUFFERS) before and after, and run ANALYZE so the planner has fresh statistics. An unused index wastes disk, slows every write, and lengthens VACUUM for zero benefit.3. Query Optimization with EXPLAIN ANALYZE
EXPLAIN ANALYZE and pg_stat_statements
EXPLAIN is the single most important tool for PostgreSQL performance. Plain EXPLAIN shows the planner's chosen plan and its cost estimates; EXPLAIN (ANALYZE, BUFFERS) actually runs the query and reports real timings, real row counts, and how many pages came from cache versus disk.
- Node types: read the plan tree from the innermost node out. Index Only Scan and Index Scan are good; a Seq Scan on a large table under a selective filter is suspect. Bitmap Heap Scan is normal for medium-selectivity predicates
- Estimated vs actual rows: compare
rows=(estimate) withactual rows=. A large gap means stale or missing statistics -- runANALYZE, raisedefault_statistics_target, or add extended statistics for correlated columns - Join strategy: Nested Loop (great for few rows), Hash Join (great for large unsorted sets), Merge Join (great when both inputs are already sorted). The planner picks based on cost; feed it good stats rather than fighting it with hints
- BUFFERS & timing:
BUFFERSexposes shared hits vs reads (cache effectiveness); watch for a Sort spilling to disk (external merge Disk) -- raisework_memfor that session. PostgreSQL 18 can also report per-node timing with better granularity - Find the slow ones: enable the
pg_stat_statementsextension to rank queries by total and mean time, and setlog_min_duration_statement = '250ms'to log slow statements.auto_explaincan log the plan of anything slow automatically
-- The workhorse: run it, count real rows, show cache behavior
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT b.id, b.created_at, u.name, u.email
FROM bookings b
JOIN users u ON u.id = b.user_id
WHERE b.room_id = 42
AND b.status = 'confirmed'
AND b.created_at >= '2026-06-01'
ORDER BY b.created_at DESC
LIMIT 20;
-- Ideal shape of the output:
-- Limit (actual time=0.05..0.09 rows=20 ...)
-- -> Index Scan using idx_bookings_room_status_created on bookings b
-- Index Cond: (room_id = 42 AND status = 'confirmed')
-- Buffers: shared hit=24
-- -> Index Scan using users_pkey on users u (rows=1)
-- The supporting index:
CREATE INDEX idx_bookings_room_status_created
ON bookings (room_id, status, created_at DESC);
-- Rank the worst offenders across the whole server:
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Optimization Patterns
- Keyset pagination: Avoid deep
OFFSET 10000(Postgres still reads and discards those rows). Page by the last seen sort key:WHERE (created_at, id) < ($1, $2) ORDER BY created_at DESC, id DESC LIMIT 20 - COUNT is expensive: an exact
COUNT(*)must scan matching rows because of MVCC visibility. For dashboards usereltuplesfrompg_classfor a fast estimate, or maintain rollup counters - Batch writes: multi-row
INSERT, orCOPYfor bulk loads (often 10-50x faster). UseINSERT ... ON CONFLICT DO UPDATEfor upserts andMERGE(PG 15+, withRETURNINGin PG 17) for complex merges - Kill N+1 with lateral joins: fetch a parent plus its top-N children in one round trip with
LEFT JOIN LATERAL (... LIMIT n) ON true, or aggregate children into JSON withjsonb_agg - Select only what you need:
SELECT *defeats index-only scans and ships wide TOAST columns you may not use. List columns explicitly and let INCLUDE indexes cover hot reads - Tune the knobs that matter: set
work_memper-session for big sorts/hashes,effective_cache_sizeso the planner knows your RAM, andrandom_page_cost = 1.1on SSD/NVMe so index scans are costed correctly
4. Connection Pooling with PgBouncer
Every PostgreSQL connection is a separate OS process with its own memory (roughly 5-10 MB plus work_mem). A few thousand direct connections will exhaust the server long before they saturate the CPU. Because a Postgres backend is heavier than a MySQL thread, external pooling with PgBouncer is not optional at scale -- it is the standard architecture, and it is what managed pooling in RDS Proxy and Supabase run under the hood.
- Right-size the pool: a common starting point is
max_connectionsaround4 x CPU coresfor the write path -- more processes mostly add contention, not throughput. PgBouncer multiplexes thousands of client connections onto that small server pool - Pooling modes: transaction mode is the sweet spot for web apps -- a server connection is handed back to the pool at the end of each transaction. session mode pins a connection for the whole client session (needed for session-level state); statement mode is the most aggressive
- Transaction-mode caveats: you cannot rely on session state (SET, advisory locks, unnamed prepared statements,
LISTEN/NOTIFY) surviving between transactions. PgBouncer 1.21+ supports protocol-level prepared statements in transaction mode viamax_prepared_statements - Timeouts & limits: tune
default_pool_size,max_client_conn,server_idle_timeout, andquery_wait_timeoutso bursts queue briefly instead of overwhelming Postgres or failing outright - App-side pool too: your driver (node-postgres
pg.Pool, HikariCP, SQLAlchemy) keeps its own small pool pointing at PgBouncer, not directly at Postgres. Keepstatement_timeoutandidle_in_transaction_session_timeoutset as safety nets - Read/write splitting: point reads at a physical replica and writes at the primary. Each usually gets its own PgBouncer pool, effectively multiplying capacity
# pgbouncer.ini -- transaction pooling for a typical web workload
[databases]
myapp = host=10.0.0.10 port=5432 dbname=myapp
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
default_pool_size = 20 # server conns per user/db pair
max_client_conn = 5000 # client conns PgBouncer will accept
max_prepared_statements = 200 # protocol-level prepared stmts (1.21+)
server_idle_timeout = 300
query_wait_timeout = 30
// Node.js: app pool connects to PgBouncer (:6432), not Postgres directly
import { Pool } from 'pg';
export const db = new Pool({
host: process.env.PGBOUNCER_HOST, // PgBouncer, port 6432
port: 6432,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
database: 'myapp',
max: 10, // small: PgBouncer does the real fan-out
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
// Safety nets applied per backend session:
options: '-c statement_timeout=15000 -c idle_in_transaction_session_timeout=10000'
});
idle_in_transaction_session_timeout, keep transactions short, and never do network I/O inside one.5. Streaming and Logical Replication
PostgreSQL offers two complementary replication mechanisms built on the Write-Ahead Log (WAL). Physical (streaming) replication ships the raw WAL byte-for-byte to produce identical binary replicas -- the basis of high availability and read scaling. Logical replication decodes the WAL into row-level change events, letting you replicate selected tables across major versions or into other systems.
- Streaming replication: a standby continuously receives WAL from the primary and replays it, staying binary-identical. Standbys are read-only "hot standbys" that can serve read queries while they replay
- Async vs synchronous: async (the default) does not wait for the standby -- fast, but a crash can lose the last few transactions.
synchronous_commit = onwithsynchronous_standby_nameswaits for a standby to persist WAL, trading a little latency for zero data loss on failover - Replication slots: a slot guarantees the primary retains WAL until a replica has consumed it, preventing "requested WAL segment already removed" errors. Guard against unbounded WAL growth with
max_slot_wal_keep_size - Logical replication:
CREATE PUBLICATIONon the source andCREATE SUBSCRIPTIONon the target replicate chosen tables at the row level. It works across major versions (the basis of near-zero-downtime upgrades) and can filter rows and columns - Monitoring lag: query
pg_stat_replicationon the primary and comparepg_current_wal_lsn()to each replica'sreplay_lsn. For read-after-write consistency, route the critical read to the primary or usesynchronous_commit - Failover tooling: Postgres has no built-in cluster manager, so production HA uses Patroni (with etcd/Consul) or repmgr to promote a standby automatically. PostgreSQL 17 added
pg_createsubscriberto convert a physical standby into a logical subscriber quickly - Failover slots (PG 17): logical replication slots can now be synchronized to standbys, so a subscriber keeps working after the publisher fails over -- previously a hard operational gap
- CDC and the ecosystem: logical decoding is how Debezium, and tools like pglogical, stream Postgres changes into Kafka, data warehouses, and search indexes -- change-data-capture without dual writes
6. Backup and Recovery
- Logical backups (pg_dump): exports one database as SQL or a compressed custom/directory archive. Portable across versions and architectures; use the directory format with
-jfor parallel dump/restore.pg_dumpallalso captures globals (roles, tablespaces) - Physical base backups (pg_basebackup): a byte-level copy of the whole cluster, taken online without blocking writes. This is the foundation for both standbys and point-in-time recovery
- Point-in-time recovery (PITR): a base backup plus archived WAL lets you restore to any moment -- e.g. one second before a bad
DELETE. Setarchive_mode = onand anarchive_command/archive_libraryto ship WAL to durable storage - Incremental backups (PG 17):
pg_basebackup --incrementalcopies only blocks changed since a prior backup, andpg_combinebackupreconstructs a full backup from the chain -- dramatically smaller, faster backups for large clusters - Use pgBackRest in production: it adds parallel, compressed, encrypted full/differential/incremental backups, retention policies, and integrity checks over raw tooling.
WAL-GandBarmanare strong alternatives - Test restores: an untested backup is not a backup. Automate a weekly restore into a scratch instance and run
pg_verifybackupplus a few smoke queries. Store copies off-region for disaster recovery
# Logical: parallel directory-format dump + parallel restore
pg_dump -h db -U app -d myapp -F d -j 4 -f /backup/myapp_$(date +%F)
pg_restore -h db2 -U app -d myapp_restored -F d -j 4 /backup/myapp_2026-07-01
# Physical base backup (streams WAL so it is self-consistent)
pg_basebackup -h db -U repl -D /backup/base -F tar -z -X stream -c fast
# Incremental base backup (PG 17), then reconstruct a full copy
pg_basebackup -h db -U repl -D /backup/incr \
--incremental=/backup/base/backup_manifest
pg_combinebackup /backup/base /backup/incr -o /backup/full
# Recommended: pgBackRest full backup + verify
pgbackrest --stanza=myapp --type=full backup
pgbackrest --stanza=myapp verify
7. Declarative Partitioning
Declarative partitioning (native since PostgreSQL 10 and steadily improved) splits one logical table into physical child partitions. Queries hit the parent; the planner performs partition pruning to touch only relevant children. The biggest win is data lifecycle management: dropping or detaching an old partition is an instant metadata operation, not a slow, bloat-generating DELETE.
- RANGE partitioning: assign rows by a value falling in a range -- the classic choice for time-series. Partition by month or day, and retire old data with
DROP/DETACH PARTITIONinstead of mass deletes - LIST partitioning: assign rows by discrete values (region, tenant, status). Great for multi-tenant isolation or per-country data residency
- HASH partitioning: spread rows evenly across N partitions by a hash of the key. Use it to parallelize I/O when there is no natural range or list boundary
- Sub-partitioning: partitions can themselves be partitioned (e.g. RANGE by month, then HASH by tenant) for very large multi-tenant time-series data
- Pruning & partition-wise joins: include the partition key in the WHERE clause so the planner prunes at plan time and at execution time. Enable
enable_partitionwise_joinandenable_partitionwise_aggregatefor large analytical queries - Rules & automation: a UNIQUE/PRIMARY KEY must include the partition key. Indexes created on the parent cascade to all partitions. Foreign keys referencing a partitioned table are supported in modern versions. Automate partition creation with
pg_partman(and TimescaleDB for time-series)
-- RANGE partition by month, ideal for event/booking logs
CREATE TABLE booking_logs (
id bigint GENERATED ALWAYS AS IDENTITY,
user_id bigint NOT NULL,
action text NOT NULL,
created_at timestamptz NOT NULL,
PRIMARY KEY (id, created_at) -- must include the partition key
) PARTITION BY RANGE (created_at);
-- Index on the parent cascades to every partition
CREATE INDEX ON booking_logs (user_id, created_at);
-- Monthly partitions
CREATE TABLE booking_logs_2026_06 PARTITION OF booking_logs
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
CREATE TABLE booking_logs_2026_07 PARTITION OF booking_logs
FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');
-- Retire old data instantly (metadata-only, no row-by-row DELETE)
ALTER TABLE booking_logs DETACH PARTITION booking_logs_2026_06;
DROP TABLE booking_logs_2026_06;
-- Confirm pruning: only the July partition is scanned
EXPLAIN SELECT * FROM booking_logs
WHERE created_at >= '2026-07-01' AND created_at < '2026-08-01';
8. MVCC, Transactions, and Isolation Levels
PostgreSQL implements ACID transactions with Multi-Version Concurrency Control (MVCC). Instead of overwriting rows in place, an UPDATE writes a new row version (tuple) and marks the old one dead; each transaction sees a consistent snapshot based on transaction IDs (xmin/xmax). The upside: readers never block writers and writers never block readers. The trade-off: dead tuples accumulate and must be reclaimed by VACUUM (next section).
- How MVCC works: visibility is decided by comparing a tuple's xmin/xmax against the transaction's snapshot. There is no separate rollback/undo segment as in some engines -- old versions live in the table itself until VACUUM removes them
- READ COMMITTED (default): each statement sees a fresh snapshot of committed data. No dirty reads, but two reads in the same transaction can differ (non-repeatable reads). Correct for the vast majority of OLTP work
- REPEATABLE READ: the whole transaction sees one snapshot taken at its first statement. In PostgreSQL this also prevents phantom reads. Concurrent conflicting writes raise a serialization failure (SQLSTATE 40001) that you retry
- SERIALIZABLE: uses Serializable Snapshot Isolation (SSI) -- true serializability without heavyweight read locks. It tracks read/write dependencies and aborts one transaction of a dangerous cycle. You must be ready to retry on 40001
- Explicit locking:
SELECT ... FOR UPDATElocks rows for a read-modify-write; addSKIP LOCKEDto build concurrent job queues, orNOWAITto fail fast. Advisory locks give app-level mutexes decoupled from rows - Deadlocks & retries: PostgreSQL detects deadlocks and aborts one transaction (SQLSTATE 40P01). Both deadlocks and serialization failures are normal under contention -- wrap transactions in a retry loop (a few attempts with exponential backoff)
-- Inspect and set isolation
SHOW transaction_isolation;
BEGIN ISOLATION LEVEL REPEATABLE READ;
-- Prevent overbooking with a locking read
BEGIN;
SELECT capacity, current_bookings
FROM rooms WHERE id = 42
FOR UPDATE; -- row locked until COMMIT/ROLLBACK
UPDATE rooms SET current_bookings = current_bookings + 1 WHERE id = 42;
INSERT INTO bookings (user_id, room_id, status)
VALUES (7, 42, 'confirmed');
COMMIT;
-- Concurrent worker queue: each worker grabs distinct rows
SELECT id FROM jobs
WHERE status = 'pending'
ORDER BY run_after
FOR UPDATE SKIP LOCKED
LIMIT 10;
-- Retry pattern (pseudocode) for 40001 / 40P01:
-- for attempt in range(5):
-- try: run_txn(); break
-- except (SerializationFailure, Deadlock): sleep(0.05 * 2**attempt)
idle_in_transaction_session_timeout.9. VACUUM, Autovacuum, and Bloat
Because MVCC leaves behind dead tuples on every UPDATE and DELETE, PostgreSQL must reclaim that space and keep statistics fresh. VACUUM does the reclaiming; the autovacuum daemon runs it automatically. Understanding and tuning it is the single most important operational skill for a healthy Postgres database -- neglecting it is the number-one cause of runaway table bloat and performance cliffs.
- What VACUUM does: marks dead tuples as reusable, updates the visibility map (which enables index-only scans), and advances the "frozen" transaction-ID horizon. Plain
VACUUMruns online;VACUUM FULLrewrites the table and takes an exclusive lock -- avoid it on live tables - ANALYZE: refreshes the planner's statistics. Autovacuum runs it too, but after big data loads or migrations run
ANALYZEmanually so the planner does not make decisions on stale numbers - Tune autovacuum per table: the defaults are conservative for large tables. Lower
autovacuum_vacuum_scale_factor(e.g. 0.02) on hot tables so cleanup triggers sooner, and raiseautovacuum_vacuum_cost_limitso it keeps up. Set these per-table withALTER TABLE ... SET (...) - Prevent XID wraparound: a wraparound-prevention (anti-wraparound) autovacuum is mandatory and cannot be skipped. Watch
datfrozenxid/age(); a long-open transaction or a stuck replication slot that holds back the horizon can push you toward emergency read-only mode. PostgreSQL 17 rewrote VACUUM's memory (TID store) so it uses far less RAM and rarely needs multiple passes - Measure bloat: monitor
pg_stat_user_tables(n_dead_tup,last_autovacuum) and usepgstattuplefor precise figures. Reclaim bloat online withpg_repackorVACUUM (FULL, ...)during a maintenance window; keepfillfactorheadroom on update-heavy tables to enable HOT updates
-- Manual maintenance (parallel index vacuuming available on big tables)
VACUUM (ANALYZE, VERBOSE) bookings;
-- Tune autovacuum for a hot, high-churn table
ALTER TABLE bookings SET (
autovacuum_vacuum_scale_factor = 0.02, -- vacuum after 2% dead tuples
autovacuum_vacuum_cost_limit = 2000, -- let it do more work per round
fillfactor = 90 -- leave room for HOT updates
);
-- Find the tables that need attention
SELECT relname, n_live_tup, n_dead_tup,
round(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
-- Watch transaction-ID age to stay clear of wraparound
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database ORDER BY xid_age DESC;
n_dead_tup and XID age eliminated the slow, mysterious latency creep that bloat causes. The rule of thumb: if a table is hot, make its autovacuum more aggressive, not less -- and never let a forgotten idle transaction or unused replication slot stall the cleanup horizon.10. JSONB, Window Functions, and CTEs
JSONB
JSONB is PostgreSQL's binary, indexable JSON type -- one of the features that lets a single Postgres instance serve both relational and document workloads. It validates on write, stores a decomposed binary form for fast key access, supports powerful operators, and can be indexed with GIN. Reach for it when a row's shape genuinely varies; keep frequently-queried, constrained fields as real columns.
- jsonb vs json:
jsonbis parsed once into binary (fast access, supports indexing, dedupes keys);jsonkeeps the exact input text and reparses on every read. Usejsonbfor almost everything - Accessing values:
->returns jsonb,->>returns text,#>/#>>take a path array, and the SQL/JSON path operator@?/jsonb_path_queryruns JSONPath expressions - Containment & indexing: the
@>containment operator plus a GIN index (jsonb_path_ops) makesmetadata @> '{"plan":"pro"}'index-backed. Index a single hot key with an expression index instead - Building & transforming:
jsonb_build_object,jsonb_agg,jsonb_set,jsonb_path_query, and (PG 17) SQL/JSONJSON_TABLEto expand JSON into relational rows -- and standard constructors likeJSON_OBJECT/JSON_QUERY
-- JSONB column with a GIN index for containment queries
CREATE TABLE users (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
email text NOT NULL UNIQUE,
metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_users_metadata ON users USING gin (metadata jsonb_path_ops);
INSERT INTO users (email, metadata) VALUES
('[email protected]',
'{"plan":"pro","features":["analytics","exports"],"trial_ends":"2026-08-01"}');
-- Containment query (uses the GIN index)
SELECT email FROM users WHERE metadata @> '{"plan":"pro"}';
-- Extract a scalar as text
SELECT email, metadata->>'plan' AS plan FROM users;
-- Update a nested key immutably
UPDATE users SET metadata = jsonb_set(metadata, '{plan}', '"enterprise"')
WHERE id = 1;
-- Expand a JSON array into rows with JSON_TABLE (PG 17)
SELECT u.email, f.feature
FROM users u,
JSON_TABLE(u.metadata, '$.features[*]'
COLUMNS (feature text PATH '$')) AS f;
Window Functions
Window functions compute across a set of rows related to the current row without collapsing them like GROUP BY does. PostgreSQL has had a complete, standard-compliant implementation for years -- including full frame clauses (ROWS/RANGE/GROUPS) and named windows -- making it excellent for rankings, running totals, and comparative analytics.
- ROW_NUMBER(): a unique sequential number per partition. The clean way to do top-N-per-group, deduplication, and stable pagination
- RANK() / DENSE_RANK(): ranking with ties -- RANK leaves gaps after ties, DENSE_RANK does not.
PERCENT_RANKandNTILEbucket the distribution - LAG() / LEAD(): read values from previous/next rows for period-over-period deltas and gap detection
- Aggregates as windows:
SUM() OVER (...),AVG() OVER (...), and a frame clause give running totals and moving averages without self-joins
-- Top 3 most active users per room (by booking count)
SELECT name, room_name, booking_count
FROM (
SELECT u.name, r.name AS room_name,
COUNT(b.id) AS booking_count,
ROW_NUMBER() OVER (PARTITION BY r.id
ORDER BY COUNT(b.id) DESC) AS rn
FROM bookings b
JOIN users u ON u.id = b.user_id
JOIN rooms r ON r.id = b.room_id
WHERE b.created_at >= '2026-01-01'
GROUP BY u.id, u.name, r.id, r.name
) ranked
WHERE rn <= 3;
-- Month-over-month growth with LAG over a monthly aggregate
SELECT month,
bookings,
LAG(bookings) OVER w AS prev_month,
round((bookings - LAG(bookings) OVER w) * 100.0
/ NULLIF(LAG(bookings) OVER w, 0), 1) AS growth_pct
FROM (
SELECT date_trunc('month', created_at) AS month, COUNT(*) AS bookings
FROM bookings GROUP BY 1
) m
WINDOW w AS (ORDER BY month)
ORDER BY month;
Common Table Expressions (CTEs)
CTEs (the WITH clause) define named, temporary result sets for a single query. They make complex logic readable, enable recursion for hierarchical data, and -- crucially in Postgres -- support data-modifying CTEs that run INSERT/UPDATE/DELETE with RETURNING inside one atomic statement.
- Non-recursive CTEs: named building blocks that flatten nested subqueries. Since PostgreSQL 12 the planner inlines a CTE by default when it is referenced once and has no side effects -- add
MATERIALIZEDto force a one-time computation, orNOT MATERIALIZEDto force inlining - Recursive CTEs:
WITH RECURSIVEtraverses trees and graphs (org charts, category trees, bill-of-materials). Add a depth guard orUNION(notUNION ALL) to stop cycles; PG 14+ supportsCYCLEandSEARCHclauses - Data-modifying CTEs: chain writes in one statement -- e.g. move rows to an archive and delete them atomically -- using
RETURNINGto pipe results between steps
-- Readable multi-step reporting; force materialization once
WITH monthly AS MATERIALIZED (
SELECT date_trunc('month', created_at) AS month, room_id, COUNT(*) AS total
FROM bookings
WHERE status = 'completed'
GROUP BY 1, room_id
),
ranked AS (
SELECT month, room_id, total,
RANK() OVER (PARTITION BY month ORDER BY total DESC) AS rnk
FROM monthly
)
SELECT month, room_id, total FROM ranked WHERE rnk <= 5;
-- Recursive CTE: category tree with indentation
WITH RECURSIVE category_tree AS (
SELECT id, name, parent_id, 0 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.depth + 1
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT repeat(' ', depth) || name AS tree_view
FROM category_tree ORDER BY depth, name;
-- Data-modifying CTE: archive and delete atomically
WITH moved AS (
DELETE FROM bookings
WHERE status = 'completed' AND created_at < '2025-01-01'
RETURNING *
)
INSERT INTO bookings_archive SELECT * FROM moved;
11. pgvector: Embeddings and RAG
Storing and Querying Vectors
pgvector is the extension that turns PostgreSQL into a production vector database -- the backbone of semantic search and Retrieval-Augmented Generation (RAG). Instead of bolting on a separate system like Pinecone or Milvus, you store embeddings next to your relational data and combine similarity search with ordinary SQL filters, joins, and transactions in one query. As of 2026 the current release is pgvector 0.8.x.
- Types:
vector(n)stores single-precision floats;halfvec(n)uses 2-byte floats to halve storage and index size (and index up to 4000 dims);bit(n)supports binary/Hamming andsparsevechandles sparse vectors - Distance operators:
<=>cosine,<->L2/Euclidean,<#>negative inner product,<+>L1/taxicab. Match the operator to how your embedding model was trained (cosine for most text models) - Generate embeddings in your pipeline: compute vectors with your model of choice (e.g. an OpenAI, Cohere, or a local model), then insert them. Dimensions must match the column: 1536 or 3072 for common OpenAI models, 768/1024 for many open models
-- Enable the extension and create an embeddings table
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
title text NOT NULL,
content text NOT NULL,
source text NOT NULL,
embedding vector(1536) NOT NULL, -- match your model's dimension
created_at timestamptz NOT NULL DEFAULT now()
);
-- Insert a chunk plus its precomputed embedding
INSERT INTO documents (title, content, source, embedding)
VALUES ('PgBouncer basics', 'Transaction pooling ...', 'guide',
'[0.0123, -0.0456, 0.0789, ...]'); -- 1536 floats from your model
-- k-NN semantic search by cosine distance
SELECT id, title, 1 - (embedding <=> $1) AS cosine_similarity
FROM documents
ORDER BY embedding <=> $1 -- $1 = the query embedding
LIMIT 10;
HNSW Indexing and RAG Patterns
- HNSW indexes: the default choice -- a graph index with excellent recall/latency and no training step. Tune build with
mandef_construction; tune query recall at runtime withSET hnsw.ef_search. IVFFlat is smaller and faster to build but must be built after data loads and needs a goodlistsvalue - Filtered & hybrid search: combine vector distance with plain
WHEREfilters (tenant, date, language). pgvector 0.8's iterative index scans keep recall high even with selective filters. For hybrid search, blend semantic distance with full-textts_rankor trigram scores -- all in one SQL statement - Scaling: use
halfvecto cut index memory, quantize or reduce dimensions when your model allows, and consider thepgvectorscaleextension (StreamingDiskANN) for very large corpora that exceed RAM
-- HNSW index for cosine distance (pick the op class that matches your operator)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Raise recall at query time (per-session)
SET hnsw.ef_search = 100;
-- RAG retrieval: filter by tenant/recency, THEN rank by similarity
SELECT id, title, content
FROM documents
WHERE source = 'guide'
AND created_at >= now() - interval '365 days'
ORDER BY embedding <=> $1
LIMIT 8;
-- Hybrid search: blend semantic similarity with keyword relevance
SELECT id, title,
0.6 * (1 - (embedding <=> $1))
+ 0.4 * ts_rank(to_tsvector('english', content),
plainto_tsquery('english', $2)) AS score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
ORDER BY score DESC
LIMIT 10;
vector(n) insert fails, and mixing embeddings from different models in one column silently ruins relevance. Re-embed the whole corpus when you change models, and rebuild the HNSW index afterward.12. Extensions: PostGIS, TimescaleDB, and More
PostgreSQL's extension system is its superpower: a single CREATE EXTENSION adds new types, index methods, functions, and even background workers without forking the engine. This is why one Postgres instance can cover relational, geospatial, time-series, vector, and full-text workloads that would otherwise need several specialized databases.
- PostGIS: the gold-standard geospatial extension. Adds
geometry/geographytypes, thousands of spatial functions (ST_DWithin,ST_Intersects,ST_Distance), and GiST/SP-GiST spatial indexes. Powers "find everything within N km" queries natively -- current series is PostGIS 3.5 - TimescaleDB: turns Postgres into a time-series database via hypertables (transparent auto-partitioning by time), native columnar compression, continuous aggregates (incrementally-maintained rollups), and retention policies -- ideal for metrics, IoT, and events
- pgvector & friends: vector search (covered above), plus
pgvectorscalefor billion-scale ANN andpg_search/ParadeDB for BM25 full-text -- the building blocks of AI features inside your primary database - Bundled essentials:
pg_stat_statements(query telemetry),pg_trgm(fuzzy/substring search),citext(case-insensitive text),uuid-ossp,hstore,postgres_fdw(query remote databases), andpgcryptoship with core - Operational add-ons:
pg_partmanautomates partition lifecycle,pg_cronruns scheduled SQL jobs inside the database, andpg_repackremoves bloat online without long locks - Availability caveat: some extensions must be preloaded via
shared_preload_librariesand a restart, and managed platforms (RDS, Cloud SQL, Azure) allow only a curated allow-list -- check support before designing around one. Vendor-neutral extensions keep you portable
-- PostGIS: geospatial "what's nearby" query
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE gyms (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name text NOT NULL,
geom geography(Point, 4326) NOT NULL -- lon/lat on the WGS84 sphere
);
CREATE INDEX idx_gyms_geom ON gyms USING gist (geom);
-- All gyms within 5 km of a point, nearest first
SELECT name, ST_Distance(geom, ST_MakePoint(-74.07, 4.65)::geography) AS meters
FROM gyms
WHERE ST_DWithin(geom, ST_MakePoint(-74.07, 4.65)::geography, 5000)
ORDER BY meters
LIMIT 20;
-- TimescaleDB: hypertable + continuous aggregate for metrics
CREATE EXTENSION IF NOT EXISTS timescaledb;
CREATE TABLE metrics (
ts timestamptz NOT NULL,
device text NOT NULL,
value double precision NOT NULL
);
SELECT create_hypertable('metrics', 'ts'); -- auto time-partitioning
-- Incrementally-maintained hourly rollup
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) AS bucket,
device,
avg(value) AS avg_value,
max(value) AS max_value
FROM metrics
GROUP BY bucket, device;
-- Compress and retain
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby = 'device');
SELECT add_retention_policy('metrics', INTERVAL '180 days');
13. Real-World Experience
I run PostgreSQL in production across personal and client systems -- from a managed primary with read replicas to self-hosted instances powering AI features. The recurring theme: Postgres scales further than people expect when you respect MVCC, keep autovacuum healthy, and pool connections properly.
- Postgres as the single source of truth: primary plus streaming read replicas, WAL archived for point-in-time recovery, and pgBackRest for compressed, verified backups. Logical replication handled a major-version upgrade with minimal downtime
- Connection pooling: put PgBouncer in transaction mode in front of the primary. Thousands of client connections collapsed onto a small server pool, and "too many connections" incidents went to zero
- Query optimization: used
pg_stat_statementsandEXPLAIN (ANALYZE, BUFFERS)to find the worst queries, added covering and partial indexes, and fixed N+1 patterns with lateral joins -- cutting P95 latency dramatically - VACUUM discipline: tuned per-table autovacuum on hot tables and alerted on dead-tuple ratio and transaction-ID age. This eliminated the slow bloat-driven latency creep that had been mistaken for "the database getting old"
- pgvector for RAG: built retrieval for AI features directly in Postgres with pgvector and HNSW indexes -- embeddings, metadata filters, and business joins in one query, no separate vector database to operate
14. PostgreSQL 17/18 and Postgres vs MySQL
Version and Release Model
PostgreSQL ships one major version per year (typically late September) and supports each for five years. As of mid-2026, PostgreSQL 18 (released September 2025) is the current stable release and PostgreSQL 17 (September 2024) is the widely-deployed previous major. There is no separate "LTS" -- every major is production-grade and gets quarterly minor patch releases with bug and security fixes.
- Pick a recent major: run PostgreSQL 17 or 18 for new deployments. Both receive regular minor releases; always apply the latest minor (e.g. 18.x) promptly, since minors are patch-only and safe to adopt
- Five-year support window: each major is supported for about five years from its release. Plan a major upgrade before your version reaches end-of-life to keep getting security fixes
- Upgrade paths: use
pg_upgradefor fast in-place major upgrades, or logical replication (pg_createsubscriberin PG 17 makes this easier) for near-zero-downtime upgrades. Always test extension compatibility first
PostgreSQL vs MySQL
Both are mature, ACID, open-source relational databases, and for simple CRUD either is fine. PostgreSQL pulls ahead when you need rich types and extensibility: native JSONB with GIN indexing, arrays, ranges with exclusion constraints, a deep extension ecosystem (pgvector, PostGIS, TimescaleDB), true partial/expression indexes, and multiple index methods (GIN/GiST/BRIN). MySQL's InnoDB clusters the table on the primary key (fast PK lookups, cheaper secondary-index-free reads) and has a huge managed-hosting footprint. The honest summary: choose PostgreSQL when correctness, complex queries, and AI/analytics features matter; MySQL is a reasonable default for straightforward, read-heavy web apps already invested in its ecosystem.
-- A one-statement capability that showcases Postgres's type system:
-- prevent overlapping room reservations at the DB level (MySQL cannot express this)
CREATE EXTENSION IF NOT EXISTS btree_gist;
ALTER TABLE bookings
ADD CONSTRAINT no_overlap
EXCLUDE USING gist (room_id WITH =, during WITH &&)
WHERE (status = 'confirmed');
-- Concurrency model differs too:
-- * PostgreSQL default isolation = READ COMMITTED; MVCC keeps old row versions
-- in the heap (reclaimed by VACUUM).
-- * MySQL/InnoDB default isolation = REPEATABLE READ; old versions live in the
-- undo log. Postgres SERIALIZABLE uses SSI (no read locks).
PostgreSQL 17 Highlights (September 2024)
PostgreSQL 17 focused on operational scale and SQL/JSON. It added native incremental base backups (pg_basebackup --incremental with pg_combinebackup), a rewritten VACUUM memory manager (a TID store that uses far less RAM), streaming I/O for faster sequential scans, and major logical-replication improvements including failover slots and pg_createsubscriber. On the SQL side it brought the standard JSON_TABLE function, more MERGE capabilities (including RETURNING and WHEN NOT MATCHED BY SOURCE), and COPY ... ON_ERROR ignore for resilient bulk loads.
-- PG 17: MERGE with RETURNING and match-source awareness
MERGE INTO inventory AS t
USING shipment AS s ON t.sku = s.sku
WHEN MATCHED THEN UPDATE SET qty = t.qty + s.qty
WHEN NOT MATCHED THEN INSERT (sku, qty) VALUES (s.sku, s.qty)
WHEN NOT MATCHED BY SOURCE THEN UPDATE SET qty = 0
RETURNING merge_action(), t.sku, t.qty;
-- PG 17: tolerate bad rows during a bulk COPY instead of aborting
COPY events FROM '/data/events.csv' WITH (FORMAT csv, HEADER, ON_ERROR ignore);
PostgreSQL 18 (September 2025)
PostgreSQL 18 is the current stable major in 2026. Its headline feature is a new asynchronous I/O (AIO) subsystem that can substantially speed up reads (sequential scans, bitmap heap scans, and VACUUM) on cloud storage. It also adds native UUIDv7 generation (uuidv7()) for time-ordered keys, virtual generated columns as the new default, B-tree skip scan for multicolumn indexes, temporal PRIMARY KEY/UNIQUE ... WITHOUT OVERLAPS constraints, RETURNING access to both OLD and NEW rows in DML, and OAuth-based authentication.
The practical upgrade story keeps improving: pg_upgrade now preserves planner statistics, so a major upgrade no longer starts with a slow, blind optimizer while you re-run ANALYZE. Combined with the AIO subsystem and skip scans, PostgreSQL 18 is a meaningful performance jump for I/O-bound and analytical workloads -- and, as always, the vast extension ecosystem (pgvector, PostGIS, TimescaleDB) tracks each new major closely.