DATABASE

PostgreSQL 17 & 18: The Advanced Open-Source Database

A deep technical guide to PostgreSQL in 2026. Covers schema design and rich data types (JSONB, arrays, ranges), indexing with B-tree, GIN, GiST, SP-GiST and BRIN, query tuning with EXPLAIN ANALYZE, MVCC and VACUUM internals, declarative partitioning, streaming and logical replication, connection pooling with PgBouncer, the pgvector extension for AI/RAG embeddings, PostGIS and TimescaleDB, and an honest Postgres vs MySQL comparison.

1. Schema Design and Data Types

A good schema is the foundation of database performance. PostgreSQL's greatest strength is its rich, strict type system: on top of the usual normalized relational design, it gives you native JSONB, arrays, ranges, enums, and user-defined domains and composite types. Model your data properly first, then reach for the semi-structured types only where the shape genuinely varies per row.

  • Normalize first (1NF-3NF): Atomic columns, a real primary key, no partial or transitive dependencies. Denormalize only with measured evidence. PostgreSQL enforces integrity with CHECK, UNIQUE, NOT NULL, EXCLUDE, and FOREIGN KEY constraints, including deferrable ones
  • Identity columns over serial: Prefer GENERATED ALWAYS AS IDENTITY (SQL-standard) over the legacy serial pseudo-type. For distributed inserts, PostgreSQL 18 adds a native uuidv7() function whose time-ordered values avoid the random B-tree page splits that UUIDv4 causes
  • Right-size types: int (4 bytes) vs bigint (8 bytes); text and varchar(n) are stored identically (there is no performance penalty for text in Postgres). Always store instants as timestamptz (stored as UTC), never naive timestamp. Use numeric for money, never float
  • JSONB for semi-structured data: jsonb is a decomposed binary format that supports indexing and rich operators. Prefer it over json (which stores raw text). Use it for metadata, feature flags, and payloads whose shape varies -- not as an excuse to avoid columns you query and constrain often
  • Arrays and ranges: Native array columns (text[], int[]) avoid junction tables for simple lists. Range types (int4range, tstzrange) model intervals, and an EXCLUDE USING gist constraint prevents overlapping bookings at the database level -- something MySQL cannot express
  • Enums and domains: CREATE TYPE ... AS ENUM gives compact, ordered enumerations; CREATE DOMAIN attaches reusable CHECK constraints to a base type. Soft deletes use deleted_at timestamptz plus a partial index (Postgres supports these natively, unlike MySQL)
-- Well-designed PostgreSQL schema
CREATE TYPE user_status AS ENUM ('active','inactive','suspended');

CREATE TABLE users (
  id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  public_id   uuid NOT NULL DEFAULT uuidv7(),   -- time-ordered UUID (PG 18)
  email       citext NOT NULL,                  -- case-insensitive (citext ext.)
  name        text NOT NULL,
  status      user_status NOT NULL DEFAULT 'active',
  tags        text[] NOT NULL DEFAULT '{}',     -- native array
  metadata    jsonb NOT NULL DEFAULT '{}'::jsonb,
  created_at  timestamptz NOT NULL DEFAULT now(),
  updated_at  timestamptz NOT NULL DEFAULT now(),
  deleted_at  timestamptz,
  CONSTRAINT uq_users_email UNIQUE (email)
);

CREATE TABLE bookings (
  id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  user_id     bigint NOT NULL REFERENCES users(id),
  room_id     bigint NOT NULL REFERENCES rooms(id),
  during      tstzrange NOT NULL,               -- the reserved interval
  status      text NOT NULL DEFAULT 'confirmed'
              CHECK (status IN ('confirmed','cancelled','completed')),
  created_at  timestamptz NOT NULL DEFAULT now(),
  -- No two confirmed bookings for the same room may overlap in time:
  EXCLUDE USING gist (room_id WITH =, during WITH &&)
    WHERE (status = 'confirmed')
);

2. Indexing Strategies (B-tree, GIN, GiST, BRIN)

B-tree Indexes

B-tree is PostgreSQL's default and most-used access method. It keeps keys in sorted order and serves equality, range, ORDER BY, and MIN/MAX queries. Unlike MySQL's InnoDB, PostgreSQL tables are heap-organized (not clustered on the primary key), so every index -- including the primary key -- is a separate secondary structure pointing at heap tuples.

  • Leftmost prefix rule: A composite index on (a, b, c) serves predicates on (a), (a, b), or (a, b, c) -- but not (b) or (c) alone. PostgreSQL 18 adds B-tree skip scan, which can use later columns when a leading column has few distinct values
  • Column order matters: Put equality columns before range columns, and generally the most selective first. Match the index's sort direction to your ORDER BY (you can specify ASC/DESC and NULLS FIRST/LAST)
  • Covering indexes with INCLUDE: Add non-key payload columns via INCLUDE (...) so an index-only scan returns everything without a heap fetch. EXPLAIN shows Index Only Scan. Index-only scans also depend on the visibility map being fresh (keep VACUUM healthy)
  • Selectivity: ratio of distinct values to rows. A boolean alone is a poor B-tree candidate -- use a partial index instead. Run ANALYZE so the planner has current statistics; extend them with CREATE STATISTICS for correlated columns
  • Write overhead & HOT: Every index is maintained on write. PostgreSQL's Heap-Only Tuple (HOT) optimization avoids index churn when an UPDATE does not touch any indexed column and the page has free space -- another reason not to over-index and to leave fillfactor headroom on hot tables
-- Composite index matching a common access pattern
-- SELECT ... FROM bookings WHERE user_id = $1 AND status = 'confirmed'
--   ORDER BY created_at DESC LIMIT 20;
CREATE INDEX idx_bookings_user_status_created
  ON bookings (user_id, status, created_at DESC);

-- Covering / index-only scan with INCLUDE (payload not part of the key)
-- SELECT id, email, name FROM users WHERE status = 'active' ORDER BY created_at DESC;
CREATE INDEX idx_users_status_created
  ON users (status, created_at DESC) INCLUDE (email, name);

-- Build without long write locks in production:
CREATE INDEX CONCURRENTLY idx_users_status_created
  ON users (status, created_at DESC) INCLUDE (email, name);

GIN and GiST Indexes

Beyond B-tree, PostgreSQL ships several specialized access methods. GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) are the two you will reach for most: GIN indexes the elements inside composite values (JSONB keys, array elements, full-text lexemes, trigrams), while GiST indexes geometric, range, and nearest-neighbor data. This is where Postgres pulls decisively ahead of MySQL.

  • GIN for JSONB / arrays / FTS: A GIN index accelerates containment (@>), key existence (?), array overlap (&&), and full-text (@@). Use the jsonb_path_ops operator class for a smaller, faster containment-only index. Combine with pg_trgm to make LIKE '%term%' and fuzzy search index-backed
  • GiST for ranges & geometry: GiST powers range-overlap (&&), EXCLUDE constraints, nearest-neighbor (ORDER BY point <-> target), and PostGIS spatial queries. SP-GiST suits non-balanced structures like quadtrees and IP prefixes
  • Real hash indexes: PostgreSQL also has a genuine USING hash index -- crash-safe and WAL-logged since PG 10 -- for equality-only lookups on large, non-sortable keys. It is niche; a B-tree is usually the better default
-- GIN on a JSONB column for containment / key queries
CREATE INDEX idx_users_metadata ON users USING gin (metadata jsonb_path_ops);
-- Uses the index:  WHERE metadata @> '{"plan":"pro"}'

-- GIN on an array column
CREATE INDEX idx_users_tags ON users USING gin (tags);
-- Uses the index:  WHERE tags && ARRAY['vip','beta']   -- overlap

-- Trigram GIN makes substring / fuzzy search index-backed
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_users_name_trgm ON users USING gin (name gin_trgm_ops);
-- Uses the index:  WHERE name ILIKE '%nobile%'

-- GiST for nearest-neighbor on geometry / vectors of points
CREATE INDEX idx_places_geom ON places USING gist (location);
-- ORDER BY location <-> ST_MakePoint(-74.07, 4.65) LIMIT 10;

Partial, Expression, and BRIN Indexes

PostgreSQL natively supports partial indexes (a WHERE clause on the index), expression indexes (index the result of a function), and BRIN (Block Range INdex) for huge, naturally-ordered tables. These are first-class features -- MySQL can only approximate them with generated columns.

  • Partial indexes: CREATE INDEX ... WHERE deleted_at IS NULL indexes only live rows, making the index smaller, faster, and cheaper to maintain. Perfect for soft deletes, or for a "pending jobs" queue where only a tiny fraction of rows matter
  • Expression indexes: Index a computed value like lower(email) or (metadata->>'plan'). The query must use the exact same expression to match. This replaces MySQL's functional indexes and is more flexible
  • BRIN for append-only tables: BRIN stores only the min/max value per block range, so an index on a billion-row time-series table can be a few megabytes. It excels when the physical row order correlates with the column (e.g. a monotonically increasing created_at). Tiny, but only useful on well-correlated data
-- Partial index: only index rows that are not soft-deleted
CREATE INDEX idx_users_active ON users (created_at DESC)
  WHERE deleted_at IS NULL;

-- Partial index for a work queue (indexes ~0.1% of the table)
CREATE INDEX idx_jobs_pending ON jobs (run_after)
  WHERE status = 'pending';

-- Expression index for case-insensitive lookups
CREATE INDEX idx_users_email_lower ON users (lower(email));
-- Uses the index:  WHERE lower(email) = lower($1)

-- BRIN on a massive, time-ordered append-only table (tiny footprint)
CREATE INDEX idx_events_created_brin ON events
  USING brin (created_at) WITH (pages_per_range = 32);

Common Indexing Pitfalls

  • Functions on indexed columns: WHERE date_trunc('year', created_at) = '2026-01-01' cannot use a plain B-tree. Rewrite as a range: WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01', or build a matching expression index
  • Implicit casts: Comparing a text column to an integer literal, or a timestamptz to a bare string, can defeat an index. Match types and cast explicitly
  • Leading-wildcard LIKE: WHERE name LIKE '%smith' cannot use a normal B-tree. A pg_trgm GIN index fixes this; plain LIKE 'smith%' works with a B-tree using text_pattern_ops
  • OR across columns: WHERE a = 1 OR b = 2 often can't use one composite index. PostgreSQL can combine two separate indexes with a BitmapOr, or rewrite the query as a UNION
  • Redundant & unused indexes: Every index adds write cost and bloats. Find dead weight with pg_stat_user_indexes (idx_scan = 0) and check size with pg_relation_size(). Drop what the planner never chooses
Never add indexes blindly. Measure with EXPLAIN (ANALYZE, BUFFERS) before and after, and run ANALYZE so the planner has fresh statistics. An unused index wastes disk, slows every write, and lengthens VACUUM for zero benefit.

3. Query Optimization with EXPLAIN ANALYZE

EXPLAIN ANALYZE and pg_stat_statements

EXPLAIN is the single most important tool for PostgreSQL performance. Plain EXPLAIN shows the planner's chosen plan and its cost estimates; EXPLAIN (ANALYZE, BUFFERS) actually runs the query and reports real timings, real row counts, and how many pages came from cache versus disk.

  • Node types: read the plan tree from the innermost node out. Index Only Scan and Index Scan are good; a Seq Scan on a large table under a selective filter is suspect. Bitmap Heap Scan is normal for medium-selectivity predicates
  • Estimated vs actual rows: compare rows= (estimate) with actual rows=. A large gap means stale or missing statistics -- run ANALYZE, raise default_statistics_target, or add extended statistics for correlated columns
  • Join strategy: Nested Loop (great for few rows), Hash Join (great for large unsorted sets), Merge Join (great when both inputs are already sorted). The planner picks based on cost; feed it good stats rather than fighting it with hints
  • BUFFERS & timing: BUFFERS exposes shared hits vs reads (cache effectiveness); watch for a Sort spilling to disk (external merge Disk) -- raise work_mem for that session. PostgreSQL 18 can also report per-node timing with better granularity
  • Find the slow ones: enable the pg_stat_statements extension to rank queries by total and mean time, and set log_min_duration_statement = '250ms' to log slow statements. auto_explain can log the plan of anything slow automatically
-- The workhorse: run it, count real rows, show cache behavior
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT b.id, b.created_at, u.name, u.email
FROM bookings b
JOIN users u ON u.id = b.user_id
WHERE b.room_id = 42
  AND b.status = 'confirmed'
  AND b.created_at >= '2026-06-01'
ORDER BY b.created_at DESC
LIMIT 20;

-- Ideal shape of the output:
--  Limit  (actual time=0.05..0.09 rows=20 ...)
--    ->  Index Scan using idx_bookings_room_status_created on bookings b
--          Index Cond: (room_id = 42 AND status = 'confirmed')
--          Buffers: shared hit=24
--    ->  Index Scan using users_pkey on users u  (rows=1)

-- The supporting index:
CREATE INDEX idx_bookings_room_status_created
  ON bookings (room_id, status, created_at DESC);

-- Rank the worst offenders across the whole server:
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Optimization Patterns

  • Keyset pagination: Avoid deep OFFSET 10000 (Postgres still reads and discards those rows). Page by the last seen sort key: WHERE (created_at, id) < ($1, $2) ORDER BY created_at DESC, id DESC LIMIT 20
  • COUNT is expensive: an exact COUNT(*) must scan matching rows because of MVCC visibility. For dashboards use reltuples from pg_class for a fast estimate, or maintain rollup counters
  • Batch writes: multi-row INSERT, or COPY for bulk loads (often 10-50x faster). Use INSERT ... ON CONFLICT DO UPDATE for upserts and MERGE (PG 15+, with RETURNING in PG 17) for complex merges
  • Kill N+1 with lateral joins: fetch a parent plus its top-N children in one round trip with LEFT JOIN LATERAL (... LIMIT n) ON true, or aggregate children into JSON with jsonb_agg
  • Select only what you need: SELECT * defeats index-only scans and ships wide TOAST columns you may not use. List columns explicitly and let INCLUDE indexes cover hot reads
  • Tune the knobs that matter: set work_mem per-session for big sorts/hashes, effective_cache_size so the planner knows your RAM, and random_page_cost = 1.1 on SSD/NVMe so index scans are costed correctly

4. Connection Pooling with PgBouncer

Every PostgreSQL connection is a separate OS process with its own memory (roughly 5-10 MB plus work_mem). A few thousand direct connections will exhaust the server long before they saturate the CPU. Because a Postgres backend is heavier than a MySQL thread, external pooling with PgBouncer is not optional at scale -- it is the standard architecture, and it is what managed pooling in RDS Proxy and Supabase run under the hood.

  • Right-size the pool: a common starting point is max_connections around 4 x CPU cores for the write path -- more processes mostly add contention, not throughput. PgBouncer multiplexes thousands of client connections onto that small server pool
  • Pooling modes: transaction mode is the sweet spot for web apps -- a server connection is handed back to the pool at the end of each transaction. session mode pins a connection for the whole client session (needed for session-level state); statement mode is the most aggressive
  • Transaction-mode caveats: you cannot rely on session state (SET, advisory locks, unnamed prepared statements, LISTEN/NOTIFY) surviving between transactions. PgBouncer 1.21+ supports protocol-level prepared statements in transaction mode via max_prepared_statements
  • Timeouts & limits: tune default_pool_size, max_client_conn, server_idle_timeout, and query_wait_timeout so bursts queue briefly instead of overwhelming Postgres or failing outright
  • App-side pool too: your driver (node-postgres pg.Pool, HikariCP, SQLAlchemy) keeps its own small pool pointing at PgBouncer, not directly at Postgres. Keep statement_timeout and idle_in_transaction_session_timeout set as safety nets
  • Read/write splitting: point reads at a physical replica and writes at the primary. Each usually gets its own PgBouncer pool, effectively multiplying capacity
# pgbouncer.ini -- transaction pooling for a typical web workload
[databases]
myapp = host=10.0.0.10 port=5432 dbname=myapp

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
default_pool_size = 20          # server conns per user/db pair
max_client_conn = 5000          # client conns PgBouncer will accept
max_prepared_statements = 200   # protocol-level prepared stmts (1.21+)
server_idle_timeout = 300
query_wait_timeout = 30
// Node.js: app pool connects to PgBouncer (:6432), not Postgres directly
import { Pool } from 'pg';

export const db = new Pool({
  host: process.env.PGBOUNCER_HOST,   // PgBouncer, port 6432
  port: 6432,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  database: 'myapp',
  max: 10,                            // small: PgBouncer does the real fan-out
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
  // Safety nets applied per backend session:
  options: '-c statement_timeout=15000 -c idle_in_transaction_session_timeout=10000'
});
The most common outage: a transaction left open ("idle in transaction") while the app waits on an HTTP call or user input. It pins a backend, blocks VACUUM from cleaning dead tuples, and can freeze writes. Always set idle_in_transaction_session_timeout, keep transactions short, and never do network I/O inside one.

5. Streaming and Logical Replication

PostgreSQL offers two complementary replication mechanisms built on the Write-Ahead Log (WAL). Physical (streaming) replication ships the raw WAL byte-for-byte to produce identical binary replicas -- the basis of high availability and read scaling. Logical replication decodes the WAL into row-level change events, letting you replicate selected tables across major versions or into other systems.

  • Streaming replication: a standby continuously receives WAL from the primary and replays it, staying binary-identical. Standbys are read-only "hot standbys" that can serve read queries while they replay
  • Async vs synchronous: async (the default) does not wait for the standby -- fast, but a crash can lose the last few transactions. synchronous_commit = on with synchronous_standby_names waits for a standby to persist WAL, trading a little latency for zero data loss on failover
  • Replication slots: a slot guarantees the primary retains WAL until a replica has consumed it, preventing "requested WAL segment already removed" errors. Guard against unbounded WAL growth with max_slot_wal_keep_size
  • Logical replication: CREATE PUBLICATION on the source and CREATE SUBSCRIPTION on the target replicate chosen tables at the row level. It works across major versions (the basis of near-zero-downtime upgrades) and can filter rows and columns
  • Monitoring lag: query pg_stat_replication on the primary and compare pg_current_wal_lsn() to each replica's replay_lsn. For read-after-write consistency, route the critical read to the primary or use synchronous_commit
  • Failover tooling: Postgres has no built-in cluster manager, so production HA uses Patroni (with etcd/Consul) or repmgr to promote a standby automatically. PostgreSQL 17 added pg_createsubscriber to convert a physical standby into a logical subscriber quickly
  • Failover slots (PG 17): logical replication slots can now be synchronized to standbys, so a subscriber keeps working after the publisher fails over -- previously a hard operational gap
  • CDC and the ecosystem: logical decoding is how Debezium, and tools like pglogical, stream Postgres changes into Kafka, data warehouses, and search indexes -- change-data-capture without dual writes
Choose by intent: use physical/streaming replication for high availability and read replicas that must be byte-identical to the primary; use logical replication for selective table replication, cross-version major upgrades with minimal downtime, and change-data-capture into other systems. Many production setups run both.

6. Backup and Recovery

  • Logical backups (pg_dump): exports one database as SQL or a compressed custom/directory archive. Portable across versions and architectures; use the directory format with -j for parallel dump/restore. pg_dumpall also captures globals (roles, tablespaces)
  • Physical base backups (pg_basebackup): a byte-level copy of the whole cluster, taken online without blocking writes. This is the foundation for both standbys and point-in-time recovery
  • Point-in-time recovery (PITR): a base backup plus archived WAL lets you restore to any moment -- e.g. one second before a bad DELETE. Set archive_mode = on and an archive_command/archive_library to ship WAL to durable storage
  • Incremental backups (PG 17): pg_basebackup --incremental copies only blocks changed since a prior backup, and pg_combinebackup reconstructs a full backup from the chain -- dramatically smaller, faster backups for large clusters
  • Use pgBackRest in production: it adds parallel, compressed, encrypted full/differential/incremental backups, retention policies, and integrity checks over raw tooling. WAL-G and Barman are strong alternatives
  • Test restores: an untested backup is not a backup. Automate a weekly restore into a scratch instance and run pg_verifybackup plus a few smoke queries. Store copies off-region for disaster recovery
# Logical: parallel directory-format dump + parallel restore
pg_dump -h db -U app -d myapp -F d -j 4 -f /backup/myapp_$(date +%F)
pg_restore -h db2 -U app -d myapp_restored -F d -j 4 /backup/myapp_2026-07-01

# Physical base backup (streams WAL so it is self-consistent)
pg_basebackup -h db -U repl -D /backup/base -F tar -z -X stream -c fast

# Incremental base backup (PG 17), then reconstruct a full copy
pg_basebackup -h db -U repl -D /backup/incr \
  --incremental=/backup/base/backup_manifest
pg_combinebackup /backup/base /backup/incr -o /backup/full

# Recommended: pgBackRest full backup + verify
pgbackrest --stanza=myapp --type=full backup
pgbackrest --stanza=myapp verify

7. Declarative Partitioning

Declarative partitioning (native since PostgreSQL 10 and steadily improved) splits one logical table into physical child partitions. Queries hit the parent; the planner performs partition pruning to touch only relevant children. The biggest win is data lifecycle management: dropping or detaching an old partition is an instant metadata operation, not a slow, bloat-generating DELETE.

  • RANGE partitioning: assign rows by a value falling in a range -- the classic choice for time-series. Partition by month or day, and retire old data with DROP/DETACH PARTITION instead of mass deletes
  • LIST partitioning: assign rows by discrete values (region, tenant, status). Great for multi-tenant isolation or per-country data residency
  • HASH partitioning: spread rows evenly across N partitions by a hash of the key. Use it to parallelize I/O when there is no natural range or list boundary
  • Sub-partitioning: partitions can themselves be partitioned (e.g. RANGE by month, then HASH by tenant) for very large multi-tenant time-series data
  • Pruning & partition-wise joins: include the partition key in the WHERE clause so the planner prunes at plan time and at execution time. Enable enable_partitionwise_join and enable_partitionwise_aggregate for large analytical queries
  • Rules & automation: a UNIQUE/PRIMARY KEY must include the partition key. Indexes created on the parent cascade to all partitions. Foreign keys referencing a partitioned table are supported in modern versions. Automate partition creation with pg_partman (and TimescaleDB for time-series)
-- RANGE partition by month, ideal for event/booking logs
CREATE TABLE booking_logs (
  id          bigint GENERATED ALWAYS AS IDENTITY,
  user_id     bigint NOT NULL,
  action      text NOT NULL,
  created_at  timestamptz NOT NULL,
  PRIMARY KEY (id, created_at)          -- must include the partition key
) PARTITION BY RANGE (created_at);

-- Index on the parent cascades to every partition
CREATE INDEX ON booking_logs (user_id, created_at);

-- Monthly partitions
CREATE TABLE booking_logs_2026_06 PARTITION OF booking_logs
  FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
CREATE TABLE booking_logs_2026_07 PARTITION OF booking_logs
  FOR VALUES FROM ('2026-07-01') TO ('2026-08-01');

-- Retire old data instantly (metadata-only, no row-by-row DELETE)
ALTER TABLE booking_logs DETACH PARTITION booking_logs_2026_06;
DROP TABLE booking_logs_2026_06;

-- Confirm pruning: only the July partition is scanned
EXPLAIN SELECT * FROM booking_logs
WHERE created_at >= '2026-07-01' AND created_at < '2026-08-01';

8. MVCC, Transactions, and Isolation Levels

PostgreSQL implements ACID transactions with Multi-Version Concurrency Control (MVCC). Instead of overwriting rows in place, an UPDATE writes a new row version (tuple) and marks the old one dead; each transaction sees a consistent snapshot based on transaction IDs (xmin/xmax). The upside: readers never block writers and writers never block readers. The trade-off: dead tuples accumulate and must be reclaimed by VACUUM (next section).

  • How MVCC works: visibility is decided by comparing a tuple's xmin/xmax against the transaction's snapshot. There is no separate rollback/undo segment as in some engines -- old versions live in the table itself until VACUUM removes them
  • READ COMMITTED (default): each statement sees a fresh snapshot of committed data. No dirty reads, but two reads in the same transaction can differ (non-repeatable reads). Correct for the vast majority of OLTP work
  • REPEATABLE READ: the whole transaction sees one snapshot taken at its first statement. In PostgreSQL this also prevents phantom reads. Concurrent conflicting writes raise a serialization failure (SQLSTATE 40001) that you retry
  • SERIALIZABLE: uses Serializable Snapshot Isolation (SSI) -- true serializability without heavyweight read locks. It tracks read/write dependencies and aborts one transaction of a dangerous cycle. You must be ready to retry on 40001
  • Explicit locking: SELECT ... FOR UPDATE locks rows for a read-modify-write; add SKIP LOCKED to build concurrent job queues, or NOWAIT to fail fast. Advisory locks give app-level mutexes decoupled from rows
  • Deadlocks & retries: PostgreSQL detects deadlocks and aborts one transaction (SQLSTATE 40P01). Both deadlocks and serialization failures are normal under contention -- wrap transactions in a retry loop (a few attempts with exponential backoff)
-- Inspect and set isolation
SHOW transaction_isolation;
BEGIN ISOLATION LEVEL REPEATABLE READ;

-- Prevent overbooking with a locking read
BEGIN;
SELECT capacity, current_bookings
  FROM rooms WHERE id = 42
  FOR UPDATE;                         -- row locked until COMMIT/ROLLBACK

UPDATE rooms SET current_bookings = current_bookings + 1 WHERE id = 42;
INSERT INTO bookings (user_id, room_id, status)
  VALUES (7, 42, 'confirmed');
COMMIT;

-- Concurrent worker queue: each worker grabs distinct rows
SELECT id FROM jobs
  WHERE status = 'pending'
  ORDER BY run_after
  FOR UPDATE SKIP LOCKED
  LIMIT 10;

-- Retry pattern (pseudocode) for 40001 / 40P01:
-- for attempt in range(5):
--   try: run_txn(); break
--   except (SerializationFailure, Deadlock): sleep(0.05 * 2**attempt)
Keep transactions short. A long-running or "idle in transaction" transaction holds an old snapshot that blocks VACUUM from reclaiming dead tuples cluster-wide, driving bloat and, in the extreme, transaction-ID wraparound risk. Never do HTTP calls, file I/O, or wait on user input inside a transaction; set idle_in_transaction_session_timeout.

9. VACUUM, Autovacuum, and Bloat

Because MVCC leaves behind dead tuples on every UPDATE and DELETE, PostgreSQL must reclaim that space and keep statistics fresh. VACUUM does the reclaiming; the autovacuum daemon runs it automatically. Understanding and tuning it is the single most important operational skill for a healthy Postgres database -- neglecting it is the number-one cause of runaway table bloat and performance cliffs.

  • What VACUUM does: marks dead tuples as reusable, updates the visibility map (which enables index-only scans), and advances the "frozen" transaction-ID horizon. Plain VACUUM runs online; VACUUM FULL rewrites the table and takes an exclusive lock -- avoid it on live tables
  • ANALYZE: refreshes the planner's statistics. Autovacuum runs it too, but after big data loads or migrations run ANALYZE manually so the planner does not make decisions on stale numbers
  • Tune autovacuum per table: the defaults are conservative for large tables. Lower autovacuum_vacuum_scale_factor (e.g. 0.02) on hot tables so cleanup triggers sooner, and raise autovacuum_vacuum_cost_limit so it keeps up. Set these per-table with ALTER TABLE ... SET (...)
  • Prevent XID wraparound: a wraparound-prevention (anti-wraparound) autovacuum is mandatory and cannot be skipped. Watch datfrozenxid/age(); a long-open transaction or a stuck replication slot that holds back the horizon can push you toward emergency read-only mode. PostgreSQL 17 rewrote VACUUM's memory (TID store) so it uses far less RAM and rarely needs multiple passes
  • Measure bloat: monitor pg_stat_user_tables (n_dead_tup, last_autovacuum) and use pgstattuple for precise figures. Reclaim bloat online with pg_repack or VACUUM (FULL, ...) during a maintenance window; keep fillfactor headroom on update-heavy tables to enable HOT updates
-- Manual maintenance (parallel index vacuuming available on big tables)
VACUUM (ANALYZE, VERBOSE) bookings;

-- Tune autovacuum for a hot, high-churn table
ALTER TABLE bookings SET (
  autovacuum_vacuum_scale_factor = 0.02,   -- vacuum after 2% dead tuples
  autovacuum_vacuum_cost_limit   = 2000,   -- let it do more work per round
  fillfactor                     = 90      -- leave room for HOT updates
);

-- Find the tables that need attention
SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup * 100.0 / NULLIF(n_live_tup, 0), 1) AS dead_pct,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

-- Watch transaction-ID age to stay clear of wraparound
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database ORDER BY xid_age DESC;
In production, tuning autovacuum per-table and alerting on n_dead_tup and XID age eliminated the slow, mysterious latency creep that bloat causes. The rule of thumb: if a table is hot, make its autovacuum more aggressive, not less -- and never let a forgotten idle transaction or unused replication slot stall the cleanup horizon.

10. JSONB, Window Functions, and CTEs

JSONB

JSONB is PostgreSQL's binary, indexable JSON type -- one of the features that lets a single Postgres instance serve both relational and document workloads. It validates on write, stores a decomposed binary form for fast key access, supports powerful operators, and can be indexed with GIN. Reach for it when a row's shape genuinely varies; keep frequently-queried, constrained fields as real columns.

  • jsonb vs json: jsonb is parsed once into binary (fast access, supports indexing, dedupes keys); json keeps the exact input text and reparses on every read. Use jsonb for almost everything
  • Accessing values: -> returns jsonb, ->> returns text, #>/#>> take a path array, and the SQL/JSON path operator @? / jsonb_path_query runs JSONPath expressions
  • Containment & indexing: the @> containment operator plus a GIN index (jsonb_path_ops) makes metadata @> '{"plan":"pro"}' index-backed. Index a single hot key with an expression index instead
  • Building & transforming: jsonb_build_object, jsonb_agg, jsonb_set, jsonb_path_query, and (PG 17) SQL/JSON JSON_TABLE to expand JSON into relational rows -- and standard constructors like JSON_OBJECT/JSON_QUERY
-- JSONB column with a GIN index for containment queries
CREATE TABLE users (
  id       bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email    text NOT NULL UNIQUE,
  metadata jsonb NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX idx_users_metadata ON users USING gin (metadata jsonb_path_ops);

INSERT INTO users (email, metadata) VALUES
('[email protected]',
 '{"plan":"pro","features":["analytics","exports"],"trial_ends":"2026-08-01"}');

-- Containment query (uses the GIN index)
SELECT email FROM users WHERE metadata @> '{"plan":"pro"}';

-- Extract a scalar as text
SELECT email, metadata->>'plan' AS plan FROM users;

-- Update a nested key immutably
UPDATE users SET metadata = jsonb_set(metadata, '{plan}', '"enterprise"')
WHERE id = 1;

-- Expand a JSON array into rows with JSON_TABLE (PG 17)
SELECT u.email, f.feature
FROM users u,
     JSON_TABLE(u.metadata, '$.features[*]'
       COLUMNS (feature text PATH '$')) AS f;

Window Functions

Window functions compute across a set of rows related to the current row without collapsing them like GROUP BY does. PostgreSQL has had a complete, standard-compliant implementation for years -- including full frame clauses (ROWS/RANGE/GROUPS) and named windows -- making it excellent for rankings, running totals, and comparative analytics.

  • ROW_NUMBER(): a unique sequential number per partition. The clean way to do top-N-per-group, deduplication, and stable pagination
  • RANK() / DENSE_RANK(): ranking with ties -- RANK leaves gaps after ties, DENSE_RANK does not. PERCENT_RANK and NTILE bucket the distribution
  • LAG() / LEAD(): read values from previous/next rows for period-over-period deltas and gap detection
  • Aggregates as windows: SUM() OVER (...), AVG() OVER (...), and a frame clause give running totals and moving averages without self-joins
-- Top 3 most active users per room (by booking count)
SELECT name, room_name, booking_count
FROM (
  SELECT u.name, r.name AS room_name,
         COUNT(b.id) AS booking_count,
         ROW_NUMBER() OVER (PARTITION BY r.id
                            ORDER BY COUNT(b.id) DESC) AS rn
  FROM bookings b
  JOIN users u ON u.id = b.user_id
  JOIN rooms r ON r.id = b.room_id
  WHERE b.created_at >= '2026-01-01'
  GROUP BY u.id, u.name, r.id, r.name
) ranked
WHERE rn <= 3;

-- Month-over-month growth with LAG over a monthly aggregate
SELECT month,
       bookings,
       LAG(bookings) OVER w AS prev_month,
       round((bookings - LAG(bookings) OVER w) * 100.0
             / NULLIF(LAG(bookings) OVER w, 0), 1) AS growth_pct
FROM (
  SELECT date_trunc('month', created_at) AS month, COUNT(*) AS bookings
  FROM bookings GROUP BY 1
) m
WINDOW w AS (ORDER BY month)
ORDER BY month;

Common Table Expressions (CTEs)

CTEs (the WITH clause) define named, temporary result sets for a single query. They make complex logic readable, enable recursion for hierarchical data, and -- crucially in Postgres -- support data-modifying CTEs that run INSERT/UPDATE/DELETE with RETURNING inside one atomic statement.

  • Non-recursive CTEs: named building blocks that flatten nested subqueries. Since PostgreSQL 12 the planner inlines a CTE by default when it is referenced once and has no side effects -- add MATERIALIZED to force a one-time computation, or NOT MATERIALIZED to force inlining
  • Recursive CTEs: WITH RECURSIVE traverses trees and graphs (org charts, category trees, bill-of-materials). Add a depth guard or UNION (not UNION ALL) to stop cycles; PG 14+ supports CYCLE and SEARCH clauses
  • Data-modifying CTEs: chain writes in one statement -- e.g. move rows to an archive and delete them atomically -- using RETURNING to pipe results between steps
-- Readable multi-step reporting; force materialization once
WITH monthly AS MATERIALIZED (
  SELECT date_trunc('month', created_at) AS month, room_id, COUNT(*) AS total
  FROM bookings
  WHERE status = 'completed'
  GROUP BY 1, room_id
),
ranked AS (
  SELECT month, room_id, total,
         RANK() OVER (PARTITION BY month ORDER BY total DESC) AS rnk
  FROM monthly
)
SELECT month, room_id, total FROM ranked WHERE rnk <= 5;

-- Recursive CTE: category tree with indentation
WITH RECURSIVE category_tree AS (
  SELECT id, name, parent_id, 0 AS depth
  FROM categories WHERE parent_id IS NULL
  UNION ALL
  SELECT c.id, c.name, c.parent_id, ct.depth + 1
  FROM categories c
  JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT repeat('  ', depth) || name AS tree_view
FROM category_tree ORDER BY depth, name;

-- Data-modifying CTE: archive and delete atomically
WITH moved AS (
  DELETE FROM bookings
  WHERE status = 'completed' AND created_at < '2025-01-01'
  RETURNING *
)
INSERT INTO bookings_archive SELECT * FROM moved;

11. pgvector: Embeddings and RAG

Storing and Querying Vectors

pgvector is the extension that turns PostgreSQL into a production vector database -- the backbone of semantic search and Retrieval-Augmented Generation (RAG). Instead of bolting on a separate system like Pinecone or Milvus, you store embeddings next to your relational data and combine similarity search with ordinary SQL filters, joins, and transactions in one query. As of 2026 the current release is pgvector 0.8.x.

  • Types: vector(n) stores single-precision floats; halfvec(n) uses 2-byte floats to halve storage and index size (and index up to 4000 dims); bit(n) supports binary/Hamming and sparsevec handles sparse vectors
  • Distance operators: <=> cosine, <-> L2/Euclidean, <#> negative inner product, <+> L1/taxicab. Match the operator to how your embedding model was trained (cosine for most text models)
  • Generate embeddings in your pipeline: compute vectors with your model of choice (e.g. an OpenAI, Cohere, or a local model), then insert them. Dimensions must match the column: 1536 or 3072 for common OpenAI models, 768/1024 for many open models
-- Enable the extension and create an embeddings table
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id         bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  title      text NOT NULL,
  content    text NOT NULL,
  source     text NOT NULL,
  embedding  vector(1536) NOT NULL,      -- match your model's dimension
  created_at timestamptz NOT NULL DEFAULT now()
);

-- Insert a chunk plus its precomputed embedding
INSERT INTO documents (title, content, source, embedding)
VALUES ('PgBouncer basics', 'Transaction pooling ...', 'guide',
        '[0.0123, -0.0456, 0.0789, ...]');   -- 1536 floats from your model

-- k-NN semantic search by cosine distance
SELECT id, title, 1 - (embedding <=> $1) AS cosine_similarity
FROM documents
ORDER BY embedding <=> $1               -- $1 = the query embedding
LIMIT 10;

HNSW Indexing and RAG Patterns

  • HNSW indexes: the default choice -- a graph index with excellent recall/latency and no training step. Tune build with m and ef_construction; tune query recall at runtime with SET hnsw.ef_search. IVFFlat is smaller and faster to build but must be built after data loads and needs a good lists value
  • Filtered & hybrid search: combine vector distance with plain WHERE filters (tenant, date, language). pgvector 0.8's iterative index scans keep recall high even with selective filters. For hybrid search, blend semantic distance with full-text ts_rank or trigram scores -- all in one SQL statement
  • Scaling: use halfvec to cut index memory, quantize or reduce dimensions when your model allows, and consider the pgvectorscale extension (StreamingDiskANN) for very large corpora that exceed RAM
-- HNSW index for cosine distance (pick the op class that matches your operator)
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Raise recall at query time (per-session)
SET hnsw.ef_search = 100;

-- RAG retrieval: filter by tenant/recency, THEN rank by similarity
SELECT id, title, content
FROM documents
WHERE source = 'guide'
  AND created_at >= now() - interval '365 days'
ORDER BY embedding <=> $1
LIMIT 8;

-- Hybrid search: blend semantic similarity with keyword relevance
SELECT id, title,
       0.6 * (1 - (embedding <=> $1))
     + 0.4 * ts_rank(to_tsvector('english', content),
                     plainto_tsquery('english', $2)) AS score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
ORDER BY score DESC
LIMIT 10;
Keep embedding dimensions consistent with the model that produced them -- a mismatched vector(n) insert fails, and mixing embeddings from different models in one column silently ruins relevance. Re-embed the whole corpus when you change models, and rebuild the HNSW index afterward.

12. Extensions: PostGIS, TimescaleDB, and More

PostgreSQL's extension system is its superpower: a single CREATE EXTENSION adds new types, index methods, functions, and even background workers without forking the engine. This is why one Postgres instance can cover relational, geospatial, time-series, vector, and full-text workloads that would otherwise need several specialized databases.

  • PostGIS: the gold-standard geospatial extension. Adds geometry/geography types, thousands of spatial functions (ST_DWithin, ST_Intersects, ST_Distance), and GiST/SP-GiST spatial indexes. Powers "find everything within N km" queries natively -- current series is PostGIS 3.5
  • TimescaleDB: turns Postgres into a time-series database via hypertables (transparent auto-partitioning by time), native columnar compression, continuous aggregates (incrementally-maintained rollups), and retention policies -- ideal for metrics, IoT, and events
  • pgvector & friends: vector search (covered above), plus pgvectorscale for billion-scale ANN and pg_search/ParadeDB for BM25 full-text -- the building blocks of AI features inside your primary database
  • Bundled essentials: pg_stat_statements (query telemetry), pg_trgm (fuzzy/substring search), citext (case-insensitive text), uuid-ossp, hstore, postgres_fdw (query remote databases), and pgcrypto ship with core
  • Operational add-ons: pg_partman automates partition lifecycle, pg_cron runs scheduled SQL jobs inside the database, and pg_repack removes bloat online without long locks
  • Availability caveat: some extensions must be preloaded via shared_preload_libraries and a restart, and managed platforms (RDS, Cloud SQL, Azure) allow only a curated allow-list -- check support before designing around one. Vendor-neutral extensions keep you portable
-- PostGIS: geospatial "what's nearby" query
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE gyms (
  id       bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  name     text NOT NULL,
  geom     geography(Point, 4326) NOT NULL   -- lon/lat on the WGS84 sphere
);
CREATE INDEX idx_gyms_geom ON gyms USING gist (geom);

-- All gyms within 5 km of a point, nearest first
SELECT name, ST_Distance(geom, ST_MakePoint(-74.07, 4.65)::geography) AS meters
FROM gyms
WHERE ST_DWithin(geom, ST_MakePoint(-74.07, 4.65)::geography, 5000)
ORDER BY meters
LIMIT 20;
-- TimescaleDB: hypertable + continuous aggregate for metrics
CREATE EXTENSION IF NOT EXISTS timescaledb;

CREATE TABLE metrics (
  ts       timestamptz NOT NULL,
  device   text NOT NULL,
  value    double precision NOT NULL
);
SELECT create_hypertable('metrics', 'ts');   -- auto time-partitioning

-- Incrementally-maintained hourly rollup
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) AS bucket,
       device,
       avg(value) AS avg_value,
       max(value) AS max_value
FROM metrics
GROUP BY bucket, device;

-- Compress and retain
ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby = 'device');
SELECT add_retention_policy('metrics', INTERVAL '180 days');

13. Real-World Experience

I run PostgreSQL in production across personal and client systems -- from a managed primary with read replicas to self-hosted instances powering AI features. The recurring theme: Postgres scales further than people expect when you respect MVCC, keep autovacuum healthy, and pool connections properly.

  • Postgres as the single source of truth: primary plus streaming read replicas, WAL archived for point-in-time recovery, and pgBackRest for compressed, verified backups. Logical replication handled a major-version upgrade with minimal downtime
  • Connection pooling: put PgBouncer in transaction mode in front of the primary. Thousands of client connections collapsed onto a small server pool, and "too many connections" incidents went to zero
  • Query optimization: used pg_stat_statements and EXPLAIN (ANALYZE, BUFFERS) to find the worst queries, added covering and partial indexes, and fixed N+1 patterns with lateral joins -- cutting P95 latency dramatically
  • VACUUM discipline: tuned per-table autovacuum on hot tables and alerted on dead-tuple ratio and transaction-ID age. This eliminated the slow bloat-driven latency creep that had been mistaken for "the database getting old"
  • pgvector for RAG: built retrieval for AI features directly in Postgres with pgvector and HNSW indexes -- embeddings, metadata filters, and business joins in one query, no separate vector database to operate
Book free 1-hour consult All Guides Home

14. PostgreSQL 17/18 and Postgres vs MySQL

Version and Release Model

PostgreSQL ships one major version per year (typically late September) and supports each for five years. As of mid-2026, PostgreSQL 18 (released September 2025) is the current stable release and PostgreSQL 17 (September 2024) is the widely-deployed previous major. There is no separate "LTS" -- every major is production-grade and gets quarterly minor patch releases with bug and security fixes.

  • Pick a recent major: run PostgreSQL 17 or 18 for new deployments. Both receive regular minor releases; always apply the latest minor (e.g. 18.x) promptly, since minors are patch-only and safe to adopt
  • Five-year support window: each major is supported for about five years from its release. Plan a major upgrade before your version reaches end-of-life to keep getting security fixes
  • Upgrade paths: use pg_upgrade for fast in-place major upgrades, or logical replication (pg_createsubscriber in PG 17 makes this easier) for near-zero-downtime upgrades. Always test extension compatibility first
Check your version's end-of-life. Following the five-year policy, PostgreSQL 13 reached end-of-life in November 2025 and PostgreSQL 14 follows in late 2026. If you are on an unsupported major you get no more security patches -- schedule an upgrade to 17 or 18, validate your extensions (pgvector, PostGIS, TimescaleDB) against the target version, and rehearse the cutover on a copy first.

PostgreSQL vs MySQL

Both are mature, ACID, open-source relational databases, and for simple CRUD either is fine. PostgreSQL pulls ahead when you need rich types and extensibility: native JSONB with GIN indexing, arrays, ranges with exclusion constraints, a deep extension ecosystem (pgvector, PostGIS, TimescaleDB), true partial/expression indexes, and multiple index methods (GIN/GiST/BRIN). MySQL's InnoDB clusters the table on the primary key (fast PK lookups, cheaper secondary-index-free reads) and has a huge managed-hosting footprint. The honest summary: choose PostgreSQL when correctness, complex queries, and AI/analytics features matter; MySQL is a reasonable default for straightforward, read-heavy web apps already invested in its ecosystem.

-- A one-statement capability that showcases Postgres's type system:
-- prevent overlapping room reservations at the DB level (MySQL cannot express this)
CREATE EXTENSION IF NOT EXISTS btree_gist;

ALTER TABLE bookings
  ADD CONSTRAINT no_overlap
  EXCLUDE USING gist (room_id WITH =, during WITH &&)
  WHERE (status = 'confirmed');

-- Concurrency model differs too:
--  * PostgreSQL default isolation = READ COMMITTED; MVCC keeps old row versions
--    in the heap (reclaimed by VACUUM).
--  * MySQL/InnoDB default isolation = REPEATABLE READ; old versions live in the
--    undo log. Postgres SERIALIZABLE uses SSI (no read locks).

PostgreSQL 17 Highlights (September 2024)

PostgreSQL 17 focused on operational scale and SQL/JSON. It added native incremental base backups (pg_basebackup --incremental with pg_combinebackup), a rewritten VACUUM memory manager (a TID store that uses far less RAM), streaming I/O for faster sequential scans, and major logical-replication improvements including failover slots and pg_createsubscriber. On the SQL side it brought the standard JSON_TABLE function, more MERGE capabilities (including RETURNING and WHEN NOT MATCHED BY SOURCE), and COPY ... ON_ERROR ignore for resilient bulk loads.

-- PG 17: MERGE with RETURNING and match-source awareness
MERGE INTO inventory AS t
USING shipment AS s ON t.sku = s.sku
WHEN MATCHED THEN UPDATE SET qty = t.qty + s.qty
WHEN NOT MATCHED THEN INSERT (sku, qty) VALUES (s.sku, s.qty)
WHEN NOT MATCHED BY SOURCE THEN UPDATE SET qty = 0
RETURNING merge_action(), t.sku, t.qty;

-- PG 17: tolerate bad rows during a bulk COPY instead of aborting
COPY events FROM '/data/events.csv' WITH (FORMAT csv, HEADER, ON_ERROR ignore);

PostgreSQL 18 (September 2025)

PostgreSQL 18 is the current stable major in 2026. Its headline feature is a new asynchronous I/O (AIO) subsystem that can substantially speed up reads (sequential scans, bitmap heap scans, and VACUUM) on cloud storage. It also adds native UUIDv7 generation (uuidv7()) for time-ordered keys, virtual generated columns as the new default, B-tree skip scan for multicolumn indexes, temporal PRIMARY KEY/UNIQUE ... WITHOUT OVERLAPS constraints, RETURNING access to both OLD and NEW rows in DML, and OAuth-based authentication.

The practical upgrade story keeps improving: pg_upgrade now preserves planner statistics, so a major upgrade no longer starts with a slow, blind optimizer while you re-run ANALYZE. Combined with the AIO subsystem and skip scans, PostgreSQL 18 is a meaningful performance jump for I/O-bound and analytical workloads -- and, as always, the vast extension ecosystem (pgvector, PostGIS, TimescaleDB) tracks each new major closely.

More Guides