PostgreSQL — Topic 9: Indexing

9.1

BTree — The Default Index

Balanced tree · equality · range · sort

▾

What is a BTree?

BTree stands for Balanced Tree. It keeps all leaf nodes at the same depth, so every lookup takes the same number of steps regardless of which key you search for. PostgreSQL builds a BTree by default when you write CREATE INDEX.

The tree has three levels: a root node at the top, internal nodes in the middle that act as routers, and leaf nodes at the bottom that contain the actual (key → heap pointer) pairs pointing to rows in the table.

BTree Structure Visualized

Root

50

20 | 35

65 | 80

Leaf nodes (sorted, linked)

10

→ page 3

20

→ page 1

35

→ page 7

50

→ page 2

65

→ page 5

80

→ page 4

95

→ page 9

Leaf nodes are also doubly-linked — range scans just walk the leaves without going back to the root.

Equality Search

Looking for WHERE user_id = 42 — PostgreSQL traverses from root → internal → leaf in O(log n) steps. For a table with 1 million rows a BTree is typically only 3–4 levels deep.

-- Create a BTree index (default when you omit USING)
CREATE INDEX idx_users_email ON users (email);

-- This query will now use the index
SELECT * FROM users WHERE email = 'alice@example.com';

-- Explicit USING BTREE (same as default)
CREATE INDEX idx_orders_user ON orders USING BTREE (user_id);

Range Search

BTree excels at range queries because leaf nodes are sorted and linked. PostgreSQL finds the starting point with O(log n) then walks leaf-to-leaf.

-- Range queries — BTree handles all of these
SELECT * FROM orders WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01';

SELECT * FROM products WHERE price BETWEEN 100 AND 500;

SELECT * FROM users WHERE name >= 'A' AND name < 'B';   -- names starting with A

-- Operators supported by BTree:
--   =   <   <=   >   >=   BETWEEN   IN   IS NULL   LIKE 'abc%' (prefix only)

⚠️

LIKE gotcha: BTree supports LIKE 'abc%' (prefix match) but NOT LIKE '%abc' (suffix) or LIKE '%abc%' (contains). For those use Full Text Search or GIN/pg_trgm.

Real-World Example — E-Commerce Orders

CREATE TABLE orders (
  order_id   BIGSERIAL PRIMARY KEY,
  user_id    BIGINT       NOT NULL,
  status     TEXT         NOT NULL,  -- 'pending','paid','shipped','cancelled'
  total      NUMERIC(10,2),
  created_at TIMESTAMPTZ  DEFAULT NOW()
);

-- Index on user_id to find orders for a specific user
CREATE INDEX idx_orders_user_id  ON orders (user_id);

-- Index on created_at for date-range dashboard queries
CREATE INDEX idx_orders_created  ON orders (created_at);

-- These now use the indexes:
SELECT * FROM orders WHERE user_id = 1001;
SELECT * FROM orders WHERE created_at >= '2025-01-01';

How BTree Helps Sorting (ORDER BY)

Because BTree leaves are already sorted, if you ORDER BY a column that has a BTree index, PostgreSQL can scan the index in order and skip the sort step entirely.

-- Without index: scan all rows, sort in memory (expensive)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;

-- With index on created_at: index scan in reverse order, no sort needed ✓
CREATE INDEX idx_orders_created ON orders (created_at DESC);
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;

9.2

Hash Index

Hash table · equality only · no ranges

▾

How a Hash Index Works

A Hash index computes hash(value) and stores it in a hash bucket. To look up a value, PostgreSQL hashes it and goes straight to that bucket — O(1) lookup. It is faster than BTree for pure equality because it skips the O(log n) tree traversal.

Hash Index Structure

  Column value       hash()            Bucket
  ──────────────   ──────────────    ──────────────────────
  'alice@x.com'  →  hash = 0x3F7A  →  Bucket 12 → row ptr
  'bob@y.com'    →  hash = 0x1A2C  →  Bucket  7 → row ptr
  'carol@z.com'  →  hash = 0x3F7A  →  Bucket 12 → row ptr  ← collision, chained

When to Use Hash

✓ Use Hash

-- Pure equality lookups
SELECT * FROM sessions
WHERE session_token = 'abc123';

-- UUID lookups
SELECT * FROM users
WHERE id = 'uuid-here';

-- High cardinality, equality only

✗ Don't Use Hash

-- Range queries — FAILS to use hash
WHERE price > 100

-- Sorting — hash can't help
ORDER BY created_at

-- Pattern match — hash can't help
WHERE email LIKE 'alice%'

Example

-- Hash index: use USING HASH explicitly
CREATE INDEX idx_sessions_token ON sessions USING HASH (session_token);

-- This equality query will use the hash index
SELECT user_id, expires_at
FROM   sessions
WHERE  session_token = 'eyJhbGci...';

-- Practical note: In practice, BTree is almost always preferred
-- because it also handles ranges and sorts. Hash wins only when
-- equality is guaranteed to be the ONLY access pattern.

ℹ️

Pro tip: Hash indexes were not WAL-logged before PostgreSQL 10, so they were unreliable after crashes. Since Pg 10 they are safe. Still, most engineers default to BTree because a BTree on a high-cardinality column is nearly as fast for equality and more versatile.

9.3

GIN — Generalized Inverted Index

JSONB · Arrays · Full Text Search

▾

What is an Inverted Index?

In a normal (forward) index: row → values. In an inverted index: value → rows. Think of how a book index works — "PostgreSQL" → pages 3, 17, 42. GIN builds this kind of index for composite values (arrays, JSONB documents, text tokens).

GIN Inverted Index — tags column (array)

  Row data:                         GIN index (inverted):
  ──────────────────────────────    ──────────────────────────────
  row 1: tags = {postgres, sql}  →  "postgres" → {row 1, row 3}
  row 2: tags = {redis, nosql}   →  "sql"      → {row 1}
  row 3: tags = {postgres, aws}  →  "redis"    → {row 2}
                                    "nosql"    → {row 2}
                                    "aws"      → {row 3}

GIN on JSONB

JSONB is the most common use case. GIN indexes every key and value inside the JSON document so you can query nested fields without a sequential scan.

-- Example: Products with a JSONB metadata column
CREATE TABLE products (
  id       BIGSERIAL PRIMARY KEY,
  name     TEXT,
  metadata JSONB   -- {"brand":"Nike","size":"L","color":"red","tags":["sports","sale"]}
);

-- GIN index on the whole JSONB column
CREATE INDEX idx_products_meta ON products USING GIN (metadata);

-- Now these queries use the GIN index:

-- Does JSON contain a key-value?
SELECT * FROM products WHERE metadata @> '{"brand":"Nike"}';

-- Does JSON contain this top-level key?
SELECT * FROM products WHERE metadata ? 'brand';

-- Does array field contain this element?
SELECT * FROM products WHERE metadata -> 'tags' @> '"sale"';

GIN on Arrays

CREATE TABLE posts (
  id    BIGSERIAL PRIMARY KEY,
  title TEXT,
  tags  TEXT[]   -- array of tag strings
);

CREATE INDEX idx_posts_tags ON posts USING GIN (tags);

-- Contains operator @> — does the tags array contain 'postgres'?
SELECT * FROM posts WHERE tags @> ARRAY['postgres'];

-- Overlap operator && — does tags array share any element with this array?
SELECT * FROM posts WHERE tags && ARRAY['postgres', 'sql'];

GIN for Full Text Search

CREATE TABLE articles (
  id      BIGSERIAL PRIMARY KEY,
  title   TEXT,
  body    TEXT,
  tsv     TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED
);

-- GIN on the pre-computed tsvector column (fastest)
CREATE INDEX idx_articles_tsv ON articles USING GIN (tsv);

-- Full text search now uses the GIN index
SELECT title, ts_rank(tsv, query) AS rank
FROM   articles, to_tsquery('english', 'postgresql & index') query
WHERE  tsv @@ query
ORDER BY rank DESC;

⚠️

GIN trade-off: GIN indexes are larger and slower to write than BTree (each insert decomposes the value into tokens and updates multiple posting lists). Use fastupdate=on (default) which batches writes. For read-heavy workloads this is a non-issue.

9.4

GiST — Generalized Search Tree

Geospatial · Ranges · Similarity Search

▾

What is GiST?

GiST is a pluggable index framework. Instead of hard-coding a single data structure, GiST lets data types define their own "split" and "consistency" functions. The tree shape is like a BTree but keys can be any data type — bounding boxes, ranges, text similarity scores, etc.

GiST is used heavily by the PostGIS extension for geospatial data, and by built-in range types.

GiST for Geospatial (PostGIS)

-- PostGIS extension for geospatial data
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE locations (
  id      BIGSERIAL PRIMARY KEY,
  name    TEXT,
  coords  GEOMETRY(POINT, 4326)   -- latitude/longitude
);

-- GiST index on the geometry column
CREATE INDEX idx_locations_coords ON locations USING GIST (coords);

-- Find all restaurants within 5km of a point
SELECT name,
       ST_Distance(coords, ST_MakePoint(77.5946, 12.9716)::geography) AS dist_m
FROM   locations
WHERE  ST_DWithin(coords::geography,
                  ST_MakePoint(77.5946, 12.9716)::geography,
                  5000)   -- 5000 metres
ORDER BY dist_m;

-- The GiST index makes this fast using bounding-box pruning

GiST for Range Types

PostgreSQL has built-in range types (daterange, tsrange, int4range). GiST indexes them and supports overlap, contains, and adjacent queries.

CREATE TABLE hotel_bookings (
  id       BIGSERIAL PRIMARY KEY,
  room_id  INT,
  booked   DATERANGE   -- e.g. [2025-01-10, 2025-01-15)
);

CREATE INDEX idx_bookings_dates ON hotel_bookings USING GIST (booked);

-- Overlapping range query — find all bookings that overlap [Jan 12, Jan 14)
SELECT * FROM hotel_bookings
WHERE  booked && '[2025-01-12, 2025-01-14)'::daterange;

-- This powers the EXCLUDE constraint for no double-booking:
ALTER TABLE hotel_bookings
  ADD CONSTRAINT no_overlap
  EXCLUDE USING GIST (room_id WITH =, booked WITH &&);

GiST for Similarity Search (pg_trgm)

-- pg_trgm: trigram similarity for fuzzy text search
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE TABLE products (id SERIAL, name TEXT);

-- GiST trigram index allows LIKE '%abc%' and similarity() queries
CREATE INDEX idx_products_name_trgm ON products USING GIST (name gist_trgm_ops);

-- Fuzzy search: find names similar to 'postgress' (typo)
SELECT name, similarity(name, 'postgress') AS sim
FROM   products
WHERE  name % 'postgress'   -- % = similarity threshold (default 0.3)
ORDER BY sim DESC;

-- LIKE with wildcard (both sides) — works with trgm index
SELECT * FROM products WHERE name LIKE '%ostgre%';

💡

GiST vs GIN for pg_trgm: Both support trigrams. GIN is faster to read (better for search). GiST is faster to write and supports the <-> distance operator (nearest-neighbor). For autocomplete or typo search, GIN is usually the right choice.

9.5

BRIN — Block Range Index

Large tables · Time-series · Append-only data

▾

How BRIN Works

BRIN doesn't store a pointer for every row. Instead it divides the table into block ranges (groups of pages, default 128 pages) and stores the min and max value of the indexed column in that range. To answer a query, PostgreSQL checks which block ranges might contain the value and only reads those pages.

BRIN on an append-only events table (created_at)

  Pages 1–128    min=2024-01-01  max=2024-03-31  ← query Jan asks: maybe here?
  Pages 129–256  min=2024-04-01  max=2024-06-30  ← query Jan asks: skip!
  Pages 257–384  min=2024-07-01  max=2024-09-30  ← skip!
  Pages 385–512  min=2024-10-01  max=2024-12-31  ← skip!

  Index size: 4 rows   vs   BTree: millions of rows

BRIN is extremely small (a few KB even for billion-row tables) but it's a coarse filter — it can't pinpoint individual rows, only narrow down which pages to scan.

When to Use BRIN

✓ Great fit

Append-only tables (logs, events, metrics) where new rows have naturally increasing values

✓ Great fit

Time-series data where created_at correlates with physical storage order

✗ Bad fit

Random-order columns (UUIDs, hashed values) — min/max ranges would span the whole table

✗ Bad fit

Small tables — a sequential scan is fast enough; BRIN overhead isn't worth it

Example — Sensor Readings Table

-- IoT sensor data: billions of rows, append-only
CREATE TABLE sensor_readings (
  id         BIGSERIAL PRIMARY KEY,
  sensor_id  INT,
  value      FLOAT,
  recorded_at TIMESTAMPTZ DEFAULT NOW()
);

-- BRIN index: tiny size, fast to build, perfect for time-series range queries
CREATE INDEX idx_sensor_brin ON sensor_readings USING BRIN (recorded_at);

-- Pages_per_range controls granularity (smaller = more precise, bigger index)
CREATE INDEX idx_sensor_brin_fine ON sensor_readings
  USING BRIN (recorded_at) WITH (pages_per_range = 32);

-- This query uses the BRIN index to skip irrelevant page ranges
SELECT AVG(value) FROM sensor_readings
WHERE  recorded_at >= '2025-01-01' AND recorded_at < '2025-02-01'
AND    sensor_id = 42;

📏

Size comparison: For a 100M row table, a BTree on created_at might be ~2GB. A BRIN on the same column would be under 1MB. The trade-off is that BRIN requires a sequential scan of qualifying block ranges, not direct row access.

9.6

Composite Indexes

Multi-column indexes · Column ordering · Selectivity

▾

What is a Composite Index?

A composite (multi-column) index spans multiple columns. PostgreSQL sorts by the first column, then by the second within ties, then the third, and so on — just like sorting a phone book by last name, then first name.

-- Composite index on (user_id, created_at)
CREATE INDEX idx_orders_user_date ON orders (user_id, created_at);

-- ✓ Uses index: prefix match (user_id is the leading column)
SELECT * FROM orders WHERE user_id = 42;

-- ✓ Uses index: both columns
SELECT * FROM orders WHERE user_id = 42 AND created_at >= '2025-01-01';

-- ✗ CANNOT use index: created_at is not the leading column
SELECT * FROM orders WHERE created_at >= '2025-01-01';
-- (for this you'd need a separate index on created_at)

Column Ordering — The Most Important Rule

Put the most selective column first — the one that filters out the most rows. But also consider which columns appear in your WHERE clauses most often and together.

📐

The Left-Prefix Rule: A composite index on (A, B, C) can serve queries on (A), (A, B), and (A, B, C). It cannot serve queries on (B) or (C) alone, or on (B, C). Always put the column you filter on most in position 1.

-- Example: orders table queried like:
--   WHERE user_id = ?                        (frequent)
--   WHERE user_id = ? AND status = ?         (frequent)
--   WHERE user_id = ? AND created_at > ?     (frequent)
--   WHERE status = ?                         (rare, dashboard)

-- Good: user_id first (most selective in user context)
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
CREATE INDEX idx_orders_user_date   ON orders (user_id, created_at);

-- Rare status-only query: either add separate index or accept seq scan
CREATE INDEX idx_orders_status ON orders (status) WHERE status != 'cancelled';
-- ^ partial index (explained in 9.8)

Selectivity

Selectivity is what fraction of rows a condition matches. High selectivity = filters many rows = good for indexes. Low selectivity = keeps most rows = index often not helpful.

Column	Distinct Values	Selectivity	Good for Index?
`user_id`	1,000,000	High	Yes
`email`	1,000,000	Very High	Yes
`status`	4 values	Low	Alone: No
`is_deleted`	2 values	Very Low	Alone: No
`country`	~200	Medium	Composite: Maybe

9.7

Partial Indexes

Filtered indexes · Index only a subset of rows

▾

What is a Partial Index?

A partial index includes only the rows that satisfy a WHERE condition. This makes the index smaller, faster to build, and faster to search because it only covers the rows you actually care about.

Think of it as an index that only knows about a filtered subset of the table.

Classic Use Cases

─────────────────────────────────────────────────────────────
 Use case 1: Active records only
 Orders are mostly 'cancelled' or 'delivered'. You only query
 'pending' or 'processing' orders on your dashboard.
─────────────────────────────────────────────────────────────
CREATE INDEX idx_orders_active
  ON orders (created_at)
  WHERE status IN ('pending', 'processing');

-- This query uses the partial index (condition matches)
SELECT * FROM orders
WHERE  status = 'pending'
AND    created_at < NOW() - '1 hour'::interval;

─────────────────────────────────────────────────────────────
 Use case 2: Soft deletes — never query deleted rows
─────────────────────────────────────────────────────────────
CREATE INDEX idx_users_email_active
  ON users (email)
  WHERE deleted_at IS NULL;

-- Only active users are in the index
SELECT * FROM users WHERE email = 'alice@x.com' AND deleted_at IS NULL;

─────────────────────────────────────────────────────────────
 Use case 3: Unique constraint on non-null values only
─────────────────────────────────────────────────────────────
CREATE UNIQUE INDEX idx_users_phone_unique
  ON users (phone)
  WHERE phone IS NOT NULL;
-- Multiple users can have NULL phone, but non-null phones must be unique

📉

Size win: If only 1% of orders are 'pending', a partial index on pending orders is 100× smaller than a full index on status. It also stays hot in cache because it's small.

9.8

Covering Indexes — INCLUDE Clause

Index-Only Scans · Eliminate heap fetches

▾

The Problem: Heap Fetches

Normally, an index scan finds the matching rows in the index, then fetches each row from the heap (the actual table) to get the columns you selected. This is two I/O steps per row.

Normal index scan (two hops)

  Query: SELECT email, name FROM users WHERE user_id = 42

  Step 1: Index on user_id → finds "user_id=42 is at heap page 7, offset 3"
  Step 2: Fetch heap page 7 → get email and name columns

  Two I/O operations per row.

Index-Only Scan with INCLUDE

If the index contains all the columns the query needs (both WHERE and SELECT columns), PostgreSQL can answer the query entirely from the index — never touching the heap. This is called an Index-Only Scan.

-- Without INCLUDE: index has user_id only
CREATE INDEX idx_users_id ON users (user_id);
-- Query must still fetch heap to get email and name

-- With INCLUDE: add the columns you SELECT into the index
CREATE INDEX idx_users_id_cover
  ON users (user_id)
  INCLUDE (email, name);

-- Now this query is an Index-Only Scan (heap never touched!)
SELECT email, name FROM users WHERE user_id = 42;

-- EXPLAIN will show: Index Only Scan (not "Index Scan")
EXPLAIN SELECT email, name FROM users WHERE user_id = 42;

INCLUDE vs Adding to Index Key

Why not just add all columns to the index key? Because INCLUDE columns are stored only in leaf nodes and don't affect sorting — they're payload, not key. This keeps the non-leaf index structure small and fast.

	Index Key Column	INCLUDE Column
Used for filtering (WHERE)	Yes	No
Used for sorting (ORDER BY)	Yes	No
Returned in query (SELECT)	Yes	Yes
Stored in internal nodes	Yes (bigger)	No (smaller)
Enables Index-Only Scan	Yes	Yes

Real-World Example — API Response Cache Pattern

-- orders table with many columns
-- API endpoint: GET /orders?user_id=X → returns [order_id, status, total, created_at]

-- Covering index for this exact API query
CREATE INDEX idx_orders_api_cover
  ON orders (user_id, created_at DESC)
  INCLUDE (order_id, status, total);

-- This query now does an Index-Only Scan — zero heap fetches
SELECT order_id, status, total, created_at
FROM   orders
WHERE  user_id = 1001
ORDER BY created_at DESC
LIMIT  20;

⚡

Interactive Index Advisor

Describe your query pattern → get the right index type

▾

What kind of query do you have?

Column data type

Query pattern

Table size

Fill in the options above and click "Get Recommendation" to see which index type fits your use case.

📋

Quick Reference — Index Types at a Glance

Summary table for all index types and when to use them

▾

Index	Equality	Range	Sort	JSONB/Array	Full-Text	Geo	Size
BTree	✓	✓	✓	✗	✗	✗	Medium
Hash	✓ Fast	✗	✗	✗	✗	✗	Small
GIN	✗	✗	✗	✓ Best	✓ Best	✗	Large
GiST	✓	✓	✗	✓	✓	✓ Best	Medium
BRIN	✗	✓ Coarse	✗	✗	✗	✗	Tiny

Index Speed Comparison (Read performance)

BTree equality~log(n) — excellent

Hash equality~O(1) — fastest for equality

GIN lookupInverted index — very fast

GiST geo lookupBounding box pruning — fast

BRIN range scanPrunes pages — moderate (coarse)

Sequential scan (no index)Reads every page — slow at scale

🎯

Golden Rules of Indexing:
1. Don't index everything — each index slows down writes and takes storage.
2. Index columns used in WHERE, JOIN ON, and ORDER BY clauses.
3. Put the most selective column first in composite indexes.
4. Use partial indexes to filter out rows you never query.
5. Use INCLUDE to enable Index-Only Scans on hot query paths.
6. Always verify with EXPLAIN ANALYZE — assumptions can be wrong.