Indexing
Indexes are the single biggest lever for query performance. Master when to use BTree, Hash, GIN, GiST, and BRIN — and advanced patterns like composite, partial, and covering indexes — and queries that used to take seconds drop to milliseconds.
BTree — The Default Index
Balanced tree · equality · range · sort
What is a BTree?
BTree stands for Balanced Tree. It keeps all leaf nodes at the same depth, so every lookup takes the same number of steps regardless of which key you search for. PostgreSQL builds a BTree by default when you write CREATE INDEX.
The tree has three levels: a root node at the top, internal nodes in the middle that act as routers, and leaf nodes at the bottom that contain the actual (key → heap pointer) pairs pointing to rows in the table.
BTree Structure Visualized
Leaf nodes are also doubly-linked — range scans just walk the leaves without going back to the root.
Equality Search
Looking for WHERE user_id = 42 — PostgreSQL traverses from root → internal → leaf in O(log n) steps. For a table with 1 million rows a BTree is typically only 3–4 levels deep.
-- Create a BTree index (default when you omit USING) CREATE INDEX idx_users_email ON users (email); -- This query will now use the index SELECT * FROM users WHERE email = 'alice@example.com'; -- Explicit USING BTREE (same as default) CREATE INDEX idx_orders_user ON orders USING BTREE (user_id);
Range Search
BTree excels at range queries because leaf nodes are sorted and linked. PostgreSQL finds the starting point with O(log n) then walks leaf-to-leaf.
-- Range queries — BTree handles all of these SELECT * FROM orders WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01'; SELECT * FROM products WHERE price BETWEEN 100 AND 500; SELECT * FROM users WHERE name >= 'A' AND name < 'B'; -- names starting with A -- Operators supported by BTree: -- = < <= > >= BETWEEN IN IS NULL LIKE 'abc%' (prefix only)
LIKE 'abc%' (prefix match) but NOT LIKE '%abc' (suffix) or LIKE '%abc%' (contains). For those use Full Text Search or GIN/pg_trgm.Real-World Example — E-Commerce Orders
CREATE TABLE orders ( order_id BIGSERIAL PRIMARY KEY, user_id BIGINT NOT NULL, status TEXT NOT NULL, -- 'pending','paid','shipped','cancelled' total NUMERIC(10,2), created_at TIMESTAMPTZ DEFAULT NOW() ); -- Index on user_id to find orders for a specific user CREATE INDEX idx_orders_user_id ON orders (user_id); -- Index on created_at for date-range dashboard queries CREATE INDEX idx_orders_created ON orders (created_at); -- These now use the indexes: SELECT * FROM orders WHERE user_id = 1001; SELECT * FROM orders WHERE created_at >= '2025-01-01';
How BTree Helps Sorting (ORDER BY)
Because BTree leaves are already sorted, if you ORDER BY a column that has a BTree index, PostgreSQL can scan the index in order and skip the sort step entirely.
-- Without index: scan all rows, sort in memory (expensive) SELECT * FROM orders ORDER BY created_at DESC LIMIT 20; -- With index on created_at: index scan in reverse order, no sort needed ✓ CREATE INDEX idx_orders_created ON orders (created_at DESC); SELECT * FROM orders ORDER BY created_at DESC LIMIT 20;
Hash Index
Hash table · equality only · no ranges
How a Hash Index Works
A Hash index computes hash(value) and stores it in a hash bucket. To look up a value, PostgreSQL hashes it and goes straight to that bucket — O(1) lookup. It is faster than BTree for pure equality because it skips the O(log n) tree traversal.
Column value hash() Bucket ────────────── ────────────── ────────────────────── 'alice@x.com' → hash = 0x3F7A → Bucket 12 → row ptr 'bob@y.com' → hash = 0x1A2C → Bucket 7 → row ptr 'carol@z.com' → hash = 0x3F7A → Bucket 12 → row ptr ← collision, chained
When to Use Hash
-- Pure equality lookups SELECT * FROM sessions WHERE session_token = 'abc123'; -- UUID lookups SELECT * FROM users WHERE id = 'uuid-here'; -- High cardinality, equality only
-- Range queries — FAILS to use hash WHERE price > 100 -- Sorting — hash can't help ORDER BY created_at -- Pattern match — hash can't help WHERE email LIKE 'alice%'
Example
-- Hash index: use USING HASH explicitly CREATE INDEX idx_sessions_token ON sessions USING HASH (session_token); -- This equality query will use the hash index SELECT user_id, expires_at FROM sessions WHERE session_token = 'eyJhbGci...'; -- Practical note: In practice, BTree is almost always preferred -- because it also handles ranges and sorts. Hash wins only when -- equality is guaranteed to be the ONLY access pattern.
GIN — Generalized Inverted Index
JSONB · Arrays · Full Text Search
What is an Inverted Index?
In a normal (forward) index: row → values. In an inverted index: value → rows. Think of how a book index works — "PostgreSQL" → pages 3, 17, 42. GIN builds this kind of index for composite values (arrays, JSONB documents, text tokens).
Row data: GIN index (inverted):
────────────────────────────── ──────────────────────────────
row 1: tags = {postgres, sql} → "postgres" → {row 1, row 3}
row 2: tags = {redis, nosql} → "sql" → {row 1}
row 3: tags = {postgres, aws} → "redis" → {row 2}
"nosql" → {row 2}
"aws" → {row 3}
GIN on JSONB
JSONB is the most common use case. GIN indexes every key and value inside the JSON document so you can query nested fields without a sequential scan.
-- Example: Products with a JSONB metadata column CREATE TABLE products ( id BIGSERIAL PRIMARY KEY, name TEXT, metadata JSONB -- {"brand":"Nike","size":"L","color":"red","tags":["sports","sale"]} ); -- GIN index on the whole JSONB column CREATE INDEX idx_products_meta ON products USING GIN (metadata); -- Now these queries use the GIN index: -- Does JSON contain a key-value? SELECT * FROM products WHERE metadata @> '{"brand":"Nike"}'; -- Does JSON contain this top-level key? SELECT * FROM products WHERE metadata ? 'brand'; -- Does array field contain this element? SELECT * FROM products WHERE metadata -> 'tags' @> '"sale"';
GIN on Arrays
CREATE TABLE posts ( id BIGSERIAL PRIMARY KEY, title TEXT, tags TEXT[] -- array of tag strings ); CREATE INDEX idx_posts_tags ON posts USING GIN (tags); -- Contains operator @> — does the tags array contain 'postgres'? SELECT * FROM posts WHERE tags @> ARRAY['postgres']; -- Overlap operator && — does tags array share any element with this array? SELECT * FROM posts WHERE tags && ARRAY['postgres', 'sql'];
GIN for Full Text Search
CREATE TABLE articles ( id BIGSERIAL PRIMARY KEY, title TEXT, body TEXT, tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED ); -- GIN on the pre-computed tsvector column (fastest) CREATE INDEX idx_articles_tsv ON articles USING GIN (tsv); -- Full text search now uses the GIN index SELECT title, ts_rank(tsv, query) AS rank FROM articles, to_tsquery('english', 'postgresql & index') query WHERE tsv @@ query ORDER BY rank DESC;
fastupdate=on (default) which batches writes. For read-heavy workloads this is a non-issue.GiST — Generalized Search Tree
Geospatial · Ranges · Similarity Search
What is GiST?
GiST is a pluggable index framework. Instead of hard-coding a single data structure, GiST lets data types define their own "split" and "consistency" functions. The tree shape is like a BTree but keys can be any data type — bounding boxes, ranges, text similarity scores, etc.
GiST is used heavily by the PostGIS extension for geospatial data, and by built-in range types.
GiST for Geospatial (PostGIS)
-- PostGIS extension for geospatial data CREATE EXTENSION IF NOT EXISTS postgis; CREATE TABLE locations ( id BIGSERIAL PRIMARY KEY, name TEXT, coords GEOMETRY(POINT, 4326) -- latitude/longitude ); -- GiST index on the geometry column CREATE INDEX idx_locations_coords ON locations USING GIST (coords); -- Find all restaurants within 5km of a point SELECT name, ST_Distance(coords, ST_MakePoint(77.5946, 12.9716)::geography) AS dist_m FROM locations WHERE ST_DWithin(coords::geography, ST_MakePoint(77.5946, 12.9716)::geography, 5000) -- 5000 metres ORDER BY dist_m; -- The GiST index makes this fast using bounding-box pruning
GiST for Range Types
PostgreSQL has built-in range types (daterange, tsrange, int4range). GiST indexes them and supports overlap, contains, and adjacent queries.
CREATE TABLE hotel_bookings ( id BIGSERIAL PRIMARY KEY, room_id INT, booked DATERANGE -- e.g. [2025-01-10, 2025-01-15) ); CREATE INDEX idx_bookings_dates ON hotel_bookings USING GIST (booked); -- Overlapping range query — find all bookings that overlap [Jan 12, Jan 14) SELECT * FROM hotel_bookings WHERE booked && '[2025-01-12, 2025-01-14)'::daterange; -- This powers the EXCLUDE constraint for no double-booking: ALTER TABLE hotel_bookings ADD CONSTRAINT no_overlap EXCLUDE USING GIST (room_id WITH =, booked WITH &&);
GiST for Similarity Search (pg_trgm)
-- pg_trgm: trigram similarity for fuzzy text search CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE TABLE products (id SERIAL, name TEXT); -- GiST trigram index allows LIKE '%abc%' and similarity() queries CREATE INDEX idx_products_name_trgm ON products USING GIST (name gist_trgm_ops); -- Fuzzy search: find names similar to 'postgress' (typo) SELECT name, similarity(name, 'postgress') AS sim FROM products WHERE name % 'postgress' -- % = similarity threshold (default 0.3) ORDER BY sim DESC; -- LIKE with wildcard (both sides) — works with trgm index SELECT * FROM products WHERE name LIKE '%ostgre%';
<-> distance operator (nearest-neighbor). For autocomplete or typo search, GIN is usually the right choice.BRIN — Block Range Index
Large tables · Time-series · Append-only data
How BRIN Works
BRIN doesn't store a pointer for every row. Instead it divides the table into block ranges (groups of pages, default 128 pages) and stores the min and max value of the indexed column in that range. To answer a query, PostgreSQL checks which block ranges might contain the value and only reads those pages.
Pages 1–128 min=2024-01-01 max=2024-03-31 ← query Jan asks: maybe here? Pages 129–256 min=2024-04-01 max=2024-06-30 ← query Jan asks: skip! Pages 257–384 min=2024-07-01 max=2024-09-30 ← skip! Pages 385–512 min=2024-10-01 max=2024-12-31 ← skip! Index size: 4 rows vs BTree: millions of rows
BRIN is extremely small (a few KB even for billion-row tables) but it's a coarse filter — it can't pinpoint individual rows, only narrow down which pages to scan.
When to Use BRIN
created_at correlates with physical storage orderExample — Sensor Readings Table
-- IoT sensor data: billions of rows, append-only CREATE TABLE sensor_readings ( id BIGSERIAL PRIMARY KEY, sensor_id INT, value FLOAT, recorded_at TIMESTAMPTZ DEFAULT NOW() ); -- BRIN index: tiny size, fast to build, perfect for time-series range queries CREATE INDEX idx_sensor_brin ON sensor_readings USING BRIN (recorded_at); -- Pages_per_range controls granularity (smaller = more precise, bigger index) CREATE INDEX idx_sensor_brin_fine ON sensor_readings USING BRIN (recorded_at) WITH (pages_per_range = 32); -- This query uses the BRIN index to skip irrelevant page ranges SELECT AVG(value) FROM sensor_readings WHERE recorded_at >= '2025-01-01' AND recorded_at < '2025-02-01' AND sensor_id = 42;
created_at might be ~2GB. A BRIN on the same column would be under 1MB. The trade-off is that BRIN requires a sequential scan of qualifying block ranges, not direct row access.Composite Indexes
Multi-column indexes · Column ordering · Selectivity
What is a Composite Index?
A composite (multi-column) index spans multiple columns. PostgreSQL sorts by the first column, then by the second within ties, then the third, and so on — just like sorting a phone book by last name, then first name.
-- Composite index on (user_id, created_at) CREATE INDEX idx_orders_user_date ON orders (user_id, created_at); -- ✓ Uses index: prefix match (user_id is the leading column) SELECT * FROM orders WHERE user_id = 42; -- ✓ Uses index: both columns SELECT * FROM orders WHERE user_id = 42 AND created_at >= '2025-01-01'; -- ✗ CANNOT use index: created_at is not the leading column SELECT * FROM orders WHERE created_at >= '2025-01-01'; -- (for this you'd need a separate index on created_at)
Column Ordering — The Most Important Rule
Put the most selective column first — the one that filters out the most rows. But also consider which columns appear in your WHERE clauses most often and together.
(A, B, C) can serve queries on (A), (A, B), and (A, B, C). It cannot serve queries on (B) or (C) alone, or on (B, C). Always put the column you filter on most in position 1.
-- Example: orders table queried like: -- WHERE user_id = ? (frequent) -- WHERE user_id = ? AND status = ? (frequent) -- WHERE user_id = ? AND created_at > ? (frequent) -- WHERE status = ? (rare, dashboard) -- Good: user_id first (most selective in user context) CREATE INDEX idx_orders_user_status ON orders (user_id, status); CREATE INDEX idx_orders_user_date ON orders (user_id, created_at); -- Rare status-only query: either add separate index or accept seq scan CREATE INDEX idx_orders_status ON orders (status) WHERE status != 'cancelled'; -- ^ partial index (explained in 9.8)
Selectivity
Selectivity is what fraction of rows a condition matches. High selectivity = filters many rows = good for indexes. Low selectivity = keeps most rows = index often not helpful.
| Column | Distinct Values | Selectivity | Good for Index? |
|---|---|---|---|
user_id | 1,000,000 | High | Yes |
email | 1,000,000 | Very High | Yes |
status | 4 values | Low | Alone: No |
is_deleted | 2 values | Very Low | Alone: No |
country | ~200 | Medium | Composite: Maybe |
Partial Indexes
Filtered indexes · Index only a subset of rows
What is a Partial Index?
A partial index includes only the rows that satisfy a WHERE condition. This makes the index smaller, faster to build, and faster to search because it only covers the rows you actually care about.
Think of it as an index that only knows about a filtered subset of the table.
Classic Use Cases
───────────────────────────────────────────────────────────── Use case 1: Active records only Orders are mostly 'cancelled' or 'delivered'. You only query 'pending' or 'processing' orders on your dashboard. ───────────────────────────────────────────────────────────── CREATE INDEX idx_orders_active ON orders (created_at) WHERE status IN ('pending', 'processing'); -- This query uses the partial index (condition matches) SELECT * FROM orders WHERE status = 'pending' AND created_at < NOW() - '1 hour'::interval; ───────────────────────────────────────────────────────────── Use case 2: Soft deletes — never query deleted rows ───────────────────────────────────────────────────────────── CREATE INDEX idx_users_email_active ON users (email) WHERE deleted_at IS NULL; -- Only active users are in the index SELECT * FROM users WHERE email = 'alice@x.com' AND deleted_at IS NULL; ───────────────────────────────────────────────────────────── Use case 3: Unique constraint on non-null values only ───────────────────────────────────────────────────────────── CREATE UNIQUE INDEX idx_users_phone_unique ON users (phone) WHERE phone IS NOT NULL; -- Multiple users can have NULL phone, but non-null phones must be unique
Covering Indexes — INCLUDE Clause
Index-Only Scans · Eliminate heap fetches
The Problem: Heap Fetches
Normally, an index scan finds the matching rows in the index, then fetches each row from the heap (the actual table) to get the columns you selected. This is two I/O steps per row.
Query: SELECT email, name FROM users WHERE user_id = 42 Step 1: Index on user_id → finds "user_id=42 is at heap page 7, offset 3" Step 2: Fetch heap page 7 → get email and name columns Two I/O operations per row.
Index-Only Scan with INCLUDE
If the index contains all the columns the query needs (both WHERE and SELECT columns), PostgreSQL can answer the query entirely from the index — never touching the heap. This is called an Index-Only Scan.
-- Without INCLUDE: index has user_id only CREATE INDEX idx_users_id ON users (user_id); -- Query must still fetch heap to get email and name -- With INCLUDE: add the columns you SELECT into the index CREATE INDEX idx_users_id_cover ON users (user_id) INCLUDE (email, name); -- Now this query is an Index-Only Scan (heap never touched!) SELECT email, name FROM users WHERE user_id = 42; -- EXPLAIN will show: Index Only Scan (not "Index Scan") EXPLAIN SELECT email, name FROM users WHERE user_id = 42;
INCLUDE vs Adding to Index Key
Why not just add all columns to the index key? Because INCLUDE columns are stored only in leaf nodes and don't affect sorting — they're payload, not key. This keeps the non-leaf index structure small and fast.
| Index Key Column | INCLUDE Column | |
|---|---|---|
| Used for filtering (WHERE) | Yes | No |
| Used for sorting (ORDER BY) | Yes | No |
| Returned in query (SELECT) | Yes | Yes |
| Stored in internal nodes | Yes (bigger) | No (smaller) |
| Enables Index-Only Scan | Yes | Yes |
Real-World Example — API Response Cache Pattern
-- orders table with many columns -- API endpoint: GET /orders?user_id=X → returns [order_id, status, total, created_at] -- Covering index for this exact API query CREATE INDEX idx_orders_api_cover ON orders (user_id, created_at DESC) INCLUDE (order_id, status, total); -- This query now does an Index-Only Scan — zero heap fetches SELECT order_id, status, total, created_at FROM orders WHERE user_id = 1001 ORDER BY created_at DESC LIMIT 20;
Interactive Index Advisor
Describe your query pattern → get the right index type
What kind of query do you have?
Quick Reference — Index Types at a Glance
Summary table for all index types and when to use them
| Index | Equality | Range | Sort | JSONB/Array | Full-Text | Geo | Size |
|---|---|---|---|---|---|---|---|
| BTree | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | Medium |
| Hash | ✓ Fast | ✗ | ✗ | ✗ | ✗ | ✗ | Small |
| GIN | ✗ | ✗ | ✗ | ✓ Best | ✓ Best | ✗ | Large |
| GiST | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ Best | Medium |
| BRIN | ✗ | ✓ Coarse | ✗ | ✗ | ✗ | ✗ | Tiny |
Index Speed Comparison (Read performance)
1. Don't index everything — each index slows down writes and takes storage.
2. Index columns used in WHERE, JOIN ON, and ORDER BY clauses.
3. Put the most selective column first in composite indexes.
4. Use partial indexes to filter out rows you never query.
5. Use INCLUDE to enable Index-Only Scans on hot query paths.
6. Always verify with EXPLAIN ANALYZE — assumptions can be wrong.