This is a tutorial that explains indexes in relational databases with a real-world analogy. What are indexes, why are they useful and what should one be thoughtful about when using them.
Let's first quickly establish some terminology. A table in a database such as Oracle, MySQL or PostgreSQL is a data structure with a fixed number of columns, or attributes and a variable number of rows which are the data itself. A table ‘looks’ like a spreadsheet in Excel or Google sheets. You might have one that shows your Credit Card purchases with columns/attributes such as Date, Time, Amount, Merchant and Category. These are (mostly) fixed. Now as you begin to use the card you create rows such as Jan 1st, 9 am, $10, Starbucks, Food and Jan 1st, 12 pm, $15, Chipotle, Food and April 5th, 2pm, $300, American Airlines, Travel and so on.
Now if you wanted at the end of the year to find all the times you have spent money in Starbucks you would issue a SQL query such as:
and get back dozens or a few hundred rows of data for the year where you made Starbucks purchases. The way you get back this data is that every record or row in your table is queried to see if the Merchant is who you want it to be (Starbucks in this case) and if it is a match it gets selected and if not, it gets skipped over. This is called a Full Table Scan as every row is being looked at. Full Table Scans are time consuming.
Since your credit card doesn't get used much more than 5-10 times a day on average, your table at the end of the year will likely contain well under 5000 rows of data. With modern computing power, going through all 5000 rows of data as a Full Table Scan is still acceptably fast
Now what if it wasn’t you but the store owner that wanted to run a query such as
Now, several hundreds of customers will routinely make purchases at a Starbucks every single day. Perhaps even thousands. So now that table at the end of the year may have a million rows of data. Even that isn’t, with modern computing power, too much to go through one-by-one to answer the query above but it will be slower than your own query for the same Full Table Scan.
What if it wasn’t that particular store owner but Starbucks’ corporate HQ that wanted to run that or a similar query. There are some 15000 Starbucks in the US as of 2020. That table of all sales will be 15000 times bigger i.e. have 15000 times the number of rows a single store would. A Full Table Scan would now take 15000 times longer (not exactly but for the purposes of this discussion the approximation works). Say the one-store query ran in 200 ms which is near instantaneous. Now the All-US Starbucks query would take 3000 seconds. That is almost an hour! Clearly, that’s unacceptable.
Enter Indexes. With an index on the right columns, queries on tables with millions or billions of rows can still be answered in a very short time.
To understand this take the example of a book. Say you’re interested in Steve Jobs’ experience in Japan and you have the Walter Isaacson biography of Steve Jobs. Now that book is 656 pages long and will take you hours to read every single page. Most non-fiction books will have a section at the back (not coincidentally called an index) where keywords or names are listed along with all the pages they appear on. If you flip to the index of the Isaacson book you’ll quickly find the handful of pages where his time in Japan is written about. This index search will take you mere seconds compared to the multiple hours of reading the entire book page by page - a Full Table Scan.
There you have it. This problem is exactly what an index on a database table attempts to solve. An index on Date in the Sales table creates that ‘section at the back of the book’ where the rows for each date are stored. Rather than page number an Oracle table uses something called RowID which is the exact same idea. An index on Item in the Sales table creates another ‘section at the back of the book’ where RowIDs for each item i.e. ‘Pumpkin Spice Latte’, ‘Flat White’, ‘Cafe Latte’ etc. are stored. Unlike a single Index per physical book you can have multiple indexes on a table. You can also have a composite index which is one index on multiple columns (mostly queried together).
Now, the All US query for Pumpkin Spice Latte on Halloween Day no longer needs to look at every single one of the billions of sales but instead looks up the indexes to find the RowIDs which match the Date being 31-Oct-2019 and the RowIDs that match the Item being a Pumpkin Spice Latte and now has the common RowIDs which are like page numbers that can be navigated to near-instantly.
And that’s why we create indexes on tables. To make queries faster.
Now the next question might be if indexes make queries faster why not just create an index on every single column of a table so that all queries whether on Date or Item or Zipcode or Amount all get sped up. Is it truly free?
Alas, nothing is truly free. There is no such thing in the universe as a free lunch. There are two costs to be paid for creating indexes.
The first is the space required for the index. As in the book analogy, adding an index adds another 10-20 pages to the end of the book which is cost. Fortunately for us - and Starbucks - the cost of storage hardware is falling way faster than the cost of paper so the physical cost of storing an index or more isn’t material enough to stop us from creating the index on the table if we’re confident there are queries that run frequently enough that will benefit. In other words it is mostly well worth it.
The second more subtle cost is that of updating. Here we have an interesting deviation from the book analogy, the book itself is fixed. Once the book is published it doesn’t change. Your book isn't going to develop pages or chapters about Jobs’ time in Japan! But the sales table is constantly changing. Every single time a sale is made the appropriate indexes that track that column need to be updated. Someone buys a Pumpkin Spice Latte and that needs to be reflected in the index. If not, a subsequent query that looks at the index first will fail to find the new RowIDs and return an incomplete result! A query with wrong results is even more unacceptable than a slow query. This means that every single row that is inserted into the table will trigger a potential update of multiple indexes. That slows insertions down.
That delay shows up when the Barista swipes your card. You want that transaction to be as fast as possible and not take 5 or 10 extra seconds.
So that’s the reason why you can’t just add an index to every column of a table and get faster queries for free.