Speaker:Janet Riley
Session title: When you have to get out and push: optimizing SQL queries
Time: 12:00
Date: Sunday, June 4 2006
This is the director's cut of my presentation -- it includes a couple notecards I overlooked.
It would be great if folks added links to their favorite optimization tools in the resources section.
Key points:
- It's Tool Time - don't optimize blindly.
- Make molehills out of mountains - cut the candidate rows down as quickly as possible.
- Take Advantage Of Indexes.
- Don't Be All "Whatever" in Selects - enumerate the columns in select statements.
- Practice Big Love -- use UNION ALL when you can get away with it .
- Save Your Work - consider how to reuse calculated values.
- Zoom Out and Rethink - look at the situation from a different angle.
- Resources
Notes
You've run VACUUM ANALYZE, created a few indexes, and your database queries still run slowly.
What next?
Blind optimization will make you crazy. The database is never doing what you think it's doing; if it were, you wouldn't have a performance problem.
When you submit a query to the database, the database devises a strategy for how to best run your query. It bases this on statistics about the data and whatever cunning it was programmed with. The EXPLAIN function will show you its strategy, step by step. Consult your database's documentation for specifics -- look for "explain query" or "explain plan". EXPLAIN can be run from a command line sql prompt, but it's easier to understand the output with a visual interface. Tora (Oracle) and pgadmin (Postgres) are free tools that include explain.
Next to each step, EXPLAIN shows numbers about how many records it processed and compute time. It's hard to translate the numbers into actual performance, but EXPLAIN lets you compare apples to apples. Get a baseline before making any changes to your query.
Any time interval that EXPLAIN reports is suspect, as the database may cache data between queries. If the compute time magically drops on refresh without any change to the query, the database is caching.
Because the database bases its execution strategy on the data, make your test data as close to actual data as possible. Ideally, run against production or a copy of production data. The tiny QA database will implement queries much differently, and you'll be wasting your time.
STATSPACK is an Oracle tool that's great for finding bottlenecks. It will log every action the database takes while you run your application, and report in great detail what step it took and what resources were used.
The database is the ultimate authority. We can theorize, but the database's behavior is the final evidence.
- Make Molehills Out of Mountains.
When you're looking for a needle in a haystack, it's best to shrink the haystack as quickly as possible on each pass. Is there a way to narrow down the candidates quickly?
EXAMPLE: Show me all the recent messages on my friends list that I'm allowed to see
INSIGHT: my friends list and permissions groups are a tiny fraction of all users and groups.
THE LONG WAY: out of all the messages in the system, find the ones I'm allowed to read; from those, find the ones written by someone on my friendslist
THE FAST WAY: find the people on my friends list; then find messages written by those people; from those, find the ones I'm allowed to read.
Write all sql with the molehill attitude.
Enforce it with subqueries if necessary.
Instead of
SELECT val1, val2, val3 FROM messages, permission_group, permission_members, friends
WHERE messages.permissions_filter = permission_group.id and
permission_members.group_id = permission_group.id and
permission_members.person_id = :me and
messages.sender_id = friends.friend_id
try this:
SELECT val1, val2, val3 FROM messages, ( select friend_id from friends where person_id = :me) friendSubquery,
( select id from permission_group where permission_members.person_id = :me and permission_group.id = permission_members.group_id) permissionSubquery
WHERE
messages.senderId = friendSubquery.friend_id AND
messages.permissions_filter = permissionSubquery.id;
Enforce it with HINTS and special selects.
Oracle has hints; MySQL supports special selects like "SELECT STRAIGHT JOIN".
Some hints allow you to specify the order in which it searches tables (e.g. friends, then permissions, then messages). Note that a hint is only a hint in Oracle, and it may freely ignore your suggestions. In that case, use subqueries.
- Take Advantage Of Indexes.
You've already put indexes on the single columns and dutifully made primary keys.
You can make compound or covering indexes with two or more columns. This has two benefits:
- it speeds lookup when your WHERE includes those multiple columns
- you can retrieve values from the index quickly even if they aren't part of the criteria.
EX: Given an index on userid and nickname;
SELECT nickname WHERE userid=123
Even though nickname wasn't part of the WHERE clause, the database will retrive nickname more quickly if it uses this index.
Oracle and MySQL allow you to suggest an index to use. Oracle may disregard the hints. EXPLAIN will confirm which index the database chooses.
You must specify the columns in the WHERE clause in the same order that you specified them when you created the index.
Declare indexes unique when they're unique.
Remember that the database can't use an index when you apply a function to the column, e.g. LOWER(nickname), or when you use the LIKE or != operators.
- Don't Be All "Whatever" in Selects.
There is a small but measurable performance penalty for using "select *". How many places does that appear in your code?
Use "select col1, col2, col3..." instead.
As a bonus, listing column names helps document the app and provides some bulletproofing. There are no guarantees about which order the database will return columns in. In practice, it seems to be the order it's written to disk.
If your code looks like myValue = resultSet.getString(1), you may get a nasty surprise some day.
- Practice Big Love -- use UNION ALL when you can get away with it.
UNION returns a distinct set of results, and therefore each result set is sorted so it may be de-duplicated.
If each subquery returns a unique set of results, use UNION ALL instead.
If you have to calculate something once, consider storing it for future use.
EXAMPLE: a subquery when you repeat an elaborate join over and over again .
Instead of
SELECT a.productname, b.price
FROM a, b, c
WHERE
a.productid = c.productid and c.subproductid = :value AND
b.productid = c.productid and c.subproductid = :value
try using
SELECT a.productname, b.price
FROM a, b, ( select c.productid where c.subproductid = :value) mySubquery
WHERE
a.productid = mySubquery.productid AND
b.productid = mySubquery.productid
That's contrived -- you could have said where b.productid = a.productid -- but you get the point. UNIONs are a situation where you're likely to see this.
If you're going to perform a calculation on a column frequently, try storing the result of the calculation:
EX: Searching through tags, change the tag to lowercase before inserting
EX: Create an index on LOWER(tag) - you're allowed to apply functions in index definitions.
EX: Searching through last names; retain the lastName column as the user entered it so it prints nicely, but search on a second column that contains LOWER(lastName)
Even after optimizing, your query may still be expensive. Reconsider the problem.
The easiest thing to optimize is the user's experience. Does the user expect all of data that you're showing? Could you split the functionality up across pages instead?
Is the lag during the query going to damage their experience? Can you show a progress bar so the user knows the app is doing something?
Does the data have to be super-accurate, or could it be accurate enough? What is the user looking for?
EX: A "Random" button brings the user to a random page. Does it have to be mathematically random, or could you prepopulate a table with a lot of good pages?
EX: A "Most Recent" page shows the most recent people to join your site. This appears on your front page and gets quite a few hits. You have a few million users to look over. How about drawing from a prepopulated table and running a cron job every 20 minutes to update it? How about inserting users into the table as they join, and truncating it periodically?
Finally, performance problems are often the result of growing pains. Your site has gotten successful enough to HAVE a performance problem.
This can help you frame the situation as you manage upwards. You've still got some work ahead of you, but it's the result of growth rather than a sign
of incompetence. Now get busy!
The Art of SQL by Stephane Faroult and Peter Robson is a good non-database-specific book.
Tom Kyte is an Oracle expert who writes the Ask Tom column. Obviously the information there is skewed toward Oracle, but there's some good general information in the Popular section of the archives. He's also written books on tuning Oracle databases.
Jay Pipes gave a http://www.petefreitag.com/item/566.cfm">presentation on MySQL performance tuning at Google.