Explaining SQL

I’m currently reading High Performance MySQL as part of an effort to compare MySQL and PostgreSQL. The book has proven most informative so far. Rather than just giving the basics of how to use MySQL, which I’m very familiar with, it goes into enhancing query performance and tweaking the server configuration to maximize what MySQL can do for you. Having been introduced to MySQL the way I was — by just jumping in as sort of a n00b programmer with no clue what I was really doing — these are the sorts of things I never found time, gumption, or particular need to learn. Most of the programs I wrote scaled pretty well (or had no need to scale) without a lot of performance tweaking. But for Spread Firefox and for my new job, I have written and will be writing some much higher-end stuff with scalability to hundreds of thousands of users in mind. So I figured it was time to read up on some of the more advanced topics in MySQL. Chapter 5 of High Performance MySQL, it turns out, is the most useful piece of geek writing I can recall ever having read. This chapter alone is worth the $40 price tag on the book (it’s actually only $26.37 through Amazon).

I read about half the book yesterday but didn’t get a chance to apply any of the takeaways in a real life situation. Today, I was glancing at some comments on sfx and saw someone asking that we reinstate the “Top 5 blog posts” block. I didn’t know it had gone away. Sure enough, though, the header was there, but no content existed. So I checked the query, and when I ran it, I got a query error, which seemed odd, as it had been working just recently, and I hadn’t changed the query. Ah, but we changed database servers, and the newer one had a newer version of MySQL. My query stupidly used the “mod” keyword as a table alias, and when I fixed that, the query error went away. But the query, which has always taken awhile because it’s calculating ratings for potentially thousands of posts and ordering them to get the top 100 (which I later apply another little algorithm to to get the top 5), took long enough that I interrupted it and decided I’d see if I could find a way to optimize it. When I say it took long enough, I don’t mean that it took three seconds rather than .0002; I mean that I killed it after 10 or 20 seconds.

To get an idea of just how MySQL optimizes and runs a given query, you use the “explain” keyword followed by the query you want to investigate. For each table the query has to hit, explain displays several pieces of information. Probably the easiest to zoom in on and the one that’ll be the most impressive for this example is how many rows the optimizer estimates it’ll have to read in order to get the results. In the output below, you’ll see that the query will have to look at 5602 rows in the “node_mod” table (aliased to “mymod” and formerly to the problematic “mod”; this table contains a list of which of the site’s posts have been moderated by users). This is a full table scan (I can tell this by looking at the “type” value and seeing “ALL”), which isn’t very efficient. For the “node_mod_user” table (aliased to “u” and listing the ratings given by each user to site posts), the optimizer estimates that it’ll have to scan 27081 rows. Again, this is a full table scan, which is less efficient the more rows you have. And finally, the “node” table (which contains posts) has one reference per iteration of the query. Here’s the output from explain:

mysql> explain SELECT node.nid, node.title, FROM_UNIXTIME(created) as date,
mymod.value as val, count(*) as cnt, (mymod.value/count(*)) as rating
FROM node, node_mod as mymod, node_mod_users as u
WHERE node.nid=mymod.nid
AND node.nid=u.nid AND created >= (UNIX_TIMESTAMP(NOW()) - (86400 * 7)) AND promote != 1 AND type != 'image'
GROUP BY node.nid ORDER BY rating DESC, created DESC LIMIT 100 G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: mymod
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5602
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: u
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27081
        Extra:
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
         type: eq_ref
possible_keys: PRIMARY,node_type,node_promote_status
          key: PRIMARY
      key_len: 4
          ref: spreadfirefox.mymod.nid
         rows: 1
        Extra: Using where
3 rows in set (0.01 sec)

The red flag this raised for me was that I shouldn’t have to do a full table scan on the nod_mod table. It turns out that I hadn’t added an index on this table for the nid field, and by doing so, I eliminate in my query the need to do a full table scan. Sure enough, I added the index, and the query returns almost instantly now (in .44 seconds rather than too many seconds to even bother waiting). Take a look at the new explain output below. Rather than doing a full table scan of 5602 rows, we’re using the new index, and it has a significant positive effect on performance.

mysql> explain SELECT node.nid, node.title, FROM_UNIXTIME(created) as date,
mymod.value as val, count(*) as cnt, (mymod.value/count(*)) as rating
FROM node, node_mod as mymod, node_mod_users as u
WHERE node.nid=mymod.nid
AND node.nid=u.nid AND created >= (UNIX_TIMESTAMP(NOW()) - (86400 * 7)) AND promote != 1 AND type != 'image'
GROUP BY node.nid ORDER BY rating DESC, created DESC LIMIT 100 G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: u
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 27079
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
         type: eq_ref
possible_keys: PRIMARY,node_type,node_promote_status
          key: PRIMARY
      key_len: 4
          ref: spreadfirefox.u.nid
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: mymod
         type: ref
possible_keys: nid
          key: nid
      key_len: 5
          ref: spreadfirefox.node.nid
         rows: 1
        Extra: Using where
3 rows in set (0.10 sec)

After making this change and starting to write this triumphant blog entry, it occurred to me to check and see if I had indexed the node_mod_users table, and it turned out that I hadn’t. When I added an index on nid, it increased performance even more, cutting my query time down to .10 seconds and resulting in the following output from explain:

mysql> explain SELECT node.nid, node.title, FROM_UNIXTIME(created) as date,
mymod.value as val, count(*) as cnt, (mymod.value/count(*)) as rating
FROM node, node_mod as mymod, node_mod_users as u
WHERE node.nid=mymod.nid
AND node.nid=u.nid AND created >= (UNIX_TIMESTAMP(NOW()) - (86400 * 7)) AND promote != 1 AND type != 'image'
GROUP BY node.nid ORDER BY rating DESC, created DESC LIMIT 100 G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: mymod
         type: ALL
possible_keys: nid
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5604
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: node
         type: eq_ref
possible_keys: PRIMARY,node_type,node_promote_status
          key: PRIMARY
      key_len: 4
          ref: spreadfirefox.mymod.nid
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: u
         type: ref
possible_keys: nid
          key: nid
      key_len: 5
          ref: spreadfirefox.node.nid
         rows: 5
        Extra: Using where; Using index
3 rows in set (0.00 sec)

Note that the optimizer is once again doing a table scan of the roughly 5600 rows of node_mod. This is because MySQL can use only one index for a given query. In this case, with the nid column indexed in both node_mod and node_mod_users, it chooses the better of two options and does a table scan of the table with 5600 rows rather than the one with 27081. That’s 4 – 5 times fewer rows to scan, and our performance increased by a factor of about 4 (from .44 seconds to .10). Makes sense, huh?

The lesson here is pretty obvious. If you’ve got a slow query, explain it to see exactly how MySQL is processing it. Explain can expose, among other things, careless indexing problems that can make a very real difference in how your program executes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s