Multi-group, multi-join, no-union, no-view mysql query

Chances are than unless you googled to find this post, both the title and the post will be pretty much nonsense. If so, don’t continue reading on my account. There are no hidden gems unless you’re looking for an answer to a mysql question.

So, my dilemma was that in a Drupal module I’m writing, I was asked to provide data sorting that requires groupings on multiple columns in different tables. Three tables are involved. Table “node” holds content; “votes” holds ratings users have given nodes; and “extension_dowloads” holds download counts for nodes of a given type. The challenge was to get average votes, vote counts, download counts, and node information for all nodes of a given type, sortable on each of these columns the aggregation of which requires grouping.

There are several possible approaches. The least favorite to my mind was post-processing. That is, I could have gotten all node info, all download info, and all voting info, tossed it into a multi-dimensional array and sorted on various keys in the array. Downsides to this approach include the fact that I have to retrieve all potentially relevant data from the database (rather than just the subset I need), and then there’s processing time to do the sorting. I can see this scaling terribly. It also keeps me from using the nice hooks Drupal has in place for paged queries, sorting, etc., and that just seems like flagrant waste.

Another approach is trying to do some complex multi-join table, but that turned out to have weird results. You have to use a left outer join in order to get nodes that have neither votes nor downloads, but grouping on the two satellite tables as you must in order to get counts and sums out of them throws the numbers off. If you group by multiple columns, you get redundant data (ie, nodes listed twice, once for the match in the downloads table and once for the match in the votes table). If you group by just the common key, strange things happen as well. In some cases, something like an unexpected multiplication (across columns!) seemed to be happening as a result of grouping. I never quite figured that one out. In any case, I couldn’t find a single multi-join query that would work.

Yet another approach is to try to use a union. A union lets you daisy-chain the results of multiple queries together. Column names and types are taken from the first query in the union, and as long as there are no major incompatibilities (e.g. different numbers of columns in subsequent queries), the results of the queries are dumped out as one result set. I was hoping I could use a union to combine output from the download count query and the vote calculation query, but a union returns distinct rows. If I selected NULL for the download count in the voting query and NULL for the voting calculation columns in the download count query, I did get back all the results I wanted, but there were two rows of data per content node because rows were not unique. This would require further post-processing.

And finally, I considered using either temporary tables (select both result sets into temporary tables and then join the two to the nodes table to select the results I wanted) or views. Temporary tables seem messy and inefficient, so I was reluctant to use those, and views require a more recent version of mysql than I’m using, so that was out.

Now on to the solution. I had heretofore used sub-selects only to get scalar lists of ids to select from: “SELECT foo from bar WHERE id IN (SELECT id from other_table)”. It turns out that you can get whole result sets from sub-selects. What you’re doing in this case is in essence to select some results and define them as a table to select from within your wrapping query. (Really, I guess it’s a temporary table scenario, though it seems less hacky and possibly somehow more efficient/optimized than issuing CREATE, INSERT, SELECT, and DROP statements per request to get some data.) Here’s what I arrived at to solve my problem:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio,
downloads.downloads
FROM node,

(SELECT node.nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node
LEFT OUTER JOIN votes
ON votes.content_id = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) votes,

(SELECT node.nid, COUNT(extension_downloads.eid) downloads
FROM node
LEFT OUTER JOIN extension_downloads
ON extension_downloads.eid = node.nid
WHERE node.type = "extension"
GROUP BY node.nid) downloads

WHERE votes.nid = downloads.nid
AND node.type = "extension"
AND node.nid = votes.nid;

I get everything from the node table. Then I do my first sub-query, which left outer joins the votes table on the node table to get all ids and relevant vote stats. I alias that sub-select as “votes” so that in the wrapping query, I can refer to its columns using “votes” as a prefix. Next I do a similar query on the downloads table. Finally, I constrain my wrapping query by node type and id. Since I’m doing left outer joins in my sub-queries, the row count for all three “tables” I’m selecting from is the same, and a simple nid = nid correspondence makes my data line up.

I may look into improving this further. It’s possible that I can reduce my data transfer and query burden by avoiding the left outer joins in the sub-queries and doing a single left outer join onto the two virtual tables in the wrapping query. I’m not sure whether aliasing the sub-queries will allow this or not.

Update: I was able to tweak the query as I speculated I might be able to above. Here’s the new query:

SELECT node.*, votes.vote, votes.vote_count, votes.vote_ratio, downloads.downloads
FROM node

LEFT OUTER JOIN
(SELECT votes.content_id nid, SUM(votes.vote) vote, COUNT(votes.vote) vote_count, (SUM(votes.vote) / COUNT(votes.vote)) vote_ratio
FROM node, votes
WHERE node.nid = votes.content_id AND node.type = "extension"
GROUP BY votes.content_id)

votes ON votes.nid = node.nid

LEFT OUTER JOIN
(SELECT extension_downloads.eid nid, COUNT(extension_downloads.eid) downloads
FROM extension_downloads GROUP BY eid)

downloads on downloads.nid = node.nid

WHERE node.type = "extension"

What’s going on now is that I’m joining on the results of the sub-queries rather than within the sub-queries. If I join within, then each sub-query returns as many rows as there are relevant nodes. If I join outside the sub-queries, each sub-query returns a number of results equal to the subset of nodes for which there is relevant data. Say I’ve got 500 nodes and that 300 of them have download counts and 250 of them have been voted on. In the original query, each sub-query returned 500 rows of data. In the new query, the sub-queries return 300 and 250 rows, which are then merged back into the 500 rows selected from the node table. The change stands to provide a significantly more efficient query that will scale better as more content nodes are added over time. What was really at question at the end of my original posting was whether or not aliasing the sub-query and using that alias in the join would work (I’ve had issues with aliasing in joins before), and it did.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s