I forget how it came up, but M was telling me the other day that she was trying to explain to an inquisitive neighbor what it is I do for a living. She knows I do computer stuff and that it’s most often web-related or system-admin-related, but these are still pretty amorphous things to somebody who doesn’t actually perform the tasks they entail. While raking leaves today, I was thinking about how I might have answered the question, which is a hard one for me to answer in a way that would be very meaningful to non-developers.
In a nut-shell, I call myself a web and analytics programmer, though I devote a lot of time to systems administration as well. The web part is fairly easy to explain. If you look at my company’s web site, you’re looking at my work. I don’t make the pretty pictures that compose the web site, but I take care of the parts that make it behave as it does, from sending emails to letting you post to the forums to displaying various types of content. I’m like the mechanic for the web site.
The analytics part I think can be a little harder to capture. At a very high level, I help facilitate the collection of statistics about our product and our web sites. At a lower level, I try to help coalesce these bits of data into meaningful, actionable numbers. For example, if we know that we have X users and Y monetizable actions performed in the product daily, then we can track Y divided by X on a daily basis and watch the curve to see what kind of money we’re making per user per day on average. If a given monetizable action begins to trend flat or downward, we might consider trying to make it easier to use the feature so that we make more money off of it.
The thing I’ve learned over the last year or so is that as you get more and more data, it gets really hard to do anything useful with it on demand. Imagine that each day, 100,000 users’ products phone home to check for a product update (I’m just making that number up). You know then that you have 100,000 users per day. If you want to track this over time, it only takes 10 days before you’ve got a million pieces of data to try to extract something meaningful out of. If you’re tracking more than one piece of data per user, your data volume increases at an alarming rate as your user base grows. The more data you have, typically the longer it takes to cull through it. And yet you have executives trying to make decisions based on this data who don’t want to sit and wait a long time for reports to run. The trick is to aggregate the data as it comes in, and as I was raking leaves this morning, I came up with what I think is a useful way of explaining how scale affects the ability to report and how aggregation helps. It’s easy to accept propositions about scale and aggregation abstractly, but concrete examples are often useful.
So imagine that you’re tasked with counting leaves. Further, imagine that on any given day, you might be tasked with reporting how many leaves there had been on some past day. Or more specifically, how many red leaves vs. yellow vs. orange. If you recount every time somebody asks you, it’ll take more time than is reasonable. The first step naturally would be to group your leaves by day (grant that this is physically possible). So on Monday, you count all the leaves and put them in a pile with a sign stuck in the ground that says “Monday: 45,031 leaves.” On Tuesday, you do the same for any other leaves that have fallen, and so on. On Friday, if somebody wants to know how many leaves you raked on Monday, you just look at the sign and tell them rather than re-counting. But what about leaf color? Well, you do the same thing, but you make a Monday pile for red leaves, a Monday pile for yellow, and a Monday pile for orange, each with a sign noting how many leaves of each color for that day. Then you add the sums and post a sign with the total for all colors for the day. If you do this as you go, then you can very quickly get back to the counts for any given day and report without having to recount. The general idea is that it’s much easier to add sums than it is to recount. The tricky part is defining in advance what sorts of information you want to know about your leaves before you ever do the counting; else you have to recount everything for all time, sorting into different piles to get counts per organizational criterion.