It’s a familiar story, at least in these Software-as-a-Service circles. Inevitably, growing datacenter operations and business activity start to throw off a lot of data. Not the critical content and customer data that customers pay us to manage, which is very explicitly modeled and optimized, but a huge variety of incidental stuff, relating to server health, usage trends, and external network factors.
Informal samplings and counts of these records and measurements turn out to be intriguing, even surprising, and then enthusiasm builds for richer and more frequent analysis. Even more so when it leads to real improvements in the business. Continue down this path and, like us, you will soon see the need for a robust toolset for big data.
Based on our experience, it’s still preferable to start small, even if you know in advance that you’re targeting lots of data. Start with a representative subset, and begin with exploratory analyses using low-level data filtering and transformation tools. They are extremely flexible and easy to understand. And they can do a lot, generally only limited by CPU time and disk space, which are quite affordable these days. Especially if you can wait 1-2 hours for the answer. We’ve found that medium-complexity analyses on 100GB datasets on a commodity server are still in this range, using very accessible tools like awk and python, and the various GNU text utilities. Not only does this get you quickly up to speed on the characteristics of your data sets, it’s invaluable as a way to check the results of more scalable and advanced tools as you adopt them.
So should we consider data on that scale (100GB, 2-3 hrs) big or not? The point of throwing out these numbers with blatantly little context is that it isn’t just data size on disk that makes your data big, it comes down to a number of other factors as well. And it’s those factors that have pushed us into exploring tools and techniques that are less established, and generally outside the mainstream of enterprise data management. We’re talking about data for which most of the related analysis work and insight arises from the process of just getting it into the right form for a typical data mining, RDB sort of model, which then is often not very re-targetable as analysis needs drift into different directions. For many, the idea of storing such unstructured data and attacking it in such ad hoc ways seems unwieldy and suboptimal. For certain uses, though, such as correlating and analyzing assorted streams of production data, we’ve found it to be exactly the right thing.
In real applications, sheer size is a definite concern. As your dataset size doubles, and doubles again, or you just need to get results more quickly, for a while you can scale up to servers with faster storage and more CPUs. Ultimately this approach reaches a functional limit (and probably gets too expensive long before that), and then you must scale out to a cluster of servers running in parallel, each with their own slice of the data. Fortunately, the cluster architecture supports that same kind of flexible processing of unstructured data that you might perform on a single server. Any IT team can now take this Google-style approach via marvelous open source tools like Hadoop and Perceus, with a modest expense in hardware and setup time. We’re very encouraged by our experiences with these technologies and look forward to sharing them in future blog posts.
Keeping it Big
Beyond the sheer number of data elements, though, I’d like to emphasize again that it is the semantics and intended usage of the data that pushes it into this big data category. It’s big because at the outset you don’t yet know its entire shape in all its dimensions, or all of its possible relations. Keeping it intact and minimally processed is a great advantage in this case, and not prohibitive anymore with today’s plummeting costs for storage. It’s also big when potential users span multiple departments and areas of interest, and it cannot be distilled into a smaller subset or summary and still meet all of those needs. Finally, big data tends to appear too big when it just fails to fit the mold of traditional data processing tools which are less arbitrarily programmable, or overly optimized for update performance and minimal storage requirements. The time has come for practical tools and techniques that specifically target these datasets, and we are fortunate that it is being pursued in a collaborative, open source manner.
Continue the conversation by sharing your comments here on the blog and by following us on Twitter @CTCT_API