It will not surprise you to learn that as I write this, #bigdata is being tweeted an average of 1400 times each hour, according to hashtags.org. Big data’s power to elucidate user behavior is, more or less, the holiest of economic grails—and has been for some time now. There is no industry that doesn’t want to better understand its consumer, and all the fragmented, unstructured information they leave in their digital wakes holds in its enormity the promise of this understanding.
The trouble is, big data is notoriously difficult to wrangle on account of its size and complexity. Setting aside for the moment that many enterprises have to purchase access to big data they don’t produce themselves, the process of grooming that data for reporting and analysis can be prohibitively expensive. So enterprises of all sizes and creeds end up asking themselves, do we need big data? If not, will we need it someday soon? If we don’t prepare ourselves for it, what do we stand to lose when it becomes an imperative down the line?
What Exactly is Big Data?
Our first step in answering these questions is to disambiguate big data, which has taken on a variety of meanings over the years. Let’s start with a concrete definition: big data is a mass of information characterized by its high volume, velocity, and variety. High volume means there’s a lot of data, high velocity means more of it all the time, and high variety means it comes in lots of different formats—not just your standard strings, integers, and dates but also geospatial, audio, video, three-dimensional, and more.
Recent hype surrounding big data has diluted this basic definition and stretched it to encompass a number of other data related-processes. For the purposes of this article, big data is not to be confused or conflated with the following:
- Big data is not business intelligence. Business intelligence tools can be used to analyze and report off of big data, but having big data is not the same thing as having a means of analyzing it.
- Big data is not only digital. While the internet is largely responsible for the proliferation of big data, it can come from traditional sources as well.
- Big data is not just data from outside your company. Enterprises can and do generate their own big data using applications, tracking systems, and devices of their own creation and/or implementation.
- Big data is not AI. But the two go hand-in-hand. Big data is complex enough to “teach” artificial intelligence algorithms how to look for patterns and predict outcomes based on existing information, but having big data isn’t the same thing as having an AI to analyze it.
So at its core, big data really is just a tremendous amount of information. But, because of its high volume, velocity, and variety, big data doesn’t fit neatly into the tables that make up relational databases. As a result, a lot of big data is collected in key value pairs instead.
Compare this tidy example of a traditional data table such as you might find in an RDBMS like MySQL with the value pairs below it.
Structured Relational Data
Unstructured Big Data
<Google+User23456_Beverage, “Dry Martini with a Twist”>
<FacebookUser12345_Beverage, “White Wine”>
<Google+User23456_Color, “Totally Teal”>
Whereas the structured data values all exist on the same table and are stored on the same server, the value pairs exist in no particular order and bear no inherent relation to each other. They can even be stored on different machines! Messiness is the price we pay for unstructured data’s flexibility and potential.
Unstructured data is stored in non-relational databases like OLAP, MongoDB, and Hadoop, but in order to report off of it, BI solutions need some sort of organizing layer. Apache Hive is one such layer. It’s a data warehouse infrastructure with a SQL-like interface that sits on top of Hadoop and allows BI applications to access the unstructured data via a connector like ODBC. Because of all these layers, making sense of unstructured data can be a real challenge. Learning to organize and manage your “small data,” often referred to as operational data, can help inform future forays into big data.
The Intricacies of Small Data
What is now considered “small data” used to just be data. The term was coined to distinguish structured data from big data, and now it carries the stigma of being humdrum and outmoded, at least from a media standpoint.
Thought it may be more structured than big data, small data is far from simple. Structured data begins its life cycle as what’s called transactional data, or normalized data, and it has to go through a denormalization process in order to become reportable. This is part of a process known as ETL, which stands for Extract, Transform, and Load. The transformation part of the process includes such steps as eliminating data redundancy, translating coded values, joining tables, and cleaning up user errors.
Transactional data might come in looking like this:
And need to be transformed into this for reporting purposes:
Note that the users’ first names have been separated from their last names, differences in input format have been corrected, and the numeric values corresponding to each user’s gender have been replaced with string values. This is a simplified example of a process that, for some enterprises, must be repeated for thousands of tables containing hundreds of rows and being accessed by hundreds of tenant groups with different needs and permissions. It’s also important to strike a balance between normalized and denormalized tables for reporting purposes because the more normalized a data set is, the more tables it has and the more unwieldy it becomes.
Priming the data so that it behaves the way it needs to can be a profound learning experience, and so can analyzing that data. Building reports and visualizations often reveals where your ETL process could use improvement.
The Right Data, Big or Small
The most important thing for you to know is what kinds of questions your company (and its competitors) are asking. Are they asking questions easily answered by their operational data, or are they in search of information they don’t yet have?
Maxwell Wessel, the general manager of SAP.io, observes that “most companies spend too much time at the altar of big data” when the data the small data they already have holds the answers to their questions. What enterprises need to do is stay in tune with their respective industries and practice separating the signal from the noise. When a critical mass of people start start asking questions that are truly unanswerable without big data—when big data is also the right data—it is time to invest in harnessing unstructured information.
In the meantime, unfettered access to structured, operational data is a great place to start, especially if it’s your first foray into business analytics. There’s a great deal to be learned in the process, which can help you prepare for the challenge of big data to come.
This article was originally published by TDWI Upside on June 6, 2017.