In the technology industry’s ongoing effort to create confusion and spark debate, we have another term that fits right in; Big Data. When the original definition is a whopping 23 words, it’s no wonder there is debate as to specifically what constitutes Big Data. Gartner sums it up as high-volume, high velocity and/or high variety data that needs to be processed for insight. Although quite short and elegant, it basically defines Big Data as just a subset of normal data. Meaning, if everything that exists is data, then Big Data is the portion that is huge, comes at us very quickly, consists of a bunch of different info, and will be used primarily for gaining insights.

Although this seems to be a nice container for Big Data, I would suggest that the first three qualifiers (volume, velocity and variation) have always existed in the original definition of data. It’s nothing new; in fact, from the beginning of the digital age, keeping up with data generation has been the primary driver of the storage market. It drove the flip from analog to digital, and continues to be the motivation for most innovation in storage to date. For me, size and variation are relative, and therefore have nothing to do with it. What seemed unmanageable 5 years ago is much easier to manage today, from a size perspective, as well as a variety perspective. So for me, the only real characteristic of Big Data is data collected where the sole purpose is for insight. Of course, this is also a little grey because you could make the argument that ALL data can be analyzed to gain insight. But let’s say for that arguments sake, the distinction is intent. To summarize, Big Data is any data that was collected or analyzed for the sole purpose of gaining insight. Fair enough?

Great, now that the definition is out of the way, let’s focus on the actual challenges of Big Data. How much data do we collect, and what data should we collect? The answers to these questions actually go hand in hand.

A Note About Data Collection

The level and amount of data a business should collect is relative to the level of resources they have to analyze the data. I think the mistake most companies make is trying to collect too much data without the right resources. Data analysis is a science; one that people spend years building the necessary education to do accurately. Most businesses do not employ specialists in mathematics, statistics, scientific methodology, critical thinking, or the other specializations that together form the fundamental of proper data analysis. So, the first step in determining the data you will collect is to identify the resource level internally that will be tasked with analyzing that data. If you don’t have the expertise to build algorithms that will anticipate every move you customer will ever make, don’t try. Focus on datasets that are understandable, and ones that can help you make decisions. Avoid collecting too much data and guessing as to what is says or means. Shifting strategic direction based on a mathematical error can be disastrous, and you should do your best to avoid making that mistake.

hardware+data

What Should You Collect?

Once you have a better understanding of your internal data analysis capabilities, you need to determine what data you are going to collect. An easier way to start this process is to start with the questions you are trying to answer before you select the data to collect. Although this seems fairly generic, and it is, it is basically saying you need to understand your goal before you can ask the right questions. If you look at your quarterly, or annual strategic plan, what issues are preventing you from accomplishing these goals? Is the issue an internal issue, or an external one? Is there data that can be collected to better prepare you to make the necessary decisions? All of these questions need to be asked to clearly define the data to collect. As well, when following this process, you will know specifically what you are looking for ahead of time so there is less confusion when looking through the data later. Repeat this process for all of the identified issues. If you are having difficulty determining what data will help you solve a particular problem, consider bringing in an external resource to help. But remember, approach Big Data in digestible chunks. Collecting every bit of information available to analyze later is one thing, but actually using data to provide real answers is much more valuable.

Overall, Big Data is a double edge sword. Having access to mountains of information seems valuable, but it can sometimes do more harm than good. Avoid the hype and focus on solving specific business issues. If the need arises to collect and analyze more data in the future, so be it, you’ll be much more prepared with some experience under your belt.