Do You Hadoop?

Hadoop clusterOne of the big area of IT growth is in providing solutions for big data. Big data is defined as any data set that is too large to conveniently compute with traditional methods. The breakeven of what constitutes big data gets larger each year as computers get more powerful and today traditional computing can handle up to a few exabytes of data.

However, many computing problems have much larger data sets and the field of big data was formalized as techniques and technologies to handle big data have been developed. Analyzing big data sets is now big business and a report by the McKinsey Global Institute estimates that the worldwide market for big data will be $32 billion by 2017.

What are some of the places that big data has a practical application today?  The most obvious are the social networking sites where companies like Facebook and Linked-In are faced making sense with monstrously large amounts of data. Apple’s Siri is based upon big data. Mining big data for marketing purposes is one of the hottest uses big data and there is a whole new field called consumer genomics is trying to understand consumer behavior through application of big data techniques.

Of course there are the more traditional sources of big data in such areas as weather analysis, astronomy, particle physics, oil exploration, medical diagnosis and genetics that require looking routinely at massive data sets. And just starting is probably the biggest future use of big data – the Internet of Things – where ubiquitous sensors in the environment are going to spit out huge amounts of constant data.

The business of big data has expanded so quickly that some large corporations now have a chief data officer. There are numerous industries that can benefit from manipulating and making sense of big data sets. The Obama administration announced the Big Data Research and Development Initiative in 2012 to explore how using big data could make the government more efficient. I suspect the NSA beat them to the punch a few years earlier.

There are a number of different techniques used to analyze big data. The most common one so far is to use a series of small servers to look at chunks of the data rather than to try to process it all on one large computer. There have been arrays of tens of thousands of individual blades used in the manner in a few applications.

In 2004 Google published a paper called MapReduce that discussed using a distributed architecture and parallel processors to handle large amounts of data faster. In this architecture the data queries are ‘mapped’ across multiple processors. The results are than gathered back from each processor in a ‘gather’ step. The Google effort was followed by an Apache open source project named Hadoop. Hadoop loops are still a fundamental piece of many of the big data techniques used today.

Probably the most visible result for most people of big data is going to be in targeted advertising and the use of personal assistants. Who has not looked at a product on a web site and then seen advertising for similar products pop up all over your web and social sites? Advertising is getting very personal and routine ads aimed at just you are right around the corner.

Some people are familiar with personal assistants from using Apple’s Siri. But future assistants are going to be far more sophisticated and will learn you over time. People will become personally integrated with their own assistant and having a constant computer companion will change the way that most people live.