Data is everywhere around us. Why? Because the cost of deleting data costs more than keeping it. What can we do with it? How can we harness this data and improve existing process or drive new insights.
Is big data new?
Most of us associate big data with data centres. Large rooms or buildings filled with servers and computers. One many even think of an image of the Google data centre below.
However large rooms filled with computers is nothing new. Back in the 1950’s there was the LEO computer. LEO was the first commercial computer used for business applications. It was owned by the Lyons and Co. and they used the computer to calculate invoices and manage deliveries. However they were also involved in an early problem that required solving 12.5 million pair points for trains with only 2kb.
Here we are in 2012 with this term of big data where do we start? Do we need to start? Any time anyone comes with the question
What can we do to with this information?
A way to approach this question or data problems is with the following steps
- Capture – Get the data
- Analyze – Find patterns and trends
- Storytell – Create a narrative with the data
- Adapt – How are you going to react or change to this new found information?
Where does this data come from? Data comes from various sources. Sensors such as GPS gives us locational position. APIs from various websites such as Foursquare can help us dig into those locational data points with more context. Accessing data is key to transparency and therefore government has been moving to be more open with it’s information with initiative of Open Data. Data also comes from existing services such logs.
Getting the data is the start so what is next?
Since data comes from many sources and in different amounts. The fact that it comes from many sources means that one source will be different from another format. Welcome to the world of unstructured data. With terabytes, petabytes or even zetabytes of data how can you analyze that information on your computer? You can’t fit all of that data on your computer.
Enter Hadoop. What is Hadoop? Originally I thought it was another database, then I thought it was an affordable option for OLAP cubes. Essentially it is a framework which is based on the map/reduce pattern and a filesystem store called the Hadoop Filesystem (HDFS). One would use Hadoop in the case to transform the unstructured data into an usable state. From there we can put data into our existing relational database stores and/or we can add Mahout on top of Hadoop for analysis.
There are downsides with Hadoop though and one of those issues is that is mostly used for batched processing. Big players in the data processing such as Twitter are part of real time analysis with their own Storm. There are advancements in this area such as the use of algorithms such as K-means for clustering.
Once we get the data from our various sources in an usable state. What kind of trends can you find the data?
With data and graphs they are static. With the our current state of technology we can create interactive graphs to help users discover and explore the information at their hand. One great example of this is the New York Times Obama Budget Proposal infographic. Rather than traditional static graphs they were able to show transitions between the graphs to help illustrate how the money flows. For more great information on how to tell stories with data check out Amanda Cox’s talk at Eyeo.
What can data do to change the way we work or live? How can we make big data, smart data? Traditional industries such as education can benefit from the use of smart data. If you did any standardized tests during school you would know that you would get your test back and maybe you got 100% on the test or maybe you got 16/20. But that’s it. If you got some questions wrong you wouldn’t know why. Narrative Science is a company working on providing insight into education from this data. Now students will be able to find out that they got the questions wrong because they struggled on a certain section of the material and the feedback could point the student to the exact pages to read up.
What will you do with the future of data? If you need help getting started feel free to contact me.