Several month's ago a short video appeared on YouTube with an interview of LinkedIn's Chief Scientist DJ Patil. In it he discusses how 'Big Data' impacts the practise of analytics. I've only just got around to posting about it but I am doing so now because he has some insights that I agree with and would like to share as they are still relevant.
Big data is today most often associated with the internet superstars like Google, eBay and Amazon. There are 3 other areas with lower profiles where big data is important: intelligence (spooks, the military, etc.), scientific and academic research, and the financial markets.
Big data's future is much bigger than this because more and more areas of human activity are going to be faced with vast data sets. When you hear people talking about the growth of knowledge and statements like 'if this data were printed then the stack would grow faster than NASA’s fastest rocket', you have to remember that there is a good chance that each page of new data is adding to someone's analytic data set.
I'm not quoting the guy verbatim but here's what I heard and my takeouts to his comments:
- Open source 'big data ready' technologies like Hadoop (see my earlier blog or here) have come into their own now. Look to people with these skills over those only with SQL if you are facing big data challenges.
- We have reached a tipping point in the use of open source for commercial solutions to big data problems.
- If you want good analysts then the best place is to look is in occupations where people will already have the practical skills in manipulating big data sets: scientific fields like meteorology, oceanography and the like. I agree but this is not the only place as in my experience I also need analysts that relate well to business decision makers - i.e. those people that make commercial decisions based on the analytics. This is perhaps less important in pure tech plays like LinkedIn.
- Open source will transform the practise of analytics in the next 3 - 5 years. I think it will take longer than this to really impact the more traditional industries. I'm not happy about this but I am realistic about the difficulty in convincing business leaders that open source is a superior solution to proprietary ones. The money behind the big vendors will keep them going for a number of years yet.
One potential qualifier to DJ Patil's perspective is that although he has a very impressive big data background as a mathematician, US Department of Defence analyst ('Threat Anticipation'), and former eBay Director of Strategy and Analytics, his current employer is LinkedIn.
The core of LinkedIn's big data is structured and fairly static: profiles of people. So I'm not sure how similar their big data challenges would be to, say, those faced with processing, understanding and predicting large streams of real time data from financial markets or very large sensor arrays. On the other hand, the growth of LinkedIn communities and their related activities must generate large amounts of semi-structured data.
I also have no idea what LinkedIn's own analytic goals are beyond what DJ mentions on his own profile where he says his analytics drives product features like:
- "People You May Know"
- "Who Viewed My Profile"
- "Groups You Might Like"
Maybe somebody reading this blog knows more?
The video is on YouTube and I embed it here for convenience:
Or you can download 'DJ Patil on How Big Data Impacts Analytics' directly from this blog.
Re: JeremyI have spent the last two years building a seeaedshrpt-based tool just as you describe. Currently I run it through the handy google docs form utility which makes the data collection much easier. I collect about 30 data points about myself a couple of times each day and have done for many months now, resulting in a massive amount of data.Re: AlexandraI have spent the last three months trying to teach myself advanced statistics and data mining practices in order to try and make sense of the data and find useful correlations. Unfortunately, this is not my full time job and so progress is painfully slow. Any help that could be provided by your experts would be greatly appreciated.I am currently running simple correlative coefficients and pearson formulas, but have identified many gaps in these approaches. Would love to discuss this further with people who have a better grasp of the math.I am so pleased to have stumbled upon this website exactly the community I've been looking to connect with.I would be willing to make public examples of how my personal metrics tool works if you're interested.Kind regards,Bard
Posted by: Asmirah | Thursday, October 04, 2012 at 10:30 PM