The finest example of social media mining that I have found to-date is the superb site Please Robe Me.
In case you can't guess, it mines the twitter stream in realtime to republish anyone who is telling the world that they are not at home. Who would want to know this? Take a guess.
Check out my new site I'm-bored-and-broke-so-who-can-I-rob?.com. We feature full integration with Google Maps and their latest realtime street stealview technologyTM© - so you can watch your place being robbed as it happens via our iPhone app (a bargain at $599)!
Continuing the book theme established in my last blog, here is another book that looks very interesting. I say 'looks' because I haven't purchased it yet. The book is Head First Data Analysis: A Learner's Guide to Big Numbers, Statistics, and Good Decisions by Michael Milton. It is very different from the last book I recommended.
The main differences are that it is aimed at the beginner, is highly visual, and at under 500 pages is significantly shorter.
Extensive examples make up the majority of the book and use is made of Excel, OpenOffice, and statistical computing software package R.
What is really interesting is that R is a major feature in this book. It doesn't even rate a mention in the Handbook of Statistical Analysis and Data Mining Applications.
The publisher's recommend that you first know how to use basic spreadsheet formulas before tackling the book. After reading this book they also recommend Head First Statistics in the same series. I haven't yet looked into this title.
Author Michael Milton's approach in the book is very refreshing and engaging as well as being very practical. It is a radical departure from the dry texts of most (all?) university courses on the subject. Michael has degrees in philosophy and religious ethics and this may help to explain his innovative approach.
Below are a couple of screen shots of the book to give you an idea of how topics are presented. Click on the images to open a larger version.
Using the Google book preview on publisher O'Reilly's website, you can get a real flavour of the book. I'm impressed by what I see and I have put it on my list of books to buy soon.
I don't keep many actual books next to my desk these days. I have found that my hard drive has become my main knowledge repository. For those interested, everything I receive online (email, documents, spreadsheets, video, research papers, etc.) is feed into my knowledgebase using Devonthink. It's unix (Mac) software.
A rare exception to this is a a new book that has really impressed me: Handbook of Statistical Analysis and Data Mining Applications by Robert Nisbet, John Elder IV, and Gary Miner. Available on Amazon for about AUD80.
Why has this 800+ page book squeezed its way onto my crowded desk? It's useful to a part-time data miner whose post-graduate maths and stats courses are in the dim and distant 1990s. I have found it useful in a number of ways:
I haven't yet made use of Section IV of the book (Measuring True Complexity, the "right model for the right use", Top Mistakes, and the Future of Analytics) but I know it's something I should get to.
The book is a practical guide for how to use SAS-Enterprise Miner and STATISTICA Data Miner. There is also a section on SPSS Clementine and sprinkled throughout the book are STATISTICA's C&RT, CHAID, MARSpline, and other data mining and graphical analytic tools. It's a pity that R is not included but you can't have everything.
Here's a link to the table of contents.
I don't need it every week, but when I do I'm really glad I have it to hand.
Back in May of this year I took a look at WolframAlpha in my blog Is WolframAlpha The Next Big Thing In Analytics? Since Wolfram's high profile (rock star) launch things had died down to a muted whisper - not a bad thing as anything as ambitious as Wolfram needs time to mature.
For those not familiar with Wolfram|Alpha, here is a summary of its features from the company itself:
That has changed in the last couple of weeks or so as a number of interesting things have happened:
Tell us about your organization's public data
Does your organization produce statistical information or other public data that you would like to make more accessible and easier to use? If so, please tell us about your organization and the data you now make publicly available or would like to make available, including details about its format.
Your data will be useful to us as we continue to develop tools we would like to offer to organizations like yours. While we won't be able to individually reply to everyone who fills out this form, we may be in touch to learn more about your data.
For more information on how public data will be used or accessed through Google, read our information for data publishers.
Ready to tell us about your organization's data? First select your organization type:
Here's an example search result using World Bank data:
Both Google and Wolfram are trying to change search into answer. This is a pretty exciting development and I look forward to Microsoft (Wolfram soon to be a subsidiary?) and Google battling it out to answer more of my analytic questions.
As of today, Wolfram has the edge in terms of its ability to answer a surprisingly wide range of questions. Examples include:
Google however still has the edge in terms of flexibility in mining a vastly wider number of textual sources. Google's data mining (and answering) ambitions seem more modest when compared to Wolfram, but I suspect that the Bing announcement has driven Google Labs into overdrive. Expect more announcements over the coming year.
I read with interest the announcement that Canadian BI vendor startup Indicee has successfully completed a second round of venture capital funding. In brief, Indicee sells a cloud solution where you upload your data and it works out automatically the relationships - freeing you to concentrate on answering your business questions. Here's a video from Indicee summarising their service:
I don't know the terms of the financing agreement but the US$6 million raised is another sign that BI in the cloud is an area that we should all keep a track of.
Why The Cloud?
Here are four reasons:
1. Indicee was founded by Mark Cunningham and Fred Tummonds both originally with Crystal Services - makers of Crystal Reports. Mark was a developer of the original Crystal Reports product - possibly the most ubiquitous BI tool outside of Excel in the market today.
2. Their early seed funding included all of the original founders of Crystal Services.
3. This second round adds Granite Ventures who have also previously invested in companies such as:
Indicee is certainly not alone and there are other startups taking similar but distinctly unique services to market. For example, take a look at CloudSwitch, RightNow Technologies, RightScale or Good Data. Good Data offers real OLAP capabilities and runs on Amazon Web Services. It is also backed by an impressive list of people including Marc Andreessen, Tim O’Reilly, Esther Dyson and John Landry.
As we approach the end of the naughties the BI market is offering some interesting and fundamentally new choices to us:
... and we can now choose to buy each of these services from all the big vendors (at big cost) as well as a raft of new entrants at substantially lower cost.
With prices like those from Indicee and others, we could be seeing a massive increase in the number of organisations making active use of analytics.
My Predictions?It is still early days for SaaS and full cloud solutions - especially for larger organisations with mission critical BI. For the big boys SaaS and cloud BI will remain a niche option for a number of years. In 2010, of these 3 alternatives, open source will have the biggest impact by far.
SMEs will perhaps be the biggest beneficiaries of these market developments in 2010. As new adopters of BI they are more able to make use of the cloud - especially if the entry cost is less than AU$100 a month to launch your first analytic service.
In a past life I co-founded and led a company specialised in the analysis of publicly available textual information. The basic idea I had was that if you could read every newspaper published everyday, and throw in other easily accessible information sources like company filings to Stock Exchanges and national regulators, then you would learn a lot about:
The hunch was right and after 1 million lines of code and several million dollars, we had a system that could automatically extract meaning from news 24/7. We launched a subscription service that read thousands of articles each day and then visualised these activities. We sold many subscriptions to leading corporations and government agencies around the globe.
On the right is a print-out from that application. Everything was interactive and you 'surfed' through the 'mind maps' using a simple point and click interface. Here is a larger file that you can zoom into to explore: The Virgin Group.
In the course of this journey from start-up to a real company with cash flow we were presented with an interesting dilemma.
Early on in the history of the company we produced a detailed view of the personal investment strategy of the founder of one of the world's largest software companies. We did it by analysing the web of legal entities he had established and then tracking share transactions, venture capital and M&A deals and news reports, etc.
The end result was very interesting and a printed copy of one of the main mind maps was lent to a Board Member of Philips (the Dutch electronics group). By coincidence, this same founder whose investment strategy we had analysed, turned up for a meeting and he saw the chart in their office. Apparently he hit the roof and thought that he was the target of some sort of corporate espionage.
This got us thinking and because I had lived in the States, I immediately contacted our trusty lawyer. Our legal council advised us that in the US we ran the risk of being sued for breaching privacy laws. This was despite the fact that all of our source information was freely available.
As a side note, the problem was solved by us agreeing not to locate our technology on servers in the US. We remained offshore and (presumably) were much harder for US citizen's to reach us legally. It meant that we paid a little more for the data centre and internet traffic - but I think it was worth it.
Why am I writing this today? Well ever since then, I have had a keen interest in what public information is available and what you are allowed to do with it. So I read with interest the following article in Ars Technica:
Lobbyists beware: judge rules metadata is public record
The Arizona Supreme Court has ruled that the metadata attached to public records is itself a public record. Given the frequency with which metadata outs lobbyists' and corporations' efforts to mask their own contributions to public debates, this is a good thing.
Ars Technica, By Jon Stokes | October 29, 2009
The Arizona state Supreme Court has ruled that the metadata attached to public records is itself public, and cannot be withheld in response to a public records request. Such a ruling on file metadata may not seem like a huge win for open government advocates, but it definitely is, given that metadata has unmasked more than one lobbyist's effort to influence Congress.
In the Arizona case, a police officer had been demoted in 2006 after reporting "serious police misconduct" to his superiors. He suspected that the demotion was done in retaliation for his blowing the whistle on his fellow officers, so he requested and obtained copies of his performance reports from the department. Thinking that perhaps the negative performance reports had been created after the fact and then backdated, he then demanded access to the file metadata for those reports, in order to find out who had written them and when.
The department refused to grant him access to the metadata, and the matter went to court. After working its way through the court system in a series of rulings and appeals, this past January an Arizona appeals ruled that even though the reports themselves were public records, the metadata was not. It turned out that Arizona state law doesn't actually define "public record" anywhere, so the appeals court relied on various common law definitions to determine that the metadata, as a mere byproduct of the act of producing a public record on a computer, was not a public record itself.
The case was then appealed to the Arizona state Supreme Court, which has now ruled that the metadata is, in fact, a public record just like the document that it's attached to.
Metadata follies, and the case of Google
If you want to know how important metadata can be in public policy deliberations, Google's history with it can be instructive, since the search giant has been both hurt and helped by metadata snooping.
Last year, the The Australian Competition Commission and Consumer Commission (ACCC) received hundreds of electronically submitted feedback letters opposing eBay Australia's decision to go PayPal-only for accepting auction payments. One of the most impressive letters was a 38-page missive that had obviously been written by someone with extensive and intimate knowledge of payment systems. A look at the letter's PDF metadata revealed that the author of the letter was none other than Google, which was upset that Google Checkout was being excluded in favor of PayPal. The metadata also revealed, embarrassingly enough, that the PDF had been written not in Google Docs, but in Microsoft Word.
The very next month, the tables were turned when the American Corn Grower's Association somewhat surprisingly threw its weight behind the idea that Congress should launch a hearing to look into the possible anti-trust implications of the Google-Yahoo advertising deal. CNET's Declan McCullagh took a look at the PDF letter that the group submitted to Congress, and found that it had been authored by a staffer at the LawMedia Group, a DC lobbying shop whose client list includes the anti-Google, anti-net neutrality National Cable and Telecommunications Association.
To leave Google's metadata mixups and go back even further in time, one of the most famous metadata lobbying goof-ups occurred in 2004, when Wired busted California Attorney General Bill Lockyer circulating an anti-P2P letter that, after a look at its Word metadata, appeared to have been either drafted or edited by the MPAA.
As open government projects that solicit feedback from the public gain traction at the federal and local level, these types of metadata-related discoveries will become more and more common. Guaranteeing that file metadata is available to the public will make help to ensure that we know who is trying to influence public discussion.
People complain (endlessly) about America - but we in Australia can only dream of having the public right to a tenth of the information made available in the US.
Does this disadvantage the practise of analytics in Australia? You bet - and we are a poorer nation for it.
Now if I can just modify my code to automatically analyse the metadata of the datasets that the Australian Federal and State Governments are now releasing under FOI-like (Freedom Of Information) licenses maybe there are interesting things to be learnt.
Anyone interested? Malcolm??
Table of Contents
Part I: Preparing for the Journey
Sometimes you just start your day as normal. You work through the sea of overnight emails (living at the opposite end of the world to just about everywhere else is fun!) and before you know it you've learnt something new and unexpected.
It's not necessarily good, but it's useful to know. Here's today's lesson.
Many of my readers may know that I have started an online community for Australian and New Zealand business intelligence, information management, data warehousing, performance management and analytics experts. It's called the Business Intelligence CORTEX. Data is my thing and I want to be in contact with other data enthusiasts - and so the CORTEX was born.
There are many existing international organisations (such as the TDWI, etc.) but for me at the end of the world (did I say that already?) these groups didn't (and don't) focus on the local issues and organisations like the CORTEX does.
What I didn't expect was to be offered for sale the complete membership lists of these international organisations. If you go to NextMark and other places like it, not only can you buy the TDWI membership list, but also those of:
and many others.
I was surprised. Naive, right? For some strange reason I thought that my little corner of the world would be too 'long tail' for the online marketers to worry about. I guess not.
Take a look at the TDWI list here to see what our details are worth.
I happily register with sites online if they have content that catches my eye. Usually I will leave the opt-out/in options at their defaults. For example, downloading a TDWI paper includes a form collecting your information with the statement: "Your e-mail address is used to communicate with you about the above requested information and related TDWI products and services."
The key words are "related TDWI products and services." To me, that means 'not open to anyone with a couple of cents who wants to buy my details'.
I was wrong.
Time for my first coffee of the day and to sit for 5 minutes to contemplate the blue of the Pacific Ocean before I face the rest of the day.
Have a good one yourself.
The people, processes and technologies used by managers of an organisation to answer their questions. Management information systems differ from regular information systems because they are for analysing other information systems used in operating the organisation.
How an organisation uses modeling (involving extensive computation) to arrive at an optimal or realistic decision based on existing data.