Monday, November 11, 2013

Why Statisticians Reject Most of What Media call as analytics?

Why statisticians reject most of what media call as analytics and then re-classify them as reporting? I attended a Big Data Analytics seminar recently with a statistician coworker. He thought it offered nothing new in analytics as compared to what existed in the industry ten years ago. I agreed with him and yet the 200 page slideshow on advanced mining and data visualization kept flashing in my mind.  It’s a fact usage gives a word the meaning. May be it’s time the meaning of the term analytics got .  But what these statisticians are talking about?

Analytics is a fascinating term and the meaning seems to suggest many flavors as more businesses embrace it. A couple of decades back financial services companies led the systematic ways of capturing and storing large volumes of transaction level data. Then they employed quantitative analysts who identified patterns and predicted customer behavior. A well-known application of analytics from those days is a scorecard, for example FICO Score. It predicted the likelihood of a customer walking away without paying his debt and it made sense.   
What really changed is the data landscape. A recent NPR story points out two related facts. The external data storage market is now worth $70 Billion a year and companies are allocating close to 15% of their Information Technology budget to warehouse data.   Data storage is a function now many businesses subcontract to companies like Switch, Amazon and Dell and for example, the size of Switch’s server farm in Las Vegas is as large as seven football fields.  It’s easy  to visualize every click you make in a website, every swipe you do with your credit card, and every ‘like’ you do on Facebook  is  finally finds its way to one of these servers. The unprecedented growth of data results in Big Data challenges we hear every day. Analytics applications that consume these massive data are touted to provide a competitive edge to the businesses that make conscious decisions to store data.

However, existence of analytics goes back to the days before we had computers, internet and quants. Now it’s like the Wright Brother’s invention that got really complicated when the advancements in physics, material science and aero dynamics wrapped around it. Who does not like a Dreamliner (not an endorsement).  Same way, advancements in Information technology, hardware and software alike, transformed the ways businesses capture, store and retrieve volumes of data. And they did it lot cheaper than in the past. Data mining and statistical  techniques, mostly used in academic and research settings, found many applications in in this new found data world and flourished  as analytics. 
Benefits of analytics are often exaggerated by the tools and solution vendors who benefit monetarily when a business decides to take the ‘analytics’ route. I found these vendors and their sales literature adds tremendously to the sloppy use of the term analytics. Let us look at some long words and sentences.

Applications providers, like SAP, emphasize that their work-flow solution is configurable and ready for analytics integration and business intelligence . When database vendor Oracle extends R capabilities, Greenplum, another cool database vendor, is driving the future of big data analytics by integrating Base SAS libraries at database server level.  When we hear partnership with SAS and R, the names long associated with reliable statistical modeling and analysis, it’s quite convincingly implied that they are providing analytics. A world of reporting tools, like MicroStrategy, SAP Business Objects or Hyperion provide data mining, business intelligence and analytics that leverage on multi-dimensional databases and brings out insights to the management in a drill down-roll up fashion.

Once you peddle through the jargons, they, Google analytics included, are still talking about reporting that just got extravagant with all the technological advancements.  It is hard is to define the term analytics without offending a lot of people. A lot of people already claimed their stake in this ‘next big’ thing. Not related, but an MIS division head I know recently changed his title to ‘ Data Scientist’. When I checked what changed, he said they hired a consultant to do Hadoop and Microstrategy for them.

Coming back to the main topic:  what’s core analytics means?  Yes I added ‘Core’ to emphasize. Probing deeper, it sounds like they mean predictive modeling or predictive analytics.  A set of old fashioned test-control- validation exercise using various statistical techniques like logistics regression, survival analysis, classification trees or even machine learning. Wiki does a better job in explaining this.  People who do such work are often called modelers and they want to differentiate themselves from a set of IT or near IT guys who primarily deal with reporting systems.   The seemingly simple issues these people deal with are not solved by the smartest visualization software – Like what is a statistically sound substitution to use for missing values in sample data?   How to derive some performance for the customer we never had? – There are hundreds of articles published in journals on these topics and hundreds were awarded PhDs. But still there is no agreement. Since there is no one rule and generalization is not a possibility, software cannot hide it under a layer. From data side, such analytics are often supplemented with data that’s not available in corporate Hadoop Big Data mine. So they don’t believe an off-the-shelf application sitting on that mine is going to get the things done. There is a difference in deliverables too. The so called generic analytics applications provide reports, tracking dashboards or warning systems.  Core analytics deliverables are sets of rules that sit in an application and acts like an expert.  Say a scoring model that replaces an underwriter.  When humans learn from the new environments and their own mistakes, an expert system pretends that what it had gleaned from the past is still sound. Moody’s AAA rating of junk bonds in 2008, is an example.  (I know genetic algorithms and artificial intelligences counter argue, but they cannot detect human lie).

The paragraph above attests how drawn-out and jargonized this topic is. Liberals have no place in such discussions. 

Then there is an overlap of these two worlds.   Google Analytics for example, it’s possible to setup a test- control strategy to see what works best in real life (A/B Testing).  Many analytics applications can be configured to work dynamically, for example: Amazon recommendations or fraud detection, are trained on the fly but rules behind them still lying in the disputed land.

I counted at least a dozen times the word analytics used in our meetings last week. It almost always meant some numbers to support an idea or argument. A report. With that statement, this topic is open for discussion.