9 Sep 2013

The risks of the large datasets for decision-makers

Past week passed away Ronald Coase, Nobel prize in economics in 1991 thanks to his seminal contributions on how transaction costs define the boundaries of the firms. He had a strong conviction for independent policy evaluation for each time and each context. In this regard he stated “I don't reject any policy without considering what its results are. If someone says there's going to be regulation, I don't say that regulation will be bad. Let's see. What we discover is that most regulation does produce, or has produced in recent times, a worse result. But I wouldn't like to say that all regulation would have this effect because one can think of circumstances in which it doesn't”. In-depth policy evaluation requires large amounts of data, which in the last decades has been available thank to exponential improvements in computer processing and storage capacity and the disruption of the Internet. The exploitation of those sources has potentially great managerial and political value but without a precise understanding of the meaning of data, erroneous conclusions may be drawn which could adversely affect the output of future decisions.

Our recent research in the music industry group has dealt with this issue. In the working paper “Using data in decision-making: analysis from the music industry” under consideration for publication in Strategic Change we explored the risks of large datasets. The article uses as example of analytically weak exploitation of large datasets the report published by the European Commission "Digital music consumption on the internet: evidence from clickstream data" coauthored by Luis Aguiar and Bertin Martens; which provides potentially industry damaging conclusions, presenting a positive relationship between file sharing activity and purchasing behavior. Their study based on clickstream data is the unique empirical analysis finding that piracy is positive for the industry. Previous research was debating about the economic relevance of the negative effect, with some studies arguing that this effect is effectively indistinguishable from zero (see link for more detail). This is a controversial debate and the Aguiar and Martens caveat “we cannot draw policy implications” brings into question the authors understanding of the potential impact of their publication.

Clickstream data reports page views, so is useful in charting a journey of a user as they navigate through the Internet and in further gaining information of the users’ potential interests. However, it does not capture the context of use, such as actual activity on a page, any user interpretation of content, the users intent or the value they seek to gain from the experience when on a webpage. Therefore, it is hard to believe that Internet users effectively purchase when entering in certain pages. It is also difficult to believe that an Internet user that allows an installation of tracking software is a typical consumer, or at least has a natural behavior when the software is active in his/her computer. Therefore, sample bias is arguably stronger than would be found in research based on consumer surveys. The authors also ignore important econometric issues. For example reverse causality. The results found are also consistent with the fact that a consumer use purchase pages as an information resource when deciding the music to download in illegal platforms. Therefore, without knowing which click is first we can hardly define which is the direction of the relationship.  In addition, we looked to the numbers in detail and applied the process recommended by statistical manuals before carrying out a regression analysis, and the unique variance explained by the downloading variable is close to zero. So, the variable can only explain a marginal fraction of the purchasing behaviour observed. This simply implies that in statistical terms the variable clicks in downloading pages is not recommended for inclusion in the model explaining clicks in digital purchase pages.

Data crunching is a real risk for policy makers and managers. In our article we use a recent example, the report published in the European Commission by Aguiar and Martens, as an illustration. Ronald Coase was aware of those risks and synthetize them in a brilliant statement: “if you torture the data long enough, it will confess”. In the last section of our article we offer an easy-to-follow guide for managers and policy makers to judge the validity of empirical reports, as long as they need to assure by themselves that data exploitation was precise enough.

No comments: