What is Big Data? What type of more do you actually need?

More data is better right? Well maybe not. We need to ask do we have a detail issue or a volume issue.

Stephen Few put it recently in his most recent book Big Data, Big Dupe;

“We seem to be suffering from a new delusional disorder – “Petaphilia – an inordinate love for exceptionally large data sets.”

At a recent conference Bill Schmarzo talked about how for him the value of data is in the granularity not in the breath of data we collect. There is an important distinction between big data meaning granular data or big data being about the breath of events covered.

As a practical example knowing how many shots a player has taken and how many have been scored can be useful. Adding detail like the location and body part of the shot gives us models like Expected Goals. For a period of time, in any data collection process volume is an issue, you need enough shots to create a decent sample, but eventually adding more shots on top of the hundreds of thousands companies like Opta must have already (keeping everything the same) won’t improve this metrics power by very much. If you want improve the XG model you need more granular data not just more shots.

Statsbomb wrote an excellent piece on how they needed more information around a shot (granular detail) to overcome the short-comings in any existing data. The essence of the piece is that to improve the model, to make it more useful to the teams that use it they needed more granular data not just more of the same data.

Most of us are working with limited resources. Sure we might be able to buy, collect & store more data than ever before but the point of all this is improve our understanding of the sport. When you sit down to design your code window and drag out some vidoes of the upcoming opposition are you making a conscious decision of whether it’s more important to collect a little from loads of games or a lot from just a few aspects of the game.

Of course the answer is it depends – anything without a minimum volume of stats is at best guessing but ask yourself when enough is enough and when more detail is what you need ;not just more events.

I’ll finish with a great quote from Stephen Few from the same book. These are vital questions that you need to constantly ask yourself.

The value of information should be measured in terms of useful outcomes.

Did your understanding increase?

If so, did that understanding relate to things that matter?

If so, did that understanding produce better decisions and actions?

Only when you can answer Yes to each of these questions was the information-and the technology that helped you understand it – worthwhile.



Rob Carroll. Founder of The Video Analyst.com Performance Analyst. Always learning.