Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 Kara Woo
Excel is a powerful tool and like it or loathe it it’s a tool that is used on a daily basis in almost every place of work, in almost every industry around the world. Analysts and non-analysts are familiar with it and it’s become the default tool to store and analyse data in.
This is not an Excel bashing – although it is a warning. I was a daily user of Excel. It’s practically free and can do a whole range of simple and complex tasks. Although these programs have their flaws a lot of the issues come down to how they are used.
Microsoft Excel, Sheets or GoogleSheets, you can pick your poison the issues are largely the same. Spreadsheets can be used for Data Entry, Storage, Analysis and Visualisation. To avoid many of the pitfalls you need to treat these 4 processes as completely separate tasks. Too often we don’t keep the data entry separate from the analysis, we mix raw data with analysed fields and usually we structure the data input to be the same format (shape) as we need for analysis. This gives us errors. Panke 2008 reported that in 13 audits of real-world spreadsheets an average of 88% contained errors.
For me you should be trying to keep your Raw Data, separate to the storage, analysis and visualisation. These are four very separate jobs and need to be managed independently.
This is without doubt the biggest issue I see on an everyday basis. There is a difference in how we would like to input data v’s how a computer program would like to read that info. On the left is an easy to read table but best practice would dictate that each row of your data should be a single data point. On the right the data is ‘pivoted’ to provide a single test for a each athlete on 1 row.
Data Intended for Human Consumption, Not Machine Consumption
I advise people to leave the data entry input in the format they are comfortable with but create a process (ideally automated) that would turn the data from the left table to the right. There are software programs that will do this but if you are sticking with excel setup formula or pivot tables to do this on a consistent basis.
Keep a Copy of Each Raw Data File
Ultimately we will want to combine all our data together at some stage – but it’s best practice to keep a copy of the underlying data in a separate file. This hangs on the principle of a single version of the truth. If something happens your analysis or you need to update the data – you should know exactly where that change needs to happen. What copy of your data is the TRUTH. Keeping the files in an individual sheet allows you update the source and then pass that through to the analysis and visualisation. If you update the data further along the chain (let’s say in the combined file), you now have different versions of the data.
How I do it:
As an example here is how I manage this. My current process is to Analyse the games in Dartfish, Export to CSV and then load that csv into a MySql database. This would be the same process even without a database. If I find an error in my analysis or dashboards the only place I can make a change to this data is back in the Dartfish file. I never make a change in the CSV or DB. Making changes in the DB or CSV would mean that my analysis files are different and if I ever re-export the data it’s not going to contain the correct info. I change the Dartfish file as the single version of the truth.
I’ve also made a video of this process here…
Merge The Data
Did you know Excel can do this for you – no need to Cut and Paste from hundreds of sheets. There are loads of free/cheap ways to do this. If you have a budget I would highly recommend you check out Alteryx.
Whatever you do, be consistent. A consistent error in your data collection is much easier to fix than a random one. If you use Player Names, always stick to the same format, whichever you choose (Joe Bloggs, J. Bloggs, Joseph Bloggs) try and always use the same format.
Things that seem like they should be simple, like consistently naming columns the same thing; Consistent Variable Names – (Male, m), (TSB_Weekly, Weekly TSB) etc… can cause you a real headache when you try to merge files.
Consistently name files in the same way, consistent format for dates, be careful of white spaces at the end of text (using the TRIM function to clean the data)
These seem very trivial but it’s rare I come across organisations who have robust systems around this. Ben Alamar’s book, Sports Analytics for Coaches, Managers and other Decision Makers is a very accessible intro this area and offers some very practical advice.
Automate Automate Automate
If a task can be automated it should be!!