Back in early 2011 I formed a data analysis business here in Ireland. Although the national sports here (Gaelic Football & Hurling) are hugely popular, because of the amateur nature of the sports and the possible job prospects after university, very little academic research is ever conducted (or published). This is starting to change, but slowly compared to the evolution of the sport. As anybody who has tried to collect data they will know that the cost and time involved in collecting, managing & analyzing good clean data is a costly exercise. This is not a post about my business but rather a post about how easy some people think data collection is and many of the pitfalls that occur.
Although these principles apply to team analysts – I have often found there is a tolerance for much more subjectivity within a team environment. Many coaches have there own definition or interpretation of what constitutes an assist for example. For some it could be the key pass in a move, for some it is simply the last player to touch the ball before the scorer, or at times I have seen assists awarded to players because of the space they made for another player. While these are all perfectly legitimate definitions of an assist – there is a reasonable level of subjectivity involved. I should also say that this flexibility can add greatly to the analysis process within a team – it’s just for a systematic data collection process the rules have to be a lot more rigid.
The first process is deciding what to analyse. For me this involved a lot of testing and watching of games before I could get started with any data collection. Despite in depth knowledge of the games – it never ceases to surprise me how often ‘unusual’ events happen. Even today we come across scenarios that were unforeseen in the beginning and are even harder to define.
At first glance a successful pass seems like the most basic of variables to both collect and define, but I would like to use this as an example to show how difficult it is to establish a clear definition of action variables. Although most people think they know what a pass is, have you ever asked a room? It still amazes me how many slight be different variations you can get. Let’s assume we go with ‘a deliberate attempt of one player to move the ball to another player on the same team’. So now we look at if the pass is successful or not. Again this seems pretty straightforward, but take a look at these questions and tell me if they are a successful passes or not?
Here are just 4 common scenarios that happen in almost every sport. As a primary data collector you need to have a very clear definition of what constitutes a successful pass, otherwise you end up with people interpreting the video for themselves – then you almost certainly have dirty data. Now you need to multiply the problems above by every possible action. And this is before we have looked at defining distance, pitch area, length and pass difficulty.
Single handedly player names is one of the hardest aspects to manage. You would be amazed, perhaps not in the Premiership but at lower levels, how often TV companies and newspapers get players names wrong. Even something as simple as being referred to by a shortened first name. James becomes Jim, Matthew become Mattie. If you are not careful and don’t have procedures in place to manage this you can quite easily end up with 2 players when you should only have 1. Then you have to add in the misspellings (different version between match programme, TV and Newspapers) and even simple typo’s. In my business this is something we spend a lot of time on checking and double checking to make sure we have the correct player with the correct spelling. Don’t get me started on players by the same name in the same team!!
I’ve stolen that line from the book Sports Analytics (Ben Alamar) but it sums up a very common problem. When you are handling a small volume of games and perhaps you are the only person coding the games this is less of an issue. Once you move to having multiple analysts and a much larger volume of games – maintaining a single version of the truth become a little trickier. In my case I have a network established and all games are tagged on a central server. The data is then exported from the analysis package and uploaded to a custom built database. While this all seems fairly straightforward – this process leaves me with 3 copies of the data (I also have an offline offsite backup just to be sure). One which is still attached to the video, one local copy of the export and now a copy of that data exists on the database. If an error is spotted in the data we must go back to the original game file, update that which in turn updates the local backup and then reload that file to the database. Any other process would cause chaos and any number of discrepancies in the data.
I would like to leave you with the following quote from Blake Wooster (@BlakeyGW) A man who knows something about data collection.
Data quality equation… Fast+cheap=low quality, fast+good=expensive, good+cheap=time-delay (+ opportunity cost)… Agree?