Long Data v Wide Data

Something I’ve mentioned quite a lot in my Tableau for Sports series of videos is how the data is laid out and why that is important for your analysis. This is true regardless of which visualization tool you use, so can equally be applied to Power BI.

It’s kind of a classic example when you see a tutorial and then you try to apply it to your own data and you just can’t get it to work the way it does in the video. More often than not this is because your data is a different ‘shape’ to the tutorial you just watched. Being able to reshape data, and knowing when to, is an invaluable stage in the process of preparing data for further analysis and modeling.

At it’s simplest a dataset can be written in two different formats: wide and long.

In the wide format, each Team creates a unique row with multiple variables running across. Most datasets that you encounter in the real world will be shown in a wide format because it’s easier for our brains to interpret. This format makes it easier to compare values across and down compared to the long format. Tableau, and I would suggest most visualisation tools, would prefer the data in a long format. Almost all my tutorials work with long format data. This is definitely not a one size fits all solution. Ultimately, the decision about the preferred shape of a data table should be driven by the logic of your analytical goals and as I mentioned at the top of the blog, knowing when to reshape your data is a key skill.

Wide Data is not about the number of Columns

In the example below I have a sample of Statsbomb event data. I’ve had to cut off most of the columns here because otherwise the image would be too small. It’s important to recognise that this doesn’t mean this is necessarily a wide data set.

Each row is a single event. Identified by a unique event_id, but we can see that the timestamp, team and event type and various descriptors makes each row unique. The key thing to recognise here is that each row is one unique event – it just happens to have a lot of additional descriptors. So despite it having 184 columns, I wouldn’t consider this a wide dataset.

Where if we look again at my wide example we see that in this made up example there are only 4 columns, each row contains 3 different variables, compared to the long format where each row is just one variable.

Some Exceptions

Like all guidelines there are some exceptions. When would I not reshape wide data to a longer format.?

In the example above I would be reluctant to just simply pivot all this data into a long-format. The main reason is I have variables on very different scales. Would I really want one columns with Weight (kg’s), Height (cm’s) and scale (1-10) measures grouped. I don’t think there is a right answer here, it really does come down to knowing what you are trying to do with your final analysis and figuring out if reshaping the data would make that process easier in the long run

Latest Analysis Jobs

Jobs in Analysis 2023 Online Conference

Jobs in Analysis 2023 – December 7th. Hosted online via Zoom, Jobs in Analysis 2023 will deliver a number of presentations and discussions centered around all things related to employment as an analyst in football. 

Read More »