Skip to content

Puck Possessed, the Book

At #SEAHAC19 I did a presentation on my Puck Possessed issues, and announced I was planning to write a book. And I did, or at least made a start with it, but I have decided to turn that content into several blog posts. They will appear here, on this sub page.

Even though they now will appear as blog posts I will still refer to them as “the book”.

Here’s a link to download the actual book: Hockey Analytics for Kids

Title: Puck Possessed

Subtitle: A visual introduction to hockey analytics for kids, adults (who have yet to grow up) and visual thinkers

Purpose of the book: Making hockey analytics available and accessible to kids, while explaining the basics and learning some basic math and statistics through reading about hockey.

Intended audience: Kids of early age (visually appealing) to junior high (math, stats), so K-9,  adults, visual thinkers

Images: Images are produced by myself, unless otherwise noted

INDEX

  • Introduction and inspiration
  • How “nerdy” do you need to be?
  • Let’s start with the basics: data
  • Math & Statistics
    • Learning numbers and solving problems with patterns
    • Building the basics through addition and subtraction
    • Using mathematics to solve problems
    • Introduction to multiplication and division
    • Learning about fractions and decimals
    • Using mathematics to solve problems
    • Operations with numbers
    • Learning about statistics
    • Shot locations
  • Brief overview of Hockey Analytics & Visualization
  • Putting it all to use
    • Google Sheets
    • NHL data
    • Data Services
    • Storing Data (SQL, Google, others)

Introduction and inspiration

There are many books about hockey on the market, of which a small number are about hockey analytics. And even fewer are written for an audience who may not yet be familiar with, or into the technical aspect of analytics, making for very theoretical (and visually unappealing) books that assume an understanding of many of these technical aspects. Actually, other than the Hockey Abstract and Stat Shot series by Rob Vollman, there is really not much else on the market for the hockey analytics newbie. On the other hand, there are again many kids’ books on hockey with some being very visual and touching on analytics, but they typically just show without actually explaining.

This book is written for these specific audiences:

  • not yet familiar with hockey analytics;
  • likes learning through visuals that are explanatory and understandable;
  • Wants to be introduced to and learn about math and statistical concepts through a topic of interest.

That being said, I still believe that the experienced hockey analytics audience can pick up a few concepts or ideas here as well, that will increase their understanding of (the visualization of) hockey analysis. And for any reader, I believe it will be a fun book to read.

Hi. I’m RJ. And I am puck possessed.

I grew up in The Netherlands, where football (or soccer) is the number one sport by a mile, and “hockey” is used to talk about field hockey rather than ice-hockey. And although I played field hockey for many years, for some reason the variant on ice has always appealed to me more. It wasn’t until I was in my 20’s that I finally started playing ice hockey, but more at a level and seriousness that would best compare to shinny, or pick-up hockey. In my last year in The Netherlands before emigrating to Canada, I also played inline-hockey (roller hockey) in a Northern Germany amateur league, with a fun group of other Dutch Ice hockey fanatics. In short, I was never really able to play my most favorite kind of hockey (it didn’t help that The Netherlands has fewer hockey rinks than there are in Calgary alone!) and I am trying to catch up, even when my body is telling me not to. But regardless of playing time, I devoted many hours of watching, analysing, and trying to better understand ice hockey through World Championships, Olympic tournaments and the occasional NHL playoff game available on a torrent site (we’re talking pre-live-streaming games here).

In addition to a love for sport in general and hockey (from here I’m referring to the kind on ice) specifically, I have always loved illustrations, comics, and visuals explaining concepts. I was an avid reader of anything called <insert pretty much anything> illustrated and comics like Tintin and Lieutenant Blueberry, and I absolutely loved travelling with the wonderful visual guides from DK books, described as a “visual feast” by a reviewer:

book0
book2
book1

Not really believing in my own illustration capabilities, I studied to do spatial analysis and mapping in stead, which in time turned into visualization of all kinds of data, not just spatial. And once I applied these skills to hockey data, one thing lead to another and eventually I started creating the Puck Possessed Issues which in turn lead to the creation of a book like the one you are reading right now.

I’m going to assume it is clear to you now that I love hockey, and that I am crazy about visuals that tell stories. Like the following, from a book about golf that I got at a book fair when I was still very young. Isn’t it amazing?

What I also like a lot about the visual language is that it is universal. No matter what language you know (and don’t know), the concepts of the golf swing (above) or from “Johan Cruyff geeft voetballes” (Johan Cruyff teaches soccer, right) are still understandable without understanding the actual written language through the visuals.

These, and many other of these type of books I grew up with inspired me to write this hockey book in similar fashion. Not just because they are fun, but because they actually taught me about countries and buildings, how to swing a golf club and concepts about soccer. And they made me understand far better and deeper than what I am convinced textbooks would have been able to achieve.

Over the years I have ran into quite a few of these visual explanatory and exploratory books, most of them related to sports one way or another. Some of these are shown a little further below, with examples of visuals being used to explain different concepts in sports: strategy, X’s and O’s, technique, positioning, uniforms, line-ups, and more. They really gave me many “aha” moments, all of them inspired me then to learn more, and now, to write this book.

The content of the book is both explanatory and instructional; it starts with looking at the different types of analysts in relation to hockey or other sports, followed by some concepts around the required elements of data analysis and visualization: data, tools, math & statistics, analytics, visualization. It then finishes with practical applications of all the learned skills. It will also include examples of math and statistical concepts based on the Albertan Kindergarten to grade nine level of math to allow younger readers to go along and pick up some skills they can apply in the classroom. And lastly, it will contain either full page or partial inserts of Puck Possessed issues to serve as examples of a discussed topic.

I hope readers of this book will have as much fun and enjoyment as I had while creating it, and better understand hockey analytics in general and how it is applied today. Enjoy!

How “nerdy” do you need to be?

This book is focussed on kids and their parents who like hockey first and foremost,  who like to play with statistics beyond points and +/- to better understand the quality and value of players, and adults new to hockey analytics. Some refer to those people as nerds and hockey nerds. While the reasoning for that stigma escapes me, fortunately we live in a time where that is actually not necessarily a bad thing. Actually it’s kind of cool!

Do you like to get lots of statistics from magazines or websites and play with the data? Or do you like picking out one topic, get specific related data and analyze it to find some hidden secrets? Perhaps you are not too much in to data crunching but like to focus on visualizing the data so that it highlights your findings and communicates that clearly to people that read your blog? Maybe you are just a fantasy hockey player who wants to get the upper hand by better understanding hockey data than your opponents, or perhaps you just want to understand your team better in the players they trade for, and if your team plays at their level or if they are (un)lucky? Or are you a bit of a know-it-all, who just wants to know as much as possible about hockey to be the one to answer everyone’s trivia questions?

Most of you may be a little bit of this and a little of that, and some may not relate to any of these description at all, or strongly. It doesn’t really matter. This book is for all of you. If you like numbers but understand that they don’t replace expert eyes when evaluating players or teams but are a different view in addition to the expert’s eye, than we’re going to have fun.

So let’s start with the basics: what are data or statistics, and how do we get them? According to Wikipedia,  Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation of masses of numerical data (Merriam-Webster dictionary). While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty.

Wow, that’s a mouth full. Let’s simplify that a bit:

  • We’re going to be dealing with numbers a lot so that somehow some math gets involved is no big surprise;
  • We will need data, so we need to collect it in some way or another;
  • To work with the data properly we need to organize it, which includes cleaning and putting it in a format that works for our purpose and tool(s) of choice;
  • We want to know what this data means and tells about hockey players and teams, so we will do some analysis;
  • We need to know what we find actually means, and what it specifically means in our context of hockey;
  • Lastly, unless we want to keep it all to ourselves, we want to present our data in a way to communicate the meaning of our findings based on the results of our analysis.

I think in terms of hockey analytics, data are often referred to when people talk about statistics. So what is data then, one might ask. Let’s go back to Wikipedia: Data is a set of values of subjects with respect to qualitative or quantitative variables. Data and information or knowledge are often used interchangeably; however data becomes information when it is viewed in context or in post-analysis. Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Raw data (“unprocessed data”) is a collection of numbers or characters before it has been “cleaned” and corrected by researchers.

Ok. There appears to be quite the overlap between data and statistics, at least in the sense of how we talk about them. I’m going to argue here that we mostly need data, and often raw data, to which we may apply some statistical analysis.

Let’s start with the basics: data

There are a number of sources that can be used for analyzing hockey, of which NHL.com is the most obvious. (Unfortunately the site does not let you easily grab or download the data in bulk, but we’ll get to that later on. For now we’ll just assume you can copy & paste the data we will be talking about.) It has a lot of good data, under the tab called Stats, where they provide leaderboards, player and team level data, and a glossary explaining what everything means, something not available on every website with hockey data and very useful.

Hockey-Reference.com is another great site for getting data, and it does allow for downloading the files to your machine in a more user friendly manner. It also provides the Player Index, a tool to filter specific data rather than a (number of) season(s).

Both NHL and Hockey-Reference provide aggregated or summarized data, so rather than getting data shot-by-shot, you’ll see something like player X had 5 shots in this game, or 123 shots this season. Granularity, or level of detail of the data is important to think about when you define the project you want to work on. You can imagine for a historic trend in number of goals in the NHL a season by season total would be enough. But if you want to do a project for a specific team you’d have to go a bit deeper, getting shots per season per team. And for analyzing specific players you have to go into even more detail, getting shots per season, per team, and per player. When you also remember that players get traded from one team to another, sometimes mid-season, what started out as a simple project quickly becomes far more complicated.

Besides NHL and Hockey-Reference there are a number of websites managed by hockey fanatics who collect, clean and analyse some of this data, and offer it for download. Some for free, others for a subscription fee. This varies from high level, seasonal and team-level advanced statistics to highly detailed where every event in a game is tracked (think, shots, hits, goals, saves, etc.). Here are a few that are currently very good: Evolving Wild, Natural Stat Trick and Corsica. But keep in mind these sites and the dedication for keeping them up-to-date change continuously, so always make sure they are still regularly updating, and also look out for others and new players in the market.

book8
book9
book10

So now we know where to find some data, but how do we get it into our (virtual) hands?

Based on your personal interest, your preferred way of collecting the data from websites will be one of the following: copy & pasting, using browser developer tools with data converter websites, writing code to automate scraping data, or paying money for a subscription or a one-time payment.

There are some alternatives somewhere in between, but before we dig deeper into these alternatives, let’s talk about live vs. static data. If you want to show the number of shots taken per team and per season, you can download one big data set for all previous completed seasons, and add the current season when it is concluded. Those datasets will not change over time, so one download will be all you need. But how about measuring a shooters shooting percentage as the season progresses? Or how the standings change from day to day? In that case we can still load data manually every day and add it to our existing data set, but that becomes quite a bit of (boring) work. This is where “live” data is more useful. With “live” data I mean data that get updated on a daily or game-by-game basis. Let’s start with static, or one-time data.

We can go to the website of our choice and load the table with the data we want, select all the rows and columns of the table, copy the data and then paste it in Excel, Google Sheets or something similar. Unfortunately not every website works that simple anymore and we may have to learn some HTML (the language used to make websites) to specifically identify the table on a webpage that we want to download. Copy & paste works fine for small datasets, say 31 records (one per team in the NHL), but even when we want data for all players, typically one table page is not sufficient. Then we need to copy-paste the data from the first table-page, load page 2, copy & paste, load page 3, etc. Quite the effort if there are 10 pages or more!

Luckily there often is an easier way. Google Sheets has functions to load html tables either directly or through specific XML-paths, and there are some scraper tools available that claim they can find and download the table’s full dataset. People that like to (learn to) write code can use languages like Python and R to write a few lines to accomplish the same, while storing the collected data in one complete dataset.

Another easy alternative for those who have some money to spare are websites that offer (often small) subscription fees, basically to keep the computers and servers running that make the data available. Subscribers can go to these sites, log in, and download the data whenever they need in an easy format like Excel or CSV.

Some websites offer what they call API’s, backdoor entrances to the sites data, often raw, that can then be collected and used. To do so one often needs to pay some money, as well as have some knowledge of how to make this all work. These APIs, (the NHL provides a very popular one that has pretty much all data they collect) require some light technical coding (or programming) skills to collect, clean and format the data, and then export the data to be used for analysis and / or visualization. Other than providing a lot of data, another benefit of these API’s is that they are typically up-to-date and continuously refreshed when new data becomes available, often even during a game! Some of the previously mentioned subscription services actually get their data from the NHL API, clean and format it, and then make it easily available to their subscribers.

But in the end we end up with a static dataset ready to be used for our analysis.

The second type of data is “live” data (written in quotes as the data is not up-to-the-second live, but typically a couple of minutes behind or made available right after a game finishes), which as mentioned is better when we need daily data that is automatically updated.

Some sites can be loaded in Google Sheets through the =IMPORTHTML() function which checks with the provided html (web) page if new data is available, every time someone uses the Sheet. Of course you could also write your own code in R or Python, or any other language you may be familiar with to automatically check every hour or day to see if any updates to the data was made.

The before-mentioned subscription services typically also update a few hours after a game is finished, so again, if you have some funds available that is probably the easiest way to go.

One thing to consider is that this type of “live” data only gives current state, and overwrites the previous day’s or game’s data automatically. Unless you write some code to store this data somewhere you will only have the most up-to-date data available to you.

Another thing to consider is what data to download. Most services and websites offer the option to for example download events for all strengths, only 5 on 5, powerplay and shorthanded, etc. For player data they often give the option to only use data from players that have played a certain number of games, or data from only the regular season, since playoff hockey is so different.

Once we have the data we want, typically we find out that there are some issues with it, so we’ll need to “clean it up”. This is a part of data analysis that should not be underestimated, both from a time-consuming perspective as a quality perspective. Unfortunately different sites use different player names (Chris or Christopher for example)and team abbreviations, as well as different column names like +/- and plmns, different ways of storing data like Time on ice in minutes, seconds or hours:minutes:seconds. There are sometimes more than one player with the same name, different position descriptions, inconsistent player identifiers, and the list goes on and on. Despite these often being small issues, cleaning data can – and often does – take a significant amount of time, and much more than the eventual creation of the data visualization itself. It is highly recommended though to not skip or quickly go over this part, as it will eventually bite you in the you-know-what if you don’t look into this properly.

To avoid having a lot of data and only use part of it – which would not be great for performance and make it harder to work with – , we’ll want to remove the data that we don’t need. Ideally we do this by keeping the main dataset as it is and create a subset that meets our specific needs. In that case we can always go back one step rather than having to start all over.

Now that we have a better idea of what we need to do to collect  and clean data, let’s talk about math.

Math & Statistics

This chapter introduces a number of mathematical concepts of increasing complexity, roughly based on the Alberta Curriculum for Kindergarten to grade 9 (http://www.learnalberta.ca/content/mychildslearning/index.html). It is my belief that the majority of kids will learn better if concepts are introduced in an environment or based on a topic that has their interest. For many children in Canada, and in many other countries, sports and hockey specifically are excellent “wrappers” for these concepts.

Learning numbers and solving problems with patterns

To learn about numbers, we’re going to start with the best team of the regular 2018-2019 NHL season, the Tampa Bay Lightning:

What jersey numbers do you recognize? And what is your favorite number? Who do you think the defensemen are, the smallest or the tallest players? Do you think some of them are wearing popular numbers on their jersey, and why do you think they wear the number they have on the back of their jersey?

Puck Possessed Issue 20 if fully dedicated to the jersey number, but on the next page I added an image of the issue for illustration purposes.

What jersey number do you think is used most often in the NHL?

Building the basics through addition and subtraction

After learning what numbers are and counting, the next thing we learn to do with numbers is recognizing patterns, adding them up and subtracting them from each other. Many statistics are counts of specific events; things like how many players are on the ice in total or per team, do both teams have the same number of players, how many teams are in each Division of the Eastern & Western conference, and do they have the same number of teams?

To get a quick idea of the answer to some of these questions we can subitize (recognize at a glance) and see familiar arrangements. All the questions above can be answered by adding and subtracting players and/or teams.

Using mathematics to solve problems

As we get more comfortable with counting, adding and subtracting, we count, describe and estimate quantities in a variety of ways. And we solve problems using numbers, patterns, measurement and data collection and use graphs and charts to communicate information.

If we put the players we looked at earlier in a chart based on their heights, they would look like this:

So far we ordered the players alphabetically by their last name. Now let’s order them based on their height:

The two goalies are amongst the tallest players on the team, and Adam Erne (#73) is the tallest forward but seven players are taller than him. Anton Stralman (#6) is the shortest defenceman but there are still 6 players shorter than he is.

Introduction to multiplication and division

From here we try to understand, apply and recall addition facts and related subtraction facts, and use mental mathematics strategies. Multiplication and division are the new topics we’ll add to our skill set.

So, the simplest form of math applied when collecting statistics, or data, is counting stats. How many shots did player X take in a game, how many power play opportunities did team Y have, how often did goalie Z make a save, etc. We can simply collect these by counting the events we are interested in, so all you need to be able to do is count. And perhaps some addition, multiplication and division to aggregate or summarize the data for a full season.

Look at the chart below, it shows Shots (pucks) and Goals (lamps) for the Lightning. Can you subtract the goals from the shots, or divide goals by all shots for goal success rates?

Learning about fractions and decimals

Now that we start multiplying and dividing, we will run into decimals and fractions. In Puck Possessed issue 16 I talk about ranking goalies by the number of saves per goal against. More specifically, how many shots does a goalie stop before he let’s one go in? The following chart shows some of the best goalies from the 2018-2019 regular season when looking at Goals Against and Saves made:

The insert shows all goalies and where the highlighted goalies sit relative to the league.

Further looking at this subset of the first chart, when we divide Saves made by Goal Against here is how the goalies compare, from left to right:

So let’s look at Ben Bishop from the Dallas Stars: during the regular season he had 87 goals against while making 1,236 saves. When we divide 1,236 / 87 we see that he made a little over 14 saves per goal against. In the chart above we use three decimals, but you can ask if that really provides useful information. I’d say one decimal would be sufficient here, so 14.2 saves per goal against (14.1 or 14.9 is still worth mentioning, but 14.17 or 14.95 doesn’t provide a better understanding).

Ben, please go to the rink now, we’ll need you for the section on Angles.

Using mathematics to solve problems

Ok, when we understand multiplication and related division we can use mental mathematics strategies. We can compare fractions with like and unlike denominators, and describe, compare, add and subtract decimal numbers and lastly, we learn about probability (although here we’ll leave that out until we get to part 7).

To apply all this to hockey, we can compare players and decide who is the best scorer; but just counting goals is not really fair. Some players will have more or fewer opportunities than others, based on the circumstances the player is in, of which many we cannot control or even measure easily; is the team he plays on very good or bad, offensive or defensive, how much ice time does he have, who does he play with, who is the opposing team’s goaltender, etc. What we can do is use rates or ratios, for which we will have to do some division.  Rather than only looking at the number of goals, let’s look at the number of goals per games played. So let’s divide the number of goals by games played:

G / GP

Goals per Games Played gives us a much better idea of who is the better scorer. Is it fair to not take into consideration if a player has been sick or injured and hasn’t played much, compared to a player who has played every single game? I would say it is not. We can even take this a bit further and look at goals per minute played, making the math a bit more complicated but the results even more useful.

Angles

Now let’s look at angles. Ah, Ben, there you are. Please show us the difference in angles between a shot from straight in front of the goal, and one from the side of the net. Have a look at the different angles:

Operations with numbers

So ratios are not much more than one counting stat per some other counting stat; goals per minute played, saves per goal allowed. When we standardize this to 100 minutes or goals allowed or look at counting stats of a player per counting stats of all players (percent of whole), we use percents (which literally means “per 100”). Let’s look at an example:

In Canada we tend to call Hockey “our” game. Although possessing something as intangible as the game of hockey, or any sports game for that matter, Puck Possessed issue 8 looks at National Pride in hockey expressed by players in the NHL representing different “nation groups” in the NHL.

As looking at all individual nations didn’t make much sense, I combined Sweden and Finland, all former Soviet states, and a rest group combining numerous smaller European countries with one-offs like Australia and Bahamas. I should also note that these are based on the Nationalities listed on the NHL web page; I do not know how the NHL deals with dual Nationalities for example, and obviously it also ignores those with a passport from their birth country who actually only lived there their first 3 years of their lives. Initially I also wanted to look at playing styles based on nationality (or continent) but that got kind of diluted by the prior comment, so I left it (for now).

unnamed (1)
unnamed (2)

Also we need to start thinking about the order of operations. The rules are as follows:

    1. Do operations inside parentheses.
    2. Do multiplication and division from left to right.
    3. Do addition and subtraction from left to right.

Another way of using ratios is to look at goals per shot attempt, or goals per 60 or 20 minutes. Typically people use per 60, as a game lasts 60 minutes, but I believe a rate per 20 makes a lot of sense as a players time on ice is more in the 20 minute range, or perhaps even 15. In any case, one would look at the number of goals compared to the time the player was on the ice, and since no player has the exact same time on ice as any other player we convert it to 15/20/60 minutes:

G / TOI * 60

So what if we want to know the shooting percentage for a whole team? We use the average of all players on that team, or to avoid using numbers from players that only played a few games (which is not a good representation of a player’s skills. So let’s say on a full season of 82 games, a player needs to play at least 20 games to be included in the calculation of the team average:

For all players that have a GP of 20 or more:

SUM(G) / SUM(GP)

The results of calculations like these often have a large number of decimals. Although we understand that if you look at the average number of goals by players in the same team will not always be a full number, having 10 decimals also doesn’t add any value. So let’s say for now we use one decimal.

Learning about statistics

As mentioned above we are now looking at averages, the mean, the median and the mode for a set of data; percents, rates, ratios and proportions, and lastly correlation and probability. Let’s look at the Lightning team again, and the players’ ages:The mean is calculated by adding up all of the values and dividing by the number of values.

The mean is calculated by adding up all of the values and dividing by the number of values:

Sum of ages: 633, and number of players: 24. The mean, or average age = 633 / 24 = 26.375

The median the “middle” of a set of numbers in ascending or descending order: 26.5. Since there are 24 ages, when ordered from low to high the middle is where there are 12 ages to the left and 12 to the right. These are 26 and 27 > middle is 26.5*

The mode is the most frequently occurring number:28, as there are 4 players that are 28 years old, more than any other age.

* Median example:

We already talked about percents, ratios and proportions above but let’s repeat to refresh our memory:

  • Percent means “out of one-hundred”, so if a goalie (remember Bishop?) saves 1,236 shots and let’s 87 puck go in the net, we say he has a 0.9296 save percentage, or in other words he saves 93 of every 100 shots saved (93 per cent).
  • A ratio is a comparison of numbers or quantities, dividing one by the other. Bishop’s goals ratio compared to shots faced is 87 / 1,236 is 0.07.
  • A proportion is a statement of equality between two ratios.

Now let’s talk about correlation. As quoted in Puck Possessed issue 21 Wikipedia states correlation as any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related... Correlations are useful because they can indicate a predictive relationship that can be exploited in practice… However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causation)… There are several correlation coefficients, often denoted, measuring the degree of correlation.

The mentioned issue 21 is included below to give some examples to better explain the concepts:

Shot locations

Other than lots of numbers and statistics, the NHL also keeps track of the locations of all events taking place in the ice: goals, shots, blocks, penalties, hits, etc. Let’s look at an example, and what it can tell about a team’s tendencies:

Now let’s look at all saved shot attempts, missed shot attempts and shot attempts resulting in goals for the NHL’s 2018-2019 season. Every dot represents one or more of these shot attempts at a specific location, recorded as an x/y location in feet from the centre ice circle.

We can show the number of shot attempts at a specific location by changing the size of the dots: the more shot attempts, the bigger the dot.

We can also use color to show the number of shot attempts per location, or combine it with size.

Now how do we show all shot attempts without showing them all on top of one-another? We can think of the rink as a grid or matrix of squares that are 1×1 feet or 5×5 feet:


In these grids we can see where goals were scored from (darker is more goals scored):


Now we can count all the shot attempts per box, as well as all goals per box, from which we can calculate our goal success ratio: Goals / Total Shot Attempts and display that using color in 1×1 or 5×5 boxes:


The darker the color, the more goals per total shot attempts are scored. As one would expect, shot attempts from closer to the net and more in front of the net tend to go in for a goal more often than shots from long distance or from along the boards. Interestingly the shot attempts from behind the goal-line seem to have a pretty good success ratio as well. That could be because there aren’t that many shot attempts taken from behind the goal line, and when people do it is because the goalie is out of position and they bank it of him or a defender’s skate.

Puck Possessed issue 25 also describes this process on the following page.

Brief overview of Hockey Analytics & Visualization

There are a number of books and websites that provide a thorough overview on the history of hockey analytics, so it wouldn’t really make sense to do the same thing all over again. Again, Rob Vollman’s books are an excellent resource for this (and other) type of information. The more important questions in this publication are: What is hockey analytics, and why do we want to use it?

Wikipedia describes “analytics” as the discovery, interpretation, and communication of meaningful patterns in data; and the process of applying those patterns towards effective decision making. In other words, analytics can be understood as the connective tissue between data and effective decision making, within an organization. When we search specifically for ‘hockey analytics” it is described as the analysis of the characteristics of hockey players and teams through the use of statistics and other tools to gain a greater understanding of the effects of their performance. Wikipedia names three commonly used statistics in ice hockey analytics:

  • “Corsi” and “Fenwick”, both of which uses shot attempts to approximate puck possession, and
  • “PDO”, which is often considered a measure of luck.

Depending on our audience, we need to make sure we communicate in a way that they understand, and in many cases that means using basic language and terminology (for example “Corsi” is really just rate of puck possession through all shot attempts, “Fenwick” is the same but it does not count blocked shots as shot attempts)

If we go back to the Wikipedia definition of analytics, the visualization part is described as “communication of meaningful patterns in data”, as well as a modern equivalent of visual communication that involves the creation and study of the visual representation of data. So once we are done our discovery and interpretation of the data, and have drawn a conclusion with regards to the topic of choice, we need to communicate this visually to the intended audience at the appropriate levels of detail: league, team, player, events.

Knowing your audience is most important, as it determines how you communicate, at what level of detail, what terminology to use and in what format you communicate. Is the audience knowledgeable, experienced, young or old, able to spend time or in need of a quick visual, able to interact with the visual (computer) or not (print), etc.? These are all items to be considered before starting a visualization (and analysis) and unfortunately, in practice, the answers will not be clear cut. Take this book for example: my goal is to make something that is understandable and interesting for kids. But I want adults to be able to enjoy it as well. And when I say kids, what age(s) am I thinking of?

Then, a visualization needs a balance between effective communication and design to make it attractive, clear and effective, actionable and memorable.

Data presentation expert Edward Tufte has introduced the concept of Data-ink ratio to “above all else, show the data”. His principle is based on the amount of ink used for data versus the amount used for the total graphic. A large share of ink on a graphic should present data-information, the ink changing as the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.

Other specialists in the field of visualization, like David McCandless, focus more on the artistic side, to make information approachable and beautiful while focussing on the relationship between facts, the context, and the connections that make information meaningful.

Data visualization is both an art and a science.

Another term often heard is Infographics or information graphics: visual representations of information, data or knowledge intended to present information quickly and clearly, that can improve cognition by utilizing graphics to enhance the human visual system’s ability to see patterns and trends.

The “Guide to Information Graphics” by Dona M. Wong explains: “it is content that makes graphics interesting. When a chart is presented properly, information just flows to the viewer in the clearest and most efficient way…Font, colour and design and the depth of critical analysis displayed really make a chart effective.” I highly recommend you read this book for great information on making infographics.

To communicate information clearly and efficiently, data visualization uses statistical graphics, plots, charts, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message. Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables.

Let’s look at some examples of data visualization both on the design or artistic side, and on the more scientific and efficient side (full page copies follow on the next three pages):




The first example, the one on the left, shows no data but describes the development of a goal with visuals. The third example on the right shows a lot of data and besides the small graphic in the top right it is not highly appealing visually. The example in the middle has a bit of both: there is data in the table and graphs at the top and the four squares showing data for a selected player. But the image and layout also make it visually attractive and appealing.

Where do you think the following example fits in the spectrum between artistic and scientific and try to think of a few reasons why you think that?




Look at the following examples that show data using a number of different visualization types:


So. We looked at a lot of definitions about analytics and visualization, but what is the practice of Hockey analytics and visualization?

Putting it all to use

This final chapter will give some examples on how to get some data, how to store and clean it up, and how to visualize it, all based on software freely accessible.

Google Sheets

Google Sheets can be specifically useful if you want to use data from a web page that updates daily, for example a page with NHL standings. The biggest item to deal with is to “force” Google Sheets to refresh the data on a daily basis. The example below explains how to set up Google Sheets to make that happen.

Loading the data

For this example I’m creating a Google Sheet with the NHL Standings from Fox, by using the importhtml function:

=IMPORTHTML(“https://www.foxsports.com/nhl/standings”,”table”,0)

The problem with Google Sheets is that this sheet only gets refreshed when this document is opened.

Daily refresh

To force the refreshing you have to do the following:

  • In your sheet, create another tab, called UpdateDate
  • Go to the Tools menu > Script Editor
  • In the editor, copy the following text:

function myFunction() {

 var ss = SpreadsheetApp.getActiveSpreadsheet();

 var tz = ss.getSpreadsheetTimeZone();

 var sheets = ss.getSheetByName(‘UpdateDate’);

 var value = Date();

 sheets.getRange(‘C1’).setValue(value);

}

This basically opens your new tab UpdateDate and “calculates” the current date in field C1. By doing that, all other sheets, including our importHTML function, get updated. In other words, it is similar to opening the Google Sheet.

  • In the File menu > Spreadsheet Settings > Calculations > Set Recalculation = On change and every hour

You are good to go! Now you can connect to this data with a data visualization tool like Tableau Public, with the certainty that your data is refreshed every night.

Unfortunately, not every page works with the ImportHTML code in Google Sheets.

NHL data

NHL data is one of those pages that doesn’t work well with Google Sheets. To get data from the NHL webpage there are at least three ways to get the data: copy-paste, using the browser’s developers tools, or using a third party’s code:

  • Copy pasting should not need much explanation. Any table on the website can be copy-pasted, however many tables will have multiple pages which need to be copied one by one. So, if you are looking for one page or a relatively simple list of statistics (say team points) you can copy-paste without a problem
  • If you need more data than a one-pager, you can use the following. In your browser (I use Chrome for this example, but most – if not all – browsers offer something very similar) go to the settings menu; in chrome, you click the three vertically stacked dots in the top right corner, and there you select More Tools > Developer Tools:

  • In the main menu bar in the Developer tools, click Network. This window will now show all network activity while we load our page. All NHL’s stats are shown in the www.nhl.com/stats pages. Use the main headers PLAYERS or TEAMS so show player specific data or team-based data, and use the provided query options to select the data you want.:

  • In the Network window you will see two files were loaded for this webpage, skatersummery.json and a file with a bit of a scrambled name:

Right click on the second file > Copy > Copy Link Address

Now open the website http://www.convertcsv.com/json-to-csv.htm

Under Step 1: Select your input > Select the “Enter URL” tab > paste the address you just copied > Click the “Load URL” button.

Scroll down a bit further under the Result Data area (which should be full of the data you were looking for after a few seconds (depending on the size of the file this can actually take a while to load) and right after “Save your result: “ > type the name of your file > click “Download Result”.

The file will now be saved including all pages!

If you need many pages and don’t want to do this process over and over, you’ll need to write some code:

In the example below I use R in the R-Studio environment, which obviously requires knowledge of coding. But the big benefit is that the scraping can be automated, so especially when you are looking for large data sets or a large number of data sets (like every individual game’s statistics at a player/event level) the code can run through all the required tables/pages.

Data Services

If coding is not your thing there are a number of services available that basically do the work described above based on the NHL’s data or other data sources, and let you download the data for a small fee or even for free! I have listed them based on my personal usage, and although this list is far from complete it should get anyone started and busy for quite a while:

Storing Data (SQL, Google, others)

In case you are just working with one or two small datasets the need to store the data in a database are small, but once you start adding and combining data you may want to consider putting this data in one spot. Of course, you can put all your files in one folder or jump-stick, but if you want to start combining data sets I recommend putting all data in a database.

There are many options here, but I recommend the free-to-use SQLite database through  https://sqlitebrowser.org/ (lightweight, hosted freely and accessible from any machine and super easy to set up). Alternatively, you can set up a SQL environment on your computer or get something hosted in Google Cloud, Amazon’s AWS or many other options. The earlier mentioned sqlitebrowser allows files to be imported to a table from CSV, which makes setting up your personal database easy-peasy. Now, to combined files, or make subsets of your data sources you will need to know or learn a query language called SQL but it’s really not that complicated to learn some basic functions.

Tableau Public

Tableau Public is a free to use data analysis and visualization tool (https://public.tableau.com/en-us/s/) that can connect “live” to Google Sheets data and many data types like Excel, CSV, txt and anything else described in this publication.

Most visuals in this book are (partially) produced with Tableau Public, and since it is easy to use you can be up and running making your own hockey visualizations in no time! It would require a second book to go through all the ins and outs of Tableau, but I recommend giving it a try and perhaps look at some low-cost courses on Udemy.com.

Final words

I’ll keep it short, but I hope you had as much fun reading through these pages as I had creating them. If you, your parents, your kids, your family, or anyone else had a few moments of “aha” and “oooooooh, that’s how that works” I have achieved my goal with this book, making you, the reader, just a little bit more educated when it comes to hockey analytics and visualization.

Cheers,

RJ

Glossaries

For stat descriptions, see:

Natural Stat Trick: http://www.naturalstattrick.com/glossary.php

NHL: http://www.nhl.com/stats/glossary

Evolving Hockey: https://evolving-hockey.com/ Menu > More > Glossary