Final Project: Introducing Myself as a BAIS Graduate

Author

Thomas Bonnici

Project Introduction

About the Project Topic

Ever since I was a little kid, I have been around the sport of hockey. It’s a sport that I have played since I was six years old and followed closely at the pro level. In saying this, I am also a huge math guy. Math has always been a favorite subject of mine, and has always come to me easier than other subjects. Combine these two things and you get hockey statistics! This is a major part of my final project. With this analysis, I just want to learn more about the game in general, whether its about the players, the play style or trends of different eras.

About the Data

My first dataset is a large culmination of hockey player data. This data spans from the years 1991 to 2023 and has a ton of players that range from legends in the game to guys that had a cup of coffee in the show. This player data is their total career over this time span. In terms of the data itself, I will provide a data dictionary below. It’s important to note that this data is only regular season data and has no goalies in it. A link to the dataset is below:

Hockey Player Data (1991-2023)

Data Dictionary

  • Player = Player Name

  • S/C = Skater Shoots (handedness; Left or Right)

  • Pos = Position on the Ice

    • C = Center

    • L = Left Wing

    • R = Right Wing

    • D = Defense

  • GP = Games Played

  • G = Goals Scored

  • A = Assists

  • P = Points (Goals + Assists)

  • plus_minus = Player’s Plus Minus

  • PIM = Penalty Minutes

  • P/GP = Points Per Game Played

  • EVG = Even Strength Goals

  • EVP = Even Strength Points

  • PPG = Power Play Goals

  • PPP = Power Play Points

  • SHG = Short Handed Goals

  • SHP = Short Handed Points

  • OTG = Overtime Goals

  • GWG = Game Winning Goals

  • S = Shots On Net

  • S% = Percentage of Shots that a Player Scored

  • TOI/GP = Time On Ice divided by Games Played (average stat)

  • FOW% = Faceoff Win Percentage (only applies to Centers)

Descriptive Analysis: NHL Player Data

Question 1: What is the distribution of Points/Game across the NHL?

First, I wanted to get a gauge of the data by plotting a histogram of points per game in the NHL.

Rows: 4000 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Player, S/C, Pos
dbl  (12): G, plus_minus, P/GP, EVG, PPG, PPP, SHG, SHP, OTG, GWG, S%, FOW%
num   (6): GP, A, P, PIM, EVP, S
time  (1): TOI/GP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Analysis (Graph 1)

From this graph we can gather that most players that play at least 50 games in their career get around 0.25 points per game. This is why when a player averages a point per game in his career, let alone a season he is regarded as one of the best scorers in the league. If I were to change the Games Played number, I would hypothesize that this distribution would slide to the right.

Question 2: Who are the top 50 goal scorers since 1991?

Since we now know what the distribution of points per game are for NHL players in this dataset, I wanted to look at who were some of the all time best goal scorers.

# A tibble: 3,986 × 2
   Player             Total_Goals
   <chr>                    <dbl>
 1 Alex Ovechkin              830
 2 Jaromir Jagr               739
 3 Teemu Selanne              684
 4 Jarome Iginla              625
 5 Sidney Crosby              576
 6 Brendan Shanahan           568
 7 Patrick Marleau            566
 8 Mats Sundin                541
 9 Keith Tkachuk              538
10 Steven Stamkos             533
11 Marian Hossa               525
12 Joe Sakic                  515
13 Brett Hull                 509
14 Mark Recchi                506
15 Mike Modano                504
16 Peter Bondra               491
17 Evgeni Malkin              486
18 Joe Pavelski               467
19 Patrick Kane               458
20 Eric Staal                 455
21 Sergei Fedorov             452
22 Daniel Alfredsson          444
23 Ilya Kovalchuk             443
24 John Tavares               439
25 Jeremy Roenick             437
26 Pavel Bure                 437
27 Rick Nash                  437
28 Jeff Carter                436
29 Alex Kovalev               430
30 Joe Thornton               430
31 Bill Guerin                429
32 Zach Parise                429
33 Alexander Mogilny          428
34 Luc Robitaille             427
35 Patrice Bergeron           427
36 Corey Perry                421
37 Vincent Lecavalier         421
38 Owen Nolan                 419
39 Jason Arnott               417
40 Tony Amonte                416
41 Phil Kessel                413
42 Rod Brind'Amour            409
43 Patrik Elias               408
44 Anze Kopitar               407
45 Marian Gaborik             407
46 John LeClair               404
47 Paul Kariya                402
48 Shane Doan                 402
49 Markus Naslund             395
50 Pierre Turgeon             395
# ℹ 3,936 more rows

Analysis (Graph 2)

From this tibble, we can see that Alexander Ovechkin has the most total goals in the regular season over this span at 830. He is getting super close to the regular season goal record held by Wayne Gretzky at 894. As we go down the list we see names like Jagr, Selanne, Iginla, Crosby rounding off the top 5. All of these players are all time greats, so no surprises there.

Question 3: Who are the true enforcers in the NHL?

Although there’s much more skill in today’s game, I still wanted to show some love to the Enforcers of the NHL. So, I looked into who averaged the most penalty minutes per game in the graph below:

Analysis (Graph 3)

With this bar graph, I was able to pull the guys who averaged at least 4 penalty minutes per game, and played at least 50 games in the NHL. Out of this whole list, I don’t recognize a single player. This indicates that being an enforcer is “a thing of the past”. Guys like Chris Nilan aka Knuckles (4.64), Scott Daniels (4.48), and Brantt Myhres (4.46), all played in the 90s or early 2000s. Trevor Gillies is the exception, as his only full season was in 2011 with the Isles where he got 165 PIM in 39 games, wow! Being an enforcer was an art form, and this art is displayed here in this bar chart.

Question 4: Who are the top centers in the NHL by Faceoff Win Percentage?

In this next graph, I wanted to find out who were the best centermen in the league by faceoff win percentage.

Analysis (Graph 4)

In this scatter plot, we can see the best career faceoff centermen with a faceoff win percentage of at least 55% and played at least 500 games. The player who sticks out from the rest with an above 65% faceoff percentage is Yanic Perreault. He played from 1993 to 2007, mainly for the Kings and Leafs. I was born in 2002, so I am not too familiar with Perreault’s game. But, I do recognize some other all time greats like Joe Nieuwendyk, Rod Brind’Amour, Patrice Bergeron, Steve Yzerman, Jonathan Towes and more.

Question 5: What positions have the best plus minus?

For this last graph, I wanted to look into plus minus from a positional standpoint:

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

Analysis (Graph 5)

In this last graph, we can see that when players play at least 200 games, the defensemen are the only position that is in the positives for this stat. They are sitting at about +10 overall. Meanwhile, all the forward positions Center (-4), Left Wing (-4.5), Right Wing (-2), are all below 0. This gives me two main takeaways: if you are a forward that has a career positive plus minus statistic, or you are a defensemen with a plus minus above 20, you are in elite company. Additionally, forwards in general don’t have as much of an emphasis on this statistic because their job is to score goals (unless the player is known as a defensive forward). But, a good way to gauge who is a good defensemen is to see their plus minus stat.

Secondary Data Source Analysis

Continuing on, I thought incorporating a sentiment analysis would add another layer to this project. Since I’ve mostly been focusing on hockey statistics, I wanted to show some love to the fan experience as well. In this analysis, I looked at two main stadiums: Bridgestone Arena, which is home to the Nashville Predators and Florida Live Arena which is home to the Florida Panthers.

Here’s the link to the data source: Sentiment_Analysis_Bridgestone_FLA_Live

Part 1: Emotional Sentiment Analysis

First, lets look at the general emotions that people felt going to events at these two stadiums:

`summarise()` has grouped output by 'sentiment'. You can override using the
`.groups` argument.

Analysis (Graph 6)

From this graph, we can see that the most common emotions being expressed are trust, joy and anticipation. Generally, this is logical as whether you are in the stadium for a concert, a game, or another event, you should be feeling trust (trust in the facilities, concessions, fan experience), joy (cheering for team/artist) and anticipation (between periods, before the game, before the concert). When breaking it down between Bridgestone Arena and Florida Live Arena, we see that despite my earlier assumptions, it seems that my assumptions are incorrect. FLA Live scores higher in the positive sentiment as well as most of the positive sentiments (trust, anticipation, surprise). The one that Bridgestone did slightly better in was joy. Overall, these results were shocking to me, and lead me to believe that there are other factors that are leading to this outcome (concerts, fan experience, etc).

Part 2: Positivity Scores Over Time

Next, I wanted to look at different positivity scores over an 8 year span.

Analysis (Graph 7)

For this graph, I was able to look into average positivity scores from 2016 to 2024 for each stadium. First thing I noticed was there were no reviews from 2016 for Bridgestone. In general, for Bridgestone we see a relatively stable score around 0.65 with it peaking in 2020 at almost a score of 0.8. For FLA Live, there seems to be a general downward trend since 2016. Since it’s known for not having the strongest facilities, this makes sense as things like the building deteriorating or aspects like the bathrooms or consessions needing an upgrade. Additionally, the fan experience for both of these arenas are vastly different in the regular season, with Bridgestone providing a fun and rowdy atmosphere and florida struggling to fill the seats when the team is struggling to win games.

Combining My Analysis

Now that we looked into a bit of the fan experience, I would like to combine the two datasets to make an observation. The observation I will make is that a better fan experience has a decent amount of influence from on ice product. I believe this because that is the main reason people come to the stadium. Although three are other special events and concerts, I think that team success has an influence on fan experience.

I added an additional dataset for this section of the document, so here’s a link to the data:

NHL Team Record Data

This data is from nhl.com and was downloaded off of their stats tab of the website. To learn more about each of the attributes of each column, click this link and hover over each of the attributes to see a description.

Visualization 1: Point Distributions for NHL Teams

In this visualization, I will look at the team’s records on each team over this 8 year time period (2016-2024) to see if there’s a correlation between having good/bad on ice product influences the fan experience.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Analysis (Graph 8)

From this we can see that most of the win percentages fall from around 50% to 68%, which makes sense. In general in sports, most of the teams will end up in this range. The largest values being slightly above .500 and above .600 are logical as well, as these above .500 teams are in the wildcard mix or have a solid playoff guarantee at above .600. Since the league being so competitive since 2016, this doesn’t come as too much of a surprise.

Visualization 2: Panthers and Predators Sentiment Analysis

Now we will take a look at the two teams we looked at in the sentiment analysis: the Nashville Predators and the Florida Panthers.

Analysis (Graph 9)

In this graph we see that the Predator’s record was higher from 2016 to 2018, with 2019 being identical to the Panthers. Since 2020, Florida has sported the better win percentage outside of 2022. Going beyond this, the Preds have made the playoffs every year since 2016 except for 2022. The Panthers missed the playoffs from 2016 to 2018, but have made it every year since. Additionally, both teams have made the Stanley Cup Final once in this span. So, overall both teams have had good on ice product aside from Florida for the first 3 seasons.

Visualization 3: Bringing It All Together

Analysis (Graph 10)

Now comparing these two graphs, we notice that win percentage has an overall positive correlation with Predators fans, with win percentage and fan satisfaction being similar. However, in Florida it’s the opposite. Point percentages and positivity score have a weak negative correlation. However, I think for the Panthers this is a case of causation doesn’t mean correlation. There are too many other factors that lead to the positivity other than the on ice product like concessions, things to do around the stadium, atmosphere, and more.

Conclusion

Overall, it was a valuable experience learning about NHL’s history as well as the fan experience when compared to a few variables. To make an even better analysis of the data, I would need to do some correlation analysis on things like the things I mentioned above. I think this would be the next logical way to go to do a true analysis on fan experience. Additionally, I would love to have the team and years the players played to compare different players over time, since I have players that span over the last 32 years of the NHL.