##Data Cleaning
#Delete second column
quant <- quant[, -2]
#Move second row up to the header
quant <- quant[-1, ]
#Update column names
colnames(quant) <- c("rk", "name", "team", "age", "pos", "gp", "g", "a",
"p","pim", "plus_minus", "ppg", "shg", "gwg","g_gp",
"a_gp", "p_gp", "sog", "sh_percent")
#Numeric conversion
quant <- quant %>%
mutate(
#remove % sign
sh_percent = parse_number(sh_percent),
rk = as.numeric(rk),
gp = as.numeric(gp),
g = as.numeric(g),
a = as.numeric(a),
p = as.numeric(p),
pim = as.numeric(pim),
plus_minus = as.numeric(plus_minus),
ppg = as.numeric(ppg),
shg = as.numeric(shg),
gwg = as.numeric(gwg),
g_gp = as.numeric(g_gp),
a_gp = as.numeric(a_gp),
p_gp = as.numeric(p_gp),
sog = as.numeric(sog),
sh_percent = as.numeric(sh_percent))
#Create year identifier
quant$season <- season_ids[season_urls == url]
quanthockey_data <- bind_rows(quanthockey_data, quant)Assignment 7: NCAA Hockey Shooting Percentages
Line of Inquiry
For my final project I will be using NHL shot data, so I am using this assignment to work with some supplemental data for that. The MoneyPuck data only has data from the NHL level, so I wanted to look at data from the collegiate level (for the record, the NCAA is only one of several paths to the NHL, so this won’t include every player in the NHL data). More specifically, I am interested in looking at the shooting percentages of these players, and how this interacts with other variables. Ultimately, the goal will be to see if there is a pattern of players with good shooting percentages/metrics in the NCAA that maintain that level in the NHL. For this particular assignment, however, I’ll only be looking at the NCAA level.
Data Collection
To investigate this line of inquiry, I’ll pull data from QuantHockey’s NCAA data. This particular data is pretty simple; it only contains fields such as goals, assists, points, shooting percentage, team, game-winning-goals, etc. (pretty much the basic “back of the hockey card” data). Using an html web scraping bot, I’ll collect the top 50 scorers (points) in each season starting in 2024-25 and dating back to 2014-15. While there are obviously far more than 50 NCAA players each season, I’m just using this sample because they will have the best chance of being drafted into the NHL and thus appearing in the MoneyPuck NHL data that I will use later. I’ll then combine each of these tables and can compare the data across various fields such as team, position, and season to analyze shooting percentages.
Data Wrangling
After pulling the data from QuantHockey, there was some necessary data wrangling that needed to be done to make the data usable. There was a column of data displaying the flag corresponding to each player’s nationality. This did not transfer well over the scraping process, so this column needed to be removed. Each season’s sheet was essentially stacked on top of one another, so I had to remove the duplicated header row for each season and update the whole data frame with a new header. I also had to rename some of the field names that contained symbols such as “%” or “/,” and then turn many of the fields into numeric variables. Lastly, I had to create a season variable based on the season ids that I used to create the URLs that were scraped for each season.
Analysis
With the data now collected and easily usable, I’ll move onto some analysis of NCAA shooting percentages. The cleaned data can be downloaded here:
First, I took a look at the position distribution and their corresponding shooting percentages. Unsurprisingly, there are far more forwards (F) than defensemen (D) in the data, 92% forwards, to be exact. This is because forwards typically record more points than defensemen. The shooting percentage of these forwards are also 4.4% higher than the defensemen. This is likely due to the location these players typically shoot from, specifically the distance from the net. Defensemen often shoot the puck from closer to the blue line, not necessarily to score, but rather to create a rebound for someone closer to the net. Forwards, on the other hand, usually take their shots from the face off dots or in the slot, which is closer to the net. Naturally, you have a better chance of scoring when the shot is closer to the net.
# A tibble: 2 × 3
player_position avg_sh_percent total_players
<chr> <dbl> <int>
1 D 11.0 43
2 F 15.4 507
Next, I was curious as to which schools produced players with the highest shooting percentage. Now, this doesn’t necessarily show the schools with the best shooting percentage overall because I’m only using data from the top 50 point scorers each season, not the whole team. However, it does show which schools produce the most elite players at shooting. Below, the chart shows the top 10 teams in shooting percentage (with a minimum of 10 players). The data does not have the full school name, so here are the schools, in order: University of North Dakota, Minnesota State University, Boston College, Northern Michigan University, Northeastern University, University of Minnesota, University of Michigan, University of Massachusetts Amherst, Harvard University, and Robert Morris University. Looking at this list, a few things jump out:
- 4 of the top 10 schools are from Massachusetts (Boston University is 13th)
- University of Denver is not actually in the top 10, but they do have the most players in this data (32)
- Too much volume of shots is not necessarily a bad thing for shooting percentage because 6 of the top 10 teams in total shots also appear here with the best shooting percentages
# A tibble: 10 × 4
team avg_sh_percent total_players total_shots
<chr> <dbl> <int> <dbl>
1 UND 17.0 16 1619
2 MNS 16.3 20 1979
3 BSC 16.2 22 2506
4 NMU 15.8 12 1161
5 NEA 15.8 21 2580
6 UMN 15.7 26 2795
7 UMI 15.7 28 3144
8 UMA 15.5 11 1216
9 HAR 15.2 17 2079
10 RMR 14.8 16 1931
I’m now going to take a look at shooting percentage in relation to goals, specifically even strength and power play goals. Both even strength (EV) and power-play (PPG) goals have a strong positive relationship with shooting percentage. The more goals a player scores, the higher their shooting percentage is likely to be, which makes sense. The EV and PPG trend lines are practically parallel, with PPG corresponding to a roughly 4% increase in shooting percentage.
Next, I wanted to investigate what volume of shots produces the best shooting percentage. If a player takes only a few shots, they could get lucky and have a higher than expected shooting percentage due to the small sample size. Vice versa, a player could score a lot of goals, but have a underwhelming shooting percentage due to the high volume of shots they take. Thus, I was curious what the relationship between the amount of shots a player takes and their shooting percentage. The boxplot below displays 25-shot intervals that begins to plateau around the mid-100s shots. Unsurprisingly, the shooting percentage is the highest for the smallest shot quantity bucket, but the optimal number of shots seems to be around 75-125.
Lastly, I wanted to look at how these metrics vary from season to season. Sticking with shots on goal and shooting percentages, you can see that they seem to have a contradicting relationship; high shots on goal corresponds with lower shooting percentages, and vice versa. This is what the the previous chart showed as well. The 2019-20 and 2020-21 seasons were impacted by Covid, and thus had shortened seasons. The shot totals from those seasons are thus irrelevant, but you can see that they corresponded with a spike in shooting percentage. Shot totals had begun to dip before Covid in 2018-19, but regained roughly the same totals in the 2022-23 season as they had 7-8 years ago.
Conclusion
Overall, the best shooting percentages typically belong to forwards with lots of goals (obviously), as there is a strong relationship between total goals scored at both even strength and on the power-play. The more shots you take, however, the more your shooting percentage is likely to decrease. League wide shooting percentages fluctuate on a season-to-season basis, but are correlated with league wide shot totals. Like with individual players, lower average shot totals leads to a higher shooting percentage. Lastly, unsurprisingly, the players with the best shooting percentages typically attend schools in Massachusetts or the North/Western-Mid West.