Our purpose is to analyze different factors affecting players in the NHL that correlate with the number of points scored. Our outcome variable will be the number of points, and we will explore the categorical variable of a player’s position (center, wing, defense) and the numerical variable of their number of shifts to examine how these affect a player’s number of points scored. For NHL hockey, points represents the sum of goals and assists for each player. For our position variable, players who play multiple positions will be grouped into an additional category; all of these players are both centers and wings, so they will be grouped into a general category “Forward”. Our data set is comprised of several different NHL player stats from the 2015-16 season. Each row represents a player and the columns have different statistics about their year. Only the relevant variables were chosen to be included in the analysis, as well as the players’ names and numbers which may be useful in identifying outliers or other clarifications during future analysis. These stats are from the 2015-16 NHL season, and it includes all players on an NHL roster for that year. All of the stats were compiled by several different people to make up the whole spreadsheet. While these numbers are objective, compiling the data from lots of different sources could have caused some errors in the numbers. Also, while the shifts variable does give a good sense of how much time a player spends on the ice, the duration of individual shifts varies a lot. This variation would likely even out across all players because they take many shifts, but this variable may be slightly unreliable. In addition, the shift variable does not account for certain situations in a game that would make scoring more or less likely, such as a power play or penalty kill.
Here are the mean, maximum and minimum number of points scored by NHL players in the 2015-16 season.
mean(pts) | max(pts) | min(pts) |
---|---|---|
19.55011 | 106 | 0 |
Here are the mean, maximum, and minimum number of shifts played by NHL players.
mean(shifts) | max(shifts) | min(shifts) |
---|---|---|
1079.98 | 2730 | 2 |
Here are the total numbers of players in each position.
position | n() |
---|---|
Center | 153 |
Defense | 304 |
Forward | 283 |
Wing | 158 |
From this box plot, we can see that the spread of the number of points scored for defensemen is smaller than the spread of other positions, but the median is not far off from any of the offensive positions (center, wing, forward). This may tell us that most defensemen generally score fewer points than offensive positions but not by a significant amount. It is also interesting to note that the median number of points scored for wings appears to be the lowest, but the wing position has a large spread. This issue will be addressed further later on.
This second figure shows a similar trend. Each point corresponds to one player with their number of shifts played on the x-axis and their number of points on the y-axis, and the graphs are split by position. The shape of the graphs suggests that as players play more shifts, they score more points. Between positions, it appears defense has the lowest number of points, and all the offensive positions (center, forward, wing) have a similar number of points.
This model shows the relationship between number of points scored and a player’s number of shifts played and position. The parallel slopes model was chosen because it is the simplest model that still represents the data well. An interaction model was tested, but the confidence intervals for the interaction terms were either very close to or included zero, meaning that the differences in slope they represented were not significant. This parallel slopes model has the same slope for all categories but includes the significant differences between the intercepts.
term | estimate | std_error | statistic | p_value | conf_low | conf_high |
---|---|---|---|---|---|---|
intercept | 0.635 | 1.074 | 0.591 | 0.555 | -1.473 | 2.742 |
shifts | 0.022 | 0.001 | 43.299 | 0.000 | 0.021 | 0.023 |
positionDefense | -13.019 | 1.099 | -11.845 | 0.000 | -15.176 | -10.862 |
positionForward | -2.370 | 1.115 | -2.125 | 0.034 | -4.559 | -0.181 |
positionWing | 1.252 | 1.265 | 0.989 | 0.323 | -1.232 | 3.736 |
Points vs Shifts Played with Color as Position
Note: This plot shows the Parallel Slopes regression model with the trend lines; it displays Points vs Shifts with color as position. The full code for this plot has not been learned yet, so the transparency of points and axes labels could not be changed.
This model was done using the parallel slopes model. The estimate in the intercept row corresponds to the intercept of the regression line for the Center position, which comes first alphabetically and, therefore, is the baseline for comparison. The shifts estimate is the slope for all four regression lines, because this model predicts all four positions to have the same slope. The positionDefense, positionForward, and positionWing estimates correspond to the difference in intercepts between the baseline position (center) and each of the other positions. This gives us four lines, one for each position.
In this model, all the positions are predicted to accumulate points at the same rate, seen through the slope. This is done under the assumption of all other things being equal, so as long as all other variables are equal, for every shift a player plays, there will be an associated increase of 0.022 points on average. Since all the intercepts are different, for a given number of shifts, each position would be predicted to have a different number of points. These intercepts do not always illustrate anything important, especially in this case. The intercepts in the lines listed above are the number of points that would be predicted if the player played zero shifts. This means that none of the intercepts above have practical implications because it wouldn’t ever be possible to score negative points, and it also wouldn’t be possible to score points without playing any shifts. Despite this lack of practical interpretation, these intercepts do set up the lines to be more accurate for higher numbers of shifts. While the intercepts may not seem useful, the differences between them do tell us something. For example, the difference between the intercept for defensemen compared to centers is -13.019, which means that for any given number of shifts, a defensemen is expected to score about 13 fewer points on average than a center. The lower intercept for the defense regression line shows that defense typically score a lower number of points, whereas the higher intercept for wings shows that they typically score more points. These interpretations from the regression table line up with the preliminary observations made in the exploratory data analysis section. The estimates and lines yielded from the regression table suggest an associated increase in points as the number of shifts goes up, and the number of points scored by offensive positions, particularly wing, is higher than the average number of points scored by defensemen for a given number of shifts. The points for offensive players being higher up on the initial graph than the defensive players lines up with the higher intercepts for offensive positions.
While it was noted above in the EDA section that the wing position appeared to have the lowest median of all the positions, this conflicts with the intercept being the highest for wings in this model. This is likely due to teams having a large number of wings on their team with few points. Since the center position takes some specialization, teams may put less experienced or less qualified forwards as wings. Because of this, there may be a larger population of players as wings who score less than other players in general, which would push the median for this group down. Despite this, there are also definitely very talented players at wing, so these players who likely score more points would account for the prediction of the model.
Because of the nature of this type of model, it predicts all positions to score points at the same rate. In this case, the model predicts that for any position, a player is expected to score 0.022 points on average per shift. Since it is not possible to score a fraction of a point, this could also be interpreted as a player is expected to score a point for about every 45 shifts that he plays. Despite predicting scoring at the same rate for each position, the model predicts each position to begin at a different starting point, such that wings score the most points, followed by centers, forwards, and defensemen.
All of these confidence intervals are done with 95% confidence, meaning that 95% of values are expected to fall within this interval.
For the slope for all the positions, in the shifts row, the confidence interval is between 0.021 and 0.023, which is a very narrow interval, indicating a high confidence in the slope given in the model. This is important because it is the same slope used for all of the lines. Practically, this means that 95% of players are expected to score between 0.021 and 0.023 points per shift on average, regardless of position.
For the intercepts, the actual values do not have as much meaning as the differences between them. For the intercepts, a player of any position would have zero points when they have played zero shifts, so none of the intercepts have practical implications. The important part about the intercepts is that they shift the line up or down to indicate more or less points scored, which fits the data (and practically applies to the situation) better with higher numbers of shifts. This shifting creates the differences between the numbers of points scored by each position. For the baseline intercept, centers, the confidence interval is between -1.473 and 2.742. This means that with zero shifts, a center would be expected to have scored between -1.472 and 2.742 points, which is impossible unless the actual number of points scored were zero.
For the defensive position, the confidence interval for the intercept is between -15.176 and -10.862, which does not overlap with the baseline interval for centers, meaning that it is significantly less than the intercept for centers. This means that for a given number of shifts, defensemen would be expected to have scored less points than centers, specifically between 15.176 and 10.862 points less.
For the forward position, the confidence interval for the intercept is between -4.55 and -0.181, which does overlap with the interval for centers. This means that centers and forwards score similar numbers of points, with centers scoring slightly more, but both of these positions score significantly more than defensemen. Since the forward category includes players who play both center and wing, it makes sense that they would score a similar number of points to players who just play center.
For the wing position, the confidence interval for the intercept is between -1.232 and 3.736, which also overlaps with the intervals for center and forward positions, meaning that all these positions score a fairly similar number of points. This makes sense because they are all offensive positions with many opportunities to score. Importantly, all these positions still score significantly more than defensemen.
For the estimate of the intercept row, the p-value is 0.555. Since this number is large, the value given as the estimate for the baseline intercept (for centers) is not significant.
For the positionDefense row, the p-value is reported as 0, which means that the value is so small that when rounded to three decimal places, it is zero. Since this value is so small, the value given for the slope is significant.
For the positionForward row, the p-value is 0.034. Since this value is small, the difference in intercept between centers, the baseline, and forwards is significant. While their confidence intervals overlap, the shared interval between them is small enough that the difference between them is still significant.
For the positionWing row, the p-value is 0.323. Since this value is large, the estimate for this row is not significantly different than the baseline estimate. This makes sense with the confidence intervals, because the interval for the center and wing positions overlapped a lot.
These values seem to line up well with the analyses of the confidence intervals above. The defensive position had a p-value indicating that its intercept was significantly different, and its confidence interval was much lower than that of the baseline. The forward position had a confidence interval that overlapped slightly with the center position’s interval, but the p-value indicates that the difference is still significant. Lastly, the wing position’s confidence interval overlapped almost entirely with the interval for the center position, and this was also seen in the p-value which indicated that this difference was not significant.
In order for inferences about this regression model to be made, certain conditions must hold. Two such conditions are that the residuals must be normally distributed and constantly spread; the graphs below examine these who conditions.
This first graph is a histogram of all the residuals. Ideally, this would be a normal distribution centered at zero. The center of the distribution is slightly left of zero, but it is fairly close to zero. The distribution also appears to be fairly normal; the spread does extend farther on the right than the left, but overall, it follows a normal distribution.
This second graph shows the residuals plotted over the outcome variable, points. Ideally, the points would be evenly spread around the line at y=0 and would not vary with the number of points. Unfortunately this is not true, and similar issues were discussed above in the limitations section. The residuals start out evenly spread, become more negative, and then become more positive again. This issue can be seen in the graph with the fitted lines, because the data is more of a quadratic shape, while the trend lines stay linear. This means that the condition of constantly spread data is not met. Because of this, specific interpretations made from the regression model may not be exactly accurate.
From the regression table, it seems that all positions generally score at the same rate, but overall, some positions score more than others. Centers and wings scored the most points, closely followed by forwards, but defensemen score significantly less points than these other positions. Most generally, offensive positions score a similar amount of points, which is larger than defense, but all positions generally score at the same rate.
As discussed above, the length of shifts, game conditions, and linear nature of the model are all limitations to this analysis. The limitation of the linearity of the model was also seen in the residual analysis.
As for future research, the problem of the linearity of the model could be solved with a log transformation of the data, which could be done with more sophisticated code, or a non-linear model could be tested. Analysis could also be done using other predictor variables besides shifts. Since shift times vary so much, the actual amount of time spent on the ice could be used. Additionally, more variables could be added to the model to examine more factors that may have an effect on points scored, such as strength, height, or draft pick. Also, a more complex analysis could also be done looking at different combinations of player, specifically for forwards. While this would take much more complex data and more time to analyze, this type of research could be useful for teams putting together lines of forwards to find the most effective combinations of players to score.
Rob Vollman. (2017). 2015-2016 NHL Statistics (7fe70665) [Data file]. Data.world: Ayodele Odubela.