Introduction

This assignment applies two different analytical methodologies to two separate sports analytics problems. While both sections rely on quantitative modeling, they answer fundamentally different types of questions and require different approaches to interpretation and validation.

The first section uses regression analysis to evaluate the factors associated with dynamic ticket pricing for Mets home games. The analysis focuses not only on identifying relationships between ticket prices and variables such as opponent quality, scheduling effects, weather conditions, and game timing, but also on evaluating whether those estimated relationships remain reliable after testing for multicollinearity, autocorrelation, and heteroskedasticity. Because regression results can become misleading when core assumptions are violated, diagnostic testing and robust inference corrections are an important part of the analysis.

The second section uses clustering analysis to identify NHL player archetypes based on salary, usage, possession metrics, expected goal share, and deployment context. Unlike regression, clustering is an unsupervised learning method that does not attempt to predict a single dependent variable. Instead, the goal is to identify meaningful structure within multidimensional player data by grouping players with similar statistical profiles. The analysis also evaluates how clustering results change when additional contextual variables are introduced and how different cluster evaluation methods influence the final model selection.

Across both sections, the broader goal is not simply to generate statistical output, but to evaluate how analytical methods can be used to support decision making in sports business strategy and player evaluation.



Part 1: Mets Dynamic Pricing Regression Analysis

Preview of Mets Dynamic Pricing Data
Average Price Minimum Price Maximum Price Price Std. Dev. Mets Win Percentage Opponent Win Percentage Wind Speed Temperature
149.32 42 312 89.09 0.475 0.549 10 53
80.08 12 315 69.93 0.475 0.549 16 55
82.44 12 315 67.98 0.475 0.549 19 59
66.08 12 315 61.79 0.475 0.497 19 61
66.08 12 315 61.79 0.475 0.497 17 56
66.56 14 316 61.76 0.667 0.667 8 53
66.08 12 315 61.79 0.538 0.538 11 64
85.20 20 315 67.49 0.571 0.500 6 67
75.64 12 315 64.76 0.533 0.533 14 50
66.08 12 315 61.79 0.500 0.562 14 49

Before building the regression models, it is important to understand the structure of the variables being analyzed. The data includes ticket pricing information, weather conditions, opponent strength, Mets performance, timing variables, and opponent category variables.

I removed the ticket price distribution chart because it was more of an exploratory visual than a required econometric step. The main focus of this section should be the regression results and whether the inference is reliable after checking multicollinearity, autocorrelation, and heteroskedasticity.


Correlation Structure

The correlation heatmap gives an initial view of how the continuous variables move together. A positive correlation means two variables tend to increase together, while a negative correlation means one variable tends to increase as the other decreases. A value close to 0 means there is not much of a linear relationship.

This visual is useful because it gives an early warning about potential overlap between variables before running the formal VIF test. It does not prove multicollinearity by itself, but it helps show whether average price, Mets win percentage, opponent win percentage, weather, and temperature are moving in similar patterns. That matters because highly related independent variables can make it harder for the regression to isolate the separate effect of each variable.


Model 1: Original Regression Specification

Model 1 Regression Results
Variable Estimate Std. Error t Statistic p Value Significance
Intercept 33.175 20.919 1.586 0.1181
Sunday -5.643 6.158 -0.916 0.3631
Monday -5.622 5.760 -0.976 0.3330
Tuesday -9.937 5.902 -1.684 0.0975 .
Thursday 8.210 6.617 1.241 0.2196
Friday 25.381 5.868 4.326 <0.001 ***
Saturday 21.647 6.453 3.355 0.0014 **
April -7.973 7.791 -1.023 0.3103
May -12.172 6.499 -1.873 0.0661 .
July -1.254 7.274 -0.172 0.8637
August -0.711 7.063 -0.101 0.9201
September -7.483 6.801 -1.100 0.2757
Mets Win Percentage 19.864 29.667 0.670 0.5057
Opponent Win Percentage -19.219 20.171 -0.953 0.3446
Wind Speed 0.104 0.468 0.221 0.8255
Temperature 0.697 0.263 2.655 0.0102 *
Afternoon Game 0.556 4.580 0.121 0.9037
Early Evening Game 1.823 7.437 0.245 0.8072
Opening Day 77.326 14.765 5.237 <0.001 ***
Division Opponent 1.165 4.227 0.276 0.7838
Interleague Opponent 22.725 8.058 2.820 0.0065 **
Significance Codes: *** p < 0.001, ** p < 0.01, * p < 0.05, . p < 0.10

The regression table shows the estimated relationship between each independent variable and average ticket price while holding the other variables constant. The estimate column shows the expected change in average price associated with a one unit increase in that variable. For dummy variables, the estimate is the expected difference from the omitted reference group.

The standard error measures uncertainty around the estimate. The t statistic compares the estimate to its standard error. The p value tells us whether the result is statistically significant under the usual regression assumptions. The significance stars summarize that information quickly, with more stars meaning stronger statistical evidence.


Model 1 Fit Statistics
R Squared Adjusted R Squared Residual Standard Error F Statistic Model p Value Observations
0.776 0.7 12.874 10.211 <0.001 80


Model 1 explains approximately 77.6 percent of the variation in average ticket prices, with an adjusted R squared of 0.7. The adjusted R squared is more useful here because this model includes a lot of schedule, timing, weather, and opponent controls.

At the 5 percent level, the statistically significant predictors in Model 1 are Opening Day, Friday, Saturday, Interleague Opponent, Temperature. These results should be interpreted as conditional relationships. Each coefficient estimates the expected change in average price while holding the other included variables constant.

From a pricing standpoint, Model 1 gives the baseline relationship between game characteristics and average ticket price. The key thing I am looking for is not just whether the full model is significant, but which specific variables actually appear to move ticket prices after controlling for day of week, month, team performance, opponent quality, weather, and game timing.

The model p value is shown as <0.001 instead of 0 because p values should not be reported as exactly zero. The standard reporting approach is to use a small threshold when the value is extremely close to zero.


Model 2: Adding Cubs, Yankees, and Phillies Dummies

Model 2 with Cubs, Yankees, and Phillies Dummies
Variable Estimate Std. Error t Statistic p Value Significance
Intercept 25.991 21.929 1.185 0.2409
Sunday -9.086 6.427 -1.414 0.1630
Monday -4.789 5.778 -0.829 0.4107
Tuesday -10.100 5.818 -1.736 0.0881 .
Thursday 7.049 6.601 1.068 0.2902
Friday 22.745 5.984 3.801 <0.001 ***
Saturday 17.734 6.649 2.667 0.0100 **
April -9.277 7.908 -1.173 0.2457
May -12.631 6.409 -1.971 0.0537 .
July -1.728 7.608 -0.227 0.8212
August -3.180 7.209 -0.441 0.6608
September -7.621 6.862 -1.111 0.2714
Mets Win Percentage 28.608 31.744 0.901 0.3713
Opponent Win Percentage -26.640 21.743 -1.225 0.2256
Wind Speed 0.227 0.468 0.484 0.6302
Temperature 0.803 0.271 2.963 0.0045 **
Afternoon Game 1.401 4.654 0.301 0.7644
Early Evening Game 2.383 7.441 0.320 0.7499
Opening Day 78.267 14.688 5.329 <0.001 ***
Division Opponent 2.123 4.611 0.460 0.6470
Interleague Opponent 10.770 9.867 1.091 0.2797
Cubs -3.090 10.208 -0.303 0.7633
Yankees 22.195 11.254 1.972 0.0535 .
Phillies -6.602 6.154 -1.073 0.2879
Significance Codes: *** p < 0.001, ** p < 0.01, * p < 0.05, . p < 0.10

Model 2 adds individual team dummy variables for the Cubs, Yankees, and Phillies. This tests whether those specific opponents create a pricing premium beyond the broader controls already included in Model 1.

This matters because opponent identity can carry demand value that is not fully captured by opponent win percentage, division opponent status, or interleague status. For example, the Yankees could raise demand because of rivalry and brand appeal, while the Phillies could matter because they are a division opponent with stronger local interest.

The columns in Model 2 have the same interpretation as Model 1. The estimate is the expected change in average ticket price, the standard error measures uncertainty, the t statistic measures the size of the estimate relative to the uncertainty, and the p value measures statistical significance.


Model 2 Fit Statistics
R Squared Adjusted R Squared Residual Standard Error F Statistic Model p Value Observations
0.793 0.709 12.685 9.352 <0.001 80


Model 2 explains approximately 79.3 percent of the variation in average ticket prices, compared with 77.6 percent in Model 1. The adjusted R squared changes from 0.7 to 0.709.

At the 5 percent level, the statistically significant predictors in Model 2 are Opening Day, Friday, Temperature, Saturday. If the Cubs, Yankees, or Phillies variables are significant, it suggests that those opponents have their own pricing effect beyond the broader schedule and opponent controls.


Model 1 vs Model 2 Fit Comparison
Model R Squared Adjusted R Squared Residual Std. Error Observations
Model 1: Original Controls 0.776 0.700 12.874 80
Model 2: Added Team Dummies 0.793 0.709 12.685 80


This comparison shows whether adding the Cubs, Yankees, and Phillies dummy variables improves the model enough to justify the extra variables. If R squared increases but adjusted R squared stays flat or falls, then the added team dummies may not add much explanatory value after accounting for the larger model size. If adjusted R squared improves, that suggests the specific opponent dummies add useful pricing information beyond the original controls.

Comparing the two models is important because adding variables will almost always increase R squared mechanically. Adjusted R squared is more useful here because it penalizes unnecessary complexity. The goal is not just to build the largest possible model, but to determine whether the additional opponent-specific controls meaningfully improve the explanatory power of the regression.


Variance Inflation Factor Test

VIF stands for Variance Inflation Factor. It measures how much the variance of a regression coefficient is inflated because that independent variable overlaps with other independent variables in the model.

A VIF of 1 means there is basically no multicollinearity problem for that variable. Higher VIF values mean the variable is more strongly related to other predictors. In this assignment, the cutoff is VIF greater than 5, so any variable above 5 should be removed before estimating the cleaned model.

Variance Inflation Factor Results
Variable VIF
Opponent Win Percentage 15.650
Mets Win Percentage 14.172
April 4.231
Temperature 4.058
July 3.669
September 3.566
August 3.516
Interleague Opponent 3.358
Saturday 2.991
May 2.948
Sunday 2.618
Division Opponent 2.616
Afternoon Game 2.313
Yankees 2.272
Friday 2.270
Monday 2.259
Tuesday 1.996
Thursday 1.949
Phillies 1.880
Cubs 1.870
Wind Speed 1.726
Early Evening Game 1.613
Opening Day 1.324

The reason for removing variables with VIF values greater than 5 is not because those variables are automatically useless. The issue is that highly collinear variables can make the model unstable. When predictors overlap too much, the regression has trouble separating the individual effect of each one.

This can lead to inflated standard errors, weaker t statistics, and coefficient estimates that change too much depending on which related variables are included. Removing high VIF variables makes the model cleaner and easier to interpret.


Model 3 VIF Cleaned Regression
Variable Estimate Std. Error t Statistic p Value Significance
Intercept 31.115 21.621 1.439 0.1556
Sunday -9.840 6.426 -1.531 0.1312
Monday -5.713 5.753 -0.993 0.3249
Tuesday -9.891 5.841 -1.693 0.0959 .
Thursday 7.775 6.603 1.178 0.2439
Friday 22.552 6.009 3.753 <0.001 ***
Saturday 17.148 6.661 2.574 0.0127 *
April -7.964 7.870 -1.012 0.3158
May -11.234 6.334 -1.774 0.0815 .
July -3.046 7.564 -0.403 0.6887
August -3.210 7.241 -0.443 0.6592
September -9.601 6.698 -1.433 0.1572
Mets Win Percentage -8.627 9.214 -0.936 0.3530
Wind Speed 0.210 0.470 0.447 0.6563
Temperature 0.823 0.272 3.030 0.0037 **
Afternoon Game 2.310 4.614 0.501 0.6186
Early Evening Game 2.533 7.473 0.339 0.7359
Opening Day 74.267 14.383 5.163 <0.001 ***
Division Opponent 0.201 4.355 0.046 0.9633
Interleague Opponent 9.109 9.816 0.928 0.3574
Cubs 1.272 9.609 0.132 0.8951
Yankees 21.954 11.301 1.943 0.0570 .
Phillies -4.648 5.969 -0.779 0.4394
Significance Codes: *** p < 0.001, ** p < 0.01, * p < 0.05, . p < 0.10
Variables Removed for Multicollinearity
Removed Variables Due to VIF > 5
Opponent Win Percentage

The VIF-cleaned model removes variables only when their VIF values exceed the assignment cutoff of 5. This matters because the goal is not simply to keep every possible control variable. The goal is to keep a model where the remaining variables can be interpreted more reliably. If two variables are measuring nearly the same thing, the regression may struggle to separate their individual effects.

After removing variables with VIF values greater than 5, Model 3 becomes the cleaner specification. The goal is not to maximize R squared. The goal is to reduce multicollinearity so that the remaining coefficient estimates are more stable and easier to trust.

Model 3 explains approximately 78.8 percent of the variation in average ticket prices, with an adjusted R squared of 0.706. The statistically significant predictors at the 5 percent level are Opening Day, Friday, Temperature, Saturday.

If the main significant relationships stay similar after removing high VIF variables, that gives more confidence that the results are not just being driven by overlapping predictors.


Regression Model Comparison Summary
Model R Squared Adjusted R Squared Residual Std. Error Observations
Model 1: Original Controls 0.776 0.700 12.874 80
Model 2: Added Team Dummies 0.793 0.709 12.685 80
Model 3: VIF Cleaned 0.788 0.706 12.741 80



This comparison helps evaluate whether removing high VIF variables meaningfully changed the regression results. If Model 3 maintains similar explanatory power while reducing multicollinearity, that suggests the removed variables were creating unnecessary overlap rather than adding substantial new information.

This is important because regression quality is not only about maximizing R squared. A model with slightly lower explanatory power but more stable coefficient estimates can often be more useful analytically than a larger unstable model with highly overlapping predictors.

If the adjusted R squared remains relatively stable after the VIF cleaning process, that suggests the model retained most of its explanatory value while improving interpretability and reducing coefficient instability.


Residual Diagnostics

The residuals vs fitted values plot compares the model’s predicted ticket prices against the model’s errors. The fitted values on the x axis are the average ticket prices predicted by the model. The residuals on the y axis are the difference between the actual average ticket price and the predicted average ticket price.

A strong residual plot should look fairly random around the horizontal zero line. That would suggest the model is not systematically overpredicting or underpredicting prices at different levels of fitted price.

If the residuals widen as fitted values increase, that is evidence of possible heteroskedasticity. In this assignment, that would make sense because premium games may have more pricing variation than ordinary games. If the residuals form a pattern or curve, that could suggest the model is missing an important nonlinear relationship or omitted pricing factor.

This plot is included because regression assumptions are not just technical details. If the residuals are not well behaved, the coefficient estimates may still be useful, but the normal OLS standard errors and p values may be too confident.


Durbin Watson Test

The Durbin Watson test checks for autocorrelation in the regression residuals. Autocorrelation means the model errors are related across observations instead of being independent.

This is especially relevant for time ordered data. In the Mets pricing context, one game’s pricing error could be related to nearby games because of homestands, market momentum, opponent series, promotions, or pricing patterns that carry over from one game to the next.


Durbin Watson Test Results
Statistic p Value
1.8574 0.0408


The Durbin Watson statistic is 1.8574 with a p value of 0.0408.

A Durbin Watson statistic close to 2 suggests little evidence of autocorrelation. A value substantially below 2 suggests positive autocorrelation, meaning positive residuals tend to be followed by positive residuals and negative residuals tend to be followed by negative residuals. A value substantially above 2 suggests negative autocorrelation.

The p value tells us whether there is statistically significant evidence of autocorrelation in the residuals. If this p value is below 0.05, then I would reject the null hypothesis of no autocorrelation and treat the regular OLS standard errors with caution. If the p value is above 0.05, then there is not strong statistical evidence of autocorrelation, although using Newey-West standard errors is still a useful conservative correction because it also addresses heteroskedasticity.


Newey West HAC Standard Errors

Newey West standard errors correct the inference problem created by heteroskedasticity and autocorrelation. HAC stands for heteroskedasticity and autocorrelation consistent.

The important point is that Newey West does not change the coefficient estimates. The estimated effects stay the same. What changes is the calculation of standard errors, t statistics, and p values. This gives a more reliable test of statistical significance when the error structure violates the regular OLS assumptions.


Model 3 with Newey West HAC Standard Errors
Variable Estimate Std. Error t Statistic p Value Significance
Intercept 31.115 15.050 2.067 0.0432 *
Sunday -9.840 5.120 -1.922 0.0596 .
Monday -5.713 3.549 -1.610 0.1130
Tuesday -9.891 4.755 -2.080 0.0420 *
Thursday 7.775 7.389 1.052 0.2971
Friday 22.552 6.641 3.396 0.0013 **
Saturday 17.148 6.694 2.562 0.0131 *
April -7.964 6.228 -1.279 0.2061
May -11.234 1.389 -8.086 <0.001 ***
July -3.046 2.677 -1.138 0.2599
August -3.210 3.660 -0.877 0.3842
September -9.601 1.736 -5.531 <0.001 ***
Mets Win Percentage -8.627 3.613 -2.387 0.0203 *
Wind Speed 0.210 0.378 0.557 0.5796
Temperature 0.823 0.299 2.756 0.0078 **
Afternoon Game 2.310 1.916 1.205 0.2331
Early Evening Game 2.533 5.693 0.445 0.6581
Opening Day 74.267 5.199 14.285 <0.001 ***
Division Opponent 0.201 3.191 0.063 0.9500
Interleague Opponent 9.109 5.557 1.639 0.1067
Cubs 1.272 4.017 0.317 0.7526
Yankees 21.954 7.041 3.118 0.0029 **
Phillies -4.648 2.701 -1.721 0.0907 .
Significance Codes: *** p < 0.001, ** p < 0.01, * p < 0.05, . p < 0.10

After applying HAC consistent standard errors, the statistically significant predictors at the 5 percent level are Opening Day, May, September, Friday, Yankees, Temperature, Saturday, Mets Win Percentage, Tuesday.

This table should be compared with the regular Model 3 table. If a variable was significant before Newey West but is no longer significant afterward, that means the original OLS inference may have overstated confidence in that variable. If a variable remains significant after Newey West, that relationship is stronger because it survives a more conservative standard error correction.


OLS vs Newey-West Significance Comparison
term OLS_Significant_5pct Newey_West_Significant_5pct
Intercept No Yes
Sunday No No
Monday No No
Tuesday No Yes
Thursday No No
Friday Yes Yes
Saturday Yes Yes
April No No
May No Yes
July No No
August No No
September No Yes
Mets Win Percentage No Yes
Wind Speed No No
Temperature Yes Yes
Afternoon Game No No
Early Evening Game No No
Opening Day Yes Yes
Division Opponent No No
Interleague Opponent No No
Cubs No No
Yankees No Yes
Phillies No No

This comparison is important because it shows whether the original OLS model may have overstated statistical confidence due to autocorrelation or heteroskedasticity in the residuals.

If variables remain significant after applying Newey-West HAC standard errors, those relationships become more convincing because they survive a more conservative inference correction. If variables lose significance after the adjustment, it suggests the original OLS standard errors may have been too optimistic.

This is one of the main reasons econometric diagnostics matter. A regression can appear statistically strong under standard assumptions, but once the error structure is corrected, some relationships may no longer appear reliable.

From a business analytics perspective, these diagnostic corrections matter because pricing decisions based on unstable or misleading statistical relationships can create real financial consequences. If the model overstates confidence in certain variables because of multicollinearity, autocorrelation, or heteroskedasticity, decision makers may incorrectly adjust pricing strategies based on relationships that are weaker than they initially appear. Applying these diagnostic tests and robust inference corrections helps make the final pricing analysis more reliable for practical decision-making.


Strategic Pricing Implications

The regression results should be read as evidence of conditional pricing relationships, not proof of perfect causal effects. Even with controls, the model likely leaves out important demand factors such as starting pitchers, promotions, injuries, playoff race context, secondary market prices, and opponent star power.

The main takeaway is that dynamic pricing looks more like a demand optimization problem than a simple reflection of team quality. Average ticket price can be affected by timing, opponent identity, weather, and market context. From a business standpoint, the most valuable pricing information may come from identifying which games shift from normal inventory into higher demand entertainment inventory.

For example, if certain opponent variables or timing variables consistently remain significant even after applying multicollinearity and HAC corrections, that suggests those demand drivers may represent stable pricing signals rather than short term statistical noise. From a business perspective, those types of variables could become especially valuable when designing dynamic pricing models or forecasting higher demand inventory windows across a season.



Part 2: NHL Player Clustering Analysis

Preview of NHL Player Data
Player Team Salary Time on Ice CF% FF% xGF% Offensive Zone Start %
JeffSkinner BUF 10,000,000.00 126.37 50.00 51.53 49.37 67.39
BlakeWheeler WPG 10,000,000.00 57.87 53.76 44.62 42.41 59.09
MikkoRantanen COL 10,000,000.00 96.65 56.18 52.76 63.36 67.50
VladimirTarasenko STL 9,500,000.00 101.62 49.29 49.09 49.39 66.67
OliverEkman-Larsson VAN 9,240,000.00 167.22 55.80 55.45 53.07 50.88
BraydenPoint T.B 9,000,000.00 146.95 53.08 53.28 59.57 60.00
AndersLee NYI 9,000,000.00 94.80 44.86 45.34 48.37 73.33
MatthewTkachuk CGY 9,000,000.00 139.88 61.01 60.00 62.59 50.00
SidneyCrosby PIT 9,000,000.00 13.05 33.33 36.36 49.96 87.50
JustinFaulk STL 9,000,000.00 144.23 51.65 49.78 48.34 58.97

This table previews the NHL skater data used in the clustering section. It includes player salary, time on ice, possession statistics, expected goal percentage, and offensive zone start percentage.

The purpose of this section is different from the regression analysis. Regression explains variation in one outcome variable. Clustering looks across several variables at the same time and tries to group similar players together. In this case, the goal is to identify player archetypes based on cost, usage, possession impact, shot quality impact, and deployment.


Initial Required Cluster Model

The initial model follows the required variables from the assignment: salary, time on ice, CF%, and FF%. These variables give a basic profile of each player based on compensation, usage, and shot attempt share.

Before running k means clustering, the data must be normalized and scaled. This is necessary because the variables are measured on different scales. Salary is measured in dollars, time on ice is measured in minutes, and CF% and FF% are percentages. If the data were not scaled, salary would dominate the model simply because it has much larger raw values.

Clustering models are useful because they let the data identify groups instead of forcing players into categories manually. However, the clusters are only as good as the variables included. In this first version, the model is intentionally simple because it follows the assignment’s starting point.

K means clustering is appropriate for this analysis because the goal is to group players based on similarity across several continuous numerical variables. The algorithm works by minimizing the distance between players and their assigned cluster centers. Since k means relies heavily on distance calculations, scaling the variables beforehand is especially important so that larger scale variables such as salary do not dominate the clustering process.


The scatterplot visualizes the three initial clusters using the first two principal components. Principal component analysis is used here as a dimensionality reduction tool because the clustering model exists in a higher dimensional space that cannot be visualized directly on a two dimensional chart.

The first principal component captures the largest amount of variation across the player profiles, while the second principal component captures the next largest amount of variation. Plotting the clusters across these two dimensions allows the overall separation between player groups to be visualized more clearly.

The colors represent the three clusters assigned by the k means algorithm.


Initial Three Cluster Summary
Cluster Players Average Salary Average Toi Average C Fpct Average F Fpct
1 189.00 1,436,645 86.32 43.23 42.94
2 174.00 5,198,578 139.59 50.76 50.35
3 230.00 1,391,092 102.69 54.41 54.71


The cluster summary table shows the average values for each cluster. The Players column tells how many players were placed into each group. The salary column shows the average salary for players in that cluster. Time on Ice shows the average usage level. CF% and FF% show the average shot attempt share for the players in that cluster.

The numbers should be interpreted as cluster averages, not individual player rankings. For example, a cluster with a higher average salary and higher time on ice likely represents higher role players. A cluster with lower salary and lower time on ice likely represents depth players. If a lower salary cluster has strong CF% or FF%, that could point to players who may provide value relative to their compensation.


Initial Model Cluster Count Diagnostics

The assignment starts with three clusters, but it also asks us to test whether that is actually the appropriate number of clusters. I use three methods: elbow method, silhouette method, and gap statistic.

These methods do not always agree because they measure different things. The goal is not to blindly follow one number. The goal is to compare the evidence and then choose a cluster count that is both statistically reasonable and easy to interpret.


The elbow method looks at within cluster sum of squares. This measures how tightly grouped the players are inside their assigned clusters. Lower values mean players are closer to their cluster centers.

As the number of clusters increases, the within cluster sum of squares will almost always decrease. The key is to look for the elbow, which is the point where adding more clusters provides less additional improvement. If the curve drops sharply at first and then flattens, the flattening point is usually a reasonable cluster count.


The silhouette method measures how well each player fits within its assigned cluster compared with the next closest cluster. Higher values mean the clusters are more clearly separated and internally consistent.

For the initial model, the silhouette method selects 3 clusters. This gives a statistical recommendation based on separation quality. A higher silhouette score generally means cleaner clusters.


The gap statistic compares the clustering structure in the real player data to what would be expected from random data. A larger gap means the observed clusters are more meaningful compared with random noise.

For the initial model, the gap statistic selects 1 clusters. This method is useful because it asks whether the clustering pattern is stronger than what we would expect by chance.

Overall, the initial model is useful as a required starting point, but it is limited. It includes salary, usage, and shot attempt share, but it does not include shot quality or deployment context. That is why the enhanced model adds more hockey specific information.


Enhanced Cluster Model

The enhanced clustering model expands the original feature set by adding expected goal percentage and offensive zone start percentage. The goal of this adjustment is to create a more complete representation of player impact and deployment context.

The original clustering model captured salary, usage, and shot attempt share, but it did not fully account for shot quality or the difficulty of a player’s role. CF% and FF% measure puck possession and shot attempt share, but they do not distinguish between dangerous and low quality opportunities. Adding xGF% helps incorporate expected scoring chance quality into the clustering structure.

Offensive zone start percentage adds deployment context. Players who begin more shifts in the offensive zone often operate under easier conditions than players consistently deployed in defensive situations. Including deployment context helps separate players who drive play independently from players whose performance may partially reflect favorable usage conditions.

This adjustment improves the clustering model because it moves beyond simple possession and salary profiling into a more complete player evaluation framework that incorporates role difficulty, usage environment, and expected scoring impact.


The PCA variance chart shows how much of the total information in the enhanced model is captured by each principal component. The first principal component captures the largest amount of variation, the second captures the next largest amount, and so on.

This matters because the cluster visualizations use the first two principal components to show the results in a two dimensional plot. The chart helps show how much of the full multidimensional model is being represented in that visual.


Enhanced Model Cluster Count Diagnostics

For the enhanced model, the elbow method again shows how much clustering tightness improves as more clusters are added. Because the enhanced model uses more variables, it can create more detailed player groupings than the initial model.

The elbow method is useful here because it shows whether the added variables create enough structure to justify more clusters. If the improvement slows down around a certain value of k, that value becomes a reasonable candidate.


For the enhanced model, the silhouette method selects 2 clusters, suggesting that this cluster count produces the strongest separation quality under the silhouette framework.


For the enhanced model, the gap statistic selects 1 clusters, suggesting that this cluster structure is meaningfully stronger than what would be expected from random data.

The clustering diagnostics do not necessarily need to agree on the exact same number of clusters because each method evaluates cluster quality differently.

The elbow method focuses on how much within cluster variation decreases as additional clusters are added. The silhouette method focuses more on separation quality between clusters and how well observations fit within their assigned groups. The gap statistic compares the observed clustering structure against what would be expected from randomly generated data.

Because these methods evaluate different aspects of clustering performance, disagreement between them is common in applied analytics problems. The final decision should therefore combine statistical evidence with interpretability and practical usefulness rather than relying entirely on one mechanical metric.


Final Enhanced Cluster Selection

A four cluster solution was selected for the final enhanced model. This choice is partly statistical and partly interpretive. The diagnostics provide guidance, but the most useful sports analytics model is not always the one that blindly maximizes one metric.

Four clusters create a better balance between statistical separation and hockey meaning. With too few clusters, different player types can get grouped together. With too many clusters, the model can become harder to explain and less useful. Four clusters allow the model to separate higher usage players, depth players, possession value players, and deployment sensitive players more clearly.


The final cluster scatterplot shows the four player groups using the first two principal components. Each point represents one player. The color shows which cluster the player belongs to.

This plot is mainly a visualization of similarity. Players in the same color group have similar overall profiles across the enhanced model variables. If the colored groups are visibly separated, that supports the idea that the model is identifying meaningful differences between player types.


Final Enhanced Cluster Summary
Cluster Players Average Salary Average Toi Average C Fpct Average F Fpct Average X G Fpct Average Ozstartpct
1 105.00 1,507,199 67.62 41.23 40.28 37.19 49.93
2 218.00 1,747,133 122.51 48.43 48.31 48.52 46.04
3 139.00 1,410,386 90.73 56.50 57.28 58.85 59.69
4 131.00 5,808,054 135.89 51.72 51.41 51.57 57.47

The final enhanced cluster summary gives the average profile for each of the four clusters. The salary column shows the average cost of players in the cluster. Time on Ice shows usage. CF% and FF% measure shot attempt share. xGF% measures expected goal share. Offensive Zone Start % measures deployment.

This table is important because it translates the cluster assignment into actual hockey meaning. The scatterplot shows separation, but the summary table explains why the clusters are different.


Cluster Centroid Heatmap

The centroid heatmap is the most important interpretation visual in the clustering section. A centroid is the average profile of a cluster after the variables have been standardized. Since the variables were scaled before clustering, the heatmap values are standardized values, not raw salary dollars, minutes, or percentages.

A positive value means that cluster is above the overall average for that variable. A negative value means that cluster is below the overall average for that variable. A value near zero means the cluster is close to average.

Standardizing the variables before clustering and centroid analysis is important because the original variables exist on very different scales. Salary is measured in dollars, time on ice is measured in minutes, and possession statistics are percentages. Without standardization, larger scale variables would dominate both the clustering process and the centroid interpretation.

Using standardized values allows the heatmap to compare relative player traits across all variables equally. A positive standardized value means that cluster is above the overall player average for that variable, while a negative value means it is below average.

The heatmap shows which traits are most above or below average for each cluster. Darker red values represent traits that are above average for that cluster, while darker blue values represent traits that are below average.

For example, a cluster with high salary and high time on ice is likely made up of higher role players who are paid more and used more often. A cluster with low salary but strong possession or expected goal numbers could be interesting because it may represent undervalued players. A cluster with high offensive zone start percentage but weaker possession numbers may represent players whose results are influenced by easier deployment.

The main value of this heatmap is that it helps move the analysis from simply saying there are four clusters to explaining what those four clusters actually mean.


Final Cluster Interpretation

The final clustering model creates a player evaluation framework based on role, cost, possession impact, shot quality, and deployment. These clusters should not be interpreted as a perfect ranking of player value. Instead, they represent player archetypes built from the statistical relationships between salary, usage, possession metrics, expected goal share, and deployment context.

Cluster 1 appears to represent low usage replacement-level or struggling players. This cluster has the lowest time on ice, the weakest possession metrics, and the weakest expected goal share in the model. These players also carry relatively low salaries, suggesting lower role skaters who struggle to drive positive possession or scoring impact.

Cluster 2 appears to represent middle usage transitional players. This group has moderate time on ice and near average possession metrics, but weaker offensive zone deployment compared to the other clusters. These players may represent stable roster contributors who are neither major liabilities nor strong play drivers.

Cluster 3 appears to represent strong possession and play-driving value players. This cluster produces the strongest CF%, FF%, and xGF% values in the model despite relatively modest salaries and lower usage than the highest role players. The elevated offensive zone start percentage suggests these players are often deployed in offensive situations where they successfully drive positive results.

Cluster 4 appears to represent expensive high usage core players. This cluster has by far the highest salaries and highest time on ice in the model. Possession and expected goal metrics remain positive, although not as strong as Cluster 3. This suggests these players are heavily relied upon by their teams and likely include star players or top lineup skaters facing more difficult overall usage responsibilities.

This is one of the main strengths of clustering in sports analytics. Instead of evaluating players only through one statistic at a time, clustering allows multiple dimensions of player profile, deployment, usage, and value to be evaluated together.


Limitations of the Clustering Model

While the clustering model identifies meaningful player archetypes, the results are still limited by the variables included in the analysis. The model focuses heavily on possession, expected goals, salary, usage, and deployment, but it does not directly account for factors such as teammate quality, competition level, injuries, special teams usage, age, contract length, or playoff performance.

Another limitation is that clustering methods force players into discrete groups even though real player performance exists on a spectrum. Some players may naturally fit between clusters rather than belonging cleanly to a single category.

K means clustering also assumes that clusters are relatively compact and distance based around central centroids. If the true player relationships are more irregular or overlapping, k means may oversimplify the underlying structure of the data.

The results are also sensitive to variable selection and scaling choices. Adding or removing variables could change the cluster structure and shift which players are grouped together. Because of this, the clusters should be interpreted as analytical frameworks rather than definitive labels.

Despite these limitations, the model still provides a useful way to organize complex player information into interpretable player archetypes that can support front office evaluation and roster construction analysis.


Front Office Implications

From a front office perspective, this type of clustering model is useful because it does not evaluate players from only one angle. Salary matters, but salary is not the same thing as value. Time on ice matters, but usage is not the same thing as impact. Possession and expected goals matter, but they also need to be interpreted with deployment context.

The most useful part of the model is finding where the variables do not line up cleanly. A lower salary player with strong CF%, FF%, or xGF% could be a possible surplus value target. A higher salary player with weaker possession and expected goal numbers could be a possible inefficient contract. A player with strong results but very favorable offensive zone deployment might require more caution before assuming the performance would hold in a harder role.

This is where clustering becomes useful for sports analytics. It is not supposed to give a final answer by itself. It gives a front office a more organized way to ask better questions about role, cost, and impact.



Conclusion

This assignment demonstrates how different analytical methods answer different types of sports analytics questions and why the choice of method matters as much as the results themselves.

The regression analysis focused on explaining variation in Mets ticket prices by identifying which game, scheduling, opponent, and environmental factors appeared to influence pricing decisions. More importantly, the analysis showed why econometric diagnostics matter. Testing for multicollinearity, autocorrelation, and heteroskedasticity helped determine whether the statistical relationships were actually reliable or whether the original regression assumptions may have overstated confidence in the results.

The clustering analysis approached a different problem. Instead of predicting one outcome variable, the goal was to identify meaningful player archetypes across salary, usage, possession metrics, expected goal share, and deployment context. The clustering results showed how multidimensional player profiles can be grouped into interpretable categories that may help support roster construction and player valuation decisions.

Across both sections, the most important takeaway is that analytics models should not be treated as automatic answers. Regression models can become unstable when assumptions are violated, and clustering models depend heavily on the variables selected and how similarity is defined. Strong sports analytics work requires not only building models, but also understanding their assumptions, limitations, and practical interpretation.

Ultimately, the value of sports analytics comes from turning complex data into better decision-making frameworks. Whether analyzing ticket pricing or player evaluation, the goal is not simply to generate output, but to produce insights that can support smarter strategic decisions.