vars n mean sd min max range se
PLAYER 1 319 NaN NA Inf -Inf -Inf NA
FORWARD 2 319 NaN NA Inf -Inf -Inf NA
CENTER 3 319 NaN NA Inf -Inf -Inf NA
GUARD 4 319 NaN NA Inf -Inf -Inf NA
ROOKIE 5 319 NaN NA Inf -Inf -Inf NA
TEAM 6 319 NaN NA Inf -Inf -Inf NA
AGE 7 319 26.16 4.27 19 39.0 20.0 0.24
GP 8 319 49.33 25.76 1 82.0 81.0 1.44
W 9 319 25.46 16.08 0 60.0 60.0 0.90
MIN 10 319 1098.37 814.22 1 2848.0 2847.0 45.59
PTS 11 319 19.39 6.42 0 42.8 42.8 0.36
FGM 12 319 7.27 2.48 0 17.1 17.1 0.14
FGA 13 319 16.54 4.93 0 52.7 52.7 0.28
FTM 14 319 2.82 1.74 0 9.1 9.1 0.10
FTA 15 319 3.85 2.36 0 18.2 18.2 0.13
REB 16 319 8.81 4.94 0 52.7 52.7 0.28
AST 17 319 4.24 2.84 0 17.1 17.1 0.16
TOV 18 319 2.43 1.19 0 7.2 7.2 0.07
DAT-4313 - Data Viz in Model Development
NBA
DATA
Partition 60% Train / 40% Test
Training set number of observations: 319
Test dataset number of observations: 211
EDA
Descriptive Statistics
The table shows the descriptive statistics for all of the variables in the data set. However, for the variables “players” all the way down to “team” there are no statistics because they are categorical.
Boxplots – All Numeric
It looks like there are some outliers in the turnover percentage, free throw attempt (FTA), and field throws made (FTM). For games played (GP) it looks like there are more observations on the lower end but the median is higher. So maybe there is a cluster of observations with a high amount of games played which is bringing the median up. Looking at age there seems to be a big range and variation. The median is low so there must be a decent amount of young players.
Looking at the center in relationship to points there seems to be more variation and some outliers if the person is not a center player. The median is higher if the person is a center player. In the minutes played it looks like the medians are very similar but it looks like if the person plays in the center, they may play more minutes.
Histograms
The field goals attempted and points looks pretty symmetrical. It looks like it’s more common for basketball players in the NBA to get around 10-20 points during their careers. Minutes played and wins are look a little more bumpy. However, it does look like there are decent amount of players who have not played as many minutes maybe because they are younger. The rest of the charts look either skewed to the right. ### Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
It looks like there might be a relationship between the number of wins and the amount of games played. The more games that are played the more wins a player has. It is interesting to note the 4 kind of clusters at the at the bottom. The cluster at the far right shows a lot of games played and a medium to low amount of games played. This cluster is probably those who are very good football players. Wins might be a good variable to put into the model. Minutes played (MIN) also looks like there is a linear relationship. As the minutes played goes up, the games played goes up.From looking at the other graphs there doesn’t seem to be any relationship between the independent and dependent variable. ` ### Correlation Matrix
This heat map visually shows if there is a relationship between the dependent and independent variables. In looking at the chart, it does not look like there is not a very good correlation with the other variables. The other variables I would choose to use the in the model would be field goals made (FGM), free throws made (FTM), and points (PTS). I chose these variables because they were the next closest values to 1. Also all three of these variables could help explain the number of games a player plays. The better a player is at making goals and earning points the more games they will probably play because they are good at basketball and thier coaches will want them to play.
MODEL
Linear Regression Model
From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.
Estimate the following model:
\(log(Wage.n) = Age + Age^2 + Potential + International.Reputation + Value.n + Special + Height + RightFoot + Skill.Moves\)
Estimate Coefficients and show coefficients table
Estimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 14.817 | 1.696 | 8.737 | 0.0000 | *** |
CENTERYes | 2.963 | 1.315 | 2.254 | 0.0249 | * |
FTM | 0.609 | 0.373 | 1.631 | 0.1039 |
|
W | 0.749 | 0.049 | 15.256 | 0.0000 | *** |
MIN | 0.017 | 0.001 | 16.533 | 0.0000 | *** |
PTS | -0.291 | 0.103 | -2.827 | 0.0050 | ** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 8.859 on 313 degrees of freedom | |||||
Multiple R-squared: 0.8835, Adjusted R-squared: 0.8817 | |||||
F-statistic: 475 on 313 and 5 DF, p-value: 0.0000 | |||||
I thought I would throw in a binary variable to see what would happen. At the .05 significant level, PTS, MIN, W and CENTER are statistically significant. The variables that are statistically significant have a positive relationship expcet for points. This is interesting because I would have thought that as a player gets more points they would get to play more games. The more wins and minutes that a player plays the more games they play. This make sense as if a player is winning more games they will be selected to keep playing the games. Also the more minutes the basketball player plays the more games they play. multiple r squared: the model can explain 88.35% of the variability in the games played.
Coefficient Magnitude Plot
Looking at the coefficient plot it looks like if a player is in the center it has a big impact compared to the other variables. The points variable has a negative effect on the dependent variable. Furthermore it looks like the minutes played have very little impact on the predicted outcome. ### Check for predictor independence
Using Variance Inflation Factors (VIF)
CENTER FTM W MIN PTS
1.059011 1.702216 2.527366 2.882990 1.770248
The all the variables in the VIF model are under 10 so there shouldn’t be a problem with multicolinearity among the variables in the regression model.
Residual Analysis
Residual Range
0% 25% 50% 75% 100%
-19.210 -6.675 -0.520 6.010 23.540
Because there is a wide range of variability among the residuals there may be something wrong with the model. The model may not be right in making predictions about the independent and dependent variables. The model may need to be fixed.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
Residual vs. Fitted: There is a quadratic looking pattern among the residuals.Therefore,they are not nnormally distributed residuals. This means the model needs to be fixed. Q-Q plot: It looks like this is a normal Q-Q plot. Scale -location: There does not appear to be any homoskedasticity. Residuals vs. Leverage: Yes, I think there are minimal influential observations I think all of them are within Cook’s distance. This is good because it means there are no influential observations that are effecting the model and potentially messing with data.
Plot Fitted Value by Actual Value
It looks like the model is between the fitted values and the actual values. This means that the model is pretty good at predicting the overall relationship.
Plot Residuals by Fitted Values
There does appear to be a quadratic shaped pattern which is not good. Therefore, the model needs to be fixed because the model is not fully explaining the dependent variable.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
PLAYER | GP | fit | lwr | upr |
|---|---|---|---|---|
Alan Williams | 5 | 11.4 | -6.4 | 29.3 |
Alec Burks | 64 | 49.3 | 31.8 | 66.8 |
Alex Poythress | 21 | 22.1 | 4.6 | 39.6 |
Alize Johnson | 14 | 21.7 | 4.1 | 39.3 |
Allen Crabbe | 43 | 44.4 | 26.9 | 61.9 |
Allonzo Trier | 64 | 46.5 | 28.8 | 64.1 |
Amile Jefferson | 12 | 20.2 | 2.6 | 37.8 |
Amir Johnson | 51 | 47.2 | 29.6 | 64.9 |
Andrew Harrison | 17 | 21.1 | 3.5 | 38.7 |
Andrew Wiggins | 73 | 76.7 | 59.1 | 94.3 |
n: 10 | ||||
Looking at this table it seems that the model did a pretty good job at predicting the games played based on the independent variables. This is especially true for Alex Poythress and Allen Crabbe.
Plot Actual vs Fitted (test)
It looks like there is a positive linear relationship between the fitted vs the actual plotted observations.
Performance Metrics
Metric | Value |
|---|---|
MAE | 7.9245 |
RMSE | 9.6824 |
MAPE | 0.8488 |
MAE: This means that the model’s prediction error between the predicted and actual are off by 8.03 units. RSME: The difference between MAE and RSME is that RSME tells the spread of prediction errors and the magnitude of the errors.The predictions are off by 9.72. MAPE: This gives a percentage of area between the actual and predicted observations. In this case, the average percentage error 83.91% off from the actual observations. This is very high and tell that the model does not do a good job of predicting the actual observations.
Model Fit by Players
It looks like the Boston team has a higher confidence level compared to the Atlanta team.
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
Overall, the model is alight at predicting the games played. The variables that were statistically significant made sense. Except what I thought was interesting was that the points a player scores decreases the games played. However, there were some issues with the residuals and the performance of the model. It does not fully predict the dependent model and there are things that need to be addressed to make this a more usable model. There is a great value in using visuals for creating a model. Visuals have helped understand the variables to help determine which ones to put in the model. I am more of a visual learner and it helps to see the graphs and charts to determine what to do. Furthermore, it is helpful to have visuals of the residuals and the actual verses predicted plots. It helps to see them in a chart rather than see a bunch of numbers and feel overwhelmed.