DAT-4313 - Data Viz in Model Development

NBA

Author

Misha Cox

DATA

Partition 60% Train / 40% Test

Training set number of observations: 319
Test dataset number of observations: 211

EDA

Descriptive Statistics

        vars   n    mean     sd min    max  range    se
PLAYER     1 319     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 319     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 319     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 319     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 319     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 319     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 319   26.16   4.27  19   39.0   20.0  0.24
GP         8 319   49.33  25.76   1   82.0   81.0  1.44
W          9 319   25.46  16.08   0   60.0   60.0  0.90
MIN       10 319 1098.37 814.22   1 2848.0 2847.0 45.59
PTS       11 319   19.39   6.42   0   42.8   42.8  0.36
FGM       12 319    7.27   2.48   0   17.1   17.1  0.14
FGA       13 319   16.54   4.93   0   52.7   52.7  0.28
FTM       14 319    2.82   1.74   0    9.1    9.1  0.10
FTA       15 319    3.85   2.36   0   18.2   18.2  0.13
REB       16 319    8.81   4.94   0   52.7   52.7  0.28
AST       17 319    4.24   2.84   0   17.1   17.1  0.16
TOV       18 319    2.43   1.19   0    7.2    7.2  0.07

The table shows the descriptive statistics for all of the variables in the data set. However, for the variables “players” all the way down to “team” there are no statistics because they are categorical.

Boxplots – All Numeric

It looks like there are some outliers in the turnover percentage, free throw attempt (FTA), and field throws made (FTM). For games played (GP) it looks like there are more observations on the lower end but the median is higher. So maybe there is a cluster of observations with a high amount of games played which is bringing the median up. Looking at age there seems to be a big range and variation. The median is low so there must be a decent amount of young players.

Looking at the center in relationship to points there seems to be more variation and some outliers if the person is not a center player. The median is higher if the person is a center player. In the minutes played it looks like the medians are very similar but it looks like if the person plays in the center, they may play more minutes.

Histograms

The field goals attempted and points looks pretty symmetrical. It looks like it’s more common for basketball players in the NBA to get around 10-20 points during their careers. Minutes played and wins are look a little more bumpy. However, it does look like there are decent amount of players who have not played as many minutes maybe because they are younger. The rest of the charts look either skewed to the right. ### Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

It looks like there might be a relationship between the number of wins and the amount of games played. The more games that are played the more wins a player has. It is interesting to note the 4 kind of clusters at the at the bottom. The cluster at the far right shows a lot of games played and a medium to low amount of games played. This cluster is probably those who are very good football players. Wins might be a good variable to put into the model. Minutes played (MIN) also looks like there is a linear relationship. As the minutes played goes up, the games played goes up.From looking at the other graphs there doesn’t seem to be any relationship between the independent and dependent variable. ` ### Correlation Matrix

This heat map visually shows if there is a relationship between the dependent and independent variables. In looking at the chart, it does not look like there is not a very good correlation with the other variables. The other variables I would choose to use the in the model would be field goals made (FGM), free throws made (FTM), and points (PTS). I chose these variables because they were the next closest values to 1. Also all three of these variables could help explain the number of games a player plays. The better a player is at making goals and earning points the more games they will probably play because they are good at basketball and thier coaches will want them to play.

MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

Estimate the following model:
\(log(Wage.n) = Age + Age^2 + Potential + International.Reputation + Value.n + Special + Height + RightFoot + Skill.Moves\)

Estimate Coefficients and show coefficients table

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	14.817	1.696	8.737	0.0000	***
CENTERYes	2.963	1.315	2.254	0.0249	*
FTM	0.609	0.373	1.631	0.1039
W	0.749	0.049	15.256	0.0000	***
MIN	0.017	0.001	16.533	0.0000	***
PTS	-0.291	0.103	-2.827	0.0050	**
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 8.859 on 313 degrees of freedom
Multiple R-squared: 0.8835, Adjusted R-squared: 0.8817
F-statistic: 475 on 313 and 5 DF, p-value: 0.0000

I thought I would throw in a binary variable to see what would happen. At the .05 significant level, PTS, MIN, W and CENTER are statistically significant. The variables that are statistically significant have a positive relationship expcet for points. This is interesting because I would have thought that as a player gets more points they would get to play more games. The more wins and minutes that a player plays the more games they play. This make sense as if a player is winning more games they will be selected to keep playing the games. Also the more minutes the basketball player plays the more games they play. multiple r squared: the model can explain 88.35% of the variability in the games played.

Coefficient Magnitude Plot

Looking at the coefficient plot it looks like if a player is in the center it has a big impact compared to the other variables. The points variable has a negative effect on the dependent variable. Furthermore it looks like the minutes played have very little impact on the predicted outcome. ### Check for predictor independence

Using Variance Inflation Factors (VIF)

  CENTER      FTM        W      MIN      PTS 
1.059011 1.702216 2.527366 2.882990 1.770248

The all the variables in the VIF model are under 10 so there shouldn’t be a problem with multicolinearity among the variables in the regression model.

Residual Analysis

Residual Range

     0%     25%     50%     75%    100% 
-19.210  -6.675  -0.520   6.010  23.540

Because there is a wide range of variability among the residuals there may be something wrong with the model. The model may not be right in making predictions about the independent and dependent variables. The model may need to be fixed.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

Residual vs. Fitted: There is a quadratic looking pattern among the residuals.Therefore,they are not nnormally distributed residuals. This means the model needs to be fixed. Q-Q plot: It looks like this is a normal Q-Q plot. Scale -location: There does not appear to be any homoskedasticity. Residuals vs. Leverage: Yes, I think there are minimal influential observations I think all of them are within Cook’s distance. This is good because it means there are no influential observations that are effecting the model and potentially messing with data.

Plot Fitted Value by Actual Value

It looks like the model is between the fitted values and the actual values. This means that the model is pretty good at predicting the overall relationship.

Plot Residuals by Fitted Values

There does appear to be a quadratic shaped pattern which is not good. Therefore, the model needs to be fixed because the model is not fully explaining the dependent variable.

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

PLAYER	GP	fit	lwr	upr
Alan Williams	5	11.4	-6.4	29.3
Alec Burks	64	49.3	31.8	66.8
Alex Poythress	21	22.1	4.6	39.6
Alize Johnson	14	21.7	4.1	39.3
Allen Crabbe	43	44.4	26.9	61.9
Allonzo Trier	64	46.5	28.8	64.1
Amile Jefferson	12	20.2	2.6	37.8
Amir Johnson	51	47.2	29.6	64.9
Andrew Harrison	17	21.1	3.5	38.7
Andrew Wiggins	73	76.7	59.1	94.3
n: 10

Looking at this table it seems that the model did a pretty good job at predicting the games played based on the independent variables. This is especially true for Alex Poythress and Allen Crabbe.

Plot Actual vs Fitted (`test`)

It looks like there is a positive linear relationship between the fitted vs the actual plotted observations.

Performance Metrics

Metric	Value
MAE	7.9245
RMSE	9.6824
MAPE	0.8488

MAE: This means that the model’s prediction error between the predicted and actual are off by 8.03 units. RSME: The difference between MAE and RSME is that RSME tells the spread of prediction errors and the magnitude of the errors.The predictions are off by 9.72. MAPE: This gives a percentage of area between the actual and predicted observations. In this case, the average percentage error 83.91% off from the actual observations. This is very high and tell that the model does not do a good job of predicting the actual observations.

Model Fit by Players

It looks like the Boston team has a higher confidence level compared to the Atlanta team.

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

Overall, the model is alight at predicting the games played. The variables that were statistically significant made sense. Except what I thought was interesting was that the points a player scores decreases the games played. However, there were some issues with the residuals and the performance of the model. It does not fully predict the dependent model and there are things that need to be addressed to make this a more usable model. There is a great value in using visuals for creating a model. Visuals have helped understand the variables to help determine which ones to put in the model. I am more of a visual learner and it helps to see the graphs and charts to determine what to do. Furthermore, it is helpful to have visuals of the residuals and the actual verses predicted plots. It helps to see them in a chart rather than see a bunch of numbers and feel overwhelmed.