options(scipen=999)
library(socviz) # use for the round_df() function
library(psych)
load("NBA_viz.RData")Learning Module 6
NBA Dataset
FURTHER INSTRUCTIONS
Use the dataset assigned to you in the LAB Instructions.
Use the depvar indicated for your dataset.
Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.
Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk.
Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
Example Code for FIFA with depvar=Wage
DATA
Selecting the variables
describe(NBA_viz) vars n mean sd median trimmed mad min max range
PLAYER* 1 530 265.50 153.14 265.50 265.50 196.44 1 530.0 529.0
FORWARD* 2 530 1.49 0.50 1.00 1.49 0.00 1 2.0 1.0
CENTER* 3 530 1.19 0.39 1.00 1.11 0.00 1 2.0 1.0
GUARD* 4 530 1.49 0.50 1.00 1.49 0.00 1 2.0 1.0
ROOKIE* 5 530 1.20 0.40 1.00 1.12 0.00 1 2.0 1.0
TEAM* 6 530 15.41 8.60 15.00 15.38 10.38 1 30.0 29.0
AGE 7 530 26.14 4.18 25.00 25.85 4.45 19 42.0 23.0
GP 8 530 49.25 26.05 56.00 51.00 28.17 1 82.0 81.0
W 9 530 24.67 15.89 24.00 24.30 20.76 0 60.0 60.0
MIN 10 530 1121.63 838.67 1069.00 1078.70 1115.66 1 3028.0 3027.0
PTS 11 530 19.84 7.85 19.05 19.46 5.86 0 100.5 100.5
FGM 12 530 7.42 3.26 7.10 7.30 2.37 0 50.3 50.3
FGA 13 530 16.78 5.37 16.10 16.47 4.30 0 52.7 52.7
FTM 14 530 3.00 2.05 2.70 2.79 1.63 0 17.2 17.2
FTA 15 530 4.08 2.62 3.65 3.85 2.15 0 18.2 18.2
REB 16 530 9.06 5.14 7.90 8.52 3.85 0 52.7 52.7
AST 17 530 4.38 2.85 3.60 4.06 2.22 0 17.1 17.1
TOV 18 530 2.53 1.40 2.30 2.43 1.04 0 12.8 12.8
skew kurtosis se
PLAYER* 0.00 -1.21 6.65
FORWARD* 0.05 -2.00 0.02
CENTER* 1.59 0.52 0.02
GUARD* 0.02 -2.00 0.02
ROOKIE* 1.51 0.28 0.02
TEAM* 0.03 -1.18 0.37
AGE 0.65 0.08 0.18
GP -0.48 -1.13 1.13
W 0.16 -1.12 0.69
MIN 0.28 -1.16 36.43
PTS 2.20 20.63 0.34
FGM 4.31 55.11 0.14
FGA 1.29 6.82 0.23
FTM 1.57 5.55 0.09
FTA 1.26 3.18 0.11
REB 2.20 11.90 0.22
AST 1.09 1.16 0.12
TOV 1.82 9.15 0.06
colnames(NBA_viz) [1] "PLAYER" "FORWARD" "CENTER" "GUARD" "ROOKIE" "TEAM" "AGE"
[8] "GP" "W" "MIN" "PTS" "FGM" "FGA" "FTM"
[15] "FTA" "REB" "AST" "TOV"
str(NBA_viz)'data.frame': 530 obs. of 18 variables:
$ PLAYER : Factor w/ 530 levels "Aaron Gordon",..: 1 2 3 5 4 6 7 8 9 10 ...
$ FORWARD: Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 1 1 1 1 ...
$ CENTER : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 2 ...
$ GUARD : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 2 2 1 ...
$ ROOKIE : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
$ TEAM : Factor w/ 30 levels "ATL","BKN","BOS",..: 22 12 21 3 25 2 26 21 14 1 ...
$ AGE : int 23 22 25 32 28 26 27 25 25 25 ...
$ GP : int 78 50 61 68 81 5 64 31 25 77 ...
$ W : int 40 31 38 41 52 1 19 21 8 28 ...
$ MIN : int 2633 646 694 1973 2292 26 1375 588 531 1544 ...
$ PTS : num 22.7 21.9 16.7 22.5 15.9 33.8 19.6 13.5 20.7 26.5 ...
$ FGM : num 8.6 7.8 6.3 9.4 5.4 15 6.7 4.6 7 9.9 ...
$ FGA : num 19.1 19.5 14.9 17.6 12.4 24.4 16.5 12.8 15.6 20.1 ...
$ FTM : num 3.4 3 1.9 1.9 3.1 3.8 4 1 4.6 4.4 ...
$ FTA : num 4.6 3.7 2.5 2.3 3.6 7.5 4.9 1.1 5.8 6.7 ...
$ REB : num 10.5 5 8 11.1 12.8 35.7 8.2 3.9 6.1 13.2 ...
$ AST : num 5.3 6.5 1.4 6.9 2.2 5.6 4.5 1.6 7 2.7 ...
$ TOV : num 3 3 1.8 2.5 1.5 1.9 2.3 1.1 3.8 3 ...
cat ("Data Identified for linear Regression model: ")Data Identified for linear Regression model:
model <- lm(PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)
summary(model)
Call:
lm(formula = PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)
Residuals:
Min 1Q Median 3Q Max
-23.653 -3.750 -0.206 3.275 86.083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.5128520 2.1371849 7.259 0.000000000001427 ***
AGE -0.0425469 0.0744910 -0.571 0.568131
GP -0.1230499 0.0332959 -3.696 0.000242 ***
W 0.0353093 0.0388266 0.909 0.363553
MIN 0.0061347 0.0008352 7.345 0.000000000000796 ***
REB 0.1094741 0.0617765 1.772 0.076961 .
AST 0.3179386 0.1244043 2.556 0.010880 *
TOV 0.5398697 0.2453718 2.200 0.028229 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.99 on 522 degrees of freedom
Multiple R-squared: 0.2173, Adjusted R-squared: 0.2068
F-statistic: 20.7 on 7 and 522 DF, p-value: < 0.00000000000000022
In this chucnk, it shows the main idea and variables of the dataset.
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
depvar <- NBA_viz$PTS
index <- createDataPartition(depvar, p = 0.6, list = FALSE)
train <- NBA_viz[index, ]
test <- NBA_viz[-index, ]Training set number of observations: 320
Test dataset number of observations: 210
The partitioning of the dataset into training (60%) and test (40%) sets is a crucial step in developing and evaluating the predictive model. In this case, the NBA dataset was split, with 320 observations allocated to the training set and 210 observations to the test set. This strategy allows the model to learn patterns and relationships from the training data, and the test set serves as an independent sample to assess the model’s generalization performance. The choice of a 60-40 split strikes a balance between providing sufficient data for training and maintaining an adequately sized test set for robust evaluation. It’s important to note that the random seed is set to 1, ensuring reproducibility in subsequent analyses and model comparisons.
EDA
Descriptive Statistics
# Load necessary libraries
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast = TRUE) vars n mean sd min max range se
PLAYER 1 320 NaN NA Inf -Inf -Inf NA
FORWARD 2 320 NaN NA Inf -Inf -Inf NA
CENTER 3 320 NaN NA Inf -Inf -Inf NA
GUARD 4 320 NaN NA Inf -Inf -Inf NA
ROOKIE 5 320 NaN NA Inf -Inf -Inf NA
TEAM 6 320 NaN NA Inf -Inf -Inf NA
AGE 7 320 26.33 4.36 19 42.0 23.0 0.24
GP 8 320 50.27 26.45 1 82.0 81.0 1.48
W 9 320 24.58 16.06 0 60.0 60.0 0.90
MIN 10 320 1154.53 842.56 1 3028.0 3027.0 47.10
PTS 11 320 19.70 6.98 0 42.8 42.8 0.39
FGM 12 320 7.36 2.70 0 17.1 17.1 0.15
FGA 13 320 16.63 4.88 0 39.5 39.5 0.27
FTM 14 320 2.94 1.84 0 10.2 10.2 0.10
FTA 15 320 4.01 2.38 0 12.3 12.3 0.13
REB 16 320 8.97 4.73 0 35.7 35.7 0.26
AST 17 320 4.30 2.78 0 17.1 17.1 0.16
TOV 18 320 2.47 1.47 0 12.8 12.8 0.08
These are the findings of the dataset: The descriptive statistics offer useful insights into the key variables in the training data set by summarizing the central tendency, variability, and distribution of each feature.
Specifically, the average age of NBA players is 26.33 years with a standard deviation of 4.36 years, indicating a wide range from 19 to 42 years. The average number of games played is 50.27 with substantial variability (standard deviation of 26.45 games). There is also considerable variation across players in terms of wins, minutes played, points scored, field goals, free throws, rebounds, and assists. The metrics provide a quantitative overview of performance in fundamental aspects of basketball.
The wide-ranging distributions and standard deviations showcase the diversity among NBA players for most variables. Visualizing these summary statistics and distributions would further contextualize the spread, skew, outliers, and other characteristics essential for thorough exploratory data analysis. Overall, these measures offer a preliminary numerical synopsis of key input features that influence the output target - points scored. Analyzing the descriptive statistics facilitates initial data comprehension while also informing subsequent modeling decisions regarding transformations, assumptions, and interpretability.
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")Each of the boxplots shows a single variable. It is easy to see that each of the variables presents outliers. AGE: The distribution of player ages is slightly skewed to the right, with a few outliers on the higher end.
AST: The distribution of assists is also slightly skewed to the right, with a few outliers on the higher end.
FGA: The distribution of field goal attempts is fairly symmetrical, with a few outliers on both the lower and higher ends.
FGM: The distribution of field goals made is similar to the distribution of field goal attempts, with a few outliers on both ends.
FTA: The distribution of free throw attempts is slightly skewed to the right, with a few outliers on the higher end.
FTM: The distribution of free throws made is similar to the distribution of free throw attempts, with a few outliers on the higher end.
GP: The distribution of games played is skewed to the right, with a few outliers on the higher end.
MIN: The distribution of minutes played is skewed to the right, with a few outliers on the higher end.
PTS: The distribution of points scored is skewed to the right, with a few outliers on the higher end.
REB: The distribution of rebounds is skewed to the right, with a few outliers on the higher end.
TOV: The distribution of turnovers is skewed to the right, with a few outliers on the higher end.
W: The distribution of wins is skewed to the right, with a few outliers on the higher end.
library(DataExplorer)
plot_boxplot(train, by="AGE") These are some observations that can be seen from the boxplots when usingg age as the group variable.
The distribution of points per game is skewed to the right for all age groups, with more players scoring fewer points than higher points.
There is a general trend of increasing median points per game with increasing age, up to around 28 years old. After that, the median points per game appears to level off or even decrease slightly.
There is a fairly wide range of points scored within each age group, as indicated by the length of the boxes. This suggests that there is a lot of variability in scoring ability among players of all ages.
There are a few outliers, represented by points beyond the whiskers of the boxes. These outliers represent players who scored significantly more or fewer points than the majority of players in their age group.
### Histograms
plot_histogram(train)Observations:
Player Age: The majority of players fall within the young-to-mid twenties age range, with a slight skew towards younger players.
Right-skewed Distributions: Several key statistics, including points scored, rebounds, assists, field goal attempts/makes, free throw attempts/makes, minutes played, and games played, exhibit right-skewed distributions. This indicates that most players fall towards the lower end of the spectrum, with a smaller number of players exceeding the average in these categories.
Wins Distribution: The distribution of team wins is also right-skewed, suggesting that most teams win a moderate number of games, while a few teams stand out with significantly more wins.
Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by = "PTS") Observations:
There is a weak positive correlation between age and PTS. This means that as players get older, they tend to score slightly more points on average. However, there is a lot of variability in the data, and many young players score more points than some older players.
Players who are forwards tend to score more points than players who are centers or guards. This is likely because forwards typically play a more offensive role than centers or guards.
Correlation Matrix
plot_correlation(train, type="continuous")Observations:
-Strong Positive Correlations:
PTS (Points Scored) and FGM (Field Goals Made): 0.96. This indicates a very strong positive correlation, meaning players with higher point totals tend to have correspondingly higher numbers of field goals made.
FGA (Field Goal Attempts) and FTM (Free Throws Made): 0.66. This suggests a moderately strong positive correlation, implying players who attempt more field goals generally make more free throws.
MIN (Minutes Played) and GP (Games Played): 0.89. This indicates a strong positive correlation, signifying that players who play more minutes tend to participate in more games.
W (Wins) and GP (Games Played): 0.76. This suggests a moderately strong positive correlation, implying teams with more wins typically play more games.
Strong Negative Correlations:
AGE and MIN (Minutes Played): -0.89. This indicates a very strong negative correlation, meaning younger players tend to play more minutes on average compared to older players.
Moderate Positive Correlations:
PTS (Points Scored) and FGA (Field Goal Attempts): 0.74. This suggests a moderate positive correlation, implying players with higher scores generally attempt more field goals.
REB (Rebounds) and MIN (Minutes Played): 0.37. This indicates a moderate positive correlation, meaning players who play more minutes tend to grab more rebounds on average.
MODEL
Linear Regression Model
From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.
Estimate the following model:
\(model <- lm(PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)\)
Estimate Coefficients and show coefficients table
# Check for zero or negative values in the dependent variable
sum(train$PTS <= 0)[1] 7
train <- train[train$PTS > 0, ]
sum(is.na(train$PTS))[1] 0
train <- train[complete.cases(train$PTS), ]
# Fit the linear regression model
model <- lm(log(PTS) ~ AGE + I(AGE^2) + GP + W + MIN + FGM + FGA + FTM + FTA + REB + AST + TOV, data = train)
summary(model)
Call:
lm(formula = log(PTS) ~ AGE + I(AGE^2) + GP + W + MIN + FGM +
FGA + FTM + FTA + REB + AST + TOV, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.68816 -0.03753 0.00980 0.05225 0.24029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.06422801 0.17994391 11.472 < 0.0000000000000002 ***
AGE -0.00761248 0.01267464 -0.601 0.54856
I(AGE^2) 0.00016352 0.00022549 0.725 0.46891
GP 0.00192779 0.00059887 3.219 0.00143 **
W 0.00003584 0.00064962 0.055 0.95604
MIN -0.00003855 0.00001576 -2.446 0.01502 *
FGM 0.10867147 0.00440791 24.654 < 0.0000000000000002 ***
FGA 0.00702982 0.00249933 2.813 0.00524 **
FTM 0.05660245 0.00933876 6.061 0.00000000405 ***
FTA -0.01424028 0.00759438 -1.875 0.06175 .
REB -0.00874151 0.00140013 -6.243 0.00000000146 ***
AST -0.00773600 0.00236237 -3.275 0.00118 **
TOV -0.00774367 0.00513955 -1.507 0.13294
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09309 on 300 degrees of freedom
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9218
F-statistic: 307.5 on 12 and 300 DF, p-value: < 0.00000000000000022
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 2.064 | 0.180 | 11.472 | 0.0000 | *** |
AGE | -0.008 | 0.013 | -0.601 | 0.5486 |
|
I(AGE^2) | 0.000 | 0.000 | 0.725 | 0.4689 |
|
GP | 0.002 | 0.001 | 3.219 | 0.0014 | ** |
W | 0.000 | 0.001 | 0.055 | 0.9560 |
|
MIN | -0.000 | 0.000 | -2.446 | 0.0150 | * |
FGM | 0.109 | 0.004 | 24.654 | 0.0000 | *** |
FGA | 0.007 | 0.002 | 2.813 | 0.0052 | ** |
FTM | 0.057 | 0.009 | 6.061 | 0.0000 | *** |
FTA | -0.014 | 0.008 | -1.875 | 0.0617 | . |
REB | -0.009 | 0.001 | -6.243 | 0.0000 | *** |
AST | -0.008 | 0.002 | -3.275 | 0.0012 | ** |
TOV | -0.008 | 0.005 | -1.507 | 0.1329 |
|
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 0.09309 on 300 degrees of freedom | |||||
Multiple R-squared: 0.9248, Adjusted R-squared: 0.9218 | |||||
F-statistic: 307.5 on 300 and 12 DF, p-value: 0.0000 | |||||
Observations:
Multiple R-squared: It indicates the proportion of variance in the dependent variable (log-transformed PTS) that is explained by the independent variables. In this case, approximately 92.48% of the variance is explained.
Adjusted R-squared: It adjusts the R-squared value based on the number of predictors in the model. It penalizes the addition of predictors that do not improve the model significantly.
F-statistic: It tests the overall significance of the model. A low p-value (< 0.05) suggests that at least one predictor variable is significantly related to the response variable.
Coefficient Magnitude Plot
# note this simple code using coefplot package produces chart like Figure 6.6 in one short line.
library(coefplot)
coefplot(model, soft = "magnitude", intercept = FALSE)Observations:
The visualization confirms the findings from the model summary.
Variables like FGM, FTM, REB, AST, GP, and MIN have coefficients with the same direction and relative strength as indicated by their p-values in the summary.
The error bars provide a visual sense of the uncertainty around the coefficient estimates. For instance, the error bar for the AGE^2 coefficient is relatively large, suggesting that this estimate might be less precise compared to others with shorter bars.
Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) AGE I(AGE^2) GP W MIN FGM FGA
111.154997 110.494789 8.539649 3.810795 6.219076 4.335330 4.444139
FTM FTA REB AST TOV
10.288880 11.275894 1.483570 1.501982 1.600125
Observations:
The VIF values reveal potential multicollinearity concerns within the regression model. Notably, AGE and its squared term exhibit high VIF, suggesting a strong correlation. Minutes Played (MIN) and Free Throws Attempted (FTA) also show moderate to high VIF, indicating possible collinearity. This could impact coefficient precision. Variables with acceptable VIF include Games Played (GP), Wins (W), Field Goals Attempted (FGA), Field Goals Made (FGM), Free Throws Made (FTM), Rebounds (REB), Assists (AST), and Turnovers (TOV). Managing multicollinearity is vital for stable coefficients and model robustness, prompting consideration of variable transformations or selection.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-0.69 -0.04 0.01 0.05 0.24
Observations:
The residual range, as shown by quantiles, indicates that the majority of residuals are concentrated within a narrow interval, with the central 50% falling between -0.04 and 0.05. This suggests a relatively balanced distribution around the fitted values, supported by a median close to zero. The interquartile range (IQR) is compact, implying consistent variability in the residuals.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Observations:
The plot of residuals versus fitted values reveals revealing patterns in the data. Although randomness is observed in the residuals, a discernible trend, particularly for the higher fitted values, suggests that the model could benefit from additional complexity or adjustments. Deviation from a perfect normal distribution raises considerations about the reliability of certain statistical tests, leading to further examination of the possible non-normality of the data. Heteroscedasticity, indicated by variation in the variance of the residuals, underscores the need to address the assumption of constant variance at different levels of fitted values. These data highlight the potential for refining the model and improving its overall predictive performance. The Q-Q residuals and scale location plots offer complementary insights, which guide further investigations for a more nuanced understanding of the patterns in the data.
Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=PTS))
p + geom_point()Observations: the model’s ability to foresee higher scores for prolific scorers. Despite some dispersion around the diagonal line, implying imprecise predictions for individual players, the model generally performs well. Outliers, representing players with substantial deviations from predicted points, warrant further investigation to uncover underlying factors influencing these discrepancies.
Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Observations:
The scatter plot of residuals versus fitted values reveals a scattered distribution around the zero line, indicating that the residuals are generally centered. However, the presence of curvature suggests a potential non-linearity in the model’s relationship with higher fitted values.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(PLAYER, AGE, fit, lwr,upr) %>%
as_flextable(show_coltype = FALSE) PLAYER | AGE | fit | lwr | upr |
|---|---|---|---|---|
Al Horford | 32 | 3.1 | 2.9 | 3.3 |
Alex Abrines | 25 | 2.6 | 2.4 | 2.8 |
Alex Poythress | 25 | 2.7 | 2.5 | 2.9 |
Alfonzo McKinnie | 26 | 2.8 | 2.6 | 3.0 |
Allen Crabbe | 27 | 2.7 | 2.6 | 2.9 |
Amile Jefferson | 25 | 2.9 | 2.7 | 3.1 |
Andre Iguodala | 35 | 2.5 | 2.4 | 2.7 |
Andre Ingram | 33 | 2.1 | 1.9 | 2.3 |
Andrew Bogut | 34 | 2.6 | 2.4 | 2.8 |
Anfernee Simons | 19 | 3.2 | 3.0 | 3.4 |
n: 10 | ||||
Observations:
The model was applied to the test dataset, generating predicted values and corresponding confidence intervals for the dependent variable (PTS). The results for the first 10 observations include player names (PLAYER), ages (AGE), fitted values (fit), and the lower (lwr) and upper (upr) bounds of the confidence intervals. For instance, Al Horford, aged 32, has a fitted PTS value of 3.1 with a confidence interval ranging from 2.9 to 3.3. Similar information is provided for the other players in the displayed subset.
Plot Actual vs Fitted (test)
plot(test_w_predict$PTS, test_w_predict$fit, pch = 16, cex = 1)Observations:
In the plot of actual versus fitted values for the test set, a positive correlation is observed, indicating that the model tends to predict higher scores for players who actually score more points and vice versa.
Performance Metrics
library(Metrics)
metric_label <- c("MAE", "RMSE", "MAPE")
metrics <- c(
round(mae(test_w_predict$PTS, test_w_predict$fit), 4),
round(rmse(test_w_predict$PTS, test_w_predict$fit), 4),
round(mape(test_w_predict$PTS, test_w_predict$fit), 4)
)
pmtable <- data.frame(Metric = metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 17.1601 |
RMSE | 19.1095 |
MAPE | Inf |
Observations:
These metrics provide insights into the accuracy and precision of the model’s predictions. The MAE represents the average absolute difference between the predicted and actual values, with a lower value indicating better performance. The RMSE gives a measure of the model’s prediction error, considering both the bias and variance. The MAPE, in this case, is reported as Inf, which might suggest potential issues in the predictions, such as zero values in the actual PTS.
Model Fit by Team
unique(NBA_viz$TEAM) [1] ORL IND OKC BOS POR BKN SAC LAL ATL GSW NYK PHI DET NOP MIN LAC CLE CHI HOU
[20] MEM MIA CHA WAS MIL DEN SAS TOR DAL UTA PHX
30 Levels: ATL BKN BOS CHA CHI CLE DAL DEN DET GSW HOU IND LAC LAL MEM ... WAS
# more in depth plots of performance
subset_data <- subset(test_w_predict, TEAM %in% c("LAL", "BOS", "NOP"))
# Create the plot
p <- ggplot(data = subset_data,
aes(x = PTS,
y = fit, ymin = lwr, ymax = upr,
color = TEAM, fill = TEAM, group = TEAM))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title = "Actual vs Fitted with Upper and Lower CI",
subtitle = "Teams: LAL and BOS",
caption = "NBA_viz") +
xlab("Actual PTS") + ylab("Fitted PTS") +
theme(legend.position = "bottom")Observations:
The plot displays the actual versus fitted points, including upper and lower confidence intervals. This visualization helps assess potential team effects, identifying any systematic differences in the model’s prediction accuracy for players from different teams.
ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(log(PTS) ~ AGE, data = train)
summary(pip)
Call:
lm(formula = log(PTS) ~ AGE, data = train)
Residuals:
Min 1Q Median 3Q Max
-1.46794 -0.19955 0.01565 0.22861 0.82338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.878515 0.114986 25.034 <0.0000000000000002 ***
AGE 0.002732 0.004303 0.635 0.526
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3332 on 311 degrees of freedom
Multiple R-squared: 0.001295, Adjusted R-squared: -0.001917
F-statistic: 0.4031 on 1 and 311 DF, p-value: 0.526
library(ggplot2)
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))Observations:
The visualizations explore modeling the relationship between player age (AGE) and log-transformed points scored (log(PTS)) in basketball. The initial scatterplot displays the linear association while robust, spline and quantile regression models account for non-linear patterns and outliers. Together, these graphics reveal subtleties and intricacies within the data that enable selecting the most appropriate model specification based on visible data patterns rather than assumptions alone. They highlight the need for both linear and non-linear considerations when relating age to scoring
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
The linear regression model was constructed to predict points (PTS) in the NBA dataset using several independent variables, including age (AGE), games played (GP), wins (W), minutes played (MIN), rebounds (REB), assists (AST), and turnovers (TOV). The model assumptions and performance were assessed through exploratory data analysis (EDA), visualization, and statistical metrics.
The model’s R-squared value is 0.9248, indicating that approximately 92.48% of the variability in the log-transformed points (log(PTS)) can be explained by the chosen independent variables. This high R-squared value suggests a strong relationship between the predictors and the log-transformed points, supporting the model’s overall explanatory power.
Model Performance
The model’s performance was evaluated using various metrics on the test dataset. Important key performance metrics include:
- Mean Absolute Error (MAE): 17.1601
- Root Mean Squared Error (RMSE): 19.1095
- Mean Absolute Percentage Error (MAPE): inf
The model’s performance metrics reveal valuable insights into its predictive accuracy. The Mean Absolute Error (MAE) of 17.1601 implies an average deviation of approximately 17.16 points between the predicted and actual NBA player points (PTS). The Root Mean Squared Error (RMSE) at 19.1095 indicates the average magnitude of prediction errors, considering both the squared differences. However, the Mean Absolute Percentage Error (MAPE) being infinite suggests potential challenges, possibly arising from instances with zero or near-zero actual PTS values.
Visualization Impact
The visualizations, including boxplots, histograms, scatterplots, and performance plots, played an important role in model development and evaluation. They helped in:
Identifying Relationships: Visualizations allowed for the exploration of relationships between variables, helping in the selection of meaningful predictors.
Assessing Assumptions: Residual plots and diagnostic plots were used to check the model assumptions, ensuring the validity of linear regression.
Performance Evaluation: Plots depicting actual vs. fitted values provided a clear view of the model’s predictive accuracy.
Team Analysis: Visualizing model performance by team helped in understanding variations and potential biases in the dataset.