Objective

We load in a dataset about basketball players with their heights, weights, the field goal percentage, free throws percentage, and the average points scored per game.

X1 = height in feet

X2 = weight in pounds

X3 = percent of successful field goals (out of 100 attempted)

X4 = percent of successful free throws (out of 100 attempted)

X5 = average points scored per game

#load our dataset
basketball = read.csv('basketball.csv')

#Rename the columns
names(basketball)[names(basketball) == "X1"] <- "Height"
names(basketball)[names(basketball) == "X2"] <- "Pounds"
names(basketball)[names(basketball) == "X3"] <- "Percent_of_Field_Goals"
names(basketball)[names(basketball) == "X4"] <- "Percent_of_Free_Thows"
names(basketball)[names(basketball) == "X5"] <- "Avg_Pts_per_Game"

head(basketball)
##   Height Pounds Percent_of_Field_Goals Percent_of_Free_Thows Avg_Pts_per_Game
## 1    6.8    225                  0.442                 0.672              9.2
## 2    6.3    180                  0.435                 0.797             11.7
## 3    6.4    190                  0.456                 0.761             15.8
## 4    6.2    180                  0.416                 0.651              8.6
## 5    6.9    205                  0.449                 0.900             23.2
## 6    6.4    225                  0.431                 0.780             27.4

Data Visualization

pairs(basketball, gap=0.5)

Backward elimination process

basketball_lm = lm(Avg_Pts_per_Game ~ Height + Pounds + Percent_of_Field_Goals + Percent_of_Free_Thows, data = basketball)

summary(basketball_lm)
## 
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Pounds + Percent_of_Field_Goals + 
##     Percent_of_Free_Thows, data = basketball)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.966 -3.545 -1.187  2.613 15.211 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             4.148707  14.855006   0.279  0.78121   
## Height                 -3.690499   2.970780  -1.242  0.22005   
## Pounds                  0.009458   0.046297   0.204  0.83897   
## Percent_of_Field_Goals 47.940199  15.709131   3.052  0.00367 **
## Percent_of_Free_Thows  11.371019   7.868536   1.445  0.15479   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.411 on 49 degrees of freedom
## Multiple R-squared:  0.2223, Adjusted R-squared:  0.1588 
## F-statistic: 3.501 on 4 and 49 DF,  p-value: 0.01364

Remove pounds because it has the highest p-value.

basketball_lm = update(basketball_lm, .~. - Pounds, data=basketball)
summary(basketball_lm)
## 
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Percent_of_Field_Goals + 
##     Percent_of_Free_Thows, data = basketball)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.889 -3.596 -1.077  2.561 15.463 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)               2.979     13.575   0.219  0.82721   
## Height                   -3.232      1.928  -1.676  0.09996 . 
## Percent_of_Field_Goals   48.700     15.116   3.222  0.00224 **
## Percent_of_Free_Thows    11.094      7.676   1.445  0.15463   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.359 on 50 degrees of freedom
## Multiple R-squared:  0.2216, Adjusted R-squared:  0.1749 
## F-statistic: 4.744 on 3 and 50 DF,  p-value: 0.005467

Remove next largest p-value which is Percent_of_Free_Thows.

basketball_lm = update(basketball_lm, .~. -Percent_of_Free_Thows, data=basketball)
summary(basketball_lm)
## 
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Percent_of_Field_Goals, 
##     data = basketball)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.527 -3.621 -1.002  2.222 15.789 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)              15.210     10.727   1.418   0.1623   
## Height                   -4.035      1.866  -2.162   0.0353 * 
## Percent_of_Field_Goals   51.562     15.144   3.405   0.0013 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.416 on 51 degrees of freedom
## Multiple R-squared:  0.1891, Adjusted R-squared:  0.1573 
## F-statistic: 5.945 on 2 and 51 DF,  p-value: 0.004776

At this point, the p-values for each variables are less than 0.05, therefore we stop the backward elimination process.

plot(fitted(basketball_lm), resid(basketball_lm))

Most of the points are distributed uniformly around zero.

qqnorm(resid(basketball_lm))
qqline(resid(basketball_lm))

Conclusion

Most of the points follow the line, even though we have some outliers. We do see it is more right-skewed distribution.

Height and Percent_of_Field_Goals are important factors in predicting the average points scored per game. The Adjusted\(\ R^2\) is not as high might explain the model is not good at modeling the noising in the measurements.