We load in a dataset about basketball players with their heights, weights, the field goal percentage, free throws percentage, and the average points scored per game.
X1 = height in feet
X2 = weight in pounds
X3 = percent of successful field goals (out of 100 attempted)
X4 = percent of successful free throws (out of 100 attempted)
X5 = average points scored per game
#load our dataset
basketball = read.csv('basketball.csv')
#Rename the columns
names(basketball)[names(basketball) == "X1"] <- "Height"
names(basketball)[names(basketball) == "X2"] <- "Pounds"
names(basketball)[names(basketball) == "X3"] <- "Percent_of_Field_Goals"
names(basketball)[names(basketball) == "X4"] <- "Percent_of_Free_Thows"
names(basketball)[names(basketball) == "X5"] <- "Avg_Pts_per_Game"
head(basketball)
## Height Pounds Percent_of_Field_Goals Percent_of_Free_Thows Avg_Pts_per_Game
## 1 6.8 225 0.442 0.672 9.2
## 2 6.3 180 0.435 0.797 11.7
## 3 6.4 190 0.456 0.761 15.8
## 4 6.2 180 0.416 0.651 8.6
## 5 6.9 205 0.449 0.900 23.2
## 6 6.4 225 0.431 0.780 27.4
pairs(basketball, gap=0.5)
basketball_lm = lm(Avg_Pts_per_Game ~ Height + Pounds + Percent_of_Field_Goals + Percent_of_Free_Thows, data = basketball)
summary(basketball_lm)
##
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Pounds + Percent_of_Field_Goals +
## Percent_of_Free_Thows, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.966 -3.545 -1.187 2.613 15.211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.148707 14.855006 0.279 0.78121
## Height -3.690499 2.970780 -1.242 0.22005
## Pounds 0.009458 0.046297 0.204 0.83897
## Percent_of_Field_Goals 47.940199 15.709131 3.052 0.00367 **
## Percent_of_Free_Thows 11.371019 7.868536 1.445 0.15479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.411 on 49 degrees of freedom
## Multiple R-squared: 0.2223, Adjusted R-squared: 0.1588
## F-statistic: 3.501 on 4 and 49 DF, p-value: 0.01364
Remove pounds because it has the highest p-value.
basketball_lm = update(basketball_lm, .~. - Pounds, data=basketball)
summary(basketball_lm)
##
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Percent_of_Field_Goals +
## Percent_of_Free_Thows, data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.889 -3.596 -1.077 2.561 15.463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.979 13.575 0.219 0.82721
## Height -3.232 1.928 -1.676 0.09996 .
## Percent_of_Field_Goals 48.700 15.116 3.222 0.00224 **
## Percent_of_Free_Thows 11.094 7.676 1.445 0.15463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.359 on 50 degrees of freedom
## Multiple R-squared: 0.2216, Adjusted R-squared: 0.1749
## F-statistic: 4.744 on 3 and 50 DF, p-value: 0.005467
Remove next largest p-value which is Percent_of_Free_Thows.
basketball_lm = update(basketball_lm, .~. -Percent_of_Free_Thows, data=basketball)
summary(basketball_lm)
##
## Call:
## lm(formula = Avg_Pts_per_Game ~ Height + Percent_of_Field_Goals,
## data = basketball)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.527 -3.621 -1.002 2.222 15.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.210 10.727 1.418 0.1623
## Height -4.035 1.866 -2.162 0.0353 *
## Percent_of_Field_Goals 51.562 15.144 3.405 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.416 on 51 degrees of freedom
## Multiple R-squared: 0.1891, Adjusted R-squared: 0.1573
## F-statistic: 5.945 on 2 and 51 DF, p-value: 0.004776
At this point, the p-values for each variables are less than 0.05, therefore we stop the backward elimination process.
plot(fitted(basketball_lm), resid(basketball_lm))
Most of the points are distributed uniformly around zero.
qqnorm(resid(basketball_lm))
qqline(resid(basketball_lm))
Most of the points follow the line, even though we have some outliers. We do see it is more right-skewed distribution.
Height and Percent_of_Field_Goals are important factors in predicting the average points scored per game. The Adjusted\(\ R^2\) is not as high might explain the model is not good at modeling the noising in the measurements.