1a.
Players <- read.csv('basketball.csv')
str(Players)
## 'data.frame': 559 obs. of 28 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Player : Factor w/ 455 levels "Aaron Brooks",..: 244 438 231 196 154 32 208 169 448 393 ...
## $ Country : Factor w/ 43 levels "Argentina","Australia",..: 43 43 43 43 43 33 41 43 36 28 ...
## $ Salary : num 1312611 50000 5400000 104059 950000 ...
## $ Guaranteed : num 456529 50000 16200000 50000 1640000 ...
## $ Position : Factor w/ 5 levels "Center","Point Guard",..: 2 3 3 5 1 1 3 1 1 1 ...
## $ Age : int 22 23 32 30 19 29 24 22 23 24 ...
## $ Team : Factor w/ 31 levels "Atlanta Hawks",..: 9 15 19 1 12 13 17 23 4 21 ...
## $ Player_Efficiency_Rating : num -31.6 35.9 3.5 10.4 23.2 30.3 11.7 8.8 25 20.8 ...
## $ True_Shooting_Percentage : num 0 0.6 0.25 1 0.558 0.635 0.521 0.465 0.589 0.628 ...
## $ Three_Point_Field_Goal_Percentage: num 50 0 0 0 0 0 16.7 0 9.2 0.3 ...
## $ Free_Throw_Percentage : num 0 0 0 0 66.7 71.1 100 44.4 81.5 39.7 ...
## $ Offensive_Rebound_Percentage : num 35.9 32.3 28.1 22.4 20.4 20.4 18.7 17.9 17.2 16.9 ...
## $ Defensive_Rebound_Percentage : num 0 17 8.8 0 16.1 32.9 12.4 20.8 29.8 13.9 ...
## $ Total_Rebound_Percentage : num 18.4 24.8 18.1 11.2 18.2 26.7 15.5 19.4 23.4 15.5 ...
## $ Assist_Percentage : num 0 0 0 0 0 9.7 0 5.9 7.1 5.5 ...
## $ Steal_Percentage : num 0 0 0 0 1.8 2.1 2.6 0 2 1.9 ...
## $ Block_Percentage : num 0 13.4 0 0 9.5 3.6 0 6.7 3 2.9 ...
## $ Turnover_Percentage : num 33.3 16.7 0 50 14.7 11.9 18.8 21.8 7.3 13.3 ...
## $ Usage_Percentage : num 44.2 38.5 14.4 17.5 21.8 25.2 12.7 23.5 21 16.7 ...
## $ Offensive_Win_Shares : num -0.1 0 0 0 0.1 0.5 0 0 0.7 6.4 ...
## $ Defensive_Win_Shares : num 0 0 0 0 0 0.2 0 0 0.3 2.9 ...
## $ Win_Shares : num -0.1 0 0 0 0.1 0.7 0.1 0 1.1 9.3 ...
## $ Win_Shares_Per_48_Minutes : num -1.005 0.099 -0.015 -0.03 0.177 ...
## $ Offense_Box_Plus_Minus : num -29.5 -3.1 -6 -2.3 -4.1 2.5 -3.1 -7.5 1.4 2.2 ...
## $ Defense_Box_Plus_Minus : num -11.9 -5.8 -7.4 -6.1 -1.4 0.8 -4.6 0.6 0.4 1.2 ...
## $ Box_Plus_Minus : num -41.4 -9 -13.4 -8.3 -5.4 3.2 -7.7 -6.9 1.8 3.4 ...
## $ Value_Over_Replacement_Player : num 0 0 0 0 0 0.2 -0.1 0 0.2 3.3 ...
I am choosing the Salary variable as respose, Position as categorical predictor, and Player Efficiency Rating (PER) as numeric predictor. The units are USD for salary and PER does not have units. The levels for position are (‘Point Guard’, ‘Shooting Guard’, ‘Small Forward’, ‘Power Forward’, and ‘Center’).
1b.
nba_model <- lm(Salary~Player_Efficiency_Rating, data = Players)
plot(Players$Player_Efficiency_Rating,Players$Salary)
abline(nba_model)
The relationship appears to not be significant; this may just be because of the very obvious outliers on either side.
1c.
contrasts(Players$Position) <- contr.treatment(5)
1d.
pos_model <- lm(Salary~Position, data = Players)
summary(pos_model)
##
## Call:
## lm(formula = Salary ~ Position, data = Players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7594883 -5037956 -2892926 2945702 28577538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7694883 667010 11.536 <2e-16 ***
## Position2 -1589871 947387 -1.678 0.0939 .
## Position3 -51021 965082 -0.053 0.9579
## Position4 -1152367 933579 -1.234 0.2176
## Position5 -1773547 987958 -1.795 0.0732 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7215000 on 554 degrees of freedom
## Multiple R-squared: 0.01061, Adjusted R-squared: 0.003464
## F-statistic: 1.485 on 4 and 554 DF, p-value: 0.2053
#anova(pos_model)
boxplot(Salary~Position, data=Players)
We can not conclude that there are differences among the means of the levels of the categorical variable, since the F statistic is .2.
1e.
reg <- lm(Salary~Position+Player_Efficiency_Rating, data=Players)
summary(reg)
##
## Call:
## lm(formula = Salary ~ Position + Player_Efficiency_Rating, data = Players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34631684 -4587239 -2664996 3666922 25031946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3802416 893601 4.255 2.45e-05 ***
## Position2 -684598 927239 -0.738 0.461
## Position3 549289 937985 0.586 0.558
## Position4 -144056 916772 -0.157 0.875
## Position5 -155595 989235 -0.157 0.875
## Player_Efficiency_Rating 231659 36813 6.293 6.34e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6976000 on 553 degrees of freedom
## Multiple R-squared: 0.07672, Adjusted R-squared: 0.06838
## F-statistic: 9.191 on 5 and 553 DF, p-value: 2.048e-08
plot( Players$Player_Efficiency_Rating,Players$Salary)
abline(reg$coefficients[1], reg$coefficients[6], col = "red")
abline(reg$coefficients[1]+reg$coefficients[2], reg$coefficients[6], col = "blue")
abline(reg$coefficients[1]+reg$coefficients[3], reg$coefficients[6], col = "green")
abline(reg$coefficients[1]+reg$coefficients[4], reg$coefficients[6], col = "purple")
abline(reg$coefficients[1]+reg$coefficients[5], reg$coefficients[6], col = "black")
1f.
reg2 <- lm(Salary~Position*Player_Efficiency_Rating, data=Players)
summary(reg2)
##
## Call:
## lm(formula = Salary ~ Position * Player_Efficiency_Rating, data = Players)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19655241 -4676775 -2568509 3567951 24367818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2812217 2190752 1.284 0.1998
## Position2 -3256822 2589484 -1.258 0.2090
## Position3 -3071372 2818569 -1.090 0.2763
## Position4 2650532 2359318 1.123 0.2617
## Position5 830478 2478741 0.335 0.7377
## Player_Efficiency_Rating 290591 124821 2.328 0.0203 *
## Position2:Player_Efficiency_Rating 217337 156818 1.386 0.1663
## Position3:Player_Efficiency_Rating 265521 170244 1.560 0.1194
## Position4:Player_Efficiency_Rating -203862 134349 -1.517 0.1297
## Position5:Player_Efficiency_Rating -58511 156742 -0.373 0.7091
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6847000 on 549 degrees of freedom
## Multiple R-squared: 0.117, Adjusted R-squared: 0.1025
## F-statistic: 8.084 on 9 and 549 DF, p-value: 2.671e-11
plot( Players$Player_Efficiency_Rating,Players$Salary)
abline(reg2$coefficients[1], reg2$coefficients[6], col = "red")
abline(reg2$coefficients[1]+reg2$coefficients[2], reg2$coefficients[6] + reg2$coefficients[7], col = "blue")
abline(reg2$coefficients[1]+reg2$coefficients[3], reg2$coefficients[6] + reg2$coefficients[8], col = "green")
abline(reg2$coefficients[1]+reg2$coefficients[4], reg2$coefficients[6] + reg2$coefficients[9], col = "purple")
abline(reg2$coefficients[1]+reg2$coefficients[5], reg2$coefficients[6] + reg2$coefficients[10], col = "black")
1g. The only relationship that was statistically significant was PER vs Salary for Centers. I learned just how important outliers are, since having just a single outlier threw off all of my original models a great amount. For example, in the above visual, it is easy to see how greatly the purple line is affected by the single point in the bottom right. This also shows how important it is to have a minimum sample size in statistics, since the PER being so high is most likely caused by a very small sample size of the player.
2a.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
data(Carseats)
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Sales is numeric. Price is numeric. Urban and US are both categorical, with 2 levels, ‘yes’ or ‘no’.
2b.
car_model <- lm(Sales~Price+Urban+US, data = Carseats)
summary(car_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
2c. The y intercept of the model is the Estimate column in the (Intercept) row. The Price Estimate means that for every time price increases by 1, the Sales value decreases by 0.05. The UrbanYes Estimate means that if Urban is ‘yes’ instead of ‘no’, the sales decreases by 0.02. The USYes estimate means that is US is ‘yes’ instead of ‘no’, sales value increases by 1.2 in our model. The Std. Error column is the standard error for each of these rows. The ‘t value’ column is the t statistic for each row. The ‘Pr(>|t|)’ column is the probability of observing a value as extreme or more extreme than our t statistic.
2d.\(y = -0.054x_1 - 0.021x_2 + 1.2x_3 + 13.04\) where \(y = \text{ sales }\),$x_1 = $ price, \(x_2 = \begin{cases} 0 \text{ when not urban} \\ 1\text{ when urban} \end{cases}\) \(x_3 = \begin{cases} 0 \text{ when not US} \\ 1\text{ when US} \end{cases}\).
2e. Looking at the summary of our model, we can reject the null hypthesis for Price and US as evident by the ’***’, but not Urban.
2f.
car_model_2 <- lm(Sales~Price+US, data = Carseats)
summary(car_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
2g.
anova(car_model)['Residuals','Mean Sq']
## [1] 6.113219
anova(car_model_2)['Residuals','Mean Sq']
## [1] 6.097921
So therefore while the two models are very similar in fit, the second model fits the data slightly better, as evident by the lower value for MSE. 2h.
confint(car_model)
## 2.5 % 97.5 %
## (Intercept) 11.76359670 14.32334118
## Price -0.06476419 -0.04415351
## UrbanYes -0.55597316 0.51214085
## USYes 0.69130419 1.70984121
These confidence intervals tell us that with 95% certainty, we can say that the true value of the coefficient (as opposed to the estimated value in our model) lies between the two listed values.