Homework 7

1a.

Players <- read.csv('basketball.csv')
str(Players)

## 'data.frame':    559 obs. of  28 variables:
##  $ X                                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Player                           : Factor w/ 455 levels "Aaron Brooks",..: 244 438 231 196 154 32 208 169 448 393 ...
##  $ Country                          : Factor w/ 43 levels "Argentina","Australia",..: 43 43 43 43 43 33 41 43 36 28 ...
##  $ Salary                           : num  1312611 50000 5400000 104059 950000 ...
##  $ Guaranteed                       : num  456529 50000 16200000 50000 1640000 ...
##  $ Position                         : Factor w/ 5 levels "Center","Point Guard",..: 2 3 3 5 1 1 3 1 1 1 ...
##  $ Age                              : int  22 23 32 30 19 29 24 22 23 24 ...
##  $ Team                             : Factor w/ 31 levels "Atlanta Hawks",..: 9 15 19 1 12 13 17 23 4 21 ...
##  $ Player_Efficiency_Rating         : num  -31.6 35.9 3.5 10.4 23.2 30.3 11.7 8.8 25 20.8 ...
##  $ True_Shooting_Percentage         : num  0 0.6 0.25 1 0.558 0.635 0.521 0.465 0.589 0.628 ...
##  $ Three_Point_Field_Goal_Percentage: num  50 0 0 0 0 0 16.7 0 9.2 0.3 ...
##  $ Free_Throw_Percentage            : num  0 0 0 0 66.7 71.1 100 44.4 81.5 39.7 ...
##  $ Offensive_Rebound_Percentage     : num  35.9 32.3 28.1 22.4 20.4 20.4 18.7 17.9 17.2 16.9 ...
##  $ Defensive_Rebound_Percentage     : num  0 17 8.8 0 16.1 32.9 12.4 20.8 29.8 13.9 ...
##  $ Total_Rebound_Percentage         : num  18.4 24.8 18.1 11.2 18.2 26.7 15.5 19.4 23.4 15.5 ...
##  $ Assist_Percentage                : num  0 0 0 0 0 9.7 0 5.9 7.1 5.5 ...
##  $ Steal_Percentage                 : num  0 0 0 0 1.8 2.1 2.6 0 2 1.9 ...
##  $ Block_Percentage                 : num  0 13.4 0 0 9.5 3.6 0 6.7 3 2.9 ...
##  $ Turnover_Percentage              : num  33.3 16.7 0 50 14.7 11.9 18.8 21.8 7.3 13.3 ...
##  $ Usage_Percentage                 : num  44.2 38.5 14.4 17.5 21.8 25.2 12.7 23.5 21 16.7 ...
##  $ Offensive_Win_Shares             : num  -0.1 0 0 0 0.1 0.5 0 0 0.7 6.4 ...
##  $ Defensive_Win_Shares             : num  0 0 0 0 0 0.2 0 0 0.3 2.9 ...
##  $ Win_Shares                       : num  -0.1 0 0 0 0.1 0.7 0.1 0 1.1 9.3 ...
##  $ Win_Shares_Per_48_Minutes        : num  -1.005 0.099 -0.015 -0.03 0.177 ...
##  $ Offense_Box_Plus_Minus           : num  -29.5 -3.1 -6 -2.3 -4.1 2.5 -3.1 -7.5 1.4 2.2 ...
##  $ Defense_Box_Plus_Minus           : num  -11.9 -5.8 -7.4 -6.1 -1.4 0.8 -4.6 0.6 0.4 1.2 ...
##  $ Box_Plus_Minus                   : num  -41.4 -9 -13.4 -8.3 -5.4 3.2 -7.7 -6.9 1.8 3.4 ...
##  $ Value_Over_Replacement_Player    : num  0 0 0 0 0 0.2 -0.1 0 0.2 3.3 ...

I am choosing the Salary variable as respose, Position as categorical predictor, and Player Efficiency Rating (PER) as numeric predictor. The units are USD for salary and PER does not have units. The levels for position are (‘Point Guard’, ‘Shooting Guard’, ‘Small Forward’, ‘Power Forward’, and ‘Center’).

1b.

nba_model <- lm(Salary~Player_Efficiency_Rating, data = Players)
plot(Players$Player_Efficiency_Rating,Players$Salary)
abline(nba_model)

The relationship appears to not be significant; this may just be because of the very obvious outliers on either side.

1c.

contrasts(Players$Position) <- contr.treatment(5)

1d.

pos_model <- lm(Salary~Position, data = Players)
summary(pos_model)

## 
## Call:
## lm(formula = Salary ~ Position, data = Players)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7594883 -5037956 -2892926  2945702 28577538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7694883     667010  11.536   <2e-16 ***
## Position2   -1589871     947387  -1.678   0.0939 .  
## Position3     -51021     965082  -0.053   0.9579    
## Position4   -1152367     933579  -1.234   0.2176    
## Position5   -1773547     987958  -1.795   0.0732 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7215000 on 554 degrees of freedom
## Multiple R-squared:  0.01061,    Adjusted R-squared:  0.003464 
## F-statistic: 1.485 on 4 and 554 DF,  p-value: 0.2053

#anova(pos_model)
boxplot(Salary~Position, data=Players)

We can not conclude that there are differences among the means of the levels of the categorical variable, since the F statistic is .2.

1e.

reg <- lm(Salary~Position+Player_Efficiency_Rating, data=Players)
summary(reg)

## 
## Call:
## lm(formula = Salary ~ Position + Player_Efficiency_Rating, data = Players)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -34631684  -4587239  -2664996   3666922  25031946 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3802416     893601   4.255 2.45e-05 ***
## Position2                 -684598     927239  -0.738    0.461    
## Position3                  549289     937985   0.586    0.558    
## Position4                 -144056     916772  -0.157    0.875    
## Position5                 -155595     989235  -0.157    0.875    
## Player_Efficiency_Rating   231659      36813   6.293 6.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6976000 on 553 degrees of freedom
## Multiple R-squared:  0.07672,    Adjusted R-squared:  0.06838 
## F-statistic: 9.191 on 5 and 553 DF,  p-value: 2.048e-08

plot( Players$Player_Efficiency_Rating,Players$Salary)
abline(reg$coefficients[1], reg$coefficients[6], col = "red")
abline(reg$coefficients[1]+reg$coefficients[2], reg$coefficients[6], col = "blue")
abline(reg$coefficients[1]+reg$coefficients[3], reg$coefficients[6], col = "green")
abline(reg$coefficients[1]+reg$coefficients[4], reg$coefficients[6], col = "purple")
abline(reg$coefficients[1]+reg$coefficients[5], reg$coefficients[6], col = "black")

1f.

reg2 <- lm(Salary~Position*Player_Efficiency_Rating, data=Players)
summary(reg2)

## 
## Call:
## lm(formula = Salary ~ Position * Player_Efficiency_Rating, data = Players)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -19655241  -4676775  -2568509   3567951  24367818 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         2812217    2190752   1.284   0.1998  
## Position2                          -3256822    2589484  -1.258   0.2090  
## Position3                          -3071372    2818569  -1.090   0.2763  
## Position4                           2650532    2359318   1.123   0.2617  
## Position5                            830478    2478741   0.335   0.7377  
## Player_Efficiency_Rating             290591     124821   2.328   0.0203 *
## Position2:Player_Efficiency_Rating   217337     156818   1.386   0.1663  
## Position3:Player_Efficiency_Rating   265521     170244   1.560   0.1194  
## Position4:Player_Efficiency_Rating  -203862     134349  -1.517   0.1297  
## Position5:Player_Efficiency_Rating   -58511     156742  -0.373   0.7091  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6847000 on 549 degrees of freedom
## Multiple R-squared:  0.117,  Adjusted R-squared:  0.1025 
## F-statistic: 8.084 on 9 and 549 DF,  p-value: 2.671e-11

plot( Players$Player_Efficiency_Rating,Players$Salary)
abline(reg2$coefficients[1], reg2$coefficients[6], col = "red")
abline(reg2$coefficients[1]+reg2$coefficients[2], reg2$coefficients[6] + reg2$coefficients[7], col = "blue")
abline(reg2$coefficients[1]+reg2$coefficients[3], reg2$coefficients[6] + reg2$coefficients[8], col = "green")
abline(reg2$coefficients[1]+reg2$coefficients[4], reg2$coefficients[6] + reg2$coefficients[9], col = "purple")
abline(reg2$coefficients[1]+reg2$coefficients[5], reg2$coefficients[6] + reg2$coefficients[10], col = "black")

1g. The only relationship that was statistically significant was PER vs Salary for Centers. I learned just how important outliers are, since having just a single outlier threw off all of my original models a great amount. For example, in the above visual, it is easy to see how greatly the purple line is affected by the single point in the bottom right. This also shows how important it is to have a minimum sample size in statistics, since the PER being so high is most likely caused by a very small sample size of the player.

2a.

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.6.3

data(Carseats)
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Sales is numeric. Price is numeric. Urban and US are both categorical, with 2 levels, ‘yes’ or ‘no’.

2b.

car_model <- lm(Sales~Price+Urban+US, data = Carseats)
summary(car_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

2c. The y intercept of the model is the Estimate column in the (Intercept) row. The Price Estimate means that for every time price increases by 1, the Sales value decreases by 0.05. The UrbanYes Estimate means that if Urban is ‘yes’ instead of ‘no’, the sales decreases by 0.02. The USYes estimate means that is US is ‘yes’ instead of ‘no’, sales value increases by 1.2 in our model. The Std. Error column is the standard error for each of these rows. The ‘t value’ column is the t statistic for each row. The ‘Pr(>|t|)’ column is the probability of observing a value as extreme or more extreme than our t statistic.

2d.

$y = -0.054x_1 - 0.021x_2 + 1.2x_3 + 13.04$ where $y = \text{ sales }$,$x_1 = $ price, $x_2 = \begin{cases} 0 \text{ when not urban} \\ 1\text{ when urban} \end{cases}$ $x_3 = \begin{cases} 0 \text{ when not US} \\ 1\text{ when US} \end{cases}$.

2e. Looking at the summary of our model, we can reject the null hypthesis for Price and US as evident by the ’***’, but not Urban.

2f.

car_model_2 <- lm(Sales~Price+US, data = Carseats)
summary(car_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

2g.

anova(car_model)['Residuals','Mean Sq']

## [1] 6.113219

anova(car_model_2)['Residuals','Mean Sq']

## [1] 6.097921

So therefore while the two models are very similar in fit, the second model fits the data slightly better, as evident by the lower value for MSE. 2h.

confint(car_model)

##                   2.5 %      97.5 %
## (Intercept) 11.76359670 14.32334118
## Price       -0.06476419 -0.04415351
## UrbanYes    -0.55597316  0.51214085
## USYes        0.69130419  1.70984121

These confidence intervals tell us that with 95% certainty, we can say that the true value of the coefficient (as opposed to the estimated value in our model) lies between the two listed values.

Homework 7

Daniel Smith

3/31/2020