HomeWork #7 Alex Matteson Stats 239
Part 1: A. The responce variable we want to look at is player salery. Player position is catagorical variable that could be a predictor, this has all the basketball positins such as pointguard, Center, Power Forward etc. A nermeric predictor is player efficiency rateing. I’m not really sure what the units are on this it is some sort of advanced metric for how efficient the players are. B.
m1 <- lm(Salary ~ Player_Efficiency_Rating, data = basketball)
summary(m1)
Call:
lm(formula = Salary ~ Player_Efficiency_Rating, data = basketball)
Residuals:
Min 1Q Median 3Q Max
-35164618 -4652842 -2787110 3662355 24382250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3654665 555894 6.574 1.13e-10 ***
Player_Efficiency_Rating 235661 35376 6.662 6.52e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6962000 on 557 degrees of freedom
Multiple R-squared: 0.07379, Adjusted R-squared: 0.07213
F-statistic: 44.38 on 1 and 557 DF, p-value: 6.516e-11
plot(basketball$Player_Efficiency_Rating, basketball$Salary)
abline(m1)
I think it appears to be significant. The p-value for the coeficient for Player_Efficiency_Rating is very low so it is significant. But looking at the data the points don’t look like there is a linear relationship there. maybe the line makes sense for between 0-50 on the x axis. C. Point Guard Power Forward Shooting Guard Small Forward Center 0 0 0 0 Point Guard 1 0 0 0 Power Forward 0 1 0 0 Shooting Guard 0 0 1 0 Small Forward 0 0 0 1 D.
m2 <- lm(Salary ~ Position, data = basketball)
summary(m2)
Call:
lm(formula = Salary ~ Position, data = basketball)
Residuals:
Min 1Q Median 3Q Max
-7594883 -5037956 -2892926 2945702 28577538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7694883 667010 11.536 <2e-16 ***
PositionPoint Guard -1589871 947387 -1.678 0.0939 .
PositionPower Forward -51021 965082 -0.053 0.9579
PositionShooting Guard -1152367 933579 -1.234 0.2176
PositionSmall Forward -1773547 987958 -1.795 0.0732 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7215000 on 554 degrees of freedom
Multiple R-squared: 0.01061, Adjusted R-squared: 0.003464
F-statistic: 1.485 on 4 and 554 DF, p-value: 0.2053
anova(m2)
Analysis of Variance Table
Response: Salary
Df Sum Sq Mean Sq F value Pr(>F)
Position 4 3.0918e+14 7.7294e+13 1.4849 0.2053
Residuals 554 2.8838e+16 5.2054e+13
ggplot(basketball, aes(y=Salary, x=Position, fill=Position))+
geom_boxplot()
the anova F-vaue is 1.489 and the p-value on that is .2 so I don’t think that we can say that there is a significant difference in the means of the levels of the catagorical variable. Also we can see this in the boc plot; all five have very simmilar means. E.
ggplot(basketball, aes(x=Player_Efficiency_Rating, y=Salary, color=Position))+
geom_point()+
geom_abline(intercept = m3$coefficients[1], slope=m3$coefficients[2],
color="red", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[3], slope=m3$coefficients[2],
color="forestgreen", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[4], slope=m3$coefficients[2],
color="blue", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[5], slope=m3$coefficients[2],
color="yellow", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[6], slope=m3$coefficients[2],
color="green", lwd=1)
You can see the lines for each of he models in the graphic. PointGuard; y= (3802416-684598)+231659xi Power Forward: y=(3802416+549289)+231659xi Shooting Guard: y=(3802416-144056)+231659xi Small Forward: y=(3802416-155594)+231659xi F.
ggplot(basketball, aes(x=Player_Efficiency_Rating, y=Salary, color=Position))+
geom_point()+
geom_abline(intercept = m3$coefficients[1], slope=m3$coefficients[6],
color="red", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[2], slope=798511,
color="forestgreen", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[3], slope=556111,
color="blue", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[4], slope=86728,
color="yellow", lwd=1)+
geom_abline(intercept = m3$coefficients[1]+m3$coefficients[5], slope=232079,
color="green", lwd=1)
*For the slopes I had to manually add them together and enter the numbers because it got messed up for some reason when I did it referencing the coeficients, probably just a typo on my part but I got frustrated and just did it by hand. PointGuard; y= (3802416-684598)+798511xi Power Forward: y=(3802416+549289)+556111xi Shooting Guard: y=(3802416-144056)+86728xi Small Forward: y=(3802416-155594)+232079xi G. The relationship of player efficiency rating and salery seems to be significant and positivly related. The relationship between salary and position does not appear to be significant. Nor does adding in an interaction between player efficiency and position. I just realized I may have messed something up with the center position? It’s the only one downward sloping when I do an interaction between position and player efficiency. Not sure what went wrong?
Part 2: A. Sales is numeric, Price is numeric, Urban is catagorical and has leveld “yes” and “no”, USA is catagorical and has levels “yes” "no
install.packages("ISLR")
Error in install.packages : Updating loaded packages
library(ISLR)
data("Carseats")
head(Carseats)
names(Carseats)
[1] "Sales" "CompPrice" "Income" "Advertising" "Population" "Price"
[7] "ShelveLoc" "Age" "Education" "Urban" "US"
summary(Carseats)
Sales CompPrice Income Advertising Population
Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0
1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 1st Qu.:139.0
Median : 7.490 Median :125 Median : 69.00 Median : 5.000 Median :272.0
Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635 Mean :264.8
3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:398.5
Max. :16.270 Max. :175 Max. :120.00 Max. :29.000 Max. :509.0
Price ShelveLoc Age Education Urban US
Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0 No :118 No :142
1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0 Yes:282 Yes:258
Median :117.0 Medium:219 Median :54.50 Median :14.0
Mean :115.8 Mean :53.32 Mean :13.9
3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
Max. :191.0 Max. :80.00 Max. :18.0
levels(Carseats$Urban)
[1] "No" "Yes"
levels(Carseats$US)
[1] "No" "Yes"
m1 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(m1)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
m2<- lm(Sales ~ Price +US, data = Carseats)
summary(m2)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
confint(m2)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
The true alue for these coefficients is 95% likely to be in the confidence intervales. As we can see non of the confidence intervals include 0 so we can reject the null that the are equal to 0 with at least 95% certainty.