Assignment 9

library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data <- read.csv('Thesis/MASTER_UPDATED.csv')
totalPoints <- data$PTS_home + data$PTS_away
data = subset(data, select = -c(PTS_home,PTS_away))
data$total_points = totalPoints

A)

My categorical variable is the season the data point came from. My response variable will be total points. My numeric predictor will be home team field goal percentage.

B)

slm <- lm(total_points ~ FG_PCT_home, data = data)
summary(slm)
## 
## Call:
## lm(formula = total_points ~ FG_PCT_home, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.487 -12.393  -0.758  11.529 107.174 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  135.094      2.216   60.98   <2e-16 ***
## FG_PCT_home  174.863      4.739   36.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.49 on 5128 degrees of freedom
## Multiple R-squared:  0.2098, Adjusted R-squared:  0.2097 
## F-statistic:  1362 on 1 and 5128 DF,  p-value: < 2.2e-16

As can be seen in the model summary graphic above, home field goal percentage seems to be significant to the response variable.

C)

season <- cut(data$SEASON, br = c(2015,2016,2017,2018,2019))
data$SEASON <- season

D)

model <- lm(total_points~SEASON, data = data)
summary(model)
## 
## Call:
## lm(formula = total_points ~ SEASON, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.145 -13.867  -0.867  13.133 107.133 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       210.7217     0.5366 392.718   <2e-16 ***
## SEASON(2016,2017]   1.4237     0.7620   1.868   0.0618 .  
## SEASON(2017,2018]  11.1455     0.7625  14.616   <2e-16 ***
## SEASON(2018,2019]  11.6399     0.8409  13.842   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.11 on 5126 degrees of freedom
## Multiple R-squared:  0.06567,    Adjusted R-squared:  0.06512 
## F-statistic: 120.1 on 3 and 5126 DF,  p-value: < 2.2e-16
anova(model)
## Analysis of Variance Table
## 
## Response: total_points
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## SEASON       3  145740   48580  120.09 < 2.2e-16 ***
## Residuals 5126 2073539     405                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(data, aes(y=total_points, x=SEASON, fill=SEASON))+
  geom_boxplot()

According to the ANOVA table F-test, season seems to be a significant variable. However, this is partially visible in the box-plot. It can be seen that the scores seem to increase very slightly each successive season and the low end and high end scores box increase as well.

E)

mlr <- lm(total_points ~ FG_PCT_home + SEASON, data = data)
summary(mlr)
## 
## Call:
## lm(formula = total_points ~ FG_PCT_home + SEASON, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.193 -11.842  -0.806  11.223 101.661 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       129.7739     2.1564  60.180   <2e-16 ***
## FG_PCT_home       174.7002     4.5408  38.473   <2e-16 ***
## SEASON(2016,2017]   1.0934     0.6713   1.629    0.103    
## SEASON(2017,2018]  10.9137     0.6718  16.246   <2e-16 ***
## SEASON(2018,2019]  11.5327     0.7408  15.568   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 5125 degrees of freedom
## Multiple R-squared:  0.2751, Adjusted R-squared:  0.2745 
## F-statistic: 486.1 on 4 and 5125 DF,  p-value: < 2.2e-16
ggplot(data, aes(x=FG_PCT_home, y=total_points))+geom_point()+
  geom_abline(intercept = mlr$coefficients[1] + mlr$coefficients[3], slope = mlr$coefficients[2], color="red", lwd=1) + geom_abline(intercept = mlr$coefficients[1] + mlr$coefficients[4], slope = mlr$coefficients[2], color="blue", lwd=1) + geom_abline(intercept = mlr$coefficients[1] + mlr$coefficients[5], slope = mlr$coefficients[2], color="green", lwd=1)

2016-17: \(y = 130.86 + 174.70x\)
2017-18: \(y = 140.68 + 174.70x\)
2018-19: \(y = 141.30 + 174.70x\)

F)

model2 <- lm(total_points ~ FG_PCT_home * SEASON, data = data)
summary(model2)
## 
## Call:
## lm(formula = total_points ~ FG_PCT_home * SEASON, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.914 -11.826  -0.793  11.287 101.677 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    133.436      4.103  32.520   <2e-16 ***
## FG_PCT_home                    166.797      8.796  18.962   <2e-16 ***
## SEASON(2016,2017]               -5.099      5.792  -0.880    0.379    
## SEASON(2017,2018]                7.485      5.851   1.279    0.201    
## SEASON(2018,2019]                6.127      6.247   0.981    0.327    
## FG_PCT_home:SEASON(2016,2017]   13.342     12.390   1.077    0.282    
## FG_PCT_home:SEASON(2017,2018]    7.401     12.525   0.591    0.555    
## FG_PCT_home:SEASON(2018,2019]   11.661     13.377   0.872    0.383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.72 on 5122 degrees of freedom
## Multiple R-squared:  0.2752, Adjusted R-squared:  0.2742 
## F-statistic: 277.9 on 7 and 5122 DF,  p-value: < 2.2e-16
ggplot(data, aes(x = FG_PCT_home, y = total_points)) + geom_point() +
  geom_abline(intercept = model2$coefficients[1] + model2$coefficients[3], slope = model2$coefficients[2] +model2$coefficients[6], color="red", lwd=1) + geom_abline(intercept = model2$coefficients[1] + model2$coefficients[4], slope = model2$coefficients[2] +model2$coefficients[7], color="blue", lwd=1) + geom_abline(intercept = model2$coefficients[1] + model2$coefficients[5], slope = model2$coefficients[2] +model2$coefficients[8], color="green", lwd=1)

G)

First model MSE:

mean(slm$residuals^2)
## [1] 341.8425

Second model MSE:

mean(model$residuals^2)
## [1] 404.1987

Third model MSE:

mean(mlr$residuals^2)
## [1] 313.6189

Fourth model MSE:

mean(model2$residuals^2)
## [1] 313.537

H)

This was an interesting exercise. My data set has many continuous variables that could be significant. It was interesting to see that season was significant. I believe this could be due to the increasing popularity of three point shots in NBA games.