Discussion Board Week 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Introduction: I will be examining an NBA dataset that contains per game averages leaguewide from 1979 to 2022. I am interested in looking at the relationship between the variable Pace (How fast a team plays) and Field Goal Attempted (Possessions per Game) and see if a linear model is appropriate.


Load data

link <- 'https://raw.githubusercontent.com/curiostegui/CUNY-SPS/main/Data%20605/sportsref_download.csv'

data <- read_csv(link, skip = 1, col_names = TRUE)
## Rows: 77 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Season, Lg, Ht
## dbl (29): Rk, Age, Wt, G, MP, FG, FGA, 3P, 3PA, FT, FTA, ORB, DRB, TRB, AST,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning

# remove nulls
data <- na.omit(data)
# see if we have removed nulls
which(is.na(data))
## integer(0)

Creation of Linear Model

# scatterplot of variables
plot(data$FGA,data$Pace)

cor(data$FGA,data$Pace)
## [1] 0.9789283
model <- lm(Pace ~ FGA, data = data)
summary(model)
## 
## Call:
## lm(formula = Pace ~ FGA, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26336 -0.53816 -0.00309  0.36571  1.76784 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22780    3.07548   0.074    0.941    
## FGA          1.12985    0.03637  31.068   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9065 on 42 degrees of freedom
## Multiple R-squared:  0.9583, Adjusted R-squared:  0.9573 
## F-statistic: 965.2 on 1 and 42 DF,  p-value: < 2.2e-16

Looking at residuals

residuals <- resid(model)
hist(residuals)

qqnorm(residuals)
qqline(residuals)

Conclusion

A linear model is appropriate in examining the variables in the dataset. We know this because when looking at the scatterplot, we can observe that the variables form a straight line. In addition, we can see the residuals are normally distributed in the normal probability plot.