Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
Introduction: I will be examining an NBA dataset that contains per game averages leaguewide from 1979 to 2022. I am interested in looking at the relationship between the variable Pace (How fast a team plays) and Field Goal Attempted (Possessions per Game) and see if a linear model is appropriate.
link <- 'https://raw.githubusercontent.com/curiostegui/CUNY-SPS/main/Data%20605/sportsref_download.csv'
data <- read_csv(link, skip = 1, col_names = TRUE)
## Rows: 77 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Season, Lg, Ht
## dbl (29): Rk, Age, Wt, G, MP, FG, FGA, 3P, 3PA, FT, FTA, ORB, DRB, TRB, AST,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# remove nulls
data <- na.omit(data)
# see if we have removed nulls
which(is.na(data))
## integer(0)
# scatterplot of variables
plot(data$FGA,data$Pace)
cor(data$FGA,data$Pace)
## [1] 0.9789283
model <- lm(Pace ~ FGA, data = data)
summary(model)
##
## Call:
## lm(formula = Pace ~ FGA, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.26336 -0.53816 -0.00309 0.36571 1.76784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22780 3.07548 0.074 0.941
## FGA 1.12985 0.03637 31.068 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9065 on 42 degrees of freedom
## Multiple R-squared: 0.9583, Adjusted R-squared: 0.9573
## F-statistic: 965.2 on 1 and 42 DF, p-value: < 2.2e-16
residuals <- resid(model)
hist(residuals)
qqnorm(residuals)
qqline(residuals)
A linear model is appropriate in examining the variables in the dataset. We know this because when looking at the scatterplot, we can observe that the variables form a straight line. In addition, we can see the residuals are normally distributed in the normal probability plot.