Assignment instructions

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?



Build a Regression Model

# Here we select the built-in-to-R data, cars
data(cars)
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
# Here we build the regression model where stopping distance is dependent on speed.
lin_reg <- lm(dist ~ speed, data=cars)
summary(lin_reg)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12


Residual Analysis

# Here we look at a plot of the residuals against the regression line
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_jitter() + 
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE)

# If regression were appropriate than the distribution of residuals would look normal.  This looks skewed with a long tail to the right.
ggplot(lin_reg, aes(x = .resid)) +
  geom_histogram(binwidth = 10) 

# This is another test for evaluating normality using a plot of the residuals.  If our distribution were normal we would expect a band of dots evenly spread out from the line across the range of the line.  In this case we don't see that; There is a lot of white space in the bottom and top left quadrants and some white.  Also the dots aren't roughly symmetric across the axis.
ggplot(lin_reg, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)



Advanced Residual Analysis

I got the following three tests by reviwing the work of classmate, William Jasmine. I wanted to include them here to add them to my repertory.

Here is the link to the beneficial work by William Jasmine: https://rpubs.com/williamzjasmine/1024987

# In the Shapiro-Wilk test, a p-value greater than 0.05 indicates the test is non-significant and that the distribution of the sample is not significantly different from a normal distribution.
# In our case the p-value is less than 0.05 (0.02152) and so our distribution is significantly different from a normal distribution.
shapiro.test(lin_reg$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lin_reg$residuals
## W = 0.94509, p-value = 0.02152
# In the Breusch-Pagan test, a p-value greater than 0.05 indicates the test is non-significant and that the distribution of the sample linear regression does not have heteroskedasticity.
# Homoskedasticity is when all of the random variables have the same finite variance.  If the variance of your dependent variable changes for higher levels of the independent variable than your linear regression is heteroskedastic.
# Since our p-value is 0.07297 we can conclude that while our sample isn't normally distributed it is homoskedastic.

#install.packages("lmtest")
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(lin_reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  lin_reg
## BP = 3.2149, df = 1, p-value = 0.07297
# Residuals are supposed to be the parts of the outcome not explained by our linear regression model.  If in fact there is correlation between the residuals and our independent variable then we could improve our model to take out that correlation.
# In our case there is no correlation between the residuals and our independent variable so we pass this test.
cor(lin_reg$residuals, lin_reg$fitted.values)
## [1] -7.330371e-18


Was the linear model appropriate?

The linear model wasn’t appropriate because the residuals weren’t normal as determined by either of the first two residual analysis tests we did.

When we did the three advanced tests, first learned about from William Jasmine’s work, we saw that our sample was homoskedastic and there was no more correlation that could be taken out from the residuals to the data by improving the model, however we did confirm with the first of the advanced tests that the residuals weren’t normal.

We could try to take the square of the speed and redo the linear model or try polynomial regression.