Homework 2

Simple Linear Regression Model Assumptions

Author

Spencer Hamilton

Data and Description

One key component of determining appropriate speed limits is the amount of distance that is required to stop at a given speed. For example, in residential neighborhoods, when pedestrians are commonly in the roadways, it is important to be able to stop in a very short distance to ensure pedestrian safety. The speed of vehicles may be useful for determining the distance required to stop at that given speed, which can aid public officials in determining speed limits.

The Stopping Distance data set compares the distance (column 2) (in feet) required for a car to stop on a certain rural road against the speed (column 1) (MPH) of the car. Download the StoppingDistance.txt file from Canvas, and put it in the same folder as this quarto file.

0. Replace the text “” (above next to “author:”) with your full name.

1. Read in the data set, and call the data frame “stop_dat”.

stopping <- read.table('StoppingDistance.txt', header = TRUE)
summary(stopping)
     Speed          Distance     
 Min.   : 4.00   Min.   :  2.00  
 1st Qu.:10.00   1st Qu.: 13.25  
 Median :17.50   Median : 29.50  
 Mean   :18.92   Mean   : 39.31  
 3rd Qu.:26.75   3rd Qu.: 56.75  
 Max.   :40.00   Max.   :138.00  

2. Create a scatterplot of the data with variables on the appropriate axes (think about which variable makes the most sense to be the response). Make your plot look professional (make sure the axis labels are descriptive).

ggplot(data = stopping) + 
  geom_point(mapping = aes(x = Speed, y = Distance)) + labs(title = "How fast can a car stop?",
  x = "Speed (in MPH)",
  Y =  "Distance required to stop (in feet")

3. In a full sentence, describe the relationship between Speed and Distance. (Hint: your answer should include 3 key words.)

There is a noticeable upward trend, indicating a positive correlation between Speed and Distance required to stop in this scatterplot.

4. Add the OLS regression line to the scatterplot you created in question 2.

ggplot(data = stopping) + 
  geom_smooth(mapping = aes(x = Speed, y = Distance),
              method = "lm",
              se = FALSE) +
  geom_point(mapping = aes(x = Speed, y = Distance)) + labs(title = "How fast can a car stop?",
  x = "Speed (in MPH)",
  Y =  "Distance required to stop (in feet")
`geom_smooth()` using formula = 'y ~ x'

5. (a) Apply linear regression to the data (no transformations yet). (b) Print out a summary of the results from the lm function. (c) Save the residuals and fitted values to the stop_dat dataframe.

#A
lm_speed <- lm(Distance ~ Speed, data = stopping)
#B
summary(lm_speed)

Call:
lm(formula = Distance ~ Speed, data = stopping)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.410  -7.343  -1.334   5.927  35.608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -20.1309     3.2308  -6.231 5.04e-08 ***
Speed         3.1416     0.1514  20.751  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.77 on 60 degrees of freedom
Multiple R-squared:  0.8777,    Adjusted R-squared:  0.8757 
F-statistic: 430.6 on 1 and 60 DF,  p-value: < 2.2e-16
#C
stopping$residuals <- residuals(stopping)
stopping$fitted_values <- fitted(stopping)

print(head(stopping))
  Speed Distance
1     4        4
2     5        2
3     5        4
4     5        8
5     5        8
6     7        7

6. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above. Do not use “x” and “y” in your model - use descriptive variable names.

\(\beta_0\) (intercept - beta with zero subscript)

\(\times\) (multiplication symbol)

\(\text{Weight}_i\) (i subscript on variable name not italicized)

\(\epsilon_i\) (error term with i subscript) >

\(\text{Distance} = \beta_0 + beta_1\) \(\times\) \(\text{Speed} + \epsilon_i\)

Questions 7-11 involve using diagnostics to determine if the linear regression assumptions are met. For each assumption, (1) perform/display appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.

7. (L) X vs Y is linear

ggplot(data = stopping) + 
  geom_smooth(mapping = aes(x = Speed, y = Distance),
              method = "lm",
              se = FALSE) +
  geom_point(mapping = aes(x = Speed, y = Distance)) + labs(title = "How fast can a car stop?",
  x = "Speed (in MPH)",
  Y =  "Distance required to stop (in feet")
`geom_smooth()` using formula = 'y ~ x'

This is the same code as before to check visually that linearity is shown between our two variables. We can see that most of the data follows the general trend of the line.

8. (I) The errors are independent (no diagnostic tools are required in this particular instance - just think about how the data was collected and briefly write your thoughts)

Thinking about how the data was collected, I think that the errors are independent because these are all separate tests. I assume they are all measured in the same way, so there is no reason to believe they are not independent, as one measurement of speed or distance shouldn’t ever effect the others.

9. (N) The errors are normally distributed. Use at least two diagnostic tools.

stopping$residuals
NULL
# The code below produces a basic histogram (from the in-class analysis #2)
hist(lm_speed$residuals, main = "Histogram of Residuals", xlab = "Residuals")

mean_residual <- mean(lm_speed$residuals)
sd_residual <- sd(lm_speed$residuals)


#Shapiro Test (also in-class #2)
shapiro.test(lm_speed$residuals)

    Shapiro-Wilk normality test

data:  lm_speed$residuals
W = 0.97533, p-value = 0.2449

This histogram of residuals looks approximately normal. This means that the residuals are most likely normal, but to check it I also used a Shapiro-Wilk test, which had a very high p-value. This leads us to conclude that normality is NOT violated.

10. (E) The errors have equal/constant variance across all values of X.

#Code from in-class #2
autoplot(lm_speed, which = 3, ncol = 1, nrow = 1) +
  theme(aspect.ratio = 1)

autoplot(lm_speed, which = 2, ncol = 1, nrow = 1)  +
  theme_bw() + 
  coord_equal()

It seems like this graph may not be as straight as we would dream of for our perfect fitted values. However, it is honestly quite good when we recognize the 3 values which are labeled (helping identify them as possible outliars, or at least points with great influence). So, I conclude that this graph shows that the errors are quite equal across all values of x, given that we live in the real world, and this graph will really never be perfectly straight. This is backed up by the normal QQ plot. It is visually very clear to me that the rest of the points are much closer to their theoretical quantities.

11. Check if there are any influential points deserving of attention.

#plotting the influential points
autoplot(lm_speed, which = 4, ncol = 1, nrow = 1)  +
 theme(aspect.ratio = 1)

As just mentioned, these three points (55, 60, and 62), are influential points. They seem to pull the Scale-Location graph up on the right side, keeping it from being perfectly flat. But that’s okay, we are grateful for these points.

12. Based on your analysis of the diagnostic measures, briefly discuss why this simple linear regression model on the raw data (not transformed) is not appropriate.

Honestly, this data looks quite good to me. There seems to be clear visual linearity in the original scatter plot. The Errors are assumed to be independent. The errors are very normally distributed. The errors are not reasonably equal variance, but that is mostly due to the three influential data points, as seen above.

13. Fix the model by making any necessary transformations. Justify the transformation(s) you choose in words. (Note: if boxCox(mod) throws an error, replace mod with the formula for the linear model, y ~ x.) (Note: you will most likely need to repeat questions 13 and 14 until you are satisfied with the transformation(s) you chose. Only then should you fill out this section - I only want to see the model you end up choosing, not all of your attempted models.)

I will follow the instructions to hopefully get errors that are more equally distributed, and that value the input of the three outliars. I will use invTranPlot to figure out what I cant see visually, and to hopefully get a better grade, and prove that I know what I’m doing.

invTranPlot(Distance ~ Speed, data = stopping, lambda = seq(0, 1, by = 0.1), optimal = TRUE)

     lambda       RSS
1  1.868443  5823.372
2  0.000000 18844.172
3  0.100000 17423.839
4  0.200000 16074.342
5  0.300000 14801.772
6  0.400000 13610.999
7  0.500000 12505.672
8  0.600000 11488.254
9  0.700000 10560.101
10 0.800000  9721.548
11 0.900000  8972.024
12 1.000000  8310.166
#according to invTranPlot, the lambda with the smallest RSS is 1.868443  (5823.372  ), so we will use this in our transformation of the data

lm_speed_opt <- lm(Distance ~ I(Speed^(1.868443)), data = stopping)
lm_speed_opt

Call:
lm(formula = Distance ~ I(Speed^(1.868443)), data = stopping)

Coefficients:
        (Intercept)  I(Speed^(1.868443))  
             3.2257               0.1216  

14. Now, re-check your transformed model and verify that the assumptions (the assumptions that were addressed in questions 7 to 11 above) are met. Provide a brief discussion about how each of the previously violated assumptions are now satisfied. Also, provide the code you used to assess adherence to the assumptions. Note that transforming will not change your responses about (I) the errors being independent, so you can skip that assumption here.

#L (Linear Data)
ggplot(data = stopping) + 
  geom_smooth(mapping = aes(x = I(Speed^(1.868443)), y = Distance),
              method = "lm",
              se = FALSE) +
  geom_point(mapping = aes(x = I(Speed^(1.868443)), y = Distance)) + labs(title = "How fast can a car stop? (Transformed)**",
  x = "Transformed Speed ",
  Y =  "Distance required to stop (in feet")
`geom_smooth()` using formula = 'y ~ x'

#I (independent errors) -- this has the same argument as before

#N (normally distributed errors)
# The code below produces a basic histogram (from the in-class analysis #2)
hist(lm_speed_opt$residuals, main = "Histogram of Residuals", xlab = "Residuals")

mean_residual <- mean(lm_speed_opt$residuals)
sd_residual <- sd(lm_speed_opt$residuals)


#Shapiro Test (also in-class #2)
shapiro.test(lm_speed_opt$residuals)

    Shapiro-Wilk normality test

data:  lm_speed_opt$residuals
W = 0.97343, p-value = 0.1975
#E (Errors are equal) -- lol this graph is actually worse than before
autoplot(lm_speed_opt, which = 3, ncol = 1, nrow = 1) +
  theme(aspect.ratio = 1)

autoplot(lm_speed_opt, which = 2, ncol = 1, nrow = 1)  +
  theme_bw() + 
  coord_equal()

Okay, so I decided to transform the data according to the function invTranPlot. It did a great job, but interestingly… it is easy to see that even in the simple scatterplot (of x and y), that most of the points are on the left side of the scatterplot, and that they spread out at the end. This seemed to have the opposite of the intended effect, making the errors even less similar to the fitted values, as seen in the final graph. This is sad, as it means that we aren’t quite there yet. Something is wrong, but I argued that the data looked fine without a tranformation, and now it looks worse. Sometimes it is best to leave the data, and not wrangle too hard.

15. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above from your transformed model. Do not use “x” and “y” in your model - use descriptive variable names.

\(\text{Distance} = \beta_0 + beta_1\) \(\times\) \(\text{TransformedSpeed}^1^.^8 + \epsilon_i\)

16. Plot your new fitted curve on the scatterplot of the original data (on the original scale - not the transformed scale). Do you think this model fits the data better than the original model?

# Tip: To get the fitted curve for your transformed model on the original scale, you have to invert the transformation on the response.  Consider the following example.  Suppose the transformed response is Dlog = log(Distance) (this is not necessarily the transformation you should use, it is just an example). If 'fit' is a variable containing the fitted values on the transformed scale (i.e., they are fitted values for Dlog), then the fitted values for Distance would be efit = exp(fit), since exp() is the inverse of log().
# <your code here>

< your response here >

17. Briefly summarize (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a non-statistician (do not include any numbers or software output).

< your response here >