DATA 605 Assignment Week 11

Assignment

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis).

Part 1: Hypothesis

Null Hypothesis (H₀):
There is no relationship between the speed of the cars’ and the cars’ stopping distances.

Alternative Hypothesis (H₁):
A linear relationship exists between the speed of the cars’ and the cars’ stopping distances.

Part 2: Data Inspection

After loading the cars dataset, initial investigation yields two variables ‘speed’ and ‘distance’, as well as 50 rows of observations.

Variable Definition
Speed: The speed of the car (in miles per hour).
Distance(Dist): The distance required to stop the car (in feet).

Speed Summary Statistics
The data for the ‘speed’ variable is approximately symmetric as support by similar mean(15.4) and median(15.0), with a range of values between 4-25 approximately represented in the 1st and 3rd quartiles. A histogram of the values also supports that the underlying distribution is normal.

Distance Summary Statistics
The data for the ‘distance’ variable appears skewed as indicated by the difference between mean(42.98) and median(36.0) values. The data range of values is between 2-120, with a greater number observations falling in the first quartile than the third (1st Qu.: 26.00, Mean: 42.98, Median: 36.00, 3rd Qu.: 56.00 ). The ‘distance’ observations are therefore skewed to the right. A histogram of the values also confirms right skew.

data<-data(cars)
#head(cars)
#nrow(cars)
#names(cars)
summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

hist(cars$speed,
     main = "Histogram of Car 'Speed' Variable",
     xlab = "Speed (mph)",
     col = "lightblue",
     border = "black")

     #breaks = seq(from = min(cars$speed), to = max(cars$speed), by = 5))

hist(cars$dist,
     main = "Histogram of 'Dist' Variable",
     xlab = "Distance",
     col = "lightblue",
     border = "black")

Part 3: Initial Visualization

Speed and distance values were plotted against each other to visually assess for linearity. The below ‘Stopping Distance vs. Speed’ plot supports the notion that a linear relationship exists between the two variables.

plot(cars$dist ~ cars$speed, data=cars, main="Stopping Distance vs. Speed",
     xlab="Speed (mph)", ylab="Stopping Distance (feet)")

Part 4: Linear Function Model

A linear model was passed into R below, with distance defined as the dependent variable and speed defined as the independent variable. The y-intercept was determined to be a0 -17.579 and the slope = 3.932.

The regression model can therefore be represented as: \[Distance = -17.579 + 3.932 * speed\]

lm<-lm(cars$dist ~ cars$speed, data=cars)   
lm

## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

plot(cars$dist ~ cars$speed, data=cars, main="Stopping Distance vs. Speed",
     xlab="Speed (mph)", ylab="Stopping Distance (feet)")

abline(lm, col="red")

Part 5: Linear Function Model Interpretation

Below, the results of the linear model are interpreted.

summary(lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residuals
The median value being relatively near zero, and the first and third quartile approximation of each other hints that the residuals are normally distributed. However, minimum and maximum values are not of similar magnitude, which provides some evidence that the model may not be the best fit for the data.

Coefficients
Observing a standard error of at least five to ten times smaller than the coefficient provides evidence of good model fit. Below the standard error for the ‘speed’ variable is 9.4 times smaller (t-value) with a statistically significant p-value of 1.49^-12. This magnitude test statistic essentially means there is little variability with regard to the slope estimate and provides further evidence that the simple linear model is a good fit for this data.

Residual Standard Error and Degrees of Freedom
The residual standard error is approximately 1.5 times the 1st and 3rd quartile residuals, meaning that the residuals appear normally distributed.

Degrees of freedom refer to the total number of observations in the dataset minus the number of variables in the SLM. For this model we have 48 degrees of freedom.

The Multiple R-squared Value
The reported R2 of 0.6511 for this model means that 65.11% of the variability in stopping distance is explained by the variation in speed

Part 6: Residual Analysis

Below, residual analysis is conducted in the form of Residual versus Fitted Value Plot and a Q-Q Plot

Residual versus Fitted Value Plot
For a Residual versus Fitted Value Plot to support the linear model, residuals should be scattered around the horizontal axis where the residual equals zero. In the below Residual versus Fitted Value Plot which is for our linear model, the residuals look to be scattered randomly around the horizontal axis where the residual equal zero. The assumption of constant variance and linearity appear to be satisfied. The plot also hints at outliers with residual measurements that approximate ~40.

plot(fitted(lm),resid(lm))

Q-Q Plot
For the Q-Q Plot to support our linear model, we would expect the plotted values to follow a straight line, indicating the residuals were normally distributed. Below our model’s Q-Q Plot suggests that the distribution of the residuals are somewhat normal. However, both the right and left tails deviate slightly from the expected straight line, suggesting that the model could be improved.

 qqnorm(resid(lm))
 qqline(resid(lm))

More Plots
The below plots identify potential outliers.

par(mfrow=c(2,2))
plot(lm)

Part 7: Conclusion

Overall, the analysis does allude to a relationship between speed and stopping distance. The null hypothesis, that there is no relationship between the speed of the cars’ and the cars’ stopping distances, is able to be rejected. However, there is also evidence that other variables might influence this relationship, as noted by the skewed residuals in the QQ Plot and the presence of outliers. Other factors which might influence the relationship could include variables such as car type, brake type, car manufacturer and/or brake manufacturer.