Homework 11

Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

rm(list=ls())
library(ggplot2)

The cars dataset has 50 rows and 2 columns. Each row is an observation that relates to a reading between car speed and the distance it takes for a car to stop. The columns in the dataset are “speed”" and “dist”.

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")

Now, let’s look at the correlation between peed and disance and create the linear regression model.

cars.lm <- lm(cars$dist ~ cars$speed)
cars.lm

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

corr<-cor(cars$dist,cars$speed)
(round(corr,4))

## [1] 0.8069

plot(x = cars$speed, y = cars$dist, main="Cars Data - R package", xlab = "Speed(mph)", ylab = "Distance(feet)")
abline(h=mean(cars$dist))
abline(cars.lm, col="red")

The black horizontal line indicates average distance and the red line is actual regression model. It explains as speed increases distance car travels after brakes are applied also increases. Now we can look at the actual quality of the linear model.

summary(cars.lm)

## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

linear regession equation–dist = -17.5791 + (3.9324 * speed) Correlation Coefficient–0.8069 Multiple R-squared–0.6511 R-quared–0.6438 The reported R-Squared of 0.6511 for this model means that the model explains 65.11 percent of the data’s variation.

residuals(cars.lm)

##          1          2          3          4          5          6 
##   3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584 
##          7          8          9         10         11         12 
##  -3.744993   4.255007  12.255007  -8.677401   2.322599 -15.609810 
##         13         14         15         16         17         18 
##  -9.609810  -5.609810  -1.609810  -7.542219   0.457781   0.457781 
##         19         20         21         22         23         24 
##  12.457781 -11.474628  -1.474628  22.525372  42.525372 -21.407036 
##         25         26         27         28         29         30 
## -15.407036  12.592964 -13.339445  -5.339445 -17.271854  -9.271854 
##         31         32         33         34         35         36 
##   0.728146 -11.204263   2.795737  22.795737  30.795737 -21.136672 
##         37         38         39         40         41         42 
## -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
##         43         44         45         46         47         48 
##   2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285 
##         49         50 
##  43.201285   4.268876

summary(residuals(cars.lm))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -29.069  -9.525  -2.272   0.000   9.215  43.201

ggplot(cars.lm, aes(.fitted, .resid)) + 
  geom_point(color = "red", size=2) +
  labs(title = "Fitted Values vs Residuals") +
  labs(x = "Fitted Values") +
  labs(y = "Residuals")

qqnorm(resid(cars.lm))
qqline(resid(cars.lm))

#Conclusion Residual is a difference between actual measured value and corresponding values on the fitted regression line. The positive value indicates observed value is above the fitted line and the negative value means the observed value is below the fitted line.

In case of the best-fitted model, mean of the residual will be zero, as this follows a normal distribution. For any given data there will be enough observed values above and below the fitted line.

We see that the two ends diverge from the Q-Q plot line. This behavior indicates that the residuals are not normally distributed. The plot suggests that the distribution’s tails are “heavier” than what we would expect from a normal distribution. Speed is not a sufficient indicator for distance in this case.

VStoyanova_Assign11

Violeta Stoyanova

November 7, 2018

Homework 11