Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)

suppressWarnings(suppressMessages(library(ggplot2)))

Step1: Cars dataset bio

attach(cars)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10
typeof(cars)
## [1] "list"
dim(cars)
## [1] 50  2
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
#check data quality of cars
#list rows of data that have missing values 
cars[!complete.cases(cars),]
## [1] speed dist 
## <0 rows> (or 0-length row.names)
#boxplot of dittributers in cars dataframe
lmts<- range(cars)
boxplot(cars,ylim=lmts)

Setp2:Distribution of the attributs

#histogram 
par(mfrow=c(1,2))
hist(cars$speed,freq=F, col='blue', breaks = 5)
lines(density(cars$speed), col='red', lwd=2)
hist(cars$dist,freq=F, col='lightyellow', breaks = 10)
lines(density(cars$speed), col='red', lwd=2)

#check nomal
par(mfrow=c(1,2))
qqnorm(cars$speed, main='Speed')
qqline(cars$speed)
qqnorm(cars$dist, main='Distant')
qqline(cars$dist)

Setp2:Relation of Speed and Distants

#scatter plot
par(mfrow=c(1,2))
plot( cars$speed,cars$dist,xlab="Speed",ylab="Distant")+
abline(lm(cars$dist ~ cars$speed))
## integer(0)
plot( cars$dist,cars$speed,xlab="Distant",ylab="Speed")+
abline(lm(cars$speed ~ cars$dist))

## integer(0)

Base on speed and distant distribution graphs, they are nearly normal. And the variale speed and distand looks like linear related base on the scatter plot. Now I want to see whether I can use regression modle to predit speed by distant or to predit distant by speed.

Before I start to use the linear regression modle, I would like to check the assumptions of single factor linear regression modle. (https://www.statisticssolutions.com/assumptions-of-linear-regression/)

  1. linear relationship : yes
  2. Multivariate normality: yes
  3. No or little multicollinearity: no (only one factor)
  4. No auto-correlation (y independt to x): no, independt
  5. Homoscedasticity (the residuals are equal across the regression line) : yes

Linear Regression Model1 : estimate speed(y) by distant(x)

\(Speed\quad \F \quad =\quad 8.2839\quad +\quad 0.1656*Distant\)

lm1<-lm(cars$speed ~ cars$dist)
lm1
## 
## Call:
## lm(formula = cars$speed ~ cars$dist)
## 
## Coefficients:
## (Intercept)    cars$dist  
##      8.2839       0.1656

Linear Regression Model2 : estimate distant(y) by speed(x)

\(Distant\quad \F \quad =\quad -17.579\quad +\quad 3.932*Speed\)

lm2<-lm(cars$dist ~ cars$speed)
lm2
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Coefficients:
## (Intercept)   cars$speed  
##     -17.579        3.932

Evaluation the quality of the modles:

#modle1:Speed~Distant
summary(lm1)
## 
## Call:
## lm(formula = cars$speed ~ cars$dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## cars$dist    0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
#modle2:Distant~Speed
summary(lm2)
## 
## Call:
## lm(formula = cars$dist ~ cars$speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residual Analysis:

par(mfrow=c(1,2))
#modle1:Speed~Distant
plot(lm1)

#plot(fitted(lm1),resid(lm1))
#qqnorm(resid(lm1))
#qqline(resid(lm1))

#modle2:Distant~Speed
plot(lm2)

#plot(fitted(lm2),resid(lm2))
#qqnorm(resid(lm2))
#qqline(resid(lm2))

Good Modle Rules:

(from the book “LinearRegression” https://conservancy.umn.edu/bitstream/handle/11299/189222/LinearRegression_fulltext.pdf?sequence=5&isAllowed=y ) 1.If the line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero.

2.A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.

3.For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.

4.Labeled Pr(>|t|) - the significance or p-value of the coefficient, shows the probability that the corresponding coefficient is not relevant in the model.

5.If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.

6.The number of degrees of freedom is the total number of measurements or observations used to generate the model, minus the number of coefficients in the model.

7.The Multiple R-squared value is a number between 0 and 1.In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value.

8.The Adjusted R-squared value is the R2 value modified to take into account the number of predictors used in the model. The adjusted R2 is always smaller than the R2 value.

Conclution:

Base on the summary of two models - modle1:speed~dist and modle2:dist~speed, these two modles are good. However, Modle1 using distant to estemate speed is more accucy than a prediction in the opporsit way in modle2.