** DATA_605_Assignment_11_Thonn - Linear Regression-1 **
# install libraries if needed
#install.packages("permutations")
library(permutations)
## Warning: package 'permutations' was built under R version 3.3.3
##
## Attaching package: 'permutations'
## The following object is masked from 'package:stats':
##
## cycle
#install.packages('gtools')
library(gtools)
#install.packages('gvlma')
library(gvlma)
#install.packages('lmtest')
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
** Assignment HW 11 **
Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
# load R dataset cars into a dataframe
cars1 <- cars
head(cars1)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
#print(cars1)
# examine the dataframe cars1 structure
str(cars1)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
# create a linear regression model for cars1
cars1.lm <- lm(cars1$dist ~ cars1$speed)
# plot the cars data
plot(cars1$speed,cars1$dist)
abline(cars1.lm)
# Note: the data looks fairly linear and the model fits reasonably well by appearance
Check a summary of the cars1.lm model
summary(cars1.lm)
##
## Call:
## lm(formula = cars1$dist ~ cars1$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars1$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# Findings:
# The pr of the model is less than .05 (.0123) so is significant
# The equation for the line is (y=stop_distance) y = -17.5791 + 3.9324 * x (x=speed)
Check the residuals
hist(cars1.lm$residuals)
# the histogram of residuals shows a partial shape of normal distribution though skewed to the right
# so not a perfect fit
plot(fitted(cars1.lm),resid(cars1.lm))
# the residuals are randomly distributedon on the residual chart which indicates good linearity
# the sample model fits the theoretical between -2 and +1.5, thought there is divergence after +1.5
# therefore, the model is a partial fit with some divergence
qqnorm(resid(cars1.lm))
qqline(resid(cars1.lm))
Check the gvlma function for overall criteria
gvlma(cars1.lm)
##
## Call:
## lm(formula = cars1$dist ~ cars1$speed)
##
## Coefficients:
## (Intercept) cars1$speed
## -17.579 3.932
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = cars1.lm)
##
## Value p-value Decision
## Global Stat 15.801 0.003298 Assumptions NOT satisfied!
## Skewness 6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis 1.661 0.197449 Assumptions acceptable.
## Link Function 2.329 0.126998 Assumptions acceptable.
## Heteroscedasticity 5.283 0.021530 Assumptions NOT satisfied!
# Note: there are several issues found in the output of this test. See below for details.
# Conclusion:
# There are issues in the gvlma check as shown below with prob < .05 for (3) criteria.
# Heteroscedasticity, Skewness, and Global Stat fail
# Conclusion: the cars1.lm model is less than an ideal model for the cars data.
# Global Stat 15.801 0.003298 Assumptions NOT satisfied!
# Skewness 6.528 0.010621 Assumptions NOT satisfied!
# Kurtosis 1.661 0.197449 Assumptions acceptable.
# Link Function 2.329 0.126998 Assumptions acceptable.
# Heteroscedasticity 5.283 0.021530 Assumptions NOT satisfied!
** END **