suppressWarnings(suppressMessages(library(ggplot2)))
Step1: Cars dataset bio
attach(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
typeof(cars)
## [1] "list"
dim(cars)
## [1] 50 2
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
#check data quality of cars
#list rows of data that have missing values
cars[!complete.cases(cars),]
## [1] speed dist
## <0 rows> (or 0-length row.names)
#boxplot of dittributers in cars dataframe
lmts<- range(cars)
boxplot(cars,ylim=lmts)
Setp2:Distribution of the attributs
#histogram
par(mfrow=c(1,2))
hist(cars$speed,freq=F, col='blue', breaks = 5)
lines(density(cars$speed), col='red', lwd=2)
hist(cars$dist,freq=F, col='lightyellow', breaks = 10)
lines(density(cars$speed), col='red', lwd=2)
#check nomal
par(mfrow=c(1,2))
qqnorm(cars$speed, main='Speed')
qqline(cars$speed)
qqnorm(cars$dist, main='Distant')
qqline(cars$dist)
Setp2:Relation of Speed and Distants
#scatter plot
par(mfrow=c(1,2))
plot( cars$speed,cars$dist,xlab="Speed",ylab="Distant")+
abline(lm(cars$dist ~ cars$speed))
## integer(0)
plot( cars$dist,cars$speed,xlab="Distant",ylab="Speed")+
abline(lm(cars$speed ~ cars$dist))
## integer(0)
Base on speed and distant distribution graphs, they are nearly normal. And the variale speed and distand looks like linear related base on the scatter plot. Now I want to see whether I can use regression modle to predit speed by distant or to predit distant by speed.
Before I start to use the linear regression modle, I would like to check the assumptions of single factor linear regression modle. (https://www.statisticssolutions.com/assumptions-of-linear-regression/)
\(Speed\quad \F \quad =\quad 8.2839\quad +\quad 0.1656*Distant\)
lm1<-lm(cars$speed ~ cars$dist)
lm1
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Coefficients:
## (Intercept) cars$dist
## 8.2839 0.1656
\(Distant\quad \F \quad =\quad -17.579\quad +\quad 3.932*Speed\)
lm2<-lm(cars$dist ~ cars$speed)
lm2
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Coefficients:
## (Intercept) cars$speed
## -17.579 3.932
#modle1:Speed~Distant
summary(lm1)
##
## Call:
## lm(formula = cars$speed ~ cars$dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## cars$dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
#modle2:Distant~Speed
summary(lm2)
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
par(mfrow=c(1,2))
#modle1:Speed~Distant
plot(lm1)
#plot(fitted(lm1),resid(lm1))
#qqnorm(resid(lm1))
#qqline(resid(lm1))
#modle2:Distant~Speed
plot(lm2)
#plot(fitted(lm2),resid(lm2))
#qqnorm(resid(lm2))
#qqline(resid(lm2))
(from the book “LinearRegression” https://conservancy.umn.edu/bitstream/handle/11299/189222/LinearRegression_fulltext.pdf?sequence=5&isAllowed=y ) 1.If the line is a good fit with the data, we would expect residual values that are normally distributed around a mean of zero.
2.A good model would tend to have a median value near zero, minimum and maximum values of roughly the same magnitude, and first and third quartile values of roughly the same magnitude.
3.For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient.
4.Labeled Pr(>|t|) - the significance or p-value of the coefficient, shows the probability that the corresponding coefficient is not relevant in the model.
5.If the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.
6.The number of degrees of freedom is the total number of measurements or observations used to generate the model, minus the number of coefficients in the model.
7.The Multiple R-squared value is a number between 0 and 1.In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value.
8.The Adjusted R-squared value is the R2 value modified to take into account the number of predictors used in the model. The adjusted R2 is always smaller than the R2 value.
Base on the summary of two models - modle1:speed~dist and modle2:dist~speed, these two modles are good. However, Modle1 using distant to estemate speed is more accucy than a prediction in the opporsit way in modle2.