Using the “cars” dataset in R, build a linear model for stopping distance as a function of speed and replicate the analysis of your textbook chapter 3 (visualization, quality evaluation of the model, and residual analysis.)
#observed there are 50 observations (rows) and 2 variables(columns) in cars dataset.
data(cars)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
I observed there are 50 observations (rows) and 2 variables(columns) in cars dataset. And there is no missing data from the dataset. Both datatype are numerical. The following I use lm function construct regression model.
Firstly, I divide data into two datasets, training dataset and test dataset, with size of 0.7 and 0.3 original dataset.
set.seed(1)
# Shuffle the dataset, call the result shuffled
n <- nrow(cars)
shuffled <- cars[sample(n),]
# Split the data in train and test
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
# Print the structure of train and test
str(train)
## 'data.frame': 35 obs. of 2 variables:
## $ speed: num 12 13 16 20 11 20 20 17 16 7 ...
## $ dist : num 24 46 40 64 17 52 56 32 32 4 ...
str(test)
## 'data.frame': 15 obs. of 2 variables:
## $ speed: num 11 19 4 18 8 13 9 15 13 14 ...
## $ dist : num 28 36 10 84 16 26 10 54 34 36 ...
#use lineaer regression algrithem to built a linear model
lm_dist <- lm(dist~ ., data = train)
lm_dist
##
## Call:
## lm(formula = dist ~ ., data = train)
##
## Coefficients:
## (Intercept) speed
## -22.135 4.251
From the result, the model gives linear regression to predicte stopping distant as following:
\({ D }_{ stop }\quad =\quad -22.135\quad +\quad 4.251\quad *\quad { S }_{ speed }\)
In which, for one unit of the stopping distanct incresing relatives to 4.251 units Speed increase if constant -22.135 remains.
plot(lm_dist)
The residual standard error usually measures the totally variance in residual model. The first residual plot shows the distant between the independent variable speed relative to prediction value. From 2nd standize residuals plot, I see most of residuals are normally distributed. The last standardized residual plot, there is no outliner above Cook’s distance which would be removed from the test. The following, I look for some statistic detail of the model by using summary.
summary(lm_dist)
##
## Call:
## lm(formula = dist ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.882 -10.008 -2.373 8.250 42.623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.136 8.176 -2.707 0.0107 *
## speed 4.251 0.487 8.729 4.34e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.42 on 33 degrees of freedom
## Multiple R-squared: 0.6978, Adjusted R-squared: 0.6886
## F-statistic: 76.2 on 1 and 33 DF, p-value: 4.337e-10
In this case there is only one factor in the linear model, I focus on T test outcome. T-test std. Error column shows statistic standard error for both coefficient. Normally for a good model, these value will be 5-10 times smaller than their coefficient. Here speed has 8.729 times smaller its coefficient, which means it has small variability in the slop estimation. However, the std. error of intercept doesn’t has small variability, which only has 2.7 time smaller than its estimation. The last column p-value, shows the probability that the corresponding coefficient is not relevant to the model. In this case, the p-value for intercept not relevant to the model is lesser than 0.05, which means the intercept is significant to the model. Now to see the p-value of the speed not relevant to model is very tiny, similarly it also mean speed is significant to the model. From the multiple R-squared of interval (0,1), it shows the model only can explain 69.78% of the variance, and 30.22% data variance is not able to explain.
Now, I use the regression model to predict the data by using the test data.
dist_new=predict(lm_dist,test)
Comparing two means in boxplot, they look very close.
boxplot(dist_new,test$dist,names= c('dist_new','test$dist'),col=c("pink","grey"))
I use the hypotheses for testing if the average dist_new and test$dist are different.
H0: There is no different between two means.
H1: There is different between two means.
t.test(dist_new,test$dist)
##
## Welch Two Sample t-test
##
## data: dist_new and test$dist
## t = 0.15414, df = 27.987, p-value = 0.8786
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.26087 16.58165
## sample estimates:
## mean of x mean of y
## 38.22706 37.06667
From T-test result, p-value is larger than 0.05, I have 95% confident there is no different from two means.
Therefore, I conclude the linear regression model prediction can be trust.