#First 5 observations
head(cars);
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
#Dimension of dataset
dim(cars)
## [1] 50 2
create scatter plot with line of best fit
ggplot(data=cars,mapping=aes(speed,dist)) + geom_point() +
geom_smooth(method=lm, se=FALSE)
The scatter plot of stopping distance and speed depicts a linear relationship. The relationship is positive because the trend shows that stopping distance increases as speed of the car increases. However, there seem to be outliers in the distribution, but that would be detected and confirmed by the model.
Correlation
#Correlation
cor(cars$dist,cars$speed)
## [1] 0.8068949
The correlation coefficient of 0.8 shows that there is a strong positive relationship between stopping distance and the speed of the car.
Building a linear regression model
lreg <- lm(dist ~ speed, data=cars)
summary(lreg)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The model has an intercept of -17.57 and a slope of 3.9 with a p-value of 1.49e-12. Having a Null hypothesis \(H_0\) that slop is equal to zero and a miniature p-value of 1.49e-12, the Null hypothesis \(H_0\) is rejected as the p-values is statistically significant. The R-squared \(R^2\) value of 0.6511 means that the model is reasonable to explain about 65% variation in stopping distance as a result of speed of car. Thus, it is correct to conclude that the model seam reliable and statistically significant.
Histogram
ggplot(data=cars,mapping=aes(lreg$residuals)) +
geom_histogram(aes(y=..density..),binwidth = 5,color='black',fill='grey') +
geom_density(alpha=0.2, fill='lightpink') +
theme(panel.background = element_rect(fill='white'),
axis.line.x=element_line(),
axis.line.y=element_line())+ggtitle('Histogram for model Residuals')
The distribution of the residual depicted by the histogram shows that it is unimodal, centered at approximately zero (i.e zero mean) with presence of outliers in the far right. The histogram is right-skewed indicating that most of the data falls to the right of the histogram. However, the shape of the histogram confirms that the residual is normally distributed. But it is not a perfect normal distribution as the histogram did not maintain a perfect bell shape.
Box Plot
ggplot(data=cars,mapping=aes(lreg$residuals)) +
geom_boxplot(color='black',fill='grey') +
theme(panel.background = element_rect(fill='white'),
axis.line.x=element_line(),
axis.line.y=element_line())+ggtitle('Box plot for model Residuals')
The box plot shows that the median of the residual is less than zero and there are outliers above the upper limit of the data distribution.
Evaluating Linear Regression Assumptions
plot(lreg)
Residual Analysis:
The Residual vs Fitted value plot shows that the error maintains a constant variability. However, there are few data points at the far left of the distribution where fitted values are less than zero.
The Q-Q plot shows that the distribution of the residual is fairly normal with outliers in the far end identifiable with index 23, 35 and 49 of the dataset
Detect and Removing Outliers from Dataset Using Box plot
#Outliers
boxplot.stats(lreg$residuals)$out
## 23 49
## 42.52537 43.20128
#Row numbers
out <-boxplot.stats(lreg$residuals)$out
out_ind <-which(lreg$residuals %in%c(out))
out_ind
## [1] 23 49
#Observations
cars[out_ind,]
## speed dist
## 23 14 80
## 49 24 120
#Displaying outliers on Box plot
boxplot(lreg$residuals,ylab='distance',
main='Box plot of distance per mile')
mtext(paste("Outliers: ",paste(out,collapse = ". ")))
#Lower bound
lower_bound = quantile(lreg$residuals,0.025)
lower_bound
## 2.5%
## -21.3462
#Upper bound
upper_bound = quantile(lreg$residuals,0.975)
upper_bound
## 97.5%
## 39.8862
#Lower $ Upper bound index
outlier_ind <- which(lreg$residuals < lower_bound | lreg$residuals > upper_bound)
outlier_ind
## 23 24 39 49
## 23 24 39 49
#Outliers in the cars dataset
cars[outlier_ind,]
## speed dist
## 23 14 80
## 24 15 20
## 39 20 32
## 49 24 120
Remove outlier from original cars dataset
#create id column
cars$id <- as.integer(rownames(cars))
#cars
#remove outliers
cars2 <- cars[cars$id[-c(23,35,39,49)], ]
#cars2
The dataset consists of two features namely, speed and stopping distance. However, a third variable id was created as a unique identifier to make the data manipulation as easy as possible.
ggplot(data=cars2,mapping=aes(speed,dist)) + geom_point() +
geom_smooth(method=lm, se=FALSE)
Building a linear regression model after removing initial outliers
lreg2 <- lm(dist ~ speed, data=cars2)
summary(lreg2)
##
## Call:
## lm(formula = dist ~ speed, data = cars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.526 -8.188 -1.232 5.798 25.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.0635 5.0969 -3.152 0.00292 **
## speed 3.7059 0.3192 11.612 5.47e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.34 on 44 degrees of freedom
## Multiple R-squared: 0.754, Adjusted R-squared: 0.7484
## F-statistic: 134.8 on 1 and 44 DF, p-value: 5.471e-15
Upon removing the initial detected outliers, the intercept of the model shifted to -16.06 and the slope changed to 3.7. However, the model p-value remains statistically significant at 5.471e-15. The R-squared \(R^2\) increased to 0.754 (~75%) and these changes show that the model would do a better prediction when the initial outliers are eliminated from the training dataset.
Histogram after removing initial outliers
ggplot(data=cars2,mapping=aes(lreg2$residuals)) +
geom_histogram(aes(y=..density..),binwidth = 5,color='black',fill='grey') +
geom_density(alpha=0.2, fill='lightpink') +
theme(panel.background = element_rect(fill='white'),
axis.line.x=element_line(),
axis.line.y=element_line())+ggtitle('Histogram for model Residuals')
Box Plot after removing initial outliers
ggplot(data=cars2,mapping=aes(lreg2$residuals)) +
geom_boxplot(color='black',fill='grey') +
theme(panel.background = element_rect(fill='white'),
axis.line.x=element_line(),
axis.line.y=element_line())+ggtitle('Box plot for model Residuals')
Assumptions check after removing initial outliers
plot(lreg2)
Residual Analysis after removing initial outliers:
The Residual vs Fitted value plot shows that the error maintains a constant variability. However, there are few data points at the far left of the distribution where fitted values are less than zero.
The removal of the initial outliers impacted the distribution of the residuals. The new Q-Q plot shows that the distribution of the residual is fairly normal. Though, better without the initial outliers. But there are still new outliers in the far end identifiable with index 22, 34 and 48 of the dataset