In this exercise we will fit a simple linear regression model which will fit car stopping distance as a function of speed. We will use built in car dataset in R for this exercise
Let’s preview the car data set
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
As we can see there are two columns in this data set. speed indicates a car speed and distance which indicates a car stopping distance
First find out the data distribution of speed and distance variables
hist(cars$speed)
hist(cars$dist)
From the above histograms we can see that both speed and distance variables are normally distributed
Find out the outliers in speed and distance variables
boxplot(cars$speed)
boxplot(cars$dist)
From the above box plots we can see that speed is uniformly distributed without much skew and there are no outliers detected in speed variable.
Distance variable indicates right skewed data with one potential outlier
Outlier is a point with stopping distance of 120 for speed 24. Let’s replace this outlier with avg stopping distance for speed 24
cars[cars$dist == max(cars$dist), 'dist'] = mean(cars[cars$speed == 24 & cars$dist != max(cars$dist),'dist'])
Let’s see the corrlation between Speed and Distance. Let’s plot a scatter plot
plot(cars$dist ~ cars$speed)
abline(lm(cars$dist ~ cars$speed))
From the above scatter plot we can see that Distance and Speed are positively correlated. The relationship appears to be somewhat linear even though not perfectly linear
Let’s fit the linear model between speed and distance
lm.cars.fit = lm(dist~speed, data=cars)
Let’s analize model summary
summary(lm.cars.fit)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.358 -9.470 -1.370 9.303 42.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.8956 6.1704 -2.414 0.0196 *
## speed 3.7127 0.3794 9.787 5.11e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.04 on 48 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6592
## F-statistic: 95.78 on 1 and 48 DF, p-value: 5.105e-13
Let’s analyze each component one by one
1] Residuals - For a Linear regression model with good fit, we expect Residuals to be normally distributed with mean around 0. From the residuls of this model, we can see that Median is close to 0 which is good. First and Third qunatiles of residuals are in the same range which indicates residuals are evenly spread. The Min and Max range of the residuals is also evenly spread.
2] Coefficients - We can see that above model has Intercept as -17.57 and slope as 3.93. With this we can write the linear function between speed and distance as distance = -17.57 + 3.93 X speed. We can conclude that for each one unit increase in speed, car stopping distance is increased by 3.93 units
P Values of Intercept and Speed variables are statestically significant. This indicates that both are useful while fitting the model.
The Standard error for Intercept and speed variable is small. This indicates that there is less variability in the coefficient estimation of Intercept and speed variable
3] Residual Standard Error - This indicates measure of total variation in the residual values. Above model has residual standard error of 15.38
4] Multiple R Squared - Multiple R Squared indicates model fit using all the variables. Above model have multiple R squared as 0.65
5] Adjusted R Suqare - Adjusted R Square indicates model fit without using any variables. Above model has adjusted R square of 0.64. Multiple R square is higher that adjusted R suqare. This indicates Speed variable is useful on overall model fit
6] F Statistics - F Statistics indicates importance of the feature used in the model cumulatively. Since in our case we have just one feature (speed), we can ignore this statistics.
plot(lm.cars.fit$residuals)
qqnorm(resid(lm.cars.fit))
qqline(resid(lm.cars.fit))
** Let’s focus more on Residual Analysis**
1] Homoscadasticity - From the residual plot above, we can see that residuals are increasing as speed increases. We expect residuls without any pattern as a shapeless cloud. Here we can see that residuals are not homoscadastic
2] Y axis Imbalance - Residulas should be balance on Y axis equally centered around 0. From the above plot we can see that residuals are Y axis imbalance
3] Nonlinear Pattern - We can see that residuals show little non linear pattern. This indicates that there is not strict linear relationship between speed and distance
4] Normal Distribution - From the quantile density plot above we can see that residuals are not normally distributed
All above factors indicates speed alone is not a very good predictor of stopping distance. We will have to add more features to increase goodness of fit for our model