Read the dataset

library(datasets)
dataset = cars
head(dataset)

Visualize the data

summary(dataset)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
str(dataset)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
library(ggplot2)
ggplot(dataset, aes(x = speed, y = dist)) +
  geom_point() + 
  labs(title = "Stopping distance vs. speed")

There is a linear relationship between distance and speed, so we can perform simple linear regression

library(caTools)
set.seed(123)
split = sample.split(dataset$dist, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Fitting Simple Linear Regression to the Training set

regressor = lm(formula = dist ~ speed,
               data = training_set)
summary(regressor)
## 
## Call:
## lm(formula = dist ~ speed, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.160  -9.160  -3.595   8.321  44.145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.1912     9.4393  -2.457   0.0198 *  
## speed         4.2176     0.5586   7.551 1.64e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.87 on 31 degrees of freedom
## Multiple R-squared:  0.6478, Adjusted R-squared:  0.6364 
## F-statistic: 57.01 on 1 and 31 DF,  p-value: 1.644e-08

The analysis shows that to increase the stopping distance by one unit, the speed should be increased by 4.2 units. The Adjusted R-squred 0.636 means that the data can be used to predict 63.6% of points correctly, which isnot bad. The p-value is less than 0.05 significance level.

library(ggplot2)
ggplot(dataset, aes(x = speed, y = dist)) +
  geom_point() + 
  geom_smooth(method = lm) +
  labs(title = "Distance vs. speed")

The blue line represents the points that can be predicted from this variable.

Residual dignosis

par(mfrow=c(2,2)) # Split the plotting panel into a 2 x 2 grid
plot(regressor) # Plot the model information

The first plot (residuals vs. fitted values) is a simple scatterplot between residuals and predicted values. The residuals are more condense below the the fitted line.

The second plot (normal Q-Q) is a normal probability plot. It gives so far a straight line which means that the errors are distributed normally. However, there are some errors that afftect the model 35 and 42

The third plot (Scale-Location), like the the first, looks random. No patterns.

The last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 23, 49 and 39 have great influence on the model.