library(datasets)
dataset = cars
head(dataset)
summary(dataset)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
str(dataset)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
library(ggplot2)
ggplot(dataset, aes(x = speed, y = dist)) +
geom_point() +
labs(title = "Stopping distance vs. speed")
There is a linear relationship between distance and speed, so we can perform simple linear regression
library(caTools)
set.seed(123)
split = sample.split(dataset$dist, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
regressor = lm(formula = dist ~ speed,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = dist ~ speed, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.160 -9.160 -3.595 8.321 44.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.1912 9.4393 -2.457 0.0198 *
## speed 4.2176 0.5586 7.551 1.64e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.87 on 31 degrees of freedom
## Multiple R-squared: 0.6478, Adjusted R-squared: 0.6364
## F-statistic: 57.01 on 1 and 31 DF, p-value: 1.644e-08
The analysis shows that to increase the stopping distance by one unit, the speed should be increased by 4.2 units. The Adjusted R-squred 0.636 means that the data can be used to predict 63.6% of points correctly, which isnot bad. The p-value is less than 0.05 significance level.
library(ggplot2)
ggplot(dataset, aes(x = speed, y = dist)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Distance vs. speed")
The blue line represents the points that can be predicted from this variable.
par(mfrow=c(2,2)) # Split the plotting panel into a 2 x 2 grid
plot(regressor) # Plot the model information
The first plot (residuals vs. fitted values) is a simple scatterplot between residuals and predicted values. The residuals are more condense below the the fitted line.
The second plot (normal Q-Q) is a normal probability plot. It gives so far a straight line which means that the errors are distributed normally. However, there are some errors that afftect the model 35 and 42
The third plot (Scale-Location), like the the first, looks random. No patterns.
The last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 23, 49 and 39 have great influence on the model.