data(cars)
cars_df <- cars
head(cars_df, 10)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
str(cars_df)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
This data was gotten from measuring the speed and stopping distances of cars in 1920s. It contains only two variables (Speed and Stopping distance) and 50 observations. The numeric Stopping distance is measured in foot (ft)
ggplot(cars_df, aes(speed, dist)) +
geom_point(size = 2, alpha = .4) +
geom_smooth(method = "lm", se = FALSE, alpha = .2) +
labs(title = "Speed vs Stopping Distance",
x = "Speed (mph)",
y = "Stopping distance (ft)")
lm_cars <- lm(speed~dist, data = cars_df)
summary(lm_cars)
##
## Call:
## lm(formula = speed ~ dist, data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
From the model, the stopping distance can be expressed as:
\(distance = 8.28391 + 0.16557 * speed\)
This implies that:
ggplot(data = cars_df, aes(x=speed, y=lm_cars$residuals)) +
geom_point(size = 2, alpha = .3) +
geom_abline(intercept = 0, slope = 0, color = "blue") +
theme(panel.grid.major = element_line(color = "green")) +
labs(title = "Car speed vs Model Residuals",
x = "Car Speed (mph)",
y = "Model Residuals")
Using qqnorm to to check if the residuals are nearly normal (exhibit normal distribution).
qqnorm(lm_cars$residuals)
qqline(lm_cars$residuals)
Reviewing further using a histogram:
hist(lm_cars$residuals, main="Histogram of Linear model Residuals", xlab="Residuals")
There appears to be a modest normal distribution as depicted by the above histogram.
Testing further using inference.
Inference:
Rounding up: A look at the model using the summary() function again.
summary(lm_cars)
##
## Call:
## lm(formula = speed ~ dist, data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As already noted:
\(distance = 8.28391 + 0.16557 * speed\)