load data
data(cars)explore data
50 observations in total, and there is no missing values. From speed histogram, it looks normally distributed, but from distance histogram, the plot is right skewed. From the scatter plot, speed and distance have positive linear relationship.
# first few observations
head(cars)## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
# number of observations?
nrow(cars)## [1] 50
# any missing value?
sum(is.na(cars))## [1] 0
# distribution of independent variable
hist(cars$speed)hist(cars$dist)# relationship of speed and distance
plot(x = cars$dist, y = cars$speed, main = 'distance vs speed')use speed to predict distance
from the summary, the intercept is the point in y-axis, speed will be the slope of the straight line, and it is positive. So the linear equation will be
\[distance = -17.5791 + 3.9324 \times speed\] Meanwhile, p-values shows that speed plays the significant role in predicting distance. \(R^2\) indicate that 65% of data is explained by the model, which is pretty good.
# create linear model
cars.lm <- lm(cars$dist ~ cars$speed)
# summary of this model
summary(cars.lm)##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
is the model good?
we need to take a look at residual plot first. If residuals follow Gaussian distribution, we will say that the line is a pretty good fit. As we can see from residual plot, there are no special patterns, points look pretty random around 0. We could say that the residual is normally distributed now, but we can double check with QQ-plot.
# residual plot
plot(fitted(cars.lm), resid(cars.lm), main = 'residual plot')
abline(h = 0)From QQ plot, the majority points are on the straight line, only few exceptions on two ends. We can still conclude that the residual follows Gaussian distribution.
# qq-plot
qqnorm(resid(cars.lm))
qqline(resid(cars.lm))conclusion
This is a good model to predict the distance by using speed.