For this exercise, I chose the faithful dataset which is
available as an R dataset
First, let’s load the built-in cars dataset to a native
R dataframe
df <- as.data.frame(faithful)
head(df)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
Let’s use the built-in lm function from R
# creating our model of eruption duration as a function of wait time
model <- lm(faithful$eruptions ~ faithful$waiting)
summary(model)
##
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## faithful$waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
Let’s plot our data as well as our model line
plot(faithful$waiting, faithful$eruptions)
abline(model)
There appears to be a general linear relationship between these
variables in this plot. We should continue to analyze our assumptions
for regression.
Next, let’s plot the distribution of our residuals via a histogram
res <- resid(model) #alternatively, model$residuals
hist(res)
This distribution is slightly skewed, but looks to be relatively normal
from an eye test
Next let’s plot our residuals as a function of our predicted values
plot(fitted(model), res)
There appears to be a slight pattern in these residuals, as there looks
to be two clusters of data within our residuals plot
Lastly, we can create a Q-Q plot to evaluate the normality of our residuals.
# Create Q-Q plot of our residuals
qqnorm(res)
qqline(res)
While this is relatively linear as a Q-Q plot, the tails of the distribution wander away from our normalization line. Overall, a linear model fits the data and meets the assumptions of regression, but the clustering behavior displayed above leads me to believe a different model may be appropriate for this dataset.