Dataset

For this exercise, I chose the faithful dataset which is available as an R dataset

First, let’s load the built-in cars dataset to a native R dataframe

df <- as.data.frame(faithful)

head(df)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

Model Creation

Let’s use the built-in lm function from R

# creating our model of eruption duration as a function of wait time
model <- lm(faithful$eruptions ~ faithful$waiting)

summary(model)
## 
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.874016   0.160143  -11.70   <2e-16 ***
## faithful$waiting  0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Let’s plot our data as well as our model line

plot(faithful$waiting, faithful$eruptions)
abline(model)

There appears to be a general linear relationship between these variables in this plot. We should continue to analyze our assumptions for regression.

Next, let’s plot the distribution of our residuals via a histogram

res <- resid(model) #alternatively, model$residuals
hist(res)

This distribution is slightly skewed, but looks to be relatively normal from an eye test

Next let’s plot our residuals as a function of our predicted values

plot(fitted(model), res)

There appears to be a slight pattern in these residuals, as there looks to be two clusters of data within our residuals plot

Lastly, we can create a Q-Q plot to evaluate the normality of our residuals.

# Create Q-Q plot of our residuals
qqnorm(res)
qqline(res)

While this is relatively linear as a Q-Q plot, the tails of the distribution wander away from our normalization line. Overall, a linear model fits the data and meets the assumptions of regression, but the clustering behavior displayed above leads me to believe a different model may be appropriate for this dataset.