Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
# Summary statistics of the dataset
library(datasets)
summary(faithful)
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
# Identify a linear relationship between independent and the response variables
plot(faithful$waiting, faithful$eruptions, main = "Eruption Duration vs Waiting time", xlab = "Waiting", ylab = "Eruption Duration")
# Creating the linear regression model
faithful_lm <- lm(faithful$eruptions ~ faithful$waiting)
faithful_lm
##
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
##
## Coefficients:
## (Intercept) faithful$waiting
## -1.87402 0.07563
plot(faithful$waiting, faithful$eruptions, main = "Eruption Duration vs Waiting time (with Fitted Regression Line)", xlab = "Waiting", ylab = "Eruption Duration")
abline(faithful_lm, col="blue")
summary(faithful_lm)
##
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## faithful$waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
# This looks like a good model as the residuals mean is zero, min/max and 1Q/3Q are roughly the same magnitude.
# The Std Error of the waiting variable is 34 times smaller than the calculated coefficient, which is good (expectation is between 5 & 10x)
# R_squared is 0.8115 which describes the model's ability to explain over 81% of the data variability
plot(faithful_lm$fitted.values, faithful_lm$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="red")
qqnorm(faithful_lm$residuals)
qqline(faithful_lm$residuals)
# Residuals plot shows a relatively constant variability with no defined patterns
# Q-Q plot shows the residuals tightly following the theoretical straight line (except on the ends), which denotes a normal distribution