Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Old Faithful Geyser Data - Faithful Dataset
Data Visualization
# Summary statistics of the dataset
library(datasets)
summary(faithful)
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0
# Identify a linear relationship between independent and the response variables
plot(faithful$waiting, faithful$eruptions, main = "Eruption Duration vs Waiting time", xlab = "Waiting", ylab = "Eruption Duration")

Modeling
# Creating the linear regression model
faithful_lm <- lm(faithful$eruptions ~ faithful$waiting)

faithful_lm
## 
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
## 
## Coefficients:
##      (Intercept)  faithful$waiting  
##         -1.87402           0.07563
plot(faithful$waiting, faithful$eruptions, main = "Eruption Duration vs Waiting time (with Fitted Regression Line)", xlab = "Waiting", ylab = "Eruption Duration")
abline(faithful_lm, col="blue")

Model Evaluation
summary(faithful_lm)
## 
## Call:
## lm(formula = faithful$eruptions ~ faithful$waiting)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.874016   0.160143  -11.70   <2e-16 ***
## faithful$waiting  0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
# This looks like a good model as the residuals mean is zero, min/max and 1Q/3Q are roughly the same magnitude.
# The Std Error of the waiting variable is 34 times smaller than the calculated coefficient, which is good (expectation is between 5 & 10x)
# R_squared is 0.8115 which describes the model's ability to explain over 81% of the data variability
Residuals Analysis
plot(faithful_lm$fitted.values, faithful_lm$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="red")

qqnorm(faithful_lm$residuals)
qqline(faithful_lm$residuals)

# Residuals plot shows a relatively constant variability with no defined patterns
# Q-Q plot shows the residuals tightly following the theoretical straight line (except on the ends), which denotes a normal distribution
Conclusion
The linear regression model looks relatively good and appears to describe about 81% of the data variability. There is a general linear (positive) correlation between eruption duration and waiting time (starting at min 70). Prior to minute 70, the correlation is not clear as the eruption times remain steady