### load some libraries
library(datasets)
data("faithful")
attach(faithful)

Old Faithful

Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming. It has erupted every 44 to 125 minutes since 2000, spewing 3,700 to 8,400 US gallons of boiling water to a height of 106 to 185 feet. The average height of an eruption is 145 feet.

The R dataset “faithful” contains a list of 272 observations of geyser eruptions during October 1980. The duration of each eruption and the waiting time between eruptions are given, measured in minutes.

Visualize the Data

plot(waiting~eruptions,
     xlab="length of eruption, in minutes", 
     ylab="Waiting time between eruptions, in minutes",
     main="Old Faithful Geyser: Eruption length vs. waiting time")

Duration of each eruption

Each eruption lasts from 1.6 and 5.1 minutes, with the mean eruption lasting 3.5 minutes.

hist(faithful$eruptions,
     freq=T, 
     xlab = "Eruption Duration, in minutes", 
     main="Histogram of Length of eruptions, in minutes",
     col="lightgreen",
     breaks=30)

Split the dataset based on length of eruption

# Short duration eruptions - less than 3.2 minutes
shortfaith = faithful[faithful$eruptions<3.2,]
numshort = dim(shortfaith)[1]
shortduration = mean(shortfaith$eruptions)
shortwait = mean(shortfaith$waiting)

# Long duration eruptions - more than 3.2 minutes
longfaith = faithful[faithful$eruptions>=3.2,]
numlong = dim(longfaith)[1]
longduration = mean(longfaith$eruptions)
longwait = mean(longfaith$waiting)

However, the data is bimodal:

About one-third (98/272) of the eruptions last less than 3.2 minutes in duration.
The average duration of these eruptions is 2.05 minutes.
The other two-thirds (174/272) of the eruptions are longer than 3.2 minutes.
The average duration of these longer eruptions is 4.3 minutes.

Time between eruptions

Intervals between eruptions can range from 43 to 96 minutes, averaging 70.9 minutes in the dataset.

hist(faithful$waiting,
     freq=FALSE, 
     xlab = "Waiting time, in minutes", 
     main="Histogram of Waiting Time between eruptions, in minutes",
     col="lightblue",
     breaks=20)

The time between eruptions has a bimodal distribution, with the mean interval being either 54.64 or 80.05 minutes, and is dependent on the length of the prior eruption. On average, Old Faithful will erupt either 54.64 minutes after an eruption lasting less than 3.2 minutes, or 80.05 minutes after an eruption lasting more than 3.2 minutes.

The Linear Model Function

Linear model: waiting ~ eruptions

LinearModel <- lm(waiting ~ eruptions, data = faithful)
intercept = round(LinearModel$coefficients[1], 3)
slope =     round(LinearModel$coefficients[2], 3)
formula = paste ("waiting = ", intercept, " + ", slope, "*", "eruptions + error")
plot(x = eruptions, y = waiting, main=paste("Old Faithful dataset: ", formula))
abline(reg = LinearModel,col="blue")

Evaluating the Quality of the Model

summary(LinearModel)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -12.079608  -4.483103   0.212248   3.924627  15.971858 
## 
## Coefficients:
##              Estimate Std. Error t value               Pr(>|t|)    
## (Intercept) 33.474397   1.154874 28.9853 < 0.000000000000000222 ***
## eruptions   10.729641   0.314753 34.0890 < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.91401 on 270 degrees of freedom
## Multiple R-squared:  0.811461,   Adjusted R-squared:  0.810762 
## F-statistic: 1162.06 on 1 and 270 DF,  p-value: < 0.000000000000000222

The regression indicates that the results are highly significant, with an $R^2$ of 0.81 .

Residual Analysis

Plot residuals

Residual = resid(LinearModel) 
Fitted = fitted(LinearModel)
plot(Fitted,Residual, 
     main="Old Faithful geyser dataset: Fitted vs. Residuals", 
     xlab="Fitted Waiting Time (minutes)")
abline(h=0, col="blue")

QQ plot

qqnorm(Residual)
qqline(Residual, col="red")

There are clearly a number of outliers at the negative end of the residuals, which suggests non-normality.

Plot histograms of Residuals

Residual = resid(LinearModel) 
hist(Residual, main = "Histogram of Residuals - 7 breaks", ylab = "Density", 
     ylim = c(0, 0.10),
     prob = TRUE,breaks=7, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)

hist(Residual, main = "Histogram of Residuals - 30 breaks", ylab = "Density", 
     ylim = c(0, 0.11),prob = TRUE,breaks=30, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)

The residuals somewhat resemble a normal curve, but there are some regions where the distribution is rather different.

Shapiro-Wilks test

$H_0$ : The residuals are normal

$H_A$ : The residuals are not normal

shapiro.test(Residual)

## 
##  Shapiro-Wilk normality test
## 
## data:  Residual
## W = 0.9885117, p-value = 0.0294699

Because the p-value (.0295) is low, the null hypothesis is rejected at 95% confidence.

Let’s model the short and long groups separately

Regress the short duration group

ShortModel <- lm(waiting ~ eruptions, data = shortfaith)
summary(ShortModel)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = shortfaith)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.15848  -4.61328   0.38580   4.21790  13.88948 
## 
## Coefficients:
##             Estimate Std. Error t value              Pr(>|t|)    
## (Intercept) 39.52377    4.15364 9.51544 0.0000000000000016353 ***
## eruptions    7.38009    2.00836 3.67468            0.00039213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.63939 on 96 degrees of freedom
## Multiple R-squared:  0.123314,   Adjusted R-squared:  0.114182 
## F-statistic: 13.5033 on 1 and 96 DF,  p-value: 0.000392134

Regress the long duration group

LongModel <- lm(waiting ~ eruptions, data = longfaith)
summary(LongModel)

## 
## Call:
## lm(formula = waiting ~ eruptions, data = longfaith)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -13.609870  -3.973889  -0.114825   3.359148  14.078262 
## 
## Coefficients:
##             Estimate Std. Error t value               Pr(>|t|)    
## (Intercept) 57.49631    4.56676 12.5902 < 0.000000000000000222 ***
## eruptions    5.24747    1.05787  4.9604            0.000001679 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.58408 on 172 degrees of freedom
## Multiple R-squared:  0.125152,   Adjusted R-squared:  0.120066 
## F-statistic: 24.6056 on 1 and 172 DF,  p-value: 0.00000167899

Plot both short and long

plot(waiting~eruptions,
     xlab="length of eruption, in minutes", 
     ylab="Waiting time between eruptions, in minutes",
     main="Old Faithful Geyser: Eruption length vs. waiting time",
     type = "n"
     )
points(shortfaith$eruptions,shortfaith$waiting,col="blue")
abline(ShortModel,col="blue")
points(longfaith$eruptions,longfaith$waiting,col="red")
abline(LongModel,col="red")

Analyze Residuals

shortResidual = resid(ShortModel) 
shortFitted = fitted(ShortModel)
longResidual = resid(LongModel) 
longFitted = fitted(LongModel)
plot(Fitted,Residual,type="n", 
     main="Old Faithful geyser dataset: Fitted vs. Residuals", 
     xlab="Fitted Waiting Time (minutes)")
points(shortFitted,shortResidual,col="blue")
points(longFitted,longResidual,col="red")
abline(h=0, col="black")

short residuals

QQ-plots - short

qqnorm(shortResidual, col="blue")
qqline(shortResidual, col="blue")

histogram of short residuals

hist(shortResidual, main = "Histogram of short Residuals - 15 breaks", ylab = "Density", 
     ylim = c(0, 0.11),prob = TRUE,breaks=15, col="lightblue")
curve(dnorm(x, mean = mean(shortResidual), sd = sd(shortResidual)), col="red", add=TRUE)

Shapiro-Wilks test - short

shapiro.test(shortResidual)

## 
##  Shapiro-Wilk normality test
## 
## data:  shortResidual
## W = 0.9842669, p-value = 0.2938

long residuals

QQ-plots - long

qqnorm(longResidual, col="red")
qqline(longResidual, col="red")

histogram of long residuals

hist(longResidual, main = "Histogram of long Residuals - 15 breaks", ylab = "Density", 
     ylim = c(0, 0.11),prob = TRUE,breaks=15, col="red")
curve(dnorm(x, mean = mean(longResidual), sd = sd(longResidual)), col="black", add=TRUE)

Shapiro-Wilks test - long

shapiro.test(longResidual)

## 
##  Shapiro-Wilk normality test
## 
## data:  longResidual
## W = 0.9899504, p-value = 0.258103

Interestingly, the split models each give a much lower $R^2$, closer to 0.12 while the combined model had a much higher $R^2 .

However, the resiiduals on the separate models fare much better under the Shapiro-Wilks Normality test.

Conclusion

Estimating the bimodal distributions using a single linear model reuslts in a high $R^2$ but the residuals fail the normality test.

Partitioning the data into “short” vs. “long” clusters results in models which each have much lower $R^2$ but the respective residuals do pass the normality test.

The explanation is that modeling all the data together does provide a strong relationship because of substantial between-group variance; thus, the high $R^2 = 0.81$ .

Splitting the data into two groups and modeling each with a separate linear model explains much less of the within-group variance.

Thus, it is difficult to model data of this type with a linear model.

Old Faithful Regression

Michael Y.

November 6, 2019