Old Faithful

Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming. It has erupted every 44 to 125 minutes since 2000, spewing 3,700 to 8,400 US gallons of boiling water to a height of 106 to 185 feet. The average height of an eruption is 145 feet.

The R dataset “faithful” contains a list of 272 observations of geyser eruptions during October 1980. The duration of each eruption and the waiting time between eruptions are given, measured in minutes.

Duration of each eruption

Each eruption lasts from 1.6 and 5.1 minutes, with the mean eruption lasting 3.5 minutes.

Split the dataset based on length of eruption

However, the data is bimodal:

  • About one-third (98/272) of the eruptions last less than 3.2 minutes in duration.
  • The average duration of these eruptions is 2.05 minutes.
  • The other two-thirds (174/272) of the eruptions are longer than 3.2 minutes.
  • The average duration of these longer eruptions is 4.3 minutes.

Time between eruptions

Intervals between eruptions can range from 43 to 96 minutes, averaging 70.9 minutes in the dataset.

The time between eruptions has a bimodal distribution, with the mean interval being either 54.64 or 80.05 minutes, and is dependent on the length of the prior eruption. On average, Old Faithful will erupt either 54.64 minutes after an eruption lasting less than 3.2 minutes, or 80.05 minutes after an eruption lasting more than 3.2 minutes.

Evaluating the Quality of the Model

## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -12.079608  -4.483103   0.212248   3.924627  15.971858 
## 
## Coefficients:
##              Estimate Std. Error t value               Pr(>|t|)    
## (Intercept) 33.474397   1.154874 28.9853 < 0.000000000000000222 ***
## eruptions   10.729641   0.314753 34.0890 < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.91401 on 270 degrees of freedom
## Multiple R-squared:  0.811461,   Adjusted R-squared:  0.810762 
## F-statistic: 1162.06 on 1 and 270 DF,  p-value: < 0.000000000000000222

The regression indicates that the results are highly significant, with an \(R^2\) of 0.81 .

Residual Analysis

QQ plot

There are clearly a number of outliers at the negative end of the residuals, which suggests non-normality.

Shapiro-Wilks test

\(H_0\) : The residuals are normal

\(H_A\) : The residuals are not normal

## 
##  Shapiro-Wilk normality test
## 
## data:  Residual
## W = 0.9885117, p-value = 0.0294699

Because the p-value (.0295) is low, the null hypothesis is rejected at 95% confidence.

Let’s model the short and long groups separately

Regress the short duration group

## 
## Call:
## lm(formula = waiting ~ eruptions, data = shortfaith)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.15848  -4.61328   0.38580   4.21790  13.88948 
## 
## Coefficients:
##             Estimate Std. Error t value              Pr(>|t|)    
## (Intercept) 39.52377    4.15364 9.51544 0.0000000000000016353 ***
## eruptions    7.38009    2.00836 3.67468            0.00039213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.63939 on 96 degrees of freedom
## Multiple R-squared:  0.123314,   Adjusted R-squared:  0.114182 
## F-statistic: 13.5033 on 1 and 96 DF,  p-value: 0.000392134

Regress the long duration group

## 
## Call:
## lm(formula = waiting ~ eruptions, data = longfaith)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -13.609870  -3.973889  -0.114825   3.359148  14.078262 
## 
## Coefficients:
##             Estimate Std. Error t value               Pr(>|t|)    
## (Intercept) 57.49631    4.56676 12.5902 < 0.000000000000000222 ***
## eruptions    5.24747    1.05787  4.9604            0.000001679 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.58408 on 172 degrees of freedom
## Multiple R-squared:  0.125152,   Adjusted R-squared:  0.120066 
## F-statistic: 24.6056 on 1 and 172 DF,  p-value: 0.00000167899

long residuals

Shapiro-Wilks test - long

## 
##  Shapiro-Wilk normality test
## 
## data:  longResidual
## W = 0.9899504, p-value = 0.258103

Interestingly, the split models each give a much lower \(R^2\), closer to 0.12 while the combined model had a much higher $R^2 .

However, the resiiduals on the separate models fare much better under the Shapiro-Wilks Normality test.

Conclusion

Estimating the bimodal distributions using a single linear model reuslts in a high \(R^2\) but the residuals fail the normality test.

Partitioning the data into “short” vs. “long” clusters results in models which each have much lower \(R^2\) but the respective residuals do pass the normality test.

The explanation is that modeling all the data together does provide a strong relationship because of substantial between-group variance; thus, the high \(R^2 = 0.81\) .

Splitting the data into two groups and modeling each with a separate linear model explains much less of the within-group variance.

Thus, it is difficult to model data of this type with a linear model.