Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming. It has erupted every 44 to 125 minutes since 2000, spewing 3,700 to 8,400 US gallons of boiling water to a height of 106 to 185 feet. The average height of an eruption is 145 feet.
The R dataset “faithful” contains a list of 272 observations of geyser eruptions during October 1980. The duration of each eruption and the waiting time between eruptions are given, measured in minutes.
plot(waiting~eruptions,
xlab="length of eruption, in minutes",
ylab="Waiting time between eruptions, in minutes",
main="Old Faithful Geyser: Eruption length vs. waiting time")Each eruption lasts from 1.6 and 5.1 minutes, with the mean eruption lasting 3.5 minutes.
hist(faithful$eruptions,
freq=T,
xlab = "Eruption Duration, in minutes",
main="Histogram of Length of eruptions, in minutes",
col="lightgreen",
breaks=30)# Short duration eruptions - less than 3.2 minutes
shortfaith = faithful[faithful$eruptions<3.2,]
numshort = dim(shortfaith)[1]
shortduration = mean(shortfaith$eruptions)
shortwait = mean(shortfaith$waiting)
# Long duration eruptions - more than 3.2 minutes
longfaith = faithful[faithful$eruptions>=3.2,]
numlong = dim(longfaith)[1]
longduration = mean(longfaith$eruptions)
longwait = mean(longfaith$waiting)However, the data is bimodal:
Intervals between eruptions can range from 43 to 96 minutes, averaging 70.9 minutes in the dataset.
hist(faithful$waiting,
freq=FALSE,
xlab = "Waiting time, in minutes",
main="Histogram of Waiting Time between eruptions, in minutes",
col="lightblue",
breaks=20)The time between eruptions has a bimodal distribution, with the mean interval being either 54.64 or 80.05 minutes, and is dependent on the length of the prior eruption. On average, Old Faithful will erupt either 54.64 minutes after an eruption lasting less than 3.2 minutes, or 80.05 minutes after an eruption lasting more than 3.2 minutes.
LinearModel <- lm(waiting ~ eruptions, data = faithful)
intercept = round(LinearModel$coefficients[1], 3)
slope = round(LinearModel$coefficients[2], 3)
formula = paste ("waiting = ", intercept, " + ", slope, "*", "eruptions + error")
plot(x = eruptions, y = waiting, main=paste("Old Faithful dataset: ", formula))
abline(reg = LinearModel,col="blue")##
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.079608 -4.483103 0.212248 3.924627 15.971858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.474397 1.154874 28.9853 < 0.000000000000000222 ***
## eruptions 10.729641 0.314753 34.0890 < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.91401 on 270 degrees of freedom
## Multiple R-squared: 0.811461, Adjusted R-squared: 0.810762
## F-statistic: 1162.06 on 1 and 270 DF, p-value: < 0.000000000000000222
The regression indicates that the results are highly significant, with an \(R^2\) of 0.81 .
Residual = resid(LinearModel)
Fitted = fitted(LinearModel)
plot(Fitted,Residual,
main="Old Faithful geyser dataset: Fitted vs. Residuals",
xlab="Fitted Waiting Time (minutes)")
abline(h=0, col="blue")There are clearly a number of outliers at the negative end of the residuals, which suggests non-normality.
Residual = resid(LinearModel)
hist(Residual, main = "Histogram of Residuals - 7 breaks", ylab = "Density",
ylim = c(0, 0.10),
prob = TRUE,breaks=7, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)hist(Residual, main = "Histogram of Residuals - 30 breaks", ylab = "Density",
ylim = c(0, 0.11),prob = TRUE,breaks=30, col="lightblue")
curve(dnorm(x, mean = mean(Residual), sd = sd(Residual)), col="red", add=TRUE)The residuals somewhat resemble a normal curve, but there are some regions where the distribution is rather different.
\(H_0\) : The residuals are normal
\(H_A\) : The residuals are not normal
##
## Shapiro-Wilk normality test
##
## data: Residual
## W = 0.9885117, p-value = 0.0294699
Because the p-value (.0295) is low, the null hypothesis is rejected at 95% confidence.
##
## Call:
## lm(formula = waiting ~ eruptions, data = shortfaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.15848 -4.61328 0.38580 4.21790 13.88948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.52377 4.15364 9.51544 0.0000000000000016353 ***
## eruptions 7.38009 2.00836 3.67468 0.00039213 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.63939 on 96 degrees of freedom
## Multiple R-squared: 0.123314, Adjusted R-squared: 0.114182
## F-statistic: 13.5033 on 1 and 96 DF, p-value: 0.000392134
##
## Call:
## lm(formula = waiting ~ eruptions, data = longfaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.609870 -3.973889 -0.114825 3.359148 14.078262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.49631 4.56676 12.5902 < 0.000000000000000222 ***
## eruptions 5.24747 1.05787 4.9604 0.000001679 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.58408 on 172 degrees of freedom
## Multiple R-squared: 0.125152, Adjusted R-squared: 0.120066
## F-statistic: 24.6056 on 1 and 172 DF, p-value: 0.00000167899
plot(waiting~eruptions,
xlab="length of eruption, in minutes",
ylab="Waiting time between eruptions, in minutes",
main="Old Faithful Geyser: Eruption length vs. waiting time",
type = "n"
)
points(shortfaith$eruptions,shortfaith$waiting,col="blue")
abline(ShortModel,col="blue")
points(longfaith$eruptions,longfaith$waiting,col="red")
abline(LongModel,col="red")shortResidual = resid(ShortModel)
shortFitted = fitted(ShortModel)
longResidual = resid(LongModel)
longFitted = fitted(LongModel)
plot(Fitted,Residual,type="n",
main="Old Faithful geyser dataset: Fitted vs. Residuals",
xlab="Fitted Waiting Time (minutes)")
points(shortFitted,shortResidual,col="blue")
points(longFitted,longResidual,col="red")
abline(h=0, col="black")hist(shortResidual, main = "Histogram of short Residuals - 15 breaks", ylab = "Density",
ylim = c(0, 0.11),prob = TRUE,breaks=15, col="lightblue")
curve(dnorm(x, mean = mean(shortResidual), sd = sd(shortResidual)), col="red", add=TRUE)##
## Shapiro-Wilk normality test
##
## data: shortResidual
## W = 0.9842669, p-value = 0.2938
hist(longResidual, main = "Histogram of long Residuals - 15 breaks", ylab = "Density",
ylim = c(0, 0.11),prob = TRUE,breaks=15, col="red")
curve(dnorm(x, mean = mean(longResidual), sd = sd(longResidual)), col="black", add=TRUE)##
## Shapiro-Wilk normality test
##
## data: longResidual
## W = 0.9899504, p-value = 0.258103
Interestingly, the split models each give a much lower \(R^2\), closer to 0.12 while the combined model had a much higher $R^2 .
However, the resiiduals on the separate models fare much better under the Shapiro-Wilks Normality test.
Estimating the bimodal distributions using a single linear model reuslts in a high \(R^2\) but the residuals fail the normality test.
Partitioning the data into “short” vs. “long” clusters results in models which each have much lower \(R^2\) but the respective residuals do pass the normality test.
The explanation is that modeling all the data together does provide a strong relationship because of substantial between-group variance; thus, the high \(R^2 = 0.81\) .
Splitting the data into two groups and modeling each with a separate linear model explains much less of the within-group variance.
Thus, it is difficult to model data of this type with a linear model.