Found a neat dataset related to duration of eruptions and waiting time between for Old Faithful (see here).
library(datasets)
library(knitr)
data(faithful)
d <- data.frame(faithful)
#examine the data
kable(head(d,5))| eruptions | waiting |
|---|---|
| 3.600 | 79 |
| 1.800 | 54 |
| 3.333 | 74 |
| 2.283 | 62 |
| 4.533 | 85 |
##
## Call:
## lm(formula = eruptions ~ waiting, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
Based on the plot, it looks like we have a strong linear relationship here! The summary table confirms this.
##
## Shapiro-Wilk normality test
##
## data: m$residuals
## W = 0.99278, p-value = 0.2106
Here we see that the residuals are reasonably normally distributed and based on the Shapiro-Wilk test we cannot reject the null hypothesis that the sample comes from a normally distributed population (… or process in this case, I suppose)
Just for visual comparison, I took a look at the residuals of truly random data using qqnorm, and it’s hard to distinguish. I’d say that the assumptions for regression are likely valid here, but a larger sample size would be ideal.