The given data set contains the duration of the eruption (in minutes) and the waiting time between eruptions (in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. The first column, eruptions, contains the duration of the eruption and the second column, waiting, contains the waiting time between eruptions.
We load the given data set, summarize the given data set and then make a labeled scatter plot of the given data set with waiting time between eruptions on the x axis and duration of the eruption on the y axis.
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
We observe that there appears to be a linear relationship between waiting time between eruptions and duration of the eruption.
We find the Pearson’s correlation coefficient between waiting time between eruptions and duration of the eruption.
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 34.089, df = 270, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8756964 0.9210652
## sample estimates:
## cor
## 0.9008112
We observe that the t-value is 34.089, the degrees of freedom is 270, the p-value is less than 2.2e-16, the 95% confidence interval is (0.8756964, 0.9210652) and the correlation coefficient is 0.9008112.
We accept the null hypothesis if the observed p-value of the test is more than the threshold of 0.05. We observe that the p-value of the test is less than 2.2e-16 which is lesser than the threshold of 0.05. Thus, with 95% confidence, we reject the null hypothesis that the true correlation is zero.
Positive Pearson’s correlation coefficient indicates positive correlation, that is, as waiting time between eruptions increases, duration of the eruption also increases. Correlation coefficient very close to 1 indicates a strong correlation. These are clearly visible in the scatter plot.
We fit a linear regression line for the given data set with waiting time between eruptions as the independent/ explanatory variable and duration of the eruption as the dependent/ explained variable, summarize the results of linear regression for the given data set and find the coefficients intercept and slope. We note that both the explanatory variable and the explained variable are quantitative variables.
##
## Call:
## lm(formula = y ~ x, data = old_faithful_geyser)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## x 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108
## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212 7.359171e-26
## x 0.07562795 0.002218541 34.08904 8.129959e-100
We observe that the estimated intercept is -1.874016, the estimated slope is 0.075628, the standard error of the intercept is 0.160143, the standard error of the slope is 0.002219, the t-value for the intercept is -11.70, the t-value for the slope is 34.09 and the p-values for both the intercept and the slope are less than 2e-16.
We find the confidence intervals for both the intercept and the slope.
## 2.5 % 97.5 %
## (Intercept) -2.18930436 -1.55872761
## x 0.07126011 0.07999579
We observe that the 95% confidence interval for the intercept is (-2.18930436, -1.55872761) with the estimated intercept contained in it and the 95% confidence interval for the slope is (0.07126011, 0.07999579) with the estimated slope contained contained in it.
We find the critical value from the t-test and find the t-values for both the intercept and the slope.
## [1] 1.968789
## [1] -11.70212
## [1] 34.08904
We test to either accept or reject the null hypothesis that the intercept is zero. We reject the null hypothesis if the absolute value of the observed t-value for the intercept is more than the threshold of critical value from the t-test. We know that the t-value for the intercept is -11.70. And 11.70 is much greater than the obtained critical value from the t-test, which is 1.96. Thus, with 95% confidence, we reject the null hypothesis that the intercept is zero.
We test to either accept or reject the null hypothesis that the slope is zero. We reject the null hypothesis if the absolute value of the observed t-value for the slope is more than the threshold of critical value from the t-test. We know that the t-value for the intercept is 34.09. And 34.09 is much greater than the obtained critical value from the t-test, which is 1.96. This also indicates significant relationship between waiting time between eruptions and duration of the eruption. Thus, with 95% confidence, we reject the null hypothesis that the slope is zero.
The assumptions on errors are that errors follow a normal distribution with zero mean and constant variance (homoscedasticity), and that errors are independent of each other. We want to verify whether the assumptions on errors hold for the given data set or not. We make residual plots (with lowess fits) to verify the zero mean, the constant variance (homoscedasticity) and the independence assumption, make a QQ plot and perform a Shapiro-Wilk’s test to verify the normal distribution assumption, make a scale-location plot (with a lowess fit) to verify the constant variance (homoscedasticity) assumption and make a leverages plot (with a lowess fit) to check for influential outliers. We note that the the scatter plot, the residual plots and the QQ plot are also indicators of influential outliers.
We calculate the residuals and standardized residuals for the given data set, find a summary of the residuals and standardized residuals for the given data set and plot the residuals and standardized residuals against waiting time between eruptions. We expect the results we obtain for standardized residuals to be consistent with the results we obtain for residuals.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.29917 -0.37689 0.03508 0.00000 0.34909 1.19329
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.6226674 -0.7626644 0.0709675 -0.0002581 0.7049573 2.4084037
We observe from the summary that the mean of the residuals is zero and the residuals are not too far from the x axis, indicating no significant departure from the zero mean assumption. We observe that the residuals are not randomly and symmetrically distributed about the x axis and form a strong pattern which can’t be ignored, indicating significant departure from the constant variance (homoscedasticity) and the independence assumption. We observe that there are no significant differences between the residual plot and the standardized residual plot other than the scaling on the y axis.
We make a QQ plot of sample quantiles versus theoretical quantiles.
We observe that the plot of sample quantiles versus theoretical quantiles lies on/ around the QQ line, with deviations only near the tails, that is, the sample quantiles and the theoretical quantiles match closely except near the tails. We observe that the QQ plot shows no signs of significant skewness or a heavy tail. Thus, there is no significant departure from the normal distribution assumption.
We perform the Shapiro-Wilk’s test on the residuals to either accept or reject the null hypothesis that the residuals are normal.
##
## Shapiro-Wilk normality test
##
## data: residual
## W = 0.99278, p-value = 0.2106
We accept the null hypothesis if the observed p-value of the Shapiro-Wilk’s test is more than the threshold of 0.05. We observe that the p-value of the Shapiro-Wilk’s test is 0.2106 which is greater than the threshold of 0.05. Thus, with 95% confidence, we accept the null hypothesis that the residuals are normal.
We calculate the square root of the absolute values of standardized residuals for the given data set and plot the square root of the absolute values of standardized residuals against waiting time between eruptions. We expect the residuals to be spread equally.
We observe that the residuals are not spread equally along the range of x axis and that the dashed line is clearly not horizontal. Thus, there is significant departure from the constant variance (homoscedasticity) assumption.
We calculate leverages for the given data set and plot the leverages against standardized residuals.
We observe that there are no influential outliers. We note that the scatter plot, the residual plots and the QQ plot also indicate the same.
Based on the residual analysis, we conclude that there are no significant departures from the zero mean and the normal distribution assumption of errors but there are significant departures from the constant variance (homoscedasticity) and the independence assumption of errors.
We make amends to accommodate the constant variance (homoscedasticity) and the independence assumption of errors along with the zero mean and the normal distribution assumption of errors by making a transformation on both sides. We use u and v instead of duration of the eruption and waiting time between eruptions respectively. We re-do the residual analysis and compare it with the previous residual analysis.
u <- sign(tan(x))* log(abs(tan(x))+1) # a function of x = duration of the eruption
v <- sign(tan(y))* log(abs(tan(y))+1) # a function of y = waiting time between eruptions
We make residual and standardized residual plots for the transformed data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.6462 -1.3006 0.3197 0.0000 1.1342 4.3678
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.490012 -0.813217 0.197524 0.000154 0.700751 2.698332
We observe from the summary that the mean of the residuals is still zero and the residuals are not too far from the x axis but this time we also observe that the residuals are randomly and symmetrically distributed about the x axis along with a nearly horizontal dashed line trying to overlap with the x axis. Thus, indicating no significant departures from the zero mean, the constant variance (homoscedasticity) and the independence assumption. We note that even when the constant variance (homoscedasticity) and the independence assumption of errors are true, there can be a little variance of the residuals and a little correlation between residuals as the residuals are only estimates of the errors and this might be the reason for not having an exactly horizontal dashed line. Again, we observe that there are no significant differences between the residual plot and the standardized residual plot except for the scaling on the y axis.
We make a QQ plot of sample quantiles versus theoretical quantiles.
Again, we observe that the plot of sample quantiles versus theoretical quantiles lies on/ around the QQ line but this time there are more visible deviations near tails as well as a few deviations in the center region. Although we observe no signs of significant skewness or a heavy tail, this plot has deteriorated a bit. Nonetheless, it is still safe to say that there is no significant departure from the normal distribution assumption.
We perform the Shapiro-Wilk’s test on the residuals to either accept or reject the null hypothesis that the residuals are normal.
##
## Shapiro-Wilk normality test
##
## data: residual_uv
## W = 0.97201, p-value = 3.611e-05
We observe that the p-value of the Shapiro-Wilk’s test is 3.611e-05 which is three orders lesser than the threshold of 0.05. Thus, this time we reject the null hypothesis that the residuals are normal. Since Shapiro-Wilk’s test is ideal only for small samples (less than a hundred) and by the fact that for larger samples (more than a hundred) the normality tests are overly conservative, it is very likely that the assumption of normality might be rejected too easily.
Thus, based on both the QQ plot and the Shapiro-Wilk’s test, we conclude that there is no significant departure from the normal distribution assumption.
We make the scale-location plot for the transformed data set.
We observe that this time the residuals spread equally along the range of x axis and that the dashed line is almost horizontal. Thus, indicating no significant departure from the constant variance (homoscedasticity) assumption. We note that even when the constant variance (homoscedasticity) assumption of errors is true, there can be some variance of the residuals as the residuals are only estimates of the errors and this might be the reason for not having an exactly horizontal dashed line.
We make the leverages plot for the transformed data set.
Again, we observe that there are no influential outliers.
Based on the residual analysis after the transformation, we conclude that there are no significant departures from the normal distribution with zero mean and constant variance (homoscedasticity), and the independence assumption of errors. Thus, it is safe to say that after the transformation the assumptions on errors are valid.