The given data set contains the duration of the eruption (in minutes) and the waiting time between eruptions (in minutes) for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. The first column, eruptions, contains the duration of the eruption and the second column, waiting, contains the waiting time between eruptions.

Scatter Plot

We load the given data set, summarize the given data set and then make a labeled scatter plot of the given data set with waiting time between eruptions on the x axis and duration of the eruption on the y axis.

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

We observe that there appears to be a linear relationship between waiting time between eruptions and duration of the eruption.

Correlation

We find the Pearson’s correlation coefficient between waiting time between eruptions and duration of the eruption.

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 34.089, df = 270, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8756964 0.9210652
## sample estimates:
##       cor 
## 0.9008112

We observe that the t-value is 34.089, the degrees of freedom is 270, the p-value is less than 2.2e-16, the 95% confidence interval is (0.8756964, 0.9210652) and the correlation coefficient is 0.9008112.

We accept the null hypothesis if the observed p-value of the test is more than the threshold of 0.05. We observe that the p-value of the test is less than 2.2e-16 which is lesser than the threshold of 0.05. Thus, with 95% confidence, we reject the null hypothesis that the true correlation is zero.

Positive Pearson’s correlation coefficient indicates positive correlation, that is, as waiting time between eruptions increases, duration of the eruption also increases. Correlation coefficient very close to 1 indicates a strong correlation. These are clearly visible in the scatter plot.

Linear Regression

We fit a linear regression line for the given data set with waiting time between eruptions as the independent/ explanatory variable and duration of the eruption as the dependent/ explained variable, summarize the results of linear regression for the given data set and find the coefficients intercept and slope. We note that both the explanatory variable and the explained variable are quantitative variables.

## 
## Call:
## lm(formula = y ~ x, data = old_faithful_geyser)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## x            0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
##                Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -1.87401599 0.160143302 -11.70212  7.359171e-26
## x            0.07562795 0.002218541  34.08904 8.129959e-100

We observe that the estimated intercept is -1.874016, the estimated slope is 0.075628, the standard error of the intercept is 0.160143, the standard error of the slope is 0.002219, the t-value for the intercept is -11.70, the t-value for the slope is 34.09 and the p-values for both the intercept and the slope are less than 2e-16.

Confidence Intervals

We find the confidence intervals for both the intercept and the slope.

##                   2.5 %      97.5 %
## (Intercept) -2.18930436 -1.55872761
## x            0.07126011  0.07999579

We observe that the 95% confidence interval for the intercept is (-2.18930436, -1.55872761) with the estimated intercept contained in it and the 95% confidence interval for the slope is (0.07126011, 0.07999579) with the estimated slope contained contained in it.

Significance Tests

We find the critical value from the t-test and find the t-values for both the intercept and the slope.

## [1] 1.968789
## [1] -11.70212
## [1] 34.08904

We test to either accept or reject the null hypothesis that the intercept is zero. We reject the null hypothesis if the absolute value of the observed t-value for the intercept is more than the threshold of critical value from the t-test. We know that the t-value for the intercept is -11.70. And 11.70 is much greater than the obtained critical value from the t-test, which is 1.96. Thus, with 95% confidence, we reject the null hypothesis that the intercept is zero.

We test to either accept or reject the null hypothesis that the slope is zero. We reject the null hypothesis if the absolute value of the observed t-value for the slope is more than the threshold of critical value from the t-test. We know that the t-value for the intercept is 34.09. And 34.09 is much greater than the obtained critical value from the t-test, which is 1.96. This also indicates significant relationship between waiting time between eruptions and duration of the eruption. Thus, with 95% confidence, we reject the null hypothesis that the slope is zero.

Residual Analysis

The assumptions on errors are that errors follow a normal distribution with zero mean and constant variance (homoscedasticity), and that errors are independent of each other. We want to verify whether the assumptions on errors hold for the given data set or not. We make residual plots (with lowess fits) to verify the zero mean, the constant variance (homoscedasticity) and the independence assumption, make a QQ plot and perform a Shapiro-Wilk’s test to verify the normal distribution assumption, make a scale-location plot (with a lowess fit) to verify the constant variance (homoscedasticity) assumption and make a leverages plot (with a lowess fit) to check for influential outliers. We note that the the scatter plot, the residual plots and the QQ plot are also indicators of influential outliers.

Residual Plots

We calculate the residuals and standardized residuals for the given data set, find a summary of the residuals and standardized residuals for the given data set and plot the residuals and standardized residuals against waiting time between eruptions. We expect the results we obtain for standardized residuals to be consistent with the results we obtain for residuals.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.29917 -0.37689  0.03508  0.00000  0.34909  1.19329
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.6226674 -0.7626644  0.0709675 -0.0002581  0.7049573  2.4084037

We observe from the summary that the mean of the residuals is zero and the residuals are not too far from the x axis, indicating no significant departure from the zero mean assumption. We observe that the residuals are not randomly and symmetrically distributed about the x axis and form a strong pattern which can’t be ignored, indicating significant departure from the constant variance (homoscedasticity) and the independence assumption. We observe that there are no significant differences between the residual plot and the standardized residual plot other than the scaling on the y axis.

Quantile-Quantile Plot

We make a QQ plot of sample quantiles versus theoretical quantiles.

We observe that the plot of sample quantiles versus theoretical quantiles lies on/ around the QQ line, with deviations only near the tails, that is, the sample quantiles and the theoretical quantiles match closely except near the tails. We observe that the QQ plot shows no signs of significant skewness or a heavy tail. Thus, there is no significant departure from the normal distribution assumption.

Shapiro-Wilk’s Test

We perform the Shapiro-Wilk’s test on the residuals to either accept or reject the null hypothesis that the residuals are normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  residual
## W = 0.99278, p-value = 0.2106

We accept the null hypothesis if the observed p-value of the Shapiro-Wilk’s test is more than the threshold of 0.05. We observe that the p-value of the Shapiro-Wilk’s test is 0.2106 which is greater than the threshold of 0.05. Thus, with 95% confidence, we accept the null hypothesis that the residuals are normal.

Scale-Location Plot

We calculate the square root of the absolute values of standardized residuals for the given data set and plot the square root of the absolute values of standardized residuals against waiting time between eruptions. We expect the residuals to be spread equally.

We observe that the residuals are not spread equally along the range of x axis and that the dashed line is clearly not horizontal. Thus, there is significant departure from the constant variance (homoscedasticity) assumption.

Leverages Plot

We calculate leverages for the given data set and plot the leverages against standardized residuals.

We observe that there are no influential outliers. We note that the scatter plot, the residual plots and the QQ plot also indicate the same.

Based on the residual analysis, we conclude that there are no significant departures from the zero mean and the normal distribution assumption of errors but there are significant departures from the constant variance (homoscedasticity) and the independence assumption of errors.

Transformation

We make amends to accommodate the constant variance (homoscedasticity) and the independence assumption of errors along with the zero mean and the normal distribution assumption of errors by making a transformation on both sides. We use u and v instead of duration of the eruption and waiting time between eruptions respectively. We re-do the residual analysis and compare it with the previous residual analysis.

u <- sign(tan(x))* log(abs(tan(x))+1) # a function of x = duration of the eruption
v <- sign(tan(y))* log(abs(tan(y))+1) # a function of y = waiting time between eruptions

Residual Plots

We make residual and standardized residual plots for the transformed data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.6462 -1.3006  0.3197  0.0000  1.1342  4.3678
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.490012 -0.813217  0.197524  0.000154  0.700751  2.698332

We observe from the summary that the mean of the residuals is still zero and the residuals are not too far from the x axis but this time we also observe that the residuals are randomly and symmetrically distributed about the x axis along with a nearly horizontal dashed line trying to overlap with the x axis. Thus, indicating no significant departures from the zero mean, the constant variance (homoscedasticity) and the independence assumption. We note that even when the constant variance (homoscedasticity) and the independence assumption of errors are true, there can be a little variance of the residuals and a little correlation between residuals as the residuals are only estimates of the errors and this might be the reason for not having an exactly horizontal dashed line. Again, we observe that there are no significant differences between the residual plot and the standardized residual plot except for the scaling on the y axis.

Quantile-Quantile Plot

We make a QQ plot of sample quantiles versus theoretical quantiles.

Again, we observe that the plot of sample quantiles versus theoretical quantiles lies on/ around the QQ line but this time there are more visible deviations near tails as well as a few deviations in the center region. Although we observe no signs of significant skewness or a heavy tail, this plot has deteriorated a bit. Nonetheless, it is still safe to say that there is no significant departure from the normal distribution assumption.

Shapiro-Wilk’s Test

We perform the Shapiro-Wilk’s test on the residuals to either accept or reject the null hypothesis that the residuals are normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  residual_uv
## W = 0.97201, p-value = 3.611e-05

We observe that the p-value of the Shapiro-Wilk’s test is 3.611e-05 which is three orders lesser than the threshold of 0.05. Thus, this time we reject the null hypothesis that the residuals are normal. Since Shapiro-Wilk’s test is ideal only for small samples (less than a hundred) and by the fact that for larger samples (more than a hundred) the normality tests are overly conservative, it is very likely that the assumption of normality might be rejected too easily.

Thus, based on both the QQ plot and the Shapiro-Wilk’s test, we conclude that there is no significant departure from the normal distribution assumption.

Scale-Location Plot

We make the scale-location plot for the transformed data set.

We observe that this time the residuals spread equally along the range of x axis and that the dashed line is almost horizontal. Thus, indicating no significant departure from the constant variance (homoscedasticity) assumption. We note that even when the constant variance (homoscedasticity) assumption of errors is true, there can be some variance of the residuals as the residuals are only estimates of the errors and this might be the reason for not having an exactly horizontal dashed line.

Leverages Plot

We make the leverages plot for the transformed data set.

Again, we observe that there are no influential outliers.

Based on the residual analysis after the transformation, we conclude that there are no significant departures from the normal distribution with zero mean and constant variance (homoscedasticity), and the independence assumption of errors. Thus, it is safe to say that after the transformation the assumptions on errors are valid.