The dataset that is took for this study is the old faithful data. In Yellowstone park of USA, there is a geyser called ‘Old Faithful’ which erupts every 1 hour to 1.5 hour . Since it is one of the popular attractions people would like to know how long they should wait in order to watch another episode of the eruption. For this reason, data is collected for about 272 eruptions and the how much interval is observed for each eruption along with the duration of each eruption on how much time it lasted.
The dataset contains 2 features interval and duration the dimensions are 272 \(\times\) 2. The data is used to build a simple linear regression model to predict interval based on the previous eruption duration. This mainly illustrates the statistical inferences of the GLM model and analytical methods for deriving the coefficeints. Also, optimisation is done to find the best value of slope if the intercept and standard deviation were provided.
dim(faithful)
## [1] 272 2
Derive the MLE estimators for the three regression parameters in the one-dimensional linear model
Derivations can done with respect to the simple linear model \[ Y = \beta_0 + \beta_1 X + \epsilon, \ \epsilon \sim N((0, \sigma^2)\] The estimates of the three parameters are :
Assuming that the data is of the form \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\). The values of the MLE estimates of parameters \(\beta_0, \beta_1, \sigma^2\) are:
\[\hat{\beta_1} = \frac{\bar{yx}
-\bar{y}\bar{x}}{\bar{x^2} - \bar{x}^2} =
\frac{\textrm{{cov}}(y,x)}{\textrm{{var}}(x)}\]
\[\hat{\beta_0} = \bar{y} - \hat{\beta_1}
\bar{x}\]
\[\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n
(y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2\]
Where \(\bar{y}, \bar{x}\) are the
sample means of \(y\) and \(x\),
\(\bar{yx}\) is the mean of \(yx\),
\(\bar{x^2}\) is the mean value of
\(x^2\).
Using the analytic expressions for linear regression, calculate the maximum-likelihood intercept and gradient for a line of best-fit using the linear model: interval ∼ duration
Calculating the mean, variance and covariances and plugging into the above analytical expressions can resul in the model coefficients.
y <- faithful$waiting # interval
x <- faithful$eruptions # duration
b1 <- cov(y,x) / var(x) # slope
b0 <- mean(y) - b1*mean(x) # intercept
b0
## [1] 33.4744
b1
## [1] 10.72964
Therefore, Value of b1 is 10.729 and b0 is 33.474
Using the glm function in R, perform a regression of the linear model: interval ∼ duration. Check that the the coefficients provided by the glm function match those from the analytic expressions. Record the estimated coefficients and their standard errors, and calculate a 95% confidence interval for β1
Now, the coefficients calculated analytically can be compared to the ones that are resulted from fitting the linear regression model using the Generalized linear models command in R. We can compare the values of \(\beta_0\) and \(\beta_1\).
df <- data.frame(duration = x,interval = y)
dim(df)[1]
## [1] 272
model_t3 <- glm(formula = interval ~ duration , data = df)
summary(model_t3)
##
## Call:
## glm(formula = interval ~ duration, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -12.0796 -4.4831 0.2122 3.9246 15.9719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.4744 1.1549 28.98 <2e-16 ***
## duration 10.7296 0.3148 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 34.97551)
##
## Null deviance: 50087.1 on 271 degrees of freedom
## Residual deviance: 9443.4 on 270 degrees of freedom
## AIC: 1742.8
##
## Number of Fisher Scoring iterations: 2
The intercept value is 33.474 and gradient is 10.729 . From the above we can conclude that the values derived analytically are equal to those that are returned by the model. The summary command gives a lot of information. It not only displays the estimates and the standard error of these coefficients, but also tells us about the critical t value for the distribution. The last column of the coefficients information signifies the probability of the t-statistic being greater than the critical t-value. In general, we reject a null hypothesis if the t statistic value is larger than critical value. But if this probability is high, then the null hypothesis that a coefficient is significant can be rejected. This can also be easily understood by the confidence intervals.
The two hypothesis can be : \[{H_0} : \beta_0 = 0\] \[{H_0} : \beta_1 = 0\] stating that the coefficients are zero.
We can now calculate the confidence intervals for the \(\beta_0\) and \(\beta_1\) to understand how important they are for the model by testing a null hypothesis on whether the coefficient can be zero. If the confidence interval contains zero, then the coefficient may not be significant resulting in rejecting null hypothesis.
# Confidence interval for the slope(beta1)
summary <- summary(model_t3)$coefficients
estimate <- summary[2,1]
stnderr <- summary[2,2]
critical_val <- qt(0.975,df = dim(df)[1] - 12 )
C.I_min <- estimate - stnderr*critical_val
C.I_max <- estimate + stnderr*critical_val
C.I_min
## [1] 10.10985
C.I_max
## [1] 11.34943
The confidence intervals of \(\beta_1\) are [10.10,11.35]. This implies that the coefficient value of \(10.72\) lies within this interval and the coefficient is significant for the model.
# Confidence interval for the intercept(beta0)
estimate <- summary[1,1]
stnderr <- summary[1,2]
critical_val <- qt(0.975,df = dim(df)[1] - 12 )
C.I_min <- estimate - stnderr*critical_val
C.I_max <- estimate + stnderr*critical_val
C.I_min
## [1] 31.2003
C.I_max
## [1] 35.74849
The confidence intervals of \(\beta_1\) are [31.2003,35.74849]. Clearly, both the confidence intervals does not contain zero and they both are positive coefficients.
Reproduce the plot of duration vs waiting interval, including the line of best fit (use the plot and abline commands in R)
The data can be plotted duration versus interval. Then the line can be plotted by using the slope and intercept that are outputted by the model.The red line indicates the increasing trend for the observations and used to predict the eruptions based on the duration .
plot(x,y,xlab = "duration", ylab = "interval",main = "Plot of erruption interval versus duration" )
abline(a = b0 , b = b1, col ="red")
Assuming an intercept of 33min and a noise parameter of σ = 30min, calculate the likelihood for a range of slopes between 0 and 25. Plot the likelihood as a function of the slope and find the slope that maximises the likelihood
Since both the input features follow a normal distribution, we can construct the function for log likelihood and can estimate the value of \(\beta_1\) and find for which value of \(\beta_1\) has the maximum loglikelihood and tat will be the best vlaue for \(\beta_1\) for the given intercept = 33 and sigma = 30 .
so first calculating likelihood,
loglikelihood <- function(b1){
x <- faithful$eruptions
y <- faithful$waiting
sigma = 30
n = 100
y_pred <- 33 + b1*x
LL = - sum((y-y_pred)^2)/(2*sigma^2) - n*log(sigma) - 0.5*n*log(2*pi)
return (LL)
}
considering 100 points between 0 and 25, we can try to calculate the maximum likelihood by plotting the curve for all \(\beta_1\)s against the log likelihoods and find the \(\beta_1\) for the maximum value of loglikelihood.
b1s <- seq(from=0,to =25,length.out = 100)
LLs <- rep(NA,100)
for (i in 1:100){
LLs[i] <- loglikelihood(b1s[i])
}
plot(b1s,LLs,xlab = "slope", ylab = "loglikelihood",main = "Plot of loglikelihood versus slope")
So, the value of b1 can be around 10, now we try to optimise to find out the best value of b1 that maximises the Likelihood. So, we take negative loglikelihood function and try to optimise it with the nearest value such as 10.
neg_LLs <- function(b1){-loglikelihood(b1)}
best_b1 <- optim(par = 10,fn = neg_LLs)
best_b1
## $par
## [1] 10.85352
##
## $value
## [1] 437.2632
##
## $counts
## function gradient
## 20 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
This optimisation provided the result that the best value of slope can be 10.85 where the loglikelihood is maximum at -437.26 For the better visualisation, the lines are plotted at the best \(\beta_1\) that has highest loglikelihood.
plot(b1s,LLs)
abline(h = -best_b1$value ,col = "green",lwd = 2)
abline(v = best_b1$par,col = "purple")
Using the predict function, predict the waiting interval after durations of eruptions of 1, 3 and 10 minutes. Which of these predictions are you most confident about? Why? (Consider how your predictions change if you add or subtract one standard error from the regression coefficients)
Let us assume the eruptions which are to be predicted as a vector and convert to a dataframe to use predict function to get the corresponding waiting intervals.
Pred_x <- c(1,3,10)
Pred_x_df <- data.frame(duration = Pred_x)
predict(model_t3,newdata = Pred_x_df)
## 1 2 3
## 44.20404 65.66332 140.77081
#model_t3
So, the above is the estimated waiting time interval for the first, third and tenth eruptions. We can be certain of the predictions because the values are within the range of data.
What would have happened if we tried to predict eruption duration from interval instead?
Predicting eruption duration from the waiting time interval is just reversing the model
new_model <- glm(formula = eruptions ~ waiting, data = faithful)
summary(new_model)
##
## Call:
## glm(formula = eruptions ~ waiting, data = faithful)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.29917 -0.37689 0.03508 0.34909 1.19329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
## waiting 0.075628 0.002219 34.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2465251)
##
## Null deviance: 353.039 on 271 degrees of freedom
## Residual deviance: 66.562 on 270 degrees of freedom
## AIC: 395.02
##
## Number of Fisher Scoring iterations: 2
The intercept and the gradient for this model are also very much reliant and the strongly significant.But, we cannot very certain about the direction causality of the data. This is more common problem and some scientific methods to be employed to understand the correlations and causalities.