Summary

I will be using Linear Models with R (Faraway), to practice solving a problem with the predictor using the SIMEX (simulation extrapolation) method.

Declaratons

We will be using a custom function in order to display the final equations for each of our models - namely a simple linear regression model and a SIMEX derived model.

print_formula <- function(model){
                    as.formula(
                      paste0("y ~ ", round(coefficients(model)[1],4), "", 
                        paste(sprintf(" %+.4f*%s ", 
                                      coefficients(model)[-1],  
                                      names(coefficients(model)[-1])), 
                              collapse="")
                       )
                    ) 
                  }

Problem Statement

Using the faithful data, fit a regression of duration on waiting. Assuming that there was a measurement error in waiting of 30 seconds, use the SIMEX method to obtain a better estimate of the slope.

Solution

The SIMEX method reduces error by adding additional measurement error via re-sampling to calculate the trend of measurement error. These calculations are then used to provide a more accurate model.

The data we will be using depicts the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

library(faraway)
str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

Let’s go ahead and perform a simple linear regression for this data.

model <- stats::lm(eruptions ~ waiting, data = faithful, x=T)

It is important to note that we must set the argument for “x” in the “lm” function to the value TRUE in order to preserve the x component for use in our SIMEX model.

stats::coefficients(model)
(Intercept)     waiting 
-1.87401599  0.07562795 
print_formula(model)
y ~ -1.874 + 0.0756 * waiting
<environment: 0x0000000023a934f0>

We can see our initial slope is 0.0756279, we will now use the SIMEX method assuming that there was a measurement error in waiting of 30 seconds

Recall as well that since we preserved the “x” component in our model, the SIMEX function will be able to make these calculations assuming that the “waiting” predictor had an error of 30 seconds.

simex_model <- simex::simex(model, SIMEXvariable = "waiting", 
                            measurement.error = 30)

coefficients(simex_model)
(Intercept)     waiting 
-10.2484648   0.1937354 
print_formula(simex_model)
y ~ -10.2485 + 0.1937 * waiting
<environment: 0x000000001a450a60>

We can now see our slope is 0.1937354, in the case we discovered a measurement error in waiting of 30 seconds was revealed.