very basic model comparison example

create data/dataframe

set.seed(5)

# createa a vector of the means that we want our variables to have
mu<-c(110, 36, 212) 

# create a vector of the standard deviations that we want our variables to have
stddev <- c(15, 2, 40)

# create a matrix that has the correlations we want our variables to have
corMat <- matrix(c(1, 0.81, 0.78,
                   0.81, 1, 0.27,
                   0.78, 0.27, 1),
                 ncol = 3)

# put it all into a dataframe using the MASS library and creating normally distributed variables (mvnorm) of 500 observations (n=500)
df <- as.data.frame(mvrnorm(n = 500, mu = mu, Sigma = corMat, empirical = FALSE))

# create an ID variable
df$ID<- seq(1,nrow(df),1)

# add column names
colnames(df)[1:3]<- c('swingspeed', 'clublength', 'distance')
head(df)
##   swingspeed clublength distance ID
## 1   109.1085   35.40583 211.3177  1
## 2   111.4261   36.96961 213.1849  2
## 3   108.7192   34.27075 211.7804  3
## 4   110.0703   36.05370 212.0578  4
## 5   111.7261   38.14224 212.5450  5
## 6   109.3863   35.75821 211.2895  6

question:

we are trying t get on a professional golf tour and want to analyze the effects of swing speed and club length on the distance that the golf ball goes to hone in our range. typically, golf balls hit with a longer club (i.e. a driver) goes further than a shorter club (i.e. putter). similarly, you usually swing faster with a longer club than a shorter club. finally and relatedly, because you are usually swinging faster with a longer club, faster club speeds typically also allow you to hit the ball further.

we had 1000 people at the golf course hit the club of their choice, and we recorded their swing speed (mph), club length (inches) and distance (yds).

however, we first want to know if our golfers are average in order to assess the data. let’s test whether our golfers hit balls further than the average distance of 219 yards.

what is our null hypothesis?

our null hypothesis is that our group hits the golf ball 219 yards on average – or no different from the average golfer.

set up our models

model C: \(distance_i= 219 + \epsilon_i\)

model A: \(distance_i= \beta_0 + \epsilon_i\)

conceptually what is model C doing? taking every observation and subtracting 219 (the avg golf ball distance we know from googling it), because we don’t want to predict anything. then, we square all values and sum them to quantify ERROR.

this graph shows each person’s golf ball distance from the hypothesized distance of 219 yards.

ggplot(df, aes(x=ID,y=distance)) + geom_point(color="blue")+
  geom_hline(yintercept=219, linetype="dashed", 
                color = "black", size=2)

model A is saying let’s take every observation and instead substract the mean (our parameter, or predictor), then square all values and sum them to quantify ERROR.

this graph shows the distance from each person’s golf ball distance to the mean golf ball distance.

ggplot(df, aes(x=ID,y=distance)) + geom_point(color="red")+
  geom_hline(yintercept=mean(df$distance), linetype="dashed", 
                color = "black", size=2)

what does the model comparison accomplish?

allows us to see if the parameter we are estimating is helpful. do we want to use a parameter such as mean to predict whether our distances are around 219 yards or is it not useful? if model A significantly reduces error we will use model A– and make our conclusions from that.

write code to test these hypotheses

df$distance_219<- df$distance-219

remember– for now, we are using the ‘logical’ argument of 0 or 1 in the lm function. therefore, we are very limited in the way we construct these simple models. when we want to predict nothing, we tell lm ‘~0’. when we want to predict the mean, we tell lm ‘~1’. think of it as 0=nothing, 1=mean. that’s why we have to also tell R we are testing our data vs the hypothesized value of 219 (by creating the distance_219 variable).

run models

model_c<- lm(df$distance_219~0)
model_a<- lm(df$distance_219~1)

look at model output

mcSummary(model_c)
## lm(formula = df$distance_219 ~ 0)
## 
## Omnibus ANOVA
##                  SS  df     MS EtaSq F p
## Model                 0                 
## Error      24877.72 500 49.755          
## Corr Total 24877.72 500 49.755          
## 
##   RMSE AdjEtaSq
##  7.054       NA
## 
## Coefficients: none
mcSummary(model_a)
## lm(formula = df$distance_219 ~ 1)
## 
## Omnibus ANOVA
##                 SS  df    MS EtaSq F p
## Model        0.000   0   Inf     0    
## Error      514.712 499 1.031          
## Corr Total 514.712 499 1.031          
## 
##   RMSE AdjEtaSq
##  1.016        0
## 
## Coefficients
##               Est StErr        t SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) -6.98 0.045 -153.686  24363 0.979  NA  -7.07  -6.891 0

compare models

modelCompare(model_c,model_a)
## SSE (Compact) =  24877.72 
## SSE (Augmented) =  514.7122 
## Delta R-Squared =  0 
## Partial Eta-Squared (PRE) =  0.9793103 
## F(1,499) = 23619.3, p = 0

conclude?

journal summary

on average, across individuals in our sample at the golf course, the average distance of the a golf ball hit (M=212.02 yards) was significantly lower than the national average of 219 yards (PRE=.98, F(1,499)= 23,619.3, p<.05).

setting up future examples

no need to fully understand this yet, but sometimes I think the simplest examples can be some of the most complicated. the reason we are teaching all of this machinery now is to give you the tools to ask and answer more intuitive questions that actually make more sense later.

the previous example is basically just a T test testing our group average vs some a priori average.

but realistically, in this case we might want to see if swing speed or club length (or both!) do indeed predict golf ball distance.

asking whether swing speed significantly predicts distance we could set up something like this:

model C: \(distance_i= \beta_0 + \epsilon_i\)

model A: \(distance_i= \beta_0 + swingspeed*\beta_1 + \epsilon_i\)

what model C is saying now is does the mean value (our parameter) do a good job of reducing error in each person’s observation of distance?

and model A is saying what if we use our group mean to predict distance but also throw in swing speed and see if we can get a more accurate prediction of distance (i.e reduce error more)

model_c<- lm(df$distance~1)
model_a<- lm(df$distance~df$swingspeed)
mcSummary(model_c)
## lm(formula = df$distance ~ 1)
## 
## Omnibus ANOVA
##                 SS  df    MS EtaSq F p
## Model        0.000   0   Inf     0    
## Error      514.712 499 1.031          
## Corr Total 514.712 499 1.031          
## 
##   RMSE AdjEtaSq
##  1.016        0
## 
## Coefficients
##                Est StErr        t   SSR(3) EtaSq tol CI_2.5 CI_97.5 p
## (Intercept) 212.02 0.045 4667.981 22476155     1  NA 211.93 212.109 0
mcSummary(model_a)
## lm(formula = df$distance ~ df$swingspeed)
## 
## Omnibus ANOVA
##                 SS  df      MS EtaSq      F p
## Model      310.677   1 310.677 0.604 758.29 0
## Error      204.035 498   0.410               
## Corr Total 514.712 499   1.031               
## 
##  RMSE AdjEtaSq
##  0.64    0.603
## 
## Coefficients
##                   Est StErr      t  SSR(3) EtaSq tol  CI_2.5 CI_97.5 p
## (Intercept)   125.600 3.138 40.020 656.178 0.763  NA 119.433 131.766 0
## df$swingspeed   0.786 0.029 27.537 310.677 0.604  NA   0.730   0.842 0

compare models

modelCompare(model_c,model_a)
## SSE (Compact) =  514.7122 
## SSE (Augmented) =  204.0347 
## Delta R-Squared =  0.6035946 
## Partial Eta-Squared (PRE) =  0.6035946 
## F(1,498) = 758.2895, p = 3.971254e-102

it appears that adding swing speed is a useful predictor of distance above and beyond testing whether the mean distance predicts each individuals distance.

we could now take the estimate (parameter) for the intercept and swingspeed and write: \(distance= 125.6 + 0.786x\)

does this look familiar? if not, don’t worry (but think y=mx+b), and we could interpret this as: in our sample, if swing speed = 0mph, the distance a ball will be hit is 125.6 yards (which obviously isn’t true but it’s how I simulated the data) as swingspeed increases by 1mph, distance increases by 0.786 yards.

ggplot(df, aes(swingspeed, distance)) +
  geom_point(color="yellow") +
  geom_smooth(method = lm, se = T)
## `geom_smooth()` using formula 'y ~ x'

again, don’t worry about this test right now!! I just want to show that IMO conceptually it gets easier, at least for a little bit.