In this section I build a mixed linear model to examine consumer carrot preferences and subsequently interpret the standard deviations of the random effect estimates.
The R code below specifies and runs models testing the two questions of interest:
• Do people like late-harvest carrots better than early-harvest carrots?
• Do people who earn higher incomes generally like carrots less?
(Note that the first model specifies only random intercepts while the second additionally includes random slopes)
#early_late variable: -1 if early, +1 if late
carrots$early_late <- -1*(carrots$Product=="Bolero_E")+1*(carrots$Product=="Bolero_L")-1*(carrots$Product=="Major_E")+1*(carrots$Product=="Major_L")-1*(carrots$Product=="Navar_E")+1*(carrots$Product=="Navar_L")-1*(carrots$Product=="Nelson_E")+1*(carrots$Product=="Nelson_L")-1*(carrots$Product=="Turbo_E")+1*(carrots$Product=="Turbo_L")-1*(carrots$Product=="Yukon_E")+1*(carrots$Product=="Yukon_L")
carrots$type <- substr(carrots$Product,1,3)
#construct linear income variable
carrots$linInc = -3*(carrots$Income==1)-1*(carrots$Income==2)+1*(carrots$Income==3)+3*(carrots$Income==4)
#run models
fit1 = lmer(Preference ~ early_late + linInc + (1 | Consumer), data = carrots) #estimating only random intercepts
summary(fit1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Preference ~ early_late + linInc + (1 | Consumer)
## Data: carrots
##
## REML criterion at convergence: 3556.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.3702 -0.5776 0.0245 0.6372 2.6536
##
## Random effects:
## Groups Name Variance Std.Dev.
## Consumer (Intercept) 0.159 0.3988
## Residual 1.144 1.0694
## Number of obs: 1161, groups: Consumer, 97
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.76932 0.05169 95.02067 92.273 < 2e-16 ***
## early_late 0.21095 0.03139 1063.22419 6.721 2.94e-11 ***
## linInc -0.05134 0.03174 94.90388 -1.617 0.109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) erly_l
## early_late -0.001
## linInc 0.133 0.000
fit2 = lmer(Preference ~ early_late + linInc + (1 + early_late | Consumer), data = carrots) #including random slopes
summary(fit2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Preference ~ early_late + linInc + (1 + early_late | Consumer)
## Data: carrots
##
## REML criterion at convergence: 3553.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.4707 -0.5851 0.0440 0.6162 2.4350
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Consumer (Intercept) 0.16127 0.4016
## early_late 0.02642 0.1625 0.17
## Residual 1.11506 1.0560
## Number of obs: 1161, groups: Consumer, 97
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4.76903 0.05167 95.03785 92.292 < 2e-16 ***
## early_late 0.21091 0.03511 96.20193 6.006 3.37e-08 ***
## linInc -0.05332 0.03167 94.89533 -1.683 0.0956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) erly_l
## early_late 0.064
## linInc 0.133 0.000
Clearly, the random effect standard deviation in the first model (\(SD = .3988\)) indicates that some variability exists around the predicted mean preference level for consumers judging carrots with no seasonality (e.g. where early_late = 0) even after accounting for level of income. This result therefore suggests that there may be additionals between-participant variable useful for predicting mean preference levels.
This variability in the intercept estimate remains unchanged in the second model, which additionally estimates random slopes. Interestingly, these random slopes exhibit less variability (\(SD = .1625\)) than the intercept, suggesting that, after accounting for income, there are fewer differences between consumers in the effect of carrot seasonality on reported preferences. Importantly, one cannot include a random slope for the linear effect of income, which only varies between consumers.
The fixed effects can be interpreted as follows:
The intercept effect (or parameter \(\alpha_{00}\) in the fully-specified linear equation for this model) equals the predicted preference score for a consumer making no income while judging a hypothetical, non-seasonal carrot. Here that value is significant different from 0.
The early_late fixed effect (or parameter \(\alpha_{10}\) in the fully-specified linear equation for this model) equals one-half increase in predicted carrot preference due to early vs. late season carrot type.
Finally, the lin_Inc fixed effect (or parameter \(\alpha_{01}\) in the fully-specified linear equation for this model) is the linear effect of between-group income on predicted carrot preference.
In this section I provide an overview of the ‘Insurance’ dataset, discuss my hypothesis of interest, formalize this hypothesis mathematically, test it with the apropriate mixed linear model, and interpret the estimated fixed effects.
The ‘Insurance’ comes from a 2018 study of annual medical insurance costs in a sample of 1338 Americans. Participants in this study were randomly contacted by email after having enrolled in the University of Colorado at Boulder’s 2017 “Medical Insurance for Dummies” seminar. Participants were required to submit a host of demographic and health variables as well as their annual medical insurance charges in order to enter a raffle for a $50 Amazon gift-card. Note that since this convenience sample was highly demographically and geographically limited, it is not representative of the American population at large. For each participant the variables included in the final data-set included:
• Age at the time of response
• Biological sex (recorded as either male or female)
• Body Mass Index (BMI) at time of enrollment
• Number of children at the time of response
• Smoking status (yes if had smoked to any degree in the past month, no otherwise)
• Region of residence at time of enrollment (options included “Northeast,” “Northwest,” “Southeast,” and “Southwest”)
• Annual medical insurance costs (in USD)
Only those participants that gave valid responses to all questions were included in the final data-set.
Since the ‘Insurance’ dataset contains information on individuals’ age, BMI, region of residence and annual medical insurance charges, I plan on testing the following two questions:
• Are age and BMI related to medical insurance charges?
• Do the effects of age and BMI on medical insurance charges themselves depend on a regions mean BMI level?
Importantly, the second question addresses the possible ‘contagion’ effects of obesity; that is, one might expect that a positive effect of age and BMI on charges would increase in already-high BMI regions.
A mixed linear model approach is the proper way to test this hypothesis. In this case the grouping unit of interest is the ‘Region’ variable since one would expect individuals living in the same geographic region to be more similar to one another within regions. BMI and age are individual-level variables and as such vary within the ‘Region’ grouping unit. Finally, mean BMI is necessarily constant within regions and therefore only varies between.
The mixed linear model I use to test the hypothesis in Section 2b is best understood with the corresponding Level 1 and Level 2 linear equations, which are reproduced below:
Level 1: \[ charges_{ij} = \beta_{0j} + \beta_{1j}*age_{ij} + \beta_{2j}*BMI_{ij} + \epsilon_{ij} \] Level 2: \[ \beta_{0j} = \alpha_{00} + \alpha_{01}*meanBMI_j + \mu_{0j} \] \[ \beta_{1j} = \alpha_{10} + \alpha_{11}*meanBMI_j + \mu_{1j} \] \[ \beta_{2j} = \alpha_{20} + \alpha_{21}*meanBMI_j + \mu_{2j} \] Substituted: \[ charges_{ij} = \alpha_{00} + \alpha_{01}*meanBMI_j + \mu_{0j} + (\alpha_{10} + \alpha_{11}*meanBMI_j + \mu_{1j})*age_{ij} + (\alpha_{20} + \alpha_{21}*meanBMI_j + \mu_{2j})*BMI_{ij} + \epsilon_{ij} \]
The R code below estimates the MLM specified by the equations above:
insurance = read.csv("insurance.csv", header = TRUE)
#uncentered meanBMI variable
insurance$meanBMI = mean(insurance$bmi[insurance$region=="northwest"])*(insurance$region=="northwest")+mean(insurance$bmi[insurance$region=="northeast"])*(insurance$region=="northeast")+mean(insurance$bmi[insurance$region=="southwest"])*(insurance$region=="southwest")+mean(insurance$bmi[insurance$region=="southeast"])*(insurance$region=="southeast")
############################
#centered meanBMI variable
centeredBMI_NW = mean(insurance$bmi[insurance$region=="northwest"])-mean(insurance$bmi) #-1.463
centeredBMI_NE = mean(insurance$bmi[insurance$region=="northeast"])-mean(insurance$bmi) #-1.48
centeredBMI_SW = mean(insurance$bmi[insurance$region=="southwest"])-mean(insurance$bmi) #-.066
centeredBMI_SE = mean(insurance$bmi[insurance$region=="southeast"])-mean(insurance$bmi) #2.69
insurance$meanBMI_gc = centeredBMI_NW*(insurance$region=="northwest")+centeredBMI_NE*(insurance$region=="northeast")+centeredBMI_SW*(insurance$region=="southwest")+centeredBMI_SE*(insurance$region=="southeast")
##########################
#region-centered individual age
insurance$age_rc = (insurance$age-mean(insurance$age[insurance$region=="northwest"]))*(insurance$region=="northwest")+(insurance$age-mean(insurance$age[insurance$region=="northeast"]))*(insurance$region=="northeast")+(insurance$age-mean(insurance$age[insurance$region=="southwest"]))*(insurance$region=="southwest")+(insurance$age-mean(insurance$age[insurance$region=="southeast"]))*(insurance$region=="southeast")
#region-centered individual BMI
insurance$BMI_rc = (insurance$bmi-mean(insurance$bmi[insurance$region=="northwest"]))*(insurance$region=="northwest")+(insurance$bmi-mean(insurance$bmi[insurance$region=="northeast"]))*(insurance$region=="northeast")+(insurance$bmi-mean(insurance$bmi[insurance$region=="southwest"]))*(insurance$region=="southwest")+(insurance$bmi-mean(insurance$bmi[insurance$region=="southeast"]))*(insurance$region=="southeast")
fit_insurance = lmer(charges ~ age_rc*meanBMI_gc + BMI_rc*meanBMI_gc + (1 + age_rc + BMI_rc | region), data = insurance)
## boundary (singular) fit: see ?isSingular
summary(fit_insurance) #singular fit due to -1.0 correlation between random intercepts and slopes
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## charges ~ age_rc * meanBMI_gc + BMI_rc * meanBMI_gc + (1 + age_rc +
## BMI_rc | region)
## Data: insurance
##
## REML criterion at convergence: 28725.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.1935 -0.6170 -0.4400 0.5638 4.1340
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## region (Intercept) 6.153e+05 784.40
## age_rc 1.052e+03 32.43 1.00
## BMI_rc 1.208e+02 10.99 -0.98 -0.98
## Residual 1.294e+08 11373.73
## Number of obs: 1338, groups: region, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 13264.024 500.874 1.548 26.482 0.00484 **
## age_rc 241.535 27.612 2.282 8.747 0.00841 **
## meanBMI_gc 436.699 291.541 1.498 1.498 0.31064
## BMI_rc 329.470 54.457 73.127 6.050 5.64e-08 ***
## age_rc:meanBMI_gc 16.556 15.874 2.103 1.043 0.40187
## meanBMI_gc:BMI_rc -11.191 29.507 53.962 -0.379 0.70598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) age_rc mnBMI_ BMI_rc a_:BMI
## age_rc 0.461
## meanBMI_gc 0.030 0.022
## BMI_rc -0.078 -0.158 -0.004
## ag_rc:mBMI_ 0.023 0.007 0.474 0.039
## mnBMI_:BMI_ -0.004 0.044 -0.085 -0.136 -0.114
## convergence code: 0
## boundary (singular) fit: see ?isSingular
Interestingly, this model runs into a ‘singular fit’ issue caused by the very high random effects correlations. However, constraining these parameters does not prevent this issue, which does not affect the interpretation of fixed effect estimates in any case.
In the model above individual age and BMI were centered at the regional group mean and mean BMI at the grand mean. Below I interpret each of the fixed effects with this centering scheme in mind:
• The fixed effect intercept (or parameter \(\alpha_{00}\) in the formal model) equals the predicted annual medical insurance charges for someone who 1) lives in a region with a regional BMI at the national mean level, 2) has a BMI exactly equal to their region’s mean BMI, and 3) is at their region’s mean age. As expected this parameter is large (around $13300) and highly significant (t = 26.5, p < .01).
• The “age_rc” fixed effect (or parameter \(\alpha_{10}\) in the formal model) equals the predicted increase in annual medical insurance charges due to a one-year increase in the age of someone who 1) lives in a region with a regional BMI at the national mean level, 2) has a BMI exactly equal to their region’s mean BMI.
• The “meanBMI_gc” fixed effect (or parameter \(\alpha_{01}\) in the formal model) equals the predicted increase in annual medical insurance charges due to a one-unit increase in regional BMI for someone who 1) has a personal BMI exactly equal to their region’s mean BMI and 2) is at their region’s mean age.
• The “BMI_rc” fixed effect (or parameter \(\alpha_{20}\) in the formal model) equals the predicted increase in annual medical insurance charges due to a one-unit increase in individual BMI for someone who 1) lives in a region with a regional BMI at the national mean level and 2) is at their region’s mean age.
• The “age_rc:meanBMI_gc” fixed effect (or parameter \(\alpha_{11}\) in the formal model) equals the predicted increase in the effect of age on annual medical insurance charges due to a one-unit increase in regional BMI for someone who has a personal BMI exactly equal to their region’s mean BMI. This is an interaction effect.
• The “meanBMI_gc:BMI_rc” fixed effect (or parameter \(\alpha_{21}\) in the formal model) equals the predicted increase in the effect of individual BMI on annual medical insurance charges due to a one-unit increase in regionl BMI for someone at their region’s mean age. This is also an interaction effect.