Using Multi-Level (Hierarchical) Modeling.

Introduction

In the recent years, it has been found that radon (Rn) the fifth natural noble gas has been strongly associated with lung cancer. Radon is a carcinogen-a naturally occurring radioactive gas whose decay products are also radioactive-known to cause lung cancer in high concentrations and estimated to cause several thousand lung cancer deaths per year in the United States (Gelman 2007). In the US, prevalence of radon varies greatly, with some areas and more specifically houses having hazardous concentrations. In this brief analysis, our goal is to analyze the prevalence of radon levels in the state of New York using the multilevel modeling technique.

Origin of Radon

Radon is a colorless, odorless radioactive gas that seeps up from the earth. When inhaled, it gives off radioactive particles that can damage the cells that line the lung.Long term exposure to radon can lead to lung cancer. The production of Radon occurs, from the natural decay of uranium that is found in nearly all soils. Uranium breaks down to radium. As radium disintegrates it turns into radioactive gas = “radon”. As a gas, radon moves up through the soil and into the air you breathe. (Minnesota Department of Health).

Impact of Radon

According to CDC, radon is an cancer causing radioactive gas. It is colorless and odorless making it difficult to be detected. However, it is important to mention, the lung cancer risk from radon is much lower than that from smoking. However, exposure to the combination of radon gas and cigarette smoke creates a greater risk for lung cancer than either factor alone (American Cancer Society).

Source of Data

This analysis will use data from New York State’s Department of Health. According the NY department of health: The Radon Program contracts with a radon testing laboratory to provide short-term charcoal radon test kits, radon test kit analysis and results to residents. The contract laboratory provides the radon test results to the individual home owner and the DOH Radon Program. All testing data is collected and stored in the “Radon Program’s” database. The data set has 116 observations, and 15 variables. The data is specifically collected from 54 counties in NY state, making it eligible for our current analysis.

The data can be downloaded from: (https://www.health.data.ny.gov/)

Limitations

The analysis has several limitations. First the data is exclusively avaiable for illustration purpose, which makes it very limited in terms of various indicators. The values for the test conducted is grouped. For example for Albany county have in total 2100 test, there now specific value for each 2100 test. There is an averaga value to all the test. Another limitation is there no other indicators such the type of house, the time of build which can be further used to explain variation in our analysis.

Key Dependent Variable

The variable “Average.Radon.pCi.L” is our key independent variable. The variable carries the information regarding the prevalence of radon gas. The concentration of radon in the air is measured in units of picocuries per litre (pCi/L) or becquerels per cubic meter (Bq/m3). One Bq corresponds to one disintegration per second. For this study the unit of measurement associated with the variable found our data is picocuries per litre.

Distribution of Radon Levels

Independent Variable(s)

For this study, we will only use three predictors variables: Air Quality Index, 1st floor, and Basement. The air quality variable is a continuous variable with reported air quality score. The 1st floor variable and basement variables simply represents, weather the radon test data was collected in basement or 1st floor. These variables were recoded and added to the original data.

Distribution of Air Quality Score

The histogram below shows, the distribution of data regarding air quality index score. The distribution pretty normal with few extreme values below and above the mean.

Prevalance of Radon in Households

The figure below, shows the prevalence of radon using in two main potential locations. According to the current data, the prevalence of radon is higher, in the basement as compared to the 1st floor. These two locations are most potentially tested for radon.
## Statistical Analysis In performing the analysis, we will first start a no-pooling model, and then a complete pooling method. For these two models, the variable in test.location.in.home will be used as independent variable. In the second step, we will use random intercept model and further on a random slope model. Our grouping variable is “County” which represents each county NY state.

A total of 54 counties are in the data set. Another important predictor for this analysis is location of the test variable which has two values 1=Basement, 0=1st Floor. We created two new variables which are binary and represent these two specific conditions separately. For instance: the basement variable is binary and repents 1= if the test was conducted in basement and 0 if not. Same binary recoding was used for 1st floor variable. This was done to find a better model.

Complete Pooling Model

The complete pooling modeling in simpler terms is a simple linear model which does not takes account of any group level structure of the data. The model below tells us that 31% of the variance of the response variable Test Location is explained by our regression model. On average the prevalence of radon in household basements is about 3.12 pCi/L. However, complete pooling model fails to account for the variation between different counties.

## 
## Call:
## lm(formula = Average.Radon.pCi.L ~ Test.Location.in.Home, data = radon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1840 -1.3884 -0.2124  0.8741  8.9560 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.5509     0.3013   8.467 9.93e-14 ***
## Test.Location.in.HomeBasement   3.1231     0.4261   7.330 3.58e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.295 on 114 degrees of freedom
## Multiple R-squared:  0.3203, Adjusted R-squared:  0.3144 
## F-statistic: 53.73 on 1 and 114 DF,  p-value: 3.576e-11

No Pooling Model

The no pooling model ignores the individual scores of each basement and first floor test and gives us a intercept and a slope of each county’s radon test level. This practice is still not adequate to find the variation between counties.

Multilevel Model#1

m1_lme <- lme(Average.Radon.pCi.L ~ Test.Location.in.Home, 
              data = radon, random = ~1|County, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   491.7692 502.7835 -241.8846
## 
## Random effects:
##  Formula: ~1 | County
##         (Intercept) Residual
## StdDev:    1.876574 1.285491
## 
## Fixed effects: Average.Radon.pCi.L ~ Test.Location.in.Home 
##                                  Value Std.Error DF   t-value p-value
## (Intercept)                   2.550862 0.3012844 57  8.466626       0
## Test.Location.in.HomeBasement 3.123103 0.2407946 57 12.969989       0
##  Correlation: 
##                               (Intr)
## Test.Location.in.HomeBasement -0.4  
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.41573168 -0.49071967  0.02357069  0.38104008  2.86986676 
## 
## Number of Observations: 116
## Number of Groups: 58

Multilevel Model#2

m2_lme <- lme(Average.Radon.pCi.L ~  X1st.Floor, 
              data = radon, random = ~ Test.Location.in.Home|County, method = "ML")
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   430.3038 446.8253 -209.1519
## 
## Random effects:
##  Formula: ~Test.Location.in.Home | County
##  Structure: General positive-definite, Log-Cholesky parametrization
##                               StdDev    Corr  
## (Intercept)                   1.3252125 (Intr)
## Test.Location.in.HomeBasement 1.7363087 0.796 
## Residual                      0.5089339       
## 
## Fixed effects: Average.Radon.pCi.L ~ X1st.Floor 
##                 Value Std.Error DF   t-value p-value
## (Intercept)  5.217677 0.3728977 57  13.99225       0
## X1st.Floor  -2.856586 0.2398058 57 -11.91208       0
##  Correlation: 
##            (Intr)
## X1st.Floor -0.913
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -2.638396971 -0.186813185 -0.009497394  0.270177067  1.517483478 
## 
## Number of Observations: 116
## Number of Groups: 58

Multilevel Model#3

m3_lme <- lme(Average.Radon.pCi.L ~ Test.Location.in.Home*Air.quality.index , 
              data = radon, random = ~ Test.Location.in.Home|County, method = "ML")
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   415.1597 437.1884 -199.5798
## 
## Random effects:
##  Formula: ~Test.Location.in.Home | County
##  Structure: General positive-definite, Log-Cholesky parametrization
##                               StdDev    Corr  
## (Intercept)                   1.2883349 (Intr)
## Test.Location.in.HomeBasement 1.6825495 0.855 
## Residual                      0.4544442       
## 
## Fixed effects: Average.Radon.pCi.L ~ Test.Location.in.Home * Air.quality.index 
##                                                     Value Std.Error DF
## (Intercept)                                      2.055440 0.7351253 57
## Test.Location.in.HomeBasement                    4.524697 1.4049297 55
## Air.quality.index                                0.075280 0.1082044 55
## Test.Location.in.HomeBasement:Air.quality.index -0.213882 0.2115138 55
##                                                   t-value p-value
## (Intercept)                                      2.796040  0.0070
## Test.Location.in.HomeBasement                    3.220586  0.0021
## Air.quality.index                                0.695723  0.4895
## Test.Location.in.HomeBasement:Air.quality.index -1.011198  0.3163
##  Correlation: 
##                                                 (Intr) Ts.L..HB Ar.ql.
## Test.Location.in.HomeBasement                   -0.238                
## Air.quality.index                               -0.969  0.276         
## Test.Location.in.HomeBasement:Air.quality.index  0.268 -0.985   -0.277
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -1.471289050 -0.226877667  0.001499596  0.286049313  1.718769661 
## 
## Number of Observations: 116
## Number of Groups: 58
radon %<>% mutate(cairindex = Air.quality.index  - mean(Air.quality.index))

Multilevel Model#3(a)

m3a_lme <- lme(Average.Radon.pCi.L ~ Test.Location.in.Home*Air.quality.index, 
               data = radon, random = ~ Test.Location.in.Home|County, method = "ML")
summary(m3a_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   415.1597 437.1884 -199.5798
## 
## Random effects:
##  Formula: ~Test.Location.in.Home | County
##  Structure: General positive-definite, Log-Cholesky parametrization
##                               StdDev    Corr  
## (Intercept)                   1.2883349 (Intr)
## Test.Location.in.HomeBasement 1.6825495 0.855 
## Residual                      0.4544442       
## 
## Fixed effects: Average.Radon.pCi.L ~ Test.Location.in.Home * Air.quality.index 
##                                                     Value Std.Error DF
## (Intercept)                                      2.055440 0.7351253 57
## Test.Location.in.HomeBasement                    4.524697 1.4049297 55
## Air.quality.index                                0.075280 0.1082044 55
## Test.Location.in.HomeBasement:Air.quality.index -0.213882 0.2115138 55
##                                                   t-value p-value
## (Intercept)                                      2.796040  0.0070
## Test.Location.in.HomeBasement                    3.220586  0.0021
## Air.quality.index                                0.695723  0.4895
## Test.Location.in.HomeBasement:Air.quality.index -1.011198  0.3163
##  Correlation: 
##                                                 (Intr) Ts.L..HB Ar.ql.
## Test.Location.in.HomeBasement                   -0.238                
## Air.quality.index                               -0.969  0.276         
## Test.Location.in.HomeBasement:Air.quality.index  0.268 -0.985   -0.277
## 
## Standardized Within-Group Residuals:
##          Min           Q1          Med           Q3          Max 
## -1.471289050 -0.226877667  0.001499596  0.286049313  1.718769661 
## 
## Number of Observations: 116
## Number of Groups: 58

Multilevel Model#4

m4_lme <- lme(Average.Radon.pCi.L ~ Test.Location.in.Home, 
               data = radon, random = ~ Test.Location.in.Home|County, method = "ML")
summary(m4_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   412.3953 428.9168 -200.1977
## 
## Random effects:
##  Formula: ~Test.Location.in.Home | County
##  Structure: General positive-definite, Log-Cholesky parametrization
##                               StdDev    Corr  
## (Intercept)                   1.2863114 (Intr)
## Test.Location.in.HomeBasement 1.6988730 0.854 
## Residual                      0.4576067       
## 
## Fixed effects: Average.Radon.pCi.L ~ Test.Location.in.Home 
##                                  Value Std.Error DF  t-value p-value
## (Intercept)                   2.550862 0.1808363 57 14.10592       0
## Test.Location.in.HomeBasement 3.123103 0.2407946 57 12.96999       0
##  Correlation: 
##                               (Intr)
## Test.Location.in.HomeBasement 0.668 
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.36636427 -0.21258786 -0.03321914  0.27473425  1.81783505 
## 
## Number of Observations: 116
## Number of Groups: 58

Confidence Intervals

m5_lme <- lme(Average.Radon.pCi.L ~ 1, random = ~ 1|County, data = radon, method = "ML")
summary(m5_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: radon 
##        AIC      BIC    logLik
##   569.4629 577.7236 -281.7314
## 
## Random effects:
##  Formula: ~1 | County
##         (Intercept) Residual
## StdDev:    1.040715 2.555264
## 
## Fixed effects: Average.Radon.pCi.L ~ 1 
##                Value Std.Error DF  t-value p-value
## (Intercept) 4.112414 0.2749791 58 14.95537       0
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.2026904 -0.6665864 -0.3004453  0.5078635  3.4821081 
## 
## Number of Observations: 116
## Number of Groups: 58
intervals(m5_lme)
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                lower     est.    upper
## (Intercept) 3.564361 4.112414 4.660467
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: County 
##                     lower     est.    upper
## sd((Intercept)) 0.4394442 1.040715 2.464678
## 
##  Within-group standard error:
##    lower     est.    upper 
## 2.138157 2.555264 3.053739

Results

Compared with the two classical estimates (no pooling and complete pooling), the inferences from the multilevel models are more reasonable. At one extreme, the complete-pooling method gives identical estimates for all counties, which is particularly inappropriate for this application, whose goal is to identify the locations in which residents are at high risk of radon. At the other extreme, the no-pooling model over fits the data; for example, it gave an implausibly high estimate of the average radon levels in counties. The Multilevel Model#1 simply takes account of radon intercept and gives us again the identical fixed effects intercept as complete pooling model. However, the random effects tells us that on average the prevalence of radon is around 1.87 up and 1.87 down, (because its SD) in different counties.

The Multilevel Model#2 uses the random slope and takes account of testing location differences as well. The overall model has AIC of “430” and model is significant. The “random effects”portion in model#2 is basically a measure of variance attributable at different levels in the counties, expressed as “Standard Deviation”. In this model instead of using the “test.location” variable, we used binary variable X1st Floor which = 1, if test was taken on 1st floor and 0 if not. The random effects of the model#2 tells gives us more textured results. The average prevalence of radon in “basements” is around 1.73 (SD+/- the intercept 5.27). The Fixed effects of model#2 shows the intercept of 5.23 with a slope of -2.85 which basically means that for every 1st Floor level the average prevalence of radon decreases.

The Multilevel Model#3 has two parts. The part 1 model simply uses an interaction or co variate variable Air Quality Index. Both of parts 3 and 3a has same AIC of 415. The fixed effects of the model tells that on average the prevalence of radon in 4.524 higher in the basements. However, the association between air quality index and radon levels is not significant (0.4895). The interaction between test.location basement and air quality index shows on average reduction in radon prevalence -.21. The model#3a is a similar model, however, for this model we scaled the variable air quality index to refit in the model but there is no significant different in two models.

The Multilevel Model#4 is best model. The model has AIC of 412.39 and is significant (p>0.05). The model#4 uses the original variable “test.location.in.home”. The random effects of the model tells us that on average prevalence of radon in basement varies by 1.69pCi.L (SD +/- Intercept) - In addition to that, the fixed effects of the model shows that overall average prevalence of radon across basements is 3.12/pCi.L. Also the correlation between the slope and intercept is = 0.668 which indicates that there is a positive association between radon prevalence and basement testing.

Statistical models
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
(Intercept) 2.55*** 2.55*** 5.22*** 2.06** 2.06** 2.55***
(0.30) (0.30) (0.37) (0.74) (0.74) (0.18)
Test.Location.in.HomeBasement 3.12*** 3.12*** 4.52** 4.52** 3.12***
(0.43) (0.24) (1.40) (1.40) (0.24)
X1st.Floor -2.86***
(0.24)
Air.quality.index 0.08 0.08
(0.11) (0.11)
Test.Location.in.HomeBasement:Air.quality.index -0.21 -0.21
(0.21) (0.21)
R2 0.32
Adj. R2 0.31
Num. obs. 116 116 116 116 116 116
RMSE 2.29
AIC 491.77 430.30 415.16 415.16 412.40
BIC 502.78 446.83 437.19 437.19 428.92
Log Likelihood -241.88 -209.15 -199.58 -199.58 -200.20
Num. groups 58 58 58 58 58
p < 0.001, p < 0.01, p < 0.05

Using MerTools Package for Predictive Intervals

We will use MerTools package to enhance our analysis by incorporating predictive intervals and using arm::sim() syntax.

Model#1 for MerTools

Just starting with a simple model which can be used for further analysis. This model can be further used to quickly display summary of the model with calculation for the sigma.

The model has same coefficients as previous models for 1.69 and 1.71 (SE) and SD of 1.89 for Counties. It is important to mention here that our data has limited nested function for each observation. For the sake simplicity, we will only use random intercepts of each county.

m1 <- lmer(Average.Radon.pCi.L ~  X1st.Floor  + Basement  + (1|County), data=radon)
fastdisp(m1)
## lmer(formula = Average.Radon.pCi.L ~ X1st.Floor + Basement + 
##     (1 | County), data = radon)
##             coef.est coef.se
## (Intercept)  2.64     1.70  
## X1st.Floor  -0.09     1.71  
## Basement     3.03     1.69  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  County   (Intercept) 1.89    
##  Residual             1.31    
## ---
## number of obs: 116, groups: County, 58
## AIC = 492.6

Simulating Values for Fixed Effects

As there is substantial variance between counties and testing locations, we can plot the difference between coefficients together in a line to see their direction and how much they deviate from 0. We use FeSim syntax.

NOTE: Value 1000 is used because of limited system resources.

Random Effects Using MerTools

To check the random effects, we will first create a data frame using our previous model m1.The data frame contains random intercepts for each county in our data.

Effect Ranges using the Catepillar Dot Plot

The plot shows that there are considerable number of counties, where radon levels are higher up to 5.0.

Predictive Intervals using MerTools

Below are upper and lower limits of 95% predictive intervals for each county in relation to the prevalence of radon. Instead of using the average, for the fit we used median. Prediction intervals and confidence intervals are not the same thing. Unfortunately the terms are often confused. A prediction interval is an interval associated with a random variable yet to be observed, with a specified probability of the random variable lying within the interval (Hynman 2013). Prediciive intertvals are usally more broad and they likely contain the value of dependent variable for a new observation given specfic value of IV variables .In this analysis, we are predicting for the median radon levels.

Ploting First 50 Obervations Obataned from the merMod function

(Based on Random Intercept “NOT” Random Slope)

Comparing Simulated Values with the Observations Obatained from merMod fuctions using arm::sim.

The prediction intervals from arm::sim and random intercepts using arm:sim function.

Comparing Simulated Values with the Observations Obatained from merMod fuctions using merBOOT function.

Using BootMer functions we have more refined results. the intervals produced in orange and point estimates for counties 116 observations. The cases (intercepts) clearly encompasses the entire intervals. However, the arm::sim seems to give more textured resutls as compared to bootMer function.

Conclusion

Multilevel modeling is an reliable approach to modeling hierarchically structured data, outperforming classical regression in predictive accuracy. One intriguing feature of multilevel models is their ability to separately estimate the predictive effects of an individual predictor and its group-level mean, which are sometimes interpreted as “direct” and “contextual” effects of the predictor (Gellman 2006).

In the end, it is important to mention that prevalence of radon is strongly associated with close areas such as basements. Our goal in this assignment was to predict the potential counties where the radon levels can be dangerously high and for that our step is to use our simulated values to estimate the effect of radon levels for each county. Using the lattice function and our model from merTools, we found that Cortland, Che mung, and Steuben counties are the most vulnerable counties to radon explore. The prevalence of radon in counties such as Cortland can be up-to 6pi-Cu/l. We also reverified our analysis by looking at the current statistics regarding the radon levels in New York state and found that Cortland county is among the most potential counties for radon exposure. (http://www.cortland-co.org/503/Radon)

Further more, according to our analysis the likelihood of radon exposure in basement is higher than on first floor or more upper floors. This makes sense as from our analysis we can see that the prevalence of radon is lowest in counties or cities like New York due to many buildings and apartments build above ground.

Comparison of last two Models (1-2) on Lecture Slides:

The models in the slids (27 - 30) are not different in terms overall model difference. However, the fixed effect correlation between sex:texp and student popularity in the cross level interaction is positive and moderate which means simply means the for male:techer average increase in experience the popularity index score get higer. This also indicates that there is some relationship between the starting and slope of sex:ctexp and the intercept of the model. However, after centering the mean of the variable techerxp becomes zero, therefore the correlation between sex:ctexp varible is = 0. From my experience fixed effect correlations should NOT be higher, but they dont usually occur in a balance model.

## $County