0. Set Up

library(readr)
NHANES<-read.csv("/Volumes/FLASHDRIVE/Data 306/NHANES_v1.csv")

1. Categorical independent predictor (IV) & (DV)

  • DMDMARTL - Marital status

  • bmxbmi - Body mass index

2. Simple linear regression model 1

  • I will model the relationship between body mass index and marital status with marital status being the independent predictor and body mass index being the dependent predictor. I theorize that married individuals weigh more than unmarried or separated indivudals because they do not try to stay fit once they stop dating.
#Recode data
NHANES$bmxbmi[NHANES$bmxbmi=="."]=NA
NHANES$bmxbmi[NHANES$bmxbmi==77]=NA
NHANES$bmxbmi[NHANES$bmxbmi==99]=NA

NHANES$dmdmartl[NHANES$dmdmartl=="."]=NA
NHANES$dmdmartl[NHANES$dmdmartl==77]=NA
NHANES$dmdmartl[NHANES$dmdmartl==99]=NA

NHANES$dmdfmsiz[NHANES$dmdfmsiz=="."]=NA

NHANES$dmdhhsza[NHANES$dmdhhsza=="."]=NA
# Extract, rename, remove missing data

nhanes=data.frame(NHANES$bmxbmi,NHANES$dmdmartl,NHANES$dmdfmsiz,NHANES$dmdhhsza)
names(nhanes)
## [1] "NHANES.bmxbmi"   "NHANES.dmdmartl" "NHANES.dmdfmsiz" "NHANES.dmdhhsza"
colnames(nhanes)=c("bmi","mstatus","famsize","famkids")
nhanes=na.omit(nhanes)
names(nhanes)
## [1] "bmi"     "mstatus" "famsize" "famkids"
# simple linear regression model predicting bmi using marital status. Y = bmi, x=mstatus

model1<-lm(bmi~mstatus,data=nhanes)
summary(model1)
## 
## Call:
## lm(formula = bmi ~ mstatus, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.375  -4.871  -1.173   3.425  53.329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.776111   0.165301  174.08   <2e-16 ***
## mstatus     -0.001012   0.050753   -0.02    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.891 on 5231 degrees of freedom
## Multiple R-squared:  7.595e-08,  Adjusted R-squared:  -0.0001911 
## F-statistic: 0.0003973 on 1 and 5231 DF,  p-value: 0.9841

Interpretation of Intercept & Coefficient & Goodness of Fit

  • Intercept is 28.7. This means that when marital status is 0, the bmi is around 29.

  • coefficent is -0.001012 and negative. This means that as the marital status increases, the bmi also decreases by a mean value of -0.001012 per unit.

  • The F statistic provides in a certain sense the coefficient of determination, this is a common method used to assess goodness of fit.

  • The coefficient of determination ranges from 0 to 1. A value closer to 0 means a lesser liklihood of X predicting Y and vice versa.

  • The F-statistic, 0.151, is closer to 0 than 1, meaning that marital status is not a perfect predictor of bmi.

3. Multiple Linear Regression Model 2

  • DMDFMSIZ - Total number of people in the Family

  • bmxbmi - Body mass index

  • I will model the relationship between body mass index and total number of people in the family with total number of people in the family being the independent predictor and body mass index being the dependent predictor. I theorize that individuals with larger families weigh more than indivudals with smaller families because bigger families are more expensive to maintain and therefore more stressful.

model2<-lm(bmi~mstatus + famsize,data=nhanes)
summary(model2)
## 
## Call:
## lm(formula = bmi ~ mstatus + famsize, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.191  -4.818  -1.165   3.393  53.575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.25147    0.25196 112.126  < 2e-16 ***
## mstatus      0.02308    0.05147   0.448  0.65382    
## famsize      0.15804    0.05731   2.758  0.00584 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.887 on 5230 degrees of freedom
## Multiple R-squared:  0.001452,   Adjusted R-squared:  0.00107 
## F-statistic: 3.803 on 2 and 5230 DF,  p-value: 0.02238
myanova<-anova(model1,model2)
myanova
## Analysis of Variance Table
## 
## Model 1: bmi ~ mstatus
## Model 2: bmi ~ mstatus + famsize
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1   5231 248403                                
## 2   5230 248042  1    360.66 7.6046 0.005842 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cbind(myanova, 'CriticalValue' = qf(1-.05, myanova[1,1], myanova[2,1]))
##   Res.Df      RSS Df Sum of Sq        F      Pr(>F) CriticalValue
## 1   5231 248402.7 NA        NA       NA          NA      1.046542
## 2   5230 248042.0  1  360.6623 7.604613 0.005842129      1.046542

Interpretation of Coefficient

  • The coefficient for marital status, is 0.02308. This means that for every increase in 1 unit of marital status, the bmi increases 0.02308 units, with influence from family size. Additionally, bmi increases 0.15804 units for every one unit increase of family size, with influence from the marital status.

Comparison

  • Coefficents:

    • The coefficient for model 1 is -0.001012. For every one unit increase of marital status, there is an decrease in bmi by 0.001012.The addition of the family size predictor increased the estimate to 0.02308. Seeing that the p-value for famsize is greater than 0.05, the overall addition of the famsize has a significant impact on marital status and bmi.
  • R-squared:

    • The adjusted Rsquared for model 1 is -0.0001911. The adjusted Rsquared for model 2 is 0.00107. The adjusted Rsquared for model 1 is lower than that of model 2. Higher R squared values represent smaller differences between the observed data and the fitted values. Since adjusted R squared values increase only if new predictors improve the model, I believe that model 2 fits the observations better than model 1.
  • ANOVA F-test:

    • The F-test value for models 1 & 2 is 7.6046 . The F-test ANOVA is conducted to figure out if the additional model gives a significantly better fit than the original model. By definition, the null hypothesis (model 2 does not provide a significantly better fit than model 1) is rejected if the calculated F-statistic is greater than the crtitical value of the F distribution. I created a column to display the critical value which is 1.046542. In addition, the p-value is less than 0.05. The null hypothesis is rejected, and I accept that model 2 gives a significantly better fit than model 1.

4. Multiple Linear Regression Model 3

  • DMDHHSZA - # of children 5 years or younger in HH - “famkids”
model3<-lm(bmi~mstatus + famsize + famkids,data=nhanes)
summary(model3)
## 
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.204  -4.816  -1.159   3.396  53.588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.15289    0.26137 107.714  < 2e-16 ***
## mstatus      0.02954    0.05166   0.572  0.56752    
## famsize      0.21103    0.06843   3.084  0.00205 ** 
## famkids     -0.26793    0.18911  -1.417  0.15659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 5229 degrees of freedom
## Multiple R-squared:  0.001835,   Adjusted R-squared:  0.001263 
## F-statistic: 3.205 on 3 and 5229 DF,  p-value: 0.02223
myanova2<-anova(model2,model3)
myanova2
## Analysis of Variance Table
## 
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   5230 248042                           
## 2   5229 247947  1    95.185 2.0074 0.1566
cbind(myanova2, 'CriticalValue' = qf(1-.05, myanova2[1,1], myanova2[2,1]))
##   Res.Df      RSS Df Sum of Sq        F    Pr(>F) CriticalValue
## 1   5230 248042.0 NA        NA       NA        NA      1.046546
## 2   5229 247946.8  1  95.18531 2.007382 0.1565948      1.046546

Interpretion of Coeficients

  • With the addition of the new predictor the, famkids, the coefficient of mtstatus increased from 0.023 to 0.029. Additionally, famsize increased from 0.0158 to 0.211. famkids itself has a coefficient estimate of -0.26793. Looking at the p values, it seems like with the addition of famkids, only famsize has a significant effect on bmi.

Comparision

  • Coefficients:

    • As stated previously, with the addition of famsize, the coefficients of the previous model increased but perhaps insignificantly.
  • R squared:

    • The adjusted R squared of model 2 is 0.00107 annd the adjusted R squared of model 3 is 0.001263. The adjusted R squared for model 3 is larger than that of model 2. Since adjusted R squared values increase only if new predictors improve the model, I believe that model 3 fits the observations better than model 2.
  • ANOVA F-test:

    • The ANOVA F-test value for model 2 is 7.6046. The ANOVA F-test value for model 3 is 2.0074. The value for model 3 is lower than that of model 2. While the F test value is higher than the critical value, the p value is not lower than 0.05. So I fail to reject the null that model 3 is a significantly better fit than model 2.

5. Continous Interaction

model4<-lm(bmi~mstatus + famsize + famkids + famsize:famkids,data=nhanes)
summary(model4)
## 
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids + famsize:famkids, 
##     data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.221  -4.821  -1.150   3.386  53.567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     28.20168    0.27330 103.190   <2e-16 ***
## mstatus          0.02692    0.05184   0.519   0.6035    
## famsize          0.19636    0.07252   2.708   0.0068 ** 
## famkids         -0.57733    0.54039  -1.068   0.2854    
## famsize:famkids  0.06346    0.10383   0.611   0.5411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 5228 degrees of freedom
## Multiple R-squared:  0.001907,   Adjusted R-squared:  0.001143 
## F-statistic: 2.497 on 4 and 5228 DF,  p-value: 0.04079
myanova3<-anova(model2,model3)
myanova3
## Analysis of Variance Table
## 
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   5230 248042                           
## 2   5229 247947  1    95.185 2.0074 0.1566
cbind(myanova3, 'CriticalValue' = qf(1-.05, myanova3[1,1], myanova3[2,1]))
##   Res.Df      RSS Df Sum of Sq        F    Pr(>F) CriticalValue
## 1   5230 248042.0 NA        NA       NA        NA      1.046546
## 2   5229 247946.8  1  95.18531 2.007382 0.1565948      1.046546

Visualization

M4<-lm(bmi~famsize*famkids,data=nhanes)
famsize.mean = round(mean(nhanes$famsize))
print(famsize.mean)
## [1] 3
famkids.mean=round(mean(nhanes$famkids))
print(famkids.mean)
## [1] 0
predict(M4,newdata=data.frame(famsize=famsize.mean,famkids=famkids.mean), interval = "c")
##        fit      lwr      upr
## 1 28.85935 28.64405 29.07465
famsize.a = round(mean(nhanes$famsize)+sd(nhanes$famsize))
famsize.b = round(mean(nhanes$famsize)-sd(nhanes$famsize))
M4.mean = data.frame(famsize=famsize.mean,famkids = 1:4)
M4.mean
##   famsize famkids
## 1       3       1
## 2       3       2
## 3       3       3
## 4       3       4
M4.mean.hat = predict(M4,newdata=M4.mean,interval="c")
M4.mean.hat
##        fit      lwr      upr
## 1 28.47269 27.96744 28.97794
## 2 28.08604 27.07090 29.10117
## 3 27.69938 26.15925 29.23951
## 4 27.31272 25.24395 29.38148
M4.a = data.frame(famsize=famsize.a,famkids=1:4)
M4.a.hat = predict(M4,newdata=M4.a, interval="c")
M4.b = data.frame(famsize=famsize.b,famkids=1:4)
M4.b.hat = predict(M4,newdata=M4.b, interval="c")
plot(1:4,M4.mean.hat[,1],ylim=range(M4.mean.hat,M4.a.hat,M4.b.hat),
     xlab="Kids in Household", ylab= "BMI", type ="n")
points(1:4, M4.mean.hat[,1],lty=3,col="pink",type="b")
points(1:4, M4.a.hat[,1],lty=3,col="blue",type="b")
points(1:4, M4.b.hat[,1],lty=3,col="gray",type="b")

Note on visualization

  • The original model: model4<-lm(bmi~mstatus + famsize + famkids + famsize:famkids,data=nhanes)

  • The model used in the visualization: M4<-lm(bmi~famsize*famkids,data=nhanes)

  • I had great difficulty trying to visualize the original nested model, so I removed the extra added predictors, and did a basic visualization with no additional predictors and just a continous interaction.

Interpertation of Coefficients

  • With the addition of an interaction between famsize and famkids, the coefficients of mstatus,famsize, and famkids all decreased.

Comparision

  • Coefficients:

    • As previously stated, all coefficient estimations decreased with the addition of a continous interaction. Looking at the p-values, it seems that only famkids is having a significant effect on bmi in this case.
  • R squared:

    • The adjusted R squared for model 3 is 0.001263. The adjusted R squared for model 4 is 0.001143. The value for model 4 is lower than the value for model 3. I believe that model 4 fits the observations worse than model 3.
  • ANOVA F-test:

    • The F-test statistic of model 3 is 2.0074. The F-test statistic of model 4 is 2.497. The F-test statistic of model 4 is larger than that of model 3. The F-test value of model 4 as well as it’s p value lower than 0.05. So, I reject the null hypothesis and say that model 4 has a significantly better fit than model 3.

#6. Categorical Interaction

#model5<-lm(bmi~famsize + famkids + famsize:mstatus ,data=nhanes)

model5<-lm(bmi~mstatus + famsize + famkids + mstatus:famkids,data=nhanes)
summary(model5)
## 
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids + mstatus:famkids, 
##     data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.269  -4.839  -1.169   3.429  53.656 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     28.270716   0.275778 102.513   <2e-16 ***
## mstatus         -0.005718   0.057990  -0.099   0.9215    
## famsize          0.202104   0.068750   2.940   0.0033 ** 
## famkids         -0.532462   0.273542  -1.947   0.0516 .  
## mstatus:famkids  0.105292   0.078674   1.338   0.1808    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.886 on 5228 degrees of freedom
## Multiple R-squared:  0.002177,   Adjusted R-squared:  0.001414 
## F-statistic: 2.852 on 4 and 5228 DF,  p-value: 0.02245
myanova4<-anova(model2,model3)
myanova4
## Analysis of Variance Table
## 
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   5230 248042                           
## 2   5229 247947  1    95.185 2.0074 0.1566
cbind(myanova4, 'CriticalValue' = qf(1-.05, myanova4[1,1], myanova4[2,1]))
##   Res.Df      RSS Df Sum of Sq        F    Pr(>F) CriticalValue
## 1   5230 248042.0 NA        NA       NA        NA      1.046546
## 2   5229 247946.8  1  95.18531 2.007382 0.1565948      1.046546

Visualization

M5<-lm(bmi~famsize*mstatus,data=nhanes)
famsize.mean = round(mean(nhanes$famsize))
print(famsize.mean)
## [1] 3
mstatus.mean=round(mean(nhanes$mstatus))
print(mstatus.mean)
## [1] 3
predict(M5,newdata=data.frame(famsize=famsize.mean,mstatus=mstatus.mean), interval = "c")
##        fit      lwr      upr
## 1 28.81986 28.62732 29.01241
famsize.a = round(mean(nhanes$famsize)+sd(nhanes$famsize))
famsize.b = round(mean(nhanes$famsize)-sd(nhanes$famsize))
M5.mean = data.frame(famsize=famsize.mean,mstatus = 1:8)
M5.mean
##   famsize mstatus
## 1       3       1
## 2       3       2
## 3       3       3
## 4       3       4
## 5       3       5
## 6       3       6
## 7       3       7
## 8       3       8
M5.mean.hat = predict(M5,newdata=M5.mean,interval="c")
M5.mean.hat
##        fit      lwr      upr
## 1 28.77460 28.52278 29.02643
## 2 28.79723 28.59706 28.99741
## 3 28.81986 28.62732 29.01241
## 4 28.84250 28.60918 29.07581
## 5 28.86513 28.56156 29.16869
## 6 28.88776 28.50017 29.27535
## 7 28.91039 28.43221 29.38856
## 8 28.93302 28.36080 29.50523
M5.a = data.frame(famsize=famsize.a,mstatus=1:8)
M5.a.hat = predict(M5,newdata=M5.a, interval="c")
M5.b = data.frame(famsize=famsize.b,mstatus=1:8)
M5.b.hat = predict(M5,newdata=M5.b, interval="c")
plot(1:8,M5.mean.hat[,1],ylim=range(M5.mean.hat,M5.a.hat,M5.b.hat),
     xlab="Marriage Status", ylab= "BMI", type ="n")
points(1:8, M5.mean.hat[,1],lty=3,col="pink",type="b")
points(1:8, M5.a.hat[,1],lty=3,col="blue",type="b")
points(1:8, M5.b.hat[,1],lty=3,col="gray",type="b")

Note on visualization

  • I had great difficultly visualizing a more nested/complex model so I did a simpler version.

Interpertation of Coefficients

  • With the addition of a categorical interaction, most values, except for famkids decreased when compared to the previous models.

Comparision

  • Coefficients:

    • As previously stated, for the most part they decreased. Again, the only significant value is famsize.
  • R squared:

    • The adjusted R squared increased slightly, so far model 5 fits the observations better than model 4.
  • ANOVA F-test:

    • The F values are exactly the same. The p-values also indicate significance. Perhaps both models fit exactly the same.

7. Conclusions

Best-Fit Model?

  • Based on the F-test values, the model with the highest value is model 2 with 7.6. Any additional model decreased to 2.07. So I believe the model with just famsize added, is the best fit.

What Factors Matter for BMI?

  • In terms of coefficient estimates + p-values, as I added additional predictors, the famsize predictor consistently showed that it had effect on the BMI.

I guess this shows that the size of a family has an effect on one’s bmi. The larger the family, the higher the BMI. Referring to the best fit model, model 2, as the family size increases by one unit, an individual’s BMI increases by 0.15804 units.