library(readr)
NHANES<-read.csv("/Volumes/FLASHDRIVE/Data 306/NHANES_v1.csv")
DMDMARTL - Marital status
bmxbmi - Body mass index
#Recode data
NHANES$bmxbmi[NHANES$bmxbmi=="."]=NA
NHANES$bmxbmi[NHANES$bmxbmi==77]=NA
NHANES$bmxbmi[NHANES$bmxbmi==99]=NA
NHANES$dmdmartl[NHANES$dmdmartl=="."]=NA
NHANES$dmdmartl[NHANES$dmdmartl==77]=NA
NHANES$dmdmartl[NHANES$dmdmartl==99]=NA
NHANES$dmdfmsiz[NHANES$dmdfmsiz=="."]=NA
NHANES$dmdhhsza[NHANES$dmdhhsza=="."]=NA
# Extract, rename, remove missing data
nhanes=data.frame(NHANES$bmxbmi,NHANES$dmdmartl,NHANES$dmdfmsiz,NHANES$dmdhhsza)
names(nhanes)
## [1] "NHANES.bmxbmi" "NHANES.dmdmartl" "NHANES.dmdfmsiz" "NHANES.dmdhhsza"
colnames(nhanes)=c("bmi","mstatus","famsize","famkids")
nhanes=na.omit(nhanes)
names(nhanes)
## [1] "bmi" "mstatus" "famsize" "famkids"
# simple linear regression model predicting bmi using marital status. Y = bmi, x=mstatus
model1<-lm(bmi~mstatus,data=nhanes)
summary(model1)
##
## Call:
## lm(formula = bmi ~ mstatus, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.375 -4.871 -1.173 3.425 53.329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.776111 0.165301 174.08 <2e-16 ***
## mstatus -0.001012 0.050753 -0.02 0.984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.891 on 5231 degrees of freedom
## Multiple R-squared: 7.595e-08, Adjusted R-squared: -0.0001911
## F-statistic: 0.0003973 on 1 and 5231 DF, p-value: 0.9841
Intercept is 28.7. This means that when marital status is 0, the bmi is around 29.
coefficent is -0.001012 and negative. This means that as the marital status increases, the bmi also decreases by a mean value of -0.001012 per unit.
The F statistic provides in a certain sense the coefficient of determination, this is a common method used to assess goodness of fit.
The coefficient of determination ranges from 0 to 1. A value closer to 0 means a lesser liklihood of X predicting Y and vice versa.
The F-statistic, 0.151, is closer to 0 than 1, meaning that marital status is not a perfect predictor of bmi.
DMDFMSIZ - Total number of people in the Family
bmxbmi - Body mass index
I will model the relationship between body mass index and total number of people in the family with total number of people in the family being the independent predictor and body mass index being the dependent predictor. I theorize that individuals with larger families weigh more than indivudals with smaller families because bigger families are more expensive to maintain and therefore more stressful.
model2<-lm(bmi~mstatus + famsize,data=nhanes)
summary(model2)
##
## Call:
## lm(formula = bmi ~ mstatus + famsize, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.191 -4.818 -1.165 3.393 53.575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.25147 0.25196 112.126 < 2e-16 ***
## mstatus 0.02308 0.05147 0.448 0.65382
## famsize 0.15804 0.05731 2.758 0.00584 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.887 on 5230 degrees of freedom
## Multiple R-squared: 0.001452, Adjusted R-squared: 0.00107
## F-statistic: 3.803 on 2 and 5230 DF, p-value: 0.02238
myanova<-anova(model1,model2)
myanova
## Analysis of Variance Table
##
## Model 1: bmi ~ mstatus
## Model 2: bmi ~ mstatus + famsize
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5231 248403
## 2 5230 248042 1 360.66 7.6046 0.005842 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cbind(myanova, 'CriticalValue' = qf(1-.05, myanova[1,1], myanova[2,1]))
## Res.Df RSS Df Sum of Sq F Pr(>F) CriticalValue
## 1 5231 248402.7 NA NA NA NA 1.046542
## 2 5230 248042.0 1 360.6623 7.604613 0.005842129 1.046542
Coefficents:
R-squared:
ANOVA F-test:
model3<-lm(bmi~mstatus + famsize + famkids,data=nhanes)
summary(model3)
##
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.204 -4.816 -1.159 3.396 53.588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.15289 0.26137 107.714 < 2e-16 ***
## mstatus 0.02954 0.05166 0.572 0.56752
## famsize 0.21103 0.06843 3.084 0.00205 **
## famkids -0.26793 0.18911 -1.417 0.15659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.886 on 5229 degrees of freedom
## Multiple R-squared: 0.001835, Adjusted R-squared: 0.001263
## F-statistic: 3.205 on 3 and 5229 DF, p-value: 0.02223
myanova2<-anova(model2,model3)
myanova2
## Analysis of Variance Table
##
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5230 248042
## 2 5229 247947 1 95.185 2.0074 0.1566
cbind(myanova2, 'CriticalValue' = qf(1-.05, myanova2[1,1], myanova2[2,1]))
## Res.Df RSS Df Sum of Sq F Pr(>F) CriticalValue
## 1 5230 248042.0 NA NA NA NA 1.046546
## 2 5229 247946.8 1 95.18531 2.007382 0.1565948 1.046546
Coefficients:
R squared:
ANOVA F-test:
model4<-lm(bmi~mstatus + famsize + famkids + famsize:famkids,data=nhanes)
summary(model4)
##
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids + famsize:famkids,
## data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.221 -4.821 -1.150 3.386 53.567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.20168 0.27330 103.190 <2e-16 ***
## mstatus 0.02692 0.05184 0.519 0.6035
## famsize 0.19636 0.07252 2.708 0.0068 **
## famkids -0.57733 0.54039 -1.068 0.2854
## famsize:famkids 0.06346 0.10383 0.611 0.5411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.886 on 5228 degrees of freedom
## Multiple R-squared: 0.001907, Adjusted R-squared: 0.001143
## F-statistic: 2.497 on 4 and 5228 DF, p-value: 0.04079
myanova3<-anova(model2,model3)
myanova3
## Analysis of Variance Table
##
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5230 248042
## 2 5229 247947 1 95.185 2.0074 0.1566
cbind(myanova3, 'CriticalValue' = qf(1-.05, myanova3[1,1], myanova3[2,1]))
## Res.Df RSS Df Sum of Sq F Pr(>F) CriticalValue
## 1 5230 248042.0 NA NA NA NA 1.046546
## 2 5229 247946.8 1 95.18531 2.007382 0.1565948 1.046546
M4<-lm(bmi~famsize*famkids,data=nhanes)
famsize.mean = round(mean(nhanes$famsize))
print(famsize.mean)
## [1] 3
famkids.mean=round(mean(nhanes$famkids))
print(famkids.mean)
## [1] 0
predict(M4,newdata=data.frame(famsize=famsize.mean,famkids=famkids.mean), interval = "c")
## fit lwr upr
## 1 28.85935 28.64405 29.07465
famsize.a = round(mean(nhanes$famsize)+sd(nhanes$famsize))
famsize.b = round(mean(nhanes$famsize)-sd(nhanes$famsize))
M4.mean = data.frame(famsize=famsize.mean,famkids = 1:4)
M4.mean
## famsize famkids
## 1 3 1
## 2 3 2
## 3 3 3
## 4 3 4
M4.mean.hat = predict(M4,newdata=M4.mean,interval="c")
M4.mean.hat
## fit lwr upr
## 1 28.47269 27.96744 28.97794
## 2 28.08604 27.07090 29.10117
## 3 27.69938 26.15925 29.23951
## 4 27.31272 25.24395 29.38148
M4.a = data.frame(famsize=famsize.a,famkids=1:4)
M4.a.hat = predict(M4,newdata=M4.a, interval="c")
M4.b = data.frame(famsize=famsize.b,famkids=1:4)
M4.b.hat = predict(M4,newdata=M4.b, interval="c")
plot(1:4,M4.mean.hat[,1],ylim=range(M4.mean.hat,M4.a.hat,M4.b.hat),
xlab="Kids in Household", ylab= "BMI", type ="n")
points(1:4, M4.mean.hat[,1],lty=3,col="pink",type="b")
points(1:4, M4.a.hat[,1],lty=3,col="blue",type="b")
points(1:4, M4.b.hat[,1],lty=3,col="gray",type="b")
The original model: model4<-lm(bmi~mstatus + famsize + famkids + famsize:famkids,data=nhanes)
The model used in the visualization: M4<-lm(bmi~famsize*famkids,data=nhanes)
I had great difficulty trying to visualize the original nested model, so I removed the extra added predictors, and did a basic visualization with no additional predictors and just a continous interaction.
Coefficients:
R squared:
ANOVA F-test:
#6. Categorical Interaction
#model5<-lm(bmi~famsize + famkids + famsize:mstatus ,data=nhanes)
model5<-lm(bmi~mstatus + famsize + famkids + mstatus:famkids,data=nhanes)
summary(model5)
##
## Call:
## lm(formula = bmi ~ mstatus + famsize + famkids + mstatus:famkids,
## data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.269 -4.839 -1.169 3.429 53.656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.270716 0.275778 102.513 <2e-16 ***
## mstatus -0.005718 0.057990 -0.099 0.9215
## famsize 0.202104 0.068750 2.940 0.0033 **
## famkids -0.532462 0.273542 -1.947 0.0516 .
## mstatus:famkids 0.105292 0.078674 1.338 0.1808
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.886 on 5228 degrees of freedom
## Multiple R-squared: 0.002177, Adjusted R-squared: 0.001414
## F-statistic: 2.852 on 4 and 5228 DF, p-value: 0.02245
myanova4<-anova(model2,model3)
myanova4
## Analysis of Variance Table
##
## Model 1: bmi ~ mstatus + famsize
## Model 2: bmi ~ mstatus + famsize + famkids
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5230 248042
## 2 5229 247947 1 95.185 2.0074 0.1566
cbind(myanova4, 'CriticalValue' = qf(1-.05, myanova4[1,1], myanova4[2,1]))
## Res.Df RSS Df Sum of Sq F Pr(>F) CriticalValue
## 1 5230 248042.0 NA NA NA NA 1.046546
## 2 5229 247946.8 1 95.18531 2.007382 0.1565948 1.046546
M5<-lm(bmi~famsize*mstatus,data=nhanes)
famsize.mean = round(mean(nhanes$famsize))
print(famsize.mean)
## [1] 3
mstatus.mean=round(mean(nhanes$mstatus))
print(mstatus.mean)
## [1] 3
predict(M5,newdata=data.frame(famsize=famsize.mean,mstatus=mstatus.mean), interval = "c")
## fit lwr upr
## 1 28.81986 28.62732 29.01241
famsize.a = round(mean(nhanes$famsize)+sd(nhanes$famsize))
famsize.b = round(mean(nhanes$famsize)-sd(nhanes$famsize))
M5.mean = data.frame(famsize=famsize.mean,mstatus = 1:8)
M5.mean
## famsize mstatus
## 1 3 1
## 2 3 2
## 3 3 3
## 4 3 4
## 5 3 5
## 6 3 6
## 7 3 7
## 8 3 8
M5.mean.hat = predict(M5,newdata=M5.mean,interval="c")
M5.mean.hat
## fit lwr upr
## 1 28.77460 28.52278 29.02643
## 2 28.79723 28.59706 28.99741
## 3 28.81986 28.62732 29.01241
## 4 28.84250 28.60918 29.07581
## 5 28.86513 28.56156 29.16869
## 6 28.88776 28.50017 29.27535
## 7 28.91039 28.43221 29.38856
## 8 28.93302 28.36080 29.50523
M5.a = data.frame(famsize=famsize.a,mstatus=1:8)
M5.a.hat = predict(M5,newdata=M5.a, interval="c")
M5.b = data.frame(famsize=famsize.b,mstatus=1:8)
M5.b.hat = predict(M5,newdata=M5.b, interval="c")
plot(1:8,M5.mean.hat[,1],ylim=range(M5.mean.hat,M5.a.hat,M5.b.hat),
xlab="Marriage Status", ylab= "BMI", type ="n")
points(1:8, M5.mean.hat[,1],lty=3,col="pink",type="b")
points(1:8, M5.a.hat[,1],lty=3,col="blue",type="b")
points(1:8, M5.b.hat[,1],lty=3,col="gray",type="b")
Coefficients:
R squared:
ANOVA F-test:
I guess this shows that the size of a family has an effect on one’s bmi. The larger the family, the higher the BMI. Referring to the best fit model, model 2, as the family size increases by one unit, an individual’s BMI increases by 0.15804 units.