I am interested in how gender may impact BMI.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
NHANES <- read.csv("/Users/Nazija/Desktop/NHANES.csv")
data1<- NHANES%>%
select(bmxbmi, riagendr)
data1$riagendr = factor(data1$riagendr)
data1 = na.omit(data1)
colnames(data1) = c("BMI", "Gender")
levels(data1$Gender) = c("Male", "Female")
head(data1)
## BMI Gender
## 1 23.3 Male
## 2 14.2 Female
## 3 17.3 Male
## 4 23.2 Female
## 5 27.2 Female
## 6 16.2 Male
M1 = lm(BMI~ I(Gender == "Female"), data = data1)
summary(M1)
##
## Call:
## lm(formula = BMI ~ I(Gender == "Female"), data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.368 -5.968 -0.768 4.488 56.332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.9119 0.1175 212.10 < 2e-16 ***
## I(Gender == "Female")TRUE 0.8558 0.1662 5.15 2.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.706 on 8600 degrees of freedom
## Multiple R-squared: 0.003075, Adjusted R-squared: 0.002959
## F-statistic: 26.53 on 1 and 8600 DF, p-value: 2.656e-07
anova(M1)
## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## I(Gender == "Female") 1 1575 1575.13 26.527 2.656e-07 ***
## Residuals 8600 510646 59.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmi.hat= fitted(M1)
bmi.resid= resid(M1)
A female’s BMI on average is .85 higher than a male’s BMI. The R-squared statistic is .003. It is not near 1, so it is not a good model.
mean(bmi.resid)
## [1] -2.014524e-16
The mean of the residuals is -2, not 0.
plot(bmi.hat, bmi.resid, xlab = "fitted values", ylab = "residuals")
abline(h = 0, col = "red")
This does not show homoscedasticity.
qqnorm(bmi.resid)
qqline(bmi.resid, col = "red")
My independent variable will be annual family income. I think that annual family income will have a negative relationship with BMI, the dependent variable, as individuals with higher family incomes will have more access to nutritious foods.
data2 <-NHANES%>%
select(bmxbmi, indfmin2, riagendr)%>%
filter(indfmin2 <= 15)
data2$riagendr = factor(data2$riagendr)
data2 = na.omit(data2)
colnames(data2) = c("BMI", "Family_Income", "Gender")
levels(data2$Gender) = c("Male", "Female")
head(data2)
## BMI Family_Income Gender
## 1 23.3 14 Male
## 2 14.2 4 Female
## 3 17.3 15 Male
## 4 23.2 8 Female
## 5 27.2 4 Female
## 6 15.4 14 Male
M2 = lm(BMI ~ Family_Income + Gender, data = data2)
summary(M2)
##
## Call:
## lm(formula = BMI ~ Family_Income + Gender, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.836 -6.054 -0.798 4.503 55.764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.58828 0.19293 132.632 < 2e-16 ***
## Family_Income -0.08560 0.01898 -4.511 6.55e-06 ***
## GenderFemale 0.91870 0.17088 5.376 7.81e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.743 on 8211 degrees of freedom
## Multiple R-squared: 0.006055, Adjusted R-squared: 0.005813
## F-statistic: 25.01 on 2 and 8211 DF, p-value: 1.483e-11
anova(M2)
## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## Family_Income 1 1266 1265.92 21.117 4.386e-06 ***
## Gender 1 1733 1732.75 28.904 7.815e-08 ***
## Residuals 8211 492238 59.95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmi2.hat = fitted(M2)
bmi2.resid = resid(M2)
For every 1-point increase in the family income scale, BMI decreases by about 9%. Females on average have BMIs that are .9 higher than males. The coefficient for female gender has increased by .91 after including family income in the model.
mean(bmi2.resid)
## [1] -1.996288e-16
The mean of the residuals is a very small negative number, near 0.
plot(bmi2.hat, bmi2.resid, xlab = "fitted values", ylab = "residuals")
abline(h = 0, col = "red")
qqnorm(bmi2.resid)
qqline(bmi2.resid, col = "red")
M1 = lm(BMI~ I(Gender == "Female"), data = data2) #make sure from the same dataset
anova(M1, M2)
## Analysis of Variance Table
##
## Model 1: BMI ~ I(Gender == "Female")
## Model 2: BMI ~ Family_Income + Gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8212 493457
## 2 8211 492238 1 1219.8 20.347 6.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M2 provides a better fit
#Task 8 Including age
data2 <-NHANES%>%
select(bmxbmi, indfmin2, riagendr, ridageyr)%>%
filter(indfmin2 <= 15)
data2$riagendr = factor(data2$riagendr)
data2 = na.omit(data2)
colnames(data2) = c("BMI", "Family_Income", "Gender", "Age")
levels(data2$Gender) = c("Male", "Female")
head(data2)
## BMI Family_Income Gender Age
## 1 23.3 14 Male 22
## 2 14.2 4 Female 3
## 3 17.3 15 Male 14
## 4 23.2 8 Female 44
## 5 27.2 4 Female 14
## 6 15.4 14 Male 6
M3 = lm(BMI ~ Family_Income + Gender+ Age, data = data2)
summary(M3)
##
## Call:
## lm(formula = BMI ~ Family_Income + Gender + Age, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.695 -4.571 -1.497 3.217 55.593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.250000 0.195539 103.560 < 2e-16 ***
## Family_Income -0.089041 0.016425 -5.421 6.09e-08 ***
## GenderFemale 0.750324 0.147929 5.072 4.02e-07 ***
## Age 0.163840 0.003123 52.459 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.701 on 8210 degrees of freedom
## Multiple R-squared: 0.2556, Adjusted R-squared: 0.2553
## F-statistic: 939.6 on 3 and 8210 DF, p-value: < 2.2e-16
anova(M3)
## Analysis of Variance Table
##
## Response: BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## Family_Income 1 1266 1266 28.192 1.128e-07 ***
## Gender 1 1733 1733 38.588 5.489e-10 ***
## Age 1 123575 123575 2751.970 < 2.2e-16 ***
## Residuals 8210 368663 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bmi3.hat = fitted(M3)
bmi3.resid = resid(M3)
After adjusting for age in M3, the coefficient for family income has changed slightly, and the coefficient for gender has decreased by .2. For every increase in the income scale, BMI decreases by about .09. Females tend to have a BMI that is .75 higher on average than males. For every 1-year increase in age, there is about .16 increase in BMI. The R-squared statistic is higher in this model, at about .26, than in M2. Becaue it is closer to 1, it shows that this model is a better fit.
anova(M2, M3)
## Analysis of Variance Table
##
## Model 1: BMI ~ Family_Income + Gender
## Model 2: BMI ~ Family_Income + Gender + Age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8211 492238
## 2 8210 368663 1 123575 2752 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
M3 provides a better fit
#library(ggplot2)
M4 = lm(BMI~Gender+ Family_Income + Age + Family_Income:Age, data = data2)
summary(M4)
##
## Call:
## lm(formula = BMI ~ Gender + Family_Income + Age + Family_Income:Age,
## data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.721 -4.564 -1.503 3.187 55.525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.5664934 0.2627596 78.271 < 2e-16 ***
## GenderFemale 0.7556628 0.1479385 5.108 3.33e-07 ***
## Family_Income -0.1311193 0.0285387 -4.594 4.40e-06 ***
## Age 0.1540737 0.0062526 24.641 < 2e-16 ***
## Family_Income:Age 0.0012849 0.0007127 1.803 0.0714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.7 on 8209 degrees of freedom
## Multiple R-squared: 0.2559, Adjusted R-squared: 0.2555
## F-statistic: 705.7 on 4 and 8209 DF, p-value: < 2.2e-16
bmi4.hat = fitted(M4)
bmi4.resid = resid(M4)
summary(M3)
##
## Call:
## lm(formula = BMI ~ Family_Income + Gender + Age, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.695 -4.571 -1.497 3.217 55.593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.250000 0.195539 103.560 < 2e-16 ***
## Family_Income -0.089041 0.016425 -5.421 6.09e-08 ***
## GenderFemale 0.750324 0.147929 5.072 4.02e-07 ***
## Age 0.163840 0.003123 52.459 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.701 on 8210 degrees of freedom
## Multiple R-squared: 0.2556, Adjusted R-squared: 0.2553
## F-statistic: 939.6 on 3 and 8210 DF, p-value: < 2.2e-16
Thee coefficients did not change by much after including the interaction between age and family income in the model.
mean(abs(data2$BMI - bmi3.hat)/data2$BMI)
## [1] 0.2034949
mean(abs(data2$BMI - bmi4.hat)/data2$BMI)
## [1] 0.203372
mean((data2$BMI - bmi3.hat)^2)
## [1] 44.88226
mean((data2$BMI - bmi4.hat)^2)
## [1] 44.8645
The mean absolute percentage errors and mean square errors are very close, as are the R squared statistics. The ones for M4 are slightly lower, showing that M4 is a better fit than M3.
anova(M3,M4)
## Analysis of Variance Table
##
## Model 1: BMI ~ Family_Income + Gender + Age
## Model 2: BMI ~ Gender + Family_Income + Age + Family_Income:Age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8210 368663
## 2 8209 368517 1 145.91 3.2503 0.07144 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The best fit model is M4, followed by M3. Gender had the largest coefficient and had the most impact on BMI, followed by age and then family income, which had very similar coefficients in M4. However, while age had a positive relationship with BMI, family income had a negative relationship with BMI. The interaction of age and family income had a very small impact on BMI.