Task 1

I am interested in how gender may impact BMI.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

NHANES <- read.csv("/Users/Nazija/Desktop/NHANES.csv")
data1<- NHANES%>%
 select(bmxbmi, riagendr)
data1$riagendr = factor(data1$riagendr)
data1 = na.omit(data1)
colnames(data1) = c("BMI", "Gender")
levels(data1$Gender) = c("Male", "Female")
head(data1)

##    BMI Gender
## 1 23.3   Male
## 2 14.2 Female
## 3 17.3   Male
## 4 23.2 Female
## 5 27.2 Female
## 6 16.2   Male

Task 2

M1 = lm(BMI~ I(Gender == "Female"), data = data1)
summary(M1)

## 
## Call:
## lm(formula = BMI ~ I(Gender == "Female"), data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.368  -5.968  -0.768   4.488  56.332 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                24.9119     0.1175  212.10  < 2e-16 ***
## I(Gender == "Female")TRUE   0.8558     0.1662    5.15 2.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.706 on 8600 degrees of freedom
## Multiple R-squared:  0.003075,   Adjusted R-squared:  0.002959 
## F-statistic: 26.53 on 1 and 8600 DF,  p-value: 2.656e-07

anova(M1)

## Analysis of Variance Table
## 
## Response: BMI
##                         Df Sum Sq Mean Sq F value    Pr(>F)    
## I(Gender == "Female")    1   1575 1575.13  26.527 2.656e-07 ***
## Residuals             8600 510646   59.38                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bmi.hat= fitted(M1)
bmi.resid= resid(M1)

A female’s BMI on average is .85 higher than a male’s BMI. The R-squared statistic is .003. It is not near 1, so it is not a good model.

Task 3

Is the mean of the residuals 0

mean(bmi.resid)

## [1] -2.014524e-16

The mean of the residuals is -2, not 0.

Homoscedasticity of Residuals

plot(bmi.hat, bmi.resid, xlab = "fitted values", ylab = "residuals")
abline(h = 0, col = "red")

This does not show homoscedasticity.

Normality of residuals

qqnorm(bmi.resid)
qqline(bmi.resid, col = "red")

Task 4

My independent variable will be annual family income. I think that annual family income will have a negative relationship with BMI, the dependent variable, as individuals with higher family incomes will have more access to nutritious foods.

data2 <-NHANES%>%
  select(bmxbmi, indfmin2, riagendr)%>%
  filter(indfmin2 <= 15)
data2$riagendr = factor(data2$riagendr)
data2 = na.omit(data2)
colnames(data2) = c("BMI", "Family_Income", "Gender")
levels(data2$Gender) = c("Male", "Female")
head(data2)

##    BMI Family_Income Gender
## 1 23.3            14   Male
## 2 14.2             4 Female
## 3 17.3            15   Male
## 4 23.2             8 Female
## 5 27.2             4 Female
## 6 15.4            14   Male

Task 5

M2 = lm(BMI ~ Family_Income + Gender, data = data2)
summary(M2)

## 
## Call:
## lm(formula = BMI ~ Family_Income + Gender, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.836  -6.054  -0.798   4.503  55.764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.58828    0.19293 132.632  < 2e-16 ***
## Family_Income -0.08560    0.01898  -4.511 6.55e-06 ***
## GenderFemale   0.91870    0.17088   5.376 7.81e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.743 on 8211 degrees of freedom
## Multiple R-squared:  0.006055,   Adjusted R-squared:  0.005813 
## F-statistic: 25.01 on 2 and 8211 DF,  p-value: 1.483e-11

anova(M2)

## Analysis of Variance Table
## 
## Response: BMI
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Family_Income    1   1266 1265.92  21.117 4.386e-06 ***
## Gender           1   1733 1732.75  28.904 7.815e-08 ***
## Residuals     8211 492238   59.95                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bmi2.hat = fitted(M2)
bmi2.resid = resid(M2)

For every 1-point increase in the family income scale, BMI decreases by about 9%. Females on average have BMIs that are .9 higher than males. The coefficient for female gender has increased by .91 after including family income in the model.

Task 6

Is the mean of the residuals 0

mean(bmi2.resid)

## [1] -1.996288e-16

The mean of the residuals is a very small negative number, near 0.

Homoscedasticity of Residuals

plot(bmi2.hat, bmi2.resid, xlab = "fitted values", ylab = "residuals")
abline(h = 0, col = "red")

Normality of residuals

qqnorm(bmi2.resid)
qqline(bmi2.resid, col = "red")

Task 7

M1 = lm(BMI~ I(Gender == "Female"), data = data2) #make sure from the same dataset
anova(M1, M2)

## Analysis of Variance Table
## 
## Model 1: BMI ~ I(Gender == "Female")
## Model 2: BMI ~ Family_Income + Gender
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1   8212 493457                                 
## 2   8211 492238  1    1219.8 20.347 6.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M2 provides a better fit

#Task 8 Including age

data2 <-NHANES%>%
  select(bmxbmi, indfmin2, riagendr, ridageyr)%>%
  filter(indfmin2 <= 15)
data2$riagendr = factor(data2$riagendr)
data2 = na.omit(data2)
colnames(data2) = c("BMI", "Family_Income", "Gender", "Age")
levels(data2$Gender) = c("Male", "Female")
head(data2)

##    BMI Family_Income Gender Age
## 1 23.3            14   Male  22
## 2 14.2             4 Female   3
## 3 17.3            15   Male  14
## 4 23.2             8 Female  44
## 5 27.2             4 Female  14
## 6 15.4            14   Male   6

M3 = lm(BMI ~ Family_Income + Gender+ Age, data = data2)
summary(M3)

## 
## Call:
## lm(formula = BMI ~ Family_Income + Gender + Age, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.695  -4.571  -1.497   3.217  55.593 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.250000   0.195539 103.560  < 2e-16 ***
## Family_Income -0.089041   0.016425  -5.421 6.09e-08 ***
## GenderFemale   0.750324   0.147929   5.072 4.02e-07 ***
## Age            0.163840   0.003123  52.459  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.701 on 8210 degrees of freedom
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2553 
## F-statistic: 939.6 on 3 and 8210 DF,  p-value: < 2.2e-16

anova(M3)

## Analysis of Variance Table
## 
## Response: BMI
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## Family_Income    1   1266    1266   28.192 1.128e-07 ***
## Gender           1   1733    1733   38.588 5.489e-10 ***
## Age              1 123575  123575 2751.970 < 2.2e-16 ***
## Residuals     8210 368663      45                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

bmi3.hat = fitted(M3)
bmi3.resid = resid(M3)

Task 9

After adjusting for age in M3, the coefficient for family income has changed slightly, and the coefficient for gender has decreased by .2. For every increase in the income scale, BMI decreases by about .09. Females tend to have a BMI that is .75 higher on average than males. For every 1-year increase in age, there is about .16 increase in BMI. The R-squared statistic is higher in this model, at about .26, than in M2. Becaue it is closer to 1, it shows that this model is a better fit.

anova(M2, M3)

## Analysis of Variance Table
## 
## Model 1: BMI ~ Family_Income + Gender
## Model 2: BMI ~ Family_Income + Gender + Age
##   Res.Df    RSS Df Sum of Sq    F    Pr(>F)    
## 1   8211 492238                                
## 2   8210 368663  1    123575 2752 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M3 provides a better fit

Task 10

#library(ggplot2)
M4  = lm(BMI~Gender+ Family_Income + Age + Family_Income:Age, data = data2)
summary(M4)

## 
## Call:
## lm(formula = BMI ~ Gender + Family_Income + Age + Family_Income:Age, 
##     data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.721  -4.564  -1.503   3.187  55.525 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       20.5664934  0.2627596  78.271  < 2e-16 ***
## GenderFemale       0.7556628  0.1479385   5.108 3.33e-07 ***
## Family_Income     -0.1311193  0.0285387  -4.594 4.40e-06 ***
## Age                0.1540737  0.0062526  24.641  < 2e-16 ***
## Family_Income:Age  0.0012849  0.0007127   1.803   0.0714 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.7 on 8209 degrees of freedom
## Multiple R-squared:  0.2559, Adjusted R-squared:  0.2555 
## F-statistic: 705.7 on 4 and 8209 DF,  p-value: < 2.2e-16

bmi4.hat = fitted(M4)
bmi4.resid = resid(M4)

Task 11

summary(M3)

## 
## Call:
## lm(formula = BMI ~ Family_Income + Gender + Age, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.695  -4.571  -1.497   3.217  55.593 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.250000   0.195539 103.560  < 2e-16 ***
## Family_Income -0.089041   0.016425  -5.421 6.09e-08 ***
## GenderFemale   0.750324   0.147929   5.072 4.02e-07 ***
## Age            0.163840   0.003123  52.459  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.701 on 8210 degrees of freedom
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2553 
## F-statistic: 939.6 on 3 and 8210 DF,  p-value: < 2.2e-16

Thee coefficients did not change by much after including the interaction between age and family income in the model.

mean(abs(data2$BMI - bmi3.hat)/data2$BMI)

## [1] 0.2034949

mean(abs(data2$BMI - bmi4.hat)/data2$BMI)

## [1] 0.203372

mean((data2$BMI - bmi3.hat)^2)

## [1] 44.88226

mean((data2$BMI - bmi4.hat)^2)

## [1] 44.8645

The mean absolute percentage errors and mean square errors are very close, as are the R squared statistics. The ones for M4 are slightly lower, showing that M4 is a better fit than M3.

anova(M3,M4)

## Analysis of Variance Table
## 
## Model 1: BMI ~ Family_Income + Gender + Age
## Model 2: BMI ~ Gender + Family_Income + Age + Family_Income:Age
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   8210 368663                              
## 2   8209 368517  1    145.91 3.2503 0.07144 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Task 12

The best fit model is M4, followed by M3. Gender had the largest coefficient and had the most impact on BMI, followed by age and then family income, which had very similar coefficients in M4. However, while age had a positive relationship with BMI, family income had a negative relationship with BMI. The interaction of age and family income had a very small impact on BMI.

DATA 306 Final

Nazija Akter

12/18/2020

Task 1

Task 2

Task 3

Is the mean of the residuals 0

Homoscedasticity of Residuals

Normality of residuals

Task 4

Task 5

Task 6

Is the mean of the residuals 0

Homoscedasticity of Residuals

Normality of residuals

Task 7

Task 9

Task 10

Task 11

Task 12