Overview

The National Health and Nutritional Examination survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. This program began in the early 1960s and has been conducted as a series of surveys focusing on different population groups of health topics. In 1999, this study become a continuous program with focus changing based on a variety of health and nutritional measurements in order to meet emerging needs.

Interviews of the NHANES program consist of questions that are demographics, socioeconomic, dietary, and health-related. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests that are administered by highly trained medical personnel.

In this notebook we will be analyzing the survey data conducted the NHANES for 2017-2018 season. More information about the survey can be gotten from this link. Dataset contains survey data collected from 9,254 persons with questions up to 46 in number.

In order to keep it simple, we will be using 6 variables consisting of 3 independent variables and one dependent variable. The variables include:

INDFMPIR: Family Income to poverty ratio
DMDFMSIZ: Family size
DMDEDUC2: Highest education level completed by participant

INDFMPIR is the dependent variable while the other 5 are the independent variables

# uncomment to install the tidyverse and caret libraries
# install.packages(c('tidyverse', 'caret'))

# Importing library
library(tidyverse)

# importing survey data
survey <- read.csv('NHANES-2017-2018 Demographics Data.csv')

# converting the variable names to upper cases
names(survey) <- str_to_upper(names(survey))

# Number of rows and Columns
dim(survey)

## [1] 9254   46

Selecting variables

# selecting variables
data <- survey %>%
  select(INDFMPIR, DMDFMSIZ, DMDEDUC2)


# independent variables
predictors  <- names(data)[!names(data) %in% 'INDFMPIR']

Data Preparation and cleaning

# renaming variable names
names(data) <- c('Y', 'familysize', 'education')

# replacing 7 and 9 in education to NA to indicate missing values
# 7 = Refused to answer, 9 = Don't know
data[data$education %in% c(7,9), 'education'] <- NA

# Conversion of categorical variables to categorical data type
data$education <- factor(data$education)

Problem 1

Model 1

# fitting a linear regression model
Model1 <- lm(Y ~ familysize, data=data)

# summary of model
summary(Model1)

## 
## Call:
## lm(formula = Y ~ familysize, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7036 -1.2966 -0.4266  1.3005  3.0705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.83266    0.04031   70.27   <2e-16 ***
## familysize  -0.12902    0.01022  -12.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.585 on 8021 degrees of freedom
##   (1231 observations deleted due to missingness)
## Multiple R-squared:  0.01948,    Adjusted R-squared:  0.01935 
## F-statistic: 159.3 on 1 and 8021 DF,  p-value: < 2.2e-16

\(Y = 2.833 - 0.129 * familysize\)

Here there’s a negative relationship between family size and family income-to-poverty-ratio. A one-unit increase in the size of a family, the income to poverty ratio decreases by 0.13 units. This means that an increase in the size of a family will affect the family income to poverty ratio by decreasing it. This effect of family size on the dependent variable is statistically significant at 0.1-5% level of significance. This rather makes sense that as the number of people in a family increases, the more mouths to feed and families with less income may find it difficult to feed.

Model 2

# linear regression with interaction terms
Model2 <- lm(Y ~ education, data=data)

summary(Model2)

## 
## Call:
## lm(formula = Y ~ education, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7600 -1.0473 -0.1728  1.2400  3.5104 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.48961    0.07396  20.142  < 2e-16 ***
## education2   0.17764    0.09593   1.852   0.0641 .  
## education3   0.58641    0.08478   6.917 5.22e-12 ***
## education4   1.07596    0.08211  13.104  < 2e-16 ***
## education5   2.27037    0.08472  26.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.409 on 4769 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2267 
## F-statistic: 350.7 on 4 and 4769 DF,  p-value: < 2.2e-16

\(Y\) = \(1.490\) + \(0.178 * education2\) + \(0.586 * education3\) + \(1.076 * education4\) + \(2.27 * educaton5\)

Here, the higher educational level a participant attains, the family income to poverty ratio is bound to increase. In this equation education was converted to a categorical variable and was later dummied where 0 indicates the absence of a category/group while 1 its presence. One variable was removed to be used as a reference group for comparing with the other categories.

From the results of model2, we see that family members with education to the 12th grade but have no diploma (education 2), the income to poverty ratio increases by 0.18 units compared to when the individual has education less than the 9th grade. This effect is not statistically significant between 0.1-5% significance level. When the family member is educated up to obtaining a high school certificate (education 3), their income to poverty ratio is expected to increase by 0.59 units compared to when the highest education is less than 9th grade.

Furthermore, if some college degree or AA degree is bagged (education 4), the value of the dependent variable is expected to increase by more than 1 units (1.08 units), and the value is expected to move further up to about 2.3 units if a college degree or above is obtained.

When we compare the performance of model 1 and model2 based on the R-squared values, we can infer that model 2 performed better than model 1, where the model was able to account for about 22.7% of the variability in the dependent variable while model1 was able to account for less than 5% (approx. 1.95%). This is also the same in their adjusted R-squared values. From these values, model 2 fits better than model 1.

iii

# get sampled data for prediction

set.seed(0) # for reproducibility

size = 20 # sample size

# new sampled data
new_data <- data %>%
  # drop rows with missing values
  drop_na() %>%
  sample_n(size)

Predictions for Model 1

# predictions for model1

predictions1 <- predict(Model1, new_data[, c('Y', 'familysize')]) %>% as_tibble()

# predicted values 
prediction.vals1 <- predict(object = Model1, 
                            newdata = new_data[, c('Y', 'familysize')], 
                            interval = 'prediction', 
                            level = .95) %>% as_tibble()

# confidence values
confidence.vals1 <- predict(object = Model1, 
                            newdata = new_data[, c('Y', 'familysize')], 
                            interval = 'confidence', 
                            level = .95) %>% as_tibble()

# merging to predictions
predictions1 <- predictions1 %>% 
  mutate(pred.lower = prediction.vals1$lwr, pred.upper = prediction.vals1$upr, 
         conf.lower = confidence.vals1$lwr, conf.upper = confidence.vals1$upr)

Plotting Prediction and confidence intervals for Model 1

# Confidence Intervals for Model 1
predictions1 %>% 
  ggplot() + 
  geom_point(aes(x=1:size, y=value), size=2, color='steelblue') +
  geom_errorbar(aes(x=1:size, ymin=conf.lower, ymax=conf.upper), 
                alpha=0.7, width=0.4) +
  theme_light() + 
  theme(plot.title = element_text(face='bold'),
        panel.grid = element_blank(), 
        legend.position = 'top', 
        legend.title  = element_blank()) + 
  labs(x='Number of Samples', y='Predictions', 
       title = 'Model1 Predictions with 95% Confidence Intervals')

# Prediction Intervals for Model 1
predictions1 %>% 
  ggplot() + 
  geom_point(aes(x=1:size, y=value), size=2, color='steelblue') +
  geom_errorbar(aes(x=1:size, ymin=pred.lower, ymax=pred.upper), alpha=0.4) +
  theme_light() + 
  theme(plot.title = element_text(face='bold'),
        panel.grid = element_blank(), 
        legend.position = 'top', 
        legend.title  = element_blank()) + 
  labs(x='Number of Samples', y='Predictions', 
       title = 'Model1 Predictions with 95% Prediction Intervals')

Predictions for Model 2

# predictions for model1

predictions2 <- predict(Model2, new_data[, c('Y', 'education')]) %>% as_tibble()

# predicted values 
prediction.vals2 <- predict(object = Model2, 
                            newdata = new_data[, c('Y', 'education')], 
                            interval = 'prediction', 
                            level = .95) %>% as_tibble()

# confidence values
confidence.vals2 <- predict(object = Model2, 
                            newdata = new_data[, c('Y', 'education')], 
                            interval = 'confidence', 
                            level = .95) %>% as_tibble()

# merging to predictions
predictions2 <- predictions2 %>% 
  mutate(pred.lower = prediction.vals2$lwr, pred.upper = prediction.vals2$upr, 
         conf.lower = confidence.vals2$lwr, conf.upper = confidence.vals2$upr)

# Printing predictions for Model 1 and Model1

# model1 predictions
predictions1

## # A tibble: 20 × 5
##    value pred.lower pred.upper conf.lower conf.upper
##    <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1  2.57     -0.532       5.68       2.53       2.62
##  2  2.57     -0.532       5.68       2.53       2.62
##  3  2.57     -0.532       5.68       2.53       2.62
##  4  2.70     -0.403       5.81       2.64       2.77
##  5  2.57     -0.532       5.68       2.53       2.62
##  6  2.57     -0.532       5.68       2.53       2.62
##  7  2.32     -0.790       5.42       2.28       2.35
##  8  2.45     -0.661       5.55       2.41       2.48
##  9  2.45     -0.661       5.55       2.41       2.48
## 10  2.45     -0.661       5.55       2.41       2.48
## 11  2.19     -0.919       5.29       2.14       2.23
## 12  2.57     -0.532       5.68       2.53       2.62
## 13  2.19     -0.919       5.29       2.14       2.23
## 14  2.06     -1.05        5.17       2.00       2.12
## 15  2.57     -0.532       5.68       2.53       2.62
## 16  2.32     -0.790       5.42       2.28       2.35
## 17  2.57     -0.532       5.68       2.53       2.62
## 18  2.57     -0.532       5.68       2.53       2.62
## 19  2.57     -0.532       5.68       2.53       2.62
## 20  2.45     -0.661       5.55       2.41       2.48

# model 2 predictions
predictions2

## # A tibble: 20 × 5
##    value pred.lower pred.upper conf.lower conf.upper
##    <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1  3.76      0.996       6.52       3.68       3.84
##  2  3.76      0.996       6.52       3.68       3.84
##  3  2.08     -0.688       4.84       1.99       2.16
##  4  1.67     -1.10        4.43       1.55       1.79
##  5  2.57     -0.198       5.33       2.50       2.64
##  6  3.76      0.996       6.52       3.68       3.84
##  7  2.57     -0.198       5.33       2.50       2.64
##  8  3.76      0.996       6.52       3.68       3.84
##  9  2.08     -0.688       4.84       1.99       2.16
## 10  2.57     -0.198       5.33       2.50       2.64
## 11  2.57     -0.198       5.33       2.50       2.64
## 12  2.08     -0.688       4.84       1.99       2.16
## 13  1.67     -1.10        4.43       1.55       1.79
## 14  2.08     -0.688       4.84       1.99       2.16
## 15  2.08     -0.688       4.84       1.99       2.16
## 16  2.57     -0.198       5.33       2.50       2.64
## 17  2.08     -0.688       4.84       1.99       2.16
## 18  2.57     -0.198       5.33       2.50       2.64
## 19  1.49     -1.28        4.26       1.34       1.63
## 20  1.49     -1.28        4.26       1.34       1.63

Plotting Prediction and confidence intervals for Model 2

# 95% confidence intervals for Model2 predictions
predictions2 %>% 
  ggplot() + 
  geom_col(aes(x=1:size, value), fill='darkred', alpha=0.7, width=0.8, color='black') + 
  geom_errorbar(aes(x=1:size, y=value, ymin=conf.lower, ymax=conf.upper), 
                alpha=0.7,width=0.4) + 
  theme_light() + 
  theme(panel.grid = element_blank(), 
        plot.title = element_text(face='bold')) + 
  labs(title='Model2 Predictions with 95% Confidence Intervals', 
       y='Predictions', x='Number of samples')

# 95% Prediction intervals for Model2 predictions
predictions2 %>% 
  ggplot() + 
  geom_col(aes(x=1:size, value), fill='darkred', alpha=0.7, width=0.8) + 
  geom_errorbar(aes(x=1:size, y=value, ymin=pred.lower, ymax=pred.upper), 
                alpha=0.6,width=0.4) + 
  theme_light() + 
  theme(panel.grid = element_blank(), 
        plot.title = element_text(face='bold')) + 
  labs(title='Model2 Predictions with 95% Prediction Intervals', 
       y='Predictions', x='Number of samples')

Problem 2

Two-Fold Cross validation

# using the caret library
library(caret)

For Model 1

# 2-fold cross validation

set.seed(0)
Model1 <- train(Y ~ familysize, 
                data=data %>% filter(!is.na(familysize), !is.na(Y)), 
                method='lm', 
                trControl = trainControl(number = 2, method='cv')
)

summary(Model1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7036 -1.2966 -0.4266  1.3005  3.0705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.83266    0.04031   70.27   <2e-16 ***
## familysize  -0.12902    0.01022  -12.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.585 on 8021 degrees of freedom
## Multiple R-squared:  0.01948,    Adjusted R-squared:  0.01935 
## F-statistic: 159.3 on 1 and 8021 DF,  p-value: < 2.2e-16

Model 2

# 2-fold cross validation

set.seed(0)
Model2 <- train(Y ~ education, 
                data=data %>% filter(!is.na(education), !is.na(Y)), 
                method='lm', 
                trControl = trainControl(number = 2, method='cv')
)


summary(Model2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7600 -1.0473 -0.1728  1.2400  3.5104 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.48961    0.07396  20.142  < 2e-16 ***
## education2   0.17764    0.09593   1.852   0.0641 .  
## education3   0.58641    0.08478   6.917 5.22e-12 ***
## education4   1.07596    0.08211  13.104  < 2e-16 ***
## education5   2.27037    0.08472  26.800  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.409 on 4769 degrees of freedom
## Multiple R-squared:  0.2273, Adjusted R-squared:  0.2267 
## F-statistic: 350.7 on 4 and 4769 DF,  p-value: < 2.2e-16

# model 1 and 2 performance results
as.data.frame(
  rbind('Model1' = Model1$results[, c('RMSE', 'Rsquared', 'MAE')], 
        'Model2' = Model2$results[, c('RMSE', 'Rsquared', 'MAE')])
  )

##            RMSE   Rsquared      MAE
## Model1 1.584587 0.01949552 1.364497
## Model2 1.410190 0.22544892 1.173220

Above table shows the average performance scores for Models 1 and 2 after the two K-fold cross-validation. The caret library was used for this.

From the table above, we see that model 2 has the best performance. Based on the root mean squared error (RMSE), the variation between the actual and predicted values vary about 1.58 on average for Model 1, while that for Model 2, it is about 1.41 on average. According to the R-squared value, Model 2 is the better of the two models, with about 22.5% while Model 1 is about 2% for Model 1. Finally based on the mean absolute error (MAE), which is the average of the absolute errors between the actual value of Y and its predicted value. From the result above, Model 2 will differ by 1.17 while Model 1, for 1.36

DATA306_Assignment5

Rachel Ciszewski

05/09/22

Overview

Problem 1

Problem 2