Overview

The National Health and Nutritional Examination survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. This program began in the early 1960s and has been conducted as a series of surveys focusing on different population groups of health topics. In 1999, this study become a continuous program with focus changing based on a variety of health and nutritional measurements in order to meet emerging needs.

Interviews of the NHANES program consist of questions that are demographics, socioeconomic, dietary, and health-related. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests that are administered by highly trained medical personnel.

In this notebook we will be analyzing the survey data conducted the NHANES for 2017-2018 season. More information about the survey can be gotten from this link. Dataset contains survey data collected from 9,254 persons with questions up to 46 in number.

In order to keep it simple, we will be using 6 variables consisting of 5 independent variables and one dependent variable. The variables include:

INDFMPIR: Family Income to poverty ratio
RIDAGEYR: Age of participant in years.
RIAGENDR: Gender of participant
RIDRETH3: Race of participant
DMDFMSIZ: Family size
DMDEDUC2: Highest education level completed by participant

INDFMPIR is the dependent variable while the other 5 are the independent variables

# uncomment to install the tidyverse package
# install.packages('tidyverse')

# Importing library
library(tidyverse)

# importing survey data
survey <- read.csv('NHANES-2017-2018 Demographics Data.csv')

# converting the variable names to upper cases
names(survey) <- str_to_upper(names(survey))

# Number of rows and Columns
dim(survey)

## [1] 9254   46

Problem 1

Data Types of Independent variables

Selecting variables

# selecting variables
data <- survey %>%
  select(INDFMPIR, RIDAGEYR, RIAGENDR, 
         RIDRETH3, DMDFMSIZ, DMDEDUC2)


# independent variables
predictors  <- names(data)[!names(data) %in% 'INDFMPIR']

Independent Variables

RIDAGEYR represents the age of the participant. It is a numerical continuous variable.
RIAGENDR represents the gender of the participant which usually assumes two categories: Male and Female. This variable is a categorical nominal variable.
RIDRETH3 represents the race of the participant. It is a categorical nominal type
DMDFMSIZ is a numerical, discrete variable and represents the family size
DMDEDUC2 is a categorical variable type and represents the highest educational level attained by the participant

Problem 2

Relationship between dependent variable and independent variables

Data Preparation and cleaning

# renaming variable names
names(data) <- c('Y', 'age', 'gender', 'race', 'familysize', 'education')

# replacing 7 and 9 in education to NA to indicate missing values
# 7 = Refused to answer, 9 = Don't know
data$education[data$education %in% c(7,9)] <- NA

# Conversion of categorical variables to categorical data type
categorical_vars <- c('gender', 'race', 'education')
data[, categorical_vars] <- lapply(data[categorical_vars], as.factor)

# fitting a linear regression model
Model1 <- lm(Y ~ ., data=data, )

# summary of model
summary(Model1)

## 
## Call:
## lm(formula = Y ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1284 -1.0242 -0.1691  1.0534  3.7226 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.216223   0.130171   9.343  < 2e-16 ***
## age          0.007429   0.001238   5.999 2.13e-09 ***
## gender2     -0.172061   0.040447  -4.254 2.14e-05 ***
## race2       -0.149272   0.089537  -1.667 0.095549 .  
## race3       -0.037206   0.070686  -0.526 0.598664    
## race4       -0.267549   0.074498  -3.591 0.000332 ***
## race6        0.224105   0.082830   2.706 0.006842 ** 
## race7       -0.318729   0.108030  -2.950 0.003189 ** 
## familysize  -0.010763   0.013661  -0.788 0.430817    
## education2   0.253833   0.097326   2.608 0.009134 ** 
## education3   0.686155   0.088631   7.742 1.19e-14 ***
## education4   1.207475   0.087220  13.844  < 2e-16 ***
## education5   2.298127   0.090588  25.369  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.393 on 4761 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2466, Adjusted R-squared:  0.2447 
## F-statistic: 129.9 on 12 and 4761 DF,  p-value: < 2.2e-16

Multiple Linear Regression:

\(Y\) = \(1.216\) + \(0.007 * age\) - \(0.172 * gender2\) - \(0.149 * race2\) - \(0.037 * race3\) - \(0.268 * race4\) + \(0.224 * race6\) - \(0.319 * race7\) - \(0.019 * familysize\) + \(0.254 * education2\) + \(0.686 * education3\) + \(1.207 * education4\) + \(2.298 * education5\)

2a:

Before a linear regression was fit on the dataset, categorical variables were converted to numerical variables by dummy encoding, where 1 represents the presence of a category and 0 its absence.

From Model 1, age, race6 (non-Hispanic Asian), all education types have positive relationships with the dependent variable. All other variables, had negative relationships with the dependent variable. For age, there’s a positive relationship between it and family-income-to-poverty ratio (dependent variable). This relationship is rather weak, where a one-unit change in the age of a participant, the family income to poverty ratio increases by about 0.007 units on average. This is statistically significant at 0.1-5% level of significance. Similarly, for family size, the effect it has on the dependent variable is weak but negative, where a one-unit increase in family size decreases the family-income-to-poverty ratio by about 0.01 units but this relationship is rather not statistically significant.

For all races except that of the non-Hispanic Asian (race6), there’s a negative relationship with the dependent variable. Because it is categorical variable, one category (Mexican American) was dropped and used as a reference. By comparing the coefficients of each race to the reference, we see that for other Hispanics (race2), it is expected that the family income to poverty ratio will, on average, decrease by 0.15 units compared to Mexican Americans. This is only statistically significant at 5% significance level. However, for Non-Hispanic White (race3), we expect that the dependent variable will decrease by 0.04 units on average. This relationship is not statistically significant. For Non-Hispanic Black (race4), this is expected to decrease by 0.27 units on average. For other races, which includes the multi-racial, this is expected to decrease further by 0.32 units on average. On the other hand, for the Non-Hispanic Asian, the family income to poverty ratio is thought to increase by 0.22 units. For the Non-Hispanic Black, the effect it has on the dependent variable is statistically significant at 0.1-5% level of significance while for both the Non-Hispanic Asian and other races, including the multi-Racial, their effect on the family income to poverty ratio is statistically significant at 1-5% level of significance.

The education of a participant has a positive effect on the family income to poverty ratio. However, this effect becomes stronger with higher education. Their effects are statistically significant from 0.1-5% significance level. As mentioned in problem1, this variable is categorical, therefore, we will compare their effects on the dependent variable by using participants with less than 9th grade (education1) as reference. Participants who moved a step further in their level of education to finish up to the 12th grade (education2), the family income to poverty ratio is expected to increase by 0.25 units. However, if they go a step higher to getting a high school degree (education3), the average family income to poverty ratio is expected to increase by about 0.69 units. Whereas for participants with some college or AA degree (education4), their family income to poverty ratio is expected to move up further to 1.21 units on average, while if the participant had at least a college degree or higher (education 5), the dependent variable increase by about 2.3 units on average.

Finally, for female (gender2) participants, the family income-to-poverty ratio is expected to decrease by 0.17 units on average compared to their male counterparts

2b:

From the model results, the performance of the model to accurately account for the variability in the dependent variable is rather poor. From the R-squared value, we see that the independent variables are only able to account for about 24.7% of the dependent variable. But from the adjusted R-squared we see that this decreases slightly to 24.5%.

Problem 3

# linear regression with interaction terms
Model2 <- lm(Y ~ . + age*familysize, data=data)

summary(Model2)

## 
## Call:
## lm(formula = Y ~ . + age * familysize, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1949 -1.0183 -0.1789  1.0493  3.7949 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.716886   0.160308  10.710  < 2e-16 ***
## age            -0.003421   0.002384  -1.435 0.151282    
## gender2        -0.164542   0.040356  -4.077 4.63e-05 ***
## race2          -0.153364   0.089285  -1.718 0.085916 .  
## race3          -0.024339   0.070525  -0.345 0.730026    
## race4          -0.270746   0.074288  -3.645 0.000271 ***
## race6           0.210405   0.082633   2.546 0.010920 *  
## race7          -0.321319   0.107722  -2.983 0.002870 ** 
## familysize     -0.204506   0.038871  -5.261 1.49e-07 ***
## education2      0.305941   0.097541   3.137 0.001720 ** 
## education3      0.738046   0.088914   8.301  < 2e-16 ***
## education4      1.243412   0.087233  14.254  < 2e-16 ***
## education5      2.327033   0.090492  25.715  < 2e-16 ***
## age:familysize  0.004135   0.000777   5.322 1.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.389 on 4760 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2511, Adjusted R-squared:  0.249 
## F-statistic: 122.8 on 13 and 4760 DF,  p-value: < 2.2e-16

Multiple Linear Regression:

\(Y\) = \(1.717\) - \(0.003 * age\) - \(0.165 * gender2\) - \(0.153 * race2\) - \(0.024 * race3\) - \(0.271 * race4\) + \(0.210 * race6\) - \(0.321 * race7\) - \(0.110 * familysize\) + \(0.306 * education2\) + \(0.738 * education3\) + \(1.243 * education4\) + \(2.327 * education5\) + \(0.004 * age * familysize\)

After adding the interaction term of age and family size, the coefficients of almost all the variables had a slight increase, while some decreased slightly. The age variable changed from positive in model 1 to negative in model2. Also, the familysize became statistically significant after adding their interaction with age.

The Adjusted R-squared of model 2 is slightly better than that for model 1. It increased from about 24.5% in model 1 to 24.9% in model 2. This is the same for the R-squared, which moved up from 24.7% to 25.1%. By this comparison, Model 2 is better in performance than Model 1.

# ANOVA between Models 1 and 2
anova(Model2, Model1)

## Analysis of Variance Table
## 
## Model 1: Y ~ age + gender + race + familysize + education + age * familysize
## Model 2: Y ~ age + gender + race + familysize + education
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4760 9177.3                                  
## 2   4761 9231.9 -1   -54.602 28.321 1.075e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By using the ANOVA F-test test to compare Model1 and Model2, we see that there’s a statistical significance (pvalue = \(1.075e^{-7})\)) at 0.1%-5% significance level that Model 2 is superior (better in performance) to Model 1.

# prediction
set.seed(0)

size = 20 # sample size
age.mean <- mean(data$age)
age.lwr <- mean(data$age) - sd(data$age) *1
age.upr <- mean(data$age) + sd(data$age) *1


new_data <- data.frame(
  age = sample(c(age.mean, age.lwr, age.upr), replace = T, size=size),
  gender = factor(rep(2,size)),
  race = factor(rep(4, size)),
  familysize = sample(unique(data$familysize), replace=T, size = size),
  education = factor(rep(3, size))
  )


predictions1 <- predict(object = Model2, 
                       newdata = new_data, 
                       interval = 'confidence', 
                       level = .95
                       ) %>% as_tibble()

predictions1 %>% 
  ggplot() + 
  geom_line(aes(x=1:size, y=fit), lty=1, color='steelblue') +
  geom_ribbon(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.4) +
  theme_light() + 
  theme(plot.title = element_text(face='bold'),
        panel.grid = element_blank(), 
        legend.position = 'top', 
        legend.title  = element_blank()) + 
  labs(x='Number of Samples', y='Predictions', 
       title = 'Model2 Predictions with 95% Confidence Intervals')

Problem 4

# Model 3
Model3 <- lm(Y ~ . + gender*race, data=data)

summary(Model3)

## 
## Call:
## lm(formula = Y ~ . + gender * race, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2164 -1.0220 -0.1766  1.0571  3.8260 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.227797   0.139393   8.808  < 2e-16 ***
## age            0.007430   0.001238   6.002 2.10e-09 ***
## gender2       -0.209942   0.114653  -1.831  0.06715 .  
## race2         -0.185423   0.129718  -1.429  0.15294    
## race3         -0.088081   0.098893  -0.891  0.37315    
## race4         -0.179685   0.105991  -1.695  0.09009 .  
## race6          0.099187   0.118732   0.835  0.40354    
## race7         -0.335456   0.148248  -2.263  0.02369 *  
## familysize    -0.009647   0.013662  -0.706  0.48013    
## education2     0.257259   0.097440   2.640  0.00831 ** 
## education3     0.688911   0.088878   7.751 1.11e-14 ***
## education4     1.215547   0.087472  13.896  < 2e-16 ***
## education5     2.307029   0.090892  25.382  < 2e-16 ***
## gender2:race2  0.068280   0.178509   0.382  0.70211    
## gender2:race3  0.098647   0.132417   0.745  0.45632    
## gender2:race4 -0.169518   0.142606  -1.189  0.23461    
## gender2:race6  0.226825   0.157287   1.442  0.14934    
## gender2:race7  0.028425   0.212395   0.134  0.89354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.392 on 4756 degrees of freedom
##   (4480 observations deleted due to missingness)
## Multiple R-squared:  0.2482, Adjusted R-squared:  0.2455 
## F-statistic: 92.37 on 17 and 4756 DF,  p-value: < 2.2e-16

Multiple Linear Regression:

\(Y\) = \(1.228\) + \(0.007 * age\) - \(0.210 * gender2\) - \(0.185 * race2\) - \(0.088 * race3\) - \(0.180 * race4\) + \(0.01 * race6\) - \(0.335 * race7\) - \(0.010 * familysize\) + \(0.257 * education2\) + \(0.689 * education3\) + \(1.216 * education4\) + \(2.307 * education5\) + \(0.068 * gender2 * race2\) + \(0.099 * gender2 * race3\) - \(0.170 * gender2 * race4\) + \(0.227 * gender2 * race6\) + \(0.028 * gender2 * race7\)

Just like in Model2, there is a slight change in the variable coefficients in Model3. In Model 1, gender2 was statistically significant at 0.1-5% significance level but in Model 3, it becomes significant at 5% level of significance only. familysize remains the same in terms of statistical significance. Similarly, the coefficients of education increases as more education is acquired.

The adjusted R-squares for model 3 and model 1 slightly differ with Model 3 having a 0.1% increase in performance (24.6%), compared to 24.5% in Model 1. Same goes for the Multiple R-squared value, where in Model 1 its value was about 24.7%, while in Model 3 it is about 24.82%.

# comparing Model 3 and Model1
anova(Model3, Model1)

## Analysis of Variance Table
## 
## Model 1: Y ~ age + gender + race + familysize + education + gender * race
## Model 2: Y ~ age + gender + race + familysize + education
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   4756 9212.4                              
## 2   4761 9231.9 -5   -19.519 2.0154 0.07329 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA F-test, we see that there’s no statistical difference in the predictions of model3 and Model1 at 0.1-5% level of significance. Hence, we will conclude that Model 3 isn’t more superior to Model1

# prediction
set.seed(0)

# get rows with education =3 and familysize = 4 (with scaled value)
new_data <- data.frame(
  age = rep(age.mean, size), 
  gender = factor(sample(unique(data$gender), replace=T, size=size)),
  race = factor(sample(unique(data$race), replace=T, size=size)),
  familysize = rep(4, size),
  education = factor(rep(3, size))
)
  

predictions2 <- predict(object = Model3, 
                       newdata = new_data, 
                       interval = 'confidence', 
                       level = .95) %>% as_tibble()

predictions2 %>% 
  ggplot() + 
  geom_col(aes(x=1:size, fit), fill='steelblue', width=0.7) + 
  geom_errorbar(aes(x=1:size, ymin=lwr, ymax=upr), alpha=0.6,width=0.5) + 
  theme_light() + 
  theme(panel.grid = element_blank(), 
        plot.title = element_text(face='bold')) + 
  labs(title='Model3 Predictions with 95% Confidence Intervals', 
       y='Predictions', x='Number of samples')

DATA306_FinalExam

Rachel Ciszewski

05/20/22

Overview

Problem 1

Problem 2

Problem 3

Problem 4