Introduction

Racial and socioeconomic disparities play a critical role in stark health disparities. Prior study suggests secondhand smoke (SHS) is one of the greatest factors contributing to diseases and harming human bodies.

The following picture shows there are many health consequences causally linked to exposure to secondhand smoke.


This project will investigate the abilities of different logistic models on classifying whether people are under SHS exposure or not, using demographic predictors. SHS is measured by serum cotinine (by blood examination) in this project.

The following sections describe the pilot data and the statistical and ML models that are used in this project.

Objective

The purpose of this project is to use different statistical and machine learning classification models to predict whether non-smokers are heavily exposed by SHS or not. And to choose the best model with lowest RMSE and highest AUROC score.

Literature Review

We include several are scholarly articles that describe potential impacts of the dependent and independent variables have on the secondhand smoke exposure. Below is a brief annotated bibliography.

Paper One

Tobacco Smoke Exposure and Levels of Urinary Metals in the U.S. Youth and Adult Population: The National Health and Nutrition Examination Survey (NHANES) 1999–2004 (Richter PA, 2009)

  • This paper examines the smoke and secondhand smoke exposure and the level of 12 urinary metals on participants from the NHANES dataset. The authors believe that urine measurements are a useful noninvasive approach in biomonitoring research of exposures to metals and other environmental pollutants. The year range of the dataset is from 1999 to 2004. Many influential predictors were included in the examination, including age, race, gender, poverty level, cadmium and lead concentration. This paper first uses statistical approaches to examine the significant difference level of different urinary metals across the smokers and nonsmokers. It finds that some metals like cadmium and lead, and antimony and barium are increased in smokers, while others like mercury, beryllium platinum, and thallium are lower or unchanged (tungsten). An inverse relationship was discovered between smoking and levels of important nutrients among the participants. And it also finds that children, who are classified as age below 12, is a group of population particularly vulnerable to secondhand smoke exposure. The statistical results show that the toxic effects of lead for children who have low levels of exposure have a higher level of urine lead than children without SHS exposure. The results also demonstrate that older smokers have a dangerously high cadmium concentration which pushes their bodies at risk.

  • One important measurement of serum cotinine concentration is conducted when comparing the differences between smokers and nonsmokers. This measurement is very important to my capstone project since it is the key variable that I will be investigated on. Knowing how to handle and interpret this variable will be helpful to make predictions about it in my project. Also, the ways to handle null values and steps to classify nonsmokers will be helpful to my project since I will also conduct statistical analysis from the NHANES dataset.

Paper Two

Socioeconomic Inequalities in Secondhand Smoke Exposure at Home and at Work in 15 Low- and Middle-Income Countries (Nazar, 2016)

  • This paper talks about the existence of socioeconomic inequalities in secondhand smoke exposure at home and at work for low- and medium-income countries. The relative index of inequality and the slope of inequality from the 15 participant countries are the main two factors that will be used in three regression-based methods to measure and estimate the magnetite socioeconomic inequalities of SHS exposure. Other methods include using interaction terms in the country-specific generalized linear models to determine whether a specific demographic variable plays a critical role in affecting inequalities in SHS exposure at home and at work. Some country-specific multiple logistic regression models were developed to estimate the relationship between socioeconomic status and SHS exposure. The paper concludes that socioeconomically disadvantaged groups suffer the most in SHS exposure at homes in the majority of LMI countries. SHS exposure at home is higher among the poor and the less educated. Similarly, SHS exposure at the workplace reduces when there is an increase in education or wealth. The authors believe that implementing pro-equity tobacco control intervention and assistance on socioeconomic disadvantage groups can help reduce inequalities in SHS exposure among LMI countries.

  • This paper helps me better understand the methods to measure and interpret socioeconomic status. This is important because my capstone topic is to investigate whether low-income households are more likely to be exposed by secondhand, and what are the characteristics of them. By reading the statistical methods of the paper, I know that building interaction terms will be a very important step prior to developing prediction models. It also demonstrates that people at home and at work are two important settings in socioeconomic inequalities for SHS exposure, therefore, I will pay more attention to these variables when constructing my prediction models in the project. Since other variables like age, gender, education level, and poverty level also play an important role in socioeconomic inequalities of SHS exposure, the exploratory data analysis of my project should include all these elements.

Paper Three

Exposure to secondhand smoke and cognitive impairment in non-smokers: national cross-sectional study with cotinine measurement (Llewellyn DJ, 2009)

  • This paper investigates the relationship between a salivary cotinine concentration, which is a biomarker of exposure to secondhand smoke, and cognitive impairment. It uses cross sectional analysis to study the responses from 4809 non-smoking adults aged 50 or more, but only 5265 participants whose salivary cotinine levels were measured. Multivariable logistic regression models were developed to determine the cross-sectional relation between SHS exposure and cognitive measurement. Several key personal and known risk indicators for cognitive measurement are adjusted. Then, participants who have cotinine concentration higher than 14.1 ng/ml were divided into four equal size groups after adjusted risk factors. The study finds that there is an association between SHS exposure and cognitive impairment. High levels of cotinine may be associated with increased odds of cognitive impairment, given in a large diverse sample of non-smoking adults.

  • This is related to my capstone project since serum cotinine measurement is my target variable for prediction in the project. And by learning the detailed steps and methodologies provided in the study, I can have a better understanding on how to analyze and interpret the value of serum cotinine and its association with other demographic variables. I also learned the importance of controlling or adjusting the weight of other influential variables when building the prediction models. And finally, it is very critical to identify the correct study population, which are the non-smoking households from the dataset.

The Data Source

The data was collected from the National Health and Nutrition Examination Survey (NHANES), covering the period of 1999-2016.

  • The pilot data source of this project is from National Health and Nutrition Examination Survey (NHANES) NHANES is a national survey that combines both interviews and physical examinations. The interview questions include demographic, socioeconomic, dietary, and health-related questions.

  • The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. The reason we choose to use NHANES as the pilot data source of the project is that it is a national program that was designed to access and study on the health and nutrition status among children and adults in the United States.

  • It is also part of the National Center for Health Statistics Program which aims to produce vital and health statistic for the nation. The sampling strategy that was used in NHANES is a complex and probability sampling that involves multistage in the process to select survey participants representative of the US population. The NHANES sampling procedure consists of four stages, which are the following:


Smokers vs Nonsmokers

  • Nonsmokers in this study will be defined as people who have not smoked 100 cigarettes in their lives. This is presented in one of the NHANES questions which survey participants respond with yes or no. Therefore, participants who answer yes are considered smokers, and those who answer no are considered nonsmokers.

  • The study population in this project will be non-smokers.

The outcome

  • The predictive variable/outcome is a binary type of data; where we classify people who have less than 1 serum cotinine concentration are under minor Secondhand smoke exposure (coded in 0), people who have equal or greater than 1 serum cotinine concentration are under medium or heavy Secondhand smoke exposure (coded in 1).

The following chart visually demonstrates the defined line for minor or heavy Secondhand smoke exposure


Load and Clean Data

#include some specific Rmarkdown format packages 
library(formattable)
library(knitr)
library(kableExtra)

# general R parkages
library(dplyr)
library(data.table)
library(tidyverse)
library(ggplot2)
library(readxl)
library(ISLR)
library(Amelia)
library(mlbench)
library(pscl)
library(Matrix)
library(xgboost)
library(mltools)
library(caret)
library(Hmisc)
library(smbinning)
library(InformationValue)
library(car)
library(broom)
library(modelr)
library(randomForest)
library(rsample)
library(e1071)
library(glue)
library(DataExplorer)
library(reshape2)
library(dummies)
library(table1)
library(AICcmodavg)
library(readr)
library(fitdistrplus)
library(EnvStats)
library(olsrr)
library(nortest)
library(grid)
library(gridExtra)
library(plotly)
library(DT)
# DecisionTree
library(mlr)
library(parallel)
library(parallelMap)
library(rpart.plot)
library(MASS)
source("http://www.sthda.com/upload/rquery_cormat.r")
  • First to import useful R packages

  • Then to filter out the data set from NHANES to only obtain the valid predicted/outcome value

# Load and read data from NHANES
data <- read.csv('nhanesallrev_subset_rev.csv')

# Load the description excel
code_book <- read_excel('NHANES public data 1999-2016 codebook.xlsx', skip = 5)

# Only select the rows with valid value of Serum Cotinine
NHANES <- data %>% 
  dplyr::filter(!(is.na(LBXCOT)|LBXCOT==9999)) %>% 
  dplyr::filter(SMQ020==2)

The total number of observations for the filtered data set is: 24248

The total number of columns of the filtered data set is: 442

Data Cleaning

  • Find and format the percentage of valid data from the dataframe

  • Independent predictors were to keep only for those that have 70% or more valid values.

# Find the format the percentage of valid data from the dataframe
df_keep <- data.frame(lapply(NHANES,function(x) {length(which(!(is.na(x)|x=="")))}))
df_keep <- df_keep %>%
  bind_rows(summarise_if(., is.numeric, ~ .[1] / NROW(NHANES))) %>%
  mutate(across(is.numeric, ~ round(., 3)))

df_keep <- data.frame(t(df_keep))
colnames(df_keep)<-c('Number of Valid Value','Percentage')

# Filtering variables process
df_keep<-df_keep %>%
  filter(Percentage>0.7)
df_keep$Percentage<-percent(df_keep$Percentage)

df_keep<-merge(df_keep, code_book, by.x = 0, by.y = "Variable Name", all.x = TRUE)
df_keep$`Number of Valid Value`<-NULL
colnames(df_keep)<-c('Variable','Percentage of Valid Data', 'Definition', 'Repetition','File Labe','S')
datatable(df_keep[,1:5],
          rownames = FALSE,
          options=list(pageLength = 5))

Variables Filtering

  • Further clean up and filter useful variables
df_keep <- df_keep %>% filter(!(Variable %in% c("SEQN", "RIDSTATR", "DMDHRMAR", "LBDBCDSI", "DMDHRGND","SMD410", 
                                      "DMDHRAGE", "DMDEDUC2", "LBDBPBSI", "LBDCOTLC", "WTINT2YR", "HIQ011",
                                      "LBXHCT", "URXUCR", "LBXHGB", "LBXBPB", "LBXBCD", "SMQ020","SDDSRVYR","RIDEXMON",
                                      "RIDAGEMN","SIALANG","HIQ210","SDMVPSU","SDMVSTRA","WTMEC2YR","FSDHH","BMXBMI",
                                      "BMXHT","BMXWAIST","BMXWT","HSD010"
                                      )))
col_lst<-df_keep$Variable

df_NHANES <- NHANES %>% dplyr::select(col_lst)
datatable(df_keep[,1:5],
          rownames = FALSE,
          options=list(pageLength = 5))

Exploratory Data Analysis

  • In this section, we will perform exploratory data analysis to understand the variables and dataset better.

Missing Values

  • We will plot the number of missing value of each variable from the dataset.

  • Since we are seeing there are missing values in multiple selected variables, we will use the average number of the variable to replace its missing data point.

plot_missing(df_NHANES)

Unique Values

  • We will explore the number of unique value of each variable from the dataset. We do not want to have variables that have only one unique value.
sapply(df_NHANES, function(x) length(unique(x)))
## DMDCITZN DMDHHSIZ DMDHREDU DMDMARTL   HOD050   HOQ065 INDFMPIR   LBXCOT 
##        5        7        8        9       16        6      501     2091 
## RIAGENDR RIDAGEYR RIDRETH1 
##        2       68        5

Data Visualization

  • Plot the distribution of the variables

  • Before plotting the distribution, we will first convert certain variables into factor variables.

# Convert columns to factor
df_NHANES$DMDCITZN <- as.factor(ifelse(df_NHANES$DMDCITZN %in% c(7,9), NA, 
                                       ifelse(df_NHANES$DMDCITZN==1,"YES","No")))
df_NHANES$DMDHREDU <- ifelse(df_NHANES$DMDHREDU %in% c(7,9), NA, df_NHANES$DMDHREDU)
df_NHANES$INDFMPIR <- ifelse(is.na(df_NHANES$INDFMPIR), mean(df_NHANES$INDFMPIR, na.rm = TRUE), df_NHANES$INDFMPIR)
df_NHANES$MaritalStatus <- as.factor(ifelse(df_NHANES$DMDMARTL==1, "Married",
                                           ifelse(df_NHANES$DMDMARTL==2,"Widowed",
                                                  ifelse(df_NHANES$DMDMARTL==3,"Divorced",
                                                         ifelse(df_NHANES$DMDMARTL==4,"Separated",
                                                                ifelse(df_NHANES$DMDMARTL==5,"Never_married",
                                                                       ifelse(df_NHANES$DMDMARTL==6,"Living_with_partner",
                                                                              NA)))))))
df_NHANES$HOD050 <- ifelse(df_NHANES$HOD050 %in% c(77,99), NA, df_NHANES$HOD050)
                           # mean(df_NHANES$HOD050,na.rm = TRUE),
                           # ifelse(is.na(df_NHANES$HOD050), mean(df_NHANES$HOD050,na.rm = TRUE),
                           #        df_NHANES$HOD050))
df_NHANES$HomeType <- as.factor(ifelse(df_NHANES$HOQ065==1, "Owned_or_being_bought",
                                           ifelse(df_NHANES$HOQ065==2,"Rented",
                                                  ifelse(df_NHANES$HOQ065==3,"Other_arrangement",
                                                         NA))))
                                                         
df_NHANES$RIAGENDR <- as.factor(ifelse(df_NHANES$RIAGENDR==1, "Male","Female"))
df_NHANES$Race <- as.factor(ifelse(df_NHANES$RIDRETH1==1, "Mexican_American",
                                           ifelse(df_NHANES$RIDRETH1==2,"Other_Hispanic",
                                                  ifelse(df_NHANES$RIDRETH1==3,"Non-Hispanic_White",
                                                         ifelse(df_NHANES$RIDRETH1==4,"Non-Hispanic_Black",
                                                                ifelse(df_NHANES$RIDRETH1==5,"Other_Race",
                                                                       NA))))))
df_NHANES$LBXCOT <- ifelse(df_NHANES$LBXCOT<=1,0,1)
df_NHANES<-df_NHANES %>% dplyr::select(-RIDRETH1,-HOQ065,-DMDMARTL)

Continuous Variables

Age

# Age
summary(df_NHANES$RIDAGEYR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   45.00   47.06   62.00   85.00
df_NHANES$RIAGENDR_NEW<-ifelse(df_NHANES$RIDAGEYR<=20, "0-20",
                            ifelse(df_NHANES$RIDAGEYR>20 & df_NHANES$RIDAGEYR<=40, "21-40",
                                   ifelse(df_NHANES$RIDAGEYR>40 & df_NHANES$RIDAGEYR<=60, "41-60",
                                          ifelse(df_NHANES$RIDAGEYR>60 & df_NHANES$RIDAGEYR<85,"61-84",
                                                 "85+"))))
age<-df_NHANES %>%
  group_by(RIAGENDR_NEW) %>%
  dplyr::summarise(Count = n())
ggplot(age, aes(x=RIAGENDR_NEW,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Age Range") +
  ggtitle("Count for Number of Observations by Age Range") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Poverty Income Ratio (PIR)

# Poverty Income Ratio (PIR)
summary(df_NHANES$INDFMPIR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.26    2.60    2.64    4.09    5.00
df_NHANES$INDFMPIR_NEW<-ifelse(df_NHANES$INDFMPIR<=1, "0-1",
                            ifelse(df_NHANES$INDFMPIR>1 & df_NHANES$INDFMPIR<=2, "1-2",
                                   ifelse(df_NHANES$INDFMPIR>2 & df_NHANES$INDFMPIR<=3, "2-3",
                                          ifelse(df_NHANES$INDFMPIR>3 & df_NHANES$INDFMPIR<=4,"3-4",
                                                 ifelse(df_NHANES$INDFMPIR>4 & df_NHANES$INDFMPIR<5,"4-5",
                                                        "5+")))))
pir<-df_NHANES %>%
  group_by(INDFMPIR_NEW) %>%
  dplyr::summarise(Count = n())
ggplot(pir, aes(x=INDFMPIR_NEW,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Poverty income ratio (PIR)") +
  ggtitle("Ratio of Family Income to Poverty Threshold") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Number of Rooms

summary(df_NHANES$HOD050)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   6.000   5.927   7.000  13.000     322
NOR<-df_NHANES %>% 
  group_by(HOD050) %>%
  dplyr::summarise(Count = n())
ggplot(NOR, aes(x=HOD050,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Number of Rooms") +
  ggtitle("Distribution of Number of Rooms") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Serum Cotinine (Y)

# Serum Cotinine
summary(NHANES$LBXCOT)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.011    0.011    0.032    9.789    0.095 1700.000
NHANES$LBXCOT_NEW<-ifelse(NHANES$LBXCOT<=1, "0-1",
                            ifelse(NHANES$LBXCOT>1 & NHANES$LBXCOT<=10, "1-10",
                                   ifelse(NHANES$LBXCOT>10 & NHANES$LBXCOT<=100,"10-100",
                                                 ifelse(NHANES$LBXCOT>100 & NHANES$LBXCOT<500,"100-500",
                                                        "500+"))))
cotinine<-NHANES %>%
  group_by(LBXCOT_NEW) %>%
  dplyr::summarise(Count = n())

ggplot(cotinine, aes(x=LBXCOT_NEW,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Serum Cotinine Level") +
  ggtitle("Number of Observations by Range of Serum Cotinine Concentration Level") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Number of People

summary(df_NHANES$DMDHHSIZ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.327   4.000   7.000
NOP<-df_NHANES %>% 
  group_by(DMDHHSIZ) %>%
  dplyr::summarise(Count = n())
ggplot(NOP, aes(x=DMDHHSIZ,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Number of People to Live with") +
  ggtitle("Distribution of Number of People to Live with") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Categorical Variables

Gender

# Gender
summary(df_NHANES$RIAGENDR)
## Female   Male 
##  14694   9554
set_a<-df_NHANES %>%
  group_by(RIAGENDR) %>%
  dplyr::summarise(Count = n())
ggplot(set_a, aes(x=RIAGENDR,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Gender") +
  ggtitle("Count for Number of Observations by Gender") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Race / Ethnicity

# Race
summary(df_NHANES$Race)
##   Mexican_American Non-Hispanic_Black Non-Hispanic_White     Other_Hispanic 
##               4985               5015               9560               2214 
##         Other_Race 
##               2474
race<-df_NHANES %>%
  group_by(Race) %>%
  dplyr::summarise(Count = n())

ggplot(race, aes(x=Race,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Race") +
  ggtitle("Count for Number of Observations by Race") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25) +
  theme(axis.text.x = element_text(angle = 25, vjust = 0.8))

Education Level

# Education Level
NHANES$DMDEDUC2_NEW<-ifelse(NHANES$DMDEDUC2==1, 111, 
                            ifelse(NHANES$DMDEDUC2==2, 222,
                                   ifelse(NHANES$DMDEDUC2==3, 333,
                                          ifelse(NHANES$DMDEDUC2==4, 444,
                                                 ifelse(NHANES$DMDEDUC2==5, 555, NA)))))
NHANES$EDUCATION<-ifelse(NHANES$DMDEDUC2_NEW==""|is.na(NHANES$DMDEDUC2_NEW), NHANES$DMDEDUC3, NHANES$DMDEDUC2_NEW)
less9grade<-c(0,1,2,3,4,5,6,7,8,111,55,66)
between9_11<-c(9,10,11,222)
highschool<-c(12,13,14,333)
college<-c(15,444)
NHANES$EDUCATION_NEW<-ifelse(NHANES$EDUCATION %in% less9grade, "Less Than 9th Grade", 
                            ifelse(NHANES$EDUCATION %in% between9_11, "9-11th Grade",
                                   ifelse(NHANES$EDUCATION %in% highschool, "High School Grad/GED or Equivalent",
                                          ifelse(NHANES$EDUCATION %in% college, "Some College or AA degree",
                                                 ifelse(NHANES$EDUCATION==555, "College Graduate or above", NA)))))

NHANES$EDUCATION_NEW<-factor(NHANES$EDUCATION_NEW, levels = c("Less Than 9th Grade", "9-11th Grade",
                                                            "High School Grad/GED or Equivalent",
                                                            "Some College or AA degree",
                                                            "College Graduate or above", NA))
edu<-NHANES %>%
  group_by(EDUCATION_NEW) %>%
  dplyr::summarise(Count = n())
summary(NHANES$EDUCATION_NEW)
##                Less Than 9th Grade                       9-11th Grade 
##                               3058                               3062 
## High School Grad/GED or Equivalent          Some College or AA degree 
##                               5131                               6693 
##          College Graduate or above                               NA's 
##                               6279                                 25
ggplot(edu, aes(x=EDUCATION_NEW,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Education Level") +
  ggtitle("Number of Observations by Education Level") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(NHANES),3)), position = position_dodge(0.9), vjust=-0.25) +
  theme(axis.text.x = element_text(angle = 25, vjust = 0.8))

Marital Status

# Distribution of Marital Status
summary(df_NHANES$MaritalStatus)
##            Divorced Living_with_partner             Married       Never_married 
##                1887                1449               13026                4426 
##           Separated             Widowed                NA's 
##                 699                2048                 713
MS<-df_NHANES %>%
  group_by(MaritalStatus) %>%
  dplyr::summarise(Count = n())

ggplot(MS, aes(x=MaritalStatus,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Marital Status") +
  ggtitle("Distribution for Marital Status") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25) +
  theme(axis.text.x = element_text(angle = 25, vjust = 0.8))

Home Type

# Distribution of types of home
summary(df_NHANES$HomeType)
##     Other_arrangement Owned_or_being_bought                Rented 
##                   537                 15299                  8123 
##                  NA's 
##                   289
HT<-df_NHANES %>%
  group_by(HomeType) %>%
  dplyr::summarise(Count = n())

ggplot(HT, aes(x=HomeType,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Home Type") +
  ggtitle("Distribution for Types of Home") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25) +
  theme(axis.text.x = element_text(angle = 25, vjust = 0.8))

Citizenship

# Distribution of types of home
summary(df_NHANES$DMDCITZN)
##    No   YES  NA's 
##  4289 19907    52
citi<-df_NHANES %>%
  group_by(DMDCITZN) %>%
  dplyr::summarise(Count = n())

ggplot(citi, aes(x=DMDCITZN,y=Count)) + 
  geom_col(fill="#69b3a2") +
  xlab("Citizenship") +
  ggtitle("Distribution for Citizenship") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label=round(Count/NROW(df_NHANES),3)), position = position_dodge(0.9), vjust=-0.25)

Building Models

The models that are included in this sections:

  • XGBoost

  • General Linear Model

  • Random Forrest

One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the df_final not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations (>10K).

Train, Test, and Validation

  • Perform missing value imputation and dropping NA values.
df_NHANES$HOD050 <- ifelse(df_NHANES$HOD050 %in% c(77,99), mean(df_NHANES$HOD050,na.rm = TRUE),
                           ifelse(is.na(df_NHANES$HOD050), mean(df_NHANES$HOD050,na.rm = TRUE),
                                  df_NHANES$HOD050))
df_NHANES<-df_NHANES %>%
  dplyr::select(-RIAGENDR_NEW,-INDFMPIR_NEW) %>%
  drop_na()

XGBoost

  • Separated the dataset into train, test, and validate for Machine Learning Model: XGBoost

  • The input will need to be transformed into matrix type

set.seed(314)
# Create random training, validation, and test sets

# Input 2. Set the fractions of the dataframe you want to split into training, 
# validation, and test.
fractionTraining   <- 0.60
fractionValidation <- 0.20
fractionTest       <- 0.20

# Compute sample sizes.
sampleSizeTraining   <- floor(fractionTraining   * nrow(df_NHANES))
sampleSizeValidation <- floor(fractionValidation * nrow(df_NHANES))
sampleSizeTest       <- floor(fractionTest       * nrow(df_NHANES))

# Create the randomly-sampled indices for the dataframe. Use setdiff() to
# avoid overlapping subsets of indices.
indicesTraining    <- sort(sample(seq_len(nrow(df_NHANES)), size=sampleSizeTraining))
indicesNotTraining <- setdiff(seq_len(nrow(df_NHANES)), indicesTraining)
indicesValidation  <- sort(sample(indicesNotTraining, size=sampleSizeValidation))
indicesTest        <- setdiff(indicesNotTraining, indicesValidation)

# Finally, output the three dataframes for training, validation and test.
df_Training   <- df_NHANES[indicesTraining, ]
df_Validation <- df_NHANES[indicesValidation, ]
df_Test       <- df_NHANES[indicesTest, ]

# Finally, output the three dataframes for training, validation and test.
dfTraining   <- one_hot(as.data.table(df_NHANES[indicesTraining, ]))
dfValidation <- one_hot(as.data.table(df_NHANES[indicesValidation, ]))
dfTest       <- one_hot(as.data.table(df_NHANES[indicesTest, ]))

# Convert train, test, and validate data set into Matrix
xg.train <- xgb.DMatrix(data = as.matrix(dfTraining %>% dplyr::select(-LBXCOT)),
                        label = dfTraining$LBXCOT) 
xg.test <- xgb.DMatrix(data = as.matrix(dfTest %>% dplyr::select(-LBXCOT)),
                       label = dfTest$LBXCOT) 
xg.validate <- xgb.DMatrix(data = as.matrix(dfValidation %>% dplyr::select(-LBXCOT)),
                           label = dfValidation$LBXCOT) 

watchlist <- list(train = xg.train, test = xg.test)

Hyperparameter Tuning

# Building the expand grid 
grid <- expand.grid(
  nrounds = 1000,
  eta = c(0.01, 0.02, 0.05),
  max_depth = c(2,4),
  best_iteration = 0,
  min_nloglik = 0,
  time = 0
)
grid

base_score <- mean(df_NHANES$LBXCOT)
base_score

for (i in 1:nrow(grid)) {
  
  start <- Sys.time()
  
  xg.fit.grid <- xgb.train(data = xg.train,
                           objective = "binary:logistic",
                           nrounds = 1000,
                           eta = grid$eta[i],
                           max_depth = grid$max_depth[i],
                           base_score = base_score,
                           eval_metric = "auc",
                           watchlist = watchlist,
                           early_stopping_rounds = 10)
  end <- Sys.time()
  
  grid$best_iteration[i] <- xg.fit.grid$best_iteration
  grid$min_nloglik[i] <- round(xg.fit.grid$best_score, 4)
  grid$time[i] <- round(end - start, 3)
}

Results for the hyperparameter tuning in datatable

datatable(grid %>% arrange(-min_nloglik))

Train Model

  • Used the best combination of parameters from the results of the Hyperparameter Tuning to train the XGBoost Model.
bst <- xgb.train(data = xg.train,
                 objective = "reg:squarederror",
                 nrounds = 98,
                 eta = 0.05,
                 max_depth = 4,
                 eval_metric = "auc",
                 watchlist = watchlist)

Model Evaluation

  • Then we will evaluate the performance of the XGBoost model

  • First to calculate the RMSE

# Evaluation Model
pred <- predict(bst, xg.validate)

# Evaluate the performance of model
RMSE = RMSE(pred,dfValidation$LBXCOT)

# Calculate RMSE
residuals = dfValidation$LBXCOT - pred

# Calculate the CutOff rate
optCutOff <- optimalCutoff(df_Validation$LBXCOT, pred, 
                           #optimiseFor = "both", 
                           returnDiagnostics = FALSE)
predicted.classes <- ifelse(pred > optCutOff, 1, 0)

# Present the confusion matrix
#table(predicted.classes,df_Validation$LBXCOT)
  • The RMSE for the XGBoost model is: 0.2683.

Plotting the Tree

Tree = 0
# Plot the trees
# Tree 1
xgb.plot.tree(model = bst, tree=0)
Tree = 1
# Tree 2
xgb.plot.tree(model = bst, tree=1)
Tree = 2
# Tree 3
xgb.plot.tree(model = bst, tree=2)

Feature Importance Matrix

# Importance Matrix
importance_matrix <- xgb.importance(model = bst)
print(importance_matrix)
##                              Feature         Gain        Cover   Frequency
## 1                           RIDAGEYR 0.2027632647 0.1545093096 0.185743660
## 2            Race_Non-Hispanic_Black 0.1692098295 0.0897401012 0.058259082
## 3                           DMDHREDU 0.1302880240 0.1112649239 0.124057574
## 4                           INDFMPIR 0.1283784721 0.1711133588 0.196710075
## 5                    RIAGENDR_Female 0.1150451911 0.1207305774 0.063056888
## 6                        DMDCITZN_No 0.0354241554 0.0503786299 0.037697053
## 7  MaritalStatus_Living_with_partner 0.0320056130 0.0141611173 0.030157642
## 8              MaritalStatus_Married 0.0308879610 0.0348467837 0.021932831
## 9            Race_Non-Hispanic_White 0.0280241836 0.0361548637 0.033584647
## 10                            HOD050 0.0240909192 0.0673940487 0.075394106
## 11                          DMDHHSIZ 0.0231640076 0.0390350174 0.053461275
## 12    HomeType_Owned_or_being_bought 0.0160963021 0.0030197929 0.010966415
## 13                   HomeType_Rented 0.0158172231 0.0388606570 0.023303633
## 14             Race_Mexican_American 0.0157676889 0.0201544712 0.022618232
## 15       MaritalStatus_Never_married 0.0147660375 0.0113207796 0.018505826
## 16           MaritalStatus_Separated 0.0065375054 0.0131896810 0.013708019
## 17        HomeType_Other_arrangement 0.0046872902 0.0029329901 0.012337217
## 18            MaritalStatus_Divorced 0.0037314998 0.0045980937 0.008224812
## 19               Race_Other_Hispanic 0.0017079193 0.0136325639 0.004797807
## 20             MaritalStatus_Widowed 0.0011611941 0.0027440998 0.003427005
## 21                   Race_Other_Race 0.0004457183 0.0002181391 0.002056203
xgb.plot.importance(importance_matrix = importance_matrix)

The ROC Curve

  • It is drawn by the False positive rate on X-axis and the True Positive rate on Y-axis.

  • It considers all possible threshold values and corresponds specificity and sensitivity rate, reflecting the final model accuracy.

plotROC(dfValidation$LBXCOT, pred)

General Linear Model (GLM)

Train the Linear Regression Model

  • Then we created the full model with all the dependent variables. Let’s name it NHANES.lm.

  • If we observe the Pr(>|z|) or p-values for the regression coefficients, then we find that some categories from Marial Status and Home Type do not have a significant contribution to our response variable. Therefore, we can try to fit a second model by using stepwise function to fit the data instead.

  • Summary of the first model is as below:

trainingData <- rbind(df_Training, df_Test)
trainingData_glm<- trainingData
trainingData_glm$LBXCOT<-as.factor(trainingData_glm$LBXCOT)
NHANES.lm<-glm(LBXCOT ~.,data=trainingData_glm, family=binomial)

summary(NHANES.lm)
## 
## Call:
## glm(formula = LBXCOT ~ ., family = binomial, data = trainingData_glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5624  -0.4449  -0.3078  -0.2106   3.2705  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -2.279246   0.286659  -7.951 1.85e-15 ***
## DMDCITZNYES                       0.919898   0.104430   8.809  < 2e-16 ***
## DMDHHSIZ                          0.044751   0.020277   2.207 0.027316 *  
## DMDHREDU                         -0.320133   0.025733 -12.440  < 2e-16 ***
## HOD050                           -0.042673   0.017211  -2.479 0.013161 *  
## INDFMPIR                         -0.145579   0.022132  -6.578 4.77e-11 ***
## RIAGENDRMale                      1.060785   0.057382  18.487  < 2e-16 ***
## RIDAGEYR                         -0.022144   0.002153 -10.286  < 2e-16 ***
## MaritalStatusLiving_with_partner  0.290618   0.132506   2.193 0.028290 *  
## MaritalStatusMarried             -0.351806   0.106581  -3.301 0.000964 ***
## MaritalStatusNever_married        0.054561   0.113935   0.479 0.632026    
## MaritalStatusSeparated            0.226409   0.168534   1.343 0.179142    
## MaritalStatusWidowed              0.149118   0.146937   1.015 0.310182    
## HomeTypeOwned_or_being_bought     0.285805   0.202250   1.413 0.157618    
## HomeTypeRented                    0.474038   0.200048   2.370 0.017806 *  
## RaceNon-Hispanic_Black            1.517852   0.102482  14.811  < 2e-16 ***
## RaceNon-Hispanic_White            1.025512   0.105932   9.681  < 2e-16 ***
## RaceOther_Hispanic                0.312410   0.136040   2.296 0.021650 *  
## RaceOther_Race                    0.561410   0.138547   4.052 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10975.0  on 18026  degrees of freedom
## Residual deviance:  9417.5  on 18008  degrees of freedom
## AIC: 9455.5
## 
## Number of Fisher Scoring iterations: 6

Applied the stepwise function to the trained Linear Regression Model

  • As we can see from the result below, the used of stepwise function to the GLM model does not improve much because the AIC from the step.model is equal to the original model.

  • We will then use the original GLM model to evaluation.

step.model <- NHANES.lm %>% stepAIC(trace = FALSE)
summary(step.model)
## 
## Call:
## glm(formula = LBXCOT ~ DMDCITZN + DMDHHSIZ + DMDHREDU + HOD050 + 
##     INDFMPIR + RIAGENDR + RIDAGEYR + MaritalStatus + HomeType + 
##     Race, family = binomial, data = trainingData_glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5624  -0.4449  -0.3078  -0.2106   3.2705  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -0.298563   0.282814  -1.056 0.291110    
## DMDCITZN2                        -0.919898   0.104430  -8.809  < 2e-16 ***
## DMDHHSIZ                          0.044751   0.020277   2.207 0.027316 *  
## DMDHREDU                         -0.320133   0.025733 -12.440  < 2e-16 ***
## HOD050                           -0.042673   0.017211  -2.479 0.013161 *  
## INDFMPIR                         -0.145579   0.022132  -6.578 4.77e-11 ***
## RIAGENDR2                        -1.060785   0.057382 -18.487  < 2e-16 ***
## RIDAGEYR                         -0.022144   0.002153 -10.286  < 2e-16 ***
## MaritalStatusLiving_with_partner  0.290618   0.132506   2.193 0.028290 *  
## MaritalStatusMarried             -0.351806   0.106581  -3.301 0.000964 ***
## MaritalStatusNever_married        0.054561   0.113935   0.479 0.632026    
## MaritalStatusSeparated            0.226409   0.168534   1.343 0.179142    
## MaritalStatusWidowed              0.149118   0.146937   1.015 0.310182    
## HomeTypeOwned_or_being_bought     0.285805   0.202250   1.413 0.157618    
## HomeTypeRented                    0.474038   0.200048   2.370 0.017806 *  
## RaceNon-Hispanic_Black            1.517852   0.102482  14.811  < 2e-16 ***
## RaceNon-Hispanic_White            1.025512   0.105932   9.681  < 2e-16 ***
## RaceOther_Hispanic                0.312410   0.136040   2.296 0.021650 *  
## RaceOther_Race                    0.561410   0.138547   4.052 5.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10975.0  on 18026  degrees of freedom
## Residual deviance:  9417.5  on 18008  degrees of freedom
## AIC: 9455.5
## 
## Number of Fisher Scoring iterations: 6

Model Evaluation

# predicted scores
predicted <- NHANES.lm %>% predict(df_Validation,type = "response")
RMSE = RMSE(predicted,dfValidation$LBXCOT)

optCutOff <- optimalCutoff(df_Validation$LBXCOT, predicted, 
                           #optimiseFor = "both", 
                           returnDiagnostics = FALSE)
model = pscl::pR2(NHANES.lm)["McFadden"]
miss<-misClassError(df_Validation$LBXCOT, predicted, threshold = optCutOff)

Misclassification error is the percentage mismatch of predcited vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better is your model.

  • The optimal CutOff value is: 0.0965.

  • The McFadden is: 0.1419.

  • The Misclassification Error is: 0.0863.

  • The RMSE for the GLM is: 3.0149.

Concordance & Residual Assessment
  • In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model.
## $Concordance
## [1] 0.7732419
## 
## $Discordance
## [1] 0.2267581
## 
## $Tied
## [1] -2.775558e-17
## 
## $Pairs
## [1] 1612688

Random Forrest

  • Finally, we trained the Random Forrest Logistic model and model results are following:
trainingData$LBXCOT<-as.factor(trainingData$LBXCOT)
#table(trainingData$LBXCOT)

rf <- randomForest(LBXCOT~., data=trainingData, proximity=TRUE)
print(rf)
## 
## Call:
##  randomForest(formula = LBXCOT ~ ., data = trainingData, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.13%
## Confusion matrix:
##       0   1 class.error
## 0 16288 102 0.006223307
## 1  1544  93 0.943188760

Model Evaluation

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf,df_Validation[,-6],type="prob")[,2]
RMSE = RMSE(prediction_for_table,dfValidation$LBXCOT)

#table(observed=df_Validation[,6],predicted=prediction_for_table)

The RMSE for the Random Forrest model is: 0.2703.

plotROC(df_Validation$LBXCOT, prediction_for_table)

Conclusion

  • The final selected models are in between the XGBoost machine learning model and GLM, as they both have a low RMSE and a high AUROC score. GLM has the advantage of explaining the coefficients of each explanatory variables. Future work will focus on discovering other methods to improve the model accuracy and exploring more different models (like SVM, AdaBoost, and K-neighbors Classifier) to compare with results.

Acknowledgment

  • I would like to thank Dr Tim Wood, Dr MyDzung T. Chu, and Dr Ami Zota for providing great support and the NHANES data set used in this project.

  • This project came from a grant funded by the US HUD grant (DCHHU0054-19). The content of this manuscript is solely the responsibility of the grantee and does not represent the official views of any funding entity. Further, no parties involved endorse the purchase of any commercial products or services discussed in this manuscript.

    • RDC Research Project Title: The Impact of Housing Assistance on Residential Environmental Exposures

    • Funded by: US Department of Housing and Urban Development (HUD) Healthy Homes Program Homes Technical Research (DCHHU0054-19)

    • Organization: Environmental Occupational Health at The George Washington University, The Milken Institute School of Public Health

    • Supervisors: Dr MyDzung T. Chu and Dr Ami Zota

Reference

  1. Richter PA, Bishop EE, Wang J, Swahn MH. Tobacco Smoke Exposure and Levels of Urinary Metals in the U.S. Youth and Adult Population: The National Health and Nutrition Examination Survey (NHANES) 1999–2004. International Journal of Environmental Research and Public Health. 2009; 6(7):1930-1946. https://doi.org/10.3390/ijerph6071930

  2. Gaurang P. Nazar, MSc, John Tayu Lee, PhD, Monika Arora, PhD, Christopher Millett, PhD, Socioeconomic Inequalities in Secondhand Smoke Exposure at Home and at Work in 15 Low- and Middle-Income Countries, Nicotine & Tobacco Research, Volume 18, Issue 5, May 2016, Pages 1230–1239, https://doi.org/10.1093/ntr/ntv261

  3. Llewellyn DJ, Lang IA, Langa KM, Naughton F, Matthews FE. 2009. Exposure to secondhand smoke and cognitive impairment in non-smokers: national cross-sectional study with cotinine measurement. BMJ 338:b462, PMID: 19213767, 10.1136/bmj.b462.

  4. “Health Effects of Secondhand Smoke” Fed. 27, 2020. Accessed on: Dec. 13,2021.[Online].Available:https://www.cdc.gov/tobacco/data_statistics/fact_sheets/secondhand_smoke/health_effects/index.htm

  5. “About the National Health and Nutrition Examination Survey” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online]. Available: https://www.cdc.gov/nchs/nhanes/about_nhanes.htm

  6. “Module 2: Sample Design” Sep. 15, 2017. Accessed on: Dec. 13, 2021. [Online].Available:https://wwwn.cdc.gov/nchs/nhanes/tutorials/module2.aspx