For this project we have chosen a dataset about the child birth in United States in the year of 2018 available at Kaggle obtained from raw data from CDC website.

Dataset contains over 3.8 milion observations of 55 variables of different types. There are several variables describing the birth circumstances, mother’s features, as well as origins and ethnicity of both parents.

This dataset gives us a lot of possibilities in terms of choosing the variable of interest – we can investigate many features and the influences of the different measures. We would like to check which features may affect if the baby is born healthy, so we decided to focus on variable DBWT that indicates the weight of the newborn baby. We will create a new variable - HEALTHY - that will be a categorical variable that classifies if the baby’s weight falls into correct interval.

Detailed dataset and variables description available here.

1. Data Inspection

Dataset is available in CSV file with just one sheet. Variables are of different types, therefore they need to be closely studied before further analysis. In most of the columns missing values are treated as a separate category and have previously assigned value. This may affect the way we qualify some variables as categorical or numeric, e.g. our variable of interest is expressed in grams but missing values are denoted as 9999 which may disturb future modelling.

setwd("C:\\Users\\lewsz\\OneDrive\\Desktop\\UW_materials_semester_2\\machine_learning\\project_large")
data <- read_csv("US_births(2018).csv")

This dataset has nearly 4 million rows, and it proved to be too heavy to handle for our CPUs and GPUs. Therefore, and we hop the logic holds up, we will use data partitioning to create a subset of the data. The goal is to have a subset that will still have a strong representation of our original data.

set.seed(1922)
data_which_rows <- createDataPartition(data$DBWT, p = 0.9, list = FALSE)

data <- data[-data_which_rows,]

After partitioning the dataset has 380 152 observations of 55 variables.

1.1. Initial Selection

The table below presents all variables with short description and their type. The last column shows how missing values are represented in a given variable.

VARIABLE	DESCRIPTION	TYPE	STATUS	MISSING VALUE
ATTEND	type of attendant at birth	NOMINAL	TO BE DELETED	9
BFACIL	birth place	NOMINAL	TO BE DELETED	9
BMI	mother’s BMI before pregnancy	CONTINUOUS	TO BE KEPT	99.9
CIG_0	cigarettes before pregnancy	DISCRETE	TO BE KEPT	99
DBWT	birth weight in grams	DISCRETE	TO BE KEPT	9999
DLMP_MM	last normal period month	NOMINAL	TO BE DELETED	99
DLMP_YY	last normal period year	NOMINAL	TO BE DELETED	9999
DMAR	marital status	NOMINAL	TO BE KEPT	BLANK
DOB_MM	birth month	NOMINAL	TO BE KEPT	NONE
DOB_TT	birth time	NOMINAL	TO BE DELETED	9999
DOB_WK	birth weekday	NOMINAL	TO BE DELETED	NONE
DOB_YY	birth year (2018 for all)	NOMINAL	TO BE DELETED	NONE
DWgt_R	delivery weight recode (weight in pounds)	DISCRETE	TO BE DELETED	999
FAGECOMB	father’s combined age	DISCRETE	TO BE KEPT	99
FEDUC	father’s education level	ORDINAL	TO BE KEPT	9
FHISPX	father’s hispanic origin if present	NOMINAL	TO BE DELETED	9
FRACE15	father’s race recode	NOMINAL	TO BE DELETED	99
FRACE31	father’s race recode	NOMINAL	TO BE DELETED	99
FRACE6	father’s race recode	NOMINAL	TO BE KEPT	9
ILLB_R	interval since last birth	NOMINAL?	TO BE DELETED	888/999
ILOP_R	interval since last pregnancy	NOMINAL?	TO BE DELETED	888/999
ILP_R	interval since last pregnancy	NOMINAL?	TO BE DELETED	888/999
IMP_SEX	imputed sex	1/BLANK	TO BE KEPT	BLANK
IP_GON	gonorrhea presence	Y/N/U	TO BE KEPT	U
LD_INDL	induction of labor	Y/N/U	TO BE KEPT	U
MAGER	mother’s age	ORDINAL	TO BE KEPT	NONE
MAGE_IMPFLG	mother’s age imputed	1/BLANK	TO BE KEPT	BLANK
MAR_IMP	mother’s imputed marital status	1/BLANK	TO BE KEPT	BLANK
MBSTATE_REC	mother’s nativity (born in/outside of US)	NOMINAL	TO BE KEPT	3
MEDUC	mother’s edcuation level	ORDINAL	TO BE KEPT	9
MHISPX	mother’s hispanic origin if present	NOMINAL	TO BE DELETED	9
MM_AICU	admit to intensive care	Y/N/U	TO BE KEPT	U
MRACE15	mother’s race recode	NOMINAL	TO BE DELETED	NONE
MRACE31	mother’s race recode	NOMINAL	TO BE DELETED	NONE
MRACEIMP	mother’s iputed race	1/2/BLANK	TO BE KEPT	BLANK
MRAVE6	mother’s race recode (MRACE6)	NOMINAL	TO BE KEPT	NONE
MTRAN	mother transferred	Y/N/U	TO BE KEPT	U
M_Ht_In	mother’s height in inches	CONTINUOUS	TO BE KEPT	99
NO_INFEC	no infections reported	1/0/9	TO BE KEPT	9
NO_MMORB	no maternal morbidity reported	1/0/9	TO BE KEPT	9
NO_RISKS	no risk factors reported	1/0/9	TO BE KEPT	9
PAY	payment source for delivery	NOMINAL	TO BE DELETED	9
PAY_REC	payment recode	NOMINAL	TO BE KEPT	9
PRECARE	month prenatal care began	NOMINAL	TO BE KEPT	99
PREVIS	number of prenatal visits	DISCRETE	TO BE KEPT	99
PRIORDEAD	prior births now dead	DISCRETE	TO BE KEPT	99
PRIORLIVE	prior births now living	DISCRETE	TO BE KEPT	99
PRIORTERM	prior other terminations	DISCRETE	TO BE KEPT	99
PWgt_R	pre-pregnancy weight recode	DISCRETE	TO BE KEPT	999
RDMETH_REC	delivery method recode	NOMINAL	TO BE DELETED	9
RESTATUS	residence status	NOMINAL	TO BE KEPT	NONE
RF_CESAR	previous cesarean	Y/N/U	TO BE KEPT	U
RF_CESARN	number of previous cesareans	DISCRETE	TO BE KEPT	99
SEX	sex of infant	M/F	TO BE KEPT	NONE
WTGAIN	weight gain in pounds	DISCRETE	TO BE KEPT	99

After the inspection of every variable, we qualified several of them for instant deletion. Variables ATTEND and BFACIL apply to the attendance and facility during birth which does not influence the weight of the baby. DOB_TT and DOB_WK describe the time and weekday of the birth which also does not affect the dependent variable. The year of birth in variable DOB_YY is the same for all observations therefore will be deleted as well. DWgt_R is just a convertion from weight in grams to weight in pounds. From 4 variables describing fathers origins or race (FHISPX, FRACE15, FRACE31, FRACE6) we decided to keep just one - FRACE6. Variables ILLB_R, and ILOP_R containing the information about the interval since last birth/pregnancy are of a mixed type - they take values 000-003 if the delivery was plural, and if it was not, they imply the months passed since the last birth/pregnancy. This kind of variable is difficult to transform, so we decided to drop all of them. Same as in the father’s case, we decided to keep just one variable describing mother’s race/origin - MRAVE6 - we deleted MRACE15, MRACE31 and MHISPX. Variables PAY and PAY_REC describe the type of payment so there is no need to keep both of them. Finally, RDMETH_REC shows the delivery method, which does not affect the dependent variable.

columns_to_drop <- c("ATTEND", "BFACIL", "DOB_TT", "DOB_WK", "DOB_YY", "DWgt_R", "FHISPX", "FRACE15", "FRACE31", "ILLB_R", "ILOP_R", "ILP_R", "MHISPX", "MRACE15", "MRACE31", "PAY", "RDMETH_REC", "DLMP_MM", "DLMP_YY")

for (name in columns_to_drop){
  data[name] <- NULL
}

1.2. Missing Values

Even though in most of the columns missing values had an assigned value, we have to check if none of them contain some blank cells.

sapply(data, function(x) sum(is.na(x)))

##         BMI       CIG_0        DBWT        DMAR      DOB_MM    FAGECOMB 
##           0           0           0       45696           0           0 
##       FEDUC      FRACE6     IMP_SEX      IP_GON     LD_INDL       MAGER 
##           0           0      380144           0           0           0 
## MAGE_IMPFLG     MAR_IMP MBSTATE_REC       MEDUC     MM_AICU    MRACEIMP 
##      380116      379968           0           0           0      356159 
##      MRAVE6       MTRAN     M_Ht_In    NO_INFEC    NO_MMORB    NO_RISKS 
##           0           0           0           0           0           0 
##     PAY_REC     PRECARE      PREVIS   PRIORDEAD   PRIORLIVE   PRIORTERM 
##           0           0           0           0           0           0 
##      PWgt_R    RESTATUS    RF_CESAR   RF_CESARN         SEX      WTGAIN 
##           0           0           0           0           0           0

We can see that 5 columns contain some missing values. Let’s take a closer look at them.

empty_columns <- names(which(colSums(is.na(data)) > 0))
empty_columns

## [1] "DMAR"        "IMP_SEX"     "MAGE_IMPFLG" "MAR_IMP"     "MRACEIMP"

for (name in empty_columns){
  missing_cells <- sum(is.na(data[name]))
  percentage <- round((missing_cells * 100)/(nrow(data)), 3)
  print(paste("Missing values in", name, ":", percentage, "%"))
}

## [1] "Missing values in DMAR : 12.02 %"
## [1] "Missing values in IMP_SEX : 99.998 %"
## [1] "Missing values in MAGE_IMPFLG : 99.991 %"
## [1] "Missing values in MAR_IMP : 99.952 %"
## [1] "Missing values in MRACEIMP : 93.689 %"

As we can see 4 of them have more than 90% missing values which qualifies them for the deletion.

columns_to_drop <- c("IMP_SEX", "MAGE_IMPFLG", "MAR_IMP", "MRACEIMP")

for (name in columns_to_drop){
  data[name] <- NULL
}

Let’s take a closer look at variable DMAR where the missing values make almost 12% of the total observations. We will keep this variable for now since around 450k observations is a lot to drop. We will treat missing values as a third category to keep those rows in a dataset.

table(data$DMAR, useNA = "ifany")

## 
##      1      2   <NA> 
## 201172 133284  45696

data$DMAR[is.na(data$DMAR)] <- 3
table(data$DMAR, useNA = "ifany")

## 
##      1      2      3 
## 201172 133284  45696

After deleting selected variables we are left with 32 features. Now, we will group continuous, discrete and ordinal variables according to the missing value representation. We deleted or transformed columns with blank cells which leaves us with the following groups:

group_9 <- c("FEDUC", "FRACE6", "MEDUC", "NO_INFEC", "NO_MMORB", "NO_RISKS", "PAY_REC")
group_99 <- c("CIG_0", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "RF_CESARN", "WTGAIN")

Variable BMI has missing values represented by 99.9, variable PWgt_R by 999 and variable DBWT by 9999.

data <- filter(data, BMI != 99.9)
data <- filter(data, PWgt_R != 999)
data <- filter(data, DBWT != 9999)

for (name in group_9){
  data <- data[data[[name]] != 9, ]
}

for (name in group_99){
  data <- data[data[[name]] != 99, ]
}

After cleaning the data we are left with 289 721 observations of 32 variables.

1.3. Creating HEALTHY

Now is the time to create our binary dependent variable - HEALTHY. The average birth weight for babies is around 7.5 lb (3.5 kg), although between 5.5 lb (2.5 kg) and 8.5 lb (4 kg) is considered normal. Therefore we set the healthy interval between 2500 and 4000 grams. We decided also to create a separate dataset where variable HEALTHY will not be binary but nominal with classes “underweight” for babies under 2500 grams, “healthy” for babies that weigh between 2500 and 4000 grams and “overweight” for those weighing more than 4000 grams.

Creating copy of the data:

data_b <- data

data_b$HEALTHY <- ifelse((data$DBWT >= 2500 & data$DBWT <= 4000), 1, 0) #binary
data$HEALTHY <- ifelse((data$DBWT <= 2500), 0, ifelse((data$DBWT <= 4000), 1, 2)) #nominal

table(data$HEALTHY)

## 
##      0      1      2 
##  21622 244322  23777

table(data_b$HEALTHY)

## 
##      0      1 
##  45186 244535

In our dataset we have around 244k healthy babies and around 45k babies that fall out of our healthy interval (underweight and overweight together). This means that our variable is unbalanced and will need some resampling on train data.

1.4. Data Converting

Before further inspection of the variables, we need to convert the data to appropriate types. There are 12 numeric variables and 21 categorical variables. Some of the categorical variables have already assigned numbers as levels, however by default R will treat them as quantitative variables, so we have to transform them as factors as well.

data_categorical_vars <- c("DMAR", "DOB_MM", "FRACE6", "MBSTATE_REC", "MRAVE6", "PAY_REC", "PRECARE", "RESTATUS", "IP_GON", "LD_INDL", "MM_AICU", "MTRAN", "NO_INFEC", "NO_MMORB", "NO_RISKS", "RF_CESAR", "SEX", "MEDUC", "FEDUC", "MAGER", "HEALTHY")

data_numeric_vars <- c("BMI", "CIG_0", "DBWT", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "PWgt_R", "RF_CESARN", "WTGAIN")

for (variable in data_categorical_vars) {
  data[[variable]] <- as.factor(data[[variable]])
}

for (variable in data_categorical_vars) {
  data_b[[variable]] <- as.factor(data_b[[variable]])
}

Among the categorical variables there are 4 that take values 0 or 1, let’s add labels to them in both datasets.

data$NO_INFEC <- factor(data$NO_INFEC,
                        levels = c(0, 1),
                        labels = c("No", "Yes"))

data$NO_MMORB <- factor(data$NO_MMORB,
                        levels = c(0, 1),
                        labels = c("No", "Yes"))

data$NO_RISKS <- factor(data$NO_RISKS,
                        levels = c(0, 1),
                        labels = c("No", "Yes"))

data$HEALTHY <- factor(data$HEALTHY,
                       levels = c(0, 1, 2),
                       labels = c("underweight", "healthy", "overweight"))

data_b$NO_INFEC <- factor(data_b$NO_INFEC,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

data_b$NO_MMORB <- factor(data_b$NO_MMORB,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

data_b$NO_RISKS <- factor(data_b$NO_RISKS,
                          levels = c(0, 1),
                          labels = c("No", "Yes"))

data_b$HEALTHY <- factor(data_b$HEALTHY,
                         levels = c(0, 1),
                         labels = c("No", "Yes"))

Moreover, 3 of those 21 categorical variables are ordinal. The levels are expressed as numbers so by default R should order them in an increasing way but let’s check the levels.

levels(data$MEDUC)

## [1] "1" "2" "3" "4" "5" "6" "7" "8"

levels(data$FEDUC)

## [1] "1" "2" "3" "4" "5" "6" "7" "8"

levels(data$MAGER)

##  [1] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"
## [16] "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [31] "43" "44" "45" "46" "47" "48" "49" "50"

As we can see all of the levels are ordered properly so we do not need to change that. The next step is to check our numeric variables and transform them if necessary.

for (variable in data_numeric_vars) {
  if (is.numeric(data[[variable]]) == FALSE) print(variable)
}

## [1] "RF_CESARN"

The variable RF_CESARN needs to be transformed into numeric in both sets.

data$RF_CESARN <- as.numeric(data$RF_CESARN)
data_b$RF_CESARN <- as.numeric(data_b$RF_CESARN)

Prior to applying transformations to our data, we first must split the data into training and testing set. Transformations on data should be applied to the training set only.

Before we move on, let’s save our data.

save(list = c("data", "data_b"),
     file = "data_prepped.RData")

2. Splitting the data

Before applying transformations to chosen variables we divide the observations into train and test sets. Using the createDataPartition method, we receive a vector of indices. With this vector, we can conveniently split the data into training and testing sample.

set.seed(1922)
data_which_train <- createDataPartition(data$DBWT, p = 0.7, list = FALSE) 

data_train <- data[data_which_train,]
data_test <- data[-data_which_train,]

data_train_b <- data_b[data_which_train,]
data_test_b <- data_b[-data_which_train,]

We splitted both data sets into training and testing sets.

summary(data_train$DBWT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     227    2990    3325    3291    3654    8165

summary(data_test$DBWT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     227    2990    3324    3292    3654    6603

summary(data_train_b$DBWT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     227    2990    3325    3291    3654    8165

summary(data_test_b$DBWT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     227    2990    3324    3292    3654    6603

Symmetry measures are very similar between train and test sets. Even though the maximum values differ, we will keep this partition. Now that we have our data split, we can begin to perform transformations on the training sample.

Before proceeding we have to delete variable DBWT from all datasets and from the list of numeric variables.

data_train$DBWT <- NULL
data_test$DBWT <- NULL

data_train_b$DBWT <- NULL
data_test_b$DBWT <- NULL

data_numeric_vars <- data_numeric_vars[data_numeric_vars != "DBWT"]

3. Up-sampling

Up-sampling the data means that we create new observations for the minority class in an unbalanced dataset. It is accomplished by making copies of observations and adding them to the dataset, so we can increase the size of the smaller classes.

set.seed(1922)
up_train <- upSample(x = data_train,
                     y = data_train$HEALTHY)                         
table(up_train$HEALTHY)

## 
## underweight     healthy  overweight 
##      171129      171129      171129

set.seed(1922)
up_train_b <- upSample(x = data_train_b,
                     y = data_train_b$HEALTHY)                         
table(up_train_b$HEALTHY)

## 
##     No    Yes 
## 171275 171275

For the modeling phase we decided to keep all 4 datasets - unbalanced with binary variable (data_train_b), balanced with binary variable (up_train_b), unbalanced with nominal variable (data_train) and balanced with nominal variable (up_train). We will compare the results between all of them.

4. Data Transformation

4.1. Variables Distribution

To decide how to treat particular variables we generated plots for their distributions and distributions of their transormed forms. The graphs for all of the variables are accessible within the attached R code. Below we presented graphs of the variables chosen to be transformed.

Selected for Transformation

These variables show enough normality to be worth keeping after necessery transformations.

BMI

PWgt_R

4.2. Transformation

Let’s apply the transformations to some of our variables that we selected earlier.

data_train[c("BMI", "PWgt_R")] <- log(data_train[c("BMI", "PWgt_R")])
data_test[c("BMI", "PWgt_R")] <- log(data_test[c("BMI", "PWgt_R")])

data_train_b[c("BMI", "PWgt_R")] <- log(data_train_b[c("BMI", "PWgt_R")])
data_test_b[c("BMI", "PWgt_R")] <- log(data_test_b[c("BMI", "PWgt_R")])

5. Feature Selection

The next step in development of our predictive model will be reducing the number of input variables by checking which of them influence the dependent variable the most.

5.1. Numeric

5.1.1. Colinearity

First, the correlation of the numeric variables will be explored. Let’s perform a check for multicolinearity - if any of our variables are correlated with each other and due to that can be dropped.

data_linear_combinations <- findLinearCombos(data_train[, data_numeric_vars])
data_linear_combinations

## $linearCombos
## list()
## 
## $remove
## NULL

Turns out no sets of variables are correlated significantly enough.

5.1.2. Correalations

Correlation explains how one or more variables are related to each other. It determines how attributes change in relation with each other which is extremely helpful in further feature selection. We will check if the numeric variables impact our dependent variable. Due to the fact that our variable is categorical, we will use ANOVA to check that. If we use just one continuous predictor in the test we can just “flip” the model around so that outcome variable and predictor variable switch places.

for (variable in data_numeric_vars) {
  anova1 <- aov(data_train[[variable]] ~ data_train$HEALTHY)
  print(summary(anova1))
}

##                        Df Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2     68   33.94   653.9 <2e-16 ***
## Residuals          202803  10525    0.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2    5345  2672.3   146.3 <2e-16 ***
## Residuals          202803 3705307    18.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2    9303    4651   102.6 <2e-16 ***
## Residuals          202803 9196050      45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2   21879   10939    1414 <2e-16 ***
## Residuals          202803 1568635       8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2   34953   17477    1116 <2e-16 ***
## Residuals          202803 3174933      16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## data_train$HEALTHY      2      1  0.7126   28.71 3.42e-13 ***
## Residuals          202803   5034  0.0248                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2    592  295.96   188.5 <2e-16 ***
## Residuals          202803 318385    1.57                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2    125   62.63   80.93 <2e-16 ***
## Residuals          202803 156953    0.77                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2    157   78.27    1369 <2e-16 ***
## Residuals          202803  11596    0.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2     25  12.422   37.07 <2e-16 ***
## Residuals          202803  67961   0.335                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                        Df   Sum Sq Mean Sq F value Pr(>F)    
## data_train$HEALTHY      2   635681  317840    1464 <2e-16 ***
## Residuals          202803 44026038     217                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for (variable in data_numeric_vars) {
  anova1 <- aov(data_train_b[[variable]] ~ data_train_b$HEALTHY)
  print(summary(anova1))
}

##                          Df Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1     40   40.05   769.7 <2e-16 ***
## Residuals            202804  10552    0.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df  Sum Sq Mean Sq F value Pr(>F)  
## data_train_b$HEALTHY      1      84   84.11   4.597  0.032 *
## Residuals            202804 3710568   18.30                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1    7372    7372   162.5 <2e-16 ***
## Residuals            202804 9197981      45                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1    3587    3587   458.4 <2e-16 ***
## Residuals            202804 1586927       8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df  Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1    3732    3732   236.1 <2e-16 ***
## Residuals            202804 3206154      16                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1      0 0.26886   10.83  0.001 ***
## Residuals            202804   5035 0.02483                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1    343   342.5     218 <2e-16 ***
## Residuals            202804 318634     1.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1     86   85.93     111 <2e-16 ***
## Residuals            202804 156992    0.77                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1     67   66.71    1158 <2e-16 ***
## Residuals            202804  11686    0.06                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## data_train_b$HEALTHY      1     20  19.572    58.4 2.15e-14 ***
## Residuals            202804  67966   0.335                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                          Df   Sum Sq Mean Sq F value Pr(>F)    
## data_train_b$HEALTHY      1    34209   34209   155.5 <2e-16 ***
## Residuals            202804 44627510     220                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All of the numeric variables returned very low p-value, which means that in all cases we can reject the null hypothesis. However, we cannot 100% trust the p-value, because when it comes to the big datasets even the small fluctuations may have a significant influence on the p-value. We decided to keep all the numeric variables.

5.2. Categorical

The Cramer’s V coefficient is used to test the strength of relationship between two categorical variables. It takes values from 0 to 1 unless both variables have only two levels - then it takes values from -1 to 1. Let’s test how all of our categorical variables are related to HEALTHY.

for (variable in data_categorical_vars) {
  a <- CramerV(data_train[["HEALTHY"]], data_train[[variable]])
  print(paste(variable, ":", a))
}

## [1] "DMAR : 0.0440713667140329"
## [1] "DOB_MM : 0.00814488682307605"
## [1] "FRACE6 : 0.0681983233639833"
## [1] "MBSTATE_REC : 0.0215136072186378"
## [1] "MRAVE6 : 0.0726345542285085"
## [1] "PAY_REC : 0.0352384050162547"
## [1] "PRECARE : 0.0408195025868523"
## [1] "RESTATUS : 0.0209088651591672"
## [1] "IP_GON : 0.0127828941039331"
## [1] "LD_INDL : 0.0284221751263552"
## [1] "MM_AICU : 0.0325762751352447"
## [1] "MTRAN : 0.0816504524801277"
## [1] "NO_INFEC : 0.0190780271157186"
## [1] "NO_MMORB : 0.0205959168245148"
## [1] "NO_RISKS : 0.0886319642005398"
## [1] "RF_CESAR : 0.0209533042371427"
## [1] "SEX : 0.0788267421505673"
## [1] "MEDUC : 0.0350560668217398"
## [1] "FEDUC : 0.0321884239034381"
## [1] "MAGER : 0.0435614429089168"
## [1] "HEALTHY : 1"

for (variable in data_categorical_vars) {
  a <- CramerV(data_train_b[["HEALTHY"]], data_train_b[[variable]])
  print(paste(variable, ":", a))
}

## [1] "DMAR : 0.00847955422154093"
## [1] "DOB_MM : 0.0086062712598186"
## [1] "FRACE6 : 0.0310867652029407"
## [1] "MBSTATE_REC : 0.024888068372129"
## [1] "MRAVE6 : 0.0306120791815099"
## [1] "PAY_REC : 0.00455057987907299"
## [1] "PRECARE : 0.0272564788442899"
## [1] "RESTATUS : 0.0275891874986047"
## [1] "IP_GON : 0.000764292889752201"
## [1] "LD_INDL : 0.0137224491034639"
## [1] "MM_AICU : 0.0201904723879968"
## [1] "MTRAN : 0.0767307914499367"
## [1] "NO_INFEC : 0.00359771611011677"
## [1] "NO_MMORB : 0.0200436961122356"
## [1] "NO_RISKS : 0.0620005279572291"
## [1] "RF_CESAR : 0.0156487866424666"
## [1] "SEX : 0.0423738189231064"
## [1] "MEDUC : 0.0104070594139674"
## [1] "FEDUC : 0.00902602838486403"
## [1] "MAGER : 0.039232606347136"
## [1] "HEALTHY : 1"

As we can see the relations are not strong - the highest score is equal to 0.1126 for the variable MTRAN. We decided to keep 5 categorical variables with the highest score and later check if they will be jointly significant for the model.

5.3. Final Selection

We will keep all 12 numeric variables and 5 categorical ones.

selected_variables <- c("HEALTHY","BMI", "CIG_0", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "PWgt_R", "RF_CESARN", "WTGAIN", "MTRAN", "NO_RISKS", "MRAVE6", "FRACE6", "PRECARE")

6. Basic Models

The glm() function automatically recodes categorical variables into dummies, assuming by default the first level of the variable as the reference. The next step is to perfom likelihood ratio test in which we check the hypothesis that says that model with selected variables is equally good as a model with just a constant. Low p-value means that our explanatory variables are jointly significant.

For binary dependent variable we tested models:

Logit
Probit

For nominal dependent variables we tested models:

Ordered Logit
Multinomial Logit
Naive Bayes Classifier
Regression Tree

6.1. Logit

Logit model is used to model two level outcome variables. In this model the log odds of the outcome are modeled as a linear combination of the predictor variables.

Logit 1

logit1 <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
                     family =  binomial(link = "logit"),
                     data = data_train_b)

summary(logit1)

## 
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "logit"), 
##     data = data_train_b)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2998   0.5017   0.5526   0.6013   1.8775  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.4856473  0.1677609  44.621  < 2e-16 ***
## MTRANU       0.0815166  0.4075253   0.200    0.841    
## MTRANY      -2.1307100  0.0744504 -28.619  < 2e-16 ***
## WTGAIN      -0.0076249  0.0004175 -18.262  < 2e-16 ***
## M_Ht_In     -0.0180049  0.0023583  -7.635 2.26e-14 ***
## PWgt_R      -0.8688742  0.0273250 -31.798  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175260  on 202805  degrees of freedom
## Residual deviance: 172887  on 202800  degrees of freedom
## AIC: 172899
## 
## Number of Fisher Scoring iterations: 4

lrtest(logit1)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   6 -86443                         
## 2   1 -87630 -5 2373.1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logit 1 (upsampling)

logit1_up <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
                     family =  binomial(link = "logit"),
                     data = up_train_b)

summary(logit1_up)

## 
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "logit"), 
##     data = up_train_b)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4898  -1.1785   0.3009   1.1474   2.5826  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.109e+00  7.960e-02  26.502   <2e-16 ***
## MTRANU       2.387e-01  2.301e-01   1.038    0.299    
## MTRANY      -2.153e+00  6.080e-02 -35.410   <2e-16 ***
## WTGAIN      -7.189e-03  2.344e-04 -30.666   <2e-16 ***
## M_Ht_In     -1.656e-02  1.309e-03 -12.654   <2e-16 ***
## PWgt_R      -5.033e-03  8.948e-05 -56.254   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 474875  on 342549  degrees of freedom
## Residual deviance: 467932  on 342544  degrees of freedom
## AIC: 467944
## 
## Number of Fisher Scoring iterations: 4

lrtest(logit1_up)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   6 -233966                         
## 2   1 -237438 -5 6942.8  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logit 2

logit2 <- glm(HEALTHY ~ .,
                     family =  binomial(link = "logit"),
                     data = data_train_b %>%
                       dplyr::select(all_of(selected_variables)))

summary(logit2)

## 
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "logit"), 
##     data = data_train_b %>% dplyr::select(all_of(selected_variables)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7004   0.4723   0.5393   0.6056   1.9026  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.6509434  0.4285397  20.187  < 2e-16 ***
## BMI         -9.0306964  1.7430803  -5.181 2.21e-07 ***
## CIG_0        0.0001320  0.0014384   0.092  0.92687    
## FAGECOMB    -0.0071469  0.0009724  -7.349 1.99e-13 ***
## M_Ht_In     -0.3011317  0.0540820  -5.568 2.58e-08 ***
## PREVIS       0.0384406  0.0019032  20.198  < 2e-16 ***
## PRIORDEAD   -0.0481565  0.0358727  -1.342  0.17946    
## PRIORLIVE   -0.0486929  0.0051926  -9.377  < 2e-16 ***
## PRIORTERM   -0.0292671  0.0068325  -4.284 1.84e-05 ***
## PWgt_R       8.2859253  1.7436736   4.752 2.01e-06 ***
## RF_CESARN    0.1439094  0.0127177  11.316  < 2e-16 ***
## WTGAIN      -0.0083305  0.0004219 -19.743  < 2e-16 ***
## MTRANU       0.1474479  0.4088778   0.361  0.71839    
## MTRANY      -2.0056925  0.0754718 -26.575  < 2e-16 ***
## NO_RISKSYes  0.3618065  0.0155439  23.276  < 2e-16 ***
## MRAVE62     -0.0795840  0.0323069  -2.463  0.01376 *  
## MRAVE63      0.1380555  0.0858896   1.607  0.10798    
## MRAVE64     -0.0596548  0.0434249  -1.374  0.16952    
## MRAVE65     -0.0464300  0.1528329  -0.304  0.76128    
## MRAVE66      0.0604788  0.0412164   1.467  0.14228    
## FRACE62      0.0176081  0.0305223   0.577  0.56401    
## FRACE63      0.1416488  0.0869453   1.629  0.10328    
## FRACE64      0.2234106  0.0457085   4.888 1.02e-06 ***
## FRACE65      0.1089339  0.1469721   0.741  0.45858    
## FRACE66     -0.0191680  0.0415776  -0.461  0.64479    
## PRECARE1    -0.1499536  0.0647513  -2.316  0.02057 *  
## PRECARE2    -0.0122761  0.0603144  -0.204  0.83872    
## PRECARE3     0.0459716  0.0598966   0.768  0.44278    
## PRECARE4     0.0814205  0.0617612   1.318  0.18740    
## PRECARE5     0.1021823  0.0652784   1.565  0.11751    
## PRECARE6     0.2180505  0.0703789   3.098  0.00195 ** 
## PRECARE7     0.3472252  0.0741964   4.680 2.87e-06 ***
## PRECARE8     0.4529481  0.0799813   5.663 1.49e-08 ***
## PRECARE9     0.7286252  0.1170714   6.224 4.85e-10 ***
## PRECARE10    1.1782890  1.0389572   1.134  0.25675    
## PRECARE99   -0.2240515  0.0962316  -2.328  0.01990 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175260  on 202805  degrees of freedom
## Residual deviance: 171428  on 202770  degrees of freedom
## AIC: 171500
## 
## Number of Fisher Scoring iterations: 4

lrtest(logit2)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD + 
##     PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN + 
##     NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
##   #Df LogLik  Df  Chisq Pr(>Chisq)    
## 1  36 -85714                          
## 2   1 -87630 -35 3832.1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logit 2 (upsampling)

logit2_up <- glm(HEALTHY ~ .,
                     family =  binomial(link = "logit"),
                     data = up_train_b %>%
                       dplyr::select(all_of(selected_variables)))

summary(logit2_up)

## 
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "logit"), 
##     data = up_train_b %>% dplyr::select(all_of(selected_variables)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9829  -1.1658   0.1405   1.1354   2.6187  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.1440071  0.3340062   0.431 0.666359    
## BMI          0.0309085  0.0059068   5.233 1.67e-07 ***
## CIG_0       -0.0013929  0.0008103  -1.719 0.085627 .  
## FAGECOMB    -0.0077776  0.0005480 -14.192  < 2e-16 ***
## M_Ht_In      0.0059468  0.0051724   1.150 0.250261    
## PREVIS       0.0338343  0.0009738  34.746  < 2e-16 ***
## PRIORDEAD   -0.0433466  0.0227645  -1.904 0.056893 .  
## PRIORLIVE   -0.0508495  0.0029977 -16.963  < 2e-16 ***
## PRIORTERM   -0.0299650  0.0039657  -7.556 4.16e-14 ***
## PWgt_R      -0.0096073  0.0009969  -9.637  < 2e-16 ***
## RF_CESARN    0.1528317  0.0072067  21.207  < 2e-16 ***
## WTGAIN      -0.0083342  0.0002395 -34.804  < 2e-16 ***
## MTRANU       0.3164232  0.2317999   1.365 0.172231    
## MTRANY      -2.0117197  0.0610869 -32.932  < 2e-16 ***
## NO_RISKSYes  0.3638752  0.0089046  40.864  < 2e-16 ***
## MRAVE62     -0.0873617  0.0183463  -4.762 1.92e-06 ***
## MRAVE63      0.1422239  0.0473295   3.005 0.002656 ** 
## MRAVE64     -0.0432218  0.0241140  -1.792 0.073070 .  
## MRAVE65      0.0604248  0.0879416   0.687 0.492019    
## MRAVE66      0.0533033  0.0229954   2.318 0.020449 *  
## FRACE62      0.0043319  0.0172620   0.251 0.801852    
## FRACE63      0.0774152  0.0474486   1.632 0.102772    
## FRACE64      0.2260829  0.0251931   8.974  < 2e-16 ***
## FRACE65      0.0983333  0.0825727   1.191 0.233705    
## FRACE66     -0.0069185  0.0234562  -0.295 0.768030    
## PRECARE1    -0.0893057  0.0380664  -2.346 0.018974 *  
## PRECARE2     0.0611694  0.0356039   1.718 0.085787 .  
## PRECARE3     0.1214347  0.0354373   3.427 0.000611 ***
## PRECARE4     0.1412539  0.0364780   3.872 0.000108 ***
## PRECARE5     0.1496845  0.0384280   3.895 9.81e-05 ***
## PRECARE6     0.2619382  0.0410276   6.384 1.72e-10 ***
## PRECARE7     0.3721227  0.0428378   8.687  < 2e-16 ***
## PRECARE8     0.4870379  0.0458022  10.633  < 2e-16 ***
## PRECARE9     0.7860846  0.0639819  12.286  < 2e-16 ***
## PRECARE10    0.8067073  0.4517791   1.786 0.074160 .  
## PRECARE99   -0.1731520  0.0572197  -3.026 0.002477 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 474875  on 342549  degrees of freedom
## Residual deviance: 463312  on 342514  degrees of freedom
## AIC: 463384
## 
## Number of Fisher Scoring iterations: 4

lrtest(logit2_up)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD + 
##     PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN + 
##     NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
##   #Df  LogLik  Df Chisq Pr(>Chisq)    
## 1  36 -231656                         
## 2   1 -237438 -35 11563  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.2. Probit

Probit model is used to model binary outcome variables. In this model, the inverse standard normal distribution of the probability is modeled as a linear combination of the predictors.

Probit 1

probit1 <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
                     family =  binomial(link = "probit"),
                     data = data_train_b)

summary(probit1)

## 
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "probit"), 
##     data = data_train_b)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3155   0.5022   0.5538   0.6019   1.8239  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.1491454  0.0924282  44.890  < 2e-16 ***
## MTRANU       0.0462331  0.2194236   0.211    0.833    
## MTRANY      -1.2795898  0.0459468 -27.849  < 2e-16 ***
## WTGAIN      -0.0040800  0.0002331 -17.500  < 2e-16 ***
## M_Ht_In     -0.0095616  0.0012996  -7.357 1.88e-13 ***
## PWgt_R      -0.4733688  0.0151723 -31.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175260  on 202805  degrees of freedom
## Residual deviance: 172914  on 202800  degrees of freedom
## AIC: 172926
## 
## Number of Fisher Scoring iterations: 4

lrtest(probit1)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   6 -86457                         
## 2   1 -87630 -5 2345.6  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Probit 1 (upsampling)

probit1_up <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
                     family =  binomial(link = "probit"),
                     data = up_train_b)

summary(probit1_up)

## 
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "probit"), 
##     data = up_train_b)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4901  -1.1785   0.3211   1.1477   2.7191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.312e+00  4.964e-02  26.438   <2e-16 ***
## MTRANU       1.512e-01  1.432e-01   1.056    0.291    
## MTRANY      -1.252e+00  3.153e-02 -39.700   <2e-16 ***
## WTGAIN      -4.458e-03  1.460e-04 -30.538   <2e-16 ***
## M_Ht_In     -1.032e-02  8.166e-04 -12.640   <2e-16 ***
## PWgt_R      -3.126e-03  5.547e-05 -56.361   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 474875  on 342549  degrees of freedom
## Residual deviance: 467954  on 342544  degrees of freedom
## AIC: 467966
## 
## Number of Fisher Scoring iterations: 4

lrtest(probit1_up)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   6 -233977                         
## 2   1 -237438 -5 6921.4  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Probit 2

probit2 <- glm(HEALTHY ~ .,
                     family =  binomial(link = "probit"),
                     data = data_train_b %>%
                       dplyr::select(all_of(selected_variables)))

summary(probit2)

## 
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "probit"), 
##     data = data_train_b %>% dplyr::select(all_of(selected_variables)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7627   0.4714   0.5404   0.6071   1.8712  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.9471602  0.2433939  20.326  < 2e-16 ***
## BMI         -5.5190123  0.9818220  -5.621 1.90e-08 ***
## CIG_0       -0.0001668  0.0007935  -0.210 0.833525    
## FAGECOMB    -0.0038645  0.0005372  -7.193 6.33e-13 ***
## M_Ht_In     -0.1829121  0.0305029  -5.997 2.02e-09 ***
## PREVIS       0.0198426  0.0010316  19.235  < 2e-16 ***
## PRIORDEAD   -0.0272798  0.0204408  -1.335 0.182015    
## PRIORLIVE   -0.0273289  0.0029126  -9.383  < 2e-16 ***
## PRIORTERM   -0.0166323  0.0038361  -4.336 1.45e-05 ***
## PWgt_R       5.1120224  0.9820949   5.205 1.94e-07 ***
## RF_CESARN    0.0811677  0.0071155  11.407  < 2e-16 ***
## WTGAIN      -0.0045718  0.0002359 -19.383  < 2e-16 ***
## MTRANU       0.0767861  0.2200644   0.349 0.727145    
## MTRANY      -1.2101311  0.0463416 -26.113  < 2e-16 ***
## NO_RISKSYes  0.2027026  0.0087195  23.247  < 2e-16 ***
## MRAVE62     -0.0457504  0.0179277  -2.552 0.010713 *  
## MRAVE63      0.0728903  0.0466648   1.562 0.118288    
## MRAVE64     -0.0330758  0.0234941  -1.408 0.159181    
## MRAVE65     -0.0255899  0.0846242  -0.302 0.762352    
## MRAVE66      0.0313114  0.0225413   1.389 0.164813    
## FRACE62      0.0076409  0.0168967   0.452 0.651114    
## FRACE63      0.0735208  0.0471812   1.558 0.119170    
## FRACE64      0.1198286  0.0245863   4.874 1.09e-06 ***
## FRACE65      0.0545208  0.0805867   0.677 0.498692    
## FRACE66     -0.0103671  0.0228833  -0.453 0.650519    
## PRECARE1    -0.0601364  0.0368897  -1.630 0.103066    
## PRECARE2     0.0172995  0.0344621   0.502 0.615677    
## PRECARE3     0.0477498  0.0342424   1.394 0.163177    
## PRECARE4     0.0661888  0.0352354   1.878 0.060316 .  
## PRECARE5     0.0748010  0.0371161   2.015 0.043871 *  
## PRECARE6     0.1364267  0.0397110   3.435 0.000591 ***
## PRECARE7     0.2052904  0.0415659   4.939 7.86e-07 ***
## PRECARE8     0.2598345  0.0444195   5.850 4.93e-09 ***
## PRECARE9     0.4060971  0.0625101   6.497 8.22e-11 ***
## PRECARE10    0.6295907  0.5058268   1.245 0.213251    
## PRECARE99   -0.1085057  0.0549586  -1.974 0.048345 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175260  on 202805  degrees of freedom
## Residual deviance: 171450  on 202770  degrees of freedom
## AIC: 171522
## 
## Number of Fisher Scoring iterations: 4

lrtest(probit2)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD + 
##     PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN + 
##     NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
##   #Df LogLik  Df  Chisq Pr(>Chisq)    
## 1  36 -85725                          
## 2   1 -87630 -35 3809.8  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Probit 2 (upsampling)

probit2_up <- glm(HEALTHY ~ .,
                     family =  binomial(link = "probit"),
                     data = up_train_b %>%
                       dplyr::select(all_of(selected_variables)))

summary(probit2_up)

## 
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "probit"), 
##     data = up_train_b %>% dplyr::select(all_of(selected_variables)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0157  -1.1662   0.1561   1.1359   2.7698  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.1334981  0.2065348   0.646 0.518039    
## BMI          0.0183137  0.0036468   5.022 5.12e-07 ***
## CIG_0       -0.0008612  0.0005039  -1.709 0.087454 .  
## FAGECOMB    -0.0048288  0.0003410 -14.161  < 2e-16 ***
## M_Ht_In      0.0029844  0.0031980   0.933 0.350701    
## PREVIS       0.0208686  0.0006024  34.642  < 2e-16 ***
## PRIORDEAD   -0.0263235  0.0141122  -1.865 0.062139 .  
## PRIORLIVE   -0.0314275  0.0018611 -16.887  < 2e-16 ***
## PRIORTERM   -0.0186044  0.0024619  -7.557 4.13e-14 ***
## PWgt_R      -0.0058005  0.0006151  -9.431  < 2e-16 ***
## RF_CESARN    0.0945636  0.0044804  21.106  < 2e-16 ***
## WTGAIN      -0.0051383  0.0001486 -34.585  < 2e-16 ***
## MTRANU       0.1982839  0.1438730   1.378 0.168146    
## MTRANY      -1.1692965  0.0319060 -36.648  < 2e-16 ***
## NO_RISKSYes  0.2261276  0.0055346  40.857  < 2e-16 ***
## MRAVE62     -0.0542894  0.0114100  -4.758 1.95e-06 ***
## MRAVE63      0.0926131  0.0294341   3.146 0.001653 ** 
## MRAVE64     -0.0260235  0.0150222  -1.732 0.083214 .  
## MRAVE65      0.0371983  0.0546750   0.680 0.496281    
## MRAVE66      0.0330876  0.0143040   2.313 0.020713 *  
## FRACE62      0.0025009  0.0107386   0.233 0.815848    
## FRACE63      0.0508883  0.0295082   1.725 0.084609 .  
## FRACE64      0.1401428  0.0156800   8.938  < 2e-16 ***
## FRACE65      0.0619734  0.0513901   1.206 0.227841    
## FRACE66     -0.0048083  0.0145973  -0.329 0.741858    
## PRECARE1    -0.0562694  0.0235201  -2.392 0.016739 *  
## PRECARE2     0.0371940  0.0219783   1.692 0.090587 .  
## PRECARE3     0.0748041  0.0218752   3.420 0.000627 ***
## PRECARE4     0.0870006  0.0225319   3.861 0.000113 ***
## PRECARE5     0.0917247  0.0237531   3.862 0.000113 ***
## PRECARE6     0.1622666  0.0253840   6.392 1.63e-10 ***
## PRECARE7     0.2304884  0.0265087   8.695  < 2e-16 ***
## PRECARE8     0.3011189  0.0283411  10.625  < 2e-16 ***
## PRECARE9     0.4855915  0.0394482  12.310  < 2e-16 ***
## PRECARE10    0.5008515  0.2793252   1.793 0.072961 .  
## PRECARE99   -0.1070841  0.0353945  -3.025 0.002483 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 474875  on 342549  degrees of freedom
## Residual deviance: 463347  on 342514  degrees of freedom
## AIC: 463419
## 
## Number of Fisher Scoring iterations: 4

lrtest(probit2_up)

## Likelihood ratio test
## 
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD + 
##     PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN + 
##     NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
##   #Df  LogLik  Df Chisq Pr(>Chisq)    
## 1  36 -231673                         
## 2   1 -237438 -35 11529  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.3. Ordered Logit

The ordered logit model is a regression model for an ordinal outcome variable. The model is based on the cumulative probabilities of the response variable. Our variable is nominal (we can say that being “healthy” is the best but we cannot order if “underweight” or “overweight” is worse) but we decided to test the performance of this model as well and compare with others.

Ordered Logit 1

ordered_logit1 <- polr(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, Hess=TRUE)
summary(ordered_logit1)

## Call:
## polr(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, 
##     Hess = TRUE)
## 
## Coefficients:
##            Value Std. Error  t value
## MTRANU   0.15564  0.3829793   0.4064
## MTRANY  -2.83394  0.0764749 -37.0572
## WTGAIN   0.02407  0.0004258  56.5279
## M_Ht_In  0.06269  0.0023503  26.6757
## PWgt_R   1.14950  0.0278901  41.2152
## 
## Intercepts:
##                     Value    Std. Error t value 
## underweight|healthy   7.8862   0.1665    47.3697
## healthy|overweight   13.0458   0.1701    76.6993
## 
## Residual Deviance: 212006.11 
## AIC: 212020.11

table <- coef(summary(ordered_logit1))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table

##                           Value Std. Error     t value       p value
## MTRANU               0.15563960 0.38297935   0.4063916  6.844549e-01
## MTRANY              -2.83394258 0.07647486 -37.0571803 1.376082e-300
## WTGAIN               0.02406675 0.00042575  56.5278864  0.000000e+00
## M_Ht_In              0.06269460 0.00235025  26.6757137 9.007545e-157
## PWgt_R               1.14949714 0.02789011  41.2152301  0.000000e+00
## underweight|healthy  7.88623137 0.16648260  47.3697047  0.000000e+00
## healthy|overweight  13.04577311 0.17008993  76.6992681  0.000000e+00

Ordered Logit 1 (upsampling)

ordered_logit1_up <- polr(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, Hess=TRUE)
summary(ordered_logit1_up)

## Call:
## polr(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, 
##     Hess = TRUE)
## 
## Coefficients:
##             Value Std. Error t value
## MTRANU   0.185832  5.830e-05 3187.32
## MTRANY  -2.821313  4.304e-02  -65.56
## WTGAIN   0.030540  1.867e-04  163.58
## M_Ht_In  0.089727  1.010e-03   88.85
## PWgt_R   0.007691  6.906e-05  111.37
## 
## Intercepts:
##                     Value     Std. Error t value  
## underweight|healthy    7.1404    0.0616   115.9909
## healthy|overweight     8.6730    0.0620   139.9549
## 
## Residual Deviance: 1062356.53 
## AIC: 1062370.53

table <- coef(summary(ordered_logit1_up))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table

##                            Value   Std. Error    t value p value
## MTRANU               0.185831739 5.830341e-05 3187.32176       0
## MTRANY              -2.821312819 4.303678e-02  -65.55585       0
## WTGAIN               0.030539817 1.866985e-04  163.57826       0
## M_Ht_In              0.089726723 1.009869e-03   88.84984       0
## PWgt_R               0.007690566 6.905526e-05  111.36829       0
## underweight|healthy  7.140412132 6.156013e-02  115.99085       0
## healthy|overweight   8.672994037 6.196991e-02  139.95492       0

Ordered Logit 2

ordered_logit2 <- polr(HEALTHY ~ ., data = data_train %>%
                       dplyr::select(all_of(selected_variables)), Hess=TRUE)
summary(ordered_logit2)

## Call:
## polr(formula = HEALTHY ~ ., data = data_train %>% dplyr::select(all_of(selected_variables)), 
##     Hess = TRUE)
## 
## Coefficients:
##                 Value Std. Error  t value
## BMI          3.216841  0.0176363 182.3987
## CIG_0       -0.027977  0.0013557 -20.6366
## FAGECOMB     0.001082  0.0009769   1.1079
## M_Ht_In      0.156993  0.0019280  81.4268
## PREVIS       0.071178  0.0017688  40.2411
## PRIORDEAD   -0.199081  0.0353148  -5.6373
## PRIORLIVE    0.110843  0.0053851  20.5831
## PRIORTERM   -0.054770  0.0071485  -7.6618
## PWgt_R      -1.833709  0.0313560 -58.4804
## RF_CESARN    0.296939  0.0130434  22.7655
## WTGAIN       0.024450  0.0004338  56.3661
## MTRANU       0.092236  0.3818212   0.2416
## MTRANY      -2.631615  0.0795610 -33.0767
## NO_RISKSYes  0.693896  0.0165055  42.0403
## MRAVE62     -0.551976  0.0329198 -16.7673
## MRAVE63      0.176262  0.0813627   2.1664
## MRAVE64      0.089000  0.0418414   2.1271
## MRAVE65     -0.044632  0.1519001  -0.2938
## MRAVE66     -0.066339  0.0404528  -1.6399
## FRACE62     -0.294842  0.0309789  -9.5175
## FRACE63      0.080730  0.0824449   0.9792
## FRACE64     -0.272273  0.0433080  -6.2869
## FRACE65     -0.362235  0.1438379  -2.5184
## FRACE66     -0.175841  0.0416263  -4.2243
## PRECARE1    -0.514185  0.0670302  -7.6709
## PRECARE2    -0.173062  0.0625298  -2.7677
## PRECARE3    -0.074238  0.0621773  -1.1940
## PRECARE4    -0.042645  0.0640129  -0.6662
## PRECARE5    -0.028957  0.0674073  -0.4296
## PRECARE6     0.139886  0.0717880   1.9486
## PRECARE7     0.219416  0.0744073   2.9488
## PRECARE8     0.514057  0.0788805   6.5169
## PRECARE9     0.894303  0.1038256   8.6135
## PRECARE10    0.895591  0.0051332 174.4701
## PRECARE99   -0.159738  0.1032545  -1.5470
## 
## Intercepts:
##                     Value    Std. Error t value 
## underweight|healthy  10.5653   0.1806    58.4959
## healthy|overweight   15.9294   0.1846    86.2834
## 
## Residual Deviance: 205516.62 
## AIC: 205590.62

table <- coef(summary(ordered_logit2))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table

##                            Value   Std. Error     t value       p value
## BMI                  3.216841284 0.0176363210 182.3986586  0.000000e+00
## CIG_0               -0.027977359 0.0013557173 -20.6365736  1.288816e-94
## FAGECOMB             0.001082346 0.0009769337   1.1079015  2.679044e-01
## M_Ht_In              0.156993428 0.0019280320  81.4267757  0.000000e+00
## PREVIS               0.071178377 0.0017687998  40.2410581  0.000000e+00
## PRIORDEAD           -0.199081439 0.0353148433  -5.6373304  1.727066e-08
## PRIORLIVE            0.110842665 0.0053851309  20.5830956  3.890266e-94
## PRIORTERM           -0.054770157 0.0071485093  -7.6617593  1.834032e-14
## PWgt_R              -1.833709491 0.0313559923 -58.4803528  0.000000e+00
## RF_CESARN            0.296939396 0.0130433698  22.7655430 1.006790e-114
## WTGAIN               0.024450155 0.0004337739  56.3661323  0.000000e+00
## MTRANU               0.092235864 0.3818212414   0.2415682  8.091148e-01
## MTRANY              -2.631615483 0.0795610140 -33.0766961 6.430013e-240
## NO_RISKSYes          0.693895722 0.0165055055  42.0402589  0.000000e+00
## MRAVE62             -0.551976259 0.0329198010 -16.7673025  4.232942e-63
## MRAVE63              0.176261974 0.0813626801   2.1663737  3.028264e-02
## MRAVE64              0.089000113 0.0418414232   2.1270814  3.341332e-02
## MRAVE65             -0.044632288 0.1519000626  -0.2938267  7.688904e-01
## MRAVE66             -0.066339441 0.0404527901  -1.6399225  1.010213e-01
## FRACE62             -0.294842034 0.0309789157  -9.5175066  1.773840e-21
## FRACE63              0.080730122 0.0824448725   0.9792012  3.274806e-01
## FRACE64             -0.272272794 0.0433079883  -6.2868954  3.238774e-10
## FRACE65             -0.362235126 0.1438378519  -2.5183575  1.179036e-02
## FRACE66             -0.175841362 0.0416263007  -4.2242851  2.397005e-05
## PRECARE1            -0.514184960 0.0670302469  -7.6709394  1.707412e-14
## PRECARE2            -0.173062330 0.0625297948  -2.7676779  5.645723e-03
## PRECARE3            -0.074238480 0.0621772765  -1.1939809  2.324854e-01
## PRECARE4            -0.042644883 0.0640128957  -0.6661921  5.052884e-01
## PRECARE5            -0.028957299 0.0674073033  -0.4295870  6.674961e-01
## PRECARE6             0.139885667 0.0717880268   1.9485933  5.134401e-02
## PRECARE7             0.219415785 0.0744073319   2.9488463  3.189626e-03
## PRECARE8             0.514056701 0.0788804994   6.5169047  7.177297e-11
## PRECARE9             0.894302869 0.1038256123   8.6135092  7.085753e-18
## PRECARE10            0.895590681 0.0051332046 174.4700939  0.000000e+00
## PRECARE99           -0.159738091 0.1032544608  -1.5470333  1.218552e-01
## underweight|healthy 10.565262329 0.1806153126  58.4959391  0.000000e+00
## healthy|overweight  15.929353927 0.1846167551  86.2833599  0.000000e+00

Ordered Logit 2 (upsampling)

ordered_logit2_up <- polr(HEALTHY ~ ., data = up_train %>%
                       dplyr::select(all_of(selected_variables)), Hess=TRUE)
summary(ordered_logit2_up)

## Call:
## polr(formula = HEALTHY ~ ., data = up_train %>% dplyr::select(all_of(selected_variables)), 
##     Hess = TRUE)
## 
## Coefficients:
##                 Value Std. Error  t value
## BMI          0.045555  1.241e-03   36.703
## CIG_0       -0.041130  6.996e-04  -58.791
## FAGECOMB     0.001368  4.272e-04    3.201
## M_Ht_In      0.116964  5.555e-04  210.575
## PREVIS       0.088718  7.939e-04  111.744
## PRIORDEAD   -0.341795  1.909e-02  -17.901
## PRIORLIVE    0.135021  2.340e-03   57.711
## PRIORTERM   -0.070324  3.087e-03  -22.778
## PWgt_R       0.001931  2.144e-04    9.004
## RF_CESARN    0.373115  5.730e-03   65.118
## WTGAIN       0.030131  1.928e-04  156.269
## MTRANU       0.059441  5.837e-05 1018.391
## MTRANY      -2.560628  4.408e-02  -58.091
## NO_RISKSYes  0.887450  7.023e-03  126.358
## MRAVE62     -0.723349  1.436e-02  -50.372
## MRAVE63      0.215550  3.666e-02    5.880
## MRAVE64      0.097958  1.857e-02    5.276
## MRAVE65      0.070263  6.667e-02    1.054
## MRAVE66     -0.100390  1.774e-02   -5.658
## FRACE62     -0.380369  1.347e-02  -28.246
## FRACE63      0.070093  3.691e-02    1.899
## FRACE64     -0.420195  1.946e-02  -21.591
## FRACE65     -0.475187  6.305e-02   -7.537
## FRACE66     -0.197115  1.817e-02  -10.848
## PRECARE1    -0.709847  3.007e-02  -23.605
## PRECARE2    -0.253104  2.813e-02   -8.999
## PRECARE3    -0.137133  2.798e-02   -4.901
## PRECARE4    -0.103124  2.876e-02   -3.585
## PRECARE5    -0.128586  3.026e-02   -4.249
## PRECARE6     0.126580  3.223e-02    3.927
## PRECARE7     0.194718  3.366e-02    5.786
## PRECARE8     0.593472  3.584e-02   16.561
## PRECARE9     1.160482  5.020e-02   23.119
## PRECARE10    1.196966  1.443e-03  829.710
## PRECARE99   -0.253025  4.443e-02   -5.695
## 
## Intercepts:
##                     Value     Std. Error t value  
## underweight|healthy   10.5590    0.0030  3539.3998
## healthy|overweight    12.2419    0.0047  2584.0796
## 
## Residual Deviance: 1002139.28 
## AIC: 1002213.28

table <- coef(summary(ordered_logit2_up))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table

##                            Value   Std. Error     t value       p value
## BMI                  0.045554567 0.0012411741   36.702803 6.588587e-295
## CIG_0               -0.041130045 0.0006995941  -58.791299  0.000000e+00
## FAGECOMB             0.001367555 0.0004271872    3.201302  1.368081e-03
## M_Ht_In              0.116964352 0.0005554513  210.575358  0.000000e+00
## PREVIS               0.088718447 0.0007939442  111.743932  0.000000e+00
## PRIORDEAD           -0.341794795 0.0190937953  -17.900831  1.161763e-71
## PRIORLIVE            0.135020820 0.0023396022   57.711015  0.000000e+00
## PRIORTERM           -0.070324389 0.0030874015  -22.777857 7.601962e-115
## PWgt_R               0.001930807 0.0002144394    9.003976  2.176882e-19
## RF_CESARN            0.373114613 0.0057297944   65.118325  0.000000e+00
## WTGAIN               0.030131149 0.0001928159  156.269038  0.000000e+00
## MTRANU               0.059440944 0.0000583675 1018.391142  0.000000e+00
## MTRANY              -2.560628363 0.0440795348  -58.091093  0.000000e+00
## NO_RISKSYes          0.887450302 0.0070233267  126.357542  0.000000e+00
## MRAVE62             -0.723349233 0.0143601412  -50.372014  0.000000e+00
## MRAVE63              0.215549789 0.0366560940    5.880326  4.094588e-09
## MRAVE64              0.097958261 0.0185670679    5.275914  1.320957e-07
## MRAVE65              0.070262654 0.0666736041    1.053830  2.919607e-01
## MRAVE66             -0.100390167 0.0177429751   -5.658023  1.531263e-08
## FRACE62             -0.380369152 0.0134664172  -28.245757 1.604810e-175
## FRACE63              0.070092721 0.0369057151    1.899238  5.753325e-02
## FRACE64             -0.420194610 0.0194616856  -21.590864 2.188819e-103
## FRACE65             -0.475187063 0.0630454993   -7.537208  4.801389e-14
## FRACE66             -0.197114870 0.0181698150  -10.848480  2.027697e-27
## PRECARE1            -0.709847074 0.0300713143  -23.605456 3.387674e-123
## PRECARE2            -0.253103834 0.0281265631   -8.998747  2.283077e-19
## PRECARE3            -0.137133168 0.0279789569   -4.901297  9.520616e-07
## PRECARE4            -0.103123813 0.0287617415   -3.585451  3.364963e-04
## PRECARE5            -0.128585688 0.0302645107   -4.248728  2.149873e-05
## PRECARE6             0.126579990 0.0322299895    3.927398  8.586989e-05
## PRECARE7             0.194717948 0.0336556906    5.785588  7.225926e-09
## PRECARE8             0.593471610 0.0358358855   16.560819  1.337611e-61
## PRECARE9             1.160482439 0.0501951147   23.119430 2.952621e-118
## PRECARE10            1.196966159 0.0014426322  829.709854  0.000000e+00
## PRECARE99           -0.253024919 0.0444299216   -5.694922  1.234286e-08
## underweight|healthy 10.558951565 0.0029832604 3539.399850  0.000000e+00
## healthy|overweight  12.241915421 0.0047374375 2584.079583  0.000000e+00

6.4. Multinomial Logit

Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.

Multinomial Logit 1

mlogit1 <- multinom(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train)

## # weights:  21 (12 variable)
## initial  value 222805.163817 
## iter  10 value 108712.888396
## iter  20 value 105476.802171
## final  value 105476.061731 
## converged

summary(mlogit1)

## Call:
## multinom(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, 
##     data = data_train)
## 
## Coefficients:
##            (Intercept)    MTRANU    MTRANY     WTGAIN    M_Ht_In    PWgt_R
## healthy      -1.920928 0.2567871 -2.820386 0.01717018 0.04212227 0.2395064
## overweight  -18.805410 0.3354321 -2.944548 0.04213521 0.11887564 1.9686275
## 
## Std. Errors:
##            (Intercept)    MTRANU     MTRANY       WTGAIN     M_Ht_In     PWgt_R
## healthy      0.2362039 0.2818162 0.07600255 0.0006593361 0.003328712 0.03944682
## overweight   0.3171847 0.1989627 0.18930194 0.0008087449 0.004394770 0.05128319
## 
## Residual Deviance: 210952.1 
## AIC: 210976.1

Multinomial Logit 1 (upsampling)

mlogit1_up <- multinom(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train)

## # weights:  21 (12 variable)
## initial  value 564013.267042 
## iter  10 value 539125.069731
## iter  20 value 529295.453728
## final  value 529173.568246 
## converged

summary(mlogit1_up)

## Call:
## multinom(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, 
##     data = up_train)
## 
## Coefficients:
##            (Intercept)    MTRANU    MTRANY     WTGAIN    M_Ht_In       PWgt_R
## healthy      -3.442004 0.3816803 -2.801082 0.01541438 0.04614605 0.0005673912
## overweight  -11.105672 0.3026172 -2.827032 0.04057141 0.12814230 0.0099425461
## 
## Std. Errors:
##            (Intercept)       MTRANU     MTRANY       WTGAIN     M_Ht_In
## healthy     0.08100244 7.664477e-05 0.05925149 0.0002531117 0.001335214
## overweight  0.08648503 1.030254e-04 0.05769547 0.0002584301 0.001399310
##                  PWgt_R
## healthy    9.405926e-05
## overweight 9.294986e-05
## 
## Residual Deviance: 1058347 
## AIC: 1058371

Multinomial Logit 2

mlogit2 <- multinom(HEALTHY ~ ., data = data_train %>%
                       dplyr::select(all_of(selected_variables)))

## # weights:  111 (72 variable)
## initial  value 222805.163817 
## iter  10 value 118504.274965
## iter  20 value 114584.343066
## iter  30 value 112255.956964
## iter  40 value 108389.154977
## iter  50 value 104387.335215
## iter  60 value 102384.963160
## iter  70 value 101540.490620
## iter  80 value 101493.544512
## final  value 101492.726191 
## converged

summary(mlogit2)

## Call:
## multinom(formula = HEALTHY ~ ., data = data_train %>% dplyr::select(all_of(selected_variables)))
## 
## Coefficients:
##            (Intercept)        BMI       CIG_0     FAGECOMB    M_Ht_In    PREVIS
## healthy      -3.604689  -2.676985 -0.02133371 -0.005964705 -0.0493575 0.1217507
## overweight  -19.805450 -12.813361 -0.05951769  0.003506744 -0.2884975 0.1494623
##             PRIORDEAD PRIORLIVE   PRIORTERM    PWgt_R RF_CESARN     WTGAIN
## healthy    -0.1703941 0.0575048 -0.07476282  3.248499 0.3931615 0.01589951
## overweight -0.3649704 0.1904451 -0.09196093 15.276666 0.5070517 0.04257545
##               MTRANU    MTRANY NO_RISKSYes    MRAVE62   MRAVE63    MRAVE64
## healthy    0.2573399 -2.587043   0.9411365 -0.4715966 0.4038744 0.02901287
## overweight 0.1712757 -2.682084   1.2044148 -1.0138549 0.4336734 0.13079753
##                 MRAVE65     MRAVE66    FRACE62   FRACE63     FRACE64    FRACE65
## healthy    -0.001800903  0.01114594 -0.2477312 0.2721514 -0.06496821 -0.2476280
## overweight -0.057636288 -0.11075327 -0.5561099 0.2158364 -0.67315808 -0.6970897
##               FRACE66  PRECARE1   PRECARE2   PRECARE3   PRECARE4   PRECARE5
## healthy    -0.1962585 -1.034434 -0.6201651 -0.4564212 -0.3763421 -0.3081936
## overweight -0.3296519 -1.271875 -0.6656170 -0.4685877 -0.4000222 -0.3671841
##               PRECARE6  PRECARE7  PRECARE8 PRECARE9 PRECARE10  PRECARE99
## healthy    -0.04132263 0.1912822 0.6210939 1.670165  4.773688 -0.7209531
## overweight -0.04142750 0.1166186 0.7499877 1.939176  4.761574 -0.5589703
## 
## Std. Errors:
##            (Intercept)        BMI       CIG_0    FAGECOMB     M_Ht_In
## healthy      0.2487441 0.03034113 0.001598626 0.001331603 0.002639070
## overweight   0.3525165 0.03071874 0.003135739 0.001826788 0.003719195
##                 PREVIS  PRIORDEAD   PRIORLIVE   PRIORTERM     PWgt_R  RF_CESARN
## healthy    0.002889024 0.04008598 0.007777708 0.009125524 0.05100012 0.01874693
## overweight 0.003593018 0.07892820 0.009852410 0.012799373 0.05556684 0.02408881
##                  WTGAIN    MTRANU     MTRANY NO_RISKSYes    MRAVE62   MRAVE63
## healthy    0.0006611956 0.2832313 0.08129894  0.02062497 0.04253047 0.1391142
## overweight 0.0008231241 0.1998819 0.19360186  0.02957927 0.06282863 0.1691612
##               MRAVE64   MRAVE65    MRAVE66    FRACE62   FRACE63    FRACE64
## healthy    0.05697526 0.2129529 0.05816113 0.04121151 0.1399654 0.05914417
## overweight 0.08353703 0.2893368 0.07783189 0.05810496 0.1711920 0.08927957
##              FRACE65    FRACE66   PRECARE1  PRECARE2   PRECARE3   PRECARE4
## healthy    0.1969180 0.05628444 0.07709273 0.0713563 0.07073577 0.07336531
## overweight 0.2771667 0.07825624 0.13416572 0.1262939 0.12562441 0.12883031
##              PRECARE5   PRECARE6   PRECARE7  PRECARE8  PRECARE9  PRECARE10
## healthy    0.07763364 0.08597836 0.09130594 0.1044832 0.2148028 0.02053736
## overweight 0.13518457 0.14391749 0.15104719 0.1617907 0.2615615 0.02058515
##            PRECARE99
## healthy    0.1211254
## overweight 0.1888979
## 
## Residual Deviance: 202985.5 
## AIC: 203129.5

Multinomial Logit 2 (upsampling)

mlogit2_up <- multinom(HEALTHY ~ ., data = up_train %>%
                       dplyr::select(all_of(selected_variables)))

## # weights:  111 (72 variable)
## initial  value 564013.267042 
## iter  10 value 539936.167145
## iter  20 value 525064.721705
## iter  30 value 521571.510697
## iter  40 value 518784.467827
## iter  50 value 510164.218567
## iter  60 value 503303.147920
## iter  70 value 498212.042972
## iter  80 value 496300.980538
## final  value 496300.815743 
## converged

summary(mlogit2_up)

## Call:
## multinom(formula = HEALTHY ~ ., data = up_train %>% dplyr::select(all_of(selected_variables)))
## 
## Coefficients:
##            (Intercept)        BMI       CIG_0     FAGECOMB    M_Ht_In
## healthy      -7.909551 0.06435719 -0.02429116 -0.006609981 0.09360385
## overweight  -20.523026 0.14353015 -0.06202987  0.003281499 0.23691071
##                PREVIS  PRIORDEAD  PRIORLIVE   PRIORTERM       PWgt_R RF_CESARN
## healthy    0.09736997 -0.2410562 0.04659381 -0.07721144 -0.008275869 0.3676203
## overweight 0.12209152 -0.4690963 0.17897109 -0.09436269 -0.011044210 0.4825466
##                WTGAIN    MTRANU    MTRANY NO_RISKSYes    MRAVE62   MRAVE63
## healthy    0.01404663 0.2909451 -2.591708   0.9304431 -0.4856432 0.3347675
## overweight 0.04110785 0.1027567 -2.572920   1.1926029 -1.0281846 0.3439250
##               MRAVE64    MRAVE65     MRAVE66    FRACE62   FRACE63     FRACE64
## healthy    0.02962716 0.09816926 -0.04904639 -0.2316751 0.1382986 -0.08297968
## overweight 0.13607314 0.11511796 -0.14898937 -0.5423345 0.1201940 -0.72048798
##               FRACE65    FRACE66   PRECARE1   PRECARE2  PRECARE3   PRECARE4
## healthy    -0.3262072 -0.1599763 -0.7650321 -0.3305209 -0.200307 -0.1373757
## overweight -0.6486167 -0.2805514 -1.0593797 -0.4258586 -0.264184 -0.2256212
##              PRECARE5   PRECARE6  PRECARE7  PRECARE8 PRECARE9 PRECARE10
## healthy    -0.1437398 0.13441013 0.3034858 0.6896253 1.680577  8.039405
## overweight -0.2778856 0.08085994 0.1642038 0.7656501 1.844563  7.902292
##             PRECARE99
## healthy    -0.5635237
## overweight -0.4225226
## 
## Std. Errors:
##            (Intercept)         BMI        CIG_0     FAGECOMB      M_Ht_In
## healthy    0.003045013 0.001609676 0.0007996155 0.0005575522 0.0006079060
## overweight 0.002593407 0.001669361 0.0010814747 0.0006071785 0.0005987777
##                 PREVIS  PRIORDEAD   PRIORLIVE   PRIORTERM       PWgt_R
## healthy    0.001023771 0.02239746 0.003194259 0.004003548 0.0002789411
## overweight 0.001087693 0.02795829 0.003288692 0.004285469 0.0002906121
##              RF_CESARN       WTGAIN       MTRANU     MTRANY NO_RISKSYes
## healthy    0.007560774 0.0002618767 9.147067e-05 0.06015949 0.008922890
## overweight 0.007892150 0.0002752565 1.029778e-04 0.05941728 0.009814584
##               MRAVE62    MRAVE63    MRAVE64    MRAVE65    MRAVE66    FRACE62
## healthy    0.01800707 0.05210396 0.02373511 0.05519714 0.02344048 0.01719111
## overweight 0.02053053 0.05378510 0.02739273 0.05173535 0.02532852 0.01906421
##               FRACE63    FRACE64    FRACE65    FRACE66   PRECARE1   PRECARE2
## healthy    0.05158844 0.02463047 0.05485284 0.02362520 0.02968984 0.02689723
## overweight 0.05368095 0.02928041 0.05466426 0.02575004 0.02583286 0.02186286
##              PRECARE3   PRECARE4   PRECARE5   PRECARE6   PRECARE7   PRECARE8
## healthy    0.02679082 0.02820006 0.03041236 0.03364370 0.03573217 0.03975014
## overweight 0.02183152 0.02402184 0.02755885 0.03161935 0.03486198 0.03901790
##              PRECARE9   PRECARE10  PRECARE99
## healthy    0.06608533 0.001990894 0.05031361
## overweight 0.06280260 0.001990468 0.04775122
## 
## Residual Deviance: 992601.6 
## AIC: 992745.6

6.5. Naive Bayes

Naive Bayes algorithm is based on Bayes’ Theorem and assumes an independence among predictors. More precisely, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Bayes 1

bayes1 <- naive_bayes(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, usekernel = T)
summary(bayes1)

## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In +      PWgt_R, data = data_train, usekernel = T) 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 202806 
## - Features: 4 
## - Conditional distributions: 
##     - Categorical: 1
##     - KDE: 3
## - Prior probabilities: 
##     - underweight: 0.0746
##     - healthy: 0.8438
##     - overweight: 0.0816
## 
## ---------------------------------------------------------------------------------

Naive Bayes 1 (upsampling)

bayes1_up <- naive_bayes(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, usekernel = T)
summary(bayes1_up)

## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In +      PWgt_R, data = up_train, usekernel = T) 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 513387 
## - Features: 4 
## - Conditional distributions: 
##     - Categorical: 1
##     - KDE: 3
## - Prior probabilities: 
##     - underweight: 0.3333
##     - healthy: 0.3333
##     - overweight: 0.3333
## 
## ---------------------------------------------------------------------------------

Naive Bayes 2

bayes2 <- naive_bayes(HEALTHY ~ ., data = data_train, usekernel = T)
summary(bayes2)

## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = HEALTHY ~ ., data = data_train,      usekernel = T) 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 202806 
## - Features: 31 
## - Conditional distributions: 
##     - Bernoulli: 7
##     - Categorical: 13
##     - KDE: 11
## - Prior probabilities: 
##     - underweight: 0.0746
##     - healthy: 0.8438
##     - overweight: 0.0816
## 
## ---------------------------------------------------------------------------------

Naive Bayes 2 (upsampling)

bayes2_up <- naive_bayes(HEALTHY ~ ., data = up_train, usekernel = T)
summary(bayes2_up)

## 
## ================================== Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = HEALTHY ~ ., data = up_train, usekernel = T) 
## - Laplace: 0 
## - Classes: 3 
## - Samples: 513387 
## - Features: 32 
## - Conditional distributions: 
##     - Bernoulli: 7
##     - Categorical: 14
##     - KDE: 11
## - Prior probabilities: 
##     - underweight: 0.3333
##     - healthy: 0.3333
##     - overweight: 0.3333
## 
## ---------------------------------------------------------------------------------

6.6. Regression Tree

tree <- rpart(HEALTHY~ MTRAN + WTGAIN + PWgt_R + PREVIS +
                PRIORDEAD + M_Ht_In + CIG_0 + BMI + FAGECOMB  +
                PRIORLIVE + PRIORTERM + NO_INFEC, data = up_train, method = 'class')
rpart.plot(tree)

From the variables chosen, we can see that the tree model saw PREVIS variable as the bigget factor in categorizing the data. Next was the WTGAIN variable, PWGT_R and M_HT_IN. The decisions upon the the weight variables make sense, but it’s peculiar why the PREVIS variable was chosen as the most important one. A soon as a woman had less than 9 prenatal visits, the model automatically classifies the baby as underweight.

7. Predictions

We created two functions that will allow us to display the accuracy of the predicitions of the model compared with the real data in the training sets.

summary_ordinal <- function(predictions, real_data){
  
  accurate <- c(predictions == real_data)
  accurate <- sum(accurate)
  accurate <- accurate / length(real_data)
  print(paste('Total percentage accuracy --- ', accurate))
  
  ### underweight
  under_indices <- real_data == "underweight"
  under_results <- results[under_indices]
  under_real_data <- real_data[under_indices]
  
  under_accuracy <- sum(c(under_results == under_real_data))
  under_accuracy <- under_accuracy / length(under_real_data)
  print(paste('Total underweight percentage accuracy --- ', under_accuracy))
  
  ### healthy
  healthy_indices <- real_data == "healthy"
  healthy_results <- results[healthy_indices]
  healthy_real_data <- real_data[healthy_indices]
  
  healthy_accuracy <- sum(c(healthy_results == healthy_real_data))
  healthy_accuracy <- healthy_accuracy / length(healthy_real_data)
  print(paste('Total healthy percentage accuracy --- ', healthy_accuracy))
  
  ### overweight
  over_indices <- real_data == "overweight"
  over_results <- results[over_indices]
  over_real_data <- real_data[over_indices]
  
  over_accuracy <- sum(c(over_results == over_real_data))
  over_accuracy <- over_accuracy / length(over_real_data)
  print(paste('Total overweight percentage accuracy --- ', over_accuracy))
  
}

summary_binary <- function(predictions, real_data){
  accurate <- c(predictions == real_data)
  accurate <- sum(accurate)
  accurate <- accurate / length(real_data)
  print(paste('Total percentage accuracy --- ',accurate))
  
  ### only positives
  positive_indices <- real_data == "Yes"
  positive_results <- results[positive_indices]
  positive_real_data <- real_data[positive_indices]
  
  positive_accuracy <- sum(c(positive_results == positive_real_data))
  positive_accuracy <- positive_accuracy / length(positive_real_data)
  print(paste('Total positive percentage accuracy --- ',positive_accuracy))
  
  ### only negatives
  negative_indices <- real_data == "No"
  negative_results <- results[negative_indices]
  negative_real_data <- real_data[negative_indices]
  
  negative_accuracy <- sum(c(negative_results == negative_real_data))
  negative_accuracy <- negative_accuracy / length(negative_real_data)
  print(paste('Total negative percentage accuracy --- ',negative_accuracy))
  
}

7.1. Logit

Logit 1

results <- predict(logit1, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    202   147
##        Yes 13453 73113
##                                           
##                Accuracy : 0.8435          
##                  95% CI : (0.8411, 0.8459)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 0.306           
##                                           
##                   Kappa : 0.0212          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.014793        
##             Specificity : 0.997993        
##          Pos Pred Value : 0.578797        
##          Neg Pred Value : 0.844593        
##              Prevalence : 0.157108        
##          Detection Rate : 0.002324        
##    Detection Prevalence : 0.004015        
##       Balanced Accuracy : 0.506393        
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.843525283322787"
## [1] "Total positive percentage accuracy ---  0.997993447993448"
## [1] "Total negative percentage accuracy ---  0.0147931160746979"

Logit 1 (upsampling)

results <- predict(logit1_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    511  1003
##        Yes 13144 72257
##                                           
##                Accuracy : 0.8372          
##                  95% CI : (0.8348, 0.8397)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0372          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.037422        
##             Specificity : 0.986309        
##          Pos Pred Value : 0.337517        
##          Neg Pred Value : 0.846091        
##              Prevalence : 0.157108        
##          Detection Rate : 0.005879        
##    Detection Prevalence : 0.017419        
##       Balanced Accuracy : 0.511866        
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.837231778174078"
## [1] "Total positive percentage accuracy ---  0.986309036309036"
## [1] "Total negative percentage accuracy ---  0.037422189674112"

Logit 2

results <- predict(logit2, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    225   189
##        Yes 13430 73071
##                                           
##                Accuracy : 0.8433          
##                  95% CI : (0.8409, 0.8457)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 0.3707          
##                                           
##                   Kappa : 0.023           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.016477        
##             Specificity : 0.997420        
##          Pos Pred Value : 0.543478        
##          Neg Pred Value : 0.844742        
##              Prevalence : 0.157108        
##          Detection Rate : 0.002589        
##    Detection Prevalence : 0.004763        
##       Balanced Accuracy : 0.506949        
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.843306678939193"
## [1] "Total positive percentage accuracy ---  0.997420147420147"
## [1] "Total negative percentage accuracy ---  0.0164774807762724"

Logit 2 (upsampling)

results <- predict(logit2_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   2698  8356
##        Yes 10957 64904
##                                          
##                Accuracy : 0.7778         
##                  95% CI : (0.775, 0.7806)
##     No Information Rate : 0.8429         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0905         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.19758        
##             Specificity : 0.88594        
##          Pos Pred Value : 0.24407        
##          Neg Pred Value : 0.85556        
##              Prevalence : 0.15711        
##          Detection Rate : 0.03104        
##    Detection Prevalence : 0.12718        
##       Balanced Accuracy : 0.54176        
##                                          
##        'Positive' Class : No             
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.777794396824484"
## [1] "Total positive percentage accuracy ---  0.885940485940486"
## [1] "Total negative percentage accuracy ---  0.19758330281948"

7.2. Probit

Probit 1

results <- predict(probit1, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    212   162
##        Yes 13443 73098
##                                          
##                Accuracy : 0.8435         
##                  95% CI : (0.841, 0.8459)
##     No Information Rate : 0.8429         
##     P-Value [Acc > NIR] : 0.3226         
##                                          
##                   Kappa : 0.022          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.015525       
##             Specificity : 0.997789       
##          Pos Pred Value : 0.566845       
##          Neg Pred Value : 0.844663       
##              Prevalence : 0.157108       
##          Detection Rate : 0.002439       
##    Detection Prevalence : 0.004303       
##       Balanced Accuracy : 0.506657       
##                                          
##        'Positive' Class : No             
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.84346775585342"
## [1] "Total positive percentage accuracy ---  0.997788697788698"
## [1] "Total negative percentage accuracy ---  0.0155254485536434"

Probit 1 (upsampling)

results <- predict(probit1_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   7174 34323
##        Yes  6481 38937
##                                           
##                Accuracy : 0.5305          
##                  95% CI : (0.5272, 0.5339)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0311          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.52538         
##             Specificity : 0.53149         
##          Pos Pred Value : 0.17288         
##          Neg Pred Value : 0.85730         
##              Prevalence : 0.15711         
##          Detection Rate : 0.08254         
##    Detection Prevalence : 0.47744         
##       Balanced Accuracy : 0.52843         
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.530529827992867"
## [1] "Total positive percentage accuracy ---  0.531490581490581"
## [1] "Total negative percentage accuracy ---  0.52537532039546"

Probit 2

results <- predict(probit2, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    328   427
##        Yes 13327 72833
##                                           
##                Accuracy : 0.8418          
##                  95% CI : (0.8393, 0.8442)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 0.8232          
##                                           
##                   Kappa : 0.0295          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.024021        
##             Specificity : 0.994171        
##          Pos Pred Value : 0.434437        
##          Neg Pred Value : 0.845323        
##              Prevalence : 0.157108        
##          Detection Rate : 0.003774        
##    Detection Prevalence : 0.008687        
##       Balanced Accuracy : 0.509096        
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.841753437266295"
## [1] "Total positive percentage accuracy ---  0.994171444171444"
## [1] "Total negative percentage accuracy ---  0.0240205053094105"

Probit 2 (upsampling)

results <- predict(probit2_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
                     levels = c('Yes', 'No'))


confusionMatrix(data = results, reference = data_test_b$HEALTHY)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No   8111 35166
##        Yes  5544 38094
##                                           
##                Accuracy : 0.5316          
##                  95% CI : (0.5283, 0.5349)
##     No Information Rate : 0.8429          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0605          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.59399         
##             Specificity : 0.51998         
##          Pos Pred Value : 0.18742         
##          Neg Pred Value : 0.87295         
##              Prevalence : 0.15711         
##          Detection Rate : 0.09332         
##    Detection Prevalence : 0.49792         
##       Balanced Accuracy : 0.55699         
##                                           
##        'Positive' Class : No              
##

summary_binary(results, data_test_b$HEALTHY)

## [1] "Total percentage accuracy ---  0.531611344416959"
## [1] "Total positive percentage accuracy ---  0.51998361998362"
## [1] "Total negative percentage accuracy ---  0.593994873672647"

7.3. Ordered Logit

Ordered Logit 1

results <- predict(ordered_logit1, newdata = data_test)

confusionMatrix(data = results, reference = data_test$HEALTHY)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    underweight healthy overweight
##   underweight         128      99          5
##   healthy            6370   73092       7215
##   overweight            2       2          2
## 
## Overall Statistics
##                                         
##                Accuracy : 0.8425        
##                  95% CI : (0.84, 0.8449)
##     No Information Rate : 0.8421        
##     P-Value [Acc > NIR] : 0.3958        
##                                         
##                   Kappa : 0.0152        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: underweight Class: healthy Class: overweight
## Sensitivity                    0.019692       0.998620         2.769e-04
## Specificity                    0.998707       0.009984         9.999e-01
## Pos Pred Value                 0.551724       0.843269         3.333e-01
## Neg Pred Value                 0.926491       0.575630         9.169e-01
## Prevalence                     0.074786       0.842122         8.309e-02
## Detection Rate                 0.001473       0.840960         2.301e-05
## Detection Prevalence           0.002669       0.997262         6.903e-05
## Balanced Accuracy              0.509200       0.504302         5.001e-01

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842455272392567"
## [1] "Total underweight percentage accuracy ---  0.0196923076923077"
## [1] "Total healthy percentage accuracy ---  0.998620086620305"
## [1] "Total overweight percentage accuracy ---  0.00027693159789532"

Ordered Logit 1 (upsampling)

results <- predict(ordered_logit1_up, newdata = data_test)

confusionMatrix(data = results, reference = data_test$HEALTHY)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    underweight healthy overweight
##   underweight        6285   70527       6563
##   healthy             140    1903        456
##   overweight           75     763        203
## 
## Overall Statistics
##                                           
##                Accuracy : 0.0965          
##                  95% CI : (0.0946, 0.0985)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -4e-04          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: underweight Class: healthy Class: overweight
## Sensitivity                     0.96692        0.02600          0.028109
## Specificity                     0.04135        0.95657          0.989485
## Pos Pred Value                  0.07538        0.76150          0.195005
## Neg Pred Value                  0.93927        0.15549          0.918264
## Prevalence                      0.07479        0.84212          0.083093
## Detection Rate                  0.07231        0.02189          0.002336
## Detection Prevalence            0.95927        0.02875          0.011977
## Balanced Accuracy               0.50414        0.49128          0.508797

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.096542599091066"
## [1] "Total underweight percentage accuracy ---  0.966923076923077"
## [1] "Total healthy percentage accuracy ---  0.0259997540748432"
## [1] "Total overweight percentage accuracy ---  0.028108557186375"

Ordered Logit 2

results <- predict(ordered_logit2, newdata = data_test)

confusionMatrix(data = results, reference = data_test$HEALTHY)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    underweight healthy overweight
##   underweight         146      84          3
##   healthy            6348   73082       7207
##   overweight            6      27         12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8427          
##                  95% CI : (0.8402, 0.8451)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : 0.333           
##                                           
##                   Kappa : 0.0187          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: underweight Class: healthy Class: overweight
## Sensitivity                    0.022462        0.99848         0.0016616
## Specificity                    0.998918        0.01217         0.9995859
## Pos Pred Value                 0.626609        0.84354         0.2666667
## Neg Pred Value                 0.926698        0.60072         0.9170024
## Prevalence                     0.074786        0.84212         0.0830927
## Detection Rate                 0.001680        0.84084         0.0001381
## Detection Prevalence           0.002681        0.99680         0.0005177
## Balanced Accuracy              0.510690        0.50533         0.5006238

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842662371282287"
## [1] "Total underweight percentage accuracy ---  0.0224615384615385"
## [1] "Total healthy percentage accuracy ---  0.998483461533207"
## [1] "Total overweight percentage accuracy ---  0.00166158958737192"

Ordered Logit 2 (upsampling)

results <- predict(ordered_logit2_up, newdata = data_test)

confusionMatrix(data = results, reference = data_test$HEALTHY)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    underweight healthy overweight
##   underweight        6144   64913       5284
##   healthy             256    7058       1565
##   overweight          100    1222        373
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1562          
##                  95% CI : (0.1538, 0.1586)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0034          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: underweight Class: healthy Class: overweight
## Sensitivity                     0.94523        0.09643          0.051648
## Specificity                     0.12707        0.86729          0.983411
## Pos Pred Value                  0.08048        0.79491          0.220059
## Neg Pred Value                  0.96633        0.15251          0.919632
## Prevalence                      0.07479        0.84212          0.083093
## Detection Rate                  0.07069        0.08121          0.004292
## Detection Prevalence            0.87834        0.10216          0.019502
## Balanced Accuracy               0.53615        0.48186          0.517530

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.15618707933038"
## [1] "Total underweight percentage accuracy ---  0.945230769230769"
## [1] "Total healthy percentage accuracy ---  0.0964299864741164"
## [1] "Total overweight percentage accuracy ---  0.0516477430074772"

7.4. Multinomial Logit

Multinomial Logit 1

results <- predict(mlogit1, newdata = data_test)

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842627854800667"
## [1] "Total underweight percentage accuracy ---  0.0261538461538462"
## [1] "Total healthy percentage accuracy ---  0.99815556132417"
## [1] "Total overweight percentage accuracy ---  0.00124619219052894"

Multinomial Logit 1 (upsampling)

results <- predict(mlogit1_up, newdata = data_test)

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.426796295230973"
## [1] "Total underweight percentage accuracy ---  0.675692307692308"
## [1] "Total healthy percentage accuracy ---  0.446231196972388"
## [1] "Total overweight percentage accuracy ---  0.00581556355580172"

Multinomial Logit 2

results <- predict(mlogit2, newdata = data_test)

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842869470172007"
## [1] "Total underweight percentage accuracy ---  0.0272307692307692"
## [1] "Total healthy percentage accuracy ---  0.998237536376429"
## [1] "Total overweight percentage accuracy ---  0.00235391858211022"

Multinomial Logit 2 (upsampling)

results <- predict(mlogit2_up, newdata = data_test)

summary_ordinal(results, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.489938445607778"
## [1] "Total underweight percentage accuracy ---  0.749384615384615"
## [1] "Total healthy percentage accuracy ---  0.512111813971281"
## [1] "Total overweight percentage accuracy ---  0.0317086679590141"

7.5. Naive Bayes

Naive Bayes 1

predictions <- predict(bayes1, data_test)
summary_ordinal(predictions, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842478283380314"
## [1] "Total underweight percentage accuracy ---  0.749384615384615"
## [1] "Total healthy percentage accuracy ---  0.512111813971281"
## [1] "Total overweight percentage accuracy ---  0.0317086679590141"

Naive Bayes 1 (upsampling)

predictions <- predict(bayes1_up, data_test)
summary_ordinal(predictions, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.0749697980785825"
## [1] "Total underweight percentage accuracy ---  0.749384615384615"
## [1] "Total healthy percentage accuracy ---  0.512111813971281"
## [1] "Total overweight percentage accuracy ---  0.0317086679590141"

Naive Bayes 2

predictions <- predict(bayes2, data_test)
summary_ordinal(predictions, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.842328711959961"
## [1] "Total underweight percentage accuracy ---  0.749384615384615"
## [1] "Total healthy percentage accuracy ---  0.512111813971281"
## [1] "Total overweight percentage accuracy ---  0.0317086679590141"

Naive Bayes 2 (upsampling)

predictions <- predict(bayes2_up, data_test)
summary_ordinal(predictions, data_test$HEALTHY)

## [1] "Total percentage accuracy ---  0.0996375769429903"
## [1] "Total underweight percentage accuracy ---  0.749384615384615"
## [1] "Total healthy percentage accuracy ---  0.512111813971281"
## [1] "Total overweight percentage accuracy ---  0.0317086679590141"

8. Results

Results of accuracy of predictions from models on binary outcome variable:

MODEL	DATA	TOTAL ACCURACY	HEALTHY	UNHEALTHY
logit 1	unbalanced	0.844	0.997	0.015
logit 1	balanced	0.837	0.986	0.037
logit 2	unbalanced	0.843	0.997	0.016
logit 2	balanced	0.778	0.886	0.198
probit 1	unbalanced	0.843	0.998	0.016
probit 1	balanced	0.531	0.531	0.525
probit 2	unbalanced	0.842	0.994	0.024
probit 2	balanced	0.532	0.520	0.594

Results of accuracy of predictions from models on nominal/ordinal outcome variable:

MODEL	DATA	TOTAL ACCURACY	UNDERWEIGHT	HEALTHY	OVERWEIGHT
ordered logit 1	unbalanced	0.842	0.020	0.999	0.000
ordered logit 1	balanced	0.097	0. 967	0.026	0.028
ordered logit 2	unbalanced	0.843	0.022	0.998	0.002
ordered logit 2	balanced	0.156	0.945	0.096	0.052
mlogit 1	unbalanced	0.843	0.026	0.998	0.001
mlogit 1	balanced	0.427	0.676	0.446	0.006
mlogit 2	unbalanced	0.843	0.027	0.998	0.002
mlogit 2	balanced	0.490	0.749	0.512	0.032
bayes 1	unbalanced	0.842	0.749	0.512	0.032
bayes 1	balanced	0.075	0.749	0.512	0.032
bayes 2	unbalanced	0.842	0.749	0.512	0.032
bayes 2	balanced	0.100	0.749	0.512	0.032

9. Conclusions

None of the models yielded satisfying results. We can see a huge differece in predictions between balanced and unbalanced samples. Even though the scores for total accuracy are quite good, a lot of them over 80%, when we take a closer look at the accuracy for predicting particular classes we see significant differences between them. For some models predicting binary outcome variable the accuracy is around 50% which is equal to the probability of randomly assigning two classes to the observations. In the models predicting nominal variables the accuracy is very uneven between classes, some models achieved less than 10% total accuracy which is also less than the probability of randomly assigning classes. Our final conclusion would be that in available dataset the independent variables do not predict the dependent variable and therefore the model cannot be constructed based on them.

US Births - Classification

Aleksandra Tomczak, Konrad Lewszyk

5/25/2021

1. Data Inspection

1.1. Initial Selection

1.2. Missing Values

1.3. Creating HEALTHY

1.4. Data Converting

2. Splitting the data

3. Up-sampling

4. Data Transformation

4.1. Variables Distribution

Selected for Transformation

BMI

PWgt_R

4.2. Transformation

5. Feature Selection

5.1. Numeric

5.1.1. Colinearity

5.1.2. Correalations

5.2. Categorical

5.3. Final Selection

6. Basic Models

6.1. Logit

Logit 1

Logit 1 (upsampling)

Logit 2

Logit 2 (upsampling)

6.2. Probit

Probit 1

Probit 1 (upsampling)

Probit 2

Probit 2 (upsampling)

6.3. Ordered Logit

Ordered Logit 1

Ordered Logit 1 (upsampling)

Ordered Logit 2

Ordered Logit 2 (upsampling)

6.4. Multinomial Logit

Multinomial Logit 1

Multinomial Logit 1 (upsampling)

Multinomial Logit 2

Multinomial Logit 2 (upsampling)

6.5. Naive Bayes

Bayes 1

Naive Bayes 1 (upsampling)

Naive Bayes 2

Naive Bayes 2 (upsampling)

6.6. Regression Tree

7. Predictions

7.1. Logit

Logit 1

Logit 1 (upsampling)

Logit 2

Logit 2 (upsampling)

7.2. Probit

Probit 1

Probit 1 (upsampling)

Probit 2

Probit 2 (upsampling)

7.3. Ordered Logit

Ordered Logit 1

Ordered Logit 1 (upsampling)

Ordered Logit 2

Ordered Logit 2 (upsampling)

7.4. Multinomial Logit

Multinomial Logit 1

Multinomial Logit 1 (upsampling)

Multinomial Logit 2

Multinomial Logit 2 (upsampling)

7.5. Naive Bayes

Naive Bayes 1

Naive Bayes 1 (upsampling)

Naive Bayes 2

Naive Bayes 2 (upsampling)

8. Results

9. Conclusions

10. Sources