For this project we have chosen a dataset about the child birth in United States in the year of 2018 available at Kaggle obtained from raw data from CDC website.
Dataset contains over 3.8 milion observations of 55 variables of different types. There are several variables describing the birth circumstances, mother’s features, as well as origins and ethnicity of both parents.
This dataset gives us a lot of possibilities in terms of choosing the variable of interest – we can investigate many features and the influences of the different measures. We would like to check which features may affect if the baby is born healthy, so we decided to focus on variable DBWT that indicates the weight of the newborn baby. We will create a new variable - HEALTHY - that will be a categorical variable that classifies if the baby’s weight falls into correct interval.
Detailed dataset and variables description available here.
Dataset is available in CSV file with just one sheet. Variables are of different types, therefore they need to be closely studied before further analysis. In most of the columns missing values are treated as a separate category and have previously assigned value. This may affect the way we qualify some variables as categorical or numeric, e.g. our variable of interest is expressed in grams but missing values are denoted as 9999 which may disturb future modelling.
setwd("C:\\Users\\lewsz\\OneDrive\\Desktop\\UW_materials_semester_2\\machine_learning\\project_large")
data <- read_csv("US_births(2018).csv")
This dataset has nearly 4 million rows, and it proved to be too heavy to handle for our CPUs and GPUs. Therefore, and we hop the logic holds up, we will use data partitioning to create a subset of the data. The goal is to have a subset that will still have a strong representation of our original data.
set.seed(1922)
data_which_rows <- createDataPartition(data$DBWT, p = 0.9, list = FALSE)
data <- data[-data_which_rows,]
After partitioning the dataset has 380 152 observations of 55 variables.
The table below presents all variables with short description and their type. The last column shows how missing values are represented in a given variable.
| VARIABLE | DESCRIPTION | TYPE | STATUS | MISSING VALUE |
|---|---|---|---|---|
| ATTEND | type of attendant at birth | NOMINAL | TO BE DELETED | 9 |
| BFACIL | birth place | NOMINAL | TO BE DELETED | 9 |
| BMI | mother’s BMI before pregnancy | CONTINUOUS | TO BE KEPT | 99.9 |
| CIG_0 | cigarettes before pregnancy | DISCRETE | TO BE KEPT | 99 |
| DBWT | birth weight in grams | DISCRETE | TO BE KEPT | 9999 |
| DLMP_MM | last normal period month | NOMINAL | TO BE DELETED | 99 |
| DLMP_YY | last normal period year | NOMINAL | TO BE DELETED | 9999 |
| DMAR | marital status | NOMINAL | TO BE KEPT | BLANK |
| DOB_MM | birth month | NOMINAL | TO BE KEPT | NONE |
| DOB_TT | birth time | NOMINAL | TO BE DELETED | 9999 |
| DOB_WK | birth weekday | NOMINAL | TO BE DELETED | NONE |
| DOB_YY | birth year (2018 for all) | NOMINAL | TO BE DELETED | NONE |
| DWgt_R | delivery weight recode (weight in pounds) | DISCRETE | TO BE DELETED | 999 |
| FAGECOMB | father’s combined age | DISCRETE | TO BE KEPT | 99 |
| FEDUC | father’s education level | ORDINAL | TO BE KEPT | 9 |
| FHISPX | father’s hispanic origin if present | NOMINAL | TO BE DELETED | 9 |
| FRACE15 | father’s race recode | NOMINAL | TO BE DELETED | 99 |
| FRACE31 | father’s race recode | NOMINAL | TO BE DELETED | 99 |
| FRACE6 | father’s race recode | NOMINAL | TO BE KEPT | 9 |
| ILLB_R | interval since last birth | NOMINAL? | TO BE DELETED | 888/999 |
| ILOP_R | interval since last pregnancy | NOMINAL? | TO BE DELETED | 888/999 |
| ILP_R | interval since last pregnancy | NOMINAL? | TO BE DELETED | 888/999 |
| IMP_SEX | imputed sex | 1/BLANK | TO BE KEPT | BLANK |
| IP_GON | gonorrhea presence | Y/N/U | TO BE KEPT | U |
| LD_INDL | induction of labor | Y/N/U | TO BE KEPT | U |
| MAGER | mother’s age | ORDINAL | TO BE KEPT | NONE |
| MAGE_IMPFLG | mother’s age imputed | 1/BLANK | TO BE KEPT | BLANK |
| MAR_IMP | mother’s imputed marital status | 1/BLANK | TO BE KEPT | BLANK |
| MBSTATE_REC | mother’s nativity (born in/outside of US) | NOMINAL | TO BE KEPT | 3 |
| MEDUC | mother’s edcuation level | ORDINAL | TO BE KEPT | 9 |
| MHISPX | mother’s hispanic origin if present | NOMINAL | TO BE DELETED | 9 |
| MM_AICU | admit to intensive care | Y/N/U | TO BE KEPT | U |
| MRACE15 | mother’s race recode | NOMINAL | TO BE DELETED | NONE |
| MRACE31 | mother’s race recode | NOMINAL | TO BE DELETED | NONE |
| MRACEIMP | mother’s iputed race | 1/2/BLANK | TO BE KEPT | BLANK |
| MRAVE6 | mother’s race recode (MRACE6) | NOMINAL | TO BE KEPT | NONE |
| MTRAN | mother transferred | Y/N/U | TO BE KEPT | U |
| M_Ht_In | mother’s height in inches | CONTINUOUS | TO BE KEPT | 99 |
| NO_INFEC | no infections reported | 1/0/9 | TO BE KEPT | 9 |
| NO_MMORB | no maternal morbidity reported | 1/0/9 | TO BE KEPT | 9 |
| NO_RISKS | no risk factors reported | 1/0/9 | TO BE KEPT | 9 |
| PAY | payment source for delivery | NOMINAL | TO BE DELETED | 9 |
| PAY_REC | payment recode | NOMINAL | TO BE KEPT | 9 |
| PRECARE | month prenatal care began | NOMINAL | TO BE KEPT | 99 |
| PREVIS | number of prenatal visits | DISCRETE | TO BE KEPT | 99 |
| PRIORDEAD | prior births now dead | DISCRETE | TO BE KEPT | 99 |
| PRIORLIVE | prior births now living | DISCRETE | TO BE KEPT | 99 |
| PRIORTERM | prior other terminations | DISCRETE | TO BE KEPT | 99 |
| PWgt_R | pre-pregnancy weight recode | DISCRETE | TO BE KEPT | 999 |
| RDMETH_REC | delivery method recode | NOMINAL | TO BE DELETED | 9 |
| RESTATUS | residence status | NOMINAL | TO BE KEPT | NONE |
| RF_CESAR | previous cesarean | Y/N/U | TO BE KEPT | U |
| RF_CESARN | number of previous cesareans | DISCRETE | TO BE KEPT | 99 |
| SEX | sex of infant | M/F | TO BE KEPT | NONE |
| WTGAIN | weight gain in pounds | DISCRETE | TO BE KEPT | 99 |
After the inspection of every variable, we qualified several of them for instant deletion. Variables ATTEND and BFACIL apply to the attendance and facility during birth which does not influence the weight of the baby. DOB_TT and DOB_WK describe the time and weekday of the birth which also does not affect the dependent variable. The year of birth in variable DOB_YY is the same for all observations therefore will be deleted as well. DWgt_R is just a convertion from weight in grams to weight in pounds. From 4 variables describing fathers origins or race (FHISPX, FRACE15, FRACE31, FRACE6) we decided to keep just one - FRACE6. Variables ILLB_R, and ILOP_R containing the information about the interval since last birth/pregnancy are of a mixed type - they take values 000-003 if the delivery was plural, and if it was not, they imply the months passed since the last birth/pregnancy. This kind of variable is difficult to transform, so we decided to drop all of them. Same as in the father’s case, we decided to keep just one variable describing mother’s race/origin - MRAVE6 - we deleted MRACE15, MRACE31 and MHISPX. Variables PAY and PAY_REC describe the type of payment so there is no need to keep both of them. Finally, RDMETH_REC shows the delivery method, which does not affect the dependent variable.
columns_to_drop <- c("ATTEND", "BFACIL", "DOB_TT", "DOB_WK", "DOB_YY", "DWgt_R", "FHISPX", "FRACE15", "FRACE31", "ILLB_R", "ILOP_R", "ILP_R", "MHISPX", "MRACE15", "MRACE31", "PAY", "RDMETH_REC", "DLMP_MM", "DLMP_YY")
for (name in columns_to_drop){
data[name] <- NULL
}
Even though in most of the columns missing values had an assigned value, we have to check if none of them contain some blank cells.
sapply(data, function(x) sum(is.na(x)))
## BMI CIG_0 DBWT DMAR DOB_MM FAGECOMB
## 0 0 0 45696 0 0
## FEDUC FRACE6 IMP_SEX IP_GON LD_INDL MAGER
## 0 0 380144 0 0 0
## MAGE_IMPFLG MAR_IMP MBSTATE_REC MEDUC MM_AICU MRACEIMP
## 380116 379968 0 0 0 356159
## MRAVE6 MTRAN M_Ht_In NO_INFEC NO_MMORB NO_RISKS
## 0 0 0 0 0 0
## PAY_REC PRECARE PREVIS PRIORDEAD PRIORLIVE PRIORTERM
## 0 0 0 0 0 0
## PWgt_R RESTATUS RF_CESAR RF_CESARN SEX WTGAIN
## 0 0 0 0 0 0
We can see that 5 columns contain some missing values. Let’s take a closer look at them.
empty_columns <- names(which(colSums(is.na(data)) > 0))
empty_columns
## [1] "DMAR" "IMP_SEX" "MAGE_IMPFLG" "MAR_IMP" "MRACEIMP"
for (name in empty_columns){
missing_cells <- sum(is.na(data[name]))
percentage <- round((missing_cells * 100)/(nrow(data)), 3)
print(paste("Missing values in", name, ":", percentage, "%"))
}
## [1] "Missing values in DMAR : 12.02 %"
## [1] "Missing values in IMP_SEX : 99.998 %"
## [1] "Missing values in MAGE_IMPFLG : 99.991 %"
## [1] "Missing values in MAR_IMP : 99.952 %"
## [1] "Missing values in MRACEIMP : 93.689 %"
As we can see 4 of them have more than 90% missing values which qualifies them for the deletion.
columns_to_drop <- c("IMP_SEX", "MAGE_IMPFLG", "MAR_IMP", "MRACEIMP")
for (name in columns_to_drop){
data[name] <- NULL
}
Let’s take a closer look at variable DMAR where the missing values make almost 12% of the total observations. We will keep this variable for now since around 450k observations is a lot to drop. We will treat missing values as a third category to keep those rows in a dataset.
table(data$DMAR, useNA = "ifany")
##
## 1 2 <NA>
## 201172 133284 45696
data$DMAR[is.na(data$DMAR)] <- 3
table(data$DMAR, useNA = "ifany")
##
## 1 2 3
## 201172 133284 45696
After deleting selected variables we are left with 32 features. Now, we will group continuous, discrete and ordinal variables according to the missing value representation. We deleted or transformed columns with blank cells which leaves us with the following groups:
group_9 <- c("FEDUC", "FRACE6", "MEDUC", "NO_INFEC", "NO_MMORB", "NO_RISKS", "PAY_REC")
group_99 <- c("CIG_0", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "RF_CESARN", "WTGAIN")
Variable BMI has missing values represented by 99.9, variable PWgt_R by 999 and variable DBWT by 9999.
data <- filter(data, BMI != 99.9)
data <- filter(data, PWgt_R != 999)
data <- filter(data, DBWT != 9999)
for (name in group_9){
data <- data[data[[name]] != 9, ]
}
for (name in group_99){
data <- data[data[[name]] != 99, ]
}
After cleaning the data we are left with 289 721 observations of 32 variables.
Now is the time to create our binary dependent variable - HEALTHY. The average birth weight for babies is around 7.5 lb (3.5 kg), although between 5.5 lb (2.5 kg) and 8.5 lb (4 kg) is considered normal. Therefore we set the healthy interval between 2500 and 4000 grams. We decided also to create a separate dataset where variable HEALTHY will not be binary but nominal with classes “underweight” for babies under 2500 grams, “healthy” for babies that weigh between 2500 and 4000 grams and “overweight” for those weighing more than 4000 grams.
Creating copy of the data:
data_b <- data
data_b$HEALTHY <- ifelse((data$DBWT >= 2500 & data$DBWT <= 4000), 1, 0) #binary
data$HEALTHY <- ifelse((data$DBWT <= 2500), 0, ifelse((data$DBWT <= 4000), 1, 2)) #nominal
table(data$HEALTHY)
##
## 0 1 2
## 21622 244322 23777
table(data_b$HEALTHY)
##
## 0 1
## 45186 244535
In our dataset we have around 244k healthy babies and around 45k babies that fall out of our healthy interval (underweight and overweight together). This means that our variable is unbalanced and will need some resampling on train data.
Before further inspection of the variables, we need to convert the data to appropriate types. There are 12 numeric variables and 21 categorical variables. Some of the categorical variables have already assigned numbers as levels, however by default R will treat them as quantitative variables, so we have to transform them as factors as well.
data_categorical_vars <- c("DMAR", "DOB_MM", "FRACE6", "MBSTATE_REC", "MRAVE6", "PAY_REC", "PRECARE", "RESTATUS", "IP_GON", "LD_INDL", "MM_AICU", "MTRAN", "NO_INFEC", "NO_MMORB", "NO_RISKS", "RF_CESAR", "SEX", "MEDUC", "FEDUC", "MAGER", "HEALTHY")
data_numeric_vars <- c("BMI", "CIG_0", "DBWT", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "PWgt_R", "RF_CESARN", "WTGAIN")
for (variable in data_categorical_vars) {
data[[variable]] <- as.factor(data[[variable]])
}
for (variable in data_categorical_vars) {
data_b[[variable]] <- as.factor(data_b[[variable]])
}
Among the categorical variables there are 4 that take values 0 or 1, let’s add labels to them in both datasets.
data$NO_INFEC <- factor(data$NO_INFEC,
levels = c(0, 1),
labels = c("No", "Yes"))
data$NO_MMORB <- factor(data$NO_MMORB,
levels = c(0, 1),
labels = c("No", "Yes"))
data$NO_RISKS <- factor(data$NO_RISKS,
levels = c(0, 1),
labels = c("No", "Yes"))
data$HEALTHY <- factor(data$HEALTHY,
levels = c(0, 1, 2),
labels = c("underweight", "healthy", "overweight"))
data_b$NO_INFEC <- factor(data_b$NO_INFEC,
levels = c(0, 1),
labels = c("No", "Yes"))
data_b$NO_MMORB <- factor(data_b$NO_MMORB,
levels = c(0, 1),
labels = c("No", "Yes"))
data_b$NO_RISKS <- factor(data_b$NO_RISKS,
levels = c(0, 1),
labels = c("No", "Yes"))
data_b$HEALTHY <- factor(data_b$HEALTHY,
levels = c(0, 1),
labels = c("No", "Yes"))
Moreover, 3 of those 21 categorical variables are ordinal. The levels are expressed as numbers so by default R should order them in an increasing way but let’s check the levels.
levels(data$MEDUC)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
levels(data$FEDUC)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
levels(data$MAGER)
## [1] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27"
## [16] "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [31] "43" "44" "45" "46" "47" "48" "49" "50"
As we can see all of the levels are ordered properly so we do not need to change that. The next step is to check our numeric variables and transform them if necessary.
for (variable in data_numeric_vars) {
if (is.numeric(data[[variable]]) == FALSE) print(variable)
}
## [1] "RF_CESARN"
The variable RF_CESARN needs to be transformed into numeric in both sets.
data$RF_CESARN <- as.numeric(data$RF_CESARN)
data_b$RF_CESARN <- as.numeric(data_b$RF_CESARN)
Prior to applying transformations to our data, we first must split the data into training and testing set. Transformations on data should be applied to the training set only.
Before we move on, let’s save our data.
save(list = c("data", "data_b"),
file = "data_prepped.RData")
Before applying transformations to chosen variables we divide the observations into train and test sets. Using the createDataPartition method, we receive a vector of indices. With this vector, we can conveniently split the data into training and testing sample.
set.seed(1922)
data_which_train <- createDataPartition(data$DBWT, p = 0.7, list = FALSE)
data_train <- data[data_which_train,]
data_test <- data[-data_which_train,]
data_train_b <- data_b[data_which_train,]
data_test_b <- data_b[-data_which_train,]
We splitted both data sets into training and testing sets.
summary(data_train$DBWT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 227 2990 3325 3291 3654 8165
summary(data_test$DBWT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 227 2990 3324 3292 3654 6603
summary(data_train_b$DBWT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 227 2990 3325 3291 3654 8165
summary(data_test_b$DBWT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 227 2990 3324 3292 3654 6603
Symmetry measures are very similar between train and test sets. Even though the maximum values differ, we will keep this partition. Now that we have our data split, we can begin to perform transformations on the training sample.
Before proceeding we have to delete variable DBWT from all datasets and from the list of numeric variables.
data_train$DBWT <- NULL
data_test$DBWT <- NULL
data_train_b$DBWT <- NULL
data_test_b$DBWT <- NULL
data_numeric_vars <- data_numeric_vars[data_numeric_vars != "DBWT"]
Up-sampling the data means that we create new observations for the minority class in an unbalanced dataset. It is accomplished by making copies of observations and adding them to the dataset, so we can increase the size of the smaller classes.
set.seed(1922)
up_train <- upSample(x = data_train,
y = data_train$HEALTHY)
table(up_train$HEALTHY)
##
## underweight healthy overweight
## 171129 171129 171129
set.seed(1922)
up_train_b <- upSample(x = data_train_b,
y = data_train_b$HEALTHY)
table(up_train_b$HEALTHY)
##
## No Yes
## 171275 171275
For the modeling phase we decided to keep all 4 datasets - unbalanced with binary variable (data_train_b), balanced with binary variable (up_train_b), unbalanced with nominal variable (data_train) and balanced with nominal variable (up_train). We will compare the results between all of them.
To decide how to treat particular variables we generated plots for their distributions and distributions of their transormed forms. The graphs for all of the variables are accessible within the attached R code. Below we presented graphs of the variables chosen to be transformed.
These variables show enough normality to be worth keeping after necessery transformations.
Let’s apply the transformations to some of our variables that we selected earlier.
data_train[c("BMI", "PWgt_R")] <- log(data_train[c("BMI", "PWgt_R")])
data_test[c("BMI", "PWgt_R")] <- log(data_test[c("BMI", "PWgt_R")])
data_train_b[c("BMI", "PWgt_R")] <- log(data_train_b[c("BMI", "PWgt_R")])
data_test_b[c("BMI", "PWgt_R")] <- log(data_test_b[c("BMI", "PWgt_R")])
The next step in development of our predictive model will be reducing the number of input variables by checking which of them influence the dependent variable the most.
First, the correlation of the numeric variables will be explored. Let’s perform a check for multicolinearity - if any of our variables are correlated with each other and due to that can be dropped.
data_linear_combinations <- findLinearCombos(data_train[, data_numeric_vars])
data_linear_combinations
## $linearCombos
## list()
##
## $remove
## NULL
Turns out no sets of variables are correlated significantly enough.
Correlation explains how one or more variables are related to each other. It determines how attributes change in relation with each other which is extremely helpful in further feature selection. We will check if the numeric variables impact our dependent variable. Due to the fact that our variable is categorical, we will use ANOVA to check that. If we use just one continuous predictor in the test we can just “flip” the model around so that outcome variable and predictor variable switch places.
for (variable in data_numeric_vars) {
anova1 <- aov(data_train[[variable]] ~ data_train$HEALTHY)
print(summary(anova1))
}
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 68 33.94 653.9 <2e-16 ***
## Residuals 202803 10525 0.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 5345 2672.3 146.3 <2e-16 ***
## Residuals 202803 3705307 18.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 9303 4651 102.6 <2e-16 ***
## Residuals 202803 9196050 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 21879 10939 1414 <2e-16 ***
## Residuals 202803 1568635 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 34953 17477 1116 <2e-16 ***
## Residuals 202803 3174933 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 1 0.7126 28.71 3.42e-13 ***
## Residuals 202803 5034 0.0248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 592 295.96 188.5 <2e-16 ***
## Residuals 202803 318385 1.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 125 62.63 80.93 <2e-16 ***
## Residuals 202803 156953 0.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 157 78.27 1369 <2e-16 ***
## Residuals 202803 11596 0.06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 25 12.422 37.07 <2e-16 ***
## Residuals 202803 67961 0.335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train$HEALTHY 2 635681 317840 1464 <2e-16 ***
## Residuals 202803 44026038 217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
for (variable in data_numeric_vars) {
anova1 <- aov(data_train_b[[variable]] ~ data_train_b$HEALTHY)
print(summary(anova1))
}
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 40 40.05 769.7 <2e-16 ***
## Residuals 202804 10552 0.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 84 84.11 4.597 0.032 *
## Residuals 202804 3710568 18.30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 7372 7372 162.5 <2e-16 ***
## Residuals 202804 9197981 45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 3587 3587 458.4 <2e-16 ***
## Residuals 202804 1586927 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 3732 3732 236.1 <2e-16 ***
## Residuals 202804 3206154 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 0 0.26886 10.83 0.001 ***
## Residuals 202804 5035 0.02483
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 343 342.5 218 <2e-16 ***
## Residuals 202804 318634 1.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 86 85.93 111 <2e-16 ***
## Residuals 202804 156992 0.77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 67 66.71 1158 <2e-16 ***
## Residuals 202804 11686 0.06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 20 19.572 58.4 2.15e-14 ***
## Residuals 202804 67966 0.335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## data_train_b$HEALTHY 1 34209 34209 155.5 <2e-16 ***
## Residuals 202804 44627510 220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All of the numeric variables returned very low p-value, which means that in all cases we can reject the null hypothesis. However, we cannot 100% trust the p-value, because when it comes to the big datasets even the small fluctuations may have a significant influence on the p-value. We decided to keep all the numeric variables.
The Cramer’s V coefficient is used to test the strength of relationship between two categorical variables. It takes values from 0 to 1 unless both variables have only two levels - then it takes values from -1 to 1. Let’s test how all of our categorical variables are related to HEALTHY.
for (variable in data_categorical_vars) {
a <- CramerV(data_train[["HEALTHY"]], data_train[[variable]])
print(paste(variable, ":", a))
}
## [1] "DMAR : 0.0440713667140329"
## [1] "DOB_MM : 0.00814488682307605"
## [1] "FRACE6 : 0.0681983233639833"
## [1] "MBSTATE_REC : 0.0215136072186378"
## [1] "MRAVE6 : 0.0726345542285085"
## [1] "PAY_REC : 0.0352384050162547"
## [1] "PRECARE : 0.0408195025868523"
## [1] "RESTATUS : 0.0209088651591672"
## [1] "IP_GON : 0.0127828941039331"
## [1] "LD_INDL : 0.0284221751263552"
## [1] "MM_AICU : 0.0325762751352447"
## [1] "MTRAN : 0.0816504524801277"
## [1] "NO_INFEC : 0.0190780271157186"
## [1] "NO_MMORB : 0.0205959168245148"
## [1] "NO_RISKS : 0.0886319642005398"
## [1] "RF_CESAR : 0.0209533042371427"
## [1] "SEX : 0.0788267421505673"
## [1] "MEDUC : 0.0350560668217398"
## [1] "FEDUC : 0.0321884239034381"
## [1] "MAGER : 0.0435614429089168"
## [1] "HEALTHY : 1"
for (variable in data_categorical_vars) {
a <- CramerV(data_train_b[["HEALTHY"]], data_train_b[[variable]])
print(paste(variable, ":", a))
}
## [1] "DMAR : 0.00847955422154093"
## [1] "DOB_MM : 0.0086062712598186"
## [1] "FRACE6 : 0.0310867652029407"
## [1] "MBSTATE_REC : 0.024888068372129"
## [1] "MRAVE6 : 0.0306120791815099"
## [1] "PAY_REC : 0.00455057987907299"
## [1] "PRECARE : 0.0272564788442899"
## [1] "RESTATUS : 0.0275891874986047"
## [1] "IP_GON : 0.000764292889752201"
## [1] "LD_INDL : 0.0137224491034639"
## [1] "MM_AICU : 0.0201904723879968"
## [1] "MTRAN : 0.0767307914499367"
## [1] "NO_INFEC : 0.00359771611011677"
## [1] "NO_MMORB : 0.0200436961122356"
## [1] "NO_RISKS : 0.0620005279572291"
## [1] "RF_CESAR : 0.0156487866424666"
## [1] "SEX : 0.0423738189231064"
## [1] "MEDUC : 0.0104070594139674"
## [1] "FEDUC : 0.00902602838486403"
## [1] "MAGER : 0.039232606347136"
## [1] "HEALTHY : 1"
As we can see the relations are not strong - the highest score is equal to 0.1126 for the variable MTRAN. We decided to keep 5 categorical variables with the highest score and later check if they will be jointly significant for the model.
We will keep all 12 numeric variables and 5 categorical ones.
selected_variables <- c("HEALTHY","BMI", "CIG_0", "FAGECOMB", "M_Ht_In", "PREVIS", "PRIORDEAD", "PRIORLIVE", "PRIORTERM", "PWgt_R", "RF_CESARN", "WTGAIN", "MTRAN", "NO_RISKS", "MRAVE6", "FRACE6", "PRECARE")
The glm() function automatically recodes categorical variables into dummies, assuming by default the first level of the variable as the reference. The next step is to perfom likelihood ratio test in which we check the hypothesis that says that model with selected variables is equally good as a model with just a constant. Low p-value means that our explanatory variables are jointly significant.
For binary dependent variable we tested models:
For nominal dependent variables we tested models:
Logit model is used to model two level outcome variables. In this model the log odds of the outcome are modeled as a linear combination of the predictor variables.
logit1 <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
family = binomial(link = "logit"),
data = data_train_b)
summary(logit1)
##
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "logit"),
## data = data_train_b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2998 0.5017 0.5526 0.6013 1.8775
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.4856473 0.1677609 44.621 < 2e-16 ***
## MTRANU 0.0815166 0.4075253 0.200 0.841
## MTRANY -2.1307100 0.0744504 -28.619 < 2e-16 ***
## WTGAIN -0.0076249 0.0004175 -18.262 < 2e-16 ***
## M_Ht_In -0.0180049 0.0023583 -7.635 2.26e-14 ***
## PWgt_R -0.8688742 0.0273250 -31.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175260 on 202805 degrees of freedom
## Residual deviance: 172887 on 202800 degrees of freedom
## AIC: 172899
##
## Number of Fisher Scoring iterations: 4
lrtest(logit1)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -86443
## 2 1 -87630 -5 2373.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
logit1_up <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
family = binomial(link = "logit"),
data = up_train_b)
summary(logit1_up)
##
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "logit"),
## data = up_train_b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4898 -1.1785 0.3009 1.1474 2.5826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.109e+00 7.960e-02 26.502 <2e-16 ***
## MTRANU 2.387e-01 2.301e-01 1.038 0.299
## MTRANY -2.153e+00 6.080e-02 -35.410 <2e-16 ***
## WTGAIN -7.189e-03 2.344e-04 -30.666 <2e-16 ***
## M_Ht_In -1.656e-02 1.309e-03 -12.654 <2e-16 ***
## PWgt_R -5.033e-03 8.948e-05 -56.254 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 474875 on 342549 degrees of freedom
## Residual deviance: 467932 on 342544 degrees of freedom
## AIC: 467944
##
## Number of Fisher Scoring iterations: 4
lrtest(logit1_up)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -233966
## 2 1 -237438 -5 6942.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
logit2 <- glm(HEALTHY ~ .,
family = binomial(link = "logit"),
data = data_train_b %>%
dplyr::select(all_of(selected_variables)))
summary(logit2)
##
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "logit"),
## data = data_train_b %>% dplyr::select(all_of(selected_variables)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7004 0.4723 0.5393 0.6056 1.9026
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.6509434 0.4285397 20.187 < 2e-16 ***
## BMI -9.0306964 1.7430803 -5.181 2.21e-07 ***
## CIG_0 0.0001320 0.0014384 0.092 0.92687
## FAGECOMB -0.0071469 0.0009724 -7.349 1.99e-13 ***
## M_Ht_In -0.3011317 0.0540820 -5.568 2.58e-08 ***
## PREVIS 0.0384406 0.0019032 20.198 < 2e-16 ***
## PRIORDEAD -0.0481565 0.0358727 -1.342 0.17946
## PRIORLIVE -0.0486929 0.0051926 -9.377 < 2e-16 ***
## PRIORTERM -0.0292671 0.0068325 -4.284 1.84e-05 ***
## PWgt_R 8.2859253 1.7436736 4.752 2.01e-06 ***
## RF_CESARN 0.1439094 0.0127177 11.316 < 2e-16 ***
## WTGAIN -0.0083305 0.0004219 -19.743 < 2e-16 ***
## MTRANU 0.1474479 0.4088778 0.361 0.71839
## MTRANY -2.0056925 0.0754718 -26.575 < 2e-16 ***
## NO_RISKSYes 0.3618065 0.0155439 23.276 < 2e-16 ***
## MRAVE62 -0.0795840 0.0323069 -2.463 0.01376 *
## MRAVE63 0.1380555 0.0858896 1.607 0.10798
## MRAVE64 -0.0596548 0.0434249 -1.374 0.16952
## MRAVE65 -0.0464300 0.1528329 -0.304 0.76128
## MRAVE66 0.0604788 0.0412164 1.467 0.14228
## FRACE62 0.0176081 0.0305223 0.577 0.56401
## FRACE63 0.1416488 0.0869453 1.629 0.10328
## FRACE64 0.2234106 0.0457085 4.888 1.02e-06 ***
## FRACE65 0.1089339 0.1469721 0.741 0.45858
## FRACE66 -0.0191680 0.0415776 -0.461 0.64479
## PRECARE1 -0.1499536 0.0647513 -2.316 0.02057 *
## PRECARE2 -0.0122761 0.0603144 -0.204 0.83872
## PRECARE3 0.0459716 0.0598966 0.768 0.44278
## PRECARE4 0.0814205 0.0617612 1.318 0.18740
## PRECARE5 0.1021823 0.0652784 1.565 0.11751
## PRECARE6 0.2180505 0.0703789 3.098 0.00195 **
## PRECARE7 0.3472252 0.0741964 4.680 2.87e-06 ***
## PRECARE8 0.4529481 0.0799813 5.663 1.49e-08 ***
## PRECARE9 0.7286252 0.1170714 6.224 4.85e-10 ***
## PRECARE10 1.1782890 1.0389572 1.134 0.25675
## PRECARE99 -0.2240515 0.0962316 -2.328 0.01990 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175260 on 202805 degrees of freedom
## Residual deviance: 171428 on 202770 degrees of freedom
## AIC: 171500
##
## Number of Fisher Scoring iterations: 4
lrtest(logit2)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD +
## PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN +
## NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 36 -85714
## 2 1 -87630 -35 3832.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
logit2_up <- glm(HEALTHY ~ .,
family = binomial(link = "logit"),
data = up_train_b %>%
dplyr::select(all_of(selected_variables)))
summary(logit2_up)
##
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "logit"),
## data = up_train_b %>% dplyr::select(all_of(selected_variables)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9829 -1.1658 0.1405 1.1354 2.6187
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1440071 0.3340062 0.431 0.666359
## BMI 0.0309085 0.0059068 5.233 1.67e-07 ***
## CIG_0 -0.0013929 0.0008103 -1.719 0.085627 .
## FAGECOMB -0.0077776 0.0005480 -14.192 < 2e-16 ***
## M_Ht_In 0.0059468 0.0051724 1.150 0.250261
## PREVIS 0.0338343 0.0009738 34.746 < 2e-16 ***
## PRIORDEAD -0.0433466 0.0227645 -1.904 0.056893 .
## PRIORLIVE -0.0508495 0.0029977 -16.963 < 2e-16 ***
## PRIORTERM -0.0299650 0.0039657 -7.556 4.16e-14 ***
## PWgt_R -0.0096073 0.0009969 -9.637 < 2e-16 ***
## RF_CESARN 0.1528317 0.0072067 21.207 < 2e-16 ***
## WTGAIN -0.0083342 0.0002395 -34.804 < 2e-16 ***
## MTRANU 0.3164232 0.2317999 1.365 0.172231
## MTRANY -2.0117197 0.0610869 -32.932 < 2e-16 ***
## NO_RISKSYes 0.3638752 0.0089046 40.864 < 2e-16 ***
## MRAVE62 -0.0873617 0.0183463 -4.762 1.92e-06 ***
## MRAVE63 0.1422239 0.0473295 3.005 0.002656 **
## MRAVE64 -0.0432218 0.0241140 -1.792 0.073070 .
## MRAVE65 0.0604248 0.0879416 0.687 0.492019
## MRAVE66 0.0533033 0.0229954 2.318 0.020449 *
## FRACE62 0.0043319 0.0172620 0.251 0.801852
## FRACE63 0.0774152 0.0474486 1.632 0.102772
## FRACE64 0.2260829 0.0251931 8.974 < 2e-16 ***
## FRACE65 0.0983333 0.0825727 1.191 0.233705
## FRACE66 -0.0069185 0.0234562 -0.295 0.768030
## PRECARE1 -0.0893057 0.0380664 -2.346 0.018974 *
## PRECARE2 0.0611694 0.0356039 1.718 0.085787 .
## PRECARE3 0.1214347 0.0354373 3.427 0.000611 ***
## PRECARE4 0.1412539 0.0364780 3.872 0.000108 ***
## PRECARE5 0.1496845 0.0384280 3.895 9.81e-05 ***
## PRECARE6 0.2619382 0.0410276 6.384 1.72e-10 ***
## PRECARE7 0.3721227 0.0428378 8.687 < 2e-16 ***
## PRECARE8 0.4870379 0.0458022 10.633 < 2e-16 ***
## PRECARE9 0.7860846 0.0639819 12.286 < 2e-16 ***
## PRECARE10 0.8067073 0.4517791 1.786 0.074160 .
## PRECARE99 -0.1731520 0.0572197 -3.026 0.002477 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 474875 on 342549 degrees of freedom
## Residual deviance: 463312 on 342514 degrees of freedom
## AIC: 463384
##
## Number of Fisher Scoring iterations: 4
lrtest(logit2_up)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD +
## PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN +
## NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 36 -231656
## 2 1 -237438 -35 11563 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Probit model is used to model binary outcome variables. In this model, the inverse standard normal distribution of the probability is modeled as a linear combination of the predictors.
probit1 <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
family = binomial(link = "probit"),
data = data_train_b)
summary(probit1)
##
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "probit"),
## data = data_train_b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3155 0.5022 0.5538 0.6019 1.8239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.1491454 0.0924282 44.890 < 2e-16 ***
## MTRANU 0.0462331 0.2194236 0.211 0.833
## MTRANY -1.2795898 0.0459468 -27.849 < 2e-16 ***
## WTGAIN -0.0040800 0.0002331 -17.500 < 2e-16 ***
## M_Ht_In -0.0095616 0.0012996 -7.357 1.88e-13 ***
## PWgt_R -0.4733688 0.0151723 -31.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175260 on 202805 degrees of freedom
## Residual deviance: 172914 on 202800 degrees of freedom
## AIC: 172926
##
## Number of Fisher Scoring iterations: 4
lrtest(probit1)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -86457
## 2 1 -87630 -5 2345.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probit1_up <- glm(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
family = binomial(link = "probit"),
data = up_train_b)
summary(probit1_up)
##
## Call:
## glm(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, family = binomial(link = "probit"),
## data = up_train_b)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4901 -1.1785 0.3211 1.1477 2.7191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.312e+00 4.964e-02 26.438 <2e-16 ***
## MTRANU 1.512e-01 1.432e-01 1.056 0.291
## MTRANY -1.252e+00 3.153e-02 -39.700 <2e-16 ***
## WTGAIN -4.458e-03 1.460e-04 -30.538 <2e-16 ***
## M_Ht_In -1.032e-02 8.166e-04 -12.640 <2e-16 ***
## PWgt_R -3.126e-03 5.547e-05 -56.361 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 474875 on 342549 degrees of freedom
## Residual deviance: 467954 on 342544 degrees of freedom
## AIC: 467966
##
## Number of Fisher Scoring iterations: 4
lrtest(probit1_up)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -233977
## 2 1 -237438 -5 6921.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probit2 <- glm(HEALTHY ~ .,
family = binomial(link = "probit"),
data = data_train_b %>%
dplyr::select(all_of(selected_variables)))
summary(probit2)
##
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "probit"),
## data = data_train_b %>% dplyr::select(all_of(selected_variables)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7627 0.4714 0.5404 0.6071 1.8712
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.9471602 0.2433939 20.326 < 2e-16 ***
## BMI -5.5190123 0.9818220 -5.621 1.90e-08 ***
## CIG_0 -0.0001668 0.0007935 -0.210 0.833525
## FAGECOMB -0.0038645 0.0005372 -7.193 6.33e-13 ***
## M_Ht_In -0.1829121 0.0305029 -5.997 2.02e-09 ***
## PREVIS 0.0198426 0.0010316 19.235 < 2e-16 ***
## PRIORDEAD -0.0272798 0.0204408 -1.335 0.182015
## PRIORLIVE -0.0273289 0.0029126 -9.383 < 2e-16 ***
## PRIORTERM -0.0166323 0.0038361 -4.336 1.45e-05 ***
## PWgt_R 5.1120224 0.9820949 5.205 1.94e-07 ***
## RF_CESARN 0.0811677 0.0071155 11.407 < 2e-16 ***
## WTGAIN -0.0045718 0.0002359 -19.383 < 2e-16 ***
## MTRANU 0.0767861 0.2200644 0.349 0.727145
## MTRANY -1.2101311 0.0463416 -26.113 < 2e-16 ***
## NO_RISKSYes 0.2027026 0.0087195 23.247 < 2e-16 ***
## MRAVE62 -0.0457504 0.0179277 -2.552 0.010713 *
## MRAVE63 0.0728903 0.0466648 1.562 0.118288
## MRAVE64 -0.0330758 0.0234941 -1.408 0.159181
## MRAVE65 -0.0255899 0.0846242 -0.302 0.762352
## MRAVE66 0.0313114 0.0225413 1.389 0.164813
## FRACE62 0.0076409 0.0168967 0.452 0.651114
## FRACE63 0.0735208 0.0471812 1.558 0.119170
## FRACE64 0.1198286 0.0245863 4.874 1.09e-06 ***
## FRACE65 0.0545208 0.0805867 0.677 0.498692
## FRACE66 -0.0103671 0.0228833 -0.453 0.650519
## PRECARE1 -0.0601364 0.0368897 -1.630 0.103066
## PRECARE2 0.0172995 0.0344621 0.502 0.615677
## PRECARE3 0.0477498 0.0342424 1.394 0.163177
## PRECARE4 0.0661888 0.0352354 1.878 0.060316 .
## PRECARE5 0.0748010 0.0371161 2.015 0.043871 *
## PRECARE6 0.1364267 0.0397110 3.435 0.000591 ***
## PRECARE7 0.2052904 0.0415659 4.939 7.86e-07 ***
## PRECARE8 0.2598345 0.0444195 5.850 4.93e-09 ***
## PRECARE9 0.4060971 0.0625101 6.497 8.22e-11 ***
## PRECARE10 0.6295907 0.5058268 1.245 0.213251
## PRECARE99 -0.1085057 0.0549586 -1.974 0.048345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175260 on 202805 degrees of freedom
## Residual deviance: 171450 on 202770 degrees of freedom
## AIC: 171522
##
## Number of Fisher Scoring iterations: 4
lrtest(probit2)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD +
## PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN +
## NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 36 -85725
## 2 1 -87630 -35 3809.8 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probit2_up <- glm(HEALTHY ~ .,
family = binomial(link = "probit"),
data = up_train_b %>%
dplyr::select(all_of(selected_variables)))
summary(probit2_up)
##
## Call:
## glm(formula = HEALTHY ~ ., family = binomial(link = "probit"),
## data = up_train_b %>% dplyr::select(all_of(selected_variables)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0157 -1.1662 0.1561 1.1359 2.7698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1334981 0.2065348 0.646 0.518039
## BMI 0.0183137 0.0036468 5.022 5.12e-07 ***
## CIG_0 -0.0008612 0.0005039 -1.709 0.087454 .
## FAGECOMB -0.0048288 0.0003410 -14.161 < 2e-16 ***
## M_Ht_In 0.0029844 0.0031980 0.933 0.350701
## PREVIS 0.0208686 0.0006024 34.642 < 2e-16 ***
## PRIORDEAD -0.0263235 0.0141122 -1.865 0.062139 .
## PRIORLIVE -0.0314275 0.0018611 -16.887 < 2e-16 ***
## PRIORTERM -0.0186044 0.0024619 -7.557 4.13e-14 ***
## PWgt_R -0.0058005 0.0006151 -9.431 < 2e-16 ***
## RF_CESARN 0.0945636 0.0044804 21.106 < 2e-16 ***
## WTGAIN -0.0051383 0.0001486 -34.585 < 2e-16 ***
## MTRANU 0.1982839 0.1438730 1.378 0.168146
## MTRANY -1.1692965 0.0319060 -36.648 < 2e-16 ***
## NO_RISKSYes 0.2261276 0.0055346 40.857 < 2e-16 ***
## MRAVE62 -0.0542894 0.0114100 -4.758 1.95e-06 ***
## MRAVE63 0.0926131 0.0294341 3.146 0.001653 **
## MRAVE64 -0.0260235 0.0150222 -1.732 0.083214 .
## MRAVE65 0.0371983 0.0546750 0.680 0.496281
## MRAVE66 0.0330876 0.0143040 2.313 0.020713 *
## FRACE62 0.0025009 0.0107386 0.233 0.815848
## FRACE63 0.0508883 0.0295082 1.725 0.084609 .
## FRACE64 0.1401428 0.0156800 8.938 < 2e-16 ***
## FRACE65 0.0619734 0.0513901 1.206 0.227841
## FRACE66 -0.0048083 0.0145973 -0.329 0.741858
## PRECARE1 -0.0562694 0.0235201 -2.392 0.016739 *
## PRECARE2 0.0371940 0.0219783 1.692 0.090587 .
## PRECARE3 0.0748041 0.0218752 3.420 0.000627 ***
## PRECARE4 0.0870006 0.0225319 3.861 0.000113 ***
## PRECARE5 0.0917247 0.0237531 3.862 0.000113 ***
## PRECARE6 0.1622666 0.0253840 6.392 1.63e-10 ***
## PRECARE7 0.2304884 0.0265087 8.695 < 2e-16 ***
## PRECARE8 0.3011189 0.0283411 10.625 < 2e-16 ***
## PRECARE9 0.4855915 0.0394482 12.310 < 2e-16 ***
## PRECARE10 0.5008515 0.2793252 1.793 0.072961 .
## PRECARE99 -0.1070841 0.0353945 -3.025 0.002483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 474875 on 342549 degrees of freedom
## Residual deviance: 463347 on 342514 degrees of freedom
## AIC: 463419
##
## Number of Fisher Scoring iterations: 4
lrtest(probit2_up)
## Likelihood ratio test
##
## Model 1: HEALTHY ~ BMI + CIG_0 + FAGECOMB + M_Ht_In + PREVIS + PRIORDEAD +
## PRIORLIVE + PRIORTERM + PWgt_R + RF_CESARN + WTGAIN + MTRAN +
## NO_RISKS + MRAVE6 + FRACE6 + PRECARE
## Model 2: HEALTHY ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 36 -231673
## 2 1 -237438 -35 11529 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ordered logit model is a regression model for an ordinal outcome variable. The model is based on the cumulative probabilities of the response variable. Our variable is nominal (we can say that being “healthy” is the best but we cannot order if “underweight” or “overweight” is worse) but we decided to test the performance of this model as well and compare with others.
ordered_logit1 <- polr(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, Hess=TRUE)
summary(ordered_logit1)
## Call:
## polr(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## MTRANU 0.15564 0.3829793 0.4064
## MTRANY -2.83394 0.0764749 -37.0572
## WTGAIN 0.02407 0.0004258 56.5279
## M_Ht_In 0.06269 0.0023503 26.6757
## PWgt_R 1.14950 0.0278901 41.2152
##
## Intercepts:
## Value Std. Error t value
## underweight|healthy 7.8862 0.1665 47.3697
## healthy|overweight 13.0458 0.1701 76.6993
##
## Residual Deviance: 212006.11
## AIC: 212020.11
table <- coef(summary(ordered_logit1))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table
## Value Std. Error t value p value
## MTRANU 0.15563960 0.38297935 0.4063916 6.844549e-01
## MTRANY -2.83394258 0.07647486 -37.0571803 1.376082e-300
## WTGAIN 0.02406675 0.00042575 56.5278864 0.000000e+00
## M_Ht_In 0.06269460 0.00235025 26.6757137 9.007545e-157
## PWgt_R 1.14949714 0.02789011 41.2152301 0.000000e+00
## underweight|healthy 7.88623137 0.16648260 47.3697047 0.000000e+00
## healthy|overweight 13.04577311 0.17008993 76.6992681 0.000000e+00
ordered_logit1_up <- polr(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, Hess=TRUE)
summary(ordered_logit1_up)
## Call:
## polr(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## MTRANU 0.185832 5.830e-05 3187.32
## MTRANY -2.821313 4.304e-02 -65.56
## WTGAIN 0.030540 1.867e-04 163.58
## M_Ht_In 0.089727 1.010e-03 88.85
## PWgt_R 0.007691 6.906e-05 111.37
##
## Intercepts:
## Value Std. Error t value
## underweight|healthy 7.1404 0.0616 115.9909
## healthy|overweight 8.6730 0.0620 139.9549
##
## Residual Deviance: 1062356.53
## AIC: 1062370.53
table <- coef(summary(ordered_logit1_up))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table
## Value Std. Error t value p value
## MTRANU 0.185831739 5.830341e-05 3187.32176 0
## MTRANY -2.821312819 4.303678e-02 -65.55585 0
## WTGAIN 0.030539817 1.866985e-04 163.57826 0
## M_Ht_In 0.089726723 1.009869e-03 88.84984 0
## PWgt_R 0.007690566 6.905526e-05 111.36829 0
## underweight|healthy 7.140412132 6.156013e-02 115.99085 0
## healthy|overweight 8.672994037 6.196991e-02 139.95492 0
ordered_logit2 <- polr(HEALTHY ~ ., data = data_train %>%
dplyr::select(all_of(selected_variables)), Hess=TRUE)
summary(ordered_logit2)
## Call:
## polr(formula = HEALTHY ~ ., data = data_train %>% dplyr::select(all_of(selected_variables)),
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## BMI 3.216841 0.0176363 182.3987
## CIG_0 -0.027977 0.0013557 -20.6366
## FAGECOMB 0.001082 0.0009769 1.1079
## M_Ht_In 0.156993 0.0019280 81.4268
## PREVIS 0.071178 0.0017688 40.2411
## PRIORDEAD -0.199081 0.0353148 -5.6373
## PRIORLIVE 0.110843 0.0053851 20.5831
## PRIORTERM -0.054770 0.0071485 -7.6618
## PWgt_R -1.833709 0.0313560 -58.4804
## RF_CESARN 0.296939 0.0130434 22.7655
## WTGAIN 0.024450 0.0004338 56.3661
## MTRANU 0.092236 0.3818212 0.2416
## MTRANY -2.631615 0.0795610 -33.0767
## NO_RISKSYes 0.693896 0.0165055 42.0403
## MRAVE62 -0.551976 0.0329198 -16.7673
## MRAVE63 0.176262 0.0813627 2.1664
## MRAVE64 0.089000 0.0418414 2.1271
## MRAVE65 -0.044632 0.1519001 -0.2938
## MRAVE66 -0.066339 0.0404528 -1.6399
## FRACE62 -0.294842 0.0309789 -9.5175
## FRACE63 0.080730 0.0824449 0.9792
## FRACE64 -0.272273 0.0433080 -6.2869
## FRACE65 -0.362235 0.1438379 -2.5184
## FRACE66 -0.175841 0.0416263 -4.2243
## PRECARE1 -0.514185 0.0670302 -7.6709
## PRECARE2 -0.173062 0.0625298 -2.7677
## PRECARE3 -0.074238 0.0621773 -1.1940
## PRECARE4 -0.042645 0.0640129 -0.6662
## PRECARE5 -0.028957 0.0674073 -0.4296
## PRECARE6 0.139886 0.0717880 1.9486
## PRECARE7 0.219416 0.0744073 2.9488
## PRECARE8 0.514057 0.0788805 6.5169
## PRECARE9 0.894303 0.1038256 8.6135
## PRECARE10 0.895591 0.0051332 174.4701
## PRECARE99 -0.159738 0.1032545 -1.5470
##
## Intercepts:
## Value Std. Error t value
## underweight|healthy 10.5653 0.1806 58.4959
## healthy|overweight 15.9294 0.1846 86.2834
##
## Residual Deviance: 205516.62
## AIC: 205590.62
table <- coef(summary(ordered_logit2))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table
## Value Std. Error t value p value
## BMI 3.216841284 0.0176363210 182.3986586 0.000000e+00
## CIG_0 -0.027977359 0.0013557173 -20.6365736 1.288816e-94
## FAGECOMB 0.001082346 0.0009769337 1.1079015 2.679044e-01
## M_Ht_In 0.156993428 0.0019280320 81.4267757 0.000000e+00
## PREVIS 0.071178377 0.0017687998 40.2410581 0.000000e+00
## PRIORDEAD -0.199081439 0.0353148433 -5.6373304 1.727066e-08
## PRIORLIVE 0.110842665 0.0053851309 20.5830956 3.890266e-94
## PRIORTERM -0.054770157 0.0071485093 -7.6617593 1.834032e-14
## PWgt_R -1.833709491 0.0313559923 -58.4803528 0.000000e+00
## RF_CESARN 0.296939396 0.0130433698 22.7655430 1.006790e-114
## WTGAIN 0.024450155 0.0004337739 56.3661323 0.000000e+00
## MTRANU 0.092235864 0.3818212414 0.2415682 8.091148e-01
## MTRANY -2.631615483 0.0795610140 -33.0766961 6.430013e-240
## NO_RISKSYes 0.693895722 0.0165055055 42.0402589 0.000000e+00
## MRAVE62 -0.551976259 0.0329198010 -16.7673025 4.232942e-63
## MRAVE63 0.176261974 0.0813626801 2.1663737 3.028264e-02
## MRAVE64 0.089000113 0.0418414232 2.1270814 3.341332e-02
## MRAVE65 -0.044632288 0.1519000626 -0.2938267 7.688904e-01
## MRAVE66 -0.066339441 0.0404527901 -1.6399225 1.010213e-01
## FRACE62 -0.294842034 0.0309789157 -9.5175066 1.773840e-21
## FRACE63 0.080730122 0.0824448725 0.9792012 3.274806e-01
## FRACE64 -0.272272794 0.0433079883 -6.2868954 3.238774e-10
## FRACE65 -0.362235126 0.1438378519 -2.5183575 1.179036e-02
## FRACE66 -0.175841362 0.0416263007 -4.2242851 2.397005e-05
## PRECARE1 -0.514184960 0.0670302469 -7.6709394 1.707412e-14
## PRECARE2 -0.173062330 0.0625297948 -2.7676779 5.645723e-03
## PRECARE3 -0.074238480 0.0621772765 -1.1939809 2.324854e-01
## PRECARE4 -0.042644883 0.0640128957 -0.6661921 5.052884e-01
## PRECARE5 -0.028957299 0.0674073033 -0.4295870 6.674961e-01
## PRECARE6 0.139885667 0.0717880268 1.9485933 5.134401e-02
## PRECARE7 0.219415785 0.0744073319 2.9488463 3.189626e-03
## PRECARE8 0.514056701 0.0788804994 6.5169047 7.177297e-11
## PRECARE9 0.894302869 0.1038256123 8.6135092 7.085753e-18
## PRECARE10 0.895590681 0.0051332046 174.4700939 0.000000e+00
## PRECARE99 -0.159738091 0.1032544608 -1.5470333 1.218552e-01
## underweight|healthy 10.565262329 0.1806153126 58.4959391 0.000000e+00
## healthy|overweight 15.929353927 0.1846167551 86.2833599 0.000000e+00
ordered_logit2_up <- polr(HEALTHY ~ ., data = up_train %>%
dplyr::select(all_of(selected_variables)), Hess=TRUE)
summary(ordered_logit2_up)
## Call:
## polr(formula = HEALTHY ~ ., data = up_train %>% dplyr::select(all_of(selected_variables)),
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## BMI 0.045555 1.241e-03 36.703
## CIG_0 -0.041130 6.996e-04 -58.791
## FAGECOMB 0.001368 4.272e-04 3.201
## M_Ht_In 0.116964 5.555e-04 210.575
## PREVIS 0.088718 7.939e-04 111.744
## PRIORDEAD -0.341795 1.909e-02 -17.901
## PRIORLIVE 0.135021 2.340e-03 57.711
## PRIORTERM -0.070324 3.087e-03 -22.778
## PWgt_R 0.001931 2.144e-04 9.004
## RF_CESARN 0.373115 5.730e-03 65.118
## WTGAIN 0.030131 1.928e-04 156.269
## MTRANU 0.059441 5.837e-05 1018.391
## MTRANY -2.560628 4.408e-02 -58.091
## NO_RISKSYes 0.887450 7.023e-03 126.358
## MRAVE62 -0.723349 1.436e-02 -50.372
## MRAVE63 0.215550 3.666e-02 5.880
## MRAVE64 0.097958 1.857e-02 5.276
## MRAVE65 0.070263 6.667e-02 1.054
## MRAVE66 -0.100390 1.774e-02 -5.658
## FRACE62 -0.380369 1.347e-02 -28.246
## FRACE63 0.070093 3.691e-02 1.899
## FRACE64 -0.420195 1.946e-02 -21.591
## FRACE65 -0.475187 6.305e-02 -7.537
## FRACE66 -0.197115 1.817e-02 -10.848
## PRECARE1 -0.709847 3.007e-02 -23.605
## PRECARE2 -0.253104 2.813e-02 -8.999
## PRECARE3 -0.137133 2.798e-02 -4.901
## PRECARE4 -0.103124 2.876e-02 -3.585
## PRECARE5 -0.128586 3.026e-02 -4.249
## PRECARE6 0.126580 3.223e-02 3.927
## PRECARE7 0.194718 3.366e-02 5.786
## PRECARE8 0.593472 3.584e-02 16.561
## PRECARE9 1.160482 5.020e-02 23.119
## PRECARE10 1.196966 1.443e-03 829.710
## PRECARE99 -0.253025 4.443e-02 -5.695
##
## Intercepts:
## Value Std. Error t value
## underweight|healthy 10.5590 0.0030 3539.3998
## healthy|overweight 12.2419 0.0047 2584.0796
##
## Residual Deviance: 1002139.28
## AIC: 1002213.28
table <- coef(summary(ordered_logit2_up))
pvalue <- pnorm(abs(table[, "t value"]), lower.tail = FALSE) * 2
table <- cbind(table, "p value" = pvalue)
table
## Value Std. Error t value p value
## BMI 0.045554567 0.0012411741 36.702803 6.588587e-295
## CIG_0 -0.041130045 0.0006995941 -58.791299 0.000000e+00
## FAGECOMB 0.001367555 0.0004271872 3.201302 1.368081e-03
## M_Ht_In 0.116964352 0.0005554513 210.575358 0.000000e+00
## PREVIS 0.088718447 0.0007939442 111.743932 0.000000e+00
## PRIORDEAD -0.341794795 0.0190937953 -17.900831 1.161763e-71
## PRIORLIVE 0.135020820 0.0023396022 57.711015 0.000000e+00
## PRIORTERM -0.070324389 0.0030874015 -22.777857 7.601962e-115
## PWgt_R 0.001930807 0.0002144394 9.003976 2.176882e-19
## RF_CESARN 0.373114613 0.0057297944 65.118325 0.000000e+00
## WTGAIN 0.030131149 0.0001928159 156.269038 0.000000e+00
## MTRANU 0.059440944 0.0000583675 1018.391142 0.000000e+00
## MTRANY -2.560628363 0.0440795348 -58.091093 0.000000e+00
## NO_RISKSYes 0.887450302 0.0070233267 126.357542 0.000000e+00
## MRAVE62 -0.723349233 0.0143601412 -50.372014 0.000000e+00
## MRAVE63 0.215549789 0.0366560940 5.880326 4.094588e-09
## MRAVE64 0.097958261 0.0185670679 5.275914 1.320957e-07
## MRAVE65 0.070262654 0.0666736041 1.053830 2.919607e-01
## MRAVE66 -0.100390167 0.0177429751 -5.658023 1.531263e-08
## FRACE62 -0.380369152 0.0134664172 -28.245757 1.604810e-175
## FRACE63 0.070092721 0.0369057151 1.899238 5.753325e-02
## FRACE64 -0.420194610 0.0194616856 -21.590864 2.188819e-103
## FRACE65 -0.475187063 0.0630454993 -7.537208 4.801389e-14
## FRACE66 -0.197114870 0.0181698150 -10.848480 2.027697e-27
## PRECARE1 -0.709847074 0.0300713143 -23.605456 3.387674e-123
## PRECARE2 -0.253103834 0.0281265631 -8.998747 2.283077e-19
## PRECARE3 -0.137133168 0.0279789569 -4.901297 9.520616e-07
## PRECARE4 -0.103123813 0.0287617415 -3.585451 3.364963e-04
## PRECARE5 -0.128585688 0.0302645107 -4.248728 2.149873e-05
## PRECARE6 0.126579990 0.0322299895 3.927398 8.586989e-05
## PRECARE7 0.194717948 0.0336556906 5.785588 7.225926e-09
## PRECARE8 0.593471610 0.0358358855 16.560819 1.337611e-61
## PRECARE9 1.160482439 0.0501951147 23.119430 2.952621e-118
## PRECARE10 1.196966159 0.0014426322 829.709854 0.000000e+00
## PRECARE99 -0.253024919 0.0444299216 -5.694922 1.234286e-08
## underweight|healthy 10.558951565 0.0029832604 3539.399850 0.000000e+00
## healthy|overweight 12.241915421 0.0047374375 2584.079583 0.000000e+00
Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.
mlogit1 <- multinom(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train)
## # weights: 21 (12 variable)
## initial value 222805.163817
## iter 10 value 108712.888396
## iter 20 value 105476.802171
## final value 105476.061731
## converged
summary(mlogit1)
## Call:
## multinom(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
## data = data_train)
##
## Coefficients:
## (Intercept) MTRANU MTRANY WTGAIN M_Ht_In PWgt_R
## healthy -1.920928 0.2567871 -2.820386 0.01717018 0.04212227 0.2395064
## overweight -18.805410 0.3354321 -2.944548 0.04213521 0.11887564 1.9686275
##
## Std. Errors:
## (Intercept) MTRANU MTRANY WTGAIN M_Ht_In PWgt_R
## healthy 0.2362039 0.2818162 0.07600255 0.0006593361 0.003328712 0.03944682
## overweight 0.3171847 0.1989627 0.18930194 0.0008087449 0.004394770 0.05128319
##
## Residual Deviance: 210952.1
## AIC: 210976.1
mlogit1_up <- multinom(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train)
## # weights: 21 (12 variable)
## initial value 564013.267042
## iter 10 value 539125.069731
## iter 20 value 529295.453728
## final value 529173.568246
## converged
summary(mlogit1_up)
## Call:
## multinom(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R,
## data = up_train)
##
## Coefficients:
## (Intercept) MTRANU MTRANY WTGAIN M_Ht_In PWgt_R
## healthy -3.442004 0.3816803 -2.801082 0.01541438 0.04614605 0.0005673912
## overweight -11.105672 0.3026172 -2.827032 0.04057141 0.12814230 0.0099425461
##
## Std. Errors:
## (Intercept) MTRANU MTRANY WTGAIN M_Ht_In
## healthy 0.08100244 7.664477e-05 0.05925149 0.0002531117 0.001335214
## overweight 0.08648503 1.030254e-04 0.05769547 0.0002584301 0.001399310
## PWgt_R
## healthy 9.405926e-05
## overweight 9.294986e-05
##
## Residual Deviance: 1058347
## AIC: 1058371
mlogit2 <- multinom(HEALTHY ~ ., data = data_train %>%
dplyr::select(all_of(selected_variables)))
## # weights: 111 (72 variable)
## initial value 222805.163817
## iter 10 value 118504.274965
## iter 20 value 114584.343066
## iter 30 value 112255.956964
## iter 40 value 108389.154977
## iter 50 value 104387.335215
## iter 60 value 102384.963160
## iter 70 value 101540.490620
## iter 80 value 101493.544512
## final value 101492.726191
## converged
summary(mlogit2)
## Call:
## multinom(formula = HEALTHY ~ ., data = data_train %>% dplyr::select(all_of(selected_variables)))
##
## Coefficients:
## (Intercept) BMI CIG_0 FAGECOMB M_Ht_In PREVIS
## healthy -3.604689 -2.676985 -0.02133371 -0.005964705 -0.0493575 0.1217507
## overweight -19.805450 -12.813361 -0.05951769 0.003506744 -0.2884975 0.1494623
## PRIORDEAD PRIORLIVE PRIORTERM PWgt_R RF_CESARN WTGAIN
## healthy -0.1703941 0.0575048 -0.07476282 3.248499 0.3931615 0.01589951
## overweight -0.3649704 0.1904451 -0.09196093 15.276666 0.5070517 0.04257545
## MTRANU MTRANY NO_RISKSYes MRAVE62 MRAVE63 MRAVE64
## healthy 0.2573399 -2.587043 0.9411365 -0.4715966 0.4038744 0.02901287
## overweight 0.1712757 -2.682084 1.2044148 -1.0138549 0.4336734 0.13079753
## MRAVE65 MRAVE66 FRACE62 FRACE63 FRACE64 FRACE65
## healthy -0.001800903 0.01114594 -0.2477312 0.2721514 -0.06496821 -0.2476280
## overweight -0.057636288 -0.11075327 -0.5561099 0.2158364 -0.67315808 -0.6970897
## FRACE66 PRECARE1 PRECARE2 PRECARE3 PRECARE4 PRECARE5
## healthy -0.1962585 -1.034434 -0.6201651 -0.4564212 -0.3763421 -0.3081936
## overweight -0.3296519 -1.271875 -0.6656170 -0.4685877 -0.4000222 -0.3671841
## PRECARE6 PRECARE7 PRECARE8 PRECARE9 PRECARE10 PRECARE99
## healthy -0.04132263 0.1912822 0.6210939 1.670165 4.773688 -0.7209531
## overweight -0.04142750 0.1166186 0.7499877 1.939176 4.761574 -0.5589703
##
## Std. Errors:
## (Intercept) BMI CIG_0 FAGECOMB M_Ht_In
## healthy 0.2487441 0.03034113 0.001598626 0.001331603 0.002639070
## overweight 0.3525165 0.03071874 0.003135739 0.001826788 0.003719195
## PREVIS PRIORDEAD PRIORLIVE PRIORTERM PWgt_R RF_CESARN
## healthy 0.002889024 0.04008598 0.007777708 0.009125524 0.05100012 0.01874693
## overweight 0.003593018 0.07892820 0.009852410 0.012799373 0.05556684 0.02408881
## WTGAIN MTRANU MTRANY NO_RISKSYes MRAVE62 MRAVE63
## healthy 0.0006611956 0.2832313 0.08129894 0.02062497 0.04253047 0.1391142
## overweight 0.0008231241 0.1998819 0.19360186 0.02957927 0.06282863 0.1691612
## MRAVE64 MRAVE65 MRAVE66 FRACE62 FRACE63 FRACE64
## healthy 0.05697526 0.2129529 0.05816113 0.04121151 0.1399654 0.05914417
## overweight 0.08353703 0.2893368 0.07783189 0.05810496 0.1711920 0.08927957
## FRACE65 FRACE66 PRECARE1 PRECARE2 PRECARE3 PRECARE4
## healthy 0.1969180 0.05628444 0.07709273 0.0713563 0.07073577 0.07336531
## overweight 0.2771667 0.07825624 0.13416572 0.1262939 0.12562441 0.12883031
## PRECARE5 PRECARE6 PRECARE7 PRECARE8 PRECARE9 PRECARE10
## healthy 0.07763364 0.08597836 0.09130594 0.1044832 0.2148028 0.02053736
## overweight 0.13518457 0.14391749 0.15104719 0.1617907 0.2615615 0.02058515
## PRECARE99
## healthy 0.1211254
## overweight 0.1888979
##
## Residual Deviance: 202985.5
## AIC: 203129.5
mlogit2_up <- multinom(HEALTHY ~ ., data = up_train %>%
dplyr::select(all_of(selected_variables)))
## # weights: 111 (72 variable)
## initial value 564013.267042
## iter 10 value 539936.167145
## iter 20 value 525064.721705
## iter 30 value 521571.510697
## iter 40 value 518784.467827
## iter 50 value 510164.218567
## iter 60 value 503303.147920
## iter 70 value 498212.042972
## iter 80 value 496300.980538
## final value 496300.815743
## converged
summary(mlogit2_up)
## Call:
## multinom(formula = HEALTHY ~ ., data = up_train %>% dplyr::select(all_of(selected_variables)))
##
## Coefficients:
## (Intercept) BMI CIG_0 FAGECOMB M_Ht_In
## healthy -7.909551 0.06435719 -0.02429116 -0.006609981 0.09360385
## overweight -20.523026 0.14353015 -0.06202987 0.003281499 0.23691071
## PREVIS PRIORDEAD PRIORLIVE PRIORTERM PWgt_R RF_CESARN
## healthy 0.09736997 -0.2410562 0.04659381 -0.07721144 -0.008275869 0.3676203
## overweight 0.12209152 -0.4690963 0.17897109 -0.09436269 -0.011044210 0.4825466
## WTGAIN MTRANU MTRANY NO_RISKSYes MRAVE62 MRAVE63
## healthy 0.01404663 0.2909451 -2.591708 0.9304431 -0.4856432 0.3347675
## overweight 0.04110785 0.1027567 -2.572920 1.1926029 -1.0281846 0.3439250
## MRAVE64 MRAVE65 MRAVE66 FRACE62 FRACE63 FRACE64
## healthy 0.02962716 0.09816926 -0.04904639 -0.2316751 0.1382986 -0.08297968
## overweight 0.13607314 0.11511796 -0.14898937 -0.5423345 0.1201940 -0.72048798
## FRACE65 FRACE66 PRECARE1 PRECARE2 PRECARE3 PRECARE4
## healthy -0.3262072 -0.1599763 -0.7650321 -0.3305209 -0.200307 -0.1373757
## overweight -0.6486167 -0.2805514 -1.0593797 -0.4258586 -0.264184 -0.2256212
## PRECARE5 PRECARE6 PRECARE7 PRECARE8 PRECARE9 PRECARE10
## healthy -0.1437398 0.13441013 0.3034858 0.6896253 1.680577 8.039405
## overweight -0.2778856 0.08085994 0.1642038 0.7656501 1.844563 7.902292
## PRECARE99
## healthy -0.5635237
## overweight -0.4225226
##
## Std. Errors:
## (Intercept) BMI CIG_0 FAGECOMB M_Ht_In
## healthy 0.003045013 0.001609676 0.0007996155 0.0005575522 0.0006079060
## overweight 0.002593407 0.001669361 0.0010814747 0.0006071785 0.0005987777
## PREVIS PRIORDEAD PRIORLIVE PRIORTERM PWgt_R
## healthy 0.001023771 0.02239746 0.003194259 0.004003548 0.0002789411
## overweight 0.001087693 0.02795829 0.003288692 0.004285469 0.0002906121
## RF_CESARN WTGAIN MTRANU MTRANY NO_RISKSYes
## healthy 0.007560774 0.0002618767 9.147067e-05 0.06015949 0.008922890
## overweight 0.007892150 0.0002752565 1.029778e-04 0.05941728 0.009814584
## MRAVE62 MRAVE63 MRAVE64 MRAVE65 MRAVE66 FRACE62
## healthy 0.01800707 0.05210396 0.02373511 0.05519714 0.02344048 0.01719111
## overweight 0.02053053 0.05378510 0.02739273 0.05173535 0.02532852 0.01906421
## FRACE63 FRACE64 FRACE65 FRACE66 PRECARE1 PRECARE2
## healthy 0.05158844 0.02463047 0.05485284 0.02362520 0.02968984 0.02689723
## overweight 0.05368095 0.02928041 0.05466426 0.02575004 0.02583286 0.02186286
## PRECARE3 PRECARE4 PRECARE5 PRECARE6 PRECARE7 PRECARE8
## healthy 0.02679082 0.02820006 0.03041236 0.03364370 0.03573217 0.03975014
## overweight 0.02183152 0.02402184 0.02755885 0.03161935 0.03486198 0.03901790
## PRECARE9 PRECARE10 PRECARE99
## healthy 0.06608533 0.001990894 0.05031361
## overweight 0.06280260 0.001990468 0.04775122
##
## Residual Deviance: 992601.6
## AIC: 992745.6
Naive Bayes algorithm is based on Bayes’ Theorem and assumes an independence among predictors. More precisely, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
bayes1 <- naive_bayes(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, usekernel = T)
summary(bayes1)
##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = data_train, usekernel = T)
## - Laplace: 0
## - Classes: 3
## - Samples: 202806
## - Features: 4
## - Conditional distributions:
## - Categorical: 1
## - KDE: 3
## - Prior probabilities:
## - underweight: 0.0746
## - healthy: 0.8438
## - overweight: 0.0816
##
## ---------------------------------------------------------------------------------
bayes1_up <- naive_bayes(HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, usekernel = T)
summary(bayes1_up)
##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = HEALTHY ~ MTRAN + WTGAIN + M_Ht_In + PWgt_R, data = up_train, usekernel = T)
## - Laplace: 0
## - Classes: 3
## - Samples: 513387
## - Features: 4
## - Conditional distributions:
## - Categorical: 1
## - KDE: 3
## - Prior probabilities:
## - underweight: 0.3333
## - healthy: 0.3333
## - overweight: 0.3333
##
## ---------------------------------------------------------------------------------
bayes2 <- naive_bayes(HEALTHY ~ ., data = data_train, usekernel = T)
summary(bayes2)
##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = HEALTHY ~ ., data = data_train, usekernel = T)
## - Laplace: 0
## - Classes: 3
## - Samples: 202806
## - Features: 31
## - Conditional distributions:
## - Bernoulli: 7
## - Categorical: 13
## - KDE: 11
## - Prior probabilities:
## - underweight: 0.0746
## - healthy: 0.8438
## - overweight: 0.0816
##
## ---------------------------------------------------------------------------------
bayes2_up <- naive_bayes(HEALTHY ~ ., data = up_train, usekernel = T)
summary(bayes2_up)
##
## ================================== Naive Bayes ==================================
##
## - Call: naive_bayes.formula(formula = HEALTHY ~ ., data = up_train, usekernel = T)
## - Laplace: 0
## - Classes: 3
## - Samples: 513387
## - Features: 32
## - Conditional distributions:
## - Bernoulli: 7
## - Categorical: 14
## - KDE: 11
## - Prior probabilities:
## - underweight: 0.3333
## - healthy: 0.3333
## - overweight: 0.3333
##
## ---------------------------------------------------------------------------------
tree <- rpart(HEALTHY~ MTRAN + WTGAIN + PWgt_R + PREVIS +
PRIORDEAD + M_Ht_In + CIG_0 + BMI + FAGECOMB +
PRIORLIVE + PRIORTERM + NO_INFEC, data = up_train, method = 'class')
rpart.plot(tree)
From the variables chosen, we can see that the tree model saw PREVIS variable as the bigget factor in categorizing the data. Next was the WTGAIN variable, PWGT_R and M_HT_IN. The decisions upon the the weight variables make sense, but it’s peculiar why the PREVIS variable was chosen as the most important one. A soon as a woman had less than 9 prenatal visits, the model automatically classifies the baby as underweight.
We created two functions that will allow us to display the accuracy of the predicitions of the model compared with the real data in the training sets.
summary_ordinal <- function(predictions, real_data){
accurate <- c(predictions == real_data)
accurate <- sum(accurate)
accurate <- accurate / length(real_data)
print(paste('Total percentage accuracy --- ', accurate))
### underweight
under_indices <- real_data == "underweight"
under_results <- results[under_indices]
under_real_data <- real_data[under_indices]
under_accuracy <- sum(c(under_results == under_real_data))
under_accuracy <- under_accuracy / length(under_real_data)
print(paste('Total underweight percentage accuracy --- ', under_accuracy))
### healthy
healthy_indices <- real_data == "healthy"
healthy_results <- results[healthy_indices]
healthy_real_data <- real_data[healthy_indices]
healthy_accuracy <- sum(c(healthy_results == healthy_real_data))
healthy_accuracy <- healthy_accuracy / length(healthy_real_data)
print(paste('Total healthy percentage accuracy --- ', healthy_accuracy))
### overweight
over_indices <- real_data == "overweight"
over_results <- results[over_indices]
over_real_data <- real_data[over_indices]
over_accuracy <- sum(c(over_results == over_real_data))
over_accuracy <- over_accuracy / length(over_real_data)
print(paste('Total overweight percentage accuracy --- ', over_accuracy))
}
summary_binary <- function(predictions, real_data){
accurate <- c(predictions == real_data)
accurate <- sum(accurate)
accurate <- accurate / length(real_data)
print(paste('Total percentage accuracy --- ',accurate))
### only positives
positive_indices <- real_data == "Yes"
positive_results <- results[positive_indices]
positive_real_data <- real_data[positive_indices]
positive_accuracy <- sum(c(positive_results == positive_real_data))
positive_accuracy <- positive_accuracy / length(positive_real_data)
print(paste('Total positive percentage accuracy --- ',positive_accuracy))
### only negatives
negative_indices <- real_data == "No"
negative_results <- results[negative_indices]
negative_real_data <- real_data[negative_indices]
negative_accuracy <- sum(c(negative_results == negative_real_data))
negative_accuracy <- negative_accuracy / length(negative_real_data)
print(paste('Total negative percentage accuracy --- ',negative_accuracy))
}
results <- predict(logit1, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 202 147
## Yes 13453 73113
##
## Accuracy : 0.8435
## 95% CI : (0.8411, 0.8459)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 0.306
##
## Kappa : 0.0212
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.014793
## Specificity : 0.997993
## Pos Pred Value : 0.578797
## Neg Pred Value : 0.844593
## Prevalence : 0.157108
## Detection Rate : 0.002324
## Detection Prevalence : 0.004015
## Balanced Accuracy : 0.506393
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.843525283322787"
## [1] "Total positive percentage accuracy --- 0.997993447993448"
## [1] "Total negative percentage accuracy --- 0.0147931160746979"
results <- predict(logit1_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 511 1003
## Yes 13144 72257
##
## Accuracy : 0.8372
## 95% CI : (0.8348, 0.8397)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0372
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.037422
## Specificity : 0.986309
## Pos Pred Value : 0.337517
## Neg Pred Value : 0.846091
## Prevalence : 0.157108
## Detection Rate : 0.005879
## Detection Prevalence : 0.017419
## Balanced Accuracy : 0.511866
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.837231778174078"
## [1] "Total positive percentage accuracy --- 0.986309036309036"
## [1] "Total negative percentage accuracy --- 0.037422189674112"
results <- predict(logit2, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 225 189
## Yes 13430 73071
##
## Accuracy : 0.8433
## 95% CI : (0.8409, 0.8457)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 0.3707
##
## Kappa : 0.023
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.016477
## Specificity : 0.997420
## Pos Pred Value : 0.543478
## Neg Pred Value : 0.844742
## Prevalence : 0.157108
## Detection Rate : 0.002589
## Detection Prevalence : 0.004763
## Balanced Accuracy : 0.506949
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.843306678939193"
## [1] "Total positive percentage accuracy --- 0.997420147420147"
## [1] "Total negative percentage accuracy --- 0.0164774807762724"
results <- predict(logit2_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2698 8356
## Yes 10957 64904
##
## Accuracy : 0.7778
## 95% CI : (0.775, 0.7806)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0905
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.19758
## Specificity : 0.88594
## Pos Pred Value : 0.24407
## Neg Pred Value : 0.85556
## Prevalence : 0.15711
## Detection Rate : 0.03104
## Detection Prevalence : 0.12718
## Balanced Accuracy : 0.54176
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.777794396824484"
## [1] "Total positive percentage accuracy --- 0.885940485940486"
## [1] "Total negative percentage accuracy --- 0.19758330281948"
results <- predict(probit1, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 212 162
## Yes 13443 73098
##
## Accuracy : 0.8435
## 95% CI : (0.841, 0.8459)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 0.3226
##
## Kappa : 0.022
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.015525
## Specificity : 0.997789
## Pos Pred Value : 0.566845
## Neg Pred Value : 0.844663
## Prevalence : 0.157108
## Detection Rate : 0.002439
## Detection Prevalence : 0.004303
## Balanced Accuracy : 0.506657
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.84346775585342"
## [1] "Total positive percentage accuracy --- 0.997788697788698"
## [1] "Total negative percentage accuracy --- 0.0155254485536434"
results <- predict(probit1_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 7174 34323
## Yes 6481 38937
##
## Accuracy : 0.5305
## 95% CI : (0.5272, 0.5339)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0311
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52538
## Specificity : 0.53149
## Pos Pred Value : 0.17288
## Neg Pred Value : 0.85730
## Prevalence : 0.15711
## Detection Rate : 0.08254
## Detection Prevalence : 0.47744
## Balanced Accuracy : 0.52843
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.530529827992867"
## [1] "Total positive percentage accuracy --- 0.531490581490581"
## [1] "Total negative percentage accuracy --- 0.52537532039546"
results <- predict(probit2, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 328 427
## Yes 13327 72833
##
## Accuracy : 0.8418
## 95% CI : (0.8393, 0.8442)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 0.8232
##
## Kappa : 0.0295
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.024021
## Specificity : 0.994171
## Pos Pred Value : 0.434437
## Neg Pred Value : 0.845323
## Prevalence : 0.157108
## Detection Rate : 0.003774
## Detection Prevalence : 0.008687
## Balanced Accuracy : 0.509096
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.841753437266295"
## [1] "Total positive percentage accuracy --- 0.994171444171444"
## [1] "Total negative percentage accuracy --- 0.0240205053094105"
results <- predict(probit2_up, newdata = data_test_b)
results <- factor(ifelse(results > 0.5,'Yes','No'),
levels = c('Yes', 'No'))
confusionMatrix(data = results, reference = data_test_b$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 8111 35166
## Yes 5544 38094
##
## Accuracy : 0.5316
## 95% CI : (0.5283, 0.5349)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0605
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.59399
## Specificity : 0.51998
## Pos Pred Value : 0.18742
## Neg Pred Value : 0.87295
## Prevalence : 0.15711
## Detection Rate : 0.09332
## Detection Prevalence : 0.49792
## Balanced Accuracy : 0.55699
##
## 'Positive' Class : No
##
summary_binary(results, data_test_b$HEALTHY)
## [1] "Total percentage accuracy --- 0.531611344416959"
## [1] "Total positive percentage accuracy --- 0.51998361998362"
## [1] "Total negative percentage accuracy --- 0.593994873672647"
results <- predict(ordered_logit1, newdata = data_test)
confusionMatrix(data = results, reference = data_test$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction underweight healthy overweight
## underweight 128 99 5
## healthy 6370 73092 7215
## overweight 2 2 2
##
## Overall Statistics
##
## Accuracy : 0.8425
## 95% CI : (0.84, 0.8449)
## No Information Rate : 0.8421
## P-Value [Acc > NIR] : 0.3958
##
## Kappa : 0.0152
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: underweight Class: healthy Class: overweight
## Sensitivity 0.019692 0.998620 2.769e-04
## Specificity 0.998707 0.009984 9.999e-01
## Pos Pred Value 0.551724 0.843269 3.333e-01
## Neg Pred Value 0.926491 0.575630 9.169e-01
## Prevalence 0.074786 0.842122 8.309e-02
## Detection Rate 0.001473 0.840960 2.301e-05
## Detection Prevalence 0.002669 0.997262 6.903e-05
## Balanced Accuracy 0.509200 0.504302 5.001e-01
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842455272392567"
## [1] "Total underweight percentage accuracy --- 0.0196923076923077"
## [1] "Total healthy percentage accuracy --- 0.998620086620305"
## [1] "Total overweight percentage accuracy --- 0.00027693159789532"
results <- predict(ordered_logit1_up, newdata = data_test)
confusionMatrix(data = results, reference = data_test$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction underweight healthy overweight
## underweight 6285 70527 6563
## healthy 140 1903 456
## overweight 75 763 203
##
## Overall Statistics
##
## Accuracy : 0.0965
## 95% CI : (0.0946, 0.0985)
## No Information Rate : 0.8421
## P-Value [Acc > NIR] : 1
##
## Kappa : -4e-04
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: underweight Class: healthy Class: overweight
## Sensitivity 0.96692 0.02600 0.028109
## Specificity 0.04135 0.95657 0.989485
## Pos Pred Value 0.07538 0.76150 0.195005
## Neg Pred Value 0.93927 0.15549 0.918264
## Prevalence 0.07479 0.84212 0.083093
## Detection Rate 0.07231 0.02189 0.002336
## Detection Prevalence 0.95927 0.02875 0.011977
## Balanced Accuracy 0.50414 0.49128 0.508797
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.096542599091066"
## [1] "Total underweight percentage accuracy --- 0.966923076923077"
## [1] "Total healthy percentage accuracy --- 0.0259997540748432"
## [1] "Total overweight percentage accuracy --- 0.028108557186375"
results <- predict(ordered_logit2, newdata = data_test)
confusionMatrix(data = results, reference = data_test$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction underweight healthy overweight
## underweight 146 84 3
## healthy 6348 73082 7207
## overweight 6 27 12
##
## Overall Statistics
##
## Accuracy : 0.8427
## 95% CI : (0.8402, 0.8451)
## No Information Rate : 0.8421
## P-Value [Acc > NIR] : 0.333
##
## Kappa : 0.0187
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: underweight Class: healthy Class: overweight
## Sensitivity 0.022462 0.99848 0.0016616
## Specificity 0.998918 0.01217 0.9995859
## Pos Pred Value 0.626609 0.84354 0.2666667
## Neg Pred Value 0.926698 0.60072 0.9170024
## Prevalence 0.074786 0.84212 0.0830927
## Detection Rate 0.001680 0.84084 0.0001381
## Detection Prevalence 0.002681 0.99680 0.0005177
## Balanced Accuracy 0.510690 0.50533 0.5006238
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842662371282287"
## [1] "Total underweight percentage accuracy --- 0.0224615384615385"
## [1] "Total healthy percentage accuracy --- 0.998483461533207"
## [1] "Total overweight percentage accuracy --- 0.00166158958737192"
results <- predict(ordered_logit2_up, newdata = data_test)
confusionMatrix(data = results, reference = data_test$HEALTHY)
## Confusion Matrix and Statistics
##
## Reference
## Prediction underweight healthy overweight
## underweight 6144 64913 5284
## healthy 256 7058 1565
## overweight 100 1222 373
##
## Overall Statistics
##
## Accuracy : 0.1562
## 95% CI : (0.1538, 0.1586)
## No Information Rate : 0.8421
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0034
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: underweight Class: healthy Class: overweight
## Sensitivity 0.94523 0.09643 0.051648
## Specificity 0.12707 0.86729 0.983411
## Pos Pred Value 0.08048 0.79491 0.220059
## Neg Pred Value 0.96633 0.15251 0.919632
## Prevalence 0.07479 0.84212 0.083093
## Detection Rate 0.07069 0.08121 0.004292
## Detection Prevalence 0.87834 0.10216 0.019502
## Balanced Accuracy 0.53615 0.48186 0.517530
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.15618707933038"
## [1] "Total underweight percentage accuracy --- 0.945230769230769"
## [1] "Total healthy percentage accuracy --- 0.0964299864741164"
## [1] "Total overweight percentage accuracy --- 0.0516477430074772"
results <- predict(mlogit1, newdata = data_test)
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842627854800667"
## [1] "Total underweight percentage accuracy --- 0.0261538461538462"
## [1] "Total healthy percentage accuracy --- 0.99815556132417"
## [1] "Total overweight percentage accuracy --- 0.00124619219052894"
results <- predict(mlogit1_up, newdata = data_test)
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.426796295230973"
## [1] "Total underweight percentage accuracy --- 0.675692307692308"
## [1] "Total healthy percentage accuracy --- 0.446231196972388"
## [1] "Total overweight percentage accuracy --- 0.00581556355580172"
results <- predict(mlogit2, newdata = data_test)
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842869470172007"
## [1] "Total underweight percentage accuracy --- 0.0272307692307692"
## [1] "Total healthy percentage accuracy --- 0.998237536376429"
## [1] "Total overweight percentage accuracy --- 0.00235391858211022"
results <- predict(mlogit2_up, newdata = data_test)
summary_ordinal(results, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.489938445607778"
## [1] "Total underweight percentage accuracy --- 0.749384615384615"
## [1] "Total healthy percentage accuracy --- 0.512111813971281"
## [1] "Total overweight percentage accuracy --- 0.0317086679590141"
predictions <- predict(bayes1, data_test)
summary_ordinal(predictions, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842478283380314"
## [1] "Total underweight percentage accuracy --- 0.749384615384615"
## [1] "Total healthy percentage accuracy --- 0.512111813971281"
## [1] "Total overweight percentage accuracy --- 0.0317086679590141"
predictions <- predict(bayes1_up, data_test)
summary_ordinal(predictions, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.0749697980785825"
## [1] "Total underweight percentage accuracy --- 0.749384615384615"
## [1] "Total healthy percentage accuracy --- 0.512111813971281"
## [1] "Total overweight percentage accuracy --- 0.0317086679590141"
predictions <- predict(bayes2, data_test)
summary_ordinal(predictions, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.842328711959961"
## [1] "Total underweight percentage accuracy --- 0.749384615384615"
## [1] "Total healthy percentage accuracy --- 0.512111813971281"
## [1] "Total overweight percentage accuracy --- 0.0317086679590141"
predictions <- predict(bayes2_up, data_test)
summary_ordinal(predictions, data_test$HEALTHY)
## [1] "Total percentage accuracy --- 0.0996375769429903"
## [1] "Total underweight percentage accuracy --- 0.749384615384615"
## [1] "Total healthy percentage accuracy --- 0.512111813971281"
## [1] "Total overweight percentage accuracy --- 0.0317086679590141"
Results of accuracy of predictions from models on binary outcome variable:
| MODEL | DATA | TOTAL ACCURACY | HEALTHY | UNHEALTHY |
|---|---|---|---|---|
| logit 1 | unbalanced | 0.844 | 0.997 | 0.015 |
| logit 1 | balanced | 0.837 | 0.986 | 0.037 |
| logit 2 | unbalanced | 0.843 | 0.997 | 0.016 |
| logit 2 | balanced | 0.778 | 0.886 | 0.198 |
| probit 1 | unbalanced | 0.843 | 0.998 | 0.016 |
| probit 1 | balanced | 0.531 | 0.531 | 0.525 |
| probit 2 | unbalanced | 0.842 | 0.994 | 0.024 |
| probit 2 | balanced | 0.532 | 0.520 | 0.594 |
Results of accuracy of predictions from models on nominal/ordinal outcome variable:
| MODEL | DATA | TOTAL ACCURACY | UNDERWEIGHT | HEALTHY | OVERWEIGHT |
|---|---|---|---|---|---|
| ordered logit 1 | unbalanced | 0.842 | 0.020 | 0.999 | 0.000 |
| ordered logit 1 | balanced | 0.097 | 0. 967 | 0.026 | 0.028 |
| ordered logit 2 | unbalanced | 0.843 | 0.022 | 0.998 | 0.002 |
| ordered logit 2 | balanced | 0.156 | 0.945 | 0.096 | 0.052 |
| mlogit 1 | unbalanced | 0.843 | 0.026 | 0.998 | 0.001 |
| mlogit 1 | balanced | 0.427 | 0.676 | 0.446 | 0.006 |
| mlogit 2 | unbalanced | 0.843 | 0.027 | 0.998 | 0.002 |
| mlogit 2 | balanced | 0.490 | 0.749 | 0.512 | 0.032 |
| bayes 1 | unbalanced | 0.842 | 0.749 | 0.512 | 0.032 |
| bayes 1 | balanced | 0.075 | 0.749 | 0.512 | 0.032 |
| bayes 2 | unbalanced | 0.842 | 0.749 | 0.512 | 0.032 |
| bayes 2 | balanced | 0.100 | 0.749 | 0.512 | 0.032 |
None of the models yielded satisfying results. We can see a huge differece in predictions between balanced and unbalanced samples. Even though the scores for total accuracy are quite good, a lot of them over 80%, when we take a closer look at the accuracy for predicting particular classes we see significant differences between them. For some models predicting binary outcome variable the accuracy is around 50% which is equal to the probability of randomly assigning two classes to the observations. In the models predicting nominal variables the accuracy is very uneven between classes, some models achieved less than 10% total accuracy which is also less than the probability of randomly assigning classes. Our final conclusion would be that in available dataset the independent variables do not predict the dependent variable and therefore the model cannot be constructed based on them.