Predicting Birth Weight of Baby

Introduction

Birth weight is the weight of baby taken just after he or she is born, and a baby birth weight is a strong indicator of maternal and new-born health and nutrition. According to World Health Organization (WHO), low birth weight is defined as the birthweight less than 2,500 grams or 5.5 pounds regardless of gestational age. The normal weight range for new-born baby should exceed 2,500 grams (5.5 pounds) but less than 4,000 grams (8.8 pounds).

Low birth weight is an outcome that has been of concern because infant mortality rates and birth defect rates are very high for low birth weight babies. Low birth weight can occur in premature babies who are delivered at a gestational age less than 37 weeks of pregnancy (the normal length of a pregnancy is 40 weeks) or in babies who are born at the regular time but are under weight. Baby with greater than the normal weight range is considered large and would increase risk and difficulties during delivery.

Motivation of study

It would be so important for the doctor to be able to predict the birth weight instead of solely relying on ultrasound results, so they can take different measures in advance and minimize the risk during delivery. Though ultrasound can help for such prediction, but data scientist could also detect this in advance with given data.

Problem statement

Here’re some interesting questions arise from this scenario that we are considering for the study:

  1. What is the weight of new-born baby given with some information during gestation period?
  2. Is the baby too small (underweight), normal or large (overweight) in size?
  3. What are the factors associated with giving birth to a low birth weight and high birth weight baby?

Objective of study

The main goal of this project is to obtain prediction for the birthweight of the baby and whether a baby is underweight, overweight or has normal weight given information about the mother’s historical health, habits, gestational period and age. Thus, several models are built to predict the birthweight of the baby and identify if the baby have low, normal or high birth weight using classification method.

Overview of data

The Dataset used is collected by North Carolina State Center for Health Statistics and we acknowledge The State Center for Health Statistics (SCHS) and the Howard W. Odum Institute for Research in Social Science at UNC at Chapel Hill as the source of data.

Import data

###################
# Load R packages #
###################
library(readr)
library(tidyverse)
###################################
# Import data: Baby's birthweight #
###################################

birthweight = readr::read_csv(file="2008_births.csv")
## 
## -- Column specification ---------------------------------------------------------------------------------------------
## cols(
##   .default = col_double(),
##   BDATE = col_date(format = ""),
##   HISPMOM = col_character(),
##   HISPDAD = col_character()
## )
## i Use `spec()` for the full column specifications.

The data was collected in 2008 with 133422 rows with 125 columns. The target variables for our study is the birth weight of infant which can be represented from columns namely BPOUND and BOUNCE which denote the birthweight in pounds and ounce separately.

str(birthweight)
## tibble [133,422 x 125] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ INST    : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
##  $ RPLACE  : num [1:133422] 6800 160 190 4100 160 100 100 100 100 162 ...
##  $ RCOUNTY : num [1:133422] 68 1 1 41 1 1 1 1 1 1 ...
##  $ PLURAL  : num [1:133422] 1 1 1 1 1 1 1 2 2 1 ...
##  $ BDATE   : Date[1:133422], format: "2008-01-01" ...
##  $ BMONTH  : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BDAY    : num [1:133422] 1 2 2 3 3 3 3 3 3 3 ...
##  $ BYEAR   : num [1:133422] 2008 2008 2008 2008 2008 ...
##  $ SEX     : num [1:133422] 2 2 1 2 2 2 1 1 2 2 ...
##  $ RACE    : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
##  $ FAGE    : num [1:133422] 23 23 31 21 26 19 37 26 26 29 ...
##  $ MAGE    : num [1:133422] 24 21 28 19 21 20 26 31 31 25 ...
##  $ FEDUC   : num [1:133422] 12 14 12 12 14 12 14 12 12 13 ...
##  $ MEDUC   : num [1:133422] 13 13 2 13 13 12 14 14 14 12 ...
##  $ TOTALP  : num [1:133422] 1 1 3 1 1 2 2 3 4 3 ...
##  $ BDEAD   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TERMS   : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
##  $ LBDATE  : num [1:133422] 0 0 52003 0 0 ...
##  $ LBMONTH : num [1:133422] 0 0 5 0 0 0 5 3 1 6 ...
##  $ LBYEAR  : num [1:133422] 0 0 2003 0 0 ...
##  $ LOUTCOME: num [1:133422] 9 9 1 9 9 2 1 1 1 1 ...
##  $ WEEKS   : num [1:133422] 35 40 36 38 42 37 39 37 37 39 ...
##  $ PRENATAL: num [1:133422] 2 2 4 2 2 3 1 2 2 3 ...
##  $ VISITS  : num [1:133422] 12 13 8 10 12 10 14 20 20 10 ...
##  $ WEIGHT  : num [1:133422] 3 7 8 6 8 5 4 4 4 6 ...
##  $ MARITAL : num [1:133422] 2 2 1 2 1 1 2 1 1 2 ...
##  $ ATTEND  : num [1:133422] 3 1 3 1 1 1 1 1 1 3 ...
##  $ RECORD  : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
##  $ CHILDREN: num [1:133422] 0 0 2 0 0 0 1 2 3 2 ...
##  $ FDATE   : num [1:133422] 0 0 0 0 0 ...
##  $ FMONTH  : num [1:133422] 0 0 0 0 0 9 0 0 0 0 ...
##  $ FYEAR   : num [1:133422] 0 0 0 0 0 ...
##  $ BPOUND  : num [1:133422] 4 8 9 7 9 6 5 5 4 7 ...
##  $ BOUNCE  : num [1:133422] 1 3 0 6 7 8 8 2 15 4 ...
##  $ RACEMOM : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
##  $ RACEDAD : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
##  $ HISPMOM : chr [1:133422] "N" "N" "M" "N" ...
##  $ HISPDAD : chr [1:133422] "N" "N" "S" "N" ...
##  $ CIGNUM  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DRINKNUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ GAINED  : num [1:133422] 20 23 36 40 35 40 20 38 38 43 ...
##  $ ANEMIA  : num [1:133422] 0 1 0 1 0 1 0 0 0 0 ...
##  $ CARDIAC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ACLUNG  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DIABETES: num [1:133422] 0 0 0 0 0 0 0 1 1 0 ...
##  $ HERPES  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYDRAM  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HEMOGLOB: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYPERCH : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYPERPR : num [1:133422] 1 0 0 0 0 0 0 1 1 0 ...
##  $ ECLAMP  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CERVIX  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PINFANT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRETERM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RENAL   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RHSEN   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ UTERINE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHMED  : num [1:133422] 0 1 0 0 0 1 1 1 1 1 ...
##  $ AMNIO   : num [1:133422] 0 0 0 0 0 0 1 0 0 0 ...
##  $ MONITOR : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
##  $ INDUCT  : num [1:133422] 0 1 0 0 0 0 0 0 0 0 ...
##  $ STIMULA : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TOCOLY  : num [1:133422] 1 0 0 0 0 0 0 0 0 0 ...
##  $ ULTRA   : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
##  $ OTHPROC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FEBRILE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MECONIUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RUPTURE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ABRUPTIO: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PREVIA  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BLEEDING: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SEIZURES: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRECLAB : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PROLAB  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DYSLAB  : num [1:133422] 0 1 0 0 0 0 0 0 0 0 ...
##  $ BREECH  : num [1:133422] 0 0 0 1 0 0 0 1 0 0 ...
##  $ CEPHALO : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PROLAPSE: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ANESTH  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DISTRESS: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHLABOR: num [1:133422] 1 1 1 0 0 1 1 0 0 0 ...
##  $ VAGINAL : num [1:133422] 1 0 1 0 1 1 1 0 0 1 ...
##  $ VAGCSECT: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRCSECT : num [1:133422] 0 1 0 1 0 0 0 1 1 0 ...
##  $ RECSECT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FORCEPS : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VACUUM  : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
##  $ KOTEL   : num [1:133422] 4 3 3 3 3 4 3 4 4 3 ...
##  $ ANENCEPH: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SPINABIF: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYDROCEP: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MICROCEP: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHCNS  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HEARTMAL: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHCIRC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RECTALAT: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FISTULA : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OMPHALOC: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHGAST : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
##  - attr(*, "spec")=
##   .. cols(
##   ..   INST = col_double(),
##   ..   RPLACE = col_double(),
##   ..   RCOUNTY = col_double(),
##   ..   PLURAL = col_double(),
##   ..   BDATE = col_date(format = ""),
##   ..   BMONTH = col_double(),
##   ..   BDAY = col_double(),
##   ..   BYEAR = col_double(),
##   ..   SEX = col_double(),
##   ..   RACE = col_double(),
##   ..   FAGE = col_double(),
##   ..   MAGE = col_double(),
##   ..   FEDUC = col_double(),
##   ..   MEDUC = col_double(),
##   ..   TOTALP = col_double(),
##   ..   BDEAD = col_double(),
##   ..   TERMS = col_double(),
##   ..   LBDATE = col_double(),
##   ..   LBMONTH = col_double(),
##   ..   LBYEAR = col_double(),
##   ..   LOUTCOME = col_double(),
##   ..   WEEKS = col_double(),
##   ..   PRENATAL = col_double(),
##   ..   VISITS = col_double(),
##   ..   WEIGHT = col_double(),
##   ..   MARITAL = col_double(),
##   ..   ATTEND = col_double(),
##   ..   RECORD = col_double(),
##   ..   CHILDREN = col_double(),
##   ..   FDATE = col_double(),
##   ..   FMONTH = col_double(),
##   ..   FYEAR = col_double(),
##   ..   BPOUND = col_double(),
##   ..   BOUNCE = col_double(),
##   ..   RACEMOM = col_double(),
##   ..   RACEDAD = col_double(),
##   ..   HISPMOM = col_character(),
##   ..   HISPDAD = col_character(),
##   ..   CIGNUM = col_double(),
##   ..   DRINKNUM = col_double(),
##   ..   GAINED = col_double(),
##   ..   ANEMIA = col_double(),
##   ..   CARDIAC = col_double(),
##   ..   ACLUNG = col_double(),
##   ..   DIABETES = col_double(),
##   ..   HERPES = col_double(),
##   ..   HYDRAM = col_double(),
##   ..   HEMOGLOB = col_double(),
##   ..   HYPERCH = col_double(),
##   ..   HYPERPR = col_double(),
##   ..   ECLAMP = col_double(),
##   ..   CERVIX = col_double(),
##   ..   PINFANT = col_double(),
##   ..   PRETERM = col_double(),
##   ..   RENAL = col_double(),
##   ..   RHSEN = col_double(),
##   ..   UTERINE = col_double(),
##   ..   OTHMED = col_double(),
##   ..   AMNIO = col_double(),
##   ..   MONITOR = col_double(),
##   ..   INDUCT = col_double(),
##   ..   STIMULA = col_double(),
##   ..   TOCOLY = col_double(),
##   ..   ULTRA = col_double(),
##   ..   OTHPROC = col_double(),
##   ..   FEBRILE = col_double(),
##   ..   MECONIUM = col_double(),
##   ..   RUPTURE = col_double(),
##   ..   ABRUPTIO = col_double(),
##   ..   PREVIA = col_double(),
##   ..   BLEEDING = col_double(),
##   ..   SEIZURES = col_double(),
##   ..   PRECLAB = col_double(),
##   ..   PROLAB = col_double(),
##   ..   DYSLAB = col_double(),
##   ..   BREECH = col_double(),
##   ..   CEPHALO = col_double(),
##   ..   PROLAPSE = col_double(),
##   ..   ANESTH = col_double(),
##   ..   DISTRESS = col_double(),
##   ..   OTHLABOR = col_double(),
##   ..   VAGINAL = col_double(),
##   ..   VAGCSECT = col_double(),
##   ..   PRCSECT = col_double(),
##   ..   RECSECT = col_double(),
##   ..   FORCEPS = col_double(),
##   ..   VACUUM = col_double(),
##   ..   KOTEL = col_double(),
##   ..   ANENCEPH = col_double(),
##   ..   SPINABIF = col_double(),
##   ..   HYDROCEP = col_double(),
##   ..   MICROCEP = col_double(),
##   ..   OTHCNS = col_double(),
##   ..   HEARTMAL = col_double(),
##   ..   OTHCIRC = col_double(),
##   ..   RECTALAT = col_double(),
##   ..   FISTULA = col_double(),
##   ..   OMPHALOC = col_double(),
##   ..   OTHGAST = col_double(),
##   ..   MALGEN = col_double(),
##   ..   RANGENE = col_double(),
##   ..   OTHURO = col_double(),
##   ..   CLEFT = col_double(),
##   ..   POLYDACT = col_double(),
##   ..   CLUBFOOT = col_double(),
##   ..   DHERNIA = col_double(),
##   ..   OTHMS = col_double(),
##   ..   DOWNS = col_double(),
##   ..   OTHCHROM = col_double(),
##   ..   OTHER = col_double(),
##   ..   BCWEEKS = col_double(),
##   ..   APGAR1 = col_double(),
##   ..   APGAR5 = col_double(),
##   ..   KESSNER = col_double(),
##   ..   INFANTTR = col_double(),
##   ..   MOTHERTR = col_double(),
##   ..   IANEMIA = col_double(),
##   ..   BINJURY = col_double(),
##   ..   FAS = col_double(),
##   ..   HYALINE = col_double(),
##   ..   ASPIRATE = col_double(),
##   ..   VENTLESS = col_double(),
##   ..   VENTMORE = col_double(),
##   ..   ISEIZURE = col_double(),
##   ..   OTHINF = col_double()
##   .. )
dim(birthweight)
## [1] 133422    125
nrow(birthweight)
## [1] 133422
ncol(birthweight)
## [1] 125

There are many features and columns which are not relevant to our scope of objectives. Features about post birth, birth details such as date are all removed to keep simplicity for pre-processing of data later.

keep_col = c('SEX', 'MARITAL','FAGE', 'GAINED', 'VISITS', 'MAGE', 
            'FEDUC', 'MEDUC', 'TOTALP', 'BDEAD', 'TERMS',  
            #'RACEMOM', 'RACEDAD', 'HISPMOM', 'HISPDAD', 'LOUTCOME',
            'PLURAL','WEEKS','CIGNUM', 'DRINKNUM', 'ANEMIA', 'BPOUND','BOUNCE',
            'CARDIAC', 'ACLUNG', 'DIABETES', 'HERPES', 'HYDRAM', 
            'HEMOGLOB', 'HYPERCH', 'HYPERPR', 'ECLAMP', 'CERVIX',
            'PINFANT','PRETERM','RENAL','RHSEN','UTERINE')
birthweight_clean = birthweight %>%  
                    select(all_of(keep_col))

After removal of the unrelevant features, the dataset is trimmed to have only 33 columns and 133422 rows of observations.

dim(birthweight_clean)
## [1] 133422     33
nrow(birthweight_clean)
## [1] 133422
ncol(birthweight_clean)
## [1] 33

Let’s dive into the dataset to get an overview. The description of each attributes in the dataset is attached under Appendix section.

str(birthweight_clean)
## tibble [133,422 x 33] (S3: tbl_df/tbl/data.frame)
##  $ SEX     : num [1:133422] 2 2 1 2 2 2 1 1 2 2 ...
##  $ MARITAL : num [1:133422] 2 2 1 2 1 1 2 1 1 2 ...
##  $ FAGE    : num [1:133422] 23 23 31 21 26 19 37 26 26 29 ...
##  $ GAINED  : num [1:133422] 20 23 36 40 35 40 20 38 38 43 ...
##  $ VISITS  : num [1:133422] 12 13 8 10 12 10 14 20 20 10 ...
##  $ MAGE    : num [1:133422] 24 21 28 19 21 20 26 31 31 25 ...
##  $ FEDUC   : num [1:133422] 12 14 12 12 14 12 14 12 12 13 ...
##  $ MEDUC   : num [1:133422] 13 13 2 13 13 12 14 14 14 12 ...
##  $ TOTALP  : num [1:133422] 1 1 3 1 1 2 2 3 4 3 ...
##  $ BDEAD   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TERMS   : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
##  $ PLURAL  : num [1:133422] 1 1 1 1 1 1 1 2 2 1 ...
##  $ WEEKS   : num [1:133422] 35 40 36 38 42 37 39 37 37 39 ...
##  $ CIGNUM  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DRINKNUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ANEMIA  : num [1:133422] 0 1 0 1 0 1 0 0 0 0 ...
##  $ BPOUND  : num [1:133422] 4 8 9 7 9 6 5 5 4 7 ...
##  $ BOUNCE  : num [1:133422] 1 3 0 6 7 8 8 2 15 4 ...
##  $ CARDIAC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ACLUNG  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DIABETES: num [1:133422] 0 0 0 0 0 0 0 1 1 0 ...
##  $ HERPES  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYDRAM  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HEMOGLOB: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYPERCH : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HYPERPR : num [1:133422] 1 0 0 0 0 0 0 1 1 0 ...
##  $ ECLAMP  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CERVIX  : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PINFANT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PRETERM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RENAL   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ RHSEN   : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
##  $ UTERINE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...

We have both qualitative and quantitative variables in the dataset. Determined factors for baby birthweight such as mother’s weight gained, parents age during gestational period, number of gestational weeks etc will be used to build the models.

Most qualitative variables are in binary form which describe mother’s health condition on whether she suffer from anaemia, cardiac disease,lung disease, diabetes, genital herpes, Oligohydramnios, hemoglobinopathy, hypertension, pregnancy hypertension, Eclampsia, incompetent cervix and renal disease. Other information such as whether the mother had preterm infant or infant exceed 4,000g and experience uterine bleeding during gestation period are also included in the dataset.

summary(birthweight_clean)
##       SEX           MARITAL           FAGE      
##  Min.   :1.000   Min.   :1.000   Min.   :14.00  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:26.00  
##  Median :1.000   Median :1.000   Median :32.00  
##  Mean   :1.487   Mean   :1.418   Mean   :42.65  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:40.00  
##  Max.   :9.000   Max.   :9.000   Max.   :99.00  
##      GAINED          VISITS          MAGE      
##  Min.   : 0.00   Min.   : 0.0   Min.   :11.00  
##  1st Qu.:21.00   1st Qu.:10.0   1st Qu.:22.00  
##  Median :30.00   Median :12.0   Median :27.00  
##  Mean   :32.68   Mean   :13.3   Mean   :27.01  
##  3rd Qu.:40.00   3rd Qu.:15.0   3rd Qu.:31.00  
##  Max.   :99.00   Max.   :99.0   Max.   :99.00  
##      FEDUC           MEDUC           TOTALP      
##  Min.   : 0.00   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:12.00   1st Qu.:12.00   1st Qu.: 1.000  
##  Median :14.00   Median :12.00   Median : 2.000  
##  Mean   :28.76   Mean   :13.22   Mean   : 2.502  
##  3rd Qu.:17.00   3rd Qu.:16.00   3rd Qu.: 3.000  
##  Max.   :99.00   Max.   :99.00   Max.   :99.000  
##      BDEAD              TERMS             PLURAL     
##  Min.   : 0.00000   Min.   : 0.0000   Min.   :1.000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.:1.000  
##  Median : 0.00000   Median : 0.0000   Median :1.000  
##  Mean   : 0.06431   Mean   : 0.4651   Mean   :1.035  
##  3rd Qu.: 0.00000   3rd Qu.: 0.0000   3rd Qu.:1.000  
##  Max.   :99.00000   Max.   :99.0000   Max.   :9.000  
##      WEEKS           CIGNUM          DRINKNUM      
##  Min.   :18.00   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.:38.00   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median :39.00   Median : 0.000   Median : 0.0000  
##  Mean   :38.68   Mean   : 1.381   Mean   : 0.7443  
##  3rd Qu.:40.00   3rd Qu.: 0.000   3rd Qu.: 0.0000  
##  Max.   :99.00   Max.   :99.000   Max.   :99.0000  
##      ANEMIA            BPOUND           BOUNCE     
##  Min.   :0.00000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:0.00000   1st Qu.: 6.000   1st Qu.: 3.00  
##  Median :0.00000   Median : 7.000   Median : 7.00  
##  Mean   :0.05962   Mean   : 6.729   Mean   : 7.51  
##  3rd Qu.:0.00000   3rd Qu.: 8.000   3rd Qu.:11.00  
##  Max.   :9.00000   Max.   :99.000   Max.   :99.00  
##     CARDIAC            ACLUNG           DIABETES      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.02443   Mean   :0.03296   Mean   :0.05155  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.00000   Max.   :9.00000   Max.   :9.00000  
##      HERPES            HYDRAM           HEMOGLOB      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.03378   Mean   :0.03518   Mean   :0.02261  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.00000   Max.   :9.00000   Max.   :9.00000  
##     HYPERCH          HYPERPR            ECLAMP       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.0346   Mean   :0.07054   Mean   :0.02422  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.0000   Max.   :9.00000   Max.   :9.00000  
##      CERVIX           PINFANT           PRETERM       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.02416   Mean   :0.02552   Mean   :0.02987  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.00000   Max.   :9.00000   Max.   :9.00000  
##      RENAL             RHSEN            UTERINE       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.02256   Mean   :0.02389   Mean   :0.02396  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.00000   Max.   :9.00000   Max.   :9.00000

Surprisingly, there is no missing values found in the dataset. However, when we notice carefully from the summary statistics for each column. Qualitative variable denoted in numeric have maximum values of 9 which seems odd. Similar patterns are found in quantitative variables with maximum values recorded as 98 or 99 which are not describing the attributes accurately. These values are actually recoded from the missing values during recording of the data.

sum(!complete.cases(birthweight_clean))
## [1] 0

Data pre-processing

Dealing Missing data

Observations with missing data are omitted from the dataset as there are just less than 10% of them are missing.

qualitative_col =c('SEX', 'MARITAL','ANEMIA',
                  'CARDIAC', 'ACLUNG', 'DIABETES', 'HERPES', 'HYDRAM', 
                  'HEMOGLOB', 'HYPERCH', 'HYPERPR', 'ECLAMP', 'CERVIX',
                  'PINFANT','PRETERM','RENAL','RHSEN','UTERINE')

birthweight_clean = birthweight_clean %>%
      filter_at(all_of(qualitative_col), all_vars(. != 9)) %>%
      filter(!rowSums(.==99)) %>%
      filter(!rowSums(.==98))

Transformation of data

Both BPOUND and BOUNCE columns that record the pounds and ounces of the weight are merged to create new column BWEIGHT which give the birth weight in decimal. Conversion of ounce to pound is computed by multiplying BOUNCE column with 0.0625.

We also create new column BWEIGHT_COND to record the groupings of baby birthweight. Babies with low birthweight are those with less than 5.5 pounds. Large baby with birthweight exceeds 8.8 pounds are grouped as overweight category. Other than that, the baby are in the normal range of birthweight. This will be used for building the classification model which will be discussed shortly.

birthweight_clean = 
  birthweight_clean %>% 
  mutate(BWEIGHT=BPOUND+(0.0625*BOUNCE)) %>%
  mutate(BWEIGHT_COND=case_when(
                      BWEIGHT <5.5 ~ 0,
                      BWEIGHT >=5.5 & BWEIGHT <8.8 ~ 1,
                      BWEIGHT >=8.8 ~ 2)
         )%>%
  select(-c("BPOUND","BOUNCE")) 

Dummy coding categorical features

We have some nominal variables such as SEX, MARITAL and the newly created BWEIGHT_COND. We are performing dummy coding which creates a set of binary (one-zero) variables that represent each category before we can use to fit and evaluate a model.

BWEIGHT_COND is currently represented with numbers of 0, 1, 2 for ‘Underweight’, ‘Normal’ and ‘Overwieght’. It may be advisable to convert it to a factor to allow the category to have unique impact on the outcome.

birthweight_clean = 
  birthweight_clean %>% 
  mutate(SEX_dummy_Boy = ifelse(SEX==1,1,0),
         SEX_dummy_Girl = ifelse(SEX==2,1,0),
         STATUS_dummy_Married=ifelse(MARITAL==1,1,0),
         STATUS_dummy_Unmarried=ifelse(MARITAL==2,1,0),
         BWEIGHT_dummy_under=ifelse(BWEIGHT_COND==0,1,0),
         BWEIGHT_dummy_normal=ifelse(BWEIGHT_COND==1,1,0),
         BWEIGHT_dummy_over=ifelse(BWEIGHT_COND==2,1,0)
      ) %>% 
  mutate_at(vars(BWEIGHT_COND ),factor,levels = c(0,1,2),labels=c("Underweight","Normal","Overweight")) %>%
  mutate_at(vars(BWEIGHT_COND ),relevel,ref="Normal")

Exploratory Data Analysis

Before modelling, let’s explore each attribute to uncover the structure, patterns, and relationships existing in our data.

Targeted variables

Most babies are reported to have weights of range 7 pounds to 8 pounds as we can see the peak of histogram lies in between this range. The dataset has the highest frequency for babies with normal weight which is a good sign as the baby are in healthy condition.

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

plot1=birthweight_clean %>% 
        ggplot(aes(x=BWEIGHT)) + 
        geom_histogram(fill="#9f2042")  +
        geom_density(alpha=.2, fill="#FF6666") +
        xlab("Birthweight (pounds)") +
        ylab("Frequency") +
        scale_x_continuous(breaks=seq(0,10,1))
        ggtitle("Histogram of Birthweight")
## $title
## [1] "Histogram of Birthweight"
## 
## attr(,"class")
## [1] "labels"

plot2=birthweight_clean %>% 
        ggplot(aes(x=BWEIGHT_COND)) + 
        geom_bar(fill=c("#9f2042","#211103","#1F1D1D")) +
        xlab("Classes") +
        ylab("Frequency") +
        ggtitle("Birthweight classes")
grid.arrange(plot1, plot2,ncol=2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Bivariate analysis

The relationship of the birthweight and other features will be discussed shortly.

plot_density = function(cond,true_lab,false_lab,title){
  birthweight_clean %>% 
    mutate(Status=ifelse(cond,true_lab,false_lab)) %>%
    ggplot(aes(x=BWEIGHT,color=Status)) +
    geom_density(size=1) +
    ylab("Density") +
    xlab("Birthweight") +
    ggtitle(title) +
    scale_color_manual(values=c("#211103","#9f2042")) 
}

# Smoker vs Non-Smoker
plot3 = plot_density(birthweight_clean$CIGNUM>0,
                     "Smoker","Non-smoker",
                     "Status of smoking")

plot4 = plot_density(birthweight_clean$DRINKNUM>1,
                     "Drink","Not drinking",
                     "Status of taking/ not taking alcohol")  

# has/had previus preterm/small infant
plot5 = plot_density(birthweight_clean$PRETERM==1,
                     "Yes","No",
                     "Has/had previous preterm/small infant") 

# had/had previous infant 4000+ grams
plot6 = plot_density(birthweight_clean$PINFANT==1,
                     "Yes","No",
                     "Has/had previous infant with >5.5 pounds")  

# Mother has/had incompetent cervix
plot7 = plot_density(birthweight_clean$CERVIX==1,
                     "Yes","No",
                     "Has/had incompetent Cervix")  


# Mother has/had diabetes
plot8 = plot_density(birthweight_clean$DIABETES==1,
                     "Yes","No",
                     "Has/had diabetes")

# Mother has/had uterine bleeding
plot9 = plot_density(birthweight_clean$UTERINE==1,
                     "Yes","No",
                     "Has/had uterine bleeding")

# Mother has/had Oligohydramnios
plot10 = plot_density(birthweight_clean$HYDRAM==1,
                     "Yes","No",
                     "Has/had Oligohydramnios")

# Mother has/had chronic hypertension
plot11 = plot_density(birthweight_clean$HYPERCH==1,
                     "Yes","No",
                     "Has/had chronic hypertension")

# Mother has/had pregnancy hypertension
plot12 = plot_density(birthweight_clean$HYPERPR==1,
                      "Yes","No",
                      "Has/had pregnancy hypertension")

# Early delivery vs Normal delivery
plot13 = plot_density(birthweight_clean$WEEKS<37,
                     "Early delivery","Normal delivery",
                     "Early delivery vs Normal delivery")


# Early delivery vs Normal delivery
plot14 = plot_density(birthweight_clean$BDEAD>0,
                     "Yes","Never/No",
                     "Has/had children born alive now dead")

Based on the density plots, we observed that:

  • Mother who smokes during pregnancy tends to have low birth weight babies.
  • Alcoholic seems does not bring significant effect to the birthweight of baby but this is just based on the observation from this data as we require more scientific justification for such claims.
  • Mother who experienced preterm delivery are likely to give birth of low weight baby. In contrast, mothers who experienced in giving birth of larger baby have higher possibility to have infants with greater birthweight.

Condition of mother’s health can be determined factors of baby birthweight. We observed that mother who suffer from

  • uterine bleeding
  • ologohydramnios
  • hypertensions

has higher probability with low birthweight baby.

grid.arrange(plot3,plot4,plot5,plot6,plot7,plot8,ncol=2)

grid.arrange(plot9,plot10,plot11,plot12,plot13,plot14,ncol=2)

Correlation

library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
quantitative_col =c('FAGE', 'GAINED', 'VISITS', 'MAGE', 'BWEIGHT','PLURAL',
                    'FEDUC', 'MEDUC', 'TOTALP', 'BDEAD', 'TERMS', 
                    'WEEKS','CIGNUM', 'DRINKNUM')

# Correlation matrix for all quantitative variables
corr_matrix <- round(cor(birthweight_clean[,quantitative_col]),2)

# Get upper triangle of the correlation matrix
get_upper_tri <- function(corr_matrix){
  corr_matrix[lower.tri(corr_matrix)]<- NA
  return(corr_matrix)
}

upper_tri <- get_upper_tri(corr_matrix)

Matrix of correlation with upper triangle is displayed.

upper_tri
##          FAGE GAINED VISITS  MAGE BWEIGHT PLURAL FEDUC
## FAGE        1  -0.06   0.10  0.75    0.05   0.05  0.28
## GAINED     NA   1.00   0.09 -0.06    0.17   0.10  0.11
## VISITS     NA     NA   1.00  0.14    0.13   0.07  0.21
## MAGE       NA     NA     NA  1.00    0.07   0.07  0.35
## BWEIGHT    NA     NA     NA    NA    1.00  -0.31  0.05
## PLURAL     NA     NA     NA    NA      NA   1.00  0.05
## FEDUC      NA     NA     NA    NA      NA     NA  1.00
## MEDUC      NA     NA     NA    NA      NA     NA    NA
## TOTALP     NA     NA     NA    NA      NA     NA    NA
## BDEAD      NA     NA     NA    NA      NA     NA    NA
## TERMS      NA     NA     NA    NA      NA     NA    NA
## WEEKS      NA     NA     NA    NA      NA     NA    NA
## CIGNUM     NA     NA     NA    NA      NA     NA    NA
## DRINKNUM   NA     NA     NA    NA      NA     NA    NA
##          MEDUC TOTALP BDEAD TERMS WEEKS CIGNUM DRINKNUM
## FAGE      0.27   0.28  0.04  0.13 -0.04  -0.04     0.01
## GAINED    0.11  -0.12 -0.02 -0.01  0.10  -0.01     0.00
## VISITS    0.23  -0.06  0.00  0.03  0.14  -0.06    -0.01
## MAGE      0.37   0.35  0.04  0.17 -0.05  -0.08     0.01
## BWEIGHT   0.06   0.00 -0.04 -0.03  0.57  -0.09    -0.01
## PLURAL    0.05   0.07  0.03  0.02 -0.27  -0.02     0.00
## FEDUC     0.74  -0.10 -0.02  0.02  0.01  -0.11     0.00
## MEDUC     1.00  -0.13 -0.02  0.03  0.00  -0.12     0.00
## TOTALP      NA   1.00  0.17  0.65 -0.08   0.08     0.02
## BDEAD       NA     NA  1.00  0.04 -0.05   0.01     0.01
## TERMS       NA     NA    NA  1.00 -0.05   0.06     0.02
## WEEKS       NA     NA    NA    NA  1.00  -0.01    -0.01
## CIGNUM      NA     NA    NA    NA    NA   1.00     0.03
## DRINKNUM    NA     NA    NA    NA    NA     NA     1.00

From the heatmap, we observed that weeks of gestation is positively correlated with birthweight. The longer the week of gestation, the higher the weight of the baby which potentially lead to baby to grow bigger in size during delivery if the gestation weeks are longer than usual.

Other factors such as number of cigarettes per day has negative association with baby birthweight. Mother who smokes oftenly tends to have low birth weight baby.

Other interesting facts observed: - Mother who has twins, triplets or even quadruplets has shorter gestational weeks than single baby and babies are ligthter. - The more weight gained during pregnancy, the baby birthweight is heavier - Mother who vists hospital more often has baby with higher birthweight

# Heatmap
melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "#211103", high = "#9f2042", mid = "white", 
                       midpoint = 0, limit =c(min(melt(corr_matrix)$value),max(melt(corr_matrix)$value)), space = "Lab", 
                       name="Pearson\nCorrelation") +

  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 10, hjust = 1))+
  xlab("")+ylab("")+ggtitle("Correlation Heat Map")+
  coord_fixed()

Modelling

Modelling: Classification

k Nearest Neighbors (kNN)

k-Nearest Neighbors (kNN) uses the principle of nearest neighbors to classify unlabeled examples by searching dataset for the historic observation with most similar to the newly-observed one. As the kNN algorithm literally “learns by example” it is suitably used for classification to indentify if weights grouping for baby when birth.

Data processing for kNN

Since kNN assumes numeric data, categorical columns should be modified in order for kNN to measure it in numeric way. Alternative solution here is by creating dummy variable which has been completed in previous steps.

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization to ensure all data elements may contribute equal shares to distance.

# Data preparation for kNN
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

# Columns require normalization
birthweight_class= birthweight_clean %>%
  mutate_at(quantitative_col,normalize) %>%
  select(-contains("BWEIGHT_dummy"),-c("SEX","MARITAL","BWEIGHT"))

Train/Test split for Classification

We are spliting birthweight_class into a training set birthweight_class_train (75% of the data) and a test set birthweight_class_test (25% of the data). A column of uniform random numbers between 0 and 1, using the function runif() is generated to get random subset of data in 75% and 25% split.

# Use nrow to get the number of rows
(N <- nrow(birthweight_class))
## [1] 103381

# Calculate how many rows 75% of N should be and print it
(round(0.75*N))
## [1] 77536

# Create the vector of N uniform random variables: split
set.seed(123)
split <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
birthweight_class_train <- birthweight_class[split<0.75,]
birthweight_class_test <- birthweight_class[split>=0.75,]

# Use nrow() to birthweight_class_train and birthweight_class_test
nrow(birthweight_class_train)
## [1] 77612
nrow(birthweight_class_test)
## [1] 25769
# Target variable indexes
target_index = grep("BWEIGHT_COND", colnames(birthweight_class))

# Class of train data
COND = birthweight_class_train$BWEIGHT_COND

There is no universal rule for selection of k. Some suggest a rule of thumb starting with k equal to the square root of the number of observations in the training data which is 279.

# When k=65 
KNN_pred_65 <- knn(train = birthweight_class_train[-target_index], 
                    test = birthweight_class_test[-target_index], 
                    cl = COND,
                    k=65)
# When k=100
KNN_pred_100 <- knn(train = birthweight_class_train[-target_index], 
                    test = birthweight_class_test[-target_index], 
                    cl = COND,
                    k=100)
# When k=279 (square root of total observations)
KNN_pred_279 <- knn(train = birthweight_class_train[-target_index], 
                test = birthweight_class_test[-target_index], 
                cl = COND,
                k=279)
# Create a confusion matrix of the predicted versus actual values
COND_actual <- birthweight_class_test$BWEIGHT_COND

# Accuracy rate
k_65=mean(COND_actual==KNN_pred_65)
k_100=mean(COND_actual==KNN_pred_100)
k_279=mean(COND_actual==KNN_pred_279)

k_result = rbind(k_65,k_100,k_279)
k_result
##            [,1]
## k_65  0.8462882
## k_100 0.8441150
## k_279 0.8389150

When k=65, we are considering 65 neighbors to consider when making the classification.

confusionMatrix(COND_actual,KNN_pred_65)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Normal Underweight Overweight
##   Normal       21353          66         14
##   Underweight   1534         438          0
##   Overweight    2347           0         17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8463          
##                  95% CI : (0.8418, 0.8507)
##     No Information Rate : 0.9792          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1643          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Normal Class: Underweight
## Sensitivity                 0.8462            0.86905
## Specificity                 0.8505            0.93928
## Pos Pred Value              0.9963            0.22211
## Neg Pred Value              0.1049            0.99723
## Prevalence                  0.9792            0.01956
## Detection Rate              0.8286            0.01700
## Detection Prevalence        0.8317            0.07653
## Balanced Accuracy           0.8483            0.90417
##                      Class: Overweight
## Sensitivity                  0.5483871
## Specificity                  0.9088119
## Pos Pred Value               0.0071912
## Neg Pred Value               0.9994018
## Prevalence                   0.0012030
## Detection Rate               0.0006597
## Detection Prevalence         0.0917381
## Balanced Accuracy            0.7285995

When k=100, we are considering 100 neighbors to consider when making the classification.

confusionMatrix(COND_actual,KNN_pred_100)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Normal Underweight Overweight
##   Normal       21370          47         16
##   Underweight   1608         364          0
##   Overweight    2346           0         18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8441          
##                  95% CI : (0.8396, 0.8485)
##     No Information Rate : 0.9827          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1401          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Normal Class: Underweight
## Sensitivity                 0.8439            0.88564
## Specificity                 0.8584            0.93659
## Pos Pred Value              0.9971            0.18458
## Neg Pred Value              0.0881            0.99802
## Prevalence                  0.9827            0.01595
## Detection Rate              0.8293            0.01413
## Detection Prevalence        0.8317            0.07653
## Balanced Accuracy           0.8511            0.91112
##                      Class: Overweight
## Sensitivity                  0.5294118
## Specificity                  0.9088401
## Pos Pred Value               0.0076142
## Neg Pred Value               0.9993164
## Prevalence                   0.0013194
## Detection Rate               0.0006985
## Detection Prevalence         0.0917381
## Balanced Accuracy            0.7191259

When k=279, we are considering 279 neighbors to consider when making the classification.

confusionMatrix(COND_actual,KNN_pred_279)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Normal Underweight Overweight
##   Normal       21423          10          0
##   Underweight   1777         195          0
##   Overweight    2364           0          0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8389          
##                  95% CI : (0.8344, 0.8434)
##     No Information Rate : 0.992           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Normal Class: Underweight
## Sensitivity                0.83801           0.951220
## Specificity                0.95122           0.930488
## Pos Pred Value             0.99953           0.098884
## Neg Pred Value             0.04497           0.999580
## Prevalence                 0.99204           0.007955
## Detection Rate             0.83135           0.007567
## Detection Prevalence       0.83174           0.076526
## Balanced Accuracy          0.89462           0.940854
##                      Class: Overweight
## Sensitivity                         NA
## Specificity                    0.90826
## Pos Pred Value                      NA
## Neg Pred Value                      NA
## Prevalence                     0.00000
## Detection Rate                 0.00000
## Detection Prevalence           0.09174
## Balanced Accuracy                   NA

With larger k values, we could get higher rate of accuracy in detecting normal baby weight. However, the model is not performing well in detecting the underweight and overweight babies. To get the balance, we chose smaller k values of 65, with greater accuracy of 0.8462882 in overall in detecting the labelled classes.

Multinomial Logistic regression

Another popular classification model is Multinomial Logistic regression which is an extension of binomial logistic regression which allows us to predict a categorical dependent variable which has more than two levels.

# Fit the model
fmla=as.formula(BWEIGHT_COND ~ .)

log_reg_model = nnet::multinom(fmla, data = birthweight_class_train)
## # weights:  105 (68 variable)
## initial  value 85265.496948 
## iter  10 value 43948.772859
## iter  20 value 39315.260691
## iter  30 value 35195.621475
## iter  40 value 33984.166963
## iter  50 value 33714.848964
## iter  60 value 33670.620023
## iter  70 value 33551.043061
## iter  80 value 33546.896059
## final  value 33546.880388 
## converged
# Summarize the model
summary(log_reg_model )
## Call:
## nnet::multinom(formula = fmla, data = birthweight_class_train)
## 
## Coefficients:
##             (Intercept)        FAGE    GAINED    VISITS
## Underweight    5.287194 -0.07920969 -2.725423 -1.609727
## Overweight    -4.355592 -0.04228494  2.865142  1.008134
##                   MAGE       FEDUC     MEDUC    TOTALP
## Underweight -0.2156991  0.08859638 0.2672378 -4.286894
## Overweight   1.2076822 -0.08324540 0.1304299  2.640975
##                 BDEAD     TERMS   PLURAL      WEEKS
## Underweight  2.745155  4.657910  7.85548 -17.006464
## Overweight  -1.054142 -2.480178 -9.72476   5.796709
##                CIGNUM  DRINKNUM      ANEMIA    CARDIAC
## Underweight  3.466275 -2.886727 -0.05464457 -0.2033561
## Overweight  -4.367276 -7.092164 -0.01250162 -0.1884203
##                  ACLUNG   DIABETES       HERPES     HYDRAM
## Underweight 0.363578687 -0.2292302 -0.193063974  1.0172823
## Overweight  0.001108051  0.5954211  0.003078272 -0.1273572
##               HEMOGLOB      HYPERCH    HYPERPR      ECLAMP
## Underweight  0.3504352  0.662480945  0.9060436  1.30701904
## Overweight  -0.3487282 -0.003570999 -0.2538979 -0.08000906
##                 CERVIX   PINFANT    PRETERM      RENAL
## Underweight  0.3548868 -1.048648  1.0299743  0.6279605
## Overweight  -0.5456573  1.728803 -0.7643823 -0.3779315
##                  RHSEN    UTERINE SEX_dummy_Boy
## Underweight -0.5793131  0.9426642      2.507795
## Overweight  -0.2759953 -0.1898026     -1.894247
##             SEX_dummy_Girl STATUS_dummy_Married
## Underweight       2.779400             2.476237
## Overweight       -2.461344            -2.007350
##             STATUS_dummy_Unmarried
## Underweight               2.810957
## Overweight               -2.348241
## 
## Std. Errors:
##             (Intercept)      FAGE     GAINED    VISITS
## Underweight  0.09505475 0.2383985 0.13825537 0.2370674
## Overweight   0.08975596 0.1716012 0.09145519 0.1783965
##                  MAGE     FEDUC     MEDUC    TOTALP
## Underweight 0.2109616 0.1627519 0.1678846 0.3729532
## Overweight  0.1495039 0.1111006 0.1161061 0.2326635
##                 BDEAD     TERMS    PLURAL     WEEKS
## Underweight 0.6595288 0.4852758 0.1833876 0.2255235
## Overweight  0.6040611 0.3376846 1.1436810 0.1921526
##                CIGNUM DRINKNUM     ANEMIA   CARDIAC
## Underweight 0.2768779 1.913990 0.09538002 0.2739302
## Overweight  0.4187796 6.204468 0.07091892 0.2041958
##                ACLUNG   DIABETES    HERPES    HYDRAM
## Underweight 0.1471751 0.09477448 0.1693388 0.1038276
## Overweight  0.1234820 0.06192606 0.1102396 0.1182136
##              HEMOGLOB   HYPERCH    HYPERPR    ECLAMP
## Underweight 0.3365748 0.1165182 0.05947054 0.1694037
## Overweight  0.3996491 0.1161619 0.06834053 0.2799525
##                CERVIX   PINFANT   PRETERM     RENAL
## Underweight 0.2166189 0.4467634 0.1230564 0.2864402
## Overweight  0.3157495 0.1040772 0.2056165 0.3201424
##                 RHSEN   UTERINE SEX_dummy_Boy
## Underweight 0.3542218 0.2198129    0.04972244
## Overweight  0.2303831 0.2605112    0.04597924
##             SEX_dummy_Girl STATUS_dummy_Married
## Underweight     0.05211682           0.05636510
## Overweight      0.04743306           0.04958922
##             STATUS_dummy_Unmarried
## Underweight             0.04799162
## Overweight              0.04630135
## 
## Residual Deviance: 67093.76 
## AIC: 67221.76

After fitting the model, we predict the labelled category of birthweight_class_train dataset and check for its accuracy from the confusion matrix.

# Make predictions
predict_log_reg <- log_reg_model  %>% predict(birthweight_class_test[-target_index])

# Model accuracy
COND_actual <- birthweight_class_test$BWEIGHT_COND

logreg = mean(COND_actual==predict_log_reg)
confusionMatrix(COND_actual,predict_log_reg)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Normal Underweight Overweight
##   Normal       21129         260         44
##   Underweight   1059         913          0
##   Overweight    2323           0         41
## 
## Overall Statistics
##                                           
##                Accuracy : 0.857           
##                  95% CI : (0.8526, 0.8612)
##     No Information Rate : 0.9512          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3025          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Normal Class: Underweight
## Sensitivity                 0.8620            0.77835
## Specificity                 0.7583            0.95694
## Pos Pred Value              0.9858            0.46298
## Neg Pred Value              0.2200            0.98907
## Prevalence                  0.9512            0.04552
## Detection Rate              0.8199            0.03543
## Detection Prevalence        0.8317            0.07653
## Balanced Accuracy           0.8102            0.86765
##                      Class: Overweight
## Sensitivity                   0.482353
## Specificity                   0.909555
## Pos Pred Value                0.017343
## Neg Pred Value                0.998120
## Prevalence                    0.003299
## Detection Rate                0.001591
## Detection Prevalence          0.091738
## Balanced Accuracy             0.695954

This model performs quite well too as it get an accuracy of 85.6959913%.

Evaluation of classification models

By comparing the accuracy rate from both KNN and multinomial regression models, the later results with higer accuracy. Thus, in terms of classification, multinomial regression models will be selected.

acc = cbind(k_65,k_100,k_279,logreg)
colnames(acc) = c("KNN (k=65)","KNN (k=100)","KNN (k=279)","Multinomial reg")
acc
##      KNN (k=65) KNN (k=100) KNN (k=279) Multinomial reg
## [1,]  0.8462882    0.844115    0.838915       0.8569599

Modelling: Regression

birthweight_reg= birthweight_clean %>%
  select(-contains("BWEIGHT_dummy"),-c("SEX","MARITAL","BWEIGHT_COND"))

Latent Variable Analysis

Principal Component Analysis (PCA) is a tool for data reduction. There are several ways PCA contributes to the model development process. Other than using the output of the PCA directly into machine learning models, it can be used to identify latent variables in a dataset, by looking for how individual observed variables hang together.

bw_reg_pca<-principal(select(birthweight_reg[split<0.75,],-BWEIGHT),nfactors = 15)



plot15=ggplot(data = data.frame(y=bw_reg_pca$Vaccounted[2,],x=1:15),aes(x=x,weight=y))+
  geom_bar(fill="#9f2042")+
  xlab("Component")+
  ylab("Prop. Variation")+
  ggtitle("Skree Plot")+
  scale_x_continuous(breaks=seq(0,15,1))

plot15

Based on the Skree Plot, 15 principle components manage to explains 60% of the variance in the data.

bw_reg_pca$loadings
## 
## Loadings:
##                        RC1    RC11   RC3    RC2    RC4   
## FAGE                    0.633  0.192         0.343       
## GAINED                                                   
## VISITS                  0.243  0.101               -0.121
## MAGE                    0.679  0.258         0.379       
## FEDUC                   0.785  0.151        -0.130       
## MEDUC                   0.793  0.165        -0.147       
## TOTALP                                       0.895       
## BDEAD                                        0.123       
## TERMS                                        0.839       
## PLURAL                                              0.804
## WEEKS                                              -0.722
## CIGNUM                 -0.191                0.193       
## DRINKNUM                0.109                            
## ANEMIA                                                   
## CARDIAC                                                  
## ACLUNG                                                   
## DIABETES                                                 
## HERPES                  0.139 -0.113               -0.112
## HYDRAM                                                   
## HEMOGLOB                                                 
## HYPERCH                                             0.139
## HYPERPR                                                  
## ECLAMP                                                   
## CERVIX                                                   
## PINFANT                                                  
## PRETERM                                                  
## RENAL                                                    
## RHSEN                  -0.132  0.176                     
## UTERINE                                                  
## SEX_dummy_Boy                        -1.000              
## SEX_dummy_Girl                        1.000              
## STATUS_dummy_Married    0.277  0.944                     
## STATUS_dummy_Unmarried -0.277 -0.944                     
##                        RC5    RC6    RC9    RC7    RC8   
## FAGE                   -0.260  0.139                     
## GAINED                  0.689 -0.176                     
## VISITS                  0.549  0.267                     
## MAGE                   -0.209  0.147                     
## FEDUC                   0.251                            
## MEDUC                   0.257                            
## TOTALP                 -0.135                            
## BDEAD                                 0.538              
## TERMS                   0.128                            
## PLURAL                  0.234                            
## WEEKS                   0.279        -0.144 -0.176       
## CIGNUM                                                   
## DRINKNUM                                                 
## ANEMIA                                              0.564
## CARDIAC                                            -0.144
## ACLUNG                                              0.156
## DIABETES                       0.663         0.208       
## HERPES                        -0.111         0.130  0.324
## HYDRAM                         0.240         0.114       
## HEMOGLOB                                            0.628
## HYPERCH                        0.641        -0.179       
## HYPERPR                 0.107                0.707       
## ECLAMP                                       0.662       
## CERVIX                                0.647              
## PINFANT                                                  
## PRETERM                               0.604              
## RENAL                                               0.104
## RHSEN                   0.177  0.247                0.384
## UTERINE                                                  
## SEX_dummy_Boy                                            
## SEX_dummy_Girl                                           
## STATUS_dummy_Married                                     
## STATUS_dummy_Unmarried                                   
##                        RC14   RC10   RC13   RC12   RC15  
## FAGE                                         0.113       
## GAINED                                                   
## VISITS                        -0.132                     
## MAGE                                         0.105       
## FEDUC                                                    
## MEDUC                                                    
## TOTALP                                                   
## BDEAD                          0.164         0.182 -0.248
## TERMS                                                    
## PLURAL                                                   
## WEEKS                                              -0.103
## CIGNUM                  0.195  0.357        -0.254       
## DRINKNUM                       0.787  0.104              
## ANEMIA                                0.267  0.161  0.124
## CARDIAC                 0.758        -0.126        -0.147
## ACLUNG                  0.539         0.182         0.171
## DIABETES                                                 
## HERPES                  0.255                            
## HYDRAM                         0.172  0.414         0.180
## HEMOGLOB                      -0.153               -0.165
## HYPERCH                 0.176                            
## HYPERPR                                                  
## ECLAMP                                             -0.108
## CERVIX                                      -0.127       
## PINFANT                                      0.882       
## PRETERM                                             0.132
## RENAL                                 0.687 -0.103 -0.109
## RHSEN                          0.400 -0.444         0.163
## UTERINE                                             0.835
## SEX_dummy_Boy                                            
## SEX_dummy_Girl                                           
## STATUS_dummy_Married                                     
## STATUS_dummy_Unmarried                                   
## 
##                  RC1  RC11   RC3   RC2   RC4   RC5   RC6
## SS loadings    2.426 2.017 2.001 1.877 1.281 1.271 1.149
## Proportion Var 0.074 0.061 0.061 0.057 0.039 0.039 0.035
## Cumulative Var 0.074 0.135 0.195 0.252 0.291 0.330 0.364
##                  RC9   RC7   RC8  RC14  RC10  RC13  RC12
## SS loadings    1.132 1.116 1.066 1.042 1.031 1.025 1.025
## Proportion Var 0.034 0.034 0.032 0.032 0.031 0.031 0.031
## Cumulative Var 0.399 0.432 0.465 0.496 0.528 0.559 0.590
##                 RC15
## SS loadings    1.017
## Proportion Var 0.031
## Cumulative Var 0.621

Several observations from the PCA:

  • Either one of the dummy variables for genders and married status can be removed.
  • The father’s age and mother’s age are often falls under the same component. Father’s age will be removed.
  • The father’s years of education and mother’s years of education are often falls under the same component. Father’s years of education will be removed.
  • The “Number of other terminations” and “Total pregnancies” often falls under the same component. A new feature will be created by dividing “Number of other terminations” with “Total pregnancies”.
  • The “Number of prenatal visits” and “Completed weeks of gestation” often falls under the same component. A new feature will be created by dividing “Number of prenatal visits” with “Completed weeks of gestation”.
func_col_clean<-function(dat){
  
  dat<-mutate(dat
              #,parent_edu=FEDUC*MEDUC
              ,TERMS_perc=TERMS/TOTALP
              ,VISITS_per_WEEKS=VISITS/WEEKS
              )
  
  dat<-select(dat,-FAGE,-STATUS_dummy_Unmarried
              ,-STATUS_dummy_Married
              ,-FEDUC
              #,-MEDUC
              ,-SEX_dummy_Girl
              ,-TOTALP)
  return(dat)
}

birthweight_reg2<-func_col_clean(birthweight_reg)

Feature Selection

The Variable Importance is extracted using the VarImp function from caret package. Factor with high variance of importance usually have high predictive power on the target variable.

control <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- train(BWEIGHT~., data=birthweight_reg2[split<0.75,], method="lm", preProcess="scale"
               , trControl=control)
importance <- varImp(model, scale=T)
print(importance)
plot(importance)

## lm variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##                  Overall
## PLURAL           100.000
## WEEKS             92.285
## GAINED            77.651
## SEX_dummy_Boy     51.091
## MAGE              50.550
## CIGNUM            38.790
## HYPERPR           31.841
## PINFANT           30.287
## DIABETES          19.021
## HYDRAM            18.852
## ECLAMP            16.089
## PRETERM           15.830
## HYPERCH           10.915
## VISITS             9.185
## CERVIX             8.699
## UTERINE            8.179
## TERMS_perc         7.309
## VISITS_per_WEEKS   7.086
## MEDUC              4.570
## BDEAD              4.005

Based on the Variable Importance diagram, PLURAL, WEEKS, GAINED, MAGE, CIGNUM, HYPERPR, PINFANT, DIABETES and HYDRAM are selected into the machine learning model.

Train/Test split for Regression

quantitative_col_reg<-c("PLURAL","WEEKS","GAINED","MAGE"                         
                        ,"CIGNUM","HYPERPR","PINFANT","DIABETES","HYDRAM","BWEIGHT")

y_max=max(birthweight_reg2$BWEIGHT)
y_min=min(birthweight_reg2$BWEIGHT)

birthweight_reg3<-mutate_at(birthweight_reg2,quantitative_col_reg,normalize)
birthweight_reg_train<-birthweight_reg3[split<0.75,]
birthweight_reg_test<-birthweight_reg3[split>0.75,]

Linear Regression

Linear regression is one of the basic regression model for continuous value prediction. It formulate a linear equations by assigning weights on the independent variable via maximum likelihood estimation.

f<-as.formula(paste0("BWEIGHT~",paste0(c(quantitative_col_reg,"SEX_dummy_Boy"),collapse = "+")))

bw_lm<-lm(f,data=birthweight_reg_train)

birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train,BWEIGHT_hat=predict(bw_lm))
birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test
                                        ,BWEIGHT_hat=predict(bw_lm
                                                             ,newdata = birthweight_reg_test))

RMSE_lm_train<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat*(y_max-y_min)+y_min-birthweight_reg_train2$BWEIGHT*(y_max-y_min)+y_min)^2))

RMSE_lm_test<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat*(y_max-y_min)+y_min-birthweight_reg_test2$BWEIGHT*(y_max-y_min)+y_min)^2))
summary(bw_lm)
## 
## Call:
## lm(formula = f, data = birthweight_reg_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53045 -0.05108 -0.00036  0.05047  0.44245 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.0541376  0.0027165   19.93   <2e-16 ***
## PLURAL        -0.3237948  0.0047270  -68.50   <2e-16 ***
## WEEKS          0.5531824  0.0032467  170.38   <2e-16 ***
## GAINED         0.1132518  0.0020846   54.33   <2e-16 ***
## MAGE           0.0776288  0.0020460   37.94   <2e-16 ***
## CIGNUM        -0.1604179  0.0057956  -27.68   <2e-16 ***
## HYPERPR       -0.0294444  0.0013241  -22.24   <2e-16 ***
## PINFANT        0.0773849  0.0037786   20.48   <2e-16 ***
## DIABETES       0.0206970  0.0016013   12.93   <2e-16 ***
## HYDRAM        -0.0320686  0.0023892  -13.42   <2e-16 ***
## SEX_dummy_Boy  0.0200160  0.0005732   34.92   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07974 on 77601 degrees of freedom
## Multiple R-squared:  0.4055, Adjusted R-squared:  0.4054 
## F-statistic:  5293 on 10 and 77601 DF,  p-value: < 2.2e-16

Although the p-value of all the variables and final model are very small (i.e. < 2.2e-16), the R-squared looks quite low, i.e. only 40% of the output variable can be explained by the input variables.

To investigate further, let’s look at the Residuals vs Fitted graph and the QQ plot of the linear regression.

plot(bw_lm)
## Warning in model.matrix.default(object, data =
## structure(list(BWEIGHT = c(0.300970873786408, : the response
## appeared on the right-hand side and was dropped
## Warning in model.matrix.default(object, data =
## structure(list(BWEIGHT = c(0.300970873786408, : problem with
## term 10 in model.matrix: no columns are assigned

Linear regression assumes the data is Homoscedastic (i.e. the residual value is consistent for all observations) and normally distributed. However, the Residual vs Fitted values shows that the residual of the dataset is heteroskedastic and the QQ plot reveals that the data has more extreme values than would be expected for a normal distribution.

Usually, the problem might be solved by transforming the target variables or independent variables. Instead of trying out numerous transformation method, we could attempt to develop another machine learning model.

XGBoost

XGBoost is one of the machine learning model that could handle non-linear and non-normal distributed data.

quantitative_col_xbg<-c(quantitative_col_reg[quantitative_col_reg!="BWEIGHT"],"SEX_dummy_Boy")

bw_xgb<-xgboost(data=as.matrix(birthweight_reg_train[,quantitative_col_xbg])
                        ,label = birthweight_reg_train$BWEIGHT
                        , max.depth = 2, nrounds = 50)
## [1]  train-rmse:0.099927 
## [2]  train-rmse:0.091502 
## [3]  train-rmse:0.086588 
## [4]  train-rmse:0.083779 
## [5]  train-rmse:0.081953 
## [6]  train-rmse:0.080976 
## [7]  train-rmse:0.080142 
## [8]  train-rmse:0.079532 
## [9]  train-rmse:0.079074 
## [10] train-rmse:0.078710 
## [11] train-rmse:0.078399 
## [12] train-rmse:0.078134 
## [13] train-rmse:0.077900 
## [14] train-rmse:0.077729 
## [15] train-rmse:0.077548 
## [16] train-rmse:0.077401 
## [17] train-rmse:0.077271 
## [18] train-rmse:0.077144 
## [19] train-rmse:0.077042 
## [20] train-rmse:0.076956 
## [21] train-rmse:0.076882 
## [22] train-rmse:0.076788 
## [23] train-rmse:0.076725 
## [24] train-rmse:0.076660 
## [25] train-rmse:0.076610 
## [26] train-rmse:0.076552 
## [27] train-rmse:0.076518 
## [28] train-rmse:0.076482 
## [29] train-rmse:0.076447 
## [30] train-rmse:0.076418 
## [31] train-rmse:0.076388 
## [32] train-rmse:0.076357 
## [33] train-rmse:0.076335 
## [34] train-rmse:0.076315 
## [35] train-rmse:0.076293 
## [36] train-rmse:0.076275 
## [37] train-rmse:0.076256 
## [38] train-rmse:0.076241 
## [39] train-rmse:0.076226 
## [40] train-rmse:0.076215 
## [41] train-rmse:0.076204 
## [42] train-rmse:0.076182 
## [43] train-rmse:0.076173 
## [44] train-rmse:0.076154 
## [45] train-rmse:0.076145 
## [46] train-rmse:0.076133 
## [47] train-rmse:0.076122 
## [48] train-rmse:0.076118 
## [49] train-rmse:0.076098 
## [50] train-rmse:0.076092
birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train2
                                         ,BWEIGHT_hat_xgb=predict(bw_xgb,newdata = as.matrix(birthweight_reg_train[,quantitative_col_xbg]))
)

birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test2
                                         ,BWEIGHT_hat_xgb=predict(bw_xgb,newdata = as.matrix(birthweight_reg_test2[,quantitative_col_xbg]))
)

RMSE_xgb_train<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat_xgb*(y_max-y_min)+y_min-birthweight_reg_train2$BWEIGHT*(y_max-y_min)+y_min)^2))

RMSE_xgb_test<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat_xgb*(y_max-y_min)+y_min-birthweight_reg_test2$BWEIGHT*(y_max-y_min)+y_min)^2))

The RMSE of the normalised BWEIGHT is 0.076092, but the resulting non-normalised BWEIGHT is 1.0436.

Since XGBoost can handle non-normal and non-linear dataset, normalisation is not necessary for the machine learning model. Would the model performance be any different if we ask XGBoost to fit the non-normalised BWEIGHT?

birthweight_reg_train2$BWEIGHT2<-birthweight_reg_train$BWEIGHT*(y_max-y_min)+y_min
birthweight_reg_test2$BWEIGHT2<-birthweight_reg_test$BWEIGHT*(y_max-y_min)+y_min

quantitative_col_xbg<-c(quantitative_col_reg[quantitative_col_reg!="BWEIGHT"],"SEX_dummy_Boy")

bw_xgb<-xgboost(data=as.matrix(birthweight_reg_train2[,quantitative_col_xbg])
                        ,label = birthweight_reg_train2$BWEIGHT2
                        , max.depth = 2, nrounds = 50)
## [1]  train-rmse:4.882723 
## [2]  train-rmse:3.501301 
## [3]  train-rmse:2.563161 
## [4]  train-rmse:1.942600 
## [5]  train-rmse:1.546722 
## [6]  train-rmse:1.309054 
## [7]  train-rmse:1.171215 
## [8]  train-rmse:1.094978 
## [9]  train-rmse:1.053656 
## [10] train-rmse:1.031047 
## [11] train-rmse:1.018097 
## [12] train-rmse:1.010258 
## [13] train-rmse:1.005064 
## [14] train-rmse:1.001786 
## [15] train-rmse:0.998937 
## [16] train-rmse:0.996785 
## [17] train-rmse:0.994984 
## [18] train-rmse:0.993283 
## [19] train-rmse:0.991939 
## [20] train-rmse:0.990822 
## [21] train-rmse:0.989852 
## [22] train-rmse:0.988648 
## [23] train-rmse:0.987827 
## [24] train-rmse:0.987011 
## [25] train-rmse:0.986138 
## [26] train-rmse:0.985391 
## [27] train-rmse:0.984914 
## [28] train-rmse:0.984401 
## [29] train-rmse:0.983962 
## [30] train-rmse:0.983582 
## [31] train-rmse:0.983246 
## [32] train-rmse:0.982886 
## [33] train-rmse:0.982544 
## [34] train-rmse:0.982133 
## [35] train-rmse:0.981841 
## [36] train-rmse:0.981530 
## [37] train-rmse:0.981303 
## [38] train-rmse:0.981013 
## [39] train-rmse:0.980840 
## [40] train-rmse:0.980679 
## [41] train-rmse:0.980526 
## [42] train-rmse:0.980363 
## [43] train-rmse:0.980232 
## [44] train-rmse:0.980088 
## [45] train-rmse:0.979870 
## [46] train-rmse:0.979729 
## [47] train-rmse:0.979559 
## [48] train-rmse:0.979466 
## [49] train-rmse:0.979390 
## [50] train-rmse:0.979330
birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train2
                                         ,BWEIGHT_hat_xgb2=predict(bw_xgb,newdata = as.matrix(birthweight_reg_train2[,quantitative_col_xbg]))
)

birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test2
                                         ,BWEIGHT_hat_xgb2=predict(bw_xgb,newdata = as.matrix(birthweight_reg_test2[,quantitative_col_xbg]))
)

RMSE_xgb_train2<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat_xgb2-birthweight_reg_train2$BWEIGHT2)^2))

RMSE_xgb_test2<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat_xgb2-birthweight_reg_test2$BWEIGHT2)^2))

Evaluation of Regression Models

RMSE<-cbind.data.frame(`RMSE: Linear Regression`=c(RMSE_lm_train,RMSE_lm_test)
                       ,`RMSE: XGBoost`=c(RMSE_xgb_train,RMSE_xgb_test)
                       ,`RMSE: XGBoost (non-normalised)`=c(RMSE_xgb_train2,RMSE_xgb_test2)
  
)

rownames(RMSE)<-c("Train","Test")

RMSE
##       RMSE: Linear Regression RMSE: XGBoost
## Train                1.092919      1.049005
## Test                 1.081757      1.043571
##       RMSE: XGBoost (non-normalised)
## Train                      0.9793309
## Test                       0.9768859

The improvement of RMSE between the two models are not significant. XGBoost performed much better on the non-normalised BWEIGHT. Usually normalisation would improve the model performance. However, normalisation leads to a worse model performance in this case study. This might be due to the normalisation approach used.

Conclusion and Future works

We developed models with adequate performance for the classification of baby’s birth weight type (i.e. underweight, normal, overweight) and the regression of baby’s birth weight. Multinomial Logistic and kNN are used for predicting the birth weight type of babies. Multinomial Logistic outperform kNN by having a higher test sample accuracy of 0.857. Linear Regression and XGBoost are used to predict the birth weight of the babies. XGBoost on non-normalised birth weight outperforms other models by having a lower test sample RMSE of 0.9769.

Several improvements can be made in the following areas:-

  • Pre-processing: Explore other normalisation methods
  • Feature selection: Select features via other methods, e.g. Recursive Feature Elimination (RFE) or stepwise regression
  • Model: Experiment with other models such as neural network might perform better on this dataset

Appendix: Metadata (trimmed datasets)

  • SEX : Sex of the baby
  • MARITAL : Marital status of its parents
  • FAGE : Age of father
  • GAINED : Weight gained during pregnancy
  • VISITS : Number of prenatal visits
  • MAGE : Age of mother
  • FEDUC : Father’s years of education
  • MEDUC : Mother’s years of education
  • TOTALP : Total pregnancies
  • BDEAD : number of children born alive now dead
  • TERMS : Number of other terminations
  • PLURAL : Whether the Mother gave birth to twins, triplets or quadruplets
  • WEEKS : Completed weeks of gestation
  • CIGNUM : Average number of cigarettes used daily (Mother)
  • DRINKNUM: Average number of drinks used daily (mother)
  • ANEMIA : Mother has/had anemia
  • CARDIAC : Mother has/had cardiac disease
  • ACLUNG : Mother has/had acute or chronic lung disease
  • DIABETES : Mother has/had diabetes
  • HERPES : Mother has/had genital herpes
  • HYDRAM : Mother has/had hydramnios/Oligohydramnios
  • HEMOGLOB : Mother has/had hemoglobinopathy
  • HYPERCH : Mother has/had chronic hypertension
  • HYPERPR : mother has/had pregnancy hypertension
  • ECLAMP : Mother has/had Eclampsia
  • CERVIX : Mother has/had incompetent cervix
  • PINFANT : Mother had/had previous infant 4000+ grams
  • PRETERM : Mother has/had previus preterm/small infant
  • RENAL : Mother has/had renal disease
  • RHSEN : Mother has/had Rh sensitization
  • UTERINE : Mother has/had uterine bleeding
  • BWEIGHT : Baby’s weight at birth