Predicting Birth Weight of Baby
Introduction
Birth weight is the weight of baby taken just after he or she is born, and a baby birth weight is a strong indicator of maternal and new-born health and nutrition. According to World Health Organization (WHO), low birth weight is defined as the birthweight less than 2,500 grams or 5.5 pounds regardless of gestational age. The normal weight range for new-born baby should exceed 2,500 grams (5.5 pounds) but less than 4,000 grams (8.8 pounds).
Low birth weight is an outcome that has been of concern because infant mortality rates and birth defect rates are very high for low birth weight babies. Low birth weight can occur in premature babies who are delivered at a gestational age less than 37 weeks of pregnancy (the normal length of a pregnancy is 40 weeks) or in babies who are born at the regular time but are under weight. Baby with greater than the normal weight range is considered large and would increase risk and difficulties during delivery.
Motivation of study
It would be so important for the doctor to be able to predict the birth weight instead of solely relying on ultrasound results, so they can take different measures in advance and minimize the risk during delivery. Though ultrasound can help for such prediction, but data scientist could also detect this in advance with given data.
Problem statement
Here’re some interesting questions arise from this scenario that we are considering for the study:
- What is the weight of new-born baby given with some information during gestation period?
- Is the baby too small (underweight), normal or large (overweight) in size?
- What are the factors associated with giving birth to a low birth weight and high birth weight baby?
Objective of study
The main goal of this project is to obtain prediction for the birthweight of the baby and whether a baby is underweight, overweight or has normal weight given information about the mother’s historical health, habits, gestational period and age. Thus, several models are built to predict the birthweight of the baby and identify if the baby have low, normal or high birth weight using classification method.
Overview of data
The Dataset used is collected by North Carolina State Center for Health Statistics and we acknowledge The State Center for Health Statistics (SCHS) and the Howard W. Odum Institute for Research in Social Science at UNC at Chapel Hill as the source of data.
Import data
###################################
# Import data: Baby's birthweight #
###################################
birthweight = readr::read_csv(file="2008_births.csv")
##
## -- Column specification ---------------------------------------------------------------------------------------------
## cols(
## .default = col_double(),
## BDATE = col_date(format = ""),
## HISPMOM = col_character(),
## HISPDAD = col_character()
## )
## i Use `spec()` for the full column specifications.The data was collected in 2008 with 133422 rows with 125 columns. The target variables for our study is the birth weight of infant which can be represented from columns namely BPOUND and BOUNCE which denote the birthweight in pounds and ounce separately.
## tibble [133,422 x 125] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ INST : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
## $ RPLACE : num [1:133422] 6800 160 190 4100 160 100 100 100 100 162 ...
## $ RCOUNTY : num [1:133422] 68 1 1 41 1 1 1 1 1 1 ...
## $ PLURAL : num [1:133422] 1 1 1 1 1 1 1 2 2 1 ...
## $ BDATE : Date[1:133422], format: "2008-01-01" ...
## $ BMONTH : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
## $ BDAY : num [1:133422] 1 2 2 3 3 3 3 3 3 3 ...
## $ BYEAR : num [1:133422] 2008 2008 2008 2008 2008 ...
## $ SEX : num [1:133422] 2 2 1 2 2 2 1 1 2 2 ...
## $ RACE : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
## $ FAGE : num [1:133422] 23 23 31 21 26 19 37 26 26 29 ...
## $ MAGE : num [1:133422] 24 21 28 19 21 20 26 31 31 25 ...
## $ FEDUC : num [1:133422] 12 14 12 12 14 12 14 12 12 13 ...
## $ MEDUC : num [1:133422] 13 13 2 13 13 12 14 14 14 12 ...
## $ TOTALP : num [1:133422] 1 1 3 1 1 2 2 3 4 3 ...
## $ BDEAD : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ TERMS : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
## $ LBDATE : num [1:133422] 0 0 52003 0 0 ...
## $ LBMONTH : num [1:133422] 0 0 5 0 0 0 5 3 1 6 ...
## $ LBYEAR : num [1:133422] 0 0 2003 0 0 ...
## $ LOUTCOME: num [1:133422] 9 9 1 9 9 2 1 1 1 1 ...
## $ WEEKS : num [1:133422] 35 40 36 38 42 37 39 37 37 39 ...
## $ PRENATAL: num [1:133422] 2 2 4 2 2 3 1 2 2 3 ...
## $ VISITS : num [1:133422] 12 13 8 10 12 10 14 20 20 10 ...
## $ WEIGHT : num [1:133422] 3 7 8 6 8 5 4 4 4 6 ...
## $ MARITAL : num [1:133422] 2 2 1 2 1 1 2 1 1 2 ...
## $ ATTEND : num [1:133422] 3 1 3 1 1 1 1 1 1 3 ...
## $ RECORD : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
## $ CHILDREN: num [1:133422] 0 0 2 0 0 0 1 2 3 2 ...
## $ FDATE : num [1:133422] 0 0 0 0 0 ...
## $ FMONTH : num [1:133422] 0 0 0 0 0 9 0 0 0 0 ...
## $ FYEAR : num [1:133422] 0 0 0 0 0 ...
## $ BPOUND : num [1:133422] 4 8 9 7 9 6 5 5 4 7 ...
## $ BOUNCE : num [1:133422] 1 3 0 6 7 8 8 2 15 4 ...
## $ RACEMOM : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
## $ RACEDAD : num [1:133422] 1 2 1 1 1 1 1 1 1 2 ...
## $ HISPMOM : chr [1:133422] "N" "N" "M" "N" ...
## $ HISPDAD : chr [1:133422] "N" "N" "S" "N" ...
## $ CIGNUM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DRINKNUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ GAINED : num [1:133422] 20 23 36 40 35 40 20 38 38 43 ...
## $ ANEMIA : num [1:133422] 0 1 0 1 0 1 0 0 0 0 ...
## $ CARDIAC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ ACLUNG : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DIABETES: num [1:133422] 0 0 0 0 0 0 0 1 1 0 ...
## $ HERPES : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYDRAM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HEMOGLOB: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYPERCH : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYPERPR : num [1:133422] 1 0 0 0 0 0 0 1 1 0 ...
## $ ECLAMP : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ CERVIX : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PINFANT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRETERM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RENAL : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RHSEN : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ UTERINE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHMED : num [1:133422] 0 1 0 0 0 1 1 1 1 1 ...
## $ AMNIO : num [1:133422] 0 0 0 0 0 0 1 0 0 0 ...
## $ MONITOR : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
## $ INDUCT : num [1:133422] 0 1 0 0 0 0 0 0 0 0 ...
## $ STIMULA : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ TOCOLY : num [1:133422] 1 0 0 0 0 0 0 0 0 0 ...
## $ ULTRA : num [1:133422] 1 1 1 1 1 1 1 1 1 1 ...
## $ OTHPROC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ FEBRILE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ MECONIUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RUPTURE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ ABRUPTIO: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PREVIA : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ BLEEDING: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ SEIZURES: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRECLAB : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PROLAB : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DYSLAB : num [1:133422] 0 1 0 0 0 0 0 0 0 0 ...
## $ BREECH : num [1:133422] 0 0 0 1 0 0 0 1 0 0 ...
## $ CEPHALO : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PROLAPSE: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ ANESTH : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DISTRESS: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHLABOR: num [1:133422] 1 1 1 0 0 1 1 0 0 0 ...
## $ VAGINAL : num [1:133422] 1 0 1 0 1 1 1 0 0 1 ...
## $ VAGCSECT: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRCSECT : num [1:133422] 0 1 0 1 0 0 0 1 1 0 ...
## $ RECSECT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ FORCEPS : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ VACUUM : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
## $ KOTEL : num [1:133422] 4 3 3 3 3 4 3 4 4 3 ...
## $ ANENCEPH: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ SPINABIF: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYDROCEP: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ MICROCEP: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHCNS : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HEARTMAL: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHCIRC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RECTALAT: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ FISTULA : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OMPHALOC: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHGAST : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
## - attr(*, "spec")=
## .. cols(
## .. INST = col_double(),
## .. RPLACE = col_double(),
## .. RCOUNTY = col_double(),
## .. PLURAL = col_double(),
## .. BDATE = col_date(format = ""),
## .. BMONTH = col_double(),
## .. BDAY = col_double(),
## .. BYEAR = col_double(),
## .. SEX = col_double(),
## .. RACE = col_double(),
## .. FAGE = col_double(),
## .. MAGE = col_double(),
## .. FEDUC = col_double(),
## .. MEDUC = col_double(),
## .. TOTALP = col_double(),
## .. BDEAD = col_double(),
## .. TERMS = col_double(),
## .. LBDATE = col_double(),
## .. LBMONTH = col_double(),
## .. LBYEAR = col_double(),
## .. LOUTCOME = col_double(),
## .. WEEKS = col_double(),
## .. PRENATAL = col_double(),
## .. VISITS = col_double(),
## .. WEIGHT = col_double(),
## .. MARITAL = col_double(),
## .. ATTEND = col_double(),
## .. RECORD = col_double(),
## .. CHILDREN = col_double(),
## .. FDATE = col_double(),
## .. FMONTH = col_double(),
## .. FYEAR = col_double(),
## .. BPOUND = col_double(),
## .. BOUNCE = col_double(),
## .. RACEMOM = col_double(),
## .. RACEDAD = col_double(),
## .. HISPMOM = col_character(),
## .. HISPDAD = col_character(),
## .. CIGNUM = col_double(),
## .. DRINKNUM = col_double(),
## .. GAINED = col_double(),
## .. ANEMIA = col_double(),
## .. CARDIAC = col_double(),
## .. ACLUNG = col_double(),
## .. DIABETES = col_double(),
## .. HERPES = col_double(),
## .. HYDRAM = col_double(),
## .. HEMOGLOB = col_double(),
## .. HYPERCH = col_double(),
## .. HYPERPR = col_double(),
## .. ECLAMP = col_double(),
## .. CERVIX = col_double(),
## .. PINFANT = col_double(),
## .. PRETERM = col_double(),
## .. RENAL = col_double(),
## .. RHSEN = col_double(),
## .. UTERINE = col_double(),
## .. OTHMED = col_double(),
## .. AMNIO = col_double(),
## .. MONITOR = col_double(),
## .. INDUCT = col_double(),
## .. STIMULA = col_double(),
## .. TOCOLY = col_double(),
## .. ULTRA = col_double(),
## .. OTHPROC = col_double(),
## .. FEBRILE = col_double(),
## .. MECONIUM = col_double(),
## .. RUPTURE = col_double(),
## .. ABRUPTIO = col_double(),
## .. PREVIA = col_double(),
## .. BLEEDING = col_double(),
## .. SEIZURES = col_double(),
## .. PRECLAB = col_double(),
## .. PROLAB = col_double(),
## .. DYSLAB = col_double(),
## .. BREECH = col_double(),
## .. CEPHALO = col_double(),
## .. PROLAPSE = col_double(),
## .. ANESTH = col_double(),
## .. DISTRESS = col_double(),
## .. OTHLABOR = col_double(),
## .. VAGINAL = col_double(),
## .. VAGCSECT = col_double(),
## .. PRCSECT = col_double(),
## .. RECSECT = col_double(),
## .. FORCEPS = col_double(),
## .. VACUUM = col_double(),
## .. KOTEL = col_double(),
## .. ANENCEPH = col_double(),
## .. SPINABIF = col_double(),
## .. HYDROCEP = col_double(),
## .. MICROCEP = col_double(),
## .. OTHCNS = col_double(),
## .. HEARTMAL = col_double(),
## .. OTHCIRC = col_double(),
## .. RECTALAT = col_double(),
## .. FISTULA = col_double(),
## .. OMPHALOC = col_double(),
## .. OTHGAST = col_double(),
## .. MALGEN = col_double(),
## .. RANGENE = col_double(),
## .. OTHURO = col_double(),
## .. CLEFT = col_double(),
## .. POLYDACT = col_double(),
## .. CLUBFOOT = col_double(),
## .. DHERNIA = col_double(),
## .. OTHMS = col_double(),
## .. DOWNS = col_double(),
## .. OTHCHROM = col_double(),
## .. OTHER = col_double(),
## .. BCWEEKS = col_double(),
## .. APGAR1 = col_double(),
## .. APGAR5 = col_double(),
## .. KESSNER = col_double(),
## .. INFANTTR = col_double(),
## .. MOTHERTR = col_double(),
## .. IANEMIA = col_double(),
## .. BINJURY = col_double(),
## .. FAS = col_double(),
## .. HYALINE = col_double(),
## .. ASPIRATE = col_double(),
## .. VENTLESS = col_double(),
## .. VENTMORE = col_double(),
## .. ISEIZURE = col_double(),
## .. OTHINF = col_double()
## .. )
There are many features and columns which are not relevant to our scope of objectives. Features about post birth, birth details such as date are all removed to keep simplicity for pre-processing of data later.
keep_col = c('SEX', 'MARITAL','FAGE', 'GAINED', 'VISITS', 'MAGE',
'FEDUC', 'MEDUC', 'TOTALP', 'BDEAD', 'TERMS',
#'RACEMOM', 'RACEDAD', 'HISPMOM', 'HISPDAD', 'LOUTCOME',
'PLURAL','WEEKS','CIGNUM', 'DRINKNUM', 'ANEMIA', 'BPOUND','BOUNCE',
'CARDIAC', 'ACLUNG', 'DIABETES', 'HERPES', 'HYDRAM',
'HEMOGLOB', 'HYPERCH', 'HYPERPR', 'ECLAMP', 'CERVIX',
'PINFANT','PRETERM','RENAL','RHSEN','UTERINE')After removal of the unrelevant features, the dataset is trimmed to have only 33 columns and 133422 rows of observations.
dim(birthweight_clean)
## [1] 133422 33
nrow(birthweight_clean)
## [1] 133422
ncol(birthweight_clean)
## [1] 33Let’s dive into the dataset to get an overview. The description of each attributes in the dataset is attached under Appendix section.
## tibble [133,422 x 33] (S3: tbl_df/tbl/data.frame)
## $ SEX : num [1:133422] 2 2 1 2 2 2 1 1 2 2 ...
## $ MARITAL : num [1:133422] 2 2 1 2 1 1 2 1 1 2 ...
## $ FAGE : num [1:133422] 23 23 31 21 26 19 37 26 26 29 ...
## $ GAINED : num [1:133422] 20 23 36 40 35 40 20 38 38 43 ...
## $ VISITS : num [1:133422] 12 13 8 10 12 10 14 20 20 10 ...
## $ MAGE : num [1:133422] 24 21 28 19 21 20 26 31 31 25 ...
## $ FEDUC : num [1:133422] 12 14 12 12 14 12 14 12 12 13 ...
## $ MEDUC : num [1:133422] 13 13 2 13 13 12 14 14 14 12 ...
## $ TOTALP : num [1:133422] 1 1 3 1 1 2 2 3 4 3 ...
## $ BDEAD : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ TERMS : num [1:133422] 0 0 0 0 0 1 0 0 0 0 ...
## $ PLURAL : num [1:133422] 1 1 1 1 1 1 1 2 2 1 ...
## $ WEEKS : num [1:133422] 35 40 36 38 42 37 39 37 37 39 ...
## $ CIGNUM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DRINKNUM: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ ANEMIA : num [1:133422] 0 1 0 1 0 1 0 0 0 0 ...
## $ BPOUND : num [1:133422] 4 8 9 7 9 6 5 5 4 7 ...
## $ BOUNCE : num [1:133422] 1 3 0 6 7 8 8 2 15 4 ...
## $ CARDIAC : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ ACLUNG : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ DIABETES: num [1:133422] 0 0 0 0 0 0 0 1 1 0 ...
## $ HERPES : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYDRAM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HEMOGLOB: num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYPERCH : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ HYPERPR : num [1:133422] 1 0 0 0 0 0 0 1 1 0 ...
## $ ECLAMP : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ CERVIX : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PINFANT : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ PRETERM : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RENAL : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ RHSEN : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
## $ UTERINE : num [1:133422] 0 0 0 0 0 0 0 0 0 0 ...
We have both qualitative and quantitative variables in the dataset. Determined factors for baby birthweight such as mother’s weight gained, parents age during gestational period, number of gestational weeks etc will be used to build the models.
Most qualitative variables are in binary form which describe mother’s health condition on whether she suffer from anaemia, cardiac disease,lung disease, diabetes, genital herpes, Oligohydramnios, hemoglobinopathy, hypertension, pregnancy hypertension, Eclampsia, incompetent cervix and renal disease. Other information such as whether the mother had preterm infant or infant exceed 4,000g and experience uterine bleeding during gestation period are also included in the dataset.
## SEX MARITAL FAGE
## Min. :1.000 Min. :1.000 Min. :14.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:26.00
## Median :1.000 Median :1.000 Median :32.00
## Mean :1.487 Mean :1.418 Mean :42.65
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:40.00
## Max. :9.000 Max. :9.000 Max. :99.00
## GAINED VISITS MAGE
## Min. : 0.00 Min. : 0.0 Min. :11.00
## 1st Qu.:21.00 1st Qu.:10.0 1st Qu.:22.00
## Median :30.00 Median :12.0 Median :27.00
## Mean :32.68 Mean :13.3 Mean :27.01
## 3rd Qu.:40.00 3rd Qu.:15.0 3rd Qu.:31.00
## Max. :99.00 Max. :99.0 Max. :99.00
## FEDUC MEDUC TOTALP
## Min. : 0.00 Min. : 0.00 Min. : 1.000
## 1st Qu.:12.00 1st Qu.:12.00 1st Qu.: 1.000
## Median :14.00 Median :12.00 Median : 2.000
## Mean :28.76 Mean :13.22 Mean : 2.502
## 3rd Qu.:17.00 3rd Qu.:16.00 3rd Qu.: 3.000
## Max. :99.00 Max. :99.00 Max. :99.000
## BDEAD TERMS PLURAL
## Min. : 0.00000 Min. : 0.0000 Min. :1.000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.:1.000
## Median : 0.00000 Median : 0.0000 Median :1.000
## Mean : 0.06431 Mean : 0.4651 Mean :1.035
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.:1.000
## Max. :99.00000 Max. :99.0000 Max. :9.000
## WEEKS CIGNUM DRINKNUM
## Min. :18.00 Min. : 0.000 Min. : 0.0000
## 1st Qu.:38.00 1st Qu.: 0.000 1st Qu.: 0.0000
## Median :39.00 Median : 0.000 Median : 0.0000
## Mean :38.68 Mean : 1.381 Mean : 0.7443
## 3rd Qu.:40.00 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :99.00 Max. :99.000 Max. :99.0000
## ANEMIA BPOUND BOUNCE
## Min. :0.00000 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.: 6.000 1st Qu.: 3.00
## Median :0.00000 Median : 7.000 Median : 7.00
## Mean :0.05962 Mean : 6.729 Mean : 7.51
## 3rd Qu.:0.00000 3rd Qu.: 8.000 3rd Qu.:11.00
## Max. :9.00000 Max. :99.000 Max. :99.00
## CARDIAC ACLUNG DIABETES
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02443 Mean :0.03296 Mean :0.05155
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.00000 Max. :9.00000 Max. :9.00000
## HERPES HYDRAM HEMOGLOB
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.03378 Mean :0.03518 Mean :0.02261
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.00000 Max. :9.00000 Max. :9.00000
## HYPERCH HYPERPR ECLAMP
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.0346 Mean :0.07054 Mean :0.02422
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.0000 Max. :9.00000 Max. :9.00000
## CERVIX PINFANT PRETERM
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02416 Mean :0.02552 Mean :0.02987
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.00000 Max. :9.00000 Max. :9.00000
## RENAL RHSEN UTERINE
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02256 Mean :0.02389 Mean :0.02396
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.00000 Max. :9.00000 Max. :9.00000
Surprisingly, there is no missing values found in the dataset. However, when we notice carefully from the summary statistics for each column. Qualitative variable denoted in numeric have maximum values of 9 which seems odd. Similar patterns are found in quantitative variables with maximum values recorded as 98 or 99 which are not describing the attributes accurately. These values are actually recoded from the missing values during recording of the data.
## [1] 0
Data pre-processing
Dealing Missing data
Observations with missing data are omitted from the dataset as there are just less than 10% of them are missing.
qualitative_col =c('SEX', 'MARITAL','ANEMIA',
'CARDIAC', 'ACLUNG', 'DIABETES', 'HERPES', 'HYDRAM',
'HEMOGLOB', 'HYPERCH', 'HYPERPR', 'ECLAMP', 'CERVIX',
'PINFANT','PRETERM','RENAL','RHSEN','UTERINE')
birthweight_clean = birthweight_clean %>%
filter_at(all_of(qualitative_col), all_vars(. != 9)) %>%
filter(!rowSums(.==99)) %>%
filter(!rowSums(.==98))Transformation of data
Both BPOUND and BOUNCE columns that record the pounds and ounces of the weight are merged to create new column BWEIGHT which give the birth weight in decimal. Conversion of ounce to pound is computed by multiplying BOUNCE column with 0.0625.
We also create new column BWEIGHT_COND to record the groupings of baby birthweight. Babies with low birthweight are those with less than 5.5 pounds. Large baby with birthweight exceeds 8.8 pounds are grouped as overweight category. Other than that, the baby are in the normal range of birthweight. This will be used for building the classification model which will be discussed shortly.
Dummy coding categorical features
We have some nominal variables such as SEX, MARITAL and the newly created BWEIGHT_COND. We are performing dummy coding which creates a set of binary (one-zero) variables that represent each category before we can use to fit and evaluate a model.
BWEIGHT_COND is currently represented with numbers of 0, 1, 2 for ‘Underweight’, ‘Normal’ and ‘Overwieght’. It may be advisable to convert it to a factor to allow the category to have unique impact on the outcome.
birthweight_clean =
birthweight_clean %>%
mutate(SEX_dummy_Boy = ifelse(SEX==1,1,0),
SEX_dummy_Girl = ifelse(SEX==2,1,0),
STATUS_dummy_Married=ifelse(MARITAL==1,1,0),
STATUS_dummy_Unmarried=ifelse(MARITAL==2,1,0),
BWEIGHT_dummy_under=ifelse(BWEIGHT_COND==0,1,0),
BWEIGHT_dummy_normal=ifelse(BWEIGHT_COND==1,1,0),
BWEIGHT_dummy_over=ifelse(BWEIGHT_COND==2,1,0)
) %>%
mutate_at(vars(BWEIGHT_COND ),factor,levels = c(0,1,2),labels=c("Underweight","Normal","Overweight")) %>%
mutate_at(vars(BWEIGHT_COND ),relevel,ref="Normal")Exploratory Data Analysis
Before modelling, let’s explore each attribute to uncover the structure, patterns, and relationships existing in our data.
Targeted variables
Most babies are reported to have weights of range 7 pounds to 8 pounds as we can see the peak of histogram lies in between this range. The dataset has the highest frequency for babies with normal weight which is a good sign as the baby are in healthy condition.
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.0.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
plot1=birthweight_clean %>%
ggplot(aes(x=BWEIGHT)) +
geom_histogram(fill="#9f2042") +
geom_density(alpha=.2, fill="#FF6666") +
xlab("Birthweight (pounds)") +
ylab("Frequency") +
scale_x_continuous(breaks=seq(0,10,1))
ggtitle("Histogram of Birthweight")
## $title
## [1] "Histogram of Birthweight"
##
## attr(,"class")
## [1] "labels"
plot2=birthweight_clean %>%
ggplot(aes(x=BWEIGHT_COND)) +
geom_bar(fill=c("#9f2042","#211103","#1F1D1D")) +
xlab("Classes") +
ylab("Frequency") +
ggtitle("Birthweight classes")## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
Bivariate analysis
The relationship of the birthweight and other features will be discussed shortly.
plot_density = function(cond,true_lab,false_lab,title){
birthweight_clean %>%
mutate(Status=ifelse(cond,true_lab,false_lab)) %>%
ggplot(aes(x=BWEIGHT,color=Status)) +
geom_density(size=1) +
ylab("Density") +
xlab("Birthweight") +
ggtitle(title) +
scale_color_manual(values=c("#211103","#9f2042"))
}
# Smoker vs Non-Smoker
plot3 = plot_density(birthweight_clean$CIGNUM>0,
"Smoker","Non-smoker",
"Status of smoking")
plot4 = plot_density(birthweight_clean$DRINKNUM>1,
"Drink","Not drinking",
"Status of taking/ not taking alcohol")
# has/had previus preterm/small infant
plot5 = plot_density(birthweight_clean$PRETERM==1,
"Yes","No",
"Has/had previous preterm/small infant")
# had/had previous infant 4000+ grams
plot6 = plot_density(birthweight_clean$PINFANT==1,
"Yes","No",
"Has/had previous infant with >5.5 pounds")
# Mother has/had incompetent cervix
plot7 = plot_density(birthweight_clean$CERVIX==1,
"Yes","No",
"Has/had incompetent Cervix")
# Mother has/had diabetes
plot8 = plot_density(birthweight_clean$DIABETES==1,
"Yes","No",
"Has/had diabetes")
# Mother has/had uterine bleeding
plot9 = plot_density(birthweight_clean$UTERINE==1,
"Yes","No",
"Has/had uterine bleeding")
# Mother has/had Oligohydramnios
plot10 = plot_density(birthweight_clean$HYDRAM==1,
"Yes","No",
"Has/had Oligohydramnios")
# Mother has/had chronic hypertension
plot11 = plot_density(birthweight_clean$HYPERCH==1,
"Yes","No",
"Has/had chronic hypertension")
# Mother has/had pregnancy hypertension
plot12 = plot_density(birthweight_clean$HYPERPR==1,
"Yes","No",
"Has/had pregnancy hypertension")
# Early delivery vs Normal delivery
plot13 = plot_density(birthweight_clean$WEEKS<37,
"Early delivery","Normal delivery",
"Early delivery vs Normal delivery")
# Early delivery vs Normal delivery
plot14 = plot_density(birthweight_clean$BDEAD>0,
"Yes","Never/No",
"Has/had children born alive now dead")Based on the density plots, we observed that:
- Mother who smokes during pregnancy tends to have low birth weight babies.
- Alcoholic seems does not bring significant effect to the birthweight of baby but this is just based on the observation from this data as we require more scientific justification for such claims.
- Mother who experienced preterm delivery are likely to give birth of low weight baby. In contrast, mothers who experienced in giving birth of larger baby have higher possibility to have infants with greater birthweight.
Condition of mother’s health can be determined factors of baby birthweight. We observed that mother who suffer from
- uterine bleeding
- ologohydramnios
- hypertensions
has higher probability with low birthweight baby.
Correlation
## Warning: package 'reshape2' was built under R version 4.0.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
quantitative_col =c('FAGE', 'GAINED', 'VISITS', 'MAGE', 'BWEIGHT','PLURAL',
'FEDUC', 'MEDUC', 'TOTALP', 'BDEAD', 'TERMS',
'WEEKS','CIGNUM', 'DRINKNUM')
# Correlation matrix for all quantitative variables
corr_matrix <- round(cor(birthweight_clean[,quantitative_col]),2)
# Get upper triangle of the correlation matrix
get_upper_tri <- function(corr_matrix){
corr_matrix[lower.tri(corr_matrix)]<- NA
return(corr_matrix)
}
upper_tri <- get_upper_tri(corr_matrix)Matrix of correlation with upper triangle is displayed.
## FAGE GAINED VISITS MAGE BWEIGHT PLURAL FEDUC
## FAGE 1 -0.06 0.10 0.75 0.05 0.05 0.28
## GAINED NA 1.00 0.09 -0.06 0.17 0.10 0.11
## VISITS NA NA 1.00 0.14 0.13 0.07 0.21
## MAGE NA NA NA 1.00 0.07 0.07 0.35
## BWEIGHT NA NA NA NA 1.00 -0.31 0.05
## PLURAL NA NA NA NA NA 1.00 0.05
## FEDUC NA NA NA NA NA NA 1.00
## MEDUC NA NA NA NA NA NA NA
## TOTALP NA NA NA NA NA NA NA
## BDEAD NA NA NA NA NA NA NA
## TERMS NA NA NA NA NA NA NA
## WEEKS NA NA NA NA NA NA NA
## CIGNUM NA NA NA NA NA NA NA
## DRINKNUM NA NA NA NA NA NA NA
## MEDUC TOTALP BDEAD TERMS WEEKS CIGNUM DRINKNUM
## FAGE 0.27 0.28 0.04 0.13 -0.04 -0.04 0.01
## GAINED 0.11 -0.12 -0.02 -0.01 0.10 -0.01 0.00
## VISITS 0.23 -0.06 0.00 0.03 0.14 -0.06 -0.01
## MAGE 0.37 0.35 0.04 0.17 -0.05 -0.08 0.01
## BWEIGHT 0.06 0.00 -0.04 -0.03 0.57 -0.09 -0.01
## PLURAL 0.05 0.07 0.03 0.02 -0.27 -0.02 0.00
## FEDUC 0.74 -0.10 -0.02 0.02 0.01 -0.11 0.00
## MEDUC 1.00 -0.13 -0.02 0.03 0.00 -0.12 0.00
## TOTALP NA 1.00 0.17 0.65 -0.08 0.08 0.02
## BDEAD NA NA 1.00 0.04 -0.05 0.01 0.01
## TERMS NA NA NA 1.00 -0.05 0.06 0.02
## WEEKS NA NA NA NA 1.00 -0.01 -0.01
## CIGNUM NA NA NA NA NA 1.00 0.03
## DRINKNUM NA NA NA NA NA NA 1.00
From the heatmap, we observed that weeks of gestation is positively correlated with birthweight. The longer the week of gestation, the higher the weight of the baby which potentially lead to baby to grow bigger in size during delivery if the gestation weeks are longer than usual.
Other factors such as number of cigarettes per day has negative association with baby birthweight. Mother who smokes oftenly tends to have low birth weight baby.
Other interesting facts observed: - Mother who has twins, triplets or even quadruplets has shorter gestational weeks than single baby and babies are ligthter. - The more weight gained during pregnancy, the baby birthweight is heavier - Mother who vists hospital more often has baby with higher birthweight
# Heatmap
melted_cormat <- melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "#211103", high = "#9f2042", mid = "white",
midpoint = 0, limit =c(min(melt(corr_matrix)$value),max(melt(corr_matrix)$value)), space = "Lab",
name="Pearson\nCorrelation") +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1))+
xlab("")+ylab("")+ggtitle("Correlation Heat Map")+
coord_fixed()Modelling
Modelling: Classification
k Nearest Neighbors (kNN)
k-Nearest Neighbors (kNN) uses the principle of nearest neighbors to classify unlabeled examples by searching dataset for the historic observation with most similar to the newly-observed one. As the kNN algorithm literally “learns by example” it is suitably used for classification to indentify if weights grouping for baby when birth.
Data processing for kNN
Since kNN assumes numeric data, categorical columns should be modified in order for kNN to measure it in numeric way. Alternative solution here is by creating dummy variable which has been completed in previous steps.
Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization to ensure all data elements may contribute equal shares to distance.
Train/Test split for Classification
We are spliting birthweight_class into a training set birthweight_class_train (75% of the data) and a test set birthweight_class_test (25% of the data). A column of uniform random numbers between 0 and 1, using the function runif() is generated to get random subset of data in 75% and 25% split.
# Use nrow to get the number of rows
(N <- nrow(birthweight_class))
## [1] 103381
# Calculate how many rows 75% of N should be and print it
(round(0.75*N))
## [1] 77536
# Create the vector of N uniform random variables: split
set.seed(123)
split <- runif(N)
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
birthweight_class_train <- birthweight_class[split<0.75,]
birthweight_class_test <- birthweight_class[split>=0.75,]
# Use nrow() to birthweight_class_train and birthweight_class_test
nrow(birthweight_class_train)
## [1] 77612
nrow(birthweight_class_test)
## [1] 25769# Target variable indexes
target_index = grep("BWEIGHT_COND", colnames(birthweight_class))
# Class of train data
COND = birthweight_class_train$BWEIGHT_CONDThere is no universal rule for selection of k. Some suggest a rule of thumb starting with k equal to the square root of the number of observations in the training data which is 279.
# When k=65
KNN_pred_65 <- knn(train = birthweight_class_train[-target_index],
test = birthweight_class_test[-target_index],
cl = COND,
k=65)# When k=100
KNN_pred_100 <- knn(train = birthweight_class_train[-target_index],
test = birthweight_class_test[-target_index],
cl = COND,
k=100)# When k=279 (square root of total observations)
KNN_pred_279 <- knn(train = birthweight_class_train[-target_index],
test = birthweight_class_test[-target_index],
cl = COND,
k=279)# Create a confusion matrix of the predicted versus actual values
COND_actual <- birthweight_class_test$BWEIGHT_COND
# Accuracy rate
k_65=mean(COND_actual==KNN_pred_65)
k_100=mean(COND_actual==KNN_pred_100)
k_279=mean(COND_actual==KNN_pred_279)
k_result = rbind(k_65,k_100,k_279)
k_result
## [,1]
## k_65 0.8462882
## k_100 0.8441150
## k_279 0.8389150When k=65, we are considering 65 neighbors to consider when making the classification.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Underweight Overweight
## Normal 21353 66 14
## Underweight 1534 438 0
## Overweight 2347 0 17
##
## Overall Statistics
##
## Accuracy : 0.8463
## 95% CI : (0.8418, 0.8507)
## No Information Rate : 0.9792
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1643
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Normal Class: Underweight
## Sensitivity 0.8462 0.86905
## Specificity 0.8505 0.93928
## Pos Pred Value 0.9963 0.22211
## Neg Pred Value 0.1049 0.99723
## Prevalence 0.9792 0.01956
## Detection Rate 0.8286 0.01700
## Detection Prevalence 0.8317 0.07653
## Balanced Accuracy 0.8483 0.90417
## Class: Overweight
## Sensitivity 0.5483871
## Specificity 0.9088119
## Pos Pred Value 0.0071912
## Neg Pred Value 0.9994018
## Prevalence 0.0012030
## Detection Rate 0.0006597
## Detection Prevalence 0.0917381
## Balanced Accuracy 0.7285995
When k=100, we are considering 100 neighbors to consider when making the classification.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Underweight Overweight
## Normal 21370 47 16
## Underweight 1608 364 0
## Overweight 2346 0 18
##
## Overall Statistics
##
## Accuracy : 0.8441
## 95% CI : (0.8396, 0.8485)
## No Information Rate : 0.9827
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1401
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Normal Class: Underweight
## Sensitivity 0.8439 0.88564
## Specificity 0.8584 0.93659
## Pos Pred Value 0.9971 0.18458
## Neg Pred Value 0.0881 0.99802
## Prevalence 0.9827 0.01595
## Detection Rate 0.8293 0.01413
## Detection Prevalence 0.8317 0.07653
## Balanced Accuracy 0.8511 0.91112
## Class: Overweight
## Sensitivity 0.5294118
## Specificity 0.9088401
## Pos Pred Value 0.0076142
## Neg Pred Value 0.9993164
## Prevalence 0.0013194
## Detection Rate 0.0006985
## Detection Prevalence 0.0917381
## Balanced Accuracy 0.7191259
When k=279, we are considering 279 neighbors to consider when making the classification.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Underweight Overweight
## Normal 21423 10 0
## Underweight 1777 195 0
## Overweight 2364 0 0
##
## Overall Statistics
##
## Accuracy : 0.8389
## 95% CI : (0.8344, 0.8434)
## No Information Rate : 0.992
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0757
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Normal Class: Underweight
## Sensitivity 0.83801 0.951220
## Specificity 0.95122 0.930488
## Pos Pred Value 0.99953 0.098884
## Neg Pred Value 0.04497 0.999580
## Prevalence 0.99204 0.007955
## Detection Rate 0.83135 0.007567
## Detection Prevalence 0.83174 0.076526
## Balanced Accuracy 0.89462 0.940854
## Class: Overweight
## Sensitivity NA
## Specificity 0.90826
## Pos Pred Value NA
## Neg Pred Value NA
## Prevalence 0.00000
## Detection Rate 0.00000
## Detection Prevalence 0.09174
## Balanced Accuracy NA
With larger k values, we could get higher rate of accuracy in detecting normal baby weight. However, the model is not performing well in detecting the underweight and overweight babies. To get the balance, we chose smaller k values of 65, with greater accuracy of 0.8462882 in overall in detecting the labelled classes.
Multinomial Logistic regression
Another popular classification model is Multinomial Logistic regression which is an extension of binomial logistic regression which allows us to predict a categorical dependent variable which has more than two levels.
# Fit the model
fmla=as.formula(BWEIGHT_COND ~ .)
log_reg_model = nnet::multinom(fmla, data = birthweight_class_train)
## # weights: 105 (68 variable)
## initial value 85265.496948
## iter 10 value 43948.772859
## iter 20 value 39315.260691
## iter 30 value 35195.621475
## iter 40 value 33984.166963
## iter 50 value 33714.848964
## iter 60 value 33670.620023
## iter 70 value 33551.043061
## iter 80 value 33546.896059
## final value 33546.880388
## converged## Call:
## nnet::multinom(formula = fmla, data = birthweight_class_train)
##
## Coefficients:
## (Intercept) FAGE GAINED VISITS
## Underweight 5.287194 -0.07920969 -2.725423 -1.609727
## Overweight -4.355592 -0.04228494 2.865142 1.008134
## MAGE FEDUC MEDUC TOTALP
## Underweight -0.2156991 0.08859638 0.2672378 -4.286894
## Overweight 1.2076822 -0.08324540 0.1304299 2.640975
## BDEAD TERMS PLURAL WEEKS
## Underweight 2.745155 4.657910 7.85548 -17.006464
## Overweight -1.054142 -2.480178 -9.72476 5.796709
## CIGNUM DRINKNUM ANEMIA CARDIAC
## Underweight 3.466275 -2.886727 -0.05464457 -0.2033561
## Overweight -4.367276 -7.092164 -0.01250162 -0.1884203
## ACLUNG DIABETES HERPES HYDRAM
## Underweight 0.363578687 -0.2292302 -0.193063974 1.0172823
## Overweight 0.001108051 0.5954211 0.003078272 -0.1273572
## HEMOGLOB HYPERCH HYPERPR ECLAMP
## Underweight 0.3504352 0.662480945 0.9060436 1.30701904
## Overweight -0.3487282 -0.003570999 -0.2538979 -0.08000906
## CERVIX PINFANT PRETERM RENAL
## Underweight 0.3548868 -1.048648 1.0299743 0.6279605
## Overweight -0.5456573 1.728803 -0.7643823 -0.3779315
## RHSEN UTERINE SEX_dummy_Boy
## Underweight -0.5793131 0.9426642 2.507795
## Overweight -0.2759953 -0.1898026 -1.894247
## SEX_dummy_Girl STATUS_dummy_Married
## Underweight 2.779400 2.476237
## Overweight -2.461344 -2.007350
## STATUS_dummy_Unmarried
## Underweight 2.810957
## Overweight -2.348241
##
## Std. Errors:
## (Intercept) FAGE GAINED VISITS
## Underweight 0.09505475 0.2383985 0.13825537 0.2370674
## Overweight 0.08975596 0.1716012 0.09145519 0.1783965
## MAGE FEDUC MEDUC TOTALP
## Underweight 0.2109616 0.1627519 0.1678846 0.3729532
## Overweight 0.1495039 0.1111006 0.1161061 0.2326635
## BDEAD TERMS PLURAL WEEKS
## Underweight 0.6595288 0.4852758 0.1833876 0.2255235
## Overweight 0.6040611 0.3376846 1.1436810 0.1921526
## CIGNUM DRINKNUM ANEMIA CARDIAC
## Underweight 0.2768779 1.913990 0.09538002 0.2739302
## Overweight 0.4187796 6.204468 0.07091892 0.2041958
## ACLUNG DIABETES HERPES HYDRAM
## Underweight 0.1471751 0.09477448 0.1693388 0.1038276
## Overweight 0.1234820 0.06192606 0.1102396 0.1182136
## HEMOGLOB HYPERCH HYPERPR ECLAMP
## Underweight 0.3365748 0.1165182 0.05947054 0.1694037
## Overweight 0.3996491 0.1161619 0.06834053 0.2799525
## CERVIX PINFANT PRETERM RENAL
## Underweight 0.2166189 0.4467634 0.1230564 0.2864402
## Overweight 0.3157495 0.1040772 0.2056165 0.3201424
## RHSEN UTERINE SEX_dummy_Boy
## Underweight 0.3542218 0.2198129 0.04972244
## Overweight 0.2303831 0.2605112 0.04597924
## SEX_dummy_Girl STATUS_dummy_Married
## Underweight 0.05211682 0.05636510
## Overweight 0.04743306 0.04958922
## STATUS_dummy_Unmarried
## Underweight 0.04799162
## Overweight 0.04630135
##
## Residual Deviance: 67093.76
## AIC: 67221.76
After fitting the model, we predict the labelled category of birthweight_class_train dataset and check for its accuracy from the confusion matrix.
# Make predictions
predict_log_reg <- log_reg_model %>% predict(birthweight_class_test[-target_index])
# Model accuracy
COND_actual <- birthweight_class_test$BWEIGHT_COND
logreg = mean(COND_actual==predict_log_reg)## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Underweight Overweight
## Normal 21129 260 44
## Underweight 1059 913 0
## Overweight 2323 0 41
##
## Overall Statistics
##
## Accuracy : 0.857
## 95% CI : (0.8526, 0.8612)
## No Information Rate : 0.9512
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3025
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Normal Class: Underweight
## Sensitivity 0.8620 0.77835
## Specificity 0.7583 0.95694
## Pos Pred Value 0.9858 0.46298
## Neg Pred Value 0.2200 0.98907
## Prevalence 0.9512 0.04552
## Detection Rate 0.8199 0.03543
## Detection Prevalence 0.8317 0.07653
## Balanced Accuracy 0.8102 0.86765
## Class: Overweight
## Sensitivity 0.482353
## Specificity 0.909555
## Pos Pred Value 0.017343
## Neg Pred Value 0.998120
## Prevalence 0.003299
## Detection Rate 0.001591
## Detection Prevalence 0.091738
## Balanced Accuracy 0.695954
This model performs quite well too as it get an accuracy of 85.6959913%.
Evaluation of classification models
By comparing the accuracy rate from both KNN and multinomial regression models, the later results with higer accuracy. Thus, in terms of classification, multinomial regression models will be selected.
Modelling: Regression
birthweight_reg= birthweight_clean %>%
select(-contains("BWEIGHT_dummy"),-c("SEX","MARITAL","BWEIGHT_COND"))Latent Variable Analysis
Principal Component Analysis (PCA) is a tool for data reduction. There are several ways PCA contributes to the model development process. Other than using the output of the PCA directly into machine learning models, it can be used to identify latent variables in a dataset, by looking for how individual observed variables hang together.
bw_reg_pca<-principal(select(birthweight_reg[split<0.75,],-BWEIGHT),nfactors = 15)
plot15=ggplot(data = data.frame(y=bw_reg_pca$Vaccounted[2,],x=1:15),aes(x=x,weight=y))+
geom_bar(fill="#9f2042")+
xlab("Component")+
ylab("Prop. Variation")+
ggtitle("Skree Plot")+
scale_x_continuous(breaks=seq(0,15,1))
plot15 Based on the Skree Plot, 15 principle components manage to explains 60% of the variance in the data.
##
## Loadings:
## RC1 RC11 RC3 RC2 RC4
## FAGE 0.633 0.192 0.343
## GAINED
## VISITS 0.243 0.101 -0.121
## MAGE 0.679 0.258 0.379
## FEDUC 0.785 0.151 -0.130
## MEDUC 0.793 0.165 -0.147
## TOTALP 0.895
## BDEAD 0.123
## TERMS 0.839
## PLURAL 0.804
## WEEKS -0.722
## CIGNUM -0.191 0.193
## DRINKNUM 0.109
## ANEMIA
## CARDIAC
## ACLUNG
## DIABETES
## HERPES 0.139 -0.113 -0.112
## HYDRAM
## HEMOGLOB
## HYPERCH 0.139
## HYPERPR
## ECLAMP
## CERVIX
## PINFANT
## PRETERM
## RENAL
## RHSEN -0.132 0.176
## UTERINE
## SEX_dummy_Boy -1.000
## SEX_dummy_Girl 1.000
## STATUS_dummy_Married 0.277 0.944
## STATUS_dummy_Unmarried -0.277 -0.944
## RC5 RC6 RC9 RC7 RC8
## FAGE -0.260 0.139
## GAINED 0.689 -0.176
## VISITS 0.549 0.267
## MAGE -0.209 0.147
## FEDUC 0.251
## MEDUC 0.257
## TOTALP -0.135
## BDEAD 0.538
## TERMS 0.128
## PLURAL 0.234
## WEEKS 0.279 -0.144 -0.176
## CIGNUM
## DRINKNUM
## ANEMIA 0.564
## CARDIAC -0.144
## ACLUNG 0.156
## DIABETES 0.663 0.208
## HERPES -0.111 0.130 0.324
## HYDRAM 0.240 0.114
## HEMOGLOB 0.628
## HYPERCH 0.641 -0.179
## HYPERPR 0.107 0.707
## ECLAMP 0.662
## CERVIX 0.647
## PINFANT
## PRETERM 0.604
## RENAL 0.104
## RHSEN 0.177 0.247 0.384
## UTERINE
## SEX_dummy_Boy
## SEX_dummy_Girl
## STATUS_dummy_Married
## STATUS_dummy_Unmarried
## RC14 RC10 RC13 RC12 RC15
## FAGE 0.113
## GAINED
## VISITS -0.132
## MAGE 0.105
## FEDUC
## MEDUC
## TOTALP
## BDEAD 0.164 0.182 -0.248
## TERMS
## PLURAL
## WEEKS -0.103
## CIGNUM 0.195 0.357 -0.254
## DRINKNUM 0.787 0.104
## ANEMIA 0.267 0.161 0.124
## CARDIAC 0.758 -0.126 -0.147
## ACLUNG 0.539 0.182 0.171
## DIABETES
## HERPES 0.255
## HYDRAM 0.172 0.414 0.180
## HEMOGLOB -0.153 -0.165
## HYPERCH 0.176
## HYPERPR
## ECLAMP -0.108
## CERVIX -0.127
## PINFANT 0.882
## PRETERM 0.132
## RENAL 0.687 -0.103 -0.109
## RHSEN 0.400 -0.444 0.163
## UTERINE 0.835
## SEX_dummy_Boy
## SEX_dummy_Girl
## STATUS_dummy_Married
## STATUS_dummy_Unmarried
##
## RC1 RC11 RC3 RC2 RC4 RC5 RC6
## SS loadings 2.426 2.017 2.001 1.877 1.281 1.271 1.149
## Proportion Var 0.074 0.061 0.061 0.057 0.039 0.039 0.035
## Cumulative Var 0.074 0.135 0.195 0.252 0.291 0.330 0.364
## RC9 RC7 RC8 RC14 RC10 RC13 RC12
## SS loadings 1.132 1.116 1.066 1.042 1.031 1.025 1.025
## Proportion Var 0.034 0.034 0.032 0.032 0.031 0.031 0.031
## Cumulative Var 0.399 0.432 0.465 0.496 0.528 0.559 0.590
## RC15
## SS loadings 1.017
## Proportion Var 0.031
## Cumulative Var 0.621
Several observations from the PCA:
- Either one of the dummy variables for genders and married status can be removed.
- The father’s age and mother’s age are often falls under the same component. Father’s age will be removed.
- The father’s years of education and mother’s years of education are often falls under the same component. Father’s years of education will be removed.
- The “Number of other terminations” and “Total pregnancies” often falls under the same component. A new feature will be created by dividing “Number of other terminations” with “Total pregnancies”.
- The “Number of prenatal visits” and “Completed weeks of gestation” often falls under the same component. A new feature will be created by dividing “Number of prenatal visits” with “Completed weeks of gestation”.
func_col_clean<-function(dat){
dat<-mutate(dat
#,parent_edu=FEDUC*MEDUC
,TERMS_perc=TERMS/TOTALP
,VISITS_per_WEEKS=VISITS/WEEKS
)
dat<-select(dat,-FAGE,-STATUS_dummy_Unmarried
,-STATUS_dummy_Married
,-FEDUC
#,-MEDUC
,-SEX_dummy_Girl
,-TOTALP)
return(dat)
}
birthweight_reg2<-func_col_clean(birthweight_reg)Feature Selection
The Variable Importance is extracted using the VarImp function from caret package. Factor with high variance of importance usually have high predictive power on the target variable.
control <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- train(BWEIGHT~., data=birthweight_reg2[split<0.75,], method="lm", preProcess="scale"
, trControl=control)
importance <- varImp(model, scale=T)
print(importance)
plot(importance)## lm variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## PLURAL 100.000
## WEEKS 92.285
## GAINED 77.651
## SEX_dummy_Boy 51.091
## MAGE 50.550
## CIGNUM 38.790
## HYPERPR 31.841
## PINFANT 30.287
## DIABETES 19.021
## HYDRAM 18.852
## ECLAMP 16.089
## PRETERM 15.830
## HYPERCH 10.915
## VISITS 9.185
## CERVIX 8.699
## UTERINE 8.179
## TERMS_perc 7.309
## VISITS_per_WEEKS 7.086
## MEDUC 4.570
## BDEAD 4.005
Based on the Variable Importance diagram, PLURAL, WEEKS, GAINED, MAGE, CIGNUM, HYPERPR, PINFANT, DIABETES and HYDRAM are selected into the machine learning model.
Train/Test split for Regression
quantitative_col_reg<-c("PLURAL","WEEKS","GAINED","MAGE"
,"CIGNUM","HYPERPR","PINFANT","DIABETES","HYDRAM","BWEIGHT")
y_max=max(birthweight_reg2$BWEIGHT)
y_min=min(birthweight_reg2$BWEIGHT)
birthweight_reg3<-mutate_at(birthweight_reg2,quantitative_col_reg,normalize)
birthweight_reg_train<-birthweight_reg3[split<0.75,]
birthweight_reg_test<-birthweight_reg3[split>0.75,]Linear Regression
Linear regression is one of the basic regression model for continuous value prediction. It formulate a linear equations by assigning weights on the independent variable via maximum likelihood estimation.
f<-as.formula(paste0("BWEIGHT~",paste0(c(quantitative_col_reg,"SEX_dummy_Boy"),collapse = "+")))
bw_lm<-lm(f,data=birthweight_reg_train)
birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train,BWEIGHT_hat=predict(bw_lm))
birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test
,BWEIGHT_hat=predict(bw_lm
,newdata = birthweight_reg_test))
RMSE_lm_train<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat*(y_max-y_min)+y_min-birthweight_reg_train2$BWEIGHT*(y_max-y_min)+y_min)^2))
RMSE_lm_test<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat*(y_max-y_min)+y_min-birthweight_reg_test2$BWEIGHT*(y_max-y_min)+y_min)^2))##
## Call:
## lm(formula = f, data = birthweight_reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53045 -0.05108 -0.00036 0.05047 0.44245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0541376 0.0027165 19.93 <2e-16 ***
## PLURAL -0.3237948 0.0047270 -68.50 <2e-16 ***
## WEEKS 0.5531824 0.0032467 170.38 <2e-16 ***
## GAINED 0.1132518 0.0020846 54.33 <2e-16 ***
## MAGE 0.0776288 0.0020460 37.94 <2e-16 ***
## CIGNUM -0.1604179 0.0057956 -27.68 <2e-16 ***
## HYPERPR -0.0294444 0.0013241 -22.24 <2e-16 ***
## PINFANT 0.0773849 0.0037786 20.48 <2e-16 ***
## DIABETES 0.0206970 0.0016013 12.93 <2e-16 ***
## HYDRAM -0.0320686 0.0023892 -13.42 <2e-16 ***
## SEX_dummy_Boy 0.0200160 0.0005732 34.92 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07974 on 77601 degrees of freedom
## Multiple R-squared: 0.4055, Adjusted R-squared: 0.4054
## F-statistic: 5293 on 10 and 77601 DF, p-value: < 2.2e-16
Although the p-value of all the variables and final model are very small (i.e. < 2.2e-16), the R-squared looks quite low, i.e. only 40% of the output variable can be explained by the input variables.
To investigate further, let’s look at the Residuals vs Fitted graph and the QQ plot of the linear regression.
## Warning in model.matrix.default(object, data =
## structure(list(BWEIGHT = c(0.300970873786408, : the response
## appeared on the right-hand side and was dropped
## Warning in model.matrix.default(object, data =
## structure(list(BWEIGHT = c(0.300970873786408, : problem with
## term 10 in model.matrix: no columns are assigned
Linear regression assumes the data is Homoscedastic (i.e. the residual value is consistent for all observations) and normally distributed. However, the Residual vs Fitted values shows that the residual of the dataset is heteroskedastic and the QQ plot reveals that the data has more extreme values than would be expected for a normal distribution.
Usually, the problem might be solved by transforming the target variables or independent variables. Instead of trying out numerous transformation method, we could attempt to develop another machine learning model.
XGBoost
XGBoost is one of the machine learning model that could handle non-linear and non-normal distributed data.
quantitative_col_xbg<-c(quantitative_col_reg[quantitative_col_reg!="BWEIGHT"],"SEX_dummy_Boy")
bw_xgb<-xgboost(data=as.matrix(birthweight_reg_train[,quantitative_col_xbg])
,label = birthweight_reg_train$BWEIGHT
, max.depth = 2, nrounds = 50)## [1] train-rmse:0.099927
## [2] train-rmse:0.091502
## [3] train-rmse:0.086588
## [4] train-rmse:0.083779
## [5] train-rmse:0.081953
## [6] train-rmse:0.080976
## [7] train-rmse:0.080142
## [8] train-rmse:0.079532
## [9] train-rmse:0.079074
## [10] train-rmse:0.078710
## [11] train-rmse:0.078399
## [12] train-rmse:0.078134
## [13] train-rmse:0.077900
## [14] train-rmse:0.077729
## [15] train-rmse:0.077548
## [16] train-rmse:0.077401
## [17] train-rmse:0.077271
## [18] train-rmse:0.077144
## [19] train-rmse:0.077042
## [20] train-rmse:0.076956
## [21] train-rmse:0.076882
## [22] train-rmse:0.076788
## [23] train-rmse:0.076725
## [24] train-rmse:0.076660
## [25] train-rmse:0.076610
## [26] train-rmse:0.076552
## [27] train-rmse:0.076518
## [28] train-rmse:0.076482
## [29] train-rmse:0.076447
## [30] train-rmse:0.076418
## [31] train-rmse:0.076388
## [32] train-rmse:0.076357
## [33] train-rmse:0.076335
## [34] train-rmse:0.076315
## [35] train-rmse:0.076293
## [36] train-rmse:0.076275
## [37] train-rmse:0.076256
## [38] train-rmse:0.076241
## [39] train-rmse:0.076226
## [40] train-rmse:0.076215
## [41] train-rmse:0.076204
## [42] train-rmse:0.076182
## [43] train-rmse:0.076173
## [44] train-rmse:0.076154
## [45] train-rmse:0.076145
## [46] train-rmse:0.076133
## [47] train-rmse:0.076122
## [48] train-rmse:0.076118
## [49] train-rmse:0.076098
## [50] train-rmse:0.076092
birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train2
,BWEIGHT_hat_xgb=predict(bw_xgb,newdata = as.matrix(birthweight_reg_train[,quantitative_col_xbg]))
)
birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test2
,BWEIGHT_hat_xgb=predict(bw_xgb,newdata = as.matrix(birthweight_reg_test2[,quantitative_col_xbg]))
)
RMSE_xgb_train<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat_xgb*(y_max-y_min)+y_min-birthweight_reg_train2$BWEIGHT*(y_max-y_min)+y_min)^2))
RMSE_xgb_test<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat_xgb*(y_max-y_min)+y_min-birthweight_reg_test2$BWEIGHT*(y_max-y_min)+y_min)^2))The RMSE of the normalised BWEIGHT is 0.076092, but the resulting non-normalised BWEIGHT is 1.0436.
Since XGBoost can handle non-normal and non-linear dataset, normalisation is not necessary for the machine learning model. Would the model performance be any different if we ask XGBoost to fit the non-normalised BWEIGHT?
birthweight_reg_train2$BWEIGHT2<-birthweight_reg_train$BWEIGHT*(y_max-y_min)+y_min
birthweight_reg_test2$BWEIGHT2<-birthweight_reg_test$BWEIGHT*(y_max-y_min)+y_min
quantitative_col_xbg<-c(quantitative_col_reg[quantitative_col_reg!="BWEIGHT"],"SEX_dummy_Boy")
bw_xgb<-xgboost(data=as.matrix(birthweight_reg_train2[,quantitative_col_xbg])
,label = birthweight_reg_train2$BWEIGHT2
, max.depth = 2, nrounds = 50)## [1] train-rmse:4.882723
## [2] train-rmse:3.501301
## [3] train-rmse:2.563161
## [4] train-rmse:1.942600
## [5] train-rmse:1.546722
## [6] train-rmse:1.309054
## [7] train-rmse:1.171215
## [8] train-rmse:1.094978
## [9] train-rmse:1.053656
## [10] train-rmse:1.031047
## [11] train-rmse:1.018097
## [12] train-rmse:1.010258
## [13] train-rmse:1.005064
## [14] train-rmse:1.001786
## [15] train-rmse:0.998937
## [16] train-rmse:0.996785
## [17] train-rmse:0.994984
## [18] train-rmse:0.993283
## [19] train-rmse:0.991939
## [20] train-rmse:0.990822
## [21] train-rmse:0.989852
## [22] train-rmse:0.988648
## [23] train-rmse:0.987827
## [24] train-rmse:0.987011
## [25] train-rmse:0.986138
## [26] train-rmse:0.985391
## [27] train-rmse:0.984914
## [28] train-rmse:0.984401
## [29] train-rmse:0.983962
## [30] train-rmse:0.983582
## [31] train-rmse:0.983246
## [32] train-rmse:0.982886
## [33] train-rmse:0.982544
## [34] train-rmse:0.982133
## [35] train-rmse:0.981841
## [36] train-rmse:0.981530
## [37] train-rmse:0.981303
## [38] train-rmse:0.981013
## [39] train-rmse:0.980840
## [40] train-rmse:0.980679
## [41] train-rmse:0.980526
## [42] train-rmse:0.980363
## [43] train-rmse:0.980232
## [44] train-rmse:0.980088
## [45] train-rmse:0.979870
## [46] train-rmse:0.979729
## [47] train-rmse:0.979559
## [48] train-rmse:0.979466
## [49] train-rmse:0.979390
## [50] train-rmse:0.979330
birthweight_reg_train2<-cbind.data.frame(birthweight_reg_train2
,BWEIGHT_hat_xgb2=predict(bw_xgb,newdata = as.matrix(birthweight_reg_train2[,quantitative_col_xbg]))
)
birthweight_reg_test2<-cbind.data.frame(birthweight_reg_test2
,BWEIGHT_hat_xgb2=predict(bw_xgb,newdata = as.matrix(birthweight_reg_test2[,quantitative_col_xbg]))
)
RMSE_xgb_train2<-sqrt(mean((birthweight_reg_train2$BWEIGHT_hat_xgb2-birthweight_reg_train2$BWEIGHT2)^2))
RMSE_xgb_test2<-sqrt(mean((birthweight_reg_test2$BWEIGHT_hat_xgb2-birthweight_reg_test2$BWEIGHT2)^2))Evaluation of Regression Models
RMSE<-cbind.data.frame(`RMSE: Linear Regression`=c(RMSE_lm_train,RMSE_lm_test)
,`RMSE: XGBoost`=c(RMSE_xgb_train,RMSE_xgb_test)
,`RMSE: XGBoost (non-normalised)`=c(RMSE_xgb_train2,RMSE_xgb_test2)
)
rownames(RMSE)<-c("Train","Test")
RMSE## RMSE: Linear Regression RMSE: XGBoost
## Train 1.092919 1.049005
## Test 1.081757 1.043571
## RMSE: XGBoost (non-normalised)
## Train 0.9793309
## Test 0.9768859
The improvement of RMSE between the two models are not significant. XGBoost performed much better on the non-normalised BWEIGHT. Usually normalisation would improve the model performance. However, normalisation leads to a worse model performance in this case study. This might be due to the normalisation approach used.
Conclusion and Future works
We developed models with adequate performance for the classification of baby’s birth weight type (i.e. underweight, normal, overweight) and the regression of baby’s birth weight. Multinomial Logistic and kNN are used for predicting the birth weight type of babies. Multinomial Logistic outperform kNN by having a higher test sample accuracy of 0.857. Linear Regression and XGBoost are used to predict the birth weight of the babies. XGBoost on non-normalised birth weight outperforms other models by having a lower test sample RMSE of 0.9769.
Several improvements can be made in the following areas:-
- Pre-processing: Explore other normalisation methods
- Feature selection: Select features via other methods, e.g. Recursive Feature Elimination (RFE) or stepwise regression
- Model: Experiment with other models such as neural network might perform better on this dataset
Appendix: Metadata (trimmed datasets)
- SEX : Sex of the baby
- MARITAL : Marital status of its parents
- FAGE : Age of father
- GAINED : Weight gained during pregnancy
- VISITS : Number of prenatal visits
- MAGE : Age of mother
- FEDUC : Father’s years of education
- MEDUC : Mother’s years of education
- TOTALP : Total pregnancies
- BDEAD : number of children born alive now dead
- TERMS : Number of other terminations
- PLURAL : Whether the Mother gave birth to twins, triplets or quadruplets
- WEEKS : Completed weeks of gestation
- CIGNUM : Average number of cigarettes used daily (Mother)
- DRINKNUM: Average number of drinks used daily (mother)
- ANEMIA : Mother has/had anemia
- CARDIAC : Mother has/had cardiac disease
- ACLUNG : Mother has/had acute or chronic lung disease
- DIABETES : Mother has/had diabetes
- HERPES : Mother has/had genital herpes
- HYDRAM : Mother has/had hydramnios/Oligohydramnios
- HEMOGLOB : Mother has/had hemoglobinopathy
- HYPERCH : Mother has/had chronic hypertension
- HYPERPR : mother has/had pregnancy hypertension
- ECLAMP : Mother has/had Eclampsia
- CERVIX : Mother has/had incompetent cervix
- PINFANT : Mother had/had previous infant 4000+ grams
- PRETERM : Mother has/had previus preterm/small infant
- RENAL : Mother has/had renal disease
- RHSEN : Mother has/had Rh sensitization
- UTERINE : Mother has/had uterine bleeding
- BWEIGHT : Baby’s weight at birth