Exploring Maternal, Infant, and Child Health Indicators by Ainsley and Annabel Merritt

1. Introduction

For our project we are taking a look at the topics of maternal and infant health throughout different countries in the world. Our interest in this topic sparked in our Infant Nutrition seminar class that we are taking currently. We wanted to examine the data behind some of the things we have been learning in class. All of our data sets have countries so we plan to merge them using inner-join with country. Thus, a single row in our data set will correspond to a single country’s data. We expect to find that countries with high income have a low infant mortality rate because they likely have better infrastructure and health services. Additionally we expect countries with high infant mortality rate to also have high neonatal mortality rate, under 5 mortality rate, and low female life expectancy as these health statistics are all signs of poor health care facilities in a country.

a. Set up

Installed the tidyverse library, tidytext library, textdata library, openintro library, dplyr library, ggplot2 library, carData library, plotROC library, caret library, rpart library, rpart.plot library, ggcorrplot library, and factoextra library.

Dataset 1

data("sowc_maternal_newborn")

# Take a look
glimpse(sowc_maternal_newborn)

## Rows: 202
## Columns: 18
## $ countries_and_areas           <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ life_expectancy_female        <int> 66, 80, 78, NA, 64, NA, 78, 80, 78, 85, …
## $ family_planning_1549          <int> 42, 5, 77, NA, 30, NA, NA, NA, 37, NA, N…
## $ family_planning_1519          <int> 21, 5, NA, NA, 15, NA, NA, NA, NA, NA, N…
## $ adolescent_birth_rate         <int> 77, 17, 10, 3, 163, 40, 67, 65, 24, 10, …
## $ births_age_18                 <int> 20, 3, 1, NA, 38, NA, NA, 12, 1, NA, NA,…
## $ antenatal_care_1              <int> 59, 88, 93, NA, 82, NA, 100, 98, 100, 98…
## $ antenatal_care_4_1549         <int> 18, 78, 67, NA, 61, NA, 100, 90, 96, 92,…
## $ antenatal_care_4_1519         <int> 16, 72, NA, NA, 56, NA, NA, 85, 93, NA, …
## $ delivery_care_attendant_1549  <int> 51, 100, 97, NA, 50, NA, 100, 100, 100, …
## $ delivery_care_attendant_1519  <int> 54, 100, NA, NA, 50, NA, NA, NA, 100, NA…
## $ delivery_care_institutional   <int> 48, 99, 97, NA, 46, NA, NA, 99, 99, 99, …
## $ c_section                     <int> 3, 31, 16, NA, 4, NA, NA, 29, 18, 31, 24…
## $ postnatal_health_newborns     <int> 9, 86, NA, NA, 21, NA, NA, NA, 98, NA, N…
## $ postnatal_health_mothers      <int> 40, 88, NA, NA, 23, NA, NA, NA, 97, NA, …
## $ maternal_deaths_2017          <int> 7700, 5, 1200, NA, 3000, NA, 1, 290, 11,…
## $ maternal_mortality_ratio_2017 <int> 638, 15, 112, NA, 241, NA, 42, 39, 26, 6…
## $ risk_maternal_death_2017      <int> 33, 3800, 270, NA, 69, NA, 1200, 1100, 2…

sowc_maternal_newborn %>%
  summarize(n_countries = n_distinct(countries_and_areas))

##   n_countries
## 1         202

#found distinct number of countries

Our first dataset is “sowc_maternal_newborn”. It contains data from UNICEF’s State of the World’s Children 2019 Statistical Table. The dataframe contains 202 observations/rows which are countries and 18 variables. Of these 18 variables, 1 of these is a character variable and 17 of these is a numeric variable. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).

Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.

Dataset 2

data("sowc_child_mortality")

## Take a look
glimpse(sowc_child_mortality)

## Rows: 195
## Columns: 18
## $ countries_and_areas            <chr> "Afghanistan", "Albania", "Algeria", "A…
## $ under5_mortality_1990          <int> 179, 41, 50, 11, 223, 28, 29, 49, 9, 10…
## $ under5_mortality_2000          <int> 129, 26, 40, 6, 206, 16, 20, 31, 6, 6, …
## $ under5_mortality_2018          <int> 62, 9, 23, 3, 77, 6, 10, 12, 4, 4, 22, …
## $ under5_reduction               <dbl> 4.1, 6.0, 2.9, 4.4, 5.4, 5.0, 3.8, 5.1,…
## $ under5_mortality_2018_male     <int> 66, 9, 25, 3, 83, 7, 11, 14, 4, 4, 24, …
## $ under5_mortality_2018_female   <int> 59, 8, 22, 3, 71, 6, 9, 11, 3, 3, 19, 9…
## $ infant_mortality_1990          <int> 121, 35, 42, 9, 132, 24, 25, 42, 8, 8, …
## $ infant_mortality_2018          <int> 48, 8, 20, 3, 52, 5, 9, 11, 3, 3, 19, 8…
## $ neonatal_mortality_1990        <int> 75, 13, 23, 6, 54, 15, 15, 23, 5, 5, 33…
## $ neonatal_mortality_2000        <int> 61, 12, 21, 3, 51, 9, 11, 16, 4, 3, 34,…
## $ neonatal_mortality_2018        <int> 37, 7, 15, 1, 28, 3, 6, 6, 2, 2, 11, 5,…
## $ prob_dying_age5to14_1990       <int> 16, 7, 9, 7, 46, 5, 3, 3, 2, 2, 5, 4, 4…
## $ prob_dying_age5to14_2018       <int> 5, 2, 4, 1, 16, 1, 2, 2, 1, 1, 3, 2, 2,…
## $ under5_deaths_2018             <int> 74, 0, 24, 0, 94, 0, 8, 1, 1, 0, 4, 0, …
## $ neonatal_deaths_2018           <int> 45, 0, 15, 0, 36, 0, 5, 0, 1, 0, 2, 0, …
## $ neonatal_deaths_percent_under5 <chr> "60", "74", "62", "50", "38", "50", "64…
## $ age5to14_deaths_2018           <int> 5, 0, 3, 0, 15, 0, 2, 0, 0, 0, 0, 0, 0,…

sowc_child_mortality %>%
  summarize(n_countries = n_distinct(countries_and_areas))

##   n_countries
## 1         195

#found distinct number of countries

Our second dataset we will be using is “sowc_child_mortality”. It contains child mortality data from UNICEF’s State of the World’s Children 2019 Statistical Tables. This data frame contains 195 observations/rows which are countries and 18 variables. Of these 18 variables, 2 are character variables and 16 are numeric variables. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).

Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.

Dataset 3

data("Leinhardt")

## Take a look
glimpse(Leinhardt)

## Rows: 105
## Columns: 4
## $ income <int> 3426, 3350, 3346, 4751, 5029, 3312, 3403, 5040, 2009, 2298, 329…
## $ infant <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.…
## $ region <fct> Asia, Europe, Europe, Americas, Europe, Europe, Europe, Europe,…
## $ oil    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Our third dataset we are using is “Leinhardt” which givens data on infant mortality for nations around the world in 1970. There are 105 observations and 4 variables (there are really 5 variables however right now one is the row type. We will fix this later with Tidying). Of these 5 variables, 3 are character variables and 2 are numeric variables. This dataset is built into the R library “carData” and the data comes from Leinhardt’s article “Exploratory data analysis: An introduction to selected methods”.

Source: Leinhardt, S. and Wasserman, S. S. (1979). Exploratory data analysis: An introduction to Selected Methods. In Schuessler, K. (Ed.) Sociological Methodology 1979 Jossey-Bass

Final Combined Dataset

## [1] 86

## [1] 19

## Rows: 86
## Columns: 39
## $ income                         <int> 3426, 3350, 3346, 4751, 5029, 3312, 340…
## $ infant                         <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.…
## $ region                         <fct> Asia, Europe, Europe, Americas, Europe,…
## $ oil                            <fct> no, no, no, no, no, no, no, no, no, no,…
## $ countries_and_areas            <chr> "Australia", "Austria", "Belgium", "Can…
## $ under5_mortality_1990          <int> 9, 10, 10, 8, 9, 7, 9, 9, 10, 6, 8, 9, …
## $ under5_mortality_2000          <int> 6, 6, 6, 6, 5, 4, 5, 7, 6, 5, 6, 5, 7, …
## $ under5_mortality_2018          <int> 4, 4, 4, 5, 4, 2, 4, 4, 3, 2, 4, 3, 4, …
## $ under5_reduction               <dbl> 2.9, 2.5, 2.6, 1.2, 1.5, 5.1, 1.6, 3.7,…
## $ under5_mortality_2018_male     <int> 4, 4, 4, 5, 5, 2, 4, 4, 3, 3, 4, 3, 4, …
## $ under5_mortality_2018_female   <int> 3, 3, 3, 5, 4, 2, 4, 3, 3, 2, 3, 2, 3, …
## $ infant_mortality_1990          <int> 8, 8, 8, 7, 7, 6, 7, 8, 8, 5, 7, 7, 12,…
## $ infant_mortality_2018          <int> 3, 3, 3, 4, 4, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ neonatal_mortality_1990        <int> 5, 5, 5, 4, 4, 4, 4, 5, 6, 3, 5, 4, 7, …
## $ neonatal_mortality_2000        <int> 4, 3, 3, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, …
## $ neonatal_mortality_2018        <int> 2, 2, 2, 3, 3, 1, 3, 2, 2, 1, 2, 1, 2, …
## $ prob_dying_age5to14_1990       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, …
## $ prob_dying_age5to14_2018       <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ under5_deaths_2018             <int> 1, 0, 0, 2, 0, 0, 3, 0, 1, 2, 1, 0, 0, …
## $ neonatal_deaths_2018           <int> 1, 0, 0, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, …
## $ neonatal_deaths_percent_under5 <chr> "62", "60", "56", "68", "74", "55", "62…
## $ age5to14_deaths_2018           <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ life_expectancy_female         <int> 85, 84, 84, 84, 83, 85, 85, 84, 85, 88,…
## $ family_planning_1549           <int> NA, NA, NA, NA, NA, NA, 96, NA, NA, NA,…
## $ family_planning_1519           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ adolescent_birth_rate          <int> 10, 7, 6, 8, 3, 6, 5, 7, 5, 4, 3, 4, 8,…
## $ births_age_18                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ antenatal_care_1               <int> 98, NA, NA, 100, NA, 100, 100, 100, 99,…
## $ antenatal_care_4_1549          <int> 92, NA, NA, 99, NA, NA, 99, NA, 68, NA,…
## $ antenatal_care_4_1519          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_attendant_1549   <int> NA, 99, NA, 100, NA, NA, NA, 100, NA, N…
## $ delivery_care_attendant_1519   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_institutional    <int> 99, 99, NA, 98, NA, 100, 98, 100, 100, …
## $ c_section                      <int> 31, 24, 18, 26, 21, 16, 21, 25, 40, NA,…
## $ postnatal_health_newborns      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ postnatal_health_mothers       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ maternal_deaths_2017           <int> 20, 4, 6, 40, 2, 2, 56, 3, 7, 44, 9, 1,…
## $ maternal_mortality_ratio_2017  <int> 6, 5, 5, 10, 4, 3, 8, 5, 2, 5, 5, 2, 8,…
## $ risk_maternal_death_2017       <int> 8200, 13500, 11200, 6100, 16200, 20900,…

As is, ‘Leinhardt’ has the country names as the row type and not as a variable on its own. We will fix this by making it a variable and then renaming it to be country. We use inner_join in order to join the three datsets by the common variable of country which is called “countries_and_areas”. We will then see how many rows our new dataset has and investigate which rows were left out of the join. Our joined dataset “allthree” has 73 observations and 39 variables. As it is joined now, we have 32 variables that are not in common between our datasets. We took a look at our antijoin function through the dataset ‘allthreecomb_miss’ and saw that the values that were not getting joined had a period rather than a space between the double names or were written slightly different so we must go back and change these names. We will then rejoin. We will go in and rename the countries with double names in “Leinhardt” to include a space instead of a period so that they can properly be matched when we join our final dataset. Now we will re-join our datasets to make sure the recoded country names get matched correctly. Now that we have renamed our country variables in “Leinhardt” to match the other datasets, we can see that our final dataset “newallthreecomb” actually has 86 countries that are found in common between the datasets and it still has 39 variables. Our final large dataset has 4 categorical variables and 35 numeric variables. We can also see that we have 19 countries that are not in common between the datasets which makes more sense and is significantly less after we renamed them.

Take a look at the dataset and create a new dataset for only numeric variables:

# Take a look
head(newallthreecomb %>% select(countries_and_areas, infant_mortality_2018, life_expectancy_female, under5_mortality_2018, neonatal_mortality_2018, income))

##   countries_and_areas infant_mortality_2018 life_expectancy_female
## 1           Australia                     3                     85
##   under5_mortality_2018 neonatal_mortality_2018 income
## 1                     4                       2   3426
##  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

#only select numeric variables we are interested in
allthreenumeric <- newallthreecomb %>% 
  select(infant_mortality_2018, life_expectancy_female, under5_mortality_2018, neonatal_mortality_2018, income) %>%
 mutate_if(is.integer, as.numeric)

allthreenumeric <- allthreenumeric %>%
  mutate(relative_infant_mortality = ifelse(infant_mortality_2018 > mean(infant_mortality_2018), 1, 0)) #coded 0 and 1 values
#0 is below average and 1 is above average

c. Define a research question

Research Question 1: Can the average female life expectancy in 2018 and under 5 mortality rate in 2018 of a country predict its infant mortality rate in 2018?

Research Question 2: Can the neonatal mortality rate in 2018 and the income of a country in 1970 predict its infant mortality rate in 2018?

2. Exploratory Data Analysis

###Create a correlation matrix:

## Use the ggcorrplot to visualize the correlation matrix
ggcorrplot(cor(allthreenumeric),
                      type = "upper", # upper diagonal
           lab = TRUE, # print values
           method = "circle") # use circles with different sizes)

From our correlation matrix we can see that infant mortality rate in 2018, female life expectancy in 2018, under 5 mortality in 2018 and neonatal mortality in 2018 are all highly correlated with each other. However, our fourth numeric explanatory variable of the country’s income does not appear to have strong correlations with our other explanatory variable or our dependent variable.

###Create visualizations to investigate relationships:

####Under 5 Mortality Rate in 2018 vs. Infant Mortality Rate in 2018

ggplot(newallthreecomb, aes(x = under5_mortality_2018, y = infant_mortality_2018)) +
  geom_point() + geom_smooth(method = "lm") +
  labs(x = "Under 5 Mortality Rate in 2018 (deaths per 1,000 live births)",
       y = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)")

It is apparent from this plot that the under 5 mortality rate in 2018 has a strong positive correlation with the infant mortality rate in 2018. Thus, as the under 5 mortality rate increases we expect an increase in the infant mortality rate for a country. This makes sense because we would expect countries that have high under 5 mortality rate to have high infant mortality rates because there are most likely structural causes of these deaths that also impacts infants.

####Female Life Expectancy in 2018 vs. Infant Mortality Rate in 2018

ggplot(newallthreecomb, aes(x = life_expectancy_female, y = infant_mortality_2018)) +
  geom_point() + geom_smooth(method = "lm") +
  labs(x = "Female Life Expectancy in 2018 (years)",
       y = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)")

From this graph, we can see that there is a strong negative correlation between female life expectancy in 2018 and infant mortality rate in 2018. Thus, as the infant mortality rate of a country increases, we expect its female life expectancy to decrease. This makes sense because if a country has poor health for infants it is likely that their populations don’t get as good health growing up and thus might not live as long.

####Female Life Expectancy in 2018 vs. Country’s Per-Capita Income in 1970

ggplot(newallthreecomb, aes(x = life_expectancy_female, y = income)) +
  geom_point() + geom_smooth(method = "lm") +
  labs(x = "Female Life Expectancy in 2018 (years)",
       y = "Country's Per-Capita Income in 1970 (U.S. Dollars)")

It is apparent from this plot that a country’s female life expectancy in 2018 has a moderate positive correlation with the country’s per-capita income in 1970. Thus, as the country’ female life expectancy increases we expect an increase in the country’s per capita income in 1970. This makes sense because most likely the wealthier countries can afford better health care for its people so they will live longer. It also makes sense that the correlations is only moderate as the per capita income is from 1970 and the country could have made advancements or fallen back in development since then.

####Infant Mortality Rate in 2018 vs. Neonatal Mortaltiy Rate in 2018

ggplot(newallthreecomb, aes(x = infant_mortality_2018, y = neonatal_mortality_2018)) +
  geom_point() + geom_smooth(method = "lm") +
  labs(x = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)",
       y = "Neonatal Mortality Rate in 2018 (deaths per 1,000 live births)")

Lastly, from this graph we can see that there is a strong positive correlation between neonatal mortality rate in 2018 and infant mortality rate in 2018. Thus, as the neonatal mortality rate of a country increases, we expect its infant mortality rate to also increase. This makes sense because,as before these mortality rates are likely due to poor maternal and infant health that is likely due to structural causes.

3. Prediction and Cross-Validation

We will investigate the kNN and linear models, fit it on the entire dataset, and compare its performance with the results of cross-validation.

###Fit the linear model to answer our first research question: Can the average female life expectancy and under 5 mortality rate of a country predict its infant mortality rate in 2018?

a. Train the model

Fit the model on the entire dataset and find the ROC:

project_log <- glm(relative_infant_mortality ~ life_expectancy_female + under5_mortality_2018, data = allthreenumeric, family = "binomial")

# Take a look at the model summary
summary(project_log)

## 
## Call:
## glm(formula = relative_infant_mortality ~ life_expectancy_female + 
##     under5_mortality_2018, family = "binomial", data = allthreenumeric)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -5.473e-04  -2.000e-08  -2.000e-08   2.000e-08   4.257e-04  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)             -703.46   68973.30  -0.010    0.992
## life_expectancy_female   -10.86    1065.00  -0.010    0.992
##  [ reached getOption("max.print") -- omitted 1 row ]
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.1543e+02  on 85  degrees of freedom
## Residual deviance: 5.7566e-07  on 83  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Calculate ROC of logarithmic regression model:

ROC1 <- allthreenumeric %>%
  # Make predictions
  mutate(predictions = predict(project_log, type = "response")) %>%
  ggplot() + 
  geom_roc(aes(d = relative_infant_mortality, m = predictions), n.cuts = 10)
ROC1

calc_auc(ROC1)

##   PANEL group AUC
## 1     1    -1   1

The AUC score is 1 so this logarithmic classifier is doing really well.

b. Perform cross-validation

How would the model perform on new data? Split your data into train/test sets then use a for-loop (like we did in Worksheet 17). Or you can use the function train() and specify to calculate the AUC:

# Choose number of folds: 
k = 10 

# Randomly order rows in the dataset
data <- allthreenumeric[sample(nrow(allthreenumeric)), ] 

# Create k folds from the dataset
folds <- cut(seq(1:nrow(data)), breaks = k, labels = FALSE)

perf_k <- NULL

for(i in 1:k){
  # Create train and test sets
  train_not_i <- data[folds != i, ] # all observations except in fold i
  test_i <- data[folds == i, ]  # observations in fold i
  
  # Train model on train set (all but fold i)
  project_log2 <- glm(relative_infant_mortality ~ life_expectancy_female + under5_mortality_2018, data = train_not_i, family = "binomial")
  
  # Test model on test set (fold i)
  predict_i <- data.frame(
    predictions = predict(project_log2, newdata = test_i, type = "response"),
    relative_infant_mortality = test_i$relative_infant_mortality)
  
  # Consider the ROC curve for the test dataset
  ROC3 <- ggplot(predict_i) + 
    geom_roc(aes(d = relative_infant_mortality, m = predictions))

  # Get diagnostics for fold i (AUC)
  perf_k[i] <- calc_auc(ROC3)$AUC
}

#find average performance
mean(perf_k)

## [1] 0.975

Our classifier is very good at fitting our new observations as our average AUC value is 0.95 which is very close to 1. Our original AUC value for the logarithmic model was 1.00 so there appears to be potential signs of overfitting, however the overfitting is very small. Our model only fits our dataset 5% better than new data which is not a lot.

Fit the KNN Model to Answer our Second Research Question: Can the neonatal mortality rate and the income of a country predict its infant mortality rate in 2018?

fit_kNN_project <- knn3(relative_infant_mortality ~ neonatal_mortality_2018 + income, data = allthreenumeric, k = 5)

predict(fit_kNN_project, allthreenumeric) %>% as.data.frame %>% head #take a look

##   0 1
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
##  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

We found AUC of our KNN model:

ROC2 <- ggplot(allthreenumeric) + 
  geom_roc(aes(d = relative_infant_mortality, m = predict(fit_kNN_project, allthreenumeric)[,2]), n.cuts = 10)
ROC2

calc_auc(ROC2)

##   PANEL group       AUC
## 1     1    -1 0.9533371

The average AUC value for the k-nearest neighbors classifier is 0.9533 which means that the model fits well because this value is close to 1.

perf_k2 <- NULL

# Use a for loop to get diagnostics for each test set
for(i in 1:k){
  
  # Train model on train set (all but fold i)
  fit_kNN_project2 <- knn3(relative_infant_mortality ~ neonatal_mortality_2018 + income,
                data = train_not_i, 
                k = 5)

  # Test model on test set (fold i)
  df <- data.frame(
    predictions = predict(fit_kNN_project2, test_i)[,2],
    relative_infant_mortality = test_i$relative_infant_mortality)
  
  # Consider the ROC curve for the test dataset
  ROC4 <- ggplot(df) + 
    geom_roc(aes(d = relative_infant_mortality, m = predictions))

  # Get diagnostics for fold i (AUC)
  perf_k2[i] <- calc_auc(ROC4)$AUC
}

# Average performance 
mean(perf_k2)

## [1] 1

Our classifier is really good at predicting new observations as it has an average AUC score of 1. This does not show signs of overfitting as this value is actually higher than our model trained on our dataset showing it does a great job predicting new values.

When comparing our logistical model and our k-nearest neighbors model, we can see that they are both really good fits for both our data and predicting new observations based on the selected variables. It appears that the logistical model might be overfitting whereas the k-nearest neighbors model did a better job at predicting new values so we think we would want to use this one when evaluating new data.

4 & 5. Dimensionality reduction and Clustering

Let’s investigate some patterns in your dataset: are the observations “naturally” forming different groups?

#Perform PCA with prcomp()
# Keep all numeric variables and scale them
allthreenumeric_scaled <- allthreenumeric %>% 
 scale %>%
  as.data.frame

pca <- allthreenumeric_scaled %>%
  prcomp

pca

## Standard deviations (1, .., p=6):
## [1] 2.2409316 0.7971591 0.4769262 0.2703450 0.1920490 0.0730427
## 
## Rotation (n x k) = (6 x 6):
##                                  PC1         PC2         PC3          PC4
## infant_mortality_2018      0.4373814  0.14820870 -0.29323863  0.075951224
##                                  PC5          PC6
## infant_mortality_2018      0.2304282  0.801172288
##  [ reached getOption("max.print") -- omitted 5 rows ]

fviz_eig(pca, addlabels = TRUE, ylim = c(0,100))

# Visualize the individuals according to PC1 and PC2
fviz_pca_ind(pca, 
             repel = TRUE) # Avoid text overlapping for the row number

# Visualize the contributions of the variables to the PCs in a table
get_pca_var(pca)$coord %>% as.data.frame %>%
  arrange(Dim.1) %>% select(Dim.1) # for PC1

##                                Dim.1
## life_expectancy_female    -0.9740282
## income                    -0.6539004
## relative_infant_mortality  0.8970962
## under5_mortality_2018      0.9648200
## neonatal_mortality_2018    0.9742280
## infant_mortality_2018      0.9801419

get_pca_var(pca)$coord %>% as.data.frame %>%
  arrange(Dim.2) %>% select(Dim.2) # for PC2

##                                 Dim.2
## life_expectancy_female    -0.04652322
## neonatal_mortality_2018    0.05751310
## infant_mortality_2018      0.11814592
## relative_infant_mortality  0.13941594
## under5_mortality_2018      0.15775779
## income                     0.75611356

get_pca_var(pca)$coord %>% as.data.frame

##                           Dim.1     Dim.2      Dim.3      Dim.4     Dim.5
## infant_mortality_2018 0.9801419 0.1181459 -0.1398532 0.02053304 0.0442535
##                            Dim.6
## infant_mortality_2018 0.05851979
##  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

Based on our scree plot, we need to keep only one component in order to reach at least 80% of the variance explained because PC 1 explains 83.7% of the variance.

Scoring high on the first PC means that a country has a high infant mortality rate, high under 5 mortality rate, high neonatal mortality rate, high relative infant mortality rate, low female life expectancy, and low income.

Scoring high on the second PC means that a country has a high income and pretty low values for all of the other variables.

The total variation in our dataset explained by the first two PCs is 93.1% which is a lot.

Clustering:

fviz_nbclust(allthreenumeric_scaled, kmeans, method = "silhouette") #find number of clusters

#we will do 2 clusters

kmeans_results <- allthreenumeric_scaled %>%
  kmeans(centers = 2) #centered at 2 because 2 clusters

kmeans_results #take a look

## K-means clustering with 2 clusters of sizes 52, 34
## 
## Cluster means:
##   infant_mortality_2018 life_expectancy_female under5_mortality_2018
## 1            -0.6745778              0.6906467            -0.6585585
##   neonatal_mortality_2018     income relative_infant_mortality
## 1              -0.6926046  0.3920204                -0.8038926
##  [ reached getOption("max.print") -- omitted 1 row ]
## 
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1
##  [ reached getOption("max.print") -- omitted 76 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 87.99441 74.06138
##  (between_SS / total_SS =  68.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

allthreenumeric %>%
  mutate(cluster = as.factor(kmeans_results$cluster)) %>%
  group_by(cluster) %>%
  summarise_all(mean) #sumarize mean of all variables by cluster

## # A tibble: 2 × 7
##   cluster infant_mortality_2018 life_expectancy_female under5_mortality_2018
##   <fct>                   <dbl>                  <dbl>                 <dbl>
## 1 1                        9.13                   80.6                  10.8
## 2 2                       45.7                    65.3                  64.8
## # ℹ 3 more variables: neonatal_mortality_2018 <dbl>, income <dbl>,
## #   relative_infant_mortality <dbl>

allthreenumeric %>%
  mutate(cluster = as.factor(kmeans_results$cluster)) %>% 
  ggplot(aes(x = income, y = infant_mortality_2018, color = cluster)) + geom_point() #made a scatterplot to visualize the relationship between card debt and income by cluster

#summary about the centers
kmeans_results$centers

##   infant_mortality_2018 life_expectancy_female under5_mortality_2018
## 1            -0.6745778              0.6906467            -0.6585585
##   neonatal_mortality_2018     income relative_infant_mortality
## 1              -0.6926046  0.3920204                -0.8038926
##  [ reached getOption("max.print") -- omitted 1 row ]

#a vector attributing a cluster number to each observation
kmeans_results$cluster

##  [1] 1 1 1 1 1 1 1 1 1 1
##  [ reached getOption("max.print") -- omitted 76 entries ]

The average silhouette width is largest (over 0.6) when there are two clusters, thus we need to create two different clusters and we did this using k-means.

From the two dimensional plot visualizing our clusters, we can see that cluster 1 overall has higher income and lower infant mortality rate compared to cluster 2. In general, cluster 2 has low income and high mortality rates.

Some statistics about the clusters are that for cluster 1, the center for infant mortality is -0.6746, the center for income 0.3920, and the center for under 5 mortality rate is -0.6586. For cluster 2, the center for infant mortality is 1.0317, the center for income is -0.5996, and the center for under 5 mortality rate is 1.0072.

6. Discussion

First we will discuss our first research question: Can the average female life expectancy in 2018 and under 5 mortality rate in 2018 of a country predict its infant mortality rate in 2018? Based on our logistic regression model, we can see that life expectancy in 2018 and under 5 mortality rate in 2018 of a country were really good predictors of that country’s infant mortality rate. The AUC value for this model was 1.0000 when looking at our data which is the best it can get and it was also 0.9508 on average on new data which shows these two variables are really good predictors. We can also see this through the correlation matrix where average female life expectancy and under 5 mortality rate were both highly positively correlated with infant mortality rate.

Next, we will discuss our second research question: Can the neonatal mortality rate in 2018 and the income of a country in 1970 predict its infant mortality rate in 2018? Based on our k-nearest neighbors model, we can see that neonatal mortality rate in 2018 and a country’s income were really good predictors of that country’s infant mortality rate. The AUC value for this model was 0.9533 when looking at our data and it was also 1.0000 on average on new data which shows these two variables are really good predictors. We can also see this through the correlation matrix where infant mortality rate is highly positively correlated with neonatal mortality rate and moderately negatively correlated with a country’s income.

Overall, the challenging part of this project was doing the PCA analysis as we were less familiar with this and it was one of the newer things that we have learned. Additionally, we had a hard time in the beginning fitting the logistic model until we realized that we had to mutate our response variable into a binary variable.

Acknowledgements: We both worked on the whole project together side-by-side. Special thank you to Professor Guyot and our TAs Aubrie and Huy for y’all’s constant help throughout this course!