For our project we are taking a look at the topics of maternal and infant health throughout different countries in the world. Our interest in this topic sparked in our Infant Nutrition seminar class that we are taking currently. We wanted to examine the data behind some of the things we have been learning in class. All of our data sets have countries so we plan to merge them using inner-join with country. Thus, a single row in our data set will correspond to a single country’s data. We expect to find that countries with high income have a low infant mortality rate because they likely have better infrastructure and health services. Additionally we expect countries with high infant mortality rate to also have high neonatal mortality rate, under 5 mortality rate, and low female life expectancy as these health statistics are all signs of poor health care facilities in a country.
Installed the tidyverse library, tidytext library, textdata library, openintro library, dplyr library, ggplot2 library, carData library, plotROC library, caret library, rpart library, rpart.plot library, ggcorrplot library, and factoextra library.
data("sowc_maternal_newborn")
# Take a look
glimpse(sowc_maternal_newborn)
## Rows: 202
## Columns: 18
## $ countries_and_areas <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ life_expectancy_female <int> 66, 80, 78, NA, 64, NA, 78, 80, 78, 85, …
## $ family_planning_1549 <int> 42, 5, 77, NA, 30, NA, NA, NA, 37, NA, N…
## $ family_planning_1519 <int> 21, 5, NA, NA, 15, NA, NA, NA, NA, NA, N…
## $ adolescent_birth_rate <int> 77, 17, 10, 3, 163, 40, 67, 65, 24, 10, …
## $ births_age_18 <int> 20, 3, 1, NA, 38, NA, NA, 12, 1, NA, NA,…
## $ antenatal_care_1 <int> 59, 88, 93, NA, 82, NA, 100, 98, 100, 98…
## $ antenatal_care_4_1549 <int> 18, 78, 67, NA, 61, NA, 100, 90, 96, 92,…
## $ antenatal_care_4_1519 <int> 16, 72, NA, NA, 56, NA, NA, 85, 93, NA, …
## $ delivery_care_attendant_1549 <int> 51, 100, 97, NA, 50, NA, 100, 100, 100, …
## $ delivery_care_attendant_1519 <int> 54, 100, NA, NA, 50, NA, NA, NA, 100, NA…
## $ delivery_care_institutional <int> 48, 99, 97, NA, 46, NA, NA, 99, 99, 99, …
## $ c_section <int> 3, 31, 16, NA, 4, NA, NA, 29, 18, 31, 24…
## $ postnatal_health_newborns <int> 9, 86, NA, NA, 21, NA, NA, NA, 98, NA, N…
## $ postnatal_health_mothers <int> 40, 88, NA, NA, 23, NA, NA, NA, 97, NA, …
## $ maternal_deaths_2017 <int> 7700, 5, 1200, NA, 3000, NA, 1, 290, 11,…
## $ maternal_mortality_ratio_2017 <int> 638, 15, 112, NA, 241, NA, 42, 39, 26, 6…
## $ risk_maternal_death_2017 <int> 33, 3800, 270, NA, 69, NA, 1200, 1100, 2…
sowc_maternal_newborn %>%
summarize(n_countries = n_distinct(countries_and_areas))
## n_countries
## 1 202
#found distinct number of countries
Our first dataset is “sowc_maternal_newborn”. It contains data from UNICEF’s State of the World’s Children 2019 Statistical Table. The dataframe contains 202 observations/rows which are countries and 18 variables. Of these 18 variables, 1 of these is a character variable and 17 of these is a numeric variable. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).
Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.
data("sowc_child_mortality")
## Take a look
glimpse(sowc_child_mortality)
## Rows: 195
## Columns: 18
## $ countries_and_areas <chr> "Afghanistan", "Albania", "Algeria", "A…
## $ under5_mortality_1990 <int> 179, 41, 50, 11, 223, 28, 29, 49, 9, 10…
## $ under5_mortality_2000 <int> 129, 26, 40, 6, 206, 16, 20, 31, 6, 6, …
## $ under5_mortality_2018 <int> 62, 9, 23, 3, 77, 6, 10, 12, 4, 4, 22, …
## $ under5_reduction <dbl> 4.1, 6.0, 2.9, 4.4, 5.4, 5.0, 3.8, 5.1,…
## $ under5_mortality_2018_male <int> 66, 9, 25, 3, 83, 7, 11, 14, 4, 4, 24, …
## $ under5_mortality_2018_female <int> 59, 8, 22, 3, 71, 6, 9, 11, 3, 3, 19, 9…
## $ infant_mortality_1990 <int> 121, 35, 42, 9, 132, 24, 25, 42, 8, 8, …
## $ infant_mortality_2018 <int> 48, 8, 20, 3, 52, 5, 9, 11, 3, 3, 19, 8…
## $ neonatal_mortality_1990 <int> 75, 13, 23, 6, 54, 15, 15, 23, 5, 5, 33…
## $ neonatal_mortality_2000 <int> 61, 12, 21, 3, 51, 9, 11, 16, 4, 3, 34,…
## $ neonatal_mortality_2018 <int> 37, 7, 15, 1, 28, 3, 6, 6, 2, 2, 11, 5,…
## $ prob_dying_age5to14_1990 <int> 16, 7, 9, 7, 46, 5, 3, 3, 2, 2, 5, 4, 4…
## $ prob_dying_age5to14_2018 <int> 5, 2, 4, 1, 16, 1, 2, 2, 1, 1, 3, 2, 2,…
## $ under5_deaths_2018 <int> 74, 0, 24, 0, 94, 0, 8, 1, 1, 0, 4, 0, …
## $ neonatal_deaths_2018 <int> 45, 0, 15, 0, 36, 0, 5, 0, 1, 0, 2, 0, …
## $ neonatal_deaths_percent_under5 <chr> "60", "74", "62", "50", "38", "50", "64…
## $ age5to14_deaths_2018 <int> 5, 0, 3, 0, 15, 0, 2, 0, 0, 0, 0, 0, 0,…
sowc_child_mortality %>%
summarize(n_countries = n_distinct(countries_and_areas))
## n_countries
## 1 195
#found distinct number of countries
Our second dataset we will be using is “sowc_child_mortality”. It contains child mortality data from UNICEF’s State of the World’s Children 2019 Statistical Tables. This data frame contains 195 observations/rows which are countries and 18 variables. Of these 18 variables, 2 are character variables and 16 are numeric variables. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).
Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.
data("Leinhardt")
## Take a look
glimpse(Leinhardt)
## Rows: 105
## Columns: 4
## $ income <int> 3426, 3350, 3346, 4751, 5029, 3312, 3403, 5040, 2009, 2298, 329…
## $ infant <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.…
## $ region <fct> Asia, Europe, Europe, Americas, Europe, Europe, Europe, Europe,…
## $ oil <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
Our third dataset we are using is “Leinhardt” which givens data on infant mortality for nations around the world in 1970. There are 105 observations and 4 variables (there are really 5 variables however right now one is the row type. We will fix this later with Tidying). Of these 5 variables, 3 are character variables and 2 are numeric variables. This dataset is built into the R library “carData” and the data comes from Leinhardt’s article “Exploratory data analysis: An introduction to selected methods”.
Source: Leinhardt, S. and Wasserman, S. S. (1979). Exploratory data analysis: An introduction to Selected Methods. In Schuessler, K. (Ed.) Sociological Methodology 1979 Jossey-Bass
## [1] 86
## [1] 19
## Rows: 86
## Columns: 39
## $ income <int> 3426, 3350, 3346, 4751, 5029, 3312, 340…
## $ infant <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.…
## $ region <fct> Asia, Europe, Europe, Americas, Europe,…
## $ oil <fct> no, no, no, no, no, no, no, no, no, no,…
## $ countries_and_areas <chr> "Australia", "Austria", "Belgium", "Can…
## $ under5_mortality_1990 <int> 9, 10, 10, 8, 9, 7, 9, 9, 10, 6, 8, 9, …
## $ under5_mortality_2000 <int> 6, 6, 6, 6, 5, 4, 5, 7, 6, 5, 6, 5, 7, …
## $ under5_mortality_2018 <int> 4, 4, 4, 5, 4, 2, 4, 4, 3, 2, 4, 3, 4, …
## $ under5_reduction <dbl> 2.9, 2.5, 2.6, 1.2, 1.5, 5.1, 1.6, 3.7,…
## $ under5_mortality_2018_male <int> 4, 4, 4, 5, 5, 2, 4, 4, 3, 3, 4, 3, 4, …
## $ under5_mortality_2018_female <int> 3, 3, 3, 5, 4, 2, 4, 3, 3, 2, 3, 2, 3, …
## $ infant_mortality_1990 <int> 8, 8, 8, 7, 7, 6, 7, 8, 8, 5, 7, 7, 12,…
## $ infant_mortality_2018 <int> 3, 3, 3, 4, 4, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ neonatal_mortality_1990 <int> 5, 5, 5, 4, 4, 4, 4, 5, 6, 3, 5, 4, 7, …
## $ neonatal_mortality_2000 <int> 4, 3, 3, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, …
## $ neonatal_mortality_2018 <int> 2, 2, 2, 3, 3, 1, 3, 2, 2, 1, 2, 1, 2, …
## $ prob_dying_age5to14_1990 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, …
## $ prob_dying_age5to14_2018 <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ under5_deaths_2018 <int> 1, 0, 0, 2, 0, 0, 3, 0, 1, 2, 1, 0, 0, …
## $ neonatal_deaths_2018 <int> 1, 0, 0, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, …
## $ neonatal_deaths_percent_under5 <chr> "62", "60", "56", "68", "74", "55", "62…
## $ age5to14_deaths_2018 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ life_expectancy_female <int> 85, 84, 84, 84, 83, 85, 85, 84, 85, 88,…
## $ family_planning_1549 <int> NA, NA, NA, NA, NA, NA, 96, NA, NA, NA,…
## $ family_planning_1519 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ adolescent_birth_rate <int> 10, 7, 6, 8, 3, 6, 5, 7, 5, 4, 3, 4, 8,…
## $ births_age_18 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ antenatal_care_1 <int> 98, NA, NA, 100, NA, 100, 100, 100, 99,…
## $ antenatal_care_4_1549 <int> 92, NA, NA, 99, NA, NA, 99, NA, 68, NA,…
## $ antenatal_care_4_1519 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_attendant_1549 <int> NA, 99, NA, 100, NA, NA, NA, 100, NA, N…
## $ delivery_care_attendant_1519 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_institutional <int> 99, 99, NA, 98, NA, 100, 98, 100, 100, …
## $ c_section <int> 31, 24, 18, 26, 21, 16, 21, 25, 40, NA,…
## $ postnatal_health_newborns <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ postnatal_health_mothers <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ maternal_deaths_2017 <int> 20, 4, 6, 40, 2, 2, 56, 3, 7, 44, 9, 1,…
## $ maternal_mortality_ratio_2017 <int> 6, 5, 5, 10, 4, 3, 8, 5, 2, 5, 5, 2, 8,…
## $ risk_maternal_death_2017 <int> 8200, 13500, 11200, 6100, 16200, 20900,…
As is, ‘Leinhardt’ has the country names as the row type and not as a variable on its own. We will fix this by making it a variable and then renaming it to be country. We use inner_join in order to join the three datsets by the common variable of country which is called “countries_and_areas”. We will then see how many rows our new dataset has and investigate which rows were left out of the join. Our joined dataset “allthree” has 73 observations and 39 variables. As it is joined now, we have 32 variables that are not in common between our datasets. We took a look at our antijoin function through the dataset ‘allthreecomb_miss’ and saw that the values that were not getting joined had a period rather than a space between the double names or were written slightly different so we must go back and change these names. We will then rejoin. We will go in and rename the countries with double names in “Leinhardt” to include a space instead of a period so that they can properly be matched when we join our final dataset. Now we will re-join our datasets to make sure the recoded country names get matched correctly. Now that we have renamed our country variables in “Leinhardt” to match the other datasets, we can see that our final dataset “newallthreecomb” actually has 86 countries that are found in common between the datasets and it still has 39 variables. Our final large dataset has 4 categorical variables and 35 numeric variables. We can also see that we have 19 countries that are not in common between the datasets which makes more sense and is significantly less after we renamed them.
Take a look at the dataset and create a new dataset for only numeric variables:
# Take a look
head(newallthreecomb %>% select(countries_and_areas, infant_mortality_2018, life_expectancy_female, under5_mortality_2018, neonatal_mortality_2018, income))
## countries_and_areas infant_mortality_2018 life_expectancy_female
## 1 Australia 3 85
## under5_mortality_2018 neonatal_mortality_2018 income
## 1 4 2 3426
## [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
#only select numeric variables we are interested in
allthreenumeric <- newallthreecomb %>%
select(infant_mortality_2018, life_expectancy_female, under5_mortality_2018, neonatal_mortality_2018, income) %>%
mutate_if(is.integer, as.numeric)
allthreenumeric <- allthreenumeric %>%
mutate(relative_infant_mortality = ifelse(infant_mortality_2018 > mean(infant_mortality_2018), 1, 0)) #coded 0 and 1 values
#0 is below average and 1 is above average
Research Question 1: Can the average female life expectancy in 2018 and under 5 mortality rate in 2018 of a country predict its infant mortality rate in 2018?
Research Question 2: Can the neonatal mortality rate in 2018 and the income of a country in 1970 predict its infant mortality rate in 2018?
###Create a correlation matrix:
## Use the ggcorrplot to visualize the correlation matrix
ggcorrplot(cor(allthreenumeric),
type = "upper", # upper diagonal
lab = TRUE, # print values
method = "circle") # use circles with different sizes)
From our correlation matrix we can see that infant mortality rate in 2018, female life expectancy in 2018, under 5 mortality in 2018 and neonatal mortality in 2018 are all highly correlated with each other. However, our fourth numeric explanatory variable of the country’s income does not appear to have strong correlations with our other explanatory variable or our dependent variable.
###Create visualizations to investigate relationships:
####Under 5 Mortality Rate in 2018 vs. Infant Mortality Rate in 2018
ggplot(newallthreecomb, aes(x = under5_mortality_2018, y = infant_mortality_2018)) +
geom_point() + geom_smooth(method = "lm") +
labs(x = "Under 5 Mortality Rate in 2018 (deaths per 1,000 live births)",
y = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)")
It is apparent from this plot that the under 5 mortality rate in 2018 has a strong positive correlation with the infant mortality rate in 2018. Thus, as the under 5 mortality rate increases we expect an increase in the infant mortality rate for a country. This makes sense because we would expect countries that have high under 5 mortality rate to have high infant mortality rates because there are most likely structural causes of these deaths that also impacts infants.
####Female Life Expectancy in 2018 vs. Infant Mortality Rate in 2018
ggplot(newallthreecomb, aes(x = life_expectancy_female, y = infant_mortality_2018)) +
geom_point() + geom_smooth(method = "lm") +
labs(x = "Female Life Expectancy in 2018 (years)",
y = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)")
From this graph, we can see that there is a strong negative correlation between female life expectancy in 2018 and infant mortality rate in 2018. Thus, as the infant mortality rate of a country increases, we expect its female life expectancy to decrease. This makes sense because if a country has poor health for infants it is likely that their populations don’t get as good health growing up and thus might not live as long.
####Female Life Expectancy in 2018 vs. Country’s Per-Capita Income in 1970
ggplot(newallthreecomb, aes(x = life_expectancy_female, y = income)) +
geom_point() + geom_smooth(method = "lm") +
labs(x = "Female Life Expectancy in 2018 (years)",
y = "Country's Per-Capita Income in 1970 (U.S. Dollars)")
It is apparent from this plot that a country’s female life expectancy in
2018 has a moderate positive correlation with the country’s per-capita
income in 1970. Thus, as the country’ female life expectancy increases
we expect an increase in the country’s per capita income in 1970. This
makes sense because most likely the wealthier countries can afford
better health care for its people so they will live longer. It also
makes sense that the correlations is only moderate as the per capita
income is from 1970 and the country could have made advancements or
fallen back in development since then.
####Infant Mortality Rate in 2018 vs. Neonatal Mortaltiy Rate in 2018
ggplot(newallthreecomb, aes(x = infant_mortality_2018, y = neonatal_mortality_2018)) +
geom_point() + geom_smooth(method = "lm") +
labs(x = "Infant Mortality Rate in 2018 (deaths per 1,000 live births)",
y = "Neonatal Mortality Rate in 2018 (deaths per 1,000 live births)")
Lastly, from this graph we can see that there is a strong positive correlation between neonatal mortality rate in 2018 and infant mortality rate in 2018. Thus, as the neonatal mortality rate of a country increases, we expect its infant mortality rate to also increase. This makes sense because,as before these mortality rates are likely due to poor maternal and infant health that is likely due to structural causes.
We will investigate the kNN and linear models, fit it on the entire dataset, and compare its performance with the results of cross-validation.
###Fit the linear model to answer our first research question: Can the average female life expectancy and under 5 mortality rate of a country predict its infant mortality rate in 2018?
Fit the model on the entire dataset and find the ROC:
project_log <- glm(relative_infant_mortality ~ life_expectancy_female + under5_mortality_2018, data = allthreenumeric, family = "binomial")
# Take a look at the model summary
summary(project_log)
##
## Call:
## glm(formula = relative_infant_mortality ~ life_expectancy_female +
## under5_mortality_2018, family = "binomial", data = allthreenumeric)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.473e-04 -2.000e-08 -2.000e-08 2.000e-08 4.257e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -703.46 68973.30 -0.010 0.992
## life_expectancy_female -10.86 1065.00 -0.010 0.992
## [ reached getOption("max.print") -- omitted 1 row ]
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.1543e+02 on 85 degrees of freedom
## Residual deviance: 5.7566e-07 on 83 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
Calculate ROC of logarithmic regression model:
ROC1 <- allthreenumeric %>%
# Make predictions
mutate(predictions = predict(project_log, type = "response")) %>%
ggplot() +
geom_roc(aes(d = relative_infant_mortality, m = predictions), n.cuts = 10)
ROC1
calc_auc(ROC1)
## PANEL group AUC
## 1 1 -1 1
The AUC score is 1 so this logarithmic classifier is doing really well.
How would the model perform on new data? Split your data into
train/test sets then use a for-loop (like we did in Worksheet 17). Or
you can use the function train() and specify to calculate
the AUC:
# Choose number of folds:
k = 10
# Randomly order rows in the dataset
data <- allthreenumeric[sample(nrow(allthreenumeric)), ]
# Create k folds from the dataset
folds <- cut(seq(1:nrow(data)), breaks = k, labels = FALSE)
perf_k <- NULL
for(i in 1:k){
# Create train and test sets
train_not_i <- data[folds != i, ] # all observations except in fold i
test_i <- data[folds == i, ] # observations in fold i
# Train model on train set (all but fold i)
project_log2 <- glm(relative_infant_mortality ~ life_expectancy_female + under5_mortality_2018, data = train_not_i, family = "binomial")
# Test model on test set (fold i)
predict_i <- data.frame(
predictions = predict(project_log2, newdata = test_i, type = "response"),
relative_infant_mortality = test_i$relative_infant_mortality)
# Consider the ROC curve for the test dataset
ROC3 <- ggplot(predict_i) +
geom_roc(aes(d = relative_infant_mortality, m = predictions))
# Get diagnostics for fold i (AUC)
perf_k[i] <- calc_auc(ROC3)$AUC
}
#find average performance
mean(perf_k)
## [1] 0.975
Our classifier is very good at fitting our new observations as our average AUC value is 0.95 which is very close to 1. Our original AUC value for the logarithmic model was 1.00 so there appears to be potential signs of overfitting, however the overfitting is very small. Our model only fits our dataset 5% better than new data which is not a lot.
fit_kNN_project <- knn3(relative_infant_mortality ~ neonatal_mortality_2018 + income, data = allthreenumeric, k = 5)
predict(fit_kNN_project, allthreenumeric) %>% as.data.frame %>% head #take a look
## 0 1
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
We found AUC of our KNN model:
ROC2 <- ggplot(allthreenumeric) +
geom_roc(aes(d = relative_infant_mortality, m = predict(fit_kNN_project, allthreenumeric)[,2]), n.cuts = 10)
ROC2
calc_auc(ROC2)
## PANEL group AUC
## 1 1 -1 0.9533371
The average AUC value for the k-nearest neighbors classifier is 0.9533 which means that the model fits well because this value is close to 1.
perf_k2 <- NULL
# Use a for loop to get diagnostics for each test set
for(i in 1:k){
# Train model on train set (all but fold i)
fit_kNN_project2 <- knn3(relative_infant_mortality ~ neonatal_mortality_2018 + income,
data = train_not_i,
k = 5)
# Test model on test set (fold i)
df <- data.frame(
predictions = predict(fit_kNN_project2, test_i)[,2],
relative_infant_mortality = test_i$relative_infant_mortality)
# Consider the ROC curve for the test dataset
ROC4 <- ggplot(df) +
geom_roc(aes(d = relative_infant_mortality, m = predictions))
# Get diagnostics for fold i (AUC)
perf_k2[i] <- calc_auc(ROC4)$AUC
}
# Average performance
mean(perf_k2)
## [1] 1
Our classifier is really good at predicting new observations as it has an average AUC score of 1. This does not show signs of overfitting as this value is actually higher than our model trained on our dataset showing it does a great job predicting new values.
When comparing our logistical model and our k-nearest neighbors model, we can see that they are both really good fits for both our data and predicting new observations based on the selected variables. It appears that the logistical model might be overfitting whereas the k-nearest neighbors model did a better job at predicting new values so we think we would want to use this one when evaluating new data.
Let’s investigate some patterns in your dataset: are the observations “naturally” forming different groups?
#Perform PCA with prcomp()
# Keep all numeric variables and scale them
allthreenumeric_scaled <- allthreenumeric %>%
scale %>%
as.data.frame
pca <- allthreenumeric_scaled %>%
prcomp
pca
## Standard deviations (1, .., p=6):
## [1] 2.2409316 0.7971591 0.4769262 0.2703450 0.1920490 0.0730427
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4
## infant_mortality_2018 0.4373814 0.14820870 -0.29323863 0.075951224
## PC5 PC6
## infant_mortality_2018 0.2304282 0.801172288
## [ reached getOption("max.print") -- omitted 5 rows ]
fviz_eig(pca, addlabels = TRUE, ylim = c(0,100))
# Visualize the individuals according to PC1 and PC2
fviz_pca_ind(pca,
repel = TRUE) # Avoid text overlapping for the row number
# Visualize the contributions of the variables to the PCs in a table
get_pca_var(pca)$coord %>% as.data.frame %>%
arrange(Dim.1) %>% select(Dim.1) # for PC1
## Dim.1
## life_expectancy_female -0.9740282
## income -0.6539004
## relative_infant_mortality 0.8970962
## under5_mortality_2018 0.9648200
## neonatal_mortality_2018 0.9742280
## infant_mortality_2018 0.9801419
get_pca_var(pca)$coord %>% as.data.frame %>%
arrange(Dim.2) %>% select(Dim.2) # for PC2
## Dim.2
## life_expectancy_female -0.04652322
## neonatal_mortality_2018 0.05751310
## infant_mortality_2018 0.11814592
## relative_infant_mortality 0.13941594
## under5_mortality_2018 0.15775779
## income 0.75611356
get_pca_var(pca)$coord %>% as.data.frame
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## infant_mortality_2018 0.9801419 0.1181459 -0.1398532 0.02053304 0.0442535
## Dim.6
## infant_mortality_2018 0.05851979
## [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
Based on our scree plot, we need to keep only one component in order to reach at least 80% of the variance explained because PC 1 explains 83.7% of the variance.
Scoring high on the first PC means that a country has a high infant mortality rate, high under 5 mortality rate, high neonatal mortality rate, high relative infant mortality rate, low female life expectancy, and low income.
Scoring high on the second PC means that a country has a high income and pretty low values for all of the other variables.
The total variation in our dataset explained by the first two PCs is 93.1% which is a lot.
Clustering:
fviz_nbclust(allthreenumeric_scaled, kmeans, method = "silhouette") #find number of clusters
#we will do 2 clusters
kmeans_results <- allthreenumeric_scaled %>%
kmeans(centers = 2) #centered at 2 because 2 clusters
kmeans_results #take a look
## K-means clustering with 2 clusters of sizes 52, 34
##
## Cluster means:
## infant_mortality_2018 life_expectancy_female under5_mortality_2018
## 1 -0.6745778 0.6906467 -0.6585585
## neonatal_mortality_2018 income relative_infant_mortality
## 1 -0.6926046 0.3920204 -0.8038926
## [ reached getOption("max.print") -- omitted 1 row ]
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 76 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 87.99441 74.06138
## (between_SS / total_SS = 68.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
allthreenumeric %>%
mutate(cluster = as.factor(kmeans_results$cluster)) %>%
group_by(cluster) %>%
summarise_all(mean) #sumarize mean of all variables by cluster
## # A tibble: 2 × 7
## cluster infant_mortality_2018 life_expectancy_female under5_mortality_2018
## <fct> <dbl> <dbl> <dbl>
## 1 1 9.13 80.6 10.8
## 2 2 45.7 65.3 64.8
## # ℹ 3 more variables: neonatal_mortality_2018 <dbl>, income <dbl>,
## # relative_infant_mortality <dbl>
allthreenumeric %>%
mutate(cluster = as.factor(kmeans_results$cluster)) %>%
ggplot(aes(x = income, y = infant_mortality_2018, color = cluster)) + geom_point() #made a scatterplot to visualize the relationship between card debt and income by cluster
#summary about the centers
kmeans_results$centers
## infant_mortality_2018 life_expectancy_female under5_mortality_2018
## 1 -0.6745778 0.6906467 -0.6585585
## neonatal_mortality_2018 income relative_infant_mortality
## 1 -0.6926046 0.3920204 -0.8038926
## [ reached getOption("max.print") -- omitted 1 row ]
#a vector attributing a cluster number to each observation
kmeans_results$cluster
## [1] 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 76 entries ]
The average silhouette width is largest (over 0.6) when there are two clusters, thus we need to create two different clusters and we did this using k-means.
From the two dimensional plot visualizing our clusters, we can see that cluster 1 overall has higher income and lower infant mortality rate compared to cluster 2. In general, cluster 2 has low income and high mortality rates.
Some statistics about the clusters are that for cluster 1, the center for infant mortality is -0.6746, the center for income 0.3920, and the center for under 5 mortality rate is -0.6586. For cluster 2, the center for infant mortality is 1.0317, the center for income is -0.5996, and the center for under 5 mortality rate is 1.0072.
First we will discuss our first research question: Can the average female life expectancy in 2018 and under 5 mortality rate in 2018 of a country predict its infant mortality rate in 2018? Based on our logistic regression model, we can see that life expectancy in 2018 and under 5 mortality rate in 2018 of a country were really good predictors of that country’s infant mortality rate. The AUC value for this model was 1.0000 when looking at our data which is the best it can get and it was also 0.9508 on average on new data which shows these two variables are really good predictors. We can also see this through the correlation matrix where average female life expectancy and under 5 mortality rate were both highly positively correlated with infant mortality rate.
Next, we will discuss our second research question: Can the neonatal mortality rate in 2018 and the income of a country in 1970 predict its infant mortality rate in 2018? Based on our k-nearest neighbors model, we can see that neonatal mortality rate in 2018 and a country’s income were really good predictors of that country’s infant mortality rate. The AUC value for this model was 0.9533 when looking at our data and it was also 1.0000 on average on new data which shows these two variables are really good predictors. We can also see this through the correlation matrix where infant mortality rate is highly positively correlated with neonatal mortality rate and moderately negatively correlated with a country’s income.
Overall, the challenging part of this project was doing the PCA analysis as we were less familiar with this and it was one of the newer things that we have learned. Additionally, we had a hard time in the beginning fitting the logistic model until we realized that we had to mutate our response variable into a binary variable.
Acknowledgements: We both worked on the whole project together side-by-side. Special thank you to Professor Guyot and our TAs Aubrie and Huy for y’all’s constant help throughout this course!