1. Introduction

For our project we are taking a look at the topics of maternal, infant, and child health throughout different countries in the world throughout time. Our interest in this topic sparked in our Infant Nutrition seminar class that we are taking currently. We wanted to examine the data behind some of the things we have been learning in class. All of our datasets have countries so we plan to merge them using inner-join with country. Thus, a single row in our dataset will correspond to a single country’s data. Some relationships we would expect to find are that the variables we are investigating will vary a lot by region but less by country and that Africa will have a higher neonatal mortality rate and lower female life expectancy compared to other regions like Europe.

a. Set up

Installed the tidyverse library, tidytext library, textdata library, openintro library, dplyr library, ggplot2 library, and carData library, but had to take it out to knit.

b. Quick description of the dataset(s)

Dataset 1

data("sowc_maternal_newborn")

# Take a look
glimpse(sowc_maternal_newborn)
## Rows: 202
## Columns: 18
## $ countries_and_areas           <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ life_expectancy_female        <int> 66, 80, 78, NA, 64, NA, 78, 80, 78, 85, …
## $ family_planning_1549          <int> 42, 5, 77, NA, 30, NA, NA, NA, 37, NA, N…
## $ family_planning_1519          <int> 21, 5, NA, NA, 15, NA, NA, NA, NA, NA, N…
## $ adolescent_birth_rate         <int> 77, 17, 10, 3, 163, 40, 67, 65, 24, 10, …
## $ births_age_18                 <int> 20, 3, 1, NA, 38, NA, NA, 12, 1, NA, NA,…
## $ antenatal_care_1              <int> 59, 88, 93, NA, 82, NA, 100, 98, 100, 98…
## $ antenatal_care_4_1549         <int> 18, 78, 67, NA, 61, NA, 100, 90, 96, 92,…
## $ antenatal_care_4_1519         <int> 16, 72, NA, NA, 56, NA, NA, 85, 93, NA, …
## $ delivery_care_attendant_1549  <int> 51, 100, 97, NA, 50, NA, 100, 100, 100, …
## $ delivery_care_attendant_1519  <int> 54, 100, NA, NA, 50, NA, NA, NA, 100, NA…
## $ delivery_care_institutional   <int> 48, 99, 97, NA, 46, NA, NA, 99, 99, 99, …
## $ c_section                     <int> 3, 31, 16, NA, 4, NA, NA, 29, 18, 31, 24…
## $ postnatal_health_newborns     <int> 9, 86, NA, NA, 21, NA, NA, NA, 98, NA, N…
## $ postnatal_health_mothers      <int> 40, 88, NA, NA, 23, NA, NA, NA, 97, NA, …
## $ maternal_deaths_2017          <int> 7700, 5, 1200, NA, 3000, NA, 1, 290, 11,…
## $ maternal_mortality_ratio_2017 <int> 638, 15, 112, NA, 241, NA, 42, 39, 26, 6…
## $ risk_maternal_death_2017      <int> 33, 3800, 270, NA, 69, NA, 1200, 1100, 2…
sowc_maternal_newborn %>%
  summarize(n_countries = n_distinct(countries_and_areas))
##   n_countries
## 1         202
#found distinct number of countries

Our first dataset is “sowc_maternal_newborn”. It contains data from UNICEF’s State of the World’s Children 2019 Statistical Table. The dataframe contains 202 observations/rows which are countries and 18 variables. Of these 18 variables, 1 of these is a character variable and 17 of these is a numeric variable. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).

Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.

Dataset 2

data("sowc_child_mortality")

## Take a look
glimpse(sowc_child_mortality)
## Rows: 195
## Columns: 18
## $ countries_and_areas            <chr> "Afghanistan", "Albania", "Algeria", "A…
## $ under5_mortality_1990          <int> 179, 41, 50, 11, 223, 28, 29, 49, 9, 10…
## $ under5_mortality_2000          <int> 129, 26, 40, 6, 206, 16, 20, 31, 6, 6, …
## $ under5_mortality_2018          <int> 62, 9, 23, 3, 77, 6, 10, 12, 4, 4, 22, …
## $ under5_reduction               <dbl> 4.1, 6.0, 2.9, 4.4, 5.4, 5.0, 3.8, 5.1,…
## $ under5_mortality_2018_male     <int> 66, 9, 25, 3, 83, 7, 11, 14, 4, 4, 24, …
## $ under5_mortality_2018_female   <int> 59, 8, 22, 3, 71, 6, 9, 11, 3, 3, 19, 9…
## $ infant_mortality_1990          <int> 121, 35, 42, 9, 132, 24, 25, 42, 8, 8, …
## $ infant_mortality_2018          <int> 48, 8, 20, 3, 52, 5, 9, 11, 3, 3, 19, 8…
## $ neonatal_mortality_1990        <int> 75, 13, 23, 6, 54, 15, 15, 23, 5, 5, 33…
## $ neonatal_mortality_2000        <int> 61, 12, 21, 3, 51, 9, 11, 16, 4, 3, 34,…
## $ neonatal_mortality_2018        <int> 37, 7, 15, 1, 28, 3, 6, 6, 2, 2, 11, 5,…
## $ prob_dying_age5to14_1990       <int> 16, 7, 9, 7, 46, 5, 3, 3, 2, 2, 5, 4, 4…
## $ prob_dying_age5to14_2018       <int> 5, 2, 4, 1, 16, 1, 2, 2, 1, 1, 3, 2, 2,…
## $ under5_deaths_2018             <int> 74, 0, 24, 0, 94, 0, 8, 1, 1, 0, 4, 0, …
## $ neonatal_deaths_2018           <int> 45, 0, 15, 0, 36, 0, 5, 0, 1, 0, 2, 0, …
## $ neonatal_deaths_percent_under5 <chr> "60", "74", "62", "50", "38", "50", "64…
## $ age5to14_deaths_2018           <int> 5, 0, 3, 0, 15, 0, 2, 0, 0, 0, 0, 0, 0,…
sowc_child_mortality %>%
  summarize(n_countries = n_distinct(countries_and_areas))
##   n_countries
## 1         195
#found distinct number of countries

Our second dataset we will be using is “sowc_child_mortality”. It contains child mortality data from UNICEF’s State of the World’s Children 2019 Statistical Tables. This data frame contains 195 observations/rows which are countries and 18 variables. Of these 18 variables, 2 are character variables and 16 are numeric variables. Each row corresponds to a unique country. This dataset is built into the R library “openintro” and the data comes from the United Nations Children’s Emergency Fund (UNICEF).

Source: The State of the World’s Children 2019: Statistical tables. United Nations Children’s Emergency Fund (UNICEF). (2019, October). Retrieved March 11, 2023, from https://data.unicef.org/resources/dataset/sowc-2019-statistical-tables/.

Dataset 3

data("Leinhardt")

## Take a look
glimpse(Leinhardt)
## Rows: 105
## Columns: 4
## $ income <int> 3426, 3350, 3346, 4751, 5029, 3312, 3403, 5040, 2009, 2298, 329…
## $ infant <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.…
## $ region <fct> Asia, Europe, Europe, Americas, Europe, Europe, Europe, Europe,…
## $ oil    <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…

Our third dataset we are using is “Leinhardt” which givens data on infant mortality for nations around the world in 1970. There are 105 observations and 4 variables (there are really 5 variables however right now one is the row type. We will fix this later with Tidying). Of these 5 variables, 3 are character variables and 2 are numeric variables. This dataset is built into the R library “carData” and the data comes from Leinhardt’s article “Exploratory data analysis: An introduction to selected methods”.

Source: Leinhardt, S. and Wasserman, S. S. (1979). Exploratory data analysis: An introduction to Selected Methods. In Schuessler, K. (Ed.) Sociological Methodology 1979 Jossey-Bass

c. Define a research question

Research Question #1: How has the infant mortality rate changed across different countries and regions over time?

Research Question #2: How does the correlation between female life expectancy and neonatal mortality rates differ among different regions?

2,3,4. Make our data analyzable

Step 1: Make ‘Leinhardt’ have country as a variable

As is, ‘Leinhardt’ has the country names as the row type and not as a variable on its own. We will fix this by making it a variable and then renaming it to be country.

Leinhardt <- cbind(rownames(Leinhardt), data.frame(Leinhardt, row.names=NULL)) #make the row name a variable

Leinhardt <- Leinhardt %>%
  rename("countries_and_areas" = `rownames(Leinhardt)`) #rename to match our other datasets

Step 2: Join the datasets based on their common variable of country

We will make a new dataset called “allthree” by making a list of the three variables and then using inner_join in order to join the three datsets by the common variable of country which is called “countries_and_areas”. We will then see how many rows our new dataset has and investigate which rows were left out of the join.

allthree = list(Leinhardt,sowc_child_mortality, sowc_maternal_newborn)

allthreecomb <- allthree %>% reduce(inner_join, by = "countries_and_areas" )

allthreecomb %>%
  nrow
## [1] 73
allthreecomb_miss <- allthree %>% reduce(anti_join, by = "countries_and_areas" )

allthreecomb_miss %>%
  nrow
## [1] 32

Our joined dataset “allthree” has 73 observations and 39 variables. As it is joined now, we have 32 variables that are not in common between our datasets. We took a look at our antijoin function through the dataset ‘allthreecomb_miss’ and saw that the values that were not getting joined had a period rather than a space between the double names or were written slightly different so we must go back and change these names. We will then rejoin.

Step 3: Go back and fix country names and rejoin to get final dataset

We will go in and rename the countries with double names in “Leinhardt” to include a space instead of a period so that they can properly be matched when we join our final dataset.

new_Leinhardt <- Leinhardt %>%
  mutate(countryrename = recode(countries_and_areas, 'West.Germany' = 'West Germany', 'New.Zealand' = 'New Zeleand', 'South.Africa' = 'South Africa', 'United.States' = 'United States', 'Saudi.Arabia' = 'Saudi Arabia', 'Costa.Rica' = 'Costa Rica', 'Dominican.Republic' = 'Dominican Republic', 'Trinidad.and.Tobago' = 'Trinidad and Tobago', 'El.Salvador' = 'El Salvador', 'Ivory.Coast' = 'Ivory Coast', 'South.Korea' = 'South Korea', 'Moroco' = 'Morocco', 'Papua.New.Guinea' = 'Papua New Guinea', 'South.Vietnam' = 'South Vietname', 'Afganistan' = 'Afghanistan', 'Central.African.Republic' = 'Central African Republic', 'Sierra.Leone' = 'Sierra Leone', 'Sri.Lanka' = 'Sri Lanka', 'Upper.Volta' = 'Upper Volta', 'Southern.Yemen' = 'Southern Yemen')) #renamed the countries


new_Leinhardt <- new_Leinhardt %>%
  rename("old_country_names" = `countries_and_areas`) #went back and renamed the column with old names

new_Leinhardt <- new_Leinhardt %>%
  rename("countries_and_areas" = `countryrename`) #renamed the new column to be "countries_and_areas" because this is the variable we joined our three datasets by

Now we will re-join our datasets to make sure the recoded country names get matched correctly.

newallthree = list(new_Leinhardt,sowc_child_mortality, sowc_maternal_newborn)

newallthreecomb <- newallthree %>% reduce(inner_join, by = "countries_and_areas") %>% #inner joined by country
  select(-"old_country_names") #did not include old row

newallthreecomb %>%
  nrow #number of rows included in our join
## [1] 86
newallthreecomb_miss <- newallthree %>% reduce(anti_join, by = "countries_and_areas" ) #making dataset of rows not included in join

newallthreecomb_miss %>%
  nrow #number of rows not included in our join
## [1] 19
glimpse(newallthreecomb)
## Rows: 86
## Columns: 39
## $ income                         <int> 3426, 3350, 3346, 4751, 5029, 3312, 340…
## $ infant                         <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.…
## $ region                         <fct> Asia, Europe, Europe, Americas, Europe,…
## $ oil                            <fct> no, no, no, no, no, no, no, no, no, no,…
## $ countries_and_areas            <chr> "Australia", "Austria", "Belgium", "Can…
## $ under5_mortality_1990          <int> 9, 10, 10, 8, 9, 7, 9, 9, 10, 6, 8, 9, …
## $ under5_mortality_2000          <int> 6, 6, 6, 6, 5, 4, 5, 7, 6, 5, 6, 5, 7, …
## $ under5_mortality_2018          <int> 4, 4, 4, 5, 4, 2, 4, 4, 3, 2, 4, 3, 4, …
## $ under5_reduction               <dbl> 2.9, 2.5, 2.6, 1.2, 1.5, 5.1, 1.6, 3.7,…
## $ under5_mortality_2018_male     <int> 4, 4, 4, 5, 5, 2, 4, 4, 3, 3, 4, 3, 4, …
## $ under5_mortality_2018_female   <int> 3, 3, 3, 5, 4, 2, 4, 3, 3, 2, 3, 2, 3, …
## $ infant_mortality_1990          <int> 8, 8, 8, 7, 7, 6, 7, 8, 8, 5, 7, 7, 12,…
## $ infant_mortality_2018          <int> 3, 3, 3, 4, 4, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ neonatal_mortality_1990        <int> 5, 5, 5, 4, 4, 4, 4, 5, 6, 3, 5, 4, 7, …
## $ neonatal_mortality_2000        <int> 4, 3, 3, 4, 3, 2, 3, 4, 3, 2, 4, 3, 3, …
## $ neonatal_mortality_2018        <int> 2, 2, 2, 3, 3, 1, 3, 2, 2, 1, 2, 1, 2, …
## $ prob_dying_age5to14_1990       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, …
## $ prob_dying_age5to14_2018       <int> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ under5_deaths_2018             <int> 1, 0, 0, 2, 0, 0, 3, 0, 1, 2, 1, 0, 0, …
## $ neonatal_deaths_2018           <int> 1, 0, 0, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, …
## $ neonatal_deaths_percent_under5 <chr> "62", "60", "56", "68", "74", "55", "62…
## $ age5to14_deaths_2018           <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, …
## $ life_expectancy_female         <int> 85, 84, 84, 84, 83, 85, 85, 84, 85, 88,…
## $ family_planning_1549           <int> NA, NA, NA, NA, NA, NA, 96, NA, NA, NA,…
## $ family_planning_1519           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ adolescent_birth_rate          <int> 10, 7, 6, 8, 3, 6, 5, 7, 5, 4, 3, 4, 8,…
## $ births_age_18                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ antenatal_care_1               <int> 98, NA, NA, 100, NA, 100, 100, 100, 99,…
## $ antenatal_care_4_1549          <int> 92, NA, NA, 99, NA, NA, 99, NA, 68, NA,…
## $ antenatal_care_4_1519          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_attendant_1549   <int> NA, 99, NA, 100, NA, NA, NA, 100, NA, N…
## $ delivery_care_attendant_1519   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ delivery_care_institutional    <int> 99, 99, NA, 98, NA, 100, 98, 100, 100, …
## $ c_section                      <int> 31, 24, 18, 26, 21, 16, 21, 25, 40, NA,…
## $ postnatal_health_newborns      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ postnatal_health_mothers       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ maternal_deaths_2017           <int> 20, 4, 6, 40, 2, 2, 56, 3, 7, 44, 9, 1,…
## $ maternal_mortality_ratio_2017  <int> 6, 5, 5, 10, 4, 3, 8, 5, 2, 5, 5, 2, 8,…
## $ risk_maternal_death_2017       <int> 8200, 13500, 11200, 6100, 16200, 20900,…

Now that we have renamed our country variables in “Leinhardt” to match the other datasets, we can see that our final dataset “newallthreecomb” actually has 86 countries that are found in common between the datasets and it still has 39 variables. Our final large dataset has 4 categorical variables and 35 numeric variables. We can also see that we have 19 countries that are not in common between the datasets which makes more sense and is significantly less after we renamed them.

Step 4: Get our data analyzable for our specific research questions

Question 1 Data

We went through our combined larger dataset and created a smaller dataset to answer our first research question. In order to answer this question, we selected the variables of country, region, infant mortality rate in 1970, and the infant mortality rate in 2018. We also mutated a variable called “diff1” that calculated the difference between the infant mortality in 2018 and the infant mortality in 1970. We chose to filter to focus on the regions of Africa, the Americas, and Europe.

question1 <- newallthreecomb %>%
  filter(region == c("Africa", 'Americas', 'Europe')) %>% #desired regions
  mutate (diff1 = infant_mortality_2018 - infant) %>% #new variable
  group_by(countries_and_areas) %>% #grouped by country
  select(region, countries_and_areas, infant, infant_mortality_2018, diff1) #selected our variables

table(question1$region)
## 
##   Africa Americas     Asia   Europe 
##        8        6        0        6
barplot(table(question1$region),
        ylab = 'Country Count',
        main = 'Country Count per Region')

#summary stats of region(character variable)

hist(question1$diff1,
     xlab = 'Difference between Infant Mortality Rate in 2018 and 1970 (deaths per 1,000 live births)',
     main = 'Distribution of Difference Between Infant Mortality Rate (deaths per 1,000 live births)')

IQR(question1$diff1) #need because skewed
## [1] 102.95
#summary stats of diff1 (numeric variable)


summary(question1$infant)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.60   17.45   68.65   93.08  148.47  300.00
hist(question1$infant,
     xlab = 'Infant Mortality Rate in 1970 (deaths per 1,000 live births)',
     main = 'Distribution of Infant Mortality Rate in 1970 (deaths per 1,000 live births)')

IQR(question1$infant) #need because skewed
## [1] 131.025
#summary stats of infant mortality rate in 1970 (numeric variable)

We calculated some summary statistics for the region, our new variable “diff1”, and infant mortality rate in 1970. For the region, we found that 8 of our countries are from Africa, 6 of our countries are from the Americas, and 6 of our countries are from Europe.

For “diff1” which was the difference between the infant mortality in 2018 and the infant mortality in 1970 we can see from our histogram that the data is left skewed so we will use the median of -46.30 deaths per 1,000 live births as our measure of center and the IQR of 102.95 deaths per 1,000 live births as our measure of spread.

For the infant mortality rate in 1970 we can see from our histogram that the data is right skewed so we will use the median of 68.65 deaths per 1,000 live births as our measure of center and the IQR of 131.025 deaths per 1,000 live births as our measure of spread.

Question 2 Data

We went through our combined larger dataset and created a smaller dataset to answer our second research question. In order to answer this question, we first had to tidy our data to make sure that the neonatal mortality and year had separate columns using the pivot longer function. Next, we selected the variables of country, region, female life expectancy in 2018, and neonatal mortality rate in 2018. We also mutated a variable called “relativefemlifeexp” that said if each country we looked at had a high or low female life expectancy compared to the mean female life expectancy in 2018. We chose to filter to focus on the year 2018.

question2 <- newallthreecomb %>%
  rename("1990" = neonatal_mortality_1990, '2000' = neonatal_mortality_2000, '2018' = neonatal_mortality_2018) %>%
  pivot_longer(cols = c('1990', '2000', '2018'),
                          names_to = 'year',
                          values_to = 'neonatalmortality')
#renamed columns so that they would only include years
#set names to year to get a column for year and values to neonatal mortality because those are the values that will go in the column
                          
question2 <- question2 %>%
  filter(year == 2018)%>% #filtered to year 2018
  group_by(region) %>% #grouped by region
  mutate(relativefemlifeexp = ifelse(life_expectancy_female > mean(life_expectancy_female), "high", "low")) %>% #new variable that says if female life expectancy in a country is "high" or "low"
  arrange(relativefemlifeexp) %>% #arranged in numeric order
  select(region, year, relativefemlifeexp, life_expectancy_female, neonatalmortality) #selected the variables to look at

table(question2$relativefemlifeexp)
## 
## high  low 
##   44   42
barplot(table(question2$relativefemlifeexp),
        ylab = 'Country Count', main = 'Country Count of Relative Female Life Expectancy in 2018')

#summary stats of relative female life expectancy (character variable)

summary(question2$life_expectancy_female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55.00   67.00   77.00   74.55   81.75   88.00
hist(question2$life_expectancy_female, 
     xlab = 'Average Female Life Expectancy (years)',
     main = 'Distribution of Average Female Life Expectancy')

IQR(question2$life_expectancy_female)
## [1] 14.75
#summary stats of female life expectancy in 2018 (numeric variable)

summary(question2$neonatalmortality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00   11.00   13.94   22.00   42.00
hist(question2$neonatalmortality,
     xlab = 'Neonatal Mortality Rate (deaths per 1,000 live births',
     main = 'Distribution of Neonatal Mortality Rate')

IQR(question2$neonatalmortality)
## [1] 18
#summary stats of neonatal mortality in 2018 (numeric variable)

We calculated some summary statistics for our new variable “relativefemlifeexp”, the female life expectancy in 2018, and neonatal mortality rate in 2018.

For the relative female life expectancy, which said if each country had a high or low female life expectancy compared to the mean female life expectancy in 2018, we found that 44 countries had high female life expectancy and 42 had low female life expectancy.

For the female life expectancy in 2018 we can see from our histogram that the data is left skewed so we will use the median of 77.00 years as our measure of center and the IQR of 14.75 years as our measure of spread.

For the neonatal mortality rate in 2018 we can see from our histogram that the data is right skewed so we will use the median of 11.00 deaths per 1,000 live births as our measure of center and the IQR of 18.00 deaths per 1,000 live births as our measure of spread.

Step 5: Making Visualizations of Our Data

Question 1

Visualization 1: 1 variable plot

We chose to create a box plot of the “diff1” variable that we created.

ggplot(data = question1, aes (y = diff1)) +
geom_boxplot(aes(fill = diff1), fill = 'light blue') +
  labs(y='Difference between Infant Mortality Rate in 2018 and 1970 (deaths per 1,000 live births)',
       title = 'Distribution of Difference between Infant Mortality Rates')+ #labels
  scale_x_discrete() + #take out x axis values
  ylim(-250,0)+ #wanted to take out outlier
  theme_minimal() #make easier to visualize

This boxplot shows us the distribution of the difference between the infant mortality rate in 2018 and the infant mortality rate in 1970. From our graph, we can see that the mean difference is around -50 deaths per 1,000 large births. We can tell from the boxplot that this variable is left skewed as it has lots of small values. From this graph, we can also see that 75% of the differences lie between around -15 to -110 deaths per 1,000 children. The fact that all of these numbers are negative means that the infant mortality rate has decreased from 1970 to 2018 which is a good thing.

Visualization 2: 2 Variable Plot

We chose to create a grouped barplot with error bars to show how the difference between infant mortality rate in 2018 and 1970 differs by region.

ggplot(question1, aes(x = region, y = diff1, fill = region)) +
  geom_bar(stat = "summary", fun = "mean") + #used the mean of "diff1" by region
  geom_errorbar(stat = "summary", fun.data = "mean_se")+ #made an error bar for the mean
  labs(x='Region', y = 'Difference Between Infant Mortality Rate in 2018 and 1970 (deaths per 1,000 births)', title = 'Distribution of Difference Between Infant Mortality Rates by Region')+ #graph titles
  theme_minimal()+ #made it easier to read
  scale_y_continuous(breaks = seq(-175,0,25))+ #created more breask so it was easier to visualize
  theme(legend.position = "none") #get rid of legend

This grouped barplot with error bars shows us the distribution of the difference between the infant mortality rate in 2018 and the infant mortality rate in 1970 by region. We can see that there does appear to be a great difference in our “diff1” variable by region. We can see that Africa has greatly decreased infant mortality rate from 1970 to 2018. After Africa, the Americas have also greatly decreased their infant mortality rate from 1970 to 2018. Europe has still reduced their infant mortality rate, however by very little which may imply that Europe had a low infant mortality rate to begin with. From our error bars, we can see that our Africa data is the most spread out, followed by data from the Americas, and the data from Europe is not that spread out. Additionally, the fact that all of these numbers are negative means that the infant mortality rate has decreased from 1970 to 2018 which is a good thing.

Visualization 3: 3 Variable Plot

We chose to create a scatter plot to show how the difference between infant mortality rate in 2018 and 1970 differs by country and by region.

ggplot(question1,aes(x=reorder(countries_and_areas,diff1), y= diff1, color=region))+ #reordered to go in order of increasing y-value from left to right
  geom_point()+ #made a scatter plot
  theme_minimal()+ #made it easier to read
  theme(axis.text.x = element_text(angle=90))+ #tilted text so we could see each country name
  labs(x= 'Country', y = 'Difference between Infant Mortality Rate in 2018 and 1970 (deaths per 1,000 live births)', title='Distribution of Difference between Infant Mortality Rates by Country and Region')+ #graph titles
    scale_y_continuous(breaks = seq(-300,0,50)) #created more breask to better visualize the y-variable

This scatter plot shows us the distribution of the difference between the infant mortality rate in 2018 and the infant mortality rate in 1970 by country and region. We can see that there does appear to be a great difference in our “diff1” variable by region. When looking at countries, our “diff1” variable is not very similar for some countries such as “Libya” and “Sweden” but very similar for some countries such as “Finland” and “Sweden”. This difference can be explained by their differing regions as can be seen by the same color dot, corresponding to the same region, being close to each other. From this plot we can see that Libya has had the largest difference in the infant mortality rate from 1970 to 2018 while Sweeden has had the smallest difference. We ordered our scatter plot so that the x-axis goes in order from largest difference on the left to smallest difference on the right. Similar to our previous graph, we can see that on average Africa has the greatest difference in infant mortality rate, followed by the Americas, followed by Europe at the smallest difference. Additionally, the fact that all of these numbers are negative means that the infant mortality rate has decreased from 1970 to 2018 which is a good thing.

Question 2

Visualization 4: 1 Variable Plot

We chose to create a bar plot of the “relativefemlifeexp” variable that we created.

ggplot(data = question2, aes (x = relativefemlifeexp)) +
geom_bar(aes(fill = relativefemlifeexp)) + #made a bar plot of our categorical variables
  labs(x='Relative Female Life Expectancy Compared to the Average', y = 'Country Count', title = 'Distribution of Relative Female Life Expectancy')+ #made graph titles
  theme_minimal()+ #made it easier to visualize
    scale_y_continuous(breaks = seq(0,50,5))+ #created more breaks to make it easier to visualize the country count
  theme(legend.position = "none") #get rid of legend

This barplot shows us the distribution of the relative female life expectancy of countries compared to the average female life expectancy in 2018. From our graph, we can see that roughly 2-3 more countries had female life expectancies above the mean versus below the mean. We have around 44 countries with female life expectancy above the average and around 42 countries with female life expectancy below the average. This tells us that the average life expectancy may not lie at the true center of our population spread. We can tell that the data is somewhat left-skewed as there are some countries with low female life expectancy bringing the average down from the true center causing there to be more countries with female life expectancy above the average.

Visualization 5: 2 Variable Plot

We chose to create a grouped barplot with error bars to show how the neonatal mortality rate in 2018 changes depending on the relative female life expectancy of countries compared to the average in 2018.

question2 %>%
  ggplot(aes(x=relativefemlifeexp, y= neonatalmortality, fill = relativefemlifeexp)) +
  geom_bar(stat = 'summary', fun = 'mean') + #calculated the mean neonatal mortality rate by either "high" or "low" relative female life expectancy
  geom_errorbar(stat = 'summary', fun.data = 'mean_se')+ #made error bars of the mean
  theme(legend.position = 'none')+ #took out legend
  labs(x='Relative Female Life Expectancy', y = 'Neonatal Mortality Rate (deaths per 1,000 live births)',
       title = 'Average Female Life Expectancy vs. Neonatal Mortality Rate in 2018')+ #made graph titles
  theme_minimal()+ #made it easier to visualize
    scale_y_continuous(breaks = seq(0,25,2))+ #increased the breaks to make it easier to visualize
  theme(legend.position = "none") #get rid of legend

This grouped barplot with error bars shows us the distribution of the neonatal mortality rate in 2018 between countries with high and low relative female life expectancy. We can see that there does appear to be a great difference in our neonatal mortality rate in 2018 by relative female life expectancy. We can see that countries with high relative female life expectancy have significantly lower neonatal mortality rates in 2018. On the other hand, countries with low relative female life expectancy also have significantly higher neonatal mortality rates in 2018. This shows an inverse relationship between female life expectancy and neonatal mortality rate.

Visualization 6: 3 Variable Plot

We chose to create a scatter plot to show how the neonatal mortality rate in 2018 and the average female life expectancy differ by region.

question2 %>%
  ggplot(aes(x = life_expectancy_female,
  y = neonatalmortality, color = region)) +
  geom_point() + #made a scatterplot
  labs(x = 'Average Female Life Expectancy in 2018 (years)', y= 'Neonatal Mortality Rate in 2018 (deaths per 1,000 live births)', title = 'Average Female Life Expectancy vs. Neonatal Mortality Rate in 2018 by Region')+ #made graph titles
  theme_minimal()+ #made it easier to visualize
    scale_y_continuous(breaks = seq(0,45,5))+ #increased the breaks on the y-axis and made the data best fit the graph
    scale_x_continuous(breaks = seq(55,90,5)) #increased the breaks on the x-axis and made the data best fit the graph

This scatter plot shows us the distribution of the neonatal mortality rate in 2018 and the average female life expectancy in 2018 region. We can see that there does appear to be a great difference in the neoneatal mortality rate in 2018 and average female life expectancy by region as shown by the same color dots, corresponding to the same region, being next to each other. We can see from our data, that overall, Europe has the highest average female life expectancy in years and the lowest neonatal mortality rate. After Europe, the Americas seem to have the second highest average female life expectancy and second lowest neonatal mortality rate, followed closely by Asia. On the other end of the spectrum, Africa appears to have the lowest average female life expectancy and the highest neonatal mortality rate. From our scatter plot, we can see that the data from Asia and Africa is more spread out than the data from Europe and the Americas. Overall, the graph shows us that there is a difference by region and that neonatal mortality rates and female life expectancy are inversely related.

6 Discussion

First we will discuss our first research question: How has the infant mortality rate changed across different countries and regions over time? We can see through our visualization 1 that the infant mortality rate overall has decreased from 1970 to 2018. Additionally, from our visualization 2, we can see that this decrease in infant mortality rate is very dependent on the region. Africa has had the greatest decrease in the infant mortality rate from 1970 to 2018 followed by the Americas, and lastly by Europe which has a small decrease. From our visualization 3, we can see that the infant mortality rate does depend on the country as seen by the differences between countries, however most of the difference is due to regional differences. Some things that we could have done better to investigate this data would be to try to find datasets with more countries and datasets with more years to pull data from so we could see the yearly difference in infant mortality rate by country and region. Additionally, we could have created a function to scale diff1 based on the starting infant mortality rate in 1970 and ending one in 2018. This could help us better be able to quantify the differences in regions because it would take into account that some countries could have started off with better infant mortality rates than others.

Next, we will discuss our second research question: How does the correlation between female life expectancy and neonatal mortality rates differ among different regions? From our visualization 5, we can see that countries with a high relative female life expectancy have significantly lower neonatal mortality rates in 2018. On the other hand, countries with low relative female life expectancy also have significantly higher neonatal mortality rates in 2018. Thus, there is an inverse relationship between female life expectancy and neonatal life expectancy. From our visualization 6, we can see that the inverse relationship holds true for different regions, however there does appear to be a difference in the place on the spectrum depending on the region. We can see from visualization 6, that overall, Europe has the highest average female life expectancy in years and the lowest neonatal mortality rate. After Europe, the Americas seem to have the second highest average female life expectancy and second lowest neonatal mortality rate, followed closely by Asia. On the other end of the spectrum, Africa appears to have the lowest average female life expectancy and the highest neonatal mortality rate. Some things we could have done better to investigate this data would be to look at different years of this data in order to see if this relationship holds outside of 2018 and investigated more regions to get further data for our findings.

Overall, the challenging part of this project was getting our three datasets to join and also be tidy because it was hard at first to see the order in which to do this. Upon experimenting however, we figured it out which is what a lot of coding is about! From the overall process, we have learned that trouble shooting code can take some time and can be very tedious so we have to pay attention to each character we type and knit often.

Acknowledgements: We both worked on the whole project together side-by-side. Special thank you to Professor Guyot and our TAs Aubrie and Huy for y’all’s constant help throughout this course!