Project 2 - USC Data

Introduction

The objective of Project 2 is to get more practice tidying and transforming data. The U.S. Census International Data collects multiple types of data on 227 countries/areas.

The data is available at: https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2023&COUNTRY_YR_ANIM=2023

The explanation for the table collected is in: https://api.census.gov/data/timeseries/idb/5year/variables.html

This analysis will compare the mortality rates for males and females between countries. The mortality rates are deaths per 1,000 population. Additionally, it will compare mortality rates by general geographic area (Asia, Africa, North America, South America, Europe, and Australia) over the years.

Processing

First, load required packages.

library(tidyverse)
library(kableExtra)

Load the data.

path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_proj2/main/usc_international_mortality_2010_23.csv"
df <- read.csv(path)

# https://www.kaggle.com/datasets/statchaitya/country-to-continent
path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_proj2/main/countryContinent.csv"
cc <- read.csv(path)

Ultimately, the component names in the data frame should be the following:

country
continent
year
age_group
sex

Take a look at the dataframe, notice that what the column names currently are do not have all the information. What the column names should be is stored both in the current column name as well as the first row.

head(df)

##     X  X.1  X.2               X.3  X.4 Infant.Mortality.Rate    X.5    X.6
## 1 Row FIPS GENC Country/Area Name Year            Both Sexes   Male Female
## 2   1   AF   AF       Afghanistan 2010                127.79 136.47 118.68
## 3   2   AF   AF       Afghanistan 2011                125.79 134.45  116.7
## 4   3   AF   AF       Afghanistan 2012                123.82 132.47 114.74
## 5   4   AF   AF       Afghanistan 2013                121.88  130.5 112.81
## 6   5   AF   AF       Afghanistan 2014                119.96 128.57 110.92
##   Child.Mortality.Rate..Ages.1.4.   X.7    X.8 Under.5.Mortality.Rate    X.9
## 1                      Both Sexes  Male Female             Both Sexes   Male
## 2                           71.37 70.34  72.44                 190.04 197.21
## 3                            69.9 68.93  70.89                 186.89 194.11
## 4                           68.45 67.54  69.38                 183.79 191.06
## 5                           67.04 66.19  67.91                 180.75 188.05
## 6                           65.64 64.85  66.46                 177.73 185.08
##     X.10 Life.Expectancy.at.Birth  X.11   X.12
## 1 Female               Both Sexes  Male Female
## 2 182.52                    48.93 47.72   50.2
## 3 179.32                    49.32 48.08  50.62
## 4 176.16                    49.71 48.45  51.04
## 5 173.06                     50.1 48.81  51.46
## 6 170.01                    50.49 49.17  51.88

First rename the columns based on the information from the current column name and the first row.

for (i in 6:length(df)){
  if (str_starts(colnames(df)[i], "[A-WY-Z]")){
    temp <- colnames(df)[i]
  }
  colnames(df)[i] <- paste(temp, df[[i]][1] ,sep = ',')
}
colnames(df)[4] <- "country"
colnames(df)[5] <- "year"

Cut out the first row which no longer has information needed as that information got appended to the column name. Also remove columns before country which are not needed.

df <- df[-1,]
df <- df[ -c(1:3)]

Perform the necessary pivots to get the desired dataframe, where rather than having a column for each age range and sex combination that information is stored in rows and the new columns are age range and sex.

df_l <- df |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("age_range", "sex"), 
    names_sep = ",",
    values_to = "count"
  )

Make the year, and mortality rates be doubles rather than characters.

for (i in 1:length(df_l)){
  if (str_detect(df_l[[i]][1], "[0-9]+")){
    df_l[[i]] <- as.double(df_l[[i]])
  }
}

Only save the relevant information, country and continent from the cc dataframe. Then, join in the continent data from the country to continent dataframe.

cc <- cc[c("country", "continent")]
df_c <- left_join(df_l, cc, by = c("country" = "country"))

Take a look at the new dataframe.

head(df_c)

## # A tibble: 6 × 6
##   country      year age_range                       sex        count continent
##   <chr>       <dbl> <chr>                           <chr>      <dbl> <chr>    
## 1 Afghanistan  2010 Infant.Mortality.Rate           Both Sexes 128.  Asia     
## 2 Afghanistan  2010 Infant.Mortality.Rate           Male       136.  Asia     
## 3 Afghanistan  2010 Infant.Mortality.Rate           Female     119.  Asia     
## 4 Afghanistan  2010 Child.Mortality.Rate..Ages.1.4. Both Sexes  71.4 Asia     
## 5 Afghanistan  2010 Child.Mortality.Rate..Ages.1.4. Male        70.3 Asia     
## 6 Afghanistan  2010 Child.Mortality.Rate..Ages.1.4. Female      72.4 Asia

I noticed not all of the countries were designated a continent. I checked out the nas and went through the csv from kaggle, updating the names that had come up as na to match the USC names. There were still some nas left over.

df_na <- df_c |> filter(is.na(continent))
dim(df_na)

## [1] 1176    6

# write.csv(df_na, "C:/Users/kgriffen/OneDrive - Globalfoundries/Documents/Data_science/na.csv")
# I went through the NAs and matched up as many country names as I could. Got NAs down to 1000ish from 6000

Widen the dataframe so that the male mortality rate can be extracted more easily.

df_w <- pivot_wider(df_c, names_from = sex, values_from = count)
df_w <- df_w |> mutate(male_mortality_rate = 
                         (Male)/(Female+Male))
head(df_w)

## # A tibble: 6 × 8
##   country      year age_range               conti…¹ Both …²  Male Female male_…³
##   <chr>       <dbl> <chr>                   <chr>     <dbl> <dbl>  <dbl>   <dbl>
## 1 Afghanistan  2010 Infant.Mortality.Rate   Asia      128.  136.   119.    0.535
## 2 Afghanistan  2010 Child.Mortality.Rate..… Asia       71.4  70.3   72.4   0.493
## 3 Afghanistan  2010 Under.5.Mortality.Rate  Asia      190.  197.   183.    0.519
## 4 Afghanistan  2010 Life.Expectancy.at.Bir… Asia       48.9  47.7   50.2   0.487
## 5 Afghanistan  2011 Infant.Mortality.Rate   Asia      126.  134.   117.    0.535
## 6 Afghanistan  2011 Child.Mortality.Rate..… Asia       69.9  68.9   70.9   0.493
## # … with abbreviated variable names ¹continent, ²`Both Sexes`,
## #   ³male_mortality_rate

Analysis

As previously mentioned this analysis will compare the mortality rates for males and females between countries over the years. Additionally, it will compare mortality rates by general geographic area (Asia, Africa, North America, South America, Europe, and Australia).

Compare mortality rate for males and females by country.

df_w|>
  filter(is.na(continent) == FALSE &
         age_range != "Life.Expectancy.at.Birth") |>
  group_by(country, year, age_range) |> 
  ggplot(aes(x = year, y = male_mortality_rate, color=country)) +
  geom_point() +
  facet_grid(~age_range) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), 
                                   legend.position="none")

What is noticeable from this graphic is that in most countries, across all the years - the male death rate is higher than that of females.

Compare mortality rates by continent, just looking at mortality rate under 5.

df_cont <- df_c|>
  filter(str_starts(age_range, "Under") & 
           (sex == "Male" | sex == "Female") &
           is.na(continent) == FALSE) |>
  group_by(continent, sex) |> 
  summarize(
  mean_under_5yrs = round(mean((count), na.rm = TRUE)), 
  n = n() , .groups = 'drop'
  ) |>
  arrange(desc(mean_under_5yrs))

kable(df_cont) |>
  kable_styling("striped")

continent	sex	mean_under_5yrs	n
Africa	Male	78	756
Africa	Female	66	756
Asia	Male	29	672
Asia	Female	25	672
Oceania	Male	24	294
Americas	Male	20	686
Oceania	Female	19	294
Americas	Female	15	686
Europe	Male	6	672
Europe	Female	5	672

df_c|>
  filter(str_starts(age_range, "Under") & 
           (sex == "Male" | sex == "Female") &
           is.na(continent) == FALSE) |>
  group_by(continent, sex) |> 
  summarize(
  mean_under_5yrs = round(mean((count), na.rm = TRUE)), 
  n = n() , .groups = 'drop'
  ) |>
  arrange(desc(mean_under_5yrs)) |>
  ggplot(aes(x = continent, y = mean_under_5yrs, fill=sex)) +
  geom_bar(stat="identity", position = "dodge")

It looks like Africa has the highest death rate and Europe the least.

Now, look at all 3 age groups.

df_c|>
  filter((sex == "Male" | sex == "Female") &
           is.na(continent) == FALSE &
         age_range != "Life.Expectancy.at.Birth") |>
  group_by(continent, age_range, sex) |> 
  summarize(
  mean_mortality_rate = round(mean((count), na.rm = TRUE)), 
  n = n() , .groups = 'drop'
  ) |>
  arrange(desc(mean_mortality_rate)) |>
  ggplot(aes(x = continent, y = mean_mortality_rate, fill=sex)) +
  geom_bar(stat="identity", position = "dodge")  +
  facet_grid(~age_range) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Inspecting all three age ranges available, Africa still has the highest child mortality rates and Europe the least.

Conclusion

The objective was to study the USC International data, paying specific attention to the mortality rates between male and females from 2010-2023, looking at country level data and at continent level. This goal was achieved and interesting insights were gained, such as the mortality rate being greater in males than in females in general and Africa having the highest mortality rates and Europe the least.

Project 2 - USC Data

Kayleah Griffen

2/26/2023

Introduction

Processing

Analysis

Conclusion