The objective of Project 2 is to get more practice tidying and transforming data. The U.S. Census International Data collects multiple types of data on 227 countries/areas.
The data is available at: https://www.census.gov/data-tools/demo/idb/#/country?COUNTRY_YEAR=2023&COUNTRY_YR_ANIM=2023
The explanation for the table collected is in: https://api.census.gov/data/timeseries/idb/5year/variables.html
This analysis will compare the mortality rates for males and females between countries. The mortality rates are deaths per 1,000 population. Additionally, it will compare mortality rates by general geographic area (Asia, Africa, North America, South America, Europe, and Australia) over the years.
First, load required packages.
library(tidyverse)
library(kableExtra)
Load the data.
path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_proj2/main/usc_international_mortality_2010_23.csv"
df <- read.csv(path)
# https://www.kaggle.com/datasets/statchaitya/country-to-continent
path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_proj2/main/countryContinent.csv"
cc <- read.csv(path)
Ultimately, the component names in the data frame should be the following:
Take a look at the dataframe, notice that what the column names currently are do not have all the information. What the column names should be is stored both in the current column name as well as the first row.
head(df)
## X X.1 X.2 X.3 X.4 Infant.Mortality.Rate X.5 X.6
## 1 Row FIPS GENC Country/Area Name Year Both Sexes Male Female
## 2 1 AF AF Afghanistan 2010 127.79 136.47 118.68
## 3 2 AF AF Afghanistan 2011 125.79 134.45 116.7
## 4 3 AF AF Afghanistan 2012 123.82 132.47 114.74
## 5 4 AF AF Afghanistan 2013 121.88 130.5 112.81
## 6 5 AF AF Afghanistan 2014 119.96 128.57 110.92
## Child.Mortality.Rate..Ages.1.4. X.7 X.8 Under.5.Mortality.Rate X.9
## 1 Both Sexes Male Female Both Sexes Male
## 2 71.37 70.34 72.44 190.04 197.21
## 3 69.9 68.93 70.89 186.89 194.11
## 4 68.45 67.54 69.38 183.79 191.06
## 5 67.04 66.19 67.91 180.75 188.05
## 6 65.64 64.85 66.46 177.73 185.08
## X.10 Life.Expectancy.at.Birth X.11 X.12
## 1 Female Both Sexes Male Female
## 2 182.52 48.93 47.72 50.2
## 3 179.32 49.32 48.08 50.62
## 4 176.16 49.71 48.45 51.04
## 5 173.06 50.1 48.81 51.46
## 6 170.01 50.49 49.17 51.88
First rename the columns based on the information from the current column name and the first row.
for (i in 6:length(df)){
if (str_starts(colnames(df)[i], "[A-WY-Z]")){
temp <- colnames(df)[i]
}
colnames(df)[i] <- paste(temp, df[[i]][1] ,sep = ',')
}
colnames(df)[4] <- "country"
colnames(df)[5] <- "year"
Cut out the first row which no longer has information needed as that information got appended to the column name. Also remove columns before country which are not needed.
df <- df[-1,]
df <- df[ -c(1:3)]
Perform the necessary pivots to get the desired dataframe, where rather than having a column for each age range and sex combination that information is stored in rows and the new columns are age range and sex.
df_l <- df |>
pivot_longer(
cols = !(country:year),
names_to = c("age_range", "sex"),
names_sep = ",",
values_to = "count"
)
Make the year, and mortality rates be doubles rather than characters.
for (i in 1:length(df_l)){
if (str_detect(df_l[[i]][1], "[0-9]+")){
df_l[[i]] <- as.double(df_l[[i]])
}
}
Only save the relevant information, country and continent from the
cc
dataframe. Then, join in the continent data from the
country to continent dataframe.
cc <- cc[c("country", "continent")]
df_c <- left_join(df_l, cc, by = c("country" = "country"))
Take a look at the new dataframe.
head(df_c)
## # A tibble: 6 × 6
## country year age_range sex count continent
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 Afghanistan 2010 Infant.Mortality.Rate Both Sexes 128. Asia
## 2 Afghanistan 2010 Infant.Mortality.Rate Male 136. Asia
## 3 Afghanistan 2010 Infant.Mortality.Rate Female 119. Asia
## 4 Afghanistan 2010 Child.Mortality.Rate..Ages.1.4. Both Sexes 71.4 Asia
## 5 Afghanistan 2010 Child.Mortality.Rate..Ages.1.4. Male 70.3 Asia
## 6 Afghanistan 2010 Child.Mortality.Rate..Ages.1.4. Female 72.4 Asia
I noticed not all of the countries were designated a continent. I checked out the nas and went through the csv from kaggle, updating the names that had come up as na to match the USC names. There were still some nas left over.
df_na <- df_c |> filter(is.na(continent))
dim(df_na)
## [1] 1176 6
# write.csv(df_na, "C:/Users/kgriffen/OneDrive - Globalfoundries/Documents/Data_science/na.csv")
# I went through the NAs and matched up as many country names as I could. Got NAs down to 1000ish from 6000
Widen the dataframe so that the male mortality rate can be extracted more easily.
df_w <- pivot_wider(df_c, names_from = sex, values_from = count)
df_w <- df_w |> mutate(male_mortality_rate =
(Male)/(Female+Male))
head(df_w)
## # A tibble: 6 × 8
## country year age_range conti…¹ Both …² Male Female male_…³
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 2010 Infant.Mortality.Rate Asia 128. 136. 119. 0.535
## 2 Afghanistan 2010 Child.Mortality.Rate..… Asia 71.4 70.3 72.4 0.493
## 3 Afghanistan 2010 Under.5.Mortality.Rate Asia 190. 197. 183. 0.519
## 4 Afghanistan 2010 Life.Expectancy.at.Bir… Asia 48.9 47.7 50.2 0.487
## 5 Afghanistan 2011 Infant.Mortality.Rate Asia 126. 134. 117. 0.535
## 6 Afghanistan 2011 Child.Mortality.Rate..… Asia 69.9 68.9 70.9 0.493
## # … with abbreviated variable names ¹​continent, ²​`Both Sexes`,
## # ³​male_mortality_rate
As previously mentioned this analysis will compare the mortality rates for males and females between countries over the years. Additionally, it will compare mortality rates by general geographic area (Asia, Africa, North America, South America, Europe, and Australia).
Compare mortality rate for males and females by country.
df_w|>
filter(is.na(continent) == FALSE &
age_range != "Life.Expectancy.at.Birth") |>
group_by(country, year, age_range) |>
ggplot(aes(x = year, y = male_mortality_rate, color=country)) +
geom_point() +
facet_grid(~age_range) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position="none")
What is noticeable from this graphic is that in most countries, across all the years - the male death rate is higher than that of females.
Compare mortality rates by continent, just looking at mortality rate under 5.
df_cont <- df_c|>
filter(str_starts(age_range, "Under") &
(sex == "Male" | sex == "Female") &
is.na(continent) == FALSE) |>
group_by(continent, sex) |>
summarize(
mean_under_5yrs = round(mean((count), na.rm = TRUE)),
n = n() , .groups = 'drop'
) |>
arrange(desc(mean_under_5yrs))
kable(df_cont) |>
kable_styling("striped")
continent | sex | mean_under_5yrs | n |
---|---|---|---|
Africa | Male | 78 | 756 |
Africa | Female | 66 | 756 |
Asia | Male | 29 | 672 |
Asia | Female | 25 | 672 |
Oceania | Male | 24 | 294 |
Americas | Male | 20 | 686 |
Oceania | Female | 19 | 294 |
Americas | Female | 15 | 686 |
Europe | Male | 6 | 672 |
Europe | Female | 5 | 672 |
df_c|>
filter(str_starts(age_range, "Under") &
(sex == "Male" | sex == "Female") &
is.na(continent) == FALSE) |>
group_by(continent, sex) |>
summarize(
mean_under_5yrs = round(mean((count), na.rm = TRUE)),
n = n() , .groups = 'drop'
) |>
arrange(desc(mean_under_5yrs)) |>
ggplot(aes(x = continent, y = mean_under_5yrs, fill=sex)) +
geom_bar(stat="identity", position = "dodge")
It looks like Africa has the highest death rate and Europe the least.
Now, look at all 3 age groups.
df_c|>
filter((sex == "Male" | sex == "Female") &
is.na(continent) == FALSE &
age_range != "Life.Expectancy.at.Birth") |>
group_by(continent, age_range, sex) |>
summarize(
mean_mortality_rate = round(mean((count), na.rm = TRUE)),
n = n() , .groups = 'drop'
) |>
arrange(desc(mean_mortality_rate)) |>
ggplot(aes(x = continent, y = mean_mortality_rate, fill=sex)) +
geom_bar(stat="identity", position = "dodge") +
facet_grid(~age_range) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Inspecting all three age ranges available, Africa still has the highest child mortality rates and Europe the least.
The objective was to study the USC International data, paying specific attention to the mortality rates between male and females from 2010-2023, looking at country level data and at continent level. This goal was achieved and interesting insights were gained, such as the mortality rate being greater in males than in females in general and Africa having the highest mortality rates and Europe the least.