My project is about UNICEF child health that focuses on neonatal mortality rate. The source of my project is UNICEF child health dataset.This dataset includes child mortality rate that represents the number of deaths , country, death per unit, age group, year, progress, income and region_name. I’am planning to explore how child mortality rates vary across age groups or income, and how these values may change over time.
Load the library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Read data
setwd("~/Downloads/First data 110 assignment_files")child_health <-read_csv("unicef-child-health.csv")
Rows: 5516 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): iso3, country, indicator, age, domain, disaggregator, total, units...
dbl (3): indicator_id, value, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View data
head(child_health)
# A tibble: 6 × 23
iso3 country indicator indicator_id age domain disaggregator value total
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 AFG Afghanist… Neonatal… 1 0-27… Survi… Total 35.2 Not …
2 ALB Albania Neonatal… 1 0-27… Survi… Total 7.78 Not …
3 DZA Algeria Neonatal… 1 0-27… Survi… Total 16.3 Not …
4 AND Andorra Neonatal… 1 0-27… Survi… Total 1.30 Not …
5 AGO Angola Neonatal… 1 0-27… Survi… Total 27.3 Not …
6 ATG Antigua a… Neonatal… 1 0-27… Survi… Total 3.48 Not …
# ℹ 14 more variables: units <chr>, year <dbl>, source <chr>, definition <chr>,
# indicator_cat <chr>, target <chr>, progress <chr>, region_sdg_name <chr>,
# region_who_code <chr>, region_who_name <chr>, region_unicef_code <chr>,
# region_unicef_name <chr>, incomecat <chr>, income <chr>
I used filter to remove missing values for value, year, age, income, and region, so the data is clean.
Regression
fit1 <-lm(value ~ year + age + income + region_unicef_name, data = data_cleaning)summary(fit1)
Call:
lm(formula = value ~ year + age + income + region_unicef_name,
data = data_cleaning)
Residuals:
Min 1Q Median 3Q Max
-71.142 -28.330 -9.581 28.194 100.076
Coefficients:
Estimate Std. Error t value
(Intercept) 14184.3410 783.5853 18.102
year -7.0161 0.3880 -18.082
age1-11 months -1.4493 2.0887 -0.694
age1-4 years 21.0222 2.0667 10.172
age10-14 years 19.7084 1.9474 10.121
age15-19 years 0.2123 1.8763 0.113
age5-9 years 9.0687 2.1470 4.224
ageContext 16.9427 1.6571 10.225
agePolicy -23.5341 3.1775 -7.406
incomeLow income 3.4251 2.0030 1.710
incomeLower middle income 5.0028 1.5042 3.326
incomeUpper middle income 6.8999 1.3443 5.133
region_unicef_nameEurope and Central Asia 3.6459 1.6002 2.278
region_unicef_nameLatin America and Caribbean 0.5786 1.7098 0.338
region_unicef_nameMiddle East and North Africa 4.8024 1.9222 2.498
region_unicef_nameNorth America 5.8108 4.8078 1.209
region_unicef_nameSouth Asia 3.5117 2.3492 1.495
region_unicef_nameSub-Saharan Africa -0.9203 1.6336 -0.563
Pr(>|t|)
(Intercept) < 2e-16 ***
year < 2e-16 ***
age1-11 months 0.487797
age1-4 years < 2e-16 ***
age10-14 years < 2e-16 ***
age15-19 years 0.909935
age5-9 years 2.44e-05 ***
ageContext < 2e-16 ***
agePolicy 1.50e-13 ***
incomeLow income 0.087315 .
incomeLower middle income 0.000887 ***
incomeUpper middle income 2.96e-07 ***
region_unicef_nameEurope and Central Asia 0.022738 *
region_unicef_nameLatin America and Caribbean 0.735062
region_unicef_nameMiddle East and North Africa 0.012505 *
region_unicef_nameNorth America 0.226860
region_unicef_nameSouth Asia 0.135019
region_unicef_nameSub-Saharan Africa 0.573185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.37 on 5444 degrees of freedom
Multiple R-squared: 0.1502, Adjusted R-squared: 0.1475
F-statistic: 56.59 on 17 and 5444 DF, p-value: < 2.2e-16
I used linear regression to build a regression model that predicts child mortality rate using year, age, income, and region_name. Then, I used summary() to show the results.
fit2 <-lm(value ~ year + age + income , data = data_cleaning)summary(fit2)
Call:
lm(formula = value ~ year + age + income, data = data_cleaning)
Residuals:
Min 1Q Median 3Q Max
-68.21 -28.63 -9.88 28.19 98.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.420e+04 7.817e+02 18.169 < 2e-16 ***
year -7.023e+00 3.871e-01 -18.144 < 2e-16 ***
age1-11 months -1.503e+00 2.090e+00 -0.719 0.4722
age1-4 years 2.116e+01 2.067e+00 10.235 < 2e-16 ***
age10-14 years 1.975e+01 1.949e+00 10.134 < 2e-16 ***
age15-19 years 1.533e-01 1.878e+00 0.082 0.9349
age5-9 years 8.999e+00 2.149e+00 4.188 2.86e-05 ***
ageContext 1.697e+01 1.658e+00 10.235 < 2e-16 ***
agePolicy -2.347e+01 3.180e+00 -7.380 1.82e-13 ***
incomeLow income -7.316e-02 1.472e+00 -0.050 0.9604
incomeLower middle income 2.910e+00 1.225e+00 2.376 0.0176 *
incomeUpper middle income 5.594e+00 1.257e+00 4.451 8.73e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.4 on 5450 degrees of freedom
Multiple R-squared: 0.1476, Adjusted R-squared: 0.1458
F-statistic: 85.77 on 11 and 5450 DF, p-value: < 2.2e-16
I created a second model by removing region because it has many categories and most of them are not significant. This makes the model simpler and easier to focus on year, age, and income.
The p-value did not change, which means the overall significance of the model stayed the same. However, the R-squared decreased slightly because removing a variable reduces the amount of variation explained by the model, especially since some of the variables are categorical. I think this model is still a good choice because it is easier to understand and interpret.
The model equation is value = 14200 - 7.023(year) + 2.910(income lower middle) + 5,594(income upper middle)
Visualization
Year is a significant variable, while age and income are partially significant since some categories are significant and others are not. However, I chose to focus on year and income because they make the visualization clearer and easier to interpret. Age has many categories, so I think will be a little messy.
I used group_by() to group the data by year and income. Then, I used summarize() to calculate the average child mortality rate for each group.
Bar Graph of child mortality rate by year and income
ggplot(data_sumary, aes(x = year, y = avg_value, fill = income )) +geom_bar(stat ="identity", position ="dodge", width =0.9) +scale_fill_brewer(palette ="Set1") +labs(title ="Child Mortality rate by Year and Income",caption ="UNICEF Child Health Dataset",x ="Year",y ="Deaths per 1,000 live births" ,fill ="Income level") +theme_minimal(base_size =12)
Essay
Before starting, I cleaned the environment. After loading the data, I used filter() to remove missing values from the variables that were important for my analysis. Then, I used linear regression to identify the most significant variables. After that, I used group_by() and summarize() to organize the data and calculate the average values for my visualization.
The visualization represents child mortality rate by year and income level. It shows that low-income countries have higher mortality rates, while high-income countries have lower rates. It also shows that mortality decreases over time.
My first choice was to use the age variable, but it has many categories and only some of them are significant. This made the graph harder to read and more confusing. I wasn’t sure if the graph should include only the significant value or all of them. I also noticed variables like age context and age policy, but I don’t think they are easy to understand or necessary. Because of this, I decided to use income instead, since it is simpler and makes the graph clearer. I wish I could have created a good bar graph with age to show the different age groups affected, but it was difficult to present clearly.