Project 1

Author

Myriam O.

Project 1 Assignment

Introduction

My project is about UNICEF child health that focuses on neonatal mortality rate. The source of my project is UNICEF child health dataset.This dataset includes child mortality rate that represents the number of deaths , country, death per unit, age group, year, progress, income and region_name. I’am planning to explore how child mortality rates vary across age groups or income, and how these values may change over time.

Load the library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read data

setwd("~/Downloads/First data 110 assignment_files")
child_health <- read_csv("unicef-child-health.csv")
Rows: 5516 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): iso3, country, indicator, age, domain, disaggregator, total, units...
dbl  (3): indicator_id, value, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View data

head(child_health)
# A tibble: 6 × 23
  iso3  country    indicator indicator_id age   domain disaggregator value total
  <chr> <chr>      <chr>            <dbl> <chr> <chr>  <chr>         <dbl> <chr>
1 AFG   Afghanist… Neonatal…            1 0-27… Survi… Total         35.2  Not …
2 ALB   Albania    Neonatal…            1 0-27… Survi… Total          7.78 Not …
3 DZA   Algeria    Neonatal…            1 0-27… Survi… Total         16.3  Not …
4 AND   Andorra    Neonatal…            1 0-27… Survi… Total          1.30 Not …
5 AGO   Angola     Neonatal…            1 0-27… Survi… Total         27.3  Not …
6 ATG   Antigua a… Neonatal…            1 0-27… Survi… Total          3.48 Not …
# ℹ 14 more variables: units <chr>, year <dbl>, source <chr>, definition <chr>,
#   indicator_cat <chr>, target <chr>, progress <chr>, region_sdg_name <chr>,
#   region_who_code <chr>, region_who_name <chr>, region_unicef_code <chr>,
#   region_unicef_name <chr>, incomecat <chr>, income <chr>

Cleaning

data_cleaning <- child_health |>
  filter(!is.na(value), !is.na(year), !is.na(age),
         !is.na(income), !is.na(region_unicef_name))

I used filter to remove missing values for value, year, age, income, and region, so the data is clean.

Regression

fit1 <- lm(value ~ year + age + income + region_unicef_name, data = data_cleaning)
summary(fit1)

Call:
lm(formula = value ~ year + age + income + region_unicef_name, 
    data = data_cleaning)

Residuals:
    Min      1Q  Median      3Q     Max 
-71.142 -28.330  -9.581  28.194 100.076 

Coefficients:
                                                 Estimate Std. Error t value
(Intercept)                                    14184.3410   783.5853  18.102
year                                              -7.0161     0.3880 -18.082
age1-11 months                                    -1.4493     2.0887  -0.694
age1-4 years                                      21.0222     2.0667  10.172
age10-14 years                                    19.7084     1.9474  10.121
age15-19 years                                     0.2123     1.8763   0.113
age5-9 years                                       9.0687     2.1470   4.224
ageContext                                        16.9427     1.6571  10.225
agePolicy                                        -23.5341     3.1775  -7.406
incomeLow income                                   3.4251     2.0030   1.710
incomeLower middle income                          5.0028     1.5042   3.326
incomeUpper middle income                          6.8999     1.3443   5.133
region_unicef_nameEurope and Central Asia          3.6459     1.6002   2.278
region_unicef_nameLatin America and Caribbean      0.5786     1.7098   0.338
region_unicef_nameMiddle East and North Africa     4.8024     1.9222   2.498
region_unicef_nameNorth America                    5.8108     4.8078   1.209
region_unicef_nameSouth Asia                       3.5117     2.3492   1.495
region_unicef_nameSub-Saharan Africa              -0.9203     1.6336  -0.563
                                               Pr(>|t|)    
(Intercept)                                     < 2e-16 ***
year                                            < 2e-16 ***
age1-11 months                                 0.487797    
age1-4 years                                    < 2e-16 ***
age10-14 years                                  < 2e-16 ***
age15-19 years                                 0.909935    
age5-9 years                                   2.44e-05 ***
ageContext                                      < 2e-16 ***
agePolicy                                      1.50e-13 ***
incomeLow income                               0.087315 .  
incomeLower middle income                      0.000887 ***
incomeUpper middle income                      2.96e-07 ***
region_unicef_nameEurope and Central Asia      0.022738 *  
region_unicef_nameLatin America and Caribbean  0.735062    
region_unicef_nameMiddle East and North Africa 0.012505 *  
region_unicef_nameNorth America                0.226860    
region_unicef_nameSouth Asia                   0.135019    
region_unicef_nameSub-Saharan Africa           0.573185    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.37 on 5444 degrees of freedom
Multiple R-squared:  0.1502,    Adjusted R-squared:  0.1475 
F-statistic: 56.59 on 17 and 5444 DF,  p-value: < 2.2e-16

I used linear regression to build a regression model that predicts child mortality rate using year, age, income, and region_name. Then, I used summary() to show the results.

fit2 <- lm(value ~ year + age + income , data = data_cleaning)
summary(fit2)

Call:
lm(formula = value ~ year + age + income, data = data_cleaning)

Residuals:
   Min     1Q Median     3Q    Max 
-68.21 -28.63  -9.88  28.19  98.19 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.420e+04  7.817e+02  18.169  < 2e-16 ***
year                      -7.023e+00  3.871e-01 -18.144  < 2e-16 ***
age1-11 months            -1.503e+00  2.090e+00  -0.719   0.4722    
age1-4 years               2.116e+01  2.067e+00  10.235  < 2e-16 ***
age10-14 years             1.975e+01  1.949e+00  10.134  < 2e-16 ***
age15-19 years             1.533e-01  1.878e+00   0.082   0.9349    
age5-9 years               8.999e+00  2.149e+00   4.188 2.86e-05 ***
ageContext                 1.697e+01  1.658e+00  10.235  < 2e-16 ***
agePolicy                 -2.347e+01  3.180e+00  -7.380 1.82e-13 ***
incomeLow income          -7.316e-02  1.472e+00  -0.050   0.9604    
incomeLower middle income  2.910e+00  1.225e+00   2.376   0.0176 *  
incomeUpper middle income  5.594e+00  1.257e+00   4.451 8.73e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.4 on 5450 degrees of freedom
Multiple R-squared:  0.1476,    Adjusted R-squared:  0.1458 
F-statistic: 85.77 on 11 and 5450 DF,  p-value: < 2.2e-16

I created a second model by removing region because it has many categories and most of them are not significant. This makes the model simpler and easier to focus on year, age, and income.

The p-value did not change, which means the overall significance of the model stayed the same. However, the R-squared decreased slightly because removing a variable reduces the amount of variation explained by the model, especially since some of the variables are categorical. I think this model is still a good choice because it is easier to understand and interpret.

The model equation is value = 14200 - 7.023(year) + 2.910(income lower middle) + 5,594(income upper middle)

Visualization

Year is a significant variable, while age and income are partially significant since some categories are significant and others are not. However, I chose to focus on year and income because they make the visualization clearer and easier to interpret. Age has many categories, so I think will be a little messy.

More cleaning

data_sumary <- data_cleaning |>
  group_by(year, income) |>
  summarize(avg_value = mean(value), .groups = "drop")

I used group_by() to group the data by year and income. Then, I used summarize() to calculate the average child mortality rate for each group.

Bar Graph of child mortality rate by year and income

ggplot(data_sumary, aes(x = year, y = avg_value, fill = income )) +
  geom_bar(stat = "identity", position = "dodge", width = 0.9) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Child Mortality rate by Year and Income",
       caption = "UNICEF Child Health Dataset",
       x = "Year",
       y = "Deaths per 1,000 live births" ,
       fill = "Income level") +
  theme_minimal(base_size = 12)

Essay

Before starting, I cleaned the environment. After loading the data, I used filter() to remove missing values from the variables that were important for my analysis. Then, I used linear regression to identify the most significant variables. After that, I used group_by() and summarize() to organize the data and calculate the average values for my visualization.

The visualization represents child mortality rate by year and income level. It shows that low-income countries have higher mortality rates, while high-income countries have lower rates. It also shows that mortality decreases over time.

My first choice was to use the age variable, but it has many categories and only some of them are significant. This made the graph harder to read and more confusing. I wasn’t sure if the graph should include only the significant value or all of them. I also noticed variables like age context and age policy, but I don’t think they are easy to understand or necessary. Because of this, I decided to use income instead, since it is simpler and makes the graph clearer. I wish I could have created a good bar graph with age to show the different age groups affected, but it was difficult to present clearly.