The “cap2030-unicef-child-health.csv” data set focuses on health indicators related to child health collected from different countries by multiple sources as part of UNICEF’s efforts to achieve the Sustainable Development Goals (SDGs) by 2030. The data set includes variables such as iso3 (country code), country (name of the country), indicator (health indicator), indicator_id (ID of the indicator), age (age group), domain (health domain), disaggregator (gender or total count when there is no dissociation among genders), value (indicator value), total (total value), units (measurement units), year (data year), source (where the data was extracted), and definition (indicator definition). Additionally, it provides information on the progress towards the SDG targets (indicator_cat, target, progress), regional classifications (region_sdg_name, region_who_code, region_who_name, region_unicef_code, region_unicef_name), and income categories (incomecat, income). This dataset serves as a valuable resource for monitoring and evaluating the progress of various countries in improving child health outcomes. I did not find a ReadMe that describes how this particular dataset was collected.
Nevertheless, UNICEF mentions in its website that it collects data through a variety of methods to inform decisions and protect children’s rights globally. One key approach is the Multiple Indicator Cluster Surveys (MICS), which are face-to-face household surveys providing statistically sound, internationally comparable data on children and women’s situations. These surveys help monitor progress toward Sustainable Development Goals (SDGs) and inform evidence-based strategies. Another approach involves leveraging Big Data and Frontier Data Nodes, including anonymized metadata from sources like mobile phone usage, to gain real-time insights into human activity. The establishment of regional and national Frontier Data Nodes ensures equitable access to data science expertise. Additionally, UNICEF is developing Open-Source Platforms that combine, analyze, and display real-time information from academic, private sector, and open-source data, enhancing data-driven decision-making. Furthermore, UNICEF leads in Global Standard Setting, measuring and monitoring data, including 19 SDG indicators related to children. The team supports countries in generating, analyzing, and using data for these indicators. As the world’s leading source of data on children, UNICEF’s efforts benefit over 3 million people globally.
By 2030, UNICEF aims to ensure that every child survives and thrives through its “Strategy for Health 2016-2030.” This strategy focuses on ending preventable maternal, newborn, and child deaths and promoting the overall health and development of children. To achieve these goals, UNICEF emphasizes integrated, multisectoral approaches to address both communicable and non-communicable diseases, while adapting to global changes such as urbanization and environmental challenges. The “Ending Preventable Newborn Deaths and Stillbirths by 2030” initiative targets reducing newborn mortality and stillbirth rates by improving access to quality healthcare, strengthening health systems, and implementing critical interventions like skilled birth attendance and neonatal care. Additionally, the “UNICEF Immunization Roadmap to 2030” outlines the organization’s commitment to ensuring every child receives essential vaccines, focusing on robust health systems, community engagement, and overcoming barriers to immunization to meet global health targets. More details about those strategies can be found on unicef.org.
Ensuring the well-being and healthy development of children is crucial as they are the future. It is fascinating to study the efforts of governments and organizations worldwide in this regard. I am deeply passionate about children welfare and would love to witness them grow up with all their needs fulfilled, as they are innocent and dear to me.
Load the libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.1
library(GGally)
Warning: package 'GGally' was built under R version 4.4.1
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(DataExplorer)
Warning: package 'DataExplorer' was built under R version 4.4.1
library(psych)
Warning: package 'psych' was built under R version 4.4.1
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
Load the dataset
setwd("C:/Users/User/Downloads/Data 110 Projects and Assignments")child_health <-read_csv("cap2030-unicef-child-health.csv")
Rows: 5516 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): iso3, country, indicator, age, domain, disaggregator, total, units...
dbl (3): indicator_id, value, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(child_health)
# A tibble: 6 × 23
iso3 country indicator indicator_id age domain disaggregator value total
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 AFG Afghanist… Neonatal… 1 0-27… Survi… Total 35.2 Not …
2 ALB Albania Neonatal… 1 0-27… Survi… Total 7.78 Not …
3 DZA Algeria Neonatal… 1 0-27… Survi… Total 16.3 Not …
4 AND Andorra Neonatal… 1 0-27… Survi… Total 1.30 Not …
5 AGO Angola Neonatal… 1 0-27… Survi… Total 27.3 Not …
6 ATG Antigua a… Neonatal… 1 0-27… Survi… Total 3.48 Not …
# ℹ 14 more variables: units <chr>, year <dbl>, source <chr>, definition <chr>,
# indicator_cat <chr>, target <chr>, progress <chr>, region_sdg_name <chr>,
# region_who_code <chr>, region_who_name <chr>, region_unicef_code <chr>,
# region_unicef_name <chr>, incomecat <chr>, income <chr>
# Check missing valuesifelse(mean(complete.cases(child_health)) ==1, "No NA Founded", "Found NA")
[1] "Found NA"
# Clean dataset with age conversionchild_health_cleaned <- child_health %>%group_by(target, region_unicef_name) %>%filter(!is.na(age) &!is.na(value) &!is.na(total) &!is.na(progress) &!is.na(target) &!is.na(income)) %>%rename("age_group"= age) %>%separate(age_group, into =c("age", "age_unit"), sep =" ", remove =FALSE) %>%separate(age, into =c("age_min", "age_max"), sep ="-", fill ="right") %>%mutate(age_min =as.numeric(age_min),age_max =as.numeric(age_max),age_in_years =case_when( age_unit =="days"~0, age_unit =="months"~0, age_unit =="years"~ age_min),.after = age_unit )# View the cleaned dataset first rowshead(child_health_cleaned)
# A tibble: 6 × 27
# Groups: target, region_unicef_name [5]
iso3 country indicator indicator_id age_group age_min age_max age_unit
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 AFG Afghanistan Neonatal… 1 0-27 days 0 27 days
2 ALB Albania Neonatal… 1 0-27 days 0 27 days
3 DZA Algeria Neonatal… 1 0-27 days 0 27 days
4 AND Andorra Neonatal… 1 0-27 days 0 27 days
5 AGO Angola Neonatal… 1 0-27 days 0 27 days
6 ATG Antigua and B… Neonatal… 1 0-27 days 0 27 days
# ℹ 19 more variables: age_in_years <dbl>, domain <chr>, disaggregator <chr>,
# value <dbl>, total <chr>, units <chr>, year <dbl>, source <chr>,
# definition <chr>, indicator_cat <chr>, target <chr>, progress <chr>,
# region_sdg_name <chr>, region_who_code <chr>, region_who_name <chr>,
# region_unicef_code <chr>, region_unicef_name <chr>, incomecat <chr>,
# income <chr>
Check the structure of the cleaned dataset
str(child_health_cleaned)
gropd_df [2,791 × 27] (S3: grouped_df/tbl_df/tbl/data.frame)
$ iso3 : chr [1:2791] "AFG" "ALB" "DZA" "AND" ...
$ country : chr [1:2791] "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ indicator : chr [1:2791] "Neonatal mortality rate" "Neonatal mortality rate" "Neonatal mortality rate" "Neonatal mortality rate" ...
$ indicator_id : num [1:2791] 1 1 1 1 1 1 1 1 1 1 ...
$ age_group : chr [1:2791] "0-27 days" "0-27 days" "0-27 days" "0-27 days" ...
$ age_min : num [1:2791] 0 0 0 0 0 0 0 0 0 0 ...
$ age_max : num [1:2791] 27 27 27 27 27 27 27 27 27 27 ...
$ age_unit : chr [1:2791] "days" "days" "days" "days" ...
$ age_in_years : num [1:2791] 0 0 0 0 0 0 0 0 0 0 ...
$ domain : chr [1:2791] "Survival" "Survival" "Survival" "Survival" ...
$ disaggregator : chr [1:2791] "Total" "Total" "Total" "Total" ...
$ value : num [1:2791] 35.19 7.78 16.29 1.3 27.28 ...
$ total : chr [1:2791] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
$ units : chr [1:2791] "Deaths per 1,000 live births" "Deaths per 1,000 live births" "Deaths per 1,000 live births" "Deaths per 1,000 live births" ...
$ year : num [1:2791] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
$ source : chr [1:2791] "UN_IGME" "UN_IGME" "UN_IGME" "UN_IGME" ...
$ definition : chr [1:2791] "Neonatal (0-27 days) mortality rate (deaths per 1,000 live births)" "Neonatal (0-27 days) mortality rate (deaths per 1,000 live births)" "Neonatal (0-27 days) mortality rate (deaths per 1,000 live births)" "Neonatal (0-27 days) mortality rate (deaths per 1,000 live births)" ...
$ indicator_cat : chr [1:2791] "Met target (green/blue): country has achieved the SDG target of a NMR of at least as low as 12 per 1000 live bi"| __truncated__ "Met target (green/blue): country has achieved the SDG target of a NMR of at least as low as 12 per 1000 live bi"| __truncated__ "Met target (green/blue): country has achieved the SDG target of a NMR of at least as low as 12 per 1000 live bi"| __truncated__ "Met target (green/blue): country has achieved the SDG target of a NMR of at least as low as 12 per 1000 live bi"| __truncated__ ...
$ target : chr [1:2791] "SDG Target 3.2: By 2030, end preventable deaths of newborns and children under 5 years of age, with all\ncountr"| __truncated__ "SDG Target 3.2: By 2030, end preventable deaths of newborns and children under 5 years of age, with all\ncountr"| __truncated__ "SDG Target 3.2: By 2030, end preventable deaths of newborns and children under 5 years of age, with all\ncountr"| __truncated__ "SDG Target 3.2: By 2030, end preventable deaths of newborns and children under 5 years of age, with all\ncountr"| __truncated__ ...
$ progress : chr [1:2791] "Acceleration needed" "Target met" "Acceleration needed" "Target met" ...
$ region_sdg_name : chr [1:2791] "Central and Southern Asia" "Europe and Northern America" "Northern Africa and Western Asia" "Europe and Northern America" ...
$ region_who_code : chr [1:2791] "EMR" "EUR" "AFR" "EUR" ...
$ region_who_name : chr [1:2791] "Eastern Mediterranean Region" "European Region" "African Region" "European Region" ...
$ region_unicef_code: chr [1:2791] "SA" "ECA" "MENA" "ECA" ...
$ region_unicef_name: chr [1:2791] "South Asia" "Europe and Central Asia" "Middle East and North Africa" "Europe and Central Asia" ...
$ incomecat : chr [1:2791] "LIC" "UMIC" "LMIC" "HIC" ...
$ income : chr [1:2791] "Low income" "Upper middle income" "Lower middle income" "High income" ...
- attr(*, "groups")= tibble [84 × 3] (S3: tbl_df/tbl/data.frame)
..$ target : chr [1:84] "Adoption of Convention 183" "Adoption of Convention 183" "Adoption of Convention 183" "Adoption of Convention 183" ...
..$ region_unicef_name: chr [1:84] "East Asia and Pacific" "Europe and Central Asia" "Latin America and Caribbean" "Middle East and North Africa" ...
..$ .rows : list<int> [1:84]
.. ..$ : int [1:30] 727 737 741 747 754 768 786 792 795 798 ...
.. ..$ : int [1:28] 722 726 732 752 753 756 765 769 770 773 ...
.. ..$ : int [1:30] 724 725 728 731 734 736 746 748 751 758 ...
.. ..$ : int [1:18] 721 729 761 787 788 790 793 796 799 802 ...
.. ..$ : int [1:2] 743 874
.. ..$ : int [1:8] 720 730 733 785 806 819 826 854
.. ..$ : int [1:46] 723 735 738 739 740 742 744 745 749 750 ...
.. ..$ : int [1:11] 482 491 494 496 502 504 509 511 512 519 ...
.. ..$ : int [1:8] 469 472 495 503 507 518 523 524
.. ..$ : int [1:6] 474 480 481 489 510 517
.. ..$ : int [1:4] 470 492 493 522
.. ..$ : int [1:6] 468 473 490 500 505 508
.. ..$ : int [1:25] 471 475 476 477 478 479 483 484 485 486 ...
.. ..$ : int [1:9] 988 992 999 1003 1005 1007 1016 1017 1019
.. ..$ : int [1:6] 966 968 982 991 1015 1021
.. ..$ : int [1:6] 970 977 978 986 1004 1014
.. ..$ : int [1:3] 989 990 1020
.. ..$ : int [1:7] 965 969 987 996 1000 1002 1013
.. ..$ : int [1:29] 967 971 972 973 974 975 976 979 980 981 ...
.. ..$ : int [1:13] 900 910 913 915 919 922 925 928 936 939 ...
.. ..$ : int [1:12] 883 885 887 904 914 926 932 943 950 956 ...
.. ..$ : int [1:10] 889 897 898 899 908 924 937 938 948 961
.. ..$ : int [1:8] 884 911 912 927 933 934 949 955
.. ..$ : int 960
.. ..$ : int [1:7] 882 886 909 920 929 935 947
.. ..$ : int [1:32] 888 890 891 892 893 894 895 896 901 902 ...
.. ..$ : int [1:132] 201 217 223 229 238 252 271 278 282 285 ...
.. ..$ : int [1:220] 194 196 200 202 203 208 209 214 218 234 ...
.. ..$ : int [1:141] 198 199 204 207 210 213 216 228 230 233 ...
.. ..$ : int [1:83] 195 205 245 272 273 275 279 283 287 290 ...
.. ..$ : int [1:8] 225 376 1609 1761 1968 2120 2252 2404
.. ..$ : int [1:39] 193 206 212 270 296 312 321 355 384 389 ...
.. ..$ : int [1:228] 197 211 215 219 220 221 222 224 226 227 ...
.. ..$ : int [1:27] 2417 2435 2441 2459 2476 2482 2486 2488 2498 2500 ...
.. ..$ : int [1:45] 2413 2416 2418 2419 2421 2422 2427 2430 2445 2447 ...
.. ..$ : int [1:25] 2415 2423 2426 2429 2440 2442 2444 2446 2451 2452 ...
.. ..$ : int [1:7] 2453 2477 2479 2483 2503 2516 2553
.. ..$ : int [1:2] 2437 2560
.. ..$ : int [1:8] 2412 2420 2425 2475 2495 2508 2517 2542
.. ..$ : int [1:40] 2414 2424 2428 2431 2432 2433 2434 2436 2438 2439 ...
.. ..$ : int [1:17] 1238 1240 1243 1245 1247 1253 1257 1260 1262 1790 ...
.. ..$ : int [1:20] 1217 1219 1221 1231 1239 1246 1250 1255 1259 1264 ...
.. ..$ : int [1:12] 1227 1228 1235 1244 1252 1258 1779 1780 1787 1795 ...
.. ..$ : int [1:10] 1218 1236 1237 1251 1263 1770 1788 1789 1802 1814
.. ..$ : int [1:4] 1220 1248 1772 1799
.. ..$ : int [1:36] 1222 1223 1224 1225 1226 1229 1230 1232 1233 1234 ...
.. ..$ : int [1:26] 1375 1376 1453 1454 1457 1458 1461 1462 1479 1480 ...
.. ..$ : int [1:86] 1365 1366 1369 1370 1373 1374 1377 1378 1383 1384 ...
.. ..$ : int [1:12] 1403 1404 1413 1414 1437 1438 1507 1508 1511 1512 ...
.. ..$ : int [1:24] 1367 1368 1379 1380 1445 1446 1449 1450 1455 1456 ...
.. ..$ : int [1:10] 1381 1382 1443 1444 1473 1474 1489 1490 1505 1506
.. ..$ : int [1:54] 1371 1372 1385 1386 1387 1388 1391 1392 1393 1394 ...
.. ..$ : int [1:90] 9 25 31 37 46 60 79 86 90 93 ...
.. ..$ : int [1:156] 2 4 8 10 11 16 17 22 26 42 ...
.. ..$ : int [1:96] 6 7 12 15 18 21 24 36 38 41 ...
.. ..$ : int [1:57] 3 13 53 80 81 83 87 91 95 98 ...
.. ..$ : int [1:6] 33 185 560 712 1057 1209
.. ..$ : int [1:24] 1 14 20 78 104 120 130 164 528 541 ...
.. ..$ : int [1:147] 5 19 23 27 28 29 30 32 34 35 ...
.. ..$ : int [1:18] 1840 1841 1868 1869 1872 1873 1884 1885 1888 1889 ...
.. ..$ : int [1:16] 1820 1821 1850 1851 1870 1871 1886 1887 1894 1895 ...
.. ..$ : int [1:20] 1824 1825 1828 1829 1836 1837 1838 1839 1844 1845 ...
.. ..$ : int [1:6] 1864 1865 1866 1867 1924 1925
.. ..$ : int [1:12] 1822 1823 1862 1863 1880 1881 1890 1891 1896 1897 ...
.. ..$ : int [1:46] 1818 1819 1826 1827 1830 1831 1832 1833 1834 1835 ...
.. ..$ : int [1:16] 1287 1288 1307 1308 1311 1312 1319 1320 1323 1324 ...
.. ..$ : int [1:16] 1271 1272 1293 1294 1309 1310 1325 1326 1331 1332 ...
.. ..$ : int [1:12] 1283 1284 1285 1286 1301 1302 1321 1322 1335 1336 ...
.. ..$ : int [1:10] 1267 1268 1303 1304 1305 1306 1333 1334 1355 1356
.. ..$ : int [1:6] 1269 1270 1317 1318 1327 1328
.. ..$ : int [1:38] 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 ...
.. ..$ : int [1:32] 2590 2591 2598 2599 2632 2633 2666 2667 2670 2671 ...
.. ..$ : int [1:72] 2568 2569 2572 2573 2574 2575 2578 2579 2584 2585 ...
.. ..$ : int [1:38] 2570 2571 2580 2581 2582 2583 2588 2589 2604 2605 ...
.. ..$ : int [1:12] 2624 2625 2656 2657 2662 2663 2674 2675 2716 2717 ...
.. ..$ : int [1:4] 2600 2601 2780 2781
.. ..$ : int [1:14] 2566 2567 2576 2577 2654 2655 2684 2685 2704 2705 ...
.. ..$ : int [1:54] 2586 2587 2594 2595 2596 2597 2602 2603 2618 2619 ...
.. ..$ : int [1:12] 2144 2145 2166 2167 2170 2171 2180 2181 2208 2209 ...
.. ..$ : int [1:10] 2168 2169 2182 2183 2186 2187 2200 2201 2214 2215
.. ..$ : int [1:20] 2134 2135 2140 2141 2142 2143 2148 2149 2150 2151 ...
.. ..$ : int [1:6] 2160 2161 2164 2165 2188 2189
.. ..$ : int [1:6] 2130 2131 2190 2191 2204 2205
.. ..$ : int [1:38] 2128 2129 2132 2133 2136 2137 2138 2139 2146 2147 ...
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
Explore correlations
My goal is to understand how indicator’s value changes as predictors change.
Simple linear regression
My goal is to understand how value changes as age changes (I am interested in predicting value based on age_in_years)
# Find the correlationcor(child_health_cleaned$age_in_years, child_health_cleaned$value)
[1] -0.2836684
The cor() function calculates the Pearson correlation coefficient by default. The Pearson correlation coefficient measures the linear relationship between two continuous variables. Negative correlation (-1 to 0) indicates that as age_in_years increases, value tends to decrease.
fit1<-lm(value~age_in_years, data = child_health_cleaned) # fit the linear modelsummary(fit1)
Call:
lm(formula = value ~ age_in_years, data = child_health_cleaned)
Residuals:
Min 1Q Median 3Q Max
-31.061 -23.685 -7.963 12.424 75.477
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.06084 0.75946 40.90 <2e-16 ***
age_in_years -1.53556 0.09829 -15.62 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.88 on 2789 degrees of freedom
Multiple R-squared: 0.08047, Adjusted R-squared: 0.08014
F-statistic: 244.1 on 1 and 2789 DF, p-value: < 2.2e-16
Diagnostics: Residual standard error: 30.88 Multiple R-squared: 0.08047 Adjusted R-squared: 0.08014 F-statistic: 244.1 on 1 and 2789 DF, p-value: <2.2e−16
The analysis reveals that the coefficient for age_in_years is negative, indicating that as age_in_years increases, the value tends to decrease. Additionally, the R-squared value (0.08047) is quite low, suggesting that only about 8% of the variance in value is explained by age_in_years. Despite both coefficients being highly significant (p-values <2e−16< 2e-16<2e−16), the model’s explanatory power is limited.
Multiple Linear Regression with Additional Predictors
Let’s see how indicator’s value changes as age, indicator’s id, and year change
# Fit another linear modelfit2<-lm(value ~ age_in_years +indicator_id+year, data = child_health_cleaned)summary(fit2)
Call:
lm(formula = value ~ age_in_years + indicator_id + year, data = child_health_cleaned)
Residuals:
Min 1Q Median 3Q Max
-78.657 -13.758 -3.488 15.681 75.916
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20733.9626 659.6079 31.43 <2e-16 ***
age_in_years -6.4410 0.1953 -32.98 <2e-16 ***
indicator_id 4.1147 0.1690 24.34 <2e-16 ***
year -10.2657 0.3264 -31.45 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.86 on 2787 degrees of freedom
Multiple R-squared: 0.5395, Adjusted R-squared: 0.5391
F-statistic: 1089 on 3 and 2787 DF, p-value: < 2.2e-16
Residual standard error: 21.86 Multiple R-squared: 0.5395 Adjusted R-squared: 0.5391 F-statistic: 1089 on 3 and 2787 DF, p-value: <2.2e−16
Residuals mean the summary statistics of the residuals (differences between observed and predicted values). t value is t-statistic for testing the hypothesis that the intercept is 0. Pr(>|t|) is the p-value for the t-test. Residual standard error gives an idea of how much the observed values deviate from the fitted values. Multiple R-squared is the proportion of the variance in the dependent variable that is predictable from the independent variable. F-statistic is overall significance of the model. It compares the model with and without the predictor. p-value indicates if the model is statistically significant.
The model identifies age_in_years, indicator_id, and year as significant predictors of the value, with age showing a negative relationship and indicator_id and year showing a positive and negative relationship, respectively, with the value. Meaning that meaning as age increases, the value tends to decrease while as the indicator ID increases, the value also tends to increase, and as the year progresses, the value tends to decrease. The R-squared value of 0.5395 indicates that approximately 54% of the variance in value is explained by these three predictors, representing a substantial improvement over Model 1. Additionally, the decreased residual standard error suggests a better fit for the model.
Create the Pairwise Scatterplot Matrix
The ggpairs function provides a comprehensive visual summary of relationships between multiple variables in the child_health_cleaned data frame, facilitating exploratory data analysis and helping identify important patterns, correlations, outliers and potential issues like multicollinearity.
Scatterplots: indicator_id vs. age_in_years, value, year: These scatterplots show the distribution and potential relationships between indicator_id and the other variables. The patterns suggest varying degrees of association. age_in_years vs. value: The scatterplot shows a slight negative correlation (Corr: -0.054), indicating that as age_in_years increases, value tends to decrease slightly. age_in_years vs. year: This scatterplot also shows a negative correlation (Corr: -0.297), suggesting that older individuals are more likely to be from earlier years in the dataset. value vs. year: Another negative correlation (Corr: -0.554**), indicating that higher values are associated with earlier years.
Histograms: The histograms on the diagonal show the distribution of each variable. For example, the histogram for age_in_years shows how ages are distributed across the dataset.
Correlation Coefficients: The correlation coefficients provide a numerical summary of the strength and direction of the relationships between pairs of variables. The asterisks denote the significance levels, with more asterisks indicating higher significance.
Create a matrix of scatterplots with psych library
The pairs.panels function from the psych package is used to create a matrix of scatterplots, correlation coefficients, histograms, and linear regression lines for a subset of variables from the data frame
It is not really useful sice there are too much data. I will select some specific variables. Columns 4,9,12, and 15 represent respectively indicator_id, age_in_years, value, and year.
When interpreting the correlation coefficients, a value of 0.92 indicates a strong positive correlation in the top row, while -0.05 and -0.30 represent weak to moderate negative correlations. Moving to the middle row, -0.28 and -0.14 show negative correlations, with -0.28 being more significant. Finally, in the bottom row, a value of -0.55 indicates a strong negative correlation. In terms of graphical representation, the bar graphs likely show the distribution of categorical variables or counts over time, the line graphs illustrate trends over the years from 2016 to 2021, and the scatter plots highlight relationships between pairs of variables, with red circles emphasizing specific data points or trends. Additionally, red circles in some plots draw attention to significant data points or trends, which might indicate outliers or key observations.
Create a Correlation Matrix Heatmap
The plot_correlation function is from the DataExplorer package and is designed to provide quick and easy exploratory data analysis. This will create a correlation plot that visually represents the relationships between the selected columns in the child_health_cleaned data frame. The plot will help to understand the linear relationships and potential multicollinearity among these variables.
I have analyzed the variables year, value, age_in_years, and indicator_id, and their correlation coefficients. The diagonal cells show a perfect positive correlation for each variable with itself. The off-diagonal cells display the correlation between different pairs of variables, such as year vs. value, year vs. age_in_years, year vs. indicator_id, value vs. age_in_years, value vs. indicator_id, and age_in_years vs. indicator_id. The color coding ranges from blue for positive correlation to red for negative correlation, with varying shades indicating the strength of the relationship. Darker shades represent stronger correlations. The interpretation reveals strong negative correlations between year and value, year and indicator_id, and value and indicator_id. This suggests that as one variable increases, the other tends to decrease significantly. Moderate negative correlations between year and age_in_years, and value and age_in_years, indicate a moderate inverse relationship. Additionally, there is a strong positive correlation between age_in_years and indicator_id, suggesting that these variables tend to increase together.
Another Multiple Linear Regression with More Predictors
fit3 <-lm(value~ age_in_years + indicator_id + age_min + age_max + income + progress + year, data = child_health_cleaned)summary(fit3)
Residual standard error: 15.52 Multiple R-squared: 0.7688 Adjusted R-squared: 0.7677 F-statistic: 710.3 on 13 and 2777 DF, p-value: <2.2e−16
Upon analyzing the model, it was found that the inclusion of several predictors make most of the coefficients be highly significant. The R-squared value of 0.7688 suggests that approximately 77% of the variance in value is explained by the predictors, indicating a significant improvement over both previous models. Additionally, the residual standard error has decreased further, indicating an even better fit. The inclusion of multiple predictors has helped to better explain the variation in value, making this model the most robust among the three. In other words, age in years shows a negative association, with each additional year decreasing value by approximately 37.1563 units (p < 0.001). Conversely, indicator_id demonstrates a positive relationship, where an increase of one unit results in about a 22.8477 unit increase in value (p < 0.001). Both age_min and age_max also positively influence value, with increases corresponding to approximately 9.0722 and 6.2920 units respectively (p < 0.001). Furthermore, higher income levels and poorer progress categories are associated with higher value and lower value, respectively (all p < 0.001). The model indicates that value decreases by about 0.6266 units with each additional year (p < 0.05).
autoplot(fit2, 1:6, nrows =2, ncol =2)
Residuals vs Fitted: This plot checks for non-linearity and homoscedasticity (equal variances) of residuals. Ideally, the residuals should be randomly scattered around the horizontal line at zero, with no discernible pattern.
Normal Q-Q: This plot assesses the normality of residuals. If the residuals are normally distributed, the points should fall approximately along the reference line and observations with outliers are indicated by their row number.
Scale-Location: This plot is another way to check homoscedasticity(homogeneous variance). The spread of residuals should be roughly the same across the range of fitted values.
Cook’s Distance indicates which outliers have high leverage since they may cause issue for the regression’s model.
Residuals vs Leverage: This plot identifies influential observations. Points that stand out far to the right are of particular interest and may warrant further investigation.
The Cook’s Distance vs Leverage plot identifies influential observations in your data. High leverage points (far right on x-axis) have unusual predictor values, while high Cook’s Distance points (far up on y-axis) greatly influence fitted response values. Points far right and up are particularly influential and warrant further investigation. Influential points can significantly impact the regression line and predictions.
Save the cleaned dataset into my working directory
The first visualization aims to answer the question: How does each indicator progress over the years? I simply drag Year to columns and Value to Rows to get respectively the x-axis and y-axis. Then I drag Indicator to color to highlight the pattern of each line. I renamed the label title to Indicator’s category. To easily hover on each indicator, I clicked on the arrow at the right of Indicator’s category and select Highlight Selected Items. I changed the automatic color palette with a left click on color, selected Edit color, chose the Hue Circle palette because it has enough colors for my variables, then clicked Assign Palette and OK. Under Marks, I clicked on Path to change the line pattern into stripped lines. To have a good view of all the lines, I clicked on Color to set the transparency to 69% since some lines cross others. I renamed the Sheet to Indicator’s Line Chart and titled the Line Chart Indicator’s Evolution Over Time with a right click on the arrow at the right of the initial title and selected Edit title. I hided all the cards that I did not need by right-clicking on Marks, Filters, Columns, and Rows, then selected Hide card.
An analyze of this linechart reveals that after 2019 there is no more data on child labour and primary school net attendance rate. Each line shows how the value of a particular indicator has changed over the years. Some indicators show a steady increase, while others may have fluctuations or a decline.Indicator’s values have depreciated over time though there are a couple of variations each year among them. Maternity Protection Convention 183 is static with negligible change while the other categories have had significant change over the years with a great reduction in their rate. That probably mean that those categories have met the target or at least have made huge progress or the opposite for a category like Early Initiation of breastfeeding which decreased value means a distance from the goals. CRVS: Birth registration had a huge drop from 2016 to 2017. The decline in CRVS (Civil Registration and Vital Statistics) birth registrations from 2016 to 2027 could be influenced by various factors, such as data collection challenges, social and cultural factors, conflict and displacement, health system challenges, legal and administrative factors, economic factors, migration and urbanization, and awareness campaigns. However, Child labour seems to be the unique category which the value from 2016 and 2019 has increased. Another interesting point is the mortality rates which data is only available for the year 2020. I wonder why.
According to the article “Latest child mortality estimates reveal world remains off track to meeting Sustainable Development Goals” published in unicef.org on December 19, 2021, the UN Inter-agency Group for Child Mortality Estimation (UN IGME) report provides estimates on child and youth mortality, highlighting the specific rates for 2020 due to several factors. Data availability remains a significant issue, with recent and reliable data on child, adolescent, and youth mortality being unavailable for most countries, particularly low-income ones. Only 36 countries had high-quality nationally representative data on under-five mortality for 2021, and about half the world’s countries lacked data on child mortality in the last five years. The COVID-19 pandemic further exacerbated these challenges, impacting data availability and quality, underscoring the importance of monitoring child and youth mortality during this critical time. While the full extent and severity of the COVID-19 impact on children and youth are still unknown, there is a potential for a mortality crisis in 2020 that threatens years of significant improvement and progress. Despite these challenges, efforts continue to enhance data collection and estimation methods to better understand child and youth mortality worldwide.
Click on the link below to see the chart https://public.tableau.com/app/profile/duchelle.kemoue/viz/IndicatorsLineChart/IndicatorsLineChart
Treemap of Countries involved in Child Labor
I wanted to visualize which countries were involved in Child labor. Therefore, I drag Indicator (from Dimensions) to filter, then I select Child labor among the different indicators. I drag Region UNICEF Name to color to color the rectangles by regions, I drag Value to size to easily see which country has the highest rate of child labor. Tableau automatically aggregate the values to give the average percentage of child labor per country from 2016 to 2019. I drag Income to detail so that when mousing over rectangles we can read in which income category the county is classified. Then I drag Country and Progress to Label to make them directly readable in each rectangle to easily see if the countries have improved regarding the Sustainable Development of UNICEF. I opened Show Me to select the treemap option. I changed the automatic color of the treemap by clicking on Color, selecting Edit Colors, and at the dialog box I selected the Summer palette, then clicked Assign Palette and OK. I also changed the borders color of the rectangles by clicking on Color, clicking on Border and click on the maroon color box. I clicked on Color to set the transparency to 75%. I hided all the cards that I did not need by right-clicking on Marks, Filters, Columns, and Rows, then selecting Hide card. I clicked on the arrow at the right of Region Unicef Name (the label) and select Highlight selected items so that when selecting a region , only it is highlighted on the treemap. I renamed the sheet Child Labor Chart and titled the chart Worldwide Child Labor Rate.
The treemap revealed that numerous countries in Sub-Saharan Africa have high rates of child labor, with Togo at 65.70% and Madagascar at 59.60%. The majority of the countries show little progress, indicating a need for urgent attention to achieve the goal of eradicating child labor. However, countries in the Middle East and North Africa show the lowest rates of child labor with good progress in eradicating it. Notably, many of the countries with high child labor rates are in the low-income category.
Click on the link below to see the chart https://public.tableau.com/app/profile/duchelle.kemoue/viz/ChildLaborChart/ChildLaborChart
# Load the librarieslibrary(ggalluvial)library(ggthemes) # Provides additional themes for ggplot2
Warning: package 'ggthemes' was built under R version 4.4.1
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Transform the dataunicef_alluvial <- child_health_cleaned |># Convert the income column to a factor with specified levels, ensuring consistent ordering.mutate(income =factor(income, levels =c("Low income", "Lower middle income", "Upper middle income", "High income")),# Convert the age_group column to a factor with specified levels, ensuring consistent ordering. age_group =factor(age_group, levels =c("0-27 days", "1-11 months", "1-4 years", "5-9 years", "10-14 years", "15-19 years"))) |># Filter rows where the indicator column contains the word "mortality", ignoring case sensitivity. filter(grepl("mortality", indicator, ignore.case =TRUE)) |># Group the data by age_group, income, and region_unicef_name group_by(age_group, income, region_unicef_name) |># Summarize the data by calculating the total value for each group, ignoring NA values. summarise(total_value =sum(value, na.rm =TRUE)) |># Create the Alluvial Plot # Initialize the ggplot object with aesthetics defining the axesggplot(aes(axis1 = age_group, axis2 =income , axis3 = region_unicef_name, y = total_value)) +# Add alluvium layers to the plot, with each layer filled by region_unicef_name geom_alluvium(aes(fill = region_unicef_name)) +# Use the "calc" color palette from ggthemes lbrary for the fill colors. scale_fill_calc() +# Adjust the width of the strata. geom_stratum(width =1/2) +# Adjust text size, prevent overlap, and hange text color to dark blue geom_text(stat ="stratum", aes(label =after_stat(stratum)), size =3, check_overlap =TRUE, color ="darkblue") +# Adjust the x-axis scale to add some padding. scale_x_discrete(expand =c(0.1, 0.1)) +# Customize title, axis labels, caption, and legend title.labs(title ="Child Mortality Rate by Age-Groups and Income Level \n (2020)",x ="Age Group,Region , and Income Level",y ="Child Mortality Prevalence",caption ="Source: UNICEF",fill ="Region" ) +# Apply a dark theme to the plot. theme_dark() +# Customize the x-axis text appearance.theme(axis.text.x =element_text(angle =45, hjust =1))
`summarise()` has grouped output by 'age_group', 'income'. You can override
using the `.groups` argument.
# Convert to an interactive plotggplotly(unicef_alluvial)
This chunk of code creates an alluvial plot using the ggalluvial package to visualize the child mortality rate by age groups, income levels, and regions based on the 2020 data. It filters for rows containing the word “mortality” in the indicator column, groups the data by age group, income level, and region, and calculates the total value for each group. The plot is customized with a dark theme, labels, and adjusted axis settings, and is then converted to an interactive plotly plot for enhanced user interaction
Note Plotly drops the caption.
The highlighting of selected items does not work as I want since only total value is displayed on the strata when hovering over. Therefore, I will add more interactivity through tooltip.
# Transform the dataunicef_alluvial <- child_health_cleaned |># Convert the income and age_group columns to a factor with specified levels, ensuring consistent ordering. mutate(income =factor(income, levels =c("Low income", "Lower middle income", "Upper middle income", "High income")),age_group =factor(age_group, levels =c("0-27 days", "1-11 months", "1-4 years", "5-9 years", "10-14 years", "15-19 years"))) |># Filter rows where the indicator column contains the word "mortality", ignoring case sensitivity filter(grepl("mortality", indicator, ignore.case =TRUE)) |># Group the data by age_group, income, and region_unicef_name group_by(age_group, income, region_unicef_name) |># Summarize the data by calculating the total value for each group, ignoring NA values summarise(total_value =sum(value, na.rm =TRUE)) |># Create the Alluvial Plot ggplot(aes(axis1 = age_group, axis2 = income, axis3 = region_unicef_name, y = total_value, fill = region_unicef_name,# Add the details that will be displayed on strata when hovering over text =paste("Age Group:", age_group, "<br>Income Level:", income, "<br>Region:", region_unicef_name, "<br>Value:", total_value))) +geom_alluvium(aes(fill = region_unicef_name)) +scale_fill_calc() +geom_stratum(width =1/2) +geom_text(stat ="stratum", aes(label =after_stat(stratum)), size =3, check_overlap =TRUE, color ="darkblue") +scale_x_discrete(expand =c(0.1, 0.1)) +labs(title ="Child Mortality Rate by Age-Groups and Income Level \n 2020 Analysis",x ="Age Group, Region, and Income Level",y ="Child Mortality Prevalence",caption ="Source: UNICEF",fill ="Region" ) +theme_dark() +theme(axis.text.x =element_text(angle =45, hjust =1))
`summarise()` has grouped output by 'age_group', 'income'. You can override
using the `.groups` argument.
# Convert ggplot object to plotly object and specify the tooltipplotly_unicef_alluvial <-ggplotly(unicef_alluvial, tooltip ="text")# Print the plotplotly_unicef_alluvial
Note when we double-click on a region, only that region is highlighted on the alluvial as in the previous alluvial. However, all the x-axis name disappear and only the name of the region is readable. In the meantime, when we hover over the intercepts, we can read details about the age group, the income level, and the child mortality rate of that specific category. I was not able to get rid of NA under the region label.
The flow of data in the alluvial diagram visualizes the variability of child mortality rates across different regions and income levels. Substantial flows are observed in Sub-Saharan Africa, indicating high child mortality rates across various age groups, particularly associated with low and lower middle income levels. Regions with higher income levels, such as Europe and Central Asia, demonstrate lower child mortality rates. Moreover, the youngest age group (0-27 days) exhibites substantial flows, signifying higher mortality rates in the neonatal period across multiple regions. The recommendations drawn from the analysis emphasize the importance of targeted interventions in regions with high child mortality rates, particularly Sub-Saharan Africa. Neonatal care and health services also need enhancement to address the high mortality rates in the 0-27 days age group, along with the development of income-based strategies to ensure equitable access to healthcare services.
Let’s dive deeper into Sub Saharan Africa which has the highest mortality rate.
Child Mortality Rate in Sub Saharan Africa
I chose to make a bar chart in Tableau. Click on the link https://public.tableau.com/app/profile/duchelle.kemoue/viz/ChildMortalityRateinSub-SaharanAfrica/ChildMortalityBarChart
I dragged Age group and Income in columns, Value in Rows, Region Unicef Name and Indicator in Filters to select only Sub Saharan Africa among the regions and to select indicators that are only relative to mortality, and dragged Progress to Color. I clicked on color to change to a green color gradient to distinguish between the progress and add a pink border between bars. I clicked on the right arrow of Age group to sort from 0-27 days to 15-19 years manually. I did the same with income to sort from Low income to High income, and progress to sort from Needs urgent attention to Target met.Still by clicking on the right arrow of progress, I select the Highlight selected items option to accentuate a progress level on the chart by clicking on it. I right clicked on x and y axis to bold the font and renamed y-axis into Mortality rate (Per 1000 children in the age-group). I then renamed the chart title to “Child Mortality Rate in Sub-Saharan Africa (2020)” and renamed the sheet to Child Mortality Bar Chart. I hided all the cards that I did not need by right-clicking on Marks, Filters, Columns, and Rows, then selecting Hide card and set the view to entire view to enable the chart to fit in any screen. Doing that, I automatically got a bar chart that displays child mortality rate per income level, colored by progress level. The darker the green color is, the better the progress is. y- axis represents the child mortality rate per 1000 children within each age group.
We observe that as income level negatively decreases, child mortality rate increases.It means that low income has the highest rate of child mortality and high income the lowest rate. Besides, The mortality rate decreases as the age group increases. Indeed, children from 0 to 4 years have the highest mortality rate with acceleration needed in the progress level while children of 15 to 19 years have the lowest mortality rate. The first age group having the highest mortality rate, indicates that neonatal mortality is a significant concern in Sub-Saharan Africa.
Given the high mortality rates in the 0-27 days age group, targeted interventions to improve neonatal health could be beneficial. This could include improving prenatal care, increasing access to skilled birth attendants, and enhancing neonatal care facilities. In a meanwhile, although the mortality rates are lower for older age groups, continued efforts are needed to sustain and further reduce these rates. This could involve improving nutrition, vaccination coverage, and access to healthcare services for children.
I could have mapped these data using leaflet, but I would have needed to merge the dataset with another one. I decided to stop here although much more analysis could have been done with this rich dataset.