UNICEF Dataset

Assignment Requirements

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 5 assignment!)
    • For each of the three chosen datasets:
      • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset.

      • You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

      • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

      • Perform the analysis requested in the discussion item. \(\ \ \) note\(^1\)

      • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  2. Please include in your homework submission, for each of the three chosen datasets:
    • The URL to the .Rmd file in your GitHub repository, and
    • The URL for your rpubs.com

Notes

  1. Notes from the bottom of the .csv were excluded. Main data was selected between rows 5 to 222. Post
  2. NA and ? were deliberately set to NA
  3. encoding set to latin1 to account for accents, in country naming convention
  4. NA columns are in odd number columns, select utilizing regex.
  5. Columns with a % indicator will be divided by 100 and (thousands) multiplied by 1000 accordingly
  6. NA, -, x and ? were deliberately set to NA
  7. NA columns are in odd number columns, select utilizing regex and specifying exception column 46.
  8. | was added as a delimiter to later seperate gender.
  9. cbind was not used to bind to df_Out_of_School at this point, because the number of rows do not match.
  10. Original data was in the format .xlsx and converted to .csv. A formatting error with blank space U+0096 occurred creating an issue upon import. .csv was opened with a notepad, then copied from Nutrition A.csv to Nutrition A 1.csv correcting the error and allowing the import to occur with the transformation of unnecessary characters into NA
  11. NA, -, x, and ? were deliberately set to NA
  12. NA columns are in odd number columns, select utilizing regex.

Datasets

Nutrition

Import Data

UTF-8 Character error resolved before import. note\(^{10}\) Data is imported from the .csv Nutrition A 1.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Nutrition <- 
  read_csv(csv_Nutrition_A,
                col_names = FALSE,
          na = c("NA","?","–","x"," ",""),
          skip = 7,
          n_max = 202,
          locale = readr::locale(encoding = "latin1"),
          )

note\(^{11}\)

df_Nutrition <- 
  df_Nutrition %>% 
  select (everything(), -matches("[3579]$"),-matches("[12][13579]$"))

note\(^{12}\)

The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below


colnames(df_Nutrition)<- colnames_df_Nutrition
Column Names and Type
Countries and areas character
Weight at birth Low birthweight (%) 2015 numeric
Weight at birth Unweighed at birth (%) 2010–2018* numeric
Early initiation of breastfeeding (%) numeric
Exclusive breastfeeding (<6 months)(%) character
Introduction to solid, semi-solid or soft foods (6–8 months) (%) numeric
Continued breastfeeding (12–23 months) (%) All children numeric
Continued breastfeeding (12–23 months) (%) Poorest 20% numeric
Continued breastfeeding (12–23 months) (%) richest 20% numeric
Minimum diet diversity (6–23 months) (%) numeric
Minimum meal frequency (6–23 months) (%) numeric
Minimum acceptable diet (6–23 months) (%) numeric
Zero vegetable or fruit consumption (6–23 months) (%) numeric








Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.
Manipulating Imported Data

note\(^{1}\)

Paraphrased Instructions, based on post, not in the same order, are as follows:

  • Remove all blank spaces and NA’s from the data set. (PARTIALLY COMPLETE ON IMPORT)
  • Select for Analysis:
    • Latin American Countries
    • Low birth weight percentage
    • Introduction to solid, semi-solid or soft foods (6-8 months) (%)
    • Zero vegetable or fruit consumption (6-23 months) (%)
  • Rename columns
  • Pivot wide
  • Calculate stats

According to WorldPopulationReview.com Latin countries consist of.

  • Argentina
  • Bolivia
  • Brazil
  • Chile
  • Colombia
  • Costa Rica
  • Cuba
  • The Dominican Republic
  • Ecuador
  • El Salvador
  • Guatemala
  • Honduras
  • Mexico
  • Nicaragua
  • Panama
  • Paraguay
  • Peru
  • Puerto Rico
  • Uruguay
  • Venezuela

Which is stored in the string str_latin_countries to filter with.

str_latin_countries<-"Argentina|Bolivia|Brazil|Chile|
Colombia|Costa Rica|Cuba|Dominican Republic|Ecuador|
El Salvador|Guatemala|Honduras|Mexico|Nicaragua|Panama|
Paraguay|Peru|Puerto Rico|Uruguay|Venezuela"

The required columns and rows are selected using pipe %>%, filter(), and select() functions.

Using str_latin_countries, the correct countries are specified.

Renaming takes place in accordance with instrucitons.

The resulting in the table is below.

df_Nutrition_Analysis<-df_Nutrition%>%
  filter(str_detect(`Countries and areas`, str_latin_countries))%>% 
  select(1,2, 6,13) %>%
  rename("Countries" = `Countries and areas`,
         "Low Birth Weight" = `Weight at birth Low birthweight (%) 2015`,
         "Intro to Solids"=`Introduction to solid, semi-solid or soft foods (6–8 months) (%)`,
         "Zero (Vegetables|Fruits)"=`Zero vegetable or fruit consumption (6–23 months) (%)`)
kbl(df_Nutrition_Analysis)
Countries Low Birth Weight Intro to Solids Zero (Vegetables|Fruits)
Argentina 7 97 NA
Bolivia (Plurinational State of) 7 81 20
Brazil 8 94 NA
Chile 6 NA NA
Costa Rica 7 90 NA
Cuba 5 91 27
Dominican Republic 11 81 35
Ecuador 11 74 NA
Guatemala 11 80 27
Honduras 11 86 36
Mexico 8 82 18
Nicaragua 11 89 NA
Panama 10 78 NA
Peru 9 95 7
Uruguay 8 NA NA
Venezuela (Bolivarian Republic of) 9 NA NA


removing NA values reduces the rows from 16 to 7.

df_Nutrition_Analysis<-na.omit(df_Nutrition_Analysis)
Countries Low Birth Weight Intro to Solids Zero (Vegetables|Fruits)
Bolivia (Plurinational State of) 7 81 20
Cuba 5 91 27
Dominican Republic 11 81 35
Guatemala 11 80 27
Honduras 11 86 36
Mexico 8 82 18
Peru 9 95 7


Analysis

Use of pivot_longer and pivot_wider did very little to enhance understanding of the data. Instead a linear model that shows a positive predictor in Low Birth weight from the Zero (Vegetable | Fruit) variable and a negative one with Intro to Solids.

df_Nutrition_Analysis %>%
pivot_wider(`Low Birth Weight`)
## # A tibble: 5 x 1
##   `Low Birth Weight`
##                <dbl>
## 1                  7
## 2                  5
## 3                 11
## 4                  8
## 5                  9
df_Nutrition_Analysis %>%
pivot_longer(`Zero (Vegetables|Fruits)`)
## # A tibble: 7 x 5
##   Countries             `Low Birth Weight` `Intro to Solid~ name           value
##   <chr>                              <dbl>            <dbl> <chr>          <dbl>
## 1 Bolivia (Plurination~                  7               81 Zero (Vegetab~    20
## 2 Cuba                                   5               91 Zero (Vegetab~    27
## 3 Dominican Republic                    11               81 Zero (Vegetab~    35
## 4 Guatemala                             11               80 Zero (Vegetab~    27
## 5 Honduras                              11               86 Zero (Vegetab~    36
## 6 Mexico                                 8               82 Zero (Vegetab~    18
## 7 Peru                                   9               95 Zero (Vegetab~     7

Child Mortality

Import Data

Data is imported from the .csv’s Child Mortality.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Child_Mortality <-
  read_csv(csv_Child_Mortality,
                 col_names = FALSE,
                 na = c("NA","?"),
                         skip = 5,
                       n_max = 202,
                 locale = readr::locale(encoding = "latin1"))

note\(^{2\&3}\)

The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below


Names were stored to the variable colname_Child_Mortality and applied to the data frame. Columns comprised entirely of NA values, are removed as well.

df_Child_Mortality <- 
  df_Child_Mortality %>% 
  select (everything(),
          -matches("[3579]$"),
          -matches("[123][13579]$"))

notes\(^{4}\)

colnames(df_Child_Mortality)<- colnames_Child_Mortality
Column Names and Type
Countries and areas character
Under-5 mortality rate (deaths per 1,000 live births) 1990 numeric
Under-5 mortality rate (deaths per 1,000 live births) 2000 numeric
Under-5 mortality rate (deaths per 1,000 live births) 2018 numeric
Annual rate of reduction in under-5 mortality rate (%) numeric
Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (male) numeric
Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (female) numeric
Infant mortality rate (deaths per 1,000 live births) 1990 numeric
Infant mortality rate (deaths per 1,000 live births) 2018 numeric
Neonatal mortality rate (deaths per 1,000 live births) 1990 numeric
Neonatal mortality rate (deaths per 1,000 live births) 2000 numeric
Neonatal mortality rate (deaths per 1,000 live births) 2018 numeric
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990 numeric
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018 numeric
Annual number of under-5 deaths (thousands) 2018 numeric
Annual number of neonatal deaths (thousands) 2018 numeric
Neonatal deaths as proportion of all under-5 deaths (%) numeric
Number of deaths among children aged 5–14 (thousands)2018 numeric








Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.
Manipulating Imported Data

The existing data frame df_Child_Mortality is a strong representation of the data stored, as is, on the .csv. Many values in the data set are specified in the header as (deaths per 1,000 lives births) however.

In order to simplify the values and names accordingly, I divide the values in those columns by 1000 and remove the comments from each column name.

#copy integer valued table to address formatting issues with kableExtra:: and reactable::
df_int_Child_Mortality<- 
  df_Child_Mortality
#located index of 1,000 columns for formatting with kableExtra:: and reactable::
column_1000_index<-
  grep("1,000 ",colnames(df_int_Child_Mortality))
#Divide values by 1000
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("\\(deaths per 1,000 live births)")) ,
             funs(./1000))
#rename columns
colnames(df_Child_Mortality)<-
  str_replace_all(colnames(df_Child_Mortality),
    pattern = "\\(deaths per 1,000 live births\\)|\\(deaths per 1,000 children aged 5\\)",
                           replacement = "")
#change type back to numeric after renaming columns converted it into `num`
df_Child_Mortality[] <- 
  lapply(df_Child_Mortality,
         function(x) {
    if(is.factor(x)) round(as.numeric(as.character(x)),3) else x
})
Under-5 mortality rate 1990 Under-5 mortality rate 2000 Under-5 mortality rate 2018 Under-5 mortality rate by sex 2018 (male) Under-5 mortality rate by sex 2018 (female) Infant mortality rate 1990 Infant mortality rate 2018 Neonatal mortality rate 1990 Neonatal mortality rate 2000 Neonatal mortality rate 2018 Probability of dying among children aged 5–14 1990 Probability of dying among children aged 5–14 2018
0.179 0.129 0.062 0.066 0.059 0.121 0.048 0.075 0.061 0.037 0.016 0.005
0.041 0.026 0.009 0.009 0.008 0.035 0.008 0.013 0.012 0.007 0.007 0.002
0.050 0.040 0.023 0.025 0.022 0.042 0.020 0.023 0.021 0.015 0.009 0.004
0.011 0.006 0.003 0.003 0.003 0.009 0.003 0.006 0.003 0.001 0.007 0.001
0.223 0.206 0.077 0.083 0.071 0.132 0.052 0.054 0.051 0.028 0.046 0.016


The values inside the columns with % and (thousands) indicators will also be modified to reflect the exact numeric value. note\(^5\)

Annual rate of reduction in under-5 mortality rate (%) Annual number of under-5 deaths (thousands) 2018 Annual number of neonatal deaths (thousands) 2018 Neonatal deaths as proportion of all under-5 deaths (%) Number of deaths among children aged 5–14 (thousands)2018
4.1 74 45 60 5
6.0 0 0 74 0
2.9 24 15 62 3
4.4 0 0 50 0
#Divides % column values by 100
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("\\%")) ,
             funs(./100))

#Multiplies "(thousands)" column values by 1000
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("thousands")) ,
             funs(.*1000))
#rename columns
colnames(df_Child_Mortality)<-
  stringr::str_replace_all(colnames(df_Child_Mortality),
                           "\\(\\%\\)|\\(thousands\\)",
                           "")


The final table before calculating for analysis is below.

Analysis

I noted that the data set did not focus on Central or South America specifically. Mortality in these relationships are correlation just based on the label names. If a mortality rate in any aspect increases, so would the total mortality rate by country region etcetera. The analysis here is just to note the average by these two regions, not specifically accounted for in the original data set.

The calculation for the regions male & female mortality count is:

\(\sum_{i=first}^{last}(Rate\times Annual)_i\)

Where:

\(first\) & \(last\) refers to the country selected.

\(Rate\) is the specified mortality rate (male/female)

and the calculation for the Annual number of deaths is just \(\sum_{Annual\ per\ country}\)

df_CAmerica <-
  c("Belize","Costa Rica","El Salvador",
     "Guatemala","Honduras","Nicaragua","Panama")

df_CAmerica <-
  df_Child_Mortality %>%
  filter(`Countries and areas` %in% c(df_CAmerica)) %>%
    select(`Countries and areas`,
           `Under-5 mortality rate by sex  2018 (male)`,
           `Under-5 mortality rate by sex  2018 (female)`,
           `Annual number of under-5 deaths  2018`)

df_CAmerica<-
  df_CAmerica %>%
    rbind(c("Total Central America",
          sum(df_CAmerica$`Under-5 mortality rate by sex  2018 (male)`*
                df_CAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_CAmerica$`Under-5 mortality rate by sex  2018 (female)`*
                df_CAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_CAmerica$`Annual number of under-5 deaths  2018`)))
df_SAmerica <- 
  c("Argentina","Bolivia","Brazil","Chile","Colombia",
    "Ecuador","Guyana","Paraguay","Peru","Suriname",
    "Uruguay","Venezuela")

df_SAmerica<-
  df_Child_Mortality %>%
  filter(`Countries and areas` %in% c(df_SAmerica)) %>%
    select(`Countries and areas`,
           `Under-5 mortality rate by sex  2018 (male)`,
           `Under-5 mortality rate by sex  2018 (female)`,
           `Annual number of under-5 deaths  2018`)

df_SAmerica<-
  df_SAmerica %>%
    rbind(c("Total South America",
          sum(df_SAmerica$`Under-5 mortality rate by sex  2018 (male)`*
                df_SAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_SAmerica$`Under-5 mortality rate by sex  2018 (female)`*
                df_SAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_SAmerica$`Annual number of under-5 deaths  2018`)))
Central America
Countries Male Female Annual
Belize 0.014 0.012 0
Costa Rica 0.01 0.008 1000
El Salvador 0.015 0.012 2000
Guatemala 0.029 0.023 11000
Honduras 0.019 0.016 4000
Nicaragua 0.02 0.016 2000
Panama 0.017 0.014 1000
Total Central America 492 395 21000
South America
Countries Male Female Annual
Argentina 0.011 0.009 8000
Brazil 0.016 0.013 42000
Chile 0.008 0.007 2000
Colombia 0.016 0.013 10000
Ecuador 0.016 0.013 5000
Guyana 0.034 0.026 0
Paraguay 0.022 0.018 3000
Peru 0.016 0.013 8000
Suriname 0.021 0.017 0
Uruguay 0.008 0.007 0
Total South America 1210 985 78000


The data was reformatted during the manipulation. Utilizing pipe and cbind, it is recreated with the necessary values being numeric. The combined data frame is df_S_C_America

A visual representation of the total Annual death rate for those under 5 years old is highlighted in the plot below. Noteablly Brazil has a higher count then Central America as a whole. Brazil is a large country, however those numbers are still very high.

Unsurprisingly South America has a higher count for both genders, I would attribute this to the larger country size and high count of Brazil which resides in South America.

Education

Import Data

Data is imported from the .csv’s Education.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Education <- 
  read_csv(csv_Child_Education,
                col_names = FALSE,
          na = c("NA","?","-","x"),
          skip = 7,
          n_max = 202,
          locale = readr::locale(encoding = "latin1"))

note\(^{5}\)

The Headers were again not workable as a direct import from the csv, due to several columns having multi-headers, sum as many as 4 (as shown below). All data is between the time range 2010-2018


Names were stored to the variable colname_Education and applied to the data frame. Columns comprised entirely of NA values, are removed as well.

df_Education <- 
  df_Education %>% select (everything(), -matches("[3579]$"),-matches("[1234][13579]$"),-46)
colnames(df_Education)<-
  colnames_Education
Column Names and Type
Countries and areas character
Out of School rate - One Year before primary entry age | male numeric
Out of School rate -One Year before primary entry age | female numeric
Out of School rate -Primary Education | male numeric
Out of School rate -Primary Education | female numeric
Out of School rate -Lower secondary education | male numeric
Out of School rate -Lower secondary education | female numeric
Out of School rate -Upper secondary education | male numeric
Out of School rate -Upper secondary education | female numeric
Completion Rate - Primary education | male numeric
Completion Rate - Primary education | female numeric
Completion Rate - Lower secondary education | male numeric
Completion Rate - Lower secondary education | female numeric
Completion Rate - Upper secondary education | male numeric
Completion Rate - Upper secondary education | female numeric
Proportion of children in grade 2 or 3 achieving minimum proficiency level | reading numeric
Proportion of children in grade 2 or 3 achieving minimum proficiency level | math numeric
Proportion of children at the end of primary achieving minimum proficiency level | reading numeric
Proportion of children at the end of primary achieving minimum proficiency level | math numeric
Proportion of children at the end of lower secondary achieving minimum proficiency level | reading numeric
Proportion of children at the end of lower secondary achieving minimum proficiency level | math numeric
Youth (15–24 years) literacy rate (%) | male numeric
Youth (15–24 years) literacy rate (%) | female numeric

Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.

note\(^{8}\)

Manipulating Imported Data

The df_Out_of_School data frame was created utilizing pipe (%), select(),gather(),seperate(), and arrange() functions in conjunction with regex to specify columns in the Out of School category.

df_Out_of_School<-
  df_Education %>%
  select(1,matches("^Out.*male"))%>%
    gather("Grade", "Out of School Rate", 2:9)%>%
      separate("Grade", c("Grade Level", "Gender"),
               sep = "\\| ") %>%
        arrange(`Countries and areas`)

Using the same method, the data Completion rate and Youth (15–24 years) literacy rate (%) is extracted and stored in df_Completion and df_Literacy data frames respectively.

df_Completion<-
  df_Education %>%
  select(1,matches("^(Completion)"))%>%
    gather("Grade", "Completion Rate", 2:7)%>%
      separate("Grade", c("Grade Level", "Gender"),
               sep = "\\| ") %>%
        arrange(`Countries and areas`)%>%
          select(1,2:4)

df_Literacy<-
df_Education %>%
  select(1,matches("Youth"))%>%
    gather("Key", "Literacy Rate", 2:3)%>%
      separate("Key", c("Column Name", "Gender"), sep = "\\| ") %>%
        arrange(`Countries and areas`) %>%
          select(1,2:4)

note\(^{9}\)

A simpler naming convention is preferred for Grade Level in both df_Completion and df_Out_of_School. The below function resolves that for all rows.

df_Completion<-
 data.frame(lapply(df_Completion,
                   function(x) {
                  gsub("Completion Rate - ", "", x)
              }))
df_Out_of_School<-
 data.frame(lapply(df_Out_of_School,
                   function(x) {
                  gsub("Out of School rate - |Out of School rate -|One | entry age", "", x)
              }))
Out of School Rate
Countries Grade Levels Gender Rate
Afghanistan Primary education male 67
Afghanistan Primary education female 40
Afghanistan Lower secondary education male 49
Afghanistan Lower secondary education female 26
Afghanistan Upper secondary education male 32
Completion Rate
Countries Grade Levels Gender Rate
Afghanistan Year before primary male NA
Afghanistan Year before primary female NA
Afghanistan Primary Education male NA
Afghanistan Primary Education female NA
Afghanistan Lower secondary education male NA
Completion Rate
Countries Grade Levels Gender Rate
Afghanistan Youth (15–24 years) literacy rate (%) male 62
Afghanistan Youth (15–24 years) literacy rate (%) female 32
Albania Youth (15–24 years) literacy rate (%) male 99
Albania Youth (15–24 years) literacy rate (%) female 99
Algeria Youth (15–24 years) literacy rate (%) male NA

In order to do an analysis for a specific set and to match the number of rows (202) in df_Completion, df_Literacy and df_Out_of_School, all dataframes are filter by Upper secondary education and male. They are stored to dataframe df_Analysis.

df_Analysis<-
  df_Out_of_School %>%
      filter(str_detect(`Grade Level`, "Upper"),
             `Gender`== "male")

df_Analysis<-
  cbind(df_Analysis,
        df_Completion%>%
      filter(str_detect(`Grade Level`, "Upper"),
             `Gender`== "male")%>%
        select(4))

df_Analysis<-
  cbind(df_Analysis,
        df_Literacy%>%
      filter(`Gender`== "male")%>%
        select(4))
Analysis

A basic analysis with a linear model shows, that as Out of school rate increases, so does the literacy of males in this data set.