DATA607 Project 2

UNICEF Dataset

Assignment Requirements

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 5 assignment!)
- For each of the three chosen datasets:
  - Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset.
  - You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
  - Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
  - Perform the analysis requested in the discussion item. \(\ \ \) note\(^1\)
  - Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets:
- The URL to the .Rmd file in your GitHub repository, and
- The URL for your rpubs.com

Notes

Notes from the bottom of the .csv were excluded. Main data was selected between rows 5 to 222.
NA and ? were deliberately set to NA
encoding set to latin1 to account for accents, in country naming convention
NA columns are in odd number columns, select utilizing regex.
Columns with a % indicator will be divided by 100 and (thousands) multiplied by 1000 accordingly
NA, -, x and ? were deliberately set to NA
NA columns are in odd number columns, select utilizing regex and specifying exception column 46.
| was added as a delimiter to later seperate gender.
cbind was not used to bind to df_Out_of_School at this point, because the number of rows do not match.
Original data was in the format .xlsx and converted to .csv. A formatting error with blank space U+0096 occurred creating an issue upon import. .csv was opened with a notepad, then copied from Nutrition A.csv to Nutrition A 1.csv correcting the error and allowing the import to occur with the transformation of unnecessary characters into NA
NA, -, x, and ? were deliberately set to NA
NA columns are in odd number columns, select utilizing regex.

Datasets

Nutrition

Import Data

UTF-8 Character error resolved before import. note\(^{10}\) Data is imported from the .csv Nutrition A 1.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Nutrition <- 
  read_csv(csv_Nutrition_A,
                col_names = FALSE,
          na = c("NA","?","–","x"," ",""),
          skip = 7,
          n_max = 202,
          locale = readr::locale(encoding = "latin1"),
          )

note\(^{11}\)

df_Nutrition <- 
  df_Nutrition %>% 
  select (everything(), -matches("[3579]$"),-matches("[12][13579]$"))

note\(^{12}\)

The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below

colnames(df_Nutrition)<- colnames_df_Nutrition

Column Names and Type

Countries and areas	character
Weight at birth Low birthweight (%) 2015	numeric
Weight at birth Unweighed at birth (%) 2010–2018*	numeric
Early initiation of breastfeeding (%)	numeric
Exclusive breastfeeding (<6 months)(%)	character
Introduction to solid, semi-solid or soft foods (6–8 months) (%)	numeric
Continued breastfeeding (12–23 months) (%) All children	numeric
Continued breastfeeding (12–23 months) (%) Poorest 20%	numeric
Continued breastfeeding (12–23 months) (%) richest 20%	numeric
Minimum diet diversity (6–23 months) (%)	numeric
Minimum meal frequency (6–23 months) (%)	numeric
Minimum acceptable diet (6–23 months) (%)	numeric
Zero vegetable or fruit consumption (6–23 months) (%)	numeric

Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.

Manipulating Imported Data

note\(^{1}\)

Paraphrased Instructions, based on post, not in the same order, are as follows:

Remove all blank spaces and NA’s from the data set. (PARTIALLY COMPLETE ON IMPORT)
Select for Analysis:
- Latin American Countries
- Low birth weight percentage
- Introduction to solid, semi-solid or soft foods (6-8 months) (%)
- Zero vegetable or fruit consumption (6-23 months) (%)
Rename columns
Pivot wide
Calculate stats

According to WorldPopulationReview.com Latin countries consist of.

Argentina
Bolivia
Brazil
Chile
Colombia
Costa Rica
Cuba
The Dominican Republic
Ecuador
El Salvador
Guatemala
Honduras
Mexico
Nicaragua
Panama
Paraguay
Peru
Puerto Rico
Uruguay
Venezuela

Which is stored in the string str_latin_countries to filter with.

str_latin_countries<-"Argentina|Bolivia|Brazil|Chile|
Colombia|Costa Rica|Cuba|Dominican Republic|Ecuador|
El Salvador|Guatemala|Honduras|Mexico|Nicaragua|Panama|
Paraguay|Peru|Puerto Rico|Uruguay|Venezuela"

The required columns and rows are selected using pipe %>%, filter(), and select() functions.

Using str_latin_countries, the correct countries are specified.

Renaming takes place in accordance with instrucitons.

The resulting in the table is below.

df_Nutrition_Analysis<-df_Nutrition%>%
  filter(str_detect(`Countries and areas`, str_latin_countries))%>% 
  select(1,2, 6,13) %>%
  rename("Countries" = `Countries and areas`,
         "Low Birth Weight" = `Weight at birth Low birthweight (%) 2015`,
         "Intro to Solids"=`Introduction to solid, semi-solid or soft foods (6–8 months) (%)`,
         "Zero (Vegetables|Fruits)"=`Zero vegetable or fruit consumption (6–23 months) (%)`)

kbl(df_Nutrition_Analysis)

Countries	Low Birth Weight	Intro to Solids	Zero (Vegetables\|Fruits)
Argentina	7	97	NA
Bolivia (Plurinational State of)	7	81	20
Brazil	8	94	NA
Chile	6	NA	NA
Costa Rica	7	90	NA
Cuba	5	91	27
Dominican Republic	11	81	35
Ecuador	11	74	NA
Guatemala	11	80	27
Honduras	11	86	36
Mexico	8	82	18
Nicaragua	11	89	NA
Panama	10	78	NA
Peru	9	95	7
Uruguay	8	NA	NA
Venezuela (Bolivarian Republic of)	9	NA	NA

removing NA values reduces the rows from 16 to 7.

df_Nutrition_Analysis<-na.omit(df_Nutrition_Analysis)

Countries	Low Birth Weight	Intro to Solids	Zero (Vegetables\|Fruits)
Bolivia (Plurinational State of)	7	81	20
Cuba	5	91	27
Dominican Republic	11	81	35
Guatemala	11	80	27
Honduras	11	86	36
Mexico	8	82	18
Peru	9	95	7

Analysis

Use of pivot_longer and pivot_wider did very little to enhance understanding of the data. Instead a linear model that shows a positive predictor in Low Birth weight from the Zero (Vegetable | Fruit) variable and a negative one with Intro to Solids.

df_Nutrition_Analysis %>%
pivot_wider(`Low Birth Weight`)

## # A tibble: 5 x 1
##   `Low Birth Weight`
##                <dbl>
## 1                  7
## 2                  5
## 3                 11
## 4                  8
## 5                  9

df_Nutrition_Analysis %>%
pivot_longer(`Zero (Vegetables|Fruits)`)

## # A tibble: 7 x 5
##   Countries             `Low Birth Weight` `Intro to Solid~ name           value
##   <chr>                              <dbl>            <dbl> <chr>          <dbl>
## 1 Bolivia (Plurination~                  7               81 Zero (Vegetab~    20
## 2 Cuba                                   5               91 Zero (Vegetab~    27
## 3 Dominican Republic                    11               81 Zero (Vegetab~    35
## 4 Guatemala                             11               80 Zero (Vegetab~    27
## 5 Honduras                              11               86 Zero (Vegetab~    36
## 6 Mexico                                 8               82 Zero (Vegetab~    18
## 7 Peru                                   9               95 Zero (Vegetab~     7

Child Mortality

Import Data

Data is imported from the .csv’s Child Mortality.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Child_Mortality <-
  read_csv(csv_Child_Mortality,
                 col_names = FALSE,
                 na = c("NA","?"),
                         skip = 5,
                       n_max = 202,
                 locale = readr::locale(encoding = "latin1"))

note\(^{2\&3}\)

The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below

Names were stored to the variable colname_Child_Mortality and applied to the data frame. Columns comprised entirely of NA values, are removed as well.

df_Child_Mortality <- 
  df_Child_Mortality %>% 
  select (everything(),
          -matches("[3579]$"),
          -matches("[123][13579]$"))

notes\(^{4}\)

colnames(df_Child_Mortality)<- colnames_Child_Mortality

Column Names and Type

Countries and areas	character
Under-5 mortality rate (deaths per 1,000 live births) 1990	numeric
Under-5 mortality rate (deaths per 1,000 live births) 2000	numeric
Under-5 mortality rate (deaths per 1,000 live births) 2018	numeric
Annual rate of reduction in under-5 mortality rate (%)	numeric
Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (male)	numeric
Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (female)	numeric
Infant mortality rate (deaths per 1,000 live births) 1990	numeric
Infant mortality rate (deaths per 1,000 live births) 2018	numeric
Neonatal mortality rate (deaths per 1,000 live births) 1990	numeric
Neonatal mortality rate (deaths per 1,000 live births) 2000	numeric
Neonatal mortality rate (deaths per 1,000 live births) 2018	numeric
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990	numeric
Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018	numeric
Annual number of under-5 deaths (thousands) 2018	numeric
Annual number of neonatal deaths (thousands) 2018	numeric
Neonatal deaths as proportion of all under-5 deaths (%)	numeric
Number of deaths among children aged 5–14 (thousands)2018	numeric

Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.

Manipulating Imported Data

The existing data frame df_Child_Mortality is a strong representation of the data stored, as is, on the .csv. Many values in the data set are specified in the header as (deaths per 1,000 lives births) however.

In order to simplify the values and names accordingly, I divide the values in those columns by 1000 and remove the comments from each column name.

#copy integer valued table to address formatting issues with kableExtra:: and reactable::
df_int_Child_Mortality<- 
  df_Child_Mortality
#located index of 1,000 columns for formatting with kableExtra:: and reactable::
column_1000_index<-
  grep("1,000 ",colnames(df_int_Child_Mortality))
#Divide values by 1000
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("\\(deaths per 1,000 live births)")) ,
             funs(./1000))

#rename columns
colnames(df_Child_Mortality)<-
  str_replace_all(colnames(df_Child_Mortality),
    pattern = "\\(deaths per 1,000 live births\\)|\\(deaths per 1,000 children aged 5\\)",
                           replacement = "")

#change type back to numeric after renaming columns converted it into `num`
df_Child_Mortality[] <- 
  lapply(df_Child_Mortality,
         function(x) {
    if(is.factor(x)) round(as.numeric(as.character(x)),3) else x
})

Under-5 mortality rate 1990	Under-5 mortality rate 2000	Under-5 mortality rate 2018	Under-5 mortality rate by sex 2018 (male)	Under-5 mortality rate by sex 2018 (female)	Infant mortality rate 1990	Infant mortality rate 2018	Neonatal mortality rate 1990	Neonatal mortality rate 2000	Neonatal mortality rate 2018	Probability of dying among children aged 5–14 1990	Probability of dying among children aged 5–14 2018
0.179	0.129	0.062	0.066	0.059	0.121	0.048	0.075	0.061	0.037	0.016	0.005
0.041	0.026	0.009	0.009	0.008	0.035	0.008	0.013	0.012	0.007	0.007	0.002
0.050	0.040	0.023	0.025	0.022	0.042	0.020	0.023	0.021	0.015	0.009	0.004
0.011	0.006	0.003	0.003	0.003	0.009	0.003	0.006	0.003	0.001	0.007	0.001
0.223	0.206	0.077	0.083	0.071	0.132	0.052	0.054	0.051	0.028	0.046	0.016

The values inside the columns with % and (thousands) indicators will also be modified to reflect the exact numeric value. note\(^5\)

Annual rate of reduction in under-5 mortality rate (%)	Annual number of under-5 deaths (thousands) 2018	Annual number of neonatal deaths (thousands) 2018	Neonatal deaths as proportion of all under-5 deaths (%)	Number of deaths among children aged 5–14 (thousands)2018
4.1	74	45	60	5
6.0	0	0	74	0
2.9	24	15	62	3
4.4	0	0	50	0

#Divides % column values by 100
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("\\%")) ,
             funs(./100))

#Multiplies "(thousands)" column values by 1000
df_Child_Mortality<-
  df_Child_Mortality %>%
  mutate_at( vars( matches("thousands")) ,
             funs(.*1000))

#rename columns
colnames(df_Child_Mortality)<-
  stringr::str_replace_all(colnames(df_Child_Mortality),
                           "\\(\\%\\)|\\(thousands\\)",
                           "")

The final table before calculating for analysis is below.

Analysis

I noted that the data set did not focus on Central or South America specifically. Mortality in these relationships are correlation just based on the label names. If a mortality rate in any aspect increases, so would the total mortality rate by country region etcetera. The analysis here is just to note the average by these two regions, not specifically accounted for in the original data set.

The calculation for the regions male & female mortality count is:

\(\sum_{i=first}^{last}(Rate\times Annual)_i\)

Where:

\(first\) & \(last\) refers to the country selected.

\(Rate\) is the specified mortality rate (male/female)

and the calculation for the Annual number of deaths is just \(\sum_{Annual\ per\ country}\)

df_CAmerica <-
  c("Belize","Costa Rica","El Salvador",
     "Guatemala","Honduras","Nicaragua","Panama")

df_CAmerica <-
  df_Child_Mortality %>%
  filter(`Countries and areas` %in% c(df_CAmerica)) %>%
    select(`Countries and areas`,
           `Under-5 mortality rate by sex  2018 (male)`,
           `Under-5 mortality rate by sex  2018 (female)`,
           `Annual number of under-5 deaths  2018`)

df_CAmerica<-
  df_CAmerica %>%
    rbind(c("Total Central America",
          sum(df_CAmerica$`Under-5 mortality rate by sex  2018 (male)`*
                df_CAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_CAmerica$`Under-5 mortality rate by sex  2018 (female)`*
                df_CAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_CAmerica$`Annual number of under-5 deaths  2018`)))

df_SAmerica <- 
  c("Argentina","Bolivia","Brazil","Chile","Colombia",
    "Ecuador","Guyana","Paraguay","Peru","Suriname",
    "Uruguay","Venezuela")

df_SAmerica<-
  df_Child_Mortality %>%
  filter(`Countries and areas` %in% c(df_SAmerica)) %>%
    select(`Countries and areas`,
           `Under-5 mortality rate by sex  2018 (male)`,
           `Under-5 mortality rate by sex  2018 (female)`,
           `Annual number of under-5 deaths  2018`)

df_SAmerica<-
  df_SAmerica %>%
    rbind(c("Total South America",
          sum(df_SAmerica$`Under-5 mortality rate by sex  2018 (male)`*
                df_SAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_SAmerica$`Under-5 mortality rate by sex  2018 (female)`*
                df_SAmerica$`Annual number of under-5 deaths  2018`),
          sum(df_SAmerica$`Annual number of under-5 deaths  2018`)))

Central America
Countries	Male	Female	Annual
Belize	0.014	0.012	0
Costa Rica	0.01	0.008	1000
El Salvador	0.015	0.012	2000
Guatemala	0.029	0.023	11000
Honduras	0.019	0.016	4000
Nicaragua	0.02	0.016	2000
Panama	0.017	0.014	1000
Total Central America	492	395	21000

South America
Countries	Male	Female	Annual
Argentina	0.011	0.009	8000
Brazil	0.016	0.013	42000
Chile	0.008	0.007	2000
Colombia	0.016	0.013	10000
Ecuador	0.016	0.013	5000
Guyana	0.034	0.026	0
Paraguay	0.022	0.018	3000
Peru	0.016	0.013	8000
Suriname	0.021	0.017	0
Uruguay	0.008	0.007	0
Total South America	1210	985	78000

The data was reformatted during the manipulation. Utilizing pipe and cbind, it is recreated with the necessary values being numeric. The combined data frame is df_S_C_America

A visual representation of the total Annual death rate for those under 5 years old is highlighted in the plot below. Noteablly Brazil has a higher count then Central America as a whole. Brazil is a large country, however those numbers are still very high.

Unsurprisingly South America has a higher count for both genders, I would attribute this to the larger country size and high count of Brazil which resides in South America.

Education

Import Data

Data is imported from the .csv’s Education.csv from my github Project 2 folder using the function read_csv from the readr package.

df_Education <- 
  read_csv(csv_Child_Education,
                col_names = FALSE,
          na = c("NA","?","-","x"),
          skip = 7,
          n_max = 202,
          locale = readr::locale(encoding = "latin1"))

note\(^{5}\)

The Headers were again not workable as a direct import from the csv, due to several columns having multi-headers, sum as many as 4 (as shown below). All data is between the time range 2010-2018

Names were stored to the variable colname_Education and applied to the data frame. Columns comprised entirely of NA values, are removed as well.

df_Education <- 
  df_Education %>% select (everything(), -matches("[3579]$"),-matches("[1234][13579]$"),-46)

colnames(df_Education)<-
  colnames_Education

Column Names and Type

Countries and areas	character
Out of School rate - One Year before primary entry age \| male	numeric
Out of School rate -One Year before primary entry age \| female	numeric
Out of School rate -Primary Education \| male	numeric
Out of School rate -Primary Education \| female	numeric
Out of School rate -Lower secondary education \| male	numeric
Out of School rate -Lower secondary education \| female	numeric
Out of School rate -Upper secondary education \| male	numeric
Out of School rate -Upper secondary education \| female	numeric
Completion Rate - Primary education \| male	numeric
Completion Rate - Primary education \| female	numeric
Completion Rate - Lower secondary education \| male	numeric
Completion Rate - Lower secondary education \| female	numeric
Completion Rate - Upper secondary education \| male	numeric
Completion Rate - Upper secondary education \| female	numeric
Proportion of children in grade 2 or 3 achieving minimum proficiency level \| reading	numeric
Proportion of children in grade 2 or 3 achieving minimum proficiency level \| math	numeric
Proportion of children at the end of primary achieving minimum proficiency level \| reading	numeric
Proportion of children at the end of primary achieving minimum proficiency level \| math	numeric
Proportion of children at the end of lower secondary achieving minimum proficiency level \| reading	numeric
Proportion of children at the end of lower secondary achieving minimum proficiency level \| math	numeric
Youth (15–24 years) literacy rate (%) \| male	numeric
Youth (15–24 years) literacy rate (%) \| female	numeric

Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.

note\(^{8}\)

Manipulating Imported Data

The df_Out_of_School data frame was created utilizing pipe (%), select(),gather(),seperate(), and arrange() functions in conjunction with regex to specify columns in the Out of School category.

df_Out_of_School<-
  df_Education %>%
  select(1,matches("^Out.*male"))%>%
    gather("Grade", "Out of School Rate", 2:9)%>%
      separate("Grade", c("Grade Level", "Gender"),
               sep = "\\| ") %>%
        arrange(`Countries and areas`)

Using the same method, the data Completion rate and Youth (15–24 years) literacy rate (%) is extracted and stored in df_Completion and df_Literacy data frames respectively.

df_Completion<-
  df_Education %>%
  select(1,matches("^(Completion)"))%>%
    gather("Grade", "Completion Rate", 2:7)%>%
      separate("Grade", c("Grade Level", "Gender"),
               sep = "\\| ") %>%
        arrange(`Countries and areas`)%>%
          select(1,2:4)

df_Literacy<-
df_Education %>%
  select(1,matches("Youth"))%>%
    gather("Key", "Literacy Rate", 2:3)%>%
      separate("Key", c("Column Name", "Gender"), sep = "\\| ") %>%
        arrange(`Countries and areas`) %>%
          select(1,2:4)

note\(^{9}\)

A simpler naming convention is preferred for Grade Level in both df_Completion and df_Out_of_School. The below function resolves that for all rows.

df_Completion<-
 data.frame(lapply(df_Completion,
                   function(x) {
                  gsub("Completion Rate - ", "", x)
              }))

df_Out_of_School<-
 data.frame(lapply(df_Out_of_School,
                   function(x) {
                  gsub("Out of School rate - |Out of School rate -|One | entry age", "", x)
              }))

Out of School Rate
Countries	Grade Levels	Gender	Rate
Afghanistan	Primary education	male	67
Afghanistan	Primary education	female	40
Afghanistan	Lower secondary education	male	49
Afghanistan	Lower secondary education	female	26
Afghanistan	Upper secondary education	male	32

Completion Rate
Countries	Grade Levels	Gender	Rate
Afghanistan	Year before primary	male	NA
Afghanistan	Year before primary	female	NA
Afghanistan	Primary Education	male	NA
Afghanistan	Primary Education	female	NA
Afghanistan	Lower secondary education	male	NA

Completion Rate
Countries	Grade Levels	Gender	Rate
Afghanistan	Youth (15–24 years) literacy rate (%)	male	62
Afghanistan	Youth (15–24 years) literacy rate (%)	female	32
Albania	Youth (15–24 years) literacy rate (%)	male	99
Albania	Youth (15–24 years) literacy rate (%)	female	99
Algeria	Youth (15–24 years) literacy rate (%)	male	NA

In order to do an analysis for a specific set and to match the number of rows (202) in df_Completion, df_Literacy and df_Out_of_School, all dataframes are filter by Upper secondary education and male. They are stored to dataframe df_Analysis.

df_Analysis<-
  df_Out_of_School %>%
      filter(str_detect(`Grade Level`, "Upper"),
             `Gender`== "male")

df_Analysis<-
  cbind(df_Analysis,
        df_Completion%>%
      filter(str_detect(`Grade Level`, "Upper"),
             `Gender`== "male")%>%
        select(4))

df_Analysis<-
  cbind(df_Analysis,
        df_Literacy%>%
      filter(`Gender`== "male")%>%
        select(4))

Analysis

A basic analysis with a linear model shows, that as Out of school rate increases, so does the literacy of males in this data set.

DATA607 Project 2

Gabriel Campos

March 17 2021

UNICEF Dataset

Assignment Requirements

Notes

Datasets

Nutrition

Import Data

Manipulating Imported Data

Analysis

Child Mortality

Import Data

Manipulating Imported Data

Analysis

Education

Import Data

Manipulating Imported Data

Analysis