The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset.
You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item. \(\ \ \) note\(^1\)
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
NA and ? were deliberately set to NANA columns are in odd number columns, select utilizing regex.% indicator will be divided by 100 and (thousands) multiplied by 1000 accordinglyNA, -, x and ? were deliberately set to NANA columns are in odd number columns, select utilizing regex and specifying exception column 46.| was added as a delimiter to later seperate gender.cbind was not used to bind to df_Out_of_School at this point, because the number of rows do not match..xlsx and converted to .csv. A formatting error with blank space U+0096 occurred creating an issue upon import. .csv was opened with a notepad, then copied from Nutrition A.csv to Nutrition A 1.csv correcting the error and allowing the import to occur with the transformation of unnecessary characters into NANA, -, x, and ? were deliberately set to NANA columns are in odd number columns, select utilizing regex.UTF-8 Character error resolved before import. note\(^{10}\) Data is imported from the .csv Nutrition A 1.csv from my github Project 2 folder using the function read_csv from the readr package.
df_Nutrition <-
read_csv(csv_Nutrition_A,
col_names = FALSE,
na = c("NA","?","–","x"," ",""),
skip = 7,
n_max = 202,
locale = readr::locale(encoding = "latin1"),
)note\(^{11}\)
df_Nutrition <-
df_Nutrition %>%
select (everything(), -matches("[3579]$"),-matches("[12][13579]$"))note\(^{12}\)
The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below
| Countries and areas | character |
| Weight at birth Low birthweight (%) 2015 | numeric |
| Weight at birth Unweighed at birth (%) 2010–2018* | numeric |
| Early initiation of breastfeeding (%) | numeric |
| Exclusive breastfeeding (<6 months)(%) | character |
| Introduction to solid, semi-solid or soft foods (6–8 months) (%) | numeric |
| Continued breastfeeding (12–23 months) (%) All children | numeric |
| Continued breastfeeding (12–23 months) (%) Poorest 20% | numeric |
| Continued breastfeeding (12–23 months) (%) richest 20% | numeric |
| Minimum diet diversity (6–23 months) (%) | numeric |
| Minimum meal frequency (6–23 months) (%) | numeric |
| Minimum acceptable diet (6–23 months) (%) | numeric |
| Zero vegetable or fruit consumption (6–23 months) (%) | numeric |
note\(^{1}\)
Paraphrased Instructions, based on post, not in the same order, are as follows:
blank spaces and NA’s from the data set. (PARTIALLY COMPLETE ON IMPORT)According to WorldPopulationReview.com Latin countries consist of.
Which is stored in the string str_latin_countries to filter with.
str_latin_countries<-"Argentina|Bolivia|Brazil|Chile|
Colombia|Costa Rica|Cuba|Dominican Republic|Ecuador|
El Salvador|Guatemala|Honduras|Mexico|Nicaragua|Panama|
Paraguay|Peru|Puerto Rico|Uruguay|Venezuela"The required columns and rows are selected using pipe %>%, filter(), and select() functions.
Using str_latin_countries, the correct countries are specified.
Renaming takes place in accordance with instrucitons.
The resulting in the table is below.
df_Nutrition_Analysis<-df_Nutrition%>%
filter(str_detect(`Countries and areas`, str_latin_countries))%>%
select(1,2, 6,13) %>%
rename("Countries" = `Countries and areas`,
"Low Birth Weight" = `Weight at birth Low birthweight (%) 2015`,
"Intro to Solids"=`Introduction to solid, semi-solid or soft foods (6–8 months) (%)`,
"Zero (Vegetables|Fruits)"=`Zero vegetable or fruit consumption (6–23 months) (%)`)| Countries | Low Birth Weight | Intro to Solids | Zero (Vegetables|Fruits) |
|---|---|---|---|
| Argentina | 7 | 97 | NA |
| Bolivia (Plurinational State of) | 7 | 81 | 20 |
| Brazil | 8 | 94 | NA |
| Chile | 6 | NA | NA |
| Costa Rica | 7 | 90 | NA |
| Cuba | 5 | 91 | 27 |
| Dominican Republic | 11 | 81 | 35 |
| Ecuador | 11 | 74 | NA |
| Guatemala | 11 | 80 | 27 |
| Honduras | 11 | 86 | 36 |
| Mexico | 8 | 82 | 18 |
| Nicaragua | 11 | 89 | NA |
| Panama | 10 | 78 | NA |
| Peru | 9 | 95 | 7 |
| Uruguay | 8 | NA | NA |
| Venezuela (Bolivarian Republic of) | 9 | NA | NA |
removing NA values reduces the rows from 16 to 7.
| Countries | Low Birth Weight | Intro to Solids | Zero (Vegetables|Fruits) |
|---|---|---|---|
| Bolivia (Plurinational State of) | 7 | 81 | 20 |
| Cuba | 5 | 91 | 27 |
| Dominican Republic | 11 | 81 | 35 |
| Guatemala | 11 | 80 | 27 |
| Honduras | 11 | 86 | 36 |
| Mexico | 8 | 82 | 18 |
| Peru | 9 | 95 | 7 |
Use of pivot_longer and pivot_wider did very little to enhance understanding of the data. Instead a linear model that shows a positive predictor in Low Birth weight from the Zero (Vegetable | Fruit) variable and a negative one with Intro to Solids.
## # A tibble: 5 x 1
## `Low Birth Weight`
## <dbl>
## 1 7
## 2 5
## 3 11
## 4 8
## 5 9
## # A tibble: 7 x 5
## Countries `Low Birth Weight` `Intro to Solid~ name value
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Bolivia (Plurination~ 7 81 Zero (Vegetab~ 20
## 2 Cuba 5 91 Zero (Vegetab~ 27
## 3 Dominican Republic 11 81 Zero (Vegetab~ 35
## 4 Guatemala 11 80 Zero (Vegetab~ 27
## 5 Honduras 11 86 Zero (Vegetab~ 36
## 6 Mexico 8 82 Zero (Vegetab~ 18
## 7 Peru 9 95 Zero (Vegetab~ 7
Data is imported from the .csv’s Child Mortality.csv from my github Project 2 folder using the function read_csv from the readr package.
df_Child_Mortality <-
read_csv(csv_Child_Mortality,
col_names = FALSE,
na = c("NA","?"),
skip = 5,
n_max = 202,
locale = readr::locale(encoding = "latin1"))note\(^{2\&3}\)
The Headers were not workable as a direct import from the csv, due to several columns having multi-headers as shown below
Names were stored to the variable colname_Child_Mortality and applied to the data frame. Columns comprised entirely of NA values, are removed as well.
df_Child_Mortality <-
df_Child_Mortality %>%
select (everything(),
-matches("[3579]$"),
-matches("[123][13579]$"))notes\(^{4}\)
| Countries and areas | character |
| Under-5 mortality rate (deaths per 1,000 live births) 1990 | numeric |
| Under-5 mortality rate (deaths per 1,000 live births) 2000 | numeric |
| Under-5 mortality rate (deaths per 1,000 live births) 2018 | numeric |
| Annual rate of reduction in under-5 mortality rate (%) | numeric |
| Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (male) | numeric |
| Under-5 mortality rate by sex (deaths per 1,000 live births) 2018 (female) | numeric |
| Infant mortality rate (deaths per 1,000 live births) 1990 | numeric |
| Infant mortality rate (deaths per 1,000 live births) 2018 | numeric |
| Neonatal mortality rate (deaths per 1,000 live births) 1990 | numeric |
| Neonatal mortality rate (deaths per 1,000 live births) 2000 | numeric |
| Neonatal mortality rate (deaths per 1,000 live births) 2018 | numeric |
| Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 1990 | numeric |
| Probability of dying among children aged 5–14 (deaths per 1,000 children aged 5) 2018 | numeric |
| Annual number of under-5 deaths (thousands) 2018 | numeric |
| Annual number of neonatal deaths (thousands) 2018 | numeric |
| Neonatal deaths as proportion of all under-5 deaths (%) | numeric |
| Number of deaths among children aged 5–14 (thousands)2018 | numeric |
The existing data frame df_Child_Mortality is a strong representation of the data stored, as is, on the .csv. Many values in the data set are specified in the header as (deaths per 1,000 lives births) however.
In order to simplify the values and names accordingly, I divide the values in those columns by 1000 and remove the comments from each column name.
#copy integer valued table to address formatting issues with kableExtra:: and reactable::
df_int_Child_Mortality<-
df_Child_Mortality
#located index of 1,000 columns for formatting with kableExtra:: and reactable::
column_1000_index<-
grep("1,000 ",colnames(df_int_Child_Mortality))
#Divide values by 1000
df_Child_Mortality<-
df_Child_Mortality %>%
mutate_at( vars( matches("\\(deaths per 1,000 live births)")) ,
funs(./1000))#rename columns
colnames(df_Child_Mortality)<-
str_replace_all(colnames(df_Child_Mortality),
pattern = "\\(deaths per 1,000 live births\\)|\\(deaths per 1,000 children aged 5\\)",
replacement = "")#change type back to numeric after renaming columns converted it into `num`
df_Child_Mortality[] <-
lapply(df_Child_Mortality,
function(x) {
if(is.factor(x)) round(as.numeric(as.character(x)),3) else x
})| Under-5 mortality rate 1990 | Under-5 mortality rate 2000 | Under-5 mortality rate 2018 | Under-5 mortality rate by sex 2018 (male) | Under-5 mortality rate by sex 2018 (female) | Infant mortality rate 1990 | Infant mortality rate 2018 | Neonatal mortality rate 1990 | Neonatal mortality rate 2000 | Neonatal mortality rate 2018 | Probability of dying among children aged 5–14 1990 | Probability of dying among children aged 5–14 2018 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.179 | 0.129 | 0.062 | 0.066 | 0.059 | 0.121 | 0.048 | 0.075 | 0.061 | 0.037 | 0.016 | 0.005 |
| 0.041 | 0.026 | 0.009 | 0.009 | 0.008 | 0.035 | 0.008 | 0.013 | 0.012 | 0.007 | 0.007 | 0.002 |
| 0.050 | 0.040 | 0.023 | 0.025 | 0.022 | 0.042 | 0.020 | 0.023 | 0.021 | 0.015 | 0.009 | 0.004 |
| 0.011 | 0.006 | 0.003 | 0.003 | 0.003 | 0.009 | 0.003 | 0.006 | 0.003 | 0.001 | 0.007 | 0.001 |
| 0.223 | 0.206 | 0.077 | 0.083 | 0.071 | 0.132 | 0.052 | 0.054 | 0.051 | 0.028 | 0.046 | 0.016 |
The values inside the columns with % and (thousands) indicators will also be modified to reflect the exact numeric value. note\(^5\)
| Annual rate of reduction in under-5 mortality rate (%) | Annual number of under-5 deaths (thousands) 2018 | Annual number of neonatal deaths (thousands) 2018 | Neonatal deaths as proportion of all under-5 deaths (%) | Number of deaths among children aged 5–14 (thousands)2018 |
|---|---|---|---|---|
| 4.1 | 74 | 45 | 60 | 5 |
| 6.0 | 0 | 0 | 74 | 0 |
| 2.9 | 24 | 15 | 62 | 3 |
| 4.4 | 0 | 0 | 50 | 0 |
#Divides % column values by 100
df_Child_Mortality<-
df_Child_Mortality %>%
mutate_at( vars( matches("\\%")) ,
funs(./100))
#Multiplies "(thousands)" column values by 1000
df_Child_Mortality<-
df_Child_Mortality %>%
mutate_at( vars( matches("thousands")) ,
funs(.*1000))#rename columns
colnames(df_Child_Mortality)<-
stringr::str_replace_all(colnames(df_Child_Mortality),
"\\(\\%\\)|\\(thousands\\)",
"")
The final table before calculating for analysis is below.
I noted that the data set did not focus on Central or South America specifically. Mortality in these relationships are correlation just based on the label names. If a mortality rate in any aspect increases, so would the total mortality rate by country region etcetera. The analysis here is just to note the average by these two regions, not specifically accounted for in the original data set.
The calculation for the regions male & female mortality count is:
\(\sum_{i=first}^{last}(Rate\times Annual)_i\)
Where:
\(first\) & \(last\) refers to the country selected.
\(Rate\) is the specified mortality rate (male/female)
and the calculation for the Annual number of deaths is just \(\sum_{Annual\ per\ country}\)
df_CAmerica <-
c("Belize","Costa Rica","El Salvador",
"Guatemala","Honduras","Nicaragua","Panama")
df_CAmerica <-
df_Child_Mortality %>%
filter(`Countries and areas` %in% c(df_CAmerica)) %>%
select(`Countries and areas`,
`Under-5 mortality rate by sex 2018 (male)`,
`Under-5 mortality rate by sex 2018 (female)`,
`Annual number of under-5 deaths 2018`)
df_CAmerica<-
df_CAmerica %>%
rbind(c("Total Central America",
sum(df_CAmerica$`Under-5 mortality rate by sex 2018 (male)`*
df_CAmerica$`Annual number of under-5 deaths 2018`),
sum(df_CAmerica$`Under-5 mortality rate by sex 2018 (female)`*
df_CAmerica$`Annual number of under-5 deaths 2018`),
sum(df_CAmerica$`Annual number of under-5 deaths 2018`)))df_SAmerica <-
c("Argentina","Bolivia","Brazil","Chile","Colombia",
"Ecuador","Guyana","Paraguay","Peru","Suriname",
"Uruguay","Venezuela")
df_SAmerica<-
df_Child_Mortality %>%
filter(`Countries and areas` %in% c(df_SAmerica)) %>%
select(`Countries and areas`,
`Under-5 mortality rate by sex 2018 (male)`,
`Under-5 mortality rate by sex 2018 (female)`,
`Annual number of under-5 deaths 2018`)
df_SAmerica<-
df_SAmerica %>%
rbind(c("Total South America",
sum(df_SAmerica$`Under-5 mortality rate by sex 2018 (male)`*
df_SAmerica$`Annual number of under-5 deaths 2018`),
sum(df_SAmerica$`Under-5 mortality rate by sex 2018 (female)`*
df_SAmerica$`Annual number of under-5 deaths 2018`),
sum(df_SAmerica$`Annual number of under-5 deaths 2018`)))| Countries | Male | Female | Annual |
|---|---|---|---|
| Belize | 0.014 | 0.012 | 0 |
| Costa Rica | 0.01 | 0.008 | 1000 |
| El Salvador | 0.015 | 0.012 | 2000 |
| Guatemala | 0.029 | 0.023 | 11000 |
| Honduras | 0.019 | 0.016 | 4000 |
| Nicaragua | 0.02 | 0.016 | 2000 |
| Panama | 0.017 | 0.014 | 1000 |
| Total Central America | 492 | 395 | 21000 |
| Countries | Male | Female | Annual |
|---|---|---|---|
| Argentina | 0.011 | 0.009 | 8000 |
| Brazil | 0.016 | 0.013 | 42000 |
| Chile | 0.008 | 0.007 | 2000 |
| Colombia | 0.016 | 0.013 | 10000 |
| Ecuador | 0.016 | 0.013 | 5000 |
| Guyana | 0.034 | 0.026 | 0 |
| Paraguay | 0.022 | 0.018 | 3000 |
| Peru | 0.016 | 0.013 | 8000 |
| Suriname | 0.021 | 0.017 | 0 |
| Uruguay | 0.008 | 0.007 | 0 |
| Total South America | 1210 | 985 | 78000 |
The data was reformatted during the manipulation. Utilizing pipe and cbind, it is recreated with the necessary values being numeric. The combined data frame is df_S_C_America
A visual representation of the total Annual death rate for those under 5 years old is highlighted in the plot below. Noteablly Brazil has a higher count then Central America as a whole. Brazil is a large country, however those numbers are still very high.
Unsurprisingly South America has a higher count for both genders, I would attribute this to the larger country size and high count of Brazil which resides in South America.
Data is imported from the .csv’s Education.csv from my github Project 2 folder using the function read_csv from the readr package.
df_Education <-
read_csv(csv_Child_Education,
col_names = FALSE,
na = c("NA","?","-","x"),
skip = 7,
n_max = 202,
locale = readr::locale(encoding = "latin1"))note\(^{5}\)
The Headers were again not workable as a direct import from the csv, due to several columns having multi-headers, sum as many as 4 (as shown below). All data is between the time range 2010-2018
Names were stored to the variable colname_Education and applied to the data frame. Columns comprised entirely of NA values, are removed as well.
df_Education <-
df_Education %>% select (everything(), -matches("[3579]$"),-matches("[1234][13579]$"),-46)| Countries and areas | character |
| Out of School rate - One Year before primary entry age | male | numeric |
| Out of School rate -One Year before primary entry age | female | numeric |
| Out of School rate -Primary Education | male | numeric |
| Out of School rate -Primary Education | female | numeric |
| Out of School rate -Lower secondary education | male | numeric |
| Out of School rate -Lower secondary education | female | numeric |
| Out of School rate -Upper secondary education | male | numeric |
| Out of School rate -Upper secondary education | female | numeric |
| Completion Rate - Primary education | male | numeric |
| Completion Rate - Primary education | female | numeric |
| Completion Rate - Lower secondary education | male | numeric |
| Completion Rate - Lower secondary education | female | numeric |
| Completion Rate - Upper secondary education | male | numeric |
| Completion Rate - Upper secondary education | female | numeric |
| Proportion of children in grade 2 or 3 achieving minimum proficiency level | reading | numeric |
| Proportion of children in grade 2 or 3 achieving minimum proficiency level | math | numeric |
| Proportion of children at the end of primary achieving minimum proficiency level | reading | numeric |
| Proportion of children at the end of primary achieving minimum proficiency level | math | numeric |
| Proportion of children at the end of lower secondary achieving minimum proficiency level | reading | numeric |
| Proportion of children at the end of lower secondary achieving minimum proficiency level | math | numeric |
| Youth (15–24 years) literacy rate (%) | male | numeric |
| Youth (15–24 years) literacy rate (%) | female | numeric |
Manual column names were used in place of the .csv multi-headers. Names are basically a concatenation of the Main Header and Sub Headers whenever applicable.
note\(^{8}\)
The df_Out_of_School data frame was created utilizing pipe (%), select(),gather(),seperate(), and arrange() functions in conjunction with regex to specify columns in the Out of School category.
df_Out_of_School<-
df_Education %>%
select(1,matches("^Out.*male"))%>%
gather("Grade", "Out of School Rate", 2:9)%>%
separate("Grade", c("Grade Level", "Gender"),
sep = "\\| ") %>%
arrange(`Countries and areas`)Using the same method, the data Completion rate and Youth (15–24 years) literacy rate (%) is extracted and stored in df_Completion and df_Literacy data frames respectively.
df_Completion<-
df_Education %>%
select(1,matches("^(Completion)"))%>%
gather("Grade", "Completion Rate", 2:7)%>%
separate("Grade", c("Grade Level", "Gender"),
sep = "\\| ") %>%
arrange(`Countries and areas`)%>%
select(1,2:4)
df_Literacy<-
df_Education %>%
select(1,matches("Youth"))%>%
gather("Key", "Literacy Rate", 2:3)%>%
separate("Key", c("Column Name", "Gender"), sep = "\\| ") %>%
arrange(`Countries and areas`) %>%
select(1,2:4)note\(^{9}\)
A simpler naming convention is preferred for Grade Level in both df_Completion and df_Out_of_School. The below function resolves that for all rows.
df_Completion<-
data.frame(lapply(df_Completion,
function(x) {
gsub("Completion Rate - ", "", x)
}))df_Out_of_School<-
data.frame(lapply(df_Out_of_School,
function(x) {
gsub("Out of School rate - |Out of School rate -|One | entry age", "", x)
}))| Countries | Grade Levels | Gender | Rate |
|---|---|---|---|
| Afghanistan | Primary education | male | 67 |
| Afghanistan | Primary education | female | 40 |
| Afghanistan | Lower secondary education | male | 49 |
| Afghanistan | Lower secondary education | female | 26 |
| Afghanistan | Upper secondary education | male | 32 |
| Countries | Grade Levels | Gender | Rate |
|---|---|---|---|
| Afghanistan | Year before primary | male | NA |
| Afghanistan | Year before primary | female | NA |
| Afghanistan | Primary Education | male | NA |
| Afghanistan | Primary Education | female | NA |
| Afghanistan | Lower secondary education | male | NA |
| Countries | Grade Levels | Gender | Rate |
|---|---|---|---|
| Afghanistan | Youth (15–24 years) literacy rate (%) | male | 62 |
| Afghanistan | Youth (15–24 years) literacy rate (%) | female | 32 |
| Albania | Youth (15–24 years) literacy rate (%) | male | 99 |
| Albania | Youth (15–24 years) literacy rate (%) | female | 99 |
| Algeria | Youth (15–24 years) literacy rate (%) | male | NA |
In order to do an analysis for a specific set and to match the number of rows (202) in df_Completion, df_Literacy and df_Out_of_School, all dataframes are filter by Upper secondary education and male. They are stored to dataframe df_Analysis.
df_Analysis<-
df_Out_of_School %>%
filter(str_detect(`Grade Level`, "Upper"),
`Gender`== "male")
df_Analysis<-
cbind(df_Analysis,
df_Completion%>%
filter(str_detect(`Grade Level`, "Upper"),
`Gender`== "male")%>%
select(4))
df_Analysis<-
cbind(df_Analysis,
df_Literacy%>%
filter(`Gender`== "male")%>%
select(4))A basic analysis with a linear model shows, that as Out of school rate increases, so does the literacy of males in this data set.