Thoughts (or Assumptions):
The code is hopefully again generic enough to allow minimal change if the dataset grows (e.g., more earnings brackets in the one, or more countries in the other). Also, like in the disability dataset, the total rows were left from the original in the earnings tidy dataset, and that is a compromise, as it is makes some analysis easier, and some harder.
The code is below, it is functionalized to help break up what is happening. Below that we will execute the code and provide commentary.
First the earnings core code of functions:
initialize <-function()
{
library(stringr)
library(tidyr)
library(dplyr)
library(plyr)
}
getEarningsAsDF <- function(filename = "earnings.csv")
{
df <- read.csv(filename, stringsAsFactors = FALSE, skip =5, header=FALSE)
#make col3 = col2 and so forth by 2
df[1,seq(3, length(df), by = 2)] <- df[1,seq(2, length(df), by = 2)]
df[1,] <- str_c(df[1,], " ", df[2,])
df <- df[-2,]
df <- df[1:39,]
df[1,1] <- "Ages"
colnames(df) <- df[1,]
return(df)
}
getEarningsAsTidyDF <- function(df)
{
df <- filter(df, Ages != "") #remove blank rows
df <- df[-1,]
# add column at end containg gender of corresponding rows
df["Sex"] <- unlist(list(rep("B", nrow(df)/3),rep("M", nrow(df)/3),rep("F", nrow(df)/3)))
##make df with just "number cols"
dfNums <- select(df, Ages, Sex, contains("number"))
##make df with just "pct cols"
dfPct <- select(df, Ages, Sex, contains("percent"))
#rotate both of them
dfEarnNumRot <- gather(dfNums, Income, Amt, -Ages, -Sex)
dfEarnPctRot <- gather(dfPct, Income, Pct, -Ages, - Sex)
#put those rotated columns together
dfEarnNumPct <- bind_cols(dfEarnNumRot, dfEarnPctRot[length(dfEarnPctRot)])
#some general text cleanup
dfEarnNumPct$Income <- str_replace(dfEarnNumPct$Income, "Number", "")
dfEarnNumPct$Ages <- str_replace(dfEarnNumPct$Ages, "Both sexes|Male|Female", "Total")
dfEarnNumPct$Amt <- (lapply(dfEarnNumPct$Amt, function(x) {str_replace_all(x, ",*", "")}))
return(dfEarnNumPct)
}
First we call code to load the data into a basic data frame, and clean up of some extraneous data. A subset of the original csv is shown, and then a subset of the partially tidied file.
initialize()
## Warning: package 'tidyr' was built under R version 3.3.1
## Warning: package 'dplyr' was built under R version 3.3.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'plyr' was built under R version 3.3.1
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
df <- getEarningsAsDF()
read.csv("earnings.csv")[1:6,1:12]
## table.with.row.headers.in.column.A.and.column.headers.in.rows.5.to.7..Leading.dots.indicate.subparts.
## 1 Table 17. Earnings of Full-Time, Year-Round Workers 15 Years and Over by Sex and Age: 2011
## 2 (Numbers in thousands. Civilian noninstitutionalized population1)
## 3
## 4 Sex and age
## 5
## 6
## X X.1 X.2 X.3 X.4 X.5
## 1
## 2
## 3
## 4 Earnings2
## 5 Total Under $5,000 $5,000 to $9,999
## 6 Number Percent Number Percent Number Percent
## X.6 X.7 X.8 X.9
## 1
## 2
## 3
## 4
## 5 $10,000 to $14,999 $15,000 to $19,999
## 6 Number Percent Number Percent
## X.10
## 1
## 2
## 3
## 4
## 5 $20,000 to $24,999
## 6 Number
df[1:5,1:10]
## Ages Total Number Total Percent Under $5,000 Number
## 1 Ages Total Number Total Percent Under $5,000 Number
## 3 Both sexes 101,676 100.0 646
## 4 .15 to 17 years 119 100.0 10
## 5 .18 to 24 years 6,392 100.0 105
## 6 .25 to 29 years 10,957 100.0 41
## Under $5,000 Percent $5,000 to $9,999 Number
## 1 Under $5,000 Percent $5,000 to $9,999 Number
## 3 0.6 1,183
## 4 8.7 9
## 5 1.6 257
## 6 0.4 132
## $5,000 to $9,999 Percent $10,000 to $14,999 Number
## 1 $5,000 to $9,999 Percent $10,000 to $14,999 Number
## 3 1.2 3,538
## 4 7.6 32
## 5 4.0 760
## 6 1.2 424
## $10,000 to $14,999 Percent $15,000 to $19,999 Number
## 1 $10,000 to $14,999 Percent $15,000 to $19,999 Number
## 3 3.5 6,289
## 4 26.9 14
## 5 11.9 1,185
## 6 3.9 858
Next we will do final tidying by rotating some of the income data and adding a gender column:
dfTidy <- getEarningsAsTidyDF(df)
dfTidy[1:10,]
## Ages Sex Income Amt Pct
## 1 Total B Total 101676 100.0
## 2 .15 to 17 years B Total 119 100.0
## 3 .18 to 24 years B Total 6392 100.0
## 4 .25 to 29 years B Total 10957 100.0
## 5 .30 to 34 years B Total 11656 100.0
## 6 .35 to 39 years B Total 11525 100.0
## 7 .40 to 44 years B Total 12767 100.0
## 8 .45 to 49 years B Total 13086 100.0
## 9 .50 to 54 years B Total 13187 100.0
## 10 .55 to 59 years B Total 10945 100.0
Some Analysis:
First male/female ratios of high and low income brackets, clearly showing that in these subset men are disprortionally representing in the higher income and the opposite for the lower income bracket.
#Percantages
maleHiIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$100,000 and over")[,5]
femaleHiIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$100,000 and over")[,5]
as.numeric(maleHiIncome)/as.numeric(femaleHiIncome)
## Warning: NAs introduced by coercion
## [1] 2.437500 NA 7.000000 2.142857 1.921569 2.064935 2.358974
## [8] 2.684932 2.716049 2.344828 2.810811 3.500000
maleLowIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$10,000 to $14,999")[,5]
femaleLowIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$10,000 to $14,999")[,5]
as.numeric(maleLowIncome)/as.numeric(femaleLowIncome)
## [1] 0.6511628 1.3539823 0.8059701 0.6326531 0.6315789 0.6052632 0.4722222
## [8] 0.6470588 0.5151515 0.5882353 0.6451613 0.8431373
#Absolute numbers
maleHiIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$100,000 and over")[,4]
femaleHiIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$100,000 and over")[,4]
as.numeric(maleHiIncome)/as.numeric(femaleHiIncome)
## Warning: NAs introduced by coercion
## [1] 3.249012 NA 7.428571 2.774510 2.699187 2.983425 3.130841
## [8] 3.446301 3.405920 2.961995 3.413043 5.642105
maleLowIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$10,000 to $14,999")[,4]
femaleLowIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$10,000 to $14,999")[,4]
as.numeric(maleLowIncome)/as.numeric(femaleLowIncome)
## [1] 0.8670185 1.6666667 1.0765027 0.8354978 0.8709677 0.8707865 0.6428571
## [8] 0.8281250 0.6512821 0.7423313 0.7812500 1.3875000
If we look for the most common income bracket for male and females, we again find that for males the income bracket is again higher than for females.
idx <- which.max(as.numeric(filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "M"
& str_trim(Income) != "Total")[,5]))
filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "M" & str_trim(Income) != "Total")[idx,]
## Ages Sex Income Amt Pct
## 8 .45 to 49 years M $50,000 to $74,999 1667 22.6
idx <- which.max(as.numeric(filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "F"
& str_trim(Income) != "Total")[,5]))
filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "F" & str_trim(Income) != "Total")[idx,]
## Ages Sex Income Amt Pct
## 7 .45 to 49 years F $35,000 to $49,999 1362 23.9
Now the code for the child mortality.
initialize <-function()
{
library(stringr)
library(tidyr)
library(dplyr)
library(plyr)
}
Get7ColsGatheredAndDrop4C <- function(df)
{
dfGather <- df[,1:7] #take 1st 7 cols
dfGather <- gather(dfGather, ColKey, ColVal, -1:-3) #gather around first 3 cols
df <- df[,-4:-7] #drop col 4-7 in original df
return(list(dfGather, df)) #return list of 2 dataframes one "gathered,
#one original with columns processed removed
}
#reads in filename and creates uncluttered DF (i.e., extraneous lines and columns and so forth removed)
GetMortalityDF <- function(filename = "SexSpecificMortality.csv")
{
dfCMort<- read.csv(filename, stringsAsFactors = FALSE, skip =6, header=FALSE)
dfCMort <-dfCMort[,1:19] ##drop trailing empty cols which were part of .csv
dfCMort[1,4:length(dfCMort)] <- unlist(rep(dfCMort[1, seq(4, length(dfCMort), 4)], each=4)) ##assign blank header values
dfCMort[1,] <- str_c(dfCMort[1,], dfCMort[2,])
colnames(dfCMort) <- (dfCMort[1,])
dfCMort <- dfCMort[c(-1,-2),]
dfCMort <- subset(dfCMort, Country != "") #drop closing comments which are in col 1
return(dfCMort)
}
#makes the df tidy for analysis
GetaTidyDeathDF <-function(df)
{
dfTidy <- NULL
dfWorking <- df
# this is going to loop through the four sections and with one section at a time
while(length(dfWorking) >= 7)
{
#get one section of data
listDF <- Get7ColsGatheredAndDrop4C(dfWorking)
if (is.null(dfTidy)) #i.e., first time thru loop
{
#make temp col names of male1990, male2000 or female 1990 etc.
cNames <- c(str_extract(colnames(dfWorking)[1:4],"[[:alpha:]]*"),
str_c(str_extract(colnames(dfWorking)[4],"[[:alpha:]]*"), " Values"))
##dfTidy gets fist df in list...the "tidy" one
dfTidy <- data.frame(listDF[1])
}
else
{
cNames <- c(colnames(dfTidy),
str_c(str_extract(colnames(dfWorking)[4],"[[:alpha:]]*"), " Values"))
#add two rotated columns
#dfTidy <- bind_cols(dfTidy, data.frame(listDF[1])[,5]) does not work???, so add 2 cols and delete one
dfTidy <- bind_cols(dfTidy, data.frame(listDF[1])[,4:5])
dfTidy <- dfTidy[,-(length(dfTidy)-1)]
}
colnames(dfTidy) <- cNames
dfWorking <- data.frame(listDF[2]) #dfworking is original with cols slowly being removed as processed
}
colnames(dfTidy)[4] = "Year"
colnames(dfTidy)[5] = "Mort Male"
colnames(dfTidy)[6] = "Mort Female"
colnames(dfTidy)[6] = "Sex Ratio"
dfTidy$Year <- str_replace(dfTidy$Year, "Male", "")
return(dfTidy)
}
Again we will initialize, and reading the .csv file, doing some basic clean up work, and we will display subsets fo the original and the partially tidied file.
initialize()
dfPartTidy <- GetMortalityDF()
read.csv("SexSpecificMortality.csv")[1:6,1:6]
## Child.Mortality.Estimates
## 1 Sex-specific under-five mortality rate
## 2 Estimates generated by the UN Inter-agency Group for Child Mortality Estimation (IGME) in 2015
## 3 downloaded from http://www.childmortality.org
## 4 Last update: 9 Sep 2015
## 5 Contact: childmortality@unicef.org
## 6 ISO Code
## X X.1 X.2 X.3 X.4
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 Country Uncertainty \nbounds* Male NA NA
dfPartTidy[1:6,1:6]
## ISO Code Country Uncertainty \nbounds* Male1990 Male2000 Male2010
## 3 AFG Afghanistan Lower 167.2 129.3 94.7
## 4 AFG Afghanistan Median 185.5 141.1 108.4
## 5 AFG Afghanistan Upper 207.3 154.7 124.6
## 6 AGO Angola Lower 210.3 198.5 136
## 7 AGO Angola Median 236.3 226.7 190.9
## 8 AGO Angola Upper 266.9 258.2 260.5
Next we will do the real work of tidying. This involves looping through the wide .csv, cutting up and working on one section of the 4 major columns at a time, to rotate the four minor columns under each of the major ones.
dfTidy <-GetaTidyDeathDF(dfPartTidy)
dfTidy[1:10,]
## ISO Country Uncertainty Year Mort Male Sex Ratio Sex Values
## 1 AFG Afghanistan Lower 1990 167.2 158.4 1.01
## 2 AFG Afghanistan Median 1990 185.5 176.2 1.05
## 3 AFG Afghanistan Upper 1990 207.3 196.7 1.1
## 4 AGO Angola Lower 1990 210.3 191.6 1.06
## 5 AGO Angola Median 1990 236.3 215.2 1.1
## 6 AGO Angola Upper 1990 266.9 243.3 1.14
## 7 ALB Albania Lower 1990 39.1 32 1.14
## 8 ALB Albania Median 1990 44.5 36.5 1.22
## 9 ALB Albania Upper 1990 50.3 41.3 1.31
## 10 AND Andorra Lower 1990 5.6 4.6 1.13
## P Values
## 1 0.92
## 2 0.96
## 3 1
## 4 0.97
## 5 1.01
## 6 1.05
## 7 0.93
## 8 1
## 9 1.07
## 10 0.93