Assignment Overview

This is parts 2 and 3 of the assignment (i.e., the 2nd and 3rd tidying efforts…)

Again as stated in the part 1 RMD file, this project’s main focus was to read in different data sets in various states of “messiness” and perform “Data Wrangling” to make them “tidy”, so as to allow easier analysis upon the data. Also, as part of the tasks, was to perform some rudimentary analytics upon the cleaned up data. Lastly, the assignment was to help practice in using some of dplyr tidyr modules.

The specifics…

Part 2, read in a wide .csv earnings file that contains counts of people per an earning groups (e.g., 20k-24.99k), further grouped by age and sexes. Then make it tidy and lastly do analysis to determine some gender inequalities in income. This dataset was from Ahsanul Choudhury’s submission.

Part 3, read in a file regarding child mortality (under 5 years of age) per country, via male, female, sex ratio and P factor, further divided by a series of years. Then take it and make it tidy for analysis, and do some analyis on which countries have best/worst mortality, average (mean) mortality, trend and so forth. This data set was from Upal Chowdhury’s submission.

The Earnings File

This file contains columns of data by an earning group (under 5k, 5k-9.99k, etc.) with one column as absolute count and the other the percentage that column is to the row as a whole. The rows of the dataset are 3 groups, male, female, and both. Each of those are bracketed into age groups, typical 5 years, but with variation for young and old wokers. The tidying will be to take the columns of income and rotate them into 3 columns, the income group, with the absolute number and percent columns adjacent to that column. Next the rowwise grouping of sexes will be made into a column adjacent to the other 3. Lastly the age groups that were grouped by sex will be left as a variable column (less the rotated sexes).

The Child Mortality File

This file contains the mortality of children under 5 for most (if not all countries in the world). The first 3 columns (country code, county, and uncertaintity) will be let alone, as those columns form nice variables. The other columns contain 4 groupings by male, female, sex ration, and P factor, each subdivided by 4 years (1990,2000,2010,2015). These 4 sets of columns will be condensed into just 4 columns of the groupings, with the year being rotated into a “year” column.

Thoughts (or Assumptions):

The code is hopefully again generic enough to allow minimal change if the dataset grows (e.g., more earnings brackets in the one, or more countries in the other). Also, like in the disability dataset, the total rows were left from the original in the earnings tidy dataset, and that is a compromise, as it is makes some analysis easier, and some harder.

The code is below, it is functionalized to help break up what is happening. Below that we will execute the code and provide commentary.

First the earnings core code of functions:

initialize <-function()
{
    library(stringr)
    library(tidyr)
    library(dplyr)
    library(plyr)
}


getEarningsAsDF <- function(filename = "earnings.csv")
{

    df <- read.csv(filename, stringsAsFactors = FALSE, skip =5, header=FALSE)
    
    #make col3 = col2 and so forth by 2
    df[1,seq(3, length(df), by = 2)] <- df[1,seq(2, length(df), by = 2)]
    df[1,] <- str_c(df[1,], " ", df[2,])
    df <- df[-2,]
    df <- df[1:39,]
    df[1,1] <- "Ages"
    colnames(df) <- df[1,]
    return(df)
}


getEarningsAsTidyDF <- function(df)
{
    df <- filter(df, Ages != "")   #remove blank rows
    df <- df[-1,]
    
    # add column at end containg gender of corresponding rows
    df["Sex"] <- unlist(list(rep("B", nrow(df)/3),rep("M", nrow(df)/3),rep("F", nrow(df)/3)))
    
    ##make df with just "number cols"
    dfNums <- select(df, Ages, Sex, contains("number"))
    
    ##make df with just  "pct cols"
    dfPct <- select(df, Ages, Sex, contains("percent"))
    
    #rotate both of them
    dfEarnNumRot <- gather(dfNums, Income, Amt, -Ages, -Sex)
    dfEarnPctRot <- gather(dfPct, Income, Pct, -Ages, - Sex)
    
    #put those rotated columns together
    dfEarnNumPct <- bind_cols(dfEarnNumRot, dfEarnPctRot[length(dfEarnPctRot)])
    
    #some general text cleanup
    dfEarnNumPct$Income <- str_replace(dfEarnNumPct$Income, "Number", "")
    dfEarnNumPct$Ages <- str_replace(dfEarnNumPct$Ages, "Both sexes|Male|Female", "Total")
    dfEarnNumPct$Amt <- (lapply(dfEarnNumPct$Amt, function(x) {str_replace_all(x, ",*", "")}))
    return(dfEarnNumPct)
}

First we call code to load the data into a basic data frame, and clean up of some extraneous data. A subset of the original csv is shown, and then a subset of the partially tidied file.

initialize()
## Warning: package 'tidyr' was built under R version 3.3.1
## Warning: package 'dplyr' was built under R version 3.3.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'plyr' was built under R version 3.3.1
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
df <- getEarningsAsDF()

read.csv("earnings.csv")[1:6,1:12]
##   table.with.row.headers.in.column.A.and.column.headers.in.rows.5.to.7..Leading.dots.indicate.subparts.
## 1            Table 17. Earnings of Full-Time, Year-Round Workers 15 Years and Over by Sex and Age: 2011
## 2                                     (Numbers in thousands. Civilian noninstitutionalized population1)
## 3                                                                                                      
## 4                                                                                           Sex and age
## 5                                                                                                      
## 6                                                                                                      
##           X     X.1          X.2     X.3                X.4     X.5
## 1                                                                  
## 2                                                                  
## 3                                                                  
## 4 Earnings2                                                        
## 5    Total          Under $5,000           $5,000 to $9,999        
## 6   Number  Percent      Number  Percent            Number  Percent
##                    X.6     X.7                  X.8     X.9
## 1                                                          
## 2                                                          
## 3                                                          
## 4                                                          
## 5   $10,000 to $14,999           $15,000 to $19,999        
## 6              Number  Percent              Number  Percent
##                   X.10
## 1                     
## 2                     
## 3                     
## 4                     
## 5   $20,000 to $24,999
## 6              Number
df[1:5,1:10]
##                 Ages Total  Number  Total  Percent Under $5,000 Number 
## 1               Ages Total  Number  Total  Percent Under $5,000 Number 
## 3         Both sexes        101,676          100.0                  646
## 4    .15 to 17 years            119          100.0                   10
## 5    .18 to 24 years          6,392          100.0                  105
## 6    .25 to 29 years         10,957          100.0                   41
##   Under $5,000 Percent   $5,000 to $9,999 Number 
## 1 Under $5,000 Percent   $5,000 to $9,999 Number 
## 3                  0.6                      1,183
## 4                  8.7                          9
## 5                  1.6                        257
## 6                  0.4                        132
##     $5,000 to $9,999 Percent   $10,000 to $14,999 Number 
## 1   $5,000 to $9,999 Percent   $10,000 to $14,999 Number 
## 3                        1.2                        3,538
## 4                        7.6                           32
## 5                        4.0                          760
## 6                        1.2                          424
##     $10,000 to $14,999 Percent   $15,000 to $19,999 Number 
## 1   $10,000 to $14,999 Percent   $15,000 to $19,999 Number 
## 3                          3.5                        6,289
## 4                         26.9                           14
## 5                         11.9                        1,185
## 6                          3.9                          858

Next we will do final tidying by rotating some of the income data and adding a gender column:

dfTidy <- getEarningsAsTidyDF(df)
dfTidy[1:10,]
##                  Ages Sex   Income    Amt   Pct
## 1               Total   B Total    101676 100.0
## 2     .15 to 17 years   B Total       119 100.0
## 3     .18 to 24 years   B Total      6392 100.0
## 4     .25 to 29 years   B Total     10957 100.0
## 5     .30 to 34 years   B Total     11656 100.0
## 6     .35 to 39 years   B Total     11525 100.0
## 7     .40 to 44 years   B Total     12767 100.0
## 8     .45 to 49 years   B Total     13086 100.0
## 9     .50 to 54 years   B Total     13187 100.0
## 10    .55 to 59 years   B Total     10945 100.0

Some Analysis:

First male/female ratios of high and low income brackets, clearly showing that in these subset men are disprortionally representing in the higher income and the opposite for the lower income bracket.

#Percantages
maleHiIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$100,000 and over")[,5]
femaleHiIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$100,000 and over")[,5]
as.numeric(maleHiIncome)/as.numeric(femaleHiIncome)
## Warning: NAs introduced by coercion
##  [1] 2.437500       NA 7.000000 2.142857 1.921569 2.064935 2.358974
##  [8] 2.684932 2.716049 2.344828 2.810811 3.500000
maleLowIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$10,000 to $14,999")[,5]
femaleLowIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$10,000 to $14,999")[,5]
as.numeric(maleLowIncome)/as.numeric(femaleLowIncome)
##  [1] 0.6511628 1.3539823 0.8059701 0.6326531 0.6315789 0.6052632 0.4722222
##  [8] 0.6470588 0.5151515 0.5882353 0.6451613 0.8431373
#Absolute numbers
maleHiIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$100,000 and over")[,4]
femaleHiIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$100,000 and over")[,4]
as.numeric(maleHiIncome)/as.numeric(femaleHiIncome)
## Warning: NAs introduced by coercion
##  [1] 3.249012       NA 7.428571 2.774510 2.699187 2.983425 3.130841
##  [8] 3.446301 3.405920 2.961995 3.413043 5.642105
maleLowIncome <- filter(dfTidy, Sex == "M" & str_trim(Income) == "$10,000 to $14,999")[,4]
femaleLowIncome <- filter(dfTidy, Sex == "F" & str_trim(Income) == "$10,000 to $14,999")[,4]
as.numeric(maleLowIncome)/as.numeric(femaleLowIncome)
##  [1] 0.8670185 1.6666667 1.0765027 0.8354978 0.8709677 0.8707865 0.6428571
##  [8] 0.8281250 0.6512821 0.7423313 0.7812500 1.3875000

If we look for the most common income bracket for male and females, we again find that for males the income bracket is again higher than for females.

idx <- which.max(as.numeric(filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "M"
                                    & str_trim(Income) != "Total")[,5]))
filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "M" & str_trim(Income) != "Total")[idx,]
##                 Ages Sex                 Income  Amt  Pct
## 8    .45 to 49 years   M   $50,000 to $74,999   1667 22.6
idx <- which.max(as.numeric(filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "F"
                                     & str_trim(Income) != "Total")[,5]))
filter(dfTidy, str_trim(Ages) == ".45 to 49 years" & Sex == "F" & str_trim(Income) != "Total")[idx,]
##                 Ages Sex                 Income  Amt  Pct
## 7    .45 to 49 years   F   $35,000 to $49,999   1362 23.9

Now the code for the child mortality.

initialize <-function()
{
    library(stringr)
    library(tidyr)
    library(dplyr)
    library(plyr)
}
 

Get7ColsGatheredAndDrop4C <- function(df)
 {
     dfGather <- df[,1:7]                                 #take 1st 7 cols
     dfGather <- gather(dfGather, ColKey, ColVal, -1:-3)   #gather around first 3 cols
     df <- df[,-4:-7]                                     #drop col 4-7 in original df
     return(list(dfGather, df))                           #return list of 2 dataframes one "gathered, 
                                                          #one original with columns processed removed
}


#reads in filename and creates uncluttered DF (i.e., extraneous lines and columns and so forth removed)
GetMortalityDF <- function(filename = "SexSpecificMortality.csv")
{
    dfCMort<- read.csv(filename, stringsAsFactors = FALSE, skip =6, header=FALSE)
    dfCMort <-dfCMort[,1:19]  ##drop trailing empty cols which were part of .csv
    dfCMort[1,4:length(dfCMort)] <- unlist(rep(dfCMort[1, seq(4, length(dfCMort), 4)], each=4))  ##assign blank header values
    dfCMort[1,] <- str_c(dfCMort[1,], dfCMort[2,])
    colnames(dfCMort) <- (dfCMort[1,])
    
    dfCMort <- dfCMort[c(-1,-2),]
    dfCMort <- subset(dfCMort, Country != "") #drop closing comments which are in col 1
    return(dfCMort)
}

#makes the df tidy for analysis
GetaTidyDeathDF <-function(df)
{    
    dfTidy <- NULL
    dfWorking <- df
    
    # this is going to loop through the four sections and with one section at a time
    while(length(dfWorking) >= 7)
    {
        #get one section of data
        listDF <- Get7ColsGatheredAndDrop4C(dfWorking)
        if (is.null(dfTidy))  #i.e., first time thru loop
        {
            #make temp col names of male1990, male2000 or female 1990 etc.
            cNames <- c(str_extract(colnames(dfWorking)[1:4],"[[:alpha:]]*"),
                        str_c(str_extract(colnames(dfWorking)[4],"[[:alpha:]]*"), " Values"))
            
            ##dfTidy gets fist df in list...the "tidy" one
            dfTidy <- data.frame(listDF[1])  
        }
        else
        {
            cNames <- c(colnames(dfTidy), 
                        str_c(str_extract(colnames(dfWorking)[4],"[[:alpha:]]*"), " Values"))
            
            #add two rotated columns
            #dfTidy <- bind_cols(dfTidy, data.frame(listDF[1])[,5]) does not work???, so add 2 cols and delete one
            dfTidy <- bind_cols(dfTidy, data.frame(listDF[1])[,4:5])
            dfTidy <- dfTidy[,-(length(dfTidy)-1)]
            
        }
        colnames(dfTidy) <- cNames
        dfWorking <- data.frame(listDF[2])  #dfworking is original with cols slowly being removed as processed
    }
    colnames(dfTidy)[4] = "Year"  
    colnames(dfTidy)[5] = "Mort Male"
    colnames(dfTidy)[6] = "Mort Female"
    colnames(dfTidy)[6] = "Sex Ratio"
    dfTidy$Year <- str_replace(dfTidy$Year, "Male", "")
    return(dfTidy)
}

Again we will initialize, and reading the .csv file, doing some basic clean up work, and we will display subsets fo the original and the partially tidied file.

initialize()
dfPartTidy <- GetMortalityDF()
read.csv("SexSpecificMortality.csv")[1:6,1:6]
##                                                                        Child.Mortality.Estimates
## 1                                                         Sex-specific under-five mortality rate
## 2 Estimates generated by the UN Inter-agency Group for Child Mortality Estimation (IGME) in 2015
## 3                                                  downloaded from http://www.childmortality.org
## 4                                                                        Last update: 9 Sep 2015
## 5                                                             Contact: childmortality@unicef.org
## 6                                                                                       ISO Code
##         X                   X.1  X.2 X.3 X.4
## 1                                     NA  NA
## 2                                     NA  NA
## 3                                     NA  NA
## 4                                     NA  NA
## 5                                     NA  NA
## 6 Country Uncertainty \nbounds* Male  NA  NA
dfPartTidy[1:6,1:6]
##   ISO Code     Country Uncertainty \nbounds* Male1990 Male2000 Male2010
## 3      AFG Afghanistan                 Lower    167.2    129.3     94.7
## 4      AFG Afghanistan                Median    185.5    141.1    108.4
## 5      AFG Afghanistan                 Upper    207.3    154.7    124.6
## 6      AGO      Angola                 Lower    210.3    198.5      136
## 7      AGO      Angola                Median    236.3    226.7    190.9
## 8      AGO      Angola                 Upper    266.9    258.2    260.5

Next we will do the real work of tidying. This involves looping through the wide .csv, cutting up and working on one section of the 4 major columns at a time, to rotate the four minor columns under each of the major ones.

dfTidy <-GetaTidyDeathDF(dfPartTidy)
dfTidy[1:10,]
##    ISO     Country Uncertainty Year Mort Male Sex Ratio Sex Values
## 1  AFG Afghanistan       Lower 1990     167.2     158.4       1.01
## 2  AFG Afghanistan      Median 1990     185.5     176.2       1.05
## 3  AFG Afghanistan       Upper 1990     207.3     196.7        1.1
## 4  AGO      Angola       Lower 1990     210.3     191.6       1.06
## 5  AGO      Angola      Median 1990     236.3     215.2        1.1
## 6  AGO      Angola       Upper 1990     266.9     243.3       1.14
## 7  ALB     Albania       Lower 1990      39.1        32       1.14
## 8  ALB     Albania      Median 1990      44.5      36.5       1.22
## 9  ALB     Albania       Upper 1990      50.3      41.3       1.31
## 10 AND     Andorra       Lower 1990       5.6       4.6       1.13
##    P Values
## 1      0.92
## 2      0.96
## 3         1
## 4      0.97
## 5      1.01
## 6      1.05
## 7      0.93
## 8         1
## 9      1.07
## 10     0.93

Here we can see child mortality is decreasing over the years.

mean(as.numeric(filter(dfTidy, Year == 1990 & Uncertainty == "Median")[,5]))
## [1] 76.70359
mean(as.numeric(filter(dfTidy, Year == 2000 & Uncertainty == "Median")[,5]))
## [1] 60.82
mean(as.numeric(filter(dfTidy, Year == 2010 & Uncertainty == "Median")[,5]))
## [1] 41.42615
mean(as.numeric(filter(dfTidy, Year == 2015 & Uncertainty == "Median")[,5]))
## [1] 34.25077
#this is not doing what I expected...
dfTidy %>% group_by(Year) %>% summarise(avg = mean(as.numeric(`Mort Male`)))
##        avg
## 1 54.00056

Finally, here we see the countries with the min and max mortality for a given year.

idx <- which.max(as.numeric(filter(dfTidy, Uncertainty == "Median")[,5]))
filter(dfTidy, Uncertainty == "Median")[idx,]
##     ISO Country Uncertainty Year Mort Male Sex Ratio Sex Values P Values
## 126 NER   Niger      Median 1990     332.4     323.9       1.03     0.97
idx <- which.min(as.numeric(filter(dfTidy, Uncertainty == "Median")[,5]))
filter(dfTidy, Uncertainty == "Median")[idx,]
##     ISO    Country Uncertainty Year Mort Male Sex Ratio Sex Values
## 690 LUX Luxembourg      Median 2015         2       1.7        1.2
##     P Values
## 690        1