What is wrong with the Unicef Data?
1. Generate Useful Variable Names
2. Code NA’s
3. Coercing and Rounding Numeric Values
4. Seperate Country and Region Data Rows
5. Separating Columns
Remove Blank Columns
Combining Columns
Reshaping
- Long to Wide
- Wide to Long

What is wrong with the Unicef Data?

The variable variable names are stored as the first row of the data, and they are not uniform.
Missing data is represented by a dash, rather than a blank or NA value.
The numeric variables are stored as class character and have too many decimal places.
The data do not abide by the “one unit of analysis per dataset” rule. The top of the data use countries as the unit of analysis, but toward the bottom are rows that have region-specific data.
Country and region appear in the first column as one variable.

unicef <- unicef.data

datatable(unicef.data)

1. Generate Useful Variable Names

# replace the variable names with the contents of the first row of the dataset. 
names(unicef) <- unicef[1,] ## replaces the column names

# Then remove the first two rows from the dataset
unicef <- unicef[-c(1:2),]

# look at the column names
colnames(unicef)

##  [1] "Countries and areas-NA"                     
##  [2] "Under-5 mortality rate (U5MR) (1990)"       
##  [3] "Under-5 mortality rate (U5MR) 2015"         
##  [4] "U5MR (male)"                                
##  [5] "U5MR (female)"                              
##  [6] "Neonatal  mortality  rate"                  
##  [7] "Total population (thousands)"               
##  [8] "Annual no. of births (thousands)"           
##  [9] "GNI per capita (US$)"                       
## [10] "Life expectancy at birth (years)"           
## [11] "Total adult literacy rate          (%)"     
## [12] "Continent of region"                        
## [13] "Primary school net enrolment ratio      (%)"
## [14] "Location of region within continent"

# create an ordered list of new meaningful column names
names(unicef) <- c("country", "u5mr.90", "u5mr.15", "u5mr.male", "u5mr.female", "neonatal.mr", "population", "births", "gni.per.cap", "life.exp", "literacy.rate", "continent", "school.enrollment", "location.cont")

# check that the new column names appear as expected
colnames(unicef)

##  [1] "country"           "u5mr.90"           "u5mr.15"          
##  [4] "u5mr.male"         "u5mr.female"       "neonatal.mr"      
##  [7] "population"        "births"            "gni.per.cap"      
## [10] "life.exp"          "literacy.rate"     "continent"        
## [13] "school.enrollment" "location.cont"

2. Code NA’s

In this dataset, there are dashes indicating missing data. In order for R to analyze this data, these values need to be recoded to NA.

# forces the dashes in the dataset to become NAs
unicef[unicef == "-"] <- NA

3. Coercing and Rounding Numeric Values

All of the variables in the dataset are stored as character vectors because the first row of each variable was previously a character string - the variable name. Now that we have removed those from the dataframe, we can coerce the variables that contain numbers to class numeric.

This section introduces the sapply() function, which applies another function across multiple variables in a dataset.

# displays the class of each variable in the dataset
sapply(unicef, FUN=class) # all are character

##           country           u5mr.90           u5mr.15         u5mr.male 
##       "character"       "character"       "character"       "character" 
##       u5mr.female       neonatal.mr        population            births 
##       "character"       "character"       "character"       "character" 
##       gni.per.cap          life.exp     literacy.rate         continent 
##       "character"       "character"       "character"       "character" 
## school.enrollment     location.cont 
##       "character"       "character"

# manually identify which of the variables should be numeric
# in this case is columns 2-11 and 13
num.columns <- c(2:11,13)

# apply the as.numeric() function to each of these columns and replace them in the dataset
unicef[,num.columns] <- sapply(unicef[,num.columns], FUN=as.numeric)
# check to see that the correct columns are now numeric
sapply(unicef, FUN=class)

##           country           u5mr.90           u5mr.15         u5mr.male 
##       "character"         "numeric"         "numeric"         "numeric" 
##       u5mr.female       neonatal.mr        population            births 
##         "numeric"         "numeric"         "numeric"         "numeric" 
##       gni.per.cap          life.exp     literacy.rate         continent 
##         "numeric"         "numeric"         "numeric"       "character" 
## school.enrollment     location.cont 
##         "numeric"       "character"

# round all values to the second decimal place
unicef[,num.columns] <- round(unicef[,num.columns], digits = 2)
datatable(unicef)

4. Seperate Country and Region Data Rows

There should only be one unit of observation per data set, so the rows that contain data about each individual country need to be stored in a separate data set from the rows that contain summary data about entire regions.

# identify where the region rows begin
summary.start <- which(unicef$country == "SUMMARY INDICATORS") 
# begins at row 198

# Create a new dataframe with only the region summaries:
unicef.region <- unicef[(summary.start + 1) : nrow(unicef),]

# For completeness, rename the first column in the new dataframe to be region instead of country
names(unicef.region)[1] <- "region"

## Now drop the regional data from the original dataset:
unicef <- unicef[1:(summary.start - 1),]


nrow(unicef.region) ## 10 rows in the summary/regional dataset

## [1] 10

nrow(unicef) ## 197 rows in the country-level dataset

## [1] 197

datatable(unicef.region)

5. Separating Columns

Currently the first column country contains the country name and region name. We want these stored as separate variables.

library(tidyr)
unicef <- separate(unicef, #dataframe
                   col = "country", #name of column with data to be split
                   into = c("Country","Region"),#names of new columns
                   sep = "-") #symbol that separates the two pieces of info
                  

head(unicef[,c(1:2)])

##               Country                          Region
## 4             Albania Eastern Europe and Central Asia
## 5             Algeria    Middle East and North Africa
## 6             Andorra                  Western Europe
## 7              Angola Least developed countries/areas
## 8 Antigua and Barbuda     Latin America and Caribbean
## 9           Argentina     Latin America and Caribbean

Remove Blank Columns

unicef.row.means <- rowMeans(is.na(unicef)) # this creates a vector that tells us what percentage of cells in each row are missing. If a row is 100% missing we'll remove it

head(unicef.row.means) #look at the means of the first few rows

##          4          5          6          7          8          9 
## 0.13333333 0.06666667 0.40000000 0.20000000 0.13333333 0.13333333

unicef <- unicef[rowMeans(is.na(unicef)) < 1,] ## only keeps the rows where there's some non-missing information

unicef <- unicef[,colMeans(is.na(unicef) | unicef == "x") < 1] ## removes the extra columns

dim(unicef)

## [1] 196  13

Combining Columns

Currently the columns 12 and 14 of the unicef.region dataframe contain the name of the continent on which the region is located and the region’s location on that continent. We want these stored as one variable.

head(unicef.region[,c(12,14)])

##     continent location.cont
## 202    Africa         South
## 203    Africa          East
## 204    Africa          West
## 205    Africa         North
## 206      Asia         South
## 207      Asia          East

# paste the contents of the location and continent columns into the location column
unicef.region$location <- NA
unicef.region$location <- paste(unicef.region$location.cont,unicef.region$continent, sep = " ")

# check that the values appeared as expected
head(unicef.region$location)

## [1] "South Africa" "East Africa"  "West Africa"  "North Africa" "South Asia"  
## [6] "East Asia"

# remove the continent column
unicef.region <- unicef.region[,c(-12,-14)]

Reshaping

Long to Wide

For this example we will use data from the American Community Survey (ACS). We can import data from the census database directly into our R environment using the tidycensus package.

The variables we will be using are:
B01003_001 - total population
B20004_001 - median annual income

Our unit of observation will be counties in PA.

library(tidycensus)

PA.inc.data <- get_acs(
  geography = "county",
  variables = c("B01003_001", "B20004_001"),
  state = "PA"
)

## Getting data from the 2014-2018 5-year ACS

dim(PA.inc.data)

## [1] 134   5

datatable(PA.inc.data)

Based on a visual check of the data, it seems like there are two rows for each county. To confirm that all rows/ids are duplicated with certainty, we should check using duplicated()

# Does the dataset contain exactly one row for every observation?
table(duplicated(PA.inc.data$NAME))

## 
## FALSE  TRUE 
##    67    67

table(table(PA.inc.data$NAME)) #all id numbers appear twice

## 
##  2 
## 67

# but none of the rows are completely identical
table(duplicated(PA.inc.data))

## 
## FALSE 
##   134

# this tells us that there are columns that contain different info on each respondent that must be preserved

Reshape the data from long to wide format resulting in one columns for each unique county name and two new columns containing the non-duplicate information from the variables.

library(tidyr)

PA.inc.data.reshaped <- pivot_wider(data = PA.inc.data, # name of data frame
                                    id_cols = c(1:2),   # columns of duplicate information that can be collapsed
                                    names_from = variable, # variable of labels that will become the new variable names
                                    values_from = estimate) # variable of values that will become the values of the new variables

#the number of columns in our reshaped data frame should be half that of the long data frame
#note: there are only 4 columns because we specified that we only wanted to include the values from 'estimate' and not moe
dim(PA.inc.data.reshaped)

## [1] 67  4

# visually check that all the data was manipulated correctly
datatable(PA.inc.data.reshaped)

# give the new columns meaningful names using the rename() function from the dplyr package
library(dplyr)
PA.inc.data.reshaped <- rename(PA.inc.data.reshaped,
                                "county.name" = "NAME",
                                "population" = "B01003_001",
                                "median.income" = "B20004_001") 
colnames(PA.inc.data.reshaped)

## [1] "GEOID"         "county.name"   "population"    "median.income"

# check whether there is one row per county.

table(duplicated(PA.inc.data.reshaped$GEOID)) # all should be FALSE - indicating no duplicates

## 
## FALSE 
##    67

length(unique(PA.inc.data.reshaped$GEOID)) # the number of unique id values should be equal to the number of rows in the final wide data frame

## [1] 67

Wide to Long

For this example we will use the data frame age.data displayed below. We can see that there are 5 columns - a binary variable that indicates whether an individual belongs to an age group (1) or not (0).

What we want instead is one variable that contains the name of the respondents age group for each of the 25 respondents.

datatable(age.data)

dim(age.data) # 25 observations and 6 variables

## [1] 25  6

# pivot_longer() condenses information stored across multiple columns into one column containting the names of the original variables
age.data.reshaped <- pivot_longer(age.data, # data frame
                         cols = c(2:6), # which columns we want to combine
                         names_to = "age", #name of our new label column
                         values_to = "binary") # name of our new value column
dim(age.data.reshaped)

## [1] 125   3

sum(age.data.reshaped$binary)

## [1] 25

age.data.reshaped.sub <- age.data.reshaped[age.data.reshaped$binary == 1,]
age.data.reshaped.sub <- age.data.reshaped.sub[,c(-3)]
dim(age.data.reshaped.sub)

## [1] 25  2

head(age.data.reshaped.sub$age)

## [1] "Age4049" "Age1829" "Age3039" "Age6069" "Age1829" "Age5059"

# we want to remove "Age" from the beginning of each character value in our 'age' column - to do this, we use the function substr() to specify that we only want characters 4 through 7
age.data.reshaped.sub$age <- substr(age.data.reshaped.sub$age, start = 4, stop = 7)

table(age.data.reshaped.sub$age) # 5 respondents in each group

## 
## 1829 3039 4049 5059 6069 
##    5    5    5    5    5

datatable(age.data.reshaped.sub)

Data Cleaning and Manipulation