Data Management with TidyR

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
growth <- read.csv("D:/Research Methods/R/WB GDP Growth.csv")
inequality <-read.csv("D:/Research Methods/R/world_inequality.csv")

In this first code chunk, I load the libraries and data we will use for this exercise. The ` marks set off the code chuck, while the information within the parentheses indicates what program to run and how to publish the document.

include=FALSE will not print this segment of code or its results in the final document
echo=FALSE prints the results, but not the code in the final document
message=FALSE will not print messages generated by the code
warning=FALSE will not print warnings generated by the code

Note that the code chuck is distinguished by a different color background. You can run individual code chunks by clicking on the right arrow in the top right corner of the block. You can run all prior code chunks by clicking on the down arrow.

Create a Dataset

Creating a “tibble” is a user-friendly way to create a data frame (dataset) in R. Tibbles operate by inputting formulae for variables. Note that when you create a tibble, you have no restrictions on what you can call variables (unlike standard R, where you cannot start a variable name with a number or special character).

tib <- tibble(
        x = lubridate::today(),
        y = 1:5,
        z = runif(5),
        `:)` = y+z
)

Tribbles operate in a similar way, but require you to enter all data manually. Note that column headings (variable names) require a ~ before them. All entries must have a comma after them except the last one.

enlp <- tribble(
  ~country, ~year, ~enlp,
  "Kyrgyzstan", 1995, 34.7872,
  "Kyrgyzstan", 2000, 35.61634,
  "Kyrgyzstan", 2010, 4.90129,
  "Kyrgyzstan", 2015, 4.8225,
  "Armenia", 1995, 4.51276,
  "Armenia", 1999, 4.14779,
  "Armenia", 2003, 3.774,
  "Armenia", 2007, 3.226358,
  "Armenia", 2012, 2.733949,
  "Armenia", 2017, 2.4755,
  "Mexico", 1982, 1.49939,
  "Mexico", 1985, 1.82634,
  "Mexico", 1988, 3.0389,
  "Mexico", 1991, 2.21435,
  "Mexico", 1994, 2.28724,
  "Mexico", 1997, 2.85714,
  "Mexico", 2000, 2.78322
  )

Modify a Dataset

There are many ways to modify information within a dataset. We will first discuss changing the information itself, then the structure of the dataset.

Rename

To rename a variable, the command is rename. You need to specify the source of the data and then enter new names followed by old names in the command. For example, to change the variable names of “Country.Name” and “Country.Code” to “country” and “code”, use the following code:

growth <- rename(growth, country = Country.Name, code = Country.Code)

Type

You can also change the type of data a variable holds. In R, there are many types of data, including numeric, character, boolean (true/false), and factor (for categorical variables). When opening a dataset from a different source, variables may not be categorized the way you need (for example, character variables may appear as factor). It can be difficult to merge variables that appear as different types, so correcting the data type can become important. Do do this, use the as. function (for example, as.character or as.numeric). For example, to make the variable code a character variable, use the following code:

growth$code <- as.character(growth$code)

Recode

Finally, some of the most common modifications to made to a dataset are recoding its values. This is particularly important when merging two files that use different spellings of countries or country codes, for example. The command to recode all values of a variable is ‘recode’ and you enter the old values before the new values. For example to change the country codes for Timor-Leste to TMP, Romania to ROM and the Democratic Republic of the Congo to ZAR, use the following code:

growth$code <- recode(growth$code, 
                     "TLS"="TMP", 
                     "ROU"="ROM", 
                     "COD"="ZAR"
                     )

Sometimes we do not want to change all values of a variable, but only some (based on characteristics of different observations). Sample code (which we will not run at this time) to change the name of the Czech Republic to Czechoslovakia for observations up to and including 1992 is below. Note that the command works by assigning a new value directly to the variable based on logical commands in brackets.

vdem$country[vdem$country==“Czech Republic” & vdem$year<=1992] <- “Czechoslovakia”

Filter

To remove observations completely from a dataset, use the function filter followed by the logical commands related to the observations you want to keep. For example to remove all observations in a country-year dataset prior to 1970, use the following code:

vdem <- filter(vdem, year>=1970)

In our sample dataset, we do not need the regional data the World Bank has provided. To remove it, run the following code block:

growth <- filter(growth, country!="Caribbean small states",
           country!="Central Europe and the Baltics")

We could remove every region by name (as above), but this is an inefficient way to code. We will discuss faster ways to do this later in the presentation.

Select

To remove entire variables from a dataset, use the command select, followed by the variables you want to keep. If you only want to remove one variable, use the negative sign (-) and list the variable you want to remove. You can key a series of variables (inclusive) by using a colon: firstvar:lastvar.

Since we do not need the country code or indicator name variables in this dataset, we can remove them with the following code:

growth <- select(growth, -code, -Indicator.Name)

Another way to perform the same function (removing two variables) is to list the variables you want to keep:

growth <- select(growth, country, X1960:X2015)

Mutate

To add a new variable based on an existing variable, use the command mutate followed by the formula you want to use. For example:

mutate(data, gdp_percap = gdp / population)

mutate(data, log_gdp = log(gdp))

Note that you do not need to assign this command to a new or existing object.

Changing the Shape of a Dataset

Sometimes data is prepared so that data you want as observations are presented as variables (and vice versa). In order to reshape the dataset so that it presents the data the way you need, use the function gather or spread. Both functions operate using the options “key” and “value”.

Gather

The command gather will turn a “wide” dataset into a “long” dataset. It will turn variables you want to combine into a single column - for example, if you have years listed as variables that you want to turn into a single variable “year”. Here, the “key” is the new variable you create out of the old variable names. “Value” is what you want to call the variable that will contain the data stored in the frame.

growth <- gather(growth, X1960:X2015, key="year", value="gdp_growth")

growth$year <- substr(growth$year, 2, 5)
growth$year <- as.numeric(as.character(growth$year))

Spread

The command spread will turn a “long” dataset into a “wide” dataset. It will turn observations you want to change into variables - for example, if you have a variable called “year” and want to turn that into a series of variables for each year. Here, the “key” is the name of the variable that contains the new variable names. “Value” the name of the variable that contains the data that will be spread out among the new columns.

growth <- spread(growth, key = "year", value = "gdp_growth")

Merging Datasets

Merging datasets is one of the most challenging things you can do in R - it requires precision, but the program does not provide much information about the problems that arise. There are four types of merges (called joins) you can do in R:

An inner join keeps only the observations that both datasets share
A left join keeps all the observations in the first dataset and adds observations from the other dataset.
A right join keeps all the observations in the second dataset and adds observations from the other dataset.
A full join keeps all observations from both datasets.

In general, a best practice is to do a full join to make sure you do not lose data. The code is simple:

merge <- full_join(inequality, growth, key="country")

## Joining, by = "country"

## Warning: Column `country` joining factors with different levels, coercing
## to character vector

But many errors arise. What went wrong with this merge?

When observations fail to join correctly, there are several steps you can take to correct the problem.

First, you need to identify what observations still need to join. The simplest way is to check which observations are missing data from only one dataset in initial observations.

Some potential solutions include: * Use a different key (country code not country name, for example) * Recode original dataset(s) so that keys match * Make sure variables are same type (character/numeric)

growth$country <- as.character(growth$country)
inequality$country <- as.character(inequality$country)

growth$country <- recode(growth$country, "Bahamas, The" = "Bahamas",
                         "Bosnia and Herzegovina" = "Bosnia",
                         "Brunei Darussalam" = "Brunei",
                         "Cabo Verde" = "Cape Verde",
                         "Congo, Dem. Rep." = "Congo Kinshasa",
                         "Congo, Rep." = "Congo Brazzaville",
                         "Cote d'Ivoire" = "Ivory Coast",
                         "Egypt, Arab Rep." = "Egypt",
                         "Gambia, The" = "Gambia",
                         "Iran, Islamic Rep." = "Iran",
                         "Korea, Dem. People's Rep." = "Korea North",
                         "Korea, Rep." = "Korea South",
                         "Kyrgyz Republic" = "Kyrgyzstan",
                         "Lao PDR" = "Laos",
                         "Macedonia, FYR" = "Macedonia",
                         "Micronesia, Fed. Sts." = "Micronesia",
                         "Myanmar" = "Myanmar (Burma)",
                         "Russian Federation" = "Russia",
                         "Syrian Arab Republic" = "Syria",
                         "Timor-Leste" = "East Timor",
                         "Trinidad and Tobago" = "Trinidad",
                         "Venezuela, RB" = "Venezuela",
                         "Yemen, Rep." = "Yemen")

merge <- full_join(inequality, growth, key="country")

## Joining, by = "country"

Saving a new dataset

When you have finished, save your new dataset.

write_csv(merge, "inequality and growth.csv")