Instructor: Edgar Franco

Outline

  1. Reading data
  2. Selecting and subsampling data
  3. Merging & appending datasets
  4. The dplyr and magrittr packages

6. Reading data

R can read and data from and to different formats.

Functions read.csv(), read.table(), read.delim() read delimited data files, .csv, and .txt files

Similarly, functions write.csv() and write.table() export delimited data files, .csv, and .txt files.

The package “xlsx” reads, writes, and formats Excel 2007 and Excel 97/2000/XP/2003 files

The package “foreign” reads and writes data stored by some versions of Stata, SAS, SPSS, dBase files, and other programs.

Function read.dbf() reads a DBF file into a data frame. This is useful when working with geographic information systems like GIS or QGIS. Some of the spatial or geographical data for this systems come in the form of .dbf files.

6.1. Databases in packages

Remember: “Packages” are collections of objects including databases, functions, models, and compiled code.

Some packages are already installed in your computer and contain baseline functions and data.

Other functions need to be downloaded from the Comprehensive R Archive Network (CRAN).

You can see the available databases in a package by typing data(package=“package.name”)

For example, let’s work with the package “Zelig”

First load the package into the session using the function ‘library()’

library("Zelig")
## 
## Attaching package: 'Zelig'
## The following object is masked from 'package:base':
## 
##     mode

Note that “Zelig” contains different types of models, functions, and databases.

To list the available datasets in a package, use the function ‘data()’

Syntax: ‘data(package = “package.name”)’

data(package=“Zelig”)

The function ‘data()’ is also used to load a particular dataset into the workspace. Syntax: ‘data(“dataset.name”)’

In this and future sessions we will be working with the “turnout” dataset in the “Zelig” package.

data(package="Zelig")       # Lists the available datasets in a package

data("turnout")             # Loads a dataset into the workspace

Now the dataset “turnout” is available for use.

ls()

Almost all datasets from packages have a help file with a brief description and codebook of the data.

To access the help file of a data, use the operator ‘?’ or the function ‘help(“dataset.name”)’

?turnout help(“turnout”)

Now we now that dataset “turnout” contains individual-level turnout data.

It pools several American National Election Surveys conducted during the 1992 presidential election year.

Only the first 2,000 observations were included in the sample data.

Remember the following commands to get familiar with your data

  • dim(turnout) # Dimensions of the data frame. Syntax: dim(x)

  • head(turnout,10) # Shows first n rows. Syntax: head(x,n)

  • tail(turnout,10) # Shows last n rows. Syntax: head(x,n)

  • str(turnout) # Displays the structure of an object. Syntax: str(x)

  • summary(turnout) # Displays summary statistics. Syntax: summary(x)

  • colnames(turnout) # Column names of an object. Syntax: colnames(x)

7. Selecting and subsampling data

Selecting elements in objects is one of the most important skills to learn in R.

A dataset is a two dimensional object in R. Each element in a dataset has a set of coordinates: c(row,column)

To subsample data in a matrix, table, or dataframe, you should provide the row and column NUMBERS or NAMES of the elements you want to select.

Use data[rows,columns] to indicate these numbers, leave a blank to select all.

Let’s go back to our turnout dataset

head(turnout)

7.1. Subsampling data

Remember: for a dataset the index is [Row, Column]

# Let's select some rows
head(turnout)
turnout[3  ,  ]         # Row 3, all columns
turnout[c(3:5) ,  ]     # Rows 3 to 5, all columns
turnout[ c(1,3,5),  ]   # Rows 1, 3, and 5, all columns

# To take a random subsample, you can use the function 'sample()'
# Note that 'sample()' only works with vectors.
# Syntax: sample(vector,size)
sample(c(1:2000),10)

# Then we can select a random sample of the rows and all columns.
head(turnout)
turnout[sample(c(1:2000),10) ,  ]   # Random sample of size 10 for rows, all columns


# For illustration purposes, let's keep the first 10 rows of the 'turnout' dataset.
turnout <- turnout[c(1:10),]
turnout

# Columns or variables in datasets can be selected in three different ways
# Suppose we want to select the column for age.

# These four notations are equivalent
turnout[  ,  2]     # All rows, column 2
turnout[  , "age"]  # All rows, column "age"
turnout[  , 'age']  # All rows, column 'age'
turnout$age         # Variable age in dataset turnout
# The sign "$" indicates that vector "age" is an element of the dataset "turnout"

# To select two or more variables, you can use one the following notations
turnout
turnout[  , c(2,4)]             # All rows, columns 2 & 3
turnout[  , c("age","income")]  # All rows, column "age" and "income"
turnout[  , c('age','income')]  # All rows, column 'age' and 'income'

# You can also use the function 'data.frame()'
data.frame(turnout$age,turnout$income) # Data frame variables age and income 
                                           # in dataset 'turnout'
# This produces a new data frame.
class(data.frame(turnout$age,turnout$income)) 

# Note the missing variable names in this new dataset.
# To fix this, you just need name your variables again.
data.frame(age = turnout$age, income = turnout$income)

## NOTE, none of these subsamples exist in the workspace since we didn't assign them an object name.
#'object <- data.frame'
ls()

NOTE: Everytime you run a function with a randomizing component the results might be different. You can set a seed to obtain the same results

# Try, in this order:

sample(c(1:2000),10)
sample(c(1:2000),10)

set.seed(2)
sample(c(1:2000),10)
set.seed(2)
sample(c(1:2000),10)

### In this case, everytime you run a randomizing function just after set.seed(number) you'll get the same results

7.2. Subsampling data using conditionals

To select data samples that satisfy one or more conditions, use the following conditionals

  • ‘<’ less than
  • ‘<=’ less or equal than
  • ‘>’ greater than
  • ‘>=’ greater or equal than
  • ‘==’ equal to
  • ‘!=’ different to

You can use more than one conditional to select samples of your data by using

  • & and
  • ‘|’ or
# Let's use the first 20 rows of the 'turnout' data again.
rm(list=ls())
library("Zelig")
data("turnout")
turnout_short <- turnout[c(1:20),]
turnout_short

# First, let's select the observations for people that voted in the 1992 presidential election (vote==1).
# There are two types of vectors that are helpful for selecting cases.

turnout_short$vote == 1
# Reports whether a condition is TRUE or FALSE for each row.

which(turnout_short$vote == 1)

# Reports the row numbers of those cases that satisfy a condition.
                                    
# Note that the first vector reports TRUE or FALSE for all rows in the data.
# The second vector only reports sucess cases.

# Using these vectors, we can subsample our data
turnout_short

# The two following notations are equivalent:
turnout_short[turnout_short$vote==1         , ]  # Select cases (rows) where condition == TRUE
turnout_short[which(turnout_short$vote == 1), ]  # Select cases (rows) that satisfy the condition



############
# Now, let's select the observations for people that had 40 years old or less and voted in the 1992 presidential election.
# Conditions vote == 1 and age <= 40
turnout_short

# Once again, two types of vectors will be helpful to make this selection
turnout_short$vote == 1 & turnout_short$age <= 40
# Reports whether a condition is TRUE or FALSE for each row.

which(turnout_short$vote == 1 & turnout_short$age <= 40)
# Reports the row numbers of those cases that satisfy a condition

# Using these vectors, we can subsample our data
turnout_short

# The following notations are equivalent
turnout_short[turnout_short$vote==1 & turnout_short$age <= 40, ]  # Select cases where condition == TRUE
turnout_short[which(turnout_short$vote == 1 & turnout_short$age <= 40), ] # Select cases that satisfy the condition

7.3. Dropping data

# Use the '-' operator to drop cases or variables
# Examples

turnout_short
turnout_short[-3, ]         # Drops row 3
turnout_short[-c(3:5),]     # Drops rows 3 to 5
turnout_short[-c(1,3,5),]       # Drops rows 1, 3, and 5

# To drop columns or variables, you can use any of the following notations.
# Remember: the function 'colnames()' return the columns names of a dataset.

# Suppose we want to drop the variable "income".
# The following notations are similar
turnout_short
colnames(turnout_short)
turnout_short[  ,  -4]                          # Drops the 4th column
turnout_short[, c("race","age","educate","vote")] # Subsamples all data except column "income" 
turnout_short[, c(colnames(turnout)!="income")]   # Keeps columns whose name is not equal to "income".
turnout_short[,-c(which(colnames(turnout_short)=="income"))] # Drops the column whose name is equal to "income"

EXERCISE 3 (Data Munging)

  1. Estimate the mean of schooling years for people that were 40 years old or less and voted in the 1992 presidential election.

  2. The function ‘with()’ constructs a local environment defined by the data.No additional references are required.
    Syntax: with(data, ‘expression’). Use the function with() to obtain the education level for people that were 40 years old or less and voted in the 1992 presidential election.

  3. Alternatively, you can use the function ‘subset()’. ‘subset()’ returns subsets of vectors, matrices or data frames which meet conditions. Syntax: ’subset(data.frame, subset = conditions for rows, select = conditions for columns). First, from the dataset turnout_short get the people who voted using the function subset. Second, with the same function get the people that were 40 years old or less and voted Finally, get the level of schooling for these people.

  4. For the dataset turnout_short, drop the cases for people that were 30 years old or less and voted. Hint. You can use the ‘which()’ function.

8. Merging & appending datasets

8.1. Adding additional data

There are at least 3 ways to add additional data to a data frame:

  1. Appending new data.
  2. Adding existing information from a keyed table using the function ‘match()’
  3. Merging 2 datasets using the function ‘merge()’
# Let's create first two databases

Data_A <- data.frame(
                    ID = c(1:5),
                    name = c("Edgar","Katie","John","Mary","Stephen"),
                    age = c(31,26,34,21,20),
                    department = c("PoliSci","Education","Sociology","Economics","Education"),
                    treatment = c(1,1,0,1,0),
                    score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
                    )

Data_B <- data.frame(
                    name = c("Edgar","Katie","John","Stephen","Stephen","David"),
                    score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
                    )

Data_A
Data_B

# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B

8.1.1. Appending two datasets

## Supose we have the following information for David
data.david <- data.frame(ID = 6, name = "David" , age = 25 , 
                         department = "Anthropology", 
                         treatment = 0, score_time1 = 8.3)
data.david
Data_A

## Note that data.david and Data_A have the same columns
## To append two datasets, you can use the function 'rbind()' which stands for row binding 
rbind(Data_A,data.david)

# Let's append the data for 'David' and replace Data_A
Data_A <- rbind(Data_A,data.david)
Data_A

8.1.2. Matching two datasets

## The function 'match' returns a vector of the positions of (first) matches of its first argument in its second.
## In other words, it looks for an argument in the first column of a 'keyed' table and returns the position of the first match.
## It works similar to the function 'vlookup' in Excel.
# Syntax: 'match(argument,table)'

# Let's go back to our original datasets
Data_A <- data.frame(
                    ID = c(1:5),
                    name = c("Edgar","Katie","John","Mary","Stephen"),
                    age = c(31,26,34,21,20),
                    department = c("PoliSci","Education","Sociology","Economics","Education"),
                    treatment = c(1,1,0,1,0),
                    score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
                    )

Data_B <- data.frame(
                    name = c("Edgar","Katie","John","Stephen","Stephen","David"),
                    score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
                    )

Data_A
Data_B

# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B

# Let's add the information for 'score_time2' to Data_A
# We'll do the match by using the variable 'name' as a key.

match(Data_A$name, Data_B$name)     # Returns the POSITION of the FIRST argument in a VECTOR.
                                    # Note, 'Mary' is not in Data_B.
                                    # 'Stephen' has two rows, only the 'first' match is reported.
                    #  'David ia not reported at all

# Once we know the positions, we can import the information for 'score_time2'
# The following notations are equivalent:
Data_B[match(Data_A$name, Data_B$name) , "score_time2"]
Data_B$score_time2[match(Data_A$name, Data_B$name)]

# Finally, we add the information to new column in Data_A.
# The following notations are equivalent:
Data_A$score_time2 <- Data_B$score_time2[match(Data_A$name, Data_B$name)]
Data_A

Data_A[,"score_time2"] <- Data_B$score_time2[match(Data_A$name, Data_B$name)]
Data_A

# Note that only the first 'match' for Stephen is imported.
Data_B

8.1.3. Merging two datasets

Let’s go back to our original datasets

Data_A <- data.frame(
                    ID = c(1:5),
                    name = c("Edgar","Katie","John","Mary","Stephen"),
                    age = c(31,26,34,21,20),
                    department = c("Edgar","Katie","Sociology","Economics","Education"),
                    treatment = c(1,1,0,1,0),
                    score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
                    )

Data_B <- data.frame(
                    name = c("Edgar","Katie","John","Stephen","Stephen","David"),
                    score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
                    )

Data_A
Data_B

# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B

# The function 'merge()' in R merges two databases X and Y.
# Syntax:
#       merge(  x,                      # X dataset
#               y,                      # Y dataset
#               by.x = "Identifier",    # Key in X. More than 1 variable can be used.
#               by.y = "Identifier",    # Key in Y. More than 1 variable can be used.
#               all.x = TRUE,           # Keep all observations in X (TRUE/FALSE)
#               all.y = TRUE,           # Keep all observations in Y (TRUE/FALSE)
#               suffixes = c(".x",".y") # If other variables have the same name in X and Y
#               )                       #  rename them according to the source
                                        #  by adding the suffixes c(".x",".y")

# Notes:In Stata, the merge is based on variables that have the same name in both databases.
#       In R, these variables can have different names.
#       In Stata, the merge can be 1:1, m:1, 1:m, or m:m.
#       In R, these options can be approximated by changing the parameters 'all.x', 'all.y'.
#       Note that R doesn't check for unique identifiers.
#       If there are other variables with the same name, Stata keeps the data of the master dataset.
#       Instead, R allows the user to keep both columns.

# Let's try the 4 different merges
# Note the suffixes for name
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=F, all.y=F)  # It throws 'Mary'and 'David'
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=T, all.y=F)  # It throws 'David'

# More on this soon...

EXERCISE 4

  1. Merge the two datasets (A and B ) by name excluding Mary. Remember that Mary is not in dataset B

  2. Merge the two datasets (A and B) keeping all info on both datasets

9. The dplyr and magrittr Packages

The dplyr package makes data manipulation much easier. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code. It also uses efficient data storage backends, so you spend less time waiting for the computer.

Here’s an useful cheat seet:

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

9.1 dplyr

Let’s play with another dataset from “Zelig”

# This dataset contains voting data for the 1988 Mexican presidential election.
data("mexico")

#Remember the following commands to get familiar with your data
dim(mexico)         # Dimensions of the data frame.             Syntax: dim(x)
head(mexico,10)   # Shows first n rows.                     Syntax: head(x,n) 
tail(mexico,10)   # Shows last n rows.                  Syntax: head(x,n)
str(mexico)           # Displays the structure of an object.    Syntax: str(x)
summary(mexico)   # Displays summary statistics.            Syntax: summary(x)
colnames(mexico)    # Column names of an object.            Syntax: colnames(x)
#?mexico


## First, let's transform our data into a tbl_df form. This is just for covenience
library(dplyr)
mexico <- tbl_df(mexico)

Single table verbs

Dplyr aims to provide a function for each basic verb of data manipulation:

  • filter() (and slice()) Return rows with matching conditions.

  • arrange() Sort a variable in descending order.

  • select() Keeps only the specified variables

  • distinct() Retain only unique/distinct rows from an input tbl.

  • mutate() Creates new variables

  • summarise() Summarise multiple values to a single value.

  • sample_n() Makes it easy to select random rows from a table.

We can forget now about ‘$’ , ‘[]’, and other annoyances

#### FILTER & SLICE
### Lets filter all voters younger than 25 and female
### Syntax: filtert(df, statements)

mexico_young<- filter(mexico, age<25 & female==1)
dim(mexico_young)

## Equivalent to:
#mexico[mexico$age<25 & mexico$female==1, ]


###Slice selects rows by position
slice(mexico, 1:5)


### ARRANGE 
### Reorders rows
mexico
mexico <- arrange(mexico, vote88)
mexico

##We can use several rows
mexico
mexico <- arrange(mexico, vote88, -pri82)  # Now vote goes from 1 to 3 and, within each pri is ordered 
mexico

#### SELECT
#### Lets keep only some variables: vote in 88, age and sex
### Syntax: select(df, variables)
mexico_short <- select(mexico, vote88, age, female)


### We can rename with this function
mexico_short <- select(mexico, vote88, age, sex=female)

####MUTATE
### Add new columns that are transformation of existing columns
### Syntax: mutate(df, new_variables)
mexico_young <- mutate(mexico, young_female= (age<25 & female==1)) #This creates a logical statement

mexico_young <- mutate(mexico_young, young_female= as.numeric(young_female)) ## Transform to binary

## TRANSMUTE: If you want to keep these variables only

transmute(mexico, young_female= (age<25 & female==1))



#### SUMMARISE
# It collapses a data frame to a single row
summarise(mexico, female=mean(female) )
summarise(mexico, age=mean(age) )


####SAMPLE_N
sample_n(mexico_short, 10)

EXERCISE 5

  1. Get the dataset flights First, run install.packages(“nycflights13”). This dataset contains all 336776 flights that departed from New York City in 2013. Run some descritive functions for this dataset

  2. Transform flights into a tbl_df() object. Filter the flights of December 25. Now, do the same for Dec 24 and 26. What are the patterns for those days?

  3. Select the variables refering to date (year, month, day)

  4. Create a new column which calculating the difference between the delay at arrival and the delay at departure

  5. Reorder the dataset your new variable in arrival and carrier.

  6. Obtain the mean dealy time at departure. Be careful with the missing data!

9.1.1 Grouped operations in dplyr

The real power of dplyr comes when we combine arguments. In dplyr, you do this by with the group_by() function. It breaks down a dataset into specified groups of rows.

For Stata users: This is similar to the ‘collapse’ function’

Grouping affects the verbs as follows:

  • grouped select() is the same as ungrouped select(), except that grouping variables are always retained.

  • grouped arrange() orders first by the grouping variables

  • mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).

  • sample_n() and sample_frac() sample the specified number/fraction of rows in each group.

  • slice() extracts rows within each group.

  • summarise() is powerful and easy to understand, as described in more detail below.

Lets apply this to our flights data:

### Lets calculate the average delay in departure by carrier
library(nycflights13)

delay_carrier <- group_by(flights, carrier)  # This first statement only specifies the new arrangement but 
                                             # on its own is not very useful
delay_carrier <- summarise(delay_carrier, departure=mean(dep_delay, na.rm=T))
delay_carrier



###  Summarise is useful with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others. For example, we could use these to find the number of planes and the number of flights that go to each possible destination:

destinations <- group_by(flights, dest) #Group by destination 
summarise(destinations,
  planes = n_distinct(tailnum),
  flights = n()
)

9.1.2 Combining datasets

dplyr has some very useful commands

  • lef_join(a, b, by = “x1”): Join matching rows from b to a.

  • right_join(a, b, by = “x1”): Join matching rows from a to b.

  • inner_join(a, b, by = “x1”): Join data. Retain only rows in both sets.

  • full_join(a, b, by = “x1”): Join data. Retain all values, all rows

### Let's create two dataframes

athletes <- data.frame(name= c("Phelps", "Biles", "Ledecky",  "Bolt", "Thompsom" ,
                                "Douglas","Nadal", "Murray")  ,
                          sport = c("swimming", "gymnastics", "swimming", "track", 
                                    "track", "gymnastics", "tennis","tennis"),
                          gender = c(0, 1, 1,0, 1, 1, 0,0),
                          country = c("US","US","US", "Jamaica", "Jamaica", 
                                      "US", "Spain", "Australia"))
                          
dim(athletes)


countries <- data.frame(country= c("US", "Jamaica", "Spain", "China"),
                        medals = c(121,11,17, 70))

dim(countries)


### Inner Join ####
#inner_join(x, y): Return all rows from x where there are matching values in y, 
#and all columns from x and y. If there are multiple matches between x and y,
#all combination of the matches are returned. This is a mutating join.

olympics <- inner_join(athletes, countries)
olympics
#We lose Murray in the join because, although he appears in x = athletes, 
#his country does not appear in y = countries. The join result has all variables from x =  plus yr_founded, from y.


### Semi Join ####
#semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.

olympics <- semi_join(athletes, countries)
olympics

#Now the effects of switching the x and y roles is more clear. The result resembles x = countries, but the county #China is lost, because there are no observations  in y = athletes.

olympics <- semi_join(countries, athletes)
olympics


#### Left join ####
#left_join(x, y): Return all rows from x, and all columns from x and y. 
#If there are multiple matches between x and y, #all combination of the matches are returned. This is a mutating join.

olympics <- left_join(athletes, countries)
olympics

#We basically get x = athletes back, but with the addition of variable medals,
#which is unique to y = countries. Murray, whose publisher does not appear in y =countries, has an NA for medals.


##Anti join ###
#anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.

olympics <- anti_join(athletes, countries)
olympics

#We keep only Murray now (and do not get medals).



### Full join ####
#full_join(x, y): Return all rows and all columns from both x and y. 
#Where there are not matching values, returns NA for the one missing. This is a mutating join.


olympics <- full_join(athletes, countries)
olympics

#We get all rows of x = athletes plus a new row from y = countries, containing China. We get all variables from x = athletes AND all variables from y = countries. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

EXERCISE 5.1 (More dplyr)

  1. Summarise the flights dataset to count the number of flights by day of the year. This dataset should have 365 rows.

  2. Now, summarise the number of delays at departure by month. Hint: Before grouping create a delay variable using a logical statement (delayed if dep_delay>0). Also, consider the possibility of missing values.

  3. Using the datasets atheltes and countries perform a anti_join but now reverse the order. That is x=countries, y=athletes. Explain your results.

NOTE: The Pipe concept

9.2 Magritrr

# Let's say that we want to group our data by day, select the delay variables and obtain the average delays at arrival and departure by day, AND keep the days with mean times higher than 30 min....UFFFF. This is how it would look 
 

daily  <- group_by(flights, year, month, day)
daily  <- select(daily, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
daily_mean <- summarise(daily, 
     arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE))
filter(daily_mean, arr>30 | dep>30)



# Lets peform this with pipes

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
## Cleaner, elegant, understandable!!!!
Note: This script is based on the R Workshop created by Gustavo Robles, some exercises are based on Cotton, R. (2013), Learning R , O’Reilly. Also from: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html and http://stat545.com/bit001_dplyr-cheatsheet.html