Instructor: Edgar Franco
R can read and data from and to different formats.
Functions read.csv(), read.table(), read.delim() read delimited data files, .csv, and .txt files
Similarly, functions write.csv() and write.table() export delimited data files, .csv, and .txt files.
The package “xlsx” reads, writes, and formats Excel 2007 and Excel 97/2000/XP/2003 files
The package “foreign” reads and writes data stored by some versions of Stata, SAS, SPSS, dBase files, and other programs.
Function read.dbf() reads a DBF file into a data frame. This is useful when working with geographic information systems like GIS or QGIS. Some of the spatial or geographical data for this systems come in the form of .dbf files.
Remember: “Packages” are collections of objects including databases, functions, models, and compiled code.
Some packages are already installed in your computer and contain baseline functions and data.
Other functions need to be downloaded from the Comprehensive R Archive Network (CRAN).
You can see the available databases in a package by typing data(package=“package.name”)
For example, let’s work with the package “Zelig”
First load the package into the session using the function ‘library()’
library("Zelig")
##
## Attaching package: 'Zelig'
## The following object is masked from 'package:base':
##
## mode
Note that “Zelig” contains different types of models, functions, and databases.
To list the available datasets in a package, use the function ‘data()’
Syntax: ‘data(package = “package.name”)’
data(package=“Zelig”)
The function ‘data()’ is also used to load a particular dataset into the workspace. Syntax: ‘data(“dataset.name”)’
In this and future sessions we will be working with the “turnout” dataset in the “Zelig” package.
data(package="Zelig") # Lists the available datasets in a package
data("turnout") # Loads a dataset into the workspace
Now the dataset “turnout” is available for use.
ls()
Almost all datasets from packages have a help file with a brief description and codebook of the data.
To access the help file of a data, use the operator ‘?’ or the function ‘help(“dataset.name”)’
?turnout help(“turnout”)
Now we now that dataset “turnout” contains individual-level turnout data.
It pools several American National Election Surveys conducted during the 1992 presidential election year.
Only the first 2,000 observations were included in the sample data.
Remember the following commands to get familiar with your data
dim(turnout) # Dimensions of the data frame. Syntax: dim(x)
head(turnout,10) # Shows first n rows. Syntax: head(x,n)
tail(turnout,10) # Shows last n rows. Syntax: head(x,n)
str(turnout) # Displays the structure of an object. Syntax: str(x)
summary(turnout) # Displays summary statistics. Syntax: summary(x)
colnames(turnout) # Column names of an object. Syntax: colnames(x)
Selecting elements in objects is one of the most important skills to learn in R.
A dataset is a two dimensional object in R. Each element in a dataset has a set of coordinates: c(row,column)
To subsample data in a matrix, table, or dataframe, you should provide the row and column NUMBERS or NAMES of the elements you want to select.
Use data[rows,columns] to indicate these numbers, leave a blank to select all.
Let’s go back to our turnout dataset
head(turnout)
Remember: for a dataset the index is [Row, Column]
# Let's select some rows
head(turnout)
turnout[3 , ] # Row 3, all columns
turnout[c(3:5) , ] # Rows 3 to 5, all columns
turnout[ c(1,3,5), ] # Rows 1, 3, and 5, all columns
# To take a random subsample, you can use the function 'sample()'
# Note that 'sample()' only works with vectors.
# Syntax: sample(vector,size)
sample(c(1:2000),10)
# Then we can select a random sample of the rows and all columns.
head(turnout)
turnout[sample(c(1:2000),10) , ] # Random sample of size 10 for rows, all columns
# For illustration purposes, let's keep the first 10 rows of the 'turnout' dataset.
turnout <- turnout[c(1:10),]
turnout
# Columns or variables in datasets can be selected in three different ways
# Suppose we want to select the column for age.
# These four notations are equivalent
turnout[ , 2] # All rows, column 2
turnout[ , "age"] # All rows, column "age"
turnout[ , 'age'] # All rows, column 'age'
turnout$age # Variable age in dataset turnout
# The sign "$" indicates that vector "age" is an element of the dataset "turnout"
# To select two or more variables, you can use one the following notations
turnout
turnout[ , c(2,4)] # All rows, columns 2 & 3
turnout[ , c("age","income")] # All rows, column "age" and "income"
turnout[ , c('age','income')] # All rows, column 'age' and 'income'
# You can also use the function 'data.frame()'
data.frame(turnout$age,turnout$income) # Data frame variables age and income
# in dataset 'turnout'
# This produces a new data frame.
class(data.frame(turnout$age,turnout$income))
# Note the missing variable names in this new dataset.
# To fix this, you just need name your variables again.
data.frame(age = turnout$age, income = turnout$income)
## NOTE, none of these subsamples exist in the workspace since we didn't assign them an object name.
#'object <- data.frame'
ls()
NOTE: Everytime you run a function with a randomizing component the results might be different. You can set a seed to obtain the same results
# Try, in this order:
sample(c(1:2000),10)
sample(c(1:2000),10)
set.seed(2)
sample(c(1:2000),10)
set.seed(2)
sample(c(1:2000),10)
### In this case, everytime you run a randomizing function just after set.seed(number) you'll get the same results
To select data samples that satisfy one or more conditions, use the following conditionals
You can use more than one conditional to select samples of your data by using
# Let's use the first 20 rows of the 'turnout' data again.
rm(list=ls())
library("Zelig")
data("turnout")
turnout_short <- turnout[c(1:20),]
turnout_short
# First, let's select the observations for people that voted in the 1992 presidential election (vote==1).
# There are two types of vectors that are helpful for selecting cases.
turnout_short$vote == 1
# Reports whether a condition is TRUE or FALSE for each row.
which(turnout_short$vote == 1)
# Reports the row numbers of those cases that satisfy a condition.
# Note that the first vector reports TRUE or FALSE for all rows in the data.
# The second vector only reports sucess cases.
# Using these vectors, we can subsample our data
turnout_short
# The two following notations are equivalent:
turnout_short[turnout_short$vote==1 , ] # Select cases (rows) where condition == TRUE
turnout_short[which(turnout_short$vote == 1), ] # Select cases (rows) that satisfy the condition
############
# Now, let's select the observations for people that had 40 years old or less and voted in the 1992 presidential election.
# Conditions vote == 1 and age <= 40
turnout_short
# Once again, two types of vectors will be helpful to make this selection
turnout_short$vote == 1 & turnout_short$age <= 40
# Reports whether a condition is TRUE or FALSE for each row.
which(turnout_short$vote == 1 & turnout_short$age <= 40)
# Reports the row numbers of those cases that satisfy a condition
# Using these vectors, we can subsample our data
turnout_short
# The following notations are equivalent
turnout_short[turnout_short$vote==1 & turnout_short$age <= 40, ] # Select cases where condition == TRUE
turnout_short[which(turnout_short$vote == 1 & turnout_short$age <= 40), ] # Select cases that satisfy the condition
# Use the '-' operator to drop cases or variables
# Examples
turnout_short
turnout_short[-3, ] # Drops row 3
turnout_short[-c(3:5),] # Drops rows 3 to 5
turnout_short[-c(1,3,5),] # Drops rows 1, 3, and 5
# To drop columns or variables, you can use any of the following notations.
# Remember: the function 'colnames()' return the columns names of a dataset.
# Suppose we want to drop the variable "income".
# The following notations are similar
turnout_short
colnames(turnout_short)
turnout_short[ , -4] # Drops the 4th column
turnout_short[, c("race","age","educate","vote")] # Subsamples all data except column "income"
turnout_short[, c(colnames(turnout)!="income")] # Keeps columns whose name is not equal to "income".
turnout_short[,-c(which(colnames(turnout_short)=="income"))] # Drops the column whose name is equal to "income"
Estimate the mean of schooling years for people that were 40 years old or less and voted in the 1992 presidential election.
The function ‘with()’ constructs a local environment defined by the data.No additional references are required.
Syntax: with(data, ‘expression’). Use the function with() to obtain the education level for people that were 40 years old or less and voted in the 1992 presidential election.
Alternatively, you can use the function ‘subset()’. ‘subset()’ returns subsets of vectors, matrices or data frames which meet conditions. Syntax: ’subset(data.frame, subset = conditions for rows, select = conditions for columns). First, from the dataset turnout_short get the people who voted using the function subset. Second, with the same function get the people that were 40 years old or less and voted Finally, get the level of schooling for these people.
For the dataset turnout_short, drop the cases for people that were 30 years old or less and voted. Hint. You can use the ‘which()’ function.
There are at least 3 ways to add additional data to a data frame:
# Let's create first two databases
Data_A <- data.frame(
ID = c(1:5),
name = c("Edgar","Katie","John","Mary","Stephen"),
age = c(31,26,34,21,20),
department = c("PoliSci","Education","Sociology","Economics","Education"),
treatment = c(1,1,0,1,0),
score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
)
Data_B <- data.frame(
name = c("Edgar","Katie","John","Stephen","Stephen","David"),
score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
)
Data_A
Data_B
# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B
## Supose we have the following information for David
data.david <- data.frame(ID = 6, name = "David" , age = 25 ,
department = "Anthropology",
treatment = 0, score_time1 = 8.3)
data.david
Data_A
## Note that data.david and Data_A have the same columns
## To append two datasets, you can use the function 'rbind()' which stands for row binding
rbind(Data_A,data.david)
# Let's append the data for 'David' and replace Data_A
Data_A <- rbind(Data_A,data.david)
Data_A
## The function 'match' returns a vector of the positions of (first) matches of its first argument in its second.
## In other words, it looks for an argument in the first column of a 'keyed' table and returns the position of the first match.
## It works similar to the function 'vlookup' in Excel.
# Syntax: 'match(argument,table)'
# Let's go back to our original datasets
Data_A <- data.frame(
ID = c(1:5),
name = c("Edgar","Katie","John","Mary","Stephen"),
age = c(31,26,34,21,20),
department = c("PoliSci","Education","Sociology","Economics","Education"),
treatment = c(1,1,0,1,0),
score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
)
Data_B <- data.frame(
name = c("Edgar","Katie","John","Stephen","Stephen","David"),
score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
)
Data_A
Data_B
# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B
# Let's add the information for 'score_time2' to Data_A
# We'll do the match by using the variable 'name' as a key.
match(Data_A$name, Data_B$name) # Returns the POSITION of the FIRST argument in a VECTOR.
# Note, 'Mary' is not in Data_B.
# 'Stephen' has two rows, only the 'first' match is reported.
# 'David ia not reported at all
# Once we know the positions, we can import the information for 'score_time2'
# The following notations are equivalent:
Data_B[match(Data_A$name, Data_B$name) , "score_time2"]
Data_B$score_time2[match(Data_A$name, Data_B$name)]
# Finally, we add the information to new column in Data_A.
# The following notations are equivalent:
Data_A$score_time2 <- Data_B$score_time2[match(Data_A$name, Data_B$name)]
Data_A
Data_A[,"score_time2"] <- Data_B$score_time2[match(Data_A$name, Data_B$name)]
Data_A
# Note that only the first 'match' for Stephen is imported.
Data_B
Let’s go back to our original datasets
Data_A <- data.frame(
ID = c(1:5),
name = c("Edgar","Katie","John","Mary","Stephen"),
age = c(31,26,34,21,20),
department = c("Edgar","Katie","Sociology","Economics","Education"),
treatment = c(1,1,0,1,0),
score_time1 = c(8.5, 8.6, 7.4, 9.3, 9.2)
)
Data_B <- data.frame(
name = c("Edgar","Katie","John","Stephen","Stephen","David"),
score_time2 = c(8.7, 8.8, 7.3, 9.3, 9.5, 8.2)
)
Data_A
Data_B
# Note:
# 'Mary' is not in Data_B
# 'David'is not in Data_A
# 'Stephen' has two rows with different information in Data_B
# The function 'merge()' in R merges two databases X and Y.
# Syntax:
# merge( x, # X dataset
# y, # Y dataset
# by.x = "Identifier", # Key in X. More than 1 variable can be used.
# by.y = "Identifier", # Key in Y. More than 1 variable can be used.
# all.x = TRUE, # Keep all observations in X (TRUE/FALSE)
# all.y = TRUE, # Keep all observations in Y (TRUE/FALSE)
# suffixes = c(".x",".y") # If other variables have the same name in X and Y
# ) # rename them according to the source
# by adding the suffixes c(".x",".y")
# Notes:In Stata, the merge is based on variables that have the same name in both databases.
# In R, these variables can have different names.
# In Stata, the merge can be 1:1, m:1, 1:m, or m:m.
# In R, these options can be approximated by changing the parameters 'all.x', 'all.y'.
# Note that R doesn't check for unique identifiers.
# If there are other variables with the same name, Stata keeps the data of the master dataset.
# Instead, R allows the user to keep both columns.
# Let's try the 4 different merges
# Note the suffixes for name
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=F, all.y=F) # It throws 'Mary'and 'David'
merge(Data_A,Data_B,by.x="name", by.y="name", all.x=T, all.y=F) # It throws 'David'
# More on this soon...
Merge the two datasets (A and B ) by name excluding Mary. Remember that Mary is not in dataset B
Merge the two datasets (A and B) keeping all info on both datasets
The dplyr package makes data manipulation much easier. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code. It also uses efficient data storage backends, so you spend less time waiting for the computer.
Here’s an useful cheat seet:
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Let’s play with another dataset from “Zelig”
# This dataset contains voting data for the 1988 Mexican presidential election.
data("mexico")
#Remember the following commands to get familiar with your data
dim(mexico) # Dimensions of the data frame. Syntax: dim(x)
head(mexico,10) # Shows first n rows. Syntax: head(x,n)
tail(mexico,10) # Shows last n rows. Syntax: head(x,n)
str(mexico) # Displays the structure of an object. Syntax: str(x)
summary(mexico) # Displays summary statistics. Syntax: summary(x)
colnames(mexico) # Column names of an object. Syntax: colnames(x)
#?mexico
## First, let's transform our data into a tbl_df form. This is just for covenience
library(dplyr)
mexico <- tbl_df(mexico)
Single table verbs
Dplyr aims to provide a function for each basic verb of data manipulation:
filter() (and slice()) Return rows with matching conditions.
arrange() Sort a variable in descending order.
select() Keeps only the specified variables
distinct() Retain only unique/distinct rows from an input tbl.
mutate() Creates new variables
summarise() Summarise multiple values to a single value.
sample_n() Makes it easy to select random rows from a table.
We can forget now about ‘$’ , ‘[]’, and other annoyances
#### FILTER & SLICE
### Lets filter all voters younger than 25 and female
### Syntax: filtert(df, statements)
mexico_young<- filter(mexico, age<25 & female==1)
dim(mexico_young)
## Equivalent to:
#mexico[mexico$age<25 & mexico$female==1, ]
###Slice selects rows by position
slice(mexico, 1:5)
### ARRANGE
### Reorders rows
mexico
mexico <- arrange(mexico, vote88)
mexico
##We can use several rows
mexico
mexico <- arrange(mexico, vote88, -pri82) # Now vote goes from 1 to 3 and, within each pri is ordered
mexico
#### SELECT
#### Lets keep only some variables: vote in 88, age and sex
### Syntax: select(df, variables)
mexico_short <- select(mexico, vote88, age, female)
### We can rename with this function
mexico_short <- select(mexico, vote88, age, sex=female)
####MUTATE
### Add new columns that are transformation of existing columns
### Syntax: mutate(df, new_variables)
mexico_young <- mutate(mexico, young_female= (age<25 & female==1)) #This creates a logical statement
mexico_young <- mutate(mexico_young, young_female= as.numeric(young_female)) ## Transform to binary
## TRANSMUTE: If you want to keep these variables only
transmute(mexico, young_female= (age<25 & female==1))
#### SUMMARISE
# It collapses a data frame to a single row
summarise(mexico, female=mean(female) )
summarise(mexico, age=mean(age) )
####SAMPLE_N
sample_n(mexico_short, 10)
Get the dataset flights First, run install.packages(“nycflights13”). This dataset contains all 336776 flights that departed from New York City in 2013. Run some descritive functions for this dataset
Transform flights into a tbl_df() object. Filter the flights of December 25. Now, do the same for Dec 24 and 26. What are the patterns for those days?
Select the variables refering to date (year, month, day)
Create a new column which calculating the difference between the delay at arrival and the delay at departure
Reorder the dataset your new variable in arrival and carrier.
Obtain the mean dealy time at departure. Be careful with the missing data!
The real power of dplyr comes when we combine arguments. In dplyr, you do this by with the group_by() function. It breaks down a dataset into specified groups of rows.
For Stata users: This is similar to the ‘collapse’ function’
Grouping affects the verbs as follows:
grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
grouped arrange() orders first by the grouping variables
mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x). They are described in detail in vignette(“window-functions”).
sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
slice() extracts rows within each group.
summarise() is powerful and easy to understand, as described in more detail below.
Lets apply this to our flights data:
### Lets calculate the average delay in departure by carrier
library(nycflights13)
delay_carrier <- group_by(flights, carrier) # This first statement only specifies the new arrangement but
# on its own is not very useful
delay_carrier <- summarise(delay_carrier, departure=mean(dep_delay, na.rm=T))
delay_carrier
### Summarise is useful with aggregate functions, which take a vector of values and return a single number. There are many useful examples of such functions in base R like min(), max(), mean(), sum(), sd(), median(), and IQR(). dplyr provides a handful of others. For example, we could use these to find the number of planes and the number of flights that go to each possible destination:
destinations <- group_by(flights, dest) #Group by destination
summarise(destinations,
planes = n_distinct(tailnum),
flights = n()
)
dplyr has some very useful commands
lef_join(a, b, by = “x1”): Join matching rows from b to a.
right_join(a, b, by = “x1”): Join matching rows from a to b.
inner_join(a, b, by = “x1”): Join data. Retain only rows in both sets.
full_join(a, b, by = “x1”): Join data. Retain all values, all rows
### Let's create two dataframes
athletes <- data.frame(name= c("Phelps", "Biles", "Ledecky", "Bolt", "Thompsom" ,
"Douglas","Nadal", "Murray") ,
sport = c("swimming", "gymnastics", "swimming", "track",
"track", "gymnastics", "tennis","tennis"),
gender = c(0, 1, 1,0, 1, 1, 0,0),
country = c("US","US","US", "Jamaica", "Jamaica",
"US", "Spain", "Australia"))
dim(athletes)
countries <- data.frame(country= c("US", "Jamaica", "Spain", "China"),
medals = c(121,11,17, 70))
dim(countries)
### Inner Join ####
#inner_join(x, y): Return all rows from x where there are matching values in y,
#and all columns from x and y. If there are multiple matches between x and y,
#all combination of the matches are returned. This is a mutating join.
olympics <- inner_join(athletes, countries)
olympics
#We lose Murray in the join because, although he appears in x = athletes,
#his country does not appear in y = countries. The join result has all variables from x = plus yr_founded, from y.
### Semi Join ####
#semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.
olympics <- semi_join(athletes, countries)
olympics
#Now the effects of switching the x and y roles is more clear. The result resembles x = countries, but the county #China is lost, because there are no observations in y = athletes.
olympics <- semi_join(countries, athletes)
olympics
#### Left join ####
#left_join(x, y): Return all rows from x, and all columns from x and y.
#If there are multiple matches between x and y, #all combination of the matches are returned. This is a mutating join.
olympics <- left_join(athletes, countries)
olympics
#We basically get x = athletes back, but with the addition of variable medals,
#which is unique to y = countries. Murray, whose publisher does not appear in y =countries, has an NA for medals.
##Anti join ###
#anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.
olympics <- anti_join(athletes, countries)
olympics
#We keep only Murray now (and do not get medals).
### Full join ####
#full_join(x, y): Return all rows and all columns from both x and y.
#Where there are not matching values, returns NA for the one missing. This is a mutating join.
olympics <- full_join(athletes, countries)
olympics
#We get all rows of x = athletes plus a new row from y = countries, containing China. We get all variables from x = athletes AND all variables from y = countries. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.
Summarise the flights dataset to count the number of flights by day of the year. This dataset should have 365 rows.
Now, summarise the number of delays at departure by month. Hint: Before grouping create a delay variable using a logical statement (delayed if dep_delay>0). Also, consider the possibility of missing values.
Using the datasets atheltes and countries perform a anti_join but now reverse the order. That is x=countries, y=athletes. Explain your results.
NOTE: The Pipe concept
# Let's say that we want to group our data by day, select the delay variables and obtain the average delays at arrival and departure by day, AND keep the days with mean times higher than 30 min....UFFFF. This is how it would look
daily <- group_by(flights, year, month, day)
daily <- select(daily, arr_delay, dep_delay)
## Adding missing grouping variables: `year`, `month`, `day`
daily_mean <- summarise(daily,
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE))
filter(daily_mean, arr>30 | dep>30)
# Lets peform this with pipes
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `year`, `month`, `day`
## Cleaner, elegant, understandable!!!!