Basic Data Processing in R

Chris Bail
Computational Sociology

Recap

1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.

Recap

1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.

2. Programs such as Stata cannot be used to work with such data because they are not object oriented or open-source.

Recap

1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.

2. Programs such as Stata cannot be used to work with such data because they are not object oriented or open-source.

3. We need to learn an object-oriented programming language, and I chose R because I believe it has the best combination of tools for statistics, visualization, and data processing.

Today's Agenda

1. Mastering different types of “objects” in R

2. Importing data from other sources (e.g. STATA or .csv)

3. Cleaning/Recoding/Reshaping/Merging Data

Object-Oriented Programming

alt text

This is what data often look like!

alt text

This is a combination of different types of objects

Different Types of Objects

alt text

Vectors

Many objects in R are vectors. These are sequences of multiple variables. We define a vector as follows:

my_vector<-c(1, 3, 4, 9)

Vectors

Next, Let's try out some basic operations on numeric vectors:

mean(my_vector)

[1] 4.25

median(my_vector)

[1] 3.5

Vectors

And a few more:

 max(my_vector)

[1] 9

min(my_vector)

[1] 1

Vectors

Or, we can get all of them at once:

 summary(my_vector)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.50    3.50    4.25    5.25    9.00

Vectors

Note that vectors can also be sequences of strings

my_word_vector<-c("Norbert Elias","is","the Best")

Vectors

We grab elements within a vector using []

my_word_vector<-c("Norbert Elias","is","the Best")
my_word_vector[3]

[1] "the Best"

Vectors

Now You Try It:

1) Create your own vector of numbers

2) Create a new variable that is the mean of your vector

Vectors

Example Solution:

my_new_vector<-c(100,200, 549)
average_vector<-mean(my_new_vector)
average_vector

[1] 283

Matrices

Vectors are a building block of matrices, another critical type of object in R.

To create a matrix, we use the matrix() function:

my_matrix <- matrix(c(1,2,1,2,64000,
                      38000,100000,200000,
                      1,5,17,21) 
                      , nrow = 4, ncol = 3)

Matrices

To look at our matrix, we can run this line:

my_matrix

     [,1]   [,2] [,3]
[1,]    1  64000    1
[2,]    2  38000    5
[3,]    1 100000   17
[4,]    2 200000   21

Or, you can click on “my_matrix” in the upper-right pane of RStudio.

Matrices

Often, we will need to grab one row of a matrix, or one column. This is called “indexing.” Indexing requires both brackets and commas:

my_matrix[1,]

[1]     1 64000     1

Matrices

The comma specifies whether you are requesting the rows or the columns of the matrix. To request the first column, we would run:

my_matrix[,1]

[1] 1 2 1 2

Matrices

To get the value of a cell within a matrix, we need to tell R about both the row and the column:

my_matrix[1,2]

[1] 64000

64,000 is the number that is in the second column of the first row

Lists

Lists are like vectors, but unique in that they may contain multiple types of data (e.g. strings, numbers, or even matrices).

Let's create a list:

my_list<-list(9, "Norbert Elias", my_matrix)

Lists

Let's take a look:

my_list

[[1]]
[1] 9

[[2]]
[1] "Norbert Elias"

[[3]]
     [,1]   [,2] [,3]
[1,]    1  64000    1
[2,]    2  38000    5
[3,]    1 100000   17
[4,]    2 200000   21

Lists

Let's say we wanted to grab “Norbert Elias” from our list. We can just write:

my_list[[2]]

[1] "Norbert Elias"

lists can also have sub-elements, which are indexed using single brackets. For example, if the object above had two sub-elements, we could obtain the second element as follows: my_list[[2]][2]

Lists

“Why are we spending so much time with Matrices and Lists?”

A: Once you get into extracting data from the web or other unstructured data sources, knowing how matrices and lists work will be critical.

Data Frames

Matrices and lists are also important because they are the building blocks of what may be the most important type of object in R: data frames.

Data Frames

Data frames are very similar to datasets you might load into STATA/SPSS/SAS in that they have rows, columns, and column names, etc

Data Frames

In order to create a data frame, we can use the following command on our matrix:

my_data_frame<-as.data.frame(my_matrix)

This line creates a new object in the upper right “Environment” pane of RStudio. If we click up there, we see that R has already chosen some arbitrary names for our columns (V1, V2, V3)

Data Frames

alt text

Data Frames

R uses clunky syntax to change column names. This is worth our time, however, because column names often change when you import or manipulate data.

lets change “V1” to “Sex”:

colnames(my_data_frame)[colnames(my_data_frame)=="V1"]<-"Sex"

Data Frames

But let's say we want to use words instead of numbers to describe sex. In this case, we need to change the contents of the data frame as follows:

my_data_frame$Sex[my_data_frame$Sex==1]<-"Female"
my_data_frame$Sex[my_data_frame$Sex==2]<-"Male"

The $ operator is how you tell R that you are looking for a specific variable within the data frame.

Data Frames

Now lets look at our data frame:

my_data_frame

     Sex     V2 V3
1 Female  64000  1
2   Male  38000  5
3 Female 100000 17
4   Male 200000 21

Data Frames

Now let's figure out the sex breakdown of our data using the table command.

table(my_data_frame$Sex)


Female   Male 
     2      2

Data Frames

Now You Try It:

1) Change the name of the second column in my_data_frame to Income

2) Calculate the median of the Income variable.

DATA WRANGLING

Until now, we have been working at a very abstract level. This is because I needed to teach you some basic concepts before we can start to work with real data.

R Data Files have the extension .Rdata We will work with these soon, but let's begin by pulling in other types of data files, because it's unlikely that you will be working with an .Rdata file if you are coming from another program such as STATA.

Importing Spreadsheets

R has a variety of ways of importing data. For example, data often comes in .csv format. To read this, we use the read.csv command.

sample_csv_data<-read.csv("Sample_CSV_Data.csv")

Importing Spreadsheets

As the upper right hand pane of RStudio now shows, these data have 9909 observations and 406 variables.

By default, R has assumed that the first line of these data are the variable names. to list all of the variable names, we can write,

head(colnames(sample_csv_data))

[1] "institution_id"   "UNITID"           "school_id"       
[4] "gss_code"         "year"             "Institution_Name"

Importing Spreadsheets

R treats strings as factors during data import unless you tell it not to. This can become a problem if you try to perform operations on string variables that are actually factor variables.

In order to determine the “class” of a variable- or whether it is a numeric, character, or factor variable- we can use the class() command:

class(sample_csv_data$Institution_Name)

[1] "factor"

Importing Spreadsheets

If we want to prevent R from defaulting to this behavior, we can add an option to our read.csv command. Options for most commands are specified by a comma after the name of the object you want to apply the command to.

sample_csv_data<-read.csv("Sample_CSV_Data.csv", stringsAsFactors=FALSE)

Importing Spreadsheets

How do you find the options for an R command?

In this case, we can run,

?read.csv

One of R's many "Help" Files

Installing Packages and Importing Data

Thus far we have been using commands from “base R.” While Base R includes an impressive array of commands, in order to take full advantage of R- and particularly cutting edge techniques- you need to install packages.

Installing Packages and Importing Data

Packages are user-contributed bundles of code- these are not unlike contributed commands in STATA/SPSS/SAS, but they are much richer because R has greater functionality. Also, there are many more people writing packages for R than these other software programs

Installing Packages and Importing Data

The first package we will install helps us read in data from other stats programs (e.g. STATA/SPSS/SAS).

This package is called haven But we could have used many others…

A Network Graph of R Packages

Installing Packages and Importing Data

In order to install a package, we use the install.packages() command:

install.packages("haven")

Installing Packages and Importing Data

Each and every time we want to use a package we need to “load it” into R's memory as follows:

library("haven")

We can also write:

require("haven")

Installing Packages and Importing Data

Each and every time we want to use a package we need to “load it” into R's memory as follows:

library("haven")

Package Documentation from CRAN

Installing Packages and Importing Data

Let's use haven to import some STATA data that I put in our class Dropbox:

sample_stata_data<-read_stata("Sample Stata Data.dta")

Note that this package not only allows you to read SPSS and SAS files, but also write R files into these formats as well.

Subsetting Data

Cleaning/Reshapring/“Wrangling” data is a core task of computational sociology and “data science” more broadly.

A recent New York Times Article suggests 80% of data scientists' time is spent cleaning data, while only 20% of their time is spent analyzing it. My experience is roughly the same.

Subsetting Data

alt text

Subsetting Data

Why is there so much data to clean?

A: Because many web or text-based data sets are either unstructured, or semi- structured. Many datasets also contain large amounts of missing data, or must be harmonized in order to be merged.

Subsetting Data

First, let's work on subsetting data. This simply means breaking up a data frame into chunks. The syntax is similar to that we used when we worked with matrices. For example, let's say we want to take all respondents from our stata dataset who are less than 50. The variable we want is called “age.”

respondents_under_50<-sample_stata_data[sample_stata_data$age<50,]

Subsetting Data

respondents_under_50<-sample_stata_data[sample_stata_data$age<50,]

the [] and , are critical. They tell R which part of the data frame we want to manipulate. This is also the first time we have used a logical operator(<)- we could have used <=, >,!= as well

Subsetting Data

Next, let's examine how to identify missing data. To drop all rows of a dataset with missing data on any variable, we can use the complete.cases() command:

no_missing_data<-sample_csv_data[complete.cases(sample_csv_data),]

Subsetting Data

Often, you might want to identify patterns of missing data by individual variables within a data frame. There are some sophisticated visualization packages that will do this for you, but for pedagogical purposes, let's do it first the old fashioned way.

First, let's read in some new data:

pewdata<-read.csv("Sample_Pew_Data.csv")

Subsetting Data

working with different datasets is useful because it gives you a sense of the range of different problems you might encounter with data cleaning.

In this data set, for example, missing data was coded as 9 instead of NA (or empty cells, which R would have read in as NA).

Subsetting Data

Let's look at missing data on the pew10 variable, which is about whether Americans supported the construction of the “Ground Zero” mosque in New York in 2011.

First, let's change the 9's to NAs

pewdata$pew10[pewdata$pew10==9]<-NA

Next, let's create a new data frame that contains the missing data:

missing<-pewdata[is.na(pewdata$pew10),]

Subsetting Data

If we want to take all the values where pew10 is NOT missing, we would do this:

no_missing<-pewdata[!is.na(pewdata$pew10),]

Subsetting Data

Note that is.na() is a logical operator. To demonstrate:

head(is.na(pewdata$pew10))

[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

Recoding Variables

What if we want to find all of the men in the dataset with missing data?

First, we need to find the variable for sex:

colnames(pewdata)

 [1] "psraid"  "state"   "density" "pew1f"   "pew2"    "pew2os"  "pew3a"  
 [8] "pew3b"   "pew3c"   "pew3d"   "pew4"    "pew5"    "pew6"    "pew7"   
[15] "par2"    "pew8"    "pew9"    "pew10"   "sex"     "age"     "recage" 
[22] "educ"    "receduc" "hisp"    "race"    "racethn" "inc"     "income" 
[29] "party"   "partyln" "weight"

Recoding Variables

Now let's see how the variable is coded:

table(pewdata$sex)


  1   2 
424 579

Recoding Variables

I happen to know that 1=male in these data, so in order to find the men with missing data we can combine the two variables with the logical operator &

missing<-pewdata[is.na(pewdata$pew10)& pewdata$sex==1,]

Note again that we need the , because we are telling R we want the rows.

Recoding Variables

If we wanted to trim columns from the data we would need to put the content we want after the ,

We can either use the numbers of the columns or their names. Let's say we just want the two variables we've been working with so far:

gender_and_mosque<-pewdata[,c("sex","pew10")]

the c() operator is necessary here because we are asking for multiple variables.

Recoding Variables

let's say we wanted everything but the first column in the dataset. First we would need to know the number of columns. We can use ncol() for this purpose:

ncol(pewdata)

[1] 31

Recoding Variables

Next, we tell R we want rows 2 to 52 using the : operator, which indicates a sequence:

 no_first_column<-pewdata[,2:31]

Recoding Variables

We could also combine the two steps as follows:

no_first_column<-pewdata[,2:ncol(pewdata)]

I'm noting this because it will be helpful to know that this is possible when we discuss programming later in this class.

Recoding Variables

Now You Try It:

What is the age of the oldest man in the pewdata dataset?

Reshaping Data Frames

Another very common task in computational sociology is reshaping data. For example, suppose we wanted to examine partisanship by race. The Partyln variable describes the following question within the Pew Data:

“As of today do you lean more to the Republican Party” or more to The Democratic party?“

Reshaping Data Frames

The possible answers are:

1: Republican;

2: Democrat;

9: Missing.

Reshaping Data Frames

It's annoying that these are not already correctly coded, but this is a very common problem. First, let's recode the numeric data into strings or characters:

pewdata$partyln[pewdata$partyln==1]<-"Republican"
pewdata$partyln[pewdata$partyln==2]<-"Democrat"
pewdata$partyln[pewdata$partyln==9]<-NA

Reshaping Data Frames

Let's check to make sure it worked:

table(pewdata$partyln)


  Democrat Republican 
       138        146

Note that the table command does not list missing values (NAs)

Reshaping Data Frames

We also need to recode the race variables.

pewdata$race[pewdata$race==1]<-"White"
pewdata$race[pewdata$race==2]<-"African American"
pewdata$race[pewdata$race==3]<-"Asian or Pacific Islander"
pewdata$race[pewdata$race==4]<-"Mixed Race"
pewdata$race[pewdata$race==5]<-"Native American"
pewdata$race[pewdata$race==6]<-"Other"
pewdata$race[pewdata$race==9]<-NA

Reshaping Data Frames

We can produce a cross tabulation as follows:

table(pewdata$partyln, pewdata$race)


             African American Asian or Pacific Islander Mixed Race
  Democrat                 11                         4          2
  Republican                4                         3          3

             Native American Other White
  Democrat                 1     2   114
  Republican               3     1   129

Reshaping Data Frames

Now that we've cleaned up this data frame, let's save it in .Rdata format so that we can use it tomorrow:

save(pewdata, file="Pew Data.Rdata")

Remember that R will save files in whatever folder you specified as your working directory. Use getwd() to find it.

Reshaping Data Frames

But let's say we want the average age by race. As is common with R, there are many different ways to do this. let's continue using base R. For the record, one could use the plyr package, the reshape package, and the data.table package, just to name a few.

Reshaping Data Frames

 aggregate(pewdata$age, by=list(pewdata$race), FUN=mean)

                    Group.1        x
1          African American 48.71134
2 Asian or Pacific Islander 39.00000
3                Mixed Race 48.46154
4           Native American 49.05263
5                     Other 30.22222
6                     White 53.76255

Reshaping Data Frames

What if we want the average age by both race and party?

aggregate(pewdata$age, by=list(pewdata$race, pewdata$partyln), FUN=mean)

Reshaping Data Frames

Once again, note we could store these data as follows:

age_by_race<-aggregate(pewdata$age, by=list(pewdata$race), FUN=mean)

Merging Data Frames

Another very common task you might face in R is merging multiple datasets.

This is one of the most common tasks you might encounter in data cleaning and manipulation precisely because R can have so many objects loaded in memory at once.

Merging Data Frames

Imagine, for example, that we want to add average income by race to our dataset that describes average age. I put a very small spreadsheet in the Dropbox that describes average income by race.

 race_income_data<-read.csv("Income By Race.xlsx")

Merging Data Frames

In this type of situation, you would need to do some research. One can, however, search for R help via RSTudio by typing ?? before the search term:

??xlsx

Merging Data Frames

alt text

Merging Data Frames

**The website “stackoverflow” is an invaluable resource:

Merging Data Frames

The consensus on this page is to use the XLConnect package. first, we need to install it:

install.packages("XLConnect")

library(XLConnect)
race_income_data <- readWorksheet(loadWorkbook("Income By Race.xlsx"),sheet=1)

Merging Data Frames

There are many different commands for merging data frames in R (e.g. the merge command in base R).

We are going to use the plyr package because it is more powerful, faster, and easier to use

install.packages("plyr")

Merging Data Frames

The command for merging datasets in plyr is called join

library(plyr)
merged_data<-join(age_by_race, race_income_data)

Merging Data Frames

This looked like it worked, but if we view the merged dataset, we see that it added NAs instead of the values

head(merged_data)

                    Group.1        x race income
1          African American 48.71134 <NA>     NA
2 Asian or Pacific Islander 39.00000 <NA>     NA
3                Mixed Race 48.46154 <NA>     NA
4           Native American 49.05263 <NA>     NA
5                     Other 30.22222 <NA>     NA
6                     White 53.76255 <NA>     NA

Merging Data Frames

Why did this happen? This particular command from the plyr package automatically searches for column names that are shared by both files.

Let's look at our column names:

colnames(age_by_race)

[1] "Group.1" "x"

colnames(race_income_data)

[1] "race"   "income"

Merging Data Frames

When we ran the aggregate command above, it gave the columns new, generic names.

We need to fix the column names so that they are the same across the datasets:

colnames(age_by_race)[colnames(age_by_race)=="Group.1"]<-"race"
colnames(age_by_race)[colnames(age_by_race)=="x"]<-"age"

Merging Data Frames

Let's try to merge again:

merged_data<-join(age_by_race, race_income_data)

head(merged_data)

                       race      age income
1          African American 48.71134     NA
2 Asian or Pacific Islander 39.00000     NA
3                Mixed Race 48.46154     NA
4           Native American 49.05263     NA
5                     Other 30.22222     NA
6                     White 53.76255  57009

Merging Data Frames

R did not give us an error message!!!

Merging Data Frames

Solution: Inspect your data regularly!

One can use View(), head(), or table() for this purpose, or (in times of desperation), output a .csv file

Merging Data Frames

table(age_by_race$race)


         African American Asian or Pacific Islander 
                        1                         1 
               Mixed Race           Native American 
                        1                         1 
                    Other                     White 
                        1                         1

Merging Data Frames

table(race_income_data$race)


Asian Black White 
    1     1     1

Merging Data Frames

The %in% operator identifies common elements in two vectors

age_by_race$race %in% race_income_data$race

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

Merging Data Frames

We can also combine %in% with !:

!(age_by_race$race %in% race_income_data$race)

[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Merging Data Frames

The problem was that the column names were labelled insconsistently across the datasets, so we need to run the following lines:

race_income_data[race_income_data=="Black"]<-"African American"
race_income_data[race_income_data=="Asian"]<-"Asian or Pacific Islander"

Merging Data Frames

Finally, let's re-merge the data:

merged_data<-join(age_by_race, race_income_data)
head(merged_data)

                       race      age income
1          African American 48.71134  33321
2 Asian or Pacific Islander 39.00000  68636
3                Mixed Race 48.46154     NA
4           Native American 49.05263     NA
5                     Other 30.22222     NA
6                     White 53.76255  57009

Questions?

HOMEWORK:

Load a dataset that you've worked within in another program into R. Recode several variables of your choice. Use the aggregate command to create a summary dataset that describes the mean score of these variables. Finally, try merging this summary dataset back to the original dataset using the join function in plyr

NEXT WEEK:

Basic Programming

Next class I will introduce you to programming in R. In its simplest form, programming refers to writing instructions for a computer to complete simple tasks many times in a row. In this class, we will go over ba- sic forms of “loops” as well as vectorized programming commands, and “pipes.” These three programming techniques do very similar things, but it is important that you learn all three so that you can learn to read (and use) other people's code.