Chris Bail
Computational Sociology
1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.
1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.
2. Programs such as Stata cannot be used to work with such data because they are not object oriented or open-source.
1. The new wave of data available provide unprecedented opportunity for sociologists to address some of the most important questions in the field.
2. Programs such as Stata cannot be used to work with such data because they are not object oriented or open-source.
3. We need to learn an object-oriented programming language, and I chose R because I believe it has the best combination of tools for statistics, visualization, and data processing.
1. Mastering different types of “objects” in R
2. Importing data from other sources (e.g. STATA or .csv)
3. Cleaning/Recoding/Reshaping/Merging Data
This is a combination of different types of objects
Many objects in R are vectors. These are sequences of multiple variables. We define a vector as follows:
my_vector<-c(1, 3, 4, 9)
Next, Let's try out some basic operations on numeric vectors:
mean(my_vector)
[1] 4.25
median(my_vector)
[1] 3.5
And a few more:
max(my_vector)
[1] 9
min(my_vector)
[1] 1
Or, we can get all of them at once:
summary(my_vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.50 3.50 4.25 5.25 9.00
Note that vectors can also be sequences of strings
my_word_vector<-c("Norbert Elias","is","the Best")
We grab elements within a vector using []
my_word_vector<-c("Norbert Elias","is","the Best")
my_word_vector[3]
[1] "the Best"
1) Create your own vector of numbers
2) Create a new variable that is the mean of your vector
my_new_vector<-c(100,200, 549)
average_vector<-mean(my_new_vector)
average_vector
[1] 283
Vectors are a building block of matrices, another critical type of object in R.
To create a matrix, we use the matrix() function:
my_matrix <- matrix(c(1,2,1,2,64000,
38000,100000,200000,
1,5,17,21)
, nrow = 4, ncol = 3)
To look at our matrix, we can run this line:
my_matrix
[,1] [,2] [,3]
[1,] 1 64000 1
[2,] 2 38000 5
[3,] 1 100000 17
[4,] 2 200000 21
Or, you can click on “my_matrix” in the upper-right pane of RStudio.
Often, we will need to grab one row of a matrix, or one column. This is called “indexing.” Indexing requires both brackets and commas:
my_matrix[1,]
[1] 1 64000 1
The comma specifies whether you are requesting the rows or the columns of the matrix. To request the first column, we would run:
my_matrix[,1]
[1] 1 2 1 2
To get the value of a cell within a matrix, we need to tell R about both the row and the column:
my_matrix[1,2]
[1] 64000
64,000 is the number that is in the second column of the first row
Lists are like vectors, but unique in that they may contain multiple types of data (e.g. strings, numbers, or even matrices).
Let's create a list:
my_list<-list(9, "Norbert Elias", my_matrix)
Let's take a look:
my_list
[[1]]
[1] 9
[[2]]
[1] "Norbert Elias"
[[3]]
[,1] [,2] [,3]
[1,] 1 64000 1
[2,] 2 38000 5
[3,] 1 100000 17
[4,] 2 200000 21
Let's say we wanted to grab “Norbert Elias” from our list. We can just write:
my_list[[2]]
[1] "Norbert Elias"
lists can also have sub-elements, which are indexed using single brackets. For example, if the object above had two sub-elements, we could obtain the
second element as follows: my_list[[2]][2]
A: Once you get into extracting data from the web or other unstructured data sources, knowing how matrices and lists work will be critical.
Matrices and lists are also important because they are the building blocks of what may be the most important type of object in R: data frames.
Data frames are very similar to datasets you might load into STATA/SPSS/SAS in that they have rows, columns, and column names, etc
In order to create a data frame, we can use the following command on our matrix:
my_data_frame<-as.data.frame(my_matrix)
This line creates a new object in the upper right “Environment” pane of RStudio. If we click up there, we see that R has already chosen some arbitrary names for our columns (V1, V2, V3)
R uses clunky syntax to change column names. This is worth our time, however, because column names often change when you import or manipulate data.
lets change “V1” to “Sex”:
colnames(my_data_frame)[colnames(my_data_frame)=="V1"]<-"Sex"
But let's say we want to use words instead of numbers to describe sex. In this case, we need to change the contents of the data frame as follows:
my_data_frame$Sex[my_data_frame$Sex==1]<-"Female"
my_data_frame$Sex[my_data_frame$Sex==2]<-"Male"
The $ operator is
how you tell R that you are looking for a specific
variable within the data frame.
Now lets look at our data frame:
my_data_frame
Sex V2 V3
1 Female 64000 1
2 Male 38000 5
3 Female 100000 17
4 Male 200000 21
Now let's figure out the sex breakdown of our
data using the table command.
table(my_data_frame$Sex)
Female Male
2 2
1) Change the name of the second column in
my_data_frame to Income
2) Calculate the median of the Income variable.
Until now, we have been working at a very abstract level. This is because I needed to teach you some basic concepts before we can start to work with real data.
R Data Files have the extension .Rdata We will work with these soon, but let's begin by pulling in other types of data files, because it's unlikely that you will be working with an .Rdata file if you are coming from another program such as STATA.
R has a variety of ways of importing data.
For example, data often comes in .csv
format. To read this, we use the read.csv
command.
sample_csv_data<-read.csv("Sample_CSV_Data.csv")
As the upper right hand pane of RStudio now shows, these data have 9909 observations and 406 variables.
By default, R has assumed that the first line of these data are the variable names. to list all of the variable names, we can write,
head(colnames(sample_csv_data))
[1] "institution_id" "UNITID" "school_id"
[4] "gss_code" "year" "Institution_Name"
R treats strings as factors during data import unless you tell it not to. This can become a problem if you try to perform operations on string variables that are actually factor variables.
In order to determine the “class” of a variable-
or whether it is a numeric, character, or
factor variable- we can use the class() command:
class(sample_csv_data$Institution_Name)
[1] "factor"
If we want to prevent R
from defaulting to this behavior, we can add
an option to our read.csv command. Options
for most commands are specified by a comma
after the name of the object you want to apply
the command to.
sample_csv_data<-read.csv("Sample_CSV_Data.csv", stringsAsFactors=FALSE)
How do you find the options for an R command?
In this case, we can run,
?read.csv
Thus far we have been using commands from “base R.” While Base R includes an impressive array of commands, in order to take full advantage of R- and particularly cutting edge techniques- you need to install packages.
Packages are user-contributed bundles of code- these are not unlike contributed commands in STATA/SPSS/SAS, but they are much richer because R has greater functionality. Also, there are many more people writing packages for R than these other software programs
The first package we will install helps us read in data from other stats programs (e.g. STATA/SPSS/SAS).
This package is called haven But we could have
used many others…
In order to install a package, we use the
install.packages() command:
install.packages("haven")
Each and every time we want to use a package we need to “load it” into R's memory as follows:
library("haven")
We can also write:
require("haven")
Each and every time we want to use a package we need to “load it” into R's memory as follows:
library("haven")
Let's use haven to import some STATA data that
I put in our class Dropbox:
sample_stata_data<-read_stata("Sample Stata Data.dta")
Note that this package not only allows you to read SPSS and SAS files, but also write R files into these formats as well.
Cleaning/Reshapring/“Wrangling” data is a core task of computational sociology and “data science” more broadly.
A recent New York Times Article suggests 80% of data scientists' time is spent cleaning data, while only 20% of their time is spent analyzing it. My experience is roughly the same.
A: Because many web or text-based data sets are either unstructured, or semi- structured. Many datasets also contain large amounts of missing data, or must be harmonized in order to be merged.
First, let's work on subsetting data. This simply means breaking up a data frame into chunks. The syntax is similar to that we used when we worked with matrices. For example, let's say we want to take all respondents from our stata dataset who are less than 50. The variable we want is called “age.”
respondents_under_50<-sample_stata_data[sample_stata_data$age<50,]
respondents_under_50<-sample_stata_data[sample_stata_data$age<50,]
the [] and , are critical. They tell R which part of the data
frame we want to manipulate. This is also the first time we have used
a logical operator(<)- we could have used <=, >,!= as well
Next, let's examine how to identify missing data. To drop all rows
of a dataset with missing data on any variable, we can use the complete.cases()
command:
no_missing_data<-sample_csv_data[complete.cases(sample_csv_data),]
Often, you might want to identify patterns of missing data by individual variables within a data frame. There are some sophisticated visualization packages that will do this for you, but for pedagogical purposes, let's do it first the old fashioned way.
First, let's read in some new data:
pewdata<-read.csv("Sample_Pew_Data.csv")
working with different datasets is useful because it gives you a sense of the range of different problems you might encounter with data cleaning.
In this data set, for example, missing data
was coded as 9 instead of NA (or empty cells, which R would
have read in as NA).
Let's look at missing data on the pew10 variable, which is about
whether Americans supported the construction of the “Ground Zero”
mosque in New York in 2011.
First, let's change the 9's to NAs
pewdata$pew10[pewdata$pew10==9]<-NA
Next, let's create a new data frame that contains the missing data:
missing<-pewdata[is.na(pewdata$pew10),]
If we want to take all the values where pew10 is NOT
missing, we would do this:
no_missing<-pewdata[!is.na(pewdata$pew10),]
Note that is.na() is a logical operator. To demonstrate:
head(is.na(pewdata$pew10))
[1] FALSE FALSE FALSE FALSE TRUE TRUE
What if we want to find all of the men in the dataset with missing data?
First, we need to find the variable for sex:
colnames(pewdata)
[1] "psraid" "state" "density" "pew1f" "pew2" "pew2os" "pew3a"
[8] "pew3b" "pew3c" "pew3d" "pew4" "pew5" "pew6" "pew7"
[15] "par2" "pew8" "pew9" "pew10" "sex" "age" "recage"
[22] "educ" "receduc" "hisp" "race" "racethn" "inc" "income"
[29] "party" "partyln" "weight"
Now let's see how the variable is coded:
table(pewdata$sex)
1 2
424 579
I happen to know that 1=male in these data,
so in order to find the men with missing data
we can combine the two variables with the logical
operator &
missing<-pewdata[is.na(pewdata$pew10)& pewdata$sex==1,]
Note again that we need the , because we are telling R
we want the rows.
If we wanted to trim columns from the data
we would need to put the content we want after the ,
We can either use the numbers of the columns or their names. Let's say we just want the two variables we've been working with so far:
gender_and_mosque<-pewdata[,c("sex","pew10")]
the c() operator is necessary here because we
are asking for multiple variables.
let's say we wanted everything but the first column in the dataset.
First we would need to know the number of columns. We can use
ncol() for this purpose:
ncol(pewdata)
[1] 31
Next, we tell R we want rows 2 to 52 using the : operator,
which indicates a sequence:
no_first_column<-pewdata[,2:31]
We could also combine the two steps as follows:
no_first_column<-pewdata[,2:ncol(pewdata)]
I'm noting this because it will be helpful to know that this is possible when we discuss programming later in this class.
What is the age of the oldest man in the pewdata dataset?
Another very common task in computational sociology is reshaping data. For
example, suppose we wanted to examine partisanship by race. The
Partyln variable describes the following question within the Pew Data:
“As of today do you lean more to the Republican Party” or more to The Democratic party?“
The possible answers are:
1: Republican;
2: Democrat;
9: Missing.
It's annoying that these are not already correctly coded, but this is a very common problem. First, let's recode the numeric data into strings or characters:
pewdata$partyln[pewdata$partyln==1]<-"Republican"
pewdata$partyln[pewdata$partyln==2]<-"Democrat"
pewdata$partyln[pewdata$partyln==9]<-NA
Let's check to make sure it worked:
table(pewdata$partyln)
Democrat Republican
138 146
Note that the table command does not
list missing values (NAs)
We also need to recode the race variables.
pewdata$race[pewdata$race==1]<-"White"
pewdata$race[pewdata$race==2]<-"African American"
pewdata$race[pewdata$race==3]<-"Asian or Pacific Islander"
pewdata$race[pewdata$race==4]<-"Mixed Race"
pewdata$race[pewdata$race==5]<-"Native American"
pewdata$race[pewdata$race==6]<-"Other"
pewdata$race[pewdata$race==9]<-NA
We can produce a cross tabulation as follows:
table(pewdata$partyln, pewdata$race)
African American Asian or Pacific Islander Mixed Race
Democrat 11 4 2
Republican 4 3 3
Native American Other White
Democrat 1 2 114
Republican 3 1 129
Now that we've cleaned up this data frame, let's
save it in .Rdata format so that we can use it
tomorrow:
save(pewdata, file="Pew Data.Rdata")
Remember that R will save files in whatever folder
you specified as your working directory. Use getwd() to find it.
But let's say we want the average age by race. As is
common with R, there are many different ways to do this.
let's continue using base R. For the record, one could use
the plyr package, the reshape package, and the
data.table package, just to name a few.
aggregate(pewdata$age, by=list(pewdata$race), FUN=mean)
Group.1 x
1 African American 48.71134
2 Asian or Pacific Islander 39.00000
3 Mixed Race 48.46154
4 Native American 49.05263
5 Other 30.22222
6 White 53.76255
What if we want the average age by both race and party?
aggregate(pewdata$age, by=list(pewdata$race, pewdata$partyln), FUN=mean)
Once again, note we could store these data as follows:
age_by_race<-aggregate(pewdata$age, by=list(pewdata$race), FUN=mean)
Another very common task you might face in R is merging multiple datasets.
This is one of the most common tasks you might encounter in data cleaning and manipulation precisely because R can have so many objects loaded in memory at once.
Imagine, for example, that we want to add average income by race to our dataset that describes average age. I put a very small spreadsheet in the Dropbox that describes average income by race.
race_income_data<-read.csv("Income By Race.xlsx")
In this type of situation, you would need to do some research. One can, however, search for R help via RSTudio by typing ?? before the search term:
??xlsx
**The website “stackoverflow” is an invaluable resource:
The consensus on this page is to use the XLConnect package.
first, we need to install it:
install.packages("XLConnect")
library(XLConnect)
race_income_data <- readWorksheet(loadWorkbook("Income By Race.xlsx"),sheet=1)
There are many different commands for merging
data frames in R (e.g. the merge command in base R).
We are going to use the plyr package because it is
more powerful, faster, and easier to use
install.packages("plyr")
The command for merging datasets in plyr is called join
library(plyr)
merged_data<-join(age_by_race, race_income_data)
This looked like it worked, but if we view the
merged dataset, we see that it added NAs instead
of the values
head(merged_data)
Group.1 x race income
1 African American 48.71134 <NA> NA
2 Asian or Pacific Islander 39.00000 <NA> NA
3 Mixed Race 48.46154 <NA> NA
4 Native American 49.05263 <NA> NA
5 Other 30.22222 <NA> NA
6 White 53.76255 <NA> NA
Why did this happen? This particular command from the plyr package
automatically searches for column names that are
shared by both files.
Let's look at our column names:
colnames(age_by_race)
[1] "Group.1" "x"
colnames(race_income_data)
[1] "race" "income"
When we ran the aggregate command above, it gave
the columns new, generic names.
We need to fix the column names so that they are the same across the datasets:
colnames(age_by_race)[colnames(age_by_race)=="Group.1"]<-"race"
colnames(age_by_race)[colnames(age_by_race)=="x"]<-"age"
Let's try to merge again:
merged_data<-join(age_by_race, race_income_data)
head(merged_data)
race age income
1 African American 48.71134 NA
2 Asian or Pacific Islander 39.00000 NA
3 Mixed Race 48.46154 NA
4 Native American 49.05263 NA
5 Other 30.22222 NA
6 White 53.76255 57009
R did not give us an error message!!!
Solution: Inspect your data regularly!
One can use View(), head(), or table() for this
purpose, or (in times of desperation), output a .csv file
table(age_by_race$race)
African American Asian or Pacific Islander
1 1
Mixed Race Native American
1 1
Other White
1 1
table(race_income_data$race)
Asian Black White
1 1 1
The %in% operator identifies common elements in two
vectors
age_by_race$race %in% race_income_data$race
[1] FALSE FALSE FALSE FALSE FALSE TRUE
We can also combine %in% with !:
!(age_by_race$race %in% race_income_data$race)
[1] TRUE TRUE TRUE TRUE TRUE FALSE
The problem was that the column names were labelled insconsistently across the datasets, so we need to run the following lines:
race_income_data[race_income_data=="Black"]<-"African American"
race_income_data[race_income_data=="Asian"]<-"Asian or Pacific Islander"
Finally, let's re-merge the data:
merged_data<-join(age_by_race, race_income_data)
head(merged_data)
race age income
1 African American 48.71134 33321
2 Asian or Pacific Islander 39.00000 68636
3 Mixed Race 48.46154 NA
4 Native American 49.05263 NA
5 Other 30.22222 NA
6 White 53.76255 57009
Load a dataset that you've worked within in another program
into R. Recode several variables of your choice. Use the aggregate
command to create a summary dataset that describes the mean score of
these variables. Finally, try merging this summary dataset back to the
original dataset using the join function in plyr
Basic Programming
Next class I will introduce you to programming in R. In its simplest form, programming refers to writing instructions for a computer to complete simple tasks many times in a row. In this class, we will go over ba- sic forms of “loops” as well as vectorized programming commands, and “pipes.” These three programming techniques do very similar things, but it is important that you learn all three so that you can learn to read (and use) other people's code.