Chris Bail
Computational Sociology
Last class we learned how to work with different types
of objects because many of the data sources we will work
with are comprised of data in different formats.
But in most cases, you will want to collect data from many different webpages, text files, or other data pipelines- not just one.
In order to do this, you need to learn some basic programming
Today we are going to learn about four different types of programs that can be used to accomplish repetitive tasks in R.
I need to familiarize you with different types of programming so that you can a) use the right tool for the job; and b) understand other people's code.
1. Introduction to Programming
2. Functions
3. Loops
4. Vectorized Functions
5. Pipes
6. Debugging
7. Parallelizing your code
8. Running R in the cloud (Bonus!)
The most basic form of programming is a function. If you have used R before, you have probably used a function before.
**We actually use functions all the time in r. For example:
mean()
list.files()
read.csv()
Though we use such functions all the time, many people often do so without ever looking at the “source code,” or the complicated list of instructions that R processes each time we run a command.
rowMeans
function (x, na.rm = FALSE, dims = 1L)
{
if (is.data.frame(x))
x <- as.matrix(x)
if (!is.array(x) || length(dn <- dim(x)) < 2L)
stop("'x' must be an array of at least two dimensions")
if (dims < 1L || dims > length(dn) - 1L)
stop("invalid 'dims'")
p <- prod(dn[-(id <- seq_len(dims))])
dn <- dn[id]
z <- if (is.complex(x))
.Internal(rowMeans(Re(x), prod(dn), p, na.rm)) + (0+1i) *
.Internal(rowMeans(Im(x), prod(dn), p, na.rm))
else .Internal(rowMeans(x, prod(dn), p, na.rm))
if (length(dn) > 1L) {
dim(z) <- dn
dimnames(z) <- dimnames(x)[id]
}
else names(z) <- dimnames(x)[[1L]]
z
}
<bytecode: 0x7f9ff5bb3ab0>
<environment: namespace:base>
We are not going to jump right into the relatively complicated code that we just saw, instead, let's start from a very basic type of function.
A function is simply a set of instructions or tasks that one may apply to any type of object in R. Let's take a very basic function:
my_function <- function(x) x+2
This function takes a number (x) and adds two to the number. let's try it:
my_function(2)
[1] 4
functions can get much, much more complicated:
another_function <- function(x, y, z) {
x <- mean(x)^2
y <- cos(y)-5
z <- log(z)*200
c(x,y,z)
}
this function requires three inputs (x, y, and z).
The c() tells R that we want to display the results.
But you will probably soon begin using other people's functions, and it is therefore critical to know how they work so that you can modify them to suit your own needs
Write a function that takes the square root of a number and then multiplies this figure times 37
my_function<-function(x){
sqrt(x)*37
}
1) Functions can make it easy to perform repetitive tasks;
2) Functions allow you to develop custom operations;
3) You may need to examine functions that someone else wrote in order to ensure that they do what you want them to do;
Let's begin with a very simple example:
for (i in 1:6){
print("Jim Moody is bad-a$$")
}
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
But things can get MUCH more complicated very quickly:
Let's assume we have a large number of csv files that describe health indicators for all OECD countries. We want to loop through all of these files and create a new dataset that describes health indicators for one country: Korea.
I created a folder in our class Dropbox named “OECD Health Data.”“ Let's use list.files to look at the files:
list.files("OECD Health Data")
[1] "CHILDVACCIN-DTP-PC_CHILD-A.csv"
[2] "CSECTIONS-TOT-1000BIRTH-A.csv"
[3] "FLUVACCIN-TOT-PC_POP65-A.csv"
[4] "HOSPITALDISCHARGE-TOT-100000HAB-A.csv"
[5] "MRIEXAM-TOT-1000HAB-A.csv"
The next step is to initialize our loop. To do this, we need to count the number of files that we want to repeat an action for.
To do this, let's use the length function:
filenames<-list.files("OECD Health Data")
number_of_files<-length(filenames)
Next, we need to create a place to store the data we will extract from each of these files that describe health indicators for all OECD countries
Let's create a blank data frame as follows:
koreadata<-as.data.frame(NULL)
Our loop will look like this:
for(i in 1:number_of_files){
filepath<-paste("OECD Health Data/", filenames[i], sep="")
data<-read.csv(filepath, stringsAsFactors = FALSE)
newdata<-data[data$Location=="Korea",]
newdata$indicator<-filenames[i]
koreadata<-rbind(koreadata,newdata)
}
This line is tricky:
filepath<-paste("OECD Health Data/", filenames[i], sep="")
We are telling R to create a new “filepath” to pass on
to read.csv each time it goes through the loop.
The paste command simply concatenates two string
variables
Let's look at the data:
head(koreadata)
Location X2012 indicator
23 Korea 99.0 CHILDVACCIN-DTP-PC_CHILD-A.csv
11 Korea 360.0 CSECTIONS-TOT-1000BIRTH-A.csv
12 Korea 77.4 FLUVACCIN-TOT-PC_POP65-A.csv
111 Korea 15571.1 HOSPITALDISCHARGE-TOT-100000HAB-A.csv
121 Korea 19.6 MRIEXAM-TOT-1000HAB-A.csv
There are other types of Loops in R
e.g. if and else and while
These types of loops are often used in conjunction with for loops
There are two big problems with loops in R:
In the next two sections, we are going to explore alternatives to loops for programming in R
1) Write a “for” loop that goes through each variable in our Pew Dataset and replaces values of 9 with NA.
Hint: you may find the ncol function useful.
Vectorized functions are another way of automating a task in r- in this case by iteratively applying a function across elements of an object.
The command we are going to focus on is sapply which is used for doing repetitive tasks on a data frame
Let's try a simple example first. Let's say we want to center or standardize the “pew10” variable in our Pew Data.
load("Pew Data.Rdata")
Next, let's write a quick function to standardize:
standardize<-function(x) x/mean(pewdata$pew10, na.rm=T)
na.rm tells R to ignore missing values when calculating
the mean of pew10
Now we apply our “standardize” function to each row as follows:
centered <- sapply(pewdata$pew10, standardize)
If we wanted to create a new variable within our pew data we could have written:
pewdata$pew10_centered<-sapply(pewdata$pew10, standardize)
Because apply commands can pass any function onto any object they are very powerful.
Let's read in our OECD files and produce a “Korea” dataset using lapply
First, we need to get the names of the files in our directory again:
filenames<-list.files("OECD Health Data")
Next, let's add the file path into them:
filenames<-paste("OECD Health Data/", filenames, sep="")
We can now import all the data files with one line of code!
data<-lapply(filenames,read.csv)
This function simply says, for each file name in the filenames obect, apply the read.csv function.
If we look at the data, however, we see that it has made each data frame into a list:
data
[[1]]
Location X2012
1 Australia 91
2 Austria 83
3 Belgium 99
4 Brazil 95
5 Canada 96
6 Chile 91
7 China (People's Republic of) 99
8 Czech Republic 99
9 Denmark 94
10 Estonia 94
11 Finland 98
12 France 99
13 Germany 96
14 Greece 99
15 Hungary 99
16 Iceland 91
17 India 72
18 Indonesia 85
19 Ireland 96
20 Israel 94
21 Italy 97
22 Japan 98
23 Korea 99
24 Luxembourg 99
25 Mexico 83
26 Netherlands 97
27 New Zealand 92
28 Norway 94
29 Poland 99
30 Portugal 98
31 Russia 97
32 Slovak Republic 98
33 Slovenia 95
34 South Africa 65
35 Spain 96
36 Sweden 98
37 Switzerland 96
38 Turkey 98
39 United Kingdom 96
40 United States 94
[[2]]
Location X2012
1 Austria 288.4
2 Czech Republic 243.9
3 Denmark 211.8
4 Estonia 200.0
5 Finland 161.9
6 France 207.6
7 Hungary 352.9
8 Ireland 275.3
9 Israel 188.7
10 Italy 368.4
11 Korea 360.0
12 Luxembourg 269.7
13 New Zealand 254.7
14 Poland 315.7
15 Slovenia 194.5
16 Spain 251.5
17 Sweden 163.0
18 Turkey 479.8
19 United Kingdom 244.2
[[3]]
Location X2012
1 Canada 64.1
2 Chile 73.1
3 Denmark 43.1
4 Estonia 0.9
5 Finland 35.1
6 France 53.1
7 Hungary 28.5
8 Ireland 56.9
9 Israel 57.6
10 Italy 62.7
11 Japan 50.0
12 Korea 77.4
13 Luxembourg 44.7
14 New Zealand 64.3
15 Norway 11.4
16 Slovak Republic 15.4
17 Slovenia 17.0
18 Spain 57.0
19 Turkey 13.2
20 United Kingdom 75.5
[[4]]
Location X2012
1 Austria 27029.6
2 Czech Republic 20054.5
3 Estonia 17285.4
4 Finland 17747.7
5 France 16765.6
6 Germany 25093.1
7 Hungary 20201.8
8 Ireland 13606.3
9 Israel 16356.2
10 Italy 12878.3
11 Korea 15571.1
12 Luxembourg 14943.8
13 Mexico 4819.8
14 Netherlands 11862.8
15 New Zealand 14815.8
16 Poland 16221.9
17 Slovak Republic 19583.0
18 Slovenia 17106.8
19 Spain 9905.5
20 Switzerland 16636.8
21 Turkey 15762.3
[[5]]
Location X2012
1 Australia 26.0
2 Canada 53.7
3 Czech Republic 43.2
4 Denmark 67.0
5 Estonia 46.8
6 Finland 42.1
7 France 82.0
8 Greece 67.6
9 Hungary 34.1
10 Iceland 79.3
11 Israel 28.0
12 Korea 19.6
13 Luxembourg 78.8
14 Poland 17.9
15 Slovak Republic 40.9
16 Slovenia 33.2
17 Spain 64.5
18 Turkey 114.3
19 United States 104.8
If we wanted to clean this up, we would need to “unlist” it, and convert it to a data frame, possibly by using yet
another apply command.
Problems with apply commands:
It can be difficult to identify the correct version of apply for the right object/desired result or ouput.
Mistakes in functions can be more difficult to find because you do not watch your program process the data row by row.
I do not want to overload you with too many different types of programming styles.
…But I do want you to leave this class with basic familiarity about all kinds of programming so that you can learn to read other people's code
Pipes are relatively new to R, though they are quite common in other programming languages. They are essentially a way of writing cleaner, more intuitive code by eliminating loops and the initialization of objects such as data frames from your code.
As of right now, they work best with the plyr commands that we have already used in this class.
Piping is not yet part of base R. So, we have to load the magrittr package:
install.packages("magrittr")
The key contribution of this package is the %>% operator.
Whatever is on the left side of this operator gets passed to the right side*.
The %>% operator works in conjunction with other functions.
Let's look at an example from Hadley Wickham's website. Let's use pipes to create a graph of the number of boys and girl's names that begin with “Ste” over the past century.
There is data on this within the babynames package in R
install.packages("babynames")
Next, let's load all the packages we need to do this:
library(magrittr)
library(dplyr)
library(ggplot2)
library(babynames)
And here is what our code will look like:
babynames %>%
filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
qplot(year, total, color = sex, data = ., geom = "line")%>%
add(ggtitle('Names starting with "Ste"')) %>%
print
Note that we never created a variable, a blank data frame or any other object.
But we passed a ton of data through multiple packages, each of which have their own syntax/options.
For some people, this is more intuitive, for others, it is less intuitive
Whether you are brand-new to coding or whether you've been doing it for years, it is extremely easy to make small mistakes that can make your code fail.
You can use RStudio to debug your code!
These simple tools are particularly useful if you are looking at a very large amount of code!
RSTudio also has more sophisticated debugging tools that are described in detail here
If you want to get into more advanced programming in R, check out this site authored by Hadley Wickham
Most computers now have multiple CPUs- and some, such as the Amazon machines we will discuss in the next section, have a very large number (e.g. 32 or even 64 cpus).
Unless you parallelize your code, R will only use one cpu per job. Parallelizing your code can therefore lead to dramatic speed improvements.
Not everything can be parallelized. The only types of codes that can be parallelized are those where each iteration (e.g. each pass through a loop) is not dependendent upon the previous iteration. This is because parralel processing does not process each stage in order.
parallel, foreach, doparallel, several others (some have been depricated)
library(doParallel)
cl<-makeCluster(detectCores()-2, outfile="")
registerDoParallel(cl)
library(foreach)
finaldata$new_var<-
foreach(i = 1:nrow(finaldata), .combine=rbind) %dopar% {
date<-finaldata$date[i]
orgname<-finaldata$orgname[i]
tester<-finaldata[finaldata$date<=date&finaldata$orgname==orgname,]
return(sum(tester$Number_of_Blog_Mentions))
}
1. If you do not have a computer with lots of cpus, parallelizing might not help very much.
2. It is sometimes more difficult to track the progress of parallelized code, and also to debug it.
Load R Studio in your Web Browser!
Why?
Why?
-Possible downsides are: a) may not be safe for sensitive data; and b) you have to pay for the time you use… but some are free and even the most expensive are not too costly (e.g. ~$5/hour)
First, you need to create an account
Next, load an RStudio “Amazon Machine Image” (AMI)
This will redirect you to this page where you can choose how much power you want:
Configure your security group to allow incoming HTTP traffic from Port 80
Create a Key Pair (security measure)
Once you launch:
The EC2 Console
Cut and paste the “Public DNS” address into your browser
Log in: By Default the user name and password are both set to “rstudio”
All done!
Change your password!
Proceed with caution when loading sensitive data into the cloud!
Always “Shut down your instances,” OR YOU MAY RUN UP A BIG BILL!!!
If you get stuck with programming, remember to check out Stack Overflow. There is occassionally decent advice on the Rhelp but it is usually outdated and poorly formatted.
If you get an error code, google it.
Always view or summarize your data to identify coding errors
R sometimes crashes. Sometimes the only way to fix things is to restart. You can stop code by hitting the “stop sign” in the RStudio console
There is no “right” or “wrong” way to do things. When you get stuck try to approach the problem from many different angles
Write a for loop that creates 200 lines of output that include the text “random number” followed by a random number.
Screen-Scraping Working with APIs
Next class I will teach you how to download large amounts of data from the in- ternet. Though many students will be interested in how to obtain data from social media sites, API technologies are now used to transmit many other types of data as well (e.g. academic citation databases, historical archives, GPS data, language translation, and anything that might become used by an app developer or computer programmer).