Basic Programming in R

Chris Bail
Computational Sociology

Recap

Last class we learned how to work with different types of objects because many of the data sources we will work with are comprised of data in different formats. alt text

Recap

 

But in most cases, you will want to collect data from many different webpages, text files, or other data pipelines- not just one.

Recap

 

In order to do this, you need to learn some basic programming

Agenda for today

 

Today we are going to learn about four different types of programs that can be used to accomplish repetitive tasks in R.

Why Can't we Just Learn One?

 

I need to familiarize you with different types of programming so that you can a) use the right tool for the job; and b) understand other people's code.

Agenda for today

 

1. Introduction to Programming
2. Functions
3. Loops
4. Vectorized Functions
5. Pipes
6. Debugging
7. Parallelizing your code
8. Running R in the cloud (Bonus!)

PART 1:

BASIC PROGRAMMING

Basic Programming in R

 

  • Programming is one of the most powerful aspects of R

Basic Programming in R

 

  • Programming is also one of the most difficult aspects to learn

Basic Programming in R

 

  • But this is probably because one's first programming language is always hard to learn

Basic Programming in R

 

  • The potential applications of programming in R are mind-numbing

Basic Programming in R

 

  • We are going to focus on some rather basic programming tasks

Basic Programming in R

 

  • My goal is to get you to the point where you can recognize what is going on in other people's code, and begin writing simple code yourself.

PART 2:

FUNCTIONS

Functions

 

The most basic form of programming is a function. If you have used R before, you have probably used a function before.

Anatomy of a Function

 

Functions

 

**We actually use functions all the time in r. For example:

mean()
list.files()
read.csv()

Functions

 

Though we use such functions all the time, many people often do so without ever looking at the “source code,” or the complicated list of instructions that R processes each time we run a command.

what "rowMeans" looks like under the hood

 

rowMeans
function (x, na.rm = FALSE, dims = 1L) 
{
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    if (!is.array(x) || length(dn <- dim(x)) < 2L) 
        stop("'x' must be an array of at least two dimensions")
    if (dims < 1L || dims > length(dn) - 1L) 
        stop("invalid 'dims'")
    p <- prod(dn[-(id <- seq_len(dims))])
    dn <- dn[id]
    z <- if (is.complex(x)) 
        .Internal(rowMeans(Re(x), prod(dn), p, na.rm)) + (0+1i) * 
            .Internal(rowMeans(Im(x), prod(dn), p, na.rm))
    else .Internal(rowMeans(x, prod(dn), p, na.rm))
    if (length(dn) > 1L) {
        dim(z) <- dn
        dimnames(z) <- dimnames(x)[id]
    }
    else names(z) <- dimnames(x)[[1L]]
    z
}
<bytecode: 0x7f9ff5bb3ab0>
<environment: namespace:base>

Babysteps

 

We are not going to jump right into the relatively complicated code that we just saw, instead, let's start from a very basic type of function.

What is a Function?

 

A function is simply a set of instructions or tasks that one may apply to any type of object in R. Let's take a very basic function:

my_function <- function(x) x+2

This function takes a number (x) and adds two to the number. let's try it:

my_function(2)
[1] 4

Functions

 

functions can get much, much more complicated:

another_function <- function(x, y, z) {
  x <- mean(x)^2
  y <- cos(y)-5
  z <- log(z)*200
  c(x,y,z)
  }

this function requires three inputs (x, y, and z).

The c() tells R that we want to display the results.

Functions

 

  • As beginniners, you might not write many functions in R

Functions

 

But you will probably soon begin using other people's functions, and it is therefore critical to know how they work so that you can modify them to suit your own needs

Now You Try It!

Write a function that takes the square root of a number and then multiplies this figure times 37

Example Solution

my_function<-function(x){
sqrt(x)*37
}

Why do I need to learn about Functions?

 

1) Functions can make it easy to perform repetitive tasks;

2) Functions allow you to develop custom operations;

3) You may need to examine functions that someone else wrote in order to ensure that they do what you want them to do;

PART 3:

FOR THE LOVE OF LOOPS

Loops

 

  • “for” loops are one of the oldest types of programming in computer science.

Loops

 

  • for loops help us repeat a function over and over across a large data frame, or a large number of objects

Loops

  alt text

Loops

Let's begin with a very simple example:

for (i in 1:6){
  print("Jim Moody is bad-a$$")
}
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"
[1] "Jim Moody is bad-a$$"

Loops

But things can get MUCH more complicated very quickly:

Let's assume we have a large number of csv files that describe health indicators for all OECD countries. We want to loop through all of these files and create a new dataset that describes health indicators for one country: Korea.

Loops

alt text

Loops

 

I created a folder in our class Dropbox named “OECD Health Data.”“ Let's use list.files to look at the files:

list.files("OECD Health Data")
[1] "CHILDVACCIN-DTP-PC_CHILD-A.csv"       
[2] "CSECTIONS-TOT-1000BIRTH-A.csv"        
[3] "FLUVACCIN-TOT-PC_POP65-A.csv"         
[4] "HOSPITALDISCHARGE-TOT-100000HAB-A.csv"
[5] "MRIEXAM-TOT-1000HAB-A.csv"            

Loops

 

The next step is to initialize our loop. To do this, we need to count the number of files that we want to repeat an action for.

To do this, let's use the length function:

filenames<-list.files("OECD Health Data")
number_of_files<-length(filenames)

Loops

 

Next, we need to create a place to store the data we will extract from each of these files that describe health indicators for all OECD countries

Let's create a blank data frame as follows:

koreadata<-as.data.frame(NULL)

Loops

 

Our loop will look like this:

for(i in 1:number_of_files){
  filepath<-paste("OECD Health Data/", filenames[i], sep="")
  data<-read.csv(filepath, stringsAsFactors = FALSE)
  newdata<-data[data$Location=="Korea",]
  newdata$indicator<-filenames[i]
  koreadata<-rbind(koreadata,newdata)
}

Anatomy of a For Loop

 

This line is tricky:

  filepath<-paste("OECD Health Data/", filenames[i], sep="")

We are telling R to create a new “filepath” to pass on to read.csv each time it goes through the loop.

The paste command simply concatenates two string variables

Anatomy of a For Loop

  alt text

Loops

Let's look at the data:

head(koreadata)
    Location   X2012                             indicator
23     Korea    99.0        CHILDVACCIN-DTP-PC_CHILD-A.csv
11     Korea   360.0         CSECTIONS-TOT-1000BIRTH-A.csv
12     Korea    77.4          FLUVACCIN-TOT-PC_POP65-A.csv
111    Korea 15571.1 HOSPITALDISCHARGE-TOT-100000HAB-A.csv
121    Korea    19.6             MRIEXAM-TOT-1000HAB-A.csv

Loops

 

There are other types of Loops in R

e.g. if and else and while

These types of loops are often used in conjunction with for loops

Loops

 

There are two big problems with loops in R:

  • 1) They are slow
  • 2) loops are somewhat complicated to write

In the next two sections, we are going to explore alternatives to loops for programming in R

Now you try it:

 

1) Write a “for” loop that goes through each variable in our Pew Dataset and replaces values of 9 with NA.  

Hint: you may find the ncol function useful.

PART 4:

VECTORIZED FUNCTIONS

Vectorized functions

 

Vectorized functions are another way of automating a task in r- in this case by iteratively applying a function across elements of an object.

Vectorized functions

 

The command we are going to focus on is sapply which is used for doing repetitive tasks on a data frame

Vectorized functions

Let's try a simple example first. Let's say we want to center or standardize the “pew10” variable in our Pew Data.

load("Pew Data.Rdata")

Next, let's write a quick function to standardize:

standardize<-function(x) x/mean(pewdata$pew10, na.rm=T)

na.rm tells R to ignore missing values when calculating the mean of pew10

Vectorized functions

Now we apply our “standardize” function to each row as follows:

centered <- sapply(pewdata$pew10, standardize)

If we wanted to create a new variable within our pew data we could have written:

pewdata$pew10_centered<-sapply(pewdata$pew10, standardize)

Vectorized functions

 

Because apply commands can pass any function onto any object they are very powerful.

Vectorized functions

 

Let's read in our OECD files and produce a “Korea” dataset using lapply

First, we need to get the names of the files in our directory again:

filenames<-list.files("OECD Health Data")

Vectorized functions

 

Next, let's add the file path into them:

filenames<-paste("OECD Health Data/", filenames, sep="")

Vectorized functions

 

We can now import all the data files with one line of code!

data<-lapply(filenames,read.csv)

This function simply says, for each file name in the filenames obect, apply the read.csv function.

Vectorized functions

If we look at the data, however, we see that it has made each data frame into a list:

data
[[1]]
                       Location X2012
1                     Australia    91
2                       Austria    83
3                       Belgium    99
4                        Brazil    95
5                        Canada    96
6                         Chile    91
7  China (People's Republic of)    99
8                Czech Republic    99
9                       Denmark    94
10                      Estonia    94
11                      Finland    98
12                       France    99
13                      Germany    96
14                       Greece    99
15                      Hungary    99
16                      Iceland    91
17                        India    72
18                    Indonesia    85
19                      Ireland    96
20                       Israel    94
21                        Italy    97
22                        Japan    98
23                        Korea    99
24                   Luxembourg    99
25                       Mexico    83
26                  Netherlands    97
27                  New Zealand    92
28                       Norway    94
29                       Poland    99
30                     Portugal    98
31                       Russia    97
32              Slovak Republic    98
33                     Slovenia    95
34                 South Africa    65
35                        Spain    96
36                       Sweden    98
37                  Switzerland    96
38                       Turkey    98
39               United Kingdom    96
40                United States    94

[[2]]
         Location X2012
1         Austria 288.4
2  Czech Republic 243.9
3         Denmark 211.8
4         Estonia 200.0
5         Finland 161.9
6          France 207.6
7         Hungary 352.9
8         Ireland 275.3
9          Israel 188.7
10          Italy 368.4
11          Korea 360.0
12     Luxembourg 269.7
13    New Zealand 254.7
14         Poland 315.7
15       Slovenia 194.5
16          Spain 251.5
17         Sweden 163.0
18         Turkey 479.8
19 United Kingdom 244.2

[[3]]
          Location X2012
1           Canada  64.1
2            Chile  73.1
3          Denmark  43.1
4          Estonia   0.9
5          Finland  35.1
6           France  53.1
7          Hungary  28.5
8          Ireland  56.9
9           Israel  57.6
10           Italy  62.7
11           Japan  50.0
12           Korea  77.4
13      Luxembourg  44.7
14     New Zealand  64.3
15          Norway  11.4
16 Slovak Republic  15.4
17        Slovenia  17.0
18           Spain  57.0
19          Turkey  13.2
20  United Kingdom  75.5

[[4]]
          Location   X2012
1          Austria 27029.6
2   Czech Republic 20054.5
3          Estonia 17285.4
4          Finland 17747.7
5           France 16765.6
6          Germany 25093.1
7          Hungary 20201.8
8          Ireland 13606.3
9           Israel 16356.2
10           Italy 12878.3
11           Korea 15571.1
12      Luxembourg 14943.8
13          Mexico  4819.8
14     Netherlands 11862.8
15     New Zealand 14815.8
16          Poland 16221.9
17 Slovak Republic 19583.0
18        Slovenia 17106.8
19           Spain  9905.5
20     Switzerland 16636.8
21          Turkey 15762.3

[[5]]
          Location X2012
1        Australia  26.0
2           Canada  53.7
3   Czech Republic  43.2
4          Denmark  67.0
5          Estonia  46.8
6          Finland  42.1
7           France  82.0
8           Greece  67.6
9          Hungary  34.1
10         Iceland  79.3
11          Israel  28.0
12           Korea  19.6
13      Luxembourg  78.8
14          Poland  17.9
15 Slovak Republic  40.9
16        Slovenia  33.2
17           Spain  64.5
18          Turkey 114.3
19   United States 104.8

If we wanted to clean this up, we would need to “unlist” it, and convert it to a data frame, possibly by using yet another apply command.

Vectorized functions

 

Problems with apply commands:

  • It can be difficult to identify the correct version of apply for the right object/desired result or ouput.

  • Mistakes in functions can be more difficult to find because you do not watch your program process the data row by row.

PART 5:

PIPES

Piping

 

I do not want to overload you with too many different types of programming styles.

…But I do want you to leave this class with basic familiarity about all kinds of programming so that you can learn to read other people's code

Piping

 

Pipes are relatively new to R, though they are quite common in other programming languages. They are essentially a way of writing cleaner, more intuitive code by eliminating loops and the initialization of objects such as data frames from your code.

As of right now, they work best with the plyr commands that we have already used in this class.

Piping

 

Piping is not yet part of base R. So, we have to load the magrittr package:

install.packages("magrittr")

Piping

 

The key contribution of this package is the %>% operator.

Whatever is on the left side of this operator gets passed to the right side*.

The %>% operator works in conjunction with other functions.

Piping

 

Let's look at an example from Hadley Wickham's website. Let's use pipes to create a graph of the number of boys and girl's names that begin with “Ste” over the past century.

There is data on this within the babynames package in R

install.packages("babynames")

Piping

 

Next, let's load all the packages we need to do this:

library(magrittr)
library(dplyr)
library(ggplot2)
library(babynames)

Piping

  And here is what our code will look like:

babynames %>%
  filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
  group_by(year, sex) %>%
  summarize(total = sum(n)) %>%
  qplot(year, total, color = sex, data = ., geom = "line")%>%
  add(ggtitle('Names starting with "Ste"')) %>%
  print

Piping

  alt text

Piping

 

  • Note that we never created a variable, a blank data frame or any other object.

  • But we passed a ton of data through multiple packages, each of which have their own syntax/options.

  • For some people, this is more intuitive, for others, it is less intuitive

PART 6

DEBUGGING YOUR CODE

Debugging your code

 

Whether you are brand-new to coding or whether you've been doing it for years, it is extremely easy to make small mistakes that can make your code fail.

Debugging your code

 

You can use RStudio to debug your code!

alt text

Debugging your code

 

alt text

Debugging your code

 

alt text

Debugging your code

 

These simple tools are particularly useful if you are looking at a very large amount of code!

RSTudio also has more sophisticated debugging tools that are described in detail here

A Resource for Advanced Progamming

 

If you want to get into more advanced programming in R, check out this site authored by Hadley Wickham

PART 7: PARRELIZING YOUR CODE

What is Parallel Code?

 

Most computers now have multiple CPUs- and some, such as the Amazon machines we will discuss in the next section, have a very large number (e.g. 32 or even 64 cpus).

Unless you parallelize your code, R will only use one cpu per job. Parallelizing your code can therefore lead to dramatic speed improvements.

What is Parallel Code?

 

Not everything can be parallelized. The only types of codes that can be parallelized are those where each iteration (e.g. each pass through a loop) is not dependendent upon the previous iteration. This is because parralel processing does not process each stage in order.

Packages for parallelizing your code in R

 

parallel, foreach, doparallel, several others (some have been depricated)

First, create a cluster

 

library(doParallel)
cl<-makeCluster(detectCores()-2, outfile="")
registerDoParallel(cl)

Example of Parallelized Code

 

library(foreach)

finaldata$new_var<-
  foreach(i = 1:nrow(finaldata), .combine=rbind) %dopar% {
  date<-finaldata$date[i]
  orgname<-finaldata$orgname[i]
  tester<-finaldata[finaldata$date<=date&finaldata$orgname==orgname,]
  return(sum(tester$Number_of_Blog_Mentions))
}

Pitfalls of parallelizing

 

1. If you do not have a computer with lots of cpus, parallelizing might not help very much.

2. It is sometimes more difficult to track the progress of parallelized code, and also to debug it.

PART 8: RUNNING R IN THE CLOUD

Running R in the Cloud

Load R Studio in your Web Browser!

Running R in the Cloud

 

Why?

  • 100+ times the computing power
  • Cheaper than buying a faster machine
  • Ability to load multiple R sessions at once, or “parallelize” your code
  • Access your R code/data from anywhere

Running R in the Cloud

 

Why?

-Possible downsides are: a) may not be safe for sensitive data; and b) you have to pay for the time you use… but some are free and even the most expensive are not too costly (e.g. ~$5/hour)

Running R in the Cloud

First, you need to create an account

Running R in the Cloud

Next, load an RStudio “Amazon Machine Image” (AMI)

Running R in the Cloud

This will redirect you to this page where you can choose how much power you want:

Running R in the Cloud

Configure your security group to allow incoming HTTP traffic from Port 80

Running R in the Cloud

Create a Key Pair (security measure)

Running R in the Cloud

Once you launch:

Running R in the Cloud

The EC2 Console

Running R in the Cloud

Cut and paste the “Public DNS” address into your browser

4.6 Running R in the Cloud

Log in: By Default the user name and password are both set to “rstudio”

Running R in the Cloud

All done!

Running R in the Cloud

 

Change your password!

Proceed with caution when loading sensitive data into the cloud!

Always “Shut down your instances,” OR YOU MAY RUN UP A BIG BILL!!!

WRAPPING UP

Wrapping Up: Parting Tips

 

If you get stuck with programming, remember to check out Stack Overflow. There is occassionally decent advice on the Rhelp but it is usually outdated and poorly formatted.

Wrapping Up: Parting Tips

 

If you get an error code, google it.

Wrapping Up: Parting Tips

 

Always view or summarize your data to identify coding errors

Wrapping Up: Parting Tips

 

R sometimes crashes. Sometimes the only way to fix things is to restart. You can stop code by hitting the “stop sign” in the RStudio console

Wrapping Up: Parting Tips

 

There is no “right” or “wrong” way to do things. When you get stuck try to approach the problem from many different angles

Questions?

 

Homework

Write a for loop that creates 200 lines of output that include the text “random number” followed by a random number.

Next Week

Screen-Scraping Working with APIs

Next class I will teach you how to download large amounts of data from the in- ternet. Though many students will be interested in how to obtain data from social media sites, API technologies are now used to transmit many other types of data as well (e.g. academic citation databases, historical archives, GPS data, language translation, and anything that might become used by an app developer or computer programmer).