R is based on the language S, which was developed for statistical analysis by the same people who brought you C. When coding in any language it is important to understand how the language works. How it thinks.
R thinks that everything is a vector. Vectors are one dimensional “list-like” objects. They are the heart and soul of R. R can analyze vectors with amazing speed. A vector is not a list and not a matrix and it can only contain one type of value (numbers or strings, etc.) Lists can generally contain multiple types of objects, including other lists. Matrices are generally two dimensional objects that contain numbers.
In contrast to R, Stata thinks everything is a flat data set. It is optimized for operations involving columns of data. Matlab is a matrix based language, while Python and Mathematica are based on LisP (List Processing).
The chapter discusses the characteristics of objects including vectors, matrices and lists. It discusses basic control functions, statistics and standard optimization.
Objects in R
This section describes how R uses basic objects like vectors, matrices, lists and data frames. It also discusses manipulation of objects including mathematical operations.
Vectors
Consider the following operations. Create a vector of numbers called \(a\), and a vector of words called \(b\). You can try adding them together. The best way to do that is to use the c() function. This function concatenates or joins vectors together. Note what happens to the numbers when you do this.
a <-c(1,2,3) # "<-" means assign which has a different meaning than "="b <-c("one","two","three")a
[1] 1 2 3
b
[1] "one" "two" "three"
# a + b # you cannot use a numerical operator like "+" on a # non-numeric object.d <-c(a,b)d
[1] "1" "2" "3" "one" "two" "three"
# note that the d is a vector of strings.
There are a couple of R-centric coding features. It is important to use <- when assigning a variable name. It is possible to use = but it doesn’t always mean the same thing. Also note that c(1,2) is a vector with two elements, both numbers, while c("1","2") is a vector with two elements, both characters.
R can manipulate vectors very quickly. When coding it is important to remember this.
b <-c(4, 5, 6)a + b
[1] 5 7 9
a*b
[1] 4 10 18
a/b
[1] 0.25 0.40 0.50
a +2
[1] 3 4 5
a*2
[1] 2 4 6
a/2
[1] 0.5 1.0 1.5
Check out the operations above. Look carefully at what happens in each case. In particular, note that each operation is cell by cell. Also note what happens when you have a mathematical operation involving a single number and a vector. The single number operates on each element of the vector.
b <-c(4,5)a + b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 5 7 7
a*b
Warning in a * b: longer object length is not a multiple of shorter object
length
[1] 4 10 12
Note that things get a little strange when the two objects are not the same length. Importantly, R may do the operation regardless! It may or may not give you a warning!! Can you work out what it actually did?1
Matrices
R allows researchers to use a fairly large array of objects, which is very nice but it leads to some issues if you are not careful. Matrices are one such object and they can be very useful. In creating matrices we can use “cbind” which joins columns together or “rbind” which joins rows together. Note that the objects being joined must be the same length.
a <-c(1,2,3)b <-c(4,5,6)A <-cbind(a,b)B <-rbind(a,b)A
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
B
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
is.matrix(A)
[1] TRUE
t(A)
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
The transpose operation is t() in R.
As with vectors, arithmetic operations on matrices are cell by cell. However, the matrices must be the same dimension in order to do cell by cell operations. Try the operations that are commented out below (remove the #).
C <- A +2A + C
a b
[1,] 4 10
[2,] 6 12
[3,] 8 14
D <- B*2B*D
[,1] [,2] [,3]
a 2 8 18
b 32 50 72
# A*B# A + BA^2
a b
[1,] 1 16
[2,] 4 25
[3,] 9 36
A +t(B)
a b
[1,] 2 8
[2,] 4 10
[3,] 6 12
You can also do standard matrix multiplication using the operator %*% to distinguish it from cell by cell multiplication. This operation follows the mathematical rules of matrix multiplication. For this operation to work, the two “inside” dimensions must be the same. Below, we are multiplying a \(3 \times 2\) matrix \(\mathbf{A}\) by a \(2 \times 3\) matrix \(\mathbf{B}\). The operation \(\mathbf{A} \mathbf{B}\) works because the inside dimension is 2 for both. How about the operation \(\mathbf{B} \mathbf{A}\)? Try it below.
The first computer language I really learned was called Logo. The language was developed by Seymour Papert in MIT’s Artificial Intelligence Lab. Like its antecedent, Scratch, Logo was designed to help children learn mathematics and programming. Logo is based on LisP. My father, who was a computer scientist, would get very excited about the list processing ability of Logo. As a ten year old, I didn’t quite understand the fuss. Today, using the list processing features of R, I am wistful of Logo’s abilities. R is not a list based language but it can process lists.
[[1]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[2]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
c_list <-c(c("one","two"),c_list)c_list
[[1]]
[1] "one"
[[2]]
[1] "two"
[[3]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[4]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
d_list <-list(c("one","two"),c_list)d_list
[[1]]
[1] "one" "two"
[[2]]
[[2]][[1]]
[1] "one"
[[2]][[2]]
[1] "two"
[[2]][[3]]
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[2]][[4]]
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
Lists can be very useful for storing things, particularly different types of objects. Lists don’t require the different elements to be of the same type. This feature makes them very useful and you will find them in the background for a number of R statistical functions. They can also store objects within objects.
One thing to notice above is that the operations c() and list() look similar but don’t actually do the same thing. Look carefully at the difference.
Data Frames
Data frames are a very important object for statistical analysis. The data frame is like a matrix, but it allows different types of elements. The type must be the same for each column. Data frames also behave a lot like lists. Importantly, you can call a column using the $ symbol. See discussion in the next section.
x <-read.csv("minimum wage.csv", as.is =TRUE)# as.is = TRUE keeps the data in the same format# as it was originally.# note R prefers to change types to factor.# another option is stringsAsFactors = FALSE.is.character(x$State)
[1] TRUE
is.numeric(x$Minimum.Wage)
[1] TRUE
The object x is a data frame. It has one column of strings and two numeric columns.
Interacting with Objects
This section discusses how we can transform objects and retrieve information from them.
Transforming Objects
We can transform objects of various types into various other types with various degrees of success. Transformations, or confirmations, of an object type usually involve as.someobjecttype(). Note that matrix is often what I use to transform a vector into a matrix. Note that you need to state how many rows or columns it has.
as.vector(as.numeric(as.character(as.factor(a)))) == a
[1] TRUE TRUE TRUE
The oddest, and possibly the most frustrating type in R, is the factor type. R likes to store character vectors as factors because it is computationally efficient. But it is easy to confuse a factor object with a numeric object. They are not the same. I often have errors in my code due to this mistake.
Logical Expressions
Like other operations, logical operators are cell by cell in R. The == is used to determine whether something is true or false. We can also use order operations, >, <, >= and <=. The ! is used for not. The symbols & and | are used for “and” and “or” for combining logical arguments. You can also use && which will ask if all the elements are the same.
a==b
[1] FALSE FALSE FALSE
A==t(B)
a b
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
a_list[[1]]==a_list[[2]]
[1] FALSE FALSE FALSE
a > b
[1] FALSE FALSE FALSE
b >5
[1] FALSE FALSE TRUE
b >=5
[1] FALSE TRUE TRUE
b <=4
[1] TRUE FALSE FALSE
a != b
[1] TRUE TRUE TRUE
(b >4) & a ==3
[1] FALSE FALSE TRUE
# (b > 4) && a == 3 (This may have been depricated)(b >4) | a ==3
[1] FALSE TRUE TRUE
# (b > 4) || a == 3
Retrieving Information from a Position
There are a number of ways to retrieve information from objects. The simplest is to request it by its position in the object. Objects, such as vectors, have an index which gives the position of every object in the vector. Note R thinks of matrices as just rearranged vectors, so the index can also be used for matrices. As matrices are two-dimensional objects, information can also be retrieved using the matrix coordinates.
a[1]
[1] 1
b[3]
[1] 6
A[5]
[1] 5
A[2,2]
b
5
We are not limited to a single index number. We can retrieve a subset of the object using various index notation including - to mean “not.” Note that there is no “end” index notation in R. In its stead I use length() or dim(). Note that dim() only works for objects with more than 1 dimension, like matrices.
a[1:2]
[1] 1 2
a[-3]
[1] 1 2
A[1,]
a b
1 4
A[c(1,5)]
[1] 1 5
a[2:length(a)]
[1] 2 3
D <-cbind(A,2*A)D
a b a b
[1,] 1 4 2 8
[2,] 2 5 4 10
[3,] 3 6 6 12
D[,3:dim(D)[2]]
a b
[1,] 2 8
[2,] 4 10
[3,] 6 12
Retrieving information from lists can be frustrating. Consider the three different methods below. Note that the first two look very similar and seem to produce similar results, but they are actually quite different. If I understood the difference I would tell you! Suffice it to say, the double bracket thingy is probably what you want.
a_list[2]
[[1]]
[1] 4 5 6
a_list[2][2]
[[1]]
NULL
a_list[[2]]
[1] 4 5 6
a_list[[2]][2]
[1] 5
As well as the index, positions in an object may be named. In that case, the name can be used to retrieve the information. This is particularly useful for lists. To retrieve a named item in a list you can use the $. RStudio has a nice feature of popping up the options when you type the $, so you don’t need to remember all the names.
Often we are trying to determine where certain information lies in the object. We can use logical expressions and the function which() to find the index positions of vectors.
which(a >2)
[1] 3
which(A >2)
[1] 3 4 5 6
which(a_list[[2]] >2)
[1] 1 2 3
If we are going to use the index to subset another object, we don’t need to use which().
b[a >2]
[1] 6
B[A >2]
[1] 2 5 3 6
A[,colnames(A)=="b"]
[1] 4 5 6
We can also use match() or %in% to determine which information belongs to a set. Note that %in% returns TRUE or FALSE, while match() returns the index or NA if there is no match. We can use match() to find objects in a list. Note that we need to be careful that we are looking for the correct type of object in the list. But match() can also find things that are not exactly format. See example below.
d <-c("one", "two", "three", "four")c("one", "four", "five") %in% d
[1] TRUE TRUE FALSE
match(a,c(1,2))
[1] 1 2 NA
match(a_list,c(4,5,6))
[1] NA NA
match(a_list,list(c(4,5,6)))
[1] NA 1
match("1",c(3,5,8,1,99))
[1] 4
Statistics
This section discusses basic statistical operations in R.
Data
set.seed(123456789)a <-c(1:1000)b <-c("one","two",NA,4,5:1000)e <-rnorm(1000)c <-2-3*a + e x <-as.data.frame(cbind(a,b,c))
Consider the simulated data. The data frame object has a nice property that allows you to have objects in which some variables are numeric while others are strings. That said, you need to be careful and keep track of the variable types. Note that all the variables in x are factors. R has transformed them.
x <-as.data.frame(cbind(a,b,c), stringsAsFactors =FALSE)
Some useful functions for creating simulated data are rnorm() and runif(). These are random number generators. You can generate pseudo-random numbers from various R functions. Use related functions to calculate probability distributions and densities. Note that these numbers are generated according to a particular function. They are generated such that if they start with a particular number, a seed, then they will always give the same sequence. So if you can get the system to always start with the same number, you can have the results be exactly reproducible. In R, set.seed() does this. I always set the seed with 123456789.
As a general rule, I like to keep my scripts in the same folder as the data. In this way there are no long paths for writing or reading files. If you want to do this in RStudio, then make sure to go to Session > Set Working Directory > To Source File Location.
Missing Values
R has a particular way of dealing with missing values. It uses NA for missing. Any operation of a value with a missing will give back a missing. This is a nice feature which forces the programmer to think about how to deal with missing information. R will also treat NA as numeric.
2+NA
[1] NA
2*NA
[1] NA
2+c(3,4,6,7,NA)
[1] 5 6 8 9 NA
Often in microeconometrics we want to ignore the missing values. We may be assuming that they are missing-at-random. We did this in our analysis of returns to schooling for men in Chapters 1 and 2.
We can have many R functions skip over the missing using the option na.rm = TRUE. We see this option used below. Some functions, like lm(), will do this automatically.
Summary Statistics
We can do some basic summary statistics on the data using mean() and sd(). We can also use the quantile function to find out more about the distribution.
The standard OLS function is lm() for linear model. When this function is run it creates a list of objects which can be used for various things. For example, you can recall just the coefficient estimates or the residuals from the regression. A nice feature of these functions is that you can just use the variable names in the regression call. You can do this as long as you specify the data frame object using data=. There are various built-in non-linear regression functions that may be called using the glm() procedure.
x <-read.csv("x.csv", as.is =TRUE)#summary(x)lm1 <-lm(c ~ a, data=x)summary(lm1)[[4]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.097396 0.0641312421 32.70475 4.934713e-160
a -3.000170 0.0001109953 -27029.69905 0.000000e+00
lm1$coefficients
(Intercept) a
2.097396 -3.000170
glm1 <-glm(c >-1500~ a, family =binomial(link=probit),data=x)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm1$coefficients
(Intercept) a
3799.867658 -7.592143
Note that when R reads in data using the read.csv() function it creates an extra variable.
Control
This section discusses basic control methods used in R. There is a folklore that you cannot do loops in R. This is not quite correct. There are good and bad ways to do loops in R.
Loops
Looping is a fundamental part of programming. The basic loop is the “for-loop.” The “while-loop” is an alternative. The for () loop is better when there is a particular number of iterations, while the while () loop is better when you want it to stop when some value is reached.
Looping in R
Looping in R has special challenges. Consider the following problem of creating a matrix with three consecutive numbers in each row.
# don't do it this way!!!A <-NULLstart_time <-Sys.time()for (i in1:10000) { A <-rbind(A,c(i,i+1,i+2))}Sys.time() - start_time
Time difference of 0.2243783 secs
A[400,]
[1] 400 401 402
sum(A)
[1] 150045000
# A faster wayA <-matrix(NA,10000,3)start_time <-Sys.time()for (i in1:10000) { A[i,] <-c(i,i+1,i+2)}Sys.time() - start_time
Time difference of 0.009110689 secs
A[400,]
[1] 400 401 402
sum(A)
[1] 150045000
# An even faster way! (Sometimes)a <-c(1:10000)start_time <-Sys.time()A <-t(matrix(sapply(a, function(x) c(x,x+1,x+2)),nrow =3))Sys.time() - start_time
Time difference of 0.01088405 secs
A[400,]
[1] 400 401 402
sum(A)
[1] 150045000
There is a particular way not to loop in R. R loops are very very slow if they involve creating and recreating objects. These types of loops also tend to be very memory hungry. In the previous example you can get a significant speed increase by creating the object at the start and then filling it in during the loop.
The example above also illustrates the apply() function. Using apply() is often the fastest way of looping. I don’t find it very intuitive and so I tend not to use this method, but it may be useful if you are trying to speed up some code.
If Else
R uses two different types of “if-else” commands. The command you will use most of the time is ifelse(). This function operates on vectors. The command takes three arguments. The first is the logical statement to be checked, the second is what to do if the statement is true and the third is what to do if the statement is false. Note that like other operations on vectors this function works cell by cell.
a <-c(1,2,3,4,5)b <-ifelse(a==3,82,a)a
[1] 1 2 3 4 5
b
[1] 1 2 82 4 5
The more standard programming command is if (){} or if (){} else{}. This command checks the logical statement and then runs the command in the curly brackets. Note that you don’t have to specify the “else.”
A <-"Chris"# A <- "Write Your Name Here"if (A=="Chris") {print("Hey Chris")} else {print(paste("Hey",A))}
[1] "Hey Chris"
Optimization
Optimization is fundamental to statistical modeling. The main command in R uses functions as inputs, so the section discusses how to create and use these objects. It also discusses the basic optimizer, optim().
Functions
If you plan to have procedures that you will use repeatedly, then it may be more efficient to create a function. Functions also help keep your script clean and tidy. They also make debugging easier by focusing you where the location of the bug is.
y <- x[,c(2,4)]apply(y, 2, mean)
a c
500.500 -1499.488
colMeans(y)
a c
500.500 -1499.488
b <-c(1:dim(y)[1])summary(sapply(b, function(x) sum(y[x,])), digits =2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2000.0 -1500.0 -1000.0 -1000.0 -500.0 0.5
summary(rowSums(y), digits =2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2000.0 -1500.0 -1000.0 -1000.0 -500.0 0.5
We can define functions on the fly like when we use the apply() function.
When writing functions it is good practice to check that the input is the correct type and provide a warning or error code if it is not the expected type.
lm_iv <-function(Y_in, X_in, Z_in = X_in, Reps =100, min_in =0.05, max_in =0.95) {# takes in the y variable, x explanatory variables# and the z variables if available.# defaults: Z_in = X_in,# Reps = 100, min_in = 0.05, max_in = 0.95# Set up index_na <-is.na(rowSums(cbind(Y_in,X_in,Z_in))) Yt <-as.matrix(Y_in[index_na==0]) Xt <-as.matrix(cbind(1,X_in)) Xt <- Xt[index_na==0,] Zt <-as.matrix(cbind(1,Z_in)) Zt <- Zt[index_na==0,]# turns the inputs into matrices# removes observations with any missing values# add column of 1s to X and Z# Bootstrap r <-c(1:Reps) ibs <-sapply(r, function(x) round(runif(N_temp, min =1, max = N_temp))) bs_temp <-sapply(ibs, function(x) solve(t(Zt[ibs,])%*%Xt[ibs,])%*%t(Zt[ibs,])%*%Yt[ibs])# Present results res_temp <-matrix(NA,dim(Xt)[2],4) res_temp[,1] <-colMeans(bs_temp)for (j in1:dim(Xt)[2]) { res_temp[j,2] <-sd(bs_temp[,j]) res_temp[j,3] <-quantile(bs_temp[,j],min_in) res_temp[j,4] <-quantile(bs_temp[,j],max_in) }colnames(res_temp) <-c("coef","sd",as.character(min_in),as.character(max_in))return(res_temp)}
Here is a version of the instrumental variable function. This function uses matrix algebra to calculate the instrumental variable regression. It presents the bootstrap results. The input variables listed at the start of the function can be set to default values. This means that the user doesn’t have to enter these values. At the start of the function you can define objects and variables. This is a good place to confirm that objects are what you think they are. This version of the function uses sapply() to do the bootstrap loop.2
optim()
The standard optimization procedure in R is optim(). The algorithms used in R were developed by John C. Nash (no relation to the Nobel prize winning mathematician) in the 1970s. Nash developed the optimizer in a language called BASIC in order to run on the particular machine that Nash was using. According to John C. Nash, there are better optimization algorithms available (Nash 2014).
Above is a simple least squares function. The problem is solved using optim(). The function, optim(), takes initial starting values, a function to optimize and variable values. It has a large number of default settings. In particular, it uses the Nelder-Mead procedure as a default. One thing to note is that this procedure may give different results depending on the starting values used. What happens if you change the starting values to par=c(1,1)?
Discussion and Further Reading
While I have been programming for a very long time, I am not a programmer. The objective of this chapter is to give you some basic insights into R. If you are interested in really learning how to program in R then I suggest purchasing one of the many books out there or taking an online course. I highly recommend Kabacoff (2011) as it is written by a statistician/computer scientist. I also suggest Paarsh and Golyaev (2016) as an excellent introduction to data analysis in economics.
Note
The views expressed here are those of the author and do not reflect those of the Federal Trade Commission or the Congressional Budget Office.
I am grateful for the help I received from friends and colleagues who read and commented on the book. I am particularly grateful to Emek Basker, David Vanness and Greg Kirwin who read early versions of the early chapters. I am grateful to George Deltas, Peter Newberry, Joris Pinkse and Nathan Wilson, for their thoughts and insights into the “structural chapters.” Also, to Ott Toomet and Grant Farnsworth who reviewed sections at the request of the editor. Finally, I am very grateful to Brian Krauth, David Prentice and Devesh Raval who worked through the whole text making suggestions and edits.
All errors are my own.
References
Kabacoff, Robert I. 2011. R in Action. Manning.
Nash, John C. 2014. “On best practice optimization methods in .”Journal of Statistical Software 60 (2).
Paarsh, Harry J., and Konstantin Golyaev. 2016. A Gentle Introduction to Effective Computing in Quantitative Research: What Every Research Assistant Should Know. MIT Press.
Footnotes
In the case of a + b, it did the following operations, 1 + 4, 2 + 5, 3 + 4. That is, R simply started over again with the first element of b. This is referred to as recycling.↩︎
In some cases it may be faster to call a compiled language like C. It is also possible to create compiled functions in R, see Kabacoff (2011).↩︎