1 Basics

1.1 Clear workspace

You might want to begin your session with a completely clear workspace, meaning there are no packages loaded and no datasets or objects in your environment. (Some people advise against this for reasons I am not savvy enough to understand, but so far I’ve had no problems)

# Clear workspace
rm(list = ls())

1.2 Install and Load Packages

At the very top of your script, you want to load any packages you will use. If its the first time using the package, you likely will have to install it first, using install.package().

# code for installing packages - note that you can load multiple packages at once. Also note you need quotation marks
install.packages(c("tidyverse", "stargazer", "lfe", "haven", "ISLR", "gridExtra"))
# how to use already installed packages- note you can only load one at a time and there are no quotation marks around the package name
library(tidyverse) # for easier code syntax and ggplots
library(stargazer) # for exporting nice tables
library(lfe)       # for running FE models
library(haven)     # for loading dta files from stata
library(ISLR)      # for data
library(gridExtra) # for arranging multiple ggplots

1.3 Working Directory

One of the first things you want to do when you open an R script is set your working directory - this tells R where to pull files from and where to save files to. Note that “working directory” is R’s term for a file path.

For windows users, your working directory will probably start with “C:/Users/” and will go on to connect to the exact folder you want to pull data from or save data to. For Mac users, this will probably start with “User/” (no “C:/”) and then connect to the folder. If you forget how the working directory input should look (ie, how does it start, are you supposed to use backward slashes or forward slashes), you can always run getwd() first.

# what is my current working directory?
getwd() 
## [1] "C:/Users/User/Dropbox/Duke/7. Bass Connections"
# set to your desired working directory
setwd("C:/Users/User/Dropbox/Duke/7. Bass Connections/data") # I'm a windows user!

1.4 Loading data

There are several different data formats you can read into R. Below are some of the common commands you might use:

  • read.csv() - for csv files or specific sheets of excel workbooks

  • read.delim() - for tab-delimited files (like dat or txt files) or “|” delimited files (sometimes txt files). First load package readr.

  • read_dta() - for loading stata .dta files. First load package haven.

Some datasets can be accessed through packages or APIs. When a package contains a pre-loaded dataset, for example, the Auto dataset in the ISLR package, you will not need a function to load this dataset. Rather you will simply call the dataset.

csv = read.csv("./data_file.csv") # Reads CSV file named "data_file.csv" in my "7. Bass Connections/data" folder (the one I set my working directory to), names this dataset "csv" in my global environment

txt = read.delim("./data_file.txt", sep = "|") # Reads txt file where delimiter is | 

dat = read.delim("./data_file.dat", sep = "\t") # Reads tab-delimited dat file ("\t" is for tab)

dta = read_dta("./data_file.dta") # Reads .dta files output from stata 

df = Auto # Reads the Auto dataset from the ISLR package, names it "df"

install.packages("htmltools")

1.5 Viewing data sets

Once you have loaded a data set, you can view the dataset in a separate window using the command View(), where the name of the dataset (df, dat, dta above) appears in the parantheses. Note, you can also click on the dataset inside the Global Environment window on the right of the R studio window.

1.6 Data Type in R

R classifies every vector (column or variable) into one of five types. Often times, it will classify the data type of the vector/variable automatically. But sometimes, it will classify the data type incorrectly, or in a way that limits your ability to manipulate, combine, or perform functions on that data, and you’ll have to manually change the variable type.

The five basic data types are:

  • character - a data point is (usually) automatically classified as type “character” if it contains at least one non-numeric character. "2000 Science Drive", "10-15-2021", or "a" would all be classified as characters. Note that character-based data points are often referred to as “strings.” All three of the examples in this bullet are considered strings.

  • numeric - a numerical data point, such as 2 or 15.398. When all the data points in a vector (column or variable) are numbers, the data type will instead be “integer.” However, if at least one data point in a column or variable contains a letter or a character like ?, the entire column will be classified as type character. Note that R also refers to type numeric as “double.”

  • integer - when all the data points in a vector, column, or variable are whole numbers, R will still read the object as type “integer.” You can force R to store a whole number as an integer by typing an “L” after the number, as in 4L or 1000L, where the L means “this is an integer.”

  • logical - data points in logical vectors or variables can take on only two values: TRUE or FALSE. Note, the TRUE or FALSE must appear in capital letters and without quotation marks to be read as type logical (a vector containing 3 data points “true”, “false”, and “true” would be classified as type character!). You can also use T and F instead of typing out the full words.

  • complex - this is for complex numbers, such as 1+8i. I don’t ever deal with this, and I hope for your sake you don’t have to either, so let’s skip it!

  • factor - factor is technically a type of data structure in R (see below) rather than a data type, however, I think it is easiest to think of it as a type similar to character or numeric. Factor is used when the data is qualitative in nature. For example, if your variable was Eye.Color, your data points might be green, blue, brown, blue. Even though these data points are strings (i.e. words rather than numbers), the information they are conveying is actually categorical in nature. Designating this variable as type “factor” will tell R that blue, green, and brown are different categories of response, not simply strings. This difference might seem negligible at first, but consider a situation in which you ran a survey asking about people’s eye color, and the response options are encoded numerically, such that 1 = brown, 2 = blue, 3 = green and 4 = other. The variable eye.color would then be read by R as a numeric variable, but it is in fact categorical. Designating the variable as a factor will ensure R does not treat this variable numerically in a way that would be inappropriate - say, by taking the average of this variable, or inserting it into a regular regression. Factors are quite important for econometric tools like Fixed Effects Regressions, where the fixed effects are factor variables.

1.6.1 Retrieving data type

  • typeof() - returns the data type of the object inside the parentheses

1.6.2 Changing data type

You can change the data type of objects in order to better suit what you want to do with the object. It might not be clear why you’d want to do that yet, but consider the example above about factors and Fixed Effects regressions as just a single situation in which changing data type is necessary. You will encounter more of these situations as you become more advance.

  • as.character() - change the data type of the object to character
  • as.numeric() - change the data type of the object to numeric
  • as.logical() - change the data type of the object to logical
  • as.factor() - change the data type of the object to factor

1.6.2.1 Changing (“coercing”) vectors to numeric

Can you change a vector containing characters and numbers into type numeric? Let’s see what happens!

# create a vector with 3 data points - 2 numbers, 1 word, 1 combination
vect1 = c(1, "apple", 40.6, "March 8") # vect1 is the name of the vector.
vect1 # typing the name of the vector will print its contents 
## [1] "1"       "apple"   "40.6"    "March 8"
# check the type
typeof(vect1) #R classified this vector as "character" because it contains letters, even though the majority of data points are numbers!
## [1] "character"
# change the type to numeric
vect2 = as.numeric(vect1) # name the new numeric version vect2

# check the type
typeof(vect2) # the vector type is now numeric (or double)
## [1] "double"
# look at the numeric vector
vect2 # R replaced the non-numeric data points with NA, even the data point that contained a number.
## [1]  1.0   NA 40.6   NA

As you can see in the example above, the data point"apple" could not be coerced into a number, and so instead it was replaced with NA. The same thing happened to "March 8" - note that this was not converted to simply 8.

1.6.2.2 Changing (“coercing”) vectors to logical and back

When can we coerce a vector to logical? R will easily convert a vector of 1 and 0 to logical, where 0 == FALSE and 1 == TRUE, and vice versa.

## coerce a numeric vector to logical
vect1 = c(1,1,0)
typeof(vect1) # automatically read as numeric ("double")
## [1] "double"
# convert to logical
vect2 = as.logical(vect1) 
vect2 # reads 1 as True, 0 as false
## [1]  TRUE  TRUE FALSE
## coerce a logical vector to numeric 
vect3 = c(F,F,T) # T and F can be used instead of TRUE and FALSE
typeof(vect3) # reads as logical
## [1] "logical"
vect4 = as.numeric(vect3)
vect4 # F = 0 and T = 1
## [1] 0 0 1
## coerce a character vector to logical
vect5 = c("T", "T", "F")
typeof(vect5) # note, because the Ts and Fs are in quotations, this is character!
## [1] "character"
vect6 = as.logical(vect5)
vect6 # this was enough to let R know how to convert!
## [1]  TRUE  TRUE FALSE
rm(vect1, vect2, vect3, vect4, vect5, vect6) # delete these vectors from our environment

1.6.2.3 Converting vectors to a factor

Lastly, lets look at what happens when we convert a numeric vector to a factor.

# create a vector of six years
years = c(1991, 1992, 1993, 1994, 1995, 1996)
years # look at the vector
## [1] 1991 1992 1993 1994 1995 1996
# calculate the mean, min and max
mean(years)
## [1] 1993.5
min(years)
## [1] 1991
max(years)
## [1] 1996
# convert year vector to a factor
years.factor = as.factor(years)
years.factor # note it looks the same (but now the years are "levels")
## [1] 1991 1992 1993 1994 1995 1996
## Levels: 1991 1992 1993 1994 1995 1996
# can we take the mean, min, or max of a factor?
mean(years.factor) # no...we can't!
## [1] NA
#min(years.factor) # this won't compute - years.factor isn't a numeric object with a minimum!

1.7 Data Objects in R

R has several different object types, or “data structures.” This can be confusing at first, but it is important to bear them in mind early on. We’ll look at the various object types in more detail below.

  • vector - this is the most basic object in R. A vector is a single column of data points - these can be numbers or letters/words (“strings”). All the data points in a vector will be classified as the same “type” (ie character, numeric, logical).
## [1] 1 2 3
  • matrix - this is a rectangular object with 1 or more columns and 1 or more rows. You can think of a matrix as a combination of vectors of the same length (imagine several vectors being glued to each other horizontally) or as a simple data frame. All of the elements (the numbers or letters filling the rows or columns) must be the same data type (string, logical, numeric, integer).
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    1    4
## [2,]    2    5    2    5
## [3,]    3    6    3    6
  • data frame - this is an object with 1 or more variables (columns) and 1 or more observations (rows). Every row has the same number of columns, and every column has the same number of rows; in other words, a data frame is a rectangular object in which every column has equal length. Data frames can have column names (variable names) and row names (observation ids, for example). Importantly, columns in a data frame can have different formats - for example, some columns can be words (“strings”) and others can be numbers.
##   number letter name  male
## 1      1      a  Bob  TRUE
## 2      2      b  Joe  TRUE
## 3      3      c Lynn FALSE
  • list - this is maybe the most complicated data object in R, you will likely not need to deal with lists early on in learning R. However, they are an extremely important tool as you begin using more advanced functions in R. A list is a container - you might think of the list as a filing cabinet that contains several “folders”. Each “folder” in the list contains an R object such as a data frame, a matrix, or a vector. A list therefore allows you to group and then index several different object types into a single object.
## [[1]]
## [1] 1 2 3
## 
## [[2]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    1    4
## [2,]    2    5    2    5
## [3,]    3    6    3    6
## 
## [[3]]
## [1] "hello"
## 
## [[4]]
##   number letter name  male
## 1      1      a  Bob  TRUE
## 2      2      b  Joe  TRUE
## 3      3      c Lynn FALSE

1.8 Creating Objects in R

To create an object in R, you use a standard syntax: the name of the object (whatever you want to call it), followed by the operator = or <-, followed by whatever you want the object to be. Note that = is used to assign values to an object, where as == is used to mean “are these two things equal?”. = and <- are identical in meaning and can be used identically in R.

Once an object is named, you only need to type in the name to call that object up or use it. Below is a basic example of creating a vector and giving it a name.

# create a vector named A that contains the elements 1, 2, and 3 
A = c(1,2,3) 

# when you type A and hit ctrl + enter, the contents of A will print in the console
A 
## [1] 1 2 3
# perform a function on A
2*A # multiplies every element of A by 2
## [1] 2 4 6
# perform a function on A and save the result as a new object
AA = 2*A # create a new object named AA
AA # print the contents of the new object in the console
## [1] 2 4 6

Next, we walk through how to create different types of objects (vectors, matrices, data frames) from scratch.

1.8.1 Creating Vectors

There are several ways to create a vector:

  • c(1,2,3) or c("Linda", "Mark", "Johnny"): entering numbers or strings into parentheses preceded by c() tells R to combine these things into a vector. Note that characters or strings always need to appear in quotation marks.
## [1] 1 2 3
## [1] "Linda"  "Mark"   "Johnny"
  • seq(5, 10, 1) or seq(300, 500, 50): the former creates a vector out of the sequence from 5 to 10 by 1, the latter creates a vector out of the sequence from 300 to 500, by 50.
## [1]  5  6  7  8  9 10
## [1] 300 350 400 450 500
  • 5:10: creates a vector out of the sequence from 5 to 10 by 1
## [1]  5  6  7  8  9 10
  • vector() or logical(0): creates an empty logical vector (default) of length 0

  • vector("character",length = 4) or character(4): creates an empty character vector of length 4

## [1] "" "" "" ""
  • vector("numeric", length = 6) or numeric(4): creates an empty numeric vector of length 6 (all 0s)
## [1] 0 0 0 0 0 0

1.8.2 Creating Matrices

Below are simple functions for creating matrices from scratch. Note that you can also bind vectors together to get matrices, or convert data frames to matrices. We will deal with conversion below, and binding in section 2.

  • matrix(nrow = 2, ncol = 3): create a 2x3 matrix filled with NAs
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
  • matrix(c(1,2,3,4), nrow = 2, ncol = 2): create a 2x2 matrix with elements 1, 2, 3, 4 that fill in by column
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
  • matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow = T): create a 2x2 matrix with elements 1, 2, 3, 4 that fill in by row
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

1.8.3 Creating data frames

This is the most basic function for creating a data frame from scratch. Note that usually, you will bind together vectors to make data frames, or convert existing matrices to data frames. We will deal with conversion below, and binding in section 2.

  • data.frame(var1 = c(T,T,F), var2 = c("a", "b", "c"), var3 = 1:3): create a data frame with 3 variables and 3 observations
##    var1 var2 var3
## 1  TRUE    a    1
## 2  TRUE    b    2
## 3 FALSE    c    3

1.8.4 Converting objects to different types

The functions for converting an object to another type are straightforward:

  • as.vector() - converts an element, matrix or data frame to a vector
  • as.matrix() - converts a vector or data frame to a matrix
  • as.data.frame() - converts a matrix or vector to a data frame

However, there can be some idiosyncracies. Take a look at the examples below.

Vectors to Matrix

If you have a vector with 3 elements, using as.matrix() will convert the vector to a 3 x 1 matrix, taking the vector as a single column. If you want the vector to serve as a single row of a matrix, you can transpose the vector first using the transpose function t(), such that the matrix transformation function looks like as.matrix(t()).

v = c(1,2,3) # start with a vector
as.matrix(v) # converts the vector to a 3x1 matrix
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
as.matrix(t(v)) # converts the vector to a 1x3 matrix
##      [,1] [,2] [,3]
## [1,]    1    2    3

Data frame to Matrix

While the columns in a data frame can be different data types (i.e. logical, numeric, character), columns in a matrix must be all the same variable type. Therefore, when you convert a data frame with categorical, character, and numeric variables to a matrix, all the columns in the matrix will be coerced to a single type (likely “character”). However, variable names remain preserved in the matrix.

C # start with data frame used earlier
##   number letter name  male
## 1      1      a  Bob  TRUE
## 2      2      b  Joe  TRUE
## 3      3      c Lynn FALSE
as.matrix(C) # all elements appear in quotation marks, even numbers
##      number letter name   male   
## [1,] "1"    "a"    "Bob"  "TRUE" 
## [2,] "2"    "b"    "Joe"  "TRUE" 
## [3,] "3"    "c"    "Lynn" "FALSE"
typeof(as.matrix(C)) # the type has been converted to character
## [1] "character"

Matrix to data frame

Unlike matrices or vectors, data frames must have column names (row names are optional). When you convert a matrix to a data frame, R will automatically assign column names to the data frame - these will be standard like “V1”, “V2”, “V3” and so on. When you convert a vector to a dataframe, R will automatically assign the name of the vector as the column name (see below). You can rename the columns using the function colnames().

# matrix to data frame
B # start with matrix used above
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    1    4
## [2,]    2    5    2    5
## [3,]    3    6    3    6
as.data.frame(B) # no column names transfer
##   V1 V2 V3 V4
## 1  1  4  1  4
## 2  2  5  2  5
## 3  3  6  3  6
# vector to data frame
A # start with vector used above
## [1] 1 2 3
as.data.frame(A) # the name of the column is A, the name of the vector.
##   A
## 1 1
## 2 2
## 3 3

1.9 Referencing elements within objects

Sometimes you will want to reference specific elements in a vector, matrix, data frame or list. The notation is slightly different depending on the object type.

  • Vectors - lets reference items in a vector called vect
    • vect[2] - returns the 2nd element in the vector
    • which(vect == "lemon") - returns the index number of where the word “lemon” appears in the vector. This command is like the reverse of the above command. For example, if vect[3] == "lemon" then which(vect == "lemon") == 3
    • min(vect) - returns the smallest element in a numeric vector, or the first element alphabetically in a character vector
    • max(vect) - returns the largest element in a numeric, or the last element alphabetically in a character vector
  • Matrices - let’s reference items in a 4*4 matrix called mat
    • mat[1,] - returns the 1st row of the matrix
    • mat[,1] - returns the 1st column of the matrix
    • mat[2,4] - returns the element in the 2nd row, 4th column of the matrix
    • mat[3] - returns the third element in the matrix when starting at the top left corner and counting down by column
    • min(mat) - returns the smallest element in the entire matrix (or the first element alphabetically if the type is character)
    • min(mat[,4]) - returns the smallest element in the 4th column of the matrix
    • mat[,"var1"] - IF the matrix has variable names (ie, was converted from a data frame), you can call specific columns in the matrix by name
## 
## ========================
## name  age gender retired
## ------------------------
## Lou   44    M     FALSE 
## Mark  65    M     TRUE  
## Nancy 23    F     FALSE 
## ------------------------
  • Data Frames - let’s reference items in the data frame above, called df
    • df$age or df[,2] - returns the second variable “age” in the data frame as a vector
    • df[2] or df["age"] - returns the second variable in the data frame as a data frame (with 3 rows, 1 column)
    • df[2,] - returns the second row in the data frame as a dataframe (with 1 row, 4 columns)
    • min(df$age) - returns the minimum of the variable “age”

2 Manipulating and shaping data

2.1 Using dplyr functions

dplyr is a package commonly used in R to make data cleaning and shaping “tidier”. Let’s look at some basic dplyr functions below.

  • filter() - select the rows you want to retain from the data frame. This can be used with variable names and “conditions”.

    • filter(Var5 > 60) would create a new data frame containing only those observations with a value for Var5 greater than 60.
    • You can also add multiple conditions as well using the ‘or’ operator | and the ‘and’ operator &: filter(Var5 > 0 & Var5 < 100) (note you have to type out the variable name twice)
    • filtering NAs requires special syntax. If you wanted to include only the observations that are missing data on Var5, you would use: filter(is.na(Var5)). If you wanted to exclude the observations that are missing data on Var5, you would use: filter(!is.na(Var5)), where the ! indicates “is not”.
  • select() - select the columns you want to retain from the data frame.

    • This can be used with variable names, for example to select only the columns named Var1 and Var3, you would enter select("Var1", "Var3")
    • Alternatively, you can refer to the order in which variables appear: select(4:8) would retain the fourth through the eighth variable in the dataset.
    • You can also specify a variable to exclude: select(-"Var5") would select every variable except Var5.
  • mutate() - use this function to create new variables that are mutations of existing variables.

    • If you wanted to create a new variable that was the log of the existing variable gdp, you would use the following function: mutate(gdp.ln = log(gdp)).
    • If you wanted to retain an existing variable gdp but round it to the nearest 1000, you would use: mutate(gdp = round(gdp, -3)).
    • If you wanted to add Var2 to Var3 to create a new variable called Var4, you would use: mutate(Var4 = Var2 + Var3).
    • If you wanted the sum of Var2 and Var3 to replace the existing Var3, you would use: mutate(Var4 = Var2 + Var3).
  • arrange() - use this function to order the rows in a data set in increasing order by the value in a column.

    • arrange(var1) will arrange the rows of the dataset in ascending order by the value of var1
    • arrange(desc(var1)) will arrange the rows of the dataset in descending order by the value of var1
  • rename() - use this function to assign new names to variables. The syntax is as follows: rename(newname = oldname)

  • group_by() - groups observations by common values on the specified variable. In group_by(year), for example, all observations from year 2010 will be grouped together, all observations with year == 2011 are grouped together, etc.

  • summarize() - create new variables or mutate existing variables to summarize data, either within groups or within the entire dataset, using some specified function. Returns a dataset where the number of rows equals 1 if no group argument is specified, or to the number of unique groups in the variable specified in the group_by() argument. Some common summary functions:

    • mean(var1, na.rm = T) : take the average (within group) of var1, ignore NAs in var1 in calculating this mean
    • sum(var1, na.rm = T) : sum up the values of all the elements (within group, if specified) of var1, ignoring NAs
    • min(var1, na.rm = T) : take the min of all the elements (within group, if specified) of var1, ignoring NAs
    • max(var1, na.rm = T) : take the max of all the elements (within group, if specified) of var1, ignoring NAs
    • quantile(var1, probs = .25, na.rm = T): calculate a specified quantile (here,the 25th percentile) within group, ignoring NAs
    • n() : count how many observations in this group (note, if no by_group() argument is used, this will correspond to the number of rows in the dataframe)
    • n_distinct(var1) : count how many distinct values of var1 exist in this group (or in the entire data frame, if no by_group() argument is used)

It is important to note that although dplyr functions can be used as any other functions, it is often paired with an odd but intuitive syntax that makes performing multi-step data manipulation functions easier. In dplyr functions, instead of inserting the name of the object inside the parentheses after a function, for example min(df), you start with the name of the object and then use the linking operator %>% to link it to functions, such that you can continue to refer to the same object several times.

  • So, where typically you’d type filter(df, var1 > 0), you now can type df %>% filter(var1 > 0).
  • The benefits of this are obvious when you have several manipulations. If you wanted to rename var3 after filtering a dataset without using the %>% piping, you would have to type: rename(filter(df, var1 > 0), new.name = var3). Instead, we can see clearly the order of operations by entering: df %>% filter(var1 > 0) %>% rename(new.name = var3)

2.1.1 Example: using group_by and summarize()

We are given a data set of the following grades recorded in PoliSci 630 last semester:

gender grade hours per week
Male 94 5
Male 86 6
Male 75 3
Male 83 12
Male 98 9
Female 100 10
Female 77 2
Female 84 6
Female 93 6
Female 79 8

We want to know the average grade by gender, the average amount of time spent working by gender, and the total number of hours spent working by gender. For this we will use group_by() and summarize()

# Create a data frame corresponding to the table above
grades = data.frame(                              # command for making a new dataframe
  "gender" = c(rep("Male", 5), rep("Female", 5)), # make a variable called "gender" that consists of "Male" repeated 5 times, and "Female" repeated 5 times
  "grade" = c(94,86,75,83,98,100,77,84,93,79),    # make a variable called "grade" that consists of the datapoints above
  "hours" = c(5,6,3,12,9,10,2,6,6,8)              # make a variable called "hours" that consists of the datapoints above
  )

grades
##    gender grade hours
## 1    Male    94     5
## 2    Male    86     6
## 3    Male    75     3
## 4    Male    83    12
## 5    Male    98     9
## 6  Female   100    10
## 7  Female    77     2
## 8  Female    84     6
## 9  Female    93     6
## 10 Female    79     8
# Create a new data frame with average grade by gender
grades_avg = grades %>%
  group_by(gender) %>%               # group the data by gender
  summarize(grade.avg = mean(grade), # create a new variable equal to the average grade for each gender
            hours.avg = mean(hours), # create a new variable equal to the average hours spent by each gender
            hours.tot = sum(hours))  # create a new variable equal to the sum of hours spend by each gender

grades_avg # we now have a "summary" data set, with 2 rows and 4 columns (gender + the summary variables)
## # A tibble: 2 x 4
##   gender grade.avg hours.avg hours.tot
##   <chr>      <dbl>     <dbl>     <dbl>
## 1 Female      86.6       6.4        32
## 2 Male        87.2       7          35

3 Summarize the data

We’ll use the “Auto” dataset from the ISLR package for this section.

dat = Auto # loads the Auto dataset from ISLR package, names it "dat"

3.1 Viewing summaries within R

There are several ways to quickly pull numerical summaries of your data in R for your own reference.

  • head(data): will report the first 6-10 lines of the dataset. This is helpful if you don’t have a ton of columns in the data.

  • summary(data): reports the min, mean, median, max, and number of NAs for each variable in the dataset. Very helpful for getting a sense of what the dataset contains.

  • length(vect): counts how many elements are in a vector. Only for vectors. Use nrow() instead for data frames.

  • nrow(data) : count the number of rows in a data frame. Note, for a vector, use length() instead.

  • ncol(data): count number of columns in a data frame.

  • table(data$var1, useNA = "ifany"): how many observations take on each value of a variable? Helpful for factor or integer variables. Be sure to specify useNA = "ifany" if you also want to know how many observations have NA for that variable.

  • table(data$var1, data$var2, useNA = "ifany"): look at the cross-wise frequencies of two variables. Again, don’t forget to specify the useNA if you want NAs to appear.

  • quantile(data$var1, na.rm = T): provides the 0, 25, 50, 75, and 100 percentile of the variable. You can also specify the percentiles you want to view.

    • range(data$var1, na.rm = T): will provide the range of the variable (i.e. just the 0 and 100th percentile)
  • names(data): provides the names of every variable in the dataset. Good for trying to remember what the name or orders of your variables look like.

3.2 Outputting summary tables

You might also want to summarize data in a way that looks nice for outputting into reports. One way to do this is using stargazer(summary.frame, summary = F), where summary.frame is a data frame you create with summaries of the data.

# Creating function for extracting the relevant summary statistics
descriptives <- function(x) {
  cbind.data.frame(
    'Observations' = sum(is.na(x)==F), 
    'Median' = round(median(as.numeric(x), na.rm=T), 3),
    'Mean' = round(mean(as.numeric(x), na.rm = T), 3),
    'Min' = min(x, na.rm=T),
    'Max' = max(x, na.rm=T))
}

# Extracting descriptive statistics for Table 1 A 
summary.frame <- do.call(rbind, apply(dat, 2, function(x) descriptives(x))) %>% rownames_to_column(var = "Variable")

# Reporting Summary
stargazer(summary.frame, 
          type = "text", 
          summary = F,
          rownames = F,
          digits = 2)
## 
## ====================================================================================
## Variable     Observations  Median    Mean             Min                 Max       
## ------------------------------------------------------------------------------------
## mpg              392       22.75    23.45             9.0                 46.6      
## cylinders        392         4       5.47              3                   8        
## displacement     392        151     194.41           68.0                455.0      
## horsepower       392       93.50    104.47            46                  230       
## weight           392      2,803.50 2,977.58          1613                 5140      
## acceleration     392       15.50    15.54             8.0                 24.8      
## year             392         76     75.98             70                   82       
## origin           392         1       1.58              1                   3        
## name             392                        amc ambassador brougham vw rabbit custom
## ------------------------------------------------------------------------------------

4 Visualization

4.1 Histograms

First lets do a number of histograms using the Auto dataset, which is built into the ISLR package. We will use different methods:

  • hist(): built in feature for making histograms - great for simple plots, but you can’t do fancy things, like compare two groups side by side

  • geom_histogram(): ggplot feature for making histograms. NOTE: you link lines of code with +. This means you can hit ctrl+Enter on the last line of the plot, and the entire plot will be created (this is different from the hist() function!)

# Most simple way: hist()
hist(dat$weight,                  # what you're making a histogram of: a single variable or vector
     main = "Histogram of Weight", # title
     xlab = "Weight")              # x-axis label

# More complicated histogram using hist()
# NOTE: you have to run all this code at the same time in order for all the elements to appear on the plot!
hist(dat$weight,                    # variable or vector
     main = "Histogram of Weight",  # title
     xlab = "Weight",               # x-axis label
     breaks = 50,                   # I want approx. 50 individual bars
     xaxt = 'n')                    # says I want to draw my own axis
# adjust x-axis label
axis(side = 1,                           # adjust labeling of x-axis (y-axis would be side = 2)
     at = seq(1500, 5000, by = 100),     # put tick marks at every 100 between 1500 and 5000
     labels = seq(1500, 5000, by = 100), # put labels at the tick marks 
     cex.axis = .8)                      # change the font size - says make the d
# add additional lines for reference
abline(v = quantile(dat$weight)[2],   # put a vertical line at the 25th quantile
       col = "blue",                  # color it blue
       lty = 2)                       # make it dashed
abline(v = quantile(dat$weight)[4],   # put a vertical line at the 75th quantile
       col = "red",                   # color it red
       lty = 3)                       # make it dotted
abline(v = mean(dat$weight),          # put a vertical line at the mean
       col = "black",                 # color it black
       lwd = 2)                       # make it thick
# add a legend to describe lines
legend("topright",                      # put it in the top right of the plot
       legend = c("25th Percentile",    # these are the labels for your legend, you can list them in any order
                  "Mean",
                  "75th Percentile"),
       lty = c(2,1,3),                  # the type of line for each legend item (list in order of labels)
       lwd = c(1,2,1),                  # the width of line for each legend item (list in order of labels)
       col = c("blue", "black", "red"), # the color of line for each legend item (list in order of labels)
       cex = .75)                       # font size - says make the font .75x the regular size

# Basic histogram with ggplot()
ggplot(dat,                               # what data frame you're drawing from in all elements of the ggplot
       aes(x = weight))+                  # variables you're plotting in all elements of the ggplot (in this case, only an x-var)
        geom_histogram() +                # says "take this data and make it a histogram"
        xlab("Weight") +                  # label the x-axis
        ylab("Frequency")+                # label the y-axis
        ggtitle("Histogram of Weight")    # give the plot a title

# More complicated histogram with ggplot()
ggplot(dat, aes(x = weight)) +
        geom_histogram(binwidth = 300,    # now we specify the bin width - each bin will capture 300 lbs
                       fill = "black",    # makes the inside of the bars black
                       color = "grey") +  # outlines them in grey
        # Label X axis
        scale_x_continuous(name = "Weight",                   # names the x-axis
                           breaks = seq(1500,5100,by = 300))+ # specifies where to put ticks and labels - every 300
        ylab("Frequency") +  
        ggtitle("Histogram of Weight") +
        # Add a line for 25, 50, 75 percentiles
        geom_vline(aes(xintercept = quantile(weight)[2],      # note we don't have to say dat$ before weight bc dat is already called in the ggplot() line
                       color = "25th Percentile")) +          # when we specify the color command INSIDE the aes(), it assigns the object a color 
        geom_vline(aes(xintercept = quantile(weight)[4],      # and allows us to give it a name for the legend - it will draw a legend automatically!
                       color = "75th Percentile"))+
        geom_vline(aes(xintercept = mean(weight),
                        color = "Mean"))+
        scale_color_discrete("Quantiles")+ # Here we label the legend
        theme_bw()  # a different look for ggplots - white background, clean legends. you can play around with different themes!

4.2 Bar plots with comparisons

Next we’ll use geom_bar() with ggplot to compare different groups in the data. Note, bar plots are really only useful when at least one of your variables is discrete!

First, lets compare histograms of foreign and domestic cars by number of cylinders. We use two different types of comparisons, “stack” and “dodge.” Which do you prefer?

# First, we create a dummy variable for foreign origin
dat = dat %>% mutate(foreign = as.factor(ifelse(origin == 2 | origin == 3, 1, 0)))

# Next, we compare weight by origin. This time, we use geom_bar (rather than geom_histogram())
stack = ggplot(dat,                                # lets save this plot as an object called "stack"
               aes(x=as.factor(cylinders)))+       # This time, lets do it by cylinders (converted to a factor variable)
        geom_bar(stat = "count",                   # specifies that we want a histogram (y-axis is count)
                 aes(fill = foreign),              # use different colors based on the value of variable 'foreign'
                 position = "stack") +             # stack the different colored bars on top of eachother
        labs(title = "Histogram of Cylinders by Origin", # can enter all your labels in a single 'labs()' command
             subtitle = "Stacked Histogram",
             x = "Number of cylinders",
             y = "Count") +
        scale_fill_discrete("Origin",                          # name the legend thats providing the discrete fill colors
                            labels = c("Domestic", "Foreign")) # specify the names of the values in the legend

# Lets do the exact same comparison, but this time with the bars next to eachother
dodge = ggplot(dat, aes(x=as.factor(cylinders)))+      # lets save this plot as an object called "dodge"
        geom_bar(stat = "count", aes(fill = foreign),              
                 position = "dodge") +                 # this time, we want the bars next to each other!
        labs(title = "Histogram of Cylinders by Origin", 
             subtitle = "Dodged Histogram", x = "Number of cylinders", y = "Count") +
        scale_fill_discrete("Origin", labels = c("Domestic", "Foreign")) 

# Lets compare the two side by side
grid.arrange(stack, dodge, nrow = 2)

What if we want to compare some value across several different groups? geom_bar() is good for this also! Below, we calculate the mean mpg for cars made in each year of the dataset, and plot the mean for each year.

# Calculate means by year
yr_mean = dat %>% group_by(year) %>% summarize(mean.mpg = mean(mpg))

# Create bar graph 
horiz = ggplot(yr_mean, aes(x = as.factor(year), y = mean.mpg)) + # x var is year - but its discrete, so we add as.factor()
        geom_bar(stat = "identity") +                             # use stat = "identity" when plotting a bar graph with a y var
        scale_x_discrete(labels = paste0("19", yr_mean$year))+    # I add "19" to the front of the year labels (70,71...) to  improve readability
        labs(x = "Year", y = "Mean MPG",                          # add labels to graph
             title = "Mean MPG by Year") + 
        theme(axis.text.x = element_text(angle = 45))             # tilt x-axis lables 45 degrees to improve readability  

# Do the exact same thing but make it vertical
vert = ggplot(yr_mean, aes(x = as.factor(year), y = mean.mpg)) +
        geom_bar(stat = "identity") +
        scale_x_discrete(labels = paste0("19", yr_mean$year)) +
        labs(x = "Year", y = "Mean MPG", title = "Mean MPG by Year") +
        coord_flip() # this command will flip the axes, you don't need to do anything else differently

grid.arrange(horiz, vert, nrow = 1)

Now, lets do the same thing, but plot foreign and domestic made cars separately, and only draw the vertical plot.

# Calculate means by year, by origin
yr_mean_for = dat %>% group_by(year, foreign) %>% summarize(mean.mpg = mean(mpg))

# Do the exact same thing but make it vertical
ggplot(yr_mean_for, aes(x = as.factor(year), y = mean.mpg)) +
        geom_bar(stat = "identity",    # bar graph where we specify the y-var (ie: not a histogram)
                 aes(fill = foreign),  # draw two separate bars with different fill colors based on the value of foreign
                 position = "dodge") + # draw the bars for foreign and domestic next to each other
        scale_x_discrete(labels = paste0("19", yr_mean$year)) +
        labs(x = "Year", y = "Mean MPG", title = "Mean MPG by Year", subtitle = "By Origin") +
        scale_fill_discrete("Origin", labels = c("Domestic", "Foreign")) + # label the legend
        coord_flip() 

4.3 Line plots

Now, let’s plot how the average MPG of foreign and domestic cars varied over time. Here, we treat year as a continuous variable (compared to in the previous section). This will allow us to compare trends.

ggplot(yr_mean_for, aes(x = year, y = mean.mpg)) +
        geom_line(aes(group = foreign,      # plot a separate line for each value of foreign
                      color = foreign)) +   # color the line based on the value of foreign
        geom_point(aes(group = foreign,     # plot a separate point at each x for each value of foreign
                       color = foreign)) +  # color the points based on the value of foreign
        scale_x_continuous(breaks = seq(min(yr_mean_for$year), max(yr_mean_for$year), 2),  # make the x-axis ticks and labels appear every 2 years
                           labels = paste0("19", seq(min(yr_mean_for$year), max(yr_mean_for$year), 2))) +  # add "19" to the beginning of the year labels
        scale_color_discrete("Origin", labels = c("Domestic", "Foreign")) +  # Name the legend and specify the values of foreign 
        labs(x = "Year", y = "Mean MPG", title = "Mean MPG Over Time", subtitle = "By origin")

What about something more fancy, like how the mean MPG changes by year based on the number of cylinders? Here we see we don’t have observations for every year for each number of cylinders, so the plot will only connect the years for which we have datapoints.

yr_mean_cyl = dat %>% group_by(year, cylinders) %>% summarize(mean.mpg = mean(mpg))

ggplot(yr_mean_cyl, aes(x = year, y = mean.mpg)) +
        geom_line(aes(group = as.factor(cylinders),      # plot a separate line for each number of cylinders
                      color = as.factor(cylinders))) +   # color the line based on the number of cylinders
        geom_point(aes(group = as.factor(cylinders),     # plot a separate point for each number of cylinders
                       color = as.factor(cylinders))) +  # color the points based on number of cylinders
        scale_x_continuous(breaks = seq(min(yr_mean_for$year), max(yr_mean_for$year), 2),  # make the x-axis ticks and labels appear every 2 years
                           labels = paste0("19", seq(min(yr_mean_for$year), max(yr_mean_for$year), 2))) +  # add "19" to the beginning of the year labels
        scale_color_discrete("Number of Cylinders") +  # Name the legend. We don't need to specify values this time
        labs(x = "Year", y = "Mean MPG", title = "Mean MPG Over Time", subtitle = "By origin")

4.4 Visualize three-dimensions of data

A scatterplot, in which the points are colored based on a third variable, can help you visualize three dimensions at once. Here we look at the relationship between MPG and Weight by a factor variable, cylinders, and a continuous variable, horsepower.

ggplot(dat, aes(x = weight, y = mpg)) +
        geom_point(aes(color = as.factor(cylinders))) +
        scale_color_discrete("# of Cylinders") +
        labs(title = "Scatterplot of MPG by Weight",
             subtitle = "By number of cylinders")

ggplot(dat, aes(x = weight, y = mpg)) +
        geom_point(aes(color = horsepower)) +
        scale_color_gradient("Horsepower", low = "green", high = "red")+
        labs(title = "Scatterplot of MPG by Weight",
             subtitle = "By horsepower")

4.5 Saving Plots

For ggplots, you can simply add ggsave("name.png", height = 5, width = 7) at the end of the code for a plot. It will save the last plot as a 5x7 png in the folder specified in your working directory.

ggsave("by_horsepower.png", height = 5, width = 7) # saves the last plot

5 Running Models and Reporting the Results

5.1 Simple linear model

Let’s use the Auto data again to look at the effect of horsepower on mpg, controlling for acceleration and displacement. The syntax for running linear models in R is: lm(y ~ x1 + x2 + x3, data). We will then use a number of functions to look at the results.

  • summary(lm1): reports the coefficients, standard errors, and p-values along with some basic diagnostics. Helpful for viewing within R, not very pretty for reports.

  • stargazer(lm1): prints a pretty version of the coefficient table in the console. Stargazer is nice bc it offers lots of customizable features. You can save this table directly to your working directory as a .html file (out = "./tablename.html") which allows for easy screenshotting for reports/presentations or a text file (out = "./tablename.tex") for inserting the code directly into overleaf. You can also print the table directly in Rmd by specifying type = "latex".

lm1 = lm(mpg ~ horsepower + acceleration + displacement, data = dat) 

# View built-in summary
summary(lm1) 
## 
## Call:
## lm(formula = mpg ~ horsepower + acceleration + displacement, 
##     data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1330  -3.1616  -0.6326   2.3641  16.6478 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.254707   2.578694  17.937  < 2e-16 ***
## horsepower   -0.088783   0.015835  -5.607 3.92e-08 ***
## acceleration -0.412230   0.116227  -3.547 0.000438 ***
## displacement -0.036660   0.005029  -7.290 1.75e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.468 on 388 degrees of freedom
## Multiple R-squared:  0.6749, Adjusted R-squared:  0.6724 
## F-statistic: 268.5 on 3 and 388 DF,  p-value: < 2.2e-16
# Save lm object outputs
coef = lm1$coefficients               # can call coefficients straight from lm object
se = summary(lm1)$coefficients[,2]    # to get standard errors, you have to pull them from the summary
pval = summary(lm1)$coefficients[,4]  # to get pvalues, you have to pull them from the summary

# Bind manually and print data frame
lm1.res = data.frame("Coef" = coef,       # 1st column is vector of coefficients
                     "Std.Error" = se,    # 2nd column is vector of SEs
                     "p.value" = pval)    # 3rd column is vector of pvals
lm1.res
##                     Coef   Std.Error      p.value
## (Intercept)  46.25470750 2.578693889 7.891585e-53
## horsepower   -0.08878252 0.015834755 3.919182e-08
## acceleration -0.41222985 0.116227322 4.377030e-04
## displacement -0.03665995 0.005028849 1.753751e-12
# Print pretty table
stargazer(lm1, # Model goes here, can be one or many models  
          dep.var.labels = "MPG", # name of dependent variable(s)
          covariate.labels = c("Horsepower",     # labels for the x-vars (mind the order!)
                               "Acceleration", 
                               "Displacement"), 
          title = "Effect of Horsepower on MPG", # title for the table
          out = "./table1.html",  # saves the table as a html in ur wd - makes it easy for opening and screenshotting.
          type = "text") # lets you view the table clearly in the console. If printing in RMarkdown, use type = "latex"
## 
## Effect of Horsepower on MPG
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 MPG            
## -----------------------------------------------
## Horsepower                   -0.089***         
##                               (0.016)          
##                                                
## Acceleration                 -0.412***         
##                               (0.116)          
##                                                
## Displacement                 -0.037***         
##                               (0.005)          
##                                                
## Constant                     46.255***         
##                               (2.579)          
##                                                
## -----------------------------------------------
## Observations                    392            
## R2                             0.675           
## Adjusted R2                    0.672           
## Residual Std. Error      4.468 (df = 388)      
## F Statistic          268.457*** (df = 3; 388)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

5.2 Fixed Effects Model

Now let’s use the Auto data to run the same model as before, but this time including fixed effects for the year in which the car was manufactured. We will use the felm() function from the lfe package. The syntax for running a FE linear model using felm is as follows: felm(y ~ x1 + x2 + x3 | fe1 + fe2 | 0 | 0, data). Note, if you want to cluster the standard errors at a certain level, you would specify which variable to cluster by where the last 0 appears in the formula above.

felm1 = felm(mpg ~ horsepower + acceleration + displacement | year| 0 | 0 , data = dat)

# As before we can use summary
summary(felm1)
## 
## Call:
##    felm(formula = mpg ~ horsepower + acceleration + displacement |      year | 0 | 0, data = dat) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6913  -2.1314  -0.1809   1.7790  14.6637 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## horsepower   -0.071689   0.013176  -5.441 9.58e-08 ***
## acceleration -0.379759   0.092188  -4.119 4.67e-05 ***
## displacement -0.032251   0.004101  -7.865 3.95e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.513 on 376 degrees of freedom
## Multiple R-squared(full model): 0.8052   Adjusted R-squared: 0.7974 
## Multiple R-squared(proj model): 0.6583   Adjusted R-squared: 0.6447 
## F-statistic(full model):103.6 on 15 and 376 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 241.5 on 3 and 376 DF, p-value: < 2.2e-16
# We can also save outputs
fe.coef = felm1$coefficients
fe.se = felm1$se
fe.pval = felm1$pval

fe.res = data.frame("coef" = fe.coef,
                    "std.error" = fe.se,
                    "p.val" = fe.pval)
fe.res
##                      mpg   std.error        p.val
## horsepower   -0.07168862 0.013176116 9.579469e-08
## acceleration -0.37975913 0.092187851 4.674519e-05
## displacement -0.03225108 0.004100517 3.949518e-14
# We can save fixed-effect related outputs as well:
fe.freq = as.data.frame(table(felm1$fe)) # how many observations in each FE category?
fe.n = nrow(fe.freq) # how many years are in the dataset

# Lets compare FE and non-FE regression in a nice table
stargazer(lm1, felm1,             # list the model objects in the order you want them to appear in the table
          dep.var.labels = "MPG", # name of dependent variable(s)
          covariate.labels = c("Horsepower", "Acceleration", "Displacement"), # labels for the x-vars (mind the order!)
          title = "Effect of Horsepower on MPG", # title for the table
          out = "./comparison.tex",  # saves the table as a .tex file in your wd - makes it easy for copying into overleaf
          add.lines = list(c("Year FEs", "No", "Yes"), # add lines at the bottom specifying which models have fixed effects
                           c("Number of FEs", "NA", fe.n)), # how many fixed effects?
          no.space = T,  # removes the space between rows - helps make smaller for inserting into documents
          omit.stat = c("ser","f"), # control which summary stats to report in the table
          type = "text") # lets you view the table clearly in the console. If printing in RMarkdown, use type = "latex"
## 
## Effect of Horsepower on MPG
## ==========================================
##                   Dependent variable:     
##               ----------------------------
##                           MPG             
##                    OLS           felm     
##                    (1)            (2)     
## ------------------------------------------
## Horsepower      -0.089***      -0.072***  
##                  (0.016)        (0.013)   
## Acceleration    -0.412***      -0.380***  
##                  (0.116)        (0.092)   
## Displacement    -0.037***      -0.032***  
##                  (0.005)        (0.004)   
## Constant        46.255***                 
##                  (2.579)                  
## ------------------------------------------
## Year FEs            No            Yes     
## Number of FEs       NA            13      
## Observations       392            392     
## R2                0.675          0.805    
## Adjusted R2       0.672          0.797    
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01

6 Setting up an R Project

    1. Create a Project: File > New Project > New Directory
    • Type in a directory name, like “R_labs”
    • Click browse and select the folder on your computer that you want your project to appear in. For my PoliSci 630 labs, I choose the folder “PS630 TA”
    • Select “Create Project” (no need to check any boxes)
    1. Go to your Finder (mac) or File Explorer (windows). Identify the folder you just created (mine is “R_labs”). Click on that folder and create a new folder “data”. Save datasets you will use in the project in that folder.
    1. While in the project (the title of the project should appear in the top right corner of your window alongside a blue cube), open a new file. This file will automatically be linked to the project.
    1. Now, when you want to load data sets into your script, you can simply reference the data folder (no need to specify a working directory!): dat <- read_csv("data/dat.csv")

This tutorial provides a great walkthrough of this process.

7 Using R Markdown

R Markdown files (.rmd) allow you to combine text with code and code output.

7.1 Formatting Text

Formatting text in R Markdown is different than in a word processing software - you will have to manually format text using certain characters. This cheatsheet will be very helpful as you get more comfortable with Markdown. Find some formatting basics below:

  • For a bullet point, insert * followed by a space before your text.
    • A sub-bullet point can be made using + followed by a space.
  • Surrounding a word with asterisks (*) allows you to italicize a word.
  • Surrounding a word with two asterisks (**) allows you to bold a word.
  • Surrounding a word with will make it appear like code
  • You can include equations in your text by surrounding your equation with dollar signs: \(Area = \pi r^{2}\)
    • You can also make math/equations appear centered on their own line by surrounding the math with two dollar signs:
    \[Area = \pi r^{2}\]
    • Note that writing math in R Markdown has its own notation that you will just have to get used to! See here for a reference sheet on basic math notation.

7.2 Knitting your file

Knitting is the R word for “convert this to a pdf” (or some other file type, for example .html). When you submit your problem set, you will turn in both the R Markdown (.rmd) file and the knitted PDF file.

As we deal with more complicated functions and output in R, you will want to customize some of the settings for your code chunks to control what is output and what is included in the knitted PDF (note, “chunks” are the grey boxes in R Markdown where you enter your code). Chunk settings can either be set for the entire document, or for individual chunks.

  • To change settings for all chunks: insert this line in a chunk at the top of your document, knitr::opts_chunk$set( ) - the specific settings you want to use will go inside the parentheses.

  • To change settings for an individual chunk: insert your settings inside the {r, } at the top of the chunk, where the settings will go after the comma.

Below are some important settings you might want to use. Note that F and T stand for FALSE and TRUE. Either can be used.:

  • warning = F : warnings won’t appear in your knitted output
  • message = F : messages won’t appear in your knitted output
  • echo = F: the output from a chunk (for example, a plot) will appear in the knitted pdf, but the code will not.
  • include = F: neither the code nor the output from a chunk will appear in the knitted pdf, but the code will be evaluated (good for chunks loading packages, viewing data, etc.).
  • eval = F: neither the code nor the output from a chunk will appear in the knitted pdf, and the code will not be evaluated.
  • results = "hold": in the knitted pdf, present all the output from the chunk in a single block under the code (as opposed to under each line of code, which is the default).
    • results = "hide": don’t print the output from the chunk (but show the code)
    • results = "asis": use this code when you are printing stargazer tables (more on this later…)
  • fig.width = 5, fig.height = 3: plots will appear in the knitted pdf as 5 inches wide and 3 inches tall.

Knitting can get complicated. Refer to this resource for help, and don’t be afraid to google to solve your problem. You can also try commenting out (inserting a # in front of a line of code) problematic code to see if you can get the pdf to knit without that line.