You might want to begin your session with a completely clear workspace, meaning there are no packages loaded and no datasets or objects in your environment. (Some people advise against this for reasons I am not savvy enough to understand, but so far I’ve had no problems)
# Clear workspace
rm(list = ls())
At the very top of your script, you want to load any packages you will use. If its the first time using the package, you likely will have to install it first, using install.package()
.
# code for installing packages - note that you can load multiple packages at once. Also note you need quotation marks
install.packages(c("tidyverse", "stargazer", "lfe", "haven", "ISLR", "gridExtra"))
# how to use already installed packages- note you can only load one at a time and there are no quotation marks around the package name
library(tidyverse) # for easier code syntax and ggplots
library(stargazer) # for exporting nice tables
library(lfe) # for running FE models
library(haven) # for loading dta files from stata
library(ISLR) # for data
library(gridExtra) # for arranging multiple ggplots
One of the first things you want to do when you open an R script is set your working directory - this tells R where to pull files from and where to save files to. Note that “working directory” is R’s term for a file path.
For windows users, your working directory will probably start with “C:/Users/” and will go on to connect to the exact folder you want to pull data from or save data to. For Mac users, this will probably start with “User/” (no “C:/”) and then connect to the folder. If you forget how the working directory input should look (ie, how does it start, are you supposed to use backward slashes or forward slashes), you can always run getwd()
first.
# what is my current working directory?
getwd()
## [1] "C:/Users/User/Dropbox/Duke/7. Bass Connections"
# set to your desired working directory
setwd("C:/Users/User/Dropbox/Duke/7. Bass Connections/data") # I'm a windows user!
There are several different data formats you can read into R. Below are some of the common commands you might use:
read.csv()
- for csv files or specific sheets of excel workbooks
read.delim()
- for tab-delimited files (like dat or txt files) or “|” delimited files (sometimes txt files). First load package readr
.
read_dta()
- for loading stata .dta files. First load package haven
.
Some datasets can be accessed through packages or APIs. When a package contains a pre-loaded dataset, for example, the Auto
dataset in the ISLR
package, you will not need a function to load this dataset. Rather you will simply call the dataset.
csv = read.csv("./data_file.csv") # Reads CSV file named "data_file.csv" in my "7. Bass Connections/data" folder (the one I set my working directory to), names this dataset "csv" in my global environment
txt = read.delim("./data_file.txt", sep = "|") # Reads txt file where delimiter is |
dat = read.delim("./data_file.dat", sep = "\t") # Reads tab-delimited dat file ("\t" is for tab)
dta = read_dta("./data_file.dta") # Reads .dta files output from stata
df = Auto # Reads the Auto dataset from the ISLR package, names it "df"
install.packages("htmltools")
Once you have loaded a data set, you can view the dataset in a separate window using the command View()
, where the name of the dataset (df, dat, dta above) appears in the parantheses. Note, you can also click on the dataset inside the Global Environment window on the right of the R studio window.
R classifies every vector (column or variable) into one of five types. Often times, it will classify the data type of the vector/variable automatically. But sometimes, it will classify the data type incorrectly, or in a way that limits your ability to manipulate, combine, or perform functions on that data, and you’ll have to manually change the variable type.
The five basic data types are:
character - a data point is (usually) automatically classified as type “character” if it contains at least one non-numeric character. "2000 Science Drive"
, "10-15-2021"
, or "a"
would all be classified as characters. Note that character-based data points are often referred to as “strings.” All three of the examples in this bullet are considered strings.
numeric - a numerical data point, such as 2
or 15.398
. When all the data points in a vector (column or variable) are numbers, the data type will instead be “integer.” However, if at least one data point in a column or variable contains a letter or a character like ?
, the entire column will be classified as type character. Note that R also refers to type numeric as “double.”
integer - when all the data points in a vector, column, or variable are whole numbers, R will still read the object as type “integer.” You can force R to store a whole number as an integer by typing an “L” after the number, as in 4L
or 1000L
, where the L means “this is an integer.”
logical - data points in logical vectors or variables can take on only two values: TRUE
or FALSE
. Note, the TRUE or FALSE must appear in capital letters and without quotation marks to be read as type logical (a vector containing 3 data points “true”, “false”, and “true” would be classified as type character!). You can also use T
and F
instead of typing out the full words.
complex - this is for complex numbers, such as 1+8i
. I don’t ever deal with this, and I hope for your sake you don’t have to either, so let’s skip it!
factor - factor is technically a type of data structure in R (see below) rather than a data type, however, I think it is easiest to think of it as a type similar to character or numeric. Factor is used when the data is qualitative in nature. For example, if your variable was Eye.Color, your data points might be green
, blue
, brown
, blue
. Even though these data points are strings (i.e. words rather than numbers), the information they are conveying is actually categorical in nature. Designating this variable as type “factor” will tell R that blue
, green
, and brown
are different categories of response, not simply strings. This difference might seem negligible at first, but consider a situation in which you ran a survey asking about people’s eye color, and the response options are encoded numerically, such that 1 = brown
, 2 = blue
, 3 = green
and 4 = other
. The variable eye.color
would then be read by R as a numeric variable, but it is in fact categorical. Designating the variable as a factor will ensure R does not treat this variable numerically in a way that would be inappropriate - say, by taking the average of this variable, or inserting it into a regular regression. Factors are quite important for econometric tools like Fixed Effects Regressions, where the fixed effects are factor variables.
typeof()
- returns the data type of the object inside the parenthesesYou can change the data type of objects in order to better suit what you want to do with the object. It might not be clear why you’d want to do that yet, but consider the example above about factors and Fixed Effects regressions as just a single situation in which changing data type is necessary. You will encounter more of these situations as you become more advance.
as.character()
- change the data type of the object to characteras.numeric()
- change the data type of the object to numericas.logical()
- change the data type of the object to logicalas.factor()
- change the data type of the object to factorCan you change a vector containing characters and numbers into type numeric? Let’s see what happens!
# create a vector with 3 data points - 2 numbers, 1 word, 1 combination
vect1 = c(1, "apple", 40.6, "March 8") # vect1 is the name of the vector.
vect1 # typing the name of the vector will print its contents
## [1] "1" "apple" "40.6" "March 8"
# check the type
typeof(vect1) #R classified this vector as "character" because it contains letters, even though the majority of data points are numbers!
## [1] "character"
# change the type to numeric
vect2 = as.numeric(vect1) # name the new numeric version vect2
# check the type
typeof(vect2) # the vector type is now numeric (or double)
## [1] "double"
# look at the numeric vector
vect2 # R replaced the non-numeric data points with NA, even the data point that contained a number.
## [1] 1.0 NA 40.6 NA
As you can see in the example above, the data point"apple"
could not be coerced into a number, and so instead it was replaced with NA. The same thing happened to "March 8"
- note that this was not converted to simply 8
.
When can we coerce a vector to logical? R will easily convert a vector of 1
and 0
to logical, where 0 == FALSE
and 1 == TRUE
, and vice versa.
## coerce a numeric vector to logical
vect1 = c(1,1,0)
typeof(vect1) # automatically read as numeric ("double")
## [1] "double"
# convert to logical
vect2 = as.logical(vect1)
vect2 # reads 1 as True, 0 as false
## [1] TRUE TRUE FALSE
## coerce a logical vector to numeric
vect3 = c(F,F,T) # T and F can be used instead of TRUE and FALSE
typeof(vect3) # reads as logical
## [1] "logical"
vect4 = as.numeric(vect3)
vect4 # F = 0 and T = 1
## [1] 0 0 1
## coerce a character vector to logical
vect5 = c("T", "T", "F")
typeof(vect5) # note, because the Ts and Fs are in quotations, this is character!
## [1] "character"
vect6 = as.logical(vect5)
vect6 # this was enough to let R know how to convert!
## [1] TRUE TRUE FALSE
rm(vect1, vect2, vect3, vect4, vect5, vect6) # delete these vectors from our environment
Lastly, lets look at what happens when we convert a numeric vector to a factor.
# create a vector of six years
years = c(1991, 1992, 1993, 1994, 1995, 1996)
years # look at the vector
## [1] 1991 1992 1993 1994 1995 1996
# calculate the mean, min and max
mean(years)
## [1] 1993.5
min(years)
## [1] 1991
max(years)
## [1] 1996
# convert year vector to a factor
years.factor = as.factor(years)
years.factor # note it looks the same (but now the years are "levels")
## [1] 1991 1992 1993 1994 1995 1996
## Levels: 1991 1992 1993 1994 1995 1996
# can we take the mean, min, or max of a factor?
mean(years.factor) # no...we can't!
## [1] NA
#min(years.factor) # this won't compute - years.factor isn't a numeric object with a minimum!
R has several different object types, or “data structures.” This can be confusing at first, but it is important to bear them in mind early on. We’ll look at the various object types in more detail below.
## [1] 1 2 3
## [,1] [,2] [,3] [,4]
## [1,] 1 4 1 4
## [2,] 2 5 2 5
## [3,] 3 6 3 6
## number letter name male
## 1 1 a Bob TRUE
## 2 2 b Joe TRUE
## 3 3 c Lynn FALSE
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 1 4
## [2,] 2 5 2 5
## [3,] 3 6 3 6
##
## [[3]]
## [1] "hello"
##
## [[4]]
## number letter name male
## 1 1 a Bob TRUE
## 2 2 b Joe TRUE
## 3 3 c Lynn FALSE
To create an object in R, you use a standard syntax: the name of the object (whatever you want to call it), followed by the operator =
or <-
, followed by whatever you want the object to be. Note that =
is used to assign values to an object, where as ==
is used to mean “are these two things equal?”. =
and <-
are identical in meaning and can be used identically in R.
Once an object is named, you only need to type in the name to call that object up or use it. Below is a basic example of creating a vector and giving it a name.
# create a vector named A that contains the elements 1, 2, and 3
A = c(1,2,3)
# when you type A and hit ctrl + enter, the contents of A will print in the console
A
## [1] 1 2 3
# perform a function on A
2*A # multiplies every element of A by 2
## [1] 2 4 6
# perform a function on A and save the result as a new object
AA = 2*A # create a new object named AA
AA # print the contents of the new object in the console
## [1] 2 4 6
Next, we walk through how to create different types of objects (vectors, matrices, data frames) from scratch.
There are several ways to create a vector:
c(1,2,3)
or c("Linda", "Mark", "Johnny")
: entering numbers or strings into parentheses preceded by c()
tells R to combine these things into a vector. Note that characters or strings always need to appear in quotation marks.## [1] 1 2 3
## [1] "Linda" "Mark" "Johnny"
seq(5, 10, 1)
or seq(300, 500, 50)
: the former creates a vector out of the sequence from 5 to 10 by 1, the latter creates a vector out of the sequence from 300 to 500, by 50.## [1] 5 6 7 8 9 10
## [1] 300 350 400 450 500
5:10
: creates a vector out of the sequence from 5 to 10 by 1## [1] 5 6 7 8 9 10
vector()
or logical(0)
: creates an empty logical
vector (default) of length 0
vector("character",length = 4)
or character(4)
: creates an empty character
vector of length 4
## [1] "" "" "" ""
vector("numeric", length = 6)
or numeric(4)
: creates an empty numeric vector of length 6 (all 0s)## [1] 0 0 0 0 0 0
Below are simple functions for creating matrices from scratch. Note that you can also bind vectors together to get matrices, or convert data frames to matrices. We will deal with conversion below, and binding in section 2.
matrix(nrow = 2, ncol = 3)
: create a 2x3 matrix filled with NAs## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
matrix(c(1,2,3,4), nrow = 2, ncol = 2)
: create a 2x2 matrix with elements 1, 2, 3, 4 that fill in by column## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow = T)
: create a 2x2 matrix with elements 1, 2, 3, 4 that fill in by row## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
This is the most basic function for creating a data frame from scratch. Note that usually, you will bind together vectors to make data frames, or convert existing matrices to data frames. We will deal with conversion below, and binding in section 2.
data.frame(var1 = c(T,T,F), var2 = c("a", "b", "c"), var3 = 1:3)
: create a data frame with 3 variables and 3 observations## var1 var2 var3
## 1 TRUE a 1
## 2 TRUE b 2
## 3 FALSE c 3
The functions for converting an object to another type are straightforward:
as.vector()
- converts an element, matrix or data frame to a vectoras.matrix()
- converts a vector or data frame to a matrixas.data.frame()
- converts a matrix or vector to a data frameHowever, there can be some idiosyncracies. Take a look at the examples below.
Vectors to Matrix
If you have a vector with 3 elements, using as.matrix()
will convert the vector to a 3 x 1 matrix, taking the vector as a single column. If you want the vector to serve as a single row of a matrix, you can transpose the vector first using the transpose function t()
, such that the matrix transformation function looks like as.matrix(t())
.
v = c(1,2,3) # start with a vector
as.matrix(v) # converts the vector to a 3x1 matrix
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
as.matrix(t(v)) # converts the vector to a 1x3 matrix
## [,1] [,2] [,3]
## [1,] 1 2 3
Data frame to Matrix
While the columns in a data frame can be different data types (i.e. logical, numeric, character), columns in a matrix must be all the same variable type. Therefore, when you convert a data frame with categorical, character, and numeric variables to a matrix, all the columns in the matrix will be coerced to a single type (likely “character”). However, variable names remain preserved in the matrix.
C # start with data frame used earlier
## number letter name male
## 1 1 a Bob TRUE
## 2 2 b Joe TRUE
## 3 3 c Lynn FALSE
as.matrix(C) # all elements appear in quotation marks, even numbers
## number letter name male
## [1,] "1" "a" "Bob" "TRUE"
## [2,] "2" "b" "Joe" "TRUE"
## [3,] "3" "c" "Lynn" "FALSE"
typeof(as.matrix(C)) # the type has been converted to character
## [1] "character"
Matrix to data frame
Unlike matrices or vectors, data frames must have column names (row names are optional). When you convert a matrix to a data frame, R will automatically assign column names to the data frame - these will be standard like “V1”, “V2”, “V3” and so on. When you convert a vector to a dataframe, R will automatically assign the name of the vector as the column name (see below). You can rename the columns using the function colnames()
.
# matrix to data frame
B # start with matrix used above
## [,1] [,2] [,3] [,4]
## [1,] 1 4 1 4
## [2,] 2 5 2 5
## [3,] 3 6 3 6
as.data.frame(B) # no column names transfer
## V1 V2 V3 V4
## 1 1 4 1 4
## 2 2 5 2 5
## 3 3 6 3 6
# vector to data frame
A # start with vector used above
## [1] 1 2 3
as.data.frame(A) # the name of the column is A, the name of the vector.
## A
## 1 1
## 2 2
## 3 3
Sometimes you will want to reference specific elements in a vector, matrix, data frame or list. The notation is slightly different depending on the object type.
vect
vect[2]
- returns the 2nd element in the vectorwhich(vect == "lemon")
- returns the index number of where the word “lemon” appears in the vector. This command is like the reverse of the above command. For example, if vect[3] == "lemon"
then which(vect == "lemon") == 3
min(vect)
- returns the smallest element in a numeric vector, or the first element alphabetically in a character vectormax(vect)
- returns the largest element in a numeric, or the last element alphabetically in a character vectormat
mat[1,]
- returns the 1st row of the matrixmat[,1]
- returns the 1st column of the matrixmat[2,4]
- returns the element in the 2nd row, 4th column of the matrixmat[3]
- returns the third element in the matrix when starting at the top left corner and counting down by columnmin(mat)
- returns the smallest element in the entire matrix (or the first element alphabetically if the type is character)min(mat[,4])
- returns the smallest element in the 4th column of the matrixmat[,"var1"]
- IF the matrix has variable names (ie, was converted from a data frame), you can call specific columns in the matrix by name##
## ========================
## name age gender retired
## ------------------------
## Lou 44 M FALSE
## Mark 65 M TRUE
## Nancy 23 F FALSE
## ------------------------
df
df$age
or df[,2]
- returns the second variable “age” in the data frame as a vectordf[2]
or df["age"]
- returns the second variable in the data frame as a data frame (with 3 rows, 1 column)df[2,]
- returns the second row in the data frame as a dataframe (with 1 row, 4 columns)min(df$age)
- returns the minimum of the variable “age”dplyr
functionsdplyr
is a package commonly used in R to make data cleaning and shaping “tidier”. Let’s look at some basic dplyr
functions below.
filter()
- select the rows you want to retain from the data frame. This can be used with variable names and “conditions”.
filter(Var5 > 60)
would create a new data frame containing only those observations with a value for Var5 greater than 60.|
and the ‘and’ operator &
: filter(Var5 > 0 & Var5 < 100)
(note you have to type out the variable name twice)filter(is.na(Var5))
. If you wanted to exclude the observations that are missing data on Var5, you would use: filter(!is.na(Var5))
, where the !
indicates “is not”.select()
- select the columns you want to retain from the data frame.
select("Var1", "Var3")
select(4:8)
would retain the fourth through the eighth variable in the dataset.select(-"Var5")
would select every variable except Var5.mutate()
- use this function to create new variables that are mutations of existing variables.
gdp
, you would use the following function: mutate(gdp.ln = log(gdp))
.gdp
but round it to the nearest 1000, you would use: mutate(gdp = round(gdp, -3))
.mutate(Var4 = Var2 + Var3)
.mutate(Var4 = Var2 + Var3)
.arrange()
- use this function to order the rows in a data set in increasing order by the value in a column.
arrange(var1)
will arrange the rows of the dataset in ascending order by the value of var1arrange(desc(var1))
will arrange the rows of the dataset in descending order by the value of var1rename()
- use this function to assign new names to variables. The syntax is as follows: rename(newname = oldname)
group_by()
- groups observations by common values on the specified variable. In group_by(year)
, for example, all observations from year 2010 will be grouped together, all observations with year == 2011
are grouped together, etc.
summarize()
- create new variables or mutate existing variables to summarize data, either within groups or within the entire dataset, using some specified function. Returns a dataset where the number of rows equals 1 if no group argument is specified, or to the number of unique groups in the variable specified in the group_by()
argument. Some common summary functions:
mean(var1, na.rm = T)
: take the average (within group) of var1, ignore NAs in var1 in calculating this meansum(var1, na.rm = T)
: sum up the values of all the elements (within group, if specified) of var1, ignoring NAsmin(var1, na.rm = T)
: take the min of all the elements (within group, if specified) of var1, ignoring NAsmax(var1, na.rm = T)
: take the max of all the elements (within group, if specified) of var1, ignoring NAsquantile(var1, probs = .25, na.rm = T)
: calculate a specified quantile (here,the 25th percentile) within group, ignoring NAsn()
: count how many observations in this group (note, if no by_group()
argument is used, this will correspond to the number of rows in the dataframe)n_distinct(var1)
: count how many distinct values of var1 exist in this group (or in the entire data frame, if no by_group()
argument is used)It is important to note that although dplyr functions can be used as any other functions, it is often paired with an odd but intuitive syntax that makes performing multi-step data manipulation functions easier. In dplyr functions, instead of inserting the name of the object inside the parentheses after a function, for example min(df)
, you start with the name of the object and then use the linking operator %>%
to link it to functions, such that you can continue to refer to the same object several times.
filter(df, var1 > 0)
, you now can type df %>% filter(var1 > 0)
.%>%
piping, you would have to type: rename(filter(df, var1 > 0), new.name = var3)
. Instead, we can see clearly the order of operations by entering: df %>% filter(var1 > 0) %>% rename(new.name = var3)
group_by
and summarize()
We are given a data set of the following grades recorded in PoliSci 630 last semester:
gender | grade | hours per week |
---|---|---|
Male | 94 | 5 |
Male | 86 | 6 |
Male | 75 | 3 |
Male | 83 | 12 |
Male | 98 | 9 |
Female | 100 | 10 |
Female | 77 | 2 |
Female | 84 | 6 |
Female | 93 | 6 |
Female | 79 | 8 |
We want to know the average grade by gender, the average amount of time spent working by gender, and the total number of hours spent working by gender. For this we will use group_by()
and summarize()
# Create a data frame corresponding to the table above
grades = data.frame( # command for making a new dataframe
"gender" = c(rep("Male", 5), rep("Female", 5)), # make a variable called "gender" that consists of "Male" repeated 5 times, and "Female" repeated 5 times
"grade" = c(94,86,75,83,98,100,77,84,93,79), # make a variable called "grade" that consists of the datapoints above
"hours" = c(5,6,3,12,9,10,2,6,6,8) # make a variable called "hours" that consists of the datapoints above
)
grades
## gender grade hours
## 1 Male 94 5
## 2 Male 86 6
## 3 Male 75 3
## 4 Male 83 12
## 5 Male 98 9
## 6 Female 100 10
## 7 Female 77 2
## 8 Female 84 6
## 9 Female 93 6
## 10 Female 79 8
# Create a new data frame with average grade by gender
grades_avg = grades %>%
group_by(gender) %>% # group the data by gender
summarize(grade.avg = mean(grade), # create a new variable equal to the average grade for each gender
hours.avg = mean(hours), # create a new variable equal to the average hours spent by each gender
hours.tot = sum(hours)) # create a new variable equal to the sum of hours spend by each gender
grades_avg # we now have a "summary" data set, with 2 rows and 4 columns (gender + the summary variables)
## # A tibble: 2 x 4
## gender grade.avg hours.avg hours.tot
## <chr> <dbl> <dbl> <dbl>
## 1 Female 86.6 6.4 32
## 2 Male 87.2 7 35
We’ll use the “Auto” dataset from the ISLR package for this section.
dat = Auto # loads the Auto dataset from ISLR package, names it "dat"
There are several ways to quickly pull numerical summaries of your data in R for your own reference.
head(data)
: will report the first 6-10 lines of the dataset. This is helpful if you don’t have a ton of columns in the data.
summary(data)
: reports the min, mean, median, max, and number of NAs for each variable in the dataset. Very helpful for getting a sense of what the dataset contains.
length(vect)
: counts how many elements are in a vector. Only for vectors. Use nrow()
instead for data frames.
nrow(data)
: count the number of rows in a data frame. Note, for a vector, use length()
instead.
ncol(data)
: count number of columns in a data frame.
table(data$var1, useNA = "ifany")
: how many observations take on each value of a variable? Helpful for factor or integer variables. Be sure to specify useNA = "ifany"
if you also want to know how many observations have NA for that variable.
table(data$var1, data$var2, useNA = "ifany")
: look at the cross-wise frequencies of two variables. Again, don’t forget to specify the useNA if you want NAs to appear.
quantile(data$var1, na.rm = T)
: provides the 0, 25, 50, 75, and 100 percentile of the variable. You can also specify the percentiles you want to view.
range(data$var1, na.rm = T)
: will provide the range of the variable (i.e. just the 0 and 100th percentile)names(data)
: provides the names of every variable in the dataset. Good for trying to remember what the name or orders of your variables look like.
You might also want to summarize data in a way that looks nice for outputting into reports. One way to do this is using stargazer(summary.frame, summary = F)
, where summary.frame is a data frame you create with summaries of the data.
# Creating function for extracting the relevant summary statistics
descriptives <- function(x) {
cbind.data.frame(
'Observations' = sum(is.na(x)==F),
'Median' = round(median(as.numeric(x), na.rm=T), 3),
'Mean' = round(mean(as.numeric(x), na.rm = T), 3),
'Min' = min(x, na.rm=T),
'Max' = max(x, na.rm=T))
}
# Extracting descriptive statistics for Table 1 A
summary.frame <- do.call(rbind, apply(dat, 2, function(x) descriptives(x))) %>% rownames_to_column(var = "Variable")
# Reporting Summary
stargazer(summary.frame,
type = "text",
summary = F,
rownames = F,
digits = 2)
##
## ====================================================================================
## Variable Observations Median Mean Min Max
## ------------------------------------------------------------------------------------
## mpg 392 22.75 23.45 9.0 46.6
## cylinders 392 4 5.47 3 8
## displacement 392 151 194.41 68.0 455.0
## horsepower 392 93.50 104.47 46 230
## weight 392 2,803.50 2,977.58 1613 5140
## acceleration 392 15.50 15.54 8.0 24.8
## year 392 76 75.98 70 82
## origin 392 1 1.58 1 3
## name 392 amc ambassador brougham vw rabbit custom
## ------------------------------------------------------------------------------------
First lets do a number of histograms using the Auto dataset, which is built into the ISLR package. We will use different methods:
hist()
: built in feature for making histograms - great for simple plots, but you can’t do fancy things, like compare two groups side by side
geom_histogram()
: ggplot feature for making histograms. NOTE: you link lines of code with +
. This means you can hit ctrl
+Enter
on the last line of the plot, and the entire plot will be created (this is different from the hist()
function!)
# Most simple way: hist()
hist(dat$weight, # what you're making a histogram of: a single variable or vector
main = "Histogram of Weight", # title
xlab = "Weight") # x-axis label
# More complicated histogram using hist()
# NOTE: you have to run all this code at the same time in order for all the elements to appear on the plot!
hist(dat$weight, # variable or vector
main = "Histogram of Weight", # title
xlab = "Weight", # x-axis label
breaks = 50, # I want approx. 50 individual bars
xaxt = 'n') # says I want to draw my own axis
# adjust x-axis label
axis(side = 1, # adjust labeling of x-axis (y-axis would be side = 2)
at = seq(1500, 5000, by = 100), # put tick marks at every 100 between 1500 and 5000
labels = seq(1500, 5000, by = 100), # put labels at the tick marks
cex.axis = .8) # change the font size - says make the d
# add additional lines for reference
abline(v = quantile(dat$weight)[2], # put a vertical line at the 25th quantile
col = "blue", # color it blue
lty = 2) # make it dashed
abline(v = quantile(dat$weight)[4], # put a vertical line at the 75th quantile
col = "red", # color it red
lty = 3) # make it dotted
abline(v = mean(dat$weight), # put a vertical line at the mean
col = "black", # color it black
lwd = 2) # make it thick
# add a legend to describe lines
legend("topright", # put it in the top right of the plot
legend = c("25th Percentile", # these are the labels for your legend, you can list them in any order
"Mean",
"75th Percentile"),
lty = c(2,1,3), # the type of line for each legend item (list in order of labels)
lwd = c(1,2,1), # the width of line for each legend item (list in order of labels)
col = c("blue", "black", "red"), # the color of line for each legend item (list in order of labels)
cex = .75) # font size - says make the font .75x the regular size
# Basic histogram with ggplot()
ggplot(dat, # what data frame you're drawing from in all elements of the ggplot
aes(x = weight))+ # variables you're plotting in all elements of the ggplot (in this case, only an x-var)
geom_histogram() + # says "take this data and make it a histogram"
xlab("Weight") + # label the x-axis
ylab("Frequency")+ # label the y-axis
ggtitle("Histogram of Weight") # give the plot a title
# More complicated histogram with ggplot()
ggplot(dat, aes(x = weight)) +
geom_histogram(binwidth = 300, # now we specify the bin width - each bin will capture 300 lbs
fill = "black", # makes the inside of the bars black
color = "grey") + # outlines them in grey
# Label X axis
scale_x_continuous(name = "Weight", # names the x-axis
breaks = seq(1500,5100,by = 300))+ # specifies where to put ticks and labels - every 300
ylab("Frequency") +
ggtitle("Histogram of Weight") +
# Add a line for 25, 50, 75 percentiles
geom_vline(aes(xintercept = quantile(weight)[2], # note we don't have to say dat$ before weight bc dat is already called in the ggplot() line
color = "25th Percentile")) + # when we specify the color command INSIDE the aes(), it assigns the object a color
geom_vline(aes(xintercept = quantile(weight)[4], # and allows us to give it a name for the legend - it will draw a legend automatically!
color = "75th Percentile"))+
geom_vline(aes(xintercept = mean(weight),
color = "Mean"))+
scale_color_discrete("Quantiles")+ # Here we label the legend
theme_bw() # a different look for ggplots - white background, clean legends. you can play around with different themes!
Next we’ll use geom_bar()
with ggplot to compare different groups in the data. Note, bar plots are really only useful when at least one of your variables is discrete!
First, lets compare histograms of foreign and domestic cars by number of cylinders. We use two different types of comparisons, “stack” and “dodge.” Which do you prefer?
# First, we create a dummy variable for foreign origin
dat = dat %>% mutate(foreign = as.factor(ifelse(origin == 2 | origin == 3, 1, 0)))
# Next, we compare weight by origin. This time, we use geom_bar (rather than geom_histogram())
stack = ggplot(dat, # lets save this plot as an object called "stack"
aes(x=as.factor(cylinders)))+ # This time, lets do it by cylinders (converted to a factor variable)
geom_bar(stat = "count", # specifies that we want a histogram (y-axis is count)
aes(fill = foreign), # use different colors based on the value of variable 'foreign'
position = "stack") + # stack the different colored bars on top of eachother
labs(title = "Histogram of Cylinders by Origin", # can enter all your labels in a single 'labs()' command
subtitle = "Stacked Histogram",
x = "Number of cylinders",
y = "Count") +
scale_fill_discrete("Origin", # name the legend thats providing the discrete fill colors
labels = c("Domestic", "Foreign")) # specify the names of the values in the legend
# Lets do the exact same comparison, but this time with the bars next to eachother
dodge = ggplot(dat, aes(x=as.factor(cylinders)))+ # lets save this plot as an object called "dodge"
geom_bar(stat = "count", aes(fill = foreign),
position = "dodge") + # this time, we want the bars next to each other!
labs(title = "Histogram of Cylinders by Origin",
subtitle = "Dodged Histogram", x = "Number of cylinders", y = "Count") +
scale_fill_discrete("Origin", labels = c("Domestic", "Foreign"))
# Lets compare the two side by side
grid.arrange(stack, dodge, nrow = 2)
What if we want to compare some value across several different groups? geom_bar()
is good for this also! Below, we calculate the mean mpg for cars made in each year of the dataset, and plot the mean for each year.
# Calculate means by year
yr_mean = dat %>% group_by(year) %>% summarize(mean.mpg = mean(mpg))
# Create bar graph
horiz = ggplot(yr_mean, aes(x = as.factor(year), y = mean.mpg)) + # x var is year - but its discrete, so we add as.factor()
geom_bar(stat = "identity") + # use stat = "identity" when plotting a bar graph with a y var
scale_x_discrete(labels = paste0("19", yr_mean$year))+ # I add "19" to the front of the year labels (70,71...) to improve readability
labs(x = "Year", y = "Mean MPG", # add labels to graph
title = "Mean MPG by Year") +
theme(axis.text.x = element_text(angle = 45)) # tilt x-axis lables 45 degrees to improve readability
# Do the exact same thing but make it vertical
vert = ggplot(yr_mean, aes(x = as.factor(year), y = mean.mpg)) +
geom_bar(stat = "identity") +
scale_x_discrete(labels = paste0("19", yr_mean$year)) +
labs(x = "Year", y = "Mean MPG", title = "Mean MPG by Year") +
coord_flip() # this command will flip the axes, you don't need to do anything else differently
grid.arrange(horiz, vert, nrow = 1)
Now, lets do the same thing, but plot foreign and domestic made cars separately, and only draw the vertical plot.
# Calculate means by year, by origin
yr_mean_for = dat %>% group_by(year, foreign) %>% summarize(mean.mpg = mean(mpg))
# Do the exact same thing but make it vertical
ggplot(yr_mean_for, aes(x = as.factor(year), y = mean.mpg)) +
geom_bar(stat = "identity", # bar graph where we specify the y-var (ie: not a histogram)
aes(fill = foreign), # draw two separate bars with different fill colors based on the value of foreign
position = "dodge") + # draw the bars for foreign and domestic next to each other
scale_x_discrete(labels = paste0("19", yr_mean$year)) +
labs(x = "Year", y = "Mean MPG", title = "Mean MPG by Year", subtitle = "By Origin") +
scale_fill_discrete("Origin", labels = c("Domestic", "Foreign")) + # label the legend
coord_flip()
Now, let’s plot how the average MPG of foreign and domestic cars varied over time. Here, we treat year as a continuous variable (compared to in the previous section). This will allow us to compare trends.
ggplot(yr_mean_for, aes(x = year, y = mean.mpg)) +
geom_line(aes(group = foreign, # plot a separate line for each value of foreign
color = foreign)) + # color the line based on the value of foreign
geom_point(aes(group = foreign, # plot a separate point at each x for each value of foreign
color = foreign)) + # color the points based on the value of foreign
scale_x_continuous(breaks = seq(min(yr_mean_for$year), max(yr_mean_for$year), 2), # make the x-axis ticks and labels appear every 2 years
labels = paste0("19", seq(min(yr_mean_for$year), max(yr_mean_for$year), 2))) + # add "19" to the beginning of the year labels
scale_color_discrete("Origin", labels = c("Domestic", "Foreign")) + # Name the legend and specify the values of foreign
labs(x = "Year", y = "Mean MPG", title = "Mean MPG Over Time", subtitle = "By origin")
What about something more fancy, like how the mean MPG changes by year based on the number of cylinders? Here we see we don’t have observations for every year for each number of cylinders, so the plot will only connect the years for which we have datapoints.
yr_mean_cyl = dat %>% group_by(year, cylinders) %>% summarize(mean.mpg = mean(mpg))
ggplot(yr_mean_cyl, aes(x = year, y = mean.mpg)) +
geom_line(aes(group = as.factor(cylinders), # plot a separate line for each number of cylinders
color = as.factor(cylinders))) + # color the line based on the number of cylinders
geom_point(aes(group = as.factor(cylinders), # plot a separate point for each number of cylinders
color = as.factor(cylinders))) + # color the points based on number of cylinders
scale_x_continuous(breaks = seq(min(yr_mean_for$year), max(yr_mean_for$year), 2), # make the x-axis ticks and labels appear every 2 years
labels = paste0("19", seq(min(yr_mean_for$year), max(yr_mean_for$year), 2))) + # add "19" to the beginning of the year labels
scale_color_discrete("Number of Cylinders") + # Name the legend. We don't need to specify values this time
labs(x = "Year", y = "Mean MPG", title = "Mean MPG Over Time", subtitle = "By origin")
A scatterplot, in which the points are colored based on a third variable, can help you visualize three dimensions at once. Here we look at the relationship between MPG and Weight by a factor variable, cylinders, and a continuous variable, horsepower.
ggplot(dat, aes(x = weight, y = mpg)) +
geom_point(aes(color = as.factor(cylinders))) +
scale_color_discrete("# of Cylinders") +
labs(title = "Scatterplot of MPG by Weight",
subtitle = "By number of cylinders")
ggplot(dat, aes(x = weight, y = mpg)) +
geom_point(aes(color = horsepower)) +
scale_color_gradient("Horsepower", low = "green", high = "red")+
labs(title = "Scatterplot of MPG by Weight",
subtitle = "By horsepower")
For ggplots, you can simply add ggsave("name.png", height = 5, width = 7)
at the end of the code for a plot. It will save the last plot as a 5x7 png in the folder specified in your working directory.
ggsave("by_horsepower.png", height = 5, width = 7) # saves the last plot
Let’s use the Auto data again to look at the effect of horsepower on mpg, controlling for acceleration and displacement. The syntax for running linear models in R is: lm(y ~ x1 + x2 + x3, data)
. We will then use a number of functions to look at the results.
summary(lm1)
: reports the coefficients, standard errors, and p-values along with some basic diagnostics. Helpful for viewing within R, not very pretty for reports.
stargazer(lm1)
: prints a pretty version of the coefficient table in the console. Stargazer is nice bc it offers lots of customizable features. You can save this table directly to your working directory as a .html file (out = "./tablename.html"
) which allows for easy screenshotting for reports/presentations or a text file (out = "./tablename.tex"
) for inserting the code directly into overleaf. You can also print the table directly in Rmd by specifying type = "latex"
.
lm1 = lm(mpg ~ horsepower + acceleration + displacement, data = dat)
# View built-in summary
summary(lm1)
##
## Call:
## lm(formula = mpg ~ horsepower + acceleration + displacement,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1330 -3.1616 -0.6326 2.3641 16.6478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.254707 2.578694 17.937 < 2e-16 ***
## horsepower -0.088783 0.015835 -5.607 3.92e-08 ***
## acceleration -0.412230 0.116227 -3.547 0.000438 ***
## displacement -0.036660 0.005029 -7.290 1.75e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.468 on 388 degrees of freedom
## Multiple R-squared: 0.6749, Adjusted R-squared: 0.6724
## F-statistic: 268.5 on 3 and 388 DF, p-value: < 2.2e-16
# Save lm object outputs
coef = lm1$coefficients # can call coefficients straight from lm object
se = summary(lm1)$coefficients[,2] # to get standard errors, you have to pull them from the summary
pval = summary(lm1)$coefficients[,4] # to get pvalues, you have to pull them from the summary
# Bind manually and print data frame
lm1.res = data.frame("Coef" = coef, # 1st column is vector of coefficients
"Std.Error" = se, # 2nd column is vector of SEs
"p.value" = pval) # 3rd column is vector of pvals
lm1.res
## Coef Std.Error p.value
## (Intercept) 46.25470750 2.578693889 7.891585e-53
## horsepower -0.08878252 0.015834755 3.919182e-08
## acceleration -0.41222985 0.116227322 4.377030e-04
## displacement -0.03665995 0.005028849 1.753751e-12
# Print pretty table
stargazer(lm1, # Model goes here, can be one or many models
dep.var.labels = "MPG", # name of dependent variable(s)
covariate.labels = c("Horsepower", # labels for the x-vars (mind the order!)
"Acceleration",
"Displacement"),
title = "Effect of Horsepower on MPG", # title for the table
out = "./table1.html", # saves the table as a html in ur wd - makes it easy for opening and screenshotting.
type = "text") # lets you view the table clearly in the console. If printing in RMarkdown, use type = "latex"
##
## Effect of Horsepower on MPG
## ===============================================
## Dependent variable:
## ---------------------------
## MPG
## -----------------------------------------------
## Horsepower -0.089***
## (0.016)
##
## Acceleration -0.412***
## (0.116)
##
## Displacement -0.037***
## (0.005)
##
## Constant 46.255***
## (2.579)
##
## -----------------------------------------------
## Observations 392
## R2 0.675
## Adjusted R2 0.672
## Residual Std. Error 4.468 (df = 388)
## F Statistic 268.457*** (df = 3; 388)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Now let’s use the Auto data to run the same model as before, but this time including fixed effects for the year in which the car was manufactured. We will use the felm()
function from the lfe
package. The syntax for running a FE linear model using felm
is as follows: felm(y ~ x1 + x2 + x3 | fe1 + fe2 | 0 | 0, data)
. Note, if you want to cluster the standard errors at a certain level, you would specify which variable to cluster by where the last 0 appears in the formula above.
felm1 = felm(mpg ~ horsepower + acceleration + displacement | year| 0 | 0 , data = dat)
# As before we can use summary
summary(felm1)
##
## Call:
## felm(formula = mpg ~ horsepower + acceleration + displacement | year | 0 | 0, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6913 -2.1314 -0.1809 1.7790 14.6637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## horsepower -0.071689 0.013176 -5.441 9.58e-08 ***
## acceleration -0.379759 0.092188 -4.119 4.67e-05 ***
## displacement -0.032251 0.004101 -7.865 3.95e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.513 on 376 degrees of freedom
## Multiple R-squared(full model): 0.8052 Adjusted R-squared: 0.7974
## Multiple R-squared(proj model): 0.6583 Adjusted R-squared: 0.6447
## F-statistic(full model):103.6 on 15 and 376 DF, p-value: < 2.2e-16
## F-statistic(proj model): 241.5 on 3 and 376 DF, p-value: < 2.2e-16
# We can also save outputs
fe.coef = felm1$coefficients
fe.se = felm1$se
fe.pval = felm1$pval
fe.res = data.frame("coef" = fe.coef,
"std.error" = fe.se,
"p.val" = fe.pval)
fe.res
## mpg std.error p.val
## horsepower -0.07168862 0.013176116 9.579469e-08
## acceleration -0.37975913 0.092187851 4.674519e-05
## displacement -0.03225108 0.004100517 3.949518e-14
# We can save fixed-effect related outputs as well:
fe.freq = as.data.frame(table(felm1$fe)) # how many observations in each FE category?
fe.n = nrow(fe.freq) # how many years are in the dataset
# Lets compare FE and non-FE regression in a nice table
stargazer(lm1, felm1, # list the model objects in the order you want them to appear in the table
dep.var.labels = "MPG", # name of dependent variable(s)
covariate.labels = c("Horsepower", "Acceleration", "Displacement"), # labels for the x-vars (mind the order!)
title = "Effect of Horsepower on MPG", # title for the table
out = "./comparison.tex", # saves the table as a .tex file in your wd - makes it easy for copying into overleaf
add.lines = list(c("Year FEs", "No", "Yes"), # add lines at the bottom specifying which models have fixed effects
c("Number of FEs", "NA", fe.n)), # how many fixed effects?
no.space = T, # removes the space between rows - helps make smaller for inserting into documents
omit.stat = c("ser","f"), # control which summary stats to report in the table
type = "text") # lets you view the table clearly in the console. If printing in RMarkdown, use type = "latex"
##
## Effect of Horsepower on MPG
## ==========================================
## Dependent variable:
## ----------------------------
## MPG
## OLS felm
## (1) (2)
## ------------------------------------------
## Horsepower -0.089*** -0.072***
## (0.016) (0.013)
## Acceleration -0.412*** -0.380***
## (0.116) (0.092)
## Displacement -0.037*** -0.032***
## (0.005) (0.004)
## Constant 46.255***
## (2.579)
## ------------------------------------------
## Year FEs No Yes
## Number of FEs NA 13
## Observations 392 392
## R2 0.675 0.805
## Adjusted R2 0.672 0.797
## ==========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
dat <- read_csv("data/dat.csv")
This tutorial provides a great walkthrough of this process.
R Markdown files (.rmd) allow you to combine text with code and code output.
Formatting text in R Markdown is different than in a word processing software - you will have to manually format text using certain characters. This cheatsheet will be very helpful as you get more comfortable with Markdown. Find some formatting basics below:
*
followed by a space before your text.
+
followed by a space.*
) allows you to italicize a word.**
) allows you to bold a word.
will make it appear like code
Knitting is the R word for “convert this to a pdf” (or some other file type, for example .html). When you submit your problem set, you will turn in both the R Markdown (.rmd) file and the knitted PDF file.
As we deal with more complicated functions and output in R, you will want to customize some of the settings for your code chunks to control what is output and what is included in the knitted PDF (note, “chunks” are the grey boxes in R Markdown where you enter your code). Chunk settings can either be set for the entire document, or for individual chunks.
To change settings for all chunks: insert this line in a chunk at the top of your document, knitr::opts_chunk$set( )
- the specific settings you want to use will go inside the parentheses.
To change settings for an individual chunk: insert your settings inside the {r, }
at the top of the chunk, where the settings will go after the comma.
Below are some important settings you might want to use. Note that F
and T
stand for FALSE
and TRUE
. Either can be used.:
warning = F
: warnings won’t appear in your knitted outputmessage = F
: messages won’t appear in your knitted outputecho = F
: the output from a chunk (for example, a plot) will appear in the knitted pdf, but the code will not.include = F
: neither the code nor the output from a chunk will appear in the knitted pdf, but the code will be evaluated (good for chunks loading packages, viewing data, etc.).eval = F
: neither the code nor the output from a chunk will appear in the knitted pdf, and the code will not be evaluated.results = "hold"
: in the knitted pdf, present all the output from the chunk in a single block under the code (as opposed to under each line of code, which is the default).
results = "hide"
: don’t print the output from the chunk (but show the code)results = "asis"
: use this code when you are printing stargazer tables (more on this later…)fig.width = 5, fig.height = 3
: plots will appear in the knitted pdf as 5 inches wide and 3 inches tall.Knitting can get complicated. Refer to this resource for help, and don’t be afraid to google to solve your problem. You can also try commenting out (inserting a #
in front of a line of code) problematic code to see if you can get the pdf to knit without that line.