The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, PBC.
We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
RStudio is divided into 4 “panes”:
The placement of these panes and their content can be customized (see
menu, Tools
-> Global Options
->
Pane Layout
). For ease of use, settings such as background
color, font color, font size, and zoom level can also be adjusted in
this menu (Global Options
->
Appearance
).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
You’ll often need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. One important data packages that you will need in this course is wooldridge. You can install this package with a single line of code:
install.packages("wooldridge")
On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that https://cloud.r-project.org/ isn’t blocked by your firewall or proxy.
You will not be able to use the functions, objects, and help files in a package until you load it with library( ). Once you have installed a package, you can load it with the library( ) function:
library(wooldridge)
There are several other packages that will be often used in this course.
install.packages(c("ggplot2", "openxlsx", "lmtest", "sandwich", "AER", "stargazer", "car", "plm"))
As you start to apply the econometric tools to your own data you will soon find questions that we do not answer in the class. The following materials describe a few tips on how to get help, and to help you keep learning.
If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
If Google doesn’t help, try stackoverflow. Start by spending a little time searching for an existing answer, including [R] to restrict your search to questions and answers that use R. If you don’t find anything useful, prepare a minimal reproducible example or reprex. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. (See more details here.)
Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.
You can use R as a calculator:
(1 / 200 )* 30
## [1] 0.15
(59 + 73 + 2) / 3
## [1] 44.66667
You can create new objects with <- or =:
x <- 3
y <- x + 4
u = 1
v = u / 2
All R statements where you create objects, assignment statements, have the same form:
object_name <- value
When reading that code say “object name gets value” in your head.
You will make lots of assignments and <- is a pain to type. Instead, use
RStudio’s keyboard shortcut: Alt
+ -
(the
minus sign) in a PC or Option
+ -
in a Mac.
Notice that RStudio automagically surrounds <- with spaces, which is a good code
formatting practice. Code is miserable to read on a good day, so
giveyoureyesabreak and use spaces. (Keyboard
Shortcuts in the RStudio IDE)
What’s in a name? Object names must start with a letter, and can only contain letters, numbers, _ and .. You want your object names to be descriptive, so you’ll need a convention for multiple words. We recommend snake_case where you separate lowercase words with _.
this_is_a_really_long_name <- 2.5
A vector is the most common and basic data type in R, and is pretty
much the workhorse of R. A vector is composed by a series of values,
which can be either numbers or characters. We can assign a series of
values to a vector using the c( )
function. For example we can create a vector of animal weights and
assign it to a new object weight_g
:
weight_g <- c(50, 60, 65, 82)
weight_g
## [1] 50 60 65 82
A vector can also contain characters:
animals <- c("mouse", "rat", "dog")
animals
## [1] "mouse" "rat" "dog"
The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R will assume objects have been created called mouse, rat and dog. As these objects don’t exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a vector. length( ) tells you how many elements are in a particular vector:
length(weight_g)
## [1] 4
length(animals)
## [1] 3
An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates what kind of object you are working with:
class(weight_g)
## [1] "numeric"
class(animals)
## [1] "character"
You can use the c() function to add other elements to your vector:
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
## [1] 30 50 60 65 82 90
Exercise-A:
- What happens if we try to mix different types in a single vector? Try to add the elements of “animals” to “weight_g”.
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"
animals[2:4]
## [1] "rat" "dog" "cat"
Conditional subsetting Another common way of
subsetting is by using a logical vector. TRUE
will select
the element with the same index, while FALSE
will not. For
example, if we want to select only the values above 50:
weight_g <- c(21, 34, 39, 54, 55)
weight_g > 50 # will return logicals with TRUE for the indices that meet the condition
## [1] FALSE FALSE FALSE TRUE TRUE
weight_g[weight_g > 50]
## [1] 54 55
We also can combine multiple tests using &
(both
conditions are true, AND) or |
(at least one of the
conditions is true, OR):
weight_g[weight_g > 30 & weight_g < 50] # select values greater than 30 and smaller than 50
## [1] 34 39
weight_g[weight_g <= 30 | weight_g == 55] # select values "less than or equal to" 30 or values "equal to" 55
## [1] 21 55
R has a large collection of built-in functions that are called like this:
function_name(arg1 = val1, arg2 = val2, ...)
Let’s try using seq( ) which makes
regular sequences of numbers. If you want to know the
function’s arguments and purpose, you can get more help in the
Help
tab in the lower right pane.
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(1, 10, length.out=4)
## [1] 1 4 7 10
Exercise-B:
- Generate 1000 random numbers from a normal distribution with mean 0 and standard deviation 1. (Hint: Use rnorm( ).)
Up to now, your code has been in the console. This is useful for
quick queries but not so helpful if you want to revisit your work for
any reason. A script can be opened by pressing Ctrl
+
Shift
+ N
. It is wise to save your script file
immediately. To do this press Ctrl
+ S
. This
will open a dialogue box where you can decide where to save your script
file, and what to name it. The .R file extension is added automatically
and ensures your file will open with RStudio.
Don’t forget to save your work periodically by pressing
Ctrl
+ S
.
Exercise-C:
- Write your R code for previous exercise in the script window, and then save it as “random_normal.R”.
The comment character in R is #
. Anything to the right
of a #
in a script will be ignored by R. It is useful to
leave notes and explanations in your scripts. For convenience, RStudio
provides a keyboard shortcut to comment or uncomment a paragraph: after
selecting the lines you want to comment, press at the same time on your
keyboard Ctrl
+ Shift
+ C
. If you
only want to comment out one line, you can put the cursor at any
location of that line (i.e. no need to select the whole line), then
press Ctrl
+ Shift
+ C
.
x <- seq(0, 1, by=0.2) # generate a sequence from 0 to 1 with increment of 0.2
A data frame is a rectangular collection of variables (in the
columns) and observations (in the rows). As an example, let’s take a
look at the wage1
data frame found in the
wooldridge
package.
library(wooldridge) # First, load the package
To open the dataset in RStudio’s Data Viewer, use the View( ) function:
View(wage1)
We also can inspect the structure of a data frame using the function str( ):
str(wage1)
## 'data.frame': 526 obs. of 24 variables:
## $ wage : num 3.1 3.24 3 6 5.3 ...
## $ educ : int 11 12 11 8 12 16 18 12 12 17 ...
## $ exper : int 2 22 2 44 7 9 15 5 26 22 ...
## $ tenure : int 0 2 0 28 2 8 7 3 4 21 ...
## $ nonwhite: int 0 0 0 0 0 0 0 0 0 0 ...
## $ female : int 1 1 0 0 0 0 0 1 1 0 ...
## $ married : int 0 1 0 1 1 1 0 0 0 1 ...
## $ numdep : int 2 3 2 0 1 0 0 0 2 0 ...
## $ smsa : int 1 1 0 1 0 1 1 1 1 1 ...
## $ northcen: int 0 0 0 0 0 0 0 0 0 0 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ west : int 1 1 1 1 1 1 1 1 1 1 ...
## $ construc: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ndurman : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trcommpu: int 0 0 0 0 0 0 0 0 0 0 ...
## $ trade : int 0 0 1 0 0 0 1 0 1 0 ...
## $ services: int 0 1 0 0 0 0 0 0 0 0 ...
## $ profserv: int 0 0 0 0 0 1 0 0 0 0 ...
## $ profocc : int 0 0 0 0 0 1 1 1 1 1 ...
## $ clerocc : int 0 0 0 1 0 0 0 0 0 0 ...
## $ servocc : int 0 1 0 0 0 0 0 0 0 0 ...
## $ lwage : num 1.13 1.18 1.1 1.79 1.67 ...
## $ expersq : int 4 484 4 1936 49 81 225 25 676 484 ...
## $ tenursq : int 0 4 0 784 4 64 49 9 16 441 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
To learn more about wage1
, open its help page by running
?wage1.
For a data set, if we want to extract some specific data from it, we
need to specify the “coordinates” we want from it. Row numbers come
first, followed by column numbers. We can extract specific values by
specifying row and column indices in the format: data_frame[row_index, column_index]. For
instance, to extract the first row and the second column from
wage1
(Note: The second column is the variable “educ”.
Thus, the value of the first row and the second column is the first
observation of “educ”.):
wage1[1,2]
## [1] 11
wage1[1,"educ"]
## [1] 11
Note: As shown in this example, data frames can also be subset by calling their column names directly.
In addition, we can use shortcuts to select a number of rows or columns at once. To select all columns, leave the column index blank. For instance, to select all columns for the first row:
wage1[1,]
Exercise-D:
- How to select all observations (i.e., all rows) for “educ”?
- How to select the first three rows of the 5th and 8th column?
- Create a data.frame (
wage1_200
) containing only the data in row 200 of thewage1
dataset.
R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the summary( ) function. This function is automatically applied to each column, and it calculates:
For example, we get the descriptive statistics of
wage1
:
summary(wage1)
## wage educ exper tenure
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
## nonwhite female married numdep
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
## Mean :0.1027 Mean :0.4791 Mean :0.6084 Mean :1.044
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :6.000
## smsa northcen south west
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.7224 Mean :0.251 Mean :0.3555 Mean :0.1692
## 3rd Qu.:1.0000 3rd Qu.:0.750 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## construc ndurman trcommpu trade
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.04563 Mean :0.1141 Mean :0.04373 Mean :0.2871
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## services profserv profocc clerocc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1008 Mean :0.2586 Mean :0.3669 Mean :0.1673
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## servocc lwage expersq tenursq
## Min. :0.0000 Min. :-0.6349 Min. : 1.0 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 1.2030 1st Qu.: 25.0 1st Qu.: 0.00
## Median :0.0000 Median : 1.5369 Median : 182.5 Median : 4.00
## Mean :0.1407 Mean : 1.6233 Mean : 473.4 Mean : 78.15
## 3rd Qu.:0.0000 3rd Qu.: 1.9286 3rd Qu.: 676.0 3rd Qu.: 49.00
## Max. :1.0000 Max. : 3.2181 Max. :2601.0 Max. :1936.00
We can also get the summary statistics of a single column:
summary(wage1$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.00 12.00 12.56 14.00 18.00
Exercise-E:
- What is the sample variance of “educ” in the
wage1
dataset? (Hint: Use var( ).)- What is the sample correlation between “educ” and “wage” in the
wage1
dataset? (Hint: Use cor( ).)- What is the number of observations that have more than 12 years of eduction?
Suppose we are interested in the relationship between years of
education and wage. Is it positive? Negative? Linear? Nonlinear? You can
test your answer with the wage1
data frame. Let
educ
on the x-axis and wage
on the y-axis:
plot(x=wage1$educ,
y=wage1$wage,
main="The relationship b/t educ and wage", # title of the plot
xlab="years of education", # label for x-axis
ylab="wage")
Another popular function used to get a plot is ggplot( ) in the ggplot2
package. In fact, R has several systems for making graphs, but
ggplot2
is one of the most elegant and most versatile.
ggplot2
implements the grammar of graphics, a coherent
system for describing and building graphs. With ggplot2
,
you can do more faster by learning one system and applying it in many
places. If you’d like to learn more about the theoretical underpinnings
of ggplot2
before you start, I’d recommend reading “The Layered
Grammar of Graphics”.
library(ggplot2)
ggplot(data = wage1) +
geom_point(mapping = aes(x = educ, y = wage))
Exercise-F:
- Make a scatterplot of
exper
vswage
, and discuss the relationship between years of experience and average hourly earnings.- At every level of education, is there any gender wage gap?