Lab 1: A Brief Introduction to R

0. Useful online resources for beginners

1. What is R? What is RStudio?

The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

2. Knowing your way around RStudio

Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.

The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, PBC.

We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.

RStudio interface screenshot. Clockwise from top left: Source, Environment/History, Files/Plots/Packages/Help/Viewer, Console.

RStudio is divided into 4 “panes”:

The Source for your scripts and documents (top-left, in the default layout)
Your Environment/History (top-right) which shows all the objects in your working space (Environment) and your command history (History)
Your Files/Plots/Packages/Help/Viewer (bottom-right)
The R Console (bottom-left)

The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). For ease of use, settings such as background color, font color, font size, and zoom level can also be adjusted in this menu (Global Options -> Appearance).

One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.

3. R packages

You’ll often need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. One important data packages that you will need in this course is wooldridge. You can install this package with a single line of code:

install.packages("wooldridge")

On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that https://cloud.r-project.org/ isn’t blocked by your firewall or proxy.

You will not be able to use the functions, objects, and help files in a package until you load it with library( ). Once you have installed a package, you can load it with the library( ) function:

library(wooldridge)

There are several other packages that will be often used in this course.

install.packages(c("ggplot2", "openxlsx", "lmtest", "sandwich", "AER", "stargazer", "car", "plm"))

4. Getting help and learning more

As you start to apply the econometric tools to your own data you will soon find questions that we do not answer in the class. The following materials describe a few tips on how to get help, and to help you keep learning.

If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.

If Google doesn’t help, try stackoverflow. Start by spending a little time searching for an existing answer, including [R] to restrict your search to questions and answers that use R. If you don’t find anything useful, prepare a minimal reproducible example or reprex. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. (See more details here.)

5. Coding basics

Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

You can use R as a calculator:

(1 / 200 )* 30

## [1] 0.15

(59 + 73 + 2) / 3

## [1] 44.66667

You can create new objects with <- or =:

x <- 3
y <- x + 4

u = 1
v = u / 2

All R statements where you create objects, assignment statements, have the same form:

object_name <- value

When reading that code say “object name gets value” in your head.

You will make lots of assignments and <- is a pain to type. Instead, use RStudio’s keyboard shortcut: Alt + - (the minus sign) in a PC or Option + - in a Mac. Notice that RStudio automagically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces. (Keyboard Shortcuts in the RStudio IDE)

What’s in a name? Object names must start with a letter, and can only contain letters, numbers, _ and .. You want your object names to be descriptive, so you’ll need a convention for multiple words. We recommend snake_case where you separate lowercase words with _.

this_is_a_really_long_name <- 2.5

6. Vectors and data types

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c( ) function. For example we can create a vector of animal weights and assign it to a new object weight_g:

weight_g <- c(50, 60, 65, 82)
weight_g

## [1] 50 60 65 82

A vector can also contain characters:

animals <- c("mouse", "rat", "dog")
animals

## [1] "mouse" "rat"   "dog"

The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R will assume objects have been created called mouse, rat and dog. As these objects don’t exist in R’s memory, there will be an error message.

There are many functions that allow you to inspect the content of a vector. length( ) tells you how many elements are in a particular vector:

length(weight_g)

## [1] 4

length(animals)

## [1] 3

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates what kind of object you are working with:

class(weight_g)

## [1] "numeric"

class(animals)

## [1] "character"

You can use the c() function to add other elements to your vector:

weight_g <- c(weight_g, 90)      # add to the end of the vector
weight_g <- c(30, weight_g)      # add to the beginning of the vector
weight_g

## [1] 30 50 60 65 82 90

Exercise-A:

What happens if we try to mix different types in a single vector? Try to add the elements of “animals” to “weight_g”.

7. Subsetting vectors

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:

animals <- c("mouse", "rat", "dog", "cat")
animals[2]

## [1] "rat"

animals[c(3, 2)]

## [1] "dog" "rat"

animals[2:4]

## [1] "rat" "dog" "cat"

Conditional subsetting Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not. For example, if we want to select only the values above 50:

weight_g <- c(21, 34, 39, 54, 55)
weight_g > 50   # will return logicals with TRUE for the indices that meet the condition

## [1] FALSE FALSE FALSE  TRUE  TRUE

weight_g[weight_g > 50]

## [1] 54 55

We also can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

weight_g[weight_g > 30 & weight_g < 50]     # select values greater than 30 and smaller than 50

## [1] 34 39

weight_g[weight_g <= 30 | weight_g == 55]   # select values "less than or equal to" 30 or values "equal to" 55

## [1] 21 55

8. Calling functions

R has a large collection of built-in functions that are called like this:

function_name(arg1 = val1, arg2 = val2, ...)

Let’s try using seq( ) which makes regular sequences of numbers. If you want to know the function’s arguments and purpose, you can get more help in the Help tab in the lower right pane.

seq(1, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(1, 10, by=2)

## [1] 1 3 5 7 9

seq(1, 10, length.out=4)

## [1]  1  4  7 10

Exercise-B:

Generate 1000 random numbers from a normal distribution with mean 0 and standard deviation 1. (Hint: Use rnorm( ).)

9. Saving your code

Up to now, your code has been in the console. This is useful for quick queries but not so helpful if you want to revisit your work for any reason. A script can be opened by pressing Ctrl + Shift + N. It is wise to save your script file immediately. To do this press Ctrl + S. This will open a dialogue box where you can decide where to save your script file, and what to name it. The .R file extension is added automatically and ensures your file will open with RStudio.

Don’t forget to save your work periodically by pressing Ctrl + S.

Exercise-C:

Write your R code for previous exercise in the script window, and then save it as “random_normal.R”.

10. Commenting your code

The comment character in R is #. Anything to the right of a # in a script will be ignored by R. It is useful to leave notes and explanations in your scripts. For convenience, RStudio provides a keyboard shortcut to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.

x <- seq(0, 1, by=0.2)   # generate a sequence from 0 to 1 with increment of 0.2

11. Starting with a data frame

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). As an example, let’s take a look at the wage1 data frame found in the wooldridge package.

library(wooldridge)    # First, load the package

To open the dataset in RStudio’s Data Viewer, use the View( ) function:

View(wage1)

We also can inspect the structure of a data frame using the function str( ):

str(wage1)

## 'data.frame':    526 obs. of  24 variables:
##  $ wage    : num  3.1 3.24 3 6 5.3 ...
##  $ educ    : int  11 12 11 8 12 16 18 12 12 17 ...
##  $ exper   : int  2 22 2 44 7 9 15 5 26 22 ...
##  $ tenure  : int  0 2 0 28 2 8 7 3 4 21 ...
##  $ nonwhite: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ female  : int  1 1 0 0 0 0 0 1 1 0 ...
##  $ married : int  0 1 0 1 1 1 0 0 0 1 ...
##  $ numdep  : int  2 3 2 0 1 0 0 0 2 0 ...
##  $ smsa    : int  1 1 0 1 0 1 1 1 1 1 ...
##  $ northcen: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ west    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ndurman : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trcommpu: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trade   : int  0 0 1 0 0 0 1 0 1 0 ...
##  $ services: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ profserv: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ profocc : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ clerocc : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ servocc : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ lwage   : num  1.13 1.18 1.1 1.79 1.67 ...
##  $ expersq : int  4 484 4 1936 49 81 225 25 676 484 ...
##  $ tenursq : int  0 4 0 784 4 64 49 9 16 441 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

To learn more about wage1, open its help page by running ?wage1.

12. Subsetting a data frame

For a data set, if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. We can extract specific values by specifying row and column indices in the format: data_frame[row_index, column_index]. For instance, to extract the first row and the second column from wage1 (Note: The second column is the variable “educ”. Thus, the value of the first row and the second column is the first observation of “educ”.):

wage1[1,2]

## [1] 11

wage1[1,"educ"]

## [1] 11

Note: As shown in this example, data frames can also be subset by calling their column names directly.

In addition, we can use shortcuts to select a number of rows or columns at once. To select all columns, leave the column index blank. For instance, to select all columns for the first row:

wage1[1,]

Exercise-D:

How to select all observations (i.e., all rows) for “educ”?

How to select the first three rows of the 5th and 8th column?

Create a data.frame (wage1_200) containing only the data in row 200 of the wage1 dataset.

13. Descriptive statistics of a data frame

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the summary( ) function. This function is automatically applied to each column, and it calculates:

minimum value of each column
maximum value of each column
mean value of each column
median value of each column
1st quartile of each column (25th percentile)
3rd quartile of each column (75th percentile)

For example, we get the descriptive statistics of wage1:

summary(wage1)

##       wage             educ           exper           tenure      
##  Min.   : 0.530   Min.   : 0.00   Min.   : 1.00   Min.   : 0.000  
##  1st Qu.: 3.330   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
##  Median : 4.650   Median :12.00   Median :13.50   Median : 2.000  
##  Mean   : 5.896   Mean   :12.56   Mean   :17.02   Mean   : 5.105  
##  3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:26.00   3rd Qu.: 7.000  
##  Max.   :24.980   Max.   :18.00   Max.   :51.00   Max.   :44.000  
##     nonwhite          female          married           numdep     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :1.000  
##  Mean   :0.1027   Mean   :0.4791   Mean   :0.6084   Mean   :1.044  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :6.000  
##       smsa           northcen         south             west       
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.7224   Mean   :0.251   Mean   :0.3555   Mean   :0.1692  
##  3rd Qu.:1.0000   3rd Qu.:0.750   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##     construc          ndurman          trcommpu           trade       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.04563   Mean   :0.1141   Mean   :0.04373   Mean   :0.2871  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##     services         profserv         profocc          clerocc      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1008   Mean   :0.2586   Mean   :0.3669   Mean   :0.1673  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     servocc           lwage            expersq          tenursq       
##  Min.   :0.0000   Min.   :-0.6349   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.:0.0000   1st Qu.: 1.2030   1st Qu.:  25.0   1st Qu.:   0.00  
##  Median :0.0000   Median : 1.5369   Median : 182.5   Median :   4.00  
##  Mean   :0.1407   Mean   : 1.6233   Mean   : 473.4   Mean   :  78.15  
##  3rd Qu.:0.0000   3rd Qu.: 1.9286   3rd Qu.: 676.0   3rd Qu.:  49.00  
##  Max.   :1.0000   Max.   : 3.2181   Max.   :2601.0   Max.   :1936.00

We can also get the summary statistics of a single column:

summary(wage1$educ)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   12.00   12.56   14.00   18.00

Exercise-E:

What is the sample variance of “educ” in the wage1 dataset? (Hint: Use var( ).)

What is the sample correlation between “educ” and “wage” in the wage1 dataset? (Hint: Use cor( ).)

What is the number of observations that have more than 12 years of eduction?

14. Creating a simple plot

Suppose we are interested in the relationship between years of education and wage. Is it positive? Negative? Linear? Nonlinear? You can test your answer with the wage1 data frame. Let educ on the x-axis and wage on the y-axis:

plot(x=wage1$educ, 
     y=wage1$wage, 
     main="The relationship b/t educ and wage",    # title of the plot
     xlab="years of education",                    # label for x-axis
     ylab="wage")

Another popular function used to get a plot is ggplot( ) in the ggplot2 package. In fact, R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places. If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “The Layered Grammar of Graphics”.

library(ggplot2)
ggplot(data = wage1) + 
  geom_point(mapping = aes(x = educ, y = wage))

Exercise-F:

Make a scatterplot of exper vs wage, and discuss the relationship between years of experience and average hourly earnings.

At every level of education, is there any gender wage gap?