Propósito

El fin de esta guía es ayudarles a familiarizarse con R y Rstudio, que son los software que usaremos para el curso de Econometría 2- 2020. R es el programa estadístico, y Rstudio es una interfaz que ayuda a que el trabajo con R sea más amigable. En caso de que no los tengan aún deben descargarlos. Acá encuentran el instalador de R, y acá el instalador de Rstudio y deben instalar R antes de instalar Rstdudio.

A partir de acá la guía es en inglés (I’m recycling it from when I taught it abroad).

Why learn R?

Advantages of using R

  • Free
  • Infinitely more intuitive and regular than Stata
  • Can be used as a calculator
  • You can have many objects in your environment
  • Integrated version control
  • Potential to understand better what you are doing
  • Makes reproducible research easy
  • Huge community willing to help
  • Kickass presentations like this one

Disadvantages

  • First time getting clustered standard errors (2 or more way) might be tough
  • Marginally slower than Stata for regressions with many fixed-effects

Interacting with R(studio)

Basics

  • R is the program actually doing all the work
  • Rstudio is an interface to improve the experience
  • Rstudio layout:
  • Interactive console (left)
  • Environment/History (tabbed in upper right)
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • You can easily get Help by typing 1 or 2 interrogation marks in the console
    • you know the function cumsum() but you are not sure what it does, type ?cumsum
    • you are not sure what you are looking for but it is something about sums, type ??sum

R as calculator

The simplest thing you could do in R is arithmetic and logic

100 + 4 [1] 104

If you type an incomplete statement, R will wait for you. The ‘+’ sign indicates R is waiting for you.

R as calculator (II)

When using R as calculator the order of operations is (highest to lowest precedence)

Parentheses: ( ) Exponents: ^ or ** Divide: / Multiply: * Add: + Subtract: -

3 + 5 * 2
[1] 13

(3 + 5) * 2
[1] 16

R as calculator (III)

R expresses small numbers in scientific notation

2/10000
[1] 2e-04

You can do it too

5e3 # note the lack of minus sign here
[1] 5000

Math functions

28 %% 6
[1] 4

Comparing things

R can also be used to compare things

1 == 1 # note the double equal (is equal to)
[1] TRUE

1 != 2 # note the exclamation mark (is not equal to)
[1] TRUE

1 < 2 # less than
[1] TRUE

1 <= 2 # less than or equal
[1] TRUE

Comparing things (II)

4 > 2 #greater than [1] TRUE

4 >= 4 #greater than or equal [1] TRUE

Some funny things with relational operators (careful!!!!)

'hola' > 4e100 [1] TRUE

T > -5 [1] TRUE

2006.00 == "2006" [1] TRUE

"Tola" < "Tolín" [1] TRUE

"Tolita" < "Tola" [1] FALSE

Further reading about operators in genera here, and about relational operators specifically here.

Variable assignment

To store values, R prefers the operator <-

y <- 1/4 y [1] 0.25

One nice feature of R: evaluations happens before assignment

A <- 100 A <- A+1 A [1] 101

Vectorization

R building block is the vector, for instance

1:5 [1] 1 2 3 4 5

2^(1:5) [1] 2 4 8 16 32

x <- 1:5 2^x [1] 2 4 8 16 32

Managing your environment

If you want a list of all the objects in your environment

ls() [1] "A" "x" "y"

If you want to remove an object

rm(A) ls() [1] "x" "y"

If you want to remove them all

rm(list=ls()) ls() character(0)

R packages

It is possible to add functions to R by writing or obtaining packages. You can

  • See installed packages by typing installed.packages()
  • Install packages by typing install.packages(“packagename”), where packagename is the package name, in quotes.
  • Update installed packages by typing update.packages()
  • Remove a package with remove.packages(“packagename”)
  • Make a package available for use with library(packagename)

Challenge

What will be the value of each variable after each statement in the following program?

mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20

Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age? Clean up your working environment by deleting the mass and age variables. Install the following packages: ggplot2, dplyr, gapminder

Data types

R deals with numerous data types. The basic four are:

  • values like 4.5 are called numerics
  • natural numbers like 4 are called integers. Integers are also numerics
  • boolean values (TRUE or FALSE) are called logical
  • text (or string) values are called characters

Challenge

Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:

c(1, 2, 3) c('d', 'e', 'f') c(1, 2, 'f')

Data structures

R can handle many types of structures. We will focus on

Vectors: have no dimension, all elements of the same type Matrices: have dimension, all elements of the same type Data frames: have dimensions, elements can be of different type as long as they are the same length Lists: super flexible, you can put anything in a list

Going to Las Vegas (Challenge)

Assign the value “Go!” to the variable vegas

Create 2 vectors containing your winnings and loses from poker and roulette

  • For poker.vector:
    • On Monday you won $140
    • Tuesday you lost $50
    • Wednesday you won $20
    • Thursday you lost $120
    • Friday you won $240
  • For roulette.vector:
    • On Monday you lost $24
    • Tuesday you lost $50
    • Wednesday you won $100
    • Thursday you lost $350
    • Friday you won $10

Going to Las Vegas (Challenge) (II)

Elements in vectors can be named

some.vector <- c("Pepito Perez", "Futbolista") names(some.vector) <- c("Nombre", "Profesión")

Following the example, name the elements of the roulette winnings

names(poker.vector) <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”) names(roulette.vector) <-

  • Create a vector with the net winnings of each day. Name it total.daily
  • Create a named vector telling which days you made winnings. Name it days.won
  • Calculate the total winnings from the 5 days in Las Vegas. Name it total.week.

Going to Las Vegas (III)

Slicing in R is really simple: if you want the first three elements of a vector just use ‘[]’ and the appropriate indexes (from 1 to 3)

total.daily[1:3] # gets the daily wins from Mon-Wed

You can also perform operations and apply functions to parts of vectors

mean(total.daily) # gives the mean for the whole week mean(total.daily[4,5]) # gives the mean from Thu-Fri

Matrices

Think of matrices as vector with dimensions or a stack of vectors (of the same dimension)

matrix(1:9, nrow = 3, ncol = 3)

.....[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

By default the function ‘matrix’ fills out the blank matrix by column.

matrix(1:9, byrow = T, ncol = 3)

.....[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Matrices challenge

  • The following vectors are the number of copies sold for 3 of Miles Davis’ albums.

  • The first number of each vector is the US sales and the second is the rest of the world.

kind.of.blue <- c(460.998, 314.4)
sketches <- c(290.475, 247.900)
miles.ahead <- c(309.306, 165.8)

  • Combine these three vectors into one single vector called ‘all.sales’.

  • Using the vector ‘all.sales’, create a matrix called ‘miles.sales’ that has US and rest of the world sales as columns.

  • Using help or Google, label the columns and rows of the matrix properly.

Selection of matrix elements

We can naturally slice a matrix by using the [] operator

some.matrix[3,6] # gets the element in the 3rd row and 6th column some.matrix[,6] # gets all the elements in the 6th column some.matrix[3,] # gets all the elements in the 3rd row

Challenge

Create a 50 by 20 matrix with the numbers from 1 to 1000 (the first column is the numbers from 1-50). Name it my.matrix Extract the following matrix from my.matrix

.....[,1] [,2] [,3] [,4]
[1,] 39 89 139 189
[2,] 40 90 140 190
[3,] 41 91 141 191
[4,] 42 92 142 192
[5,] 43 93 143 193
[6,] 44 94 144 194
[7,] 45 95 145 195

Arithmetic with matrices (I)

Just as vectors, the normal operators perform element-wise operations

a <- matrix(5:8, nrow = 2, ncol = 2)
b <- matrix(1:4, byrow = T, ncol = 2)
c <- a * b

Scalars work the same way

d <- matrix(1:9, nrow = 3) pi * d

......[,1] [,2] [,3]
[1,] 3.141593 12.56637 21.99115
[2,] 6.283185 15.70796 25.13274
[3,] 9.424778 18.84956 28.27433

Arithmetic with matrices (II)

If we want to use standard matrix multiplication we need to use the operator %*%

e <- matrix(1:6, nrow = 3, ncol = 2)
f <- matrix(11:20, nrow = 2, ncol = 5)
g <- e %*% f

For inverse we use the function solve()

h <- matrix(rnorm(9), nrow = 3)
h.inv <- solve(h)
for transpose the function t()

i <- matrix(21:30, byrow = T, ncol = 2)
j <- t(i)

Functions and matrices

Applying functions to columns or rows is as easy as

k <- matrix(rnorm(100), byrow = T, nrow = 10)

column.mean <- mean(k[,1])
row.mean <- mean(k[4,])

Challenge:

Calculate the mean and standard deviation for Miles Davis’ sales for US and non-US markets separately.

Factors

Are the categorical variables in R. It is not just a name. They are treated particularly. They are extremely important for research in Economics.

genre <- c('bebop', 'modal', 'cool', 'modal', 'cool', 'cool') genre <- factor(genre)

Some factors are naturally ordered (e.g days of the week) while some others aren’t

fruits <- c('Lulo', 'Guama', 'Carambolo', 'Papaya', 'Guama') fruits <- factor(fruits)

temperatures <- c('High', 'Medium', 'High', 'High', 'Low', 'Low') temperatures <- factor(temperatures, ordered = T, levels = c('Low', 'Medium', 'High'))

And you can always change the name of your levels

levels(temperatures) <- c('L', 'M', 'H')

Data frames

This is the most typical data structure that economists work with (nowadys tibbles are even better)

Data in data frames can be of numerous types (but same length)

A typical example is household surveys

  • name (character)
  • age (numeric)
  • married (logical)

Some ways of getting to know your DF

mtcars # pre-loaded data frame always available in R

head(mtcars)
tail(mtcars)
str(mtcars)

Creating dataframes

Most of the time you will read a csv table to create DF.

Sometimes, though, you have to create them

planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")

type <- c("Terrestrial", "Terrestrial", "Terrestrial", "Terrestrial", "Gas giant", "Gas giant", "Gas giant", "Gas giant")

diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

Challenge

Create a data frame with the previous vectors. Call it planets.df.

Use the function str() to analyze the structure of planets.df...

Slicing data frames

Similar to matrices, the [] can be used

df[1,2] # selects the first elemnt of column 2
df[1:3, 2:8] # selects the first 3 elemnts of columns 2 to 8
df[1,] # what would this select?

names of columns can also be used to slice

Challenge

Print out the value of the diameter of Mercury

Print out all data on Mars

Select the diameter of the furthest planets

Slicing data frames: columns

There are three ways of selecting columns

planets.df[,3]
planets.df[,'diameter']
planets.df$diameter

The last way is clearly less verbose; more concise.

Data frame: conditional selecting

Suppose we want to work only with planets that have rings.

has.rings <- planets.df$rings

planets.with.rings.df <- planets.df[rings,]

the function subset() can be used as a shorthand

planets.with.rings.df2 <- subset(planets.df, subset = rings)

Challenge

Create a dataframe with the planets whose diameter is less than 1.

Lists

Think of lists as rucksacks or backpacks. At the end you want to put many things in one container.

my.list <- list(a,b,c,d, genre.vector, planets.df)
my.list

If you want the elements of the list to have names

my.named.list <- list(some.matrix = b, some.df = planets.df, my.age = 36)

Working with Dplyr and Magrittr: prelude

Let’s open a new R-script. Name it my.script.R

Using the function read.csv() read the csv file with the gapminder data.

This data is clean. It usually isn’t.

load the libraries dplyr and magrittr

Challenge

Describe the data frame (dimensions and data types)

dplyr

It was created with data frames in mind

It is awesome to manipulate data in many, many useful ways

the names of the functions are the same verbs we’d use to describe what we want

  • filter() (and slice())
  • arrange()
  • select() (and rename())
  • distinct()
  • mutate() (and transmute())
  • summarize()
  • sample_n() (and sample_frac())

magrittr

It is meant to write code in a ‘pipeline’ way Code written using magrittr is more readable It is probably clearer with examples

Data frame challenges

Do the following using the gapminder data

Remove data from Africa and from North Korea (filter()) keep only data from 2007 Create a column with the gdp of each country (mutate()) Group the data by continent (group_by()) Create a variable with the total population of the continent(call it continent.pop). Create a variable with the number of countries per continent (n()) call it num.country Summarize the data frame at continent level (the result should have 4 continents and 4 variables)

Extra