Why learn R?

Advantages of using R

Free
Infinitely more intuitive and regular than Stata
Can be used as a calculator
You can have many objects in your environment
Integrated version control
Potential to understand better what you are doing
Makes reproducible research easy
Huge community willing to help
Kickass presentations like this one

Disadvantages

First time getting clustered standard errors (2 or more way) might be tough
Marginally slower than Stata for regressions with many fixed-effects

Interacting with R(studio)

Basics

R is the program actually doing all the work
Rstudio is an interface to improve the experience
Rstudio layout:
Interactive console (left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
You can easily get Help by typing 1 or 2 interrogation marks in the console
- you know the function cumsum() but you are not sure what it does, type ?cumsum
- you are not sure what you are looking for but it is something about sums, type ??sum

R as calculator

The simplest thing you could do in R is arithmetic and logic

100 + 4 [1] 104

If you type an incomplete statement, R will wait for you. The ‘+’ sign indicates R is waiting for you.

R as calculator (II)

When using R as calculator the order of operations is (highest to lowest precedence)

Parentheses: ( ) Exponents: ^ or ** Divide: / Multiply: * Add: + Subtract: -

3 + 5 * 2
[1] 13

(3 + 5) * 2
[1] 16

R as calculator (III)

R expresses small numbers in scientific notation

2/10000
[1] 2e-04

You can do it too

5e3 # note the lack of minus sign here
[1] 5000

Math functions

28 %% 6
[1] 4

Comparing things

R can also be used to compare things

1 == 1 # note the double equal (is equal to)
[1] TRUE

1 != 2 # note the exclamation mark (is not equal to)
[1] TRUE

1 < 2 # less than
[1] TRUE

1 <= 2 # less than or equal
[1] TRUE

Comparing things (II)

4 > 2 #greater than [1] TRUE

4 >= 4 #greater than or equal [1] TRUE

Some funny things with relational operators (careful!!!!)

'hola' > 4e100 [1] TRUE

T > -5 [1] TRUE

2006.00 == "2006" [1] TRUE

"Tola" < "Tolín" [1] TRUE

"Tolita" < "Tola" [1] FALSE

Further reading about operators in genera here, and about relational operators specifically here.

Variable assignment

To store values, R prefers the operator <-

y <- 1/4 y [1] 0.25

One nice feature of R: evaluations happens before assignment

A <- 100 A <- A+1 A [1] 101

Vectorization

R building block is the vector, for instance

1:5 [1] 1 2 3 4 5

2^(1:5) [1] 2 4 8 16 32

x <- 1:5 2^x [1] 2 4 8 16 32

Managing your environment

If you want a list of all the objects in your environment

ls() [1] "A" "x" "y"

If you want to remove an object

rm(A) ls() [1] "x" "y"

If you want to remove them all

rm(list=ls()) ls() character(0)

R packages

It is possible to add functions to R by writing or obtaining packages. You can

See installed packages by typing installed.packages()
Install packages by typing install.packages(“packagename”), where packagename is the package name, in quotes.
Update installed packages by typing update.packages()
Remove a package with remove.packages(“packagename”)
Make a package available for use with library(packagename)

Challenge

What will be the value of each variable after each statement in the following program?

mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20

Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age? Clean up your working environment by deleting the mass and age variables. Install the following packages: ggplot2, dplyr, gapminder

Data types

R deals with numerous data types. The basic four are:

values like 4.5 are called numerics
natural numbers like 4 are called integers. Integers are also numerics
boolean values (TRUE or FALSE) are called logical
text (or string) values are called characters

Challenge

Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:

c(1, 2, 3) c('d', 'e', 'f') c(1, 2, 'f')

Data structures

R can handle many types of structures. We will focus on

Vectors: have no dimension, all elements of the same type Matrices: have dimension, all elements of the same type Data frames: have dimensions, elements can be of different type as long as they are the same length Lists: super flexible, you can put anything in a list

Going to Las Vegas (Challenge)

Assign the value “Go!” to the variable vegas

Create 2 vectors containing your winnings and loses from poker and roulette

For poker.vector:
- On Monday you won $140
- Tuesday you lost $50
- Wednesday you won $20
- Thursday you lost $120
- Friday you won $240
For roulette.vector:
- On Monday you lost $24
- Tuesday you lost $50
- Wednesday you won $100
- Thursday you lost $350
- Friday you won $10

Going to Las Vegas (Challenge) (II)

Elements in vectors can be named

some.vector <- c("Pepito Perez", "Futbolista") names(some.vector) <- c("Nombre", "Profesión")

Following the example, name the elements of the roulette winnings

names(poker.vector) <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”) names(roulette.vector) <-

Create a vector with the net winnings of each day. Name it total.daily
Create a named vector telling which days you made winnings. Name it days.won
Calculate the total winnings from the 5 days in Las Vegas. Name it total.week.

Going to Las Vegas (III)

Slicing in R is really simple: if you want the first three elements of a vector just use ‘[]’ and the appropriate indexes (from 1 to 3)

total.daily[1:3] # gets the daily wins from Mon-Wed

You can also perform operations and apply functions to parts of vectors

mean(total.daily) # gives the mean for the whole week mean(total.daily[4,5]) # gives the mean from Thu-Fri

Matrices

Think of matrices as vector with dimensions or a stack of vectors (of the same dimension)

matrix(1:9, nrow = 3, ncol = 3)

.....[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

By default the function ‘matrix’ fills out the blank matrix by column.

matrix(1:9, byrow = T, ncol = 3)

.....[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Matrices challenge

The following vectors are the number of copies sold for 3 of Miles Davis’ albums.
The first number of each vector is the US sales and the second is the rest of the world.

kind.of.blue <- c(460.998, 314.4)
sketches <- c(290.475, 247.900)
miles.ahead <- c(309.306, 165.8)

Combine these three vectors into one single vector called ‘all.sales’.
Using the vector ‘all.sales’, create a matrix called ‘miles.sales’ that has US and rest of the world sales as columns.
Using help or Google, label the columns and rows of the matrix properly.

Selection of matrix elements

We can naturally slice a matrix by using the [] operator

some.matrix[3,6] # gets the element in the 3rd row and 6th column some.matrix[,6] # gets all the elements in the 6th column some.matrix[3,] # gets all the elements in the 3rd row

Challenge

Create a 50 by 20 matrix with the numbers from 1 to 1000 (the first column is the numbers from 1-50). Name it my.matrix Extract the following matrix from my.matrix

.....[,1] [,2] [,3] [,4]
[1,] 39 89 139 189
[2,] 40 90 140 190
[3,] 41 91 141 191
[4,] 42 92 142 192
[5,] 43 93 143 193
[6,] 44 94 144 194
[7,] 45 95 145 195

Arithmetic with matrices (I)

Just as vectors, the normal operators perform element-wise operations

a <- matrix(5:8, nrow = 2, ncol = 2)
b <- matrix(1:4, byrow = T, ncol = 2)
c <- a * b

Scalars work the same way

d <- matrix(1:9, nrow = 3) pi * d

......[,1] [,2] [,3]
[1,] 3.141593 12.56637 21.99115
[2,] 6.283185 15.70796 25.13274
[3,] 9.424778 18.84956 28.27433

Arithmetic with matrices (II)

If we want to use standard matrix multiplication we need to use the operator %*%

e <- matrix(1:6, nrow = 3, ncol = 2)
f <- matrix(11:20, nrow = 2, ncol = 5)
g <- e %*% f

For inverse we use the function solve()

h <- matrix(rnorm(9), nrow = 3)
h.inv <- solve(h)
for transpose the function t()

i <- matrix(21:30, byrow = T, ncol = 2)
j <- t(i)

Functions and matrices

Applying functions to columns or rows is as easy as

k <- matrix(rnorm(100), byrow = T, nrow = 10)

column.mean <- mean(k[,1])
row.mean <- mean(k[4,])

Challenge:

Calculate the mean and standard deviation for Miles Davis’ sales for US and non-US markets separately.

Factors

Are the categorical variables in R. It is not just a name. They are treated particularly. They are extremely important for research in Economics.

genre <- c('bebop', 'modal', 'cool', 'modal', 'cool', 'cool') genre <- factor(genre)

Some factors are naturally ordered (e.g days of the week) while some others aren’t

fruits <- c('Lulo', 'Guama', 'Carambolo', 'Papaya', 'Guama') fruits <- factor(fruits)

temperatures <- c('High', 'Medium', 'High', 'High', 'Low', 'Low') temperatures <- factor(temperatures, ordered = T, levels = c('Low', 'Medium', 'High'))

And you can always change the name of your levels

levels(temperatures) <- c('L', 'M', 'H')

Data frames

This is the most typical data structure that economists work with (nowadys tibbles are even better)

Data in data frames can be of numerous types (but same length)

A typical example is household surveys

name (character)
age (numeric)
married (logical)

Some ways of getting to know your DF

mtcars # pre-loaded data frame always available in R

head(mtcars)
tail(mtcars)
str(mtcars)

Creating dataframes

Most of the time you will read a csv table to create DF.

Sometimes, though, you have to create them

planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")

type <- c("Terrestrial", "Terrestrial", "Terrestrial", "Terrestrial", "Gas giant", "Gas giant", "Gas giant", "Gas giant")

diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

Challenge

Create a data frame with the previous vectors. Call it planets.df.

Use the function str() to analyze the structure of planets.df...

Slicing data frames

Similar to matrices, the [] can be used

df[1,2] # selects the first elemnt of column 2
df[1:3, 2:8] # selects the first 3 elemnts of columns 2 to 8
df[1,] # what would this select?

names of columns can also be used to slice

Challenge

Print out the value of the diameter of Mercury

Print out all data on Mars

Select the diameter of the furthest planets

Slicing data frames: columns

There are three ways of selecting columns

planets.df[,3]
planets.df[,'diameter']
planets.df$diameter

The last way is clearly less verbose; more concise.

Data frame: conditional selecting

Suppose we want to work only with planets that have rings.

has.rings <- planets.df$rings

planets.with.rings.df <- planets.df[rings,]

the function subset() can be used as a shorthand

planets.with.rings.df2 <- subset(planets.df, subset = rings)

Challenge

Create a dataframe with the planets whose diameter is less than 1.

Lists

Think of lists as rucksacks or backpacks. At the end you want to put many things in one container.

my.list <- list(a,b,c,d, genre.vector, planets.df)
my.list

If you want the elements of the list to have names

my.named.list <- list(some.matrix = b, some.df = planets.df, my.age = 36)

Working with Dplyr and Magrittr: prelude

Let’s open a new R-script. Name it my.script.R

Using the function read.csv() read the csv file with the gapminder data.

This data is clean. It usually isn’t.

load the libraries dplyr and magrittr

Challenge

Describe the data frame (dimensions and data types)

dplyr

It was created with data frames in mind

It is awesome to manipulate data in many, many useful ways

the names of the functions are the same verbs we’d use to describe what we want

filter() (and slice())
arrange()
select() (and rename())
distinct()
mutate() (and transmute())
summarize()
sample_n() (and sample_frac())

magrittr

It is meant to write code in a ‘pipeline’ way Code written using magrittr is more readable It is probably clearer with examples

Data frame challenges

Do the following using the gapminder data

Remove data from Africa and from North Korea (filter()) keep only data from 2007 Create a column with the gdp of each country (mutate()) Group the data by continent (group_by()) Create a variable with the total population of the continent(call it continent.pop). Create a variable with the number of countries per continent (n()) call it num.country Summarize the data frame at continent level (the result should have 4 continents and 4 variables)

Guía de Estudio 1: aprendiendo R

Propósito

Why learn R?

Advantages of using R

Disadvantages

Interacting with R(studio)

Basics

R as calculator

R as calculator (II)

R as calculator (III)

Comparing things

Comparing things (II)

Variable assignment

Vectorization

Managing your environment

R packages

Challenge

Data types

Challenge

Data structures

Going to Las Vegas (Challenge)

Going to Las Vegas (Challenge) (II)

Going to Las Vegas (III)

Matrices

Matrices challenge

Selection of matrix elements

Challenge

Arithmetic with matrices (I)

Arithmetic with matrices (II)

Functions and matrices

Challenge:

Factors

Data frames

Creating dataframes

Challenge

Slicing data frames

Challenge

Slicing data frames: columns

Data frame: conditional selecting

Challenge

Lists

Working with Dplyr and Magrittr: prelude

Challenge

dplyr

magrittr

Data frame challenges

Extra