El fin de esta guía es ayudarles a familiarizarse con R y Rstudio, que son los software que usaremos para el curso de Econometría 2- 2020. R es el programa estadístico, y Rstudio es una interfaz que ayuda a que el trabajo con R sea más amigable. En caso de que no los tengan aún deben descargarlos. Acá encuentran el instalador de R, y acá el instalador de Rstudio y deben instalar R antes de instalar Rstdudio.
A partir de acá la guía es en inglés (I’m recycling it from when I taught it abroad).
cumsum() but you are not sure what it does, type ?cumsum??sumThe simplest thing you could do in R is arithmetic and logic
100 + 4 [1] 104
If you type an incomplete statement, R will wait for you. The ‘+’ sign indicates R is waiting for you.
When using R as calculator the order of operations is (highest to lowest precedence)
Parentheses: ( ) Exponents: ^ or ** Divide: / Multiply: * Add: + Subtract: -
3 + 5 * 2
[1] 13
(3 + 5) * 2
[1] 16
R expresses small numbers in scientific notation
2/10000
[1] 2e-04
You can do it too
5e3 # note the lack of minus sign here
[1] 5000
Math functions
28 %% 6
[1] 4
R can also be used to compare things
1 == 1 # note the double equal (is equal to)
[1] TRUE
1 != 2 # note the exclamation mark (is not equal to)
[1] TRUE
1 < 2 # less than
[1] TRUE
1 <= 2 # less than or equal
[1] TRUE
4 > 2 #greater than [1] TRUE
4 >= 4 #greater than or equal [1] TRUE
Some funny things with relational operators (careful!!!!)
'hola' > 4e100 [1] TRUE
T > -5 [1] TRUE
2006.00 == "2006" [1] TRUE
"Tola" < "Tolín" [1] TRUE
"Tolita" < "Tola" [1] FALSE
Further reading about operators in genera here, and about relational operators specifically here.
To store values, R prefers the operator <-
y <- 1/4 y [1] 0.25
One nice feature of R: evaluations happens before assignment
A <- 100 A <- A+1 A [1] 101
R building block is the vector, for instance
1:5 [1] 1 2 3 4 5
2^(1:5) [1] 2 4 8 16 32
x <- 1:5 2^x [1] 2 4 8 16 32
If you want a list of all the objects in your environment
ls() [1] "A" "x" "y"
If you want to remove an object
rm(A) ls() [1] "x" "y"
If you want to remove them all
rm(list=ls()) ls() character(0)
It is possible to add functions to R by writing or obtaining packages. You can
What will be the value of each variable after each statement in the following program?
mass <- 47.5 age <- 122 mass <- mass * 2.3 age <- age - 20
Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age? Clean up your working environment by deleting the mass and age variables. Install the following packages: ggplot2, dplyr, gapminder
R deals with numerous data types. The basic four are:
Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:
c(1, 2, 3) c('d', 'e', 'f') c(1, 2, 'f')
R can handle many types of structures. We will focus on
Vectors: have no dimension, all elements of the same type Matrices: have dimension, all elements of the same type Data frames: have dimensions, elements can be of different type as long as they are the same length Lists: super flexible, you can put anything in a list
Assign the value “Go!” to the variable vegas
Create 2 vectors containing your winnings and loses from poker and roulette
poker.vector:
roulette.vector:
Elements in vectors can be named
some.vector <- c("Pepito Perez", "Futbolista") names(some.vector) <- c("Nombre", "Profesión")
Following the example, name the elements of the roulette winnings
names(poker.vector) <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”) names(roulette.vector) <-
total.dailySlicing in R is really simple: if you want the first three elements of a vector just use ‘[]’ and the appropriate indexes (from 1 to 3)
total.daily[1:3] # gets the daily wins from Mon-Wed
You can also perform operations and apply functions to parts of vectors
mean(total.daily) # gives the mean for the whole week mean(total.daily[4,5]) # gives the mean from Thu-Fri
Think of matrices as vector with dimensions or a stack of vectors (of the same dimension)
matrix(1:9, nrow = 3, ncol = 3)
.....[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
By default the function ‘matrix’ fills out the blank matrix by column.
matrix(1:9, byrow = T, ncol = 3)
.....[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
The following vectors are the number of copies sold for 3 of Miles Davis’ albums.
The first number of each vector is the US sales and the second is the rest of the world.
kind.of.blue <- c(460.998, 314.4)
sketches <- c(290.475, 247.900)
miles.ahead <- c(309.306, 165.8)
Combine these three vectors into one single vector called ‘all.sales’.
Using the vector ‘all.sales’, create a matrix called ‘miles.sales’ that has US and rest of the world sales as columns.
Using help or Google, label the columns and rows of the matrix properly.
We can naturally slice a matrix by using the [] operator
some.matrix[3,6] # gets the element in the 3rd row and 6th column some.matrix[,6] # gets all the elements in the 6th column some.matrix[3,] # gets all the elements in the 3rd row
Create a 50 by 20 matrix with the numbers from 1 to 1000 (the first column is the numbers from 1-50). Name it my.matrix Extract the following matrix from my.matrix
.....[,1] [,2] [,3] [,4]
[1,] 39 89 139 189
[2,] 40 90 140 190
[3,] 41 91 141 191
[4,] 42 92 142 192
[5,] 43 93 143 193
[6,] 44 94 144 194
[7,] 45 95 145 195
Just as vectors, the normal operators perform element-wise operations
a <- matrix(5:8, nrow = 2, ncol = 2)
b <- matrix(1:4, byrow = T, ncol = 2)
c <- a * b
Scalars work the same way
d <- matrix(1:9, nrow = 3) pi * d
......[,1] [,2] [,3]
[1,] 3.141593 12.56637 21.99115
[2,] 6.283185 15.70796 25.13274
[3,] 9.424778 18.84956 28.27433
If we want to use standard matrix multiplication we need to use the operator %*%
e <- matrix(1:6, nrow = 3, ncol = 2)
f <- matrix(11:20, nrow = 2, ncol = 5)
g <- e %*% f
For inverse we use the function solve()
h <- matrix(rnorm(9), nrow = 3)
h.inv <- solve(h)
for transpose the function t()
i <- matrix(21:30, byrow = T, ncol = 2)
j <- t(i)
Applying functions to columns or rows is as easy as
k <- matrix(rnorm(100), byrow = T, nrow = 10)
column.mean <- mean(k[,1])
row.mean <- mean(k[4,])
Calculate the mean and standard deviation for Miles Davis’ sales for US and non-US markets separately.
Are the categorical variables in R. It is not just a name. They are treated particularly. They are extremely important for research in Economics.
genre <- c('bebop', 'modal', 'cool', 'modal', 'cool', 'cool') genre <- factor(genre)
Some factors are naturally ordered (e.g days of the week) while some others aren’t
fruits <- c('Lulo', 'Guama', 'Carambolo', 'Papaya', 'Guama') fruits <- factor(fruits)
temperatures <- c('High', 'Medium', 'High', 'High', 'Low', 'Low') temperatures <- factor(temperatures, ordered = T, levels = c('Low', 'Medium', 'High'))
And you can always change the name of your levels
levels(temperatures) <- c('L', 'M', 'H')
This is the most typical data structure that economists work with (nowadys tibbles are even better)
Data in data frames can be of numerous types (but same length)
A typical example is household surveys
Some ways of getting to know your DF
mtcars # pre-loaded data frame always available in R
head(mtcars)
tail(mtcars)
str(mtcars)
Most of the time you will read a csv table to create DF.
Sometimes, though, you have to create them
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial", "Terrestrial", "Terrestrial", "Terrestrial", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
Create a data frame with the previous vectors. Call it planets.df.
Use the function str() to analyze the structure of planets.df...
Similar to matrices, the [] can be used
df[1,2] # selects the first elemnt of column 2
df[1:3, 2:8] # selects the first 3 elemnts of columns 2 to 8
df[1,] # what would this select?
names of columns can also be used to slice
Print out the value of the diameter of Mercury
Print out all data on Mars
Select the diameter of the furthest planets
There are three ways of selecting columns
planets.df[,3]
planets.df[,'diameter']
planets.df$diameter
The last way is clearly less verbose; more concise.
Suppose we want to work only with planets that have rings.
has.rings <- planets.df$rings
planets.with.rings.df <- planets.df[rings,]
the function subset() can be used as a shorthand
planets.with.rings.df2 <- subset(planets.df, subset = rings)
Create a dataframe with the planets whose diameter is less than 1.
Think of lists as rucksacks or backpacks. At the end you want to put many things in one container.
my.list <- list(a,b,c,d, genre.vector, planets.df)
my.list
If you want the elements of the list to have names
my.named.list <- list(some.matrix = b, some.df = planets.df, my.age = 36)
Let’s open a new R-script. Name it my.script.R
Using the function read.csv() read the csv file with the gapminder data.
This data is clean. It usually isn’t.
load the libraries dplyr and magrittr
Describe the data frame (dimensions and data types)
It was created with data frames in mind
It is awesome to manipulate data in many, many useful ways
the names of the functions are the same verbs we’d use to describe what we want
It is meant to write code in a ‘pipeline’ way Code written using magrittr is more readable It is probably clearer with examples
Do the following using the gapminder data
Remove data from Africa and from North Korea (filter()) keep only data from 2007 Create a column with the gdp of each country (mutate()) Group the data by continent (group_by()) Create a variable with the total population of the continent(call it continent.pop). Create a variable with the number of countries per continent (n()) call it num.country Summarize the data frame at continent level (the result should have 4 continents and 4 variables)