Class 01: Basic Aspects of R

Introduction

R is a language for statistical analysis, data processing, and graphical creation. R appeared based in the S language developed at Bell laboratories (AT&T) with the special effort of John Chambers.

R provides a great variety of tools for different data analysis. One of its big advantage is that R is an Open Source software that allow people from different part of the world improve its performance and development.

R can be download from the following link: https://cran.r-project.org/mirrors.html; also, you can check the complete list of libraries to download from this link: https://cran.r-project.org/web/packages/available_packages_by_name.html.

Preparing your environment

One of the first things to remember is that you should clean your environment with the following code. rm() stands for remove, and inside we list the object we previously created and remain in our working space (ls()).

rm(list = ls())

Basic mathematical operations

R can easily solve algebraic equations straightforward by using the following kind of codes:

1+1

## [1] 2

2*3

## [1] 6

log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))

## [1] 1

Also, we can use R to create numerical vectors and then we can analize them properly:

x <- c(-5, 0, 1.8, 3.14, 4, 88.169, 13, 2, 5.263, 10.025)
x

##  [1] -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

the number of objects that the vector x has:

length(x)

## [1] 10

and, some mathematical operations with the vector:

2*x+3

##  [1]  -7.000   3.000   6.600   9.280  11.000 179.338  29.000   7.000  13.526
## [10]  23.050

##  [1] -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

7:1+x

## Warning in 7:1 + x: longer object length is not a multiple of shorter object
## length

##  [1]  2.000  6.000  6.800  7.140  7.000 90.169 14.000  9.000 11.263 15.025

7:1 + 1:7

## [1] 8 8 8 8 8 8 8

7:1*x + 1:7

## Warning in 7:1 * x: longer object length is not a multiple of shorter object
## length

## Warning in 7:1 * x + 1:7: longer object length is not a multiple of shorter
## object length

##  [1] -34.000   2.000  12.000  16.560  17.000 182.338  20.000  15.000  33.578
## [10]  53.125

log(x)

## Warning in log(x): NaNs produced

##  [1]       NaN      -Inf 0.5877867 1.1442228 1.3862944 4.4792554 2.5649494
##  [8] 0.6931472 1.6607012 2.3050820

In the case of log(x), when the number is negative, R produces an element NaN which stands for Not a Number; similarly, when the element is zero, R produces an element inf, which stands for \(\infty\).

Working with vectors

We can selet some elements or a set of elements from our vector:

x[c(1, 4)]

## [1] -5.00  3.14

x[c(1:4)]

## [1] -5.00  0.00  1.80  3.14

x[-c(2, 3, 5, 7)]

## [1] -5.000  3.140 88.169  2.000  5.263 10.025

Also, we can create some vectors automatically by using some commands such as:

rep(a, b): it repeats the element a for b times.
seq(a, b, c): it creates a vector that starts in a and ends in b with a span of c.
t1:t2: this form helps us to create a vector from the time \(t_1\) to \(t_2\).

ones <- rep(1, 10)
ones

##  [1] 1 1 1 1 1 1 1 1 1 1

even <- seq(from = 2, to = 20, by = 2)
even

##  [1]  2  4  6  8 10 12 14 16 18 20

even2 <- seq(2,20,2)
even2

##  [1]  2  4  6  8 10 12 14 16 18 20

trend <- 1986:2020
trend

##  [1] 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
## [16] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## [31] 2016 2017 2018 2019 2020

y <- c(ones, even)
y

##  [1]  1  1  1  1  1  1  1  1  1  1  2  4  6  8 10 12 14 16 18 20

Matrices

R is also suitable to create matrices and work with them by applying algebraic operations. It is important to notice that the command matrix() creates the matrix fulfilling the elements from top to bottom and from left to right. Also, the number of rows can be specified as an option.

A <- matrix(1:6, nrow = 2)
A

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

B <- matrix(1:6, nrow = 3)
B

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Additionally, we can transpose the matrix with the command t(), get the dimension of the matrix dim(), and obtain the number of rows, nrow(), or columns, ncol().

t(A)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

dim(A)

## [1] 2 3

nrow(A)

## [1] 2

ncol(A)

## [1] 3

It is possible to build new matrices based on our previous created matrix:

A1 <- A[1:2, c(1, 3)]
A1

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6

Some other operations with matrices are: obtain the eigen values of the matrix by using eigen(), get the inverse matrix with solve(), and multiplication of matrices with %*%.

eigen(A1)

## eigen() decomposition
## $values
## [1]  7.5311289 -0.5311289
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.6078802 -0.9561723
## [2,] -0.7940288  0.2928046

solve(A1)

##      [,1]  [,2]
## [1,] -1.5  1.25
## [2,]  0.5 -0.25

A1 %*% solve(A1)

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Same as we did with vectors, R allows us to create matrices by pre-specified patterns such as diag(a) which creates a \(4 \times 4\) matrix full with 1s in the diagonal.

D <- diag(4)
D

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

the code diag(a, b, c) creates a diagonal matrix with the value \(a\) in the main diagonal with \(b\) rows and \(c\) columns.

diag(1, 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

upper.tri() identifies the upper triangle of the matrix.

upper.tri(D)

##       [,1]  [,2]  [,3]  [,4]
## [1,] FALSE  TRUE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE  TRUE
## [3,] FALSE FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE FALSE

we can merge some other matrices and create a new ones by using cbind() to merge by columns and rbind() to merge by rows.

cbind(1, A1)

##      [,1] [,2] [,3]
## [1,]    1    1    5
## [2,]    1    2    6

rbind(A1, diag(4, 2))

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    4    0
## [4,]    0    4

Logical and String vectors

Lets take the vector \(x\) we previously created and check the values larger than 3.5:

##  [1] -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

x>3.5

##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

we can name them:

names(x) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
x

##      a      b      c      d      e      f      g      h      i      j 
## -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

names(x) <- letters[1:10]
x

##      a      b      c      d      e      f      g      h      i      j 
## -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

we can subset the vector x:

x[c(5:7,9:10)]

##      e      f      g      i      j 
##  4.000 88.169 13.000  5.263 10.025

x[c("e", "f", "g", "i", "j")]

##      e      f      g      i      j 
##  4.000 88.169 13.000  5.263 10.025

x[x>3.5]

##      e      f      g      i      j 
##  4.000 88.169 13.000  5.263 10.025

Lists

In R we can create a list of elements and extract its elements. For example, we will create a normal distribution with mean and variance previously determined and then we can extract the sample we used for other analysis.

list_norm <- list(sample = rnorm(10), family = "normal distribution", parameters = list(mean = 0, sd = 1))
list_norm

## $sample
##  [1]  0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163  1.1780128
##  [7] -0.4539898  0.4433819  0.8076383  1.3340896
## 
## $family
## [1] "normal distribution"
## 
## $parameters
## $parameters$mean
## [1] 0
## 
## $parameters$sd
## [1] 1

Now we can extract some elements from the list:

list_norm[[1]]

##  [1]  0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163  1.1780128
##  [7] -0.4539898  0.4433819  0.8076383  1.3340896

list_norm[["sample"]]

##  [1]  0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163  1.1780128
##  [7] -0.4539898  0.4433819  0.8076383  1.3340896

list_norm$sample

##  [1]  0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163  1.1780128
##  [7] -0.4539898  0.4433819  0.8076383  1.3340896

list_norm[[3]]$sd

## [1] 1

list_norm$parameters$sd

## [1] 1

Logical comparison

We are able to check if the vector we create fulfills some logic criteria:

##      a      b      c      d      e      f      g      h      i      j 
## -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

x > 3 & x <= 4

##     a     b     c     d     e     f     g     h     i     j 
## FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

which(x>3 & x<=4)

## d e 
## 4 5

all(x>3)

## [1] FALSE

any(x>3)

## [1] TRUE

is.numeric(x)

## [1] TRUE

is.character(x)

## [1] FALSE

as.character(x)

##  [1] "-5"     "0"      "1.8"    "3.14"   "4"      "88.169" "13"     "2"     
##  [9] "5.263"  "10.025"

c(1, "a")

## [1] "1" "a"

is.character(c(1, "a"))

## [1] TRUE

Generating Random Variables

To generate random variables, we can use the following commands:

set.seed(123)
rnorm(4)

## [1] -0.56047565 -0.23017749  1.55870831  0.07050839

rnorm(4)

## [1]  0.1292877  1.7150650  0.4609162 -1.2650612

set.seed(123)
rnorm(4)

## [1] -0.56047565 -0.23017749  1.55870831  0.07050839

rnorm(4)

## [1]  0.1292877  1.7150650  0.4609162 -1.2650612

sample(1:10)

##  [1]  5  3  9  1  4  7 10  6  8  2

sample(c("male", "female"), size=10, replace=TRUE, prob = c(0.2, 0.8))

##  [1] "female" "female" "female" "male"   "male"   "female" "female" "female"
##  [9] "female" "female"

Using some functions

We can use some function in order to reduce the time of repeating calculus or similar lines of programming. For example, we can require to R to calculate the sum of the values of our vector \(x\) if the rnorm() is larger than zero; otherwise, we require it to get the mean of the vector.

##      a      b      c      d      e      f      g      h      i      j 
## -5.000  0.000  1.800  3.140  4.000 88.169 13.000  2.000  5.263 10.025

rnorm(1)

## [1] -0.7843822

sum(x)

## [1] 122.397

mean(x)

## [1] 12.2397

if(rnorm(1)>0) sum(x) else mean(x)

## [1] 12.2397

ifelse(x>4, sqrt(x), x^2)

## Warning in sqrt(x): NaNs produced

##         a         b         c         d         e         f         g         h 
## 25.000000  0.000000  3.240000  9.859600 16.000000  9.389835  3.605551  4.000000 
##         i         j 
##  2.294123  3.166228

To save time and lines of programming, we can create a loop to repeat actions by using functions:

for (i in 2:7) {
  x[i] <- x[i] - x[i-1]
}
x[-1]

##       b       c       d       e       f       g       h       i       j 
##   5.000  -3.200   6.340  -2.340  90.509 -77.509   2.000   5.263  10.025

Functions and estimations

R is commonly used for data science and estimation. Economics is not an excemption and we can use R for estimating equations such as OLS:

First, we check our data:

##  [1]  1  1  1  1  1  1  1  1  1  1  2  4  6  8 10 12 14 16 18 20

length(y)

## [1] 20

##       a       b       c       d       e       f       g       h       i       j 
##  -5.000   5.000  -3.200   6.340  -2.340  90.509 -77.509   2.000   5.263  10.025

length(x)

## [1] 10

Then, we create the function of our estimation:

f <- y ~ x
class(f)

## [1] "formula"

We re-generate our data and graph them

x <- seq(from = 0, to = 100, by = 0.5)
x

##   [1]   0.0   0.5   1.0   1.5   2.0   2.5   3.0   3.5   4.0   4.5   5.0   5.5
##  [13]   6.0   6.5   7.0   7.5   8.0   8.5   9.0   9.5  10.0  10.5  11.0  11.5
##  [25]  12.0  12.5  13.0  13.5  14.0  14.5  15.0  15.5  16.0  16.5  17.0  17.5
##  [37]  18.0  18.5  19.0  19.5  20.0  20.5  21.0  21.5  22.0  22.5  23.0  23.5
##  [49]  24.0  24.5  25.0  25.5  26.0  26.5  27.0  27.5  28.0  28.5  29.0  29.5
##  [61]  30.0  30.5  31.0  31.5  32.0  32.5  33.0  33.5  34.0  34.5  35.0  35.5
##  [73]  36.0  36.5  37.0  37.5  38.0  38.5  39.0  39.5  40.0  40.5  41.0  41.5
##  [85]  42.0  42.5  43.0  43.5  44.0  44.5  45.0  45.5  46.0  46.5  47.0  47.5
##  [97]  48.0  48.5  49.0  49.5  50.0  50.5  51.0  51.5  52.0  52.5  53.0  53.5
## [109]  54.0  54.5  55.0  55.5  56.0  56.5  57.0  57.5  58.0  58.5  59.0  59.5
## [121]  60.0  60.5  61.0  61.5  62.0  62.5  63.0  63.5  64.0  64.5  65.0  65.5
## [133]  66.0  66.5  67.0  67.5  68.0  68.5  69.0  69.5  70.0  70.5  71.0  71.5
## [145]  72.0  72.5  73.0  73.5  74.0  74.5  75.0  75.5  76.0  76.5  77.0  77.5
## [157]  78.0  78.5  79.0  79.5  80.0  80.5  81.0  81.5  82.0  82.5  83.0  83.5
## [169]  84.0  84.5  85.0  85.5  86.0  86.5  87.0  87.5  88.0  88.5  89.0  89.5
## [181]  90.0  90.5  91.0  91.5  92.0  92.5  93.0  93.5  94.0  94.5  95.0  95.5
## [193]  96.0  96.5  97.0  97.5  98.0  98.5  99.0  99.5 100.0

y <- 1.25 + 3.5*x + 75*rnorm(201)  # 201 is the number of information in x variable
y

##   [1]  -14.93990455  -22.11845683  -76.67743515    0.09325517   88.54579037
##   [6]   -0.90451632  -75.66586359  -47.88867919   66.62020584   -7.00423145
##  [11]  -79.61418085  -24.47062460   12.54419832   90.50521103   14.39530282
##  [16]   52.23434006 -212.79921223  -26.88438284   54.24114255  -57.03839868
##  [21]   68.84127828   98.01326494   27.45517735  134.71890812  -26.82887935
##  [26]   74.52814892   77.02235918  -17.98275373  -48.67032027   54.16329314
##  [31]   21.34026549  182.24043890  149.37945866   79.70176087  -17.92316284
##  [36]   23.43479921  185.99018913  -14.25511720  194.19154332   51.37326742
##  [41]   36.13496410   15.02663289  235.99395018  -23.57652212  115.44028600
##  [46]  172.54821801  129.32715936  114.40167061  144.76889808   75.56920252
##  [51]   71.58281387   22.94061871   37.12303832  -13.07643377  142.19626511
##  [56]   97.03513033   47.82198657   80.04998543   44.04522935   46.07520702
##  [61]   78.13999302   84.04546434  116.09078260   53.86447978   66.30668152
##  [66]   47.43468586  166.52965023  141.02093386  125.86426179  137.47795215
##  [71]   87.08078735   78.40362566  123.73124552  141.19635865  227.67294362
##  [76]   97.73326239  157.15974206  129.70084651  168.52725869  153.27586804
##  [81]  274.65562138  145.82621355  232.96650933  104.60981423   77.32865451
##  [86]  100.11085204  185.65226437  193.01417519  137.98033234  261.80699899
##  [91]  291.02397735  196.92010156  142.31957921  175.37085280  268.99573598
##  [96]  153.97042643   51.67436506  151.44555805  244.88577638  238.54215982
## [101]  207.65975266  203.49673848  224.48187943  321.85635039  228.46527773
## [106]  127.53733971  140.22551347  247.76426879  172.10176176  275.81148609
## [111]  282.61979762  318.98561282  211.72471394  169.55848344  201.27884925
## [116]   15.38734167  130.95277284  253.14266171  201.43946453  238.15987059
## [121]  129.24396508  156.34020295  194.95053109  160.06529771  251.30190679
## [126]  124.19119254  310.03940342  291.18793756  130.65218674  289.80913593
## [131]   52.62822693  276.32283526  228.65991933   54.06017188  234.30107829
## [136]  230.84859108  119.58863264  304.87819908  189.23293956  324.48227583
## [141]  206.03180572  288.19377923  112.60300246  115.48499831  356.19602801
## [146]  212.67978445  329.52334199  257.10245118  287.42777622  412.84791896
## [151]  175.40243760  208.86221384  242.40366366  247.72229777  294.31146751
## [156]  410.86290033  200.60621379  440.70028225  262.39992512  352.63572023
## [161]  216.18254128  245.41093062  343.66933722  128.83145074  285.08463027
## [166]  259.63929399  283.29255251  428.28553305  234.47553066  439.56750553
## [171]  351.92158303  355.71460809  404.68324152  260.78020804  245.39507603
## [176]  267.37006413  368.61609321  257.92946369  217.08424710  492.65983459
## [181]  234.23076888  332.43311867  310.28797140  217.36758951  358.49266437
## [186]  397.02867679  326.36440337  263.21516687  295.53407198  188.66365587
## [191]  361.48620328  300.82839570  361.39669189  317.69989576  422.63923326
## [196]  354.51622721  547.36262133  367.47174471  345.31173119  275.27414857
## [201]  367.13574461

plot(y ~ x)

lm1 <- lm(y ~ x)
summary(lm1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -235.499  -51.120   -4.663   46.558  196.946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.4305    10.4487  -0.615    0.539    
## x             3.6413     0.1808  20.145   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.34 on 199 degrees of freedom
## Multiple R-squared:  0.671,  Adjusted R-squared:  0.6693 
## F-statistic: 405.8 on 1 and 199 DF,  p-value: < 2.2e-16

Applied Econometrics: Geocomputation and Spatial Methods

Prof. Augusto Delgado (Faculty of Political Science and Economics - Waseda University)

October, 2020