R is a language for statistical analysis, data processing, and graphical creation. R appeared based in the S language developed at Bell laboratories (AT&T) with the special effort of John Chambers.
R provides a great variety of tools for different data analysis. One of its big advantage is that R is an Open Source software that allow people from different part of the world improve its performance and development.
R can be download from the following link: https://cran.r-project.org/mirrors.html; also, you can check the complete list of libraries to download from this link: https://cran.r-project.org/web/packages/available_packages_by_name.html.
One of the first things to remember is that you should clean your environment with the following code. rm() stands for remove, and inside we list the object we previously created and remain in our working space (ls()).
rm(list = ls())
R can easily solve algebraic equations straightforward by using the following kind of codes:
1+1
## [1] 2
2*3
## [1] 6
log(exp(sin(pi/4)^2)*exp(cos(pi/4)^2))
## [1] 1
Also, we can use R to create numerical vectors and then we can analize them properly:
x <- c(-5, 0, 1.8, 3.14, 4, 88.169, 13, 2, 5.263, 10.025)
x
## [1] -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
the number of objects that the vector x has:
length(x)
## [1] 10
and, some mathematical operations with the vector:
2*x+3
## [1] -7.000 3.000 6.600 9.280 11.000 179.338 29.000 7.000 13.526
## [10] 23.050
x
## [1] -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
7:1+x
## Warning in 7:1 + x: longer object length is not a multiple of shorter object
## length
## [1] 2.000 6.000 6.800 7.140 7.000 90.169 14.000 9.000 11.263 15.025
7:1 + 1:7
## [1] 8 8 8 8 8 8 8
7:1*x + 1:7
## Warning in 7:1 * x: longer object length is not a multiple of shorter object
## length
## Warning in 7:1 * x + 1:7: longer object length is not a multiple of shorter
## object length
## [1] -34.000 2.000 12.000 16.560 17.000 182.338 20.000 15.000 33.578
## [10] 53.125
log(x)
## Warning in log(x): NaNs produced
## [1] NaN -Inf 0.5877867 1.1442228 1.3862944 4.4792554 2.5649494
## [8] 0.6931472 1.6607012 2.3050820
In the case of log(x), when the number is negative, R produces an element NaN which stands for Not a Number; similarly, when the element is zero, R produces an element inf, which stands for \(\infty\).
We can selet some elements or a set of elements from our vector:
x[c(1, 4)]
## [1] -5.00 3.14
x[c(1:4)]
## [1] -5.00 0.00 1.80 3.14
x[-c(2, 3, 5, 7)]
## [1] -5.000 3.140 88.169 2.000 5.263 10.025
Also, we can create some vectors automatically by using some commands such as:
rep(a, b): it repeats the element a for b times.seq(a, b, c): it creates a vector that starts in a and ends in b with a span of c.t1:t2: this form helps us to create a vector from the time \(t_1\) to \(t_2\).ones <- rep(1, 10)
ones
## [1] 1 1 1 1 1 1 1 1 1 1
even <- seq(from = 2, to = 20, by = 2)
even
## [1] 2 4 6 8 10 12 14 16 18 20
even2 <- seq(2,20,2)
even2
## [1] 2 4 6 8 10 12 14 16 18 20
trend <- 1986:2020
trend
## [1] 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
## [16] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## [31] 2016 2017 2018 2019 2020
y <- c(ones, even)
y
## [1] 1 1 1 1 1 1 1 1 1 1 2 4 6 8 10 12 14 16 18 20
R is also suitable to create matrices and work with them by applying algebraic operations. It is important to notice that the command matrix() creates the matrix fulfilling the elements from top to bottom and from left to right. Also, the number of rows can be specified as an option.
A <- matrix(1:6, nrow = 2)
A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
B <- matrix(1:6, nrow = 3)
B
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Additionally, we can transpose the matrix with the command t(), get the dimension of the matrix dim(), and obtain the number of rows, nrow(), or columns, ncol().
t(A)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
dim(A)
## [1] 2 3
nrow(A)
## [1] 2
ncol(A)
## [1] 3
It is possible to build new matrices based on our previous created matrix:
A1 <- A[1:2, c(1, 3)]
A1
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
Some other operations with matrices are: obtain the eigen values of the matrix by using eigen(), get the inverse matrix with solve(), and multiplication of matrices with %*%.
eigen(A1)
## eigen() decomposition
## $values
## [1] 7.5311289 -0.5311289
##
## $vectors
## [,1] [,2]
## [1,] -0.6078802 -0.9561723
## [2,] -0.7940288 0.2928046
solve(A1)
## [,1] [,2]
## [1,] -1.5 1.25
## [2,] 0.5 -0.25
A1 %*% solve(A1)
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Same as we did with vectors, R allows us to create matrices by pre-specified patterns such as diag(a) which creates a \(4 \times 4\) matrix full with 1s in the diagonal.
D <- diag(4)
D
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
the code diag(a, b, c) creates a diagonal matrix with the value \(a\) in the main diagonal with \(b\) rows and \(c\) columns.
diag(1, 4, 4)
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
upper.tri() identifies the upper triangle of the matrix.
upper.tri(D)
## [,1] [,2] [,3] [,4]
## [1,] FALSE TRUE TRUE TRUE
## [2,] FALSE FALSE TRUE TRUE
## [3,] FALSE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE FALSE
we can merge some other matrices and create a new ones by using cbind() to merge by columns and rbind() to merge by rows.
cbind(1, A1)
## [,1] [,2] [,3]
## [1,] 1 1 5
## [2,] 1 2 6
rbind(A1, diag(4, 2))
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 4 0
## [4,] 0 4
Lets take the vector \(x\) we previously created and check the values larger than 3.5:
x
## [1] -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
x>3.5
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
we can name them:
names(x) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
x
## a b c d e f g h i j
## -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
names(x) <- letters[1:10]
x
## a b c d e f g h i j
## -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
we can subset the vector x:
x[c(5:7,9:10)]
## e f g i j
## 4.000 88.169 13.000 5.263 10.025
x[c("e", "f", "g", "i", "j")]
## e f g i j
## 4.000 88.169 13.000 5.263 10.025
x[x>3.5]
## e f g i j
## 4.000 88.169 13.000 5.263 10.025
In R we can create a list of elements and extract its elements. For example, we will create a normal distribution with mean and variance previously determined and then we can extract the sample we used for other analysis.
list_norm <- list(sample = rnorm(10), family = "normal distribution", parameters = list(mean = 0, sd = 1))
list_norm
## $sample
## [1] 0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163 1.1780128
## [7] -0.4539898 0.4433819 0.8076383 1.3340896
##
## $family
## [1] "normal distribution"
##
## $parameters
## $parameters$mean
## [1] 0
##
## $parameters$sd
## [1] 1
Now we can extract some elements from the list:
list_norm[[1]]
## [1] 0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163 1.1780128
## [7] -0.4539898 0.4433819 0.8076383 1.3340896
list_norm[["sample"]]
## [1] 0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163 1.1780128
## [7] -0.4539898 0.4433819 0.8076383 1.3340896
list_norm$sample
## [1] 0.4645735 -1.2465951 -1.8358194 -0.1549101 -0.3548163 1.1780128
## [7] -0.4539898 0.4433819 0.8076383 1.3340896
list_norm[[3]]$sd
## [1] 1
list_norm$parameters$sd
## [1] 1
We are able to check if the vector we create fulfills some logic criteria:
x
## a b c d e f g h i j
## -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
x > 3 & x <= 4
## a b c d e f g h i j
## FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
which(x>3 & x<=4)
## d e
## 4 5
all(x>3)
## [1] FALSE
any(x>3)
## [1] TRUE
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
as.character(x)
## [1] "-5" "0" "1.8" "3.14" "4" "88.169" "13" "2"
## [9] "5.263" "10.025"
c(1, "a")
## [1] "1" "a"
is.character(c(1, "a"))
## [1] TRUE
To generate random variables, we can use the following commands:
set.seed(123)
rnorm(4)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839
rnorm(4)
## [1] 0.1292877 1.7150650 0.4609162 -1.2650612
set.seed(123)
rnorm(4)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839
rnorm(4)
## [1] 0.1292877 1.7150650 0.4609162 -1.2650612
sample(1:10)
## [1] 5 3 9 1 4 7 10 6 8 2
sample(c("male", "female"), size=10, replace=TRUE, prob = c(0.2, 0.8))
## [1] "female" "female" "female" "male" "male" "female" "female" "female"
## [9] "female" "female"
We can use some function in order to reduce the time of repeating calculus or similar lines of programming. For example, we can require to R to calculate the sum of the values of our vector \(x\) if the rnorm() is larger than zero; otherwise, we require it to get the mean of the vector.
x
## a b c d e f g h i j
## -5.000 0.000 1.800 3.140 4.000 88.169 13.000 2.000 5.263 10.025
rnorm(1)
## [1] -0.7843822
sum(x)
## [1] 122.397
mean(x)
## [1] 12.2397
if(rnorm(1)>0) sum(x) else mean(x)
## [1] 12.2397
ifelse(x>4, sqrt(x), x^2)
## Warning in sqrt(x): NaNs produced
## a b c d e f g h
## 25.000000 0.000000 3.240000 9.859600 16.000000 9.389835 3.605551 4.000000
## i j
## 2.294123 3.166228
To save time and lines of programming, we can create a loop to repeat actions by using functions:
for (i in 2:7) {
x[i] <- x[i] - x[i-1]
}
x[-1]
## b c d e f g h i j
## 5.000 -3.200 6.340 -2.340 90.509 -77.509 2.000 5.263 10.025
R is commonly used for data science and estimation. Economics is not an excemption and we can use R for estimating equations such as OLS:
First, we check our data:
y
## [1] 1 1 1 1 1 1 1 1 1 1 2 4 6 8 10 12 14 16 18 20
length(y)
## [1] 20
x
## a b c d e f g h i j
## -5.000 5.000 -3.200 6.340 -2.340 90.509 -77.509 2.000 5.263 10.025
length(x)
## [1] 10
Then, we create the function of our estimation:
f <- y ~ x
class(f)
## [1] "formula"
We re-generate our data and graph them
x <- seq(from = 0, to = 100, by = 0.5)
x
## [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## [13] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5
## [25] 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5
## [37] 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5
## [49] 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5
## [61] 30.0 30.5 31.0 31.5 32.0 32.5 33.0 33.5 34.0 34.5 35.0 35.5
## [73] 36.0 36.5 37.0 37.5 38.0 38.5 39.0 39.5 40.0 40.5 41.0 41.5
## [85] 42.0 42.5 43.0 43.5 44.0 44.5 45.0 45.5 46.0 46.5 47.0 47.5
## [97] 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5
## [109] 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5
## [121] 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5
## [133] 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 71.0 71.5
## [145] 72.0 72.5 73.0 73.5 74.0 74.5 75.0 75.5 76.0 76.5 77.0 77.5
## [157] 78.0 78.5 79.0 79.5 80.0 80.5 81.0 81.5 82.0 82.5 83.0 83.5
## [169] 84.0 84.5 85.0 85.5 86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5
## [181] 90.0 90.5 91.0 91.5 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5
## [193] 96.0 96.5 97.0 97.5 98.0 98.5 99.0 99.5 100.0
y <- 1.25 + 3.5*x + 75*rnorm(201) # 201 is the number of information in x variable
y
## [1] -14.93990455 -22.11845683 -76.67743515 0.09325517 88.54579037
## [6] -0.90451632 -75.66586359 -47.88867919 66.62020584 -7.00423145
## [11] -79.61418085 -24.47062460 12.54419832 90.50521103 14.39530282
## [16] 52.23434006 -212.79921223 -26.88438284 54.24114255 -57.03839868
## [21] 68.84127828 98.01326494 27.45517735 134.71890812 -26.82887935
## [26] 74.52814892 77.02235918 -17.98275373 -48.67032027 54.16329314
## [31] 21.34026549 182.24043890 149.37945866 79.70176087 -17.92316284
## [36] 23.43479921 185.99018913 -14.25511720 194.19154332 51.37326742
## [41] 36.13496410 15.02663289 235.99395018 -23.57652212 115.44028600
## [46] 172.54821801 129.32715936 114.40167061 144.76889808 75.56920252
## [51] 71.58281387 22.94061871 37.12303832 -13.07643377 142.19626511
## [56] 97.03513033 47.82198657 80.04998543 44.04522935 46.07520702
## [61] 78.13999302 84.04546434 116.09078260 53.86447978 66.30668152
## [66] 47.43468586 166.52965023 141.02093386 125.86426179 137.47795215
## [71] 87.08078735 78.40362566 123.73124552 141.19635865 227.67294362
## [76] 97.73326239 157.15974206 129.70084651 168.52725869 153.27586804
## [81] 274.65562138 145.82621355 232.96650933 104.60981423 77.32865451
## [86] 100.11085204 185.65226437 193.01417519 137.98033234 261.80699899
## [91] 291.02397735 196.92010156 142.31957921 175.37085280 268.99573598
## [96] 153.97042643 51.67436506 151.44555805 244.88577638 238.54215982
## [101] 207.65975266 203.49673848 224.48187943 321.85635039 228.46527773
## [106] 127.53733971 140.22551347 247.76426879 172.10176176 275.81148609
## [111] 282.61979762 318.98561282 211.72471394 169.55848344 201.27884925
## [116] 15.38734167 130.95277284 253.14266171 201.43946453 238.15987059
## [121] 129.24396508 156.34020295 194.95053109 160.06529771 251.30190679
## [126] 124.19119254 310.03940342 291.18793756 130.65218674 289.80913593
## [131] 52.62822693 276.32283526 228.65991933 54.06017188 234.30107829
## [136] 230.84859108 119.58863264 304.87819908 189.23293956 324.48227583
## [141] 206.03180572 288.19377923 112.60300246 115.48499831 356.19602801
## [146] 212.67978445 329.52334199 257.10245118 287.42777622 412.84791896
## [151] 175.40243760 208.86221384 242.40366366 247.72229777 294.31146751
## [156] 410.86290033 200.60621379 440.70028225 262.39992512 352.63572023
## [161] 216.18254128 245.41093062 343.66933722 128.83145074 285.08463027
## [166] 259.63929399 283.29255251 428.28553305 234.47553066 439.56750553
## [171] 351.92158303 355.71460809 404.68324152 260.78020804 245.39507603
## [176] 267.37006413 368.61609321 257.92946369 217.08424710 492.65983459
## [181] 234.23076888 332.43311867 310.28797140 217.36758951 358.49266437
## [186] 397.02867679 326.36440337 263.21516687 295.53407198 188.66365587
## [191] 361.48620328 300.82839570 361.39669189 317.69989576 422.63923326
## [196] 354.51622721 547.36262133 367.47174471 345.31173119 275.27414857
## [201] 367.13574461
plot(y ~ x)
lm1 <- lm(y ~ x)
summary(lm1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -235.499 -51.120 -4.663 46.558 196.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.4305 10.4487 -0.615 0.539
## x 3.6413 0.1808 20.145 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.34 on 199 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.6693
## F-statistic: 405.8 on 1 and 199 DF, p-value: < 2.2e-16
Practice the command learned in this class by modifying them and creating your new versions in R.