By the end of this activity, you should
Get familiar with R notebooks
Recall basic functionality of R, such as libraries, variables, data types, printing.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
If you don’t have R and R Studio, install them.
You can load libraries (also called packages) with commands library
or require
(google the difference between these two commands if you want). If you load a library, it should have been installed first with the command install.packages
Example:
First you run install.packages('ISLR')
(only once; the next time you open R, the library will be there)
library('ISLR')
(every time you start a new R session; next time you open R the package will be on your computer but not loaded into R)
Below we load the libraries. Note that you will need to install those libraries that you don’t have on your computer before you can run this chunk of R code
library(ISLR) # for datasets from 'Introduction to Statistical Learning'
If you don’t have ISLR
installed, install it.
The assignment operator in R is <-
or =
.
x <- 2020
To print the value of a variable, we can simply type its name as follows:
x
## [1] 2020
or use the command print
(equivalent of just typing the variable name):
print(x)
## [1] 2020
or use the command cat
(it can print several arguments):
cat("x = ", x)
## x = 2020
The document that you are working with is a so-called R notebook. The file has extension “Rmd”. An R notebook consists of normal text with some very simple marking and chunks of R code that you can run independently one by one or “knit” into a readable report or presentation. A knitted notebook can be published online.
Fedor’s lecture notes are also R notebooks and Fedor will share them with you for reference. Your lab reports, project report and project presentation will also be R notebooks.
You can insert mathematical equations into R notebooks. Here is a displayed equation: \[ \int_{\infty}^{\infty}e^{-x^2}dx=\sqrt{\pi} \]
and here is an inline equation: \(e^{i\pi}+1=0\).
It is recommended to always (including your non-MH4510 data projects or your future job) work with notebooks rather than with pure R files. The same concerns Python - if you do a data project in Python, use Jupyter (similar to R notebooks) rather than Spyder.
Knit this R Notebook into PDF.
R has several atomic data types (you can google “atomic types in R” for details). Three most common ones are numeric
, logical
, character
. Note that a variable of an atomic data type contains a vector of values rather than a single value. Some of the values can be NA
, NaN
or, for numeric types, -Inf
and Inf
. The R command for constructing vectors is c
.
# Here, we introduce a numeric vector
num_v <- c(-1.0, 2.5, -Inf, 0.6, NA, 2*6.2, -0/0, -12.0, 50/0)
cat("Below is a numeric vector\n")
## Below is a numeric vector
num_v
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(num_v)
## [1] "numeric"
cat("\nBelow is its length\n")
##
## Below is its length
length(num_v)
## [1] 9
Use the operator [ ]
for vector subsetting:
num_v[c(1, 2, 5)]
## [1] -1.0 2.5 NA
There are commands in R for calculating simple functions, such as sum, mean, standard deviation etc:
# first, we extract finite real values from num_v
x <- num_v[c(1, 2, 4, 6, 8)]
cat("We will calculate sum, mean, and SD of the following vector:\n")
## We will calculate sum, mean, and SD of the following vector:
print(x)
## [1] -1.0 2.5 0.6 12.4 -12.0
cat("Sum is", sum(x), "\n")
## Sum is 2.5
cat("Mean is", mean(x), "\n")
## Mean is 0.5
cat("Standard deviation is", sd(x), "\n")
## Standard deviation is 8.719518
Functions and operations are applied to vectors element-wise. The following command applies the operation \(a\mapsto a^2-a-1\) to every entry \(a\) of the vector num_v
:
num_v^2-num_v-1
## [1] 1.00 2.75 Inf -1.24 NA 140.36 NaN 155.00 NaN
Note that (-Inf)^2+Inf-1 = Inf
, but (Inf)^2-Inf-1 = NaN
. Explain why (it makes perfect sense).
This is self-explanatory:
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
-6.6:5
## [1] -6.6 -5.6 -4.6 -3.6 -2.6 -1.6 -0.6 0.4 1.4 2.4 3.4 4.4
seq(from = -1, to = 2, by = 0.2)
## [1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
## [16] 2.0
seq(from = -1, to = 1.5, len = 6)
## [1] -1.0 -0.5 0.0 0.5 1.0 1.5
# Here, we introduce a logical vector
log_v <- c(TRUE, FALSE, NA, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
cat("Below is a logical vector\n")
## Below is a logical vector
log_v
## [1] TRUE FALSE NA TRUE FALSE FALSE TRUE TRUE TRUE
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(log_v)
## [1] "logical"
cat("\nBelow is its length\n")
##
## Below is its length
length(log_v)
## [1] 9
Logical vectors can be converted to numeric either explicitly
as.numeric(log_v)
## [1] 1 0 NA 1 0 0 1 1 1
or implicitly - when a logical vector is involved in an expression with numeric vectors, TRUE becomes 1 and FALSE becomes 0
log_v + 0
## [1] 1 0 NA 1 0 0 1 1 1
Logical vectors can be used to subset numeric (and character) vectors:
num_v
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
log_v
## [1] TRUE FALSE NA TRUE FALSE FALSE TRUE TRUE TRUE
num_v[log_v]
## [1] -1.0 NA 0.6 NaN -12.0 Inf
Note that elements on positions corresponding to FALSE
in the logical index vector are removed, elements on positions corresponding to TRUE
are kept, and elements on positions corresponding to NA
become NA
.
Let us look at our numeric vector again
num_v
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
Positions of missing entries are
is.na(num_v)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
The number of missing entries is
sum(is.na(num_v))
## [1] 2
Note that both NA
and NaN
are processed as missing entries in a numeric vector.
Positions of negative entries are extracted as follows:
num_v < 0
## [1] TRUE FALSE TRUE FALSE NA FALSE NA TRUE FALSE
Negative entries themselves are:
num_v[!is.na(num_v) & (num_v < 0)]
## [1] -1 -Inf -12
Positions of finite entries are
is.finite(num_v)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Numeric positions of finite entries are
which(is.finite(num_v))
## [1] 1 2 4 6 8
Given a numeric vector, write a single R command that calculates the sum of squares of its finite entries. It should work for an arbitrary input, i.e., you cannot just extract entries of num_v
with explicitly specified positions.
# Write your command here
# Here, we introduce a character vector
chr_v <- c("Fedor likes roller skating", NA, "-24", "FALSE", "asgoifw", NaN, 50)
cat("Below is a character vector\n")
## Below is a character vector
chr_v
## [1] "Fedor likes roller skating" NA
## [3] "-24" "FALSE"
## [5] "asgoifw" "NaN"
## [7] "50"
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(chr_v)
## [1] "character"
cat("\nBelow is its length\n")
##
## Below is its length
length(chr_v)
## [1] 7
Note that when we created this vector, NA
has been interpreted as a missing entry but NaN
as a literal string “NaN”. The number of missing entries in our character vector is
sum(is.na(chr_v))
## [1] 1
Character vectors can be converted to numeric vectors. Everything that does not look like a number will become a missing entry:
as.numeric(chr_v)
## Warning: NAs introduced by coercion
## [1] NA NA -24 NA NA NaN 50
Same for conversion to logical vectors:
as.logical(chr_v)
## [1] NA NA NA FALSE NA NA NA
Conversely, numeric and logical vectors can be converted to character vectors:
num_v
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
as.character(num_v)
## [1] "-1" "2.5" "-Inf" "0.6" NA "12.4" "NaN" "-12" "Inf"
log_v
## [1] TRUE FALSE NA TRUE FALSE FALSE TRUE TRUE TRUE
as.character(log_v)
## [1] "TRUE" "FALSE" NA "TRUE" "FALSE" "FALSE" "TRUE" "TRUE" "TRUE"
Below we find the length of every string in our character vector:
nchar(chr_v)
## [1] 26 NA 3 5 7 3 2
Write a single R command that extracts all strings of length at least 7 from a character vector. It should work for an arbitrary input, i.e., you cannot extract entries of chr_v
with explicitly specified positions.
# Write your command here
Items of a list are variables of any class (they can even, in turn, be lists). Below we construct a list
# Here, we introduce a list
L <- list(-5, "MH4510", num_v, chr_v, log_v)
cat("Below is a list\n")
## Below is a list
L
## [[1]]
## [1] -5
##
## [[2]]
## [1] "MH4510"
##
## [[3]]
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
##
## [[4]]
## [1] "Fedor likes roller skating" NA
## [3] "-24" "FALSE"
## [5] "asgoifw" "NaN"
## [7] "50"
##
## [[5]]
## [1] TRUE FALSE NA TRUE FALSE FALSE TRUE TRUE TRUE
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(L)
## [1] "list"
cat("\nBelow is its length\n")
##
## Below is its length
length(L)
## [1] 5
An individual element is referenced with the operator [[]]
:
L[[3]]
## [1] -1.0 2.5 -Inf 0.6 NA 12.4 NaN -12.0 Inf
Besides, we extract a few elements from a list with the operator []
and a vector index but the result is going to be another list:
L[c(1, 4)]
## [[1]]
## [1] -5
##
## [[2]]
## [1] "Fedor likes roller skating" NA
## [3] "-24" "FALSE"
## [5] "asgoifw" "NaN"
## [7] "50"
What is the difference between L[[1]]
and L[1]
?
The data frame structure is the basic method of storing data in R. Essentially, it mimics excel spreadsheet.
# Here, we introduce a data frame
df <- data.frame(
name = c("Harry Potter", "Batman", "Sunny Baudelaire"),
gender = c("male", "male", "female"),
age = c(12, 35, 2),
occupation = c("wizard", "vigilante", "chef"),
has.siblings = c(FALSE, FALSE, TRUE)
)
cat("Below is a data frame\n")
## Below is a data frame
df
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(df)
## [1] "data.frame"
cat("\nBelow is its length\n")
##
## Below is its length
length(df)
## [1] 5
Note that a data frame is a list of its columns or variables under the hood. That’s why the length of this data is 5.
Our variables are
names(df)
## [1] "name" "gender" "age" "occupation" "has.siblings"
The number of observations is
nrow(df)
## [1] 3
The number of variables is
ncol(df)
## [1] 5
Dimensions of our data frame are
dim(df)
## [1] 3 5
An individual column is
df$occupation
## [1] "wizard" "vigilante" "chef"
or
df[ , 4]
## [1] "wizard" "vigilante" "chef"
or
df[ , 'occupation']
## [1] "wizard" "vigilante" "chef"
Its class is
class(df$occupation)
## [1] "character"
And an indivual row is
df[ 3 , ]
Its class is
class(df[3, ])
## [1] "data.frame"
And here we extract two rows and three columns from our data frame:
df[c(1, 2), c("name", "age", "has.siblings")]
Write a single R command that prints names and occupations everyone in our data frame whose age is at least 10. It should work for an arbitrary input that has variables “name”, “age”, and “occupation” , i.e., you should not extract entries of df
with explicitly specified positions.
# Write your command here
A matrix, like a data frame, is a table. The difference is that all entries of a matrix have the same type - usually, numeric (but there are logical and sometimes even character matrices). Matrices can be added and multiplied.
# Here, we introduce a matrix
A <- matrix(1:20, nrow = 5)
cat("Below is a matrix\n")
## Below is a matrix
A
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
cat("\nBelow is its class, i.e., data type\n")
##
## Below is its class, i.e., data type
class(A)
## [1] "matrix" "array"
cat("\nBelow is its length\n")
##
## Below is its length
length(A)
## [1] 20
Functions for the number of rows and the number of columns for a matrix are same as for data frame:
nrow(A)
## [1] 5
ncol(A)
## [1] 4
dim(A)
## [1] 5 4
The main difference with a data frame is that matrices can be added and multiplied element-wise or as a matrix. Below we square every entry of our matrix A
:
A^2
## [,1] [,2] [,3] [,4]
## [1,] 1 36 121 256
## [2,] 4 49 144 289
## [3,] 9 64 169 324
## [4,] 16 81 196 361
## [5,] 25 100 225 400
Now we will create another matrix of the same size
B <- matrix(seq(from = 0, to = 1, len = 20),
ncol = 4, byrow = TRUE)
B
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.05263158 0.1052632 0.1578947
## [2,] 0.2105263 0.26315789 0.3157895 0.3684211
## [3,] 0.4210526 0.47368421 0.5263158 0.5789474
## [4,] 0.6315789 0.68421053 0.7368421 0.7894737
## [5,] 0.8421053 0.89473684 0.9473684 1.0000000
We can add it with A
:
A+B
## [,1] [,2] [,3] [,4]
## [1,] 1.000000 6.052632 11.10526 16.15789
## [2,] 2.210526 7.263158 12.31579 17.36842
## [3,] 3.421053 8.473684 13.52632 18.57895
## [4,] 4.631579 9.684211 14.73684 19.78947
## [5,] 5.842105 10.894737 15.94737 21.00000
And we can multiply A
and B
element-wise (this is called Hadamard product, denoted \(A\odot B\) in mathematics):
A * B
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.3157895 1.157895 2.526316
## [2,] 0.4210526 1.8421053 3.789474 6.263158
## [3,] 1.2631579 3.7894737 6.842105 10.421053
## [4,] 2.5263158 6.1578947 10.315789 15.000000
## [5,] 4.2105263 8.9473684 14.210526 20.000000
We cannot calculate the matrix product \(AB\) because dimensions do not match. Since \(A\) is a \(5\times 4\) matrix, we can only multiply it by a \(4\times\_\) matrix. Below we create a \(4\times 3\) matrix:
C <- matrix(1:12, nrow = 4, byrow = TRUE)
C
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
Now the product \(AC\) is well-defined and is a \(5\times 3\) matrix:
A %*% C
## [,1] [,2] [,3]
## [1,] 262 296 330
## [2,] 284 322 360
## [3,] 306 348 390
## [4,] 328 374 420
## [5,] 350 400 450
Finally, the function t
transposes a matrix:
t(C)
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Calculate the matrix \(AA^{t}\) in a single R command
# Write your command here
Then calculate the determinant of the matrix \(AA^{t}\) and explain why the result is not 0.
# Write your command here