Info

Objectives

By the end of this activity, you should

  1. Get familiar with R notebooks

  2. Recall basic functionality of R, such as libraries, variables, data types, printing.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Question 1

If you don’t have R and R Studio, install them

This is straightforward. Ask peers or instructors for help if you can’t do it on your own.

Libraries

You can load libraries (also called packages) with commands library or require (google the difference between these two commands if you want). If you load a library, it should have been installed first with the command install.packages

Example:

  • First you run install.packages('ISLR') (only once; the next time you open R, the library will be there)

  • library('ISLR') (every time you start a new R session; next time you open R the package will be on your computer but not loaded into R)

Below we load the libraries. Note that you will need to install those libraries that you don’t have on your computer before you can run this chunk of R code

library(ISLR) # for datasets from 'Introduction to Statistical Learning'

Question 2

If you don’t have ISLR installed, install it.

This is straightforward.

Assignment and printing

The assignment operator in R is <- or =.

x <- 2020

To print the value of a variable, we can simply type its name as follows:

x
## [1] 2020

or use the command print (equivalent of just typing the variable name):

print(x)
## [1] 2020

or use the command cat (it can print several arguments):

cat("x = ", x)
## x =  2020

R notebooks

The document that you are working with is a so-called R notebook. The file has extention “Rmd”. An R notebook consists of normal text with some very simple marking and chunks of R code that you can run independently one by one or “knit” into a readable report or presentation. A knitted notebook can be published online

Fedor’s lecture notes are also R notebooks and Fedor will share them with you for reference. Your lab reports, project report and project presentation will also be R notebooks.

You can insert mathematical equations into R notebooks. Here is a displayed equation: \[ \int_{\infty}^{\infty}e^{-x^2}dx=\sqrt{\pi} \]

and here is an inline equation: \(e^{i\pi}+1=0\).

It is recommended to always (including your non-MH4510 data projects or your future job) work with notebooks rather than with pure R files. The same concerns Python - if you do a data project in Python, use Jupyter (similar to R notebooks) rather than Spyder.

Question 3

Knit this R Notebook into PDF.

You need to select the right option in the R Studio menu

Atomic types

R has several atomic data types (you can google “atomic types in R” for details). Three most common ones are numeric, logical, character. Note that a variable of an atomic data type contains a vector of values rather than a single value. Some of the values can be NA, NaN or, for numeric types, -Inf and Inf. The R command for constructing vectors is c.

Numeric vectors

# Here, we introduce a numeric vector
num_v <- c(-1.0, 2.5, -Inf, 0.6, NA, 2*6.2, -0/0, -12.0, 50/0)
cat("Below is a numeric vector\n")
## Below is a numeric vector
num_v
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(num_v)
## [1] "numeric"
cat("\nBelow is its length\n")
## 
## Below is its length
length(num_v)
## [1] 9

Vectors can be subsetted with the operator [ ]:

num_v[c(1, 2, 5)]
## [1] -1.0  2.5   NA

There are commands in R for calculating simple functions, such as sum, mean, standard deviation etc:

# first, we extract finite real values from num_v
x <- num_v[c(1, 2, 4, 6, 8)]
cat("We will calculate sum, mean, and SD of the following vector:\n")
## We will calculate sum, mean, and SD of the following vector:
print(x)
## [1]  -1.0   2.5   0.6  12.4 -12.0
cat("Sum is", sum(x), "\n")
## Sum is 2.5
cat("Mean is", mean(x), "\n")
## Mean is 0.5
cat("Standard deviation is", sd(x), "\n")
## Standard deviation is 8.719518

Functions and operations are applied to vectors element-wise. The following command applies the operation \(a\mapsto a^2-a-1\) to every entry \(a\) of the vector num_v:

num_v^2-num_v-1
## [1]   1.00   2.75    Inf  -1.24     NA 140.36    NaN 155.00    NaN

Question 4

Note that (-Inf)^2+Inf-1 = Inf, but (Inf)^2-Inf-1 = NaN. Explain why (it makes perfect sense).

This is because \((-\infty)^2=\infty\), \(\infty + \infty = \infty\) and \(\infty + a=\infty\), where \(a\in\mathbb{R}\). However, \(\infty - \infty\) is undefined.

Indeed, \(\infty\) is not an element of the set \(\mathbb{R}\). Rather, it can be formally defined as a limit. However, the limit of the difference of two expressions of infinite limit can be positive or negative infinity, can be an arbitary real number, or may not be defined at all. Look at the following examples:

  • \(\lim_{x\to \infty} \left(x^2\right) - \left(x\right) = \infty\)

  • \(\lim_{x\to \infty} \left(x\right) - \left(x^2\right) = -\infty\)

  • \(\lim_{x\to \infty} \left(x+42\right) - \left(x\right) = 42\)

  • \(\lim_{x\to \infty} \left(x^2+\cos x\right) - \left(x^2\right)\) is undefined.

Sequence vectors

This is self-explanatory:

1:12
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
-6.6:5
##  [1] -6.6 -5.6 -4.6 -3.6 -2.6 -1.6 -0.6  0.4  1.4  2.4  3.4  4.4
seq(from = -1, to = 2, by = 0.2)
##  [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8
## [16]  2.0
seq(from = -1, to = 1.5, len = 6)
## [1] -1.0 -0.5  0.0  0.5  1.0  1.5

Logical vectors

# Here, we introduce a logical vector
log_v <- c(TRUE, FALSE, NA, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE)
cat("Below is a logical vector\n")
## Below is a logical vector
log_v
## [1]  TRUE FALSE    NA  TRUE FALSE FALSE  TRUE  TRUE  TRUE
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(log_v)
## [1] "logical"
cat("\nBelow is its length\n")
## 
## Below is its length
length(log_v)
## [1] 9

Logical vectors can be converted to numeric either explicitly

as.numeric(log_v)
## [1]  1  0 NA  1  0  0  1  1  1

or implicitly - when a logical vector is involved in an expression with numeric vectors, TRUE becomes 1 and FALSE becomes 0

log_v + 0
## [1]  1  0 NA  1  0  0  1  1  1

Logical vectors can be used to subset numeric (and character) vectors:

num_v
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf
log_v
## [1]  TRUE FALSE    NA  TRUE FALSE FALSE  TRUE  TRUE  TRUE
num_v[log_v]
## [1]  -1.0    NA   0.6   NaN -12.0   Inf

Note that elements on positions corresponding to FALSE in the logical index vector are removed, elements on positions corresponding to TRUE are kept, and elements on positions corresponding to NA become NA.

Logical vectors with information about numeric vectors

Let us look at our numeric vector again

num_v
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf

Positions of missing entries are

is.na(num_v)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

The number of missing entries is

sum(is.na(num_v))
## [1] 2

Note that both NA and NaN are processed as missing entries in a numeric vector.

Positions of negative entries are extracted as follows:

num_v < 0
## [1]  TRUE FALSE  TRUE FALSE    NA FALSE    NA  TRUE FALSE

Negative entries themselves are:

num_v[!is.na(num_v) & (num_v < 0)]
## [1]   -1 -Inf  -12

Positions of finite entries are

is.finite(num_v)
## [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Numeric positions of finite entries are

which(is.finite(num_v))
## [1] 1 2 4 6 8

Question 5

Given a numeric vector, write a single R command that calculates the sum of squares of its finite entries. It should work for an arbitrary input, i.e., you cannot just extract entries of num_v with explicitly specified positions.

# Write your command here
sum(num_v[is.finite(num_v)]^2)
## [1] 305.37

Character vectors

# Here, we introduce a character vector
chr_v <- c("Fedor likes roller skating", NA, "-24", "FALSE", "asgoifw", NaN, 50)
cat("Below is a character vector\n")
## Below is a character vector
chr_v
## [1] "Fedor likes roller skating" NA                          
## [3] "-24"                        "FALSE"                     
## [5] "asgoifw"                    "NaN"                       
## [7] "50"
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(chr_v)
## [1] "character"
cat("\nBelow is its length\n")
## 
## Below is its length
length(chr_v)
## [1] 7

Note that when we created this vector, NA has been interpreted as a missing entry but NaN as a literal string “NaN”. The number of missing entries in our character vector is

sum(is.na(chr_v))
## [1] 1

Character vectors can be converted to numeric vectors. Everything that does not look like a number will become a missing entry:

as.numeric(chr_v)
## Warning: NAs introduced by coercion
## [1]  NA  NA -24  NA  NA NaN  50

Same for conversion to logical vectors:

as.logical(chr_v)
## [1]    NA    NA    NA FALSE    NA    NA    NA

Conversely, numeric and logical vectors can be converted to character vectors:

num_v
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf
as.character(num_v)
## [1] "-1"   "2.5"  "-Inf" "0.6"  NA     "12.4" "NaN"  "-12"  "Inf"
log_v
## [1]  TRUE FALSE    NA  TRUE FALSE FALSE  TRUE  TRUE  TRUE
as.character(log_v)
## [1] "TRUE"  "FALSE" NA      "TRUE"  "FALSE" "FALSE" "TRUE"  "TRUE"  "TRUE"

Below we find the length of every string in our character vector:

nchar(chr_v)
## [1] 26 NA  3  5  7  3  2

Question 6

Write a single R command that extracts all strings of length at least 7 from a character vector. It should work for an arbitrary input, i.e., you cannot extract entries of chr_v with explicitly specified positions.

chr_v[!is.na(chr_v) & (nchar(chr_v) >= 7)]
## [1] "Fedor likes roller skating" "asgoifw"

Data in R

Lists

Items of a list are variables of any class (they can even, in turn, be lists). Below we construct a list

# Here, we introduce a list 
L <- list(-5, "MH4510", num_v, chr_v, log_v)
cat("Below is a list\n")
## Below is a list
L
## [[1]]
## [1] -5
## 
## [[2]]
## [1] "MH4510"
## 
## [[3]]
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf
## 
## [[4]]
## [1] "Fedor likes roller skating" NA                          
## [3] "-24"                        "FALSE"                     
## [5] "asgoifw"                    "NaN"                       
## [7] "50"                        
## 
## [[5]]
## [1]  TRUE FALSE    NA  TRUE FALSE FALSE  TRUE  TRUE  TRUE
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(L)
## [1] "list"
cat("\nBelow is its length\n")
## 
## Below is its length
length(L)
## [1] 5

An individual element is referenced with the operator [[]]:

L[[3]]
## [1]  -1.0   2.5  -Inf   0.6    NA  12.4   NaN -12.0   Inf

Besides, we extract a few elements from a list with the operator [] and a vector index but the result is going to be another list:

L[c(1, 4)]
## [[1]]
## [1] -5
## 
## [[2]]
## [1] "Fedor likes roller skating" NA                          
## [3] "-24"                        "FALSE"                     
## [5] "asgoifw"                    "NaN"                       
## [7] "50"

Question 7

What is the difference between L[[1]] and L[1]?

The former is an element of a list and the latter is a list containing a single item.

Data frames

The data frame structure is the basic method of storing data in R. Essentially, it mimics excel spreadsheet.

# Here, we introduce a data frame 
df <- data.frame(
  name = c("Harry Potter", "Batman", "Sunny Baudelaire"),
  gender = c("male", "male", "female"),
  age = c(12, 35, 2),
  occupation = c("wizard", "vigilante", "chef"),
  has.siblings = c(FALSE, FALSE, TRUE)
)

cat("Below is a data frame\n")
## Below is a data frame
df
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(df)
## [1] "data.frame"
cat("\nBelow is its length\n")
## 
## Below is its length
length(df)
## [1] 5

Note that a data frame is a list of its columns or variables under the hood. That’s why the length of this data is 5.

Our variables are

names(df)
## [1] "name"         "gender"       "age"          "occupation"   "has.siblings"

The number of observations is

nrow(df)
## [1] 3

The number of variables is

ncol(df)
## [1] 5

Dimensions of our data frame are

dim(df)
## [1] 3 5

An individual column is

df$occupation
## [1] "wizard"    "vigilante" "chef"

or

df[ , 4]
## [1] "wizard"    "vigilante" "chef"

or

df[ , 'occupation']
## [1] "wizard"    "vigilante" "chef"

Its class is

class(df$occupation)
## [1] "character"

And an indivual row is

df[ 3 , ]

Its class is

class(df[3, ])
## [1] "data.frame"

And here we extract two rows and three columns from our data frame:

df[c(1, 2), c("name", "age", "has.siblings")]

Question 8

Write a single R command that prints names and occupations everyone in our data frame whose age is at least 10. It should work for an arbitrary input that has variables “name”, “age”, and “occupation” , i.e., you should not extract entries of df with explicitly specified positions.

df[df$age >= 10, c("name", "occupation")]

Matrices

A matrix, like a data frame, is a table. The difference is that all entries of a matrix have the same type - usually, numeric (but there are logical and sometimes even character matrices). Matrices can be added and multiplied.

# Here, we introduce a matrix
A <- matrix(1:20, nrow = 5)
cat("Below is a matrix\n")
## Below is a matrix
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
cat("\nBelow is its class, i.e., data type\n")
## 
## Below is its class, i.e., data type
class(A)
## [1] "matrix" "array"
cat("\nBelow is its length\n")
## 
## Below is its length
length(A)
## [1] 20

Functions for the number of rows and the number of columns for a matrix are same as for data frame:

nrow(A)
## [1] 5
ncol(A)
## [1] 4
dim(A)
## [1] 5 4

The main difference with a data frame is that matrices can be added and multiplied element-wise or as a matrix. Below we square every entry of our matrix A:

A^2
##      [,1] [,2] [,3] [,4]
## [1,]    1   36  121  256
## [2,]    4   49  144  289
## [3,]    9   64  169  324
## [4,]   16   81  196  361
## [5,]   25  100  225  400

Now we will create another matrix of the same size

B <- matrix(seq(from = 0, to = 1, len = 20), 
            ncol = 4, byrow = TRUE)
B
##           [,1]       [,2]      [,3]      [,4]
## [1,] 0.0000000 0.05263158 0.1052632 0.1578947
## [2,] 0.2105263 0.26315789 0.3157895 0.3684211
## [3,] 0.4210526 0.47368421 0.5263158 0.5789474
## [4,] 0.6315789 0.68421053 0.7368421 0.7894737
## [5,] 0.8421053 0.89473684 0.9473684 1.0000000

We can add it with A:

A+B
##          [,1]      [,2]     [,3]     [,4]
## [1,] 1.000000  6.052632 11.10526 16.15789
## [2,] 2.210526  7.263158 12.31579 17.36842
## [3,] 3.421053  8.473684 13.52632 18.57895
## [4,] 4.631579  9.684211 14.73684 19.78947
## [5,] 5.842105 10.894737 15.94737 21.00000

And we can multiply A and B element-wise (this is called Hadamard product, denoted \(A\odot B\) in mathematics):

A * B
##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.0000000 0.3157895  1.157895  2.526316
## [2,] 0.4210526 1.8421053  3.789474  6.263158
## [3,] 1.2631579 3.7894737  6.842105 10.421053
## [4,] 2.5263158 6.1578947 10.315789 15.000000
## [5,] 4.2105263 8.9473684 14.210526 20.000000

We cannot calculate the matrix product \(AB\) because dimensions do not match. Since \(A\) is a \(5\times 4\) matrix, we can only multiply it by a \(4\times\_\) matrix. Below we create a \(4\times 3\) matrix:

C <- matrix(1:12, nrow = 4, byrow = TRUE)
C
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12

Now the product \(AC\) is well-defined and is a \(5\times 3\) matrix:

A %*% C
##      [,1] [,2] [,3]
## [1,]  262  296  330
## [2,]  284  322  360
## [3,]  306  348  390
## [4,]  328  374  420
## [5,]  350  400  450

Finally, the function t transposes a matrix:

t(C)
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Question 9

Calculate the matrix \(AA^{t}\) in a single R command

A %*% t(A)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  414  448  482  516  550
## [2,]  448  486  524  562  600
## [3,]  482  524  566  608  650
## [4,]  516  562  608  654  700
## [5,]  550  600  650  700  750

Then calculate the determinant of the matrix \(AA^{t}\) and explain why the result is not 0.

det(A %*% t(A))
## [1] 2.291403e-38

The determinant is of the order \(10^{-38}\) (on Fedor’s computer), i.e., effectively it is 0. It is not precisely 0 because of an approximation error.

Survey

There is a link to a simple survey after lab 0: