0. Introduction


In this lecture, we are going to learn some important R basics including

All these are important for moving forward into more advanced data analysis using R.


1. Functions


So far we have nearly always used built-in functions to fulfill tasks. However, sometimes built-in functions are not enough to do some specific things that we hope to do. In the last lecture, there was such an example.

make_datetime_100 <- function(year, month, day, time, tz = "UTC") {
  make_datetime(year, month, day, time %/% 100, time %% 100, 0, tz)
}

In this example, we need to handle a time in the format such as “319” (3:19am). So to separate it into hour and minute, then make a date-time object out of all the time information, we created the function above to do it.

Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. We have learned how to write a Python function, and all basics are pretty much the same in R but the syntax in writing a function is slightly different.

function_Name <- function(<inputs separated by commas>){
  # function body, indented by two spaces
}

For example,

add_one <- function(x) { 
  x+1 
}

add_one(3)
## [1] 4

By default, the function will return the last value computed in the function.

add_one <- function(x) { 
  x+1
  print(x)
}

add_one(3)
## [1] 3

The function above returns x which is the value of the line print(x). Note that the print function in R returns the value in the function, which is different from Python.

add_one <- function(x) { 
  print(x)
  return(x+1) 
  print(x)
}

y <- add_one(3)
## [1] 3
y
## [1] 4

Similar to Python, we use return function to specify the value to be returned by a function. When a function reaches a return function, it will stop there and anything after return will not be executed.


Multiple arguments

my_func <- function(a=1,b=2,c=3,d=4,e=5){
  c(a,b,c,d,e)
}

my_func()
## [1] 1 2 3 4 5

When taking multiple arguments, simply separated them by comma as inputs. If there is a default value for the argument, simply declare it following the argument name.

Calling a multiple argument function is similar to Python. We can either use their positions or keywords. The difference is that R is more flexible and doesn’t require keyword argument to follow positional argument.

my_func <- function(a=1,b=2,c=3,d=4,e=5){
  c(a,b,c,d,e)
}

my_func(10, 20)
## [1] 10 20  3  4  5
my_func(10, 20, e = 50)
## [1] 10 20  3  4 50
my_func(, 10)   # This works but not recommended
## [1]  1 10  3  4  5
my_func(c = 30, 10, 20)
## [1] 10 20 30  4  5

Usually, in data analysis, it is recommended to ignore the keyword for the first argument (the data set name) since they are used in every function. But for other arguments, it’s better to write out their names in function calling.

ggplot(mpg) +   # For data set name, usually we don't use the argument name
  geom_bar(mapping = aes(x = as.factor(cyl), fill = as.factor(year)), col = "blue", position = "dodge")

Other than the data set mpg, it’s recommended that you write out all the keyword names for each argument.

Environment

An important thing to know is how functions interact with the environment outside the function frame (in Python it’s called the global frame). For example,

f <- function(x) {
  x + y
}

In many programming languages, this would be an error, because y is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environment where the function was defined:

y <- 100
f(10)
## [1] 110
y <- 1000
f(10)
## [1] 1010

This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn’t cause too many problems (especially if you regularly restart R to get to a clean slate).

The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. This power and flexibility is what makes tools like ggplot2 and dplyr possible.


2. Conditional Execution


In R, using if to implement conditional execution looks like this:

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

As below is a simple example

my_func <- function(x) {
  if (x %% 2 == 0) {
    "even"
  } else {
    "odd"
  }
}

my_func(4)
## [1] "even"
my_func(3)
## [1] "odd"


Conditions

The condition must evaluate to either TRUE or FALSE. If it’s a vector, you’ll get a warning message; if it’s an NA, you’ll get an error. Watch out for these messages in your own code:

if (c(TRUE, FALSE)) {}
#> Error in if (c(TRUE, FALSE)) {: the condition has length > 1

if (NA) {}
#> Error in if (NA) {: missing value where TRUE/FALSE needed

When you need logical operations, use && and || in your expression! Those are non-vectorised operators which should be used in if or other conditional execution. You should never use vectorised operators & and | which is used in filter().

my_func2 <- function(x) {
  if (x %% 2 == 0 && x %% 3 == 0){
    "x is divisible by six"
  } else {
    "x is not divisible by six"
  }
}

my_func2(6)
## [1] "x is divisible by six"
my_func2(7)
## [1] "x is not divisible by six"


Lab Exercise: Write a function contain_abc(string) that returns TRUE if the given string contains letter “a”, “b” or “c”; FALSE otherwise. Use grepl(letter, string) to detect whether a letter is in a string or not.


1. List


We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.

We can use typeof function to check the type of a vector:

typeof(c(1,2,3))
## [1] "double"
typeof(c(1L, 2L, 3L))
## [1] "integer"
typeof(c("a", "b", "c"))
## [1] "character"
typeof(c(T, F, TRUE, FALSE))
## [1] "logical"

Note that a number by default is considered a double type. To force it into an integer, we need to place an L after the number.

These vectors are homogeneous in the sense that all values must be of the same type. If not, they will be forced into the same type.

typeof(c(1L, 2, 3))
## [1] "double"
typeof(c(1, 2, "a"))
## [1] "character"

In the example above, a mixture of integers and doubles is considered as a double vector, while a mixture of numbers and strings is considered as a string vector.

To create a heterogenous vector that contains values of different atomic types, we have to use lists, which are sometimes called recursive vectors because lists can contain other lists. The diagram below shows the hierarchy of R’s vector types

There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

Every vector has two key properties:

typeof(letters)
## [1] "character"
typeof(1:10)
## [1] "integer"
x <- list("a", "b", 1:10)
length(x)
## [1] 3

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:


Coercion

There are two ways to convert, or coerce, one type of vector to another:

  • Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.

  • Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

Explicit coercion is used relatively rarely, and is largely easy to understand. For implicit coercion, we’ve already seen an example - using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0:

x <- 1:100
y <- x <= 40
print(y)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE
sum(y)
## [1] 40


Recycling Rules

As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.

This is generally most useful when you are mixing vectors and “scalars”. I put scalars in quotes because R doesn’t actually have scalars: instead, a single number is a vector of length 1. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That’s why, for example, this code works:

1:10 + 100
##  [1] 101 102 103 104 105 106 107 108 109 110
1:10 > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

In R, basic mathematical operations work with vectors. That means that you should never need to perform explicit iteration when performing simple mathematical computations.

c(1,3,5)/2
## [1] 0.5 1.5 2.5
c(1,3,5)^2
## [1]  1  9 25

It’s intuitive what should happen if you add two vectors of the same length, or a vector and a “scalar”, but what happens if you add two vectors of different lengths?

c(1,3,5,7,9,11) + c(1,2)
## [1]  2  5  6  9 10 13

Here, R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:

c(1,3,5) + c(1,3)
## Warning in c(1, 3, 5) + c(1, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 6 6

While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you’ll need to do it yourself with rep():

tibble(x = 1:4, y = 1:2)
tibble(x = 1:4, y = rep(1:2, 2))
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     1
## 4     4     2
tibble(x = 1:4, y = rep(1:2, each = 2))
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
## 3     3     2
## 4     4     2


Naming Vectors

All types of vectors can be named. You can name them during creation with c() or with purrr::set_names()

v1 <- c(x = 1, y = 2, x = 4)
v2 <- set_names(1:3, c("a", "b", "c"))

Named vectors are like dictionaries in Python. We can retract the values by their names using [] or [[]]:

v1["x"]
## x 
## 1
v1[c("x","y")]
## x y 
## 1 2
v1[["x"]]
## [1] 1
v2[["b"]]
## [1] 2

The difference between [ and [[ is that [[ only extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop. The distinction between [ and [[ is most important for lists, as we’ll see shortly.


Lists

Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with list():

x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

A very useful tool for working with lists is str() because it focusses on the structure, not the contents.

str(x)
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
##  $ a: num 1
##  $ b: num 2
##  $ c: num 3

Unlike atomic vectors, list() can contain a mix of objects and can even contain other lists.

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE
z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4


Subsetting a list

There are three ways to subset a list, which we’ll illustrate with a list named a:

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
  • [ extracts a sub-list. The result will always be a list.
str(a[1:2])
## List of 2
##  $ a: int [1:3] 1 2 3
##  $ b: chr "a string"
str(a[4])
## List of 1
##  $ d:List of 2
##   ..$ : num -1
##   ..$ : num -5
  • [[ extracts a single component from a list. It removes a level of hierarchy from the list.
str(a[[1]])
##  int [1:3] 1 2 3
str(a[[4]])
## List of 2
##  $ : num -1
##  $ : num -5
  • $ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.
a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3


Application of lists - ggplot object

You may wonder why we need lists at all in data analysis since most data frames have columns of the same data type. For example

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

So each column is a vector of either character, integer or double. What is the use of lists here? Let’s create a plot.

plot1 <- ggplot(mpg) + geom_bar(mapping = aes(x = as.factor(cyl), fill = drv), position = "dodge")
plot1

Let’s name the plot to be plot1. You might think that plot1 is merely a plot. Actually, if you input plot1 into the console, it does return the same plot.

However, actually all information of the plot is stored in plot1! If we check its type, we will see that:

typeof(plot1)
## [1] "list"

So plot1 is actually a list. What is in it? The answer is everything. If we want to see the data set that was used to create the plot, we can use

plot1$data
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

So the first element in the list of plot1 is the original data frame. If we want to see the labels of the plot, we can have

plot1$labels
## $x
## [1] "as.factor(cyl)"
## 
## $fill
## [1] "drv"
## 
## $y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
## 
## $weight
## [1] "weight"
## attr(,"fallback")
## [1] TRUE

We have another list returned which has again a few items in it. For example, the x-label and y-label are

plot1$labels$x
## [1] "as.factor(cyl)"
plot1$labels$y
## [1] "count"
## attr(,"fallback")
## [1] TRUE

We can actually change the labels by directly changing the values here

plot1$labels$x <- "New_x_label"
plot1$labels$y <- "New_y_label"
plot1


Use ggplot_build to adjust all details of a plot

We can use the ggplot_build() function on a ggplot object to obtain all raw data in producing the plot, and adjust any details.

plot1_data <- ggplot_build(plot1)
plot1_data
## $data
## $data[[1]]
##      fill  y count prop     x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23    23    1 0.775       FALSE     1     1    0   23 0.55 1.00
## 2 #00BA38 58    58    1 1.225       FALSE     1     2    0   58 1.00 1.45
## 3 #00BA38  4     4    1 2.000       FALSE     1     3    0    4 1.55 2.45
## 4 #F8766D 32    32    1 2.700       FALSE     1     4    0   32 2.55 2.85
## 5 #00BA38 43    43    1 3.000       FALSE     1     5    0   43 2.85 3.15
## 6 #619CFF  4     4    1 3.300       FALSE     1     6    0    4 3.15 3.45
## 7 #F8766D 48    48    1 3.700       FALSE     1     7    0   48 3.55 3.85
## 8 #00BA38  1     1    1 4.000       FALSE     1     8    0    1 3.85 4.15
## 9 #619CFF 21    21    1 4.300       FALSE     1     9    0   21 4.15 4.45
##   colour linewidth linetype alpha
## 1     NA       0.5        1    NA
## 2     NA       0.5        1    NA
## 3     NA       0.5        1    NA
## 4     NA       0.5        1    NA
## 5     NA       0.5        1    NA
## 6     NA       0.5        1    NA
## 7     NA       0.5        1    NA
## 8     NA       0.5        1    NA
## 9     NA       0.5        1    NA
## 
## 
## $layout
## <ggproto object: Class Layout, gg>
##     coord: <ggproto object: Class CoordCartesian, Coord, gg>
##         aspect: function
##         backtransform_range: function
##         clip: on
##         default: TRUE
##         distance: function
##         expand: TRUE
##         is_free: function
##         is_linear: function
##         labels: function
##         limits: list
##         modify_scales: function
##         range: function
##         render_axis_h: function
##         render_axis_v: function
##         render_bg: function
##         render_fg: function
##         setup_data: function
##         setup_layout: function
##         setup_panel_guides: function
##         setup_panel_params: function
##         setup_params: function
##         train_panel_guides: function
##         transform: function
##         super:  <ggproto object: Class CoordCartesian, Coord, gg>
##     coord_params: list
##     facet: <ggproto object: Class FacetNull, Facet, gg>
##         compute_layout: function
##         draw_back: function
##         draw_front: function
##         draw_labels: function
##         draw_panels: function
##         finish_data: function
##         init_scales: function
##         map_data: function
##         params: list
##         setup_data: function
##         setup_params: function
##         shrink: TRUE
##         train_scales: function
##         vars: function
##         super:  <ggproto object: Class FacetNull, Facet, gg>
##     facet_params: list
##     finish_data: function
##     get_scales: function
##     layout: data.frame
##     map_position: function
##     panel_params: list
##     panel_scales_x: list
##     panel_scales_y: list
##     render: function
##     render_labels: function
##     reset_scales: function
##     setup: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     train_position: function
##     xlabel: function
##     ylabel: function
##     super:  <ggproto object: Class Layout, gg>
## 
## $plot

## 
## attr(,"class")
## [1] "ggplot_built"

So here we have another list with three elements $data, $layout and $plot. If we check $data we see all details of the plot.

plot1_data$data[[1]]
##      fill  y count prop     x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23    23    1 0.775       FALSE     1     1    0   23 0.55 1.00
## 2 #00BA38 58    58    1 1.225       FALSE     1     2    0   58 1.00 1.45
## 3 #00BA38  4     4    1 2.000       FALSE     1     3    0    4 1.55 2.45
## 4 #F8766D 32    32    1 2.700       FALSE     1     4    0   32 2.55 2.85
## 5 #00BA38 43    43    1 3.000       FALSE     1     5    0   43 2.85 3.15
## 6 #619CFF  4     4    1 3.300       FALSE     1     6    0    4 3.15 3.45
## 7 #F8766D 48    48    1 3.700       FALSE     1     7    0   48 3.55 3.85
## 8 #00BA38  1     1    1 4.000       FALSE     1     8    0    1 3.85 4.15
## 9 #619CFF 21    21    1 4.300       FALSE     1     9    0   21 4.15 4.45
##   colour linewidth linetype alpha
## 1     NA       0.5        1    NA
## 2     NA       0.5        1    NA
## 3     NA       0.5        1    NA
## 4     NA       0.5        1    NA
## 5     NA       0.5        1    NA
## 6     NA       0.5        1    NA
## 7     NA       0.5        1    NA
## 8     NA       0.5        1    NA
## 9     NA       0.5        1    NA

The $data stores data for different panels. In this graph we only have one panel, so it only has one element. To extract that, we use [[1]] and then have a data frame.

class(plot1_data$data[[1]])
## [1] "data.frame"
typeof(plot1_data$data[[1]])
## [1] "list"

Note that class() returns the name of the augmented vector. So a data frame is built upon a list, so it is a list if we check its type. But it is an augmented type and has its own name - data frame.

This data frame stores the details of this bar plot. For example, the range of x for each bar. Let’s change this and see how the plot will be updated:

plot1_data$data[[1]]$xmin[3] <- 2
plot(ggplot_gtable(plot1_data))

So we successfully change the left border of the bar on top of “5”. Note that here “4”, “5”, “6”, “8” are labels and the actual x-coordinates are 1 to 4 correspondingly. So when we set xmin of the bar of “5” to be 2, it becomes what we see above.

The ggplot_gtable() function rebuilds plot data into a ggplot object which can be displayed by the plot function.

Lab Exercise: Change the count of “5” to become 100 manually from modifying plot1_data$data[[1]].


Another example of using lists

Another example of using lists is the map() function. Recall that we used that function to count the NA values in each column of a data frame.

map(mpg, ~sum(is.na(.)))
## $manufacturer
## [1] 0
## 
## $model
## [1] 0
## 
## $displ
## [1] 0
## 
## $year
## [1] 0
## 
## $cyl
## [1] 0
## 
## $trans
## [1] 0
## 
## $drv
## [1] 0
## 
## $cty
## [1] 0
## 
## $hwy
## [1] 0
## 
## $fl
## [1] 0
## 
## $class
## [1] 0

So the map function always return a list. It applies the function to each element of a vector (or a list) and returns the result in a list with the same names. For example, the following code returns the number of unique values in each column.

result <- map(mpg, ~length(unique(.)))
result
## $manufacturer
## [1] 15
## 
## $model
## [1] 38
## 
## $displ
## [1] 35
## 
## $year
## [1] 2
## 
## $cyl
## [1] 4
## 
## $trans
## [1] 10
## 
## $drv
## [1] 3
## 
## $cty
## [1] 21
## 
## $hwy
## [1] 27
## 
## $fl
## [1] 5
## 
## $class
## [1] 7

We can unlist this list and make a bar plot out of it.

barplot(unlist(result), cex.names = 0.5)


Brief summary

Lists are widely used in graphing, modeling and analysis packages to store a complicated nested structure of data. Therefore most complex data classes are built upon lists (such as data frames, tibbles, linear models etc.). Understanding lists will help us extract, modify and store useful information from more complex data structures.