In this lecture, we are going to learn some important R basics including
All these are important for moving forward into more advanced data analysis using R.
So far we have nearly always used built-in functions to fulfill tasks. However, sometimes built-in functions are not enough to do some specific things that we hope to do. In the last lecture, there was such an example.
make_datetime_100 <- function(year, month, day, time, tz = "UTC") {
make_datetime(year, month, day, time %/% 100, time %% 100, 0, tz)
}
In this example, we need to handle a time in the format such as “319” (3:19am). So to separate it into hour and minute, then make a date-time object out of all the time information, we created the function above to do it.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. We have learned how to write a Python function, and all basics are pretty much the same in R but the syntax in writing a function is slightly different.
function_Name <- function(<inputs separated by commas>){
# function body, indented by two spaces
}
For example,
add_one <- function(x) {
x+1
}
add_one(3)
## [1] 4
By default, the function will return the last value computed in the function.
add_one <- function(x) {
x+1
print(x)
}
add_one(3)
## [1] 3
The function above returns x
which is the value of the
line print(x)
. Note that the print
function in
R returns the value in the function, which is different from Python.
add_one <- function(x) {
print(x)
return(x+1)
print(x)
}
y <- add_one(3)
## [1] 3
y
## [1] 4
Similar to Python, we use return
function to specify the
value to be returned by a function. When a function reaches a
return
function, it will stop there and anything after
return
will not be executed.
my_func <- function(a=1,b=2,c=3,d=4,e=5){
c(a,b,c,d,e)
}
my_func()
## [1] 1 2 3 4 5
When taking multiple arguments, simply separated them by comma as inputs. If there is a default value for the argument, simply declare it following the argument name.
Calling a multiple argument function is similar to Python. We can either use their positions or keywords. The difference is that R is more flexible and doesn’t require keyword argument to follow positional argument.
my_func <- function(a=1,b=2,c=3,d=4,e=5){
c(a,b,c,d,e)
}
my_func(10, 20)
## [1] 10 20 3 4 5
my_func(10, 20, e = 50)
## [1] 10 20 3 4 50
my_func(, 10) # This works but not recommended
## [1] 1 10 3 4 5
my_func(c = 30, 10, 20)
## [1] 10 20 30 4 5
Usually, in data analysis, it is recommended to ignore the keyword for the first argument (the data set name) since they are used in every function. But for other arguments, it’s better to write out their names in function calling.
ggplot(mpg) + # For data set name, usually we don't use the argument name
geom_bar(mapping = aes(x = as.factor(cyl), fill = as.factor(year)), col = "blue", position = "dodge")
Other than the data set mpg
, it’s recommended that you
write out all the keyword names for each argument.
An important thing to know is how functions interact with the environment outside the function frame (in Python it’s called the global frame). For example,
f <- function(x) {
x + y
}
In many programming languages, this would be an error, because
y
is not defined inside the function. In R, this is valid
code because R uses rules called lexical scoping to find the value
associated with a name. Since y
is not defined inside the
function, R will look in the environment where the function was
defined:
y <- 100
f(10)
## [1] 110
y <- 1000
f(10)
## [1] 1010
This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn’t cause too many problems (especially if you regularly restart R to get to a clean slate).
The advantage of this behaviour is that from a language standpoint it
allows R to be very consistent. This power and flexibility is what makes
tools like ggplot2
and dplyr
possible.
In R, using if
to implement conditional execution looks
like this:
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
As below is a simple example
my_func <- function(x) {
if (x %% 2 == 0) {
"even"
} else {
"odd"
}
}
my_func(4)
## [1] "even"
my_func(3)
## [1] "odd"
The condition must evaluate to either TRUE
or
FALSE
. If it’s a vector, you’ll get a warning message; if
it’s an NA
, you’ll get an error. Watch out for these
messages in your own code:
if (c(TRUE, FALSE)) {}
#> Error in if (c(TRUE, FALSE)) {: the condition has length > 1
if (NA) {}
#> Error in if (NA) {: missing value where TRUE/FALSE needed
When you need logical operations, use &&
and
||
in your expression! Those are non-vectorised operators
which should be used in if
or other conditional execution.
You should never use vectorised operators &
and
|
which is used in filter()
.
my_func2 <- function(x) {
if (x %% 2 == 0 && x %% 3 == 0){
"x is divisible by six"
} else {
"x is not divisible by six"
}
}
my_func2(6)
## [1] "x is divisible by six"
my_func2(7)
## [1] "x is not divisible by six"
contain_abc(string)
that returns TRUE
if the
given string contains letter “a”, “b” or “c”; FALSE
otherwise. Use grepl(letter, string)
to detect whether a
letter is in a string or not.We have learned how to create vectors that contain multiple values
of the same type with The c()
function.
These vectors are called atomic vectors, of which there
are six types: logical, integer, double, character, complex, and raw.
Integer and double vectors are collectively known as numeric vectors. In
this course we won’t use complex and raw vectors so we ignore them
here.
We can use typeof
function to check the type of a
vector:
typeof(c(1,2,3))
## [1] "double"
typeof(c(1L, 2L, 3L))
## [1] "integer"
typeof(c("a", "b", "c"))
## [1] "character"
typeof(c(T, F, TRUE, FALSE))
## [1] "logical"
Note that a number by default is considered a double
type. To force it into an integer, we need to place an L
after the number.
These vectors are homogeneous in the sense that all values must be of the same type. If not, they will be forced into the same type.
typeof(c(1L, 2, 3))
## [1] "double"
typeof(c(1, 2, "a"))
## [1] "character"
In the example above, a mixture of integers and doubles is considered as a double vector, while a mixture of numbers and strings is considered as a string vector.
To create a heterogenous vector that contains values of different atomic types, we have to use lists, which are sometimes called recursive vectors because lists can contain other lists. The diagram below shows the hierarchy of R’s vector types
There’s one other related object: NULL
.
NULL
is often used to represent the absence of a vector (as
opposed to NA
which is used to represent the absence of a
value in a vector). NULL
typically behaves like a vector of
length 0.
Every vector has two key properties:
typeof()
.typeof(letters)
## [1] "character"
typeof(1:10)
## [1] "integer"
length()
.x <- list("a", "b", 1:10)
length(x)
## [1] 3
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:
Factors
are built on top of integer vectors.Dates
and date-times
are built on top of
numeric vectors.Data frames
and tibbles
are built on top
of lists.There are two ways to convert, or coerce, one type of vector to another:
Explicit coercion happens when you call a function like
as.logical()
, as.integer()
,
as.double()
, or as.character()
. Whenever you
find yourself using explicit coercion, you should always check whether
you can make the fix upstream, so that the vector never had the wrong
type in the first place. For example, you may need to tweak your readr
col_types specification.
Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.
Explicit coercion is used relatively rarely, and is largely easy to understand. For implicit coercion, we’ve already seen an example - using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0:
x <- 1:100
y <- x <= 40
print(y)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
sum(y)
## [1] 40
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
This is generally most useful when you are mixing vectors and “scalars”. I put scalars in quotes because R doesn’t actually have scalars: instead, a single number is a vector of length 1. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That’s why, for example, this code works:
1:10 + 100
## [1] 101 102 103 104 105 106 107 108 109 110
1:10 > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
In R, basic mathematical operations work with vectors. That means that you should never need to perform explicit iteration when performing simple mathematical computations.
c(1,3,5)/2
## [1] 0.5 1.5 2.5
c(1,3,5)^2
## [1] 1 9 25
It’s intuitive what should happen if you add two vectors of the same length, or a vector and a “scalar”, but what happens if you add two vectors of different lengths?
c(1,3,5,7,9,11) + c(1,2)
## [1] 2 5 6 9 10 13
Here, R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
c(1,3,5) + c(1,3)
## Warning in c(1, 3, 5) + c(1, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 6 6
While vector recycling can be used to create very succinct, clever
code, it can also silently conceal problems. For this reason, the
vectorised functions in tidyverse
will throw errors when
you recycle anything other than a scalar. If you do want to recycle,
you’ll need to do it yourself with rep()
:
tibble(x = 1:4, y = 1:2)
tibble(x = 1:4, y = rep(1:2, 2))
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
tibble(x = 1:4, y = rep(1:2, each = 2))
## # A tibble: 4 × 2
## x y
## <int> <int>
## 1 1 1
## 2 2 1
## 3 3 2
## 4 4 2
All types of vectors can be named. You can name them during creation
with c()
or with purrr::set_names()
v1 <- c(x = 1, y = 2, x = 4)
v2 <- set_names(1:3, c("a", "b", "c"))
Named vectors are like dictionaries in Python. We can retract the
values by their names using []
or [[]]
:
v1["x"]
## x
## 1
v1[c("x","y")]
## x y
## 1 2
v1[["x"]]
## [1] 1
v2[["b"]]
## [1] 2
The difference between [
and [[
is that
[[
only extracts a single element, and always drops names.
It’s a good idea to use it whenever you want to make it clear that
you’re extracting a single item, as in a for loop. The distinction
between [
and [[
is most important for lists,
as we’ll see shortly.
Lists are a step up in complexity from atomic vectors, because
lists can contain other lists. This makes them suitable
for representing hierarchical or tree-like structures. You create a list
with list()
:
x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
A very useful tool for working with lists is str()
because it focusses on the structure, not the contents.
str(x)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
Unlike atomic vectors, list()
can contain a mix of
objects and can even contain other lists.
y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
There are three ways to subset a list, which we’ll illustrate with a
list named a
:
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
[
extracts a sub-list. The result will always be a
list.str(a[1:2])
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
str(a[4])
## List of 1
## $ d:List of 2
## ..$ : num -1
## ..$ : num -5
[[
extracts a single component from a list. It removes
a level of hierarchy from the list.str(a[[1]])
## int [1:3] 1 2 3
str(a[[4]])
## List of 2
## $ : num -1
## $ : num -5
$
is a shorthand for extracting named elements of a
list. It works similarly to [[
except that you don’t need
to use quotes.a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3
ggplot
objectYou may wonder why we need lists at all in data analysis since most data frames have columns of the same data type. For example
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
So each column is a vector of either character, integer or double. What is the use of lists here? Let’s create a plot.
plot1 <- ggplot(mpg) + geom_bar(mapping = aes(x = as.factor(cyl), fill = drv), position = "dodge")
plot1
Let’s name the plot to be plot1
. You might think that
plot1
is merely a plot. Actually, if you input
plot1
into the console, it does return the same plot.
However, actually all information of the plot is stored in
plot1
! If we check its type, we will see that:
typeof(plot1)
## [1] "list"
So plot1
is actually a list. What is in it? The answer
is everything. If we want to see the data set that was used to create
the plot, we can use
plot1$data
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
So the first element in the list of plot1
is the
original data frame. If we want to see the labels of the plot, we can
have
plot1$labels
## $x
## [1] "as.factor(cyl)"
##
## $fill
## [1] "drv"
##
## $y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
##
## $weight
## [1] "weight"
## attr(,"fallback")
## [1] TRUE
We have another list returned which has again a few items in it. For example, the x-label and y-label are
plot1$labels$x
## [1] "as.factor(cyl)"
plot1$labels$y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
We can actually change the labels by directly changing the values here
plot1$labels$x <- "New_x_label"
plot1$labels$y <- "New_y_label"
plot1
ggplot_build
to adjust all details of a plotWe can use the ggplot_build()
function on a
ggplot
object to obtain all raw data in producing the plot,
and adjust any details.
plot1_data <- ggplot_build(plot1)
plot1_data
## $data
## $data[[1]]
## fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23 23 1 0.775 FALSE 1 1 0 23 0.55 1.00
## 2 #00BA38 58 58 1 1.225 FALSE 1 2 0 58 1.00 1.45
## 3 #00BA38 4 4 1 2.000 FALSE 1 3 0 4 1.55 2.45
## 4 #F8766D 32 32 1 2.700 FALSE 1 4 0 32 2.55 2.85
## 5 #00BA38 43 43 1 3.000 FALSE 1 5 0 43 2.85 3.15
## 6 #619CFF 4 4 1 3.300 FALSE 1 6 0 4 3.15 3.45
## 7 #F8766D 48 48 1 3.700 FALSE 1 7 0 48 3.55 3.85
## 8 #00BA38 1 1 1 4.000 FALSE 1 8 0 1 3.85 4.15
## 9 #619CFF 21 21 1 4.300 FALSE 1 9 0 21 4.15 4.45
## colour linewidth linetype alpha
## 1 NA 0.5 1 NA
## 2 NA 0.5 1 NA
## 3 NA 0.5 1 NA
## 4 NA 0.5 1 NA
## 5 NA 0.5 1 NA
## 6 NA 0.5 1 NA
## 7 NA 0.5 1 NA
## 8 NA 0.5 1 NA
## 9 NA 0.5 1 NA
##
##
## $layout
## <ggproto object: Class Layout, gg>
## coord: <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## coord_params: list
## facet: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## facet_params: list
## finish_data: function
## get_scales: function
## layout: data.frame
## map_position: function
## panel_params: list
## panel_scales_x: list
## panel_scales_y: list
## render: function
## render_labels: function
## reset_scales: function
## setup: function
## setup_panel_guides: function
## setup_panel_params: function
## train_position: function
## xlabel: function
## ylabel: function
## super: <ggproto object: Class Layout, gg>
##
## $plot
##
## attr(,"class")
## [1] "ggplot_built"
So here we have another list with three elements $data
,
$layout
and $plot
. If we check
$data
we see all details of the plot.
plot1_data$data[[1]]
## fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23 23 1 0.775 FALSE 1 1 0 23 0.55 1.00
## 2 #00BA38 58 58 1 1.225 FALSE 1 2 0 58 1.00 1.45
## 3 #00BA38 4 4 1 2.000 FALSE 1 3 0 4 1.55 2.45
## 4 #F8766D 32 32 1 2.700 FALSE 1 4 0 32 2.55 2.85
## 5 #00BA38 43 43 1 3.000 FALSE 1 5 0 43 2.85 3.15
## 6 #619CFF 4 4 1 3.300 FALSE 1 6 0 4 3.15 3.45
## 7 #F8766D 48 48 1 3.700 FALSE 1 7 0 48 3.55 3.85
## 8 #00BA38 1 1 1 4.000 FALSE 1 8 0 1 3.85 4.15
## 9 #619CFF 21 21 1 4.300 FALSE 1 9 0 21 4.15 4.45
## colour linewidth linetype alpha
## 1 NA 0.5 1 NA
## 2 NA 0.5 1 NA
## 3 NA 0.5 1 NA
## 4 NA 0.5 1 NA
## 5 NA 0.5 1 NA
## 6 NA 0.5 1 NA
## 7 NA 0.5 1 NA
## 8 NA 0.5 1 NA
## 9 NA 0.5 1 NA
The $data
stores data for different panels. In this
graph we only have one panel, so it only has one element. To extract
that, we use [[1]]
and then have a data frame.
class(plot1_data$data[[1]])
## [1] "data.frame"
typeof(plot1_data$data[[1]])
## [1] "list"
Note that class()
returns the name of the
augmented vector. So a data frame is built upon a list,
so it is a list if we check its type. But it is an augmented type and
has its own name - data frame.
This data frame stores the details of this bar plot. For example, the
range of x
for each bar. Let’s change this and see how the
plot will be updated:
plot1_data$data[[1]]$xmin[3] <- 2
plot(ggplot_gtable(plot1_data))
So we successfully change the left border of the bar on top of “5”.
Note that here “4”, “5”, “6”, “8” are labels and the actual
x-coordinates are 1 to 4 correspondingly. So when we set
xmin
of the bar of “5” to be 2, it becomes what we see
above.
The ggplot_gtable()
function rebuilds plot data into a
ggplot object which can be displayed by the plot
function.
plot1_data$data[[1]]
.Another example of using lists is the map()
function.
Recall that we used that function to count the NA
values in
each column of a data frame.
map(mpg, ~sum(is.na(.)))
## $manufacturer
## [1] 0
##
## $model
## [1] 0
##
## $displ
## [1] 0
##
## $year
## [1] 0
##
## $cyl
## [1] 0
##
## $trans
## [1] 0
##
## $drv
## [1] 0
##
## $cty
## [1] 0
##
## $hwy
## [1] 0
##
## $fl
## [1] 0
##
## $class
## [1] 0
So the map
function always return a list. It applies the
function to each element of a vector (or a list) and returns the result
in a list with the same names. For example, the following code returns
the number of unique values in each column.
result <- map(mpg, ~length(unique(.)))
result
## $manufacturer
## [1] 15
##
## $model
## [1] 38
##
## $displ
## [1] 35
##
## $year
## [1] 2
##
## $cyl
## [1] 4
##
## $trans
## [1] 10
##
## $drv
## [1] 3
##
## $cty
## [1] 21
##
## $hwy
## [1] 27
##
## $fl
## [1] 5
##
## $class
## [1] 7
We can unlist this list and make a bar plot out of it.
barplot(unlist(result), cex.names = 0.5)
Lists are widely used in graphing, modeling and analysis packages to store a complicated nested structure of data. Therefore most complex data classes are built upon lists (such as data frames, tibbles, linear models etc.). Understanding lists will help us extract, modify and store useful information from more complex data structures.