R Software

Applies

Apply functions like sapply, apply and lapply allow us to, well apply a function over an object (i.e., a vector, list, etc.) in R. Using an apply can often be cleaner, easier, and (sometimes) marginally faster than for loops (we won’t be working with truly large datasets so we won’t be overly focused on processing speed). The setup is relatively simple, and we will start with sapply, which just stands for ‘simplify apply.’

# define a vector
x <- 1:10

# apply the square function over the vector
sapply(x, function(x) x^2)

##  [1]   1   4   9  16  25  36  49  64  81 100

The first argument is simply the object, and the second argument is the function we want to apply over the object. Here we use the square function, so we write function(x) x ^ 2, which tells sapply to apply a function that takes input x and returns output x ^ 2.

It is important to note that, for an operation like this, taking advantage of R’s vectorization (ability to automatically apply a function over a vector or other object) would be quicker. We can see that here, by using Sys.time() to measure how long our code takes to run.

# sapply solution
start <- Sys.time()
sapply(x, function(x) x^2)

##  [1]   1   4   9  16  25  36  49  64  81 100

print(Sys.time()-start)

## Time difference of 0.02497196 secs

# vectorization solution
start <- Sys.time()
x ^ 2

##  [1]   1   4   9  16  25  36  49  64  81 100

print(Sys.time() - start)

## Time difference of 0.001065969 secs

However, there are times when sapply can be useful. Imagine if you had a vector of 1000 random letters (you can always access this in R via the object letters) and wanted to create a numeric vector that had value 1 if the corresponding letter in your letter vector was a vowel and 0 if it was a consonant. Here, we can make use of sapply (we use the head function to check the first few entries in the output):

# sample letters:
samp_letters <- sample(letters, size = 1000, replace = TRUE)

# sapply solution
start <- Sys.time()
head(sapply(samp_letters, function(x) {
            if (x %in% c("a", "e", "i", "o", "u")) return(1)
            return(0)
            }))

## e f w k z r 
## 1 0 0 0 0 0

print(Sys.time() - start)

## Time difference of 0.004759073 secs

We can use apply and lapply for similar applications, just with different objects. For example, apply works on arrays or matrices. We merely need to specify if we need to apply the function over the rows (we enter 1 as the second argument) or the columns (we enter 2 as the second argument).

We can test apply by setting up a matrix of random data and then, for example, finding the value in the 95th percentile of each row.

# generate data
x <- matrix(rnorm(1000 * 100),
            nrow = 1000,
            ncol = 100)

# apply 95th percentile function
head(apply(x, 1, function(x) quantile(x, .95)))

## [1] 1.846214 1.907840 1.468298 1.514143 1.992183 1.590457

If we wanted to apply that same function over the columns of the matrix, we would simply use 2 in the second argument instead of 1.

Finally, lapply works in the same way but on lists. A quick refresher on lists; they are R objects that contain different elements which may be different ‘types’ (numerics, strings, etc.). Essentially, they allow us to group together things that might not be the same type. If we use the more common concatenate c, we get some strange results when we mix types:

c(4, "hello")

## [1] "4"     "hello"

Notice that concatenating a numeric, 4, with the string, hello, converts 4 to a string so that the two types match. Obviously, this would not be helpful if we wanted to maintain that 4 was a numeric; lists are helpful in this regard.

x <- list(4, "hello", rnorm(5))
x[[1]]

## [1] 4

## [1] 4
x[[2]]

## [1] "hello"

## [1] "hello"
x[[3]]

## [1]  1.4899459 -0.5525289  1.2706936 -1.0618517 -0.3645837

## [1] -0.3552938  0.7640098  0.3007766  0.8413623 -0.2883009

Note that the list maintains the type of the inputs; 4 and the rnorm(5) output are still numeric, while hello is still a string. Also note that we can index into a list with x[[i]] (this just grabs the $i^{th}$ element in the list).

Let’s try using lapply, then, to simply find the type of all of the elements in a list (which we can do with the function class).

x <- list(4, "hello", rnorm(5))
lapply(x, class)

## [[1]]
## [1] "numeric"
## 
## [[2]]
## [1] "character"
## 
## [[3]]
## [1] "numeric"

## [[1]]
## [1] "numeric"
## 
## [[2]]
## [1] "character"
## 
## [[3]]
## [1] "numeric"

data.table

There are many handy objects for storing data in R; you’ve probably grown accustomed to using data.frame, matrix and similar functions. The data.table object (available, of course, in the data.table package) is especially popular and useful.

Since this isn’t a data science class, we won’t be covering the deep intricacies of the package.; some of the complexities are explored in this vignette. Let’s start with actually defining a data.table:

library(data.table)
data <- data.table(letter = letters,
                   score = rnorm(26))
head(data)

##    letter      score
## 1:      a  1.4059180
## 2:      b -0.6760500
## 3:      c -0.4860818
## 4:      d  0.3373064
## 5:      e -0.5685178
## 6:      f  1.6164520

##    letter      score
## 1:      a -0.5723140
## 2:      b -0.5395135
## 3:      c  0.8182038
## 4:      d -1.2698973
## 5:      e -1.0723506
## 6:      f  1.5048800

This might be a dataset about, say, how much someone likes each letter in the alphabet (they assign a numerical score based on how much they like the letter).

The $ operator allows us to access individual columns of the data.table:

head(data$letter)

## [1] "a" "b" "c" "d" "e" "f"

## [1] "a" "b" "c" "d" "e" "f"
mean(data$score)

## [1] 0.2039537

## [1] 0.1569299

We can also easily index into any row of the data.table:

# grab 17th row
data[17]

##    letter       score
## 1:      q -0.02860032

##    letter    score
## 1:      q 1.209325
# grab 'm' row
data[which(letter == "m")]

##    letter     score
## 1:      m -1.094647

##    letter     score
## 1:      m -1.981792

We can even index into a specific row and

{r echo=TRUE}column: # grab 23rd row, 2nd column data[23, 2] ## score ## 1: -2.118373

It’s very easy to add a column with the $ and to remove it by setting the column to NULL:

# add column...
data$score_sq <- data$score ^ 2
head(data)

##    letter      score  score_sq
## 1:      a  1.4059180 1.9766054
## 2:      b -0.6760500 0.4570436
## 3:      c -0.4860818 0.2362755
## 4:      d  0.3373064 0.1137756
## 5:      e -0.5685178 0.3232124
## 6:      f  1.6164520 2.6129170

##    letter      score  score_sq
## 1:      a -0.5723140 0.3275433
## 2:      b -0.5395135 0.2910748
## 3:      c  0.8182038 0.6694575
## 4:      d -1.2698973 1.6126392
## 5:      e -1.0723506 1.1499358
## 6:      f  1.5048800 2.2646639
# ...and remove it
data$score_sq <- NULL
head(data)

##    letter      score
## 1:      a  1.4059180
## 2:      b -0.6760500
## 3:      c -0.4860818
## 4:      d  0.3373064
## 5:      e -0.5685178
## 6:      f  1.6164520

##    letter      score
## 1:      a -0.5723140
## 2:      b -0.5395135
## 3:      c  0.8182038
## 4:      d -1.2698973
## 5:      e -1.0723506
## 6:      f  1.5048800

If you don’t like the $ operator, data.table even has a built in way to add columns with the := operator. Simply put the name of the new column (score_sq here) in front of the := operator and put the value of the new column after the operator (here, score ^ 2). By leaving the first argument (before the ,) empty in data[], we are telling data.table to perform this calculation for all rows. Similar to the $ operator, we can use NULL to remove columns.

data[ , score_sq := score ^ 2]
head(data)

##    letter      score  score_sq
## 1:      a  1.4059180 1.9766054
## 2:      b -0.6760500 0.4570436
## 3:      c -0.4860818 0.2362755
## 4:      d  0.3373064 0.1137756
## 5:      e -0.5685178 0.3232124
## 6:      f  1.6164520 2.6129170

##    letter      score  score_sq
## 1:      a -0.5723140 0.3275433
## 2:      b -0.5395135 0.2910748
## 3:      c  0.8182038 0.6694575
## 4:      d -1.2698973 1.6126392
## 5:      e -1.0723506 1.1499358
## 6:      f  1.5048800 2.2646639
data[ , score_sq := NULL]
head(data)

##    letter      score
## 1:      a  1.4059180
## 2:      b -0.6760500
## 3:      c -0.4860818
## 4:      d  0.3373064
## 5:      e -0.5685178
## 6:      f  1.6164520

##    letter      score
## 1:      a -0.5723140
## 2:      b -0.5395135
## 3:      c  0.8182038
## 4:      d -1.2698973
## 5:      e -1.0723506
## 6:      f  1.5048800

It’s also easy to use the := operator to change existing values in the table instead of adding an entirely new column. We can use the first argument in data[] to select the correct row, then use an existing column (here, score) before the := operator to set a new value:

# change 'f' row to zero
data[letter == "f", score := 0]
data[letter == "f"]

##    letter score
## 1:      f     0

##    letter score
## 1:      f     0

We can quickly order the data.table by employing the order function:

head(data[order(score, decreasing = TRUE)])

##    letter     score
## 1:      z 2.6848363
## 2:      t 2.1318430
## 3:      o 1.4492866
## 4:      a 1.4059180
## 5:      r 0.9768593
## 6:      i 0.8467144

##    letter    score
## 1:      l 2.166953
## 2:      x 1.867895
## 3:      t 1.667792
## 4:      z 1.340374
## 5:      v 1.304526
## 6:      q 1.209325

There are many more nifty things that you can do with data.table, but they are perhaps better suited for a data science course (check out the vignette linked above for more). We’ve covered a strong base here, and will be sure to work through any examples with any odd applications. The final benefit is that data.table objects will be very useful for our purposes: they work well with the ggplot2 visual suite, for instance (more on that to come!).

Time Series

Working with time series data - simply data over time - is usually a fundamental part of any practical statistical endeavor. Working with dates can be frustrating but, luckily, R provides plenty of tools to ease the load.

The xts package is one of the more popular time series packages in R.

library("xts")

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:data.table':
## 
##     first, last

We can initialize a time series simply with data and a vector of dates (which we specify with the order.by argument:

# replicate
set.seed(0)

# generate data
y <- 1:10 + rnorm(10)

# date vector
x <- seq(from = Sys.Date() - 9, to = Sys.Date(), by = 1)

# create time series
data <- xts(y, order.by = x)
data

##                 [,1]
## 2022-09-02  2.262954
## 2022-09-03  1.673767
## 2022-09-04  4.329799
## 2022-09-05  5.272429
## 2022-09-06  5.414641
## 2022-09-07  4.460050
## 2022-09-08  6.071433
## 2022-09-09  7.705280
## 2022-09-10  8.994233
## 2022-09-11 12.404653

##                 [,1]
## 2021-07-14  2.262954
## 2021-07-15  1.673767
## 2021-07-16  4.329799
## 2021-07-17  5.272429
## 2021-07-18  5.414641
## 2021-07-19  4.460050
## 2021-07-20  6.071433
## 2021-07-21  7.705280
## 2021-07-22  8.994233
## 2021-07-23 12.404653

We can access the dates in a time series by employing the index function (which simply asks for the ‘index’ of a time series):

index(data)

##  [1] "2022-09-02" "2022-09-03" "2022-09-04" "2022-09-05" "2022-09-06"
##  [6] "2022-09-07" "2022-09-08" "2022-09-09" "2022-09-10" "2022-09-11"

##  [1] "2021-07-14" "2021-07-15" "2021-07-16" "2021-07-17" "2021-07-18" "2021-07-19"
##  [7] "2021-07-20" "2021-07-21" "2021-07-22" "2021-07-23"

Base R is also pretty good at plotting these xts objects:

plot(data)

The above example was a univariate time series, but we can easily forge a multivariate time series by inputting multivariate data into xts:

# replicate
set.seed(0)

# generate data
y <- 1:10 + rnorm(10)
z <- 1:10 + rnorm(10)

# date vector
x <- seq(from = Sys.Date() - 9, to = Sys.Date(), by = 1)

# create time series
data <- cbind(y, z)
data <- xts(data, order.by = x)
data

##                    y        z
## 2022-09-02  2.262954 1.763593
## 2022-09-03  1.673767 1.200991
## 2022-09-04  4.329799 1.852343
## 2022-09-05  5.272429 3.710538
## 2022-09-06  5.414641 4.700785
## 2022-09-07  4.460050 5.588489
## 2022-09-08  6.071433 7.252223
## 2022-09-09  7.705280 7.108079
## 2022-09-10  8.994233 9.435683
## 2022-09-11 12.404653 8.762462

##                    y        z
## 2021-07-14  2.262954 1.763593
## 2021-07-15  1.673767 1.200991
## 2021-07-16  4.329799 1.852343
## 2021-07-17  5.272429 3.710538
## 2021-07-18  5.414641 4.700785
## 2021-07-19  4.460050 5.588489
## 2021-07-20  6.071433 7.252223
## 2021-07-21  7.705280 7.108079
## 2021-07-22  8.994233 9.435683
## 2021-07-23 12.404653 8.762462

We can still easily generate a plot of this data:

plot(data)

And access the dates with index:

index(data)

##  [1] "2022-09-02" "2022-09-03" "2022-09-04" "2022-09-05" "2022-09-06"
##  [6] "2022-09-07" "2022-09-08" "2022-09-09" "2022-09-10" "2022-09-11"

##  [1] "2021-07-14" "2021-07-15" "2021-07-16" "2021-07-17" "2021-07-18" "2021-07-19"
##  [7] "2021-07-20" "2021-07-21" "2021-07-22" "2021-07-23"

Finally, if we want to access the individual variables, we can do so with the $ operator. We can recover the actual numeric vector with the as.numeric function:

# grab y time series
data$y

##                    y
## 2022-09-02  2.262954
## 2022-09-03  1.673767
## 2022-09-04  4.329799
## 2022-09-05  5.272429
## 2022-09-06  5.414641
## 2022-09-07  4.460050
## 2022-09-08  6.071433
## 2022-09-09  7.705280
## 2022-09-10  8.994233
## 2022-09-11 12.404653

##                    y
## 2021-07-14  2.262954
## 2021-07-15  1.673767
## 2021-07-16  4.329799
## 2021-07-17  5.272429
## 2021-07-18  5.414641
## 2021-07-19  4.460050
## 2021-07-20  6.071433
## 2021-07-21  7.705280
## 2021-07-22  8.994233
## 2021-07-23 12.404653
# grab numeric vector
as.numeric(data$y)

##  [1]  2.262954  1.673767  4.329799  5.272429  5.414641  4.460050  6.071433
##  [8]  7.705280  8.994233 12.404653

##  [1]  2.262954  1.673767  4.329799  5.272429  5.414641  4.460050  6.071433  7.705280
##  [9]  8.994233 12.404653

We often have to compute rolling statistics on time series, which is made simple (and fast!) with the roll package (written by Jason Foster).

library("roll")

Let’s generate some long time series data and calculate the rolling mean. First, we can generate data with a steadily increasing mean (by inputting an increasing vector into the mean argument of the rnorm function).

# replicate
set.seed(0)

# generate data with shifting regimes
n <- 1000
y <- rnorm(n, -(n / 2):(n / 2 - 1) / (n / 4))

# date vector
x <- seq(from = Sys.Date() - (n - 1), to = Sys.Date(), by = 1)

# create time series
data <- xts(y, order.by = x)
plot(data)

We can then use the roll_mean function to calculate, well, a rolling mean of this data! We can input how large we want out ‘window’ to be (is this a 30-day rolling mean? 40 day? etc.) with the width argument. Here, we will do a monthly, 30-day rolling mean. We could also use the weights argument to specify how we are going to weight the different data points, which we will show in a later example.

data_roll <- roll_mean(data, width = 30)
head(data_roll, 35)

##                 [,1]
## 2019-12-17        NA
## 2019-12-18        NA
## 2019-12-19        NA
## 2019-12-20        NA
## 2019-12-21        NA
## 2019-12-22        NA
## 2019-12-23        NA
## 2019-12-24        NA
## 2019-12-25        NA
## 2019-12-26        NA
## 2019-12-27        NA
## 2019-12-28        NA
## 2019-12-29        NA
## 2019-12-30        NA
## 2019-12-31        NA
## 2020-01-01        NA
## 2020-01-02        NA
## 2020-01-03        NA
## 2020-01-04        NA
## 2020-01-05        NA
## 2020-01-06        NA
## 2020-01-07        NA
## 2020-01-08        NA
## 2020-01-09        NA
## 2020-01-10        NA
## 2020-01-11        NA
## 2020-01-12        NA
## 2020-01-13        NA
## 2020-01-14        NA
## 2020-01-15 -1.920049
## 2020-01-16 -1.966005
## 2020-01-17 -1.969226
## 2020-01-18 -2.023997
## 2020-01-19 -2.084060
## 2020-01-20 -2.069656

##                 [,1]
## 2018-10-28        NA
## 2018-10-29        NA
## 2018-10-30        NA
## 2018-10-31        NA
## 2018-11-01        NA
## 2018-11-02        NA
## 2018-11-03        NA
## 2018-11-04        NA
## 2018-11-05        NA
## 2018-11-06        NA
## 2018-11-07        NA
## 2018-11-08        NA
## 2018-11-09        NA
## 2018-11-10        NA
## 2018-11-11        NA
## 2018-11-12        NA
## 2018-11-13        NA
## 2018-11-14        NA
## 2018-11-15        NA
## 2018-11-16        NA
## 2018-11-17        NA
## 2018-11-18        NA
## 2018-11-19        NA
## 2018-11-20        NA
## 2018-11-21        NA
## 2018-11-22        NA
## 2018-11-23        NA
## 2018-11-24        NA
## 2018-11-25        NA
## 2018-11-26 -1.920049
## 2018-11-27 -1.966005
## 2018-11-28 -1.969226
## 2018-11-29 -2.023997
## 2018-11-30 -2.084060
## 2018-12-01 -2.069656
plot(data_roll)

Note that data_roll is a time series, just like the data we fed into it. Also note that the first 29 values are NA because we are using a 30 day window (up to 29 days in, we don’t have 30 days with which to calculate a 30-day rolling mean!). Also note that we can easily plot the rolling mean, which gives us a much cleaner visual.

In addition to calculating means, the roll package uses a similar structure to calculate key statistics on both univariate and multivariate time series data. Some other useful functions include roll_max, roll_sum, roll_sd (standard deviation) and roll_lm (linear regression), just to name a few.

ggplot

The base R graphics package is reasonably versatile and useful, but the ggplot2 package provides access to a much more attractive, flexible suite of visuals.

library("ggplot2")

Many R users don’t work with ggplot2 because it can be tricky to learn. The biggest barrier to entry is that ggplot2 likes data in the long format instead of the wide format. Here is an example of the ‘wide’ format, which you likely think of as the standard, default format:

# replicate
set.seed(0)

# initialize data
data <- matrix(round(rexp(10 * 5, 1 / 5), 1), 
               nrow = 10,
               ncol = 5)
data <- cbind(1:10, data)


# initialize data
data <- matrix(round(rexp(10 * 5, 1 / 5), 1), 
               nrow = 10,
               ncol = 5)
data <- cbind(1:10, data)


# convert to data.table and name
data <- data.table(data)
names(data) <- c("day", "matt", "tom", "sam", "anne", "tony")


# view data
data

##     day matt  tom  sam anne tony
##  1:   1 10.9  5.5 11.6  5.0  2.2
##  2:   2 16.1  1.2 14.5  4.1  5.2
##  3:   3  2.8  7.9  1.4  0.3  1.3
##  4:   4  3.0 24.2  1.9 11.4  3.4
##  5:   5  4.9  2.2  0.3  4.0  1.3
##  6:   6  1.0 13.7  1.8  7.9  2.2
##  7:   7  1.5  5.7  7.8  6.2  1.1
##  8:   8  5.5  4.1  4.1  6.7  0.7
##  9:   9  3.9  4.2 13.8 10.5  1.7
## 10:  10  0.4  8.9  1.9  5.2  9.5

##     day matt  tom  sam anne tony
##  1:   1  0.9  3.8  3.2  0.2  5.1
##  2:   2  0.7  6.2  1.5  1.6  6.5
##  3:   3  0.7 22.1  2.8  6.6  6.3
##  4:   4  2.2  5.3  0.5  1.0  2.8
##  5:   5 14.5  5.2  0.3  5.1  1.5
##  6:   6  6.1  9.4  2.9  1.5  6.5
##  7:   7  2.7  3.3 19.8  3.6  5.0
##  8:   8  4.8  1.7  5.9  3.8  2.6
##  9:   9  0.7  2.9  5.0  1.2 10.0
## 10:  10  7.0 11.8  7.2  5.4  2.1

This data could represent, say, the total number of miles walked or run by five people over ten days (for example, Matt walked 2.7 miles on day 7). Note that each row in the data.table is a day, and each column is a person.

We can can convert the data to ‘long’ format using the melt function:

data_long <- melt(data, id.vars = "day")
head(data_long, 15)

##     day variable value
##  1:   1     matt  10.9
##  2:   2     matt  16.1
##  3:   3     matt   2.8
##  4:   4     matt   3.0
##  5:   5     matt   4.9
##  6:   6     matt   1.0
##  7:   7     matt   1.5
##  8:   8     matt   5.5
##  9:   9     matt   3.9
## 10:  10     matt   0.4
## 11:   1      tom   5.5
## 12:   2      tom   1.2
## 13:   3      tom   7.9
## 14:   4      tom  24.2
## 15:   5      tom   2.2

##     day variable value
##  1:   1     matt   0.9
##  2:   2     matt   0.7
##  3:   3     matt   0.7
##  4:   4     matt   2.2
##  5:   5     matt  14.5
##  6:   6     matt   6.1
##  7:   7     matt   2.7
##  8:   8     matt   4.8
##  9:   9     matt   0.7
## 10:  10     matt   7.0
## 11:   1      tom   3.8
## 12:   2      tom   6.2
## 13:   3      tom  22.1
## 14:   4      tom   5.3
## 15:   5      tom   5.2

Note that we now have three simple columns in the data table: the ‘id’ column (which we specified in melt with the argument id.vars), the ‘variable’ column (which identifies which person that this specific row refers to) and the ‘value’ column (which identifies how far that person walked or ran on that day). This is considered ‘long’ data because every observation has its own row; that makes it pretty long!

While long data might seem unfamiliar, ggplot2 likes this format. We can think of making a ggplot2 chart as adding blocks on top of each other to make a picture (instead of drawing one picture all at once). For instance, we can always just start with a blank canvas with the ggplot function:

ggplot()

And we can add a title to this blank canvas by adding (with the + operator; this reinforces the idea that we are always adding pieces to a blank canvas) the ggtitle function:

ggplot() +
  ggtitle("First ggplot!")

Let’s try plotting our data_long. We usually put this right into our initial ggplot function and specify what our x and y values will be within the aes argument of ggplot (aes stands for aesthetics).

ggplot(data_long, aes(x = day, y = value)) +
  ggtitle("First ggplot!")

Hmm…we see a nice grid, but no data! We can add lines or points with the geom_line and geom_point functions. Let’s try points:

ggplot(data_long, aes(x = day, y = value)) +
  ggtitle("First ggplot!") +
  geom_point()

This still isn’t ideal because we can’t tell who is who (which points are Matt? Which points are Tom?). Luckily, we can specify how we want to color the points (here, by person) with the color parameter in aes. This is why it’s so nice to have data in the long format; each row as an x value, a y value and a ‘categorical’ value to tell us which group the observation is in!

ggplot(data_long, aes(x = day, y = value, color = variable)) +
  ggtitle("First ggplot!") +
  geom_point()

These are just the basics, but we can really make our chart ‘fancy’ by adding some simple functions: xlab and ylab give us the axis labels, the alpha argument in geom_line gives us the darkness of the line (0 to 1, 0 being invisible and 1 being full color), geom_smooth allows us to add smoothed trends for each line (and setting the se, or ‘standard error,’ argument in geom_smooth to FALSE removes the default shaded bands, which provide a sort of ‘confidence interval’ for the), geom_hline allows us to draw a horizontal line (hline stands for horizontal line; in the function, we specify the yintercept to be 0) and theme_bw adds a simple black and white theme (there are some really neat themes, including Wall Street Journal and 538, which you can find here).

ggplot(data_long, aes(x = day, y = value, color = variable)) +
  geom_line(alpha = 1 / 2) +
  ggtitle("First (fancy) ggplot!") + 
  geom_smooth(se = FALSE) +
  xlab("days") + ylab("miles traveled") +
  geom_hline(yintercept = 0) +
  theme_bw()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

One last nifty tip is to use facet_wrap, which, instead of coloring each variable differently, creates a different chart for each variable (we could still color the lines differently with the col argument if we so desired):

ggplot(data_long, aes(x = day, y = value)) +
  facet_wrap(~variable) +
  geom_line(alpha = 1 / 2) +
  ggtitle("First (fancy) ggplot!") + 
  geom_smooth(se = FALSE) +
  xlab("days") + ylab("miles traveled") +
  geom_hline(yintercept = 0) +
  theme_bw()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We simply put the variable we want to facet on (here, just variable) inside the facet_wrap function (don’t forget to add the ~ syntax!).

Note that, if we had even more categorical variables, we could facet by one variable and color by another variable; ggplot allows us to easily view complicated relationships and data breakdowns.

Styling

We’ll close this lecture with a note on something crucial that many R users forgot: styling. This is essentially the ‘formatting’ of your code; i.e., how the code looks to the human reader. Two code chunks that produce the same result when executed can have different style; the code looks the same to the computer, but different to us!

For a full guide on standard style practices, see Hadley Wickham’s style guide. We will report and discuss some of the more crucial (and more common) aspects of style here.

When commenting your code, it’s important to include a space after the # so that the human reader can more easily read when the comment starts. It’s also important to keep your comments succinct and neat; if you have something verbose to say, simply say it in a text chunk like this one before your code chunk.

# good comment

#bad comment




# this is a good, succinct comment

# this is a bad comment because it 
# doesn't get to the point and
# takes up a big chunk of the page and also has an extra long line

One of the subtlest conventions in R is using <- for assignment instead of =. Many are annoyed by this tendency; what’s the difference? If anything, <- is an extra keystroke (even two extra keystrokes, because you have to hold down ‘shift’)!

There are some more advanced programming reasons for why <- is preferred (which you can read about here) but, as a practical user, we can keep these two key reasons in mind:

Using <-, which looks like an arrow pointing to the left, means that we will never forget that the direction of assignment is left to right (i.e., x <- 4 means ‘put the value 4 into x’ instead of ‘put the value x into 4’). Naturally, with an equals sign, this direction isn’t clear (you can also do rightwards assignment, or 4 -> x, although this is far, far less common). It sounds silly to think that we will need to be constantly reminded of the direction of assignment. However, when working with long, complicated code, it can be surprising how often really simple errors pop up. It can’t hurt to eliminate the risk of making an assignment error!
Using <- for assignment makes it far easier for a human reader to pick out different sections of the code. Since we don’t use = for assignment, if the reader is scanning the code and sees an equals sign, they can be more certain that they are looking at a function (since =, not <-, is used for the assignment of arguments in functions). All told, don’t sweat that extra keystroke!

# good assignment
x <- 4

# bad assignment
x = 4

You can access the booleans (true or false in R) either with TRUE and FALSE or with T and F. It adds a few keystrokes, but just use the full name. It’s easier to read and, oddly, it is possible to assign values to T but not to TRUE (i.e., T <- 4 will execute but TRUE <- 4 throws an error; the same with FALSE and F), in which case, if you used T, you would be calling an object that is not a boolean!

# good boolean
TRUE; FALSE

## [1] TRUE

## [1] FALSE

## [1] TRUE
## [1] FALSE
# bad boolean
T <- 5
F <- "hi"
T; F

## [1] 5

## [1] "hi"

## [1] 5
## [1] "hi"

Space out your arithmetic; make sure that you have a space between each number and operator (i.e., +, -, *, /, ^, etc.). It makes it much easier to read!

# good arithmetic
6 * 5 - 4 ^ 2

## [1] 14

## [1] 14
# bad arithmetic
6*5-4^2

## [1] 14

## [1] 14

With loops and if statements, make sure that you have the proper spacing: include a space after for, if, etc., as well as after the closing parentheses. Again, the extra spacing makes reading easier.

# good loop
for (i in 1:10) {
  # do something
}


# bad loop
for(i in 1:10){
  # do something
}
# good if statement
if (2 + 2 != 4) {
  # do something
}

# bad if statement
if(2 + 2 != 4){
  # do something
}

When defining functions, make sure to use the same spacing structure as loops and if statements. Also, when naming functions, it’s useful to preface the name with FUN_ so that you automatically know the object is a function (instead of a vector, constant or some other object). Similarly, you’ll see that we often label any data with the prefix data_.

# good function
FUN_square <- function (x) {
  return(x ^ 2)
}

# bad function
square = function(x){
  return(x^2)
}

Lecture_2_R_Software

George Yanev

2022-09-11