Surprising R

Jesse Yang
Oct 5, 2017

The good and bad of the R language

Review of R Basics

  • Data structures in R
    • Atomic vectors: c(...)
    • Matrices: homogeneous multi-dimensional arrays
    • Lists: list(...) - analogous to a dictionary in other languages
    • Data frames: a special type of list
      • a list of equal-length vectors
  • Atomic Vector types
    • Double float: c(1, 1.)
    • Integer: c(1L, 5L, 6L)
    • Logical: c(TRUE, FALSE, c(9, 2) == 2)
    • Character (string): c(1, '2')

Annoying surprises

Too many inconsistencies

  • <- and =
x <- c('a', 'b', 'c')
x
[1] "a" "b" "c"
x <- matrix(data=c('a', 'b', 'c'), nrow=2)
x
     [,1] [,2]
[1,] "a"  "c" 
[2,] "b"  "a" 

Function names toooo short

which creates ambiguities.

  • str() - not “string”, but “structure”
  • structure() - actually means “re-structure”
  • get() - get what?
  • date() - current date? create a date? day of month of a Date object?
  • c() - combine? character?
  • ncol <- 2 - can I use it as a variable when it's already a function?

You many find many common verbs/nouns you want to use for your function already taken by the built-in functions.

But...

R is smart enough to figure out what you mean

ncol
function (x) 
dim(x)[2L]
<bytecode: 0x7f9ab8214518>
<environment: namespace:base>
ncol <- 3
dim(matrix(1:12, ncol=ncol))
[1] 4 3
ncol
[1] 3

Functions can be overridden

Therefore the order when loading libaries matters:

library(plyr)
library(dplyr)

is different from

library(dplyr)
library(plyr)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise, summarize

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Make Sense of Factors

  • stringAsFactors=FALSE?
data.frame(name=c('a','b', 'a'), val=c(1,2,3), stringsAsFactors=FALSE) %>% str()
'data.frame':   3 obs. of  2 variables:
 $ name: chr  "a" "b" "a"
 $ val : num  1 2 3
data.frame(name=c('a','b', 'a'), val=c(1,2,3)) %>% str()
'data.frame':   3 obs. of  2 variables:
 $ name: Factor w/ 2 levels "a","b": 1 2 1
 $ val : num  1 2 3
  • Why factors when you have strings?

Quiz

Can you tell the result of following code?

as.numeric(factor(5:10))
[1] 1 2 3 4 5 6
as.numeric(factor(c(2, 3, 3, 6, 6, 6)))
[1] 1 2 2 3 3 3

Quiz - Continued

as.numeric(factor(5:10))
[1] 1 2 3 4 5 6
  • 5:10 generates c(5, 6, 7, 8, 9, 10).
  • factor() coerces it into characters first: as.characters(5:10)
  • Characters are assigned levels by factor()
  • Each value in a factor has an integer representation of the level it belongs to
  • as.numeric returns the internal numeric representations of factor values; it does not try to convert them into characters first.

    as.numeric(factor(c(2, 3, 3, 6, 6, 6)))
    
    [1] 1 2 2 3 3 3
    

Pleasant Surprises

Factors can be used to ...

Change order of categorical variables

water %>% 
  mutate(
    risk = forcats::fct_relevel(risk, 'Less Risk', 'Some Risk', 'More Risk')
  ) %>%
  group_by(risk, section) %>%
  summarise(u238=mean(uranium238)) %>%
  ggplot(aes(x=section, y=u238, fill=risk)) +
  geom_bar(stat="identity", position="dodge")

plot of chunk unnamed-chunk-13

Factors can be used to ...

  • Have a list of acceptable values as safeguard.
x <- factor(c('a', 'b', 'c'))
x[4] <- 'd'
Warning in `[<-.factor`(`*tmp*`, 4, value = "d"): invalid factor level, NA
generated
x <- data.frame(id=factor(c('a','b','a')), val=1:3)
y <- data.frame(id=factor(c('c','b','a')), val=1:3)
x %>% left_join(y, by='id')
Warning: Column `id` joining factors with different levels, coercing to
character vector
  id val.x val.y
1  a     1     3
2  b     2     2
3  a     3     3

Convenient vector operations

  • Logical expressions generate vectors
a <- c(1:5, 7:9)
a > 4
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
which(a > 6 & a < 8)
[1] 6
which(a == 6)
integer(0)

Vector operations - cont.

  • All atomic vectors are flat
c(TRUE, FALSE, c(9, 2) == 2)
[1]  TRUE FALSE FALSE  TRUE
  • Nest as many layers as you want
c(TRUE, FALSE, c(9, 2) == 2, c(1, c(0, 1, 1)))
[1] 1 0 0 1 1 0 1 1
  • Intuitive sequence generation and subsetting
x <- c(2, 4, 6:8)  # sequence integers are inclusive
x[c(2:4, 1, 3, 1)]  # you can pick a value more than one time
[1] 4 6 7 2 6 2

Vector operations - cont.

  • Subsetting square data objects
x <- matrix(1:20, ncol = 5)
x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
x[2, 2:5, drop=FALSE]  # what if you don't add drop=FALSE?
     [,1] [,2] [,3] [,4]
[1,]    6   10   14   18

Vector operations - cont.

  • Negative indices and reverse sequences
x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
x[-c(15:5)]
[1]  1  2  3  4 16 17 18 19 20
  • One dimentional subsetting coerce a matrix to a vector

Vector operations - cont.

  • Works with data frames, too (with the help of dplyr)
x <- data.frame(name=c('a','b', 'a'), val1=1:3, val2=5:7)
select(x, val1:val2)
  val1 val2
1    1    5
2    2    6
3    3    7
  • But you can't combine negative sign with column name characters.
select(x, -val1:val2)  # error!
Error in combine_vars(vars, ind_list) : 
  Each argument must yield either positive or negative integers

Vector operations - cont.

  • No, it won't work with Base R subsetting
x[, val1:val2]
Error in `[.data.frame`(x, , val1:val2) : object 'val1' not found
  • Do you know why?
cols <- val1:val2
x[, cols]

Vector operations - cont.

  • Convenient arithmetic operations
c(1:5) + c(10:15)
[1] 11 13 15 17 19 16
c(1:5) / c(10:15)
[1] 0.10000000 0.18181818 0.25000000 0.30769231 0.35714286 0.06666667

Advanced vector operations

  • Advice: make sure you functions handle vectors. (r4ds, Chapter 21)
  • Rule of thumb: avoid loops whenever possible.
MergeDedup <- function(x, collapse='; ', sep=' *[,;\\*] *') {
  # Merge strings separated by `sep`
  # remove duplicate and empty items
  str_c(x, collapse = collapse) %>%
    str_split(sep) %>%
    map_chr(function(x) {
      # remove empty string so to clean the end output
      x <- unique(x) %>% .[. != '']
      if (length(x) > 0) {
        x <- str_c(x, collapse = collapse)
      } else {
        x <- NA
      }
      x
    })
}

Quotes or not quotes?

You don't need quotes in for dplyr functions.

# without pipe
filter(flights,  dep_delay > 0)

# with pipe
flights %>%
  filter(dep_delay > 0) %>%
  group_by(carrier) %>%
  summarize(avg.delay = mean(dep_delay)) %>%
  arrange(desc(avg.delay))
  • This is achieved by employing the so-called “lazy evaluation” of function arguments.
  • Expression dep_delay > 0 evaluated when we do actually need the results of the filter() function call.
  • By that time dplyr would be able to infer in what context should it get the variable dep_delay.

Quotes or not quotes? - cont.

  • The Base R way
flights[flights$dep_delay > 0, ]
  • What really happened:

    1. Get the vector: x <- flights$dep_delay
    2. Evaluate ecpression x > 0, get a logical vector
    3. Use the logical vector make a subset:

      flights[c(TRUE,FALSE,...,TRUE), ]
      

Quotes or not quotes? - cont.

Column selection needs quote

x <- data.frame(name=c('a','b', 'a'), val1=1:3, val2=5:7)
x[, c('val1', 'val2')]
  val1 val2
1    1    5
2    2    6
3    3    7

Conditionally select columns

x[, startsWith(colnames(x), 'val')]
  val1 val2
1    1    5
2    2    6
3    3    7

Bonus: Debugging R scripts

Debug your R scripts

  • Good practices:

    • Pretty-format your code. (Use Ctr/Cmd+Shift+A if you are lazy)
    • Add comments to explain obscure things
    • Use meaningful variable names
  • Debug process:

    • Enable Break in code

Debug your R scripts - cont.

  • Debug process (continued):
    • Restart R Session (menu: Session -> Restart R)
    • Assign intermediate results to a new variable
    • print(str(...)) intermediate result
flights %>%
  filter(dep_delay > 0) %>%
  group_by(carrier) %>%
  summarize(avg.delay = mean(dep_delay)) %>%
  arrange(desc(avg.delay))
# ---  V.S. -----
flights.delayed <- flights %>% filter(dep_delay > 0)
str(flights.delayed)
summarized <- flights.delayed %>% group_by(carrier) %>% summarize(avg.delay = mean(dep_delay))
summarized %>% arrange(desc(avg.delay))

Bonus: Better Coding Workflow

Bonus: Better Coding Workflow

  • Create projects

  • Fold your code

  • # ======” or “# -----” both OK

Bonus: Better Coding Workflow

  • Create reusable code snippets

Bonus: Better Coding Workflow

  • Type <Tab> to trigger autocomplete.