Load libraries wither with library() or require().

R Basics

String construction.

dog <- "Chester"
print(paste("you are a dog", dog))
## [1] "you are a dog Chester"
nchar(dog)
## [1] 7

Vectors

Create a vector with the combine function c(). Reference vector elements with brackets, or with element names. R compares vectors element-wise. If you compare a vector to a singe value, R will create an appropriately sized vector.

There are two types of vectors in R: atomic vectors, and lists. Atomic vectors are homogenous of one of six types: logical, integer, double, character, complex, and raw (don’t worry about the relatively uncommon complex and raw types). Lists are recursive vectors (they can contain other lists).

Vectors have two key properties: type typeof() of length length(). Subset a list with single brackets and extract elements with double brackets. For example,

a <- list(
  a = 1:3,
  b = "a string",
  c = pi,
  d = list(-1, -5)
)
# List d.
typeof(a[4])
## [1] "list"
# The two elements of list d.
typeof(a[[4]])
## [1] "list"
# The first element of list d.
typeof(a[[4]][1])
## [1] "list"
# The first value of list d
typeof(a[[4]][[1]])
## [1] "double"
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
character_vector[1]
## [1] "a"
boolean_vector[c(2,3)]
## [1] FALSE  TRUE
boolean_vector[2:3]
## [1] FALSE  TRUE
roulette_vector <- c(-24, -50, 100, -350, 10)
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector[1]
## Monday 
##    -24
roulette_vector["Monday"]
## Monday 
##    -24
# vector operations
sum(roulette_vector)
## [1] -314
mean(roulette_vector)
## [1] -62.8
# take a subset of a vector using booleans
roulette_vector[roulette_vector>0]
## Wednesday    Friday 
##       100        10

Matrix

A matrix is a two-dimensional collection of elements. Create a matrix with the matrix(data, nrow, ncol, byrow) function. Label the rows with rownames() and the columns with colnames(). Sum each row and column into vectors with rowSums() and colSums(). Bind rows and columns to a matrix with rbind() and cbind(). Reference matrix items with brackets [row, col].

# Matrix of numbers 1:20, filling one row at a time, for 5 rows and 4 columns.  Specifying the number of columns is optional if number of rows is specified.
m <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)
rownames(m) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
colnames(m) <- c("Col 1", "col 2", "col 3", "col 4")
m
##       Col 1 col 2 col 3 col 4
## row 1     1     2     3     4
## row 2     5     6     7     8
## row 3     9    10    11    12
## row 4    13    14    15    16
## row 5    17    18    19    20
# Bind row sums to matrix.
m.rowSum <- rowSums(m)
cbind(m, m.rowSum)
##       Col 1 col 2 col 3 col 4 m.rowSum
## row 1     1     2     3     4       10
## row 2     5     6     7     8       26
## row 3     9    10    11    12       42
## row 4    13    14    15    16       58
## row 5    17    18    19    20       74
# All rows of the second colum of m.
m[,2]
## row 1 row 2 row 3 row 4 row 5 
##     2     6    10    14    18

Use nrows() and ncols() to determine number of rows and columns.

for (i in 1:nrow(m)) {
  for (j in 1:ncol(m)) {
    print(paste("On row ", i, " and column ", j, " the matrix contains ", m[i,j]))
  }
}
## [1] "On row  1  and column  1  the matrix contains  1"
## [1] "On row  1  and column  2  the matrix contains  2"
## [1] "On row  1  and column  3  the matrix contains  3"
## [1] "On row  1  and column  4  the matrix contains  4"
## [1] "On row  2  and column  1  the matrix contains  5"
## [1] "On row  2  and column  2  the matrix contains  6"
## [1] "On row  2  and column  3  the matrix contains  7"
## [1] "On row  2  and column  4  the matrix contains  8"
## [1] "On row  3  and column  1  the matrix contains  9"
## [1] "On row  3  and column  2  the matrix contains  10"
## [1] "On row  3  and column  3  the matrix contains  11"
## [1] "On row  3  and column  4  the matrix contains  12"
## [1] "On row  4  and column  1  the matrix contains  13"
## [1] "On row  4  and column  2  the matrix contains  14"
## [1] "On row  4  and column  3  the matrix contains  15"
## [1] "On row  4  and column  4  the matrix contains  16"
## [1] "On row  5  and column  1  the matrix contains  17"
## [1] "On row  5  and column  2  the matrix contains  18"
## [1] "On row  5  and column  3  the matrix contains  19"
## [1] "On row  5  and column  4  the matrix contains  20"

Factors

The factor() function converts a variable into type factor. R needs to know whether a variable is continuous or categorical. To specify an ordinal categorical variable, specify order = TRUE and levels.

student_status <- c("student", "not student", "student", "not student")
categorical_student <- factor(student_status)
categorical_student
## [1] student     not student student     not student
## Levels: not student student
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
temperature_vector
## [1] "High"   "Low"    "High"   "Low"    "Medium"
# nominal variables are not comparable, but ordinal variables are.
temperature_vector[1] > temperature_vector[2]
## [1] FALSE
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE
# Change the level names with the levels function.  Note the levels are initially in alphabetical order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Notice how summary treats a factor variable different from a regular variable.
summary(survey_vector)
##    Length     Class      Mode 
##         5 character character
summary(factor_survey_vector)
## Female   Male 
##      2      3

Data Frames

A dataframe is like a matrix, except each column can be a different data type. Several functions inspect data frames. * head (tail): by default prints the first (last) 6 rows of the dataframe * str: prints the structure of the dataframe. Probably the first function you’ll call with a new data set. * dim: prints the dimensions of the dataframe * colnames: prints the column names of the dataframe * na.omit() removes rows with NA in any column.

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
head(mtcars,6)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Create a data frame with the data.frame() function.

planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df <- data.frame(planets, type, diameter, rotation, rings)
# Select first 5 values of diameter column.  The $ is a short-cut method.
planets_df[1:5,"diameter"]
## [1]  0.382  0.949  1.000  0.532 11.209
planets_df$diameter[1:5]
## [1]  0.382  0.949  1.000  0.532 11.209

Use subset() to apply a where condition to the data frame rows. User order() to apply an order by to the data frame.

subset(planets_df, subset = diameter < 1)
##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
planets_df[order(planets_df$diameter),]
##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

Lists

Construct a list of objects with list(). Name the list items either with “=” at creation, or using names().

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]

my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
my_list
## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# or
my_list2 <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list2
## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## $df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Reference items in a list by its component number in brackets, or name in brackets, or name after a dollar sign.

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

# Third col of second element of my_list (my_matrix)
my_list[[2]][,3]
## [1] 7 8 9
my_list$mat[,3]
## [1] 7 8 9

Append to a list with combine c().

my_vector <- 1:10 
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list <- c(my_list, df2 = my_df)

Intermediate R

Conditionals

Relational operators are == and !=. Logical operators are &, |, and !. Be careful to not use && or !! - they evaluate only the first item in the list! Control constructs are if().

x <- 3
if (x %% 2 == 0) {
  print("x is divisible by 2")
} else if (x %% 3 == 0) {
  print("x is divisible by 3")
} else {
  print("x is divisible by neither 2 nor 3")
}
## [1] "x is divisible by 3"

Loops

While loop is while() {}. Break out of loop early with if (condition) { break()}.

i <- 1
while (i <= 10) {
  print(3 * i)
  if (3 * i %% 8 == 0) {
    break()
  }
  i <- i + 1
}
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24

For loop is for(var in seq) {exp}. The break statement abandons the active loop. The next statement skips the rest of the statements in the current loop interation.

linkedin <- c(16, 9, 13, 5, 2, 17, 14)

# Loop version 1
for(views in linkedin) {
  print(views)
  if (views > 10) {
    break
  } else if (view < 5) {
    next
  }
}
## [1] 16
# Loop version 2
for(i in 1:length(linkedin)) {
  print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14
# seq_along handles zero-length vectors and lists.
for (i in seq_along(linkedin)) {
  print(linkedin[i])
}
## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14

Functions

Get help on a function with help or ?, or args. Specify function parameters either by name or by position. When the documentation specifies default values, they are not required.

#help(mean)
#?mean
args(mean)
## function (x, ...) 
## NULL
grades <- c(8.5, 7, 9, 5.5, 6)
mean(x=grades)
## [1] 7.2
mean(grades)
## [1] 7.2

Define a custom function with the function() code chunk. The return statement returns and exits immediately and is optional. Set default argument value with =.

multiply_a_b <- function(a, b = 1) {
  return (a * b)
}
result <- multiply_a_b(a = 3, b = 7)

Install a package with install.packages(arg). Packages are located at the Comprehensive R Archive Network (CRAN). Search for packages with search(). R attaches seven packages to its search list by default. Attach more packages with library() or require().

The Apply Family

Function lapply(X, FUN, ...) applies a function to a list. lapply() returns a list, so if X is a vector, cast the function result back to list with unlist. If the function requires arguments, pass them in as additional arguments to lapply(). Functions can be named or anonymous, so if used only once, define the function within lapply().

lapply(list(1,2,3), function(x) { 3 * x })
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 6
## 
## [[3]]
## [1] 9

Function sapply() calls lapply() then converts the list to a one-dimensional array (vector) or two-dimensional array (matrix). If sapply cannot simplify because the resulting list contains vectors of varying lengths, then sapply() returns the same result as lapply().

Function vapply() uses lapply() but with FUN.VALUE which indicates the return variable type. vapply() is a safe alternative to sapply().

purrr Package

The purrr package maps functions to a vector and return a vector. map() returns a list; the others are map_dbl(), map_lgl(), map_int(), and map_chr(). The purrr functions provide shortcuts for the f argument, are more consistant than lapply and sapply, and handle iteration well.

library(purrr)
## Warning: package 'purrr' was built under R version 3.4.4
cyl <- split(mtcars, mtcars$cyl)
# Regress mpg ~ wt on each cylinder class
map(cyl, function(df) lm(mpg ~ wt, data = df))
## $`4`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      39.571       -5.647  
## 
## 
## $`6`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##       28.41        -2.78  
## 
## 
## $`8`
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      23.868       -2.192
# Same thing with shortcuts
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
coefs <- map(models, coef)
map(coefs, "wt")
## $`4`
## [1] -5.647025
## 
## $`6`
## [1] -2.780106
## 
## $`8`
## [1] -2.192438
# Or, using a single command with pipes.
mtcars %>% 
  split(mtcars$cyl) %>%
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(coef) %>% 
  map_dbl("wt")
##         4         6         8 
## -5.647025 -2.780106 -2.192438

The safely() function returns a list with two elements: result and error for each element. possibly() returns a default value on errors. quietly() captures all printed output, messages, and warnings instead of capturing errors.

safe_readLines <- safely(readLines())

# Call safe_readLines() on "http://example.org"
example_lines <- safe_readLines("http://example.org")
example_lines
## $result
## NULL
## 
## $error
## NULL
# Call safe_readLines() on "http://asdfasdasdkfjlda"
nonsense_lines <- safe_readLines("http://asdfasdasdkfjlda")
nonsense_lines
## $result
## NULL
## 
## $error
## NULL
n <- list(5, 10, 20)
mu <- list(1, 5, 10)
sd <- list(0.1, 1, 0.1)

# iterate over the lists
pmap(list(n, mu, sd), rnorm)
## [[1]]
## [1] 1.0380868 0.9605489 1.0786154 1.0073599 1.0234126
## 
## [[2]]
##  [1] 4.343431 6.307386 3.939620 3.125216 7.622740 5.457172 5.548574
##  [8] 4.371869 4.627905 5.260454
## 
## [[3]]
##  [1] 10.053020 10.053259 10.119406  9.824395  9.995872  9.749677  9.997900
##  [8] 10.128129 10.115909 10.197187 10.031033 10.080599  9.935449 10.055783
## [15] 10.083899  9.935934  9.781156 10.215975 10.060304 10.016733
funs <- list("rnorm", "runif", "rexp")

rnorm_params <- list(mean = 10)
runif_params <- list(min = 0, max = 5)
rexp_params <- list(rate = 5)
params <- list(
  rnorm_params,
  runif_params,
  rexp_params
)

# Call invoke_map() on funs supplying params and setting n to 5
invoke_map(funs, params, n = 5)
## [[1]]
## [1]  9.657600 12.019679 10.136912 11.521788  9.658688
## 
## [[2]]
## [1] 1.0613833 2.0008371 1.4973380 2.9227932 0.3804437
## 
## [[3]]
## [1] 0.07188987 0.07739475 0.03476835 0.33302093 0.17282787

walk() operates just like map() except it’s designed for functions that don’t return anything. Use walk() for functions with side effects like printing, plotting or saving.

#?walk2

stopifnot() is a quick way to stop a function stop if a condition fails. stopifnot() takes logical expressions as arguments and looks for any to be FALSE.

x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  stopifnot(length(x) == length(y))
  sum(is.na(x) & is.na(y))
}
#both_na(x, y)

Use stop() instead of stopifnot() to specify a more informative error message.

x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  if (length(x) != length(y)) {
    stop("x and y must have the same length", call. = FALSE)
  }    
  sum(is.na(x) & is.na(y))
}
#both_na(x, y)

Useful Functions

R features a bunch of functions to juggle around with data structures:: seq(from = 1, to 2, by = .25): Generates sequence from 1 to 2 incremented by .25. rep(x, times): Replicates elements of vectors and lists. sort(x): Sorts a vector. rev(x): Reverses the elements in a data structures for which reversal is defined. str(x): Display the structure of any R object x. append(x y): Appends vectors or list y to x. is.*(): Checks class of R object x. as.*(): Casts R object x. unlist(x): Flatten (possibly embedded) lists to produce a vector.

myseq <- seq(8, 2, by=-2)
myseq
## [1] 8 6 4 2
myrep <- rep(myseq, times =2)
myrep
## [1] 8 6 4 2 8 6 4 2
myrep <- rep(myseq, each = 2)
myrep
## [1] 8 8 6 6 4 4 2 2
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
li_vec <- unlist(linkedin)
fb_vec <- unlist(facebook)
social_vec <- append(li_vec, fb_vec)
sort(social_vec, decreasing = TRUE)
##  [1] 17 17 16 16 14 14 13 13  9  8  7  5  5  2

Regular expressions include grepl() grepl(pattern = "a", x = animals) returns TRUE for each element of x matching the pattern. Regular expression “^a” means a*; “a$” means *a; .\* means any character zero or more times; ’\smeans space;[0-9]+means numbers 0 to 9 at least once.grep(pattern = “a”, x = animals)returns the vector indices for each element ofxmatching thepattern.sub(pattern = “a”, replacement = “o”, x = animals“)substitutes the first a with o.gsum(pattern =”a“, replacement =”o“, x = animals”)` substitutes all a’s with o’s.)

animals <- c("cat", "moose", "impala", "ant", "kiwi")
grepl(pattern = "a", x = animals)
## [1]  TRUE FALSE  TRUE  TRUE FALSE
which(grepl(pattern = "a", x = animals))
## [1] 1 3 4
grep(pattern = "a", x = animals)
## [1] 1 3 4

There are two datetimes in R, POSIXlt, a list with named components, and POSIXct, the number of seconds since 1970-01-01 00:00:00. POSIXct is more amenable to data frames, so you will encounter it much more often. Sys.Date() returns a Date equal to today. Sys.time() returns POSIXct.

as.Date("2018-10-16")
## [1] "2018-10-16"
as.POSIXct("2018-11-28 08:34:00")
## [1] "2018-11-28 08:34:00 EST"

Importing Data

RData

The simplest file to import is RData.

url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
download.file(url_rdata, "Programs/Data/wine_local.RData")
# loading wine_local.RData creates variable wine.
load("Programs/Data/wine_local.RData")
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0
# or, equivalently,
load(url("https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"))
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0

Flat files

There are three common packages designed to load flat files: util which comes with base r, readr, and data.table.

util

The base r util package includes flat file reading functions. read.table() is a generic flat file loading function. Wrapper functions read.csv() reads comma-separated files, and read.delim reads tab-delimited files.

  • stringsAsFactors = TRUE treats string variables as categorical.
  • col.names = c() overrides, or sets, column names.
  • colClasses = c() sets data types. NULL elements in the vector drop the variable.
# Opt 1: set working dir to file location
# setwd("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data")
# Opt 2: define a file path relative to script file.
path <- file.path("Data", "swimming_pools.csv")

swimming_pools <- read.csv(path, stringsAsFactors = FALSE)

swimming_pools <- read.table(path, 
                             sep = ",",
                             header = TRUE,
                             col.names = c("name", "address", "ph", "ph2", "open_hr","facilities", "disabl","park","lat","longit"),
                             colClasses = c("factor", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "numeric", "numeric"))

readr

readr is similar to utils, but is faster and less verbose. readr returns a “tibble” instead of a data frame. Functions read_csv() and read_tsv() are wrappers for read_delim(), similar to the construction in package utils.

  • Default col_names = TRUE sets column names to the first row of data. Set col_names = FALSE for system-generated names or set col_names = c() to set the column names to a character vector.
  • col_types = c() sets data types. NULL elements in the vector drop the variable. Use shorthand strings where col_types = "cd_il") means “character, double, (skip), integer, logical”.
  • Collector functions col_factor() and col_integer() also set column types.
library(readr)
pools <- file.path("Programs/Data", "swimming_pools.csv")
# or, if on the web,
pools.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(pools.path)
## Parsed with column specification:
## cols(
##   Name = col_character(),
##   Address = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )
potatoes.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_delim(potatoes.path, delim = "\t")
## Parsed with column specification:
## cols(
##   area = col_integer(),
##   temp = col_integer(),
##   size = col_integer(),
##   storage = col_integer(),
##   method = col_integer(),
##   texture = col_double(),
##   flavor = col_double(),
##   moistness = col_double()
## )
machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- read_tsv(machine, skip = 6, n_max = 5, 
                              col_names = properties)
## Parsed with column specification:
## cols(
##   new = col_double(),
##   old = col_double()
## )
hotdogs <- file.path("Programs/Data", "hotdogs.txt")
hotdogs_factor <- read_tsv(hotdogs,
                           col_names = c("type", "calories", "sodium"),
                           skip = 1)
## Parsed with column specification:
## cols(
##   type = col_character(),
##   calories = col_double(),
##   sodium = col_double()
## )

data.table

The data.table package is optimized for large files. fread() is faster and more convenient than read.table.

library(data.table)
## Warning: package 'data.table' was built under R version 3.4.4
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
## 
##     transpose
pools <- file.path("Programs/Data", "swimming_pools.csv")

machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- fread(machine)

Excel

There are three packages to choose from, readxl, gdata, and XLConnect. gdata only handles .xls files and will be replaced when readxl is more mature. XLConnect is designed to work with Excel through R.

readxl

readxl cannot read directly from the internet. First download the file, then import the file.

Packagage readxl functions excel_sheets() lists the available sheets, read_excel() reads the file.

  • Default col_names = TRUE sets column names to the first row of data. Set col_names = FALSE for system-generated names or set col_names = c() to set the column names to a character vector.
  • col_types = c() sets data types. “blank” elements in the vector drop the variable.
  • skip skips lines. If first line is column names, you will have to manually set it.
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
download.file(url_xls, file.path("Programs/Data", "local_latitude.xls"))
#excel_readxl <- read_excel(file.path("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Programs/Data", "local_latitude.xls"))

mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
excel_sheets(mini.path)
## [1] "Sheet1" "Sheet2"
sheet1 <- read_excel(mini.path, sheet = "Sheet1")
sheet2 <- read_excel(mini.path, sheet = "Sheet2")
sheet.list = list(sheet1, sheet2)

# Equivalently...
sheet.list <- lapply(excel_sheets(mini.path), 
                     read_excel, path = mini.path)

gdata

gdata requires perl in the background. It can only read .xls files. It can read directly from web sites though.

library(gdata)
## Warning: package 'gdata' was built under R version 3.4.4
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:data.table':
## 
##     first, last
## The following object is masked from 'package:purrr':
## 
##     keep
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
#read.xls(url_xls)

XLConnect

#library(XLConnect)
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
#my_book <- loadWorkbook(mini.path)
#class(my_book)
#getSheets(my_book)
#readWorksheet(my_book, sheet = 2)
#all <- lapply(sheets, readWorksheet, object = my_book)
#str(all)
#createSheet(my_book, name = "year_2010")
#writeWorksheet(my_book, pop_2010, sheet = "year_2010")
#saveWorkbook(my_book, file = "MinitabIntroData2.xlsx")

Other Sources

Databases

There is a dedicated package for each DBMS: RMySQL, RPostgresSQL, ROracle, etc. Function dbGetQuery() is a convenient aggregator of three functions, dbSendQuery(), dbFetch(), and dbClearResults(). Use the three functions if the data set is large and only a chunk of data is needed at a time.

library(DBI)
## Warning: package 'DBI' was built under R version 3.4.4
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")
con
## <MySQLConnection:0,0>
# read all tables into a list of data frames
table_names <- dbListTables(con)
tables <- lapply(table_names, dbReadTable, conn = con)
# read an entire table, then subset the rows you want (inefficient)
comments <- dbReadTable(con, "comments")
subset(comments,
       subset = user_id == 1,
       tweat_id = 77)
##      id tweat_id user_id            message
## 4  1012       87       1   awesome! thanks!
## 7  1004       49       1  this is fabulous!
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments 
                        WHERE user_id = 1")
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")

dbDisconnect(con)
## [1] TRUE

Internet

If a file resides on the web, reference it directly instead of manually downloading. For the excel package, you will have to first download the file.

url = "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r"
dest_path = file.path("~", "local_cities.xlsx")
#download.file(url, dest_path)

The httr package also handles internet files.

library(httr)
## Warning: package 'httr' was built under R version 3.4.4
resp <- GET("http://www.example.com/")
raw_content <- content(resp, as = "raw")
head(raw_content)
## [1] 3c 21 64 6f 63 74

API’s and JSON

JSON files are either name-value pair objects {“id”:1,“name”:“Frank”}, or arrays [1,2,3,“dog”].

library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.4.4
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert file JSON into list
wine <- fromJSON(wine_json)
str(wine)
## List of 5
##  $ name       : chr "Chateau Migraine"
##  $ year       : int 1997
##  $ alcohol_pct: num 12.4
##  $ color      : chr "red"
##  $ awarded    : logi FALSE
# Convert web API JSON into list
url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
#sw4 <- fromJSON(url_sw4)
#sw3 <- fromJSON(url_sw3)

# Print the Title element of both lists
#sw4$Title
#sw3$Title

# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
pretty_json
## [
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.62,
##     "qsec": 16.46,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4"
##   },
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.875,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4 Wag"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 108,
##     "hp": 93,
##     "drat": 3.85,
##     "wt": 2.32,
##     "qsec": 18.61,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Datsun 710"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 6,
##     "disp": 258,
##     "hp": 110,
##     "drat": 3.08,
##     "wt": 3.215,
##     "qsec": 19.44,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Hornet 4 Drive"
##   },
##   {
##     "mpg": 18.7,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 175,
##     "drat": 3.15,
##     "wt": 3.44,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Hornet Sportabout"
##   },
##   {
##     "mpg": 18.1,
##     "cyl": 6,
##     "disp": 225,
##     "hp": 105,
##     "drat": 2.76,
##     "wt": 3.46,
##     "qsec": 20.22,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Valiant"
##   },
##   {
##     "mpg": 14.3,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 245,
##     "drat": 3.21,
##     "wt": 3.57,
##     "qsec": 15.84,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Duster 360"
##   },
##   {
##     "mpg": 24.4,
##     "cyl": 4,
##     "disp": 146.7,
##     "hp": 62,
##     "drat": 3.69,
##     "wt": 3.19,
##     "qsec": 20,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 240D"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 140.8,
##     "hp": 95,
##     "drat": 3.92,
##     "wt": 3.15,
##     "qsec": 22.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 230"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.3,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280"
##   },
##   {
##     "mpg": 17.8,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280C"
##   },
##   {
##     "mpg": 16.4,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 4.07,
##     "qsec": 17.4,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SE"
##   },
##   {
##     "mpg": 17.3,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.73,
##     "qsec": 17.6,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SL"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.78,
##     "qsec": 18,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SLC"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 472,
##     "hp": 205,
##     "drat": 2.93,
##     "wt": 5.25,
##     "qsec": 17.98,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Cadillac Fleetwood"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 460,
##     "hp": 215,
##     "drat": 3,
##     "wt": 5.424,
##     "qsec": 17.82,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Lincoln Continental"
##   },
##   {
##     "mpg": 14.7,
##     "cyl": 8,
##     "disp": 440,
##     "hp": 230,
##     "drat": 3.23,
##     "wt": 5.345,
##     "qsec": 17.42,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Chrysler Imperial"
##   },
##   {
##     "mpg": 32.4,
##     "cyl": 4,
##     "disp": 78.7,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 2.2,
##     "qsec": 19.47,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat 128"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 75.7,
##     "hp": 52,
##     "drat": 4.93,
##     "wt": 1.615,
##     "qsec": 18.52,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Honda Civic"
##   },
##   {
##     "mpg": 33.9,
##     "cyl": 4,
##     "disp": 71.1,
##     "hp": 65,
##     "drat": 4.22,
##     "wt": 1.835,
##     "qsec": 19.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Toyota Corolla"
##   },
##   {
##     "mpg": 21.5,
##     "cyl": 4,
##     "disp": 120.1,
##     "hp": 97,
##     "drat": 3.7,
##     "wt": 2.465,
##     "qsec": 20.01,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Toyota Corona"
##   },
##   {
##     "mpg": 15.5,
##     "cyl": 8,
##     "disp": 318,
##     "hp": 150,
##     "drat": 2.76,
##     "wt": 3.52,
##     "qsec": 16.87,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Dodge Challenger"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 304,
##     "hp": 150,
##     "drat": 3.15,
##     "wt": 3.435,
##     "qsec": 17.3,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "AMC Javelin"
##   },
##   {
##     "mpg": 13.3,
##     "cyl": 8,
##     "disp": 350,
##     "hp": 245,
##     "drat": 3.73,
##     "wt": 3.84,
##     "qsec": 15.41,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Camaro Z28"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 8,
##     "disp": 400,
##     "hp": 175,
##     "drat": 3.08,
##     "wt": 3.845,
##     "qsec": 17.05,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Pontiac Firebird"
##   },
##   {
##     "mpg": 27.3,
##     "cyl": 4,
##     "disp": 79,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 1.935,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat X1-9"
##   },
##   {
##     "mpg": 26,
##     "cyl": 4,
##     "disp": 120.3,
##     "hp": 91,
##     "drat": 4.43,
##     "wt": 2.14,
##     "qsec": 16.7,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Porsche 914-2"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 95.1,
##     "hp": 113,
##     "drat": 3.77,
##     "wt": 1.513,
##     "qsec": 16.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Lotus Europa"
##   },
##   {
##     "mpg": 15.8,
##     "cyl": 8,
##     "disp": 351,
##     "hp": 264,
##     "drat": 4.22,
##     "wt": 3.17,
##     "qsec": 14.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 4,
##     "_row": "Ford Pantera L"
##   },
##   {
##     "mpg": 19.7,
##     "cyl": 6,
##     "disp": 145,
##     "hp": 175,
##     "drat": 3.62,
##     "wt": 2.77,
##     "qsec": 15.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 6,
##     "_row": "Ferrari Dino"
##   },
##   {
##     "mpg": 15,
##     "cyl": 8,
##     "disp": 301,
##     "hp": 335,
##     "drat": 3.54,
##     "wt": 3.57,
##     "qsec": 14.6,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 8,
##     "_row": "Maserati Bora"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 4,
##     "disp": 121,
##     "hp": 109,
##     "drat": 4.11,
##     "wt": 2.78,
##     "qsec": 18.6,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Volvo 142E"
##   }
## ]
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
mini_json
## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]

Statistics Packages, haven and foreign

R supports SAS, STATA, and SPSS.

library(haven)
## Warning: package 'haven' was built under R version 3.4.4
sales <- read_sas("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat")
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Convert labeled values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
dat <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
library(foreign)
# foreign can load xprt files but not sas7dat files.
# load in the data and store it in the variable cars
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv")
# print the first 6 rows of the dataset using the head() function
head(cars)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb               car
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1           Valiant

Change the variable separator for text files with the sep argument. Use sep = 't' for tab.

# load in the dataset
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")

# print the first 6 rows of the dataset
head(cars)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Get and set you working directory.

getwd()
## [1] "C:/Users/mpfol/OneDrive/Documents/Data Analysis"
list.files()
##  [1] "Analyzing Survey Data in R.Rmd"     
##  [2] "Analyzing_Survey_Data_in_R.html"    
##  [3] "Cookbook for R.Rmd"                 
##  [4] "Cookbook_for_R.html"                
##  [5] "Cookbook_for_R.Rmd"                 
##  [6] "Cookbook_for_R_files"               
##  [7] "Coursework"                         
##  [8] "Data"                               
##  [9] "Data Analysis.docx"                 
## [10] "Data Analysis.xlsx"                 
## [11] "Data Visualization.docx"            
## [12] "Foundations of Inference.Rmd"       
## [13] "Foundations_of_Inference.html"      
## [14] "local_latitude.xls"                 
## [15] "Programs"                           
## [16] "rmarkdown-cheatsheet.pdf"           
## [17] "rsconnect"                          
## [18] "Statistical Analysis.docx"          
## [19] "Statistical Package Syntax (1).docx"
## [20] "Statistics Notes.docx"              
## [21] "Statistics v20170301.docx"

Data Wrangling

Data Exploration

Data exploration starts with evaluation of structure and characteristics using class() (it better be a data.frame), dim(), and names(). Create summaries with str() or glimpse(), and summary(). Run some initial visualizations for insights into distributions. Use histograms for univariate analysis, scatterplots for numeric-numeric bi-variate analysis, and boxplots for numeric-factor bi-variate analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:gdata':
## 
##     combine, first, last
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Check structure
class(mtcars)
## [1] "data.frame"
dim(mtcars)
## [1] 32 11
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# Initial summaries
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
glimpse(mtcars)  # Slightly cleaner version of str (requires dplyr).
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
hist(mtcars$mpg)

plot(mtcars$mpg, mtcars$qsec)

# View sample data
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

Tidying

Tidy data organizes a single observational unit into rows and columns. Use the tidyr package to tidy messy data.

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
wide_df <- data.frame(Obs=c(1,2),
                      a=c(1,4),
                      b=c(2,5),
                      c=c(3,6),
                      year_mo=c("2010-05","2007-07"))
wide_df
##   Obs a b c year_mo
## 1   1 1 2 3 2010-05
## 2   2 4 5 6 2007-07
# Gather wide data into key-value pairs. Exclude Obs and year_mo
long_df <- gather(wide_df, my_key, my_val, -c(Obs,year_mo))
long_df
##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6
# The opposite of gather() is spread()
wide_df <- spread(long_df, my_key, my_val)
wide_df
##   Obs year_mo a b c
## 1   1 2010-05 1 2 3
## 2   2 2007-07 4 5 6
# Split a column using separate().
long_df_sep <- separate(long_df, col = year_mo, into = c("year","month"), sep = "-")
long_df_sep
##   Obs year month my_key my_val
## 1   1 2010    05      a      1
## 2   2 2007    07      a      4
## 3   1 2010    05      b      2
## 4   2 2007    07      b      5
## 5   1 2010    05      c      3
## 6   2 2007    07      c      6
# The opposite of separate() is unite()
long_df_uni <- unite(long_df_sep, year_mo, year, month, sep = "-")
long_df_uni
##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6

Preparing for Analysis

Types of variables in R: * character * numeric, including NaN and inf. * integer, denoted 123L * factor * logical, included NA.

Coerce variables into data types with * as.character() * as.numeric() * as.integer() * as.factor() * as.logical() where 0 := FALSE * Package lubridate coerces strings to dates. Valid masking characters are y, m, d, h, m, and s. Unite several fields into one with unite(). Rearrange column order with select(). Change the structure of multiple columns with mutate_at.

Because the period (.) has special meaning in certain situations, use underscores (_) to separate words in variable names. Use all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Package lubridate manipulates dates. Round dates with round_date, floor_date, and ceiling_date. All three take a unit argument specifying the resolution of rounding: “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
# There 3! ymd date functions: ymd(), ydm(), mdy(), myd(), dmy(), dym().
# Create datetimes with: _h, _hm, or _hms
as.Date(ymd_hms("2005/10/23 14:40:00"))
## [1] "2005-10-23"
as.POSIXct(mdy("July 21, 2006"))
## [1] "2006-07-20 20:00:00 EDT"
ymd("2006-07-21")
## [1] "2006-07-21"
ymd("2006 Jul 21")
## [1] "2006-07-21"
mdy("July 21, 2006")
## [1] "2006-07-21"
hms("10:25:09")
## [1] "10H 25M 9S"
ymd_hms("2005/10/23 14:40:00")
## [1] "2005-10-23 14:40:00 UTC"
# If date is in an unsupported order like dym_msh, use parse_date_time() with  argument orders specifying the order of the components in the date.

# Combine date parts with make_date(year, month, date).
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Date rounding
floor_date(r_3_4_1, unit = "day")
## [1] "2016-05-03 UTC"
round_date(r_3_4_1, unit = "5 minutes")
## [1] "2016-05-03 07:15:00 UTC"
ceiling_date(r_3_4_1, unit = "week")
## [1] "2016-05-08 UTC"

Subtract dates with simple - operator for days unit, or get finer control with base function difftime(t1, t2, units). Available system dates are now and today().

date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")

difftime(today(), date_landing, units = "days")
## Time difference of 18075 days
difftime(now(), moment_step, units = "secs")
## Time difference of 1561709101 secs

Use timespans to add fixed amount of time to dates. Distinguish periods (human understanding) from durations (number of seconds) to handle daylight savings time gracefully. By combining addition and multiplication with sequences you can generate sequences of datetimes.

library(lubridate)
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)
## [1] "2018-09-03 14:00:00 UTC"
# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)
## [1] "2018-08-31 18:00:00 UTC"
# A period of five years is longer than a duration of 5 years!
today() - years(5)
## [1] "2014-01-14"
today() - dyears(5)
## [1] "2014-01-15"
# Create combined periods and durations.
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)

# Create datetime for every two weeks for a year
today_8am <- today() + hours(8)
every_two_weeks <- 1:26 * weeks(2)
today_8am + every_two_weeks
##  [1] "2019-01-28 08:00:00 UTC" "2019-02-11 08:00:00 UTC"
##  [3] "2019-02-25 08:00:00 UTC" "2019-03-11 08:00:00 UTC"
##  [5] "2019-03-25 08:00:00 UTC" "2019-04-08 08:00:00 UTC"
##  [7] "2019-04-22 08:00:00 UTC" "2019-05-06 08:00:00 UTC"
##  [9] "2019-05-20 08:00:00 UTC" "2019-06-03 08:00:00 UTC"
## [11] "2019-06-17 08:00:00 UTC" "2019-07-01 08:00:00 UTC"
## [13] "2019-07-15 08:00:00 UTC" "2019-07-29 08:00:00 UTC"
## [15] "2019-08-12 08:00:00 UTC" "2019-08-26 08:00:00 UTC"
## [17] "2019-09-09 08:00:00 UTC" "2019-09-23 08:00:00 UTC"
## [19] "2019-10-07 08:00:00 UTC" "2019-10-21 08:00:00 UTC"
## [21] "2019-11-04 08:00:00 UTC" "2019-11-18 08:00:00 UTC"
## [23] "2019-12-02 08:00:00 UTC" "2019-12-16 08:00:00 UTC"
## [25] "2019-12-30 08:00:00 UTC" "2020-01-13 08:00:00 UTC"

ymd("2018-01-31") + months(1) returns NA. For situations like this, use alternative operators like %m+%.

library(lubridate)

# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)

# Add 1 to 12 months to jan_31.  This way returns NAs.
ymd("2018-01-31") + month_seq
##  [1] NA           "2018-03-31" NA           "2018-05-31" NA          
##  [6] "2018-07-31" "2018-08-31" NA           "2018-10-31" NA          
## [11] "2018-12-31" "2019-01-31"
# Better way.
ymd("2018-01-31") %m+% month_seq
##  [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
##  [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"

Intervals have a specific start and end time. There are two notations: datetime1 %--% datetime2, or interval(datetime1, datetime2).

# Two ways to create an interval.
dmy("5 January 1961") %--% dmy("30 January 1969")
## [1] 1961-01-05 UTC--1969-01-30 UTC
interval(dmy("5 January 1961"), dmy("30 January 1969"))
## [1] 1961-01-05 UTC--1969-01-30 UTC

Once you have an interval you can find out its start, end, and length with int_start(), int_end() and int_length() respectively. You can test whether a date is %within% and interval. You can test whether two intervals overlap with int_overlaps().

my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
int_length(my_intvl)
## [1] 254620800
y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
ymd("2001-03-30") %within% y2001
## [1] TRUE

Convert an interval to a period or duration with as.period and as.duration.

my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
as.period(my_intvl)
## [1] "8y 0m 25d 0H 0M 0S"
as.duration(my_intvl)
## [1] "254620800s (~8.07 years)"

Extract timezone with tz(). Change timezone with force_tz(dt, tzone=) or temporarily view it with with_tz(dt, tzone=). Get tzone names from ’OlsonNames()`.

game2 <- mdy_hm("June 11 2015 19:00")
game3 <- mdy_hm("June 15 2015 18:30")

# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game3_local <- force_tz(game3, tzone = "America/Winnipeg")

# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")
## [1] "2015-06-12 13:00:00 NZST"

stamp is a great way to format a date. It returns a function with format string you specify by example.

stamp("09/20/2017")(today())
## Multiple formats matched: "%Om/%d/%y%H"(1), "%Om/%y/%d%H"(1), "%Om/%d/%Y"(1), "%m/%d/%y%H"(1), "%m/%y/%d%H"(1), "%m/%d/%Y"(1)
## Using: "%Om/%y/%d%H"
## [1] "01/19/1400"

Package stringr manipulates strings.

library(stringr)
# trim whitespace.
str_trim("  this is a test  ")
## [1] "this is a test"
# pad string with zeros.
str_pad("2493", width = 7, side = "left", pad = "0")
## [1] "0002493"
# find pattern Alice
str_detect(c("Sarah", "Alice", "Tom"), "Alice")
## [1] FALSE  TRUE FALSE
# replace pattern Alice with Jeff
str_replace(c("Sarah", "Alice", "Tom"), "Alice", "Jeff")
## [1] "Sarah" "Jeff"  "Tom"
# Change case
toupper("DataCamp")
## [1] "DATACAMP"
tolower("DataCamp")
## [1] "datacamp"

Use is.na() to locate null values.

# 4x3 data frame with a few NAs.
df <- data.frame(A = c(1, NA, 8, NA),
                 B = c(3, NA, 88, 23), 
                 C = c(2, 45, 3, 1),
                 D = c("A", "", "C", "D"))
# Any NAs?
any(is.na(df))
## [1] TRUE
# locate the NAs.
is.na(df)
##          A     B     C     D
## [1,] FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE FALSE
# How many?
sum(is.na(df))
## [1] 3
# Summarize the NAs
summary(df)
##        A              B              C         D    
##  Min.   :1.00   Min.   : 3.0   Min.   : 1.00    :1  
##  1st Qu.:2.75   1st Qu.:13.0   1st Qu.: 1.75   A:1  
##  Median :4.50   Median :23.0   Median : 2.50   C:1  
##  Mean   :4.50   Mean   :38.0   Mean   :12.75   D:1  
##  3rd Qu.:6.25   3rd Qu.:55.5   3rd Qu.:13.50        
##  Max.   :8.00   Max.   :88.0   Max.   :45.00        
##  NA's   :2      NA's   :1
# Rows with no missing values, two ways
df[complete.cases(df),]
##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C
na.omit(df)
##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C
# Replace empty strings with NA
df$D <- df$D[df$D == ""] <- NA

df2 <- data.frame(A = rnorm(100,50,10),
                  B = c(rnorm(99,50,10), 500),
                  C = c(rnorm(99,50,10), -1))
# Find outliers using hist() or boxplot().
hist(df2$B)

boxplot(df2)

# Drop or replace outliers.  Use which() to find index of offending observation.
mymtcars <- mtcars
ind <- which(mymtcars$mpg == 15.0)
mymtcars$mpg[ind] = 20.0

3. Data Wrangling

3.1 dplyr

The dplyr package provides data wrangling tools. dplyr introduces the tibble, a dataframe constrained to display well in an R session. The tibble class inherits from the data frame class. Work with a tibble using the tbl_df(data.frame) function. glimpse(tbl) works with tibbles the way str(data.frame) works with data frames. Convert a tibble back to a data frame with as.data.frame(tbl).

library(dplyr)

# hflights is a data.frame of Houston based flights.
library(hflights)
## Warning: package 'hflights' was built under R version 3.4.4
hflights <- as_tibble(hflights)
head(hflights)
## # A tibble: 6 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     1          1         6    1400    1500 AA                  428
## 2  2011     1          2         7    1401    1501 AA                  428
## 3  2011     1          3         1    1352    1502 AA                  428
## 4  2011     1          4         2    1403    1513 AA                  428
## 5  2011     1          5         3    1405    1507 AA                  428
## 6  2011     1          6         4    1359    1503 AA                  428
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 
# hflights consists of 227,496 observations and 21 variables.
nrow(hflights)
## [1] 227496
ncol(hflights)
## [1] 21
# Create a lookup table for the UniqueCarrier column using a named vector.
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]

dplyr features five verbs. * select(.data, ...) where ... are variables. Use : to select a range of variables, and - to exclude some variables, similar to indexing a data.frame with square brackets. Use variable names or integer indexes. Use helper functions starts_with(), ends_with(), contains(), matches(), num_range(), and one_of(). * filter(.data, one or more comparisons). Among the operators are ==, !=, and %in%. Combine comparisons with & and |. * arrange(.data, ...). Wrap the arguments with desc() to override the default sort order. * mutate(.data, name-value pair of expressions). * summarise(.data, ...). Base r includes several aggregate functions, and dplyr adds first(), last(), nth(), n(), and n_distinct(). Pipe a data set with %>% into a verb. The filter() verb returns a filtered data set. The arrange() verb returns a sorted data set. Arrange in descending order by arrange(desc(gdpPerCap)). The mutate() verb adds or changes values in the data set. group_by(.data, col(s)). group_by only has an effect when combined with a summarize() function. Specify group_by prior to summarize().

dplry uses %>% from the magrittr package.

library(dplyr)
library(hflights)
hflights <- as_tibble(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$Carrier <- lut[hflights$UniqueCarrier]
# select example
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancell"))
## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##  * <chr>             <int> <chr>       <int> <chr>           
##  1 AA                  428 N576AA          0 ""              
##  2 AA                  428 N557AA          0 ""              
##  3 AA                  428 N541AA          0 ""              
##  4 AA                  428 N403AA          0 ""              
##  5 AA                  428 N492AA          0 ""              
##  6 AA                  428 N262AA          0 ""              
##  7 AA                  428 N493AA          0 ""              
##  8 AA                  428 N477AA          0 ""              
##  9 AA                  428 N476AA          0 ""              
## 10 AA                  428 N504AA          0 ""              
## # ... with 227,486 more rows
# mutate example
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# filter example
hflights %>%
  mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance/ RealTime) %>%
  filter(!is.na(mph) & mph < 70) %>%
  group_by(UniqueCarrier) %>%
  summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))
## # A tibble: 6 x 5
##   UniqueCarrier n_less n_dest min_dist max_dist
##   <chr>          <int>  <int>    <dbl>    <dbl>
## 1 AA                40      1     224.     224.
## 2 CO              3393      4     140.     305.
## 3 MQ                12      1     247.     247.
## 4 OO               349      3     140.     224.
## 5 WN              1747      4     148.     239.
## 6 XE              1185     12      79.     253.

dplyr works for data frames, data tables, and databases.

Use dplyr to merge data instead of base r merge() because dplr syntax is intuitive, preserves row order, and works with databases.

The four mutating joins are left_join(tbl1, tbl2, by = c(col_names)), right_join, inner_join, and full_join.

Filter join semi_join performs an inner join without returning the secondary table. Filter join anti_join performs a right where the right table is null.

Set functions union(), intersect, and setdiff.

setequal(set1, set2) checks for row equality (not necesarily order).

If two datasets have identical structure, combine with bind_rows() and bind_cols(), the dplyr equivalent to base r rbind() and cbind.

dplyr improves base r functions data.frame with data_frame(). data_frame() will not change data types, add row or column names, or recycle vectors. Function as_data_frame() parellels the behavior of data_frame(). as_data_frame combines a list of vectors into a data frame. It is the column equivalent of bind_rows() which combines data frames.

library(Lahman)
## Warning: package 'Lahman' was built under R version 3.4.4
library(dplyr)

players <- Master %>% 
  distinct(playerID, nameFirst, nameLast)

players %>%
  # Find unsalaried players
  anti_join(Salaries, by = "playerID") %>% 
  # Join Batting to the unsalaried players
  left_join(Batting, by = "playerID") %>% 
  # Group by player
  group_by(playerID) %>% 
  # Sum at-bats for each player
  summarise(total_at_bat = sum(AB, na.rm = TRUE)) %>% 
  # Arrange in descending order
  arrange(desc(total_at_bat))
## # A tibble: 13,958 x 2
##    playerID  total_at_bat
##    <chr>            <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3 cobbty01         11434
##  4 musiast01        10972
##  5 mayswi01         10881
##  6 robinbr01        10654
##  7 wagneho01        10430
##  8 brocklo01        10332
##  9 ansonca01        10277
## 10 aparilu01        10230
## # ... with 13,948 more rows
library(Lahman)
library(dplyr)

# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>% 
  distinct(playerID)

nominated %>% 
  # Count the number of players in nominated
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1260
# 1,239 players were nominated for the hall of fame.

nominated_full <- nominated %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)

# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>% 
  filter(inducted == "Y") %>% 
  distinct(playerID)

inducted %>% 
  # Count the number of players in inducted
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   317
# 312 players have been inducted.

inducted_full <- inducted %>% 
  # Join to Master
  left_join(Master, by = "playerID") %>% 
  # Return playerID, nameFirst, nameLast
  select(playerID, nameFirst, nameLast)


# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>% 
  group_by(playerID) %>% 
  tally()

nAwards %>% 
  # Filter to just the players in inducted 
  semi_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  12.1
nAwards %>% 
  # Filter to just the players in nominated 
  semi_join(nominated, by = "playerID") %>% 
  # Filter to players NOT in inducted 
  anti_join(inducted, by = "playerID") %>% 
  # Calculate the mean number of awards per player
  summarize(avg_n = mean(n, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  4.23
# On Average, inductees had 11.95 - 4.23 = 7.72 more awards than non-inductees. 


# Find the players who are in nominated, but not inducted
notInducted <- nominated %>% 
  setdiff(inducted)

Salaries %>% 
  # Find the players who are in notInducted
  semi_join(notInducted, by = "playerID") %>% 
  # Calculate the max salary by player
  group_by(playerID) %>%
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  # Calculate the average of the max salaries
  summarize(avg_salary = mean(max_salary, na.rm = TRUE)) 
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   5230273.
# Repeat for players who were inducted
Salaries %>% 
  semi_join(inducted, by = "playerID") %>% 
  group_by(playerID) %>%
  summarize(max_salary = max(salary, na.rm = TRUE)) %>% 
  summarize(avg_salary = mean(max_salary, na.rm = TRUE))
## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   6092038.
Appearances %>% 
  # Filter Appearances against nominated
  semi_join(nominated, by = "playerID") %>% 
  # Find last year played by player
  group_by(playerID) %>% 
  summarize(last_year = max(yearID)) %>% 
  # Join to full HallOfFame
  left_join(HallOfFame, by = "playerID") %>% 
  # Filter for unusual observations
  filter((yearID - last_year)<5)
## # A tibble: 194 x 10
##    playerID  last_year yearID votedBy ballots needed votes inducted
##    <chr>         <dbl>  <int> <chr>     <int>  <int> <int> <fct>   
##  1 altroni01     1933.   1937 BBWAA       201    151     3 N       
##  2 applilu01     1950.   1953 BBWAA       264    198     2 N       
##  3 bartedi01     1946.   1948 BBWAA       121     91     1 N       
##  4 beckro01      2004.   2008 BBWAA       543    408     2 N       
##  5 boudrlo01     1952.   1956 BBWAA       193    145     2 N       
##  6 camildo01     1945.   1948 BBWAA       121     91     1 N       
##  7 chandsp01     1947.   1950 BBWAA       168    126     2 N       
##  8 chandsp01     1947.   1951 BBWAA       226    170     1 N       
##  9 chapmbe01     1946.   1949 BBWAA       153    115     1 N       
## 10 cissebi01     1938.   1937 BBWAA       201    151     1 N       
## # ... with 184 more rows, and 2 more variables: category <fct>,
## #   needed_note <chr>

Data Visualization

Data visualization is about exploratory analysis (investigative) and explanatory analysis.

There are seven grammatical layers of plots; three are required: data, aesthetics, and geometries. The other elements are facets (subplots), statistics (e.g., fitted lines), coordinates, and themes. The grammar of graphics is implemented in the ggplot2 package.

Base r provides plotting functionality, but it comes with limitations. The plot is an image, not an object, so you cannot manipulate it further. It does not present a legend. There is a separate function for each plot type. The lack of a unified framework means you will have to learn each plot type separately: points(), hist(), etc.

Scale the x axis with a scale_x_log10 layer. There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. On a scaled access with base 2, the value of each tick mark is double the value of the preceding one. An example of a multiplicative factor is constant acceleration. More on scales for continuous data here.

Scatterplots

For scatterplots, map x, y, color, and shape in the aesthetic layer. Map size, fill, shape, alpha (transparency), and position (e.g., “jitter”) in the geom_point layer.

mtcars$cyl <- as.factor(mtcars$cyl)

# Use base r to create plots with a series for each cyl value.
# Add a linear fit line through the points, one for each series, and one overall.
plot(mtcars$wt, mtcars$mpg, col = factor(mtcars$cyl))
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
  })
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL
## 
## [[18]]
## NULL
## 
## [[19]]
## NULL
## 
## [[20]]
## NULL
## 
## [[21]]
## NULL
## 
## [[22]]
## NULL
## 
## [[23]]
## NULL
## 
## [[24]]
## NULL
## 
## [[25]]
## NULL
## 
## [[26]]
## NULL
## 
## [[27]]
## NULL
## 
## [[28]]
## NULL
## 
## [[29]]
## NULL
## 
## [[30]]
## NULL
## 
## [[31]]
## NULL
## 
## [[32]]
## NULL
legend(x = 5, y = 33, legend = levels(mtcars$cyl),
       col = 1:3, pch = 1, bty = "n")

# Again in ggplot2
# The first geom_smooth inherits the ggplot color aesthetic as its group.
# The second geom_smooth explicity sets group to a dummy 1.  The col = "All" adds it to the legend.
# When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series. The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly. 
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, group = factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  geom_smooth(method = "lm", se = FALSE, linetype = 2, aes(group = 1, col = "All"))

ggplot can visualize four attributes at once with x, y, col, and facet_grid. Such graphing requires tidy data, which in turn requires thoughtful definitions of metrics. In the iris data set, if measuring length vs width, then those are separate variables (cols). If measuring length (or width) vs species, then species is a variable. If measuring length (or width) vs part of flower (petal vs sepal), then flower part is a variable. To look at all four together, then length and width are members of the measure variable (because length and width share units).

library(ggplot2)
library(tidyr)

iris.tidy <- iris %>%
  # gather(data, key, value, <cols>)
  # Transpose all cols to rows except the identifier cols (Species)
  # The former call name becomes a value in the key column.
  gather(key, Value, -Species) %>%
  # separate(data, col, into, sep)
  separate(col = key, into = c("Part", "Measure"), sep = "\\.")

# If we want the ploy Length vs width, then each should be a column.
iris$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value)
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
  geom_jitter() +
  facet_grid(. ~ Species)

Typical aesthetics are x, y, colour, fill, size, alpha, linetype, labels, and shape. shapes 1:20 can accept only the color aesthetic, and shapes 21:25 accepts both color and fill.

One common technique to use with solid shapes is alpha blending (i.e. adding transparency). An alternative is to use hollow shapes.

library(ggplot2)

# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4)

# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, shape = 1)