Load libraries wither with `library()` or `require()`.

# R Basics

String construction.

``````dog <- "Chester"
print(paste("you are a dog", dog))``````
``## [1] "you are a dog Chester"``
``nchar(dog)``
``## [1] 7``

### Vectors

Create a vector with the combine function `c()`. Reference vector elements with brackets, or with element names. R compares vectors element-wise. If you compare a vector to a singe value, R will create an appropriately sized vector.

There are two types of vectors in R: atomic vectors, and lists. Atomic vectors are homogenous of one of six types: logical, integer, double, character, complex, and raw (don’t worry about the relatively uncommon complex and raw types). Lists are recursive vectors (they can contain other lists).

Vectors have two key properties: type `typeof()` of length `length()`. Subset a list with single brackets and extract elements with double brackets. For example,

``````a <- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)
# List d.
typeof(a[4])``````
``## [1] "list"``
``````# The two elements of list d.
typeof(a[[4]])``````
``## [1] "list"``
``````# The first element of list d.
typeof(a[[4]][1])``````
``## [1] "list"``
``````# The first value of list d
typeof(a[[4]][[1]])``````
``## [1] "double"``
``````numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
boolean_vector <- c(TRUE, FALSE, TRUE)
character_vector[1]``````
``## [1] "a"``
``boolean_vector[c(2,3)]``
``## [1] FALSE  TRUE``
``boolean_vector[2:3]``
``## [1] FALSE  TRUE``
``````roulette_vector <- c(-24, -50, 100, -350, 10)
names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector[1]``````
``````## Monday
##    -24``````
``roulette_vector["Monday"]``
``````## Monday
##    -24``````
``````# vector operations
sum(roulette_vector)``````
``## [1] -314``
``mean(roulette_vector)``
``## [1] -62.8``
``````# take a subset of a vector using booleans
roulette_vector[roulette_vector>0]``````
``````## Wednesday    Friday
##       100        10``````

### Matrix

A matrix is a two-dimensional collection of elements. Create a matrix with the `matrix(data, nrow, ncol, byrow)` function. Label the rows with `rownames()` and the columns with `colnames()`. Sum each row and column into vectors with `rowSums()` and `colSums()`. Bind rows and columns to a matrix with `rbind()` and `cbind()`. Reference matrix items with brackets [row, col].

``````# Matrix of numbers 1:20, filling one row at a time, for 5 rows and 4 columns.  Specifying the number of columns is optional if number of rows is specified.
m <- matrix(1:20, byrow = TRUE, nrow = 5, ncol = 4)
rownames(m) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
colnames(m) <- c("Col 1", "col 2", "col 3", "col 4")
m``````
``````##       Col 1 col 2 col 3 col 4
## row 1     1     2     3     4
## row 2     5     6     7     8
## row 3     9    10    11    12
## row 4    13    14    15    16
## row 5    17    18    19    20``````
``````# Bind row sums to matrix.
m.rowSum <- rowSums(m)
cbind(m, m.rowSum)``````
``````##       Col 1 col 2 col 3 col 4 m.rowSum
## row 1     1     2     3     4       10
## row 2     5     6     7     8       26
## row 3     9    10    11    12       42
## row 4    13    14    15    16       58
## row 5    17    18    19    20       74``````
``````# All rows of the second colum of m.
m[,2]``````
``````## row 1 row 2 row 3 row 4 row 5
##     2     6    10    14    18``````

Use `nrows()` and `ncols()` to determine number of rows and columns.

``````for (i in 1:nrow(m)) {
for (j in 1:ncol(m)) {
print(paste("On row ", i, " and column ", j, " the matrix contains ", m[i,j]))
}
}``````
``````## [1] "On row  1  and column  1  the matrix contains  1"
## [1] "On row  1  and column  2  the matrix contains  2"
## [1] "On row  1  and column  3  the matrix contains  3"
## [1] "On row  1  and column  4  the matrix contains  4"
## [1] "On row  2  and column  1  the matrix contains  5"
## [1] "On row  2  and column  2  the matrix contains  6"
## [1] "On row  2  and column  3  the matrix contains  7"
## [1] "On row  2  and column  4  the matrix contains  8"
## [1] "On row  3  and column  1  the matrix contains  9"
## [1] "On row  3  and column  2  the matrix contains  10"
## [1] "On row  3  and column  3  the matrix contains  11"
## [1] "On row  3  and column  4  the matrix contains  12"
## [1] "On row  4  and column  1  the matrix contains  13"
## [1] "On row  4  and column  2  the matrix contains  14"
## [1] "On row  4  and column  3  the matrix contains  15"
## [1] "On row  4  and column  4  the matrix contains  16"
## [1] "On row  5  and column  1  the matrix contains  17"
## [1] "On row  5  and column  2  the matrix contains  18"
## [1] "On row  5  and column  3  the matrix contains  19"
## [1] "On row  5  and column  4  the matrix contains  20"``````

### Factors

The `factor()` function converts a variable into type factor. R needs to know whether a variable is continuous or categorical. To specify an ordinal categorical variable, specify `order = TRUE` and `levels`.

``````student_status <- c("student", "not student", "student", "not student")
categorical_student <- factor(student_status)
categorical_student``````
``````## [1] student     not student student     not student
## Levels: not student student``````
``````temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
temperature_vector``````
``## [1] "High"   "Low"    "High"   "Low"    "Medium"``
``````# nominal variables are not comparable, but ordinal variables are.
temperature_vector[1] > temperature_vector[2]``````
``## [1] FALSE``
``factor_temperature_vector[1] > factor_temperature_vector[2]``
``## [1] TRUE``
``````# Change the level names with the levels function.  Note the levels are initially in alphabetical order.
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Notice how summary treats a factor variable different from a regular variable.
summary(survey_vector)``````
``````##    Length     Class      Mode
##         5 character character``````
``summary(factor_survey_vector)``
``````## Female   Male
##      2      3``````

### Data Frames

A dataframe is like a matrix, except each column can be a different data type. Several functions inspect data frames. * `head` (`tail`): by default prints the first (last) 6 rows of the dataframe * `str`: prints the structure of the dataframe. Probably the first function you’ll call with a new data set. * `dim`: prints the dimensions of the dataframe * `colnames`: prints the column names of the dataframe * `na.omit()` removes rows with NA in any column.

``mtcars``
``````##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2``````
``head(mtcars,6)``
``````##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1``````
``str(mtcars)``
``````## 'data.frame':    32 obs. of  11 variables:
##  \$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  \$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  \$ disp: num  160 160 108 258 360 ...
##  \$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  \$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  \$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  \$ qsec: num  16.5 17 18.6 19.4 17 ...
##  \$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  \$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  \$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  \$ carb: num  4 4 1 1 2 1 4 2 2 4 ...``````
``dim(mtcars)``
``## [1] 32 11``
``colnames(mtcars)``
``````##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"``````

Create a data frame with the `data.frame()` function.

``````planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_df <- data.frame(planets, type, diameter, rotation, rings)
# Select first 5 values of diameter column.  The \$ is a short-cut method.
planets_df[1:5,"diameter"]``````
``## [1]  0.382  0.949  1.000  0.532 11.209``
``planets_df\$diameter[1:5]``
``## [1]  0.382  0.949  1.000  0.532 11.209``

Use `subset()` to apply a where condition to the data frame rows. User `order()` to apply an order by to the data frame.

``subset(planets_df, subset = diameter < 1)``
``````##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE``````
``planets_df[order(planets_df\$diameter),]``
``````##   planets               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE``````

### Lists

Construct a list of objects with `list()`. Name the list items either with “=” at creation, or using `names()`.

``````my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]

my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
my_list``````
``````## \$vec
##  [1]  1  2  3  4  5  6  7  8  9 10
##
## \$mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
##
## \$df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4``````
``````# or
my_list2 <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list2``````
``````## \$vec
##  [1]  1  2  3  4  5  6  7  8  9 10
##
## \$mat
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
##
## \$df
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4``````

Reference items in a list by its component number in brackets, or name in brackets, or name after a dollar sign.

``````my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)

# Third col of second element of my_list (my_matrix)
my_list[[2]][,3]``````
``## [1] 7 8 9``
``my_list\$mat[,3]``
``## [1] 7 8 9``

Append to a list with combine `c()`.

``````my_vector <- 1:10
my_matrix <- matrix(1:9, ncol = 3)
my_df <- mtcars[1:10,]
my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
my_list <- c(my_list, df2 = my_df)``````

## Intermediate R

### Conditionals

Relational operators are `==` and `!=`. Logical operators are `&`, `|`, and `!`. Be careful to not use `&&` or `!!` - they evaluate only the first item in the list! Control constructs are `if()`.

``````x <- 3
if (x %% 2 == 0) {
print("x is divisible by 2")
} else if (x %% 3 == 0) {
print("x is divisible by 3")
} else {
print("x is divisible by neither 2 nor 3")
}``````
``## [1] "x is divisible by 3"``

### Loops

While loop is `while() {}`. Break out of loop early with `if (condition) { break()}`.

``````i <- 1
while (i <= 10) {
print(3 * i)
if (3 * i %% 8 == 0) {
break()
}
i <- i + 1
}``````
``````## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
## [1] 18
## [1] 21
## [1] 24``````

For loop is `for(var in seq) {exp}`. The `break` statement abandons the active loop. The `next` statement skips the rest of the statements in the current loop interation.

``````linkedin <- c(16, 9, 13, 5, 2, 17, 14)

# Loop version 1
for(views in linkedin) {
print(views)
if (views > 10) {
break
} else if (view < 5) {
next
}
}``````
``## [1] 16``
``````# Loop version 2
for(i in 1:length(linkedin)) {
}``````
``````## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14``````
``````# seq_along handles zero-length vectors and lists.
for (i in seq_along(linkedin)) {
}``````
``````## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14``````

### Functions

Get help on a function with `help` or `?`, or `args`. Specify function parameters either by name or by position. When the documentation specifies default values, they are not required.

``````#help(mean)
#?mean
args(mean)``````
``````## function (x, ...)
## NULL``````
``````grades <- c(8.5, 7, 9, 5.5, 6)
``## [1] 7.2``
``mean(grades)``
``## [1] 7.2``

Define a custom function with the `function()` code chunk. The `return` statement returns and exits immediately and is optional. Set default argument value with `=`.

``````multiply_a_b <- function(a, b = 1) {
return (a * b)
}
result <- multiply_a_b(a = 3, b = 7)``````

Install a package with `install.packages(arg)`. Packages are located at the Comprehensive R Archive Network (CRAN). Search for packages with `search()`. R attaches seven packages to its search list by default. Attach more packages with `library()` or `require()`.

### The Apply Family

Function `lapply(X, FUN, ...)` applies a function to a list. `lapply()` returns a list, so if `X` is a vector, cast the function result back to list with `unlist`. If the function requires arguments, pass them in as additional arguments to `lapply()`. Functions can be named or anonymous, so if used only once, define the function within `lapply()`.

``lapply(list(1,2,3), function(x) { 3 * x })``
``````## [[1]]
## [1] 3
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 9``````

Function `sapply()` calls `lapply()` then converts the list to a one-dimensional array (vector) or two-dimensional array (matrix). If `sapply` cannot simplify because the resulting list contains vectors of varying lengths, then `sapply()` returns the same result as `lapply()`.

Function `vapply()` uses `lapply()` but with `FUN.VALUE` which indicates the return variable type. `vapply()` is a safe alternative to `sapply()`.

### `purrr` Package

The `purrr` package maps functions to a vector and return a vector. `map()` returns a list; the others are `map_dbl()`, `map_lgl()`, `map_int()`, and `map_chr()`. The `purrr` functions provide shortcuts for the f argument, are more consistant than lapply and sapply, and handle iteration well.

``library(purrr)``
``## Warning: package 'purrr' was built under R version 3.4.4``
``````cyl <- split(mtcars, mtcars\$cyl)
# Regress mpg ~ wt on each cylinder class
map(cyl, function(df) lm(mpg ~ wt, data = df))``````
``````## \$`4`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept)           wt
##      39.571       -5.647
##
##
## \$`6`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept)           wt
##       28.41        -2.78
##
##
## \$`8`
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Coefficients:
## (Intercept)           wt
##      23.868       -2.192``````
``````# Same thing with shortcuts
models <- map(cyl, ~ lm(mpg ~ wt, data = .))
coefs <- map(models, coef)
map(coefs, "wt")``````
``````## \$`4`
## [1] -5.647025
##
## \$`6`
## [1] -2.780106
##
## \$`8`
## [1] -2.192438``````
``````# Or, using a single command with pipes.
mtcars %>%
split(mtcars\$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(coef) %>%
map_dbl("wt")``````
``````##         4         6         8
## -5.647025 -2.780106 -2.192438``````

The `safely()` function returns a list with two elements: result and error for each element. `possibly()` returns a default value on errors. `quietly()` captures all printed output, messages, and warnings instead of capturing errors.

``````safe_readLines <- safely(readLines())

# Call safe_readLines() on "http://example.org"
example_lines``````
``````## \$result
## NULL
##
## \$error
## NULL``````
``````# Call safe_readLines() on "http://asdfasdasdkfjlda"
nonsense_lines``````
``````## \$result
## NULL
##
## \$error
## NULL``````
``````n <- list(5, 10, 20)
mu <- list(1, 5, 10)
sd <- list(0.1, 1, 0.1)

# iterate over the lists
pmap(list(n, mu, sd), rnorm)``````
``````## [[1]]
## [1] 1.0380868 0.9605489 1.0786154 1.0073599 1.0234126
##
## [[2]]
##  [1] 4.343431 6.307386 3.939620 3.125216 7.622740 5.457172 5.548574
##  [8] 4.371869 4.627905 5.260454
##
## [[3]]
##  [1] 10.053020 10.053259 10.119406  9.824395  9.995872  9.749677  9.997900
##  [8] 10.128129 10.115909 10.197187 10.031033 10.080599  9.935449 10.055783
## [15] 10.083899  9.935934  9.781156 10.215975 10.060304 10.016733``````
``````funs <- list("rnorm", "runif", "rexp")

rnorm_params <- list(mean = 10)
runif_params <- list(min = 0, max = 5)
rexp_params <- list(rate = 5)
params <- list(
rnorm_params,
runif_params,
rexp_params
)

# Call invoke_map() on funs supplying params and setting n to 5
invoke_map(funs, params, n = 5)``````
``````## [[1]]
## [1]  9.657600 12.019679 10.136912 11.521788  9.658688
##
## [[2]]
## [1] 1.0613833 2.0008371 1.4973380 2.9227932 0.3804437
##
## [[3]]
## [1] 0.07188987 0.07739475 0.03476835 0.33302093 0.17282787``````

`walk()` operates just like `map()` except it’s designed for functions that don’t return anything. Use `walk()` for functions with side effects like printing, plotting or saving.

``#?walk2``

`stopifnot()` is a quick way to stop a function stop if a condition fails. stopifnot() takes logical expressions as arguments and looks for any to be `FALSE`.

``````x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
stopifnot(length(x) == length(y))
sum(is.na(x) & is.na(y))
}
#both_na(x, y)``````

Use `stop()` instead of `stopifnot()` to specify a more informative error message.

``````x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
if (length(x) != length(y)) {
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
#both_na(x, y)``````

### Useful Functions

R features a bunch of functions to juggle around with data structures:: `seq(from = 1, to 2, by = .25)`: Generates sequence `from` 1 `to` 2 incremented `by` .25. `rep(x, times)`: Replicates elements of vectors and lists. `sort(x)`: Sorts a vector. `rev(x)`: Reverses the elements in a data structures for which reversal is defined. `str(x)`: Display the structure of any R object `x`. `append(x y)`: Appends vectors or list `y` to `x`. `is.*()`: Checks class of R object `x`. `as.*()`: Casts R object `x`. `unlist(x)`: Flatten (possibly embedded) lists to produce a vector.

``````myseq <- seq(8, 2, by=-2)
myseq``````
``## [1] 8 6 4 2``
``````myrep <- rep(myseq, times =2)
myrep``````
``## [1] 8 6 4 2 8 6 4 2``
``````myrep <- rep(myseq, each = 2)
myrep``````
``## [1] 8 8 6 6 4 4 2 2``
``````linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)
social_vec <- append(li_vec, fb_vec)
sort(social_vec, decreasing = TRUE)``````
``##  [1] 17 17 16 16 14 14 13 13  9  8  7  5  5  2``

Regular expressions include grepl() `grepl(pattern = "a", x = animals)` returns TRUE for each element of `x` matching the `pattern`. Regular expression “^a” means a*; “a\$” means *a; `.\*` means any character zero or more times; ’\s`means space;`[0-9]+`means numbers 0 to 9 at least once.`grep(pattern = “a”, x = animals)`returns the vector indices for each element of`x`matching the`pattern`.`sub(pattern = “a”, replacement = “o”, x = animals“)`substitutes the first a with o.`gsum(pattern =”a“, replacement =”o“, x = animals”)` substitutes all a’s with o’s.)

``````animals <- c("cat", "moose", "impala", "ant", "kiwi")
grepl(pattern = "a", x = animals)``````
``## [1]  TRUE FALSE  TRUE  TRUE FALSE``
``which(grepl(pattern = "a", x = animals))``
``## [1] 1 3 4``
``grep(pattern = "a", x = animals)``
``## [1] 1 3 4``

There are two datetimes in R, `POSIXlt`, a list with named components, and `POSIXct`, the number of seconds since 1970-01-01 00:00:00. `POSIXct` is more amenable to data frames, so you will encounter it much more often. `Sys.Date()` returns a `Date` equal to today. `Sys.time()` returns `POSIXct`.

``as.Date("2018-10-16")``
``## [1] "2018-10-16"``
``as.POSIXct("2018-11-28 08:34:00")``
``## [1] "2018-11-28 08:34:00 EST"``

# Importing Data

## RData

The simplest file to import is RData.

``````url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
summary(wine)``````
``````##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600
##  Proanthocyanins Color intensity       Hue           Proline
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0``````
``````# or, equivalently,
summary(wine)``````
``````##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600
##  Proanthocyanins Color intensity       Hue           Proline
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0``````

## Flat files

There are three common packages designed to load flat files: `util` which comes with base r, `readr`, and `data.table`.

### `util`

The base r `util` package includes flat file reading functions. `read.table()` is a generic flat file loading function. Wrapper functions `read.csv()` reads comma-separated files, and `read.delim` reads tab-delimited files.

• `stringsAsFactors = TRUE` treats string variables as categorical.
• `col.names = c()` overrides, or sets, column names.
• `colClasses = c()` sets data types. NULL elements in the vector drop the variable.
``````# Opt 1: set working dir to file location
# setwd("C:/Users/mpfol/OneDrive/Documents/Data Analysis/Data")
# Opt 2: define a file path relative to script file.
path <- file.path("Data", "swimming_pools.csv")

swimming_pools <- read.csv(path, stringsAsFactors = FALSE)

sep = ",",
col.names = c("name", "address", "ph", "ph2", "open_hr","facilities", "disabl","park","lat","longit"),
colClasses = c("factor", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "numeric", "numeric"))``````

### `readr`

`readr` is similar to `utils`, but is faster and less verbose. `readr` returns a “tibble” instead of a data frame. Functions `read_csv()` and `read_tsv()` are wrappers for `read_delim()`, similar to the construction in package `utils`.

• Default `col_names = TRUE` sets column names to the first row of data. Set `col_names = FALSE` for system-generated names or set `col_names = c()` to set the column names to a character vector.
• `col_types = c()` sets data types. NULL elements in the vector drop the variable. Use shorthand strings where `col_types = "cd_il")` means “character, double, (skip), integer, logical”.
• Collector functions `col_factor()` and `col_integer()` also set column types.
``````library(readr)
pools <- file.path("Programs/Data", "swimming_pools.csv")
# or, if on the web,
pools.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
``````## Parsed with column specification:
## cols(
##   Name = col_character(),
##   Address = col_character(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )``````
``````potatoes.path <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_delim(potatoes.path, delim = "\t")``````
``````## Parsed with column specification:
## cols(
##   area = col_integer(),
##   temp = col_integer(),
##   size = col_integer(),
##   storage = col_integer(),
##   method = col_integer(),
##   texture = col_double(),
##   flavor = col_double(),
##   moistness = col_double()
## )``````
``````machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")
machine.fragment <- read_tsv(machine, skip = 6, n_max = 5,
col_names = properties)``````
``````## Parsed with column specification:
## cols(
##   new = col_double(),
##   old = col_double()
## )``````
``````hotdogs <- file.path("Programs/Data", "hotdogs.txt")
col_names = c("type", "calories", "sodium"),
skip = 1)``````
``````## Parsed with column specification:
## cols(
##   type = col_character(),
##   calories = col_double(),
##   sodium = col_double()
## )``````

### `data.table`

The `data.table` package is optimized for large files. `fread()` is faster and more convenient than `read.table`.

``library(data.table)``
``## Warning: package 'data.table' was built under R version 3.4.4``
``````##
## Attaching package: 'data.table'``````
``````## The following object is masked from 'package:purrr':
##
##     transpose``````
``````pools <- file.path("Programs/Data", "swimming_pools.csv")

machine <- file.path("Programs/Data", "machine.txt")
properties <- c("new", "old")

## Excel

There are three packages to choose from, `readxl`, `gdata`, and `XLConnect`. `gdata` only handles .xls files and will be replaced when `readxl` is more mature. `XLConnect` is designed to work with Excel through R.

### `readxl`

`readxl` cannot read directly from the internet. First download the file, then import the file.

Packagage `readxl` functions `excel_sheets()` lists the available sheets, `read_excel()` reads the file.

• Default `col_names = TRUE` sets column names to the first row of data. Set `col_names = FALSE` for system-generated names or set `col_names = c()` to set the column names to a character vector.
• `col_types = c()` sets data types. “blank” elements in the vector drop the variable.
• `skip` skips lines. If first line is column names, you will have to manually set it.
``library(readxl)``
``## Warning: package 'readxl' was built under R version 3.4.4``
``````url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
excel_sheets(mini.path)``````
``## [1] "Sheet1" "Sheet2"``
``````sheet1 <- read_excel(mini.path, sheet = "Sheet1")
sheet2 <- read_excel(mini.path, sheet = "Sheet2")
sheet.list = list(sheet1, sheet2)

# Equivalently...
sheet.list <- lapply(excel_sheets(mini.path),
read_excel, path = mini.path)``````

### `gdata`

`gdata` requires perl in the background. It can only read `.xls` files. It can read directly from web sites though.

``library(gdata)``
``## Warning: package 'gdata' was built under R version 3.4.4``
``````## gdata: Unable to locate valid perl interpreter
## gdata:
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata:
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)``````
``````## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.``````
``## ``
``````## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.``````
``## ``
``````## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.``````
``````##
## Attaching package: 'gdata'``````
``````## The following objects are masked from 'package:data.table':
##
##     first, last``````
``````## The following object is masked from 'package:purrr':
##
##     keep``````
``````## The following object is masked from 'package:stats':
##
##     nobs``````
``````## The following object is masked from 'package:utils':
##
##     object.size``````
``````## The following object is masked from 'package:base':
##
##     startsWith``````
``````url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

### XLConnect

``````#library(XLConnect)
mini.path <- file.path("Programs/Data", "MinitabIntroData.xlsx")
#class(my_book)
#getSheets(my_book)
#readWorksheet(my_book, sheet = 2)
#all <- lapply(sheets, readWorksheet, object = my_book)
#str(all)
#createSheet(my_book, name = "year_2010")
#writeWorksheet(my_book, pop_2010, sheet = "year_2010")
#saveWorkbook(my_book, file = "MinitabIntroData2.xlsx")``````

## Other Sources

### Databases

There is a dedicated package for each DBMS: `RMySQL`, `RPostgresSQL`, `ROracle`, etc. Function `dbGetQuery()` is a convenient aggregator of three functions, `dbSendQuery()`, `dbFetch()`, and `dbClearResults()`. Use the three functions if the data set is large and only a chunk of data is needed at a time.

``library(DBI)``
``## Warning: package 'DBI' was built under R version 3.4.4``
``````con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
con``````
``## <MySQLConnection:0,0>``
``````# read all tables into a list of data frames
table_names <- dbListTables(con)
tables <- lapply(table_names, dbReadTable, conn = con)
# read an entire table, then subset the rows you want (inefficient)
subset = user_id == 1,
tweat_id = 77)``````
``````##      id tweat_id user_id            message
## 4  1012       87       1   awesome! thanks!
## 7  1004       49       1  this is fabulous!
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day``````
``````elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments
WHERE user_id = 1")
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > \"2015-09-21\"")

dbDisconnect(con)``````
``## [1] TRUE``

### Internet

If a file resides on the web, reference it directly instead of manually downloading. For the `excel` package, you will have to first download the file.

``````url = "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r"
dest_path = file.path("~", "local_cities.xlsx")

The `httr` package also handles internet files.

``library(httr)``
``## Warning: package 'httr' was built under R version 3.4.4``
``````resp <- GET("http://www.example.com/")
raw_content <- content(resp, as = "raw")
``## [1] 3c 21 64 6f 63 74``

### API’s and JSON

JSON files are either name-value pair objects {“id”:1,“name”:“Frank”}, or arrays [1,2,3,“dog”].

``library(jsonlite)``
``## Warning: package 'jsonlite' was built under R version 3.4.4``
``````##
## Attaching package: 'jsonlite'``````
``````## The following object is masked from 'package:purrr':
##
##     flatten``````
``````wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert file JSON into list
wine <- fromJSON(wine_json)
str(wine)``````
``````## List of 5
##  \$ name       : chr "Chateau Migraine"
##  \$ year       : int 1997
##  \$ alcohol_pct: num 12.4
##  \$ color      : chr "red"
##  \$ awarded    : logi FALSE``````
``````# Convert web API JSON into list
url_sw4 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=ff21610b&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
#sw4 <- fromJSON(url_sw4)
#sw3 <- fromJSON(url_sw3)

# Print the Title element of both lists
#sw4\$Title
#sw3\$Title

# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
pretty_json``````
``````## [
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.62,
##     "qsec": 16.46,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4"
##   },
##   {
##     "mpg": 21,
##     "cyl": 6,
##     "disp": 160,
##     "hp": 110,
##     "drat": 3.9,
##     "wt": 2.875,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 1,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Mazda RX4 Wag"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 108,
##     "hp": 93,
##     "drat": 3.85,
##     "wt": 2.32,
##     "qsec": 18.61,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Datsun 710"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 6,
##     "disp": 258,
##     "hp": 110,
##     "drat": 3.08,
##     "wt": 3.215,
##     "qsec": 19.44,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Hornet 4 Drive"
##   },
##   {
##     "mpg": 18.7,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 175,
##     "drat": 3.15,
##     "wt": 3.44,
##     "qsec": 17.02,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Hornet Sportabout"
##   },
##   {
##     "mpg": 18.1,
##     "cyl": 6,
##     "disp": 225,
##     "hp": 105,
##     "drat": 2.76,
##     "wt": 3.46,
##     "qsec": 20.22,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Valiant"
##   },
##   {
##     "mpg": 14.3,
##     "cyl": 8,
##     "disp": 360,
##     "hp": 245,
##     "drat": 3.21,
##     "wt": 3.57,
##     "qsec": 15.84,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Duster 360"
##   },
##   {
##     "mpg": 24.4,
##     "cyl": 4,
##     "disp": 146.7,
##     "hp": 62,
##     "drat": 3.69,
##     "wt": 3.19,
##     "qsec": 20,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 240D"
##   },
##   {
##     "mpg": 22.8,
##     "cyl": 4,
##     "disp": 140.8,
##     "hp": 95,
##     "drat": 3.92,
##     "wt": 3.15,
##     "qsec": 22.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Merc 230"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.3,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280"
##   },
##   {
##     "mpg": 17.8,
##     "cyl": 6,
##     "disp": 167.6,
##     "hp": 123,
##     "drat": 3.92,
##     "wt": 3.44,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 0,
##     "gear": 4,
##     "carb": 4,
##     "_row": "Merc 280C"
##   },
##   {
##     "mpg": 16.4,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 4.07,
##     "qsec": 17.4,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SE"
##   },
##   {
##     "mpg": 17.3,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.73,
##     "qsec": 17.6,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SL"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 275.8,
##     "hp": 180,
##     "drat": 3.07,
##     "wt": 3.78,
##     "qsec": 18,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 3,
##     "_row": "Merc 450SLC"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 472,
##     "hp": 205,
##     "drat": 2.93,
##     "wt": 5.25,
##     "qsec": 17.98,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Cadillac Fleetwood"
##   },
##   {
##     "mpg": 10.4,
##     "cyl": 8,
##     "disp": 460,
##     "hp": 215,
##     "drat": 3,
##     "wt": 5.424,
##     "qsec": 17.82,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Lincoln Continental"
##   },
##   {
##     "mpg": 14.7,
##     "cyl": 8,
##     "disp": 440,
##     "hp": 230,
##     "drat": 3.23,
##     "wt": 5.345,
##     "qsec": 17.42,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Chrysler Imperial"
##   },
##   {
##     "mpg": 32.4,
##     "cyl": 4,
##     "disp": 78.7,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 2.2,
##     "qsec": 19.47,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat 128"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 75.7,
##     "hp": 52,
##     "drat": 4.93,
##     "wt": 1.615,
##     "qsec": 18.52,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Honda Civic"
##   },
##   {
##     "mpg": 33.9,
##     "cyl": 4,
##     "disp": 71.1,
##     "hp": 65,
##     "drat": 4.22,
##     "wt": 1.835,
##     "qsec": 19.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Toyota Corolla"
##   },
##   {
##     "mpg": 21.5,
##     "cyl": 4,
##     "disp": 120.1,
##     "hp": 97,
##     "drat": 3.7,
##     "wt": 2.465,
##     "qsec": 20.01,
##     "vs": 1,
##     "am": 0,
##     "gear": 3,
##     "carb": 1,
##     "_row": "Toyota Corona"
##   },
##   {
##     "mpg": 15.5,
##     "cyl": 8,
##     "disp": 318,
##     "hp": 150,
##     "drat": 2.76,
##     "wt": 3.52,
##     "qsec": 16.87,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Dodge Challenger"
##   },
##   {
##     "mpg": 15.2,
##     "cyl": 8,
##     "disp": 304,
##     "hp": 150,
##     "drat": 3.15,
##     "wt": 3.435,
##     "qsec": 17.3,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "AMC Javelin"
##   },
##   {
##     "mpg": 13.3,
##     "cyl": 8,
##     "disp": 350,
##     "hp": 245,
##     "drat": 3.73,
##     "wt": 3.84,
##     "qsec": 15.41,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 4,
##     "_row": "Camaro Z28"
##   },
##   {
##     "mpg": 19.2,
##     "cyl": 8,
##     "disp": 400,
##     "hp": 175,
##     "drat": 3.08,
##     "wt": 3.845,
##     "qsec": 17.05,
##     "vs": 0,
##     "am": 0,
##     "gear": 3,
##     "carb": 2,
##     "_row": "Pontiac Firebird"
##   },
##   {
##     "mpg": 27.3,
##     "cyl": 4,
##     "disp": 79,
##     "hp": 66,
##     "drat": 4.08,
##     "wt": 1.935,
##     "qsec": 18.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 1,
##     "_row": "Fiat X1-9"
##   },
##   {
##     "mpg": 26,
##     "cyl": 4,
##     "disp": 120.3,
##     "hp": 91,
##     "drat": 4.43,
##     "wt": 2.14,
##     "qsec": 16.7,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Porsche 914-2"
##   },
##   {
##     "mpg": 30.4,
##     "cyl": 4,
##     "disp": 95.1,
##     "hp": 113,
##     "drat": 3.77,
##     "wt": 1.513,
##     "qsec": 16.9,
##     "vs": 1,
##     "am": 1,
##     "gear": 5,
##     "carb": 2,
##     "_row": "Lotus Europa"
##   },
##   {
##     "mpg": 15.8,
##     "cyl": 8,
##     "disp": 351,
##     "hp": 264,
##     "drat": 4.22,
##     "wt": 3.17,
##     "qsec": 14.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 4,
##     "_row": "Ford Pantera L"
##   },
##   {
##     "mpg": 19.7,
##     "cyl": 6,
##     "disp": 145,
##     "hp": 175,
##     "drat": 3.62,
##     "wt": 2.77,
##     "qsec": 15.5,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 6,
##     "_row": "Ferrari Dino"
##   },
##   {
##     "mpg": 15,
##     "cyl": 8,
##     "disp": 301,
##     "hp": 335,
##     "drat": 3.54,
##     "wt": 3.57,
##     "qsec": 14.6,
##     "vs": 0,
##     "am": 1,
##     "gear": 5,
##     "carb": 8,
##     "_row": "Maserati Bora"
##   },
##   {
##     "mpg": 21.4,
##     "cyl": 4,
##     "disp": 121,
##     "hp": 109,
##     "drat": 4.11,
##     "wt": 2.78,
##     "qsec": 18.6,
##     "vs": 1,
##     "am": 1,
##     "gear": 4,
##     "carb": 2,
##     "_row": "Volvo 142E"
##   }
## ]``````
``````# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
mini_json``````
``## [{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]``

### Statistics Packages, `haven` and `foreign`

R supports SAS, STATA, and SPSS.

``library(haven)``
``## Warning: package 'haven' was built under R version 3.4.4``
``````sales <- read_sas("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat")
# Convert labeled values in Date column to dates
sugar\$Date <- as.Date(as_factor(sugar\$Date))
``````library(foreign)
# foreign can load xprt files but not sas7dat files.``````
``````# load in the data and store it in the variable cars
# print the first 6 rows of the dataset using the head() function
``````##    mpg cyl disp  hp drat    wt  qsec vs am gear carb               car
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1           Valiant``````

Change the variable separator for text files with the `sep` argument. Use `sep = 't'` for tab.

``````# load in the dataset
cars <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep = ";")

# print the first 6 rows of the dataset
``````##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1``````

Get and set you working directory.

``getwd()``
``## [1] "C:/Users/mpfol/OneDrive/Documents/Data Analysis"``
``list.files()``
``````##  [1] "Analyzing Survey Data in R.Rmd"
##  [2] "Analyzing_Survey_Data_in_R.html"
##  [3] "Cookbook for R.Rmd"
##  [4] "Cookbook_for_R.html"
##  [5] "Cookbook_for_R.Rmd"
##  [6] "Cookbook_for_R_files"
##  [7] "Coursework"
##  [8] "Data"
##  [9] "Data Analysis.docx"
## [10] "Data Analysis.xlsx"
## [11] "Data Visualization.docx"
## [12] "Foundations of Inference.Rmd"
## [13] "Foundations_of_Inference.html"
## [14] "local_latitude.xls"
## [15] "Programs"
## [16] "rmarkdown-cheatsheet.pdf"
## [17] "rsconnect"
## [18] "Statistical Analysis.docx"
## [19] "Statistical Package Syntax (1).docx"
## [20] "Statistics Notes.docx"
## [21] "Statistics v20170301.docx"``````

# Data Wrangling

### Data Exploration

Data exploration starts with evaluation of structure and characteristics using `class()` (it better be a data.frame), `dim()`, and `names()`. Create summaries with `str()` or `glimpse()`, and `summary()`. Run some initial visualizations for insights into distributions. Use histograms for univariate analysis, scatterplots for numeric-numeric bi-variate analysis, and boxplots for numeric-factor bi-variate analysis.

``library(dplyr)``
``````##
## Attaching package: 'dplyr'``````
``````## The following objects are masked from 'package:gdata':
##
##     combine, first, last``````
``````## The following objects are masked from 'package:data.table':
##
##     between, first, last``````
``````## The following objects are masked from 'package:stats':
##
##     filter, lag``````
``````## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union``````
``````# Check structure
class(mtcars)``````
``## [1] "data.frame"``
``dim(mtcars)``
``## [1] 32 11``
``names(mtcars)``
``````##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"``````
``````# Initial summaries
str(mtcars)``````
``````## 'data.frame':    32 obs. of  11 variables:
##  \$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  \$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  \$ disp: num  160 160 108 258 360 ...
##  \$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  \$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  \$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  \$ qsec: num  16.5 17 18.6 19.4 17 ...
##  \$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  \$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  \$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  \$ carb: num  4 4 1 1 2 1 4 2 2 4 ...``````
``glimpse(mtcars)  # Slightly cleaner version of str (requires dplyr).``
``````## Observations: 32
## Variables: 11
## \$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## \$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## \$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## \$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## \$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## \$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## \$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## \$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## \$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## \$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## \$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...``````
``summary(mtcars)``
``````##       mpg             cyl             disp             hp
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
##       drat             wt             qsec             vs
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000
##        am              gear            carb
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000
##  Median :0.0000   Median :4.000   Median :2.000
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000``````
``hist(mtcars\$mpg)``

``plot(mtcars\$mpg, mtcars\$qsec)``

``````# View sample data
``````##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1``````
``tail(mtcars)``
``````##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2``````

### Tidying

Tidy data organizes a single observational unit into rows and columns. Use the `tidyr` package to tidy messy data.

``library(tidyr)``
``## Warning: package 'tidyr' was built under R version 3.4.4``
``````wide_df <- data.frame(Obs=c(1,2),
a=c(1,4),
b=c(2,5),
c=c(3,6),
year_mo=c("2010-05","2007-07"))
wide_df``````
``````##   Obs a b c year_mo
## 1   1 1 2 3 2010-05
## 2   2 4 5 6 2007-07``````
``````# Gather wide data into key-value pairs. Exclude Obs and year_mo
long_df <- gather(wide_df, my_key, my_val, -c(Obs,year_mo))
long_df``````
``````##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6``````
``````# The opposite of gather() is spread()
wide_df <- spread(long_df, my_key, my_val)
wide_df``````
``````##   Obs year_mo a b c
## 1   1 2010-05 1 2 3
## 2   2 2007-07 4 5 6``````
``````# Split a column using separate().
long_df_sep <- separate(long_df, col = year_mo, into = c("year","month"), sep = "-")
long_df_sep``````
``````##   Obs year month my_key my_val
## 1   1 2010    05      a      1
## 2   2 2007    07      a      4
## 3   1 2010    05      b      2
## 4   2 2007    07      b      5
## 5   1 2010    05      c      3
## 6   2 2007    07      c      6``````
``````# The opposite of separate() is unite()
long_df_uni <- unite(long_df_sep, year_mo, year, month, sep = "-")
long_df_uni``````
``````##   Obs year_mo my_key my_val
## 1   1 2010-05      a      1
## 2   2 2007-07      a      4
## 3   1 2010-05      b      2
## 4   2 2007-07      b      5
## 5   1 2010-05      c      3
## 6   2 2007-07      c      6``````

### Preparing for Analysis

Types of variables in R: * character * numeric, including `NaN` and `inf`. * integer, denoted `123L` * factor * logical, included `NA`.

Coerce variables into data types with * `as.character()` * `as.numeric()` * `as.integer()` * `as.factor()` * `as.logical()` where 0 := FALSE * Package `lubridate` coerces strings to dates. Valid masking characters are `y`, `m`, `d`, `h`, `m`, and `s`. Unite several fields into one with `unite()`. Rearrange column order with `select()`. Change the structure of multiple columns with `mutate_at`.

Because the period (.) has special meaning in certain situations, use underscores (_) to separate words in variable names. Use all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Package `lubridate` manipulates dates. Round dates with `round_date`, `floor_date`, and `ceiling_date`. All three take a unit argument specifying the resolution of rounding: “second”, “minute”, “hour”, “day”, “week”, “month”, “bimonth”, “quarter”, “halfyear”, or “year”. Or, you can specify any multiple of those units, e.g. “5 years”, “3 minutes” etc.

``library(lubridate)``
``## Warning: package 'lubridate' was built under R version 3.4.4``
``````##
## Attaching package: 'lubridate'``````
``````## The following objects are masked from 'package:data.table':
##
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year``````
``````## The following object is masked from 'package:base':
##
##     date``````
``````# There 3! ymd date functions: ymd(), ydm(), mdy(), myd(), dmy(), dym().
# Create datetimes with: _h, _hm, or _hms
as.Date(ymd_hms("2005/10/23 14:40:00"))``````
``## [1] "2005-10-23"``
``as.POSIXct(mdy("July 21, 2006"))``
``## [1] "2006-07-20 20:00:00 EDT"``
``ymd("2006-07-21")``
``## [1] "2006-07-21"``
``ymd("2006 Jul 21")``
``## [1] "2006-07-21"``
``mdy("July 21, 2006")``
``## [1] "2006-07-21"``
``hms("10:25:09")``
``## [1] "10H 25M 9S"``
``ymd_hms("2005/10/23 14:40:00")``
``## [1] "2005-10-23 14:40:00 UTC"``
``````# If date is in an unsupported order like dym_msh, use parse_date_time() with  argument orders specifying the order of the components in the date.

# Combine date parts with make_date(year, month, date).
r_3_4_1 <- ymd_hms("2016-05-03 07:13:28 UTC")

# Date rounding
floor_date(r_3_4_1, unit = "day")``````
``## [1] "2016-05-03 UTC"``
``round_date(r_3_4_1, unit = "5 minutes")``
``## [1] "2016-05-03 07:15:00 UTC"``
``ceiling_date(r_3_4_1, unit = "week")``
``## [1] "2016-05-08 UTC"``

Subtract dates with simple `-` operator for days unit, or get finer control with `base` function `difftime(t1, t2, units)`. Available system dates are `now` and `today()`.

``````date_landing <- mdy("July 20, 1969")
moment_step <- mdy_hms("July 20, 1969, 02:56:15", tz = "UTC")

difftime(today(), date_landing, units = "days")``````
``## Time difference of 18075 days``
``difftime(now(), moment_step, units = "secs")``
``## Time difference of 1561709101 secs``

Use timespans to add fixed amount of time to dates. Distinguish periods (human understanding) from durations (number of seconds) to handle daylight savings time gracefully. By combining addition and multiplication with sequences you can generate sequences of datetimes.

``````library(lubridate)
# Add a period of one week to mon_2pm
mon_2pm <- dmy_hm("27 Aug 2018 14:00")
mon_2pm + weeks(1)``````
``## [1] "2018-09-03 14:00:00 UTC"``
``````# Add a duration of 81 hours to tue_9am
tue_9am <- dmy_hm("28 Aug 2018 9:00")
tue_9am + dhours(81)``````
``## [1] "2018-08-31 18:00:00 UTC"``
``````# A period of five years is longer than a duration of 5 years!
today() - years(5)``````
``## [1] "2014-01-14"``
``today() - dyears(5)``
``## [1] "2014-01-15"``
``````# Create combined periods and durations.
eclipse_2017 <- ymd_hms("2017-08-21 18:26:40")
synodic <- ddays(29) + dhours(12) + dminutes(44) + dseconds(3)

# Create datetime for every two weeks for a year
today_8am <- today() + hours(8)
every_two_weeks <- 1:26 * weeks(2)
today_8am + every_two_weeks``````
``````##  [1] "2019-01-28 08:00:00 UTC" "2019-02-11 08:00:00 UTC"
##  [3] "2019-02-25 08:00:00 UTC" "2019-03-11 08:00:00 UTC"
##  [5] "2019-03-25 08:00:00 UTC" "2019-04-08 08:00:00 UTC"
##  [7] "2019-04-22 08:00:00 UTC" "2019-05-06 08:00:00 UTC"
##  [9] "2019-05-20 08:00:00 UTC" "2019-06-03 08:00:00 UTC"
## [11] "2019-06-17 08:00:00 UTC" "2019-07-01 08:00:00 UTC"
## [13] "2019-07-15 08:00:00 UTC" "2019-07-29 08:00:00 UTC"
## [15] "2019-08-12 08:00:00 UTC" "2019-08-26 08:00:00 UTC"
## [17] "2019-09-09 08:00:00 UTC" "2019-09-23 08:00:00 UTC"
## [19] "2019-10-07 08:00:00 UTC" "2019-10-21 08:00:00 UTC"
## [21] "2019-11-04 08:00:00 UTC" "2019-11-18 08:00:00 UTC"
## [23] "2019-12-02 08:00:00 UTC" "2019-12-16 08:00:00 UTC"
## [25] "2019-12-30 08:00:00 UTC" "2020-01-13 08:00:00 UTC"``````

`ymd("2018-01-31") + months(1)` returns NA. For situations like this, use alternative operators like `%m+%`.

``````library(lubridate)

# A sequence of 1 to 12 periods of 1 month
month_seq <- 1:12 * months(1)

# Add 1 to 12 months to jan_31.  This way returns NAs.
ymd("2018-01-31") + month_seq``````
``````##  [1] NA           "2018-03-31" NA           "2018-05-31" NA
##  [6] "2018-07-31" "2018-08-31" NA           "2018-10-31" NA
## [11] "2018-12-31" "2019-01-31"``````
``````# Better way.
ymd("2018-01-31") %m+% month_seq``````
``````##  [1] "2018-02-28" "2018-03-31" "2018-04-30" "2018-05-31" "2018-06-30"
##  [6] "2018-07-31" "2018-08-31" "2018-09-30" "2018-10-31" "2018-11-30"
## [11] "2018-12-31" "2019-01-31"``````

Intervals have a specific start and end time. There are two notations: `datetime1 %--% datetime2`, or `interval(datetime1, datetime2)`.

``````# Two ways to create an interval.
dmy("5 January 1961") %--% dmy("30 January 1969")``````
``## [1] 1961-01-05 UTC--1969-01-30 UTC``
``interval(dmy("5 January 1961"), dmy("30 January 1969"))``
``## [1] 1961-01-05 UTC--1969-01-30 UTC``

Once you have an interval you can find out its start, end, and length with int_start(), int_end() and int_length() respectively. You can test whether a date is `%within%` and interval. You can test whether two intervals overlap with `int_overlaps()`.

``````my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
int_length(my_intvl)``````
``## [1] 254620800``
``````y2001 <- ymd("2001-01-01") %--% ymd("2001-12-31")
ymd("2001-03-30") %within% y2001``````
``## [1] TRUE``

Convert an interval to a period or duration with `as.period` and `as.duration`.

``````my_intvl <- interval(dmy("5 January 1961"), dmy("30 January 1969"))
as.period(my_intvl)``````
``## [1] "8y 0m 25d 0H 0M 0S"``
``as.duration(my_intvl)``
``## [1] "254620800s (~8.07 years)"``

Extract timezone with `tz()`. Change timezone with `force_tz(dt, tzone=)` or temporarily view it with `with_tz(dt, tzone=)`. Get `tzone` names from ’OlsonNames()`.

``````game2 <- mdy_hm("June 11 2015 19:00")
game3 <- mdy_hm("June 15 2015 18:30")

# Set the timezone to "America/Edmonton"
game2_local <- force_tz(game2, tzone = "America/Edmonton")
game3_local <- force_tz(game3, tzone = "America/Winnipeg")

# What time is game2_local in NZ?
with_tz(game2_local, tzone = "Pacific/Auckland")``````
``## [1] "2015-06-12 13:00:00 NZST"``

`stamp` is a great way to format a date. It returns a function with format string you specify by example.

``stamp("09/20/2017")(today())``
``## Multiple formats matched: "%Om/%d/%y%H"(1), "%Om/%y/%d%H"(1), "%Om/%d/%Y"(1), "%m/%d/%y%H"(1), "%m/%y/%d%H"(1), "%m/%d/%Y"(1)``
``## Using: "%Om/%y/%d%H"``
``## [1] "01/19/1400"``

Package `stringr` manipulates strings.

``````library(stringr)
# trim whitespace.
str_trim("  this is a test  ")``````
``## [1] "this is a test"``
``````# pad string with zeros.
str_pad("2493", width = 7, side = "left", pad = "0")``````
``## [1] "0002493"``
``````# find pattern Alice
str_detect(c("Sarah", "Alice", "Tom"), "Alice")``````
``## [1] FALSE  TRUE FALSE``
``````# replace pattern Alice with Jeff
str_replace(c("Sarah", "Alice", "Tom"), "Alice", "Jeff")``````
``## [1] "Sarah" "Jeff"  "Tom"``
``````# Change case
toupper("DataCamp")``````
``## [1] "DATACAMP"``
``tolower("DataCamp")``
``## [1] "datacamp"``

Use `is.na()` to locate null values.

``````# 4x3 data frame with a few NAs.
df <- data.frame(A = c(1, NA, 8, NA),
B = c(3, NA, 88, 23),
C = c(2, 45, 3, 1),
D = c("A", "", "C", "D"))
# Any NAs?
any(is.na(df))``````
``## [1] TRUE``
``````# locate the NAs.
is.na(df)``````
``````##          A     B     C     D
## [1,] FALSE FALSE FALSE FALSE
## [2,]  TRUE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE FALSE``````
``````# How many?
sum(is.na(df))``````
``## [1] 3``
``````# Summarize the NAs
summary(df)``````
``````##        A              B              C         D
##  Min.   :1.00   Min.   : 3.0   Min.   : 1.00    :1
##  1st Qu.:2.75   1st Qu.:13.0   1st Qu.: 1.75   A:1
##  Median :4.50   Median :23.0   Median : 2.50   C:1
##  Mean   :4.50   Mean   :38.0   Mean   :12.75   D:1
##  3rd Qu.:6.25   3rd Qu.:55.5   3rd Qu.:13.50
##  Max.   :8.00   Max.   :88.0   Max.   :45.00
##  NA's   :2      NA's   :1``````
``````# Rows with no missing values, two ways
df[complete.cases(df),]``````
``````##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C``````
``na.omit(df)``
``````##   A  B C D
## 1 1  3 2 A
## 3 8 88 3 C``````
``````# Replace empty strings with NA
df\$D <- df\$D[df\$D == ""] <- NA

df2 <- data.frame(A = rnorm(100,50,10),
B = c(rnorm(99,50,10), 500),
C = c(rnorm(99,50,10), -1))
# Find outliers using hist() or boxplot().
hist(df2\$B)``````

``boxplot(df2)``

``````# Drop or replace outliers.  Use which() to find index of offending observation.
mymtcars <- mtcars
ind <- which(mymtcars\$mpg == 15.0)
mymtcars\$mpg[ind] = 20.0``````

# 3. Data Wrangling

## 3.1 `dplyr`

The `dplyr` package provides data wrangling tools. `dplyr` introduces the tibble, a dataframe constrained to display well in an R session. The tibble class inherits from the data frame class. Work with a tibble using the `tbl_df(data.frame)` function. `glimpse(tbl)` works with tibbles the way `str(data.frame)` works with data frames. Convert a tibble back to a data frame with `as.data.frame(tbl)`.

``````library(dplyr)

# hflights is a data.frame of Houston based flights.
library(hflights)``````
``## Warning: package 'hflights' was built under R version 3.4.4``
``````hflights <- as_tibble(hflights)
``````## # A tibble: 6 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     1          1         6    1400    1500 AA                  428
## 2  2011     1          2         7    1401    1501 AA                  428
## 3  2011     1          3         1    1352    1502 AA                  428
## 4  2011     1          4         2    1403    1513 AA                  428
## 5  2011     1          5         3    1405    1507 AA                  428
## 6  2011     1          6         4    1359    1503 AA                  428
## # ... with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>``````
``summary(hflights)``
``````##       Year          Month          DayofMonth      DayOfWeek
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000
##
##     DepTime        ArrTime     UniqueCarrier        FlightNum
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855
##  Median :1416   Median :1617   Mode  :character   Median :1696
##  Mean   :1396   Mean   :1578                      Mean   :1962
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755
##  Max.   :2400   Max.   :2400                      Max.   :7290
##  NA's   :2905   NA's   :3066
##    TailNum          ActualElapsedTime    AirTime         ArrDelay
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000
##                     NA's   :3622      NA's   :3622    NA's   :3622
##     DepDelay          Origin              Dest              Distance
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0
##  Mean   :  9.445                                         Mean   : 787.8
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0
##  Max.   :981.000                                         Max.   :3904.0
##  NA's   :2905
##      TaxiIn           TaxiOut         Cancelled       CancellationCode
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000
##  NA's   :3066      NA's   :2947
##     Diverted
##  Min.   :0.000000
##  1st Qu.:0.000000
##  Median :0.000000
##  Mean   :0.002853
##  3rd Qu.:0.000000
##  Max.   :1.000000
## ``````
``````# hflights consists of 227,496 observations and 21 variables.
nrow(hflights)``````
``## [1] 227496``
``ncol(hflights)``
``## [1] 21``
``````# Create a lookup table for the UniqueCarrier column using a named vector.
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights\$Carrier <- lut[hflights\$UniqueCarrier]``````

`dplyr` features five verbs. * `select(.data, ...)` where `...` are variables. Use `:` to select a range of variables, and `-` to exclude some variables, similar to indexing a data.frame with square brackets. Use variable names or integer indexes. Use helper functions `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()`, and `one_of()`. * `filter(.data, one or more comparisons)`. Among the operators are `==`, `!=`, and `%in%`. Combine comparisons with `&` and `|`. * `arrange(.data, ...)`. Wrap the arguments with `desc()` to override the default sort order. * `mutate(.data, name-value pair of expressions)`. * `summarise(.data, ...)`. Base r includes several aggregate functions, and `dplyr` adds `first()`, `last()`, `nth()`, `n()`, and `n_distinct()`. Pipe a data set with `%>%` into a verb. The `filter()` verb returns a filtered data set. The `arrange()` verb returns a sorted data set. Arrange in descending order by `arrange(desc(gdpPerCap))`. The `mutate()` verb adds or changes values in the data set. `group_by(.data, col(s))`. `group_by` only has an effect when combined with a `summarize()` function. Specify `group_by` prior to `summarize()`.

`dplry` uses `%>%` from the `magrittr` package.

``````library(dplyr)
library(hflights)
hflights <- as_tibble(hflights)
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights\$Carrier <- lut[hflights\$UniqueCarrier]
# select example
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancell"))``````
``````## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##  * <chr>             <int> <chr>       <int> <chr>
##  1 AA                  428 N576AA          0 ""
##  2 AA                  428 N557AA          0 ""
##  3 AA                  428 N541AA          0 ""
##  4 AA                  428 N403AA          0 ""
##  5 AA                  428 N492AA          0 ""
##  6 AA                  428 N262AA          0 ""
##  7 AA                  428 N493AA          0 ""
##  8 AA                  428 N477AA          0 ""
##  9 AA                  428 N476AA          0 ""
## 10 AA                  428 N504AA          0 ""
## # ... with 227,486 more rows``````
``````# mutate example
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)``````
``````## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## This warning is displayed once per session.``````
``````## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.``````
``````## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.``````
``````## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.``````
``````# filter example
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = 60 * Distance/ RealTime) %>%
filter(!is.na(mph) & mph < 70) %>%
group_by(UniqueCarrier) %>%
summarize(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance))``````
``````## # A tibble: 6 x 5
##   UniqueCarrier n_less n_dest min_dist max_dist
##   <chr>          <int>  <int>    <dbl>    <dbl>
## 1 AA                40      1     224.     224.
## 2 CO              3393      4     140.     305.
## 3 MQ                12      1     247.     247.
## 4 OO               349      3     140.     224.
## 5 WN              1747      4     148.     239.
## 6 XE              1185     12      79.     253.``````

`dplyr` works for data frames, data tables, and databases.

Use `dplyr` to merge data instead of base r `merge()` because `dplr` syntax is intuitive, preserves row order, and works with databases.

The four mutating joins are `left_join(tbl1, tbl2, by = c(col_names))`, `right_join`, `inner_join`, and `full_join`.

Filter join `semi_join` performs an inner join without returning the secondary table. Filter join `anti_join` performs a right where the right table is null.

Set functions `union()`, `intersect`, and `setdiff`.

`setequal(set1, set2)` checks for row equality (not necesarily order).

If two datasets have identical structure, combine with `bind_rows()` and `bind_cols()`, the `dplyr` equivalent to base r `rbind()` and `cbind`.

`dplyr` improves `base r` functions `data.frame` with `data_frame()`. `data_frame()` will not change data types, add row or column names, or recycle vectors. Function `as_data_frame()` parellels the behavior of `data_frame()`. `as_data_frame` combines a list of vectors into a data frame. It is the column equivalent of `bind_rows()` which combines data frames.

``library(Lahman)``
``## Warning: package 'Lahman' was built under R version 3.4.4``
``````library(dplyr)

players <- Master %>%
distinct(playerID, nameFirst, nameLast)

players %>%
# Find unsalaried players
anti_join(Salaries, by = "playerID") %>%
# Join Batting to the unsalaried players
left_join(Batting, by = "playerID") %>%
# Group by player
group_by(playerID) %>%
# Sum at-bats for each player
summarise(total_at_bat = sum(AB, na.rm = TRUE)) %>%
# Arrange in descending order
arrange(desc(total_at_bat))``````
``````## # A tibble: 13,958 x 2
##    playerID  total_at_bat
##    <chr>            <int>
##  1 aaronha01        12364
##  2 yastrca01        11988
##  3 cobbty01         11434
##  4 musiast01        10972
##  5 mayswi01         10881
##  6 robinbr01        10654
##  7 wagneho01        10430
##  8 brocklo01        10332
##  9 ansonca01        10277
## 10 aparilu01        10230
## # ... with 13,948 more rows``````
``````library(Lahman)
library(dplyr)

# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>%
distinct(playerID)

nominated %>%
# Count the number of players in nominated
count()``````
``````## # A tibble: 1 x 1
##       n
##   <int>
## 1  1260``````
``````# 1,239 players were nominated for the hall of fame.

nominated_full <- nominated %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)

# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID)

inducted %>%
# Count the number of players in inducted
count()``````
``````## # A tibble: 1 x 1
##       n
##   <int>
## 1   317``````
``````# 312 players have been inducted.

inducted_full <- inducted %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)

# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>%
group_by(playerID) %>%
tally()

nAwards %>%
# Filter to just the players in inducted
semi_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))``````
``````## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  12.1``````
``````nAwards %>%
# Filter to just the players in nominated
semi_join(nominated, by = "playerID") %>%
# Filter to players NOT in inducted
anti_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))``````
``````## # A tibble: 1 x 1
##   avg_n
##   <dbl>
## 1  4.23``````
``````# On Average, inductees had 11.95 - 4.23 = 7.72 more awards than non-inductees.

# Find the players who are in nominated, but not inducted
notInducted <- nominated %>%
setdiff(inducted)

Salaries %>%
# Find the players who are in notInducted
semi_join(notInducted, by = "playerID") %>%
# Calculate the max salary by player
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
# Calculate the average of the max salaries
summarize(avg_salary = mean(max_salary, na.rm = TRUE)) ``````
``````## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   5230273.``````
``````# Repeat for players who were inducted
Salaries %>%
semi_join(inducted, by = "playerID") %>%
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
summarize(avg_salary = mean(max_salary, na.rm = TRUE))``````
``````## # A tibble: 1 x 1
##   avg_salary
##        <dbl>
## 1   6092038.``````
``````Appearances %>%
# Filter Appearances against nominated
semi_join(nominated, by = "playerID") %>%
# Find last year played by player
group_by(playerID) %>%
summarize(last_year = max(yearID)) %>%
# Join to full HallOfFame
left_join(HallOfFame, by = "playerID") %>%
# Filter for unusual observations
filter((yearID - last_year)<5)``````
``````## # A tibble: 194 x 10
##    playerID  last_year yearID votedBy ballots needed votes inducted
##    <chr>         <dbl>  <int> <chr>     <int>  <int> <int> <fct>
##  1 altroni01     1933.   1937 BBWAA       201    151     3 N
##  2 applilu01     1950.   1953 BBWAA       264    198     2 N
##  3 bartedi01     1946.   1948 BBWAA       121     91     1 N
##  4 beckro01      2004.   2008 BBWAA       543    408     2 N
##  5 boudrlo01     1952.   1956 BBWAA       193    145     2 N
##  6 camildo01     1945.   1948 BBWAA       121     91     1 N
##  7 chandsp01     1947.   1950 BBWAA       168    126     2 N
##  8 chandsp01     1947.   1951 BBWAA       226    170     1 N
##  9 chapmbe01     1946.   1949 BBWAA       153    115     1 N
## 10 cissebi01     1938.   1937 BBWAA       201    151     1 N
## # ... with 184 more rows, and 2 more variables: category <fct>,
## #   needed_note <chr>``````

## Data Visualization

Data visualization is about exploratory analysis (investigative) and explanatory analysis.

There are seven grammatical layers of plots; three are required: data, aesthetics, and geometries. The other elements are facets (subplots), statistics (e.g., fitted lines), coordinates, and themes. The grammar of graphics is implemented in the `ggplot2` package.

Base r provides plotting functionality, but it comes with limitations. The plot is an image, not an object, so you cannot manipulate it further. It does not present a legend. There is a separate function for each plot type. The lack of a unified framework means you will have to learn each plot type separately: `points()`, `hist()`, etc.

Scale the x axis with a `scale_x_log10` layer. There are two main reasons to use logarithmic scales in charts and graphs. The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. The second is to show percent change or multiplicative factors. On a scaled access with base 2, the value of each tick mark is double the value of the preceding one. An example of a multiplicative factor is constant acceleration. More on scales for continuous data here.

### Scatterplots

For scatterplots, map `x`, `y`, `color`, and `shape` in the aesthetic layer. Map `size`, `fill`, `shape`, `alpha` (transparency), and `position` (e.g., “jitter”) in the `geom_point` layer.

``````mtcars\$cyl <- as.factor(mtcars\$cyl)

# Use base r to create plots with a series for each cyl value.
# Add a linear fit line through the points, one for each series, and one overall.
plot(mtcars\$wt, mtcars\$mpg, col = factor(mtcars\$cyl))
abline(lm(mpg ~ wt, data = mtcars), lty = 2)
lapply(mtcars\$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars, subset = (cyl == x)), col = x)
})``````
``````## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
##
## [[18]]
## NULL
##
## [[19]]
## NULL
##
## [[20]]
## NULL
##
## [[21]]
## NULL
##
## [[22]]
## NULL
##
## [[23]]
## NULL
##
## [[24]]
## NULL
##
## [[25]]
## NULL
##
## [[26]]
## NULL
##
## [[27]]
## NULL
##
## [[28]]
## NULL
##
## [[29]]
## NULL
##
## [[30]]
## NULL
##
## [[31]]
## NULL
##
## [[32]]
## NULL``````
``````legend(x = 5, y = 33, legend = levels(mtcars\$cyl),
col = 1:3, pch = 1, bty = "n")

# Again in ggplot2
# The first geom_smooth inherits the ggplot color aesthetic as its group.
# The second geom_smooth explicity sets group to a dummy 1.  The col = "All" adds it to the legend.
# When mapping onto color you can sometimes treat a continuous scale, like year, as an ordinal variable, but only if it is a regular series. The better alternative is to leave it as a continuous variable and use the group aesthetic as a factor to make sure your plot is drawn correctly.
library(ggplot2)``````
``## Warning: package 'ggplot2' was built under R version 3.4.4``

``````ggplot(mtcars, aes(x = wt, y = mpg, col = cyl, group = factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(method = "lm", se = FALSE, linetype = 2, aes(group = 1, col = "All"))``````

`ggplot` can visualize four attributes at once with `x`, `y`, `col`, and `facet_grid`. Such graphing requires tidy data, which in turn requires thoughtful definitions of metrics. In the iris data set, if measuring length vs width, then those are separate variables (cols). If measuring length (or width) vs species, then species is a variable. If measuring length (or width) vs part of flower (petal vs sepal), then flower part is a variable. To look at all four together, then length and width are members of the measure variable (because length and width share units).

``````library(ggplot2)
library(tidyr)

iris.tidy <- iris %>%
# gather(data, key, value, <cols>)
# Transpose all cols to rows except the identifier cols (Species)
# The former call name becomes a value in the key column.
gather(key, Value, -Species) %>%
# separate(data, col, into, sep)
separate(col = key, into = c("Part", "Measure"), sep = "\\.")

# If we want the ploy Length vs width, then each should be a column.
iris\$Flower <- 1:nrow(iris)
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
ggplot(iris.wide, aes(x = Length, y = Width, color = Part)) +
geom_jitter() +
facet_grid(. ~ Species)``````

Typical aesthetics are `x`, `y`, `colour`, `fill`, `size`, `alpha`, `linetype`, `labels`, and `shape`. `shape`s 1:20 can accept only the `color` aesthetic, and `shape`s 21:25 accepts both `color` and `fill`.

One common technique to use with solid shapes is alpha blending (i.e. adding transparency). An alternative is to use hollow shapes.

``````library(ggplot2)

# Basic scatter plot: wt on x-axis and mpg on y-axis; map cyl to col
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4)``````

``````# Hollow circles - an improvement
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 4, shape = 1)``````