Base R Tutorial:

Setting up Your Workspace

The working directory is where R by default will search for input files and save all output. To set the working directory you can use the setwd() command. The working directory is essentially a file path or address. This can be copied from the explorer window almsot directly O the R studio console. There are two caveat however.

  1. The first is that the file path is surrounded by quotes “”.
  2. The second is that each back slash “/” is replaced by a forward slash “"or by an additional backslash so that each folder location is separated as follows”//“.

These two rules are important because the apply anytime you want to import external csv or xlsx files as well as when you are saving output.

Here is an example below

setwd(“C:/….”)

Assigning Values to Objects

Instead of cells like spreadsheet, R leverages the concept of objects. An object is a container that stores information. Objects can store all sorts of values or information, a fact that we will discuss further on in the section of data types. There are also several different kinds of objects or storage containers. We will also get to these types later on.

Assign a value to my_apples:

my_apples <- 5 
my_apples
## [1] 5

Assign a value to my_oranges:

my_oranges <- 6 
my_oranges
## [1] 6

You can also assign objects using previously created objects. Assign to my_fruit sum of both fruit objects.

my_fruit <- my_apples + my_oranges 
my_fruit
## [1] 11

Data Types:

As previously mentioned, there are several data types that R can support within an object. These include;

. Decimals values like 4.5 are called numerics. . Natural numbers like 4 are called integers. Integers are also numerics. . Boolean values (TRUE or FALSE) are called logical. . Text (or string) values are called characters. . Factors are specific types of character values with inherit ordering . Date a character value specific to dates such as “1/1/2019”.

Numeric:

my_numeric <- 42 
my_numeric
## [1] 42

Boolean:

my_logical <- FALSE 
my_logical
## [1] FALSE

Character:

my_character <- "universe" 
my_character
## [1] "universe"

Factor

A factor is a character variable with a discrete number of categories.

To create factors in R, you make use of the function factor().

For example, the sex_vector contains the sex of 5 different individuals:

sex_vector <- c("Male","Female","Female","Male","Male")

It is clear that there are two categories, or in R-terms ‘factor levels’, at work here: “Male” and “Female”.

The function factor() will encode the vector as a factor:

(factor_sex_vector <- factor(sex_vector))
## [1] Male   Female Female Male   Male  
## Levels: Female Male

There are two types of factors variables:

  1. Nominal categorical variable

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that one is worth more than the other.

For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other.

  1. Ordinal categorical variable

In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.

By default, the function factor() transforms vectors into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels. Take the following script for example:

factor(some_vector, ordered = TRUE, levels = c(“lev1”, “lev2” …))

By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.

temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High   Low    High   Low    Medium
## Levels: Low < Medium < High

The fact that factor_speed_vector is now ordered enables us to compare different elements. You can simply do this by using the well-known operators.

factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE

To get an output of the levels associated with a factor, you can use the levels command:

levels(factor_temperature_vector)
## [1] "Low"    "Medium" "High"

There are two additional handy commands, that can be used to understand the data type of an R object. These are the str() and summary() functions.

The structure function provides an summary of a vector with the top 10 values and its data type.

str(factor_temperature_vector)
##  Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 1 3 1 2

The summary() command on the other hand provides a statistical summary. If the vector selected is a numerical one, R provides a 5 number summary. If categorical, it procides a frequency count.

summary(factor_temperature_vector)
##    Low Medium   High 
##      2      1      2

NEED TO ADD SECTION ON ORDERING AND MODIFYING FACTORS FROM R FOR DATA SCIENCE

Dates

Dates in Base R

In Base R date-time formats consist of of the following pieces:

Year -%Y (4 digits). -%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

Month -%m (2 digits). -%b (abbreviated name, like “Jan”). -%B (full name, “January”).

Day -%d (2 digits). -%e (optional leading space).

Time -%H 0-23 hour. -%I 0-12, must be used with %p. -%p AM/PM indicator. -%M minutes. -%S integer seconds. -%OS real seconds. -%Z Time zone (as name, e.g. America/Chicago).

Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.

-%z (as offset from UTC, e.g. +0800) -%. skips one non-digit character. -%* skips any number of non-digits.

Dates with Lubridate

There are three types of date/time data that refer to an instant in time:

  1. A date. Tibbles print this as .

  2. A time within a day. Tibbles print this as

  3. A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as . Elsewhere in R these are called POSIXct, but I don’t think that’s a very useful name.

In this chapter we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.

To get the current date or date-time you can use today() or now():

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
today()
## [1] "2019-03-24"
now()
## [1] "2019-03-24 18:24:25 EDT"

Otherwise, there are three ways you’re likely to create a date/time:

  1. From a string.
  2. From individual date-time components.
  3. From an existing date/time object.
From strings

Date/time data often comes as strings. You’ve seen one approach to parsing strings into date-times in date-times. Another approach is to use the helpers provided by lubridate.

They automatically work out the format once you specify the order of the component.

To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd() is short and unambiguous:

ymd(20170131)

ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

ymd_hms(“2017-01-31 20:11:59”) mdy_hm(“01/31/2017 08:01”)

From individual components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

library(nycflights13)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.7
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
flights %>% 
select(year, month, day, hour, minute) %>% 
mutate(departure = make_datetime(year, month, day, hour, minute))  
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1  5.00   15.0 2013-01-01 05:15:00
##  2  2013     1     1  5.00   29.0 2013-01-01 05:29:00
##  3  2013     1     1  5.00   40.0 2013-01-01 05:40:00
##  4  2013     1     1  5.00   45.0 2013-01-01 05:45:00
##  5  2013     1     1  6.00    0   2013-01-01 06:00:00
##  6  2013     1     1  5.00   58.0 2013-01-01 05:58:00
##  7  2013     1     1  6.00    0   2013-01-01 06:00:00
##  8  2013     1     1  6.00    0   2013-01-01 06:00:00
##  9  2013     1     1  6.00    0   2013-01-01 06:00:00
## 10  2013     1     1  6.00    0   2013-01-01 06:00:00
## # ... with 336,766 more rows

Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I’ve created the date-time variables, I focus in on the variables we’ll explore in the rest of the chapter.

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 x 9
##    origin dest  dep_delay arr_delay dep_time           
##    <chr>  <chr>     <dbl>     <dbl> <dttm>             
##  1 EWR    IAH        2.00     11.0  2013-01-01 05:17:00
##  2 LGA    IAH        4.00     20.0  2013-01-01 05:33:00
##  3 JFK    MIA        2.00     33.0  2013-01-01 05:42:00
##  4 JFK    BQN       -1.00    -18.0  2013-01-01 05:44:00
##  5 LGA    ATL       -6.00    -25.0  2013-01-01 05:54:00
##  6 EWR    ORD       -4.00     12.0  2013-01-01 05:54:00
##  7 EWR    FLL       -5.00     19.0  2013-01-01 05:55:00
##  8 LGA    IAD       -3.00    -14.0  2013-01-01 05:57:00
##  9 JFK    MCO       -3.00    - 8.00 2013-01-01 05:57:00
## 10 LGA    ORD       -2.00      8.00 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 4 more variables: sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

With this data, I can visualise the distribution of departure times across the year:

flights_dt %>% 
ggplot(aes(dep_time)) + 
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

Or within a single day:

flights_dt %>% 
filter(dep_time < ymd(20130102)) %>% 
ggplot(aes(dep_time)) + 
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes

From other types

You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

as_datetime(today())
## [1] "2019-03-24 UTC"
as_date(now())
## [1] "2019-03-24"

Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"
Date/Time Components
Getting components

You can pull out individual parts of the date with the accessor functions -year() -month() -mday() (day of the month) -yday() (day of the year) -wday() (day of the week) -hour(), minute(), and second().

datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6
hour(datetime)
## [1] 12
minute(datetime)
## [1] 34
second(datetime)
## [1] 56

For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

month(datetime, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Object Types:

Vectors: What are they?

A vector is a sequence of data elements of the same basic type.

Creating Vectors:

To create a vector you use the “c()” command with “,” commas to sepearte each value. Remember the data type must be the same for each value.

numeric_vector <- c(1, 10, 49)

numeric_vector
## [1]  1 10 49
character_vector <- c("a", "b", "c")

character_vector
## [1] "a" "b" "c"
boolean_vector <- c(TRUE, FALSE, TRUE)

boolean_vector
## [1]  TRUE FALSE  TRUE

Naming a Vector:

Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)
poker_vector
## [1]  140  -50   20 -120  240

Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)
roulette_vector
## [1]  -24  -50  100 -350   10

Assign days as names of poker_vector

names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign days as names of roulette_vectors

days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Assign the names of the day to roulette_vector and poker_vector

names(poker_vector) <- c(days_vector)
names(roulette_vector) <- c(days_vector)

With the names assigned, you now see the names alongside the values

poker_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       140       -50        20      -120       240
roulette_vector
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       -24       -50       100      -350        10

Comparing vectors

To check if you realized higher total gains in poker than in roulette we can use logical comparison operators. R by default recognizes the following: . < for less than . > for greater than . <= for less than or equal to . >= for greater than or equal to . == for equal to each other . != not equal to each other

Greater and less than

You can compare two vectors to one another by using the Equality and Greater/Less than Operators

You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:

(1 + 2) > 4

“dog” < “Cats”

TRUE <= FALSE

Remember that for string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE corresponds to 1 in R, and FALSE coerces to 0 behind the scenes.

Therefore, FALSE < TRUE is TRUE.

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
linkedin > 15
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
linkedin <= 5
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
linkedin > facebook
## [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE
Equality

The following statements all = TRUE

3 == (2 + 1)
## [1] TRUE

!= The inverse of TRUE. This states x not equal to y

"intermediate" != "r" 
## [1] TRUE
TRUE != FALSE 
## [1] TRUE
"Rchitect" != "rchitect"
## [1] TRUE

True and True = TRUE

TRUE & TRUE
## [1] TRUE

True or False = True

FALSE | TRUE
## [1] TRUE
& and |

The & symbol is R’s AND() function from excel. | represents OR():

last <- tail(linkedin, 1)

Is last under 5 or above 10?

last <5 | last > 10
## [1] TRUE

To do an interval such as 3 < x < 7 you need to write it with &’s such as:

3 < x & x <7

Is last between 15 (exclusive) and 20 (inclusive)?

last >15 & last <=20
## [1] FALSE
5 <= 5 & 2 < 3
## [1] TRUE
3 < 4 | 7 < 6
## [1] TRUE

Like relational operators, logical operators work perfectly fine with vectors and matrices.

linkedin > 10 & facebook < 10
## [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

When are views greater than 12

linkedin >= 12 | facebook >= 12 | linkedin >= 12 & facebook >= 12
## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

When is views between 11 (exclusive) and 14 (inclusive)?

facebook > 11 & linkedin <= 14
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
The if statement

The if statement, enables you to set the conditions under which R outputs certain expressions:

if (condition) { expr }

medium <- "LinkedIn"
num_views <- 14

Examine the if statement for medium

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
}
## [1] "Showing LinkedIn information"
Add an else

You can only use an else statement in combination with an if statement. The else statement does not require a condition; its corresponding code is simply run if all of the preceding conditions in the control structure are FALSE. Here’s a recipe for its usage: if (condition) { expr1 } else { expr2 }

It’s important that the else keyword comes on the same line as the closing bracket of the if part!

if (medium == "LinkedIn") {
print("Showing LinkedIn information")
} else {
print("Unknown medium")
}
## [1] "Showing LinkedIn information"
Customize further: else if

The else if statement allows you to further customize your control structure. You can add as many else if statements as you like. Keep in mind that R ignores the remainder of the control structure once a condition has been found that is TRUE and the corresponding expressions have been executed. Here’s an overview of the syntax to freshen your memory: if (condition1) { expr1 } else if (condition2) { expr2 } else if (condition3) { expr3 } else { expr4 }

Again, It’s important that the else if keywords comes on the same line as the closing bracket of the previous part of the control construct!

if (medium == "LinkedIn") {
  print("Showing LinkedIn information")
} else if (medium == "Facebook") {
print("Showing Facebook information")
} else {
  print("Unknown medium")
}
## [1] "Showing LinkedIn information"

Vector Selection:

Let’s summarize we have learned by putting it all together:

poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

In step order we:

  1. Created a Poker Vector with our daily winnings
  2. Created a Roulette Vector with our daily winnings
  3. Created a days vector with the names of each day of the week
  4. Assigned the day names to each daily winnings number.

In addition to creating objects and naming them, we can also select data from objects that already exist.

To select data from a vector you simply specify the location of the desired observations in the vector. For example, below we select the third element from the poker_vector to get our winnings for Wednesday, the third day of the week:

(poker_wednesday <- poker_vector[3])
## Wednesday 
##        20

We can also select the same data by using the obsevations name, instead of its location:

(poker_wednesday <- poker_vector["Wednesday"])
## Wednesday 
##        20

We can also define a new variable based on a selection with multiple values

(poker_midweek <- poker_vector[c(2,3,4)])
##   Tuesday Wednesday  Thursday 
##       -50        20      -120

Define a new variable based on a range of values in selection roulette_selection_vector <- roulette_vector [c(2:5)]

(poker_midweek <- poker_vector[c(2,3,4)])
##   Tuesday Wednesday  Thursday 
##       -50        20      -120

We can also select poker results for multiple days using the names vs. location, we did previously:

poker_start <- poker_vector[c("Monday","Tuesday","Wednesday")]

To subselect for days where we had positive winnings we can use the > logical operator:

(selection_vector <- poker_vector > 0)
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE      TRUE     FALSE      TRUE

In addition to selecting observations using their position and names, we can also select observations using other objects.

(poker_winning_days <- poker_vector[selection_vector])
##    Monday Wednesday    Friday 
##       140        20       240

In this situation, what are is doing is actually a multi-step process:

  1. Compares every poker_vector selection to the selection_vector value thereby creating a vector of (TRUE, FALSE) values.
  2. Subselects the poker vector for those observations that are TRUE

This is a very powerful method that will be used going forward.

Matrix:

What’s a matrix?

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

Since you are only working with rows and columns, a matrix is called two-dimensional.

You can construct a matrix in R with the matrix() function. Consider the following example:

matrix(1:9, byrow = TRUE, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Let’s break down the matrix function step-by-step:

  1. The first argument is the collection of elements that R will arrange into the rows and columns of the matrix.

Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9). We introduced this briefly in our selection section.

  1. The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.

  2. The third argument nrow indicates that the matrix should have three rows. If we change the numer of rows we will see the effect this has.

(matrix(1:9, byrow= TRUE, nrow=9))
##       [,1]
##  [1,]    1
##  [2,]    2
##  [3,]    3
##  [4,]    4
##  [5,]    5
##  [6,]    6
##  [7,]    7
##  [8,]    8
##  [9,]    9

Creating A Matrix:

To create a matrix we will do the following:

  1. Use c(new_hope, empire_strikes, return_jedi) to combine the three vectors into one vector. We will call this vector box_office.
  1. Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
  1. Create box_office
box_office <- c(new_hope, empire_strikes, return_jedi)
  1. Constuct Star_Wars_Matrix
  1. We will specify the nrow = 3 and
  2. Specify that we want to create the matrix byrow = TRUE.
  3. Name the resulting matrix star_wars_matrix.
star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow = 3)

Naming a Matrix:

Just as we named elements in our vector, we can also create names for our matrix:

To do so we first create our naming vectors, one for the columns, the other for rows:

region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

We can assign column names using the colnames() command:

colnames(star_wars_matrix) <- region

For row names, the prcoedure is the same only the rownames() command is used:

rownames(star_wars_matrix) <- titles

If we print the star_wars_matrix we will now see all values with their names:

star_wars_matrix
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

As it turns out, there is a easier way to complete the process of creating and naming a matrix. It can all be done in one step, within the matrix() function using the dinames option:

box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
(star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
c("US", "non-US"))))
##                              US non-US
## A New Hope              460.998  314.4
## The Empire Strikes Back 290.475  247.9
## Return of the Jedi      309.306  165.8

Modifying Existing Matricies

Adding Columns

While it is handy to know how to create a matrix, it is also necessary to know how to modify an exiting one:

You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example: big_matrix <- cbind(matrix1, matrix2, vector1 …)

Using our star wars matrix, we can append the the row total to our matrix to create a grand total column:

worldwide_vector <- rowSums(star_wars_matrix)
all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector)
all_wars_matrix
##                              US non-US worldwide_vector
## A New Hope              460.998  314.4          775.398
## The Empire Strikes Back 290.475  247.9          538.375
## Return of the Jedi      309.306  165.8          475.106
Adding Rows

In addition to appending columns, we can also append rows to the bottom of our matrix

sequels <- c(120, 150, 170, 190, 200, 210)
(star_wars_matrix2 <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("Episode 1", "Episode 2", "Episode 3"), 
c("US", "non-US"))))
##                US non-US
## Episode 1 460.998  314.4
## Episode 2 290.475  247.9
## Episode 3 309.306  165.8

To do this we use the rbind() command, in which we list the matricies we wish to merge. all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)

To calculate the total of each row, we can use the rowSums() commmand:

rowSums(star_wars_matrix)
##              A New Hope The Empire Strikes Back      Return of the Jedi 
##                 775.398                 538.375                 475.106

To calculate the total of each column, we use the colSums() command:

colSums(star_wars_matrix)
##       US   non-US 
## 1060.779  728.100

Comparing Matricies

R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!

Instead of in vectors (as in the previous exercise), the LinkedIn and Facebook data is now stored in a matrix called views. You can do this using the matrix() command and leveraging the previously built vectors using c().

The first row contains the LinkedIn information; the second row the Facebook information. The original vectors facebook and linkedin are still available as well.

linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)

When does views equal 13?

views == 13
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

When is views less than or equal to 14?

views <= 14
##       [,1] [,2] [,3]  [,4] [,5]  [,6] [,7]
## [1,] FALSE TRUE TRUE  TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE  TRUE TRUE

Matrix Math

To practice doing mathematical operations with matricies we can create two vectors and total them:

A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)
(total_vector <- c(A_vector + B_vector))
## [1] 5 7 9

Similarly, we can calculate our total winnings for the week with the command below. Note, if you surround the assignment command “<-” in parenthesis, “( )”, R will not only complete the assignemnt of the object, but will also provide an output of the result:

(total_daily <- c(roulette_vector + poker_vector))
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       116      -100       120      -470       250

To calculate the total of our poker winnings only we can total our poker vector with the sum() function:

(total_poker <- sum(poker_vector))
## [1] 230

We can do the same to calculate our total roulette winnings:

(total_roulette <-  sum(roulette_vector))
## [1] -314

In addition to the sum, we can also calculate the average of the daily winnings using the average command:

(average_roulette <- mean(roulette_vector))
## [1] -62.8

To get a grand total of the winnings for the week we can sum the two total vectors:

(total_week <- total_roulette + total_poker)
## [1] -84

Data Frames:

What’s a data frame?

You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.

When doing a market research survey, however, you often have questions such as: -‘Are your married?’ or ‘yes/no’ questions (logical) -‘How old are you?’ (numeric) -‘What is your opinion on this product?’ or other ‘open-ended’ questions (character)

The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.

A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

Creating a data frame

You construct a data frame with the data.frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible (and likely) that they contain different types of data.

Use the function data.frame() to construct a data frame. Pass the vectors name, type, diameter, rotation and rings as arguments to data.frame(), in this order. Call the resulting data frame planets_df.

First we create vectors for each column of our dataset:

name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

With our vectors built, the last step is to combine them all into one dataframe

(planets_df <- data.frame(name, type, diameter, rotation, rings))
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE

Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements.

Therefore, it is often useful to show only a small part of the entire data set.

So how to do this in R? Well, the function head() enables you to show the first observations of a data frame.

Similarly, the function tail() prints out the last observations in your data set.

Both head() and tail() print a top line called the ‘header’, which contains the names of the different variables in your data set.

head(planets_df)
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
tail(planets_df)
##      name               type diameter rotation rings
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 8 Neptune          Gas giant    3.883     0.67  TRUE

Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your data set. For a data frame it tells you: -The total number of observations (e.g. 32 car types) -The total number of variables (e.g. 11 car features) -A full list of the variables names (e.g. mpg, cyl … ) -The data type of each variable (e.g. num) -The first observations

Applying the str() function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight in your data set before diving into the real analysis.

str(planets_df)
## 'data.frame':    8 obs. of  5 variables:
##  $ name    : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
##  $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
##  $ diameter: num  0.382 0.949 1 0.532 11.209 ...
##  $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
##  $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

Dataframe Selection:

Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

my_df[1,2] selects the value at the first row and second column in my_df.

my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df.

Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row. Let us now apply this technique on planets_df!

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

Suppose you want to select the first three elements of the type column. One way to do this is to use the “:” symbol.

planets_df[1:3,2]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet

A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

planets_df[1:3,"type"]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:

planets_df[,3]
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883
planets_df[,"diameter"]
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883
planets_df[,c(2,4)]
##                 type rotation
## 1 Terrestrial planet    58.64
## 2 Terrestrial planet  -243.02
## 3 Terrestrial planet     1.00
## 4 Terrestrial planet     1.03
## 5          Gas giant     0.41
## 6          Gas giant     0.43
## 7          Gas giant    -0.72
## 8          Gas giant     0.67

However, there is a short-cut. If your columns have names, you can use the $ sign:

planets_df$diameter
## [1]  0.382  0.949  1.000  0.532 11.209  9.449  4.007  3.883

Filtering/Subsetting Dataframes

You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?

If you type rings_vector in the console, you get: [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE).

However, you do not get a nice overview of the names of these planets, their diameter, etc. Let’s try to use rings_vector to select the data for the four planets with rings.

rings_vector <- planets_df$rings
planets_df[rings_vector, ]
##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

You selected a subset from a data frame (planets_df) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. This is an extremely important concept. We will be using this method to subset our larger data for values that meet certain criteria.

Now, let us move up one level and use the function subset(). You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.

subset(my_df, subset = some_condition)

The first argument of subset() specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

The code below will give the exact same result as you got in the previous exercise, but this time, you didn’t need the rings_vector!

subset(planets_df, subset = rings)
##      name      type diameter rotation rings
## 5 Jupiter Gas giant   11.209     0.41  TRUE
## 6  Saturn Gas giant    9.449     0.43  TRUE
## 7  Uranus Gas giant    4.007    -0.72  TRUE
## 8 Neptune Gas giant    3.883     0.67  TRUE

Here are some additional, more complex examples using the cars dataset:

data("mtcars")

subset(mtcars, mpg>20, c("mpg", "hp"))
##                 mpg  hp
## Mazda RX4      21.0 110
## Mazda RX4 Wag  21.0 110
## Datsun 710     22.8  93
## Hornet 4 Drive 21.4 110
## Merc 240D      24.4  62
## Merc 230       22.8  95
## Fiat 128       32.4  66
## Honda Civic    30.4  52
## Toyota Corolla 33.9  65
## Toyota Corona  21.5  97
## Fiat X1-9      27.3  66
## Porsche 914-2  26.0  91
## Lotus Europa   30.4 113
## Volvo 142E     21.4 109
subset(mtcars, mpg==max(mpg))
##                 mpg cyl disp hp drat    wt qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1
subset(mtcars, mpg==max(mpg), mpg)
##                 mpg
## Toyota Corolla 33.9

Sorting Dataframes

Making and creating rankings is one of mankind’s favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).

In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order().

order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

a <- c(100, 10, 1000)
order(a)
## [1] 2 1 3

10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the first element in a is the second smallest element, so 1 comes second in the output of order(a).

This means we can use the output of order(a) to reshuffle our data. This leverages the same principals as were used in subsetting:

a[order(a)]
## [1]   10  100 1000
positions <-order(planets_df$diameter)
planets_df[positions,]
##      name               type diameter rotation rings
## 1 Mercury Terrestrial planet    0.382    58.64 FALSE
## 4    Mars Terrestrial planet    0.532     1.03 FALSE
## 2   Venus Terrestrial planet    0.949  -243.02 FALSE
## 3   Earth Terrestrial planet    1.000     1.00 FALSE
## 8 Neptune          Gas giant    3.883     0.67  TRUE
## 7  Uranus          Gas giant    4.007    -0.72  TRUE
## 6  Saturn          Gas giant    9.449     0.43  TRUE
## 5 Jupiter          Gas giant   11.209     0.41  TRUE

Lists

A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, type of activity that has to do be done.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

You could say that a list is some kind super data type: you can store practically any piece of information in it!

Creating a list

Let us create our first list! To construct a list you use the function list(): my_list <- list(comp1, comp2 …)

The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …

Just like on your to-do list, you want to avoid not knowing or remembering what the components of your list stand for.

That is why you should give names to them:

my_list <- list(name1 = your_comp1, name2 = your_comp2)

This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you’ve created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:

my_list <- list(your_comp1, your_comp2)

names(my_list) <- c(“name1”, “name2”)

Selecting elements from a list

Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward.

One way to select a component is using the numbered position of that component. For example, to “grab” the first component of shining_list you type.

shining_list[[1]]

A quick way to check this out is typing it in the console. Important to remember: to select elements from vectors, you use single square brackets: [ ]. Don’t mix them up!

You can also refer to the names of the components, with [[ ]] or with the $ sign. Both will select the data frame representing the reviews:

shining_list[[“reviews”]] shining_list$reviews

Besides selecting components, you often need to select specific elements out of these components.

For example, with shining_list[[2]][1] you select from the second component, actors (shining_list[[2]]), the first element ([1]). When you type this in the console, you will see the answer is Jack Nicholson.

Being proud of your first list, you shared it with the members of your movie hobby club. However, one of the senior members, a guy named M. McDowell, noted that you forgot to add the release year. Given your ambitions to become next year’s president of the club, you decide to add this information to the list.

To conveniently add elements to lists you can use the c() function, that you also used to build vectors:

ext_list <- c(my_list , my_val)

This will simply extend the original list, my_list, with the component my_val. This component gets appended to the end of the list. If you want to give the new list item a name, you just add the name as you did before:

ext_list <- c(my_list, my_name = my_val)

TIDY Data:

Now that we have discussed the functions that come with the base R envrionment, it is time that we introduce a collection of packages known as the TIDYVERSE. It has the functions and datasets associated with the next series of tutorials:

install.packages(‘tidyverse’)

You only need to install a package once, but you must load it every time you start a new session.

Once a package is installed, you can load it, or a dataset with the library() command.

library(tidyverse)

Importing Data with Readr

While it is helpful to know how to construct vectors, matricies, dataframes, and lists from stratch, the vast majority of the time, you will be importing data from external sources. Some of these include:

-CSV files -Excel Workbooks with multiple Sheets -Relational Databases -API’s

First we will begin with one of the most common datasources, CSVs. To import CSVs we will leverage one partular package that comes with the tidyverse library; readr.

In addition to read_csv(), there are additional reader functions that allow you to import other kinds of delineated files:

-read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place) -read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter. -read_fwf() reads fixed width files. You can specify fields either by their widths with -fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space. -read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)

The first argument to read_csv() is the most important: it’s the path to the file to read.

heights <- read_csv(“data/heights.csv”)

When you run read_csv() it prints out a column specification that gives the name and type of each column.

In both cases read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:

Metadata In Headers

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = “#” to drop all lines that start with (e.g.) #.

read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
## # A tibble: 1 x 3
##       x     y     z
##   <int> <int> <int>
## 1     1     2     3
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")

Missing Column Names

-The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

read_csv("1,2,3\n4,5,6", col_names = FALSE)
## # A tibble: 2 x 3
##      X1    X2    X3
##   <int> <int> <int>
## 1     1     2     3
## 2     4     5     6

(“”) is a convenient shortcut for adding a new line. You’ll learn more about it and other types of string escape in string basics.)

Alternatively you can pass col_names a character vector which will be used as the column names:

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 x 3
##       x     y     z
##   <int> <int> <int>
## 1     1     2     3
## 2     4     5     6

NA Values

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

read_csv("a,b,c\n1,2,.", na = ".")
## # A tibble: 1 x 3
##       a     b c    
##   <int> <int> <chr>
## 1     1     2 <NA>

Another alternative to read_csv is data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.

Parse Function

It was previously mentioned that readr() provides an output of the data types in each columns. How does readr know what kind of values we have?

readr leverages a function called parse() that attempts to identify the type of values in each column of the file. By default, values are imported as character values. What parse() does is sample the data and make a best guess based on predefined rules. Knowing how parse data is helpful because there are times where reader is not able to recognize values. For these situations it is helpful to override predefined rule to properly identify data type.

Like all functions in the tidyverse, the parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing:

parse_integer(c("1", "231", ".", "456"), na = ".")
## [1]   1 231  NA 456

If parsing fails, you’ll get a warning:

x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col     row   col expected               actual expected   <int> <int> <chr>                  <chr>  actual 1     3    NA an integer             abc    row 2     4    NA no trailing characters .45

Failures will be missing in the output:

x
## [1] 123 345  NA  NA
## attr(,"problems")
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

If there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dply

problems(x)
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:

-parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

-parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

-parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

-parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

-parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

The following sections describe these parsers in more detail.

Parsing Numbers

It seems like it should be straightforward to parse a number, but three problems make it tricky:

  1. People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

readr’s default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.

To address this problem, readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of “.” by creating a new locale and setting the decimal_mark argument:

parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
  1. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.

parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45
  1. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:

USD Currency:

parse_number("$123,456,789")
## [1] 123456789

Used in many parts of Europe

parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789

Used in Switzerland

parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789

Parsing Strings

It seems like parse_character() should be really simple - it could just return its input. Unfortunately life isn’t so simple, as there are multiple ways to represent the same string. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():

charToRaw("Hadley")
## [1] 48 61 64 6c 65 79

Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the American Standard Code for Information Interchange.

Things get more complicated for languages other than English.

In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.

For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).

In Latin1, the byte b1 is “±”, but in Latin2, it’s “a”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8.

UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).

readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.

This is a good default, but will fail for data produced by older systems that don’t understand UTF-8.

If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish. For example:

x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"

x1
## [1] "El Niño was particularly bad this year"
x2
## [1] "‚±‚ñ‚É‚¿‚Í"

To fix the problem you need to specify the encoding in parse_character():

parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"

How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out.

It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.

guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1      0.460
## 2 ISO-8859-9      0.230
guess_encoding(charToRaw(x2))
## # A tibble: 1 x 2
##   encoding confidence
##   <chr>         <dbl>
## 1 KOI8-R        0.420

The first argument to guess_encoding() can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).

Encodings are a rich and complex topic, and we’ve only scratched the surface here. If you’d like to learn more I’d recommend reading the detailed explanation at http://kunststube.net/encoding/.

Parsing Factors

R uses factors to represent categorical variables that have a known set of possible values. Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present:

fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col     row   col expected           actual   expected   <int> <int> <chr>              <chr>    actual 1     3    NA value in level set bananana
## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

But if you have many problematic entries, it’s often easier to leave as character vectors and then use the tools you’ll learn about in strings and factors to clean them up.

Parsing Dates, date-times, and times

You pick between three parsers depending on whether you want:

  1. Date (the number of days since 1970-01-01)
  2. Date-time (the number of seconds since midnight 1970-01-01)
  3. Time (the number of seconds since midnight).

When called without any additional arguments parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
parse_datetime("20101010")
## [1] "2010-10-10 UTC"

This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading https://en.wikipedia.org/wiki/ISO_8601

parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:

parse_date("2010-10-01")
## [1] "2010-10-01"

parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:

library(hms)
## 
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
## 
##     hms
parse_time("01:10 am")
## 01:10:00

Base R doesn’t have a great built in class for time data, so we use the one provided in the hms package.

If these defaults don’t work for your data you can supply your own date-time format, built up of the following pieces:

Year -%Y (4 digits). -%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

Month -%m (2 digits). -%b (abbreviated name, like “Jan”). -%B (full name, “January”).

Day -%d (2 digits). -%e (optional leading space).

Time -%H 0-23 hour. -%I 0-12, must be used with %p. -%p AM/PM indicator. -%M minutes. -%S integer seconds. -%OS real seconds. -%Z Time zone (as name, e.g. America/Chicago).

Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.

-%z (as offset from UTC, e.g. +0800) -%. skips one non-digit character. -%* skips any number of non-digits.

The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions. For example:

parse_time("20:10:01")
## 20:10:01
parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"

If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"

Parsing a file

Now that you’ve learned how to parse an individual vector, it’s time to return to the beginning and explore how readr parses a file. There are two new things that you’ll learn about in this section:

  1. How readr automatically guesses the type of each column.

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column:

guess_parser("2010-10-01")
## [1] "date"
guess_parser("15:01")
## [1] "time"
guess_parser(c("TRUE", "FALSE"))
## [1] "logical"
guess_parser(c("1", "5", "9"))
## [1] "integer"
guess_parser(c("12,352,561"))
## [1] "number"
str(parse_guess("2010-10-10"))
##  Date[1:1], format: "2010-10-10"

The heuristic tries each of the following types, stopping when it finds a match:

-logical: contains only “F”, “T”, “FALSE”, or “TRUE”. -integer: contains only numeric characters (and -). -double: contains only valid doubles (including numbers like 4.5e-5). -number: contains valid doubles with the grouping mark inside. -time: matches the default time_format. -date: matches the default date_format. -date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

  1. How to override the default specification.

Problems

-These defaults don’t always work for larger files. There are two basic problems:

  1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.

  2. The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a character vector, whereas you probably want to parse it as something more specific.

readr contains a challenging CSV that illustrates both of these problems:

challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected               actual             file               expected   <int> <chr> <chr>                  <chr>              <chr>              actual 1  1001 x     no trailing characters .23837975086644292 'C:/Users/andre/D~ file 2  1002 x     no trailing characters .41167997173033655 'C:/Users/andre/D~ row 3  1003 x     no trailing characters .7460716762579978  'C:/Users/andre/D~ col 4  1004 x     no trailing characters .723450553836301   'C:/Users/andre/D~ expected 5  1005 x     no trailing characters .614524137461558   'C:/Users/andre/D~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.

(Note the use of readr_example() which finds the path to one of the files included with the package)

There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It’s always a good idea to explicitly pull out the problems(), so you can explore them in more depth:

problems(challenge)
## # A tibble: 1,000 x 5
##      row col   expected               actual             file             
##    <int> <chr> <chr>                  <chr>              <chr>            
##  1  1001 x     no trailing characters .23837975086644292 'C:/Users/andre/~
##  2  1002 x     no trailing characters .41167997173033655 'C:/Users/andre/~
##  3  1003 x     no trailing characters .7460716762579978  'C:/Users/andre/~
##  4  1004 x     no trailing characters .723450553836301   'C:/Users/andre/~
##  5  1005 x     no trailing characters .614524137461558   'C:/Users/andre/~
##  6  1006 x     no trailing characters .473980569280684   'C:/Users/andre/~
##  7  1007 x     no trailing characters .5784610391128808  'C:/Users/andre/~
##  8  1008 x     no trailing characters .2415937229525298  'C:/Users/andre/~
##  9  1009 x     no trailing characters .11437866208143532 'C:/Users/andre/~
## 10  1010 x     no trailing characters .2983446326106787  'C:/Users/andre/~
## # ... with 990 more rows

A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the x column - there are trailing characters after the integer value. That suggests we need to use a double parser instead.

To fix the call, start by copying and pasting the column specification into your original call:

challenge <- read_csv(
readr_example("challenge.csv"), 
col_types = cols(
x = col_integer(),
y = col_character()))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected               actual             file               expected   <int> <chr> <chr>                  <chr>              <chr>              actual 1  1001 x     no trailing characters .23837975086644292 'C:/Users/andre/D~ file 2  1002 x     no trailing characters .41167997173033655 'C:/Users/andre/D~ row 3  1003 x     no trailing characters .7460716762579978  'C:/Users/andre/D~ col 4  1004 x     no trailing characters .723450553836301   'C:/Users/andre/D~ expected 5  1005 x     no trailing characters .614524137461558   'C:/Users/andre/D~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.

Tibbles:

Now that we have imported our data using the readr package,the nexr concept we need to discuss, is the tibble. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

library(tidyverse)
as_tibble(iris)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.10        3.50         1.40       0.200 setosa 
##  2         4.90        3.00         1.40       0.200 setosa 
##  3         4.70        3.20         1.30       0.200 setosa 
##  4         4.60        3.10         1.50       0.200 setosa 
##  5         5.00        3.60         1.40       0.200 setosa 
##  6         5.40        3.90         1.70       0.400 setosa 
##  7         4.60        3.40         1.40       0.300 setosa 
##  8         5.00        3.40         1.50       0.200 setosa 
##  9         4.40        2.90         1.40       0.200 setosa 
## 10         4.90        3.10         1.50       0.100 setosa 
## # ... with 140 more rows

Note that tibble() does much less than data.frames. It never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names. However, what it does do, is output the top 10 rows of the dataframe as well as metadata regarding the attributes.

This metadata is detailed below:

-int stands for integers.

-dbl stands for doubles, or real numbers.

-chr stands for character vectors, or strings.

-dttm stands for date-times (a date + a time).

There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:

-lgl stands for logical, vectors that contain only TRUE or FALSE.

-fctr stands for factors, which R uses to represent categorical variables with fixed possible values.

-date stands for dates.

Tibbles also only output enough columns so as to fit the window. Any additional fields are detailed below the output.

Tidying Data with DYPLR:

Now that we have successfully imported our external data and formatted it as a tibble, the next step is to ensure it is a tidy dataset:

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham

The Tidy dataaset approach is a consistent way to organise your data in R. Getting your data into this format requires some upfront work, but that work pays off in the long term.

Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

You can represent the same underlying data in multiple ways. However, there are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

-Put each dataset in a tibble. -Put each variable in a column.

Spreading and gathering with DYPLR

The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:

  1. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

2)Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.

This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:

-One variable might be spread across multiple columns.

-One observation might be scattered across multiple rows.

Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: gather() and spread().

Gathering:

A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Take table4a: the column names 1999 and 2000 represent values of the year variable, and each row represents two observations, not one.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:

-The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.

-The name of the variable whose values form the column names. I call that the key, and here it is year.

-The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.

Together those parameters generate the call to gather():

table4a %>% 
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

The columns to gather are specified with dplyr::select() style notation. Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names (because they don’t start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see select.

In the final result, the gathered columns are dropped, and we get new key and value columns. Otherwise, the relationships between the original variables are preserved.

Spreading:

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take table2: an observation is a country in a year, but each observation is spread across two rows.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

To tidy this up, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:

  1. The column that contains variable names, the key column. Here, it’s type.

  2. The column that contains values from multiple variables, the value column. Here it’s count.

Once we’ve figured that out, we can use spread(), as shown programmatically below:

table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

As you might have guessed from the common key and value arguments, spread() and gather() are complements. gather() makes wide tables narrower and longer; spread() makes long tables shorter and wider.

Separating and uniting with DYPLR

So far you’ve learned how to tidy table2 and table4, but not table3. table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.

Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into as shown in the code below.

table3 %>% 
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

By default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). For example, in the code above, separate() split the values of rate at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate(). For example, we could rewrite the code above as:

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behaviour in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:

table3 %>% 
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
## * <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into.

You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you’ll see in a little bit.

table3 %>% 
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Unite with DYPLR:

unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.

We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:

table5 %>% 
unite(new, century, year)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

In this case we also need to use the sep argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use “”:

table5 %>% 
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Relational Data with DYPLR:

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.

Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

Mutating joins, which add new variables to one data frame from matching observations in another.

Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

Set operations, which treat observations as if they were set elements.

The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.

Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.

Nycflights13

library(nycflights13)

We will use the nycflights13 package to learn about relational data. nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:

-airlines lets you look up the full carrier name from its abbreviated code:

head(airlines)
## # A tibble: 6 x 2
##   carrier name                    
##   <chr>   <chr>                   
## 1 9E      Endeavor Air Inc.       
## 2 AA      American Airlines Inc.  
## 3 AS      Alaska Airlines Inc.    
## 4 B6      JetBlue Airways         
## 5 DL      Delta Air Lines Inc.    
## 6 EV      ExpressJet Airlines Inc.

-airports gives information about each airport, identified by the faa airport code

head(airports)
## # A tibble: 6 x 8
##   faa   name                             lat   lon   alt    tz dst   tzone
##   <chr> <chr>                          <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G   Lansdowne Airport               41.1 -80.6  1044 -5.00 A     Amer~
## 2 06A   Moton Field Municipal Airport   32.5 -85.7   264 -6.00 A     Amer~
## 3 06C   Schaumburg Regional             42.0 -88.1   801 -6.00 A     Amer~
## 4 06N   Randall Airport                 41.4 -74.4   523 -5.00 A     Amer~
## 5 09J   Jekyll Island Airport           31.1 -81.4    11 -5.00 A     Amer~
## 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593 -5.00 A     Amer~

-planes gives information about each plane, identified by its tailnum:

head(planes)
## # A tibble: 6 x 9
##   tailnum  year type       manufacturer  model  engines seats speed engine
##   <chr>   <int> <chr>      <chr>         <chr>    <int> <int> <int> <chr> 
## 1 N10156   2004 Fixed win~ EMBRAER       EMB-1~       2    55    NA Turbo~
## 2 N102UW   1998 Fixed win~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 3 N103US   1999 Fixed win~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 4 N104UW   1999 Fixed win~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 5 N10575   2002 Fixed win~ EMBRAER       EMB-1~       2    55    NA Turbo~
## 6 N105UW   1999 Fixed win~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~

-weather gives the weather at each NYC airport for each hour:

head(weather)
## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1 EWR     2013  1.00     1     0  37.0  21.9  54.0      230       10.4
## 2 EWR     2013  1.00     1     1  37.0  21.9  54.0      230       13.8
## 3 EWR     2013  1.00     1     2  37.9  21.9  52.1      230       12.7
## 4 EWR     2013  1.00     1     3  37.9  23.0  54.5      230       13.8
## 5 EWR     2013  1.00     1     4  37.9  24.1  57.0      240       15.0
## 6 EWR     2013  1.00     1     6  39.0  26.1  59.4      270       10.4
## # ... with 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

You don’t need to understand the entire dataset thing; you just need to understand the chain of relations between the tables that you are interested in.

For nycflights13:

-flights connects to planes via a single variable, tailnum.

-flights connects to airlines through the carrier variable.

-flights connects to airports in two ways: via the origin and dest variables.

-flights connects to weather via origin (the location), and year, month, day and hour (the time).

Keys

The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. For example, each plane is uniquely identified by its tailnum. In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.

There are two types of keys:

  1. A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

  2. A foreign key uniquely identifies an observation in another table. For example, the flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airport table.

Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one:

planes %>% 
count(tailnum) %>% 
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>% 
count(year, month, day, hour, origin) %>% 
filter(n > 1)
## # A tibble: 0 x 6
## # ... with 6 variables: year <dbl>, month <dbl>, day <int>, hour <int>,
## #   origin <chr>, n <int>

Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what’s the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:

flights %>% 
count(year, month, day, flight) %>% 
filter(n > 1)
## # A tibble: 29,768 x 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # ... with 29,758 more rows
flights %>% 
count(year, month, day, tailnum) %>% 
filter(n > 1)
## # A tibble: 64,928 x 5
##     year month   day tailnum     n
##    <int> <int> <int> <chr>   <int>
##  1  2013     1     1 N0EGMQ      2
##  2  2013     1     1 N11189      2
##  3  2013     1     1 N11536      2
##  4  2013     1     1 N11544      3
##  5  2013     1     1 N11551      2
##  6  2013     1     1 N12540      2
##  7  2013     1     1 N12567      2
##  8  2013     1     1 N13123      2
##  9  2013     1     1 N13538      3
## 10  2013     1     1 N13566      3
## # ... with 64,918 more rows

When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case!

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

A primary key and the corresponding foreign key in another table form a relation. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights.

In other data, you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many.

You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.

For example, in this data there’s a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.

Mutating joins

The first tool we’ll look at for combining a pair of tables is the mutating join. A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out. For these examples, we’ll make it easier to see what’s going on in the examples by creating a narrower dataset:

flights2 <- flights %>% 
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1  5.00 EWR    IAH   N14228  UA     
##  2  2013     1     1  5.00 LGA    IAH   N24211  UA     
##  3  2013     1     1  5.00 JFK    MIA   N619AA  AA     
##  4  2013     1     1  5.00 JFK    BQN   N804JB  B6     
##  5  2013     1     1  6.00 LGA    ATL   N668DN  DL     
##  6  2013     1     1  5.00 EWR    ORD   N39463  UA     
##  7  2013     1     1  6.00 EWR    FLL   N516JB  B6     
##  8  2013     1     1  6.00 LGA    IAD   N829AS  EV     
##  9  2013     1     1  6.00 JFK    MCO   N593JB  B6     
## 10  2013     1     1  6.00 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():

flights2 %>%
select(-origin, -dest) %>% 
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1  5.00 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1  5.00 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1  5.00 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1  5.00 N804JB  B6      JetBlue Airways         
##  5  2013     1     1  6.00 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1  5.00 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1  6.00 N516JB  B6      JetBlue Airways         
##  8  2013     1     1  6.00 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1  6.00 N593JB  B6      JetBlue Airways         
## 10  2013     1     1  6.00 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

The result of joining airlines to flights2 is an additional variable: name. This is why I call this type of join a mutating join. In this case, you could have got to the same place using mutate() and R’s base subsetting:

flights2 %>%
select(-origin, -dest) %>% 
mutate(name = airlines$name[match(carrier, airlines$carrier)])
## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1  5.00 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1  5.00 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1  5.00 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1  5.00 N804JB  B6      JetBlue Airways         
##  5  2013     1     1  6.00 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1  5.00 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1  6.00 N516JB  B6      JetBlue Airways         
##  8  2013     1     1  6.00 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1  6.00 N593JB  B6      JetBlue Airways         
## 10  2013     1     1  6.00 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

But this is hard to generalise when you need to match multiple variables, and takes close reading to figure out the overall intent.

The following sections explain, in detail, how mutating joins work. You’ll start by learning a useful visual representation of joins. We’ll then use that to explain the four mutating join functions: the inner join, and the three outer joins. When working with real data, keys don’t always uniquely identify observations, so next we’ll talk about what happens when there isn’t a unique match. Finally, you’ll learn how to tell dplyr which variables are the keys for a given join.

Understanding joins

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

Inner Join

The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:

(To be precise, this is an inner equijoin because the keys are matched using the equality operator. Since most joins are equijoins we usually drop that specification.)

The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by to tell dplyr which variable is the key:

x %>% 
inner_join(y, by = "key")
## # A tibble: 2 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2

The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.

Outer joins

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  1. A left join keeps all observations in x.
  2. A right join keeps all observations in y.
  3. A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

Duplicate keys

So far all our examples assumed that the keys are unique. But that’s not always the case. This section explains what happens when the keys are not unique. There are two possibilities:

  1. One table has duplicate keys. This is useful when you want to add in additional information as there is typically a one-to-many relationship.

Note that I’ve put the key column in a slightly different position in the output. This reflects that the key is a primary key in y and a foreign key in x.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  2.00 x3    y2   
## 4  1.00 x4    y1
  1. Both tables have duplicate keys. This is usually an error because in neither table do the keys uniquely identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian product:
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1  1.00 x1    y1   
## 2  2.00 x2    y2   
## 3  2.00 x2    y3   
## 4  2.00 x3    y2   
## 5  2.00 x3    y3   
## 6  3.00 x4    y4

Defining the key columns

So far, the pairs of tables have always been joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = “key”. You can use other values for by to connect the tables in other ways:

  1. The default, by = NULL, uses all variables that appear in both tables, the so called natural join. For example, the flights and weather tables match on their common variables: year, month, day, hour and origin.
flights2 %>% 
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <dbl> <dbl> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013  1.00     1  5.00 EWR    IAH   N14228  UA       NA    NA    NA  
##  2  2013  1.00     1  5.00 LGA    IAH   N24211  UA       NA    NA    NA  
##  3  2013  1.00     1  5.00 JFK    MIA   N619AA  AA       NA    NA    NA  
##  4  2013  1.00     1  5.00 JFK    BQN   N804JB  B6       NA    NA    NA  
##  5  2013  1.00     1  6.00 LGA    ATL   N668DN  DL       39.9  26.1  57.3
##  6  2013  1.00     1  5.00 EWR    ORD   N39463  UA       NA    NA    NA  
##  7  2013  1.00     1  6.00 EWR    FLL   N516JB  B6       39.0  26.1  59.4
##  8  2013  1.00     1  6.00 LGA    IAD   N829AS  EV       39.9  26.1  57.3
##  9  2013  1.00     1  6.00 JFK    MCO   N593JB  B6       39.0  26.1  59.4
## 10  2013  1.00     1  6.00 LGA    ORD   N3ALAA  AA       39.9  26.1  57.3
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
  1. A character vector, by = “x”. This is like a natural join, but uses only some of the common variables. For example, flights and planes have year variables, but they mean different things so we only want to join by tailnum.
flights2 %>% 
left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type      
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>     
##  1   2013     1     1  5.00 EWR    IAH   N14228  UA        1999 Fixed win~
##  2   2013     1     1  5.00 LGA    IAH   N24211  UA        1998 Fixed win~
##  3   2013     1     1  5.00 JFK    MIA   N619AA  AA        1990 Fixed win~
##  4   2013     1     1  5.00 JFK    BQN   N804JB  B6        2012 Fixed win~
##  5   2013     1     1  6.00 LGA    ATL   N668DN  DL        1991 Fixed win~
##  6   2013     1     1  5.00 EWR    ORD   N39463  UA        2012 Fixed win~
##  7   2013     1     1  6.00 EWR    FLL   N516JB  B6        2000 Fixed win~
##  8   2013     1     1  6.00 LGA    IAD   N829AS  EV        1998 Fixed win~
##  9   2013     1     1  6.00 JFK    MCO   N593JB  B6        2004 Fixed win~
## 10   2013     1     1  6.00 LGA    ORD   N3ALAA  AA          NA <NA>      
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  1. A named character vector: by = c(“a” = “b”). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:

flights2 %>% 
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl>
##  1  2013     1     1  5.00 EWR    IAH   N14228  UA      Georg~  30.0 -95.3
##  2  2013     1     1  5.00 LGA    IAH   N24211  UA      Georg~  30.0 -95.3
##  3  2013     1     1  5.00 JFK    MIA   N619AA  AA      Miami~  25.8 -80.3
##  4  2013     1     1  5.00 JFK    BQN   N804JB  B6      <NA>    NA    NA  
##  5  2013     1     1  6.00 LGA    ATL   N668DN  DL      Harts~  33.6 -84.4
##  6  2013     1     1  5.00 EWR    ORD   N39463  UA      Chica~  42.0 -87.9
##  7  2013     1     1  6.00 EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2
##  8  2013     1     1  6.00 LGA    IAD   N829AS  EV      Washi~  38.9 -77.5
##  9  2013     1     1  6.00 JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3
## 10  2013     1     1  6.00 LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## #   dst <chr>, tzone <chr>

Other implementations

base::merge() can perform all four types of mutating join: dplyr merge inner_join(x, y) merge(x, y) left_join(x, y) merge(x, y, all.x = TRUE) right_join(x, y) merge(x, y, all.y = TRUE), full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)

The advantages of the specific dplyr verbs is that they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of merge(). dplyr’s joins are considerably faster and don’t mess with the order of the rows.

SQL is the inspiration for dplyr’s conventions, so the translation is straightforward:

dplyr SQL inner_join(x, y, by = “z”) SELECT * FROM x INNER JOIN y USING (z) left_join(x, y, by = “z”) SELECT * FROM x LEFT OUTER JOIN y USING (z) right_join(x, y, by = “z”) SELECT * FROM x RIGHT OUTER JOIN y USING (z) full_join(x, y, by = “z”) SELECT * FROM x FULL OUTER JOIN y USING (z)

Note that “INNER” and “OUTER” are optional, and often omitted.

Joining different variables between the tables, e.g. inner_join(x, y, by = c(“a” = “b”)) uses a slightly different syntax in SQL: SELECT * FROM x INNER JOIN y ON x.a = y.b. As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).

Filtering Joins:

Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  1. semi_join(x, y) keeps all observations in x that have a match in y.
  2. anti_join(x, y) drops all observations in x that have a match in y.

Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine you’ve found the top ten most popular destinations:

top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:

flights %>% 
filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540      2.00      923
##  2  2013     1     1      554            600     -6.00      812
##  3  2013     1     1      554            558     -4.00      740
##  4  2013     1     1      555            600     -5.00      913
##  5  2013     1     1      557            600     -3.00      838
##  6  2013     1     1      558            600     -2.00      753
##  7  2013     1     1      558            600     -2.00      924
##  8  2013     1     1      558            600     -2.00      923
##  9  2013     1     1      559            559      0         702
## 10  2013     1     1      600            600      0         851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

But it’s difficult to extend that approach to multiple variables. For example, imagine that you’d found the 10 days with highest average delays. How would you construct the filter statement that used year, month, and day to match it back to flights?

Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y:

flights %>% 
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      542            540      2.00      923
##  2  2013     1     1      554            600     -6.00      812
##  3  2013     1     1      554            558     -4.00      740
##  4  2013     1     1      555            600     -5.00      913
##  5  2013     1     1      557            600     -3.00      838
##  6  2013     1     1      558            600     -2.00      753
##  7  2013     1     1      558            600     -2.00      924
##  8  2013     1     1      558            600     -2.00      923
##  9  2013     1     1      559            559      0         702
## 10  2013     1     1      600            600      0         851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match

Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:

flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
##    tailnum     n
##    <chr>   <int>
##  1 <NA>     2512
##  2 N725MQ    575
##  3 N722MQ    513
##  4 N723MQ    507
##  5 N713MQ    483
##  6 N735MQ    396
##  7 N0EGMQ    371
##  8 N534MQ    364
##  9 N542MQ    363
## 10 N531MQ    349
## # ... with 712 more rows

Join problems

The data you’ve been working with in this chapter has been cleaned up so that you’ll have as few problems as possible. Your own data is unlikely to be so nice, so there are a few things that you should do with your own data to make your joins go smoothly.

  1. Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.

If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general.

For example, the altitude and longitude uniquely identify each airport, but they are not good identifiers!

airports %>% count(alt, lon) %>% filter(n > 1)
## # A tibble: 0 x 3
## # ... with 3 variables: alt <int>, lon <dbl>, n <int>
  1. Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!

  2. Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.

  3. If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.

Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly. If you have an inner join with duplicate keys in both tables, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!

Set operations

The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:

  1. intersect(x, y): return only observations in both x and y.
  2. union(x, y): return unique observations in x and y.
  3. setdiff(x, y): return observations in x, but not in y.

Given this simple data:

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

The four possibilities are:

intersect(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1  1.00  1.00
union(df1, df2)
## # A tibble: 3 x 2
##       x     y
##   <dbl> <dbl>
## 1  1.00  2.00
## 2  2.00  1.00
## 3  1.00  1.00
setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1  2.00  1.00
setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##   <dbl> <dbl>
## 1  1.00  2.00

Data Transformation with DYPLR:

DYPLR is very powerful data cleaning package. Similar to SQL it uses several functions to select, filter, group by, summarise, and mutate data. We will be leveraging the flights data frame which contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights. You can get additional information by using ? before a function or dataset. You can also click the packages tab in the menu.

library(nycflights13)

There are Five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

-Pick observations by their values (filter()). -Reorder the rows (arrange()). -Pick variables by their names (select()). -Create new variables with functions of existing variables (mutate()). -Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

  1. The first argument is a data frame.

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

Filter rows with filter()

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, “<-”:

R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:

(dec25 <- filter(flights, month == 12, day == 25))
## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    25      456            500     -4.00      649
##  2  2013    12    25      524            515      9.00      805
##  3  2013    12    25      542            540      2.00      832
##  4  2013    12    25      546            550     -4.00     1022
##  5  2013    12    25      556            600     -4.00      730
##  6  2013    12    25      557            600     -3.00      743
##  7  2013    12    25      557            600     -3.00      818
##  8  2013    12    25      559            600     -1.00      855
##  9  2013    12    25      559            600     -1.00      849
## 10  2013    12    25      600            600      0         850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).

When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error:

filter(flights, month == 1)
## # A tibble: 27,004 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 26,994 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!

sqrt(2) ^ 2 == 2
## [1] FALSE

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():

near(sqrt(2) ^ 2,  2)
## [1] TRUE
Logical operators

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”

The following code finds all flights that departed in November or December:

filter(flights, month == 11 | month == 12)
## # A tibble: 55,403 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    11     1        5           2359      6.00      352
##  2  2013    11     1       35           2250    105         123
##  3  2013    11     1      455            500   -  5.00      641
##  4  2013    11     1      539            545   -  6.00      856
##  5  2013    11     1      542            545   -  3.00      831
##  6  2013    11     1      549            600   - 11.0       912
##  7  2013    11     1      550            600   - 10.0       705
##  8  2013    11     1      554            600   -  6.00      659
##  9  2013    11     1      554            600   -  6.00      826
## 10  2013    11     1      554            600   -  6.00      749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The order of operations doesn’t work like English. You can’t write:

filter(flights, month == 11 | 12)

Which you might literally translate into “finds all flights that departed in November or December”.

Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing!

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:

nov_dec <- filter(flights, month %in% c(11, 12))

Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA

The most confusing result is this one:

NA == NA
## [1] NA

To verify whether or not values are NA use the is.na() command.

x <- NA
is.na(x)
## [1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
## # A tibble: 1 x 1
##       x
##   <dbl>
## 1  3.00
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1 NA   
## 2  3.00

Another example is the command below. There are NA values aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1    NA
##  2  2013     1     2    NA
##  3  2013     1     3    NA
##  4  2013     1     4    NA
##  5  2013     1     5    NA
##  6  2013     1     6    NA
##  7  2013     1     7    NA
##  8  2013     1     8    NA
##  9  2013     1     9    NA
## 10  2013     1    10    NA
## # ... with 355 more rows

To remove these NA values we can use the na.rm command:

flights %>% 
group_by(year, month, day) %>% 
summarise(mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Use desc() to re-order by a column in descending order:

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     6    27      959           1900       899     1236
##  9  2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing values are always sorted at the end:

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1  2.00
## 2  5.00
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1  5.00
## 2  2.00
## 3 NA

Select columns with select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select() is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea:

select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

You can also select columns that are between two specific fields:

select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

To exclude fields, you can use the “-” symbol prior to the column name.

select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515      2.00      830            819     11.0 
##  2      533            529      4.00      850            830     20.0 
##  3      542            540      2.00      923            850     33.0 
##  4      544            545     -1.00     1004           1022    -18.0 
##  5      554            600     -6.00      812            837    -25.0 
##  6      554            558     -4.00      740            728     12.0 
##  7      555            600     -5.00      913            854     19.0 
##  8      557            600     -3.00      709            723    -14.0 
##  9      557            600     -3.00      838            846    - 8.00
## 10      558            600     -2.00      753            745      8.00
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

There are a number of helper functions you can use within select():

-starts_with(“abc”): matches names that begin with “abc”.

-ends_with(“xyz”): matches names that end with “xyz”.

-contains(“ijk”): matches names that contain “ijk”.

-matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

-num_range(“x”, 1:3): matches x1, x2 and x3.

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00    227    2013     1     1      517            515
##  2 2013-01-01 05:00:00    227    2013     1     1      533            529
##  3 2013-01-01 05:00:00    160    2013     1     1      542            540
##  4 2013-01-01 05:00:00    183    2013     1     1      544            545
##  5 2013-01-01 06:00:00    116    2013     1     1      554            600
##  6 2013-01-01 05:00:00    150    2013     1     1      554            558
##  7 2013-01-01 06:00:00    158    2013     1     1      555            600
##  8 2013-01-01 06:00:00     53.0  2013     1     1      557            600
##  9 2013-01-01 06:00:00    140    2013     1     1      557            600
## 10 2013-01-01 06:00:00    138    2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

Renaming Fields with rename()

select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Add new variables with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.

flights_sml <- select(flights, 
year:day, 
ends_with("delay"), 
distance, 
air_time)

mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time   gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>  <dbl> <dbl>
##  1  2013     1     1      2.00     11.0      1400    227   - 9.00   370
##  2  2013     1     1      4.00     20.0      1416    227   -16.0    374
##  3  2013     1     1      2.00     33.0      1089    160   -31.0    408
##  4  2013     1     1     -1.00    -18.0      1576    183    17.0    517
##  5  2013     1     1     -6.00    -25.0       762    116    19.0    394
##  6  2013     1     1     -4.00     12.0       719    150   -16.0    288
##  7  2013     1     1     -5.00     19.0      1065    158   -24.0    404
##  8  2013     1     1     -3.00    -14.0       229     53.0  11.0    259
##  9  2013     1     1     -3.00    - 8.00      944    140     5.00   405
## 10  2013     1     1     -2.00      8.00      733    138   -10.0    319
## # ... with 336,766 more rows

Note that you can refer to columns that you’ve just created:

mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time   gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>  <dbl> <dbl>
##  1  2013     1     1      2.00     11.0      1400    227   - 9.00 3.78 
##  2  2013     1     1      4.00     20.0      1416    227   -16.0  3.78 
##  3  2013     1     1      2.00     33.0      1089    160   -31.0  2.67 
##  4  2013     1     1     -1.00    -18.0      1576    183    17.0  3.05 
##  5  2013     1     1     -6.00    -25.0       762    116    19.0  1.93 
##  6  2013     1     1     -4.00     12.0       719    150   -16.0  2.50 
##  7  2013     1     1     -5.00     19.0      1065    158   -24.0  2.63 
##  8  2013     1     1     -3.00    -14.0       229     53.0  11.0  0.883
##  9  2013     1     1     -3.00    - 8.00      944    140     5.00 2.33 
## 10  2013     1     1     -2.00      8.00      733    138   -10.0  2.30 
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

If you only want to keep the new variables, use transmute():

transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
##      gain hours gain_per_hour
##     <dbl> <dbl>         <dbl>
##  1 - 9.00 3.78         - 2.38
##  2 -16.0  3.78         - 4.23
##  3 -31.0  2.67         -11.6 
##  4  17.0  3.05           5.57
##  5  19.0  1.93           9.83
##  6 -16.0  2.50         - 6.40
##  7 -24.0  2.63         - 9.11
##  8  11.0  0.883         12.5 
##  9   5.00 2.33           2.14
## 10 -10.0  2.30         - 4.35
## # ... with 336,766 more rows
Useful creation functions

There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

-Arithmetic operators: +, -, *, /, ^

These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.

-Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.

-Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:

transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517  5.00   17.0
##  2      533  5.00   33.0
##  3      542  5.00   42.0
##  4      544  5.00   44.0
##  5      554  5.00   54.0
##  6      554  5.00   54.0
##  7      555  5.00   55.0
##  8      557  5.00   57.0
##  9      557  5.00   57.0
## 10      558  5.00   58.0
## # ... with 336,766 more rows

-Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.

All else being equal, I recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.

-Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA

-Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

-Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

-Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1]  1  2  2 NA  4  5
min_rank(desc(y))
## [1]  5  3  3 NA  2  1

If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

row_number(y)
## [1]  1  2  3 NA  4  5
dense_rank(y)
## [1]  1  2  2 NA  3  4
percent_rank(y)
## [1] 0.00 0.25 0.25   NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6  NA 0.8 1.0

Grouped summaries with summarise()

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations with the pipe

Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dest != "HNL")

It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there’s more ability to make up delays in the air?

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There are three steps to prepare this data:

  1. Group flights by destination.

  2. Summarise to compute distance, average delay, and number of flights.

  3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.

There’s another way to tackle the same problem with the pipe, %>%:

delays <- flights %>% 
group_by(dest) %>% 
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>% 
filter(count > 20, dest != "HNL")

This focuses on the transformations, not what’s being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code, and we’ll come back to it in more detail in pipes.

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.

Counts

Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:

not_cancelled <- flights %>% 
filter(!is.na(dep_delay), !is.na(arr_delay))

delays <- not_cancelled %>% 
group_by(tailnum) %>% 
summarise(
delay = mean(arr_delay))

ggplot(data = delays, mapping = aes(x = delay)) + 
geom_freqpoly(binwidth = 10)

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:

delays <- not_cancelled %>% 
group_by(tailnum) %>% 
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n())
ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
geom_point(alpha = 1/10)

Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It’s a bit painful that you have to switch from %>% to +, but once you get the hang of it, it’s quite convenient.

delays %>% 
filter(n > 25) %>% 
ggplot(mapping = aes(x = n, y = delay)) + 
geom_point(alpha = 1/10)

There’s another common variation of this type of pattern. Let’s look at how the average performance of batters in baseball is related to the number of times they’re at bat. Here I use data from the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.

When I plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:

-As above, the variation in our aggregate decreases as we get more data points.

-There’s a positive correlation between skill (ba) and opportunities to hit the ball (ab). This is because teams control who gets to play, and obviously they’ll pick their best players.

batting <- as_tibble(Lahman::Batting)

batters <- batting %>% 
group_by(playerID) %>% 
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE))

batters %>% 
filter(ab > 100) %>% 
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() + 
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

Counts are so useful that dplyr provides a simple helper if all you want is a count:

not_cancelled %>% 
  count(dest)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

not_cancelled %>% 
  count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows
Useful summary functions
Mean

Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

It’s sometimes useful to combine aggregation with logical subsetting. We haven’t talked about this sort of subsetting yet, but you’ll learn more about it in subsetting.

not_cancelled %>% 
group_by(year, month, day) %>% 
summarise(avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0])) 
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4    - 1.93        28.3
##  5  2013     1     5    - 1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7    - 4.95        27.8
##  8  2013     1     8    - 3.23        20.8
##  9  2013     1     9    - 0.264       25.6
## 10  2013     1    10    - 5.90        27.3
## # ... with 355 more rows
Measures of Spread: sd(), IQR(), and mad()

Measures of spread: sd(x), IQR(x), mad(x). The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

Why is distance to some destinations more variable than to others?

not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))
## # A tibble: 104 x 2
##    dest  distance_sd
##    <chr>       <dbl>
##  1 EGE         10.5 
##  2 SAN         10.4 
##  3 SFO         10.2 
##  4 HNL         10.0 
##  5 SEA          9.98
##  6 LAS          9.91
##  7 PDX          9.87
##  8 PHX          9.86
##  9 LAX          9.66
## 10 IND          9.46
## # ... with 94 more rows
Measures of rank: min(x), quantile(), and max(x).

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

When do the first and last flights leave each day?

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time))
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day  first  last
##    <int> <int> <int>  <dbl> <dbl>
##  1  2013     1     1 517     2356
##  2  2013     1     2  42.0   2354
##  3  2013     1     3  32.0   2349
##  4  2013     1     4  25.0   2358
##  5  2013     1     5  14.0   2357
##  6  2013     1     6  16.0   2355
##  7  2013     1     7  49.0   2359
##  8  2013     1     8 454     2351
##  9  2013     1     9   2.00  2252
## 10  2013     1    10   3.00  2320
## # ... with 355 more rows
Measures of position: first(x), nth(x, n), last(x).

These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

not_cancelled %>% 
group_by(year, month, day) %>% 
summarise(
first_dep = first(dep_time), 
last_dep = last(dep_time))
## # A tibble: 365 x 5
## # Groups:   year, month [?]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows
Counts and proportions of logical values: sum(x > 10), mean(y == 0).

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

How many flights left before 5am? (these usually indicate delayed flights from the previous day)

not_cancelled %>% 
group_by(year, month, day) %>% 
summarise(n_early = sum(dep_time < 500))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows

What proportion of flights are delayed by more than an hour?

not_cancelled %>% 
group_by(year, month, day) %>% 
summarise(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows
Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:

daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
## # A tibble: 12 x 3
## # Groups:   year [?]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135
(per_year  <- summarise(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

Be careful when progressively rolling up summaries: it’s OK for sums and counts, but you need to think about weighting means and variances, and it’s not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily %>% 
ungroup() %>%           
summarise(flights = n())
## # A tibble: 1 x 1
##   flights
##     <int>
## 1  336776
Grouped mutates (and filters)

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter():

Find the worst members of each group:

flights_sml %>% 
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184     41.0
##  2  2013     1     1       290       338     1134    213  
##  3  2013     1     1       260       263      266     46.0
##  4  2013     1     1       157       174      213     60.0
##  5  2013     1     1       216       222      708    121  
##  6  2013     1     1       255       250      589    115  
##  7  2013     1     1       285       246     1085    146  
##  8  2013     1     1       192       191      199     44.0
##  9  2013     1     1       379       456     1092    222  
## 10  2013     1     2       224       207      550     94.0
## # ... with 3,296 more rows

Find all groups bigger than a threshold:

popular_dests <- flights %>% 
group_by(dest) %>% 
filter(n() > 365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 332,567 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Standardise to compute per group metrics:

popular_dests %>% 
filter(arr_delay > 0) %>% 
mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH       11.0   0.000111 
##  2  2013     1     1 IAH       20.0   0.000201 
##  3  2013     1     1 MIA       33.0   0.000235 
##  4  2013     1     1 ORD       12.0   0.0000424
##  5  2013     1     1 FLL       19.0   0.0000938
##  6  2013     1     1 ORD        8.00  0.0000283
##  7  2013     1     1 LAX        7.00  0.0000344
##  8  2013     1     1 DFW       31.0   0.000282 
##  9  2013     1     1 ATL       12.0   0.0000400
## 10  2013     1     1 DTW       16.0   0.000116 
## # ... with 131,096 more rows

Advanced Date Functions with Lubridate

Rounding Dates

An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:

flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

Data Visualization:

Base R:

GGPLOT 2:

This chapter will teach you how to visualise your data using ggplot2. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.

If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.

Let’s use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

The mpg data frame:

You can test your answer with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4        1.80  1999     4 auto(l~ f        18    29 p    
##  2 audi         a4        1.80  1999     4 manual~ f        21    29 p    
##  3 audi         a4        2.00  2008     4 manual~ f        20    31 p    
##  4 audi         a4        2.00  2008     4 auto(a~ f        21    30 p    
##  5 audi         a4        2.80  1999     6 auto(l~ f        16    26 p    
##  6 audi         a4        2.80  1999     6 manual~ f        18    26 p    
##  7 audi         a4        3.10  2008     6 auto(a~ f        18    27 p    
##  8 audi         a4 quat~  1.80  1999     4 manual~ 4        18    26 p    
##  9 audi         a4 quat~  1.80  1999     4 auto(l~ 4        16    25 p    
## 10 audi         a4 quat~  2.00  2008     4 manual~ 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>

Among the variables in mpg are:

-displ, a car’s engine size, in litres.

-hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

To learn more about mpg, open its help page by running ?mpg.

Creating a ggplot

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size?

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very interesting so I’m not going to show it here.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You’ll learn a whole bunch of them throughout this chapter.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.

A graphing template

Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

ggplot(data = ) + (mapping = aes())

Aesthetic mappings

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the -size -shape -color

Of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties.

Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes().

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling.

ggplot2 will also add a legend that explains which levels correspond to which values.

The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation.

We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea.

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

Or we could have mapped class to the alpha aesthetic, which controls the transparency of the points, or to the shape aesthetic, which controls the shape of the points.

Alpha

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

Shape

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

For each aesthetic, you use aes() to associate the name of the aesthetic with a variable to display.

The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument.

The syntax highlights a useful insight about x and y: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.

Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values.

For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

H ere, the color doesn’t convey information about a variable, but only changes the appearance of the plot.

To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes(). You’ll need to pick a level that makes sense for that aesthetic:

-The name of a color as a character string.

-The size of a point in mm.

-The shape of a point as a number, as shown in Figure 3.1.

R has 25 built in shapes that are identified by numbers. There are some seeming duplicates:

The hollow shapes (0–14) have a border determined by colour; the solid shapes (15–18) are filled with colour; the filled shapes (21–24) have a border of colour and are filled with fill.

Facets

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy)) + 
facet_wrap(~ class, nrow = 2)

Statistics In R:

In the Dyplr section we quickly touched upon some statistical terms such as: -Mean -Standard Deviation (SD) -Confidence Intervals

In this section, we will expand upon our statistical discussion primarily focusing around how R is used to conduct statistical calculations. While we will explain the terms, the focus will be around the R programming language.

Exploratory Data Analysis:

Classicial statistics, also known as inferential statistics, focuses primarily on deriving conclusions about large populations based on data collected from a sample.

In EDA however, the focus is on understanding the data, specifically measures of Location and Variance:

Estimates of Location:

Variables with measured or count data might have thousands of distinct values. One basic step in getting to understand your data is to get an idea of what the typical value is for each feature (variable). In other words, an estimate of where most of the data is located. In statistics this is known as the central tendency of the data.

When it comes to measuring the central tendancy of your data there are several metrics that can do the job:

  1. Mean: the sum of all values divided by the number of values
  2. Weighted Mean: the sum of all values times a weight divided by the sum of the weights.
  3. Median: the value such that one-half of the data lies above and below. 4) Weighted Median: the value such that one-half of the sum of the weights lies above and below the sorted data.
  4. Trimmed Mean: the average of all values after dropping a fixed number of extreme values

To calculate the mean and median in R we use the following commands:

state <- read.csv("state.csv")
mean(state$Population)
## [1] 6162876
median(state$Population)
## [1] 4436370

To trim the mean and median calculations, we pass a “trim” argument after the value with the % trim in the form of a decimal:

mean(state$Population, trim = 0.1)
## [1] 4783697
median(state$Population, trim = 0.1)
## [1] 4436370

As we can see in the data, the population’s of the various states differ greatly. Any analysis into rates such as Murder Rates, should take this into account. To do so, we would need to use a weighted mean or median to account for different state populations when calculating average murder rates.

Base R does not have a function for weighted measures of central tendency. To do such a calculation we need to install the MatrixStats package.

weighted.mean(state$Murder.Rate, w = state$Population)
## [1] 4.445834
matrixStats::weightedMedian(state$Murder.Rate, w = state$Population)
## [1] 4.4

Estimates of Variability

Measures of centrality are one way to summarize a feature of your data. Variability, also known as dispersion, measures whether the data values are tighly clustered or spread out.

At the heart of statistics lies variability; measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability, and making decisions in the presence of it.

Some measures of variability are as follows:

-Variance: the sum of squared deviations from the mean divided by n-1 where n is count of values -Standard Deviation: the square root of the variance -Mean Absolute Deviation: the mean of the absolute value of the deviations from the mean -Mean Absolute Deviation from the Median: the mean of the absolute value of the deviations from the mean -Range: difference between largest and smallest values -Percentiles: the value such that P percent of the values take on this value or less and (100-P) take on this value or more -Interquartile Range: the difference between the 75th and 25th percentile values.

To calculate these measures of variability in R we use the following commands:

sd(state$Population)
## [1] 6848235
IQR(state$Population)
## [1] 4847308
mad(state$Population)
## [1] 3849870
quantile(state$Population)
##       0%      25%      50%      75%     100% 
##   563626  1833004  4436370  6680312 37253956

You will notice that the standard deviation is almost twice as large as the MAD. This is not surprising since the standard deviation is sensitive to outliers.

Distributions

With an understanding of the centrality and variability of the data, the next step is to explore and understand the overall distribution of the data.

For this purpose there are several visualizations that enable you to literally see the distribution. Many of these were discussed above in the section on Data Visualization. To visualize a distribution the following visualizations are best used:

  1. Boxplots: based on percentiles, provide a quick way to visualize distributions.
boxplot(log(state$Population), ylab="Population log")

  1. Frequency Tables

Frequency tables divide up a variable into equally spaced segments and provide counts for how many values fall in each segment.

To create one, we break up the variable by creating a breaks object with the seq() command which sequences all values in that field from the mininum to the maximum. This is setup work to instruct R how to segment the field. From here, we we pass the breaks= argument.

breaks <- seq(from=min(state$Population), to=max(state$Population), length=11)
pop_freq <- cut(state$Population, breaks = breaks, right=TRUE, include.lowest = TRUE)
table(pop_freq)
## pop_freq
## [5.64e+05,4.23e+06]  (4.23e+06,7.9e+06]  (7.9e+06,1.16e+07] 
##                  24                  14                   6 
## (1.16e+07,1.52e+07] (1.52e+07,1.89e+07] (1.89e+07,2.26e+07] 
##                   2                   1                   1 
## (2.26e+07,2.62e+07] (2.62e+07,2.99e+07] (2.99e+07,3.36e+07] 
##                   1                   0                   0 
## (3.36e+07,3.73e+07] 
##                   1
  1. Histogram

A histogram is a way to visualize a frequency table with bins on the x-axis and data count on the y-axis. To create a histogram in R we use the hist command as follows:

hist(state$Population, breaks = breaks)

General rules for plotting histograms: -Empty bins are included -Bins are equal width -Number of bins is up to the user -Bars are continous, no empty space shown between bars, unless there is an emptyin bin.

Skewness and Kurtosis

In statistics, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis.

Skewness refers to whether the data is skewed to larger or smaller values.

Kurtosis indicates the propensity of the data to have extreme values.

To validate these measures, visual displays are commonly used. One of these is a density plot, which shows the distribution of data values as a continous line. A density plot can be though of as a smoothed histogram, although it is typically computed directly from the data through a kernal density estimate.

  1. Density Plot

To create a density plot you create a histogram and then layer a lines using the density() function.

hist(state$Murder.Rate, freq=FALSE)
lines(density(state$Murder.Rate),lwd=3,col="blue")

One major distinction between a histogram and a density plot, is that a density plot corresponds to plotting the histogram as a proportion rather than counts.This is sepecified with the argument freq=FALSE

Exploring Binary and Categorical Data

Up until now, we have discussed how to conduct exploratory data analysis on Numerical Features. We will now discuss how to do the same with categorical ones. Recall that categorical features have a mutually exclsuive and exhaustive set of values.

For such variables, simple proportions or percentages tell the story of the data.

There are also specific visualizations that can be used to help understand categorical data. Bar charts for instance show the frequency or proportion for each category plotted as bars.

In this example we show the different kinds of delays categorized by type:

dfw <- read.csv("dfw_airline.csv",header = TRUE)
barplot(as.matrix(dfw)/6, cex.axis = .5)

Barcharts are a common visual tool for displaying a single categorical varialbe. Categories are listed on the x-axis and frequencies or propritons on the y-axis.

Bar charts resemble a histogram, however in a bar chart the x-axis represents different cagtegories of a factor variable, while in a histogram the x-axis represents values of a single variable on a numeric scale. In a histogram, the bars are typically shown touching each other, with gaps indicating values that did not occur in the data. In a bar chart, the bars are shown separate from one another.

Exploring Two or More Variables: Numeric

Up to now, we have discussed how to understand a single variable, whether it be numeric or categorical. From here on out we will see how to understand relationships between two or more variables.

Correlation

EDA in many modeling projects involves examining correlation among predictors, and between predictors and a target variable.

Two variables are said to be positively correlated if high values of X go with high values of Y, and low values of X with low values of Y.

If the relationship between variables is inverse, in other words, X increases while Y decreases then they are said to be negatively correlated.

To measure the degree to which two variables are correlated, we calculate a value called Pearson’s Correlation Coefficient. This is done by multiplying deviations from the mean for both x and y variables and then dividing those by the product of the standard deviations times N-1.

The correlation coefficient always lies between +! and -1 (perfecr correlation). 0 indicates no correlation. Variables can also have a relaitonship that is not linear. In these cases correlation coefficients may not be as helpful.

To display correlation coefficients amongst all numeric variaibles, we use what is called a correlation matrix. Along the diagonal we see values of 1, since this is the correlation of a variable with itself.

A correlation matrix is created in R using the package corrplot:

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
sp500_px <- read.csv("sp500_data.csv")
sp500_sym <- read.csv("sp500_sectors.csv")
etf <- sp500_px[row.names(sp500_px)>"2012-07-01",sp500_sym[sp500_sym$sector=="etf",'symbol']]
corrplot(cor(etf),method = "ellipse")

Scatterplots are another way to visualize the relationship between two measured data variables. The x-axis represents one variable, the y-axis another, and each point is a record.

We can do this with the plot command:

plot(etf$GM, etf$DGX, xlab="GM", ylab="DGX")

It appears like there is a slight correlation between the returns of the two stocks. We can validate with the cor() command.

cor(etf$GM, etf$DGX)
## [1] 0.1484631

Another way to plot two numerical features is hexagonal bidding. One example is the relationship between the finished square feet vs. tax-assessed values for homes in King County.

Plotting points in this situation would result in a large dark cloud of points. By grouping values into heaxagonal bins and coloring them with a heatmap based on the number of records, we can see a positive relationship between sq feet and tax-assessed value.

library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.4
kc_tax <- read.csv("kc_tax.csv")
kc_tax0 <- subset(kc_tax, TaxAssessedValue<750000 & SqFtTotLiving>100 & SqFtTotLiving<3500) 
ggplot(kc_tax0, (aes(x=SqFtTotLiving, y=TaxAssessedValue))) +
stat_binhex(color="white") + 
theme_bw() +
scale_fill_gradient(low="white", high="black") +
labs(x="Finished Square Feet", y="Tax Asessed Value")

Contour plots also visualize the relatioship between two numeric variable, only this time in the form of a topographical map. Each contour band represents a specific density of points, inceasing as one nears a “peak”.

ggplot(kc_tax0, aes(SqFtTotLiving, TaxAssessedValue)) + 
theme_bw() + 
geom_point(alpha=0.1)+
geom_density2d(colour="white")+
labs(x="Finished Square Feet", y="Tax Assessed Value")

Exploring Two or More Variables: Categorical

A useful way to summarize two categorical variables is a contingency table - a table of counts by category.

In R, the CrossTable function in the descr package products contingecy tables. Here is an example with loans data:

library(descr)
## Warning: package 'descr' was built under R version 3.4.4
lc_loans <- read.csv("lc_loans.csv")
x_tab <- CrossTable(lc_loans$grade, lc_loans$status,prop.c = FALSE, prop.chisq = FALSE, prop.t = )
x_tab
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =====================================================================
##                   lc_loans$status
## lc_loans$grade    Charged Off   Current   Fully Paid    Late    Total
## ---------------------------------------------------------------------
## A                        1562     50051        20408     469    72490
##                         0.022     0.690        0.282   0.006    0.161
##                         0.003     0.111        0.045   0.001         
## ---------------------------------------------------------------------
## B                        5302     93852        31160    2056   132370
##                         0.040     0.709        0.235   0.016    0.294
##                         0.012     0.208        0.069   0.005         
## ---------------------------------------------------------------------
## C                        6023     88928        23147    2777   120875
##                         0.050     0.736        0.191   0.023    0.268
##                         0.013     0.197        0.051   0.006         
## ---------------------------------------------------------------------
## D                        5007     53281        13681    2308    74277
##                         0.067     0.717        0.184   0.031    0.165
##                         0.011     0.118        0.030   0.005         
## ---------------------------------------------------------------------
## E                        2842     24639         5949    1374    34804
##                         0.082     0.708        0.171   0.039    0.077
##                         0.006     0.055        0.013   0.003         
## ---------------------------------------------------------------------
## F                        1526      8444         2328     606    12904
##                         0.118     0.654        0.180   0.047    0.029
##                         0.003     0.019        0.005   0.001         
## ---------------------------------------------------------------------
## G                         409      1990          643     199     3241
##                         0.126     0.614        0.198   0.061    0.007
##                         0.001     0.004        0.001   0.000         
## ---------------------------------------------------------------------
## Total                   22671    321185        97316    9789   450961
## =====================================================================

Exploring Two or More Variables: Categorical and Numeric

Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.

airline_stats <- read.csv("airline_stats.csv")
boxplot(pct_carrier_delay ~ airline, data=airline_stats, ylim=c(0,50))

A violin plot is another way to provide this visual of different distributions by cagetory. It is an enhancement to the boxplot and plots the density estimate to the boxplot with the density on the y-axis.

The density is mirrior and flipped over and the resulting shape is filled in, creating the image of a violin.

The advantage it provides over a traditional boxplot is that it can show nuances in the distribution that cannot be seen in a boxplot.

You can create a violin plot with the geom_violin option in ggplot.

ggplot(data=airline_stats, aes(airline, pct_carrier_delay))+
ylim(0,50)+
geom_violin()+
labs(x="",y="Daily % of Delayed Flights")
## Warning: Removed 38 rows containing non-finite values (stat_ydensity).

As was previously discussed, facets can also be used to break down Numerical data by different categorical values:

ggplot(subset(kc_tax0, ZipCode %in% c(98188,98105,98126)),
aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex(colour="white") +
theme_bw()+
scale_fill_gradient(low="white",high="blue")+
labs(x="Finished Squared Feet", y="Tax Assessed Value")+
facet_wrap("ZipCode")

Sampling Distributions:

The central limit theorem proves that the distribution of random sample means is normally distributed regardless of whether or not the population is itelf normal.

We can prove this in R by taking random samples of different sizes and plotting them:

#1 Random Sample
loans_income <- read.csv("loans_income.csv")

samp_data <- data.frame(income=sample(loans_income[1:50000,], size = 1000),type='data_dist')

#Sample of Means of 5 Values
samp_mean_05 <- 
data.frame(income=tapply(sample(loans_income[1:50000,],1000*5),rep(1:1000,rep(5,1000)), FUN=mean), type = 'mean_of_5')

#Sample of Means of 20 Values
samp_mean_20 <- data.frame(income=tapply(sample(loans_income[1:50000,],1000*20),
rep(1:1000,rep(20,1000)), FUN=mean), type = 'mean_of_20')

#bind the dfs together and convert type of factor

income <- rbind(samp_data, samp_mean_05, samp_mean_20)

income$type= factor(income$type, levels = c('data_dist','mean_of_5','mean_of_20'),labels = c('Data','Mean of 5','Mean of 20'))

ggplot(income, aes(x=income)) +
geom_histogram(bins=40)+
facet_grid(type ~ . )

The Bootstrap:

One easy and effective way to estimate the sampling distribution of a statistic is to draw additional samples, WITH REPLACEMENT, from the sample itself. Each time recalculating the statistic for each resample. This is called the bootstrap and does not involve any assumptions about the data or the sample statistic being normally distributed

Conceptually, you can think of the bootstrap as replicating the original sample thousands or millions or times, thereby essentially creating a hypothetical population using only your original sample. The steps to conducting a bootstrap resampling of the mean for example is as follows: 1. Draw a sample value, record, replace it 2. Repeat n times 3. Record the mean of the n resampled values 4. Repeat steps 1-3R Times 5. Use R results to: a) Calculate Std. Deviation b) Histogram or boxplot c) Confidence Intervals

R = number of iterartions of the bootstrap is up to your discretion.

The R package boot, combines all these steps into one function. For example, the following applies the bootstap to the incomes of people taking out loans:

library(boot)
## Warning: package 'boot' was built under R version 3.4.4
stat_fun <- function(x,idx) median(x[idx])
(boot_obj <- boot(loans_income[1:5000,], R = 1000, statistic = stat_fun))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = loans_income[1:5000, ], statistic = stat_fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    63000 116.461    789.0006

Bootstrapping can be used with machine learning as well. With classification and regression trees, running multiple trees on bootstrap samples and then averaging their predictions(or, with clsasification, taking a majority vote) generally performs better than using a single tree. This process is called bagging (bootstrap aggregating).

Normal Distributions

The bell-shaped normal distribution is icnonic in traditional statistics. The fact that distributions of sample statistics are often normally shaped has made it a powerful tool in the development of mathematical formulas that approximate those distributions.

In a normal distribution, 68% of the data lies within one std deviation of the mean, and 95% between two, and 99.5% within 3. A standard normal distribution is one in which the units on the x-axis are expressed in terms of std. devaitions from the mean. This is done by subtracting a value from the mean of its sample/population and then dividing by its standard deviation. This is called normalization or standardization.

The transformed value is terms a z-score. Because normal distributions are so important, it is also important to know how to determine whether or not your data is normally distributed.

To do so, there are several tools that are used. One is a visual tool known as a QQ-Plot. This plot arranges values by ascending z-score value on the y-axis and the x-axis is the corresponding quantile of a normal distribution for that value’s rank.

If the values in the data plot roughly on the diagonal line, then the sample distribution can be considered close to normal. Here is an example with a sample of 100 randomly generated values:

norm_samp <- rnorm(100)
qqnorm(norm_samp)
abline(a=0, b=1, col='grey')

Long Tailed Distributions

Long tailed distributions are like normmal distribution’s only extreme values occur at a greater frequency.

An example of a long-tailed distribution is stock returns. Here is a QQ-Plot that shows daily stock returns for NFLX:

nflx <- sp500_px[,'NFLX'] 
nflx <- diff(log(nflx[nflx>0]))
qqnorm(nflx)
abline(a=0,b=1,col = 'grey')

Compared to the normal distribution we randomly generated, you can see that there are more values below and above the line at the extremes.

T-Distribution

The T distribution is a normally shaped distribution, only thicker and longer on the tails. It is used heavily in sample statistics. This is because sample distributions are usually distributed like a t-distribution.

The larger the sample, the more normally shaped the t-distribution.

Binomial Distribution

A lot of analytics questions don’t deal with distributions of continous variables, but instead of discrete variables. Discrete variables have a limited set of values as opposed to the infinite possibilites that exist in a continous one.

One such disribution is the Binomial which only has two outcomes; yes or no.

Digital analytics is a perfect example of when this would be used. A user either purchases or does not in a particular session. Click vs don’t click, churn or don’t are other similar questions.

Central to binomial distributions is the concept of trials where each trial has two outcomes.

Flipping a coin 10 times for example is a binomial distribution with 10 trails.In this case, the probability for each outcome is 50/50. For ecommerce sites however, conversion rates can be very low, with 1 being much more rare than 0.

To calculate binomial distribution probabilities in R we can use the dbinom fuction

dbinom(x=2,5,p=0.1)
## [1] 0.0729

This tells us that the probability of getting 2 successes in 5 trails where the proability for success is 0.1, is equal to 0.0729 which is equal to 7.3%

Often however, we are interested in determining the probability of x or fewer successes in n trails. To do this we use the pbinom function:

pbinom(2,5,0.1)
## [1] 0.99144

The probability of getting two or less successes is 99%.

The mean of a binominal distribution is n x p. This is the expected number of successes in n trials. The variance is n x p(1-p). With a large enough number of trails the binomial distribution essentially merges with the normal distribution.

A/B Testing

Another major concept in Statistics is that of Hypothesis testing. Hypothesis testing is when you try to answer a question scientifically using an experiment. You split a representative sample of a desired population into two groups through random selection. This enables you to attribute any differences in between the two groups to the different treatments.

Take for example am A/B test on a website. We can split our users into two different experiences and see which one performs better. In R, we will show this using example session data:

session_times <- read.csv("web_page_data.csv")
ggplot(session_times, aes(x=Page, y=Time))+
geom_boxplot()

Lookng at the boxplot, it appears that page B has longer session times.

(mean_a <- mean(session_times[session_times['Page']=='Page A', 'Time']))
## [1] 1.263333
(mean_b <- mean(session_times[session_times['Page']=='Page B', 'Time']))
## [1] 1.62

Looking at the two pages, on average, Page B does indeed have a higher mean session time. The question however, is, is this due to random chance, or the difference in treatments. We can answer this using a permutation test.

A permutation test combines all session times together, then repeatedy shuffles and reassigns them into groups of the original sample size. This essentially creates a large random population.

perm_fun <- function(x,n1,n2)
{
n <- n1 + n2
indx_b <- sample(1:n,n1)
indx_a <- setdiff(1:n, indx_b)
mean_diff <- mean(x[indx_b]) - mean(x[indx_a])
return(mean_diff)
}  

perm_diffs <- rep(0,1000)
for(i in 1:1000)
perm_diffs[i] = perm_fun(session_times[,'Time'],21,15)
hist(perm_diffs,xlab='Session time differences (in seconds)')
abline(v = mean_b - mean_a)

The histogram above, shows that mean differene of random permutations often exceeds the observed difference in session time. This suggests that the observed differene in session times between page A and Page B are well within range of random chance and thus not statistically signifiant.

Let’s look at another example with web conversion rates in a Price Test.

  1. Price A 200 people converted out of 23,539 sessions.
  2. Price B 182 people conveted out of 22,406 sessions.

Again, we can apply a permutation test to see if the difference is within random chance variation.

obs_pct_diff <- 100*(200/23739 - 182/22588)
conversion <- c(rep(0,45945),rep(1,392))
perm_diffs <- rep(0,1000)
for(i in 1:1000)
perm_diffs[i] = 100*perm_fun(conversion,23739,22588)
hist(perm_diffs, xlab='Conversion Rate Percent Differences')
abline(v = obs_pct_diff)

As we can see from the histogram, the observed difference of 0.0368% is well within the range of chance variation.

Simply looking at the graph however is not a very precise way to determine statistical signifiance.

P-Values

A p-value is the propability that you would see the observed difference assuming that there is no real difference between the two groups.

mean(perm_diffs > obs_pct_diff)
## [1] 0.291

In this case, it appears that there is a greater than 30% chance that we would see this observed difference as a result of random chance. This is a high probability. In general, statisticians use 5% as a decision point for when an experiment is statistically significant

In the above example, we didnt necessarily have to use a permuation test to get a p-value. Since we are using data with a binomial distribution and large observation counts we can approximate the p-value using the normal distribution. In r, we do this with the prop.test function:

prop.test(x=c(200,182), n=c(23739,22588),alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(200, 182) out of c(23739, 22588)
## X-squared = 0.14893, df = 1, p-value = 0.3498
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.001057439  1.000000000
## sample estimates:
##      prop 1      prop 2 
## 0.008424955 0.008057376

Ass you can see we get the p-value of 0.3498 which is close to what we found from the permutation test.

We can also use the standard t-distribution. This is done at times when the distribution cannot be said to be normal.

t.test(Time ~ Page, data=session_times,alternative='less')
## 
##  Welch Two Sample t-test
## 
## data:  Time by Page
## t = -1.0983, df = 27.693, p-value = 0.1408
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.1959674
## sample estimates:
## mean in group Page A mean in group Page B 
##             1.263333             1.620000

Here we can see that the p-value is 0.14, which is still greater than the recommended 5%.

ANOVA

While the functions we discussed above work great for measuring differenecs between two groups, they don’t work if we are trying to test more than 2 groups.

In this situation, we use what is called a ANOVA, or analysis of variance.

We can conduct an ANOVA using the anova() command. Another important command is the aovp command which is park of the lmPerm package. This command conducts a permutation test anova. If you recall, a permutation test reshuffles all values based on a pooled sample.

library(lmPerm)
## Warning: package 'lmPerm' was built under R version 3.4.4
four_sessions <- read.csv("four_sessions.csv")
summary(aovp(Time ~ Page, data = four_sessions))
## [1] "Settings:  unique SS "
## Component 1 :
##             Df R Sum Sq R Mean Sq Iter Pr(Prob)  
## Page         3    831.4    277.13 5000   0.0692 .
## Residuals   16   1618.4    101.15                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see the p-value is 0.08%. The column Iter lists the number of iterations taken in the permutation test. The other columns corresponds to a traditional anova.

F-statistic: similar to a t-statistic, the F-statistic is the ratio of variance acrosss group means to the variance due to residual error. The higher this ratio, the more statistically signnificant the result.

Below is a traditional anova:

summary(aov(Time ~ Page, data=four_sessions))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Page         3  831.4   277.1    2.74 0.0776 .
## Residuals   16 1618.4   101.2                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You will also notice

DF: degrees of freedom Sum Sq: sum of squared differences between each value and the mean for that group.

Chi Square

Whereas an ANOVA is used to compare means from more than two groups, a Chi-square test is used when comparing more than 2 proportions.

clicks <- read.csv("click_rates.csv")
clicks <- 
clicks
chisq.test(x=c(14/1000,8/1000,12/1000), y=c('A','B','C') , simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  c(14/1000, 8/1000, 12/1000) and c("A", "B", "C")
## X-squared = 6, df = NA, p-value = 1

Machine Learning Models

Machine learning is the use of statistical algorithum’s to predict a certain outcome using existing data. This outcome can be a categorical or numerical value.

Regression Model’s:

Regression tries to answer the question; is the variable X (or X1:X4) related to Y? And if so, is the relationship strong enough that we can use it to predict future values of Y?

Another important way regression can be used is in anomaly detection. Anomalous values may be important such as in credit card fraud or loan auditing.

Simple Linear Regression

Simple linear regression models the relationship between the magnitude of one variable and that of a second. Similar to correlation, it tries to determine whether or not as X increases, Y also increases. Or if when X increases, Y decreases.

The equation tries to estimate exactly how much Y will change when X changes by a certain amount.

The major caveat here, is that a linear regression assumes that the relatioship between the variables is linear. If it is not, the linear regression formula will do a poor job of capturing the true relationship.

Let’s consider the following scatterplot of lung capacity (PERF) vs Exposure to cotor dust. How are the two related?

PERF <- read.csv("LungDisease.csv")
plot(PEFR ~ Exposure, data = PERF)

We can modify the style of the points using the pch command

plot(PEFR ~ Exposure, data = PERF,pch=20)

plot(PEFR ~ Exposure, data = PERF,pch="+")

You can see the ful breadth of options with the following command:

plot(1:20,1:20,pch=1:20)

Its hard to tell just looking at the scatterplot, whether there is a strong relationship between the two variables. In this situation, we use a simple linear regression to find the “best” line to predict our response PERF as a function of the predictor variable Exposure.

model <- lm(PEFR ~ Exposure, data = PERF)
model
## 
## Call:
## lm(formula = PEFR ~ Exposure, data = PERF)
## 
## Coefficients:
## (Intercept)     Exposure  
##     424.583       -4.185

Here we can see that our Y-interecept is 424.583. This represents the predicted lung capacity for someone with 0 years exposure to cotton dust.

Looking at the coefficient for exposure we can see that for every year that a worker is exposure to cotton dust, the worker’s lung capacity is reduced by -4.185.

We can also call additional details related to our model such as the confidence intervals for the coefficients:

confint(model)
##                  2.5 %     97.5 %
## (Intercept) 383.408147 465.757466
## Exposure     -6.807942  -1.561211

The summary command provides the full detail for the model:

names(model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
summary(model)
## 
## Call:
## lm(formula = PEFR ~ Exposure, data = PERF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -297.845  -58.783   -1.214   61.024  209.109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  424.583     20.796  20.417  < 2e-16 ***
## Exposure      -4.185      1.325  -3.158  0.00201 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.4 on 120 degrees of freedom
## Multiple R-squared:  0.07674,    Adjusted R-squared:  0.06905 
## F-statistic: 9.974 on 1 and 120 DF,  p-value: 0.002008

To see how well our best fit regression line fits the data we can plot it on our scatterplot with the abline() command

plot(PEFR ~ Exposure, data = PERF)
abline(model)

We can modify the width and color of the line with the additional lwd() and col() commands:

plot(PEFR ~ Exposure, data = PERF)
abline(model,lwd=3,col="red")

As was previously discussed, we can use this regression equation to predict future values. We do this in R with the predict command.

fitted <- predict(model)

In addition to predicting variables, we can also measure how accurate our model is by evaluating the error, also known as residuals:

resid <- residuals(model)
plot(resid)

R by default provides 4 diagnostic plots when plotting the output of a lm(). To see them all on one screen we use the par() command followed by mfrow option which specifies the number of windows we want. Rows are designated first.

par(mfrow=c(2,2))
plot(model)

Multiple Linear Regression

Multiple Linear Regression is the same concept as Simple Linear regression only that instead of using one variable to predict Y, we will use multiple independent variables.

As an example, we will use housing data on King County Seattle, Washington. We will try to predict house prices using various features of the house:

house <- read.delim("house_sales.csv")
house_lm <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data = house , na.action=na.omit)
house_lm
## 
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
##     Bedrooms + BldgGrade, data = house, na.action = na.omit)
## 
## Coefficients:
##   (Intercept)  SqFtTotLiving        SqFtLot      Bathrooms       Bedrooms  
##    -5.219e+05      2.288e+02     -6.051e-02     -1.944e+04     -4.778e+04  
##     BldgGrade  
##     1.061e+05

Looking at the results of the mode, we can see that adding a SqFt to living area increases the value of a house by roughly $229.

Using the summary function computes RSE which is the root squared error (square root of the squared residuals/n-p-1).

The RSE is a measure of the error of our model. The higher the RSE the worse the model performed.

summary(house_lm)
## 
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + 
##     Bedrooms + BldgGrade, data = house, na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1199508  -118879   -20982    87414  9472982 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.219e+05  1.565e+04 -33.349  < 2e-16 ***
## SqFtTotLiving  2.288e+02  3.898e+00  58.699  < 2e-16 ***
## SqFtLot       -6.051e-02  6.118e-02  -0.989    0.323    
## Bathrooms     -1.944e+04  3.625e+03  -5.362 8.32e-08 ***
## Bedrooms      -4.778e+04  2.489e+03 -19.194  < 2e-16 ***
## BldgGrade      1.061e+05  2.396e+03  44.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261200 on 22683 degrees of freedom
## Multiple R-squared:  0.5407, Adjusted R-squared:  0.5406 
## F-statistic:  5340 on 5 and 22683 DF,  p-value: < 2.2e-16

We also additional outputs such as the R-squared, adjusted R-squared, F-stat, and p-value associated with our model.

As we can see our model is statistically significant, however the residual std. error is quite large.

R squared is roughly 0.54 which is not very strong as far how much variation is accounted for.

As you can see the output of a model can get unweildy as more variables are added. In such cases, it can help to be specific about calling model output. To do so, we can call what we need using the “$” and the outputnames just like we would in a normal vector or dataframe:

summary(model)$r.sq
## [1] 0.07674102
summary(model)$sigma
## [1] 101.4382
Model Selection & Stepwise Regression

Many variables could be used as predictors in regression. Adding more variables however, does not necessarily improve the perofrmance of a model. All things being equal, Occam’s Razor tells us that a simpler model is always better than an uncessarily complex model.

While including more variables always increases the R2 of a model and decreases its RMSE, it may not increase its predictive power for new data.

To account for this, several measures have been created that penalize models for having more variables:

  1. Akaike’s Information Criteria: AIC = 2P + nlog(RSS/n). Where p is the number of variables and n is the number of records. The goal is to find the model that minimizes AIC.
  2. Bayesian Information Criteria: similar to AIC but stronger penality for more variables.
  3. Mallows CP: a variance of AIC deveoped by Colin Mallows.

To find the model that minimizes these values we use a couple diferent approaches:

  1. All Subset Regression: this approach looks through all possible models, and as a result is computationally expensive. It is not feasible for problems with large data and many variables.
  2. Stepwise Regression: successively adds and drops predictors to find a model that lowers AIC.

The MASS package offers a stepwise regression function called stepAIC:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
step <- stepAIC(house_lm,direction = "both")
## Start:  AIC=566015.4
## AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + 
##     BldgGrade
## 
##                 Df  Sum of Sq        RSS    AIC
## - SqFtLot        1 6.6750e+10 1.5481e+15 566014
## <none>                        1.5481e+15 566015
## - Bathrooms      1 1.9622e+12 1.5500e+15 566042
## - Bedrooms       1 2.5142e+13 1.5732e+15 566379
## - BldgGrade      1 1.3386e+14 1.6819e+15 567895
## - SqFtTotLiving  1 2.3516e+14 1.7832e+15 569222
## 
## Step:  AIC=566014.3
## AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms + BldgGrade
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        1.5481e+15 566014
## + SqFtLot        1 6.6750e+10 1.5481e+15 566015
## - Bathrooms      1 1.9283e+12 1.5501e+15 566041
## - Bedrooms       1 2.5075e+13 1.5732e+15 566377
## - BldgGrade      1 1.3392e+14 1.6821e+15 567895
## - SqFtTotLiving  1 2.3977e+14 1.7879e+15 569279
step
## 
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms + 
##     BldgGrade, data = house, na.action = na.omit)
## 
## Coefficients:
##   (Intercept)  SqFtTotLiving      Bathrooms       Bedrooms      BldgGrade  
##     -522415.7          228.2       -19240.4       -47650.5       106138.4
Categorical Variables in Regression

Categorical variables take on a limited number of discrete values. Binay yes/no variables also called indicator variables, are special cases of a factor variable.

Categorical variables can also be used in regression models, however they must be recoded because regressions require numeric inputs. The most common approach to recode variables is to convert it into a set of binary dummy variables (0/1).

In the King County housing data, there is a factor variable for the property type:

unique(house[,"PropertyType"])
## [1] Multiplex     Single Family Townhouse    
## Levels: Multiplex Single Family Townhouse

To use this factor variable we need to convert it to a set of binary variables. We do this by creating a binary variable for each possible value of the factor variable. To do this in R, we use the model.matrix function:

prop_type_dummies <- model.matrix(~PropertyType -1, data=house)
head(prop_type_dummies)
##   PropertyTypeMultiplex PropertyTypeSingle Family PropertyTypeTownhouse
## 1                     1                         0                     0
## 2                     0                         1                     0
## 3                     0                         1                     0
## 4                     0                         1                     0
## 5                     0                         1                     0
## 6                     0                         0                     1

The categorical variable has three distinct levels, so it is translated into three columns. Usually, a factor variable with P levels, is represented by P-1 columns. This is done to avoid multicollinearity. the intercept term will use one of these factor levels as its base. All other coefficients will be calculated relative to this base factor level.

house_lm  <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data=house)

In the example above, you can see that there is no coefficient for Multiplex since it is implicity defined when the other type Property Types are set to 0.

The coefficients are interpreted as relative to Multiplx, so a home that is a Single Family is worth almost $85k less, and one that is Townhouse is worth over 150k less.

Factors with Multiple Levels

Some factor variables can product a huge number of binary dummies - zip codes for example are massive in quantity. There are 43,000 zip codes in the US. In such situations is can be helpful to consolidate the factors into groups.

zip_groups <- house %>% mutate(resid = residuals(house_lm)) %>%
group_by(ZipCode) %>%
summarise(med_resid = median(resid),cnt=n()) %>%
mutate(cum_cnt = cumsum(cnt), ZipGroup = ntile(cum_cnt,5))
house <- house %>% left_join(zip_groups,house, by='ZipCode')
house %>% dplyr::select(starts_with("ZipGroup")) %>% distinct()
##   ZipGroup
## 1        1
## 2        5
## 3        2
## 4        3
## 5        4

In the example above, we calculated the median residual for each zip and used the ntile function to split the zips, sorted by the median, into five groups.

Ordered Factor Variables

As was previously discussed, some factor variables reflect levels of a factor. These are called Ordered Factor Variables.

The loan grade for a house for example can be ordered into A,B,C,etc. Each grade carries more risk than the prior grade. Ordered factor variables can be converted to numerical values and used as is. For example, BldgGrade is an ordred factor variable.

str(house[,"BldgGrade"])
##  int [1:22689] 7 10 8 7 7 8 8 7 6 6 ...
Correlated Predictors
step$coefficients
##   (Intercept) SqFtTotLiving     Bathrooms      Bedrooms     BldgGrade 
##    -522415.74        228.23     -19240.42     -47650.53     106138.39

If we look at the output of our stepwise regression model, we notie that our bedrooms coefficent is negative. If you think about this, it makes little sense in the real world. How can a house be worth less if it has less bedrooms than another?

The truth is that there are several correlated predictors in our model. The variables for bedrooms, house size, and number of bathrooms are all correlated. We can verify this using a measured called VIF. The VIF function, part of the car package, allows us to quickly identify predictive variables that are correlated, and thereby may create multi-collinearity:

library(car)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
## 
## Attaching package: 'car'
## The following object is masked from 'package:boot':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(house_lm)
##                   GVIF Df GVIF^(1/(2*Df))
## SqFtTotLiving 4.742407  1        2.177707
## SqFtLot       1.050959  1        1.025163
## Bathrooms     2.851288  1        1.688576
## Bedrooms      1.754294  1        1.324498
## BldgGrade     2.802485  1        1.674063
## PropertyType  1.301312  2        1.068059

The higher the VIF value, the greater the degree of multicolinearity.

If we fit our model without some of our correlated predictors we see something different:

update(step, . ~ . -SqFtTotLiving - SqFtFinBasement -Bathrooms)
## 
## Call:
## lm(formula = AdjSalePrice ~ Bedrooms + BldgGrade, data = house, 
##     na.action = na.omit)
## 
## Coefficients:
## (Intercept)     Bedrooms    BldgGrade  
##    -1166529        31247       211763

We can modify our model using the update function to remove or add variables. to our model.

In extreme cases,correlated variables product multicollinearity, a condition in which there is redundance among the predictor variables.This occurs when one predictor variable can be expressed as a linear combination of others. This can happen when:

** A variable is included multiple times by error ** P dummies, instead of P-1 dummies, are created from a factor variable ** Two variables are nearly perfectly correlated with one another

Multicollinearity in regression must be addressed before interpreting coefficients because it can result in erroneous results.

Multicollinearty can be measured with the VIF function. The higher the VIG value, the higher the degree of collinearity.

Interaction Effects

In some cases, the relationship between variables is not linear. There are times where one variable can amplify the effect of other variables.

In regards to the housing model for example, it is a known fact that in realestate when location is held constant, larger homes are considered more valuable. Additionally, the size of a home can have a multiplicative relationship with location as far as the value of the home.

lm(AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data=house, na.action=na.omit)
## 
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving * ZipGroup + SqFtLot + 
##     Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house, 
##     na.action = na.omit)
## 
## Coefficients:
##               (Intercept)              SqFtTotLiving  
##                -6.795e+05                  2.645e+02  
##                  ZipGroup                    SqFtLot  
##                 5.810e+04                  1.138e-02  
##                 Bathrooms                   Bedrooms  
##                -5.818e+03                 -4.640e+04  
##                 BldgGrade  PropertyTypeSingle Family  
##                 1.114e+05                 -6.569e+04  
##     PropertyTypeTownhouse     SqFtTotLiving:ZipGroup  
##                -1.212e+05                 -1.645e+01

In this model, we see four new coefficients. Each is an interaction term between a ZipGroup and SqftTotLiving. Interpreting the coefficents is done as follows. For each increase in Square Feet in ZipGroup 7 (the lowest group) there is an increase of $177. However in the highest ZipGroup, the slope is the sum of the main effect plus the interaction term which = $447 per suqare foot.In other words, adding a square foot in the most expensive Zip Code boosts the sales price by a factor of almost 2.7 compared to the boost in the least expensive zip code group

Testing the Assumptions: Regression Diagnostics

One of the most important steps of generating a regression model is the validation of the models assumptions.

The assumptions underlying a regression model are as follows:

Outliers:

Generally speaking, an extreme value, also called an outlier, is one that is distant from most of the other observations. Just as outliers need to be handled for measures of location and variability, they can also cause major problems with regression models. This is because regression models will try to fit these noisey values which can take away from the predictive power of the model for real values.

There are several ways to identify outliers. From a visual perspective, we discussed using the boxplot. However, we can also leverage statistical characteristics of the boxplot. For example, it is an accepted rule in statistics that any value that is greater than 1.5X the IQR in either direction is likely an outlier. We can identify those values in our data uding the boxplot.stats function combined with the $out option. Here is an example below:

x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(x, y)
rm(x, y)
(a <- with(df, which(x %in% boxplot.stats(x)$out)))
## [1] 83
(b <- with(df, which(y %in% boxplot.stats(y)$out)))
## integer(0)
(outlier.list1 <- intersect(a,b))
## integer(0)
plot(df)
points(df[outlier.list1,], col="red", pch="+", cex=2.5)

(outlier.list2 <- union(a,b))
## [1] 83
plot(df)
points(df[outlier.list2,], col="blue", pch="x", cex=1)

Why is it important to deal with outliers? For one, it was previously mentioned that outliers can greatly influence any measures of variance or centraility. In a similar fashion, then can significantly alter any coefficient measures in a regression model.Let’s take data for sales of houses in zip code 98105 for example:

house_98105 <- house[house$ZipCode == 98105,]
lm_98105 <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)

We can extract the standardized residuals using the rstandard function to obtain the index of the smallest residual using the order function:

sresid <- rstandard(lm_98105)
indx <- order(sresid)
sresid[indx[1]]
##     20431 
## -4.326732

As we can see, the biggest overestimate from the model is more than 4 std errors above the regression line, which is an overestimate of $757,753. The original data record corresponding to this outlier is as follows:

house_98105[indx[1],c('AdjSalePrice','SqFtTotLiving','SqFtLot','Bathrooms','Bedrooms','BldgGrade')]
##       AdjSalePrice SqFtTotLiving SqFtLot Bathrooms Bedrooms BldgGrade
## 20431       119748          2900    7276         3        6         7

In this case, it appears that there is something wrong with this record. A house of this size typically sells for much more than $119,748 in the zipcode its in. Looking deeper, this is clearly an outlier.

One way to deal with outliers is to replace them with the mean value for data excluding the outliers. We can do this with the subset() command which was introducted in the dataframe section, or we can also use the which() command, which can also be used to filter values. Take the sample below:

x<- c(1:4, NA, 6:7, NA)
complete.cases(x)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
is.na(x)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
which(is.na(x))
## [1] 5 8
x[which(is.na(x))]<-mean(x, na.rm=T)
x[is.na(x)]<- mean(x, na.rm=T)
Influential Values

Influential values are those whose absence would significantly change the regression equation. Influential values are said to have high leverage on the regression model.

There are different ways of measuring influence. Here are some of the most common used:

** Hat-Values: Hat-value is calculated with the following formula 2(P+1)/n

** Cook’s Distance: defines influence as a combination of leverage and residual size. An observation has high influence if Cook’s distance exceeds 4/(n-P-1).

An influence plot, or bubble plot combines standardized residuals, the hat-value, and Cook’s distance in a single plot. Here is an example of how to do this in R with the house data:

std_resid <- rstandard(lm_98105)
cooks_D <- cooks.distance(lm_98105)
hat_values <- hatvalues(lm_98105)
plot(hat_values,std_resid,cex=10*sqrt(cooks_D))
abline(h=c(-2.5,2.5),lty=2)

As you can see in the chart, we have plotted the standard residual on the y-axis, the hat values on the x-axis, and assigned the size of the bubbles to the value of Cook’s distance.

Looking at the chart above, we can see that there are several data points that exhibit large influence in the regression. Including these values can negatively impact the performance of our model, because it will try to fit these noisey points as opposed to the true relationship.

It is important to note however, that the impact of these high leverage points is negatively correlated with the size of the overall dataaset. The larger the data, the less of a chance a point will have to dramatically change the regression equation.

Heteroskedasticity, Non-Normality, and Correlated Errors

Statisticians spend a lot of time analyzing the distribution of the residuals. This is because it is important for statistical models to be unbiased in their degree of error. If there is a clear trend in the errors, it is likely that there is an incorrect assumption being made.

Heteroskedasticity is a measure of the degree to which errors are greater for some portions of the range of data than for others.

Below we can compare the absolute residuals versus the predicted values for the regression that we fitted for houses in the 98105 Zip Code.

df <- data.frame(resid = residuals(lm_98105),
pred = predict(lm_98105))
ggplot(df,aes(pred,abs(resid))) + 
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

If we look at the chart above, we can clearly see that the variance of the residuals gets larger as the value of the homes we are predicting increases. It is also larger homes on the lower extreme. This indicates that our model has heteroskedastic errors.

Another assumption that Statisticians like to verfiy is that errors in the model are independent of each other. This can come into play when looking at data that is collected over the course of time. In certain situations, errors between two time periods can be corrrelated. This is called autocorrelation. In such a situation, it would be preferable to conduct a time series analysis than linear regeression.

Polynomial and Spline Regression

Linear regressions are relatively simple, and as a result, easy to interpret. However, there is one assumption of linear models that rarely holds in reality. And that is the assumption of a linear relationship. In the real world, it is not common for variables to have a linear relationship. In these cases, we use what are called non-linear regressions.

Polynomial Regression

A polynomial regression involves including polynomial terms to a regression equation. Take for example, this quadratic polynomial regression (Power of 2) for Square Footage in the King County housing data:

lm(AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot + BldgGrade + Bathrooms + Bedrooms, data = house_98105)
## 
## Call:
## lm(formula = AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot + 
##     BldgGrade + Bathrooms + Bedrooms, data = house_98105)
## 
## Coefficients:
##             (Intercept)  poly(SqFtTotLiving, 2)1  poly(SqFtTotLiving, 2)2  
##              -402530.47               3271519.49                776934.02  
##                 SqFtLot                BldgGrade                Bathrooms  
##                   32.56                135717.06                 -1435.12  
##                Bedrooms  
##                -9191.94

By adding the poly() function to the model we can see that two coefficients have been added. One for the linear term and one for the quadtraic term.

Splines

Polynomial regressions unfortunately, only capture a certain amount of curvature in a nonlinear relationship. As the degree of the polynomial increases, you start to see the amount of fleibility in the regression equation reach an undesirable extreme.

In such cases, statisticians have developed a superior approach that smoothly interpolates between fixed points. These fixed points are called knots. To create a spline in R we can use the splines package. The bs() function places knots at the boundaries that we specified (quartiles) and the degree to which we would like to run the spline:

library(splines)
knots <- quantile(house_98105$SqFtTotLiving, p=c(.25,.5,.75))
lm_spline <- lm(AdjSalePrice ~ bs(SqFtTotLiving, knots = knots, degree=3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
lm_spline
## 
## Call:
## lm(formula = AdjSalePrice ~ bs(SqFtTotLiving, knots = knots, 
##     degree = 3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, 
##     data = house_98105)
## 
## Coefficients:
##                                   (Intercept)  
##                                    -414157.61  
## bs(SqFtTotLiving, knots = knots, degree = 3)1  
##                                    -199529.76  
## bs(SqFtTotLiving, knots = knots, degree = 3)2  
##                                    -120580.64  
## bs(SqFtTotLiving, knots = knots, degree = 3)3  
##                                     -71644.39  
## bs(SqFtTotLiving, knots = knots, degree = 3)4  
##                                     195677.89  
## bs(SqFtTotLiving, knots = knots, degree = 3)5  
##                                     845244.25  
## bs(SqFtTotLiving, knots = knots, degree = 3)6  
##                                     695545.67  
##                                       SqFtLot  
##                                         33.33  
##                                     Bathrooms  
##                                      -4778.21  
##                                      Bedrooms  
##                                      -5778.70  
##                                     BldgGrade  
##                                     134462.37

Here we can see several additional coefficients that have been added. However, it is essentially impossible to interpret these. Instead it is more useful to use the visual dispaly to reveal the nature of the spline fit.

Generalized Additive Models

GAM, is a technique used to automatically fit a spline regression without having to specify the knots. The gam package in R can be used to fit a GAM model to the housing data:

library(mgcv)
## Warning: package 'mgcv' was built under R version 3.4.4
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
lm_gam <- gam(AdjSalePrice ~ s(SqFtTotLiving) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data = house_98105)

The term “s” tells the gam function to find the best knots for the spline term of Square Foot to Living.

Classification Models

Regression is used to understand and predict quantitative variables. However, what do we do when we need to predict a categorical variable? In these cases, we are trying to answer a classification problem.

The goal of classification models is to predict whether a record is a 0 or a 1 (transaction vs non transaction in web session), or in some cases one of several categories (Gmail filtering into primary, social, forums, etc)

The way classification works is that the probability that a record is part of a particular class is calculated. Once the probability reaches a predefined cutoff, a decision is made. The step by step process is as follows:

  1. A cutoff probability for the class of interest above which we consider a record as belonging to that class.
  2. Estimate the probability that a record belongs to the class of interest.
  3. If that probability is above the cutoff probability, assign the record to the class of interest.
Logistic Regression

Logistic regression is analogous to multiple linear regression, except the outcome is binary. Various transformations are employed to convert the problem to one in which a linear model can be fit.

The key ingredients in a logistic regression are the logistic response fuction and the logit, in which we map a probability to a linear model.

To run a logistic regression in R, the glm function is used with the family parameter set to binomial. Glm() function stands for generalized linear models, and since logistc regression is still linear in nature, it is included in the generalized linear model family.

The syntax of glm is similar to that in lm() however that one must add the additional criteria of family. For Logistic Regression a binomial family is used. Binomial distributions that are 0/1 in nature are also called bernoulli distributions.

We will be using stock market data for 1,250 days worth of S&P percentage returns from 2001 to 2005 to train our models. Each obs, is a day with percentage returns for each of the five previous trading days, Lag1, through Lag5. Volume is also a feature, which represents the number of shares traded in billions, and Direction states whether the market was Up or Down on the date in question.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
attach(Smarket)
names(Smarket)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

Pairs provides a scatterplot matrix

pairs(Smarket)

Cor provides a correlation matrix but will return an error if given quantitative variables (factors). For that reason we exclude Direction which is a factor

cor(Smarket[,-9])
##              Year         Lag1         Lag2         Lag3         Lag4
## Year   1.00000000  0.029699649  0.030596422  0.033194581  0.035688718
## Lag1   0.02969965  1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2   0.03059642 -0.026294328  1.000000000 -0.025896670 -0.010853533
## Lag3   0.03319458 -0.010803402 -0.025896670  1.000000000 -0.024051036
## Lag4   0.03568872 -0.002985911 -0.010853533 -0.024051036  1.000000000
## Lag5   0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647  0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today  0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##                Lag5      Volume        Today
## Year    0.029787995  0.53900647  0.030095229
## Lag1   -0.005674606  0.04090991 -0.026155045
## Lag2   -0.003557949 -0.04338321 -0.010250033
## Lag3   -0.018808338 -0.04182369 -0.002447647
## Lag4   -0.027083641 -0.04841425 -0.006899527
## Lag5    1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315  1.00000000  0.014591823
## Today  -0.034860083  0.01459182  1.000000000

Correlations between the lag variables and today’s returns are close to zero. The only substantial correlation is year and volume:

plot(Year,Volume)

Let’s go ahead and create our logistic regression model:

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Looking at the coefficients, it appears that the market is more likely to go down today if it went up yesterday Or the day before. Prior to that, it is more likely for the market to go up if the market had previously gone up 3:5 days ago However, none of the coefficients p-values are statistically significant, so we can’t put too much weight on these findings

To extract the p-value associated with the coefficients you can reference the column number just like a dataframe

summary(glm.fit)$coef[,4]
## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##   0.6006983   0.1452272   0.3983491   0.8243333   0.8514445   0.8349974 
##      Volume 
##   0.3924004

To predict the probability that the market will go up based on certain values of predictors use the predict() function Just like in LR. The type=“response” command tells R to output probabilities as opposed to the logit. Note that if no dataset is specified, then the model defaults to the dataset that was used to create the model initially:

glm.probs=predict(glm.fit,type="response")
glm.probs[1:10]
##         1         2         3         4         5         6         7 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 
##         8         9        10 
## 0.5092292 0.5176135 0.4888378

The contrasts function indicates that R has create a dummy variable with a 1 for Up:

contrasts(Direction)
##      Up
## Down  0
## Up    1

In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class lables “Up or Down”. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5. You will notice that the vector has 1250 observations (the total number of observations in the Smarket data)

glm.pred=rep("Down",1250)

Each “Down” value is then replaced with up when the fitted probability is greater than our 50% threshold

glm.pred[glm.probs>.5]="Up"
Measuring Classification Model Performance

At the heart of classification metrics is the confusion matrix. The confusion matrix is a table shhowing the number of correct and incorrect predictions categorized by type of response. Several packages are available in R to compute in a confusion matrix, but it also easy to do so by hand.

We can then create a confusion matrix of the results with the table() command This will help us determine how many observations were correctly/incorrectly classified The diagonals of the confusion matrix indicate our correct predictions, while the off-diagonals represent incorrect guesses

table(glm.pred,Direction)
##         Direction
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507
(table(glm.pred,Direction)[2,2]+table(glm.pred,Direction)[1,1])/dim(Smarket)[1]
## [1] 0.5216

The mean() function can be used to compute the fraction of days for which the prediction was correct.

mean(glm.pred==Direction)
## [1] 0.5216
Cross Validation:

In cross validation, we create a training and test data set, run the model on both, and compare the RSE. This is a vital concept in data science. The reason we do this is because statistical algorithums are good a fitting models with the data they are provided. However, when models are trained on new data they make perform poorly in actuality.

To do this in R we create two boolean vectors (True/False). One with our training set and one with our test set. In this case, we designated that observations that occurred before 2005 will be in the training set. Once the training set is created, say in an object called train you can use it to create the test set. This can be done by adding a ! to the front of the object thereby excluding it from the aggregate data.

The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train,] yields a submatrix of the stock market data containing only the observations forwhich train is FALSE-that is, the observations with dates in 2005. T

We can subset our data based on Year by filtering out the values that are less than 2005:

train=(Year<2005)
Smarket.2005=Smarket[!train,]

Filtering out all values before 2005 by excluding the values of Smarket that have year values less than 2005

As you can see by comparing the datasets:

table(dim(Smarket.2005),dim(Smarket))
##      
##       9 1250
##   9   1    0
##   252 0    1

You can also use the train object to exclude the values from certain vectors

Direction.2005=Direction[!train]

We will now fit a model to fit our training data and evaluate its performance using the test data With the training data ready, you can subset the training data from the total observations using the subset command

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)

With a model filtered for the training dataset, we can look at the results on the test data Note that the dataset being specified for predicting response is the test set

glm.probs=predict(glm.fit,Smarket.2005,type="response")

As we did previously, we create a vector length of the test set, with response classes defaulted to “Down”

glm.pred=rep("Down",252)

We then replace those for which our logistic model has predicted more than 50% probability

glm.pred[glm.probs>.5]="Up"

We create our confusion matrix

table(glm.pred,Direction.2005)
##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

Correct % Predicted

mean(glm.pred==Direction.2005)
## [1] 0.4801587
mean(glm.pred!=Direction.2005)
## [1] 0.5198413

We can see see our model has a very high error rate, higher than 50% which is random chance. Prior model had a very high error rate. Maybe by removing Lag fields that were not significant we can improve the model

glm.fit=glm(Direction~Lag1+Lag2,data=Smarket,family=binomial,subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106

Error and Accuracy Rates:

mean(glm.pred==Direction.2005)
## [1] 0.5595238
mean(glm.pred!=Direction.2005)
## [1] 0.4404762

By removing the predictors with low prediction power, we have improved the accuracy of our model. If we would like to use our model to predict new values, we can use the newdata function combined with type = response

predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")
##         1         2 
## 0.4791462 0.4960939
Linear Discriminant Analysis

LDA is one of the most popular classification functions. When applied to the predictor variables, it maximizes the separation of the classes. It does this by distinguishing variation between groups as opposed to variation within groups. LDA focuses on maximizing the sum of squares between the groups relative to the sum of squares within the groups. The combination that maximizes this sum of squares ratio yeilds the greatest separation between the two groups.

library(MASS)
train=(Year<2005)
Smarket.2005=Smarket[!train,]
Direction.2005=Direction[!train]
lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

Plotting the model gives us a plot of the linear discriminants (Coefficients * training observations)

plot(lda.fit)

Predict function returns a list of three elements 1) Class contains the preditions about the markets movement 2) Posterior is a matrix whose columns contain the posterior probability that observations belong to said class 3) X contains the Linear Discriminants

lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)
## [1] "class"     "posterior" "x"
lda.class=lda.pred$class

With the predictions for each value created, we can now build the confusion matrix

table(lda.class,Direction.2005)
##          Direction.2005
## lda.class Down  Up
##      Down   35  35
##      Up     76 106

Results are almost identical to the LogR model

mean(lda.class==Direction.2005)
## [1] 0.5595238

We can modify the predictions made by the model using the following command which leverages a 50% threshold for prediction.

sum(lda.pred$posterior[,1]>=.5)
## [1] 70
sum(lda.pred$posterior[,1]<.5)
## [1] 182

We can also pull the first 20 observations and their predicted values (that market will decrease)

lda.pred$posterior[1:20,1]
##       999      1000      1001      1002      1003      1004      1005 
## 0.4901792 0.4792185 0.4668185 0.4740011 0.4927877 0.4938562 0.4951016 
##      1006      1007      1008      1009      1010      1011      1012 
## 0.4872861 0.4907013 0.4844026 0.4906963 0.5119988 0.4895152 0.4706761 
##      1013      1014      1015      1016      1017      1018 
## 0.4744593 0.4799583 0.4935775 0.5030894 0.4978806 0.4886331

Gives us the direction instead of the Probability

lda.class[1:20]
##  [1] Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Up   Down Up   Up  
## [15] Up   Up   Up   Down Up   Up  
## Levels: Down Up

If we want to make the model more sensitive, we can modify the threshold from 50% to 90%. As we can see there is no value that we are greater than 90% sure of our prediction:

sum(lda.pred$posterior[,1]>.9)
## [1] 0

Quadratic Discriminant Analysis

We implement the QDA model using the qda() function which works just like the syntax of lda()

qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train)

Provides us with our prior probabilities and group means however no coefficents since these are quadratic coefficients

qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544

Predict function works the same was as for LDA

qda.class=predict(qda.fit,Smarket.2005)$class
table(qda.class,Direction.2005)
##          Direction.2005
## qda.class Down  Up
##      Down   30  20
##      Up     81 121
mean(qda.class==Direction.2005)
## [1] 0.5992063

Looking at the results, the test error rate is less than 50%. This suggests that the QDA model reduces Bias in model because the true shape of the relationship is not linear.

K-Nearest Neighbors

To perform KNN we can use the KNN function, which is part of the class library Works differently than the models in the sense that it is done in 1 step vs. The 2-step model and prediction process we previously used in our other methods

The model requires 4 inputs: 1) A matrix containing the predictors associated with the training data 2) A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below 3) A vector containing the class labels for the training observations 4) A value for K, the number of nearest neighbors to be used by the classifier

  1. Matrix containing predictors for the training data, To create the matrix, you bind the columns together using cbind. And then filter the data from the matrix using the same Boolean Train object as before
library(class)
train.X=cbind(Lag1,Lag2)[train,]
  1. Matrix containing predictors for the test data
test.X=cbind(Lag1,Lag2)[!train,]
  1. Vector containing class labels for the training observations
train.Direction=Direction[train]
  1. Choosing the number of nearest neighbors (K)

Note: Set seed is used to break ties randomly between nearest neighbors

set.seed(1)
knn.pred=knn(train.X,test.X,train.Direction,k=1)

Confusion Matrix:

table(knn.pred,Direction.2005)
##         Direction.2005
## knn.pred Down Up
##     Down   43 58
##     Up     68 83
mean(knn.pred==Direction.2005)
## [1] 0.5

Error rate is 50% thereby accuracy is as good as chance. Let’s try a KNN Model with K = 3. By increasing K to 3, we will reduce variance and overfitting thereby increasing test model accuracy. The lower that K is, the higher the flexibility of hte model,and the greater the higher the variance.

knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(knn.pred,Direction.2005)
##         Direction.2005
## knn.pred Down Up
##     Down   48 54
##     Up     63 87
mean(knn.pred==Direction.2005)
## [1] 0.5357143

Appears that increasing the K (neighbors) improved accuracy slightly

Comparing Multiple Models: An Application to Caravan Insurance Data

dim(Caravan)
## [1] 5822   86
attach(Caravan)
summary(Purchase)
##   No  Yes 
## 5474  348
(348/5822)*100
## [1] 5.977327

5.97% of people purchase a caravan insurance policy

KNN uses measures of distance to identify neighbors. The scale of the variable being used to measure this distance can have a major impact on the output of this analysis. Salary measured in 1000’s will have a greater effect vs. age Which goes up in increments of one because it is a relatively discrete quantitative variable

A good way to handle this problem of scale, is to standardize the data so that instead of the actual output, variables are transalted into z-scores in other words # of stds from the mean. The can be done in R using the scale function

standardized.X=scale(Caravan[,-86])
var(Caravan[,1])
## [1] 165.0378
var(Caravan[,2])
## [1] 0.1647078
var(standardized.X[,1])
## [1] 1
var(standardized.X[,2])
## [1] 1

If you compare the variance prior to standardization you can see that the var is 1 instead of the actual value

Model 1: KNN

Recall that KNN models require some slight pre-processing work. The model requires 4 inputs: 1) A matrix containing the predictors associated with the training data 2) A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below 3) A vector containing the class labels for the training observations 4) A value for K, the number of nearest neighbors to be used by the classifier

Test dataset is the first 1,000 observations

test=1:1000

Creating training and test datasets using the pre-defined test object

train.X=standardized.X[-test,]
test.X=standardized.X[test,]
train.Y=Purchase[-test]
test.Y=Purchase[test]

Building the Model

set.seed(1)
knn.pred=knn(train.X,test.X,train.Y,k=1)

Model performance

mean(test.Y!=knn.pred)
## [1] 0.118
mean(test.Y!="No")
## [1] 0.059

Test error rate is just under 12%. However, since only 6% of people purchased overall, we’d get more accurate if we just predicted no for every observation anyways. However, this is for overall test error. The goal is to predict purchases not non-purchases. Therefore, we only need to be concerned with the sensitivity (accuracy at predicting purchases)

Confusion Matrix:

table(knn.pred,test.Y)
##         test.Y
## knn.pred  No Yes
##      No  873  50
##      Yes  68   9
9/(68+9)
## [1] 0.1168831

A sensitivity of 11.6% is nearly double what we would get from random guessing. We can also try to re-run the model using a less flexible K value (3 vs 1) to see if this improves performances.

knn.pred=knn(train.X,test.X,train.Y,k=3)
table(knn.pred,test.Y)
##         test.Y
## knn.pred  No Yes
##      No  920  54
##      Yes  21   5
5/26
## [1] 0.1923077

As you can see, here we have improved the accuracy to 19% by increasingly K slightly. Let’s take this further and ontinue to increase K, we get an accuracy rate of 26.7%:

knn.pred=knn(train.X,test.X,train.Y,k=5)
table(knn.pred,test.Y)
##         test.Y
## knn.pred  No Yes
##      No  930  55
##      Yes  11   4
4/15
## [1] 0.2666667

Let’s compare these results to a LogR Model

Model 2) Logistic Regression

Let’s compare the results of our KNN model to that of a Logistic Regression and see if that increases performance:

glm.fit=glm(Purchase~.,data=Caravan,family=binomial,subset=-test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.probs=predict(glm.fit,Caravan[test,],type="response")
glm.pred=rep("No",1000)
glm.pred[glm.probs>.5]="Yes"
table(glm.pred,test.Y)
##         test.Y
## glm.pred  No Yes
##      No  934  59
##      Yes   7   0

Confusion matrix shows 0% accuracy for those that purchased with a threshold of only 50%. Let’s reset the model for the 25% Threshold:

glm.pred=rep("No",1000)
glm.pred[glm.probs>.25]="Yes"
table(glm.pred,test.Y)
##         test.Y
## glm.pred  No Yes
##      No  919  48
##      Yes  22  11
11/(22+11)
## [1] 0.3333333

As you can see above, we have improved the model accuracy to 33% by choosing a less flexible model. As you can see, it does not always make sense to choose a highly flexible model as it may lead to overfitting.