The working directory is where R by default will search for input files and save all output. To set the working directory you can use the setwd() command. The working directory is essentially a file path or address. This can be copied from the explorer window almsot directly O the R studio console. There are two caveat however.
These two rules are important because the apply anytime you want to import external csv or xlsx files as well as when you are saving output.
Here is an example below
setwd(“C:/….”)
Instead of cells like spreadsheet, R leverages the concept of objects. An object is a container that stores information. Objects can store all sorts of values or information, a fact that we will discuss further on in the section of data types. There are also several different kinds of objects or storage containers. We will also get to these types later on.
Assign a value to my_apples:
my_apples <- 5
my_apples
## [1] 5
Assign a value to my_oranges:
my_oranges <- 6
my_oranges
## [1] 6
You can also assign objects using previously created objects. Assign to my_fruit sum of both fruit objects.
my_fruit <- my_apples + my_oranges
my_fruit
## [1] 11
As previously mentioned, there are several data types that R can support within an object. These include;
. Decimals values like 4.5 are called numerics. . Natural numbers like 4 are called integers. Integers are also numerics. . Boolean values (TRUE or FALSE) are called logical. . Text (or string) values are called characters. . Factors are specific types of character values with inherit ordering . Date a character value specific to dates such as “1/1/2019”.
my_numeric <- 42
my_numeric
## [1] 42
my_logical <- FALSE
my_logical
## [1] FALSE
my_character <- "universe"
my_character
## [1] "universe"
A factor is a character variable with a discrete number of categories.
To create factors in R, you make use of the function factor().
For example, the sex_vector contains the sex of 5 different individuals:
sex_vector <- c("Male","Female","Female","Male","Male")
It is clear that there are two categories, or in R-terms ‘factor levels’, at work here: “Male” and “Female”.
The function factor() will encode the vector as a factor:
(factor_sex_vector <- factor(sex_vector))
## [1] Male Female Female Male Male
## Levels: Female Male
There are two types of factors variables:
A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that one is worth more than the other.
For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other.
In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.
By default, the function factor() transforms vectors into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels. Take the following script for example:
factor(some_vector, ordered = TRUE, levels = c(“lev1”, “lev2” …))
By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
## [1] High Low High Low Medium
## Levels: Low < Medium < High
The fact that factor_speed_vector is now ordered enables us to compare different elements. You can simply do this by using the well-known operators.
factor_temperature_vector[1] > factor_temperature_vector[2]
## [1] TRUE
To get an output of the levels associated with a factor, you can use the levels command:
levels(factor_temperature_vector)
## [1] "Low" "Medium" "High"
There are two additional handy commands, that can be used to understand the data type of an R object. These are the str() and summary() functions.
The structure function provides an summary of a vector with the top 10 values and its data type.
str(factor_temperature_vector)
## Ord.factor w/ 3 levels "Low"<"Medium"<..: 3 1 3 1 2
The summary() command on the other hand provides a statistical summary. If the vector selected is a numerical one, R provides a 5 number summary. If categorical, it procides a frequency count.
summary(factor_temperature_vector)
## Low Medium High
## 2 1 2
NEED TO ADD SECTION ON ORDERING AND MODIFYING FACTORS FROM R FOR DATA SCIENCE
In Base R date-time formats consist of of the following pieces:
Year -%Y (4 digits). -%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month -%m (2 digits). -%b (abbreviated name, like “Jan”). -%B (full name, “January”).
Day -%d (2 digits). -%e (optional leading space).
Time -%H 0-23 hour. -%I 0-12, must be used with %p. -%p AM/PM indicator. -%M minutes. -%S integer seconds. -%OS real seconds. -%Z Time zone (as name, e.g. America/Chicago).
Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.
-%z (as offset from UTC, e.g. +0800) -%. skips one non-digit character. -%* skips any number of non-digits.
There are three types of date/time data that refer to an instant in time:
A date. Tibbles print this as
A time within a day. Tibbles print this as
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as
In this chapter we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.
You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.
To get the current date or date-time you can use today() or now():
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
today()
## [1] "2019-03-24"
now()
## [1] "2019-03-24 18:24:25 EDT"
Otherwise, there are three ways you’re likely to create a date/time:
Date/time data often comes as strings. You’ve seen one approach to parsing strings into date-times in date-times. Another approach is to use the helpers provided by lubridate.
They automatically work out the format once you specify the order of the component.
To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:
ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. ymd() is short and unambiguous:
ymd(20170131)
ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:
ymd_hms(“2017-01-31 20:11:59”) mdy_hm(“01/31/2017 08:01”)
Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
library(nycflights13)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.7
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.2.0
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 336,776 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5.00 15.0 2013-01-01 05:15:00
## 2 2013 1 1 5.00 29.0 2013-01-01 05:29:00
## 3 2013 1 1 5.00 40.0 2013-01-01 05:40:00
## 4 2013 1 1 5.00 45.0 2013-01-01 05:45:00
## 5 2013 1 1 6.00 0 2013-01-01 06:00:00
## 6 2013 1 1 5.00 58.0 2013-01-01 05:58:00
## 7 2013 1 1 6.00 0 2013-01-01 06:00:00
## 8 2013 1 1 6.00 0 2013-01-01 06:00:00
## 9 2013 1 1 6.00 0 2013-01-01 06:00:00
## 10 2013 1 1 6.00 0 2013-01-01 06:00:00
## # ... with 336,766 more rows
Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I’ve created the date-time variables, I focus in on the variables we’ll explore in the rest of the chapter.
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 x 9
## origin dest dep_delay arr_delay dep_time
## <chr> <chr> <dbl> <dbl> <dttm>
## 1 EWR IAH 2.00 11.0 2013-01-01 05:17:00
## 2 LGA IAH 4.00 20.0 2013-01-01 05:33:00
## 3 JFK MIA 2.00 33.0 2013-01-01 05:42:00
## 4 JFK BQN -1.00 -18.0 2013-01-01 05:44:00
## 5 LGA ATL -6.00 -25.0 2013-01-01 05:54:00
## 6 EWR ORD -4.00 12.0 2013-01-01 05:54:00
## 7 EWR FLL -5.00 19.0 2013-01-01 05:55:00
## 8 LGA IAD -3.00 -14.0 2013-01-01 05:57:00
## 9 JFK MCO -3.00 - 8.00 2013-01-01 05:57:00
## 10 LGA ORD -2.00 8.00 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 4 more variables: sched_dep_time <dttm>,
## # arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
With this data, I can visualise the distribution of departure times across the year:
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
Or within a single day:
flights_dt %>%
filter(dep_time < ymd(20130102)) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():
as_datetime(today())
## [1] "2019-03-24 UTC"
as_date(now())
## [1] "2019-03-24"
Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().
as_datetime(60 * 60 * 10)
## [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
## [1] "1980-01-01"
You can pull out individual parts of the date with the accessor functions -year() -month() -mday() (day of the month) -yday() (day of the year) -wday() (day of the week) -hour(), minute(), and second().
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6
hour(datetime)
## [1] 12
minute(datetime)
## [1] 34
second(datetime)
## [1] 56
For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.
month(datetime, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
A vector is a sequence of data elements of the same basic type.
To create a vector you use the “c()” command with “,” commas to sepearte each value. Remember the data type must be the same for each value.
numeric_vector <- c(1, 10, 49)
numeric_vector
## [1] 1 10 49
character_vector <- c("a", "b", "c")
character_vector
## [1] "a" "b" "c"
boolean_vector <- c(TRUE, FALSE, TRUE)
boolean_vector
## [1] TRUE FALSE TRUE
Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)
poker_vector
## [1] 140 -50 20 -120 240
Roulette winnings from Monday to Friday
roulette_vector <- c(-24, -50, 100, -350, 10)
roulette_vector
## [1] -24 -50 100 -350 10
Assign days as names of poker_vector
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
Assign days as names of roulette_vectors
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
Assign the names of the day to roulette_vector and poker_vector
names(poker_vector) <- c(days_vector)
names(roulette_vector) <- c(days_vector)
With the names assigned, you now see the names alongside the values
poker_vector
## Monday Tuesday Wednesday Thursday Friday
## 140 -50 20 -120 240
roulette_vector
## Monday Tuesday Wednesday Thursday Friday
## -24 -50 100 -350 10
To check if you realized higher total gains in poker than in roulette we can use logical comparison operators. R by default recognizes the following: . < for less than . > for greater than . <= for less than or equal to . >= for greater than or equal to . == for equal to each other . != not equal to each other
You can compare two vectors to one another by using the Equality and Greater/Less than Operators
You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:
(1 + 2) > 4
“dog” < “Cats”
TRUE <= FALSE
Remember that for string comparison, R determines the greater than relationship based on alphabetical order. Also, keep in mind that TRUE corresponds to 1 in R, and FALSE coerces to 0 behind the scenes.
Therefore, FALSE < TRUE is TRUE.
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
linkedin > 15
## [1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
linkedin <= 5
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
linkedin > facebook
## [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE
The following statements all = TRUE
3 == (2 + 1)
## [1] TRUE
!= The inverse of TRUE. This states x not equal to y
"intermediate" != "r"
## [1] TRUE
TRUE != FALSE
## [1] TRUE
"Rchitect" != "rchitect"
## [1] TRUE
True and True = TRUE
TRUE & TRUE
## [1] TRUE
True or False = True
FALSE | TRUE
## [1] TRUE
The & symbol is R’s AND() function from excel. | represents OR():
last <- tail(linkedin, 1)
Is last under 5 or above 10?
last <5 | last > 10
## [1] TRUE
To do an interval such as 3 < x < 7 you need to write it with &’s such as:
3 < x & x <7
Is last between 15 (exclusive) and 20 (inclusive)?
last >15 & last <=20
## [1] FALSE
5 <= 5 & 2 < 3
## [1] TRUE
3 < 4 | 7 < 6
## [1] TRUE
Like relational operators, logical operators work perfectly fine with vectors and matrices.
linkedin > 10 & facebook < 10
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
When are views greater than 12
linkedin >= 12 | facebook >= 12 | linkedin >= 12 & facebook >= 12
## [1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE
When is views between 11 (exclusive) and 14 (inclusive)?
facebook > 11 & linkedin <= 14
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
The if statement, enables you to set the conditions under which R outputs certain expressions:
if (condition) { expr }
medium <- "LinkedIn"
num_views <- 14
Examine the if statement for medium
if (medium == "LinkedIn") {
print("Showing LinkedIn information")
}
## [1] "Showing LinkedIn information"
You can only use an else statement in combination with an if statement. The else statement does not require a condition; its corresponding code is simply run if all of the preceding conditions in the control structure are FALSE. Here’s a recipe for its usage: if (condition) { expr1 } else { expr2 }
It’s important that the else keyword comes on the same line as the closing bracket of the if part!
if (medium == "LinkedIn") {
print("Showing LinkedIn information")
} else {
print("Unknown medium")
}
## [1] "Showing LinkedIn information"
The else if statement allows you to further customize your control structure. You can add as many else if statements as you like. Keep in mind that R ignores the remainder of the control structure once a condition has been found that is TRUE and the corresponding expressions have been executed. Here’s an overview of the syntax to freshen your memory: if (condition1) { expr1 } else if (condition2) { expr2 } else if (condition3) { expr3 } else { expr4 }
Again, It’s important that the else if keywords comes on the same line as the closing bracket of the previous part of the control construct!
if (medium == "LinkedIn") {
print("Showing LinkedIn information")
} else if (medium == "Facebook") {
print("Showing Facebook information")
} else {
print("Unknown medium")
}
## [1] "Showing LinkedIn information"
Let’s summarize we have learned by putting it all together:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector
In step order we:
In addition to creating objects and naming them, we can also select data from objects that already exist.
To select data from a vector you simply specify the location of the desired observations in the vector. For example, below we select the third element from the poker_vector to get our winnings for Wednesday, the third day of the week:
(poker_wednesday <- poker_vector[3])
## Wednesday
## 20
We can also select the same data by using the obsevations name, instead of its location:
(poker_wednesday <- poker_vector["Wednesday"])
## Wednesday
## 20
We can also define a new variable based on a selection with multiple values
(poker_midweek <- poker_vector[c(2,3,4)])
## Tuesday Wednesday Thursday
## -50 20 -120
Define a new variable based on a range of values in selection roulette_selection_vector <- roulette_vector [c(2:5)]
(poker_midweek <- poker_vector[c(2,3,4)])
## Tuesday Wednesday Thursday
## -50 20 -120
We can also select poker results for multiple days using the names vs. location, we did previously:
poker_start <- poker_vector[c("Monday","Tuesday","Wednesday")]
To subselect for days where we had positive winnings we can use the > logical operator:
(selection_vector <- poker_vector > 0)
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE TRUE FALSE TRUE
In addition to selecting observations using their position and names, we can also select observations using other objects.
(poker_winning_days <- poker_vector[selection_vector])
## Monday Wednesday Friday
## 140 20 240
In this situation, what are is doing is actually a multi-step process:
This is a very powerful method that will be used going forward.
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.
Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example:
matrix(1:9, byrow = TRUE, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Let’s break down the matrix function step-by-step:
Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9). We introduced this briefly in our selection section.
The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
The third argument nrow indicates that the matrix should have three rows. If we change the numer of rows we will see the effect this has.
(matrix(1:9, byrow= TRUE, nrow=9))
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
## [6,] 6
## [7,] 7
## [8,] 8
## [9,] 9
To create a matrix we will do the following:
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
box_office <- c(new_hope, empire_strikes, return_jedi)
star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow = 3)
Just as we named elements in our vector, we can also create names for our matrix:
To do so we first create our naming vectors, one for the columns, the other for rows:
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
We can assign column names using the colnames() command:
colnames(star_wars_matrix) <- region
For row names, the prcoedure is the same only the rownames() command is used:
rownames(star_wars_matrix) <- titles
If we print the star_wars_matrix we will now see all values with their names:
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
As it turns out, there is a easier way to complete the process of creating and naming a matrix. It can all be done in one step, within the matrix() function using the dinames option:
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
(star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
c("US", "non-US"))))
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
While it is handy to know how to create a matrix, it is also necessary to know how to modify an exiting one:
You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example: big_matrix <- cbind(matrix1, matrix2, vector1 …)
Using our star wars matrix, we can append the the row total to our matrix to create a grand total column:
worldwide_vector <- rowSums(star_wars_matrix)
all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector)
all_wars_matrix
## US non-US worldwide_vector
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
In addition to appending columns, we can also append rows to the bottom of our matrix
sequels <- c(120, 150, 170, 190, 200, 210)
(star_wars_matrix2 <- matrix(box_office, nrow = 3, byrow = TRUE,
dimnames = list(c("Episode 1", "Episode 2", "Episode 3"),
c("US", "non-US"))))
## US non-US
## Episode 1 460.998 314.4
## Episode 2 290.475 247.9
## Episode 3 309.306 165.8
To do this we use the rbind() command, in which we list the matricies we wish to merge. all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)
To calculate the total of each row, we can use the rowSums() commmand:
rowSums(star_wars_matrix)
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
To calculate the total of each column, we use the colSums() command:
colSums(star_wars_matrix)
## US non-US
## 1060.779 728.100
R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!
Instead of in vectors (as in the previous exercise), the LinkedIn and Facebook data is now stored in a matrix called views. You can do this using the matrix() command and leveraging the previously built vectors using c().
The first row contains the LinkedIn information; the second row the Facebook information. The original vectors facebook and linkedin are still available as well.
linkedin <- c(16, 9, 13, 5, 2, 17, 14)
facebook <- c(17, 7, 5, 16, 8, 13, 14)
views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)
When does views equal 13?
views == 13
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
When is views less than or equal to 14?
views <= 14
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE TRUE TRUE
To practice doing mathematical operations with matricies we can create two vectors and total them:
A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)
(total_vector <- c(A_vector + B_vector))
## [1] 5 7 9
Similarly, we can calculate our total winnings for the week with the command below. Note, if you surround the assignment command “<-” in parenthesis, “( )”, R will not only complete the assignemnt of the object, but will also provide an output of the result:
(total_daily <- c(roulette_vector + poker_vector))
## Monday Tuesday Wednesday Thursday Friday
## 116 -100 120 -470 250
To calculate the total of our poker winnings only we can total our poker vector with the sum() function:
(total_poker <- sum(poker_vector))
## [1] 230
We can do the same to calculate our total roulette winnings:
(total_roulette <- sum(roulette_vector))
## [1] -314
In addition to the sum, we can also calculate the average of the daily winnings using the average command:
(average_roulette <- mean(roulette_vector))
## [1] -62.8
To get a grand total of the winnings for the week we can sum the two total vectors:
(total_week <- total_roulette + total_poker)
## [1] -84
What’s a data frame?
You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.
When doing a market research survey, however, you often have questions such as: -‘Are your married?’ or ‘yes/no’ questions (logical) -‘How old are you?’ (numeric) -‘What is your opinion on this product?’ or other ‘open-ended’ questions (character)
The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.
A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.
You construct a data frame with the data.frame() function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don’t forget that it is possible (and likely) that they contain different types of data.
Use the function data.frame() to construct a data frame. Pass the vectors name, type, diameter, rotation and rings as arguments to data.frame(), in this order. Call the resulting data frame planets_df.
First we create vectors for each column of our dataset:
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
With our vectors built, the last step is to combine them all into one dataframe
(planets_df <- data.frame(name, type, diameter, rotation, rings))
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements.
Therefore, it is often useful to show only a small part of the entire data set.
So how to do this in R? Well, the function head() enables you to show the first observations of a data frame.
Similarly, the function tail() prints out the last observations in your data set.
Both head() and tail() print a top line called the ‘header’, which contains the names of the different variables in your data set.
head(planets_df)
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
tail(planets_df)
## name type diameter rotation rings
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your data set. For a data frame it tells you: -The total number of observations (e.g. 32 car types) -The total number of variables (e.g. 11 car features) -A full list of the variables names (e.g. mpg, cyl … ) -The data type of each variable (e.g. num) -The first observations
Applying the str() function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight in your data set before diving into the real analysis.
str(planets_df)
## 'data.frame': 8 obs. of 5 variables:
## $ name : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
## $ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
## $ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:
my_df[1,2] selects the value at the first row and second column in my_df.
my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df.
Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row. Let us now apply this technique on planets_df!
Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.
Suppose you want to select the first three elements of the type column. One way to do this is to use the “:” symbol.
planets_df[1:3,2]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet
A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:
planets_df[1:3,"type"]
## [1] Terrestrial planet Terrestrial planet Terrestrial planet
## Levels: Gas giant Terrestrial planet
You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:
planets_df[,3]
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
planets_df[,"diameter"]
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
planets_df[,c(2,4)]
## type rotation
## 1 Terrestrial planet 58.64
## 2 Terrestrial planet -243.02
## 3 Terrestrial planet 1.00
## 4 Terrestrial planet 1.03
## 5 Gas giant 0.41
## 6 Gas giant 0.43
## 7 Gas giant -0.72
## 8 Gas giant 0.67
However, there is a short-cut. If your columns have names, you can use the $ sign:
planets_df$diameter
## [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?
If you type rings_vector in the console, you get: [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE).
However, you do not get a nice overview of the names of these planets, their diameter, etc. Let’s try to use rings_vector to select the data for the four planets with rings.
rings_vector <- planets_df$rings
planets_df[rings_vector, ]
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
You selected a subset from a data frame (planets_df) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. This is an extremely important concept. We will be using this method to subset our larger data for values that meet certain criteria.
Now, let us move up one level and use the function subset(). You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.
subset(my_df, subset = some_condition)
The first argument of subset() specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.
The code below will give the exact same result as you got in the previous exercise, but this time, you didn’t need the rings_vector!
subset(planets_df, subset = rings)
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Here are some additional, more complex examples using the cars dataset:
data("mtcars")
subset(mtcars, mpg>20, c("mpg", "hp"))
## mpg hp
## Mazda RX4 21.0 110
## Mazda RX4 Wag 21.0 110
## Datsun 710 22.8 93
## Hornet 4 Drive 21.4 110
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Fiat 128 32.4 66
## Honda Civic 30.4 52
## Toyota Corolla 33.9 65
## Toyota Corona 21.5 97
## Fiat X1-9 27.3 66
## Porsche 914-2 26.0 91
## Lotus Europa 30.4 113
## Volvo 142E 21.4 109
subset(mtcars, mpg==max(mpg))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
subset(mtcars, mpg==max(mpg), mpg)
## mpg
## Toyota Corolla 33.9
Making and creating rankings is one of mankind’s favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).
In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order().
order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:
a <- c(100, 10, 1000)
order(a)
## [1] 2 1 3
10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the first element in a is the second smallest element, so 1 comes second in the output of order(a).
This means we can use the output of order(a) to reshuffle our data. This leverages the same principals as were used in subsetting:
a[order(a)]
## [1] 10 100 1000
positions <-order(planets_df$diameter)
planets_df[positions,]
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, type of activity that has to do be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.
You could say that a list is some kind super data type: you can store practically any piece of information in it!
Let us create our first list! To construct a list you use the function list(): my_list <- list(comp1, comp2 …)
The arguments to the list function are the list components. Remember, these components can be matrices, vectors, other lists, …
Just like on your to-do list, you want to avoid not knowing or remembering what the components of your list stand for.
That is why you should give names to them:
my_list <- list(name1 = your_comp1, name2 = your_comp2)
This creates a list with components that are named name1, name2, and so on. If you want to name your lists after you’ve created them, you can use the names() function as you did with vectors. The following commands are fully equivalent to the assignment above:
my_list <- list(your_comp1, your_comp2)
names(my_list) <- c(“name1”, “name2”)
Selecting elements from a list
Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward.
One way to select a component is using the numbered position of that component. For example, to “grab” the first component of shining_list you type.
shining_list[[1]]
A quick way to check this out is typing it in the console. Important to remember: to select elements from vectors, you use single square brackets: [ ]. Don’t mix them up!
You can also refer to the names of the components, with [[ ]] or with the $ sign. Both will select the data frame representing the reviews:
shining_list[[“reviews”]] shining_list$reviews
Besides selecting components, you often need to select specific elements out of these components.
For example, with shining_list[[2]][1] you select from the second component, actors (shining_list[[2]]), the first element ([1]). When you type this in the console, you will see the answer is Jack Nicholson.
Being proud of your first list, you shared it with the members of your movie hobby club. However, one of the senior members, a guy named M. McDowell, noted that you forgot to add the release year. Given your ambitions to become next year’s president of the club, you decide to add this information to the list.
To conveniently add elements to lists you can use the c() function, that you also used to build vectors:
ext_list <- c(my_list , my_val)
This will simply extend the original list, my_list, with the component my_val. This component gets appended to the end of the list. If you want to give the new list item a name, you just add the name as you did before:
ext_list <- c(my_list, my_name = my_val)
Now that we have discussed the functions that come with the base R envrionment, it is time that we introduce a collection of packages known as the TIDYVERSE. It has the functions and datasets associated with the next series of tutorials:
install.packages(‘tidyverse’)
You only need to install a package once, but you must load it every time you start a new session.
Once a package is installed, you can load it, or a dataset with the library() command.
library(tidyverse)
While it is helpful to know how to construct vectors, matricies, dataframes, and lists from stratch, the vast majority of the time, you will be importing data from external sources. Some of these include:
-CSV files -Excel Workbooks with multiple Sheets -Relational Databases -API’s
First we will begin with one of the most common datasources, CSVs. To import CSVs we will leverage one partular package that comes with the tidyverse library; readr.
In addition to read_csv(), there are additional reader functions that allow you to import other kinds of delineated files:
-read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place) -read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter. -read_fwf() reads fixed width files. You can specify fields either by their widths with -fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space. -read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)
The first argument to read_csv() is the most important: it’s the path to the file to read.
heights <- read_csv(“data/heights.csv”)
When you run read_csv() it prints out a column specification that gives the name and type of each column.
In both cases read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behaviour:
Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = “#” to drop all lines that start with (e.g.) #.
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
## # A tibble: 1 x 3
## x y z
## <int> <int> <int>
## 1 1 2 3
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
-The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
## # A tibble: 2 x 3
## X1 X2 X3
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
(“”) is a convenient shortcut for adding a new line. You’ll learn more about it and other types of string escape in string basics.)
Alternatively you can pass col_names a character vector which will be used as the column names:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
## # A tibble: 2 x 3
## x y z
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
## # A tibble: 1 x 3
## a b c
## <int> <int> <chr>
## 1 1 2 <NA>
Another alternative to read_csv is data.table::fread(). It doesn’t fit quite so well into the tidyverse, but it can be quite a bit faster.
It was previously mentioned that readr() provides an output of the data types in each columns. How does readr know what kind of values we have?
readr leverages a function called parse() that attempts to identify the type of values in each column of the file. By default, values are imported as character values. What parse() does is sample the data and make a best guess based on predefined rules. Knowing how parse data is helpful because there are times where reader is not able to recognize values. For these situations it is helpful to override predefined rule to properly identify data type.
Like all functions in the tidyverse, the parse_*() functions are uniform: the first argument is a character vector to parse, and the na argument specifies which strings should be treated as missing:
parse_integer(c("1", "231", ".", "456"), na = ".")
## [1] 1 231 NA 456
If parsing fails, you’ll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 2 parsing failures.
## row # A tibble: 2 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA an integer abc row 2 4 NA no trailing characters .45
Failures will be missing in the output:
x
## [1] 123 345 NA NA
## attr(,"problems")
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
If there are many parsing failures, you’ll need to use problems() to get the complete set. This returns a tibble, which you can then manipulate with dply
problems(x)
## # A tibble: 2 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA an integer abc
## 2 4 NA no trailing characters .45
Using parsers is mostly a matter of understanding what’s available and how they deal with different types of input. There are eight particularly important parsers:
-parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.
-parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.
-parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.
-parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.
-parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
It seems like it should be straightforward to parse a number, but three problems make it tricky:
readr’s default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
To address this problem, readr has the notion of a “locale”, an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of “.” by creating a new locale and setting the decimal_mark argument:
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark = ","))
## [1] 1.23
parse_number() addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
parse_number("$100")
## [1] 100
parse_number("20%")
## [1] 20
parse_number("It cost $123.45")
## [1] 123.45
The final problem is addressed by the combination of parse_number() and the locale as parse_number() will ignore the “grouping mark”:
USD Currency:
parse_number("$123,456,789")
## [1] 123456789
Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
## [1] 123456789
Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
## [1] 123456789
It seems like parse_character() should be really simple - it could just return its input. Unfortunately life isn’t so simple, as there are multiple ways to represent the same string. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():
charToRaw("Hadley")
## [1] 48 61 64 6c 65 79
Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the American Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte b1 is “±”, but in Latin2, it’s “a”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don’t understand UTF-8.
If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish. For example:
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
## [1] "El Niño was particularly bad this year"
x2
## [1] "±ñÉ¿Í"
To fix the problem you need to specify the encoding in parse_character():
parse_character(x1, locale = locale(encoding = "Latin1"))
## [1] "El Niño was particularly bad this year"
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
## [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"
How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out.
It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.
guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
## encoding confidence
## <chr> <dbl>
## 1 ISO-8859-1 0.460
## 2 ISO-8859-9 0.230
guess_encoding(charToRaw(x2))
## # A tibble: 1 x 2
## encoding confidence
## <chr> <dbl>
## 1 KOI8-R 0.420
The first argument to guess_encoding() can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and we’ve only scratched the surface here. If you’d like to learn more I’d recommend reading the detailed explanation at http://kunststube.net/encoding/.
R uses factors to represent categorical variables that have a known set of possible values. Give parse_factor() a vector of known levels to generate a warning whenever an unexpected value is present:
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA value in level set bananana
## [1] apple banana <NA>
## attr(,"problems")
## # A tibble: 1 x 4
## row col expected actual
## <int> <int> <chr> <chr>
## 1 3 NA value in level set bananana
## Levels: apple banana
But if you have many problematic entries, it’s often easier to leave as character vectors and then use the tools you’ll learn about in strings and factors to clean them up.
You pick between three parsers depending on whether you want:
When called without any additional arguments parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
parse_datetime("2010-10-01T2010")
## [1] "2010-10-01 20:10:00 UTC"
parse_datetime("20101010")
## [1] "2010-10-10 UTC"
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading https://en.wikipedia.org/wiki/ISO_8601
parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:
parse_date("2010-10-01")
## [1] "2010-10-01"
parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
parse_time("01:10 am")
## 01:10:00
Base R doesn’t have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don’t work for your data you can supply your own date-time format, built up of the following pieces:
Year -%Y (4 digits). -%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month -%m (2 digits). -%b (abbreviated name, like “Jan”). -%B (full name, “January”).
Day -%d (2 digits). -%e (optional leading space).
Time -%H 0-23 hour. -%I 0-12, must be used with %p. -%p AM/PM indicator. -%M minutes. -%S integer seconds. -%OS real seconds. -%Z Time zone (as name, e.g. America/Chicago).
Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this time zones.
-%z (as offset from UTC, e.g. +0800) -%. skips one non-digit character. -%* skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions. For example:
parse_time("20:10:01")
## 20:10:01
parse_date("01/02/15", "%m/%d/%y")
## [1] "2015-01-02"
parse_date("01/02/15", "%d/%m/%y")
## [1] "2015-02-01"
parse_date("01/02/15", "%y/%m/%d")
## [1] "2001-02-15"
If you’re using %b or %B with non-English month names, you’ll need to set the lang argument to locale(). See the list of built-in languages in date_names_langs(), or if your language is not already included, create your own with date_names().
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-01"
Now that you’ve learned how to parse an individual vector, it’s time to return to the beginning and explore how readr parses a file. There are two new things that you’ll learn about in this section:
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. You can emulate this process with a character vector using guess_parser(), which returns readr’s best guess, and parse_guess() which uses that guess to parse the column:
guess_parser("2010-10-01")
## [1] "date"
guess_parser("15:01")
## [1] "time"
guess_parser(c("TRUE", "FALSE"))
## [1] "logical"
guess_parser(c("1", "5", "9"))
## [1] "integer"
guess_parser(c("12,352,561"))
## [1] "number"
str(parse_guess("2010-10-10"))
## Date[1:1], format: "2010-10-10"
The heuristic tries each of the following types, stopping when it finds a match:
-logical: contains only “F”, “T”, “FALSE”, or “TRUE”. -integer: contains only numeric characters (and -). -double: contains only valid doubles (including numbers like 4.5e-5). -number: contains valid doubles with the grouping mark inside. -time: matches the default time_format. -date: matches the default date_format. -date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
-These defaults don’t always work for larger files. There are two basic problems:
The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general. For example, you might have a column of doubles that only contains integers in the first 1000 rows.
The column might contain a lot of missing values. If the first 1000 rows contain only NAs, readr will guess that it’s a character vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
## x = col_integer(),
## y = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1001 x no trailing characters .23837975086644292 'C:/Users/andre/D~ file 2 1002 x no trailing characters .41167997173033655 'C:/Users/andre/D~ row 3 1003 x no trailing characters .7460716762579978 'C:/Users/andre/D~ col 4 1004 x no trailing characters .723450553836301 'C:/Users/andre/D~ expected 5 1005 x no trailing characters .614524137461558 'C:/Users/andre/D~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
(Note the use of readr_example() which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It’s always a good idea to explicitly pull out the problems(), so you can explore them in more depth:
problems(challenge)
## # A tibble: 1,000 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 1001 x no trailing characters .23837975086644292 'C:/Users/andre/~
## 2 1002 x no trailing characters .41167997173033655 'C:/Users/andre/~
## 3 1003 x no trailing characters .7460716762579978 'C:/Users/andre/~
## 4 1004 x no trailing characters .723450553836301 'C:/Users/andre/~
## 5 1005 x no trailing characters .614524137461558 'C:/Users/andre/~
## 6 1006 x no trailing characters .473980569280684 'C:/Users/andre/~
## 7 1007 x no trailing characters .5784610391128808 'C:/Users/andre/~
## 8 1008 x no trailing characters .2415937229525298 'C:/Users/andre/~
## 9 1009 x no trailing characters .11437866208143532 'C:/Users/andre/~
## 10 1010 x no trailing characters .2983446326106787 'C:/Users/andre/~
## # ... with 990 more rows
A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the x column - there are trailing characters after the integer value. That suggests we need to use a double parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_integer(),
y = col_character()))
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 1000 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1001 x no trailing characters .23837975086644292 'C:/Users/andre/D~ file 2 1002 x no trailing characters .41167997173033655 'C:/Users/andre/D~ row 3 1003 x no trailing characters .7460716762579978 'C:/Users/andre/D~ col 4 1004 x no trailing characters .723450553836301 'C:/Users/andre/D~ expected 5 1005 x no trailing characters .614524137461558 'C:/Users/andre/D~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
Now that we have imported our data using the readr package,the nexr concept we need to discuss, is the tibble. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():
library(tidyverse)
as_tibble(iris)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.10 3.50 1.40 0.200 setosa
## 2 4.90 3.00 1.40 0.200 setosa
## 3 4.70 3.20 1.30 0.200 setosa
## 4 4.60 3.10 1.50 0.200 setosa
## 5 5.00 3.60 1.40 0.200 setosa
## 6 5.40 3.90 1.70 0.400 setosa
## 7 4.60 3.40 1.40 0.300 setosa
## 8 5.00 3.40 1.50 0.200 setosa
## 9 4.40 2.90 1.40 0.200 setosa
## 10 4.90 3.10 1.50 0.100 setosa
## # ... with 140 more rows
Note that tibble() does much less than data.frames. It never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names. However, what it does do, is output the top 10 rows of the dataframe as well as metadata regarding the attributes.
This metadata is detailed below:
-int stands for integers.
-dbl stands for doubles, or real numbers.
-chr stands for character vectors, or strings.
-dttm stands for date-times (a date + a time).
There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:
-lgl stands for logical, vectors that contain only TRUE or FALSE.
-fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
-date stands for dates.
Tibbles also only output enough columns so as to fit the window. Any additional fields are detailed below the output.
Now that we have successfully imported our external data and formatted it as a tibble, the next step is to ensure it is a tidy dataset:
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” – Hadley Wickham
The Tidy dataaset approach is a consistent way to organise your data in R. Getting your data into this format requires some upfront work, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
You can represent the same underlying data in multiple ways. However, there are three interrelated rules which make a dataset tidy:
These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:
-Put each dataset in a tibble. -Put each variable in a column.
The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
2)Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.
This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. The second step is to resolve one of two common problems:
-One variable might be spread across multiple columns.
-One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: gather() and spread().
A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Take table4a: the column names 1999 and 2000 represent values of the year variable, and each row represents two observations, not one.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:
-The set of columns that represent values, not variables. In this example, those are the columns 1999 and 2000.
-The name of the variable whose values form the column names. I call that the key, and here it is year.
-The name of the variable whose values are spread over the cells. I call that value, and here it’s the number of cases.
Together those parameters generate the call to gather():
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
The columns to gather are specified with dplyr::select() style notation. Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names (because they don’t start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see select.
In the final result, the gathered columns are dropped, and we get new key and value columns. Otherwise, the relationships between the original variables are preserved.
Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take table2: an observation is a country in a year, but each observation is spread across two rows.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
To tidy this up, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:
The column that contains variable names, the key column. Here, it’s type.
The column that contains values from multiple variables, the value column. Here it’s count.
Once we’ve figured that out, we can use spread(), as shown programmatically below:
table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
As you might have guessed from the common key and value arguments, spread() and gather() are complements. gather() makes wide tables narrower and longer; spread() makes long tables shorter and wider.
So far you’ve learned how to tidy table2 and table4, but not table3. table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.
separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into as shown in the code below.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
By default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). For example, in the code above, separate() split the values of rate at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the sep argument of separate(). For example, we could rewrite the code above as:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behaviour in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into.
You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you’ll see in a little bit.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
## country century year rate
## * <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequently than separate(), but it’s still a useful tool to have in your back pocket.
We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:
table5 %>%
unite(new, century, year)
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
In this case we also need to use the sep argument. The default will place an underscore (_) between the values from different columns. Here we don’t want any separator so we use “”:
table5 %>%
unite(new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.
Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations, which treat observations as if they were set elements.
The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren’t commonly needed for data analysis.
library(nycflights13)
We will use the nycflights13 package to learn about relational data. nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:
-airlines lets you look up the full carrier name from its abbreviated code:
head(airlines)
## # A tibble: 6 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
-airports gives information about each airport, identified by the faa airport code
head(airports)
## # A tibble: 6 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5.00 A Amer~
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6.00 A Amer~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6.00 A Amer~
## 4 06N Randall Airport 41.4 -74.4 523 -5.00 A Amer~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5.00 A Amer~
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5.00 A Amer~
-planes gives information about each plane, identified by its tailnum:
head(planes)
## # A tibble: 6 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed win~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed win~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed win~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed win~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed win~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed win~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
-weather gives the weather at each NYC airport for each hour:
head(weather)
## # A tibble: 6 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1.00 1 0 37.0 21.9 54.0 230 10.4
## 2 EWR 2013 1.00 1 1 37.0 21.9 54.0 230 13.8
## 3 EWR 2013 1.00 1 2 37.9 21.9 52.1 230 12.7
## 4 EWR 2013 1.00 1 3 37.9 23.0 54.5 230 13.8
## 5 EWR 2013 1.00 1 4 37.9 24.1 57.0 240 15.0
## 6 EWR 2013 1.00 1 6 39.0 26.1 59.4 270 10.4
## # ... with 5 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm>
You don’t need to understand the entire dataset thing; you just need to understand the chain of relations between the tables that you are interested in.
For nycflights13:
-flights connects to planes via a single variable, tailnum.
-flights connects to airlines through the carrier variable.
-flights connects to airports in two ways: via the origin and dest variables.
-flights connects to weather via origin (the location), and year, month, day and hour (the time).
The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. In simple cases, a single variable is sufficient to identify an observation. For example, each plane is uniquely identified by its tailnum. In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.
There are two types of keys:
A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.
A foreign key uniquely identifies an observation in another table. For example, the flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
A variable can be both a primary key and a foreign key. For example, origin is part of the weather primary key, and is also a foreign key for the airport table.
Once you’ve identified the primary keys in your tables, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one:
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 0 x 6
## # ... with 6 variables: year <dbl>, month <dbl>, day <int>, hour <int>,
## # origin <chr>, n <int>
Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what’s the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
flights %>%
count(year, month, day, flight) %>%
filter(n > 1)
## # A tibble: 29,768 x 5
## year month day flight n
## <int> <int> <int> <int> <int>
## 1 2013 1 1 1 2
## 2 2013 1 1 3 2
## 3 2013 1 1 4 2
## 4 2013 1 1 11 3
## 5 2013 1 1 15 2
## 6 2013 1 1 21 2
## 7 2013 1 1 27 4
## 8 2013 1 1 31 2
## 9 2013 1 1 32 2
## 10 2013 1 1 35 2
## # ... with 29,758 more rows
flights %>%
count(year, month, day, tailnum) %>%
filter(n > 1)
## # A tibble: 64,928 x 5
## year month day tailnum n
## <int> <int> <int> <chr> <int>
## 1 2013 1 1 N0EGMQ 2
## 2 2013 1 1 N11189 2
## 3 2013 1 1 N11536 2
## 4 2013 1 1 N11544 3
## 5 2013 1 1 N11551 2
## 6 2013 1 1 N12540 2
## 7 2013 1 1 N12567 2
## 8 2013 1 1 N13123 2
## 9 2013 1 1 N13538 3
## 10 2013 1 1 N13566 3
## # ... with 64,918 more rows
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case!
If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.
A primary key and the corresponding foreign key in another table form a relation. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights.
In other data, you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many.
You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.
For example, in this data there’s a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.
The first tool we’ll look at for combining a pair of tables is the mutating join. A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out. For these examples, we’ll make it easier to see what’s going on in the examples by creating a narrower dataset:
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5.00 EWR IAH N14228 UA
## 2 2013 1 1 5.00 LGA IAH N24211 UA
## 3 2013 1 1 5.00 JFK MIA N619AA AA
## 4 2013 1 1 5.00 JFK BQN N804JB B6
## 5 2013 1 1 6.00 LGA ATL N668DN DL
## 6 2013 1 1 5.00 EWR ORD N39463 UA
## 7 2013 1 1 6.00 EWR FLL N516JB B6
## 8 2013 1 1 6.00 LGA IAD N829AS EV
## 9 2013 1 1 6.00 JFK MCO N593JB B6
## 10 2013 1 1 6.00 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5.00 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5.00 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5.00 N619AA AA American Airlines Inc.
## 4 2013 1 1 5.00 N804JB B6 JetBlue Airways
## 5 2013 1 1 6.00 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5.00 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6.00 N516JB B6 JetBlue Airways
## 8 2013 1 1 6.00 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6.00 N593JB B6 JetBlue Airways
## 10 2013 1 1 6.00 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
The result of joining airlines to flights2 is an additional variable: name. This is why I call this type of join a mutating join. In this case, you could have got to the same place using mutate() and R’s base subsetting:
flights2 %>%
select(-origin, -dest) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5.00 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5.00 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5.00 N619AA AA American Airlines Inc.
## 4 2013 1 1 5.00 N804JB B6 JetBlue Airways
## 5 2013 1 1 6.00 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5.00 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6.00 N516JB B6 JetBlue Airways
## 8 2013 1 1 6.00 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6.00 N593JB B6 JetBlue Airways
## 10 2013 1 1 6.00 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
But this is hard to generalise when you need to match multiple variables, and takes close reading to figure out the overall intent.
The following sections explain, in detail, how mutating joins work. You’ll start by learning a useful visual representation of joins. We’ll then use that to explain the four mutating join functions: the inner join, and the three outer joins. When working with real data, keys don’t always uniquely identify observations, so next we’ll talk about what happens when there isn’t a unique match. Finally, you’ll learn how to tell dplyr which variables are the keys for a given join.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:
(To be precise, this is an inner equijoin because the keys are matched using the equality operator. Since most joins are equijoins we usually drop that specification.)
The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by to tell dplyr which variable is the key:
x %>%
inner_join(y, by = "key")
## # A tibble: 2 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1.00 x1 y1
## 2 2.00 x2 y2
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.
An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
So far all our examples assumed that the keys are unique. But that’s not always the case. This section explains what happens when the keys are not unique. There are two possibilities:
Note that I’ve put the key column in a slightly different position in the output. This reflects that the key is a primary key in y and a foreign key in x.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
left_join(x, y, by = "key")
## # A tibble: 4 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1.00 x1 y1
## 2 2.00 x2 y2
## 3 2.00 x3 y2
## 4 1.00 x4 y1
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
## # A tibble: 6 x 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1.00 x1 y1
## 2 2.00 x2 y2
## 3 2.00 x2 y3
## 4 2.00 x3 y2
## 5 2.00 x3 y3
## 6 3.00 x4 y4
So far, the pairs of tables have always been joined by a single variable, and that variable has the same name in both tables. That constraint was encoded by by = “key”. You can use other values for by to connect the tables in other ways:
flights2 %>%
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <dbl> <dbl> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1.00 1 5.00 EWR IAH N14228 UA NA NA NA
## 2 2013 1.00 1 5.00 LGA IAH N24211 UA NA NA NA
## 3 2013 1.00 1 5.00 JFK MIA N619AA AA NA NA NA
## 4 2013 1.00 1 5.00 JFK BQN N804JB B6 NA NA NA
## 5 2013 1.00 1 6.00 LGA ATL N668DN DL 39.9 26.1 57.3
## 6 2013 1.00 1 5.00 EWR ORD N39463 UA NA NA NA
## 7 2013 1.00 1 6.00 EWR FLL N516JB B6 39.0 26.1 59.4
## 8 2013 1.00 1 6.00 LGA IAD N829AS EV 39.9 26.1 57.3
## 9 2013 1.00 1 6.00 JFK MCO N593JB B6 39.0 26.1 59.4
## 10 2013 1.00 1 6.00 LGA ORD N3ALAA AA 39.9 26.1 57.3
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
flights2 %>%
left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5.00 EWR IAH N14228 UA 1999 Fixed win~
## 2 2013 1 1 5.00 LGA IAH N24211 UA 1998 Fixed win~
## 3 2013 1 1 5.00 JFK MIA N619AA AA 1990 Fixed win~
## 4 2013 1 1 5.00 JFK BQN N804JB B6 2012 Fixed win~
## 5 2013 1 1 6.00 LGA ATL N668DN DL 1991 Fixed win~
## 6 2013 1 1 5.00 EWR ORD N39463 UA 2012 Fixed win~
## 7 2013 1 1 6.00 EWR FLL N516JB B6 2000 Fixed win~
## 8 2013 1 1 6.00 LGA IAD N829AS EV 1998 Fixed win~
## 9 2013 1 1 6.00 JFK MCO N593JB B6 2004 Fixed win~
## 10 2013 1 1 6.00 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (lat and lon) of each airport. Each flight has an origin and destination airport, so we need to specify which one we want to join to:
flights2 %>%
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 1 1 5.00 EWR IAH N14228 UA Georg~ 30.0 -95.3
## 2 2013 1 1 5.00 LGA IAH N24211 UA Georg~ 30.0 -95.3
## 3 2013 1 1 5.00 JFK MIA N619AA AA Miami~ 25.8 -80.3
## 4 2013 1 1 5.00 JFK BQN N804JB B6 <NA> NA NA
## 5 2013 1 1 6.00 LGA ATL N668DN DL Harts~ 33.6 -84.4
## 6 2013 1 1 5.00 EWR ORD N39463 UA Chica~ 42.0 -87.9
## 7 2013 1 1 6.00 EWR FLL N516JB B6 Fort ~ 26.1 -80.2
## 8 2013 1 1 6.00 LGA IAD N829AS EV Washi~ 38.9 -77.5
## 9 2013 1 1 6.00 JFK MCO N593JB B6 Orlan~ 28.4 -81.3
## 10 2013 1 1 6.00 LGA ORD N3ALAA AA Chica~ 42.0 -87.9
## # ... with 336,766 more rows, and 4 more variables: alt <int>, tz <dbl>,
## # dst <chr>, tzone <chr>
Other implementations
base::merge() can perform all four types of mutating join: dplyr merge inner_join(x, y) merge(x, y) left_join(x, y) merge(x, y, all.x = TRUE) right_join(x, y) merge(x, y, all.y = TRUE), full_join(x, y) merge(x, y, all.x = TRUE, all.y = TRUE)
The advantages of the specific dplyr verbs is that they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of merge(). dplyr’s joins are considerably faster and don’t mess with the order of the rows.
SQL is the inspiration for dplyr’s conventions, so the translation is straightforward:
dplyr SQL inner_join(x, y, by = “z”) SELECT * FROM x INNER JOIN y USING (z) left_join(x, y, by = “z”) SELECT * FROM x LEFT OUTER JOIN y USING (z) right_join(x, y, by = “z”) SELECT * FROM x RIGHT OUTER JOIN y USING (z) full_join(x, y, by = “z”) SELECT * FROM x FULL OUTER JOIN y USING (z)
Note that “INNER” and “OUTER” are optional, and often omitted.
Joining different variables between the tables, e.g. inner_join(x, y, by = c(“a” = “b”)) uses a slightly different syntax in SQL: SELECT * FROM x INNER JOIN y ON x.a = y.b. As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
Semi-joins are useful for matching filtered summary tables back to the original rows. For example, imagine you’ve found the top ten most popular destinations:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
flights %>%
filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2.00 923
## 2 2013 1 1 554 600 -6.00 812
## 3 2013 1 1 554 558 -4.00 740
## 4 2013 1 1 555 600 -5.00 913
## 5 2013 1 1 557 600 -3.00 838
## 6 2013 1 1 558 600 -2.00 753
## 7 2013 1 1 558 600 -2.00 924
## 8 2013 1 1 558 600 -2.00 923
## 9 2013 1 1 559 559 0 702
## 10 2013 1 1 600 600 0 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
But it’s difficult to extend that approach to multiple variables. For example, imagine that you’d found the 10 days with highest average delays. How would you construct the filter statement that used year, month, and day to match it back to flights?
Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in x that have a match in y:
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 542 540 2.00 923
## 2 2013 1 1 554 600 -6.00 812
## 3 2013 1 1 554 558 -4.00 740
## 4 2013 1 1 555 600 -5.00 913
## 5 2013 1 1 557 600 -3.00 838
## 6 2013 1 1 558 600 -2.00 753
## 7 2013 1 1 558 600 -2.00 924
## 8 2013 1 1 558 600 -2.00 923
## 9 2013 1 1 559 559 0 702
## 10 2013 1 1 600 600 0 851
## # ... with 141,135 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match
Anti-joins are useful for diagnosing join mismatches. For example, when connecting flights and planes, you might be interested to know that there are many flights that don’t have a match in planes:
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
## # A tibble: 722 x 2
## tailnum n
## <chr> <int>
## 1 <NA> 2512
## 2 N725MQ 575
## 3 N722MQ 513
## 4 N723MQ 507
## 5 N713MQ 483
## 6 N735MQ 396
## 7 N0EGMQ 371
## 8 N534MQ 364
## 9 N542MQ 363
## 10 N531MQ 349
## # ... with 712 more rows
The data you’ve been working with in this chapter has been cleaned up so that you’ll have as few problems as possible. Your own data is unlikely to be so nice, so there are a few things that you should do with your own data to make your joins go smoothly.
If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general.
For example, the altitude and longitude uniquely identify each airport, but they are not good identifiers!
airports %>% count(alt, lon) %>% filter(n > 1)
## # A tibble: 0 x 3
## # ... with 3 variables: alt <int>, lon <dbl>, n <int>
Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!
Check that your foreign keys match primary keys in another table. The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.
If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly. If you have an inner join with duplicate keys in both tables, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, and treat the observations like sets:
Given this simple data:
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
The four possibilities are:
intersect(df1, df2)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1.00 1.00
union(df1, df2)
## # A tibble: 3 x 2
## x y
## <dbl> <dbl>
## 1 1.00 2.00
## 2 2.00 1.00
## 3 1.00 1.00
setdiff(df1, df2)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 2.00 1.00
setdiff(df2, df1)
## # A tibble: 1 x 2
## x y
## <dbl> <dbl>
## 1 1.00 2.00
DYPLR is very powerful data cleaning package. Similar to SQL it uses several functions to select, filter, group by, summarise, and mutate data. We will be leveraging the flights data frame which contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights. You can get additional information by using ? before a function or dataset. You can also click the packages tab in the menu.
library(nycflights13)
There are Five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
-Pick observations by their values (filter()). -Reorder the rows (arrange()). -Pick variables by their names (select()). -Create new variables with functions of existing variables (mutate()). -Collapse many values down to a single summary (summarise()).
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
filter(flights, month == 1, day == 1)
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2.00 830
## 2 2013 1 1 533 529 4.00 850
## 3 2013 1 1 542 540 2.00 923
## 4 2013 1 1 544 545 -1.00 1004
## 5 2013 1 1 554 600 -6.00 812
## 6 2013 1 1 554 558 -4.00 740
## 7 2013 1 1 555 600 -5.00 913
## 8 2013 1 1 557 600 -3.00 709
## 9 2013 1 1 557 600 -3.00 838
## 10 2013 1 1 558 600 -2.00 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, “<-”:
R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:
(dec25 <- filter(flights, month == 12, day == 25))
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 25 456 500 -4.00 649
## 2 2013 12 25 524 515 9.00 805
## 3 2013 12 25 542 540 2.00 832
## 4 2013 12 25 546 550 -4.00 1022
## 5 2013 12 25 556 600 -4.00 730
## 6 2013 12 25 557 600 -3.00 743
## 7 2013 12 25 557 600 -3.00 818
## 8 2013 12 25 559 600 -1.00 855
## 9 2013 12 25 559 600 -1.00 849
## 10 2013 12 25 600 600 0 850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal).
When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. When this happens you’ll get an informative error:
filter(flights, month == 1)
## # A tibble: 27,004 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2.00 830
## 2 2013 1 1 533 529 4.00 850
## 3 2013 1 1 542 540 2.00 923
## 4 2013 1 1 544 545 -1.00 1004
## 5 2013 1 1 554 600 -6.00 812
## 6 2013 1 1 554 558 -4.00 740
## 7 2013 1 1 555 600 -5.00 913
## 8 2013 1 1 557 600 -3.00 709
## 9 2013 1 1 557 600 -3.00 838
## 10 2013 1 1 558 600 -2.00 753
## # ... with 26,994 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!
sqrt(2) ^ 2 == 2
## [1] FALSE
Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():
near(sqrt(2) ^ 2, 2)
## [1] TRUE
Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ! is “not”
The following code finds all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
## # A tibble: 55,403 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 11 1 5 2359 6.00 352
## 2 2013 11 1 35 2250 105 123
## 3 2013 11 1 455 500 - 5.00 641
## 4 2013 11 1 539 545 - 6.00 856
## 5 2013 11 1 542 545 - 3.00 831
## 6 2013 11 1 549 600 - 11.0 912
## 7 2013 11 1 550 600 - 10.0 705
## 8 2013 11 1 554 600 - 6.00 659
## 9 2013 11 1 554 600 - 6.00 826
## 10 2013 11 1 554 600 - 6.00 749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The order of operations doesn’t work like English. You can’t write:
filter(flights, month == 11 | 12)
Which you might literally translate into “finds all flights that departed in November or December”.
Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing!
A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:
nov_dec <- filter(flights, month %in% c(11, 12))
One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
The most confusing result is this one:
NA == NA
## [1] NA
To verify whether or not values are NA use the is.na() command.
x <- NA
is.na(x)
## [1] TRUE
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
## # A tibble: 1 x 1
## x
## <dbl>
## 1 3.00
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3.00
Another example is the command below. There are NA values aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ... with 355 more rows
To remove these NA values we can use the na.rm command:
flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2.00 830
## 2 2013 1 1 533 529 4.00 850
## 3 2013 1 1 542 540 2.00 923
## 4 2013 1 1 544 545 -1.00 1004
## 5 2013 1 1 554 600 -6.00 812
## 6 2013 1 1 554 558 -4.00 740
## 7 2013 1 1 555 600 -5.00 913
## 8 2013 1 1 557 600 -3.00 709
## 9 2013 1 1 557 600 -3.00 838
## 10 2013 1 1 558 600 -2.00 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2.00
## 2 5.00
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5.00
## 2 2.00
## 3 NA
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select() is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea:
select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
You can also select columns that are between two specific fields:
select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
To exclude fields, you can use the “-” symbol prior to the column name.
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2.00 830 819 11.0
## 2 533 529 4.00 850 830 20.0
## 3 542 540 2.00 923 850 33.0
## 4 544 545 -1.00 1004 1022 -18.0
## 5 554 600 -6.00 812 837 -25.0
## 6 554 558 -4.00 740 728 12.0
## 7 555 600 -5.00 913 854 19.0
## 8 557 600 -3.00 709 723 -14.0
## 9 557 600 -3.00 838 846 - 8.00
## 10 558 600 -2.00 753 745 8.00
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
There are a number of helper functions you can use within select():
-starts_with(“abc”): matches names that begin with “abc”.
-ends_with(“xyz”): matches names that end with “xyz”.
-contains(“ijk”): matches names that contain “ijk”.
-matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
-num_range(“x”, 1:3): matches x1, x2 and x3.
Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53.0 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:
rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2.00 830
## 2 2013 1 1 533 529 4.00 850
## 3 2013 1 1 542 540 2.00 923
## 4 2013 1 1 544 545 -1.00 1004
## 5 2013 1 1 554 600 -6.00 812
## 6 2013 1 1 554 558 -4.00 740
## 7 2013 1 1 555 600 -5.00 913
## 8 2013 1 1 557 600 -3.00 709
## 9 2013 1 1 557 600 -3.00 838
## 10 2013 1 1 558 600 -2.00 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().
mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2.00 11.0 1400 227 - 9.00 370
## 2 2013 1 1 4.00 20.0 1416 227 -16.0 374
## 3 2013 1 1 2.00 33.0 1089 160 -31.0 408
## 4 2013 1 1 -1.00 -18.0 1576 183 17.0 517
## 5 2013 1 1 -6.00 -25.0 762 116 19.0 394
## 6 2013 1 1 -4.00 12.0 719 150 -16.0 288
## 7 2013 1 1 -5.00 19.0 1065 158 -24.0 404
## 8 2013 1 1 -3.00 -14.0 229 53.0 11.0 259
## 9 2013 1 1 -3.00 - 8.00 944 140 5.00 405
## 10 2013 1 1 -2.00 8.00 733 138 -10.0 319
## # ... with 336,766 more rows
Note that you can refer to columns that you’ve just created:
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2.00 11.0 1400 227 - 9.00 3.78
## 2 2013 1 1 4.00 20.0 1416 227 -16.0 3.78
## 3 2013 1 1 2.00 33.0 1089 160 -31.0 2.67
## 4 2013 1 1 -1.00 -18.0 1576 183 17.0 3.05
## 5 2013 1 1 -6.00 -25.0 762 116 19.0 1.93
## 6 2013 1 1 -4.00 12.0 719 150 -16.0 2.50
## 7 2013 1 1 -5.00 19.0 1065 158 -24.0 2.63
## 8 2013 1 1 -3.00 -14.0 229 53.0 11.0 0.883
## 9 2013 1 1 -3.00 - 8.00 944 140 5.00 2.33
## 10 2013 1 1 -2.00 8.00 733 138 -10.0 2.30
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variables, use transmute():
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 - 9.00 3.78 - 2.38
## 2 -16.0 3.78 - 4.23
## 3 -31.0 2.67 -11.6
## 4 17.0 3.05 5.57
## 5 19.0 1.93 9.83
## 6 -16.0 2.50 - 6.40
## 7 -24.0 2.63 - 9.11
## 8 11.0 0.883 12.5
## 9 5.00 2.33 2.14
## 10 -10.0 2.30 - 4.35
## # ... with 336,766 more rows
There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
-Arithmetic operators: +, -, *, /, ^
These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.
-Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.
-Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100)
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5.00 17.0
## 2 533 5.00 33.0
## 3 542 5.00 42.0
## 4 544 5.00 44.0
## 5 554 5.00 54.0
## 6 554 5.00 54.0
## 7 555 5.00 55.0
## 8 557 5.00 57.0
## 9 557 5.00 57.0
## 10 558 5.00 58.0
## # ... with 336,766 more rows
-Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.
All else being equal, I recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
-Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
-Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
-Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.
-Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.
row_number(y)
## [1] 1 2 3 NA 4 5
dense_rank(y)
## [1] 1 2 2 NA 3 4
percent_rank(y)
## [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6 NA 0.8 1.0
The last key verb is summarise(). It collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dest != "HNL")
It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there’s more ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are three steps to prepare this data:
Group flights by destination.
Summarise to compute distance, average delay, and number of flights.
Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don’t care about it. Naming things is hard, so this slows down our analysis.
There’s another way to tackle the same problem with the pipe, %>%:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dest != "HNL")
This focuses on the transformations, not what’s being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is “then”.
Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code, and we’ll come back to it in more detail in pipes.
Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.
Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay))
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay:
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n())
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It’s a bit painful that you have to switch from %>% to +, but once you get the hang of it, it’s quite convenient.
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
There’s another common variation of this type of pattern. Let’s look at how the average performance of batters in baseball is related to the number of times they’re at bat. Here I use data from the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
When I plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:
-As above, the variation in our aggregate decreases as we get more data points.
-There’s a positive correlation between skill (ba) and opportunities to hit the ball (ab). This is because teams control who gets to play, and obviously they’ll pick their best players.
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE))
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
Counts are so useful that dplyr provides a simple helper if all you want is a count:
not_cancelled %>%
count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.
It’s sometimes useful to combine aggregation with logical subsetting. We haven’t talked about this sort of subsetting yet, but you’ll learn more about it in subsetting.
not_cancelled %>%
group_by(year, month, day) %>%
summarise(avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]))
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 - 1.93 28.3
## 5 2013 1 5 - 1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 - 4.95 27.8
## 8 2013 1 8 - 3.23 20.8
## 9 2013 1 9 - 0.264 25.6
## 10 2013 1 10 - 5.90 27.3
## # ... with 355 more rows
Measures of spread: sd(x), IQR(x), mad(x). The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.
Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
## # A tibble: 104 x 2
## dest distance_sd
## <chr> <dbl>
## 1 EGE 10.5
## 2 SAN 10.4
## 3 SFO 10.2
## 4 HNL 10.0
## 5 SEA 9.98
## 6 LAS 9.91
## 7 PDX 9.87
## 8 PHX 9.86
## 9 LAX 9.66
## 10 IND 9.46
## # ... with 94 more rows
Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.
When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time))
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first last
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42.0 2354
## 3 2013 1 3 32.0 2349
## 4 2013 1 4 25.0 2358
## 5 2013 1 5 14.0 2357
## 6 2013 1 6 16.0 2355
## 7 2013 1 7 49.0 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2.00 2252
## 10 2013 1 10 3.00 2320
## # ... with 355 more rows
These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time))
## # A tibble: 365 x 5
## # Groups: year, month [?]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.
How many flights left before 5am? (these usually indicate delayed flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
## 7 2013 1 7 2
## 8 2013 1 8 1
## 9 2013 1 9 3
## 10 2013 1 10 3
## # ... with 355 more rows
What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
## 7 2013 1 7 0.0333
## 8 2013 1 8 0.0213
## 9 2013 1 9 0.0202
## 10 2013 1 10 0.0183
## # ... with 355 more rows
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
## # A tibble: 12 x 3
## # Groups: year [?]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
(per_year <- summarise(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
Be careful when progressively rolling up summaries: it’s OK for sums and counts, but you need to think about weighting means and variances, and it’s not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily %>%
ungroup() %>%
summarise(flights = n())
## # A tibble: 1 x 1
## flights
## <int>
## 1 336776
Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter():
Find the worst members of each group:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups: year, month, day [365]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41.0
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46.0
## 4 2013 1 1 157 174 213 60.0
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
## 7 2013 1 1 285 246 1085 146
## 8 2013 1 1 192 191 199 44.0
## 9 2013 1 1 379 456 1092 222
## 10 2013 1 2 224 207 550 94.0
## # ... with 3,296 more rows
Find all groups bigger than a threshold:
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups: dest [77]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2.00 830
## 2 2013 1 1 533 529 4.00 850
## 3 2013 1 1 542 540 2.00 923
## 4 2013 1 1 544 545 -1.00 1004
## 5 2013 1 1 554 600 -6.00 812
## 6 2013 1 1 554 558 -4.00 740
## 7 2013 1 1 555 600 -5.00 913
## 8 2013 1 1 557 600 -3.00 709
## 9 2013 1 1 557 600 -3.00 838
## 10 2013 1 1 558 600 -2.00 753
## # ... with 332,567 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Standardise to compute per group metrics:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
## year month day dest arr_delay prop_delay
## <int> <int> <int> <chr> <dbl> <dbl>
## 1 2013 1 1 IAH 11.0 0.000111
## 2 2013 1 1 IAH 20.0 0.000201
## 3 2013 1 1 MIA 33.0 0.000235
## 4 2013 1 1 ORD 12.0 0.0000424
## 5 2013 1 1 FLL 19.0 0.0000938
## 6 2013 1 1 ORD 8.00 0.0000283
## 7 2013 1 1 LAX 7.00 0.0000344
## 8 2013 1 1 DFW 31.0 0.000282
## 9 2013 1 1 ATL 12.0 0.0000400
## 10 2013 1 1 DTW 16.0 0.000116
## # ... with 131,096 more rows
An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()
This chapter will teach you how to visualise your data using ggplot2. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.
If you’d like to learn more about the theoretical underpinnings of ggplot2 before you start, I’d recommend reading “The Layered Grammar of Graphics”, http://vita.had.co.nz/papers/layered-grammar.pdf.
Let’s use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
The mpg data frame:
You can test your answer with the mpg data frame found in ggplot2 (aka ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l~ f 18 29 p
## 2 audi a4 1.80 1999 4 manual~ f 21 29 p
## 3 audi a4 2.00 2008 4 manual~ f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a~ f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l~ f 16 26 p
## 6 audi a4 2.80 1999 6 manual~ f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a~ f 18 27 p
## 8 audi a4 quat~ 1.80 1999 4 manual~ 4 18 26 p
## 9 audi a4 quat~ 1.80 1999 4 auto(l~ 4 16 25 p
## 10 audi a4 quat~ 2.00 2008 4 manual~ 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
Among the variables in mpg are:
-displ, a car’s engine size, in litres.
-hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
To learn more about mpg, open its help page by running ?mpg.
To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with big engines use more fuel. Does this confirm or refute your hypothesis about fuel efficiency and engine size?
With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very interesting so I’m not going to show it here.
You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You’ll learn a whole bunch of them throughout this chapter.
Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.
Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.
ggplot(data = ) +
You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the -size -shape -color
Of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties.
Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes().
ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling.
ggplot2 will also add a legend that explains which levels correspond to which values.
The colors reveal that many of the unusual points are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation.
We get a warning here, because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
Or we could have mapped class to the alpha aesthetic, which controls the transparency of the points, or to the shape aesthetic, which controls the shape of the points.
Alpha
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.
For each aesthetic, you use aes() to associate the name of the aesthetic with a variable to display.
The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument.
The syntax highlights a useful insight about x and y: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values.
For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
H ere, the color doesn’t convey information about a variable, but only changes the appearance of the plot.
To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes(). You’ll need to pick a level that makes sense for that aesthetic:
-The name of a color as a character string.
-The size of a point in mm.
-The shape of a point as a number, as shown in Figure 3.1.
R has 25 built in shapes that are identified by numbers. There are some seeming duplicates:
The hollow shapes (0–14) have a border determined by colour
; the solid shapes (15–18) are filled with colour
; the filled shapes (21–24) have a border of colour
and are filled with fill
.
One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
In the Dyplr section we quickly touched upon some statistical terms such as: -Mean -Standard Deviation (SD) -Confidence Intervals
In this section, we will expand upon our statistical discussion primarily focusing around how R is used to conduct statistical calculations. While we will explain the terms, the focus will be around the R programming language.
Classicial statistics, also known as inferential statistics, focuses primarily on deriving conclusions about large populations based on data collected from a sample.
In EDA however, the focus is on understanding the data, specifically measures of Location and Variance:
Variables with measured or count data might have thousands of distinct values. One basic step in getting to understand your data is to get an idea of what the typical value is for each feature (variable). In other words, an estimate of where most of the data is located. In statistics this is known as the central tendency of the data.
When it comes to measuring the central tendancy of your data there are several metrics that can do the job:
To calculate the mean and median in R we use the following commands:
state <- read.csv("state.csv")
mean(state$Population)
## [1] 6162876
median(state$Population)
## [1] 4436370
To trim the mean and median calculations, we pass a “trim” argument after the value with the % trim in the form of a decimal:
mean(state$Population, trim = 0.1)
## [1] 4783697
median(state$Population, trim = 0.1)
## [1] 4436370
As we can see in the data, the population’s of the various states differ greatly. Any analysis into rates such as Murder Rates, should take this into account. To do so, we would need to use a weighted mean or median to account for different state populations when calculating average murder rates.
Base R does not have a function for weighted measures of central tendency. To do such a calculation we need to install the MatrixStats package.
weighted.mean(state$Murder.Rate, w = state$Population)
## [1] 4.445834
matrixStats::weightedMedian(state$Murder.Rate, w = state$Population)
## [1] 4.4
Measures of centrality are one way to summarize a feature of your data. Variability, also known as dispersion, measures whether the data values are tighly clustered or spread out.
At the heart of statistics lies variability; measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability, and making decisions in the presence of it.
Some measures of variability are as follows:
-Variance: the sum of squared deviations from the mean divided by n-1 where n is count of values -Standard Deviation: the square root of the variance -Mean Absolute Deviation: the mean of the absolute value of the deviations from the mean -Mean Absolute Deviation from the Median: the mean of the absolute value of the deviations from the mean -Range: difference between largest and smallest values -Percentiles: the value such that P percent of the values take on this value or less and (100-P) take on this value or more -Interquartile Range: the difference between the 75th and 25th percentile values.
To calculate these measures of variability in R we use the following commands:
sd(state$Population)
## [1] 6848235
IQR(state$Population)
## [1] 4847308
mad(state$Population)
## [1] 3849870
quantile(state$Population)
## 0% 25% 50% 75% 100%
## 563626 1833004 4436370 6680312 37253956
You will notice that the standard deviation is almost twice as large as the MAD. This is not surprising since the standard deviation is sensitive to outliers.
With an understanding of the centrality and variability of the data, the next step is to explore and understand the overall distribution of the data.
For this purpose there are several visualizations that enable you to literally see the distribution. Many of these were discussed above in the section on Data Visualization. To visualize a distribution the following visualizations are best used:
boxplot(log(state$Population), ylab="Population log")
Frequency tables divide up a variable into equally spaced segments and provide counts for how many values fall in each segment.
To create one, we break up the variable by creating a breaks object with the seq() command which sequences all values in that field from the mininum to the maximum. This is setup work to instruct R how to segment the field. From here, we we pass the breaks= argument.
breaks <- seq(from=min(state$Population), to=max(state$Population), length=11)
pop_freq <- cut(state$Population, breaks = breaks, right=TRUE, include.lowest = TRUE)
table(pop_freq)
## pop_freq
## [5.64e+05,4.23e+06] (4.23e+06,7.9e+06] (7.9e+06,1.16e+07]
## 24 14 6
## (1.16e+07,1.52e+07] (1.52e+07,1.89e+07] (1.89e+07,2.26e+07]
## 2 1 1
## (2.26e+07,2.62e+07] (2.62e+07,2.99e+07] (2.99e+07,3.36e+07]
## 1 0 0
## (3.36e+07,3.73e+07]
## 1
A histogram is a way to visualize a frequency table with bins on the x-axis and data count on the y-axis. To create a histogram in R we use the hist command as follows:
hist(state$Population, breaks = breaks)
General rules for plotting histograms: -Empty bins are included -Bins are equal width -Number of bins is up to the user -Bars are continous, no empty space shown between bars, unless there is an emptyin bin.
In statistics, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis.
Skewness refers to whether the data is skewed to larger or smaller values.
Kurtosis indicates the propensity of the data to have extreme values.
To validate these measures, visual displays are commonly used. One of these is a density plot, which shows the distribution of data values as a continous line. A density plot can be though of as a smoothed histogram, although it is typically computed directly from the data through a kernal density estimate.
To create a density plot you create a histogram and then layer a lines using the density() function.
hist(state$Murder.Rate, freq=FALSE)
lines(density(state$Murder.Rate),lwd=3,col="blue")
One major distinction between a histogram and a density plot, is that a density plot corresponds to plotting the histogram as a proportion rather than counts.This is sepecified with the argument freq=FALSE
Up until now, we have discussed how to conduct exploratory data analysis on Numerical Features. We will now discuss how to do the same with categorical ones. Recall that categorical features have a mutually exclsuive and exhaustive set of values.
For such variables, simple proportions or percentages tell the story of the data.
There are also specific visualizations that can be used to help understand categorical data. Bar charts for instance show the frequency or proportion for each category plotted as bars.
In this example we show the different kinds of delays categorized by type:
dfw <- read.csv("dfw_airline.csv",header = TRUE)
barplot(as.matrix(dfw)/6, cex.axis = .5)
Barcharts are a common visual tool for displaying a single categorical varialbe. Categories are listed on the x-axis and frequencies or propritons on the y-axis.
Bar charts resemble a histogram, however in a bar chart the x-axis represents different cagtegories of a factor variable, while in a histogram the x-axis represents values of a single variable on a numeric scale. In a histogram, the bars are typically shown touching each other, with gaps indicating values that did not occur in the data. In a bar chart, the bars are shown separate from one another.
Up to now, we have discussed how to understand a single variable, whether it be numeric or categorical. From here on out we will see how to understand relationships between two or more variables.
EDA in many modeling projects involves examining correlation among predictors, and between predictors and a target variable.
Two variables are said to be positively correlated if high values of X go with high values of Y, and low values of X with low values of Y.
If the relationship between variables is inverse, in other words, X increases while Y decreases then they are said to be negatively correlated.
To measure the degree to which two variables are correlated, we calculate a value called Pearson’s Correlation Coefficient. This is done by multiplying deviations from the mean for both x and y variables and then dividing those by the product of the standard deviations times N-1.
The correlation coefficient always lies between +! and -1 (perfecr correlation). 0 indicates no correlation. Variables can also have a relaitonship that is not linear. In these cases correlation coefficients may not be as helpful.
To display correlation coefficients amongst all numeric variaibles, we use what is called a correlation matrix. Along the diagonal we see values of 1, since this is the correlation of a variable with itself.
A correlation matrix is created in R using the package corrplot:
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.4
## corrplot 0.84 loaded
sp500_px <- read.csv("sp500_data.csv")
sp500_sym <- read.csv("sp500_sectors.csv")
etf <- sp500_px[row.names(sp500_px)>"2012-07-01",sp500_sym[sp500_sym$sector=="etf",'symbol']]
corrplot(cor(etf),method = "ellipse")
Scatterplots are another way to visualize the relationship between two measured data variables. The x-axis represents one variable, the y-axis another, and each point is a record.
We can do this with the plot command:
plot(etf$GM, etf$DGX, xlab="GM", ylab="DGX")
It appears like there is a slight correlation between the returns of the two stocks. We can validate with the cor() command.
cor(etf$GM, etf$DGX)
## [1] 0.1484631
Another way to plot two numerical features is hexagonal bidding. One example is the relationship between the finished square feet vs. tax-assessed values for homes in King County.
Plotting points in this situation would result in a large dark cloud of points. By grouping values into heaxagonal bins and coloring them with a heatmap based on the number of records, we can see a positive relationship between sq feet and tax-assessed value.
library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.4
kc_tax <- read.csv("kc_tax.csv")
kc_tax0 <- subset(kc_tax, TaxAssessedValue<750000 & SqFtTotLiving>100 & SqFtTotLiving<3500)
ggplot(kc_tax0, (aes(x=SqFtTotLiving, y=TaxAssessedValue))) +
stat_binhex(color="white") +
theme_bw() +
scale_fill_gradient(low="white", high="black") +
labs(x="Finished Square Feet", y="Tax Asessed Value")
Contour plots also visualize the relatioship between two numeric variable, only this time in the form of a topographical map. Each contour band represents a specific density of points, inceasing as one nears a “peak”.
ggplot(kc_tax0, aes(SqFtTotLiving, TaxAssessedValue)) +
theme_bw() +
geom_point(alpha=0.1)+
geom_density2d(colour="white")+
labs(x="Finished Square Feet", y="Tax Assessed Value")
A useful way to summarize two categorical variables is a contingency table - a table of counts by category.
In R, the CrossTable function in the descr package products contingecy tables. Here is an example with loans data:
library(descr)
## Warning: package 'descr' was built under R version 3.4.4
lc_loans <- read.csv("lc_loans.csv")
x_tab <- CrossTable(lc_loans$grade, lc_loans$status,prop.c = FALSE, prop.chisq = FALSE, prop.t = )
x_tab
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Table Total |
## |-------------------------|
##
## =====================================================================
## lc_loans$status
## lc_loans$grade Charged Off Current Fully Paid Late Total
## ---------------------------------------------------------------------
## A 1562 50051 20408 469 72490
## 0.022 0.690 0.282 0.006 0.161
## 0.003 0.111 0.045 0.001
## ---------------------------------------------------------------------
## B 5302 93852 31160 2056 132370
## 0.040 0.709 0.235 0.016 0.294
## 0.012 0.208 0.069 0.005
## ---------------------------------------------------------------------
## C 6023 88928 23147 2777 120875
## 0.050 0.736 0.191 0.023 0.268
## 0.013 0.197 0.051 0.006
## ---------------------------------------------------------------------
## D 5007 53281 13681 2308 74277
## 0.067 0.717 0.184 0.031 0.165
## 0.011 0.118 0.030 0.005
## ---------------------------------------------------------------------
## E 2842 24639 5949 1374 34804
## 0.082 0.708 0.171 0.039 0.077
## 0.006 0.055 0.013 0.003
## ---------------------------------------------------------------------
## F 1526 8444 2328 606 12904
## 0.118 0.654 0.180 0.047 0.029
## 0.003 0.019 0.005 0.001
## ---------------------------------------------------------------------
## G 409 1990 643 199 3241
## 0.126 0.614 0.198 0.061 0.007
## 0.001 0.004 0.001 0.000
## ---------------------------------------------------------------------
## Total 22671 321185 97316 9789 450961
## =====================================================================
Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.
airline_stats <- read.csv("airline_stats.csv")
boxplot(pct_carrier_delay ~ airline, data=airline_stats, ylim=c(0,50))
A violin plot is another way to provide this visual of different distributions by cagetory. It is an enhancement to the boxplot and plots the density estimate to the boxplot with the density on the y-axis.
The density is mirrior and flipped over and the resulting shape is filled in, creating the image of a violin.
The advantage it provides over a traditional boxplot is that it can show nuances in the distribution that cannot be seen in a boxplot.
You can create a violin plot with the geom_violin option in ggplot.
ggplot(data=airline_stats, aes(airline, pct_carrier_delay))+
ylim(0,50)+
geom_violin()+
labs(x="",y="Daily % of Delayed Flights")
## Warning: Removed 38 rows containing non-finite values (stat_ydensity).
As was previously discussed, facets can also be used to break down Numerical data by different categorical values:
ggplot(subset(kc_tax0, ZipCode %in% c(98188,98105,98126)),
aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex(colour="white") +
theme_bw()+
scale_fill_gradient(low="white",high="blue")+
labs(x="Finished Squared Feet", y="Tax Assessed Value")+
facet_wrap("ZipCode")
The central limit theorem proves that the distribution of random sample means is normally distributed regardless of whether or not the population is itelf normal.
We can prove this in R by taking random samples of different sizes and plotting them:
#1 Random Sample
loans_income <- read.csv("loans_income.csv")
samp_data <- data.frame(income=sample(loans_income[1:50000,], size = 1000),type='data_dist')
#Sample of Means of 5 Values
samp_mean_05 <-
data.frame(income=tapply(sample(loans_income[1:50000,],1000*5),rep(1:1000,rep(5,1000)), FUN=mean), type = 'mean_of_5')
#Sample of Means of 20 Values
samp_mean_20 <- data.frame(income=tapply(sample(loans_income[1:50000,],1000*20),
rep(1:1000,rep(20,1000)), FUN=mean), type = 'mean_of_20')
#bind the dfs together and convert type of factor
income <- rbind(samp_data, samp_mean_05, samp_mean_20)
income$type= factor(income$type, levels = c('data_dist','mean_of_5','mean_of_20'),labels = c('Data','Mean of 5','Mean of 20'))
ggplot(income, aes(x=income)) +
geom_histogram(bins=40)+
facet_grid(type ~ . )
One easy and effective way to estimate the sampling distribution of a statistic is to draw additional samples, WITH REPLACEMENT, from the sample itself. Each time recalculating the statistic for each resample. This is called the bootstrap and does not involve any assumptions about the data or the sample statistic being normally distributed
Conceptually, you can think of the bootstrap as replicating the original sample thousands or millions or times, thereby essentially creating a hypothetical population using only your original sample. The steps to conducting a bootstrap resampling of the mean for example is as follows: 1. Draw a sample value, record, replace it 2. Repeat n times 3. Record the mean of the n resampled values 4. Repeat steps 1-3R Times 5. Use R results to: a) Calculate Std. Deviation b) Histogram or boxplot c) Confidence Intervals
R = number of iterartions of the bootstrap is up to your discretion.
The R package boot, combines all these steps into one function. For example, the following applies the bootstap to the incomes of people taking out loans:
library(boot)
## Warning: package 'boot' was built under R version 3.4.4
stat_fun <- function(x,idx) median(x[idx])
(boot_obj <- boot(loans_income[1:5000,], R = 1000, statistic = stat_fun))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = loans_income[1:5000, ], statistic = stat_fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 63000 116.461 789.0006
Bootstrapping can be used with machine learning as well. With classification and regression trees, running multiple trees on bootstrap samples and then averaging their predictions(or, with clsasification, taking a majority vote) generally performs better than using a single tree. This process is called bagging (bootstrap aggregating).
The bell-shaped normal distribution is icnonic in traditional statistics. The fact that distributions of sample statistics are often normally shaped has made it a powerful tool in the development of mathematical formulas that approximate those distributions.
In a normal distribution, 68% of the data lies within one std deviation of the mean, and 95% between two, and 99.5% within 3. A standard normal distribution is one in which the units on the x-axis are expressed in terms of std. devaitions from the mean. This is done by subtracting a value from the mean of its sample/population and then dividing by its standard deviation. This is called normalization or standardization.
The transformed value is terms a z-score. Because normal distributions are so important, it is also important to know how to determine whether or not your data is normally distributed.
To do so, there are several tools that are used. One is a visual tool known as a QQ-Plot. This plot arranges values by ascending z-score value on the y-axis and the x-axis is the corresponding quantile of a normal distribution for that value’s rank.
If the values in the data plot roughly on the diagonal line, then the sample distribution can be considered close to normal. Here is an example with a sample of 100 randomly generated values:
norm_samp <- rnorm(100)
qqnorm(norm_samp)
abline(a=0, b=1, col='grey')
Long tailed distributions are like normmal distribution’s only extreme values occur at a greater frequency.
An example of a long-tailed distribution is stock returns. Here is a QQ-Plot that shows daily stock returns for NFLX:
nflx <- sp500_px[,'NFLX']
nflx <- diff(log(nflx[nflx>0]))
qqnorm(nflx)
abline(a=0,b=1,col = 'grey')
Compared to the normal distribution we randomly generated, you can see that there are more values below and above the line at the extremes.
The T distribution is a normally shaped distribution, only thicker and longer on the tails. It is used heavily in sample statistics. This is because sample distributions are usually distributed like a t-distribution.
The larger the sample, the more normally shaped the t-distribution.
A lot of analytics questions don’t deal with distributions of continous variables, but instead of discrete variables. Discrete variables have a limited set of values as opposed to the infinite possibilites that exist in a continous one.
One such disribution is the Binomial which only has two outcomes; yes or no.
Digital analytics is a perfect example of when this would be used. A user either purchases or does not in a particular session. Click vs don’t click, churn or don’t are other similar questions.
Central to binomial distributions is the concept of trials where each trial has two outcomes.
Flipping a coin 10 times for example is a binomial distribution with 10 trails.In this case, the probability for each outcome is 50/50. For ecommerce sites however, conversion rates can be very low, with 1 being much more rare than 0.
To calculate binomial distribution probabilities in R we can use the dbinom fuction
dbinom(x=2,5,p=0.1)
## [1] 0.0729
This tells us that the probability of getting 2 successes in 5 trails where the proability for success is 0.1, is equal to 0.0729 which is equal to 7.3%
Often however, we are interested in determining the probability of x or fewer successes in n trails. To do this we use the pbinom function:
pbinom(2,5,0.1)
## [1] 0.99144
The probability of getting two or less successes is 99%.
The mean of a binominal distribution is n x p. This is the expected number of successes in n trials. The variance is n x p(1-p). With a large enough number of trails the binomial distribution essentially merges with the normal distribution.
Another major concept in Statistics is that of Hypothesis testing. Hypothesis testing is when you try to answer a question scientifically using an experiment. You split a representative sample of a desired population into two groups through random selection. This enables you to attribute any differences in between the two groups to the different treatments.
Take for example am A/B test on a website. We can split our users into two different experiences and see which one performs better. In R, we will show this using example session data:
session_times <- read.csv("web_page_data.csv")
ggplot(session_times, aes(x=Page, y=Time))+
geom_boxplot()
Lookng at the boxplot, it appears that page B has longer session times.
(mean_a <- mean(session_times[session_times['Page']=='Page A', 'Time']))
## [1] 1.263333
(mean_b <- mean(session_times[session_times['Page']=='Page B', 'Time']))
## [1] 1.62
Looking at the two pages, on average, Page B does indeed have a higher mean session time. The question however, is, is this due to random chance, or the difference in treatments. We can answer this using a permutation test.
A permutation test combines all session times together, then repeatedy shuffles and reassigns them into groups of the original sample size. This essentially creates a large random population.
perm_fun <- function(x,n1,n2)
{
n <- n1 + n2
indx_b <- sample(1:n,n1)
indx_a <- setdiff(1:n, indx_b)
mean_diff <- mean(x[indx_b]) - mean(x[indx_a])
return(mean_diff)
}
perm_diffs <- rep(0,1000)
for(i in 1:1000)
perm_diffs[i] = perm_fun(session_times[,'Time'],21,15)
hist(perm_diffs,xlab='Session time differences (in seconds)')
abline(v = mean_b - mean_a)
The histogram above, shows that mean differene of random permutations often exceeds the observed difference in session time. This suggests that the observed differene in session times between page A and Page B are well within range of random chance and thus not statistically signifiant.
Let’s look at another example with web conversion rates in a Price Test.
Again, we can apply a permutation test to see if the difference is within random chance variation.
obs_pct_diff <- 100*(200/23739 - 182/22588)
conversion <- c(rep(0,45945),rep(1,392))
perm_diffs <- rep(0,1000)
for(i in 1:1000)
perm_diffs[i] = 100*perm_fun(conversion,23739,22588)
hist(perm_diffs, xlab='Conversion Rate Percent Differences')
abline(v = obs_pct_diff)
As we can see from the histogram, the observed difference of 0.0368% is well within the range of chance variation.
Simply looking at the graph however is not a very precise way to determine statistical signifiance.
A p-value is the propability that you would see the observed difference assuming that there is no real difference between the two groups.
mean(perm_diffs > obs_pct_diff)
## [1] 0.291
In this case, it appears that there is a greater than 30% chance that we would see this observed difference as a result of random chance. This is a high probability. In general, statisticians use 5% as a decision point for when an experiment is statistically significant
In the above example, we didnt necessarily have to use a permuation test to get a p-value. Since we are using data with a binomial distribution and large observation counts we can approximate the p-value using the normal distribution. In r, we do this with the prop.test function:
prop.test(x=c(200,182), n=c(23739,22588),alternative = "greater")
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(200, 182) out of c(23739, 22588)
## X-squared = 0.14893, df = 1, p-value = 0.3498
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.001057439 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.008424955 0.008057376
Ass you can see we get the p-value of 0.3498 which is close to what we found from the permutation test.
We can also use the standard t-distribution. This is done at times when the distribution cannot be said to be normal.
t.test(Time ~ Page, data=session_times,alternative='less')
##
## Welch Two Sample t-test
##
## data: Time by Page
## t = -1.0983, df = 27.693, p-value = 0.1408
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.1959674
## sample estimates:
## mean in group Page A mean in group Page B
## 1.263333 1.620000
Here we can see that the p-value is 0.14, which is still greater than the recommended 5%.
While the functions we discussed above work great for measuring differenecs between two groups, they don’t work if we are trying to test more than 2 groups.
In this situation, we use what is called a ANOVA, or analysis of variance.
We can conduct an ANOVA using the anova() command. Another important command is the aovp command which is park of the lmPerm package. This command conducts a permutation test anova. If you recall, a permutation test reshuffles all values based on a pooled sample.
library(lmPerm)
## Warning: package 'lmPerm' was built under R version 3.4.4
four_sessions <- read.csv("four_sessions.csv")
summary(aovp(Time ~ Page, data = four_sessions))
## [1] "Settings: unique SS "
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## Page 3 831.4 277.13 5000 0.0692 .
## Residuals 16 1618.4 101.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see the p-value is 0.08%. The column Iter lists the number of iterations taken in the permutation test. The other columns corresponds to a traditional anova.
F-statistic: similar to a t-statistic, the F-statistic is the ratio of variance acrosss group means to the variance due to residual error. The higher this ratio, the more statistically signnificant the result.
Below is a traditional anova:
summary(aov(Time ~ Page, data=four_sessions))
## Df Sum Sq Mean Sq F value Pr(>F)
## Page 3 831.4 277.1 2.74 0.0776 .
## Residuals 16 1618.4 101.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You will also notice
DF: degrees of freedom Sum Sq: sum of squared differences between each value and the mean for that group.
Whereas an ANOVA is used to compare means from more than two groups, a Chi-square test is used when comparing more than 2 proportions.
clicks <- read.csv("click_rates.csv")
clicks <-
clicks
chisq.test(x=c(14/1000,8/1000,12/1000), y=c('A','B','C') , simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: c(14/1000, 8/1000, 12/1000) and c("A", "B", "C")
## X-squared = 6, df = NA, p-value = 1
Machine learning is the use of statistical algorithum’s to predict a certain outcome using existing data. This outcome can be a categorical or numerical value.
Regression tries to answer the question; is the variable X (or X1:X4) related to Y? And if so, is the relationship strong enough that we can use it to predict future values of Y?
Another important way regression can be used is in anomaly detection. Anomalous values may be important such as in credit card fraud or loan auditing.
Simple linear regression models the relationship between the magnitude of one variable and that of a second. Similar to correlation, it tries to determine whether or not as X increases, Y also increases. Or if when X increases, Y decreases.
The equation tries to estimate exactly how much Y will change when X changes by a certain amount.
The major caveat here, is that a linear regression assumes that the relatioship between the variables is linear. If it is not, the linear regression formula will do a poor job of capturing the true relationship.
Let’s consider the following scatterplot of lung capacity (PERF) vs Exposure to cotor dust. How are the two related?
PERF <- read.csv("LungDisease.csv")
plot(PEFR ~ Exposure, data = PERF)
We can modify the style of the points using the pch command
plot(PEFR ~ Exposure, data = PERF,pch=20)
plot(PEFR ~ Exposure, data = PERF,pch="+")
You can see the ful breadth of options with the following command:
plot(1:20,1:20,pch=1:20)
Its hard to tell just looking at the scatterplot, whether there is a strong relationship between the two variables. In this situation, we use a simple linear regression to find the “best” line to predict our response PERF as a function of the predictor variable Exposure.
model <- lm(PEFR ~ Exposure, data = PERF)
model
##
## Call:
## lm(formula = PEFR ~ Exposure, data = PERF)
##
## Coefficients:
## (Intercept) Exposure
## 424.583 -4.185
Here we can see that our Y-interecept is 424.583. This represents the predicted lung capacity for someone with 0 years exposure to cotton dust.
Looking at the coefficient for exposure we can see that for every year that a worker is exposure to cotton dust, the worker’s lung capacity is reduced by -4.185.
We can also call additional details related to our model such as the confidence intervals for the coefficients:
confint(model)
## 2.5 % 97.5 %
## (Intercept) 383.408147 465.757466
## Exposure -6.807942 -1.561211
The summary command provides the full detail for the model:
names(model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(model)
##
## Call:
## lm(formula = PEFR ~ Exposure, data = PERF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -297.845 -58.783 -1.214 61.024 209.109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 424.583 20.796 20.417 < 2e-16 ***
## Exposure -4.185 1.325 -3.158 0.00201 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.4 on 120 degrees of freedom
## Multiple R-squared: 0.07674, Adjusted R-squared: 0.06905
## F-statistic: 9.974 on 1 and 120 DF, p-value: 0.002008
To see how well our best fit regression line fits the data we can plot it on our scatterplot with the abline() command
plot(PEFR ~ Exposure, data = PERF)
abline(model)
We can modify the width and color of the line with the additional lwd() and col() commands:
plot(PEFR ~ Exposure, data = PERF)
abline(model,lwd=3,col="red")
As was previously discussed, we can use this regression equation to predict future values. We do this in R with the predict command.
fitted <- predict(model)
In addition to predicting variables, we can also measure how accurate our model is by evaluating the error, also known as residuals:
resid <- residuals(model)
plot(resid)
R by default provides 4 diagnostic plots when plotting the output of a lm(). To see them all on one screen we use the par() command followed by mfrow option which specifies the number of windows we want. Rows are designated first.
par(mfrow=c(2,2))
plot(model)
Multiple Linear Regression is the same concept as Simple Linear regression only that instead of using one variable to predict Y, we will use multiple independent variables.
As an example, we will use housing data on King County Seattle, Washington. We will try to predict house prices using various features of the house:
house <- read.delim("house_sales.csv")
house_lm <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data = house , na.action=na.omit)
house_lm
##
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
## Bedrooms + BldgGrade, data = house, na.action = na.omit)
##
## Coefficients:
## (Intercept) SqFtTotLiving SqFtLot Bathrooms Bedrooms
## -5.219e+05 2.288e+02 -6.051e-02 -1.944e+04 -4.778e+04
## BldgGrade
## 1.061e+05
Looking at the results of the mode, we can see that adding a SqFt to living area increases the value of a house by roughly $229.
Using the summary function computes RSE which is the root squared error (square root of the squared residuals/n-p-1).
The RSE is a measure of the error of our model. The higher the RSE the worse the model performed.
summary(house_lm)
##
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
## Bedrooms + BldgGrade, data = house, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1199508 -118879 -20982 87414 9472982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.219e+05 1.565e+04 -33.349 < 2e-16 ***
## SqFtTotLiving 2.288e+02 3.898e+00 58.699 < 2e-16 ***
## SqFtLot -6.051e-02 6.118e-02 -0.989 0.323
## Bathrooms -1.944e+04 3.625e+03 -5.362 8.32e-08 ***
## Bedrooms -4.778e+04 2.489e+03 -19.194 < 2e-16 ***
## BldgGrade 1.061e+05 2.396e+03 44.287 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 261200 on 22683 degrees of freedom
## Multiple R-squared: 0.5407, Adjusted R-squared: 0.5406
## F-statistic: 5340 on 5 and 22683 DF, p-value: < 2.2e-16
We also additional outputs such as the R-squared, adjusted R-squared, F-stat, and p-value associated with our model.
As we can see our model is statistically significant, however the residual std. error is quite large.
R squared is roughly 0.54 which is not very strong as far how much variation is accounted for.
As you can see the output of a model can get unweildy as more variables are added. In such cases, it can help to be specific about calling model output. To do so, we can call what we need using the “$” and the outputnames just like we would in a normal vector or dataframe:
summary(model)$r.sq
## [1] 0.07674102
summary(model)$sigma
## [1] 101.4382
Many variables could be used as predictors in regression. Adding more variables however, does not necessarily improve the perofrmance of a model. All things being equal, Occam’s Razor tells us that a simpler model is always better than an uncessarily complex model.
While including more variables always increases the R2 of a model and decreases its RMSE, it may not increase its predictive power for new data.
To account for this, several measures have been created that penalize models for having more variables:
To find the model that minimizes these values we use a couple diferent approaches:
The MASS package offers a stepwise regression function called stepAIC:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
step <- stepAIC(house_lm,direction = "both")
## Start: AIC=566015.4
## AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms +
## BldgGrade
##
## Df Sum of Sq RSS AIC
## - SqFtLot 1 6.6750e+10 1.5481e+15 566014
## <none> 1.5481e+15 566015
## - Bathrooms 1 1.9622e+12 1.5500e+15 566042
## - Bedrooms 1 2.5142e+13 1.5732e+15 566379
## - BldgGrade 1 1.3386e+14 1.6819e+15 567895
## - SqFtTotLiving 1 2.3516e+14 1.7832e+15 569222
##
## Step: AIC=566014.3
## AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms + BldgGrade
##
## Df Sum of Sq RSS AIC
## <none> 1.5481e+15 566014
## + SqFtLot 1 6.6750e+10 1.5481e+15 566015
## - Bathrooms 1 1.9283e+12 1.5501e+15 566041
## - Bedrooms 1 2.5075e+13 1.5732e+15 566377
## - BldgGrade 1 1.3392e+14 1.6821e+15 567895
## - SqFtTotLiving 1 2.3977e+14 1.7879e+15 569279
step
##
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving + Bathrooms + Bedrooms +
## BldgGrade, data = house, na.action = na.omit)
##
## Coefficients:
## (Intercept) SqFtTotLiving Bathrooms Bedrooms BldgGrade
## -522415.7 228.2 -19240.4 -47650.5 106138.4
Categorical variables take on a limited number of discrete values. Binay yes/no variables also called indicator variables, are special cases of a factor variable.
Categorical variables can also be used in regression models, however they must be recoded because regressions require numeric inputs. The most common approach to recode variables is to convert it into a set of binary dummy variables (0/1).
In the King County housing data, there is a factor variable for the property type:
unique(house[,"PropertyType"])
## [1] Multiplex Single Family Townhouse
## Levels: Multiplex Single Family Townhouse
To use this factor variable we need to convert it to a set of binary variables. We do this by creating a binary variable for each possible value of the factor variable. To do this in R, we use the model.matrix function:
prop_type_dummies <- model.matrix(~PropertyType -1, data=house)
head(prop_type_dummies)
## PropertyTypeMultiplex PropertyTypeSingle Family PropertyTypeTownhouse
## 1 1 0 0
## 2 0 1 0
## 3 0 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 0 1
The categorical variable has three distinct levels, so it is translated into three columns. Usually, a factor variable with P levels, is represented by P-1 columns. This is done to avoid multicollinearity. the intercept term will use one of these factor levels as its base. All other coefficients will be calculated relative to this base factor level.
house_lm <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data=house)
In the example above, you can see that there is no coefficient for Multiplex since it is implicity defined when the other type Property Types are set to 0.
The coefficients are interpreted as relative to Multiplx, so a home that is a Single Family is worth almost $85k less, and one that is Townhouse is worth over 150k less.
Some factor variables can product a huge number of binary dummies - zip codes for example are massive in quantity. There are 43,000 zip codes in the US. In such situations is can be helpful to consolidate the factors into groups.
zip_groups <- house %>% mutate(resid = residuals(house_lm)) %>%
group_by(ZipCode) %>%
summarise(med_resid = median(resid),cnt=n()) %>%
mutate(cum_cnt = cumsum(cnt), ZipGroup = ntile(cum_cnt,5))
house <- house %>% left_join(zip_groups,house, by='ZipCode')
house %>% dplyr::select(starts_with("ZipGroup")) %>% distinct()
## ZipGroup
## 1 1
## 2 5
## 3 2
## 4 3
## 5 4
In the example above, we calculated the median residual for each zip and used the ntile function to split the zips, sorted by the median, into five groups.
As was previously discussed, some factor variables reflect levels of a factor. These are called Ordered Factor Variables.
The loan grade for a house for example can be ordered into A,B,C,etc. Each grade carries more risk than the prior grade. Ordered factor variables can be converted to numerical values and used as is. For example, BldgGrade is an ordred factor variable.
str(house[,"BldgGrade"])
## int [1:22689] 7 10 8 7 7 8 8 7 6 6 ...
In some cases, the relationship between variables is not linear. There are times where one variable can amplify the effect of other variables.
In regards to the housing model for example, it is a known fact that in realestate when location is held constant, larger homes are considered more valuable. Additionally, the size of a home can have a multiplicative relationship with location as far as the value of the home.
lm(AdjSalePrice ~ SqFtTotLiving*ZipGroup + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data=house, na.action=na.omit)
##
## Call:
## lm(formula = AdjSalePrice ~ SqFtTotLiving * ZipGroup + SqFtLot +
## Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house,
## na.action = na.omit)
##
## Coefficients:
## (Intercept) SqFtTotLiving
## -6.795e+05 2.645e+02
## ZipGroup SqFtLot
## 5.810e+04 1.138e-02
## Bathrooms Bedrooms
## -5.818e+03 -4.640e+04
## BldgGrade PropertyTypeSingle Family
## 1.114e+05 -6.569e+04
## PropertyTypeTownhouse SqFtTotLiving:ZipGroup
## -1.212e+05 -1.645e+01
In this model, we see four new coefficients. Each is an interaction term between a ZipGroup and SqftTotLiving. Interpreting the coefficents is done as follows. For each increase in Square Feet in ZipGroup 7 (the lowest group) there is an increase of $177. However in the highest ZipGroup, the slope is the sum of the main effect plus the interaction term which = $447 per suqare foot.In other words, adding a square foot in the most expensive Zip Code boosts the sales price by a factor of almost 2.7 compared to the boost in the least expensive zip code group
One of the most important steps of generating a regression model is the validation of the models assumptions.
The assumptions underlying a regression model are as follows:
Generally speaking, an extreme value, also called an outlier, is one that is distant from most of the other observations. Just as outliers need to be handled for measures of location and variability, they can also cause major problems with regression models. This is because regression models will try to fit these noisey values which can take away from the predictive power of the model for real values.
There are several ways to identify outliers. From a visual perspective, we discussed using the boxplot. However, we can also leverage statistical characteristics of the boxplot. For example, it is an accepted rule in statistics that any value that is greater than 1.5X the IQR in either direction is likely an outlier. We can identify those values in our data uding the boxplot.stats function combined with the $out option. Here is an example below:
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(x, y)
rm(x, y)
(a <- with(df, which(x %in% boxplot.stats(x)$out)))
## [1] 83
(b <- with(df, which(y %in% boxplot.stats(y)$out)))
## integer(0)
(outlier.list1 <- intersect(a,b))
## integer(0)
plot(df)
points(df[outlier.list1,], col="red", pch="+", cex=2.5)
(outlier.list2 <- union(a,b))
## [1] 83
plot(df)
points(df[outlier.list2,], col="blue", pch="x", cex=1)
Why is it important to deal with outliers? For one, it was previously mentioned that outliers can greatly influence any measures of variance or centraility. In a similar fashion, then can significantly alter any coefficient measures in a regression model.Let’s take data for sales of houses in zip code 98105 for example:
house_98105 <- house[house$ZipCode == 98105,]
lm_98105 <- lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
We can extract the standardized residuals using the rstandard function to obtain the index of the smallest residual using the order function:
sresid <- rstandard(lm_98105)
indx <- order(sresid)
sresid[indx[1]]
## 20431
## -4.326732
As we can see, the biggest overestimate from the model is more than 4 std errors above the regression line, which is an overestimate of $757,753. The original data record corresponding to this outlier is as follows:
house_98105[indx[1],c('AdjSalePrice','SqFtTotLiving','SqFtLot','Bathrooms','Bedrooms','BldgGrade')]
## AdjSalePrice SqFtTotLiving SqFtLot Bathrooms Bedrooms BldgGrade
## 20431 119748 2900 7276 3 6 7
In this case, it appears that there is something wrong with this record. A house of this size typically sells for much more than $119,748 in the zipcode its in. Looking deeper, this is clearly an outlier.
One way to deal with outliers is to replace them with the mean value for data excluding the outliers. We can do this with the subset() command which was introducted in the dataframe section, or we can also use the which() command, which can also be used to filter values. Take the sample below:
x<- c(1:4, NA, 6:7, NA)
complete.cases(x)
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
is.na(x)
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
which(is.na(x))
## [1] 5 8
x[which(is.na(x))]<-mean(x, na.rm=T)
x[is.na(x)]<- mean(x, na.rm=T)
Influential values are those whose absence would significantly change the regression equation. Influential values are said to have high leverage on the regression model.
There are different ways of measuring influence. Here are some of the most common used:
** Hat-Values: Hat-value is calculated with the following formula 2(P+1)/n
** Cook’s Distance: defines influence as a combination of leverage and residual size. An observation has high influence if Cook’s distance exceeds 4/(n-P-1).
An influence plot, or bubble plot combines standardized residuals, the hat-value, and Cook’s distance in a single plot. Here is an example of how to do this in R with the house data:
std_resid <- rstandard(lm_98105)
cooks_D <- cooks.distance(lm_98105)
hat_values <- hatvalues(lm_98105)
plot(hat_values,std_resid,cex=10*sqrt(cooks_D))
abline(h=c(-2.5,2.5),lty=2)
As you can see in the chart, we have plotted the standard residual on the y-axis, the hat values on the x-axis, and assigned the size of the bubbles to the value of Cook’s distance.
Looking at the chart above, we can see that there are several data points that exhibit large influence in the regression. Including these values can negatively impact the performance of our model, because it will try to fit these noisey points as opposed to the true relationship.
It is important to note however, that the impact of these high leverage points is negatively correlated with the size of the overall dataaset. The larger the data, the less of a chance a point will have to dramatically change the regression equation.
Linear regressions are relatively simple, and as a result, easy to interpret. However, there is one assumption of linear models that rarely holds in reality. And that is the assumption of a linear relationship. In the real world, it is not common for variables to have a linear relationship. In these cases, we use what are called non-linear regressions.
A polynomial regression involves including polynomial terms to a regression equation. Take for example, this quadratic polynomial regression (Power of 2) for Square Footage in the King County housing data:
lm(AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot + BldgGrade + Bathrooms + Bedrooms, data = house_98105)
##
## Call:
## lm(formula = AdjSalePrice ~ poly(SqFtTotLiving, 2) + SqFtLot +
## BldgGrade + Bathrooms + Bedrooms, data = house_98105)
##
## Coefficients:
## (Intercept) poly(SqFtTotLiving, 2)1 poly(SqFtTotLiving, 2)2
## -402530.47 3271519.49 776934.02
## SqFtLot BldgGrade Bathrooms
## 32.56 135717.06 -1435.12
## Bedrooms
## -9191.94
By adding the poly() function to the model we can see that two coefficients have been added. One for the linear term and one for the quadtraic term.
Polynomial regressions unfortunately, only capture a certain amount of curvature in a nonlinear relationship. As the degree of the polynomial increases, you start to see the amount of fleibility in the regression equation reach an undesirable extreme.
In such cases, statisticians have developed a superior approach that smoothly interpolates between fixed points. These fixed points are called knots. To create a spline in R we can use the splines package. The bs() function places knots at the boundaries that we specified (quartiles) and the degree to which we would like to run the spline:
library(splines)
knots <- quantile(house_98105$SqFtTotLiving, p=c(.25,.5,.75))
lm_spline <- lm(AdjSalePrice ~ bs(SqFtTotLiving, knots = knots, degree=3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data=house_98105)
lm_spline
##
## Call:
## lm(formula = AdjSalePrice ~ bs(SqFtTotLiving, knots = knots,
## degree = 3) + SqFtLot + Bathrooms + Bedrooms + BldgGrade,
## data = house_98105)
##
## Coefficients:
## (Intercept)
## -414157.61
## bs(SqFtTotLiving, knots = knots, degree = 3)1
## -199529.76
## bs(SqFtTotLiving, knots = knots, degree = 3)2
## -120580.64
## bs(SqFtTotLiving, knots = knots, degree = 3)3
## -71644.39
## bs(SqFtTotLiving, knots = knots, degree = 3)4
## 195677.89
## bs(SqFtTotLiving, knots = knots, degree = 3)5
## 845244.25
## bs(SqFtTotLiving, knots = knots, degree = 3)6
## 695545.67
## SqFtLot
## 33.33
## Bathrooms
## -4778.21
## Bedrooms
## -5778.70
## BldgGrade
## 134462.37
Here we can see several additional coefficients that have been added. However, it is essentially impossible to interpret these. Instead it is more useful to use the visual dispaly to reveal the nature of the spline fit.
GAM, is a technique used to automatically fit a spline regression without having to specify the knots. The gam package in R can be used to fit a GAM model to the housing data:
library(mgcv)
## Warning: package 'mgcv' was built under R version 3.4.4
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
lm_gam <- gam(AdjSalePrice ~ s(SqFtTotLiving) + SqFtLot + Bathrooms + Bedrooms + BldgGrade, data = house_98105)
The term “s” tells the gam function to find the best knots for the spline term of Square Foot to Living.
Regression is used to understand and predict quantitative variables. However, what do we do when we need to predict a categorical variable? In these cases, we are trying to answer a classification problem.
The goal of classification models is to predict whether a record is a 0 or a 1 (transaction vs non transaction in web session), or in some cases one of several categories (Gmail filtering into primary, social, forums, etc)
The way classification works is that the probability that a record is part of a particular class is calculated. Once the probability reaches a predefined cutoff, a decision is made. The step by step process is as follows:
Logistic regression is analogous to multiple linear regression, except the outcome is binary. Various transformations are employed to convert the problem to one in which a linear model can be fit.
The key ingredients in a logistic regression are the logistic response fuction and the logit, in which we map a probability to a linear model.
To run a logistic regression in R, the glm function is used with the family parameter set to binomial. Glm() function stands for generalized linear models, and since logistc regression is still linear in nature, it is included in the generalized linear model family.
The syntax of glm is similar to that in lm() however that one must add the additional criteria of family. For Logistic Regression a binomial family is used. Binomial distributions that are 0/1 in nature are also called bernoulli distributions.
We will be using stock market data for 1,250 days worth of S&P percentage returns from 2001 to 2005 to train our models. Each obs, is a day with percentage returns for each of the five previous trading days, Lag1, through Lag5. Volume is also a feature, which represents the number of shares traded in billions, and Direction states whether the market was Up or Down on the date in question.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.4
attach(Smarket)
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
Pairs provides a scatterplot matrix
pairs(Smarket)
Cor provides a correlation matrix but will return an error if given quantitative variables (factors). For that reason we exclude Direction which is a factor
cor(Smarket[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
Correlations between the lag variables and today’s returns are close to zero. The only substantial correlation is year and volume:
plot(Year,Volume)
Let’s go ahead and create our logistic regression model:
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Looking at the coefficients, it appears that the market is more likely to go down today if it went up yesterday Or the day before. Prior to that, it is more likely for the market to go up if the market had previously gone up 3:5 days ago However, none of the coefficients p-values are statistically significant, so we can’t put too much weight on these findings
To extract the p-value associated with the coefficients you can reference the column number just like a dataframe
summary(glm.fit)$coef[,4]
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974
## Volume
## 0.3924004
To predict the probability that the market will go up based on certain values of predictors use the predict() function Just like in LR. The type=“response” command tells R to output probabilities as opposed to the logit. Note that if no dataset is specified, then the model defaults to the dataset that was used to create the model initially:
glm.probs=predict(glm.fit,type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509
## 8 9 10
## 0.5092292 0.5176135 0.4888378
The contrasts function indicates that R has create a dummy variable with a 1 for Up:
contrasts(Direction)
## Up
## Down 0
## Up 1
In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class lables “Up or Down”. The following two commands create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5. You will notice that the vector has 1250 observations (the total number of observations in the Smarket data)
glm.pred=rep("Down",1250)
Each “Down” value is then replaced with up when the fitted probability is greater than our 50% threshold
glm.pred[glm.probs>.5]="Up"
At the heart of classification metrics is the confusion matrix. The confusion matrix is a table shhowing the number of correct and incorrect predictions categorized by type of response. Several packages are available in R to compute in a confusion matrix, but it also easy to do so by hand.
We can then create a confusion matrix of the results with the table() command This will help us determine how many observations were correctly/incorrectly classified The diagonals of the confusion matrix indicate our correct predictions, while the off-diagonals represent incorrect guesses
table(glm.pred,Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
(table(glm.pred,Direction)[2,2]+table(glm.pred,Direction)[1,1])/dim(Smarket)[1]
## [1] 0.5216
The mean() function can be used to compute the fraction of days for which the prediction was correct.
mean(glm.pred==Direction)
## [1] 0.5216
In cross validation, we create a training and test data set, run the model on both, and compare the RSE. This is a vital concept in data science. The reason we do this is because statistical algorithums are good a fitting models with the data they are provided. However, when models are trained on new data they make perform poorly in actuality.
To do this in R we create two boolean vectors (True/False). One with our training set and one with our test set. In this case, we designated that observations that occurred before 2005 will be in the training set. Once the training set is created, say in an object called train you can use it to create the test set. This can be done by adding a ! to the front of the object thereby excluding it from the aggregate data.
The ! symbol can be used to reverse all of the elements of a Boolean vector. That is, !train is a vector similar to train, except that the elements that are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train get swapped to TRUE in !train. Therefore, Smarket[!train,] yields a submatrix of the stock market data containing only the observations forwhich train is FALSE-that is, the observations with dates in 2005. T
We can subset our data based on Year by filtering out the values that are less than 2005:
train=(Year<2005)
Smarket.2005=Smarket[!train,]
Filtering out all values before 2005 by excluding the values of Smarket that have year values less than 2005
As you can see by comparing the datasets:
table(dim(Smarket.2005),dim(Smarket))
##
## 9 1250
## 9 1 0
## 252 0 1
You can also use the train object to exclude the values from certain vectors
Direction.2005=Direction[!train]
We will now fit a model to fit our training data and evaluate its performance using the test data With the training data ready, you can subset the training data from the total observations using the subset command
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)
With a model filtered for the training dataset, we can look at the results on the test data Note that the dataset being specified for predicting response is the test set
glm.probs=predict(glm.fit,Smarket.2005,type="response")
As we did previously, we create a vector length of the test set, with response classes defaulted to “Down”
glm.pred=rep("Down",252)
We then replace those for which our logistic model has predicted more than 50% probability
glm.pred[glm.probs>.5]="Up"
We create our confusion matrix
table(glm.pred,Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
Correct % Predicted
mean(glm.pred==Direction.2005)
## [1] 0.4801587
mean(glm.pred!=Direction.2005)
## [1] 0.5198413
We can see see our model has a very high error rate, higher than 50% which is random chance. Prior model had a very high error rate. Maybe by removing Lag fields that were not significant we can improve the model
glm.fit=glm(Direction~Lag1+Lag2,data=Smarket,family=binomial,subset=train)
glm.probs=predict(glm.fit,Smarket.2005,type="response")
glm.pred=rep("Down",252)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
Error and Accuracy Rates:
mean(glm.pred==Direction.2005)
## [1] 0.5595238
mean(glm.pred!=Direction.2005)
## [1] 0.4404762
By removing the predictors with low prediction power, we have improved the accuracy of our model. If we would like to use our model to predict new values, we can use the newdata function combined with type = response
predict(glm.fit,newdata=data.frame(Lag1=c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")
## 1 2
## 0.4791462 0.4960939
LDA is one of the most popular classification functions. When applied to the predictor variables, it maximizes the separation of the classes. It does this by distinguishing variation between groups as opposed to variation within groups. LDA focuses on maximizing the sum of squares between the groups relative to the sum of squares within the groups. The combination that maximizes this sum of squares ratio yeilds the greatest separation between the two groups.
library(MASS)
train=(Year<2005)
Smarket.2005=Smarket[!train,]
Direction.2005=Direction[!train]
lda.fit=lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
Plotting the model gives us a plot of the linear discriminants (Coefficients * training observations)
plot(lda.fit)
Predict function returns a list of three elements 1) Class contains the preditions about the markets movement 2) Posterior is a matrix whose columns contain the posterior probability that observations belong to said class 3) X contains the Linear Discriminants
lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class=lda.pred$class
With the predictions for each value created, we can now build the confusion matrix
table(lda.class,Direction.2005)
## Direction.2005
## lda.class Down Up
## Down 35 35
## Up 76 106
Results are almost identical to the LogR model
mean(lda.class==Direction.2005)
## [1] 0.5595238
We can modify the predictions made by the model using the following command which leverages a 50% threshold for prediction.
sum(lda.pred$posterior[,1]>=.5)
## [1] 70
sum(lda.pred$posterior[,1]<.5)
## [1] 182
We can also pull the first 20 observations and their predicted values (that market will decrease)
lda.pred$posterior[1:20,1]
## 999 1000 1001 1002 1003 1004 1005
## 0.4901792 0.4792185 0.4668185 0.4740011 0.4927877 0.4938562 0.4951016
## 1006 1007 1008 1009 1010 1011 1012
## 0.4872861 0.4907013 0.4844026 0.4906963 0.5119988 0.4895152 0.4706761
## 1013 1014 1015 1016 1017 1018
## 0.4744593 0.4799583 0.4935775 0.5030894 0.4978806 0.4886331
Gives us the direction instead of the Probability
lda.class[1:20]
## [1] Up Up Up Up Up Up Up Up Up Up Up Down Up Up
## [15] Up Up Up Down Up Up
## Levels: Down Up
If we want to make the model more sensitive, we can modify the threshold from 50% to 90%. As we can see there is no value that we are greater than 90% sure of our prediction:
sum(lda.pred$posterior[,1]>.9)
## [1] 0
We implement the QDA model using the qda() function which works just like the syntax of lda()
qda.fit=qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
Provides us with our prior probabilities and group means however no coefficents since these are quadratic coefficients
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
Predict function works the same was as for LDA
qda.class=predict(qda.fit,Smarket.2005)$class
table(qda.class,Direction.2005)
## Direction.2005
## qda.class Down Up
## Down 30 20
## Up 81 121
mean(qda.class==Direction.2005)
## [1] 0.5992063
Looking at the results, the test error rate is less than 50%. This suggests that the QDA model reduces Bias in model because the true shape of the relationship is not linear.
To perform KNN we can use the KNN function, which is part of the class library Works differently than the models in the sense that it is done in 1 step vs. The 2-step model and prediction process we previously used in our other methods
The model requires 4 inputs: 1) A matrix containing the predictors associated with the training data 2) A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below 3) A vector containing the class labels for the training observations 4) A value for K, the number of nearest neighbors to be used by the classifier
library(class)
train.X=cbind(Lag1,Lag2)[train,]
test.X=cbind(Lag1,Lag2)[!train,]
train.Direction=Direction[train]
Note: Set seed is used to break ties randomly between nearest neighbors
set.seed(1)
knn.pred=knn(train.X,test.X,train.Direction,k=1)
Confusion Matrix:
table(knn.pred,Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 43 58
## Up 68 83
mean(knn.pred==Direction.2005)
## [1] 0.5
Error rate is 50% thereby accuracy is as good as chance. Let’s try a KNN Model with K = 3. By increasing K to 3, we will reduce variance and overfitting thereby increasing test model accuracy. The lower that K is, the higher the flexibility of hte model,and the greater the higher the variance.
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(knn.pred,Direction.2005)
## Direction.2005
## knn.pred Down Up
## Down 48 54
## Up 63 87
mean(knn.pred==Direction.2005)
## [1] 0.5357143
Appears that increasing the K (neighbors) improved accuracy slightly
dim(Caravan)
## [1] 5822 86
attach(Caravan)
summary(Purchase)
## No Yes
## 5474 348
(348/5822)*100
## [1] 5.977327
5.97% of people purchase a caravan insurance policy
KNN uses measures of distance to identify neighbors. The scale of the variable being used to measure this distance can have a major impact on the output of this analysis. Salary measured in 1000’s will have a greater effect vs. age Which goes up in increments of one because it is a relatively discrete quantitative variable
A good way to handle this problem of scale, is to standardize the data so that instead of the actual output, variables are transalted into z-scores in other words # of stds from the mean. The can be done in R using the scale function
standardized.X=scale(Caravan[,-86])
var(Caravan[,1])
## [1] 165.0378
var(Caravan[,2])
## [1] 0.1647078
var(standardized.X[,1])
## [1] 1
var(standardized.X[,2])
## [1] 1
If you compare the variance prior to standardization you can see that the var is 1 instead of the actual value
Model 1: KNN
Recall that KNN models require some slight pre-processing work. The model requires 4 inputs: 1) A matrix containing the predictors associated with the training data 2) A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below 3) A vector containing the class labels for the training observations 4) A value for K, the number of nearest neighbors to be used by the classifier
Test dataset is the first 1,000 observations
test=1:1000
Creating training and test datasets using the pre-defined test object
train.X=standardized.X[-test,]
test.X=standardized.X[test,]
train.Y=Purchase[-test]
test.Y=Purchase[test]
Building the Model
set.seed(1)
knn.pred=knn(train.X,test.X,train.Y,k=1)
Model performance
mean(test.Y!=knn.pred)
## [1] 0.118
mean(test.Y!="No")
## [1] 0.059
Test error rate is just under 12%. However, since only 6% of people purchased overall, we’d get more accurate if we just predicted no for every observation anyways. However, this is for overall test error. The goal is to predict purchases not non-purchases. Therefore, we only need to be concerned with the sensitivity (accuracy at predicting purchases)
Confusion Matrix:
table(knn.pred,test.Y)
## test.Y
## knn.pred No Yes
## No 873 50
## Yes 68 9
9/(68+9)
## [1] 0.1168831
A sensitivity of 11.6% is nearly double what we would get from random guessing. We can also try to re-run the model using a less flexible K value (3 vs 1) to see if this improves performances.
knn.pred=knn(train.X,test.X,train.Y,k=3)
table(knn.pred,test.Y)
## test.Y
## knn.pred No Yes
## No 920 54
## Yes 21 5
5/26
## [1] 0.1923077
As you can see, here we have improved the accuracy to 19% by increasingly K slightly. Let’s take this further and ontinue to increase K, we get an accuracy rate of 26.7%:
knn.pred=knn(train.X,test.X,train.Y,k=5)
table(knn.pred,test.Y)
## test.Y
## knn.pred No Yes
## No 930 55
## Yes 11 4
4/15
## [1] 0.2666667
Let’s compare these results to a LogR Model
Model 2) Logistic Regression
Let’s compare the results of our KNN model to that of a Logistic Regression and see if that increases performance:
glm.fit=glm(Purchase~.,data=Caravan,family=binomial,subset=-test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.probs=predict(glm.fit,Caravan[test,],type="response")
glm.pred=rep("No",1000)
glm.pred[glm.probs>.5]="Yes"
table(glm.pred,test.Y)
## test.Y
## glm.pred No Yes
## No 934 59
## Yes 7 0
Confusion matrix shows 0% accuracy for those that purchased with a threshold of only 50%. Let’s reset the model for the 25% Threshold:
glm.pred=rep("No",1000)
glm.pred[glm.probs>.25]="Yes"
table(glm.pred,test.Y)
## test.Y
## glm.pred No Yes
## No 919 48
## Yes 22 11
11/(22+11)
## [1] 0.3333333
As you can see above, we have improved the model accuracy to 33% by choosing a less flexible model. As you can see, it does not always make sense to choose a highly flexible model as it may lead to overfitting.