Source file ⇒ 2017-lec7.Rmd

last compiled on Wed Feb 8 00:20:19 2017

Announcements

none

Today:

  1. DC chapter 10 Joins and relational databases
  2. The role of the backtick ` in R
  3. DC chapter 11 gather and spread data verbs

Joins and relational databases

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Example: Grades and Enrollment

The data tables Grades and Courses gives grade and course related info in a school.

Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
2197 S31680 B session3518
259 S31242 B session2897
4188 S32127 A session2002
3880 S32058 A- session2952
Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
640 session2568 J 100 FA2002 15 inst223
76 session1940 d 100 FA2000 16 inst409
1218 session3242 m 200 SP2004 30 inst476

The primary and foreign key is sessionID.

A join is a data verb that combines left and right tables.

Your left table is usually your primary table (if there is one) and your right table has information you need not available in the left table.

Example: Average class size

Goal: Figure out the average class size for each student.

  • Student (sid) comes from Grades.
  • enroll comes from Courses table.
  • sessionID is in both tables.

Starting with Grades table we need Courses table to get enrollment for classes. Alternatively, starting with Courses you can find all the students in those Courses with the Grades table.

Grades %>%
  left_join(Courses) %>% sample_n(size=4)
sid grade sessionID dept level sem enroll iid
4043 S32061 B+ session3644 K 300 SP2005 18 inst239
1536 S31518 AU session2901 J 100 FA2003 22 inst224
1844 S31587 B- session3191 b 200 SP2004 10 inst403
3593 S31962 B session3821 W 300 SP2005 24 inst138
Courses %>%
  left_join(Grades) %>% sample_n(size=4)
sessionID dept level sem enroll iid sid grade
2050 session2454 Y 200 FA2002 13 inst397 S31587 C+
3823 session2990 m 200 FA2003 20 inst476 S32310 B+
3354 session2851 O 300 SP2003 14 inst308 S31509 A
5405 session3573 M 100 SP2005 23 inst263 S31830 S

Once Courses and Grades are joined, it’s straightforward to find the average enrollment seen by each student.

AveClassEachStudent <- Grades %>% 
  left_join(Courses) %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid ave_enroll
S32124 25.10000
S32154 22.93333
S32415 20.87500

Establishing a match between cases

  • You specify what is the primary and foreign key.
  • Cases must have exactly equal values in the left primary key and right foreign key for a match to be made.

Example:

Grades %>% 
  left_join(Courses, by = c(sessionID = "sessionID")) %>%
  head(4)
sid grade sessionID dept level sem enroll iid
S31185 D+ session1784 M 100 FA1991 22 inst265
S31185 B+ session1785 k 100 FA1991 52 inst458
S31185 A- session1791 J 100 FA1993 22 inst223
S31185 B+ session1792 J 300 FA1993 20 inst235

The default value of by= is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.

Grades %>% 
  left_join(Courses) %>%
  head(4)
sid grade sessionID dept level sem enroll iid
S31185 D+ session1784 M 100 FA1991 22 inst265
S31185 B+ session1785 k 100 FA1991 52 inst458
S31185 A- session1791 J 100 FA1993 22 inst223
S31185 B+ session1792 J 300 FA1993 20 inst235

If the names match you can write:

Grades %>% 
  left_join(Courses, by="sessionID") %>%
  head(4)
sid grade sessionID dept level sem enroll iid
S31185 D+ session1784 M 100 FA1991 22 inst265
S31185 B+ session1785 k 100 FA1991 52 inst458
S31185 A- session1791 J 100 FA1993 22 inst223
S31185 B+ session1792 J 300 FA1993 20 inst235

Example (nycflights)

#install.packages("nycflights13")
library(nycflights13)

The data table nycflights13::flights has many variables of which we select just a few

flights2 <- flights %>% 
  select(year:day, hour, tailnum, carrier)
head(flights2,2)
year month day hour tailnum carrier
2013 1 1 5 N14228 UA
2013 1 1 5 N24211 UA

The data table airlines lets you look up the full carrier name from its abbreviated code:

head(airlines,2)
carrier name
9E Endeavor Air Inc.
AA American Airlines Inc.

The data tables weather give the weather at each NYC airport for each hour:

head(weather,2)
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702 11.91865 0 1013.9 10 2012-12-31 16:00:00
EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936 15.89154 0 1013.0 10 2012-12-31 17:00:00

One way to show the relationships between the different tables is with a drawing:

For nycflights13:

flights connects to airlines through the carrier variable.

flights connects to weather via origin (the location), and year, month, day and hour (the time).

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():

flights2 %>%
  left_join(airlines, by="carrier") %>%
  head(3)
year month day hour tailnum carrier name
2013 1 1 5 N14228 UA United Air Lines Inc.
2013 1 1 5 N24211 UA United Air Lines Inc.
2013 1 1 5 N619AA AA American Airlines Inc.

The result of joining airlines to flights2 is an additional variable: name.

Underderstanding joins

To help you learn how joins work, I’m going to use a visual representation:

The coloured column represents the “key” variable: these are used to match the rows between the tables. The grey column represents the “value” column that is carried along for the ride. In these examples I’ll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.

A join is a way of connecting each row in x to zero, one, or more rows in y. The following diagram shows each potential match as an intersection of a pair of lines.

In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.

Inner Join

The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:

The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by= to tell dplyr which variable is the key:

x %>% 
  inner_join(y, by="key")

The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.

Outer Joins

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

Graphically, that looks like:

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. For example, lets say you want a list of all the employees in a buisiness plus any dependents. One table called EMPLOYEE contains a list of all employees and another table called DEPENDENTS contains information of all employee’s dependents.You need a left join here because if an employee doesn’t have a dependent, then their record won’t show up at all – because there’s no matching record in the DEPENDENT table. So, you use a left join which keeps all the data on the “left” (i.e. the first table) and pulls in any matching data on the “right” (the second table).

The left join should be your default join: use it unless you have a strong reason to prefer one of the others. For example sometimes when data wrangling you don’t want to have NA in your table so you use an inner join.

Another way to depict the different types of joins is with a Venn diagram:

More complicated example

Suppose you have these two tables:

  • Left: cases are medical clinics. The variables: clinicName, postalCode.
clinicName postalCode
A 22120
B 35752
C 56718
D 35752
E 67756
F 69129
G 73455
H 73455
I 76292
  • Right: cases are postal codes. Variables reflect the demographics of that postal code: postalCode, over65, etc.
over65 postalCode
0.46 35752
0.72 22120
0.93 22120
0.26 92332
0.46 84739
0.94 67756

The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.

An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.

LL %>% inner_join(RR)
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
D 35752 0.46
E 67756 0.94

An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.

Unmatched cases from the left table
LL %>% left_join( RR)
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
C 56718 NA
D 35752 0.46
E 67756 0.94
F 69129 NA
G 73455 NA
H 73455 NA
I 76292 NA
Unmatched cases from the right table
LL %>%  right_join(RR) 
clinicName postalCode over65
B 35752 0.46
D 35752 0.46
A 22120 0.72
A 22120 0.93
NA 92332 0.26
NA 84739 0.46
E 67756 0.94
Unmatched cases from both tables
LL %>% full_join(RR)
clinicName postalCode over65
A 22120 0.72
A 22120 0.93
B 35752 0.46
C 56718 NA
D 35752 0.46
E 67756 0.94
F 69129 NA
G 73455 NA
H 73455 NA
I 76292 NA
NA 92332 0.26
NA 84739 0.46

In class exercise

install.packages("fivethirtyeight")
install.packages("exercisedata")
library(fivethirtyeight)
library(exercisedata)
head(bad_drivers)
state num_drivers perc_speeding perc_alcohol perc_not_distracted perc_no_previous insurance_premiums losses
Alabama 18.8 39 30 96 80 784.55 145.08
Alaska 18.1 41 25 90 94 1053.48 133.93
Arizona 18.6 35 28 84 96 899.47 110.35
Arkansas 22.4 18 26 94 95 827.34 142.39
California 12.0 35 28 91 89 878.41 165.63
Colorado 13.6 37 28 79 95 835.50 139.91
head(State_pop_area)
state pop_2015 area
New Jersey 8958013 7354
Rhode Island 1056298 1034
Massachusetts 6794422 7800
Connecticut 3590886 4842
Maryland 6006401 9707
Delaware 945934 1949

Next you will join these two data tables to make a glyph ready table

http://gandalf.berkeley.edu:3838/alucas/Chapter-10-collection/

2 The role of the backtick ` in R

This will be important when we discuss the data verb gather below.

We saw in lecture 2 about rules for naming objects in R. For example, %>% or the number 19 can’t be an object name. But sometimes we want these to be names. We can get around this by putting our illegal name between backticks. For example:

`%>%` <- "pipe"
`%>%`
## [1] "pipe"
`19` <- 24
`19`
## [1] 24

3. Wide versus Narrow data tables (chapter 11)

Two data verbs we will be using are tidyr::narrow() and tidyr::gather(). These aren’t part of the DataComputing package you will need to install it once at the console:

install.packages("tidyr")

Remember to load it in under library(DataComputing) at the top of your Rmd document.

library(tidyr)

A data table can be presented in wide or narrow format. Each have their own advantatges.

Wide format is easier to get the difference of sbp (systolic blood pressure) before and after of a test for each patient. This table isn’t tidy as discussed in lecture 1.

BP_wide
subject before after
BHO 120 160
GWB 115 135
WJC 105 145

Below is a narrow format (tidy). More often your graphics will require a tidy data table.

BP_narrow
subject when sbp
BHO before 160
GWB before 115
WJC after 145
GWB after 135
WJC before 105
BHO after 160

The data verbs ’spread()andgather()` convert between these formats.

gather() transforms wide into narrow

A common problem is a dataset where some of the column names are not names of variables, but values of a variable. You use gather() when you notice that you have columns that are not variables. To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:

  1. key: The name of the variable whose values form the column names. I call that the key, and here it is when.

  2. value: The name of the variable whose values are spread over the cells. I call that value, and here it’s the sbp measurements.

  3. The set of columns that represent values, not variables. In this example, those are the columns before and after. Use bare variable names by putting the entries in backticks. For example if the entry in the column is the sting is 19, write it as 19.

BP_narrow1 <-  BP_wide %>%
  gather(key= when, value = sbp, `before`, `after`)  # before or after are legal names so it isn't necessary to use backticks (but it doesn't hurt)
BP_narrow1
subject when sbp
BHO before 120
GWB before 115
WJC before 105
BHO after 160
GWB after 135
WJC after 145

spread() transforms narrow into wide

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

To take a narrow data table and make it wide we only need two parameters:

  1. The column that contains variable names; the key column. Here, it’s when.

  2. The column that contains values that will be spread; the value column. Here it’s ‘sbp’.

BP_wide1 <-  BP_narrow %>% 
  spread(key= when, value = sbp)
BP_wide1
subject after before
BHO 160 160
GWB 135 115
WJC 145 105

in class exercise

We make a data table of country populations in different years.

a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide
countries 1999 2000
Afghanistan 745 266
Brazil 37737 80488
China 212258 213766

We wish to gather this table into tidy form.

countries year population
Afghanistan 1999 745
Brazil 1999 37737
China 1999 212258
Afghanistan 2000 266
Brazil 2000 80488
China 2000 213766

With paper and pencil fill in the necessary code:

df_narrow <- df_wide %>% 
  gather(fill__this__out)
df_narrow

example

Is the following data set narrow or wide? Convert it to the other data table format.

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##    name       F      M
## * <chr>   <int>  <int>
## 1 Allen    1836 263375
## 2  Mary 4112464  15151
## 3   Sue  144410    519
## Source: local data frame [6 x 3]
## Groups: name [3]
## 
##    name   sex   value
##   <chr> <chr>   <int>
## 1 Allen     F    1836
## 2  Mary     F 4112464
## 3   Sue     F  144410
## 4 Allen     M  263375
## 5  Mary     M   15151
## 6   Sue     M     519

i-clicker questions

Answ: yes

Answ: none of the above—key is sex

example

Lets examine the wide iris data table:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Suppose you want to make the following plot:

The data table iris isn’t gyph ready. Here is the glyph ready table:

##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4

step 1: Use gather

iris_narrow <- iris %>%
  gather(key, Value, -Species) %>%  #here -Species means all columns except Species
  head()
iris_narrow %>% head()
##   Species          key Value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

step 2: Use the data verb separate()

iris_narrow_sep <- iris_narrow %>% separate(key, into=c("Part", "Measure"), sep="\\.")
head(iris_narrow_sep)
##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4

in class exercise:

Show how we use seperate, gather and spread to go from the top to the bottom table:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower
## 1          5.1         3.5          1.4         0.2  setosa      1
## 2          4.9         3.0          1.4         0.2  setosa      2
## 3          4.7         3.2          1.3         0.2  setosa      3
## 4          4.6         3.1          1.5         0.2  setosa      4
## 5          5.0         3.6          1.4         0.2  setosa      5
## 6          5.4         3.9          1.7         0.4  setosa      6
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) 
iris.wide %>% head()
##   Species Flower          key value
## 1  setosa      1 Sepal.Length   5.1
## 2  setosa      2 Sepal.Length   4.9
## 3  setosa      3 Sepal.Length   4.7
## 4  setosa      4 Sepal.Length   4.6
## 5  setosa      5 Sepal.Length   5.0
## 6  setosa      6 Sepal.Length   5.4
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") 
iris.wide %>% head()
##   Species Flower  Part Measure value
## 1  setosa      1 Sepal  Length   5.1
## 2  setosa      2 Sepal  Length   4.9
## 3  setosa      3 Sepal  Length   4.7
## 4  setosa      4 Sepal  Length   4.6
## 5  setosa      5 Sepal  Length   5.0
## 6  setosa      6 Sepal  Length   5.4
iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value) 
iris.wide %>% head()
##   Species Flower  Part Length Width
## 1  setosa      1 Petal    1.4   0.2
## 2  setosa      1 Sepal    5.1   3.5
## 3  setosa      2 Petal    1.4   0.2
## 4  setosa      2 Sepal    4.9   3.0
## 5  setosa      3 Petal    1.3   0.2
## 6  setosa      3 Sepal    4.7   3.2

To Do: Read chapter 10 and 11 and do HW 3