Source file ⇒ 2017-lec7.Rmd

last compiled on Wed Feb 8 00:20:19 2017

Announcements

none

Today:

DC chapter 10 Joins and relational databases
The role of the backtick ` in R
DC chapter 11 gather and spread data verbs

Joins and relational databases

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Example: Grades and Enrollment

The data tables Grades and Courses gives grade and course related info in a school.

Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")

	sid	grade	sessionID
2197	S31680	B	session3518
259	S31242	B	session2897
4188	S32127	A	session2002
3880	S32058	A-	session2952

Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")

	sessionID	dept	level	sem	enroll	iid
640	session2568	J	100	FA2002	15	inst223
76	session1940	d	100	FA2000	16	inst409
1218	session3242	m	200	SP2004	30	inst476

The primary and foreign key is sessionID.

A join is a data verb that combines left and right tables.

Your left table is usually your primary table (if there is one) and your right table has information you need not available in the left table.

Example: Average class size

Goal: Figure out the average class size for each student.

Student (sid) comes from Grades.
enroll comes from Courses table.
sessionID is in both tables.

Starting with Grades table we need Courses table to get enrollment for classes. Alternatively, starting with Courses you can find all the students in those Courses with the Grades table.

Grades %>%
  left_join(Courses) %>% sample_n(size=4)

	sid	grade	sessionID	dept	level	sem	enroll	iid
4043	S32061	B+	session3644	K	300	SP2005	18	inst239
1536	S31518	AU	session2901	J	100	FA2003	22	inst224
1844	S31587	B-	session3191	b	200	SP2004	10	inst403
3593	S31962	B	session3821	W	300	SP2005	24	inst138

Courses %>%
  left_join(Grades) %>% sample_n(size=4)

	sessionID	dept	level	sem	enroll	iid	sid	grade
2050	session2454	Y	200	FA2002	13	inst397	S31587	C+
3823	session2990	m	200	FA2003	20	inst476	S32310	B+
3354	session2851	O	300	SP2003	14	inst308	S31509	A
5405	session3573	M	100	SP2005	23	inst263	S31830	S

Once Courses and Grades are joined, it’s straightforward to find the average enrollment seen by each student.

AveClassEachStudent <- Grades %>% 
  left_join(Courses) %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(enroll, na.rm=TRUE))

sid	ave_enroll
S32124	25.10000
S32154	22.93333
S32415	20.87500

Establishing a match between cases

You specify what is the primary and foreign key.
Cases must have exactly equal values in the left primary key and right foreign key for a match to be made.

Example:

Grades %>% 
  left_join(Courses, by = c(sessionID = "sessionID")) %>%
  head(4)

sid	grade	sessionID	dept	level	sem	enroll	iid
S31185	D+	session1784	M	100	FA1991	22	inst265
S31185	B+	session1785	k	100	FA1991	52	inst458
S31185	A-	session1791	J	100	FA1993	22	inst223
S31185	B+	session1792	J	300	FA1993	20	inst235

The default value of by= is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.

Grades %>% 
  left_join(Courses) %>%
  head(4)

sid	grade	sessionID	dept	level	sem	enroll	iid
S31185	D+	session1784	M	100	FA1991	22	inst265
S31185	B+	session1785	k	100	FA1991	52	inst458
S31185	A-	session1791	J	100	FA1993	22	inst223
S31185	B+	session1792	J	300	FA1993	20	inst235

If the names match you can write:

Grades %>% 
  left_join(Courses, by="sessionID") %>%
  head(4)

sid	grade	sessionID	dept	level	sem	enroll	iid
S31185	D+	session1784	M	100	FA1991	22	inst265
S31185	B+	session1785	k	100	FA1991	52	inst458
S31185	A-	session1791	J	100	FA1993	22	inst223
S31185	B+	session1792	J	300	FA1993	20	inst235

Example (nycflights)

#install.packages("nycflights13")
library(nycflights13)

The data table nycflights13::flights has many variables of which we select just a few

flights2 <- flights %>% 
  select(year:day, hour, tailnum, carrier)
head(flights2,2)

year	month	day	hour	tailnum	carrier
2013	1	1	5	N14228	UA
2013	1	1	5	N24211	UA

The data table airlines lets you look up the full carrier name from its abbreviated code:

head(airlines,2)

carrier	name
9E	Endeavor Air Inc.
AA	American Airlines Inc.

The data tables weather give the weather at each NYC airport for each hour:

head(weather,2)

origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	time_hour
EWR	2013	1	1	0	37.04	21.92	53.97	230	10.35702	11.91865	0	1013.9	10	2012-12-31 16:00:00
EWR	2013	1	1	1	37.04	21.92	53.97	230	13.80936	15.89154	0	1013.0	10	2012-12-31 17:00:00

One way to show the relationships between the different tables is with a drawing:

For nycflights13:

flights connects to airlines through the carrier variable.

flights connects to weather via origin (the location), and year, month, day and hour (the time).

Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with left_join():

flights2 %>%
  left_join(airlines, by="carrier") %>%
  head(3)

year	month	day	hour	tailnum	carrier	name
2013	1	1	5	N14228	UA	United Air Lines Inc.
2013	1	1	5	N24211	UA	United Air Lines Inc.
2013	1	1	5	N619AA	AA	American Airlines Inc.

The result of joining airlines to flights2 is an additional variable: name.

Underderstanding joins

To help you learn how joins work, I’m going to use a visual representation:

The coloured column represents the “key” variable: these are used to match the rows between the tables. The grey column represents the “value” column that is carried along for the ride. In these examples I’ll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.

A join is a way of connecting each row in x to zero, one, or more rows in y. The following diagram shows each potential match as an intersection of a pair of lines.

In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.

Inner Join

The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:

The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by= to tell dplyr which variable is the key:

x %>% 
  inner_join(y, by="key")

The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.

Outer Joins

An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA.

Graphically, that looks like:

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. For example, lets say you want a list of all the employees in a buisiness plus any dependents. One table called EMPLOYEE contains a list of all employees and another table called DEPENDENTS contains information of all employee’s dependents.You need a left join here because if an employee doesn’t have a dependent, then their record won’t show up at all – because there’s no matching record in the DEPENDENT table. So, you use a left join which keeps all the data on the “left” (i.e. the first table) and pulls in any matching data on the “right” (the second table).

The left join should be your default join: use it unless you have a strong reason to prefer one of the others. For example sometimes when data wrangling you don’t want to have NA in your table so you use an inner join.

Another way to depict the different types of joins is with a Venn diagram:

More complicated example

Suppose you have these two tables:

Left: cases are medical clinics. The variables: clinicName, postalCode.

clinicName	postalCode
A	22120
B	35752
C	56718
D	35752
E	67756
F	69129
G	73455
H	73455
I	76292

Right: cases are postal codes. Variables reflect the demographics of that postal code: postalCode, over65, etc.

over65	postalCode
0.46	35752
0.72	22120
0.93	22120
0.26	92332
0.46	84739
0.94	67756

The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.

An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.

LL %>% inner_join(RR)

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
D	35752	0.46
E	67756	0.94

An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.

Unmatched cases from the left table

LL %>% left_join( RR)

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
C	56718	NA
D	35752	0.46
E	67756	0.94
F	69129	NA
G	73455	NA
H	73455	NA
I	76292	NA

Unmatched cases from the right table

LL %>%  right_join(RR)

clinicName	postalCode	over65
B	35752	0.46
D	35752	0.46
A	22120	0.72
A	22120	0.93
NA	92332	0.26
NA	84739	0.46
E	67756	0.94

Unmatched cases from both tables

LL %>% full_join(RR)

clinicName	postalCode	over65
A	22120	0.72
A	22120	0.93
B	35752	0.46
C	56718	NA
D	35752	0.46
E	67756	0.94
F	69129	NA
G	73455	NA
H	73455	NA
I	76292	NA
NA	92332	0.26
NA	84739	0.46

In class exercise

install.packages("fivethirtyeight")
install.packages("exercisedata")

library(fivethirtyeight)
library(exercisedata)
head(bad_drivers)

state	num_drivers	perc_speeding	perc_alcohol	perc_not_distracted	perc_no_previous	insurance_premiums	losses
Alabama	18.8	39	30	96	80	784.55	145.08
Alaska	18.1	41	25	90	94	1053.48	133.93
Arizona	18.6	35	28	84	96	899.47	110.35
Arkansas	22.4	18	26	94	95	827.34	142.39
California	12.0	35	28	91	89	878.41	165.63
Colorado	13.6	37	28	79	95	835.50	139.91

head(State_pop_area)

state	pop_2015	area
New Jersey	8958013	7354
Rhode Island	1056298	1034
Massachusetts	6794422	7800
Connecticut	3590886	4842
Maryland	6006401	9707
Delaware	945934	1949

Next you will join these two data tables to make a glyph ready table

http://gandalf.berkeley.edu:3838/alucas/Chapter-10-collection/

2 The role of the backtick ` in R

This will be important when we discuss the data verb gather below.

We saw in lecture 2 about rules for naming objects in R. For example, %>% or the number 19 can’t be an object name. But sometimes we want these to be names. We can get around this by putting our illegal name between backticks. For example:

`%>%` <- "pipe"
`%>%`

## [1] "pipe"

`19` <- 24
`19`

## [1] 24

3. Wide versus Narrow data tables (chapter 11)

Two data verbs we will be using are tidyr::narrow() and tidyr::gather(). These aren’t part of the DataComputing package you will need to install it once at the console:

install.packages("tidyr")

Remember to load it in under library(DataComputing) at the top of your Rmd document.

library(tidyr)

A data table can be presented in wide or narrow format. Each have their own advantatges.

Wide format is easier to get the difference of sbp (systolic blood pressure) before and after of a test for each patient. This table isn’t tidy as discussed in lecture 1.

BP_wide

subject	before	after
BHO	120	160
GWB	115	135
WJC	105	145

Below is a narrow format (tidy). More often your graphics will require a tidy data table.

BP_narrow

subject	when	sbp
BHO	before	160
GWB	before	115
WJC	after	145
GWB	after	135
WJC	before	105
BHO	after	160

The data verbs ’spread()andgather()` convert between these formats.

`gather()` transforms wide into narrow

A common problem is a dataset where some of the column names are not names of variables, but values of a variable. You use gather() when you notice that you have columns that are not variables. To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:

key: The name of the variable whose values form the column names. I call that the key, and here it is when.
value: The name of the variable whose values are spread over the cells. I call that value, and here it’s the sbp measurements.
The set of columns that represent values, not variables. In this example, those are the columns before and after. Use bare variable names by putting the entries in backticks. For example if the entry in the column is the sting is 19, write it as 19.

BP_narrow1 <-  BP_wide %>%
  gather(key= when, value = sbp, `before`, `after`)  # before or after are legal names so it isn't necessary to use backticks (but it doesn't hurt)
BP_narrow1

subject	when	sbp
BHO	before	120
GWB	before	115
WJC	before	105
BHO	after	160
GWB	after	135
WJC	after	145

`spread()` transforms narrow into wide

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

To take a narrow data table and make it wide we only need two parameters:

The column that contains variable names; the key column. Here, it’s when.
The column that contains values that will be spread; the value column. Here it’s ‘sbp’.

BP_wide1 <-  BP_narrow %>% 
  spread(key= when, value = sbp)
BP_wide1

subject	after	before
BHO	160	160
GWB	135	115
WJC	145	105

in class exercise

We make a data table of country populations in different years.

a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide

countries	1999	2000
Afghanistan	745	266
Brazil	37737	80488
China	212258	213766

We wish to gather this table into tidy form.

countries	year	population
Afghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Afghanistan	2000	266
Brazil	2000	80488
China	2000	213766

With paper and pencil fill in the necessary code:

df_narrow <- df_wide %>% 
  gather(fill__this__out)
df_narrow

example

Is the following data set narrow or wide? Convert it to the other data table format.

## Source: local data frame [3 x 3]
## Groups: name [3]
## 
##    name       F      M
## * <chr>   <int>  <int>
## 1 Allen    1836 263375
## 2  Mary 4112464  15151
## 3   Sue  144410    519

## Source: local data frame [6 x 3]
## Groups: name [3]
## 
##    name   sex   value
##   <chr> <chr>   <int>
## 1 Allen     F    1836
## 2  Mary     F 4112464
## 3   Sue     F  144410
## 4 Allen     M  263375
## 5  Mary     M   15151
## 6   Sue     M     519

i-clicker questions

Answ: yes

Answ: none of the above—key is sex

example

Lets examine the wide iris data table:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Suppose you want to make the following plot:

The data table iris isn’t gyph ready. Here is the glyph ready table:

##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4

step 1: Use gather

iris_narrow <- iris %>%
  gather(key, Value, -Species) %>%  #here -Species means all columns except Species
  head()
iris_narrow %>% head()

##   Species          key Value
## 1  setosa Sepal.Length   5.1
## 2  setosa Sepal.Length   4.9
## 3  setosa Sepal.Length   4.7
## 4  setosa Sepal.Length   4.6
## 5  setosa Sepal.Length   5.0
## 6  setosa Sepal.Length   5.4

step 2: Use the data verb separate()

iris_narrow_sep <- iris_narrow %>% separate(key, into=c("Part", "Measure"), sep="\\.")
head(iris_narrow_sep)

##   Species  Part Measure Value
## 1  setosa Sepal  Length   5.1
## 2  setosa Sepal  Length   4.9
## 3  setosa Sepal  Length   4.7
## 4  setosa Sepal  Length   4.6
## 5  setosa Sepal  Length   5.0
## 6  setosa Sepal  Length   5.4

in class exercise:

Show how we use seperate, gather and spread to go from the top to the bottom table:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower
## 1          5.1         3.5          1.4         0.2  setosa      1
## 2          4.9         3.0          1.4         0.2  setosa      2
## 3          4.7         3.2          1.3         0.2  setosa      3
## 4          4.6         3.1          1.5         0.2  setosa      4
## 5          5.0         3.6          1.4         0.2  setosa      5
## 6          5.4         3.9          1.7         0.4  setosa      6

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) 
iris.wide %>% head()

##   Species Flower          key value
## 1  setosa      1 Sepal.Length   5.1
## 2  setosa      2 Sepal.Length   4.9
## 3  setosa      3 Sepal.Length   4.7
## 4  setosa      4 Sepal.Length   4.6
## 5  setosa      5 Sepal.Length   5.0
## 6  setosa      6 Sepal.Length   5.4

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") 
iris.wide %>% head()

##   Species Flower  Part Measure value
## 1  setosa      1 Sepal  Length   5.1
## 2  setosa      2 Sepal  Length   4.9
## 3  setosa      3 Sepal  Length   4.7
## 4  setosa      4 Sepal  Length   4.6
## 5  setosa      5 Sepal  Length   5.0
## 6  setosa      6 Sepal  Length   5.4

iris.wide <- iris %>%
  gather(key, value, -Species, -Flower) %>%
  separate(key, c("Part", "Measure"), "\\.") %>%
  spread(Measure, value) 
iris.wide %>% head()

##   Species Flower  Part Length Width
## 1  setosa      1 Petal    1.4   0.2
## 2  setosa      1 Sepal    5.1   3.5
## 3  setosa      2 Petal    1.4   0.2
## 4  setosa      2 Sepal    4.9   3.0
## 5  setosa      3 Petal    1.3   0.2
## 6  setosa      3 Sepal    4.7   3.2

Lecture 7

Adam Lucas

Announcements

Today:

Joins and relational databases

Example: Average class size

Establishing a match between cases

Example (nycflights)

Underderstanding joins

Inner Join

Outer Joins

More complicated example

Unmatched cases from the left table

Unmatched cases from the right table

Unmatched cases from both tables

In class exercise

2 The role of the backtick ` in R

3. Wide versus Narrow data tables (chapter 11)

`gather()` transforms wide into narrow

`spread()` transforms narrow into wide

in class exercise

example

i-clicker questions

example

in class exercise:

To Do: Read chapter 10 and 11 and do HW 3

Lecture 7

Adam Lucas

Announcements

Today:

Joins and relational databases

Example: Average class size

Establishing a match between cases

Example (nycflights)

Underderstanding joins

Inner Join

Outer Joins

More complicated example

Unmatched cases from the left table

Unmatched cases from the right table

Unmatched cases from both tables

In class exercise

2 The role of the backtick ` in R

3. Wide versus Narrow data tables (chapter 11)

gather() transforms wide into narrow

spread() transforms narrow into wide

in class exercise

example

i-clicker questions

example

in class exercise:

To Do: Read chapter 10 and 11 and do HW 3

`gather()` transforms wide into narrow

`spread()` transforms narrow into wide