Source file ⇒ 2017-lec7.Rmd
last compiled on Wed Feb 8 00:20:19 2017
none
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Example: Grades and Enrollment
The data tables Grades
and Courses
gives grade and course related info in a school.
Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID | |
---|---|---|---|
2197 | S31680 | B | session3518 |
259 | S31242 | B | session2897 |
4188 | S32127 | A | session2002 |
3880 | S32058 | A- | session2952 |
Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|
640 | session2568 | J | 100 | FA2002 | 15 | inst223 |
76 | session1940 | d | 100 | FA2000 | 16 | inst409 |
1218 | session3242 | m | 200 | SP2004 | 30 | inst476 |
The primary and foreign key is sessionID
.
A join is a data verb that combines left and right tables.
Your left table is usually your primary table (if there is one) and your right table has information you need not available in the left table.
Goal: Figure out the average class size for each student.
sid
) comes from Grades
.enroll
comes from Courses
table.sessionID
is in both tables.Starting with Grades table we need Courses table to get enrollment for classes. Alternatively, starting with Courses you can find all the students in those Courses with the Grades table.
Grades %>%
left_join(Courses) %>% sample_n(size=4)
sid | grade | sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|---|---|
4043 | S32061 | B+ | session3644 | K | 300 | SP2005 | 18 | inst239 |
1536 | S31518 | AU | session2901 | J | 100 | FA2003 | 22 | inst224 |
1844 | S31587 | B- | session3191 | b | 200 | SP2004 | 10 | inst403 |
3593 | S31962 | B | session3821 | W | 300 | SP2005 | 24 | inst138 |
Courses %>%
left_join(Grades) %>% sample_n(size=4)
sessionID | dept | level | sem | enroll | iid | sid | grade | |
---|---|---|---|---|---|---|---|---|
2050 | session2454 | Y | 200 | FA2002 | 13 | inst397 | S31587 | C+ |
3823 | session2990 | m | 200 | FA2003 | 20 | inst476 | S32310 | B+ |
3354 | session2851 | O | 300 | SP2003 | 14 | inst308 | S31509 | A |
5405 | session3573 | M | 100 | SP2005 | 23 | inst263 | S31830 | S |
Once Courses
and Grades
are joined, it’s straightforward to find the average enrollment seen by each student.
AveClassEachStudent <- Grades %>%
left_join(Courses) %>%
group_by(sid) %>%
summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid | ave_enroll |
---|---|
S32124 | 25.10000 |
S32154 | 22.93333 |
S32415 | 20.87500 |
Example:
Grades %>%
left_join(Courses, by = c(sessionID = "sessionID")) %>%
head(4)
sid | grade | sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|---|---|
S31185 | D+ | session1784 | M | 100 | FA1991 | 22 | inst265 |
S31185 | B+ | session1785 | k | 100 | FA1991 | 52 | inst458 |
S31185 | A- | session1791 | J | 100 | FA1993 | 22 | inst223 |
S31185 | B+ | session1792 | J | 300 | FA1993 | 20 | inst235 |
The default value of by=
is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.
Grades %>%
left_join(Courses) %>%
head(4)
sid | grade | sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|---|---|
S31185 | D+ | session1784 | M | 100 | FA1991 | 22 | inst265 |
S31185 | B+ | session1785 | k | 100 | FA1991 | 52 | inst458 |
S31185 | A- | session1791 | J | 100 | FA1993 | 22 | inst223 |
S31185 | B+ | session1792 | J | 300 | FA1993 | 20 | inst235 |
If the names match you can write:
Grades %>%
left_join(Courses, by="sessionID") %>%
head(4)
sid | grade | sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|---|---|
S31185 | D+ | session1784 | M | 100 | FA1991 | 22 | inst265 |
S31185 | B+ | session1785 | k | 100 | FA1991 | 52 | inst458 |
S31185 | A- | session1791 | J | 100 | FA1993 | 22 | inst223 |
S31185 | B+ | session1792 | J | 300 | FA1993 | 20 | inst235 |
#install.packages("nycflights13")
library(nycflights13)
The data table nycflights13::flights
has many variables of which we select just a few
flights2 <- flights %>%
select(year:day, hour, tailnum, carrier)
head(flights2,2)
year | month | day | hour | tailnum | carrier |
---|---|---|---|---|---|
2013 | 1 | 1 | 5 | N14228 | UA |
2013 | 1 | 1 | 5 | N24211 | UA |
The data table airlines
lets you look up the full carrier name from its abbreviated code:
head(airlines,2)
carrier | name |
---|---|
9E | Endeavor Air Inc. |
AA | American Airlines Inc. |
The data tables weather
give the weather at each NYC airport for each hour:
head(weather,2)
origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EWR | 2013 | 1 | 1 | 0 | 37.04 | 21.92 | 53.97 | 230 | 10.35702 | 11.91865 | 0 | 1013.9 | 10 | 2012-12-31 16:00:00 |
EWR | 2013 | 1 | 1 | 1 | 37.04 | 21.92 | 53.97 | 230 | 13.80936 | 15.89154 | 0 | 1013.0 | 10 | 2012-12-31 17:00:00 |
One way to show the relationships between the different tables is with a drawing:
For nycflights13:
flights
connects to airlines
through the carrier
variable.
flights
connects to weather
via origin
(the location), and year
, month
, day
and hour
(the time).
Imagine you want to add the full airline name to the flights2
data. You can combine the airlines
and flights2
data frames with left_join()
:
flights2 %>%
left_join(airlines, by="carrier") %>%
head(3)
year | month | day | hour | tailnum | carrier | name |
---|---|---|---|---|---|---|
2013 | 1 | 1 | 5 | N14228 | UA | United Air Lines Inc. |
2013 | 1 | 1 | 5 | N24211 | UA | United Air Lines Inc. |
2013 | 1 | 1 | 5 | N619AA | AA | American Airlines Inc. |
The result of joining airlines to flights2 is an additional variable: name
.
To help you learn how joins work, I’m going to use a visual representation:
The coloured column represents the “key” variable: these are used to match the rows between the tables. The grey column represents the “value” column that is carried along for the ride. In these examples I’ll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
A join is a way of connecting each row in x to zero, one, or more rows in y. The following diagram shows each potential match as an intersection of a pair of lines.
In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output.
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:
The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by=
to tell dplyr which variable is the key:
x %>%
inner_join(y, by="key")
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.
An inner join keeps observations that appear in both tables. An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
x
.y
.x
and y
.These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA
.
Graphically, that looks like:
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. For example, lets say you want a list of all the employees in a buisiness plus any dependents. One table called EMPLOYEE contains a list of all employees and another table called DEPENDENTS contains information of all employee’s dependents.You need a left join here because if an employee doesn’t have a dependent, then their record won’t show up at all – because there’s no matching record in the DEPENDENT table. So, you use a left join which keeps all the data on the “left” (i.e. the first table) and pulls in any matching data on the “right” (the second table).
The left join should be your default join: use it unless you have a strong reason to prefer one of the others. For example sometimes when data wrangling you don’t want to have NA
in your table so you use an inner join.
Another way to depict the different types of joins is with a Venn diagram:
Suppose you have these two tables:
clinicName
, postalCode
.clinicName | postalCode |
---|---|
A | 22120 |
B | 35752 |
C | 56718 |
D | 35752 |
E | 67756 |
F | 69129 |
G | 73455 |
H | 73455 |
I | 76292 |
postalCode
, over65
, etc.over65 | postalCode |
---|---|
0.46 | 35752 |
0.72 | 22120 |
0.93 | 22120 |
0.26 | 92332 |
0.46 | 84739 |
0.94 | 67756 |
The diagram below shows the cases in the left and right tables. The lines show the matches between left and right. The cases connected by a match are the overlap cases; there are five of them in the diagram. Cases without a match also appear in both the left and right tables.
An inner join gives the matching pairs. Note that clinic A, which had two matches in the right table, appears twice, once for each matching pair in which clinic A is involved.
LL %>% inner_join(RR)
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
An outer join can include cases where there is no match. You might want to include the unmatched cases from the left table, from the right table, or from both tables.
LL %>% left_join( RR)
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
C | 56718 | NA |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
F | 69129 | NA |
G | 73455 | NA |
H | 73455 | NA |
I | 76292 | NA |
LL %>% right_join(RR)
clinicName | postalCode | over65 |
---|---|---|
B | 35752 | 0.46 |
D | 35752 | 0.46 |
A | 22120 | 0.72 |
A | 22120 | 0.93 |
NA | 92332 | 0.26 |
NA | 84739 | 0.46 |
E | 67756 | 0.94 |
LL %>% full_join(RR)
clinicName | postalCode | over65 |
---|---|---|
A | 22120 | 0.72 |
A | 22120 | 0.93 |
B | 35752 | 0.46 |
C | 56718 | NA |
D | 35752 | 0.46 |
E | 67756 | 0.94 |
F | 69129 | NA |
G | 73455 | NA |
H | 73455 | NA |
I | 76292 | NA |
NA | 92332 | 0.26 |
NA | 84739 | 0.46 |
install.packages("fivethirtyeight")
install.packages("exercisedata")
library(fivethirtyeight)
library(exercisedata)
head(bad_drivers)
state | num_drivers | perc_speeding | perc_alcohol | perc_not_distracted | perc_no_previous | insurance_premiums | losses |
---|---|---|---|---|---|---|---|
Alabama | 18.8 | 39 | 30 | 96 | 80 | 784.55 | 145.08 |
Alaska | 18.1 | 41 | 25 | 90 | 94 | 1053.48 | 133.93 |
Arizona | 18.6 | 35 | 28 | 84 | 96 | 899.47 | 110.35 |
Arkansas | 22.4 | 18 | 26 | 94 | 95 | 827.34 | 142.39 |
California | 12.0 | 35 | 28 | 91 | 89 | 878.41 | 165.63 |
Colorado | 13.6 | 37 | 28 | 79 | 95 | 835.50 | 139.91 |
head(State_pop_area)
state | pop_2015 | area |
---|---|---|
New Jersey | 8958013 | 7354 |
Rhode Island | 1056298 | 1034 |
Massachusetts | 6794422 | 7800 |
Connecticut | 3590886 | 4842 |
Maryland | 6006401 | 9707 |
Delaware | 945934 | 1949 |
Next you will join these two data tables to make a glyph ready table
http://gandalf.berkeley.edu:3838/alucas/Chapter-10-collection/
This will be important when we discuss the data verb gather below.
We saw in lecture 2 about rules for naming objects in R. For example, %>% or the number 19 can’t be an object name. But sometimes we want these to be names. We can get around this by putting our illegal name between backticks. For example:
`%>%` <- "pipe"
`%>%`
## [1] "pipe"
`19` <- 24
`19`
## [1] 24
Two data verbs we will be using are tidyr::narrow()
and tidyr::gather()
. These aren’t part of the DataComputing
package you will need to install it once at the console:
install.packages("tidyr")
Remember to load it in under library(DataComputing)
at the top of your Rmd document.
library(tidyr)
A data table can be presented in wide or narrow format. Each have their own advantatges.
Wide format is easier to get the difference of sbp
(systolic blood pressure) before and after of a test for each patient. This table isn’t tidy as discussed in lecture 1.
BP_wide
subject | before | after |
---|---|---|
BHO | 120 | 160 |
GWB | 115 | 135 |
WJC | 105 | 145 |
Below is a narrow format (tidy). More often your graphics will require a tidy data table.
BP_narrow
subject | when | sbp |
---|---|---|
BHO | before | 160 |
GWB | before | 115 |
WJC | after | 145 |
GWB | after | 135 |
WJC | before | 105 |
BHO | after | 160 |
The data verbs ’spread()and
gather()` convert between these formats.
gather()
transforms wide into narrowA common problem is a dataset where some of the column names are not names of variables, but values of a variable. You use gather()
when you notice that you have columns that are not variables. To tidy a dataset like this, we need to gather those columns into a new pair of variables. To describe that operation we need three parameters:
key
: The name of the variable whose values form the column names. I call that the key
, and here it is when
.
value
: The name of the variable whose values are spread over the cells. I call that value, and here it’s the sbp
measurements.
The set of columns that represent values, not variables. In this example, those are the columns before
and after
. Use bare variable names by putting the entries in backticks. For example if the entry in the column is the sting is 19, write it as 19
.
BP_narrow1 <- BP_wide %>%
gather(key= when, value = sbp, `before`, `after`) # before or after are legal names so it isn't necessary to use backticks (but it doesn't hurt)
BP_narrow1
subject | when | sbp |
---|---|---|
BHO | before | 120 |
GWB | before | 115 |
WJC | before | 105 |
BHO | after | 160 |
GWB | after | 135 |
WJC | after | 145 |
spread()
transforms narrow into wideSpreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.
To take a narrow data table and make it wide we only need two parameters:
The column that contains variable names; the key
column. Here, it’s when
.
The column that contains values that will be spread; the value
column. Here it’s ‘sbp’.
BP_wide1 <- BP_narrow %>%
spread(key= when, value = sbp)
BP_wide1
subject | after | before |
---|---|---|
BHO | 160 | 160 |
GWB | 135 | 115 |
WJC | 145 | 105 |
We make a data table of country populations in different years.
a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide
countries | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 266 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
We wish to gather this table into tidy form.
countries | year | population |
---|---|---|
Afghanistan | 1999 | 745 |
Brazil | 1999 | 37737 |
China | 1999 | 212258 |
Afghanistan | 2000 | 266 |
Brazil | 2000 | 80488 |
China | 2000 | 213766 |
With paper and pencil fill in the necessary code:
df_narrow <- df_wide %>%
gather(fill__this__out)
df_narrow
Is the following data set narrow or wide? Convert it to the other data table format.
## Source: local data frame [3 x 3]
## Groups: name [3]
##
## name F M
## * <chr> <int> <int>
## 1 Allen 1836 263375
## 2 Mary 4112464 15151
## 3 Sue 144410 519
## Source: local data frame [6 x 3]
## Groups: name [3]
##
## name sex value
## <chr> <chr> <int>
## 1 Allen F 1836
## 2 Mary F 4112464
## 3 Sue F 144410
## 4 Allen M 263375
## 5 Mary M 15151
## 6 Sue M 519
Answ: yes
Answ: none of the above—key is
sex
Lets examine the wide iris data table:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Suppose you want to make the following plot:
The data table iris
isn’t gyph ready. Here is the glyph ready table:
## Species Part Measure Value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
step 1: Use gather
iris_narrow <- iris %>%
gather(key, Value, -Species) %>% #here -Species means all columns except Species
head()
iris_narrow %>% head()
## Species key Value
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4
step 2: Use the data verb separate()
iris_narrow_sep <- iris_narrow %>% separate(key, into=c("Part", "Measure"), sep="\\.")
head(iris_narrow_sep)
## Species Part Measure Value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
Show how we use seperate, gather and spread to go from the top to the bottom table:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 2
## 3 4.7 3.2 1.3 0.2 setosa 3
## 4 4.6 3.1 1.5 0.2 setosa 4
## 5 5.0 3.6 1.4 0.2 setosa 5
## 6 5.4 3.9 1.7 0.4 setosa 6
iris.wide <- iris %>%
gather(key, value, -Species, -Flower)
iris.wide %>% head()
## Species Flower key value
## 1 setosa 1 Sepal.Length 5.1
## 2 setosa 2 Sepal.Length 4.9
## 3 setosa 3 Sepal.Length 4.7
## 4 setosa 4 Sepal.Length 4.6
## 5 setosa 5 Sepal.Length 5.0
## 6 setosa 6 Sepal.Length 5.4
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.")
iris.wide %>% head()
## Species Flower Part Measure value
## 1 setosa 1 Sepal Length 5.1
## 2 setosa 2 Sepal Length 4.9
## 3 setosa 3 Sepal Length 4.7
## 4 setosa 4 Sepal Length 4.6
## 5 setosa 5 Sepal Length 5.0
## 6 setosa 6 Sepal Length 5.4
iris.wide <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"), "\\.") %>%
spread(Measure, value)
iris.wide %>% head()
## Species Flower Part Length Width
## 1 setosa 1 Petal 1.4 0.2
## 2 setosa 1 Sepal 5.1 3.5
## 3 setosa 2 Petal 1.4 0.2
## 4 setosa 2 Sepal 4.9 3.0
## 5 setosa 3 Petal 1.3 0.2
## 6 setosa 3 Sepal 4.7 3.2