Today

Chapter 10 Joining two data tables
Chapter 8 more ggplot2

Chapter 10: Joining two Tables

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")

## Reading data with read.csv()

head(Grades,10)

##       sid grade   sessionID
## 1  S31185    D+ session1784
## 2  S31185    B+ session1785
## 3  S31185    A- session1791
## 4  S31185    B+ session1792
## 5  S31185    B- session1794
## 6  S31185    C+ session1795
## 7  S31185    B- session1796
## 8  S31185    NC session1807
## 9  S31188     C session1801
## 10 S31188    C+ session1802

Courses <- read.file("http://tiny.cc/mosaic/courses.csv")

## Reading data with read.csv()

head(Courses,10)

##      sessionID dept level    sem enroll     iid
## 1  session1784    M   100 FA1991     22 inst265
## 2  session1785    k   100 FA1991     52 inst458
## 3  session1791    J   100 FA1993     22 inst223
## 4  session1792    J   300 FA1993     20 inst235
## 5  session1794    J   200 FA1993     22 inst234
## 6  session1795    J   200 SP1994     26 inst230
## 7  session1796    k   200 IT1994     25 inst466
## 8  session1801    C   100 SP1997     38 inst144
## 9  session1802    o   300 SP1997     16 inst390
## 10 session1807    W   100 SP1998     43 inst138

Joins

A join is a data verb that combines two tables.

These care called the left and the right tables. There are several kinds of join.
All involve establishing a correspondance — a match — between each case in the left table and zero or more cases in the right table.
The various joins differ in how they handle multiple matches or missing matches.

Example: Average class size

Goal: Figure out the average class size seen by each student.

enroll comes from Courses table.
Student (sid) comes from Grades.
sessionID is in both tables.

joined_tables <- Grades %>%
  select(sid, sessionID) %>%
  left_join(Courses %>% select(enroll, sessionID), by=c("sessionID"="sessionID"))
 head(joined_tables, 10)

##       sid   sessionID enroll
## 1  S31185 session1784     22
## 2  S31185 session1785     52
## 3  S31185 session1791     22
## 4  S31185 session1792     20
## 5  S31185 session1794     22
## 6  S31185 session1795     26
## 7  S31185 session1796     25
## 8  S31185 session1807     43
## 9  S31188 session1801     38
## 10 S31188 session1802     16

Task for you

Make the following plot of the average class size for each student.

 avg_class_size <-  joined_tables %>%
   group_by(sid) %>%
   summarise(avg=mean(enroll, na.rm=TRUE))
#head(avg_class_size)
avg_class_size %>% ggplot(aes(x=sid, y=avg)) + geom_point() + scale_x_discrete(breaks=NULL)

Note: To figure out how to remove the tick mark labels (which aren’t legible) a good resource is the Cookbook for R

Most popular joins:

left_join() The output has all the cases from the Left, even if there is no match in the right. If there is no match in the right you get NA
inner_join() the output has only the cases from the Left with a match in the Right

inner joins are usefull for filtering. For example suppose I want to know the departments where the most students get good grades.

good_grades <- Grades %>% 
  filter(grade %in% c("A","A-"))
head(good_grades)

##      sid grade   sessionID
## 1 S31185    A- session1791
## 2 S31188    A- session1810
## 3 S31188     A session2644
## 4 S31188    A- session3139
## 5 S31188     A session3141
## 6 S31188    A- session3187

good_depts <- good_grades %>% 
  inner_join(Courses) %>% 
  group_by(dept) %>% 
  tally(sort=TRUE)

## Joining by: "sessionID"

head(good_depts)

## Source: local data frame [6 x 2]
## 
##    dept     n
##   (chr) (int)
## 1     m   226
## 2     d   221
## 3     O   208
## 4     W   186
## 5     q   167
## 6     k   153

Chap 8 (More ggplot2)

It is important to be able to look up the syntax and aesetics for the different geoms. Here is a good resourse from ggplot2.org

Example

The data table Birth78 in MosaicData package gives a day by day record of the number of births in the United States in 1978.

head(Births78)

##         date births dayofyear  wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri

Task for you

Make a new table DailyBirths that gives the average of births on the different days of the week for each month of the year. You will need to use the function lubridate::month(date) which gives you the month from date. For example:

head(Births78$date)

## [1] "1978-01-01 UTC" "1978-01-02 UTC" "1978-01-03 UTC" "1978-01-04 UTC"
## [5] "1978-01-05 UTC" "1978-01-06 UTC"

head(month(Births78$date))

## [1] 1 1 1 1 1 1

For purposes of clarity later, we will rename wday as Wday using rename():

myBirths78 <- Births78 %>% rename(Wday=wday)
head(myBirths78)

##         date births dayofyear  Wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri

Make the following plot:

avg_Births <- myBirths78 %>% 
  mutate(Month=month(date)) %>%
  group_by(Month,Wday) %>% 
  summarise(avg=mean(births))
head(avg_Births)

## Source: local data frame [6 x 3]
## Groups: Month [1]
## 
##   Month   Wday     avg
##   (dbl) (fctr)   (dbl)
## 1     1    Sun 7634.00
## 2     1    Mon 8918.60
## 3     1   Tues 9164.20
## 4     1    Wed 9069.75
## 5     1  Thurs 9228.75
## 6     1    Fri 9158.50

plot1 <- avg_Births %>% ggplot(aes(x=Wday,y=avg)) +geom_point(aes(color=as.factor(Month)))
plot1

You can find a data set listing US federal holidays at http://tiny.cc/dcf/US-Holidays.csv. Read it in as follows:¹

Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
mutate(date = lubridate::dmy(date))

Filter out cases that are not in the month of November or December

my_months=c(11,12)
myHolidays <- Holidays %>% filter(month(date) %in% my_months , year(date)==1978)

Add a layer to your plot from (1). Draw a vertical bar at each date which is a holiday in November or January. You’ll use the geom_vline() glyph. You can give a data = argument to geom_vline() to tell it to plot out the information from Holidays rather than MyTwoYears.

plot2 <- avg_Births %>% ggplot(aes(x=Wday, y=avg)) + geom_point(aes(color=Wday)) +geom_vline(data= myHolidays, aes(xintercept=wday(date), color=wday(date,label=TRUE))) 

plot2

Add a text label to each of the vertical bars to identify which holiday it is. Use the geom_text() glyph.²

plot3 <- plot2 + geom_text(data = myHolidays,
           aes(x = wday(date), y = 9000, label = holiday, angle = 65))
plot3

Next your layer your graph with a horizontal line with yintercept equal to the average number of births in for a day in 1978. Make the following plot:

avg <- Births78 %>% summarise(avg=mean(births))
plot4 <- plot3 + geom_hline(yintercept = avg[1,1])
plot4

Use annotate() to label the horizontal line as average

plot4 + annotate("text", label="avg", x="Sun", y=9000)

The point of the lubridate::dmy() function is to convert the character-string date stored in the CSV to a POSIX date-number.↩
Hints: You’ll have to make up a y-coordinate for each label. You can set the orientation of each label with the angle aesthetic.↩

Lec9

Today

Chapter 10: Joining two Tables

Joins

Example: Average class size

Goal: Figure out the average class size seen by each student.

Task for you

Most popular joins:

Chap 8 (More ggplot2)

Example

Task for you