Source file ⇒ Lec9soln.Rmd

Today

  1. Chapter 10 Joining two data tables
  2. Chapter 8 more ggplot2

Chapter 10: Joining two Tables

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
## Reading data with read.csv()
head(Grades,10)
##       sid grade   sessionID
## 1  S31185    D+ session1784
## 2  S31185    B+ session1785
## 3  S31185    A- session1791
## 4  S31185    B+ session1792
## 5  S31185    B- session1794
## 6  S31185    C+ session1795
## 7  S31185    B- session1796
## 8  S31185    NC session1807
## 9  S31188     C session1801
## 10 S31188    C+ session1802
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
## Reading data with read.csv()
head(Courses,10)
##      sessionID dept level    sem enroll     iid
## 1  session1784    M   100 FA1991     22 inst265
## 2  session1785    k   100 FA1991     52 inst458
## 3  session1791    J   100 FA1993     22 inst223
## 4  session1792    J   300 FA1993     20 inst235
## 5  session1794    J   200 FA1993     22 inst234
## 6  session1795    J   200 SP1994     26 inst230
## 7  session1796    k   200 IT1994     25 inst466
## 8  session1801    C   100 SP1997     38 inst144
## 9  session1802    o   300 SP1997     16 inst390
## 10 session1807    W   100 SP1998     43 inst138

Joins

A join is a data verb that combines two tables.

  • These care called the left and the right tables. There are several kinds of join.

  • All involve establishing a correspondance — a match — between each case in the left table and zero or more cases in the right table.
  • The various joins differ in how they handle multiple matches or missing matches.

Example: Average class size

Goal: Figure out the average class size seen by each student.

  • enroll comes from Courses table.
  • Student (sid) comes from Grades.
  • sessionID is in both tables.
joined_tables <- Grades %>%
  select(sid, sessionID) %>%
  left_join(Courses %>% select(enroll, sessionID), by=c("sessionID"="sessionID"))
 head(joined_tables, 10) 
##       sid   sessionID enroll
## 1  S31185 session1784     22
## 2  S31185 session1785     52
## 3  S31185 session1791     22
## 4  S31185 session1792     20
## 5  S31185 session1794     22
## 6  S31185 session1795     26
## 7  S31185 session1796     25
## 8  S31185 session1807     43
## 9  S31188 session1801     38
## 10 S31188 session1802     16

Task for you

Make the following plot of the average class size for each student.

 avg_class_size <-  joined_tables %>%
   group_by(sid) %>%
   summarise(avg=mean(enroll, na.rm=TRUE))
#head(avg_class_size)
avg_class_size %>% ggplot(aes(x=sid, y=avg)) + geom_point() + scale_x_discrete(breaks=NULL)

Note: To figure out how to remove the tick mark labels (which aren’t legible) a good resource is the Cookbook for R

Chap 8 (More ggplot2)

It is important to be able to look up the syntax and aesetics for the different geoms. Here is a good resourse from ggplot2.org

Example

The data table Birth78 in MosaicData package gives a day by day record of the number of births in the United States in 1978.

head(Births78)
##         date births dayofyear  wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri

Task for you

Make a new table DailyBirths that gives the average of births on the different days of the week for each month of the year. You will need to use the function lubridate::month(date) which gives you the month from date. For example:

head(Births78$date)
## [1] "1978-01-01 UTC" "1978-01-02 UTC" "1978-01-03 UTC" "1978-01-04 UTC"
## [5] "1978-01-05 UTC" "1978-01-06 UTC"
head(month(Births78$date))
## [1] 1 1 1 1 1 1

For purposes of clarity later, we will rename wday as Wday using rename():

myBirths78 <- Births78 %>% rename(Wday=wday)
head(myBirths78)
##         date births dayofyear  Wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri
  1. Make the following plot:
avg_Births <- myBirths78 %>% 
  mutate(Month=month(date)) %>%
  group_by(Month,Wday) %>% 
  summarise(avg=mean(births))
head(avg_Births)
## Source: local data frame [6 x 3]
## Groups: Month [1]
## 
##   Month   Wday     avg
##   (dbl) (fctr)   (dbl)
## 1     1    Sun 7634.00
## 2     1    Mon 8918.60
## 3     1   Tues 9164.20
## 4     1    Wed 9069.75
## 5     1  Thurs 9228.75
## 6     1    Fri 9158.50
plot1 <- avg_Births %>% ggplot(aes(x=Wday,y=avg)) +geom_point(aes(color=as.factor(Month)))
plot1

  1. You can find a data set listing US federal holidays at http://tiny.cc/dcf/US-Holidays.csv. Read it in as follows:1
Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
mutate(date = lubridate::dmy(date))

Filter out cases that are not in the month of November or December

my_months=c(11,12)
myHolidays <- Holidays %>% filter(month(date) %in% my_months , year(date)==1978) 
  1. Add a layer to your plot from (1). Draw a vertical bar at each date which is a holiday in November or January. You’ll use the geom_vline() glyph. You can give a data = argument to geom_vline() to tell it to plot out the information from Holidays rather than MyTwoYears.
plot2 <- avg_Births %>% ggplot(aes(x=Wday, y=avg)) + geom_point(aes(color=Wday)) +geom_vline(data= myHolidays, aes(xintercept=wday(date), color=wday(date,label=TRUE))) 

plot2

  1. Add a text label to each of the vertical bars to identify which holiday it is. Use the geom_text() glyph.2
plot3 <- plot2 + geom_text(data = myHolidays,
           aes(x = wday(date), y = 9000, label = holiday, angle = 65))
plot3

  1. Next your layer your graph with a horizontal line with yintercept equal to the average number of births in for a day in 1978. Make the following plot:
avg <- Births78 %>% summarise(avg=mean(births))
plot4 <- plot3 + geom_hline(yintercept = avg[1,1])
plot4

  1. Use annotate() to label the horizontal line as average
plot4 + annotate("text", label="avg", x="Sun", y=9000)


  1. The point of the lubridate::dmy() function is to convert the character-string date stored in the CSV to a POSIX date-number.

  2. Hints: You’ll have to make up a y-coordinate for each label. You can set the orientation of each label with the angle aesthetic.