Source file ⇒ Lec9.Rmd

Today

  1. Chapter 10 Joining two data tables
  2. Chapter 8 more ggplot2

Chapter 10: Joining two Tables

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
## Reading data with read.csv()
head(Grades,10)
##       sid grade   sessionID
## 1  S31185    D+ session1784
## 2  S31185    B+ session1785
## 3  S31185    A- session1791
## 4  S31185    B+ session1792
## 5  S31185    B- session1794
## 6  S31185    C+ session1795
## 7  S31185    B- session1796
## 8  S31185    NC session1807
## 9  S31188     C session1801
## 10 S31188    C+ session1802
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
## Reading data with read.csv()
head(Courses,10)
##      sessionID dept level    sem enroll     iid
## 1  session1784    M   100 FA1991     22 inst265
## 2  session1785    k   100 FA1991     52 inst458
## 3  session1791    J   100 FA1993     22 inst223
## 4  session1792    J   300 FA1993     20 inst235
## 5  session1794    J   200 FA1993     22 inst234
## 6  session1795    J   200 SP1994     26 inst230
## 7  session1796    k   200 IT1994     25 inst466
## 8  session1801    C   100 SP1997     38 inst144
## 9  session1802    o   300 SP1997     16 inst390
## 10 session1807    W   100 SP1998     43 inst138

Joins

A join is a data verb that combines two tables.

  • These care called the left and the right tables. There are several kinds of join.

  • All involve establishing a correspondance — a match — between each case in the left table and zero or more cases in the right table.
  • The various joins differ in how they handle multiple matches or missing matches.

Example: Average class size

Goal: Figure out the average class size seen by each student.

  • enroll comes from Courses table.
  • Student (sid) comes from Grades.
  • sessionID is in both tables.
joined_tables <- Grades %>%
  select(sid, sessionID) %>%
  left_join(Courses %>% select(enroll, sessionID), by=c("sessionID"="sessionID"))
 head(joined_tables, 10) 
##       sid   sessionID enroll
## 1  S31185 session1784     22
## 2  S31185 session1785     52
## 3  S31185 session1791     22
## 4  S31185 session1792     20
## 5  S31185 session1794     22
## 6  S31185 session1795     26
## 7  S31185 session1796     25
## 8  S31185 session1807     43
## 9  S31188 session1801     38
## 10 S31188 session1802     16

Task for you

Make the following plot of the average class size for each student.

Note: To figure out how to remove the tick mark labels (which aren’t legible) a good resource is the Cookbook for R

Chap 8 (More ggplot2)

It is important to be able to look up the syntax and aesetics for the different geoms. Here is a good resourse from ggplot2.org

Example

The data table Birth78 in MosaicData package a day by day record of the number of births in the United States in 1978.

head(Births78)
##         date births dayofyear  wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri

Task for you

Make a new table DailyBirths that gives the average of births on the different days of the week for each month of the year. You will need to use the function lubridate::month(date) which gives you the month from date. For example:

head(Births78$date)
## [1] "1978-01-01 UTC" "1978-01-02 UTC" "1978-01-03 UTC" "1978-01-04 UTC"
## [5] "1978-01-05 UTC" "1978-01-06 UTC"
head(month(Births78$date))
## [1] 1 1 1 1 1 1

We will rename wday as Wday using rename():

myBirths78 <- Births78 %>% rename(Wday=wday)
head(myBirths78)
##         date births dayofyear  Wday
## 1 1978-01-01   7701         1   Sun
## 2 1978-01-02   7527         2   Mon
## 3 1978-01-03   8825         3  Tues
## 4 1978-01-04   8859         4   Wed
## 5 1978-01-05   9043         5 Thurs
## 6 1978-01-06   9208         6   Fri
  1. Make the following plot:

  1. You can find a data set listing US federal holidays at http://tiny.cc/dcf/US-Holidays.csv. Read it in as follows:1
Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
mutate(date = lubridate::dmy(date))

Filter out cases that are not in the month of November or December

my_months=c(11,12)
myHolidays <- Holidays %>% filter(month(date) %in% my_months , year(date)==1978) 
  1. Add a layer to your plot from (1). Draw a vertical bar at each date which is a holiday in November or January. You’ll use the geom_vline() glyph. You can give a data = argument to geom_vline() to tell it to plot out the information from Holidays rather than MyTwoYears.

  1. Add a text label to each of the vertical bars to identify which holiday it is. Use the geom_text() glyph.2

  2. Next your layer your graph with a horizontal line with yintercept equal to the average number of births in for a day in 1978. Make the following plot:

  1. Use annotate() to label the horizontal line as average
plot4 + annotate("text", label="avg", x="Sun", y=9000)


  1. The point of the lubridate::dmy() function is to convert the character-string date stored in the CSV to a POSIX date-number.

  2. Hints: You’ll have to make up a y-coordinate for each label. You can set the orientation of each label with the angle aesthetic.