Source file ⇒ Lec9soln.Rmd
Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
## Reading data with read.csv()
head(Grades,10)
## sid grade sessionID
## 1 S31185 D+ session1784
## 2 S31185 B+ session1785
## 3 S31185 A- session1791
## 4 S31185 B+ session1792
## 5 S31185 B- session1794
## 6 S31185 C+ session1795
## 7 S31185 B- session1796
## 8 S31185 NC session1807
## 9 S31188 C session1801
## 10 S31188 C+ session1802
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
## Reading data with read.csv()
head(Courses,10)
## sessionID dept level sem enroll iid
## 1 session1784 M 100 FA1991 22 inst265
## 2 session1785 k 100 FA1991 52 inst458
## 3 session1791 J 100 FA1993 22 inst223
## 4 session1792 J 300 FA1993 20 inst235
## 5 session1794 J 200 FA1993 22 inst234
## 6 session1795 J 200 SP1994 26 inst230
## 7 session1796 k 200 IT1994 25 inst466
## 8 session1801 C 100 SP1997 38 inst144
## 9 session1802 o 300 SP1997 16 inst390
## 10 session1807 W 100 SP1998 43 inst138
A join is a data verb that combines two tables.
These care called the left and the right tables. There are several kinds of join.
The various joins differ in how they handle multiple matches or missing matches.
joined_tables <- Grades %>%
select(sid, sessionID) %>%
left_join(Courses %>% select(enroll, sessionID), by=c("sessionID"="sessionID"))
head(joined_tables, 10)
## sid sessionID enroll
## 1 S31185 session1784 22
## 2 S31185 session1785 52
## 3 S31185 session1791 22
## 4 S31185 session1792 20
## 5 S31185 session1794 22
## 6 S31185 session1795 26
## 7 S31185 session1796 25
## 8 S31185 session1807 43
## 9 S31188 session1801 38
## 10 S31188 session1802 16
Make the following plot of the average class size for each student.
avg_class_size <- joined_tables %>%
group_by(sid) %>%
summarise(avg=mean(enroll, na.rm=TRUE))
#head(avg_class_size)
avg_class_size %>% ggplot(aes(x=sid, y=avg)) + geom_point() + scale_x_discrete(breaks=NULL)
Note: To figure out how to remove the tick mark labels (which aren’t legible) a good resource is the Cookbook for R
inner joins are usefull for filtering. For example suppose I want to know the departments where the most students get good grades.
good_grades <- Grades %>%
filter(grade %in% c("A","A-"))
head(good_grades)
## sid grade sessionID
## 1 S31185 A- session1791
## 2 S31188 A- session1810
## 3 S31188 A session2644
## 4 S31188 A- session3139
## 5 S31188 A session3141
## 6 S31188 A- session3187
good_depts <- good_grades %>%
inner_join(Courses) %>%
group_by(dept) %>%
tally(sort=TRUE)
## Joining by: "sessionID"
head(good_depts)
## Source: local data frame [6 x 2]
##
## dept n
## (chr) (int)
## 1 m 226
## 2 d 221
## 3 O 208
## 4 W 186
## 5 q 167
## 6 k 153
It is important to be able to look up the syntax and aesetics for the different geoms. Here is a good resourse from ggplot2.org
The data table Birth78
in MosaicData package gives a day by day record of the number of births in the United States in 1978.
head(Births78)
## date births dayofyear wday
## 1 1978-01-01 7701 1 Sun
## 2 1978-01-02 7527 2 Mon
## 3 1978-01-03 8825 3 Tues
## 4 1978-01-04 8859 4 Wed
## 5 1978-01-05 9043 5 Thurs
## 6 1978-01-06 9208 6 Fri
Make a new table DailyBirths that gives the average of births on the different days of the week for each month of the year. You will need to use the function lubridate::month(date) which gives you the month from date
. For example:
head(Births78$date)
## [1] "1978-01-01 UTC" "1978-01-02 UTC" "1978-01-03 UTC" "1978-01-04 UTC"
## [5] "1978-01-05 UTC" "1978-01-06 UTC"
head(month(Births78$date))
## [1] 1 1 1 1 1 1
For purposes of clarity later, we will rename wday as Wday using rename()
:
myBirths78 <- Births78 %>% rename(Wday=wday)
head(myBirths78)
## date births dayofyear Wday
## 1 1978-01-01 7701 1 Sun
## 2 1978-01-02 7527 2 Mon
## 3 1978-01-03 8825 3 Tues
## 4 1978-01-04 8859 4 Wed
## 5 1978-01-05 9043 5 Thurs
## 6 1978-01-06 9208 6 Fri
avg_Births <- myBirths78 %>%
mutate(Month=month(date)) %>%
group_by(Month,Wday) %>%
summarise(avg=mean(births))
head(avg_Births)
## Source: local data frame [6 x 3]
## Groups: Month [1]
##
## Month Wday avg
## (dbl) (fctr) (dbl)
## 1 1 Sun 7634.00
## 2 1 Mon 8918.60
## 3 1 Tues 9164.20
## 4 1 Wed 9069.75
## 5 1 Thurs 9228.75
## 6 1 Fri 9158.50
plot1 <- avg_Births %>% ggplot(aes(x=Wday,y=avg)) +geom_point(aes(color=as.factor(Month)))
plot1
http://tiny.cc/dcf/US-Holidays.csv
. Read it in as follows:1Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
mutate(date = lubridate::dmy(date))
Filter out cases that are not in the month of November or December
my_months=c(11,12)
myHolidays <- Holidays %>% filter(month(date) %in% my_months , year(date)==1978)
geom_vline()
glyph. You can give a data =
argument to geom_vline()
to tell it to plot out the information from Holidays
rather than MyTwoYears
.plot2 <- avg_Births %>% ggplot(aes(x=Wday, y=avg)) + geom_point(aes(color=Wday)) +geom_vline(data= myHolidays, aes(xintercept=wday(date), color=wday(date,label=TRUE)))
plot2
geom_text()
glyph.2plot3 <- plot2 + geom_text(data = myHolidays,
aes(x = wday(date), y = 9000, label = holiday, angle = 65))
plot3
avg <- Births78 %>% summarise(avg=mean(births))
plot4 <- plot3 + geom_hline(yintercept = avg[1,1])
plot4
annotate()
to label the horizontal line as averageplot4 + annotate("text", label="avg", x="Sun", y=9000)