Albert Y. Kim
Monday 2015/02/02
Get comfortable with this: dplyr cheat sheet from the folks at RStudio.
tbl_df(UCB)
Source: local data frame [24 x 4]
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
5 Admitted Male B 353
6 Rejected Male B 207
7 Admitted Female B 17
8 Rejected Female B 8
9 Admitted Male C 120
10 Rejected Male C 205
.. ... ... ... ...
tbl_df(UCB) %>% group_by(Admit, Dept) %>% summarize(Freq=sum(Freq))
Source: local data frame [12 x 3]
Groups: Admit
Admit Dept Freq
1 Admitted A 601
2 Admitted B 370
3 Admitted C 322
4 Admitted D 269
5 Admitted E 147
6 Admitted F 46
7 Rejected A 332
8 Rejected B 215
9 Rejected C 596
10 Rejected D 523
11 Rejected E 437
12 Rejected F 668
He is the author of the following R packages
ggplotdplyrlubridate: handling dates and times (later)stingr: handling character strings i.e. computer text (later)The rvest package works by scraping HTML code used to make webpages. To view a webpage's raw HTML code:
The html_nodes function looks for HTML tags.
nhl is a list that contains all HTML tables on the page.
webpage <- html("http://www.nhl.com")
nhl <- webpage %>%
html_nodes("table")
Now we pull the first table and then apply the html_table() function to convert it to a format useable by R.
webpage <- html("http://www.nhl.com")
nhl <- webpage %>%
html_nodes("table") %>%
.[[1]] %>% html_table()
We also use the leaflet package to draw maps: http://www.r-bloggers.com/the-leaflet-package-for-online-mapping-in-r/
Article from the Washington Post: Colleges often give discounts to the rich. But here’s one that gave up on ‘merit aid.’
Using the data provided, we investigate which states have the highest average no need grants (averaged over schools).
Please read this article for Wednesday.
A lot of sophisticated tools are used here, so don't try to figure out all the data preprocessing. Focus on the dplyr and ggplot. Some new concepts:
rename() command in dplyr allows you to rename columnsifelse(TEST, A, B)**: run TEST, return A if TRUE, return B if FALSE. Example ifelse(3 <= 5, 1, 0) returns 1NA. is.na(x) tests if x is a missing value. If you set na.rm=TRUE to many functions, it will ignore the missing values. Make sure you want to do this!