Albert Y. Kim
Friday 2015/01/30
What up with the geom_bar(stat = "identity") code from last time?
The default stat for bar plots is “bin”. Meaning if the data are already binned,
we don't want to re-bin them again, so we set stat="identity" meaning take the numbers as
they are.
See R code.
The qplot() command (Chapter 2 in ggplot text) describes a way to make “quick” plots, such as simple histograms and scatterplots.
It's built using all the grammar of graphics and you add layers.
See R code.
Male vs female admissions
Go to R code.
We now discuss a grammar for data manipulation. Other terms for “data manipulation” include:
Most data manipulations can be achieved by the following verbs on a “tidy” data frame:
filter: keep rows matching criteriasummarise: reduce variables to valuesmutate: add new variablesarrange: reorder rowsselect: pick columns by nameEach of these is a command from the dplyr package.
The beauty of this “grammar” (and the grammar of graphics) is that it is programming language/software agnostic.
Even if later on your don't end up using R, the previous five verbs is still how you would think about manipulating your data.
%>% command, described as “then”.TRUE or FALSE.group_by() command that is useful for summarise()'ations.The %>% command, described as “then”. This saves you from a morass of nesting.
For example ex: say you want to apply functions h() and g() and then f() on data x. You can do
f(g(h(x))) ORh(x) %>% g() %>% f()This allows for sequential breaking down of tasks, allowing you and more importantly others to understand what you are doing!
== equals
5 == 3 yields FALSE!= not equal to
5 != 3 yields TRUE| or
5 < 3 | 5 < 10 yields TRUE& and
5 < 3 | 5 < 10 yields FALSE%in% is x in y?
c(1, 3, 2) %in% c(1, 2) yields TRUE FALSE TRUEWe need to install the dev_tools package. Unfortunately, you can't just download it from CRAN (i.e. directly from RStudio as we've been doing). Follow the instructions here. Beforehand
sessionInfo() in R), and install Rtools32.exeThen run
devtools::install_github("hadley/rvest")
library(rvest)
and ensure the package rvest loads.
if (!require("rvest")) devtools::install_github("hadley/rvest")
library(rvest)
webpage <- html("http://en.wikipedia.org/wiki/List_of_Stanley_Cup_champions")
stanley.cup <- webpage %>% html_nodes("table") %>% .[[3]] %>% html_table()
Get comfortable with this: dplyr cheat sheet from the folks at RStudio.