Albert Y. Kim
Monday 2016/2/15
We now discuss a grammar for data manipulation. Other terms for “data manipulation” include:
Deceptively powerful concept of tidy data, represented in R either as a data.frame or a tbl_df table data frame:
Example of Codd's 3rd Normal Form of database normalization:
We will revisit this concept later.
~85% of data manipulations can be achieved by the following verbs on tidy data.
filter(): subset rows matching criteriaselect(): subset columns chosen by namemutate(): add new variables by mutating existing onesarrange(): reorder rowssummarise(): reduce variables to valuesEach of these verbs is a command from the dplyr package.
The beauty of this package is that it is built on principles that are programming language/software agnostic, specifically Database Normalization, which SQL is based on as well.
Even if later on your don't end up using R, the previous five verbs is still how you would think about manipulating your data.
TRUE or FALSE.group_by()
command that is useful for summarise()'ations.piping originates from the magrittr package, which the dplyr package loads
by default.
It allows you to take the output of one function and pipe it as the input of the next function, and build a sequence.
The %>% command, described as “then”. This saves you from nested parentheses.
For example ex: say you want to apply functions h() and g() and then f() on data x. You can do
f(g(h(x))) ORh(x) %>% g() %>% f()This allows for a sequential breaking down of tasks, allowing you and more importantly others to understand what you are doing!
== equals
5 == 3 yields FALSE!= not equal to
5 != 3 yields TRUE| or
5 < 3 | 5 < 10 yields TRUE& and
5 < 3 & 5 < 10 yields FALSE%in% is x in y?
c(1, 3, 2) %in% c(1, 2) yields TRUE FALSE TRUE