Source file ⇒ 2017-lec6.Rmd
last compiled on Fri Feb 3 22:43:56 2017
You have already seen two data verbs:
summarise()
group_by()
Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:
head(BabyNames)
name | sex | count | year |
---|---|---|---|
Mary | F | 7065 | 1880 |
Anna | F | 2604 | 1880 |
Emma | F | 2003 | 1880 |
Elizabeth | F | 1939 | 1880 |
Minnie | F | 1746 | 1880 |
Margaret | F | 1578 | 1880 |
BabyNames %>%
group_by( sex, year ) %>%
summarise( total=sum( count ) )
## Source: local data frame [268 x 3]
## Groups: sex [?]
##
## sex year total
## <chr> <int> <int>
## 1 F 1880 90993
## 2 F 1881 91954
## 3 F 1882 107850
## 4 F 1883 112322
## 5 F 1884 129022
## 6 F 1885 133055
## 7 F 1886 144535
## 8 F 1887 145982
## 9 F 1888 178627
## 10 F 1889 178366
## # ... with 258 more rows
Consider the data table `DataComputing::ZipGeography: where we examine small States.
## ZIP State Population LandArea
## 1 05001 Vermont 9172 110.3
## 2 05009 Vermont NA NA
## 3 05030 Vermont NA NA
## 4 05031 Vermont 98 11.5
## 5 05032 Vermont 2682 189.8
Here’s a graphic showing the mean population of all the ZIP codes in each small state.
Is this data table Glyph ready to produce the plot below of average population per zipcode for each small state?
Answ: No since we need to do a group_by(State)
and summarise(avgZipPopulation=mean(Population), area=sum(LandArea,na.rm=TRUE))
data wrangling (see below):
#Data wrangling
Zip <- ZipGeography %>% filter(State != "") %>%
group_by(State)%>%
summarise(aveZipPopulation=mean(Population,na.rm=TRUE), area=sum(LandArea,na.rm=TRUE)) %>%
filter(area<50000)
To produce the graphic shown above here is the ggplot commands:
Zip$State <- factor(Zip$State, levels = Zip$State[order(Zip$aveZipPopulation)]) # to make states ordered by aveZipPopulation in ggplot
Zip %>% ggplot(aes(x=State,y=aveZipPopulation)) + geom_point(aes(color=area)) + theme(axis.text.x = element_text(angle = 80, hjust = 1))
We will discuss 4 more data verbs:
select()
mutate()
filter()
arrange()
As with group_by()
and summarise()
, each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.
Selecting from a data table means choosing one or more variables from the table. Reasons to do this:
The syntax is similar to that of group_by()
or summarise()
. A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.
To illustrate, here’s the first few cases in the BabyNames
data table:
name | sex | count | year |
---|---|---|---|
Mary | F | 7065 | 1880 |
Anna | F | 2604 | 1880 |
Emma | F | 2003 | 1880 |
Elizabeth | F | 1939 | 1880 |
Minnie | F | 1746 | 1880 |
Margaret | F | 1578 | 1880 |
And here is the result of selecting just the name
and year
variables:
name | year |
---|---|
Mary | 1880 |
Anna | 1880 |
Emma | 1880 |
Elizabeth | 1880 |
Minnie | 1880 |
Margaret | 1880 |
If you want to rename a variable, use a named argument, as with when=year
in the following:
BabyNames %>% select( name, when=year )
name | when |
---|---|
Mary | 1880 |
Anna | 1880 |
Emma | 1880 |
Elizabeth | 1880 |
Minnie | 1880 |
Margaret | 1880 |
To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.
In selecting, the variables are specified by name, e.g.
BabyNames %>% select( year, count ) %>% head()
year | count |
---|---|
1880 | 7065 |
1880 | 2604 |
1880 | 2003 |
1880 | 1939 |
1880 | 1746 |
1880 | 1578 |
With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==
, >
, <
, %in%
, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:
BabyNames %>% filter( sex=="F") %>%
sample_n( size=6 )
name | sex | count | year | |
---|---|---|---|---|
121991 | Cornelious | F | 8 | 1924 |
681602 | Kaytlin | F | 143 | 1993 |
815517 | Jaliah | F | 26 | 2001 |
622725 | Andre | F | 33 | 1989 |
614790 | Sherissa | F | 8 | 1988 |
530929 | Joyce | F | 663 | 1982 |
Here are the cases for either sex for babies born after 1990:
BabyNames %>% filter( year > 1990 ) %>%
sample_n( size=6 )
name | sex | count | year | |
---|---|---|---|---|
52296 | Chassity | F | 82 | 1993 |
546718 | Dixi | F | 5 | 2009 |
27144 | Doreen | F | 67 | 1992 |
391465 | Marzell | M | 5 | 2004 |
389512 | Johany | M | 6 | 2004 |
363564 | Inaya | F | 43 | 2004 |
Here are the girls born after 1990:
BabyNames %>% filter( year > 1990, sex=="F") %>%
head()
name | sex | count | year | |
---|---|---|---|---|
185121 | Kayra | F | 21 | 2002 |
360082 | Cheyanna | F | 27 | 2011 |
390870 | Priyana | F | 6 | 2012 |
18974 | Christyn | F | 31 | 1992 |
238137 | Chantelle | F | 47 | 2005 |
60760 | Crystalina | F | 5 | 1994 |
You can specify as many tests as you like. The filter()
function will pass through only those cases that pass all the tests.
Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:
BabyNames %>% filter( year>1990 | sex=="F")
It’s also possible to test for a variable being any of several different values. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:
BabyNames %>%
filter( year %in% c(1980, 1990, 2000, 2010)) %>%
sample_n( size=6 )
name | sex | count | year | |
---|---|---|---|---|
94778 | Vladimir | M | 179 | 2010 |
47188 | Kavya | F | 46 | 2000 |
98659 | Nahshon | M | 19 | 2010 |
94896 | Landin | M | 150 | 2010 |
28440 | Niema | F | 9 | 1990 |
75182 | Keila | F | 187 | 2010 |
filter()
works well togther with group_by()
. For example suppose you want only those names where the minimum of the counts greater than 100.
BabyNames %>% group_by(name) %>%
filter(count==min(count)) %>%
filter(count>100) %>%
head()
## Source: local data frame [4 x 4]
## Groups: name [4]
##
## name sex count year
## <chr> <chr> <int> <int>
## 1 Jessie M 143 1881
## 2 Jacqueli F 157 1989
## 3 Cassandr F 152 1989
## 4 Christop M 1082 1989
Notice that group_by together with filter doesn’t change any of the variables.
We can also only report only baby names used for over 100 years.
BabyNames %>% group_by(name) %>%
summarise(years_used=n()) %>%
filter(years_used>100) %>%
head()
name | years_used |
---|---|
Aaron | 218 |
Abbie | 176 |
Abby | 156 |
Abe | 134 |
Abel | 150 |
Abigail | 169 |
The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.
Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData
data table has variables pop
and area
giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.
CountryData %>%
mutate( popDensity=pop/area ) %>%
select( country, pop, area, popDensity) %>%
sample_n(size=6)
country | pop | area | popDensity | |
---|---|---|---|---|
70 | Ecuador | 15654411 | 283561 | 55.206502 |
161 | Navassa Island | NA | 5 | NA |
192 | Saint Kitts and Nevis | 51538 | 261 | 197.463602 |
198 | San Marino | 32742 | 61 | 536.754098 |
124 | Kiribati | 104488 | 811 | 128.838471 |
15 | Australia | 22507617 | 7741220 | 2.907503 |
Arranging sets the order of cases. It does not change the variables — that’s a job for mutate()
. Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.
For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:
Minneapolis2013 %>%
group_by( First ) %>%
summarise( total=n() ) %>%
head()
First | total |
---|---|
ABDUL M RAHAMAN “THE ROCK” | 338 |
ALICIA K. BENNETT | 351 |
BETSY HODGES | 28935 |
BILL KAHN | 97 |
BOB “AGAIN” CARNEY JR | 56 |
BOB FINE | 2094 |
The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total
in descending order.
Minneapolis2013 %>%
group_by( First ) %>%
summarise( total=n() ) %>%
arrange( desc(total) ) %>%
head()
First | total |
---|---|
BETSY HODGES | 28935 |
MARK ANDREW | 19584 |
DON SAMUELS | 8335 |
CAM WINTON | 7511 |
JACKIE CHERRYHOMES | 3524 |
BOB FINE | 2094 |
By default, the arrangement goes in ascending order: from lowest to highest.
The notation used in these notes is dplyr
. Keep in mind that this is just one of several notations. Some of them are:
Here is the same expression in these different notations:
BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
"BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")
BabyNames[, length(count), by=c("sex","year") ]
Please copy and paste url into your web browser.
http://gandalf.berkeley.edu:3838/alucas/Chapter-09-collection/
Please write a wrangling statement to extract out only those names where the total number of births over all the years and both sexes is greater than 10,000. Your result will look like this (although the names in Twenty will be different.)
Twenty %>%
group_by(name) %>%
summarise(total=n()) %>%
filter(total>10000)
Write a wrangling statement to extract only those names for which there is some year where the total numbers of babies is greater than 50 (combining boys and girls). Your result should look like this: the name followed by the total count in the best year and the year in which that occurred.
Twenty %>% group_by(name) %>%
filter(count==max(count)) %>%
filter(count>50)
Now write a wrangling statement that extract only those names which appear at least 50 times per year for at least 80 of the 134 year time span.
Twenty %>%
group_by(name)%>%
summarise(nyears=n()) %>%
filter(nyears>80)
Here’s an attempt to calculate the number of votes for candidates who received more than 10,000 votes altogether.
Minneapolis2013 %>%
group_by(First) %>%
summarise(total_votes = n()) %>%
filter(total_votes > 10000)
we will work with the mtcars data set
mtcars_m <- mtcars %>%
filter(am==0)
mtcars_a <- mtcars %>%
filter(am==1)
head(mtcars_a)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.
In base package if you want to make a scatter diagram of wt
versus mpg
in mtcars
for manual cars
plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")
Here we see a major limitation of base package drawing.
In ggplot we would have:
mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)
The vector precip
gives the yearly precipitation in differnt cities. Using the base package function hist
make a histogram of precip
(hint: try hist(precip)
). Next make the plot in ggplot using the geom_histogram()
function. You will need to convert the vector precip
to a data frame using as.data.frame(precip)
. This might be helpful:ggplot2.org
As another example, suppose we wish to make a linear model of how mpg varies with car weight.
# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
##
## Coefficients:
## (Intercept) wt
## 31.416 -3.786
We can see that the best fitting line through the scatter plot for the manual transmission cars from the output.
We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
In fact you can draw the regression line for each cylinder type.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
or more efficiently using lapply()
which you will learn about in the data camp course intermediate R soon.
plot(mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
})
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")
In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.
mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)