Source file ⇒ 2017-lec6.Rmd

last compiled on Fri Feb 3 22:43:56 2017

Announcements

no way for me to penalize you for late DataCamp!

Today:

DC chapter 9 more data verbs
ggplot2 versus base package graphics

1. DC chapter 9 more data verbs

You have already seen two data verbs:

summarise()
group_by()

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

head(BabyNames)

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

BabyNames %>%  
  group_by( sex, year ) %>% 
 summarise( total=sum( count ) )

## Source: local data frame [268 x 3]
## Groups: sex [?]
## 
##      sex  year  total
##    <chr> <int>  <int>
## 1      F  1880  90993
## 2      F  1881  91954
## 3      F  1882 107850
## 4      F  1883 112322
## 5      F  1884 129022
## 6      F  1885 133055
## 7      F  1886 144535
## 8      F  1887 145982
## 9      F  1888 178627
## 10     F  1889 178366
## # ... with 258 more rows

i-clicker question

Consider the data table `DataComputing::ZipGeography: where we examine small States.

##     ZIP   State Population LandArea
## 1 05001 Vermont       9172    110.3
## 2 05009 Vermont         NA       NA
## 3 05030 Vermont         NA       NA
## 4 05031 Vermont         98     11.5
## 5 05032 Vermont       2682    189.8

Here’s a graphic showing the mean population of all the ZIP codes in each small state.

Is this data table Glyph ready to produce the plot below of average population per zipcode for each small state?

Answ: No since we need to do a group_by(State) and summarise(avgZipPopulation=mean(Population), area=sum(LandArea,na.rm=TRUE)) data wrangling (see below):

#Data wrangling
Zip <- ZipGeography %>% filter(State != "") %>%
  group_by(State)%>%
  summarise(aveZipPopulation=mean(Population,na.rm=TRUE), area=sum(LandArea,na.rm=TRUE)) %>%
  filter(area<50000)

To produce the graphic shown above here is the ggplot commands:

Zip$State <- factor(Zip$State, levels = Zip$State[order(Zip$aveZipPopulation)]) # to make states ordered by aveZipPopulation in ggplot

Zip %>% ggplot(aes(x=State,y=aveZipPopulation)) + geom_point(aes(color=area)) + theme(axis.text.x = element_text(angle = 80, hjust = 1))

More data verbs

We will discuss 4 more data verbs:

select()
mutate()
filter()
arrange()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

Select

Selecting from a data table means choosing one or more variables from the table. Reasons to do this:

Simplify the table you are working on.
Rename one or more of the variables to make it more convenient to work with.

The syntax is similar to that of group_by() or summarise(). A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.

To illustrate, here’s the first few cases in the BabyNames data table:

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

And here is the result of selecting just the name and year variables:

name	year
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

If you want to rename a variable, use a named argument, as with when=year in the following:

BabyNames %>% select( name, when=year )

name	when
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

Filter

To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.

In selecting, the variables are specified by name, e.g.

BabyNames %>% select( year, count ) %>% head()

year	count
1880	7065
1880	2604
1880	2003
1880	1939
1880	1746
1880	1578

With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==, >, <, %in%, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:

BabyNames %>% filter( sex=="F") %>%
  sample_n( size=6 )

	name	sex	count	year
121991	Cornelious	F	8	1924
681602	Kaytlin	F	143	1993
815517	Jaliah	F	26	2001
622725	Andre	F	33	1989
614790	Sherissa	F	8	1988
530929	Joyce	F	663	1982

Here are the cases for either sex for babies born after 1990:

BabyNames %>% filter( year > 1990 ) %>% 
  sample_n( size=6 )

	name	sex	count	year
52296	Chassity	F	82	1993
546718	Dixi	F	5	2009
27144	Doreen	F	67	1992
391465	Marzell	M	5	2004
389512	Johany	M	6	2004
363564	Inaya	F	43	2004

Here are the girls born after 1990:

BabyNames %>% filter( year > 1990, sex=="F") %>%
  head()

	name	sex	count	year
185121	Kayra	F	21	2002
360082	Cheyanna	F	27	2011
390870	Priyana	F	6	2012
18974	Christyn	F	31	1992
238137	Chantelle	F	47	2005
60760	Crystalina	F	5	1994

You can specify as many tests as you like. The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:

BabyNames %>% filter( year>1990 | sex=="F")

It’s also possible to test for a variable being any of several different values. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010)) %>%
  sample_n( size=6 )

	name	sex	count	year
94778	Vladimir	M	179	2010
47188	Kavya	F	46	2000
98659	Nahshon	M	19	2010
94896	Landin	M	150	2010
28440	Niema	F	9	1990
75182	Keila	F	187	2010

filter() works well togther with group_by(). For example suppose you want only those names where the minimum of the counts greater than 100.

BabyNames %>% group_by(name) %>%
  filter(count==min(count)) %>%
  filter(count>100) %>%
  head()

## Source: local data frame [4 x 4]
## Groups: name [4]
## 
##       name   sex count  year
##      <chr> <chr> <int> <int>
## 1   Jessie     M   143  1881
## 2 Jacqueli     F   157  1989
## 3 Cassandr     F   152  1989
## 4 Christop     M  1082  1989

Notice that group_by together with filter doesn’t change any of the variables.

We can also only report only baby names used for over 100 years.

BabyNames %>% group_by(name) %>%
  summarise(years_used=n()) %>%
  filter(years_used>100) %>%
  head()

name	years_used
Aaron	218
Abbie	176
Abby	156
Abe	134
Abel	150
Abigail	169

Mutate

The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.

Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData data table has variables pop and area giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.

  CountryData %>% 
  mutate( popDensity=pop/area ) %>% 
  select( country, pop, area, popDensity) %>%
  sample_n(size=6)

	country	pop	area	popDensity
70	Ecuador	15654411	283561	55.206502
161	Navassa Island	NA	5	NA
192	Saint Kitts and Nevis	51538	261	197.463602
198	San Marino	32742	61	536.754098
124	Kiribati	104488	811	128.838471
15	Australia	22507617	7741220	2.907503

Arrange

Arranging sets the order of cases. It does not change the variables — that’s a job for mutate(). Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  head()

First	total
ABDUL M RAHAMAN “THE ROCK”	338
ALICIA K. BENNETT	351
BETSY HODGES	28935
BILL KAHN	97
BOB “AGAIN” CARNEY JR	56
BOB FINE	2094

The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending order.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  arrange( desc(total) ) %>%
  head()

First	total
BETSY HODGES	28935
MARK ANDREW	19584
DON SAMUELS	8335
CAM WINTON	7511
JACKIE CHERRYHOMES	3524
BOB FINE	2094

By default, the arrangement goes in ascending order: from lowest to highest.

Data verb languages

The notation used in these notes is dplyr. Keep in mind that this is just one of several notations. Some of them are:

dplyr - R
data.table - R (for big data–see course “Data Analysis in R, the data.table Way”" if interested)
SQL - database servers

Here is the same expression in these different notations:

dplyr BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
SQL "BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")
data.table BabyNames[, length(count), by=c("sex","year") ]

In class exercises (FIXED)

Please copy and paste url into your web browser.
http://gandalf.berkeley.edu:3838/alucas/Chapter-09-collection/

Answs:

Please write a wrangling statement to extract out only those names where the total number of births over all the years and both sexes is greater than 10,000. Your result will look like this (although the names in Twenty will be different.)

Twenty %>% 
  group_by(name) %>%
  summarise(total=n()) %>%
  filter(total>10000)

Write a wrangling statement to extract only those names for which there is some year where the total numbers of babies is greater than 50 (combining boys and girls). Your result should look like this: the name followed by the total count in the best year and the year in which that occurred.

Twenty %>% group_by(name) %>%
  filter(count==max(count)) %>%
  filter(count>50)

Now write a wrangling statement that extract only those names which appear at least 50 times per year for at least 80 of the 134 year time span.

Twenty %>% 
  group_by(name)%>%
  summarise(nyears=n()) %>%
  filter(nyears>80)

Here’s an attempt to calculate the number of votes for candidates who received more than 10,000 votes altogether.

Minneapolis2013 %>%
  group_by(First) %>%
  summarise(total_votes = n()) %>%
  filter(total_votes > 10000)

2. ggplot2 versus base package graphics

we will work with the mtcars data set

mtcars_m <- mtcars %>% 
  filter(am==0)

mtcars_a <- mtcars %>%
  filter(am==1) 
  
head(mtcars_a)

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1

Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.

example 1

In base package if you want to make a scatter diagram of wt versus mpg in mtcars for manual cars

plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))

That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")

Here we see a major limitation of base package drawing.

Plot doesn’t get redrawn
Plot is drawn as an image (every layer is drawn on top of the image. In ggplot the plot is an object which we can change)
We need to add a legend ourselves (you may forget what the different colors mean when you go into manually make a legend)

In ggplot we would have:

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)

in class exercise

The vector precip gives the yearly precipitation in differnt cities. Using the base package function hist make a histogram of precip (hint: try hist(precip)). Next make the plot in ggplot using the geom_histogram() function. You will need to convert the vector precip to a data frame using as.data.frame(precip). This might be helpful:ggplot2.org

example 2

As another example, suppose we wish to make a linear model of how mpg varies with car weight.

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
## 
## Coefficients:
## (Intercept)           wt  
##      31.416       -3.786

We can see that the best fitting line through the scatter plot for the manual transmission cars from the output.

We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In fact you can draw the regression line for each cylinder type.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

or more efficiently using lapply() which you will learn about in the data camp course intermediate R soon.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
  })
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.

mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Lecture 7

Adam Lucas