Source file ⇒ 2017-lec6.Rmd

last compiled on Fri Feb 3 22:43:56 2017

Announcements

  1. no way for me to penalize you for late DataCamp!

Today:

  1. DC chapter 9 more data verbs
  2. ggplot2 versus base package graphics

1. DC chapter 9 more data verbs

You have already seen two data verbs:

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

head(BabyNames)
name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
BabyNames %>%  
  group_by( sex, year ) %>% 
 summarise( total=sum( count ) ) 
## Source: local data frame [268 x 3]
## Groups: sex [?]
## 
##      sex  year  total
##    <chr> <int>  <int>
## 1      F  1880  90993
## 2      F  1881  91954
## 3      F  1882 107850
## 4      F  1883 112322
## 5      F  1884 129022
## 6      F  1885 133055
## 7      F  1886 144535
## 8      F  1887 145982
## 9      F  1888 178627
## 10     F  1889 178366
## # ... with 258 more rows

i-clicker question

Consider the data table `DataComputing::ZipGeography: where we examine small States.

##     ZIP   State Population LandArea
## 1 05001 Vermont       9172    110.3
## 2 05009 Vermont         NA       NA
## 3 05030 Vermont         NA       NA
## 4 05031 Vermont         98     11.5
## 5 05032 Vermont       2682    189.8

Here’s a graphic showing the mean population of all the ZIP codes in each small state.

Is this data table Glyph ready to produce the plot below of average population per zipcode for each small state?

Answ: No since we need to do a group_by(State) and summarise(avgZipPopulation=mean(Population), area=sum(LandArea,na.rm=TRUE)) data wrangling (see below):

#Data wrangling
Zip <- ZipGeography %>% filter(State != "") %>%
  group_by(State)%>%
  summarise(aveZipPopulation=mean(Population,na.rm=TRUE), area=sum(LandArea,na.rm=TRUE)) %>%
  filter(area<50000)

To produce the graphic shown above here is the ggplot commands:

Zip$State <- factor(Zip$State, levels = Zip$State[order(Zip$aveZipPopulation)]) # to make states ordered by aveZipPopulation in ggplot

Zip %>% ggplot(aes(x=State,y=aveZipPopulation)) + geom_point(aes(color=area)) + theme(axis.text.x = element_text(angle = 80, hjust = 1))

More data verbs

We will discuss 4 more data verbs:

  • select()
  • mutate()
  • filter()
  • arrange()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

Select

Selecting from a data table means choosing one or more variables from the table. Reasons to do this:

  • Simplify the table you are working on.
  • Rename one or more of the variables to make it more convenient to work with.

The syntax is similar to that of group_by() or summarise(). A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.

To illustrate, here’s the first few cases in the BabyNames data table:

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880

And here is the result of selecting just the name and year variables:

name year
Mary 1880
Anna 1880
Emma 1880
Elizabeth 1880
Minnie 1880
Margaret 1880

If you want to rename a variable, use a named argument, as with when=year in the following:

BabyNames %>% select( name, when=year )
name when
Mary 1880
Anna 1880
Emma 1880
Elizabeth 1880
Minnie 1880
Margaret 1880

Filter

To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.

In selecting, the variables are specified by name, e.g.

BabyNames %>% select( year, count ) %>% head()
year count
1880 7065
1880 2604
1880 2003
1880 1939
1880 1746
1880 1578

With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==, >, <, %in%, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:

BabyNames %>% filter( sex=="F") %>%
  sample_n( size=6 )
name sex count year
121991 Cornelious F 8 1924
681602 Kaytlin F 143 1993
815517 Jaliah F 26 2001
622725 Andre F 33 1989
614790 Sherissa F 8 1988
530929 Joyce F 663 1982

Here are the cases for either sex for babies born after 1990:

BabyNames %>% filter( year > 1990 ) %>% 
  sample_n( size=6 ) 
name sex count year
52296 Chassity F 82 1993
546718 Dixi F 5 2009
27144 Doreen F 67 1992
391465 Marzell M 5 2004
389512 Johany M 6 2004
363564 Inaya F 43 2004

Here are the girls born after 1990:

BabyNames %>% filter( year > 1990, sex=="F") %>%
  head()
name sex count year
185121 Kayra F 21 2002
360082 Cheyanna F 27 2011
390870 Priyana F 6 2012
18974 Christyn F 31 1992
238137 Chantelle F 47 2005
60760 Crystalina F 5 1994

You can specify as many tests as you like. The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:

BabyNames %>% filter( year>1990 | sex=="F") 

It’s also possible to test for a variable being any of several different values. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010)) %>%
  sample_n( size=6 ) 
name sex count year
94778 Vladimir M 179 2010
47188 Kavya F 46 2000
98659 Nahshon M 19 2010
94896 Landin M 150 2010
28440 Niema F 9 1990
75182 Keila F 187 2010

filter() works well togther with group_by(). For example suppose you want only those names where the minimum of the counts greater than 100.

BabyNames %>% group_by(name) %>%
  filter(count==min(count)) %>%
  filter(count>100) %>%
  head()
## Source: local data frame [4 x 4]
## Groups: name [4]
## 
##       name   sex count  year
##      <chr> <chr> <int> <int>
## 1   Jessie     M   143  1881
## 2 Jacqueli     F   157  1989
## 3 Cassandr     F   152  1989
## 4 Christop     M  1082  1989

Notice that group_by together with filter doesn’t change any of the variables.

We can also only report only baby names used for over 100 years.

BabyNames %>% group_by(name) %>%
  summarise(years_used=n()) %>%
  filter(years_used>100) %>%
  head()
name years_used
Aaron 218
Abbie 176
Abby 156
Abe 134
Abel 150
Abigail 169

Mutate

The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.

Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData data table has variables pop and area giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.

  CountryData %>% 
  mutate( popDensity=pop/area ) %>% 
  select( country, pop, area, popDensity) %>%
  sample_n(size=6) 
country pop area popDensity
70 Ecuador 15654411 283561 55.206502
161 Navassa Island NA 5 NA
192 Saint Kitts and Nevis 51538 261 197.463602
198 San Marino 32742 61 536.754098
124 Kiribati 104488 811 128.838471
15 Australia 22507617 7741220 2.907503

Arrange

Arranging sets the order of cases. It does not change the variables — that’s a job for mutate(). Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  head()
First total
ABDUL M RAHAMAN “THE ROCK” 338
ALICIA K. BENNETT 351
BETSY HODGES 28935
BILL KAHN 97
BOB “AGAIN” CARNEY JR 56
BOB FINE 2094

The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending order.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  arrange( desc(total) ) %>%
  head() 
First total
BETSY HODGES 28935
MARK ANDREW 19584
DON SAMUELS 8335
CAM WINTON 7511
JACKIE CHERRYHOMES 3524
BOB FINE 2094

By default, the arrangement goes in ascending order: from lowest to highest.

Data verb languages

The notation used in these notes is dplyr. Keep in mind that this is just one of several notations. Some of them are:

  • dplyr - R
  • data.table - R (for big data–see course “Data Analysis in R, the data.table Way”" if interested)
  • SQL - database servers

Here is the same expression in these different notations:

  • dplyr BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
  • SQL "BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")
  • data.table BabyNames[, length(count), by=c("sex","year") ]

In class exercises (FIXED)

Please copy and paste url into your web browser.
http://gandalf.berkeley.edu:3838/alucas/Chapter-09-collection/

Answs:

Please write a wrangling statement to extract out only those names where the total number of births over all the years and both sexes is greater than 10,000. Your result will look like this (although the names in Twenty will be different.)

Twenty %>% 
  group_by(name) %>%
  summarise(total=n()) %>%
  filter(total>10000)

Write a wrangling statement to extract only those names for which there is some year where the total numbers of babies is greater than 50 (combining boys and girls). Your result should look like this: the name followed by the total count in the best year and the year in which that occurred.

Twenty %>% group_by(name) %>%
  filter(count==max(count)) %>%
  filter(count>50)

Now write a wrangling statement that extract only those names which appear at least 50 times per year for at least 80 of the 134 year time span.

Twenty %>% 
  group_by(name)%>%
  summarise(nyears=n()) %>%
  filter(nyears>80)

Here’s an attempt to calculate the number of votes for candidates who received more than 10,000 votes altogether.

Minneapolis2013 %>%
  group_by(First) %>%
  summarise(total_votes = n()) %>%
  filter(total_votes > 10000)

2. ggplot2 versus base package graphics

we will work with the mtcars data set

mtcars_m <- mtcars %>% 
  filter(am==0)

mtcars_a <- mtcars %>%
  filter(am==1) 
  
head(mtcars_a)
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

Before ggplot there was plotting with the base R package. Many research papers still make their plots with base package so you should be familiar with it.

example 1

In base package if you want to make a scatter diagram of wt versus mpg in mtcars for manual cars

plot( mtcars_m$wt,mtcars_m$mpg, col=as.factor(mtcars_m$cyl))

That is fine but, suppose we wish to add an additional layer of points corresponding to cars with automatic transmission

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
points( mtcars_a$wt,mtcars_a$mpg, col="blue")

Here we see a major limitation of base package drawing.

  1. Plot doesn’t get redrawn
  2. Plot is drawn as an image (every layer is drawn on top of the image. In ggplot the plot is an object which we can change)
  3. We need to add a legend ourselves (you may forget what the different colors mean when you go into manually make a legend)

In ggplot we would have:

mtcars %>% ggplot(aes(x=wt,y=mpg)) + geom_point(aes(col=as.factor(cyl))) + facet_wrap(~am)

in class exercise

The vector precip gives the yearly precipitation in differnt cities. Using the base package function hist make a histogram of precip (hint: try hist(precip)). Next make the plot in ggplot using the geom_histogram() function. You will need to convert the vector precip to a data frame using as.data.frame(precip). This might be helpful:ggplot2.org

example 2

As another example, suppose we wish to make a linear model of how mpg varies with car weight.

# Use lm() to calculate a linear model and save it as carModel
carModel <- lm(mpg ~ wt, data = mtcars_m)
carModel
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_m)
## 
## Coefficients:
## (Intercept)           wt  
##      31.416       -3.786

We can see that the best fitting line through the scatter plot for the manual transmission cars from the output.

We can draw the regression line through our plot for manual transmission cars. The legend is hard to make manually.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
carModel <- lm(mpg ~ wt, data = mtcars_m)
abline(carModel, lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In fact you can draw the regression line for each cylinder type.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==4)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==6)), lty = 2)
abline(lm(mpg ~ wt, data = mtcars_m, subset= (cyl ==8)), lty = 2)
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

or more efficiently using lapply() which you will learn about in the data camp course intermediate R soon.

plot(mtcars_m$wt,mtcars_m$mpg,  col=as.factor(mtcars_m$cyl))
lapply(mtcars_m$cyl, function(x) {
  abline(lm(mpg ~ wt, mtcars_m, subset = (cyl == x)), col = x)
  })
legend(x = 5, y = 25, legend = levels(as.factor(mtcars_m$cyl)), col = 1:3, pch = 1, bty = "n")

In ggplot it is much easier. Note that we put the color aesthetic in the ggplot frame instead of geom_point since we want both the points and the regression lines to be categorized by color.

mtcars_m %>% ggplot(aes(x = wt, y = mpg, col = as.factor(cyl))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

To Do: Read chapter 9 (with in class exercises) and do HW 2. Next time Chapter 10 in DC textbook and Theme() in ggplot2 from Data Camp.