Source file ⇒ 2017-lec5.Rmd

Announcements

  1. thank you for you feedback!
  2. HW 2 is out!

Today:

  1. Review ideas of from DC Chapter 6 (Frames, Glyphs, and other components of graphs)
  2. DC Chapter 8 ggplot2
  3. DC Chapter 7 and 9 (Wrangling and Data Verbs)

Chapter 6

Volcabulary:

roads <- read.csv(file="http://tiny.cc/dcf/table-6-2.csv")
head(roads)
##               country      gdp educ roadways net_users
## 1             Albania  9383.46  3.3     0.63      >35%
## 2             Algeria  7335.03  4.3     0.05       >5%
## 3              Angola  6904.82  3.5     0.04       >0%
## 4            Anguilla 10903.89  2.8     1.92      >15%
## 5 Antigua and Barbuda 17635.14  2.4     2.64      >60%
## 6           Argentina 17920.07  6.3     0.08      >15%

In class exercise: What are these ggplot2 components for the data table and plot above?

Frame= The space for drawing gyphs.
Glyph= A symbol or a marking in a frame. —rectangular region Aesthetics= The properties of a glyph related to a variable in the dataset. —x,y,size
Scales= the mapping between property of glyph and variable in dataset —-x=roadways, y=gdp, size=educ
Guide= a display of scale. —tick marks x and y axis, facet guide, legend
Facets= Multiple side by side graphs used to display levels of a categorical variable. —-internet usage is a facet Layers= multiple glyphs —only one

Here is were to find out the details of the ggplot2 components:
http://docs.ggplot2.org/current/

Chapter 8: Graphics and their Grammar

ggplot2 is a grpahics package that uses the components of graphs (i.e. glyphs, aestetics, frames, scales, layers) –called the grammar of graphics.

We call glyphs geoms now.

examples

Here is the data table mosaicData::CPS85:

data(CPS85,package="mosaicData")
head(CPS85)
##   wage educ race sex hispanic south married exper union age   sector
## 1  9.0   10    W   M       NH    NS Married    27   Not  43    const
## 2  5.5   12    W   M       NH    NS Married    20   Not  38    sales
## 3  3.8   12    W   F       NH    NS  Single     4   Not  22    sales
## 4 10.5   12    W   F       NH    NS Married    29   Not  47 clerical
## 5 15.0   12    W   M       NH    NS Married    40 Union  58    const
## 6  9.0   16    W   F       NH    NS Married    27   Not  49 clerical
frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point()

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex))

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex)) + facet_grid(married ~ .)

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=married)) + ylim(0,30)

Here are my two favorite resources for ggplot2:

ggplot

Rstudio

In class exercise

Please make from data table CPS85.

Chapter 7 Wrangling and Data Verbs

Rarely is a data table glyph ready. We usually need to wrangle the data. A data verb is a function that takes a data table as an input and produces a data table as an output.

summarise() and group_by()

Two of the most useful Data Verbs often used together are summarise() and group_by():

summarise() turns multiple cases into a single case using reduction formulas such as n() or sum() or mean().

For example:

head(DataComputing::BabyNames)
##        name sex count year
## 1      Mary   F  7065 1880
## 2      Anna   F  2604 1880
## 3      Emma   F  2003 1880
## 4 Elizabeth   F  1939 1880
## 5    Minnie   F  1746 1880
## 6  Margaret   F  1578 1880
BabyNames %>% summarise(num_cases=n())  #gives the number of rows.
##   num_cases
## 1   1792091

here n() is a reduction function that only works inside of summarise()

Instead of using summarise(num_cases= n()) you could use nrow() or tally().

BabyNames %>% nrow()
## [1] 1792091
BabyNames %>% tally()
##         n
## 1 1792091

tally() is a data verb (since it outputs a data table) but nrow() isn’t since it outputs a number.

BabyNames %>% summarise(average=mean(count)) #gives the average of counts
##    average
## 1 186.0496

mean() or sum() are reduction functions that takes a variable (for example count) as an argument.

If we want to summarise by name we would use group_by() and summarise() together.

BabyNames %>% 
  group_by(name) %>%
  summarise(num_cases=n()) 
## # A tibble: 92,600 × 2
##         name num_cases
##        <chr>     <int>
## 1      Aaban         6
## 2      Aabha         2
## 3      Aabid         1
## 4  Aabriella         1
## 5      Aadam        22
## 6      Aadan         8
## 7    Aadarsh        13
## 8      Aaden        14
## 9     Aadesh         3
## 10    Aadhav         7
## # ... with 92,590 more rows

If we want to divide the name into subgroups according to the sex of the baby, let group_by() have two arguements name and sex (order matters).

BabyNames %>% 
  group_by(name, sex) %>%
  summarise(num_cases=n(), sum_cases=sum(count)) 
## Source: local data frame [102,690 x 4]
## Groups: name [?]
## 
##         name   sex num_cases sum_cases
##        <chr> <chr>     <int>     <int>
## 1      Aaban     M         6        56
## 2      Aabha     F         2        12
## 3      Aabid     M         1         5
## 4  Aabriella     F         1         5
## 5      Aadam     M        22       177
## 6      Aadan     M         8       104
## 7    Aadarsh     M        13       140
## 8      Aaden     F         1         5
## 9      Aaden     M        13      3677
## 10    Aadesh     M         3        15
## # ... with 102,680 more rows

Notice that the table now has 4 colunmns.

What is the 3 most popular names in BabyNames?

BabyNames %>%
  group_by(name) %>%
  summarise(tot=sum(count)) %>%
  arrange(desc(tot)) %>%
  head(3)
## # A tibble: 3 × 2
##     name     tot
##    <chr>   <int>
## 1  James 5114325
## 2   John 5095590
## 3 Robert 4809858

In class exercises:

I suggest you open this in another browser (not in rpubs)
http://gandalf.berkeley.edu:3838/alucas/Chapter-07-collection/

Suppose we want to make a bar chart ranking the candidates by how many precincts they are in first place.

#not glyph ready
head(DataComputing::Minneapolis2013)
##   Precinct           First      Second      Third Ward
## 1     P-10    BETSY HODGES   undervote  undervote  W-7
## 2     P-06        BOB FINE MARK ANDREW  undervote W-10
## 3     P-09 KURTIS W. HANNA    BOB FINE MIKE GOULD W-10
## 4     P-05    BETSY HODGES DON SAMUELS  undervote W-13
## 5     P-01     DON SAMUELS   undervote  undervote  W-5
## 6     P-04       undervote   undervote  undervote  W-6
FirstPlaceTally <- Minneapolis2013 %>% 
  rename(candidate=First) %>%
  group_by(candidate) %>%
  summarise(total=n()) %>%
  arrange( desc(total))

#glyph ready

FirstPlaceTally
## # A tibble: 38 × 2
##             candidate total
##                 <chr> <int>
## 1        BETSY HODGES 28935
## 2         MARK ANDREW 19584
## 3         DON SAMUELS  8335
## 4          CAM WINTON  7511
## 5  JACKIE CHERRYHOMES  3524
## 6            BOB FINE  2094
## 7           DAN COHEN  1798
## 8  STEPHANIE WOODRUFF  1010
## 9     MARK V ANDERSON   975
## 10          undervote   834
## # ... with 28 more rows

In Class exercises

  1. Suppose want to order the Precincts according to which has the most number of ballots cast
Minneapolis2013 %>%
  group_by(Precinct) %>%
  summarise(count=n()) %>%     # n() finds how many cases there are
  arrange(desc(count))
## # A tibble: 16 × 2
##    Precinct count
##       <chr> <int>
## 1      P-06  9711
## 2      P-02  9551
## 3      P-08  9430
## 4      P-03  8703
## 5      P-05  8490
## 6      P-07  8104
## 7      P-04  7753
## 8      P-01  7301
## 9      P-09  5342
## 10     P-10  1561
## 11    P-04D   852
## 12    P-02D   822
## 13    P-05A   742
## 14    P-03A   730
## 15    P-01C   505
## 16     P-6C   504
#or


Minneapolis2013 %>%
  group_by(Precinct) %>%
  tally(sort=TRUE)  
## # A tibble: 16 × 2
##    Precinct     n
##       <chr> <int>
## 1      P-06  9711
## 2      P-02  9551
## 3      P-08  9430
## 4      P-03  8703
## 5      P-05  8490
## 6      P-07  8104
## 7      P-04  7753
## 8      P-01  7301
## 9      P-09  5342
## 10     P-10  1561
## 11    P-04D   852
## 12    P-02D   822
## 13    P-05A   742
## 14    P-03A   730
## 15    P-01C   505
## 16     P-6C   504
  1. What if want to order Precincts with the most number of ballots cast having “BETSY HODGES” as First?
Minneapolis2013 %>%
    filter(First =="BETSY HODGES") %>% 
    group_by(Precinct) %>%
    tally(sort=TRUE)
## # A tibble: 16 × 2
##    Precinct     n
##       <chr> <int>
## 1      P-06  3762
## 2      P-02  3739
## 3      P-08  3480
## 4      P-07  3073
## 5      P-05  2895
## 6      P-01  2793
## 7      P-03  2663
## 8      P-04  2571
## 9      P-09  1943
## 10     P-10   486
## 11    P-02D   326
## 12    P-04D   319
## 13    P-05A   303
## 14    P-03A   295
## 15    P-01C   161
## 16     P-6C   126
  1. How many votes are marked “BETSY HODGES” in the First and Second choice selection?
Minneapolis2013 %>%
  filter(Second == "BETSY HODGES", First == "BETSY HODGES") %>%
  tally()   #could also use nrow here
##     n
## 1 222
Minneapolis2013 %>%
  group_by(First,Second) %>%
  tally() %>%
  filter(First=="BETSY HODGES", Second=="BETSY HODGES")
## Source: local data frame [1 x 3]
## Groups: First [1]
## 
##          First       Second     n
##          <chr>        <chr> <int>
## 1 BETSY HODGES BETSY HODGES   222

To Do: Read chapters 7-9 and start homework 1. Next time data verbs and ggplot.