Chapter 7: Wrangling and Data Verbs

Last class we learned about the data verbs, arrange(), filter(), select(), mutate(), group_by, and summarise().

A few notes:

The function sum() takes a vector and adds up the elements.

a <- c(1,2,2,2,3)
sum(a)  # sum of elements of a

## [1] 10

sum(a==2) # number of elements of a that have the value 2

## [1] 3

Sum is usefull in summarise(). For example suppose you want to display the number of times the name “Adam” and “John” appears in the BabyNames data table.

BabyNames %>%
  summarise(Adam_sum= sum(name=="Adam"), John_sum= sum(name == "John"))

##   Adam_sum John_sum
## 1      190      268

n() adds up the number of cases. It is usefull inside of summarise(). Lets find the number of boys names in the year 1890

BabyNames %>%
  filter(year==1890, sex=="M") %>%
  summarise( total = n())

##   total
## 1  1161

Tally() is another way to add up the number of cases of a data table.

BabyNames %>%
  filter(year==1890, sex=="M") %>%
  tally()   # tally calls the number of cases n

##      n
## 1 1161

It is similar to nrow

BabyNames %>%
  filter(year==1890, sex=="M") %>%
  nrow()

## [1] 1161

More complex examples of putting together data verbs:

Recall the data table Minneapolis2013 which lists the results of ballots from different wards in different precincts:

head(Minneapolis2013)

##   Precinct           First      Second      Third Ward
## 1     P-10    BETSY HODGES   undervote  undervote  W-7
## 2     P-06        BOB FINE MARK ANDREW  undervote W-10
## 3     P-09 KURTIS W. HANNA    BOB FINE MIKE GOULD W-10
## 4     P-05    BETSY HODGES DON SAMUELS  undervote W-13
## 5     P-01     DON SAMUELS   undervote  undervote  W-5
## 6     P-04       undervote   undervote  undervote  W-6

Suppose want to order the Precincts according to which has the most number of ballots cast

Minneapolis2013 %>%
  group_by(Precinct) %>%
  summarise(count=n()) %>%     # n() finds how many cases there are
  arrange(desc(count))

## Source: local data frame [16 x 2]
## 
##    Precinct count
##       (chr) (int)
## 1      P-06  9711
## 2      P-02  9551
## 3      P-08  9430
## 4      P-03  8703
## 5      P-05  8490
## 6      P-07  8104
## 7      P-04  7753
## 8      P-01  7301
## 9      P-09  5342
## 10     P-10  1561
## 11    P-04D   852
## 12    P-02D   822
## 13    P-05A   742
## 14    P-03A   730
## 15    P-01C   505
## 16     P-6C   504

Minneapolis2013 %>%
  group_by(Precinct) %>%
  tally(sort=TRUE)

## Source: local data frame [16 x 2]
## 
##    Precinct     n
##       (chr) (int)
## 1      P-06  9711
## 2      P-02  9551
## 3      P-08  9430
## 4      P-03  8703
## 5      P-05  8490
## 6      P-07  8104
## 7      P-04  7753
## 8      P-01  7301
## 9      P-09  5342
## 10     P-10  1561
## 11    P-04D   852
## 12    P-02D   822
## 13    P-05A   742
## 14    P-03A   730
## 15    P-01C   505
## 16     P-6C   504

What if want to order Precincts with the most number of ballots cast having “BETSY HODGES” as First?

Minneapolis2013 %>%
    filter(First =="BETSY HODGES") %>% 
    group_by(Precinct) %>%
    tally(sort=TRUE)

## Source: local data frame [16 x 2]
## 
##    Precinct     n
##       (chr) (int)
## 1      P-06  3762
## 2      P-02  3739
## 3      P-08  3480
## 4      P-07  3073
## 5      P-05  2895
## 6      P-01  2793
## 7      P-03  2663
## 8      P-04  2571
## 9      P-09  1943
## 10     P-10   486
## 11    P-02D   326
## 12    P-04D   319
## 13    P-05A   303
## 14    P-03A   295
## 15    P-01C   161
## 16     P-6C   126

Your turn:

How many votes are marked “BETSY HODGES” in the First and Second choice selection?

Here is BabyNames:

head(BabyNames)

##        name sex count year
## 1      Mary   F  7065 1880
## 2      Anna   F  2604 1880
## 3      Emma   F  2003 1880
## 4 Elizabeth   F  1939 1880
## 5    Minnie   F  1746 1880
## 6  Margaret   F  1578 1880

What is the 3 most popular names in BabyNames?

Chapter 8: Graphics and their Grammar

ggplot2 is a grpahics package that uses the components of graphs (i.e. glyphs, aestetics, frames, scales, layers) –called the grammar of graphics.

We call glyphs geoms now.

examples

Here is the data table mosaicData::CPS85:

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point()

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex))

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=sex)) + facet_grid(married ~ .)

frame <- CPS85 %>% ggplot(aes(x=age,y=wage)) 
frame + geom_point(aes(shape=married)) + ylim(0,30)

BabyNames %>%
  group_by(name) %>%
  summarise(tot=sum(count)) %>%
  arrange(desc(tot)) %>%
  head(3)

## Source: local data frame [3 x 2]
## 
##     name     tot
##    (chr)   (int)
## 1  James 5114325
## 2   John 5095590
## 3 Robert 4809858

Here is a cheet sheet: (Rstudio)[https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf]

Your turn

Please make

Lec8

Stat 133

Today’s plan: