Source file ⇒ lec8.Rmd
Last class we learned about the data verbs, arrange(), filter(), select(), mutate(), group_by, and summarise().
A few notes:
a <- c(1,2,2,2,3)
sum(a) # sum of elements of a
## [1] 10
sum(a==2) # number of elements of a that have the value 2
## [1] 3
BabyNames %>%
summarise(Adam_sum= sum(name=="Adam"), John_sum= sum(name == "John"))
## Adam_sum John_sum
## 1 190 268
BabyNames %>%
filter(year==1890, sex=="M") %>%
summarise( total = n())
## total
## 1 1161
BabyNames %>%
filter(year==1890, sex=="M") %>%
tally() # tally calls the number of cases n
## n
## 1 1161
It is similar to nrow
BabyNames %>%
filter(year==1890, sex=="M") %>%
nrow()
## [1] 1161
Recall the data table Minneapolis2013 which lists the results of ballots from different wards in different precincts:
head(Minneapolis2013)
## Precinct First Second Third Ward
## 1 P-10 BETSY HODGES undervote undervote W-7
## 2 P-06 BOB FINE MARK ANDREW undervote W-10
## 3 P-09 KURTIS W. HANNA BOB FINE MIKE GOULD W-10
## 4 P-05 BETSY HODGES DON SAMUELS undervote W-13
## 5 P-01 DON SAMUELS undervote undervote W-5
## 6 P-04 undervote undervote undervote W-6
Suppose want to order the Precincts according to which has the most number of ballots cast
Minneapolis2013 %>%
group_by(Precinct) %>%
summarise(count=n()) %>% # n() finds how many cases there are
arrange(desc(count))
## Source: local data frame [16 x 2]
##
## Precinct count
## (chr) (int)
## 1 P-06 9711
## 2 P-02 9551
## 3 P-08 9430
## 4 P-03 8703
## 5 P-05 8490
## 6 P-07 8104
## 7 P-04 7753
## 8 P-01 7301
## 9 P-09 5342
## 10 P-10 1561
## 11 P-04D 852
## 12 P-02D 822
## 13 P-05A 742
## 14 P-03A 730
## 15 P-01C 505
## 16 P-6C 504
or
Minneapolis2013 %>%
group_by(Precinct) %>%
tally(sort=TRUE)
## Source: local data frame [16 x 2]
##
## Precinct n
## (chr) (int)
## 1 P-06 9711
## 2 P-02 9551
## 3 P-08 9430
## 4 P-03 8703
## 5 P-05 8490
## 6 P-07 8104
## 7 P-04 7753
## 8 P-01 7301
## 9 P-09 5342
## 10 P-10 1561
## 11 P-04D 852
## 12 P-02D 822
## 13 P-05A 742
## 14 P-03A 730
## 15 P-01C 505
## 16 P-6C 504
What if want to order Precincts with the most number of ballots cast having “BETSY HODGES” as First?
Minneapolis2013 %>%
filter(First =="BETSY HODGES") %>%
group_by(Precinct) %>%
tally(sort=TRUE)
## Source: local data frame [16 x 2]
##
## Precinct n
## (chr) (int)
## 1 P-06 3762
## 2 P-02 3739
## 3 P-08 3480
## 4 P-07 3073
## 5 P-05 2895
## 6 P-01 2793
## 7 P-03 2663
## 8 P-04 2571
## 9 P-09 1943
## 10 P-10 486
## 11 P-02D 326
## 12 P-04D 319
## 13 P-05A 303
## 14 P-03A 295
## 15 P-01C 161
## 16 P-6C 126
or
head(BabyNames)
## name sex count year
## 1 Mary F 7065 1880
## 2 Anna F 2604 1880
## 3 Emma F 2003 1880
## 4 Elizabeth F 1939 1880
## 5 Minnie F 1746 1880
## 6 Margaret F 1578 1880
What is the 3 most popular names in BabyNames?
ggplot2 is a grpahics package that uses the components of graphs (i.e. glyphs, aestetics, frames, scales, layers) –called the grammar of graphics.
We call glyphs geoms now.
Here is the data table mosaicData::CPS85:
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point()
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=sex))
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=sex)) + facet_grid(married ~ .)
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=married)) + ylim(0,30)
BabyNames %>%
group_by(name) %>%
summarise(tot=sum(count)) %>%
arrange(desc(tot)) %>%
head(3)
## Source: local data frame [3 x 2]
##
## name tot
## (chr) (int)
## 1 James 5114325
## 2 John 5095590
## 3 Robert 4809858
Here is a cheet sheet: (Rstudio)[https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf]
Please make