Source file ⇒ 2017-lec5.Rmd
roads <- read.csv(file="http://tiny.cc/dcf/table-6-2.csv")
head(roads)
## country gdp educ roadways net_users
## 1 Albania 9383.46 3.3 0.63 >35%
## 2 Algeria 7335.03 4.3 0.05 >5%
## 3 Angola 6904.82 3.5 0.04 >0%
## 4 Anguilla 10903.89 2.8 1.92 >15%
## 5 Antigua and Barbuda 17635.14 2.4 2.64 >60%
## 6 Argentina 17920.07 6.3 0.08 >15%
Frame= The space for drawing gyphs.
Glyph= A symbol or a marking in a frame. —rectangular region Aesthetics= The properties of a glyph related to a variable in the dataset. —x,y,size
Scales= the mapping between property of glyph and variable in dataset —-x=roadways, y=gdp, size=educ
Guide= a display of scale. —tick marks x and y axis, facet guide, legend
Facets= Multiple side by side graphs used to display levels of a categorical variable. —-internet usage is a facet Layers= multiple glyphs —only one
Here is were to find out the details of the ggplot2 components:
http://docs.ggplot2.org/current/
ggplot2 is a grpahics package that uses the components of graphs (i.e. glyphs, aestetics, frames, scales, layers) –called the grammar of graphics.
We call glyphs geoms now.
Here is the data table mosaicData::CPS85:
data(CPS85,package="mosaicData")
head(CPS85)
## wage educ race sex hispanic south married exper union age sector
## 1 9.0 10 W M NH NS Married 27 Not 43 const
## 2 5.5 12 W M NH NS Married 20 Not 38 sales
## 3 3.8 12 W F NH NS Single 4 Not 22 sales
## 4 10.5 12 W F NH NS Married 29 Not 47 clerical
## 5 15.0 12 W M NH NS Married 40 Union 58 const
## 6 9.0 16 W F NH NS Married 27 Not 49 clerical
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point()
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=sex))
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=sex)) + facet_grid(married ~ .)
frame <- CPS85 %>% ggplot(aes(x=age,y=wage))
frame + geom_point(aes(shape=married)) + ylim(0,30)
Here are my two favorite resources for ggplot2:
Please make from data table CPS85
.
Rarely is a data table glyph ready. We usually need to wrangle the data. A data verb is a function that takes a data table as an input and produces a data table as an output.
summarise()
and group_by()
Two of the most useful Data Verbs often used together are summarise()
and group_by()
:
summarise()
turns multiple cases into a single case using reduction formulas such as n()
or sum()
or mean()
.
For example:
head(DataComputing::BabyNames)
## name sex count year
## 1 Mary F 7065 1880
## 2 Anna F 2604 1880
## 3 Emma F 2003 1880
## 4 Elizabeth F 1939 1880
## 5 Minnie F 1746 1880
## 6 Margaret F 1578 1880
BabyNames %>% summarise(num_cases=n()) #gives the number of rows.
## num_cases
## 1 1792091
here n()
is a reduction function that only works inside of summarise()
Instead of using summarise(num_cases= n())
you could use nrow()
or tally()
.
BabyNames %>% nrow()
## [1] 1792091
BabyNames %>% tally()
## n
## 1 1792091
tally()
is a data verb (since it outputs a data table) but nrow()
isn’t since it outputs a number.
BabyNames %>% summarise(average=mean(count)) #gives the average of counts
## average
## 1 186.0496
mean()
or sum()
are reduction functions that takes a variable (for example count
) as an argument.
If we want to summarise by name we would use group_by()
and summarise()
together.
BabyNames %>%
group_by(name) %>%
summarise(num_cases=n())
## # A tibble: 92,600 × 2
## name num_cases
## <chr> <int>
## 1 Aaban 6
## 2 Aabha 2
## 3 Aabid 1
## 4 Aabriella 1
## 5 Aadam 22
## 6 Aadan 8
## 7 Aadarsh 13
## 8 Aaden 14
## 9 Aadesh 3
## 10 Aadhav 7
## # ... with 92,590 more rows
If we want to divide the name
into subgroups according to the sex
of the baby, let group_by()
have two arguements name
and sex
(order matters).
BabyNames %>%
group_by(name, sex) %>%
summarise(num_cases=n(), sum_cases=sum(count))
## Source: local data frame [102,690 x 4]
## Groups: name [?]
##
## name sex num_cases sum_cases
## <chr> <chr> <int> <int>
## 1 Aaban M 6 56
## 2 Aabha F 2 12
## 3 Aabid M 1 5
## 4 Aabriella F 1 5
## 5 Aadam M 22 177
## 6 Aadan M 8 104
## 7 Aadarsh M 13 140
## 8 Aaden F 1 5
## 9 Aaden M 13 3677
## 10 Aadesh M 3 15
## # ... with 102,680 more rows
Notice that the table now has 4 colunmns.
What is the 3 most popular names in BabyNames?
BabyNames %>%
group_by(name) %>%
summarise(tot=sum(count)) %>%
arrange(desc(tot)) %>%
head(3)
## # A tibble: 3 × 2
## name tot
## <chr> <int>
## 1 James 5114325
## 2 John 5095590
## 3 Robert 4809858
I suggest you open this in another browser (not in rpubs)
http://gandalf.berkeley.edu:3838/alucas/Chapter-07-collection/
Suppose we want to make a bar chart ranking the candidates by how many precincts they are in first place.
#not glyph ready
head(DataComputing::Minneapolis2013)
## Precinct First Second Third Ward
## 1 P-10 BETSY HODGES undervote undervote W-7
## 2 P-06 BOB FINE MARK ANDREW undervote W-10
## 3 P-09 KURTIS W. HANNA BOB FINE MIKE GOULD W-10
## 4 P-05 BETSY HODGES DON SAMUELS undervote W-13
## 5 P-01 DON SAMUELS undervote undervote W-5
## 6 P-04 undervote undervote undervote W-6
FirstPlaceTally <- Minneapolis2013 %>%
rename(candidate=First) %>%
group_by(candidate) %>%
summarise(total=n()) %>%
arrange( desc(total))
#glyph ready
FirstPlaceTally
## # A tibble: 38 × 2
## candidate total
## <chr> <int>
## 1 BETSY HODGES 28935
## 2 MARK ANDREW 19584
## 3 DON SAMUELS 8335
## 4 CAM WINTON 7511
## 5 JACKIE CHERRYHOMES 3524
## 6 BOB FINE 2094
## 7 DAN COHEN 1798
## 8 STEPHANIE WOODRUFF 1010
## 9 MARK V ANDERSON 975
## 10 undervote 834
## # ... with 28 more rows
Minneapolis2013 %>%
group_by(Precinct) %>%
summarise(count=n()) %>% # n() finds how many cases there are
arrange(desc(count))
## # A tibble: 16 × 2
## Precinct count
## <chr> <int>
## 1 P-06 9711
## 2 P-02 9551
## 3 P-08 9430
## 4 P-03 8703
## 5 P-05 8490
## 6 P-07 8104
## 7 P-04 7753
## 8 P-01 7301
## 9 P-09 5342
## 10 P-10 1561
## 11 P-04D 852
## 12 P-02D 822
## 13 P-05A 742
## 14 P-03A 730
## 15 P-01C 505
## 16 P-6C 504
#or
Minneapolis2013 %>%
group_by(Precinct) %>%
tally(sort=TRUE)
## # A tibble: 16 × 2
## Precinct n
## <chr> <int>
## 1 P-06 9711
## 2 P-02 9551
## 3 P-08 9430
## 4 P-03 8703
## 5 P-05 8490
## 6 P-07 8104
## 7 P-04 7753
## 8 P-01 7301
## 9 P-09 5342
## 10 P-10 1561
## 11 P-04D 852
## 12 P-02D 822
## 13 P-05A 742
## 14 P-03A 730
## 15 P-01C 505
## 16 P-6C 504
Minneapolis2013 %>%
filter(First =="BETSY HODGES") %>%
group_by(Precinct) %>%
tally(sort=TRUE)
## # A tibble: 16 × 2
## Precinct n
## <chr> <int>
## 1 P-06 3762
## 2 P-02 3739
## 3 P-08 3480
## 4 P-07 3073
## 5 P-05 2895
## 6 P-01 2793
## 7 P-03 2663
## 8 P-04 2571
## 9 P-09 1943
## 10 P-10 486
## 11 P-02D 326
## 12 P-04D 319
## 13 P-05A 303
## 14 P-03A 295
## 15 P-01C 161
## 16 P-6C 126
Minneapolis2013 %>%
filter(Second == "BETSY HODGES", First == "BETSY HODGES") %>%
tally() #could also use nrow here
## n
## 1 222
Minneapolis2013 %>%
group_by(First,Second) %>%
tally() %>%
filter(First=="BETSY HODGES", Second=="BETSY HODGES")
## Source: local data frame [1 x 3]
## Groups: First [1]
##
## First Second n
## <chr> <chr> <int>
## 1 BETSY HODGES BETSY HODGES 222