We introduce the verbs:
gather()
purrr::map()
These functions are available after loading tidyverse.
First we illustrate on Wisconsin lottery data. The first 16 lines of the data file describe the variables.
fileName <-
"http://fisher.stats.uwo.ca/faculty/aim/2018/3859A/data/WiscLottery.csv"
readLines(fileName, 16)
## [1] "#Data is descibed in Frees' textbook (p. 23 and Table 3.11, p.102)"
## [2] "#State of Wisconsin lottery sales from 50 randomly selected areas"
## [3] "#identified by their zip codes. Socio-demographic variables that"
## [4] "#may be of interest to administrators are also provided."
## [5] "#ZIP - zip code"
## [6] "#PERPERHN - persons per household"
## [7] "#MEDSCHYR - median years of schooling"
## [8] "#MEDHVL- median home value in $1000's for owner occupied homes"
## [9] "#PRCRENT - percent of housing that is owner occupied"
## [10] "#PRC55P - percent of population that is over 55"
## [11] "#HHMEDAGE - household median age"
## [12] "#MEDINC - estimated median income in $1,000"
## [13] "#SALES - online lottery sales to individuals"
## [14] "#POP - population in thousands"
## [15] "ZIP,PERPERHH,MEDSCHYR,MEDHVL,PRCRENT,PRC55P,HHMEDAGE,MEDINC,SALES,POP"
## [16] "53003,3,12.6,71.3,21,38,48,54.2,1285.4,435"
Using read_csv() we input as tibble.
wdata <- read_csv(fileName, skip=14)
## Parsed with column specification:
## cols(
## ZIP = col_integer(),
## PERPERHH = col_double(),
## MEDSCHYR = col_double(),
## MEDHVL = col_double(),
## PRCRENT = col_integer(),
## PRC55P = col_integer(),
## HHMEDAGE = col_integer(),
## MEDINC = col_double(),
## SALES = col_double(),
## POP = col_integer()
## )
wdata
## # A tibble: 50 x 10
## ZIP PERPERHH MEDSCHYR MEDHVL PRCRENT PRC55P HHMEDAGE MEDINC SALES
## <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <dbl>
## 1 53003 3 12.6 71.3 21 38 48 54.2 1285.
## 2 53033 3.2 12.9 98 6 28 46 70.7 3571.
## 3 53038 2.8 12.4 58.7 25 35 45 43.6 2407.
## 4 53059 3.1 12.5 65.7 24 29 45 51.9 1224.
## 5 53072 2.6 13.1 96.7 32 27 42 63.1 15046.
## 6 53083 2.7 12.8 66.4 25 38 48 55.7 9129.
## 7 53095 2.8 12.9 91 31 37 48 54.9 33181.
## 8 53098 2.9 12.5 61 26 40 50 46.9 2243.
## 9 53104 2.8 12.8 91.5 18 35 48 62.3 21588.
## 10 53172 2.6 12.7 68.8 37 39 47 49.1 15693.
## # ... with 40 more rows, and 1 more variable: POP <int>
We are interested in SALES and POP. So first we looked at a simple scatterplot and noticed that both variables were strongly skewed. Things always simpler when the data is symmetric! As a check we have included a rug on both axes. This clearly shows the skewed nature of the distribution of both variables.
wdata %>%
ggplot(aes(x=POP, y=SALES)) +
geom_point(size=1.5, shape=1, stroke=1.25) +
geom_smooth(method="lm", se=FALSE) +
geom_rug() +
ggtitle(label=NULL, subtitle="Random Sample of 50 Zip Codes") +
labs(x="Poplulation", y="Sales in $")
We take logarithms (log to the base e). We use the mutate() verb.
wdata %>%
mutate(
logPOP = log(POP),
logSALES = log(SALES)
) %>%
ggplot(aes(x=logPOP, y=logSALES)) +
geom_point(size=1.5, shape=1, stroke=1.25) +
geom_smooth(method="lm", se=FALSE) +
geom_rug() +
ggtitle(label=NULL, subtitle="Random Sample of 50 Zip Codes") +
labs(x="log Poplulation", y="log Sales in $")
## Warning: package 'bindrcpp' was built under R version 3.4.4
The next script illustrates the use of mutate(), arrange() and select(). Note also the use of desc() since this function modifies the verb arrange() it is considered at adverb. The dplyr verbs have many adverbs – see documentation for each verb.
Let’s drill down and find the observations corresponding to the 5 largest ratios of SALES/POP. The ZIP codes corresponding to the largest 53104 and 53003 and correspond to small border towns where out-of-state people go to purchase Wisconsin lottery tickets.
wdata2 <-
wdata %>%
mutate(
logPOP = log(POP),
logSALES = log(SALES),
SALESPerCapita=SALES/POP
) %>%
arrange(desc(SALESPerCapita)) %>%
select(ZIP, contains("POP"), contains("SALES"))
wdata2[1:5,]
## # A tibble: 5 x 6
## ZIP POP logPOP SALES logSALES SALESPerCapita
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 53104 4464 8.40 21588. 9.98 4.84
## 2 53003 435 6.08 1285. 7.16 2.95
## 3 53952 1217 7.10 2001. 7.60 1.64
## 4 53072 13337 9.50 15046. 9.62 1.13
## 5 53520 5825 8.67 6209. 8.73 1.07
I wonder if any other variables can be considered discrete or perhaps categorical. So I compute the number of unique values for each variable using the map() and map_int() functions. The tidyverse package purrr enhance functional programming in R and provide an alternative to the base R suite of apply functions. Scripts using map functions are typically easier to read and more elegant. For maximum efficiency the underlying implementation using C.
map_int(map(wdata, unique), length)
## ZIP PERPERHH MEDSCHYR MEDHVL PRCRENT PRC55P HHMEDAGE MEDINC
## 50 11 13 47 26 25 16 48
## SALES POP
## 50 50
We use summarize() with helper functions group_by() and n().
wdata %>%
mutate(SALESPerCapita=SALES/POP) %>%
group_by(HHMEDAGE) %>%
summarize(average=mean(SALESPerCapita), sd=sd(SALESPerCapita), n=n())
## # A tibble: 16 x 4
## HHMEDAGE average sd n
## <int> <dbl> <dbl> <int>
## 1 41 0.497 NaN 1
## 2 42 0.878 0.308 3
## 3 44 0.322 0.0841 3
## 4 45 0.624 0.206 5
## 5 46 0.795 0.180 4
## 6 47 0.518 0.327 2
## 7 48 1.40 1.65 8
## 8 49 0.858 0.164 2
## 9 50 0.514 0.180 7
## 10 51 0.538 0.133 4
## 11 52 0.742 0.240 4
## 12 54 0.437 0.280 2
## 13 55 0.251 0.0335 2
## 14 57 0.482 NaN 1
## 15 58 0.879 NaN 1
## 16 59 1.64 NaN 1
We use filter() to select rows.
wdata %>%
filter(HHMEDAGE==41 | HHMEDAGE>=57 )
## # A tibble: 4 x 10
## ZIP PERPERHH MEDSCHYR MEDHVL PRCRENT PRC55P HHMEDAGE MEDINC SALES
## <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <dbl>
## 1 53211 2.3 15.9 120 62 27 41 61.1 18389.
## 2 53934 2.4 12.2 43.7 20 56 57 31.2 2344.
## 3 53952 2.2 12.5 45.9 18 56 59 39.1 2001.
## 4 54634 2.4 12.2 38.4 40 53 58 27.9 2030.
## # ... with 1 more variable: POP <int>
wdata2 <-
wdata %>%
mutate(
logPOP = log(POP),
logSALES = log(SALES),
SALESPerCapita=SALES/POP
) %>%
arrange(desc(SALESPerCapita)) %>%
select(ZIP, contains("POP"), contains("SALES"))
wdata2[1:5,]
## # A tibble: 5 x 6
## ZIP POP logPOP SALES logSALES SALESPerCapita
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 53104 4464 8.40 21588. 9.98 4.84
## 2 53003 435 6.08 1285. 7.16 2.95
## 3 53952 1217 7.10 2001. 7.60 1.64
## 4 53072 13337 9.50 15046. 9.62 1.13
## 5 53520 5825 8.67 6209. 8.73 1.07