Introducing dplyr

dplyr: data plyers package

We introduce the verbs:

filter()
select()
mutate()
arrange()
summarize()
gather()
purrr::map()

These functions are available after loading tidyverse.

First we illustrate on Wisconsin lottery data. The first 16 lines of the data file describe the variables.

fileName <- 
  "http://fisher.stats.uwo.ca/faculty/aim/2018/3859A/data/WiscLottery.csv"
readLines(fileName, 16)

##  [1] "#Data is descibed in Frees' textbook (p. 23 and Table 3.11, p.102)"   
##  [2] "#State of Wisconsin lottery sales from 50 randomly selected areas"    
##  [3] "#identified by their zip codes. Socio-demographic variables that"     
##  [4] "#may be of interest to administrators are also provided."             
##  [5] "#ZIP - zip code"                                                      
##  [6] "#PERPERHN - persons per household"                                    
##  [7] "#MEDSCHYR - median years of schooling"                                
##  [8] "#MEDHVL- median home value in $1000's for owner occupied homes"       
##  [9] "#PRCRENT - percent of housing that is owner occupied"                 
## [10] "#PRC55P - percent of population that is over 55"                      
## [11] "#HHMEDAGE - household median age"                                     
## [12] "#MEDINC - estimated median income in $1,000"                          
## [13] "#SALES - online lottery sales to individuals"                         
## [14] "#POP - population in thousands"                                       
## [15] "ZIP,PERPERHH,MEDSCHYR,MEDHVL,PRCRENT,PRC55P,HHMEDAGE,MEDINC,SALES,POP"
## [16] "53003,3,12.6,71.3,21,38,48,54.2,1285.4,435"

Using read_csv() we input as tibble.

wdata <- read_csv(fileName, skip=14)

## Parsed with column specification:
## cols(
##   ZIP = col_integer(),
##   PERPERHH = col_double(),
##   MEDSCHYR = col_double(),
##   MEDHVL = col_double(),
##   PRCRENT = col_integer(),
##   PRC55P = col_integer(),
##   HHMEDAGE = col_integer(),
##   MEDINC = col_double(),
##   SALES = col_double(),
##   POP = col_integer()
## )

wdata

## # A tibble: 50 x 10
##      ZIP PERPERHH MEDSCHYR MEDHVL PRCRENT PRC55P HHMEDAGE MEDINC  SALES
##    <int>    <dbl>    <dbl>  <dbl>   <int>  <int>    <int>  <dbl>  <dbl>
##  1 53003      3       12.6   71.3      21     38       48   54.2  1285.
##  2 53033      3.2     12.9   98         6     28       46   70.7  3571.
##  3 53038      2.8     12.4   58.7      25     35       45   43.6  2407.
##  4 53059      3.1     12.5   65.7      24     29       45   51.9  1224.
##  5 53072      2.6     13.1   96.7      32     27       42   63.1 15046.
##  6 53083      2.7     12.8   66.4      25     38       48   55.7  9129.
##  7 53095      2.8     12.9   91        31     37       48   54.9 33181.
##  8 53098      2.9     12.5   61        26     40       50   46.9  2243.
##  9 53104      2.8     12.8   91.5      18     35       48   62.3 21588.
## 10 53172      2.6     12.7   68.8      37     39       47   49.1 15693.
## # ... with 40 more rows, and 1 more variable: POP <int>

We are interested in SALES and POP. So first we looked at a simple scatterplot and noticed that both variables were strongly skewed. Things always simpler when the data is symmetric! As a check we have included a rug on both axes. This clearly shows the skewed nature of the distribution of both variables.

wdata %>%
  ggplot(aes(x=POP, y=SALES)) +
  geom_point(size=1.5, shape=1, stroke=1.25) +
  geom_smooth(method="lm", se=FALSE) + 
  geom_rug() +
  ggtitle(label=NULL, subtitle="Random Sample of 50 Zip Codes") +
  labs(x="Poplulation", y="Sales in $")

We take logarithms (log to the base e). We use the mutate() verb.

wdata %>%
  mutate(
    logPOP = log(POP),
    logSALES = log(SALES)
  ) %>%
  ggplot(aes(x=logPOP, y=logSALES)) +
  geom_point(size=1.5, shape=1, stroke=1.25) +
  geom_smooth(method="lm", se=FALSE) + 
  geom_rug() +
  ggtitle(label=NULL, subtitle="Random Sample of 50 Zip Codes") +
  labs(x="log Poplulation", y="log Sales in $")

## Warning: package 'bindrcpp' was built under R version 3.4.4

The next script illustrates the use of mutate(), arrange() and select(). Note also the use of desc() since this function modifies the verb arrange() it is considered at adverb. The dplyr verbs have many adverbs – see documentation for each verb.

Let’s drill down and find the observations corresponding to the 5 largest ratios of SALES/POP. The ZIP codes corresponding to the largest 53104 and 53003 and correspond to small border towns where out-of-state people go to purchase Wisconsin lottery tickets.

wdata2 <- 
 wdata %>%
   mutate(
  logPOP = log(POP),
  logSALES = log(SALES),
  SALESPerCapita=SALES/POP
 ) %>%
 arrange(desc(SALESPerCapita)) %>%
 select(ZIP, contains("POP"), contains("SALES"))
 wdata2[1:5,]

## # A tibble: 5 x 6
##     ZIP   POP logPOP  SALES logSALES SALESPerCapita
##   <int> <int>  <dbl>  <dbl>    <dbl>          <dbl>
## 1 53104  4464   8.40 21588.     9.98           4.84
## 2 53003   435   6.08  1285.     7.16           2.95
## 3 53952  1217   7.10  2001.     7.60           1.64
## 4 53072 13337   9.50 15046.     9.62           1.13
## 5 53520  5825   8.67  6209.     8.73           1.07

I wonder if any other variables can be considered discrete or perhaps categorical. So I compute the number of unique values for each variable using the map() and map_int() functions. The tidyverse package purrr enhance functional programming in R and provide an alternative to the base R suite of apply functions. Scripts using map functions are typically easier to read and more elegant. For maximum efficiency the underlying implementation using C.

map_int(map(wdata, unique), length)

##      ZIP PERPERHH MEDSCHYR   MEDHVL  PRCRENT   PRC55P HHMEDAGE   MEDINC 
##       50       11       13       47       26       25       16       48 
##    SALES      POP 
##       50       50

We use summarize() with helper functions group_by() and n().

wdata %>% 
   mutate(SALESPerCapita=SALES/POP) %>%
   group_by(HHMEDAGE) %>%
   summarize(average=mean(SALESPerCapita), sd=sd(SALESPerCapita), n=n())

## # A tibble: 16 x 4
##    HHMEDAGE average       sd     n
##       <int>   <dbl>    <dbl> <int>
##  1       41   0.497 NaN          1
##  2       42   0.878   0.308      3
##  3       44   0.322   0.0841     3
##  4       45   0.624   0.206      5
##  5       46   0.795   0.180      4
##  6       47   0.518   0.327      2
##  7       48   1.40    1.65       8
##  8       49   0.858   0.164      2
##  9       50   0.514   0.180      7
## 10       51   0.538   0.133      4
## 11       52   0.742   0.240      4
## 12       54   0.437   0.280      2
## 13       55   0.251   0.0335     2
## 14       57   0.482 NaN          1
## 15       58   0.879 NaN          1
## 16       59   1.64  NaN          1

We use filter() to select rows.

wdata %>%
  filter(HHMEDAGE==41 | HHMEDAGE>=57 )

## # A tibble: 4 x 10
##     ZIP PERPERHH MEDSCHYR MEDHVL PRCRENT PRC55P HHMEDAGE MEDINC  SALES
##   <int>    <dbl>    <dbl>  <dbl>   <int>  <int>    <int>  <dbl>  <dbl>
## 1 53211      2.3     15.9  120        62     27       41   61.1 18389.
## 2 53934      2.4     12.2   43.7      20     56       57   31.2  2344.
## 3 53952      2.2     12.5   45.9      18     56       59   39.1  2001.
## 4 54634      2.4     12.2   38.4      40     53       58   27.9  2030.
## # ... with 1 more variable: POP <int>

wdata2 <- 
 wdata %>%
   mutate(
  logPOP = log(POP),
  logSALES = log(SALES),
  SALESPerCapita=SALES/POP
 ) %>%
 arrange(desc(SALESPerCapita)) %>%
 select(ZIP, contains("POP"), contains("SALES"))
 wdata2[1:5,]

## # A tibble: 5 x 6
##     ZIP   POP logPOP  SALES logSALES SALESPerCapita
##   <int> <int>  <dbl>  <dbl>    <dbl>          <dbl>
## 1 53104  4464   8.40 21588.     9.98           4.84
## 2 53003   435   6.08  1285.     7.16           2.95
## 3 53952  1217   7.10  2001.     7.60           1.64
## 4 53072 13337   9.50 15046.     9.62           1.13
## 5 53520  5825   8.67  6209.     8.73           1.07

Introducing dplyr

A. I. McLeod

September 24, 2018

dplyr: data plyers package