Cleaning Data with data.table and dplyr

Figuring out which packages to use is the single largest downside to learning R. There are over 10,000 packages in CRAN, knowing which one to turn to could take you to a place that’s not the most intuitive or worse straight up bad. New learners tend to use the packages written by Hadley Wickham, which are great.

dplyr, for example, is used to manipulate data frames, handles data in an intuitive way that makes your work easier to follow, and easier to read.

One of the downsides of manipulating dataframes is that they aren’t the most nimble things to be moving around. Manipulating datasets that are much larger (>5m rows) becomes much slower, and grouping operations start to lag. Thus, the data.table package was created by Matt Dowle and Arun Srinivasan to improve speed and memory usage for larger sets of data.

There are some minor functionality differences and major syntax differences when manipulating data with dplyr compared to data.table. To illustrate this we’ll examine a daily data set on air pollution and weather for the city of Chicago from 1987 to 2005.

First, let’s read in the data and create both a data.frame and a data.table.

#*.RDS and *.RData are file types which can store in-memory R objects on disk efficiently.
fileURL <- "https://github.com/uky994/STA-4233.-INTRODUCTION-TO-PROGRAMMING-AND-DATA-MANAGEMENT-IN-R/blob/master/chicago.rds?raw=true"
if(!file.exists("./data")){
  dir.create("./data")
}
#On my machine, this code works changing the download.file as follows:
#download.file(fileURL, destfile = "./data/chicago.rds", extra='-L',mode='wb')
#Setting the method to 'curl' didn't help, so I thought to play with the mode parameter based on the advice that it is mostly good #to set it to 'wb'..

#It seems that RDS is binary file and in my case it was recognized as such by default. But it is not something to rely on, as #suggested here: http://stackoverflow.com/questions/29814405/why-do-i-need-to-use-mode-wb-with-download-file-for-this-rds-file

download.file(fileURL, destfile = "./data/chicago.rds", method = "curl", extra='-L')

#create data frame
chicago <- readRDS("./data/chicago.rds")
dim(chicago)

## [1] 6940    8

#create data table
chicdt<-as.data.table(chicago)
dim(chicdt)

## [1] 6940    8

Recall that the dplyr package introduces some basic verbs that helped to streamline the data manipulation process. - select: return a subset of the columns of a data frame - filter: extract a subset of rows from a data frame based on logical conditions - arrange: reorder rows of a data frame - rename: rename variables in a data frame - mutate: add new variables/columns or transform existing variables - summarize: generate summary statistics of different variables in the data frame, possibly within strata

In contrast, a data.table has a general form of data.table syntax, as shown below:

DT[i, j, by]                                       
                                                   
   R:        i                   j            by
 SQL:  where | order by   select | update  group by

Those of you who have an SQL background might perhaps immediately relate to this syntax.

Selecting Columns

If you take a look at the dimensions of the data set you’ll see that there are 6,940 rows and eight columns, and here are the first couple of rows of the data of the data set and the first five columns.

head(select(chicago, 1:5))

##   city tmpd   dptp       date pm25tmean2
## 1 chic 31.5 31.500 1987-01-01         NA
## 2 chic 33.0 29.875 1987-01-02         NA
## 3 chic 33.0 27.375 1987-01-03         NA
## 4 chic 29.0 28.625 1987-01-04         NA
## 5 chic 32.0 28.875 1987-01-05         NA
## 6 chic 40.0 35.125 1987-01-06         NA

I can do the same thing using data.table by placing nothing in the ith position and 1:5 in the jth position. Now, R prints more rows than just the first few, but data.table provides a succint output that doesn’t overwhelm the window so the head function is convenient and unneccessary.

chicdt[,1:5]

##       city tmpd   dptp       date pm25tmean2
##    1: chic 31.5 31.500 1987-01-01         NA
##    2: chic 33.0 29.875 1987-01-02         NA
##    3: chic 33.0 27.375 1987-01-03         NA
##    4: chic 29.0 28.625 1987-01-04         NA
##    5: chic 32.0 28.875 1987-01-05         NA
##   ---                                       
## 6936: chic 40.0 33.600 2005-12-27   23.56000
## 6937: chic 37.0 34.500 2005-12-28   17.75000
## 6938: chic 35.0 29.400 2005-12-29    7.45000
## 6939: chic 36.0 31.000 2005-12-30   15.05714
## 6940: chic 35.0 30.100 2005-12-31   15.00000

Now one of the things you can do for any data frame is to look at the names, of the variable names using the names function. So, here I’m going to print out the first few variable names. One of the nice things you can do with the select function in dplyr is to actually access the columns or set of columns in the data frame using the names rather than the column indices.

Say I wanted to look at the columns starting with city and ending with DPTP, which is the dew point column, and I want to include all the columns in between. I can use this notation, which is “city colon DBTP”.

names(chicago)[1:3]

## [1] "city" "tmpd" "dptp"

head(select(chicago, city:dptp))

##   city tmpd   dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125

Which is not a notation that you would normally be able to use in other functions but you can use here in the select function and you can see it selects all the columns between the city column and the dew point column. It’s a nice and handy way to look at subsets of columns or data frame by just you referring to them by their names.

I can perform the same function using the data.table. I placed 1:5 in the ith positon of my data.table subset syntax only to mirror the output from dplyr.

names(chicdt)[1:3]

## [1] "city" "tmpd" "dptp"

chicdt[1:5,city:dptp]

##    city tmpd   dptp
## 1: chic 31.5 31.500
## 2: chic 33.0 29.875
## 3: chic 33.0 27.375
## 4: chic 29.0 28.625
## 5: chic 32.0 28.875

Similarly, you can use the minus sign to say, “I want to look at all the columns except for the columns indicated by this range.” You can use the select function to just say minus on that city colon dew point sequence, and you’ll get all of the columns except those few columns.

head(select(chicago, -(city:dptp)))

##         date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01         NA   34.00000 4.250000  19.98810
## 2 1987-01-02         NA         NA 3.304348  23.19099
## 3 1987-01-03         NA   34.16667 3.333333  23.81548
## 4 1987-01-04         NA   47.00000 4.375000  30.43452
## 5 1987-01-05         NA         NA 4.750000  30.33333
## 6 1987-01-06         NA   48.00000 5.833333  25.77233

Similar logic can be used in data.table

chicdt[1:5,-(city:dptp)]

##          date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1: 1987-01-01         NA   34.00000 4.250000  19.98810
## 2: 1987-01-02         NA         NA 3.304348  23.19099
## 3: 1987-01-03         NA   34.16667 3.333333  23.81548
## 4: 1987-01-04         NA   47.00000 4.375000  30.43452
## 5: 1987-01-05         NA         NA 4.750000  30.33333

The equivalent code in regular R, without using the dplyr package, is a little bit tricky because you have to find the index for where the city column is and you have to find the index for where the dew point column is and then you need to take the negative of those indices. It’s not particularly complicated but it’s an extra two lines of code and perhaps not as readable.

## Equivalent base R
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])

##         date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01         NA   34.00000 4.250000  19.98810
## 2 1987-01-02         NA         NA 3.304348  23.19099
## 3 1987-01-03         NA   34.16667 3.333333  23.81548
## 4 1987-01-04         NA   47.00000 4.375000  30.43452
## 5 1987-01-05         NA         NA 4.750000  30.33333
## 6 1987-01-06         NA   48.00000 5.833333  25.77233

Filtering Rows

The filter function in dplyris used subset rows based on conditions. For example, you might want to take all the rows in the Chicago data set where pm 2.5 is greater than 30. I’ve got that condition as the second argument creating a logical sequence to subsets the rows. All of the values of pm2.5 are greater than 30.

chic.f <- filter(chicago, pm25tmean2 > 30)
head(select(chic.f, 1:3, pm25tmean2), 10)

##    city tmpd dptp pm25tmean2
## 1  chic   23 21.9      38.10
## 2  chic   28 25.8      33.95
## 3  chic   55 51.3      39.40
## 4  chic   59 53.7      35.40
## 5  chic   57 52.0      33.30
## 6  chic   57 56.0      32.10
## 7  chic   75 65.8      56.50
## 8  chic   61 59.0      33.80
## 9  chic   73 60.3      30.30
## 10 chic   78 67.1      41.40

Filtering is really just a game testing your logic coding. For data table the same pm25tmean2 > 30 logic can be placed in the ith position to achieve the same result.

chicdt[pm25tmean2 > 30 , 1:3]

##      city tmpd dptp
##   1: chic   23 21.9
##   2: chic   28 25.8
##   3: chic   55 51.3
##   4: chic   59 53.7
##   5: chic   57 52.0
##  ---               
## 190: chic    8 -1.8
## 191: chic   12  7.7
## 192: chic   22 23.3
## 193: chic   41 32.6
## 194: chic   37 35.2

Of course, you don’t have to just subset rows based on values in one column. You could take multiple columns and create a more complicated logical sequence. Here I’m looking at pm2.5 greater than 30 and temperature greater than 80. When I print out the few first couple rows here, all of the temperature values are greater than 80 and all the pm2.5 values are greater than 30.

#DPLYR
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)

##    city tmpd dptp pm25tmean2
## 1  chic   81 71.2    39.6000
## 2  chic   81 70.4    31.5000
## 3  chic   82 72.2    32.3000
## 4  chic   84 72.9    43.7000
## 5  chic   85 72.6    38.8375
## 6  chic   84 72.6    38.2000
## 7  chic   82 67.4    33.0000
## 8  chic   82 63.5    42.5000
## 9  chic   81 70.4    33.1000
## 10 chic   82 66.2    38.8500

#DATA.TABLE
chicdt[pm25tmean2 > 30 & tmpd > 80,1:3]

##     city tmpd dptp
##  1: chic   81 71.2
##  2: chic   81 70.4
##  3: chic   82 72.2
##  4: chic   84 72.9
##  5: chic   85 72.6
##  6: chic   84 72.6
##  7: chic   82 67.4
##  8: chic   82 63.5
##  9: chic   81 70.4
## 10: chic   82 66.2
## 11: chic   82 65.1
## 12: chic   84 68.4
## 13: chic   86 63.4
## 14: chic   82 64.6
## 15: chic   85 64.1
## 16: chic   84 67.0
## 17: chic   84 69.0

You can have an arbitrarily complex logical sequence there, and the rows subset based on that sequence in both the filter function and using data.table’s way. The nice thing about both methods is you can refer to the variable names directly using their names and you don’t have to kind of subset out each variable using the subset operators like $.

Sorting

The arrange function in dplyr is used to reorder the rows of a data frame based on the values of a column. This is a pain in base R and the arrange function makes it quite a bit simpler and easier to read.

Suppose, I want to order the rows of the chicago data according to the date variable from lowest to highest.

chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)

##         date pm25tmean2
## 1 1987-01-01         NA
## 2 1987-01-02         NA
## 3 1987-01-03         NA

tail(select(chicago, date, pm25tmean2), 3)

##            date pm25tmean2
## 6938 2005-12-29    7.45000
## 6939 2005-12-30   15.05714
## 6940 2005-12-31   15.00000

As you can see that the first couple of rows, start in 1987 and then when I use the tail function to look at the last couple rows, you see that they all end in 2005.

We can use the R function order() to accomplish this in data.table.

chicdt[order(date),3:4]

##         dptp       date
##    1: 31.500 1987-01-01
##    2: 29.875 1987-01-02
##    3: 27.375 1987-01-03
##    4: 28.625 1987-01-04
##    5: 28.875 1987-01-05
##   ---                  
## 6936: 33.600 2005-12-27
## 6937: 34.500 2005-12-28
## 6938: 29.400 2005-12-29
## 6939: 31.000 2005-12-30
## 6940: 30.100 2005-12-31

order() is internally optimised! In addition, order(...) within the frame of a data.table uses data.table’s internal fast radix order forder(). This sort provided such a compelling improvement over R’s base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

Now, we might want to arrange the rows in descending order. In this case, the desc function can be used. We can use “-” on a character columns within the frame of a data.table to sort in decreasing order.

#DPLYR
chicago <- arrange(chicago, desc(date))
head(select(chicago, date, pm25tmean2), 3)

##         date pm25tmean2
## 1 2005-12-31   15.00000
## 2 2005-12-30   15.05714
## 3 2005-12-29    7.45000

tail(select(chicago, date, pm25tmean2), 3)

##            date pm25tmean2
## 6938 1987-01-03         NA
## 6939 1987-01-02         NA
## 6940 1987-01-01         NA

#DATA.TABLE
chicdt[order(-date),3:4]

##         dptp       date
##    1: 30.100 2005-12-31
##    2: 31.000 2005-12-30
##    3: 29.400 2005-12-29
##    4: 34.500 2005-12-28
##    5: 33.600 2005-12-27
##   ---                  
## 6936: 28.875 1987-01-05
## 6937: 28.625 1987-01-04
## 6938: 27.375 1987-01-03
## 6939: 29.875 1987-01-02
## 6940: 31.500 1987-01-01

Now when the rows are printed, they start with all the values of 2005 and go backwards ending in 1987. There’s not a big adjustment to reverse the order of all the rows in the data frame according to the date.

Renaming Columns

The rename function is used to …, wait for it, … rename a variable in R. Without a function like this, it is surprisingly annoying to rename a variable in R.

So suppose I think the pm2.5 variable, called pm25tmean2 is a bit of a mouthful, and I want to simplify it by calling it pm25. Similarly, the dew point variable is called dptp, which is not particularly intuitive so I might want to rename that variable too. So here, I just call the rename function in dplyr and pass the chicago data frame.

head(chicago[, 1:5], 3)

##   city tmpd dptp       date pm25tmean2
## 1 chic   35 30.1 2005-12-31   15.00000
## 2 chic   36 31.0 2005-12-30   15.05714
## 3 chic   35 29.4 2005-12-29    7.45000

chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago[, 1:5], 3)

##   city tmpd dewpoint       date     pm25
## 1 chic   35     30.1 2005-12-31 15.00000
## 2 chic   36     31.0 2005-12-30 15.05714
## 3 chic   35     29.4 2005-12-29  7.45000

In data.table, we wrap the variables (column names) within list(), which ensures that a data.table is returned. In case of a single column name, not wrapping with list() returns a vector instead.

data.table also allows wrapping columns with .() which is an alias to list(). Since .() is the same as list(), we can name columns as we would while creating a list.

chicdt[, .(city, tmpd,dewpoint = dptp,date, pm25 = pm25tmean2)]

##       city tmpd dewpoint       date     pm25
##    1: chic 31.5   31.500 1987-01-01       NA
##    2: chic 33.0   29.875 1987-01-02       NA
##    3: chic 33.0   27.375 1987-01-03       NA
##    4: chic 29.0   28.625 1987-01-04       NA
##    5: chic 32.0   28.875 1987-01-05       NA
##   ---                                       
## 6936: chic 40.0   33.600 2005-12-27 23.56000
## 6937: chic 37.0   34.500 2005-12-28 17.75000
## 6938: chic 35.0   29.400 2005-12-29  7.45000
## 6939: chic 36.0   31.000 2005-12-30 15.05714
## 6940: chic 35.0   30.100 2005-12-31 15.00000

Be careful here. Just renaming the columns selects only those columns. Thus, I have to include all of the columns I want within my .() list. Not really as convenient as the rename function.

Nevertheless, I supply the new name and set it equal to the old name. Dew-point equals dptp and pm25 equals pm25tmean2 which renames those two columns. In the data frame all of the other columns are untouched. When the first couple rows of this data are printed you can see that the dewpoint variable and the pm25 variable are properly named.

Creating New Variables

The mutate function transforms existing variables or to create new variables. Now suppose I want to create a new variable called pm25detrend which is the pm25 variable centered around the mean. I’ll create this new variable called pm25detrend by subtracting the mean of pm25 from the pm25 variable.

chicago <- mutate(chicago, pm25detrend=pm25-mean(pm25, na.rm=TRUE))
head(select(chicago, pm25, pm25detrend))

##       pm25 pm25detrend
## 1 15.00000   -1.230958
## 2 15.05714   -1.173815
## 3  7.45000   -8.780958
## 4 17.75000    1.519042
## 5 23.56000    7.329042
## 6  8.40000   -7.830958

data.table adds new columns very fast and efficiently. You use this the := command to add a variable to the data table.

chicdt[,pm25detrend:=pm25tmean2-mean(pm25tmean2, na.rm=TRUE)] 
chicdt

##       city tmpd   dptp       date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
##    1: chic 31.5 31.500 1987-01-01         NA   34.00000 4.250000  19.98810
##    2: chic 33.0 29.875 1987-01-02         NA         NA 3.304348  23.19099
##    3: chic 33.0 27.375 1987-01-03         NA   34.16667 3.333333  23.81548
##    4: chic 29.0 28.625 1987-01-04         NA   47.00000 4.375000  30.43452
##    5: chic 32.0 28.875 1987-01-05         NA         NA 4.750000  30.33333
##   ---                                                                     
## 6936: chic 40.0 33.600 2005-12-27   23.56000   27.00000 4.468750  23.50000
## 6937: chic 37.0 34.500 2005-12-28   17.75000   27.50000 3.260417  19.28563
## 6938: chic 35.0 29.400 2005-12-29    7.45000   23.50000 6.794837  19.97222
## 6939: chic 36.0 31.000 2005-12-30   15.05714   19.20000 3.034420  22.80556
## 6940: chic 35.0 30.100 2005-12-31   15.00000   23.50000 2.531250  13.25000
##       pm25detrend
##    1:          NA
##    2:          NA
##    3:          NA
##    4:          NA
##    5:          NA
##   ---            
## 6936:    7.329042
## 6937:    1.519042
## 6938:   -8.780958
## 6939:   -1.173815
## 6940:   -1.230958

The nice thing is when you’re adding a new variable to a data frame, R will copy over the entire data frame and add a new variable to it, so you get two copies of the the data frame in memory. When dealing with big data sets, this is going to cause lots of memory problems, which you don’t have with data table because a new copy isn’t being created.

You’ll notice that I had to use the old pm25tmean2 variable name instead of pm25. As data.table is working by reference, saving the names isn’t hard, but it isn’t trivial either.

Splitting Data by a dimension

Finally the group_by function allows you to split a data frame according to categorical variables. Let’s create a temperature category variable which will indicate whether a given day was hot or cold, depending on whether the temperature was over 80 degrees or not. I create this factor variable called tempcat, and then I use the group by function to create a new kind of data structure. Based on the original data frame and this tempcat variable. And so I call that new data structure hot/cold, and you can print it out here.

chicago <- mutate(chicago, tempcat = factor(1 * (tmpd > 80), labels = c("cold", "hot")))
hotcold <- group_by(chicago, tempcat)

## Warning: Factor `tempcat` contains implicit NA, consider using
## `forcats::fct_explicit_na`

summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE),o3 = max(o3tmean2, na.rm = TRUE), 
          no2 = median(no2tmean2, na.rm = TRUE))

## # A tibble: 3 x 4
##   tempcat  pm25    o3   no2
##   <fct>   <dbl> <dbl> <dbl>
## 1 cold     16.0 66.6   24.5
## 2 hot      26.5 63.0   24.9
## 3 <NA>     47.7  9.42  37.4

In data.table, we’ll combine the i and j statements together with by to perform operations by group.

chicdt[,tempcat:=factor(1 * (tmpd > 80), labels = c("cold", "hot"))]
chicdt[,.(pm25=mean(pm25tmean2, na.rm = TRUE),o3=max(o3tmean2, na.rm = TRUE),
          no2 = median(no2tmean2, na.rm = TRUE)),by=tempcat]

##    tempcat     pm25        o3      no2
## 1:    cold 15.97807 66.587500 24.54924
## 2:     hot 26.48118 62.969656 24.93870
## 3:    <NA> 47.73750  9.416667 37.44444

We pass our aggregations in the j arguments and use the tempcat column in the by section to tell data.table how to calculate the requested statistics.

You can now see the different syntax in practice where I use the summarize function to aggregate by the hot/cold group which has been split based on the temperature category variable in contrast to the statements in the separate parts of the data.table syntax. In either case, I calculate the mean pm2.5 for both hot and cold days, the maximum ozone for hot and cold days, and the median nitrogen dioxide or no2 for both hot and cold days. It prints out a little data frame that has the levels of the categorical variable which are cold, hot, and NA so there are some missing values. Both give you the summary statistics for pm25 and ozone and no2 in each of those categories. Now since there are missing data in the pm25 variable, I need to specify the na.rm equals true in order to get some the means of those values by ignoring the missing values. You can see that that pm2.5 is tends to be lower when it’s cold and higher when it’s hot.

These are the summary statistics for each of the pollutants in these temperature categories. I could also categorize the data set on other variables. For example I might want to do a summary for each year in the data set, I can use the mutate function to create a year variable.

chicago <- mutate(chicago,year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),o3 = max(o3tmean2, na.rm = TRUE),
          no2 = median(no2tmean2, na.rm = TRUE))

## # A tibble: 19 x 4
##     year  pm25    o3   no2
##    <dbl> <dbl> <dbl> <dbl>
##  1  1987 NaN    63.0  23.5
##  2  1988 NaN    61.7  24.5
##  3  1989 NaN    59.7  26.1
##  4  1990 NaN    52.2  22.6
##  5  1991 NaN    63.1  21.4
##  6  1992 NaN    50.8  24.8
##  7  1993 NaN    44.3  25.8
##  8  1994 NaN    52.2  28.5
##  9  1995 NaN    66.6  27.3
## 10  1996 NaN    58.4  26.4
## 11  1997 NaN    56.5  25.5
## 12  1998  18.3  50.7  24.6
## 13  1999  18.5  57.5  24.7
## 14  2000  16.9  55.8  23.5
## 15  2001  16.9  51.8  25.1
## 16  2002  15.3  54.9  22.7
## 17  2003  15.2  56.2  24.6
## 18  2004  14.6  44.5  23.4
## 19  2005  16.2  58.8  22.6

I can use the same as.POSIXlt function in data.table to create the year variable there

chicdt[,year := as.POSIXlt(date)$year + 1900]
chicdt[,.(pm25=mean(pm25tmean2, na.rm = TRUE),o3=max(o3tmean2, na.rm = TRUE),
            no2 = median(no2tmean2, na.rm = TRUE)),by=year]

##     year     pm25       o3      no2
##  1: 1987      NaN 62.96966 23.49369
##  2: 1988      NaN 61.67708 24.52296
##  3: 1989      NaN 59.72727 26.14062
##  4: 1990      NaN 52.22917 22.59583
##  5: 1991      NaN 63.10417 21.38194
##  6: 1992      NaN 50.82870 24.78921
##  7: 1993      NaN 44.30093 25.76993
##  8: 1994      NaN 52.17844 28.47500
##  9: 1995      NaN 66.58750 27.26042
## 10: 1996      NaN 58.39583 26.38715
## 11: 1997      NaN 56.54167 25.48143
## 12: 1998 18.26467 50.66250 24.58649
## 13: 1999 18.49646 57.48864 24.66667
## 14: 2000 16.93806 55.76103 23.46082
## 15: 2001 16.92632 51.81984 25.06522
## 16: 2002 15.27335 54.88043 22.73750
## 17: 2003 15.23183 56.16608 24.62500
## 18: 2004 14.62864 44.48240 23.39130
## 19: 2005 16.18556 58.84126 22.62387

Piping or Chaining

The dplyr package implements a special operator that allows you to kind of chain different operations together in a kind of readable way. It’s indicated by a percent symbol and then the greater than symbol and then the percent symbol (%>%). We call it the pipeline operator and you can take a data set and feed through a pipeline of operations to create a new data set.

chicago %>% 
  mutate(month = as.POSIXlt(date)$mon + 1) %>% 
  group_by(month) %>% 
  summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE), 
                no2 = median(no2tmean2, na.rm = TRUE))

## # A tibble: 12 x 4
##    month  pm25    o3   no2
##    <dbl> <dbl> <dbl> <dbl>
##  1     1  17.8  28.2  25.4
##  2     2  20.4  37.4  26.8
##  3     3  17.4  39.0  26.8
##  4     4  13.9  47.9  25.0
##  5     5  14.1  52.8  24.2
##  6     6  15.9  66.6  25.0
##  7     7  16.6  59.5  22.4
##  8     8  16.9  54.0  23.0
##  9     9  15.9  57.5  24.5
## 10    10  14.2  47.1  24.2
## 11    11  15.2  29.5  23.6
## 12    12  17.5  27.7  24.5

Here I’ve got the Chicago data set and I want to mutate to create a month variable because I want to create a summary of each of the pollutant variables by month. I create the month variable with mutate and then I want to take the output of mutate group by the month variable. Then I take the output of group by and then run it through summarize. Notice each call to mutate, group by and summarize, is separated by the pipeline operator (%>%). I don’t have the specify the data frame since it’s the first argument. The output of this long pipeline of operations is a data frame that shows you the summary statistics of each of the three pollutant variables by each of the 12 months of the year.

The pipeline operator is really handy. It allows you to kind of change a bunch of operations in one sequence that’s both readable and powerful.

As data.table is working by reference, there is no assignment and it is like it is chained by default without any pipe-like operator. Pipe operators are a way to chain operation without making an assignment. data.table can chain expressions together on after another by tacking expressions one after another, forming a chain of operations, i.e., DT[ ... ][ ... ][ ... ].

Or you can also chain them vertically:

DT[ ...
   ][ ...
     ][ ...
       ]

Thus,

chicdt[,month := as.POSIXlt(date)$mon + 1][,.(pm25 = mean(pm25tmean2, na.rm = TRUE), 
                                              o3 = max(o3tmean2, na.rm = TRUE),
                                              no2 = median(no2tmean2, na.rm = TRUE)), by=month]

##     month     pm25       o3      no2
##  1:     1 17.76996 28.22222 25.35417
##  2:     2 20.37513 37.37500 26.78034
##  3:     3 17.40818 39.05000 26.76984
##  4:     4 13.85879 47.94907 25.03125
##  5:     5 14.07420 52.75000 24.22222
##  6:     6 15.86461 66.58750 25.01140
##  7:     7 16.57087 59.54167 22.38442
##  8:     8 16.93380 53.96701 22.98333
##  9:     9 15.91279 57.48864 24.47917
## 10:    10 14.23557 47.09275 24.15217
## 11:    11 15.15794 29.45833 23.56537
## 12:    12 17.52221 27.70833 24.45773

Summary and Further Reading

data.table provides speed and memory gains, and dplyr has more accessible syntax. Aggregating and joining is so much faster with data.table for large tables that I wish I found out about data.table earlier so then I could gain the intuition for writing consistent code.

data.table was born to be an efficient and comprehensive all-in-one package. It has a very fast file reader. The syntax is succint (I mean you can do a lot in a single line of code). It is currently the fastest open source data manipulation tool for data that fit in memory (see the benchmark at Rdatatable/data.table. data.table is also memory-efficient, and is written in C for the most part.

Conversely, dplyr is happy to be complemented by functions provided by other packages (The author of the package also wrote readr and tidyr for file reading and long/wide operations. dplyr allows you to munge data from several DBs using the same syntax; in this respect it has some similarity to Python’s Blaze project. Finally, dplyr’s code is made more efficient by the use of C++.

These differences are not clear-cut: no one forbids data.table from implementing an extra layer to interface seamlessly to DBMS. And maybe dplyr could leverage its code base to implement clever parallelism and gain a speed edge. I tend to use dplyr, but am happy to switch to data.table when performance and memory usage are bottlenecks.

Read these: - stack overflow: data.table vs dplyr: can one do something well the other can’t or does poorly? - Hadley Wickham’s take on dplyr vs data.table link - Hadleyverse vs data.table link - Fast data lookups in R: dplyr vs data.table link