Figuring out which packages to use is the single largest downside to learning R. There are over 10,000 packages in CRAN, knowing which one to turn to could take you to a place that’s not the most intuitive or worse straight up bad. New learners tend to use the packages written by Hadley Wickham, which are great.
dplyr
, for example, is used to manipulate data frames, handles data in an intuitive way that makes your work easier to follow, and easier to read.
One of the downsides of manipulating dataframes is that they aren’t the most nimble things to be moving around. Manipulating datasets that are much larger (>5m rows) becomes much slower, and grouping operations start to lag. Thus, the data.table
package was created by Matt Dowle and Arun Srinivasan to improve speed and memory usage for larger sets of data.
There are some minor functionality differences and major syntax differences when manipulating data with dplyr
compared to data.table
. To illustrate this we’ll examine a daily data set on air pollution and weather for the city of Chicago from 1987 to 2005.
First, let’s read in the data and create both a data.frame
and a data.table
.
#*.RDS and *.RData are file types which can store in-memory R objects on disk efficiently.
fileURL <- "https://github.com/uky994/STA-4233.-INTRODUCTION-TO-PROGRAMMING-AND-DATA-MANAGEMENT-IN-R/blob/master/chicago.rds?raw=true"
if(!file.exists("./data")){
dir.create("./data")
}
#On my machine, this code works changing the download.file as follows:
#download.file(fileURL, destfile = "./data/chicago.rds", extra='-L',mode='wb')
#Setting the method to 'curl' didn't help, so I thought to play with the mode parameter based on the advice that it is mostly good #to set it to 'wb'..
#It seems that RDS is binary file and in my case it was recognized as such by default. But it is not something to rely on, as #suggested here: http://stackoverflow.com/questions/29814405/why-do-i-need-to-use-mode-wb-with-download-file-for-this-rds-file
download.file(fileURL, destfile = "./data/chicago.rds", method = "curl", extra='-L')
#create data frame
chicago <- readRDS("./data/chicago.rds")
dim(chicago)
## [1] 6940 8
#create data table
chicdt<-as.data.table(chicago)
dim(chicdt)
## [1] 6940 8
Recall that the dplyr
package introduces some basic verbs that helped to streamline the data manipulation process. - select: return a subset of the columns of a data frame - filter: extract a subset of rows from a data frame based on logical conditions - arrange: reorder rows of a data frame - rename: rename variables in a data frame - mutate: add new variables/columns or transform existing variables - summarize: generate summary statistics of different variables in the data frame, possibly within strata
In contrast, a data.table
has a general form of data.table syntax, as shown below:
DT[i, j, by]
R: i j by
SQL: where | order by select | update group by
Those of you who have an SQL background might perhaps immediately relate to this syntax.
If you take a look at the dimensions of the data set you’ll see that there are 6,940 rows and eight columns, and here are the first couple of rows of the data of the data set and the first five columns.
head(select(chicago, 1:5))
## city tmpd dptp date pm25tmean2
## 1 chic 31.5 31.500 1987-01-01 NA
## 2 chic 33.0 29.875 1987-01-02 NA
## 3 chic 33.0 27.375 1987-01-03 NA
## 4 chic 29.0 28.625 1987-01-04 NA
## 5 chic 32.0 28.875 1987-01-05 NA
## 6 chic 40.0 35.125 1987-01-06 NA
I can do the same thing using data.table
by placing nothing in the i
th position and 1:5
in the j
th position. Now, R
prints more rows than just the first few, but data.table
provides a succint output that doesn’t overwhelm the window so the head function is convenient and unneccessary.
chicdt[,1:5]
## city tmpd dptp date pm25tmean2
## 1: chic 31.5 31.500 1987-01-01 NA
## 2: chic 33.0 29.875 1987-01-02 NA
## 3: chic 33.0 27.375 1987-01-03 NA
## 4: chic 29.0 28.625 1987-01-04 NA
## 5: chic 32.0 28.875 1987-01-05 NA
## ---
## 6936: chic 40.0 33.600 2005-12-27 23.56000
## 6937: chic 37.0 34.500 2005-12-28 17.75000
## 6938: chic 35.0 29.400 2005-12-29 7.45000
## 6939: chic 36.0 31.000 2005-12-30 15.05714
## 6940: chic 35.0 30.100 2005-12-31 15.00000
Now one of the things you can do for any data frame is to look at the names, of the variable names using the names
function. So, here I’m going to print out the first few variable names. One of the nice things you can do with the select
function in dplyr
is to actually access the columns or set of columns in the data frame using the names rather than the column indices.
Say I wanted to look at the columns starting with city
and ending with DPTP
, which is the dew point column, and I want to include all the columns in between. I can use this notation, which is “city
colon DBTP
”.
names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
head(select(chicago, city:dptp))
## city tmpd dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125
Which is not a notation that you would normally be able to use in other functions but you can use here in the select
function and you can see it selects all the columns between the city
column and the dew point column. It’s a nice and handy way to look at subsets of columns or data frame by just you referring to them by their names.
I can perform the same function using the data.table
. I placed 1:5
in the i
th positon of my data.table
subset syntax only to mirror the output from dplyr
.
names(chicdt)[1:3]
## [1] "city" "tmpd" "dptp"
chicdt[1:5,city:dptp]
## city tmpd dptp
## 1: chic 31.5 31.500
## 2: chic 33.0 29.875
## 3: chic 33.0 27.375
## 4: chic 29.0 28.625
## 5: chic 32.0 28.875
Similarly, you can use the minus sign to say, “I want to look at all the columns except for the columns indicated by this range.” You can use the select
function to just say minus on that city colon dew point sequence, and you’ll get all of the columns except those few columns.
head(select(chicago, -(city:dptp)))
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
Similar logic can be used in data.table
chicdt[1:5,-(city:dptp)]
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1: 1987-01-01 NA 34.00000 4.250000 19.98810
## 2: 1987-01-02 NA NA 3.304348 23.19099
## 3: 1987-01-03 NA 34.16667 3.333333 23.81548
## 4: 1987-01-04 NA 47.00000 4.375000 30.43452
## 5: 1987-01-05 NA NA 4.750000 30.33333
The equivalent code in regular R, without using the dplyr package, is a little bit tricky because you have to find the index for where the city column is and you have to find the index for where the dew point column is and then you need to take the negative of those indices. It’s not particularly complicated but it’s an extra two lines of code and perhaps not as readable.
## Equivalent base R
i <- match("city", names(chicago))
j <- match("dptp", names(chicago))
head(chicago[, -(i:j)])
## date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01 NA 34.00000 4.250000 19.98810
## 2 1987-01-02 NA NA 3.304348 23.19099
## 3 1987-01-03 NA 34.16667 3.333333 23.81548
## 4 1987-01-04 NA 47.00000 4.375000 30.43452
## 5 1987-01-05 NA NA 4.750000 30.33333
## 6 1987-01-06 NA 48.00000 5.833333 25.77233
The filter
function in dplyr
is used subset rows based on conditions. For example, you might want to take all the rows in the Chicago data set where pm 2.5
is greater than 30. I’ve got that condition as the second argument creating a logical sequence to subsets the rows. All of the values of pm2.5
are greater than 30.
chic.f <- filter(chicago, pm25tmean2 > 30)
head(select(chic.f, 1:3, pm25tmean2), 10)
## city tmpd dptp pm25tmean2
## 1 chic 23 21.9 38.10
## 2 chic 28 25.8 33.95
## 3 chic 55 51.3 39.40
## 4 chic 59 53.7 35.40
## 5 chic 57 52.0 33.30
## 6 chic 57 56.0 32.10
## 7 chic 75 65.8 56.50
## 8 chic 61 59.0 33.80
## 9 chic 73 60.3 30.30
## 10 chic 78 67.1 41.40
Filtering is really just a game testing your logic coding. For data table the same pm25tmean2 > 30
logic can be placed in the ith position to achieve the same result.
chicdt[pm25tmean2 > 30 , 1:3]
## city tmpd dptp
## 1: chic 23 21.9
## 2: chic 28 25.8
## 3: chic 55 51.3
## 4: chic 59 53.7
## 5: chic 57 52.0
## ---
## 190: chic 8 -1.8
## 191: chic 12 7.7
## 192: chic 22 23.3
## 193: chic 41 32.6
## 194: chic 37 35.2
Of course, you don’t have to just subset rows based on values in one column. You could take multiple columns and create a more complicated logical sequence. Here I’m looking at pm2.5 greater than 30 and temperature greater than 80. When I print out the few first couple rows here, all of the temperature values are greater than 80 and all the pm2.5 values are greater than 30.
#DPLYR
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
head(select(chic.f, 1:3, pm25tmean2, tmpd), 10)
## city tmpd dptp pm25tmean2
## 1 chic 81 71.2 39.6000
## 2 chic 81 70.4 31.5000
## 3 chic 82 72.2 32.3000
## 4 chic 84 72.9 43.7000
## 5 chic 85 72.6 38.8375
## 6 chic 84 72.6 38.2000
## 7 chic 82 67.4 33.0000
## 8 chic 82 63.5 42.5000
## 9 chic 81 70.4 33.1000
## 10 chic 82 66.2 38.8500
#DATA.TABLE
chicdt[pm25tmean2 > 30 & tmpd > 80,1:3]
## city tmpd dptp
## 1: chic 81 71.2
## 2: chic 81 70.4
## 3: chic 82 72.2
## 4: chic 84 72.9
## 5: chic 85 72.6
## 6: chic 84 72.6
## 7: chic 82 67.4
## 8: chic 82 63.5
## 9: chic 81 70.4
## 10: chic 82 66.2
## 11: chic 82 65.1
## 12: chic 84 68.4
## 13: chic 86 63.4
## 14: chic 82 64.6
## 15: chic 85 64.1
## 16: chic 84 67.0
## 17: chic 84 69.0
You can have an arbitrarily complex logical sequence there, and the rows subset based on that sequence in both the filter
function and using data.table
’s way. The nice thing about both methods is you can refer to the variable names directly using their names and you don’t have to kind of subset out each variable using the subset operators like $
.
The arrange
function in dplyr
is used to reorder the rows of a data frame based on the values of a column. This is a pain in base R and the arrange function makes it quite a bit simpler and easier to read.
Suppose, I want to order the rows of the chicago data according to the date
variable from lowest to highest.
chicago <- arrange(chicago, date)
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 1987-01-01 NA
## 2 1987-01-02 NA
## 3 1987-01-03 NA
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 2005-12-29 7.45000
## 6939 2005-12-30 15.05714
## 6940 2005-12-31 15.00000
As you can see that the first couple of rows, start in 1987 and then when I use the tail function to look at the last couple rows, you see that they all end in 2005.
We can use the R function order()
to accomplish this in data.table
.
chicdt[order(date),3:4]
## dptp date
## 1: 31.500 1987-01-01
## 2: 29.875 1987-01-02
## 3: 27.375 1987-01-03
## 4: 28.625 1987-01-04
## 5: 28.875 1987-01-05
## ---
## 6936: 33.600 2005-12-27
## 6937: 34.500 2005-12-28
## 6938: 29.400 2005-12-29
## 6939: 31.000 2005-12-30
## 6940: 30.100 2005-12-31
order()
is internally optimised! In addition, order(...)
within the frame of a data.table
uses data.table
’s internal fast radix order forder()
. This sort provided such a compelling improvement over R’s base::order
that the R project adopted the data.table
algorithm as its default sort in 2016 for R 3.3.0, see ?sort
and the R Release NEWS.
Now, we might want to arrange the rows in descending order. In this case, the desc
function can be used. We can use “-” on a character
columns within the frame of a data.table
to sort in decreasing order.
#DPLYR
chicago <- arrange(chicago, desc(date))
head(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 1 2005-12-31 15.00000
## 2 2005-12-30 15.05714
## 3 2005-12-29 7.45000
tail(select(chicago, date, pm25tmean2), 3)
## date pm25tmean2
## 6938 1987-01-03 NA
## 6939 1987-01-02 NA
## 6940 1987-01-01 NA
#DATA.TABLE
chicdt[order(-date),3:4]
## dptp date
## 1: 30.100 2005-12-31
## 2: 31.000 2005-12-30
## 3: 29.400 2005-12-29
## 4: 34.500 2005-12-28
## 5: 33.600 2005-12-27
## ---
## 6936: 28.875 1987-01-05
## 6937: 28.625 1987-01-04
## 6938: 27.375 1987-01-03
## 6939: 29.875 1987-01-02
## 6940: 31.500 1987-01-01
Now when the rows are printed, they start with all the values of 2005 and go backwards ending in 1987. There’s not a big adjustment to reverse the order of all the rows in the data frame according to the date.
The rename
function is used to …, wait for it, … rename a variable in R. Without a function like this, it is surprisingly annoying to rename a variable in R.
So suppose I think the pm2.5 variable, called pm25tmean2
is a bit of a mouthful, and I want to simplify it by calling it pm25
. Similarly, the dew point variable is called dptp
, which is not particularly intuitive so I might want to rename that variable too. So here, I just call the rename
function in dplyr
and pass the chicago
data frame.
head(chicago[, 1:5], 3)
## city tmpd dptp date pm25tmean2
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago[, 1:5], 3)
## city tmpd dewpoint date pm25
## 1 chic 35 30.1 2005-12-31 15.00000
## 2 chic 36 31.0 2005-12-30 15.05714
## 3 chic 35 29.4 2005-12-29 7.45000
In data.table
, we wrap the variables (column names) within list()
, which ensures that a data.table
is returned. In case of a single column name, not wrapping with list()
returns a vector instead.
data.table
also allows wrapping columns with .()
which is an alias to list()
. Since .() is the same as list()
, we can name columns as we would while creating a list.
chicdt[, .(city, tmpd,dewpoint = dptp,date, pm25 = pm25tmean2)]
## city tmpd dewpoint date pm25
## 1: chic 31.5 31.500 1987-01-01 NA
## 2: chic 33.0 29.875 1987-01-02 NA
## 3: chic 33.0 27.375 1987-01-03 NA
## 4: chic 29.0 28.625 1987-01-04 NA
## 5: chic 32.0 28.875 1987-01-05 NA
## ---
## 6936: chic 40.0 33.600 2005-12-27 23.56000
## 6937: chic 37.0 34.500 2005-12-28 17.75000
## 6938: chic 35.0 29.400 2005-12-29 7.45000
## 6939: chic 36.0 31.000 2005-12-30 15.05714
## 6940: chic 35.0 30.100 2005-12-31 15.00000
Be careful here. Just renaming the columns selects only those columns. Thus, I have to include all of the columns I want within my .()
list. Not really as convenient as the rename
function.
Nevertheless, I supply the new name and set it equal to the old name. Dew-point equals dptp and pm25 equals pm25tmean2 which renames those two columns. In the data frame all of the other columns are untouched. When the first couple rows of this data are printed you can see that the dewpoint variable and the pm25 variable are properly named.
The mutate
function transforms existing variables or to create new variables. Now suppose I want to create a new variable called pm25detrend which is the pm25 variable centered around the mean. I’ll create this new variable called pm25detrend by subtracting the mean of pm25 from the pm25 variable.
chicago <- mutate(chicago, pm25detrend=pm25-mean(pm25, na.rm=TRUE))
head(select(chicago, pm25, pm25detrend))
## pm25 pm25detrend
## 1 15.00000 -1.230958
## 2 15.05714 -1.173815
## 3 7.45000 -8.780958
## 4 17.75000 1.519042
## 5 23.56000 7.329042
## 6 8.40000 -7.830958
data.table
adds new columns very fast and efficiently. You use this the :=
command to add a variable to the data table.
chicdt[,pm25detrend:=pm25tmean2-mean(pm25tmean2, na.rm=TRUE)]
chicdt
## city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1: chic 31.5 31.500 1987-01-01 NA 34.00000 4.250000 19.98810
## 2: chic 33.0 29.875 1987-01-02 NA NA 3.304348 23.19099
## 3: chic 33.0 27.375 1987-01-03 NA 34.16667 3.333333 23.81548
## 4: chic 29.0 28.625 1987-01-04 NA 47.00000 4.375000 30.43452
## 5: chic 32.0 28.875 1987-01-05 NA NA 4.750000 30.33333
## ---
## 6936: chic 40.0 33.600 2005-12-27 23.56000 27.00000 4.468750 23.50000
## 6937: chic 37.0 34.500 2005-12-28 17.75000 27.50000 3.260417 19.28563
## 6938: chic 35.0 29.400 2005-12-29 7.45000 23.50000 6.794837 19.97222
## 6939: chic 36.0 31.000 2005-12-30 15.05714 19.20000 3.034420 22.80556
## 6940: chic 35.0 30.100 2005-12-31 15.00000 23.50000 2.531250 13.25000
## pm25detrend
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## ---
## 6936: 7.329042
## 6937: 1.519042
## 6938: -8.780958
## 6939: -1.173815
## 6940: -1.230958
The nice thing is when you’re adding a new variable to a data frame, R will copy over the entire data frame and add a new variable to it, so you get two copies of the the data frame in memory. When dealing with big data sets, this is going to cause lots of memory problems, which you don’t have with data table because a new copy isn’t being created.
You’ll notice that I had to use the old pm25tmean2
variable name instead of pm25
. As data.table
is working by reference, saving the names isn’t hard, but it isn’t trivial either.
Finally the group_by
function allows you to split a data frame according to categorical variables. Let’s create a temperature category variable which will indicate whether a given day was hot or cold, depending on whether the temperature was over 80 degrees or not. I create this factor variable called tempcat, and then I use the group by function to create a new kind of data structure. Based on the original data frame and this tempcat variable. And so I call that new data structure hot/cold, and you can print it out here.
chicago <- mutate(chicago, tempcat = factor(1 * (tmpd > 80), labels = c("cold", "hot")))
hotcold <- group_by(chicago, tempcat)
## Warning: Factor `tempcat` contains implicit NA, consider using
## `forcats::fct_explicit_na`
summarize(hotcold, pm25 = mean(pm25, na.rm = TRUE),o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 3 x 4
## tempcat pm25 o3 no2
## <fct> <dbl> <dbl> <dbl>
## 1 cold 16.0 66.6 24.5
## 2 hot 26.5 63.0 24.9
## 3 <NA> 47.7 9.42 37.4
In data.table
, we’ll combine the i
and j
statements together with by
to perform operations by group.
chicdt[,tempcat:=factor(1 * (tmpd > 80), labels = c("cold", "hot"))]
chicdt[,.(pm25=mean(pm25tmean2, na.rm = TRUE),o3=max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE)),by=tempcat]
## tempcat pm25 o3 no2
## 1: cold 15.97807 66.587500 24.54924
## 2: hot 26.48118 62.969656 24.93870
## 3: <NA> 47.73750 9.416667 37.44444
We pass our aggregations in the j
arguments and use the tempcat
column in the by
section to tell data.table
how to calculate the requested statistics.
You can now see the different syntax in practice where I use the summarize function to aggregate by the hot/cold group which has been split based on the temperature category variable in contrast to the statements in the separate parts of the data.table
syntax. In either case, I calculate the mean pm2.5 for both hot and cold days, the maximum ozone for hot and cold days, and the median nitrogen dioxide or no2 for both hot and cold days. It prints out a little data frame that has the levels of the categorical variable which are cold, hot, and NA so there are some missing values. Both give you the summary statistics for pm25 and ozone and no2 in each of those categories. Now since there are missing data in the pm25 variable, I need to specify the na.rm equals true in order to get some the means of those values by ignoring the missing values. You can see that that pm2.5 is tends to be lower when it’s cold and higher when it’s hot.
These are the summary statistics for each of the pollutants in these temperature categories. I could also categorize the data set on other variables. For example I might want to do a summary for each year in the data set, I can use the mutate function to create a year variable.
chicago <- mutate(chicago,year = as.POSIXlt(date)$year + 1900)
years <- group_by(chicago, year)
summarize(years, pm25 = mean(pm25, na.rm = TRUE),o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 19 x 4
## year pm25 o3 no2
## <dbl> <dbl> <dbl> <dbl>
## 1 1987 NaN 63.0 23.5
## 2 1988 NaN 61.7 24.5
## 3 1989 NaN 59.7 26.1
## 4 1990 NaN 52.2 22.6
## 5 1991 NaN 63.1 21.4
## 6 1992 NaN 50.8 24.8
## 7 1993 NaN 44.3 25.8
## 8 1994 NaN 52.2 28.5
## 9 1995 NaN 66.6 27.3
## 10 1996 NaN 58.4 26.4
## 11 1997 NaN 56.5 25.5
## 12 1998 18.3 50.7 24.6
## 13 1999 18.5 57.5 24.7
## 14 2000 16.9 55.8 23.5
## 15 2001 16.9 51.8 25.1
## 16 2002 15.3 54.9 22.7
## 17 2003 15.2 56.2 24.6
## 18 2004 14.6 44.5 23.4
## 19 2005 16.2 58.8 22.6
I can use the same as.POSIXlt
function in data.table to create the year variable there
chicdt[,year := as.POSIXlt(date)$year + 1900]
chicdt[,.(pm25=mean(pm25tmean2, na.rm = TRUE),o3=max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE)),by=year]
## year pm25 o3 no2
## 1: 1987 NaN 62.96966 23.49369
## 2: 1988 NaN 61.67708 24.52296
## 3: 1989 NaN 59.72727 26.14062
## 4: 1990 NaN 52.22917 22.59583
## 5: 1991 NaN 63.10417 21.38194
## 6: 1992 NaN 50.82870 24.78921
## 7: 1993 NaN 44.30093 25.76993
## 8: 1994 NaN 52.17844 28.47500
## 9: 1995 NaN 66.58750 27.26042
## 10: 1996 NaN 58.39583 26.38715
## 11: 1997 NaN 56.54167 25.48143
## 12: 1998 18.26467 50.66250 24.58649
## 13: 1999 18.49646 57.48864 24.66667
## 14: 2000 16.93806 55.76103 23.46082
## 15: 2001 16.92632 51.81984 25.06522
## 16: 2002 15.27335 54.88043 22.73750
## 17: 2003 15.23183 56.16608 24.62500
## 18: 2004 14.62864 44.48240 23.39130
## 19: 2005 16.18556 58.84126 22.62387
The dplyr
package implements a special operator that allows you to kind of chain different operations together in a kind of readable way. It’s indicated by a percent symbol and then the greater than symbol and then the percent symbol (%>%
). We call it the pipeline operator and you can take a data set and feed through a pipeline of operations to create a new data set.
chicago %>%
mutate(month = as.POSIXlt(date)$mon + 1) %>%
group_by(month) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 12 x 4
## month pm25 o3 no2
## <dbl> <dbl> <dbl> <dbl>
## 1 1 17.8 28.2 25.4
## 2 2 20.4 37.4 26.8
## 3 3 17.4 39.0 26.8
## 4 4 13.9 47.9 25.0
## 5 5 14.1 52.8 24.2
## 6 6 15.9 66.6 25.0
## 7 7 16.6 59.5 22.4
## 8 8 16.9 54.0 23.0
## 9 9 15.9 57.5 24.5
## 10 10 14.2 47.1 24.2
## 11 11 15.2 29.5 23.6
## 12 12 17.5 27.7 24.5
Here I’ve got the Chicago
data set and I want to mutate
to create a month
variable because I want to create a summary of each of the pollutant variables by month
. I create the month
variable with mutate
and then I want to take the output of mutate
group by the month
variable. Then I take the output of group by
and then run it through summarize
. Notice each call to mutate
, group by
and summarize
, is separated by the pipeline operator (%>%
). I don’t have the specify the data frame since it’s the first argument. The output of this long pipeline of operations is a data frame that shows you the summary statistics of each of the three pollutant variables by each of the 12 months of the year.
The pipeline operator is really handy. It allows you to kind of change a bunch of operations in one sequence that’s both readable and powerful.
As data.table
is working by reference, there is no assignment and it is like it is chained by default without any pipe-like operator. Pipe operators are a way to chain operation without making an assignment. data.table
can chain expressions together on after another by tacking expressions one after another, forming a chain of operations, i.e., DT[ ... ][ ... ][ ... ]
.
Or you can also chain them vertically:
DT[ ...
][ ...
][ ...
]
Thus,
chicdt[,month := as.POSIXlt(date)$mon + 1][,.(pm25 = mean(pm25tmean2, na.rm = TRUE),
o3 = max(o3tmean2, na.rm = TRUE),
no2 = median(no2tmean2, na.rm = TRUE)), by=month]
## month pm25 o3 no2
## 1: 1 17.76996 28.22222 25.35417
## 2: 2 20.37513 37.37500 26.78034
## 3: 3 17.40818 39.05000 26.76984
## 4: 4 13.85879 47.94907 25.03125
## 5: 5 14.07420 52.75000 24.22222
## 6: 6 15.86461 66.58750 25.01140
## 7: 7 16.57087 59.54167 22.38442
## 8: 8 16.93380 53.96701 22.98333
## 9: 9 15.91279 57.48864 24.47917
## 10: 10 14.23557 47.09275 24.15217
## 11: 11 15.15794 29.45833 23.56537
## 12: 12 17.52221 27.70833 24.45773
data.table
provides speed and memory gains, and dplyr
has more accessible syntax. Aggregating and joining is so much faster with data.table for large tables that I wish I found out about data.table
earlier so then I could gain the intuition for writing consistent code.
data.table
was born to be an efficient and comprehensive all-in-one package. It has a very fast file reader. The syntax is succint (I mean you can do a lot in a single line of code). It is currently the fastest open source data manipulation tool for data that fit in memory (see the benchmark at Rdatatable/data.table. data.table
is also memory-efficient, and is written in C for the most part.
Conversely, dplyr
is happy to be complemented by functions provided by other packages (The author of the package also wrote readr
and tidyr
for file reading and long/wide operations. dplyr
allows you to munge data from several DBs using the same syntax; in this respect it has some similarity to Python’s Blaze project. Finally, dplyr
’s code is made more efficient by the use of C++.
These differences are not clear-cut: no one forbids data.table
from implementing an extra layer to interface seamlessly to DBMS. And maybe dplyr
could leverage its code base to implement clever parallelism and gain a speed edge. I tend to use dplyr, but am happy to switch to data.table
when performance and memory usage are bottlenecks.
Read these: - stack overflow: data.table vs dplyr: can one do something well the other can’t or does poorly? - Hadley Wickham’s take on dplyr vs data.table link - Hadleyverse vs data.table link - Fast data lookups in R: dplyr vs data.table link