Package: data.table

Package Overview

Background

data.table is one of the 15,000 add-on R packages that is widely used in the field of data science. It provides an enhanced version of data.frame that allows us to do fast data manipulations with terse syntax. If you know R language and haven’t learned about data.table package yet, then this documentation will help you get started right away!

Why do we prefer Data.table?

Concise: Complicated operations can be carried out without having to write verbose code.
Memory Efficient: It does a shallow copy of the data frame intead of physically copied the entire data frame in system’s memory.
Fast: Thanks to its high memory efficiency, data.table provides blazing fast speed when it comes to loading and manipulating large datasets.

Syntax

The general form of data.table programming is DT[i, j, by], which can be interpreted as taking a data.table called DT, subseting rows using i, then manipulating columns using j while grouping by by.

Below are a few special but useful features of data.table you need to know before start:

data.table doesn’t really depend on functions to support fast data manipulation. But it does provide some useful functions such as except fread() and set().
A data.table never sets, needs or uses row names.
A data.table never automatically converts character columns to factors, and read Dates and Datetimes as character by default.
The data.table package has .() as an alias for list() (we will use this a lot when passing values to arguments)
.N returns the the number of rows(we will see how it works in examples).

Version history: (since 2019)

Version	Release Date	New features ( not inclusive)
data.table v1.12.9	In development	%chin% subset on character vectors ( speeding up significantly)
data.table v1.12.8	09.Dec.2019	DT [, {. . . ; .( A,B ) } ] (i.e. when .() is the final item of a multi-statement {…}) now auto-names the columns A and B (just like DT[, .(A,B)]) rather than V1 and V2
data.table v1.12.4	03.Oct.2019	rleid ( ) functions now support long vectors (length > 2 billion).
data.table v1.12.2	07.Apr.2019	rbind and rbindlist now retain the position of duplicate column names rather than grouping them together
data.table v1.12.0	13.Jan.2019	fread() can now read a remote compressed file in one step; for example, fread (“https://domain.org/file.csv.bz2”).

Tips before diving deeper:

We use head() in the following sections to show the output of our exmaples for presentation conciseness. It is not part of the example itself.
Our presentation will cover examples of usages of 9 “functions”,similar packages and reflections, and we design this order for logic flow from basic to advanced.

Examples of Usage

The dataset used for introducing this package was uploaded on Kaggle by an individual who has extracted top 1000 Apple iOS apps during January to December 2016 time frame, which resulted in a large dataset of over 7000 observations. We sample 200 observations and extracted 8 variables:

Variable	Description
Rating	Average User Rating Value (out of scale 1 to 5)
Size	Size of an app (in Bytes)
Rating_count	User rating counts
Device_num	Number of supporting devices
Screenshots	Number of screenshots shown for display
Languages_num	Number of supported languages
Price	Price amount of each mobile app in US dollars
Genre	Primary Genre
Age	Content Rating

Creating data.table()

There are three ways to create a data.table:

convert data.frame to data.table using as.data.table()

app <- read.csv("appstore_sample.csv",header= TRUE)
app<- as.data.table(app)
head(app,5)

##         size price rating_count rating age    genre device_num screenshots
## 1:   8593408  0.99            1    3.0  4+ Entertai         38           2
## 2:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 3:  34856960  0.99            1    5.0  9+    Games         38           4
## 4: 199166976  2.99            1    5.0  4+    Games         40           5
## 5:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:             1
## 5:            16

import local file as data.table using fread().

app <- fread("appstore_sample.csv")
head(app,5)

##         size price rating_count rating age    genre device_num screenshots
## 1:   8593408  0.99            1    3.0  4+ Entertai         38           2
## 2:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 3:  34856960  0.99            1    5.0  9+    Games         38           4
## 4: 199166976  2.99            1    5.0  4+    Games         40           5
## 5:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:             1
## 5:            16

Create a data.table from scratch using data.table()

DT = data.table(
  ID = c("b","b","b","a","a","c"),
  a = 1:6,
  b = 7:12,
  c = 13:18
)
DT

##    ID a  b  c
## 1:  b 1  7 13
## 2:  b 2  8 14
## 3:  b 3  9 15
## 4:  a 4 10 16
## 5:  a 5 11 17
## 6:  c 6 12 18

We can also check if an object’s type is data.table using is.data.table(). function.

is.data.table(app)

## [1] TRUE

Subsetting rows using i

(a) subsetting rows with row index

Subsetting data with row index is straightward. You can pass a vector into the square bracket in the i. Adding a comma after i or not doesn’t affect the output.

head(app[2:5,],5) # with comma

##         size price rating_count rating age    genre device_num screenshots
## 1:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 2:  34856960  0.99            1    5.0  9+    Games         38           4
## 3: 199166976  2.99            1    5.0  4+    Games         40           5
## 4:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             2
## 3:             1
## 4:            16

head(app[2:5],5) # without comma

##         size price rating_count rating age    genre device_num screenshots
## 1:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 2:  34856960  0.99            1    5.0  9+    Games         38           4
## 3: 199166976  2.99            1    5.0  4+    Games         40           5
## 4:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             2
## 3:             1
## 4:            16

(b) subsetting rows with conditions

We can also subsetting data.table using conditions. In the first example, we want to find the apps with price 0.99. In the second example, we want to find apps which have price of either 0.99 or 2.99. Note: %in% is used to identify if an element can be found in a vector.

price99 <-app[price==0.99]
head(price99,5)

##        size price rating_count rating age    genre device_num screenshots
## 1:  8593408  0.99            1    3.0  4+ Entertai         38           2
## 2: 47495168  0.99            1    4.0  9+ Entertai         37           5
## 3: 34856960  0.99            1    5.0  9+    Games         38           4
## 4: 24403968  0.99            2    3.5  4+ Educatio         26           4
## 5: 11927552  0.99            2    1.0  4+    Games         43           2
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:            16
## 5:             1

price_of_two = app[price %in% c(0.99,2.99)]
head(price_of_two,5)

##         size price rating_count rating age    genre device_num screenshots
## 1:   8593408  0.99            1    3.0  4+ Entertai         38           2
## 2:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 3:  34856960  0.99            1    5.0  9+    Games         38           4
## 4: 199166976  2.99            1    5.0  4+    Games         40           5
## 5:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:             1
## 5:            16

Subsetting by column & Data Manipulation on columns in j

The general format to subset data by columns is to put whatever variable names that you want to select in j after the comma after i. Such as app[,variable nanames].

(a) select only one column

Suppose we want to know what genre do the sampled apps belong to:

genres <-app[,genre]
head(genres,5)

## [1] "Entertai" "Entertai" "Games"    "Games"    "Educatio"

(b) select multiple columns

You can select multiple columns from a data.table by passing column names in .() or list(). Two ways should give the same output.

# use list()
lists <-app[, list(price,genre)] 
head(lists,5)

##    price    genre
## 1:  0.99 Entertai
## 2:  0.99 Entertai
## 3:  0.99    Games
## 4:  2.99    Games
## 5:  0.99 Educatio

#use .()
parent<-app[,.(price,genre)]  
head(parent,5)

##    price    genre
## 1:  0.99 Entertai
## 2:  0.99 Entertai
## 3:  0.99    Games
## 4:  2.99    Games
## 5:  0.99 Educatio

(c) Data manipulation on j

All the aggregationa and data manipulation operations are carried out in j. We can do operations on one or multiple columns at the same time and name the aggregated value.

Example 1: aggregate on one column

Suppose we want to calculae the mean price of all apps in the sample.

app[,median(price)]

## [1] 1.99

Example 2: aggregate on multiple columns

Suppose we want to calculate the median price and mean rating for all apps in the sample. This can be achieved by putting two functions in the .() and passing them together to j.

app[,.(median(price),mean(rating))]

##      V1    V2
## 1: 1.99 4.115

Example 3: name the aggregated value

It may be confusing to readers if we don’t name the aggregated value. THe folowing command outputs the same values as above, but the outputs are assigned with new names.

app[,.(median_price = median(price),mean_rating = mean(rating))]

##    median_price mean_rating
## 1:         1.99       4.115

Data Manipulation by group

Now, let us look at the last syntax, by. The purpose for it is similar to the groupbyfunction in dyplyr package. For every code line below, we want the operations to by carried out separately for each group as specified.

Example 1: Group by one variable

Suppose we want to find the the mean price for apps in each genre:

app[,.(mean_price = mean(price)), by = genre]

##       genre mean_price
## 1: Entertai   1.836154
## 2:    Games   2.631892
## 3: Educatio   2.951538

Example 2: Group by multiple variables

Suppose we want to find the mean price of apps in each genre with different age range.

app[,(mean_price = mean(price)), by = .(genre,age)]

##       genre age       V1
## 1: Entertai  4+ 2.066923
## 2: Entertai  9+ 0.990000
## 3:    Games  9+ 2.418571
## 4:    Games  4+ 2.762152
## 5: Educatio  4+ 2.951538
## 6: Entertai 12+ 1.990000
## 7:    Games 12+ 2.623333
## 8: Entertai 17+ 1.990000
## 9:    Games 17+ 1.990000

Example 3: Aggregate by group on multiple columns in a subset of data

Suppose we want to find the mean price of apps with more than 1 language in each genre of different age range.

app[languages_num > 1,.(mean_price = mean(price)), by = .(genre,age)]

##       genre age mean_price
## 1:    Games  9+   2.451538
## 2: Educatio  4+   3.240000
## 3:    Games  4+   2.556667
## 4:    Games 12+   3.190000
## 5: Entertai  4+   2.490000
## 6:    Games 17+   2.990000
## 7: Entertai 17+   2.990000
## 8: Entertai 12+   1.990000

Adding/Updating Columns By Reference

:= is used in the j operator to assign values to columns by reference which is only usable for a data.table object. Remember the general form of a data table: DT[i, j, by]. Now let’s update the our dataset.

(a) Update a column

Let’s try update 1 column, for example, the price of the Apps. The price before update:

head(app[,price],5)

## [1] 0.99 0.99 0.99 2.99 0.99

Now, update the existing price column with the rounded price:

app[ , price:= round(price)]

The price column after update:

head(app[,price],5)

## [1] 1 1 1 3 1

We can also update multiple columns at the same time. For example, let’s remove the ‘+’ sign in the age threshold column and change the column of support language number into the one that indicate whether the app supports more than 10 languages.

The original columns:

head(app[ , c('age','languages_num')],5)

##    age languages_num
## 1:  4+             1
## 2:  9+             1
## 3:  9+             2
## 4:  4+             1
## 5:  4+            16

Columns “age” and “languages_num” are updated by what is after :=:

#way 1
app[ , c('age','languages_num'):=list(substring(age,1,1), languages_num>10)]

#way 2: 
app[,':='(age=substring(age,1,1), languages_num=languages_num>10)]

Now it becomes:

head(app[ , c('age','languages_num')],5)

##    age languages_num
## 1:   4         FALSE
## 2:   9         FALSE
## 3:   9         FALSE
## 4:   4         FALSE
## 5:   4         FALSE

(b) Add a column

Add a column that gives the difference between the price of the app and the mean price:

app[ , price.diff:= price-mean(price),] 
head(app[, price.diff],5)

## [1] -1.58 -1.58 -1.58  0.42 -1.58

(c) Remove a column

To remove columns, we can simply use := NULL:

#remove age column
app[ , age:= NULL]
head(app,5)

##         size price rating_count rating    genre device_num screenshots
## 1:   8593408     1            1    3.0 Entertai         38           2
## 2:  47495168     1            1    4.0 Entertai         37           5
## 3:  34856960     1            1    5.0    Games         38           4
## 4: 199166976     3            1    5.0    Games         40           5
## 5:  24403968     1            2    3.5 Educatio         26           4
##    languages_num price.diff
## 1:         FALSE      -1.58
## 2:         FALSE      -1.58
## 3:         FALSE      -1.58
## 4:         FALSE       0.42
## 5:         FALSE      -1.58

#remove both language and size columns
app[ , c('size','languages_num'):= NULL]
head(app,5)

##    price rating_count rating    genre device_num screenshots price.diff
## 1:     1            1    3.0 Entertai         38           2      -1.58
## 2:     1            1    4.0 Entertai         37           5      -1.58
## 3:     1            1    5.0    Games         38           4      -1.58
## 4:     3            1    5.0    Games         40           5       0.42
## 5:     1            2    3.5 Educatio         26           4      -1.58

Chaining

A good way to avoid intermediate assignment to a temporary variable altogether is by chaining expressions.

For example, to get the app genre with average rating greater than 4, originally we need to:

calculate the average rating by genre:

genre.avg.rating <- app[,.(avg.rating=mean(rating)),by=genre]

Select the genre of which the average rating is above 4:

genre.avg.rating[avg.rating>4]

##    genre avg.rating
## 1: Games   4.263514

However, we simply can do it through chaining, which is similar to the piping function, and we get the same result:

app[,.(avg.rating=mean(rating)),by=genre][avg.rating>4]

##    genre avg.rating
## 1: Games   4.263514

Read & Write Files

Although data.table is not a function-based package, it does provide some useful functions such as fread() and fwrite() for importing and exporting data.

(a) Importing local & oneline file

fread() can automatically identify the structure of the importing dataset without having to specify features such as column seperator, column class, and header.It can import both local files and online files at ease by passing the local path or URL into the function.

#Local File 
read_local <- fread('appstore_sample.csv')
head(read_local,5)

##         size price rating_count rating age    genre device_num screenshots
## 1:   8593408  0.99            1    3.0  4+ Entertai         38           2
## 2:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 3:  34856960  0.99            1    5.0  9+    Games         38           4
## 4: 199166976  2.99            1    5.0  4+    Games         40           5
## 5:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:             1
## 5:            16

#Online File
read_online <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(read_online,5)

##    V1  V2   V3    V4 V5
## 1:  1 307  930 36.58  0
## 2:  2 307  940 36.73  0
## 3:  3 307  950 36.93  0
## 4:  4 307 1000 37.15  0
## 5:  5 307 1010 37.23  0

(b) Subsetting when importing

Sometimes the dataset we want to import doesn’t have reguler stucture. For example, it may have paragraphs of data description at the start or the end of the file. We can then subset the data while importing by using nrow argument to specify the maximum number of rows to read, or use skip argument to specify the number rows to avoid from reading, starting on the first line.

#read the first 3 rows 
read_3row <- fread('appstore_sample.csv',nrows = 3)
head(read_3row)

##        size price rating_count rating age    genre device_num screenshots
## 1:  8593408  0.99            1      3  4+ Entertai         38           2
## 2: 47495168  0.99            1      4  9+ Entertai         37           5
## 3: 34856960  0.99            1      5  9+    Games         38           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2

#skip the first three rows
#column names will also be removed! Be careful!
skip_3rows <- fread('appstore_sample.csv', skip = 3)
head(skip_3rows,4)

##           V1   V2 V3  V4 V5       V6 V7 V8 V9
## 1:  34856960 0.99  1 5.0 9+    Games 38  4  2
## 2: 199166976 2.99  1 5.0 4+    Games 40  5  1
## 3:  24403968 0.99  2 3.5 4+ Educatio 26  4 16
## 4:  11927552 0.99  2 1.0 4+    Games 43  2  1

Some times we may need only a few columns from a large dataset, then we can use select or drop arguments to specify the columns we want to read or not to read by passing a vector of column namesor a vector of column index after select and drop.

#import only size and price column by passing column names
select_byname <- fread('appstore_sample.csv', select = c('size','price'))
head(select_byname,5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

#import only size and price column by passing column index
select_byind <- fread('appstore_sample.csv', select = c(1,2))
head(select_byind,5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

#drop price and size column by passing column names
drop_byname <- fread('appstore_sample.csv', drop = c('size','price'))
head(drop_byname,5)

##    rating_count rating age    genre device_num screenshots languages_num
## 1:            1    3.0  4+ Entertai         38           2             1
## 2:            1    4.0  9+ Entertai         37           5             1
## 3:            1    5.0  9+    Games         38           4             2
## 4:            1    5.0  4+    Games         40           5             1
## 5:            2    3.5  4+ Educatio         26           4            16

(c) Set column class type when importing

fread() does not automatically import character variables as factor type as many importing functions do. But you can always specify the column type by (1) passing an unnamed vector of types corresponding to the columns in the file, or (2) a named list specifying types for a subset of columns by name to the argument colClasses. The second method is more useful when you have multiple columns to set types.

#method (1): import age as factor and screenshots as numeric
col_type <- fread('appstore_sample.csv', colClasses = c(age = 'factor', screenshots = 'numeric'))
print(paste0('type of age: ', class(col_type$age), ",  ", "type of screenshots: ", class(col_type$screenshots)))

## [1] "type of age: factor,  type of screenshots: numeric"

#method (2): import the 7-9th columns as numeric, and the age and genre column as factor using method
col_type2 <- fread('appstore_sample.csv', colClasses = list(numeric = 7:9, factor=c('age','genre')))
str(col_type2)

## Classes 'data.table' and 'data.frame':   200 obs. of  9 variables:
##  $ size         :integer64 8593408 47495168 34856960 199166976 24403968 11927552 5627904 102250496 ... 
##  $ price        : num  0.99 0.99 0.99 2.99 0.99 0.99 4.99 3.99 0.99 2.99 ...
##  $ rating_count : int  1 1 1 1 2 2 3 3 5 5 ...
##  $ rating       : num  3 4 5 5 3.5 1 2 4.5 5 4.5 ...
##  $ age          : Factor w/ 4 levels "12+","17+","4+",..: 3 4 4 3 3 3 3 3 4 3 ...
##  $ genre        : Factor w/ 3 levels "Educatio","Entertai",..: 2 2 3 3 1 3 1 3 2 3 ...
##  $ device_num   : num  38 37 38 40 26 43 43 38 37 38 ...
##  $ screenshots  : num  2 5 4 5 4 2 0 5 0 5 ...
##  $ languages_num: num  1 1 2 1 16 1 1 13 1 2 ...
##  - attr(*, ".internal.selfref")=<externalptr>

(d) fwrite() for exporting files

We can export data.table as an csv file using fwrite() function by passing the data.table and file name in double quotes to the function.

#subset the first 5 rows of app data.table and write it to a csv file called small_app.csv
fwrite(head(app,5), 'small_app.csv')
fread('small_app.csv')

##         size price rating_count rating age    genre device_num screenshots
## 1:   8593408  0.99            1    3.0  4+ Entertai         38           2
## 2:  47495168  0.99            1    4.0  9+ Entertai         37           5
## 3:  34856960  0.99            1    5.0  9+    Games         38           4
## 4: 199166976  2.99            1    5.0  4+    Games         40           5
## 5:  24403968  0.99            2    3.5  4+ Educatio         26           4
##    languages_num
## 1:             1
## 2:             1
## 3:             2
## 4:             1
## 5:            16

.SD & .SDcols

.SD is an special operater in data.table package, which stands for subset of the data. .SD is itself a data.table but with the caveat that not all data.table function works on it. It is usually used to apply the same operation to multiple columns. It can also be used with by to perform the same operations on each sub-data.table before the re-assembled result is returned to us.

(a) Perform the same operations on multiple columns with lapply() and .SD

Suppose you want to know the mean for all columns, it would be inefficient to calculate mean one column after the other. But using lapply and .SD can speed this process up.

multiple_mean <- app[,lapply(.SD, mean, na.rm = T)]
head(multiple_mean,5)

##         size price rating_count rating age genre device_num screenshots
## 1: 224292462  2.57      6886.88  4.115  NA    NA     38.715       4.215
##    languages_num
## 1:         5.095

(b) Using .SDcols for column subsetting

One way to impact what .SD is is to limit the columns included in .SD by passing a vector of column names or column index to .SDcols argument. The following 4 commands have the same output.

#include only the size and price column by passing vector of column names
SD_include <- app[,.SD, .SDcols = c('size', 'price')]
head(SD_include, 5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

#exclude column except price and size by passing vector of column index 
exclude_col <- app[,.SD, .SDcols = -c(3:ncol(app))]
head(exclude_col,5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

#subseting using [] on .SD by passing column index
bracket_subset <- app[,.SD[,c(1,2)]]
head(bracket_subset,5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

#subseting using [] on .SD by passing column names
bracket_subset2 <- app[,.SD[,c('size','price')]]
head(bracket_subset2,5)

##         size price
## 1:   8593408  0.99
## 2:  47495168  0.99
## 3:  34856960  0.99
## 4: 199166976  2.99
## 5:  24403968  0.99

(c) Batch Type Conversion

Based on what we learned above, .SD, lapply(), and .SDcols can be used together to convrt groups of columns to a common type. Please note that we need to put () around a named vector if we want to passing it into the column operator.

#specify the columns you want to convert 
fkt <- c('age','genre')
#change the type of age and genre from character to factor
sd_convert <- app[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
print(paste0('type of age: ', class(sd_convert$age), ",  ", "type of genre: ", class(sd_convert$genre)))

## [1] "type of age: factor,  type of genre: factor"

(d) Group subsetting

Group subsetting and performing operation by groups can be achieved through combining .SD with by argument. The following example gets the app with the most and the least rating count for each genre. Just to remind, .N standads for row numbers, which in this case is the row index for the app with the lowest rating count in each genre subset.

#the dataset is first sorted by rating count in descending order 
sorted_app <- app[order(rating_count, decreasing = T),]
most_least_rating <- sorted_app[,.SD[c(1,.N)],by = genre]
head(most_least_rating,5)

##       genre      size price rating_count rating age device_num screenshots
## 1:    Games 624107810  0.99       326482    5.0 12+         43           5
## 2:    Games 199166976  2.99            1    5.0  4+         40           5
## 3: Entertai  93021184  1.99        26333    4.5 12+         37           5
## 4: Entertai  47495168  0.99            1    4.0  9+         37           5
## 5: Educatio 150221824  4.99         6503    4.5  4+         24           5
##    languages_num
## 1:            13
## 2:             1
## 3:            14
## 4:             1
## 5:            12

(e) group operations

By combining by argument with lapply() and .SD, we can performe the same operations on multiple columns of grouped subset of data. In the following example, we will calculate the mean size, mean price, mean rating_count and mean rating for each app genre.

group_mean<- app[,lapply(.SD, mean, na.rm = T), by = genre, .SDcols = 1:4]
head(group_mean)

##       genre      size    price rating_count   rating
## 1: Entertai  85306047 1.836154    1911.1538 3.596154
## 2:    Games 256272887 2.631892    8828.2297 4.263514
## 3: Educatio 181236460 2.951538     811.8462 3.788462

set() family

The set() family functions allows for super fast assignment operations(even faster than := and are usuelly used in for-loops. It has the same structure as data.table syntax but can be used on a data.frame project as well.

(a) set()

The structure of set() function is : set(DT, index, column, value), where:

DT is the data.table you wish to update
index is the row index used in a loop
column is columns you wish to update in the loop
value specifies how column should be updated

The original app dataset does not include unique keys for each app included, so let’s now created one for each. There are two ways we can achieve that using data.table package:

Using :=to add ID column

system.time(for (i in 1:nrow(app)) app[i, ID := i])

##    user  system elapsed 
##   0.170   0.003   0.095

2.Usingset()` to add ID column

#
system.time(for (i in 1:nrow(app)) set(app, i, 'ID2', i))

##    user  system elapsed 
##   0.007   0.000   0.004

We can see that set() is much faster than := even for a small dataset with only 200 rows.

(b) setnames()

setnames() function is used to change column names of data.table object with the structure setnames(DT, old_name, new_name).

Let’s try to abbreviate the column names for device number and language number using setnames() in the folllowing example.

setnames(app, c('device_num', 'languages_num'), c('d_num', 'l_num'))
head(app,1)

##       size price rating_count rating age    genre d_num screenshots l_num ID
## 1: 8593408  0.99            1      3  4+ Entertai    38           2     1  1
##    ID2
## 1:   1

(c) setcolorder()

Isn’t weird if we see a dataset with ID column as the last column?

Let’s reorder our column by moving the two ID columns created in (a) to the front of the dataset! We can do this in two ways.

#first way 
order_col1 <- setcolorder(app, c("ID", "ID2",colnames(app)[1:9]))
head(order_col1,5)

##    ID ID2      size price rating_count rating age    genre d_num screenshots
## 1:  1   1   8593408  0.99            1    3.0  4+ Entertai    38           2
## 2:  2   2  47495168  0.99            1    4.0  9+ Entertai    37           5
## 3:  3   3  34856960  0.99            1    5.0  9+    Games    38           4
## 4:  4   4 199166976  2.99            1    5.0  4+    Games    40           5
## 5:  5   5  24403968  0.99            2    3.5  4+ Educatio    26           4
##    l_num
## 1:     1
## 2:     1
## 3:     2
## 4:     1
## 5:    16

#second way 
order_col2 <- setcolorder(app, c("ID","ID2"))
head(order_col2,5)

##    ID ID2      size price rating_count rating age    genre d_num screenshots
## 1:  1   1   8593408  0.99            1    3.0  4+ Entertai    38           2
## 2:  2   2  47495168  0.99            1    4.0  9+ Entertai    37           5
## 3:  3   3  34856960  0.99            1    5.0  9+    Games    38           4
## 4:  4   4 199166976  2.99            1    5.0  4+    Games    40           5
## 5:  5   5  24403968  0.99            2    3.5  4+ Educatio    26           4
##    l_num
## 1:     1
## 2:     1
## 3:     2
## 4:     1
## 5:    16

Similar Packages

data.table vs dataframe

We can think of data.tables as data.frames with extra features. While data.frame is part of base R, data.table is a package that extends data.frames. Two of its most notable features are cleaner syntax and speed.

Similarities

1. Syntax:

Both data.table and dataframe are similar in some syntaxs, such as reading in data:

#data table
app.DT <- fread('appstore_sample.csv')

#dataframe
app.DF <- read.csv('appstore_sample.csv')

and subsetting:

head(app.DF[,1]) #data.table subsetting

## [1]   8593408  47495168  34856960 199166976  24403968  11927552

head(app.DT[,1]) #data.frame subsetting

##         size
## 1:   8593408
## 2:  47495168
## 3:  34856960
## 4: 199166976
## 5:  24403968
## 6:  11927552

Despite of differences in their output formats.

Differences

1. Syntax

From the example below, when we add, update, and delete a column or values in a data set, we can see that we did not assign the results back to data table since:=operator modifies the input object by reference.

#add a new column
app.DT[ , price.diff:= price-mean(price)] 
app.DF$price.diff <- app.DF$price-mean(app.DF$price)

#update a column
app.DT[ , price:= round(price)] 
app.DF$price <- round(app.DF$price)

#delete a column
app.DT[ , 1:= NULL]
app.DF <- app.DF[, -1]

2. Memory Usage

The inneccessity of assignment in the example above as well as in the chaining process results in shallow copies in R which leads to better performance with less memory usage.

3. Speed

Even in this dataset with only 200 rows, we can tell that the fread(), the function in the data.table package processes the data faster. The difference will be amplified if we read in large data files.

#data.table function
system.time(fread('appstore_sample.csv'))

##    user  system elapsed 
##   0.003   0.000   0.001

#dataframe
system.time(read_csv('appstore_sample.csv'))

## Parsed with column specification:
## cols(
##   size = col_double(),
##   price = col_double(),
##   rating_count = col_integer(),
##   rating = col_double(),
##   age = col_character(),
##   genre = col_character(),
##   device_num = col_integer(),
##   screenshots = col_integer(),
##   languages_num = col_integer()
## )

##    user  system elapsed 
##   0.020   0.008   0.030

data.table vs dplyr

Similarities

1. Common in Functionalities

Both data.table and dplyr aim at a more efficient manipulation of data. They share a lot of functionalities, such as:

Operations	data.table	dplyr
Filter rows	`app.DT[price > 5]`	`filter(app.DF, price > 5)`
Sort rows	`app.DT[order(price)]`	`arrange(app.DF, price)`
Select columns	`app.DT[, list(price)]`	`select(app.DF, price)`
Summarise data	`app.DT[, mean(price)]`	`summarise(app.DF, mean(price))`
by	`app.DT[, .(mean.price = mean(price), by = "genre"]`	`app.DF %>% group_by(genre) %>% summarise(mean.price = mean(price))`

Differences

1. Syntax

The syntax of dplyr is based on five main functions filter(), arrange(), select(), mutate(), summarise() and group_by(), while general data.table syntax is : DT[i, j, by, ...].

For example, let’s calculate the mean of languages supported of each genre where the rating of the App is greater than 2:

#data.table syntax
app[rating> 2,mean(languages_num), by=genre]

##       genre       V1
## 1: Entertai  2.12000
## 2:    Games  4.44898
## 3: Educatio 12.36000

#dplyr syntax
app %>% filter(rating > 2) %>% group_by(genre) %>% summarise(mean(languages_num))

## # A tibble: 3 x 2
##   genre    `mean(languages_num)`
##   <chr>                    <dbl>
## 1 Educatio                 12.4 
## 2 Entertai                  2.12
## 3 Games                     4.45

We can tell that, compared to dplyr’s quite verbose syntax, data.table syntax is compact.

2. Memory Usage

In data.table, objects can be manipulated ‘by reference’ (using the set() function or with the := operator). Such manipulation will not copy the data but simply modify it, minimizing the memory requirements. However,dplyr have to make copies when functions like arrange() and summarise() are used.

3. Speed

#data.table syntax
system.time(app[rating> 2,mean(languages_num), by=genre])

##    user  system elapsed 
##   0.002   0.000   0.001

#dplyr syntax
system.time(app %>% filter(rating > 2) %>% group_by(genre) %>% summarise(mean(languages_num)))

##    user  system elapsed 
##   0.003   0.000   0.002

While the current dataset is too small to tell the large differences, “data.table” performs much faster when processing big data as indicated by the image below:

Benchmarks by Matt Dowle

4. Ecosystem

The data.table package has no dependency whereas dplyr is part of the tidyverse. So, for example, while data.table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr.

To summarize this section:

	Similarities	Differences
data.table VS dataframe	Both have similar basic syntax, such as reading data and subsetting	data.table uses less memory and processes data faster
data.table VS dplyr	Both aim to process data efficiently and share lots of functionality	data.table has more concise syntax and no dependency, uses less memory, and processes data faster

Reflection

Advantages: fast! very fast! super fast!

Data.table can easily optimise operations for speed and more efficient memory usage. Its concise and consistent syntax also reduces the workload for coding. These makes the package especially user-friendly when dealing with large dataset and complicated operations.

Shortcomings:

Be ready for the steep learning curve!

The special syntax DT[i,j,by] of data.table, and the fact that it doesn’t really depend on functions to do operations makes it more like having its own language. Therefore, it may appears counter-intuitive for people who are used to working with other R functions like those in dplyr. To be specific, it doesn’t have the function names and pipeline which together facilitates a structured reading of the steps that were taken. Its lack of readibility will cause trouble when users have to show their codes to other people who its not familiar with data.table syntax.

Be careful and watch out for mistakes!

Another thing to note when using data.table is that it adds/modifies columns by reference. This feature makes it extremely fast but may not be desirable and especially risky when you know you might make some mistakes in coding. When using <- operator to update columns you can always run the operation before assigning it to the original dataset to check for mistakes. But this is not the case with data.table because the input data will autimatically be modified and when the column being referenced is changed by mistake the derived column will also be affected.

Limitations & Things to Improve

You can’t rely on your old R coding intuition!

For example, suppose we want to add an ID column to app data.table with unique value for every row. The traditional way to achieve this is through app$ID <- 1:nrow(app). But if you want to run app[, ID := 1:nrow(app)], it will reports an error because := doesn’t allow assignment of an vector to a column. It only allows for assingning an single value (e.g. NA) or an operation over existing columns(e.g. mean(price)) for column creation/updates. For now, you have to run for (i in 1:nrow(app)) set(app, i, 'ID2', i) to achieve this using the data.table syntax. It is dfinitely faster when the item assigned is large but the coding is undeniably less intuitive.

fread() can be smarter!

Notice that in the examples for fread(), we showed that if we want to skip the first two rows of the local csv file, the header row will also be removed. This problem cannot be fixed even if we add header = T, which suggests a limitation of fread() function.

Admitedly, there might be a trade off between computational efficiency and function felxibility and it may not be practical for data.table to outperform other packages in both aspects. But the developers of data.table should think seriously about how to find the balance.

Package Overview

Background

Syntax

Version history: (since 2019)

Tips before diving deeper:

Examples of Usage

Creating data.table()

Subsetting rows using i

(a) subsetting rows with row index

(b) subsetting rows with conditions

Subsetting by column & Data Manipulation on columns in j

(a) select only one column

(b) select multiple columns

(c) Data manipulation on j

Data Manipulation by group

Adding/Updating Columns By Reference

(a) Update a column

(b) Add a column

(c) Remove a column

Chaining

Read & Write Files

(a) Importing local & oneline file

(b) Subsetting when importing

(c) Set column class type when importing

(d) fwrite() for exporting files

.SD & .SDcols

(a) Perform the same operations on multiple columns with lapply() and .SD

(b) Using .SDcols for column subsetting

(c) Batch Type Conversion

(d) Group subsetting

(e) group operations

set() family

(a) set()

(b) setnames()

(c) setcolorder()

Similar Packages

data.table vs dataframe

Similarities

1. Syntax:

Differences

1. Syntax

2. Memory Usage

3. Speed

data.table vs dplyr

Similarities

1. Common in Functionalities

Differences

1. Syntax

2. Memory Usage

3. Speed

4. Ecosystem

Reflection

Advantages: fast! very fast! super fast!

Shortcomings:

Limitations & Things to Improve

Bibliography