Package Overview
Background
data.table is one of the 15,000 add-on R packages that is widely used in the field of data science. It provides an enhanced version of data.frame that allows us to do fast data manipulations with terse syntax. If you know R language and haven’t learned about data.table package yet, then this documentation will help you get started right away!
Why do we prefer Data.table?
- Concise: Complicated operations can be carried out without having to write verbose code.
- Memory Efficient: It does a shallow copy of the data frame intead of physically copied the entire data frame in system’s memory.
- Fast: Thanks to its high memory efficiency, data.table provides blazing fast speed when it comes to loading and manipulating large datasets.
Syntax
The general form of data.table programming is DT[i, j, by], which can be interpreted as taking a data.table called DT, subseting rows using i, then manipulating columns using j while grouping by by.
Below are a few special but useful features of data.table you need to know before start:
- data.table doesn’t really depend on functions to support fast data manipulation. But it does provide some useful functions such as except fread() and set().
- A data.table never sets, needs or uses row names.
- A data.table never automatically converts character columns to factors, and read Dates and Datetimes as character by default.
- The data.table package has
.()as an alias forlist()(we will use this a lot when passing values to arguments) .Nreturns the the number of rows(we will see how it works in examples).
Version history: (since 2019)
| Version | Release Date | New features ( not inclusive) |
|---|---|---|
| data.table v1.12.9 | In development | %chin% subset on character vectors ( speeding up significantly) |
| data.table v1.12.8 | 09.Dec.2019 | DT [, {. . . ; .( A,B ) } ] (i.e. when .() is the final item of a multi-statement {…}) now auto-names the columns A and B (just like DT[, .(A,B)]) rather than V1 and V2 |
| data.table v1.12.4 | 03.Oct.2019 | rleid ( ) functions now support long vectors (length > 2 billion). |
| data.table v1.12.2 | 07.Apr.2019 | rbind and rbindlist now retain the position of duplicate column names rather than grouping them together |
| data.table v1.12.0 | 13.Jan.2019 | fread() can now read a remote compressed file in one step; for example, fread (“https://domain.org/file.csv.bz2”). |
Tips before diving deeper:
- We use head() in the following sections to show the output of our exmaples for presentation conciseness. It is not part of the example itself.
- Our presentation will cover examples of usages of 9 “functions”,similar packages and reflections, and we design this order for logic flow from basic to advanced.
Examples of Usage
The dataset used for introducing this package was uploaded on Kaggle by an individual who has extracted top 1000 Apple iOS apps during January to December 2016 time frame, which resulted in a large dataset of over 7000 observations. We sample 200 observations and extracted 8 variables:
| Variable | Description |
|---|---|
| Rating | Average User Rating Value (out of scale 1 to 5) |
| Size | Size of an app (in Bytes) |
| Rating_count | User rating counts |
| Device_num | Number of supporting devices |
| Screenshots | Number of screenshots shown for display |
| Languages_num | Number of supported languages |
| Price | Price amount of each mobile app in US dollars |
| Genre | Primary Genre |
| Age | Content Rating |
Creating data.table()
There are three ways to create a data.table:
- convert data.frame to data.table using
as.data.table()
app <- read.csv("appstore_sample.csv",header= TRUE)
app<- as.data.table(app)
head(app,5)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
- import local file as data.table using
fread().
app <- fread("appstore_sample.csv")
head(app,5)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
- Create a data.table from scratch using
data.table()
DT = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6,
b = 7:12,
c = 13:18
)
DT## ID a b c
## 1: b 1 7 13
## 2: b 2 8 14
## 3: b 3 9 15
## 4: a 4 10 16
## 5: a 5 11 17
## 6: c 6 12 18
We can also check if an object’s type is data.table using is.data.table(). function.
is.data.table(app) ## [1] TRUE
Subsetting rows using i
(a) subsetting rows with row index
Subsetting data with row index is straightward. You can pass a vector into the square bracket in the i. Adding a comma after i or not doesn’t affect the output.
head(app[2:5,],5) # with comma## size price rating_count rating age genre device_num screenshots
## 1: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 2: 34856960 0.99 1 5.0 9+ Games 38 4
## 3: 199166976 2.99 1 5.0 4+ Games 40 5
## 4: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 2
## 3: 1
## 4: 16
head(app[2:5],5) # without comma## size price rating_count rating age genre device_num screenshots
## 1: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 2: 34856960 0.99 1 5.0 9+ Games 38 4
## 3: 199166976 2.99 1 5.0 4+ Games 40 5
## 4: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 2
## 3: 1
## 4: 16
(b) subsetting rows with conditions
We can also subsetting data.table using conditions. In the first example, we want to find the apps with price 0.99. In the second example, we want to find apps which have price of either 0.99 or 2.99. Note: %in% is used to identify if an element can be found in a vector.
price99 <-app[price==0.99]
head(price99,5)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 24403968 0.99 2 3.5 4+ Educatio 26 4
## 5: 11927552 0.99 2 1.0 4+ Games 43 2
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 16
## 5: 1
price_of_two = app[price %in% c(0.99,2.99)]
head(price_of_two,5)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
Subsetting by column & Data Manipulation on columns in j
The general format to subset data by columns is to put whatever variable names that you want to select in j after the comma after i. Such as app[,variable nanames].
(a) select only one column
Suppose we want to know what genre do the sampled apps belong to:
genres <-app[,genre]
head(genres,5)## [1] "Entertai" "Entertai" "Games" "Games" "Educatio"
(b) select multiple columns
You can select multiple columns from a data.table by passing column names in .() or list(). Two ways should give the same output.
# use list()
lists <-app[, list(price,genre)]
head(lists,5)## price genre
## 1: 0.99 Entertai
## 2: 0.99 Entertai
## 3: 0.99 Games
## 4: 2.99 Games
## 5: 0.99 Educatio
#use .()
parent<-app[,.(price,genre)]
head(parent,5)## price genre
## 1: 0.99 Entertai
## 2: 0.99 Entertai
## 3: 0.99 Games
## 4: 2.99 Games
## 5: 0.99 Educatio
(c) Data manipulation on j
All the aggregationa and data manipulation operations are carried out in j. We can do operations on one or multiple columns at the same time and name the aggregated value.
Example 1: aggregate on one column
Suppose we want to calculae the mean price of all apps in the sample.
app[,median(price)] ## [1] 1.99
Example 2: aggregate on multiple columns
Suppose we want to calculate the median price and mean rating for all apps in the sample. This can be achieved by putting two functions in the .() and passing them together to j.
app[,.(median(price),mean(rating))] ## V1 V2
## 1: 1.99 4.115
Example 3: name the aggregated value
It may be confusing to readers if we don’t name the aggregated value. THe folowing command outputs the same values as above, but the outputs are assigned with new names.
app[,.(median_price = median(price),mean_rating = mean(rating))] ## median_price mean_rating
## 1: 1.99 4.115
Data Manipulation by group
Now, let us look at the last syntax, by. The purpose for it is similar to the groupbyfunction in dyplyr package. For every code line below, we want the operations to by carried out separately for each group as specified.
Example 1: Group by one variable
Suppose we want to find the the mean price for apps in each genre:
app[,.(mean_price = mean(price)), by = genre] ## genre mean_price
## 1: Entertai 1.836154
## 2: Games 2.631892
## 3: Educatio 2.951538
Example 2: Group by multiple variables
Suppose we want to find the mean price of apps in each genre with different age range.
app[,(mean_price = mean(price)), by = .(genre,age)] ## genre age V1
## 1: Entertai 4+ 2.066923
## 2: Entertai 9+ 0.990000
## 3: Games 9+ 2.418571
## 4: Games 4+ 2.762152
## 5: Educatio 4+ 2.951538
## 6: Entertai 12+ 1.990000
## 7: Games 12+ 2.623333
## 8: Entertai 17+ 1.990000
## 9: Games 17+ 1.990000
Example 3: Aggregate by group on multiple columns in a subset of data
Suppose we want to find the mean price of apps with more than 1 language in each genre of different age range.
app[languages_num > 1,.(mean_price = mean(price)), by = .(genre,age)] ## genre age mean_price
## 1: Games 9+ 2.451538
## 2: Educatio 4+ 3.240000
## 3: Games 4+ 2.556667
## 4: Games 12+ 3.190000
## 5: Entertai 4+ 2.490000
## 6: Games 17+ 2.990000
## 7: Entertai 17+ 2.990000
## 8: Entertai 12+ 1.990000
Adding/Updating Columns By Reference
:= is used in the j operator to assign values to columns by reference which is only usable for a data.table object. Remember the general form of a data table: DT[i, j, by]. Now let’s update the our dataset.
(a) Update a column
Let’s try update 1 column, for example, the price of the Apps. The price before update:
head(app[,price],5)## [1] 0.99 0.99 0.99 2.99 0.99
Now, update the existing price column with the rounded price:
app[ , price:= round(price)] The price column after update:
head(app[,price],5)## [1] 1 1 1 3 1
We can also update multiple columns at the same time. For example, let’s remove the ‘+’ sign in the age threshold column and change the column of support language number into the one that indicate whether the app supports more than 10 languages.
The original columns:
head(app[ , c('age','languages_num')],5)## age languages_num
## 1: 4+ 1
## 2: 9+ 1
## 3: 9+ 2
## 4: 4+ 1
## 5: 4+ 16
Columns “age” and “languages_num” are updated by what is after :=:
#way 1
app[ , c('age','languages_num'):=list(substring(age,1,1), languages_num>10)]
#way 2:
app[,':='(age=substring(age,1,1), languages_num=languages_num>10)]Now it becomes:
head(app[ , c('age','languages_num')],5)## age languages_num
## 1: 4 FALSE
## 2: 9 FALSE
## 3: 9 FALSE
## 4: 4 FALSE
## 5: 4 FALSE
(b) Add a column
Add a column that gives the difference between the price of the app and the mean price:
app[ , price.diff:= price-mean(price),]
head(app[, price.diff],5)## [1] -1.58 -1.58 -1.58 0.42 -1.58
(c) Remove a column
To remove columns, we can simply use := NULL:
#remove age column
app[ , age:= NULL]
head(app,5)## size price rating_count rating genre device_num screenshots
## 1: 8593408 1 1 3.0 Entertai 38 2
## 2: 47495168 1 1 4.0 Entertai 37 5
## 3: 34856960 1 1 5.0 Games 38 4
## 4: 199166976 3 1 5.0 Games 40 5
## 5: 24403968 1 2 3.5 Educatio 26 4
## languages_num price.diff
## 1: FALSE -1.58
## 2: FALSE -1.58
## 3: FALSE -1.58
## 4: FALSE 0.42
## 5: FALSE -1.58
#remove both language and size columns
app[ , c('size','languages_num'):= NULL]
head(app,5)## price rating_count rating genre device_num screenshots price.diff
## 1: 1 1 3.0 Entertai 38 2 -1.58
## 2: 1 1 4.0 Entertai 37 5 -1.58
## 3: 1 1 5.0 Games 38 4 -1.58
## 4: 3 1 5.0 Games 40 5 0.42
## 5: 1 2 3.5 Educatio 26 4 -1.58
Chaining
A good way to avoid intermediate assignment to a temporary variable altogether is by chaining expressions.
For example, to get the app genre with average rating greater than 4, originally we need to:
- calculate the average rating by genre:
genre.avg.rating <- app[,.(avg.rating=mean(rating)),by=genre]- Select the genre of which the average rating is above 4:
genre.avg.rating[avg.rating>4]## genre avg.rating
## 1: Games 4.263514
However, we simply can do it through chaining, which is similar to the piping function, and we get the same result:
app[,.(avg.rating=mean(rating)),by=genre][avg.rating>4]## genre avg.rating
## 1: Games 4.263514
Read & Write Files
Although data.table is not a function-based package, it does provide some useful functions such as fread() and fwrite() for importing and exporting data.
(a) Importing local & oneline file
fread() can automatically identify the structure of the importing dataset without having to specify features such as column seperator, column class, and header.It can import both local files and online files at ease by passing the local path or URL into the function.
#Local File
read_local <- fread('appstore_sample.csv')
head(read_local,5)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
#Online File
read_online <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(read_online,5)## V1 V2 V3 V4 V5
## 1: 1 307 930 36.58 0
## 2: 2 307 940 36.73 0
## 3: 3 307 950 36.93 0
## 4: 4 307 1000 37.15 0
## 5: 5 307 1010 37.23 0
(b) Subsetting when importing
Sometimes the dataset we want to import doesn’t have reguler stucture. For example, it may have paragraphs of data description at the start or the end of the file. We can then subset the data while importing by using nrow argument to specify the maximum number of rows to read, or use skip argument to specify the number rows to avoid from reading, starting on the first line.
#read the first 3 rows
read_3row <- fread('appstore_sample.csv',nrows = 3)
head(read_3row)## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3 4+ Entertai 38 2
## 2: 47495168 0.99 1 4 9+ Entertai 37 5
## 3: 34856960 0.99 1 5 9+ Games 38 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
#skip the first three rows
#column names will also be removed! Be careful!
skip_3rows <- fread('appstore_sample.csv', skip = 3)
head(skip_3rows,4)## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1: 34856960 0.99 1 5.0 9+ Games 38 4 2
## 2: 199166976 2.99 1 5.0 4+ Games 40 5 1
## 3: 24403968 0.99 2 3.5 4+ Educatio 26 4 16
## 4: 11927552 0.99 2 1.0 4+ Games 43 2 1
Some times we may need only a few columns from a large dataset, then we can use select or drop arguments to specify the columns we want to read or not to read by passing a vector of column namesor a vector of column index after select and drop.
#import only size and price column by passing column names
select_byname <- fread('appstore_sample.csv', select = c('size','price'))
head(select_byname,5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
#import only size and price column by passing column index
select_byind <- fread('appstore_sample.csv', select = c(1,2))
head(select_byind,5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
#drop price and size column by passing column names
drop_byname <- fread('appstore_sample.csv', drop = c('size','price'))
head(drop_byname,5)## rating_count rating age genre device_num screenshots languages_num
## 1: 1 3.0 4+ Entertai 38 2 1
## 2: 1 4.0 9+ Entertai 37 5 1
## 3: 1 5.0 9+ Games 38 4 2
## 4: 1 5.0 4+ Games 40 5 1
## 5: 2 3.5 4+ Educatio 26 4 16
(c) Set column class type when importing
fread() does not automatically import character variables as factor type as many importing functions do. But you can always specify the column type by (1) passing an unnamed vector of types corresponding to the columns in the file, or (2) a named list specifying types for a subset of columns by name to the argument colClasses. The second method is more useful when you have multiple columns to set types.
#method (1): import age as factor and screenshots as numeric
col_type <- fread('appstore_sample.csv', colClasses = c(age = 'factor', screenshots = 'numeric'))
print(paste0('type of age: ', class(col_type$age), ", ", "type of screenshots: ", class(col_type$screenshots)))## [1] "type of age: factor, type of screenshots: numeric"
#method (2): import the 7-9th columns as numeric, and the age and genre column as factor using method
col_type2 <- fread('appstore_sample.csv', colClasses = list(numeric = 7:9, factor=c('age','genre')))
str(col_type2)## Classes 'data.table' and 'data.frame': 200 obs. of 9 variables:
## $ size :integer64 8593408 47495168 34856960 199166976 24403968 11927552 5627904 102250496 ...
## $ price : num 0.99 0.99 0.99 2.99 0.99 0.99 4.99 3.99 0.99 2.99 ...
## $ rating_count : int 1 1 1 1 2 2 3 3 5 5 ...
## $ rating : num 3 4 5 5 3.5 1 2 4.5 5 4.5 ...
## $ age : Factor w/ 4 levels "12+","17+","4+",..: 3 4 4 3 3 3 3 3 4 3 ...
## $ genre : Factor w/ 3 levels "Educatio","Entertai",..: 2 2 3 3 1 3 1 3 2 3 ...
## $ device_num : num 38 37 38 40 26 43 43 38 37 38 ...
## $ screenshots : num 2 5 4 5 4 2 0 5 0 5 ...
## $ languages_num: num 1 1 2 1 16 1 1 13 1 2 ...
## - attr(*, ".internal.selfref")=<externalptr>
(d) fwrite() for exporting files
We can export data.table as an csv file using fwrite() function by passing the data.table and file name in double quotes to the function.
#subset the first 5 rows of app data.table and write it to a csv file called small_app.csv
fwrite(head(app,5), 'small_app.csv')
fread('small_app.csv')## size price rating_count rating age genre device_num screenshots
## 1: 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 24403968 0.99 2 3.5 4+ Educatio 26 4
## languages_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
.SD & .SDcols
.SD is an special operater in data.table package, which stands for subset of the data. .SD is itself a data.table but with the caveat that not all data.table function works on it. It is usually used to apply the same operation to multiple columns. It can also be used with by to perform the same operations on each sub-data.table before the re-assembled result is returned to us.
(a) Perform the same operations on multiple columns with lapply() and .SD
Suppose you want to know the mean for all columns, it would be inefficient to calculate mean one column after the other. But using lapply and .SD can speed this process up.
multiple_mean <- app[,lapply(.SD, mean, na.rm = T)]
head(multiple_mean,5)## size price rating_count rating age genre device_num screenshots
## 1: 224292462 2.57 6886.88 4.115 NA NA 38.715 4.215
## languages_num
## 1: 5.095
(b) Using .SDcols for column subsetting
One way to impact what .SD is is to limit the columns included in .SD by passing a vector of column names or column index to .SDcols argument. The following 4 commands have the same output.
#include only the size and price column by passing vector of column names
SD_include <- app[,.SD, .SDcols = c('size', 'price')]
head(SD_include, 5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
#exclude column except price and size by passing vector of column index
exclude_col <- app[,.SD, .SDcols = -c(3:ncol(app))]
head(exclude_col,5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
#subseting using [] on .SD by passing column index
bracket_subset <- app[,.SD[,c(1,2)]]
head(bracket_subset,5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
#subseting using [] on .SD by passing column names
bracket_subset2 <- app[,.SD[,c('size','price')]]
head(bracket_subset2,5)## size price
## 1: 8593408 0.99
## 2: 47495168 0.99
## 3: 34856960 0.99
## 4: 199166976 2.99
## 5: 24403968 0.99
(c) Batch Type Conversion
Based on what we learned above, .SD, lapply(), and .SDcols can be used together to convrt groups of columns to a common type. Please note that we need to put () around a named vector if we want to passing it into the column operator.
#specify the columns you want to convert
fkt <- c('age','genre')
#change the type of age and genre from character to factor
sd_convert <- app[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
print(paste0('type of age: ', class(sd_convert$age), ", ", "type of genre: ", class(sd_convert$genre)))## [1] "type of age: factor, type of genre: factor"
(d) Group subsetting
Group subsetting and performing operation by groups can be achieved through combining .SD with by argument. The following example gets the app with the most and the least rating count for each genre. Just to remind, .N standads for row numbers, which in this case is the row index for the app with the lowest rating count in each genre subset.
#the dataset is first sorted by rating count in descending order
sorted_app <- app[order(rating_count, decreasing = T),]
most_least_rating <- sorted_app[,.SD[c(1,.N)],by = genre]
head(most_least_rating,5)## genre size price rating_count rating age device_num screenshots
## 1: Games 624107810 0.99 326482 5.0 12+ 43 5
## 2: Games 199166976 2.99 1 5.0 4+ 40 5
## 3: Entertai 93021184 1.99 26333 4.5 12+ 37 5
## 4: Entertai 47495168 0.99 1 4.0 9+ 37 5
## 5: Educatio 150221824 4.99 6503 4.5 4+ 24 5
## languages_num
## 1: 13
## 2: 1
## 3: 14
## 4: 1
## 5: 12
(e) group operations
By combining by argument with lapply() and .SD, we can performe the same operations on multiple columns of grouped subset of data. In the following example, we will calculate the mean size, mean price, mean rating_count and mean rating for each app genre.
group_mean<- app[,lapply(.SD, mean, na.rm = T), by = genre, .SDcols = 1:4]
head(group_mean)## genre size price rating_count rating
## 1: Entertai 85306047 1.836154 1911.1538 3.596154
## 2: Games 256272887 2.631892 8828.2297 4.263514
## 3: Educatio 181236460 2.951538 811.8462 3.788462
set() family
The set() family functions allows for super fast assignment operations(even faster than := and are usuelly used in for-loops. It has the same structure as data.table syntax but can be used on a data.frame project as well.
(a) set()
The structure of set() function is : set(DT, index, column, value), where:
DTis the data.table you wish to updateindexis the row index used in a loopcolumnis columns you wish to update in the loopvaluespecifies howcolumnshould be updated
The original app dataset does not include unique keys for each app included, so let’s now created one for each. There are two ways we can achieve that using data.table package:
- Using
:=to add ID column
system.time(for (i in 1:nrow(app)) app[i, ID := i])## user system elapsed
## 0.170 0.003 0.095
2.Usingset()` to add ID column
#
system.time(for (i in 1:nrow(app)) set(app, i, 'ID2', i))## user system elapsed
## 0.007 0.000 0.004
We can see that set() is much faster than := even for a small dataset with only 200 rows.
(b) setnames()
setnames() function is used to change column names of data.table object with the structure setnames(DT, old_name, new_name).
Let’s try to abbreviate the column names for device number and language number using setnames() in the folllowing example.
setnames(app, c('device_num', 'languages_num'), c('d_num', 'l_num'))
head(app,1)## size price rating_count rating age genre d_num screenshots l_num ID
## 1: 8593408 0.99 1 3 4+ Entertai 38 2 1 1
## ID2
## 1: 1
(c) setcolorder()
Isn’t weird if we see a dataset with ID column as the last column?
Let’s reorder our column by moving the two ID columns created in (a) to the front of the dataset! We can do this in two ways.
#first way
order_col1 <- setcolorder(app, c("ID", "ID2",colnames(app)[1:9]))
head(order_col1,5)## ID ID2 size price rating_count rating age genre d_num screenshots
## 1: 1 1 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 2 2 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 3 3 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 4 4 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 5 5 24403968 0.99 2 3.5 4+ Educatio 26 4
## l_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
#second way
order_col2 <- setcolorder(app, c("ID","ID2"))
head(order_col2,5)## ID ID2 size price rating_count rating age genre d_num screenshots
## 1: 1 1 8593408 0.99 1 3.0 4+ Entertai 38 2
## 2: 2 2 47495168 0.99 1 4.0 9+ Entertai 37 5
## 3: 3 3 34856960 0.99 1 5.0 9+ Games 38 4
## 4: 4 4 199166976 2.99 1 5.0 4+ Games 40 5
## 5: 5 5 24403968 0.99 2 3.5 4+ Educatio 26 4
## l_num
## 1: 1
## 2: 1
## 3: 2
## 4: 1
## 5: 16
Similar Packages
data.table vs dataframe
We can think of data.tables as data.frames with extra features. While data.frame is part of base R, data.table is a package that extends data.frames. Two of its most notable features are cleaner syntax and speed.
Similarities
1. Syntax:
Both data.table and dataframe are similar in some syntaxs, such as reading in data:
#data table
app.DT <- fread('appstore_sample.csv')
#dataframe
app.DF <- read.csv('appstore_sample.csv')and subsetting:
head(app.DF[,1]) #data.table subsetting## [1] 8593408 47495168 34856960 199166976 24403968 11927552
head(app.DT[,1]) #data.frame subsetting## size
## 1: 8593408
## 2: 47495168
## 3: 34856960
## 4: 199166976
## 5: 24403968
## 6: 11927552
Despite of differences in their output formats.
Differences
1. Syntax
From the example below, when we add, update, and delete a column or values in a data set, we can see that we did not assign the results back to data table since:=operator modifies the input object by reference.
#add a new column
app.DT[ , price.diff:= price-mean(price)]
app.DF$price.diff <- app.DF$price-mean(app.DF$price)
#update a column
app.DT[ , price:= round(price)]
app.DF$price <- round(app.DF$price)
#delete a column
app.DT[ , 1:= NULL]
app.DF <- app.DF[, -1]2. Memory Usage
The inneccessity of assignment in the example above as well as in the chaining process results in shallow copies in R which leads to better performance with less memory usage.
3. Speed
Even in this dataset with only 200 rows, we can tell that the fread(), the function in the data.table package processes the data faster. The difference will be amplified if we read in large data files.
#data.table function
system.time(fread('appstore_sample.csv'))## user system elapsed
## 0.003 0.000 0.001
#dataframe
system.time(read_csv('appstore_sample.csv'))## Parsed with column specification:
## cols(
## size = col_double(),
## price = col_double(),
## rating_count = col_integer(),
## rating = col_double(),
## age = col_character(),
## genre = col_character(),
## device_num = col_integer(),
## screenshots = col_integer(),
## languages_num = col_integer()
## )
## user system elapsed
## 0.020 0.008 0.030
data.table vs dplyr
Similarities
1. Common in Functionalities
Both data.table and dplyr aim at a more efficient manipulation of data. They share a lot of functionalities, such as:
| Operations | data.table | dplyr |
|---|---|---|
| Filter rows | app.DT[price > 5] |
filter(app.DF, price > 5) |
| Sort rows | app.DT[order(price)] |
arrange(app.DF, price) |
| Select columns | app.DT[, list(price)] |
select(app.DF, price) |
| Summarise data | app.DT[, mean(price)] |
summarise(app.DF, mean(price)) |
| by | app.DT[, .(mean.price = mean(price), by = "genre"] |
app.DF %>% group_by(genre) %>% summarise(mean.price = mean(price)) |
Differences
1. Syntax
The syntax of dplyr is based on five main functions filter(), arrange(), select(), mutate(), summarise() and group_by(), while general data.table syntax is : DT[i, j, by, ...].
For example, let’s calculate the mean of languages supported of each genre where the rating of the App is greater than 2:
#data.table syntax
app[rating> 2,mean(languages_num), by=genre]## genre V1
## 1: Entertai 2.12000
## 2: Games 4.44898
## 3: Educatio 12.36000
#dplyr syntax
app %>% filter(rating > 2) %>% group_by(genre) %>% summarise(mean(languages_num))## # A tibble: 3 x 2
## genre `mean(languages_num)`
## <chr> <dbl>
## 1 Educatio 12.4
## 2 Entertai 2.12
## 3 Games 4.45
We can tell that, compared to dplyr’s quite verbose syntax, data.table syntax is compact.
2. Memory Usage
In data.table, objects can be manipulated ‘by reference’ (using the set() function or with the := operator). Such manipulation will not copy the data but simply modify it, minimizing the memory requirements. However,dplyr have to make copies when functions like arrange() and summarise() are used.
3. Speed
#data.table syntax
system.time(app[rating> 2,mean(languages_num), by=genre])## user system elapsed
## 0.002 0.000 0.001
#dplyr syntax
system.time(app %>% filter(rating > 2) %>% group_by(genre) %>% summarise(mean(languages_num)))## user system elapsed
## 0.003 0.000 0.002
While the current dataset is too small to tell the large differences, “data.table” performs much faster when processing big data as indicated by the image below:
Benchmarks by Matt Dowle
4. Ecosystem
The data.table package has no dependency whereas dplyr is part of the tidyverse. So, for example, while data.table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr.
To summarize this section:
| Similarities | Differences | |
|---|---|---|
| data.table VS dataframe | Both have similar basic syntax, such as reading data and subsetting | data.table uses less memory and processes data faster |
| data.table VS dplyr | Both aim to process data efficiently and share lots of functionality | data.table has more concise syntax and no dependency, uses less memory, and processes data faster |
Reflection
Advantages: fast! very fast! super fast!
Data.table can easily optimise operations for speed and more efficient memory usage. Its concise and consistent syntax also reduces the workload for coding. These makes the package especially user-friendly when dealing with large dataset and complicated operations.
Shortcomings:
- Be ready for the steep learning curve!
The special syntax DT[i,j,by] of data.table, and the fact that it doesn’t really depend on functions to do operations makes it more like having its own language. Therefore, it may appears counter-intuitive for people who are used to working with other R functions like those in dplyr. To be specific, it doesn’t have the function names and pipeline which together facilitates a structured reading of the steps that were taken. Its lack of readibility will cause trouble when users have to show their codes to other people who its not familiar with data.table syntax.
- Be careful and watch out for mistakes!
Another thing to note when using data.table is that it adds/modifies columns by reference. This feature makes it extremely fast but may not be desirable and especially risky when you know you might make some mistakes in coding. When using <- operator to update columns you can always run the operation before assigning it to the original dataset to check for mistakes. But this is not the case with data.table because the input data will autimatically be modified and when the column being referenced is changed by mistake the derived column will also be affected.
Limitations & Things to Improve
- You can’t rely on your old R coding intuition!
For example, suppose we want to add an ID column to app data.table with unique value for every row. The traditional way to achieve this is through app$ID <- 1:nrow(app). But if you want to run app[, ID := 1:nrow(app)], it will reports an error because := doesn’t allow assignment of an vector to a column. It only allows for assingning an single value (e.g. NA) or an operation over existing columns(e.g. mean(price)) for column creation/updates. For now, you have to run for (i in 1:nrow(app)) set(app, i, 'ID2', i) to achieve this using the data.table syntax. It is dfinitely faster when the item assigned is large but the coding is undeniably less intuitive.
- fread() can be smarter!
Notice that in the examples for fread(), we showed that if we want to skip the first two rows of the local csv file, the header row will also be removed. This problem cannot be fixed even if we add header = T, which suggests a limitation of fread() function.
Admitedly, there might be a trade off between computational efficiency and function felxibility and it may not be practical for data.table to outperform other packages in both aspects. But the developers of data.table should think seriously about how to find the balance.