data.table 包简要介绍

data.table是 R 处理数据的一个非常有用的包,它在处理数据时的高效便捷使得它广受欢迎,关于它的特点介绍如下:

就是说data.table包继承了data.frame,所有用于处理data.frame的函数都可以用来处理data.table形式的数据,因为data.table是用C语言写的,所以在运行速度上面要快得多.

现在我们就赶紧来学习一下data.table包的一些方便快捷处理数据的功能

开始学习

下载、载入data.table

如果你的电脑已经安装了该包,那就直接载入包

library(data.table)

构造数据表

生成data tables跟生成data frames的操作是一样的

我们先生成一个data.frame数据框,再生成一个变量和相应的值都完全相同的data.table数据表

DF = data.frame(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DF)
##            x y           z
## 1 -1.2133384 a -0.86008803
## 2  1.5356639 a -1.30916114
## 3 -0.5450018 a  0.00868531
## 4  0.3438450 b  0.41848500
## 5  1.9883591 b  0.05216796
## 6  1.1103583 b -1.30507142
DT = data.table(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DT)
##             x y           z
## 1:  0.9320433 a  0.31312968
## 2:  0.9076802 a -0.47774093
## 3: -0.3034488 a -0.35465277
## 4: -0.9983883 b -0.37324598
## 5:  0.8210649 b  0.07856847
## 6:  2.6389940 b  0.17203514

从生成的数据格式看,两种数据没什么差别,但其实他们是有差别的

利用tables()函数查看当前内存中的所有数据表

tables()
##      NAME NROW NCOL MB COLS  KEY
## [1,] DT      9    3  1 x,y,z    
## Total: 1MB

可以看到数据中的一些信息,包括数据表的名字、行数、占了多大内存、列名称和键点

取子集

关于取子集操作,数据表和数据框的操作基本差不多,但也有些差别

选取行

选取行的操作和数据框的处理一样

DT[2, ]
##            x y          z
## 1: 0.9076802 a -0.4777409
DF[2, ]
##          x y         z
## 2 1.535664 a -1.309161
DT[DT$y == "a", ]
##             x y          z
## 1:  0.9320433 a  0.3131297
## 2:  0.9076802 a -0.4777409
## 3: -0.3034488 a -0.3546528
DF[DF$y == "a", ]
##            x y           z
## 1 -1.2133384 a -0.86008803
## 2  1.5356639 a -1.30916114
## 3 -0.5450018 a  0.00868531

下面的操作与数据框不同,数据框选取的是列,数据表选取的是行:

DT[c(2, 3)]
##             x y          z
## 1:  0.9076802 a -0.4777409
## 2: -0.3034488 a -0.3546528
DF[c(2, 3)]
##   y           z
## 1 a -0.86008803
## 2 a -1.30916114
## 3 a  0.00868531
## 4 b  0.41848500
## 5 b  0.05216796
## 6 b -1.30507142
## 7 c -0.47888247
## 8 c -0.22511917
## 9 c -2.71346937

选取列

  • The subsetting function is modified for data.table
  • The argument you pass after the comma is called an “expression”
  • In R an expression is a collection of statements enclosed in curley brackets

用数据框中的操作来操作数据表选取行是行不通的,结果如下:

DT[,c(2, 3)]
## [1] 2 3
DF[, c(2, 3)]
##   y           z
## 1 a -0.86008803
## 2 a -1.30916114
## 3 a  0.00868531
## 4 b  0.41848500
## 5 b  0.05216796
## 6 b -1.30507142
## 7 c -0.47888247
## 8 c -0.22511917
## 9 c -2.71346937

用表达式计算变量的值

有没有觉得下面的操作很方便(跟数据框操作比)?

DT[, list(mean(x), sum(z))]
##           V1        V2
## 1: 0.5992462 -2.835869
DT[, table(y)]
## y
## a b c 
## 3 3 3

增加新行

增加新行用的是一个新的命令:=

DT[, w := z ^ 2]
##             x y           z           w
## 1:  0.9320433 a  0.31312968 0.098050195
## 2:  0.9076802 a -0.47774093 0.228236396
## 3: -0.3034488 a -0.35465277 0.125778586
## 4: -0.9983883 b -0.37324598 0.139312563
## 5:  0.8210649 b  0.07856847 0.006173005
## 6:  2.6389940 b  0.17203514 0.029596088
## 7:  0.1165524 c  0.43671341 0.190718599
## 8:  0.5424233 c -0.62805803 0.394456890
## 9:  0.7362949 c -2.00261783 4.010478178

但这里会碰到一个问题

DT2 = DT
DT[, y := 2]
## Warning in `[.data.table`(DT, , `:=`(y, 2)): Coerced 'double' RHS to
## 'character' to match the column's type; may have truncated precision.
## Either change the target column to 'double' first (by creating a new
## 'double' vector length 9 (nrows of entire table) and assign that; i.e.
## 'replace' column), or coerce RHS to 'character' (e.g. 1L,
## NA_[real|integer]_, as.*, etc) to make your intent clear and for speed.
## Or, set the column type correctly up front when you create the table and
## stick to it, please.
##             x y           z           w
## 1:  0.9320433 2  0.31312968 0.098050195
## 2:  0.9076802 2 -0.47774093 0.228236396
## 3: -0.3034488 2 -0.35465277 0.125778586
## 4: -0.9983883 2 -0.37324598 0.139312563
## 5:  0.8210649 2  0.07856847 0.006173005
## 6:  2.6389940 2  0.17203514 0.029596088
## 7:  0.1165524 2  0.43671341 0.190718599
## 8:  0.5424233 2 -0.62805803 0.394456890
## 9:  0.7362949 2 -2.00261783 4.010478178
head(DT)
##             x y           z           w
## 1:  0.9320433 2  0.31312968 0.098050195
## 2:  0.9076802 2 -0.47774093 0.228236396
## 3: -0.3034488 2 -0.35465277 0.125778586
## 4: -0.9983883 2 -0.37324598 0.139312563
## 5:  0.8210649 2  0.07856847 0.006173005
## 6:  2.6389940 2  0.17203514 0.029596088
head(DT2, n = 3)
##             x y          z         w
## 1:  0.9320433 2  0.3131297 0.0980502
## 2:  0.9076802 2 -0.4777409 0.2282364
## 3: -0.3034488 2 -0.3546528 0.1257786

如果新变量的计算需要多个表达式,那么在{}里最后的计算结果只会看最后一个表达式

DT[, m := {tmp = (x + z); log2(tmp + 5)}]
##             x y           z           w        m
## 1:  0.9320433 2  0.31312968 0.098050195 2.642742
## 2:  0.9076802 2 -0.47774093 0.228236396 2.440936
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320
## 5:  0.8210649 2  0.07856847 0.006173005 2.560625
## 6:  2.6389940 2  0.17203514 0.029596088 2.965513
## 7:  0.1165524 2  0.43671341 0.190718599 2.473336
## 8:  0.5424233 2 -0.62805803 0.394456890 2.297005
## 9:  0.7362949 2 -2.00261783 4.010478178 1.900597

plyr的操作

DT[, a := x > 0]
##             x y           z           w        m     a
## 1:  0.9320433 2  0.31312968 0.098050195 2.642742  TRUE
## 2:  0.9076802 2 -0.47774093 0.228236396 2.440936  TRUE
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326 FALSE
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320 FALSE
## 5:  0.8210649 2  0.07856847 0.006173005 2.560625  TRUE
## 6:  2.6389940 2  0.17203514 0.029596088 2.965513  TRUE
## 7:  0.1165524 2  0.43671341 0.190718599 2.473336  TRUE
## 8:  0.5424233 2 -0.62805803 0.394456890 2.297005  TRUE
## 9:  0.7362949 2 -2.00261783 4.010478178 1.900597  TRUE

这里以a来分组

DT[, b := mean(x + w), by = a]
##             x y           z           w        m     a         b
## 1:  0.9320433 2  0.31312968 0.098050195 2.642742  TRUE  1.664680
## 2:  0.9076802 2 -0.47774093 0.228236396 2.440936  TRUE  1.664680
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326 FALSE -0.518373
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320 FALSE -0.518373
## 5:  0.8210649 2  0.07856847 0.006173005 2.560625  TRUE  1.664680
## 6:  2.6389940 2  0.17203514 0.029596088 2.965513  TRUE  1.664680
## 7:  0.1165524 2  0.43671341 0.190718599 2.473336  TRUE  1.664680
## 8:  0.5424233 2 -0.62805803 0.394456890 2.297005  TRUE  1.664680
## 9:  0.7362949 2 -2.00261783 4.010478178 1.900597  TRUE  1.664680

特殊变量 .N

.N 这个特殊的变量能够很方便地统计出满足条件的变量个中值出现的次数

set.seed(123)
DT = data.table(x = sample(letters[1:3],
                           1E5, TRUE))
DT[, .N, by = x]
##    x     N
## 1: a 33387
## 2: c 33201
## 3: b 33412

键点

DT = data.table(x = rep(c("a", "b", "c"), each = 100),
                y = rnorm(300))
setkey(DT, x)
DT['a']
##      x           y
##   1: a  0.25958973
##   2: a  0.91751072
##   3: a -0.72231834
##   4: a -0.80828402
##   5: a -0.14135202
##   6: a  2.25701345
##   7: a -2.37955015
##   8: a -0.45425393
##   9: a -0.06007418
##  10: a  0.86090061
##  11: a -1.78466393
##  12: a -0.13074225
##  13: a -0.36983749
##  14: a -0.18065990
##  15: a -1.04973030
##  16: a  0.37831550
##  17: a -1.37079353
##  18: a -0.31611578
##  19: a  0.39435003
##  20: a -1.68987831
##  21: a -1.46233527
##  22: a  2.55837664
##  23: a  0.08788697
##  24: a  1.73141492
##  25: a  1.21512638
##  26: a  0.29954390
##  27: a -0.17245754
##  28: a  1.13249663
##  29: a  0.02319828
##  30: a  1.33587399
##  31: a -1.09879007
##  32: a -0.58176064
##  33: a  0.03892452
##  34: a  1.07315441
##  35: a  1.34969593
##  36: a  1.19527937
##  37: a -0.02217912
##  38: a  0.69849448
##  39: a  0.67240626
##  40: a -0.79164585
##  41: a -0.21790545
##  42: a  0.02307037
##  43: a  0.11539395
##  44: a -0.27708029
##  45: a  0.03688377
##  46: a  0.47520014
##  47: a  1.70748924
##  48: a  1.07600560
##  49: a -1.34571320
##  50: a -1.44024891
##  51: a -0.39392783
##  52: a  0.58106297
##  53: a -0.17078819
##  54: a -0.90585446
##  55: a  0.15621346
##  56: a -0.37322530
##  57: a -0.34587104
##  58: a -0.35828720
##  59: a -0.13306601
##  60: a -0.08959642
##  61: a  0.62793032
##  62: a -1.42882873
##  63: a  0.17255399
##  64: a -0.79115025
##  65: a  1.26204078
##  66: a -0.26940548
##  67: a  0.15698296
##  68: a -0.76059823
##  69: a  1.37060069
##  70: a  0.03758155
##  71: a  0.44949417
##  72: a  2.78868764
##  73: a -0.46848614
##  74: a  1.01260608
##  75: a -0.04374086
##  76: a  1.40669725
##  77: a  0.41992874
##  78: a  0.31008615
##  79: a  1.11904687
##  80: a -1.29814018
##  81: a -1.28248182
##  82: a  1.65942788
##  83: a  0.78374544
##  84: a  0.57771022
##  85: a -0.26724640
##  86: a -0.64569141
##  87: a -0.44952912
##  88: a -0.82619821
##  89: a  1.05503854
##  90: a -0.87926983
##  91: a -1.27712832
##  92: a -0.63412243
##  93: a  0.66470047
##  94: a -0.50958183
##  95: a  0.40736335
##  96: a  1.67774776
##  97: a -1.05205570
##  98: a -0.63690737
##  99: a  0.56539163
## 100: a  0.38015779
##      x           y

组合两个数据表

DT1 = data.table(x = c('a', 'a', 'b', 'dt1'),
                 y = 1:4)
DT2 = data.table(x = c('a', 'b', 'dt2'),
                 z = 5:7)
setkey(DT1, x); setkey(DT2, x)
merge(DT1, DT2)
##    x y z
## 1: a 1 5
## 2: a 2 5
## 3: b 3 6

运行速度比较

最后我们来做个试验,比较一下数据框和数据表在读入数据时的运行速度,以此证明,数据表的强大(也证明一下我们学习这个包不是在浪费时间).

big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))
## 
Read 67.0% of 1000000 rows
Read 1000000 rows and 2 (of 2) columns from 0.035 GB file in 00:00:03
##    user  system elapsed 
##    2.81    0.01    2.92

fread这个函数是data.table中用来把数据读入到 R 中的他用的时间是s

system.time(read.table(file, header=TRUE, sep="\t"))
##    user  system elapsed 
##   14.79    0.19   15.38

我们再来用我们熟悉的 read.table 读取相同的数据

时间比较结果:

Method Time
read.table() 2.23
fread() 11.99