data.table
是 R 处理数据的一个非常有用的包,它在处理数据时的高效便捷使得它广受欢迎,关于它的特点介绍如下:
就是说data.table
包继承了data.frame
,所有用于处理data.frame
的函数都可以用来处理data.table
形式的数据,因为data.table
是用C语言写的,所以在运行速度上面要快得多.
现在我们就赶紧来学习一下data.table包的一些方便快捷处理数据的功能
data.table
包如果你的电脑已经安装了该包,那就直接载入包
library(data.table)
生成data tables
跟生成data frames
的操作是一样的
我们先生成一个data.frame
数据框,再生成一个变量和相应的值都完全相同的data.table
数据表
DF = data.frame(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DF)
## x y z
## 1 -1.2133384 a -0.86008803
## 2 1.5356639 a -1.30916114
## 3 -0.5450018 a 0.00868531
## 4 0.3438450 b 0.41848500
## 5 1.9883591 b 0.05216796
## 6 1.1103583 b -1.30507142
DT = data.table(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DT)
## x y z
## 1: 0.9320433 a 0.31312968
## 2: 0.9076802 a -0.47774093
## 3: -0.3034488 a -0.35465277
## 4: -0.9983883 b -0.37324598
## 5: 0.8210649 b 0.07856847
## 6: 2.6389940 b 0.17203514
从生成的数据格式看,两种数据没什么差别,但其实他们是有差别的
tables()
函数查看当前内存中的所有数据表tables()
## NAME NROW NCOL MB COLS KEY
## [1,] DT 9 3 1 x,y,z
## Total: 1MB
可以看到数据中的一些信息,包括数据表的名字、行数、占了多大内存、列名称和键点
关于取子集操作,数据表和数据框的操作基本差不多,但也有些差别
选取行的操作和数据框的处理一样
DT[2, ]
## x y z
## 1: 0.9076802 a -0.4777409
DF[2, ]
## x y z
## 2 1.535664 a -1.309161
DT[DT$y == "a", ]
## x y z
## 1: 0.9320433 a 0.3131297
## 2: 0.9076802 a -0.4777409
## 3: -0.3034488 a -0.3546528
DF[DF$y == "a", ]
## x y z
## 1 -1.2133384 a -0.86008803
## 2 1.5356639 a -1.30916114
## 3 -0.5450018 a 0.00868531
下面的操作与数据框不同,数据框选取的是列,数据表选取的是行:
DT[c(2, 3)]
## x y z
## 1: 0.9076802 a -0.4777409
## 2: -0.3034488 a -0.3546528
DF[c(2, 3)]
## y z
## 1 a -0.86008803
## 2 a -1.30916114
## 3 a 0.00868531
## 4 b 0.41848500
## 5 b 0.05216796
## 6 b -1.30507142
## 7 c -0.47888247
## 8 c -0.22511917
## 9 c -2.71346937
用数据框中的操作来操作数据表选取行是行不通的,结果如下:
DT[,c(2, 3)]
## [1] 2 3
DF[, c(2, 3)]
## y z
## 1 a -0.86008803
## 2 a -1.30916114
## 3 a 0.00868531
## 4 b 0.41848500
## 5 b 0.05216796
## 6 b -1.30507142
## 7 c -0.47888247
## 8 c -0.22511917
## 9 c -2.71346937
有没有觉得下面的操作很方便(跟数据框操作比)?
DT[, list(mean(x), sum(z))]
## V1 V2
## 1: 0.5992462 -2.835869
DT[, table(y)]
## y
## a b c
## 3 3 3
增加新行用的是一个新的命令:=
DT[, w := z ^ 2]
## x y z w
## 1: 0.9320433 a 0.31312968 0.098050195
## 2: 0.9076802 a -0.47774093 0.228236396
## 3: -0.3034488 a -0.35465277 0.125778586
## 4: -0.9983883 b -0.37324598 0.139312563
## 5: 0.8210649 b 0.07856847 0.006173005
## 6: 2.6389940 b 0.17203514 0.029596088
## 7: 0.1165524 c 0.43671341 0.190718599
## 8: 0.5424233 c -0.62805803 0.394456890
## 9: 0.7362949 c -2.00261783 4.010478178
但这里会碰到一个问题
DT2 = DT
DT[, y := 2]
## Warning in `[.data.table`(DT, , `:=`(y, 2)): Coerced 'double' RHS to
## 'character' to match the column's type; may have truncated precision.
## Either change the target column to 'double' first (by creating a new
## 'double' vector length 9 (nrows of entire table) and assign that; i.e.
## 'replace' column), or coerce RHS to 'character' (e.g. 1L,
## NA_[real|integer]_, as.*, etc) to make your intent clear and for speed.
## Or, set the column type correctly up front when you create the table and
## stick to it, please.
## x y z w
## 1: 0.9320433 2 0.31312968 0.098050195
## 2: 0.9076802 2 -0.47774093 0.228236396
## 3: -0.3034488 2 -0.35465277 0.125778586
## 4: -0.9983883 2 -0.37324598 0.139312563
## 5: 0.8210649 2 0.07856847 0.006173005
## 6: 2.6389940 2 0.17203514 0.029596088
## 7: 0.1165524 2 0.43671341 0.190718599
## 8: 0.5424233 2 -0.62805803 0.394456890
## 9: 0.7362949 2 -2.00261783 4.010478178
head(DT)
## x y z w
## 1: 0.9320433 2 0.31312968 0.098050195
## 2: 0.9076802 2 -0.47774093 0.228236396
## 3: -0.3034488 2 -0.35465277 0.125778586
## 4: -0.9983883 2 -0.37324598 0.139312563
## 5: 0.8210649 2 0.07856847 0.006173005
## 6: 2.6389940 2 0.17203514 0.029596088
head(DT2, n = 3)
## x y z w
## 1: 0.9320433 2 0.3131297 0.0980502
## 2: 0.9076802 2 -0.4777409 0.2282364
## 3: -0.3034488 2 -0.3546528 0.1257786
如果新变量的计算需要多个表达式,那么在{}
里最后的计算结果只会看最后一个表达式
DT[, m := {tmp = (x + z); log2(tmp + 5)}]
## x y z w m
## 1: 0.9320433 2 0.31312968 0.098050195 2.642742
## 2: 0.9076802 2 -0.47774093 0.228236396 2.440936
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320
## 5: 0.8210649 2 0.07856847 0.006173005 2.560625
## 6: 2.6389940 2 0.17203514 0.029596088 2.965513
## 7: 0.1165524 2 0.43671341 0.190718599 2.473336
## 8: 0.5424233 2 -0.62805803 0.394456890 2.297005
## 9: 0.7362949 2 -2.00261783 4.010478178 1.900597
plyr
的操作DT[, a := x > 0]
## x y z w m a
## 1: 0.9320433 2 0.31312968 0.098050195 2.642742 TRUE
## 2: 0.9076802 2 -0.47774093 0.228236396 2.440936 TRUE
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326 FALSE
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320 FALSE
## 5: 0.8210649 2 0.07856847 0.006173005 2.560625 TRUE
## 6: 2.6389940 2 0.17203514 0.029596088 2.965513 TRUE
## 7: 0.1165524 2 0.43671341 0.190718599 2.473336 TRUE
## 8: 0.5424233 2 -0.62805803 0.394456890 2.297005 TRUE
## 9: 0.7362949 2 -2.00261783 4.010478178 1.900597 TRUE
这里以a来分组
DT[, b := mean(x + w), by = a]
## x y z w m a b
## 1: 0.9320433 2 0.31312968 0.098050195 2.642742 TRUE 1.664680
## 2: 0.9076802 2 -0.47774093 0.228236396 2.440936 TRUE 1.664680
## 3: -0.3034488 2 -0.35465277 0.125778586 2.118326 FALSE -0.518373
## 4: -0.9983883 2 -0.37324598 0.139312563 1.859320 FALSE -0.518373
## 5: 0.8210649 2 0.07856847 0.006173005 2.560625 TRUE 1.664680
## 6: 2.6389940 2 0.17203514 0.029596088 2.965513 TRUE 1.664680
## 7: 0.1165524 2 0.43671341 0.190718599 2.473336 TRUE 1.664680
## 8: 0.5424233 2 -0.62805803 0.394456890 2.297005 TRUE 1.664680
## 9: 0.7362949 2 -2.00261783 4.010478178 1.900597 TRUE 1.664680
.N
.N
这个特殊的变量能够很方便地统计出满足条件的变量个中值出现的次数
set.seed(123)
DT = data.table(x = sample(letters[1:3],
1E5, TRUE))
DT[, .N, by = x]
## x N
## 1: a 33387
## 2: c 33201
## 3: b 33412
DT = data.table(x = rep(c("a", "b", "c"), each = 100),
y = rnorm(300))
setkey(DT, x)
DT['a']
## x y
## 1: a 0.25958973
## 2: a 0.91751072
## 3: a -0.72231834
## 4: a -0.80828402
## 5: a -0.14135202
## 6: a 2.25701345
## 7: a -2.37955015
## 8: a -0.45425393
## 9: a -0.06007418
## 10: a 0.86090061
## 11: a -1.78466393
## 12: a -0.13074225
## 13: a -0.36983749
## 14: a -0.18065990
## 15: a -1.04973030
## 16: a 0.37831550
## 17: a -1.37079353
## 18: a -0.31611578
## 19: a 0.39435003
## 20: a -1.68987831
## 21: a -1.46233527
## 22: a 2.55837664
## 23: a 0.08788697
## 24: a 1.73141492
## 25: a 1.21512638
## 26: a 0.29954390
## 27: a -0.17245754
## 28: a 1.13249663
## 29: a 0.02319828
## 30: a 1.33587399
## 31: a -1.09879007
## 32: a -0.58176064
## 33: a 0.03892452
## 34: a 1.07315441
## 35: a 1.34969593
## 36: a 1.19527937
## 37: a -0.02217912
## 38: a 0.69849448
## 39: a 0.67240626
## 40: a -0.79164585
## 41: a -0.21790545
## 42: a 0.02307037
## 43: a 0.11539395
## 44: a -0.27708029
## 45: a 0.03688377
## 46: a 0.47520014
## 47: a 1.70748924
## 48: a 1.07600560
## 49: a -1.34571320
## 50: a -1.44024891
## 51: a -0.39392783
## 52: a 0.58106297
## 53: a -0.17078819
## 54: a -0.90585446
## 55: a 0.15621346
## 56: a -0.37322530
## 57: a -0.34587104
## 58: a -0.35828720
## 59: a -0.13306601
## 60: a -0.08959642
## 61: a 0.62793032
## 62: a -1.42882873
## 63: a 0.17255399
## 64: a -0.79115025
## 65: a 1.26204078
## 66: a -0.26940548
## 67: a 0.15698296
## 68: a -0.76059823
## 69: a 1.37060069
## 70: a 0.03758155
## 71: a 0.44949417
## 72: a 2.78868764
## 73: a -0.46848614
## 74: a 1.01260608
## 75: a -0.04374086
## 76: a 1.40669725
## 77: a 0.41992874
## 78: a 0.31008615
## 79: a 1.11904687
## 80: a -1.29814018
## 81: a -1.28248182
## 82: a 1.65942788
## 83: a 0.78374544
## 84: a 0.57771022
## 85: a -0.26724640
## 86: a -0.64569141
## 87: a -0.44952912
## 88: a -0.82619821
## 89: a 1.05503854
## 90: a -0.87926983
## 91: a -1.27712832
## 92: a -0.63412243
## 93: a 0.66470047
## 94: a -0.50958183
## 95: a 0.40736335
## 96: a 1.67774776
## 97: a -1.05205570
## 98: a -0.63690737
## 99: a 0.56539163
## 100: a 0.38015779
## x y
DT1 = data.table(x = c('a', 'a', 'b', 'dt1'),
y = 1:4)
DT2 = data.table(x = c('a', 'b', 'dt2'),
z = 5:7)
setkey(DT1, x); setkey(DT2, x)
merge(DT1, DT2)
## x y z
## 1: a 1 5
## 2: a 2 5
## 3: b 3 6
最后我们来做个试验,比较一下数据框和数据表在读入数据时的运行速度,以此证明,数据表的强大(也证明一下我们学习这个包不是在浪费时间).
big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))
##
Read 67.0% of 1000000 rows
Read 1000000 rows and 2 (of 2) columns from 0.035 GB file in 00:00:03
## user system elapsed
## 2.81 0.01 2.92
fread
这个函数是data.table
中用来把数据读入到 R 中的他用的时间是s
system.time(read.table(file, header=TRUE, sep="\t"))
## user system elapsed
## 14.79 0.19 15.38
我们再来用我们熟悉的 read.table
读取相同的数据
时间比较结果:
Method | Time |
---|---|
read.table() | 2.23 |
fread() | 11.99 |