After installing the data.table function using install.packages('data.table') load the package -

> library(data.table)

Reading data

fread is short for fast read -

> statecsv <- fread("Data/statedata.csv")  # Reading CSV file
> head(statecsv)
         State state_code    region   governor                  psychRegions
1:     Alabama         AL     South Republican     Friendly and Conventional
2:     Arizona         AZ      West Republican          Relaxed and Creative
3:    Arkansas         AR     South Republican     Friendly and Conventional
4:  California         CA      West   Democrat          Relaxed and Creative
5:    Colorado         CO      West   Democrat     Friendly and Conventional
6: Connecticut         CT Northeast   Democrat Temperamental and Uninhibited
   extraversion agreeableness conscientiousness neuroticism openness instagram
1:         55.5          52.7              55.5        48.7     42.7      0.64
2:         50.6          46.6              58.4        38.1     54.7      0.18
3:         49.9          52.7              41.0        56.2     40.3      0.46
4:         51.4          49.0              43.2        39.1     65.0      1.47
5:         45.3          47.5              58.8        34.3     57.9     -1.04
6:         57.6          38.6              34.2        53.4     53.9      0.37
   facebook retweet entrepreneur  gdpr privacy university mortgage volunteering
1:     1.65    0.35         0.26 -0.77    0.58       1.74     1.41        -1.49
2:    -0.26   -0.57         0.56 -0.31   -0.45      -0.77     1.01         0.96
3:     1.10   -0.60         0.25 -0.60    0.69       0.02    -0.66        -1.19
4:    -0.42    0.48         0.50  1.12    0.23      -1.92    -0.88         0.57
5:    -1.06   -0.90         0.02  0.59   -0.22      -0.44     1.49         1.01
6:    -0.98    1.15         0.07  0.71    0.36       0.37     0.48         1.28
   museum scrapbook modernDance
1:  -1.11      0.06       -1.27
2:  -0.13      0.34        0.41
3:  -0.95      0.85       -1.44
4:   0.05     -0.69        0.38
5:   0.72     -0.34       -0.29
6:   1.18     -0.81        0.55
> class(statecsv)
[1] "data.table" "data.frame"

So, functions that accept a data.frame will work just fine on data.table as well.

A simple demonstration of how fast fread is:

> # Create a big data frame
> big_df <- data.frame(x = rnorm(10E5),
+                      y = runif(10E5))
> # create a temporary file 
> file <- tempfile()
> # write the data frame to the file
> write.table(big_df, file=file, col.names = T, sep="\t")

Now using read.table(), read the file and find the time it took using system.time():

> system.time(read.table(file, header=T, sep="\t"))
   user  system elapsed 
   7.29    0.14    7.74 
> system.time(fread(file, header=T, sep="\t"))
   user  system elapsed 
   0.28    0.05    0.22 

SO using fread(), data can be read in faster speed than the regular read.table() function.

Data Frame to Data Table conversion

Creating copy and convert

Using data.table(df) or as.data.table(df) -

> class(iris)
[1] "data.frame"
> iris2 <- data.table(iris)
> class(iris2)
[1] "data.table" "data.frame"
> iris3 <- as.data.table(iris)
> class(iris3)
[1] "data.table" "data.frame"

Get R Programming assignment help at www.homeworkhelponline.net.

Converting inplace

Using setDT(df) -

> iris4 <- copy(iris)
> class(iris4)
[1] "data.frame"

Converting inplace -

> setDT(iris4)
> class(iris4)
[1] "data.table" "data.frame"

Data Table to Data Frame conversion

Creating copy and convert

Using as.data.frame(dt) or data.frame(dt)-

> class(iris4)
[1] "data.table" "data.frame"
> iris5 <- data.frame(iris4)
> class(iris5)
[1] "data.frame"

Converting inplace

Using setDF(dt)-

> class(iris4)
[1] "data.table" "data.frame"
> setDF(iris4)
> class(iris4)
[1] "data.frame"

caution with Rownames

The built in mtcars data frame has rownames but data table rows do not contain any rownames.

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> class(mtcars)
[1] "data.frame"
> mtcars2 <- copy(mtcars)
> setDT(mtcars2)
> head(mtcars2)   # rownames are gone
    mpg cyl disp  hp drat    wt  qsec vs am gear carb
1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

So remember to put the rownames in a column first -

> mtcars2$carname <- rownames(mtcars)
> head(mtcars2)
    mpg cyl disp  hp drat    wt  qsec vs am gear carb           carname
1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4
2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag
3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1        Datsun 710
4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive
5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout
6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1           Valiant

Filtering rows and columns

Gives all the columns where cyl is 4 and gear is 4 -

> mtcars2[cyl==4 & gear==4, ]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb        carname
1: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1     Datsun 710
2: 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2      Merc 240D
3: 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2       Merc 230
4: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1       Fiat 128
5: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2    Honda Civic
6: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
7: 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1      Fiat X1-9
8: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     Volvo 142E

To specify columns -

> mtcars2[cyl==4 & gear==4,c(12,4,1,8)]
          carname  hp  mpg vs
1:     Datsun 710  93 22.8  1
2:      Merc 240D  62 24.4  1
3:       Merc 230  95 22.8  1
4:       Fiat 128  66 32.4  1
5:    Honda Civic  52 30.4  1
6: Toyota Corolla  65 33.9  1
7:      Fiat X1-9  66 27.3  1
8:     Volvo 142E 109 21.4  1

The above code is equivalent to this one shown below. By supplying a list of column names we can get the specified columns -

> mtcars2[cyl==4 & gear==4, list(carname, hp, mpg, vs)]
          carname  hp  mpg vs
1:     Datsun 710  93 22.8  1
2:      Merc 240D  62 24.4  1
3:       Merc 230  95 22.8  1
4:       Fiat 128  66 32.4  1
5:    Honda Civic  52 30.4  1
6: Toyota Corolla  65 33.9  1
7:      Fiat X1-9  66 27.3  1
8:     Volvo 142E 109 21.4  1

Instead of using list, a simple . also does the work!

> mtcars2[cyl==4 & gear==4, .(carname, hp, mpg, vs)]
          carname  hp  mpg vs
1:     Datsun 710  93 22.8  1
2:      Merc 240D  62 24.4  1
3:       Merc 230  95 22.8  1
4:       Fiat 128  66 32.4  1
5:    Honda Civic  52 30.4  1
6: Toyota Corolla  65 33.9  1
7:      Fiat X1-9  66 27.3  1
8:     Volvo 142E 109 21.4  1

By specifying the column names as string we can pass a vector of strings to the columns place -

> mtcars2[cyl>=4 & gear>4, c('carname', 'hp', 'mpg', 'drat')]
          carname  hp  mpg drat
1:  Porsche 914-2  91 26.0 4.43
2:   Lotus Europa 113 30.4 3.77
3: Ford Pantera L 264 15.8 4.22
4:   Ferrari Dino 175 19.7 3.62
5:  Maserati Bora 335 15.0 3.54

If a column name isn’t specified directly in the function but in a vector outside the function then we have to set with=F:

> cols <- c('carname', 'hp', 'mpg', 'drat')
> mtcars2[, cols, with=F]
                carname  hp  mpg drat
 1:           Mazda RX4 110 21.0 3.90
 2:       Mazda RX4 Wag 110 21.0 3.90
 3:          Datsun 710  93 22.8 3.85
 4:      Hornet 4 Drive 110 21.4 3.08
 5:   Hornet Sportabout 175 18.7 3.15
 6:             Valiant 105 18.1 2.76
 7:          Duster 360 245 14.3 3.21
 8:           Merc 240D  62 24.4 3.69
 9:            Merc 230  95 22.8 3.92
10:            Merc 280 123 19.2 3.92
11:           Merc 280C 123 17.8 3.92
12:          Merc 450SE 180 16.4 3.07
13:          Merc 450SL 180 17.3 3.07
14:         Merc 450SLC 180 15.2 3.07
15:  Cadillac Fleetwood 205 10.4 2.93
16: Lincoln Continental 215 10.4 3.00
17:   Chrysler Imperial 230 14.7 3.23
18:            Fiat 128  66 32.4 4.08
19:         Honda Civic  52 30.4 4.93
20:      Toyota Corolla  65 33.9 4.22
21:       Toyota Corona  97 21.5 3.70
22:    Dodge Challenger 150 15.5 2.76
23:         AMC Javelin 150 15.2 3.15
24:          Camaro Z28 245 13.3 3.73
25:    Pontiac Firebird 175 19.2 3.08
26:           Fiat X1-9  66 27.3 4.08
27:       Porsche 914-2  91 26.0 4.43
28:        Lotus Europa 113 30.4 3.77
29:      Ford Pantera L 264 15.8 4.22
30:        Ferrari Dino 175 19.7 3.62
31:       Maserati Bora 335 15.0 3.54
32:          Volvo 142E 109 21.4 4.11
                carname  hp  mpg drat

Rename columns

Let’s look at the column names of the mtcars2 data table -

> names(mtcars2)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "vs"      "am"      "gear"    "carb"    "carname"

Using setnames() renaming can be done inplace.
Let’s rename the carname to CarName:

> setnames(mtcars2, 'carname', 'CarName')
> names(mtcars2)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "vs"      "am"      "gear"    "carb"    "CarName"

Adding new column

Let’s take only the columns CarName, mpg, cyl, hp, gear to the data table named dt:

> dt <- mtcars2[,list(CarName, mpg, cyl, hp, gear)]
> head(dt)
             CarName  mpg cyl  hp gear
1:         Mazda RX4 21.0   6 110    4
2:     Mazda RX4 Wag 21.0   6 110    4
3:        Datsun 710 22.8   4  93    4
4:    Hornet 4 Drive 21.4   6 110    3
5: Hornet Sportabout 18.7   8 175    3
6:           Valiant 18.1   6 105    3

Let’s add a new column to the table which shows kmpg (km per gallon) calculated from the column mpg:

> dtcopy <- copy(dt)
> dtcopy[,kmpg := mpg * 1.6]
> head(dtcopy)
             CarName  mpg cyl  hp gear  kmpg
1:         Mazda RX4 21.0   6 110    4 33.60
2:     Mazda RX4 Wag 21.0   6 110    4 33.60
3:        Datsun 710 22.8   4  93    4 36.48
4:    Hornet 4 Drive 21.4   6 110    3 34.24
5: Hornet Sportabout 18.7   8 175    3 29.92
6:           Valiant 18.1   6 105    3 28.96

To add multiple columns:

> dtcopy <- copy(dt)
> dtcopy[,`:=`(kmpg = mpg*1.6,
+              bhp = hp*0.986,
+              mileage_type = ifelse(mpg > 22, 'high', 'low'))]  
> # bhp is distinct from hp because it takes into account power loss due to friction
> head(dtcopy)
             CarName  mpg cyl  hp gear  kmpg     bhp mileage_type
1:         Mazda RX4 21.0   6 110    4 33.60 108.460          low
2:     Mazda RX4 Wag 21.0   6 110    4 33.60 108.460          low
3:        Datsun 710 22.8   4  93    4 36.48  91.698         high
4:    Hornet 4 Drive 21.4   6 110    3 34.24 108.460          low
5: Hornet Sportabout 18.7   8 175    3 29.92 172.550          low
6:           Valiant 18.1   6 105    3 28.96 103.530          low

To see only the columns that is added new to the data table use list or . symbol:

> dt1 <-  dt[, .(CarName,
+                kmpg = mpg*1.6,
+                bhp = hp*0.986)]
> head(dt1)
             CarName  kmpg     bhp
1:         Mazda RX4 33.60 108.460
2:     Mazda RX4 Wag 33.60 108.460
3:        Datsun 710 36.48  91.698
4:    Hornet 4 Drive 34.24 108.460
5: Hornet Sportabout 29.92 172.550
6:           Valiant 28.96 103.530

Using lag and lead

> pdt <- data.table(Date=seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),
+                  Prod=round(runif(10,1000,20000)))
> pdt
          Date  Prod
 1: 2010-01-01  8121
 2: 2011-01-01 11675
 3: 2012-01-01 12676
 4: 2013-01-01 13028
 5: 2014-01-01  3545
 6: 2015-01-01 10370
 7: 2016-01-01  1105
 8: 2017-01-01 18542
 9: 2018-01-01  9225
10: 2019-01-01  8349
> pdt[ , Lag := shift(Prod, 1, type="lag")]  # lag by 1
> pdt
          Date  Prod   Lag
 1: 2010-01-01  8121    NA
 2: 2011-01-01 11675  8121
 3: 2012-01-01 12676 11675
 4: 2013-01-01 13028 12676
 5: 2014-01-01  3545 13028
 6: 2015-01-01 10370  3545
 7: 2016-01-01  1105 10370
 8: 2017-01-01 18542  1105
 9: 2018-01-01  9225 18542
10: 2019-01-01  8349  9225
> pdt[ , Lead := shift(Prod, 1, type="lead")]  # lead by 1
> pdt
          Date  Prod   Lag  Lead
 1: 2010-01-01  8121    NA 11675
 2: 2011-01-01 11675  8121 12676
 3: 2012-01-01 12676 11675 13028
 4: 2013-01-01 13028 12676  3545
 5: 2014-01-01  3545 13028 10370
 6: 2015-01-01 10370  3545  1105
 7: 2016-01-01  1105 10370 18542
 8: 2017-01-01 18542  1105  9225
 9: 2018-01-01  9225 18542  8349
10: 2019-01-01  8349  9225    NA
> pdt[ , `:=`(Lag3 = shift(Prod, 3, type="lag"),
+             Lead2 = shift(Prod, 2, type="lead"))]
> pdt
          Date  Prod   Lag  Lead  Lag3 Lead2
 1: 2010-01-01  8121    NA 11675    NA 12676
 2: 2011-01-01 11675  8121 12676    NA 13028
 3: 2012-01-01 12676 11675 13028    NA  3545
 4: 2013-01-01 13028 12676  3545  8121 10370
 5: 2014-01-01  3545 13028 10370 11675  1105
 6: 2015-01-01 10370  3545  1105 12676 18542
 7: 2016-01-01  1105 10370 18542 13028  9225
 8: 2017-01-01 18542  1105  9225  3545  8349
 9: 2018-01-01  9225 18542  8349 10370    NA
10: 2019-01-01  8349  9225    NA  1105    NA

Delete columns

Using ! to delete single or multiple columns:

> colnames(mtcars2)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "vs"      "am"      "gear"    "carb"    "CarName"
> del <- mtcars2[,!'vs', with=F]
> colnames(del)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "am"      "gear"    "carb"    "CarName"

Using variables to assign the column names and passing it to the function: (use with=F)

> dropcols <- c('drat','wt','qsec','am','carb')
> colnames(mtcars2)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "vs"      "am"      "gear"    "carb"    "CarName"
> del <- mtcars2[,!dropcols, with=F]
> colnames(del)
[1] "mpg"     "cyl"     "disp"    "hp"      "vs"      "gear"    "CarName"

Another way is to using := to assign and setting the value to NULL :

> mtcars3 <- copy(mtcars2)
> colnames(mtcars3)
 [1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "wt"      "qsec"   
 [8] "vs"      "am"      "gear"    "carb"    "CarName"
> mtcars3[, c('vs','wt','am') := NULL]
> colnames(mtcars3)
[1] "mpg"     "cyl"     "disp"    "hp"      "drat"    "qsec"    "gear"   
[8] "carb"    "CarName"

Grouping

Let’s group by gear and calculate the mean of mpg:

> dt[, .(mean_mpg = mean(mpg)), by=gear]
   gear mean_mpg
1:    4 24.53333
2:    3 16.10667
3:    5 21.38000
> head(dt)
             CarName  mpg cyl  hp gear
1:         Mazda RX4 21.0   6 110    4
2:     Mazda RX4 Wag 21.0   6 110    4
3:        Datsun 710 22.8   4  93    4
4:    Hornet 4 Drive 21.4   6 110    3
5: Hornet Sportabout 18.7   8 175    3
6:           Valiant 18.1   6 105    3

Let’s add the new column to the data set:

> dt2 <- copy(dt)
> dt2[, mean_mpg := round(mean(mpg),3), by=gear]
> head(dt2)
             CarName  mpg cyl  hp gear mean_mpg
1:         Mazda RX4 21.0   6 110    4   24.533
2:     Mazda RX4 Wag 21.0   6 110    4   24.533
3:        Datsun 710 22.8   4  93    4   24.533
4:    Hornet 4 Drive 21.4   6 110    3   16.107
5: Hornet Sportabout 18.7   8 175    3   16.107
6:           Valiant 18.1   6 105    3   16.107

We can also group by multiple columns:

> dt2 <- copy(dt)
> dt2[, .(mean_mpg = mean(mpg)), by=.(gear, cyl)]
   gear cyl mean_mpg
1:    4   6   19.750
2:    4   4   26.925
3:    3   6   19.750
4:    3   8   15.050
5:    3   4   21.500
6:    5   4   28.200
7:    5   8   15.400
8:    5   6   19.700

.N

> dt[, .N, by=cyl]
   cyl  N
1:   6  7
2:   4 11
3:   8 14
> dt[, .N, by=gear]
   gear  N
1:    4 12
2:    3 15
3:    5  5

Order

In descending order:

> dt[, .N, by=gear][order(N, decreasing = T),]
   gear  N
1:    3 15
2:    4 12
3:    5  5

In Ascending order:

> dt[, .N, by=gear][order(N, decreasing = F),]
   gear  N
1:    5  5
2:    4 12
3:    3 15

Chaining

Using previous result to calculate the later:

> dtcopy[,`:=`(kmpg = mpg*1.6,
+              bhp = hp*0.986)][, mileage_type := ifelse(kmpg > 30, 
+                                                        'high', 'low')]
> head(dtcopy)
             CarName  mpg cyl  hp gear  kmpg     bhp mileage_type
1:         Mazda RX4 21.0   6 110    4 33.60 108.460         high
2:     Mazda RX4 Wag 21.0   6 110    4 33.60 108.460         high
3:        Datsun 710 22.8   4  93    4 36.48  91.698         high
4:    Hornet 4 Drive 21.4   6 110    3 34.24 108.460         high
5: Hornet Sportabout 18.7   8 175    3 29.92 172.550          low
6:           Valiant 18.1   6 105    3 28.96 103.530          low

Setting Key

Setting one or more keys on a data.table enables it to perform binary search, which is many order of magnitudes faster than linear search, especially for large data.
Using setkey() key can be set:

> setkey(dt, CarName)

To check the keys for a data table:

> key(dt)
[1] "CarName"

To remove keys:

> setkey(dt, NULL)

Setting multiple keys:

> setkey(dt, CarName, cyl)
> # It first sorts in CarName's ascending order then in cyl's ascending order
> key(dt)
[1] "CarName" "cyl"    

Using key we can filter/find rows that satisfies specific conditions too fast

> DT <- data.table(X = rep(c('a','b','c'), each=10),Y = rnorm(30))
> setkey(DT, X)
> key(DT)
[1] "X"
> DT['c']
    X          Y
 1: c  1.4711518
 2: c -0.1449743
 3: c -0.3280360
 4: c -1.4182936
 5: c  0.9825383
 6: c  1.0700048
 7: c  0.4630494
 8: c -0.7679049
 9: c -1.6136755
10: c -0.9212752

Merge

> setkey(dt, CarName)
> key(dt)
[1] "CarName"

Let’s make two tables:

> dt1 <- dt[c(1:12,15,18), .(CarName, mpg, cyl)]
> dt2 <- dt[c(4:10,14:16), .(CarName, gear)]
> key(dt1);key(dt2)
[1] "CarName"
[1] "CarName"

Inner Join

Since we have already defined the key we can easily merge the data tables without defining key again:

> merge(dt1, dt2)
               CarName  mpg cyl gear
1:   Chrysler Imperial 14.7   8    3
2:          Datsun 710 22.8   4    4
3:    Dodge Challenger 15.5   8    3
4:          Duster 360 14.3   8    3
5:        Ferrari Dino 19.7   6    5
6:            Fiat 128 32.4   4    4
7:           Fiat X1-9 27.3   4    4
8: Lincoln Continental 10.4   8    3

Outer Join

> merge(dt1, dt2, all=T)
                CarName  mpg cyl gear
 1:         AMC Javelin 15.2   8   NA
 2:  Cadillac Fleetwood 10.4   8   NA
 3:          Camaro Z28 13.3   8   NA
 4:   Chrysler Imperial 14.7   8    3
 5:          Datsun 710 22.8   4    4
 6:    Dodge Challenger 15.5   8    3
 7:          Duster 360 14.3   8    3
 8:        Ferrari Dino 19.7   6    5
 9:            Fiat 128 32.4   4    4
10:           Fiat X1-9 27.3   4    4
11:      Ford Pantera L 15.8   8   NA
12:         Honda Civic 30.4   4   NA
13:   Hornet Sportabout   NA  NA    3
14: Lincoln Continental 10.4   8    3
15:        Lotus Europa   NA  NA    5
16:           Mazda RX4 21.0   6   NA

Left join

> merge(dt1, dt2, all.x=T)
                CarName  mpg cyl gear
 1:         AMC Javelin 15.2   8   NA
 2:  Cadillac Fleetwood 10.4   8   NA
 3:          Camaro Z28 13.3   8   NA
 4:   Chrysler Imperial 14.7   8    3
 5:          Datsun 710 22.8   4    4
 6:    Dodge Challenger 15.5   8    3
 7:          Duster 360 14.3   8    3
 8:        Ferrari Dino 19.7   6    5
 9:            Fiat 128 32.4   4    4
10:           Fiat X1-9 27.3   4    4
11:      Ford Pantera L 15.8   8   NA
12:         Honda Civic 30.4   4   NA
13: Lincoln Continental 10.4   8    3
14:           Mazda RX4 21.0   6   NA

Right join

> merge(dt1, dt2, all.y=T)
                CarName  mpg cyl gear
 1:   Chrysler Imperial 14.7   8    3
 2:          Datsun 710 22.8   4    4
 3:    Dodge Challenger 15.5   8    3
 4:          Duster 360 14.3   8    3
 5:        Ferrari Dino 19.7   6    5
 6:            Fiat 128 32.4   4    4
 7:           Fiat X1-9 27.3   4    4
 8:   Hornet Sportabout   NA  NA    3
 9: Lincoln Continental 10.4   8    3
10:        Lotus Europa   NA  NA    5

Pivot Table

dcast.data.table(dt, x ~ y, fun.aggregate = list(functions), value.var='column name of value'):

> dcast.data.table(mtcars2, gear ~ cyl, fun.aggregate = list(length), value.var = 'mpg')
   gear 4 6  8
1:    3 1 2 12
2:    4 8 4  0
3:    5 2 1  2
> dcast.data.table(mtcars2, cyl ~ vs, 
+                  fun.aggregate = list(sum, mean), 
+                  value.var = 'mpg')
   cyl mpg_sum_0 mpg_sum_1 mpg_mean_0 mpg_mean_1
1:   4      26.0     267.3   26.00000     26.730
2:   6      61.7      76.5   20.56667     19.125
3:   8     211.4       0.0   15.10000        NaN

Set

set(dt, i, j, value)
It is usually used in for-loops and is literally thousands of times faster.

Here is a demonstration of how fast the set() function is compared to other functions:

> m = matrix(1,nrow=100000,ncol=100)
> DF = as.data.frame(m)
> DT = as.data.table(m)  
> 
> system.time(for (i in 1:10000) DF[i,1] <- i)
   user  system elapsed 
   2.73    1.72    4.47 
> system.time(for (i in 1:10000) set(DT,i,1L,i))
   user  system elapsed 
   0.03    0.00    0.03 

Here you can find more about the data.table package.

---
title: "Manipulation with data.table"
author: "Md Ahsanul Islam"
output: 
  html_document:
    toc: true
    toc_float: true
    toc_depth: 4
    theme: cerulean
    code_download: true
---

---
```{r, include=FALSE}
knitr::opts_chunk$set(
  comment = "", prompt = TRUE, message=F, warning = F
)
```


After installing the `data.table` function using `install.packages('data.table')` load the package -
```{r}
library(data.table)
```

## Reading data

`fread` is short for fast read -
```{r}
statecsv <- fread("Data/statedata.csv")  # Reading CSV file
head(statecsv)
```

```{r}
class(statecsv)
```
So, functions that accept a `data.frame` will work just fine on `data.table` as well.

A simple demonstration of how fast fread is:
```{r}
# Create a big data frame
big_df <- data.frame(x = rnorm(10E5),
                     y = runif(10E5))
# create a temporary file 
file <- tempfile()
# write the data frame to the file
write.table(big_df, file=file, col.names = T, sep="\t")
```

Now using `read.table()`, read the file and find the time it took using system.time():

```{r}
system.time(read.table(file, header=T, sep="\t"))
```

```{r}
system.time(fread(file, header=T, sep="\t"))
```

SO using `fread()`, data can be read in faster speed than the regular `read.table()` function.


## Data Frame to Data Table conversion

### Creating copy and convert
Using data.table(df) or as.data.table(df) -
```{r}
class(iris)
```

```{r}
iris2 <- data.table(iris)
class(iris2)
```

```{r}
iris3 <- as.data.table(iris)
class(iris3)
```

Get R Programming assignment help at [www.homeworkhelponline.net](https://www.homeworkhelponline.net "R programming assignment help").

### Converting inplace

Using `setDT(df)` -

```{r}
iris4 <- copy(iris)
class(iris4)
```

Converting inplace -
```{r}
setDT(iris4)
class(iris4)
```

## Data Table to Data Frame conversion

### Creating copy and convert
Using `as.data.frame(dt)` or `data.frame(dt)`-
```{r}
class(iris4)
iris5 <- data.frame(iris4)
class(iris5)
```

### Converting inplace

Using `setDF(dt)`-
```{r}
class(iris4)
setDF(iris4)
class(iris4)
```

## caution with Rownames

The built in mtcars data frame has rownames but data table rows do not contain any rownames. 

```{r}
head(mtcars)
```

```{r}
class(mtcars)
```

```{r}
mtcars2 <- copy(mtcars)
setDT(mtcars2)
head(mtcars2)   # rownames are gone
```

So remember to put the rownames in a column first -
```{r}
mtcars2$carname <- rownames(mtcars)
head(mtcars2)
```

## Filtering rows and columns

Gives all the columns where cyl is 4 and gear is 4 -
```{r}
mtcars2[cyl==4 & gear==4, ]
```

To specify columns -
```{r}
mtcars2[cyl==4 & gear==4,c(12,4,1,8)]
```

The above code is equivalent to this one shown below. By supplying a list of column names we can get the specified columns -
```{r}
mtcars2[cyl==4 & gear==4, list(carname, hp, mpg, vs)]
```

Instead of using list, a simple . also does the work!
```{r}
mtcars2[cyl==4 & gear==4, .(carname, hp, mpg, vs)]
```

By specifying the column names as string we can pass a vector of strings to the columns place -
```{r}
mtcars2[cyl>=4 & gear>4, c('carname', 'hp', 'mpg', 'drat')]
```

If a column name isn't specified directly in the function but in a vector outside the function then we have to set with=F:

```{r}
cols <- c('carname', 'hp', 'mpg', 'drat')
mtcars2[, cols, with=F]
```

## Rename columns

Let's look at the column names of the mtcars2 data table -
```{r}
names(mtcars2)
```
Using `setnames()` renaming can be done `inplace`.  
Let's rename the `carname` to `CarName`:
```{r}
setnames(mtcars2, 'carname', 'CarName')
names(mtcars2)
```

## Adding new column

Let's take only the columns `CarName`, `mpg`, `cyl`, `hp`, `gear` to the data table named dt:
```{r}
dt <- mtcars2[,list(CarName, mpg, cyl, hp, gear)]
head(dt)
```

Let's add a new column to the table which shows kmpg (km per gallon) calculated from the column mpg:
```{r}
dtcopy <- copy(dt)
dtcopy[,kmpg := mpg * 1.6]
head(dtcopy)
```

To add multiple columns:
```{r}
dtcopy <- copy(dt)
dtcopy[,`:=`(kmpg = mpg*1.6,
             bhp = hp*0.986,
             mileage_type = ifelse(mpg > 22, 'high', 'low'))]  
# bhp is distinct from hp because it takes into account power loss due to friction
head(dtcopy)
```

To see only the columns that is added new to the data table use list or . symbol:
```{r}
dt1 <-  dt[, .(CarName,
               kmpg = mpg*1.6,
               bhp = hp*0.986)]
head(dt1)
```


### Using lag and lead
```{r}
pdt <- data.table(Date=seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),
                 Prod=round(runif(10,1000,20000)))
pdt
pdt[ , Lag := shift(Prod, 1, type="lag")]  # lag by 1
pdt
pdt[ , Lead := shift(Prod, 1, type="lead")]  # lead by 1
pdt
pdt[ , `:=`(Lag3 = shift(Prod, 3, type="lag"),
            Lead2 = shift(Prod, 2, type="lead"))]
pdt
```

## Delete columns

Using ! to delete single or multiple columns:

```{r}
colnames(mtcars2)
del <- mtcars2[,!'vs', with=F]
colnames(del)
```

Using variables to assign the column names and passing it to the function: (use with=F)
```{r}
dropcols <- c('drat','wt','qsec','am','carb')
colnames(mtcars2)
del <- mtcars2[,!dropcols, with=F]
colnames(del)
```


Another way is to using := to assign and setting the value to NULL :
```{r}
mtcars3 <- copy(mtcars2)
colnames(mtcars3)
mtcars3[, c('vs','wt','am') := NULL]
colnames(mtcars3)
```

## Grouping

Let's group by gear and calculate the mean of mpg:
```{r}
dt[, .(mean_mpg = mean(mpg)), by=gear]
```


```{r}
head(dt)
```

Let's add the new column to the data set:
```{r}
dt2 <- copy(dt)
dt2[, mean_mpg := round(mean(mpg),3), by=gear]
head(dt2)
```

We can also group by multiple columns:
```{r}
dt2 <- copy(dt)
dt2[, .(mean_mpg = mean(mpg)), by=.(gear, cyl)]
```

## .N

```{r}
dt[, .N, by=cyl]
dt[, .N, by=gear]
```

## Order

In descending order:
```{r}
dt[, .N, by=gear][order(N, decreasing = T),]
```

In Ascending order:
```{r}
dt[, .N, by=gear][order(N, decreasing = F),]
```

## Chaining

Using previous result to calculate the later:
```{r}
dtcopy[,`:=`(kmpg = mpg*1.6,
             bhp = hp*0.986)][, mileage_type := ifelse(kmpg > 30, 
                                                       'high', 'low')]
head(dtcopy)
```

## Setting Key

Setting one or more keys on a data.table enables it to perform binary search, which is many order of magnitudes faster than linear search, especially for large data.   
Using `setkey()` key can be set:
```{r}
setkey(dt, CarName)
```

To check the keys for a data table:
```{r}
key(dt)
```

To remove keys:
```{r}
setkey(dt, NULL)
```

Setting multiple keys:
```{r}
setkey(dt, CarName, cyl)
# It first sorts in CarName's ascending order then in cyl's ascending order
key(dt)
```

Using key we can filter/find rows that satisfies specific conditions too fast
```{r}
DT <- data.table(X = rep(c('a','b','c'), each=10),Y = rnorm(30))
setkey(DT, X)
key(DT)
DT['c']
```


### Merge

```{r}
setkey(dt, CarName)
key(dt)
```

Let's make two tables:
```{r}
dt1 <- dt[c(1:12,15,18), .(CarName, mpg, cyl)]
dt2 <- dt[c(4:10,14:16), .(CarName, gear)]
key(dt1);key(dt2)
```


#### Inner Join

Since we have already defined the key we can easily merge the data tables without defining key again:
```{r}
merge(dt1, dt2)
```

#### Outer Join

```{r}
merge(dt1, dt2, all=T)
```

#### Left join

```{r}
merge(dt1, dt2, all.x=T)
```

#### Right join

```{r}
merge(dt1, dt2, all.y=T)
```

## Pivot Table

`dcast.data.table(dt, x ~ y, fun.aggregate = list(functions), value.var='column name of value')`:

```{r}
dcast.data.table(mtcars2, gear ~ cyl, fun.aggregate = list(length), value.var = 'mpg')
```

```{r}
dcast.data.table(mtcars2, cyl ~ vs, 
                 fun.aggregate = list(sum, mean), 
                 value.var = 'mpg')
```

## Set

`set(dt, i, j, value)`   
It is usually used in for-loops and is literally thousands of times faster.

Here is a demonstration of how fast the set() function is compared to other functions:
```{r}
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)  

system.time(for (i in 1:10000) DF[i,1] <- i)

system.time(for (i in 1:10000) set(DT,i,1L,i))
```

[Here](https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/) you can find more about the data.table package.