This document is a reference to help understand data.table structure.

At first, I want to explain why I use data.table package. Look at the result when I use data.table.

> data_table
   year month_num num_obs
1: 2010         1      10
2: 2010         2       8
3: 2010         3       7
4: 2011         1       6
5: 2011         2       5
6: 2012         2       4
7: 2012         3       8
> str(data_table)
Classes 'data.table' and 'data.frame':  7 obs. of  3 variables:
 $ year     : num  2010 2010 2010 2011 2011 ...
 $ month_num: num  1 2 3 1 2 2 3
 $ num_obs  : num  10 8 7 6 5 4 8
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr  "year" "month_num"

And another result when I use data.frame.

> data_frame
  year month_num num_obs
1 2010         1      10
2 2010         2       8
3 2010         3       7
4 2011         1       6
5 2011         2       5
6 2012         2       4
7 2012         3       8
> str(data_frame)
'data.frame':   7 obs. of  3 variables:
 $ year     : num  2010 2010 2010 2011 2011 ...
 $ month_num: num  1 2 3 1 2 2 3
 $ num_obs  : num  10 8 7 6 5 4 8

Can you find a difference with data.table and data.frame? They look very similar, but they doesn’t. data.frame and matrix are basic structures on R. So, they are light and simple. If you want to make or restructure a data, you can reform a data when you run complicated code without specific packages. That is why people use packages. For example, ggplot package can make wonderful graphs than basic plot. In the same way, data.table package help users to handle data easily.


Now, I want to compare results both sides. (Oops including Korean.. it means ‘does not use factor’)

> data_table[, list(month_num=seq.int(1, 3)), by=year]
   year month_num
1: 2010         1
2: 2010         2
3: 2010         3
4: 2011         1
5: 2011         2
6: 2011         3
7: 2012         1
8: 2012         2
9: 2012         3
> data_frame[, list(month_num=seq.int(1, 3)), by=year]
Error in `[.data.frame`(data_frame, , list(month_num = seq.int(1, 3)), : 사용되지 않은 인자 (by = year)

As you can see above, it makes error when I use data.frame structure. The by option in a data format should run on the data.table structure only.


Why I use the list(...) option in the code?

> data_table[, by=year]
   year month_num num_obs
1: 2010         1      10
2: 2010         2       8
3: 2010         3       7
4: 2011         1       6
5: 2011         2       5
6: 2012         2       4
7: 2012         3       8
> data_table[, seq.int(1, 3), by=year]
   year V1
1: 2010  1
2: 2010  2
3: 2010  3
4: 2011  1
5: 2011  2
6: 2011  3
7: 2012  1
8: 2012  2
9: 2012  3
> data_table[, month_num=seq.int(1, 3), by=year]
Error in `[.data.table`(data_table, , month_num = seq.int(1, 3), by = year): 사용되지 않은 인자 (month_num = seq.int(1, 3))
> data_table[, list(month_num=seq.int(1, 3)), by=year]
   year month_num
1: 2010         1
2: 2010         2
3: 2010         3
4: 2011         1
5: 2011         2
6: 2011         3
7: 2012         1
8: 2012         2
9: 2012         3

The reason is why I use list(...) that it is a target when you make a data structure. data_table[, by=year] shows both month_num and num_obs. And it is a tip that I can announce a column name and make table at the same time. It is a kind of compression the 2~3 lines code.

> request <- data_table[, seq.int(1, 3), by=year]
> colnames(request)[2] <- 'month_num'
> request
   year month_num
1: 2010         1
2: 2010         2
3: 2010         3
4: 2011         1
5: 2011         2
6: 2011         3
7: 2012         1
8: 2012         2
9: 2012         3

Last, I want to show make a genral version of request line. request <- data[, list(month_num=seq.int(1, 3)), by=year] code has a problem if you don’t know a scale of data.

> min(data_table$month_num)
[1] 1
> max(data_table$month_num)
[1] 3
> data_table[ , 
+            list(month_num=seq.int(min(data_table$month_num),
+                                   max(data_table$month_num))),
+            by=year]
   year month_num
1: 2010         1
2: 2010         2
3: 2010         3
4: 2011         1
5: 2011         2
6: 2011         3
7: 2012         1
8: 2012         2
9: 2012         3

I know… It is quite complicate coding, but it will be useful when you make a system automatically.