1. Tibbles là gì

tibble

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).

https://tibble.tidyverse.org/articles/tibble.html, Source: vignettes/tibble.Rmd

Hoặc xem thêm tại đây cho người mới bắt đầu

Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist).

2. CÁCH DÙNG

tibble() is a nice way to create data frames. It encapsulates best practices for data frames:

2.1. Create a tibble from an existing object with as_tibble(): ép các đối tượng vào tibble.

as_tibble() đơn giản hơn nhiều so với as.data.frame()

library(tibble)

data <- data.frame(a = 1:3, b = letters[1:3], c = Sys.Date() - 1:3)
data

##   a b          c
## 1 1 a 2022-08-06
## 2 2 b 2022-08-05
## 3 3 c 2022-08-04

#>   a b          c
#> 1 1 a 2022-03-30
#> 2 2 b 2022-03-29
#> 3 3 c 2022-03-28

as_tibble(data)

## # A tibble: 3 × 3
##       a b     c         
##   <int> <chr> <date>    
## 1     1 a     2022-08-06
## 2     2 b     2022-08-05
## 3     3 c     2022-08-04

#> # A tibble: 3 × 3
#>       a b     c         
#>   <int> <chr> <date>    
#> 1     1 a     2022-03-30
#> 2     2 b     2022-03-29
#> 3     3 c     2022-03-28

This will work for reasonable inputs that are already data.frames, lists, matrices, or tables.

2.2. You can also create a new tibble from column vectors with tibble():

This makes it easier to use with list-columns:

tibble(x = 1:5, y = 1, z = x ^ 2 + y)

## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

#> # A tibble: 5 × 3
#>       x     y     z
#>   <int> <dbl> <dbl>
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26


tibble(x1 = 1:3, y1 = list(1:5, 1:10, 1:20))

## # A tibble: 3 × 2
##      x1 y1        
##   <int> <list>    
## 1     1 <int [5]> 
## 2     2 <int [10]>
## 3     3 <int [20]>

#> # A tibble: 3 × 2
#>       x y         
#>   <int> <list>    
#> 1     1 <int [5]> 
#> 2     2 <int [10]>
#> 3     3 <int [20]>

2.3. You can define a tibble* row-by-row with tribble():

tribble(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)

## # A tibble: 2 × 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

#> # A tibble: 2 × 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         2   3.6
#> 2 b         1   8.5

2.4. Một số đặc điểm của tibble()

tibble() does much less than data.frame(): it never changes the type of the inputs (e.g. it never converts strings to factors!),
it never changes the names of variables, it only recycles inputs of length 1–> Chỉ tái chế vector có độ dài bằng 1.

You can read more about these features in vignette(“tibble”).

# It never adjusts the names of variables:

names(data.frame(`crazy name` = 1))

## [1] "crazy.name"

#> [1] "crazy.name"
names(tibble(`crazy name` = 1))

## [1] "crazy name"

#> [1] "crazy name"

It evaluates its arguments lazily and sequentially:

tibble(x = 1:5, y = x ^ 2)

## # A tibble: 5 × 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     4
## 3     3     9
## 4     4    16
## 5     5    25

#> # A tibble: 5 × 2
#>       x     y
#>   <int> <dbl>
#> 1     1     1
#> 2     2     4
#> 3     3     9
#> 4     4    16
#> 5     5    25

and it never creates row.names(). The whole point of tidy data is to store variables in a consistent way. So it never stores a variable as special attribute. (Tất cả các biến đều nhất quán 1 thuộc tính và ngăn nắp).

3. Tibbles vs data frames

3 sự khác biệt: printing, subsetting, và recycling rules.

3.1. Printing

When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen. It also prints an abbreviated description of the column type, and uses font styles and color for highlighting:

tibble(x = -5:100, y = 123.456 * (3 ^ x))

## # A tibble: 106 × 2
##        x         y
##    <int>     <dbl>
##  1    -5     0.508
##  2    -4     1.52 
##  3    -3     4.57 
##  4    -2    13.7  
##  5    -1    41.2  
##  6     0   123.   
##  7     1   370.   
##  8     2  1111.   
##  9     3  3333.   
## 10     4 10000.   
## # … with 96 more rows
## # ℹ Use `print(n = ...)` to see more rows

#> # A tibble: 106 × 2
#>        x         y
#>    <int>     <dbl>
#>  1    -5     0.508
#>  2    -4     1.52 
#>  3    -3     4.57 
#>  4    -2    13.7  
#>  5    -1    41.2  
#>  6     0   123.   
#>  7     1   370.   
#>  8     2  1111.   
#>  9     3  3333.   
#> 10     4 10000.   
#> # … with 96 more rows
#> # ℹ Use `print(n = ...)` to see more rows

You can control the default appearance with options:
- options(pillar.print_max = n, pillar.print_min = m): if there are more than n rows, print only the first m rows.
- Use options(pillar.print_max = Inf) to always show all rows.
- options(pillar.width = n): use n character slots horizontally to show the data. Sử dụng vị trí n theo chiều ngang để hiển thị dữ liệu.
- If n > getOption(“width”), this will result in multiple tiers.
- Use options(pillar.width = Inf) to always print all columns, regardless of the width of the screen.
  
  See ?pillar::pillar_options and ?tibble_options for the available options, vignette(“types”) for an overview of the type abbreviations, vignette(“numbers”) for details on the formatting of numbers, and vignette(“digits”) for a comparison with data frame printing.

3.2. Subsetting

Tibbles are quite strict about subsetting. [ always returns another tibble. Contrast this with a data frame: sometimes [ returns a data frame and sometimes it just returns a vector:

  df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])

## [1] "data.frame"

#> [1] "data.frame"
class(df1[, 1])

## [1] "integer"

#> [1] "integer"

df2 <- tibble(x = 1:3, y = 3:1)
class(df2[, 1:2])

## [1] "tbl_df"     "tbl"        "data.frame"

#> [1] "tbl_df"     "tbl"        "data.frame"
class(df2[, 1])

## [1] "tbl_df"     "tbl"        "data.frame"

#> [1] "tbl_df"     "tbl"        "data.frame"

To extract a single column use [[ or $:

class(df2[[1]])

## [1] "integer"

#> [1] "integer"
class(df2$x)

## [1] "integer"

#> [1] "integer"

Tibbles are also stricter with $. Tibbles never do partial matching, and will throw a warning and return NULL if the column does not exist:

df <- data.frame(abc = 1)
df$a

## [1] 1

#> [1] 1

df2 <- tibble(abc = 1)
df2$a

## Warning: Unknown or uninitialised column: `a`.

## NULL

#> Warning: Unknown or uninitialised column: `a`.
#> NULL

However, tibbles respect the drop argument if it is provided:

data.frame(a = 1:3)[, "a", drop = TRUE]

## [1] 1 2 3

#> [1] 1 2 3
tibble(a = 1:3)[, "a", drop = TRUE]

## [1] 1 2 3

#> [1] 1 2 3

Tibbles do not support row names. They are removed when converting to a tibble or when subsetting:

df <- data.frame(a = 1:3, row.names = letters[1:3])
rownames(df)

## [1] "a" "b" "c"

#> [1] "a" "b" "c"
rownames(as_tibble(df))

## [1] "1" "2" "3"

#> [1] "1" "2" "3"

tbl <- tibble(a = 1:3)
rownames(tbl) <- letters[1:3]

## Warning: Setting row names on a tibble is deprecated.

#> Warning: Setting row names on a tibble is deprecated.
rownames(tbl)

## [1] "a" "b" "c"

#> [1] "a" "b" "c"
rownames(tbl[1, ])

## [1] "1"

#> [1] "1"

See [**vignette("invariants")**](https://tibble.tidyverse.org/articles/invariants.html) for a detailed comparison between tibbles and data frames.

3.3. Recycling rules

When constructing a tibble, only values of length 1 are recycled. The first column with length different to one determines the number of rows in the tibble, conflicts lead to an error: (Tạo các cột phải cân bằng)

tibble(a = 1, b = 1:3)

## # A tibble: 3 × 2
##       a     b
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3

#> # A tibble: 3 × 2
#>       a     b
#>   <dbl> <int>
#> 1     1     1
#> 2     1     2
#> 3     1     3
tibble(a = 1:3, b = 1)

## # A tibble: 3 × 2
##       a     b
##   <int> <dbl>
## 1     1     1
## 2     2     1
## 3     3     1

#> # A tibble: 3 × 2
#>       a     b
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     3     1

# tibble(a = 1:3, c = 1:2)
#> Error:
#> Tibble columns must have compatible sizes.
#> Size 3: Existing data.
#> Size 2: Column `c`.
#> Only values of size one are recycled.

*This also extends to tibbles with zero rows, which is sometimes important for programming:

tibble(a = 1, b = integer())

## # A tibble: 0 × 2
## # … with 2 variables: a <dbl>, b <int>
## # ℹ Use `colnames()` to see all variable names

#> # A tibble: 0 × 2
#> # … with 2 variables: a <dbl>, b <int>
#> # ℹ Use `colnames()` to see all variable names
tibble(a = integer(), b = 1)

## # A tibble: 0 × 2
## # … with 2 variables: a <int>, b <dbl>
## # ℹ Use `colnames()` to see all variable names

#> # A tibble: 0 × 2
#> # … with 2 variables: a <int>, b <dbl>
#> # ℹ Use `colnames()` to see all variable names

Arithmetic operations: Với các phép tính toán học - vô hiệu

Unlike data frames, tibbles don’t support arithmetic operations on all columns. The result is silently coerced to a data frame. Do not rely on this behavior, it may become an error in a forthcoming version.

tbl <- tibble(a = 1:3, b = 4:6)
tbl * 2

##   a  b
## 1 2  8
## 2 4 10
## 3 6 12

#>   a  b
#> 1 2  8
#> 2 4 10
#> 3 6 12

# Data frame làm được: 
 df1 <- data.frame(x = 1:3, y = 3:1)
 df1

##   x y
## 1 1 3
## 2 2 2
## 3 3 1

#>   x y
#> 1 1 3
#> 2 2 2
#> 3 3 1

 df1 * 2

##   x y
## 1 2 6
## 2 4 4
## 3 6 2

#>   x y
#> 1 2 6
#> 2 4 4
#> 3 6 2

Tibble of Tidyverse

Bao Huyen, Coursera, in Hanoi

2022-AUG-07