Introduction

Dataframes have been fundamental to R as its preferred way to store data in a straightforward and human-readable form. For the most part, dataframes do their job well; they can store data of different types in the same dataframe, they can be subsetted relatively easily, and they interact cleanly with other objects in R. However, as data science has evolved, some functionality of dataframes has become inconvenient and inefficient. This includes default behavior like stringsAsFactors = TRUE, printing every row and column when calling the dataframe, and partial column matching. These “features” often lead to long function calls, unintentionally printing an entire dataframe, and calling the wrong columns, which create errors and frustration. Given that dataframes are used so frequently in R, these inconveniences and inefficiencies can snowball over time leading to poor performance overall. Tibbles are the solution to these problems.

What is a tibble?

Simply put, a tibble is a lightweight data frame that removes the annoying default behaviors of dataframes, while still preserving core functionality. Tibbles are designed to be more restrictive than dataframes to encourage users into writing less sloppy code. Tibbles can do everything dataframes can do (barring a few niche exceptions) faster and with less overhead. Let’s get into some examples.

Installing tibble

Tibble can be installed like any other package in R using the install.packages() function. There are multiple sources for the tibble package including CRAN, the tidyverse, and Github. The tibble package is frequently maintained and easily accessible, which is nice. The following code installs the tibble package from various different sources according to your preference.

#Install tibble directly from CRAN
install.packages("tibble")

#Install tibble through the tidyverse package
install.packages("tidyverse")

#Install tibble through Github
devtools::install_github("tidyverse/tibble")

Data

I will be using the women in STEM dataset from fivethirtyeight’s Github to show some examples of tibble. The link to the data is here:

https://github.com/fivethirtyeight/data/blob/master/college-majors/women-stem.csv

This data contains information about women in various STEM fields, like the proportion of women in the field and the median income of the field. This data was used for an article about picking a college major from an economic perspective. The link to the article is here:

https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

The data can be seen below as a tibble:

library(tibble)
library(readr)

women_stem_tibble =  read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Major_code = col_double(),
##   Major = col_character(),
##   Major_category = col_character(),
##   Total = col_double(),
##   Men = col_double(),
##   Women = col_double(),
##   ShareWomen = col_double(),
##   Median = col_double()
## )
women_stem_tibble
## # A tibble: 76 x 9
##     Rank Major_code Major Major_category Total   Men Women ShareWomen
##    <dbl>      <dbl> <chr> <chr>          <dbl> <dbl> <dbl>      <dbl>
##  1     1       2419 PETR~ Engineering     2339  2057   282      0.121
##  2     2       2416 MINI~ Engineering      756   679    77      0.102
##  3     3       2415 META~ Engineering      856   725   131      0.153
##  4     4       2417 NAVA~ Engineering     1258  1123   135      0.107
##  5     5       2418 NUCL~ Engineering     2573  2200   373      0.145
##  6     6       2405 CHEM~ Engineering    32260 21239 11021      0.342
##  7     7       5001 ASTR~ Physical Scie~  1792   832   960      0.536
##  8     8       2414 MECH~ Engineering    91227 80320 10907      0.120
##  9     9       2401 AERO~ Engineering    15058 12953  2105      0.140
## 10    10       2408 ELEC~ Engineering    81527 65511 16016      0.196
## # ... with 66 more rows, and 1 more variable: Median <dbl>

Tibbles vs. Data Frames

As specified in the introduction, tibbles are lightweight dataframes with different default behaviors. There are a few characteristics that separate tibbles from dataframes:

  • Tibbles preview only the first few rows when called.
  • Tibbles do not change input data types.
  • Tibbles do not change variable names.
  • Tibbles do not use row names.
  • Tibbles only recycles single element vectors.

It is important to remember that tibbles are designed to be more restrictive than dataframes to prevent common bugs that can occur when using dataframes. The following sections will provide examples of the differing characteristics and how these differences can be useful.

Data previewing

When loading data, most people try to look at a preview of their data to see if their data appears correct at a glance. This process usually involves fetching the first few rows of the data and checking it for major issues. The typical call for this type of operation is head(dataframe), where the head() function fetches the first few rows from dataframe. The reason head() needs to be called is because dataframes print the entire dataframe when they are called without the head() function. This “feature” can be seen below:

Notice how long and disorienting the data can appear when called this way.

women_stem_df = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv")

women_stem_df
##    Rank Major_code
## 1     1       2419
## 2     2       2416
## 3     3       2415
## 4     4       2417
## 5     5       2418
## 6     6       2405
## 7     7       5001
## 8     8       2414
## 9     9       2401
## 10   10       2408
## 11   11       2407
## 12   12       5008
## 13   13       2404
## 14   14       2409
## 15   15       2402
## 16   16       2412
## 17   17       2400
## 18   18       2403
## 19   19       2102
## 20   20       2502
## 21   21       2413
## 22   22       2499
## 23   23       2406
## 24   24       2500
## 25   25       2411
## 26   26       2410
## 27   27       6107
## 28   28       2503
## 29   29       5102
## 30   30       2105
## 31   31       2100
## 32   32       5007
## 33   33       3701
## 34   34       3700
## 35   35       3702
## 36   36       3607
## 37   37       6105
## 38   38       5006
## 39   39       2501
## 40   40       4005
## 41   41       6104
## 42   42       2101
## 43   43       4006
## 44   44       2504
## 45   45       2599
## 46   46       5000
## 47   47       1401
## 48   48       3605
## 49   49       3603
## 50   50       6108
## 51   51       5003
## 52   52       3606
## 53   53       2106
## 54   54       3601
## 55   55       3602
## 56   56       2107
## 57   57       5004
## 58   58       5005
## 59   59       6199
## 60   60       1301
## 61   61       5002
## 62   62       2001
## 63   63       5098
## 64   64       3608
## 65   65       3611
## 66   66       6103
## 67   67       4002
## 68   68       6110
## 69   69       3699
## 70   70       6106
## 71   71       3600
## 72   72       3604
## 73   73       6109
## 74   74       6100
## 75   75       6102
## 76   76       3609
##                                                         Major
## 1                                       PETROLEUM ENGINEERING
## 2                              MINING AND MINERAL ENGINEERING
## 3                                   METALLURGICAL ENGINEERING
## 4                   NAVAL ARCHITECTURE AND MARINE ENGINEERING
## 5                                         NUCLEAR ENGINEERING
## 6                                        CHEMICAL ENGINEERING
## 7                                  ASTRONOMY AND ASTROPHYSICS
## 8                                      MECHANICAL ENGINEERING
## 9                                       AEROSPACE ENGINEERING
## 10                                     ELECTRICAL ENGINEERING
## 11                                       COMPUTER ENGINEERING
## 12                                          MATERIALS SCIENCE
## 13                                     BIOMEDICAL ENGINEERING
## 14                  ENGINEERING MECHANICS PHYSICS AND SCIENCE
## 15                                     BIOLOGICAL ENGINEERING
## 16                   INDUSTRIAL AND MANUFACTURING ENGINEERING
## 17                                        GENERAL ENGINEERING
## 18                                  ARCHITECTURAL ENGINEERING
## 19                                           COMPUTER SCIENCE
## 20                          ELECTRICAL ENGINEERING TECHNOLOGY
## 21                MATERIALS ENGINEERING AND MATERIALS SCIENCE
## 22                                  MISCELLANEOUS ENGINEERING
## 23                                          CIVIL ENGINEERING
## 24                                   ENGINEERING TECHNOLOGIES
## 25                     GEOLOGICAL AND GEOPHYSICAL ENGINEERING
## 26                                  ENVIRONMENTAL ENGINEERING
## 27                                                    NURSING
## 28                         INDUSTRIAL PRODUCTION TECHNOLOGIES
## 29 NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES
## 30                                       INFORMATION SCIENCES
## 31                           COMPUTER AND INFORMATION SYSTEMS
## 32                                                    PHYSICS
## 33                                        APPLIED MATHEMATICS
## 34                                                MATHEMATICS
## 35                            STATISTICS AND DECISION SCIENCE
## 36                                               PHARMACOLOGY
## 37                           MEDICAL TECHNOLOGIES TECHNICIANS
## 38                                               OCEANOGRAPHY
## 39                      ENGINEERING AND INDUSTRIAL MANAGEMENT
## 40                           MATHEMATICS AND COMPUTER SCIENCE
## 41                                 MEDICAL ASSISTING SERVICES
## 42                   COMPUTER PROGRAMMING AND DATA PROCESSING
## 43                        COGNITIVE SCIENCE AND BIOPSYCHOLOGY
## 44                MECHANICAL ENGINEERING RELATED TECHNOLOGIES
## 45                     MISCELLANEOUS ENGINEERING TECHNOLOGIES
## 46                                          PHYSICAL SCIENCES
## 47                                               ARCHITECTURE
## 48                                                   GENETICS
## 49                                          MOLECULAR BIOLOGY
## 50        PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION
## 51                                                  CHEMISTRY
## 52                                               MICROBIOLOGY
## 53            COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY
## 54                                       BIOCHEMICAL SCIENCES
## 55                                                     BOTANY
## 56                 COMPUTER NETWORKING AND TELECOMMUNICATIONS
## 57                                  GEOLOGY AND EARTH SCIENCE
## 58                                                GEOSCIENCES
## 59                   MISCELLANEOUS HEALTH MEDICAL PROFESSIONS
## 60                                      ENVIRONMENTAL SCIENCE
## 61                       ATMOSPHERIC SCIENCES AND METEOROLOGY
## 62                                 COMMUNICATION TECHNOLOGIES
## 63                      MULTI-DISCIPLINARY OR GENERAL SCIENCE
## 64                                                 PHYSIOLOGY
## 65                                               NEUROSCIENCE
## 66                 HEALTH AND MEDICAL ADMINISTRATIVE SERVICES
## 67                                         NUTRITION SCIENCES
## 68                                COMMUNITY AND PUBLIC HEALTH
## 69                                      MISCELLANEOUS BIOLOGY
## 70                    HEALTH AND MEDICAL PREPARATORY PROGRAMS
## 71                                                    BIOLOGY
## 72                                                    ECOLOGY
## 73                              TREATMENT THERAPY PROFESSIONS
## 74                        GENERAL MEDICAL AND HEALTH SERVICES
## 75              COMMUNICATION DISORDERS SCIENCES AND SERVICES
## 76                                                    ZOOLOGY
##             Major_category  Total    Men  Women ShareWomen Median
## 1              Engineering   2339   2057    282 0.12056434 110000
## 2              Engineering    756    679     77 0.10185185  75000
## 3              Engineering    856    725    131 0.15303738  73000
## 4              Engineering   1258   1123    135 0.10731320  70000
## 5              Engineering   2573   2200    373 0.14496697  65000
## 6              Engineering  32260  21239  11021 0.34163050  65000
## 7        Physical Sciences   1792    832    960 0.53571429  62000
## 8              Engineering  91227  80320  10907 0.11955890  60000
## 9              Engineering  15058  12953   2105 0.13979280  60000
## 10             Engineering  81527  65511  16016 0.19645026  60000
## 11             Engineering  41542  33258   8284 0.19941264  60000
## 12             Engineering   4279   2949   1330 0.31082028  60000
## 13             Engineering  14955   8407   6548 0.43784687  60000
## 14             Engineering   4321   3526    795 0.18398519  58000
## 15             Engineering   8925   6062   2863 0.32078431  57100
## 16             Engineering  18968  12453   6515 0.34347322  57000
## 17             Engineering  61152  45683  15469 0.25295984  56000
## 18             Engineering   2825   1835    990 0.35044248  54000
## 19 Computers & Mathematics 128319  99743  28576 0.22269500  53000
## 20             Engineering  11565   8181   3384 0.29260700  52000
## 21             Engineering   2993   2020    973 0.32509188  52000
## 22             Engineering   9133   7398   1735 0.18997044  50000
## 23             Engineering  53153  41081  12072 0.22711794  50000
## 24             Engineering   3600   2695    905 0.25138889  50000
## 25             Engineering    720    488    232 0.32222222  50000
## 26             Engineering   4047   2662   1385 0.34222881  50000
## 27                  Health 209394  21773 187621 0.89601899  48000
## 28             Engineering   4631   3477   1154 0.24919024  46000
## 29       Physical Sciences   2116    528   1588 0.75047259  46000
## 30 Computers & Mathematics  11913   9005   2908 0.24410308  45000
## 31 Computers & Mathematics  36698  27392   9306 0.25358330  45000
## 32       Physical Sciences  32142  23080   9062 0.28193641  45000
## 33 Computers & Mathematics   4939   2794   2145 0.43429844  45000
## 34 Computers & Mathematics  72397  39956  32441 0.44809868  45000
## 35 Computers & Mathematics   6251   2960   3291 0.52647576  45000
## 36  Biology & Life Science   1762    515   1247 0.70771850  45000
## 37                  Health  15914   3916  11998 0.75392736  45000
## 38       Physical Sciences   2418    752   1666 0.68899917  44700
## 39             Engineering   2906   2400    506 0.17412251  44000
## 40 Computers & Mathematics    609    500    109 0.17898194  42000
## 41                  Health  11123    803  10320 0.92780725  42000
## 42 Computers & Mathematics   4168   3046   1122 0.26919386  41300
## 43  Biology & Life Science   3831   1667   2164 0.56486557  41000
## 44             Engineering   4790   4419    371 0.07745303  40000
## 45             Engineering   8804   7043   1761 0.20002272  40000
## 46       Physical Sciences   1436    894    542 0.37743733  40000
## 47             Engineering  46420  25463  20957 0.45146489  40000
## 48  Biology & Life Science   3635   1761   1874 0.51554333  40000
## 49  Biology & Life Science  18300   7426  10874 0.59420765  40000
## 50                  Health  23551   8697  14854 0.63071632  40000
## 51       Physical Sciences  66530  32923  33607 0.50514054  39000
## 52  Biology & Life Science  15232   6383   8849 0.58094800  38000
## 53 Computers & Mathematics   8066   6607   1459 0.18088272  37500
## 54  Biology & Life Science  39107  18951  20156 0.51540645  37400
## 55  Biology & Life Science   1329    626    703 0.52896915  37000
## 56 Computers & Mathematics   7613   5291   2322 0.30500460  36400
## 57       Physical Sciences  10972   5813   5159 0.47019687  36200
## 58       Physical Sciences   1978    809   1169 0.59100101  36000
## 59                  Health  13386   1589  11797 0.88129389  36000
## 60  Biology & Life Science  25965  10787  15178 0.58455613  35600
## 61       Physical Sciences   4043   2744   1299 0.32129607  35000
## 62 Computers & Mathematics  18035  11431   6604 0.36617688  35000
## 63       Physical Sciences  62052  27015  35037 0.56463933  35000
## 64  Biology & Life Science  22060   8422  13638 0.61822303  35000
## 65  Biology & Life Science  13663   4944   8719 0.63814682  35000
## 66                  Health  18109   4266  13843 0.76442653  35000
## 67                  Health  18909   2563  16346 0.86445608  35000
## 68                  Health  19735   4103  15632 0.79209526  34000
## 69  Biology & Life Science  10706   4747   5959 0.55660377  33500
## 70                  Health  12740   5521   7219 0.56664050  33500
## 71  Biology & Life Science 280709 111762 168947 0.60185815  33400
## 72  Biology & Life Science   9154   3878   5276 0.57636006  33000
## 73                  Health  48491  13487  35004 0.72186591  33000
## 74                  Health  33599   7574  26025 0.77457662  32400
## 75                  Health  38279   1225  37054 0.96799812  28000
## 76  Biology & Life Science   8409   3050   5359 0.63729338  26000

Unlike dataframes, tibbles have built-in data preview functionality. When tibbles are called, they only show the first few rows of the data instead of the entire dataset.

women_stem_tibble
## # A tibble: 76 x 9
##     Rank Major_code Major Major_category Total   Men Women ShareWomen
##    <dbl>      <dbl> <chr> <chr>          <dbl> <dbl> <dbl>      <dbl>
##  1     1       2419 PETR~ Engineering     2339  2057   282      0.121
##  2     2       2416 MINI~ Engineering      756   679    77      0.102
##  3     3       2415 META~ Engineering      856   725   131      0.153
##  4     4       2417 NAVA~ Engineering     1258  1123   135      0.107
##  5     5       2418 NUCL~ Engineering     2573  2200   373      0.145
##  6     6       2405 CHEM~ Engineering    32260 21239 11021      0.342
##  7     7       5001 ASTR~ Physical Scie~  1792   832   960      0.536
##  8     8       2414 MECH~ Engineering    91227 80320 10907      0.120
##  9     9       2401 AERO~ Engineering    15058 12953  2105      0.140
## 10    10       2408 ELEC~ Engineering    81527 65511 16016      0.196
## # ... with 66 more rows, and 1 more variable: Median <dbl>

Input data types

One quirk of dataframes is that they change strings to the factor data type automatically. This used to be a convenience feature when data was simpler and string vectors were almost always categorical variables. This would come in handy with columns with predefined categories like gender, where the vector would only contain the values “male”, “female”, and “other”. However, with the increasing popularity of natural language processing and using words as data, it is much more convenient to keep string vectors as strings rather than to change them to factors.

Unlike dataframes, tibbles do not change strings to the factor data type automatically. Instead, tibbles preserve the input data type by default. This quirk can be seen below. Notice that the Major column is a 76-level factor in the dataframe, while the same column is a character data type in the tibble.

str(women_stem_df$Major)
##  Factor w/ 76 levels "AEROSPACE ENGINEERING",..: 68 54 52 61 63 12 5 48 1 26 ...
str(women_stem_tibble$Major)
##  chr [1:76] "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" ...

Variable names

Another weird quirk of dataframes is that they don’t like spaces in their column names. As a result, dataframes will change spaces in column names to periods. The reason dataframes do this is because referencing columns that had spaces without autocomplete was very annoying due to having to put quotes around the column name to reference it. However, now with the modern version of R, tab completion is a default feature, making column references very simple.

nwsdf = data.frame("name with spaces" = 2)
nwsdf
##   name.with.spaces
## 1                2
nwsdf$name.with.spaces
## [1] 2
nwst = tibble("name with spaces" = 2)
nwst
## # A tibble: 1 x 1
##   `name with spaces`
##                <dbl>
## 1                  2
nwst$`name with spaces`
## [1] 2

Row names

In order to maintain simplicity, tibbles cannot use row names. Since row names are special attributes that are stored differently from normal columns, they can complicate dataframes by adding special attributes that were not designedd to store long names. If row.names is called on a tibble, it will return the index of the rows.

row.names(women_stem_tibble)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76"

Recycling

One unique characteristic of R is its use of element recycling. Recycling is when elements are reused when a vector does not have enough elements to match the length of another vector it is attached to. An example of recycling can be seen below.

In the example below, the 1 in the x column is recycled 10 times to match the 10 elements in the y column.

tibble(x = 1, y = 1:10)
## # A tibble: 10 x 2
##        x     y
##    <dbl> <int>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     1     5
##  6     1     6
##  7     1     7
##  8     1     8
##  9     1     9
## 10     1    10

Recycling can be useful if the user is lazy and does not want to call the rep() function every time they want to repeat a number to the same length as another vector.

Recycling can also be used for vectors with multiple elements as seen below. However, the catch is that the vectors must be multiples/divisors of each other in order to recycle multiple elements. Notice that x is 1:4 recycled once (total length 8), while y is printed normal (total length 8).

data.frame(x = 1:4, y = 1:8)
##   x y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 1 5
## 6 2 6
## 7 3 7
## 8 4 8

If the vector is not a multiple/divisor of the other vector, it will not recycle properly and throw an error. This error happens frequently for people who use recycling and it can cause many problems.

data.frame(x = 1:3, y = 1:8)
## Error in data.frame(x = 1:3, y = 1:8): arguments imply differing number of rows: 3, 8

Tibbles circumvent this muiltiple element recycling problem by only allowing vectors of length 1 to be recycled. This minimizes the possible errors that can occur when recycling because length 1 vectors can always be recycled.

tibble(x = 1:4, y = 1:8)
## Tibble columns must have consistent lengths, only values of length one are recycled:
## * Length 4: Column `x`
## * Length 8: Column `y`

Extension by Henry Otuadinma

Austin already did a great job on this, I am going to add a little more to what he already has by demonstrating the use of glimpse, and as_tibble functions of the tibble package.
glimpse is simply getting a glimpse of the data. According to the documentation, it is like a transposed version of print such that columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It’s a little like the str function applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)
glimpse(women_stem_tibble)
## Observations: 76
## Variables: 9
## $ Rank           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ Major_code     <dbl> 2419, 2416, 2415, 2417, 2418, 2405, 5001, 2414,...
## $ Major          <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL EN...
## $ Major_category <chr> "Engineering", "Engineering", "Engineering", "E...
## $ Total          <dbl> 2339, 756, 856, 1258, 2573, 32260, 1792, 91227,...
## $ Men            <dbl> 2057, 679, 725, 1123, 2200, 21239, 832, 80320, ...
## $ Women          <dbl> 282, 77, 131, 135, 373, 11021, 960, 10907, 2105...
## $ ShareWomen     <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.1...
## $ Median         <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 6200...
as_tibble Coerces Lists and Matrices to Data Frames. The as.data.frame is effectively a thin wrapper around data.frame, and hence is rather slow (because it calls data.frame() on each element before cbinding together). But as_tibble is a new S3 generic with more efficient methods for matrices and data frames.
m <- matrix(rnorm(50), ncol = 5)
colnames(m) <- c("a", "b", "c", "d", "e")
df <- as_tibble(m)
df
## # A tibble: 10 x 5
##         a       b      c       d       e
##     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
##  1  0.299  0.568   1.52  -1.99   -0.289 
##  2 -2.58   0.128   0.535 -0.331  -0.744 
##  3 -0.626  1.22    0.816  0.552  -1.49  
##  4  0.182  0.103  -1.57  -0.846   1.29  
##  5 -0.170 -0.457  -0.814 -1.44    0.592 
##  6  1.51   0.124  -0.843  0.875   1.37  
##  7 -0.429 -0.519  -1.69  -1.43    1.05  
##  8 -2.13  -1.40    0.323 -1.18    0.550 
##  9 -0.455 -0.359   2.54  -0.0912  0.239 
## 10  0.261  0.0978 -0.881 -0.328  -0.0994

Conclusion

Tibbles are a very useful form of dataframe that is fast and lightweight. Instead of using the bloated and outdated features of dataframes, tibbles offer a simplified version that preserves core functionality while removing features that are prone to errors and inefficiency like row names, previewing the entire dataframe, and multiple element recycling. Tibbles are especially useful for string data because they do not change column names or column data types for strings.

Overall, tibbles are can do almost everything dataframes can do while being faster, more efficient, and more convenient.