Dataframes have been fundamental to R as its preferred way to store data in a straightforward and human-readable form. For the most part, dataframes do their job well; they can store data of different types in the same dataframe, they can be subsetted relatively easily, and they interact cleanly with other objects in R. However, as data science has evolved, some functionality of dataframes has become inconvenient and inefficient. This includes default behavior like stringsAsFactors = TRUE, printing every row and column when calling the dataframe, and partial column matching. These “features” often lead to long function calls, unintentionally printing an entire dataframe, and calling the wrong columns, which create errors and frustration. Given that dataframes are used so frequently in R, these inconveniences and inefficiencies can snowball over time leading to poor performance overall. Tibbles are the solution to these problems.
Simply put, a tibble is a lightweight data frame that removes the annoying default behaviors of dataframes, while still preserving core functionality. Tibbles are designed to be more restrictive than dataframes to encourage users into writing less sloppy code. Tibbles can do everything dataframes can do (barring a few niche exceptions) faster and with less overhead. Let’s get into some examples.
tibbleTibble can be installed like any other package in R using the install.packages() function. There are multiple sources for the tibble package including CRAN, the tidyverse, and Github. The tibble package is frequently maintained and easily accessible, which is nice. The following code installs the tibble package from various different sources according to your preference.
#Install tibble directly from CRAN
install.packages("tibble")
#Install tibble through the tidyverse package
install.packages("tidyverse")
#Install tibble through Github
devtools::install_github("tidyverse/tibble")
I will be using the women in STEM dataset from fivethirtyeight’s Github to show some examples of tibble. The link to the data is here:
https://github.com/fivethirtyeight/data/blob/master/college-majors/women-stem.csv
This data contains information about women in various STEM fields, like the proportion of women in the field and the median income of the field. This data was used for an article about picking a college major from an economic perspective. The link to the article is here:
https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/
The data can be seen below as a tibble:
library(tibble)
library(readr)
women_stem_tibble = read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv")
## Parsed with column specification:
## cols(
## Rank = col_double(),
## Major_code = col_double(),
## Major = col_character(),
## Major_category = col_character(),
## Total = col_double(),
## Men = col_double(),
## Women = col_double(),
## ShareWomen = col_double(),
## Median = col_double()
## )
women_stem_tibble
## # A tibble: 76 x 9
## Rank Major_code Major Major_category Total Men Women ShareWomen
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 2419 PETR~ Engineering 2339 2057 282 0.121
## 2 2 2416 MINI~ Engineering 756 679 77 0.102
## 3 3 2415 META~ Engineering 856 725 131 0.153
## 4 4 2417 NAVA~ Engineering 1258 1123 135 0.107
## 5 5 2418 NUCL~ Engineering 2573 2200 373 0.145
## 6 6 2405 CHEM~ Engineering 32260 21239 11021 0.342
## 7 7 5001 ASTR~ Physical Scie~ 1792 832 960 0.536
## 8 8 2414 MECH~ Engineering 91227 80320 10907 0.120
## 9 9 2401 AERO~ Engineering 15058 12953 2105 0.140
## 10 10 2408 ELEC~ Engineering 81527 65511 16016 0.196
## # ... with 66 more rows, and 1 more variable: Median <dbl>
As specified in the introduction, tibbles are lightweight dataframes with different default behaviors. There are a few characteristics that separate tibbles from dataframes:
It is important to remember that tibbles are designed to be more restrictive than dataframes to prevent common bugs that can occur when using dataframes. The following sections will provide examples of the differing characteristics and how these differences can be useful.
When loading data, most people try to look at a preview of their data to see if their data appears correct at a glance. This process usually involves fetching the first few rows of the data and checking it for major issues. The typical call for this type of operation is head(dataframe), where the head() function fetches the first few rows from dataframe. The reason head() needs to be called is because dataframes print the entire dataframe when they are called without the head() function. This “feature” can be seen below:
Notice how long and disorienting the data can appear when called this way.
women_stem_df = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/women-stem.csv")
women_stem_df
## Rank Major_code
## 1 1 2419
## 2 2 2416
## 3 3 2415
## 4 4 2417
## 5 5 2418
## 6 6 2405
## 7 7 5001
## 8 8 2414
## 9 9 2401
## 10 10 2408
## 11 11 2407
## 12 12 5008
## 13 13 2404
## 14 14 2409
## 15 15 2402
## 16 16 2412
## 17 17 2400
## 18 18 2403
## 19 19 2102
## 20 20 2502
## 21 21 2413
## 22 22 2499
## 23 23 2406
## 24 24 2500
## 25 25 2411
## 26 26 2410
## 27 27 6107
## 28 28 2503
## 29 29 5102
## 30 30 2105
## 31 31 2100
## 32 32 5007
## 33 33 3701
## 34 34 3700
## 35 35 3702
## 36 36 3607
## 37 37 6105
## 38 38 5006
## 39 39 2501
## 40 40 4005
## 41 41 6104
## 42 42 2101
## 43 43 4006
## 44 44 2504
## 45 45 2599
## 46 46 5000
## 47 47 1401
## 48 48 3605
## 49 49 3603
## 50 50 6108
## 51 51 5003
## 52 52 3606
## 53 53 2106
## 54 54 3601
## 55 55 3602
## 56 56 2107
## 57 57 5004
## 58 58 5005
## 59 59 6199
## 60 60 1301
## 61 61 5002
## 62 62 2001
## 63 63 5098
## 64 64 3608
## 65 65 3611
## 66 66 6103
## 67 67 4002
## 68 68 6110
## 69 69 3699
## 70 70 6106
## 71 71 3600
## 72 72 3604
## 73 73 6109
## 74 74 6100
## 75 75 6102
## 76 76 3609
## Major
## 1 PETROLEUM ENGINEERING
## 2 MINING AND MINERAL ENGINEERING
## 3 METALLURGICAL ENGINEERING
## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING
## 5 NUCLEAR ENGINEERING
## 6 CHEMICAL ENGINEERING
## 7 ASTRONOMY AND ASTROPHYSICS
## 8 MECHANICAL ENGINEERING
## 9 AEROSPACE ENGINEERING
## 10 ELECTRICAL ENGINEERING
## 11 COMPUTER ENGINEERING
## 12 MATERIALS SCIENCE
## 13 BIOMEDICAL ENGINEERING
## 14 ENGINEERING MECHANICS PHYSICS AND SCIENCE
## 15 BIOLOGICAL ENGINEERING
## 16 INDUSTRIAL AND MANUFACTURING ENGINEERING
## 17 GENERAL ENGINEERING
## 18 ARCHITECTURAL ENGINEERING
## 19 COMPUTER SCIENCE
## 20 ELECTRICAL ENGINEERING TECHNOLOGY
## 21 MATERIALS ENGINEERING AND MATERIALS SCIENCE
## 22 MISCELLANEOUS ENGINEERING
## 23 CIVIL ENGINEERING
## 24 ENGINEERING TECHNOLOGIES
## 25 GEOLOGICAL AND GEOPHYSICAL ENGINEERING
## 26 ENVIRONMENTAL ENGINEERING
## 27 NURSING
## 28 INDUSTRIAL PRODUCTION TECHNOLOGIES
## 29 NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES
## 30 INFORMATION SCIENCES
## 31 COMPUTER AND INFORMATION SYSTEMS
## 32 PHYSICS
## 33 APPLIED MATHEMATICS
## 34 MATHEMATICS
## 35 STATISTICS AND DECISION SCIENCE
## 36 PHARMACOLOGY
## 37 MEDICAL TECHNOLOGIES TECHNICIANS
## 38 OCEANOGRAPHY
## 39 ENGINEERING AND INDUSTRIAL MANAGEMENT
## 40 MATHEMATICS AND COMPUTER SCIENCE
## 41 MEDICAL ASSISTING SERVICES
## 42 COMPUTER PROGRAMMING AND DATA PROCESSING
## 43 COGNITIVE SCIENCE AND BIOPSYCHOLOGY
## 44 MECHANICAL ENGINEERING RELATED TECHNOLOGIES
## 45 MISCELLANEOUS ENGINEERING TECHNOLOGIES
## 46 PHYSICAL SCIENCES
## 47 ARCHITECTURE
## 48 GENETICS
## 49 MOLECULAR BIOLOGY
## 50 PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION
## 51 CHEMISTRY
## 52 MICROBIOLOGY
## 53 COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY
## 54 BIOCHEMICAL SCIENCES
## 55 BOTANY
## 56 COMPUTER NETWORKING AND TELECOMMUNICATIONS
## 57 GEOLOGY AND EARTH SCIENCE
## 58 GEOSCIENCES
## 59 MISCELLANEOUS HEALTH MEDICAL PROFESSIONS
## 60 ENVIRONMENTAL SCIENCE
## 61 ATMOSPHERIC SCIENCES AND METEOROLOGY
## 62 COMMUNICATION TECHNOLOGIES
## 63 MULTI-DISCIPLINARY OR GENERAL SCIENCE
## 64 PHYSIOLOGY
## 65 NEUROSCIENCE
## 66 HEALTH AND MEDICAL ADMINISTRATIVE SERVICES
## 67 NUTRITION SCIENCES
## 68 COMMUNITY AND PUBLIC HEALTH
## 69 MISCELLANEOUS BIOLOGY
## 70 HEALTH AND MEDICAL PREPARATORY PROGRAMS
## 71 BIOLOGY
## 72 ECOLOGY
## 73 TREATMENT THERAPY PROFESSIONS
## 74 GENERAL MEDICAL AND HEALTH SERVICES
## 75 COMMUNICATION DISORDERS SCIENCES AND SERVICES
## 76 ZOOLOGY
## Major_category Total Men Women ShareWomen Median
## 1 Engineering 2339 2057 282 0.12056434 110000
## 2 Engineering 756 679 77 0.10185185 75000
## 3 Engineering 856 725 131 0.15303738 73000
## 4 Engineering 1258 1123 135 0.10731320 70000
## 5 Engineering 2573 2200 373 0.14496697 65000
## 6 Engineering 32260 21239 11021 0.34163050 65000
## 7 Physical Sciences 1792 832 960 0.53571429 62000
## 8 Engineering 91227 80320 10907 0.11955890 60000
## 9 Engineering 15058 12953 2105 0.13979280 60000
## 10 Engineering 81527 65511 16016 0.19645026 60000
## 11 Engineering 41542 33258 8284 0.19941264 60000
## 12 Engineering 4279 2949 1330 0.31082028 60000
## 13 Engineering 14955 8407 6548 0.43784687 60000
## 14 Engineering 4321 3526 795 0.18398519 58000
## 15 Engineering 8925 6062 2863 0.32078431 57100
## 16 Engineering 18968 12453 6515 0.34347322 57000
## 17 Engineering 61152 45683 15469 0.25295984 56000
## 18 Engineering 2825 1835 990 0.35044248 54000
## 19 Computers & Mathematics 128319 99743 28576 0.22269500 53000
## 20 Engineering 11565 8181 3384 0.29260700 52000
## 21 Engineering 2993 2020 973 0.32509188 52000
## 22 Engineering 9133 7398 1735 0.18997044 50000
## 23 Engineering 53153 41081 12072 0.22711794 50000
## 24 Engineering 3600 2695 905 0.25138889 50000
## 25 Engineering 720 488 232 0.32222222 50000
## 26 Engineering 4047 2662 1385 0.34222881 50000
## 27 Health 209394 21773 187621 0.89601899 48000
## 28 Engineering 4631 3477 1154 0.24919024 46000
## 29 Physical Sciences 2116 528 1588 0.75047259 46000
## 30 Computers & Mathematics 11913 9005 2908 0.24410308 45000
## 31 Computers & Mathematics 36698 27392 9306 0.25358330 45000
## 32 Physical Sciences 32142 23080 9062 0.28193641 45000
## 33 Computers & Mathematics 4939 2794 2145 0.43429844 45000
## 34 Computers & Mathematics 72397 39956 32441 0.44809868 45000
## 35 Computers & Mathematics 6251 2960 3291 0.52647576 45000
## 36 Biology & Life Science 1762 515 1247 0.70771850 45000
## 37 Health 15914 3916 11998 0.75392736 45000
## 38 Physical Sciences 2418 752 1666 0.68899917 44700
## 39 Engineering 2906 2400 506 0.17412251 44000
## 40 Computers & Mathematics 609 500 109 0.17898194 42000
## 41 Health 11123 803 10320 0.92780725 42000
## 42 Computers & Mathematics 4168 3046 1122 0.26919386 41300
## 43 Biology & Life Science 3831 1667 2164 0.56486557 41000
## 44 Engineering 4790 4419 371 0.07745303 40000
## 45 Engineering 8804 7043 1761 0.20002272 40000
## 46 Physical Sciences 1436 894 542 0.37743733 40000
## 47 Engineering 46420 25463 20957 0.45146489 40000
## 48 Biology & Life Science 3635 1761 1874 0.51554333 40000
## 49 Biology & Life Science 18300 7426 10874 0.59420765 40000
## 50 Health 23551 8697 14854 0.63071632 40000
## 51 Physical Sciences 66530 32923 33607 0.50514054 39000
## 52 Biology & Life Science 15232 6383 8849 0.58094800 38000
## 53 Computers & Mathematics 8066 6607 1459 0.18088272 37500
## 54 Biology & Life Science 39107 18951 20156 0.51540645 37400
## 55 Biology & Life Science 1329 626 703 0.52896915 37000
## 56 Computers & Mathematics 7613 5291 2322 0.30500460 36400
## 57 Physical Sciences 10972 5813 5159 0.47019687 36200
## 58 Physical Sciences 1978 809 1169 0.59100101 36000
## 59 Health 13386 1589 11797 0.88129389 36000
## 60 Biology & Life Science 25965 10787 15178 0.58455613 35600
## 61 Physical Sciences 4043 2744 1299 0.32129607 35000
## 62 Computers & Mathematics 18035 11431 6604 0.36617688 35000
## 63 Physical Sciences 62052 27015 35037 0.56463933 35000
## 64 Biology & Life Science 22060 8422 13638 0.61822303 35000
## 65 Biology & Life Science 13663 4944 8719 0.63814682 35000
## 66 Health 18109 4266 13843 0.76442653 35000
## 67 Health 18909 2563 16346 0.86445608 35000
## 68 Health 19735 4103 15632 0.79209526 34000
## 69 Biology & Life Science 10706 4747 5959 0.55660377 33500
## 70 Health 12740 5521 7219 0.56664050 33500
## 71 Biology & Life Science 280709 111762 168947 0.60185815 33400
## 72 Biology & Life Science 9154 3878 5276 0.57636006 33000
## 73 Health 48491 13487 35004 0.72186591 33000
## 74 Health 33599 7574 26025 0.77457662 32400
## 75 Health 38279 1225 37054 0.96799812 28000
## 76 Biology & Life Science 8409 3050 5359 0.63729338 26000
Unlike dataframes, tibbles have built-in data preview functionality. When tibbles are called, they only show the first few rows of the data instead of the entire dataset.
women_stem_tibble
## # A tibble: 76 x 9
## Rank Major_code Major Major_category Total Men Women ShareWomen
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 2419 PETR~ Engineering 2339 2057 282 0.121
## 2 2 2416 MINI~ Engineering 756 679 77 0.102
## 3 3 2415 META~ Engineering 856 725 131 0.153
## 4 4 2417 NAVA~ Engineering 1258 1123 135 0.107
## 5 5 2418 NUCL~ Engineering 2573 2200 373 0.145
## 6 6 2405 CHEM~ Engineering 32260 21239 11021 0.342
## 7 7 5001 ASTR~ Physical Scie~ 1792 832 960 0.536
## 8 8 2414 MECH~ Engineering 91227 80320 10907 0.120
## 9 9 2401 AERO~ Engineering 15058 12953 2105 0.140
## 10 10 2408 ELEC~ Engineering 81527 65511 16016 0.196
## # ... with 66 more rows, and 1 more variable: Median <dbl>
One quirk of dataframes is that they change strings to the factor data type automatically. This used to be a convenience feature when data was simpler and string vectors were almost always categorical variables. This would come in handy with columns with predefined categories like gender, where the vector would only contain the values “male”, “female”, and “other”. However, with the increasing popularity of natural language processing and using words as data, it is much more convenient to keep string vectors as strings rather than to change them to factors.
Unlike dataframes, tibbles do not change strings to the factor data type automatically. Instead, tibbles preserve the input data type by default. This quirk can be seen below. Notice that the Major column is a 76-level factor in the dataframe, while the same column is a character data type in the tibble.
str(women_stem_df$Major)
## Factor w/ 76 levels "AEROSPACE ENGINEERING",..: 68 54 52 61 63 12 5 48 1 26 ...
str(women_stem_tibble$Major)
## chr [1:76] "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" ...
Another weird quirk of dataframes is that they don’t like spaces in their column names. As a result, dataframes will change spaces in column names to periods. The reason dataframes do this is because referencing columns that had spaces without autocomplete was very annoying due to having to put quotes around the column name to reference it. However, now with the modern version of R, tab completion is a default feature, making column references very simple.
nwsdf = data.frame("name with spaces" = 2)
nwsdf
## name.with.spaces
## 1 2
nwsdf$name.with.spaces
## [1] 2
nwst = tibble("name with spaces" = 2)
nwst
## # A tibble: 1 x 1
## `name with spaces`
## <dbl>
## 1 2
nwst$`name with spaces`
## [1] 2
In order to maintain simplicity, tibbles cannot use row names. Since row names are special attributes that are stored differently from normal columns, they can complicate dataframes by adding special attributes that were not designedd to store long names. If row.names is called on a tibble, it will return the index of the rows.
row.names(women_stem_tibble)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76"
One unique characteristic of R is its use of element recycling. Recycling is when elements are reused when a vector does not have enough elements to match the length of another vector it is attached to. An example of recycling can be seen below.
In the example below, the 1 in the x column is recycled 10 times to match the 10 elements in the y column.
tibble(x = 1, y = 1:10)
## # A tibble: 10 x 2
## x y
## <dbl> <int>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
Recycling can be useful if the user is lazy and does not want to call the rep() function every time they want to repeat a number to the same length as another vector.
Recycling can also be used for vectors with multiple elements as seen below. However, the catch is that the vectors must be multiples/divisors of each other in order to recycle multiple elements. Notice that x is 1:4 recycled once (total length 8), while y is printed normal (total length 8).
data.frame(x = 1:4, y = 1:8)
## x y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 1 5
## 6 2 6
## 7 3 7
## 8 4 8
If the vector is not a multiple/divisor of the other vector, it will not recycle properly and throw an error. This error happens frequently for people who use recycling and it can cause many problems.
data.frame(x = 1:3, y = 1:8)
## Error in data.frame(x = 1:3, y = 1:8): arguments imply differing number of rows: 3, 8
Tibbles circumvent this muiltiple element recycling problem by only allowing vectors of length 1 to be recycled. This minimizes the possible errors that can occur when recycling because length 1 vectors can always be recycled.
tibble(x = 1:4, y = 1:8)
## Tibble columns must have consistent lengths, only values of length one are recycled:
## * Length 4: Column `x`
## * Length 8: Column `y`
glimpse, and as_tibble functions of the tibble package.glimpse is simply getting a glimpse of the data. According to the documentation, it is like a transposed version of print such that columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It’s a little like the str function applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)glimpse(women_stem_tibble)
## Observations: 76
## Variables: 9
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ Major_code <dbl> 2419, 2416, 2415, 2417, 2418, 2405, 5001, 2414,...
## $ Major <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL EN...
## $ Major_category <chr> "Engineering", "Engineering", "Engineering", "E...
## $ Total <dbl> 2339, 756, 856, 1258, 2573, 32260, 1792, 91227,...
## $ Men <dbl> 2057, 679, 725, 1123, 2200, 21239, 832, 80320, ...
## $ Women <dbl> 282, 77, 131, 135, 373, 11021, 960, 10907, 2105...
## $ ShareWomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.1...
## $ Median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 6200...
as_tibble Coerces Lists and Matrices to Data Frames. The as.data.frame is effectively a thin wrapper around data.frame, and hence is rather slow (because it calls data.frame() on each element before cbinding together). But as_tibble is a new S3 generic with more efficient methods for matrices and data frames.m <- matrix(rnorm(50), ncol = 5)
colnames(m) <- c("a", "b", "c", "d", "e")
df <- as_tibble(m)
df
## # A tibble: 10 x 5
## a b c d e
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.299 0.568 1.52 -1.99 -0.289
## 2 -2.58 0.128 0.535 -0.331 -0.744
## 3 -0.626 1.22 0.816 0.552 -1.49
## 4 0.182 0.103 -1.57 -0.846 1.29
## 5 -0.170 -0.457 -0.814 -1.44 0.592
## 6 1.51 0.124 -0.843 0.875 1.37
## 7 -0.429 -0.519 -1.69 -1.43 1.05
## 8 -2.13 -1.40 0.323 -1.18 0.550
## 9 -0.455 -0.359 2.54 -0.0912 0.239
## 10 0.261 0.0978 -0.881 -0.328 -0.0994
Tibbles are a very useful form of dataframe that is fast and lightweight. Instead of using the bloated and outdated features of dataframes, tibbles offer a simplified version that preserves core functionality while removing features that are prone to errors and inefficiency like row names, previewing the entire dataframe, and multiple element recycling. Tibbles are especially useful for string data because they do not change column names or column data types for strings.
Overall, tibbles are can do almost everything dataframes can do while being faster, more efficient, and more convenient.