- Background on data.table what it’s for, who’s behind it, and how does it fit into the tidyverse
- Intro to data.table objects and syntax
- Data manipulation with data.table how to subset, extract, summarize, group, etc
- An in-class exercise
What is a data.table?
A data.table is basically a data frame object with an upgrade. Saving a dataset as a data.table object allows you to use any packages or functions you typically use with data frames, like dplyr, as well as all the data.frame package functionality.
Why would I use data.table package functions over something else?
The language is pretty concise and you may find it more intuitive. It’s also memory-efficient and fast, making it a good choice for exploring very large datasets.
Creating a data.table object is easy!
library(data.table)example_dt <- data.table(your_dataset)
Note on loading your dataset: fread() - Same functionality as read.csv() - Just much faster. Great for reading very large datasets.
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
This dataset is a built-in with the coronavirus package, covering covid-19 cases by country, pulled from the Johns Hopkins University Center for Systems Science and Engineering.
It’s already in long, tidy format, and it’s pretty massive: 518,682 observations of 15 variables, making it a great candidate for looking at with data.table. Note that there are some anomalies in the data due to changes in reporting practices and methodologies, such as when false positives were removed.
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
We’re using i for subsetting rows. Note that the commas aren’t strictly necessary now, but are useful for clarity.
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
Examples:your_dataset[1:5, ] # Subset first 5 rowsyour_dataset[-(1:5), ] # Subset everything but first 5 rows
your_dataset[.N, ] # Returns the last row.
“.N” is useful, giving us the integer number of rows in the data.table.
your_dataset[city_name == “Berlin”, ] # Returns all rows that match a variable value.
Can use logical expressions, as in dplyr::filter().
< , > , <== , >==is.na() , !is.na()%in%| , &
Special data.table() operators:%like% ← Allows searching for patterns in char or factor
Example: dt[species %like% “bat”,]
%between% ← Allows searching for values within a closed interval
Example: dt[pop %between% c(2500, 5000),]
Let’s try this in R!
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
Now we’re using j, which refers to columns. Remember that j takes vectors of variable names as “character”
… (and remember the commas!)
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
Examples:ans <- your_dataset[, .(country,cases)]# the "." stands in for list() hereans <- dt[, c(1, 4)]# Can specify column number
Note: extracted columns will save to a data.table object, not a vector.
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
In data.table, you can run computations directly in j. Easiest example to start with is calculating summary statistics.
Example: your_dataset[, mean(pop)]
You can also easily add a filter in i to narrow your results:
Example: your_dataset[species == "bat", mean(pop)]
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
j also allows you to create new columns easily.
Example:your_dataset[, gdp_per_capita := gdp/pop]
This creates a new calculated column: gdp_per_capita
Let’s take a closer look at using j in R.
your_dataset[i, j, by]| | || | --> grouped by| -----> columns/computations--------> rows
Similar to dplyr::group_by(), this argument allows you to group your results by a given categorical variable.
Example:your_dataset[, gdp_per_capita := gdp/pop, by = continent]
This creates a new calculated column gdp_per_capita and then groups by the continent.
data("coronavirus") loads the relevant dataset