Speed Up Your Data Analysis with data.table

Gentle introduction

Author

DSA_406_001_SP25_wk4_unity_id

Today’s Lesson: Efficient Big Data Analysis with data.table

Objectives

  • Understand why data.table is crucial for big data analysis

  • Learn the syntax and key features of data.table

  • Practice with real-world Spotify data

  • Compare performance with traditional methods

Why data.table?

data.table is designed for handling large datasets (1GB to 100GB) with:

  • Lightning-fast operations

  • Memory efficiency

  • Concise syntax

  • Enhanced features for data manipulation

Setting Up

First, let’s install and load our required packages:

# Install and load required packages
required_packages <- c("data.table", "tidyverse", "microbenchmark", "pryr")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load libraries individually
#library(data.table)
#library(tidyverse)
#library(microbenchmark) # For precise timing
#library(pryr) # For memory tracking

#Load libraries
lapply(required_packages, library, character.only = TRUE)
[[1]]
[1] "data.table" "stats"      "graphics"   "grDevices"  "utils"     
[6] "datasets"   "methods"    "base"      

[[2]]
 [1] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
 [6] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "data.table"
[11] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
[16] "methods"    "base"      

[[3]]
 [1] "microbenchmark" "forcats"        "stringr"        "dplyr"         
 [5] "purrr"          "readr"          "tidyr"          "tibble"        
 [9] "ggplot2"        "tidyverse"      "data.table"     "stats"         
[13] "graphics"       "grDevices"      "utils"          "datasets"      
[17] "methods"        "base"          

[[4]]
 [1] "pryr"           "microbenchmark" "forcats"        "stringr"       
 [5] "dplyr"          "purrr"          "readr"          "tidyr"         
 [9] "tibble"         "ggplot2"        "tidyverse"      "data.table"    
[13] "stats"          "graphics"       "grDevices"      "utils"         
[17] "datasets"       "methods"        "base"          

Let’s read our Spotify data using data.table:

# Read the dataset using fread (much faster than read.csv() or read_csv())
#spotify_dt <- fread("https://docs.google.com/spreadsheets/d/e/2PACX-1vT9iNNoFknWSmgIglkiEXnmXkJplHHeEf9wPn07tvtOV-5eTh-MdYnLS3eyST_Bx2OXpd0GC9dw5gBt/pub?gid=1801845240&single=true&output=csv")

#read in teh data from data folder
spotify_dt <- fread("spotify_clean.csv")


spotify_csv <- read_csv("spotify_clean.csv")

Lets test it out with the fread function

# Let's measure memory usage
mem_before <- pryr::mem_used()
system.time(spotify_dt <- data.table::fread("spotify_clean.csv"))
   user  system elapsed 
   0.00    0.00    0.02 
mem_after_fread <- pryr::mem_used()
fread_memory <- mem_after_fread - mem_before

βœ…Now you try:

Now your turn try it out with the readr function read_csv

#comment and write your code below:
mem_before <- pryr::mem_used()
system.time(spotify_dt <- readr::read_csv("spotify_clean.csv"))
Rows: 9999 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (11): track_name, artist_name_s, album_name, album_artist_name_s, album...
dbl  (16): disc_number, track_number, track_duration_ms, popularity, danceab...
lgl   (2): explicit, album_genres
dttm  (1): added_at

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.
   user  system elapsed 
   0.02    0.06    0.18 
mem_after_fread <- pryr::mem_used()
fread_memory <- mem_after_fread - mem_before

When you have an existing data frame (like one from tidyverse or base R like spotify_csv), you can convert it to a data.table using setDT().

# Option 1: setDT() - converts in place (memory efficient)
#setDT(spotify_csv)  # Converts the existing object to data.table

# Option 2: as.data.table() - creates a copy (uses more memory)
#spotify_dt <- as.data.table(spotify_csv)

πŸŽŠπŸŽ‰ Congratulations you finished week 4 and now on to a mini project analysis using data.table functions.

πŸ’‘ Mini-Project Guidelines:

Create a comprehensive analysis using data.table that:

Demonstrates at least three different data.table operations Includes performance comparisons with equivalent dplyr operations Creates insightful visualizations from the results

Remember to:

  • Document your code

  • Explain your analysis choices

  • Consider performance implications

  • Think about scalability

Want more….

  • Explore more advanced data.table features

Analysis: Who are the three most successful artists in the Hip Hop genre ?

Prompt for Chat GPT/Claude/Gemini

You are a data scientist using exploring a large data set using exploratory data analysis principles. Your task is to create code in R using data.table functions to …. {Add what you want it to do}

Prompt (Chat GPT) You are a data scientist using exploring a large data set using exploratory data analysis principles. Your task is to create code in R using data.table functions to find the 3 most successful artists in the Hip Hop genre.

library(data.table)

# Load dataset
dt <- fread("spotify_clean.csv")

# Filter for Hip Hop genre using grepl to capture variations
hiphop_dt <- dt[grepl("Hip Hop", main_genre, ignore.case = TRUE)]


# Compute success metrics per artist
hiphop_dt <- hiphop_dt[, .(
  Total_Success = sum(popularity, na.rm = TRUE),
  Avg_Popularity = mean(popularity, na.rm = TRUE)
), by = artist_name_s]

hiphop_dt
                      artist_name_s Total_Success Avg_Popularity
  1:       Run–D.M.C., Jason Nevins           109       36.33333
  2:                         Eminem           457       19.86957
  3:  Zay Hilfigerrr, Zayion McCall            61       61.00000
  4:                          Nitty             0        0.00000
  5: Heavy D & The Boyz, Aaron Hall            65       32.50000
 ---                                                            
115:               LunchMoney Lewis            62       62.00000
116:     Naughty By Nature, Phiness            52       52.00000
117:                    The Streets            62       62.00000
118:             Travis Scott, HVME            51       51.00000
119:                         Fugees             0        0.00000
# Define a weighted score for ranking
hiphop_dt[, Weighted_Score := (Total_Success * 0.5) + (Avg_Popularity * 0.5)]


# Get the top 3 unique artists based on Weighted_Score
top_artists <- hiphop_dt[order(-Weighted_Score)][1:3]

# Print top 3 artists
print(top_artists)
   artist_name_s Total_Success Avg_Popularity Weighted_Score
1:        Eminem           457       19.86957       238.4348
2:     Lil Nas X           248       62.00000       155.0000
3:       Outkast           261       43.50000       152.2500