# Read the dataset using fread (much faster than read.csv() or read_csv())#spotify_dt <- fread("https://docs.google.com/spreadsheets/d/e/2PACX-1vT9iNNoFknWSmgIglkiEXnmXkJplHHeEf9wPn07tvtOV-5eTh-MdYnLS3eyST_Bx2OXpd0GC9dw5gBt/pub?gid=1801845240&single=true&output=csv")#read in teh data from data folderspotify_dt <-fread("data/spotify_clean.csv")spotify_csv <-read_csv("data/spotify_clean.csv")
Now your turn try it out with the readr function read_csv
#comment and write your code below:my_spotify_csv <-read_csv("data/spotify_clean.csv")
Rows: 9999 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): track_name, artist_name_s, album_name, album_artist_name_s, album...
dbl (16): disc_number, track_number, track_duration_ms, popularity, danceab...
lgl (2): explicit, album_genres
dttm (1): added_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you have an existing data frame (like one from tidyverse or base R like spotify_csv), you can convert it to a data.table using setDT().
# Option 1: setDT() - converts in place (memory efficient)#setDT(spotify_csv) # Converts the existing object to data.tablesetDT(spotify_dt)# Option 2: as.data.table() - creates a copy (uses more memory)#spotify_dt <- as.data.table(spotify_csv)my_spotify_dt <-as.data.table(my_spotify_csv)
setDT() is more memory efficient because it modifies the existing object
as.data.table() creates a copy, using twice the memory
For big data, it’s best to read directly with fread() rather than converting
The conversion only needs to happen if you start with a regular data frame
| means “OR” (at least one condition must be true)
! means “NOT” (reverses the condition)
2.) Column Operations (j)
The j argument lets you select columns and perform computations:
# Select specific columnsspotify_dt[, .(track_name, artist_name_s, popularity)]
track_name
<char>
1: Justified & Ancient - Stand by the Jams
2: I Know You Want Me (Calle Ocho)
3: From the Bottom of My Broken Heart
4: Apeman - 2014 Remastered Version
5: You Can't Always Get What You Want
---
9995: Kernkraft 400 (A Better Day)
9996: Never Say Never - Radio Edit
9997: Groovejet (If This Ain't Love) [feat. Sophie Ellis-Bextor]
9998: Lay Low
9999: Padam Padam
artist_name_s popularity
<char> <int>
1: The KLF 0
2: Pitbull 64
3: Britney Spears 56
4: The Kinks 42
5: The Rolling Stones 0
---
9995: Topic, A7S 79
9996: Vandalism 17
9997: Spiller, Sophie Ellis-Bextor 62
9998: Tiësto 87
9999: Kylie Minogue 69
# Select specific columnsspotify_dt[, .(artist_name_s, track_name)]
artist_name_s
<char>
1: The KLF
2: Pitbull
3: Britney Spears
4: The Kinks
5: The Rolling Stones
---
9995: Topic, A7S
9996: Vandalism
9997: Spiller, Sophie Ellis-Bextor
9998: Tiësto
9999: Kylie Minogue
track_name
<char>
1: Justified & Ancient - Stand by the Jams
2: I Know You Want Me (Calle Ocho)
3: From the Bottom of My Broken Heart
4: Apeman - 2014 Remastered Version
5: You Can't Always Get What You Want
---
9995: Kernkraft 400 (A Better Day)
9996: Never Say Never - Radio Edit
9997: Groovejet (If This Ain't Love) [feat. Sophie Ellis-Bextor]
9998: Lay Low
9999: Padam Padam
# Create new columns in outputspotify_dt[, .(song = track_name,artist = artist_name_s,popularity_score = popularity/100)]
song
<char>
1: Justified & Ancient - Stand by the Jams
2: I Know You Want Me (Calle Ocho)
3: From the Bottom of My Broken Heart
4: Apeman - 2014 Remastered Version
5: You Can't Always Get What You Want
---
9995: Kernkraft 400 (A Better Day)
9996: Never Say Never - Radio Edit
9997: Groovejet (If This Ain't Love) [feat. Sophie Ellis-Bextor]
9998: Lay Low
9999: Padam Padam
artist popularity_score
<char> <num>
1: The KLF 0.00
2: Pitbull 0.64
3: Britney Spears 0.56
4: The Kinks 0.42
5: The Rolling Stones 0.00
---
9995: Topic, A7S 0.79
9996: Vandalism 0.17
9997: Spiller, Sophie Ellis-Bextor 0.62
9998: Tiësto 0.87
9999: Kylie Minogue 0.69
#comment and write your code below:# Select specific columnsspotify_dt[, .(track_name, label, energy)]
track_name
<char>
1: Justified & Ancient - Stand by the Jams
2: I Know You Want Me (Calle Ocho)
3: From the Bottom of My Broken Heart
4: Apeman - 2014 Remastered Version
5: You Can't Always Get What You Want
---
9995: Kernkraft 400 (A Better Day)
9996: Never Say Never - Radio Edit
9997: Groovejet (If This Ain't Love) [feat. Sophie Ellis-Bextor]
9998: Lay Low
9999: Padam Padam
label energy
<char> <num>
1: Jams Communications 0.872
2: Mr.305/Polo Grounds Music/J Records 0.743
3: Jive 0.665
4: Sanctuary Records 0.728
5: Universal Music Group 0.627
---
9995: Virgin 0.727
9996: Vicious 0.841
9997: Defected Records 0.806
9998: Musical Freedom 0.855
9999: Liberator Music 0.620
track_count avg_popularity avg_loudness
<int> <num> <num>
1: 9999 37.62466 NA
# Select specific columnsspotify_dt[, .(artist_genres, label)]
artist_genres
<char>
1: acid house,ambient house,big beat,hip house
2: dance pop,miami hip hop,pop
3: dance pop,pop
4: album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter
5: album rock,british invasion,classic rock,rock
---
9995: german dance,pop dance,pop edm,uk dance,pop dance,scandipop,uk dance
9996: australian dance,melbourne bounce
9997: disco house,vocal house,dance pop,europop,new wave pop
9998: big room,brostep,dutch edm,edm,house,pop dance,slap house,trance
9999: australian dance,australian pop,dance pop,eurodance,new wave pop
label
<char>
1: Jams Communications
2: Mr.305/Polo Grounds Music/J Records
3: Jive
4: Sanctuary Records
5: Universal Music Group
---
9995: Virgin
9996: Vicious
9997: Defected Records
9998: Musical Freedom
9999: Liberator Music
# Create new columns in outputspotify_dt[, .(song = track_name,artist = artist_name_s,popularity_score = popularity/100)]
song
<char>
1: Justified & Ancient - Stand by the Jams
2: I Know You Want Me (Calle Ocho)
3: From the Bottom of My Broken Heart
4: Apeman - 2014 Remastered Version
5: You Can't Always Get What You Want
---
9995: Kernkraft 400 (A Better Day)
9996: Never Say Never - Radio Edit
9997: Groovejet (If This Ain't Love) [feat. Sophie Ellis-Bextor]
9998: Lay Low
9999: Padam Padam
artist popularity_score
<char> <num>
1: The KLF 0.00
2: Pitbull 0.64
3: Britney Spears 0.56
4: The Kinks 0.42
5: The Rolling Stones 0.00
---
9995: Topic, A7S 0.79
9996: Vandalism 0.17
9997: Spiller, Sophie Ellis-Bextor 0.62
9998: Tiësto 0.87
9999: Kylie Minogue 0.69
main_genre explicit track_count avg_popularity
<char> <lgcl> <int> <num>
1: Dance/Electronic FALSE 252 32.81746
2: Pop FALSE 5345 39.74275
3: Rock FALSE 2363 37.50994
4: Hip Hop FALSE 132 31.15152
5: Other FALSE 924 25.98377
6: R&B/Soul FALSE 301 35.05316
7: Pop TRUE 349 48.36390
8: Jazz/Blues FALSE 54 22.24074
9: Hip Hop TRUE 82 34.69512
10: Folk/Country FALSE 89 33.40449
11: Reggae/Caribbean FALSE 27 48.22222
12: Rock TRUE 50 40.68000
13: Other TRUE 22 30.90909
14: Classical/Instrumental TRUE 1 0.00000
15: Dance/Electronic TRUE 2 26.00000
16: Classical/Instrumental FALSE 3 19.66667
17: Folk/Country TRUE 2 90.50000
18: R&B/Soul TRUE 1 0.00000
# Group by genre for popularity analysisgenre_stats <- spotify_dt[, .(avg_popularity =mean(popularity, na.rm =TRUE),track_count = .N,avg_duration =mean(track_duration_ms)/1000), by = main_genre][]
✅Now you try:
Grouping (i)
#comment and write your code below:grouping_spotify <- spotify_dt[1:10, .(track_count = .N,avg_popularity =mean(popularity),avg_danceability =mean(danceability)), by= .(label, main_genre)]grouping_spotify
label main_genre track_count
<char> <char> <int>
1: Jams Communications Dance/Electronic 1
2: Mr.305/Polo Grounds Music/J Records Pop 1
3: Jive Pop 1
4: Sanctuary Records Rock 1
5: Universal Music Group Rock 1
6: Rhino/Warner Records Rock 1
7: Benny Blanco Solo Album PS Pop 1
8: EMI Rock 1
9: Rhino Atlantic Rock 1
10: Universal Music Group Pop 1
avg_popularity avg_danceability
<num> <num>
1: 0 0.617
2: 64 0.825
3: 56 0.677
4: 42 0.683
5: 0 0.319
6: 79 0.671
7: 78 0.560
8: 61 0.480
9: 74 0.357
10: 0 0.562
Think of it like a sentence:
i = “Which rows do you want?”
j = “What do you want to do with them?”
by = “How should they be grouped?”
5. Advanced Features
Adding/Updating Columns (:=)
We use the empty brackets at the end to print but you could have also
# Add duration in secondsspotify_dt[, duration_sec := track_duration_ms/1000][]
# Load lubridatelibrary(lubridate)# Convert and extract date componentsspotify_dt <- spotify_dt[, `:=`(release_date =as.Date(album_release_date),release_year =year(as.Date(album_release_date)),release_month =month(as.Date(album_release_date)),release_day =day(as.Date(album_release_date)))]# View the new date columnsspotify_dt[, .(release_date, release_year, release_month, release_day)][1:5]
Key: <main_genre>
main_genre danceability energy valence
<char> <num> <num> <num>
1: Classical/Instrumental 0.5580000 0.5570000 0.6500000
2: Dance/Electronic 0.6779331 0.8231969 0.5845252
3: Folk/Country 0.5683407 0.5402066 0.5439121
4: Hip Hop 0.7607477 0.7146449 0.6088201
5: Jazz/Blues 0.5241852 0.5895185 0.7212222
6: Other NA NA NA
7: Pop 0.6243319 0.6954419 0.5665183
8: R&B/Soul 0.6557980 0.6104570 0.6947848
9: Reggae/Caribbean 0.7564815 0.6411481 0.7690000
10: Rock 0.5480435 0.6759753 0.5957726
Watch out for these common mistakes:
Forgetting the .() when selecting multiple columns
Using $ instead of := for assignments
Forgetting to add [] at the end when you want to see results
Remember:
tidyverse is great for readability and smaller datasets
data.table shines with big data and performance needs
Both have their place in your R toolkit
🎊🎉 Congratulations you finished week 4 and now on to a mini project analysis using data.table functions.
💡 Mini-Project Guidelines:
Create a comprehensive analysis using data.table that:
Demonstrates at least three different data.table operations Includes performance comparisons with equivalent dplyr operations Creates insightful visualizations from the results
Remember to:
Document your code
Explain your analysis choices
Consider performance implications
Think about scalability
Want more….
Explore more advanced data.table features
Prompt for Chat GPT/Claude/Gemini:
You are a data scientist using exploring a large data set using exploratory data analysis principles. Your task is to create code in R using data.table functions to investigate the mean and standard deviation of the energy and danceability of songs from songs from each decade represented in the data set. Demonstrate at least three different data.table operations, include performance comparisons with equivalent dplyr operations, and create insightful visualizations from the results. Make sure the x-axis is from 1950-2020.
library(data.table)library(dplyr)library(ggplot2)# Load the datasetdata <-fread("data/spotify_clean.csv")# Extract decadedata[, decade :=as.numeric(substr(album_release_date, 1, 4))]data[, decade :=floor(decade /10) *10]# Compute mean and standard deviation using data.tablesummary_dt <- data[, .(mean_energy =mean(energy, na.rm =TRUE),sd_energy =sd(energy, na.rm =TRUE),mean_danceability =mean(danceability, na.rm =TRUE),sd_danceability =sd(danceability, na.rm =TRUE)), by = decade]# Compute equivalent operations using dplyrsummary_dplyr <- data %>%group_by(decade) %>%summarise(mean_energy =mean(energy, na.rm =TRUE),sd_energy =sd(energy, na.rm =TRUE),mean_danceability =mean(danceability, na.rm =TRUE),sd_danceability =sd(danceability, na.rm =TRUE) )# Performance comparisonsystem.time(data[, .(mean_energy =mean(energy, na.rm =TRUE),sd_energy =sd(energy, na.rm =TRUE),mean_danceability =mean(danceability, na.rm =TRUE),sd_danceability =sd(danceability, na.rm =TRUE)), by = decade])
# Visualizationplot1 <-ggplot(summary_dt, aes(x = decade)) +geom_line(aes(y = mean_energy, color ="Mean Energy")) +geom_line(aes(y = mean_danceability, color ="Mean Danceability")) +labs(title ="Trends in Energy and Danceability Over Decades", x ="Decade", y ="Average Song Characteristics") +scale_x_continuous(limits =c(1950, 2020)) +scale_color_manual(values =c("Mean Energy"="blue", "Mean Danceability"="red")) +theme_minimal()print(plot1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
I decided to recreate the analysis I completed last week. I used chat-GPT using the prompt given above, and received the code output seen in the above code block. This code gives us the mean and standard deviation of the energy and danceability variables by decade to study musical trends over time. It also shows the computing power and time differences between dplyr and data.table functions (which seems to be about even). Finally, it graphs the means of each variable over time, which helps visualize the changes seen in the analysis.