# Read the dataset using fread (much faster than read.csv() or read_csv())#spotify_dt <- fread("https://docs.google.com/spreadsheets/d/e/2PACX-1vT9iNNoFknWSmgIglkiEXnmXkJplHHeEf9wPn07tvtOV-5eTh-MdYnLS3eyST_Bx2OXpd0GC9dw5gBt/pub?gid=1801845240&single=true&output=csv")#read in teh data from data folderspotify_dt <-fread("spotify_clean.csv")spotify_csv <-read_csv("spotify_clean.csv")
When you have an existing data frame (like one from tidyverse or base R like spotify_csv), you can convert it to a data.table using setDT().
# Option 1: setDT() - converts in place (memory efficient)#setDT(spotify_csv) # Converts the existing object to data.table# Option 2: as.data.table() - creates a copy (uses more memory)#spotify_dt <- as.data.table(spotify_csv)
ππ Congratulations you finished week 4 and now on to a mini project analysis using data.table functions.
π‘ Mini-Project Guidelines:
Create a comprehensive analysis using data.table that:
Demonstrates at least three different data.table operations Includes performance comparisons with equivalent dplyr operations Creates insightful visualizations from the results
Remember to:
Document your code
Explain your analysis choices
Consider performance implications
Think about scalability
Want moreβ¦.
Explore more advanced data.table features
Analysis: Who are the three most successful artists in the Hip Hop genre ?
Prompt for Chat GPT/Claude/Gemini
You are a data scientist using exploring a large data set using exploratory data analysis principles. Your task is to create code in R using data.table functions to β¦. {Add what you want it to do}
Prompt (Chat GPT)You are a data scientist using exploring a large data set using exploratory data analysis principles. Your task is to create code in R using data.table functions to find the 3 most successful artists in the Hip Hop genre.
library(data.table)# Load datasetdt <-fread("spotify_clean.csv")# Filter for Hip Hop genre using grepl to capture variationshiphop_dt <- dt[grepl("Hip Hop", main_genre, ignore.case =TRUE)]# Compute success metrics per artisthiphop_dt <- hiphop_dt[, .(Total_Success =sum(popularity, na.rm =TRUE),Avg_Popularity =mean(popularity, na.rm =TRUE)), by = artist_name_s]hiphop_dt
artist_name_s Total_Success Avg_Popularity
1: RunβD.M.C., Jason Nevins 109 36.33333
2: Eminem 457 19.86957
3: Zay Hilfigerrr, Zayion McCall 61 61.00000
4: Nitty 0 0.00000
5: Heavy D & The Boyz, Aaron Hall 65 32.50000
---
115: LunchMoney Lewis 62 62.00000
116: Naughty By Nature, Phiness 52 52.00000
117: The Streets 62 62.00000
118: Travis Scott, HVME 51 51.00000
119: Fugees 0 0.00000
# Define a weighted score for rankinghiphop_dt[, Weighted_Score := (Total_Success *0.5) + (Avg_Popularity *0.5)]# Get the top 3 unique artists based on Weighted_Scoretop_artists <- hiphop_dt[order(-Weighted_Score)][1:3]# Print top 3 artistsprint(top_artists)
artist_name_s Total_Success Avg_Popularity Weighted_Score
1: Eminem 457 19.86957 238.4348
2: Lil Nas X 248 62.00000 155.0000
3: Outkast 261 43.50000 152.2500