With 330,000 rows, this dataset covers approximately 3,300 weeks of music history. This equates to 63 years, aligning perfectly with the Billboard timeline from 1958 to 2021.
Data Quality Conclusion
1. Data Scale:
Extensive coverage of 330k+ records spanning over six decades.
2. Data Completeness
Zero missing values in essential fields (Date, Song, Artist, Rank).
Note on New Entries: The 32,312 NA values in last-week are confirmed as debut tracks (their first week on the chart).
3. Duplicate Check
0 identical rows (no exact duplicates at the record level).
- Data Overview (Glimpse)
# Use glimpse() to inspect variable types and data dimensionsglimpse(D1_BillboardHot100)
# Preview the first few rows to understand the data contenthead(D1_BillboardHot100)
# A tibble: 6 × 7
date rank song artist `last-week` `peak-rank` `weeks-on-board`
<date> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2021-11-06 1 Easy On Me Adele 1 1 3
2 2021-11-06 2 Stay The K… 2 1 16
3 2021-11-06 3 Industry Baby Lil N… 3 1 14
4 2021-11-06 4 Fancy Like Walke… 4 3 19
5 2021-11-06 5 Bad Habits Ed Sh… 5 2 18
6 2021-11-06 6 Way 2 Sexy Drake… 6 1 8
- Data Integrity Check (Duplicates)
# Check for duplicate records in the datasetsum(duplicated(D1_BillboardHot100))
[1] 0
- Missing Value Analysis
# Identify missing values; note that 'last-week = 0' represents new entriescolSums(is.na(D1_BillboardHot100))
date rank song artist last-week
0 0 0 0 32312
peak-rank weeks-on-board
0 0
D2_SpotifyHitPredictor
Introduction
Data Quality Conclusion
1. Data Scale and Distribution
Total Sample Size: 41,106 tracks.
Decade Distribution: Six decades (1960s–2010s).
2. Data Completeness (Missing Values)
Result: zero missing values.
3. Duplicate Identification
Row-level check: 0 identical rows (no exact duplicates).
URI-level check: 546 duplicate URIs detected.
Interpretation: This confirms that 546 tracks overlap between different decade files (e.g., a song appearing in both the 70s and 80s datasets).
4. Key Variable Observations
Target Variable: A track was a “Hit” (1) or a “Flop” (0).
Date Precision: The dataset lacks specific dates and only provides decade labels. I will rely on D1 (Billboard) or D3 (Music Dataset) to align these tracks with precise years for time-series analysis.
# Duplicate Check## Check for identical rows across the entire datasetsum(duplicated(D2_Spotify_All))
[1] 0
## Detect duplicate entries based on unique Spotify URIssum(duplicated(D2_Spotify_All$uri))
[1] 546
# Missing ValuescolSums(is.na(D2_Spotify_All))
track artist uri danceability
0 0 0 0
energy key loudness mode
0 0 0 0
speechiness acousticness instrumentalness liveness
0 0 0 0
valence tempo duration_ms time_signature
0 0 0 0
chorus_hit sections target decade
0 0 0 0
D3_MusicDataset1950_2019
Introduction
Data3 maps the evolution of lyrical themes and audio characteristics across genres and decades.
Data Quality Conclusion
1. Data Scale and Distribution
Contains 28,372 records and 31 variables, spanning from 1950 to 2019, providing a robust sample size for analysis.
2. Data Completeness
Critical audio features and lyrical topics are complete with no significant missing values.
3. Duplicate Check
No duplicate records were identified.
4. Key Variable Observations
The dataset effectively integrates quantitative audio features (e.g., danceability, acousticness) with categorical lyrical topics (e.g., romantic, sadness), forming a strong foundation for comparative analysis.
# check music genre table(D3_MusicDataset1950_2019$genre)
blues country hip hop jazz pop reggae rock
4604 5445 904 3845 7042 2498 4034
Saving to RDS
# Saving cleaned data to RDSsaveRDS(D1_BillboardHot100, "../Data/cleaned_D1.rds")saveRDS(D2_Spotify_All, "../Data/cleaned_D2.rds")saveRDS(D3_MusicDataset1950_2019, "../Data/cleaned_D3.rds")cat("Data exported successfully. Proceeding to pattern analysis.")
Data exported successfully. Proceeding to pattern analysis.
Part B: Initial Data Preparation
Light Cleaning Performed
✅ Removed URI prefix (spotify:track:)
✅ Merged decade files and added decade labels
✅ Dropped unnecessary column (...1)
Rationale
These minimal transformations were performed during exploration to:
Facilitate subsequent analysis
Standardize identifier format
Improve data structure clarity
Data Integration & Research Focus
D1 (Billboard): Ensures that the research subjects are “market-validated” within the U.S. mainstream.
Preliminary inspection shows that Data 3 includes a wide range of genres (e.g., Jazz, Reggae) and may contain tracks from non-U.S. markets. To ensure geographic consistency, this study will perform an Inner Join with the Billboard dataset during the data cleaning phase. This process will filter out tracks that did not enter the U.S. mainstream market, focusing the analysis on the American music landscape.
---title: "01 Initial Data Exploration"author: "Fu Wei Hsu"format: html: theme: cosmo toc: true toc-location: right toc-title: "On this page" code-tools: true embed-resources: trueexecute: warning: false message: falseeditor: visual---# Read.csv```{r}# Load librarylibrary(tidyverse)D1_BillboardHot100 <-read_csv("../Data/Data1_charts.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_00s <-read_csv("../Data/Data2_dataset-of-00s.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_10s <-read_csv("../Data/Data2_dataset-of-10s.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_60s <-read_csv("../Data/Data2_dataset-of-60s.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_70s <-read_csv("../Data/Data2_dataset-of-70s.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_80s <-read_csv("../Data/Data2_dataset-of-80s.csv", show_col_types =FALSE)D2_SpotifyHitPredictor_90s <-read_csv("../Data/Data2_dataset-of-90s.csv", show_col_types =FALSE)D3_MusicDataset1950_2019 <-read_csv("../Data/Data3_tcc_ceds_music.csv", show_col_types =FALSE)# Output final confirmation messagecat("All datasets loaded successfully.")```# Part A: Exploratory Data Analysis## D1_BillboardHot100### IntroductionWith 330,000 rows, this dataset covers approximately 3,300 weeks of music history. This equates to 63 years, aligning perfectly with the Billboard timeline from 1958 to 2021.**Data Quality Conclusion****1. Data Scale:**- Extensive coverage of 330k+ records spanning over six decades.**2. Data Completeness**- Zero missing values in essential fields (Date, Song, Artist, Rank).- Note on New Entries: The 32,312 NA values in last-week are confirmed as debut tracks (their first week on the chart).**3. Duplicate Check**- 0 identical rows (no exact duplicates at the record level).#### - Data Overview (Glimpse)```{r}# Use glimpse() to inspect variable types and data dimensionsglimpse(D1_BillboardHot100)```#### - Data Preview```{r}# Preview the first few rows to understand the data contenthead(D1_BillboardHot100)```#### - Data Integrity Check (Duplicates)```{r}# Check for duplicate records in the datasetsum(duplicated(D1_BillboardHot100))```#### - Missing Value Analysis```{r}# Identify missing values; note that 'last-week = 0' represents new entriescolSums(is.na(D1_BillboardHot100))```## D2_SpotifyHitPredictor### Introduction**Data Quality Conclusion****1. Data Scale and Distribution**- Total Sample Size: 41,106 tracks.- Decade Distribution: Six decades (1960s–2010s).**2. Data Completeness (Missing Values)**- Result: zero missing values.**3. Duplicate Identification**- Row-level check: 0 identical rows (no exact duplicates).- URI-level check: 546 duplicate URIs detected.Interpretation: This confirms that 546 tracks overlap between different decade files (e.g., a song appearing in both the 70s and 80s datasets).**4. Key Variable Observations**- Target Variable: A track was a "Hit" (1) or a "Flop" (0).- Date Precision: The dataset lacks specific dates and only provides decade labels. I will rely on D1 (Billboard) or D3 (Music Dataset) to align these tracks with precise years for time-series analysis.#### Labeling Decades While Merging Datasets```{r}D2_Spotify_All <-bind_rows(read_csv("../Data/Data2_dataset-of-60s.csv", show_col_types =FALSE) %>%mutate(decade ="60s"),read_csv("../Data/Data2_dataset-of-70s.csv", show_col_types =FALSE) %>%mutate(decade ="70s"),read_csv("../Data/Data2_dataset-of-80s.csv", show_col_types =FALSE) %>%mutate(decade ="80s"),read_csv("../Data/Data2_dataset-of-90s.csv", show_col_types =FALSE) %>%mutate(decade ="90s"),read_csv("../Data/Data2_dataset-of-00s.csv", show_col_types =FALSE) %>%mutate(decade ="00s"),read_csv("../Data/Data2_dataset-of-10s.csv", show_col_types =FALSE) %>%mutate(decade ="10s"))``````{r}# glimpse D2_Spotify_Allglimpse(D2_Spotify_All)# Preview Rowshead(D2_Spotify_All)# Remove Prefix wordingsD2_Spotify_All <- D2_Spotify_All %>%mutate(uri =str_remove(uri, "spotify:track:"))length(unique(D2_Spotify_All$uri)) ==nrow(D2_Spotify_All) -546glimpse(D2_Spotify_All)head(D2_Spotify_All)# check Decade Distribution from 60s to 10s table(D2_Spotify_All$decade)# Duplicate Check## Check for identical rows across the entire datasetsum(duplicated(D2_Spotify_All))## Detect duplicate entries based on unique Spotify URIssum(duplicated(D2_Spotify_All$uri))# Missing ValuescolSums(is.na(D2_Spotify_All))```## D3_MusicDataset1950_2019### IntroductionData3 maps the evolution of lyrical themes and audio characteristics across genres and decades.**Data Quality Conclusion****1. Data Scale and Distribution**- Contains 28,372 records and 31 variables, spanning from 1950 to 2019, providing a robust sample size for analysis.**2. Data Completeness**- Critical audio features and lyrical topics are complete with no significant missing values.**3. Duplicate Check**- No duplicate records were identified.**4. Key Variable Observations**- The dataset effectively integrates quantitative audio features (e.g., danceability, acousticness) with categorical lyrical topics (e.g., romantic, sadness), forming a strong foundation for comparative analysis.```{r}# glimpse D3_MusicDataset1950_2019glimpse(D3_MusicDataset1950_2019)# Preview Rowshead(D3_MusicDataset1950_2019)# Duplicate Checksum(duplicated(D3_MusicDataset1950_2019))# Missing ValuescolSums(is.na(D3_MusicDataset1950_2019))# Remove first no meaningful columnD3_MusicDataset1950_2019 <- D3_MusicDataset1950_2019 %>%select(-...1)glimpse(D3_MusicDataset1950_2019)# check music genre table(D3_MusicDataset1950_2019$genre)```### Saving to RDS```{r}# Saving cleaned data to RDSsaveRDS(D1_BillboardHot100, "../Data/cleaned_D1.rds")saveRDS(D2_Spotify_All, "../Data/cleaned_D2.rds")saveRDS(D3_MusicDataset1950_2019, "../Data/cleaned_D3.rds")cat("Data exported successfully. Proceeding to pattern analysis.")```# Part B: Initial Data Preparation### Light Cleaning Performed1. ✅ Removed URI prefix (`spotify:track:`)2. ✅ Merged decade files and added `decade` labels3. ✅ Dropped unnecessary column (`...1`)### RationaleThese minimal transformations were performed during exploration to:- Facilitate subsequent analysis- Standardize identifier format- Improve data structure clarity# Data Integration & Research Focus- D1 (Billboard): Ensures that the research subjects are "market-validated" within the U.S. mainstream.- D2 (Spotify): Provides "audio-sensory" analytical capabilities (e.g., danceability, energy).- D3 (Music Dataset): Provides "lyrical sentiment" analytical capabilities (e.g., romantic, violence).# Geographic Consistency & Filtering StrategyPreliminary inspection shows that Data 3 includes a wide range of genres (e.g., Jazz, Reggae) and may contain tracks from non-U.S. markets. To ensure geographic consistency, this study will perform an Inner Join with the Billboard dataset during the data cleaning phase. This process will filter out tracks that did not enter the U.S. mainstream market, focusing the analysis on the American music landscape.# Time Coverage Comparison```{r}library(gt)time_coverage <-data.frame(Dataset =c("D1 (Billboard Hot 100)", "D2 (Spotify Hit Predictor)", "D3 (Music Dataset 1950-2019)"),Time_Range =c("1958 - 2021", "1960s - 2010s", "1950 - 2019"),Granularity =c("Weekly", "Decade", "Yearly"),Total_Records =c(330087, 41106, 28372))time_coverage %>%gt() %>%tab_header(title ="Dataset Time Coverage Summary" ) %>%cols_label(Dataset ="Dataset",Time_Range ="Time Range",Granularity ="Granularity",Total_Records ="Total Records" ) %>%fmt_number(columns = Total_Records,decimals =0,use_seps =TRUE ) %>%tab_options(table.width =pct(100) )```