01 Initial Data Exploration

Author

Fu Wei Hsu

Read.csv

# Load library
library(tidyverse)
D1_BillboardHot100 <- read_csv("../Data/Data1_charts.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_00s <- read_csv("../Data/Data2_dataset-of-00s.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_10s <- read_csv("../Data/Data2_dataset-of-10s.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_60s <- read_csv("../Data/Data2_dataset-of-60s.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_70s <- read_csv("../Data/Data2_dataset-of-70s.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_80s <- read_csv("../Data/Data2_dataset-of-80s.csv", show_col_types = FALSE)
D2_SpotifyHitPredictor_90s <- read_csv("../Data/Data2_dataset-of-90s.csv", show_col_types = FALSE)
D3_MusicDataset1950_2019 <- read_csv("../Data/Data3_tcc_ceds_music.csv", show_col_types = FALSE)

# Output final confirmation message
cat("All datasets loaded successfully.")
All datasets loaded successfully.

Part A: Exploratory Data Analysis

D1_BillboardHot100

Introduction

With 330,000 rows, this dataset covers approximately 3,300 weeks of music history. This equates to 63 years, aligning perfectly with the Billboard timeline from 1958 to 2021.

Data Quality Conclusion

1. Data Scale:

  • Extensive coverage of 330k+ records spanning over six decades.

2. Data Completeness

  • Zero missing values in essential fields (Date, Song, Artist, Rank).

  • Note on New Entries: The 32,312 NA values in last-week are confirmed as debut tracks (their first week on the chart).

3. Duplicate Check

  • 0 identical rows (no exact duplicates at the record level).

- Data Overview (Glimpse)

# Use glimpse() to inspect variable types and data dimensions
glimpse(D1_BillboardHot100)
Rows: 330,087
Columns: 7
$ date             <date> 2021-11-06, 2021-11-06, 2021-11-06, 2021-11-06, 2021…
$ rank             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ song             <chr> "Easy On Me", "Stay", "Industry Baby", "Fancy Like", …
$ artist           <chr> "Adele", "The Kid LAROI & Justin Bieber", "Lil Nas X …
$ `last-week`      <dbl> 1, 2, 3, 4, 5, 6, 9, 7, 11, 8, 12, 10, 14, 15, 21, 16…
$ `peak-rank`      <dbl> 1, 1, 1, 3, 2, 1, 7, 1, 9, 2, 9, 3, 12, 14, 15, 11, 1…
$ `weeks-on-board` <dbl> 3, 16, 14, 19, 18, 8, 7, 24, 20, 56, 17, 29, 41, 18, …

- Data Preview

# Preview the first few rows to understand the data content
head(D1_BillboardHot100)
# A tibble: 6 × 7
  date        rank song          artist `last-week` `peak-rank` `weeks-on-board`
  <date>     <dbl> <chr>         <chr>        <dbl>       <dbl>            <dbl>
1 2021-11-06     1 Easy On Me    Adele            1           1                3
2 2021-11-06     2 Stay          The K…           2           1               16
3 2021-11-06     3 Industry Baby Lil N…           3           1               14
4 2021-11-06     4 Fancy Like    Walke…           4           3               19
5 2021-11-06     5 Bad Habits    Ed Sh…           5           2               18
6 2021-11-06     6 Way 2 Sexy    Drake…           6           1                8

- Data Integrity Check (Duplicates)

# Check for duplicate records in the dataset
sum(duplicated(D1_BillboardHot100))
[1] 0

- Missing Value Analysis

# Identify missing values; note that 'last-week = 0' represents new entries
colSums(is.na(D1_BillboardHot100))
          date           rank           song         artist      last-week 
             0              0              0              0          32312 
     peak-rank weeks-on-board 
             0              0 

D2_SpotifyHitPredictor

Introduction

Data Quality Conclusion

1. Data Scale and Distribution

  • Total Sample Size: 41,106 tracks.
  • Decade Distribution: Six decades (1960s–2010s).

2. Data Completeness (Missing Values)

  • Result: zero missing values.

3. Duplicate Identification

  • Row-level check: 0 identical rows (no exact duplicates).
  • URI-level check: 546 duplicate URIs detected.

Interpretation: This confirms that 546 tracks overlap between different decade files (e.g., a song appearing in both the 70s and 80s datasets).

4. Key Variable Observations

  • Target Variable: A track was a “Hit” (1) or a “Flop” (0).
  • Date Precision: The dataset lacks specific dates and only provides decade labels. I will rely on D1 (Billboard) or D3 (Music Dataset) to align these tracks with precise years for time-series analysis.

Labeling Decades While Merging Datasets

D2_Spotify_All <- bind_rows(
  read_csv("../Data/Data2_dataset-of-60s.csv", show_col_types = FALSE) %>% mutate(decade = "60s"),
  read_csv("../Data/Data2_dataset-of-70s.csv", show_col_types = FALSE) %>% mutate(decade = "70s"),
  read_csv("../Data/Data2_dataset-of-80s.csv", show_col_types = FALSE) %>% mutate(decade = "80s"),
  read_csv("../Data/Data2_dataset-of-90s.csv", show_col_types = FALSE) %>% mutate(decade = "90s"),
  read_csv("../Data/Data2_dataset-of-00s.csv", show_col_types = FALSE) %>% mutate(decade = "00s"),
  read_csv("../Data/Data2_dataset-of-10s.csv", show_col_types = FALSE) %>% mutate(decade = "10s")
)
# glimpse D2_Spotify_All
glimpse(D2_Spotify_All)
Rows: 41,106
Columns: 20
$ track            <chr> "Jealous Kind Of Fella", "Initials B.B.", "Melody Twi…
$ artist           <chr> "Garland Green", "Serge Gainsbourg", "Lord Melody", "…
$ uri              <chr> "spotify:track:1dtKN6wwlolkM8XZy2y9C1", "spotify:trac…
$ danceability     <dbl> 0.417, 0.498, 0.657, 0.590, 0.515, 0.697, 0.662, 0.72…
$ energy           <dbl> 0.6200, 0.5050, 0.6490, 0.5450, 0.7650, 0.6730, 0.272…
$ key              <dbl> 3, 3, 5, 7, 11, 0, 0, 5, 2, 2, 10, 9, 2, 2, 4, 5, 6, …
$ loudness         <dbl> -7.727, -12.475, -13.392, -12.058, -3.515, -10.573, -…
$ mode             <dbl> 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,…
$ speechiness      <dbl> 0.0403, 0.0337, 0.0380, 0.1040, 0.1240, 0.0266, 0.031…
$ acousticness     <dbl> 0.4900, 0.0180, 0.8460, 0.7060, 0.8570, 0.7140, 0.360…
$ instrumentalness <dbl> 0.00e+00, 1.07e-01, 4.42e-06, 2.46e-02, 8.72e-04, 9.1…
$ liveness         <dbl> 0.0779, 0.1760, 0.1190, 0.0610, 0.2130, 0.1220, 0.096…
$ valence          <dbl> 0.8450, 0.7970, 0.9080, 0.9670, 0.9060, 0.7780, 0.591…
$ tempo            <dbl> 185.655, 101.801, 115.940, 105.592, 114.617, 112.117,…
$ duration_ms      <dbl> 173533, 213613, 223960, 157907, 245600, 167667, 13436…
$ time_signature   <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ chorus_hit       <dbl> 32.94975, 48.82510, 37.22663, 24.75484, 21.79874, 65.…
$ sections         <dbl> 9, 10, 12, 8, 14, 7, 7, 8, 6, 9, 7, 9, 6, 5, 8, 8, 13…
$ target           <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
$ decade           <chr> "60s", "60s", "60s", "60s", "60s", "60s", "60s", "60s…
# Preview Rows
head(D2_Spotify_All)
# A tibble: 6 × 20
  track        artist uri   danceability energy   key loudness  mode speechiness
  <chr>        <chr>  <chr>        <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
1 Jealous Kin… Garla… spot…        0.417  0.62      3    -7.73     1      0.0403
2 Initials B.… Serge… spot…        0.498  0.505     3   -12.5      1      0.0337
3 Melody Twist Lord … spot…        0.657  0.649     5   -13.4      1      0.038 
4 Mi Bomba So… Celia… spot…        0.59   0.545     7   -12.1      0      0.104 
5 Uravu Solla  P. Su… spot…        0.515  0.765    11    -3.52     0      0.124 
6 Beat n. 3    Ennio… spot…        0.697  0.673     0   -10.6      1      0.0266
# ℹ 11 more variables: acousticness <dbl>, instrumentalness <dbl>,
#   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
#   time_signature <dbl>, chorus_hit <dbl>, sections <dbl>, target <dbl>,
#   decade <chr>
# Remove Prefix wordings
D2_Spotify_All <- D2_Spotify_All %>% 
  mutate(uri = str_remove(uri, "spotify:track:"))
length(unique(D2_Spotify_All$uri)) == nrow(D2_Spotify_All) - 546
[1] TRUE
glimpse(D2_Spotify_All)
Rows: 41,106
Columns: 20
$ track            <chr> "Jealous Kind Of Fella", "Initials B.B.", "Melody Twi…
$ artist           <chr> "Garland Green", "Serge Gainsbourg", "Lord Melody", "…
$ uri              <chr> "1dtKN6wwlolkM8XZy2y9C1", "5hjsmSnUefdUqzsDogisiX", "…
$ danceability     <dbl> 0.417, 0.498, 0.657, 0.590, 0.515, 0.697, 0.662, 0.72…
$ energy           <dbl> 0.6200, 0.5050, 0.6490, 0.5450, 0.7650, 0.6730, 0.272…
$ key              <dbl> 3, 3, 5, 7, 11, 0, 0, 5, 2, 2, 10, 9, 2, 2, 4, 5, 6, …
$ loudness         <dbl> -7.727, -12.475, -13.392, -12.058, -3.515, -10.573, -…
$ mode             <dbl> 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,…
$ speechiness      <dbl> 0.0403, 0.0337, 0.0380, 0.1040, 0.1240, 0.0266, 0.031…
$ acousticness     <dbl> 0.4900, 0.0180, 0.8460, 0.7060, 0.8570, 0.7140, 0.360…
$ instrumentalness <dbl> 0.00e+00, 1.07e-01, 4.42e-06, 2.46e-02, 8.72e-04, 9.1…
$ liveness         <dbl> 0.0779, 0.1760, 0.1190, 0.0610, 0.2130, 0.1220, 0.096…
$ valence          <dbl> 0.8450, 0.7970, 0.9080, 0.9670, 0.9060, 0.7780, 0.591…
$ tempo            <dbl> 185.655, 101.801, 115.940, 105.592, 114.617, 112.117,…
$ duration_ms      <dbl> 173533, 213613, 223960, 157907, 245600, 167667, 13436…
$ time_signature   <dbl> 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ chorus_hit       <dbl> 32.94975, 48.82510, 37.22663, 24.75484, 21.79874, 65.…
$ sections         <dbl> 9, 10, 12, 8, 14, 7, 7, 8, 6, 9, 7, 9, 6, 5, 8, 8, 13…
$ target           <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
$ decade           <chr> "60s", "60s", "60s", "60s", "60s", "60s", "60s", "60s…
head(D2_Spotify_All)
# A tibble: 6 × 20
  track        artist uri   danceability energy   key loudness  mode speechiness
  <chr>        <chr>  <chr>        <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
1 Jealous Kin… Garla… 1dtK…        0.417  0.62      3    -7.73     1      0.0403
2 Initials B.… Serge… 5hjs…        0.498  0.505     3   -12.5      1      0.0337
3 Melody Twist Lord … 6uk8…        0.657  0.649     5   -13.4      1      0.038 
4 Mi Bomba So… Celia… 7aNj…        0.59   0.545     7   -12.1      0      0.104 
5 Uravu Solla  P. Su… 1rQ0…        0.515  0.765    11    -3.52     0      0.124 
6 Beat n. 3    Ennio… 32VB…        0.697  0.673     0   -10.6      1      0.0266
# ℹ 11 more variables: acousticness <dbl>, instrumentalness <dbl>,
#   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
#   time_signature <dbl>, chorus_hit <dbl>, sections <dbl>, target <dbl>,
#   decade <chr>
# check Decade Distribution from 60s to 10s 
table(D2_Spotify_All$decade)

 00s  10s  60s  70s  80s  90s 
5872 6398 8642 7766 6908 5520 
# Duplicate Check
## Check for identical rows across the entire dataset
sum(duplicated(D2_Spotify_All))
[1] 0
## Detect duplicate entries based on unique Spotify URIs
sum(duplicated(D2_Spotify_All$uri))
[1] 546
# Missing Values
colSums(is.na(D2_Spotify_All))
           track           artist              uri     danceability 
               0                0                0                0 
          energy              key         loudness             mode 
               0                0                0                0 
     speechiness     acousticness instrumentalness         liveness 
               0                0                0                0 
         valence            tempo      duration_ms   time_signature 
               0                0                0                0 
      chorus_hit         sections           target           decade 
               0                0                0                0 

D3_MusicDataset1950_2019

Introduction

Data3 maps the evolution of lyrical themes and audio characteristics across genres and decades.

Data Quality Conclusion

1. Data Scale and Distribution

  • Contains 28,372 records and 31 variables, spanning from 1950 to 2019, providing a robust sample size for analysis.

2. Data Completeness

  • Critical audio features and lyrical topics are complete with no significant missing values.

3. Duplicate Check

  • No duplicate records were identified.

4. Key Variable Observations

  • The dataset effectively integrates quantitative audio features (e.g., danceability, acousticness) with categorical lyrical topics (e.g., romantic, sadness), forming a strong foundation for comparative analysis.
# glimpse D3_MusicDataset1950_2019
glimpse(D3_MusicDataset1950_2019)
Rows: 28,372
Columns: 31
$ ...1                       <dbl> 0, 4, 6, 10, 12, 14, 15, 17, 20, 23, 28, 32…
$ artist_name                <chr> "mukesh", "frankie laine", "johnnie ray", "…
$ track_name                 <chr> "mohabbat bhi jhoothi", "i believe", "cry",…
$ release_date               <dbl> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1…
$ genre                      <chr> "pop", "pop", "pop", "pop", "pop", "pop", "…
$ lyrics                     <chr> "hold time feel break feel untrue convince …
$ len                        <dbl> 95, 51, 24, 54, 48, 98, 179, 21, 30, 61, 11…
$ dating                     <dbl> 0.0005980861, 0.0355371338, 0.0027700831, 0…
$ violence                   <dbl> 0.0637461276, 0.0967767423, 0.0027700832, 0…
$ `world/life`               <dbl> 0.0005980862, 0.4434351738, 0.0027700833, 0…
$ `night/time`               <dbl> 0.0005980862, 0.0012836971, 0.0027700833, 0…
$ `shake the audience`       <dbl> 0.0005980861, 0.0012836971, 0.0027700831, 0…
$ `family/gospel`            <dbl> 0.0488570152, 0.0270074774, 0.0027700831, 0…
$ romantic                   <dbl> 0.0171043389, 0.0012836971, 0.1585644657, 0…
$ communication              <dbl> 0.2637508813, 0.0012836971, 0.2506679099, 0…
$ obscene                    <dbl> 0.0005980862, 0.0012836971, 0.0027700833, 0…
$ music                      <dbl> 0.0392883659, 0.1180338412, 0.3237940522, 0…
$ `movement/places`          <dbl> 0.0005980862, 0.0012836971, 0.0027700835, 0…
$ `light/visual perceptions` <dbl> 0.0005980861, 0.2126810672, 0.0027700833, 0…
$ `family/spiritual`         <dbl> 0.0005980861, 0.0511241990, 0.0027700833, 0…
$ `like/girls`               <dbl> 0.0005980862, 0.0012836971, 0.0027700835, 0…
$ sadness                    <dbl> 0.3802988952, 0.0012836971, 0.0027700832, 0…
$ feelings                   <dbl> 0.1171754514, 0.0012836972, 0.2254223233, 0…
$ danceability               <dbl> 0.3577385, 0.3317448, 0.4562981, 0.6869923,…
$ loudness                   <dbl> 0.4541189, 0.6475399, 0.5852883, 0.7444043,…
$ acousticness               <dbl> 0.99799197, 0.95481923, 0.84036129, 0.08393…
$ instrumentalness           <dbl> 9.018219e-01, 1.528340e-06, 0.000000e+00, 1…
$ valence                    <dbl> 0.33944765, 0.32502061, 0.35181369, 0.77535…
$ energy                     <dbl> 0.1371102, 0.2632403, 0.1391123, 0.7437357,…
$ topic                      <chr> "sadness", "world/life", "music", "romantic…
$ age                        <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000,…
# Preview Rows
head(D3_MusicDataset1950_2019)
# A tibble: 6 × 31
   ...1 artist_name  track_name release_date genre lyrics   len  dating violence
  <dbl> <chr>        <chr>             <dbl> <chr> <chr>  <dbl>   <dbl>    <dbl>
1     0 mukesh       mohabbat …         1950 pop   hold …    95 5.98e-4  0.0637 
2     4 frankie lai… i believe          1950 pop   belie…    51 3.55e-2  0.0968 
3     6 johnnie ray  cry                1950 pop   sweet…    24 2.77e-3  0.00277
4    10 pérez prado  patricia           1950 pop   kiss …    54 4.82e-2  0.00155
5    12 giorgos pap… apopse ei…         1950 pop   till …    48 1.35e-3  0.00135
6    14 perry como   round and…         1950 pop   convo…    98 1.05e-3  0.421  
# ℹ 22 more variables: `world/life` <dbl>, `night/time` <dbl>,
#   `shake the audience` <dbl>, `family/gospel` <dbl>, romantic <dbl>,
#   communication <dbl>, obscene <dbl>, music <dbl>, `movement/places` <dbl>,
#   `light/visual perceptions` <dbl>, `family/spiritual` <dbl>,
#   `like/girls` <dbl>, sadness <dbl>, feelings <dbl>, danceability <dbl>,
#   loudness <dbl>, acousticness <dbl>, instrumentalness <dbl>, valence <dbl>,
#   energy <dbl>, topic <chr>, age <dbl>
# Duplicate Check
sum(duplicated(D3_MusicDataset1950_2019))
[1] 0
# Missing Values
colSums(is.na(D3_MusicDataset1950_2019))
                    ...1              artist_name               track_name 
                       0                        0                        0 
            release_date                    genre                   lyrics 
                       0                        0                        0 
                     len                   dating                 violence 
                       0                        0                        0 
              world/life               night/time       shake the audience 
                       0                        0                        0 
           family/gospel                 romantic            communication 
                       0                        0                        0 
                 obscene                    music          movement/places 
                       0                        0                        0 
light/visual perceptions         family/spiritual               like/girls 
                       0                        0                        0 
                 sadness                 feelings             danceability 
                       0                        0                        0 
                loudness             acousticness         instrumentalness 
                       0                        0                        0 
                 valence                   energy                    topic 
                       0                        0                        0 
                     age 
                       0 
# Remove first no meaningful column
D3_MusicDataset1950_2019 <- D3_MusicDataset1950_2019 %>% 
  select(-...1)
glimpse(D3_MusicDataset1950_2019)
Rows: 28,372
Columns: 30
$ artist_name                <chr> "mukesh", "frankie laine", "johnnie ray", "…
$ track_name                 <chr> "mohabbat bhi jhoothi", "i believe", "cry",…
$ release_date               <dbl> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1…
$ genre                      <chr> "pop", "pop", "pop", "pop", "pop", "pop", "…
$ lyrics                     <chr> "hold time feel break feel untrue convince …
$ len                        <dbl> 95, 51, 24, 54, 48, 98, 179, 21, 30, 61, 11…
$ dating                     <dbl> 0.0005980861, 0.0355371338, 0.0027700831, 0…
$ violence                   <dbl> 0.0637461276, 0.0967767423, 0.0027700832, 0…
$ `world/life`               <dbl> 0.0005980862, 0.4434351738, 0.0027700833, 0…
$ `night/time`               <dbl> 0.0005980862, 0.0012836971, 0.0027700833, 0…
$ `shake the audience`       <dbl> 0.0005980861, 0.0012836971, 0.0027700831, 0…
$ `family/gospel`            <dbl> 0.0488570152, 0.0270074774, 0.0027700831, 0…
$ romantic                   <dbl> 0.0171043389, 0.0012836971, 0.1585644657, 0…
$ communication              <dbl> 0.2637508813, 0.0012836971, 0.2506679099, 0…
$ obscene                    <dbl> 0.0005980862, 0.0012836971, 0.0027700833, 0…
$ music                      <dbl> 0.0392883659, 0.1180338412, 0.3237940522, 0…
$ `movement/places`          <dbl> 0.0005980862, 0.0012836971, 0.0027700835, 0…
$ `light/visual perceptions` <dbl> 0.0005980861, 0.2126810672, 0.0027700833, 0…
$ `family/spiritual`         <dbl> 0.0005980861, 0.0511241990, 0.0027700833, 0…
$ `like/girls`               <dbl> 0.0005980862, 0.0012836971, 0.0027700835, 0…
$ sadness                    <dbl> 0.3802988952, 0.0012836971, 0.0027700832, 0…
$ feelings                   <dbl> 0.1171754514, 0.0012836972, 0.2254223233, 0…
$ danceability               <dbl> 0.3577385, 0.3317448, 0.4562981, 0.6869923,…
$ loudness                   <dbl> 0.4541189, 0.6475399, 0.5852883, 0.7444043,…
$ acousticness               <dbl> 0.99799197, 0.95481923, 0.84036129, 0.08393…
$ instrumentalness           <dbl> 9.018219e-01, 1.528340e-06, 0.000000e+00, 1…
$ valence                    <dbl> 0.33944765, 0.32502061, 0.35181369, 0.77535…
$ energy                     <dbl> 0.1371102, 0.2632403, 0.1391123, 0.7437357,…
$ topic                      <chr> "sadness", "world/life", "music", "romantic…
$ age                        <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000,…
# check music genre 
table(D3_MusicDataset1950_2019$genre)

  blues country hip hop    jazz     pop  reggae    rock 
   4604    5445     904    3845    7042    2498    4034 

Saving to RDS

# Saving cleaned data to RDS
saveRDS(D1_BillboardHot100, "../Data/cleaned_D1.rds")
saveRDS(D2_Spotify_All, "../Data/cleaned_D2.rds")
saveRDS(D3_MusicDataset1950_2019, "../Data/cleaned_D3.rds")

cat("Data exported successfully. Proceeding to pattern analysis.")
Data exported successfully. Proceeding to pattern analysis.

Part B: Initial Data Preparation

Light Cleaning Performed

  1. ✅ Removed URI prefix (spotify:track:)
  2. ✅ Merged decade files and added decade labels
  3. ✅ Dropped unnecessary column (...1)

Rationale

These minimal transformations were performed during exploration to:

  • Facilitate subsequent analysis
  • Standardize identifier format
  • Improve data structure clarity

Data Integration & Research Focus

  • D1 (Billboard): Ensures that the research subjects are “market-validated” within the U.S. mainstream.
  • D2 (Spotify): Provides “audio-sensory” analytical capabilities (e.g., danceability, energy).
  • D3 (Music Dataset): Provides “lyrical sentiment” analytical capabilities (e.g., romantic, violence).

Geographic Consistency & Filtering Strategy

Preliminary inspection shows that Data 3 includes a wide range of genres (e.g., Jazz, Reggae) and may contain tracks from non-U.S. markets. To ensure geographic consistency, this study will perform an Inner Join with the Billboard dataset during the data cleaning phase. This process will filter out tracks that did not enter the U.S. mainstream market, focusing the analysis on the American music landscape.

Time Coverage Comparison

library(gt)

time_coverage <- data.frame(
  Dataset = c("D1 (Billboard Hot 100)", 
              "D2 (Spotify Hit Predictor)", 
              "D3 (Music Dataset 1950-2019)"),
  Time_Range = c("1958 - 2021", 
                 "1960s - 2010s", 
                 "1950 - 2019"),
  Granularity = c("Weekly", 
                  "Decade", 
                  "Yearly"),
  Total_Records = c(330087, 
                    41106, 
                    28372)
)

time_coverage %>% 
  gt() %>% 
  tab_header(
    title = "Dataset Time Coverage Summary"
  ) %>% 
  cols_label(
    Dataset = "Dataset",
    Time_Range = "Time Range",
    Granularity = "Granularity",
    Total_Records = "Total Records"
  ) %>% 
  fmt_number(
    columns = Total_Records,
    decimals = 0,
    use_seps = TRUE
  ) %>%
  tab_options(
    table.width = pct(100)
  )
Dataset Time Coverage Summary
Dataset Time Range Granularity Total Records
D1 (Billboard Hot 100) 1958 - 2021 Weekly 330,087
D2 (Spotify Hit Predictor) 1960s - 2010s Decade 41,106
D3 (Music Dataset 1950-2019) 1950 - 2019 Yearly 28,372