Project Part 2: Quarto dociument

YourProject Assignment 1: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your dataset systematically.

Learning Objectives

By completing this assignment, you will: - Apply the READY framework to plan your data investigation - Use the SCAN framework to systematically explore your dataset - Practice using Arrow for memory-efficient data loading - Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
tracks_csv <- open_dataset("tracks_features.csv", format = "csv")

glimpse(tracks_csv)
FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…
# Load required libraries

For ZIP files containing CSV(s):

# Open your CSV directly with Arrow
library(arrow)
library(dplyr)

my_dataset <- open_dataset("~/Downloads/tracks_features.csv", format = "csv")

# Check memory usage
glue::glue("Memory used by Arrow object: {format(object.size(my_dataset), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Document your thoughts as comments:

What is the scope of your data?

My data contains music and audio features for millions of songs.

Time period covered:

The time period of my data is 1999 through 2025.

Geographic coverage:

The geographic coverage is global.

Population represented:

The music tracks included in Spotify’s library

Potential biases or limitations:

Biased toward artists who distribute on Spotify. It is possible that some genres are not properly represented

Example questions to consider:

  • Do we have complete coverage of what we’re studying?

  • Are there any obvious gaps in the data?

  • What might be missing?

E - Executive Driven Questions

Who would care about insights from your data?

Artists, Producers, and Record Labels would all care about insights from my data

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

What musical features are most correlated with popular music? How have song characteristics changed over time across genres? Are certain features becoming less common?

Examples: - If this is sales data: “How can we optimize our sales strategy?” - If this is health data: “What patterns affect patient outcomes?” - If this is social media data: “How can we improve engagement?”

Your stakeholder questions: 1. 2. 3.

A - Analytical Framework

Your exploration strategy:

Phase 1: Data Quality Assessment - Check for missing values - Identify data types and consistency - Look for outliers or anomalies

Check for missing or extreme values in the dataset to make sure all audio features are usable

Phase 2: Descriptive Analysis - What are the key variables? - What’s the distribution of important metrics? - What time patterns exist?

Summarize and visualize the data using histograms and boxplots to understand the distributions of each musical feature

Phase 3: Pattern Investigation - What relationships might exist between variables? - Are there seasonal or temporal patterns? - What groupings or segments emerge?

Explore relationships between variables, such as how tempo relates to energy or danceability, and look for any clusters or patterns among tracks

Your specific analytical approach: 1. 2. 3.

D - Data Best Practices

Quality checks to perform:

Verify each column is stored as numeric data. Check for duplicate tracks or missing song IDs. Confirm all variables fall within expected ranges.

Missing data assessment:

Look for any missing values in musical features. Assess whether missing data appear random or patterned across certain tracks

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

All musical characteristics must be numeric, track names should be text. Confirm release year column uses consistent format. Check that the categorical columns only contain valid values

Your quality concerns: 1. 2. 3.

Y - Your Insights

Initial hypotheses about what you might find:

I think songs with higher energy, danceability, and tempo will often appear together, showing a pattern in upbeat or popular tracks.

Based on your domain knowledge, what patterns do you expect? What would surprise you? What would be most valuable to discover?

Tracks with higher valence will also have higher energy and tempo. More recent songs may trend toward louder, more energetic characteristics than older ones. Songs in major keys may have higher average valence scores than songs in minor keys. I would be surprised if slower songs had higher valence.

Your predictions: 1. 2. 3.

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

glimpse(tracks_csv)
FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…

After examining the data structure, who else might be interested?

Music streaming platforms, marketing teams, and researchers studying patterns.

What specific questions would they have?

What characteristics define popular or “feel good” songs across genres.

What concerns might they have about data quality?

I am concerned older clubs have less reliable measurements.

C - Columns and Coverage

Create a summary table of your variables:

tracks_df <- as.data.frame(tracks_csv)

summary_table <- tracks_df %>%
  summarise_all(~ sum(is.na(.))) %>%
  t() %>%
  as.data.frame()

colnames(summary_table) <- c("Missing_Values")
summary_table
                 Missing_Values
id                            0
name                          0
album                         0
album_id                      0
artists                       0
artist_ids                    0
track_number                  0
disc_number                   0
explicit                      0
danceability                  0
energy                        0
key                           0
loudness                      0
mode                          0
speechiness                   0
acousticness                  0
instrumentalness              0
liveness                      0
valence                       0
tempo                         0
duration_ms                   0
time_signature                0
year                          0
release_date                  0

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
summary(tracks_df)
      id                name              album             album_id        
 Length:1204025     Length:1204025     Length:1204025     Length:1204025    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   artists           artist_ids         track_number     disc_number    
 Length:1204025     Length:1204025     Min.   : 1.000   Min.   : 1.000  
 Class :character   Class :character   1st Qu.: 3.000   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Median : 7.000   Median : 1.000  
                                       Mean   : 7.656   Mean   : 1.056  
                                       3rd Qu.:10.000   3rd Qu.: 1.000  
                                       Max.   :50.000   Max.   :13.000  
   explicit          danceability        energy            key        
 Length:1204025     Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
 Class :character   1st Qu.:0.3560   1st Qu.:0.2520   1st Qu.: 2.000  
 Mode  :character   Median :0.5010   Median :0.5240   Median : 5.000  
                    Mean   :0.4931   Mean   :0.5095   Mean   : 5.194  
                    3rd Qu.:0.6330   3rd Qu.:0.7660   3rd Qu.: 8.000  
                    Max.   :1.0000   Max.   :1.0000   Max.   :11.000  
    loudness            mode         speechiness       acousticness   
 Min.   :-60.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:-15.254   1st Qu.:0.0000   1st Qu.:0.03510   1st Qu.:0.0376  
 Median : -9.791   Median :1.0000   Median :0.04460   Median :0.3890  
 Mean   :-11.809   Mean   :0.6715   Mean   :0.08438   Mean   :0.4468  
 3rd Qu.: -6.717   3rd Qu.:1.0000   3rd Qu.:0.07230   3rd Qu.:0.8610  
 Max.   :  7.234   Max.   :1.0000   Max.   :0.96900   Max.   :0.9960  
 instrumentalness       liveness         valence          tempo       
 Min.   :0.0000000   Min.   :0.0000   Min.   :0.000   Min.   :  0.00  
 1st Qu.:0.0000076   1st Qu.:0.0968   1st Qu.:0.191   1st Qu.: 94.05  
 Median :0.0080800   Median :0.1250   Median :0.403   Median :116.73  
 Mean   :0.2828605   Mean   :0.2016   Mean   :0.428   Mean   :117.63  
 3rd Qu.:0.7190000   3rd Qu.:0.2450   3rd Qu.:0.644   3rd Qu.:137.05  
 Max.   :1.0000000   Max.   :1.0000   Max.   :1.000   Max.   :248.93  
  duration_ms      time_signature       year      release_date      
 Min.   :   1000   Min.   :0.000   Min.   :   0   Length:1204025    
 1st Qu.: 174090   1st Qu.:4.000   1st Qu.:2002   Class :character  
 Median : 224339   Median :4.000   Median :2009   Mode  :character  
 Mean   : 248840   Mean   :3.832   Mean   :2007                     
 3rd Qu.: 285840   3rd Qu.:4.000   3rd Qu.:2015                     
 Max.   :6061090   Max.   :5.000   Max.   :2020                     
tracks_df %>%
  select_if(is.numeric) %>%
  summarise_all(list(mean = mean, min = min, max = max), na.rm = TRUE)
# A tibble: 1 × 48
  track_number_mean disc_number_mean danceability_mean energy_mean key_mean
              <dbl>            <dbl>             <dbl>       <dbl>    <dbl>
1              7.66             1.06             0.493       0.510     5.19
# ℹ 43 more variables: loudness_mean <dbl>, mode_mean <dbl>,
#   speechiness_mean <dbl>, acousticness_mean <dbl>,
#   instrumentalness_mean <dbl>, liveness_mean <dbl>, valence_mean <dbl>,
#   tempo_mean <dbl>, duration_ms_mean <dbl>, time_signature_mean <dbl>,
#   year_mean <dbl>, track_number_min <int>, disc_number_min <int>,
#   danceability_min <dbl>, energy_min <dbl>, key_min <int>,
#   loudness_min <dbl>, mode_min <int>, speechiness_min <dbl>, …

N - Notable Segments

tracks_df %>%
  group_by(explicit) %>%
  summarise(
    avg_danceability = mean(danceability, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_valence = mean(valence, na.rm = TRUE)
  )
# A tibble: 2 × 4
  explicit avg_danceability avg_energy avg_valence
  <chr>               <dbl>      <dbl>       <dbl>
1 False               0.483      0.497       0.424
2 True                0.629      0.686       0.485
tracks_df %>%
  group_by(mode) %>%
  summarise(avg_tempo = mean(tempo, na.rm = TRUE))
# A tibble: 2 × 2
   mode avg_tempo
  <int>     <dbl>
1     0      117.
2     1      118.

Complete this comprehensive assessment:

DATASET OVERVIEW: - Records: Over 1 million lines of data representing songs and their characteristics - Time span: Music from 1999 to 2025 - Key metrics: Tempo, Valence, Release date, and Energy

DATA COMPLETENESS: - Core fields: 100% complete - Name: 100% complete - Danceability: 100% complete

DATA QUALITY STRENGTHS: 1. No missing data 2. All variables are correctly typed 3. Standardized measurement scales make comparisons reliable

DATA QUALITY CONCERNS: 1. Some songs may lack genre or popularity context 2. [No streaming or listener data, so analysis focuses on musical qualities only 3. Features like mode and key are categorical but not directly interpretable without domain context

RELIABILITY ASSESSMENT: - Most reliable variables: danceability, energy, valence, tempo, and loudness - Variables needing caution: mode, key, and explicit - Overall confidence level: High

JUSTIFICATION: The dataset is highly reliable because it comes from Spotify’s official API, which uses consistent digital signal processing to quantify musical attributes. All key variables are complete and numeric, making the data ready for quantitative analysis.

Part 4: Missing Data Analysis & Strategy (Week 10)

In this section, I’ll quantify and visualize missing values, classify their likely causes, and decide how to handle them before deeper analysis.

4.1 Quantify Missing Data

library(dplyr)
library(tidyr)
library(knitr)

missing_summary <- tracks_csv %>%
  dplyr::summarise(dplyr::across(
    .cols = everything(),
    .fns  = ~ sum(is.na(.x)) / n() * 100,
    .names = "{.col}"
  )) %>%
  dplyr::collect() %>%   # ✅ Convert Arrow dataset to an R data frame
  tidyr::pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "missing_percent"
  ) %>%
  dplyr::mutate(missing_percent = round(missing_percent, 2)) %>%
  dplyr::arrange(dplyr::desc(missing_percent))

knitr::kable(
  missing_summary,
  caption = "Percentage of Missing Values per Variable"
)
Percentage of Missing Values per Variable
variable missing_percent
id 0
name 0
album 0
album_id 0
artists 0
artist_ids 0
track_number 0
disc_number 0
explicit 0
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
liveness 0
valence 0
tempo 0
duration_ms 0
time_signature 0
year 0
release_date 0

Interpretation:

All of the variables found in the Spotify dataset show 0% missing values, confirming that the data is complete. This shows the dataset was generated using consistent extraction processes rather than manually entered. Since there are not missing values, there is no need for any more data cleaning before further analysis.

Part 5: Variable Selection & Focus

5.1 Variable Inventory

## Part 5: Variable Selection & Focus

library(dplyr)
library(knitr)

# Create a table describing variable roles
variable_roles <- tibble::tibble(
  Variable = names(tracks_csv),
  Role = c(
    "Identifier",        # id
    "Text / Title",      # name
    "Text / Album",      # album
    "Identifier",        # album_id
    "Categorical",       # artists
    "Identifier",        # artist_ids
    "Numeric",           # track_number
    "Numeric",           # disc_number
    "Categorical",       # explicit
    "Numeric (Feature)", # danceability
    "Numeric (Feature)", # energy
    "Numeric (Feature)", # key
    "Numeric (Feature)", # loudness
    "Numeric (Feature)", # mode
    "Numeric (Feature)", # speechiness
    "Numeric (Feature)", # acousticness
    "Numeric (Feature)", # instrumentalness
    "Numeric (Feature)", # liveness
    "Numeric (Feature)", # valence
    "Numeric (Feature)", # tempo
    "Numeric",           # duration_ms
    "Numeric",           # time_signature
    "Numeric (Time)",    # year
    "Date"               # release_date
  )
)

knitr::kable(variable_roles, caption = "Variable Roles and Classifications")
Variable Roles and Classifications
Variable Role
id Identifier
name Text / Title
album Text / Album
album_id Identifier
artists Categorical
artist_ids Identifier
track_number Numeric
disc_number Numeric
explicit Categorical
danceability Numeric (Feature)
energy Numeric (Feature)
key Numeric (Feature)
loudness Numeric (Feature)
mode Numeric (Feature)
speechiness Numeric (Feature)
acousticness Numeric (Feature)
instrumentalness Numeric (Feature)
liveness Numeric (Feature)
valence Numeric (Feature)
tempo Numeric (Feature)
duration_ms Numeric
time_signature Numeric
year Numeric (Time)
release_date Date

5.2 Correlation Analysis

# Compute correlations among key numeric audio features
numeric_features <- tracks_csv %>%
  dplyr::select(danceability, energy, loudness, valence, tempo,
                acousticness, liveness) %>%
  dplyr::collect()

cor_matrix <- cor(numeric_features, use = "complete.obs")

knitr::kable(cor_matrix,
             caption = "Correlation Matrix for Key Audio Features")
Correlation Matrix for Key Audio Features
danceability energy loudness valence tempo acousticness liveness
danceability 1.0000000 0.2830157 0.3781937 0.5634361 0.0605899 -0.2857497 -0.0443279
energy 0.2830157 1.0000000 0.8179337 0.3995307 0.2682314 -0.7962421 0.2134935
loudness 0.3781937 0.8179337 1.0000000 0.3850047 0.2462482 -0.6715526 0.1381228
valence 0.5634361 0.3995307 0.3850047 1.0000000 0.1761994 -0.2688359 0.0626624
tempo 0.0605899 0.2682314 0.2462482 0.1761994 1.0000000 -0.2310215 0.0301493
acousticness -0.2857497 -0.7962421 -0.6715526 -0.2688359 -0.2310215 1.0000000 -0.1150012
liveness -0.0443279 0.2134935 0.1381228 0.0626624 0.0301493 -0.1150012 1.0000000

5.3 Interpretation:

The correlation matrix shows that energy and loudness are highly correlated variables (r = .82) showing that energetic songs are louder. Danceability and valence also show a positive correlation of r = .56, indicating that upbeat tracks are more positive in tone. Acousticness, however, is negatively correlated with loudness and energy. This shows that acoustic songs are generally quieter and less energetic. All of these correlations are generally what I would have expected for how different audio features would interact with each other.

Part 6: Exploratory Visualizations and Analysis

6.1 Visualization 1 - Energy vs Loudness

How are energy and loudness related across songs?

library(ggplot2)

ggplot(tracks_csv, aes(x = energy, y = loudness)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Energy vs Loudness",
       x = "Energy",
       y = "Loudness (dB)")
`geom_smooth()` using formula = 'y ~ x'

Interpretation: This visualization shows the clear positive correlation between energy and loudness. Songs with higher energy values tend to be louder, which makes logical sense as energetic songs would naturally be louder.

6.2 Visualization 2 - Danceability vs Valence

Do happier songs tend to be more danceable?

ggplot(tracks_csv, aes(x = valence, y = danceability)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", color = "orange") +
labs(title = "Danceability vs Valence",
x = "Valence (Happiness Score)",
y = "Danceability")
`geom_smooth()` using formula = 'y ~ x'

Interpretation: There is a positive relationship between danceability and valence, meaning that songs with higher happiness scores tend to be more danceable. This shows upbeat songs are designed to have qualities made for dancing.

6.3 Visualization 3 - Tempo and Energy Over Time

How have tempo and energy changed over time?

library(lubridate)

Attaching package: 'lubridate'
The following object is masked from 'package:arrow':

    duration
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(dplyr)
library(ggplot2)

# Convert Arrow dataset to a regular data frame
tracks_df <- as.data.frame(tracks_csv)

# Safely parse release_date as a date
tracks_df$release_date <- as.Date(tracks_df$release_date, format = "%Y-%m-%d")

# Extract year
tracks_df$year <- year(tracks_df$release_date)

# Compute yearly averages
tempo_energy_year <- tracks_df %>%
  group_by(year) %>%
  summarise(
    mean_tempo = mean(tempo, na.rm = TRUE),
    mean_energy = mean(energy, na.rm = TRUE)
  )

# Plot
ggplot(tempo_energy_year, aes(x = year)) +
  geom_line(aes(y = mean_tempo, color = "Tempo")) +
  geom_line(aes(y = mean_energy * 200, color = "Energy (x200 scale)")) +
  labs(
    title = "Average Tempo and Energy Over Time",
    y = "Tempo (BPM)",
    color = "Metric"
  ) +
  theme_minimal()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Interpretation: This plot shows heavy fluctuation between both tempo and energy with time. In general, songs seem to have higher tempos than energy levels. Tempo is a relatively constant variable over this time period with some fluctuation in the first half of the 1900s while energy fluctuates like heavily for the entire time period. This shows that while the speed of popular music does not change often, the style does.

6.4 Visualization 4 - Acousticness vs Energy

How does acousticness relate to energy and loudness?

ggplot(tracks_csv, aes(x = acousticness, y = energy, color = loudness)) +
geom_point(alpha = 0.4) +
scale_color_viridis_c() +
labs(title = "Acousticness vs Energy (Colored by Loudness)",
x = "Acousticness",
y = "Energy",
color = "Loudness (dB)")

Visualization: This plot shows a negative relationship between acousticness and energy. Essentially, songs that are more acoustic tend to be less energetic and vice versa. This goes along with the theory that produced tracks lose acoustic sound in favor of more electronic sound.

6.5 Visualization 5 - Correlation Heatmap of Audio Features

Which audio features cluster together most strongly?

library(dplyr)
library(ggplot2)
library(tidyr)

# Select numeric features
num_features <- tracks_df %>%
  select(danceability, energy, loudness, valence, tempo, acousticness, liveness)

# Compute correlation matrix
corr_matrix <- cor(num_features, use = "complete.obs")

# Convert to long format for ggplot
melted <- as.data.frame(as.table(corr_matrix))

# Plot heatmap
ggplot(melted, aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, 
                       limit = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(
    title = "Correlation Heatmap of Key Audio Features",
    x = "",
    y = ""
  )

Interpretation: This heatmap shows multiple strong correlations across the Spotify audio features. Energy and Loudness and the most positively correlated. Danceability and valence are also positively related while acousticness is negatively correlated with energy and loudness.

6.6 Visualization 6 - Acousticness over time

Has popular music become more or less acoustic over time?

ggplot(tracks_df, aes(x = year, y = acousticness)) +
  geom_smooth(color = "blue", se = FALSE) +
  labs(title = "Average Acousticness Over Time",
       x = "Year", y = "Acousticness (0 = electronic, 1 = acoustic)")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 131186 rows containing non-finite outside the scale range
(`stat_smooth()`).

Interpretation: The trend in average acousticness shows that songs were at their most acoustic in between 1925 and 1950. There was a major dip in acousticness around the late 1970’s to early 1980’s before a rise in the 90’s and then a steady decline to the present. Currently, acousticness is at an all time low for this dataset. This makes logical sense as so much of music is electronically produced now that acousticness has faded.

Part 7 - Stakeholder Communication

Executive Summary

Over the past century, music has heavily evolved in both the manner and style in which it is created. Using the Spotify tracks dataset, which spans from 1900 to the 2020s, this analysis will dive into the evolution of many of the audio traits which are the building blocks of these tracks. It will examine how over the past century, popular music has become less acoustic, louder, and more energetic.

The first two audio features I examined over time are Tempo and Energy. In figure 6.3, it can be seen that Energy fluctuates heavily over the examined time period. It currently is peaking in the 2020s with other major peaks in the 1980s and the 1950s. Tempo, however, is much more constant over this time period. It is at its highest of over 125 beats per minute(bpm) at the start before dipping to below 100 bpm, rising back up, and stabilizing around 115 bpm from 1950 to the present.

Acousticness over time can be seen in figure 6.6 which shows an upside down, parabolic looking curve from 1900 to 1975, before spiking briefly and then steadily declining until the present. Throughout my analysis, acousticness was consistently the most intriguing of the audio features. As can be seen in figure 6.5, a figure of a heatmap which details the features relationships with each other, acousticness was negatively correlated to every other audio feature included in the dataset. Acoustic songs tend to be quieter as seen in figure 6.4, and they are negatively related to energy and danceability. These findings indicate that today’s most popular music traits are higher energy, stronger rhythm, and moderately fast tempos. This may be due to the rise of hip hop and pop in the past few decades or may just be a general distaste for slower, more acoustic music.

Returning to tempo, it is interesting at the lack of change in this feature while the rest fluctuate so much. As can be seen by the heatmap, tempo seems to be barely correlated with any of the other audio features. It seems that through genre shifts, cultural changes, major changes in energy, etc., tempo is constant.

This analysis is important to be aware of for both music carriers and artists. Anyone looking to break into the music industry should be aware of the current state of the industry while producers and music carriers can better understand what traits dominate music at the moment before making decisions. Ultimately, this study shows how technological innovation has changed what listeners value in their music.

Hero Visualization

ggplot(tracks_df, aes(x = year, y = acousticness)) +
  geom_smooth(color = "#1f77b4", se = FALSE, size = 1.2) +
  labs(
    title = "The Decline of Acoustic Music Over Time",
    subtitle = "Average acousticness of songs from 1900–2020 based on Spotify audio features",
    x = "Year",
    y = "Average Acousticness (0 = Electronic, 1 = Acoustic)",
    caption = "Source: Spotify Tracks Dataset"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray30"),
    plot.caption = element_text(size = 10, color = "gray40"),
    panel.grid.minor = element_blank()
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 131186 rows containing non-finite outside the scale range
(`stat_smooth()`).

This figure shows the decline in acousticness in popular music. It highlights how the acoustic sounds so popular in the mid 1900s have been replaced by electronically produced music.

Limitations

While this analysis provides insight into the evolution of audio traits in music from the 1900s to today, there are several key variables missing which could make this analysis significantly more valuable. Some variables which would be invaluable to this dataset would be a simple genre variable or a variable which measures the popularity of the song. The second variable would be a little tricky considering how old some of the data is but even so. Another limitation is how some of the older eras may not be fully represented due to technological capabilities when the data was recorded.

Recommendations

Future analyses could expand on this project by including some of the additional context which is mentioned in the Limitations section. These include a potential genre or popularity variable to better tell how traits differ across styles. Another potential step could be analyzing how audio features vary across regions to see if some of the trends seen are true everywhere. A final improvement could be the use of machine learning models to help predict trends for the future of music trough analysis of past patterns.

Deliverables Checklist

Ensure your submission includes:

  • Complete READY framework analysis with thoughtful responses

  • Systematic SCAN framework exploration with specific findings

  • Successful data loading with Arrow

  • Professional data description and summary statistics

  • Comprehensive missing value analysis with percentages

  • Variable summary table documenting key fields

  • Memory efficiency demonstration

  • 3-5 well-defined, specific exporatory research questions

  • Data quality assessment with honest evaluation

  • Professional summary with clear next steps

Grading Criteria

  • READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach

  • Data Loading (15%): Successful Arrow implementation with proper documentation

  • SCAN Framework (25%): Systematic exploration with specific, meaningful findings

  • Data Quality Assessment (20%): Comprehensive evaluation with specific evidence

  • Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities

  • Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

  • Be specific in your observations - avoid vague statements

  • Think like a stakeholder - what would decision-makers actually want to know?

  • Document your reasoning for all assessment decisions

  • Be honest about limitations - this builds credibility

  • Focus on actionable insights - what can actually be learned from this data?

  • Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.