Your Project Title: Part 2 - Deep Dive Analysis

Author

J Sanchez-Collins

Project Part 2: From Data Understanding to Stakeholder Insights

Overview

Welcome to Part 2 of your data analysis project! In Part 1, you successfully loaded your data, applied the READY and SCAN frameworks, and developed 3-5 research questions. Now it’s time to dig deeper.

What You’ll Accomplish in Part 2:

Part 4: Systematically analyze and handle missing data
Part 5: Select the most meaningful variables for your analysis
Part 6: Create compelling visualizations that answer exploratory questions
Part 7: Communicate your findings to stakeholders

Remember: This is still exploratory data analysis - you’re investigating patterns and building understanding, not proving predetermined hypotheses.

Setup and Data Loading

Load Required Libraries

[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

[[4]]
 [1] "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

[[5]]
 [1] "glue"      "duckdb"    "DBI"       "arrow"     "lubridate" "forcats"  
 [7] "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"   
[13] "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices" "utils"    
[19] "datasets"  "methods"   "base"     

[[6]]
 [1] "naniar"    "glue"      "duckdb"    "DBI"       "arrow"     "lubridate"
 [7] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
[13] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
[19] "utils"     "datasets"  "methods"   "base"     

[[7]]
 [1] "corrr"     "naniar"    "glue"      "duckdb"    "DBI"       "arrow"    
 [7] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
[13] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[19] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[8]]
 [1] "scales"    "corrr"     "naniar"    "glue"      "duckdb"    "DBI"      
 [7] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
[13] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"

Reconnect to Your Dataset

# Set up and extract your ZIP file
musical_data <- "~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"  
outdir <- file.path(dirname("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"), "extracted_data")
dir.create(outdir, showWarnings = FALSE)
unzip("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", exdir = outdir, overwrite = TRUE)

# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))

# Open with Arrow - specify the main file you want to work with
my_music_dataset <- open_dataset("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", format = "csv")  

# Check memory usage
glue("Memory used by Arrow object: {format(object.size(my_music_dataset), units = 'KB')}")

Memory used by Arrow object: 0.5 Kb

#Take a glimpse of the data
glimpse(my_music_dataset)

FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…

Part 5: Variable Selection & Focus

Building on your Part 2 analytical framework

Introduction

You started with 10+ variables. Now it’s time to narrow down to the 3-7 variables that matter most for your research questions.

Step 1: Variable Inventory

❓ EXPLORATORY QUESTION 8: What role does each variable play in our analysis?

Create a comprehensive inventory categorizing all variables:

# A tibble: 24 × 5
   variable     role                       data_type usefulness  notes          
   <chr>        <chr>                      <chr>     <chr>       <chr>          
 1 danceability measurement of music       double    very useful might use in a…
 2 energy       measurement of music       double    very useful will likely be…
 3 tempo        measurement of music       double    very useful likely to be i…
 4 track_number identifier within an album integer   useful      the largest va…
 5 name         Easier song identifier     string    semi-useful no notes       
 6 year         date                       integer   semi-useful good if analyz…
 7 release_date date                       string    semi-useful also good for …
 8 album        identifier                 string    semi-useful no notes       
 9 artists      identifier                 string    semi-useful no notes       
10 loudness     measurement of music       double    semi-useful difficult to i…
# ℹ 14 more rows

📝 Your Variable Assessment:

For each variable, justify its potential usefulness:

High Usefulness Variables (keep):
Tempo: Could be correlated with a songs' energy
Track number: Main variable I'm interested in studying the effect of
Danceability: Interested in it's correlation to track number.
Energy: Interested in this variable's correlation with track number. 

Medium Usefulness Variables (consider):
Year: Could have an effect since newer songs might have more energy and daceability.
Speechiness: Could also have an effect?
Valence: Could also maybe be correlated with track number?
Intrumentalness: Could maybe have an effect?

Low Usefulness Variables (likely exclude):
Explicit: I doubt this variable would have effect
Mode: Unsure what this variable indicates
id and album_id: the names are provided so the id numbers aren't needed

Step 2: Examine Relationships

For Numeric Variables: Correlation Analysis

❓ EXPLORATORY QUESTION 9: Which numeric variables are correlated with each other?

# Visualize correlations
C <- my_music_dataset|>
  select(where(is.numeric)) |>  # Select only numeric columns
  collect() |>
  sample_n(min(10000, n()))|> #because the data is large
  cor()
corrplot(C, method = "number", number.cex=0.75)

📝 Correlation Interpretation:

Answer these questions:

Which variables are strongly correlated (|r| > 0.7)?
- Loudness ~ Energy: 0.81
- Acousticness ~ Energy: -0.79
- Acousticness ~ Loudness: -0.67
Also notably:
- Danceability ~ Valence: 0.55
- Loudness ~ Danceability: 0.36
- Valence ~ Energy: 0.41
- Loudness ~ Valence: 0.36
Are any of these redundant? Can we drop one?

We should probably keep all of these correlations, even though they all make logical sense.
Do any correlations surprise you?

I wasn’t expecting danceability and valence to have a moderately high correlation. It should make sense that positive language has a high correlation with dancing music, but I’ve never paid enough attention to the lyrics to notice.

For Categorical Variables: Relationship Analysis

❓ EXPLORATORY QUESTION 10: How do categorical variables relate to our outcome?

# Examine relationship between categorical predictor and numeric outcome
relationship_analysis <- my_music_dataset |>
    arrange(track_number)|>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_energy = mean(energy, na.rm = TRUE),
    median_energy = median(energy, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

# Visualize
ggplot(relationship_analysis, 
       aes(x = track_number, 
           y = mean_energy)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Energy add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

#examining track number and danceability
relationship_analysis_2 <- my_music_dataset |>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_danceability = mean(danceability, na.rm = TRUE),
    median_danceablity = median(danceability, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_danceability)) |>
  collect()


# Visualize
ggplot(relationship_analysis_2, 
       aes(x = track_number, 
           y = mean_danceability)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Danceability add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

#examinging track number and valence
relationship_analysis <- my_music_dataset |>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_valence = mean(valence, na.rm = TRUE),
    median_valence = median(valence, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_valence)) |>
  collect()



# Visualize
ggplot(relationship_analysis, 
       aes(x = track_number, 
           y = mean_valence)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Valence add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

#Examining track number and instrumentalness
relationship_analysis <- my_music_dataset |>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_instrumentalness = mean(instrumentalness, na.rm = TRUE),
    median_instrumentalness = median(instrumentalness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_instrumentalness)) |>
  collect()


# Visualize
ggplot(relationship_analysis, 
       aes(x = track_number, 
           y = mean_instrumentalness)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Instrumentalness add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

#examining track number and tempo
relationship_analysis <- my_music_dataset |>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_tempo = mean(tempo, na.rm = TRUE),
    median_tempo = median(tempo, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_tempo)) |>
  collect()



# Visualize
ggplot(relationship_analysis, 
       aes(x = track_number, 
           y = mean_tempo)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Tempo add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

#examining track number and acousticness
relationship_analysis <- my_music_dataset |>
  group_by(track_number) |>
  summarise(
    count = n(),
    mean_acousticness = mean(acousticness, na.rm = TRUE),
    median_acousticness = median(acousticness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(mean_acousticness)) |>
  collect()



# Visualize
ggplot(relationship_analysis, 
       aes(x = track_number, 
           y = mean_acousticness)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  labs(
    title = "Outcome by Category",
    subtitle = "Does Acousticness add value to our analysis?",
    x = "Category",
    y = "Mean Outcome"
  ) +
  theme_minimal()

📝 Your Interpretation:

There was not a noticeable trend between track number and tempo, valence and danceability. However, there was a noticeable trend when it came to track number against acousticness, instrumentalness, energy.

Step 3: Final Variable Selection

❓ EXPLORATORY QUESTION 11: Which 3-7 variables best answer our research questions?

Reminder of what our original research questions were:

What factors increase a songs’ danceability?
What factor’s increase a songs’ energy?
In an album, are the characteristics of a song associated with the track number?

Based on your inventory, correlations, and relationships, select your final variable set:

FileSystemDataset with 1 csv file (query)
1,101,615 rows x 5 columns
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
Call `print()` for query details

📋 Final Variable Selection Table

final_variables <- tribble(
  ~variable, ~role, ~why_included, ~research_question_addressed,
  "track number", "predictor", "Key Grouping Factor", "Question 3",
  "energy", "outcome", "metric of interest", "Question 3",
  "acousticness", "outcome", "metric of interest", "Question 3",
  "instrumentalness", "outcome", "metric of interest", "Question 3",
  "year", "predictor", "key grouping factor", "None"
)

final_variables

# A tibble: 5 × 4
  variable         role      why_included        research_question_addressed
  <chr>            <chr>     <chr>               <chr>                      
1 track number     predictor Key Grouping Factor Question 3                 
2 energy           outcome   metric of interest  Question 3                 
3 acousticness     outcome   metric of interest  Question 3                 
4 instrumentalness outcome   metric of interest  Question 3                 
5 year             predictor key grouping factor None

📝 Variables Excluded and Why:

Excluded Variables:
1. Tempo: It barely changed between different track numbers, indicating no relationship.

2. Valence: The numbers were all over the place depending on the track number, indicating no relationship.

3. Danceability: There was a slight trend, but it was more obvious in the other selected variables and felt more like a stretch to use it.

Step 4: Tool Selection Documentation

❓ EXPLORATORY QUESTION 12: Are we using the right tools for our data size?

Document which tools you’re using and why:

# Calculate your dataset characteristics
dataset_stats <- my_music_dataset|>
  summarise(
    total_rows = n(),
    total_cols = ncol(my_music_dataset)
  ) |>
  collect()

# Estimate size
# Note: This is approximate
estimated_size <- dataset_stats$total_rows * dataset_stats$total_cols * 8 / 1e9  # GB

glue("
Dataset Statistics:
- Rows: {scales::comma(dataset_stats$total_rows)}
- Columns: {dataset_stats$total_cols}
- Estimated Size: ~{round(estimated_size, 1)} GB
")

Dataset Statistics:
- Rows: 1,204,025
- Columns: 24
- Estimated Size: ~0.2 GB

📝 Tool Selection Justification:

Tools Being Used:
☐ Arrow (< 5GB, simple operations)
☐ DuckDB (5-50GB, complex analytics)
☐ Spark (50GB+, distributed computing)

Justification for Tool Choice:
The estimated size was approximately 0.2GB, so I will be using Arrow.

Part 6: Exploratory Visualizations & Analysis

Building on your Part 3 notable segments

Introduction

Now that you have clean data and focused variables, it’s time to create compelling visualizations that answer your research questions. You’ll create 5+ visualizations using at least 3 different chart types.

For each visualization, you must: 1. State the exploratory question 2. Create a publication-quality chart 3. Interpret the patterns you find 4. Explain implications for stakeholders

Visualization 1: Line Graph

❓ EXPLORATORY QUESTION 13: How has the average characteristics of music changed over time?

# Line chart of acousticness over time
viz_data_1 <- my_data_focused |>
  group_by(year) |>
  summarise(
    metric = mean(acousticness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

ggplot(viz_data_1, aes(x = year, y = metric)) +
  geom_line(color = "#4575b4", linewidth = 1.5) +
  geom_point(size = 3, color = "#4575b4") +
  scale_y_continuous(labels = scales::comma) +
  xlim(1938,2020)+
  labs(
    title = "Average Song Accousticness by Year",
    subtitle = "Shows a decrease in the average acousticness of songs.",
    x = "Year",
    y = "Accousticness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11)
  )

#Line chart of energy over time
viz_data_2 <- my_data_focused |>
  group_by(year) |>
  summarise(
    metric = mean(energy, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

ggplot(viz_data_2, aes(x = year, y = metric)) +
  geom_line(color = "#4575b4", linewidth = 1.5) +
  geom_point(size = 3, color = "#4575b4") +
  scale_y_continuous(labels = scales::comma) +
  xlim(1938,2020)+
  labs(
    title = "Average Song Energy by Year",
    subtitle = "Shows a fluctuation of the average energy of songs.",
    x = "Year",
    y = "Energy Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11)
  )

#Line chart of instrumentalness over time
viz_data_3 <- my_data_focused |>
  group_by(year) |>
  summarise(
    metric = mean(instrumentalness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()


ggplot(viz_data_3, aes(x = year, y = metric)) +
  geom_line(color = "#4575b4", linewidth = 1.5) +
  geom_point(size = 3, color = "#4575b4") +
  scale_y_continuous(labels = scales::comma) +
  xlim(1938,2020)+
  labs(
    title = "Average Song Instrumentalness by Year",
    subtitle = "Shows a semi-consistent level of the average instrumentalness of songs.",
    x = "Year",
    y = "Instrumentalness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11)
  )

📝 Interpretation:

Pattern Observed:

Describe what you see in the visualization: 
The three characteristics of interest each have a different relationship with time. One decreasing, one increasing, and one remaining fairly consistent.

Statistical Evidence:

Any specific numbers or trends to highlight:
Something very interesting happened between 1980 and 1990 that significantly impacted all of the variables I'm studying.

Stakeholder Implications:

What does this mean for decision-makers?
If someone is looking for more energetic songs, they should look to listen to more recent music, and if someone has a preference for more accoustic music, they should try listening to older music. Also, year has little to no effect on the instramentalness of a song.

Visualization 2: Bar Chart

❓ EXPLORATORY QUESTION 14: How does track number affect the averages of the song characteristics?

#Bar chart of energy by track number
viz_data_4 <- my_data_focused |>
  group_by(track_number) |>
  summarise(
    total = mean(energy, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()

ggplot(viz_data_4, aes(x = track_number, y = total)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Average Energy by Track Number",
    subtitle = "Shows a slight decrease in energy as an album goes on.",
    x = "Track Number",
    y = "Energy Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal()

Average Song Characteristics in Relation to Track Number

#Bar chart of instrumentalness by track number
viz_data_5 <- my_data_focused |>
  group_by(track_number) |>
  summarise(
    total = mean(instrumentalness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()

ggplot(viz_data_5, aes(x = track_number, y = total)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Average Instrumentalness by Track Number",
    subtitle = "Shows increases and decreases in instrumentalness as an album goes on.",
    x = "Track Number",
    y = "Instrumentalness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal()

#Bar chart of acousticness by track number
viz_data_6 <- my_data_focused |>
  group_by(track_number) |>
  summarise(
    total = mean(acousticness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()

ggplot(viz_data_6, aes(x = track_number, y = total)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Average Acousticness by Track Number",
    subtitle = "Shows an increase in accousticness as an album goes on.",
    x = "Track Number",
    y = "Acousticness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal()

📝 Interpretation:

Pattern Observed:

It looks like energy decreases as an album plays, instrumentalness fluctuates, and accousticness increases.

Statistical Evidence:

Not much to add here since bar plots are easier to interpret than line graphs.

Stakeholder Implications:

This is telling listeners that earlier songs tend to have more enegry, middle songs have a higher instrumentalness score, and later songs tend to be acoustic.

Visualization 3: Scatter plot

❓ EXPLORATORY QUESTION 15: Do our numerical variables have a linear relationship?

#Scatterplot of Energy vs. Acousticness
viz_data_7 <- my_data_focused |>
  collect() |>
  sample_n(min(5000, n()))

ggplot(viz_data_7, aes(x = energy, y = acousticness)) +
  geom_point(alpha = 0.3, color = "#4575b4") +
  geom_smooth(method = "lm", color = "#d73027", se = TRUE) +
  labs(
    title = "Energy vs. Acousticness",
    subtitle = "Depicts a negative relationship between the variables",
    x = "Energy Score (0 - 1)",
    y = "Acousticness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs (n = 5000)"
  ) +
  theme_minimal()

#Scatterplot of Energy vs. Instrumentalness
viz_data_8 <- my_data_focused |>
  collect() |>
  sample_n(min(5000, n()))  

ggplot(viz_data_8, aes(x = instrumentalness, y = energy)) +
  geom_point(alpha = 0.3, color = "#4575b4") +
  geom_smooth(method = "lm", color = "#d73027", se = TRUE) +
  labs(
    title = "Instrumentalness vs. Energy",
    subtitle = "Depicts no relationship between the variables",
    x = "Instrumentalness Score (0 - 1)",
    y = "Energy Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs (n = 5000)"
  ) +
  theme_minimal()

#Scatterplot of Instrumentalness vs. Acousticness
viz_data_9 <- my_data_focused |>
  collect() |>
  sample_n(min(5000, n()))  

ggplot(viz_data_9, aes(x = acousticness, y = instrumentalness)) +
  geom_point(alpha = 0.3, color = "#4575b4") +
  geom_smooth(method = "lm", color = "#d73027", se = TRUE) +
  labs(
    title = "Acousticness vs. Instrumentalness",
    subtitle = "Depicts a slightly positive relationship between the variables",
    x = "Acousticness Score (0 - 1)",
    y = "Instrumentalness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs (n = 5000)"
  ) +
  theme_minimal()

📝 Interpretation:

Pattern Observed:

The only variables that have a clear relationship is negative relationship between energy and acousticness.

Statistical Evidence:

The trends seen here make a lot of sense since none of these variables had a high correlation in the correlation matrix from earlier.

Stakeholder Implications:

This doesn't impact listeners much since it already intuitive that if someone wants to listen to a high energy song, which are more likely to be a recent song, they know that the song is more digital than acousticy.

Visualization Summary

❓ EXPLORATORY QUESTION 18: What’s the overall story from our visualizations?

# A tibble: 3 × 4
  viz_number chart_type    key_finding                         supports_question
       <dbl> <chr>         <chr>                               <chr>            
1          1 Line charts   Acousticness decreased, energy inc… How has song cha…
2          2 Bar charts    Track number has the most signific… How does track n…
3          3 Scatter plots Energy and acousticness have a mod… How are song cha…

📝 Narrative Synthesis:

Connect your visualizations into a coherent story:

The Big Picture:
Over time as music has changed, so have the characteristics that make up songs. A song's placement in album has a small ability to determine some of those characteristics as well. 

Surprising Findings:
I was suprised to find a lack of relationship between most of the numeric variables.

Patterns Across Visualizations:
Across all of these visualizations, the instrumentalness variable consitstently has the weakest trends and patterns. It was not as helpful as I originally thought it would be.

Part 7: Stakeholder Communication

Telling the complete story

Introduction

You’ve done deep analysis - now it’s time to communicate your findings effectively to decision-makers who may not be data experts.

Executive Summary

❓ EXPLORATORY QUESTION 19: If a stakeholder has only 2 minutes, what must they know?

Write a 300-500 word executive summary using the BLUF (Bottom Line Up Front) approach:

Predicting Song Characteristics in an Album - Key Findings

Bottom Line: A song’s placement in an album does have an affect, albeit a small one, on its characteristics.

Context: I chose to analyze the Spotify 1.2M+ Songs dataset from Kaggle. I wanted to figure out if there was a different way for people to listen to albums of music they are unfamiliar with.

Key Findings:

Year has an effect on a song’s characteristics: This makes sense since decades can almost be defined by their music. However, the proportion of a song taken by just instruments has remained largely the same over time. Listeners can take note of what they like in a song, and listen to more from that time period.
Song placement in an album has minor effects on its characteristics: Earlier songs in an album are more likely to have a higher energy score and a lower acousticness score. So listeners that prefer more excited music should listen to the first song of albums, and those who prefer less digital music should start with the later songs of an album.
The characteristics of a song are not good predictors of another characteristic: There were no unexpected correlations between song characteristics. It makes sense that energy and acousticness have a negative relationship, and it also make sense that few other variables had a correlation. For listeners this just means they shouldn’t look for something they like in

Recommendations:

If you like a song from a decade, you will likely enjoy more from that decade.
If you like high energy songs, listen to the first song of albums. If you like less digital sounding music, you should opt for later songs in an album.
More relationships need to be looked into since the three I looked into didn’t come up with any useful results.

Data Quality Note: Data doesn’t include song genre as a categorical variable, and song popularity also wasn’t included.

Hero Visualization

❓ EXPLORATORY QUESTION 20: Which single visualization tells our story best?

Create ONE publication-quality “hero” visualization that could stand alone in a presentation or report. This should be self-explanatory for non-technical audiences.

# Your BEST, most polished visualization
# Include:
# - Clear, descriptive title
# - Informative subtitle
# - Proper axis labels with units
# - Readable text size
# - Color scheme that's accessible
# - Annotations if helpful
# - Source citation

#Bar chart of acousticness
viz_data_6 <- my_data_focused |>
  group_by(track_number) |>
  summarise(
    total = mean(acousticness, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()

ggplot(viz_data_6, aes(x = track_number, y = total)) +
  geom_col(fill = "#4575b4") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Average Acousticness by Track Number",
    subtitle = "Shows an increase in accousticness as an album goes on.",
    x = "Track Number",
    y = "Acousticness Score (0 - 1)",
    caption = "Source: Spotify 1.2M+ Songs"
  ) +
  theme_minimal()+
  theme(text = element_text(size = 20))

Your hero visualization - make it count!

📝 Hero Visualization Explanation:

Why this visualization?

Explain why you chose this as your hero visualization:
I chose this visualization because it is both easy to understand and most clearly demonstrates the effect that track number has on the characteristics of a song

What should viewers notice first?

Viewers should notice the upward trend in acoustic score as we get later into an album.

The “so what” factor:

Why does this matter to stakeholders?
This matters because using information from multiple graphs like this, they could discern where they want to start listening to an album based on what they like to hear in a song.

Limitations

❓ EXPLORATORY QUESTION 21: What are the honest limitations of our analysis?

Be transparent about what this analysis can and cannot tell us:

Data Limitations:

1. The scope of this data covers many numerical characteristics of songs, but not enough categorical ones. Specifically, the genre of each sing is absent.


2. The data also doesn't include a song's popularity, which could have been an interesting variable to study the effect of.

Methodological Limitations:

1. For consisness, I only looked into 3 different song characteristic variables when there were 9.

Scope Limitations:

1. What effect does track number have on the another numerical variables? Does the number of songs in an album have an effect on our findings?


2. We've seen how year affects the characteristics of music, so adding recent music to the dataset should help stregnthen our conclusions.

Actionable Recommendations

❓ EXPLORATORY QUESTION 22: Based on our findings, what should stakeholders do?

Provide 2-3 concrete, actionable recommendations:

Recommendation 1: Find a decade of music you like.

Finding it’s based on: [Reference specific visualization or finding] Visualizations 1 and 2.

Proposed action:

Music within clusters of years are more likely to share similar characteristics, so search around there for new albums.

Expected impact:

You should be able to find new music that sounds good to you.

Implementation difficulty: ☐ Easy

Recommendation 2: Listening to an Album

Finding it’s based on: Visualizations 4, 5, and 6

Proposed action:

If you're looking for calmer, and more acoustic vibes, then you should listen to the later songs in an album. If you are looking for more energy and dancing vibesm then you should listen to the earlier songs in an album.

Expected impact:

It should be easier to find new music that you enjoy.

Implementation difficulty: ☐ Moderate

Hypothesis Development

❓ EXPLORATORY QUESTION 23: Based on our EDA, what hypotheses can we test in future work?

After exploratory analysis, we can formulate testable hypotheses:

Hypothesis 1:

Null Hypothesis (H₀):

A song's placement in an album has no effect on its characteristics.

Alternative Hypothesis (H₁):

A song's placement in an album has an effect on its characteristics.

Evidence from EDA:

Visualizations 4, 5, and 6 all support the hypothesis. The bars in the bar charts aren't level with one another.

How to test:

A linear regression test should be able to help prove this theory.

Presentation Plan

❓ EXPLORATORY QUESTION 24: How do we tell this story in 5 minutes?

Create an outline for a 5-minute presentation:

Slide 1: Title & Hook (30 seconds)

Title: Predicting Song Characteristics in an Album
Hook: Hello enjoyer of music! Isn't it difficult to find new music to get into?

Slide 2: Context & Question (45 seconds)

Dataset: I chose to analyze the Spotify 1.2M+ Song dataset from Kaggle. 
Why it matters:  I feel as if I'm the kind of person that doesn't explore music as much because there is such an overwhelming amount of it. Through this research I wanted to at least create a starting point.

Key questions: What affect does year have on a song's characteristics? Does track number have an effect on the characteristics of a song?

Slide 3: Key Finding #1 (60 seconds)

Visualization: Visualization 6, which is a bar chart of the average acousticness score for each track number.
Finding: It shows that later songs in an album tend to have a higher acousticness score.
Implication: [Why does it matter?] This means that people can personalize the order that they listen to new albums to enhance their experience.

Slide 4: Key Finding #2 (60 seconds)

Visualization: Visualization 1, the line chart of the average song acousticness by year.
Finding: It shows that over time, the average acousticness of music has decreased.
Implication: This means that someone who likes the acousticness of music should look into older songs.

Slide 5: Recommendations (60 seconds)

Action 1: Find a decade of music you like and start there.
Action 2: Determine what characteristics of music you like.
Action 3: Using the data, pick a random album and start listening based on where the characteristics you picked are most prevelant.

Slide 6: Questions & Next Steps (45 seconds)

Limitations: I was only able to look at 3 of the 9 numerical variables in the dataset and maybe got lucky with three that made track number look significant.
Future work: Add more recent songs and look at those other numeric variables. Maybe do a regression test on the track number variable.
Thank you + questions

Final Reflection

❓ EXPLORATORY QUESTION 25: What did we learn from this entire process?

Take a moment to reflect on your complete analysis journey:

Most Surprising Finding:

I was shocked at the lack of correlations between all the numeric variables.

Biggest Challenge:

Deciding what variables to focus on and creating research questions from them.

What You’d Do Differently:

I think I would have picked an alternate version of this dataset that included popularity statistics to have a bit more variety in my research questions.

Skills Developed:

I got practice in data cleaning and narrowing down a dataset to focus on a few select variables.

Confidence in Findings: ☐ Medium

Explanation:

I'm not the most confident in my findings because while the bar charts did show some variety, I didn't look at the others and I'm worried I got lucky. I also am worried that making a research question targetted for consumers rather than producers was too risky of an idea.

Cleanup

# Close database connection if using DuckDB
if (exists("con")) {
  dbDisconnect(con, shutdown = TRUE)
  glue("✅ Database connection closed successfully")
}

Grading Criteria

Part 4: Missing Data Analysis (20%) - Systematic quantification and pattern identification - Evidence-based classification (MCAR/MAR/MNAR) - Thoughtful handling strategies with justifications - Impact assessment

Part 5: Variable Selection (15%) - Comprehensive variable inventory - Appropriate relationship analysis - Justified final variable set (3-7 variables) - Clear documentation of exclusions

Part 6: Exploratory Visualizations (30%) - 5+ publication-quality visualizations - Variety in chart types (3+ different types) - Clear exploratory questions - Insightful interpretations - Stakeholder-focused implications - Coherent narrative synthesis

Part 7: Stakeholder Communication (25%) - Clear, concise executive summary (BLUF) - Compelling hero visualization - Honest limitations discussion - Actionable recommendations - Well-structured presentation plan - Testable hypotheses developed

Professional Communication (10%) - Code organization and documentation - Clear writing throughout - Logical flow of analysis - Appropriate use of visualizations - Professional presentation

Tips for Success

Start Early: This is substantial work - don’t wait until the last minute!
Be Honest: Acknowledge limitations and uncertainties in your data
Think Like a Stakeholder: Every finding should answer “so what?”
Quality Over Quantity: Better to have 5 excellent visualizations than 10 mediocre ones
Tell a Story: Connect your findings into a coherent narrative
Document Everything: Explain your reasoning for all major decisions
Ask for Help: Use office hours if you’re stuck on any section
Iterate: Review and refine your work before submitting
Proofread: Check for typos and ensure all code runs
Be Specific: Avoid vague statements - provide evidence and examples

Remember: This is exploratory data analysis - you’re building understanding and generating insights, not proving predetermined hypotheses. Let your curiosity guide you while maintaining systematic rigor!

NSF Acknowledgement: This material is based upon work supported by the National Science Foundation under Grant #DGE-2222148.

Project Part 2: From Data Understanding to Stakeholder Insights

Overview

Setup and Data Loading

Load Required Libraries

Reconnect to Your Dataset

Part 5: Variable Selection & Focus

Introduction

Step 1: Variable Inventory

❓ EXPLORATORY QUESTION 8: What role does each variable play in our analysis?

📝 Your Variable Assessment:

Step 2: Examine Relationships

For Numeric Variables: Correlation Analysis

❓ EXPLORATORY QUESTION 9: Which numeric variables are correlated with each other?

📝 Correlation Interpretation:

For Categorical Variables: Relationship Analysis

❓ EXPLORATORY QUESTION 10: How do categorical variables relate to our outcome?

📝 Your Interpretation:

Step 3: Final Variable Selection

❓ EXPLORATORY QUESTION 11: Which 3-7 variables best answer our research questions?

📋 Final Variable Selection Table

📝 Variables Excluded and Why:

Step 4: Tool Selection Documentation

❓ EXPLORATORY QUESTION 12: Are we using the right tools for our data size?

📝 Tool Selection Justification:

Part 6: Exploratory Visualizations & Analysis

Introduction

Visualization 1: Line Graph

❓ EXPLORATORY QUESTION 13: How has the average characteristics of music changed over time?

📝 Interpretation:

Visualization 2: Bar Chart

❓ EXPLORATORY QUESTION 14: How does track number affect the averages of the song characteristics?

📝 Interpretation:

Visualization 3: Scatter plot

❓ EXPLORATORY QUESTION 15: Do our numerical variables have a linear relationship?

📝 Interpretation:

Visualization Summary

❓ EXPLORATORY QUESTION 18: What’s the overall story from our visualizations?

📝 Narrative Synthesis:

Part 7: Stakeholder Communication

Introduction

Executive Summary

❓ EXPLORATORY QUESTION 19: If a stakeholder has only 2 minutes, what must they know?

Predicting Song Characteristics in an Album - Key Findings

Hero Visualization

❓ EXPLORATORY QUESTION 20: Which single visualization tells our story best?

📝 Hero Visualization Explanation:

Limitations

❓ EXPLORATORY QUESTION 21: What are the honest limitations of our analysis?

Data Limitations:

Methodological Limitations:

Scope Limitations:

Actionable Recommendations

❓ EXPLORATORY QUESTION 22: Based on our findings, what should stakeholders do?

Recommendation 1: Find a decade of music you like.

Recommendation 2: Listening to an Album

Hypothesis Development

❓ EXPLORATORY QUESTION 23: Based on our EDA, what hypotheses can we test in future work?

Hypothesis 1:

Presentation Plan

❓ EXPLORATORY QUESTION 24: How do we tell this story in 5 minutes?

Slide 1: Title & Hook (30 seconds)

Slide 2: Context & Question (45 seconds)

Slide 3: Key Finding #1 (60 seconds)

Slide 4: Key Finding #2 (60 seconds)

Slide 5: Recommendations (60 seconds)

Slide 6: Questions & Next Steps (45 seconds)

Final Reflection

❓ EXPLORATORY QUESTION 25: What did we learn from this entire process?

Cleanup

Deliverables Checklist

Part 4: Missing Data Analysis ✓

Part 5: Variable Selection ✓

Part 6: Exploratory Visualizations ✓

Part 7: Stakeholder Communication ✓

Overall Quality ✓

Grading Criteria

Tips for Success