Project Part 1: Quarto document

Author

J Sanchez-Collins

YourProject Assignment 1: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your data set systematically.

Learning Objectives

By completing this assignment, you will:

- Apply the READY framework to plan your data investigation

- Use the SCAN framework to systematically explore your dataset

- Practice using Arrow for memory-efficient data loading

- Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

# Load required libraries
library(tidyverse)
library(glue)
library(arrow)

For ZIP files containing CSV(s):

# Set up and extract your ZIP file
musical_data <- "~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"  
outdir <- file.path(dirname("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"), "extracted_data")
dir.create(outdir, showWarnings = FALSE)
unzip("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", exdir = outdir, overwrite = TRUE)

Warning in unzip("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", :
error 1 in extracting from zip file

# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))

# Open with Arrow - specify the main file you want to work with
my_music_dataset <- open_dataset("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", format = "csv")  

# Check memory usage
glue("Memory used by Arrow object: {format(object.size(my_music_dataset), units = 'KB')}")

Memory used by Arrow object: 0.5 Kb

#Take a glimpse of the data
glimpse(my_music_dataset)

FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

#Looking for number of songs and dataset scope.
dataset_overview <- my_music_dataset |> 
  summarise(
    total_songs = n(),
    latest_year = max(year, na.rm = TRUE),
  ) |> 
  head(10)|>
  collect()
dataset_overview

# A tibble: 1 × 2
  total_songs latest_year
        <int>       <int>
1     1204025        2020

#couldn't get min function to work, so found earliest year 
# a different way
my_music_dataset|>
  select(year)|>
  arrange(desc(year))|>
  head(10)|>
  collect()

# A tibble: 10 × 1
    year
   <int>
 1  2020
 2  2020
 3  2020
 4  2020
 5  2020
 6  2020
 7  2020
 8  2020
 9  2020
10  2020

R - Representative Data

Document your thoughts as comments:

What is the scope of your data?

There are 1204025 songs in this data set.

Time period covered:

This data set contains songs from 1999 to 2020.

Population represented:

Artists of varying musical genres that uploaded their music to Spotify.

Potential biases or limitations:

There may be fewer international artists represented in the data and data beyond 2020 isn’t available. There are also more likely to be more popular songs in this dataset because 1.2 million songs spanning 21 years isn’t that many.

Example questions to consider:

Do we have complete coverage of what we’re studying?
Are there any obvious gaps in the data?
What might be missing?

E - Executive Driven Questions

Who would care about insights from your data?

Primary stakeholders: Music artists, Spotify users, record labels.

Your stakeholder questions:

What factors increase a songs’ danceability?
What factor’s increase a songs’ energy?
In an album, are the characteristics of a song associated with the track number?

A - Analytical Framework

Your exploration strategy:

Phase 1: Data Quality Assessment: See where there are missing values. Look at variable types.

Phase 2: Descriptive Analysis: Identify key variables needed to answer the question. Find relationships between numerical variables.

Phase 3: Pattern Investigation: Will most likely look into the third stakeholder question. For this I should use histograms to find patterns within the category of track number within an album against numerical variables like energy, liveness, etc.

D - Data Best Practices

Quality checks to perform: Check the variable types to ensure they can be worked with.

Missing data assessment: See which variables have the most missing data.

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

Your quality concerns:

The release data variable is a string and track number is a numerical variable, but I want to use it as a categorical/string variable.

Y - Your Insights

Your predictions:

I predict that songs at the beginning and end of albums will have more energy and loudness in comparison .
I wonder if year would have an impact on this question.
I expect that tempo and valence to also have an effect.

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

After examining the data structure, who else might be interested?

Composers

What specific questions would they have?

They could be interested in how levels of instrumentals affect the volume and its correlation with acousticness and speechiness.

What concerns might they have about data quality?

They may be concerned about any measuring techniques, and whether they were used consistently across tracks. They also may concerned about the accuracy of older song measurements, since sound technology has grown more precise over time.

C - Columns and Coverage

Create a summary table of your variables:

Variable Name	Data Type	What do we notice from the output? Things to keep an eye on?
id	string	Unique to each song
name	string	Contain spaces
album	string	Also contains spaces
album_id	string	unique to each album
artists	string	has brackets
artist_ids	string	has brackets
track_number	integer	Is an integer, but we might want to change it to char later on
disc_number	integer	Unclear how many levels are present
explicit	string	Boolean
danceability	double	score from 0 to 1
energy	double	score from 0 to 1
key	integer	whole numbers?
loudness	double	Has a lot of negative numbers
mode	integer	Appears to be 0 or 1
speechiness	double	score from 0 to 1
acousticness	double	score from 0 to 1
instrumentalness	double	really small numbers
liveness	double	score from 0 to 1
valence	double	score from 0 to 1
tempo	double	Likely ranged around 70 to 140
duration_ms	integer	measured in milliseconds
time_signature	double	Unclear how many levels there are, may need to be converted to char
year	integer	starts at 1999, ends at 2020
release_date	string	is in year-month-day format

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
dataset_statistics <- my_music_dataset |> 
  summarise(
    total_songs = n(),
    latest_year = max(year, na.rm = TRUE),
    average_dance_score = mean(danceability, na.rm = TRUE),
    average_energy_score = mean(energy, na.rm = TRUE),
    average_valence_score = mean(valence, na.rm = TRUE),
    average_volume_score = mean(loudness, na.rm = TRUE)
  ) |> 
  collect()
dataset_statistics

# A tibble: 1 × 6
  total_songs latest_year average_dance_score average_energy_score
        <int>       <int>               <dbl>                <dbl>
1     1204025        2020               0.493                0.510
# ℹ 2 more variables: average_valence_score <dbl>, average_volume_score <dbl>

N - Notable Segments

#Checking for missing values
missing_check <- my_music_dataset |> 
  summarise(
    total_rows = n(),
    missing_id = sum(is.na(id)),
    missing_name = sum(is.na(name)),
    missing_album = sum(is.na(album)),
    missing_album_id = sum(is.na(album_id)),
    missing_artists = sum(is.na(artists)),
    missing_artist_ids = sum(is.na(artist_ids)),
    missing_track_number = sum(is.na(track_number)),
    missing_disc_number = sum(is.na(disc_number)),
    missing_explicit = sum(is.na(explicit)),
    missing_danceability = sum(is.na(danceability)),
    missing_energy = sum(is.na(energy)),
    missing_key = sum(is.na(key)),
    missing_loudness = sum(is.na(loudness)),
    missing_mode = sum(is.na(mode)),
    missing_speechiness = sum(is.na(speechiness)),
    missing_acousticness = sum(is.na(acousticness)),
    missing_instrumentalness = sum(is.na(instrumentalness)),
    missing_liveness = sum(is.na(liveness)),
    missing_valence = sum(is.na(valence)),
    missing_tempo = sum(is.na(tempo)),
    missing_duration_ms = sum(is.na(duration_ms)),
    missing_time_signature = sum(is.na(time_signature)),
    missing_year = sum(is.na(year)),
    missing_release_date = sum(is.na(release_date)),
  ) |> 
  collect()

missing_check

# A tibble: 1 × 25
  total_rows missing_id missing_name missing_album missing_album_id
       <int>      <int>        <int>         <int>            <int>
1    1204025          0            0             0                0
# ℹ 20 more variables: missing_artists <int>, missing_artist_ids <int>,
#   missing_track_number <int>, missing_disc_number <int>,
#   missing_explicit <int>, missing_danceability <int>, missing_energy <int>,
#   missing_key <int>, missing_loudness <int>, missing_mode <int>,
#   missing_speechiness <int>, missing_acousticness <int>,
#   missing_instrumentalness <int>, missing_liveness <int>,
#   missing_valence <int>, missing_tempo <int>, missing_duration_ms <int>, …

Complete this comprehensive assessment:

DATASET OVERVIEW:

- Records: 1204025 representing the number of songs.

- Time span: Years from 1938 to 2020

- Key metrics:

Average Dance Score = 0.493

Average Energy Score = 0.510

Average Valence Score = 0.428

DATA COMPLETENESS:

No missing values

DATA QUALITY STRENGTHS:

1. Several variables are on a scale from 0 to 1, making them easy to analyze.

This data set seems more reliable because of the several different ways that the songs were measured and evaluated.

3. There are no missing values so the coverage is good.

DATA QUALITY CONCERNS:

1. There are two main issues. Both the instrumentals and the loudness have difficult to interpret values. Also the release_date variable is a string, which makes it difficult to work with.

2. Despite there being over a million songs, this data set spans several years, so it likely only has the most popular songs of each year.

3. Variables that aren’t easily interpreted.

RELIABILITY ASSESSMENT:

Most reliable variables: name, artists, danceability, energy, mode, speechiness, acousticness, liveness, tempo, valence, and year.

- Variables needing caution: loudness, instrumentalness, time_signature

- Overall confidence level: High

JUSTIFICATION:

There is no missing data, and the variables that are harder to work with aren’t relevant to the research questions I want to answer.