# Load required libraries
library(tidyverse)
library(glue)
library(arrow)Project Part 1: Quarto document
YourProject Assignment 1: First Contact with Your Dataset Using Arrow
Assignment Overview
This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your data set systematically.
Learning Objectives
By completing this assignment, you will:
- Apply the READY framework to plan your data investigation
- Use the SCAN framework to systematically explore your dataset
- Practice using Arrow for memory-efficient data loading
- Document your initial findings and develop investigation questions
Part 1: Data Setup and Loading
Step 1: Extract and Load Your Data
Use the appropriate code pattern below based on your data format:
LOAD LIBRARIES
For ZIP files containing CSV(s):
# Set up and extract your ZIP file
musical_data <- "~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"
outdir <- file.path(dirname("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv"), "extracted_data")
dir.create(outdir, showWarnings = FALSE)
unzip("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", exdir = outdir, overwrite = TRUE)Warning in unzip("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", :
error 1 in extracting from zip file
# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))
# Open with Arrow - specify the main file you want to work with
my_music_dataset <- open_dataset("~/Downloads/Fall 2025 Semester/Data/tracks_features.csv", format = "csv")
# Check memory usage
glue("Memory used by Arrow object: {format(object.size(my_music_dataset), units = 'KB')}")Memory used by Arrow object: 0.5 Kb
#Take a glimpse of the data
glimpse(my_music_dataset)FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit <string> "False", "True", "False", "True", "False", "False", …
$ danceability <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…
Part 2: READY Framework Analysis
Work through each component of READY with your dataset:
#Looking for number of songs and dataset scope.
dataset_overview <- my_music_dataset |>
summarise(
total_songs = n(),
latest_year = max(year, na.rm = TRUE),
) |>
head(10)|>
collect()
dataset_overview# A tibble: 1 × 2
total_songs latest_year
<int> <int>
1 1204025 2020
#couldn't get min function to work, so found earliest year
# a different way
my_music_dataset|>
select(year)|>
arrange(desc(year))|>
head(10)|>
collect()# A tibble: 10 × 1
year
<int>
1 2020
2 2020
3 2020
4 2020
5 2020
6 2020
7 2020
8 2020
9 2020
10 2020
R - Representative Data
Document your thoughts as comments:
What is the scope of your data?
There are 1204025 songs in this data set.
Time period covered:
This data set contains songs from 1999 to 2020.
Population represented:
Artists of varying musical genres that uploaded their music to Spotify.
Potential biases or limitations:
There may be fewer international artists represented in the data and data beyond 2020 isn’t available. There are also more likely to be more popular songs in this dataset because 1.2 million songs spanning 21 years isn’t that many.
Example questions to consider:
Do we have complete coverage of what we’re studying?
Are there any obvious gaps in the data?
What might be missing?
E - Executive Driven Questions
Who would care about insights from your data?
Primary stakeholders: Music artists, Spotify users, record labels.
Your stakeholder questions:
What factors increase a songs’ danceability?
What factor’s increase a songs’ energy?
In an album, are the characteristics of a song associated with the track number?
A - Analytical Framework
Your exploration strategy:
Phase 1: Data Quality Assessment: See where there are missing values. Look at variable types.
Phase 2: Descriptive Analysis: Identify key variables needed to answer the question. Find relationships between numerical variables.
Phase 3: Pattern Investigation: Will most likely look into the third stakeholder question. For this I should use histograms to find patterns within the category of track number within an album against numerical variables like energy, liveness, etc.
D - Data Best Practices
Quality checks to perform: Check the variable types to ensure they can be worked with.
Missing data assessment: See which variables have the most missing data.
Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?
Your quality concerns:
- The release data variable is a string and track number is a numerical variable, but I want to use it as a categorical/string variable.
Y - Your Insights
Your predictions:
- I predict that songs at the beginning and end of albums will have more energy and loudness in comparison .
- I wonder if year would have an impact on this question.
- I expect that tempo and valence to also have an effect.
Part 3: Data Quality Assessment Summary
S -Stakeholders (Revisited)
After examining the data structure, who else might be interested?
Composers
What specific questions would they have?
They could be interested in how levels of instrumentals affect the volume and its correlation with acousticness and speechiness.
What concerns might they have about data quality?
They may be concerned about any measuring techniques, and whether they were used consistently across tracks. They also may concerned about the accuracy of older song measurements, since sound technology has grown more precise over time.
C - Columns and Coverage
Create a summary table of your variables:
| Variable Name | Data Type | What do we notice from the output? Things to keep an eye on? |
|---|---|---|
| id | string | Unique to each song |
| name | string | Contain spaces |
| album | string | Also contains spaces |
| album_id | string | unique to each album |
| artists | string | has brackets |
| artist_ids | string | has brackets |
| track_number | integer | Is an integer, but we might want to change it to char later on |
| disc_number | integer | Unclear how many levels are present |
| explicit | string | Boolean |
| danceability | double | score from 0 to 1 |
| energy | double | score from 0 to 1 |
| key | integer | whole numbers? |
| loudness | double | Has a lot of negative numbers |
| mode | integer | Appears to be 0 or 1 |
| speechiness | double | score from 0 to 1 |
| acousticness | double | score from 0 to 1 |
| instrumentalness | double | really small numbers |
| liveness | double | score from 0 to 1 |
| valence | double | score from 0 to 1 |
| tempo | double | Likely ranged around 70 to 140 |
| duration_ms | integer | measured in milliseconds |
| time_signature | double | Unclear how many levels there are, may need to be converted to char |
| year | integer | starts at 1999, ends at 2020 |
| release_date | string | is in year-month-day format |
A - Aggregates: Overall Picture
# Get comprehensive dataset statistics
dataset_statistics <- my_music_dataset |>
summarise(
total_songs = n(),
latest_year = max(year, na.rm = TRUE),
average_dance_score = mean(danceability, na.rm = TRUE),
average_energy_score = mean(energy, na.rm = TRUE),
average_valence_score = mean(valence, na.rm = TRUE),
average_volume_score = mean(loudness, na.rm = TRUE)
) |>
collect()
dataset_statistics# A tibble: 1 × 6
total_songs latest_year average_dance_score average_energy_score
<int> <int> <dbl> <dbl>
1 1204025 2020 0.493 0.510
# ℹ 2 more variables: average_valence_score <dbl>, average_volume_score <dbl>
N - Notable Segments
#Checking for missing values
missing_check <- my_music_dataset |>
summarise(
total_rows = n(),
missing_id = sum(is.na(id)),
missing_name = sum(is.na(name)),
missing_album = sum(is.na(album)),
missing_album_id = sum(is.na(album_id)),
missing_artists = sum(is.na(artists)),
missing_artist_ids = sum(is.na(artist_ids)),
missing_track_number = sum(is.na(track_number)),
missing_disc_number = sum(is.na(disc_number)),
missing_explicit = sum(is.na(explicit)),
missing_danceability = sum(is.na(danceability)),
missing_energy = sum(is.na(energy)),
missing_key = sum(is.na(key)),
missing_loudness = sum(is.na(loudness)),
missing_mode = sum(is.na(mode)),
missing_speechiness = sum(is.na(speechiness)),
missing_acousticness = sum(is.na(acousticness)),
missing_instrumentalness = sum(is.na(instrumentalness)),
missing_liveness = sum(is.na(liveness)),
missing_valence = sum(is.na(valence)),
missing_tempo = sum(is.na(tempo)),
missing_duration_ms = sum(is.na(duration_ms)),
missing_time_signature = sum(is.na(time_signature)),
missing_year = sum(is.na(year)),
missing_release_date = sum(is.na(release_date)),
) |>
collect()
missing_check# A tibble: 1 × 25
total_rows missing_id missing_name missing_album missing_album_id
<int> <int> <int> <int> <int>
1 1204025 0 0 0 0
# ℹ 20 more variables: missing_artists <int>, missing_artist_ids <int>,
# missing_track_number <int>, missing_disc_number <int>,
# missing_explicit <int>, missing_danceability <int>, missing_energy <int>,
# missing_key <int>, missing_loudness <int>, missing_mode <int>,
# missing_speechiness <int>, missing_acousticness <int>,
# missing_instrumentalness <int>, missing_liveness <int>,
# missing_valence <int>, missing_tempo <int>, missing_duration_ms <int>, …
Complete this comprehensive assessment:
DATASET OVERVIEW:
- Records: 1204025 representing the number of songs.
- Time span: Years from 1938 to 2020
- Key metrics:
Average Dance Score = 0.493
Average Energy Score = 0.510
Average Valence Score = 0.428
DATA COMPLETENESS:
No missing values
DATA QUALITY STRENGTHS:
1. Several variables are on a scale from 0 to 1, making them easy to analyze.
- This data set seems more reliable because of the several different ways that the songs were measured and evaluated.
3. There are no missing values so the coverage is good.
DATA QUALITY CONCERNS:
1. There are two main issues. Both the instrumentals and the loudness have difficult to interpret values. Also the release_date variable is a string, which makes it difficult to work with.
2. Despite there being over a million songs, this data set spans several years, so it likely only has the most popular songs of each year.
3. Variables that aren’t easily interpreted.
RELIABILITY ASSESSMENT:
- Most reliable variables: name, artists, danceability, energy, mode, speechiness, acousticness, liveness, tempo, valence, and year.
- Variables needing caution: loudness, instrumentalness, time_signature
- Overall confidence level: High
JUSTIFICATION:
There is no missing data, and the variables that are harder to work with aren’t relevant to the research questions I want to answer.