Project Part 2: Quarto dociument

YourProject Assignment 1: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your dataset systematically.

Learning Objectives

By completing this assignment, you will: - Apply the READY framework to plan your data investigation - Use the SCAN framework to systematically explore your dataset - Practice using Arrow for memory-efficient data loading - Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

library(arrow)


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

tracks_csv <- open_dataset("tracks_features.csv", format = "csv")

glimpse(tracks_csv)

FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…

# Load required libraries

For ZIP files containing CSV(s):

# Open your CSV directly with Arrow
library(arrow)
library(dplyr)

my_dataset <- open_dataset("~/Downloads/tracks_features.csv", format = "csv")

# Check memory usage
glue::glue("Memory used by Arrow object: {format(object.size(my_dataset), units = 'KB')}")

Memory used by Arrow object: 0.5 Kb

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Document your thoughts as comments:

What is the scope of your data?

My data contains music and audio features for millions of songs.

Time period covered:

The time period of my data is 1999 through 2025.

Geographic coverage:

The geographic coverage is global.

Population represented:

The music tracks included in Spotify’s library

Potential biases or limitations:

Biased toward artists who distribute on Spotify. It is possible that some genres are not properly represented

Example questions to consider:

Do we have complete coverage of what we’re studying?
Are there any obvious gaps in the data?
What might be missing?

E - Executive Driven Questions

Who would care about insights from your data?

Artists, Producers, and Record Labels would all care about insights from my data

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

What musical features are most correlated with popular music? How have song characteristics changed over time across genres? Are certain features becoming less common?

Examples: - If this is sales data: “How can we optimize our sales strategy?” - If this is health data: “What patterns affect patient outcomes?” - If this is social media data: “How can we improve engagement?”

Your stakeholder questions: 1. 2. 3.

A - Analytical Framework

Your exploration strategy:

Phase 1: Data Quality Assessment - Check for missing values - Identify data types and consistency - Look for outliers or anomalies

Check for missing or extreme values in the dataset to make sure all audio features are usable

Phase 2: Descriptive Analysis - What are the key variables? - What’s the distribution of important metrics? - What time patterns exist?

Summarize and visualize the data using histograms and boxplots to understand the distributions of each musical feature

Phase 3: Pattern Investigation - What relationships might exist between variables? - Are there seasonal or temporal patterns? - What groupings or segments emerge?

Explore relationships between variables, such as how tempo relates to energy or danceability, and look for any clusters or patterns among tracks

Your specific analytical approach: 1. 2. 3.

D - Data Best Practices

Quality checks to perform:

Verify each column is stored as numeric data. Check for duplicate tracks or missing song IDs. Confirm all variables fall within expected ranges.

Missing data assessment:

Look for any missing values in musical features. Assess whether missing data appear random or patterned across certain tracks

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

All musical characteristics must be numeric, track names should be text. Confirm release year column uses consistent format. Check that the categorical columns only contain valid values

Your quality concerns: 1. 2. 3.

Y - Your Insights

Initial hypotheses about what you might find:

I think songs with higher energy, danceability, and tempo will often appear together, showing a pattern in upbeat or popular tracks.

Based on your domain knowledge, what patterns do you expect? What would surprise you? What would be most valuable to discover?

Tracks with higher valence will also have higher energy and tempo. More recent songs may trend toward louder, more energetic characteristics than older ones. Songs in major keys may have higher average valence scores than songs in minor keys. I would be surprised if slower songs had higher valence.

Your predictions: 1. 2. 3.

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

glimpse(tracks_csv)

FileSystemDataset with 1 csv file
1,204,025 rows x 24 columns
$ id               <string> "7lmeHLHBe4nmXzuXc0HDjk", "1wsRitfRRtWyEapl0q22o8", …
$ name             <string> "Testify", "Guerrilla Radio", "Calm Like a Bomb", "M…
$ album            <string> "The Battle Of Los Angeles", "The Battle Of Los Ange…
$ album_id         <string> "2eia0myWFgoHuttJytCxgX", "2eia0myWFgoHuttJytCxgX", …
$ artists          <string> "['Rage Against The Machine']", "['Rage Against The …
$ artist_ids       <string> "['2d0hyoQ5ynDBnkvAbJKORj']", "['2d0hyoQ5ynDBnkvAbJK…
$ track_number      <int64> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5…
$ disc_number       <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ explicit         <string> "False", "True", "False", "True", "False", "False", …
$ danceability     <double> 0.470, 0.599, 0.315, 0.440, 0.426, 0.298, 0.417, 0.2…
$ energy           <double> 0.978, 0.957, 0.970, 0.967, 0.929, 0.848, 0.976, 0.8…
$ key               <int64> 7, 11, 7, 11, 2, 2, 9, 11, 7, 9, 7, 6, 4, 7, 1, 7, 4…
$ loudness         <double> -5.399, -5.764, -5.424, -5.830, -6.729, -5.947, -6.0…
$ mode              <int64> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1…
$ speechiness      <double> 0.0727, 0.1880, 0.4830, 0.2370, 0.0701, 0.0727, 0.17…
$ acousticness     <double> 0.026100, 0.012900, 0.023400, 0.163000, 0.001620, 0.…
$ instrumentalness <double> 1.09e-05, 7.06e-05, 2.03e-06, 3.64e-06, 1.05e-01, 1.…
$ liveness         <double> 0.3560, 0.1550, 0.1220, 0.1210, 0.0789, 0.2010, 0.10…
$ valence          <double> 0.503, 0.489, 0.370, 0.574, 0.539, 0.194, 0.483, 0.6…
$ tempo            <double> 117.906, 103.680, 149.749, 96.752, 127.059, 148.282,…
$ duration_ms       <int64> 210133, 206200, 298893, 213640, 205600, 280960, 2020…
$ time_signature   <double> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ year              <int64> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
$ release_date     <string> "1999-11-02", "1999-11-02", "1999-11-02", "1999-11-0…

After examining the data structure, who else might be interested?

Music streaming platforms, marketing teams, and researchers studying patterns.

What specific questions would they have?

What characteristics define popular or “feel good” songs across genres.

What concerns might they have about data quality?

I am concerned older clubs have less reliable measurements.

C - Columns and Coverage

Create a summary table of your variables:

tracks_df <- as.data.frame(tracks_csv)

summary_table <- tracks_df %>%
  summarise_all(~ sum(is.na(.))) %>%
  t() %>%
  as.data.frame()

colnames(summary_table) <- c("Missing_Values")
summary_table

                 Missing_Values
id                            0
name                          0
album                         0
album_id                      0
artists                       0
artist_ids                    0
track_number                  0
disc_number                   0
explicit                      0
danceability                  0
energy                        0
key                           0
loudness                      0
mode                          0
speechiness                   0
acousticness                  0
instrumentalness              0
liveness                      0
valence                       0
tempo                         0
duration_ms                   0
time_signature                0
year                          0
release_date                  0

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
summary(tracks_df)

      id                name              album             album_id        
 Length:1204025     Length:1204025     Length:1204025     Length:1204025    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   artists           artist_ids         track_number     disc_number    
 Length:1204025     Length:1204025     Min.   : 1.000   Min.   : 1.000  
 Class :character   Class :character   1st Qu.: 3.000   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Median : 7.000   Median : 1.000  
                                       Mean   : 7.656   Mean   : 1.056  
                                       3rd Qu.:10.000   3rd Qu.: 1.000  
                                       Max.   :50.000   Max.   :13.000  
   explicit          danceability        energy            key        
 Length:1204025     Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
 Class :character   1st Qu.:0.3560   1st Qu.:0.2520   1st Qu.: 2.000  
 Mode  :character   Median :0.5010   Median :0.5240   Median : 5.000  
                    Mean   :0.4931   Mean   :0.5095   Mean   : 5.194  
                    3rd Qu.:0.6330   3rd Qu.:0.7660   3rd Qu.: 8.000  
                    Max.   :1.0000   Max.   :1.0000   Max.   :11.000  
    loudness            mode         speechiness       acousticness   
 Min.   :-60.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:-15.254   1st Qu.:0.0000   1st Qu.:0.03510   1st Qu.:0.0376  
 Median : -9.791   Median :1.0000   Median :0.04460   Median :0.3890  
 Mean   :-11.809   Mean   :0.6715   Mean   :0.08438   Mean   :0.4468  
 3rd Qu.: -6.717   3rd Qu.:1.0000   3rd Qu.:0.07230   3rd Qu.:0.8610  
 Max.   :  7.234   Max.   :1.0000   Max.   :0.96900   Max.   :0.9960  
 instrumentalness       liveness         valence          tempo       
 Min.   :0.0000000   Min.   :0.0000   Min.   :0.000   Min.   :  0.00  
 1st Qu.:0.0000076   1st Qu.:0.0968   1st Qu.:0.191   1st Qu.: 94.05  
 Median :0.0080800   Median :0.1250   Median :0.403   Median :116.73  
 Mean   :0.2828605   Mean   :0.2016   Mean   :0.428   Mean   :117.63  
 3rd Qu.:0.7190000   3rd Qu.:0.2450   3rd Qu.:0.644   3rd Qu.:137.05  
 Max.   :1.0000000   Max.   :1.0000   Max.   :1.000   Max.   :248.93  
  duration_ms      time_signature       year      release_date      
 Min.   :   1000   Min.   :0.000   Min.   :   0   Length:1204025    
 1st Qu.: 174090   1st Qu.:4.000   1st Qu.:2002   Class :character  
 Median : 224339   Median :4.000   Median :2009   Mode  :character  
 Mean   : 248840   Mean   :3.832   Mean   :2007                     
 3rd Qu.: 285840   3rd Qu.:4.000   3rd Qu.:2015                     
 Max.   :6061090   Max.   :5.000   Max.   :2020

tracks_df %>%
  select_if(is.numeric) %>%
  summarise_all(list(mean = mean, min = min, max = max), na.rm = TRUE)

# A tibble: 1 × 48
  track_number_mean disc_number_mean danceability_mean energy_mean key_mean
              <dbl>            <dbl>             <dbl>       <dbl>    <dbl>
1              7.66             1.06             0.493       0.510     5.19
# ℹ 43 more variables: loudness_mean <dbl>, mode_mean <dbl>,
#   speechiness_mean <dbl>, acousticness_mean <dbl>,
#   instrumentalness_mean <dbl>, liveness_mean <dbl>, valence_mean <dbl>,
#   tempo_mean <dbl>, duration_ms_mean <dbl>, time_signature_mean <dbl>,
#   year_mean <dbl>, track_number_min <int>, disc_number_min <int>,
#   danceability_min <dbl>, energy_min <dbl>, key_min <int>,
#   loudness_min <dbl>, mode_min <int>, speechiness_min <dbl>, …

N - Notable Segments

tracks_df %>%
  group_by(explicit) %>%
  summarise(
    avg_danceability = mean(danceability, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_valence = mean(valence, na.rm = TRUE)
  )

# A tibble: 2 × 4
  explicit avg_danceability avg_energy avg_valence
  <chr>               <dbl>      <dbl>       <dbl>
1 False               0.483      0.497       0.424
2 True                0.629      0.686       0.485

tracks_df %>%
  group_by(mode) %>%
  summarise(avg_tempo = mean(tempo, na.rm = TRUE))

# A tibble: 2 × 2
   mode avg_tempo
  <int>     <dbl>
1     0      117.
2     1      118.

Complete this comprehensive assessment:

DATASET OVERVIEW: - Records: Over 1 million lines of data representing songs and their characteristics - Time span: Music from 1999 to 2025 - Key metrics: Tempo, Valence, Release date, and Energy

DATA COMPLETENESS: - Core fields: 100% complete - Name: 100% complete - Danceability: 100% complete

DATA QUALITY STRENGTHS: 1. No missing data 2. All variables are correctly typed 3. Standardized measurement scales make comparisons reliable

DATA QUALITY CONCERNS: 1. Some songs may lack genre or popularity context 2. [No streaming or listener data, so analysis focuses on musical qualities only 3. Features like mode and key are categorical but not directly interpretable without domain context

RELIABILITY ASSESSMENT: - Most reliable variables: danceability, energy, valence, tempo, and loudness - Variables needing caution: mode, key, and explicit - Overall confidence level: High

JUSTIFICATION: The dataset is highly reliable because it comes from Spotify’s official API, which uses consistent digital signal processing to quantify musical attributes. All key variables are complete and numeric, making the data ready for quantitative analysis.

Deliverables Checklist

Ensure your submission includes:

Complete READY framework analysis with thoughtful responses
Systematic SCAN framework exploration with specific findings
Successful data loading with Arrow
Professional data description and summary statistics
Comprehensive missing value analysis with percentages
Variable summary table documenting key fields
Memory efficiency demonstration
3-5 well-defined, specific exporatory research questions
Data quality assessment with honest evaluation
Professional summary with clear next steps

Grading Criteria

READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach
Data Loading (15%): Successful Arrow implementation with proper documentation
SCAN Framework (25%): Systematic exploration with specific, meaningful findings
Data Quality Assessment (20%): Comprehensive evaluation with specific evidence
Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities
Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

Be specific in your observations - avoid vague statements
Think like a stakeholder - what would decision-makers actually want to know?
Document your reasoning for all assessment decisions
Be honest about limitations - this builds credibility
Focus on actionable insights - what can actually be learned from this data?
Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.