Project Part 2: Quarto dociument

Your Project Assignment 1: First Contact with Your Dataset Using Arrow

Assignment Overview

This week you’ll apply the READY + SCAN frameworks to your own dataset using Arrow for efficient big data exploration. You’ll become a “data detective” investigating your dataset systematically.

Learning Objectives

By completing this assignment, you will: - Apply the READY framework to plan your data investigation - Use the SCAN framework to systematically explore your dataset - Practice using Arrow for memory-efficient data loading - Document your initial findings and develop investigation questions

Part 1: Data Setup and Loading

Step 1: Extract and Load Your Data

Use the appropriate code pattern below based on your data format:

LOAD LIBRARIES

# Function to check and install required packages
required_packages <- c("tidyverse", "arrow", "glue", "janitor")

# Install missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load libraries
lapply(required_packages, library, character.only = TRUE)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'arrow' was built under R version 4.3.3

Attaching package: 'arrow'

The following object is masked from 'package:lubridate':

    duration

The following object is masked from 'package:utils':

    timestamp
Warning: package 'janitor' was built under R version 4.3.3

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
[[1]]
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
 [1] "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
 [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
[13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
 [1] "glue"      "arrow"     "lubridate" "forcats"   "stringr"   "dplyr"    
 [7] "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse"
[13] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
[19] "base"     

[[4]]
 [1] "janitor"   "glue"      "arrow"     "lubridate" "forcats"   "stringr"  
 [7] "dplyr"     "purrr"     "readr"     "tidyr"     "tibble"    "ggplot2"  
[13] "tidyverse" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
[19] "methods"   "base"     

For ZIP files containing CSV(s):

# Set up and extract your ZIP file
zip_path <- "C:/Classes/DSC 406/data/SpotifyData.zip"  # UPDATE THIS PATH
outdir <- file.path(dirname(zip_path), "tracks_features")
dir.create(outdir, showWarnings = FALSE)
unzip(zip_path, exdir = outdir, overwrite = TRUE)

# Get list of CSV files
csv_files <- list.files(outdir, pattern = "\\.csv$", full.names = TRUE)
names(csv_files) <- tools::file_path_sans_ext(basename(csv_files))

# Open with Arrow - specify the main file you want to work with
my_dataset <- open_dataset(csv_files[1], format = "csv")  # Adjust [1] as needed

# Check memory usage
glue("Memory used by Arrow object: {format(object.size(my_dataset), units = 'KB')}")
Memory used by Arrow object: 0.5 Kb

Part 2: READY Framework Analysis

Work through each component of READY with your dataset:

R - Representative Data

Document your thoughts as comments:

What is the scope of your data? Spotify tracks with audio features and metadata

Time period covered: Multiple years based on release_date and year columns

Geographic coverage: Global, though skewed toward regions with higher Spotify usage

Population represented: All tracks in the dataset; may include curated playlists, albums, or user interactions

Potential biases or limitations:

  • May overrepresent popular or trending tracks.
  • Regional differences in Spotify usage could skew results.

  • Missing metadata (e.g., genre or release year) can limit analysis depth.

  • If user-generated, it may reflect personal preferences rather than overall trends.

Example questions to consider:

  • Do we have complete coverage of all tracks, or only those in certain playlists?

  • Are all tracks from certain playlists or global coverage?

  • Are audio features consistently recorded?

  • Are there temporal patterns in track releases?

  • Is the data periodically updated, or is it a one-time snapshot?

  • Are independent or less-streamed artists underrepresented?

E - Executive Driven Questions

Who would care about insights from your data?

Primary stakeholders: Key business/research questions they might ask: What decisions could this data inform?

Examples: - If this is sales data: “How can we optimize our sales strategy?” - If this is health data: “What patterns affect patient outcomes?” - If this is social media data: “How can we improve engagement?”

Your stakeholder questions:

1. Which audio features (like energy, danceability, or tempo) most influence a track’s popularity?

  1. How do track characteristics differ across albums, artists, or years?

3. Which track features are most consistent in long-running playlists or collections?

A - Analytical Framework

Your exploration strategy:

Phase 1: Data Quality Assessment - Check for missing values - Identify data types and consistency - Look for outliers or anomalies

Phase 2: Descriptive Analysis - What are the key variables? - What’s the distribution of important metrics? - What time patterns exist?

Phase 3: Pattern Investigation - What relationships might exist between variables? - Are there seasonal or temporal patterns? - What groupings or segments emerge?

Your specific analytical approach:

  1. Perform exploratory data analysis (EDA) to summarize distributions and detect missing or inconsistent values.
  2. Use correlation and regression analysis to find relationships between song features and popularity.
  3. Apply clustering to identify natural groupings of songs or genres with similar characteristics.

D - Data Best Practices

Quality checks to perform:

Missing data assessment:

Data type verification: Are numeric columns actually numeric? Are dates properly formatted? Are categorical variables consistent?

Your quality concerns:

  1. Some tracks may have incomplete metadata (e.g., missing genre or release date).
  2. Popularity metrics might change over time, affecting consistency.
  3. Duplicate entries or remastered versions could distort feature averages and trend patterns.

Y - Your Insights

Initial hypotheses about what you might find:

Based on your domain knowledge, what patterns do you expect? What would surprise you? What would be most valuable to discover?

Your predictions:

  1. Songs with higher danceability and energy scores will tend to be more popular.
  2. Longer tracks may be instrumental or live recordings
  3. Certain albums or artists will show consistent audio feature patterns

Part 3: Data Quality Assessment Summary

S -Stakeholders (Revisited)

# Convert Arrow dataset to a regular data frame for analysis
tracks_features_df <- my_dataset %>% collect()

# View dataset structure summary
glue("✅ The dataset contains {nrow(tracks_features_df)} rows and {ncol(tracks_features_df)} columns.")
✅ The dataset contains 1204025 rows and 24 columns.
# Check for missing values across all columns
missing_summary <- tracks_features_df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "column", values_to = "missing_count") %>%
  mutate(missing_percent = round(missing_count / nrow(tracks_features_df) * 100, 2)) %>%
  arrange(desc(missing_count))

# Glimpse the missing value summary
glimpse(missing_summary)
Rows: 24
Columns: 3
$ column          <chr> "id", "name", "album", "album_id", "artists", "artist_…
$ missing_count   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ missing_percent <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

After examining the data structure, who else might be interested?

Interested parties: Spotify analysts, artists, music marketers, researchers

Questions they may have:

  • Which features define popular collections?

  • How do albums/artists differ in audio metrics?

  • Are certain audio patterns consistent across time or playlists

Data quality concerns:

  • Outliers in track duration or numeric features

  • Duplicate or composite entries in album/artist fields

C - Columns and Coverage

Create a summary table of your variables:

# Create a summary table of all variables
variable_summary <- tracks_features_df %>%
  summarise(across(everything(), list(
    type = ~class(.)[1],
    missing = ~sum(is.na(.)),
    unique = ~n_distinct(.)
  ), .names = "{.col}_{.fn}"))

# Reshape manually to avoid underscore conflicts
variable_summary_long <- tibble(
  variable = names(tracks_features_df),
  type = sapply(tracks_features_df, function(x) class(x)[1]),
  missing = sapply(tracks_features_df, function(x) sum(is.na(x))),
  unique = sapply(tracks_features_df, function(x) n_distinct(x))
)

# View the summary table
variable_summary_long
# A tibble: 24 × 4
   variable     type      missing  unique
   <chr>        <chr>       <int>   <int>
 1 id           character       0 1204025
 2 name         character       0  850944
 3 album        character       0  106162
 4 album_id     character       0  118382
 5 artists      character       0  165365
 6 artist_ids   character       0  166423
 7 track_number integer         0      50
 8 disc_number  integer         0      13
 9 explicit     character       0       2
10 danceability numeric         0    1362
# ℹ 14 more rows

A - Aggregates: Overall Picture

# Get comprehensive dataset statistics
# Summarize numeric variables for a high-level view
# Summarize numeric variables for a high-level view
numeric_summary <- tracks_features_df %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), list(
    mean = ~mean(., na.rm = TRUE),
    median = ~median(., na.rm = TRUE),
    sd = ~sd(., na.rm = TRUE),
    min = ~min(., na.rm = TRUE),
    max = ~max(., na.rm = TRUE)
  )))

# View the summary
numeric_summary
# A tibble: 1 × 80
  track_number_mean track_number_median track_number_sd track_number_min
              <dbl>               <int>           <dbl>            <int>
1              7.66                   7            5.99                1
# ℹ 76 more variables: track_number_max <int>, disc_number_mean <dbl>,
#   disc_number_median <int>, disc_number_sd <dbl>, disc_number_min <int>,
#   disc_number_max <int>, danceability_mean <dbl>, danceability_median <dbl>,
#   danceability_sd <dbl>, danceability_min <dbl>, danceability_max <dbl>,
#   energy_mean <dbl>, energy_median <dbl>, energy_sd <dbl>, energy_min <dbl>,
#   energy_max <dbl>, key_mean <dbl>, key_median <int>, key_sd <dbl>,
#   key_min <int>, key_max <int>, loudness_mean <dbl>, loudness_median <dbl>, …

N - Notable Segments

# Analyze key categorical variables
# Modify based on your specific data
# Analyze top genres or other categorical segments
# Top albums by number of tracks
top_albums <- tracks_features_df %>%
  group_by(album) %>%
  summarise(
    track_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(track_count)) %>%
  slice_head(n = 10)

# Top artists by number of tracks
top_artists <- tracks_features_df %>%
  group_by(artists) %>%
  summarise(
    track_count = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(track_count)) %>%
  slice_head(n = 10)

# Top tracks by duration
top_tracks <- tracks_features_df %>%
  select(name, duration_ms) %>%
  arrange(desc(duration_ms)) %>%
  slice_head(n = 10)

# Show results
top_albums
# A tibble: 10 × 2
   album                      track_count
   <chr>                            <int>
 1 Greatest Hits                     1800
 2 Live                               872
 3 The Collection                     343
 4 Love Songs                         337
 5 Super Hits                         330
 6 Platinum & Gold Collection         300
 7 RCA 100 Años de Música             280
 8 The Best Of                        280
 9 II                                 278
10 Handel: Messiah                    268
top_artists
# A tibble: 10 × 2
   artists                                         track_count
   <chr>                                                 <int>
 1 "['Various Artists']"                                  1753
 2 "['Vitamin String Quartet']"                           1577
 3 "['Aretha Franklin']"                                  1209
 4 "['The City of Prague Philharmonic Orchestra']"        1042
 5 "[\"Dan Gibson's Solitudes\"]"                          997
 6 "['Bob Dylan']"                                         913
 7 "['The Fall']"                                          812
 8 "[\"Pickin' On Series\"]"                               774
 9 "['Dolly Parton']"                                      747
10 "['Guided By Voices']"                                  739
top_tracks
# A tibble: 10 × 2
   name                                                              duration_ms
   <chr>                                                                   <int>
 1 Bargrooves Deluxe Edition 2018 Mix 2 - Continuous Mix                 6061090
 2 Doctorow's Third Law                                                  6054655
 3 Gothic Lolita                                                         5764624
 4 Bargrooves Deluxe Edition 2017 - Continuous Mix 2                     5713196
 5 Bargrooves Deluxe Edition 2018 Mix 1 - Continuous Mix                 5679399
 6 Monstercat Podcast Ep. 086 (Staff Picks 2015)                         5646226
 7 Doctorow's Second Law                                                 5645108
 8 Arc Angel - Continuous Mix                                            5577278
 9 Bargrooves Lounge (Continuous Mix 1)                                  5531591
10 Los Jefes - Banda Sonora de la Película (feat. Big Man, feat. Ca…     5440375

Complete this comprehensive assessment:

DATASET OVERVIEW:

  • Records: 1,204,025 representing Spotify tracks with audio features and metadata

  • Time span: Multiple years from earliest track release to most recent

  • Key metrics: danceability, energy, loudness, tempo, valence, duration_ms, track_number, disc_number

DATA COMPLETENESS:

  • Core fields: 100% complete
  • Variable 1: danceability 100% complete

  • Variable 2: energy 100% complete

DATA QUALITY STRENGTHS:

  • No missing values across any columns

  • Consistent data types for numeric and categorical variables

  • Excellent coverage of tracks, albums, and artists

DATA QUALITY CONCERNS:

  • Some track_number and disc_number values are extreme or unusual

  • Numeric features may include outliers (e.g., very long duration, loudness extremes)

  • Categorical identifiers may have duplicates or variations in naming

MISSING DATA IMPACT: -

  • Most missing: None at 0%

  • Impact on analysis: No missing data; analyses can proceed without bias

  • Handling strategy: Standard quality checks; no special handling required

RELIABILITY ASSESSMENT:

  • Most reliable variables: danceability, energy, duration_ms, explicit, track_number, disc_number

  • Variables needing caution: album, artists, name, album_id, artist_ids

  • Overall confidence level: High

JUSTIFICATION: Confidence is high because the data set has complete coverage, no missing values, consistent types, and a large number of records, through some categorical identifiers may need careful handling for duplicates or composite entries.

Deliverables Checklist

Ensure your submission includes:

  • Complete READY framework analysis with thoughtful responses

  • Systematic SCAN framework exploration with specific findings

  • Successful data loading with Arrow

  • Professional data description and summary statistics

  • Comprehensive missing value analysis with percentages

  • Variable summary table documenting key fields

  • Memory efficiency demonstration

  • 3-5 well-defined, specific exporatory research questions

  • Data quality assessment with honest evaluation

  • Professional summary with clear next steps

Grading Criteria

  • READY Framework (20%): Thoughtful strategic planning showing understanding of stakeholders and analytical approach

  • Data Loading (15%): Successful Arrow implementation with proper documentation

  • SCAN Framework (25%): Systematic exploration with specific, meaningful findings

  • Data Quality Assessment (20%): Comprehensive evaluation with specific evidence

  • Research Questions (15%): Clear, answerable questions tied to stakeholder needs and data capabilities

  • Professional Communication (5%): Clear, honest, well-organized presentation throughout

Tips for Success

  • Be specific in your observations - avoid vague statements

  • Think like a stakeholder - what would decision-makers actually want to know?

  • Document your reasoning for all assessment decisions

  • Be honest about limitations - this builds credibility

  • Focus on actionable insights - what can actually be learned from this data?

  • Ask for help if your data format doesn’t match the provided templates

Remember: This is exploratory data analysis - you’re learning about your data, not proving predetermined hypotheses. Let your curiosity guide your investigation while maintaining systematic rigor.