##Data Analysis Project - Spotify’s Top 100 Songs
This is a data analyst project focused on analyzing the top 100 songs on Spotify of all time. The project will be published on my Linkeldn, and the data is sourced from a Kaggle dataset.
The project aims to explore the characteristics of the top 100 songs in Spotify, the trends and patterns that emerge from the data, and to provide insights into the music industry, and general viewers interest.
My main goal is to determine the average duration of songs in the top 100. This information can help singers identify the most effective duration range for building a successful song, which they can then apply to their own work. Additionally, I aim to identify outliers, such as artists with multiple songs in the top 100, or songs with unusually long durations. This analysis can provide valuable insights to artists on how to stand out and succeed in the music industry.
#The first code loads the tidyverse package, which is a collection of R packages designed for data manipulation, visualization, and analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
# I am uploading my data
#This block of code reads in a CSV file from the specified file path and assigns it to the variable first_import.
filepath <-"~/Downloads/projecto/Features.csv"
first_import <- read_csv(filepath)
## Rows: 100 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): id, name
## dbl (12): duration, energy, key, loudness, mode, speechiness, acousticness, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
first_import
## # A tibble: 100 × 14
## id name durat…¹ energy key loudn…² mode speec…³ acous…⁴ instr…⁵
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0VjIjW4GlUZ… Blin… 3.33 0.73 1 -5.93 1 0.0598 0.00146 9.54e-5
## 2 7qiZfU4dY1l… Shap… 3.9 0.652 1 -3.18 0 0.0802 0.581 0
## 3 2XU0oxnq2qx… Danc… 3.49 0.588 6 -6.4 0 0.0924 0.692 1.04e-4
## 4 7qEHsqek33r… Some… 3.04 0.405 1 -5.68 1 0.0319 0.751 0
## 5 0e7ipj03S05… Rock… 3.64 0.52 5 -6.14 0 0.0712 0.124 7.01e-5
## 6 3KkXRkHbMCA… Sunf… 2.63 0.479 2 -5.57 1 0.0466 0.556 0
## 7 1zi7xx7UVEF… One … 2.9 0.625 1 -5.61 1 0.0536 0.00776 1.8 e-3
## 8 7BKLCZ1jbUB… Clos… 4.08 0.524 8 -5.60 1 0.0338 0.414 0
## 9 789CxjEOtO7… Stay 4.01 0.31 9 -10.2 0 0.0283 0.945 6.12e-5
## 10 0pqnGHJpmpx… Beli… 3.41 0.78 10 -4.37 0 0.128 0.0622 0
## # … with 90 more rows, 4 more variables: liveness <dbl>, valence <dbl>,
## # tempo <dbl>, danceability <dbl>, and abbreviated variable names ¹duration,
## # ²loudness, ³speechiness, ⁴acousticness, ⁵instrumentalness
#This block of code removes a selection of columns from first_import, including "energy", "key", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "mode", and "id".
first_import <- subset(first_import, select = -c(energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, mode, danceability, id))
first_import
## # A tibble: 100 × 2
## name duration
## <chr> <dbl>
## 1 Blinding Lights 3.33
## 2 Shape of You 3.9
## 3 Dance Monkey 3.49
## 4 Someone You Loved 3.04
## 5 Rockstar 3.64
## 6 Sunflower 2.63
## 7 One Dance 2.9
## 8 Closer 4.08
## 9 Stay 4.01
## 10 Believer 3.41
## # … with 90 more rows
#uploading more data
#This block of code reads in a second CSV file from the specified file path and assigns it to the variable second_import. The data frame is also printed to the console.
filepath <- "~/Downloads/projecto/Streams.csv"
second_import <- read_csv(filepath)
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Song, Artist, Release Date
## dbl (1): Streams (Billions)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
second_import
## # A tibble: 100 × 4
## Song Artist Streams (Billio…¹ Relea…²
## <chr> <chr> <dbl> <chr>
## 1 Blinding Lights The Weeknd 3.45 29-Nov…
## 2 Shape of You Ed Sheeran 3.40 06-Jan…
## 3 Dance Monkey Tones And I 2.77 10-May…
## 4 Someone You Loved Lewis Capaldi 2.68 08-Nov…
## 5 Rockstar Post Malone featuring 21 Savage 2.62 15-Sep…
## 6 Sunflower Post Malone and Swae Lee 2.58 18-Oct…
## 7 One Dance Drake featuring Wizkid and Kyla 2.56 05-Apr…
## 8 Closer The Chainsmokers featuring Halsey 2.48 29-Jul…
## 9 Stay The Kid Laroi and Justin Bieber 2.43 09-Jul…
## 10 Believer Imagine Dragons 2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹`Streams (Billions)`,
## # ²`Release Date`
#preparing to join
#This block of code renames the first column of first_import to "song" and prints the updated data frame.
#I am doing this because I want to join first_import with second_import, so I chose to use the song column, so I am changing the name to make them the same.
names(first_import)[1] <- "song"
first_import
## # A tibble: 100 × 2
## song duration
## <chr> <dbl>
## 1 Blinding Lights 3.33
## 2 Shape of You 3.9
## 3 Dance Monkey 3.49
## 4 Someone You Loved 3.04
## 5 Rockstar 3.64
## 6 Sunflower 2.63
## 7 One Dance 2.9
## 8 Closer 4.08
## 9 Stay 4.01
## 10 Believer 3.41
## # … with 90 more rows
names(second_import)[1] <- "song"
#joining
#This block of code joins first_import and second_import on the "song" column using a full join and assigns the result to the variable new_tibble.
new_tibble <- full_join(first_import, second_import, by = "song")
new_tibble
## # A tibble: 100 × 5
## song duration Artist Stream…¹ Relea…²
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Blinding Lights 3.33 The Weeknd 3.45 29-Nov…
## 2 Shape of You 3.9 Ed Sheeran 3.40 06-Jan…
## 3 Dance Monkey 3.49 Tones And I 2.77 10-May…
## 4 Someone You Loved 3.04 Lewis Capaldi 2.68 08-Nov…
## 5 Rockstar 3.64 Post Malone featuring 21 Savage 2.62 15-Sep…
## 6 Sunflower 2.63 Post Malone and Swae Lee 2.58 18-Oct…
## 7 One Dance 2.9 Drake featuring Wizkid and Kyla 2.56 05-Apr…
## 8 Closer 4.08 The Chainsmokers featuring Halsey 2.48 29-Jul…
## 9 Stay 4.01 The Kid Laroi and Justin Bieber 2.43 09-Jul…
## 10 Believer 3.41 Imagine Dragons 2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹`Streams (Billions)`,
## # ²`Release Date`
#This block of code renames the third column of new_tibble to "artist" and prints the updated data frame.
#I do this because I don't like to work with uppercase letters and want to keep a standard with my data.
names(new_tibble)[3] <- "artist"
new_tibble
## # A tibble: 100 × 5
## song duration artist Stream…¹ Relea…²
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Blinding Lights 3.33 The Weeknd 3.45 29-Nov…
## 2 Shape of You 3.9 Ed Sheeran 3.40 06-Jan…
## 3 Dance Monkey 3.49 Tones And I 2.77 10-May…
## 4 Someone You Loved 3.04 Lewis Capaldi 2.68 08-Nov…
## 5 Rockstar 3.64 Post Malone featuring 21 Savage 2.62 15-Sep…
## 6 Sunflower 2.63 Post Malone and Swae Lee 2.58 18-Oct…
## 7 One Dance 2.9 Drake featuring Wizkid and Kyla 2.56 05-Apr…
## 8 Closer 4.08 The Chainsmokers featuring Halsey 2.48 29-Jul…
## 9 Stay 4.01 The Kid Laroi and Justin Bieber 2.43 09-Jul…
## 10 Believer 3.41 Imagine Dragons 2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹`Streams (Billions)`,
## # ²`Release Date`
#This block of code removes the fourth and fifth column of new_tibble.
#I am cleaning my data more, because I found some columns that I don't need
new_tibble <- new_tibble[, -4]
new_tibble
## # A tibble: 100 × 4
## song duration artist `Release Date`
## <chr> <dbl> <chr> <chr>
## 1 Blinding Lights 3.33 The Weeknd 29-Nov-19
## 2 Shape of You 3.9 Ed Sheeran 06-Jan-17
## 3 Dance Monkey 3.49 Tones And I 10-May-19
## 4 Someone You Loved 3.04 Lewis Capaldi 08-Nov-18
## 5 Rockstar 3.64 Post Malone featuring 21 Savage 15-Sep-17
## 6 Sunflower 2.63 Post Malone and Swae Lee 18-Oct-18
## 7 One Dance 2.9 Drake featuring Wizkid and Kyla 05-Apr-16
## 8 Closer 4.08 The Chainsmokers featuring Halsey 29-Jul-16
## 9 Stay 4.01 The Kid Laroi and Justin Bieber 09-Jul-21
## 10 Believer 3.41 Imagine Dragons 01-Feb-17
## # … with 90 more rows
#This block of code calculates the mean of the "duration" column of new_tibble, ignoring any missing values. The mean is assigned to the variable mean_value and printed to the console.
# I wanted to do this because it is important for the problem my project is solving.
mean_value <- mean(new_tibble$duration, na.rm=TRUE)
mean_value
## [1] 3.6353
#I create a new file (spotify_data.csv) to export my data
library(tibble)
filepath <- "~/Downloads/projecto/spotify_data.csv"
write_csv(new_tibble, filepath)
The goal of the project is to present an informative analysis that tells a story about the data, its insights and what they reveal about the music industry.
The dashboard is available in this link: https://public.tableau.com/views/MostStreamedSongsofAlltimeonSpotify/Dashboard2