##Data Analysis Project - Spotify’s Top 100 Songs

This is a data analyst project focused on analyzing the top 100 songs on Spotify of all time. The project will be published on my Linkeldn, and the data is sourced from a Kaggle dataset.

The project aims to explore the characteristics of the top 100 songs in Spotify, the trends and patterns that emerge from the data, and to provide insights into the music industry, and general viewers interest.

My main goal is to determine the average duration of songs in the top 100. This information can help singers identify the most effective duration range for building a successful song, which they can then apply to their own work. Additionally, I aim to identify outliers, such as artists with multiple songs in the top 100, or songs with unusually long durations. This analysis can provide valuable insights to artists on how to stand out and succeed in the music industry.

#The first code loads the tidyverse package, which is a collection of R packages designed for data manipulation, visualization, and analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
# I am uploading my data
#This block of code reads in a CSV file from the specified file path and assigns it to the variable first_import. 

filepath <-"~/Downloads/projecto/Features.csv"
first_import <- read_csv(filepath)
## Rows: 100 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): id, name
## dbl (12): duration, energy, key, loudness, mode, speechiness, acousticness, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
first_import
## # A tibble: 100 × 14
##    id           name  durat…¹ energy   key loudn…²  mode speec…³ acous…⁴ instr…⁵
##    <chr>        <chr>   <dbl>  <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
##  1 0VjIjW4GlUZ… Blin…    3.33  0.73      1   -5.93     1  0.0598 0.00146 9.54e-5
##  2 7qiZfU4dY1l… Shap…    3.9   0.652     1   -3.18     0  0.0802 0.581   0      
##  3 2XU0oxnq2qx… Danc…    3.49  0.588     6   -6.4      0  0.0924 0.692   1.04e-4
##  4 7qEHsqek33r… Some…    3.04  0.405     1   -5.68     1  0.0319 0.751   0      
##  5 0e7ipj03S05… Rock…    3.64  0.52      5   -6.14     0  0.0712 0.124   7.01e-5
##  6 3KkXRkHbMCA… Sunf…    2.63  0.479     2   -5.57     1  0.0466 0.556   0      
##  7 1zi7xx7UVEF… One …    2.9   0.625     1   -5.61     1  0.0536 0.00776 1.8 e-3
##  8 7BKLCZ1jbUB… Clos…    4.08  0.524     8   -5.60     1  0.0338 0.414   0      
##  9 789CxjEOtO7… Stay     4.01  0.31      9  -10.2      0  0.0283 0.945   6.12e-5
## 10 0pqnGHJpmpx… Beli…    3.41  0.78     10   -4.37     0  0.128  0.0622  0      
## # … with 90 more rows, 4 more variables: liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, danceability <dbl>, and abbreviated variable names ¹​duration,
## #   ²​loudness, ³​speechiness, ⁴​acousticness, ⁵​instrumentalness
#This block of code removes a selection of columns from first_import, including "energy", "key", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "mode", and "id". 

first_import <- subset(first_import, select = -c(energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, mode, danceability, id))
first_import
## # A tibble: 100 × 2
##    name              duration
##    <chr>                <dbl>
##  1 Blinding Lights       3.33
##  2 Shape of You          3.9 
##  3 Dance Monkey          3.49
##  4 Someone You Loved     3.04
##  5 Rockstar              3.64
##  6 Sunflower             2.63
##  7 One Dance             2.9 
##  8 Closer                4.08
##  9 Stay                  4.01
## 10 Believer              3.41
## # … with 90 more rows
#uploading more data
#This block of code reads in a second CSV file from the specified file path and assigns it to the variable second_import. The data frame is also printed to the console.

filepath <- "~/Downloads/projecto/Streams.csv"
second_import <- read_csv(filepath)
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Song, Artist, Release Date
## dbl (1): Streams (Billions)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
second_import
## # A tibble: 100 × 4
##    Song              Artist                            Streams (Billio…¹ Relea…²
##    <chr>             <chr>                                         <dbl> <chr>  
##  1 Blinding Lights   The Weeknd                                     3.45 29-Nov…
##  2 Shape of You      Ed Sheeran                                     3.40 06-Jan…
##  3 Dance Monkey      Tones And I                                    2.77 10-May…
##  4 Someone You Loved Lewis Capaldi                                  2.68 08-Nov…
##  5 Rockstar          Post Malone featuring 21 Savage                2.62 15-Sep…
##  6 Sunflower         Post Malone and Swae Lee                       2.58 18-Oct…
##  7 One Dance         Drake featuring Wizkid and Kyla                2.56 05-Apr…
##  8 Closer            The Chainsmokers featuring Halsey              2.48 29-Jul…
##  9 Stay              The Kid Laroi and Justin Bieber                2.43 09-Jul…
## 10 Believer          Imagine Dragons                                2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹​`Streams (Billions)`,
## #   ²​`Release Date`
#preparing to join
#This block of code renames the first column of first_import to "song" and prints the updated data frame.
#I am doing this because I want to join first_import with second_import, so I chose to use the song column, so I am changing the name to make them the same.

names(first_import)[1] <- "song"
first_import
## # A tibble: 100 × 2
##    song              duration
##    <chr>                <dbl>
##  1 Blinding Lights       3.33
##  2 Shape of You          3.9 
##  3 Dance Monkey          3.49
##  4 Someone You Loved     3.04
##  5 Rockstar              3.64
##  6 Sunflower             2.63
##  7 One Dance             2.9 
##  8 Closer                4.08
##  9 Stay                  4.01
## 10 Believer              3.41
## # … with 90 more rows
names(second_import)[1] <- "song"
#joining
#This block of code joins first_import and second_import on the "song" column using a full join and assigns the result to the variable new_tibble. 
new_tibble <- full_join(first_import, second_import, by = "song")
new_tibble
## # A tibble: 100 × 5
##    song              duration Artist                            Stream…¹ Relea…²
##    <chr>                <dbl> <chr>                                <dbl> <chr>  
##  1 Blinding Lights       3.33 The Weeknd                            3.45 29-Nov…
##  2 Shape of You          3.9  Ed Sheeran                            3.40 06-Jan…
##  3 Dance Monkey          3.49 Tones And I                           2.77 10-May…
##  4 Someone You Loved     3.04 Lewis Capaldi                         2.68 08-Nov…
##  5 Rockstar              3.64 Post Malone featuring 21 Savage       2.62 15-Sep…
##  6 Sunflower             2.63 Post Malone and Swae Lee              2.58 18-Oct…
##  7 One Dance             2.9  Drake featuring Wizkid and Kyla       2.56 05-Apr…
##  8 Closer                4.08 The Chainsmokers featuring Halsey     2.48 29-Jul…
##  9 Stay                  4.01 The Kid Laroi and Justin Bieber       2.43 09-Jul…
## 10 Believer              3.41 Imagine Dragons                       2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹​`Streams (Billions)`,
## #   ²​`Release Date`
#This block of code renames the third column of new_tibble to "artist" and prints the updated data frame.
#I do this because I don't like to work with uppercase letters and want to keep a standard with my data.

names(new_tibble)[3] <- "artist"
new_tibble
## # A tibble: 100 × 5
##    song              duration artist                            Stream…¹ Relea…²
##    <chr>                <dbl> <chr>                                <dbl> <chr>  
##  1 Blinding Lights       3.33 The Weeknd                            3.45 29-Nov…
##  2 Shape of You          3.9  Ed Sheeran                            3.40 06-Jan…
##  3 Dance Monkey          3.49 Tones And I                           2.77 10-May…
##  4 Someone You Loved     3.04 Lewis Capaldi                         2.68 08-Nov…
##  5 Rockstar              3.64 Post Malone featuring 21 Savage       2.62 15-Sep…
##  6 Sunflower             2.63 Post Malone and Swae Lee              2.58 18-Oct…
##  7 One Dance             2.9  Drake featuring Wizkid and Kyla       2.56 05-Apr…
##  8 Closer                4.08 The Chainsmokers featuring Halsey     2.48 29-Jul…
##  9 Stay                  4.01 The Kid Laroi and Justin Bieber       2.43 09-Jul…
## 10 Believer              3.41 Imagine Dragons                       2.41 01-Feb…
## # … with 90 more rows, and abbreviated variable names ¹​`Streams (Billions)`,
## #   ²​`Release Date`
#This block of code removes the fourth and fifth column of new_tibble.
#I am cleaning my data more, because I found some columns that I don't need

new_tibble <- new_tibble[, -4]
new_tibble
## # A tibble: 100 × 4
##    song              duration artist                            `Release Date`
##    <chr>                <dbl> <chr>                             <chr>         
##  1 Blinding Lights       3.33 The Weeknd                        29-Nov-19     
##  2 Shape of You          3.9  Ed Sheeran                        06-Jan-17     
##  3 Dance Monkey          3.49 Tones And I                       10-May-19     
##  4 Someone You Loved     3.04 Lewis Capaldi                     08-Nov-18     
##  5 Rockstar              3.64 Post Malone featuring 21 Savage   15-Sep-17     
##  6 Sunflower             2.63 Post Malone and Swae Lee          18-Oct-18     
##  7 One Dance             2.9  Drake featuring Wizkid and Kyla   05-Apr-16     
##  8 Closer                4.08 The Chainsmokers featuring Halsey 29-Jul-16     
##  9 Stay                  4.01 The Kid Laroi and Justin Bieber   09-Jul-21     
## 10 Believer              3.41 Imagine Dragons                   01-Feb-17     
## # … with 90 more rows
#This block of code calculates the mean of the "duration" column of new_tibble, ignoring any missing values. The mean is assigned to the variable mean_value and printed to the console.
# I wanted to do this because it is important for the problem my project is solving.
mean_value <- mean(new_tibble$duration, na.rm=TRUE)
mean_value
## [1] 3.6353
#I create a new file (spotify_data.csv) to export my data
library(tibble)
filepath <- "~/Downloads/projecto/spotify_data.csv"
write_csv(new_tibble, filepath)

The goal of the project is to present an informative analysis that tells a story about the data, its insights and what they reveal about the music industry.

The dashboard is available in this link: https://public.tableau.com/views/MostStreamedSongsofAlltimeonSpotify/Dashboard2