Data Wrangling Final Project

Introduction

Problem Statement: The dataset for this project is based on a music streaming application, Spotify. It contains millions of music tracks across various genres, artists and sentimental factors such as motivation, energy, romance, etc. Our goal is to analyze on the popular genres among the Spotify population and categorize the artists and their famous tracks on the basis of their genre. Moreover, for each sentiment we will observe the most played artists and their tracks.
Motivation for choosing this topic: Music can reach our feelings and connects people across any boundaries. Our analysis will help people to choose the most popular tracks as per their mood.
Solution: Our study on the dataset provides us with the key insights on the songs to be played according to one’s mood and the artist with the highest no. of songs in each of the six genres.

Packages Required

library(knitr) #displaying an aligned table on the screen
library(readr) #load .csv file
library(ggplot2) #visualize the data
library(dplyr) #manipulate data
library(tidyr)#tidying the data
library(DT) #output data in table
library(GGally) #Visualize the data
library(plotly) #Visualize the data

We used the following packages to analyze the dataset:

tidyverse : Used in data processing and data transformation as well as for data visualization
readr : Used for importing data CSV files
knitr : Used to displaying an aligned table on the screen
ggplot2: Used to visualize data
dplyr :Used to manipulate data
DT: Used to output data in table
GGally: Used to visualize correlation among variables
plotly: Used to visualize distributions of the numerical variables. Note - Double-Click on a particular legend to view the graph or distribution of the selected genre

Data Preparation

Steps followed to prepare data for analysis:

Data Understanding

Data Import

The spotify_songs data file can be downloaded directly from the Spotify. This dataset comes originally from spotifyr package. This package was authored to make it easily accesible for anyone to get their own data or general metadata around songs from the Spotify’s API.

#Loading the dataset
sp_data <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

Data Description

#Display the dimensions of raw dataset
dim(sp_data)

## [1] 32833    23

The dataset contains 32833 observations and 23 variables. Names of the variables are below:

colnames(sp_data)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

Not all of the 23 variables are relevant for our analysis. Firstly, There are these 7 variables which should better be cast in factor datatypes for better analysis results.

#Checking the datatype of the columns
lapply(sp_data, typeof)

## $track_id
## [1] "character"
## 
## $track_name
## [1] "character"
## 
## $track_artist
## [1] "character"
## 
## $track_popularity
## [1] "double"
## 
## $track_album_id
## [1] "character"
## 
## $track_album_name
## [1] "character"
## 
## $track_album_release_date
## [1] "character"
## 
## $playlist_name
## [1] "character"
## 
## $playlist_id
## [1] "character"
## 
## $playlist_genre
## [1] "character"
## 
## $playlist_subgenre
## [1] "character"
## 
## $danceability
## [1] "double"
## 
## $energy
## [1] "double"
## 
## $key
## [1] "double"
## 
## $loudness
## [1] "double"
## 
## $mode
## [1] "double"
## 
## $speechiness
## [1] "double"
## 
## $acousticness
## [1] "double"
## 
## $instrumentalness
## [1] "double"
## 
## $liveness
## [1] "double"
## 
## $valence
## [1] "double"
## 
## $tempo
## [1] "double"
## 
## $duration_ms
## [1] "double"

The variable “playlist_genre” contains 6 distinct categories and “playlist_subgenre” contains 24 distinct categories respectively, so it converted to factor type it would be easier to analyze. Hence, We will prune our variables’ list and explore the dataset further with respect to these variables only.

#Converting the non-numerical variables into categorical variables
sp_data$track_id <- as.factor(sp_data$track_id)
sp_data$track_artist <- as.factor(sp_data$track_artist)
sp_data$track_name <- as.factor(sp_data$track_name)
sp_data$track_album_name <- as.factor(sp_data$track_album_name)
sp_data$playlist_name <- as.factor(sp_data$playlist_name)
sp_data$playlist_genre <- as.factor(sp_data$playlist_genre)
sp_data$playlist_subgenre <- as.factor(sp_data$playlist_subgenre)

#Selecting the interesting variables
sp_songs <- select(sp_data,-c(5,7,9,14:19,22))
dim(sp_songs)

## [1] 32833    13

colnames(sp_songs)

##  [1] "track_id"          "track_name"        "track_artist"     
##  [4] "track_popularity"  "track_album_name"  "playlist_name"    
##  [7] "playlist_genre"    "playlist_subgenre" "danceability"     
## [10] "energy"            "liveness"          "valence"          
## [13] "duration_ms"

Now, we have shorten our variables list from 23 to 13. The metadata for these variables is provided below:

Variable Name	Description
track_id	Song unique ID
track_name	Song Name
track_artist	Song Artist
track_popularity	Song Popularity (0-100) where higher is better
track_album_name	Song album name
playlist_name	Name of playlist
playlist_genre	Playlist genre
playlist_subgenre	Playlist subgenre
danceability	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
liveness	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
duration_ms	Duration of song in milliseconds

Data Cleaning

#Dimensions of the updated dataset
dim(sp_songs)

## [1] 32833    13

Renaming the columns:

sp_songs = sp_songs %>% rename(track_danceability = danceability, 
                               track_energy_level = energy, 
                               live_performed = liveness,  
                               musical_positivity = valence, 
                               song_duration = duration_ms
                               )

Finding missing values

#Finding missing values
missing = colSums(is.na(sp_songs))
missing

##           track_id         track_name       track_artist 
##                  0                  5                  5 
##   track_popularity   track_album_name      playlist_name 
##                  0                  5                  0 
##     playlist_genre  playlist_subgenre track_danceability 
##                  0                  0                  0 
## track_energy_level     live_performed musical_positivity 
##                  0                  0                  0 
##      song_duration 
##                  0

There are 5 missing values each in track_name, track_artist and track_album_name. These corresppond to same 5 observations which are not even 1% of the entire dataset. We do not want to keep the missing values and hence we will remove the rows.

sp_songs = na.omit(sp_songs)
#Dimensions of the cleansed dataset
dim(sp_songs)

## [1] 32828    13

Our final dataset after removing missing observations contains 32828 observations with 13 variables.

Table View of the dataset:

datatable(head(sp_songs,100),extensions = 'FixedColumns', options = list(scrollX = TRUE, scrollY = "400px",fixedColumns= TRUE))

Data Exploration

Structure of the data

str(sp_songs)

## Classes 'tbl_df', 'tbl' and 'data.frame':    32828 obs. of  13 variables:
##  $ track_id          : Factor w/ 28356 levels "0017A6SJgTbfQVU2EtsPNo",..: 22912 2531 7160 25706 4705 26672 9521 22445 26146 5283 ...
##  $ track_name        : Factor w/ 23449 levels "'39 - 2011 Mix",..: 9370 12876 1076 3237 18356 2101 13856 15783 20922 9819 ...
##  $ track_artist      : Factor w/ 10692 levels "'Til Tuesday",..: 2840 6171 10632 9370 5519 2840 4993 8313 771 8556 ...
##  $ track_popularity  : num  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_name  : Factor w/ 19743 levels "'74 - '75 (feat. Susan Tyler)",..: 7926 10675 1059 2942 15182 1959 11512 13091 17780 8153 ...
##  $ playlist_name     : Factor w/ 449 levels "\"Permanent Wave\"",..: 309 309 309 309 309 309 309 309 309 309 ...
##  $ playlist_genre    : Factor w/ 6 levels "edm","latin",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ playlist_subgenre : Factor w/ 24 levels "album rock","big room",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ track_danceability: num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ track_energy_level: num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ live_performed    : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ musical_positivity: num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ song_duration     : num  194754 162600 176616 169093 189052 ...
##  - attr(*, "na.action")= 'omit' Named int  8152 9283 9284 19569 19812
##   ..- attr(*, "names")= chr  "8152" "9283" "9284" "19569" ...

The cleansed data contains information about tracks, artists, genre, duration and other relevant information. We know there are 7 categorical and 6 numerical variables now.

Inference:

It was the right decision to convert all the non-numerical variables to factors as the columns like track_id(song unique ID) also contains duplicate values showing us that the total categories in “track_id” is less than the total number of observations. These 7 categorical variables will be used to perform further analysis in determining the specific results by grouping the values together. There are 3 variables which have adequate number of categories that can be used to discern the insights with better interpretability.

Playlist_genre - 6
Playlist_subgenre - 24, and
Playlist_name - 449 categories

Summary statistics of variables

#Summary of the data 
summary(sp_songs)

##                    track_id        track_name              track_artist  
##  7BKLCZ1jbUBVqRi2FVlTVw:   10   Poison  :   22   Martin Garrix   :  161  
##  14sOS5L36385FJ3OL8hew4:    9   Breathe :   21   Queen           :  136  
##  3eekarcy7kvN4yt5ZFzltW:    9   Alive   :   20   The Chainsmokers:  123  
##  0nbXyq5TXYPCO7pr3N8S4I:    8   Forever :   20   David Guetta    :  110  
##  0qaWEvPkts34WF68r8Dzx9:    8   Paradise:   19   Don Omar        :  102  
##  0rIAC4PXANcKmitJfoqmVm:    8   Stay    :   19   Drake           :  100  
##  (Other)               :32776   (Other) :32707   (Other)         :32096  
##  track_popularity                    track_album_name
##  Min.   :  0.00   Greatest Hits              :  139  
##  1st Qu.: 24.00   Ultimate Freestyle Mega Mix:   42  
##  Median : 45.00   Gold                       :   35  
##  Mean   : 42.48   Malibu                     :   30  
##  3rd Qu.: 62.00   Rock & Rios (Remastered)   :   29  
##  Max.   :100.00   Appetite For Destruction   :   28  
##                   (Other)                    :32525  
##                                                                          playlist_name  
##  Indie Poptimism                                                                :  308  
##  2020 Hits & 2019  Hits – Top Global Tracks <U+0001F525><U+0001F525><U+0001F525>:  247  
##  Permanent Wave                                                                 :  244  
##  Hard Rock Workout                                                              :  219  
##  Ultimate Indie Presents... Best Indie Tracks of the 2010s                      :  198  
##  Fitness Workout Electro | House | Dance | Progressive House                    :  195  
##  (Other)                                                                        :31417  
##  playlist_genre                 playlist_subgenre track_danceability
##  edm  :6043     progressive electro house: 1809   Min.   :0.0000    
##  latin:5153     southern hip hop         : 1674   1st Qu.:0.5630    
##  pop  :5507     indie poptimism          : 1672   Median :0.6720    
##  r&b  :5431     latin hip hop            : 1655   Mean   :0.6549    
##  rap  :5743     neo soul                 : 1637   3rd Qu.:0.7610    
##  rock :4951     pop edm                  : 1517   Max.   :0.9830    
##                 (Other)                  :22864                     
##  track_energy_level live_performed   musical_positivity song_duration   
##  Min.   :0.000175   Min.   :0.0000   Min.   :0.0000     Min.   :  4000  
##  1st Qu.:0.581000   1st Qu.:0.0927   1st Qu.:0.3310     1st Qu.:187805  
##  Median :0.721000   Median :0.1270   Median :0.5120     Median :216000  
##  Mean   :0.698603   Mean   :0.1902   Mean   :0.5106     Mean   :225797  
##  3rd Qu.:0.840000   3rd Qu.:0.2480   3rd Qu.:0.6930     3rd Qu.:253581  
##  Max.   :1.000000   Max.   :0.9960   Max.   :0.9910     Max.   :517810  
##

The above summary statistics shows that there are no more missing values present in the dataset. The results on the categorical columns provided above are conforming with our inferences mentioned earlier.

Inference on numerical variables:

#Summary Statistics Table 
data.table::data.table(
  Variable.Name = c("track_popularity",
                    "track_danceability","track_energy_level","live_performed",
                    "musical_positivity","song_duration(in min)"),
            Min = c(0, 0, 0.000175, 0, 0, 0.067),
           Mean = c(42.48, 0.65, 0.698, 0.19, 0.51, 3.76),
         Medium = c(45, 0.67, 0.721, 0.13, 0.51, 3.6),
            Max = c(100, 0.98, 1, 0.99, 0.99, 8.63)
)

##            Variable.Name      Min   Mean Medium    Max
## 1:      track_popularity 0.000000 42.480 45.000 100.00
## 2:    track_danceability 0.000000  0.650  0.670   0.98
## 3:    track_energy_level 0.000175  0.698  0.721   1.00
## 4:        live_performed 0.000000  0.190  0.130   0.99
## 5:    musical_positivity 0.000000  0.510  0.510   0.99
## 6: song_duration(in min) 0.067000  3.760  3.600   8.63

Higher the metric, better the song!

It can be deduced that almost 50% of the data has popularity metric values = 45, 100 being the maximum.
Around 50% of the tracks has 67% danceability score(0-1) making the tracks suitable for dancing.
Around 50% of the tracks has 72% energy measure(0-1) implying energetic tracks that feel fast, loud and noisy.
The statistic for live_performed variable is distributed more towards the left range(0-1) which implies that not many tracks available in our dataset were performed live.
The musical positiveness measure has mean and median both equal to 0.51 which means that majority of the tracks are almost balanced in terms of valence.
50% of the tracks in the dataset are of around 3.6 minutes in duration.

Exploratory Data Analysis

(count <- sp_songs %>% count(playlist_genre) %>% knitr::kable())

playlist_genre	n
edm	6043
latin	5153
pop	5507
r&b	5431
rap	5743
rock	4951

Inference: The table shows us the count of songs in each genre.

ggcorr(sp_songs,label = TRUE)

Inference: We can see from the above graph that there is no significant correlation among the variables. track_danceability and musical positivity has the higest correlation of 0.3.

green <- "#1ed760"
yellow <- "#e7e247"
pink <- "#ff6f59"
blue <- "#17bebb"
orange <- "#ffa500"
grey <- "#808080"

#Plotting density distributions
#1. Danceability feature
viz1 <- ggplot(sp_songs, aes(x=track_danceability, fill=playlist_genre,
                    text = paste(playlist_genre)))+
  geom_density(alpha=0.7, color=NA)+
  scale_fill_manual(values=c(green, yellow, grey, blue, orange, pink))+
  labs(x="Danceability", y="Density") +
  guides(fill=guide_legend(title="Genres"))+
  theme_minimal()+
  ggtitle("Distribution of Danceability Data")

ggplotly(viz1, tooltip=c("text"))

Inference: All the genres are right-skewed except for the rock genre which is normally distributed. We can also infer that the latin genre has the highest density.

#1. Popularity feature
viz2 <- ggplot(sp_songs, aes(x=track_popularity, fill=playlist_genre,
                    text = paste(playlist_genre)))+
  geom_density(alpha=0.7, color=NA)+
  scale_fill_manual(values=c(green, yellow, grey, blue, orange, pink))+
  labs(x="Tracks popularity score", y="Density") +
  guides(fill=guide_legend(title="Genres"))+
  theme_minimal()+
  ggtitle("Distribution of Tracks popularity")

ggplotly(viz2, tooltip=c("text"))

Inference - It can be visualized that the tracks in different genres do have songs with low popularity but there are majority of tracks which are well distributed from the range of 15-100.

gen_valence <- sp_songs %>%
  group_by(playlist_genre)%>%
  mutate(max=max(musical_positivity))%>%
  mutate(min=min(musical_positivity))%>%
  select(playlist_genre, max, min)%>%
  unique()

viz3 <- plot_ly(gen_valence, color = I("gray80"),  
              hoverinfo = 'text') %>%
  add_segments(x = ~max, xend = ~min, y = ~playlist_genre, yend = ~playlist_genre, showlegend = FALSE) %>%
  add_markers(x = ~max, y = ~playlist_genre, name = "High Positivity", color = I(pink), text=~paste('Max Valence: ', max)) %>%
  add_markers(x = ~min, y = ~playlist_genre, name = "Low Positivity", color = I(blue), text=~paste('Min Valence: ', min))%>%
  layout(
    title = "Genres' Positivity Range",
    xaxis = list(title = "Positivity Level"),
    yaxis= list(title=""))

ggplotly(viz3)

Inference: The above graph provides us with the musical positivity in each genre and we can infer that rock and r&b has highest posiitvity while latin and rock has lowest positivity.

gen_energy <- sp_songs %>%
  group_by(playlist_genre)%>%
  mutate(max=max(track_energy_level))%>%
  mutate(min=min(track_energy_level))%>%
  select(playlist_genre, max, min)%>%
  unique()

viz4 <- plot_ly(gen_energy, color = I("gray80"),  
              hoverinfo = 'text') %>%
  add_segments(x = ~max, xend = ~min, y = ~playlist_genre, yend = ~playlist_genre, showlegend = FALSE) %>%
  add_markers(x = ~max, y = ~playlist_genre, name = "Maximum Energy Level Value", color = I(pink), text=~paste('Max Energy Level: ', max)) %>%
  add_markers(x = ~min, y = ~playlist_genre, name = "Minimum Energy Level Value", color = I(blue), text=~paste('Min Energy Level: ', min))%>%
  layout(
    title = "Genres' Energy Level Range",
    xaxis = list(title = "Energy Level"),
    yaxis= list(title=""))

ggplotly(viz4)

Inference: The above graph provides us with the energy level in each genre and we can infer that pop and latin has lowest energy level while all the genres are max at energy level.

sp_most_played <- sp_songs %>% group_by(playlist_genre) %>% count(track_artist) %>% arrange(-n) %>% top_n(1)
datatable(sp_most_played)

Inference:: The above table gives us the singer with the maximum of tracks and the sum of the in each genre.

#Finding the most popular track and its artist as per mood
final <- sp_songs %>%
     group_by(playlist_genre, playlist_subgenre) %>%
  select(c(2:4,9,10,12)) %>%
     slice(which.max(track_popularity)) %>%
  arrange(-track_popularity)

datatable(final[with(final, order(playlist_genre, playlist_subgenre)),])

Inference:: The above table provides us with the most popular songs and artists for each genre and subgenres within the genres.

Conclusion