Midterm Project Evaluation

1.Introduction

(1.1) Introduction

Spotify is an audio streaming provider which offers recorded music and podcasts for more than 70 million songs. The focus of this project is to analyze data about various songs which are streaming on Spotify and uncover some interesting trends and most important factors that lead to the popularity of songs.

(1.2) Problem Statement

To understand the dataset and what it contains to determine how it can be used for our analysis
To perform Data Cleaning if necessary so that the data is usable for our analysis
To perform Exploratory Data Analysis and understand the variables contributing to a song’s popularity

(1.3) Approach

We will look at the properties of each variable available in the dataset and perform a data cleaning if there is too much missing data or outliers
Plot the correlation between popularity of a song vs various variables to see which variable has a good correlation

(1.4) Consumer Impact

Determing the most important variables impacting popularity of songs can help creators compose new songs that would reach a wider audience just like Netflix used user data to create better Web Series
It can help Companies decide which type of songs or artists to invest in so that they can maximize revenues

2. Packages Required

(2.1 - 2.3) Packages Used

library(tidyr)    # Tidyr package is used to tidy data. Tidy data is data that’s easy to work with. 
library(dplyr)    # dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data 
library(ggplot2)  # This package can be used to create interesting visualisations 
library(DT)       # It is used to display tables in HTML

3. Data Preparation

(3.1) Data Source

Source: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md

Importing the data into R:

spotify <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
knitr::kable(head(spotify), align = "lccrr")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

(3.2) About the Dataset

The data contains 32,833 rows and 23 columns in the data and was collected in Jan-20. Our target variable is track_popularity, and the data has various other features dealing with our target variable like song credits and song features like danceability, energy, acousticness etc. There 15 missing values in the data which are removed in data cleaning.

d <- read.csv("spotify_dd.csv") 
DT::datatable(d,options = list(   pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')

Data Cleaning

(3.3) Dropping columns that are unique to each song

ids <- c("track_id","track_album_id","playlist_id") 
spotify.data <- data.frame(spotify[,!(names(spotify) %in% ids)])

Identifying missing data in the dataset and removing the rows

colSums(is.na(spotify.data))

##               track_name             track_artist         track_popularity 
##                        5                        5                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        5                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

spotify.data <- na.omit(spotify.data) 
colSums(is.na(spotify.data))

##               track_name             track_artist         track_popularity 
##                        0                        0                        0 
##         track_album_name track_album_release_date            playlist_name 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

Converting categorical variables ‘key’ and ‘mode’ as factors

spotify.data$key <- as.factor(spotify.data$key)
spotify.data$mode <- as.factor(spotify.data$mode)

(3.4) Final data after cleaning

knitr::kable(head(spotify.data), align = "lccrr")

track_name	track_artist	track_popularity	track_album_name	track_album_release_date	playlist_name	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
Memories - Dillon Francis Remix	Maroon 5	67	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
All the Time - Don Diablo Remix	Zara Larsson	70	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
Call You Mine - Keanu Silva Remix	The Chainsmokers	60	Call You Mine - The Remixes	2019-07-19	Pop Remix	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
Someone You Loved - Future Humans Remix	Lewis Capaldi	69	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

Checking for outliers and distribution for each numerical column:

Boxplots

num_cols<-c('danceability','energy','loudness','speechiness','acousticness','instrumentalness',
            'liveness','valence','tempo')

par(mfrow=c(3,3))
for (i in num_cols){
boxplot(spotify.data[[i]], main=sprintf('Histogram of  %s',i))
}

Histograms

par(mfrow=c(3,3))
for (i in num_cols){
  hist(x = spotify.data[[i]], 
       col="blue",
       lty=1,
       freq = FALSE,
       main = sprintf('Histogram of  %s',i),
       xlab = i)
  lines(density(x = spotify.data[[i]],na.rm=TRUE), lwd=2, col='red')
}

Outliers: There are a few outliers we can see from the box plots but we see that the majority of these metrics are normally distributed. Manipulating these outliers may not add much value to the analysis.

(3.5) Summary of Variables

d1 <- read.csv("summary.csv") 
DT::datatable(d1,options = list(pageLength=50, scrollX = T,autoWidth = TRUE),class = 'cell-border stripe')

4. Proposed EDA

(4.1) Uncovering new information

We will be doing a correlation analysis to check if there is any correlation between the variables themselves
We can create a few new columns from the existing columns to check if they can explain the data in a better way. For Example, checking if ‘remix’ songs are more popular than other songs by creating a new column for remix with values 0 or 1
We can summarize the data mainly by picking most important qualitative variables such as genre and some numerical variables such as energy, beat and danceability to see how they are impacting popularity

(4.2) Plots and Tables

We plan to create Bivariate plots between popularity and other variables that may help us gauge the relationship between the variables
Slicing the data by qualitative factors and plotting them on a barchart can provide an insight on whether there is a significant difference in popularity between various levels of these factors.

(4.3) Need to Learn

We plan to learn various techniques such as feature engineering for our exploratory analysis. We also are looking at ways to classify the dataset as popular and non-popular and apply various ML techniques in order to gain better insights into the data.

(4.4) ML Techniques

We believe that a regression analysis is appropriate to estimate or predict the popularity of a song since it a numerical variable. We can also use ML classification techniques such as logistic regression, Decision Trees to assess popularity by dividing the target variable ‘popularity’ as ‘popular’ and ‘not popular’ based on a cutoff value for popular.