Introduction

Problem statement

Music plays an important role in our lives. Being able to learn more about music and how to analyze it will allow us to broaden our knowledge while also making us more interesting human beings when we are conversing with others. The goal of this project is to use R Markdown to import a real-world data set and generate an HTML report that is completely reproducible. A variety of data cleaning and tidying techniques will be used before performing a fundamental exploratory data analysis procedure. In terms of research question, we want to look into the characteristics of songs that make them popular.

Addressing the problem statement

We will use different data manipulation and data visualization tools to clean the data. Data cleansing is important because it improves the quality of our data and, as a result, increases our overall productivity. When we clean our data, all of the outdated or incorrect information is removed, leaving us with only the highest quality of data available.

The data used:

df = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Objective

Our analysis will help consumer understand more about the popular songs. Popularity of the songs will be evaluated based on the variable named “Popularity”. Based on that we plan to conduct analysis that help the consumer:

  • find out the distribution of the popularity index.

  • common characteristics of popular songs or other words, what makes a song popular.

  • create a model to predict popularity of the songs based on current features.

Packages Required

Packages Used

The followings are the packages that we will be using for this project.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(kableExtra))
Suppressing the warnings

suppressPackageStartupMessages() function has been implemented to suppress the warnings.

Purpose of each package

The following packages will be used for this project:

  • ggplot2: Based on The Grammar of Visuals, ggplot2 is a system for declaratively constructing graphics. You give ggplot2 the data, tell it how to map variables to aesthetics and which graphical primitives to use, and it does the rest.

  • dplyr: dplyr is a data manipulation package that provides a consistent collection of verbs to tackle the most frequent data manipulation problems.

  • tidyr: tidyr provides a series of functions to assist you in obtaining clean data. Clean data has a uniform format: in a nutshell, each variable belongs in a column, and each column is a variable.

  • readr: readr is a tool for reading rectangular data that is both quick and easy to use (like csv, tsv, and fwf). It’s built to parse a wide range of data formats found in the world while also failing cleanly when the data changes unexpectedly.

  • stringr: stringr is a collection of functions that make working with strings as simple as possible. It’s developed on top of stringi, which makes use of the ICUC library to deliver quick and accurate string manipulations.

  • kableExtra : The kableExtra package is designed to extend the basic functionality of tables produced using knitr::kable(). Since knitr::kable() is simple by design, it definitely has a lot of missing features that are commonly seen in other packages, and kableExtra has filled the gap perfectly. The most amazing thing about kableExtra is that most of its table features work for both HTML and PDF formats.

Data Preparation

Data attributes

There are 23 audio features , including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

A brief description of the variables is as mentioned below:

Importing data
ss <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

-Variable Names

names(ss)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

There is no need for any variable name change as the names look consistent and easy to understand

-Variable Types

It is very necessary to understand each and every data types of the variables used in the dataset before doing the next essential steps so that we do the proper analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary.

str(ss[])
## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...
Observations
  • mode is numeric field.
  • track_album_release_dateis a character column but its actually a field with date values. It is impotant to change the data type as we would need this column in date format for analysis.

{r modify_data_types, message=FALSE,warning=FALSE}

-Modyfying Data types

ss$mode <- as.factor(ss$mode)
ss$track_album_release_date <- as.Date(ss$track_album_release_date)

-Null Values in the Dataset

There are in total 15 Null values in the 32833 X 23 dataframe, which is surprising considering there are lots of rows and data set is exhaustive. We can see 5 each Null values across these 3 columns - trac_artist, track_name and track_album_name

sum(is.na(ss))
## [1] 1901
colSums(is.na(ss))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                     1886                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

-Missing Value Treatment

As we have seen earlier that we had 5 missing values each in 3 columns i.e. track_artist, track_album_name and track_name.

We went ahead and imputed these missing values with a character constant ‘NA’. We are not removing and deleting these values because we still have a lot of information for these values and we can use for our EDA

ss$track_artist[is.na(ss$track_artist)] <- 'NA'
ss$track_album_name[is.na(ss$track_album_name)] <- 'NA'
ss$track_name[is.na(ss$track_name)] <- 'NA'
Generating summary

For this part lets look at the distribution of all of the variables by plotting them.

summary(select_if(ss,is.numeric))
##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness        speechiness      acousticness    instrumentalness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.: -8.171   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median : -6.166   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   : -6.720   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.: -4.645   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :  1.275   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

From the descriptive statistics of only the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data.

To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)

-Visual Summary

-Generating boxplots

boxplot(ss$danceability, main = 'Boxplot distribution of Danceability')

boxplot(ss$loudness, main = 'Boxplot distribution of loudness')

boxplot(ss$tempo , main = 'Boxplot distribution of tempo')

The box plots are helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and also treat them.

Outlier Detection and Treatment

From the boxplot distributions we see that the variable “danceability” has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.

We can remove these records. It is okay to remove these records from the dataset and visualize the dataset again to see the change in distribution.

-Trimming Outliers

df_2 <- subset(ss, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo))

-Visualizing the distributions again

boxplot(df_2$danceability, main = 'Distribution of Danceability')

boxplot(df_2$loudness, main = 'Distribution of loudness')

boxplot(df_2$tempo , main = 'Distribution of tempo')

Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.

-Data in the most condensed form possible

The below table shows a glimpse of the final cleaned dataset.

knitr::kable(head(df_2,5), "simple")
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052

Proposed Exploratory Data Analysis

The dataset essentially has information about the song such as, track name, artist name, danceability, key of the song, acousticness, speech, tempo, liveness, valence, popularity along with other factors that would help us deduce meaningful information in determining if a song can be classified as a hit or not.

We would like to know which parameters in the songs will make a bigger appeal to the audience.What type of music styles are often popular , what sort of musics we might enjoy.

For Exploratory data analysis we will be performing statistical analysis by analyzing different variables

Analysis includes:

Top 10 Songs vs all of the dataset

We will compare the top 10 songs with the rest of the datset.This is done by sorting on the popularity column. Then we will check for the features which have impact. We can use box plots to compare the songs.

Energy versus Danceability/Valence and Liveliness

By looking at the data we can intuitively say that these features might have an impact on the popularity. Analyzing them deeply might give us some more insights

Predictive modelling

Popularity plays a key role for a song ( i.e how many times a song is played). As a result, we’d like to apply what we’ve learned about regression so far and model the variables in such a way that we can use them to forecast the likelihood of a song becoming a hit.