Music plays an important role in our lives. Being able to learn more about music and how to analyze it will allow us to broaden our knowledge while also making us more interesting human beings when we are conversing with others. The goal of this project is to use R Markdown to import a real-world data set and generate an HTML report that is completely reproducible. A variety of data cleaning and tidying techniques will be used before performing a fundamental exploratory data analysis procedure. In terms of research question, we want to look into the characteristics of songs that make them popular.
We will use different data manipulation and data visualization tools to clean the data. Data cleansing is important because it improves the quality of our data and, as a result, increases our overall productivity. When we clean our data, all of the outdated or incorrect information is removed, leaving us with only the highest quality of data available.
The data used:
df = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our analysis will help consumer understand more about the popular songs. Popularity of the songs will be evaluated based on the variable named “Popularity”. Based on that we plan to conduct analysis that help the consumer:
find out the distribution of the popularity index.
common characteristics of popular songs or other words, what makes a song popular.
create a model to predict popularity of the songs based on current features.
The followings are the packages that we will be using for this project.
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages() function has been implemented to suppress the warnings.
The following packages will be used for this project:
ggplot2: Based on The Grammar of Visuals, ggplot2 is a system for declaratively constructing graphics. You give ggplot2 the data, tell it how to map variables to aesthetics and which graphical primitives to use, and it does the rest.
dplyr: dplyr is a data manipulation package that provides a consistent collection of verbs to tackle the most frequent data manipulation problems.
tidyr: tidyr provides a series of functions to assist you in obtaining clean data. Clean data has a uniform format: in a nutshell, each variable belongs in a column, and each column is a variable.
readr: readr is a tool for reading rectangular data that is both quick and easy to use (like csv, tsv, and fwf). It’s built to parse a wide range of data formats found in the world while also failing cleanly when the data changes unexpectedly.
stringr: stringr is a collection of functions that make working with strings as simple as possible. It’s developed on top of stringi, which makes use of the ICUC library to deliver quick and accurate string manipulations.
kableExtra : The kableExtra package is designed to extend the basic functionality of tables produced using knitr::kable(). Since knitr::kable() is simple by design, it definitely has a lot of missing features that are commonly seen in other packages, and kableExtra has filled the gap perfectly. The most amazing thing about kableExtra is that most of its table features work for both HTML and PDF formats.
There are 23 audio features , including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.
A brief description of the variables is as mentioned below:
ss <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
-Variable Names
names(ss)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
There is no need for any variable name change as the names look consistent and easy to understand
-Variable Types
It is very necessary to understand each and every data types of the variables used in the dataset before doing the next essential steps so that we do the proper analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary.
str(ss[])
## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
mode is numeric field.track_album_release_dateis a character column but its actually a field with date values. It is impotant to change the data type as we would need this column in date format for analysis.{r modify_data_types, message=FALSE,warning=FALSE}
-Modyfying Data types
ss$mode <- as.factor(ss$mode)
ss$track_album_release_date <- as.Date(ss$track_album_release_date)
-Null Values in the Dataset
There are in total 15 Null values in the 32833 X 23 dataframe, which is surprising considering there are lots of rows and data set is exhaustive. We can see 5 each Null values across these 3 columns - trac_artist, track_name and track_album_name
sum(is.na(ss))
## [1] 1901
colSums(is.na(ss))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 1886 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
-Missing Value Treatment
As we have seen earlier that we had 5 missing values each in 3 columns i.e. track_artist, track_album_name and track_name.
We went ahead and imputed these missing values with a character constant ‘NA’. We are not removing and deleting these values because we still have a lot of information for these values and we can use for our EDA
ss$track_artist[is.na(ss$track_artist)] <- 'NA'
ss$track_album_name[is.na(ss$track_album_name)] <- 'NA'
ss$track_name[is.na(ss$track_name)] <- 'NA'
For this part lets look at the distribution of all of the variables by plotting them.
summary(select_if(ss,is.numeric))
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6548 Mean :0.698619 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness speechiness acousticness instrumentalness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.: -8.171 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median : -6.166 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean : -6.720 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.: -4.645 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. : 1.275 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
From the descriptive statistics of only the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data.
To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)
-Visual Summary
-Generating boxplots
boxplot(ss$danceability, main = 'Boxplot distribution of Danceability')
boxplot(ss$loudness, main = 'Boxplot distribution of loudness')
boxplot(ss$tempo , main = 'Boxplot distribution of tempo')
The box plots are helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and also treat them.
From the boxplot distributions we see that the variable “danceability” has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.
We can remove these records. It is okay to remove these records from the dataset and visualize the dataset again to see the change in distribution.
-Trimming Outliers
df_2 <- subset(ss, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo))
-Visualizing the distributions again
boxplot(df_2$danceability, main = 'Distribution of Danceability')
boxplot(df_2$loudness, main = 'Distribution of loudness')
boxplot(df_2$tempo , main = 'Distribution of tempo')
Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.
-Data in the most condensed form possible
The below table shows a glimpse of the final cleaned dataset.
knitr::kable(head(df_2,5), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
The dataset essentially has information about the song such as, track name, artist name, danceability, key of the song, acousticness, speech, tempo, liveness, valence, popularity along with other factors that would help us deduce meaningful information in determining if a song can be classified as a hit or not.
We would like to know which parameters in the songs will make a bigger appeal to the audience.What type of music styles are often popular , what sort of musics we might enjoy.
For Exploratory data analysis we will be performing statistical analysis by analyzing different variables
Analysis includes:
Dataset has many measures on songs.By looking at the data it seems danceability , energy and valence will be highly associated with the song popularity.There are also very specific measures that are hard to understand if you are not that into music. For instance, acousticness, liveness, and speechiness are technical terms that we are generally not aware of them. Some of these measures may be correlated. Analyzing the correlation between would give us meaningful insights.
We will compare the top 10 songs with the rest of the datset.This is done by sorting on the popularity column. Then we will check for the features which have impact. We can use box plots to compare the songs.
By looking at the data we can intuitively say that these features might have an impact on the popularity. Analyzing them deeply might give us some more insights
By analyzing this data we can assess what particular features are responsible for them popular, based on this we can improve the suggestion playlist
On top of it we can also perform Track analysis ( i.e what particular words are making a song popular etc)
Popularity plays a key role for a song ( i.e how many times a song is played). As a result, we’d like to apply what we’ve learned about regression so far and model the variables in such a way that we can use them to forecast the likelihood of a song becoming a hit.