Everyone listens to music all day. Even I am hooked to music. I need music no matter which activity I do. I have an eclectic taste in music, the genres I listen to vary from dance music with a high tempo to sweet mellow acoustic music.Being able to learn more about music and how to analyze it will allow us to broaden our knowledge while also making us more interesting human beings when we are conversing with others. The goal of this project is to use R Markdown to import a real-world data set and generate an HTML report that is completely reproducible. A variety of data cleaning and tidying techniques will be used before performing a fundamental exploratory data analysis procedure. In terms of research question, we want to investigate the characteristics of songs that make them popular. With the help of this analysis , we will have a much better understanding of listening taste and habits.
To clean the data, we’ll employ a variety of data manipulation and visualization tools. Data cleansing is vital since it enhances the quality of our data and so boosts our overall productivity. When we clean our data, we eliminate all of the obsolete or erroneous information, leaving us with only the best data accessible.
The data used:
df = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Consumers will gain a better understanding of popular tunes as a result of our investigation. The song’s popularity will be assessed using the “Popularity” variable. We want to do analyses that will benefit the consumer based on this information:
ascertain the popularity index’s distribution.
popular song attributes, or what makes a song popular.
develop a model to forecast song popularity based on current characteristics.
The followings are the packages that we will be using for this project.
#install.packages("plotly")
#install.packages("factoextra")
#install.packages("gridExtra")
#install.packages("cowplot")
#install.packages("wordcloud")
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.1.2
## Loading required package: RColorBrewer
#install.packages("RColorBrewer")
library(RColorBrewer)
#install.packages("wordcloud2")
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.1.2
library(gridExtra)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(wordcloud2))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(plotly))
## Warning: package 'plotly' was built under R version 4.1.2
suppressPackageStartupMessages(library(cowplot))
## Warning: package 'cowplot' was built under R version 4.1.2
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages() function has been implemented to suppress the warnings.
The following packages will be used for this project:
ggplot2: Based on The Grammar of Visuals, ggplot2 is a system for declaratively constructing graphics. You give ggplot2 the data, tell it how to map variables to aesthetics and which graphical primitives to use, and it does the rest.
dplyr: dplyr is a data manipulation package that provides a consistent collection of verbs to tackle the most frequent data manipulation problems.
tidyr: tidyr provides a series of functions to assist you in obtaining clean data. Clean data has a uniform format: in a nutshell, each variable belongs in a column, and each column is a variable.
readr: readr is a tool for reading rectangular data that is both quick and easy to use (like csv, tsv, and fwf). It’s built to parse a wide range of data formats found in the world while also failing cleanly when the data changes unexpectedly.
stringr: stringr is a collection of functions that make working with strings as simple as possible. It’s developed on top of stringi, which makes use of the ICUC library to deliver quick and accurate string manipulations.
kableExtra : The kableExtra package is designed to extend the basic functionality of tables produced using knitr::kable(). Since knitr::kable() is simple by design, it definitely has a lot of missing features that are commonly seen in other packages, and kableExtra has filled the gap perfectly. The most amazing thing about kableExtra is that most of its table features work for both HTML and PDF formats.
Factoextra : Factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses
corrplot : R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
GGally : ggplot2 is a plotting system for R based on the grammar of graphics. GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pairwise plot matrix, a scatterplot plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.
RColorBrewer : RColorBrewer can be used to create colorful graphs with pre-made palettes that consist of 8 to 12 colors.
cowplot :The cowplot package is a simple add-on to ggplot. It provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images.
There are 23 audio features , including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.
A brief description of the variables is as mentioned below:
ss <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
## Rows: 32833 Columns: 23
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
-Variable Names
names(ss)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
There is no need for any variable name change as the names look consistent and easy to understand
-Variable Types
It is very necessary to understand each and every data types of the variables used in the dataset before doing the next essential steps so that we do the proper analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary.
str(ss[])
## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
## $ track_id : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr [1:32833] "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num [1:32833] 122 100 124 122 124 ...
## $ duration_ms : num [1:32833] 194754 162600 176616 169093 189052 ...
mode is numeric field.track_album_release_dateis a character column but its actually a field with date values. It is impotant to change the data type as we would need this column in date format for analysis.{r modify_data_types, message=FALSE,warning=FALSE}
-Modyfying Data types
ss$mode <- as.factor(ss$mode)
ss$track_album_release_date <- as.Date(ss$track_album_release_date)
-Null Values in the Dataset
There are in total 15 Null values in the 32833 X 23 dataframe, which is surprising considering there are lots of rows and data set is exhaustive. We can see 5 each Null values across these 3 columns - trac_artist, track_name and track_album_name
sum(is.na(ss))
## [1] 1901
colSums(is.na(ss))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 1886 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
-Missing Value Treatment
As we have seen earlier that we had 5 missing values each in 3 columns i.e. track_artist, track_album_name and track_name.
We went ahead and imputed these missing values with a character constant ‘NA’. We are not removing and deleting these values because we still have a lot of information for these values and we can use for our EDA
ss$track_artist[is.na(ss$track_artist)] <- 'NA'
ss$track_album_name[is.na(ss$track_album_name)] <- 'NA'
ss$track_name[is.na(ss$track_name)] <- 'NA'
For this part lets look at the distribution of all of the variables by plotting them.
summary(select_if(ss,is.numeric))
## track_popularity danceability energy key
## Min. : 0.00 Min. :0.0000 Min. :0.000175 Min. : 0.000
## 1st Qu.: 24.00 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000
## Median : 45.00 Median :0.6720 Median :0.721000 Median : 6.000
## Mean : 42.48 Mean :0.6548 Mean :0.698619 Mean : 5.374
## 3rd Qu.: 62.00 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000
## Max. :100.00 Max. :0.9830 Max. :1.000000 Max. :11.000
## loudness speechiness acousticness instrumentalness
## Min. :-46.448 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.: -8.171 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median : -6.166 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean : -6.720 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.: -4.645 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. : 1.275 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
From the descriptive statistics of only the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data.
To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)
-Visual Summary
-Generating boxplots
par(mfrow = c(2, 2))
a = boxplot(ss$danceability, main = 'Boxplot distribution of Danceability')
b = boxplot(ss$loudness, main = 'Boxplot distribution of loudness')
c = boxplot(ss$tempo , main = 'Boxplot distribution of tempo')
The box plots are helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and also treat them.
From the boxplot distributions we see that the variable “danceability” has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.
We can remove these records. It is okay to remove these records from the dataset and visualize the dataset again to see the change in distribution.
-Trimming Outliers
df_2 <- subset(ss, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo))
-Visualizing the distributions again
par(mfrow = c(2, 2))
boxplot(df_2$danceability, main = 'Distribution of Danceability')
boxplot(df_2$loudness, main = 'Distribution of loudness')
boxplot(df_2$tempo , main = 'Distribution of tempo')
Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.
-Data in the most condensed form possible
The below table shows a glimpse of the final cleaned dataset.
knitr::kable(head(df_2,5), "simple")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 |
#### Extracting Year from songs
ss <- ss %>%
separate(track_album_release_date,
c("year","month","day"),
sep = "-")
#### Creating minutes from duration
ss<-ss %>%
mutate(duration_min=duration_ms/60000)
#### changing data type of year column
ss$year <- as.numeric(ss$year)
popularity_order<-select(ss,track_popularity,track_artist,track_album_name)
arrange(popularity_order,desc(track_popularity))
## # A tibble: 32,833 x 3
## track_popularity track_artist track_album_name
## <dbl> <chr> <chr>
## 1 100 Tones and I Dance Monkey (Stripped Back) / Dance Monkey
## 2 100 Tones and I Dance Monkey (Stripped Back) / Dance Monkey
## 3 99 Arizona Zervas ROXANNE
## 4 99 Arizona Zervas ROXANNE
## 5 99 Arizona Zervas ROXANNE
## 6 99 Arizona Zervas ROXANNE
## 7 98 KAROL G Tusa
## 8 98 Maroon 5 Memories
## 9 98 The Weeknd Blinding Lights
## 10 98 Maroon 5 Memories
## # ... with 32,823 more rows
popularity_order <- popularity_order%>% filter(duplicated(track_popularity)== FALSE)
popularity_order
## # A tibble: 101 x 3
## track_popularity track_artist track_album_name
## <dbl> <chr> <chr>
## 1 66 Ed Sheeran I Don't Care (with Justin Bieber) [Loud Lu~
## 2 67 Maroon 5 Memories (Dillon Francis Remix)
## 3 70 Zara Larsson All the Time (Don Diablo Remix)
## 4 60 The Chainsmokers Call You Mine - The Remixes
## 5 69 Lewis Capaldi Someone You Loved (Future Humans Remix)
## 6 62 Katy Perry Never Really Over (R3HAB Remix)
## 7 68 Avicii Tough Love (Tiësto Remix)
## 8 58 Ed Sheeran Cross Me (feat. Chance the Rapper & PnB Ro~
## 9 63 Martin Garrix Summer Days (feat. Macklemore & Patrick St~
## 10 65 David Guetta Say My Name (feat. Bebe Rexha & J Balvin) ~
## # ... with 91 more rows
pop2 <- arrange(popularity_order,desc(track_popularity))
pop2
## # A tibble: 101 x 3
## track_popularity track_artist track_album_name
## <dbl> <chr> <chr>
## 1 100 Tones and I Dance Monkey (Stripped Back) / Dance Mo~
## 2 99 Arizona Zervas ROXANNE
## 3 98 KAROL G Tusa
## 4 97 Billie Eilish everything i wanted
## 5 96 The Black Eyed Peas RITMO (Bad Boys For Life)
## 6 95 Billie Eilish WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?
## 7 94 Regard Ride It
## 8 93 Anuel AA China
## 9 92 Juice WRLD Bandit (with YoungBoy Never Broke Again)
## 10 91 MEDUZA Lose Control
## # ... with 91 more rows
pop3 <- table(pop2$track_artist)
pop3
##
## <U+771F><U+4E4B><U+4ECB> <U+9999><U+53D6><U+614E><U+543E> 5 Seconds of Summer A R I Z O N A
## 1 1 1 1
## AAAMYYY Alesso Ant Saunders Anuel AA
## 1 1 1 1
## Arizona Zervas Avicii Axwell /\\ Ingrosso Bastille
## 1 2 1 1
## Billie Eilish Bolier Boys Get Hurt Carly Rae Jepsen
## 3 1 1 1
## Catiso Charli XCX Charlie Puth Clean Bandit
## 1 1 1 1
## Coldplay Daddy Yankee David Guetta Deee-Lite
## 1 1 1 1
## Deorro Disclosure DVBBS E-girls
## 1 1 1 1
## Ed Sheeran Ellie Goulding EMMA WAHLIN Grace
## 2 3 1 1
## Gryffin Hardwell Herve Pagez Jonas Blue
## 3 1 1 1
## JP Cooper Juice WRLD KAROL G Kaskade
## 1 1 1 1
## Katy Perry Kygo Lewis Capaldi Lil Nas X
## 2 2 1 1
## Lindsey Stirling Maggie Lindemann Maroon 5 Marshmello
## 1 1 1 1
## Martin Garrix Matt Simons MAX MEDUZA
## 1 1 1 1
## Molella Nikki Vianna ODESZA OneRepublic
## 1 1 1 1
## R3HAB Regard Riton Ryuji Imaichi
## 1 1 1 1
## SAINt JHN Sakurako Ohara SHAED Shallou
## 1 1 1 1
## Shinn Yamada Sia Starley Steve Aoki
## 1 1 1 4
## STVCKS SUNMI Swedish House Mafia T-Spoon
## 1 1 1 1
## The Black Eyed Peas The Chainsmokers Tiësto Tones and I
## 1 3 1 1
## TWICE Tyler Shaw Vinil Why Don't We
## 1 1 1 2
## Wolves By Night Yves V Zara Larsson Zedd
## 1 1 2 1
t<-barplot(pop3)
coord_flip(t)
## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
suppressWarnings(wordcloud(words = pop2$track_artist, freq = pop2$track_popularity, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2")))
Artists are arranged according to the popularity. Larger the word size indicated greater the popularity of the artist
library(dplyr)
aa2<-ss
aa2$speech_only <- cut(aa2$speechiness, breaks = 10)
aa2 %>%
ggplot( aes(x = speech_only )) +
geom_bar(width = 0.8, fill = "blue", colour = "black") +
scale_x_discrete(name = "Speechiness")
From the plot we can see lower the speechiness, songs are more favored by users
aa<-ss
aa$energy_only <- cut(aa$energy, breaks = 10)
aa %>%
ggplot( aes(x = energy_only )) +
geom_bar(width = 1, fill = "blue", colour = "black") +
scale_x_discrete(name = "Energy")
From the above plot we can see Energy range around 0.8 -0.9 is most preferred among the users
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
corr <- ss %>%
select(track_popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness, liveness, valence, tempo)
ggcorr(corr,
nbreaks = 6,
label = TRUE,
label_size = 3,
color = "grey50")
Based on the plot, we can state that popularity does not have strong correlation with other track features.We can see only energy has some what stronger correlation of around 0.7
library(ggplot2)
#### Plotting Density Plots
ggplot(ss) +
geom_density(aes(energy, fill ="energy", alpha = 0.1)) +
geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) +
geom_density(aes(valence, fill ="valence", alpha = 0.1)) +
geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) +
geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) +
geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) +
scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
scale_y_continuous(name = "Density") +
ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
theme_bw() +
theme(plot.title = element_text(size = 10, face = "bold"),
text = element_text(size = 10)) +
theme(legend.title=element_blank()) +
scale_fill_brewer(palette="Accent")
bp <- ggplot(ss, aes(energy, playlist_genre)) +
geom_boxplot(aes(fill = playlist_genre)) +
theme_minimal() +
theme(legend.position = "top")
bp
From the above plot we can EDM genre has songs with highest energy
bp1 <- ggplot(ss, aes(danceability, playlist_genre)) +
geom_boxplot(aes(fill = playlist_genre)) +
theme_minimal() +
theme(legend.position = "top")
bp1
From the above plot we can Rap genre has songs with highest danceability factor
bp2 <- ggplot(ss, aes(liveness, playlist_genre)) +
geom_boxplot(aes(fill = playlist_genre)) +
theme_minimal() +
theme(legend.position = "top")
bp2
From the above plot we can EDM genre has songs with most liveness
bp3 <- ggplot(ss, aes(valence, playlist_genre)) +
geom_boxplot(aes(fill = playlist_genre)) +
theme_minimal() +
theme(legend.position = "top")
bp3
From the above plot we can Latin genre has songs with higher Valence
bp4 <- ggplot(ss, aes(loudness, playlist_genre)) +
geom_boxplot(aes(fill = playlist_genre)) +
theme_minimal() +
theme(legend.position = "top")
bp4
From the above plot we can EDM genre has songs with loudness greater when compared to others
trend_chart <- function(arg){
trend_change <- ss %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart<- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "black", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 3)) + scale_y_continuous(name=paste("",arg,sep=""))
return(chart)
}
trend_chart_track_popularity<-trend_chart("track_popularity") + theme_classic()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
trend_chart_danceability<-trend_chart("danceability") + theme_classic()
trend_chart_energy<-trend_chart("energy") + theme_classic()
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_min<-trend_chart("duration_min") + theme_classic()
trend_chart_speechiness<-trend_chart("speechiness") + theme_classic()
plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness,ncol = 3, label_size = 3)
From the above plot we can see that duration of songs is reducing with each year
library(factoextra)
library(cluster)
## Warning: package 'cluster' was built under R version 4.1.2
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(cluster))
We are trying to employ K-means clustering to the dataset.K-means clustering is a technique in which we place each observation in a dataset into one of K clusters.
Z <-select(ss ,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo)
clus<-head(Z,20)
clus
## # A tibble: 20 x 9
## danceability energy loudness speechiness acousticness instrumentalness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.748 0.916 -2.63 0.0583 0.102 0
## 2 0.726 0.815 -4.97 0.0373 0.0724 0.00421
## 3 0.675 0.931 -3.43 0.0742 0.0794 0.0000233
## 4 0.718 0.93 -3.78 0.102 0.0287 0.00000943
## 5 0.65 0.833 -4.67 0.0359 0.0803 0
## 6 0.675 0.919 -5.38 0.127 0.0799 0
## 7 0.449 0.856 -4.79 0.0623 0.187 0
## 8 0.542 0.903 -2.42 0.0434 0.0335 0.00000483
## 9 0.594 0.935 -3.56 0.0565 0.0249 0.00000397
## 10 0.642 0.818 -4.55 0.032 0.0567 0
## 11 0.679 0.923 -6.5 0.181 0.146 0.00000492
## 12 0.437 0.774 -4.92 0.0554 0.148 0
## 13 0.744 0.726 -4.68 0.0463 0.0399 0
## 14 0.572 0.915 -4.45 0.0625 0.0111 0
## 15 0.69 0.78 -4.45 0.0594 0.00733 0.00183
## 16 0.805 0.835 -4.60 0.0896 0.13 0.00000503
## 17 0.694 0.901 -4.32 0.0948 0.0702 0
## 18 0.678 0.747 -5.29 0.165 0.0395 0
## 19 0.746 0.557 -6.72 0.0542 0.103 0.0036
## 20 0.467 0.821 -5.47 0.0934 0.00791 0.000441
## # ... with 3 more variables: liveness <dbl>, valence <dbl>, tempo <dbl>
# Finding optimal number of clusters using K-Means
fviz_nbclust(clus, kmeans, method = "wss")
Typically when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.
For this plot it appear that there is a bit of an elbow or “bend” at k = 4 clusters.
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(clus,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
#perform k-means clustering with k = 4 clusters
km <- kmeans(clus, centers = 4, nstart = 25)
#view results
km
## K-means clustering with 4 clusters of sizes 9, 1, 2, 8
##
## Cluster means:
## danceability energy loudness speechiness acousticness instrumentalness
## 1 0.6494444 0.8743333 -4.169111 0.064500 0.05590333 2.074589e-04
## 2 0.7260000 0.8150000 -4.969000 0.037300 0.07240000 4.210000e-03
## 3 0.5975000 0.7065000 -5.755000 0.058250 0.14500000 1.800000e-03
## 4 0.6456250 0.8422500 -4.697750 0.099525 0.07277625 5.691875e-05
## liveness valence tempo
## 1 0.2170111 0.5426667 125.3212
## 2 0.3570000 0.6930000 99.9720
## 3 0.1570000 0.2380000 112.3045
## 4 0.2040375 0.4598750 121.4769
##
## Clustering vector:
## [1] 4 2 1 4 1 1 3 1 1 1 4 4 4 1 1 1 4 4 3 4
##
## Within cluster sum of squares by cluster:
## [1] 22.935639 0.000000 2.214049 29.170506
## (between_SS / total_SS = 93.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#plot results of final k-means model
fviz_cluster(km, data = clus) + theme_minimal() + theme_classic()
#add cluster assigment to original data
final_data <- cbind(clus, cluster = km$cluster)
#view final data
head(final_data,20)
## danceability energy loudness speechiness acousticness instrumentalness
## 1 0.748 0.916 -2.634 0.0583 0.10200 0.00e+00
## 2 0.726 0.815 -4.969 0.0373 0.07240 4.21e-03
## 3 0.675 0.931 -3.432 0.0742 0.07940 2.33e-05
## 4 0.718 0.930 -3.778 0.1020 0.02870 9.43e-06
## 5 0.650 0.833 -4.672 0.0359 0.08030 0.00e+00
## 6 0.675 0.919 -5.385 0.1270 0.07990 0.00e+00
## 7 0.449 0.856 -4.788 0.0623 0.18700 0.00e+00
## 8 0.542 0.903 -2.419 0.0434 0.03350 4.83e-06
## 9 0.594 0.935 -3.562 0.0565 0.02490 3.97e-06
## 10 0.642 0.818 -4.552 0.0320 0.05670 0.00e+00
## 11 0.679 0.923 -6.500 0.1810 0.14600 4.92e-06
## 12 0.437 0.774 -4.918 0.0554 0.14800 0.00e+00
## 13 0.744 0.726 -4.675 0.0463 0.03990 0.00e+00
## 14 0.572 0.915 -4.451 0.0625 0.01110 0.00e+00
## 15 0.690 0.780 -4.446 0.0594 0.00733 1.83e-03
## 16 0.805 0.835 -4.603 0.0896 0.13000 5.03e-06
## 17 0.694 0.901 -4.322 0.0948 0.07020 0.00e+00
## 18 0.678 0.747 -5.289 0.1650 0.03950 0.00e+00
## 19 0.746 0.557 -6.722 0.0542 0.10300 3.60e-03
## 20 0.467 0.821 -5.466 0.0934 0.00791 4.41e-04
## liveness valence tempo cluster
## 1 0.0653 0.518 122.036 4
## 2 0.3570 0.693 99.972 2
## 3 0.1100 0.613 124.008 1
## 4 0.2040 0.277 121.956 4
## 5 0.0833 0.725 123.976 1
## 6 0.1430 0.585 124.982 1
## 7 0.1760 0.152 112.648 3
## 8 0.1110 0.367 127.936 1
## 9 0.6370 0.366 127.015 1
## 10 0.0919 0.590 124.957 1
## 11 0.1240 0.752 121.984 4
## 12 0.1330 0.329 123.125 4
## 13 0.3740 0.687 121.985 4
## 14 0.3390 0.678 123.919 1
## 15 0.0729 0.238 126.070 1
## 16 0.3650 0.722 125.028 1
## 17 0.4270 0.368 118.051 4
## 18 0.1740 0.516 120.002 4
## 19 0.1380 0.324 111.961 3
## 20 0.1310 0.232 122.676 4
By K-Means clustering we segregated the variables into different clusters
A commonplace notion among people is that energy impacts predominance like energetic tunes are more well known. Nevertheless, we couldn’t find any relationship among popularity and energy.
Some of the key relation which we found were:
-Lower the speechiness, songs are more favored by users
-Energy range around 0.8 -0.9 is most preferred among the users
-EDM genre has songs with highest energy and most liveness
-Rap genre has songs with highest danceability factor
-Latin genre has songs with higher Valence
-EDM genre has songs with loudness greater when compared to other
-Duration of songs is decreasing with every year
The average popularity of the songs showed up at its minimum value in 2014 in latest multi decade and after that it’s has been continually growing, depicting that the tunes are becoming popular with time among people.
We have used example charts to see how the components change across time. To understand the relationship among factors, we have used corrplot work in R. We have used boxplots to find the outliers.
As we have limited records(about 32k) for our examination,we couldn’t gain a full picture of the components of music. Also, the examination could be better if we have information to customer related features like their playlist history,country of residence,premium customer or not etc .Finally we used K-Means clustering algorithm to cluster the different variables