Spotify Data Analysis

Introduction

Problem statement

Everyone listens to music all day. Even I am hooked to music. I need music no matter which activity I do. I have an eclectic taste in music, the genres I listen to vary from dance music with a high tempo to sweet mellow acoustic music.Being able to learn more about music and how to analyze it will allow us to broaden our knowledge while also making us more interesting human beings when we are conversing with others. The goal of this project is to use R Markdown to import a real-world data set and generate an HTML report that is completely reproducible. A variety of data cleaning and tidying techniques will be used before performing a fundamental exploratory data analysis procedure. In terms of research question, we want to investigate the characteristics of songs that make them popular. With the help of this analysis , we will have a much better understanding of listening taste and habits.

Addressing the problem statement

To clean the data, we’ll employ a variety of data manipulation and visualization tools. Data cleansing is vital since it enhances the quality of our data and so boosts our overall productivity. When we clean our data, we eliminate all of the obsolete or erroneous information, leaving us with only the best data accessible.

The data used:

df = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

## Rows: 32833 Columns: 23

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Objective

Consumers will gain a better understanding of popular tunes as a result of our investigation. The song’s popularity will be assessed using the “Popularity” variable. We want to do analyses that will benefit the consumer based on this information:

ascertain the popularity index’s distribution.
popular song attributes, or what makes a song popular.
develop a model to forecast song popularity based on current characteristics.

Packages Required

Packages Used

The followings are the packages that we will be using for this project.

#install.packages("plotly")
#install.packages("factoextra")
#install.packages("gridExtra")
#install.packages("cowplot")
#install.packages("wordcloud")
library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.1.2

## Loading required package: RColorBrewer

#install.packages("RColorBrewer")
library(RColorBrewer)
#install.packages("wordcloud2")
library(wordcloud2)

## Warning: package 'wordcloud2' was built under R version 4.1.2

library(gridExtra)
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(wordcloud2))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(plotly))

## Warning: package 'plotly' was built under R version 4.1.2

suppressPackageStartupMessages(library(cowplot))

## Warning: package 'cowplot' was built under R version 4.1.2

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(kableExtra))

Suppressing the warnings

suppressPackageStartupMessages() function has been implemented to suppress the warnings.

Purpose of each package

The following packages will be used for this project:

ggplot2: Based on The Grammar of Visuals, ggplot2 is a system for declaratively constructing graphics. You give ggplot2 the data, tell it how to map variables to aesthetics and which graphical primitives to use, and it does the rest.
dplyr: dplyr is a data manipulation package that provides a consistent collection of verbs to tackle the most frequent data manipulation problems.
tidyr: tidyr provides a series of functions to assist you in obtaining clean data. Clean data has a uniform format: in a nutshell, each variable belongs in a column, and each column is a variable.
readr: readr is a tool for reading rectangular data that is both quick and easy to use (like csv, tsv, and fwf). It’s built to parse a wide range of data formats found in the world while also failing cleanly when the data changes unexpectedly.
stringr: stringr is a collection of functions that make working with strings as simple as possible. It’s developed on top of stringi, which makes use of the ICUC library to deliver quick and accurate string manipulations.
kableExtra : The kableExtra package is designed to extend the basic functionality of tables produced using knitr::kable(). Since knitr::kable() is simple by design, it definitely has a lot of missing features that are commonly seen in other packages, and kableExtra has filled the gap perfectly. The most amazing thing about kableExtra is that most of its table features work for both HTML and PDF formats.
Factoextra : Factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses
corrplot : R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
GGally : ggplot2 is a plotting system for R based on the grammar of graphics. GGally extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pairwise plot matrix, a scatterplot plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.
RColorBrewer : RColorBrewer can be used to create colorful graphs with pre-made palettes that consist of 8 to 12 colors.
cowplot :The cowplot package is a simple add-on to ggplot. It provides various features that help with creating publication-quality figures, such as a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images.

Data Preparation

Data attributes

There are 23 audio features , including confidence measures like acousticness, liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceability and valence (positiveness), and descriptors like duration, tempo, key, and mode.

A brief description of the variables is as mentioned below:

Importing data

ss <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

## Rows: 32833 Columns: 23

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

-Variable Names

names(ss)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

There is no need for any variable name change as the names look consistent and easy to understand

-Variable Types

It is very necessary to understand each and every data types of the variables used in the dataset before doing the next essential steps so that we do the proper analysis. Hence, we used str() to observe the data types of each column and changed the data type wherever necessary.

str(ss[])

## tibble [32,833 x 23] (S3: tbl_df/tbl/data.frame)
##  $ track_id                : chr [1:32833] "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr [1:32833] "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr [1:32833] "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : num [1:32833] 66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr [1:32833] "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr [1:32833] "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr [1:32833] "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr [1:32833] "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr [1:32833] "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr [1:32833] "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr [1:32833] "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num [1:32833] 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num [1:32833] 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : num [1:32833] 6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num [1:32833] -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : num [1:32833] 1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num [1:32833] 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num [1:32833] 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num [1:32833] 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num [1:32833] 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num [1:32833] 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num [1:32833] 122 100 124 122 124 ...
##  $ duration_ms             : num [1:32833] 194754 162600 176616 169093 189052 ...

Observations

mode is numeric field.
track_album_release_dateis a character column but its actually a field with date values. It is impotant to change the data type as we would need this column in date format for analysis.

{r modify_data_types, message=FALSE,warning=FALSE}

-Modyfying Data types

ss$mode <- as.factor(ss$mode)
ss$track_album_release_date <- as.Date(ss$track_album_release_date)

-Null Values in the Dataset

There are in total 15 Null values in the 32833 X 23 dataframe, which is surprising considering there are lots of rows and data set is exhaustive. We can see 5 each Null values across these 3 columns - trac_artist, track_name and track_album_name

sum(is.na(ss))

## [1] 1901

colSums(is.na(ss))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                     1886                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

-Missing Value Treatment

As we have seen earlier that we had 5 missing values each in 3 columns i.e. track_artist, track_album_name and track_name.

We went ahead and imputed these missing values with a character constant ‘NA’. We are not removing and deleting these values because we still have a lot of information for these values and we can use for our EDA

ss$track_artist[is.na(ss$track_artist)] <- 'NA'
ss$track_album_name[is.na(ss$track_album_name)] <- 'NA'
ss$track_name[is.na(ss$track_name)] <- 'NA'

Generating summary

For this part lets look at the distribution of all of the variables by plotting them.

summary(select_if(ss,is.numeric))

##  track_popularity  danceability        energy              key        
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.000175   Min.   : 0.000  
##  1st Qu.: 24.00   1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000  
##  Median : 45.00   Median :0.6720   Median :0.721000   Median : 6.000  
##  Mean   : 42.48   Mean   :0.6548   Mean   :0.698619   Mean   : 5.374  
##  3rd Qu.: 62.00   3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :0.9830   Max.   :1.000000   Max.   :11.000  
##     loudness        speechiness      acousticness    instrumentalness   
##  Min.   :-46.448   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.: -8.171   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median : -6.166   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   : -6.720   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.: -4.645   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :  1.275   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810

From the descriptive statistics of only the numeric variables that we obtained above, we see that for some variables the mean is not very close to the median, which indicates the skewness in the data.

To further check if the variables have outliers in the data we plot the distribution of these variables using boxplots (In the visual summary secion)

-Visual Summary

-Generating boxplots

par(mfrow = c(2, 2))
a = boxplot(ss$danceability, main = 'Boxplot distribution of Danceability')
b = boxplot(ss$loudness, main = 'Boxplot distribution of loudness')
c = boxplot(ss$tempo , main = 'Boxplot distribution of tempo')

The box plots are helpful in outlier detection. In the analysis above, we observe that: few columns have the mean pulled towards on side due to outliers or skewness. Here we will be checking the boxplots of these variables to identify outliers and also treat them.

Outlier Detection and Treatment

From the boxplot distributions we see that the variable “danceability” has one value at 0, which stands out from the remaining of the variable. Similarly in loudness there is one value that is very low ‘-46’ and in tempo there is one value that is too high and one value that is too low than the majority of data points.

We can remove these records. It is okay to remove these records from the dataset and visualize the dataset again to see the change in distribution.

-Trimming Outliers

df_2 <- subset(ss, danceability > min(danceability) & loudness > min(loudness) & tempo > min(tempo) & tempo < max(tempo))

-Visualizing the distributions again

par(mfrow = c(2, 2))

boxplot(df_2$danceability, main = 'Distribution of Danceability')
boxplot(df_2$loudness, main = 'Distribution of loudness')
boxplot(df_2$tempo , main = 'Distribution of tempo')

Thus, in Data Cleaning, we have checked the variable types, imputed the missing values, we checked the numerical summaries and detected and treated the outliers.

-Data in the most condensed form possible

The below table shows a glimpse of the final cleaned dataset.

knitr::kable(head(df_2,5), "simple")

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052

Exploratory Data Analysis

#### Extracting Year from songs
ss <- ss %>%
separate(track_album_release_date,
c("year","month","day"),
sep = "-") 

#### Creating minutes from duration
ss<-ss %>% 
  mutate(duration_min=duration_ms/60000)

#### changing data type of year column
ss$year <- as.numeric(ss$year)

popularity_order<-select(ss,track_popularity,track_artist,track_album_name)
arrange(popularity_order,desc(track_popularity))

## # A tibble: 32,833 x 3
##    track_popularity track_artist   track_album_name                           
##               <dbl> <chr>          <chr>                                      
##  1              100 Tones and I    Dance Monkey (Stripped Back) / Dance Monkey
##  2              100 Tones and I    Dance Monkey (Stripped Back) / Dance Monkey
##  3               99 Arizona Zervas ROXANNE                                    
##  4               99 Arizona Zervas ROXANNE                                    
##  5               99 Arizona Zervas ROXANNE                                    
##  6               99 Arizona Zervas ROXANNE                                    
##  7               98 KAROL G        Tusa                                       
##  8               98 Maroon 5       Memories                                   
##  9               98 The Weeknd     Blinding Lights                            
## 10               98 Maroon 5       Memories                                   
## # ... with 32,823 more rows

popularity_order <- popularity_order%>% filter(duplicated(track_popularity)== FALSE)
popularity_order

## # A tibble: 101 x 3
##    track_popularity track_artist     track_album_name                           
##               <dbl> <chr>            <chr>                                      
##  1               66 Ed Sheeran       I Don't Care (with Justin Bieber) [Loud Lu~
##  2               67 Maroon 5         Memories (Dillon Francis Remix)            
##  3               70 Zara Larsson     All the Time (Don Diablo Remix)            
##  4               60 The Chainsmokers Call You Mine - The Remixes                
##  5               69 Lewis Capaldi    Someone You Loved (Future Humans Remix)    
##  6               62 Katy Perry       Never Really Over (R3HAB Remix)            
##  7               68 Avicii           Tough Love (Tiësto Remix)                  
##  8               58 Ed Sheeran       Cross Me (feat. Chance the Rapper & PnB Ro~
##  9               63 Martin Garrix    Summer Days (feat. Macklemore & Patrick St~
## 10               65 David Guetta     Say My Name (feat. Bebe Rexha & J Balvin) ~
## # ... with 91 more rows

pop2 <- arrange(popularity_order,desc(track_popularity))
pop2

## # A tibble: 101 x 3
##    track_popularity track_artist        track_album_name                        
##               <dbl> <chr>               <chr>                                   
##  1              100 Tones and I         Dance Monkey (Stripped Back) / Dance Mo~
##  2               99 Arizona Zervas      ROXANNE                                 
##  3               98 KAROL G             Tusa                                    
##  4               97 Billie Eilish       everything i wanted                     
##  5               96 The Black Eyed Peas RITMO (Bad Boys For Life)               
##  6               95 Billie Eilish       WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?
##  7               94 Regard              Ride It                                 
##  8               93 Anuel AA            China                                   
##  9               92 Juice WRLD          Bandit (with YoungBoy Never Broke Again)
## 10               91 MEDUZA              Lose Control                            
## # ... with 91 more rows

pop3 <- table(pop2$track_artist)
pop3

## 
## <U+771F><U+4E4B><U+4ECB> <U+9999><U+53D6><U+614E><U+543E> 5 Seconds of Summer       A R I Z O N A 
##                   1                   1                   1                   1 
##             AAAMYYY              Alesso        Ant Saunders            Anuel AA 
##                   1                   1                   1                   1 
##      Arizona Zervas              Avicii Axwell /\\ Ingrosso            Bastille 
##                   1                   2                   1                   1 
##       Billie Eilish              Bolier       Boys Get Hurt    Carly Rae Jepsen 
##                   3                   1                   1                   1 
##              Catiso          Charli XCX        Charlie Puth        Clean Bandit 
##                   1                   1                   1                   1 
##            Coldplay        Daddy Yankee        David Guetta           Deee-Lite 
##                   1                   1                   1                   1 
##              Deorro          Disclosure               DVBBS             E-girls 
##                   1                   1                   1                   1 
##          Ed Sheeran      Ellie Goulding         EMMA WAHLIN               Grace 
##                   2                   3                   1                   1 
##             Gryffin            Hardwell         Herve Pagez          Jonas Blue 
##                   3                   1                   1                   1 
##           JP Cooper          Juice WRLD             KAROL G             Kaskade 
##                   1                   1                   1                   1 
##          Katy Perry                Kygo       Lewis Capaldi           Lil Nas X 
##                   2                   2                   1                   1 
##    Lindsey Stirling    Maggie Lindemann            Maroon 5          Marshmello 
##                   1                   1                   1                   1 
##       Martin Garrix         Matt Simons                 MAX              MEDUZA 
##                   1                   1                   1                   1 
##             Molella        Nikki Vianna              ODESZA         OneRepublic 
##                   1                   1                   1                   1 
##               R3HAB              Regard               Riton       Ryuji Imaichi 
##                   1                   1                   1                   1 
##           SAINt JHN      Sakurako Ohara               SHAED             Shallou 
##                   1                   1                   1                   1 
##        Shinn Yamada                 Sia             Starley          Steve Aoki 
##                   1                   1                   1                   4 
##              STVCKS               SUNMI Swedish House Mafia             T-Spoon 
##                   1                   1                   1                   1 
## The Black Eyed Peas    The Chainsmokers              Tiësto         Tones and I 
##                   1                   3                   1                   1 
##               TWICE          Tyler Shaw               Vinil        Why Don't We 
##                   1                   1                   1                   2 
##     Wolves By Night              Yves V        Zara Larsson                Zedd 
##                   1                   1                   2                   1

t<-barplot(pop3)

coord_flip(t)

## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>

suppressWarnings(wordcloud(words = pop2$track_artist, freq = pop2$track_popularity, min.freq = 1,           max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2")))

Artists are arranged according to the popularity. Larger the word size indicated greater the popularity of the artist

library(dplyr)

aa2<-ss

aa2$speech_only <- cut(aa2$speechiness, breaks = 10)
aa2 %>%
  ggplot( aes(x = speech_only )) +
  geom_bar(width = 0.8, fill = "blue", colour = "black") +
  scale_x_discrete(name = "Speechiness")

From the plot we can see lower the speechiness, songs are more favored by users

aa<-ss
aa$energy_only <- cut(aa$energy, breaks = 10)
aa %>%
  ggplot( aes(x = energy_only )) +
  geom_bar(width = 1, fill = "blue", colour = "black") +
  scale_x_discrete(name = "Energy")

From the above plot we can see Energy range around 0.8 -0.9 is most preferred among the users

library(corrplot)

## corrplot 0.92 loaded

library(GGally)

## Warning: package 'GGally' was built under R version 4.1.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

corr <- ss %>%
select(track_popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness, liveness, valence, tempo)

ggcorr(corr,
       nbreaks = 6,
       label = TRUE,
       label_size = 3,
       color = "grey50")

Based on the plot, we can state that popularity does not have strong correlation with other track features.We can see only energy has some what stronger correlation of around 0.7

library(ggplot2)
#### Plotting Density Plots
ggplot(ss) +
  geom_density(aes(energy, fill ="energy", alpha = 0.1)) + 
  geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) + 
  geom_density(aes(valence, fill ="valence", alpha = 0.1)) + 
  geom_density(aes(acousticness, fill ="acousticness", alpha = 0.1)) + 
  geom_density(aes(speechiness, fill ="speechiness", alpha = 0.1)) + 
  geom_density(aes(liveness, fill ="liveness", alpha = 0.1)) + 
  scale_x_continuous(name = "Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  scale_y_continuous(name = "Density") +
  ggtitle("Density plot of Energy, Danceability, Valence, Acousticness, Speechiness and Liveness") +
  theme_bw() +
  theme(plot.title = element_text(size = 10, face = "bold"),
        text = element_text(size = 10)) +
  theme(legend.title=element_blank()) +
  scale_fill_brewer(palette="Accent")

bp <- ggplot(ss, aes(energy, playlist_genre)) + 
  geom_boxplot(aes(fill = playlist_genre)) +
  theme_minimal() +
  theme(legend.position = "top")
bp

From the above plot we can EDM genre has songs with highest energy

bp1 <- ggplot(ss, aes(danceability, playlist_genre)) + 
  geom_boxplot(aes(fill = playlist_genre)) +
  theme_minimal() +
  theme(legend.position = "top")
bp1

From the above plot we can Rap genre has songs with highest danceability factor

bp2 <- ggplot(ss, aes(liveness, playlist_genre)) + 
  geom_boxplot(aes(fill = playlist_genre)) +
  theme_minimal() +
  theme(legend.position = "top")
bp2

From the above plot we can EDM genre has songs with most liveness

bp3 <- ggplot(ss, aes(valence, playlist_genre)) + 
  geom_boxplot(aes(fill = playlist_genre)) +
  theme_minimal() +
  theme(legend.position = "top")
bp3

From the above plot we can Latin genre has songs with higher Valence

bp4 <- ggplot(ss, aes(loudness, playlist_genre)) + 
  geom_boxplot(aes(fill = playlist_genre)) +
  theme_minimal() +
  theme(legend.position = "top")
bp4

From the above plot we can EDM genre has songs with loudness greater when compared to others

trend_chart <- function(arg){
trend_change <- ss %>% filter(year>2010) %>% group_by(year) %>% summarize_at(vars(all_of(arg)), funs(Average = mean))
chart<- ggplot(data = trend_change, aes(x = year, y = Average)) +
geom_line(color = "black", size = 1) +
scale_x_continuous(breaks=seq(2011, 2020, 3)) + scale_y_continuous(name=paste("",arg,sep=""))
return(chart)
}

trend_chart_track_popularity<-trend_chart("track_popularity") + theme_classic()

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

trend_chart_danceability<-trend_chart("danceability")  + theme_classic()
trend_chart_energy<-trend_chart("energy")  + theme_classic()
trend_chart_loudness<-trend_chart("loudness")
trend_chart_duration_min<-trend_chart("duration_min")  + theme_classic()
trend_chart_speechiness<-trend_chart("speechiness")  + theme_classic()

plot_grid(trend_chart_track_popularity, trend_chart_danceability, trend_chart_energy, trend_chart_loudness, trend_chart_duration_min, trend_chart_speechiness,ncol = 3, label_size = 3)

From the above plot we can see that duration of songs is reducing with each year

library(factoextra)
library(cluster)

## Warning: package 'cluster' was built under R version 4.1.2

suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(cluster))

Clustering Analysis

We are trying to employ K-means clustering to the dataset.K-means clustering is a technique in which we place each observation in a dataset into one of K clusters.

Z <-select(ss ,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo)
clus<-head(Z,20)
clus

## # A tibble: 20 x 9
##    danceability energy loudness speechiness acousticness instrumentalness
##           <dbl>  <dbl>    <dbl>       <dbl>        <dbl>            <dbl>
##  1        0.748  0.916    -2.63      0.0583      0.102         0         
##  2        0.726  0.815    -4.97      0.0373      0.0724        0.00421   
##  3        0.675  0.931    -3.43      0.0742      0.0794        0.0000233 
##  4        0.718  0.93     -3.78      0.102       0.0287        0.00000943
##  5        0.65   0.833    -4.67      0.0359      0.0803        0         
##  6        0.675  0.919    -5.38      0.127       0.0799        0         
##  7        0.449  0.856    -4.79      0.0623      0.187         0         
##  8        0.542  0.903    -2.42      0.0434      0.0335        0.00000483
##  9        0.594  0.935    -3.56      0.0565      0.0249        0.00000397
## 10        0.642  0.818    -4.55      0.032       0.0567        0         
## 11        0.679  0.923    -6.5       0.181       0.146         0.00000492
## 12        0.437  0.774    -4.92      0.0554      0.148         0         
## 13        0.744  0.726    -4.68      0.0463      0.0399        0         
## 14        0.572  0.915    -4.45      0.0625      0.0111        0         
## 15        0.69   0.78     -4.45      0.0594      0.00733       0.00183   
## 16        0.805  0.835    -4.60      0.0896      0.13          0.00000503
## 17        0.694  0.901    -4.32      0.0948      0.0702        0         
## 18        0.678  0.747    -5.29      0.165       0.0395        0         
## 19        0.746  0.557    -6.72      0.0542      0.103         0.0036    
## 20        0.467  0.821    -5.47      0.0934      0.00791       0.000441  
## # ... with 3 more variables: liveness <dbl>, valence <dbl>, tempo <dbl>

# Finding optimal number of clusters using K-Means

fviz_nbclust(clus, kmeans, method = "wss")

Typically when we create this type of plot we look for an “elbow” where the sum of squares begins to “bend” or level off. This is typically the optimal number of clusters.

For this plot it appear that there is a bit of an elbow or “bend” at k = 4 clusters.

#calculate gap statistic based on number of clusters
gap_stat <- clusGap(clus,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

#perform k-means clustering with k = 4 clusters
km <- kmeans(clus, centers = 4, nstart = 25)

#view results
km

## K-means clustering with 4 clusters of sizes 9, 1, 2, 8
## 
## Cluster means:
##   danceability    energy  loudness speechiness acousticness instrumentalness
## 1    0.6494444 0.8743333 -4.169111    0.064500   0.05590333     2.074589e-04
## 2    0.7260000 0.8150000 -4.969000    0.037300   0.07240000     4.210000e-03
## 3    0.5975000 0.7065000 -5.755000    0.058250   0.14500000     1.800000e-03
## 4    0.6456250 0.8422500 -4.697750    0.099525   0.07277625     5.691875e-05
##    liveness   valence    tempo
## 1 0.2170111 0.5426667 125.3212
## 2 0.3570000 0.6930000  99.9720
## 3 0.1570000 0.2380000 112.3045
## 4 0.2040375 0.4598750 121.4769
## 
## Clustering vector:
##  [1] 4 2 1 4 1 1 3 1 1 1 4 4 4 1 1 1 4 4 3 4
## 
## Within cluster sum of squares by cluster:
## [1] 22.935639  0.000000  2.214049 29.170506
##  (between_SS / total_SS =  93.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#plot results of final k-means model
fviz_cluster(km, data = clus) + theme_minimal() + theme_classic()

#add cluster assigment to original data
final_data <- cbind(clus, cluster = km$cluster)

#view final data
head(final_data,20)

##    danceability energy loudness speechiness acousticness instrumentalness
## 1         0.748  0.916   -2.634      0.0583      0.10200         0.00e+00
## 2         0.726  0.815   -4.969      0.0373      0.07240         4.21e-03
## 3         0.675  0.931   -3.432      0.0742      0.07940         2.33e-05
## 4         0.718  0.930   -3.778      0.1020      0.02870         9.43e-06
## 5         0.650  0.833   -4.672      0.0359      0.08030         0.00e+00
## 6         0.675  0.919   -5.385      0.1270      0.07990         0.00e+00
## 7         0.449  0.856   -4.788      0.0623      0.18700         0.00e+00
## 8         0.542  0.903   -2.419      0.0434      0.03350         4.83e-06
## 9         0.594  0.935   -3.562      0.0565      0.02490         3.97e-06
## 10        0.642  0.818   -4.552      0.0320      0.05670         0.00e+00
## 11        0.679  0.923   -6.500      0.1810      0.14600         4.92e-06
## 12        0.437  0.774   -4.918      0.0554      0.14800         0.00e+00
## 13        0.744  0.726   -4.675      0.0463      0.03990         0.00e+00
## 14        0.572  0.915   -4.451      0.0625      0.01110         0.00e+00
## 15        0.690  0.780   -4.446      0.0594      0.00733         1.83e-03
## 16        0.805  0.835   -4.603      0.0896      0.13000         5.03e-06
## 17        0.694  0.901   -4.322      0.0948      0.07020         0.00e+00
## 18        0.678  0.747   -5.289      0.1650      0.03950         0.00e+00
## 19        0.746  0.557   -6.722      0.0542      0.10300         3.60e-03
## 20        0.467  0.821   -5.466      0.0934      0.00791         4.41e-04
##    liveness valence   tempo cluster
## 1    0.0653   0.518 122.036       4
## 2    0.3570   0.693  99.972       2
## 3    0.1100   0.613 124.008       1
## 4    0.2040   0.277 121.956       4
## 5    0.0833   0.725 123.976       1
## 6    0.1430   0.585 124.982       1
## 7    0.1760   0.152 112.648       3
## 8    0.1110   0.367 127.936       1
## 9    0.6370   0.366 127.015       1
## 10   0.0919   0.590 124.957       1
## 11   0.1240   0.752 121.984       4
## 12   0.1330   0.329 123.125       4
## 13   0.3740   0.687 121.985       4
## 14   0.3390   0.678 123.919       1
## 15   0.0729   0.238 126.070       1
## 16   0.3650   0.722 125.028       1
## 17   0.4270   0.368 118.051       4
## 18   0.1740   0.516 120.002       4
## 19   0.1380   0.324 111.961       3
## 20   0.1310   0.232 122.676       4

By K-Means clustering we segregated the variables into different clusters

Summary

A commonplace notion among people is that energy impacts predominance like energetic tunes are more well known. Nevertheless, we couldn’t find any relationship among popularity and energy.

Some of the key relation which we found were:

-Lower the speechiness, songs are more favored by users

-Energy range around 0.8 -0.9 is most preferred among the users

-EDM genre has songs with highest energy and most liveness

-Rap genre has songs with highest danceability factor

-Latin genre has songs with higher Valence

-EDM genre has songs with loudness greater when compared to other

-Duration of songs is decreasing with every year

The average popularity of the songs showed up at its minimum value in 2014 in latest multi decade and after that it’s has been continually growing, depicting that the tunes are becoming popular with time among people.

We have used example charts to see how the components change across time. To understand the relationship among factors, we have used corrplot work in R. We have used boxplots to find the outliers.

As we have limited records(about 32k) for our examination,we couldn’t gain a full picture of the components of music. Also, the examination could be better if we have information to customer related features like their playlist history,country of residence,premium customer or not etc .Finally we used K-Means clustering algorithm to cluster the different variables

Spotify Data Analysis

Data Wrangling Group 11 - Abhiram Daivala ,Jashanpreet ,Truong (Jack)

Introduction

Problem statement

Addressing the problem statement

Objective

Packages Required

Packages Used

Suppressing the warnings

Purpose of each package

Data Preparation

Data attributes

Importing data

Observations

Generating summary

Outlier Detection and Treatment

Exploratory Data Analysis

Exploratory Data Analysis

Clustering Analysis

Summary