Spotify Songs: An analysis of the spotify dataset

Sanjay Jayakumar

4/3/2020

Analysis of the popular songs on Spotify

Introduction

1.Problem Statement

If music be the food of love, play on…

Over the centuries since Shakespeare,music has continued to be an indispensable aspect of our daily life.Not a day passes by without you humming a tune or listening to one.Our sources of listening to them have also evolved over time. From the vintage gramaphones through mix tapes through radio, our listening medium has now reached online music-streaming platforms. Spotify, Apple music, Google play etc are some of the most popular of these.

With over 36% market share among online music subscribers and having a base of over 100 million subscribers, Spotify occupies the top spot. As a music afficionado, this prompted me to dig deeper in to the spotify songs database to discover interesting trends regarding the songs, their artists and hopefully help people discover new music.

2.Data

The Spotify_data used for this analysis is a contiguous dataset of over 30,000 songs with 12 audio features for each track, including confidence measures like acousticness,liveness, speechiness and instrumentalness, perceptual measures like energy, loudness, danceablity and valence (positiveness), and descriptors like duration,key, tempo and mode.

3.The approach

A three pronged approach is taken here :

  • An overview of the genres and the artists

  • An analysis of the track popularity and how it extends aross genres

  • Based on the preferences of the user, build a simple system that gives out related songs based on genre and / or artist

Several univariate and multivariate analyses are done across variables post data preparation to substantiate the findings / results obtained by the above approaches

4.What can you do with the analysis

  • Identify the association between different types of songs

  • As a music aficionado, find the songs that were off your radar

  • If you love that genre, get a recommendation for it

  • Identify the popular songs, the popularity of artists etc

Since a song can be associated with multiple genres on Spotify, there may be cases of multiple instances of the same song appearing while analysing

Packages Required

The packages used in the project (currently) are:

  • data.table: To read the csv files in the fastest possible way

  • DT : Filtering, pagination, and sorting of data tables in html outputs

  • kableExtra : Manipulate table styles for good visualizations

  • knitr : Aligned displays of table in a html doc

  • stringr: String replacements and pattern matching

  • formattable: Allows to create aesthetic tables in R

  • plotly: Interactive graphing library

  • wordcloud: Creates wordclouds

  • RColorBrewer: offers several color palette for R

  • tidyselect: allows to create selecting verbs that are consistent with other tidyverse packages

  • tm: Used for text mining - creating Corpus

  • shiny: to create the shiny app

  • shinythemes: to alter shiny themes

  • tidyverse : Collection of R packages for data manipulation, exploration and visualization. I am currently using

    • dplyr: Data manipulation using filter, joins, summarise etc.
    • magrittr: The pipe %>% operator
# Loading the packages
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(stringr)
library(formattable)
library(plotly)
library(wordcloud)
library(RColorBrewer)
library(tidyverse)
library(tidyselect)
library(tm)
library(shiny)
library(shinythemes)

Data Preparation

Description of dataset

The data for the analysis comes from Spotify via the spotifyr package. Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff authored this package to make it easier to get either your own data or general metadata arounds songs from Spotify’s API. This data was gathered in Jan’2020

Importing the dataset

spotify <- read.csv('D:/Cinci prep/Coursework/Data Wrangling with R/spotify_songs.csv')

The attribute of the dataset and the column information is provided for the key variables.

#dimension of the dataset
dimension <- str_trim(paste0(dim(spotify), sep = "  ", collapse = ""))

#column names of the dataset
vars <- str_trim(paste0(names(spotify), sep = " | ", collapse = ""))

# Creating a table
table_attributes <- data_frame(Data = 'Spotify_songs',
  `Rows  Columns` = dimension,
  Variables = vars)

# Printing the table
kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Data Rows Columns Variables
Spotify_songs 32833 23 track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms |

Cleaning & Data Manipulation

We have 13 numeric variables (4 - integer, 9 - float) and 10 character variables in this dataset.

Are there any missing values?

The next step in cleaning the data was to identify any missing values across columns. A quick is.na() check is used fo the puropse.

# Identifying the missing values across columns

miss <- colSums(is.na(spotify))
print(miss[miss>0])
##       track_name     track_artist track_album_name 
##                5                5                5

Since these removing these observations are not expected to affect our approach, we remove these incomplete observations

# Identifying the missing values across columns
spotify <- na.omit(spotify)

Should we remove any columns?

# Removing irrelevant columns

spotify <- spotify %>% select(-c(playlist_id,playlist_name,track_album_id,playlist_subgenre))

The columns playlist_id,playlist_name,playlist_subgenre, track_album_id will not be used in the ensuing analysis. Hence we proceed to remove these columns as well

Are there any duplicate values?

# Identifying duplicate values
print(paste("Number of duplicate observations : " , sum(duplicated(spotify))))
## [1] "Number of duplicate observations :  2448"
spotify <- spotify[!duplicated(spotify),]

print(paste("We have ",sum(complete.cases(spotify)),"Complete cases!"))
## [1] "We have  30380 Complete cases!"

How are the variables distributed?

#Converting df from wide to long for plotting the box plot of variable distributions
long_df <- gather(spotify,decriptive_vars, values, danceability, energy,
                  speechiness, acousticness,  liveness,valence,
                  factor_key=TRUE)

#box plot of the audio features

var_distribution <- ggplot(long_df, aes(x = decriptive_vars, y = values)) +
        geom_boxplot() + 
        coord_flip() +
        labs(title = "Distribution across rest of the variables", x = "Audio features", y = "values")

key_box_plot <- ggplot(spotify, aes(y = key)) + geom_boxplot() +  guides(fill=FALSE) + labs(title = 'Distribution of Key') 

loudness_box_plot <- ggplot(spotify, aes(y = loudness)) + geom_boxplot() +  guides(fill=FALSE) + labs(title = 'Distribution of loudness') 

instrumentalness_box_plot <- ggplot(spotify, aes(y = instrumentalness)) + geom_boxplot() +  guides(fill=FALSE) + labs(title = 'Distribution of instrumentalness') 

tempo_box_plot <- ggplot(spotify, aes(y = tempo)) + geom_boxplot() +  guides(fill=FALSE) + labs(title = 'Distribution of tempo') 

duration_box_plot <- ggplot(spotify, aes(y = duration_ms)) + geom_boxplot() + guides(fill=FALSE) + labs(title = 'Distribution of duration_ms') 


ggplotly(key_box_plot)
ggplotly(loudness_box_plot)
ggplotly(instrumentalness_box_plot)
ggplotly(tempo_box_plot)
ggplotly(duration_box_plot)
ggplotly(var_distribution)

There are common outliers in all the variables that have values equal to 0 across multiple variables.We remove those observations as well.

#removing outliers from the dataset
spotify <- spotify[!(spotify$duration_ms==4000),]

duration_outliers <- boxplot(spotify$duration_ms, 
                             plot = FALSE, range = 4)$out

spotify <- spotify %>%
  filter(!duration_ms %in% duration_outliers) 

duration_new_plot <- spotify  %>%
  ggplot(aes(y = duration_ms)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = 'Duration, outliers removed') 

duration_new_plot

We also remove outliers in the duration_ms column having values outside of the 4x range of values

Data Manipulation

We are creating 2 new columns

  • year - which will have the year of release of the album / song

  • durn_minutes - which will have the duration of the song in minutes

#creatiing a year columns
spotify$track_album_release_date <- as.character(spotify$track_album_release_date, "%m/%d/%Y")
spotify$year <- substr(spotify$track_album_release_date,1,4)

#Creating a duration in minutes column
spotify$durn_minutes <- spotify$duration_ms/(1000*60)

Final dimensions of the data

#dimension of the dataset
dimension <- str_trim(paste0(dim(spotify), sep = "  ", collapse = ""))

#column names of the dataset
vars <- str_trim(paste0(names(spotify), sep = " | ", collapse = ""))

# Creating a table
table_attributes <- data_frame(Data = 'Spotify_songs',
  `Rows  Columns` = dimension,
  Variables = vars)

# Printing the table
kable(table_attributes, format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Data Rows Columns Variables
Spotify_songs 30379 21 track_id | track_name | track_artist | track_popularity | track_album_name | track_album_release_date | playlist_genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | year | durn_minutes |

Snapshot of the Dataset

datatable(head(spotify, 100))

Data Description

The description of the variables are given below:

Data Data Type Variable description
track_id factor Song unique ID
track_name factor Song Name
track_artist factor Song Artist
track_popularity integer Song Popularity (0-100) where higher is better
track_album_name factor Song album name
track_album_release_date character Date when album released
playlist_genre factor playlist genre
danceability numeric Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1 is most danceable.
energy numeric Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key integer The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 2 = D, and so on. If no key was detected, the value is -1.
loudness numeric The overall loudness of a track in decibels (dB). Loudness values are averaged across the full track and are useful for comparing relative loudness of track. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db
mode integer Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
speechiness numeric Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks
acousticness numeric A confidence measure from 0.0 to 1.0 of whether the track is acoustic.1 represents high confidence the track is acoustic
instrumentalness numeric Predicts whether a track contains no vocals. OOH and AAH sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly vocal. The closer the instrumentalness is to 1, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness numeric Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence numeric A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo numeric The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms integer Duration of song in milliseconds
year character Year of release of the album
durn_minutes numeric duration of song in minutes

Exploratory Data Analysis

We need to talk about the Genres

The spread across genres

First, let us find out the distribution of songs across genres. What genre has the most number of songs in the dataset?

# songs per genre
spotify %>% group_by(Genre = playlist_genre) %>%
  summarise(No_of_tracks = n()) %>% knitr::kable()
Genre No_of_tracks
edm 5537
latin 4639
pop 5132
r&b 5138
rap 5483
rock 4450

EDM is the genre in which most songs have been released, followed by rap and then pop.

Who are the artists with the most releases?

# artists with most releases
most_releases <- spotify %>% group_by(Artist = track_artist) %>%
  summarise(No_of_tracks = n()) %>%
  arrange(desc(No_of_tracks)) %>%
  top_n(15, wt = No_of_tracks) %>% 
  ggplot(aes(x = Artist, y = No_of_tracks)) +
        geom_bar(stat = "identity") +
        coord_flip() + labs(title = "artists with the most releases", x = "artists", y = "no of releases")

ggplotly(most_releases)

With ~130 tracks in their name, Queen have been the busiest artists over time. David Guetta comes in second with ~90 tracks

What are the popular words featuring in titles?

#Create a vector containing only the text
text <- spotify$track_name 
# Create a corpus  
docs <- Corpus(VectorSource(text))

#clean text data
docs <- docs %>%
        tm_map(removeNumbers) %>%
        tm_map(removePunctuation) %>%
        tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords,c("feat","edit","remix","remastered","remaster","radio","version","original","mix"))

#create a doument-term matrix

dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

#generate the word cloud
wordcloud(words = df$word, freq = df$freq,scale=c(8,0.25), min.freq = 1,
          max.words=150, random.order=FALSE, rot.per=0.25, 
          colors=brewer.pal(8, "Dark2"))

No word has been more associated with music than love. Love is the most frequently used word in track titles. Like, don’t, one etc are the other frequent ones

When were the tracks released?

Is there something as a golden age of music? A span of years when a lot of songs were released? The graph below will tell us

# grouping tracks by years

plot_year <- spotify %>% 
  select(year) %>%
  group_by(year) %>%
  summarise(count = n()) 

#plotting releases across years

year_plot <- ggplot(plot_year,aes(x = year, y = count,group = 1)) + 
  geom_line() +
  theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Release of songs across years", x = "Year", 
       y = "No of songs released")

ggplotly(year_plot)

On further inspection, over 80% of the songs have been released in the 21st century. Let us get a clearer picture of this period, shall we?

# zooming into 21st century

plot_zoom_year <- spotify %>% 
        select(year) %>%
        group_by(year) %>%
        summarise(count = n()) %>% 
        subset(year >= 2000)


graph_zoom_year <- ggplot(plot_zoom_year,aes(x = year, y = count,group = 1)) + 
  geom_line() +
  theme(legend.position = "none",axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Music in 21st Century", x = "Year", 
       y = "No of songs released")

ggplotly(graph_zoom_year)

The advent and popularity of internet may have led to this remarkable spike in the number of songs released post 2000. The dip in 2020 is because we only have data for less than a month in 2020.

What characteristics describe a genre?

Now let us really start exploring the genres, understand the features that characterise a genre

#DF of characteristics of genres

genre_description <- spotify %>% 
  group_by(Genre = playlist_genre) %>%
  summarise(Danceability = median(danceability),
            Energy = median(energy),
            Key = median(key),
            Loudness = median(loudness),
            Mode = median(mode),
            Speechiness = median(speechiness),
            Acousticness = median(acousticness),
            Instrumentalness = median(instrumentalness),
            Liveness = median(liveness),
            Valence = median(valence),
            Tempo = median(tempo),
            Duration = median(durn_minutes))

kable(genre_description , format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Genre Danceability Energy Key Loudness Mode Speechiness Acousticness Instrumentalness Liveness Valence Tempo Duration
edm 0.661 0.8270 6 -5.0270 1 0.06060 0.02040 3.75e-03 0.138 0.376 127.0040 3.406967
latin 0.726 0.7270 6 -5.8620 1 0.06620 0.13800 3.00e-06 0.121 0.620 112.0830 3.513850
pop 0.650 0.7270 5 -5.8740 1 0.04890 0.07595 1.38e-05 0.123 0.499 120.0255 3.517558
r&b 0.688 0.5990 6 -7.4125 1 0.06785 0.16400 4.80e-06 0.120 0.542 108.8660 3.870667
rap 0.734 0.6650 6 -6.5180 1 0.17500 0.11100 0.00e+00 0.128 0.516 119.9990 3.500000
rock 0.523 0.7785 5 -6.9510 1 0.04200 0.03680 2.05e-04 0.137 0.530 124.0135 3.947783

As expected, EDM is the “high energy genre”. Rap is also a popular party genre with 0.73 danceability (top rank). EDM, rock and pop have the highest tempo

How closely related are the genres?

names <- names(spotify)[c(8:10,12:19)]

# average features by genre
avg_genre_matrix <- spotify %>%
  group_by(playlist_genre) %>%
  summarise_if(is.numeric, median, na.rm = TRUE) %>%
  ungroup() 

#converting to matrix
avg_genre_cor <- avg_genre_matrix %>%
  select(names, -mode) %>% 
  scale() %>%
  t() %>%
  as.matrix() %>%
  cor() 

colnames(avg_genre_cor) <- avg_genre_matrix$playlist_genre
row.names(avg_genre_cor) <- avg_genre_matrix$playlist_genre

avg_genre_cor %>% corrplot::corrplot(method = 'color', 
                     order = 'hclust',
                     type = 'upper',
                     tl.col = 'black',
                     diag = FALSE,
                     addCoef.col = "black",
                     number.cex = 0.75,
                     mar = c(2,2,2,2),
                     main = 'Correlation Between Median Genre Feature Values',
                     family = 'Avenir')

The correlation values help us understand that EDM and Rock can be easily demarcated from the other genres. Having a correlation of 0.57, latin and r&b are the most similar in nature. Latin is poles apart from EDM and rock with correlation around -0.6. Source

..let’s not forget about the popularity analysis either

Now that we have had an overview of the genres and related aspects in the previous section, let us find out if there is anything discernible about the popularity of tracks

Who are the popular artists overall?

#finding popular artists
popular_artists <- spotify %>% group_by(Artist = track_artist) %>%
  summarise(No_of_tracks = n(),Popularity = mean(track_popularity))  %>% 
  filter(No_of_tracks > 2) %>%
  arrange(desc(Popularity)) %>%
  top_n(15, wt = Popularity) %>% 
  ggplot(aes(x = Artist, y = Popularity)) +
        geom_bar(stat = "identity") +
        coord_flip() + labs(title = "popular artists overall", x = "Artists", y = "Popularity")

ggplotly(popular_artists)

Trevor Daniel, Y2K and Don Toliver are the most popular artists. I have given a condition of artists having minimum 2 credits to their name so as to eliminate “one hit wonders”

Who are the top artists in each genre?

# top artists in each genre
top_artists_genre <- spotify %>% 
  group_by(Genre = playlist_genre, Artist = track_artist) %>%
  summarise(No_of_tracks = n(), Popularity = mean(track_popularity)) %>% 
  filter(No_of_tracks > 2) %>%
  arrange(desc(Popularity)) %>%
  top_n(1, wt = Popularity)


kable(top_artists_genre , format = "html") %>%
  kable_styling(bootstrap_options = "striped") %>%
    column_spec(2, width = "12em")
Genre Artist No_of_tracks Popularity
latin Billie Eilish 3 91.66667
pop Travis Scott 3 89.00000
r&b XXXTENTACION 3 86.33333
edm MEDUZA 3 85.33333
rap JACKBOYS 3 84.33333
rock MGMT 3 74.66667

What is the distribution of popularity among genres?

#popularity among genres
rating_plot <- ggplot(spotify, aes(x = playlist_genre, y = track_popularity)) +
        geom_boxplot() +
        coord_flip() +
        labs(title = "Popularity across genres", x = "Genres", y = "Popularity")

ggplotly(rating_plot)

Pop has the highest median popularity among the genres. EDM has least median popularity

How has the genre popularity changed over time?

For better clarity, I am only considering years 2000 & above

#poplarity movement across years
pop_across_years <- spotify %>% 
  group_by(playlist_genre, year) %>% 
  summarise(avg = mean(track_popularity) )%>% 
        subset(year >=2000)

year_graph <- pop_across_years %>%
  ggplot(aes(x = year, y = avg, 
             group = playlist_genre, color = playlist_genre)) +
        geom_line() +   labs(title = "21st Century", x = "Year of release",
                             y = "Average popularity") + 
  theme(legend.position = "none",
                            axis.text.x = element_text(angle = 90, hjust = 1))

year_graph

We don’t see any discernible pattern here.

Does the saying “Old is gold”, still hold true?

I wanted to see if the old songs were popular than the new ones. I have taken songs released from 1950s onwards.We see that, barring a significant dip during the decade of 2000 - 2010, the mean popularity has not changed much over the years

trend <- spotify %>% 
  group_by(year) %>%
  summarise(num_songs = n(), rating = sum(track_popularity)/n()) %>%
  ungroup() %>%
  ggplot(aes(x = year, y = rating, group = 1)) +
  geom_line() +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Rating vs. Year of release", x = "Year of release", y = "Average popularity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


ggplotly(trend)

Here’s a song you will like

Who are the people in your life who point you in the direction of new artists, songs and even new styles of music?

I have created a shiny app which will collect the user preferences :

  • by genre - to list out top songs in the genre based on a customizable rating scale

  • by artist - to list out top songs of the artist based on a customizable rating scale

The shiny app is hosted here

The App

Code

User Interface for the app (ui.R)

shinyUI(navbarPage(theme = shinytheme("cosmo"),"Song recommender",
                   tabPanel("Based on Genre",
                            sidebarPanel(
                                # Genre Selection
                                
                                selectInput(inputId = "Columns", label = "Which genres do you like?",
                                            unique(songs$playlist_genre), multiple = FALSE),
                                verbatimTextOutput("rock"),
                                
                                sliderInput(inputId = "range", label = "Ragne of Ratings that you wish to read?",
                                            min = min(songs$track_popularity),max = 100,value = c(55,100))
                            ),
                            mainPanel(
                                    h2("Top songs of the genre"),
                                DT::dataTableOutput(outputId = "songsreco")
                            )
                   ),
                   tabPanel("Based on Artist",
                            sidebarPanel(selectInput(inputId = "singers", label = "Which singer do you like?",
                                                     unique(songs$track_artist), multiple = FALSE),
                                         verbatimTextOutput("Ed Sheeran"),
                                         
                                         sliderInput(inputId = "range_2", label = "Ragne of Ratings that you wish to read?",
                                                     min = min(songs$track_popularity),max = 100,value = c(55,100))),
                            mainPanel(
                                h2("Top songs of the artist"),
                                DT::dataTableOutput(outputId = "songsreco_artist")))))

Server Logic for the app (server.R)

shinyServer(function(input, output) {
    
    datasetInput <- reactive({
        
        # Filtering the books based on genre and rating
        songs %>% filter(playlist_genre %in% as.vector(input$Columns)) %>%
            group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range[1]), track_popularity <= as.numeric(input$range[2])) %>%
            arrange(desc(track_popularity)) %>%
            select(track_name, track_artist, track_popularity, playlist_genre) %>%
            rename(`song` = track_name, `Genre(s)` = playlist_genre)
        

    })
    
    datasetInput2 <- reactive({
        
        # Filtering the books based on genre and rating
        songs %>% filter(track_artist %in% as.vector(input$singers)) %>%
            group_by(track_name) %>% filter(track_popularity >= as.numeric(input$range_2[1]), track_popularity <= as.numeric(input$range_2[2])) %>%
            arrange(desc(track_popularity)) %>%
            select(track_name, track_artist, track_popularity, playlist_genre) %>%
            rename(`song` = track_name, `Genre(s)` = playlist_genre)
        
        
    })
    
    
    #Rendering the table
    output$songsreco <- DT::renderDataTable({
        
        DT::datatable(head(datasetInput(), n = 50), escape = FALSE, options = list(scrollX = '1000px'))
    })
    
    output$songsreco_artist <- DT::renderDataTable({
        
        DT::datatable(head(datasetInput2(), n = 100), escape = FALSE, options = list(scrollX = '1000px'))
    })
})

Summary

I hope the insights given in the previous sections have been informative to you. I also hope that you have explored the app created and gotten hold of atleast a couple of new songs that you might want to try out.

Given below is a summary of the findings that I have come across during the exploration of this dataset

  • EDM is the genre in which most songs have been released, followed by rap and then pop.

  • Pop has the highest median popularity among the genres. EDM has least median popularity

  • EDM is the “high energy genre”. Rap is also a popular party genre with 0.73 danceability (top rank). EDM, rock and pop have the highest tempo

  • EDM and Rock can be easily demarcated from the other genres. . Latin is poles apart from EDM and rock

  • Barring a significant dip in mean popularityduring the decade of 2000 - 2010, the mean popularity has not changed much over the years

  • With ~130 tracks in their name, Queen have been the busiest artists over time. David Guetta comes in second with ~90 tracks

  • Trevor Daniel, Y2K and Don Toliver are the most popular artists. I have given a condition of artists having minimum 2 credits to their name so as to eliminate “one hit wonders”

  • Love is the most frequently used word in track titles. Like, don’t, one etc are the other frequent ones

Further scope

Next Steps

  • Finding track popularity using the given predictors. Prominent factors in SVM / Clustering can be used to identify any strong indicators on how a song gets a good popularity

  • A recommendation engine based on the genre selected (being slightly ambitious here)