Introduction

Problem Statement

The purpose of this project is to explore the Spotify music dataset and answer some questions such as what types of music people prefer, which types of music are better for dancing, which types of music are more energetic, which artists are more loved, etc. Music is the most widely used language in human history. Even if we don’t have the same language, people can still understand each other’s emotions through music and also express their own emotions. By analyzing music, we can get the most realistic feelings of people in a certain period of time.

Solution

After the data is collated, I will use a number of packages to analyze the relevant data.. For example, popularity, song artist, playlist genre/subgenre, loudness and valence, etc.

Packages Required

library(tibble)     #Used to store data as a tibble, and makes it much easier to handle and manipulate data
library(DT)         #Used to display the data on the screen in a scrollable format
library(knitr)      #Used to display a table on the screen
library(tidyverse)  #used to make a plot chart
library(dplyr)      #Used for data manipulation

Data Preparation

Data Import

The original spotify_songs.csv dataset has 23 variables, covering 32,833 songs over the last 10 years.

Data Import Code:

library(tibble)
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv"
music <- as_tibble(read.csv(url,stringsAsFactors = FALSE))
class(music)

## [1] "tbl_df"     "tbl"        "data.frame"

colnames(music)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

dim(music)

## [1] 32833    23

Data Cleaning

I used the “track_name” colume to remove duplicates in the dataset that would affect our analysis.

new_music=music[!duplicated(music$track_name),]

After removing the duplicates, I rechecked the rows and columns of the data. Now my dataset contains 23 variables and 23,450 unique pieces of songs.

dim(new_music)

## [1] 23450    23

Data

In the data, each row is a music and the columns are the information of the music.

The following is the cleaned dataset (top 100 rows):

Data Dictionary

Variable	Class	Description
track_id	character	Song unique ID
track_name	character	Song Name
track_artist	character	Song Artist
track_popularity	integer	Song Popularity (0-100) where higher is better
track_album_id	character	Album unique ID
track_album_name	character	Song album name
track_album_release_date	character	Date when album released
playlist_name	character	Name of playlist
playlist_id	character	Playlist ID
playlist_genre	character	Playlist genre
playlist_subgenre	character	Playlist subgenre
danceability	numeric	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy	numeric	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key	integer	The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness	numeric	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode	integer	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness	numeric	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness	numeric	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness	numeric	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness	numeric	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence	numeric	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo	numeric	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms	integer	Duration of song in milliseconds

# Convert relevant variables to numeric
music$track_popularity <- as.numeric(music$track_popularity)
music$danceability <- as.numeric(music$danceability)
music$energy <- as.numeric(music$energy)
music$key <- as.numeric(music$key)
music$loudness <- as.numeric(music$loudness)
music$mode <- as.numeric(music$mode)
music$speechiness <- as.numeric(music$speechiness)
music$acousticness <- as.numeric(music$acousticness)
music$instrumentalness <- as.numeric(music$instrumentalness)
music$liveness <- as.numeric(music$liveness)
music$valence <- as.numeric(music$valence)
music$tempo <- as.numeric(music$tempo)
music$duration_ms <- as.numeric(music$duration_ms)

music <- na.omit(music)

Exploratory Data Analysis

track_album_release_date

theme_set(theme_bw() +  theme(plot.title = element_text(hjust = 0.5)))

# Convert the album release date format to year-month format
music <- music %>% 
  mutate(track_album_release_month =     floor_date(as.Date(track_album_release_date), "month"))

# Group and count by year-month
release_counts <- music %>% 
  group_by(track_album_release_month) %>% 
  summarise(count = n())

release_counts$track_album_release_month <- ymd(release_counts$track_album_release_month)
# Group and count by year-month
release_counts <- music %>% 
  group_by(track_album_release_month) %>% 
  summarise(count = n())  %>%
  filter(track_album_release_month > ymd("2010-01-01"))

# Create a line plot
ggplot(release_counts, aes(x = track_album_release_month, y = count)) + 
  geom_line() +
  labs(title = "Number of Songs Released by Month", x = "Year", y = "Count")

This line plot shows the number of songs released by month over the period from February 2010 to January 2020. The plot indicates a gradual increase in the number of songs released each month from 2010 to 2019, with a peak in November 2019, where the highest number of songs were released. In 2020, however, the number of songs released decreased significantly, with the lowest number of songs released in January 2020. Overall, this plot provides an interesting insight into the trends of song releases over the past decade.

# Convert the date format to year-month format
music$month <- month(ymd(music$track_album_release_date))

# Calculate the number of songs per month
music_monthly_count <- music %>% 
  group_by(month) %>% 
  summarize(count = n())

# Draw a bar chart
ggplot(music_monthly_count, aes(x = month, y = count)) +
  geom_col(fill = "steelblue", alpha = 0.5, width = 0.5, outlier.size = 1) +
  labs(title = "Number of Songs Released by Month", x = "Month", y = "Count") +
  scale_x_continuous(breaks = 1:12)

A bar chart is created to display the count of songs released per month. The chart shows that the months with the highest number of song releases are October and November, with 1645 and 1668 songs released, respectively. The months with the lowest number of song releases are February and March, with 851 and 1097 songs released, respectively. The chart also shows a gradual increase in the number of song releases from January to June, with the highest number of releases in June, and a gradual decrease from July to December, with the lowest number of releases in December.

playlist_genre

playlist_genre_count <- music %>%
  group_by(playlist_genre) %>%
  summarise(count = n()) %>%
  arrange(desc(count))



# Draw a bar chart
ggplot(playlist_genre_count, aes(x = playlist_genre, y = count, fill = playlist_genre)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = count), vjust = -0.5, size = 3.5) +
  labs(title = "Number of Songs by Genre", x = "Genre", y = "Count")

The resulting playlist_genre_count table shows the count of songs in each playlist genre, arranged in descending order. The bar chart shows the same information, with the x-axis representing the playlist genres and the y-axis representing the count of songs in each genre. The bars are color-coded based on the genre, and data labels are added above each bar to show the exact count of songs in each genre. The most popular genre is EDM with 5,787 songs, followed by R&B with 5,228 songs, pop with 4,500 songs, and Latin with 1,251 songs.

playlist_subgenre

# Create a table of subgenre counts
subgenre_counts <- music %>%
  group_by(playlist_subgenre) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
subgenre_counts

## # A tibble: 24 × 2
##    playlist_subgenre         count
##    <chr>                     <int>
##  1 progressive electro house  1809
##  2 southern hip hop           1674
##  3 indie poptimism            1672
##  4 latin hip hop              1655
##  5 neo soul                   1637
##  6 pop edm                    1517
##  7 electro house              1511
##  8 hard rock                  1485
##  9 gangster rap               1456
## 10 electropop                 1408
## # … with 14 more rows

# Create the bar chart
ggplot(data = subgenre_counts, aes(x = reorder(playlist_subgenre, -count), y = count)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  labs(title = "Number of Songs by Playlist Subgenre", x = "Playlist Subgenre", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above bar chart shows the number of songs by playlist subgenre in descending order. The subgenres “progressive electro house”, “neo soul”, and “pop edm” have the highest number of songs, with counts of 1709, 1616, and 1515, respectively. The chart also indicates that “indie poptimism” has the lowest number of songs, with a count of 861. The x-axis displays the playlist subgenres, and the y-axis represents the count of songs for each subgenre. The x-axis labels have been angled at 45 degrees for better readability.

track_popularity

ggplot(music, aes(x = track_popularity)) + 
  geom_histogram(binwidth = 5, fill = "steelblue", alpha = 0.5) +
  labs(x = "Track Popularity", y = "Count", title = "Distribution of Track Popularity")

The plot shows that the majority of tracks in the music data frame have a popularity score between 20 and 60. The distribution appears slightly skewed to the left, with fewer tracks having very low popularity scores (0-10) and more tracks having moderate popularity scores (30-60). The plot also shows a smaller peak in the 90-95 range, indicating that a small number of tracks in the data frame have very high popularity scores. Overall, the plot provides a useful overview of the distribution of popularity scores in the music data frame.

ggplot(music, aes(x = playlist_genre, y = track_popularity)) + 
  geom_boxplot(fill = "steelblue", alpha = 0.5) +
  labs(x = "Playlist Genre", y = "Track Popularity", 
       title = "Distribution of Track Popularity by Playlist Genre")

The box plot indicates that the Pop playlist genre has the highest median track popularity compared to the other genres. Latin playlist genre has the widest interquartile range and the highest maximum popularity value. EDM and R&B playlist genres have relatively lower median and maximum popularity values.

# Create a new data frame with average track popularity by playlist subgenre
subgenre_popularity <- music %>% 
  group_by(playlist_subgenre) %>% 
  summarise(avg_track_popularity = mean(as.numeric(track_popularity))) %>% 
  arrange(desc(avg_track_popularity))

# Create the boxplot with ordered subgenres
ggplot(subgenre_popularity, aes(x = reorder(playlist_subgenre, avg_track_popularity), y = avg_track_popularity)) +
  geom_col(fill = "steelblue", alpha = 0.5, width = 0.5, outlier.size = 1) +
  labs(title = "Average Track Popularity by Playlist Subgenre", x = "Playlist Subgenre", y = "Track Popularity") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major.x = element_blank())

The above bar chart shows the average track popularity for each playlist subgenre in descending order. The “post-teen pop” subgenre has the highest average track popularity, followed by “hip pop” and “dance pop”. On the other hand, “progressive electro house” has the lowest average track popularity, with “neo soul” and “big room” following closely behind. The chart is visually appealing with clear labels, and the ordering of subgenres makes it easy to compare their popularity.

Summary

From 2010 to 2019, the number of songs released each month gradually increased, peaking in November 2019 with the highest number of songs released.
The months with the highest number of song releases are October and November, and the lowest are February and March.
The most popular genre is EDM, followed by R&B, pop, and Latin.
Most of the tracks in the music dataset have a popularity rating between 20 and 60.

Music Exploratory Analysis

Shumei Liu

2023-04-27