Client: Ubisoft Game Corp - with a focus on the game Just Dance
The video game Just Dance allows players to enjoy dancing along with the latest popular songs. However, each year when a new version is released, deciding which songs to include becomes a critical challenge. Choosing the wrong tracks can reduce player engagement and lead to a decline in sales. Conversely, selecting the right songs can significantly enhance the game’s appeal and strengthen its market share. Therefore, the problem we aim to address is: “Which songs would be the most effective to include in the next version of Just Dance?”
In 2009, the first Just Dance game used the Nintendo Wii’s motion-sensing technology to offer a gaming experience that invited players of all ages to move to the world’s biggest hits. With its innovative gameplay and irresistible tracks, the Just Dance series became a benchmark for dancing games and is the number 1 music game franchise of all time. With over 10 titles released on a wide range of platforms over the years, the franchise has reached an impressive 135+ million players and created more than 500 unique choreographies.
We are approaching this assignment as executives for the company in charge of the game “Just Dance”. In this sense, our goal is to find songs with high levels of danceability as well as songs with high popularity, to pick a selection for the next release.
Looking at certain factors, we can find a good balance of songs that have people interested in the game, but also songs that keep people dancing. Compare which other categories have increased popularity as well as dance ability.
To address the problem, we utilize a Spotify dataset that provides both song-level metadata and audio features. The dataset includes variables such as: - Song Information: track_name, track_artist, track_popularity, track_album_release_date - Playlist data (categorial): playlist_genre, playlist_subgenre - Audio Features: danceability, energy, tempo, valence, speechiness, acousticness, instrumentalness, liveness, etc.
Our main focus will be danceability as well as popularity, but there are other factors we would like to consider, such as tempo, loudness, and energy, to determine categories such as skill level.
First, we will clean the data a bit, in this case, filtering the data to remove any null values when it comes to track name, track genre, popularity, and danceability. These are key components needed in our analysis of the data. To examine the data, we are considering danceability and popularity as the top two factors. We can consider a scatterplot of danceability vs popularity to aggregate a group of songs that fall within a higher range for both factors. Following the selection of songs according to the preceding factors, we will then use other factors to place them in different categories, such as level, genre, intensity, etc. For example, we will use tempo and valence to determine level, where a higher level correlates to a song of a higher intensity. We will also factor in song length to find the right amount of dancing the player may prefer.
This analysis will help the consumer determine the songs used in advertising, i.e., those most popular could be used to promote the game and draw in customers. But popularity doesn’t always equal highly danceable, so the rest of the songs, while less popular, would keep the customer playing due to their danceability. The rest of the data analyzed would then be used to categorize the songs into levels and genres for the game itself.
Spotify dataset via spotifyr package
Click here to go to the data source
Kaylin Pavlik had a recent blog post using the audio features to explore and classify songs. Data was collected using the spotifyr package to collect about 5000 songs from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock. There were 23 variables considered in her data, some of which we are examining.
The Spotifyr package was initially authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This package allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics. You can also pull song and playlist information for a given Spotify User
library(dplyr) #data manipulation
library(tidyverse) #creating tidy data
library(ggplot2) #visualising data
library(tibble) #converting data to tables
library(knitr) #allowing R code integration into markdwon file
#Imported data using read.csv
songs <- read.csv("spotify_songs.csv")
#Used head to take a look at the data without seeing entire data set
head(songs, 5)
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
#Use str to check the structure
str(songs)
## 'data.frame': 32833 obs. of 23 variables:
## $ track_id : chr "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
## $ track_name : chr "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
## $ track_artist : chr "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
## $ track_popularity : int 66 67 70 60 69 67 62 69 68 67 ...
## $ track_album_id : chr "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
## $ track_album_name : chr "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
## $ track_album_release_date: chr "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
## $ playlist_name : chr "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
## $ playlist_id : chr "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
## $ playlist_genre : chr "pop" "pop" "pop" "pop" ...
## $ playlist_subgenre : chr "dance pop" "dance pop" "dance pop" "dance pop" ...
## $ danceability : num 0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
## $ energy : num 0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
## $ key : int 6 11 1 7 1 8 5 4 8 2 ...
## $ loudness : num -2.63 -4.97 -3.43 -3.78 -4.67 ...
## $ mode : int 1 1 0 1 1 1 0 0 1 1 ...
## $ speechiness : num 0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
## $ acousticness : num 0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
## $ instrumentalness : num 0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
## $ liveness : num 0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
## $ valence : num 0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
## $ tempo : num 122 100 124 122 124 ...
## $ duration_ms : int 194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...
#Checked for missing values, the total number and missing values in each row and column
sum(is.na(songs))
## [1] 15
colSums(is.na(songs))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
#rowSums(is.na(songs))
Used “dplyr” to remove the 5 missing values in track_name and 5 missing values in track_artist. Then checked using colSums.
songs <- songs %>%
filter(!is.na(track_name), !is.na(track_artist))
colSums(is.na(songs[, c("track_name", "track_artist")]))
## track_name track_artist
## 0 0
#duplicated(songs)
sum(duplicated(songs))
## [1] 0
colnames(songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
songs$track_album_release_date <- as.Date(songs$track_album_release_date)
Changed playlist_genre and playlist_subgenre to factors to make them viewed as categories not just text strings
songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)
Noticed missing values in track_album_release_date after changing it to a date in the data. Checked how many using sum(is.na()). With 1886 missing values in track_album_release_date, the data should be kept because date is not the main focus, but marked as missing so it is easy to filter and see later on
sum(is.na(songs$track_album_release_date))
## [1] 1886
songs <- songs %>%
mutate(missing_release_date = is.na(track_album_release_date))
dHead <- head(songs)
knitr ::kable(dHead, format = "html", align = "lccrr", caption = "Spotify Songs Head")
| track_id | track_name | track_artist | track_popularity | track_album_id | track_album_name | track_album_release_date | playlist_name | playlist_id | playlist_genre | playlist_subgenre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | missing_release_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6f807x0ima9a1j3VPbc7VN | I Don’t Care (with Justin Bieber) - Loud Luxury Remix | Ed Sheeran | 66 | 2oCs0DGTsRO98Gh5ZSl2Cx | I Don’t Care (with Justin Bieber) [Loud Luxury Remix] | 2019-06-14 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.748 | 0.916 | 6 | -2.634 | 1 | 0.0583 | 0.1020 | 0.00e+00 | 0.0653 | 0.518 | 122.036 | 194754 | FALSE |
| 0r7CVbZTWZgbTCYdfa2P31 | Memories - Dillon Francis Remix | Maroon 5 | 67 | 63rPSO264uRjW1X5E6cWv6 | Memories (Dillon Francis Remix) | 2019-12-13 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.726 | 0.815 | 11 | -4.969 | 1 | 0.0373 | 0.0724 | 4.21e-03 | 0.3570 | 0.693 | 99.972 | 162600 | FALSE |
| 1z1Hg7Vb0AhHDiEmnDE79l | All the Time - Don Diablo Remix | Zara Larsson | 70 | 1HoSmj2eLcsrR0vE9gThr4 | All the Time (Don Diablo Remix) | 2019-07-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.931 | 1 | -3.432 | 0 | 0.0742 | 0.0794 | 2.33e-05 | 0.1100 | 0.613 | 124.008 | 176616 | FALSE |
| 75FpbthrwQmzHlBJLuGdC7 | Call You Mine - Keanu Silva Remix | The Chainsmokers | 60 | 1nqYsOef1yKKuGOVchbsk6 | Call You Mine - The Remixes | 2019-07-19 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.718 | 0.930 | 7 | -3.778 | 1 | 0.1020 | 0.0287 | 9.40e-06 | 0.2040 | 0.277 | 121.956 | 169093 | FALSE |
| 1e8PAfcKUYoKkxPhrHqw4x | Someone You Loved - Future Humans Remix | Lewis Capaldi | 69 | 7m7vv9wlQ4i0LFuJiE2zsQ | Someone You Loved (Future Humans Remix) | 2019-03-05 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.650 | 0.833 | 1 | -4.672 | 1 | 0.0359 | 0.0803 | 0.00e+00 | 0.0833 | 0.725 | 123.976 | 189052 | FALSE |
| 7fvUMiyapMsRRxr07cU8Ef | Beautiful People (feat. Khalid) - Jack Wins Remix | Ed Sheeran | 67 | 2yiy9cd2QktrNvWC2EUi0k | Beautiful People (feat. Khalid) [Jack Wins Remix] | 2019-07-11 | Pop Remix | 37i9dQZF1DXcZDD7cfEKhW | pop | dance pop | 0.675 | 0.919 | 8 | -5.385 | 1 | 0.1270 | 0.0799 | 0.00e+00 | 0.1430 | 0.585 | 124.982 | 163049 | FALSE |
variable_summary <- data.frame(
Variable = colnames(songs),
Type = sapply(songs, class),
Missing_Values = colSums(is.na(songs))
)
print(variable_summary)
## Variable Type Missing_Values
## track_id track_id character 0
## track_name track_name character 0
## track_artist track_artist character 0
## track_popularity track_popularity integer 0
## track_album_id track_album_id character 0
## track_album_name track_album_name character 0
## track_album_release_date track_album_release_date Date 1886
## playlist_name playlist_name character 0
## playlist_id playlist_id character 0
## playlist_genre playlist_genre factor 0
## playlist_subgenre playlist_subgenre factor 0
## danceability danceability numeric 0
## energy energy numeric 0
## key key integer 0
## loudness loudness numeric 0
## mode mode integer 0
## speechiness speechiness numeric 0
## acousticness acousticness numeric 0
## instrumentalness instrumentalness numeric 0
## liveness liveness numeric 0
## valence valence numeric 0
## tempo tempo numeric 0
## duration_ms duration_ms integer 0
## missing_release_date missing_release_date logical 0
ggplot(data = songs, aes(x = danceability, y = track_popularity)) +
geom_jitter(alpha = 0.2, color = "violet") +
labs(title = "Jitter Plot of Popularity against Danceability",
x = "Danceability",
y= "Popularity")
This jitter plot was to identify any correlations between the variables danceability and popularity, the two most important aspects we are looking at. Howeevr, we see that there is no linear relationship between the two. We realised there may be too many data points on the chart crowding it, so filtered down and looked at the top 30 songs.
dance_top <- songs %>%
arrange(desc(danceability)) %>%
head(30)
ggplot(data = dance_top, aes(x = danceability, y = track_popularity)) +
geom_jitter( color = "violet") +
labs(title = "Jitter Plot of Popularity against Danceability ", subtitle = "Top 30 most danceable songs"
, x = "Danceability",
y= "Popularity") +
geom_vline(xintercept = 0.975, linetype = "dashed", color = "red") +
geom_hline(yintercept= 40, linetype = "dashed", color = "red")
From the plot above, we still don’t see any correlation, but it allows us to better visualize which quadrant we want to pick our top songs from. We decide to focus on the 1st quadrant of the jitter plot above.
songs %>%
group_by(playlist_genre) %>%
summarise(avg_dance = mean(danceability, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(playlist_genre, avg_dance), y = avg_dance)) +
geom_col(fill = "gray") +
labs(
title = "Avg Danceability by Genre",
x = "Genre",
y = "Avg Danceability"
) +
theme_classic()
This visualization shows average danceability by genre. Rap and latin are on average the most danceably genres, while rock and pop are the least.
ggplot(songs, aes(x = danceability)) +
geom_histogram(binwidth = 0.05, fill = "blue", color = "white") +
facet_wrap(~ playlist_genre) +
labs(
title = "Histogram of Danceability by Genre",
x = "Danceability",
y = "Count"
)
This visualization is a histogram of danceability by genre. It shows the distribution of how danceable each genre is. The key takeaways are that rap and latin are overall very danceable, while rock is not. It also shows that each genre has some very danceable songs, even if the genre is not danceable overall.
ggplot(songs, aes(x = tempo)) +
geom_histogram(binwidth = 10,
fill = "grey",
color = "white",
alpha = 0.8) +
labs(title = "Tempo Distribution of Songs",
subtitle = "Highlighting the Danceable Tempo Range (100–140 BPM)",
x = "Tempo (BPM)",
y = "Number of Songs") +
geom_histogram(data = songs[songs$tempo >= 100 & songs$tempo <=140, ],
binwidth = 10, fill ="#1DB954", color="white") +
theme_classic()
Tempo Distribution Histogram is a graph showing the tempo (BPM) distribution of Spotify songs. Most of the song tempos are concentrated in the 100-140 BPM section (Green highlighted), which is a rhythm range suitable for dancing. In other words, we can see that our processed song data has sufficient song options.
After all of our in-depth analysis, we use the following code to select the top 10 songs based on highest scores for popularity and danceability. In doing so we discovered that all of these songs fall within the pop genre and the dance-pop sub-genre.
library(knitr)
songs %>%
filter(danceability > 0.8, track_popularity > 80) %>%
select(track_name, track_popularity, danceability, playlist_genre, playlist_subgenre) %>%
head(10) -> head_1
Top 10 songs for Just Dance’s New Release
knitr ::kable(head_1, format = "html", align = "lccrr")
| track_name | track_popularity | danceability | playlist_genre | playlist_subgenre |
|---|---|---|---|---|
| Taki Taki (with Selena Gomez, Ozuna & Cardi B) | 83 | 0.842 | pop | dance pop |
| Giant (with Rag’n’Bone Man) | 83 | 0.807 | pop | dance pop |
| Ride It | 94 | 0.880 | pop | dance pop |
| Tusa | 98 | 0.803 | pop | dance pop |
| Morado | 82 | 0.881 | pop | dance pop |
| Easy - Remix | 81 | 0.886 | pop | dance pop |
| Blanco | 88 | 0.870 | pop | dance pop |
| Dance Monkey | 92 | 0.824 | pop | dance pop |
| Rare | 88 | 0.838 | pop | dance pop |
| Shape of You | 86 | 0.825 | pop | dance pop |
We anticipate that our conclusions will contribute to the success of the new edition of Just Dance that Ubisoft intends to release.