Let’s Explore!

Our Goal

Client: Ubisoft Game Corp - with a focus on the game Just Dance

The video game Just Dance allows players to enjoy dancing along with the latest popular songs. However, each year when a new version is released, deciding which songs to include becomes a critical challenge. Choosing the wrong tracks can reduce player engagement and lead to a decline in sales. Conversely, selecting the right songs can significantly enhance the game’s appeal and strengthen its market share. Therefore, the problem we aim to address is: “Which songs would be the most effective to include in the next version of Just Dance?”

Looks fun?

What is Just Dance?

In 2009, the first Just Dance game used the Nintendo Wii’s motion-sensing technology to offer a gaming experience that invited players of all ages to move to the world’s biggest hits. With its innovative gameplay and irresistible tracks, the Just Dance series became a benchmark for dancing games and is the number 1 music game franchise of all time. With over 10 titles released on a wide range of platforms over the years, the franchise has reached an impressive 135+ million players and created more than 500 unique choreographies.

Click here to go to Just Dance’s page!

Approach

We are approaching this assignment as executives for the company in charge of the game “Just Dance”. In this sense, our goal is to find songs with high levels of danceability as well as songs with high popularity, to pick a selection for the next release.

Looking at certain factors, we can find a good balance of songs that have people interested in the game, but also songs that keep people dancing. Compare which other categories have increased popularity as well as dance ability.

Expected Data process

To address the problem, we utilize a Spotify dataset that provides both song-level metadata and audio features. The dataset includes variables such as: - Song Information: track_name, track_artist, track_popularity, track_album_release_date - Playlist data (categorial): playlist_genre, playlist_subgenre - Audio Features: danceability, energy, tempo, valence, speechiness, acousticness, instrumentalness, liveness, etc.

Our main focus will be danceability as well as popularity, but there are other factors we would like to consider, such as tempo, loudness, and energy, to determine categories such as skill level.

First, we will clean the data a bit, in this case, filtering the data to remove any null values when it comes to track name, track genre, popularity, and danceability. These are key components needed in our analysis of the data. To examine the data, we are considering danceability and popularity as the top two factors. We can consider a scatterplot of danceability vs popularity to aggregate a group of songs that fall within a higher range for both factors. Following the selection of songs according to the preceding factors, we will then use other factors to place them in different categories, such as level, genre, intensity, etc. For example, we will use tempo and valence to determine level, where a higher level correlates to a song of a higher intensity. We will also factor in song length to find the right amount of dancing the player may prefer.

Expected Use

This analysis will help the consumer determine the songs used in advertising, i.e., those most popular could be used to promote the game and draw in customers. But popularity doesn’t always equal highly danceable, so the rest of the songs, while less popular, would keep the customer playing due to their danceability. The rest of the data analyzed would then be used to categorize the songs into levels and genres for the game itself.

Data Preparation

Data Source

Spotify dataset via spotifyr package

Click here to go to the data source

Kaylin Pavlik had a recent blog post using the audio features to explore and classify songs. Data was collected using the spotifyr package to collect about 5000 songs from 6 main categories (EDM, Latin, Pop, R&B, Rap, & Rock. There were 23 variables considered in her data, some of which we are examining.

The Spotifyr package was initially authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff. This package allows you to enter an artist’s name and retrieve their entire discography in seconds, along with Spotify’s audio features and track/album popularity metrics. You can also pull song and playlist information for a given Spotify User

Packages Used

library(dplyr) #data manipulation
library(tidyverse) #creating tidy data
library(ggplot2) #visualising data
library(tibble) #converting data to tables
library(knitr) #allowing R code integration into markdwon file

Importing and Checking Structure

#Imported data using read.csv
songs <- read.csv("spotify_songs.csv")

#Used head to take a look at the data without seeing entire data set
head(songs, 5)

##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052

#Use str to check the structure
str(songs)

## 'data.frame':    32833 obs. of  23 variables:
##  $ track_id                : chr  "6f807x0ima9a1j3VPbc7VN" "0r7CVbZTWZgbTCYdfa2P31" "1z1Hg7Vb0AhHDiEmnDE79l" "75FpbthrwQmzHlBJLuGdC7" ...
##  $ track_name              : chr  "I Don't Care (with Justin Bieber) - Loud Luxury Remix" "Memories - Dillon Francis Remix" "All the Time - Don Diablo Remix" "Call You Mine - Keanu Silva Remix" ...
##  $ track_artist            : chr  "Ed Sheeran" "Maroon 5" "Zara Larsson" "The Chainsmokers" ...
##  $ track_popularity        : int  66 67 70 60 69 67 62 69 68 67 ...
##  $ track_album_id          : chr  "2oCs0DGTsRO98Gh5ZSl2Cx" "63rPSO264uRjW1X5E6cWv6" "1HoSmj2eLcsrR0vE9gThr4" "1nqYsOef1yKKuGOVchbsk6" ...
##  $ track_album_name        : chr  "I Don't Care (with Justin Bieber) [Loud Luxury Remix]" "Memories (Dillon Francis Remix)" "All the Time (Don Diablo Remix)" "Call You Mine - The Remixes" ...
##  $ track_album_release_date: chr  "2019-06-14" "2019-12-13" "2019-07-05" "2019-07-19" ...
##  $ playlist_name           : chr  "Pop Remix" "Pop Remix" "Pop Remix" "Pop Remix" ...
##  $ playlist_id             : chr  "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" "37i9dQZF1DXcZDD7cfEKhW" ...
##  $ playlist_genre          : chr  "pop" "pop" "pop" "pop" ...
##  $ playlist_subgenre       : chr  "dance pop" "dance pop" "dance pop" "dance pop" ...
##  $ danceability            : num  0.748 0.726 0.675 0.718 0.65 0.675 0.449 0.542 0.594 0.642 ...
##  $ energy                  : num  0.916 0.815 0.931 0.93 0.833 0.919 0.856 0.903 0.935 0.818 ...
##  $ key                     : int  6 11 1 7 1 8 5 4 8 2 ...
##  $ loudness                : num  -2.63 -4.97 -3.43 -3.78 -4.67 ...
##  $ mode                    : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ speechiness             : num  0.0583 0.0373 0.0742 0.102 0.0359 0.127 0.0623 0.0434 0.0565 0.032 ...
##  $ acousticness            : num  0.102 0.0724 0.0794 0.0287 0.0803 0.0799 0.187 0.0335 0.0249 0.0567 ...
##  $ instrumentalness        : num  0.00 4.21e-03 2.33e-05 9.43e-06 0.00 0.00 0.00 4.83e-06 3.97e-06 0.00 ...
##  $ liveness                : num  0.0653 0.357 0.11 0.204 0.0833 0.143 0.176 0.111 0.637 0.0919 ...
##  $ valence                 : num  0.518 0.693 0.613 0.277 0.725 0.585 0.152 0.367 0.366 0.59 ...
##  $ tempo                   : num  122 100 124 122 124 ...
##  $ duration_ms             : int  194754 162600 176616 169093 189052 163049 187675 207619 193187 253040 ...

#Checked for missing values, the total number and missing values in each row and column
sum(is.na(songs))

## [1] 15

colSums(is.na(songs))

##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0

#rowSums(is.na(songs))

Removing NAs

Used “dplyr” to remove the 5 missing values in track_name and 5 missing values in track_artist. Then checked using colSums.

songs <- songs %>%
  filter(!is.na(track_name), !is.na(track_artist))

colSums(is.na(songs[, c("track_name", "track_artist")]))

##   track_name track_artist 
##            0            0

Checking for Duplicates

#duplicated(songs)
sum(duplicated(songs))

## [1] 0

Changing Release Date to a proper date

colnames(songs)

##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

songs$track_album_release_date <- as.Date(songs$track_album_release_date)

Changing Data Types

Changed playlist_genre and playlist_subgenre to factors to make them viewed as categories not just text strings

songs$playlist_genre <- as.factor(songs$playlist_genre)
songs$playlist_subgenre <- as.factor(songs$playlist_subgenre)

Noticed missing values in track_album_release_date after changing it to a date in the data. Checked how many using sum(is.na()). With 1886 missing values in track_album_release_date, the data should be kept because date is not the main focus, but marked as missing so it is easy to filter and see later on

sum(is.na(songs$track_album_release_date))

## [1] 1886

songs <- songs %>%
  mutate(missing_release_date = is.na(track_album_release_date))

Cleaned Data

dHead <- head(songs)
knitr ::kable(dHead, format = "html", align = "lccrr", caption = "Spotify Songs Head")

Spotify Songs Head
track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	missing_release_date
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754	FALSE
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600	FALSE
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616	FALSE
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093	FALSE
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052	FALSE
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049	FALSE

Data Summary

variable_summary <- data.frame(
  Variable = colnames(songs),
  Type = sapply(songs, class),
  Missing_Values = colSums(is.na(songs))
)

print(variable_summary)

##                                          Variable      Type Missing_Values
## track_id                                 track_id character              0
## track_name                             track_name character              0
## track_artist                         track_artist character              0
## track_popularity                 track_popularity   integer              0
## track_album_id                     track_album_id character              0
## track_album_name                 track_album_name character              0
## track_album_release_date track_album_release_date      Date           1886
## playlist_name                       playlist_name character              0
## playlist_id                           playlist_id character              0
## playlist_genre                     playlist_genre    factor              0
## playlist_subgenre               playlist_subgenre    factor              0
## danceability                         danceability   numeric              0
## energy                                     energy   numeric              0
## key                                           key   integer              0
## loudness                                 loudness   numeric              0
## mode                                         mode   integer              0
## speechiness                           speechiness   numeric              0
## acousticness                         acousticness   numeric              0
## instrumentalness                 instrumentalness   numeric              0
## liveness                                 liveness   numeric              0
## valence                                   valence   numeric              0
## tempo                                       tempo   numeric              0
## duration_ms                           duration_ms   integer              0
## missing_release_date         missing_release_date   logical              0

Exploratory Data Analysis

Plots and Correlations

ggplot(data = songs, aes(x = danceability, y = track_popularity)) +
  geom_jitter(alpha = 0.2, color = "violet") +
  labs(title = "Jitter Plot of Popularity against Danceability",
       x = "Danceability", 
       y= "Popularity")

This jitter plot was to identify any correlations between the variables danceability and popularity, the two most important aspects we are looking at. Howeevr, we see that there is no linear relationship between the two. We realised there may be too many data points on the chart crowding it, so filtered down and looked at the top 30 songs.

dance_top <- songs %>% 
  arrange(desc(danceability)) %>%
    head(30)

ggplot(data = dance_top, aes(x = danceability, y = track_popularity)) +
  geom_jitter( color = "violet") +
  labs(title = "Jitter Plot of Popularity against Danceability ", subtitle = "Top 30 most danceable songs"
      , x = "Danceability", 
       y= "Popularity") +
  geom_vline(xintercept = 0.975, linetype = "dashed", color = "red") +
  geom_hline(yintercept= 40, linetype = "dashed", color = "red")

From the plot above, we still don’t see any correlation, but it allows us to better visualize which quadrant we want to pick our top songs from. We decide to focus on the 1st quadrant of the jitter plot above.

Genres

songs %>% 
  group_by(playlist_genre) %>% 
  summarise(avg_dance = mean(danceability, na.rm = TRUE)) %>% 
  ggplot(aes(x = reorder(playlist_genre, avg_dance), y = avg_dance)) +
  geom_col(fill = "gray") +
  labs(
    title = "Avg Danceability by Genre",
    x = "Genre",
    y = "Avg Danceability"
  ) +
  theme_classic()

This visualization shows average danceability by genre. Rap and latin are on average the most danceably genres, while rock and pop are the least.

ggplot(songs, aes(x = danceability)) +
  geom_histogram(binwidth = 0.05, fill = "blue", color = "white") +
  facet_wrap(~ playlist_genre) +
  labs(
    title = "Histogram of Danceability by Genre",
    x = "Danceability",
    y = "Count"
  )

This visualization is a histogram of danceability by genre. It shows the distribution of how danceable each genre is. The key takeaways are that rap and latin are overall very danceable, while rock is not. It also shows that each genre has some very danceable songs, even if the genre is not danceable overall.

Tempo

ggplot(songs, aes(x = tempo)) +
  geom_histogram(binwidth = 10,
                 fill = "grey", 
                 color = "white", 
                 alpha = 0.8) +
  labs(title = "Tempo Distribution of Songs",
       subtitle = "Highlighting the Danceable Tempo Range (100–140 BPM)",
       x = "Tempo (BPM)",
       y = "Number of Songs") +
  geom_histogram(data = songs[songs$tempo >= 100 & songs$tempo <=140, ],
                 binwidth = 10, fill ="#1DB954", color="white") +
  theme_classic()

Tempo Distribution Histogram is a graph showing the tempo (BPM) distribution of Spotify songs. Most of the song tempos are concentrated in the 100-140 BPM section (Green highlighted), which is a rhythm range suitable for dancing. In other words, we can see that our processed song data has sufficient song options.

Conclusions

After all of our in-depth analysis, we use the following code to select the top 10 songs based on highest scores for popularity and danceability. In doing so we discovered that all of these songs fall within the pop genre and the dance-pop sub-genre.

library(knitr)

songs %>% 
  filter(danceability > 0.8, track_popularity > 80) %>% 
  select(track_name, track_popularity, danceability, playlist_genre, playlist_subgenre) %>% 
  head(10) -> head_1

Top 10 songs for Just Dance’s New Release

knitr ::kable(head_1, format = "html", align = "lccrr")

track_name	track_popularity	danceability	playlist_genre	playlist_subgenre
Taki Taki (with Selena Gomez, Ozuna & Cardi B)	83	0.842	pop	dance pop
Giant (with Rag’n’Bone Man)	83	0.807	pop	dance pop
Ride It	94	0.880	pop	dance pop
Tusa	98	0.803	pop	dance pop
Morado	82	0.881	pop	dance pop
Easy - Remix	81	0.886	pop	dance pop
Blanco	88	0.870	pop	dance pop
Dance Monkey	92	0.824	pop	dance pop
Rare	88	0.838	pop	dance pop
Shape of You	86	0.825	pop	dance pop

We anticipate that our conclusions will contribute to the success of the new edition of Just Dance that Ubisoft intends to release.

Spotify Dataset - Just Dance Song Analysis

Dami, Matt, Sindhu & Soyeong

2025-10-08