We are exploring the spotify data that is provided. Importing the data.
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
The details on the data table and column definitions can be found here - link.
colnames(spotify_songs)
## [1] "track_id" "track_name"
## [3] "track_artist" "track_popularity"
## [5] "track_album_id" "track_album_name"
## [7] "track_album_release_date" "playlist_name"
## [9] "playlist_id" "playlist_genre"
## [11] "playlist_subgenre" "danceability"
## [13] "energy" "key"
## [15] "loudness" "mode"
## [17] "speechiness" "acousticness"
## [19] "instrumentalness" "liveness"
## [21] "valence" "tempo"
## [23] "duration_ms"
From the structure analysis we see our spotify_songs data has 23 columns and 32833 rows. The data shows spotify songs audio features.
Head of the data
head(spotify_songs)
## # A tibble: 6 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5 67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson 70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~ 60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~ 69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
Tail of the data
tail(spotify_songs)
## # A tibble: 6 x 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 0aBDrRTgDCwWbcOnEIp7DJ Many Ways~ Ferry Corst~ 27 59XOfNjuYZB6f~
## 2 7bxnKAamR3snQ1VGLuVfC1 City Of L~ Lush & Simon 42 2azRoBBWEEEYh~
## 3 5Aevni09Em4575077nkWHz Closer - ~ Tegan and S~ 20 6kD6KLxj7s8eC~
## 4 7ImMqPP3Q1yfUHvsdn7wEo Sweet Sur~ Starkillers 14 0ltWNSY9JgxoI~
## 5 2m69mhnfQ1Oq6lGtXuYhgX Only For ~ Mat Zo 15 1fGrOkHnHJcSt~
## 6 29zWqhca3zt5NsckZqDf6c Typhoon -~ Julian Calor 27 0X3mUOm6MhxR7~
## # ... with 18 more variables: track_album_name <chr>,
## # track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## # playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## # energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, duration_ms <dbl>
The data type for each variable is
sapply(spotify_songs, class)
## track_id track_name track_artist
## "character" "character" "character"
## track_popularity track_album_id track_album_name
## "numeric" "character" "character"
## track_album_release_date playlist_name playlist_id
## "character" "character" "character"
## playlist_genre playlist_subgenre danceability
## "character" "character" "numeric"
## energy key loudness
## "numeric" "numeric" "numeric"
## mode speechiness acousticness
## "numeric" "numeric" "numeric"
## instrumentalness liveness valence
## "numeric" "numeric" "numeric"
## tempo duration_ms
## "numeric" "numeric"
Out of the 23 columns we have 10 character columns and 13 numeric columns.
We observe below that our spotify_songs data has 2 columns with missing values. track_name and track_artist columns has 5 missing value in them.
sapply(spotify_songs,function(x) sum(is.na(x)))
## track_id track_name track_artist
## 0 5 5
## track_popularity track_album_id track_album_name
## 0 0 5
## track_album_release_date playlist_name playlist_id
## 0 0 0
## playlist_genre playlist_subgenre danceability
## 0 0 0
## energy key loudness
## 0 0 0
## mode speechiness acousticness
## 0 0 0
## instrumentalness liveness valence
## 0 0 0
## tempo duration_ms
## 0 0
The summary statistics for the columns can be shown as below
summary(spotify_songs)
## track_id track_name track_artist track_popularity
## Length:32833 Length:32833 Length:32833 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 24.00
## Mode :character Mode :character Mode :character Median : 45.00
## Mean : 42.48
## 3rd Qu.: 62.00
## Max. :100.00
## track_album_id track_album_name track_album_release_date
## Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## playlist_name playlist_id playlist_genre playlist_subgenre
## Length:32833 Length:32833 Length:32833 Length:32833
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## danceability energy key loudness
## Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
## 1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
## Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
## Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
## 3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
## Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
## 1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
## Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
## Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
## 3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
## Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
## liveness valence tempo duration_ms
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
## 1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
## Median :0.1270 Median :0.5120 Median :121.98 Median :216000
## Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
## 3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
## Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
Checking for duplicate rows. The spotify_songs data is at track_id and playlist_id level.
nrow(spotify_songs)
## [1] 32833
library(dplyr)
nrow(distinct(spotify_songs, track_id, playlist_id, .keep_all = TRUE))
## [1] 32251
From the above we see we have 32251 unique track_id and playlist_id combination and 582 duplicate rows. We now check our duplicated rows
ind <- duplicated(spotify_songs[,c("track_id", "playlist_id")])
spotify_dup <- spotify_songs[ind,]
On further investigation we see that for these 582 rows we have multiple values of playlist_genre and playlist_subgenre.
Checking for duplicate columns
names(spotify_songs)[duplicated(names(spotify_songs))]
## character(0)
The spotify_songs data doesn’t have any duplicate columns.
hist(spotify_songs$track_popularity, main = "Track popularity")
Majority of the track popularity data is between 0-20
boxplot(track_popularity ~ playlist_genre, data = spotify_songs, main = "Track popularity")
Pop has the highest mean popularity, no outliers here.
hist(spotify_songs$danceability, main = "Danceability")
Values are left skewed, close to normal with mean 0.7.
boxplot(danceability ~ playlist_genre, data = spotify_songs, main = "Danceability")
From the boxplot of danceability we observe
hist(spotify_songs$energy, main = "Energy")
Values are left skewed, close to normal with mean 0.7.
boxplot(energy ~ playlist_genre, data = spotify_songs, main = "Energy")
As expected edm has highest median but lot of outliers below 0.4 (lower IQR).
out <- barplot(table(spotify_songs$key), main="Key")
Most songs have a 1 key.
hist(spotify_songs$loudness, main = "Loudness")
Majority of the songs has loudness in the range of -10 to 0.
boxplot(loudness ~ playlist_genre, data = spotify_songs, main = "Loudness")
As expected EDM has the highest median. EDM,R&B and RAP has outilier above the upper IQR. All genres have multiple outliers below the lower IQR.
out <- barplot(table(spotify_songs$mode), main="Mode")
Most songs have mode 1.
hist(spotify_songs$speechiness, main = "Speechiness")
Most songs have speechiness in the range 0 - 0.2.
boxplot(loudness ~ playlist_genre, data = spotify_songs, main = "Loudness")
hist(spotify_songs$acousticness, main = "Acousticness")
Most songs have acousticness in the range 0 - 0.2.
boxplot(acousticness ~ playlist_genre, data = spotify_songs, main = "Acousticness")
Median of 0.2 and a lot of outliers on the upper IQR.