1. Importing the data

We are exploring the spotify data that is provided. Importing the data.

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

2. Identifying and reviewing the codebook (if available) or website of origin

The details on the data table and column definitions can be found here - link.

3. Learning about the data

Assessing dimensions
colnames(spotify_songs)
##  [1] "track_id"                 "track_name"              
##  [3] "track_artist"             "track_popularity"        
##  [5] "track_album_id"           "track_album_name"        
##  [7] "track_album_release_date" "playlist_name"           
##  [9] "playlist_id"              "playlist_genre"          
## [11] "playlist_subgenre"        "danceability"            
## [13] "energy"                   "key"                     
## [15] "loudness"                 "mode"                    
## [17] "speechiness"              "acousticness"            
## [19] "instrumentalness"         "liveness"                
## [21] "valence"                  "tempo"                   
## [23] "duration_ms"

From the structure analysis we see our spotify_songs data has 23 columns and 32833 rows. The data shows spotify songs audio features.

Viewing the head and tail of the data

Head of the data

head(spotify_songs)
## # A tibble: 6 x 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C~ Ed Sheeran                 66 2oCs0DGTsRO98~
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories ~ Maroon 5                   67 63rPSO264uRjW~
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T~ Zara Larsson               70 1HoSmj2eLcsrR~
## 4 75FpbthrwQmzHlBJLuGdC7 Call You ~ The Chainsm~               60 1nqYsOef1yKKu~
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y~ Lewis Capal~               69 7m7vv9wlQ4i0L~
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful~ Ed Sheeran                 67 2yiy9cd2QktrN~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>

Tail of the data

tail(spotify_songs)
## # A tibble: 6 x 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 0aBDrRTgDCwWbcOnEIp7DJ Many Ways~ Ferry Corst~               27 59XOfNjuYZB6f~
## 2 7bxnKAamR3snQ1VGLuVfC1 City Of L~ Lush & Simon               42 2azRoBBWEEEYh~
## 3 5Aevni09Em4575077nkWHz Closer - ~ Tegan and S~               20 6kD6KLxj7s8eC~
## 4 7ImMqPP3Q1yfUHvsdn7wEo Sweet Sur~ Starkillers                14 0ltWNSY9JgxoI~
## 5 2m69mhnfQ1Oq6lGtXuYhgX Only For ~ Mat Zo                     15 1fGrOkHnHJcSt~
## 6 29zWqhca3zt5NsckZqDf6c Typhoon -~ Julian Calor               27 0X3mUOm6MhxR7~
## # ... with 18 more variables: track_album_name <chr>,
## #   track_album_release_date <chr>, playlist_name <chr>, playlist_id <chr>,
## #   playlist_genre <chr>, playlist_subgenre <chr>, danceability <dbl>,
## #   energy <dbl>, key <dbl>, loudness <dbl>, mode <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## #   tempo <dbl>, duration_ms <dbl>
Identifying the data types of each variable

The data type for each variable is

sapply(spotify_songs, class)
##                 track_id               track_name             track_artist 
##              "character"              "character"              "character" 
##         track_popularity           track_album_id         track_album_name 
##                "numeric"              "character"              "character" 
## track_album_release_date            playlist_name              playlist_id 
##              "character"              "character"              "character" 
##           playlist_genre        playlist_subgenre             danceability 
##              "character"              "character"                "numeric" 
##                   energy                      key                 loudness 
##                "numeric"                "numeric"                "numeric" 
##                     mode              speechiness             acousticness 
##                "numeric"                "numeric"                "numeric" 
##         instrumentalness                 liveness                  valence 
##                "numeric"                "numeric"                "numeric" 
##                    tempo              duration_ms 
##                "numeric"                "numeric"

Out of the 23 columns we have 10 character columns and 13 numeric columns.

Identifying missing data

We observe below that our spotify_songs data has 2 columns with missing values. track_name and track_artist columns has 5 missing value in them.

sapply(spotify_songs,function(x) sum(is.na(x)))
##                 track_id               track_name             track_artist 
##                        0                        5                        5 
##         track_popularity           track_album_id         track_album_name 
##                        0                        0                        5 
## track_album_release_date            playlist_name              playlist_id 
##                        0                        0                        0 
##           playlist_genre        playlist_subgenre             danceability 
##                        0                        0                        0 
##                   energy                      key                 loudness 
##                        0                        0                        0 
##                     mode              speechiness             acousticness 
##                        0                        0                        0 
##         instrumentalness                 liveness                  valence 
##                        0                        0                        0 
##                    tempo              duration_ms 
##                        0                        0
Computing summary statistics for the variables

The summary statistics for the columns can be shown as below

summary(spotify_songs)
##    track_id          track_name        track_artist       track_popularity
##  Length:32833       Length:32833       Length:32833       Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.: 24.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
##                                                           Mean   : 42.48  
##                                                           3rd Qu.: 62.00  
##                                                           Max.   :100.00  
##  track_album_id     track_album_name   track_album_release_date
##  Length:32833       Length:32833       Length:32833            
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  playlist_name      playlist_id        playlist_genre     playlist_subgenre 
##  Length:32833       Length:32833       Length:32833       Length:32833      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   danceability        energy              key            loudness      
##  Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
##  1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
##  Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
##  Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
##  3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
##  Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
##       mode         speechiness      acousticness    instrumentalness   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
##  Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
##  3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
##  Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
##     liveness         valence           tempo         duration_ms    
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
##  1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
##  Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
##  Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
##  3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
##  Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810
Check for duplicate rows or columns

Checking for duplicate rows. The spotify_songs data is at track_id and playlist_id level.

nrow(spotify_songs)
## [1] 32833
library(dplyr)
nrow(distinct(spotify_songs, track_id, playlist_id, .keep_all = TRUE))
## [1] 32251

From the above we see we have 32251 unique track_id and playlist_id combination and 582 duplicate rows. We now check our duplicated rows

ind <- duplicated(spotify_songs[,c("track_id", "playlist_id")])
spotify_dup <- spotify_songs[ind,]

On further investigation we see that for these 582 rows we have multiple values of playlist_genre and playlist_subgenre.

Checking for duplicate columns

names(spotify_songs)[duplicated(names(spotify_songs))]
## character(0)

The spotify_songs data doesn’t have any duplicate columns.

4. Learn about the data visually by plotting

Track popularity
hist(spotify_songs$track_popularity, main = "Track popularity")

Majority of the track popularity data is between 0-20

boxplot(track_popularity ~ playlist_genre, data = spotify_songs, main = "Track popularity")

Pop has the highest mean popularity, no outliers here.

Danceability
hist(spotify_songs$danceability, main = "Danceability")

Values are left skewed, close to normal with mean 0.7.

boxplot(danceability ~ playlist_genre, data = spotify_songs, main = "Danceability")

From the boxplot of danceability we observe

  • surprisingly rap has highest median
  • rock has outliers above the higher IQR
  • other genres have a lot of outliers below the lower IQR
Energy
hist(spotify_songs$energy, main = "Energy")

Values are left skewed, close to normal with mean 0.7.

boxplot(energy ~ playlist_genre, data = spotify_songs, main = "Energy")

As expected edm has highest median but lot of outliers below 0.4 (lower IQR).

Key
out <- barplot(table(spotify_songs$key), main="Key")

Most songs have a 1 key.

Loudness
hist(spotify_songs$loudness, main = "Loudness")

Majority of the songs has loudness in the range of -10 to 0.

boxplot(loudness ~ playlist_genre, data = spotify_songs, main = "Loudness")

As expected EDM has the highest median. EDM,R&B and RAP has outilier above the upper IQR. All genres have multiple outliers below the lower IQR.

Mode
out <- barplot(table(spotify_songs$mode), main="Mode")

Most songs have mode 1.

Speechiness
hist(spotify_songs$speechiness, main = "Speechiness")

Most songs have speechiness in the range 0 - 0.2.

boxplot(loudness ~ playlist_genre, data = spotify_songs, main = "Loudness")

Acousticness
hist(spotify_songs$acousticness, main = "Acousticness")

Most songs have acousticness in the range 0 - 0.2.

boxplot(acousticness ~ playlist_genre, data = spotify_songs, main = "Acousticness")

Median of 0.2 and a lot of outliers on the upper IQR.