The following packages are required to reproduce the report:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(outliers)
To preprocess this dataset I have undertaken various tasks. Firstly I imported the datasets data.csv and data_by_artists.csv. As they could not yet be merged, I first dealt with tidying the data by unnesting lists in the "artists" variable and changing column names of the artist data.
Next, I provided a summary of the data and its variables before changing "instrumentalness"" into a 3-level categorical variable and labelling the "key" factor variable.
A new variable "relative_danceability"" was then added by comparing the danceability of a song to its artist's average danceability.
The data was scanned for any missing values, special values and obvious errors. It was apparent that none of these had occurred in the data.
The data was then also scanned for any outliers, which were deemed to be present. To deal with this, the outliers were replaced with their nearest neighbour non-outliers.
Finally, the variable for duration was converted from milliseconds to minutes to aid undestanding of the variable.
The files being used are data.csv and data_by_artist.csv from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv.
The data.csv file countains songs collected from spotify. The data_by_artist.csv file contains averages or most common values (for categorical variables) of the same data but for each artist. The data_by_artist file does not contain the song-specific variables: id, name, release date, year and explicit. This file does however include the variable "count", which is the number of songs the artist has participated in.
The variables are:
Identity:
-X (An id created for this dataset from 0 upwards in integers, the name "X" comes from the code as the name is blank)
-id (Spotify's ID of the song)
Numerical:
-acousticness (From 0 to 1, measure of the song being acoustic)
-danceability (From 0 to 1, measure of the song being danceable)
-energy (From 0 to 1, energy of the song)
-duration_ms (Integer usually between 200k and 300k, duration of the track in milliseconds)
-instrumentalness (From 0 to 1, measure of the song being instrumental)
-valence (From 0 to 1, the positiveness of the song)
-popularity (From 0 to 100, popularity of the song recently)
-tempo (Usually between 50 and 150, in beats per minute)
-liveness (From 0 to 1, measure of if the song sounds like a live performance)
-loudness (From 0 to 1, relative loudness of the song -60dB -0dB)
-speechiness (From 0 to 1, the relative length of the track containing human voice)
-year (From 1921 to 2020)
-count (Only in data_by_artist, from 1 to 2267)
Binary:
-mode (0 = Minor, 1 = Major, binary value showing whether the song starts with a major chord progression)
-explicit (0 = No explicit content, 1 = explicit content)
Categorical:
-key (Keys on octave encoded as 0 to 11, starting from C as 0)
-artists (List of artists in the song)
-release_date (Date of release in dd-mm-yyyy or yyyy if day and month arent available)
-name (Name of the song)
As data.csv contains lists of artists, this dataset is untidy. This also prevents merging the two datasets immediately, which means that the merging will occur after data.csv has been tidied.
Also, both data sets contain many of the same column names, despite the values referring to either a particular song or the average for a specific artist. Due to this, I have altered the relevant names that are relevant to data to be prefixed by "artists_".
#Reading in dataset "data" into dataframe "df"
df <- read.csv("data.csv")
#Reading the dataset "data_by_artist" into dataframe "df_artists"
df_artists <- read.csv("data_by_artist.csv")
#Renaming columns that show an artist's averages to make them different to the names in df
df_artists = df_artists %>%
rename(
artists_acousticness = acousticness,
artists_danceability = danceability,
artists_duration_ms = duration_ms,
artists_energy = energy,
artists_instrumentalness = instrumentalness,
artists_liveness = liveness,
artists_loudness = loudness,
artists_speechiness = speechiness,
artists_tempo = tempo,
artists_valence = valence,
artists_popularity = popularity,
artists_key = key,
artists_mode = mode
)
head(df,5)
## X acousticness
## 1 0 0.732
## 2 1 0.982
## 3 2 0.996
## 4 3 0.982
## 5 4 0.957
## artists
## 1 ['Dennis Day']
## 2 ['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']
## 3 ['John McCormack']
## 4 ['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']
## 5 ['Phil Regan']
## danceability duration_ms energy explicit id
## 1 0.819 180533 0.341 0 7xPhfUan2yNtyFG0cUWkt8
## 2 0.279 831667 0.211 0 4BJqT0PrAfrxzMOxytFOIz
## 3 0.518 159507 0.203 0 5uNZnElqOS3W4fRmRYPk4T
## 4 0.279 831667 0.211 0 1SCWBjhk5WmXPxhDduD3HM
## 5 0.418 166693 0.193 0 4d6HGyGT8e121BsdKmw9v6
## instrumentalness key liveness loudness mode
## 1 0.00e+00 7 0.160 -12.441 1
## 2 8.78e-01 10 0.665 -20.096 1
## 3 0.00e+00 0 0.115 -10.589 1
## 4 8.78e-01 10 0.665 -20.096 1
## 5 1.68e-06 3 0.229 -10.096 1
## name popularity
## 1 Clancy Lowered the Boom 8
## 2 Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve 5
## 3 The Wearing of the Green 6
## 4 Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve 4
## 5 When Irish Eyes Are Smiling 4
## release_date speechiness tempo valence year
## 1 1921 0.4150 60.936 0.9630 1921
## 2 1921 0.0366 80.954 0.0594 1921
## 3 1921 0.0615 66.221 0.4060 1921
## 4 1921 0.0366 80.954 0.0594 1921
## 5 1921 0.0380 101.665 0.2530 1921
head(df_artists,5)
## artists artists_acousticness artists_danceability
## 1 Francisco Canaro 0.9830723 0.6547106
## 2 Frédéric Chopin 0.9899607 0.3400871
## 3 Ludwig van Beethoven 0.9550187 0.3401573
## 4 Wolfgang Amadeus Mozart 0.9620839 0.3538952
## 5 Johann Sebastian Bach 0.9584050 0.3542243
## artists_duration_ms artists_energy artists_instrumentalness artists_liveness
## 1 177776.5 0.2926219 0.4906754 0.2011177
## 2 251871.9 0.1068738 0.8768993 0.1556774
## 3 439361.3 0.1531762 0.6907695 0.1640777
## 4 327808.0 0.1383482 0.5148373 0.1870906
## 5 203461.8 0.2019323 0.7443225 0.1658367
## artists_loudness artists_speechiness artists_tempo artists_valence
## 1 -11.73337 0.11100684 123.60879 0.7464689
## 2 -22.57558 0.04291348 90.97777 0.2036439
## 3 -20.10770 0.05554984 104.83354 0.2602549
## 4 -20.21415 0.06775596 108.59492 0.3328552
## 5 -20.93652 0.04884376 106.55187 0.5705262
## artists_popularity artists_key artists_mode count
## 1 0.05425673 2 1 2267
## 2 5.74812734 1 1 1068
## 3 5.23730570 0 1 965
## 4 9.62487047 7 1 965
## 5 14.70466884 7 1 921
The tidy data principles are:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
As mentioned previously, the data.csv file is not tidy due to having multiple values in cells in the "artists" column, due to multiple artists contributing to some of the songs. This means that it violates the third principle listed above.
The required steps to clean this data are:
1. Cleaning the names by removing quotes (') and square brackets ([]).
2. Splitting the artists by commas.
3. Unnesting the lists of artists so that each has a separate row.
#Showing untidy data (a subset to make it clearer)
head(df %>% dplyr::select(X, acousticness, artists),5)
## X acousticness
## 1 0 0.732
## 2 1 0.982
## 3 2 0.996
## 4 3 0.982
## 5 4 0.957
## artists
## 1 ['Dennis Day']
## 2 ['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']
## 3 ['John McCormack']
## 4 ['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']
## 5 ['Phil Regan']
#Removing [, ] and ' from the variable "artists"
df$artists=gsub("\\[|\\]|\\'", "", df$artists)
#Splitting the artists by commas
df$artists=strsplit(df$artists,", ")
#Unnesting the lists of artists into separate rows
df=unnest(df,cols=c(artists))
#Showing unnested data
head(df %>% dplyr::select(X, acousticness, artists),5)
## # A tibble: 5 x 3
## X acousticness artists
## <int> <dbl> <chr>
## 1 0 0.732 Dennis Day
## 2 1 0.982 Sergei Rachmaninoff
## 3 1 0.982 James Levine
## 4 1 0.982 Berliner Philharmoniker
## 5 2 0.996 John McCormack
Now that the datasets are both tidied, they can be combined via the "artists" variable. inner_join is used to ensure that all columns are present for all rows, as not all the artists listed in the data.csv file are listed in the data_by_artists file.
#Merging the two datasets into one so that the artist's data is shown with the song's data.
df=df %>% inner_join(df_artists, by = "artists")
The types of variables and data structure are now checked. The structure of the data is that is a data frame with 221475 rows and 34 columns.
The variables, as listed earlier, are of the following types:
Nominal (all char, except for X):
id, name, artists, X
Ordinal (all integer):
key, mode, explicit
Continuous quantitative variables (all numeric):
acousticness, danceability, energy, instrumentalness, valence, popularity, tempo, liveness, loudness, speechiness
Discrete continuous variables (all integer):
duration_ms, year, count
In order to simplify instrumentalness, I have chosen to convert it from a continuous quantitative variable to an ordinal variable with three levels: Not_Instrumental, Partly_Instrumental and Instrumental. The thresholds for these levels are as follows:
0-0.1 = Not_Instrumental
0.1-0.8 = Partly_Instrumental
0.8-1 = Instrumental
This change was chosen as it makes the data easier to decipher due to a large portion of the data being previously listed as values with high negative powers due to being extremely small values. Such values are now simply listed as "Not_Instrumental". These values were chosen in place of using equal width/depth binning as the Partly_Instrumental level requires a larger portion of the range than Not_Instrumental (when considering what each value means) and equal depth binning would not be appropriate as it is possible for all of the songs to be either instrumental or not.
The key variable (a factor variable) was chosen to be labelled with the relevant keys on an octave from C (correlating to a value of 0), all the way to b (which correlates to 11).
As both of the variables that have been altered also have a second variable related to the artist's average values, these artist variables have also been altered in the same way.
#Checking the structure and attributes of the merged dataframe
print(str(df))
## tibble [221,475 x 34] (S3: tbl_df/tbl/data.frame)
## $ X : int [1:221475] 0 1 1 1 2 3 3 3 4 5 ...
## $ acousticness : num [1:221475] 0.732 0.982 0.982 0.982 0.996 0.982 0.982 0.982 0.957 0.957 ...
## $ artists : chr [1:221475] "Dennis Day" "Sergei Rachmaninoff" "James Levine" "Berliner Philharmoniker" ...
## $ danceability : num [1:221475] 0.819 0.279 0.279 0.279 0.518 0.279 0.279 0.279 0.418 0.259 ...
## $ duration_ms : int [1:221475] 180533 831667 831667 831667 159507 831667 831667 831667 166693 186467 ...
## $ energy : num [1:221475] 0.341 0.211 0.211 0.211 0.203 0.211 0.211 0.211 0.193 0.212 ...
## $ explicit : int [1:221475] 0 0 0 0 0 0 0 0 0 0 ...
## $ id : chr [1:221475] "7xPhfUan2yNtyFG0cUWkt8" "4BJqT0PrAfrxzMOxytFOIz" "4BJqT0PrAfrxzMOxytFOIz" "4BJqT0PrAfrxzMOxytFOIz" ...
## $ instrumentalness : num [1:221475] 0.00 8.78e-01 8.78e-01 8.78e-01 0.00 8.78e-01 8.78e-01 8.78e-01 1.68e-06 2.22e-04 ...
## $ key : int [1:221475] 7 10 10 10 0 10 10 10 3 2 ...
## $ liveness : num [1:221475] 0.16 0.665 0.665 0.665 0.115 0.665 0.665 0.665 0.229 0.236 ...
## $ loudness : num [1:221475] -12.4 -20.1 -20.1 -20.1 -10.6 ...
## $ mode : int [1:221475] 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr [1:221475] "Clancy Lowered the Boom" "Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve" "Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve" "Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve" ...
## $ popularity : int [1:221475] 8 5 5 5 6 4 4 4 4 2 ...
## $ release_date : chr [1:221475] "1921" "1921" "1921" "1921" ...
## $ speechiness : num [1:221475] 0.415 0.0366 0.0366 0.0366 0.0615 0.0366 0.0366 0.0366 0.038 0.0358 ...
## $ tempo : num [1:221475] 60.9 81 81 81 66.2 ...
## $ valence : num [1:221475] 0.963 0.0594 0.0594 0.0594 0.406 0.0594 0.0594 0.0594 0.253 0.218 ...
## $ year : int [1:221475] 1921 1921 1921 1921 1921 1921 1921 1921 1921 1921 ...
## $ artists_acousticness : num [1:221475] 0.868 0.98 0.941 0.935 0.994 ...
## $ artists_danceability : num [1:221475] 0.5 0.332 0.272 0.234 0.437 ...
## $ artists_duration_ms : num [1:221475] 180234 273919 365311 548717 200880 ...
## $ artists_energy : num [1:221475] 0.339 0.162 0.18 0.151 0.171 ...
## $ artists_instrumentalness: num [1:221475] 2.46e-06 8.08e-01 5.01e-01 7.23e-01 1.60e-05 ...
## $ artists_liveness : num [1:221475] 0.139 0.231 0.252 0.21 0.107 ...
## $ artists_loudness : num [1:221475] -11.1 -20.1 -19.9 -19.6 -13.2 ...
## $ artists_speechiness : num [1:221475] 0.1462 0.041 0.0482 0.0453 0.0649 ...
## $ artists_tempo : num [1:221475] 86.7 96.5 106.7 95.9 124.3 ...
## $ artists_valence : num [1:221475] 0.575 0.271 0.175 0.141 0.279 ...
## $ artists_popularity : num [1:221475] 3.5 4.36 25.13 21.26 2 ...
## $ artists_key : int [1:221475] 7 2 10 2 7 2 10 2 3 3 ...
## $ artists_mode : int [1:221475] 1 1 1 1 1 1 1 1 1 1 ...
## $ count : int [1:221475] 4 242 15 115 3 242 15 115 3 3 ...
## NULL
#Converting numerical variable "instrumentalness" to categorical so that it shows more easily how instrumental a song is.
df$instrumentalness[df$instrumentalness <0.1] = "Not_Instrumental"
df$instrumentalness[df$instrumentalness >0.1 & df$instrumentalness <0.8] = "Partly_Instrumental"
df$instrumentalness[df$instrumentalness >0.8 & df$instrumentalness <1.1] = "Instrumental"
#Converting numerical variable "artists_instrumentalness" to categorical
df$artists_instrumentalness[df$artists_instrumentalness <0.1] = "Not_Instrumental"
df$artists_instrumentalness[df$artists_instrumentalness >0.1 & df$artists_instrumentalness <0.8] = "Partly_Instrumental"
df$artists_instrumentalness[df$artists_instrumentalness >0.8 & df$artists_instrumentalness <1.1] = "Instrumental"
#Labeling the factor variable "key" with the corresponding key names.
df$key <- factor(df$key,
levels = c(0,1,2,3,4,5,6,7,8,9,10,11),
labels = c('c', 'c_sharp', 'd', 'd_sharp', 'e', 'f', 'f_sharp','g', 'g_sharp','a', 'a_sharp', 'b'))
#Repeating the same for the artist's key
df$artists_key <- factor(df$artists_key,
levels = c(0,1,2,3,4,5,6,7,8,9,10,11),
labels = c('c', 'c_sharp', 'd', 'd_sharp', 'e', 'f', 'f_sharp','g', 'g_sharp','a', 'a_sharp', 'b'))
#Showing changed data
head(df %>% dplyr::select(artists, instrumentalness, artists_instrumentalness, key, artists_key),5)
## # A tibble: 5 x 5
## artists instrumentalness artists_instrumentaln~ key artists_key
## <chr> <chr> <chr> <fct> <fct>
## 1 Dennis Day Not_Instrumental Not_Instrumental g g
## 2 Sergei Rachmaninoff Instrumental Instrumental a_sha~ d
## 3 James Levine Instrumental Partly_Instrumental a_sha~ a_sharp
## 4 Berliner Philharmo~ Instrumental Partly_Instrumental a_sha~ d
## 5 John McCormack Not_Instrumental Not_Instrumental c g
I chose to create a new variable called "relative_danceability". What this variable shows is how danceable a song is compared to its artist's average. I calculated this variable by subtracting the artist's danceability from the song's danceability.
I then converted this to an ordinal categorical variable with the following levels:
-1 to -0.1 = Much_Less_Danceable
-0.1 to 0 (not inclusive of 0) = Less_Danceable
0 = Same
0 to 0.1 (not inclusive of 0) = More_Danceable
0.1 to 1 = Much_More_Danceable
Numerical values before being changed to categorical can only be in the range from -1 to 1 as danceability only ranged from 0 to 1.
#Creating new variable
df$relative_danceability <- df$danceability - df$artists_danceability
#Setting levels for variable
df$relative_danceability[df$relative_danceability > -1.1 & df$relative_danceability < -0.1] = "Much_Less_Danceable"
df$relative_danceability[df$relative_danceability < 0] = "Less_Danceable"
df$relative_danceability[df$relative_danceability ==0] = "Same"
df$relative_danceability[df$relative_danceability > 0 & df$relative_danceability < 0.1] = "More_Danceable"
df$relative_danceability[df$relative_danceability > 0.1 & df$relative_danceability < 1.1] = "Much_More_Danceable"
#Showing changed data
head(df %>% dplyr::select(artists, relative_danceability),5)
## # A tibble: 5 x 2
## artists relative_danceability
## <chr> <chr>
## 1 Dennis Day Much_More_Danceable
## 2 Sergei Rachmaninoff Less_Danceable
## 3 James Levine More_Danceable
## 4 Berliner Philharmoniker More_Danceable
## 5 John McCormack More_Danceable
The data must be scanned for missing values, special values and obvious errors. Missing values and special values are determined quite simply by the first line in the code below. This dataset does not contain any missing values. It is worth noting that the data would have missing values if joined by a method other than inner_join. Merging this way caused rows that were not present on both datasets to be removed.
Next, obvious errors must be checked. For this data, that meant checking acousticness, danceability, energy, valence, popularity, liveness and speechiness (and their corresponding artist variables) for values that did not occur within their set ranges. As shown by the output of the code below, no obvious errors are present.
If inappropriate values were to occur, the ideal method of dealing with the missing values would be to either removed the row (if an important variable like artist is missing), or if only one of the numerical variables that has a corresponding artist value (or vice versa), then setting the value to its corresponding value could provide a solution.
#Checking for missing and special values
sum(is.na(df), sapply(df, is.nan), sum(sapply(df, is.infinite)))
## [1] 0
#Checking for obvious errors by each column necessary
obvious_ac <- df$acousticness<0|df$acousticness>1
obvious_da <- df$danceability<0|df$danceability>1
obvious_en <- df$energy<0|df$energy>1
obvious_va <- df$valence<0|df$valence>1
obvious_po <- df$popularity<0|df$popularity>100
obvious_li <- df$liveness<0|df$liveness>1
obvious_sp <- df$speechiness<0|df$speechiness>1
obvious_art_ac <- df$artists_acousticness<0|df$artists_acousticness>1
obvious_art_da <- df$artists_danceability<0|df$artists_danceability>1
obvious_art_en <- df$artists_energy<0|df$artists_energy>1
obvious_art_va <- df$artists_valence<0|df$artists_valence>1
obvious_art_po <- df$artists_popularity<0|df$artists_popularity>100
obvious_art_li <- df$artists_liveness<0|df$artists_liveness>1
obvious_art_sp <- df$artists_speechiness<0|df$artists_speechiness>1
#Combining the error lists
obvious=c(obvious_ac, obvious_da, obvious_en, obvious_va, obvious_po, obvious_li, obvious_sp, obvious_art_ac,
obvious_art_da, obvious_art_en, obvious_art_va, obvious_art_po, obvious_art_li, obvious_art_sp)
#Showing total number of errors
sum(obvious)
## [1] 0
It is necessary to scan the numeric variables with potential for outliers (tempo, loudness and duration_ms), it is worth noting that various other numeric variables were checked for unreasonable values in the first scan. The first step taken is to calculate z-scores for these 3 variables and then show the number of outliers. The process for counting the number of outliers is shown for the variable "tempo" below (517 outliers). After proving that outliers have occurred, the next step is to choose a method of dealing the outliers.I have chosen the capping method. The capping method involves replacing outliers with the nearest neighbours that are not outliers. To do this, I have utilised a function listed in model 6.
I have shown a summary of each of the variables both before and after the values have been capped to show the difference that occurs. For instance, for tempo, the minimum changes from "0"" to "30.95"" while the maximum changes from "244.09"" to "198.65". The new values are placed into the dataframe.
#Calculating z scores
z.scores <- df$tempo %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.71971 -0.77342 -0.08302 0.00000 0.61718 4.17206
#Showing number of outliers
length(which( abs(z.scores) >3 ))
## [1] 517
#Utilising function from Module 6 to cap values from outside the limits
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Showing summaries of numeric variables before capping
summary(df$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 91.13 112.48 115.05 134.14 244.09
summary(df$loudness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -60.000 -15.652 -11.188 -12.126 -7.417 3.855
summary(df$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5108 171667 211200 240443 270893 5403500
#Capping numeric variables
df$tempo <- df$tempo %>% cap()
df$loudness <- df$loudness %>% cap()
df$duration_ms <- df$duration_ms %>% cap()
#Repeating for artist variables
df$artists_tempo <- df$artists_tempo %>% cap()
df$artists_loudness <- df$artists_loudness %>% cap()
df$artists_duration_ms <- df$artists_duration_ms %>% cap()
#Showing summaries of numeric variables after capping
summary(df$tempo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.95 91.13 112.48 114.89 134.14 198.65
summary(df$loudness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -28.002 -15.652 -11.188 -11.970 -7.417 3.855
summary(df$duration_ms)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22933 171667 211200 230347 270893 461520
The transformation I chose to make was to change duration from milliseconds to minutes. This allows for the variable to be understood more easily. To achieve this, I simply divided by 1000 to convert to seconds and then divided by 60 to convert to minutes. As the variable name is "duration_ms", I changed this to "duration_mins". I repeated the same for the artists' averages.
df$duration_ms<-df$duration_ms/(60*1000)
df$artists_duration_ms<-df$artists_duration_ms/(60*1000)
df=df %>% rename(duration_mins = duration_ms)
df=df %>% rename(artists_duration_mins = artists_duration_ms)
#Showing changed data
head(df %>% dplyr::select(artists, duration_mins, artists_duration_mins),5)
## # A tibble: 5 x 3
## artists duration_mins artists_duration_mins
## <chr> <dbl> <dbl>
## 1 Dennis Day 3.01 3.00
## 2 Sergei Rachmaninoff 7.69 4.57
## 3 James Levine 7.69 6.09
## 4 Berliner Philharmoniker 7.69 6.61
## 5 John McCormack 2.66 3.35