DIMESION REDUCTION USING PCA

PAIDAMOYO SIMBA

468074

Introduction

“This project engages in an in-depth exploration and analysis of a robust music data set, concentrating on the intrinsic components that define the characteristics of songs genres. The data set spans a wide range of musical genres and artists, aiming to unravel the underlying structures that contribute to the richness of musical compositions. Beginning with meticulous data cleaning and transformation, the project employs Principal Component Analysis (PCA) as a central technique to distill the data set into its fundamental components. The emphasis lies in understanding the variance and relationships among these components.Through statistical tests and visualization techniques, we aim to shed light on the key factors that shape musical compositions, providing valuable insights for understanding the essence of music beyond chronological trends.The data set was downloaded from kaggle and its called Top 10000 Songs on Spotify 1960-Now.”

Data loading

Firstly i loaded the data set and determined the structure and class of the data set so that i can have an understanding.

library(readr)

## Warning: package 'readr' was built under R version 4.3.2

music_dataset=read.csv("top_10000_1960-now.csv")
(music_dataset)

class(music_dataset)

## [1] "data.frame"

colnames(music_dataset)

##  [1] "Track.URI"            "Track.Name"           "Artist.URI.s."       
##  [4] "Artist.Name.s."       "Album.URI"            "Album.Name"          
##  [7] "Album.Artist.URI.s."  "Album.Artist.Name.s." "Album.Release.Date"  
## [10] "Album.Image.URL"      "Disc.Number"          "Track.Number"        
## [13] "Track.Duration..ms."  "Track.Preview.URL"    "Explicit"            
## [16] "Popularity"           "ISRC"                 "Added.By"            
## [19] "Added.At"             "Artist.Genres"        "Danceability"        
## [22] "Energy"               "Key"                  "Loudness"            
## [25] "Mode"                 "Speechiness"          "Acousticness"        
## [28] "Instrumentalness"     "Liveness"             "Valence"             
## [31] "Tempo"                "Time.Signature"       "Album.Genres"        
## [34] "Label"                "Copyrights"

str(music_dataset)

## 'data.frame':    9999 obs. of  35 variables:
##  $ Track.URI           : chr  "spotify:track:1XAZlnVtthcDZt2NI1Dtxo" "spotify:track:6a8GbQIlV8HBUW3c6Uk9PH" "spotify:track:70XtWbcVZcpaOddJftMcVi" "spotify:track:1NXUWyPJk5kO6DQJ5t7bDu" ...
##  $ Track.Name          : chr  "Justified & Ancient - Stand by the Jams" "I Know You Want Me (Calle Ocho)" "From the Bottom of My Broken Heart" "Apeman - 2014 Remastered Version" ...
##  $ Artist.URI.s.       : chr  "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
##  $ Artist.Name.s.      : chr  "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
##  $ Album.URI           : chr  "spotify:album:4MC0ZjNtVP1nDD5lsLxFjc" "spotify:album:5xLAcbvbSAlRtPXnKkggXA" "spotify:album:3WNxdumkSMGMJRhEgK80qx" "spotify:album:6lL6HugNEN4Vlc8sj0Zcse" ...
##  $ Album.Name          : chr  "Songs Collection" "Pitbull Starring In Rebelution" "...Baby One More Time (Digital Deluxe Version)" "Lola vs. Powerman and the Moneygoround, Pt. One + Percy (Super Deluxe)" ...
##  $ Album.Artist.URI.s. : chr  "spotify:artist:6dYrdRlNZSKaVxYg5IrvCH" "spotify:artist:0TnOYISbd1XYRBk9myaseg" "spotify:artist:26dSoYclwsYLMAKD3tpOr4" "spotify:artist:1SQRv42e4PjEYfPhS0Tk9E" ...
##  $ Album.Artist.Name.s.: chr  "The KLF" "Pitbull" "Britney Spears" "The Kinks" ...
##  $ Album.Release.Date  : chr  "1992-08-03" "2009-10-23" "1999-01-12" "2014-10-20" ...
##  $ Album.Image.URL     : chr  "https://i.scdn.co/image/ab67616d0000b27355346bc1f268730f607f9544" "https://i.scdn.co/image/ab67616d0000b27326d73ab8423a350faa5d395a" "https://i.scdn.co/image/ab67616d0000b2738e49866860c25afffe2f1a02" "https://i.scdn.co/image/ab67616d0000b2731e7c5307ccbbb74101e0cc77" ...
##  $ Disc.Number         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Track.Number        : int  3 3 6 11 9 4 1 1 2 2 ...
##  $ Track.Duration..ms. : int  216270 237120 312533 233400 448720 193346 173799 240546 259800 208733 ...
##  $ Track.Preview.URL   : chr  "" "https://p.scdn.co/mp3-preview/d6f8883fc955cb0ecb7f3e1e06e77a9d8611158d?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/1de5faef947224dcb7efb26a5303ae0735b28167?cid=9950ac751e34487dbbe027c4fd7f8e99" "https://p.scdn.co/mp3-preview/c4df3a832509cc5506bd0c91419146f78d864825?cid=9950ac751e34487dbbe027c4fd7f8e99" ...
##  $ Explicit            : chr  "false" "false" "false" "false" ...
##  $ Popularity          : int  0 64 56 42 0 79 78 61 74 0 ...
##  $ ISRC                : chr  "QMARG1760056" "USJAY0900144" "USJI19910455" "GB5KW1499822" ...
##  $ Added.By            : chr  "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" "spotify:user:bradnumber1" ...
##  $ Added.At            : chr  "2020-03-05T09:20:39Z" "2021-08-08T09:26:31Z" "2021-08-08T09:26:31Z" "2021-08-08T09:26:31Z" ...
##  $ Artist.Genres       : chr  "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
##  $ Danceability        : num  0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
##  $ Energy              : num  0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
##  $ Key                 : int  8 2 7 9 0 9 6 6 9 11 ...
##  $ Loudness            : num  -12.3 -6 -5.17 -8.92 -9.61 ...
##  $ Mode                : int  1 1 1 1 1 1 0 1 1 0 ...
##  $ Speechiness         : num  0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
##  $ Acousticness        : num  0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
##  $ Instrumentalness    : num  1.12e-01 2.12e-05 1.01e-06 5.08e-05 7.29e-05 1.12e-05 0.00 3.28e-05 0.00 0.00 ...
##  $ Liveness            : num  0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
##  $ Valence             : num  0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
##  $ Tempo               : num  111.5 127 75 75.3 85.8 ...
##  $ Time.Signature      : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Album.Genres        : logi  NA NA NA NA NA NA ...
##  $ Label               : chr  "Jams Communications" "Mr.305/Polo Grounds Music/J Records" "Jive" "Sanctuary Records" ...
##  $ Copyrights          : chr  "C 1992 Copyright Control, P 1992 Jams Communications" "P (P) 2009 RCA/JIVE Label Group, a unit of Sony Music Entertainment" "P (P) 1999 Zomba Recording LLC" "C © 2014 Sanctuary Records Group Ltd., a BMG Company, P ℗ 2014 Sanctuary Records Group Ltd., a BMG Company" ...

head(music_dataset)

Data cleaning

As i explored my data i noticed that most of the columns are just identifiers and they do not contribute towards the analysis and i removed the columns.

mymusic_data <- music_dataset[, !names(music_dataset) %in% c("Track.URI", "Artist.URI.s.","Time.Signature", "Album.URI", "Album.Name",
                                                             "Album.Artist.URI.s.", "Album.Artist.Name.s.", "Album.Image.URL",
                                                             "Disc.Number", "Track.Number",
                                                             "Track.Preview.URL", "Explicit", "ISRC", "Added.By", "Added.At",
                                                             "Instrumentalness","Album.Genres",
                                                             "Label","Artist.Name.s.","Album.Release.Date","Copyrights","Track.Name")]
colnames(mymusic_data)

##  [1] "Track.Duration..ms." "Popularity"          "Artist.Genres"      
##  [4] "Danceability"        "Energy"              "Key"                
##  [7] "Loudness"            "Mode"                "Speechiness"        
## [10] "Acousticness"        "Liveness"            "Valence"            
## [13] "Tempo"

dim(mymusic_data)

## [1] 9999   13

str(mymusic_data)

## 'data.frame':    9999 obs. of  13 variables:
##  $ Track.Duration..ms.: int  216270 237120 312533 233400 448720 193346 173799 240546 259800 208733 ...
##  $ Popularity         : int  0 64 56 42 0 79 78 61 74 0 ...
##  $ Artist.Genres      : chr  "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
##  $ Danceability       : num  0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
##  $ Energy             : num  0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
##  $ Key                : int  8 2 7 9 0 9 6 6 9 11 ...
##  $ Loudness           : num  -12.3 -6 -5.17 -8.92 -9.61 ...
##  $ Mode               : int  1 1 1 1 1 1 0 1 1 0 ...
##  $ Speechiness        : num  0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
##  $ Acousticness       : num  0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
##  $ Liveness           : num  0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
##  $ Valence            : num  0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
##  $ Tempo              : num  111.5 127 75 75.3 85.8 ...

Data type

After removing identifiers features from my data i was left with 13 features in my case the only identifier that i left was the music genres.To proceed with my data cleaning i changed all integers features to numeric

int_cols <- sapply(mymusic_data, is.integer)
mymusic_data[int_cols] <- lapply(mymusic_data[int_cols], as.numeric)
str(mymusic_data)

## 'data.frame':    9999 obs. of  13 variables:
##  $ Track.Duration..ms.: num  216270 237120 312533 233400 448720 ...
##  $ Popularity         : num  0 64 56 42 0 79 78 61 74 0 ...
##  $ Artist.Genres      : chr  "acid house,ambient house,big beat,hip house" "dance pop,miami hip hop,pop" "dance pop,pop" "album rock,art rock,british invasion,classic rock,folk rock,glam rock,protopunk,psychedelic rock,rock,singer-songwriter" ...
##  $ Danceability       : num  0.617 0.825 0.677 0.683 0.319 0.671 0.56 0.48 0.357 0.562 ...
##  $ Energy             : num  0.872 0.743 0.665 0.728 0.627 0.71 0.68 0.628 0.653 0.681 ...
##  $ Key                : num  8 2 7 9 0 9 6 6 9 11 ...
##  $ Loudness           : num  -12.3 -6 -5.17 -8.92 -9.61 ...
##  $ Mode               : num  1 1 1 1 1 1 0 1 1 0 ...
##  $ Speechiness        : num  0.048 0.149 0.0305 0.259 0.0687 0.0356 0.321 0.0262 0.0654 0.0871 ...
##  $ Acousticness       : num  0.0158 0.0142 0.56 0.568 0.675 0.0393 0.555 0.174 0.0828 0.113 ...
##  $ Liveness           : num  0.408 0.237 0.338 0.0384 0.289 0.0387 0.116 0.0753 0.0844 0.11 ...
##  $ Valence            : num  0.504 0.8 0.706 0.833 0.497 0.834 0.319 0.541 0.522 0.357 ...
##  $ Tempo              : num  111.5 127 75 75.3 85.8 ...

Whist the feature Artist Genre remained as a character as it is the identifier of all other features #Missing values

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

missing_values <- colSums(is.na(mymusic_data))
print(missing_values)

## Track.Duration..ms.          Popularity       Artist.Genres        Danceability 
##                   0                   0                   0                   2 
##              Energy                 Key            Loudness                Mode 
##                   2                   2                   2                   2 
##         Speechiness        Acousticness            Liveness             Valence 
##                   2                   2                   2                   2 
##               Tempo 
##                   2

mymusic_data <- na.omit(mymusic_data)

Identifying Outliers

identify_outliers <- function(x) {
  z_scores <- scale(x)
  outliers <- abs(z_scores) > 3  
  return(outliers)
}

numeric_cols <- sapply(mymusic_data, is.numeric)
outliers <- mymusic_data[, numeric_cols] %>%
  lapply(identify_outliers)
outlier_counts <- sapply(outliers, sum)
print(outlier_counts)

## Track.Duration..ms.          Popularity        Danceability              Energy 
##                 110                   0                  20                  27 
##                 Key            Loudness                Mode         Speechiness 
##                   0                  83                   0                 301 
##        Acousticness            Liveness             Valence               Tempo 
##                  20                 253                   0                  53

I have about 1000 outliers in my data set so i can remove them.

rows_with_outliers <- apply(do.call(cbind, outliers), 1, any)

mymusic_data <- mymusic_data[!rows_with_outliers, ]

The observations of my data set changed from 9999obs to 8798 obs

Standardization

All numeric variables where scaled to allow comparison.

numeric_data <- mymusic_data[sapply(mymusic_data, is.numeric)]
standardized_data <- scale(numeric_data)
summary(standardized_data)

##  Track.Duration..ms.   Popularity       Danceability          Energy       
##  Min.   :-2.78893    Min.   :-1.2792   Min.   :-3.03548   Min.   :-3.0682  
##  1st Qu.:-0.64963    1st Qu.:-1.2792   1st Qu.:-0.63587   1st Qu.:-0.6636  
##  Median :-0.06206    Median : 0.1449   Median : 0.07072   Median : 0.1397  
##  Mean   : 0.00000    Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.57976    3rd Qu.: 0.8908   3rd Qu.: 0.70036   3rd Qu.: 0.8029  
##  Max.   : 3.59085    Max.   : 2.0437   Max.   : 2.63824   Max.   : 1.6763  
##       Key              Loudness            Mode          Speechiness     
##  Min.   :-1.44036   Min.   :-3.2383   Min.   :-1.5285   Min.   :-0.8892  
##  1st Qu.:-0.88156   1st Qu.:-0.5687   1st Qu.:-1.5285   1st Qu.:-0.6109  
##  Median :-0.04337   Median : 0.2210   Median : 0.6542   Median :-0.3780  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.79482   3rd Qu.: 0.7385   3rd Qu.: 0.6542   3rd Qu.: 0.1736  
##  Max.   : 1.63301   Max.   : 2.2365   Max.   : 0.6542   Max.   : 4.8604  
##   Acousticness        Liveness          Valence             Tempo         
##  Min.   :-0.8370   Min.   :-1.3627   Min.   :-2.31061   Min.   :-2.79198  
##  1st Qu.:-0.7622   1st Qu.:-0.7038   1st Qu.:-0.78768   1st Qu.:-0.71785  
##  Median :-0.4560   Median :-0.3993   Median : 0.05376   Median :-0.02379  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.4370   3rd Qu.: 0.5038   3rd Qu.: 0.83271   3rd Qu.: 0.51226  
##  Max.   : 3.0796   Max.   : 3.9787   Max.   : 1.69498   Max.   : 3.15083

PCA

KMO Test

The Kaiser-Meyer-Olkin (KMO) test was used to determine whether the data is suitable for principal component analysis (PCA).

library(psych)

## Warning: package 'psych' was built under R version 4.3.2

kmo_result <- KMO(numeric_data)
print(paste("KMO Test Statistic:", kmo_result$MSA, "\n"))

## [1] "KMO Test Statistic: 0.544964564384212 \n"

In this case, with a KMO test statistic of 0.54496, it falls in the range where the sampling adequacy is somewhat moderate henceforth there is need to do other test before proceeding do do PCA.Henceforth the Bartlett Test Statistic was used to further determine if the variables are fit for PCA.

Bartlett Test

bartlett_result <- cortest.bartlett(cor(numeric_data), n = nrow(numeric_data))
print(paste("Bartlett's Test Statistic:", bartlett_result$chisq, "\n"))

## [1] "Bartlett's Test Statistic: 17694.8069022018 \n"

print(paste("Bartlett's Test p-value:", bartlett_result$p.value, "\n"))

## [1] "Bartlett's Test p-value: 0 \n"

The Bartlett test results indicate a highly significant difference between the observed covariance matrix and the identity matrix, with a test statistic of 15980.00 and a p-value close to zero. This implies that there are significant correlations among the variables in the data set, making it well-suited for factor analysis or principal component analysis (PCA). In this case we proceed to use PCA as the Bartlett test has re-confirmed the use of PCA.

Correlation

Proceeding with PCA I first determined the correlation within the variance and determined the relationship between the components.

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

cor_matrix <- cor(standardized_data)
corrplot(cor_matrix, method = "color", type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

From the graph we can analyse relationships between many components including energy and track duration that there is weak positive correlation which suggests that there is a slight tendency for longer track duration to be associated with higher energy. As well as a weak negative correlation between popularity and acousticness implies that more popular tracks might be slightly less acoustic.A moderate positive correlation between denceability and valence indicates that tracks with higher danceability tend to have higher valence .Whilst a strong positive correlation between energy and loudness suggests that higher energy tracks are likely to be louder. A strong negative correlation between loudness and acousticness implies that quieter tracks are more likely to be acoustic.A moderate positive correlation between energy and speechiness indicates that more speech-like tracks tend to be more energetic.A weak negative correlation between valence and acousticness indicates that more positive-valences tracks tend to be slightly less acoustic.In general the graph above shows the relationship between two features and a lot can be explained.

PCA SUMMARY

pca_result <- prcomp(standardized_data, center = TRUE, scale. = TRUE)
summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.5658 1.2183 1.0970 1.04738 0.99730 0.99530 0.95327
## Proportion of Variance 0.2043 0.1237 0.1003 0.09142 0.08288 0.08255 0.07573
## Cumulative Proportion  0.2043 0.3280 0.4283 0.51970 0.60258 0.68514 0.76086
##                            PC8    PC9    PC10    PC11    PC12
## Standard deviation     0.94580 0.9060 0.69941 0.67805 0.45325
## Proportion of Variance 0.07454 0.0684 0.04076 0.03831 0.01712
## Cumulative Proportion  0.83541 0.9038 0.94457 0.98288 1.00000

The PCA results indicate that the data set variability is well-captured by the first eight principal components (PC1 to PC8). PC1 has the highest standard deviation and explains the largest proportion of variance . Subsequent components contribute decreasing proportions of variance. This was further explained by this graph #Scree plot

plot(pca_result, type = "l", main = "Scree Plot")

This is plot showing the variance between one cluster to the other cluster this is furthur explained using the cumulative Distribution plot.Henceforth taking 8 PC.

Cumulative Distribution of Variance

cumulative_var <- cumsum(pca_result$sdev^2 / sum(pca_result$sdev^2))
barplot(cumulative_var, col = "skyblue", names.arg = 1:length(cumulative_var),
        xlab = "Principal Components", ylab = "Cumulative Proportion of Variance",
        main = "Cumulative Distribution of Variance", ylim = c(0, 1.2))
text(1:length(cumulative_var), cumulative_var, labels = paste0(round(cumulative_var * 100, 2), "%"),
     pos = 3, cex = 0.8, col = "darkblue")

By looking at this graph, you can determine how much cumulative variance is explained by including a certain number of principal components.In this case i will opt to capture variance of above 90% using the first 8 of 12 components.By the eighth component a cumulative 90.83 of the total variance is explained. Beyond PC8, the contribution of each component diminishes,suggesting that all variability is captured, but the bulk of the information is contained in the first eight components.

Variable factor map

The values in the graph indicate the strength and direction of the relationship between each variable and each PC

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_pca_var(pca_result, col.var = "brown4")

For each PC, higher absolute values of loading indicate stronger contributions of the corresponding variables to that PC while the negative suggest lower contributions to the PCA. Positive loading suggest a positive relationship like Energy,Loudness,Valence,Danceability while negative loading suggest a negative relationship like Mode,Acousticness and Track duration.

Contribution of variables to DIM

library(factoextra)
fviz_contrib(pca_result, "var", axes = 1:6, fill = "tomato3", color = "tomato4")

The bar plot clearly shows that mostly the major contribution towards fundamental components shaping musical composition of genre comes from energy,tempo and Danceability which contributes more than 90% whilst valence,liveness and Acousticness contribute more than 70% with mode and speechiness contributing less.Also the type of genre can determine the popularity of the song.

Conclusion

In conclusion, this analysis of a diverse music data set has unveiled intricate relationships and key components that define the characteristics of various genres. Through meticulous data pre-processing and advanced statistical techniques, Principal Component Analysis (PCA), the study identified significant correlations among musical features. Noteworthy relationships include the subtle link between energy and track duration, the contrasting dynamics of loudness and acousticness, and the impact of variables like danceability and valence on musical compositions. The project revealed that energy, tempo, and danceability are paramount contributors to the fundamental components shaping musical genres, while variables like valence and acousticness play crucial roles in specific aspects.Whilst the popularity of a song can be determined by its genres Overall, thisexploration provides valuable insights into the essence of music, offering a foundation for deeper understanding and future investigations in the realm of musical analysis.

Exploratory Analysis and Dimensionality Reduction of Music Dataset using PCA