Project Text Mining

—————– Introduction of the project —————–

This project is an analysis of Spotify using text mining techniques and social media and networking methods. These analysis will be based on two different dataset. The first one is a dataset containing the lyrics of the top 20 songs of spotify on the 19th of December that I created. It contains the name of the artist and finally the lyrics found on lyricfind.com. The other dataset used for the network analysis is a dataset about some user’s playlist. This dataset includes the name of the user the same of its playlist and the song’s name and artist (9cc0cfd4d7d7885102480dd99e7a90d6,“Elvis Costello”,“(The Angels Wanna Wear My) Red Shoes”,“HARD ROCK 2010”). These data will be used to discover some connection between the music tastes of individuals. The aim of this project would be as a first step to discover some pattern within the top songs of the media provider. This analysis could either be usefull for the artist trying to create a new hit song as well as for the platform itself to recommand songs to their user knowing that the majority of them will like them. Secondly, the other half of the project will provide some knowledge about possible relationships between the users’ playlist or listening habits. Being able to know which users listen to the same artists is a really important to later be able to make recommandations that are targeted in the best way possible.

—————– Litterature review —————–

Some published papers approach the topic of song lyrics analysis. The author generally study this topic in the hope to create a good recommandation tool or for musicology purposes. An article from Michael Fell and Caroline Sporleder present the result of their semantic analysis. They focused on genre detection, distinguising the “quality” of the song and the publication time. they discovered that an n-gram model is a good first approximation but also that extending the feature space with more sophisticated features nearly always improves their results. Another paper on this topic suggets a preliminary analysis of the song to select the one that are the ones that were the most susceptible to provide consistent content to the analysis. To decide if a song was suitable for the semantic analysis, they started by evaluating the content of the words by classifyung them to extract the meaning of each song. With the new dataset they explored which kind of word content was frequently appearing and analyse the the grammatical structure of the lyrics. Concerning the comparison between the users the paper “Recommender System based on Collaborative Filtering for Spotify’s Users” tackles the computation time required to compare users together. They prefered the model using the Phi coefficient than the Pearson’s coefficient to evaluate the accuracy of the recommendations.

        First Part of the project 
        -------------------------

The first part of the project consists in the analysis of the songs’ lyrics. The different analysis provided by this section are

1 frequency analysis

2 wordcloud

3 sentiment analysis

4 clustering

Import the necessary packages

library(tm)
library(SnowballC)
library(tidyverse) 
library(tidytext)
library(glue)
library(data.table)
library (readxl)
library(ggplot2)  
library(wordcloud)
library(tidytext)
library(dplyr)
library(stringr)
library (textdata)
library(cluster)
library(stringi)
library(proxy)
library(fpc)

Data Preparation

# Get the lyrics Dataset
setwd("C:/Users/Home/Documents/Erasmus Warsaw/text_mining_projet")
lyrics <-read_excel("projet excel chansons.xlsx") 
lyrics <- lyrics[,3]

#Text parsing 
docs <- Corpus(VectorSource(lyrics))

# Preliminary cleaning
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

# Cleaning text
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, content_transformer(tolower))

# Stopword removal 
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("doo"))

# Stemming 
docs <- tm_map(docs, stemDocument)

# Term frequency matrix
docTermMatrix <- TermDocumentMatrix(docs)
matrixLy <- as.matrix(docTermMatrix)

Frequency Analysis

# Staging the data   
docTermMatrix <- DocumentTermMatrix(docs)
termDocMatrix <- TermDocumentMatrix(docs)

# Document term matrix  
freq <- colSums(as.matrix(docTermMatrix))   
ord <- order(freq)   
Matrix <- as.matrix(docTermMatrix)   

# Deal with the sparse terms
docTermMatrixSparse <- removeSparseTerms(docTermMatrix, 0.2)

# Table after removing sparse terms
freq <- colSums(as.matrix(docTermMatrixSparse))   

# Terms that appear frequently   
findFreqTerms(docTermMatrixSparse, lowfreq=20)

##  [1] "away"   "back"   "bring"  "cant"   "danc"   "dont"   "ever"  
##  [8] "gave"   "give"   "ill"    "just"   "know"   "let"    "like"  
## [15] "love"   "make"   "move"   "need"   "never"  "one"    "rhythm"
## [22] "run"    "say"    "see"    "sugar"  "tell"   "thing"  "wanna" 
## [29] "want"   "wont"   "your"

# Plot words that appear at least 20 times  

wordDF <- data.frame(word=names(freq), freq=freq)   
plotword <- ggplot(subset(wordDF, freq>20), aes(x=reorder(word, -freq), freq))    
plotword <- plotword + geom_bar(stat="identity")   
plotword <- plotword + theme(axis.text.x=element_text(angle=45, hjust=1))   
plotword

This bar chart represent the illustrate the occurence frequency of each word in the lyrics. These are the words that appear at least 20 times in the selected songs. We can see that the most reappearing word is love followed by see, danc-, move and like. Just by looking at this top 5 we can note that the most listened songs are mainly love songs or happy and songs about dancing (which we can guess have happy and energic melodies)

Wordcloud of the Data

docTermMatrixSparse <- removeSparseTerms(docTermMatrix, 0.15)   
freq <- colSums(as.matrix(docTermMatrixSparse))   
dark2 <- brewer.pal(6, "Dark2")   
wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)

Sentiment analysis

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

colnames(lyrics) <- 'lyrics'
lyricstoken <- lyrics %>% 
  unnest_tokens(word, 'lyrics')
summary(lyricstoken)

##      word          
##  Length:8117       
##  Class :character  
##  Mode  :character

lyricstoken <- lyricstoken %>%
  anti_join(stop_words)

#words related to positive emotions
nrc_positive <- get_sentiments("nrc") %>%  filter(sentiment == "positive")
positivewords <- lyricstoken %>%
  inner_join(nrc_positive) %>%
  count(word, sort = TRUE)

positivewords %>%
  filter(n > 5) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

#words related to negative emotions
nrc_negative <- get_sentiments("nrc") %>%  filter(sentiment == "negative")
negativewords <- lyricstoken %>%
  inner_join(nrc_negative) %>%
  count(word, sort = TRUE)

negativewords %>%
  filter(n > 5)  %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

#ratio of positive words compared to negative ones. 
ratio = sum(as.numeric(positivewords$n), na.rm = TRUE) / (sum(as.numeric(negativewords$n), na.rm = TRUE) + sum(as.numeric(positivewords$n), na.rm = TRUE))
print(paste("The positive words represent " , toString( ratio*100) ,"% of the total amount of words"))

## [1] "The positive words represent  68.8888888888889 % of the total amount of words"

newsgroup_sentiments <- lyricstoken %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  summarize(score = sum(value))

newsgroup_sentiments$score

## [1] 179

contributions <- lyricstoken %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(value))

# Which words had the most effect on sentiment scores overall
contributions %>%
  top_n(20, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

This analysis reveals that the majority of the words used in the songs have a positively connotation. Dance and love which were the most reaccuring words detected by the frequency analysis dominate the ranking of positive emotions. On the contrary the ranking of negative words is lead by the words bad, beg, lose or fall. However the maximum occurance of the top word (bad) does not surpass twenty when love, top wword for the positive ranking reached the 80 mentions. The ratio of of positive word is more than 2/3 of the total amount of words in the lyrics. The afinn sentiment score evaluating the ovrall impression of the top 20 equals 179. Firstly this value is positive which means that the lyrics are in overall positive and secondly this value is quite high which implies that the feeling of the text is significantly positive.

Clustering

Data Preparation

setwd("C:/Users/Home/Documents/Erasmus Warsaw/text_mining_projet")
lyrics <-read.csv("projet excel chansons.csv", sep=";", dec=",", header=TRUE) 
names <- unlist(lyrics[,1])
lyrics <- lyrics[,3]
corpus <- VCorpus(VectorSource(lyrics))

ndocs <- length(corpus)
minTermFreq <- ndocs * 0.01
maxTermFreq <- ndocs * .5
dtm = DocumentTermMatrix(corpus, control = list( stopwords = TRUE, wordLengths=c(4, 15), removePunctuation = T, removeNumbers = T,  bounds = list(global = c(minTermFreq, maxTermFreq))))

dtm.matrix = as.matrix(dtm)
matrix <- as.matrix(dtm)
distMatrix <- dist(matrix , method="euclidean")

Dendogram

groups <- hclust(distMatrix, method="ward.D")
plot(groups, labels = names ,cex=0.9, hang=-1)
rect.hclust(groups, k=6)

K-Means

kfit <- kmeans(distMatrix, 5)   
clusplot(as.matrix(distMatrix), kfit$cluster , color=T, shade=T, labels= 4, lines=0)

The clustering analysis allows the division of the dataset of songs into smaller groups of songs that are evaluated as similar. The Dendogram demonstrate that the song Dance Monkey is really different from the rest of the database as its clustered in a singleton and the regrouping point’s hiegh is high. RITMO which is a half spanish half english song is also clustering in a single cluster probably because of this difference. Suprisingly the three songs about christmas ( All I Want For Chistmas is You, Last Christmas and Santa Tell Me where not grouped in the same cluster.

        Second Part of the project 
        -------------------------

The second part of the project is a network analysis. The analysis bellow are made on a restricted number of user to allow a clear visual representation of the results. However the same analysis were run on a bigger scale and also using another subset of the data and the results obtained were very consistent

Data Preparation

library(magrittr) 
library(dplyr) 
library(igraph)

# Get the lyrics Dataset
setwd("C:/Users/Home/Documents/Erasmus Warsaw/text_mining_projet")
network <- read.csv("spotify.csv", header=T, sep = ";", nrows = 20000)
network <- network[ ,2:3]

#create the nodes 
nodes <- as.vector(unique(network$user_id))
nbr_listeners <- length(unique(network$user_id))


#Delete people ho listen to many songs of the same artist 
network<-unique(network[c("user_id", "artistname")])


network$user_id = as.character(network$user_id)

#Group by the name of the artist       
network <- network %>% group_by(artistname) %>% summarize(value = list(unique(user_id)))

Create the Network

# create empty graph with nodes
graph <- make_empty_graph(n = nbr_listeners)
graph <- set.vertex.attribute(graph, "name", value=nodes)
plot(graph)

for (variable in 1:length(network$artistname)) {
  #start the iteration for every artist start loop 
  firstPerson <- network[variable,2][[1]]
  lenfirst <- length(firstPerson[[1]])
  
  #if there are multiple people check the pairs to link
  matrix1 <- unique(expand.grid(firstPerson[[1]],firstPerson[[1]]))
  listetotal <- list()
  for (element in 1:  dim(matrix1)[[1]]  ){
    a<- as.character( matrix1$Var1[element])
    b <- as.character(matrix1$Var2[element])
    listetotal <- c(listetotal, a , b)
  }

  # link the 2 listeners, their nodes
  if(lenfirst >1) {
    graph <- add.edges(graph = graph, edges = unlist(listetotal))
  }
  graph <- igraph::simplify(graph, remove.multiple = TRUE, remove.loops = TRUE)
  
}

#Transform the graph into an undirected graph
graph <- as.undirected(graph, mode = c("collapse", "each", "mutual"),edge.attr.comb = igraph_opt("edge.attr.comb"))


#layout 
layout1<- layout_with_fr(graph)
# Plot with Layout 
plot(graph, edge.curved=0, vertex.label.cex=.3,
     vertex.color="lightblue", vertex.frame.color="#555555",
     vertex.label.color="black", layout= layout1)

Network Evaluation

# network density
edgeDensity <- edge_density(graph, loops=F)
cat ('the edge densitity value = ', edgeDensity , '\n')

## the edge densitity value =  0.5098039

networkDensity <- ecount(graph)/(vcount(graph)*(vcount(graph)-1))
cat ('the network densitity value = ', networkDensity, '\n')

## the network densitity value =  0.254902

# transitivity
globalTransitivity <- transitivity(graph, type="global")  
cat ('the global transitivity value = ', globalTransitivity , '\n')

## the global transitivity value =  0.7882653

# diameter
diameterNetwork <- diameter(graph, directed=F)
cat ('the diameter of the network value = ', diameterNetwork , '\n')

## the diameter of the network value =  3

The density, defined as the number of connections a user has, divided by the total possible connections he or she could have. The value of this graph is quite low (0.25) which means that only a fourth of the potential connections between users are actually existing. The transitivity refering to the extend to which the relation between two nodes is transitive. The transitivity value is quite high (0.78), this implies that in most of the cases the transitivity rule applies. Finally the shortest distance bebtween the 2 most distant nodes of the network is 3.

Node Analysis

# degrees of nodes
degreesNetwork <- degree(graph, mode="all")
# Plot the node with a size proportinal to its degree (without label for clarity)
plot(graph, vertex.size=degreesNetwork *1.001, vertex.label=NA)

hist(degreesNetwork, breaks=1:vcount(graph)-1, main="Histogram of node degree of Network")

# distribution of degrees 
distanceDegreeNetwork <- degree_distribution(graph, cumulative=T, mode="all")
plot( x=0:max(degreesNetwork), y=1-distanceDegreeNetwork , pch=16, cex=1.2, col="blue", 
      xlab="Degree", ylab="Cumulative Frequency")

Observing the graph of the node’s degree distribution we can see that most of the nodes have either no connection to others (so a degree of 0) either a really high degree value. This is due to the fact that most of the points are related to a lot of their neighbors.

Path Analysis

# distances, paths
cat ( 'the average distance in the graph is ', mean_distance(graph), '\n')

## the average distance in the graph is  1.507353

# finding the shortest paths
news.path <- shortest_paths(graph, 
                            from = V(graph)[name=="f844835ad2842f8134f4283a4a7554e2"], 
                            to  = V(graph)[name=="944c80d26922ae634d6ce445b1fdff7f"],
                            output = "both") # both path nodes and edges

# Generate edge color variable to plot the path:
ecol <- rep("gray80", ecount(graph))
ecol[unlist(news.path$epath)] <- "orange"
# Generate edge width variable to plot the path:
ew <- rep(2, ecount(graph))
ew[unlist(news.path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- rep("gray40", vcount(graph))
vcol[unlist(news.path$vpath)] <- "gold"

plot(graph, vertex.color=vcol, edge.color=ecol, 
     edge.width=ew, edge.arrow.mode=0, vertex.label.cex = 0.5)

Neighboor Analysis

# immediate neighbors
neigh.nodes <- neighbors(graph, V(graph)[name=="db937456654d2465292c4daa947c95de"], mode="out")
neigh.nodes

## + 8/18 vertices, named, from 1a43bda:
## [1] 07f0fc3be95dcd878966b1f9572ff670 944c80d26922ae634d6ce445b1fdff7f
## [3] c50566d83fba17b20697039d5824db78 650c4d63a819dbb77cc15a87f407039a
## [5] 7511e45f2cc6f6e609ae46c15506538c 29320fd2d36575b40d21b623adcf12b2
## [7] 9fb60017f2d971b03885ba80748b19bd 1ed9910b0db7fcb779ec65b2ded4892f

# Set colors to plot the neighbors:
vcol[neigh.nodes] <- "#ff9d00"
plot(graph, vertex.color=vcol, vertex.label.cex = 0.5)

This neighborhoor analysis is useful to analyze the environment in which a user evolves. It can be helpful to understand by who he or she is influenced and help them once again with recommandations.

Subgroups and communities

# cliques
cliquesNetwork <- cliques(graph) 
sizeCliques <- sapply(cliques(graph), length) 
largestClique <- largest_cliques(graph) 
vcol <- rep("grey80", vcount(graph))
vcol[unlist(largestClique)] <- "gold"
plot(as.undirected(graph), vertex.label=V(graph)$name, vertex.color=vcol, vertex.label.cex = 0.5)

# community detection
clusterEdgeBetweenness <- cluster_edge_betweenness(graph) 
dendPlot(clusterEdgeBetweenness, mode="hclust")

plot(clusterEdgeBetweenness, graph, vertex.label.cex = 0.5)

length(clusterEdgeBetweenness) # no of communities

## [1] 6

clusteringNodes <- cluster_label_prop(graph)
plot(clusteringNodes, graph, vertex.label.cex = 0.5)

This clustering is based on the betweeness centrality of the graph. It is based on the edges and allows a regroupment of users together. This is very helpful for targeting and segmenting strategies.

Other Statistics

# optimization of modularity
greedyAlgoCluster <- cluster_fast_greedy(graph)
plot(greedyAlgoCluster, graph , vertex.label.cex = 0.5)

V(graph)$community <- greedyAlgoCluster$membership
colrs <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen"), alpha=.6)
plot(graph, vertex.color=colrs[V(graph)$community],, vertex.label.cex = 0.5)

In this particular subset were found 4 differents communities.

# k-core decomposition
kcore <- coreness(graph, mode="all")
plot(graph, vertex.size=kcore*6, vertex.label=kcore, vertex.color=colrs[kcore])

The k-core of the graphis a maximal subgraph where for every vertex there is a degree K. We can see that nearly all the time the subgraph is quite high, people tend to stay together with people they know and maybe trust.

Project Text Mining

Margaux Peschon

31/01/2020