Analysis of last.fm network dataset

Introduction

Network is a group or system of interconnected people or things, a way to represent relationships between entities. Network analysis has its roots in discrete mathematics ad it can be also summarized as an arrangement of intersecting horizontal and vertical lines. Applications of network analysis are numerous, from epidemiology to sociology. These days, with the rise of network industries, network analysis has been part of Microeconomics.

Most of the standard methods of statistical analysis require the data to be in tabular format. However, sometimes it is not easy to obtain such structure. This is the case in network analysis. Storing the information in this standard format requires some dimensionality reduction technique - especially when analysing a vast network.

In this paper, I have presented methods for analysis of large network and hence, applications of general dimension reduction tools to the network data. I have applied these methods to the data about playcount and similarities (5 most similar artists to each, selected by the last.fm recommendation engine) between polish musicians obtained from last.fm website. The main point of this paper is to investigate the position of the top 100 most listened artists in the vast network and see, if one can confirm, that these artists form a cluster as a result of a preferential attachment process (homophily).

Disclaimer: These data come from a group project for Applied Microeconomics course that I took last semester at my home university (University of Warsaw in Poland). In that paper, we fitted playcount data to some distributions obeying power-laws and compared it with our own simulation that takes the network into account. However, in this assignment, I’m only going to describe the network we have using the network measures and reduction of its dimensionality, which was not done in the above-mentioned paper. Hence, the work presented here is original.

Literature Review

According to UNCTAD (United Nations Conference on Trade and Development) figures, the global market value of industries relying heavily on creative and cultural inputs is estimated at $1.3 trillion and growing. Relatively small recording industry is part of the vast family of creative industries. According to the IFPI Global Music Report total worldwide revenues for 2018 were equal to $19.1 billion. The market grew by 9.7% then (fourth consecutive year of growth). However, music industry can be characterized by great inequalities. There is a vast number of artist, but only a few manage to achieve so-called superstardom. On one hand, the emergence of stars is shaped by the preferences of customers. On the other, our choices are influenced heavily by the network we’re in and in the music industry specifically – by recommendation engines (in 2018 about 40% of total revenues come from streaming, according to IFPI).

There are two substantial theories of stardom which reach back to the eighties. They are distinct but not mutually exclusive and state what drives the demand for superstar services: according to Rosen (1981) this factor is superior talent combined with perfect reproducibility of art, whilst Adler (1985) claimed network externalities of popularity to be responsible. Pursuant to Rosen (1981), poorer quality is an imperfect substitute for higher quality. He claims that small differences in talent translate to large earnings differentials. Then, most people are less satisfied with a performance of a less talented and cheaper artist when they have an opportunity to enjoy a top performance, even with the higher cost (Frey (1998)). If the best artist is significantly better than the competition, “each consumer consuming the best” is a special case (Adler (1985)). In these circumstances, he becomes a monopolist whose profit maximizing strategy depends on the elasticity of demand for his product (if the demand is highly elastic, it pays off to serve the whole market). In Rosen’s model Rosen (1981), there are two extreme options: either there is a very top artist who sets a high price and sells it to only a fraction of consumers (unless the demand is highly elastic) or there are several equally talented artists, one of which serves the whole market but is poor. In conclusion, if a star is both extraordinarily popular and rich, his talent must be unquestionably greater than the rest. The second theory by Adler (1985) refers to the concept of “consumption capital” (Stigler and Becker (1977)). Generally speaking, this theory predicts that the utility consumers derive from a particular good or service increases with prior consumption. Consumers built their consumption capital in art and the larger it is, the greater the enjoyment from each encounter with its subject (art and artist). Moreover, consumers want to consume the same art that others do, which is a key factor underlying production of superstars. When the artist is popular, it is easier to find other people familiar with his works or media coverage. However, in Adler (1985)’s model, the emergence of a star is a chance event: consumers first include artists randomly in their consumption basket, and it is “pure luck” that one artist ends up with more patrons than the rest. It obviously gives him advantage, which can then snowball into superstardom. On the other hand, MacDonald (1988) described a dynamic process through which stars arise. Each artist is capable of a good or bad performance. The difference in talent is then defined differently than in the two previous models: it is not the quality of performances but rather the probability that a particular performance will be good (which is, in the model, constant throughout the artist’s career). But from the viewpoint of audiences, this probability is lower for a new performer than for a famous one. Those who perform poorly drop out, while good artists stay on the market and increase their probability (again, from the viewpoint of audiences) of performing well in the future. In this way, artists with a good track record can appoint higher prices. In conclusion, artists of corresponding talent do equitably well.

Recently, many research contributions in musical data focus on the existence of power-law relationships (Rafailidis and Yannis, 2008). A random variable obeys a power-law distribution, if the probability density function is given by the following formula, where $a$ and $C$ are real-valued constants and $p(x)$ is the probability of an event to occur if the value of variable $x$ is equal to a certain value: $p(x)=Cx^{-a}$.

Similar to the power-law is the Zipf law (also known as Discrete Pareto Distribution). The main difference is that in the case of the Zipf law, the rank-frequency plots are used which are equivalent to the cumulative distribution of the variable under study. In fact, if a variable obeys a power law distribution with exponent a then the corresponding slope for the Zipf law will be $a-1$. However, Zipf’s law is an empirical distribution which was originally fitted to the linguistic data (Zipf (1949)). When investigating power-laws in the music industry, a similar, theoretical Yule-Simon distribution is typically used. It arose originally as the limiting distribution of a particular stochastic process studied by Yule (1924) as a model for the distribution of biological taxa and subtaxa. Simon called this process the “Yule process” (Simon (1955)) but it is more commonly known today as a “preferential attachment process”. The preferential attachment process is an urn process in which balls are added to a growing number of urns, each ball being allocated to an urn with probability linear in the number the urn already contains. The distribution also arises as a compound distribution, in which the parameter of a geometric distribution is treated as a function of random variable having an exponential distribution. Moreover, the tail of the Yule–Simon distribution (further referred to as “Yule” or “Yule distribution”) is a realization of Zipf’s law: the probability mass function of Yule can be used to model, for example, the relative frequency of the the most listened artist in a large pool of performers, which according to Zipf’s law is inversely proportional to a (typically small) power $k$.

The previously mentioned theories on stardom focus on the relationship between marked factors and popularity. However, in modern studies there is a growing tendency to incorporate the influence of social networks, which are believed not only to affect individual’s choices but also to initiate changes of the whole cohort’s behaviour (Doreian and Romney (1989)). Researchers stress, that network participants are more likely to be interested in an issue popular amongst other agents in the network (Lind and Herrmann (2007)). In the economic literature, the connection between individual and group behaviour is incorporated in such concepts as bandwagon, Veblen and snob effects, where people are believed to adjust their individual actions, on the basis of group popularity of some items (Leibenstein (1950)). In the case of music artists, homophily (the hypothesis of which is closely linked to preferential attachment process) also plays a role. As I mentioned previously, only a few artists out of a vast pool manage to achieve stardom. Since their number is relatively scarce, they usually form a cluster and a network can be characterized by degree heterogeneity (Kim and Altmann (2017)) - when a large number of nodes has a small number of links and a few of them have many links. Kim and Altmann (2017) find that homophily may affect the evolution of the degree distribution of scale-free networks. More specifically, homophily may cause bias towards convexity instead of the often hypothesised concave shape of networks. Thus, homophily can significantly (and uniformly) affect the emergence of scale-free networks influenced by preferential attachment, regardless of the type of seed networks observed (e.g. whether it is centralized or decentralized).

The most important conclusion from the presented literature is that top artists are only few and usually located close to each other in the network of similarities. In this paper I am going to see if that’s the case in the last.fm playcount data.

Methods

Dimension reduction for the network data

One would wonder, why dimension reduction is sometimes necessary for network analysis. One of the classic ways to represent a graph is an adjacency matrix, in which each row and column represents one node (vertex), and in each cell of the matrix there is either 1 (vertices are connected) and 0 (vertices are not connected). After presenting the graph in this way, it is tempting to follow standard data analysis path from this point, treating each row as an observation and each column as a feature.

However, there are two problems with such setting. Firstly, my data is high-dimensional. Number of columns is equal to number of vertices in the graph, thus for a graph with n vertices one gets a huge $n x n$ matrix. For sparse graphs (with much more vertices than edges), the matrix contains mostly zeros, which is problematic for some algorithms.

Other non-trivial mistake present in this method is that, from the adjacency matrix, one can obtain an information only about the closest neighbor. In many cases, even not direct neighboring can carry important information about particular node.

Therefore, reducing dimensionality of the network data is sometimes necessary for further analysis. That is, converting a specific representation of each vertex, with specified neighbors, to an abstract one, where each vertex is a point in a low-dimensional space. One should be aware, that the placement of each point in this space contains information about its placement in the original graph, nonetheless we’re losing information about specific neighbors.

Graph plotting layouts computation

Graph plotting plays a crucial role in network analysis. Plotting the data is usually a first step of Exploratory Data Analysis, and as such, reliable methods for network plotting have been developed. Obtaining a layout of the network in 2D space (that is, $x$ and $y$ coordinates for each vertex on the plot) requires some dimensionality reduction technique.

There exist some well-defined criteria about how we can obtain a good network visualization. The most important ones, according to Kaufmann and Wagner (2003), are:

Minimization of edge crossings
Even distribution of the elements
Minimization of edges length
Finding and preservation of symmetry between subgroups of nodes (displaying similar areas of the graph in the same way).

A family of plot layout creation techniques which I’m going to use here is called force-directed layouts. An underlying process for such techniques was inspired real-world physical systems, where vertices are modeled as particles repulsing each other, and the edges act like elastic springs attracting particles to each other. After dragging one specific vertex, the others adjust to the change. These methods have dimensionality reduction techniques bulit in, especially for analysis of big networks.

There are two most widespread force-directed layouts: Fruchterman and Reingold (1991) algorithm, and Kamada-Kawai (1989). They are similar and both use a simulation. Kobourov (2004) describes the difference like this: “Whereas the algorithm of Fruchterman-Reingold aim to keep adjacent vertices close to each other while ensuring that vertices are not too close to each other, Kamada and Kawai take graph theoretic approach […]. In this model, the “perfect” drawing of a graph would be one in which the pair-wise geometric distances between the drawn vertices match the graph theoretic pairwise distances."¹

There is also another layout algorithm worth mentioning. Large-Graph-Layout (Adai et al. 2004) is aimed specifically at drawing big (>1 million vertices) graphs.

Below, I have created a small, artificial network, and plotted it using these various conventions.

nodes <- tribble(~id, ~color, ~label,
                1, NA, NA,
                2, NA, NA,
                3, NA, NA,
                4, NA, NA
                )

edges <- tribble(~from, ~to,
                 1, 2,
                 2, 3,
                 1, 3,
                 3, 4
                 )

net <- sample_pa(100) 
V(net)$size <- 8
V(net)$frame.color <- NA
V(net)$color <- "red"
V(net)$label <- "" 
E(net)$arrow.mode <- 0

par(mfrow = c(1, 4))
plot(net, layout = layout_randomly, main = "Random")
plot(net, layout = layout_with_fr, main = "Fruchterman-Reingold")
plot(net, layout = layout_with_kk, main = "Kamada-Kawai")
plot(net, layout = layout_with_lgl, main = "Large-Graph-Layout")

General Dimension reduction methods

However, it is also possible to use standard dimension reduction algorithms for the layout computation. The main obstacle is that the chosen method should be able to use dissimilarity matrix instead of standard data frame. This means that Principal Component Analysis is not possible to use.

In this analysis I have used two other, standard dimension reduction algorithms, which is Multi-Dimensional Scaling and t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008). The latter one is relatively new, and as such, requires some explanation. t-SNE is a method aimed specifically at visualizing high dimensional, non-linear data in 2- or 3-dimensional space. The basic outline of the method is the same as for the Multi-Dimensional Scaling algorithm - managing to learn a representation of high dimensional points to smaller space by trying to preserve pairwise distances. T-SNE employs different distance metric, based on conditional probability of two points being neighbors. An explanation of inner workings of the algorithm is beyond the scope of this assignment For detailed description of the algorithm and its potential pitfalls, see here.

Network Demographics

Further on, I am going to calculate well-known network statistics and interpret their value qualitatively. The particular statistics that interest me are:

Degree - a number of links each node has.
Density - the number of actual links divided by the number of potential links.
Diameter - the shortest distance between the two most distant nodes in the network.
Global average path length - the average distance from one node to another.
Closeness - measures how many steps is required to access every other vertex from a given vertex.
Transitivity (global clustering coefficient) - a measure of the degree to which nodes in a graph tend to cluster together.
Assortativity - a preference for a network’s nodes to attach to others that are similar in some way, often examined in terms of node’s degree.
Betweenees - the betweenees centrality for each vertex is the number of shortest paths that pass through the vertex. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized.

Dataset description

The data used here was obtained using an open API provided by the last.fm site. Last.fm is a service where users are able to store music played by them using different services, including Spotify, YouTube, Apple Music and others. The total number of artists tagged “polish” there is 17178. However, the API poses a limit for a maximum call for one information type and thus, data about 9998 random artists was obtained.

Available informations about each artist are: artist’s name, number of times particular artist was played by all users (further called “playcount”), names of 5 most similar artists (created by last.fm recommendation engine).

Below, I have loaded the the dataset and presented its initial structure.

# Loading data
temp <- new.env()
load("data/clean_datasets.Rdata", envir = temp)
general_info <- temp$general_info
similar_artists <- temp$similar_artists

Artist data

general_info %>%
  arrange(-playcount) %>%
  head() %>%
  select(-listeners) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()

name	playcount
Coma	34745306
Myslovitz	31309608
Hey	27841125
happysad	27167021
Kult	25944772
Pidżama Porno	22457056

Similar artists

similar_artists %>%
  head(10) %>%
  select(-rank) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()

artist	similar
Myslovitz	Artur Rojek
Myslovitz	happysad
Myslovitz	T.Love
Myslovitz	Hey
Myslovitz	Wilki
Brodka	Mela Koteluk
Brodka	The Dumplings
Brodka	Dawid Podsiadło
Brodka	Krzysztof Zalewski
Brodka	Mikromusic

I have used information about similar artists to construct a network of similarities between artists. For further analysis, I have excluded the vertices for artists, for which there is no further data available.
Methods for layout computation, that I’ll present further, usually work better if the graph is connected. This means that it is possible to go from each vertex to any other vertex. As such, I have selected only the biggest connected subraph for later analysis. It has 6618 artists in total, so it captures the majority of them. Also, it contains mostly artists who were played at least 1 time.

I have also made the graph undirected. That is, from an information e.g. “Brodka is similar to Dawid Podsiadło”, I got “Brodka and Dawid Podsiadło are similar”. This step makes the analysis easier. Also, pairwise distances matrix should be symmetric to avoid surprises later on.

# Remove artists not avaliable in the database
similar_artists <- similar_artists %>%
  drop_na() %>%
  filter(similar %in% general_info$name) 

general_info %>%
  select(-listeners) -> general_info

# Create a graph
artist_graph <- graph_from_data_frame(similar_artists, 
                                vertices = general_info)

# Convert to undirected
artist_graph %>% 
  as.undirected(mode = "collapse") -> artist_graph

# Select biggest component of the graph
comp <- components(artist_graph)
biggest_subgraph <- delete_vertices(artist_graph, 
                                    V(artist_graph)[comp[["membership"]] != 1])

artist_graph <- biggest_subgraph

One quantatative information about each artist is playcount. Below I have shown playcount density plot. I have log-transformed the values for readability. As can be seen, playcount is highly right-skewed.

tibble(log_playcount = log(V(artist_graph)$playcount)) %>%
  mutate(log_playcount = ifelse(is.infinite(log_playcount), 0, log_playcount)) %>%
ggplot(aes(x = log_playcount)) +
  geom_histogram(color = "red", fill="red", alpha = 0.9) +
  labs(x = "log(playcount)") +
  theme_minimal(base_size = 17)

Analysis

The library used most commonly for graph analysis in R is igraph (Csardi and Nepusz 2006). It provides a vast amount of functions for graph manipulation, statistics and plotting. For interactive plotting, I have used visNetwork library (Almende B.V., Thieurmel, and Robert 2019). In later examples, I have saved output from visNetwork as images, as interactive visualization is very resource-heavy.

Below I have computed coordinates in 2-dimensional space using Fruchterman-Reingold, Kamada-Kawai, LGL, Random, MDS and t-SNE methods. For t-SNE and MDS methods, a distance matrix is needed. A natural distance metric for a network is shortest path between two given vertices. Function igraph::distances induces Inf value when there is no such path for particular vertices. As MDS can’t operate on infinite values, I am converting these for large number. Last step is to rescale all coordinates to $[-1, 1]$ range.

# Fruchterman-Reingold
set.seed(10)
coords_fr <- layout_with_fr(artist_graph, niter = 5000)

#  Kamada-Kawai
set.seed(10)
coords_kk <- layout_with_kk(artist_graph)

# LGL
set.seed(10)
coords_lgl <- layout_with_lgl(artist_graph, maxiter = 500)

# Random
set.seed(10)
coords_rand <- layout_randomly(artist_graph)

# Create distance matrix
dist_graph <- distances(artist_graph)
dist_graph[is.infinite(dist_graph)] <- 1000000 # convert from infinity
dist_graph <- as.dist(dist_graph)

# MDS
set.seed(10)
coords_mds <- cmdscale(dist_graph)

# t-SNE
set.seed(10)
coords_tsne <- Rtsne::Rtsne(dist_graph, is_distance = T, pca = F)

save(coords_fr,
     coords_lgl,
     coords_rand,
     dist_graph,
     coords_mds,
     coords_tsne,
     file = "data/coords_computed.Rdata")
save(coords_kk, file = "data/coords_kk.Rdata")

One thing worth mentioning is that MDS method is very slow. It’s cubic time complexity makes it unreasonable to run on even moderate datasets. The above computation took ~1 hour on my computer, for ~6 000 observations. Cubic running time means that if I would double the observation count, running the algorithm would take 8 times longer. One should be aware that for really big graphs with millions of vertices, it does make sense to actually operate on vertices and edges densities rather than actual position, as human eye can’t operate in such detail.

# Load the coordinates from file
load("data/coords_computed.Rdata")
load("data/coords_kk.Rdata")

# Extract result matrix from tsne
coords_tsne <- coords_tsne$Y

coords_fr <- norm_coords(coords_fr, xmin = -1, xmax = 1, ymin = -1, ymax = 1)
coords_lgl <- norm_coords(coords_lgl, xmin = -1, xmax = 1, ymin = -1, ymax = 1)
coords_rand <- norm_coords(coords_rand, xmin = -1, xmax = 1, ymin = -1, ymax = 1)
coords_mds <- norm_coords(coords_mds, xmin = -1, xmax = 1, ymin = -1, ymax = 1)
coords_tsne <- norm_coords(coords_tsne, xmin = -1, xmax = 1, ymin = -1, ymax = 1)
coords_kk <- norm_coords(coords_kk, xmin = -1, xmax = 1, ymin = -1, ymax = 1)

One consideration when using dimensionality reduction of any kind is which technique is the best the one. There is no algorithm-agnostic method of comparison between the results, and thus results can be subjective. I have decided to check, which layout presents meaningful information about popular artists. I have highlighted the top 100 artists.

t-SNE

Multi-Dimensional Scaling

Fruchterman-Reingold

Kamada-Kawai

Large Graph Layout

Random

After manually inspecting the graphs, it is visible that in t-SNE, MDS and F-R methods, top 100 artists are laying close to each other. It is an indication that prediction of popularity using the graph layout as the features may give promising results. However, the rest of layouts do not show much. As the network has its representation in lower-dimensional space, next analyses are fairly easy. As t-SNE seems to have yielded the most promising results, I have decided to run further analysis on the coordinates obtained from this method.

Firstly, I have run k-means algorithm. By the criterion of maximum silhouette, the optimal number of clusters is 3.

coords_df <- as_tibble(coords_tsne)
fviz_nbclust(coords_df, kmeans, method = "silhouette", linecolor = "red") +
  theme_minimal()

I have compared playcounts for each of the clusters. On the below density plots, the playcount was logarithmed for clarity. The densities look similar at the first sight. However, in cluster 2 the right tail is bigger than in other ones. This means that in this cluster, most popular artists are more represented. the same property is visible in quantile analysis. 75% playcount quantile for cluster 2 is 3.4 times bigger than in “the least popular” cluster 3.

Density plot

set.seed(10)
km <- kmeans(coords_df, centers = 3)

coords_df %>%
  mutate(cluster = as.character(km$cluster),
         playcount = V(artist_graph)$playcount,
         artist = V(artist_graph)$name) -> df

potential_colors <- adjustcolor(RColorBrewer::brewer.pal(8, "Reds"), alpha.f = 0.1)
ggplot(df, aes(x = log(playcount), fill = cluster)) +
  scale_fill_manual(values = potential_colors[c(2,5,7)]) +
  geom_density(alpha = 0.7) +
  facet_grid(rows = vars(cluster)) +
  theme_minimal()

Quantiles table

df %>% 
  group_by(cluster) %>%
  summarise(`25%` = quantile(playcount, 0.25),
            `50%` = quantile(playcount, 0.5),
            `75%` = quantile(playcount, 0.75),
            `95%` = quantile(playcount, 0.95)) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()

cluster	25%	50%	75%	95%
1	144.00	1208.0	9148.00	134335.0
2	286.25	1762.5	22460.25	966653.9
3	167.50	907.0	6571.50	98124.7

Network Statistics

Finally, I’m going to calculate standard network statistics using the biggest subgraph created earlier.

Degree

# degree is a number of links each node has
de <- degree(biggest_subgraph)

ggplot(df, aes(de)) + 
  geom_histogram(aes(y=..density..),
                 color="red",
                 fill="red",
                 position="identity",  
                 alpha=1)  + 
  geom_density(alpha=0.2) +
  labs(x = "Histogram of Degree")

max(de)

## [1] 75

min(de)

## [1] 1

length(de[de==75])

## [1] 1

length(de[de==1])

## [1] 762

length(de[de>20])/length(de)

## [1] 0.02508311

It can be seen, that the histogram of degree is highly right-skewed. The maximum number of link a node has is 75, however, there is just one node with that property. 762 artists of all are linked to one other node only. The long right-tail of the histogram appear to begin at the degree equal to 20. Artists with more than 20 links are only a subset of 2.5 % of all.

Density, Diameter and Global Average Path Length

# the density of a graph is the ratio of number of edges and the number of possible edges 
dens <- edge_density(biggest_subgraph)
dens

## [1] 0.0008146823

# the diameter is the shortest distance between the two most distant nodes in the network
dia <- diameter(biggest_subgraph)
dia

## [1] 20

# global average path length
gapl <- mean_distance(biggest_subgraph)
gapl

## [1] 6.527887

Edge density of this network is equal to 0.08 % which means, that the network is not dense at all, but highly diffused. The diameter equal to 20 steps also points at long distances between nodes in my network. Then, moving from one node to another takes on average 6.5 steps, which is quite a lot but reasonable for a large network.

Closeness

# closeness of a node is a measure of centrality in a network, calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph - the more central a node is, the closer it is to all other nodes
cl <- closeness(biggest_subgraph, mode = 'All')

options(scipen=999)
max(cl)

## [1] 0.00003259559

min(cl)

## [1] 0.00001201288

ggplot(df, aes(cl)) + 
  geom_histogram(aes(y=..density..),
                 color="red",
                 fill="red",
                 position="identity",  
                 alpha=0.9)  + 
  geom_density(alpha=0.2) +
  labs(x = "Histogram of Closeness")

It can be seen on the above histogram of closeness, that closeness of all nodes is almost 0, which means that no node is central in this network and in general - nodes are not close to each other. However, this histogram is slightly left-skewed. A possible explanation might be that a group of more popular artists have a slightly higher closeness because they are located closer to each other.

Transitivity

# transitivity (global clustering coefficient) is a measure of the degree to which nodes in a graph tend to cluster together
tr <- transitivity(biggest_subgraph,type="local")

ggplot(df, aes(tr)) + 
  geom_histogram(aes(y=..count..),
                 color="red",
                 fill="red",
                 position="identity",  
                 alpha=0.9)  + 
  labs(x = "Histogram of Transitivity")

length(tr[tr==0])/length(tr)

## [1] 0.3520701

length(tr[tr<0.5])/length(tr)

## [1] 0.8218495

length(tr[tr==1])/length(tr)

## [1] 0.1755817

length(tr[tr==1])

## [1] 1162

The above histogram of transitivity gives us information about the degree to which nodes in graph tend to cluster together. The clustering coefficient for 82% of the nodes is less than 50% and for 35% of the nodes - equal to zero. It means that clusters are rarely formed in this network. 17% of the artists have a coefficient equal to 1 which means that the proportion of connections among their neighbours which are actually realised is equal to the number of all possible connections. 17% of the dataset means 1162 artists.

Assortativity

# assortativity is a preference for a network's nodes to attach to others that are similar in some way, often examined in terms of node's degree
assortativity(biggest_subgraph,V(biggest_subgraph)$playcount)

## [1] 0.3744589

The assortativity coefficient $r$ is the Pearson correlation coefficient of degree between pairs of linked nodes. Positive values of $r$ indicate a correlation between nodes of similar degree, while negative values indicate relationships between nodes of different degree. In general, $r\in(-1,1)$. When $r = 1$, the network is said to have perfect assortative mixing patterns, when $r = 0$ the network is non-assortative, while at $r = −1$ the network is completely disassortative. In terms of playcount, $r=0.37$. This property can be interpreted as my network being mildly assortative, which means that artists usually attach to those with similar playcount.

Betweenees

# the betweenees centrality for each vertex is the number of shortest paths that pass through the vertex 
be <- betweenness(biggest_subgraph,directed=T, weights = NA)
V(biggest_subgraph)$betweenness <- be

graph_data <- tibble(name = V(biggest_subgraph)$name, 
       be = V(biggest_subgraph)$betweenness, 
       playcount = V(biggest_subgraph)$playcount)

graph_data %>% 
  select_if(is.numeric) %>%
  mutate_all(dense_rank)%>%
  GGally::ggpairs()

graph_data %>%
  select_if(is.numeric) %>%
  cor(method='s')

##                  be playcount
## be        1.0000000 0.2935827
## playcount 0.2935827 1.0000000

Above, I have plotted playcount and betweenees centrality of the nodes. It can be seen, that the shapes of density are similar for both functions. Moreoverly, the correlation coefficient between those two variables is 0.33 according to Pearson and 0.29 according to spearman. There exists a decent correation between the popularity of artists (measured by playcount) and their centrality in the network (measured by betweenees).

Conclusion

In this empirical paper I have presented methods of analysing a large network including dimensionality reduction, visualisation and qualitative interpretation of network statistics. I investigated a hypothesis that the most popular artists are clustered together in this network. Visual inspection of graphs after dimensionality reduction and checking network demography delivered a conclusion that that claim is correct. On the graphs created by most algorithms, top 100 artists were located closely to each other. All of the popular artists were located in one cluster by k-means (indicated by the fat right tail of cluster 2). The degree is highly right-skewed, the density is almost 0, the diameter and the global average path length are relatively large. Then, closeness of all nodes is almost equal to 0 and transitivity coefficient is 1 only for a small subset (17%). Assortativity of 37% tells that artists usually stick to those with similar playcount. Finally, the betweenes is correlated with the playcount enought to argue that the most listened artists are central and clustered together in this network.

Discussion

One property poses a limitation in this study. The network of similarities was constructed using the information about the 5 most similar artists provided by the last.fm recommendation engine. Most of the recommendation engines are are pre-trained machine learning algorithms that are based on three main aspects: collaborative filtering, natural language processing (NLP) and audio assessment models. The first one utilizes the information about users’ choices – using streaming history of each agent, the model seeks for similarity. When a high similarity between a group users’ behaviour is found, the engine is likely to recommend the other users’ choices, that have not been discovered by the individual yet. The second filter is based on a text analysis – it may include analysis of lyrics or even online websites tracking for cooccurrences of artists in one article or similarity of adjectives used to describe them in different sources. The last one is based on audio models, that look for musical resemblance in songs and recommend tracks that sound familiar to the user. Last.fm uses mostly collaborative filtering. It also has an additional advantage of having access to social network data, linking users who might not have as similar tastes (although this is quite possible) but are socially connected. The combination of these two types of data results in a very powerful recommendation engine. However, if another recommendation engine was used, the results might have been different. In a larger study it would be valuable to compare various musical networks - even though last.fm “scrobbles” data from Spotify, iTunes and so on, there services have their own recommendation engines. Also, due to the last.fm API limit, I got only a random subset of artists (however, more that 50%). I do not know which artists were left because the dataset is too vast to analyse manually. Taking all of them into consideration might also alter the result.

Furhtermore, it could be valuable to find a real relationship between the distribution of playcount and the network properties. In this context, fitting this distribution to well-known distributions obeying power-laws might deliver interesting information about if stars emerge following an exponential pattern.

Adai, Alex T, Shailesh V Date, Shannon Wieland, and Edward M Marcotte. 2004. “LGL: Creating a Map of Protein Function with an Algorithm for Visualizing Very Large Biological Networks.” Journal of Molecular Biology 340 (1). Elsevier: 179–90.

Adler, Moshe. 1985. “Stardom and Talent.” The American Economic Review 75 (1): 208–12.

Almende B.V., Benoit Thieurmel, and Titouan Robert. 2019. VisNetwork: Network Visualization Using ’Vis.js’ Library. https://CRAN.R-project.org/package=visNetwork.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. http://igraph.org.

Doreian, Freeman, P., and A. Romney. 1989. “Models of Network Effects on Social Actors.” Research Methods in Social Network Analysis.

Frey, B. S. 1998. “Superstar Museums. An Economic Analysis.” Journal of Cultural Economics 22: 113–25.

Fruchterman, Thomas MJ, and Edward M Reingold. 1991. “Graph Drawing by Force-Directed Placement.” Software: Practice and Experience 21 (11). Wiley Online Library: 1129–64.

Kamada, Tomihisa, Satoru Kawai, and others. 1989. “An Algorithm for Drawing General Undirected Graphs.” Information Processing Letters 31 (1). Citeseer: 7–15.

Kaufmann, Michael, and Dorothea Wagner. 2003. Drawing Graphs: Methods and Models. Vol. 2025. Springer.

Kim, K., and J. Altmann. 2017. “Effect of Homophily on Network Formation.” Communications in Nonlinear Science and Numerical Simulation 44.

Kobourov, Stephen G. 2004. “Force-Directed Drawing Algorithms.” Citeseer.

Leibenstein, H. 1950. “Bandwagon, Snob, and Veblen Effects in the Theory of Consumers’ Demand.” The Quarterly Journal of Economics 64 (2).

Lind, da Silva, P. G., and H. J. Herrmann. 2007. “Spreading Gossip in Social Networks.” Physical Review E 76 (3).

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-Sne.” Journal of Machine Learning Research 9 (Nov): 2579–2605.

MacDonald, Glenn. 1988. “The Economics of Rising Stars.” American Economic Review 78 (1): 155–66.

Rosen, Sherwin. 1981. “The Economics of Superstars.” The American Economic Review 71 (5): 845–58.

Simon, Herbert A. 1955. “On a Class of Skew Distribution Functions.” Biometrika 42 (3).

Stigler, George J., and Gary S. Becker. 1977. “De Gustibus Non Est Disputandum.” The American Economic Review 67 (2): 76–90.

Yule, G. Udny. 1924. “A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis.” Philosophical Transactions of the Royal Society of London B (213).

Zipf, George K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley Press.

http://cs.brown.edu/people/rtamassi/gdhandbook/chapters/force-directed.pdf ↩