MDS, PCA and Hierarchical Clustering of Movies


1 Introduction

Main idea to provide this paper, was the author interest in movies and Unsupervised Learning. All data gathering and manipulation were performed in Rstudio. Main source, which was used as a base to gather data, was Polish IMDb like website, Filmweb. If you are also interested in Clustering Methods (PAM,CLARA,K-means), you can read my previous publication here

1.1 Web Scraping as a tool for Clustering

Definition of Web Scraping from Wikipedia is “Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.”(short intro taken from here: https://en.wikipedia.org/wiki/Web_scraping)

In other words, Web Scraping is a widely spread method for gathering resources, data, images and others from websites. Sometimes it can be really useful, especially in cases when we cannot gain acces to interesting data.

1.1.1 Assumptions

In this paper, movies were the area of interest. The will to perform this research was driven by curiosity whether, it’s possible to get some interesting insights about how the movies can be sorted, when various Scaling Methods were performed on them. Thus, that was the main purpose of using it here, to build our own Data Base.

We did Web Scraping on http://filmweb.pl. We can perform Web Scraping in Rstudio using Rvest library.

1.1.2 Code

f3 = function(n){

tryCatch({
  
  url= 'https://www.filmweb.pl/films/search?endRate=10&orderBy=popularity&descending=true&startCount=100&startRate=1'
  list_of_pages <- str_c(url, '&page=', 2:n)
  data_base = tibble()
  
  for(i in 1:n){
  
    baza_filmow = read_html(list_of_pages[i])
    
    get_title_html = html_nodes(baza_filmow,".filmPreview__title")
    get_rating_html = html_nodes(baza_filmow,".rateBox__rate")
    get_genre_html = html_nodes(baza_filmow,".filmPreview__info--genres")
    get_country_html = html_nodes(baza_filmow,".filmPreview__info--countries")
    get_seen_html = html_nodes(baza_filmow,".rateBox__votes")
    get_iwanttosee_html = html_nodes(baza_filmow,".wantToSee__count")
    
    #deleting unnecessary strings
    remove = html_text(get_seen_html)
    remove_first = stri_replace_all(remove,"",fixed="głosy")
    seen = stri_replace_all(remove_first,"",fixed="głosów")
    seen = gsub(" ", "", seen, fixed = TRUE)
    remove_second = html_text(get_genre_html)
    genre = stri_replace_all(remove_second,"",fixed="gatunek:")
    remove_third = html_text(get_country_html)
    country = stri_replace_all(remove_third,"",fixed="kraj:")
    remove_fourth = html_text(get_rating_html)
    rating = stri_replace_all(remove_fourth,".",fixed=",")
    
    #adding spaces to remove extra genres
    genre= gsub("([a-z])([A-Z])", "\\1 \\2", genre)
    genre= gsub("([Ĺ‚])([A-Z])", "\\1 \\2 \\3", genre)
    country = gsub("([a-z])([A-Z])", "\\1 \\2", country)
    country = gsub("([A])([A-Z])","\\1 \\2",country)
    
    #some cleaning (leaving only one factor)
    genre = gsub('([A-z]+) .*', '\\1', genre)
    genre = gsub('([Ĺ‚]+) .*', '\\1', genre)
    country = gsub('([A-z]+) .*', '\\1 \\2',country)
    
    
    #alliasing
    title = html_text(get_title_html)
    i_want_to_see = html_text(get_iwanttosee_html)
    i_want_to_see = gsub(" ", "", i_want_to_see, fixed = TRUE)
    
    
    #building datatable
    data_table = tibble(movie_title = title,
                         movie_score = rating,
                         movie_genre = genre,
                         release_country = country,
                         people_seen = seen,
                         people_want_to_see = i_want_to_see)
    
    data_base = rbind(data_base,data_table, check.rows=FALSE)
  
   
    }
  }, error=function(e){})
View(data_base)
data_base = data_base[data_base$movie_score !=FALSE,]
write_csv(data_base,'RawFilmweb_Data_Base3.csv')
}

#applying the function for 1000 pages
f3(1000)
Generated Data Base (Head of 75 records, if you want to see them all check it here.)

1.2 Dimension Scaling Methods and Hierarchical Clustering

Dimension Scaling and overall dimensions reductivity of the data set is very useful when analysing big amount of data. Sometimes we want to get insights with respect to some big principal variables, which can explain variability of our data sample with high efficiency. In this paper we will mainly focus on three dimension scaling methods. First we will discuss about MDS, what’s stands for Multidimensional Scaling. Then we will proceed to PCA, what’s stands for Principal Component Analysis and in the end we will finish with Hierarchical Clustering.

1.2.1 MDS

Multidimensional Scaling (MDS) is a statistical method to get insights about hidden variables which describes relations, and similiarities among analysed objects. When performing MDS, first of all we have to prepare a matrix containing euclidean distances between values, we can use for this purpose correlation matrix. After that we obtain k similiar data points which reflects to one dimension in kartesian coordinate system. If k<=3 we can visualise data (up to 3D). If you want to learn more about MDS, check this link.

Below some simple Multidimensional Scaling graph in the field of Marketing:

Marketing MDS

Marketing MDS

1.2.2 PCA

Principal Component Analysis (PCA) is a statistical method of factor analysis. We can describe our data set as a cloud of N points in K dimensions, where N is number of observations and K are variables. Goal of PCA is to fit the coordinate system in such a way that first we will maximize the variance of first coordinate, then we maximize it for second etc etc. PCA is widely used to reduce the dimensions of our data set, dropping the least affective ones.(translated from here: https://pl.wikipedia.org/wiki/Analiza_głównych_składowych).

When performing the PCA analysis we assignt to each Principal Component some variables that can describe this component best, our goal is to set maximum of 3 Principal Components, because our visual perception can handle only 3D plots.

If you want to know more about PCA, you can read something here, or if you are interested how to perform PCA from scratch, I suggest to watch this Video:

1.2.3 Hierarchical Clustering

Hierarchical Clustering is one of the automatic Clustering method. When we perform Hierarchical Clustering, we assume in the beggining that every object is a single cluster and in the end that all of them are in one cluster. We can distinguish two methods of performing hierarchical clustering:

  • Agglomerative methods: we have to create a matrix of similarities, then in next iterations we are adding to the first cluster objects that are most similiar to each other

  • Divisive methods: we have to create one huge cluster for our data, then in next iterations we are dividing the cluster to smaller ones, where objects are most similiar to each other
Also, most of times when we perform Hierarchical Clustering we want as an output, a dendrogram which can show us a pathway of divisible or agglomerative clustering. More about hierarchical clustering here.

Projection of Agglomerative Clustering you can see down below (Divisive goes straight opposite):

Agglomerative Clustering

Here however, it’s shown how dendrogram with respect to Agglomerative Clustering works:

Dendrogram

2 Review of the Data Set

In this paragraph we will review our Scraped data set. We will check basic statistics, take a glance at it again and prepare it for further analysis.

2.1 Data Summary

Now it’s time to check basic features of our data set, before preparing it for firther journey.

 movie_title         movie_score    movie_genre        release_country   
 Length:5790        Min.   :1.400   Length:5790        Length:5790       
 Class :character   1st Qu.:6.300   Class :character   Class :character  
 Mode  :character   Median :6.900   Mode  :character   Mode  :character  
                    Mean   :6.794                                        
                    3rd Qu.:7.400                                        
                    Max.   :8.700                                        
  people_seen     people_want_to_see
 Min.   :    97   Min.   :   429    
 1st Qu.: 10238   1st Qu.:  4254    
 Median : 21592   Median :  7130    
 Mean   : 46914   Mean   : 11543    
 3rd Qu.: 51771   3rd Qu.: 13841    
 Max.   :727515   Max.   :131610    
As we can see: movie_genre, title and release_country have no use in clustering in current form. That’s why we have to do some cleaning, and transforming the data, from characters to numeric values. Thats why, we prepared before one more Web Scraping, to change release_countries to their relative GDP per capita. We delete for now the movie title, because we won’t use it in clustering, and movie genres as characters will be replaced with their percentage share in all genres.

Then the summary looks as follows:

  movie_score     people_seen     people_want_to_see    PKB_2017     
 Min.   :1.400   Min.   :   117   Min.   :   428     Min.   :  1834  
 1st Qu.:6.300   1st Qu.: 10206   1st Qu.:  4214     1st Qu.: 42514  
 Median :6.900   Median : 21500   Median :  7071     Median : 56935  
 Mean   :6.791   Mean   : 46834   Mean   : 11500     Mean   : 47767  
 3rd Qu.:7.400   3rd Qu.: 51662   3rd Qu.: 13782     3rd Qu.: 56935  
 Max.   :8.700   Max.   :725921   Max.   :131319     Max.   :107865  
   rate.Freq        
 Min.   :0.0001733  
 1st Qu.:0.0408870  
 Median :0.0684338  
 Mean   :0.1606753  
 3rd Qu.:0.3395703  
 Max.   :0.3395703  

Code for this cleaning you can find here:

filmweb_database = read_csv('RawFilmweb_Data_Base2.csv')

#changing to numeric
filmweb_database$movie_score = as.numeric(filmweb_database$movie_score)
filmweb_database$people_seen = as.numeric(filmweb_database$people_seen)
filmweb_database$people_want_to_see = as.numeric(filmweb_database$people_want_to_see)

#checking the unique values in genres and countries

unique_vector_of_genres = unique(filmweb_database$movie_genre)
unique_vector_of_countries = unique(filmweb_database$release_country)

#assigning numerical value to string values from genres and countries

genres_copy = c(unique_vector_of_genres)
no_uniq_gen = as.double(length(genres_copy))
num_value_of_genres = seq(1:(no_uniq_gen))

names(num_value_of_genres) = genres_copy

countries_copy = c(unique_vector_of_countries)
no_uniq_countr = as.double(length(countries_copy))
num_value_of_countries = seq(1:(no_uniq_countr))

names(num_value_of_countries) = countries_copy

#geting rid of titles and assigning numerical id
movie_id = c(filmweb_database$movie_title)
no_movie_id =  as.double(length(movie_id))
num_value_of_movie_titles = seq(1:(no_movie_id))

#creating dictionary and backing up the names and assigned number
dict_of_genres = tibble(unique_vector_of_genres,num_value_of_genres)
dict_of_countries = tibble(unique_vector_of_countries,num_value_of_countries)
dict_of_movies_id = tibble(filmweb_database$movie_title,movie_id)

filmweb_database_conclusions = read_csv("Filmweb_Data_Base.csv") #we will need movie titles for final conclusions but not for clustering purposes

Clusterable_Data_Base = read_csv("Filmweb_Data_Base.csv")
Clusterable_Data_Base$movie_title = NULL

#####################################################################################
#Changing genre to binary (DO NOT RECOMMEND! But it gives an output using clustering)
#####################################################################################
# database_genre_copy = data.frame(filmweb_database2$movie_genre)                   #
# binary_df = data.frame(row.names=rownames(database_genre_copy))                   #
#  for (i in colnames(database_genre_copy)) {                                       #
#    for (x in names(num_value_of_genres)) {                                        #
#                                                                                   #
#      binary_df[paste0(i, "_", x)] = as.numeric(database_genre_copy[i] == x)       #
#                                                                                   #
#    }                                                                              #
#  }                                                                                #
# filmweb_database2 = cbind(filmweb_database2,binary_df)                            #
#####################################################################################

#Binding everything together and clearing movie_genre column, because we dont need it no longer
Clusterable_Data_Base = merge(Clusterable_Data_Base,PKB.df,  by.x ='release_country',by.y = "Country", all.x = T)
Clusterable_Data_Base = na.omit(Clusterable_Data_Base)

#Checking occurence of each genre to assign it as a percentage value, because binary value is to dirty for clustering purposes
table_of_genres = table(Clusterable_Data_Base$movie_genre)
table_of_genres.df = data.frame(genre = names(table_of_genres), occurence = as.vector(table_of_genres), rate = prop.table(table(Clusterable_Data_Base$movie_genre)))
table_of_genres.df$rate.Var1 = NULL

Clusterable_Data_Base = merge(Clusterable_Data_Base,table_of_genres.df,  by.x ='movie_genre',by.y = "genre", all.x = T)
Clusterable_Data_Base$release_country = NULL
Clusterable_Data_Base$occurence = NULL
Clusterable_Data_Base$movie_genre = NULL

#Let's take a look and save it
View(Clusterable_Data_Base)
write_csv(Clusterable_Data_Base,"FilmwebClustering_DataBase.csv")

It’s important also to explain our variables:

  • Movie_score is our movie rating

  • seen explains how many people have seen the movie

  • will_see explains how many people wants to see the movie

  • country reflects to country where the movie was released

  • genre reflects to genre as it percentage share in total box of genres

3 Multidimensional Scaling

In this paragraph we will take a closer look at MDS, we will perform it on our data and will try to gather some insights.

First of all we will need some libraries:

Then we have to read our data set

3.1 Preparation

When we read our data, next step is to prepare it for usage of MDS. We need to create matrix with distances between units with respect to variables. After we will prepare our data to perform some tests on it.

Here we will prepare data for future Mentel’s and Kruskal’s tests

Now we will make a correlation plot for nominal values:

            movie_score   seen will_see country  genre
movie_score       1.000  0.231    0.387   0.016  0.135
seen              0.231  1.000    0.574   0.059 -0.110
will_see          0.387  0.574    1.000   0.031  0.130
country           0.016  0.059    0.031   1.000 -0.067
genre             0.135 -0.110    0.130  -0.067  1.000

From this plot and data, we can figure out that there is significant positive correlation between variables:

  • seen and will_see
  • will_see and movie_score

Let’s make a trial visualisation

For purposes of clear view and better scale, we will standarize it

Now our preparation’s are done, let’s proceed to the next point, where we will make some basic summary, and statistics to check traits of our dataset

3.2 Summary

Here we have some slight summary for this set

       V1                V2         
 Min.   :-9.7951   Min.   :-3.8665  
 1st Qu.:-0.3698   1st Qu.:-0.8475  
 Median : 0.2923   Median :-0.2566  
 Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.7714   3rd Qu.: 0.8904  
 Max.   : 4.2805   Max.   : 3.0088  
As we can see now we have two variables instead of five. It was done on purpose by setting k=2 to reduce the dimension of the data.

Let’s now perform Mentel’s test to check, whether dissimilarity matrix and converted similarity matrix to dissimilarity one are the same?

Here we see the results of the Mentel Test

$z.stat
[1] 0.7964331

$p
[1] 0.007

$alternative
[1] "two.sided"

If we set our p-value level at 0.05, then according to results of Mentel Test we reject the null-hypotesis, assuming that matrices are similar.

Next test we will perform is Kruskal test, to check the MDS quality. Formula for this test looks as follows: \[ \sigma(X) = \frac{\sqrt{\sum_{i<j} w_{ij}(d_{ij}-d_{ij}(X))^{2}}}{n(n-1)/2}\]

[1] 0.6090237
[1] 0.001227381
[1] 0.002015326

First vector is mean of random stress function for observation equal to our sample size = 5772, next vector is our data stress funcion mean. Last one is ratio of random stress function and our data set stress function. Thus, We can notice that its pretty small [1] 0.002015245, hence as we can read from the table below, quality of our MDS is rather excellent! Hurray! 0.20 = poor, 0.10 = fair, 0.05 = good, 0.025 = excellent, 0.00 = perfect

3.3 Visualisation

Lets visualise it again, first with respect to variables

Now let’s compare the results with Kmeans Clusters

It’s visible that some variables are clustered together, where two others are far away in different clusters. It’s hard to interpret it in this way but, one can say that for potential viewers movie score, amount of people who seen and want to see the movie are in one prior basket, where type of movie and country from which it comes are two different baskets.

It can mean that it belongs to preferences of the viewers, whether they choose movie by this 3 principals: movie score, seen and will_see factors or either by genre or country where movie was realased.

Now, we will do the same but with respect to units

And also compare it to clusters

Summarizing Multidimensional Scaling as a dimensions reduction method, one can say that it can be very usefull in various fields. We can plot the results in two different ways, with respect to units and variables, which is invaluable for example in marketing, where sometimes we have to compare our product/company to competitors in two or three dimensional scaling maximum. We can see also, that we can easily cluster the results to get even more insights about our data set traits. However, MDS as a method gives us only a limited view of the data set, as a analysis tool, we can use it for comparision, or studies about shopping baskets.

4 Principal Component Analysis

In this paragraph we will present the application of the PCA on our data set.

4.1 Preparations

First of all we have to prepare our thata, in such a way we can perform PCA analysis and get some summary etc. Thus, lets check in the beggining correlation between variables with normalized values:

            movie_score   seen will_see country  genre
movie_score       1.000  0.231    0.387   0.016  0.135
seen              0.231  1.000    0.574   0.059 -0.110
will_see          0.387  0.574    1.000   0.031  0.130
country           0.016  0.059    0.031   1.000 -0.067
genre             0.135 -0.110    0.130  -0.067  1.000

We can notice that when we normalized data, the graph and summary looks pretty similar to the previous one. Still we have correlation between:

  • people_want_to_see and people_seen (now0.574then0.574 )
  • people_want_to see and movie_score(now0.387then`0.387)

Basically results are just same

Now we will perform test to check how many Principal Components to retain


 Parallel Analysis Results  
 
Method: pca 
Number of variables: 5 
Sample size: 5772 
Number of correlation matrices: 100 
Seed: 123 
Percentile: 0.95 
 
Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component. 
 
 Component  Mean  0.95
         1 1.036 1.051
         2 1.016 1.027
         3 1.000 1.008
         4 0.983 0.993
         5 0.965 0.978

4.2 Summary

First point of summary is to compare our observed eigenvalues to those one generated by hornpa to check whether we have support to retain the factor/component.

1.824094 1.134637 0.9539321 0.7200782 0.367258

We can see that our first eigenvalue is 1.82 > 1.05, second is 1.13 > 1.027, third 0.95 < 1.008. Thus, comparing this results to the randomly generated in parallel Analysis, the main idea of the test is check wether eigenvalues from random set (0.95 percentile) are smaller than our empirical eigens, hence if so, we can conclude the number of principal components needed. In our case it will be 2.

Having this in mind, we will check anyway how our PCA for this data set looks like

Importance of components:
                          PC1    PC2    PC3    PC4     PC5
Standard deviation     1.3506 1.0652 0.9767 0.8486 0.60602
Proportion of Variance 0.3648 0.2269 0.1908 0.1440 0.07345
Cumulative Proportion  0.3648 0.5917 0.7825 0.9265 1.00000

From table above we can conclude that 4 Principal Components are fair enought co explain variance in our whole data set. However, we will visualise the plot for number of PC’s to check the cohesion of the results.

It also confirms that when we sum up the percentages of explained variance, 4 PC’s are enough to describe this data.

In final stage lets check loadings of our PCA, to check what is inside (during this analysis we will omit 5th Principal Component, because as it was said before 4 are enough)

Standard deviations (1, .., p=5):
[1] 1.3505903 1.0651936 0.9766945 0.8485742 0.6060182

Rotation (n x k) = (5 x 4):
                    PC1         PC2         PC3         PC4
movie_score -0.48289385  0.25217943 -0.15111335  0.80885185
seen        -0.57761303 -0.31391114  0.23605121 -0.32982115
will_see    -0.64726638  0.02398036  0.05336807 -0.23976545
country     -0.06684445 -0.48651065 -0.86967370 -0.04991279
genre       -0.09876628  0.77498236 -0.40282016 -0.42071123

We have to know that minimum treshold for this correlation with respect to PCA and variable is around 0.3-0.5 and also -0.5 to -0.3, thus we will not describe anything below 0.5 and -0.5 as significant.

We can describe the table above as follows:

  • In PC1 the most important set of variables, hence negatively correlated are: will_see -0.64726638, and seen -0.57761303

  • PC2 best describes genre with high 0.77498236 possitive correlation

  • PC3 best describes country with high -0.86967370 negative correlation

  • In PC4 the most important is movie_score with correlation level corresponding to 0.80885185

4.3 Visualisation

First of all we will visualise our data as a 2D plot with respect to two main Principal Components

We can see on the plot, that variables: genre and country are dominated by the Second Principal Component (which explains 22.7% of total variance), where movie_score, will_see and seen are dominated by First Principal Component (which explains 36.5% of total variance).

From the plot we can also read that when radius of the lines between variables is more than 130 degrees they are not correlated at all, and when it’s less than 60 degrees they are seem to be correlated positively.

Thus, we conclude having in mind previous observations, summary and this plot observe that, when it comes to choosing movie, viewers are likely to:

  • Either check also how many people have seen and want to see the movie

  • What is the movie score and how many people wants to see the movie

  • Or finaly either choosing by genre or release country, doesnt concerning about other variables

Lest’s plot it with respect to units (standarized set):

Now with labeling (color equals to genres)

Summarizing PCA as a dimension reduction method, we can say that it can be superior to MDS in terms of geting better overview about variables, and our data. PCA method sorts our variables with respect to correlations between them, and then reduce the dimension of whole data set assigning multiple variables to newly created superior one which is reflected in Principal Component no.1, no.2 etc.

Analysing that kind of results can be really unvaluable, mostly because we can reduce our data set to 2 or 3 Principal Components which is easier to plot and understand. Thus, after creating those components we can assign new meaning to them, accordingly to the variables inside them. In such a way we can explain for example customer behavior when it comes to purchasing goods, where there are multiple variables to consider (design, content, brand etc)

5 Hierarchical Clustering

This is the final part about methods we mentioned in the beggining. In this chapter we will perform Hierarchical Clustering on our data set, visualise it and make slight summarize overall.

We will perform hierarchical clustering using the package stats and function hclust().

5.1 Preparations

First of all we have to standarize our data

Then we have to measure distances between observations, and assignt clustering to one variable

Let’s also check density, to let us know how the distribution of distances looks like

And visualise the trial dendrogram

Now we will limit the observation to 5 standard deviations and cut the tree. Then we will proceed to summary.

5.2 Summary

In this part we will check summaries about the clustering with hclust, maybe we will find something interesing?

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   6.492  12.000  21.000 

Maximum of 21 clusters? It’s a lot, but we won’t give up, let’s change the stdev value to get final of 3-4 clusters

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.264   1.000   4.000 

Neighbors within 10 stdandard deviations it’s A LOT, let’s check another way

We will set number of clusters manually now

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    1.00    1.26    1.00    3.00 

Now it looks reasonable, lets plot it

Fair enough, lets proceed to visualising

5.3 Visualisation

In this part we will try to visualise those dendogram in more beautiful way

First let’s try with triangle

Looks nice, a bit like modern art, let’s try with another one!

In the case above, we have cut triangle dendrogram a little bit to make it more readable, but we stil have a problem with labeling, because of plenty of observations.

But, anyway, let’s make some magic!

We can see that Hierarchical Clustering is really cool tool when it comes to visualisation, however main limitation of this method is amount of observations. The more we have, the less clarity it has, thus we have to delimit the amount of observations and visualise them separately, but it would give us enormus amount of dendrograms, hard to interpret. Yet, I really recommend this method for small data sets, maybe up to 500 observations, where we can clarify the vision, and plot it in a proper way.

6 Final Word

Throughout this paper, we took a closer look into methods of dimensional scaling and hierarchical clustering. First of all, we did Web Scraping, which is very usefull tool when it comes to gathering data from sources, that doesn’t provide ready to use data bases. Then the slight introduction to the topic was made, we provided some handmade definitions and visual examples taken from the internet to better understand the topic. After, we proceeded to briefly elaborate every method, basing on example of our Scraped data base. After all we have discovered new ways of visualising data with respect to each of those methods, which can be very useful in future work.

I hope this paper was and will be very useful for you reader in your work, and field of interests. I also encourage you to check out my another paper about clustering methods, which is available here

Korneliusz Krysiak

12.2018