1. Introduction
  1. Review of the Data Set
  1. Clustering Data
  1. Insights from Clustering
  1. Final Word
  2. References

1 Introduction

Main idea to provide this paper, was the author interest in movies and Unsupervised Learning. All data gathering and manipulation were performed in Rstudio. Main source, which was used as a base to gather data, was Polish IMDb like website, Filmweb.

1.1 Web Scraping as a tool for Clustering

Definition of Web Scraping from Wikipedia is “Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.”(short intro taken from here: https://en.wikipedia.org/wiki/Web_scraping)

In other words, Web Scraping is a widely spread method for gathering resources, data, images and others from websites. Sometimes it can be really useful, especially in cases when we cannot gain acces to interesting data.

1.1.1 Assumptions

In this paper, movies were the area of interest. The will to perform this research was driven by curiosity whether, it’s possible to get some interesting insights about how the movies can be sorted, when the Clustering methods were performed on them. Thus, that was the main purpose of using it here, to build our own Data Base.

We did Web Scraping from http://filmweb.pl. We can perform Web Scraping in Rstudio as well, using Rvest library.

1.1.2 Code

f3 = function(n){

tryCatch({
  
  url= 'https://www.filmweb.pl/films/search?endRate=10&orderBy=popularity&descending=true&startCount=100&startRate=1'
  list_of_pages <- str_c(url, '&page=', 2:n)
  data_base = tibble()
  
  for(i in 1:n){
  
    baza_filmow = read_html(list_of_pages[i])
    
    get_title_html = html_nodes(baza_filmow,".filmPreview__title")
    get_rating_html = html_nodes(baza_filmow,".rateBox__rate")
    get_genre_html = html_nodes(baza_filmow,".filmPreview__info--genres")
    get_country_html = html_nodes(baza_filmow,".filmPreview__info--countries")
    get_seen_html = html_nodes(baza_filmow,".rateBox__votes")
    get_iwanttosee_html = html_nodes(baza_filmow,".wantToSee__count")
    
    #deleting unnecessary strings
    remove = html_text(get_seen_html)
    remove_first = stri_replace_all(remove,"",fixed="głosy")
    seen = stri_replace_all(remove_first,"",fixed="głosów")
    seen = gsub(" ", "", seen, fixed = TRUE)
    remove_second = html_text(get_genre_html)
    genre = stri_replace_all(remove_second,"",fixed="gatunek:")
    remove_third = html_text(get_country_html)
    country = stri_replace_all(remove_third,"",fixed="kraj:")
    remove_fourth = html_text(get_rating_html)
    rating = stri_replace_all(remove_fourth,".",fixed=",")
    
    #adding spaces to remove extra genres
    genre= gsub("([a-z])([A-Z])", "\\1 \\2", genre)
    genre= gsub("([ł])([A-Z])", "\\1 \\2 \\3", genre)
    country = gsub("([a-z])([A-Z])", "\\1 \\2", country)
    country = gsub("([A])([A-Z])","\\1 \\2",country)
    
    #some cleaning (leaving only one factor)
    genre = gsub('([A-z]+) .*', '\\1', genre)
    genre = gsub('([ł]+) .*', '\\1', genre)
    country = gsub('([A-z]+) .*', '\\1 \\2',country)
    
    
    #alliasing
    title = html_text(get_title_html)
    i_want_to_see = html_text(get_iwanttosee_html)
    i_want_to_see = gsub(" ", "", i_want_to_see, fixed = TRUE)
    
    
    #building datatable
    data_table = tibble(movie_title = title,
                         movie_score = rating,
                         movie_genre = genre,
                         release_country = country,
                         people_seen = seen,
                         people_want_to_see = i_want_to_see)
    
    data_base = rbind(data_base,data_table, check.rows=FALSE)
  
   
    }
  }, error=function(e){})
View(data_base)
data_base = data_base[data_base$movie_score !=FALSE,]
write_csv(data_base,'RawFilmweb_Data_Base3.csv')
}

#applying the function for 1000 pages
f3(1000)

Generated Data Base

1.2 Clustering

From definition Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.(short intro taken from here: https://en.wikipedia.org/wiki/Cluster_analysis)

In this paper, Clustering methods were used to get insights from gathered Data Set. To better understand the topic and idea of clustering, we have to make some short introduction about three principal methods of clustering.

1.2.1 K-Means

K-Means method is one of the most important Clustering methods, where we randomly choose best suited centroid. Then around it, we connect most similiar data points and build our cluster. In Rstudio, to perform K-means clustering we use library(facoextra), and function eclust(data,"kmeans"). If you want to know more about K-means method, you can read it more here.

Simplified Animated K-Means Clustering:

K-Means

K-Means

1.2.2 PAM

Partitioning Aroung Medoids aka. PAM/K-Medoids, is second most important clustering method. Hence, this time medoids are not chosen randomly, they are taken from actual data set. Thus, this metod is more “real”, because we do not assess any artificial medoid by our own. More about PAM here.

Comparision between K-Means and K-Medoids(PAM):

Slight Comparision

Slight Comparision

1.2.3 CLARA

Clustering Large Applications (CLARA) basically does the same what PAM does. One important difference is that it’s more efficient with large data sets, where PAM is super slow. For example clustering this movie data base (around 5800 records) took for PAM around 1 hour, where for CLARA only 30 seconds. More about CLARA here.

2 Review of the Data Set

In this paragraph we will review our Scraped data set. We will check basic statistics, take a glance at it again and prepare it for clustering process.

2.1 Data Summary

Now it’s time to check basic features of our data set, before preparation to clustering.

##  movie_title         movie_score    movie_genre        release_country   
##  Length:5790        Min.   :1.400   Length:5790        Length:5790       
##  Class :character   1st Qu.:6.300   Class :character   Class :character  
##  Mode  :character   Median :6.900   Mode  :character   Mode  :character  
##                     Mean   :6.794                                        
##                     3rd Qu.:7.400                                        
##                     Max.   :8.700                                        
##   people_seen     people_want_to_see
##  Min.   :    97   Min.   :   429    
##  1st Qu.: 10238   1st Qu.:  4254    
##  Median : 21592   Median :  7130    
##  Mean   : 46914   Mean   : 11543    
##  3rd Qu.: 51771   3rd Qu.: 13841    
##  Max.   :727515   Max.   :131610

As we can see: movie_genre, title and release_country have no use in clustering in current form. That’s why we have to do some cleaning, and transforming the data, from characters to numeric values. Thats why, we prepared before one more Web Scraping, to change release_countries to their relative GDP per capita. We delete for now the movie title, because we won’t use it in clustering, and movie genres as characters will be replaced with their percentage share in all genres.

Then the summary looks as follows:

##   movie_score     people_seen     people_want_to_see    PKB_2017     
##  Min.   :1.400   Min.   :   117   Min.   :   428     Min.   :  1834  
##  1st Qu.:6.300   1st Qu.: 10206   1st Qu.:  4214     1st Qu.: 42514  
##  Median :6.900   Median : 21500   Median :  7071     Median : 56935  
##  Mean   :6.791   Mean   : 46834   Mean   : 11500     Mean   : 47767  
##  3rd Qu.:7.400   3rd Qu.: 51662   3rd Qu.: 13782     3rd Qu.: 56935  
##  Max.   :8.700   Max.   :725921   Max.   :131319     Max.   :107865  
##    rate.Freq        
##  Min.   :0.0001733  
##  1st Qu.:0.0408870  
##  Median :0.0684338  
##  Mean   :0.1606753  
##  3rd Qu.:0.3395703  
##  Max.   :0.3395703

Code for this cleaning u can find here:

filmweb_database = read_csv('RawFilmweb_Data_Base2.csv')

#changing to numeric
filmweb_database$movie_score = as.numeric(filmweb_database$movie_score)
filmweb_database$people_seen = as.numeric(filmweb_database$people_seen)
filmweb_database$people_want_to_see = as.numeric(filmweb_database$people_want_to_see)

#checking the unique values in genres and countries

unique_vector_of_genres = unique(filmweb_database$movie_genre)
unique_vector_of_countries = unique(filmweb_database$release_country)

#assigning numerical value to string values from genres and countries

genres_copy = c(unique_vector_of_genres)
no_uniq_gen = as.double(length(genres_copy))
num_value_of_genres = seq(1:(no_uniq_gen))

names(num_value_of_genres) = genres_copy

countries_copy = c(unique_vector_of_countries)
no_uniq_countr = as.double(length(countries_copy))
num_value_of_countries = seq(1:(no_uniq_countr))

names(num_value_of_countries) = countries_copy

#geting rid of titles and assigning numerical id
movie_id = c(filmweb_database$movie_title)
no_movie_id =  as.double(length(movie_id))
num_value_of_movie_titles = seq(1:(no_movie_id))

#creating dictionary and backing up the names and assigned number
dict_of_genres = tibble(unique_vector_of_genres,num_value_of_genres)
dict_of_countries = tibble(unique_vector_of_countries,num_value_of_countries)
dict_of_movies_id = tibble(filmweb_database$movie_title,movie_id)

filmweb_database_conclusions = read_csv("Filmweb_Data_Base.csv") #we will need movie titles for final conclusions but not for clustering purposes

Clusterable_Data_Base = read_csv("Filmweb_Data_Base.csv")
Clusterable_Data_Base$movie_title = NULL

#####################################################################################
#Changing genre to binary (DO NOT RECOMMEND! But it gives an output using clustering)
#####################################################################################
# database_genre_copy = data.frame(filmweb_database2$movie_genre)                   #
# binary_df = data.frame(row.names=rownames(database_genre_copy))                   #
#  for (i in colnames(database_genre_copy)) {                                       #
#    for (x in names(num_value_of_genres)) {                                        #
#                                                                                   #
#      binary_df[paste0(i, "_", x)] = as.numeric(database_genre_copy[i] == x)       #
#                                                                                   #
#    }                                                                              #
#  }                                                                                #
# filmweb_database2 = cbind(filmweb_database2,binary_df)                            #
#####################################################################################

#Binding everything together and clearing movie_genre column, because we dont need it no longer
Clusterable_Data_Base = merge(Clusterable_Data_Base,PKB.df,  by.x ='release_country',by.y = "Country", all.x = T)
Clusterable_Data_Base = na.omit(Clusterable_Data_Base)

#Checking occurence of each genre to assign it as a percentage value, because binary value is to dirty for clustering purposes
table_of_genres = table(Clusterable_Data_Base$movie_genre)
table_of_genres.df = data.frame(genre = names(table_of_genres), occurence = as.vector(table_of_genres), rate = prop.table(table(Clusterable_Data_Base$movie_genre)))
table_of_genres.df$rate.Var1 = NULL

Clusterable_Data_Base = merge(Clusterable_Data_Base,table_of_genres.df,  by.x ='movie_genre',by.y = "genre", all.x = T)
Clusterable_Data_Base$release_country = NULL
Clusterable_Data_Base$occurence = NULL
Clusterable_Data_Base$movie_genre = NULL

#Let's take a look and save it
View(Clusterable_Data_Base)
write_csv(Clusterable_Data_Base,"FilmwebClustering_DataBase.csv")

3 Clustering Data

In this part we will start to play a little with Clustering. Throughout this part of publication, we will go consecutive through whole clustering process. First we will check if data is clusterable, then we will make some trial clustering, delimit optimal number of clusters for each metod, and at the end we will decide which method is most suitable for our data. Afterwards we will conduct proper clustering on the data set, and visualise them in a proper, readable way.

3.1 Preparation

First of all we will need some libraries

Then we have to read our data set

First let’s check what happens when we cluster all data with random number of clusters, as the database is quite big we will us CLARA

As we can see it’s pretty ugly and doesnt give us anything, Let’s try to check if the big values of seen and i_want_to_see are similiar in shape to normal distribution

They aren’t, so let’s perform standrarization

##Clusterability >

Now it’s time to check clusterability and choose optimal number of clusters (Hopkins Statistics)

Our output looks like this: $hopkins_stat [1] 0.05249729. Generally, Hopkins statistics calculate the clusterability in a following way: \[H = \frac{\sum_{i=1}^{m} u_{i}^{d}}{\sum_{i=1}^{m} u_{i}^{d} + \sum_{i=1}^{m} w_{i}^{d}}\]

From these, as we obtain H close to 1, it means that our data set is highly clusterable, if H is around 0.5 it means that our data set is randomly distributed and clustering is not recommended, if H is around 0 it means our data set is closest to uniform distribution, and clustering is pointless.

That would mean that our data is close to uniform distribution, and there is no need to perform clustering on it. Thus, inget_clust_tendency function and library(factoextra) have been few changes about how the Hopkins statistics is calculated. Hence, we can read in get_clust_tendency{factoextra} statement as follows : “Hopkins statistic: If the value of Hopkins statistic is close to zero (far below 0.5), then we can conclude that the dataset is significantly clusterable”.

3.2 Optimal number of clusters and silhouette

As, we know so far, our data is highly clusterable. Hence, we have to define now the optimal number of clusters using every method available, and plot silhouette to check whether to choose K-Means or CLARA (We reject PAM, because our data set is too big).

First of all let us start with K-Means. We will check optimal number of clusters and silhouette for this clustering method.

As we can see, 2 clusters explain 78% of variability, but in my opinion it is not enough to describe this population, thus my choice is 3 clusters.

3.2.1 K-means optimal number of Clusters and Silhouette

##   cluster size ave.sil.width
## 1       1 4329          0.39
## 2       2  745          0.29
## 3       3  698          0.14

3.2.2 CLARA optimal number of Clusters and Silhouette

## $pamobject
## Call:     clara(x = sdata, k = k) 
## Medoids:
##      movie_score people_seen people_want_to_see    PKB_2017  rate.Freq
## [1,]  -0.1155632  -0.2773701         -0.4748466 -0.06942759 -0.9807155
## [2,]   0.2644393  -0.3373910         -0.1273311 -0.06942759  1.3230280
## Objective function:   1.711484
## Clustering vector:    int [1:5772] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
## Cluster sizes:            3751 2021 
## Best sample:
##  [1]   66  174  233  359  582  770  856  975 1009 1030 1116 1205 1272 1840
## [15] 1954 2030 2085 2089 2313 2520 2538 2618 2867 2868 2994 3046 3417 3681
## [29] 3885 4159 4173 4252 4269 4315 4371 4735 4867 4893 5026 5162 5250 5359
## [43] 5367 5644
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"      
## 
## $nc
## [1] 2
## 
## $crit
##  [1] 0.0000000 0.2756372 0.2119296 0.2413333 0.1847550 0.2450437 0.2299515
##  [8] 0.2303386 0.2227659 0.2165695 0.2182948 0.2076607 0.2178446 0.2170536
## [15] 0.2318194 0.2168398 0.2131170 0.2184835 0.2001230 0.2081288 0.2093967
## [22] 0.2008916 0.1844133 0.1955363 0.1957706 0.2205315 0.2063810 0.2085979
## [29] 0.1986365 0.2039860 0.2183373 0.2016073 0.1979206 0.2017878 0.1974226
## [36] 0.1986464 0.1849371 0.1989937 0.2028897 0.2019285

CLARA Silhouette

##   cluster size ave.sil.width
## 1       1 3743          0.11
## 2       2  500          0.06
## 3       3 1529          0.06

We can see that from both methods optimal number of clusters is 2, but in this paper we will set it anyway to 3, because as I said previously, two is not enough in my opinion for this data set.

3.3 Choosing best method

When we put everything together, we have to decide which method to choose. Althoug both methods seems to fit well, we will opt for K-means solution in future deliberations.

3.4 Visualising Clusters

First let’s visualise K-means clustering method for this data set

Now let’s do the same with CLARA

Here we have confirmation that it’s better to use K-means clustering, the shape, silhouette and distrubition of clusters looks way better.

Let’s take a glance on correlation plot for K-means with respect to clusters

4 Insights from Clustering

In this chapter we will try to get some insights from clustered data set. To achieve this we have to plot grouped data with respect to number of a cluster it belongs to. Then we can make some explanatory analysis. Taking advantage of our knowledge about movies we will describe in best possible way each cluster traits. After this we will perform some basic summary on each cluster and try to write an conclusion.

4.1 Assumptions

Our assumption about clustering are as follows:

  • Each cluster is different in terms of content
  • Each cluster describes some trend in movies evaluation by viewers
  • Each cluster is explicable in terms of chosen main variable
  • Each cluster is describable in terms of differences between movie “sets”

In the beggining let’s take a look on a whole data set with respect to variable and cluster

First, we have to describe each variable:

  • Rating variable varies from 1-10

  • People_seen is a variable explaining how many people have seen movies from given cluster

  • People_want_to_se is a variable explaining how many people want to see movies from a given cluster

  • Countries is a variable in which we transformed Countries to their GDP per capita Indicator

  • Genre is a variable which we transformed with respect to it percentage share in all genres

The quick conclusions from those graphs can be as follows:

  • Cluster 3 represents rather good movies, well rated that everyone want to watch, where release country doesn’t really matter

  • Cluster 2 represents popular movies, where rating is rather inconsistent (we can clearly see outliers, it can mean that movies from this cluster, are those well advertised and with huge box office, but in terms of plot, visualisation etc rather in poor condition) and when it goes to the term of country, those one are rather from Big countries with lower GDP per capita, or from small poor countries

  • Cluster 1 represents least popular movies, where rating is rather high (small amount of outliers), from countries with rather high GDP per capita. Concluding from those marks, we can say that films from this cluster are rather ambitious and known only by few (like Swedish or German movies)

    About Genres it’s a bit tricky because in overall database, out of 6000 movies almost 1800 of them are Dramas. Because of that distribution with respect to genre in each cluster in this graph looks rather similiar. To check whether it differs “inside”, we will perform further analysis.
Here we can see the same way of comparing data but using box plot:

4.2 Description of each Cluster

For now, as we have done some basic, and realy general description of clusters, let’s proceed to get more insights. Maybe we will find something interesting? Brace yourself, because this part will contain a lot of graphs.

First of all lets take a look at those, unpleasant genres. What types of films are in each clusters? We will choose top 6 from each. (I’m sorry for language on the graphs but, Web Scraping have been conducted on the Polish website, thus genres, names, countries etc will be in Polish)

When it comes to evaluation of those graphs. We can see, as was mentioned before, that Drama takes a lead in every cluster. Let’s compare rest of the genres distribution.

  • In first cluster Drama dominates, after it goes as follows; Comedy, Thriller, Horror, Biographic and Animation

  • In second cluster also Drama dominates, after it goes as follows; Comedy, Thriller, Love Drama, Biographic and Documentary

  • In third it goes like this; Drama, Thriller, Biographic, Comedy, Action, Fantasy

Now lets take a look at countries in each cluster. It will be also top 6.

In this part we can see rather suprising thing, that in cluster 2, the majority is Poland. I didn’t expect that at all, but lets proceed to description of each cluster.

  • In claster number one, it’s visible that the majority is USA. The second one is France, then Germany, Great Britain, Canada and Belgium. As it was said before, they are rather wealthy countries, and very specific. I mentioned before that it can be the cluster of “niche and ambitious cinema”, and it seems so. The leader is of course USA, but a lot of great movies comes from Germany, France and Great Britain, and from my opinion as a person who watched more than 1800 movies total, movies from those countries are rather more ambitious, and have often deeper plot than standard blockbusters from USA.

  • In cluster number two, we can see suprisingly that the leader is Poland. Then, not suprisingly after, USA, China, Czech REpublic, Spain and France. It confirms thesis, stated before. We have here 2 big countries: China and USA and rather poor ones like Poland and Czech Republic. When Poland stands for majority we can also confirm thesis about rating. most movies made in Poland are rather bad, but there are some super good that everyone knows, the same with China which is 3rd on the table. USA always stands for divergence, they produce some extremely good movies, and extremely bad ones so it’s hard to interpret.

  • In cluster number three, the majority is again USA. After we have France, Germany, Poland, Australia and Canada. We can also confirm the thesis said before that those movies from here, are rather good, with high marks, but in terms of the release countrie that seemed to doesn’t matter, we have USA with huge share in the market. As was said before, USA makes extremely good, and bad movies. In this case, we can see that probabily, the best movies from USA went to the cluster number 3 and rest to others.

In case of watched, and want to watch ratio its pointless to plot histogram with respect to cluster, because it will give us only numbers. In this case we will check statistics of those 2 variables in each cluster.

Cluster no.1 Seen:

##     seen.cl1     
##  Min.   :   117  
##  1st Qu.:  9201  
##  Median : 16894  
##  Mean   : 27763  
##  3rd Qu.: 35756  
##  Max.   :219732

Cluster no.1 Want to see:

##  want_to_see.cl1
##  Min.   :  428  
##  1st Qu.: 3872  
##  Median : 6116  
##  Mean   : 7973  
##  3rd Qu.:10507  
##  Max.   :38404

Cluster no.2 Seen:

##     seen.cl2     
##  Min.   :   809  
##  1st Qu.:  8946  
##  Median : 17666  
##  Mean   : 31743  
##  3rd Qu.: 38446  
##  Max.   :294573

Cluster no.2 Want to see:

##  want_to_see.cl2
##  Min.   :  820  
##  1st Qu.: 4327  
##  Median : 6628  
##  Mean   : 9183  
##  3rd Qu.:11900  
##  Max.   :41764

Cluster no.3 Seen:

##     seen.cl3     
##  Min.   : 15492  
##  1st Qu.: 85692  
##  Median :150884  
##  Mean   :174145  
##  3rd Qu.:225234  
##  Max.   :725921

Cluster no.3 Want to see:

##  want_to_see.cl3 
##  Min.   :  1741  
##  1st Qu.: 23510  
##  Median : 32178  
##  Mean   : 35558  
##  3rd Qu.: 44532  
##  Max.   :131319

In the end we have to take a look at movie ratings in each cluster. For this purposes we will make a density plot.

From the density plots we can read that, our teory about cluster can be true.

  • Cluster one represents movies rather well scored with Median around 6,8 and outliers reaching even score of 2 but not 10. It can assure us that movies from this cluster are rather ordinary. Score around 6.5 is rather common in Filmweb so we can assume that movies from cluster one aren’t outstanding with respect to plot or others.

  • Cluster two represents movies, as was said before, rather inconsistent with rating. We can see here outliers reaching even to 1, but in opposite doesn’t reaching 10. We have notice before that movies from this cluster are mostly from Poland, and countries that doesn’t put a lot of effort making movies. Thus, we can conclude there are some “rare avis”, but mostly they are rather watchable but nothing more.

  • Cluster three represents most ambitious movies among the clusters. Median from this cluster is around 7.8 which implicates very good movies. In overall Filmweb ranking it’s hard to find that kind of well rated movies. The outliers here are not significant, with left tail reaching to 5 and right to 10. Also comparing the skewness from cluster 3 density plot, to other clusters it most resemble normal distribution. All of that confirms our previous theory that movies from this cluster are the best one.

5 Final Word

Throughout this paper, we tried to present Clustering as a tool, and Clustering methods in most pleasant and understandable way. We have get acquainted with some theory, three basic methods and assumptions of clustering. After, we conducted our own Web Scraping to gather data, we cleared them, and have them prepared to future clustering purposes. Then Clustering have been done, we choosed the best fitting method, did some visualisation about clusters and silhouettes to finally proceed to inside clusters analysis. We tried to analyse the clusters in a simple way, pleasant to read and understandable for someone who is new to the topic. I want to emphasise that the clustering have been done on the data from Polish website, thus conclusions were wrote with respect to preferences of people from Poland.

Regarding everything that has been done in this paper, we can say that Clustering is very powerful tool to get interesting insights from data we want to describe. It is an Unsupervised Learning method to automaticaly spare data, with respect to similiarities among them, and thus very useful.

I hope this paper for you, was interesting and gave you an opportunity to gain new point of view with respect to data analysis if u wasn’t familiar with Unsupervised Learning methods before.

Now you can take a look at the final table which contains input data and assigned clusters, to have better perspective which movies were in which cluster.