MDS, PCA and Hierarchical Clustering of Movies
MDS, PCA and Hierarchical Clustering of Movies
1 Introduction
Main idea to provide this paper, was the author interest in movies and Unsupervised Learning. All data gathering and manipulation were performed in Rstudio. Main source, which was used as a base to gather data, was Polish IMDb like website, Filmweb. If you are also interested in Clustering Methods (PAM,CLARA,K-means), you can read my previous publication here
1.1 Web Scraping as a tool for Clustering
Definition of Web Scraping from Wikipedia is “Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.”(short intro taken from here: https://en.wikipedia.org/wiki/Web_scraping)
In other words, Web Scraping is a widely spread method for gathering resources, data, images and others from websites. Sometimes it can be really useful, especially in cases when we cannot gain acces to interesting data.
1.1.1 Assumptions
In this paper, movies were the area of interest. The will to perform this research was driven by curiosity whether, it’s possible to get some interesting insights about how the movies can be sorted, when various Scaling Methods were performed on them. Thus, that was the main purpose of using it here, to build our own Data Base.
We did Web Scraping on http://filmweb.pl. We can perform Web Scraping in Rstudio using Rvest library.
1.1.2 Code
f3 = function(n){
tryCatch({
url= 'https://www.filmweb.pl/films/search?endRate=10&orderBy=popularity&descending=true&startCount=100&startRate=1'
list_of_pages <- str_c(url, '&page=', 2:n)
data_base = tibble()
for(i in 1:n){
baza_filmow = read_html(list_of_pages[i])
get_title_html = html_nodes(baza_filmow,".filmPreview__title")
get_rating_html = html_nodes(baza_filmow,".rateBox__rate")
get_genre_html = html_nodes(baza_filmow,".filmPreview__info--genres")
get_country_html = html_nodes(baza_filmow,".filmPreview__info--countries")
get_seen_html = html_nodes(baza_filmow,".rateBox__votes")
get_iwanttosee_html = html_nodes(baza_filmow,".wantToSee__count")
#deleting unnecessary strings
remove = html_text(get_seen_html)
remove_first = stri_replace_all(remove,"",fixed="głosy")
seen = stri_replace_all(remove_first,"",fixed="głosów")
seen = gsub(" ", "", seen, fixed = TRUE)
remove_second = html_text(get_genre_html)
genre = stri_replace_all(remove_second,"",fixed="gatunek:")
remove_third = html_text(get_country_html)
country = stri_replace_all(remove_third,"",fixed="kraj:")
remove_fourth = html_text(get_rating_html)
rating = stri_replace_all(remove_fourth,".",fixed=",")
#adding spaces to remove extra genres
genre= gsub("([a-z])([A-Z])", "\\1 \\2", genre)
genre= gsub("([Ĺ‚])([A-Z])", "\\1 \\2 \\3", genre)
country = gsub("([a-z])([A-Z])", "\\1 \\2", country)
country = gsub("([A])([A-Z])","\\1 \\2",country)
#some cleaning (leaving only one factor)
genre = gsub('([A-z]+) .*', '\\1', genre)
genre = gsub('([Ĺ‚]+) .*', '\\1', genre)
country = gsub('([A-z]+) .*', '\\1 \\2',country)
#alliasing
title = html_text(get_title_html)
i_want_to_see = html_text(get_iwanttosee_html)
i_want_to_see = gsub(" ", "", i_want_to_see, fixed = TRUE)
#building datatable
data_table = tibble(movie_title = title,
movie_score = rating,
movie_genre = genre,
release_country = country,
people_seen = seen,
people_want_to_see = i_want_to_see)
data_base = rbind(data_base,data_table, check.rows=FALSE)
}
}, error=function(e){})
View(data_base)
data_base = data_base[data_base$movie_score !=FALSE,]
write_csv(data_base,'RawFilmweb_Data_Base3.csv')
}
#applying the function for 1000 pages
f3(1000)Generated Data Base (Head of 75 records, if you want to see them all check it here.)
#Scraping PKB per capita for future purposes
www_PKB = read_html("https://pl.tradingeconomics.com/country-list/gdp-per-capita")
countries_list_PKB = read_html("https://pl.wikipedia.org/wiki/Lista_pa%C5%84stw_%C5%9Bwiata_wed%C5%82ug_PKB_nominalnego_per_capita")
get_amount = html_nodes(www_PKB,"td:nth-child(2)")
PKB_amount = html_text(get_amount)
get_country = html_nodes(countries_list_PKB,"td:nth-child(2)")
PKB_country = html_text(get_country)
removed = stri_replace_all(PKB_country,"",fixed=" ")
#Cleaning data
trim.leading = function (x) sub("^\\s+", "", x)
trim.trailing = function (x) sub("\\s+$", "", x)
PKB_country = trim.leading(PKB_country)
PKB_country = gsub("([A-z]+) .*", "\\1 \\2", PKB_country)
PKB_country = trim.trailing(PKB_country)
PKB_country = gsub("Stany", "USA", PKB_country, fixed = TRUE)
PKB_values = gsub("([.])", "\\1 \\2", PKB_amount)
PKB_values = gsub('([.]+) .*', '\\1 \\2', PKB_values)
PKB_values = gsub(".", "", PKB_values, fixed = TRUE)
PKB_values = as.numeric(PKB_values)
#Building Data.Frame
Countries.df = data.frame(Country = PKB_country)
changed = Countries.df[-nrow(Countries.df),]
PKB.df = data.frame(Country = changed, PKB_2017 = PKB_values)
PKB.df$Country = as.character(PKB.df$Country)
View(PKB.df)
write_csv(PKB.df,"Countries_PKB")1.2 Dimension Scaling Methods and Hierarchical Clustering
Dimension Scaling and overall dimensions reductivity of the data set is very useful when analysing big amount of data. Sometimes we want to get insights with respect to some big principal variables, which can explain variability of our data sample with high efficiency. In this paper we will mainly focus on three dimension scaling methods. First we will discuss about MDS, what’s stands for Multidimensional Scaling. Then we will proceed to PCA, what’s stands for Principal Component Analysis and in the end we will finish with Hierarchical Clustering.
1.2.1 MDS
Multidimensional Scaling (MDS) is a statistical method to get insights about hidden variables which describes relations, and similiarities among analysed objects. When performing MDS, first of all we have to prepare a matrix containing euclidean distances between values, we can use for this purpose correlation matrix. After that we obtain k similiar data points which reflects to one dimension in kartesian coordinate system. If k<=3 we can visualise data (up to 3D). If you want to learn more about MDS, check this link.Below some simple Multidimensional Scaling graph in the field of Marketing:
![]()
Marketing MDS
1.2.2 PCA
Principal Component Analysis (PCA) is a statistical method of factor analysis. We can describe our data set as a cloud of N points in K dimensions, where N is number of observations and K are variables. Goal of PCA is to fit the coordinate system in such a way that first we will maximize the variance of first coordinate, then we maximize it for second etc etc. PCA is widely used to reduce the dimensions of our data set, dropping the least affective ones.(translated from here: https://pl.wikipedia.org/wiki/Analiza_głównych_składowych).When performing the PCA analysis we assignt to each Principal Component some variables that can describe this component best, our goal is to set maximum of 3 Principal Components, because our visual perception can handle only 3D plots.
If you want to know more about PCA, you can read something here, or if you are interested how to perform PCA from scratch, I suggest to watch this Video:
![]()
1.2.3 Hierarchical Clustering
Hierarchical Clustering is one of the automatic Clustering method. When we perform Hierarchical Clustering, we assume in the beggining that every object is a single cluster and in the end that all of them are in one cluster. We can distinguish two methods of performing hierarchical clustering:
Also, most of times when we perform Hierarchical Clustering we want as an output, a dendrogram which can show us a pathway of divisible or agglomerative clustering. More about hierarchical clustering here.
- Agglomerative methods: we have to create a matrix of similarities, then in next iterations we are adding to the first cluster objects that are most similiar to each other
- Divisive methods: we have to create one huge cluster for our data, then in next iterations we are dividing the cluster to smaller ones, where objects are most similiar to each other
Projection of Agglomerative Clustering you can see down below (Divisive goes straight opposite):
![]()
Here however, it’s shown how dendrogram with respect to Agglomerative Clustering works:
![]()
2 Review of the Data Set
In this paragraph we will review our Scraped data set. We will check basic statistics, take a glance at it again and prepare it for further analysis.
2.1 Data Summary
Now it’s time to check basic features of our data set, before preparing it for firther journey.
movie_title movie_score movie_genre release_country
Length:5790 Min. :1.400 Length:5790 Length:5790
Class :character 1st Qu.:6.300 Class :character Class :character
Mode :character Median :6.900 Mode :character Mode :character
Mean :6.794
3rd Qu.:7.400
Max. :8.700
people_seen people_want_to_see
Min. : 97 Min. : 429
1st Qu.: 10238 1st Qu.: 4254
Median : 21592 Median : 7130
Mean : 46914 Mean : 11543
3rd Qu.: 51771 3rd Qu.: 13841
Max. :727515 Max. :131610
As we can see: movie_genre, title and release_country have no use in clustering in current form. That’s why we have to do some cleaning, and transforming the data, from characters to numeric values. Thats why, we prepared before one more Web Scraping, to change release_countries to their relative GDP per capita. We delete for now the movie title, because we won’t use it in clustering, and movie genres as characters will be replaced with their percentage share in all genres.Then the summary looks as follows:
movie_score people_seen people_want_to_see PKB_2017
Min. :1.400 Min. : 117 Min. : 428 Min. : 1834
1st Qu.:6.300 1st Qu.: 10206 1st Qu.: 4214 1st Qu.: 42514
Median :6.900 Median : 21500 Median : 7071 Median : 56935
Mean :6.791 Mean : 46834 Mean : 11500 Mean : 47767
3rd Qu.:7.400 3rd Qu.: 51662 3rd Qu.: 13782 3rd Qu.: 56935
Max. :8.700 Max. :725921 Max. :131319 Max. :107865
rate.Freq
Min. :0.0001733
1st Qu.:0.0408870
Median :0.0684338
Mean :0.1606753
3rd Qu.:0.3395703
Max. :0.3395703
Code for this cleaning you can find here:
filmweb_database = read_csv('RawFilmweb_Data_Base2.csv')
#changing to numeric
filmweb_database$movie_score = as.numeric(filmweb_database$movie_score)
filmweb_database$people_seen = as.numeric(filmweb_database$people_seen)
filmweb_database$people_want_to_see = as.numeric(filmweb_database$people_want_to_see)
#checking the unique values in genres and countries
unique_vector_of_genres = unique(filmweb_database$movie_genre)
unique_vector_of_countries = unique(filmweb_database$release_country)
#assigning numerical value to string values from genres and countries
genres_copy = c(unique_vector_of_genres)
no_uniq_gen = as.double(length(genres_copy))
num_value_of_genres = seq(1:(no_uniq_gen))
names(num_value_of_genres) = genres_copy
countries_copy = c(unique_vector_of_countries)
no_uniq_countr = as.double(length(countries_copy))
num_value_of_countries = seq(1:(no_uniq_countr))
names(num_value_of_countries) = countries_copy
#geting rid of titles and assigning numerical id
movie_id = c(filmweb_database$movie_title)
no_movie_id = as.double(length(movie_id))
num_value_of_movie_titles = seq(1:(no_movie_id))
#creating dictionary and backing up the names and assigned number
dict_of_genres = tibble(unique_vector_of_genres,num_value_of_genres)
dict_of_countries = tibble(unique_vector_of_countries,num_value_of_countries)
dict_of_movies_id = tibble(filmweb_database$movie_title,movie_id)
filmweb_database_conclusions = read_csv("Filmweb_Data_Base.csv") #we will need movie titles for final conclusions but not for clustering purposes
Clusterable_Data_Base = read_csv("Filmweb_Data_Base.csv")
Clusterable_Data_Base$movie_title = NULL
#####################################################################################
#Changing genre to binary (DO NOT RECOMMEND! But it gives an output using clustering)
#####################################################################################
# database_genre_copy = data.frame(filmweb_database2$movie_genre) #
# binary_df = data.frame(row.names=rownames(database_genre_copy)) #
# for (i in colnames(database_genre_copy)) { #
# for (x in names(num_value_of_genres)) { #
# #
# binary_df[paste0(i, "_", x)] = as.numeric(database_genre_copy[i] == x) #
# #
# } #
# } #
# filmweb_database2 = cbind(filmweb_database2,binary_df) #
#####################################################################################
#Binding everything together and clearing movie_genre column, because we dont need it no longer
Clusterable_Data_Base = merge(Clusterable_Data_Base,PKB.df, by.x ='release_country',by.y = "Country", all.x = T)
Clusterable_Data_Base = na.omit(Clusterable_Data_Base)
#Checking occurence of each genre to assign it as a percentage value, because binary value is to dirty for clustering purposes
table_of_genres = table(Clusterable_Data_Base$movie_genre)
table_of_genres.df = data.frame(genre = names(table_of_genres), occurence = as.vector(table_of_genres), rate = prop.table(table(Clusterable_Data_Base$movie_genre)))
table_of_genres.df$rate.Var1 = NULL
Clusterable_Data_Base = merge(Clusterable_Data_Base,table_of_genres.df, by.x ='movie_genre',by.y = "genre", all.x = T)
Clusterable_Data_Base$release_country = NULL
Clusterable_Data_Base$occurence = NULL
Clusterable_Data_Base$movie_genre = NULL
#Let's take a look and save it
View(Clusterable_Data_Base)
write_csv(Clusterable_Data_Base,"FilmwebClustering_DataBase.csv")It’s important also to explain our variables:
- Movie_score is our movie rating
- seen explains how many people have seen the movie
- will_see explains how many people wants to see the movie
- country reflects to country where the movie was released
- genre reflects to genre as it percentage share in total box of genres
3 Multidimensional Scaling
In this paragraph we will take a closer look at MDS, we will perform it on our data and will try to gather some insights.
First of all we will need some libraries:
library(smacof)
library(labdsv)
library(vegan)
library(MASS)
library(ape)
library(ggfortify)
library(pls)
library(ClusterR)
library(maptools)
library(factoextra)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(corrplot)
library(bazar)
library(hornpa)
library(ggbiplot)
library(devtools)
library(pca3d)
library(dendextend)Then we have to read our data set
database = read_csv("FilmwebClustering_DataBase.csv")
database_conclusions = read.xlsx("Filmweb_DataBase_Final.xlsx", sheetIndex = 1, skipEmptyRows = FALSE)3.1 Preparation
When we read our data, next step is to prepare it for usage of MDS. We need to create matrix with distances between units with respect to variables. After we will prepare our data to perform some tests on it.
Here we will prepare data for future Mentel’s and Kruskal’s tests
sim=cor(database)
dis=dist(database)
dis.t=dist(t(database))
dis_test=sim2diss(sim, method=1, to.dist = TRUE)
fit.data=mds(dis_test, ndim=2, type="ordinal")Now we will make a correlation plot for nominal values:
movie_score seen will_see country genre
movie_score 1.000 0.231 0.387 0.016 0.135
seen 0.231 1.000 0.574 0.059 -0.110
will_see 0.387 0.574 1.000 0.031 0.130
country 0.016 0.059 0.031 1.000 -0.067
genre 0.135 -0.110 0.130 -0.067 1.000
From this plot and data, we can figure out that there is significant positive correlation between variables:
- seen and will_see
- will_see and movie_score
Let’s make a trial visualisation
For purposes of clear view and better scale, we will standarize it
Now our preparation’s are done, let’s proceed to the next point, where we will make some basic summary, and statistics to check traits of our dataset
3.2 Summary
Here we have some slight summary for this set
V1 V2
Min. :-9.7951 Min. :-3.8665
1st Qu.:-0.3698 1st Qu.:-0.8475
Median : 0.2923 Median :-0.2566
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.7714 3rd Qu.: 0.8904
Max. : 4.2805 Max. : 3.0088
As we can see now we have two variables instead of five. It was done on purpose by settingk=2to reduce the dimension of the data.
Let’s now perform Mentel’s test to check, whether dissimilarity matrix and converted similarity matrix to dissimilarity one are the same?
Here we see the results of the Mentel Test
$z.stat
[1] 0.7964331
$p
[1] 0.007
$alternative
[1] "two.sided"
If we set our p-value level at 0.05, then according to results of Mentel Test we reject the null-hypotesis, assuming that matrices are similar.
Next test we will perform is Kruskal test, to check the MDS quality. Formula for this test looks as follows: \[ \sigma(X) = \frac{\sqrt{\sum_{i<j} w_{ij}(d_{ij}-d_{ij}(X))^{2}}}{n(n-1)/2}\]
[1] 0.6090237
[1] 0.001227381
[1] 0.002015326
First vector is mean of random stress function for observation equal to our sample size = 5772, next vector is our data stress funcion mean. Last one is ratio of random stress function and our data set stress function. Thus, We can notice that its pretty small
[1] 0.002015245, hence as we can read from the table below, quality of our MDS is rather excellent! Hurray!0.20 = poor, 0.10 = fair, 0.05 = good, 0.025 = excellent, 0.00 = perfect
3.3 Visualisation
Lets visualise it again, first with respect to variables
dist.reg2=dist(t(standar_database))
mds2=cmdscale(dist.reg2, k=2)
plot(mds2, col=c("pink","orange","blue","green","red"), xlab="MDS 1", ylab="MDS 2")
pointLabel(mds2, rownames(mds2), cex=0.8)Now let’s compare the results with Kmeans Clusters
It’s visible that some variables are clustered together, where two others are far away in different clusters. It’s hard to interpret it in this way but, one can say that for potential viewers movie score, amount of people who seen and want to see the movie are in one prior basket, where type of movie and country from which it comes are two different baskets.
It can mean that it belongs to preferences of the viewers, whether they choose movie by this 3 principals: movie score, seen and will_see factors or either by genre or country where movie was realased.
Now, we will do the same but with respect to units
And also compare it to clusters
Summarizing Multidimensional Scaling as a dimensions reduction method, one can say that it can be very usefull in various fields. We can plot the results in two different ways, with respect to units and variables, which is invaluable for example in marketing, where sometimes we have to compare our product/company to competitors in two or three dimensional scaling maximum. We can see also, that we can easily cluster the results to get even more insights about our data set traits. However, MDS as a method gives us only a limited view of the data set, as a analysis tool, we can use it for comparision, or studies about shopping baskets.
4 Principal Component Analysis
In this paragraph we will present the application of the PCA on our data set.
4.1 Preparations
First of all we have to prepare our thata, in such a way we can perform PCA analysis and get some summary etc. Thus, lets check in the beggining correlation between variables with normalized values:
normalized.set=normalize(standar_database, range=c(0,1))
normalized.set.cor=cor(normalized.set, method="pearson") movie_score seen will_see country genre
movie_score 1.000 0.231 0.387 0.016 0.135
seen 0.231 1.000 0.574 0.059 -0.110
will_see 0.387 0.574 1.000 0.031 0.130
country 0.016 0.059 0.031 1.000 -0.067
genre 0.135 -0.110 0.130 -0.067 1.000
We can notice that when we normalized data, the graph and summary looks pretty similar to the previous one. Still we have correlation between:
- people_want_to_see and people_seen (now
0.574then0.574)- people_want_to see and movie_score(now
0.387then`0.387)Basically results are just same
Now we will perform test to check how many Principal Components to retain
Parallel Analysis Results
Method: pca
Number of variables: 5
Sample size: 5772
Number of correlation matrices: 100
Seed: 123
Percentile: 0.95
Compare your observed eigenvalues from your original dataset to the 95 percentile in the table below generated using random data. If your eigenvalue is greater than the percentile indicated (not the mean), you have support to retain that factor/component.
Component Mean 0.95
1 1.036 1.051
2 1.016 1.027
3 1.000 1.008
4 0.983 0.993
5 0.965 0.978
4.2 Summary
First point of summary is to compare our observed eigenvalues to those one generated by
hornpato check whether we have support to retain the factor/component.
1.824094 1.134637 0.9539321 0.7200782 0.367258
We can see that our first eigenvalue is 1.82 > 1.05, second is 1.13 > 1.027, third 0.95 < 1.008. Thus, comparing this results to the randomly generated in parallel Analysis, the main idea of the test is check wether eigenvalues from random set (0.95 percentile) are smaller than our empirical eigens, hence if so, we can conclude the number of principal components needed. In our case it will be 2.
Having this in mind, we will check anyway how our PCA for this data set looks like
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 1.3506 1.0652 0.9767 0.8486 0.60602
Proportion of Variance 0.3648 0.2269 0.1908 0.1440 0.07345
Cumulative Proportion 0.3648 0.5917 0.7825 0.9265 1.00000
From table above we can conclude that 4 Principal Components are fair enought co explain variance in our whole data set. However, we will visualise the plot for number of PC’s to check the cohesion of the results.
It also confirms that when we sum up the percentages of explained variance, 4 PC’s are enough to describe this data.
In final stage lets check loadings of our PCA, to check what is inside (during this analysis we will omit 5th Principal Component, because as it was said before 4 are enough)
Standard deviations (1, .., p=5):
[1] 1.3505903 1.0651936 0.9766945 0.8485742 0.6060182
Rotation (n x k) = (5 x 4):
PC1 PC2 PC3 PC4
movie_score -0.48289385 0.25217943 -0.15111335 0.80885185
seen -0.57761303 -0.31391114 0.23605121 -0.32982115
will_see -0.64726638 0.02398036 0.05336807 -0.23976545
country -0.06684445 -0.48651065 -0.86967370 -0.04991279
genre -0.09876628 0.77498236 -0.40282016 -0.42071123
We have to know that minimum treshold for this correlation with respect to PCA and variable is around 0.3-0.5 and also -0.5 to -0.3, thus we will not describe anything below 0.5 and -0.5 as significant.
We can describe the table above as follows:
- In PC1 the most important set of variables, hence negatively correlated are: will_see
-0.64726638, and seen-0.57761303
- PC2 best describes genre with high
0.77498236possitive correlation
- PC3 best describes country with high
-0.86967370negative correlation
- In PC4 the most important is movie_score with correlation level corresponding to
0.80885185
4.3 Visualisation
First of all we will visualise our data as a 2D plot with respect to two main Principal Components
We can see on the plot, that variables: genre and country are dominated by the Second Principal Component (which explains 22.7% of total variance), where movie_score, will_see and seen are dominated by First Principal Component (which explains 36.5% of total variance).
From the plot we can also read that when radius of the lines between variables is more than 130 degrees they are not correlated at all, and when it’s less than 60 degrees they are seem to be correlated positively.
Thus, we conclude having in mind previous observations, summary and this plot observe that, when it comes to choosing movie, viewers are likely to:
- Either check also how many people have seen and want to see the movie
- What is the movie score and how many people wants to see the movie
- Or finaly either choosing by genre or release country, doesnt concerning about other variables
Lest’s plot it with respect to units (standarized set):
Now with labeling (color equals to genres)
Summarizing PCA as a dimension reduction method, we can say that it can be superior to MDS in terms of geting better overview about variables, and our data. PCA method sorts our variables with respect to correlations between them, and then reduce the dimension of whole data set assigning multiple variables to newly created superior one which is reflected in Principal Component no.1, no.2 etc.
Analysing that kind of results can be really unvaluable, mostly because we can reduce our data set to 2 or 3 Principal Components which is easier to plot and understand. Thus, after creating those components we can assign new meaning to them, accordingly to the variables inside them. In such a way we can explain for example customer behavior when it comes to purchasing goods, where there are multiple variables to consider (design, content, brand etc)
5 Hierarchical Clustering
This is the final part about methods we mentioned in the beggining. In this chapter we will perform Hierarchical Clustering on our data set, visualise it and make slight summarize overall.
We will perform hierarchical clustering using the package
statsand functionhclust().
5.1 Preparations
First of all we have to standarize our data
Then we have to measure distances between observations, and assignt clustering to one variable
Let’s also check density, to let us know how the distribution of distances looks like
And visualise the trial dendrogram
Now we will limit the observation to 5 standard deviations and cut the tree. Then we will proceed to summary.
5.2 Summary
In this part we will check summaries about the clustering with hclust, maybe we will find something interesing?
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 6.492 12.000 21.000
Maximum of 21 clusters? It’s a lot, but we won’t give up, let’s change the stdev value to get final of 3-4 clusters
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.264 1.000 4.000
Neighbors within 10 stdandard deviations it’s A LOT, let’s check another way
We will set number of clusters manually now
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 1.00 1.26 1.00 3.00
Now it looks reasonable, lets plot it
Fair enough, lets proceed to visualising
5.3 Visualisation
In this part we will try to visualise those dendogram in more beautiful way
First let’s try with triangle
Looks nice, a bit like modern art, let’s try with another one!
In the case above, we have cut triangle dendrogram a little bit to make it more readable, but we stil have a problem with labeling, because of plenty of observations.
But, anyway, let’s make some magic!
nodePar = list(lab.cex = 1, pch = c(NA, 19),
cex = 10, col = "green")
plot(as.phylo(hclustering), type = "cladogram", cex = 0.6,
edge.color = "steelblue", edge.width = 2, edge.lty = 2,
tip.color = "steelblue")We can see that Hierarchical Clustering is really cool tool when it comes to visualisation, however main limitation of this method is amount of observations. The more we have, the less clarity it has, thus we have to delimit the amount of observations and visualise them separately, but it would give us enormus amount of dendrograms, hard to interpret. Yet, I really recommend this method for small data sets, maybe up to 500 observations, where we can clarify the vision, and plot it in a proper way.
6 Final Word
Throughout this paper, we took a closer look into methods of dimensional scaling and hierarchical clustering. First of all, we did Web Scraping, which is very usefull tool when it comes to gathering data from sources, that doesn’t provide ready to use data bases. Then the slight introduction to the topic was made, we provided some handmade definitions and visual examples taken from the internet to better understand the topic. After, we proceeded to briefly elaborate every method, basing on example of our Scraped data base. After all we have discovered new ways of visualising data with respect to each of those methods, which can be very useful in future work.
I hope this paper was and will be very useful for you reader in your work, and field of interests. I also encourage you to check out my another paper about clustering methods, which is available here
7 References
http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learninghttps://en.wikipedia.org/wiki/Principal_component_analysis
https://en.wikipedia.org/wiki/Hierarchical_clustering
https://en.wikipedia.org/wiki/Hierarchical_clustering