Dimensionality reduction is the operation of decreasing the number of variables under consideration by attaining a set of principle variables. Lower number of features make it easier to visualize the dataset and lead to better human interpretation.
In this article, I will try to describe a large number of young people hobbies by a smaller number of features. In order to do that, I will use Multidimensional Scaling and compare the results to Hierarchical Clustering.
The article is divided as follows: firstly, I will descrive metodology used in the research, focusing on Multidimensional Scaling, Hierarchical Clustering and handling missing data. Then I will describe the dataset and impute missing data. In the next chapter I will present the result of Multidimensional Scaling with k-means clusters. After that I will show the results of clustering by Hierarchical Clustering. Finally, I will discuss the results and possible further researches in that matter.
Multidimensional Scaling (MDS) is a technique that helps to visualize the relative positions of objects, given only a table of distances between them. MDS might reduce dimensionality to one, two, three or even more dimensions. The table of distances is created either directly from experiments or is derived from correlation matrix.
Trevor and Michael Cox (2000) in their comprehensive text provide a profound definition of MDS:
A narrow definition of multidimensional scaling (often abbreviated to MDS) is the search for a low dimensional space, usually Euclidean, in which points in the space represent the objects (whiskies), one point representing one ob ject, and such that the distances between the points in the space, \(d_{rs}\) , match, as well as possible, the original dissimilarities \(\delta_{rs}\) . The techniques used for the search for the space and the associated configuration of points form metric and nonmetric multidimensional scaling.1
There are two general methods for solving the MDS problem. Metric (Classical) MDS tries to recreate the original distances, while Non-Metric considers only the ranks of the distances. Therefore, the original distances are not reproduced. Metric MDS is applicable only when ratio of similarity of distance is computable. If the data are ordinal, non-metric MDS should be used.
In MDS we try to model the distances, therefore the goodness-of-fit statistic is based on the distances between the projected points in the spaces and the original dissimilarities. Such measure is called stress:
\[stress = \sqrt{\frac{\sum{(d_{ij}-\hat{d_{ij}})^2}}{\sum{d_{ij}^2}}}\] where \(\hat{d_{ij}}\) is predicted distance from the MDS model and \(d_{ij}\) is the original dissimilarity.
The lower the value of stress, the better the fit. Ideally, stress value would be zero. According to Kruskal, stress rules of thumb are as follows: Our experience with experimental and synthetic data suggests the following verbal evaluation: 20% poor, 10% fair, 5% good, 2.5% excellent, 0% perfect.2
Hierarchical clustering is a method for grouping similar objects into clusters. The graphical representation of hierarchical relationship between the clusters is a dendrogram (classification tree). It can be cut at selected height, so that we get desired number of clusters.
The algorithm for hierarchical clustering might be described as below:
The measures of distance (similarity) that are used in step 2 can be an appropriate distance metric - eg. Euclidean metric.
Missingness is usually a small incoveniece in research and not the main issue, however handling it in an incorrect way may lead to results that are biased, inefficient and unreliable. First and crucial step in dealing with missing data is the analysis of missing data patterns. In general, there are three missing data mechanisms:
Missing Completely at Random (MCAR) - there is no relationship between the propensity of missing values and any values (neither observed nor missing);
Missing at Random (MAR) - there is a systematic relationship between the missingness of the data and the observed data but not the missing data, e.g. if women tend to withhold the information about their weight than men, then weight is MAR;
Missing not at Random (NMAR) - there is a relationship between the missingness of the data and the values of missing data, e.g. if people with lower education are more likely to conceal their education level.
To test whether the missing data mechanism of the dataset is missing completely at random (MCAR), R package MissMech3 can be used. The TestMCARNormality function follows the methodology proposed by Jamshidian and Jalal (2010). It is based on testing equality of covariances between groups consisting of identical missing data patterns. According to authors:
If this test is rejected, it will be concluded that the covariances are non-homogeneous, and data are not MCAR.4
When we establish the missing data mechanism, we can choose an appropriate way of handling them. One of the most popular methods for missing data is case deletion, also known as listwise deletion (LD). LD involves deleting from the dataset all the observations that have a missing value on any of the variables. When the missing data are not MCAR, results from case deletion might occur biased. In other words, listwise deletion requires that the data are MCAR. However, even when MCAR holds, case deletion might still be inefficient.5
Schafer and Graham (2002) argue that there are 2 highly recommended methods for handling missing data: maximum likelihood (ML) and Bayesian multiple imputation (MI). These methods also require assumptionts about the distribution of missingness. ML method uses maximum likelihood estimation to compute the values of parameters that are most likely to have resulted in the observed data. This method provides unbiased parameter estimates, but it is limited to linear models. MI imputes the missing values multiple times, resulting in multiple full datasets. Then each dataset is analyzed and the results are combined. MI also gives approximately unbiased estimates of all the estimates from the random error.
R function mice from mice6 package is designed to create MI (replacement values) for multivariate missing data. Each value is imputed by a separate model (Fully Conditional Specification). According to authors, MICE can handle both MAR and missing not at random (MNAR). Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations.7
Data comes from a survey conducted by Faculty of Social and Economic Sciences of Comenius University in Bratislava. The original dataset and the description of the survey is available at [https://www.kaggle.com/miroslavsabo/young-people-survey]. Students of the Statistics class were asked to invite their friends to answer questionnaire containing 150 questions concerning in general their hobbies, personality traits and demographics. The questionnaire was filled in by participants in both electronic and written form. The original survey was presented in Slovak language and was later translated to English. All the participants are Slovaks aged between 15 and 30.
In this paper, the dataset has been limited to the area regarding hobbies and interests. Full list of questions is as follows:
| Original description | Short name | Scale |
|---|---|---|
| History | History | Not interested 1-2-3-4-5 Very interested |
| Psychology | Psychology | Not interested 1-2-3-4-5 Very interested |
| Politics | Politics | Not interested 1-2-3-4-5 Very interested |
| Mathematics | Mathematics | Not interested 1-2-3-4-5 Very interested |
| Physics | Physics | Not interested 1-2-3-4-5 Very interested |
| Internet | Internet | Not interested 1-2-3-4-5 Very interested |
| PC Software, Hardware | PC | Not interested 1-2-3-4-5 Very interested |
| Economy, Management | Economy Management | Not interested 1-2-3-4-5 Very interested |
| Biology | Biology | Not interested 1-2-3-4-5 Very interested |
| Chemistry | Chemistry | Not interested 1-2-3-4-5 Very interested |
| Poetry reading | Reading | Not interested 1-2-3-4-5 Very interested |
| Geography | Geography | Not interested 1-2-3-4-5 Very interested |
| Foreign languages | Foreign languages | Not interested 1-2-3-4-5 Very interested |
| Medicine | Medicine | Not interested 1-2-3-4-5 Very interested |
| Law | Law | Not interested 1-2-3-4-5 Very interested |
| Cars | Cars | Not interested 1-2-3-4-5 Very interested |
| Art | Art exhibitions | Not interested 1-2-3-4-5 Very interested |
| Religion | Religion | Not interested 1-2-3-4-5 Very interested |
| Outdoor activities | Countryside, outdoors | Not interested 1-2-3-4-5 Very interested |
| Dancing | Dancing | Not interested 1-2-3-4-5 Very interested |
| Playing musical instruments | Musical instruments | Not interested 1-2-3-4-5 Very interested |
| Poetry writing | Writing | Not interested 1-2-3-4-5 Very interested |
| Sport and leisure activities | Passive sport | Not interested 1-2-3-4-5 Very interested |
| Sport at competitive level | Active sport | Not interested 1-2-3-4-5 Very interested |
| Gardening | Gardening | Not interested 1-2-3-4-5 Very interested |
| Celebrity lifestyle | Celebrities | Not interested 1-2-3-4-5 Very interested |
| Shopping | Shopping | Not interested 1-2-3-4-5 Very interested |
| Science and technology | Science and technology | Not interested 1-2-3-4-5 Very interested |
| Theatre | Theatre | Not interested 1-2-3-4-5 Very interested |
| Socializing | Fun with friends | Not interested 1-2-3-4-5 Very interested |
| Adrenaline sports | Adrenaline sports | Not interested 1-2-3-4-5 Very interested |
| Pets | Pets | Not interested 1-2-3-4-5 Very interested |
The original dataset contains missing values (15.45% of rows contain at least 1 missing value). To decide how to handle them, firstly I will check how many of them are in each column and in each row.
missing_in_cols <- sapply(data, function(x) sum(is.na(x))/nrow(data))
percent(missing_in_cols)
## History Psychology Politics
## 0.20% 0.50% 0.10%
## Mathematics Physics Internet
## 0.30% 0.30% 0.40%
## PC Economy.Management Biology
## 0.59% 0.50% 0.59%
## Chemistry Reading Geography
## 0.99% 0.59% 0.89%
## Foreign.languages Medicine Law
## 0.50% 0.50% 0.10%
## Cars Art.exhibitions Religion
## 0.40% 0.59% 0.30%
## Countryside..outdoors Dancing Musical.instruments
## 0.69% 0.30% 0.10%
## Writing Passive.sport Active.sport
## 0.59% 1.49% 0.40%
## Gardening Celebrities Shopping
## 0.69% 0.20% 0.20%
## Science.and.technology Theatre Fun.with.friends
## 0.59% 0.79% 0.40%
## Adrenaline.sports Pets
## 0.30% 0.40%
missing_in_rows <- rowSums(is.na(data))
missing_in_rows.t <- t(table(missing_in_rows))
rownames(missing_in_rows.t) <- c("Frequency of missing values in observations")
kable(missing_in_rows.t, format = "html", row.names = T) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_header_above(c(" ", "Number of missing values" = 6)) %>%
column_spec(2:7, width = "7em")
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| Frequency of missing values in observations | 886 | 103 | 15 | 2 | 3 | 1 |
In most of the cases, only 1 value in row is missing. What’s more, the missingness of data seems to be equally distributed between the columns. The data might be missing completely at random (MCAR). To test it, I will perform the nonparametric test of MCAR.
MCAR_test <- TestMCARNormality(data = data, method = "Nonparametric")
MCAR_test$pnormality
## [1] 0.3634555
The nonparametric test of homoscedasticity results in high p-value (0.3635) so we fail to reject the null hypothesis that there is no evidence of heteroscedasticity. Thus, we can conclude that there is no evidence that data are not MCAR and we could use listwise deletion of the data or maximum likelihood (ML) or Bayesian multiple imputation (MI). Because the data are MCAR, listwise deletion should not cause any bias. However, it implies throwing out of the analysis over 15% of observations. In order not to lose these informations, Multiple imputation will be generated.
imputed_Data <- mice(data, m=5, maxit = 50, method = 'pmm', seed = 500)
data <- complete(imputed_Data)
Having imputed missing values, I will plot bar chart on full dataset for each variable.
melt.data <- melt(data, id.vars = NULL)
ggplot(data = melt.data) +
geom_bar(aes(x = value)) +
theme_classic() +
xlab(label = NULL) + ylab(label = NULL) + ggtitle("Frequency distribution of answers by categories") +
theme(plot.title = element_text(hjust = 0.5, size = 14)) +
facet_wrap(~ variable, scales = "free", ncol = 4)
The barcharts show quite clearly that variables have various distributions. There are as well hobbies that are generally appreciated, eg. spending time with friends and browsing the Internet as hobbies that are mostly disliked, eg. writing and gardening. There are also hobbies that are distributed quite evenly, like adrenaline sports.
To understand data more profoundly, I will also plot correlation plot. As the data is ordinal, I will be using Spearman’s Rank Correlation Coefficient.
cor.matrix <- cor(data, method = c("spearman"))
corrplot(cor.matrix, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
The correlation plot has been ordered in hierarchical clustering order. The strongest correlation seems to be between chemistry and biology, biology and medicine, chemistry and medine, mathematics and physics.
To visualize the level of similarity of individual hobbies, firstly I will generate the correlation matrix. Then I will convert similarities to dissimilarities, which later I will use in MDS.
sim <-cor(data, method = "spearman") # Correlation Matrix
delta <- sim2diss(sim, method = 1, to.dist = TRUE) # Observed dissimilarities, not normalized
fit.data <- mds(delta, ndim = 2, type = "ordinal") # MDS for ordinal data
Stress per point (in %):
print(fit.data$spp, digits = 2)
## History Psychology Politics
## 4.88 3.38 4.89
## Mathematics Physics Internet
## 2.98 2.24 3.79
## PC Economy.Management Biology
## 1.70 4.47 4.22
## Chemistry Reading Geography
## 3.76 1.95 2.38
## Foreign.languages Medicine Law
## 5.09 2.76 3.30
## Cars Art.exhibitions Religion
## 1.34 0.95 1.97
## Countryside..outdoors Dancing Musical.instruments
## 4.11 2.29 3.61
## Writing Passive.sport Active.sport
## 1.85 3.73 4.48
## Gardening Celebrities Shopping
## 5.14 2.33 2.08
## Science.and.technology Theatre Fun.with.friends
## 1.94 1.68 2.74
## Adrenaline.sports Pets
## 2.83 5.15
Total stress:
print(fit.data$stress)
## [1] 0.2332024
df <- as.data.frame(fit.data$conf)
df$label <- rownames(df)
ggplot(df, aes(D1, D2, label = label)) +
geom_point(color = "red") +
geom_text_repel() +
theme_classic() +
ggtitle("Plot 1. Configuration Plot\n") +
theme(plot.title = element_text(hjust = 0.5, size = 12),
panel.background = element_rect(fill = "white", colour = "grey50", linetype = "dashed"))
The results seem reasonable, as hobbies that are similar (for example adrenaline sports and active sport or art exhibitions and theatre) are plotted close to each other. MDS function also returns the point stress, which shows the contribution of a point to the total stress (the larger the total stress, the worse the fit).
However, the total stress value is quite high. A high stress value indicates that two dimensions might be not enough to represent the data. To check that, I will plot stress by dimension:
stress.matrix <- matrix(0, 9, 1)
for (i in 2:10){
fit.data.temp <- mds(delta, ndim = i, type = "ordinal") # MDS for ordinal data
stress.matrix[i-1,1] <- fit.data.temp$stress
}
stress.df <- as.data.frame(stress.matrix)
stress.df$Num.of.dim <- 2:10
colnames(stress.df) <- c("Stress.value", "Number.of.dimensions")
ggplot(stress.df, aes(x=Number.of.dimensions, y=Stress.value)) +
geom_point() + geom_line() +
geom_hline(yintercept=c(0.2, 0.1), linetype="dashed", color = "red", size=1) +
scale_x_continuous(breaks = seq(1,10))
The “elbow” in the curve seems to be at point 3. The total stress for 3 dimensions is about 13%, which as rule of thumb indicates, is enough. Therefore, I will use 3 dimensions in MDS.
fit.data.3 <- mds(delta, ndim = 3, type = "ordinal") # MDS for ordinal data
df.3 <- as.data.frame(fit.data.3$conf)
I will also use k-means to cluster similar hobbies. To do that, I need to choose the optimal number of clusters. In order to do that, I will use package NbClust, which calculates 26 indices for determining the number of clusters.
nb <- NbClust(df.3, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")
nb_best <- t(nb$Best.nc)
nb_best <- data.frame(nb_best)
barplot(table(nb_best[,"Number_clusters"]),
col=c("grey"),
main="Optimal number of clusters",
xlab="Number of clusters",
ylab="Frequency among all indices")
The majority of indices suggest using 2 or 8 clusters. In my analysis, I will use 8 clusters.
km <- eclust(df.3,FUNcluster="kmeans", k=8,hc_metric = "euclidean")
Now, I will plot the MDS results in 3 dimensions, broken down into clusters.
df.3$Cluster <- km$cluster
t <- list(
family = "sans serif",
size = 12,
color = toRGB("grey50"))
p <- plot_ly(df.3, x = ~D1, y = ~D2, z = ~D3, color = ~Cluster, text = rownames(df.3)) %>%
add_markers() %>%
add_text(textfont = t, textposition = "top right") %>%
layout(scene = list(xaxis = list(title = 'D1'),
yaxis = list(title = 'D2'),
zaxis = list(title = 'D3')))
p
The results are not as easy to interpret as it was with 2 dimensions. However, we can still see that similar hobbies are close to each other, eg. social sciences (Law, Economics, Geography and Politics) or artistic interests (writing, reading, theatre and art exhibitions).
I expect hierarchical clustering to give similar results to those above. Firstly, I will standarize the data, then I will calculate distances between variables and I will plot dendrogram.
data.standarized <- data.Normalization(data, type="n1",normalization="column") # n1 - standardization ((x-mean)/sd)
distance.m<-dist(t(data.standarized)) # distances between variables
hc<-hclust(distance.m, method="complete") # hierarchical clustering using Complete Linkage
plot(hc, hang=-1)
plot(density(distance.m))
Each leaf in the dendrogram shown above corresponds to one observation. As we move up the tree, similar observations are grouped into branches. To compare results with PCA and k-means, I will divide the tree into 8 clusters
plot(hc, hang=-1)
rect.hclust(hc, k = 8, border = 2:5)
sub_grp<-cutree(hc, k=8) # division into 8 clusters
fviz_cluster(list(data = distance.m, cluster = sub_grp))
There are noticeable differences between clusters generated from hierarchical tree and k-means combined with PCA.
The main goal of this research was to examine whether human interests can be described by a smaller number of latent concepts by dimension reduction. The results of the analysis suggest that human interests cannot be described by two dimensions. However, they possibly can be visualized in three dimensions and clustered in small groups. K-means clustering on MDS groupped hobbies into 8 clusters and the results are rather consistent with intuition - eg. writing, reading, theatre and art exhibitions form a cluster. In this article MDS and hierarchical clustering were applied to the same dataset to explore if these two methods give the same results. It was shown that the final results differ. Further research could involve using Factor Analysis to search for underlying factors that might explain correlations among human interests. Additionaly, adding more variables from the Young People Survey dataset, such as gender or character traits might also help to explain the variation in human interests.
Michael A. A. CoxTrevor F. Cox. 2000. “Multidimensional Scaling”. Chapman and Hall.↩
Kruskal, J. B. 1964. “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis”. Psychometrika, 29, 1–27↩
https://cran.r-project.org/web/packages/MissMech/MissMech.pdf↩
Jamshidian, Mortaza & Jalal, Siavash & Jansen, Camden. 2014. “MissMech: An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR)” Journal of statistical software 56(6): 1-31↩
Joseph L. Schafer and John W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7(2): 147-177↩
Stef van Buuren, Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of statistical software 45(3), 1–67.↩