1 Introduction

Dimensionality reduction is the operation of decreasing the number of variables under consideration by attaining a set of principle variables. Lower number of features make it easier to visualize the dataset and lead to better human interpretation.

In this article, I will try to describe a large number of young people hobbies by a smaller number of features. In order to do that, I will use Multidimensional Scaling and compare the results to Hierarchical Clustering.

The article is divided as follows: firstly, I will descrive metodology used in the research, focusing on Multidimensional Scaling, Hierarchical Clustering and handling missing data. Then I will describe the dataset and impute missing data. In the next chapter I will present the result of Multidimensional Scaling with k-means clusters. After that I will show the results of clustering by Hierarchical Clustering. Finally, I will discuss the results and possible further researches in that matter.

2 Methodology

2.1 Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS) is a technique that helps to visualize the relative positions of objects, given only a table of distances between them. MDS might reduce dimensionality to one, two, three or even more dimensions. The table of distances is created either directly from experiments or is derived from correlation matrix.

Trevor and Michael Cox (2000) in their comprehensive text provide a profound definition of MDS:

A narrow definition of multidimensional scaling (often abbreviated to MDS) is the search for a low dimensional space, usually Euclidean, in which points in the space represent the objects (whiskies), one point representing one ob ject, and such that the distances between the points in the space, \(d_{rs}\) , match, as well as possible, the original dissimilarities \(\delta_{rs}\) . The techniques used for the search for the space and the associated configuration of points form metric and nonmetric multidimensional scaling.¹

There are two general methods for solving the MDS problem. Metric (Classical) MDS tries to recreate the original distances, while Non-Metric considers only the ranks of the distances. Therefore, the original distances are not reproduced. Metric MDS is applicable only when ratio of similarity of distance is computable. If the data are ordinal, non-metric MDS should be used.

In MDS we try to model the distances, therefore the goodness-of-fit statistic is based on the distances between the projected points in the spaces and the original dissimilarities. Such measure is called stress:

\[stress = \sqrt{\frac{\sum{(d_{ij}-\hat{d_{ij}})^2}}{\sum{d_{ij}^2}}}\] where \(\hat{d_{ij}}\) is predicted distance from the MDS model and \(d_{ij}\) is the original dissimilarity.

The lower the value of stress, the better the fit. Ideally, stress value would be zero. According to Kruskal, stress rules of thumb are as follows: Our experience with experimental and synthetic data suggests the following verbal evaluation: 20% poor, 10% fair, 5% good, 2.5% excellent, 0% perfect.²

2.2 Hierarchical clustering

Hierarchical clustering is a method for grouping similar objects into clusters. The graphical representation of hierarchical relationship between the clusters is a dendrogram (classification tree). It can be cut at selected height, so that we get desired number of clusters.

The algorithm for hierarchical clustering might be described as below:

Start by treating each observation as a seperate cluster.
Find two clusters that are closest to each other.
Merge these two clusters.
Repeat steps 2 and 3 until all the observations are merged together.

The measures of distance (similarity) that are used in step 2 can be an appropriate distance metric - eg. Euclidean metric.

2.3 Missing data

Missingness is usually a small incoveniece in research and not the main issue, however handling it in an incorrect way may lead to results that are biased, inefficient and unreliable. First and crucial step in dealing with missing data is the analysis of missing data patterns. In general, there are three missing data mechanisms:

Missing Completely at Random (MCAR) - there is no relationship between the propensity of missing values and any values (neither observed nor missing);
Missing at Random (MAR) - there is a systematic relationship between the missingness of the data and the observed data but not the missing data, e.g. if women tend to withhold the information about their weight than men, then weight is MAR;
Missing not at Random (NMAR) - there is a relationship between the missingness of the data and the values of missing data, e.g. if people with lower education are more likely to conceal their education level.

To test whether the missing data mechanism of the dataset is missing completely at random (MCAR), R package MissMech³ can be used. The TestMCARNormality function follows the methodology proposed by Jamshidian and Jalal (2010). It is based on testing equality of covariances between groups consisting of identical missing data patterns. According to authors:

If this test is rejected, it will be concluded that the covariances are non-homogeneous, and data are not MCAR.⁴

When we establish the missing data mechanism, we can choose an appropriate way of handling them. One of the most popular methods for missing data is case deletion, also known as listwise deletion (LD). LD involves deleting from the dataset all the observations that have a missing value on any of the variables. When the missing data are not MCAR, results from case deletion might occur biased. In other words, listwise deletion requires that the data are MCAR. However, even when MCAR holds, case deletion might still be inefficient.⁵

Schafer and Graham (2002) argue that there are 2 highly recommended methods for handling missing data: maximum likelihood (ML) and Bayesian multiple imputation (MI). These methods also require assumptionts about the distribution of missingness. ML method uses maximum likelihood estimation to compute the values of parameters that are most likely to have resulted in the observed data. This method provides unbiased parameter estimates, but it is limited to linear models. MI imputes the missing values multiple times, resulting in multiple full datasets. Then each dataset is analyzed and the results are combined. MI also gives approximately unbiased estimates of all the estimates from the random error.

R function mice from mice⁶ package is designed to create MI (replacement values) for multivariate missing data. Each value is imputed by a separate model (Fully Conditional Specification). According to authors, MICE can handle both MAR and missing not at random (MNAR). Multiple imputation under MNAR requires additional modeling assumptions that influence the generated imputations.⁷

3 Data description and statistical analysis

Data comes from a survey conducted by Faculty of Social and Economic Sciences of Comenius University in Bratislava. The original dataset and the description of the survey is available at [https://www.kaggle.com/miroslavsabo/young-people-survey]. Students of the Statistics class were asked to invite their friends to answer questionnaire containing 150 questions concerning in general their hobbies, personality traits and demographics. The questionnaire was filled in by participants in both electronic and written form. The original survey was presented in Slovak language and was later translated to English. All the participants are Slovaks aged between 15 and 30.

In this paper, the dataset has been limited to the area regarding hobbies and interests. Full list of questions is as follows:

Original description	Short name	Scale
History	History	Not interested 1-2-3-4-5 Very interested
Psychology	Psychology	Not interested 1-2-3-4-5 Very interested
Politics	Politics	Not interested 1-2-3-4-5 Very interested
Mathematics	Mathematics	Not interested 1-2-3-4-5 Very interested
Physics	Physics	Not interested 1-2-3-4-5 Very interested
Internet	Internet	Not interested 1-2-3-4-5 Very interested
PC Software, Hardware	PC	Not interested 1-2-3-4-5 Very interested
Economy, Management	Economy Management	Not interested 1-2-3-4-5 Very interested
Biology	Biology	Not interested 1-2-3-4-5 Very interested
Chemistry	Chemistry	Not interested 1-2-3-4-5 Very interested
Poetry reading	Reading	Not interested 1-2-3-4-5 Very interested
Geography	Geography	Not interested 1-2-3-4-5 Very interested
Foreign languages	Foreign languages	Not interested 1-2-3-4-5 Very interested
Medicine	Medicine	Not interested 1-2-3-4-5 Very interested
Law	Law	Not interested 1-2-3-4-5 Very interested
Cars	Cars	Not interested 1-2-3-4-5 Very interested
Art	Art exhibitions	Not interested 1-2-3-4-5 Very interested
Religion	Religion	Not interested 1-2-3-4-5 Very interested
Outdoor activities	Countryside, outdoors	Not interested 1-2-3-4-5 Very interested
Dancing	Dancing	Not interested 1-2-3-4-5 Very interested
Playing musical instruments	Musical instruments	Not interested 1-2-3-4-5 Very interested
Poetry writing	Writing	Not interested 1-2-3-4-5 Very interested
Sport and leisure activities	Passive sport	Not interested 1-2-3-4-5 Very interested
Sport at competitive level	Active sport	Not interested 1-2-3-4-5 Very interested
Gardening	Gardening	Not interested 1-2-3-4-5 Very interested
Celebrity lifestyle	Celebrities	Not interested 1-2-3-4-5 Very interested
Shopping	Shopping	Not interested 1-2-3-4-5 Very interested
Science and technology	Science and technology	Not interested 1-2-3-4-5 Very interested
Theatre	Theatre	Not interested 1-2-3-4-5 Very interested
Socializing	Fun with friends	Not interested 1-2-3-4-5 Very interested
Adrenaline sports	Adrenaline sports	Not interested 1-2-3-4-5 Very interested
Pets	Pets	Not interested 1-2-3-4-5 Very interested

The original dataset contains missing values (15.45% of rows contain at least 1 missing value). To decide how to handle them, firstly I will check how many of them are in each column and in each row.

missing_in_cols <- sapply(data, function(x) sum(is.na(x))/nrow(data))
percent(missing_in_cols)

##                History             Psychology               Politics 
##                  0.20%                  0.50%                  0.10% 
##            Mathematics                Physics               Internet 
##                  0.30%                  0.30%                  0.40% 
##                     PC     Economy.Management                Biology 
##                  0.59%                  0.50%                  0.59% 
##              Chemistry                Reading              Geography 
##                  0.99%                  0.59%                  0.89% 
##      Foreign.languages               Medicine                    Law 
##                  0.50%                  0.50%                  0.10% 
##                   Cars        Art.exhibitions               Religion 
##                  0.40%                  0.59%                  0.30% 
##  Countryside..outdoors                Dancing    Musical.instruments 
##                  0.69%                  0.30%                  0.10% 
##                Writing          Passive.sport           Active.sport 
##                  0.59%                  1.49%                  0.40% 
##              Gardening            Celebrities               Shopping 
##                  0.69%                  0.20%                  0.20% 
## Science.and.technology                Theatre       Fun.with.friends 
##                  0.59%                  0.79%                  0.40% 
##      Adrenaline.sports                   Pets 
##                  0.30%                  0.40%

missing_in_rows <- rowSums(is.na(data))
missing_in_rows.t <- t(table(missing_in_rows))
rownames(missing_in_rows.t) <- c("Frequency of missing values in observations")
kable(missing_in_rows.t, format = "html", row.names = T) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  add_header_above(c(" ", "Number of missing values" = 6)) %>% 
  column_spec(2:7, width = "7em")

	Number of missing values
	0	1	2	3	4	5
Frequency of missing values in observations	886	103	15	2	3	1

In most of the cases, only 1 value in row is missing. What’s more, the missingness of data seems to be equally distributed between the columns. The data might be missing completely at random (MCAR). To test it, I will perform the nonparametric test of MCAR.

MCAR_test <- TestMCARNormality(data = data, method = "Nonparametric")
MCAR_test$pnormality

## [1] 0.3634555

The nonparametric test of homoscedasticity results in high p-value (0.3635) so we fail to reject the null hypothesis that there is no evidence of heteroscedasticity. Thus, we can conclude that there is no evidence that data are not MCAR and we could use listwise deletion of the data or maximum likelihood (ML) or Bayesian multiple imputation (MI). Because the data are MCAR, listwise deletion should not cause any bias. However, it implies throwing out of the analysis over 15% of observations. In order not to lose these informations, Multiple imputation will be generated.

imputed_Data <- mice(data, m=5, maxit = 50, method = 'pmm', seed = 500)
data <- complete(imputed_Data)

Having imputed missing values, I will plot bar chart on full dataset for each variable.

melt.data <- melt(data, id.vars = NULL)
ggplot(data = melt.data) +
    geom_bar(aes(x = value)) +
    theme_classic() +
    xlab(label = NULL) + ylab(label = NULL) + ggtitle("Frequency distribution of answers by categories") +
    theme(plot.title = element_text(hjust = 0.5, size = 14)) +
    facet_wrap(~ variable, scales = "free", ncol = 4)

The barcharts show quite clearly that variables have various distributions. There are as well hobbies that are generally appreciated, eg. spending time with friends and browsing the Internet as hobbies that are mostly disliked, eg. writing and gardening. There are also hobbies that are distributed quite evenly, like adrenaline sports.

To understand data more profoundly, I will also plot correlation plot. As the data is ordinal, I will be using Spearman’s Rank Correlation Coefficient.

cor.matrix <- cor(data, method = c("spearman"))
corrplot(cor.matrix, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

The correlation plot has been ordered in hierarchical clustering order. The strongest correlation seems to be between chemistry and biology, biology and medicine, chemistry and medine, mathematics and physics.

4 Multidimensional Scaling

To visualize the level of similarity of individual hobbies, firstly I will generate the correlation matrix. Then I will convert similarities to dissimilarities, which later I will use in MDS.

sim <-cor(data, method = "spearman") # Correlation Matrix
delta <- sim2diss(sim, method = 1, to.dist = TRUE) #  Observed dissimilarities, not normalized
fit.data <- mds(delta, ndim = 2, type = "ordinal") # MDS for ordinal data

Stress per point (in %):

print(fit.data$spp, digits = 2)

##                History             Psychology               Politics 
##                   4.88                   3.38                   4.89 
##            Mathematics                Physics               Internet 
##                   2.98                   2.24                   3.79 
##                     PC     Economy.Management                Biology 
##                   1.70                   4.47                   4.22 
##              Chemistry                Reading              Geography 
##                   3.76                   1.95                   2.38 
##      Foreign.languages               Medicine                    Law 
##                   5.09                   2.76                   3.30 
##                   Cars        Art.exhibitions               Religion 
##                   1.34                   0.95                   1.97 
##  Countryside..outdoors                Dancing    Musical.instruments 
##                   4.11                   2.29                   3.61 
##                Writing          Passive.sport           Active.sport 
##                   1.85                   3.73                   4.48 
##              Gardening            Celebrities               Shopping 
##                   5.14                   2.33                   2.08 
## Science.and.technology                Theatre       Fun.with.friends 
##                   1.94                   1.68                   2.74 
##      Adrenaline.sports                   Pets 
##                   2.83                   5.15

Total stress:

print(fit.data$stress)

## [1] 0.2332024

df <- as.data.frame(fit.data$conf)
df$label <- rownames(df)

ggplot(df, aes(D1, D2, label = label)) +
  geom_point(color = "red") +
  geom_text_repel() +
  theme_classic() +
  ggtitle("Plot 1. Configuration Plot\n") +
  theme(plot.title = element_text(hjust = 0.5, size = 12), 
        panel.background = element_rect(fill = "white", colour = "grey50", linetype = "dashed"))

The results seem reasonable, as hobbies that are similar (for example adrenaline sports and active sport or art exhibitions and theatre) are plotted close to each other. MDS function also returns the point stress, which shows the contribution of a point to the total stress (the larger the total stress, the worse the fit).

However, the total stress value is quite high. A high stress value indicates that two dimensions might be not enough to represent the data. To check that, I will plot stress by dimension:

stress.matrix <- matrix(0, 9, 1)
for (i in 2:10){
  fit.data.temp <- mds(delta, ndim = i, type = "ordinal") # MDS for ordinal data
  stress.matrix[i-1,1] <- fit.data.temp$stress
}
stress.df <- as.data.frame(stress.matrix)
stress.df$Num.of.dim <- 2:10
colnames(stress.df) <- c("Stress.value", "Number.of.dimensions")
ggplot(stress.df, aes(x=Number.of.dimensions, y=Stress.value)) + 
  geom_point() + geom_line() +
  geom_hline(yintercept=c(0.2, 0.1), linetype="dashed", color = "red", size=1) +
  scale_x_continuous(breaks = seq(1,10))

The “elbow” in the curve seems to be at point 3. The total stress for 3 dimensions is about 13%, which as rule of thumb indicates, is enough. Therefore, I will use 3 dimensions in MDS.

fit.data.3 <- mds(delta, ndim = 3, type = "ordinal") # MDS for ordinal data
df.3 <- as.data.frame(fit.data.3$conf)

I will also use k-means to cluster similar hobbies. To do that, I need to choose the optimal number of clusters. In order to do that, I will use package NbClust, which calculates 26 indices for determining the number of clusters.

nb <- NbClust(df.3, distance="euclidean", min.nc=2, max.nc=10, method="ward.D2", index="all")

nb_best <- t(nb$Best.nc)
nb_best <- data.frame(nb_best)

barplot(table(nb_best[,"Number_clusters"]),
        col=c("grey"), 
        main="Optimal number of clusters", 
        xlab="Number of clusters",
        ylab="Frequency among all indices")

The majority of indices suggest using 2 or 8 clusters. In my analysis, I will use 8 clusters.

km <- eclust(df.3,FUNcluster="kmeans", k=8,hc_metric = "euclidean")

Now, I will plot the MDS results in 3 dimensions, broken down into clusters.

df.3$Cluster <- km$cluster

t <- list(
  family = "sans serif",
  size = 12,
  color = toRGB("grey50"))

p <- plot_ly(df.3, x = ~D1, y = ~D2, z = ~D3, color = ~Cluster, text = rownames(df.3)) %>%
  add_markers() %>%
  add_text(textfont = t, textposition = "top right") %>%
  layout(scene = list(xaxis = list(title = 'D1'),
                     yaxis = list(title = 'D2'),
                     zaxis = list(title = 'D3')))
p

The results are not as easy to interpret as it was with 2 dimensions. However, we can still see that similar hobbies are close to each other, eg. social sciences (Law, Economics, Geography and Politics) or artistic interests (writing, reading, theatre and art exhibitions).

5 Hierarchical clustering

I expect hierarchical clustering to give similar results to those above. Firstly, I will standarize the data, then I will calculate distances between variables and I will plot dendrogram.

data.standarized <- data.Normalization(data, type="n1",normalization="column") # n1 - standardization ((x-mean)/sd)
distance.m<-dist(t(data.standarized)) # distances between variables
hc<-hclust(distance.m, method="complete") # hierarchical clustering using Complete Linkage
plot(hc, hang=-1)

plot(density(distance.m))

Each leaf in the dendrogram shown above corresponds to one observation. As we move up the tree, similar observations are grouped into branches. To compare results with PCA and k-means, I will divide the tree into 8 clusters

plot(hc, hang=-1)
rect.hclust(hc, k = 8, border = 2:5)

sub_grp<-cutree(hc, k=8) # division into 8 clusters
fviz_cluster(list(data = distance.m, cluster = sub_grp))

There are noticeable differences between clusters generated from hierarchical tree and k-means combined with PCA.

6 Discussion

The main goal of this research was to examine whether human interests can be described by a smaller number of latent concepts by dimension reduction. The results of the analysis suggest that human interests cannot be described by two dimensions. However, they possibly can be visualized in three dimensions and clustered in small groups. K-means clustering on MDS groupped hobbies into 8 clusters and the results are rather consistent with intuition - eg. writing, reading, theatre and art exhibitions form a cluster. In this article MDS and hierarchical clustering were applied to the same dataset to explore if these two methods give the same results. It was shown that the final results differ. Further research could involve using Factor Analysis to search for underlying factors that might explain correlations among human interests. Additionaly, adding more variables from the Young People Survey dataset, such as gender or character traits might also help to explain the variation in human interests.

Michael A. A. CoxTrevor F. Cox. 2000. “Multidimensional Scaling”. Chapman and Hall.↩
Kruskal, J. B. 1964. “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis”. Psychometrika, 29, 1–27↩
https://cran.r-project.org/web/packages/MissMech/MissMech.pdf ↩
Jamshidian, Mortaza & Jalal, Siavash & Jansen, Camden. 2014. “MissMech: An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR)” Journal of statistical software 56(6): 1-31↩
Joseph L. Schafer and John W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7(2): 147-177↩
https://cran.r-project.org/web/packages/mice/mice.pdf ↩
Stef van Buuren, Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of statistical software 45(3), 1–67.↩

Dimension reduction in Analysis of Human Interests

Katarzyna Kryńska

5th of March 2019