PROJECT II - DIMENSION REDUCTION

 

 

This project is dedicated to Young People Survey Data. It was conducted in 2013 by students of the Statistics Class at FSEV UK. All of the respondents were of Slovakian nationality. The dataset was found and downloaded from site kaggle: (https://www.kaggle.com/miroslavsabo/young-people-survey).

 

The report is divided into … parts:

  1. Introduction
  2. EDA part
  3. Dimension Reduction Through PCA
  4. PCA - music preferences
  5. PCA - movies preferences
  6. PCA - personal interests - different knowledge domains
  7. PCA - personal interests - hobbies
  8. PCA - fears
  9. PCA - money spending habits
  10. Final Conclusions

 

Following libraries were used during the analysis:

  • corrplot
  • FactoMineR
  • stats
  • tidyverse
  • psych
  • data.table
  • knitr
  • rmarkdown
  • factoextra

 

INTRODUCTION

 

Originally, the survey dataset consists of 150 columns and 1010 observations. The questions asked during this survey cover spheres such as:

  • music preferences
  • movies preferences
  • personal interests concerning different domains (such as chemistry/medicine etc.)
  • hobbies
  • fears
  • lifestyle (drugs, workaholism etc.)
  • self-judgment (on socializing, children etc.)
  • personal finance and spendings
  • demographic data

The vast majority of the variables are coded as a likert scale (1-5). Five variables are more descriptive. Also the demographic data is different.

The likert-scale questions are going to be addressed by the dimension reduction techniques. The demographic and other data is going to be treated in the EDA section.

As far as the missing data is concerned, the survey dataset contains muany missing values. There are 571 NAs in the whole dataset. For each segment’s dimension reduction, I am going to use the complete cases to build the principal components.

<However, some of the missing data is going to be predicted by the original variables and principal components. We are going to perform such an exercise later.>

 

 

 

EDA

Let’s take a look at the dataset.

# inspect the dataset
paged_table(data[1:10,])
#str(data)
#describe(data)
#summary(data)
#table(data$crime)
#table(data$weapon)
#table(data$district)
#table(data$year)
#table(data$month)
#table(data$is_night)

 

Firstly, let’s get to know our respondents and analyze the demographic data. At the beginnig, let’s take a look at the age distribution.

  ggplot(data=data,aes(x = Age)) +
  geom_histogram() +
  stat_bin(bins=17)+
  theme_minimal() +
  labs(x = "", y = "", title = "Respondents Age Distribution") #+
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).

## Warning: Removed 7 rows containing non-finite values (stat_bin).

  #scale_x_continuous(breaks = c(0,6,12,18,24,30,36,42,48,54,60,66,72)) 

Actually, this is what we should expect. The age of the respondents is around 20, there are a few cases when the age of the respondent is closer to 30. As I said, no surprise here, because it is a survey conducted at the University by statistics students, who asked the questions to their friends.

 

 

The next step is to check the gender distribution among the students.

data %>%
  group_by(Gender) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Gender)) +
  geom_bar(stat="identity", fill = 'darkgrey') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Gender counts") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  expand_limits(y = 700)

First of all, there are a few cases with empty value - probably these are the people who did not want to state their gender. There are more females than males in the sample (59 vs. 41%). Actually, There are a few hypothesis that comes to my mind - maybe the ladies did their work more diligently? Or maybe the males and females have more female friends. Since we do not know which respondents are choosen by which student, there is no chance to validate these suspicions.

 

 

What might be interesting, is to check the left-/right-handedness taking into account the gender among the respondents.

data[Gender!='' & Left.right.handed!='', ] %>%
  group_by(Gender, Left.right.handed) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(Left.right.handed, -counts))) +
  geom_bar(stat="identity", fill = 'brown') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=15, face='bold'))  +
  labs(x = "", y = "", title = "Gender ~ Left/Right-Handed") +
  geom_text(aes(label = counts), vjust = 1.2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ Gender)
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.

There are very few cases of left-handed people. As we can see, there are less men in the sample, but at the same time there are more left-handed men than women (13% vs 7% of the sub-samples). This fact probably does not affect the other variables in any significant way, but as an exploratory element prooves to be quite interesting.

 

 

Another story is the number of siblings among the respondents.

data %>%
  group_by(Number.of.siblings) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Number.of.siblings)) +
  geom_bar(stat="identity", fill = 'violet') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Number of siblings among the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 600) +
  scale_x_continuous(breaks = 1:10) 
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_text).

Most of the young people have one sibling. There are a little bit more respondents with 2 siblings than the only childs (202 vs 163). There is one person with 10 siblings. It is our outlier, which may be also a mistake (e.g. the zero was added mistakenly if the data were manually filled after the surveys), but it is possible, so let’s assume there is actually such a person among the surveyed young people.

 

 

The other important factor concerning the respondents is their level of education. Let’s take a look at the education distribution (I am expecting, that most of them are going to be students, since they are supossed to be the young students’ friends):

data$edu <- ifelse(data$Education =='secondary school', 'secondary', 
                   ifelse(data$Education =='college/bachelor degree', 'BA',
                          ifelse(data$Education =='masters degree', 'MS',
                                 ifelse(data$Education =='primary school', 'primary',
                          ifelse(data$Education =='currently a primary school pupil', 'current primary',
                                 ifelse(data$Education =='doctorate degree', 'PhD', ''))))))

data[edu !='',] %>%
  group_by(edu) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(edu, -counts))) +
  geom_bar(stat="identity", fill = 'navyblue') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Level of Education of the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 700) 

Most of the respondents are finished with the secondary school. Many of them are also BA graduates, some of them finished the Master’s programme. There are also 90 respondents who are during/finished the primary school - perhaps these are the siblings of the students collecting the data?

 

 

Let’s take a look at the last dimension of the demographic data - localization. Firstly, let’s check the village vs. town division:

data[Village.town !='',]%>%
  group_by(Village.town) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Village.town, -counts))) +
  geom_bar(stat="identity", fill = 'orange') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Where do live the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 780) 

Most of the respondents live in the city. Yet, there are also many of the surveyed people, who live in villages (proportion 7:3).

 

 

What is probably more interesting, is to check where do specifically live people from cities and villages (house/block of flats):

data[Village.town!='' & House.block.of.flats!='', ] %>%
  group_by(Village.town, House.block.of.flats) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(House.block.of.flats, -counts))) +
  geom_bar(stat="identity", fill = '#00C0B8') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=15),
        title =element_text(size=15, face='bold'))  +
  labs(x = "", y = "", title = "Attitude towards alcohol of the respondents") +
  geom_text(aes(label = counts), vjust = 1.2, size = 5, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ Village.town)
## `summarise()` has grouped output by 'Village.town'. You can override using the `.groups` argument.

The results seem to be intuitive. Most of the people from cities live in block of flats, whereas the villagers live mostly in houses. 21.6% from cities live in houses, and 13.7% from villages live in block of flats. Since then, we can state that the respondents from villages are more homogeneous when it comes to the place of residence.

 

 

Let’s move on to analyzing the questions, where is no scale applied. We are going to check the attitude of the respondents towards some products/habits. There are five questions which are going to be explored below:

  • Smoking
  • Alcohol
  • Punctuality
  • Lying
  • Internet usage

 

 

Firstly, Let’s take a look at the approach to smoking and alcohol among the respondents. After the separate look at each of them, we are going to look at them jointly.

data[Smoking !='',]%>%
  group_by(Smoking) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Smoking, -counts))) +
  geom_bar(stat="identity", fill = '#7CAE00') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Smoking habits of the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 780) 

Most of the young people tried smoking during their life. There is a similar proportion of people who never smoked and are current/former smokers. Let’s take a look at the attitude towards the alcohol.

data[Alcohol !='',]%>%
  group_by(Alcohol) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Alcohol, -counts))) +
  geom_bar(stat="identity", fill = '#424242') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=14),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Where do live the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 780) 

Most of the young people are social drinkers (it may come together with recreational smoking, I presume). There are quite a lot of hard-drinkers among the respondents (there is a problem what constitutes ‘drink a lot’ definition? The definitions of the questions and possible distortions arising from that issiue is a very interesting topic for me, but we are not going to dig into details in this project). There are 124 people, who never tried alcohol. I hope, these are the respondents from primary/ secondary school, certainly among them there could be also adults.

 

 

Now let’s take a loook at the alcohol and smoking habits together. What is the most interesting for me is to check if party-drinking goes with the smoking-trying, and whether the hard drinkers are the current/former smokers:

data[Smoking!='' & Alcohol!='', ] %>%
  group_by(Smoking, Alcohol) %>%
  summarize(counts = n()) %>%
  mutate(freq = formattable::percent(counts / sum(counts))) %>%
  ggplot(aes(y=freq, x = reorder(Alcohol, freq))) +
  geom_bar(stat="identity", fill = '#FFC107') +
  theme_minimal() +
theme(axis.text.x = element_text(size=12),
        title =element_text(size=15, face='bold'))  +
  labs(x = "", y = "", title = "Alcohol and Smoking habits of the respondents") +
  geom_text(aes(label = freq), vjust = -1, size = 4, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL) +
  facet_wrap(~ Smoking) #+
## `summarise()` has grouped output by 'Smoking'. You can override using the `.groups` argument.

  #expand_limits(y = 400) 

As we can see, many of the social drinkers tried smoking. They are the highest proportions acrooss different smoking habits here. At the same time, there is a relatively high proportion of social drinkers for current/former and no-smokers. It seems that there is some connection between social drinking and smoking but it is not that strong. What is interesting, there are hardly any hard-drinkers for no-smokers. At the same time, 41.5% of the current smokers are also hard-drinkers. Also among the former smokers, it is quite popular to drink a lot. In this case the co-occurence of smoking and alcohol is more visible and seems to be a significant pair of unhealthy habits.

 

 

Let’s take a look at the punctuality of the respondents.

data[Punctuality !='',]%>%
  group_by(Punctuality) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Punctuality, -counts))) +
  geom_bar(stat="identity", fill = '#311B92') +
  theme_minimal() +
  theme(axis.text.x = element_text(size=13),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Punctuality of the respondents") +
  geom_text(aes(label = counts), vjust = -1, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  expand_limits(y = 450) 

The most numerous group are respondents, who claim, that they are always on time (399). Many of them (327) state, that they are often early. The smallest (282), yet also quite big group contitute those, who state ,that they are often late. The interesting case here is a possibility, that the answers are influenced by the “social acceptance”. Some of the surveyed people may answer in a way, that their behaviour could be more socially acceptable and thus be dishonest. It is one of the dangers of the survey data - people can lie. Let’s check how often and how much!

 

 

Yes, there is also a question about lies in the survey. Let’s take a look and check whether there might be a similar “social acceptance” issue present as well.

data[Lying !='',]%>%
  group_by(Lying) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Lying, counts))) +
  geom_bar(stat="identity", fill = '#3E2723') +
  theme_minimal() +
  theme(axis.text.y = element_text(size=12),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Do the respondents lie?") +
  geom_text(aes(label = counts), hjust = -.3, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  coord_flip() +
  expand_limits(y = 650) 

Being perfectly honest here ;), the results meet my intuition very well. The vast majority states, that they lie sometimes, or only to avoid hurting someone, or never. There are only 14%, that lie whenever it serves the respondent. There is a problem in the question formulation - I presume there are lacking less radical option to choose from (e.g. I lie often, mostly in harmless manner etc.). At the same time, this case shows clearly, how the survey data may be distorted by the social acceptance paradigm. However, there is also a possibility, that the respondents are perfectly honest and my view is distorted, I am just suspicious of the human behaviour.

 

 

Finally, let’s take a look at the internet usage of the respondents.

  data%>%
  group_by(Internet.usage) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, reorder(Internet.usage, counts))) +
  geom_bar(stat="identity", fill = '#E57373') +
  theme_minimal() +
  theme(axis.text.y = element_text(size=12),
        title =element_text(size=16, face='bold'))  +
  labs(x = "", y = "", title = "Internet usage among the respondents") +
  geom_text(aes(label = counts), hjust = -.3, size = 5, color = "black") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  coord_flip() +
  expand_limits(y = 800) 

The result shows clearly, that the responents represent a digital era generation. 74% of the young people use the internet a few hours a day and 12% use it most of the day, which eventually means, that over 85% of the surveyed people are deeply immersed in the internet (with all of the pros and cons of this fact).

It seems that we were able to get to know a little bit about the respondents. Let’s move on to the segments of the survey, where we are going to perform the PCA.

 

 

 

Dimension Reduction Through PCA

 

As stated at the beginning, the survey covers several different domains:

  • music preferences
  • movies preferences
  • personal interests concerning different domains (such as chemistry/medicine etc.)
  • hobbies
  • fears
  • lifestyle (drugs, workaholism etc.)
  • personal finance and spendings

Each of the segments is going to be explored in order to catch the potential principal components and explore hidden connections between the questions within the defined segments of the survey.

The structure of PCA part for each questions segment is going to be as follows:

  • preliminary tests to check if PCA is applicable and useful - Kaiser-Meyer-Olkin test and Bartlett’s test
  • correlation plot of the variables
  • PCA, eigenvalues construction and visualisation (+ scree plot)
  • visualisation of two major components with the original variables
  • contribution of the original variables to principal components
  • quality of representation via cos2 (correlation plot and cumulative barplot for top two components)

The preliminary assumption (certainly, we are going to check it each time) for each of the segment is, that the answers are correlated. In fact, there is a debate whether the likert scale data should be treated as a normally distributed one. This matter determines the method for calculating the correlation matrices. The most popular method is the Pearson Correlation (it assumes the normality of data). In case of the project, I assume that the likert scale fits the normal distribution close enough to perform the Pearson Correlations. At the same time, I am aware, that this method might have some biases.

In course of the exploration via PCA, I am expecting, that we will be able to distinguish some interesting features of the data. In each case, we are going to use full records (with no missing data).

 

 

 

PCA - Music

 

The first sphere, which we are going to explore, is the approach towards Music. The data encompasses the general approach to music and opera and the specific music genres.

 

Firstly, let’s take a look at the distributions of the questions concerning music:

# prepare appropriate datasets for each questions segment
music <- (data[complete.cases(data[,1:19]),1:19])
movies <- (data[complete.cases(data[,20:31]),20:31])
domain <- (data[complete.cases(data[,32:49]),32:49])
hobby <- (data[complete.cases(data[,50:63]),50:63])
fear <- (data[complete.cases(data[,64:73]),64:73])
life <- (data[complete.cases(data[,c(80:107, 110:132)]),c(80:107, 110:132)])
money <- (data[complete.cases(data[,134:140]),134:140])

# assign an appopriate dataset to perform pca
pca_data <- music

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

Almost all of the respondents really enjoy listening to music. Young people are not so enthusiastic about folk and country music, they also do not like opera that much. They are rather neutral about te musical and classical music. On the other hand, the young people love rock and pop music. Latino and Hiphop music have many enthusiasts as well as enemies.

 

 

The next step is the PCA. First of all, I am going to check if PCA is appropriate for the dataset.

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.77
## MSA for each item = 
##                    Music Slow.songs.or.fast.songs                    Dance 
##                     0.74                     0.75                     0.74 
##                     Folk                  Country          Classical.music 
##                     0.81                     0.78                     0.79 
##                  Musical                      Pop                     Rock 
##                     0.79                     0.69                     0.79 
##        Metal.or.Hardrock                     Punk              Hiphop..Rap 
##                     0.78                     0.77                     0.75 
##              Reggae..Ska              Swing..Jazz              Rock.n.roll 
##                     0.65                     0.80                     0.83 
##              Alternative                   Latino           Techno..Trance 
##                     0.83                     0.79                     0.64 
##                    Opera 
##                     0.78
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 4641.865737
## 
## $p.value
## [1] 0
## 
## $df
## [1] 171

I assume, that the Measure of Sampling Adequacy index should be more than 0.5 to perform the factor analysis. In this case the MSA is equal to 0.77, which is a pretty optimistic result.

In order to provide more reliability, The Bartlett test is also conducted. H0 indicates, that the correlation matrix is an identity matrix. In case of music preferences data, the p-value for this test is equal to 0 -> PCA is applicable here.

 

 

Let’s take a look at the correlation plot for the data:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

There are visible correlations between variables - mostly positively correlated pairs (e.g. Rock ~ Metal), but there are also some negatively correlated pairs (weaker and less -> e.g. Metal ~ Dance).

 

 

Let’s take a look at the eigenvalues for music preferences PCA:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "orange", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

There are three dimensions, which present the largest magnitude. According to the Kaiser’s rule, there are 5 reasonable dimensions (for them the eigenvalues are larger than 1). Yet, dimensions 4-5 explain just a little bit more variance than one original variable. We are going to discover further two major components.

 

 

Let’s move on to the screeplot informing about the variance explained by transformed dimensions:

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "orange", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

The most influential dimension explains 20% of the entire variance in data, whereas three of them explain almost half of the variance. The result is far from ideal, but yet, we were able to extract some informative factors. Let’s dig deeper.

 

 

let’s take a look at two major components with original variables mapped on the axis:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "orange")

According to the first dimension, there is visible positive correlation for e.g. Rock and Roll, Classical music, Swing-Jazz, Opera. The highest influence presents the variable Rock and Roll. What is interesting, in case of variable slow/fast songs, the correlation is negative (yet, it is quite weak). The original variable answers the question: Do you prefer slow (low values) or fast (high values) music? Since then, I would say, that the first dimension could be named as e.g. such a question: Are you a music introvert (slow songs played in private) or extrovert (listening to rock and roll, classical music, visiting opera and litening to music in social manner - during concerts etc.).

The second dimension is way less informative. The most influential of the original variables are dance/pop/latino. They are positively correlated with the dimension. In this case, there is no negatively correlated variables. One of the propositions for namnig such a dimension could be: Do you like rhythmic, danceable music?

 

 

Let’s take a look at the contribution plot for two major components:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "orange", axes = 1:2, top = 19)

As we’ve seen earlier, for the first dimension, the most influential variable was Rock and Roll, but taking into consideration two top components, swing/jazz and metal/hardrock variables bring the largest contribution. In fact, dance, latino, classical music and rock variables also contribute to two major dimensions greatly.

 

 

Let’s check the quality of original variables representation for the reduced dimensions (again top 2 dims). It is expressed in square cosine measure (cos2). This measure indicates contribution of a component to squared distance of the observation to the origin.

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "orange", axes = 1:2, top = 19 )

The higher the cos2, the better the representation of original variables on the factor map. Again, the best fit is observed for swing/jazz and metal/hardrock music.

We were able to distinguish two interesting components for young people music preferences. Let’s move on to movies preferences.

 

 

 

PCA - MOVIES

 

Once we analyzed the music tastes, let’s move on and dig into the movies preferences. At first - let’s plot the distributions for all of the questions concerning movies:

# assign an appopriate dataset to perform pca
pca_data <- movies

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

Most of the young people really enjoy watching films. Among the movie genres, youngsters love especially comedies. They also like the animated movies and fantasy/fairy tales very much. What is quite surprising, there is very optimistic attitude towards documentaries. On the other hand, it seems that the respondents don’t like westerns (in fact, it is the only genre that gathered a very negative feedback).

 

 

Let’s conduct the Kaiser-Mayer-Olkin and Bartlett’s tests:

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.67
## MSA for each item = 
##              Movies              Horror            Thriller              Comedy 
##                0.73                0.58                0.65                0.62 
##            Romantic              Sci.fi                 War Fantasy.Fairy.tales 
##                0.72                0.81                0.78                0.59 
##            Animated         Documentary             Western              Action 
##                0.58                0.69                0.74                0.73
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 2186.757519
## 
## $p.value
## [1] 0
## 
## $df
## [1] 66

As we can see, the MSA is higher than 0.5 -> 0.67, while p-value for the Bartlett’s test is equal to 0 - we are fine, and can proceed with the PCA.

 

 

Let’s check correlations for the movies preferences:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

Youngsters who like Fantasy and fairy tales are also interested in animated movies. Thriller enthusiasts also like horror movies. What is interesting,all of the movie genres are negatively correlated with romantic films (romantic films enthusiasts might be a homogenous subgroup which do not like other genres - yet the correlations are very weak, so it is only a loose hypothesis).

 

 

Let’s take a look at the eigenvalues for movies preferences PCA:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "darkgreen", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

There are three dimensions, that pass the Kaiser’s threshold 1. Two top dimensions seem to be much better than the rest - finally these two are going to be interpreted.

 

 

Let’s take a look at the screeplot for movies:

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "darkgreen", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

First Dimension eplains 21.2% of the overall variation in the data. Taking into account three dimensions - more than a half of variance could be explained.

 

 

Let’s plot the factor map and explore the top two dimensions:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "darkgreen")

As far as the first dimension is concerned, the strongest positive correlation is observed for War, Thriller, Action, and Western Movies. There is also positive correlation for Sci-fi and Horror. All of the films are rather “serious”, intensive and often controversial/difficult from the emotional perspective.

On the other hand, for the second dimension, we can see, that comedies, fairytales and animated productions are positively correlated. Since then, we could assume that the second dimension is concentrated on light, pleasant movies.

Since then, we could rename the first and second dimension into questions: Do you like the serious, intensive movies (dim1)?
Do you like light, pleasant movies (dim2)?

 

 

Let’s take a look at the contribution plot for top 2 dimensions:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "darkgreen", axes = 1:2, top = 12)

As we can see, Fantasy/Fairytailes and Animated movies have the highest contribution (mainly to the first dimension). On the other hand, the least influential variable prooves to be Documentaries.

 

 

Finally, let’s take a look at the cos2 plot:

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "darkgreen", axes = 1:2, top = 12 )

The results correspond to the previous results concerning the contribution. The cos2 measure is especially high for fantasy and animated films, which means that the first two components are a decent representation for these variables.

 

 

 

PCA - KNOWLEDGE DOMAINS

 

Let’s analyze the interest concerning different knowledge domains among the young people. Below the distributions for their answers on this topic:

# assign an appopriate dataset to perform pca
pca_data <- domain

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

Young People hate physics and mathematics. They are also not interested in law, politics, art and religion. There are few spheres, that I would call neutral (they have enthusiasts and ‘enemies’) -> such as geography, history, psychology, reading in general. Among the youngsters, there is a positive feedback about internet and foreign languages - these are the only spheres, where the vast majority of respondents is fascinated by.

 

 

Let’s perform the PCA adequacy tests:

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.73
## MSA for each item = 
##            History         Psychology           Politics        Mathematics 
##               0.73               0.74               0.76               0.61 
##            Physics           Internet                 PC Economy.Management 
##               0.64               0.68               0.71               0.67 
##            Biology          Chemistry            Reading          Geography 
##               0.74               0.76               0.75               0.74 
##  Foreign.languages           Medicine                Law               Cars 
##               0.74               0.77               0.74               0.73 
##    Art.exhibitions           Religion 
##               0.80               0.84
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 4523.3114
## 
## $p.value
## [1] 0
## 
## $df
## [1] 153

The overall MSA according to KMO test is 0.73, which is more than 0.5. Again, we reject H0 of the Bartlett’s test - the correlation matrix is different than the identity matrix. Let’s move on to further analysis.

 

 

Below the correlation plot for the knowlegde domains interests analysis:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

There is quite strong positive correlation between chemistry and biology, chemistry and medicine, and biology and medicine (quite obvious). What is more interesting - there is a visible negative correlation between cars and reading - it seems that those interested in cars do not like reading and bookworms are not fascinated by motorization.

 

 

Let’s conduct the PCA and take a look at the eigenvalues:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "steelblue", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

1-3 dimensions proove to be especially informative. According to the Kaiser rule, we could also take into consideration 4-5, yet it is reasonable to analyze 1-2 or 1-3.

 

 

Let’s check the variance explained by specific dimensions:

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "steelblue", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

The first component explains 17.6 of all variance. 1-3 dimensions explain 45% of the variance. The results are not that bad, yet we were able to attain better results for earlier analyses.

 

 

Let’s map the original variables onto two major components:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "steelblue")

For the first dimension, we can see, that Medicine, Biology and Chemistry are the most influential ones. They are positively correlated with the first component. This component could be named as: Are you interested in domains connected with medicine?

As far as the second dimension is concerned, the strongest positive correlation is observed for PC, Economy, Politics, Cars and Internet. Here the interpretation of the component is a little bit more difficult. Yet, I would opt for such a question: Are you interested in domains concerning power, value and influence?

 

 

Let’s take a look at the contribution and cos2 plots:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "steelblue", axes = 1:2, top = 18)

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "steelblue", axes = 1:2, top = 18 )

As far as dimensions 1-2 are concerned, biology, medicine and politics bring the largest contribution to the dimensions and present the highest cos2 measure. The least influential viariable, which is also not so well represented by the constructed dimensions, is the question about Foreign Languages.

 

 

 

PCA - HOBBIES

 

Let’s move on to hobbies of the respondents:

# assign an appopriate dataset to perform pca
pca_data <- hobby

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

Young people seem to be very social creatures - hanging around with friends is important for them. They also like the outdoors activities very much. Youngsters are rather skeptical towards dancing, playing the musical instruments and writing. They also deem gardening boring. Actually, there is no visible trend for shopping, science and technology, theatre and extreme sports.

 

 

Let’s check the MSA measure and results for the Bartlett’s test:

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.66
## MSA for each item = 
##  Countryside..outdoors                Dancing    Musical.instruments 
##                   0.69                   0.73                   0.67 
##                Writing          Passive.sport           Active.sport 
##                   0.60                   0.65                   0.62 
##              Gardening            Celebrities               Shopping 
##                   0.72                   0.60                   0.63 
## Science.and.technology                Theatre       Fun.with.friends 
##                   0.65                   0.69                   0.65 
##      Adrenaline.sports                   Pets 
##                   0.62                   0.76
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 1581.433573
## 
## $p.value
## [1] 2.153364668e-270
## 
## $df
## [1] 91

The MSA is 0.66, the p-value is near 0 -> again, we are fine as far as the preliminary conditions for PCA effectiveness is concerned. Let’s proceed.

 

 

Let’s take a look at the correlation plot for hobbies:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

There are mainly weak positive correlations present. Some of them are: Shopping ~ Celebrities, Active sport ~ Adrenaline sports. There are also some very weak negative correlations present -> e.g. Celebrities ~ Science and technology.

 

 

Let’s construct the screeplot and eigenvalues plot in order to assess the dimensions found through PCA:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "brown", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "brown", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

Four dimensions exceed the 1.0 threshold for eigenvalues according to the Kaiser Rule. Yet, It seems that three of them are the most informative ones (especially the first one, which explains 17.5% of entire variance). 3 major components explain 41.2% of the variance for this dataset. The result is far from ideal, yet there may be some hidden potential in the first dimensions.

 

 

Let’s prepare the variable correlation plot:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "brown")

Dancing, Gardening and Theatre are the variables, for which we observe the highest positive correlation. Since then, I suggest that this dimension could be perceived as artistic hobbies connected with aesthetics.

The second dimension is a little bit more tricky. There are three variables concerning sport which are positively correlated with the dimension, but there is also science and technology variable, which seems to be a significant factor. One of the suggestions for the dimension’s name could be: hobbies concerning physical and mental activity.

 

 

Let’s take a look at the contribution and cos2 plots:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "brown", axes = 1:2, top = 14)

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "brown", axes = 1:2, top = 14 )

The most influential variable is Adrenaline Sports - it is quite well represented within 2 top variables. The Pets variable seems to be insignificant in terms of contribution.

 

 

 

PCA - FEARS

 

Let’s dig into probably the most interesting and fascinating spheres concerning the whole surrvey - fears of the young people:

# assign an appopriate dataset to perform pca
pca_data <- fear

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

Young people are not scared of flying or storms. They are also rather tolerant towards rats and darkness. There are quite radical views on snakes and spiders - many of the respondents do not fear them at all or are very scared of these creatures. It seems that youngsters ignore ageing at this stage of their life. There is a moderate fear noted for dangerous dogs and public speeking.

 

 

The next step is to perform the Kaiser-Meyer-Olkin factor adequacy test and Bartlett’s test:

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.8
## MSA for each item = 
##                  Flying                   Storm                Darkness 
##                    0.80                    0.77                    0.76 
##                 Heights                 Spiders                  Snakes 
##                    0.86                    0.85                    0.78 
##                    Rats                  Ageing          Dangerous.dogs 
##                    0.80                    0.85                    0.84 
## Fear.of.public.speaking 
##                    0.81
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 1852.896822
## 
## $p.value
## [1] 0
## 
## $df
## [1] 45

The MSA measure is quite large - 0.8. According to Bartlett’s test - we reject H0 stating that the correlation matrix is an identity matrix.

 

 

Let’s take a look at the correlation plot for the youngsters fears:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

What is interesting - there are no negative correlations between any pair of variables. There are some quite strong positive correlations - for instance: rats ~ snakes and storm ~ darkness.

 

 

Let’s take a look at the screeplot and eigenvalues plot for youngsters’ fears PCA:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "darkgrey", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "darkgrey", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

According to the Kaiser Rule, there are only two dimensions, which surpass the 1.0 eigenvalue threshold. The first dimension explains almost 1/3 of all variation, the second dimension merely 11.3%.

 

 

Let’s plot the original variables onto the top two dimensions:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "darkgrey")

For the first dimension, the strongest positive correlation is observed for Rats, Snakes, Storm, Darkness, Dangerous Dogs and Spiders. Since then I would opt for naming the dimension as: Fears concerning the animated (snakes, rats, spiders) and non-living (darkness, storms) nature. It may be perceived as some primal, instinctual fear.

The second dimension is less informative, yet there are two interesting positive correlations present - flying and heights fears. Hence, the second dimension could be named as “Fear of Heights”.

 

 

Let’s present the contribution and cos2 plots:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "darkgrey", axes = 1:2, top = 10)

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "darkgrey", axes = 1:2, top = 10 )

The largest contribution to top two dimensions bring rats, snakes, storm and darkness. Ageing and Fear of public speaking do not contribute in significant way and they are not well represented by the top two dimensions (very low cos2 measure).

 

 

 

PCA - MONEY

 

The last part of the PCA is the personal finance sphere. Let’s take a look at the answers distributions:

# assign an appopriate dataset to perform pca
pca_data <- money

# plot distributions of the answers
plt_data <- melt(pca_data, id.vars = NULL)
## Warning in melt.data.table(pca_data, id.vars = NULL): id.vars and measure.vars
## are internally guessed when both are 'NULL'. All non-numeric/integer/logical
## type columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
ggplot(data = plt_data) + geom_bar(aes(x = value)) +  
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 4) +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(),legend.position = "off") +
    labs(x = NULL, y = NULL)

The young people enjoy visiting the shopping centres. They also prefer branded to non-branded clothing. Some of the youngsters tend to spend much money on partying and socializing, but the trend is weaker than I would expect. What is more, young people do not spend much on gadgets (another surprise here). What is even more unexpected - the vast majority of respondents would happily pay more for good quality, healthy food.

 

 

Let’s conduct the KMO and Bartlett’s test for the last time:

# prepare a correlation matrix
cor_mat <- cor(pca_data)

# Kaiser-Meyer-Olkin factor adequacy test
KMO(pca_data)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = pca_data)
## Overall MSA =  0.73
## MSA for each item = 
##                   Finances           Shopping.centres 
##                       0.69                       0.64 
##           Branded.clothing     Entertainment.spending 
##                       0.80                       0.73 
##          Spending.on.looks        Spending.on.gadgets 
##                       0.72                       0.79 
## Spending.on.healthy.eating 
##                       0.77
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(pca_data))
## $chisq
## [1] 1163.782456
## 
## $p.value
## [1] 3.211053673e-233
## 
## $df
## [1] 21

The MSA is higher than 0.5 and the p-value for the Bartlett’s test indicates, that we reject H0 - the correlation matrix is different than identity matrix.

 

 

Now let’s take a look at the correlation plot:

# correlation plot
corrplot(cor_mat, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.7)

As we can see, the there is quite strong positive correlation between spendings on looks and shopping centres/entertainment spending/branded clothing. There is also a visible negative correlation between the saving rate and entertainment spending -> it fits well the intuition about the connection between these variables.

 

 

Let’s analyse the scree- and eigenvalues plots:

# before pca - scale and center the data
data_scale <- scale(pca_data, center = TRUE, scale = TRUE)

# pca
pca_res <- PCA(data_scale, scale.unit = F, ncp = ncol(pca_data), graph = FALSE)
# eigenvalues plot
fviz_eig(pca_res, choice = "eigenvalue", ncp = ncol(pca_data), barfill = "navyblue", barcolor = "black", linecolor = "black",  addlabels = TRUE,   main = "Eigenvalues for the PCA")

# scree plot
fviz_eig(pca_res,  ncp = ncol(pca_data), barfill = "navyblue", barcolor = "black", linecolor = "black",  
         addlabels = TRUE,   main = "Screeplot for the PCA")

According to the Kaiser Rule - there are two informative dimensions (where eigenvalue >1). Especially strong seems to be the first one - it accounts for 36.3 of all variation in the dataset. Together with the second one, they explain 51.9% of entire variation.

 

 

Let’s plot the original variables and check for the correlations with top two dimensions:

# two major compontents with orignial variables plot
fviz_pca_var(pca_res, col.var = "navyblue")

As we can see, for the first dimension, the strongest positive correlation is observed for spending on looks, branded clothing significance, spendings on entertainment and gagets. This dimension may be perceived as a “consumerism index”.

As far as the second dimension is concerned, there is a visible positive correlation for Finances variable. It informs about the saving habits of the respondent - the higher the value, the more respondent is saving her/his money. We can also observe for this dimension negative correlation for the Entertainment Spending Variable. Since then, the second dimension could be names as “Saving Rate”.

 

 

Let’s plot the contribution and cos2 measure:

# contribution plot
fviz_contrib(pca_res, choice = "var", fill = "navyblue", axes = 1:2, top = 10)

# cos2 plot: barplot cum top 2
fviz_cos2(pca_res, choice="var", fill = "navyblue", axes = 1:2, top = 10 )

We observe the highest contribution for Finances variable - it is also very well represented by the two major components. Gadgets and healthy eating spendings present an opposite direction - they do not contribute that much, and proposed dimensions do not represent these variables well.

 

 

 

FINAL CONCLUSIONS

 

Surveyed Young People from Slovakia are mainly ~20 years old, there are a little bit more females than males. Most of the youngsters have 1-3 siblings and live mainly in the city.

Most of the young people are social drinkers, most of them at least tried smoking during their life (some of them are present/former smokers). The respondents confess, that they are lying, yet only sometimes or having good intentions.

Youngsters love listening to music. Their music tastes could be summed up by two major components put as questions: Are you a music intro/extravert? Do you like rhythmic, danceable music?

The respondents also enjoy watching movies. They are especially enthusiastic about comedies, animations, and fairy tales. The movie tastes of the youngsters could be expressed in terms of two major questions: Do you like serious, emotional and intensive movies? Do you like light, pleasant movies?

Young people are not so much interested in Maths and Physics. They seem to not care also about law, politics, art, and religion. However, they really like the internet and foreign languages domains. Two major components encompass the domains concerning medicine (first dimension) and power/value/influence (second dimension). As far as youngsters’ hobbies are concerned, they seem to love having fun with friends and going outdoors. They do not like gardening and have a rather negative attitude towards dancing and musical instruments. There were found two interesting components concerning hobbies: artistic hobbies (first component) and physical/mental activity hobbies (second component).

The survey contains also information about the fears of young people. They do not fear flying and storms. Snakes and spiders polarize the respondents - many of them fear them greatly, while the others are not scared at all. Two major components were discovered. They could be summed up as fears connected with nature (non-living and animated) and fear of heights.

Young people like visiting shopping centers. They prefer branded to non-branded clothes. At the same time, youngsters do not spend much money on gadgets and would pay more money for good quality food. There is an interesting component observed for the dataset - it could be described as the “consumerism index”. The other significant dimension is a “saving rate” component.

This project was quite a challenge - there were lots of variables and much work involved to analyze the data. Yet, I believe, there is much hidden value in such surveys - it is only a single endeavor to gain some insight into the data.