Project 2: Dimension Reduction

Dimension Reduction in Analysis of Wellbeing and Lifestyle Survey

The database in use was acquired from Kaggle (Dalat, 2020) and is based on a survey conducted by Authentic Happiness (2015) in 2015, with responses collected over the 5 year time span. It consists of 12756 observations and 23 variables, that will further be explained, giving us 293388 cells. It is a substantial dataset and dimension reduction can widely facilitate its readability and use.
The survey was carried out to determine how we live our lives. More precisely, it examines how we as a society thrive in our personal and professional lives - how efficiently we shape our lifestyles, rituals, habits to optimise the overall life satisfaction along the given 5 dimensions:
“1. Healthy body, reflecting your fitness and healthy habits
2. Healthy mind, indicating how well you embrace positive emotions
3. Expertise, measuring the ability to grow your expertise and achieve something unique
4. Connection, assessing the strength of your social network and your inclination to discover the world
5. Meaning, evaluating your compassion, generosity and how much”you are living the life of your dreams“.”

Introduction

The goal of this paper is to use different methods of dimension reduction, PCA and MCA mixed with PCA on a questionnaire data. MCA (Multiple correspondence analysis) is an equivalent to the process of PCA (Principal Component Analysis), a statistical method often implemented on expanded, manifold databases, for the trimming of an X number of dimensions and thereby creating a simpler version of the dataset with a minimum loss of information, allowing for a proficient visualisation on 2-dimensional plots for a dependency determination between the variables (Hayden, 2018), but for variables that are categorical. In our dataset, we only have categorical data, mostly in the numerical form. ## Data Preparation First, let us have a look at the dataset in question, the dimensions and variables. As a first step, before the actual preprocessing, I assigned new variable names to the dataset for a cleaner read.

wellbeing <- read.csv("Wellbeing_and_lifestyle_data.csv", header = TRUE, sep = ",", col.names = c("Timestamp", "FruitsOrVeggiesPerDay", "DailyStress", "PlacesVisitedAnnually", "CoreCircle", "SupportingOthers", "DailySocialNetworking", "PersonalAchievements", "AnnualDonations", "BMIRange", "WeeklyTaskCompletion", "HoursInFocus", "DailySteps", "LifeVisionAhead", "HoursOfSleep", "LostVacationDays", "DailyAnger", "IncomeSufficiency", "PersonalAwards", "TimeForPassion", "WeeklyMeditation", "AgeRange", "Gender"), stringsAsFactors = FALSE)

head(wellbeing, 10)

##    Timestamp FruitsOrVeggiesPerDay DailyStress PlacesVisitedAnnually CoreCircle
## 1     7/7/15                     3           2                     2          5
## 2     7/7/15                     2           3                     4          3
## 3     7/7/15                     2           3                     3          4
## 4     7/7/15                     3           3                    10          3
## 5     7/7/15                     5           1                     3          3
## 6     7/8/15                     3           2                     3          9
## 7     7/8/15                     4           2                    10          6
## 8     7/9/15                     3           4                     5          3
## 9     7/9/15                     5           3                     6          4
## 10   7/10/15                     4           4                     2          6
##    SupportingOthers DailySocialNetworking PersonalAchievements AnnualDonations
## 1                 0                     5                    2               0
## 2                 8                    10                    5               2
## 3                 4                    10                    3               2
## 4                10                     7                    2               5
## 5                10                     4                    2               4
## 6                10                    10                    2               3
## 7                10                    10                    3               5
## 8                 5                     7                    4               0
## 9                 3                     3                    5               4
## 10               10                    10                    0               4
##    BMIRange WeeklyTaskCompletion HoursInFocus DailySteps LifeVisionAhead
## 1         1                    6            4          5               0
## 2         2                    5            2          5               5
## 3         2                    2            2          4               5
## 4         2                    3            5          5               0
## 5         2                    5            0          5               0
## 6         1                    6            1          7              10
## 7         2                    8            8          7               5
## 8         1                    8            2          8              10
## 9         1                   10            2          1               5
## 10        2                    3            2          3               0
##    HoursOfSleep LostVacationDays DailyAnger IncomeSufficiency PersonalAwards
## 1             7                5          5                 1              4
## 2             8                2          2                 2              3
## 3             8               10          2                 2              4
## 4             5                7          5                 1              5
## 5             7                0          0                 2              8
## 6             8                0          2                 2             10
## 7             7               10          0                 2             10
## 8             6                0          2                 2              8
## 9            10                0          2                 2             10
## 10            6                0          0                 1              3
##    TimeForPassion WeeklyMeditation   AgeRange Gender
## 1               0                5   36 to 50 Female
## 2               2                6   36 to 50 Female
## 3               8                3   36 to 50 Female
## 4               2                0 51 or more Female
## 5               1                5 51 or more Female
## 6               8                3 51 or more Female
## 7               8               10 51 or more   Male
## 8               2                2   21 to 35 Female
## 9               3               10   21 to 35 Female
## 10              8                1 51 or more Female

The variables correspond to:
Timestamp - Date when survey was submitted
FruitsOrVeggiesPerDay - How many fruits or vegetables do you eat on a daily basis?
DailyStress - How much stress do you typically experience?
PlacesVisitedAnnually - How many places do you typically visit in a year? Such as new states, cities, museums…
CoreCircle - How many people are you very close to?
SupportingOthers - How many people do you help at achieving a better life? Score of selflessness
DailySocialNetworking - How many people do you interact with on a typical day?
PersonalAchievements - How many achievements are you proud of?
AnnualDonations - How many times a year do you donate your time or money to good causes?
BMIRange - What is your BMI range? 1 - Below 25; 2 - Above 25
WeeklyTaskCompletion - How well do you complete your weekly to-do lists?
HoursInFocus - In a typical day, how many hours do you experience a good “flow”?
DailySteps - How many steps (expressed in thousands) do you typically walk daily?
LifeVisionAhead - How many years ahead is your life planned for?
HoursOfSleep- How many hours do you typically sleep every night?
LostVacationDays - How many days of vacation do you usually lose annually?
DailyAnger - How often do you shout at someone on a daily basis?
IncomeSufficiency - How sufficient is your income to cover your basic like expenses? 1- Not or hardly sufficient; 2 - Sufficient
PersonalAwards - How many awards have you received in your life?
TimeForPassion - How many hours daily do you spend doing what you are passionate about?
WeeklyMeditation - How many times do you have the opportunity to think about yourself weekly?
AgeRange - Age groups: Less than 20, 21 to 35, 36 to 50, 51 or more
Gender - Male, Female

All the questions were asked in a pre-defined answers form, most of them on a 1-10 or 1-5 sequence, with a couple of exceptions. In variables IncomeSufficiency, BMIRange we have a 2 level answer choices. Apart from that, we have the AgeRange variable with 4 subgroups and Gender variable with two options, both variables with non numerical alternatives. Excluding the Timestamp variable, all of the inputs are categorical. However, some of them include simple answer to “how many?” and a simple count. Others have assigned levels. Therefore, we cannot simply conduct a PCA or MCA on the whole dataset, but divide the variables based on their background and carry out the respective tests. We will also implement t-SNE.

But first, let’s transform the text data into numerics for a unified set. We will also get rid of the Timestamp variable, as it is only a survey completion information, not connected with the survey itself and its contents that we aim to reduce.

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

wellbeing <- select(wellbeing, -Timestamp)
colnames(wellbeing)

##  [1] "FruitsOrVeggiesPerDay" "DailyStress"           "PlacesVisitedAnnually"
##  [4] "CoreCircle"            "SupportingOthers"      "DailySocialNetworking"
##  [7] "PersonalAchievements"  "AnnualDonations"       "BMIRange"             
## [10] "WeeklyTaskCompletion"  "HoursInFocus"          "DailySteps"           
## [13] "LifeVisionAhead"       "HoursOfSleep"          "LostVacationDays"     
## [16] "DailyAnger"            "IncomeSufficiency"     "PersonalAwards"       
## [19] "TimeForPassion"        "WeeklyMeditation"      "AgeRange"             
## [22] "Gender"

Now that we got rid of the initially unnecessary variable, we can transform the AgeRange and Gender variables into numerical inputs for some descriptive statistics further on. Let’s create a new variable AgeGroup, and subset the Gender inputs.

wellbeing$AgeGroup <- ifelse(wellbeing$AgeRange == "Less than 20", 1,
                             ifelse(wellbeing$AgeRange == "21 to 35", 2,
                                    ifelse(wellbeing$AgeRange == "36 to 50", 3, 4)))


wellbeing$Gender <- ifelse(wellbeing$Gender == "Male", 0, 1)

wellbeing <- select(wellbeing, -AgeRange)

wellbeing$DailyStress[wellbeing$DailyStress == "1/1/00"] = 1

Let us check some statistics now.

library("ggplot2")
library("reshape2")

#Converting the DailyStress variable to numeric as a precaution
wellbeing$DailyStress <- as.numeric(wellbeing$DailyStress)

wellbeingPlot <- melt(wellbeing[, 1:12], id.vars = NULL)
ggplot(data = wellbeingPlot) + geom_bar(aes(x = value)) +  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 3)

wellbeingPlot2 <- melt(wellbeing[, 13:22], id.vars = NULL)
ggplot(data = wellbeingPlot2) + geom_bar(aes(x = value)) +  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  facet_wrap(~ variable, scales = "free", ncol = 3)

summary(wellbeing)

##  FruitsOrVeggiesPerDay  DailyStress    PlacesVisitedAnnually   CoreCircle    
##  Min.   :0.00          Min.   :0.000   Min.   : 0.000        Min.   : 0.000  
##  1st Qu.:2.00          1st Qu.:2.000   1st Qu.: 3.000        1st Qu.: 3.000  
##  Median :3.00          Median :3.000   Median : 5.000        Median : 5.000  
##  Mean   :2.93          Mean   :2.783   Mean   : 5.339        Mean   : 5.485  
##  3rd Qu.:4.00          3rd Qu.:4.000   3rd Qu.: 8.000        3rd Qu.: 8.000  
##  Max.   :5.00          Max.   :5.000   Max.   :10.000        Max.   :10.000  
##  SupportingOthers DailySocialNetworking PersonalAchievements AnnualDonations
##  Min.   : 0.000   Min.   : 0.000        Min.   : 0.000       Min.   :0.000  
##  1st Qu.: 3.000   1st Qu.: 4.000        1st Qu.: 2.000       1st Qu.:1.000  
##  Median : 5.000   Median : 6.000        Median : 3.000       Median :3.000  
##  Mean   : 5.578   Mean   : 6.551        Mean   : 3.963       Mean   :2.701  
##  3rd Qu.:10.000   3rd Qu.:10.000        3rd Qu.: 6.000       3rd Qu.:5.000  
##  Max.   :10.000   Max.   :10.000        Max.   :10.000       Max.   :5.000  
##     BMIRange     WeeklyTaskCompletion  HoursInFocus      DailySteps    
##  Min.   :1.000   Min.   : 0.000       Min.   : 0.000   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.: 4.000       1st Qu.: 1.000   1st Qu.: 3.000  
##  Median :1.000   Median : 6.000       Median : 3.000   Median : 5.000  
##  Mean   :1.401   Mean   : 5.706       Mean   : 3.127   Mean   : 5.705  
##  3rd Qu.:2.000   3rd Qu.: 8.000       3rd Qu.: 4.000   3rd Qu.: 8.000  
##  Max.   :2.000   Max.   :10.000       Max.   :10.000   Max.   :10.000  
##  LifeVisionAhead   HoursOfSleep    LostVacationDays   DailyAnger    
##  Min.   : 0.000   Min.   : 1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 6.000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median : 3.000   Median : 7.000   Median : 0.000   Median : 2.000  
##  Mean   : 3.712   Mean   : 7.035   Mean   : 2.833   Mean   : 2.921  
##  3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 5.000   3rd Qu.: 4.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  IncomeSufficiency PersonalAwards   TimeForPassion   WeeklyMeditation
##  Min.   :1.000     Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000     1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.: 4.000  
##  Median :2.000     Median : 5.000   Median : 2.000   Median : 7.000  
##  Mean   :1.728     Mean   : 5.703   Mean   : 3.267   Mean   : 6.253  
##  3rd Qu.:2.000     3rd Qu.: 9.000   3rd Qu.: 5.000   3rd Qu.:10.000  
##  Max.   :2.000     Max.   :10.000   Max.   :10.000   Max.   :10.000  
##      Gender          AgeGroup    
##  Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:2.000  
##  Median :1.0000   Median :2.000  
##  Mean   :0.6047   Mean   :2.582  
##  3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :4.000

First we had to turn all the data into numerics, just as a precaution, as an character outlier (DailyStress) was spotted. We do not need to scale the data as all of it is simply in the integer measure. We do not face the problem of any NAs, as all the questions were compulsory for the survey completion.

I decided to split the variables into two plots for a better clarity of graphs.

As we can see, around 60% of the respondents are women. A substantial share of the respondents, over 38% of them belong to the second age group (21 to 35 years old), with the second largest group being the set or 36 to 50 years olds.

More respondents are in the first BMI range, meaning they are not obese, and their weight/height ratio is correct. Following on the physical health variables, majority of the respondents eats at least one fruit/vegetable per day, sleeps 6-8 hours. Noteworthy, the biggest share in the DailySteps variable is the one equivalent to 10 thousand steps a day, which is the recommended number to stay in shape and get the daily dose of exercise. Nonetheless, the overwhelming majority of the respondents does not walk 10 thousands steps daily, varying somewhat evenly from 1 to 8k with a big decrease in 9k.
Moving onto the financial freedom, we can see that significant part of the respondents is able to cover all the needs with the incomes. Moreover, they visit in many cases 10 or more new places annually, and in approximately 1/3 of the cases donate at least 5 times a year. That shows a balances and diversified life.
When we move onto the mental wellbeing we are faces with some thought-provoking instances. By looking at DailyMeditation, we see that approx 2/3 of the group takes good care of their mental health with meditation, relaxation or fitness. On the other hand, a visible majority of people spend 2,5 or less hours per day doing what they really like, leading to a conclusion that they do not enjoy their careers and activities. Luckily we also see a nice share of people spending at least 5 hours per day enjoying their tasks.
Most of the people are in the present, without going too much ahead of their lives, having planned at longest 2,5 years to the future. Approximately 1/3 of the respondents has at least a 5 year plan.
Surprisingly, although most people experience a low flow - a time in focus - from 0 to 3 hours per day, they complete more than half of the tasks planned for the week (5.706 tasks on average). However, they also experience a relative stress of 2,73 out of 5. This is probably due to poor flow abilities, where they cannot focus on their duties, rushing to the deadlines.
Lastly, the generality of people mingle with a well numbered circle of connections, both on the close proximity and the daily interactions, where most of the cases have at least 10 of them daily. They also give their support to multiple people, leading to a conclusion of a well balanced individual and intra group relationships.

PCA and MCA

After some deeper look into the data structure, I gained some doubts whether PCA is the appropriate measure. Although all of our data is in the numerical form, it is categorical, have only a discrete number of possibilities. Moreover we can distinct two subsets - questions of “how many?”, where a simple number is the final answer, and variables where an number is corresponding to a given level, or a range. For the MCA measure, we will divide the dataset, and convert the leveled answers to text values. The rest will be left unchanged. Here is the division. Numerical set:
* FruitsOrVeggiesPerDay
* PlacesVisitedAnnually
* CoreCircle
* SupportingOthers
* DailySocialNetworking
* PersonalAchievements
* AnnualDonations
* WeeklyTaskCompletion
* HoursInFocus
* DailySteps
* LifeVisionAhead
* HoursOfSleep
* LostVacationDays
* PersonalAwards
* TimeForPassion
* WeeklyMeditation

Text based set:
* DailyStress
* BMIRange
* DailyAnger
* IncomeSufficiency
* AgeRange
* Gender

Kaiser-Meyer-Olkin and Bartlett’s test

But first let’s conduct two individual tests to determine, whether our data is PCA testable.
* Kaiser-Meyer-Olkin test - a test used to examine whether the Principal Component Analysis is useful for the given variables or not.
* Bartlett’s test - a test used to find out whether the correlation matrix for a set of data is the identity matrix. If the identity correlation matrix is confirmed, PCA is not an appropriate measure (Carillo et al., 2019).

library("psych")

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library("corrplot")

## corrplot 0.84 loaded

#Correlation Matrix
cor_w <- cor(wellbeing)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

KMO(cor_w)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_w)
## Overall MSA =  0.84
## MSA for each item = 
## FruitsOrVeggiesPerDay           DailyStress PlacesVisitedAnnually 
##                  0.84                  0.73                  0.89 
##            CoreCircle      SupportingOthers DailySocialNetworking 
##                  0.90                  0.87                  0.83 
##  PersonalAchievements       AnnualDonations              BMIRange 
##                  0.89                  0.87                  0.60 
##  WeeklyTaskCompletion          HoursInFocus            DailySteps 
##                  0.91                  0.85                  0.83 
##       LifeVisionAhead          HoursOfSleep      LostVacationDays 
##                  0.92                  0.72                  0.73 
##            DailyAnger     IncomeSufficiency        PersonalAwards 
##                  0.72                  0.83                  0.89 
##        TimeForPassion      WeeklyMeditation                Gender 
##                  0.86                  0.83                  0.57 
##              AgeGroup 
##                  0.57

cortest.bartlett(cor_w, n = 12756)

## $chisq
## [1] 41370.92
## 
## $p.value
## [1] 0
## 
## $df
## [1] 231

In KMO test, our Overall MSA = 0.84, the test is satisfied, as it surpassed the .6 threshold to be able to proceed with factor analysis. In the bartlett’s test, the p-value = 0, meaning that our correlation matrix is completely different from the identity matrix.
After positive outcomes from both test, we can confidently proceed with the PCA. Regardless of this fact, we will conduct the MCA as well for a confrontation of the results. Let’s start with PCA first.

Principal Component Analysis (PCA)

Optimal number of components

To determine the optimal number of components to remain in the set, we will use the Kaiser’s Stopping Rule - a measure used when deciding which components should be chosen. Here, the ones with eigenvalue higher over 1 should stay (Brown, 2009). To better visualise the eigenvalues, we will plot the scree test. There, all the elements are plotted in the descending order, and based on the rule of elbow, we choose the optimum number.

library("factoextra")

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

pca <- prcomp(wellbeing, center = TRUE, scale = TRUE)
eigen(cor_w)$values

##  [1] 4.1824990 1.6794879 1.4774633 1.2836214 1.0746203 1.0521988 0.9717958
##  [8] 0.9253522 0.8946419 0.8325454 0.7688360 0.7482203 0.7124629 0.6874012
## [15] 0.6796851 0.6585689 0.6234278 0.6160834 0.5799556 0.5470580 0.5312043
## [22] 0.4728706

fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")

Based on the Kaiser’s rule, 6 components should be chosen, as the eigenvalue for those is above 1.

fviz_pca_var(pca, col.var = "brown4")

fviz_screeplot(pca, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")

Basing solely on the plot above, a few components are visible, although highly stacked - the downside of plotting 22 variables at once. We can distinguish the first dimension with DailyStress, Daily Anger, LostVacationDays and BMIRange, the second with SupportingOthers, DailySocialNetworking, AnnualDonations and some others, the third with HoursInFocus, PersonalAchievements, Personal Awards, fourth with TimeForPassion, PlacesVisitedAnnually, WeeklyTaskCompletion, DailySteps and more, fifth one of Weekly Meditation, and the sixth one of HoursOfSleep. However, the actual distribution of dimesions is not of the highest clarity. To better determine which variables are allocated to each of the components, let us plot the contribution of variables to the principal components.

library("pdp")
pca_var <- get_pca_var(pca)

fviz_contrib(pca, "var", axes = 1:6, fill = "tomato3", color = "tomato4")

The variables of AgeGroup, Gender, DailyStress, DailySocialNetworking, TimeForPassion, WeeklyMeditation, DailySteps, HoursInFocus, SupportingOthers, PersonalAchievements and HoursOfSleep contribute the most to the results of the survey.

MCA

For the MCA, first, we need to divide the dataset into two subsets, and transform one of them to character type.

library("dplyr")

wellbeing_text <- select(wellbeing, DailyStress, BMIRange, IncomeSufficiency, AgeGroup, Gender)


wellbeing_text$DailyStress <- factor(wellbeing_text$DailyStress, ordered = TRUE)
wellbeing_text$BMIRange <- factor(wellbeing_text$BMIRange, ordered = TRUE)
wellbeing_text$IncomeSufficiency <- factor(wellbeing_text$IncomeSufficiency, ordered = TRUE)
wellbeing_text$AgeGroup <- factor(wellbeing_text$AgeGroup, ordered = TRUE)
wellbeing_text$Gender <- factor(wellbeing_text$Gender, ordered = FALSE)


wellbeing_num <- select(wellbeing, FruitsOrVeggiesPerDay, PlacesVisitedAnnually, CoreCircle, SupportingOthers, DailySocialNetworking, PersonalAchievements, AnnualDonations, WeeklyTaskCompletion, HoursInFocus, DailySteps, LifeVisionAhead, HoursOfSleep, LostVacationDays, PersonalAwards, TimeForPassion, WeeklyMeditation, DailyAnger)

wellbeing_text$DailyStress <- ifelse(wellbeing_text$DailyStress == 0, "no stress",
                             ifelse(wellbeing_text$DailyStress == 1, "not a lot of stress",
                                    ifelse(wellbeing_text$DailyStress == 2, "some stress", 
                                            ifelse(wellbeing_text$DailyStress == 3, "relative stress",
                                                   ifelse(wellbeing_text$DailyStress == 4, "a lot of stress", "constant stress")))))

wellbeing_text$BMIRange <- ifelse(wellbeing_text$BMIRange == 1, "Below 25", "Above 25")

wellbeing_text$IncomeSufficiency <- ifelse(wellbeing_text$IncomeSufficiency == 1, "Not or hardly sufficient", "Sufficient or more")

wellbeing_text$AgeGroup <- ifelse(wellbeing$AgeGroup == 1, "Less than 20",
                             ifelse(wellbeing$AgeGroup == 2, "21 to 35",
                                    ifelse(wellbeing$AgeGroup == 3, "36 to 50", "50 or more")))

wellbeing_text$Gender <- wellbeing$Gender <- ifelse(wellbeing$Gender == 0, "Male", "Female")

head(wellbeing_text)

##           DailyStress BMIRange        IncomeSufficiency   AgeGroup Gender
## 1         some stress Below 25 Not or hardly sufficient   36 to 50 Female
## 2     relative stress Above 25       Sufficient or more   36 to 50 Female
## 3     relative stress Above 25       Sufficient or more   36 to 50 Female
## 4     relative stress Above 25 Not or hardly sufficient 50 or more Female
## 5 not a lot of stress Above 25       Sufficient or more 50 or more Female
## 6         some stress Below 25       Sufficient or more 50 or more Female

head(wellbeing_num)

##   FruitsOrVeggiesPerDay PlacesVisitedAnnually CoreCircle SupportingOthers
## 1                     3                     2          5                0
## 2                     2                     4          3                8
## 3                     2                     3          4                4
## 4                     3                    10          3               10
## 5                     5                     3          3               10
## 6                     3                     3          9               10
##   DailySocialNetworking PersonalAchievements AnnualDonations
## 1                     5                    2               0
## 2                    10                    5               2
## 3                    10                    3               2
## 4                     7                    2               5
## 5                     4                    2               4
## 6                    10                    2               3
##   WeeklyTaskCompletion HoursInFocus DailySteps LifeVisionAhead HoursOfSleep
## 1                    6            4          5               0            7
## 2                    5            2          5               5            8
## 3                    2            2          4               5            8
## 4                    3            5          5               0            5
## 5                    5            0          5               0            7
## 6                    6            1          7              10            8
##   LostVacationDays PersonalAwards TimeForPassion WeeklyMeditation DailyAnger
## 1                5              4              0                5          5
## 2                2              3              2                6          2
## 3               10              4              8                3          2
## 4                7              5              2                0          5
## 5                0              8              1                5          0
## 6                0             10              8                3          2

Now we can run MCA on categorical data and another PCA run on the subset wellbeing_num.

library("FactoMineR")

mca1 <- MCA(wellbeing_text, ncp = 5, graph = FALSE)
fviz_screeplot(mca1, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")

fviz_contrib(mca1, choice = "var", axes = 1:4, fill = "tomato3", color = "tomato4")

Scree plots represent the percentage of variance. Now, let’s have a look at the contribution plot, shows which variables contribute to the results in 2 dimensional space, as we have 5 variables, 2 of them have 5 dimensions, 2 have 2, and one has 4. An average of this is 3.6, so let’s choose 4 dimensions.
The options of “Less than 20” and “50 or more” in variable AgeGroup, “Not or hardly sufficient” in variable IncomeSufficiency, “contant stress” and “no stress” in variable DailyStress, “Male” in Gender, and “Above 25” in BMIRange contribute the most to the results.

Now, let’s run PCA and wellbeing_num. Both test have favourable outcomes for running PCA.

#Correlation Matrix on wellbeing_num
cor_w2 <- cor(wellbeing_num)
KMO(cor_w2)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_w2)
## Overall MSA =  0.87
## MSA for each item = 
## FruitsOrVeggiesPerDay PlacesVisitedAnnually            CoreCircle 
##                  0.86                  0.89                  0.90 
##      SupportingOthers DailySocialNetworking  PersonalAchievements 
##                  0.87                  0.85                  0.89 
##       AnnualDonations  WeeklyTaskCompletion          HoursInFocus 
##                  0.86                  0.91                  0.84 
##            DailySteps       LifeVisionAhead          HoursOfSleep 
##                  0.85                  0.92                  0.70 
##      LostVacationDays        PersonalAwards        TimeForPassion 
##                  0.71                  0.89                  0.86 
##      WeeklyMeditation            DailyAnger 
##                  0.85                  0.83

cortest.bartlett(cor_w2, n = 12756)

## $chisq
## [1] 32744.83
## 
## $p.value
## [1] 0
## 
## $df
## [1] 136

pca2 <- prcomp(wellbeing_num, center = TRUE, scale = TRUE)
eigen(cor_w2)$values

##  [1] 4.0113786 1.3834533 1.1704243 1.0143174 0.9862448 0.9272378 0.8872879
##  [8] 0.8092303 0.7607359 0.7578666 0.7195338 0.6773710 0.6614664 0.6407978
## [15] 0.5601076 0.5470651 0.4854814

fviz_eig(pca2, choice = "eigenvalue", ncp = 17, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")

Here, we have a recommended number of 3 dimensions (Eigenvalues above 1). Let’s see the circle plot for the variable distribution now. And below that we will have the contribution plot, to see which numerical variables contribute the most.

fviz_pca_var(pca2, col.var = "darkred")

fviz_contrib(pca2, choice = "var", axes = 1:3, fill = "tomato3", color = "tomato4")

In this case we have the variables of HoursInFocus, TimeForPassion, SupportingOthers, PersonalAchievements, PlacesVisitedAnnually and LiveVisionAhead, that contribute at the greatest spectrum to the dataset results.

Conclusions

Although the percentages of explained variances in both cases of PCA and PCA mixed with MCA were not high enough for the proposed number of dimensions to assume that all the points have been correctly allocated, the outcomes seem to be relatively sensible in both scenarios. Gender and AgeGroup are highly relevant for the survey outcome, where age groups of below 20 and over 50 years old had lower wellbeing scores. That is rather unsurprising, as people below the age of 20 rarely take care of their mental health, as well as the needs covering, while people over the age of 50 encounter the phase of decrease in vitality and beginnings of diseases etc. BMIRange was also an important indicator of the outcomes, with a positive shift for the below 25 spectrum. Apart from the purely categorical data, the variables of our relationships, time spend on passion AND in focus, and physical activity variables (Daily Steps, WeeklyMeditation) seem to have the biggest impact on the results. Nonetheless, dimension reduction can be successfully conducted on a survey data, both in the sphere of variables - number of questions, where they can be lowered to general categories, as well as in number of options, in some cases to even binary.

References

Automatic Happiness. (2015). YOUR WORK-LIFE BALANCE SCORE. Retrieved from http://www.authentic-happiness.com/your-life-satisfaction-score

Brown, J. (2009). Choosing the right number of components or factors in PCA and EFA. JALT Testing & Evaluation SIG Newsletter, 13(2). Retrieved from http://hosted.jalt.org/test/bro_30.htm

Carillo, M. F., Largo, F. F., & Ceballos, R. F. (2019). Principal Component Analysis on the Philippine Health Data. arXiv preprint arXiv: 1902.07905.

Dalat, Y. (2019). Lifestyle_and_Wellbeing_Data. 12,757 survey responses with 23 attributes describing our lifestyle & behavior. Retrieved from https://www.kaggle.com/ydalat/lifestyle-and-wellbeing-data

Hayden, L. (2018). Principal Component Analysis in R. Retrieved from https://www.datacamp.com/community/tutorials/pca-analysis-r

Nirmal, A., J. (2019). Getting started with t-SNE for biologist (R). Retrieved from https://ajitjohnson.com/tsne-for-biologist-tutorial/