The database in use was acquired from Kaggle (Dalat, 2020) and is based on a survey conducted by Authentic Happiness (2015) in 2015, with responses collected over the 5 year time span. It consists of 12756 observations and 23 variables, that will further be explained, giving us 293388 cells. It is a substantial dataset and dimension reduction can widely facilitate its readability and use.
The survey was carried out to determine how we live our lives. More precisely, it examines how we as a society thrive in our personal and professional lives - how efficiently we shape our lifestyles, rituals, habits to optimise the overall life satisfaction along the given 5 dimensions:
“1. Healthy body, reflecting your fitness and healthy habits
2. Healthy mind, indicating how well you embrace positive emotions
3. Expertise, measuring the ability to grow your expertise and achieve something unique
4. Connection, assessing the strength of your social network and your inclination to discover the world
5. Meaning, evaluating your compassion, generosity and how much”you are living the life of your dreams“.”
The goal of this paper is to use different methods of dimension reduction, PCA and MCA mixed with PCA on a questionnaire data. MCA (Multiple correspondence analysis) is an equivalent to the process of PCA (Principal Component Analysis), a statistical method often implemented on expanded, manifold databases, for the trimming of an X number of dimensions and thereby creating a simpler version of the dataset with a minimum loss of information, allowing for a proficient visualisation on 2-dimensional plots for a dependency determination between the variables (Hayden, 2018), but for variables that are categorical. In our dataset, we only have categorical data, mostly in the numerical form. ## Data Preparation First, let us have a look at the dataset in question, the dimensions and variables. As a first step, before the actual preprocessing, I assigned new variable names to the dataset for a cleaner read.
wellbeing <- read.csv("Wellbeing_and_lifestyle_data.csv", header = TRUE, sep = ",", col.names = c("Timestamp", "FruitsOrVeggiesPerDay", "DailyStress", "PlacesVisitedAnnually", "CoreCircle", "SupportingOthers", "DailySocialNetworking", "PersonalAchievements", "AnnualDonations", "BMIRange", "WeeklyTaskCompletion", "HoursInFocus", "DailySteps", "LifeVisionAhead", "HoursOfSleep", "LostVacationDays", "DailyAnger", "IncomeSufficiency", "PersonalAwards", "TimeForPassion", "WeeklyMeditation", "AgeRange", "Gender"), stringsAsFactors = FALSE)
head(wellbeing, 10)
## Timestamp FruitsOrVeggiesPerDay DailyStress PlacesVisitedAnnually CoreCircle
## 1 7/7/15 3 2 2 5
## 2 7/7/15 2 3 4 3
## 3 7/7/15 2 3 3 4
## 4 7/7/15 3 3 10 3
## 5 7/7/15 5 1 3 3
## 6 7/8/15 3 2 3 9
## 7 7/8/15 4 2 10 6
## 8 7/9/15 3 4 5 3
## 9 7/9/15 5 3 6 4
## 10 7/10/15 4 4 2 6
## SupportingOthers DailySocialNetworking PersonalAchievements AnnualDonations
## 1 0 5 2 0
## 2 8 10 5 2
## 3 4 10 3 2
## 4 10 7 2 5
## 5 10 4 2 4
## 6 10 10 2 3
## 7 10 10 3 5
## 8 5 7 4 0
## 9 3 3 5 4
## 10 10 10 0 4
## BMIRange WeeklyTaskCompletion HoursInFocus DailySteps LifeVisionAhead
## 1 1 6 4 5 0
## 2 2 5 2 5 5
## 3 2 2 2 4 5
## 4 2 3 5 5 0
## 5 2 5 0 5 0
## 6 1 6 1 7 10
## 7 2 8 8 7 5
## 8 1 8 2 8 10
## 9 1 10 2 1 5
## 10 2 3 2 3 0
## HoursOfSleep LostVacationDays DailyAnger IncomeSufficiency PersonalAwards
## 1 7 5 5 1 4
## 2 8 2 2 2 3
## 3 8 10 2 2 4
## 4 5 7 5 1 5
## 5 7 0 0 2 8
## 6 8 0 2 2 10
## 7 7 10 0 2 10
## 8 6 0 2 2 8
## 9 10 0 2 2 10
## 10 6 0 0 1 3
## TimeForPassion WeeklyMeditation AgeRange Gender
## 1 0 5 36 to 50 Female
## 2 2 6 36 to 50 Female
## 3 8 3 36 to 50 Female
## 4 2 0 51 or more Female
## 5 1 5 51 or more Female
## 6 8 3 51 or more Female
## 7 8 10 51 or more Male
## 8 2 2 21 to 35 Female
## 9 3 10 21 to 35 Female
## 10 8 1 51 or more Female
The variables correspond to:
Timestamp - Date when survey was submitted
FruitsOrVeggiesPerDay - How many fruits or vegetables do you eat on a daily basis?
DailyStress - How much stress do you typically experience?
PlacesVisitedAnnually - How many places do you typically visit in a year? Such as new states, cities, museums…
CoreCircle - How many people are you very close to?
SupportingOthers - How many people do you help at achieving a better life? Score of selflessness
DailySocialNetworking - How many people do you interact with on a typical day?
PersonalAchievements - How many achievements are you proud of?
AnnualDonations - How many times a year do you donate your time or money to good causes?
BMIRange - What is your BMI range? 1 - Below 25; 2 - Above 25
WeeklyTaskCompletion - How well do you complete your weekly to-do lists?
HoursInFocus - In a typical day, how many hours do you experience a good “flow”?
DailySteps - How many steps (expressed in thousands) do you typically walk daily?
LifeVisionAhead - How many years ahead is your life planned for?
HoursOfSleep- How many hours do you typically sleep every night?
LostVacationDays - How many days of vacation do you usually lose annually?
DailyAnger - How often do you shout at someone on a daily basis?
IncomeSufficiency - How sufficient is your income to cover your basic like expenses? 1- Not or hardly sufficient; 2 - Sufficient
PersonalAwards - How many awards have you received in your life?
TimeForPassion - How many hours daily do you spend doing what you are passionate about?
WeeklyMeditation - How many times do you have the opportunity to think about yourself weekly?
AgeRange - Age groups: Less than 20, 21 to 35, 36 to 50, 51 or more
Gender - Male, Female
All the questions were asked in a pre-defined answers form, most of them on a 1-10 or 1-5 sequence, with a couple of exceptions. In variables IncomeSufficiency, BMIRange we have a 2 level answer choices. Apart from that, we have the AgeRange variable with 4 subgroups and Gender variable with two options, both variables with non numerical alternatives. Excluding the Timestamp variable, all of the inputs are categorical. However, some of them include simple answer to “how many?” and a simple count. Others have assigned levels. Therefore, we cannot simply conduct a PCA or MCA on the whole dataset, but divide the variables based on their background and carry out the respective tests. We will also implement t-SNE.
But first, let’s transform the text data into numerics for a unified set. We will also get rid of the Timestamp variable, as it is only a survey completion information, not connected with the survey itself and its contents that we aim to reduce.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
wellbeing <- select(wellbeing, -Timestamp)
colnames(wellbeing)
## [1] "FruitsOrVeggiesPerDay" "DailyStress" "PlacesVisitedAnnually"
## [4] "CoreCircle" "SupportingOthers" "DailySocialNetworking"
## [7] "PersonalAchievements" "AnnualDonations" "BMIRange"
## [10] "WeeklyTaskCompletion" "HoursInFocus" "DailySteps"
## [13] "LifeVisionAhead" "HoursOfSleep" "LostVacationDays"
## [16] "DailyAnger" "IncomeSufficiency" "PersonalAwards"
## [19] "TimeForPassion" "WeeklyMeditation" "AgeRange"
## [22] "Gender"
Now that we got rid of the initially unnecessary variable, we can transform the AgeRange and Gender variables into numerical inputs for some descriptive statistics further on. Let’s create a new variable AgeGroup, and subset the Gender inputs.
wellbeing$AgeGroup <- ifelse(wellbeing$AgeRange == "Less than 20", 1,
ifelse(wellbeing$AgeRange == "21 to 35", 2,
ifelse(wellbeing$AgeRange == "36 to 50", 3, 4)))
wellbeing$Gender <- ifelse(wellbeing$Gender == "Male", 0, 1)
wellbeing <- select(wellbeing, -AgeRange)
wellbeing$DailyStress[wellbeing$DailyStress == "1/1/00"] = 1
Let us check some statistics now.
library("ggplot2")
library("reshape2")
#Converting the DailyStress variable to numeric as a precaution
wellbeing$DailyStress <- as.numeric(wellbeing$DailyStress)
wellbeingPlot <- melt(wellbeing[, 1:12], id.vars = NULL)
ggplot(data = wellbeingPlot) + geom_bar(aes(x = value)) + theme(plot.title = element_text(hjust = 0.5, size = 15)) +
facet_wrap(~ variable, scales = "free", ncol = 3)
wellbeingPlot2 <- melt(wellbeing[, 13:22], id.vars = NULL)
ggplot(data = wellbeingPlot2) + geom_bar(aes(x = value)) + theme(plot.title = element_text(hjust = 0.5, size = 15)) +
facet_wrap(~ variable, scales = "free", ncol = 3)
summary(wellbeing)
## FruitsOrVeggiesPerDay DailyStress PlacesVisitedAnnually CoreCircle
## Min. :0.00 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.00 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median :3.00 Median :3.000 Median : 5.000 Median : 5.000
## Mean :2.93 Mean :2.783 Mean : 5.339 Mean : 5.485
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :5.00 Max. :5.000 Max. :10.000 Max. :10.000
## SupportingOthers DailySocialNetworking PersonalAchievements AnnualDonations
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 2.000 1st Qu.:1.000
## Median : 5.000 Median : 6.000 Median : 3.000 Median :3.000
## Mean : 5.578 Mean : 6.551 Mean : 3.963 Mean :2.701
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.: 6.000 3rd Qu.:5.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :5.000
## BMIRange WeeklyTaskCompletion HoursInFocus DailySteps
## Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.: 4.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median :1.000 Median : 6.000 Median : 3.000 Median : 5.000
## Mean :1.401 Mean : 5.706 Mean : 3.127 Mean : 5.705
## 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.: 4.000 3rd Qu.: 8.000
## Max. :2.000 Max. :10.000 Max. :10.000 Max. :10.000
## LifeVisionAhead HoursOfSleep LostVacationDays DailyAnger
## Min. : 0.000 Min. : 1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 6.000 1st Qu.: 0.000 1st Qu.: 1.000
## Median : 3.000 Median : 7.000 Median : 0.000 Median : 2.000
## Mean : 3.712 Mean : 7.035 Mean : 2.833 Mean : 2.921
## 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.: 4.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## IncomeSufficiency PersonalAwards TimeForPassion WeeklyMeditation
## Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 1.000 1st Qu.: 4.000
## Median :2.000 Median : 5.000 Median : 2.000 Median : 7.000
## Mean :1.728 Mean : 5.703 Mean : 3.267 Mean : 6.253
## 3rd Qu.:2.000 3rd Qu.: 9.000 3rd Qu.: 5.000 3rd Qu.:10.000
## Max. :2.000 Max. :10.000 Max. :10.000 Max. :10.000
## Gender AgeGroup
## Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:2.000
## Median :1.0000 Median :2.000
## Mean :0.6047 Mean :2.582
## 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :4.000
First we had to turn all the data into numerics, just as a precaution, as an character outlier (DailyStress) was spotted. We do not need to scale the data as all of it is simply in the integer measure. We do not face the problem of any NAs, as all the questions were compulsory for the survey completion.
I decided to split the variables into two plots for a better clarity of graphs.
As we can see, around 60% of the respondents are women. A substantial share of the respondents, over 38% of them belong to the second age group (21 to 35 years old), with the second largest group being the set or 36 to 50 years olds.
More respondents are in the first BMI range, meaning they are not obese, and their weight/height ratio is correct. Following on the physical health variables, majority of the respondents eats at least one fruit/vegetable per day, sleeps 6-8 hours. Noteworthy, the biggest share in the DailySteps variable is the one equivalent to 10 thousand steps a day, which is the recommended number to stay in shape and get the daily dose of exercise. Nonetheless, the overwhelming majority of the respondents does not walk 10 thousands steps daily, varying somewhat evenly from 1 to 8k with a big decrease in 9k.
Moving onto the financial freedom, we can see that significant part of the respondents is able to cover all the needs with the incomes. Moreover, they visit in many cases 10 or more new places annually, and in approximately 1/3 of the cases donate at least 5 times a year. That shows a balances and diversified life.
When we move onto the mental wellbeing we are faces with some thought-provoking instances. By looking at DailyMeditation, we see that approx 2/3 of the group takes good care of their mental health with meditation, relaxation or fitness. On the other hand, a visible majority of people spend 2,5 or less hours per day doing what they really like, leading to a conclusion that they do not enjoy their careers and activities. Luckily we also see a nice share of people spending at least 5 hours per day enjoying their tasks.
Most of the people are in the present, without going too much ahead of their lives, having planned at longest 2,5 years to the future. Approximately 1/3 of the respondents has at least a 5 year plan.
Surprisingly, although most people experience a low flow - a time in focus - from 0 to 3 hours per day, they complete more than half of the tasks planned for the week (5.706 tasks on average). However, they also experience a relative stress of 2,73 out of 5. This is probably due to poor flow abilities, where they cannot focus on their duties, rushing to the deadlines.
Lastly, the generality of people mingle with a well numbered circle of connections, both on the close proximity and the daily interactions, where most of the cases have at least 10 of them daily. They also give their support to multiple people, leading to a conclusion of a well balanced individual and intra group relationships.
After some deeper look into the data structure, I gained some doubts whether PCA is the appropriate measure. Although all of our data is in the numerical form, it is categorical, have only a discrete number of possibilities. Moreover we can distinct two subsets - questions of “how many?”, where a simple number is the final answer, and variables where an number is corresponding to a given level, or a range. For the MCA measure, we will divide the dataset, and convert the leveled answers to text values. The rest will be left unchanged. Here is the division. Numerical set:
* FruitsOrVeggiesPerDay
* PlacesVisitedAnnually
* CoreCircle
* SupportingOthers
* DailySocialNetworking
* PersonalAchievements
* AnnualDonations
* WeeklyTaskCompletion
* HoursInFocus
* DailySteps
* LifeVisionAhead
* HoursOfSleep
* LostVacationDays
* PersonalAwards
* TimeForPassion
* WeeklyMeditation
Text based set:
* DailyStress
* BMIRange
* DailyAnger
* IncomeSufficiency
* AgeRange
* Gender
But first let’s conduct two individual tests to determine, whether our data is PCA testable.
* Kaiser-Meyer-Olkin test - a test used to examine whether the Principal Component Analysis is useful for the given variables or not.
* Bartlett’s test - a test used to find out whether the correlation matrix for a set of data is the identity matrix. If the identity correlation matrix is confirmed, PCA is not an appropriate measure (Carillo et al., 2019).
library("psych")
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library("corrplot")
## corrplot 0.84 loaded
#Correlation Matrix
cor_w <- cor(wellbeing)
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
KMO(cor_w)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_w)
## Overall MSA = 0.84
## MSA for each item =
## FruitsOrVeggiesPerDay DailyStress PlacesVisitedAnnually
## 0.84 0.73 0.89
## CoreCircle SupportingOthers DailySocialNetworking
## 0.90 0.87 0.83
## PersonalAchievements AnnualDonations BMIRange
## 0.89 0.87 0.60
## WeeklyTaskCompletion HoursInFocus DailySteps
## 0.91 0.85 0.83
## LifeVisionAhead HoursOfSleep LostVacationDays
## 0.92 0.72 0.73
## DailyAnger IncomeSufficiency PersonalAwards
## 0.72 0.83 0.89
## TimeForPassion WeeklyMeditation Gender
## 0.86 0.83 0.57
## AgeGroup
## 0.57
cortest.bartlett(cor_w, n = 12756)
## $chisq
## [1] 41370.92
##
## $p.value
## [1] 0
##
## $df
## [1] 231
In KMO test, our Overall MSA = 0.84, the test is satisfied, as it surpassed the .6 threshold to be able to proceed with factor analysis. In the bartlett’s test, the p-value = 0, meaning that our correlation matrix is completely different from the identity matrix.
After positive outcomes from both test, we can confidently proceed with the PCA. Regardless of this fact, we will conduct the MCA as well for a confrontation of the results. Let’s start with PCA first.
To determine the optimal number of components to remain in the set, we will use the Kaiser’s Stopping Rule - a measure used when deciding which components should be chosen. Here, the ones with eigenvalue higher over 1 should stay (Brown, 2009). To better visualise the eigenvalues, we will plot the scree test. There, all the elements are plotted in the descending order, and based on the rule of elbow, we choose the optimum number.
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
pca <- prcomp(wellbeing, center = TRUE, scale = TRUE)
eigen(cor_w)$values
## [1] 4.1824990 1.6794879 1.4774633 1.2836214 1.0746203 1.0521988 0.9717958
## [8] 0.9253522 0.8946419 0.8325454 0.7688360 0.7482203 0.7124629 0.6874012
## [15] 0.6796851 0.6585689 0.6234278 0.6160834 0.5799556 0.5470580 0.5312043
## [22] 0.4728706
fviz_eig(pca, choice = "eigenvalue", ncp = 22, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4", addlabels = TRUE, main = "Eigenvalues")
Based on the Kaiser’s rule, 6 components should be chosen, as the eigenvalue for those is above 1.
fviz_pca_var(pca, col.var = "brown4")
fviz_screeplot(pca, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")
Basing solely on the plot above, a few components are visible, although highly stacked - the downside of plotting 22 variables at once. We can distinguish the first dimension with DailyStress, Daily Anger, LostVacationDays and BMIRange, the second with SupportingOthers, DailySocialNetworking, AnnualDonations and some others, the third with HoursInFocus, PersonalAchievements, Personal Awards, fourth with TimeForPassion, PlacesVisitedAnnually, WeeklyTaskCompletion, DailySteps and more, fifth one of Weekly Meditation, and the sixth one of HoursOfSleep. However, the actual distribution of dimesions is not of the highest clarity. To better determine which variables are allocated to each of the components, let us plot the contribution of variables to the principal components.
library("pdp")
pca_var <- get_pca_var(pca)
fviz_contrib(pca, "var", axes = 1:6, fill = "tomato3", color = "tomato4")
The variables of AgeGroup, Gender, DailyStress, DailySocialNetworking, TimeForPassion, WeeklyMeditation, DailySteps, HoursInFocus, SupportingOthers, PersonalAchievements and HoursOfSleep contribute the most to the results of the survey.
For the MCA, first, we need to divide the dataset into two subsets, and transform one of them to character type.
library("dplyr")
wellbeing_text <- select(wellbeing, DailyStress, BMIRange, IncomeSufficiency, AgeGroup, Gender)
wellbeing_text$DailyStress <- factor(wellbeing_text$DailyStress, ordered = TRUE)
wellbeing_text$BMIRange <- factor(wellbeing_text$BMIRange, ordered = TRUE)
wellbeing_text$IncomeSufficiency <- factor(wellbeing_text$IncomeSufficiency, ordered = TRUE)
wellbeing_text$AgeGroup <- factor(wellbeing_text$AgeGroup, ordered = TRUE)
wellbeing_text$Gender <- factor(wellbeing_text$Gender, ordered = FALSE)
wellbeing_num <- select(wellbeing, FruitsOrVeggiesPerDay, PlacesVisitedAnnually, CoreCircle, SupportingOthers, DailySocialNetworking, PersonalAchievements, AnnualDonations, WeeklyTaskCompletion, HoursInFocus, DailySteps, LifeVisionAhead, HoursOfSleep, LostVacationDays, PersonalAwards, TimeForPassion, WeeklyMeditation, DailyAnger)
wellbeing_text$DailyStress <- ifelse(wellbeing_text$DailyStress == 0, "no stress",
ifelse(wellbeing_text$DailyStress == 1, "not a lot of stress",
ifelse(wellbeing_text$DailyStress == 2, "some stress",
ifelse(wellbeing_text$DailyStress == 3, "relative stress",
ifelse(wellbeing_text$DailyStress == 4, "a lot of stress", "constant stress")))))
wellbeing_text$BMIRange <- ifelse(wellbeing_text$BMIRange == 1, "Below 25", "Above 25")
wellbeing_text$IncomeSufficiency <- ifelse(wellbeing_text$IncomeSufficiency == 1, "Not or hardly sufficient", "Sufficient or more")
wellbeing_text$AgeGroup <- ifelse(wellbeing$AgeGroup == 1, "Less than 20",
ifelse(wellbeing$AgeGroup == 2, "21 to 35",
ifelse(wellbeing$AgeGroup == 3, "36 to 50", "50 or more")))
wellbeing_text$Gender <- wellbeing$Gender <- ifelse(wellbeing$Gender == 0, "Male", "Female")
head(wellbeing_text)
## DailyStress BMIRange IncomeSufficiency AgeGroup Gender
## 1 some stress Below 25 Not or hardly sufficient 36 to 50 Female
## 2 relative stress Above 25 Sufficient or more 36 to 50 Female
## 3 relative stress Above 25 Sufficient or more 36 to 50 Female
## 4 relative stress Above 25 Not or hardly sufficient 50 or more Female
## 5 not a lot of stress Above 25 Sufficient or more 50 or more Female
## 6 some stress Below 25 Sufficient or more 50 or more Female
head(wellbeing_num)
## FruitsOrVeggiesPerDay PlacesVisitedAnnually CoreCircle SupportingOthers
## 1 3 2 5 0
## 2 2 4 3 8
## 3 2 3 4 4
## 4 3 10 3 10
## 5 5 3 3 10
## 6 3 3 9 10
## DailySocialNetworking PersonalAchievements AnnualDonations
## 1 5 2 0
## 2 10 5 2
## 3 10 3 2
## 4 7 2 5
## 5 4 2 4
## 6 10 2 3
## WeeklyTaskCompletion HoursInFocus DailySteps LifeVisionAhead HoursOfSleep
## 1 6 4 5 0 7
## 2 5 2 5 5 8
## 3 2 2 4 5 8
## 4 3 5 5 0 5
## 5 5 0 5 0 7
## 6 6 1 7 10 8
## LostVacationDays PersonalAwards TimeForPassion WeeklyMeditation DailyAnger
## 1 5 4 0 5 5
## 2 2 3 2 6 2
## 3 10 4 8 3 2
## 4 7 5 2 0 5
## 5 0 8 1 5 0
## 6 0 10 8 3 2
Now we can run MCA on categorical data and another PCA run on the subset wellbeing_num.
library("FactoMineR")
mca1 <- MCA(wellbeing_text, ncp = 5, graph = FALSE)
fviz_screeplot(mca1, addlabels = TRUE, barfill = "salmon", barcolor = "tomato3", linecolor = "tomato4")
fviz_contrib(mca1, choice = "var", axes = 1:4, fill = "tomato3", color = "tomato4")
Scree plots represent the percentage of variance. Now, let’s have a look at the contribution plot, shows which variables contribute to the results in 2 dimensional space, as we have 5 variables, 2 of them have 5 dimensions, 2 have 2, and one has 4. An average of this is 3.6, so let’s choose 4 dimensions.
The options of “Less than 20” and “50 or more” in variable AgeGroup, “Not or hardly sufficient” in variable IncomeSufficiency, “contant stress” and “no stress” in variable DailyStress, “Male” in Gender, and “Above 25” in BMIRange contribute the most to the results.
Now, let’s run PCA and wellbeing_num. Both test have favourable outcomes for running PCA.
#Correlation Matrix on wellbeing_num
cor_w2 <- cor(wellbeing_num)
KMO(cor_w2)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_w2)
## Overall MSA = 0.87
## MSA for each item =
## FruitsOrVeggiesPerDay PlacesVisitedAnnually CoreCircle
## 0.86 0.89 0.90
## SupportingOthers DailySocialNetworking PersonalAchievements
## 0.87 0.85 0.89
## AnnualDonations WeeklyTaskCompletion HoursInFocus
## 0.86 0.91 0.84
## DailySteps LifeVisionAhead HoursOfSleep
## 0.85 0.92 0.70
## LostVacationDays PersonalAwards TimeForPassion
## 0.71 0.89 0.86
## WeeklyMeditation DailyAnger
## 0.85 0.83
cortest.bartlett(cor_w2, n = 12756)
## $chisq
## [1] 32744.83
##
## $p.value
## [1] 0
##
## $df
## [1] 136
pca2 <- prcomp(wellbeing_num, center = TRUE, scale = TRUE)
eigen(cor_w2)$values
## [1] 4.0113786 1.3834533 1.1704243 1.0143174 0.9862448 0.9272378 0.8872879
## [8] 0.8092303 0.7607359 0.7578666 0.7195338 0.6773710 0.6614664 0.6407978
## [15] 0.5601076 0.5470651 0.4854814
fviz_eig(pca2, choice = "eigenvalue", ncp = 17, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4", addlabels = TRUE, main = "Eigenvalues")
Here, we have a recommended number of 3 dimensions (Eigenvalues above 1). Let’s see the circle plot for the variable distribution now. And below that we will have the contribution plot, to see which numerical variables contribute the most.
fviz_pca_var(pca2, col.var = "darkred")
fviz_contrib(pca2, choice = "var", axes = 1:3, fill = "tomato3", color = "tomato4")
In this case we have the variables of HoursInFocus, TimeForPassion, SupportingOthers, PersonalAchievements, PlacesVisitedAnnually and LiveVisionAhead, that contribute at the greatest spectrum to the dataset results.
Although the percentages of explained variances in both cases of PCA and PCA mixed with MCA were not high enough for the proposed number of dimensions to assume that all the points have been correctly allocated, the outcomes seem to be relatively sensible in both scenarios. Gender and AgeGroup are highly relevant for the survey outcome, where age groups of below 20 and over 50 years old had lower wellbeing scores. That is rather unsurprising, as people below the age of 20 rarely take care of their mental health, as well as the needs covering, while people over the age of 50 encounter the phase of decrease in vitality and beginnings of diseases etc. BMIRange was also an important indicator of the outcomes, with a positive shift for the below 25 spectrum. Apart from the purely categorical data, the variables of our relationships, time spend on passion AND in focus, and physical activity variables (Daily Steps, WeeklyMeditation) seem to have the biggest impact on the results. Nonetheless, dimension reduction can be successfully conducted on a survey data, both in the sphere of variables - number of questions, where they can be lowered to general categories, as well as in number of options, in some cases to even binary.
Automatic Happiness. (2015). YOUR WORK-LIFE BALANCE SCORE. Retrieved from http://www.authentic-happiness.com/your-life-satisfaction-score
Brown, J. (2009). Choosing the right number of components or factors in PCA and EFA. JALT Testing & Evaluation SIG Newsletter, 13(2). Retrieved from http://hosted.jalt.org/test/bro_30.htm
Carillo, M. F., Largo, F. F., & Ceballos, R. F. (2019). Principal Component Analysis on the Philippine Health Data. arXiv preprint arXiv: 1902.07905.
Dalat, Y. (2019). Lifestyle_and_Wellbeing_Data. 12,757 survey responses with 23 attributes describing our lifestyle & behavior. Retrieved from https://www.kaggle.com/ydalat/lifestyle-and-wellbeing-data
Hayden, L. (2018). Principal Component Analysis in R. Retrieved from https://www.datacamp.com/community/tutorials/pca-analysis-r
Nirmal, A., J. (2019). Getting started with t-SNE for biologist (R). Retrieved from https://ajitjohnson.com/tsne-for-biologist-tutorial/