Puvaneswari Poobalan (S2182524)
Hun Yee Chong (S2197999)
Chunli Wang (22064827)
Vinoshini Loganathan (17090738)
Ze Ying Tan (22058059)
This project is to explore the relationship between personality traits and happiness and develop a data-driven method to predict an individual’s happiness level based on their personality characteristics.The method is predicated on the hypothesis that some personality features have a high positive correlation with happiness, and that by identifying these traits, we may more correctly predict a person’s happiness score.
Questions: 1. How can personality traits affect happiness score? 2. How many types of clusters can be obtained from the data?
The project aims to achieve the following goals:
The project involves the following key processes:
Two data sets are used for this project:
Five Personality Data: This data set provides information on various personality traits and their countries.
Link: https://www.kaggle.com/code/yadhua/eda-and-cluster-analysis/input
Year: 2018
Purpose: To analyze the relationship between personality traits and life satisfaction.
Content: The dataset includes features such as Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism), and countries.
Structure: The dataset is provided in a comma-separated values (CSV) format with each row representing an individual and each column representing a specific feature.
Dimension:1015341 rows and 110 columns
World Happiness Data 2018: This data set contains happiness scores for different countries.
Link: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2018.csv
Year: 2018
Content: The dataset includes features such as the happiness score, GDP per capita,social support, life expectancy, and freedom to make life choices.
Structure: The dataset is provided in a comma-separated values (CSV) format with each row representing a specific country and each column representing a specific feature.
Dimension: 156 rows and 9 variables
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggrepel)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(treemap)
library(corrplot)
## corrplot 0.92 loaded
library(ggcorrplot)
library(treemapify)
library(geomapdata)
library(BSDA)
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
##
## Orange
library(caret)
library(klaR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
library(class)
library(mlbench)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(stats)
library(cluster)
df<-read.delim("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\Five Personality Data.txt")
df<-data.frame(df)
country_code<-read.csv("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\wikipedia-iso-country-codes.csv")
country_code2<-country_code[c("Full_Name","Alpha.2.code")]
happiness<-read.csv("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\2018.csv")
happiness2<-happiness[c("Country.or.region","Score")]
df2 <- left_join(df, country_code2, by=c('country'='Alpha.2.code'))
df3 <- left_join(df2, happiness2, by=c('Full_Name'='Country.or.region'))
dim(df3)
## [1] 1015341 112
missing_counts<-colSums(is.na(df3))
missing_counts
## EXT1 EXT2 EXT3
## 0 0 0
## EXT4 EXT5 EXT6
## 0 0 0
## EXT7 EXT8 EXT9
## 0 0 0
## EXT10 EST1 EST2
## 0 0 0
## EST3 EST4 EST5
## 0 0 0
## EST6 EST7 EST8
## 0 0 0
## EST9 EST10 AGR1
## 0 0 0
## AGR2 AGR3 AGR4
## 0 0 0
## AGR5 AGR6 AGR7
## 0 0 0
## AGR8 AGR9 AGR10
## 0 0 0
## CSN1 CSN2 CSN3
## 0 0 0
## CSN4 CSN5 CSN6
## 0 0 0
## CSN7 CSN8 CSN9
## 0 0 0
## CSN10 OPN1 OPN2
## 0 0 0
## OPN3 OPN4 OPN5
## 0 0 0
## OPN6 OPN7 OPN8
## 0 0 0
## OPN9 OPN10 EXT1_E
## 0 0 0
## EXT2_E EXT3_E EXT4_E
## 0 0 0
## EXT5_E EXT6_E EXT7_E
## 0 0 0
## EXT8_E EXT9_E EXT10_E
## 0 0 0
## EST1_E EST2_E EST3_E
## 0 0 0
## EST4_E EST5_E EST6_E
## 0 0 0
## EST7_E EST8_E EST9_E
## 0 0 0
## EST10_E AGR1_E AGR2_E
## 0 0 0
## AGR3_E AGR4_E AGR5_E
## 0 0 0
## AGR6_E AGR7_E AGR8_E
## 0 0 0
## AGR9_E AGR10_E CSN1_E
## 0 0 0
## CSN2_E CSN3_E CSN4_E
## 0 0 0
## CSN5_E CSN6_E CSN7_E
## 0 0 0
## CSN8_E CSN9_E CSN10_E
## 0 0 0
## OPN1_E OPN2_E OPN3_E
## 0 0 0
## OPN4_E OPN5_E OPN6_E
## 0 0 0
## OPN7_E OPN8_E OPN9_E
## 0 0 0
## OPN10_E dateload screenw
## 0 0 0
## screenh introelapse testelapse
## 0 0 0
## endelapse IPC country
## 0 0 77
## lat_appx_lots_of_err long_appx_lots_of_err Full_Name
## 0 0 13796
## Score
## 21503
missing_values<-df3[is.na(df3)]
rows_with_missing<-df3[!complete.cases(df3),]
distinct_column<-unique(rows_with_missing$Full_Name)
distinct_column
## [1] "Hong Kong S.A.R., China"
## [2] "Oman"
## [3] NA
## [4] "Brunei Darussalam"
## [5] "Puerto Rico"
## [6] "Namibia"
## [7] "Trinidad and Tobago"
## [8] "Bahamas"
## [9] "Isle of Man"
## [10] "Maldives"
## [11] "Gibraltar"
## [12] "Macao"
## [13] "Macedonia, the former Yugoslav Republic of"
## [14] "Grenada"
## [15] "Cayman Islands"
## [16] "Barbados"
## [17] "C?e d'Ivoire"
## [18] "Papua New Guinea"
## [19] "Antigua and Barbuda"
## [20] "Virgin Islands, U.S."
## [21] "Swaziland"
## [22] "Bermuda"
## [23] "Fiji"
## [24] "Saint Vincent and the Grenadines"
## [25] "Guernsey"
## [26] "Guadeloupe"
## [27] "?land Islands"
## [28] "Libyan Arab Jamahiriya"
## [29] "Jersey"
## [30] "Northern Mariana Islands"
## [31] "Syrian Arab Republic"
## [32] "Palestinian Territory, Occupied"
## [33] "Moldova, Republic of"
## [34] "Guam"
## [35] "Virgin Islands, British"
## [36] "French Polynesia"
## [37] "Dominica"
## [38] "Aruba"
## [39] "Saint Lucia"
## [40] "Guyana"
## [41] "Cape Verde"
## [42] "Gambia"
## [43] "Lao People's Democratic Republic"
## [44] "Suriname"
## [45] "Cuba"
## [46] "New Caledonia"
## [47] "Seychelles"
## [48] "Faroe Islands"
## [49] "Palau"
## [50] "Niue"
## [51] "Anguilla"
## [52] "Saint Kitts and Nevis"
## [53] "Vanuatu"
## [54] "Monaco"
## [55] "Cook Islands"
## [56] "Martinique"
## [57] "Antarctica"
## [58] "Greenland"
## [59] "Montserrat"
## [60] "Falkland Islands (Malvinas)"
## [61] "Marshall Islands"
## [62] "Turks and Caicos"
## [63] "Reunion"
## [64] "San Marino"
## [65] "Andorra"
## [66] "French Guiana"
## [67] "Comoros"
## [68] "Liechtenstein"
## [69] "Micronesia, Federated States of"
## [70] "Tonga"
## [71] "Samoa"
## [72] "Timor-Leste"
## [73] "Equatorial Guinea"
## [74] "Saint Martin"
## [75] "Saint Pierre and Miquelon"
## [76] "Djibouti"
## [77] "American Samoa"
## [78] "Saint Helena, Ascension and Tristan da Cunha"
df4<-na.omit(df3)
class(df4)
## [1] "data.frame"
str(df4)
## 'data.frame': 993761 obs. of 112 variables:
## $ EXT1 : chr "4" "3" "2" "2" ...
## $ EXT2 : chr "1" "5" "3" "2" ...
## $ EXT3 : chr "5" "3" "4" "2" ...
## $ EXT4 : chr "2" "4" "4" "3" ...
## $ EXT5 : chr "5" "3" "3" "4" ...
## $ EXT6 : chr "1" "3" "2" "2" ...
## $ EXT7 : chr "5" "2" "1" "2" ...
## $ EXT8 : chr "2" "5" "3" "4" ...
## $ EXT9 : chr "4" "1" "2" "1" ...
## $ EXT10 : chr "1" "5" "5" "4" ...
## $ EST1 : chr "1" "2" "4" "3" ...
## $ EST2 : chr "4" "3" "4" "3" ...
## $ EST3 : chr "4" "4" "4" "3" ...
## $ EST4 : chr "2" "1" "2" "2" ...
## $ EST5 : chr "2" "3" "2" "3" ...
## $ EST6 : chr "2" "1" "2" "2" ...
## $ EST7 : chr "2" "2" "2" "2" ...
## $ EST8 : chr "2" "1" "2" "2" ...
## $ EST9 : chr "3" "3" "1" "4" ...
## $ EST10 : chr "2" "1" "3" "3" ...
## $ AGR1 : chr "2" "1" "1" "2" ...
## $ AGR2 : chr "5" "4" "4" "4" ...
## $ AGR3 : chr "2" "1" "1" "3" ...
## $ AGR4 : chr "4" "5" "4" "4" ...
## $ AGR5 : chr "2" "1" "2" "2" ...
## $ AGR6 : chr "3" "5" "4" "4" ...
## $ AGR7 : chr "2" "3" "1" "2" ...
## $ AGR8 : chr "4" "4" "4" "4" ...
## $ AGR9 : chr "3" "5" "4" "3" ...
## $ AGR10 : chr "4" "3" "3" "4" ...
## $ CSN1 : chr "3" "3" "4" "2" ...
## $ CSN2 : chr "4" "2" "2" "4" ...
## $ CSN3 : chr "3" "5" "2" "4" ...
## $ CSN4 : chr "2" "3" "2" "4" ...
## $ CSN5 : chr "2" "3" "3" "1" ...
## $ CSN6 : chr "4" "1" "3" "2" ...
## $ CSN7 : chr "4" "3" "4" "2" ...
## $ CSN8 : chr "2" "3" "2" "3" ...
## $ CSN9 : chr "4" "5" "4" "1" ...
## $ CSN10 : chr "4" "3" "2" "4" ...
## $ OPN1 : chr "5" "1" "5" "4" ...
## $ OPN2 : chr "1" "2" "1" "2" ...
## $ OPN3 : chr "4" "4" "2" "5" ...
## $ OPN4 : chr "1" "2" "1" "2" ...
## $ OPN5 : chr "4" "3" "4" "3" ...
## $ OPN6 : chr "1" "1" "2" "1" ...
## $ OPN7 : chr "5" "4" "5" "4" ...
## $ OPN8 : chr "3" "2" "3" "4" ...
## $ OPN9 : chr "4" "5" "4" "3" ...
## $ OPN10 : chr "5" "3" "4" "3" ...
## $ EXT1_E : chr "9419" "7235" "4657" "3996" ...
## $ EXT2_E : chr "5491" "3598" "3549" "2896" ...
## $ EXT3_E : chr "3959" "3315" "2543" "5096" ...
## $ EXT4_E : chr "4821" "2564" "3335" "4240" ...
## $ EXT5_E : chr "5611" "2976" "5847" "5168" ...
## $ EXT6_E : chr "2756" "3050" "2540" "5456" ...
## $ EXT7_E : chr "2388" "4787" "4922" "4360" ...
## $ EXT8_E : chr "2113" "3228" "3142" "4496" ...
## $ EXT9_E : chr "5900" "3465" "14621" "5240" ...
## $ EXT10_E : chr "4110" "3309" "2191" "4000" ...
## $ EST1_E : chr "6135" "9036" "5128" "3736" ...
## $ EST2_E : chr "4150" "2406" "3675" "4616" ...
## $ EST3_E : chr "5739" "3484" "3442" "3015" ...
## $ EST4_E : chr "6364" "3359" "4546" "2711" ...
## $ EST5_E : chr "3663" "3061" "8275" "3960" ...
## $ EST6_E : chr "5070" "2539" "2185" "4064" ...
## $ EST7_E : chr "5709" "4226" "2164" "4208" ...
## $ EST8_E : chr "4285" "2962" "1175" "2936" ...
## $ EST9_E : chr "2587" "1799" "3813" "7336" ...
## $ EST10_E : chr "3997" "1607" "1593" "3896" ...
## $ AGR1_E : chr "4750" "2158" "1089" "6062" ...
## $ AGR2_E : chr "5475" "2090" "2203" "11952" ...
## $ AGR3_E : chr "11641" "2143" "3386" "1040" ...
## $ AGR4_E : chr "3115" "2807" "1464" "2264" ...
## $ AGR5_E : chr "3207" "3422" "2562" "3664" ...
## $ AGR6_E : chr "3260" "5324" "1493" "3049" ...
## $ AGR7_E : chr "10235" "4494" "3067" "4912" ...
## $ AGR8_E : chr "5897" "3627" "13719" "7545" ...
## $ AGR9_E : chr "1758" "1850" "3892" "4632" ...
## $ AGR10_E : chr "3081" "1747" "4100" "6896" ...
## $ CSN1_E : chr "6602" "5163" "4286" "2824" ...
## $ CSN2_E : chr "5457" "5240" "4775" "520" ...
## $ CSN3_E : chr "1569" "7208" "2713" "2368" ...
## $ CSN4_E : chr "2129" "2783" "2813" "3225" ...
## $ CSN5_E : chr "3762" "4103" "4237" "2848" ...
## $ CSN6_E : chr "4420" "3431" "6308" "6264" ...
## $ CSN7_E : chr "9382" "3347" "2690" "3760" ...
## $ CSN8_E : chr "5286" "2399" "1516" "10472" ...
## $ CSN9_E : chr "4983" "3360" "2379" "3192" ...
## $ CSN10_E : chr "6339" "5595" "2983" "7704" ...
## $ OPN1_E : chr "3146" "2624" "1930" "3456" ...
## $ OPN2_E : chr "4067" "4985" "1470" "6665" ...
## $ OPN3_E : chr "2959" "1684" "1644" "1977" ...
## $ OPN4_E : chr "3411" "3026" "1683" "3728" ...
## $ OPN5_E : chr "2170" "4742" "2229" "4128" ...
## $ OPN6_E : chr "4920" "3336" "8114" "3776" ...
## $ OPN7_E : chr "4436" "2718" "2043" "2984" ...
## $ OPN8_E : chr "3116" "3374" "6295" "4192" ...
## $ OPN9_E : chr "2992" "3096" "1585" "3480" ...
## [list output truncated]
## - attr(*, "na.action")= 'omit' Named int [1:21580] 20 61 70 73 84 97 120 167 180 227 ...
## ..- attr(*, "names")= chr [1:21580] "20" "61" "70" "73" ...
summary(df4)
## EXT1 EXT2 EXT3 EXT4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT5 EXT6 EXT7 EXT8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT9 EXT10 EST1 EST2
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST3 EST4 EST5 EST6
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST7 EST8 EST9 EST10
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR1 AGR2 AGR3 AGR4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR5 AGR6 AGR7 AGR8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR9 AGR10 CSN1 CSN2
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN3 CSN4 CSN5 CSN6
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN7 CSN8 CSN9 CSN10
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN1 OPN2 OPN3 OPN4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN5 OPN6 OPN7 OPN8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN9 OPN10 EXT1_E EXT2_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT3_E EXT4_E EXT5_E EXT6_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT7_E EXT8_E EXT9_E EXT10_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST1_E EST2_E EST3_E EST4_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST5_E EST6_E EST7_E EST8_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST9_E EST10_E AGR1_E AGR2_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR3_E AGR4_E AGR5_E AGR6_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR7_E AGR8_E AGR9_E AGR10_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN1_E CSN2_E CSN3_E CSN4_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN5_E CSN6_E CSN7_E CSN8_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN9_E CSN10_E OPN1_E OPN2_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN3_E OPN4_E OPN5_E OPN6_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN7_E OPN8_E OPN9_E OPN10_E
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## dateload screenw screenh introelapse
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## testelapse endelapse IPC country
## Length:993761 Min. :1.000e+00 Min. : 1.00 Length:993761
## Class :character 1st Qu.:9.000e+00 1st Qu.: 1.00 Class :character
## Mode :character Median :1.300e+01 Median : 1.00 Mode :character
## Mean :2.725e+03 Mean : 10.57
## 3rd Qu.:1.800e+01 3rd Qu.: 2.00
## Max. :1.493e+09 Max. :725.00
## lat_appx_lots_of_err long_appx_lots_of_err Full_Name Score
## Length:993761 Length:993761 Length:993761 Min. :2.905
## Class :character Class :character Class :character 1st Qu.:6.886
## Mode :character Mode :character Mode :character Median :6.886
## Mean :6.767
## 3rd Qu.:6.965
## Max. :7.632
df5<-dplyr::select(df4,-(EXT1_E:IPC))
dim(df5)
## [1] 993761 55
df6<-dplyr::select(df5,-(lat_appx_lots_of_err:long_appx_lots_of_err ))
dim(df6)
## [1] 993761 53
# Summary of df6
summary(df6)
## EXT1 EXT2 EXT3 EXT4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT5 EXT6 EXT7 EXT8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EXT9 EXT10 EST1 EST2
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST3 EST4 EST5 EST6
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## EST7 EST8 EST9 EST10
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR1 AGR2 AGR3 AGR4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR5 AGR6 AGR7 AGR8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AGR9 AGR10 CSN1 CSN2
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN3 CSN4 CSN5 CSN6
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## CSN7 CSN8 CSN9 CSN10
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN1 OPN2 OPN3 OPN4
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN5 OPN6 OPN7 OPN8
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## OPN9 OPN10 country Full_Name
## Length:993761 Length:993761 Length:993761 Length:993761
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Score
## Min. :2.905
## 1st Qu.:6.886
## Median :6.886
## Mean :6.767
## 3rd Qu.:6.965
## Max. :7.632
Perform an exploratory analysis of the EDA and Cluster Data to gain insights into the distribution, relationships, and characteristics of the personality traits. This step involves data cleaning, handling missing values, and identifying any outliers.
Extroversion (outgoing/energetic vs. solitary/reserved)
EXT<-df6 %>% dplyr::select(EXT1:EXT10)
# Remove values containing "NULL"
EXT <- EXT %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
EXT <- EXT %>% filter(complete.cases(.))
EXT_counts <- count(EXT,EXT1)
plot_ly(EXT_counts, labels = ~EXT1, values = ~n, type = "pie",
text = ~paste(EXT1, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT1)), "Dark2")),title = "Chart of EXT 1: I am the life of the party") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 28.5% selected a neutral option. 44.3% had chosen 1 & 2 and 26.96% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT2)
plot_ly(EXT_counts, labels = ~EXT2, values = ~n, type = "pie",
text = ~paste(EXT2, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT2)), "Dark2")),title = "Chart of EXT 2: I don't talk a lot") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 24.1% selected a neutral option. 44.1% had chosen 1 & 2 and 31.3% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT3)
plot_ly(EXT_counts, labels = ~EXT3, values = ~n, type = "pie",
text = ~paste(EXT3, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT3)), "Dark2")),title = "Chart of EXT 3: I feel comfortable around people") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 26.5% selected a neutral option. 26.44% had chosen 1 & 2 and 54.6% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT4)
plot_ly(EXT_counts, labels = ~EXT4, values = ~n, type = "pie",
text = ~paste(EXT4, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT4)), "Dark2")),title = "Chart of EXT 4: I keep in the background") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 27.6% selected a neutral option. 30.7% had chosen 1 & 2 and 41.1% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT5)
plot_ly(EXT_counts, labels = ~EXT5, values = ~n, type = "pie",
text = ~paste(EXT5, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT5)), "Dark2")),title = "Chart of EXT 5: I start conversations") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 23.1% selected a neutral option. 27.28% had chosen 1 & 2 and 48.7% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT6)
plot_ly(EXT_counts, labels = ~EXT6, values = ~n, type = "pie",
text = ~paste(EXT6, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT6)), "Dark2")),title = "Chart of EXT 6: I have little to say") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.9% selected a neutral option. 59.7% had chosen 1 & 2 and 25.91% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT7)
plot_ly(EXT_counts, labels = ~EXT7, values = ~n, type = "pie",
text = ~paste(EXT7, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT7)), "Dark2")),title = "Chart of EXT 7: I talk to a lot of different people at parties") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 19.2% selected a neutral option. 45.8% had chosen 1 & 2 and 34.2% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT8)
plot_ly(EXT_counts, labels = ~EXT8, values = ~n, type = "pie",
text = ~paste(EXT8, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT8)), "Dark2")),title = "Chart of EXT 8: I don't like to draw attention to myself") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 22.7% selected a neutral option. 25.09% had chosen 1 & 2 and 51.7% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT9)
plot_ly(EXT_counts, labels = ~EXT9, values = ~n, type = "pie",
text = ~paste(EXT9, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT9)), "Dark2")),title = "Chart of EXT 9: I don't mind being the center of attention") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 21.2% selected a neutral option. 39% had chosen 1 & 2 and 39.1% had chosen 4 & 5.
EXT_counts <- count(EXT,EXT10)
plot_ly(EXT_counts, labels = ~EXT10, values = ~n, type = "pie",
text = ~paste(EXT10, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT10)), "Dark2")),title = "Chart of EXT 10: I am quiet around strangers") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.9% selected a neutral option. 23.15% had chosen 1 & 2 and 57.3% had chosen 4 & 5.
Although most people feel comfortable around people and are not afraid to start conversations, they prefer to engage in a smaller group and do not like to draw attention to them.
EXT<-df6 %>% dplyr::select(EXT1:Score)
EXT<-dplyr::select(EXT,-(EST1:Full_Name))
# Remove values containing "NULL"
EXT <- EXT %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
EXT <- EXT %>% filter(complete.cases(.))
EXT$EXT1 <- as.integer(EXT$EXT1)
EXT$EXT2 <- as.integer(EXT$EXT2)
EXT$EXT3 <- as.integer(EXT$EXT3)
EXT$EXT4 <- as.integer(EXT$EXT4)
EXT$EXT5 <- as.integer(EXT$EXT5)
EXT$EXT6 <- as.integer(EXT$EXT6)
EXT$EXT7 <- as.integer(EXT$EXT7)
EXT$EXT8 <- as.integer(EXT$EXT8)
EXT$EXT9 <- as.integer(EXT$EXT9)
EXT$EXT10 <- as.integer(EXT$EXT10)
EXT_CORR<-cor(EXT)
# as number
ggcorrplot(EXT_CORR,
hc.order = TRUE,
type = "lower",
lab = TRUE)
Comment: The variables are not highly correlated but they have very little correlation with happiness score.
neuroticism (sensitive/nervous vs. resilient/confident)
EST<-df6 %>% dplyr::select(EST1:EST10)
# Remove values containing "NULL"
EST <- EST %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
EST <- EST %>% filter(complete.cases(.))
distinct_column<-unique(EST$EST1)
distinct_column
## [1] "1" "2" "4" "3" "5" "0"
distinct_column<-unique(EST$EST1)
distinct_column
## [1] "1" "2" "4" "3" "5" "0"
EST_counts <- count(EST,EST1)
plot_ly(EST_counts, labels = ~EST1, values = ~n, type = "pie",
text = ~paste(EST1, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST1)), "Dark2")),title = "Chart of EST 1: I get stressed out easily") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 20.1% selected a neutral option. 29.7% had chosen 1 & 2 and 49.5% had chosen 4 & 5.
EST_counts <- count(EST,EST2)
plot_ly(EST_counts, labels = ~EST2, values = ~n, type = "pie",
text = ~paste(EST2, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST2)), "Dark2")),title = "Chart of EST 2: I am relaxed most of the time") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 25.9% selected a neutral option. 30.25% had chosen 1 & 2 and 42.9% had chosen 4 & 5.
EST_counts <- count(EST,EST3)
plot_ly(EST_counts, labels = ~EST3, values = ~n, type = "pie",
text = ~paste(EST3, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST3)), "Dark2")),title = "Chart of EST 3: I worry about things") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 14.6% selected a neutral option. 14.52% had chosen 1 & 2 and 70.4% had chosen 4 & 5.
EST_counts <- count(EST,EST4)
plot_ly(EST_counts, labels = ~EST4, values = ~n, type = "pie",
text = ~paste(EST4, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST4)), "Dark2")),title = "Chart of EST 4: I seldom feel blue") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 24.9% selected a neutral option. 47.5% had chosen 1 & 2 and 26.48% had chosen 4 & 5.
EST_counts <- count(EST,EST5)
plot_ly(EST_counts, labels = ~EST5, values = ~n, type = "pie",
text = ~paste(EST5, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST5)), "Dark2")),title = "Chart of EST 5: I am easily disturbed") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 22.5% selected a neutral option. 42.8% had chosen 1 & 2 and 34.2% had chosen 4 & 5.
EST_counts <- count(EST,EST6)
plot_ly(EST_counts, labels = ~EST6, values = ~n, type = "pie",
text = ~paste(EST6, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST6)), "Dark2")),title = "Chart of EST 6: I get upset easily") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 21% selected a neutral option. 42.9% had chosen 1 & 2 and 35% had chosen 4 & 5.
EST_counts <- count(EST,EST7)
plot_ly(EST_counts, labels = ~EST7, values = ~n, type = "pie",
text = ~paste(EST7, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST7)), "Dark2")),title = "Chart of EST 7: I change my mood a lot") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 23% selected a neutral option. 36.4% had chosen 1 & 2 and 40% had chosen 4 & 5.
EST_counts <- count(EST,EST8)
plot_ly(EST_counts, labels = ~EST8, values = ~n, type = "pie",
text = ~paste(EST8, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST8)), "Dark2")),title = "Chart of EST 8: I have frequent mood swings") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 19.7% selected a neutral option. 48.9% had chosen 1 & 2 and 30.7% had chosen 4 & 5.
EST_counts <- count(EST,EST9)
plot_ly(EST_counts, labels = ~EST9, values = ~n, type = "pie",
text = ~paste(EST9, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST9)), "Dark2")),title = "Chart of EST 9: I get irritated easily") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 20.8% selected a neutral option. 35.2% had chosen 1 & 2 and 43.4% had chosen 4 & 5.
EST_counts <- count(EST,EST10)
plot_ly(EST_counts, labels = ~EST10, values = ~n, type = "pie",
text = ~paste(EST10, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST10)), "Dark2")),title = "Chart of EST 10: I often feel blue") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 21.6% selected a neutral option. 45.1% had chosen 1 & 2 and 32.5% had chosen 4 & 5.
Although the number of people who get stressed out easily and number of people who are relaxed most of the time are almost equal, 70.4% of them still worry about things.
EST<-df6 %>% dplyr::select(EST1:Score)
EST<-dplyr::select(EST,-(AGR1:Full_Name))
# Remove values containing "NULL"
EST <- EST %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
EST <- EST %>% filter(complete.cases(.))
EST$EST1 <- as.integer(EST$EST1)
EST$EST2 <- as.integer(EST$EST2)
EST$EST3 <- as.integer(EST$EST3)
EST$EST4 <- as.integer(EST$EST4)
EST$EST5 <- as.integer(EST$EST5)
EST$EST6 <- as.integer(EST$EST6)
EST$EST7 <- as.integer(EST$EST7)
EST$EST8 <- as.integer(EST$EST8)
EST$EST9 <- as.integer(EST$EST9)
EST$EST10 <- as.integer(EST$EST10)
EST_CORR<-cor(EST)
# as number
ggcorrplot(EST_CORR,
hc.order = TRUE,
type = "lower",
lab = TRUE)
EST7 is highly correlated to EST8. All variables have very little correlation to happiness score.
agreeableness (friendly/compassionate vs. critical/rational)
AGR<-df6 %>% dplyr::select(AGR1:AGR10)
# Remove values containing "NULL"
AGR <- AGR %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
AGR <- AGR %>% filter(complete.cases(.))
distinct_column<-unique(AGR$AGR1)
distinct_column
## [1] "2" "1" "5" "4" "3" "0"
distinct_column<-unique(AGR$AGR1)
distinct_column
## [1] "2" "1" "5" "4" "3" "0"
AGR_counts <- count(AGR,AGR1)
plot_ly(AGR_counts, labels = ~AGR1, values = ~n, type = "pie",
text = ~paste(AGR1, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR1)), "Dark2")),title = "Chart of AGR 1: I feel little concern for others") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 12.7% selected a neutral option. 64.8% had chosen 1 & 2 and 21.94% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR2)
plot_ly(AGR_counts, labels = ~AGR2, values = ~n, type = "pie",
text = ~paste(AGR2, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR2)), "Dark2")),title = "Chart of AGR 2: I am interested in people") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.5% selected a neutral option. 12.24% had chosen 1 & 2 and 68.4% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR3)
plot_ly(AGR_counts, labels = ~AGR3, values = ~n, type = "pie",
text = ~paste(AGR3, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR3)), "Dark2")),title = "Chart of AGR 3: I insult people") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 17.3% selected a neutral option. 61.5% had chosen 1 & 2 and 20.82% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR4)
plot_ly(AGR_counts, labels = ~AGR4, values = ~n, type = "pie",
text = ~paste(AGR4, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR4)), "Dark2")),title = "Chart of AGR 4: I sympathize with others' feelings") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 14.3% selected a neutral option. 11.64% had chosen 1 & 2 and 73.3% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR5)
plot_ly(AGR_counts, labels = ~AGR5, values = ~n, type = "pie",
text = ~paste(AGR5, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR5)), "Dark2")),title = "Chart of AGR 5: I am not interested in other people's problems") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 17.7% selected a neutral option. 64.7% had chosen 1 & 2 and 16.95% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR6)
plot_ly(AGR_counts, labels = ~AGR6, values = ~n, type = "pie",
text = ~paste(AGR6, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR6)), "Dark2")),title = "Chart of AGR 6: I have a soft heart") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.1% selected a neutral option. 15.52% had chosen 1 & 2 and 65.3% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR7)
plot_ly(AGR_counts, labels = ~AGR7, values = ~n, type = "pie",
text = ~paste(AGR7, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR7)), "Dark2")),title = "Chart of AGR 7: I am not really interested in others") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 17.9% selected a neutral option. 66.9% had chosen 1 & 2 and 14.56% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR8)
plot_ly(AGR_counts, labels = ~AGR8, values = ~n, type = "pie",
text = ~paste(AGR8, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR8)), "Dark2")),title = "Chart of AGR 8: I take time out for others") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 21.2% selected a neutral option. 13.64% had chosen 1 & 2 and 64.3% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR9)
plot_ly(AGR_counts, labels = ~AGR9, values = ~n, type = "pie",
text = ~paste(AGR9, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR9)), "Dark2")),title = "Chart of AGR 9: I feel others' emotions") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 15.7% selected a neutral option. 14.83% had chosen 1 & 2 and 68.8% had chosen 4 & 5.
AGR_counts <- count(AGR,AGR10)
plot_ly(AGR_counts, labels = ~AGR10, values = ~n, type = "pie",
text = ~paste(AGR10, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR10)), "Dark2")),title = "Chart of AGR 10: I make people feel at ease") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 30% selected a neutral option. 12.96% had chosen 1 & 2 and 56.1% had chosen 4 & 5.
64.8% of people who took this test care about people and 61.5% try not to insult people.
AGR<-df6 %>% dplyr::select(AGR1:Score)
AGR<-dplyr::select(AGR,-(CSN1:Full_Name))
# Remove values containing "NULL"
AGR <- AGR %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
AGR <- AGR %>% filter(complete.cases(.))
AGR$AGR1 <- as.integer(AGR$AGR1)
AGR$AGR2 <- as.integer(AGR$AGR2)
AGR$AGR3 <- as.integer(AGR$AGR3)
AGR$AGR4 <- as.integer(AGR$AGR4)
AGR$AGR5 <- as.integer(AGR$AGR5)
AGR$AGR6 <- as.integer(AGR$AGR6)
AGR$AGR7 <- as.integer(AGR$AGR7)
AGR$AGR8 <- as.integer(AGR$AGR8)
AGR$AGR9 <- as.integer(AGR$AGR9)
AGR$AGR10 <- as.integer(AGR$AGR10)
AGR_CORR<-cor(AGR)
ggcorrplot(AGR_CORR,
hc.order = TRUE,
type = "lower",
lab = TRUE)
Comment: All the variables are moderately correlated. All variables have very little correlation to happiness score.
conscientiousness (efficient/organized vs. extravagant/careless)
CSN<-df6 %>% dplyr::select(CSN1:CSN10)
# Remove values containing "NULL"
CSN <- CSN %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
CSN <- CSN %>% filter(complete.cases(.))
distinct_column<-unique(CSN$CSN1)
distinct_column
## [1] "3" "4" "2" "5" "1" "0"
distinct_column<-unique(CSN$CSN1)
distinct_column
## [1] "3" "4" "2" "5" "1" "0"
CSN_counts <- count(CSN,CSN1)
plot_ly(CSN_counts, labels = ~CSN1, values = ~n, type = "pie",
text = ~paste(CSN1, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN1)), "Dark2")),title = "Chart of CSN 1: I am always prepared") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 26.4% selected a neutral option. 23.59% had chosen 1 & 2 and 48.9% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN2)
plot_ly(CSN_counts, labels = ~CSN2, values = ~n, type = "pie",
text = ~paste(CSN2, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN2)), "Dark2")),title = "Chart of CSN 2: I leave my belongings around") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.8% selected a neutral option. 40.6% had chosen 1 & 2 and 40% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN3)
plot_ly(CSN_counts, labels = ~CSN3, values = ~n, type = "pie",
text = ~paste(CSN3, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN3)), "Dark2")),title = "Chart of CSN 3: I pay attention to details") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 16.1% selected a neutral option. 9.28% had chosen 1 & 2 and 74.1% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN4)
plot_ly(CSN_counts, labels = ~CSN4, values = ~n, type = "pie",
text = ~paste(CSN4, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN4)), "Dark2")),title = "Chart of CSN 4: I make a mess of things") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 22.8% selected a neutral option. 50% had chosen 1 & 2 and 26.59% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN5)
plot_ly(CSN_counts, labels = ~CSN5, values = ~n, type = "pie",
text = ~paste(CSN5, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN5)), "Dark2")),title = "Chart of CSN 5: I get chores done right away") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 23.5% selected a neutral option. 49.2% had chosen 1 & 2 and 26.55% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN6)
plot_ly(CSN_counts, labels = ~CSN6, values = ~n, type = "pie",
text = ~paste(CSN6, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN6)), "Dark2")),title = "Chart of CSN 6: I often forget to put things back in their proper place") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 16.1% selected a neutral option. 46% had chosen 1 & 2 and 37.1% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN7)
plot_ly(CSN_counts, labels = ~CSN7, values = ~n, type = "pie",
text = ~paste(CSN7, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN7)), "Dark2")),title = "Chart of CSN 7: I like order") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 22.5% selected a neutral option. 13.58% had chosen 1 & 2 and 63.2% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN8)
plot_ly(CSN_counts, labels = ~CSN8, values = ~n, type = "pie",
text = ~paste(CSN8, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN8)), "Dark2")),title = "Chart of CSN 8: I shirk my duties") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 31.2% selected a neutral option. 50.3% had chosen 1 & 2 and 17.74% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN9)
plot_ly(CSN_counts, labels = ~CSN9, values = ~n, type = "pie",
text = ~paste(CSN9, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN9)), "Dark2")),title = "Chart of CSN 9: I follow a schedule") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 22.7% selected a neutral option. 30.1% had chosen 1 & 2 and 46.5% had chosen 4 & 5.
CSN_counts <- count(CSN,CSN10)
plot_ly(CSN_counts, labels = ~CSN10, values = ~n, type = "pie",
text = ~paste(CSN10, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN10)), "Dark2")),title = "Chart of CSN 10: I am exacting in my work") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 31.3% selected a neutral option. 12.13% had chosen 1 & 2 and 55.7% had chosen 4 & 5.
In this aspect, it shows conflicts when the people are choosing their answer. Although 74.1% claimed to pay attention to details, most people (40%) always leave belongings behind. On the other hand, 63.2% claim that they like order but most people (49.2%) do not get chores done immediately. 55.7% say that they are exacting in their work but we still have more than half of the people (50.3%) shirking their duties.
CSN<-df6 %>% dplyr::select(CSN1:Score)
CSN<-dplyr::select(CSN,-(OPN1:Full_Name))
# Remove values containing "NULL"
CSN <- CSN %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
CSN <- CSN %>% filter(complete.cases(.))
CSN$CSN1 <- as.integer(CSN$CSN1)
CSN$CSN2 <- as.integer(CSN$CSN2)
CSN$CSN3 <- as.integer(CSN$CSN3)
CSN$CSN4 <- as.integer(CSN$CSN4)
CSN$CSN5 <- as.integer(CSN$CSN5)
CSN$CSN6 <- as.integer(CSN$CSN6)
CSN$CSN7 <- as.integer(CSN$CSN7)
CSN$CSN8 <- as.integer(CSN$CSN8)
CSN$CSN9 <- as.integer(CSN$CSN9)
CSN$CSN10 <- as.integer(CSN$CSN10)
CSN_CORR<-cor(CSN)
ggcorrplot(CSN_CORR,
hc.order = TRUE,
type = "lower",
lab = TRUE)
Comment: The variables are not highly correlated with each other and they have very little correlation with the happiness score.
openness to experience (inventive/curious vs. consistent/cautious)
OPN<-df6 %>% dplyr::select(OPN1:OPN10)
# Remove values containing "NULL"
OPN <- OPN %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
OPN <- OPN %>% filter(complete.cases(.))
distinct_column<-unique(OPN$OPN1)
distinct_column
## [1] "5" "1" "4" "3" "2" "0"
distinct_column<-unique(OPN$OPN1)
distinct_column
## [1] "5" "1" "4" "3" "2" "0"
OPN_counts <- count(OPN,OPN1)
plot_ly(OPN_counts, labels = ~OPN1, values = ~n, type = "pie",
text = ~paste(OPN1, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN1)), "Dark2")),title = "Chart of OPN 1: I have a rich vocabulary") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 24.8% selected a neutral option. 14.41% had chosen 1 & 2 and 59.9% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN2)
plot_ly(OPN_counts, labels = ~OPN2, values = ~n, type = "pie",
text = ~paste(OPN2, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN2)), "Dark2")),title = "Chart of OPN 2: I have difficulty understanding abstract ideas") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.4% selected a neutral option. 68.9% had chosen 1 & 2 and 12.19% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN3)
plot_ly(OPN_counts, labels = ~OPN3, values = ~n, type = "pie",
text = ~paste(OPN3, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN3)), "Dark2")),title = "Chart of OPN 3: I have a vivid imagination") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 17.2% selected a neutral option. 9.51% had chosen 1 & 2 and 72.5% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN4)
plot_ly(OPN_counts, labels = ~OPN4, values = ~n, type = "pie",
text = ~paste(OPN4, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN4)), "Dark2")),title = "Chart of OPN 4: I am not interested in abstract ideas") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 18.6% selected a neutral option. 70.6% had chosen 1 & 2 and 10.09% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN5)
plot_ly(OPN_counts, labels = ~OPN5, values = ~n, type = "pie",
text = ~paste(OPN5, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN5)), "Dark2")),title = "Chart of OPN 5: I have excellent ideas") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 27% selected a neutral option. 7.46% had chosen 1 & 2 and 64.9% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN6)
plot_ly(OPN_counts, labels = ~OPN6, values = ~n, type = "pie",
text = ~paste(OPN6, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN6)), "Dark2")),title = "Chart of OPN 6: I Do Not Have A Good Imagination") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 11.7% selected a neutral option. 76.6% had chosen 1 & 2 and 10.91% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN7)
plot_ly(OPN_counts, labels = ~OPN7, values = ~n, type = "pie",
text = ~paste(OPN7, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN7)), "Dark2")),title = "Chart of OPN 7: I am quick to understand things") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 17.9% selected a neutral option. 7.1% had chosen 1 & 2 and 74.2% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN8)
plot_ly(OPN_counts, labels = ~OPN8, values = ~n, type = "pie",
text = ~paste(OPN8, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN8)), "Dark2")),title = "Chart of OPN 8: I use difficult words") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 24.9% selected a neutral option. 29.8% had chosen 1 & 2 and 44.5% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN9)
plot_ly(OPN_counts, labels = ~OPN9, values = ~n, type = "pie",
text = ~paste(OPN9, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN9)), "Dark2")),title = "Chart of OPN 9: I spend time reflecting on things") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 12.6% selected a neutral option. 7.56% had chosen 1 & 2 and 79.1% had chosen 4 & 5.
OPN_counts <- count(OPN,OPN10)
plot_ly(OPN_counts, labels = ~OPN10, values = ~n, type = "pie",
text = ~paste(OPN10, ": ", n),
marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN10)), "Dark2")),title = "Chart of OPN 10: I am full of ideas") %>%
layout(scene = list(aspectmode = "data"), showlegend = FALSE)
Comment: 20.9% selected a neutral option. 7.91% had chosen 1 & 2 and 70.6% had chosen 4 & 5.
More than 70% claim to be creative, have a vivid imagination and full of ideas.
OPN<-df6 %>% dplyr::select(OPN1:Score)
OPN<-dplyr::select(OPN,-(country:Full_Name))
# Remove values containing "NULL"
OPN <- OPN %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
# Remove rows with any NA values
OPN <- OPN %>% filter(complete.cases(.))
OPN$OPN1 <- as.integer(OPN$OPN1)
OPN$OPN2 <- as.integer(OPN$OPN2)
OPN$OPN3 <- as.integer(OPN$OPN3)
OPN$OPN4 <- as.integer(OPN$OPN4)
OPN$OPN5 <- as.integer(OPN$OPN5)
OPN$OPN6 <- as.integer(OPN$OPN6)
OPN$OPN7 <- as.integer(OPN$OPN7)
OPN$OPN8 <- as.integer(OPN$OPN8)
OPN$OPN9 <- as.integer(OPN$OPN9)
OPN$OPN10 <- as.integer(OPN$OPN10)
OPN_CORR<-cor(OPN)
ggcorrplot(OPN_CORR,
hc.order = TRUE,
type = "lower",
lab = TRUE)
Comment: The variables are not highly correlated with each other and they have very little correlation with the happiness score.
distinct_column<-unique(df6$Full_Name)
distinct_column
## [1] "United Kingdom" "Malaysia" "Kenya"
## [4] "Sweden" "United States" "Finland"
## [7] "Ukraine" "Philippines" "France"
## [10] "Australia" "India" "Canada"
## [13] "Netherlands" "South Africa" "Brazil"
## [16] "Switzerland" "Thailand" "Italy"
## [19] "Spain" "United Arab Emirates" "Croatia"
## [22] "Greece" "Ireland" "Germany"
## [25] "Portugal" "Singapore" "Romania"
## [28] "Norway" "Bangladesh" "Nigeria"
## [31] "Lithuania" "Ethiopia" "Indonesia"
## [34] "Belgium" "Austria" "Denmark"
## [37] "Tanzania" "Luxembourg" "Poland"
## [40] "Japan" "Mexico" "Cyprus"
## [43] "Uganda" "Sri Lanka" "Turkey"
## [46] "Myanmar" "Colombia" "Estonia"
## [49] "Argentina" "Iceland" "Hungary"
## [52] "Pakistan" "Tunisia" "Latvia"
## [55] "Czech Republic" "New Zealand" "Serbia"
## [58] "Israel" "Jamaica" "Chile"
## [61] "Qatar" "Saudi Arabia" "Vietnam"
## [64] "Kazakhstan" "Bosnia and Herzegovina" "Mauritius"
## [67] "Egypt" "Peru" "Slovenia"
## [70] "Jordan" "Taiwan" "Dominican Republic"
## [73] "Algeria" "Kuwait" "Morocco"
## [76] "Malta" "Venezuela" "Russia"
## [79] "South Korea" "Liberia" "Guatemala"
## [82] "Bulgaria" "Ghana" "Somalia"
## [85] "Slovakia" "China" "Azerbaijan"
## [88] "Albania" "Cambodia" "Lebanon"
## [91] "Uruguay" "Zimbabwe" "Uzbekistan"
## [94] "Honduras" "Costa Rica" "Georgia"
## [97] "Nepal" "Iran" "Mongolia"
## [100] "Zambia" "Nicaragua" "Bahrain"
## [103] "Sudan" "Belize" "Paraguay"
## [106] "Panama" "El Salvador" "Montenegro"
## [109] "Angola" "Kyrgyzstan" "Afghanistan"
## [112] "Rwanda" "Belarus" "Gabon"
## [115] "Armenia" "Ecuador" "Yemen"
## [118] "Botswana" "Burundi" "Cameroon"
## [121] "Lesotho" "Iraq" "Bolivia"
## [124] "Mozambique" "Senegal" "Malawi"
## [127] "Madagascar" "Benin" "Bhutan"
## [130] "Haiti" "Congo (Brazzaville)" "Mali"
## [133] "Congo (Kinshasa)" "Mauritania" "Burkina Faso"
## [136] "Tajikistan" "Sierra Leone" "Togo"
## [139] "Chad" "Niger" "Guinea"
count(df6, Full_Name)%>%arrange(desc(n))
## Full_Name n
## 1 United States 546403
## 2 United Kingdom 66596
## 3 Canada 61849
## 4 Australia 50030
## 5 Philippines 19847
## 6 India 17491
## 7 Germany 14095
## 8 New Zealand 12992
## 9 Norway 11417
## 10 Malaysia 11355
## 11 Mexico 11152
## 12 Sweden 10493
## 13 Netherlands 9785
## 14 Singapore 7686
## 15 Indonesia 6489
## 16 Brazil 6245
## 17 France 6145
## 18 Denmark 5512
## 19 Ireland 5409
## 20 Italy 5319
## 21 Spain 5008
## 22 Poland 4659
## 23 Finland 4340
## 24 Romania 3858
## 25 Belgium 3824
## 26 South Africa 3751
## 27 Colombia 3619
## 28 Pakistan 3511
## 29 Russia 3323
## 30 Argentina 3154
## 31 Switzerland 3124
## 32 United Arab Emirates 3061
## 33 Turkey 2891
## 34 Portugal 2519
## 35 Greece 2513
## 36 Vietnam 2337
## 37 Croatia 2245
## 38 Austria 2215
## 39 Chile 2193
## 40 Serbia 2065
## 41 Czech Republic 2014
## 42 Thailand 1971
## 43 Japan 1933
## 44 Peru 1659
## 45 South Korea 1593
## 46 Hungary 1506
## 47 Israel 1432
## 48 Kenya 1405
## 49 China 1340
## 50 Bulgaria 1271
## 51 Venezuela 1260
## 52 Ecuador 1146
## 53 Lithuania 1101
## 54 Saudi Arabia 1097
## 55 Egypt 1033
## 56 Estonia 1020
## 57 Slovakia 992
## 58 Nigeria 952
## 59 Taiwan 921
## 60 Slovenia 866
## 61 Lebanon 837
## 62 Ukraine 747
## 63 Sri Lanka 701
## 64 Costa Rica 650
## 65 Nepal 642
## 66 Iceland 612
## 67 Bosnia and Herzegovina 550
## 68 Kazakhstan 526
## 69 Latvia 517
## 70 Jamaica 512
## 71 Morocco 499
## 72 Jordan 437
## 73 Albania 436
## 74 Iran 429
## 75 Guatemala 425
## 76 Kuwait 397
## 77 Cambodia 386
## 78 Malta 378
## 79 Bolivia 369
## 80 Qatar 366
## 81 Dominican Republic 359
## 82 Georgia 359
## 83 Uruguay 351
## 84 Bangladesh 319
## 85 Cyprus 311
## 86 Paraguay 281
## 87 Ethiopia 277
## 88 Ghana 272
## 89 Algeria 239
## 90 Honduras 225
## 91 Luxembourg 220
## 92 Bahrain 207
## 93 Panama 198
## 94 El Salvador 197
## 95 Tunisia 192
## 96 Mauritius 187
## 97 Nicaragua 174
## 98 Belarus 166
## 99 Belize 152
## 100 Uganda 149
## 101 Montenegro 142
## 102 Armenia 109
## 103 Botswana 106
## 104 Zimbabwe 103
## 105 Zambia 98
## 106 Iraq 93
## 107 Sudan 90
## 108 Myanmar 86
## 109 Tanzania 86
## 110 Azerbaijan 80
## 111 Mongolia 72
## 112 Afghanistan 54
## 113 Kyrgyzstan 54
## 114 Uzbekistan 35
## 115 Cameroon 33
## 116 Rwanda 32
## 117 Mozambique 26
## 118 Malawi 24
## 119 Senegal 18
## 120 Somalia 16
## 121 Haiti 15
## 122 Angola 14
## 123 Bhutan 14
## 124 Yemen 14
## 125 Lesotho 12
## 126 Madagascar 12
## 127 Sierra Leone 6
## 128 Liberia 5
## 129 Gabon 4
## 130 Congo (Brazzaville) 3
## 131 Congo (Kinshasa) 3
## 132 Mali 3
## 133 Mauritania 3
## 134 Togo 3
## 135 Benin 2
## 136 Burkina Faso 2
## 137 Tajikistan 2
## 138 Burundi 1
## 139 Chad 1
## 140 Guinea 1
## 141 Niger 1
## Remove Outliers
df_hist<-df6 %>% dplyr::select(Score) %>% arrange(Score)
ggplot(df_hist, aes(x=Score, fill=Score)) + geom_histogram(binwidth=1, color="blue", fill="lightgreen") + labs(title="Distribution of Score", x="Score", y="Number of Customers")
Comment: Although the histogram seems like a normal distribution, it is slightly skewed.
Q1 <- quantile(df6$Score, .25)
Q3 <- quantile(df6$Score, .75)
IQR <- IQR(df6$Score)
df7 <- subset(df6, df6$Score > (Q1 - 1.5*IQR) & df6$Score < (Q3 + 1.5*IQR))
dim(df7)
## [1] 575094 53
ggplot(df7, aes(x=Score, fill=Score)) + geom_histogram(binwidth=1, color="blue", fill="lightgreen") + labs(title="Distribution of Score", x="Score", y="Number of Customers")
dim(df7)
## [1] 575094 53
df7 <- df7 %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
df7 <- df7[complete.cases(df7), ]
df7$EXT1 <- as.integer(df7$EXT1)
df7$EXT2 <- as.integer(df7$EXT2)
df7$EXT3 <- as.integer(df7$EXT3)
df7$EXT4 <- as.integer(df7$EXT4)
df7$EXT5 <- as.integer(df7$EXT5)
df7$EXT6 <- as.integer(df7$EXT6)
df7$EXT7 <- as.integer(df7$EXT7)
df7$EXT8 <- as.integer(df7$EXT8)
df7$EXT9 <- as.integer(df7$EXT9)
df7$EXT10 <- as.integer(df7$EXT10)
df7$EST1 <- as.integer(df7$EST1)
df7$EST2 <- as.integer(df7$EST2)
df7$EST3 <- as.integer(df7$EST3)
df7$EST4 <- as.integer(df7$EST4)
df7$EST5 <- as.integer(df7$EST5)
df7$EST6 <- as.integer(df7$EST6)
df7$EST7 <- as.integer(df7$EST7)
df7$EST8 <- as.integer(df7$EST8)
df7$EST9 <- as.integer(df7$EST9)
df7$EST10 <- as.integer(df7$EST10)
df7$AGR1 <- as.integer(df7$AGR1)
df7$AGR2 <- as.integer(df7$AGR2)
df7$AGR3 <- as.integer(df7$AGR3)
df7$AGR4 <- as.integer(df7$AGR4)
df7$AGR5 <- as.integer(df7$AGR5)
df7$AGR6 <- as.integer(df7$AGR6)
df7$AGR7 <- as.integer(df7$AGR7)
df7$AGR8 <- as.integer(df7$AGR8)
df7$AGR9 <- as.integer(df7$AGR9)
df7$AGR10 <- as.integer(df7$AGR10)
df7$CSN1 <- as.integer(df7$CSN1)
df7$CSN2 <- as.integer(df7$CSN2)
df7$CSN3 <- as.integer(df7$CSN3)
df7$CSN4 <- as.integer(df7$CSN4)
df7$CSN5 <- as.integer(df7$CSN5)
df7$CSN6 <- as.integer(df7$CSN6)
df7$CSN7 <- as.integer(df7$CSN7)
df7$CSN8 <- as.integer(df7$CSN8)
df7$CSN9 <- as.integer(df7$CSN9)
df7$CSN10 <- as.integer(df7$CSN10)
df7$OPN1 <- as.integer(df7$OPN1)
df7$OPN2 <- as.integer(df7$OPN2)
df7$OPN3 <- as.integer(df7$OPN3)
df7$OPN4 <- as.integer(df7$OPN4)
df7$OPN5 <- as.integer(df7$OPN5)
df7$OPN6 <- as.integer(df7$OPN6)
df7$OPN7 <- as.integer(df7$OPN7)
df7$OPN8 <- as.integer(df7$OPN8)
df7$OPN9 <- as.integer(df7$OPN9)
df7$OPN10 <- as.integer(df7$OPN10)
df7<-dplyr::select(df7,-(country:Full_Name))
head(df7)
## EXT1 EXT2 EXT3 EXT4 EXT5 EXT6 EXT7 EXT8 EXT9 EXT10 EST1 EST2 EST3 EST4 EST5
## 7 4 3 4 3 3 3 5 3 4 3 2 4 4 2 4
## 22 3 2 2 4 4 4 5 3 1 3 3 3 4 4 3
## 50 3 5 4 3 3 4 2 2 2 4 1 2 3 2 2
## 52 1 2 3 4 3 3 2 5 1 5 3 2 4 1 2
## 54 1 4 5 4 4 1 2 1 1 2 2 3 4 2 2
## 72 3 4 1 4 3 2 2 4 3 5 5 3 4 3 5
## EST6 EST7 EST8 EST9 EST10 AGR1 AGR2 AGR3 AGR4 AGR5 AGR6 AGR7 AGR8 AGR9 AGR10
## 7 2 2 2 4 4 1 2 1 5 3 5 3 4 4 5
## 22 3 5 4 3 4 1 5 1 4 5 3 2 3 4 2
## 50 2 3 2 2 4 3 4 3 2 1 4 2 3 3 4
## 52 3 3 3 4 3 3 3 4 3 4 5 3 3 3 2
## 54 2 4 4 2 3 1 5 1 5 1 4 1 5 5 5
## 72 1 3 2 5 2 4 2 5 1 5 2 4 1 1 4
## CSN1 CSN2 CSN3 CSN4 CSN5 CSN6 CSN7 CSN8 CSN9 CSN10 OPN1 OPN2 OPN3 OPN4 OPN5
## 7 3 2 4 2 1 4 4 2 2 5 5 2 4 3 4
## 22 3 3 2 2 2 2 4 3 2 2 3 4 3 2 2
## 50 5 4 4 2 4 2 2 2 5 5 3 2 4 2 2
## 52 3 3 4 2 3 1 4 1 1 3 4 4 5 2 3
## 54 3 4 4 1 2 4 4 1 3 5 5 1 5 1 4
## 72 2 5 4 3 1 4 5 3 3 5 3 4 2 2 4
## OPN6 OPN7 OPN8 OPN9 OPN10 Score
## 7 1 5 5 4 4 6.886
## 22 5 3 2 1 2 6.886
## 50 2 4 2 4 5 6.774
## 52 1 3 4 4 4 6.886
## 54 1 5 5 5 4 6.977
## 72 4 5 3 4 2 6.886
Data set is split into 70% training set and 30% testing set.
# Step 1: Split the dataset into training and test sets
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(df7$Score, p = 0.7, list = FALSE)
trainData <- df7[trainIndex, ]
testData <- df7[-trainIndex, ]
nrow(trainData)
## [1] 402203
nrow(testData)
## [1] 172371
# Step 2: Fit the model
model <- lm(Score ~ ., data = trainData) # Linear regression model
# Step 3: Evaluate the model on the training set
trainPred <- predict(model, newdata = trainData)
trainError <- sqrt(mean((trainData$Score - trainPred)^2))
cat("Training RMSE:", trainError, "\n")
## Training RMSE: 0.01871987
# Step 4: Evaluate the model on the test set
testPred <- predict(model, newdata = testData)
testError <- sqrt(mean((testData$Score - testPred)^2))
cat("Test RMSE:", testError, "\n")
## Test RMSE: 0.01863382
If the training RMSE is significantly lower than the test RMSE, it suggests that the model may be overfitting the training data. In contrast, if both the training and test RMSE are high, it indicates underfitting, suggesting that the model is not capturing the patterns in the data well.
Comment: training RMSE and testing RMSE are almost equal. Hence the model is not overfit or underfit.
results <- data.frame(Predicted = testPred, Actual = testData$Score)
head(results)
## Predicted Actual
## 7 6.890553 6.886
## 22 6.889074 6.886
## 72 6.889343 6.886
## 83 6.891699 6.965
## 105 6.890599 6.886
## 152 6.888636 6.886
# Define the number of folds (k)
k <- 10
# Create an empty vector to store the evaluation results
evaluation_results <- numeric(k)
rmse_results <- numeric(k)
rsquared_results <- numeric(k)
# Create an index vector to shuffle the data
index <- sample(1:nrow(df7))
# Calculate the fold size
fold_size <- floor(nrow(df7) / k)
# Perform k-fold cross-validation
for (i in 1:k) {
# Define the start and end indices of the current fold
start_index <- (i - 1) * fold_size + 1
end_index <- start_index + fold_size - 1
# Extract the training and testing data for the current fold
training_data <- df7[-index[start_index:end_index], ]
testing_data <- df7[index[start_index:end_index], ]
# Fit your model on the training data
model <- lm(Score ~ ., data = training_data)
# Make predictions on the testing data
predictions <- predict(model, newdata = testing_data)
# Evaluate the performance of the model
evaluation_results[i] <- mean((testing_data$Score - predictions)^2)
# Compute RMSE
rmse_results[i] <- sqrt(mean((testing_data$Score - predictions)^2))
# Compute R-squared
rsquared_results[i] <- summary(model)$r.squared
}
# Calculate the average RMSE and R-squared across all folds
average_rmse <- mean(rmse_results)
average_rsquared <- mean(rsquared_results)
cat("Average RMSE:", average_rmse, "\n")
## Average RMSE: 0.01869467
cat("Average R-Squared:", average_rsquared, "\n")
## Average R-Squared: 0.01459357
# Calculate the average performance across all folds
average_performance <- mean(evaluation_results)
cat("Average Performance:", average_performance, "\n")
## Average Performance: 0.0003495235
Generally, a higher r-squared indicates more variability is explained by the model.A low r-squared figure is generally a bad sign for predictive models. Although the model is showing small average RMSE, it has a very low R-squared and average performance. Possible cause: Most of the variables are not related to the happiness score. Future action: 1. Feature selection or feature engineering to create more informative variables. 2. Hyperparameter tuning
df<-na.omit(df)
df<-dplyr::select(df,-(EXT1_E:IPC))
df<-dplyr::select(df,-(lat_appx_lots_of_err:long_appx_lots_of_err ))
df<-dplyr::select(df,-(country))
df <- df %>%
mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
df <- df[complete.cases(df), ]
df$EXT1 <- as.integer(df$EXT1)
df$EXT2 <- as.integer(df$EXT2)
df$EXT3 <- as.integer(df$EXT3)
df$EXT4 <- as.integer(df$EXT4)
df$EXT5 <- as.integer(df$EXT5)
df$EXT6 <- as.integer(df$EXT6)
df$EXT7 <- as.integer(df$EXT7)
df$EXT8 <- as.integer(df$EXT8)
df$EXT9 <- as.integer(df$EXT9)
df$EXT10 <- as.integer(df$EXT10)
df$EST1 <- as.integer(df$EST1)
df$EST2 <- as.integer(df$EST2)
df$EST3 <- as.integer(df$EST3)
df$EST4 <- as.integer(df$EST4)
df$EST5 <- as.integer(df$EST5)
df$EST6 <- as.integer(df$EST6)
df$EST7 <- as.integer(df$EST7)
df$EST8 <- as.integer(df$EST8)
df$EST9 <- as.integer(df$EST9)
df$EST10 <- as.integer(df$EST10)
df$AGR1 <- as.integer(df$AGR1)
df$AGR2 <- as.integer(df$AGR2)
df$AGR3 <- as.integer(df$AGR3)
df$AGR4 <- as.integer(df$AGR4)
df$AGR5 <- as.integer(df$AGR5)
df$AGR6 <- as.integer(df$AGR6)
df$AGR7 <- as.integer(df$AGR7)
df$AGR8 <- as.integer(df$AGR8)
df$AGR9 <- as.integer(df$AGR9)
df$AGR10 <- as.integer(df$AGR10)
df$CSN1 <- as.integer(df$CSN1)
df$CSN2 <- as.integer(df$CSN2)
df$CSN3 <- as.integer(df$CSN3)
df$CSN4 <- as.integer(df$CSN4)
df$CSN5 <- as.integer(df$CSN5)
df$CSN6 <- as.integer(df$CSN6)
df$CSN7 <- as.integer(df$CSN7)
df$CSN8 <- as.integer(df$CSN8)
df$CSN9 <- as.integer(df$CSN9)
df$CSN10 <- as.integer(df$CSN10)
df$OPN1 <- as.integer(df$OPN1)
df$OPN2 <- as.integer(df$OPN2)
df$OPN3 <- as.integer(df$OPN3)
df$OPN4 <- as.integer(df$OPN4)
df$OPN5 <- as.integer(df$OPN5)
df$OPN6 <- as.integer(df$OPN6)
df$OPN7 <- as.integer(df$OPN7)
df$OPN8 <- as.integer(df$OPN8)
df$OPN9 <- as.integer(df$OPN9)
df$OPN10 <- as.integer(df$OPN10)
50 variables are too many to implement clustering due to hardware limitation. Hence we created 10 new variables following the characteristics of the variables.
## Creating new variables
ext <- c('EXT1', 'EXT3', 'EXT5', 'EXT7', 'EXT9')
int <- c('EXT2', 'EXT4', 'EXT6', 'EXT6', 'EXT10')
opn <- c('OPN3', 'OPN5', 'OPN7', 'OPN8', 'OPN9', 'OPN10')
cst <- c('OPN1', 'OPN2', 'OPN4', 'OPN6')
agr <- c('AGR2', 'AGR4', 'AGR6', 'AGR8', 'AGR10', 'AGR9')
cpt <- c('AGR1', 'AGR3', 'AGR5', 'AGR7')
csn <- c('CSN1', 'CSN3', 'CSN5', 'CSN7', 'CSN9', 'CSN10')
spt <- c('CSN2', 'CSN4', 'CSN6', 'CSN8')
est <- c('EST2', 'EST4', 'EST5', 'EST6')
nrt <- c('EST1', 'EST3', 'EST7', 'EST8', 'EST9', 'EST10')
df$extroversion <- rowSums(df[, ext])
df$introversion <- rowSums(df[, int])
df$open <- rowSums(df[, opn])
df$consistency <- rowSums(df[, cst])
df$agreeable <- rowSums(df[, agr])
df$competitiveness <- rowSums(df[, cpt])
df$conscientious <- rowSums(df[, csn])
df$spontaneity <- rowSums(df[, spt])
df$emotionally_stable <- rowSums(df[, est])
df$neurotic <- rowSums(df[, nrt])
df<-dplyr::select(df,-(EXT1:OPN10))
head(df)
## extroversion introversion open consistency agreeable competitiveness
## 1 23 6 25 8 23 8
## 2 12 20 21 6 26 6
## 3 12 16 22 9 23 5
## 4 11 13 22 9 23 9
## 5 17 16 28 8 26 4
## 6 16 13 24 8 21 7
## conscientious spontaneity emotionally_stable neurotic
## 1 20 12 10 14
## 2 22 9 8 13
## 3 19 9 10 16
## 4 14 13 10 17
## 5 28 4 10 13
## 6 21 8 9 13
As we have almost 1 million of data, our data is considered high dimensionality and requires dimensionality reduction. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components.
The sdev component of the pca_result object contains the standard deviations of the principal components, which can be used to calculate the proportion of variance explained.
The variance_explained variable stores the proportion of variance explained by each principal component.
The cumulative_variance variable stores the cumulative proportion of variance explained.
# Prepare your data
X <- df # Features (numeric matrix or data frame)
# Perform PCA
pca_result <- prcomp(X)
# Access the principal components
principal_components <- pca_result$x
# Access the proportion of variance explained by each component
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
# Access the cumulative proportion of variance explained
cumulative_variance <- cumsum(variance_explained)
# Plot the scree plot
par(mar = c(2, 2, 2, 2) + 0.1) # Adjust the margin values as needed
plot(1:length(variance_explained), variance_explained, type = "b", xlab = "Principal Component", ylab = "Variance Explained", main = "Scree Plot")
plot(1:length(cumulative_variance), cumulative_variance, type = "b", xlab = "Principal Component", ylab = "Cumulative Variance Explained", main = "Cumulative Variance Explained")
The scree plot displays the variance explained by each principal component. It helps determine the optimal number of principal components to retain in your analysis. The x-axis represents the principal components, typically ordered from left to right. The y-axis represents the variance explained by each principal component. Look for an “elbow” or a point where the curve starts to level off. This indicates the number of principal components that capture most of the variability in the data. Choose the number of components before the elbow point.
The cumulative variance explained plot shows the cumulative proportion of variance explained by adding successive principal components. The x-axis represents the principal components. The y-axis represents the cumulative proportion of variance explained. Look for a point where the curve starts to plateau or reaches a saturation point. This indicates the number of components needed to explain a significant portion of the total variance. Choose the number of components before the saturation point to retain a substantial amount of variance. By examining both the scree plot and the cumulative variance explained plot, you can make an informed decision about the number of principal components to retain for your subsequent analysis or clustering.
In this case, principal component =5 explained more than 80% of the variability.
# Select the desired number of components
num_components <- 5 # Adjust the number of components as needed
# Retrieve the selected components
selected_components <- principal_components[, 1:num_components]
result_kmeans <- kmeans(selected_components, centers = 5)
# Access the cluster assignments
cluster_assignments <- result_kmeans$cluster
# Create a data frame with original variables and cluster assignments
cluster_data <- data.frame(df, Cluster = factor(cluster_assignments))
# Print the cluster assignments and original variables in a table format
head(cluster_data)
## extroversion introversion open consistency agreeable competitiveness
## 1 23 6 25 8 23 8
## 2 12 20 21 6 26 6
## 3 12 16 22 9 23 5
## 4 11 13 22 9 23 9
## 5 17 16 28 8 26 4
## 6 16 13 24 8 21 7
## conscientious spontaneity emotionally_stable neurotic Cluster
## 1 20 12 10 14 2
## 2 22 9 8 13 5
## 3 19 9 10 16 5
## 4 14 13 10 17 4
## 5 28 4 10 13 2
## 6 21 8 9 13 2
clusplot(df,result_kmeans$cluster)
The clusplot shows that the clusters are not very distinct. Hence we profile the clusters based on the variables to see their characteristics.
cluster_data$Cluster = as.integer(cluster_data$Cluster)
mean_cluster <- aggregate(. ~ Cluster, data = cluster_data, mean)
mean_cluster <- round(mean_cluster, 2)
mean_cluster
## Cluster extroversion introversion open consistency agreeable competitiveness
## 1 1 18.35 11.16 23.92 9.59 24.42 8.60
## 2 2 19.64 9.86 24.17 9.11 25.11 7.09
## 3 3 10.12 19.00 22.18 10.14 20.46 10.78
## 4 4 12.63 15.34 21.37 9.52 16.61 11.48
## 5 5 12.57 17.05 22.89 9.78 24.19 7.90
## conscientious spontaneity emotionally_stable neurotic
## 1 18.69 13.22 12.04 22.37
## 2 22.60 8.61 11.15 13.36
## 3 18.09 13.37 11.95 24.09
## 4 18.41 10.35 10.83 13.89
## 5 23.48 8.61 11.41 18.83
mean_overall <-colMeans(cluster_data)
mean_overall <-data.frame(t(mean_overall))
mean_overall <-dplyr::select(mean_overall,-(Cluster))
mean_overall
## extroversion introversion open consistency agreeable competitiveness
## 1 14.94897 14.27228 23.03318 9.625085 22.57257 8.981891
## conscientious spontaneity emotionally_stable neurotic
## 1 20.38859 10.84528 11.51305 18.72826
cluster_col<-c("overall")
mean_overall<-cbind(Cluster=cluster_col,mean_overall)
mean_cluster<-rbind(mean_overall,mean_cluster)
# Reshape the data frame from wide to long format
mean_cluster_long <- tidyr::gather(mean_cluster, variable, value, -Cluster)
# Sort the values within each cluster from highest to lowest
mean_cluster_sorted <- mean_cluster_long %>%
group_by(Cluster) %>%
arrange(Cluster, desc(value)) %>%
mutate(variable = factor(variable, levels = unique(variable)))
# Plot the bar chart for each cluster with sorted values
ggplot(mean_cluster_sorted, aes(x = variable, y = value, fill = Cluster)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Cluster, ncol = 5) +
labs(x = "Variable", y = "Mean Value", title = "Bar Chart - Cluster-wise Mean Values (Sorted)") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.margin = margin(0, 0, 0, 0, "cm") # Adjust the right margin (10 cm in this example)
)
mean_cluster
## Cluster extroversion introversion open consistency agreeable
## 1 overall 14.94897 14.27228 23.03318 9.625085 22.57257
## 2 1 18.35000 11.16000 23.92000 9.590000 24.42000
## 3 2 19.64000 9.86000 24.17000 9.110000 25.11000
## 4 3 10.12000 19.00000 22.18000 10.140000 20.46000
## 5 4 12.63000 15.34000 21.37000 9.520000 16.61000
## 6 5 12.57000 17.05000 22.89000 9.780000 24.19000
## competitiveness conscientious spontaneity emotionally_stable neurotic
## 1 8.981891 20.38859 10.84528 11.51305 18.72826
## 2 8.600000 18.69000 13.22000 12.04000 22.37000
## 3 7.090000 22.60000 8.61000 11.15000 13.36000
## 4 10.780000 18.09000 13.37000 11.95000 24.09000
## 5 11.480000 18.41000 10.35000 10.83000 13.89000
## 6 7.900000 23.48000 8.61000 11.41000 18.83000
To check for survival bias, we cross check the mean for every variable in every cluster to see if they behave differently from worldwide data From here, we can conclude that:
Cluster 1 and Cluster 2 have more similarity: They are more extroverted, more agreeable and less competitive. Cluster 1 tend to be more spontaneity and neurotic whereas Cluster 2 is more neutral in their feelings and organised.
Cluster 4 and Cluster 5 have more similarity: They are more introverted than overall but Cluster 4 is more competitive as compared to Cluster 5 who is more agreeable with other people and conscientious. However as Cluster 5 takes their job seriously and they prefer consistency, they also tend to be neurotic.
Cluster 3 is the most introverted among all clusters and prefer consistency and a stable life.
In conclusion, the project on “Predicting Happiness and Personality-Based Clustering” has provided valuable insights into understanding the relationship between individual personality traits and subjective well-being through the analysis of various data sets and the application of machine learning techniques.
From the EDA, it is concluded that all the variables have very little to no relationship with the happiness score.
This project is able to predict an individual’s happiness levels based on their personality characteristics using the Multiple Linear Regression model. However, the outcome gives a low average performance on the predictions. K-fold validation is used to evaluate on the Multiple Linear Regression model resulting in low RMSE but also very low R-Squared.
However, the kmeans clustering applied to the personality traits data is able to split the data into five clusters. Although they do not seem very distant, we can still observe some slight differences among each cluster.
In the future, in order to improve the Multiple Linear Regression model in this project, model hyperparameter tuning and feature engineering are required.