Group Member:-

  1. Puvaneswari Poobalan (S2182524)

  2. Hun Yee Chong (S2197999)

  3. Chunli Wang (22064827)

  4. Vinoshini Loganathan (17090738)

  5. Ze Ying Tan (22058059)

1. Introduction

This project is to explore the relationship between personality traits and happiness and develop a data-driven method to predict an individual’s happiness level based on their personality characteristics.The method is predicated on the hypothesis that some personality features have a high positive correlation with happiness, and that by identifying these traits, we may more correctly predict a person’s happiness score.

Questions: 1. How can personality traits affect happiness score? 2. How many types of clusters can be obtained from the data?

The project aims to achieve the following goals:

  1. To investigate the correlation between personality traits and happiness scores.
  2. To group individuals into clusters based on their personality features.
  3. To build a predictive model to forecast happiness scores.
  4. To evaluate the effectiveness of the clustering and prediction approach.

2. Preprocessing Processes Involved:

The project involves the following key processes:

2.1 Data Collection:

Two data sets are used for this project:

Five Personality Data: This data set provides information on various personality traits and their countries.

Link: https://www.kaggle.com/code/yadhua/eda-and-cluster-analysis/input

Year: 2018

Purpose: To analyze the relationship between personality traits and life satisfaction.

Content: The dataset includes features such as Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism), and countries.

Structure: The dataset is provided in a comma-separated values (CSV) format with each row representing an individual and each column representing a specific feature.

Dimension:1015341 rows and 110 columns

World Happiness Data 2018: This data set contains happiness scores for different countries.

Link: https://www.kaggle.com/datasets/mathurinache/world-happiness-report?select=2018.csv

Year: 2018

Content: The dataset includes features such as the happiness score, GDP per capita,social support, life expectancy, and freedom to make life choices.

Structure: The dataset is provided in a comma-separated values (CSV) format with each row representing a specific country and each column representing a specific feature.

Dimension: 156 rows and 9 variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(ggrepel)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(treemap)
library(corrplot)
## corrplot 0.92 loaded
library(ggcorrplot)
library(treemapify)
library(geomapdata)
library(BSDA)
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
library(caret)
library(klaR)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
library(class)
library(mlbench)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:stats':
## 
##     loadings
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(stats)
library(cluster)
df<-read.delim("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\Five Personality Data.txt")
df<-data.frame(df)
country_code<-read.csv("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\wikipedia-iso-country-codes.csv")
country_code2<-country_code[c("Full_Name","Alpha.2.code")]
happiness<-read.csv("C:\\Users\\Puvanes\\Documents\\R Programming\\Programming for Data Science\\WQD7004 Group Project\\2018.csv")
happiness2<-happiness[c("Country.or.region","Score")]
df2 <- left_join(df, country_code2, by=c('country'='Alpha.2.code'))
df3 <- left_join(df2, happiness2, by=c('Full_Name'='Country.or.region'))

Checking Dimensions of the data

dim(df3)
## [1] 1015341     112

Missing Data

missing_counts<-colSums(is.na(df3))
missing_counts
##                  EXT1                  EXT2                  EXT3 
##                     0                     0                     0 
##                  EXT4                  EXT5                  EXT6 
##                     0                     0                     0 
##                  EXT7                  EXT8                  EXT9 
##                     0                     0                     0 
##                 EXT10                  EST1                  EST2 
##                     0                     0                     0 
##                  EST3                  EST4                  EST5 
##                     0                     0                     0 
##                  EST6                  EST7                  EST8 
##                     0                     0                     0 
##                  EST9                 EST10                  AGR1 
##                     0                     0                     0 
##                  AGR2                  AGR3                  AGR4 
##                     0                     0                     0 
##                  AGR5                  AGR6                  AGR7 
##                     0                     0                     0 
##                  AGR8                  AGR9                 AGR10 
##                     0                     0                     0 
##                  CSN1                  CSN2                  CSN3 
##                     0                     0                     0 
##                  CSN4                  CSN5                  CSN6 
##                     0                     0                     0 
##                  CSN7                  CSN8                  CSN9 
##                     0                     0                     0 
##                 CSN10                  OPN1                  OPN2 
##                     0                     0                     0 
##                  OPN3                  OPN4                  OPN5 
##                     0                     0                     0 
##                  OPN6                  OPN7                  OPN8 
##                     0                     0                     0 
##                  OPN9                 OPN10                EXT1_E 
##                     0                     0                     0 
##                EXT2_E                EXT3_E                EXT4_E 
##                     0                     0                     0 
##                EXT5_E                EXT6_E                EXT7_E 
##                     0                     0                     0 
##                EXT8_E                EXT9_E               EXT10_E 
##                     0                     0                     0 
##                EST1_E                EST2_E                EST3_E 
##                     0                     0                     0 
##                EST4_E                EST5_E                EST6_E 
##                     0                     0                     0 
##                EST7_E                EST8_E                EST9_E 
##                     0                     0                     0 
##               EST10_E                AGR1_E                AGR2_E 
##                     0                     0                     0 
##                AGR3_E                AGR4_E                AGR5_E 
##                     0                     0                     0 
##                AGR6_E                AGR7_E                AGR8_E 
##                     0                     0                     0 
##                AGR9_E               AGR10_E                CSN1_E 
##                     0                     0                     0 
##                CSN2_E                CSN3_E                CSN4_E 
##                     0                     0                     0 
##                CSN5_E                CSN6_E                CSN7_E 
##                     0                     0                     0 
##                CSN8_E                CSN9_E               CSN10_E 
##                     0                     0                     0 
##                OPN1_E                OPN2_E                OPN3_E 
##                     0                     0                     0 
##                OPN4_E                OPN5_E                OPN6_E 
##                     0                     0                     0 
##                OPN7_E                OPN8_E                OPN9_E 
##                     0                     0                     0 
##               OPN10_E              dateload               screenw 
##                     0                     0                     0 
##               screenh           introelapse            testelapse 
##                     0                     0                     0 
##             endelapse                   IPC               country 
##                     0                     0                    77 
##  lat_appx_lots_of_err long_appx_lots_of_err             Full_Name 
##                     0                     0                 13796 
##                 Score 
##                 21503
missing_values<-df3[is.na(df3)]
rows_with_missing<-df3[!complete.cases(df3),]
distinct_column<-unique(rows_with_missing$Full_Name)
distinct_column
##  [1] "Hong Kong S.A.R., China"                     
##  [2] "Oman"                                        
##  [3] NA                                            
##  [4] "Brunei Darussalam"                           
##  [5] "Puerto Rico"                                 
##  [6] "Namibia"                                     
##  [7] "Trinidad and Tobago"                         
##  [8] "Bahamas"                                     
##  [9] "Isle of Man"                                 
## [10] "Maldives"                                    
## [11] "Gibraltar"                                   
## [12] "Macao"                                       
## [13] "Macedonia, the former Yugoslav Republic of"  
## [14] "Grenada"                                     
## [15] "Cayman Islands"                              
## [16] "Barbados"                                    
## [17] "C?e d'Ivoire"                                
## [18] "Papua New Guinea"                            
## [19] "Antigua and Barbuda"                         
## [20] "Virgin Islands, U.S."                        
## [21] "Swaziland"                                   
## [22] "Bermuda"                                     
## [23] "Fiji"                                        
## [24] "Saint Vincent and the Grenadines"            
## [25] "Guernsey"                                    
## [26] "Guadeloupe"                                  
## [27] "?land Islands"                               
## [28] "Libyan Arab Jamahiriya"                      
## [29] "Jersey"                                      
## [30] "Northern Mariana Islands"                    
## [31] "Syrian Arab Republic"                        
## [32] "Palestinian Territory, Occupied"             
## [33] "Moldova, Republic of"                        
## [34] "Guam"                                        
## [35] "Virgin Islands, British"                     
## [36] "French Polynesia"                            
## [37] "Dominica"                                    
## [38] "Aruba"                                       
## [39] "Saint Lucia"                                 
## [40] "Guyana"                                      
## [41] "Cape Verde"                                  
## [42] "Gambia"                                      
## [43] "Lao People's Democratic Republic"            
## [44] "Suriname"                                    
## [45] "Cuba"                                        
## [46] "New Caledonia"                               
## [47] "Seychelles"                                  
## [48] "Faroe Islands"                               
## [49] "Palau"                                       
## [50] "Niue"                                        
## [51] "Anguilla"                                    
## [52] "Saint Kitts and Nevis"                       
## [53] "Vanuatu"                                     
## [54] "Monaco"                                      
## [55] "Cook Islands"                                
## [56] "Martinique"                                  
## [57] "Antarctica"                                  
## [58] "Greenland"                                   
## [59] "Montserrat"                                  
## [60] "Falkland Islands (Malvinas)"                 
## [61] "Marshall Islands"                            
## [62] "Turks and Caicos"                            
## [63] "Reunion"                                     
## [64] "San Marino"                                  
## [65] "Andorra"                                     
## [66] "French Guiana"                               
## [67] "Comoros"                                     
## [68] "Liechtenstein"                               
## [69] "Micronesia, Federated States of"             
## [70] "Tonga"                                       
## [71] "Samoa"                                       
## [72] "Timor-Leste"                                 
## [73] "Equatorial Guinea"                           
## [74] "Saint Martin"                                
## [75] "Saint Pierre and Miquelon"                   
## [76] "Djibouti"                                    
## [77] "American Samoa"                              
## [78] "Saint Helena, Ascension and Tristan da Cunha"

Removing Missing Data

df4<-na.omit(df3)
class(df4)
## [1] "data.frame"

Checking datatypes

str(df4)
## 'data.frame':    993761 obs. of  112 variables:
##  $ EXT1                 : chr  "4" "3" "2" "2" ...
##  $ EXT2                 : chr  "1" "5" "3" "2" ...
##  $ EXT3                 : chr  "5" "3" "4" "2" ...
##  $ EXT4                 : chr  "2" "4" "4" "3" ...
##  $ EXT5                 : chr  "5" "3" "3" "4" ...
##  $ EXT6                 : chr  "1" "3" "2" "2" ...
##  $ EXT7                 : chr  "5" "2" "1" "2" ...
##  $ EXT8                 : chr  "2" "5" "3" "4" ...
##  $ EXT9                 : chr  "4" "1" "2" "1" ...
##  $ EXT10                : chr  "1" "5" "5" "4" ...
##  $ EST1                 : chr  "1" "2" "4" "3" ...
##  $ EST2                 : chr  "4" "3" "4" "3" ...
##  $ EST3                 : chr  "4" "4" "4" "3" ...
##  $ EST4                 : chr  "2" "1" "2" "2" ...
##  $ EST5                 : chr  "2" "3" "2" "3" ...
##  $ EST6                 : chr  "2" "1" "2" "2" ...
##  $ EST7                 : chr  "2" "2" "2" "2" ...
##  $ EST8                 : chr  "2" "1" "2" "2" ...
##  $ EST9                 : chr  "3" "3" "1" "4" ...
##  $ EST10                : chr  "2" "1" "3" "3" ...
##  $ AGR1                 : chr  "2" "1" "1" "2" ...
##  $ AGR2                 : chr  "5" "4" "4" "4" ...
##  $ AGR3                 : chr  "2" "1" "1" "3" ...
##  $ AGR4                 : chr  "4" "5" "4" "4" ...
##  $ AGR5                 : chr  "2" "1" "2" "2" ...
##  $ AGR6                 : chr  "3" "5" "4" "4" ...
##  $ AGR7                 : chr  "2" "3" "1" "2" ...
##  $ AGR8                 : chr  "4" "4" "4" "4" ...
##  $ AGR9                 : chr  "3" "5" "4" "3" ...
##  $ AGR10                : chr  "4" "3" "3" "4" ...
##  $ CSN1                 : chr  "3" "3" "4" "2" ...
##  $ CSN2                 : chr  "4" "2" "2" "4" ...
##  $ CSN3                 : chr  "3" "5" "2" "4" ...
##  $ CSN4                 : chr  "2" "3" "2" "4" ...
##  $ CSN5                 : chr  "2" "3" "3" "1" ...
##  $ CSN6                 : chr  "4" "1" "3" "2" ...
##  $ CSN7                 : chr  "4" "3" "4" "2" ...
##  $ CSN8                 : chr  "2" "3" "2" "3" ...
##  $ CSN9                 : chr  "4" "5" "4" "1" ...
##  $ CSN10                : chr  "4" "3" "2" "4" ...
##  $ OPN1                 : chr  "5" "1" "5" "4" ...
##  $ OPN2                 : chr  "1" "2" "1" "2" ...
##  $ OPN3                 : chr  "4" "4" "2" "5" ...
##  $ OPN4                 : chr  "1" "2" "1" "2" ...
##  $ OPN5                 : chr  "4" "3" "4" "3" ...
##  $ OPN6                 : chr  "1" "1" "2" "1" ...
##  $ OPN7                 : chr  "5" "4" "5" "4" ...
##  $ OPN8                 : chr  "3" "2" "3" "4" ...
##  $ OPN9                 : chr  "4" "5" "4" "3" ...
##  $ OPN10                : chr  "5" "3" "4" "3" ...
##  $ EXT1_E               : chr  "9419" "7235" "4657" "3996" ...
##  $ EXT2_E               : chr  "5491" "3598" "3549" "2896" ...
##  $ EXT3_E               : chr  "3959" "3315" "2543" "5096" ...
##  $ EXT4_E               : chr  "4821" "2564" "3335" "4240" ...
##  $ EXT5_E               : chr  "5611" "2976" "5847" "5168" ...
##  $ EXT6_E               : chr  "2756" "3050" "2540" "5456" ...
##  $ EXT7_E               : chr  "2388" "4787" "4922" "4360" ...
##  $ EXT8_E               : chr  "2113" "3228" "3142" "4496" ...
##  $ EXT9_E               : chr  "5900" "3465" "14621" "5240" ...
##  $ EXT10_E              : chr  "4110" "3309" "2191" "4000" ...
##  $ EST1_E               : chr  "6135" "9036" "5128" "3736" ...
##  $ EST2_E               : chr  "4150" "2406" "3675" "4616" ...
##  $ EST3_E               : chr  "5739" "3484" "3442" "3015" ...
##  $ EST4_E               : chr  "6364" "3359" "4546" "2711" ...
##  $ EST5_E               : chr  "3663" "3061" "8275" "3960" ...
##  $ EST6_E               : chr  "5070" "2539" "2185" "4064" ...
##  $ EST7_E               : chr  "5709" "4226" "2164" "4208" ...
##  $ EST8_E               : chr  "4285" "2962" "1175" "2936" ...
##  $ EST9_E               : chr  "2587" "1799" "3813" "7336" ...
##  $ EST10_E              : chr  "3997" "1607" "1593" "3896" ...
##  $ AGR1_E               : chr  "4750" "2158" "1089" "6062" ...
##  $ AGR2_E               : chr  "5475" "2090" "2203" "11952" ...
##  $ AGR3_E               : chr  "11641" "2143" "3386" "1040" ...
##  $ AGR4_E               : chr  "3115" "2807" "1464" "2264" ...
##  $ AGR5_E               : chr  "3207" "3422" "2562" "3664" ...
##  $ AGR6_E               : chr  "3260" "5324" "1493" "3049" ...
##  $ AGR7_E               : chr  "10235" "4494" "3067" "4912" ...
##  $ AGR8_E               : chr  "5897" "3627" "13719" "7545" ...
##  $ AGR9_E               : chr  "1758" "1850" "3892" "4632" ...
##  $ AGR10_E              : chr  "3081" "1747" "4100" "6896" ...
##  $ CSN1_E               : chr  "6602" "5163" "4286" "2824" ...
##  $ CSN2_E               : chr  "5457" "5240" "4775" "520" ...
##  $ CSN3_E               : chr  "1569" "7208" "2713" "2368" ...
##  $ CSN4_E               : chr  "2129" "2783" "2813" "3225" ...
##  $ CSN5_E               : chr  "3762" "4103" "4237" "2848" ...
##  $ CSN6_E               : chr  "4420" "3431" "6308" "6264" ...
##  $ CSN7_E               : chr  "9382" "3347" "2690" "3760" ...
##  $ CSN8_E               : chr  "5286" "2399" "1516" "10472" ...
##  $ CSN9_E               : chr  "4983" "3360" "2379" "3192" ...
##  $ CSN10_E              : chr  "6339" "5595" "2983" "7704" ...
##  $ OPN1_E               : chr  "3146" "2624" "1930" "3456" ...
##  $ OPN2_E               : chr  "4067" "4985" "1470" "6665" ...
##  $ OPN3_E               : chr  "2959" "1684" "1644" "1977" ...
##  $ OPN4_E               : chr  "3411" "3026" "1683" "3728" ...
##  $ OPN5_E               : chr  "2170" "4742" "2229" "4128" ...
##  $ OPN6_E               : chr  "4920" "3336" "8114" "3776" ...
##  $ OPN7_E               : chr  "4436" "2718" "2043" "2984" ...
##  $ OPN8_E               : chr  "3116" "3374" "6295" "4192" ...
##  $ OPN9_E               : chr  "2992" "3096" "1585" "3480" ...
##   [list output truncated]
##  - attr(*, "na.action")= 'omit' Named int [1:21580] 20 61 70 73 84 97 120 167 180 227 ...
##   ..- attr(*, "names")= chr [1:21580] "20" "61" "70" "73" ...
summary(df4)
##      EXT1               EXT2               EXT3               EXT4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EXT5               EXT6               EXT7               EXT8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EXT9              EXT10               EST1               EST2          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EST3               EST4               EST5               EST6          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EST7               EST8               EST9              EST10          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR1               AGR2               AGR3               AGR4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR5               AGR6               AGR7               AGR8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR9              AGR10               CSN1               CSN2          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      CSN3               CSN4               CSN5               CSN6          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      CSN7               CSN8               CSN9              CSN10          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN1               OPN2               OPN3               OPN4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN5               OPN6               OPN7               OPN8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN9              OPN10              EXT1_E             EXT2_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     EXT3_E             EXT4_E             EXT5_E             EXT6_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     EXT7_E             EXT8_E             EXT9_E            EXT10_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     EST1_E             EST2_E             EST3_E             EST4_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     EST5_E             EST6_E             EST7_E             EST8_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     EST9_E            EST10_E             AGR1_E             AGR2_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     AGR3_E             AGR4_E             AGR5_E             AGR6_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     AGR7_E             AGR8_E             AGR9_E            AGR10_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     CSN1_E             CSN2_E             CSN3_E             CSN4_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     CSN5_E             CSN6_E             CSN7_E             CSN8_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     CSN9_E            CSN10_E             OPN1_E             OPN2_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     OPN3_E             OPN4_E             OPN5_E             OPN6_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     OPN7_E             OPN8_E             OPN9_E            OPN10_E         
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    dateload           screenw            screenh          introelapse       
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   testelapse          endelapse              IPC           country         
##  Length:993761      Min.   :1.000e+00   Min.   :  1.00   Length:993761     
##  Class :character   1st Qu.:9.000e+00   1st Qu.:  1.00   Class :character  
##  Mode  :character   Median :1.300e+01   Median :  1.00   Mode  :character  
##                     Mean   :2.725e+03   Mean   : 10.57                     
##                     3rd Qu.:1.800e+01   3rd Qu.:  2.00                     
##                     Max.   :1.493e+09   Max.   :725.00                     
##  lat_appx_lots_of_err long_appx_lots_of_err  Full_Name             Score      
##  Length:993761        Length:993761         Length:993761      Min.   :2.905  
##  Class :character     Class :character      Class :character   1st Qu.:6.886  
##  Mode  :character     Mode  :character      Mode  :character   Median :6.886  
##                                                                Mean   :6.767  
##                                                                3rd Qu.:6.965  
##                                                                Max.   :7.632

Removing Unwanted Columns

df5<-dplyr::select(df4,-(EXT1_E:IPC))
dim(df5)
## [1] 993761     55
df6<-dplyr::select(df5,-(lat_appx_lots_of_err:long_appx_lots_of_err ))
dim(df6)
## [1] 993761     53
# Summary of df6
summary(df6)
##      EXT1               EXT2               EXT3               EXT4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EXT5               EXT6               EXT7               EXT8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EXT9              EXT10               EST1               EST2          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EST3               EST4               EST5               EST6          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      EST7               EST8               EST9              EST10          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR1               AGR2               AGR3               AGR4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR5               AGR6               AGR7               AGR8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      AGR9              AGR10               CSN1               CSN2          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      CSN3               CSN4               CSN5               CSN6          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      CSN7               CSN8               CSN9              CSN10          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN1               OPN2               OPN3               OPN4          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN5               OPN6               OPN7               OPN8          
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      OPN9              OPN10             country           Full_Name        
##  Length:993761      Length:993761      Length:993761      Length:993761     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Score      
##  Min.   :2.905  
##  1st Qu.:6.886  
##  Median :6.886  
##  Mean   :6.767  
##  3rd Qu.:6.965  
##  Max.   :7.632

2.2 Exploratory Data Analysis (EDA):

Perform an exploratory analysis of the EDA and Cluster Data to gain insights into the distribution, relationships, and characteristics of the personality traits. This step involves data cleaning, handling missing values, and identifying any outliers.

Performing EDA on the first group of data

Extroversion (outgoing/energetic vs. solitary/reserved)

EXT<-df6 %>% dplyr::select(EXT1:EXT10)
# Remove values containing "NULL"
EXT <- EXT %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
EXT <- EXT %>% filter(complete.cases(.))
EXT_counts <- count(EXT,EXT1)

plot_ly(EXT_counts, labels = ~EXT1, values = ~n, type = "pie",
        text = ~paste(EXT1, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT1)), "Dark2")),title = "Chart of EXT 1: I am the life of the party") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 28.5% selected a neutral option. 44.3% had chosen 1 & 2 and 26.96% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT2)

plot_ly(EXT_counts, labels = ~EXT2, values = ~n, type = "pie",
        text = ~paste(EXT2, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT2)), "Dark2")),title = "Chart of EXT 2: I don't talk a lot") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 24.1% selected a neutral option. 44.1% had chosen 1 & 2 and 31.3% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT3)

plot_ly(EXT_counts, labels = ~EXT3, values = ~n, type = "pie",
        text = ~paste(EXT3, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT3)), "Dark2")),title = "Chart of EXT 3: I feel comfortable around people") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 26.5% selected a neutral option. 26.44% had chosen 1 & 2 and 54.6% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT4)

plot_ly(EXT_counts, labels = ~EXT4, values = ~n, type = "pie",
        text = ~paste(EXT4, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT4)), "Dark2")),title = "Chart of EXT 4: I keep in the background") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 27.6% selected a neutral option. 30.7% had chosen 1 & 2 and 41.1% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT5)

plot_ly(EXT_counts, labels = ~EXT5, values = ~n, type = "pie",
        text = ~paste(EXT5, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT5)), "Dark2")),title = "Chart of EXT 5: I start conversations") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 23.1% selected a neutral option. 27.28% had chosen 1 & 2 and 48.7% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT6)

plot_ly(EXT_counts, labels = ~EXT6, values = ~n, type = "pie",
        text = ~paste(EXT6, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT6)), "Dark2")),title = "Chart of EXT 6: I have little to say") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.9% selected a neutral option. 59.7% had chosen 1 & 2 and 25.91% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT7)

plot_ly(EXT_counts, labels = ~EXT7, values = ~n, type = "pie",
        text = ~paste(EXT7, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT7)), "Dark2")),title = "Chart of EXT 7: I talk to a lot of different people at parties") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 19.2% selected a neutral option. 45.8% had chosen 1 & 2 and 34.2% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT8)

plot_ly(EXT_counts, labels = ~EXT8, values = ~n, type = "pie",
        text = ~paste(EXT8, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT8)), "Dark2")),title = "Chart of EXT 8: I don't like to draw attention to myself") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 22.7% selected a neutral option. 25.09% had chosen 1 & 2 and 51.7% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT9)

plot_ly(EXT_counts, labels = ~EXT9, values = ~n, type = "pie",
        text = ~paste(EXT9, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT9)), "Dark2")),title = "Chart of EXT 9: I don't mind being the center of attention") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 21.2% selected a neutral option. 39% had chosen 1 & 2 and 39.1% had chosen 4 & 5.

EXT_counts <- count(EXT,EXT10)

plot_ly(EXT_counts, labels = ~EXT10, values = ~n, type = "pie",
        text = ~paste(EXT10, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EXT_counts$EXT10)), "Dark2")),title = "Chart of EXT 10: I am quiet around strangers") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.9% selected a neutral option. 23.15% had chosen 1 & 2 and 57.3% had chosen 4 & 5.

Although most people feel comfortable around people and are not afraid to start conversations, they prefer to engage in a smaller group and do not like to draw attention to them.

EXT<-df6 %>% dplyr::select(EXT1:Score) 
EXT<-dplyr::select(EXT,-(EST1:Full_Name))

# Remove values containing "NULL"
EXT <- EXT %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
EXT <- EXT %>% filter(complete.cases(.))
EXT$EXT1 <- as.integer(EXT$EXT1)  
EXT$EXT2 <- as.integer(EXT$EXT2)  
EXT$EXT3 <- as.integer(EXT$EXT3)  
EXT$EXT4 <- as.integer(EXT$EXT4)  
EXT$EXT5 <- as.integer(EXT$EXT5)  
EXT$EXT6 <- as.integer(EXT$EXT6)  
EXT$EXT7 <- as.integer(EXT$EXT7)  
EXT$EXT8 <- as.integer(EXT$EXT8)  
EXT$EXT9 <- as.integer(EXT$EXT9)  
EXT$EXT10 <- as.integer(EXT$EXT10)
EXT_CORR<-cor(EXT)

# as number
ggcorrplot(EXT_CORR,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

Comment: The variables are not highly correlated but they have very little correlation with happiness score.

Performing EDA on the second group of data

neuroticism (sensitive/nervous vs. resilient/confident)

EST<-df6 %>% dplyr::select(EST1:EST10)
# Remove values containing "NULL"
EST <- EST %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
EST <- EST %>% filter(complete.cases(.))
distinct_column<-unique(EST$EST1)
distinct_column
## [1] "1" "2" "4" "3" "5" "0"
distinct_column<-unique(EST$EST1)
distinct_column
## [1] "1" "2" "4" "3" "5" "0"
EST_counts <- count(EST,EST1)

plot_ly(EST_counts, labels = ~EST1, values = ~n, type = "pie",
        text = ~paste(EST1, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST1)), "Dark2")),title = "Chart of EST 1: I get stressed out easily") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 20.1% selected a neutral option. 29.7% had chosen 1 & 2 and 49.5% had chosen 4 & 5.

EST_counts <- count(EST,EST2)

plot_ly(EST_counts, labels = ~EST2, values = ~n, type = "pie",
        text = ~paste(EST2, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST2)), "Dark2")),title = "Chart of EST 2: I am relaxed most of the time") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 25.9% selected a neutral option. 30.25% had chosen 1 & 2 and 42.9% had chosen 4 & 5.

EST_counts <- count(EST,EST3)

plot_ly(EST_counts, labels = ~EST3, values = ~n, type = "pie",
        text = ~paste(EST3, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST3)), "Dark2")),title = "Chart of EST 3: I worry about things") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 14.6% selected a neutral option. 14.52% had chosen 1 & 2 and 70.4% had chosen 4 & 5.

EST_counts <- count(EST,EST4)

plot_ly(EST_counts, labels = ~EST4, values = ~n, type = "pie",
        text = ~paste(EST4, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST4)), "Dark2")),title = "Chart of EST 4: I seldom feel blue") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 24.9% selected a neutral option. 47.5% had chosen 1 & 2 and 26.48% had chosen 4 & 5.

EST_counts <- count(EST,EST5)

plot_ly(EST_counts, labels = ~EST5, values = ~n, type = "pie",
        text = ~paste(EST5, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST5)), "Dark2")),title = "Chart of EST 5: I am easily disturbed") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 22.5% selected a neutral option. 42.8% had chosen 1 & 2 and 34.2% had chosen 4 & 5.

EST_counts <- count(EST,EST6)

plot_ly(EST_counts, labels = ~EST6, values = ~n, type = "pie",
        text = ~paste(EST6, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST6)), "Dark2")),title = "Chart of EST 6: I get upset easily") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 21% selected a neutral option. 42.9% had chosen 1 & 2 and 35% had chosen 4 & 5.

EST_counts <- count(EST,EST7)

plot_ly(EST_counts, labels = ~EST7, values = ~n, type = "pie",
        text = ~paste(EST7, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST7)), "Dark2")),title = "Chart of EST 7: I change my mood a lot") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 23% selected a neutral option. 36.4% had chosen 1 & 2 and 40% had chosen 4 & 5.

EST_counts <- count(EST,EST8)

plot_ly(EST_counts, labels = ~EST8, values = ~n, type = "pie",
        text = ~paste(EST8, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST8)), "Dark2")),title = "Chart of EST 8: I have frequent mood swings") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 19.7% selected a neutral option. 48.9% had chosen 1 & 2 and 30.7% had chosen 4 & 5.

EST_counts <- count(EST,EST9)

plot_ly(EST_counts, labels = ~EST9, values = ~n, type = "pie",
        text = ~paste(EST9, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST9)), "Dark2")),title = "Chart of EST 9: I get irritated easily") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 20.8% selected a neutral option. 35.2% had chosen 1 & 2 and 43.4% had chosen 4 & 5.

EST_counts <- count(EST,EST10)

plot_ly(EST_counts, labels = ~EST10, values = ~n, type = "pie",
        text = ~paste(EST10, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(EST_counts$EST10)), "Dark2")),title = "Chart of EST 10:   I often feel blue") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 21.6% selected a neutral option. 45.1% had chosen 1 & 2 and 32.5% had chosen 4 & 5.

Although the number of people who get stressed out easily and number of people who are relaxed most of the time are almost equal, 70.4% of them still worry about things.

EST<-df6 %>% dplyr::select(EST1:Score) 
EST<-dplyr::select(EST,-(AGR1:Full_Name))

# Remove values containing "NULL"
EST <- EST %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
EST <- EST %>% filter(complete.cases(.))
EST$EST1 <- as.integer(EST$EST1)  
EST$EST2 <- as.integer(EST$EST2)  
EST$EST3 <- as.integer(EST$EST3)  
EST$EST4 <- as.integer(EST$EST4)  
EST$EST5 <- as.integer(EST$EST5)  
EST$EST6 <- as.integer(EST$EST6)  
EST$EST7 <- as.integer(EST$EST7)  
EST$EST8 <- as.integer(EST$EST8)  
EST$EST9 <- as.integer(EST$EST9)  
EST$EST10 <- as.integer(EST$EST10)  
EST_CORR<-cor(EST)

# as number
ggcorrplot(EST_CORR,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

EST7 is highly correlated to EST8. All variables have very little correlation to happiness score.

Performing EDA on the third group of data

agreeableness (friendly/compassionate vs. critical/rational)

AGR<-df6 %>% dplyr::select(AGR1:AGR10)
# Remove values containing "NULL"
AGR <- AGR %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
AGR <- AGR %>% filter(complete.cases(.))
distinct_column<-unique(AGR$AGR1)
distinct_column
## [1] "2" "1" "5" "4" "3" "0"
distinct_column<-unique(AGR$AGR1)
distinct_column
## [1] "2" "1" "5" "4" "3" "0"
AGR_counts <- count(AGR,AGR1)

plot_ly(AGR_counts, labels = ~AGR1, values = ~n, type = "pie",
        text = ~paste(AGR1, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR1)), "Dark2")),title = "Chart of AGR 1: I feel little concern for others") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 12.7% selected a neutral option. 64.8% had chosen 1 & 2 and 21.94% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR2)

plot_ly(AGR_counts, labels = ~AGR2, values = ~n, type = "pie",
        text = ~paste(AGR2, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR2)), "Dark2")),title = "Chart of AGR 2: I am interested in people") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.5% selected a neutral option. 12.24% had chosen 1 & 2 and 68.4% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR3)

plot_ly(AGR_counts, labels = ~AGR3, values = ~n, type = "pie",
        text = ~paste(AGR3, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR3)), "Dark2")),title = "Chart of AGR 3: I insult people") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 17.3% selected a neutral option. 61.5% had chosen 1 & 2 and 20.82% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR4)

plot_ly(AGR_counts, labels = ~AGR4, values = ~n, type = "pie",
        text = ~paste(AGR4, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR4)), "Dark2")),title = "Chart of AGR 4: I sympathize with others' feelings") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 14.3% selected a neutral option. 11.64% had chosen 1 & 2 and 73.3% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR5)

plot_ly(AGR_counts, labels = ~AGR5, values = ~n, type = "pie",
        text = ~paste(AGR5, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR5)), "Dark2")),title = "Chart of AGR 5: I am not interested in other people's problems") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 17.7% selected a neutral option. 64.7% had chosen 1 & 2 and 16.95% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR6)

plot_ly(AGR_counts, labels = ~AGR6, values = ~n, type = "pie",
        text = ~paste(AGR6, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR6)), "Dark2")),title = "Chart of AGR 6: I have a soft heart") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.1% selected a neutral option. 15.52% had chosen 1 & 2 and 65.3% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR7)

plot_ly(AGR_counts, labels = ~AGR7, values = ~n, type = "pie",
        text = ~paste(AGR7, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR7)), "Dark2")),title = "Chart of AGR 7: I am not really interested in others") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 17.9% selected a neutral option. 66.9% had chosen 1 & 2 and 14.56% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR8)

plot_ly(AGR_counts, labels = ~AGR8, values = ~n, type = "pie",
        text = ~paste(AGR8, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR8)), "Dark2")),title = "Chart of AGR 8: I take time out for others") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 21.2% selected a neutral option. 13.64% had chosen 1 & 2 and 64.3% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR9)

plot_ly(AGR_counts, labels = ~AGR9, values = ~n, type = "pie",
        text = ~paste(AGR9, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR9)), "Dark2")),title = "Chart of AGR 9: I feel others' emotions") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 15.7% selected a neutral option. 14.83% had chosen 1 & 2 and 68.8% had chosen 4 & 5.

AGR_counts <- count(AGR,AGR10)

plot_ly(AGR_counts, labels = ~AGR10, values = ~n, type = "pie",
        text = ~paste(AGR10, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(AGR_counts$AGR10)), "Dark2")),title = "Chart of AGR 10:   I make people feel at ease") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 30% selected a neutral option. 12.96% had chosen 1 & 2 and 56.1% had chosen 4 & 5.

64.8% of people who took this test care about people and 61.5% try not to insult people.

AGR<-df6 %>% dplyr::select(AGR1:Score) 
AGR<-dplyr::select(AGR,-(CSN1:Full_Name))

# Remove values containing "NULL"
AGR <- AGR %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
AGR <- AGR %>% filter(complete.cases(.))
AGR$AGR1 <- as.integer(AGR$AGR1)  
AGR$AGR2 <- as.integer(AGR$AGR2)  
AGR$AGR3 <- as.integer(AGR$AGR3)  
AGR$AGR4 <- as.integer(AGR$AGR4)  
AGR$AGR5 <- as.integer(AGR$AGR5)  
AGR$AGR6 <- as.integer(AGR$AGR6)  
AGR$AGR7 <- as.integer(AGR$AGR7)  
AGR$AGR8 <- as.integer(AGR$AGR8)  
AGR$AGR9 <- as.integer(AGR$AGR9)  
AGR$AGR10 <- as.integer(AGR$AGR10)  
AGR_CORR<-cor(AGR)

ggcorrplot(AGR_CORR,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

Comment: All the variables are moderately correlated. All variables have very little correlation to happiness score.

Performing EDA on the fourth group of data

conscientiousness (efficient/organized vs. extravagant/careless)

CSN<-df6 %>% dplyr::select(CSN1:CSN10)
# Remove values containing "NULL"
CSN <- CSN %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
CSN <- CSN %>% filter(complete.cases(.))
distinct_column<-unique(CSN$CSN1)
distinct_column
## [1] "3" "4" "2" "5" "1" "0"
distinct_column<-unique(CSN$CSN1)
distinct_column
## [1] "3" "4" "2" "5" "1" "0"
CSN_counts <- count(CSN,CSN1)

plot_ly(CSN_counts, labels = ~CSN1, values = ~n, type = "pie",
        text = ~paste(CSN1, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN1)), "Dark2")),title = "Chart of CSN 1: I am always prepared") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 26.4% selected a neutral option. 23.59% had chosen 1 & 2 and 48.9% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN2)

plot_ly(CSN_counts, labels = ~CSN2, values = ~n, type = "pie",
        text = ~paste(CSN2, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN2)), "Dark2")),title = "Chart of CSN 2: I leave my belongings around") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.8% selected a neutral option. 40.6% had chosen 1 & 2 and 40% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN3)

plot_ly(CSN_counts, labels = ~CSN3, values = ~n, type = "pie",
        text = ~paste(CSN3, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN3)), "Dark2")),title = "Chart of CSN 3: I pay attention to details") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 16.1% selected a neutral option. 9.28% had chosen 1 & 2 and 74.1% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN4)

plot_ly(CSN_counts, labels = ~CSN4, values = ~n, type = "pie",
        text = ~paste(CSN4, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN4)), "Dark2")),title = "Chart of CSN 4: I make a mess of things") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 22.8% selected a neutral option. 50% had chosen 1 & 2 and 26.59% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN5)

plot_ly(CSN_counts, labels = ~CSN5, values = ~n, type = "pie",
        text = ~paste(CSN5, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN5)), "Dark2")),title = "Chart of CSN 5: I get chores done right away") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 23.5% selected a neutral option. 49.2% had chosen 1 & 2 and 26.55% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN6)

plot_ly(CSN_counts, labels = ~CSN6, values = ~n, type = "pie",
        text = ~paste(CSN6, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN6)), "Dark2")),title = "Chart of CSN 6: I often forget to put things back in their proper place") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 16.1% selected a neutral option. 46% had chosen 1 & 2 and 37.1% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN7)

plot_ly(CSN_counts, labels = ~CSN7, values = ~n, type = "pie",
        text = ~paste(CSN7, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN7)), "Dark2")),title = "Chart of CSN 7: I like order") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 22.5% selected a neutral option. 13.58% had chosen 1 & 2 and 63.2% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN8)

plot_ly(CSN_counts, labels = ~CSN8, values = ~n, type = "pie",
        text = ~paste(CSN8, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN8)), "Dark2")),title = "Chart of CSN 8: I shirk my duties") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 31.2% selected a neutral option. 50.3% had chosen 1 & 2 and 17.74% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN9)

plot_ly(CSN_counts, labels = ~CSN9, values = ~n, type = "pie",
        text = ~paste(CSN9, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN9)), "Dark2")),title = "Chart of CSN 9: I follow a schedule") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 22.7% selected a neutral option. 30.1% had chosen 1 & 2 and 46.5% had chosen 4 & 5.

CSN_counts <- count(CSN,CSN10)

plot_ly(CSN_counts, labels = ~CSN10, values = ~n, type = "pie",
        text = ~paste(CSN10, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(CSN_counts$CSN10)), "Dark2")),title = "Chart of CSN 10:   I am exacting in my work") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 31.3% selected a neutral option. 12.13% had chosen 1 & 2 and 55.7% had chosen 4 & 5.

In this aspect, it shows conflicts when the people are choosing their answer. Although 74.1% claimed to pay attention to details, most people (40%) always leave belongings behind. On the other hand, 63.2% claim that they like order but most people (49.2%) do not get chores done immediately. 55.7% say that they are exacting in their work but we still have more than half of the people (50.3%) shirking their duties.

CSN<-df6 %>% dplyr::select(CSN1:Score) 
CSN<-dplyr::select(CSN,-(OPN1:Full_Name))

# Remove values containing "NULL"
CSN <- CSN %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
CSN <- CSN %>% filter(complete.cases(.))
CSN$CSN1 <- as.integer(CSN$CSN1)  
CSN$CSN2 <- as.integer(CSN$CSN2)  
CSN$CSN3 <- as.integer(CSN$CSN3)  
CSN$CSN4 <- as.integer(CSN$CSN4)  
CSN$CSN5 <- as.integer(CSN$CSN5)  
CSN$CSN6 <- as.integer(CSN$CSN6)  
CSN$CSN7 <- as.integer(CSN$CSN7)  
CSN$CSN8 <- as.integer(CSN$CSN8)  
CSN$CSN9 <- as.integer(CSN$CSN9)  
CSN$CSN10 <- as.integer(CSN$CSN10)  
CSN_CORR<-cor(CSN)

ggcorrplot(CSN_CORR,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

Comment: The variables are not highly correlated with each other and they have very little correlation with the happiness score.

Performing EDA on the last group of data

openness to experience (inventive/curious vs. consistent/cautious)

OPN<-df6 %>% dplyr::select(OPN1:OPN10)
# Remove values containing "NULL"
OPN <- OPN %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
OPN <- OPN %>% filter(complete.cases(.))
distinct_column<-unique(OPN$OPN1)
distinct_column
## [1] "5" "1" "4" "3" "2" "0"
distinct_column<-unique(OPN$OPN1)
distinct_column
## [1] "5" "1" "4" "3" "2" "0"
OPN_counts <- count(OPN,OPN1)

plot_ly(OPN_counts, labels = ~OPN1, values = ~n, type = "pie",
        text = ~paste(OPN1, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN1)), "Dark2")),title = "Chart of OPN 1: I have a rich vocabulary") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 24.8% selected a neutral option. 14.41% had chosen 1 & 2 and 59.9% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN2)

plot_ly(OPN_counts, labels = ~OPN2, values = ~n, type = "pie",
        text = ~paste(OPN2, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN2)), "Dark2")),title = "Chart of OPN 2: I have difficulty understanding abstract ideas") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.4% selected a neutral option. 68.9% had chosen 1 & 2 and 12.19% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN3)

plot_ly(OPN_counts, labels = ~OPN3, values = ~n, type = "pie",
        text = ~paste(OPN3, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN3)), "Dark2")),title = "Chart of OPN 3: I have a vivid imagination") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 17.2% selected a neutral option. 9.51% had chosen 1 & 2 and 72.5% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN4)

plot_ly(OPN_counts, labels = ~OPN4, values = ~n, type = "pie",
        text = ~paste(OPN4, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN4)), "Dark2")),title = "Chart of OPN 4: I am not interested in abstract ideas") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 18.6% selected a neutral option. 70.6% had chosen 1 & 2 and 10.09% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN5)

plot_ly(OPN_counts, labels = ~OPN5, values = ~n, type = "pie",
        text = ~paste(OPN5, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN5)), "Dark2")),title = "Chart of OPN 5: I have excellent ideas") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 27% selected a neutral option. 7.46% had chosen 1 & 2 and 64.9% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN6)

plot_ly(OPN_counts, labels = ~OPN6, values = ~n, type = "pie",
        text = ~paste(OPN6, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN6)), "Dark2")),title = "Chart of OPN 6: I Do Not Have A Good Imagination") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 11.7% selected a neutral option. 76.6% had chosen 1 & 2 and 10.91% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN7)

plot_ly(OPN_counts, labels = ~OPN7, values = ~n, type = "pie",
        text = ~paste(OPN7, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN7)), "Dark2")),title = "Chart of OPN 7: I am quick to understand things") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 17.9% selected a neutral option. 7.1% had chosen 1 & 2 and 74.2% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN8)

plot_ly(OPN_counts, labels = ~OPN8, values = ~n, type = "pie",
        text = ~paste(OPN8, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN8)), "Dark2")),title = "Chart of OPN 8: I use difficult words") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 24.9% selected a neutral option. 29.8% had chosen 1 & 2 and 44.5% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN9)

plot_ly(OPN_counts, labels = ~OPN9, values = ~n, type = "pie",
        text = ~paste(OPN9, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN9)), "Dark2")),title = "Chart of OPN 9: I spend time reflecting on things") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 12.6% selected a neutral option. 7.56% had chosen 1 & 2 and 79.1% had chosen 4 & 5.

OPN_counts <- count(OPN,OPN10)

plot_ly(OPN_counts, labels = ~OPN10, values = ~n, type = "pie",
        text = ~paste(OPN10, ": ", n),
        marker = list(colors = RColorBrewer::brewer.pal(length(unique(OPN_counts$OPN10)), "Dark2")),title = "Chart of OPN 10:   I am full of ideas") %>%
  layout(scene = list(aspectmode = "data"), showlegend = FALSE)  

Comment: 20.9% selected a neutral option. 7.91% had chosen 1 & 2 and 70.6% had chosen 4 & 5.

More than 70% claim to be creative, have a vivid imagination and full of ideas.

OPN<-df6 %>% dplyr::select(OPN1:Score) 
OPN<-dplyr::select(OPN,-(country:Full_Name))

# Remove values containing "NULL"
OPN <- OPN %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))

# Remove rows with any NA values
OPN <- OPN %>% filter(complete.cases(.))
OPN$OPN1 <- as.integer(OPN$OPN1)  
OPN$OPN2 <- as.integer(OPN$OPN2)  
OPN$OPN3 <- as.integer(OPN$OPN3)  
OPN$OPN4 <- as.integer(OPN$OPN4)  
OPN$OPN5 <- as.integer(OPN$OPN5)  
OPN$OPN6 <- as.integer(OPN$OPN6)  
OPN$OPN7 <- as.integer(OPN$OPN7)  
OPN$OPN8 <- as.integer(OPN$OPN8)  
OPN$OPN9 <- as.integer(OPN$OPN9)  
OPN$OPN10 <- as.integer(OPN$OPN10)  
OPN_CORR<-cor(OPN)

ggcorrplot(OPN_CORR,
           hc.order = TRUE,
           type = "lower",
           lab = TRUE)

Comment: The variables are not highly correlated with each other and they have very little correlation with the happiness score.

2.3 Data Preprocessing for modelling:

distinct_column<-unique(df6$Full_Name)
distinct_column
##   [1] "United Kingdom"         "Malaysia"               "Kenya"                 
##   [4] "Sweden"                 "United States"          "Finland"               
##   [7] "Ukraine"                "Philippines"            "France"                
##  [10] "Australia"              "India"                  "Canada"                
##  [13] "Netherlands"            "South Africa"           "Brazil"                
##  [16] "Switzerland"            "Thailand"               "Italy"                 
##  [19] "Spain"                  "United Arab Emirates"   "Croatia"               
##  [22] "Greece"                 "Ireland"                "Germany"               
##  [25] "Portugal"               "Singapore"              "Romania"               
##  [28] "Norway"                 "Bangladesh"             "Nigeria"               
##  [31] "Lithuania"              "Ethiopia"               "Indonesia"             
##  [34] "Belgium"                "Austria"                "Denmark"               
##  [37] "Tanzania"               "Luxembourg"             "Poland"                
##  [40] "Japan"                  "Mexico"                 "Cyprus"                
##  [43] "Uganda"                 "Sri Lanka"              "Turkey"                
##  [46] "Myanmar"                "Colombia"               "Estonia"               
##  [49] "Argentina"              "Iceland"                "Hungary"               
##  [52] "Pakistan"               "Tunisia"                "Latvia"                
##  [55] "Czech Republic"         "New Zealand"            "Serbia"                
##  [58] "Israel"                 "Jamaica"                "Chile"                 
##  [61] "Qatar"                  "Saudi Arabia"           "Vietnam"               
##  [64] "Kazakhstan"             "Bosnia and Herzegovina" "Mauritius"             
##  [67] "Egypt"                  "Peru"                   "Slovenia"              
##  [70] "Jordan"                 "Taiwan"                 "Dominican Republic"    
##  [73] "Algeria"                "Kuwait"                 "Morocco"               
##  [76] "Malta"                  "Venezuela"              "Russia"                
##  [79] "South Korea"            "Liberia"                "Guatemala"             
##  [82] "Bulgaria"               "Ghana"                  "Somalia"               
##  [85] "Slovakia"               "China"                  "Azerbaijan"            
##  [88] "Albania"                "Cambodia"               "Lebanon"               
##  [91] "Uruguay"                "Zimbabwe"               "Uzbekistan"            
##  [94] "Honduras"               "Costa Rica"             "Georgia"               
##  [97] "Nepal"                  "Iran"                   "Mongolia"              
## [100] "Zambia"                 "Nicaragua"              "Bahrain"               
## [103] "Sudan"                  "Belize"                 "Paraguay"              
## [106] "Panama"                 "El Salvador"            "Montenegro"            
## [109] "Angola"                 "Kyrgyzstan"             "Afghanistan"           
## [112] "Rwanda"                 "Belarus"                "Gabon"                 
## [115] "Armenia"                "Ecuador"                "Yemen"                 
## [118] "Botswana"               "Burundi"                "Cameroon"              
## [121] "Lesotho"                "Iraq"                   "Bolivia"               
## [124] "Mozambique"             "Senegal"                "Malawi"                
## [127] "Madagascar"             "Benin"                  "Bhutan"                
## [130] "Haiti"                  "Congo (Brazzaville)"    "Mali"                  
## [133] "Congo (Kinshasa)"       "Mauritania"             "Burkina Faso"          
## [136] "Tajikistan"             "Sierra Leone"           "Togo"                  
## [139] "Chad"                   "Niger"                  "Guinea"
count(df6, Full_Name)%>%arrange(desc(n))
##                  Full_Name      n
## 1            United States 546403
## 2           United Kingdom  66596
## 3                   Canada  61849
## 4                Australia  50030
## 5              Philippines  19847
## 6                    India  17491
## 7                  Germany  14095
## 8              New Zealand  12992
## 9                   Norway  11417
## 10                Malaysia  11355
## 11                  Mexico  11152
## 12                  Sweden  10493
## 13             Netherlands   9785
## 14               Singapore   7686
## 15               Indonesia   6489
## 16                  Brazil   6245
## 17                  France   6145
## 18                 Denmark   5512
## 19                 Ireland   5409
## 20                   Italy   5319
## 21                   Spain   5008
## 22                  Poland   4659
## 23                 Finland   4340
## 24                 Romania   3858
## 25                 Belgium   3824
## 26            South Africa   3751
## 27                Colombia   3619
## 28                Pakistan   3511
## 29                  Russia   3323
## 30               Argentina   3154
## 31             Switzerland   3124
## 32    United Arab Emirates   3061
## 33                  Turkey   2891
## 34                Portugal   2519
## 35                  Greece   2513
## 36                 Vietnam   2337
## 37                 Croatia   2245
## 38                 Austria   2215
## 39                   Chile   2193
## 40                  Serbia   2065
## 41          Czech Republic   2014
## 42                Thailand   1971
## 43                   Japan   1933
## 44                    Peru   1659
## 45             South Korea   1593
## 46                 Hungary   1506
## 47                  Israel   1432
## 48                   Kenya   1405
## 49                   China   1340
## 50                Bulgaria   1271
## 51               Venezuela   1260
## 52                 Ecuador   1146
## 53               Lithuania   1101
## 54            Saudi Arabia   1097
## 55                   Egypt   1033
## 56                 Estonia   1020
## 57                Slovakia    992
## 58                 Nigeria    952
## 59                  Taiwan    921
## 60                Slovenia    866
## 61                 Lebanon    837
## 62                 Ukraine    747
## 63               Sri Lanka    701
## 64              Costa Rica    650
## 65                   Nepal    642
## 66                 Iceland    612
## 67  Bosnia and Herzegovina    550
## 68              Kazakhstan    526
## 69                  Latvia    517
## 70                 Jamaica    512
## 71                 Morocco    499
## 72                  Jordan    437
## 73                 Albania    436
## 74                    Iran    429
## 75               Guatemala    425
## 76                  Kuwait    397
## 77                Cambodia    386
## 78                   Malta    378
## 79                 Bolivia    369
## 80                   Qatar    366
## 81      Dominican Republic    359
## 82                 Georgia    359
## 83                 Uruguay    351
## 84              Bangladesh    319
## 85                  Cyprus    311
## 86                Paraguay    281
## 87                Ethiopia    277
## 88                   Ghana    272
## 89                 Algeria    239
## 90                Honduras    225
## 91              Luxembourg    220
## 92                 Bahrain    207
## 93                  Panama    198
## 94             El Salvador    197
## 95                 Tunisia    192
## 96               Mauritius    187
## 97               Nicaragua    174
## 98                 Belarus    166
## 99                  Belize    152
## 100                 Uganda    149
## 101             Montenegro    142
## 102                Armenia    109
## 103               Botswana    106
## 104               Zimbabwe    103
## 105                 Zambia     98
## 106                   Iraq     93
## 107                  Sudan     90
## 108                Myanmar     86
## 109               Tanzania     86
## 110             Azerbaijan     80
## 111               Mongolia     72
## 112            Afghanistan     54
## 113             Kyrgyzstan     54
## 114             Uzbekistan     35
## 115               Cameroon     33
## 116                 Rwanda     32
## 117             Mozambique     26
## 118                 Malawi     24
## 119                Senegal     18
## 120                Somalia     16
## 121                  Haiti     15
## 122                 Angola     14
## 123                 Bhutan     14
## 124                  Yemen     14
## 125                Lesotho     12
## 126             Madagascar     12
## 127           Sierra Leone      6
## 128                Liberia      5
## 129                  Gabon      4
## 130    Congo (Brazzaville)      3
## 131       Congo (Kinshasa)      3
## 132                   Mali      3
## 133             Mauritania      3
## 134                   Togo      3
## 135                  Benin      2
## 136           Burkina Faso      2
## 137             Tajikistan      2
## 138                Burundi      1
## 139                   Chad      1
## 140                 Guinea      1
## 141                  Niger      1
## Remove Outliers
df_hist<-df6 %>% dplyr::select(Score) %>% arrange(Score)
ggplot(df_hist, aes(x=Score, fill=Score)) + geom_histogram(binwidth=1, color="blue", fill="lightgreen") + labs(title="Distribution of Score", x="Score", y="Number of Customers")

Comment: Although the histogram seems like a normal distribution, it is slightly skewed.

Removing Outliers

Q1 <- quantile(df6$Score, .25)
Q3 <- quantile(df6$Score, .75)
IQR <- IQR(df6$Score)

df7 <- subset(df6, df6$Score > (Q1 - 1.5*IQR) & df6$Score < (Q3 + 1.5*IQR))
dim(df7)
## [1] 575094     53
ggplot(df7, aes(x=Score, fill=Score)) + geom_histogram(binwidth=1, color="blue", fill="lightgreen") + labs(title="Distribution of Score", x="Score", y="Number of Customers")

dim(df7)
## [1] 575094     53
df7 <- df7 %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
df7 <- df7[complete.cases(df7), ]
df7$EXT1 <- as.integer(df7$EXT1)  
df7$EXT2 <- as.integer(df7$EXT2)  
df7$EXT3 <- as.integer(df7$EXT3)  
df7$EXT4 <- as.integer(df7$EXT4)  
df7$EXT5 <- as.integer(df7$EXT5)  
df7$EXT6 <- as.integer(df7$EXT6)  
df7$EXT7 <- as.integer(df7$EXT7)  
df7$EXT8 <- as.integer(df7$EXT8)  
df7$EXT9 <- as.integer(df7$EXT9)  
df7$EXT10 <- as.integer(df7$EXT10)
df7$EST1 <- as.integer(df7$EST1)  
df7$EST2 <- as.integer(df7$EST2)  
df7$EST3 <- as.integer(df7$EST3)  
df7$EST4 <- as.integer(df7$EST4)  
df7$EST5 <- as.integer(df7$EST5)  
df7$EST6 <- as.integer(df7$EST6)  
df7$EST7 <- as.integer(df7$EST7)  
df7$EST8 <- as.integer(df7$EST8)  
df7$EST9 <- as.integer(df7$EST9)  
df7$EST10 <- as.integer(df7$EST10)  
df7$AGR1 <- as.integer(df7$AGR1)  
df7$AGR2 <- as.integer(df7$AGR2)  
df7$AGR3 <- as.integer(df7$AGR3)  
df7$AGR4 <- as.integer(df7$AGR4)  
df7$AGR5 <- as.integer(df7$AGR5)  
df7$AGR6 <- as.integer(df7$AGR6)  
df7$AGR7 <- as.integer(df7$AGR7)  
df7$AGR8 <- as.integer(df7$AGR8)  
df7$AGR9 <- as.integer(df7$AGR9)  
df7$AGR10 <- as.integer(df7$AGR10)  
df7$CSN1 <- as.integer(df7$CSN1)  
df7$CSN2 <- as.integer(df7$CSN2)  
df7$CSN3 <- as.integer(df7$CSN3)  
df7$CSN4 <- as.integer(df7$CSN4)  
df7$CSN5 <- as.integer(df7$CSN5)  
df7$CSN6 <- as.integer(df7$CSN6)  
df7$CSN7 <- as.integer(df7$CSN7)  
df7$CSN8 <- as.integer(df7$CSN8)  
df7$CSN9 <- as.integer(df7$CSN9)  
df7$CSN10 <- as.integer(df7$CSN10) 
df7$OPN1 <- as.integer(df7$OPN1)  
df7$OPN2 <- as.integer(df7$OPN2)  
df7$OPN3 <- as.integer(df7$OPN3)  
df7$OPN4 <- as.integer(df7$OPN4)  
df7$OPN5 <- as.integer(df7$OPN5)  
df7$OPN6 <- as.integer(df7$OPN6)  
df7$OPN7 <- as.integer(df7$OPN7)  
df7$OPN8 <- as.integer(df7$OPN8)  
df7$OPN9 <- as.integer(df7$OPN9)  
df7$OPN10 <- as.integer(df7$OPN10)  


df7<-dplyr::select(df7,-(country:Full_Name))
head(df7)
##    EXT1 EXT2 EXT3 EXT4 EXT5 EXT6 EXT7 EXT8 EXT9 EXT10 EST1 EST2 EST3 EST4 EST5
## 7     4    3    4    3    3    3    5    3    4     3    2    4    4    2    4
## 22    3    2    2    4    4    4    5    3    1     3    3    3    4    4    3
## 50    3    5    4    3    3    4    2    2    2     4    1    2    3    2    2
## 52    1    2    3    4    3    3    2    5    1     5    3    2    4    1    2
## 54    1    4    5    4    4    1    2    1    1     2    2    3    4    2    2
## 72    3    4    1    4    3    2    2    4    3     5    5    3    4    3    5
##    EST6 EST7 EST8 EST9 EST10 AGR1 AGR2 AGR3 AGR4 AGR5 AGR6 AGR7 AGR8 AGR9 AGR10
## 7     2    2    2    4     4    1    2    1    5    3    5    3    4    4     5
## 22    3    5    4    3     4    1    5    1    4    5    3    2    3    4     2
## 50    2    3    2    2     4    3    4    3    2    1    4    2    3    3     4
## 52    3    3    3    4     3    3    3    4    3    4    5    3    3    3     2
## 54    2    4    4    2     3    1    5    1    5    1    4    1    5    5     5
## 72    1    3    2    5     2    4    2    5    1    5    2    4    1    1     4
##    CSN1 CSN2 CSN3 CSN4 CSN5 CSN6 CSN7 CSN8 CSN9 CSN10 OPN1 OPN2 OPN3 OPN4 OPN5
## 7     3    2    4    2    1    4    4    2    2     5    5    2    4    3    4
## 22    3    3    2    2    2    2    4    3    2     2    3    4    3    2    2
## 50    5    4    4    2    4    2    2    2    5     5    3    2    4    2    2
## 52    3    3    4    2    3    1    4    1    1     3    4    4    5    2    3
## 54    3    4    4    1    2    4    4    1    3     5    5    1    5    1    4
## 72    2    5    4    3    1    4    5    3    3     5    3    4    2    2    4
##    OPN6 OPN7 OPN8 OPN9 OPN10 Score
## 7     1    5    5    4     4 6.886
## 22    5    3    2    1     2 6.886
## 50    2    4    2    4     5 6.774
## 52    1    3    4    4     4 6.886
## 54    1    5    5    5     4 6.977
## 72    4    5    3    4     2 6.886

3. Regression Model:

Building a regression model to predict happiness score

Data set is split into 70% training set and 30% testing set.

# Step 1: Split the dataset into training and test sets
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(df7$Score, p = 0.7, list = FALSE)
trainData <- df7[trainIndex, ]
testData <- df7[-trainIndex, ]
nrow(trainData)
## [1] 402203
nrow(testData)
## [1] 172371
# Step 2: Fit the model
model <- lm(Score ~ ., data = trainData)  # Linear regression model

Model evaluation

# Step 3: Evaluate the model on the training set
trainPred <- predict(model, newdata = trainData)
trainError <- sqrt(mean((trainData$Score - trainPred)^2))
cat("Training RMSE:", trainError, "\n")
## Training RMSE: 0.01871987
# Step 4: Evaluate the model on the test set
testPred <- predict(model, newdata = testData)
testError <- sqrt(mean((testData$Score - testPred)^2))
cat("Test RMSE:", testError, "\n")
## Test RMSE: 0.01863382

If the training RMSE is significantly lower than the test RMSE, it suggests that the model may be overfitting the training data. In contrast, if both the training and test RMSE are high, it indicates underfitting, suggesting that the model is not capturing the patterns in the data well.

Comment: training RMSE and testing RMSE are almost equal. Hence the model is not overfit or underfit.

results <- data.frame(Predicted = testPred, Actual = testData$Score)
head(results)
##     Predicted Actual
## 7    6.890553  6.886
## 22   6.889074  6.886
## 72   6.889343  6.886
## 83   6.891699  6.965
## 105  6.890599  6.886
## 152  6.888636  6.886

k-fold validation

# Define the number of folds (k)
k <- 10

# Create an empty vector to store the evaluation results
evaluation_results <- numeric(k)
rmse_results <- numeric(k)
rsquared_results <- numeric(k)

# Create an index vector to shuffle the data
index <- sample(1:nrow(df7))

# Calculate the fold size
fold_size <- floor(nrow(df7) / k)

# Perform k-fold cross-validation
for (i in 1:k) {
  # Define the start and end indices of the current fold
  start_index <- (i - 1) * fold_size + 1
  end_index <- start_index + fold_size - 1
  
  # Extract the training and testing data for the current fold
  training_data <- df7[-index[start_index:end_index], ]
  testing_data <- df7[index[start_index:end_index], ]
  
  # Fit your model on the training data
  model <- lm(Score ~ ., data = training_data)
  
  # Make predictions on the testing data
  predictions <- predict(model, newdata = testing_data)
  
  # Evaluate the performance of the model
  evaluation_results[i] <- mean((testing_data$Score - predictions)^2)
  
  # Compute RMSE
  rmse_results[i] <- sqrt(mean((testing_data$Score - predictions)^2))
  
  # Compute R-squared
  rsquared_results[i] <- summary(model)$r.squared
}

# Calculate the average RMSE and R-squared across all folds
average_rmse <- mean(rmse_results)
average_rsquared <- mean(rsquared_results)
cat("Average RMSE:", average_rmse, "\n")
## Average RMSE: 0.01869467
cat("Average R-Squared:", average_rsquared, "\n")
## Average R-Squared: 0.01459357
# Calculate the average performance across all folds
average_performance <- mean(evaluation_results)
cat("Average Performance:", average_performance, "\n")
## Average Performance: 0.0003495235

Generally, a higher r-squared indicates more variability is explained by the model.A low r-squared figure is generally a bad sign for predictive models. Although the model is showing small average RMSE, it has a very low R-squared and average performance. Possible cause: Most of the variables are not related to the happiness score. Future action: 1. Feature selection or feature engineering to create more informative variables. 2. Hyperparameter tuning

4. Clustering:

4.1 Data Preprocessing

df<-na.omit(df)
 
df<-dplyr::select(df,-(EXT1_E:IPC))
df<-dplyr::select(df,-(lat_appx_lots_of_err:long_appx_lots_of_err ))
df<-dplyr::select(df,-(country))

df <- df %>% 
  mutate(across(everything(), ~ ifelse(. == "NULL", NA, .)))
df <- df[complete.cases(df), ]

df$EXT1 <- as.integer(df$EXT1)  
df$EXT2 <- as.integer(df$EXT2)  
df$EXT3 <- as.integer(df$EXT3)  
df$EXT4 <- as.integer(df$EXT4)  
df$EXT5 <- as.integer(df$EXT5)  
df$EXT6 <- as.integer(df$EXT6)  
df$EXT7 <- as.integer(df$EXT7)  
df$EXT8 <- as.integer(df$EXT8)  
df$EXT9 <- as.integer(df$EXT9)  
df$EXT10 <- as.integer(df$EXT10)
df$EST1 <- as.integer(df$EST1)  
df$EST2 <- as.integer(df$EST2)  
df$EST3 <- as.integer(df$EST3)  
df$EST4 <- as.integer(df$EST4)  
df$EST5 <- as.integer(df$EST5)  
df$EST6 <- as.integer(df$EST6)  
df$EST7 <- as.integer(df$EST7)  
df$EST8 <- as.integer(df$EST8)  
df$EST9 <- as.integer(df$EST9)  
df$EST10 <- as.integer(df$EST10)  
df$AGR1 <- as.integer(df$AGR1)  
df$AGR2 <- as.integer(df$AGR2)  
df$AGR3 <- as.integer(df$AGR3)  
df$AGR4 <- as.integer(df$AGR4)  
df$AGR5 <- as.integer(df$AGR5)  
df$AGR6 <- as.integer(df$AGR6)  
df$AGR7 <- as.integer(df$AGR7)  
df$AGR8 <- as.integer(df$AGR8)  
df$AGR9 <- as.integer(df$AGR9)  
df$AGR10 <- as.integer(df$AGR10)  
df$CSN1 <- as.integer(df$CSN1)  
df$CSN2 <- as.integer(df$CSN2)  
df$CSN3 <- as.integer(df$CSN3)  
df$CSN4 <- as.integer(df$CSN4)  
df$CSN5 <- as.integer(df$CSN5)  
df$CSN6 <- as.integer(df$CSN6)  
df$CSN7 <- as.integer(df$CSN7)  
df$CSN8 <- as.integer(df$CSN8)  
df$CSN9 <- as.integer(df$CSN9)  
df$CSN10 <- as.integer(df$CSN10) 
df$OPN1 <- as.integer(df$OPN1)  
df$OPN2 <- as.integer(df$OPN2)  
df$OPN3 <- as.integer(df$OPN3)  
df$OPN4 <- as.integer(df$OPN4)  
df$OPN5 <- as.integer(df$OPN5)  
df$OPN6 <- as.integer(df$OPN6)  
df$OPN7 <- as.integer(df$OPN7)  
df$OPN8 <- as.integer(df$OPN8)  
df$OPN9 <- as.integer(df$OPN9)  
df$OPN10 <- as.integer(df$OPN10) 

50 variables are too many to implement clustering due to hardware limitation. Hence we created 10 new variables following the characteristics of the variables.

## Creating new variables
ext <- c('EXT1', 'EXT3', 'EXT5', 'EXT7', 'EXT9')
int <- c('EXT2', 'EXT4', 'EXT6', 'EXT6', 'EXT10')
opn <- c('OPN3', 'OPN5', 'OPN7', 'OPN8', 'OPN9', 'OPN10')
cst <- c('OPN1', 'OPN2', 'OPN4', 'OPN6')
agr <- c('AGR2', 'AGR4', 'AGR6', 'AGR8', 'AGR10', 'AGR9')
cpt <- c('AGR1', 'AGR3', 'AGR5', 'AGR7')
csn <- c('CSN1', 'CSN3', 'CSN5', 'CSN7', 'CSN9', 'CSN10')
spt <- c('CSN2', 'CSN4', 'CSN6', 'CSN8')
est <- c('EST2', 'EST4', 'EST5', 'EST6')
nrt <- c('EST1', 'EST3', 'EST7', 'EST8', 'EST9', 'EST10')

df$extroversion <- rowSums(df[, ext])
df$introversion <- rowSums(df[, int])
df$open <- rowSums(df[, opn])
df$consistency <- rowSums(df[, cst])
df$agreeable <- rowSums(df[, agr])
df$competitiveness <- rowSums(df[, cpt])
df$conscientious <- rowSums(df[, csn])
df$spontaneity <- rowSums(df[, spt])
df$emotionally_stable <- rowSums(df[, est])
df$neurotic <- rowSums(df[, nrt])



df<-dplyr::select(df,-(EXT1:OPN10))
head(df)
##   extroversion introversion open consistency agreeable competitiveness
## 1           23            6   25           8        23               8
## 2           12           20   21           6        26               6
## 3           12           16   22           9        23               5
## 4           11           13   22           9        23               9
## 5           17           16   28           8        26               4
## 6           16           13   24           8        21               7
##   conscientious spontaneity emotionally_stable neurotic
## 1            20          12                 10       14
## 2            22           9                  8       13
## 3            19           9                 10       16
## 4            14          13                 10       17
## 5            28           4                 10       13
## 6            21           8                  9       13

4.2 PCA

As we have almost 1 million of data, our data is considered high dimensionality and requires dimensionality reduction. Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated variables called principal components.

The sdev component of the pca_result object contains the standard deviations of the principal components, which can be used to calculate the proportion of variance explained.

The variance_explained variable stores the proportion of variance explained by each principal component.

The cumulative_variance variable stores the cumulative proportion of variance explained.

# Prepare your data
X <- df  # Features (numeric matrix or data frame)

# Perform PCA
pca_result <- prcomp(X)

# Access the principal components
principal_components <- pca_result$x

# Access the proportion of variance explained by each component
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)

# Access the cumulative proportion of variance explained
cumulative_variance <- cumsum(variance_explained)

# Plot the scree plot
par(mar = c(2, 2, 2, 2) + 0.1)  # Adjust the margin values as needed
plot(1:length(variance_explained), variance_explained, type = "b", xlab = "Principal Component", ylab = "Variance Explained", main = "Scree Plot")

plot(1:length(cumulative_variance), cumulative_variance, type = "b", xlab = "Principal Component", ylab = "Cumulative Variance Explained", main = "Cumulative Variance Explained")

The scree plot displays the variance explained by each principal component. It helps determine the optimal number of principal components to retain in your analysis. The x-axis represents the principal components, typically ordered from left to right. The y-axis represents the variance explained by each principal component. Look for an “elbow” or a point where the curve starts to level off. This indicates the number of principal components that capture most of the variability in the data. Choose the number of components before the elbow point.

The cumulative variance explained plot shows the cumulative proportion of variance explained by adding successive principal components. The x-axis represents the principal components. The y-axis represents the cumulative proportion of variance explained. Look for a point where the curve starts to plateau or reaches a saturation point. This indicates the number of components needed to explain a significant portion of the total variance. Choose the number of components before the saturation point to retain a substantial amount of variance. By examining both the scree plot and the cumulative variance explained plot, you can make an informed decision about the number of principal components to retain for your subsequent analysis or clustering.

In this case, principal component =5 explained more than 80% of the variability.

# Select the desired number of components
num_components <- 5  # Adjust the number of components as needed

# Retrieve the selected components
selected_components <- principal_components[, 1:num_components]


result_kmeans <- kmeans(selected_components, centers = 5)

# Access the cluster assignments
cluster_assignments <- result_kmeans$cluster

# Create a data frame with original variables and cluster assignments
cluster_data <- data.frame(df, Cluster = factor(cluster_assignments))

# Print the cluster assignments and original variables in a table format
head(cluster_data)
##   extroversion introversion open consistency agreeable competitiveness
## 1           23            6   25           8        23               8
## 2           12           20   21           6        26               6
## 3           12           16   22           9        23               5
## 4           11           13   22           9        23               9
## 5           17           16   28           8        26               4
## 6           16           13   24           8        21               7
##   conscientious spontaneity emotionally_stable neurotic Cluster
## 1            20          12                 10       14       2
## 2            22           9                  8       13       5
## 3            19           9                 10       16       5
## 4            14          13                 10       17       4
## 5            28           4                 10       13       2
## 6            21           8                  9       13       2
clusplot(df,result_kmeans$cluster)

The clusplot shows that the clusters are not very distinct. Hence we profile the clusters based on the variables to see their characteristics.

Profiling

cluster_data$Cluster = as.integer(cluster_data$Cluster)  
mean_cluster <- aggregate(. ~ Cluster, data = cluster_data, mean)
mean_cluster <- round(mean_cluster, 2)
mean_cluster
##   Cluster extroversion introversion  open consistency agreeable competitiveness
## 1       1        18.35        11.16 23.92        9.59     24.42            8.60
## 2       2        19.64         9.86 24.17        9.11     25.11            7.09
## 3       3        10.12        19.00 22.18       10.14     20.46           10.78
## 4       4        12.63        15.34 21.37        9.52     16.61           11.48
## 5       5        12.57        17.05 22.89        9.78     24.19            7.90
##   conscientious spontaneity emotionally_stable neurotic
## 1         18.69       13.22              12.04    22.37
## 2         22.60        8.61              11.15    13.36
## 3         18.09       13.37              11.95    24.09
## 4         18.41       10.35              10.83    13.89
## 5         23.48        8.61              11.41    18.83
mean_overall <-colMeans(cluster_data)
mean_overall <-data.frame(t(mean_overall))
mean_overall <-dplyr::select(mean_overall,-(Cluster))
mean_overall
##   extroversion introversion     open consistency agreeable competitiveness
## 1     14.94897     14.27228 23.03318    9.625085  22.57257        8.981891
##   conscientious spontaneity emotionally_stable neurotic
## 1      20.38859    10.84528           11.51305 18.72826
cluster_col<-c("overall")
mean_overall<-cbind(Cluster=cluster_col,mean_overall)
mean_cluster<-rbind(mean_overall,mean_cluster)

# Reshape the data frame from wide to long format
mean_cluster_long <- tidyr::gather(mean_cluster, variable, value, -Cluster)

# Sort the values within each cluster from highest to lowest
mean_cluster_sorted <- mean_cluster_long %>%
  group_by(Cluster) %>%
  arrange(Cluster, desc(value)) %>%
  mutate(variable = factor(variable, levels = unique(variable)))

# Plot the bar chart for each cluster with sorted values
ggplot(mean_cluster_sorted, aes(x = variable, y = value, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Cluster, ncol = 5) +
  labs(x = "Variable", y = "Mean Value", title = "Bar Chart - Cluster-wise Mean Values (Sorted)") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.margin = margin(0, 0, 0, 0, "cm")  # Adjust the right margin (10 cm in this example)
  )

mean_cluster
##   Cluster extroversion introversion     open consistency agreeable
## 1 overall     14.94897     14.27228 23.03318    9.625085  22.57257
## 2       1     18.35000     11.16000 23.92000    9.590000  24.42000
## 3       2     19.64000      9.86000 24.17000    9.110000  25.11000
## 4       3     10.12000     19.00000 22.18000   10.140000  20.46000
## 5       4     12.63000     15.34000 21.37000    9.520000  16.61000
## 6       5     12.57000     17.05000 22.89000    9.780000  24.19000
##   competitiveness conscientious spontaneity emotionally_stable neurotic
## 1        8.981891      20.38859    10.84528           11.51305 18.72826
## 2        8.600000      18.69000    13.22000           12.04000 22.37000
## 3        7.090000      22.60000     8.61000           11.15000 13.36000
## 4       10.780000      18.09000    13.37000           11.95000 24.09000
## 5       11.480000      18.41000    10.35000           10.83000 13.89000
## 6        7.900000      23.48000     8.61000           11.41000 18.83000

To check for survival bias, we cross check the mean for every variable in every cluster to see if they behave differently from worldwide data From here, we can conclude that:

  1. Cluster 1 and Cluster 2 have more similarity: They are more extroverted, more agreeable and less competitive. Cluster 1 tend to be more spontaneity and neurotic whereas Cluster 2 is more neutral in their feelings and organised.

  2. Cluster 4 and Cluster 5 have more similarity: They are more introverted than overall but Cluster 4 is more competitive as compared to Cluster 5 who is more agreeable with other people and conscientious. However as Cluster 5 takes their job seriously and they prefer consistency, they also tend to be neurotic.

  3. Cluster 3 is the most introverted among all clusters and prefer consistency and a stable life.

5. Conclusion

In conclusion, the project on “Predicting Happiness and Personality-Based Clustering” has provided valuable insights into understanding the relationship between individual personality traits and subjective well-being through the analysis of various data sets and the application of machine learning techniques.

From the EDA, it is concluded that all the variables have very little to no relationship with the happiness score.

This project is able to predict an individual’s happiness levels based on their personality characteristics using the Multiple Linear Regression model. However, the outcome gives a low average performance on the predictions. K-fold validation is used to evaluate on the Multiple Linear Regression model resulting in low RMSE but also very low R-Squared.

However, the kmeans clustering applied to the personality traits data is able to split the data into five clusters. Although they do not seem very distant, we can still observe some slight differences among each cluster.

In the future, in order to improve the Multiple Linear Regression model in this project, model hyperparameter tuning and feature engineering are required.