This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
We attempt to replace missing data with average value
# Read CSV
whr.df <- read.csv(file.choose(), header = T)
# Read CSV to refer for the row names
whr.df.names <- read.csv(file.choose(), header = T)
head(whr.df)
## country LifeLadder LnGDPpc SocSupp LifeExp LifeChoice
## 1 Afghanistan 4.220169 7.497288 0.5590718 49.87127 0.5225662
## 2 Albania 4.511101 9.282300 0.6384115 68.69838 0.7298189
## 3 Algeria 5.388171 9.549138 0.7481497 64.82995 NA
## 4 Argentina 6.427221 NA 0.8828191 67.44399 0.8477022
## 5 Armenia 4.325472 8.989569 0.7092183 65.40947 0.6109869
## 6 Australia 7.250080 10.696281 0.9423342 72.52163 0.9223157
## Generosity Corruption GDPpc
## 1 0.05739315 0.7932456 1803.145
## 2 -0.01792729 0.9010708 10746.120
## 3 NA NA 14032.594
## 4 NA 0.8509245 NA
## 5 -0.15581442 0.9214211 8018.998
## 6 0.22308631 0.3985451 44191.221
str(whr.df)
## 'data.frame': 141 obs. of 9 variables:
## $ country : Factor w/ 141 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ LifeLadder: num 4.22 4.51 5.39 6.43 4.33 ...
## $ LnGDPpc : num 7.5 9.28 9.55 NA 8.99 ...
## $ SocSupp : num 0.559 0.638 0.748 0.883 0.709 ...
## $ LifeExp : num 49.9 68.7 64.8 67.4 65.4 ...
## $ LifeChoice: num 0.523 0.73 NA 0.848 0.611 ...
## $ Generosity: num 0.0574 -0.0179 NA NA -0.1558 ...
## $ Corruption: num 0.793 0.901 NA 0.851 0.921 ...
## $ GDPpc : num 1803 10746 14033 NA 8019 ...
whr.df <- whr.df[,-1]
# Replacing missing data with average
whr.df$LifeExp<- ifelse(is.na(whr.df$LifeExp), mean(whr.df$LifeExp, na.rm=TRUE), whr.df$LifeExp)
whr.df$LifeChoice<- ifelse(is.na(whr.df$LifeChoice), mean(whr.df$LifeChoice, na.rm=TRUE), whr.df$LifeChoice)
whr.df$Generosity<- ifelse(is.na(whr.df$Generosity), mean(whr.df$Generosity, na.rm=TRUE), whr.df$Generosity)
whr.df$Corruption<- ifelse(is.na(whr.df$Corruption), mean(whr.df$Corruption, na.rm=TRUE), whr.df$Corruption)
whr.df$GDPpc<- ifelse(is.na(whr.df$GDPpc), mean(whr.df$GDPpc, na.rm=TRUE), whr.df$GDPpc)
whr.df$LnGDPpc<- ifelse(is.na(whr.df$LnGDPpc), log(whr.df$GDPpc), whr.df$LnGDPpc)
# Plot to see the relationships between input variables
ggcorr(whr.df[, c(1:8)], label=TRUE, cex=3)
ggpairs(whr.df, columns= c(1:8), upper = list(continuous = wrap("cor", size = 3)))
#Data Transformation From the plots, we can see that GDP per capita is strongly positively skewed while social support, life choice and corruption are highly negatively skewed. Therefore, to make the data more normalized distributed, we use Logarithmic transformation for GDP per captia and power transformation for social support, life choice and corruption.
Next we will perform data normalization. The original data is very skewed, which would affect the result’s accuracy.
whr.df$SocSupp_sq<-whr.df$SocSupp^2
hist(whr.df$SocSupp_sq)
whr.df$SocSupp_3<-whr.df$SocSupp^3
hist(whr.df$SocSupp_3)
whr.df$LifeChoice_sq<-whr.df$LifeChoice^2
hist(whr.df$LifeChoice_sq)
whr.df$Corruption_sq<-whr.df$Corruption^2
hist(whr.df$Corruption_sq)
whr.df$Corruption_3<-whr.df$Corruption^3
hist(whr.df$Corruption_3)
ggpairs(whr.df, columns= c(1,2,4,6,10,11,13), upper = list(continuous = wrap("cor", size = 3)))
As a result, the data looks more normally distributed.
From the correlation between variables, we can oberseve that except the 0.783 correlation between life expectancy and Ln GDP per capita, there is no significantly collinearity among the other factors if Life Ladder is excluded. Therefore, we would omit Life Ladder and include the remaining 6 factors in our model: Ln GDP per captia, life expectancy, generosity, social support cube, life choice squre and corruption cube.
We strive to standardize the the data. Hence, we will perform data normalization. The original data is very skewed, which would affect the result’s accuracy.
whr.df.new<-whr.df[,c(1,2,4,6,10,11,13)]
# Normalize input variables
whr.df.norm <- sapply(whr.df.new, scale)
whr.df.norm
## LifeLadder LnGDPpc LifeExp Generosity SocSupp_3
## [1,] -1.03039735 -1.56706372 -1.622661723 0.47297884 -1.83285739
## [2,] -0.77623324 -0.02933493 0.725175776 -0.07477943 -1.43787696
## [3,] -0.01000758 0.20053722 0.242762618 0.00000000 -0.70496288
## [4,] 0.89772756 0.45934963 0.568747307 0.00000000 0.53971620
## [5,] -0.93840232 -0.28151321 0.315031874 -1.07754599 -0.99167937
## [6,] 1.61659329 1.18876314 1.201954597 1.67796116 1.22724540
## [7,] 1.44011489 1.18731063 1.092715836 0.57584020 1.03336958
## [8,] -0.08363252 0.32291661 -0.104543974 -1.60416156 -0.47002416
## [9,] 0.67272866 1.19780440 0.384174750 0.00000000 0.32720279
## [10,] -0.73688529 -1.04735632 -0.104386513 -0.52487060 -1.37635351
## [11,] -0.19370480 0.31731769 0.359339716 -0.91176431 1.03613689
## [12,] 1.35350824 1.13354043 1.077255187 -0.41168620 1.06493693
## [13,] -1.21631324 -1.46003885 -1.510679053 -0.22454491 -2.08733444
## [14,] 0.32332440 -0.44732595 -0.353867632 -0.23453736 -0.30967289
## [15,] -0.19111371 -0.06561825 0.610098122 1.07902301 -0.20494834
## [16,] -1.66047984 0.26564246 -0.958936540 -1.82374746 -0.54429230
## [17,] 0.85194645 0.18018310 0.242884400 -0.74671969 0.87087191
## [18,] -0.49103143 0.38767269 0.666732620 -1.21119952 1.03000897
## [19,] -1.04309454 -1.67144851 -1.468559013 0.17897251 -0.57606675
## [20,] -0.81977568 -1.00763560 -0.542008031 0.62191748 -0.72236205
## [21,] -0.50966439 -1.11675093 -1.813832886 0.08586072 -1.31591944
## [22,] 1.61202055 1.17001024 1.174104498 1.48168203 1.01049993
## [23,] -2.36450865 0.45934963 -2.351942913 0.00000000 -2.52761613
## [24,] -1.19710000 -1.49215029 -2.216647487 0.35646344 -1.55906522
## [25,] 1.03037359 0.59758506 1.115846870 0.70825081 0.11264662
## [26,] -0.06523327 0.21076644 0.771107699 -1.72773524 -0.75456901
## [27,] 0.72867668 0.14634473 0.140019231 -0.72834582 0.52979743
## [28,] -1.11834912 -0.52137879 -1.120399470 -0.79401811 -1.56304009
## [29,] -0.76676784 -2.31167980 -1.570433273 0.10096201 0.34223804
## [30,] 1.51659670 0.25227899 0.845315959 -0.17001126 0.73691402
## [31,] 0.34507333 0.86729869 1.216957612 -0.21155116 -0.39231431
## [32,] 1.16715703 0.86750655 0.815256652 -1.42149763 1.08446245
## [33,] 1.88540832 1.17740920 1.001668312 1.01065234 1.37837805
## [34,] -0.14058949 0.19156333 0.053068066 -0.54911485 0.67044487
## [35,] 0.62534707 -0.08041206 0.533579802 -0.07109489 0.12212001
## [36,] -0.73636124 -0.06023046 -0.179827820 -1.03745613 -0.19123108
## [37,] 0.64665235 -0.25742264 0.152852045 -1.35837755 -0.32980983
## [38,] 0.21844805 0.77867633 0.560487991 -1.08514251 1.17064623
## [39,] -0.96253447 -1.66999517 -0.852263771 0.43584300 -0.92452458
## [40,] 1.97457065 1.07891005 1.059923086 -0.20529951 1.37192582
## [41,] 0.93965035 1.05089900 1.211416514 -0.68177813 0.56250676
## [42,] -0.49609531 0.46855597 -0.844516796 -1.66056962 -0.44666865
## [43,] -0.83102194 -0.13552633 0.371107886 -1.67596398 -1.93904394
## [44,] 1.28783529 1.19555743 0.995136783 1.05153112 0.79720654
## [45,] -0.77334097 -0.86615661 -1.145720208 0.74839865 -1.38692132
## [46,] -0.08474686 0.68811427 1.043393036 -1.89171012 -0.25078758
## [47,] 0.83805494 -0.35528097 -0.089264095 0.12792696 -0.17287121
## [48,] -1.56969495 -1.95484397 -1.545115866 0.02612367 -1.21618873
## [49,] -1.78858437 -1.64238094 -1.206583787 2.21992071 -1.72114259
## [50,] 0.21711960 -0.70965295 0.087647116 0.64402691 -0.49805985
## [51,] 0.08630899 1.36276354 1.635687167 0.64935118 0.02225864
## [52,] 0.04304816 0.69535549 0.551946102 -1.37028392 0.72354808
## [53,] 1.84369464 1.18251600 1.143581847 2.04813594 1.77594913
## [54,] -1.06620817 -0.51963134 -0.440665142 0.31102744 -1.57309127
## [55,] -0.23002467 -0.02493055 -0.300036981 3.58944069 -0.34575187
## [56,] -0.65250232 0.37256341 0.345910358 1.19629560 -1.80120584
## [57,] -0.86234042 0.23291238 -0.253100831 -0.54739469 -0.92281924
## [58,] 1.43370214 1.36525286 1.062588977 1.30488357 1.42520323
## [59,] 1.53703342 0.90549790 1.175094931 1.12239693 0.61423014
## [60,] 0.48476988 0.96095886 1.227635449 -0.57911360 1.04402006
## [61,] -0.74876225 -0.99833473 -2.185324213 -0.13796790 -1.55275621
## [62,] 0.48488069 1.01434128 1.522100396 -0.49424170 0.72649144
## [63,] -0.11212163 -0.06582446 0.177707061 -0.32944375 -0.09253708
## [64,] 0.11700011 0.66351269 0.296245997 -0.33211564 1.05115173
## [65,] -0.87667599 -1.13000924 -1.035128520 2.34908568 -1.01456666
## [66,] 0.31431639 -0.13118028 -0.077050659 0.91584641 -0.05640099
## [67,] 0.47836672 1.52777033 0.291897035 -0.76684085 0.15045439
## [68,] -0.47445594 -1.05495985 -0.036138451 0.65133121 0.89308896
## [69,] 0.47247136 0.63454288 0.329425024 -1.12731301 0.92446664
## [70,] -0.11261152 0.12103744 0.746857790 0.30668120 -0.01778467
## [71,] -1.39029699 -1.25622203 -2.432480525 -0.62613004 -0.29117056
## [72,] -1.78650879 -2.26964927 -1.413419679 0.67887831 -1.41396568
## [73,] 0.02966571 0.26621920 -0.197752658 -0.88598938 0.46728894
## [74,] 0.40704252 0.77511405 0.271449970 -1.93612447 1.17257611
## [75,] 1.36958677 1.84603397 1.248805597 0.08354195 1.21403997
## [76,] -0.04707062 0.14641785 0.351103234 0.61788819 0.41591996
## [77,] -1.51707578 -1.79440628 -0.778495068 -0.36184526 -0.71776221
## [78,] -1.68008737 -1.98366261 -0.947900014 0.52418180 -1.97438680
## [79,] -1.20873867 -1.34343994 -1.657706495 -0.49161223 0.06255836
## [80,] 1.04067005 0.45934963 1.093782383 0.00000000 1.08177897
## [81,] -0.81026196 -0.91936650 -1.185876508 -1.10508535 -0.40610751
## [82,] 0.18378990 0.46151932 0.324108465 0.99572309 0.06040241
## [83,] 1.24451239 0.35167609 0.634336613 -1.08570764 0.65646708
## [84,] 0.15564238 -0.72928245 0.127816736 0.09706734 0.07291777
## [85,] -0.29932515 0.01829577 0.005412162 0.59863195 1.29107113
## [86,] -0.08348297 0.28909714 0.453839959 -0.58998024 0.35872442
## [87,] -0.01163556 -0.35223545 0.154091751 -1.75893968 -1.33923105
## [88,] -0.67837118 0.45934963 -0.705092189 0.00000000 -0.33153889
## [89,] -0.26216130 -1.35761265 -0.227822431 1.39433385 0.07021773
## [90,] 1.87063951 1.24119995 1.045526130 1.69974885 1.02891686
## [91,] 1.59528385 0.99178457 1.071206030 1.91849480 1.15710090
## [92,] 0.53562820 -0.68066671 0.401758027 0.36803341 0.22521531
## [93,] -1.01774974 -2.15731754 -1.213557254 0.10577985 -1.16898045
## [94,] -0.15730245 -0.60065757 -2.171449112 0.29240813 -0.23143153
## [95,] 0.37347412 0.45934963 0.000000000 0.00000000 -0.20508260
## [96,] 1.91908544 1.50947914 0.985337109 0.82311224 1.44559059
## [97,] 0.13006638 -0.71311582 -0.684033833 0.65076423 -1.50165320
## [98,] -0.43070146 -0.73616465 0.008946228 -0.90725950 -0.11274885
## [99,] 0.62726956 0.58041366 0.616137765 -0.66014394 0.53583929
## [100,] 0.35098035 -0.20247433 0.065838562 -0.36887991 1.19694411
## [101,] 0.26296194 0.06039408 0.297325863 -0.99173131 -0.24854858
## [102,] 0.02726291 -0.36683455 -0.410113473 -0.49855075 -0.07989352
## [103,] 0.66609179 0.71816114 0.595952341 -0.67853330 0.92825638
## [104,] 0.04106985 0.76864428 0.981806373 -1.63508109 0.78136406
## [105,] 0.49730335 0.53955262 0.506950380 -0.80448872 -0.19113474
## [106,] 0.39777624 0.64261470 -0.084270544 -1.34580744 0.85326311
## [107,] -1.80545418 -1.60329569 -0.971757937 0.30982972 -1.28046059
## [108,] 0.93852560 1.29781439 0.120665354 -1.09109921 0.61721150
## [109,] -0.70334439 -1.33234602 -0.622969978 -0.44020725 0.08921446
## [110,] 0.30850018 0.14476734 0.397089387 -0.43052224 0.67201599
## [111,] -0.58241868 -1.69116469 -2.353499918 0.85729909 -1.33138744
## [112,] 0.55374793 1.70885010 1.686692086 0.96337847 1.01922424
## [113,] 0.51852574 0.80201600 0.658442857 -0.49077137 1.26238154
## [114,] 0.46930456 0.83648778 0.987297045 -0.39153504 1.13142513
## [115,] -0.63921443 0.45934963 -1.910543764 0.00000000 -1.66977810
## [116,] -0.55028122 0.08528649 -1.581808505 -0.53145350 0.46009913
## [117,] 0.49878303 0.99206308 1.492828868 0.16263462 -0.17352822
## [118,] -2.19410807 -1.59807838 -1.681356056 0.20255994 -1.94400529
## [119,] 0.80284441 0.95747545 1.381292417 -0.37322239 1.21989376
## [120,] 1.72026074 1.23013273 1.132097201 1.01186855 0.86631978
## [121,] 1.79869045 1.37934506 1.256260007 0.62326302 1.04897500
## [122,] 0.97253515 0.45934963 1.009705947 0.00000000 0.67306322
## [123,] -0.25850836 -1.23142615 -0.140194803 0.16396372 0.26527187
## [124,] -2.18133403 -1.24826646 -0.799298442 1.30809383 -1.44157797
## [125,] 0.58883183 0.29593169 0.363677261 2.57120141 0.81447346
## [126,] -1.32881728 -1.77616685 -1.316057551 0.05137243 -2.02943511
## [127,] -0.76718900 -0.02416979 0.209324801 -1.24553608 -1.04273033
## [128,] -0.06412726 0.48025661 0.338447384 0.00000000 0.50928888
## [129,] 0.42582461 0.31945620 -0.538856437 -0.10874718 1.06575297
## [130,] -1.01895947 -1.59459661 -1.424297792 1.11111995 -0.66282190
## [131,] -1.19767654 -0.33328045 0.041196190 0.28316569 0.56292424
## [132,] 1.25043316 1.54610587 0.701728869 0.82205788 0.19184570
## [133,] 1.24460904 1.08112415 1.029528877 1.79996526 1.37353786
## [134,] 1.22653929 1.34677764 0.903126417 0.98684568 0.69267314
## [135,] 0.67431164 0.50703532 0.669586893 -0.56653763 0.73331125
## [136,] 0.43061854 -0.53781983 -0.281918529 1.54637787 1.26142836
## [137,] -1.18682228 0.18754673 0.232700351 -1.44356295 0.75097286
## [138,] -0.29472326 -0.53775575 0.381458433 -0.58749907 0.47003265
## [139,] -1.37507328 0.45934963 -1.077530183 0.00000000 -0.48560260
## [140,] -0.91911992 -0.96287790 -1.246294345 0.89047959 -0.55455764
## [141,] -1.45390061 -1.63165631 -1.391128753 -0.33035335 -0.54328856
## LifeChoice_sq Corruption_3
## [1,] -1.7753148416 0.095099243
## [2,] -0.3633544510 1.132109681
## [3,] -0.0878906972 -0.280463287
## [4,] 0.6482521868 0.616986539
## [5,] -1.2300840159 1.358265328
## [6,] 1.3666739356 -1.849142112
## [7,] 1.0337071401 -1.491026094
## [8,] -0.4986707865 -1.134984861
## [9,] 1.0354191493 -0.280463287
## [10,] 0.9012141074 -0.679711939
## [11,] -0.9039118969 -0.825249965
## [12,] 0.8165582001 -1.585023817
## [13,] 0.0470521080 0.490968525
## [14,] 0.9685584570 0.633185798
## [15,] -1.0779944794 1.782157753
## [16,] 0.6851621296 -0.402055711
## [17,] 0.2781195316 -0.005679438
## [18,] -0.5932622256 1.526411195
## [19,] -0.9999296777 -0.462734586
## [20,] 1.7298128713 0.516418192
## [21,] -0.4991836782 0.902778250
## [22,] 1.2679472758 -1.876788095
## [23,] -1.1422772732 0.696705665
## [24,] -1.7601766666 0.326183087
## [25,] -0.9462513971 0.687352254
## [26,] -0.0878906972 -0.280463287
## [27,] 0.5316736253 1.094044216
## [28,] 0.0991030604 0.071103353
## [29,] -1.0509438921 0.856940794
## [30,] 0.8847808923 -0.010009792
## [31,] -0.1499258813 1.094967739
## [32,] 0.6725090578 1.125160063
## [33,] 1.6303686649 -2.090288580
## [34,] 0.8823164468 -0.344423425
## [35,] 0.6356650628 -0.062392415
## [36,] -0.9209505679 0.305899618
## [37,] 0.2193563423 0.129518768
## [38,] 0.6029027008 -0.967136074
## [39,] -0.2471694014 -0.582465876
## [40,] 1.6318297829 -2.062120605
## [41,] 0.1065795941 -1.054435916
## [42,] -0.6033334030 0.297288549
## [43,] -1.2600092997 -1.344241083
## [44,] 0.8614790076 -1.735986353
## [45,] -0.1913570253 1.055401851
## [46,] -1.9990036017 1.103940268
## [47,] 0.7875679778 0.257058466
## [48,] -0.3960842341 0.176367678
## [49,] -2.7595880791 0.498558680
## [50,] 0.6699045838 0.091981708
## [51,] 0.2184549299 -1.839973287
## [52,] -1.5915191971 1.389773572
## [53,] 1.6652897637 -0.471355344
## [54,] 0.3975515125 -0.136563002
## [55,] 0.4861678260 1.009869229
## [56,] -0.0878906972 -0.280463287
## [57,] -0.8467709999 0.142768958
## [58,] 0.9001545974 -1.849143821
## [59,] -0.0162534632 0.187385233
## [60,] -1.1444146885 1.150948622
## [61,] -0.0456615999 -0.192911065
## [62,] 0.5416592299 -0.616863254
## [63,] -0.0242030658 -0.280463287
## [64,] 0.0726407303 -0.588172707
## [65,] -0.2130579930 0.404554104
## [66,] 0.4632407948 1.584277327
## [67,] 0.5863838707 -0.280463287
## [68,] 0.3430676342 1.307404297
## [69,] -0.7060661383 0.782160757
## [70,] -0.9101478347 0.638262544
## [71,] -0.3659667871 -0.302718340
## [72,] -0.0899495595 1.134247415
## [73,] 0.4182446661 -0.280463287
## [74,] -1.2084068328 1.685831495
## [75,] 0.9744757293 -1.929698624
## [76,] -0.5480169515 0.803160342
## [77,] -1.4955969531 0.747357069
## [78,] 0.3072478592 0.360756191
## [79,] -0.6256049203 0.728962461
## [80,] 1.3037514363 -0.624306996
## [81,] -2.0766580481 0.529846947
## [82,] 0.3895874270 1.020303642
## [83,] -0.1877215766 0.226737965
## [84,] -1.5708576991 1.933336190
## [85,] -0.1208983910 1.125392247
## [86,] -1.5018616098 0.598057907
## [87,] 0.3662783225 -0.484775513
## [88,] 0.9278138464 -1.132441469
## [89,] 0.5728662408 0.302211698
## [90,] 1.2173246758 -1.768623014
## [91,] 1.4095224817 -2.035415002
## [92,] -0.4678769732 -0.385687644
## [93,] -0.5805867340 0.278867093
## [94,] 0.2006157183 1.171778409
## [95,] 0.1879853623 -0.788701820
## [96,] 1.6937270041 -1.824836358
## [97,] -1.0729669266 0.089079508
## [98,] -1.2520768449 0.260895991
## [99,] 0.9947980376 0.484034433
## [100,] 0.7022246437 -0.203159564
## [101,] 0.4852852391 0.764866711
## [102,] 1.2201457665 0.084309657
## [103,] 0.8633020042 0.586377807
## [104,] 0.5599158441 1.367036866
## [105,] 0.4123040007 1.681633453
## [106,] -0.4906556261 1.404389520
## [107,] 1.2512109582 -2.113741441
## [108,] 0.0003236584 -0.280463287
## [109,] -0.2518484824 0.104444175
## [110,] -1.2075280439 1.010799353
## [111,] -0.7365224736 0.738308954
## [112,] 1.1821108388 -2.131066064
## [113,] -0.5945344770 1.303876559
## [114,] 1.1802966082 0.498099925
## [115,] 1.3167084686 -1.749456522
## [116,] -0.0007823621 0.264381321
## [117,] -1.3610531632 0.723886832
## [118,] -2.2080342795 0.029004507
## [119,] -0.0508084825 0.315134401
## [120,] 1.3238329348 -2.064980843
## [121,] 1.4841243683 -2.009200681
## [122,] -0.4492071383 0.243766879
## [123,] -0.5721829267 -1.006035512
## [124,] 0.0105849970 -0.329368309
## [125,] 1.3850550306 0.887565656
## [126,] -0.3596385581 0.283755786
## [127,] -1.2070766264 0.245742489
## [128,] -1.0036789661 -0.144500379
## [129,] -0.2130895489 -0.280463287
## [130,] -0.2867015166 0.248594895
## [131,] -1.8869783609 1.024698442
## [132,] 1.6395436572 -0.280463287
## [133,] 0.4075811215 -1.702087132
## [134,] -0.1361537133 -0.331763936
## [135,] 1.0130253263 -0.752181229
## [136,] 2.0042321151 -0.280463287
## [137,] -2.1217019031 1.014609520
## [138,] 1.0903184421 0.145962311
## [139,] -1.7156114998 -0.280463287
## [140,] 0.3221578235 -0.089860065
## [141,] -0.3382686322 -0.441315811
# Add row names:
#row.names(whr.df.norm) <- row.names(whr.df)
row.names(whr.df.norm) <- whr.df.names[,1]
We first exclude variable LifeLadder when performing clustering.
We try to find the best number of cluster.
set.seed(123)
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
km.out <- kmeans(whr.df.norm[,-1], centers = i, nstart=20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
From the plot, we can figure out that the best optimal number of clusters is 3. It is consistent with our judgments as we believe “happiness” depends alot on wealth. An indicator of wealth would be GDPpc, which can be separated into three categories: high, medium and low. Therefore, having three cluster make intuitive sense to our group.
Next, we set k to 3 and start running the model.
# Set k equal to the number of clusters corresponding to the elbow location
k <- 3
# Build model with k=3 clusters
km.out <- kmeans(whr.df.norm[,-1], centers = k, nstart = 25, iter.max = 50)
km.out$cluster
## Afghanistan Albania
## 2 1
## Algeria Argentina
## 1 1
## Armenia Australia
## 1 3
## Austria Azerbaijan
## 3 1
## Bahrain Bangladesh
## 3 2
## Belarus Belgium
## 1 3
## Benin Bolivia
## 2 1
## Bosnia and Herzegovina Botswana
## 1 1
## Brazil Bulgaria
## 1 1
## Burkina Faso Cambodia
## 2 2
## Cameroon Canada
## 2 3
## Central African Republic Chad
## 2 2
## Chile China
## 1 1
## Colombia Congo (Brazzaville)
## 1 2
## Congo (Kinshasa) Costa Rica
## 2 1
## Cyprus Czech Republic
## 1 1
## Denmark Dominican Republic
## 3 1
## Ecuador Egypt
## 1 1
## El Salvador Estonia
## 1 1
## Ethiopia Finland
## 2 3
## France Gabon
## 3 1
## Georgia Germany
## 1 3
## Ghana Greece
## 2 1
## Guatemala Guinea
## 1 2
## Haiti Honduras
## 2 2
## Hong Kong Hungary
## 3 1
## Iceland India
## 3 2
## Indonesia Iran
## 2 2
## Iraq Ireland
## 1 3
## Israel Italy
## 3 1
## Ivory Coast Japan
## 2 3
## Jordan Kazakhstan
## 1 1
## Kenya Kosovo
## 2 1
## Kuwait Kyrgyzstan
## 1 1
## Latvia Lebanon
## 1 1
## Lesotho Liberia
## 2 2
## Libya Lithuania
## 1 1
## Luxembourg Macedonia
## 3 1
## Madagascar Malawi
## 2 2
## Mali Malta
## 2 3
## Mauritania Mauritius
## 2 1
## Mexico Moldova
## 1 1
## Mongolia Montenegro
## 1 1
## Morocco Myanmar
## 1 1
## Nepal Netherlands
## 2 3
## New Zealand Nicaragua
## 3 1
## Niger Nigeria
## 2 2
## Turkish Republic of Northern Cyprus Norway
## 1 3
## Pakistan Palestine
## 2 1
## Panama Paraguay
## 1 1
## Peru Philippines
## 1 1
## Poland Portugal
## 1 1
## Romania Russia
## 1 1
## Rwanda Saudi Arabia
## 2 1
## Senegal Serbia
## 2 1
## Sierra Leone Singapore
## 2 3
## Slovakia Slovenia
## 1 1
## Somalia South Africa
## 2 1
## South Korea South Sudan
## 1 2
## Spain Sweden
## 1 3
## Switzerland Taiwan
## 3 1
## Tajikistan Tanzania
## 2 2
## Thailand Togo
## 3 2
## Tunisia Turkey
## 1 1
## Turkmenistan Uganda
## 1 2
## Ukraine United Arab Emirates
## 1 3
## United Kingdom United States
## 3 3
## Uruguay Uzbekistan
## 3 3
## Venezuela Vietnam
## 1 1
## Yemen Zambia
## 2 2
## Zimbabwe
## 2
# plot the clusters
plot(whr.df.norm, col = km.out$cluster, main = "k-means with 3 clusters", xlab = "", ylab = "")
#centroid plot
plot(c(0), xaxt = 'n', ylab = "", type = "l", ylim = c(min(km.out$centers), max(km.out$centers)), xlim = c(0,6))
# Label x-axes
axis(1, at = c(1:6), labels = names(whr.df.new[,-1]), cex.axis = 0.7)
# Plot Centroids
for (i in c(1:k))
lines(km.out$centers[i,], lty = i, lwd = 2, col = ifelse(i %in% c(2), "black", "dark gray"))
# Name the clusters
text(x = 0.5, y = km.out$centers[,1], labels = paste("Cluster", c(1:k)))
#Plot the clusters
clusplot(whr.df.norm, km.out$cluster, main = "Cluster Plot with K-means excluding LifeLadder", color = TRUE, shade = TRUE, labels = 2, lines = 0, cex = 0.7)
From the graph, we can see the characteristics of the three clusters. Cluster 1 has high GDP per capita and high score for life expectancy, generosity, social support with very low corruption. Cluster 2, on the other hand, has very low GDP per capita, low life expectancy, social support and life choice but significantly high corruption and surprisingly high generosity. On the other hand, Cluster 3 has medium GDP per capita, social support and life choice but highest corruption and lowest generosity among the three clusters.
So, a handful of countries from each of the clusters is as below
cluster 1: Luxembourg, Finland, Switzerland, New Zealand and Denmark
Cluster 2: Rwanda, Kenya, Malawi, Haiti, and Ivory Coast
Cluster 3: Greece, Hungary, China, Russia and Mexico
#Heirarchical clustering model introduction ( how to measure the distance between clusters)
#compare different measurement of distance
#plot(hc.out.complete)
hc.out.complete<- hclust(dist(whr.df.norm), method = "complete")
plot(hc.out.complete)
#plot(hc.out.single)
hc.out.single<- hclust(dist(whr.df.norm), method = "single")
plot(hc.out.single)
#plot(hc.out.average)
hc.out.average<- hclust(dist(whr.df.norm), method = "average")
plot(hc.out.average)
#plot(hc.out.centroid)
hc.out.centroid<- hclust(dist(whr.df.norm), method = "centroid")
plot(hc.out.complete)
#"complete" gives the most balanced clustering
By using different types of measurement of distance, we get different results and we choose the most balance one which is the “complete” method.
Now, with k = 3, we will start building our hierarchical model
#cut the dendogram into k=3 clusters
cut.whr<-cutree(hc.out.complete,k= 3)
# plot heatmap
heatmap(whr.df.norm[,-1], Colv = NA, hclustfun = hclust,
col=rev(paste("gray",1:99,sep="")), cexRow = 0.2, cexCol = 0.9)
# Plot the clusters
clusplot(whr.df.norm, cut.whr, main = "Cluster Plot for Hierarchical model without LifeLadder", color = TRUE, shade = TRUE, labels = 2, lines = 0, cex = 0.7)
From the heatmap we can see that the first cluster has the characteristic of high GDP per capita, long life expectancy, large social support, more freedom of life choices, while people are less generous and the government corruption is not widespread. The second cluster includes countries that have similar conditions in terms of GDP per capita, life expectancy, less social suppor to that in the first cluster, very little freedom of life choices, and the corruption is very severe. Countries in the third cluster generally have less GDP per captia, shorter life expectancy, little social support, and widepread corruption. But their people are very generous and have more freedom of life choices than that in the second cluster.
From the plots, we see significant difference between two models
# comparing cluster results
table(km.out$cluster)
##
## 1 2 3
## 71 42 28
table(cut.whr)
## cut.whr
## 1 2 3
## 25 79 37
As we can see, the heirachical model provides some conflicting results. For cluster 1 of the heirachical model, we see Finland, Belgium, Switzerland with Rwanda and Kenya in the same cluster. This may result from the nature of heirachichal model which is its sensitivity with outliers.
Therefore, for this case, we choose k-means as our best model.
#Include “LifeLadder” Column Now we will look at the models when including variable lifeladder.
#Heirarchical Model
# As seen above in the Hierarchical model, the best measure of distance is by using the "Complete" method. Therefore, we use the same method
hc.out.complete.1<- hclust(dist(whr.df.norm), method = "complete")
plot(hc.out.complete)
# Cut the dendogram into k=3 clusters
cut.whr.1<-cutree(hc.out.complete.1,k= 3)
# Plot heatmap
heatmap(whr.df.norm, Colv = NA, hclustfun = hclust,
col=rev(paste("gray",1:99,sep="")), cexRow = 0.2, cexCol = 0.9)
# Plot the clusters
clusplot(whr.df.norm, cut.whr.1, main = "Cluster Plot for Hierachical Model including LifeLadder", color = TRUE, shade = TRUE, labels = 2, lines = 0, cex = 0.7)
From the heatmap, we can see the characteristics of the clusters. The cluster with lowest Lifeladder has low GDP per capita with very high generostity and corruption. The cluster with medium liferladder show many conflicting characteristics within the cluster itself. For example, despite countries in this cluster share similar GDP per capita, their generosity and lifechoice varies. The cluster with highest lifeladder has very high GDP per capity as well as high score for other factor and very little corruption.
Countries for clusters Cluster 1: Haiti, Ghana, Ivory Coast, Mali, Central African Republic
Cluster 2: Greece, Russia, China, Spain, Italy
Cluster 3: Luxembourg, Finland, Switzerland, New Zealand, Kenya
# Build model with k=3 clusters
km.out.1 <- kmeans(whr.df.norm, centers = k, nstart = 25, iter.max = 50)
km.out$cluster
## Afghanistan Albania
## 2 1
## Algeria Argentina
## 1 1
## Armenia Australia
## 1 3
## Austria Azerbaijan
## 3 1
## Bahrain Bangladesh
## 3 2
## Belarus Belgium
## 1 3
## Benin Bolivia
## 2 1
## Bosnia and Herzegovina Botswana
## 1 1
## Brazil Bulgaria
## 1 1
## Burkina Faso Cambodia
## 2 2
## Cameroon Canada
## 2 3
## Central African Republic Chad
## 2 2
## Chile China
## 1 1
## Colombia Congo (Brazzaville)
## 1 2
## Congo (Kinshasa) Costa Rica
## 2 1
## Cyprus Czech Republic
## 1 1
## Denmark Dominican Republic
## 3 1
## Ecuador Egypt
## 1 1
## El Salvador Estonia
## 1 1
## Ethiopia Finland
## 2 3
## France Gabon
## 3 1
## Georgia Germany
## 1 3
## Ghana Greece
## 2 1
## Guatemala Guinea
## 1 2
## Haiti Honduras
## 2 2
## Hong Kong Hungary
## 3 1
## Iceland India
## 3 2
## Indonesia Iran
## 2 2
## Iraq Ireland
## 1 3
## Israel Italy
## 3 1
## Ivory Coast Japan
## 2 3
## Jordan Kazakhstan
## 1 1
## Kenya Kosovo
## 2 1
## Kuwait Kyrgyzstan
## 1 1
## Latvia Lebanon
## 1 1
## Lesotho Liberia
## 2 2
## Libya Lithuania
## 1 1
## Luxembourg Macedonia
## 3 1
## Madagascar Malawi
## 2 2
## Mali Malta
## 2 3
## Mauritania Mauritius
## 2 1
## Mexico Moldova
## 1 1
## Mongolia Montenegro
## 1 1
## Morocco Myanmar
## 1 1
## Nepal Netherlands
## 2 3
## New Zealand Nicaragua
## 3 1
## Niger Nigeria
## 2 2
## Turkish Republic of Northern Cyprus Norway
## 1 3
## Pakistan Palestine
## 2 1
## Panama Paraguay
## 1 1
## Peru Philippines
## 1 1
## Poland Portugal
## 1 1
## Romania Russia
## 1 1
## Rwanda Saudi Arabia
## 2 1
## Senegal Serbia
## 2 1
## Sierra Leone Singapore
## 2 3
## Slovakia Slovenia
## 1 1
## Somalia South Africa
## 2 1
## South Korea South Sudan
## 1 2
## Spain Sweden
## 1 3
## Switzerland Taiwan
## 3 1
## Tajikistan Tanzania
## 2 2
## Thailand Togo
## 3 2
## Tunisia Turkey
## 1 1
## Turkmenistan Uganda
## 1 2
## Ukraine United Arab Emirates
## 1 3
## United Kingdom United States
## 3 3
## Uruguay Uzbekistan
## 3 3
## Venezuela Vietnam
## 1 1
## Yemen Zambia
## 2 2
## Zimbabwe
## 2
# clusters
plot(whr.df.norm, col = km.out$cluster, main = "k-means with 3 clusters", xlab = "", ylab = "")
#centroid plot
# Scatter Plot
plot(c(0), xaxt = 'n', ylab = "", type = "l", ylim = c(min(km.out.1$centers), max(km.out.1$centers)), xlim = c(0,7))
# Label x-axes
axis(1, at = c(1:7), labels = names(whr.df.new), cex.axis = 0.7)
# Plot Centroids
for (i in c(1:k))
lines(km.out.1$centers[i,], lty = i, lwd = 2, col = ifelse(i %in% c(2), "black", "dark gray"))
# Name the clusters
text(x = 0.5, y = km.out.1$centers[,1], labels = paste("Cluster", c(1:k)))
# Plot the clusters
clusplot(whr.df.norm, km.out.1$cluster, main = "Cluster Plot for K-means including LifeLadder", color = TRUE, shade = TRUE, labels = 2, lines = 0, cex = 0.7)
From the graph, we can see the characteristics of the three clusters. Cluster 2 with highest Lifeladder score has high GDP per capita and high score for life expectancy, generosity, social support with very low corruption. Cluster 1, with lowest Lifeladder score, on the other hand, has very low GDP per capita, low life expectancy, social support and life choice but significantly high corruption and surprisingly high generosity. Cluster 3, with medium lifeladder score, has medium GDP per capita, social support and life choice but highest corruption and lowest generosity among the three clusters.
Countries for clusters Cluster 1: Rwanda, Kenya, Malawi, Haiti, Ivory Coast
Cluster 2: Luxembourg, Finland, New Zealand, Austria, Switzerland
Cluster 3: Greece, Hungary, Russia, Italy, Spain
# comparing cluster results on K-means
table(km.out.1$cluster)
##
## 1 2 3
## 41 29 71
# Comparison on Hierarchical Model
table(cut.whr.1)
## cut.whr.1
## 1 2 3
## 25 79 37
As we can see, the heirachical model provides some conflicting results. For cluster 3 of the heirachical model, we see Finland, Belgium, Switzerland with Rwanda and Kenya in the same cluster. This may result from the nature of heirachichal model which is its sensitivity with outliers.
Therefore, for this case, we choose k-means as our best model.
K-Means Advantages : 1) If variables are huge, then K-Means most of the times is computationally faster than hierarchical clustering, if we keep k smalls. 2) K-Means produce tighter clusters than hierarchical clustering 3) Easy to implement
Disadvantages: 1) Strong sensitivity to outliers and noise 2) Doesn’t work well with non-circular cluster shape – number of cluster and initial seed value need to be specified beforehand 3) The order of the data has an impact on the final results 4) Selection of optimal number of clusters is difficult 5) Not recommended if dataset has more categorical variables 6) Assumes that clusters are spherical, distinct, and approximately equal in size
Heirarchical Model: Advantages 1) It is easier to decide on the number of clusters by looking at the dendrogram 2) NO prior imformation about number of clusters required 3) Dendograms are great for visualization 4) Only a distance of proximity matrix is required to compute the heirarchical clustering
Disadvantages 1) Time complexity: not suitable for large datasets 2) Initial seeds have a strong impact on the final results
3) The order of the data has an impact on the final results 4) Very sensitive to outliers