Using data of the worldhappiness index of 2018, we draw two types of decision trees

First, load essential packages.

library(rpart)
library(rpart.plot)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load the data

happiness=read.csv('worldhappiness2018.csv')
happiness%>% glimpse
## Observations: 156
## Variables: 10
## $ 癤풰ank       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Region        <fct> eur, eur, eur, eur, eur, eur, ame, oce, eur, oce...
## $ country       <fct> Finland, Norway, Denmark, Iceland, Swizerland, N...
## $ score         <dbl> 7.632, 7.594, 7.555, 7.495, 7.487, 7.441, 7.328,...
## $ gdp           <dbl> 1.305, 1.456, 1.351, 1.343, 1.420, 1.361, 1.330,...
## $ socialsupport <dbl> 1.592, 1.582, 1.590, 1.644, 1.549, 1.488, 1.532,...
## $ healthylife   <dbl> 0.874, 0.861, 0.868, 0.914, 0.927, 0.878, 0.896,...
## $ freedom       <dbl> 0.681, 0.686, 0.683, 0.677, 0.660, 0.638, 0.653,...
## $ generosity    <dbl> 0.192, 0.286, 0.284, 0.353, 0.256, 0.333, 0.321,...
## $ corruption    <fct> 0.393, 0.34, 0.408, 0.138, 0.357, 0.295, 0.291, ...
colnames(happiness)[1]='rank'

Use only essential columns and re-order the data in a random way.

happiness1=select(happiness,score,gdp,socialsupport,healthylife,freedom)
set.seed(2019)
samp=sample(1:nrow(happiness1),replace=F)
happiness1=happiness1[order(samp),]
head(happiness1,n=8)
##     score   gdp socialsupport healthylife freedom
## 152 3.355 0.442         1.073       0.343   0.244
## 8   7.324 1.268         1.601       0.876   0.669
## 110 4.623 0.720         1.034       0.441   0.626
## 53  5.933 1.148         1.454       0.671   0.363
## 35  6.322 1.161         1.258       0.669   0.356
## 59  5.810 1.151         1.479       0.599   0.399
## 6   7.441 1.361         1.488       0.878   0.638
## 5   7.487 1.420         1.549       0.927   0.660

Make a Decision Tree using ANOVA method

anov=rpart(score~.,data=happiness1,method='anova')
anov
## n= 156 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 156 194.2605000 5.375917  
##    2) gdp< 1.066 100  72.5158400 4.782650  
##      4) healthylife< 0.4655 45  14.5064300 4.158222  
##        8) gdp< 0.4825 27   7.1571910 3.913963  
##         16) freedom< 0.3085 8   1.3486560 3.485750 *
##         17) freedom>=0.3085 19   3.7239500 4.094263 *
##        9) gdp>=0.4825 18   3.3220160 4.524611 *
##      5) healthylife>=0.4655 55  26.1076800 5.293545  
##       10) socialsupport< 1.0515 11   1.8257320 4.438818 *
##       11) socialsupport>=1.0515 44  14.2367600 5.507227  
##         22) freedom< 0.406 13   1.8798330 5.106615 *
##         23) freedom>=0.406 31   9.3956330 5.675226 *
##    3) gdp>=1.066 56  23.6971400 6.435321  
##      6) freedom< 0.6355 44  10.9558900 6.203068  
##       12) gdp< 1.231 23   2.6522280 5.925217 *
##       13) gdp>=1.231 21   4.5833050 6.507381  
##         26) healthylife>=0.902 7   0.7818994 6.051714 *
##         27) healthylife< 0.902 14   1.6212680 6.735214 *
##      7) freedom>=0.6355 12   1.6652510 7.286917 *
rpart.plot(anov,type=3,digits=3,fallen.leaves = T)

  1. Root node : Whether gdp score >=1.07
  2. Ten terminal nodes. The first line in a box means the average total score of each group. The second line means how much of the total data is in each group. Total sum of these percentages is 100.

Make another Decision Tree using Classification method.

Got rid of Oceanian countries(only 2). Too many categories isn’t good.

happiness2=select(happiness,Region,gdp,socialsupport,healthylife,freedom)
oce=which(happiness2$Region=='oce')
happiness2=happiness2[-oce,]
set.seed(2019)
samp=sample(1:nrow(happiness2),replace=F)
happiness2=happiness2[order(samp),]
class=rpart(Region~.,data=happiness2,method='class')
rpart.plot(class,type=3,digits=3,fallen.leaves=T)

  1. Root node : whether healthylife score >=0.421
  2. four numbers in each box. (africa, america, asia, eruope) The first red box means that 94.9% of countries classified as ‘Africa’ are real African countries.
  3. As there are many different countries in each continent, there are several subgroups of each continent, for America, Europe and Asia.
  4. As there are a lot of different kinds of countries in each continent, there’s some errors in the classification.