library(ggplot2)
library(ggfortify)
library(knitr)
The two datasets shared on Kaggle that I’m going to use are:
**Human Development UN Human Development Index Index](https://www.kaggle.com/robbies/humandevelopmentindex)**
Sustainable Development Solutions Network’s World Happiness Report from 2015
The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released in Rome in advance of UN World Happiness Day, March 20th. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.
So, the code below reads in the data sources and joins them together by country name. In here, we use dplyr and join the two datasets for comparison.
library(readr)
library(dplyr)
Read in data files from X2015 and human_development datasets
human_development <- read.csv("~/Documents/Rstudio/datasets/human_development.csv", stringsAsFactors = F)
X2015 <- read.csv("~/Documents/Rstudio/datasets/2015.csv", stringsAsFactors = F)
view data
glimpse(human_development)
## Observations: 195
## Variables: 8
## $ HDI.Rank <int> 1, 2, 3, 4, 5, 6, 6, 8,...
## $ Country <chr> "Norway", "Australia", ...
## $ Human.Development.Index..HDI. <dbl> 0.944, 0.935, 0.930, 0....
## $ Life.Expectancy.at.Birth <dbl> 81.6, 82.4, 83.0, 80.2,...
## $ Expected.Years.of.Education <dbl> 17.5, 20.2, 15.8, 18.7,...
## $ Mean.Years.of.Education <dbl> 12.6, 13.0, 12.8, 12.7,...
## $ Gross.National.Income..GNI..per.Capita <chr> "64,992", "42,261", "56...
## $ GNI.per.Capita.Rank.Minus.HDI.Rank <int> 5, 17, 6, 11, 9, 11, 16...
glimpse(X2015)
## Observations: 158
## Variables: 12
## $ Country <chr> "Switzerland", "Iceland", "Denma...
## $ Region <chr> "Western Europe", "Western Europ...
## $ Happiness.Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ Happiness.Score <dbl> 7.587, 7.561, 7.527, 7.522, 7.42...
## $ Standard.Error <dbl> 0.03411, 0.04884, 0.03328, 0.038...
## $ Economy..GDP.per.Capita. <dbl> 1.39651, 1.30232, 1.32548, 1.459...
## $ Family <dbl> 1.34951, 1.40223, 1.36058, 1.330...
## $ Health..Life.Expectancy. <dbl> 0.94143, 0.94784, 0.87464, 0.885...
## $ Freedom <dbl> 0.66557, 0.62877, 0.64938, 0.669...
## $ Trust..Government.Corruption. <dbl> 0.41978, 0.14145, 0.48357, 0.365...
## $ Generosity <dbl> 0.29678, 0.43630, 0.34139, 0.346...
## $ Dystopia.Residual <dbl> 2.51738, 2.70201, 2.49204, 2.465...
create HDI vs Hap
HDIvsHap <- human_development %>%
left_join(X2015 , by = "Country") %>%
mutate(Country = factor(Country)) %>%
select(Country,HDI.Rank,Happiness.Rank, Happiness.Score,Human.Development.Index..HDI., Mean.Years.of.Education,Dystopia.Residual)
view the summary of our newly joined dataset
summary(HDIvsHap)
## Country HDI.Rank Happiness.Rank
## Afghanistan : 1 Min. : 1.00 Min. : 1.00
## Albania : 1 1st Qu.: 47.75 1st Qu.: 36.25
## Algeria : 1 Median : 94.00 Median : 79.50
## Andorra : 1 Mean : 94.31 Mean : 78.19
## Angola : 1 3rd Qu.:141.25 3rd Qu.:118.75
## Antigua and Barbuda: 1 Max. :188.00 Max. :158.00
## (Other) :189 NA's :7 NA's :57
## Happiness.Score Human.Development.Index..HDI. Mean.Years.of.Education
## Min. :2.839 Min. :0.3480 Min. : 1.400
## 1st Qu.:4.526 1st Qu.:0.5770 1st Qu.: 5.550
## Median :5.232 Median :0.7210 Median : 8.400
## Mean :5.416 Mean :0.6918 Mean : 8.079
## 3rd Qu.:6.322 3rd Qu.:0.8000 3rd Qu.:10.600
## Max. :7.587 Max. :0.9440 Max. :13.100
## NA's :57
## Dystopia.Residual
## Min. :0.6704
## 1st Qu.:1.8090
## Median :2.0954
## Mean :2.1122
## 3rd Qu.:2.4624
## Max. :3.6021
## NA's :57
We observed some NA’s so we have to remove NA’s
HDIvsHap <-na.omit(HDIvsHap)
again, we check our new dataset
summary(HDIvsHap)
## Country HDI.Rank Happiness.Rank Happiness.Score
## Afghanistan: 1 Min. : 1.00 Min. : 1.00 Min. :2.839
## Albania : 1 1st Qu.: 40.25 1st Qu.: 36.25 1st Qu.:4.526
## Algeria : 1 Median : 85.00 Median : 79.50 Median :5.232
## Angola : 1 Mean : 90.50 Mean : 78.19 Mean :5.416
## Argentina : 1 3rd Qu.:144.50 3rd Qu.:118.75 3rd Qu.:6.322
## Armenia : 1 Max. :188.00 Max. :158.00 Max. :7.587
## (Other) :132
## Human.Development.Index..HDI. Mean.Years.of.Education Dystopia.Residual
## Min. :0.3480 Min. : 1.400 Min. :0.6704
## 1st Qu.:0.5497 1st Qu.: 5.675 1st Qu.:1.8090
## Median :0.7330 Median : 8.500 Median :2.0954
## Mean :0.7007 Mean : 8.177 Mean :2.1122
## 3rd Qu.:0.8357 3rd Qu.:10.900 3rd Qu.:2.4624
## Max. :0.9440 Max. :13.100 Max. :3.6021
##
Now that we already have our dataset, we can take a closer look at it. We can use formattable.
library(formattable)
HDIvsHap %>%
head(39) %>%
formattable((list(
Country = color_bar("yellow"),
HDI.Rank = color_bar("lightgreen"),
Happiness.Rank = color_bar ("light pink"),
Happiness.Score = color_bar("deepskyblue"),
Human.Development.Index..HDI = color_bar("yellow"),
Mean.Years.of.Education = color_bar("deepskyblue"),
Dystopia.Residual = color_bar("deepskyblue"))))
| Country | HDI.Rank | Happiness.Rank | Happiness.Score | Human.Development.Index..HDI. | Mean.Years.of.Education | Dystopia.Residual | |
|---|---|---|---|---|---|---|---|
| 1 | Norway | 1 | 4 | 7.522 | 0.944 | 12.6 | 2.46531 |
| 2 | Australia | 2 | 10 | 7.284 | 0.935 | 13.0 | 2.26646 |
| 3 | Switzerland | 3 | 1 | 7.587 | 0.930 | 12.8 | 2.51738 |
| 4 | Denmark | 4 | 3 | 7.527 | 0.923 | 12.7 | 2.49204 |
| 5 | Netherlands | 5 | 7 | 7.378 | 0.922 | 11.9 | 2.46570 |
| 6 | Germany | 6 | 26 | 6.750 | 0.916 | 13.1 | 2.11569 |
| 7 | Ireland | 6 | 18 | 6.940 | 0.916 | 12.2 | 1.97570 |
| 8 | United States | 8 | 15 | 7.119 | 0.915 | 12.9 | 2.51011 |
| 9 | Canada | 9 | 5 | 7.427 | 0.913 | 13.0 | 2.45176 |
| 10 | New Zealand | 9 | 9 | 7.286 | 0.913 | 12.5 | 2.26425 |
| 11 | Singapore | 11 | 24 | 6.798 | 0.912 | 10.6 | 1.88501 |
| 14 | Sweden | 14 | 8 | 7.364 | 0.907 | 12.1 | 2.37119 |
| 15 | United Kingdom | 14 | 21 | 6.867 | 0.907 | 13.1 | 1.96994 |
| 16 | Iceland | 16 | 2 | 7.561 | 0.899 | 10.6 | 2.70201 |
| 18 | Israel | 18 | 11 | 7.278 | 0.894 | 12.5 | 3.08854 |
| 19 | Luxembourg | 19 | 17 | 6.946 | 0.892 | 11.7 | 1.96961 |
| 20 | Japan | 20 | 46 | 5.987 | 0.891 | 11.5 | 1.68435 |
| 21 | Belgium | 21 | 19 | 6.937 | 0.890 | 11.3 | 2.41484 |
| 22 | France | 22 | 29 | 6.575 | 0.888 | 11.1 | 2.21126 |
| 23 | Austria | 23 | 13 | 7.200 | 0.885 | 10.8 | 2.53320 |
| 24 | Finland | 24 | 6 | 7.406 | 0.883 | 10.3 | 2.61955 |
| 25 | Slovenia | 25 | 55 | 5.848 | 0.880 | 11.9 | 1.61583 |
| 26 | Spain | 26 | 36 | 6.329 | 0.876 | 9.6 | 2.12367 |
| 27 | Italy | 27 | 50 | 5.948 | 0.873 | 10.1 | 2.02518 |
| 28 | Czech Republic | 28 | 31 | 6.505 | 0.870 | 12.3 | 2.67782 |
| 29 | Greece | 29 | 102 | 4.857 | 0.865 | 10.3 | 1.80101 |
| 30 | Estonia | 30 | 73 | 5.429 | 0.861 | 12.5 | 1.58782 |
| 32 | Cyprus | 32 | 67 | 5.689 | 0.850 | 11.6 | 1.88931 |
| 33 | Qatar | 32 | 28 | 6.611 | 0.850 | 9.1 | 1.55674 |
| 35 | Slovakia | 35 | 45 | 5.995 | 0.844 | 12.2 | 2.24639 |
| 36 | Poland | 36 | 60 | 5.791 | 0.843 | 11.8 | 1.86565 |
| 37 | Lithuania | 37 | 56 | 5.833 | 0.839 | 12.4 | 2.44649 |
| 38 | Malta | 37 | 37 | 6.302 | 0.839 | 10.3 | 1.64880 |
| 39 | Saudi Arabia | 39 | 35 | 6.411 | 0.837 | 8.7 | 2.43872 |
| 40 | Argentina | 40 | 30 | 6.574 | 0.836 | 9.8 | 2.83600 |
| 41 | United Arab Emirates | 41 | 20 | 6.901 | 0.835 | 9.5 | 2.24743 |
| 42 | Chile | 42 | 27 | 6.670 | 0.832 | 9.8 | 2.67585 |
| 43 | Portugal | 43 | 88 | 5.102 | 0.830 | 8.2 | 1.26462 |
| 44 | Hungary | 44 | 104 | 4.800 | 0.828 | 11.6 | 1.24074 |
Well, looks like our table contains several surprises. We could see Australia as the 4th HDI rank but she only ranks 10th in Happiness ranking. Check also Germany, were she ranks 6th in HDI but ranks 26th in happiness. Iceland, which ranks 2nd in happiness, drops down to 16th in HDI. Even Japan, which ranks 20th in HDI but almost double down on happiness ranking at 46th. Over-all , the table tells us that having high HDI does not translate to happy people in a country. From above observation, I have to admire countries like Norway, Australia, Switzerland, Denmark, the Netherlands, New Zealand and Canada for having well-rounded policies in each of their individual countries which results in a better work-life balance for their citizens.
**Will happiness increase or decrease if people both have longer lives and spend long years at school? In here we will try to answer our question and run some simple linear regression models
my.data <- human_development %>%
left_join(X2015 , by = "Country") %>%
mutate(Country = factor(Country)) %>%
select(Happiness.Score,Life.Expectancy.at.Birth, Expected.Years.of.Education, Dystopia.Residual)
its very important to always check our joined dataset. I dont know..I guess this has become a habit of mine.
glimpse(my.data)
## Observations: 195
## Variables: 4
## $ Happiness.Score <dbl> 7.522, 7.284, 7.587, 7.527, 7.378,...
## $ Life.Expectancy.at.Birth <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9...
## $ Expected.Years.of.Education <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5...
## $ Dystopia.Residual <dbl> 2.46531, 2.26646, 2.51738, 2.49204...
we will change the column names so it will look better before we plot our new dataset to corrplot
colnames(my.data) <- c("happiness", "lifeSpan", "education","dystopiaResidual")
again, check if our columns are in order
summary(my.data)
## happiness lifeSpan education dystopiaResidual
## Min. :2.839 Min. :49.00 Min. : 4.10 Min. :0.6704
## 1st Qu.:4.526 1st Qu.:65.75 1st Qu.:11.10 1st Qu.:1.8090
## Median :5.232 Median :73.10 Median :13.10 Median :2.0954
## Mean :5.416 Mean :71.07 Mean :12.86 Mean :2.1122
## 3rd Qu.:6.322 3rd Qu.:76.80 3rd Qu.:14.90 3rd Qu.:2.4624
## Max. :7.587 Max. :84.00 Max. :20.20 Max. :3.6021
## NA's :57 NA's :57
remove NA’s
Happiness <-na.omit(my.data)
summary
summary(Happiness)
## happiness lifeSpan education dystopiaResidual
## Min. :2.839 Min. :49.00 Min. : 5.40 Min. :0.6704
## 1st Qu.:4.526 1st Qu.:64.65 1st Qu.:11.10 1st Qu.:1.8090
## Median :5.232 Median :74.10 Median :13.50 Median :2.0954
## Mean :5.416 Mean :71.36 Mean :13.10 Mean :2.1122
## 3rd Qu.:6.322 3rd Qu.:77.55 3rd Qu.:15.28 3rd Qu.:2.4624
## Max. :7.587 Max. :83.50 Max. :20.20 Max. :3.6021
library(corrplot)
## corrplot 0.82 loaded
#plot matrix of all variables
plot(Happiness, pch=16 , col ="blue", main = "Matrix Scatterplot of Happiness , LifeSpan ,
Education and Dystopia Residual")
The matrix plot above allows us to vizualise the relationship among all variables in one single image.For example, we can see how happiness and LifeSpan are related (see first column, second row top to bottom graph).
Another interesting example is the relationship between LifeSpan and education (third column left to right second row top to bottom graph). Here we can see that as LifeSpan increases, average years spent in schooling also increases! Whew! talking about stoping school at at early age.
Also from the matrix plot, note how dystopiaResidual seems to have a similar pattern relative to happiness when plotted against education (fourth column left to right second row top to bottom graph).
Now, we will try to run simple linear regression on our dataset and create some models
create formula
fmla <- happiness ~ lifeSpan + education
fit the model
Happiness_model <- lm(fmla, data = Happiness)
print Happiness_model and call summary()
Happiness_model
##
## Call:
## lm(formula = fmla, data = Happiness)
##
## Coefficients:
## (Intercept) lifeSpan education
## -0.88554 0.06396 0.13266
summary(Happiness_model)
##
## Call:
## lm(formula = fmla, data = Happiness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7665 -0.5195 0.0839 0.4873 1.5826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.88554 0.55906 -1.584 0.115539
## lifeSpan 0.06396 0.01176 5.437 2.46e-07 ***
## education 0.13266 0.03462 3.832 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7396 on 135 degrees of freedom
## Multiple R-squared: 0.6037, Adjusted R-squared: 0.5978
## F-statistic: 102.8 on 2 and 135 DF, p-value: < 2.2e-16
we have a negative coefficient, so our results tells us that a decrease in happiness means an increase in both LifeSpan and education. In simple terms, as people tend to live longer and spend longer years in schooling , it would also decrease their chances of being happy. :(
plot(Happiness_model)
predict
Happiness$prediction <- predict(Happiness_model)
ggplot
ggplot(Happiness, aes(x = prediction, y = happiness)) +
geom_point() +
geom_abline(color = "blue")
Plot a correlation graph
happinesscor = cor(Happiness[1:4])
corrplot(happinesscor, method = "number")
graphically evaluate Happiness_model
Make predictions from the model
Happiness$predictions <- predict(Happiness_model)
Fill in the blanks to plot predictions (on x-axis) versus the happiness
ggplot(Happiness, aes(x = predictions, y = happiness)) +
geom_point() +
geom_abline()
Calculate residuals
Happiness$residuals <- Happiness$predictions - Happiness$happiness
Fill in the blanks to plot the residuals vs. predictions
ggplot(Happiness, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
Load the package WVPlots
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(Happiness, "predictions", "happiness", "Happiness model")
#comment:A relative gini coefficient close to one shows that the model correctly sorts situations from lower ones.
view summary
summary(Happiness)
## happiness lifeSpan education dystopiaResidual
## Min. :2.839 Min. :49.00 Min. : 5.40 Min. :0.6704
## 1st Qu.:4.526 1st Qu.:64.65 1st Qu.:11.10 1st Qu.:1.8090
## Median :5.232 Median :74.10 Median :13.50 Median :2.0954
## Mean :5.416 Mean :71.36 Mean :13.10 Mean :2.1122
## 3rd Qu.:6.322 3rd Qu.:77.55 3rd Qu.:15.28 3rd Qu.:2.4624
## Max. :7.587 Max. :83.50 Max. :20.20 Max. :3.6021
## prediction predictions residuals
## Min. :3.312 Min. :3.312 Min. :-1.5826
## 1st Qu.:4.656 1st Qu.:4.656 1st Qu.:-0.4873
## Median :5.631 Median :5.631 Median :-0.0839
## Mean :5.416 Mean :5.416 Mean : 0.0000
## 3rd Qu.:6.035 3rd Qu.:6.035 3rd Qu.: 0.5195
## Max. :7.064 Max. :7.064 Max. : 1.7665
For convenience put the residuals in the variable res
res <- Happiness$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.7315398
# Calculate the standard deviation of happiness and print it
(sd_Happiness <- sd(Happiness$happiness))
## [1] 1.166301
Comment:An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.
This is my first R project and I just started using it. The models still neeeds more improvement. (Even the graphs!) I hope you all enjoy this post and feel free to comment below. I would like to hear from more experience R users so you could help me improve my skills. Thanks!