Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(ggplot2)
library(tidyverse)
library(dplyr)
library(outliers)
library(forecast)
Sys.setlocale(locale="English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
In this assignment, we will investigate two data sets, one is ‘world happiness report’ and the other is ‘human development report’, join two dataset, then find out outliers and missing values, finally check the distribution of some variables, do some transformations.
Here I will choose two data sets, the world happiness report and the human development report.
The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, and the updatest version is published in 2016. The World Happiness 2017, which ranks 155 countries by public’s happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report then gained worldwide recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields - econimics, psychology, survey analysis, national statistics, health, public policy and more - describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. The dataset was downloaded from https://www.kaggle.com/unsdsn/world-happiness/downloads/world-happiness-report.zip/2
The human development report is a part of Data Science for Good : Kiva Crowdfunding, The data set consists of HDI, GDI, population, Education, Health related data. New data set consists of countries boundary in Geojson format helpful for Geo spatial analysis. The dataset was downloaded from https://www.kaggle.com/sudhirnl7/human-development-index-hdi
Since there are too many variables in the human development report, we will only choose some of variables.
The variables list below are chosen for the human development: - Country - HDI.RANK - HDI - Life.expectancy - Mean.years.of.schooling - Gross.national.income..GNI..per.capita - GNI.per.capita.rank.minus.HDI.rank
Merge the two dataset.
hdi <- read.csv("HDI.csv")
hdi <- hdi[,3:12]
happiness <- read.csv("2017.csv")
data <- merge(hdi, happiness)
Check the data type and structure of the merged dataset.
dim(data)
## [1] 137 21
str(data)
## 'data.frame': 137 obs. of 21 variables:
## $ Country : Factor w/ 195 levels "Afghanistan",..: 1 2 3 5 7 8 9 10 11 13 ...
## $ HDI.Rank : int 169 75 83 150 45 84 2 24 78 47 ...
## $ HDI : num 0.479 0.764 0.745 0.533 0.827 0.743 0.939 0.893 0.759 0.824 ...
## $ Life.expectancy : num 60.7 78 75 52.7 76.5 74.9 82.5 81.6 70.9 76.7 ...
## $ Mean.years.of.schooling : num 3.6 9.6 7.8 5 9.9 11.3 13.2 11.3 11.2 9.4 ...
## $ Gross.national.income..GNI..per.capita: int 1871 10252 13533 6291 20945 8189 42822 43609 16413 37236 ...
## $ GNI.per.capita.rank.minus.HDI.rank : int 1 24 -1 -27 12 28 19 -4 -12 -19 ...
## $ Change.in.HDI.rank.2010.2015 : int -2 4 3 4 -2 1 1 -1 -2 -3 ...
## $ Average.annual.HDI.growth.1990.2000 : num 1.43 0.41 1.11 NA 0.9 0.16 0.38 0.53 NA 0.63 ...
## $ Average.annual.HDI.growth.2000.2010 : num 2.95 1.1 1.18 2.38 0.57 1.24 0.31 0.5 1.43 0.23 ...
## $ Happiness.Rank : int 141 109 53 140 24 121 10 13 85 41 ...
## $ Happiness.Score : num 3.79 4.64 5.87 3.8 6.6 ...
## $ Whisker.high : num 3.87 4.75 5.98 3.95 6.69 ...
## $ Whisker.low : num 3.71 4.54 5.77 3.64 6.51 ...
## $ Economy..GDP.per.Capita. : num 0.401 0.996 1.092 0.858 1.185 ...
## $ Family : num 0.582 0.804 1.146 1.104 1.44 ...
## $ Health..Life.Expectancy. : num 0.1807 0.7312 0.6176 0.0499 0.6951 ...
## $ Freedom : num 0.106 0.381 0.233 0 0.495 ...
## $ Generosity : num 0.3119 0.2013 0.0694 0.0979 0.1095 ...
## $ Trust..Government.Corruption. : num 0.0612 0.0399 0.1461 0.0697 0.0597 ...
## $ Dystopia.Residual : num 2.15 1.49 2.57 1.61 2.61 ...
As we can see from the result, there are 137 observations and 32 variables, only one factor variable - the country name is in the dataset, other variable are numeric variables.
head(data)
The dataset is tidy already, we do not have to do more tidy work.
Split the dataset into three groups according to the score of the happiness, if the happiness score is greater than 6, the group is high, if the happiness score is less than 4, then the group is divided to low group, otherwise divided to middle group.
data <- data %>% mutate(group=ifelse(Happiness.Score>6, "high", ifelse(Happiness.Score>4, "middle", "low")))
Find the number of NA values in each column, drop any row contains the NA value.
colSums(is.na(data))
## Country
## 0
## HDI.Rank
## 1
## HDI
## 1
## Life.expectancy
## 0
## Mean.years.of.schooling
## 1
## Gross.national.income..GNI..per.capita
## 0
## GNI.per.capita.rank.minus.HDI.rank
## 1
## Change.in.HDI.rank.2010.2015
## 1
## Average.annual.HDI.growth.1990.2000
## 18
## Average.annual.HDI.growth.2000.2010
## 9
## Happiness.Rank
## 0
## Happiness.Score
## 0
## Whisker.high
## 0
## Whisker.low
## 0
## Economy..GDP.per.Capita.
## 0
## Family
## 0
## Health..Life.Expectancy.
## 0
## Freedom
## 0
## Generosity
## 0
## Trust..Government.Corruption.
## 0
## Dystopia.Residual
## 0
## group
## 0
data <- na.omit(data)
dim(data)
## [1] 119 22
The columns list below contains NA values: - HDI.Rank - HDI - Mean.years.of.schoolin - GNI.per.capita.rank.minus.HDI.rank - Average.annual.HDI.growth.1990.2000 - Change.in.HDI.rank.2010.2015 - Average.annual.HDI.growth.2000.2010
After cleaning, there are 119 rows and 22 columns.
Scan the numeric data for outliers. Loop for each variable, find the outlier and remove the row contaisn the outlier.
names <- colnames(data)
names <- names[-1]
names <- names[-length(names)]
for(name in names){
# remove outliers
data <- data[data[,name] != outlier(data[,name]),]
}
Applying square transformation to the variable Dystopia.Residual. Firstly, we can check the distribution of the data, then we will find that the distribution is left-skewed After that, we can apply the square transformation to the variable, after tranformation, we can find the distribution looks a bit like the normal distribution.
hist(data$Dystopia.Residual)
# after square transformation
hist(sqrt(data$Dystopia.Residual))