Required packages

# This is the R chunk for the required packages
library(readr)
library(ggplot2) # Useful for creating plots
library(dplyr)  # Useful for data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(knitr) # Useful for creating nice tables
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggplot2)
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(outliers)
library(MVN)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## sROC 0.1-2 loaded
library(infotheo)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

The purpose of wrangling datasets in this assignment is for comparing the effect of happiness on obesity in 2 year 2015 and 2016. In order to have a clean dataset, we have 7 essential stages described below:

Data stage: There are 3 datasets, that are “Obesity among adults by country,1975-2016”, “The World Happiness Report 2015” and “The World Happiness Report 2016”. In this stage, I will import all the datasets and subset them to have data in 2 year 2015 and 2016. Then, I changed the column names of datasets where necessary. After that, I join the two World Happiness Report together and change column name where necessary (named WHP_join). Finally, I joined all datasets together and had one data called “all_join”.

Understand stage: I checked data type of every variables for data conversion. However, all the data conversion and factorizing will be done in the next stage.

Tidy & Manipulate Data 1 stage: I found that there are two variables that need to be tidy. I separate data from each column and removed unnecessary data. Also, I check variables that need to be tidy up. Also in this stage, I converted the cleaned variables into numeric datatype. Then, I transform data from wide format to long format and finally factorized them.

Tidy & Manipulate Data 2 stage: from cleaned dataset, I mutated to have 2 new variables called “Difference_happiness” and “Difference_Obesity”.

Scan 1 stage: I scanned and summed all missing value and filtered them to not have any missing values in the datasets. I also checked any special values available and did not find any.

Scan 2 stage: I used box plot to check whether there are any outlier or not. Then, I used Mahalanobis distance methods to find specific outliers. Finally, I chose to keep all outlier.

Transform stage: I plotted histogram to see how distributed the variables are. Then I decided to use Boxcox transformation to make the graph more symmetrical.

Data

There are three datasets used in my assignment 2:

Because the obesity datasets are used to analyse the relationship betwwen obesity and happiness levels in 2015 and 2016. So, all the data in these 2 years are kept and the other years are excluded.

The World Happiness Report 2015 and 2016 are used to see the difference in happiness score. The Happiness Score is a sum of scores of other 7 factors.

In this stage, i will rename the column names of datasets where neccessary, and then merge all data together after subsetting the 3 datasets. After that, i will check its head.

# Read Obesity dataset and subset for year_2015 and year_2016
Obesity<-read.csv("D:/RMIT/2. Data Wrangling/Assignment 2/Dataset/Group 1/obesity among adult by countries/data.csv")

knitr::kable(head(Obesity))
X X2016 X2016.1 X2016.2 X2015 X2015.1 X2015.2 X2014 X2014.1 X2014.2 X2013 X2013.1 X2013.2 X2012 X2012.1 X2012.2 X2011 X2011.1 X2011.2 X2010 X2010.1 X2010.2 X2009 X2009.1 X2009.2 X2008 X2008.1 X2008.2 X2007 X2007.1 X2007.2 X2006 X2006.1 X2006.2 X2005 X2005.1 X2005.2 X2004 X2004.1 X2004.2 X2003 X2003.1 X2003.2 X2002 X2002.1 X2002.2 X2001 X2001.1 X2001.2 X2000 X2000.1 X2000.2 X1999 X1999.1 X1999.2 X1998 X1998.1 X1998.2 X1997 X1997.1 X1997.2 X1996 X1996.1 X1996.2 X1995 X1995.1 X1995.2 X1994 X1994.1 X1994.2 X1993 X1993.1 X1993.2 X1992 X1992.1 X1992.2 X1991 X1991.1 X1991.2 X1990 X1990.1 X1990.2 X1989 X1989.1 X1989.2 X1988 X1988.1 X1988.2 X1987 X1987.1 X1987.2 X1986 X1986.1 X1986.2 X1985 X1985.1 X1985.2 X1984 X1984.1 X1984.2 X1983 X1983.1 X1983.2 X1982 X1982.1 X1982.2 X1981 X1981.1 X1981.2 X1980 X1980.1 X1980.2 X1979 X1979.1 X1979.2 X1978 X1978.1 X1978.2 X1977 X1977.1 X1977.2 X1976 X1976.1 X1976.2 X1975 X1975.1 X1975.2
Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%) Prevalence of obesity among adults, BMI &GreaterEqual; 30 (age-standardized estimate) (%)
18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years 18+ years
Country Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female Both sexes Male Female
Afghanistan 5.5 [3.4-8.1] 3.2 [1.3-6.4] 7.6 [4.3-12.4] 5.2 [3.3-7.7] 3.0 [1.3-6.0] 7.3 [4.1-11.8] 4.9 [3.1-7.3] 2.8 [1.2-5.6] 7.0 [4.0-11.3] 4.7 [2.9-6.9] 2.7 [1.1-5.3] 6.6 [3.8-10.7] 4.4 [2.8-6.6] 2.5 [1.1-5.0] 6.3 [3.6-10.2] 4.2 [2.6-6.2] 2.4 [1.0-4.7] 6.0 [3.4-9.7] 4.0 [2.5-5.9] 2.2 [1.0-4.4] 5.7 [3.3-9.2] 3.8 [2.4-5.6] 2.1 [0.9-4.2] 5.4 [3.1-8.8] 3.6 [2.2-5.4] 2.0 [0.8-3.9] 5.2 [2.9-8.4] 3.4 [2.1-5.1] 1.8 [0.8-3.7] 4.9 [2.8-8.0] 3.2 [2.0-4.9] 1.7 [0.7-3.5] 4.7 [2.6-7.7] 3.0 [1.9-4.6] 1.6 [0.7-3.3] 4.4 [2.4-7.3] 2.9 [1.7-4.4] 1.5 [0.6-3.1] 4.2 [2.3-7.0] 2.7 [1.6-4.2] 1.4 [0.6-3.0] 4.0 [2.1-6.7] 2.6 [1.5-4.0] 1.3 [0.5-2.8] 3.8 [2.0-6.4] 2.4 [1.4-3.8] 1.3 [0.5-2.7] 3.6 [1.9-6.2] 2.3 [1.3-3.7] 1.2 [0.4-2.5] 3.4 [1.8-5.9] 2.2 [1.3-3.5] 1.1 [0.4-2.4] 3.2 [1.6-5.6] 2.1 [1.2-3.3] 1.0 [0.4-2.2] 3.0 [1.5-5.4] 1.9 [1.1-3.2] 1.0 [0.4-2.1] 2.9 [1.4-5.1] 1.8 [1.0-3.0] 0.9 [0.3-2.0] 2.7 [1.3-4.9] 1.7 [1.0-2.9] 0.8 [0.3-1.9] 2.6 [1.2-4.7] 1.6 [0.9-2.8] 0.8 [0.3-1.8] 2.4 [1.2-4.5] 1.5 [0.8-2.6] 0.7 [0.2-1.7] 2.3 [1.1-4.3] 1.5 [0.8-2.5] 0.7 [0.2-1.6] 2.2 [1.0-4.1] 1.4 [0.7-2.4] 0.6 [0.2-1.5] 2.1 [0.9-3.9] 1.3 [0.7-2.3] 0.6 [0.2-1.4] 1.9 [0.9-3.7] 1.2 [0.6-2.2] 0.6 [0.2-1.3] 1.8 [0.8-3.5] 1.2 [0.6-2.1] 0.5 [0.2-1.2] 1.7 [0.7-3.4] 1.1 [0.5-2.0] 0.5 [0.1-1.2] 1.6 [0.7-3.2] 1.0 [0.5-1.9] 0.4 [0.1-1.1] 1.5 [0.6-3.1] 1.0 [0.5-1.8] 0.4 [0.1-1.0] 1.5 [0.6-3.0] 0.9 [0.4-1.7] 0.4 [0.1-1.0] 1.4 [0.5-2.9] 0.8 [0.4-1.6] 0.4 [0.1-0.9] 1.3 [0.5-2.7] 0.8 [0.4-1.5] 0.3 [0.1-0.9] 1.2 [0.4-2.6] 0.7 [0.3-1.4] 0.3 [0.1-0.8] 1.1 [0.4-2.5] 0.7 [0.3-1.4] 0.3 [0.1-0.8] 1.1 [0.4-2.4] 0.6 [0.3-1.3] 0.3 [0.1-0.8] 1.0 [0.3-2.3] 0.6 [0.2-1.3] 0.2 [0.1-0.7] 0.9 [0.3-2.2] 0.6 [0.2-1.2] 0.2 [0.0-0.7] 0.9 [0.3-2.1] 0.5 [0.2-1.1] 0.2 [0.0-0.7] 0.8 [0.2-2.0] 0.5 [0.2-1.1] 0.2 [0.0-0.6] 0.8 [0.2-2.0]
Albania 21.7 [17.0-26.7] 21.6 [14.8-29.0] 21.8 [15.3-28.9] 21.1 [16.6-26.0] 20.9 [14.4-28.1] 21.3 [15.1-28.1] 20.5 [16.2-25.1] 20.2 [13.9-27.3] 20.8 [14.9-27.4] 19.9 [15.7-24.4] 19.5 [13.4-26.3] 20.4 [14.6-26.7] 19.3 [15.3-23.7] 18.8 [13.0-25.4] 19.9 [14.3-26.0] 18.8 [14.8-23.0] 18.1 [12.5-24.5] 19.4 [14.0-25.3] 18.2 [14.3-22.3] 17.4 [12.0-23.7] 18.9 [13.6-24.8] 17.6 [13.8-21.6] 16.8 [11.5-22.9] 18.4 [13.3-24.2] 17.0 [13.4-21.0] 16.1 [10.9-22.1] 18.0 [12.9-23.6] 16.5 [12.9-20.3] 15.5 [10.4-21.2] 17.5 [12.6-23.0] 16.0 [12.4-19.8] 14.9 [9.9-20.5] 17.1 [12.2-22.5] 15.4 [12.0-19.2] 14.3 [9.4-19.8] 16.6 [11.9-22.0] 14.9 [11.5-18.7] 13.7 [9.0-19.1] 16.2 [11.5-21.5] 14.5 [11.1-18.2] 13.2 [8.5-18.5] 15.8 [11.1-21.0] 14.0 [10.7-17.7] 12.6 [8.1-18.0] 15.4 [10.8-20.6] 13.6 [10.3-17.3] 12.1 [7.7-17.3] 15.0 [10.4-20.2] 13.2 [9.9-16.8] 11.7 [7.3-16.9] 14.7 [10.1-19.8] 12.8 [9.5-16.4] 11.2 [6.9-16.4] 14.3 [9.8-19.5] 12.4 [9.2-16.0] 10.8 [6.5-15.9] 14.0 [9.5-19.1] 12.0 [8.9-15.6] 10.4 [6.2-15.4] 13.7 [9.2-18.8] 11.7 [8.5-15.2] 10.0 [5.9-15.0] 13.4 [9.0-18.6] 11.3 [8.2-14.9] 9.6 [5.6-14.6] 13.1 [8.7-18.3] 11.0 [7.9-14.5] 9.2 [5.3-14.1] 12.8 [8.4-17.9] 10.7 [7.6-14.2] 8.8 [5.1-13.7] 12.5 [8.2-17.7] 10.4 [7.3-13.9] 8.5 [4.8-13.3] 12.2 [7.9-17.5] 10.1 [7.1-13.6] 8.2 [4.6-12.9] 12.0 [7.7-17.2] 9.8 [6.9-13.3] 7.9 [4.3-12.6] 11.7 [7.4-16.9] 9.5 [6.6-13.0] 7.6 [4.1-12.3] 11.5 [7.1-16.8] 9.3 [6.4-12.8] 7.3 [3.9-11.9] 11.3 [6.9-16.6] 9.0 [6.2-12.5] 7.0 [3.7-11.6] 11.0 [6.7-16.5] 8.7 [6.0-12.3] 6.8 [3.5-11.3] 10.8 [6.5-16.4] 8.5 [5.7-12.0] 6.5 [3.3-10.9] 10.6 [6.2-16.2] 8.3 [5.5-11.8] 6.3 [3.1-10.7] 10.3 [6.0-16.1] 8.0 [5.3-11.6] 6.0 [2.9-10.4] 10.1 [5.7-15.9] 7.8 [5.1-11.4] 5.8 [2.8-10.1] 9.9 [5.5-15.8] 7.6 [4.9-11.2] 5.6 [2.6-9.9] 9.7 [5.3-15.7] 7.4 [4.6-11.0] 5.4 [2.4-9.7] 9.5 [5.0-15.5] 7.2 [4.4-10.9] 5.2 [2.3-9.6] 9.3 [4.8-15.5] 7.0 [4.2-10.8] 5.0 [2.1-9.4] 9.1 [4.6-15.5] 6.8 [4.0-10.7] 4.8 [2.0-9.3] 8.9 [4.3-15.4] 6.7 [3.8-10.6] 4.6 [1.8-9.2] 8.8 [4.1-15.4] 6.5 [3.6-10.5] 4.4 [1.7-9.2] 8.6 [3.9-15.4]
Algeria 27.4 [22.5-32.7] 19.9 [13.6-27.1] 34.9 [27.6-42.7] 26.7 [21.9-31.8] 19.2 [13.2-26.1] 34.2 [27.1-41.7] 26.0 [21.4-30.9] 18.5 [12.7-25.0] 33.6 [26.7-40.7] 25.3 [20.9-30.1] 17.8 [12.3-24.1] 32.9 [26.4-39.8] 24.7 [20.4-29.2] 17.1 [11.9-23.2] 32.2 [25.9-39.0] 24.0 [19.9-28.4] 16.5 [11.4-22.3] 31.5 [25.4-38.2] 23.3 [19.3-27.6] 15.8 [11.0-21.4] 30.9 [24.9-37.3] 22.7 [18.8-26.8] 15.2 [10.5-20.5] 30.2 [24.3-36.4] 22.0 [18.2-26.1] 14.6 [10.0-19.8] 29.5 [23.7-35.6] 21.4 [17.6-25.4] 14.0 [9.5-19.1] 28.8 [23.1-34.8] 20.8 [17.1-24.7] 13.4 [9.1-18.4] 28.2 [22.5-34.1] 20.2 [16.5-24.1] 12.9 [8.6-17.8] 27.5 [21.8-33.5] 19.6 [16.0-23.4] 12.3 [8.2-17.2] 26.8 [21.2-32.9] 19.1 [15.5-22.8] 11.8 [7.8-16.6] 26.2 [20.5-32.2] 18.5 [14.9-22.2] 11.3 [7.4-16.0] 25.5 [19.9-31.6] 17.9 [14.4-21.6] 10.8 [6.9-15.4] 24.9 [19.3-30.9] 17.4 [13.8-21.1] 10.3 [6.6-14.8] 24.2 [18.7-30.2] 16.8 [13.4-20.5] 9.9 [6.2-14.3] 23.6 [18.0-29.6] 16.3 [12.8-19.9] 9.4 [5.8-13.8] 23.0 [17.4-28.9] 15.8 [12.4-19.4] 9.0 [5.5-13.3] 22.4 [16.7-28.3] 15.3 [11.9-18.8] 8.6 [5.1-12.8] 21.7 [16.1-27.7] 14.8 [11.4-18.3] 8.2 [4.8-12.3] 21.1 [15.5-27.0] 14.3 [10.9-17.8] 7.8 [4.6-11.8] 20.5 [14.9-26.4] 13.8 [10.5-17.3] 7.4 [4.3-11.3] 19.9 [14.3-25.8] 13.3 [10.0-16.8] 7.0 [4.0-10.8] 19.3 [13.7-25.2] 12.8 [9.6-16.3] 6.7 [3.7-10.4] 18.7 [13.1-24.5] 12.4 [9.1-15.9] 6.3 [3.5-10.0] 18.1 [12.5-24.0] 11.9 [8.7-15.4] 6.0 [3.2-9.6] 17.5 [12.0-23.5] 11.5 [8.2-15.0] 5.7 [3.0-9.2] 16.9 [11.4-23.0] 11.0 [7.8-14.6] 5.4 [2.8-8.9] 16.3 [10.9-22.4] 10.6 [7.4-14.1] 5.1 [2.6-8.5] 15.7 [10.4-21.8] 10.2 [7.0-13.7] 4.8 [2.4-8.1] 15.2 [9.8-21.4] 9.7 [6.6-13.3] 4.5 [2.2-7.8] 14.6 [9.3-21.0] 9.4 [6.2-12.9] 4.3 [2.0-7.5] 14.1 [8.8-20.5] 9.0 [5.9-12.6] 4.0 [1.9-7.2] 13.6 [8.2-20.1] 8.6 [5.5-12.3] 3.8 [1.7-7.0] 13.1 [7.8-19.7] 8.3 [5.2-12.0] 3.6 [1.6-6.7] 12.7 [7.3-19.3] 8.0 [4.9-11.7] 3.4 [1.4-6.6] 12.2 [6.9-18.9] 7.7 [4.6-11.4] 3.2 [1.3-6.4] 11.8 [6.5-18.6] 7.4 [4.3-11.3] 3.1 [1.2-6.2] 11.4 [6.2-18.4] 7.2 [4.1-11.1] 2.9 [1.1-6.1] 11.1 [5.8-18.2] 6.9 [3.9-10.9] 2.8 [1.0-6.0] 10.7 [5.5-18.0]
# Subset Obesity dataset for year_2015 and year_2016
Obesity<-Obesity[-c(1:3), c(1,2,5)]

head(Obesity)
# Changing column name of Obesity
colnames(Obesity)<- c("Country","Obesity_2016_percentage","Obesity_2015_percentage")
head(Obesity)
# Read World Happiness Report 2015 and subset Hapiness Score
WHP_2015<-read_csv("D:/RMIT/2. Data Wrangling/Assignment 2/Dataset/Group 1/World Happiness Report/2015.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   Region = col_character(),
##   `Happiness Rank` = col_double(),
##   `Happiness Score` = col_double(),
##   `Standard Error` = col_double(),
##   `Economy (GDP per Capita)` = col_double(),
##   Family = col_double(),
##   `Health (Life Expectancy)` = col_double(),
##   Freedom = col_double(),
##   `Trust (Government Corruption)` = col_double(),
##   Generosity = col_double(),
##   `Dystopia Residual` = col_double()
## )
head(WHP_2015)
# Subset World Happiness Report to get Hapiness Score
WHP_2015<-WHP_2015[,c(1,2,4)]
head(WHP_2015)
# Read World Happiness Report 2016 
WHP_2016<-read_csv("D:/RMIT/2. Data Wrangling/Assignment 2/Dataset/Group 1/World Happiness Report/2016.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   Region = col_character(),
##   `Happiness Rank` = col_double(),
##   `Happiness Score` = col_double(),
##   `Lower Confidence Interval` = col_double(),
##   `Upper Confidence Interval` = col_double(),
##   `Economy (GDP per Capita)` = col_double(),
##   Family = col_double(),
##   `Health (Life Expectancy)` = col_double(),
##   Freedom = col_double(),
##   `Trust (Government Corruption)` = col_double(),
##   Generosity = col_double(),
##   `Dystopia Residual` = col_double()
## )
head(WHP_2016)
# Subset World Happiness Report to get Hapiness Score
WHP_2016<-WHP_2016[,c(1,4)]
head(WHP_2016)
# Joining World Health Report in 2015 and 2016
WHP_join<-WHP_2015 %>% left_join(WHP_2016, by="Country")
head(WHP_join,10)
#Changing column name of World Health Report in 2015 and 2016
colnames(WHP_join)<-c("Country","Region","Score_2015","Score_2016")
head(WHP_join)
#Joining Obesity dataset and World Health Report datasets together
all_join<-left_join(WHP_join,Obesity, by = "Country")
head(all_join)

Understand

Checking structure of two datasets to see their data types and dimensions

# Structure of Obesity
str(all_join)
## tibble [158 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country                : chr [1:158] "Switzerland" "Iceland" "Denmark" "Norway" ...
##  $ Region                 : chr [1:158] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Score_2015             : num [1:158] 7.59 7.56 7.53 7.52 7.43 ...
##  $ Score_2016             : num [1:158] 7.51 7.5 7.53 7.5 7.4 ...
##  $ Obesity_2016_percentage: chr [1:158] "19.5 [16.0-23.3]" "21.9 [18.0-26.0]" "19.7 [16.2-23.3]" "23.1 [19.3-27.1]" ...
##  $ Obesity_2015_percentage: chr [1:158] "19.1 [15.7-22.7]" "21.5 [17.8-25.4]" "19.3 [16.1-22.7]" "22.6 [19.0-26.4]" ...

-At this stage, it is obvious to see that: there are two main data type of the all_join dataset, which are character and numeric. The dataset has 158 observations and 6 variables.

Tidy & Manipulate Data I

-Firstly, i need to check the dataset to find any problem.

# Checking head

head(all_join)

As can be seen from the head of 6 observations above, the columns named Obesity_2016_percentage and Obesity_2015_percentage are untidy as they contain intervals in the cells. In order to convert to numeric, i need to be tidy and then reshape the dataset.

# Using separate function to split "Obesity_2016_percentage" and its interval
all_join %<>% separate("Obesity_2016_percentage",into= c("Obesity_2016_percentage", "Intervals_1"), sep=" ") 
knitr::kable(head(all_join,10))
Country Region Score_2015 Score_2016 Obesity_2016_percentage Intervals_1 Obesity_2015_percentage
Switzerland Western Europe 7.587 7.509 19.5 [16.0-23.3] 19.1 [15.7-22.7]
Iceland Western Europe 7.561 7.501 21.9 [18.0-26.0] 21.5 [17.8-25.4]
Denmark Western Europe 7.527 7.526 19.7 [16.2-23.3] 19.3 [16.1-22.7]
Norway Western Europe 7.522 7.498 23.1 [19.3-27.1] 22.6 [19.0-26.4]
Canada North America 7.427 7.404 29.4 [25.7-33.3] 28.8 [25.3-32.4]
Finland Western Europe 7.406 7.413 22.2 [19.0-25.7] 21.8 [18.8-25.1]
Netherlands Western Europe 7.378 7.339 20.4 [16.9-24.2] 20.0 [16.7-23.6]
Sweden Western Europe 7.364 7.291 20.6 [17.1-24.3] 20.2 [16.9-23.6]
New Zealand Australia and New Zealand 7.286 7.334 30.8 [27.3-34.3] 30.2 [26.9-33.5]
Australia Australia and New Zealand 7.284 7.313 29.0 [25.3-32.9] 28.4 [24.9-32.1]
# Using separate function to split "Obesity_2015_percentage" and its interval
all_join %<>% separate("Obesity_2015_percentage",into= c("Obesity_2015_percentage", "Intervals_2"), sep=" ") 
knitr::kable(head(all_join,10))
Country Region Score_2015 Score_2016 Obesity_2016_percentage Intervals_1 Obesity_2015_percentage Intervals_2
Switzerland Western Europe 7.587 7.509 19.5 [16.0-23.3] 19.1 [15.7-22.7]
Iceland Western Europe 7.561 7.501 21.9 [18.0-26.0] 21.5 [17.8-25.4]
Denmark Western Europe 7.527 7.526 19.7 [16.2-23.3] 19.3 [16.1-22.7]
Norway Western Europe 7.522 7.498 23.1 [19.3-27.1] 22.6 [19.0-26.4]
Canada North America 7.427 7.404 29.4 [25.7-33.3] 28.8 [25.3-32.4]
Finland Western Europe 7.406 7.413 22.2 [19.0-25.7] 21.8 [18.8-25.1]
Netherlands Western Europe 7.378 7.339 20.4 [16.9-24.2] 20.0 [16.7-23.6]
Sweden Western Europe 7.364 7.291 20.6 [17.1-24.3] 20.2 [16.9-23.6]
New Zealand Australia and New Zealand 7.286 7.334 30.8 [27.3-34.3] 30.2 [26.9-33.5]
Australia Australia and New Zealand 7.284 7.313 29.0 [25.3-32.9] 28.4 [24.9-32.1]
# Removing intervals 
all_join<-all_join[ , -c(6,8)]
head(all_join)
# Convert data type from character to numeric
all_join[ , c(5)]<-as.numeric(unlist(all_join[ , c(5)]))
## Warning: NAs introduced by coercion
all_join[ , c(6)]<-as.numeric(unlist(all_join[ , c(6)]))
## Warning: NAs introduced by coercion
str(all_join)
## tibble [158 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country                : chr [1:158] "Switzerland" "Iceland" "Denmark" "Norway" ...
##  $ Region                 : chr [1:158] "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Score_2015             : num [1:158] 7.59 7.56 7.53 7.52 7.43 ...
##  $ Score_2016             : num [1:158] 7.51 7.5 7.53 7.5 7.4 ...
##  $ Obesity_2016_percentage: num [1:158] 19.5 21.9 19.7 23.1 29.4 22.2 20.4 20.6 30.8 29 ...
##  $ Obesity_2015_percentage: num [1:158] 19.1 21.5 19.3 22.6 28.8 21.8 20 20.2 30.2 28.4 ...
# Transfer World Health Report to long format
all_join1 <- all_join %>% 
  pivot_longer(3:4, names_to="Years", values_to="happiness_score")
knitr::kable(head(all_join1))
Country Region Obesity_2016_percentage Obesity_2015_percentage Years happiness_score
Switzerland Western Europe 19.5 19.1 Score_2015 7.587
Switzerland Western Europe 19.5 19.1 Score_2016 7.509
Iceland Western Europe 21.9 21.5 Score_2015 7.561
Iceland Western Europe 21.9 21.5 Score_2016 7.501
Denmark Western Europe 19.7 19.3 Score_2015 7.527
Denmark Western Europe 19.7 19.3 Score_2016 7.526
# Factorise Year variable
all_join1$Years <-all_join1$Years %>%factor(levels=c("Score_2015", "Score_2016"),labels=c("2015", "2016"))
head(all_join1)

Tidy & Manipulate Data II

# Mutating the combined dataset to see the changes between 2015 and 2016
all_join<-mutate(all_join,Difference_hapiness= Score_2016 - Score_2015, 
                 Difference_obesity=(Obesity_2016_percentage - Obesity_2015_percentage) )
knitr::kable(head(all_join))
Country Region Score_2015 Score_2016 Obesity_2016_percentage Obesity_2015_percentage Difference_hapiness Difference_obesity
Switzerland Western Europe 7.587 7.509 19.5 19.1 -0.078 0.4
Iceland Western Europe 7.561 7.501 21.9 21.5 -0.060 0.4
Denmark Western Europe 7.527 7.526 19.7 19.3 -0.001 0.4
Norway Western Europe 7.522 7.498 23.1 22.6 -0.024 0.5
Canada North America 7.427 7.404 29.4 28.8 -0.023 0.6
Finland Western Europe 7.406 7.413 22.2 21.8 0.007 0.4

Scan I

In this stage, i will scan the columns Difference_happiness and Difference_obesity for missing value. By using sum(), i can know how many missing values in each variable.

If there is any missing value, i will choose to exclude the observations that have missing values. The reason i do not replace missing values by using other methods such as the mean, median or mode is that each country has its own value, and each value change every year depending on the country’s circumstance. For example, in the Covid-19 period, the happiness score will be significantly different and it is impossible to predict or replace the missing value.

# Scanning missing value in the difference in Obesity and World Health Report
sum(is.na(all_join$Difference_hapiness))
## [1] 7
sum(is.na(all_join$Difference_obesity))
## [1] 25

-Now, we can see that the “Difference_happiness” column has 7 missing values, while the “Difference_Obesity” contains 25 missing value - In the next step,i will remove the missing value by applying complete.cases() method for each variable.

# Excluding missing data
all_join<-all_join[complete.cases(all_join$Difference_hapiness),]
all_join<-all_join[complete.cases(all_join$Difference_obesity),]
sum(is.na(all_join))
## [1] 0
# Create function to check for special values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
# Applying function
sum(sapply(all_join$Difference_hapiness, is.special))
## [1] 0
sum(sapply(all_join$Difference_obesity, is.special))
## [1] 0

Scan II

In this stage, because of having two variables, I will use the univariate box plot approach to detect any outlier.

# Scanning for outlier
all_join%>%plot(Difference_obesity~Difference_hapiness, data=., main="Relationship of Happiness levels and Obesity",
                 xlab = "Hapiness", ylab = "Obesity")

# Subsetting data for outlier check
Happiness_Obesity<-all_join%>%select(Difference_hapiness,Difference_obesity)
head(Happiness_Obesity)
# Multivariate outlier detection using Mahalanobis distance with QQ plots
results <- mvn(data = Happiness_Obesity, multivariateOutlierMethod = "quan", showOutliers = TRUE)

results$multivariateOutliers

Transform

In this stage, i need to check the distribution of the Difference_in_happiness and Difference_in_obesity variables.

# Checking distribution for Difference_happiness variable
all_join$Difference_hapiness %>% hist(col="grey",xlab = "Hapiness",main = "Histogram of difference in happiness")

# Applying BoxCox transformation
boxcox_happiness<-BoxCox(all_join$Difference_hapiness, lambda="auto")
hist(boxcox_happiness)

# Checking distribution for Difference_Obesity variable
all_join$Difference_obesity %>% hist(col="grey",xlab = "Obesity",main = "Histogram of difference in Obesity")

# Applying BoxCox transformation
boxcox_obesity<-BoxCox(all_join$Difference_obesity,lambda="auto")
hist(boxcox_obesity)

Reference