Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(readr)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(editrules)
library(validate)
library(Hmisc)
library(forecast)
library(outliers)
library(stringr)
library(lubridate)
library(MASS)
library(mlr)
library(ggplot2)
The data set used in this report consists of two, one is the level of consumption of alcohol in each country (including three types of alcohol), and the other is the World Happiness Index (and contains the macro indices of various countries). The purpose of this report is to understand the relationship between national consumption of alcohol and happiness index and macro data. The entire report contains the main 7 parts. The first two parts mainly describes the composition of the two data sets and imports the data into R. The third part is to process the two data sets separately and change the data format, Reach the most clean. The fourth part is to combine the two processed data sets together, then to tidy it to be clean. Also, to use mutate() to add and create one new variable. At the same time, missing values and inconsistent queries are also executed in next section. The sixth part is the scanning of outliers. Use z.scores to find out the outliers, then use the average of the variables to replace the outliers for the purpose of remove outliters without produce missing value. In the last part, we selected the family variables for data conversion.
This report contains two data sets: “World Happiness Report” and “Alcohol consumption by country”.
For import data set, the function read_csv() was used. Because function read_csv() can import the “.csv”, and readr package are around 10x faster than the base R functions.
read_csv() to hp, it is the happyness data set.Then I import drinks.csv with read_csv() to al, it is the alcohol consumption data set.
hp <- read_csv("2016.csv")
al <- read_csv("drinks.csv")
head(hp)
head(al)
The Alcohol Consumption Report is about the amount of alcohol consumed in different countries, published in 2016. It contains 193 countries and regions, and the types of alcohol investigated are three, beer, spirit and wine.
The six variables were shown below, and the country and continental variables are character variables, and the other variables are numeric variables. In the subsequent processing, the continent variable is converted into factor variables for convenience, to labelled main areas.
Country: Name of the country.
Beer_servings: Liters ( per capita ) of beer consumption.
Spirit_servings: Liters ( per capita ) of spirit consumption.
Wine_servings: Liters ( per capita ) of wine consumption.
total_litres_of_pure_alcohol: Total litres of pure alcohol.
Continent: The continent where the country is located.
The World Happiness Report is a landmark survey of global happiness. Surveys of happiness-related indices in more than 150 countries. It is published at the United Nations Celebration of International Happiness Day every year, for accurate use in conjunction with the Alcohol Consumption Dataset, the 2016 data set was used. It contains 157 countries and regions.Also, 13 variables were contained (The country and region are character variables, and the other 11 variables are numeric variables. In the subsequent processing, the region variable is converted into factor variables for convenience, to labelled main areas. The happiness rank is a numeric variable in the data set, but in reality it is a factor variable. It will be converted in subsequent operations.), which are:
Country: Name of the country.
Region: Region the country belongs to.
Happiness Rank: Rank of the country based on the Happiness Score.
Happiness Score: A metric measured in 2016 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest”.
Lower Confidence Interval: Lower Confidence Interval of the Happiness Score.
Upper Confidence Interval: Upper Confidence Interval of the Happiness Score.
Economy (GDP per Capita): The extent to which GDP contributes to the calculation of the Happiness Score.
Family: The extent to which Family contributes to the calculation of the Happiness Score.
Health (Life Expectancy): The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom:The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government Corruption): The extent to which Perception of Corruption contributes to Happiness Score.
Generosity: The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.
Before combining the two data sets, in order to facilitate the subsequent data processing, we need to modify the two data accordingly.
# Step 1: Modify the region and happiness rank variables from a character
# variable to a factor variable
hp$Region <- factor(hp$Region)
hp$`Happiness Rank` <- factor(hp$`Happiness Rank`)
class(hp$Region)
[1] "factor"
class(hp$`Happiness Rank` )
[1] "factor"
head(hp)
# Step 2: Modify the format of the country name, modify the format and remove
# punctuation, and standardize the country name so that the two data sets can be
# successfully combined.
hp$Country <- hp$Country %>% str_replace_all(pattern = "[:punct:]", replacement = "")
# Step 1: Modify the name of the dataset variable, which is easy to view and use.
colnames(al) <- c("Country", "Beer", "Spirit", "Wine", "Total pure alcohol", "Continent")
head(al)
# Step 2: Modify the Continent variable from a character variable to a factor variable
al$Continent <- factor(al$Continent)
class(al$Continent)
[1] "factor"
head(al)
# Step 3: Modify the format of the country name, modify the format, and remove
# punctuation to ensure that the name format is the same as the country name
# format in the Happiness data set.
# Change all "&" symbols to "and"
al$Country <- al$Country %>% str_replace_all(pattern = "&", replacement = "")
# Remove all punctuation
al$Country <- al$Country %>% str_replace_all(pattern = "[:punct:]",
replacement = "")
# Change the only country abbreviation to full name
al$Country <- al$Country %>% str_replace_all(pattern = "USA",
replacement = "United States")
After the previous step, the two data sets have met the requirements before the combination, and then the two data sets need to be combined and tidyed.
Here, inner_join() is used for combination, on the grounds that although the two countries surveyed have more than 150 countries, the countries surveyed are still different. So for data integrity, we use inner_join() (keyword is country) to exclude countries and their associated data that are not included in the two data sets at the same time, saving only the countries and their associated data contained in both data sets. Then get the new data set “hpal”.
hpal <- inner_join(hp,al,by="Country")
head(hpal)
By observing the dataset, we found that the geographical locations contained in the region and continent variables are similar (the information contained in the continent is the six major continents on the planet, and the region is more specific), so in order to avoid data duplication we do not need continental variables to delete continent variables.
hpal <- subset(hpal,,-c(Continent))
head(hpal)
In order to more closely observe the relationship between the three alcohol consumption (beer, spirit and wine) and the happiness score, we create a new variable, which is the total of three alcohol consumption.
hpal <- hpal %>% mutate(`Sum of alcohol` = Beer + Spirit + Wine)
head(hpal)
hpalsubset<-subset(hpal,,c("Country", "Happiness Rank", "Happiness Score", "Sum of alcohol"))
head(hpalsubset)
Scan the data for missing values, inconsistencies and obvious errors.
Use is.na() to judge the entire data set to determine if there are missing values. Then use sum() to show how many missing values are in total.
sum(is.na(hpal))
[1] 0
According to the results, the total number of missing values for the entire data set is zero, so there are no missing values.
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(hpal, is.special)%>% summary()
Length Class Mode
Country 0 -none- NULL
Region 0 -none- NULL
Happiness Rank 0 -none- NULL
Happiness Score 143 -none- logical
Lower Confidence Interval 143 -none- logical
Upper Confidence Interval 143 -none- logical
Economy (GDP per Capita) 143 -none- logical
Family 143 -none- logical
Health (Life Expectancy) 143 -none- logical
Freedom 143 -none- logical
Trust (Government Corruption) 143 -none- logical
Generosity 143 -none- logical
Dystopia Residual 143 -none- logical
Beer 143 -none- logical
Spirit 143 -none- logical
Wine 143 -none- logical
Total pure alcohol 143 -none- logical
Sum of alcohol 143 -none- logical
The results above show that, there no further inconsistencies or special values in the data set of ‘hpal’ for each variable.
Scan the numeric data for outliers. And use z.scores to display outliers and which() lengyh() to calculate the number of outliers Because the content contained in the country and region variables is the country name and the region name, there is no need to look for outliers. In same reason, happiness rank is only rank by number, so there is no need to find outliers.
Calculate z scores for each variable (exclude country,region and happiness rank variables) and then look for outliers. If z.scores is greater than 3, it is considered an outlier.
z.scores1 <- scores(hpal[,4],type = "z")
z.scores2 <- scores(hpal[,5],type = "z")
z.scores3 <- scores(hpal[,6],type = "z")
z.scores4 <- scores(hpal[,7],type = "z")
z.scores5 <- scores(hpal[,8],type = "z")
z.scores6 <- scores(hpal[,9],type = "z")
z.scores7 <- scores(hpal[,10],type = "z")
z.scores8 <- scores(hpal[,11],type = "z")
z.scores9 <- scores(hpal[,12],type = "z")
z.scores10 <- scores(hpal[,13],type = "z")
z.scores11 <- scores(hpal[,14],type = "z")
z.scores12 <- scores(hpal[,15],type = "z")
z.scores13 <- scores(hpal[,16],type = "z")
z.scores14 <- scores(hpal[,17],type = "z")
z.scores15 <- scores(hpal[,18],type = "z")
length(which( abs(z.scores1) >3 ))
[1] 0
length(which( abs(z.scores2) >3 ))
[1] 0
length(which( abs(z.scores3) >3 ))
[1] 0
length(which( abs(z.scores4) >3 ))
[1] 0
length(which( abs(z.scores5) >3 ))
[1] 0
length(which( abs(z.scores6) >3 ))
[1] 0
length(which( abs(z.scores7) >3 ))
[1] 0
length(which( abs(z.scores8) >3 ))
[1] 2
length(which( abs(z.scores9) >3 ))
[1] 1
length(which( abs(z.scores10) >3 ))
[1] 0
length(which( abs(z.scores11) >3 ))
[1] 0
length(which( abs(z.scores12) >3 ))
[1] 2
length(which( abs(z.scores13) >3 ))
[1] 2
length(which( abs(z.scores14) >3 ))
[1] 0
length(which( abs(z.scores15) >3 ))
[1] 0
As can be seen from the above calculations, Trust (Government Corruption), Generosity, Spirit and Wine variables has outliers, the number of outliers are 2, 1, 2 and 2 repectively.
We can also impute outliers. We can use mean or median imputation methods to replace outliers. And before Imputing we should check whether the outlier is a result of data entry/processing error. But for this case, the data set have be check at front part.
A<-hpal$`Trust (Government Corruption)`[
which( abs(z.scores8) >3 )]<- mean(hpal$`Trust (Government Corruption)`,
na.rm = TRUE)
B<-hpal$Generosity[ which( abs(z.scores9) >3 )] <- mean(hpal$Generosity[
- which( abs(z.scores9) >3 )], na.rm = TRUE)
C<-hpal$Spirit[ which( abs(z.scores12) >3 )] <- mean(hpal$Spirit,
na.rm = TRUE)
D<-hpal$Wine[ which( abs(z.scores13) >3 )] <- mean(hpal$Wine, na.rm = TRUE)
length(which( abs(scores(A,type = "z")) >3 ))
[1] 0
length(which( abs(scores(B,type = "z")) >3 ))
[1] 0
length(which( abs(scores(C,type = "z")) >3 ))
[1] 0
length(which( abs(scores(D,type = "z")) >3 ))
[1] 0
After the imputing, we have imputing all the outliers as above shown.
As the main observation object of ours, we have selected the Family variable as the target of data transform. First we use hist() to observe the distribution of the data.
hist(hpal$Family)
It can be seen that the distribution of the Family has a tendency to be left skewness. So we consider use square transformation.
x1<-hpal$Family
x1^2 %>% hist()
According to the figure shown, the conversion was not prefectly successful, so we re-selected the method to transform. We consider BoxCox transform.
BoxCox(x1,lambda = "auto") %>% hist()
As shown in the figure above, the distribution of Family variable is very close to the normal distribution, so we consider the transformation to be successful.