The following packages are used for the assignment. (Wickham, Hester, and Francois (2017), Wickham, François, et al. (2018), Wickham and Henry (2018), van der Loo and de Jonge (2017), van der Loo and de Jonge (2018), Harrell (2018), Komsta (2011), Korkmaz, Goksuluk, and Zararsiz (2018), Meyer (2014), Bischl et al. (2018), Wickham, Chang, et al. (2018), Xie (2018), Wickham (2018))
library(readr)
library(dplyr)
library(tidyr)
library(deductive)
library(validate)
library(Hmisc)
library(outliers)
library(MVN)
library(infotheo)
library(mlr)
library(ggplot2)
library(knitr)
library(stringr)
For this assignment, World Happiness measures by country were combined with a subset of CIA World Fact Book data including geographic, social and economic statistics to produce a tidy dataset ready for statistical analysis. Data pre-processing was performed in the R statistical programming environment and utilised an array of data manipulation and cleaning packages. The validity of the combined data set was thoroughly checked for consistency with logical rules arising from their features. The few missing values found in the dataset were imputed using geographical regional means. The data was thoroughly checked for univariate and some multivariate outliers. The majority of features with outliers were highly skewed, and were transformed using the logarithmic (base 10) or square root transformations. This successfully reduced the skew in these factors for future analysis, and removed the majority of outliers. Where such power law transformations proved ineffective, equal frequency binning was applied. The remaining few outliers were treated by capping at 1.5 times the interquartile range. The final, pre-processed data contained complete, valid data without outliers, ready for further statistical analysis.
For this assignment, country level happiness scores as published in the World Happiness Report (Helliwell (2017)) was merged with general country information taken from the US CIA world fact book (“The World Factbook — Central Intelligence Agency” 2018). Both data sets were sourced from kaggle.com. This section gives a brief summary of both data sets.
Data from the World Happiness Report was taken from kaggle.com (“World Happiness Report” 2018), which contained world happiness ratings by country for the years 2015-2017. Only 2017 data was used here, the variables of which are outlined below. Further information can be sourced from (Helliwell (2017)).
| Variable | Description |
|---|---|
| Country | Country Name |
| Region | Country Geographical Region |
| Happiness Rank | Country happiness rank |
| Happiness Score | Average reported happiness (0-10 scale) for a country |
| Upper CI95 | Upper 95% confidence interval bound for happiness score |
| Lower CI95 | Lower 95% confidence interval bound for happiness score |
| Economy (GDP per Capita) | The extent to which Gross Domestic Product contributed to Happiness Score |
| Family | The extent to which Family contributed to Happiness Score |
| Health (Life Expectancy) | The extent to which life expectancy contributed to Happiness Score |
| Freedom | The extent to which freedom contributed to Happiness Score |
| Trust (Government Corruption) | The extent to which perception of corruption contributed to Happiness Score |
| Generosity | The extent to which generosity contributed to Happiness Score |
| Dystopia Residual | The extent to which Dystopia Residual contributed to the calculation of the Happiness Score. |
All data was imported into R using the readr::read_csv function. An extract of the imported data is shown belwow.
#Use readr functions to import the data
happy <- read_csv("2017.csv")
#Clean up the names
names(happy) = c("Country", "Happiness Rank", "Happiness Score", "Upper CI95", "Lower CI95",
"Economy (GDP per Capita)","Family", "Health (Life Expectancy)", "Freedom",
"Generosity", "Trust (Government Corruption)", "Dystopia Residual")
#Show an extract of the data
happy %>% head(5) %>% kable(digits = 3, format = "markdown")
| Country | Happiness Rank | Happiness Score | Upper CI95 | Lower CI95 | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Generosity | Trust (Government Corruption) | Dystopia Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Norway | 1 | 7.537 | 7.594 | 7.480 | 1.616 | 1.534 | 0.797 | 0.635 | 0.362 | 0.316 | 2.277 |
| Denmark | 2 | 7.522 | 7.582 | 7.462 | 1.482 | 1.551 | 0.793 | 0.626 | 0.355 | 0.401 | 2.314 |
| Iceland | 3 | 7.504 | 7.622 | 7.386 | 1.481 | 1.611 | 0.834 | 0.627 | 0.476 | 0.154 | 2.323 |
| Switzerland | 4 | 7.494 | 7.562 | 7.426 | 1.565 | 1.517 | 0.858 | 0.620 | 0.291 | 0.367 | 2.277 |
| Finland | 5 | 7.469 | 7.528 | 7.410 | 1.444 | 1.540 | 0.809 | 0.618 | 0.245 | 0.383 | 2.430 |
Additional information for countries of the world was sourced from kaggle.com (“Countries of the World” 2018) based on values published in the CIA World Fact book (“The World Factbook — Central Intelligence Agency” 2018). It is based on 2010 data, which on some measures may be slightly out of date, but should be stable for country geographic information, with economic and social data indicative of broad trends.
This data consists of 20 variables for 227 countries. These variables are:
| Variable | Description |
|---|---|
| Country | Country Name |
| Region | Country Geographical Region |
| Population | Population of the country |
| Area (sq. mi) | Geographic area of the country in square miles |
| Pop. Density (per sq.mi) | Population density in people per square mile |
| Coastline (coast/area ratio) | Ratio of coastal length to area |
| Net Migration | Net migration for every 1,000 people |
| Infant Mortality (per 1000 births) | Infant mortality rate for every 1,000 births |
| GDP ($ per capita) | Per capital Gross Domestic Product |
| Literacy (%) | Literacy rates as a percent |
| Phones (per 1000) | Number of phones per 1,000 people |
| Arable (%) | Percentage of land area with arable land (but not used for crops) |
| Crops (%) | Percentage of land area used for crops |
| Other (%) | Percentage of land area neither arable nor used for crops |
| Birthrate | Annual rate of births per 1,000 people |
| Deathrate | Annual rate of deaths per 1,000 people |
| Agriculture | Fraction of GDP attributable to the agriculture sector |
| Industry | Fraction of GDP attributable to the industry sector |
| Service | Fraction of GDP attributable to the service sector |
Another variable named Climate was removed from the data set as it was encoded as a factor variable without labels. A brief examination of the original source material (“The World Factbook — Central Intelligence Agency” 2018) couldn’t resolve the mapping used to produce the kaggle data set from the CIA fact book. An extract of this data set is shown below:
# Use readr to source country data. Note that the European decimal convention was used in the original data.
countryInfo2010 <- read_csv("countries of the world.csv", locale = locale(decimal_mark = ","))
#Remove the Climate variable
countryInfo2010 <- countryInfo2010 %>% select(-Climate)
#Show an extract of the data
countryInfo2010 %>% head(2) %>% t()
## [,1] [,2]
## Country "Afghanistan" "Albania"
## Region "ASIA (EX. NEAR EAST)" "EASTERN EUROPE"
## Population "31056997" " 3581655"
## Area (sq. mi.) "647500" " 28748"
## Pop. Density (per sq. mi.) " 48.0" "124.6"
## Coastline (coast/area ratio) "0.00" "1.26"
## Net migration "23.06" "-4.93"
## Infant mortality (per 1000 births) "163.07" " 21.52"
## GDP ($ per capita) " 700" "4500"
## Literacy (%) "36.0" "86.5"
## Phones (per 1000) " 3.2" "71.2"
## Arable (%) "12.13" "21.09"
## Crops (%) "0.22" "4.42"
## Other (%) "87.65" "74.49"
## Birthrate "46.60" "15.11"
## Deathrate "20.34" " 5.22"
## Agriculture "0.380" "0.232"
## Industry "0.240" "0.188"
## Service "0.380" "0.579"
The Happiness data was merged with the Fact Book Country information. A left join was used such that information from countries which appear in the Happiness index where included, as other countries are not of interest to this study.
To ensure a good join, it was considered prudent to ensure that the spelling for each country didn’t vary between the data sets. The dplyr::anti_join function was used to find the countries in the Happiness data which didn’t have matches in the Fact Book. Seventeen countries couldn’t be matched. Searching the list of countries in the Fact Book were able to correctly identify ten using alternate names for the country or alternate spelling. These were renamed in the Fact Book data to ensure consistency. The remaining seven countries or areas couldn’t be matched and were omitted by using an inner-join when merging the data sets.
# Check set difference between two data sets.
anti_join(happy, countryInfo2010) %>% select(Country) %>% unique() %>% unlist(use.names = FALSE)
## Joining, by = "Country"
## [1] "Taiwan Province of China" "Trinidad and Tobago"
## [3] "South Korea" "North Cyprus"
## [5] "Hong Kong S.A.R., China" "Kosovo"
## [7] "Montenegro" "Bosnia and Herzegovina"
## [9] "Palestinian Territories" "Myanmar"
## [11] "Congo (Brazzaville)" "Congo (Kinshasa)"
## [13] "Ivory Coast" "South Sudan"
## [15] "Central African Republic"
# Create a function to substitute terms in Fact Book data
countryFix <- function(X,y){
countryInfo2010$Country <- gsub(pattern = X, replacement = y,
x = countryInfo2010$Country)
return(countryInfo2010)
}
# Replace country names to ensure effective merge.
countryInfo2010 <- countryFix("&", "and")
countryInfo2010 <- countryFix("Korea, South", "South Korea")
countryInfo2010 <- countryFix("Congo, Dem. Rep.", "Congo (Kinshasa)")
countryInfo2010 <- countryFix("Congo, Repub. of the", "Congo (Brazzaville)")
countryInfo2010 <- countryFix("Burma", "Myanmar")
countryInfo2010 <- countryFix("Central African Rep.", "Central African Republic")
countryInfo2010 <- countryFix("Cote d'Ivoire", "Ivory Coast")
countryInfo2010 <- countryFix("Taiwan", "Taiwan Province of China")
countryInfo2010 <- countryFix("Hong Kong", "Hong Kong S.A.R., China")
#Perform the inner join and show a sample of the output
data <- inner_join(happy, countryInfo2010, by = "Country", suffix = c("", ".2010"))
data %>% head(2) %>% t()
## [,1] [,2]
## Country "Norway" "Denmark"
## Happiness Rank "1" "2"
## Happiness Score "7.537" "7.522"
## Upper CI95 "7.594445" "7.581728"
## Lower CI95 "7.479556" "7.462272"
## Economy (GDP per Capita) "1.616463" "1.482383"
## Family "1.533524" "1.551122"
## Health (Life Expectancy) "0.7966665" "0.7925655"
## Freedom "0.6354226" "0.6260067"
## Generosity "0.3620122" "0.3552805"
## Trust (Government Corruption) "0.3159638" "0.4007701"
## Dystopia Residual "2.277027" "2.313707"
## Region "WESTERN EUROPE" "WESTERN EUROPE"
## Population "4610820" "5450661"
## Area (sq. mi.) "323802" " 43094"
## Pop. Density (per sq. mi.) " 14.2" "126.5"
## Coastline (coast/area ratio) " 7.77" "16.97"
## Net migration "1.74" "2.48"
## Infant mortality (per 1000 births) "3.70" "4.56"
## GDP ($ per capita) "37800" "31100"
## Literacy (%) "100" "100"
## Phones (per 1000) "461.7" "614.6"
## Arable (%) " 2.87" "54.02"
## Crops (%) "0.00" "0.19"
## Other (%) "97.13" "45.79"
## Birthrate "11.46" "11.13"
## Deathrate " 9.40" "10.36"
## Agriculture "0.021" "0.018"
## Industry "0.415" "0.246"
## Service "0.564" "0.735"
The structure of the data was assessed using the baseR::str function. After merging, there are 30 attributes to the data, including 2 character, 4 integer and 24 numeric. For the majority of the attributes, this assignment is logical, however to facilitate future analysis, the Region variable was coearsed into a factor.
# Encode the Region varaible as a factor
data$Region <- data$Region %>% as.factor()
#Check the structure of the dataset
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 150 obs. of 30 variables:
## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ...
## $ Happiness Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness Score : num 7.54 7.52 7.5 7.49 7.47 ...
## $ Upper CI95 : num 7.59 7.58 7.62 7.56 7.53 ...
## $ Lower CI95 : num 7.48 7.46 7.39 7.43 7.41 ...
## $ Economy (GDP per Capita) : num 1.62 1.48 1.48 1.56 1.44 ...
## $ Family : num 1.53 1.55 1.61 1.52 1.54 ...
## $ Health (Life Expectancy) : num 0.797 0.793 0.834 0.858 0.809 ...
## $ Freedom : num 0.635 0.626 0.627 0.62 0.618 ...
## $ Generosity : num 0.362 0.355 0.476 0.291 0.245 ...
## $ Trust (Government Corruption) : num 0.316 0.401 0.154 0.367 0.383 ...
## $ Dystopia Residual : num 2.28 2.31 2.32 2.28 2.43 ...
## $ Region : Factor w/ 11 levels "ASIA (EX. NEAR EAST)",..: 11 11 11 11 11 11 8 9 11 9 ...
## $ Population : int 4610820 5450661 299388 7523934 5231372 16491461 33098932 4076140 9016596 20264082 ...
## $ Area (sq. mi.) : int 323802 43094 103000 41290 338145 41526 9984670 268680 449964 7686850 ...
## $ Pop. Density (per sq. mi.) : num 14.2 126.5 2.9 182.2 15.5 ...
## $ Coastline (coast/area ratio) : num 7.77 16.97 4.83 0 0.37 ...
## $ Net migration : num 1.74 2.48 2.38 4.05 0.95 2.91 5.96 4.05 1.67 3.98 ...
## $ Infant mortality (per 1000 births): num 3.7 4.56 3.31 4.39 3.57 5.04 4.75 5.85 2.77 4.69 ...
## $ GDP ($ per capita) : int 37800 31100 30900 32700 27400 28600 29800 21600 26800 29000 ...
## $ Literacy (%) : num 100 100 99.9 99 100 99 97 99 99 100 ...
## $ Phones (per 1000) : num 462 615 648 681 405 ...
## $ Arable (%) : num 2.87 54.02 0.07 10.42 7.19 ...
## $ Crops (%) : num 0 0.19 0 0.61 0.03 0.97 0.02 6.99 0.01 0.04 ...
## $ Other (%) : num 97.1 45.8 99.9 89 92.8 ...
## $ Birthrate : num 11.46 11.13 13.64 9.71 10.45 ...
## $ Deathrate : num 9.4 10.36 6.72 8.49 9.86 ...
## $ Agriculture : num 0.021 0.018 0.086 0.015 0.028 0.021 0.022 0.043 0.011 0.038 ...
## $ Industry : num 0.415 0.246 0.15 0.34 0.295 0.244 0.294 0.273 0.282 0.262 ...
## $ Service : num 0.564 0.735 0.765 0.645 0.676 0.736 0.684 0.684 0.707 0.7 ...
The data was checked to ensure that it is in the Tidy data format, that is:
The structure of the data is a Tidy structure.
The variable Population Growth was calculated from the existing variables, as per the formula below:
\[ PopulationGrowth = NetMigration + BirthRate - DeathRate\]
The new Population Growth (%) variable was evaluated using the dplyr::mutate function.
#Create Population Growth variable and show some output
data <- data %>% mutate(`Population Growth %` = (`Net migration` + Birthrate - Deathrate)/10 )
data %>% head(2) %>% t()
## [,1] [,2]
## Country "Norway" "Denmark"
## Happiness Rank "1" "2"
## Happiness Score "7.537" "7.522"
## Upper CI95 "7.594445" "7.581728"
## Lower CI95 "7.479556" "7.462272"
## Economy (GDP per Capita) "1.616463" "1.482383"
## Family "1.533524" "1.551122"
## Health (Life Expectancy) "0.7966665" "0.7925655"
## Freedom "0.6354226" "0.6260067"
## Generosity "0.3620122" "0.3552805"
## Trust (Government Corruption) "0.3159638" "0.4007701"
## Dystopia Residual "2.277027" "2.313707"
## Region "WESTERN EUROPE" "WESTERN EUROPE"
## Population "4610820" "5450661"
## Area (sq. mi.) "323802" " 43094"
## Pop. Density (per sq. mi.) " 14.2" "126.5"
## Coastline (coast/area ratio) " 7.77" "16.97"
## Net migration "1.74" "2.48"
## Infant mortality (per 1000 births) "3.70" "4.56"
## GDP ($ per capita) "37800" "31100"
## Literacy (%) "100" "100"
## Phones (per 1000) "461.7" "614.6"
## Arable (%) " 2.87" "54.02"
## Crops (%) "0.00" "0.19"
## Other (%) "97.13" "45.79"
## Birthrate "11.46" "11.13"
## Deathrate " 9.40" "10.36"
## Agriculture "0.021" "0.018"
## Industry "0.415" "0.246"
## Service "0.564" "0.735"
## Population Growth % "0.380" "0.325"
The data set was scanned for missing values, inconsistencies and obvious errors.
The data was validated using the validate::check_that function, against the following requirements:
# Create the validator-class object using check_that with all the rules (note, some tolerance allowed for rounding errors )
valid <- data %>% check_that(
round(`Arable (%)` + `Crops (%)` + `Other (%)`) == 100,
round(Agriculture*100 + Industry*100 + Service*100) <= 100,
between(`Infant mortality (per 1000 births)`,0,1000) == TRUE,
between(Birthrate,0,1000) == TRUE,
between(Deathrate,0,1000) == TRUE,
between(`Phones (per 1000)`,0,1000) == TRUE,
between(`Literacy (%)`, 0,100)==TRUE,
between(Population, 0, 8e9)==TRUE,
Area > 0,
abs(`Pop. Density (per sq. mi.)` - Population/`Area (sq. mi.)`) < 1
)
#Summarise the checks by rule
aggregate(valid)
There was 1 observation which failed the validation checks (only rule 2, regarding composition of the economy). This record was extracted, and corresponded to observations of Sierra Leone, where the sum of economy sectors was 1.01. This was deemed an acceptable rounding error (only 1% out).
# Find which values didn't pass, by extracting the values from valid, and applying the all function row-wise to look for any violations to rules.
data[which(!(apply(values(valid), MARGIN = 1, FUN = all))),] %>% select(1,2,Agriculture, Industry, Service) %>%
mutate(total = Agriculture + Industry + Service)
The mlr::summarizeColumns function was used to scan all variables for missing values. Missing values were found in 8 variables.
data %>% summarizeColumns() %>% select(1,2,3) %>% subset(na > 0)
Only a few values were missing from the Literacy (%), Birthrate and Deathrate and Phones (per 1000) variables. These were imputed based on the regional averages. This was performed using a combination of the dplyr::group_by and Hmisc::impute functions.
# Impute mean values (by region) for missing values in Literacy, Birthrate, Deathrate and Phones.
data = data %>% group_by(Region) %>% mutate(`Literacy (%)` = Hmisc::impute(`Literacy (%)`, fun = mean),
`Birthrate` = Hmisc::impute(`Birthrate`, fun = mean),
`Deathrate` = Hmisc::impute(`Deathrate`, fun = mean),
`Phones (per 1000)` = Hmisc::impute(`Phones (per 1000)`, fun = mean))
# Ungroup the data, as interferes with future operations
data = ungroup(data)
Population growth is a derived feature, and can be easily derived now that its constituent features have been imputed.
# Recalculate the Population Growth to fill missing values.
data <- data %>% mutate(`Population Growth %` = (`Net migration` + Birthrate - Deathrate)/10 )
Numeric data values were scanned for both uni-variate and multivariate outliers. The following section discusses identified outliers and proposed treatments. All treatments are applied in the next section, Transformation.
To check for the presence of uni-variate outliers in the data, the z-scores of all numeric variables was calculated, with outliers defined as observations beyond absolute values of 3. Under this criterion, 16 variables where found to contain outliers, and where flagged for closer inspection.
# Find and store references to numeric (and integer) variables
numericCols <- data %>% summarizeColumns() %>% select(type)
numericCols <- which(numericCols[[1]] %in% c("numeric", "integer"))
# Calculate the z scores for each variable and count the number of outliers (values greater than 3)
z_score_outliers <- data %>% select(numericCols) %>% mutate_all(.funs = scores, type = "z") %>%
mutate_all(.funs = function(x) abs(x)>3) %>%
sapply(sum) %>% as.data.frame()
# Show the number of outliers per variable
names(z_score_outliers) = "outliers"
z_score_outliers %>% subset(outliers > 0)
In the first instance, a cursory look at the distribution of the data was performed (and shown below). Most variables were found to have highly skewed distributions. Outlier treatments should favour transformations/normalisation techniques when distributions are highly skewed.
# Creater a labeller function to shorten names (just remove the brackets):
shorten_names <-function(x){
index = str_locate(x, pattern = "\\(") -2
out <- substr(x, 1, index)
out[is.na(out)] <- x[is.na(out)]
return(out)
}
#Identify which variables contain outliers (according to z-score tests)
outlier.vars <- data %>% select(numericCols) %>% mutate_all(.funs = scores, type = "z") %>%
summarizeColumns() %>% subset(min < -3 | max >3) %>% select(name) %>% unlist()
#Plot histograms of the data with outliers
data %>% select(one_of(outlier.vars)) %>% gather(key = variable, value = value) %>% ggplot(aes(x = value)) +
geom_histogram() + facet_wrap(.~variable, scales = "free", labeller = as_labeller(shorten_names))
The presence of outliers was dismissed from some factors as they are valid data points. These include Agriculture, Arable (%), Crops (%) and Industry. All these features were found consistent with related variables when checking for obvious errors/inconsistencies. E.g. given the validation rules, an outlier in Agriculture should have a corresponding outlier in Industry or Service (given Agriculture + Industry + Service = 1), but this is not the case.
The size sensitive factors of Population and Area were examined for multivariate outliers using a scatter plot (below). The plot revealed what could be outliers in the data however, given the right skew distributions of these two features, these outliers might be treatable via a transformation. The effect of log10 scaling on these two features was tested using another scatter plot. This plot shows no obvious outliers, and hence these features were flagged for log10 transformation.
data %>% ggplot(aes(x = Population, y = `Area (sq. mi.)`)) + geom_point()
data %>% ggplot(aes(x = Population, y = `Area (sq. mi.)`)) + geom_point() + scale_x_log10() + scale_y_log10()
The remaining variables were grouped into right skewed, left-skewed or largely symmetric distributions. Potential outliers and treatment strategies are discussed.
Variables with Right-skewed Distributions
Coastline (coast/area ratio), Deathrate, GDP ($ per capita),Infant mortality (per 1000), Phones (per 1000), Pop. Density and Trust (Government Corruption) all showed right skewed distributions, and some outliers. Given the right skew, the effect of a base 10 logarithmic transformation and square-root transformation was examined. Based on these figures, a log10 transformation is prescribed for the Deathrate, GDP ($ per capita), Infant Mortality (per 1000) and Phones (per 1000) variables. It is also recommended for the Pop. Density (per sq. mi.) variable, with remaining outliers capped at 1.5 times the Interquartile Range. Similarly, the Trust (Government Corruption) variable is recommended to undergo a square root transform, and then capped (1.5xIQR).
Neither the log10 nor square root transformation is sufficient to treat the Coastline variable. The higher data values in this feature are not expected to be inaccurate, or actual outliers. Rather they are indicative of island nations. As such, removal of apparent outliers isn’t deemed appropriate, and transformation is preferred. Since log10 and square root transforms are insufficient, a binning treatment (equal frequency) was selected.
# Select variables with righ skew
data_skewR <- data %>% select(`Coastline (coast/area ratio)`, Deathrate, `GDP ($ per capita)`,
`Infant mortality (per 1000 births)`, `Phones (per 1000)`,
`Pop. Density (per sq. mi.)`, `Trust (Government Corruption)`)
# Convert to long format for ggplot
data_skewR <- data_skewR %>% gather(key = "variable", value = "Null")
#Define thme to remove x-axis labels (from boxplots)
NULLx <- theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank())
# Apply the transformations (and convert back to long format)
data_skewR<- data_skewR %>% mutate(log10 = log10(Null +1), sqrt = sqrt(Null)) %>%
gather(key = "transform", value = value, -1)
# Produce box plots for right skewed distribtions
data_skewR %>% subset(transform == "Null") %>% ggplot(aes(y = value)) + geom_boxplot() +
facet_wrap(variable~., scales = "free", ncol = 7, labeller = as_labeller(shorten_names)) +
ggtitle("Right Skewed Variables (No transform)") + NULLx
# Produce box plots for right skewed distribtions after log10 distributions
data_skewR %>% subset(transform == "log10") %>% ggplot(aes(y = value)) + geom_boxplot() +
facet_wrap(variable~., scales = "free", ncol = 7, labeller = as_labeller(shorten_names))+
ggtitle("Right Skewed Variables (log10 Transform)")+ NULLx
# Produce box plots for right skewed distribtions after square-root distributions
data_skewR %>% subset(transform == "sqrt") %>% ggplot(aes(y = value)) + geom_boxplot() +
facet_wrap(variable~., scales = "free", ncol = 7, labeller = as_labeller(shorten_names)) +
ggtitle("Right Skewed Variables (Square Root Transform)")+ NULLx
Left Skew Distributions
Literacy (%) is a left skew distribution which has been flagged with outliers. Looking at a box plot of this variable (below), the outliers can be can be simply capped at 1.5xIQR.
# Produce boxplot of Literacy variable
data %>% ggplot(aes(y = `Literacy (%)`)) + geom_boxplot() + NULLx
Symetricly distributed variables
Dystopia Residual, Family, Net migration and Population Growth (%) are all roughly symmetrically distributed, and with suspected outliers. From box plots of this data, all but Net Migration can be dealt with by capping at 1.5xIQR. Net migration is instead flagged for a binning (equal frequency).
# Plot Symetrically distributed variables
data %>% select(`Dystopia Residual`, Family, `Net migration`, `Population Growth %`) %>% gather() %>%
ggplot(aes(y = value)) + geom_boxplot() + facet_wrap(.~key, ncol = 4, scales = "free") + NULLx
Several variables were flagged for transformation during outlier identification activities. These are summarised in the table below:
| Feature | Transformation | Function |
|---|---|---|
| Population | Logarithmic (Base 10) | Base::log10 |
| Area (sq. mi.) | Logarithmic (Base 10) | Base::log10 |
| Coastline (coast/area) | Equal Frequency Binning | infotheo::discretize |
| Death rate | Logarithmic (Base 10) | Base::log10 |
| GDP ($ per capita) | Logarithmic (Base 10) | Base::log10 |
| Infant mortality (per 1000 births) | Logarithmic (Base 10) | Base::log10 |
| Phones (per 1000) | Logarithmic (Base 10) | Base::log10 |
| Pop. Density (per sq. mi) | Logarithmic (Base 10) | Base::log10 |
| Windsoring | User function | |
| Trust (Government Corruption) | Square Root | Base::sqrt |
| Windsoring | User function | |
| Literacy (%) | Windsoring | User function |
| Dystopia Residual | Windsoring | User function |
| Family | Windsoring | User function |
| Population Growth % | Windsoring | User function |
| Net migration | Equal Frequency Binning | infotheo::discretize |
Variable transformation where performed using the dplyr::mutate and dplyr::mutate_at functions to overwrite existing values. The specific transformation functions for each transformation are detailed in the table above. Note that the winsoring function cap is taken from Stack Overflow Francois (2018).
#Perform log10 transformations
#Define the variables to undergo log transformation
log10_var <- c("Population", "Area (sq. mi.)", "Deathrate", "GDP ($ per capita)",
"Infant mortality (per 1000 births)", "Phones (per 1000)", "Pop. Density (per sq. mi.)")
#Use mutate_at to perform the log10 transformation
data <- data %>% mutate_at(.vars = vars(one_of(log10_var)), .fun = function(x) log10(x+1))
#Perform the Square Root Transformation
data <- data %>% mutate(`Trust (Government Corruption)` = sqrt(`Trust (Government Corruption)`))
#Perform the windsoring (function taken from Stack Overflow)
cap <- function(x){
quantiles <- quantile(x, c(.05,0.25,0.75,.95))
x[x < quantiles[2] - 1.5*IQR(x)] <- quantiles[1]
x[x > quantiles[3] + 1.5*IQR(x)] <- quantiles[4]
return(x)
}
#Specify the variables to be capped.
cap_var <- c("Pop. Density (per sq. mi.)", "Trust (Government Corruption)", "Literacy (%)",
"Dystopia Residual", "Family", "Population Growth %")
#Perform the capping
data <- data %>% mutate_at(.vars = vars(one_of(cap_var)), .fun = cap)
#Perform the Equal frequency binning
data[,c("Coastline (coast/area ratio)", "Net migration")] <-
data[,c("Coastline (coast/area ratio)", "Net migration")] %>% discretize(disc = "equalfreq")
data %>% head()
data %>% head(2) %>% t()
## [,1] [,2]
## Country "Norway" "Denmark"
## Happiness Rank "1" "2"
## Happiness Score "7.537" "7.522"
## Upper CI95 "7.594445" "7.581728"
## Lower CI95 "7.479556" "7.462272"
## Economy (GDP per Capita) "1.616463" "1.482383"
## Family "1.533524" "1.551122"
## Health (Life Expectancy) "0.7966665" "0.7925655"
## Freedom "0.6354226" "0.6260067"
## Generosity "0.3620122" "0.3552805"
## Trust (Government Corruption) "0.5621066" "0.5895341"
## Dystopia Residual "2.277027" "2.313707"
## Region "WESTERN EUROPE" "WESTERN EUROPE"
## Population "6.663778" "6.736449"
## Area (sq. mi.) "5.510281" "4.634427"
## Pop. Density (per sq. mi.) "1.181844" "2.105510"
## Coastline (coast/area ratio) "5" "5"
## Net migration "5" "5"
## Infant mortality (per 1000 births) "0.6720979" "0.7450748"
## GDP ($ per capita) "4.577503" "4.492774"
## Literacy (%) "100" "100"
## Phones (per 1000) "2.665299" "2.789299"
## Arable (%) " 2.87" "54.02"
## Crops (%) "0.00" "0.19"
## Other (%) "97.13" "45.79"
## Birthrate "11.46" "11.13"
## Deathrate "1.017033" "1.055378"
## Agriculture "0.021" "0.018"
## Industry "0.415" "0.246"
## Service "0.564" "0.735"
## Population Growth % "0.380" "0.325"
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Zachary Jones, Giuseppe Casalicchio, and Mason Gallo. 2018. Mlr: Machine Learning in R. https://CRAN.R-project.org/package=mlr.
Francois, Romain. 2018. “Dataset - How to Replace Outliers with the 5th and 95th Percentile Values in R.” Stack Overflow. Accessed October 20. https://stackoverflow.com/questions/13339685/how-to-replace-outliers-with-the-5th-and-95th-percentile-values-in-r.
Harrell, Frank E, Jr. 2018. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.
Helliwell, John. 2017. World Happiness Report 2017. Protecteur du Citoyen du Québec. http://deslibris.ca/ID/10090444.
Komsta, Lukasz. 2011. Outliers: Tests for Outliers. https://CRAN.R-project.org/package=outliers.
Korkmaz, Selcuk, Dincer Goksuluk, and Gokmen Zararsiz. 2018. MVN: Multivariate Normality Tests. https://CRAN.R-project.org/package=MVN.
Meyer, Patrick E. 2014. Infotheo: Information-Theoretic Measures. https://CRAN.R-project.org/package=infotheo.
van der Loo, Mark, and Edwin de Jonge. 2017. Deductive: Data Correction and Imputation Using Deductive Methods. https://CRAN.R-project.org/package=deductive.
———. 2018. Validate: Data Validation Infrastructure. https://CRAN.R-project.org/package=validate.
Wickham, Hadley. 2018. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, and Lionel Henry. 2018. Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, and Kara Woo. 2018. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2017. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Xie, Yihui. 2018. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.