MATH2349 Semester 2, 2019

Required packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyr)
library(outliers)
library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:tidyr':
## 
##     separate

## The following object is masked from 'package:dplyr':
## 
##     contains

Executive Summary

The objective of this assignment is to import two different dataset and perform a variety data preprocessing tasks on them. The datasets being used pertain to world happiness rankings and world literacy rates. A primary outcome of this task is to merge the datasets using a common variable between the two and in this case it is the common country. Before we commence merging the data sets, both need to be cleaned in order to make it easier to use. This involved removing unnecessary columns and renaming some of the variables.

Once the data was cleaned, it was then merged. Furthermore, some of the data types for the variable had to be changed as such to comply with assignment specifications. Variables that were changed include region (it was changed from character to factor) and literacy_rate_percent_all (changed from character to numeric). Furthermore, a new variable was created to examine the ratio between life_expectancy and happiness score. Finally, the merged dataset was examined for missing values, special values and inconsistencies of which all were dealt with appropriately (see the relevant code comments).

Outliers were then detected using box plots for each of the numeric variables of the merged dataset. The outliers were then subsequently dealt with by imputing the mean. Histograms were also constructed to observe the distributions of the family and happiness score variables. Appropriate transformations were applied to normalise the data

Data

This assignment utilises two different datasets obtained from Kaggle. We would like to acknowlege that we do not own these datasets and are publicaly available. The first dataset is the world happiness report 2017. Here it examines happiness rankings of 155 countries. Each coutry is given a happiness rank based on they achieve from their happiness score. Then the subsequent variables- economic production, social support, life expectancy, freedom, absence of corruption, and generosity - provide an estimate to the extent that these factors played in determining the happiness score. These variables are self explanatory and are slightly changed for easier Rcode manipulaiton .The link to this dataset is available below:

world happiness report: https://www.kaggle.com/unsdsn/world-happiness#2017.csv

The second dataset examines the world literacy rates of 2017 from 150 different countries. The data is relatively simple to understand, it contains 5 key variables which are country( country being examined), literacy_rate_percent_all(combined literacy rates of both genders), male_lietracy_percent_rate(male lieracy rates), female_literacy_percent_rate (female literacy rates) and gender_difference_percent_world (male literacy rates - female literacy rates). The link to the dataset is available below:

world literacy rates report: https://www.kaggle.com/niyamatalmass/youth-unemployment-gdp-and-literacy-percentage#literacy_rate.csv

happiness <- read_csv("D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/World_Happiness_Report_2017.csv")

## Parsed with column specification:
## cols(
##   Country = col_character(),
##   Region = col_character(),
##   `Happiness Rank` = col_double(),
##   `Happiness Score` = col_double(),
##   `Lower Confidence Interval` = col_double(),
##   `Upper Confidence Interval` = col_double(),
##   `Economy (GDP per Capita)` = col_double(),
##   Family = col_double(),
##   `Health (Life Expectancy)` = col_double(),
##   Freedom = col_double(),
##   `Trust (Government Corruption)` = col_double(),
##   Generosity = col_double(),
##   `Dystopia Residual` = col_double()
## )

#Preview the database
head(happiness)

## # A tibble: 6 x 13
##   Country Region `Happiness Rank` `Happiness Scor~ `Lower Confiden~
##   <chr>   <chr>             <dbl>            <dbl>            <dbl>
## 1 Denmark Weste~                1             7.53             7.46
## 2 Switze~ Weste~                2             7.51             7.43
## 3 Iceland Weste~                3             7.50             7.33
## 4 Norway  Weste~                4             7.50             7.42
## 5 Finland Weste~                5             7.41             7.35
## 6 Canada  North~                6             7.40             7.34
## # ... with 8 more variables: `Upper Confidence Interval` <dbl>, `Economy
## #   (GDP per Capita)` <dbl>, Family <dbl>, `Health (Life
## #   Expectancy)` <dbl>, Freedom <dbl>, `Trust (Government
## #   Corruption)` <dbl>, Generosity <dbl>, `Dystopia Residual` <dbl>

#Check the dimensions of the database
dim(happiness)

## [1] 157  13

#Check the colnames of the database
colnames(happiness)

##  [1] "Country"                       "Region"                       
##  [3] "Happiness Rank"                "Happiness Score"              
##  [5] "Lower Confidence Interval"     "Upper Confidence Interval"    
##  [7] "Economy (GDP per Capita)"      "Family"                       
##  [9] "Health (Life Expectancy)"      "Freedom"                      
## [11] "Trust (Government Corruption)" "Generosity"                   
## [13] "Dystopia Residual"

#Here we observe that some columns have spaces present in between them.In that case we replace the spaces with "_" so that we don't have to call quotation marks when we are calling the columns
names(happiness) <- gsub(" ","_",names(happiness))
#To check whether the column names have been sucessfully updated or not
colnames(happiness)

##  [1] "Country"                       "Region"                       
##  [3] "Happiness_Rank"                "Happiness_Score"              
##  [5] "Lower_Confidence_Interval"     "Upper_Confidence_Interval"    
##  [7] "Economy_(GDP_per_Capita)"      "Family"                       
##  [9] "Health_(Life_Expectancy)"      "Freedom"                      
## [11] "Trust_(Government_Corruption)" "Generosity"                   
## [13] "Dystopia_Residual"

#Renaming column Health_(Life_Expectancy) to Life_Expectancy,Economy_(GDP_per_Capita) to Economy_GDP and Trust_(Government_Corruption) to Trust_Government to make it easier to manipulate
colnames(happiness)[colnames(happiness)=="Health_(Life_Expectancy)"] <- "Life_Expectancy"
colnames(happiness)[colnames(happiness)=="Economy_(GDP_per_Capita)"] <- "Economy_GDP"
colnames(happiness)[colnames(happiness)=="Trust_(Government_Corruption)"] <- "Trust_Government"

#Remove dystopia and confidence columns
happiness <- select(happiness,-13,-5,-6)

#Now we import the second database
literacy <- read_csv("D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv")

## Parsed with column specification:
## cols(
##   country = col_character(),
##   literacy_rate_percent_all = col_character(),
##   male_literacy_percent_rate = col_double(),
##   female_literacy_percent_rate = col_double(),
##   gender_difference_percentWorld = col_double(),
##   `86.3` = col_double(),
##   `90.0` = col_character(),
##   `82.7` = col_character(),
##   `7.3` = col_character()
## )

## Warning: 150 parsing failures.
## row col  expected    actual                                                                                   file
##   1  -- 9 columns 5 columns 'D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv'
##   2  -- 9 columns 5 columns 'D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv'
##   3  -- 9 columns 5 columns 'D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv'
##   4  -- 9 columns 5 columns 'D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv'
##   5  -- 9 columns 5 columns 'D:/RMIT Semester 1/Data Preprocessing/Assignment 3/Final data/literacy_rate_2017.csv'
## ... ... ......... ......... ......................................................................................
## See problems(...) for more details.

parse_error <- problems(literacy)
parse_error

## # A tibble: 150 x 5
##      row col   expected  actual   file                                     
##    <int> <chr> <chr>     <chr>    <chr>                                    
##  1     1 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  2     2 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  3     3 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  4     4 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  5     5 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  6     6 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  7     7 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  8     8 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
##  9     9 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
## 10    10 <NA>  9 columns 5 colum~ 'D:/RMIT Semester 1/Data Preprocessing/A~
## # ... with 140 more rows

#So the parse error was caused due to the database having 9 columns but  with only 5 of those columns being populated. In porder to rectify this error, the last 4 columns need to be removed.
#We check the head of our second database
head(literacy)

## # A tibble: 6 x 9
##   country literacy_rate_p~ male_literacy_p~ female_literacy~
##   <chr>   <chr>                       <dbl>            <dbl>
## 1 Afghan~ 38.2                         52               24.2
## 2 Albania 97.6                         98.4             96.8
## 3 Algeria 80.2                         87.2             73.1
## 4 Angola  71.1                         82               60.7
## 5 Argent~ 98.1                         98               98.1
## 6 Armenia 99.8                         99.8             99.7
## # ... with 5 more variables: gender_difference_percentWorld <dbl>,
## #   `86.3` <dbl>, `90.0` <chr>, `82.7` <chr>, `7.3` <chr>

#We notice that the columns "86.3","90.0","82.7","7.3" donot have any values and they don't mean anything.They msybe as a result of human error during the compilation of the database. Eitherway, those variables do not make any sense on their own and will therefore be removed.

literacy <- select(literacy,-6:-9)
head(literacy)

## # A tibble: 6 x 5
##   country literacy_rate_p~ male_literacy_p~ female_literacy~
##   <chr>   <chr>                       <dbl>            <dbl>
## 1 Afghan~ 38.2                         52               24.2
## 2 Albania 97.6                         98.4             96.8
## 3 Algeria 80.2                         87.2             73.1
## 4 Angola  71.1                         82               60.7
## 5 Argent~ 98.1                         98               98.1
## 6 Armenia 99.8                         99.8             99.7
## # ... with 1 more variable: gender_difference_percentWorld <dbl>

#It looks OK

#Now we check the dimensions of the database
dim(literacy)

## [1] 150   5

#Check the colnames of the 2nd database
colnames(literacy)

## [1] "country"                        "literacy_rate_percent_all"     
## [3] "male_literacy_percent_rate"     "female_literacy_percent_rate"  
## [5] "gender_difference_percentWorld"

#For the purpose of this task, we do not require the literacy rate to be subsetted on the basis of gender, we only require the literacy rate of people per country. Hence, Gender literacy rate columns along with their difference will be removed.
literacy <- select(literacy,-3:-5)
head(literacy,10)

## # A tibble: 10 x 2
##    country     literacy_rate_percent_all
##    <chr>       <chr>                    
##  1 Afghanistan 38.2                     
##  2 Albania     97.6                     
##  3 Algeria     80.2                     
##  4 Angola      71.1                     
##  5 Argentina   98.1                     
##  6 Armenia     99.8                     
##  7 Azerbaijan  99.8                     
##  8 Bahrain     95.7                     
##  9 Bangladesh  61.5                     
## 10 Belarus     99.7

#Inorder to merge the two databases, there needs to be a mutual column between the datasets. The datasets do posses a mutual variable column but they slightly differ from one another.The variable "country" in the literacy dataset will be changed to "Country" as to match the variable in happiness.

colnames(literacy)[colnames(literacy)=="country"] <- "Country"
#Now we just check
colnames(literacy)

## [1] "Country"                   "literacy_rate_percent_all"

#An inner join will be performed on these data sets by joining them using the common variable "Country".

combined <- inner_join(happiness,literacy,by="Country")
head(combined,15)

## # A tibble: 15 x 11
##    Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##    <chr>   <chr>           <dbl>           <dbl>       <dbl>  <dbl>
##  1 Costa ~ Latin~             14            7.09        1.07  1.02 
##  2 Brazil  Latin~             17            6.95        1.09  1.04 
##  3 Mexico  Latin~             21            6.78        1.12  0.715
##  4 Singap~ South~             22            6.74        1.65  0.868
##  5 Chile   Latin~             24            6.70        1.22  0.906
##  6 Panama  Latin~             25            6.70        1.18  0.989
##  7 Argent~ Latin~             26            6.65        1.15  1.07 
##  8 United~ Middl~             28            6.57        1.57  0.871
##  9 Uruguay Latin~             29            6.54        1.18  1.03 
## 10 Malta   Weste~             30            6.49        1.31  1.10 
## 11 Colomb~ Latin~             31            6.48        1.03  1.02 
## 12 Thaila~ South~             33            6.47        1.09  1.04 
## 13 Saudi ~ Middl~             34            6.38        1.49  0.848
## 14 Qatar   Middl~             36            6.38        1.82  0.880
## 15 Spain   Weste~             37            6.36        1.34  1.13 
## # ... with 5 more variables: Life_Expectancy <dbl>, Freedom <dbl>,
## #   Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <chr>

#Now we check the dimensions of the new database that is formed
dim(combined)

## [1] 123  11

#There are 123 rows and 11 columns.We now know that all the records from the happines database have been sucessfully joined with the literacy database as the total columns present in the combined database is 11 (which is total columns in literacy+happiness-"Country")

Understand

In this section, we determined the structure of the combined dataset and changed some of the data types into more useable forms and also to comply with assignment specification.

#We check the structure of our database
str(combined)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 123 obs. of  11 variables:
##  $ Country                  : chr  "Costa Rica" "Brazil" "Mexico" "Singapore" ...
##  $ Region                   : chr  "Latin America and Caribbean" "Latin America and Caribbean" "Latin America and Caribbean" "Southeastern Asia" ...
##  $ Happiness_Rank           : num  14 17 21 22 24 25 26 28 29 30 ...
##  $ Happiness_Score          : num  7.09 6.95 6.78 6.74 6.71 ...
##  $ Economy_GDP              : num  1.07 1.09 1.12 1.65 1.22 ...
##  $ Family                   : num  1.022 1.039 0.715 0.868 0.906 ...
##  $ Life_Expectancy          : num  0.761 0.614 0.711 0.947 0.819 ...
##  $ Freedom                  : num  0.552 0.404 0.377 0.488 0.378 ...
##  $ Trust_Government         : num  0.105 0.142 0.184 0.47 0.115 ...
##  $ Generosity               : num  0.226 0.158 0.117 0.327 0.316 ...
##  $ literacy_rate_percent_all: chr  "97.8" "92.6" "94.4" "96.8" ...

#We find that the majority of our columns are  numeric which is OK although there are some columns that are better to be shifted into another class such as literacy_rate_percent_all which should be shifted to numeric format rather than character format

#Befor we convert the column literacy_rate_percent_all into a numeric type the double quotes must first be removed
combined$literacy_rate_percent_all <- noquote(combined$literacy_rate_percent_all)

#Converting to numeric
combined$literacy_rate_percent_all <- as.numeric(combined$literacy_rate_percent_all)

#Checking the class type
class(combined$literacy_rate_percent_all)

## [1] "numeric"

#The variable region is currently presented as a character but would be more suitable if it were a factor. Hence the variable is converted to a factor. 

combined$Region <- factor(combined$Region,levels = c("Latin America and Caribbean","Southeastern Asia","Middle East and Northern Africa","Western Europe","Central and Eastern Europe","Sub-Saharan Africa","Eastern Asia","Southern Asia"),labels = c("Latin America and Caribbean","Southeast Asia","Middle East and North Africa","Western Europe","Central and East Europe","Sub-Saharan Africa","East Asia","South Asia"))

#Final check of the combined database
str(combined)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 123 obs. of  11 variables:
##  $ Country                  : chr  "Costa Rica" "Brazil" "Mexico" "Singapore" ...
##  $ Region                   : Factor w/ 8 levels "Latin America and Caribbean",..: 1 1 1 2 1 1 1 3 1 4 ...
##  $ Happiness_Rank           : num  14 17 21 22 24 25 26 28 29 30 ...
##  $ Happiness_Score          : num  7.09 6.95 6.78 6.74 6.71 ...
##  $ Economy_GDP              : num  1.07 1.09 1.12 1.65 1.22 ...
##  $ Family                   : num  1.022 1.039 0.715 0.868 0.906 ...
##  $ Life_Expectancy          : num  0.761 0.614 0.711 0.947 0.819 ...
##  $ Freedom                  : num  0.552 0.404 0.377 0.488 0.378 ...
##  $ Trust_Government         : num  0.105 0.142 0.184 0.47 0.115 ...
##  $ Generosity               : num  0.226 0.158 0.117 0.327 0.316 ...
##  $ literacy_rate_percent_all: num  97.8 92.6 94.4 96.8 97.3 95 98.1 93.8 98.4 94.1 ...

Tidy & Manipulate Data I

For a dataset to be considered a stidy it must adhere to three rule: 1. Each variable must have its own column 2. Each observation must have it own row 3. Each value must have its own cell

Upon inspection, the dataset has been found to adhere to the tidy principles and is therefore considered as tidy.

#After carefully observing our data we have come to our conclusion tht the given data is in tidy format and thus no changes are needed

Tidy & Manipulate Data II

In this component, a new variable was mutated using the life_expectancy variable as well as the happiness_score.

#Here we create some new variables using the arrange,group_by and mutate functions

#this new variable will look at the ratio of life expectancy to happiness score.
mutate(combined,happines_Life_Expectancy = (Life_Expectancy*100)/Happiness_Score)

## # A tibble: 123 x 12
##    Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##    <chr>   <fct>           <dbl>           <dbl>       <dbl>  <dbl>
##  1 Costa ~ Latin~             14            7.09        1.07  1.02 
##  2 Brazil  Latin~             17            6.95        1.09  1.04 
##  3 Mexico  Latin~             21            6.78        1.12  0.715
##  4 Singap~ South~             22            6.74        1.65  0.868
##  5 Chile   Latin~             24            6.70        1.22  0.906
##  6 Panama  Latin~             25            6.70        1.18  0.989
##  7 Argent~ Latin~             26            6.65        1.15  1.07 
##  8 United~ Middl~             28            6.57        1.57  0.871
##  9 Uruguay Latin~             29            6.54        1.18  1.03 
## 10 Malta   Weste~             30            6.49        1.31  1.10 
## # ... with 113 more rows, and 6 more variables: Life_Expectancy <dbl>,
## #   Freedom <dbl>, Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <dbl>, happines_Life_Expectancy <dbl>

Scan I

The following code chunk displays the sum of Na and special values for each column. Inconsistencies are also accounted for.

#Scanning for any missing values or NA values in every column
colSums(is.na(combined))

##                   Country                    Region 
##                         0                         0 
##            Happiness_Rank           Happiness_Score 
##                         0                         0 
##               Economy_GDP                    Family 
##                         0                         0 
##           Life_Expectancy                   Freedom 
##                         0                         0 
##          Trust_Government                Generosity 
##                         0                         0 
## literacy_rate_percent_all 
##                         0

#We determine that there are no null values present in the database as all the columns show that they have zero missing or NA values

#We now try to determine if our database has any infinite or special values present in it or not
#The function below will check firstly if every value for x is numeric. If it is, the fucntion will then proceed to check if the value is non finite. This function is being applied to our combined dataset but without using the character columns (as character columns will return false to is.numeric() and halt the fucntion)

special <- function(x){
  if(is.numeric(x)) !is.finite(x)
}
special_val <- (sapply(combined,special))
special_val <- as.data.frame(special_val[-c(1,2)])
colSums(special_val)

##            Happiness_Rank           Happiness_Score 
##                         0                         0 
##               Economy_GDP                    Family 
##                         0                         0 
##           Life_Expectancy                   Freedom 
##                         0                         0 
##          Trust_Government                Generosity 
##                         0                         0 
## literacy_rate_percent_all 
##                         0

colnames(combined)

##  [1] "Country"                   "Region"                   
##  [3] "Happiness_Rank"            "Happiness_Score"          
##  [5] "Economy_GDP"               "Family"                   
##  [7] "Life_Expectancy"           "Freedom"                  
##  [9] "Trust_Government"          "Generosity"               
## [11] "literacy_rate_percent_all"

#Hence here again we determine that there are no special values

#  NOTE: it is important to note that the following variables Economy_GDP, Family, Life_Expectancy, Freedom, Generosity, Trust Government describe the extent to which these factors contribute in evaluating the happiness score in each country. so upon initial inspection of the data, having a life_expectancy of 0 or freedom score of 0 would not make sense. However when viewed from this angle, it just means that these variables were not considered in determining the happiness scores of those countries. Hence, the constraints needed to be designed with this in mind.
(Rule1 <- editset(c("Happiness_Rank>=0")))

## 
## Edit set:
## num1 : 0 <= Happiness_Rank

(Rule2 <- editset(c("Happiness_Score>=0","Happiness_Score<10")))

## 
## Edit set:
## num1 : 0 <= Happiness_Score
## num2 : Happiness_Score < 10

(Rule5 <- editset(c("Economy_GDP>=0")))

## 
## Edit set:
## num1 : 0 <= Economy_GDP

(Rule6 <- editset(c("Family>=0")))

## 
## Edit set:
## num1 : 0 <= Family

(Rule7 <- editset(c("Life_Expectancy>=0","Life_Expectancy<1")))

## 
## Edit set:
## num1 : 0 <= Life_Expectancy
## num2 : Life_Expectancy < 1

(Rule8 <- editset(c("Freedom>=0","Freedom<1")))

## 
## Edit set:
## num1 : 0 <= Freedom
## num2 : Freedom < 1

(Rule9 <- editset(c("Trust_Government>=0","Trust_Government<1")))

## 
## Edit set:
## num1 : 0 <= Trust_Government
## num2 : Trust_Government < 1

(Rule10 <- editset(c("Generosity>=0","Generosity<1")))

## 
## Edit set:
## num1 : 0 <= Generosity
## num2 : Generosity < 1

(Rule11 <- editset(c("Dystopia_Residual>=0")))

## 
## Edit set:
## num1 : 0 <= Dystopia_Residual

(Rule12 <- editset(c("literacy_rate_percent_all>=0","literacy_rate_percent_all<100")))

## 
## Edit set:
## num1 : 0 <= literacy_rate_percent_all
## num2 : literacy_rate_percent_all < 100

sum(violatedEdits(Rule1,combined))

## [1] 0

sum(violatedEdits(Rule2,combined))

## [1] 0

sum(violatedEdits(Rule5,combined))

## [1] 0

sum(violatedEdits(Rule6,combined))

## [1] 0

sum(violatedEdits(Rule7,combined))

## [1] 0

sum(violatedEdits(Rule8,combined))

## [1] 0

sum(violatedEdits(Rule9,combined))

## [1] 0

sum(violatedEdits(Rule10,combined))

## [1] 0

sum(violatedEdits(Rule12,combined))

## [1] 0

#If the sum of violatedEdits equals to 0, it means that the variables adhere to the constraints and have no errors. this was done to reduce the amount of output generated from just using violatedEdits(Rule#,combined).

Scan II

Here in this chunk, outliers were detected using boxplots and were then subsequently dealt with by imputing the mean.

#boxplots of numeric values were plotted (excluding happiness rank). All the variables were scanned for outliers. Variables showing outliers were shown to save space. After dealing with outliers, new boxplots were coded but not displayed to save space. 
#boxplot(combined$Happiness_Score, main="Happiness Score")
#boxplot(combined$Economy_GDP, main="Economy GDP")
boxplot(combined$Family, main="Family")

#boxplot(combined$Life_Expectancy, main="Life Expectancy")
#boxplot(combined$Freedom, main="Freedom")
boxplot(combined$Trust_Government, main= "Trust of Government")

boxplot(combined$Generosity, main= "Generosity")

boxplot(combined$literacy_rate_percent_all, main= "Literacy rate percent all")

#Box plots shown for Family, Trust_Government, Generosity and literacy_rate_percent_all. Their outliers were confirmed by using the tukeys method of outlier detection. This method will determine the upper and lower outlier fences.
#Lower fence = Q1-1.5IQR
#Upper fence= Q3+1.5IQR
#We then check the outliers for Generosity where we  get

quantile(combined$Generosity)

##       0%      25%      50%      75%     100% 
## 0.000000 0.143525 0.204400 0.275040 0.819710

IQR(combined$Generosity)

## [1] 0.131515

#Using Tukey's method we then get that any value ranging between 0.053 and 0.4723 are outliers
combined %>% filter(Generosity<0.04109|Generosity>0.37617)

## # A tibble: 20 x 11
##    Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##    <chr>   <fct>           <dbl>           <dbl>       <dbl>  <dbl>
##  1 Malta   Weste~             30            6.49       1.31   1.10 
##  2 Thaila~ South~             33            6.47       1.09   1.04 
##  3 Malays~ South~             47            6.00       1.25   0.880
##  4 Russia  Centr~             56            5.86       1.23   1.05 
##  5 Lithua~ Centr~             60            5.81       1.27   1.06 
##  6 Indone~ South~             79            5.31       0.951  0.876
##  7 Bhutan  South~             84            5.20       0.853  0.908
##  8 Kyrgyz~ Centr~             85            5.18       0.560  0.954
##  9 Morocco Middl~             90            5.15       0.841  0.386
## 10 Tunisia Middl~             98            5.04       0.977  0.432
## 11 Greece  Weste~             99            5.03       1.25   0.755
## 12 Laos    South~            102            4.88       0.680  0.550
## 13 Iran    Middl~            105            4.81       1.12   0.389
## 14 Nepal   South~            107            4.79       0.446  0.697
## 15 Sri La~ South~            117            4.42       0.973  0.848
## 16 Myanmar South~            119            4.39       0.341  0.700
## 17 Kenya   Sub-S~            122            4.36       0.523  0.762
## 18 Haiti   Latin~            136            4.03       0.341  0.296
## 19 Cambod~ South~            140            3.91       0.556  0.538
## 20 Syria   Middl~            156            3.07       0.747  0.149
## # ... with 5 more variables: Life_Expectancy <dbl>, Freedom <dbl>,
## #   Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <dbl>

#We then impute them with the mean of the entire column
combined$Generosity[which(combined$Generosity<0.04109|combined$Generosity>0.37617)] <- mean(combined$Generosity,na.rm = T)
#Now we will check the boxplot again for any changes
#boxplot(combined$Generosity, main= "Genorosity - Outliers imputed")

#This process must be repeated for Family, Trust_Governement and Literacy_rate_percent_all
#Family outlier detection and imputation
quantile(combined$Family)

##       0%      25%      50%      75%     100% 
## 0.000000 0.625095 0.809750 0.947110 1.168100

IQR(combined$Family)

## [1] 0.322015

combined %>% filter(Family<0.1420725|Family>1.4301325)

## # A tibble: 3 x 11
##   Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##   <chr>   <fct>           <dbl>           <dbl>       <dbl>  <dbl>
## 1 Benin   Sub-S~            153            3.48       0.395  0.104
## 2 Afghan~ South~            154            3.36       0.382  0.110
## 3 Togo    Sub-S~            155            3.30       0.281  0    
## # ... with 5 more variables: Life_Expectancy <dbl>, Freedom <dbl>,
## #   Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <dbl>

combined$Family[which(combined$Family<0.1420725|combined$Family>1.4301325)] <- mean(combined$Family,na.rm = T)
#Now we will check the boxplot again for any changes
#boxplot(combined$Family, main= "Family - Outliers imputed")

#Trust_Governemt outlier detection and imputation
quantile(combined$Trust_Government)

##       0%      25%      50%      75%     100% 
## 0.000000 0.052965 0.087220 0.136415 0.505210

IQR(combined$Trust_Government)

## [1] 0.08345

combined %>% filter(Trust_Government< -0.07221|Trust_Government>0.26159)

## # A tibble: 8 x 11
##   Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##   <chr>   <fct>           <dbl>           <dbl>       <dbl>  <dbl>
## 1 Singap~ South~             22            6.74       1.65   0.868
## 2 United~ Middl~             28            6.57       1.57   0.871
## 3 Saudi ~ Middl~             34            6.38       1.49   0.848
## 4 Qatar   Middl~             36            6.38       1.82   0.880
## 5 Uzbeki~ Centr~             49            5.99       0.736  1.17 
## 6 Turkme~ Centr~             65            5.66       1.08   1.04 
## 7 Georgia Centr~            126            4.25       0.838  0.192
## 8 Rwanda  Sub-S~            152            3.52       0.328  0.616
## # ... with 5 more variables: Life_Expectancy <dbl>, Freedom <dbl>,
## #   Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <dbl>

combined$Trust_Government[which(combined$Trust_Government< -0.07221|combined$Trust_Government>0.26159)] <- mean(combined$Trust_Government,na.rm = T)
#Now we will check the boxplot again for any changes
#boxplot(combined$Trust_Government, main= "Trust_Government - Outliers imputed")

#Literacy_rate_percent_all outlier detection and imputation
quantile(combined$literacy_rate_percent_all)

##   0%  25%  50%  75% 100% 
## 19.1 75.1 93.9 98.1 99.9

IQR(combined$literacy_rate_percent_all)

## [1] 23

combined %>% filter(literacy_rate_percent_all< 40.6|literacy_rate_percent_all>132.6)

## # A tibble: 8 x 11
##   Country Region Happiness_Rank Happiness_Score Economy_GDP Family
##   <chr>   <fct>           <dbl>           <dbl>       <dbl>  <dbl>
## 1 Mali    Sub-S~            135            4.07       0.313  0.863
## 2 Niger   Sub-S~            142            3.86       0.133  0.605
## 3 South ~ Sub-S~            143            3.83       0.394  0.185
## 4 Chad    Sub-S~            144            3.76       0.422  0.632
## 5 Burkin~ Sub-S~            145            3.74       0.320  0.631
## 6 Guinea  Sub-S~            151            3.61       0.224  0.311
## 7 Benin   Sub-S~            153            3.48       0.395  0.750
## 8 Afghan~ South~            154            3.36       0.382  0.750
## # ... with 5 more variables: Life_Expectancy <dbl>, Freedom <dbl>,
## #   Trust_Government <dbl>, Generosity <dbl>,
## #   literacy_rate_percent_all <dbl>

combined$literacy_rate_percent_all[which(combined$literacy_rate_percent_all< 40.6|combined$literacy_rate_percent_all>132.6)] <- mean(combined$literacy_rate_percent_all,na.rm = T)
#Now we will check the boxplot again for any changes
#boxplot(combined$literacy_rate_percent_all, main= "Literacy Rate - Outliers imputed")

Transform

The Two variables that we applied transformations to are the happiness scores and family datasets. When examining the happiness score data set, the data demonstrated a slight left skew. In order to normalise the data, we determined that is was best to apply a square transformation to it.

Similarly, the family dataset was skewed to the left but this time, it was more apparent. A square transformation was applied to the data in order to obtain a normal distribution.Below there are two histograms for family, the red represents the data pre-transformation. The blue histogram shows the variable post-transformation.

#Family
hist(combined$Family, col = "red",main = "Histogram of Family", xlab = "family score")

y <- (combined$Family)^2
hist(y, xlim = c(0,2.0), ylim = c(0,30), col = "Blue",main = "Histogram of Family transformed",xlab = "family score")