MATH2349 Data Wrangling

Required packages

# Importing Data
library(readr)
library(readxl)

#general use and data transformation
library(dplyr)
library(tidyr)
library(stringr)

# graphs
library(ggplot2)

# NAs
library(validate)
library(Hmisc)
library(editrules)

# outliers
library(MVN)
library(forecast)

# transformation
library(car)

Executive Summary

The life expectancy and the GDP/Population datasets are read using read_csv() and read_xlsx(). After diagnosing mismatches between the data frames with anti_join(), they are merged by country name using inner_join(). head() is used to evaluate the merged dataset. class() is called for each column using sapply() and str() is used to explore the dimensions and highlight incorrectly assigned classes and column names. After imputing missing values with NA, type conversions are applied and less ambiguous column names are chosen. According to the principles of tidy data the data structure is determined untidy. To convert to a tidy format the gendered population and life expectancy columns are separated into Sex, Population and life expectancy using unite(), gather() and separate(). GDP per population is created using mutate(). NA’s are Identified with is.na() and handled by imputing the median using Hmisc::impute() or by inferring the values from other observations. A user-created function identifies special values and using the editrules package, rules are created to scan for nonsensical values or errors. Outliers are identified for each numeric variable using boxplots. After removing extreme outliers, the remaining outliers are capped using a user-defined function. The capping was not successful due to the extreme skew of the data. Instead, transformations are explored as a more appropriate method of handling outliers. Distributions are determined using histograms and logarithmic transformations are tested. Q-Q plots identify which transformation best approximates a normal distribution and the most appropriate transformation is applied. Boxplots are then recreated for the transformed variables to identify potential outliers. These are capped using the user-created cap function. A positive linear relationship between Population and GDP is confirmed with a scatter plot before a Chi-Square Q-Q plot is created to test for multivariate outliers using the chi-square distribution critical value approach.

Data

Life Expectancy Data

The life expectancy dataset taken from https://www.kaggle.com/amansaxena/lifeexpectancy provides the life expectancy of 223 Countries by Sex and ranks them according to that life expectancy. The data itself was sourced from a table on wikipedia which was sourced from cia.gov. While the kaggle dataset was posted in 2018, the listed date of the information on the CIA dataset was 2017 so it is safe to assume these values correspond with the year 2017. The data set is public domain.

Variables

Rank: The rank of the given country according to it’s life expectancy with rank 1 having the highest life expectancy
Country: The name of the given country
Overall Life: The overall life expectancy of that given country. Life expectancy is an estimate of the age members of a population will be when they die if current mortality rates for each age remained the same.
Male Life: Male life expectancy for the given Country
Female Life: female life expectancy for the given Country
Continent: The Continent which the given country lies in
- “Africa”
- “Asia”
- “Europe”
- “North America”
- “Oceania”
- “South America”

GDP and Population

The GDP/Population dataset was taken from the https://databank.worldbank.org/source/world-development-indicators#

The World Bank is a financial institution which offers loans and grants to lower income countries, they have a substantial open database that mostly focuses on developmental indicators for over 200 countries. There databank allows you to specify the variables, format and layout of a dataset. This specific dataset was taken from their World Development Indicators database, all available countries were selected. Series, which is the variables to be included in the dataset, were selected as gdp(current US$), population(total), population (Female) and population (Male) and the year was selected as 2019. The dataset was then downloaded as an xlsx. The GDP/Population dataset also contains observation for a number of groups of countries for example, European Union. As we are more interested in single countries information these rows will be excluded from the dataset. All data on the worldbank website is listed as public domain.

Variables

Country Name: The name of the country. Country here is a vague term used to define different regional areas. Many of the Countries listed may not be considered countries by certain definitions or according to different political groups.
Country Code: A unique 3 character identifier for each country
GDP (Current US$): GDP is described as “…the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products.” These figures were converted to current USD using 2019s official exchange rates for each country. The data was taken from World Bank national accounts data and OECD National Accounts data files.
Population (Total): Population information in the dataset was sourced from the United Nations World Population Prospects. “Overall” refers to the total population for the given Country in 2019.
Population (Female): The female population for the given country in 2019.
Population (Male): The male population for the given country in 2019.
Time: This variable is the year the data was taken from. As it contains 2019 for every observation it will be excluded from the dataframe
Time Code: A unique 6 character identifier for the year 2019. This variable will also be excluded from the dataset.

The dataset also contains the following metadata for each Country

Income Group: Income groups uses 2019 gross national income (GNI) per capita of each country to separate them into the following groups:
- low income: $1,035 or less
- lower middle income: $1,036 - 4,045
- upper middle income: $4,046 - 12,535
- high income: $12,536 or more
Currency Unit: The local currency for the given country

As the dataset is an excel file which contains the metadata for each country on a separate sheet, those tables will need to be combined before the full dataset can be merged with the life expectancy data. GDP data and metadata share a common variable Country Code which will be used to perform the join. There are also a lot of unneeded columns in the metadata set that will be removed before joining.

Reading Data

# Importing life expectancy Data
lifeExpectancy <- read_csv("Life_expectancy_dataset.csv")

## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Country = col_character(),
##   `Overall Life` = col_double(),
##   `Male Life` = col_double(),
##   `Female Life` = col_double(),
##   Continent = col_character()
## )

head(lifeExpectancy)

# Reading GDP data
gdp <- read_xlsx("Data_Extract_From_World_Development_Indicators (1).xlsx", sheet = "Data")

# Excluding first 2 columns
gdp <- gdp[1:217,3:8]

# Reading Country Metadata
metadata <- read_xlsx("Data_Extract_From_World_Development_Indicators (1).xlsx", sheet = "Country - Metadata")
metadata <- metadata %>% select(`Code`, `Income Group`, `Currency Unit`)

# joining metadata to GDP
gdpComplete <- left_join(gdp, metadata, by = c("Country Code" = "Code"))
head(gdpComplete)

After downloading the life expectancy csv from kaggle.com and moving it to the project directory, Life expectancy data was read using read_csv() from the readr Package.

The XLSX containing the GDP/population data was moved into the project directory and read into r using the read_xlsx() function from the readxl package. sheet = "data" was entered to read the correct table. The first two columns and the rows that did not contain country information were removed by subsetting the data set. The metadata was read using the argument sheet = "Country - Metadata", The rows Code, Income Group, Currency Unit were selected using select() and the data was joined to gdp using a left_join()

As can be expected when joining countries based on country name there are a considerable number of discrepancies, using anti_join to diagnose mismatches between the life expectancy dataset and the GDP complete dataset 47 mismatches were found. A number of these were just differences in naming conventions for example “Korea, Rep” versus “South Korea”. In these cases the GDP/Population datasets Country Name was changed to match the Life expectancy dataset. There were also a number of cases where Countries or principalities were divided into different regions between datasets, for example, the GDP/Population dataset contains “west Bank” and “Gaza Strip” combined whereas Life Expectancy separates them into two regions. These observations will be ignored when merging the final dataset for the sake of clarity and to avoid introducing misleading data. There were also a number of countries like Jersey and Geurnsey which had no information at all in the GDP/Population dataset.

# Diagnosing mismatches
anti_join(lifeExpectancy, gdpComplete, by = c("Country" = "Country Name")) %>% head()

# Changing Country Names
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Macao")] <- "Macau; China"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Hong Kong")] <-   "Hong Kong, China"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Korea, Rep")] <- "South Korea"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Korea, Dem")] <- "North Korea"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "France")] <- "France, metropolitan"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "U.S.")] <- "U.S. Virgin Islands"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Sint Maarten")] <- "Sint Maarten"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Mariana")] <- "Northern Mariana Islands; US"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Lucia")] <- "Saint Lucia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Brunei")] <- "Brunei"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Slovak")] <- "Slovakia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Macedonia")] <-   "Republic of Macedonia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Venezuela")] <- "Venezuela"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Nevis")] <-  "Saint Kitts and Nevis"
lifeExpectancy$Country[str_which(lifeExpectancy$Country, "Republic of China")] <- "China"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Vincent")] <- "Saint Vincent and the Grenadines"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Syria")] <- "Syria"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Micronesia")] <- "Federated States of Micronesia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Egypt")] <- "Egypt"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Bahamas")] <- "The Bahamas"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Iran")] <- "Iran"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Kyrg")] <-  "Kyrgyzstan"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Russia")] <- "Russia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Myanmar")] <- "Burma"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Yemen")] <- "Yemen"
lifeExpectancy$Country[str_which(lifeExpectancy$Country, "S�o Tom� and Pr�ncipe")] <- "Sao Tome and Principe"    
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Gambia")] <- "The Gambia"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Lao")] <- "Laos"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Congo, Rep")] <- "Republic of the Congo"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Congo, Dem")] <- "Democratic Republic of the Congo"
lifeExpectancy$Country[str_which(lifeExpectancy$Country, "C�te d'Ivoire")]  <-  "Cote d'Ivoire"
gdpComplete$`Country Name`[str_which(gdpComplete$`Country Name`, "Eswatini")] <- "Swaziland"

# Checking remaining mismatches
anti_join(lifeExpectancy, gdpComplete, by = c("Country" = "Country Name")) %>% nrow()

## [1] 17

After, changing the Country Names to conform with each other the total number of join mismatches from life expectancy to gdp was reduced to 17. As the relationship of interest is between gdp and life expectancy the inner_join will be used to retain only observations that occur in both datasets. This will avoid issues with regions being divided differently between datasets and create a more cohesive dataset. If the goal was to create a complete dataset with all given gdp and life expectancy information, a full_join would be more suitable. However, this would create a considerable number of NA’s as well as inconsistencies with regions.

# Joining Datasets
lifeExpectancyGDP <- inner_join(gdpComplete, lifeExpectancy, by = c("Country Name" = "Country"))
head(lifeExpectancyGDP)

Understand

Summarising the Dataset

Using sapply(), class() can be applied to each column. This will return the data type for each column.

# checking classes
sapply(lifeExpectancyGDP, class)

##                           Country Name                           Country Code 
##                            "character"                            "character" 
##     GDP (current US$) [NY.GDP.MKTP.CD]        Population, total [SP.POP.TOTL] 
##                            "character"                            "character" 
##   Population, male [SP.POP.TOTL.MA.IN] Population, female [SP.POP.TOTL.FE.IN] 
##                            "character"                            "character" 
##                           Income Group                          Currency Unit 
##                            "character"                            "character" 
##                                   Rank                           Overall Life 
##                              "numeric"                              "numeric" 
##                              Male Life                            Female Life 
##                              "numeric"                              "numeric" 
##                              Continent 
##                            "character"

As the stringsAsFactors argument defaults to FALSE each string variable is imported with the data type of character. While this is correct in the case of Country Name and Country Code, Income Group should be converted to an ordered factor and Continent would also be more useful as a factor. Currency Unit could be considered a factor as a number of countries share Currency Units, however due to the large number of unique values it would be more useful to keep it as character. Likewise, Life expectancy rank is an ordinal variable and could be could be converted to a factor, however as there is a different level for each observation and they are already ordered, factorising rank would be inefficient and not useful. Instead Rank will be converted to an integer as rank should always contain whole numbers.

The class function highlights another problem with the dataset, the numeric information taken from the gdp dataset (‘Population, male’, ‘Population, female’, ‘Population, total’ and ‘GDP.Current.USD’) were all imported as character variables. Looking at the data set the reason for this is that missing values are entered as ‘..’ instead of NA. this will need to be rectified before a proper data type conversion can take place.

# Checking data structure
str(lifeExpectancyGDP)

## tibble [206 x 13] (S3: tbl_df/tbl/data.frame)
##  $ Country Name                          : chr [1:206] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country Code                          : chr [1:206] "AFG" "ALB" "DZA" "ASM" ...
##  $ GDP (current US$) [NY.GDP.MKTP.CD]    : chr [1:206] "19101353832.737125" "15278077446.864292" "169988236398.12585" ".." ...
##  $ Population, total [SP.POP.TOTL]       : chr [1:206] "38041754" "2854191" "43053054" "55312" ...
##  $ Population, male [SP.POP.TOTL.MA.IN]  : chr [1:206] "19529725" "1453180" "21749666" ".." ...
##  $ Population, female [SP.POP.TOTL.FE.IN]: chr [1:206] "18512029" "1401011" "21303388" ".." ...
##  $ Income Group                          : chr [1:206] "Low income" "Upper middle income" "Lower middle income" "Upper middle income" ...
##  $ Currency Unit                         : chr [1:206] "Afghan afghani" "Albanian lek" "Algerian dinar" "U.S. dollar" ...
##  $ Rank                                  : num [1:206] 221 59 79 103 8 207 85 75 118 80 ...
##  $ Overall Life                          : num [1:206] 51.3 78.3 76.8 75.4 82.8 56 76.5 77.1 74.6 76.8 ...
##  $ Male Life                             : num [1:206] 49.9 75.7 75.5 72.4 80.6 54.8 74.4 74 71.4 73.7 ...
##  $ Female Life                           : num [1:206] 52.7 81.2 78.2 78.5 85.1 57.2 78.8 80.4 78.3 79.9 ...
##  $ Continent                             : chr [1:206] "Asia" "Europe" "Africa" "Oceania" ...

The str() function displays the structure of the data including data types, attribute names, dimensions of the dataset as well as a sample for each attribute. alternatively we could use the attributes() function to return the colnames, rownames and class of the dataset and the dim() function to return dimensions.

Str() shows 206 observations of 13 columns. This equals the 223 observations from the original dataset minus the 17 mismatches which were excluded with the inner join The column names highlight some redundant information in a number of columns from the gdp dataset as well as some ambiguously labeled columns, these should be changed to make the dataset more coherent.

# Changing Column Names
colnames(lifeExpectancyGDP) <- c("Country.Name", "Country.Code", "GDP.Current.USD", "Population.Total", "Population.Male", "Population.Female", "Income.Group", "Currency.Unit", "Life.Exp.Rank", "Life.Exp.Overall", "Life.Exp.Male", "Life.Exp.Female", "Continent")

Converting Datatypes

# Replacing missing values with NA's
lifeExpectancyGDP$`GDP.Current.USD`[lifeExpectancyGDP$`GDP.Current.USD` == ".."] <- NA
lifeExpectancyGDP$`Population.Total`[lifeExpectancyGDP$`Population.Total` == ".."] <- NA
lifeExpectancyGDP$`Population.Male`[lifeExpectancyGDP$`Population.Male` == ".."] <- NA
lifeExpectancyGDP$`Population.Female`[lifeExpectancyGDP$`Population.Female` == ".."] <- NA

# Converting to numeric
lifeExpectancyGDP$`GDP.Current.USD` <- as.numeric(lifeExpectancyGDP$`GDP.Current.USD`)
lifeExpectancyGDP[4:6] <- lapply(lifeExpectancyGDP[4:6], as.integer)
lifeExpectancyGDP$`Life.Exp.Rank` <- as.integer(lifeExpectancyGDP$`Life.Exp.Rank`)

By searching the relevant columns for cells that contain “..” and replacing those cells with NA it was possible to use the as.integer() function to convert the rows into integers. lapply() was utilised with the required subset of columns to automatically apply as.integer to the required columns. integer was chosen over double as population will always be counted in integers. Rank was also converted to an integer using as.integer().

# Converting Continent Factor
unique(lifeExpectancyGDP$Continent)

## [1] "Asia"          "Europe"        "Africa"        "Oceania"      
## [5] "North America" "South America"

lifeExpectancyGDP$Continent <- factor(lifeExpectancyGDP$Continent, 
                                      levels = c("Africa", "Asia", "Europe", "North America", "Oceania", "South America"),
                                      ordered = FALSE)
class(lifeExpectancyGDP$Continent)

## [1] "factor"

levels(lifeExpectancyGDP$Continent)

## [1] "Africa"        "Asia"          "Europe"        "North America"
## [5] "Oceania"       "South America"

Using the unique() function we can see that continent variable contains 6 unique values. By using the factor() and setting the levels as those values and ordered = FALSE, continent was converted to a factor. Class() and levels() were then called to ensure that the conversion was successful and the appropriate levels were applied.

# Converting Income.Group Factor
unique(lifeExpectancyGDP$`Income.Group`)

## [1] "Low income"          "Upper middle income" "Lower middle income"
## [4] "High income"

lifeExpectancyGDP$`Income.Group` <- factor(lifeExpectancyGDP$`Income.Group`,
                                           levels = c("Low income", "Lower middle income", "Upper middle income", "High income"),
                                           labels = c("Low", "Lower Middle", "Upper Middle", "High"),
                                           ordered = TRUE)
class(lifeExpectancyGDP$`Income.Group`)

## [1] "ordered" "factor"

levels(lifeExpectancyGDP$`Income.Group`)

## [1] "Low"          "Lower Middle" "Upper Middle" "High"

Calling unique() on Income.Group, 5 clearly ordered levels can be seen. Using a similar process the Income.Group column was converted to a factor this time labeling the levels as “Low”, “Lower Middle”, “Upper Middle”, “High” and setting ordered = TRUE Then class() and levels() were again called to ensure the conversion was successful.

# Rechecking Data Structure
str(lifeExpectancyGDP)

## tibble [206 x 13] (S3: tbl_df/tbl/data.frame)
##  $ Country.Name     : chr [1:206] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country.Code     : chr [1:206] "AFG" "ALB" "DZA" "ASM" ...
##  $ GDP.Current.USD  : num [1:206] 1.91e+10 1.53e+10 1.70e+11 NA 3.15e+09 ...
##  $ Population.Total : int [1:206] 38041754 2854191 43053054 55312 77142 31825295 97118 44938712 2957731 106314 ...
##  $ Population.Male  : int [1:206] 19529725 1453180 21749666 NA NA 15744777 46852 21918496 1391270 50445 ...
##  $ Population.Female: int [1:206] 18512029 1401011 21303388 NA NA 16080518 50266 23020216 1566461 55869 ...
##  $ Income.Group     : Ord.factor w/ 4 levels "Low"<"Lower Middle"<..: 1 3 2 3 4 2 4 3 3 4 ...
##  $ Currency.Unit    : chr [1:206] "Afghan afghani" "Albanian lek" "Algerian dinar" "U.S. dollar" ...
##  $ Life.Exp.Rank    : int [1:206] 221 59 79 103 8 207 85 75 118 80 ...
##  $ Life.Exp.Overall : num [1:206] 51.3 78.3 76.8 75.4 82.8 56 76.5 77.1 74.6 76.8 ...
##  $ Life.Exp.Male    : num [1:206] 49.9 75.7 75.5 72.4 80.6 54.8 74.4 74 71.4 73.7 ...
##  $ Life.Exp.Female  : num [1:206] 52.7 81.2 78.2 78.5 85.1 57.2 78.8 80.4 78.3 79.9 ...
##  $ Continent        : Factor w/ 6 levels "Africa","Asia",..: 2 3 1 5 3 1 4 6 3 6 ...

calling str() again we can see that each variable has the correct datatype and that the column names have been changed correctly.

Tidy & Manipulate Data I

According to Hadley Wickham and Grolemund, the three principles of tidy data are:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Comparing each original data set to the tidy data principles, neither dataset could be considered tidy. The GDP dataset contains male population, female population and total population as column headers. These are in fact two different variables, sex and population and should be separated into different columns before the data could be considered tidy. Likewise the life expectancy dataset has the same issue with male life, female life and overall life being combinations of two separate variables as column headers, in order to produce a tidy dataset these will need to be separated into sex and life expectancy.

The combined dataset inherits the untidiness from the original data sets and therefore, is also considered untidy. To tidy the data structure, the gendered columns will need to be separated into 3 columns, Sex, Life.Expectancy and Population. In order to properly separate all 6 columns into rows that share the variable sex using tidyr, a few steps must be taken.

# Combining columns based on sex
tidy1 <- lifeExpectancyGDP %>% unite(col = "Overall", `Population.Total`, `Life.Exp.Overall`, sep = "/")
tidy2 <- tidy1 %>% unite(col = "Male", `Population.Male`, `Life.Exp.Male`, sep = "/")
tidy3 <- tidy2 %>% unite(col = "Female", `Population.Female`, `Life.Exp.Female`, sep = "/")
tidy4 <- tidy3 %>% gather(4:6, key = "Sex", value = "Values")
lifeExpGDPTidy <- tidy4 %>% tidyr::separate(Values, into = c("Population", "Life.Expectancy"), sep = "/")

# Converting back to Numeric
lifeExpGDPTidy$Population <- as.integer(lifeExpGDPTidy$Population)

## Warning: NAs introduced by coercion

lifeExpGDPTidy$`Life.Expectancy` <- as.numeric(lifeExpGDPTidy$`Life.Expectancy`)

dim(lifeExpGDPTidy)

## [1] 618  10

head(lifeExpGDPTidy)

First, the overall population total and life expectancy total were combined into a single column called overall with the values separated by a forward slash. This was repeated for the male columns and the female columns. Then using gather() the three remaining columns were combined into two variables sex and values. For the data to be tidy though each cell needs to contain a single value so using separate() the values column was split into Population and Life.Expectancy by setting the sep argument to “/”. Because using “/” as the sep argument coerced the values into character, they were converted back to numerics using lapply() and the as.numeric() function.

Calling dim() we can see that the dataset now contains 618 rows instead of 206. This makes sense as now each sex value occupies a row which means that each country has three rows. Looking at the head of the dataset and comparing it to the untidy dataset it can be seen that while the data has been converted from wide to long format the information has been preserved and the three principles of tidy data are upheld.

Tidy & Manipulate Data II

Creating GDP per Capita

# Adding GDP.per.Capita variable
lifeExpGDPTidy <- mutate(lifeExpGDPTidy, "GDP.per.Capita" = `GDP.Current.USD` / `Population`)
head(lifeExpGDPTidy)

The mutate function from the tidyr package was used to produce a new variable. This variable titled “GDP.per.Capita” takes the GDP value for a given row and divides it by the Population value. In countries that have either an NA for population value or an NA for GDP value this column will return an NA. Using the head function we can see that the dataset now contains a new column titled “GDP.per.Capita”. This variable was automatically coerced into a double when it was calculated.

Scan I

NA’s

As outlined in the understand step a number of values in the population columns were missing and instead contained “..”. These values were replaced with NA’s in order to convert those columns into numerics. Colsums() combined with the is.na() funtion is used to determine the toal number of NA’s in each variable.

# Checking number of NA's per column
colSums(is.na(lifeExpGDPTidy))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0              93               0               0 
##   Life.Exp.Rank       Continent             Sex      Population Life.Expectancy 
##               0               0               0              43               0 
##  GDP.per.Capita 
##             103

# Checking number of NA's per column filtered by overall
colSums(is.na(filter(lifeExpGDPTidy, `Sex` == "Overall")))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0              31               0               0 
##   Life.Exp.Rank       Continent             Sex      Population Life.Expectancy 
##               0               0               0               1               0 
##  GDP.per.Capita 
##              31

The colsums() function together with the is.na() function shows that there are 93 NA’s in the gdp column, 43 in population and 103 in GDP.per.Capita. Filtering for Sex == “Overall” the number of NA’s in the population Column was reduced to one. This means that in the data taken from the gdp dataset all except 1 row had total population but a large number of rows didnt record a male population or female population. According to https://ourworldindata.org/gender-ratio#gender-ratio-across-the-world the distribution of sex globally is 49.6% female, 50.4% male. one way to handle the missing data for countries that have a known total population would be to divide the total according to that sex ratio. This process becomes more difficult now that the data has been converted to long format, but can still been done with a few steps

# Separating Population information
population <- lifeExpGDPTidy %>% select(`Country.Name`, `Sex`, `Population`)

# Transforming from long to wide format
population <- population %>% spread(key = `Sex`, value = `Population`)

# Replacing the missing values
population$Female[is.na(population$Female)] <- population$Overall[is.na(population$Female)] * .496
population$Male[is.na(population$Male)] <- population$Overall[is.na(population$Male)] * .504

# Transforming from wide to long format
population <- population %>% gather(2:4, key = "Sex", value = "Population")

# Returning data to the dataset
lifeExpGDPTidy <- full_join(lifeExpGDPTidy[-9], population, by = c("Country.Name", "Sex"))

First the population, sex and country columns were separated into a new dataframe called population.
This data frame was then converted to wide format using spread() with sex as the key and population as the value.
Then the NA rows of both the male and female columns were subsetted using is.na() and imputed with .504 * the overall population for males and .496 * the overall population for females.
The dataframe was then converted back to long format using gather and rejoined to the original dataset using “Country.Name” and “Sex” as the by = argument.
The original dataset was also subsetted to remove its original population data that was replaced by the new population data.

# Checking remaining NA
lifeExpGDPTidy %>% filter(is.na(lifeExpGDPTidy$Population))

# Excluding Eritrea
lifeExpGDPTidy <- lifeExpGDPTidy %>% filter(!is.na(lifeExpGDPTidy$Population))

The one country that has no population information at all also has no gdp information meaning it is not particularly helpful in investigating relationships in the dataset. This country was excluded from the dataset entirely using filter(!is.na()).

# Checking number of NA's per column
colSums(is.na(lifeExpGDPTidy))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0              90               0               0 
##   Life.Exp.Rank       Continent             Sex Life.Expectancy  GDP.per.Capita 
##               0               0               0               0             100 
##      Population 
##               0

# Checking number of NA's per column filtered by overall
colSums(is.na(filter(lifeExpGDPTidy, `Sex` == "Overall")))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0              30               0               0 
##   Life.Exp.Rank       Continent             Sex Life.Expectancy  GDP.per.Capita 
##               0               0               0               0              30 
##      Population 
##               0

Checking NA’s again, population now has no NA’s. GDP however has NA’s for 30 countries.

GDP information is harder to deal with. As there is no information on each countries gdp for the previous or following years, the missing value can not be predicted or inferred from those values. Also, given the large variability of the data it may be misleading to impute the mean or median.

Alternatively, as around 14% of the GDP data is made up of missing values excluding the NA’s (using na.omit()) would most likely create a biased dataset as the missing values would typically belong to smaller countries. Furthermore, the reason that the NA’s exist for certain countries is itself worth investigating. For example Monaco which has the highest life expectancy also has no recorded information for GDP.

# Summary of NA's
summary(lifeExpGDPTidy %>% filter(is.na(lifeExpGDPTidy$`GDP.Current.USD`), `Sex` == "Overall"))

##  Country.Name       Country.Code       GDP.Current.USD       Income.Group
##  Length:30          Length:30          Min.   : NA     Low         : 4   
##  Class :character   Class :character   1st Qu.: NA     Lower Middle: 2   
##  Mode  :character   Mode  :character   Median : NA     Upper Middle: 7   
##                                        Mean   :NaN     High        :17   
##                                        3rd Qu.: NA                       
##                                        Max.   : NA                       
##                                        NA's   :30                        
##  Currency.Unit      Life.Exp.Rank            Continent     Sex           
##  Length:30          Min.   :  1.00   Africa       :1   Length:30         
##  Class :character   1st Qu.: 41.25   Asia         :6   Class :character  
##  Mode  :character   Median : 69.50   Europe       :6   Mode  :character  
##                     Mean   : 84.67   North America:7                     
##                     3rd Qu.:134.75   Oceania      :8                     
##                     Max.   :217.00   South America:2                     
##                                                                          
##  Life.Expectancy GDP.per.Capita   Population      
##  Min.   :52.40   Min.   : NA    Min.   :   18008  
##  1st Qu.:73.10   1st Qu.: NA    1st Qu.:   42719  
##  Median :77.45   Median : NA    Median :   94539  
##  Mean   :76.15   Mean   :NaN    Mean   : 7283761  
##  3rd Qu.:79.95   3rd Qu.: NA    3rd Qu.: 4647340  
##  Max.   :89.50   Max.   : NA    Max.   :82913906  
##                  NA's   :30

A brief summary of countries with missing GDP values shows that the majority of missing values belong to the high Income.Group (17 countries out of 30) and are generally less populated countries (population median of 94,539). This means excluding these values would bias the data towards larger population countries

Ultimately the method chosen to handle missing data depends on the type of investigation that is being performed. If the data set was merely being used to check different countries attributes at a glance then the NA’s should not be excluded from the dataset. However if we were investigating the overall relationship between multiple variables based on GDP then imputing with the median may be a better option.

For the purpose of this assignment we will assume that the data is for correlation testing and handle the Missing NA’s by imputing the median. The median was chosen over the mean as the GDP distribution is heavily right skewed and the median is generally a better measurement of central tendency in heavily skewed datasets.

# Imputing the median
lifeExpGDPTidy$`GDP.Current.USD` <- impute(lifeExpGDPTidy$`GDP.Current.USD`, fun = median)

# Converting from impute numeric to numeric
lifeExpGDPTidy$`GDP.Current.USD` <- as.numeric(lifeExpGDPTidy$`GDP.Current.USD`)

# Recalculating GDP.per.Capita
lifeExpGDPTidy <- mutate(lifeExpGDPTidy[-10], "GDP.per.Capita" = `GDP.Current.USD` / `Population`)

# Rechecking NA's
colSums(is.na(lifeExpGDPTidy))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0               0               0               0 
##   Life.Exp.Rank       Continent             Sex Life.Expectancy      Population 
##               0               0               0               0               0 
##  GDP.per.Capita 
##               0

Using the impute() function from the Hmisc package and setting the fun = median the missing values were imputed with the median gdp value. Now that the missing values for GDP and Population had been dealt with the GDP.per.Capita column was recalculated and added to the dataset using the mutate() function. calling colsums() and is.na() again we can see that all na’s for GDP and population have been dealt with. As leaving the class as impute numeric causes problems with some of the other packages used in this assignment the class was converted back to numeric using is.numeric(). Checking NA’s again the total number in the dataset has been reduced to 0.

checking for Special values, NANs and infinite numbers

# Creating is.special function
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}

# Summing is.special values
sapply(lifeExpGDPTidy, function(x) sum(is.special(x)))

##    Country.Name    Country.Code GDP.Current.USD    Income.Group   Currency.Unit 
##               0               0               0               0               0 
##   Life.Exp.Rank       Continent             Sex Life.Expectancy      Population 
##               0               0               0               0               0 
##  GDP.per.Capita 
##               0

A function was created called is.special() this function checks if a value is numeric, then if the value is numeric it will check whether it is infinite or a NaN and return True if either of those conditions are met. Using sapply() this function was applied to the dataset and the results were summed. There are no NaNs or special characters in the dataset.

Scanning for obvious errors

To check for nonsensical data, for example, negative value for population, the editrules package was utilised. Values that would be considered obvious errors and will be checked for are - GDP values less than 0 (a country cannot have a negative domestic product) - Life expectancy less than 0 or greater than 120 - Life expectancy rank less than 1 or greater than 300 (A country can not be ranked 0 or lower and the number of regions or countries in the data is well below 300 so anything above that would be considered an error) - Population less than 0 or greater than 3 billion (A population can not be negative and any country that has a population of greater than 5 billion would most likely be an error worth investigating)

# Creating Rules
(Rule1 <- editset(c("`GDP.Current.USD` >= 0",
                    "`Life.Expectancy` >= 0", "`Life.Expectancy` <= 150",
                    "`Life.Exp.Rank` >= 1", "`Life.Exp.Rank` <= 500",
                    "`Population` >= 0", "`Population` <= 5000000000", 
                    "GDP.per.Capita >= 0")))

## 
## Edit set:
## num1 : 0 <= GDP.Current.USD
## num2 : 0 <= Life.Expectancy
## num3 : Life.Expectancy <= 150
## num4 : 1 <= Life.Exp.Rank
## num5 : Life.Exp.Rank <= 500
## num6 : 0 <= Population
## num7 : Population <= 5e+09
## num8 : 0 <= GDP.per.Capita

# Checking for violations
colSums(violatedEdits(Rule1, lifeExpGDPTidy))

## num1 num2 num3 num4 num5 num6 num7 num8 
##    0    0    0    0    0    0    0    0

After creating the ruleset according to the rules listed above the violatedEdits() function was applied with colSums() to get the total number of violations for each rule. A 0 was returned for each meaning that no value violated the defined rules.

Scan II

Creating boxplots provides a cursory glance at which numeric variables may potentially have outliers. For the purpose of identifying outliers the data was filtered to only provide the overall values and not values for male and female. If male and female values were included it would skew the dataset to the right for population and “GDP.per.Capita.USD” as each population observation would have two similar observations roughly half the size.

# Creating boxplots
lifeExpGDPTidyOverall <- lifeExpGDPTidy %>% filter(`Sex` == "Overall")
lifeExpGDPTidyOverall$Life.Exp.Rank %>% boxplot(main = "Life Expectancy Rank")
box <- lifeExpGDPTidyOverall$Life.Expectancy %>% boxplot(main = "Life Expectancy")
lifeExpGDPTidyOverall$Population %>% boxplot(main = "Population")
as.numeric(lifeExpGDPTidyOverall$GDP.Current.USD) %>% boxplot(main = "GDP Current USD")
lifeExpGDPTidyOverall$GDP.per.Capita %>% boxplot(main = "GDP per Capita USD")

As expected Life Expectancy Rank has no outliers. Life Expectancy as well seems to be more or less normally distributed with no outliers. Note: A graphical glitch when exporting to html caused two outliers from the population boxplot to clip into the life expectancy boxplot, these did not appear in the r markdown file or when i tested for outliers on life expectancy

GDP and population on the other hand are extremely right skewed and contain a considerable number of outliers which will need to be handled. GDP.per.capita is also incredibly varied but will be worth re-investigating once the outliers for population and GDP have been handled.

# ScatterPlot of GDP and population
lifeExpGDPTidyOverall %>% plot(GDP.Current.USD ~ Population, data = ., ylab = "GDP (Current USD)", xlab = "Population", main = "GDP by Population")

A scatter plot provides a quick way to detect abnormal observations. Plotting GDP against Population highlights 3 extreme outliers belonging to the two most populated countries and the country with the highest GDP. These values are several times greater than their nearest neighbors and will need to be handled. To reduce their impact on statistical analysis these extreme outliers will be removed.

Handling outliers using Capping

# Removing Extreme Outliers
lifeExpGDPCap <- lifeExpGDPTidy %>% filter(`Country.Name` != "China", `Country.Name` != "India", `Country.Name` != "United States")

# checking number of outliers
# Population
populationBox <- lifeExpGDPCap %>% filter(`Sex` == "Overall") %>% select(`Population`) %>% boxplot(main = "Population")
length(populationBox$out)

## [1] 19

# GDP
gdpBox <- lifeExpGDPCap %>% filter(`Sex` == "Overall") %>% select(`GDP.Current.USD`) %>% boxplot(main = "GDP")
length(gdpBox$out)

## [1] 25

By storing the overall population boxplot as populationBox and gdpBox the number of outliers can be counted using length(populationBox$out). After removing the three extreme outliers, there are a total of 19 outliers for population and 25 for GDP. As this is more than 10% of countries in the dataset removing these outliers would result in a lot of lost information. To handle these outliers a function will be created to identify which values fall above 1.5 times the inter-quartile range above the third quartile. The values that fall above this range will be replaced with the 95th percentile. The function will also check whether a value falls 1.5 times the inter-quartile range below the first quartile and cap it at the 5th percentile. This function will be applied to each Sex separately to maintain the relationships between each. If it was applied to only overall population there would be situations where a countries overall population would be less than it’s male or female population population

# Creating Cap function
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5 * IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5 * IQR(x) ] <- quantiles[4]
    x
}

# Capping Population
lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Overall"] <- lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Overall"] %>% cap()
lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Female"] <- lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Female"] %>% cap()
lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Male"] <- lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Male"] %>% cap()

# Capping GDP
lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Overall"] <- lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Overall"] %>% cap()
lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Female"] <- lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Female"] %>% cap()
lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Male"] <- lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Male"] %>% cap()


box <- boxplot(lifeExpGDPCap$Population[lifeExpGDPCap$Sex == "Overall"], main = "Capped Population")
length(box$out)

## [1] 19

box <- boxplot(lifeExpGDPCap$GDP.Current.USD[lifeExpGDPCap$Sex == "Overall"], main = "Capped GDP")
length(box$out)

## [1] 26

Rechecking the number of outliers after capping at the 95th percentile, the number of outliers was not reduced. Due to the extreme spread of population and the large number of outliers, the 95th percentile of the data was more than 1.5 * the inter-quartile range from the third quartile. This resulted in all data being capped at a value that would still be considered an outlier. Alternatively imputing with the median or excluding these values would fix this problem, however, as the suggested outliers are most likely not a result of data entry/processing error and are instead due to the natural spread and skew of the variable, Imputation or exclusion would not be appropriate. As the data is not normally distributed the z score approach for outlier detection would also not be suitable.

Given how heavily right-skewed the dataset is, for statistical testing a better approach to reducing outliers would be to transform the offending variables using a logarithmic transformation. By transforming the data to approximate a normal distribution it would also be possible to apply Multivariate outlier detection between GDP and Population using Mahalanobis distance with Q-Q plots. This approach will be explored in the following section before re-investigating outliers. As GDP.per.Capita is a product of 2 heavily skewed variables, it’s outliers would be a result of those variables. Once those variables have been transformed and GDP.per.Capita has been recalculated, it will be reexamined for outliers.

Transform

As normal distributions are a pre-requisite for a number of statistical tests, for example, linear regression, transforming variables is necessary to increase normality. To know which type of transformation to apply to population and GDP, histograms will be used to analyze the shape of the distributions. The extreme outliers, China, India, and the United States will again be excluded from the dataset as they deviate too far from the other values. For the other outliers we will use the uncapped values.

# Removing extreme outliers
lifeExpGDPTidy <- lifeExpGDPTidy %>% filter(`Country.Name` != "United States", `Country.Name` != "China", `Country.Name` != "India")

# Checking distribution of data
hist(lifeExpGDPTidy$Population[lifeExpGDPTidy$Sex == "Overall"], main = "Population", xlab = "Overall Country Population", breaks = 20)
hist(lifeExpGDPTidy$GDP.Current.USD[lifeExpGDPTidy$Sex == "Overall"], main = "GDP", xlab = "GDP (Current USD)", breaks = 20)

The histograms show that both Population and GDP are heavily right-skewed. A log transformation may be effective in reducing spread and increasing symmetry.

Transforming GDP

# Transformation Histograms
hist(sqrt(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"], main = "sqrt(GDP)", xlab = "GDP (Current USD)", breaks = 15)
hist(log10(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"], main = "log10(GDP)", xlab = "GDP (Current USD)", breaks = 15)
hist(log(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"], main = "log(GDP)", xlab = "GDP (Current USD)", breaks = 15)
hist(BoxCox(lifeExpGDPTidy$GDP.Current.USD, lambda = "auto")[lifeExpGDPTidy$Sex == "Overall"], main = "BoxCox(GDP)", xlab = "GDP (Current USD)", breaks = 15)

Plotting 4 logarithmic transformations for GDP side by side. The most effective at increasing normality appears to be the log10 transformation. The square root transformation was not powerful enough and the box cox transformation (with lambda = “auto”) appears to have over-corrected to a slightly left skewed distribution. Another approach for testing Normality would be Q-Q plot. Plotting side be side Q-Q plots using the qqPlot(dist = "norm") function from the car package gives a clear comparison of the transformations normality.

# Testing normality using Q-Q plots
sqrt(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "Square Root GDP")

## [1] 91 69

log10(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "Log10 GDP")

## [1] 189 128

log(lifeExpGDPTidy$GDP.Current.USD)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "Log GDP")

## [1] 189 128

BoxCox(lifeExpGDPTidy$GDP.Current.USD, lambda = "auto")[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "BoxCox GDP")

## [1] 189 128

The Q-Q plots confirm that the square root transformation still has a right skew and the Box Cox falls outside of the bounds of normality towards the left of the distribution. Apart from one or two values on the left side of the distribution the log10 and log transformations appear to fall within the bounds of normal distribution. The log10 transformation will be applied to GDP and the column name will be applied to reflect the transformation.

# Transforming variable
lifeExpGDPTidy$GDP.Current.USD <- log10(lifeExpGDPTidy$GDP.Current.USD)
names(lifeExpGDPTidy)[names(lifeExpGDPTidy) == "GDP.Current.USD"] <- "Log10.GDP.Current.USD"

Transforming Population

# Transformation Histograms
hist(sqrt(lifeExpGDPTidy$Population)[lifeExpGDPTidy$Sex == "Overall"], main = "sqrt(GDP)", xlab = "Population", breaks = 20)
hist(log10(lifeExpGDPTidy$Population)[lifeExpGDPTidy$Sex == "Overall"], main = "log10(GDP)", xlab = "Population", breaks = 20)
hist(BoxCox(lifeExpGDPTidy$Population, lambda = "auto")[lifeExpGDPTidy$Sex == "Overall"], main = "BoxCox(GDP)", xlab = "Population", breaks = 20)
hist(BoxCox(lifeExpGDPTidy$Population, lambda = 1/10)[lifeExpGDPTidy$Sex == "Overall"], main = "BoxCox(GDP)", xlab = "Population", breaks = 20)

Applying similar transformations square root was again not a powerful enough transformation. The log and Box Cox (lambda = “auto”) transformations were both moderately successful however both exhibit a left skew. A Box Cox transformation with lambda set as 0.1 appears to approximate normality the most. This transforms the variable according to Y^(0.1) = √(Y).

# Testing normality using Q-Q plots
sqrt(lifeExpGDPTidy$Population)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "Square Root Population")

## [1]  83 140

log10(lifeExpGDPTidy$Population)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "Log10 Population")

## [1] 189 128

BoxCox(lifeExpGDPTidy$Population, lambda = "auto")[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "BoxCox Population")

## [1] 189 128

BoxCox(lifeExpGDPTidy$Population, lambda = 1/10)[lifeExpGDPTidy$Sex == "Overall"] %>% qqPlot(dist = "norm", main = "BoxCox Population")

## [1]  83 189

The Q-Q plots confirm that each transformation deviates from normality except for the Box Cox with a lambda of 0.1. This transformation will be applied for Population and the GDP per capita will be calculated to reflect these new values

# Transforming population
lifeExpGDPTidy$Population <- BoxCox(lifeExpGDPTidy$Population, lambda = 1/10)
names(lifeExpGDPTidy)[names(lifeExpGDPTidy) == "Population"] <- "Box.Cox.Population"

# Recalculating GDP per Capita
lifeExpGDPTidy <- mutate(lifeExpGDPTidy[-11], "GDP.per.Population" = `Log10.GDP.Current.USD` / `Box.Cox.Population`)

# Checking for outliers
lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Overall"] %>% boxplot(main = "Log10 GDP")
lifeExpGDPTidy$Box.Cox.Population[lifeExpGDPTidy$Sex == "Overall"] %>% boxplot(main = "BoxCox Population")
lifeExpGDPTidy$GDP.per.Population[lifeExpGDPTidy$Sex == "Overall"] %>% boxplot(main = "GDP per Population")

Investigating boxplots it is clear the transforming the data was a more effective method of reducing outliers. GDP now has only 3 outliers past the lower bounds. Population has no outliers at all. GDP per Population still has a number of outliers but as this variable is a product of the other variables and as the outliers for those variables have been handled we can be confident that the outliers in GDP per Population are not due to error. The three lower outliers for GDP will be capped at the 5th percentile using the previously created cap() function.

Multivariate Outliers analysis

Now that Population and GDP approximate a normal distribution a multivariate analysis can be performed to identify any possible multivariate outliers. First a relationship between GDP and population must be confirmed.

# Capping lower outliers
lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Overall"] <- lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Overall"] %>% cap()
lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Female"] <- lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Female"] %>% cap()
lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Male"] <- lifeExpGDPTidy$Log10.GDP.Current.USD[lifeExpGDPTidy$Sex == "Male"] %>% cap()

# Scatterplot
plot(lifeExpGDPTidy[,c("Log10.GDP.Current.USD", "Box.Cox.Population")])

After capping the lower outliers a scatterplot was created using plot() and subsetting gdp and population. The scatter plot confirms a positive linear relationship between the variables. The spike at the median of GDP corresponds to the imputed NA values. As a relationship has been confirmed it is safe to proceed with a multivariate analysis.

# Multivariate analysis
mvnResults <- mvn(data = lifeExpGDPTidy[,c("Log10.GDP.Current.USD", "Box.Cox.Population")], multivariateOutlierMethod = "quan", showOutliers = TRUE)

Utilising the mvn function from the MVN package, setting the outlier method to “quan” and showOutliers to TRUE, a Chi-Square Q-Q Plot was created to investigate any potential multivariate outliers. According to the Chi-Square Q-Q Plot, there are no multivariate outliers between GDP and Population.

MATH2349 Data Wrangling

Assignment 2

Mitchell Pasmans s3853065

Required packages

Executive Summary

Data

Life Expectancy Data

Variables

GDP and Population

Variables

Reading Data

Understand

Summarising the Dataset

Converting Datatypes

Tidy & Manipulate Data I

Tidy & Manipulate Data II

Creating GDP per Capita

Scan I

NA’s

checking for Special values, NANs and infinite numbers

Scanning for obvious errors

Scan II

Handling outliers using Capping

Transform

Transforming GDP

Transforming Population

Multivariate Outliers analysis