MATH2349 Data Wrangling

Required packages

library(dplyr)
library(tidyr)
library(readxl)
library(ggplot2)
library(car)

Executive Summary

Two Datasets conveying world population and Gross GDP as US$ were downloaded from World Bank Open Data. These datasets were loaded into R using readxl. Each variable was converted to the appropriate datatype. Usless variables and purious country names were removed by subsetting data. The datasets were then tidied by converting from wide format to long format, correctly rendering Year as a variable. Once tidied, the datasets were combined using Country Name and Year as a key. NA values were scanned for and ommited using is.na(). A new variable GDP per Capita was created by dividing GDP by population. Outliers were examined for all numerical variables for given years using Tukeys Fencing Method. It was deemed inappropriate to remove outliers but a scenario where this is valid was explored removing all outliers outside of +- 1.5 x IQR. The variable GDP was transformed by taking the base 10 logaritihm of all GDP values, this was visualsed using a histogram.

Data

Dataset 1 - World Bank Open Data - Total Population

Retrieved .xls on 15th October 2020 from: https://data.worldbank.org/indicator/SP.POP.TOTL

This dataset from the World Bank has data collated from several sources including the UN population division and various governmental census reports of which it reports total population for 264 countries starting from the year 1960 to 2020.

The variables in this dataset are:

Country Name: The name of the country
Country Code: A three character code indicating the country
Indicator Name: Indicating what the observation is reporting
Indicator Code: A code that represents indicator name
Total Population values from the year 1960 to 2017

Loading Dataset 1 - Population

This data was loaded into an R dataframe named ‘population’ using read_xl() from the readxl package. The first rows of population dataframe are displayed using head().

#Load data and skip first 3 rows of blank space
population <-  read_excel("API_SP.POP.TOTL_DS2_en_excel_v2_1495038.xls", skip = 3)
head(population)

Dataset 2 - World Bank Open Data - GDP (current US$)

Retrieved .xls on 15th October 2020 from: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

This dataset also from the World Bank has data collated from World Bank national accounts data, and OECD National Accounts data files. This reports GDP for 264 countries starting from the year 1960 to 2020 normalised to the current US dollar so GDP can be compared.

The variables in this dataset are:

Country Name: The name of the country
Country Code: A three character code indicating the country
Indicator Name: Indicating what the observation is reporting
Indicator Code: A code that represents indicator name
Total GDP in current US$ from the year 1960 to 2017

Loading Dataset 2

This data was loaded into an R dataframe named ‘gdp’ using read_xl() from the readxl package. The first rows of population dataframe are displayed using head().

#Load data and skip first 3 rows of blank space
gdp <-  read_excel("API_NY.GDP.MKTP.KD_DS2_en_excel_v2_1495404.xls", skip = 3)
head(gdp)

Further Steps

The aim will be to produce observations of GDP per capita for each year per country. In order to do this and to fulfil the last requirement of the data section “merge at least two data sets to create the one you are going to work on.” the population and gdp dataframes will be mereged. This will be carried out later in this report during the section Tidy & Manipulate.

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Understand Dataset 1 - Population

Previously, we looked at the first rows of the Population dataset (Load Dataset 1 - Population). We could see that our first four columns are of the ‘character’ and the remaining columns consisting of the population for each year are of the ‘double’ datatype, a numeric datatype. We can illustrate this using str()

str(population, list.len = 6)

## tibble [264 × 65] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code  : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
##  $ Indicator Name: chr [1:264] "Population, total" "Population, total" "Population, total" "Population, total" ...
##  $ Indicator Code: chr [1:264] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##  $ 1960          : num [1:264] 54211 8996973 5454933 1608800 13411 ...
##  $ 1961          : num [1:264] 55438 9169410 5531472 1659800 14375 ...
##   [list output truncated]

For our numeric data columns ‘double’ is an appropriate. The names of the numeric dataset are also a variable representing year, a categorical variable which should be a factor, this will be handled later in the Tidy & Manipulate Section. For our first four columns these could all be expressed as factors, although we don’t need the variables ‘Country Code’, ‘Indicator Name’ or ‘Indicator Code’.

To simplify this, these will be removed by subsetting our dataframe to not include them and the results displayed using the colnames().

population <-  population[, -c(2:4)]
colnames(population)

##  [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
##  [6] "1964"         "1965"         "1966"         "1967"         "1968"        
## [11] "1969"         "1970"         "1971"         "1972"         "1973"        
## [16] "1974"         "1975"         "1976"         "1977"         "1978"        
## [21] "1979"         "1980"         "1981"         "1982"         "1983"        
## [26] "1984"         "1985"         "1986"         "1987"         "1988"        
## [31] "1989"         "1990"         "1991"         "1992"         "1993"        
## [36] "1994"         "1995"         "1996"         "1997"         "1998"        
## [41] "1999"         "2000"         "2001"         "2002"         "2003"        
## [46] "2004"         "2005"         "2006"         "2007"         "2008"        
## [51] "2009"         "2010"         "2011"         "2012"         "2013"        
## [56] "2014"         "2015"         "2016"         "2017"         "2018"        
## [61] "2019"         "2020"

Next we can convert ‘Country Name’ to a factor varable using as.factor() and examine the results using str()

population$`Country Name` <- population$`Country Name` %>% as.factor
str(population, list.len = 4)

## tibble [264 × 62] (S3: tbl_df/tbl/data.frame)
##  $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 8 250 9 10 4 ...
##  $ 1960        : num [1:264] 54211 8996973 5454933 1608800 13411 ...
##  $ 1961        : num [1:264] 55438 9169410 5531472 1659800 14375 ...
##  $ 1962        : num [1:264] 56225 9351441 5608539 1711319 15370 ...
##   [list output truncated]

On examining the levels of looking at the levels of Country name I could see that there are some entries that aren’t countries at all but are a combination eg. [63] “East Asia & Pacific (excluding high income)” we will remove these by storing the non countries as a list which will be used to remove all rows containing these

nonCountries <- c(
 "Arab World" ,"Central Europe and the Baltics","Early-demographic dividend", "East Asia & Pacific" ,"East Asia & Pacific (excluding high income)" ,"East Asia & Pacific (IDA & IBRD countries)","Euro area","Europe & Central Asia","Europe & Central Asia (excluding high income)" ,"Europe & Central Asia (IDA & IBRD countries)","European Union","Fragile and conflict affected situations" ,"Heavily indebted poor countries (HIPC)", "High income", "Heavily indebted poor countries (HIPC)", "High income","IBRD only","IDA & IBRD total", "IDA blend", "IDA only", "IDA total","Late-demographic dividend"   , "Latin America & Caribbean"   , "Latin America & Caribbean (excluding high income)"   , "Latin America & the Caribbean (IDA & IBRD countries)"  , "Least developed countries: UN classification"   ,"Low income"  ,"Low & middle income"   ,"Lower middle income"   , "Middle East & North Africa"   , "Middle East & North Africa (excluding high income)"   , "Middle East & North Africa (IDA & IBRD countries)"  , "Middle income"   ,"North America"  , "Not classified"   , "OECD members"   ,"Other small states"   , "Post-demographic dividend"   ,"Pre-demographic dividend"   ,"South Asia"   , "South Asia (IDA & IBRD)"   , "Sub-Saharan Africa"   , "Sub-Saharan Africa (excluding high income)"   ,"Sub-Saharan Africa (IDA & IBRD countries)"   , "Upper middle income"   ,"West Bank and Gaza"  ,"World")

population <- population[!population$`Country Name` %in% nonCountries,]
colnames(population)

##  [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
##  [6] "1964"         "1965"         "1966"         "1967"         "1968"        
## [11] "1969"         "1970"         "1971"         "1972"         "1973"        
## [16] "1974"         "1975"         "1976"         "1977"         "1978"        
## [21] "1979"         "1980"         "1981"         "1982"         "1983"        
## [26] "1984"         "1985"         "1986"         "1987"         "1988"        
## [31] "1989"         "1990"         "1991"         "1992"         "1993"        
## [36] "1994"         "1995"         "1996"         "1997"         "1998"        
## [41] "1999"         "2000"         "2001"         "2002"         "2003"        
## [46] "2004"         "2005"         "2006"         "2007"         "2008"        
## [51] "2009"         "2010"         "2011"         "2012"         "2013"        
## [56] "2014"         "2015"         "2016"         "2017"         "2018"        
## [61] "2019"         "2020"

Understand Dataset 2 - GDP

As the ‘gdp’ dataframe has a very similar format to the ‘population’ dataframe. We can draw the same conclusions as we did for ’population and the same manipulation can be applied to this dataframe.

Again, we will subset our dataframe to not include the variables we dont need and the results displayed using the colnames(). Remember that the names of the numeric dataset are also a variable representing year, a categorical variable which should be a factor, this will be handled later in the Tidy & Manipulate Section

gdp <-  gdp[, -c(2:4)]
colnames(gdp)

##  [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
##  [6] "1964"         "1965"         "1966"         "1967"         "1968"        
## [11] "1969"         "1970"         "1971"         "1972"         "1973"        
## [16] "1974"         "1975"         "1976"         "1977"         "1978"        
## [21] "1979"         "1980"         "1981"         "1982"         "1983"        
## [26] "1984"         "1985"         "1986"         "1987"         "1988"        
## [31] "1989"         "1990"         "1991"         "1992"         "1993"        
## [36] "1994"         "1995"         "1996"         "1997"         "1998"        
## [41] "1999"         "2000"         "2001"         "2002"         "2003"        
## [46] "2004"         "2005"         "2006"         "2007"         "2008"        
## [51] "2009"         "2010"         "2011"         "2012"         "2013"        
## [56] "2014"         "2015"         "2016"         "2017"         "2018"        
## [61] "2019"         "2020"

Next we can convert ‘Country Name’ to a factor varable using as.factor() and examine the results using str()

gdp$`Country Name` <- gdp$`Country Name` %>% as.factor
str(gdp, list.len = 4)

## tibble [264 × 62] (S3: tbl_df/tbl/data.frame)
##  $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 8 250 9 10 4 ...
##  $ 1960        : num [1:264] NA NA NA NA NA ...
##  $ 1961        : num [1:264] NA NA NA NA NA ...
##  $ 1962        : num [1:264] NA NA NA NA NA ...
##   [list output truncated]

like we did for population, we will remove all non countries from the GDP dataframe

gdp <- gdp[!gdp$`Country Name` %in% nonCountries,]
colnames(gdp)

##  [1] "Country Name" "1960"         "1961"         "1962"         "1963"        
##  [6] "1964"         "1965"         "1966"         "1967"         "1968"        
## [11] "1969"         "1970"         "1971"         "1972"         "1973"        
## [16] "1974"         "1975"         "1976"         "1977"         "1978"        
## [21] "1979"         "1980"         "1981"         "1982"         "1983"        
## [26] "1984"         "1985"         "1986"         "1987"         "1988"        
## [31] "1989"         "1990"         "1991"         "1992"         "1993"        
## [36] "1994"         "1995"         "1996"         "1997"         "1998"        
## [41] "1999"         "2000"         "2001"         "2002"         "2003"        
## [46] "2004"         "2005"         "2006"         "2007"         "2008"        
## [51] "2009"         "2010"         "2011"         "2012"         "2013"        
## [56] "2014"         "2015"         "2016"         "2017"         "2018"        
## [61] "2019"         "2020"

Scan I

As we could see from the output of str() there is a lot of missing data from both dataframes. We can take a closer look and deal with this now.

Scanning missing values Dataset 1 - Population

We can look at the missing values for population each year using colnames() with is.na() which will count the number of missing values for each column.

population %>% is.na() %>% colSums ()

## Country Name         1960         1961         1962         1963         1964 
##            0            2            2            2            2            2 
##         1965         1966         1967         1968         1969         1970 
##            2            2            2            2            2            2 
##         1971         1972         1973         1974         1975         1976 
##            2            2            2            2            2            2 
##         1977         1978         1979         1980         1981         1982 
##            2            2            2            2            2            2 
##         1983         1984         1985         1986         1987         1988 
##            2            2            2            2            2            2 
##         1989         1990         1991         1992         1993         1994 
##            2            1            1            2            2            2 
##         1995         1996         1997         1998         1999         2000 
##            1            1            1            0            0            0 
##         2001         2002         2003         2004         2005         2006 
##            0            0            0            0            0            0 
##         2007         2008         2009         2010         2011         2012 
##            0            0            0            0            0            1 
##         2013         2014         2015         2016         2017         2018 
##            1            1            1            1            1            1 
##         2019         2020 
##            1          219

As we can see from this output there are no missing values for Country Name, each year there are missing values which start at about 2 from 1960, which decreases to 1 by 1995 and there is no population data for the year 2020.

On closer look at the data it is obvious that population data is missing for a few countries before 1989 probably because these records were not available and there is obviously no reliable data for 2020 yet.We can visualize this by comparing two countries Serbia and Somalia, converting the data to long format, and plotting the results. This is achieved by using gather() which is used to gather the rows 1960 to 2020 to a new column called year and the values to a new column called population. The output of this is plotted using ggplot with the poplation converted to millions.

As we can see the population data for Serbia doesnt begin until 1989 and both countries stop at 2020. This is somewhat important as it indicates that there isn’t random missing data.

population %>%  gather( '1960':'2020', key = "Year", value = "population")  %>%  filter(`Country Name` == c("Serbia", "Somalia")) %>% ggplot(aes(Year, population * 1*10^ -6  ,group = `Country Name`, col = `Country Name`)) +
  geom_line(size = 2) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("Population (Millions)")

Since we are limited by the packages we can use in this assignment and programming a function to compute the appropriate value based on regression would be very difficult and out of the scope of this assignment. It is easier just to omit the values of years that are missing. This is the strategy that I will use to deal with missing values from population in the next section once it is in a tidy format

Scanning missing values Dataset 2 - GDP

Following the same approach for GDP by looking at each year using colnames() with is.na() which will count the number of missing values for each column. We can se similarly As we can see from this output there are no missing values for Country Name. For the years it is actually similar in terms of the earlier years having more missing GDP reported data which improves by the late 70s with no values for 2020.

gdp %>% is.na() %>% colSums ()

## Country Name         1960         1961         1962         1963         1964 
##            0          130          127          127          127          127 
##         1965         1966         1967         1968         1969         1970 
##          123          119          118          116          116          102 
##         1971         1972         1973         1974         1975         1976 
##          102          102          102          100           97           96 
##         1977         1978         1979         1980         1981         1982 
##           91           91           90           80           75           72 
##         1983         1984         1985         1986         1987         1988 
##           72           69           68           64           61           59 
##         1989         1990         1991         1992         1993         1994 
##           58           46           45           42           40           39 
##         1995         1996         1997         1998         1999         2000 
##           33           33           30           30           29           24 
##         2001         2002         2003         2004         2005         2006 
##           23           18           18           17           17           16 
##         2007         2008         2009         2010         2011         2012 
##           16           15           15           11           15           16 
##         2013         2014         2015         2016         2017         2018 
##           16           17           18           19           19           23 
##         2019         2020 
##           38          219

Following a similar approach from before, if we look at the GDP data for Serbia and Somalia we can see that serbia has data starting at 1995 while Somalia has no data dat all

gdp %>%  gather( '1960':'2020', key = "Year", value = "gdp")  %>%  filter(`Country Name` == c("Serbia", "Somalia")) %>% ggplot(aes(Year, gdp * 1*10^ -9 ,group = `Country Name`, col = `Country Name`)) +
  geom_line(size = 2) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("GDP (Billions)")

In order to deal with this missing data it doesn’t make sense to impute the missing values with a mean or median, this is because the values represent growth. For each country, what could be imputed could be based on regression or k-nearest neighbours. Since we are limited by the packages we can use in this assignment and programming a function to compute the appropriate value based on regression would be very difficult and out of the scope of this assignment. It is easier just to omit the values of years that are missing. This is the strategy that I will use to deal with missing values from population in the next section once it is in a tidy format

At this point we can omit the last row, 2020 from both datasets by subseting away the last row

gdp <-  gdp[,- 62]
population <-  population[,- 62]

Tidy & Manipulate Data I

So far both our datasets do not conform to tidy data principles. Both datasets are in the wide format which makes things easier to read as a able but this does not meet tidy data principles. As mentioned before in the Understand section, the year columns are a values of the year variable We will now deal with this.

For dataset 1 We will now use gather() to gather the values year and population. Also as mentioned before in the Understand section, year in this context is a categorical variable and therefore can be converted to a factor. To do this we will use as.factor() and store it in the newly created Year column. The datatypes are displayed using str(), the results showing that Country Name and Year are factors and population as a varaible

population <- population %>%  gather( '1960':'2019', key = "Year", value = "Population") 
population$Year <- population$Year %>% as.factor()
str(population)

## tibble [13,140 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 250 9 10 4 7 ...
##  $ Year        : Factor w/ 60 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population  : num [1:13140] 54211 8996973 5454933 1608800 13411 ...

by calling our resulting dataframe we can see that it is in a tidy format

population

We will now do the same thing for the GDP data. By applying the same functions we can see that Country Name and Year are factors and population as a variable this completes the requirement for Understand Criteria

gdp <- gdp %>%  gather( '1960':'2019', key = "Year", value = "GDP") 
gdp$Year <- gdp$Year %>% as.factor()
str(gdp)

## tibble [13,140 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 250 9 10 4 7 ...
##  $ Year        : Factor w/ 60 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GDP         : num [1:13140] NA NA NA NA NA ...

Now that both datasets are tidy we can combine them using Country Name and Year as the common factors. This is achieved using left join function and storing it as a new dataframe called Countries. This completes the requirement for Data

Countries <- left_join(population, gdp, by = c("Country Name" = "Country Name", "Year" = "Year"))
Countries

The result is a tidy dataset that conforms to tidy data principles by each variable being its own column and each row being its own observation. In this case each observation is a country, the year and the population and gdp data for corresponding country and year.

By sumarising the data for population we can see population data is only missing for the following four countries

Countries %>% subset(is.na(Population)) %>% group_by(`Country Name`) %>% 
  summarise(NA_count=sum(is.na(Population)))

By sumarising the data for gdp we can see there is a lot of data missing for every country, but this is okay since we have over 13 thousand observations

Countries %>% subset(is.na(GDP)) %>% group_by(`Country Name`) %>% 
  summarise(NA_count=sum(is.na(GDP)))

Now that the datasets are together in a tidy format. We can consider that the overall goal of this dataset will be to calculate GDP per capita for each year. In order to do that we need complete data for population and GDP. Since all this data is required in each row we can succinctly remove all NA values using na.omit() This completes requirements for Scan part 1

Countries <- na.omit(Countries)
which(is.na(Countries))

## integer(0)

Tidy & Manipulate Data II

Now that we have omitted all NA’s we can calculate GDP per capita for each country and year pairing. This is achieved by dividing GDP by the population of each country. This is achieved using the mutate function, creating a new Column called GDP_per_Capita which takes GDP/ Population

Countries <- mutate(Countries, GDP_per_Capita = GDP / Population)
Countries

Scan II

First to decide how we scan for outliers it would be handy to see the overall distribution of our each one of our numeric variables. If it is normal then we can apply a z-score method. To do this for each numeric variable we will plot the histograms to check the shape of the data and also qqplot to check how normal each variable is, if the data is normal then we can see the datapoints mostly fit within the blue threshold lines

par(mfrow= c(1,3))
Countries$Population %>% hist(main = "Distribution of Population 1960-2019" )
Countries$GDP %>% hist(main = "Distribution of GDP 1960-2019" )
Countries$GDP_per_Capita %>% hist(main = "Distribution of GDPPC 1960-2019" )

par(mfrow= c(1,3))
qqPlot(Countries$Population )
qqPlot(Countries$GDP )
qqPlot(Countries$GDP_per_Capita )

As we can see the data is far from normal. This actually makes sense, we have a large ammount of countries all at different levels of population and growth, we can see that most countries have a lower population, GDP and GDP per capita.

Since the data is not normal we will use Tukeys Fencing method where, ’outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR. These −1.5×IQR and 1.5×IQR limits are called “outlier fences” and any values lying outside the outlier fences are depicted using an “o” or a similar symbol on the box plot.

From this point forward to make things easier I have defined functions using the Tukeys method to obtain a dataframe of sorted upper outliers and lower outliers and also box plots and histograms. For this dataset if would make more sense to compare outliers for particular years versus 50 years of growth in one go. In this way we can detect outliers for a particular year

This function will take our dataframe and the column variable and an optional year filter. It will calculate the Inter Quartile Range (IQR) of our variable to obtain the upper fence of an outlier as defined by Tukeys method (1.5) and filter for every thing that is above this range and then sort the output based on the variable

getUpperOutliers<- function(df, column, yearVar){
   if(!missing(yearVar)){
    df <- df  %>% filter(Year == yearVar)
   } 
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
UpperFence <- quantile(df[[column]],probs = 0.75) + 1.5 * IQR
filtered <-  df[df[[column]] > UpperFence ,]

return(arrange(filtered, desc(!! rlang::sym(column))))
}

This does the same but gets everything that is in the lower fence.

getLowerOutliers<- function(df, column, yearVar){
   if(!missing(yearVar)){
  df <- df  %>% filter(Year == yearVar)
 } 
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
LowerFence <- quantile(df[[column]], probs = 0.25) - 1.5 * IQR
filtered <-  df[df[[column]] < LowerFence ,]
return(arrange(filtered, !! rlang::sym(column)))
}

This is a function to get our boxplots

getBoxPlot <- function(df, column, yearVar){
   if(!missing(yearVar)){
    df <- df  %>% filter(Year == yearVar)
     title <- paste("Boxplot of", column, "in", yearVar)
   } else{
        title <- paste("Boxplot of", column, "from 1960 to 2020")
   }
  df[[column]] %>% boxplot(main=title, xlab=column)
}

and a function to get our histogram

getHist <- function(df, column, yearVar){
   if(!missing(yearVar)){
    df <- df  %>% filter(Year == yearVar)
    title <- paste("Distribution of", column, "in", yearVar)
   } else{ title <- paste("Distribution of", column, "from 1960 to 2020")}

  df[[column]] %>% hist(breaks = 50, main=title, xlab=column)
}

and finally a function that will call both histograms and boxplots at the same time.

showDistribution<-  function(df, column, yearVar){
   if(!missing(yearVar)){
  w <-    getHist(Countries, "GDP_per_Capita",yearVar )
 
   x <-   getBoxPlot(df, column, yearVar)
   } else {
     w <-    getHist(Countries, "GDP_per_Capita" )
   x <-   getBoxPlot(df, column)
   }
  return(list(w,x))
}

With out functions defined we can take a look at outliers in GDP per captia for the year 1960. As we can see a right skewed distribution and there are a large ammount of upper outliers and no lower outliers.

par(mfrow= c(1,2))
showDistribution(Countries, "GDP_per_Capita",1960 )

Looking at the dataframe we can see that countries that are the outliers are all affluent countries, mostly european, which is to be expected. These countries make a lot more money compared to their population with bermuda having the highest GDP_per_captia

getUpperOutliers(Countries, "GDP_per_Capita",1960 )

getLowerOutliers(Countries, "GDP_per_Capita",1960 )

Fast forward to 2019 and we can see much of the same

par(mfrow= c(1,2))
showDistribution(Countries, "GDP_per_Capita",2019 )

Although now we have some asian and arab nations. Very interesting

getUpperOutliers(Countries, "GDP_per_Capita",2019 )

getLowerOutliers(Countries, "GDP_per_Capita",2019 )

Looking at population in 1970 we can see the distribution follows the same pattern, right skewed with a good number of outliers again no lower outliers

par(mfrow= c(1,2))
showDistribution(Countries, "Population", 1970)

Looking at the outliers we see the usual suspects such as china and and india with the highest populations with a good number of outher countries

getUpperOutliers(Countries, "Population",1960)

getLowerOutliers(Countries, "Population",1960 )

Fastforward to 2019 and we can see that there are some huge outliers and some moderate ones too

par(mfrow= c(1,2))
showDistribution(Countries, "Population", 2019)

Looking at the countries we can see that China and India dwarf everybody else in terms of population

getUpperOutliers(Countries, "Population",2019)

getLowerOutliers(Countries, "Population",2019 )

Looking only at GDP and the year 1960 we can see that again there is a big right skwewed distribution there is a massive outlier and some moderate ones

par(mfrow= c(1,2))
showDistribution(Countries, "GDP", 1960)

The massive outlier is China in 1960!

getUpperOutliers(Countries, "Population",1960)

getLowerOutliers(Countries, "Population",1960 )

If we look at GDP from last year we can see a huge number of outliers and some massive ones

par(mfrow= c(1,2))
showDistribution(Countries, "GDP", 2019)

If we look at the countries responsible for this we can see China, Inda and the United states are economic powerhouses.

getUpperOutliers(Countries, "Population",1960)

getLowerOutliers(Countries, "Population",1960 )

Dealing with outliers

We can see from our example outliers in this context isn’t a result of any of the following:

Data Entry Errors
Measurement Errors
Experimental Error
Intentional Error
Data Processing Errors
Sampling error

The outliers in our example actually shows really large and really rich nations and also shows how poor and small the rest of the world is in comparison. Imputing values to outliers in this example doesnt make any sense at all and would ruin any statistical analysis placed on the whole world.

There could be an argument made that we could exclude the rich or large nations to only look at the rest of the nations to produce a rest of the world dataset and if we did want to do that we could do it for a particular year by getting everyting that is under the tukey upper fence. I have a defined a function that does this, I have not bothered with lower outliers since we have none.

restOftheWorld <- function(df, column, yearVar){
   if(!missing(yearVar)){
    df <- df  %>% filter(Year == yearVar)
   } 
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
UpperFence <- quantile(df[[column]],probs = 0.75) + 1.5 * IQR
filtered <-  df[df[[column]] < UpperFence ,] # get everything that isnt an upper outlier
return(arrange(filtered, desc(!! rlang::sym(column))))
}

Here we can see the population of the rest of the world that doesnt include our larger outliers in 2019

restOftheWorld(Countries, "Population", 2019 )

The same for GDP

restOftheWorld(Countries, "GDP", 2019 )

and the same for GDP per captia

restOftheWorld(Countries, "GDP_per_Capita", 2019 )

Transform

As we could see form the outliers exploration our data is right skewed we can apply a either a log or a log10 transformation to get it to decrease the skewness and convert the distribution to normal. Once visualised by plotting the histogram of it we can see it looks a lot more normal

log10(Countries$GDP) %>% hist(breaks = 50, main = "Distribution of all Countries log10 GDP")