library(dplyr)
library(tidyr)
library(readxl)
library(ggplot2)
library(car)
Two Datasets conveying world population and Gross GDP as US$ were downloaded from World Bank Open Data. These datasets were loaded into R using readxl. Each variable was converted to the appropriate datatype. Usless variables and purious country names were removed by subsetting data. The datasets were then tidied by converting from wide format to long format, correctly rendering Year as a variable. Once tidied, the datasets were combined using Country Name and Year as a key. NA values were scanned for and ommited using is.na(). A new variable GDP per Capita was created by dividing GDP by population. Outliers were examined for all numerical variables for given years using Tukeys Fencing Method. It was deemed inappropriate to remove outliers but a scenario where this is valid was explored removing all outliers outside of +- 1.5 x IQR. The variable GDP was transformed by taking the base 10 logaritihm of all GDP values, this was visualsed using a histogram.
Retrieved .xls on 15th October 2020 from: https://data.worldbank.org/indicator/SP.POP.TOTL
This dataset from the World Bank has data collated from several sources including the UN population division and various governmental census reports of which it reports total population for 264 countries starting from the year 1960 to 2020.
The variables in this dataset are:
Country Name: The name of the country
Country Code: A three character code indicating the country
Indicator Name: Indicating what the observation is reporting
Indicator Code: A code that represents indicator name
Total Population values from the year 1960 to 2017
This data was loaded into an R dataframe named ‘population’ using read_xl() from the readxl package. The first rows of population dataframe are displayed using head().
#Load data and skip first 3 rows of blank space
population <- read_excel("API_SP.POP.TOTL_DS2_en_excel_v2_1495038.xls", skip = 3)
head(population)
Retrieved .xls on 15th October 2020 from: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
This dataset also from the World Bank has data collated from World Bank national accounts data, and OECD National Accounts data files. This reports GDP for 264 countries starting from the year 1960 to 2020 normalised to the current US dollar so GDP can be compared.
The variables in this dataset are:
Country Name: The name of the country
Country Code: A three character code indicating the country
Indicator Name: Indicating what the observation is reporting
Indicator Code: A code that represents indicator name
Total GDP in current US$ from the year 1960 to 2017
This data was loaded into an R dataframe named ‘gdp’ using read_xl() from the readxl package. The first rows of population dataframe are displayed using head().
#Load data and skip first 3 rows of blank space
gdp <- read_excel("API_NY.GDP.MKTP.KD_DS2_en_excel_v2_1495404.xls", skip = 3)
head(gdp)
The aim will be to produce observations of GDP per capita for each year per country. In order to do this and to fulfil the last requirement of the data section “merge at least two data sets to create the one you are going to work on.” the population and gdp dataframes will be mereged. This will be carried out later in this report during the section Tidy & Manipulate.
Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.
Previously, we looked at the first rows of the Population dataset (Load Dataset 1 - Population). We could see that our first four columns are of the ‘character’ and the remaining columns consisting of the population for each year are of the ‘double’ datatype, a numeric datatype. We can illustrate this using str()
str(population, list.len = 6)
## tibble [264 × 65] (S3: tbl_df/tbl/data.frame)
## $ Country Name : chr [1:264] "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Country Code : chr [1:264] "ABW" "AFG" "AGO" "ALB" ...
## $ Indicator Name: chr [1:264] "Population, total" "Population, total" "Population, total" "Population, total" ...
## $ Indicator Code: chr [1:264] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
## $ 1960 : num [1:264] 54211 8996973 5454933 1608800 13411 ...
## $ 1961 : num [1:264] 55438 9169410 5531472 1659800 14375 ...
## [list output truncated]
For our numeric data columns ‘double’ is an appropriate. The names of the numeric dataset are also a variable representing year, a categorical variable which should be a factor, this will be handled later in the Tidy & Manipulate Section. For our first four columns these could all be expressed as factors, although we don’t need the variables ‘Country Code’, ‘Indicator Name’ or ‘Indicator Code’.
To simplify this, these will be removed by subsetting our dataframe to not include them and the results displayed using the colnames().
population <- population[, -c(2:4)]
colnames(population)
## [1] "Country Name" "1960" "1961" "1962" "1963"
## [6] "1964" "1965" "1966" "1967" "1968"
## [11] "1969" "1970" "1971" "1972" "1973"
## [16] "1974" "1975" "1976" "1977" "1978"
## [21] "1979" "1980" "1981" "1982" "1983"
## [26] "1984" "1985" "1986" "1987" "1988"
## [31] "1989" "1990" "1991" "1992" "1993"
## [36] "1994" "1995" "1996" "1997" "1998"
## [41] "1999" "2000" "2001" "2002" "2003"
## [46] "2004" "2005" "2006" "2007" "2008"
## [51] "2009" "2010" "2011" "2012" "2013"
## [56] "2014" "2015" "2016" "2017" "2018"
## [61] "2019" "2020"
Next we can convert ‘Country Name’ to a factor varable using as.factor() and examine the results using str()
population$`Country Name` <- population$`Country Name` %>% as.factor
str(population, list.len = 4)
## tibble [264 × 62] (S3: tbl_df/tbl/data.frame)
## $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 8 250 9 10 4 ...
## $ 1960 : num [1:264] 54211 8996973 5454933 1608800 13411 ...
## $ 1961 : num [1:264] 55438 9169410 5531472 1659800 14375 ...
## $ 1962 : num [1:264] 56225 9351441 5608539 1711319 15370 ...
## [list output truncated]
On examining the levels of looking at the levels of Country name I could see that there are some entries that aren’t countries at all but are a combination eg. [63] “East Asia & Pacific (excluding high income)” we will remove these by storing the non countries as a list which will be used to remove all rows containing these
nonCountries <- c(
"Arab World" ,"Central Europe and the Baltics","Early-demographic dividend", "East Asia & Pacific" ,"East Asia & Pacific (excluding high income)" ,"East Asia & Pacific (IDA & IBRD countries)","Euro area","Europe & Central Asia","Europe & Central Asia (excluding high income)" ,"Europe & Central Asia (IDA & IBRD countries)","European Union","Fragile and conflict affected situations" ,"Heavily indebted poor countries (HIPC)", "High income", "Heavily indebted poor countries (HIPC)", "High income","IBRD only","IDA & IBRD total", "IDA blend", "IDA only", "IDA total","Late-demographic dividend" , "Latin America & Caribbean" , "Latin America & Caribbean (excluding high income)" , "Latin America & the Caribbean (IDA & IBRD countries)" , "Least developed countries: UN classification" ,"Low income" ,"Low & middle income" ,"Lower middle income" , "Middle East & North Africa" , "Middle East & North Africa (excluding high income)" , "Middle East & North Africa (IDA & IBRD countries)" , "Middle income" ,"North America" , "Not classified" , "OECD members" ,"Other small states" , "Post-demographic dividend" ,"Pre-demographic dividend" ,"South Asia" , "South Asia (IDA & IBRD)" , "Sub-Saharan Africa" , "Sub-Saharan Africa (excluding high income)" ,"Sub-Saharan Africa (IDA & IBRD countries)" , "Upper middle income" ,"West Bank and Gaza" ,"World")
population <- population[!population$`Country Name` %in% nonCountries,]
colnames(population)
## [1] "Country Name" "1960" "1961" "1962" "1963"
## [6] "1964" "1965" "1966" "1967" "1968"
## [11] "1969" "1970" "1971" "1972" "1973"
## [16] "1974" "1975" "1976" "1977" "1978"
## [21] "1979" "1980" "1981" "1982" "1983"
## [26] "1984" "1985" "1986" "1987" "1988"
## [31] "1989" "1990" "1991" "1992" "1993"
## [36] "1994" "1995" "1996" "1997" "1998"
## [41] "1999" "2000" "2001" "2002" "2003"
## [46] "2004" "2005" "2006" "2007" "2008"
## [51] "2009" "2010" "2011" "2012" "2013"
## [56] "2014" "2015" "2016" "2017" "2018"
## [61] "2019" "2020"
As the ‘gdp’ dataframe has a very similar format to the ‘population’ dataframe. We can draw the same conclusions as we did for ’population and the same manipulation can be applied to this dataframe.
Again, we will subset our dataframe to not include the variables we dont need and the results displayed using the colnames(). Remember that the names of the numeric dataset are also a variable representing year, a categorical variable which should be a factor, this will be handled later in the Tidy & Manipulate Section
gdp <- gdp[, -c(2:4)]
colnames(gdp)
## [1] "Country Name" "1960" "1961" "1962" "1963"
## [6] "1964" "1965" "1966" "1967" "1968"
## [11] "1969" "1970" "1971" "1972" "1973"
## [16] "1974" "1975" "1976" "1977" "1978"
## [21] "1979" "1980" "1981" "1982" "1983"
## [26] "1984" "1985" "1986" "1987" "1988"
## [31] "1989" "1990" "1991" "1992" "1993"
## [36] "1994" "1995" "1996" "1997" "1998"
## [41] "1999" "2000" "2001" "2002" "2003"
## [46] "2004" "2005" "2006" "2007" "2008"
## [51] "2009" "2010" "2011" "2012" "2013"
## [56] "2014" "2015" "2016" "2017" "2018"
## [61] "2019" "2020"
Next we can convert ‘Country Name’ to a factor varable using as.factor() and examine the results using str()
gdp$`Country Name` <- gdp$`Country Name` %>% as.factor
str(gdp, list.len = 4)
## tibble [264 × 62] (S3: tbl_df/tbl/data.frame)
## $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 8 250 9 10 4 ...
## $ 1960 : num [1:264] NA NA NA NA NA ...
## $ 1961 : num [1:264] NA NA NA NA NA ...
## $ 1962 : num [1:264] NA NA NA NA NA ...
## [list output truncated]
like we did for population, we will remove all non countries from the GDP dataframe
gdp <- gdp[!gdp$`Country Name` %in% nonCountries,]
colnames(gdp)
## [1] "Country Name" "1960" "1961" "1962" "1963"
## [6] "1964" "1965" "1966" "1967" "1968"
## [11] "1969" "1970" "1971" "1972" "1973"
## [16] "1974" "1975" "1976" "1977" "1978"
## [21] "1979" "1980" "1981" "1982" "1983"
## [26] "1984" "1985" "1986" "1987" "1988"
## [31] "1989" "1990" "1991" "1992" "1993"
## [36] "1994" "1995" "1996" "1997" "1998"
## [41] "1999" "2000" "2001" "2002" "2003"
## [46] "2004" "2005" "2006" "2007" "2008"
## [51] "2009" "2010" "2011" "2012" "2013"
## [56] "2014" "2015" "2016" "2017" "2018"
## [61] "2019" "2020"
As we could see from the output of str() there is a lot of missing data from both dataframes. We can take a closer look and deal with this now.
We can look at the missing values for population each year using colnames() with is.na() which will count the number of missing values for each column.
population %>% is.na() %>% colSums ()
## Country Name 1960 1961 1962 1963 1964
## 0 2 2 2 2 2
## 1965 1966 1967 1968 1969 1970
## 2 2 2 2 2 2
## 1971 1972 1973 1974 1975 1976
## 2 2 2 2 2 2
## 1977 1978 1979 1980 1981 1982
## 2 2 2 2 2 2
## 1983 1984 1985 1986 1987 1988
## 2 2 2 2 2 2
## 1989 1990 1991 1992 1993 1994
## 2 1 1 2 2 2
## 1995 1996 1997 1998 1999 2000
## 1 1 1 0 0 0
## 2001 2002 2003 2004 2005 2006
## 0 0 0 0 0 0
## 2007 2008 2009 2010 2011 2012
## 0 0 0 0 0 1
## 2013 2014 2015 2016 2017 2018
## 1 1 1 1 1 1
## 2019 2020
## 1 219
As we can see from this output there are no missing values for Country Name, each year there are missing values which start at about 2 from 1960, which decreases to 1 by 1995 and there is no population data for the year 2020.
On closer look at the data it is obvious that population data is missing for a few countries before 1989 probably because these records were not available and there is obviously no reliable data for 2020 yet.We can visualize this by comparing two countries Serbia and Somalia, converting the data to long format, and plotting the results. This is achieved by using gather() which is used to gather the rows 1960 to 2020 to a new column called year and the values to a new column called population. The output of this is plotted using ggplot with the poplation converted to millions.
As we can see the population data for Serbia doesnt begin until 1989 and both countries stop at 2020. This is somewhat important as it indicates that there isn’t random missing data.
population %>% gather( '1960':'2020', key = "Year", value = "population") %>% filter(`Country Name` == c("Serbia", "Somalia")) %>% ggplot(aes(Year, population * 1*10^ -6 ,group = `Country Name`, col = `Country Name`)) +
geom_line(size = 2) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("Population (Millions)")
In order to deal with this missing data it doesn’t make sense to impute the missing values with a mean or median, this is because the values represent growth. For each country, what could be imputed could be based on regression or k-nearest neighbours based on the slope of the line.
Since we are limited by the packages we can use in this assignment and programming a function to compute the appropriate value based on regression would be very difficult and out of the scope of this assignment. It is easier just to omit the values of years that are missing. This is the strategy that I will use to deal with missing values from population in the next section once it is in a tidy format
Following the same approach for GDP by looking at each year using colnames() with is.na() which will count the number of missing values for each column. We can se similarly As we can see from this output there are no missing values for Country Name. For the years it is actually similar in terms of the earlier years having more missing GDP reported data which improves by the late 70s with no values for 2020.
gdp %>% is.na() %>% colSums ()
## Country Name 1960 1961 1962 1963 1964
## 0 130 127 127 127 127
## 1965 1966 1967 1968 1969 1970
## 123 119 118 116 116 102
## 1971 1972 1973 1974 1975 1976
## 102 102 102 100 97 96
## 1977 1978 1979 1980 1981 1982
## 91 91 90 80 75 72
## 1983 1984 1985 1986 1987 1988
## 72 69 68 64 61 59
## 1989 1990 1991 1992 1993 1994
## 58 46 45 42 40 39
## 1995 1996 1997 1998 1999 2000
## 33 33 30 30 29 24
## 2001 2002 2003 2004 2005 2006
## 23 18 18 17 17 16
## 2007 2008 2009 2010 2011 2012
## 16 15 15 11 15 16
## 2013 2014 2015 2016 2017 2018
## 16 17 18 19 19 23
## 2019 2020
## 38 219
Following a similar approach from before, if we look at the GDP data for Serbia and Somalia we can see that serbia has data starting at 1995 while Somalia has no data dat all
gdp %>% gather( '1960':'2020', key = "Year", value = "gdp") %>% filter(`Country Name` == c("Serbia", "Somalia")) %>% ggplot(aes(Year, gdp * 1*10^ -9 ,group = `Country Name`, col = `Country Name`)) +
geom_line(size = 2) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("GDP (Billions)")
In order to deal with this missing data it doesn’t make sense to impute the missing values with a mean or median, this is because the values represent growth. For each country, what could be imputed could be based on regression or k-nearest neighbours. Since we are limited by the packages we can use in this assignment and programming a function to compute the appropriate value based on regression would be very difficult and out of the scope of this assignment. It is easier just to omit the values of years that are missing. This is the strategy that I will use to deal with missing values from population in the next section once it is in a tidy format
At this point we can omit the last row, 2020 from both datasets by subseting away the last row
gdp <- gdp[,- 62]
population <- population[,- 62]
So far both our datasets do not conform to tidy data principles. Both datasets are in the wide format which makes things easier to read as a able but this does not meet tidy data principles. As mentioned before in the Understand section, the year columns are a values of the year variable We will now deal with this.
For dataset 1 We will now use gather() to gather the values year and population. Also as mentioned before in the Understand section, year in this context is a categorical variable and therefore can be converted to a factor. To do this we will use as.factor() and store it in the newly created Year column. The datatypes are displayed using str(), the results showing that Country Name and Year are factors and population as a varaible
population <- population %>% gather( '1960':'2019', key = "Year", value = "Population")
population$Year <- population$Year %>% as.factor()
str(population)
## tibble [13,140 × 3] (S3: tbl_df/tbl/data.frame)
## $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 250 9 10 4 7 ...
## $ Year : Factor w/ 60 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Population : num [1:13140] 54211 8996973 5454933 1608800 13411 ...
by calling our resulting dataframe we can see that it is in a tidy format
population
We will now do the same thing for the GDP data. By applying the same functions we can see that Country Name and Year are factors and population as a variable this completes the requirement for Understand Criteria
gdp <- gdp %>% gather( '1960':'2019', key = "Year", value = "GDP")
gdp$Year <- gdp$Year %>% as.factor()
str(gdp)
## tibble [13,140 × 3] (S3: tbl_df/tbl/data.frame)
## $ Country Name: Factor w/ 264 levels "Afghanistan",..: 11 1 6 2 5 250 9 10 4 7 ...
## $ Year : Factor w/ 60 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ GDP : num [1:13140] NA NA NA NA NA ...
Now that both datasets are tidy we can combine them using Country Name and Year as the common factors. This is achieved using left join function and storing it as a new dataframe called Countries. This completes the requirement for Data
Countries <- left_join(population, gdp, by = c("Country Name" = "Country Name", "Year" = "Year"))
Countries
The result is a tidy dataset that conforms to tidy data principles by each variable being its own column and each row being its own observation. In this case each observation is a country, the year and the population and gdp data for corresponding country and year.
By sumarising the data for population we can see population data is only missing for the following four countries
Countries %>% subset(is.na(Population)) %>% group_by(`Country Name`) %>%
summarise(NA_count=sum(is.na(Population)))
By sumarising the data for gdp we can see there is a lot of data missing for every country, but this is okay since we have over 13 thousand observations
Countries %>% subset(is.na(GDP)) %>% group_by(`Country Name`) %>%
summarise(NA_count=sum(is.na(GDP)))
Now that the datasets are together in a tidy format. We can consider that the overall goal of this dataset will be to calculate GDP per capita for each year. In order to do that we need complete data for population and GDP. Since all this data is required in each row we can succinctly remove all NA values using na.omit() This completes requirements for Scan part 1
Countries <- na.omit(Countries)
which(is.na(Countries))
## integer(0)
Now that we have omitted all NA’s we can calculate GDP per capita for each country and year pairing. This is achieved by dividing GDP by the population of each country. This is achieved using the mutate function, creating a new Column called GDP_per_Capita which takes GDP/ Population
Countries <- mutate(Countries, GDP_per_Capita = GDP / Population)
Countries
First to decide how we scan for outliers it would be handy to see the overall distribution of our each one of our numeric variables. If it is normal then we can apply a z-score method. To do this for each numeric variable we will plot the histograms to check the shape of the data and also qqplot to check how normal each variable is, if the data is normal then we can see the datapoints mostly fit within the blue threshold lines
par(mfrow= c(1,3))
Countries$Population %>% hist(main = "Distribution of Population 1960-2019" )
Countries$GDP %>% hist(main = "Distribution of GDP 1960-2019" )
Countries$GDP_per_Capita %>% hist(main = "Distribution of GDPPC 1960-2019" )
par(mfrow= c(1,3))
qqPlot(Countries$Population )
qqPlot(Countries$GDP )
qqPlot(Countries$GDP_per_Capita )
As we can see the data is far from normal. This actually makes sense, we have a large ammount of countries all at different levels of population and growth, we can see that most countries have a lower population, GDP and GDP per capita.
Since the data is not normal we will use Tukeys Fencing method where, ’outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR. These −1.5×IQR and 1.5×IQR limits are called “outlier fences” and any values lying outside the outlier fences are depicted using an “o” or a similar symbol on the box plot.
From this point forward to make things easier I have defined functions using the Tukeys method to obtain a dataframe of sorted upper outliers and lower outliers and also box plots and histograms. For this dataset if would make more sense to compare outliers for particular years versus 50 years of growth in one go. In this way we can detect outliers for a particular year
This function will take our dataframe and the column variable and an optional year filter. It will calculate the Inter Quartile Range (IQR) of our variable to obtain the upper fence of an outlier as defined by Tukeys method (1.5) and filter for every thing that is above this range and then sort the output based on the variable
getUpperOutliers<- function(df, column, yearVar){
if(!missing(yearVar)){
df <- df %>% filter(Year == yearVar)
}
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
UpperFence <- quantile(df[[column]],probs = 0.75) + 1.5 * IQR
filtered <- df[df[[column]] > UpperFence ,]
return(arrange(filtered, desc(!! rlang::sym(column))))
}
This does the same but gets everything that is in the lower fence.
getLowerOutliers<- function(df, column, yearVar){
if(!missing(yearVar)){
df <- df %>% filter(Year == yearVar)
}
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
LowerFence <- quantile(df[[column]], probs = 0.25) - 1.5 * IQR
filtered <- df[df[[column]] < LowerFence ,]
return(arrange(filtered, !! rlang::sym(column)))
}
This is a function to get our boxplots
getBoxPlot <- function(df, column, yearVar){
if(!missing(yearVar)){
df <- df %>% filter(Year == yearVar)
title <- paste("Boxplot of", column, "in", yearVar)
} else{
title <- paste("Boxplot of", column, "from 1960 to 2020")
}
df[[column]] %>% boxplot(main=title, xlab=column)
}
and a function to get our histogram
getHist <- function(df, column, yearVar){
if(!missing(yearVar)){
df <- df %>% filter(Year == yearVar)
title <- paste("Distribution of", column, "in", yearVar)
} else{ title <- paste("Distribution of", column, "from 1960 to 2020")}
df[[column]] %>% hist(breaks = 50, main=title, xlab=column)
}
and finally a function that will call both histograms and boxplots at the same time.
showDistribution<- function(df, column, yearVar){
if(!missing(yearVar)){
w <- getHist(Countries, "GDP_per_Capita",yearVar )
x <- getBoxPlot(df, column, yearVar)
} else {
w <- getHist(Countries, "GDP_per_Capita" )
x <- getBoxPlot(df, column)
}
return(list(w,x))
}
With out functions defined we can take a look at outliers in GDP per captia for the year 1960. As we can see a right skewed distribution and there are a large ammount of upper outliers and no lower outliers.
par(mfrow= c(1,2))
showDistribution(Countries, "GDP_per_Capita",1960 )
Looking at the dataframe we can see that countries that are the outliers are all affluent countries, mostly european, which is to be expected. These countries make a lot more money compared to their population with bermuda having the highest GDP_per_captia
getUpperOutliers(Countries, "GDP_per_Capita",1960 )
getLowerOutliers(Countries, "GDP_per_Capita",1960 )
Fast forward to 2019 and we can see much of the same
par(mfrow= c(1,2))
showDistribution(Countries, "GDP_per_Capita",2019 )
Although now we have some asian and arab nations. Very interesting
getUpperOutliers(Countries, "GDP_per_Capita",2019 )
getLowerOutliers(Countries, "GDP_per_Capita",2019 )
Looking at population in 1970 we can see the distribution follows the same pattern, right skewed with a good number of outliers again no lower outliers
par(mfrow= c(1,2))
showDistribution(Countries, "Population", 1970)
Looking at the outliers we see the usual suspects such as china and and india with the highest populations with a good number of outher countries
getUpperOutliers(Countries, "Population",1960)
getLowerOutliers(Countries, "Population",1960 )
Fastforward to 2019 and we can see that there are some huge outliers and some moderate ones too
par(mfrow= c(1,2))
showDistribution(Countries, "Population", 2019)
Looking at the countries we can see that China and India dwarf everybody else in terms of population
getUpperOutliers(Countries, "Population",2019)
getLowerOutliers(Countries, "Population",2019 )
Looking only at GDP and the year 1960 we can see that again there is a big right skwewed distribution there is a massive outlier and some moderate ones
par(mfrow= c(1,2))
showDistribution(Countries, "GDP", 1960)
The massive outlier is China in 1960!
getUpperOutliers(Countries, "Population",1960)
getLowerOutliers(Countries, "Population",1960 )
If we look at GDP from last year we can see a huge number of outliers and some massive ones
par(mfrow= c(1,2))
showDistribution(Countries, "GDP", 2019)
If we look at the countries responsible for this we can see China, Inda and the United states are economic powerhouses.
getUpperOutliers(Countries, "Population",1960)
getLowerOutliers(Countries, "Population",1960 )
We can see from our example outliers in this context isn’t a result of any of the following:
Data Entry Errors
Measurement Errors
Experimental Error
Intentional Error
Data Processing Errors
Sampling error
The outliers in our example actually shows really large and really rich nations and also shows how poor and small the rest of the world is in comparison. Imputing values to outliers in this example doesnt make any sense at all and would ruin any statistical analysis placed on the whole world.
There could be an argument made that we could exclude the rich or large nations to only look at the rest of the nations to produce a rest of the world dataset and if we did want to do that we could do it for a particular year by getting everyting that is under the tukey upper fence. I have a defined a function that does this, I have not bothered with lower outliers since we have none.
restOftheWorld <- function(df, column, yearVar){
if(!missing(yearVar)){
df <- df %>% filter(Year == yearVar)
}
IQR <- quantile(df[[column]] , probs = 0.75) - quantile(df[[column]], probs = 0.25)
UpperFence <- quantile(df[[column]],probs = 0.75) + 1.5 * IQR
filtered <- df[df[[column]] < UpperFence ,] # get everything that isnt an upper outlier
return(arrange(filtered, desc(!! rlang::sym(column))))
}
Here we can see the population of the rest of the world that doesnt include our larger outliers in 2019
restOftheWorld(Countries, "Population", 2019 )
The same for GDP
restOftheWorld(Countries, "GDP", 2019 )
and the same for GDP per captia
restOftheWorld(Countries, "GDP_per_Capita", 2019 )
As we could see form the outliers exploration our data is right skewed we can apply a either a log or a log10 transformation to get it to decrease the skewness and convert the distribution to normal. Once visualised by plotting the histogram of it we can see it looks a lot more normal
log10(Countries$GDP) %>% hist(breaks = 50, main = "Distribution of all Countries log10 GDP")