MATH2349 Semester 1, 2020

Required packages

Loading the Required packages.

# This is the R chunk for the required packages
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(knitr)
library(outliers)

Executive Summary

The important steps of the data pre-processing are as follows:

Get: The datasets were downloaded from the public open-data website KAGGLE.COM and then imported into R. Three datasets (GDP , Population, Continents) are merged for this purpose.
Understand: This step is to understand the datasets as well as the merged dataset. The volume and structure of the data was checked, with proper understanding of each attribute.
Tidy and manipulate: Two of the dataset was untidy.so, I tidied up the messy data to make sure each variable is stored in a column and each observation has one row. One new variable was generated called (GDP-PER-CAPITA).
Scan: Missing values, special values and obvious error for all variables were checked. For different type of variables, different approaches were carried out: for categorical variables, plausibility of values was checked for each column; numerical variables, were scanned for any possible outliers.
Transform: In the final step data transformation was carried out for the column that has a skewed distribution.

Data

As discussed earlier, we import 3 datasets for this analysis. Their descriptions are as follows:

GDP of all Countries-

The file used for the analysis is ‘GDP.csv’. This dataset has been taken from kaggle and can be viewed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_NY.GDP.MKTP.CD_DS2_en_csv_v2_559588.csv

The dataset has 264 observation and 65 variables. The variables include the:

‘Country Name’: Name of the Country
‘Country Code’: Code of the Country
‘Indicator Name’: Contains the Indicator- GDP(US$)
‘Indicator code’: Code for the Indicator
59 columns for Years ‘1960’ to ‘2018’ : GDP for all Countries in 59 years between 1960 to 2018
The last two columns ‘2019’ and ‘X65’ are empty column. so they are omitted later for further analysis.

GDP <- read_csv("C:/Users/Daivik/Desktop/GDP.csv")

## Warning: Missing column names filled in: 'X65' [65]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )

## See spec(...) for full column specifications.

head(GDP)

Population of all Countries-

The second dataset contains information on population of all countries from 1960 to 2018. This dataset has also been downloaded from Kaggle and can be accessed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_SP.POP.TOTL_DS2_en_csv_v2_511378.csv

The data file has 264 observations and 65 variables. The variables are as follows:

‘Country Name’: Name of the country
‘Country Code’ : Code of the country
‘Indicator Name’: Contains the Indicator- Population(Total)
‘Indicator code’: Code for the Indicator
59 columns for Years ‘1960’ to ‘2018’ : Population for all Countries in 59 years between 1960 to 2018
The last two columns ‘2019’ and ‘X65’ are again empty columns.

POPULATION <- read_csv("C:/Users/Daivik/Desktop/population.csv")

## Warning: Missing column names filled in: 'X65' [65]

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )

## See spec(...) for full column specifications.

head(POPULATION)

Continent information

The third dataset, named ‘Continent’ has also been taken from Kaggle. It can be accessed using the following link: https://www.kaggle.com/sarques/conticountry .

The data file has 249 observations and 8 variables. The important variables are as follows:

‘Country’ : Country name
’Country_Code : Code of the Country
‘Continent’: Name of the continent the country is in
‘Region-1’: Name of the sub region of the continent, the country is in
The other variables are ‘No’,‘M49_Code’,‘Region-2’,‘X8’. These variables are not essential to the analysis.

We will merge the 3 datasets after tidying them.

CONTINENT <- read_csv("C:/Users/Daivik/Desktop/Continent.csv")

## Warning: Missing column names filled in: 'X8' [8]

## Parsed with column specification:
## cols(
##   No. = col_double(),
##   Country = col_character(),
##   Country_code = col_character(),
##   M49_Code = col_character(),
##   `Region-1` = col_character(),
##   `Region-2` = col_character(),
##   Continent = col_character(),
##   X8 = col_character()
## )

head(CONTINENT)

Selecting only the required variables from all three data sets:

GDP Data:

Country Name
Country code
Years from 2000-2017

Please note that the reason for selecting years from 2000-2017 is to reduce the number of missing values and uncertainty in the data.

GDP<-GDP %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
                    `2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
                    `2014`,`2015`,`2016`,`2017`)
head(GDP)

POPULATION Data:

Country Name
Country Code
Years from 2000-2017

POPULATION<-POPULATION %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
                    `2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
                    `2014`,`2015`,`2016`,`2017`)
head(POPULATION)

CONTINENT Data:

Country
Country_Code
Region-1
Continent name

CONTINENT<-CONTINENT %>% select(Country, Country_code,`Region-1`,Continent)
head(CONTINENT)

Tidy & Manipulate 1

The Main Principles of a Tidy dataset are :

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

After inspecting all the original datasets, we find that only CONTINENT dataset is tidy whereas the GDP & POPULATION dataset is not. This is because of the following reasons:

The GDP values are listed from 2000 to 2017 as column headers. Instead, these values should be transformed to form row values with column header named ‘Year’.

#Converting wide format data into Long format
TIDY_GDP <- GDP %>% gather(key = "year",value = TOTAL_GDP,3:20)
head(TIDY_GDP)

Similarly, the POPULATION values are listed from 2000 to 2017 as column headers. Instead, these values should be transformed to form row values with column header named ‘Year’.

#Converting wide format data into Long format
TIDY_POPULATION <- POPULATION %>% gather(key = "year",value = "Total_population",3:20)
head(TIDY_POPULATION)

Merging all the three dataset -

There are three columns (Country Name, Country Code and year) which are common in our first two data sets (TIDY_GDP & TIDY_POPULATION).
Therefore, TIDY_GDP data set is joined with TIDY_POPULATION data using left join.
The resulting data sets is then joined with CONTINENT dataset with common variable “country Name” and “Country Code” using inner join.

# joining the TIDY_GDP & TIDY_POPULATION
POP_GDP<-left_join(TIDY_GDP,TIDY_POPULATION)

## Joining, by = c("Country Name", "Country Code", "year")

# Renaming the variables of CONTINENT Dataset for joining
colnames(CONTINENT)[1] <-"Country Name"
colnames(CONTINENT)[2] <-"Country Code"

# Joining FINAL_DATA
FINAL_DATA<-inner_join(POP_GDP,CONTINENT)

## Joining, by = c("Country Name", "Country Code")

head(FINAL_DATA)

Understand

Once all the datasets were merged, the dimensions was checked using dim() functions.
The new Dataset FINAL_DATA has 3294 observation and 7 variables.
The summary of the FINAL_DATA was given using summary() function.
structure of the data set was checked using str() function.

# This is the R chunk for the Understand Section

dim(FINAL_DATA)

## [1] 3294    7

summary(FINAL_DATA)

##  Country Name       Country Code           year             TOTAL_GDP        
##  Length:3294        Length:3294        Length:3294        Min.   :1.320e+07  
##  Class :character   Class :character   Class :character   1st Qu.:3.542e+09  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.696e+10  
##                                                           Mean   :2.250e+11  
##                                                           3rd Qu.:1.124e+11  
##                                                           Max.   :1.214e+13  
##                                                           NA's   :170        
##  Total_population      Region-1          Continent        
##  Min.   :9.394e+03   Length:3294        Length:3294       
##  1st Qu.:7.455e+05   Class :character   Class :character  
##  Median :5.527e+06   Mode  :character   Mode  :character  
##  Mean   :3.202e+07                                        
##  3rd Qu.:1.780e+07                                        
##  Max.   :1.386e+09                                        
##  NA's   :6

str(FINAL_DATA)

## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : chr [1:3294] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code    : chr [1:3294] "ABW" "AFG" "AGO" "ALB" ...
##  $ year            : chr [1:3294] "2000" "2000" "2000" "2000" ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : chr [1:3294] "Caribbean" "Southern Asia" "Middle Africa" "Southern Europe" ...
##  $ Continent       : chr [1:3294] "North America" "Asia" "Africa" "Europe" ...

Some of the Variables were not in the correct format like -

’Country Name - it was converted to factor type from character type with 183 levels.
’Country Code - it was converted to factor type from character type with 183 levels.
‘year’ - it was converted to factor form as it was in character type with 18 levels
‘Region-1’ - converted into factor from charater data type consisting of 26 levels.
’Continent - it was also factored orderely from character type consisting of 8 levels.

Furthermore, The structure of the dataset is provided to confirm all the data type conversions.

#conversion of variables
FINAL_DATA$`Country Name`<-as.factor(FINAL_DATA$`Country Name`)
class(FINAL_DATA$`Country Name`)

## [1] "factor"

levels(FINAL_DATA$`Country Name`)

##   [1] "Afghanistan"               "Albania"                  
##   [3] "Algeria"                   "American Samoa"           
##   [5] "Andorra"                   "Angola"                   
##   [7] "Antigua and Barbuda"       "Argentina"                
##   [9] "Armenia"                   "Aruba"                    
##  [11] "Australia"                 "Austria"                  
##  [13] "Azerbaijan"                "Bahrain"                  
##  [15] "Bangladesh"                "Barbados"                 
##  [17] "Belarus"                   "Belgium"                  
##  [19] "Belize"                    "Benin"                    
##  [21] "Bermuda"                   "Bhutan"                   
##  [23] "Bosnia and Herzegovina"    "Botswana"                 
##  [25] "Brazil"                    "British Virgin Islands"   
##  [27] "Brunei Darussalam"         "Bulgaria"                 
##  [29] "Burkina Faso"              "Burundi"                  
##  [31] "Cabo Verde"                "Cambodia"                 
##  [33] "Cameroon"                  "Canada"                   
##  [35] "Cayman Islands"            "Central African Republic" 
##  [37] "Chad"                      "Chile"                    
##  [39] "China"                     "Colombia"                 
##  [41] "Comoros"                   "Costa Rica"               
##  [43] "Croatia"                   "Cuba"                     
##  [45] "Cyprus"                    "Denmark"                  
##  [47] "Djibouti"                  "Dominica"                 
##  [49] "Dominican Republic"        "Ecuador"                  
##  [51] "El Salvador"               "Equatorial Guinea"        
##  [53] "Eritrea"                   "Estonia"                  
##  [55] "Eswatini"                  "Ethiopia"                 
##  [57] "Faroe Islands"             "Fiji"                     
##  [59] "Finland"                   "France"                   
##  [61] "French Polynesia"          "Gabon"                    
##  [63] "Georgia"                   "Germany"                  
##  [65] "Ghana"                     "Gibraltar"                
##  [67] "Greece"                    "Greenland"                
##  [69] "Grenada"                   "Guam"                     
##  [71] "Guatemala"                 "Guinea"                   
##  [73] "Guinea-Bissau"             "Guyana"                   
##  [75] "Haiti"                     "Honduras"                 
##  [77] "Hungary"                   "Iceland"                  
##  [79] "India"                     "Indonesia"                
##  [81] "Iraq"                      "Ireland"                  
##  [83] "Isle of Man"               "Israel"                   
##  [85] "Italy"                     "Jamaica"                  
##  [87] "Japan"                     "Jordan"                   
##  [89] "Kazakhstan"                "Kenya"                    
##  [91] "Kiribati"                  "Kuwait"                   
##  [93] "Latvia"                    "Lebanon"                  
##  [95] "Lesotho"                   "Liberia"                  
##  [97] "Libya"                     "Liechtenstein"            
##  [99] "Lithuania"                 "Luxembourg"               
## [101] "Madagascar"                "Malawi"                   
## [103] "Malaysia"                  "Maldives"                 
## [105] "Mali"                      "Malta"                    
## [107] "Marshall Islands"          "Mauritania"               
## [109] "Mauritius"                 "Mexico"                   
## [111] "Monaco"                    "Mongolia"                 
## [113] "Montenegro"                "Morocco"                  
## [115] "Mozambique"                "Myanmar"                  
## [117] "Namibia"                   "Nauru"                    
## [119] "Nepal"                     "Netherlands"              
## [121] "New Caledonia"             "New Zealand"              
## [123] "Nicaragua"                 "Niger"                    
## [125] "Nigeria"                   "North Macedonia"          
## [127] "Northern Mariana Islands"  "Norway"                   
## [129] "Oman"                      "Pakistan"                 
## [131] "Palau"                     "Panama"                   
## [133] "Papua New Guinea"          "Paraguay"                 
## [135] "Peru"                      "Philippines"              
## [137] "Poland"                    "Portugal"                 
## [139] "Puerto Rico"               "Qatar"                    
## [141] "Romania"                   "Russian Federation"       
## [143] "Rwanda"                    "Samoa"                    
## [145] "San Marino"                "Sao Tome and Principe"    
## [147] "Saudi Arabia"              "Senegal"                  
## [149] "Serbia"                    "Seychelles"               
## [151] "Sierra Leone"              "Singapore"                
## [153] "Sint Maarten (Dutch part)" "Slovenia"                 
## [155] "Solomon Islands"           "Somalia"                  
## [157] "South Africa"              "South Sudan"              
## [159] "Spain"                     "Sri Lanka"                
## [161] "Sudan"                     "Suriname"                 
## [163] "Sweden"                    "Switzerland"              
## [165] "Syrian Arab Republic"      "Tajikistan"               
## [167] "Thailand"                  "Timor-Leste"              
## [169] "Togo"                      "Tonga"                    
## [171] "Trinidad and Tobago"       "Tunisia"                  
## [173] "Turkey"                    "Turkmenistan"             
## [175] "Turks and Caicos Islands"  "Tuvalu"                   
## [177] "Uganda"                    "Ukraine"                  
## [179] "United Arab Emirates"      "Uruguay"                  
## [181] "Uzbekistan"                "Vanuatu"                  
## [183] "Zambia"

FINAL_DATA$`Country Code`<-as.factor(FINAL_DATA$`Country Code`)
class(FINAL_DATA$`Country Code`)

## [1] "factor"

levels(FINAL_DATA$`Country Code`)

##   [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
##  [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BIH" "BLR" "BLZ" "BMU"
##  [25] "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CMR" "COL"
##  [37] "COM" "CPV" "CRI" "CUB" "CYM" "CYP" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA"
##  [49] "ECU" "ERI" "ESP" "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "GAB" "GEO" "GHA"
##  [61] "GIB" "GIN" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HND" "HRV"
##  [73] "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRQ" "ISL" "ISR" "ITA" "JAM" "JOR"
##  [85] "JPN" "KAZ" "KEN" "KHM" "KIR" "KWT" "LBN" "LBR" "LBY" "LIE" "LKA" "LSO"
##  [97] "LTU" "LUX" "LVA" "MAR" "MCO" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT"
## [109] "MMR" "MNE" "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER"
## [121] "NGA" "NIC" "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL"
## [133] "PLW" "PNG" "POL" "PRI" "PRT" "PRY" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
## [145] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
## [157] "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK" "TKM"
## [169] "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "UGA" "UKR" "URY" "UZB" "VGB" "VUT"
## [181] "WSM" "ZAF" "ZMB"

FINAL_DATA$year<-FINAL_DATA$year %>% factor(levels = c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017),ordered = TRUE)
class(FINAL_DATA$year)

## [1] "ordered" "factor"

levels(FINAL_DATA$year)

##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"

FINAL_DATA$`Region-1`<-FINAL_DATA$`Region-1` %>%  factor(levels = c("344","446","535","Antarctica","Australia and New Zealand","Caribbean","Central America","Central Asia","Eastern Africa","Eastern Asia","Eastern Europe","Melanesia","Micronesia","Middle Africa","Northern Africa","Northern America", "Northern Europe","Polynesia","South-eastern Asia","South America","Southern Africa","Southern Asia", "Southern Europe","Western Africa", "Western Asia","Western Europe"), ordered = TRUE)

class(FINAL_DATA$`Region-1`)

## [1] "ordered" "factor"

levels(FINAL_DATA$`Region-1`)

##  [1] "344"                       "446"                      
##  [3] "535"                       "Antarctica"               
##  [5] "Australia and New Zealand" "Caribbean"                
##  [7] "Central America"           "Central Asia"             
##  [9] "Eastern Africa"            "Eastern Asia"             
## [11] "Eastern Europe"            "Melanesia"                
## [13] "Micronesia"                "Middle Africa"            
## [15] "Northern Africa"           "Northern America"         
## [17] "Northern Europe"           "Polynesia"                
## [19] "South-eastern Asia"        "South America"            
## [21] "Southern Africa"           "Southern Asia"            
## [23] "Southern Europe"           "Western Africa"           
## [25] "Western Asia"              "Western Europe"

FINAL_DATA$Continent<-FINAL_DATA$Continent %>% factor(levels = c("Africa","Antarctica","Asia","Europe", "Latin America and the Caribbean", "North America","Oceania","South America"), ordered =TRUE)

class(FINAL_DATA$Continent)

## [1] "ordered" "factor"

levels(FINAL_DATA$Continent)

## [1] "Africa"                          "Antarctica"                     
## [3] "Asia"                            "Europe"                         
## [5] "Latin America and the Caribbean" "North America"                  
## [7] "Oceania"                         "South America"

str(FINAL_DATA)

## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
##  $ Country Code    : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
##  $ Continent       : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...

Tidy & Manipulate Data II

A new column called GDP-PER-CAPITA was created which will store the values of GDP per capita i.e
```
 GDP-PER-CAPITA = TOTAL_GDP/TOTAL_POPULATION
```
mutate() function was used to create a new column in FINAL_DATA dataset.
str() function was again used to check the structure of dataset after mutating a new column.

# Creating a new Variable
FINAL_DATA<-FINAL_DATA %>% mutate("GDP-PER-CAPITA"= round((TOTAL_GDP/Total_population),2))
head(FINAL_DATA)

str(FINAL_DATA)

## tibble [3,294 x 8] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
##  $ Country Code    : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
##  $ Continent       : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...
##  $ GDP-PER-CAPITA  : num [1:3294] 20621 NA 557 1127 21937 ...

Scan I

The FINAL_DATA was scanned for any missing values using is.na() function and the total number of missing values in each column was checked using colSums() function

# This is the R chunk for the Scan I
colSums(is.na(FINAL_DATA))

##     Country Name     Country Code             year        TOTAL_GDP 
##                0                0                0              170 
## Total_population         Region-1        Continent   GDP-PER-CAPITA 
##                6                0                0              170

The missing values of the attributes were then imputed by its mean value because it is the most common method to deal with the missing values.
We will only deal with missing values for numerical variables ‘TOTAL_GDP’,‘TOTAL_POPULATION’,‘GDP-PER-CAPITA’.

#Imputing the missing values

FINAL_DATA$TOTAL_GDP[is.na(FINAL_DATA$TOTAL_GDP)]<-mean(FINAL_DATA$TOTAL_GDP,na.rm = TRUE)
sum(is.na(FINAL_DATA$TOTAL_GDP))

## [1] 0

FINAL_DATA$Total_population[is.na(FINAL_DATA$Total_population)]<-mean(FINAL_DATA$Total_population,na.rm = TRUE)
sum(is.na(FINAL_DATA$Total_population))

## [1] 0

FINAL_DATA$`GDP-PER-CAPITA`[is.na(FINAL_DATA$`GDP-PER-CAPITA`)]<-mean(FINAL_DATA$`GDP-PER-CAPITA`,na.rm = TRUE)
sum(is.na(FINAL_DATA$`GDP-PER-CAPITA`))

## [1] 0

The FINAL_DATA was then checked for any special values like inf,NaN etc. is.special() function was created to check any special values for numerica variables.

is.special <- function(x){ if(is.numeric(x))  
  (is.infinite(x)|is.nan(x))
}
FINAL_DATA %>% sapply(function(x) sum(is.special(x)))

##     Country Name     Country Code             year        TOTAL_GDP 
##                0                0                0                0 
## Total_population         Region-1        Continent   GDP-PER-CAPITA 
##                0                0                0                0

As we can see all the variables of FINAL_DATA do not have any special values. And all the missing values are now dealt with.

head(FINAL_DATA)

Scan II

The FINAL_DATA dataset is scanned for outliers for the numeric variables, (TOTAL_GDP,TOTAL_POPULATION,GDP-PER-CAPITA) and it is visualised using the boxplot.In the boxplot all values that fall outside the outlier fences is shown.
The outlier fences are the range between (-1.5IQR to 1.5IQR).
We use the z-score method to detect univariate outliers.
As per the Central limit we assume that the variables follows a normal distribution.
The number of outliers is found using z-scores method. A standardised z-score of all observations are calculated. The z-scores with value greater than 3 is considered as outlier.

# This is the R chunk for the Scan II
FINAL_DATA %>% select(TOTAL_GDP,Total_population,`GDP-PER-CAPITA`) %>% summary()

##    TOTAL_GDP         Total_population    GDP-PER-CAPITA    
##  Min.   :1.320e+07   Min.   :9.394e+03   Min.   :   111.9  
##  1st Qu.:3.986e+09   1st Qu.:7.458e+05   1st Qu.:  1455.0  
##  Median :1.969e+10   Median :5.537e+06   Median :  5199.2  
##  Mean   :2.250e+11   Mean   :3.202e+07   Mean   : 14271.3  
##  3rd Qu.:1.728e+11   3rd Qu.:1.799e+07   3rd Qu.: 15691.5  
##  Max.   :1.214e+13   Max.   :1.386e+09   Max.   :189170.9

FINAL_DATA$TOTAL_GDP %>% boxplot(main="Box-Plot of GDP",ylab="Total GDP")

FINAL_DATA$Total_population %>% boxplot(main="Box-Plot of POPULATION",ylab="Total Population")

FINAL_DATA$`GDP-PER-CAPITA` %>% boxplot(main="Box-Plot of GDP PER CAPITA",ylab="GDP/POPULATION")

#z-scores for all the numeric variables
z.score_GDP <-FINAL_DATA$TOTAL_GDP %>% scores(type = "z")
z.score_GDP %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.30443 -0.29905 -0.27779  0.00000 -0.07065 16.12926

length(which(abs(z.score_GDP)>3))

## [1] 60

z.score_POP <-FINAL_DATA$Total_population  %>% scores(type = "z")
z.score_POP %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2373 -0.2319 -0.1964  0.0000 -0.1040 10.0416

length(which(abs(z.score_POP)>3))

## [1] 36

z.score <-FINAL_DATA$`GDP-PER-CAPITA`  %>% scores(type = "z")
z.score %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6371 -0.5767 -0.4082  0.0000  0.0639  7.8695

length(which(abs(z.score)>3))

## [1] 73

Replacing the outliers with the nearest neighbours that are not outliers. The outliers that lie outside the outlier fences on a box-plot are capped by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.
In order to cap the outliers we used a user-defined function as follows

# Define a function to cap the values outside the limits

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# cappping the outliers

FINAL_DATA$TOTAL_GDP<-cap(FINAL_DATA$TOTAL_GDP)
FINAL_DATA$Total_population<-cap(FINAL_DATA$Total_population)
FINAL_DATA$`GDP-PER-CAPITA`<-cap(FINAL_DATA$`GDP-PER-CAPITA`)

THE Box-Plot of the Variables after capping the Outliers -

boxplot(FINAL_DATA$TOTAL_GDP,FINAL_DATA$Total_population,FINAL_DATA$`GDP-PER-CAPITA`,main="Box-Plot after  capping outliers",names = c("GDP","POPULATION","GDP PER CAPITA"))

Transform

Data Transformation has been done for all 3 numerical variables (TOTAL_GDP,TOTAL_POPULATION,GDP-PER-CAPITA)
Histogram of all 3 variables has been visualised side by side before transformation and after transformation.
It was observed that the histograms are right skewed, therefore the mathematical transformation method, Log transformation was used.
Log Transformation is generally used to reduce the right skewness and convert the distribution into a normal distribution.
It was concluded that after transformation the variables were approximately distributed normally.

# This is the R chunk for the Transform Section
par(mfrow=c(1,2))
hist(FINAL_DATA$TOTAL_GDP,main = "GDP",xlab = "Total GDP",col = "grey")
Log_GDP<-log(FINAL_DATA$TOTAL_GDP)
hist(Log_GDP,main = "Transformed GDP",xlab ="Total GDP",col = "red")

par(mfrow=c(1,2))

hist(FINAL_DATA$Total_population,main = "POPULATION",xlab = "Total Population",col = "grey")

Log_POP <- log(FINAL_DATA$Total_population)
hist(Log_POP,main = "Transformed POPULATION",xlab = "Total Population",col = "red")

par(mfrow=c(1,2))
hist(FINAL_DATA$`GDP-PER-CAPITA`,main = "GDP per CAPITA",xlab = "GDP per CAPITA",col = "grey")
Log_GDPperCAPITA<-log(FINAL_DATA$`GDP-PER-CAPITA`)
hist(Log_GDPperCAPITA,main = "Transformed - GDP per CAPITA",xlab = "GDP per CAPITA",col = "red")

REFERENCE

Dataset- https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_SP.POP.TOTL_DS2_en_csv_v2_511378.csv
Dolgun, Dr. Anil, ‘MATH2349 Data Wrangling’, lecture notes, RMIT University, http://rare-phoenix-161610.appspot.com/secured/index.html