Required packages

Loading the Required packages.

# This is the R chunk for the required packages
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
library(outliers)

Executive Summary

The important steps of the data pre-processing are as follows:

  1. Get: The datasets were downloaded from the public open-data website KAGGLE.COM and then imported into R. Three datasets (GDP , Population, Continents) are merged for this purpose.

  2. Understand: This step is to understand the datasets as well as the merged dataset. The volume and structure of the data was checked, with proper understanding of each attribute.

  3. Tidy and manipulate: Two of the dataset was untidy.so, I tidied up the messy data to make sure each variable is stored in a column and each observation has one row. One new variable was generated called (GDP-PER-CAPITA).

  4. Scan: Missing values, special values and obvious error for all variables were checked. For different type of variables, different approaches were carried out: for categorical variables, plausibility of values was checked for each column; numerical variables, were scanned for any possible outliers.

  5. Transform: In the final step data transformation was carried out for the column that has a skewed distribution.

Data

As discussed earlier, we import 3 datasets for this analysis. Their descriptions are as follows:

  1. GDP of all Countries-

The file used for the analysis is ‘GDP.csv’. This dataset has been taken from kaggle and can be viewed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_NY.GDP.MKTP.CD_DS2_en_csv_v2_559588.csv

The dataset has 264 observation and 65 variables. The variables include the:

GDP <- read_csv("C:/Users/Daivik/Desktop/GDP.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.
head(GDP)
  1. Population of all Countries-

The second dataset contains information on population of all countries from 1960 to 2018. This dataset has also been downloaded from Kaggle and can be accessed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_SP.POP.TOTL_DS2_en_csv_v2_511378.csv

The data file has 264 observations and 65 variables. The variables are as follows:

POPULATION <- read_csv("C:/Users/Daivik/Desktop/population.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.
head(POPULATION)
  1. Continent information

The third dataset, named ‘Continent’ has also been taken from Kaggle. It can be accessed using the following link: https://www.kaggle.com/sarques/conticountry .

The data file has 249 observations and 8 variables. The important variables are as follows:

We will merge the 3 datasets after tidying them.

CONTINENT <- read_csv("C:/Users/Daivik/Desktop/Continent.csv")
## Warning: Missing column names filled in: 'X8' [8]
## Parsed with column specification:
## cols(
##   No. = col_double(),
##   Country = col_character(),
##   Country_code = col_character(),
##   M49_Code = col_character(),
##   `Region-1` = col_character(),
##   `Region-2` = col_character(),
##   Continent = col_character(),
##   X8 = col_character()
## )
head(CONTINENT)

Selecting only the required variables from all three data sets:

  1. GDP Data:

Please note that the reason for selecting years from 2000-2017 is to reduce the number of missing values and uncertainty in the data.

GDP<-GDP %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
                    `2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
                    `2014`,`2015`,`2016`,`2017`)
head(GDP)
  1. POPULATION Data:
POPULATION<-POPULATION %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
                    `2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
                    `2014`,`2015`,`2016`,`2017`)
head(POPULATION)
  1. CONTINENT Data:
CONTINENT<-CONTINENT %>% select(Country, Country_code,`Region-1`,Continent)
head(CONTINENT)

Tidy & Manipulate 1

The Main Principles of a Tidy dataset are :

After inspecting all the original datasets, we find that only CONTINENT dataset is tidy whereas the GDP & POPULATION dataset is not. This is because of the following reasons:

  1. The GDP values are listed from 2000 to 2017 as column headers. Instead, these values should be transformed to form row values with column header named ‘Year’.
#Converting wide format data into Long format
TIDY_GDP <- GDP %>% gather(key = "year",value = TOTAL_GDP,3:20)
head(TIDY_GDP)
  1. Similarly, the POPULATION values are listed from 2000 to 2017 as column headers. Instead, these values should be transformed to form row values with column header named ‘Year’.
#Converting wide format data into Long format
TIDY_POPULATION <- POPULATION %>% gather(key = "year",value = "Total_population",3:20)
head(TIDY_POPULATION)
  1. Merging all the three dataset -
# joining the TIDY_GDP & TIDY_POPULATION
POP_GDP<-left_join(TIDY_GDP,TIDY_POPULATION)
## Joining, by = c("Country Name", "Country Code", "year")
# Renaming the variables of CONTINENT Dataset for joining
colnames(CONTINENT)[1] <-"Country Name"
colnames(CONTINENT)[2] <-"Country Code"

# Joining FINAL_DATA
FINAL_DATA<-inner_join(POP_GDP,CONTINENT)
## Joining, by = c("Country Name", "Country Code")
head(FINAL_DATA)

Understand

# This is the R chunk for the Understand Section

dim(FINAL_DATA)
## [1] 3294    7
summary(FINAL_DATA)
##  Country Name       Country Code           year             TOTAL_GDP        
##  Length:3294        Length:3294        Length:3294        Min.   :1.320e+07  
##  Class :character   Class :character   Class :character   1st Qu.:3.542e+09  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.696e+10  
##                                                           Mean   :2.250e+11  
##                                                           3rd Qu.:1.124e+11  
##                                                           Max.   :1.214e+13  
##                                                           NA's   :170        
##  Total_population      Region-1          Continent        
##  Min.   :9.394e+03   Length:3294        Length:3294       
##  1st Qu.:7.455e+05   Class :character   Class :character  
##  Median :5.527e+06   Mode  :character   Mode  :character  
##  Mean   :3.202e+07                                        
##  3rd Qu.:1.780e+07                                        
##  Max.   :1.386e+09                                        
##  NA's   :6
str(FINAL_DATA)
## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : chr [1:3294] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code    : chr [1:3294] "ABW" "AFG" "AGO" "ALB" ...
##  $ year            : chr [1:3294] "2000" "2000" "2000" "2000" ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : chr [1:3294] "Caribbean" "Southern Asia" "Middle Africa" "Southern Europe" ...
##  $ Continent       : chr [1:3294] "North America" "Asia" "Africa" "Europe" ...

Some of the Variables were not in the correct format like -

Furthermore, The structure of the dataset is provided to confirm all the data type conversions.

#conversion of variables
FINAL_DATA$`Country Name`<-as.factor(FINAL_DATA$`Country Name`)
class(FINAL_DATA$`Country Name`)
## [1] "factor"
levels(FINAL_DATA$`Country Name`)
##   [1] "Afghanistan"               "Albania"                  
##   [3] "Algeria"                   "American Samoa"           
##   [5] "Andorra"                   "Angola"                   
##   [7] "Antigua and Barbuda"       "Argentina"                
##   [9] "Armenia"                   "Aruba"                    
##  [11] "Australia"                 "Austria"                  
##  [13] "Azerbaijan"                "Bahrain"                  
##  [15] "Bangladesh"                "Barbados"                 
##  [17] "Belarus"                   "Belgium"                  
##  [19] "Belize"                    "Benin"                    
##  [21] "Bermuda"                   "Bhutan"                   
##  [23] "Bosnia and Herzegovina"    "Botswana"                 
##  [25] "Brazil"                    "British Virgin Islands"   
##  [27] "Brunei Darussalam"         "Bulgaria"                 
##  [29] "Burkina Faso"              "Burundi"                  
##  [31] "Cabo Verde"                "Cambodia"                 
##  [33] "Cameroon"                  "Canada"                   
##  [35] "Cayman Islands"            "Central African Republic" 
##  [37] "Chad"                      "Chile"                    
##  [39] "China"                     "Colombia"                 
##  [41] "Comoros"                   "Costa Rica"               
##  [43] "Croatia"                   "Cuba"                     
##  [45] "Cyprus"                    "Denmark"                  
##  [47] "Djibouti"                  "Dominica"                 
##  [49] "Dominican Republic"        "Ecuador"                  
##  [51] "El Salvador"               "Equatorial Guinea"        
##  [53] "Eritrea"                   "Estonia"                  
##  [55] "Eswatini"                  "Ethiopia"                 
##  [57] "Faroe Islands"             "Fiji"                     
##  [59] "Finland"                   "France"                   
##  [61] "French Polynesia"          "Gabon"                    
##  [63] "Georgia"                   "Germany"                  
##  [65] "Ghana"                     "Gibraltar"                
##  [67] "Greece"                    "Greenland"                
##  [69] "Grenada"                   "Guam"                     
##  [71] "Guatemala"                 "Guinea"                   
##  [73] "Guinea-Bissau"             "Guyana"                   
##  [75] "Haiti"                     "Honduras"                 
##  [77] "Hungary"                   "Iceland"                  
##  [79] "India"                     "Indonesia"                
##  [81] "Iraq"                      "Ireland"                  
##  [83] "Isle of Man"               "Israel"                   
##  [85] "Italy"                     "Jamaica"                  
##  [87] "Japan"                     "Jordan"                   
##  [89] "Kazakhstan"                "Kenya"                    
##  [91] "Kiribati"                  "Kuwait"                   
##  [93] "Latvia"                    "Lebanon"                  
##  [95] "Lesotho"                   "Liberia"                  
##  [97] "Libya"                     "Liechtenstein"            
##  [99] "Lithuania"                 "Luxembourg"               
## [101] "Madagascar"                "Malawi"                   
## [103] "Malaysia"                  "Maldives"                 
## [105] "Mali"                      "Malta"                    
## [107] "Marshall Islands"          "Mauritania"               
## [109] "Mauritius"                 "Mexico"                   
## [111] "Monaco"                    "Mongolia"                 
## [113] "Montenegro"                "Morocco"                  
## [115] "Mozambique"                "Myanmar"                  
## [117] "Namibia"                   "Nauru"                    
## [119] "Nepal"                     "Netherlands"              
## [121] "New Caledonia"             "New Zealand"              
## [123] "Nicaragua"                 "Niger"                    
## [125] "Nigeria"                   "North Macedonia"          
## [127] "Northern Mariana Islands"  "Norway"                   
## [129] "Oman"                      "Pakistan"                 
## [131] "Palau"                     "Panama"                   
## [133] "Papua New Guinea"          "Paraguay"                 
## [135] "Peru"                      "Philippines"              
## [137] "Poland"                    "Portugal"                 
## [139] "Puerto Rico"               "Qatar"                    
## [141] "Romania"                   "Russian Federation"       
## [143] "Rwanda"                    "Samoa"                    
## [145] "San Marino"                "Sao Tome and Principe"    
## [147] "Saudi Arabia"              "Senegal"                  
## [149] "Serbia"                    "Seychelles"               
## [151] "Sierra Leone"              "Singapore"                
## [153] "Sint Maarten (Dutch part)" "Slovenia"                 
## [155] "Solomon Islands"           "Somalia"                  
## [157] "South Africa"              "South Sudan"              
## [159] "Spain"                     "Sri Lanka"                
## [161] "Sudan"                     "Suriname"                 
## [163] "Sweden"                    "Switzerland"              
## [165] "Syrian Arab Republic"      "Tajikistan"               
## [167] "Thailand"                  "Timor-Leste"              
## [169] "Togo"                      "Tonga"                    
## [171] "Trinidad and Tobago"       "Tunisia"                  
## [173] "Turkey"                    "Turkmenistan"             
## [175] "Turks and Caicos Islands"  "Tuvalu"                   
## [177] "Uganda"                    "Ukraine"                  
## [179] "United Arab Emirates"      "Uruguay"                  
## [181] "Uzbekistan"                "Vanuatu"                  
## [183] "Zambia"
FINAL_DATA$`Country Code`<-as.factor(FINAL_DATA$`Country Code`)
class(FINAL_DATA$`Country Code`)
## [1] "factor"
levels(FINAL_DATA$`Country Code`)
##   [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
##  [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BIH" "BLR" "BLZ" "BMU"
##  [25] "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CMR" "COL"
##  [37] "COM" "CPV" "CRI" "CUB" "CYM" "CYP" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA"
##  [49] "ECU" "ERI" "ESP" "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "GAB" "GEO" "GHA"
##  [61] "GIB" "GIN" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HND" "HRV"
##  [73] "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRQ" "ISL" "ISR" "ITA" "JAM" "JOR"
##  [85] "JPN" "KAZ" "KEN" "KHM" "KIR" "KWT" "LBN" "LBR" "LBY" "LIE" "LKA" "LSO"
##  [97] "LTU" "LUX" "LVA" "MAR" "MCO" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT"
## [109] "MMR" "MNE" "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER"
## [121] "NGA" "NIC" "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL"
## [133] "PLW" "PNG" "POL" "PRI" "PRT" "PRY" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
## [145] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
## [157] "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK" "TKM"
## [169] "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "UGA" "UKR" "URY" "UZB" "VGB" "VUT"
## [181] "WSM" "ZAF" "ZMB"
FINAL_DATA$year<-FINAL_DATA$year %>% factor(levels = c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017),ordered = TRUE)
class(FINAL_DATA$year)
## [1] "ordered" "factor"
levels(FINAL_DATA$year)
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
FINAL_DATA$`Region-1`<-FINAL_DATA$`Region-1` %>%  factor(levels = c("344","446","535","Antarctica","Australia and New Zealand","Caribbean","Central America","Central Asia","Eastern Africa","Eastern Asia","Eastern Europe","Melanesia","Micronesia","Middle Africa","Northern Africa","Northern America", "Northern Europe","Polynesia","South-eastern Asia","South America","Southern Africa","Southern Asia", "Southern Europe","Western Africa", "Western Asia","Western Europe"), ordered = TRUE)

class(FINAL_DATA$`Region-1`)
## [1] "ordered" "factor"
levels(FINAL_DATA$`Region-1`)
##  [1] "344"                       "446"                      
##  [3] "535"                       "Antarctica"               
##  [5] "Australia and New Zealand" "Caribbean"                
##  [7] "Central America"           "Central Asia"             
##  [9] "Eastern Africa"            "Eastern Asia"             
## [11] "Eastern Europe"            "Melanesia"                
## [13] "Micronesia"                "Middle Africa"            
## [15] "Northern Africa"           "Northern America"         
## [17] "Northern Europe"           "Polynesia"                
## [19] "South-eastern Asia"        "South America"            
## [21] "Southern Africa"           "Southern Asia"            
## [23] "Southern Europe"           "Western Africa"           
## [25] "Western Asia"              "Western Europe"
FINAL_DATA$Continent<-FINAL_DATA$Continent %>% factor(levels = c("Africa","Antarctica","Asia","Europe", "Latin America and the Caribbean", "North America","Oceania","South America"), ordered =TRUE)

class(FINAL_DATA$Continent)
## [1] "ordered" "factor"
levels(FINAL_DATA$Continent)
## [1] "Africa"                          "Antarctica"                     
## [3] "Asia"                            "Europe"                         
## [5] "Latin America and the Caribbean" "North America"                  
## [7] "Oceania"                         "South America"
str(FINAL_DATA)
## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
##  $ Country Code    : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
##  $ Continent       : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...

Tidy & Manipulate Data II

# Creating a new Variable
FINAL_DATA<-FINAL_DATA %>% mutate("GDP-PER-CAPITA"= round((TOTAL_GDP/Total_population),2))
head(FINAL_DATA)
str(FINAL_DATA)
## tibble [3,294 x 8] (S3: tbl_df/tbl/data.frame)
##  $ Country Name    : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
##  $ Country Code    : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTAL_GDP       : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
##  $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
##  $ Region-1        : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
##  $ Continent       : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...
##  $ GDP-PER-CAPITA  : num [1:3294] 20621 NA 557 1127 21937 ...

Scan I

# This is the R chunk for the Scan I
colSums(is.na(FINAL_DATA))
##     Country Name     Country Code             year        TOTAL_GDP 
##                0                0                0              170 
## Total_population         Region-1        Continent   GDP-PER-CAPITA 
##                6                0                0              170
#Imputing the missing values

FINAL_DATA$TOTAL_GDP[is.na(FINAL_DATA$TOTAL_GDP)]<-mean(FINAL_DATA$TOTAL_GDP,na.rm = TRUE)
sum(is.na(FINAL_DATA$TOTAL_GDP))
## [1] 0
FINAL_DATA$Total_population[is.na(FINAL_DATA$Total_population)]<-mean(FINAL_DATA$Total_population,na.rm = TRUE)
sum(is.na(FINAL_DATA$Total_population))
## [1] 0
FINAL_DATA$`GDP-PER-CAPITA`[is.na(FINAL_DATA$`GDP-PER-CAPITA`)]<-mean(FINAL_DATA$`GDP-PER-CAPITA`,na.rm = TRUE)
sum(is.na(FINAL_DATA$`GDP-PER-CAPITA`))
## [1] 0
is.special <- function(x){ if(is.numeric(x))  
  (is.infinite(x)|is.nan(x))
}
FINAL_DATA %>% sapply(function(x) sum(is.special(x)))
##     Country Name     Country Code             year        TOTAL_GDP 
##                0                0                0                0 
## Total_population         Region-1        Continent   GDP-PER-CAPITA 
##                0                0                0                0
head(FINAL_DATA)

Scan II

# This is the R chunk for the Scan II
FINAL_DATA %>% select(TOTAL_GDP,Total_population,`GDP-PER-CAPITA`) %>% summary()
##    TOTAL_GDP         Total_population    GDP-PER-CAPITA    
##  Min.   :1.320e+07   Min.   :9.394e+03   Min.   :   111.9  
##  1st Qu.:3.986e+09   1st Qu.:7.458e+05   1st Qu.:  1455.0  
##  Median :1.969e+10   Median :5.537e+06   Median :  5199.2  
##  Mean   :2.250e+11   Mean   :3.202e+07   Mean   : 14271.3  
##  3rd Qu.:1.728e+11   3rd Qu.:1.799e+07   3rd Qu.: 15691.5  
##  Max.   :1.214e+13   Max.   :1.386e+09   Max.   :189170.9
FINAL_DATA$TOTAL_GDP %>% boxplot(main="Box-Plot of GDP",ylab="Total GDP")

FINAL_DATA$Total_population %>% boxplot(main="Box-Plot of POPULATION",ylab="Total Population")

FINAL_DATA$`GDP-PER-CAPITA` %>% boxplot(main="Box-Plot of GDP PER CAPITA",ylab="GDP/POPULATION")

#z-scores for all the numeric variables
z.score_GDP <-FINAL_DATA$TOTAL_GDP %>% scores(type = "z")
z.score_GDP %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.30443 -0.29905 -0.27779  0.00000 -0.07065 16.12926
length(which(abs(z.score_GDP)>3))
## [1] 60
z.score_POP <-FINAL_DATA$Total_population  %>% scores(type = "z")
z.score_POP %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2373 -0.2319 -0.1964  0.0000 -0.1040 10.0416
length(which(abs(z.score_POP)>3))
## [1] 36
z.score <-FINAL_DATA$`GDP-PER-CAPITA`  %>% scores(type = "z")
z.score %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.6371 -0.5767 -0.4082  0.0000  0.0639  7.8695
length(which(abs(z.score)>3))
## [1] 73
# Define a function to cap the values outside the limits

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# cappping the outliers

FINAL_DATA$TOTAL_GDP<-cap(FINAL_DATA$TOTAL_GDP)
FINAL_DATA$Total_population<-cap(FINAL_DATA$Total_population)
FINAL_DATA$`GDP-PER-CAPITA`<-cap(FINAL_DATA$`GDP-PER-CAPITA`)
boxplot(FINAL_DATA$TOTAL_GDP,FINAL_DATA$Total_population,FINAL_DATA$`GDP-PER-CAPITA`,main="Box-Plot after  capping outliers",names = c("GDP","POPULATION","GDP PER CAPITA"))

Transform

# This is the R chunk for the Transform Section
par(mfrow=c(1,2))
hist(FINAL_DATA$TOTAL_GDP,main = "GDP",xlab = "Total GDP",col = "grey")
Log_GDP<-log(FINAL_DATA$TOTAL_GDP)
hist(Log_GDP,main = "Transformed GDP",xlab ="Total GDP",col = "red")

par(mfrow=c(1,2))

hist(FINAL_DATA$Total_population,main = "POPULATION",xlab = "Total Population",col = "grey")

Log_POP <- log(FINAL_DATA$Total_population)
hist(Log_POP,main = "Transformed POPULATION",xlab = "Total Population",col = "red")

par(mfrow=c(1,2))
hist(FINAL_DATA$`GDP-PER-CAPITA`,main = "GDP per CAPITA",xlab = "GDP per CAPITA",col = "grey")
Log_GDPperCAPITA<-log(FINAL_DATA$`GDP-PER-CAPITA`)
hist(Log_GDPperCAPITA,main = "Transformed - GDP per CAPITA",xlab = "GDP per CAPITA",col = "red")

REFERENCE