Loading the Required packages.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
library(outliers)
The important steps of the data pre-processing are as follows:
Get: The datasets were downloaded from the public open-data website KAGGLE.COM and then imported into R. Three datasets (GDP , Population, Continents) are merged for this purpose.
Understand: This step is to understand the datasets as well as the merged dataset. The volume and structure of the data was checked, with proper understanding of each attribute.
Tidy and manipulate: Two of the dataset was untidy.so, I tidied up the messy data to make sure each variable is stored in a column and each observation has one row. One new variable was generated called (GDP-PER-CAPITA).
Scan: Missing values, special values and obvious error for all variables were checked. For different type of variables, different approaches were carried out: for categorical variables, plausibility of values was checked for each column; numerical variables, were scanned for any possible outliers.
Transform: In the final step data transformation was carried out for the column that has a skewed distribution.
As discussed earlier, we import 3 datasets for this analysis. Their descriptions are as follows:
The file used for the analysis is ‘GDP.csv’. This dataset has been taken from kaggle and can be viewed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_NY.GDP.MKTP.CD_DS2_en_csv_v2_559588.csv
The dataset has 264 observation and 65 variables. The variables include the:
GDP <- read_csv("C:/Users/Daivik/Desktop/GDP.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `2019` = col_logical(),
## X65 = col_logical()
## )
## See spec(...) for full column specifications.
head(GDP)
The second dataset contains information on population of all countries from 1960 to 2018. This dataset has also been downloaded from Kaggle and can be accessed using the following link: https://www.kaggle.com/greeshmagirish/worldbank-data-on-gdp-population-and-military?select=API_SP.POP.TOTL_DS2_en_csv_v2_511378.csv
The data file has 264 observations and 65 variables. The variables are as follows:
POPULATION <- read_csv("C:/Users/Daivik/Desktop/population.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## `2019` = col_logical(),
## X65 = col_logical()
## )
## See spec(...) for full column specifications.
head(POPULATION)
The third dataset, named ‘Continent’ has also been taken from Kaggle. It can be accessed using the following link: https://www.kaggle.com/sarques/conticountry .
The data file has 249 observations and 8 variables. The important variables are as follows:
We will merge the 3 datasets after tidying them.
CONTINENT <- read_csv("C:/Users/Daivik/Desktop/Continent.csv")
## Warning: Missing column names filled in: 'X8' [8]
## Parsed with column specification:
## cols(
## No. = col_double(),
## Country = col_character(),
## Country_code = col_character(),
## M49_Code = col_character(),
## `Region-1` = col_character(),
## `Region-2` = col_character(),
## Continent = col_character(),
## X8 = col_character()
## )
head(CONTINENT)
Please note that the reason for selecting years from 2000-2017 is to reduce the number of missing values and uncertainty in the data.
GDP<-GDP %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
`2014`,`2015`,`2016`,`2017`)
head(GDP)
POPULATION<-POPULATION %>% select(`Country Name`,`Country Code`,`2000`,`2001`,`2002`,`2003`,`2004`,`2005`,`2006`,`2007`,
`2008`,`2009`,`2010`,`2011`,`2012`,`2013`,
`2014`,`2015`,`2016`,`2017`)
head(POPULATION)
CONTINENT<-CONTINENT %>% select(Country, Country_code,`Region-1`,Continent)
head(CONTINENT)
The Main Principles of a Tidy dataset are :
After inspecting all the original datasets, we find that only CONTINENT dataset is tidy whereas the GDP & POPULATION dataset is not. This is because of the following reasons:
#Converting wide format data into Long format
TIDY_GDP <- GDP %>% gather(key = "year",value = TOTAL_GDP,3:20)
head(TIDY_GDP)
#Converting wide format data into Long format
TIDY_POPULATION <- POPULATION %>% gather(key = "year",value = "Total_population",3:20)
head(TIDY_POPULATION)
There are three columns (Country Name, Country Code and year) which are common in our first two data sets (TIDY_GDP & TIDY_POPULATION).
Therefore, TIDY_GDP data set is joined with TIDY_POPULATION data using left join.
The resulting data sets is then joined with CONTINENT dataset with common variable “country Name” and “Country Code” using inner join.
# joining the TIDY_GDP & TIDY_POPULATION
POP_GDP<-left_join(TIDY_GDP,TIDY_POPULATION)
## Joining, by = c("Country Name", "Country Code", "year")
# Renaming the variables of CONTINENT Dataset for joining
colnames(CONTINENT)[1] <-"Country Name"
colnames(CONTINENT)[2] <-"Country Code"
# Joining FINAL_DATA
FINAL_DATA<-inner_join(POP_GDP,CONTINENT)
## Joining, by = c("Country Name", "Country Code")
head(FINAL_DATA)
# This is the R chunk for the Understand Section
dim(FINAL_DATA)
## [1] 3294 7
summary(FINAL_DATA)
## Country Name Country Code year TOTAL_GDP
## Length:3294 Length:3294 Length:3294 Min. :1.320e+07
## Class :character Class :character Class :character 1st Qu.:3.542e+09
## Mode :character Mode :character Mode :character Median :1.696e+10
## Mean :2.250e+11
## 3rd Qu.:1.124e+11
## Max. :1.214e+13
## NA's :170
## Total_population Region-1 Continent
## Min. :9.394e+03 Length:3294 Length:3294
## 1st Qu.:7.455e+05 Class :character Class :character
## Median :5.527e+06 Mode :character Mode :character
## Mean :3.202e+07
## 3rd Qu.:1.780e+07
## Max. :1.386e+09
## NA's :6
str(FINAL_DATA)
## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
## $ Country Name : chr [1:3294] "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Country Code : chr [1:3294] "ABW" "AFG" "AGO" "ALB" ...
## $ year : chr [1:3294] "2000" "2000" "2000" "2000" ...
## $ TOTAL_GDP : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
## $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
## $ Region-1 : chr [1:3294] "Caribbean" "Southern Asia" "Middle Africa" "Southern Europe" ...
## $ Continent : chr [1:3294] "North America" "Asia" "Africa" "Europe" ...
Some of the Variables were not in the correct format like -
Furthermore, The structure of the dataset is provided to confirm all the data type conversions.
#conversion of variables
FINAL_DATA$`Country Name`<-as.factor(FINAL_DATA$`Country Name`)
class(FINAL_DATA$`Country Name`)
## [1] "factor"
levels(FINAL_DATA$`Country Name`)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "American Samoa"
## [5] "Andorra" "Angola"
## [7] "Antigua and Barbuda" "Argentina"
## [9] "Armenia" "Aruba"
## [11] "Australia" "Austria"
## [13] "Azerbaijan" "Bahrain"
## [15] "Bangladesh" "Barbados"
## [17] "Belarus" "Belgium"
## [19] "Belize" "Benin"
## [21] "Bermuda" "Bhutan"
## [23] "Bosnia and Herzegovina" "Botswana"
## [25] "Brazil" "British Virgin Islands"
## [27] "Brunei Darussalam" "Bulgaria"
## [29] "Burkina Faso" "Burundi"
## [31] "Cabo Verde" "Cambodia"
## [33] "Cameroon" "Canada"
## [35] "Cayman Islands" "Central African Republic"
## [37] "Chad" "Chile"
## [39] "China" "Colombia"
## [41] "Comoros" "Costa Rica"
## [43] "Croatia" "Cuba"
## [45] "Cyprus" "Denmark"
## [47] "Djibouti" "Dominica"
## [49] "Dominican Republic" "Ecuador"
## [51] "El Salvador" "Equatorial Guinea"
## [53] "Eritrea" "Estonia"
## [55] "Eswatini" "Ethiopia"
## [57] "Faroe Islands" "Fiji"
## [59] "Finland" "France"
## [61] "French Polynesia" "Gabon"
## [63] "Georgia" "Germany"
## [65] "Ghana" "Gibraltar"
## [67] "Greece" "Greenland"
## [69] "Grenada" "Guam"
## [71] "Guatemala" "Guinea"
## [73] "Guinea-Bissau" "Guyana"
## [75] "Haiti" "Honduras"
## [77] "Hungary" "Iceland"
## [79] "India" "Indonesia"
## [81] "Iraq" "Ireland"
## [83] "Isle of Man" "Israel"
## [85] "Italy" "Jamaica"
## [87] "Japan" "Jordan"
## [89] "Kazakhstan" "Kenya"
## [91] "Kiribati" "Kuwait"
## [93] "Latvia" "Lebanon"
## [95] "Lesotho" "Liberia"
## [97] "Libya" "Liechtenstein"
## [99] "Lithuania" "Luxembourg"
## [101] "Madagascar" "Malawi"
## [103] "Malaysia" "Maldives"
## [105] "Mali" "Malta"
## [107] "Marshall Islands" "Mauritania"
## [109] "Mauritius" "Mexico"
## [111] "Monaco" "Mongolia"
## [113] "Montenegro" "Morocco"
## [115] "Mozambique" "Myanmar"
## [117] "Namibia" "Nauru"
## [119] "Nepal" "Netherlands"
## [121] "New Caledonia" "New Zealand"
## [123] "Nicaragua" "Niger"
## [125] "Nigeria" "North Macedonia"
## [127] "Northern Mariana Islands" "Norway"
## [129] "Oman" "Pakistan"
## [131] "Palau" "Panama"
## [133] "Papua New Guinea" "Paraguay"
## [135] "Peru" "Philippines"
## [137] "Poland" "Portugal"
## [139] "Puerto Rico" "Qatar"
## [141] "Romania" "Russian Federation"
## [143] "Rwanda" "Samoa"
## [145] "San Marino" "Sao Tome and Principe"
## [147] "Saudi Arabia" "Senegal"
## [149] "Serbia" "Seychelles"
## [151] "Sierra Leone" "Singapore"
## [153] "Sint Maarten (Dutch part)" "Slovenia"
## [155] "Solomon Islands" "Somalia"
## [157] "South Africa" "South Sudan"
## [159] "Spain" "Sri Lanka"
## [161] "Sudan" "Suriname"
## [163] "Sweden" "Switzerland"
## [165] "Syrian Arab Republic" "Tajikistan"
## [167] "Thailand" "Timor-Leste"
## [169] "Togo" "Tonga"
## [171] "Trinidad and Tobago" "Tunisia"
## [173] "Turkey" "Turkmenistan"
## [175] "Turks and Caicos Islands" "Tuvalu"
## [177] "Uganda" "Ukraine"
## [179] "United Arab Emirates" "Uruguay"
## [181] "Uzbekistan" "Vanuatu"
## [183] "Zambia"
FINAL_DATA$`Country Code`<-as.factor(FINAL_DATA$`Country Code`)
class(FINAL_DATA$`Country Code`)
## [1] "factor"
levels(FINAL_DATA$`Country Code`)
## [1] "ABW" "AFG" "AGO" "ALB" "AND" "ARE" "ARG" "ARM" "ASM" "ATG" "AUS" "AUT"
## [13] "AZE" "BDI" "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BIH" "BLR" "BLZ" "BMU"
## [25] "BRA" "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CMR" "COL"
## [37] "COM" "CPV" "CRI" "CUB" "CYM" "CYP" "DEU" "DJI" "DMA" "DNK" "DOM" "DZA"
## [49] "ECU" "ERI" "ESP" "EST" "ETH" "FIN" "FJI" "FRA" "FRO" "GAB" "GEO" "GHA"
## [61] "GIB" "GIN" "GNB" "GNQ" "GRC" "GRD" "GRL" "GTM" "GUM" "GUY" "HND" "HRV"
## [73] "HTI" "HUN" "IDN" "IMN" "IND" "IRL" "IRQ" "ISL" "ISR" "ITA" "JAM" "JOR"
## [85] "JPN" "KAZ" "KEN" "KHM" "KIR" "KWT" "LBN" "LBR" "LBY" "LIE" "LKA" "LSO"
## [97] "LTU" "LUX" "LVA" "MAR" "MCO" "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT"
## [109] "MMR" "MNE" "MNG" "MNP" "MOZ" "MRT" "MUS" "MWI" "MYS" "NAM" "NCL" "NER"
## [121] "NGA" "NIC" "NLD" "NOR" "NPL" "NRU" "NZL" "OMN" "PAK" "PAN" "PER" "PHL"
## [133] "PLW" "PNG" "POL" "PRI" "PRT" "PRY" "PYF" "QAT" "ROU" "RUS" "RWA" "SAU"
## [145] "SDN" "SEN" "SGP" "SLB" "SLE" "SLV" "SMR" "SOM" "SRB" "SSD" "STP" "SUR"
## [157] "SVN" "SWE" "SWZ" "SXM" "SYC" "SYR" "TCA" "TCD" "TGO" "THA" "TJK" "TKM"
## [169] "TLS" "TON" "TTO" "TUN" "TUR" "TUV" "UGA" "UKR" "URY" "UZB" "VGB" "VUT"
## [181] "WSM" "ZAF" "ZMB"
FINAL_DATA$year<-FINAL_DATA$year %>% factor(levels = c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017),ordered = TRUE)
class(FINAL_DATA$year)
## [1] "ordered" "factor"
levels(FINAL_DATA$year)
## [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"
FINAL_DATA$`Region-1`<-FINAL_DATA$`Region-1` %>% factor(levels = c("344","446","535","Antarctica","Australia and New Zealand","Caribbean","Central America","Central Asia","Eastern Africa","Eastern Asia","Eastern Europe","Melanesia","Micronesia","Middle Africa","Northern Africa","Northern America", "Northern Europe","Polynesia","South-eastern Asia","South America","Southern Africa","Southern Asia", "Southern Europe","Western Africa", "Western Asia","Western Europe"), ordered = TRUE)
class(FINAL_DATA$`Region-1`)
## [1] "ordered" "factor"
levels(FINAL_DATA$`Region-1`)
## [1] "344" "446"
## [3] "535" "Antarctica"
## [5] "Australia and New Zealand" "Caribbean"
## [7] "Central America" "Central Asia"
## [9] "Eastern Africa" "Eastern Asia"
## [11] "Eastern Europe" "Melanesia"
## [13] "Micronesia" "Middle Africa"
## [15] "Northern Africa" "Northern America"
## [17] "Northern Europe" "Polynesia"
## [19] "South-eastern Asia" "South America"
## [21] "Southern Africa" "Southern Asia"
## [23] "Southern Europe" "Western Africa"
## [25] "Western Asia" "Western Europe"
FINAL_DATA$Continent<-FINAL_DATA$Continent %>% factor(levels = c("Africa","Antarctica","Asia","Europe", "Latin America and the Caribbean", "North America","Oceania","South America"), ordered =TRUE)
class(FINAL_DATA$Continent)
## [1] "ordered" "factor"
levels(FINAL_DATA$Continent)
## [1] "Africa" "Antarctica"
## [3] "Asia" "Europe"
## [5] "Latin America and the Caribbean" "North America"
## [7] "Oceania" "South America"
str(FINAL_DATA)
## tibble [3,294 x 7] (S3: tbl_df/tbl/data.frame)
## $ Country Name : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
## $ Country Code : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TOTAL_GDP : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
## $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
## $ Region-1 : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
## $ Continent : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...
A new column called GDP-PER-CAPITA was created which will store the values of GDP per capita i.e
GDP-PER-CAPITA = TOTAL_GDP/TOTAL_POPULATIONmutate() function was used to create a new column in FINAL_DATA dataset.
str() function was again used to check the structure of dataset after mutating a new column.
# Creating a new Variable
FINAL_DATA<-FINAL_DATA %>% mutate("GDP-PER-CAPITA"= round((TOTAL_GDP/Total_population),2))
head(FINAL_DATA)
str(FINAL_DATA)
## tibble [3,294 x 8] (S3: tbl_df/tbl/data.frame)
## $ Country Name : Factor w/ 183 levels "Afghanistan",..: 10 1 6 2 5 179 8 9 4 7 ...
## $ Country Code : Factor w/ 183 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : Ord.factor w/ 18 levels "2000"<"2001"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TOTAL_GDP : num [1:3294] 1.87e+09 NA 9.13e+09 3.48e+09 1.43e+09 ...
## $ Total_population: num [1:3294] 90853 20779953 16395473 3089027 65390 ...
## $ Region-1 : Ord.factor w/ 26 levels "344"<"446"<"535"<..: 6 22 14 23 23 25 20 25 18 6 ...
## $ Continent : Ord.factor w/ 8 levels "Africa"<"Antarctica"<..: 6 3 1 4 4 3 8 3 7 6 ...
## $ GDP-PER-CAPITA : num [1:3294] 20621 NA 557 1127 21937 ...
# This is the R chunk for the Scan I
colSums(is.na(FINAL_DATA))
## Country Name Country Code year TOTAL_GDP
## 0 0 0 170
## Total_population Region-1 Continent GDP-PER-CAPITA
## 6 0 0 170
#Imputing the missing values
FINAL_DATA$TOTAL_GDP[is.na(FINAL_DATA$TOTAL_GDP)]<-mean(FINAL_DATA$TOTAL_GDP,na.rm = TRUE)
sum(is.na(FINAL_DATA$TOTAL_GDP))
## [1] 0
FINAL_DATA$Total_population[is.na(FINAL_DATA$Total_population)]<-mean(FINAL_DATA$Total_population,na.rm = TRUE)
sum(is.na(FINAL_DATA$Total_population))
## [1] 0
FINAL_DATA$`GDP-PER-CAPITA`[is.na(FINAL_DATA$`GDP-PER-CAPITA`)]<-mean(FINAL_DATA$`GDP-PER-CAPITA`,na.rm = TRUE)
sum(is.na(FINAL_DATA$`GDP-PER-CAPITA`))
## [1] 0
is.special <- function(x){ if(is.numeric(x))
(is.infinite(x)|is.nan(x))
}
FINAL_DATA %>% sapply(function(x) sum(is.special(x)))
## Country Name Country Code year TOTAL_GDP
## 0 0 0 0
## Total_population Region-1 Continent GDP-PER-CAPITA
## 0 0 0 0
head(FINAL_DATA)
The FINAL_DATA dataset is scanned for outliers for the numeric variables, (TOTAL_GDP,TOTAL_POPULATION,GDP-PER-CAPITA) and it is visualised using the boxplot.In the boxplot all values that fall outside the outlier fences is shown.
The outlier fences are the range between (-1.5IQR to 1.5IQR).
We use the z-score method to detect univariate outliers.
As per the Central limit we assume that the variables follows a normal distribution.
The number of outliers is found using z-scores method. A standardised z-score of all observations are calculated. The z-scores with value greater than 3 is considered as outlier.
# This is the R chunk for the Scan II
FINAL_DATA %>% select(TOTAL_GDP,Total_population,`GDP-PER-CAPITA`) %>% summary()
## TOTAL_GDP Total_population GDP-PER-CAPITA
## Min. :1.320e+07 Min. :9.394e+03 Min. : 111.9
## 1st Qu.:3.986e+09 1st Qu.:7.458e+05 1st Qu.: 1455.0
## Median :1.969e+10 Median :5.537e+06 Median : 5199.2
## Mean :2.250e+11 Mean :3.202e+07 Mean : 14271.3
## 3rd Qu.:1.728e+11 3rd Qu.:1.799e+07 3rd Qu.: 15691.5
## Max. :1.214e+13 Max. :1.386e+09 Max. :189170.9
FINAL_DATA$TOTAL_GDP %>% boxplot(main="Box-Plot of GDP",ylab="Total GDP")
FINAL_DATA$Total_population %>% boxplot(main="Box-Plot of POPULATION",ylab="Total Population")
FINAL_DATA$`GDP-PER-CAPITA` %>% boxplot(main="Box-Plot of GDP PER CAPITA",ylab="GDP/POPULATION")
#z-scores for all the numeric variables
z.score_GDP <-FINAL_DATA$TOTAL_GDP %>% scores(type = "z")
z.score_GDP %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.30443 -0.29905 -0.27779 0.00000 -0.07065 16.12926
length(which(abs(z.score_GDP)>3))
## [1] 60
z.score_POP <-FINAL_DATA$Total_population %>% scores(type = "z")
z.score_POP %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2373 -0.2319 -0.1964 0.0000 -0.1040 10.0416
length(which(abs(z.score_POP)>3))
## [1] 36
z.score <-FINAL_DATA$`GDP-PER-CAPITA` %>% scores(type = "z")
z.score %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.6371 -0.5767 -0.4082 0.0000 0.0639 7.8695
length(which(abs(z.score)>3))
## [1] 73
Replacing the outliers with the nearest neighbours that are not outliers. The outliers that lie outside the outlier fences on a box-plot are capped by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.
In order to cap the outliers we used a user-defined function as follows
# Define a function to cap the values outside the limits
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
# cappping the outliers
FINAL_DATA$TOTAL_GDP<-cap(FINAL_DATA$TOTAL_GDP)
FINAL_DATA$Total_population<-cap(FINAL_DATA$Total_population)
FINAL_DATA$`GDP-PER-CAPITA`<-cap(FINAL_DATA$`GDP-PER-CAPITA`)
boxplot(FINAL_DATA$TOTAL_GDP,FINAL_DATA$Total_population,FINAL_DATA$`GDP-PER-CAPITA`,main="Box-Plot after capping outliers",names = c("GDP","POPULATION","GDP PER CAPITA"))
Data Transformation has been done for all 3 numerical variables (TOTAL_GDP,TOTAL_POPULATION,GDP-PER-CAPITA)
Histogram of all 3 variables has been visualised side by side before transformation and after transformation.
It was observed that the histograms are right skewed, therefore the mathematical transformation method, Log transformation was used.
Log Transformation is generally used to reduce the right skewness and convert the distribution into a normal distribution.
It was concluded that after transformation the variables were approximately distributed normally.
# This is the R chunk for the Transform Section
par(mfrow=c(1,2))
hist(FINAL_DATA$TOTAL_GDP,main = "GDP",xlab = "Total GDP",col = "grey")
Log_GDP<-log(FINAL_DATA$TOTAL_GDP)
hist(Log_GDP,main = "Transformed GDP",xlab ="Total GDP",col = "red")
par(mfrow=c(1,2))
hist(FINAL_DATA$Total_population,main = "POPULATION",xlab = "Total Population",col = "grey")
Log_POP <- log(FINAL_DATA$Total_population)
hist(Log_POP,main = "Transformed POPULATION",xlab = "Total Population",col = "red")
par(mfrow=c(1,2))
hist(FINAL_DATA$`GDP-PER-CAPITA`,main = "GDP per CAPITA",xlab = "GDP per CAPITA",col = "grey")
Log_GDPperCAPITA<-log(FINAL_DATA$`GDP-PER-CAPITA`)
hist(Log_GDPperCAPITA,main = "Transformed - GDP per CAPITA",xlab = "GDP per CAPITA",col = "red")
Dolgun, Dr. Anil, ‘MATH2349 Data Wrangling’, lecture notes, RMIT University, http://rare-phoenix-161610.appspot.com/secured/index.html