Required packages

All required packages for the preprocessing steps are loaded.

library(dplyr)
library(readxl)
library(tidyr)
library(ggplot2)

Executive Summary

This report outlines the steps taken to pre-process two relational dataset downloaded from the World Bank website. The final dataset contains information about the GDP and net inflow of Foreign Direct Investment (FDI) for ASEAN countries, which will be used for the next step of analysis in investigating the relationship between GDP and FDI.

Firstly, each table was reshaped individually according to tidy data principles, before they were merged. It is important to note that the countries in the GDP table were prioritised during merging as net inflow of FDI was expressed in percentage of the country’s GDP. The next step was to apply data type conversions, where Country Name and year was converted from character into factor and ordered factor types respectively. The tidy dataset was then filtered to only include observations from ten ASEAN countries from the year 2000 onwards as the last member of ASEAN joined in 1999.

The filtered dataset was scanned for missing values and special values (Infinity and NaNs). One observation had missing values in them, which was removed as it is below the 5% threshold. Outliers were also observed upon examining the dataset. However, the decision was to leave it as is as these were considered to be valid points which can be explained by the varying strengths of each country’s economy.

The final step involves checking the distribution of GDP and net inflow of FDI. Mathematical data transformation technique was applied in an attempt of getting the variables to match a normal distribution. Even though the transformed variables are not perfectly normal, the mean and median of each variable are close, which means that they are close to a normal distribution. Furthermore, the sample size is large enough (n > 30) where Central Limit Theorem can be invoked.

Data

There are two datasets in this report, which has been downloaded from the worldbank website. The first dataset contains information about a country’s GDP from 1960 to 2018 in US dollars. There are 63 variables in this dataset:
- Country Name
- Country Code (Unique country code given by World Bank)
- Indicator Name (The name of the indicator depending on the dataset found on World Bank’s website)
- Indicator Code (Unique indicator code)
- Column 5 to 63 contains the recorded GDP of each country for that year, labelled as the year.

# Import function
  import <- function(file){
    read_xlsx(file, sheet = 1, skip = 3)
}

# Import FDI and GDP tables
  file <- "Country FDI.xlsx"
  FDI <- import(file)

head(FDI)

The second dataset includes information about net inflow of Foreign Direct Investment (FDI) into different countries worldwide, expressed as the percentage of GDP. FDI net inflows are the value of inward direct investment made by non-resident investors in the reporting economy. It is calculated as new investment inflows less disinvestments from foreign investors in each economy, which is finally divided by GDP to be represented as a percentage of GDP. More information on FDI can be found in the link provided in the References section below.

There are also 63 variables in this dataset, with the same structure outlined above. The two datasets can be connected by Country Name or Country Code. The net inflow of FDI can be a negative number when new foreign investment is less than disinvestment for that year.

file <- "Country GDP.xlsx"
GDP <- import(file)
head(GDP)

Tidy & Manipulate Data I

According to tidy data principles, each variable should have its own column and the current dataset violates that principle as the years are spread across column 5 to 63 (column headers as variables). We will first reshape the dataset individually before merging them together by gathering column 5 to 63 into a year column.

FDI_clean <- FDI %>% gather(key = "year", value = "FDI_percent", 5:63)
GDP_clean <- GDP %>% gather(key = "year", value = "GDP", 5:63)

head(FDI_clean)
head(GDP_clean)

With two tables in a tidy format, we will merge the two together using the left_join() function, with the GDP_clean table on the left. The reason for using left_join() is because we only want to retain countries that has its GDP recorded. Since net FDI inflow is expressed in percentage of the country’s GDP, we exclude observations where a country’s net FDI inflow is recorded, but without the corresponding GDP information in the first table.

The tables will be joined by Country Name, Country Code and year to avoid duplications of these columns in the merged dataset. We will also drop redundant columns after merging.

economy <- left_join(GDP_clean, FDI_clean, by = c("Country Name", "Country Code", "year")) 
   
head(economy)
economy <- economy[,-c(2,3,4,7,8)]

head(economy)

Understand

Now that the dataset is in a tidy format, we will continue with data type conversions. The variables that has gone through data type conversions are:
1. Country Name: Changed from character to factor.
2. year: Changed from character to ordered factor

# Check the data structure of economy table before conversion
  str(economy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   15576 obs. of  4 variables:
 $ Country Name: chr  "Aruba" "Afghanistan" "Angola" "Albania" ...
 $ year        : chr  "1960" "1960" "1960" "1960" ...
 $ GDP         : num  NA 5.38e+08 NA NA NA ...
 $ FDI_percent : num  NA NA NA NA NA NA NA NA NA NA ...
# Apply data conversions
  country_level <- unique(economy$`Country Name`)
  economy$`Country Name` <- factor(economy$`Country Name`, 
                                     levels = country_level,
                                     labels = country_level)

  year <- seq(1960,2018, by = 1)
  economy$year <- factor(economy$year,
                         levels = year,
                         labels = year,
                         ordered = TRUE)

# Check the data structure of economy table after conversion
  str(economy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   15576 obs. of  4 variables:
 $ Country Name: Factor w/ 264 levels "Aruba","Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year        : Ord.factor w/ 59 levels "1960"<"1961"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ GDP         : num  NA 5.38e+08 NA NA NA ...
 $ FDI_percent : num  NA NA NA NA NA NA NA NA NA NA ...

Tidy & Manipulate Data II

Since we are only interested in countries that are members of ASEAN, we will subset the merged economy table to include only the observations of the 10 countries. The dataset is also filtered to the year 2000 onwards as the last member of ASEAN, Cambodia, joined on \(30^{th}\) April 1999.

economy_ASEAN<- economy %>% filter(`Country Name` == "Cambodia" |
                                       `Country Name` == "Philippines" |
                                       `Country Name` == "Indonesia" |
                                       `Country Name` == "Thailand" |
                                       `Country Name` == "Vietnam" |
                                       `Country Name` == "Singapore" |
                                       `Country Name` == "Malaysia" |
                                       `Country Name` == "Lao PDR" |
                                       `Country Name` == "Brunei Darussalam" |
                                       `Country Name` == "Myanmar", year >=2000)
# Check dimension of subset
  dim(economy_ASEAN)
[1] 190   4

After subsetting the dataset, we will create two new variables:
1. FDI_value_InBillions: Each country’s net inflow FDI expressed in billions US\(\$\) (FDI_percent \(\times\) GDP)
2. GDP_value_InBillions: Each country’s GDP expressed in billions US\(\$\) (in order to simplify the reading of the numbers)

# Mutate a new variable and convert the actual values in billions
  economy_ASEAN <- mutate(economy_ASEAN, FDI_value_InBillions =
                            (economy_ASEAN$FDI_percent*economy_ASEAN$GDP)/1000000000)
  economy_ASEAN <- mutate(economy_ASEAN, GDP_value_InBillions = economy_ASEAN$GDP/1000000000)
  head(economy_ASEAN)

Scan I

The next step will be to scan for obvious errors, missing values and special values within the dataset. The total number of missing values were tallied for each column and one observation with missing values has been identified.

# Scan for missing values
  colSums(is.na(economy_ASEAN))
        Country Name                 year                  GDP          FDI_percent FDI_value_InBillions 
                   0                    0                    0                    1                    1 
GDP_value_InBillions 
                   0 
  which(is.na(economy_ASEAN), arr.ind = TRUE)
     row col
[1,]   1   4
[2,]   1   5
  economy_ASEAN[1,]

Since the missing value is less than 5% of the data, we can safely remove the observation.

  (sum(is.na(economy_ASEAN$FDI_percent)) / length(economy_ASEAN$FDI_percent))*100
[1] 0.5263158
  (sum(is.na(economy_ASEAN$FDI_value_InBillions)) / length(economy_ASEAN$FDI_value_InBillions))*100
[1] 0.5263158
  economy_ASEAN_cleaned <- economy_ASEAN[complete.cases(economy_ASEAN),]
  dim(economy_ASEAN_cleaned)
[1] 189   6

Special values (Infinity, NaNs) and inconsistencies were also checked and none were found in the dataset. As mentioned in previous sections, net inflow of FDI can be a negative number, which is why it is not considered to be an inconsistency in this dataset. Therefore, we will check if there is any zero in the dataset as FDI and GDP cannot be zero in value.

# Scan for special values
  is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x)) }
  sapply(economy_ASEAN_cleaned, function(x) sum( is.special(x) ))
        Country Name                 year                  GDP          FDI_percent FDI_value_InBillions 
                   0                    0                    0                    0                    0 
GDP_value_InBillions 
                   0 

The result show there is no zero value observed in the dataset.

# check for inconsistencies in the dataset
  zero <- function(x){if (is.numeric(x)) (x = 0)}
  sapply(economy_ASEAN_cleaned, function(x) sum(zero(x)))
        Country Name                 year                  GDP          FDI_percent FDI_value_InBillions 
                   0                    0                    0                    0                    0 
GDP_value_InBillions 
                   0 

Scan II

After examining and handling the missing values, special values, and inconsistencies, the next step will be to check for outliers in the dataset using boxplot.

GDP_A <- economy_ASEAN_cleaned$GDP_value_InBillions
FDI_A <- economy_ASEAN_cleaned$FDI_value_InBillions
FDI_percent_A <- economy_ASEAN_cleaned$FDI_percent
  
boxplot(GDP_A, main = "Boxplot of GDP (In Billions)", xlab = "GDP (In Billions)")

boxplot(FDI_A, main = "Boxplot of FDI (In Billions)", xlab = "FDI (In Billions)")

boxplot(FDI_percent_A, main = "Boxplot of FDI (In Percent)", xlab = "FDI (In Percent)")

From the boxplots above, we can see that there are many univariate outliers in the dataset. However, we will leave these values as is as these are mainly caused by the GDP and net inflow of FDI disparity between ASEAN countries, wherein economies of some of these countries performed better (as shown in the bar chart below). Hence, the oultiers are not driven by any potential errors in the data.

  plot1 <- ggplot(economy_ASEAN_cleaned, aes(x = reorder(`Country Name`, GDP_value_InBillions), y = GDP_value_InBillions, fill = `Country Name`))

plot1 + geom_histogram(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("ASEAN Countries' Total GDP (2000-2005)") +
  xlab("Countries") +
  ylab("GDP (US Billions)")


  plot2 <- ggplot(economy_ASEAN_cleaned, aes(x = reorder(`Country Name`, GDP_value_InBillions), y = FDI_value_InBillions, fill = `Country Name`))

plot2 + geom_histogram(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("ASEAN Countries' Total FDI net inflow (2000-2005)") +
  xlab("Countries") +
  ylab("Net FDI inflow (US Billions)")

Transform

Since we are pre-processing the data with the intent of performing a simple linear regression between the GDP and FDI, the normality of both variables will be checked using a histogram. The countries’ GDP will be checked first before moving on to the net inflow of FDI.

GDP Transformation

hist(GDP_A, main = "Histrogram of GDP (in billions)")

Upon first inspection, we can see that the distribution of GDP for ASEAN countries is right skewed. Mathematical data transformation techniques has been applied in the attempt of transforming the data to have a normal distribution.

logGDP <- log10(GDP_A)
hist(logGDP)

summary(logGDP)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2383  1.1581  1.9962  1.8205  2.4199  3.0179 

After applying data transformation, the histogram of logGDP shows a better shape. Although it is not perfectly symmetric, the GDP’s mean and median being very close together indicates that it can be considered close to an approximately normal distribution. Furthermore, the sample size for this dataset is greater than 30, which means that Central Limit Theorem can be invoked in the analysis.

FDI Transformation

Similar to the GDP, shape of the distribution of the FDI is right-skewed. Mathematical data transformation has been applied to transform the data to approximate a normal distribution. Since there are negative values observed in the FDI as explained earlier, first step is to transform FDI raised to the power of 4 to address the negative values. Then, log10 transformation is applied.

hist(FDI_A)

FDI_power4 <- FDI_A^4
logFDI <- log10(FDI_power4)
hist(logFDI)

summary(logFDI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.406   7.163   9.781   9.393  11.803  15.907 

The histogram of logFDI does not show a perfectly symmetrical distribution. However same as the explanation on the GDP’s transformation, FDI can be considered close to a normal distribution with the mean and median being fairly close to each other and the sample size being large enough for the Central Limit Theorem to be applied.

References

Foreign direct investment, net inflows (% of GDP). (2019). Retrieved from https://data.worldbank.org/indicator/BX.KLT.DINV.WD.GD.ZS?view=chart

GDP (current US$). (2019). Retrieved from https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?name_desc=false&view=chart

Kramer, L. (2019). What is GDP and Why is It So Important to Economists and Investors?. Retrieved from https://www.investopedia.com/ask/answers/what-is-gdp-why-its-important-to-economists-investors/

Te Velde, D. W. (2006). Foreign Direct Investment and Development: An Historical Perspective. Retrieved from https://www.odi.org/publications/594-foreign-direct-investment-development-historical-perspective

---
title: "MATH2349 Semester 2, 2019"
author: "Hardi Winata 3794905, Ma. Charriz Anne Rioflorido 3789114, Keavatey Srun 3767615"
subtitle: Assignment 3
output:
  html_notebook: default
---

## Required packages 

All required packages for the preprocessing steps are loaded.

```{r }
library(dplyr)
library(readxl)
library(tidyr)
library(ggplot2)
```


## Executive Summary 

This report outlines the steps taken to pre-process two relational dataset downloaded from the World Bank website. The final dataset contains information about the GDP and net inflow of Foreign Direct Investment (FDI) for ASEAN countries, which will be used for the next step of analysis in investigating the relationship between GDP and FDI.  
  
Firstly, each table was reshaped individually according to tidy data principles, before they were merged. It is important to note that the countries in the GDP table were prioritised during merging as net inflow of FDI was expressed in percentage of the country’s GDP. The next step was to apply data type conversions, where `Country Name` and `year` was converted from `character` into `factor` and ordered `factor` types respectively. The tidy dataset was then filtered to only include observations from ten ASEAN countries from the year 2000 onwards as the last member of ASEAN joined in 1999.  
  
The filtered dataset was scanned for missing values and special values (Infinity and NaNs). One observation had missing values in them, which was removed as it is below the 5% threshold. Outliers were also observed upon examining the dataset. However, the decision was to leave it as is as these were considered to be valid points which can be explained by the varying strengths of each country’s economy.

The final step involves checking the distribution of GDP and net inflow of FDI. Mathematical data transformation technique was applied in an attempt of getting the variables to match a normal distribution. Even though the transformed variables are not perfectly normal, the mean and median of each variable are close, which means that they are close to a normal distribution. Furthermore, the sample size is large enough (n > 30) where Central Limit Theorem can be invoked. 

## Data 

There are two datasets in this report, which has been downloaded from the worldbank website. The first dataset contains information about a country's GDP from 1960 to 2018 in US dollars. There are 63 variables in this dataset:  
  - **Country Name**  
  - **Country Code** (Unique country code given by World Bank)  
  - **Indicator Name** (The name of the indicator depending on the dataset found on World Bank's website)  
  - **Indicator Code** (Unique indicator code)  
  - **Column 5 to 63** contains the recorded GDP of each country for that year, labelled as the year.  
```{r}
# Import function
  import <- function(file){
    read_xlsx(file, sheet = 1, skip = 3)
}

# Import FDI and GDP tables
  file <- "Country FDI.xlsx"
  FDI <- import(file)

head(FDI)
```

The second dataset includes information about net inflow of Foreign Direct Investment (FDI) into different countries worldwide, expressed as the percentage of GDP. FDI net inflows are the value of inward direct investment made by non-resident investors in the reporting economy. It is calculated as new investment inflows less disinvestments from foreign investors in each economy, which is finally divided by GDP to be represented as a percentage of GDP. More information on FDI can be found in the link provided in the References section below.

There are also 63 variables in this dataset, with the same structure outlined above. The two datasets can be connected by Country Name or Country Code. The net inflow of FDI can be a negative number when new foreign investment is less than disinvestment for that year.
```{r}
file <- "Country GDP.xlsx"
GDP <- import(file)
head(GDP)
```

##	Tidy & Manipulate Data I 

According to tidy data principles, each variable should have its own column and the current dataset violates that principle as the years are spread across column 5 to 63 (column headers as variables). We will first reshape the dataset individually before merging them together by gathering column 5 to 63 into a `year` column. 
```{r}
FDI_clean <- FDI %>% gather(key = "year", value = "FDI_percent", 5:63)
GDP_clean <- GDP %>% gather(key = "year", value = "GDP", 5:63)

head(FDI_clean)
head(GDP_clean)
```
With two tables in a tidy format, we will merge the two together using the `left_join()` function, with the `GDP_clean` table on the left. The reason for using `left_join()` is because we only want to retain countries that has its GDP recorded. Since net FDI inflow is expressed in percentage of the country's GDP, we exclude observations where a country's net FDI inflow is recorded, but without the corresponding GDP information in the first table.  
  
The tables will be joined by `Country Name`, `Country Code` and `year` to avoid duplications of these columns in the merged dataset. We will also drop redundant columns after merging. 
```{r}
economy <- left_join(GDP_clean, FDI_clean, by = c("Country Name", "Country Code", "year")) 
   
head(economy)
economy <- economy[,-c(2,3,4,7,8)]

head(economy)
```

## Understand 

Now that the dataset is in a tidy format, we will continue with data type conversions. The variables that has gone through data type conversions are:  
  1. `Country Name`: Changed from `character` to `factor`.  
  2. `year`: Changed from `character` to ordered `factor`  
```{r}
# Check the data structure of economy table before conversion
  str(economy)

# Apply data conversions
  country_level <- unique(economy$`Country Name`)
  economy$`Country Name` <- factor(economy$`Country Name`, 
                                     levels = country_level,
                                     labels = country_level)

  year <- seq(1960,2018, by = 1)
  economy$year <- factor(economy$year,
                         levels = year,
                         labels = year,
                         ordered = TRUE)

# Check the data structure of economy table after conversion
  str(economy)
```

##	Tidy & Manipulate Data II 

Since we are only interested in countries that are members of ASEAN, we will subset the merged `economy` table to include only the observations of the 10 countries. The dataset is also filtered to the year 2000 onwards as the last member of ASEAN, Cambodia, joined on $30^{th}$ April 1999.
```{r}
economy_ASEAN<- economy %>% filter(`Country Name` == "Cambodia" |
                                       `Country Name` == "Philippines" |
                                       `Country Name` == "Indonesia" |
                                       `Country Name` == "Thailand" |
                                       `Country Name` == "Vietnam" |
                                       `Country Name` == "Singapore" |
                                       `Country Name` == "Malaysia" |
                                       `Country Name` == "Lao PDR" |
                                       `Country Name` == "Brunei Darussalam" |
                                       `Country Name` == "Myanmar", year >=2000)
# Check dimension of subset
  dim(economy_ASEAN)
```

After subsetting the dataset, we will create two new variables:  
  1. `FDI_value_InBillions`: Each country's net inflow FDI expressed in billions US$\$$ (`FDI_percent` $\times$ `GDP`)  
  2. `GDP_value_InBillions`: Each country's GDP expressed in billions US$\$$ (in order to simplify the reading of the numbers)  
```{r}
# Mutate a new variable and convert the actual values in billions
  economy_ASEAN <- mutate(economy_ASEAN, FDI_value_InBillions =
                            (economy_ASEAN$FDI_percent*economy_ASEAN$GDP)/1000000000)
  economy_ASEAN <- mutate(economy_ASEAN, GDP_value_InBillions = economy_ASEAN$GDP/1000000000)
  head(economy_ASEAN)
```

##	Scan I 

The next step will be to scan for obvious errors, missing values and special values within the dataset. The total number of missing values were tallied for each column and one observation with missing values has been identified.  
```{r}
# Scan for missing values
  colSums(is.na(economy_ASEAN))
  which(is.na(economy_ASEAN), arr.ind = TRUE)
  economy_ASEAN[1,]
```

Since the missing value is less than 5% of the data, we can safely remove the observation.
```{r}
  (sum(is.na(economy_ASEAN$FDI_percent)) / length(economy_ASEAN$FDI_percent))*100
  (sum(is.na(economy_ASEAN$FDI_value_InBillions)) / length(economy_ASEAN$FDI_value_InBillions))*100

  economy_ASEAN_cleaned <- economy_ASEAN[complete.cases(economy_ASEAN),]
  dim(economy_ASEAN_cleaned)
```

Special values (Infinity, NaNs) and inconsistencies were also checked and none were found in the dataset. As mentioned in previous sections, net inflow of FDI can be a negative number, which is why it is not considered to be an inconsistency in this dataset. Therefore, we will check if there is any zero in the dataset as FDI and GDP cannot be zero in value.
```{r}
# Scan for special values
  is.special <- function(x){ if (is.numeric(x)) (is.infinite(x) | is.nan(x)) }
  sapply(economy_ASEAN_cleaned, function(x) sum( is.special(x) ))
```

The result show there is no zero value observed in the dataset.
```{r}
# check for inconsistencies in the dataset
  zero <- function(x){if (is.numeric(x)) (x = 0)}
  sapply(economy_ASEAN_cleaned, function(x) sum(zero(x)))
```

##	Scan II

After examining and handling the missing values, special values, and inconsistencies, the next step will be to check for outliers in the dataset using boxplot.
```{r}
GDP_A <- economy_ASEAN_cleaned$GDP_value_InBillions
FDI_A <- economy_ASEAN_cleaned$FDI_value_InBillions
FDI_percent_A <- economy_ASEAN_cleaned$FDI_percent
  
boxplot(GDP_A, main = "Boxplot of GDP (In Billions)", xlab = "GDP (In Billions)")
boxplot(FDI_A, main = "Boxplot of FDI (In Billions)", xlab = "FDI (In Billions)")
boxplot(FDI_percent_A, main = "Boxplot of FDI (In Percent)", xlab = "FDI (In Percent)")
```

From the boxplots above, we can see that there are many univariate outliers in the dataset. However, we will leave these values as is as these are mainly caused by the GDP and net inflow of FDI disparity between ASEAN countries, wherein economies of some of these countries performed better (as shown in the bar chart below). Hence, the oultiers are not driven by any potential errors in the data.
```{r warning=FALSE}
  plot1 <- ggplot(economy_ASEAN_cleaned, aes(x = reorder(`Country Name`, GDP_value_InBillions), y = GDP_value_InBillions, fill = `Country Name`))

plot1 + geom_histogram(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("ASEAN Countries' Total GDP (2000-2005)") +
  xlab("Countries") +
  ylab("GDP (US Billions)")

  plot2 <- ggplot(economy_ASEAN_cleaned, aes(x = reorder(`Country Name`, GDP_value_InBillions), y = FDI_value_InBillions, fill = `Country Name`))

plot2 + geom_histogram(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("ASEAN Countries' Total FDI net inflow (2000-2005)") +
  xlab("Countries") +
  ylab("Net FDI inflow (US Billions)")
```

##	Transform 

Since we are pre-processing the data with the intent of performing a simple linear regression between the GDP and FDI, the normality of both variables will be checked using a histogram. The countries' GDP will be checked first before moving on to the net inflow of FDI.

### GDP Transformation

```{r}
hist(GDP_A, main = "Histrogram of GDP (in billions)")
```

Upon first inspection, we can see that the distribution of GDP for ASEAN countries is right skewed. Mathematical data transformation techniques has been applied in the attempt of transforming the data to have a normal distribution.
```{r}
logGDP <- log10(GDP_A)
hist(logGDP)
summary(logGDP)
```

After applying data transformation, the histogram of `logGDP` shows a better shape. Although it is not perfectly symmetric, the GDP's mean and median being very close together indicates that it can be considered close to an approximately normal distribution. Furthermore, the sample size for this dataset is greater than 30, which means that Central Limit Theorem can be invoked in the analysis. 

### FDI Transformation

Similar to the GDP, shape of the distribution of the FDI is right-skewed. Mathematical data transformation has been applied to transform the data to approximate a normal distribution. Since there are negative values observed in the FDI as explained earlier, first step is to transform FDI raised to the power of 4 to address the negative values. Then, `log10` transformation is applied.

```{r}
hist(FDI_A)
FDI_power4 <- FDI_A^4
logFDI <- log10(FDI_power4)
hist(logFDI)
summary(logFDI)
```

The histogram of `logFDI` does not show a perfectly symmetrical distribution. However same as the explanation on the GDP's transformation, FDI can be considered close to a normal distribution with the mean and median being fairly close to each other and the sample size being large enough for the Central Limit Theorem to be applied.

##	References

Foreign direct investment, net inflows (% of GDP). (2019).
    Retrieved from https://data.worldbank.org/indicator/BX.KLT.DINV.WD.GD.ZS?view=chart
    
GDP (current US$). (2019). Retrieved from 
    https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?name_desc=false&view=chart

Kramer, L. (2019). What is GDP and Why is It So Important to Economists and Investors?. Retrieved from 
    https://www.investopedia.com/ask/answers/what-is-gdp-why-its-important-to-economists-investors/

Te Velde, D. W. (2006). Foreign Direct Investment and Development: An Historical Perspective. Retrieved from
    https://www.odi.org/publications/594-foreign-direct-investment-development-historical-perspective

