In this analysis, I applied Association Rule Mining using the Apriori algorithm on economic indicators GDP, Foreign Direct Investment (FDI), Inflation, and Unemployment. The objective of this analysis was to analyze what impact the foreign investment plays into a country’s economy.
The data used was downloaded from https://databank.worldbank.org/ for all countries and used GDP, Foreign Direct Investment (FDI), Inflation, and Unemployment as Variables.
if (!require(arules)) install.packages("arules", dependencies=TRUE)
## Loading required package: arules
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
if (!require(arulesViz)) install.packages("arulesViz", dependencies=TRUE)
## Loading required package: arulesViz
# Load required libraries
library(arules)
library(arulesViz)
Loading Data and cleaning Data.
To avoid bias and wrong interpretation in our dataset I decided to use only rows which had complete cases.
data <- read.csv("gdpData.csv", stringsAsFactors = FALSE)
# Replace ".." with NA for missing values
data[data == ".."] <- NA
# Convert necessary columns to numeric
data[, 5:8] <- lapply(data[, 5:8], as.numeric)
# Remove rows with missing values
data_complete <- na.omit(data)
# Discretizing numerical variables
data_complete$GDP <- cut(data_complete[,5], breaks=3, labels=c("Low GDP", "Medium GDP", "High GDP"))
data_complete$FDI <- cut(data_complete[,6], breaks=3, labels=c("Low FDI", "Medium FDI", "High FDI"))
data_complete$Inflation <- cut(data_complete[,7], breaks=3, labels=c("Low Inflation", "Medium Inflation", "High Inflation"))
data_complete$Unemployment <- cut(data_complete[,8], breaks=3, labels=c("Low Unemployment", "Medium Unemployment", "High Unemployment"))
head(data_complete)
## Time Time.Code Country.Name Country.Code
## 2 2023 YR2023 Albania ALB
## 7 2023 YR2023 Belarus BLR
## 9 2023 YR2023 Bosnia and Herzegovina BIH
## 10 2023 YR2023 Botswana BWA
## 11 2023 YR2023 Brazil BRA
## 12 2023 YR2023 Burkina Faso BFA
## GDP..current.US....NY.GDP.MKTP.CD.
## 2 2.354718e+10
## 7 7.185738e+10
## 9 2.751478e+10
## 10 1.939608e+10
## 11 2.173666e+12
## 12 2.032462e+10
## Foreign.direct.investment..net.inflows..BoP..current.US....BX.KLT.DINV.CD.WD.
## 2 1620982551
## 7 1992107728
## 9 1035178227
## 10 665417580
## 11 64227330466
## 12 5174270
## Inflation..consumer.prices..annual.....FP.CPI.TOTL.ZG.
## 2 4.7597642
## 7 5.0005990
## 9 6.1059011
## 10 5.0676155
## 11 4.5935628
## 12 0.7429104
## Unemployment..total....of.total.labor.force...national.estimate...SL.UEM.TOTL.NE.ZS.
## 2 10.669
## 7 3.461
## 9 10.668
## 10 23.381
## 11 7.947
## 12 5.348
## GDP FDI Inflation Unemployment
## 2 Low GDP Medium FDI Low Inflation Low Unemployment
## 7 Low GDP Medium FDI Low Inflation Low Unemployment
## 9 Low GDP Medium FDI Low Inflation Low Unemployment
## 10 Low GDP Medium FDI Low Inflation High Unemployment
## 11 Low GDP Medium FDI Low Inflation Low Unemployment
## 12 Low GDP Medium FDI Low Inflation Low Unemployment
Converting our Data into Transactions
# Selecting categorical columns for analysis
data_trans <- data_complete[, c("GDP", "FDI", "Inflation", "Unemployment")]
# Convert each row into a character vector
data_list <- split(data_trans, rownames(data_trans))
# Ensure data is in the correct format
data_list <- lapply(data_list, function(x) as.character(unlist(x)))
# Convert to transactions
transactions <- as(data_list, "transactions")
# Display a summary of the transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
## 86 rows (elements/itemsets/transactions) and
## 11 columns (items) and a density of 0.3636364
##
## most frequent items:
## Low GDP Low Inflation Medium FDI Low Unemployment
## 85 82 82 78
## Medium Unemployment (Other)
## 6 11
##
## element (itemset/transaction) length distribution:
## sizes
## 4
## 86
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4 4 4 4 4 4
##
## includes extended item information - examples:
## labels
## 1 High FDI
## 2 High GDP
## 3 High Inflation
##
## includes extended transaction information - examples:
## transactionID
## 1 10
## 2 103
## 3 106
Minimum support is set to 5 % and confidence is set to 80 %
rules <- apriori(transactions, parameter=list(support=0.05, confidence=0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[11 item(s), 86 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [44 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 44 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4
## 4 15 18 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.636 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05814 Min. :0.8333 Min. :0.05814 Min. :0.8740
## 1st Qu.:0.06977 1st Qu.:0.9144 1st Qu.:0.06977 1st Qu.:0.9981
## Median :0.86047 Median :0.9595 Median :0.90698 Median :1.0049
## Mean :0.66068 Mean :0.9511 Mean :0.69345 Mean :0.9963
## 3rd Qu.:0.90698 3rd Qu.:1.0000 3rd Qu.:0.95349 3rd Qu.:1.0118
## Max. :0.98837 Max. :1.0000 Max. :1.00000 Max. :1.0488
## count
## Min. : 5.00
## 1st Qu.: 6.00
## Median :74.00
## Mean :56.82
## 3rd Qu.:78.00
## Max. :85.00
##
## mining info:
## data ntransactions support confidence
## transactions 86 0.05 0.8
## call
## apriori(data = transactions, parameter = list(support = 0.05, confidence = 0.8))
rules_df <-as(rules, 'data.frame')
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Plotting a scatter plot of the Rules
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
From above plot we can observe that the rules that have high probability of correctness and are most reliable into predicting relationship between GDP, FDI and Inflation and unemployment are clustered on the top right hand of the plot.
Inspect Rules by sorting Lift
cat("Top 10 Association Rules:\n")
## Top 10 Association Rules:
inspect(sort(rules, by="lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Medium Unemployment} => {Medium FDI} 0.06976744 1.0000000 0.06976744 1.048780 6
## [2] {Low Inflation,
## Medium Unemployment} => {Medium FDI} 0.05813953 1.0000000 0.05813953 1.048780 5
## [3] {Low GDP,
## Medium Unemployment} => {Medium FDI} 0.06976744 1.0000000 0.06976744 1.048780 6
## [4] {Low GDP,
## Low Inflation,
## Medium Unemployment} => {Medium FDI} 0.05813953 1.0000000 0.05813953 1.048780 5
## [5] {Medium Unemployment} => {Low GDP} 0.06976744 1.0000000 0.06976744 1.011765 6
## [6] {Medium FDI} => {Low GDP} 0.95348837 1.0000000 0.95348837 1.011765 82
## [7] {Low GDP} => {Medium FDI} 0.95348837 0.9647059 0.98837209 1.011765 82
## [8] {Low Inflation,
## Medium Unemployment} => {Low GDP} 0.05813953 1.0000000 0.05813953 1.011765 5
## [9] {Medium FDI,
## Medium Unemployment} => {Low GDP} 0.06976744 1.0000000 0.06976744 1.011765 6
## [10] {Low Unemployment,
## Medium FDI} => {Low GDP} 0.86046512 1.0000000 0.86046512 1.011765 74
There is a strong relationship between Medium Unemployment and Medium FDI which might be the reasonable as we know that employment rate are driven by Capital which comes from investment.However this relationship between Medium Unemployment and medium FDI isn’t strong correlated as the lift is slight over 1.048 which confirms the relationship but doesn’t indicate the major impact.
Moreover, based on the 6 rule, it indicates that FDI is not one of the most important drivers of GDP growth, because there are 95.35% cases with Low GDP and Medium FDI which indicates that while Medium FDI is most often received by countries with Low GDP, investment does not automatically lead to economic growth.