Forecasting the foreign exchange market is difficult due to the effect of several economic factors. This study focuses on forecasting the Malaysian Ringgit (MYR) exchange rate versus the U.S. Dollar (USD) using supervised classification and regression machine learning models. The aims of this project are to identify the economic variables that have the greatest impact on the USD/MYR exchange rate and to predict the exchange rate. A comparison is also done between the models, and the best one is chosen to anticipate the MYR/USD exchange rate. Multiple regression models with low Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) were determined to have the highest performance, with RandomForest outperforming K-Nearest Neighbor and NaiveBayes, whose respective accuracy rates are 1, 0.9756, and 0.9756, respectively.
In the early part of Covid-19 outbreak, Malaysian Ringgit is weakening from 4.02 in December 2020 to 4.20 in March 2022 due to the uncertain global economy outlook. As economy re-open and geopolitical risk in Ukraine-Russia war trigger higher inflation, resulting US Federal Reserve to hike interest rate from 0% in March 2022 to 3.25% in October 2022. The Fed’s tightening monetary policy had led to a significant increase in the value of the US dollar and a direct devaluation of other currencies. As a result, MYR further weaken past 4.70 in 7 months time.
As the exchange rate changes may cause uncertainty and risk aversion, it makes managing the country’s reserves more difficult for the central bank.The exchange rate movement also causes profitability impact to exporter and importer. Conversely, a strong domestic currency hinders exports and lowers import prices. A weak domestic currency favours exports whilst also raising the cost of imports.
Since 2022 is a post-pandemic year, the rate of normalisation of monetary policy differs between countries, primarily to strike a balance between stable economic growth, currency volatility, and inflation objective. Thus, this project seek to study various macroeconomic variables in contributing to the MYR exchange rate performance using regression classification model.
The Cross-Industry Standard Process for Data Mining (CRISP-DM) was used as the method for this project. It was chosen because CRISP-DM could give an overview of the data mining life cycle, which has six steps that will be explained below.
Data analysis and pattern recognition are key jobs in the financial industry. For example, prior to making an investment decision, investors seek to obtain as much information as possible. Comparable to other financial market data, foreign exchange rates can be forecast by analysing historical exchange rate series for patterns.
Recent declines in the ringgit and other currencies can be attributed to the strengthening of the US dollar, which has been spurred by a rise in US interest rates. An economist predicts that the ringgit will continue to decline in the near future due to the rising performance of the US dollar. In compliance with CRISP-DM criteria, it is identified the primary variables that influence the USD to MYR exchange rate, such as interest rates, inflation, and global trade developments.
In light of this, the objective of this research is to develop a model that can predict the USD/MYR exchange rate. This can be useful for enterprises that routinely conduct foreign business, individuals planning to visit Malaysia or make investments, and the formulation of monetary policy (Jong, 2016; Khan, 2018; Fischer,2015).
From October 2005 through October 2022, monthly data on the Malaysian currency, i.e., the Ringgit Exchange rate vs US Dollars, was gathered from Bloomberg. The year 2005 was chosen because, between 1998 and 2005, the Malaysian government set a 3.80 level of peg. Consequently, it is not advised to incorporate currency data prior to 2005. Many of these variables have a significant relationship with the Overnight Policy Rate (OPR), Net foreign flow of bond investment into Malaysia, Malaysia Foreign Reserve, Brent Crude Oil Price, Malaysia Trade Balance, Malaysia Consumer Price Index (CPI), United States CPI, and 3 market data (Malaysia Government Bond (MGS) Yield in 10 years, Chinese Yuan (USD/CNY) exchange rate, and USD currency index are among those covered).
MPR is the benchmark interest rate established as the reference rate for overnight interbank transactions by the central bank of Malaysia, Bank Negara Malaysia (BNM). It serves as a benchmark for other interest rates in the country, such as deposit and lending rates. The BNM determines the MOPR through its monetary policy operations based on current market conditions and anticipated inflation. It is intended to meet the price stability and financial stability objectives of the central bank’s monetary policy
Net foreign flow of bond investment into Malaysia refers to the amount of money that flows into or out of the nation as bond investments. A positive net flow implies that more foreign capital is entering Malaysia for bond investments than is leaving the nation. A negative net flow shows the reverse, namely that more foreign funds are leaving Malaysia than are entering.
Malaysia Foreign Reserve refers to the amount of foreign currency assets kept by Bank Negara Malaysia (BNM) to support domestic monetary policy and regulate the exchange rate. Gold, foreign currency deposits, and foreign government securities comprise these assets.
Brent Crude Oil Price is a benchmark for global oil prices and the price of a particular mix of crude oil that is used as a standard by a significant number of oil producing nations.
Malaysia’s Trade Balance represents the difference between its exports and imports. A positive trade balance, often known as a surplus, shows that the value of exports exceeds the value of imports. A negative trade balance, or deficit, means that the value of imports exceeds the value of exports.
A financial security issued by the Malaysian government is known as a government bond. It resembles a loan an investor made to the government, with the understanding that the interest and principle would be paid back later. Because of the government issuer’s creditworthiness, these bonds are generally utilised by the government to generate money for different programmes and projects, and they are frequently seen as reasonably safe investments.
The US Dollar Index (USDX, DXY, DX) is an indicator of how much the US dollar is worth in comparison to a basket of other currencies. The index is frequently used as a benchmark for other financial instruments since it is designed to give a broad indicator of the worth of the US dollar internationally. The index is based on the exchange rates of six important international currencies: the Euro, Japanese Yen, British Pound, Canadian Dollar, Swedish Krona, and Swiss Franc. It is computed by the ICE (Intercontinental Exchange).
# install.packages('tidyverse',repos = "http://cran.us.r-project.org")
# install.packages('parsedate',repos = "http://cran.us.r-project.org")
# install.packages('Hmisc')
library('lubridate')
library('dplyr')
library('parsedate')
library('tidyr')
library('ggplot2')
library('RColorBrewer')
library('Hmisc')
library('caTools')
library('caret')
library('broom')
library('klaR')
library('randomForest')
library('class')
library('mlbench')
library('pls')
library('pROC')
data
glimpse(data)
## Rows: 209
## Columns: 13
## $ Date <chr> "31/10/2005", "30/11/2005", "31/12/2005",…
## $ MalaysiaPolicyRate <dbl> 2.70, 3.00, 3.00, 3.00, 3.00, 3.25, 3.25,…
## $ USPolicyRate <dbl> 3.75, 4.00, 4.25, 4.50, 4.50, 4.50, 4.75,…
## $ MalaysiaNetForeignBondFlow <dbl> -1510.7, -1104.9, 70.6, -84.6, -84.6, 139…
## $ MalaysiaForeignReserve <dbl> 77.07, 73.07, 70.50, 71.29, 71.29, 72.20,…
## $ BrentCrudeOilPrice <dbl> 58.10, 55.05, 58.98, 65.99, 65.99, 61.76,…
## $ MalaysiaExternalTradeBalance <dbl> 10.62, 9.10, 9.77, 9.03, 9.03, 7.68, 9.91…
## $ MalaysiaCPI <dbl> 3.3, 3.5, 3.5, 3.2, 3.2, 3.2, 4.8, 4.6, 3…
## $ USCPI <dbl> 4.3, 3.5, 3.4, 4.0, 4.0, 3.6, 3.4, 3.5, 4…
## $ USDCurrencyIndex <dbl> 90.070, 91.570, 91.170, 88.960, 88.960, 9…
## $ CNYExchangeRate <dbl> 8.0845, 8.0804, 8.0702, 8.0608, 8.0608, 8…
## $ Malaysia10YearGovBondYield <dbl> 4.180, 4.245, 4.189, 4.112, 4.112, 4.120,…
## $ MYRExchangeRate <dbl> 3.7750, 3.7775, 3.7795, 3.7505, 3.7505, 3…
Dimension of data is 209 x 13.
Date is deemed as character data
type.
Other variables are continuous numerical data
type.
summary(data)
## Date MalaysiaPolicyRate USPolicyRate
## Length:209 Min. :-3.250 Min. :-2.125
## Class :character 1st Qu.: 2.500 1st Qu.: 0.125
## Mode :character Median : 3.000 Median : 0.125
## Mean : 2.828 Mean : 1.199
## 3rd Qu.: 3.250 3rd Qu.: 1.875
## Max. : 3.500 Max. : 5.250
##
## MalaysiaNetForeignBondFlow MalaysiaForeignReserve BrentCrudeOilPrice
## Min. :-23042.8 Min. : 70.5 Min. : 22.74
## 1st Qu.: -1152.7 1st Qu.: 96.7 1st Qu.: 57.54
## Median : 1393.6 Median : 104.2 Median : 71.46
## Mean : 826.4 Mean : 116.2 Mean : 77.05
## 3rd Qu.: 3240.4 3rd Qu.: 120.3 3rd Qu.:101.01
## Max. : 10259.0 Max. :1000.7 Max. :139.83
##
## MalaysiaExternalTradeBalance MalaysiaCPI USCPI
## Min. :-3.630 Min. :-2.900 Min. :-2.100
## 1st Qu.: 7.853 1st Qu.: 1.400 1st Qu.: 1.300
## Median : 9.510 Median : 2.200 Median : 2.000
## Mean :10.644 Mean : 2.222 Mean : 2.396
## 3rd Qu.:11.812 3rd Qu.: 3.300 3rd Qu.: 3.000
## Max. :31.710 Max. : 8.500 Max. : 9.100
## NA's :3
## USDCurrencyIndex CNYExchangeRate Malaysia10YearGovBondYield MYRExchangeRate
## Min. : 71.80 Min. : 6.054 Min. :2.551 Min. : 2.961
## 1st Qu.: 80.20 1st Qu.: 6.356 1st Qu.:3.560 1st Qu.: 3.260
## Median : 88.66 Median : 6.691 Median :3.873 Median : 3.654
## Mean : 88.16 Mean : 7.413 Mean :3.808 Mean : 3.880
## 3rd Qu.: 95.79 3rd Qu.: 6.890 3rd Qu.:4.110 3rd Qu.: 4.149
## Max. :112.12 Max. :79.527 Max. :5.033 Max. :41.842
## NA's :3
During this phase, the data is cleansed, transformed, and formatted so that it may be utilised in the modelling phase. These include fixing errors, imputing missing values, and eliminating outliers. The ultimate objective of this phase is to guarantee that the data are accurate and consistent so that they may be utilised effectively in subsequent phases of the CRISP-DM process. Given below are the steps.
Any record that shares data with another record mistakenly is
considered duplicate data. Duplicate data is simple to identify and
typically arises during the transfer of data between systems.
data[duplicated(data) == TRUE,]
Result: Four (4) duplicates are found. Since the dataset was a
monthly dataset, it is appropriate to remove the duplicated data, as
each month’s data should be unique.
Step: Remove the duplicates
data1<-unique(data)
data1 <-distinct(data1)
data1[duplicated(data1) == TRUE,]
As a result, all duplicates are removed.
The data overview reveals that the ‘Date’ column contains dates written in a variety of forms, all of which need to be standardised.
dates<-unlist(data1['Date'])
dates<-as.Date(parse_date(dates))
data2<-data.frame(data1)
data2['Date']<-dates
As a result, the ‘date’ column was standardised.
Missing data, or missing values, arise when a variable in an
observation contains no data value. Commonly occurring, missing data can
have a substantial impact on the findings that can be derived from the
data.
summary(data2)
## Date MalaysiaPolicyRate USPolicyRate
## Min. :2005-10-31 Min. :-3.250 Min. :-2.125
## 1st Qu.:2010-01-31 1st Qu.: 2.700 1st Qu.: 0.125
## Median :2014-04-30 Median : 3.000 Median : 0.125
## Mean :2014-04-30 Mean : 2.833 Mean : 1.198
## 3rd Qu.:2018-07-31 3rd Qu.: 3.250 3rd Qu.: 1.875
## Max. :2022-10-31 Max. : 3.500 Max. : 5.250
##
## MalaysiaNetForeignBondFlow MalaysiaForeignReserve BrentCrudeOilPrice
## Min. :-23042.8 Min. : 70.50 Min. : 22.74
## 1st Qu.: -1152.7 1st Qu.: 96.71 1st Qu.: 58.10
## Median : 1398.1 Median : 104.20 Median : 71.70
## Mean : 846.2 Mean : 116.51 Mean : 77.26
## 3rd Qu.: 3240.4 3rd Qu.: 120.29 3rd Qu.:101.01
## Max. : 10259.0 Max. :1000.72 Max. :139.83
##
## MalaysiaExternalTradeBalance MalaysiaCPI USCPI
## Min. :-3.630 Min. :-2.900 Min. :-2.100
## 1st Qu.: 7.902 1st Qu.: 1.400 1st Qu.: 1.300
## Median : 9.555 Median : 2.200 Median : 2.000
## Mean :10.632 Mean : 2.232 Mean : 2.409
## 3rd Qu.:11.812 3rd Qu.: 3.300 3rd Qu.: 3.000
## Max. :31.710 Max. : 8.500 Max. : 9.100
## NA's :3
## USDCurrencyIndex CNYExchangeRate Malaysia10YearGovBondYield MYRExchangeRate
## Min. : 71.80 Min. : 6.054 Min. :2.551 Min. : 2.961
## 1st Qu.: 80.18 1st Qu.: 6.356 1st Qu.:3.572 1st Qu.: 3.260
## Median : 88.66 Median : 6.691 Median :3.877 Median : 3.654
## Mean : 88.17 Mean : 7.421 Mean :3.818 Mean : 3.884
## 3rd Qu.: 95.87 3rd Qu.: 6.890 3rd Qu.:4.110 3rd Qu.: 4.149
## Max. :112.12 Max. :79.527 Max. :5.033 Max. :41.842
## NA's :3
The above is a summary of the data after duplicates were removed and the
date column was standardised. The number of rows of observations has
been reduced to 205. The following variables are determined to be
missing a total of 6 values:
The following are the steps to identify the number of rows containing missing values.
any(is.na(data2))
## [1] TRUE
nrow(data2[!complete.cases(data2),])
## [1] 6
which(is.na(data2['MalaysiaExternalTradeBalance']))
## [1] 36 111 139
which(is.na(data2['USDCurrencyIndex']))
## [1] 20 39 82
6 rows of observations possess missing values, indicating that every
missing value is located in a distinct observation. Listed below was the
precise location.for(i in names(data2)) {
print(paste(i, sum(data2[i]=="")))
}
## [1] "Date NA"
## [1] "MalaysiaPolicyRate 0"
## [1] "USPolicyRate 0"
## [1] "MalaysiaNetForeignBondFlow 0"
## [1] "MalaysiaForeignReserve 0"
## [1] "BrentCrudeOilPrice 0"
## [1] "MalaysiaExternalTradeBalance NA"
## [1] "MalaysiaCPI 0"
## [1] "USCPI 0"
## [1] "USDCurrencyIndex NA"
## [1] "CNYExchangeRate 0"
## [1] "Malaysia10YearGovBondYield 0"
## [1] "MYRExchangeRate 0"
Result: No values are empty.
Box plots provide the rapid identification of mean values, data set
dispersion, and skewness indicators. Inasmuch as it divides the data
into parts containing around 25% of the data in that set and then
provides a visual summary, it is capable of analysing large amounts of
information.
In the case for USDCurrencyIndex, no outliers are observed in the
boxplot and it is normally distributed.
Thus, missing values can be
imputed using Mean.
data3<-data.frame(data2)
data3['USDCurrencyIndex']<-with(data3, impute(data3['USDCurrencyIndex'], mean))
data3[c(20,39,82),'USDCurrencyIndex']
## [1] 88.16865 88.16865 88.16865
Result: Mean value is imputed to all those 3 missing cells.
In the case for MalaysiaExternalTradeBalance, it is observed from the
boxplot that this variable is positively skewed.
Median can handle
skewed distribution more effectively than Mean.
data3['MalaysiaExternalTradeBalance']<-with(data3, impute(data3['MalaysiaExternalTradeBalance'], median))
data3[c(36,111,139),'MalaysiaExternalTradeBalance']
## [1] 9.555 9.555 9.555
Result: Median value is imputed to all missing cells.
Outliers can have a substantial effect on the outcomes of a data
analysis or model, hence it is vital to determine whether the numbers
identified as outliers are real or inaccurate. Real outliers, also known
as valid outliers, are numbers that are truly distinct from the rest of
the data and may represent a significant trend or piece of information.
Incorrect outliers, also known as invalid outliers, are the result of
data collecting or processing errors, such as measurement or data entry
mistakes. If these values are not recognised and rectified, they might
distort the outcomes of the study or model, leading to erroneous
conclusions or subpar performance. In order to verify that the analysis
or model is based on accurate and dependable data, it is essential to
determine whether the outliers are genuine or fictitious.
Since the beginning of 2004, the Central Bank of Malaysia (Bank Negara Malaysia, BNM) has never had a negative Overnight Policy Rate (OPR). Therefore, only positive values should be recorded in this variable, however a single negative value has been observed which required to handle.
Source: https://www.bnm.gov.my/monetary-stability/opr-decisions
data4<-data.frame(data3)
data4['MalaysiaPolicyRate']<-abs(data4['MalaysiaPolicyRate'])
Besides that, it is also observed from the boxplot that USPolicyRate contains negative value, which also has never had a negative rate based on the Federal Reserve, the Central Bank of United States. In conclusion, it should be handled.
Source: https://www.federalreserve.gov/monetarypolicy/openmarket.htm
data4['USPolicyRate']<-abs(data4['USPolicyRate'])
Result: All values in MalaysiaPolicyRate and USPolicyRate are
converted to absolute (POSITIVE) values.
Since the values are high, thus, it is necessary to examine whether such extreme maximum and lowest values are typical.
top_n(data4['MalaysiaNetForeignBondFlow'], 5)
## Selecting by MalaysiaNetForeignBondFlow
top_n(data4['MalaysiaNetForeignBondFlow'], -5)
## Selecting by MalaysiaNetForeignBondFlow
Result: It appears reasonable for the maximum to be 10258.963 because
the next five greatest figures are also close to 10000. However,
considering the next-smallest value is -12465.554, a difference of
10,000, it is rather uncommon for the minimum to be beyond -23042.756.
Step: Check the values of 3 rows before and after the minimum.
min_row<-apply(data4['MalaysiaNetForeignBondFlow'],2,which.min)
data4[(min_row-3):(min_row+3),'MalaysiaNetForeignBondFlow']
## [1] -4543.296 -2095.369 -7446.066 -23042.756 5719.764 8911.200 -903.187
Result: The values do not seem to have correlation with date as they
fluctuate every month. We will keep this value as it is for now.
It is observed from the boxplot that there is value(s) that larger
than 1000, meanwhile other values are less than 200.
Step: Find the
location of outlier(s).
fr_outlier_row<-which(data4['MalaysiaForeignReserve'] > 200)
data4[fr_outlier_row, 'MalaysiaForeignReserve']
## [1] 1000.21 1000.72
Result: A total of two (2) outliers are found in
‘MalaysiaForeignReserve’ with value > 1000.
Step: Replace outliers with NA.
Figure above shows the boxplot of MalaysiaForeignReserve after
replacing outliers with NA.
Since the values are skewed, Median
imputation is implemented for NA values.
data4[fr_outlier_row,'MalaysiaForeignReserve']<-median(data4[,'MalaysiaForeignReserve'], na.rm=TRUE)
data4[fr_outlier_row,'MalaysiaForeignReserve']
## [1] 103.9 103.9
Step: Find the location of outlier(s).
fivenum(data4[,'CNYExchangeRate'])
## [1] 6.0543 6.3561 6.6915 6.8904 79.5270
cer_outlier_row<-which(data4['CNYExchangeRate'] > 10)
data4[cer_outlier_row, 'CNYExchangeRate']
## [1] 79.527 73.919
Result: The outliers have values more than 70.
Step: Check 3 rows
before and after each outlier.
data4[(cer_outlier_row[1]-3):(cer_outlier_row[1]+3),'CNYExchangeRate']
## [1] 8.0175 7.9943 7.9690 79.5270 7.9040 7.8794 7.8331
data4[(cer_outlier_row[2]-3):(cer_outlier_row[2]+3),'CNYExchangeRate']
## [1] 7.5450 7.5105 7.4627 73.9190 7.3037 7.1818 7.1108
Result: Apparently, it looks like the decimal point is wrongly placed
for each outlier. Consequently, the data can be updated directly by
adjusting the decimal point to one level in front. Imputation is not
required for this variable.
data4[cer_outlier_row[1],'CNYExchangeRate']<-7.9527
data4[cer_outlier_row[2],'CNYExchangeRate']<-7.3919
data4[cer_outlier_row,'CNYExchangeRate']
## [1] 7.9527 7.3919
Step: Find the location of outlier(s).
fivenum(data4[,'MYRExchangeRate'])
## [1] 2.9610 3.2595 3.6541 4.1490 41.8420
mer_outlier_row<-which(data4['MYRExchangeRate'] > 10)
data4[mer_outlier_row, 'MYRExchangeRate']
## [1] 41.842
Result: One outlier is detected with value over 40.
Step: Check 3
rows before and after the outlier.
data4[(mer_outlier_row[1]-3):(mer_outlier_row[1]+3),'MYRExchangeRate']
## [1] 4.0652 4.1090 4.1383 41.8420 4.1842 4.1335 4.0953
Result: Similar to the case with CNYExchangeRate, the outlier can be
updated directly by adjusting the decimal point to one level in front.
Imputation is not required too.
data4[mer_outlier_row[1],'MYRExchangeRate']<-4.1842
data4[mer_outlier_row,'MYRExchangeRate']
## [1] 4.1842
Most if not all numerical data have different scales.
Therefore,
it is necessary to perform normalisation on these data to bring them
into a same range for more accurate comparisons.
## Date MalaysiaPolicyRate USPolicyRate MalaysiaNetForeignBondFlow
## 1 2005-10-31 0.5428571 0.7073171 0.6465749
## 2 2005-11-30 0.7142857 0.7560976 0.6587605
## 3 2005-12-31 0.7142857 0.8048780 0.6940589
## 4 2006-01-31 0.7142857 0.8536585 0.6893985
## 5 2006-02-28 0.8571429 0.8536585 0.7339218
## 6 2006-03-31 0.8571429 0.9024390 0.6868191
## MalaysiaForeignReserve BrentCrudeOilPrice MalaysiaExternalTradeBalance
## 1 0.09266573 0.3019899 0.4032258
## 2 0.03624824 0.2759416 0.3602151
## 3 0.00000000 0.3095055 0.3791737
## 4 0.01114245 0.3693740 0.3582343
## 5 0.02397743 0.3332479 0.3200340
## 6 0.04513399 0.3686908 0.3831353
## MalaysiaCPI USCPI USDCurrencyIndex CNYExchangeRate
## 1 0.5438596 0.5714286 0.4531316 1.0000000
## 2 0.5614035 0.5000000 0.4903386 0.9979805
## 3 0.5614035 0.4910714 0.4804167 0.9929564
## 4 0.5350877 0.5446429 0.4255984 0.9883263
## 5 0.5350877 0.5089286 0.4541238 0.9781795
## 6 0.6754386 0.4910714 0.4446980 0.9668506
## Malaysia10YearGovBondYield MYRExchangeRate
## 1 0.6563255 0.4610070
## 2 0.6825141 0.4624228
## 3 0.6599517 0.4635555
## 4 0.6289283 0.4471314
## 5 0.6321515 0.4270261
## 6 0.6365834 0.4089596
Result: All numerical data, including feature and target variables,
except from Date, have been normalised.
A line graph is frequently regarded as the most effective approach to
depict time series data since it efficiently conveys how the data has
changed over time. The y-axis shows the relevant variable while the
x-axis shows time. It is simple to spot patterns and trends in the data,
such as upward or downward trends, fluctuations, and seasonality,
because the data points are connected by lines.
Since line
graph is able to display how the data has changed over time and how
various data points relate to one another, line graphs are particularly
helpful for time series data. Additionally, they may be annotated with
other features like lines for mean, median, and standard deviation,
which can be useful for seeing patterns and trends that aren’t always
obvious. As line graph do not depict the consistency of the data over
time, other chart styles like bar charts or scatter plots might not be
the ideal choice for time series data.
Line graphs are plotted
to visualise the trend of each feature with MYRExchangeRate.
Note: Black line represents the feature whilst red dotted line
represents the MYRExchangeRate.
From the line charts above, it is observed that USPolicyRate,
USDCurrencyIndex and CNYExchangeRate have similar trend with
MYRExchangeRate, whereby when one decreases, the other decreases; when
one increases, the other increases as well.
On the other hand,
BrentCrudeOilPrice, Malaysia Foreign Reserve, Malaysia CPI and US CPI
show somewhat opposite trends with MYRExchangeRate.
In its widest definition, correlation is a measurement of the link
between variables. When two variables are correlated, a change in one
variable’s magnitude is usually accompanied by a change in the other
variable’s magnitude.
Step: Plotting cross-correlation graph to
measure different time series variables and ranking their relations.
The barchart above shows the correlation between two variables
ranking from highest to lowest correlation. Blue bars indicate positive
correlation whereas pink bars indicate negative correlation.
Apparently, the dependent variable MYRExchangeRate shows high
correlation with USDCurrencyIndex and BrentCrudeOilPrice.
The
remainings are high correlation between the independent variables:
CNYExchangeRate with USPolicyRate and MalaysiaForeignReserve and
MalaysiaForeignReserve with BrentCrudeOilPrice.
From the correlation heatmap plotted, it is observed that
USDCurrencyIndex has a very strong correlation with MYRExchange rate,
followed by BrentCrudeOilPrice and MalaysiaForeignReserve.
This
observation is similar to the result from the cross-correlation barchart
that has been plotted earlier.
Preliminary estimation:
1.
USDCurrencyIndex and BrentCrudeOilPrice are significant.
2.
MalaysiaForeignReserve can be eliminated.
It is possible that correlation between variables will change along with market change, hence correlation graphs are plotted for data by every five years: 2005-2010, 2011-2015, 2016-2022 from left to right
In addition to overall findings, CNY Exchange rate and Bondflow has
a higher correlation.
Policy Rate & CPI have a high correlation but the correlation
is a coincidence due to stable USD/MYRExchangeRate.
In addition to overall findings, CNYExchangeRate has a relative
high correlation.
## Variable overall Y2005toY2010 Y2011toY2015
## 1 MalaysiaPolicyRate -0.27705889 0.06146647 0.7587431
## 2 USPolicyRate 0.07260433 0.46917321 0.5307193
## 3 MalaysiaNetForeignBondFlow -0.21952723 -0.42486821 -0.1033952
## 4 MalaysiaForeignReserve -0.47326473 -0.77783658 -0.9322529
## 5 BrentCrudeOilPrice -0.64499891 -0.69070660 -0.9286770
## 6 MalaysiaExternalTradeBalance 0.41545170 -0.06799965 0.0968636
## 7 MalaysiaCPI -0.12473945 0.14096291 0.1242865
## 8 USCPI 0.19788252 -0.00822469 -0.6974354
## 9 USDCurrencyIndex 0.91100701 0.70123283 0.9045400
## 10 CNYExchangeRate 0.15822691 0.67854585 0.2523939
## 11 Malaysia10YearGovBondYield -0.10315798 0.21818305 0.5717199
## 12 MYRExchangeRate 1.00000000 1.00000000 1.0000000
## Y2016toy2022
## 1 -0.1822738
## 2 0.1110615
## 3 -0.3152069
## 4 -0.1212143
## 5 0.1762821
## 6 0.2025899
## 7 0.3419343
## 8 0.4162212
## 9 0.8279759
## 10 0.5739093
## 11 0.2441488
## 12 1.0000000
Schober, Boer and Schwarte (2018) stated that a correlation score of
less than 0.3 is considered as weak correlation.
Therefore, from the
results above, it is observed that MalaysiaExternalTradeBalance,
MalaysiaCPI, USCPI, USPolicyRate, MalaysiaNetForeignBondFlow and
Malaysia10YearGovBondYield have the weak correlation with
MYRExchangeRate.
These variables may not be important for
prediction.
In the CRISP-DM technique, modelling refers to the stage in which a model is constructed and validated utilising the data gathered in prior processes. The objective of modelling in CRISP-DM is to construct a model that properly predicts the desired output based on the input variables. This process comprises selecting the right model type, training the model using a sample of the data, and assessing the performance of the model using techniques such as classification accuracy and mean absolute error (MAE). The model with the greatest performance is then used to forecast fresh data.
This project utilised supervised machine learning, that categorised into the following:The distinction between the two is the outcomes they forecast.
In regression, the result is a continuous number; hence, the output of
this project is to estimate the USD/MYR exchange rate for the following
month. On the other hand, the output of classification is a discrete
value, such as a label, identifying the class to which an input belongs.
For this project, it is utilised to anticipate the trend of the USD/MYR
exchange rate.
Two regression models are implemented for prediction.
Use is made of both simple linear regression and multiple regression. Models vary in the number of independent factors employed to predict the dependent variable. Simple linear regression utilises a single independent variable to predict the dependent variable. In this project, the USD Currency Index may be used to forecast the USD to MYR exchange rate.
Multiple regression, in contrast, utilises numerous independent variables to predict a dependent variable. Predicting the USD to MYR conversion rate by utilising numerous characteristics. Multiple regression provides a more accurate model for predicting the Malaysian ringgit exchange rate because it incorporates a greater number of variables that may influence the exchange rate. However, multiple regression is more complicated since it requires more data and processing resources.
The following were the workings:
The dataset is first split into training set and test set in the ratio of 80:20.
data_r = data10
data_r_ori = data9
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
set.seed(123)
split = sample.split(data_r$MYRExchangeRate, SplitRatio = 0.8)
training_set_r = subset(data_r, split == TRUE)
test_set_r = subset(data_r, split == FALSE)
From the EDA result, it is observed that USDCurrencyIndex has a strong correlation with target variable as shown in the diagram below.
data_r %>%
as.data.frame() %>%
GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Therefore, a simple linear regression using ‘USDCurrencyIndex’ is
created.
##
## Call:
## lm(formula = MYRExchangeRate ~ USDCurrencyIndex, data = training_set_r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5695 -0.1019 0.0115 0.1392 0.3514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.89843 0.03223 89.94 <2e-16 ***
## USDCurrencyIndex 1.97439 0.06873 28.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1907 on 162 degrees of freedom
## Multiple R-squared: 0.8359, Adjusted R-squared: 0.8349
## F-statistic: 825.2 on 1 and 162 DF, p-value: < 2.2e-16
The linear model is then used to predict the MYRExchangeRate.
The
performance of the model is depicted in the graph below.
## `geom_smooth()` using formula = 'y ~ x'
Next, test set is used to check the model’s performance.
Before performing multiple linear regression, it is necessary to confirm whether there is a linear correlation between the independent variables.
## [1] -0.01809105 0.08821480 0.73681104 -0.18899004 -0.35744013 -0.58294577
## [7] 0.21851721 -0.74850514 -0.33234109 0.03132114
## integer(0)
The result above shows that the independent variables do not have
multicollinearity.
Thus, multiple linear regression model can be
implemented to on the data. Backward elimination is used to train the
model.
##
## Call:
## lm(formula = MYRExchangeRate ~ BrentCrudeOilPrice + USDCurrencyIndex,
## data = training_set_r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46769 -0.09419 0.01396 0.12825 0.47712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.18703 0.06075 52.458 < 2e-16 ***
## BrentCrudeOilPrice -0.42635 0.07829 -5.446 1.9e-07 ***
## USDCurrencyIndex 1.74985 0.07559 23.149 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1758 on 161 degrees of freedom
## Multiple R-squared: 0.8614, Adjusted R-squared: 0.8597
## F-statistic: 500.4 on 2 and 161 DF, p-value: < 2.2e-16
We can use the image to compare the difference between the predicted
and actual values. Blue represents the predicted value, red represents
the actual value.
Three different classification algorithms are implemented to predict
whether MYRExchangeRate is high or low.
Firstly, MYRExchangeRate
is transformed to binomial categorical data, in which rate equals or
above 3.8 is categorised as high exchange rate, whilst rate below 3.8 is
categorised as low exchange rate.
The algorithm used for this project were K-Nearest Neighbors (KNN), Naive Bayes and Random Forest. The following showed the workings.
data_c <- data9
data_c['MYRExchangeRate_cat'] <- 0
for (i in 1:nrow(data_c)) {
if (data_c[i,6]<=3.8) data_c[i,'MYRExchangeRate_cat'] <- 'low'
else data_c[i,'MYRExchangeRate_cat'] <- 'high'
}
data_c<-data_c%>%dplyr::select(-c(6))
data_c
as.data.frame(table(data_c$MYRExchangeRate))
It is verified that the binomial classes are balance after
transformation.
Dataset is then again being split into training set and test set.
data_c$MYRExchangeRate_cat = factor(data_c$MYRExchangeRate_cat,levels = c("low","high"))
# Splitting the dataset into the Training set and Test set
set.seed(12771)
split = sample.split(data_c$MYRExchangeRate_cat, SplitRatio = 0.8)
training_set_c = subset(data_c, split == TRUE)
test_set_c = subset(data_c, split == FALSE)
K-Nearest Neighbors (KNN) is an instance-based, non-parametric
learning algorithm. It classifies new cases based on a similarity
measure between the new case and the stored examples. For instance, KNN
may predict exchange rate based on similar attirbutes.
classifier = train(MYRExchangeRate_cat~., data = training_set_c, method = "knn")
x_test_knn <- test_set_c[,1:5]
y_test_knn <- test_set_c[,6]
predictions_knn <- predict(classifier,x_test_knn)
confusionMatrix(predictions_knn,y_test_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low high
## low 22 0
## high 1 18
##
## Accuracy : 0.9756
## 95% CI : (0.8714, 0.9994)
## No Information Rate : 0.561
## P-Value [Acc > NIR] : 1.684e-09
##
## Kappa : 0.9508
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9565
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9474
## Prevalence : 0.5610
## Detection Rate : 0.5366
## Detection Prevalence : 0.5366
## Balanced Accuracy : 0.9783
##
## 'Positive' Class : low
##
Naive Bayes is a probabilistic algorithm that classifies data points
based on the likelihood that they belong to a certain class. It is a
straightforward method that assumes the independence of the
characteristics. It requires a massive proportion of data that may
anticipate future exchange rate changes.
classifier = NaiveBayes(MYRExchangeRate_cat~., data = training_set_c)
x_test_nb <- test_set_c[,1:5]
y_test_nb <- test_set_c[,6]
predictions_nb <- predict(classifier,x_test_nb)
confusionMatrix(predictions_nb$class,y_test_nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low high
## low 22 0
## high 1 18
##
## Accuracy : 0.9756
## 95% CI : (0.8714, 0.9994)
## No Information Rate : 0.561
## P-Value [Acc > NIR] : 1.684e-09
##
## Kappa : 0.9508
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9565
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9474
## Prevalence : 0.5610
## Detection Rate : 0.5366
## Detection Prevalence : 0.5366
## Balanced Accuracy : 0.9783
##
## 'Positive' Class : low
##
Random Forest is an ensemble learning method for classification, regression and other tasks. It operates by constructing a large number of decision trees at training time and outputting the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees. Random Forest is frequently applied if there are several elements that could influence the exchange rate, and also to identify the most influential ones.
classifier = randomForest(MYRExchangeRate_cat~., data = training_set_c)
x_test_rf <- test_set_c[,1:5]
y_test_rf <- test_set_c[,6]
predictions_rf <- predict(classifier,x_test_rf)
confusionMatrix(predictions_rf,y_test_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low high
## low 23 0
## high 0 18
##
## Accuracy : 1
## 95% CI : (0.914, 1)
## No Information Rate : 0.561
## P-Value [Acc > NIR] : 5.09e-11
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.561
## Detection Rate : 0.561
## Detection Prevalence : 0.561
## Balanced Accuracy : 1.000
##
## 'Positive' Class : low
##
The most used evaluation measures in regression are r-squared, prediction error rate, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). For this project, Adjusted R2, Akaike’s Information Criteria (AIC), and Bayesian information criteria (BIC) were also utilised as the robust metric for assessing the quality of regression models and comparing models.
Dividing the RSE by the mean of the outcome variables will result in
the prediction error rate.
The smaller the prediction error rate,
the better the model’s performance.
glance(regressor_s)
#Prediction the test set result
glance(regressor_m_2)
paste('Simple linear regression prediction error rate: ',sigma(regressor_s)/mean(data$MYRExchangeRate))
## [1] "Simple linear regression prediction error rate: 0.049157295344548"
paste('Multiple linear regression prediction error rate: ',sigma(regressor_m_2)/mean(data$MYRExchangeRate))
## [1] "Multiple linear regression prediction error rate: 0.0453127747972776"
set.seed(1457)
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
model <- train(MYRExchangeRate ~ BrentCrudeOilPrice + USDCurrencyIndex,data = data_r,method = 'lm',trControl = train_control)
print(model)
## Linear Regression
##
## 205 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 185, 184, 185, 184, 184, 185, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1818232 0.8542876 0.1436248
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
With a low RMSE of 0.18 and MAE of 0.14, this shows that the model
fits the data well.
Interpretation:
1. Multiple Linear Regressor has higher Adjusted
R2 and lower AIC and BIC value as compared to Simple Linear
Regressor.
2. The prediction error rate for Multiple Linear
Regressor is lower than Simple Linear Regressor.
Result:
Multiple Linear Regressor is better than Simple Linear Regressor.
From the above, accuracy rates of K-Nearest Neighbour, Naive Bayes
and Random Forest are 0.9756, 0.9756 and 1 which shows that Random
Forest has the best performance than K-Nearest Neighbour and Naive
Bayes.
An ROC curve is a graph showing the performance of a classification
model at all classification thresholds.
ROC graphs are plotted for
K-Nearest Neighbour and Random Forest for comparison.
## Setting levels: control = low, case = high
## Setting direction: controls < cases
## Setting levels: control = low, case = high
## Setting direction: controls < cases
The graphs above show an AUC value of 1, which indicates an excellent
test result for both K-Nearest Neighbour and Random Forest.
Results: Random Forest is better overall given the highest accuracy it
has.
The USD Currency Index has the greatest impact on the USD/MYR exchange rate, hence the first aim has been met. The comparison between the two regression models and the results revealed that the multiple regression model had the lowest error measures for both the estimate and assessment portions of the data. Additionally, the Random Forest model outperformed the other two categorisation methods (K-Nearest Neighbor, NaiveBayes). The rise and fall of the MYR is unquestionably influenced by several macroeconomic variables. This study aids decision makers, particularly investors, in monitoring the anticipated performance of financial markets. For future study, adding more attributes to the dataset will be helpful because they can reveal more details for global projection. Additional algorithms may be considered by future researchers to improve the analysis and verify the results.