Abstract

Forecasting the foreign exchange market is difficult due to the effect of several economic factors. This study focuses on forecasting the Malaysian Ringgit (MYR) exchange rate versus the U.S. Dollar (USD) using supervised classification and regression machine learning models. The aims of this project are to identify the economic variables that have the greatest impact on the USD/MYR exchange rate and to predict the exchange rate. A comparison is also done between the models, and the best one is chosen to anticipate the MYR/USD exchange rate. Multiple regression models with low Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) were determined to have the highest performance, with RandomForest outperforming K-Nearest Neighbor and NaiveBayes, whose respective accuracy rates are 1, 0.9756, and 0.9756, respectively.

Problem Statement

In the early part of Covid-19 outbreak, Malaysian Ringgit is weakening from 4.02 in December 2020 to 4.20 in March 2022 due to the uncertain global economy outlook. As economy re-open and geopolitical risk in Ukraine-Russia war trigger higher inflation, resulting US Federal Reserve to hike interest rate from 0% in March 2022 to 3.25% in October 2022. The Fed’s tightening monetary policy had led to a significant increase in the value of the US dollar and a direct devaluation of other currencies. As a result, MYR further weaken past 4.70 in 7 months time.

As the exchange rate changes may cause uncertainty and risk aversion, it makes managing the country’s reserves more difficult for the central bank.The exchange rate movement also causes profitability impact to exporter and importer. Conversely, a strong domestic currency hinders exports and lowers import prices. A weak domestic currency favours exports whilst also raising the cost of imports.

Since 2022 is a post-pandemic year, the rate of normalisation of monetary policy differs between countries, primarily to strike a balance between stable economic growth, currency volatility, and inflation objective. Thus, this project seek to study various macroeconomic variables in contributing to the MYR exchange rate performance using regression classification model.

Objective

  1. To investigate the economic factors that influence the USD/MYR exchange rate the most
  2. To evaluate the performance of the predicting model
  3. To predict the USD/MYR exchange rate

Methodology

The Cross-Industry Standard Process for Data Mining (CRISP-DM) was used as the method for this project. It was chosen because CRISP-DM could give an overview of the data mining life cycle, which has six steps that will be explained below.

Business Understanding

Data analysis and pattern recognition are key jobs in the financial industry. For example, prior to making an investment decision, investors seek to obtain as much information as possible. Comparable to other financial market data, foreign exchange rates can be forecast by analysing historical exchange rate series for patterns.

Recent declines in the ringgit and other currencies can be attributed to the strengthening of the US dollar, which has been spurred by a rise in US interest rates. An economist predicts that the ringgit will continue to decline in the near future due to the rising performance of the US dollar. In compliance with CRISP-DM criteria, it is identified the primary variables that influence the USD to MYR exchange rate, such as interest rates, inflation, and global trade developments.

In light of this, the objective of this research is to develop a model that can predict the USD/MYR exchange rate. This can be useful for enterprises that routinely conduct foreign business, individuals planning to visit Malaysia or make investments, and the formulation of monetary policy (Jong, 2016; Khan, 2018; Fischer,2015).

Data Understanding

Data Source

From October 2005 through October 2022, monthly data on the Malaysian currency, i.e., the Ringgit Exchange rate vs US Dollars, was gathered from Bloomberg. The year 2005 was chosen because, between 1998 and 2005, the Malaysian government set a 3.80 level of peg. Consequently, it is not advised to incorporate currency data prior to 2005. Many of these variables have a significant relationship with the Overnight Policy Rate (OPR), Net foreign flow of bond investment into Malaysia, Malaysia Foreign Reserve, Brent Crude Oil Price, Malaysia Trade Balance, Malaysia Consumer Price Index (CPI), United States CPI, and 3 market data (Malaysia Government Bond (MGS) Yield in 10 years, Chinese Yuan (USD/CNY) exchange rate, and USD currency index are among those covered).

Variable Definition

1. Malaysia Policy Rate (MPR)

MPR is the benchmark interest rate established as the reference rate for overnight interbank transactions by the central bank of Malaysia, Bank Negara Malaysia (BNM). It serves as a benchmark for other interest rates in the country, such as deposit and lending rates. The BNM determines the MOPR through its monetary policy operations based on current market conditions and anticipated inflation. It is intended to meet the price stability and financial stability objectives of the central bank’s monetary policy

2. Net Foreign Flow of Bond Investment into Malaysia

Net foreign flow of bond investment into Malaysia refers to the amount of money that flows into or out of the nation as bond investments. A positive net flow implies that more foreign capital is entering Malaysia for bond investments than is leaving the nation. A negative net flow shows the reverse, namely that more foreign funds are leaving Malaysia than are entering.

3. Malaysia Foreign Reserve

Malaysia Foreign Reserve refers to the amount of foreign currency assets kept by Bank Negara Malaysia (BNM) to support domestic monetary policy and regulate the exchange rate. Gold, foreign currency deposits, and foreign government securities comprise these assets.

4. Brent Crude Oil Price

Brent Crude Oil Price is a benchmark for global oil prices and the price of a particular mix of crude oil that is used as a standard by a significant number of oil producing nations.

5. Malaysia External Trade Balance

Malaysia’s Trade Balance represents the difference between its exports and imports. A positive trade balance, often known as a surplus, shows that the value of exports exceeds the value of imports. A negative trade balance, or deficit, means that the value of imports exceeds the value of exports.

6. Consumer Price Index (CPI)

The Consumer Price Index (CPI) measures the average change in prices of a basket of household-consumed goods and services. It is used to monitor inflation, which is the rate of increase in the general level of prices for goods and services. The CPI is frequently used as a measure of inflation and is also used to adjust different economic variables, such as real wages and gross domestic product, for inflation. However, there are some differences between the two indices:
  • Method of computation: The CPI calculation methods in Malaysia and the United States may vary somewhat. For instance, the Bureau of Labor Statistics (BLS) in the United States employs a fixed-weighted basket of goods and services, but the Department of Statistics Malaysia utilises a geometric mean to calculate the CPI.
  • Basket of goods and services: The basket of goods and services utilised by the CPI in Malaysia may contain products that are absent from the basket utilised by the BLS in the United States.
  • Inflation rate: The inflation rates in Malaysia and the United States may also be distinct. In 2020, the inflation rate in Malaysia is 1.9% while the inflation rate in the United States is 1.3%.
  • Currency: Malaysia’s currency is the Ringgit, whereas the United States’ currency is the United States Dollar.

7. Malaysia Government Bond (MGS)

A financial security issued by the Malaysian government is known as a government bond. It resembles a loan an investor made to the government, with the understanding that the interest and principle would be paid back later. Because of the government issuer’s creditworthiness, these bonds are generally utilised by the government to generate money for different programmes and projects, and they are frequently seen as reasonably safe investments.

8. USD Currency Index

The US Dollar Index (USDX, DXY, DX) is an indicator of how much the US dollar is worth in comparison to a basket of other currencies. The index is frequently used as a benchmark for other financial instruments since it is designed to give a broad indicator of the worth of the US dollar internationally. The index is based on the exchange rates of six important international currencies: the Euro, Japanese Yen, British Pound, Canadian Dollar, Swedish Krona, and Swiss Franc. It is computed by the ICE (Intercontinental Exchange).

Installing Packages and Libraries

# install.packages('tidyverse',repos = "http://cran.us.r-project.org")
# install.packages('parsedate',repos = "http://cran.us.r-project.org")
# install.packages('Hmisc')
library('lubridate')
library('dplyr')
library('parsedate')
library('tidyr')
library('ggplot2')
library('RColorBrewer')
library('Hmisc')
library('caTools')
library('caret')
library('broom')
library('klaR')
library('randomForest')
library('class')
library('mlbench')
library('pls')
library('pROC')

Data Overview

Viewing data

data


Viewing data struture

glimpse(data)
## Rows: 209
## Columns: 13
## $ Date                         <chr> "31/10/2005", "30/11/2005", "31/12/2005",…
## $ MalaysiaPolicyRate           <dbl> 2.70, 3.00, 3.00, 3.00, 3.00, 3.25, 3.25,…
## $ USPolicyRate                 <dbl> 3.75, 4.00, 4.25, 4.50, 4.50, 4.50, 4.75,…
## $ MalaysiaNetForeignBondFlow   <dbl> -1510.7, -1104.9, 70.6, -84.6, -84.6, 139…
## $ MalaysiaForeignReserve       <dbl> 77.07, 73.07, 70.50, 71.29, 71.29, 72.20,…
## $ BrentCrudeOilPrice           <dbl> 58.10, 55.05, 58.98, 65.99, 65.99, 61.76,…
## $ MalaysiaExternalTradeBalance <dbl> 10.62, 9.10, 9.77, 9.03, 9.03, 7.68, 9.91…
## $ MalaysiaCPI                  <dbl> 3.3, 3.5, 3.5, 3.2, 3.2, 3.2, 4.8, 4.6, 3…
## $ USCPI                        <dbl> 4.3, 3.5, 3.4, 4.0, 4.0, 3.6, 3.4, 3.5, 4…
## $ USDCurrencyIndex             <dbl> 90.070, 91.570, 91.170, 88.960, 88.960, 9…
## $ CNYExchangeRate              <dbl> 8.0845, 8.0804, 8.0702, 8.0608, 8.0608, 8…
## $ Malaysia10YearGovBondYield   <dbl> 4.180, 4.245, 4.189, 4.112, 4.112, 4.120,…
## $ MYRExchangeRate              <dbl> 3.7750, 3.7775, 3.7795, 3.7505, 3.7505, 3…

Dimension of data is 209 x 13.
Date is deemed as character data type.
Other variables are continuous numerical data type.

Summary

summary(data)
##      Date           MalaysiaPolicyRate  USPolicyRate   
##  Length:209         Min.   :-3.250     Min.   :-2.125  
##  Class :character   1st Qu.: 2.500     1st Qu.: 0.125  
##  Mode  :character   Median : 3.000     Median : 0.125  
##                     Mean   : 2.828     Mean   : 1.199  
##                     3rd Qu.: 3.250     3rd Qu.: 1.875  
##                     Max.   : 3.500     Max.   : 5.250  
##                                                        
##  MalaysiaNetForeignBondFlow MalaysiaForeignReserve BrentCrudeOilPrice
##  Min.   :-23042.8           Min.   :  70.5         Min.   : 22.74    
##  1st Qu.: -1152.7           1st Qu.:  96.7         1st Qu.: 57.54    
##  Median :  1393.6           Median : 104.2         Median : 71.46    
##  Mean   :   826.4           Mean   : 116.2         Mean   : 77.05    
##  3rd Qu.:  3240.4           3rd Qu.: 120.3         3rd Qu.:101.01    
##  Max.   : 10259.0           Max.   :1000.7         Max.   :139.83    
##                                                                      
##  MalaysiaExternalTradeBalance  MalaysiaCPI         USCPI       
##  Min.   :-3.630               Min.   :-2.900   Min.   :-2.100  
##  1st Qu.: 7.853               1st Qu.: 1.400   1st Qu.: 1.300  
##  Median : 9.510               Median : 2.200   Median : 2.000  
##  Mean   :10.644               Mean   : 2.222   Mean   : 2.396  
##  3rd Qu.:11.812               3rd Qu.: 3.300   3rd Qu.: 3.000  
##  Max.   :31.710               Max.   : 8.500   Max.   : 9.100  
##  NA's   :3                                                     
##  USDCurrencyIndex CNYExchangeRate  Malaysia10YearGovBondYield MYRExchangeRate 
##  Min.   : 71.80   Min.   : 6.054   Min.   :2.551              Min.   : 2.961  
##  1st Qu.: 80.20   1st Qu.: 6.356   1st Qu.:3.560              1st Qu.: 3.260  
##  Median : 88.66   Median : 6.691   Median :3.873              Median : 3.654  
##  Mean   : 88.16   Mean   : 7.413   Mean   :3.808              Mean   : 3.880  
##  3rd Qu.: 95.79   3rd Qu.: 6.890   3rd Qu.:4.110              3rd Qu.: 4.149  
##  Max.   :112.12   Max.   :79.527   Max.   :5.033              Max.   :41.842  
##  NA's   :3

Data Preparation and Exploratory

During this phase, the data is cleansed, transformed, and formatted so that it may be utilised in the modelling phase. These include fixing errors, imputing missing values, and eliminating outliers. The ultimate objective of this phase is to guarantee that the data are accurate and consistent so that they may be utilised effectively in subsequent phases of the CRISP-DM process. Given below are the steps.

Handling duplicates

Any record that shares data with another record mistakenly is considered duplicate data. Duplicate data is simple to identify and typically arises during the transfer of data between systems.

data[duplicated(data) == TRUE,]

Result: Four (4) duplicates are found. Since the dataset was a monthly dataset, it is appropriate to remove the duplicated data, as each month’s data should be unique.
Step: Remove the duplicates

data1<-unique(data)
data1 <-distinct(data1)
data1[duplicated(data1) == TRUE,]

As a result, all duplicates are removed.

Standardising date format

The data overview reveals that the ‘Date’ column contains dates written in a variety of forms, all of which need to be standardised.

dates<-unlist(data1['Date'])
dates<-as.Date(parse_date(dates))

data2<-data.frame(data1)
data2['Date']<-dates

As a result, the ‘date’ column was standardised.

Handling missing values or empty string

Missing data, or missing values, arise when a variable in an observation contains no data value. Commonly occurring, missing data can have a substantial impact on the findings that can be derived from the data.

  1. Identify the Missing Value
summary(data2)
##       Date            MalaysiaPolicyRate  USPolicyRate   
##  Min.   :2005-10-31   Min.   :-3.250     Min.   :-2.125  
##  1st Qu.:2010-01-31   1st Qu.: 2.700     1st Qu.: 0.125  
##  Median :2014-04-30   Median : 3.000     Median : 0.125  
##  Mean   :2014-04-30   Mean   : 2.833     Mean   : 1.198  
##  3rd Qu.:2018-07-31   3rd Qu.: 3.250     3rd Qu.: 1.875  
##  Max.   :2022-10-31   Max.   : 3.500     Max.   : 5.250  
##                                                          
##  MalaysiaNetForeignBondFlow MalaysiaForeignReserve BrentCrudeOilPrice
##  Min.   :-23042.8           Min.   :  70.50        Min.   : 22.74    
##  1st Qu.: -1152.7           1st Qu.:  96.71        1st Qu.: 58.10    
##  Median :  1398.1           Median : 104.20        Median : 71.70    
##  Mean   :   846.2           Mean   : 116.51        Mean   : 77.26    
##  3rd Qu.:  3240.4           3rd Qu.: 120.29        3rd Qu.:101.01    
##  Max.   : 10259.0           Max.   :1000.72        Max.   :139.83    
##                                                                      
##  MalaysiaExternalTradeBalance  MalaysiaCPI         USCPI       
##  Min.   :-3.630               Min.   :-2.900   Min.   :-2.100  
##  1st Qu.: 7.902               1st Qu.: 1.400   1st Qu.: 1.300  
##  Median : 9.555               Median : 2.200   Median : 2.000  
##  Mean   :10.632               Mean   : 2.232   Mean   : 2.409  
##  3rd Qu.:11.812               3rd Qu.: 3.300   3rd Qu.: 3.000  
##  Max.   :31.710               Max.   : 8.500   Max.   : 9.100  
##  NA's   :3                                                     
##  USDCurrencyIndex CNYExchangeRate  Malaysia10YearGovBondYield MYRExchangeRate 
##  Min.   : 71.80   Min.   : 6.054   Min.   :2.551              Min.   : 2.961  
##  1st Qu.: 80.18   1st Qu.: 6.356   1st Qu.:3.572              1st Qu.: 3.260  
##  Median : 88.66   Median : 6.691   Median :3.877              Median : 3.654  
##  Mean   : 88.17   Mean   : 7.421   Mean   :3.818              Mean   : 3.884  
##  3rd Qu.: 95.87   3rd Qu.: 6.890   3rd Qu.:4.110              3rd Qu.: 4.149  
##  Max.   :112.12   Max.   :79.527   Max.   :5.033              Max.   :41.842  
##  NA's   :3
The above is a summary of the data after duplicates were removed and the date column was standardised. The number of rows of observations has been reduced to 205. The following variables are determined to be missing a total of 6 values:
  • MalaysiaExternalTradeBalance
  • USDCurrencyIndex

The following are the steps to identify the number of rows containing missing values.

any(is.na(data2))
## [1] TRUE
nrow(data2[!complete.cases(data2),])
## [1] 6
which(is.na(data2['MalaysiaExternalTradeBalance']))
## [1]  36 111 139
which(is.na(data2['USDCurrencyIndex']))
## [1] 20 39 82
6 rows of observations possess missing values, indicating that every missing value is located in a distinct observation. Listed below was the precise location.
  • MalaysiaExternalTradeBalance: row 36, 111, 139
  • USDCurrencyIndex: row 20, 39, 82


  1. Identify the Empty String Values
for(i in names(data2)) {
  print(paste(i, sum(data2[i]=="")))
}
## [1] "Date NA"
## [1] "MalaysiaPolicyRate 0"
## [1] "USPolicyRate 0"
## [1] "MalaysiaNetForeignBondFlow 0"
## [1] "MalaysiaForeignReserve 0"
## [1] "BrentCrudeOilPrice 0"
## [1] "MalaysiaExternalTradeBalance NA"
## [1] "MalaysiaCPI 0"
## [1] "USCPI 0"
## [1] "USDCurrencyIndex NA"
## [1] "CNYExchangeRate 0"
## [1] "Malaysia10YearGovBondYield 0"
## [1] "MYRExchangeRate 0"

Result: No values are empty.

Boxplot of each continuous variables

Box plots provide the rapid identification of mean values, data set dispersion, and skewness indicators. Inasmuch as it divides the data into parts containing around 25% of the data in that set and then provides a visual summary, it is capable of analysing large amounts of information.


The boxplots show that there are possible outliers in the variables below:
  • MalaysiaPolicyRate
  • MalaysiaNetForeignBondFlow
  • MalaysiaForeignReserve
  • CNYExchangeRate
  • MYRExchangeRate

In the case for USDCurrencyIndex, no outliers are observed in the boxplot and it is normally distributed.
Thus, missing values can be imputed using Mean.

data3<-data.frame(data2)
data3['USDCurrencyIndex']<-with(data3, impute(data3['USDCurrencyIndex'], mean))
data3[c(20,39,82),'USDCurrencyIndex']
## [1] 88.16865 88.16865 88.16865

Result: Mean value is imputed to all those 3 missing cells.

In the case for MalaysiaExternalTradeBalance, it is observed from the boxplot that this variable is positively skewed.
Median can handle skewed distribution more effectively than Mean.

data3['MalaysiaExternalTradeBalance']<-with(data3, impute(data3['MalaysiaExternalTradeBalance'], median))
data3[c(36,111,139),'MalaysiaExternalTradeBalance']
## [1] 9.555 9.555 9.555

Result: Median value is imputed to all missing cells.

Handling outliers

Outliers can have a substantial effect on the outcomes of a data analysis or model, hence it is vital to determine whether the numbers identified as outliers are real or inaccurate. Real outliers, also known as valid outliers, are numbers that are truly distinct from the rest of the data and may represent a significant trend or piece of information. Incorrect outliers, also known as invalid outliers, are the result of data collecting or processing errors, such as measurement or data entry mistakes. If these values are not recognised and rectified, they might distort the outcomes of the study or model, leading to erroneous conclusions or subpar performance. In order to verify that the analysis or model is based on accurate and dependable data, it is essential to determine whether the outliers are genuine or fictitious.

MalaysiaPolicyRate

Since the beginning of 2004, the Central Bank of Malaysia (Bank Negara Malaysia, BNM) has never had a negative Overnight Policy Rate (OPR). Therefore, only positive values should be recorded in this variable, however a single negative value has been observed which required to handle.

Source: https://www.bnm.gov.my/monetary-stability/opr-decisions

data4<-data.frame(data3)
data4['MalaysiaPolicyRate']<-abs(data4['MalaysiaPolicyRate'])
USPolicyRate

Besides that, it is also observed from the boxplot that USPolicyRate contains negative value, which also has never had a negative rate based on the Federal Reserve, the Central Bank of United States. In conclusion, it should be handled.

Source: https://www.federalreserve.gov/monetarypolicy/openmarket.htm

data4['USPolicyRate']<-abs(data4['USPolicyRate'])

Result: All values in MalaysiaPolicyRate and USPolicyRate are converted to absolute (POSITIVE) values.

MalaysiaNetForeignBondFlow

Since the values are high, thus, it is necessary to examine whether such extreme maximum and lowest values are typical.

top_n(data4['MalaysiaNetForeignBondFlow'], 5)
## Selecting by MalaysiaNetForeignBondFlow
top_n(data4['MalaysiaNetForeignBondFlow'], -5)
## Selecting by MalaysiaNetForeignBondFlow

Result: It appears reasonable for the maximum to be 10258.963 because the next five greatest figures are also close to 10000. However, considering the next-smallest value is -12465.554, a difference of 10,000, it is rather uncommon for the minimum to be beyond -23042.756.
Step: Check the values of 3 rows before and after the minimum.

min_row<-apply(data4['MalaysiaNetForeignBondFlow'],2,which.min)
data4[(min_row-3):(min_row+3),'MalaysiaNetForeignBondFlow']
## [1]  -4543.296  -2095.369  -7446.066 -23042.756   5719.764   8911.200   -903.187

Result: The values do not seem to have correlation with date as they fluctuate every month. We will keep this value as it is for now.

MalaysiaForeignReserve

It is observed from the boxplot that there is value(s) that larger than 1000, meanwhile other values are less than 200.
Step: Find the location of outlier(s).

fr_outlier_row<-which(data4['MalaysiaForeignReserve'] > 200)
data4[fr_outlier_row, 'MalaysiaForeignReserve']
## [1] 1000.21 1000.72

Result: A total of two (2) outliers are found in ‘MalaysiaForeignReserve’ with value > 1000.

Step: Replace outliers with NA.

Figure above shows the boxplot of MalaysiaForeignReserve after replacing outliers with NA.
Since the values are skewed, Median imputation is implemented for NA values.

data4[fr_outlier_row,'MalaysiaForeignReserve']<-median(data4[,'MalaysiaForeignReserve'], na.rm=TRUE)
data4[fr_outlier_row,'MalaysiaForeignReserve']
## [1] 103.9 103.9
CNYExchangeRate

Step: Find the location of outlier(s).

fivenum(data4[,'CNYExchangeRate'])
## [1]  6.0543  6.3561  6.6915  6.8904 79.5270
cer_outlier_row<-which(data4['CNYExchangeRate'] > 10)
data4[cer_outlier_row, 'CNYExchangeRate']
## [1] 79.527 73.919

Result: The outliers have values more than 70.
Step: Check 3 rows before and after each outlier.

data4[(cer_outlier_row[1]-3):(cer_outlier_row[1]+3),'CNYExchangeRate']
## [1]  8.0175  7.9943  7.9690 79.5270  7.9040  7.8794  7.8331
data4[(cer_outlier_row[2]-3):(cer_outlier_row[2]+3),'CNYExchangeRate']
## [1]  7.5450  7.5105  7.4627 73.9190  7.3037  7.1818  7.1108

Result: Apparently, it looks like the decimal point is wrongly placed for each outlier. Consequently, the data can be updated directly by adjusting the decimal point to one level in front. Imputation is not required for this variable.

data4[cer_outlier_row[1],'CNYExchangeRate']<-7.9527
data4[cer_outlier_row[2],'CNYExchangeRate']<-7.3919
data4[cer_outlier_row,'CNYExchangeRate']
## [1] 7.9527 7.3919
MYRExchangeRate

Step: Find the location of outlier(s).

fivenum(data4[,'MYRExchangeRate'])
## [1]  2.9610  3.2595  3.6541  4.1490 41.8420
mer_outlier_row<-which(data4['MYRExchangeRate'] > 10)
data4[mer_outlier_row, 'MYRExchangeRate']
## [1] 41.842

Result: One outlier is detected with value over 40.
Step: Check 3 rows before and after the outlier.

data4[(mer_outlier_row[1]-3):(mer_outlier_row[1]+3),'MYRExchangeRate']
## [1]  4.0652  4.1090  4.1383 41.8420  4.1842  4.1335  4.0953

Result: Similar to the case with CNYExchangeRate, the outlier can be updated directly by adjusting the decimal point to one level in front. Imputation is not required too.

data4[mer_outlier_row[1],'MYRExchangeRate']<-4.1842
data4[mer_outlier_row,'MYRExchangeRate']
## [1] 4.1842

Normalising data

Most if not all numerical data have different scales.
Therefore, it is necessary to perform normalisation on these data to bring them into a same range for more accurate comparisons.

##         Date MalaysiaPolicyRate USPolicyRate MalaysiaNetForeignBondFlow
## 1 2005-10-31          0.5428571    0.7073171                  0.6465749
## 2 2005-11-30          0.7142857    0.7560976                  0.6587605
## 3 2005-12-31          0.7142857    0.8048780                  0.6940589
## 4 2006-01-31          0.7142857    0.8536585                  0.6893985
## 5 2006-02-28          0.8571429    0.8536585                  0.7339218
## 6 2006-03-31          0.8571429    0.9024390                  0.6868191
##   MalaysiaForeignReserve BrentCrudeOilPrice MalaysiaExternalTradeBalance
## 1             0.09266573          0.3019899                    0.4032258
## 2             0.03624824          0.2759416                    0.3602151
## 3             0.00000000          0.3095055                    0.3791737
## 4             0.01114245          0.3693740                    0.3582343
## 5             0.02397743          0.3332479                    0.3200340
## 6             0.04513399          0.3686908                    0.3831353
##   MalaysiaCPI     USCPI USDCurrencyIndex CNYExchangeRate
## 1   0.5438596 0.5714286        0.4531316       1.0000000
## 2   0.5614035 0.5000000        0.4903386       0.9979805
## 3   0.5614035 0.4910714        0.4804167       0.9929564
## 4   0.5350877 0.5446429        0.4255984       0.9883263
## 5   0.5350877 0.5089286        0.4541238       0.9781795
## 6   0.6754386 0.4910714        0.4446980       0.9668506
##   Malaysia10YearGovBondYield MYRExchangeRate
## 1                  0.6563255       0.4610070
## 2                  0.6825141       0.4624228
## 3                  0.6599517       0.4635555
## 4                  0.6289283       0.4471314
## 5                  0.6321515       0.4270261
## 6                  0.6365834       0.4089596

Result: All numerical data, including feature and target variables, except from Date, have been normalised.

Visualising data

A line graph is frequently regarded as the most effective approach to depict time series data since it efficiently conveys how the data has changed over time. The y-axis shows the relevant variable while the x-axis shows time. It is simple to spot patterns and trends in the data, such as upward or downward trends, fluctuations, and seasonality, because the data points are connected by lines.

Since line graph is able to display how the data has changed over time and how various data points relate to one another, line graphs are particularly helpful for time series data. Additionally, they may be annotated with other features like lines for mean, median, and standard deviation, which can be useful for seeing patterns and trends that aren’t always obvious. As line graph do not depict the consistency of the data over time, other chart styles like bar charts or scatter plots might not be the ideal choice for time series data.

Line graphs are plotted to visualise the trend of each feature with MYRExchangeRate.

Note: Black line represents the feature whilst red dotted line represents the MYRExchangeRate.

From the line charts above, it is observed that USPolicyRate, USDCurrencyIndex and CNYExchangeRate have similar trend with MYRExchangeRate, whereby when one decreases, the other decreases; when one increases, the other increases as well.
On the other hand, BrentCrudeOilPrice, Malaysia Foreign Reserve, Malaysia CPI and US CPI show somewhat opposite trends with MYRExchangeRate.

Finding correlation between variables

In its widest definition, correlation is a measurement of the link between variables. When two variables are correlated, a change in one variable’s magnitude is usually accompanied by a change in the other variable’s magnitude.

Step: Plotting cross-correlation graph to measure different time series variables and ranking their relations.
The barchart above shows the correlation between two variables ranking from highest to lowest correlation. Blue bars indicate positive correlation whereas pink bars indicate negative correlation.

Apparently, the dependent variable MYRExchangeRate shows high correlation with USDCurrencyIndex and BrentCrudeOilPrice.

The remainings are high correlation between the independent variables: CNYExchangeRate with USPolicyRate and MalaysiaForeignReserve and MalaysiaForeignReserve with BrentCrudeOilPrice.

Correlation between all variables except Date


From the correlation heatmap plotted, it is observed that USDCurrencyIndex has a very strong correlation with MYRExchange rate, followed by BrentCrudeOilPrice and MalaysiaForeignReserve.

This observation is similar to the result from the cross-correlation barchart that has been plotted earlier.

Preliminary estimation:
1. USDCurrencyIndex and BrentCrudeOilPrice are significant.
2. MalaysiaForeignReserve can be eliminated.

It is possible that correlation between variables will change along with market change, hence correlation graphs are plotted for data by every five years: 2005-2010, 2011-2015, 2016-2022 from left to right

Correlation heatmap for year 2005-2010


In addition to overall findings, CNY Exchange rate and Bondflow has a higher correlation.

Correlation heatmap for year 2011-2015


Policy Rate & CPI have a high correlation but the correlation is a coincidence due to stable USD/MYRExchangeRate.

Correlation heatmap for year 2016-2022


In addition to overall findings, CNYExchangeRate has a relative high correlation.

Overall cross-correlation with MYRExchangeRate
##                        Variable     overall Y2005toY2010 Y2011toY2015
## 1            MalaysiaPolicyRate -0.27705889   0.06146647    0.7587431
## 2                  USPolicyRate  0.07260433   0.46917321    0.5307193
## 3    MalaysiaNetForeignBondFlow -0.21952723  -0.42486821   -0.1033952
## 4        MalaysiaForeignReserve -0.47326473  -0.77783658   -0.9322529
## 5            BrentCrudeOilPrice -0.64499891  -0.69070660   -0.9286770
## 6  MalaysiaExternalTradeBalance  0.41545170  -0.06799965    0.0968636
## 7                   MalaysiaCPI -0.12473945   0.14096291    0.1242865
## 8                         USCPI  0.19788252  -0.00822469   -0.6974354
## 9              USDCurrencyIndex  0.91100701   0.70123283    0.9045400
## 10              CNYExchangeRate  0.15822691   0.67854585    0.2523939
## 11   Malaysia10YearGovBondYield -0.10315798   0.21818305    0.5717199
## 12              MYRExchangeRate  1.00000000   1.00000000    1.0000000
##    Y2016toy2022
## 1    -0.1822738
## 2     0.1110615
## 3    -0.3152069
## 4    -0.1212143
## 5     0.1762821
## 6     0.2025899
## 7     0.3419343
## 8     0.4162212
## 9     0.8279759
## 10    0.5739093
## 11    0.2441488
## 12    1.0000000

Cross-correlation with MYRExchangeRate by every 5 years from left to right: 2005-2010, 2011-2015, 2016-2022

Schober, Boer and Schwarte (2018) stated that a correlation score of less than 0.3 is considered as weak correlation.
Therefore, from the results above, it is observed that MalaysiaExternalTradeBalance, MalaysiaCPI, USCPI, USPolicyRate, MalaysiaNetForeignBondFlow and Malaysia10YearGovBondYield have the weak correlation with MYRExchangeRate.
These variables may not be important for prediction.

Overall insights and inferences:

  • USDCurrencyIndex is the most significant variable as indicated from all analyses.
  • Malaysia has more trade with China recently which makes CNYExchangeRate useful for future prediction as the trade continues.
  • BrentCrudeOilPrice has strong inverse correlation with MYRExchangeRate.

Preparing data with selected features to be used in modelling

  • Date
  • MalaysiaPolicyRate
  • MalaysiaForeignReserve
  • USDCurrencyIndex
  • CNYExchangeRate
  • Malaysia10YearGovBondYield

Modelling

In the CRISP-DM technique, modelling refers to the stage in which a model is constructed and validated utilising the data gathered in prior processes. The objective of modelling in CRISP-DM is to construct a model that properly predicts the desired output based on the input variables. This process comprises selecting the right model type, training the model using a sample of the data, and assessing the performance of the model using techniques such as classification accuracy and mean absolute error (MAE). The model with the greatest performance is then used to forecast fresh data.

This project utilised supervised machine learning, that categorised into the following:
  • Regression
  • Classification

The distinction between the two is the outcomes they forecast.
In regression, the result is a continuous number; hence, the output of this project is to estimate the USD/MYR exchange rate for the following month. On the other hand, the output of classification is a discrete value, such as a label, identifying the class to which an input belongs. For this project, it is utilised to anticipate the trend of the USD/MYR exchange rate.

Data Modelling (REGRESSION)

Two regression models are implemented for prediction.

Use is made of both simple linear regression and multiple regression. Models vary in the number of independent factors employed to predict the dependent variable. Simple linear regression utilises a single independent variable to predict the dependent variable. In this project, the USD Currency Index may be used to forecast the USD to MYR exchange rate.

Multiple regression, in contrast, utilises numerous independent variables to predict a dependent variable. Predicting the USD to MYR conversion rate by utilising numerous characteristics. Multiple regression provides a more accurate model for predicting the Malaysian ringgit exchange rate because it incorporates a greater number of variables that may influence the exchange rate. However, multiple regression is more complicated since it requires more data and processing resources.

The following were the workings:

Simple Linear Regression

The dataset is first split into training set and test set in the ratio of 80:20.

data_r = data10
data_r_ori = data9
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
set.seed(123)
split = sample.split(data_r$MYRExchangeRate, SplitRatio = 0.8)
training_set_r = subset(data_r, split == TRUE)
test_set_r = subset(data_r, split == FALSE)

From the EDA result, it is observed that USDCurrencyIndex has a strong correlation with target variable as shown in the diagram below.

data_r %>%
  as.data.frame() %>%
  GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Therefore, a simple linear regression using ‘USDCurrencyIndex’ is created.

## 
## Call:
## lm(formula = MYRExchangeRate ~ USDCurrencyIndex, data = training_set_r)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5695 -0.1019  0.0115  0.1392  0.3514 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.89843    0.03223   89.94   <2e-16 ***
## USDCurrencyIndex  1.97439    0.06873   28.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1907 on 162 degrees of freedom
## Multiple R-squared:  0.8359, Adjusted R-squared:  0.8349 
## F-statistic: 825.2 on 1 and 162 DF,  p-value: < 2.2e-16

The linear model is then used to predict the MYRExchangeRate.
The performance of the model is depicted in the graph below.

## `geom_smooth()` using formula = 'y ~ x'

Next, test set is used to check the model’s performance.

Before performing multiple linear regression, it is necessary to confirm whether there is a linear correlation between the independent variables.

##  [1] -0.01809105  0.08821480  0.73681104 -0.18899004 -0.35744013 -0.58294577
##  [7]  0.21851721 -0.74850514 -0.33234109  0.03132114
## integer(0)

The result above shows that the independent variables do not have multicollinearity.
Thus, multiple linear regression model can be implemented to on the data. Backward elimination is used to train the model.

Multiple Linear Regression

## 
## Call:
## lm(formula = MYRExchangeRate ~ BrentCrudeOilPrice + USDCurrencyIndex, 
##     data = training_set_r)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46769 -0.09419  0.01396  0.12825  0.47712 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.18703    0.06075  52.458  < 2e-16 ***
## BrentCrudeOilPrice -0.42635    0.07829  -5.446  1.9e-07 ***
## USDCurrencyIndex    1.74985    0.07559  23.149  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1758 on 161 degrees of freedom
## Multiple R-squared:  0.8614, Adjusted R-squared:  0.8597 
## F-statistic: 500.4 on 2 and 161 DF,  p-value: < 2.2e-16

We can use the image to compare the difference between the predicted and actual values. Blue represents the predicted value, red represents the actual value.

Data Modelling (CLASSIFICATION)

Three different classification algorithms are implemented to predict whether MYRExchangeRate is high or low.

Firstly, MYRExchangeRate is transformed to binomial categorical data, in which rate equals or above 3.8 is categorised as high exchange rate, whilst rate below 3.8 is categorised as low exchange rate.

The algorithm used for this project were K-Nearest Neighbors (KNN), Naive Bayes and Random Forest. The following showed the workings.

data_c <- data9
data_c['MYRExchangeRate_cat'] <- 0
for (i in 1:nrow(data_c))  {
 if (data_c[i,6]<=3.8)  data_c[i,'MYRExchangeRate_cat'] <- 'low'
 else data_c[i,'MYRExchangeRate_cat'] <- 'high'
}
data_c<-data_c%>%dplyr::select(-c(6))
data_c
as.data.frame(table(data_c$MYRExchangeRate))

It is verified that the binomial classes are balance after transformation.

Dataset is then again being split into training set and test set.

data_c$MYRExchangeRate_cat = factor(data_c$MYRExchangeRate_cat,levels = c("low","high"))
# Splitting the dataset into the Training set and Test set
set.seed(12771)
split = sample.split(data_c$MYRExchangeRate_cat, SplitRatio = 0.8)
training_set_c = subset(data_c, split == TRUE)
test_set_c = subset(data_c, split == FALSE)

K-Nearest Neighbor

K-Nearest Neighbors (KNN) is an instance-based, non-parametric learning algorithm. It classifies new cases based on a similarity measure between the new case and the stored examples. For instance, KNN may predict exchange rate based on similar attirbutes.

classifier = train(MYRExchangeRate_cat~., data = training_set_c, method = "knn")
x_test_knn <- test_set_c[,1:5]
y_test_knn <- test_set_c[,6]
predictions_knn <- predict(classifier,x_test_knn)
confusionMatrix(predictions_knn,y_test_knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low high
##       low   22    0
##       high   1   18
##                                           
##                Accuracy : 0.9756          
##                  95% CI : (0.8714, 0.9994)
##     No Information Rate : 0.561           
##     P-Value [Acc > NIR] : 1.684e-09       
##                                           
##                   Kappa : 0.9508          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9565          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9474          
##              Prevalence : 0.5610          
##          Detection Rate : 0.5366          
##    Detection Prevalence : 0.5366          
##       Balanced Accuracy : 0.9783          
##                                           
##        'Positive' Class : low             
## 

Naive Bayes

Naive Bayes is a probabilistic algorithm that classifies data points based on the likelihood that they belong to a certain class. It is a straightforward method that assumes the independence of the characteristics. It requires a massive proportion of data that may anticipate future exchange rate changes.

classifier = NaiveBayes(MYRExchangeRate_cat~., data = training_set_c)
x_test_nb <- test_set_c[,1:5]
y_test_nb <- test_set_c[,6]
predictions_nb <- predict(classifier,x_test_nb)
confusionMatrix(predictions_nb$class,y_test_nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low high
##       low   22    0
##       high   1   18
##                                           
##                Accuracy : 0.9756          
##                  95% CI : (0.8714, 0.9994)
##     No Information Rate : 0.561           
##     P-Value [Acc > NIR] : 1.684e-09       
##                                           
##                   Kappa : 0.9508          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9565          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9474          
##              Prevalence : 0.5610          
##          Detection Rate : 0.5366          
##    Detection Prevalence : 0.5366          
##       Balanced Accuracy : 0.9783          
##                                           
##        'Positive' Class : low             
## 

Random Forest

Random Forest is an ensemble learning method for classification, regression and other tasks. It operates by constructing a large number of decision trees at training time and outputting the class that is the mode of the classes (classification) or the mean prediction (regression) of the individual trees. Random Forest is frequently applied if there are several elements that could influence the exchange rate, and also to identify the most influential ones.

classifier = randomForest(MYRExchangeRate_cat~., data = training_set_c)
x_test_rf <- test_set_c[,1:5]
y_test_rf <- test_set_c[,6]
predictions_rf <- predict(classifier,x_test_rf)
confusionMatrix(predictions_rf,y_test_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low high
##       low   23    0
##       high   0   18
##                                     
##                Accuracy : 1         
##                  95% CI : (0.914, 1)
##     No Information Rate : 0.561     
##     P-Value [Acc > NIR] : 5.09e-11  
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.561     
##          Detection Rate : 0.561     
##    Detection Prevalence : 0.561     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : low       
## 

Evaluation

Regression Model Evaluation

The most used evaluation measures in regression are r-squared, prediction error rate, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). For this project, Adjusted R2, Akaike’s Information Criteria (AIC), and Bayesian information criteria (BIC) were also utilised as the robust metric for assessing the quality of regression models and comparing models.

Regression Model Evaluation in training data

Dividing the RSE by the mean of the outcome variables will result in the prediction error rate.
The smaller the prediction error rate, the better the model’s performance.

glance(regressor_s)
#Prediction the test set result
glance(regressor_m_2)
paste('Simple linear regression prediction error rate: ',sigma(regressor_s)/mean(data$MYRExchangeRate))
## [1] "Simple linear regression prediction error rate:  0.049157295344548"
paste('Multiple linear regression prediction error rate: ',sigma(regressor_m_2)/mean(data$MYRExchangeRate))
## [1] "Multiple linear regression prediction error rate:  0.0453127747972776"
set.seed(1457)
train_control <- trainControl(method = "repeatedcv",
                              number = 10, repeats = 3)
model <- train(MYRExchangeRate ~     BrentCrudeOilPrice + USDCurrencyIndex,data = data_r,method = 'lm',trControl = train_control)
print(model)
## Linear Regression 
## 
## 205 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 185, 184, 185, 184, 184, 185, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.1818232  0.8542876  0.1436248
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

With a low RMSE of 0.18 and MAE of 0.14, this shows that the model fits the data well.

Interpretation:
1. Multiple Linear Regressor has higher Adjusted R2 and lower AIC and BIC value as compared to Simple Linear Regressor.
2. The prediction error rate for Multiple Linear Regressor is lower than Simple Linear Regressor.

Result: Multiple Linear Regressor is better than Simple Linear Regressor.

Classification Model Evaluation

From the above, accuracy rates of K-Nearest Neighbour, Naive Bayes and Random Forest are 0.9756, 0.9756 and 1 which shows that Random Forest has the best performance than K-Nearest Neighbour and Naive Bayes.

An ROC curve is a graph showing the performance of a classification model at all classification thresholds.
ROC graphs are plotted for K-Nearest Neighbour and Random Forest for comparison.

## Setting levels: control = low, case = high
## Setting direction: controls < cases

## Setting levels: control = low, case = high
## Setting direction: controls < cases

The graphs above show an AUC value of 1, which indicates an excellent test result for both K-Nearest Neighbour and Random Forest.

Results: Random Forest is better overall given the highest accuracy it has.

Conclusion and Recommendation

The USD Currency Index has the greatest impact on the USD/MYR exchange rate, hence the first aim has been met. The comparison between the two regression models and the results revealed that the multiple regression model had the lowest error measures for both the estimate and assessment portions of the data. Additionally, the Random Forest model outperformed the other two categorisation methods (K-Nearest Neighbor, NaiveBayes). The rise and fall of the MYR is unquestionably influenced by several macroeconomic variables. This study aids decision makers, particularly investors, in monitoring the anticipated performance of financial markets. For future study, adding more attributes to the dataset will be helpful because they can reveal more details for global projection. Additional algorithms may be considered by future researchers to improve the analysis and verify the results.

References