Gold Price Prediction

Group Members:

Phua Jia Yi (23051517) - Leader
Tan Kai Qi (23051355)
Wendy Sim Pei Lin (s2038886)
Seh Chia Shin (17202838)
Ng Chee Kang (23051336)

1.0 Introduction

Gold, a common symbol representing wealth, has been used as a form of currency since decades ago. Due to its intrinsic value, this precious metal continues to play a vital role in today’s society and the global economy. It is deemed as a form of guarantee to foreign debts repayment and it represents the economic strength and stability of a country.

Being a commodity product on its own as well as a currency, the gold market is highly influenced by economic factors, geopolitical factors and, among others, investors’ sentiment. Investors would often rely on forecast of gold prices to decide on the best execution of trade, however, several factors would come to play in determining the prediction of gold price. These factors include prices of other precious metals, prices of crude oils, performance of the stock exchange, currency rates etc. This research aims to formulate an algorithm to predict the movement in price of gold, future prices and its exact value while taking into account of the correlating factors.

1.1 Problem Question

The volatile gold market renders it difficult for stakeholders to properly assess the risk and make strategic decisions in seizing the right opportunity for investment. The sophistication in gold price prediction derives from the multifaceted external influences such as economic trends, geopolitical events, oil prices etc.

This research aims to utilize advanced data analytics and machine learning techniques to develop a robust gold price prediction model. The gold price prediction model will provide stakeholders with foreseeable results and valuable insights to better assess the risk and benefits for proper investment strategies.

1.2 Objective

To explore the significant factors influencing gold prices through comprehensive data analysis.
To develop an effective predictive model for gold prices using machine learning algorithms.
To assess the model’s performance to ensure its reliability and general applicability.

2.0 Dataset details

The dataset used in this project was primarily gathered from Yahoo Finance. It is a set of data collected from November 18, 2011, to January 1, 2019, encompassing 1718 rows and 81 columns.

These columns represent various financial attributes including Oil Price, Standard and Poor’s (S&P) 500 index, Dow Jones Index, US Bond rates (10 years), Euro USD exchange rates, prices of precious metals like Silver, Platinum and other metals such as Palladium and Rhodium. Additionally, it includes data on the US Dollar Index, Eldorado Gold Corporation, and Gold Miners ETF.

Within the dataset, the historical data of the Gold ETF comprises 7 columns, including Date, Open, High, Low, Close, Adjusted Close, and Volume. The value of Adjusted Close is particularly significant as it considers factors like dividends, stock splits, and new stock offerings to determine a more accurate representation of the Gold ETF’s closing value.

source: https://www.kaggle.com/datasets/sid321axn/gold-price-prediction-dataset/data

# Import dataset
df<- read.csv("Dataset/FINAL_USO.csv")

## Import library
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(Metrics)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:Metrics':
## 
##     precision, recall

3.0 Data Cleaning

3.1 Exploring raw data

Upon using the ‘glimpse()’ function from the ‘dplyr’ library, it was observed that the ‘Date’ column is in character format, while the remaining 80 columns consist of numeric variables, with values either in double or integer form.

## Exploring dataset
glimpse(df)

## Rows: 1,718
## Columns: 81
## $ Date          <chr> "2011-12-15", "2011-12-16", "2011-12-19", "2011-12-20", …
## $ Open          <dbl> 154.74, 154.31, 155.48, 156.82, 156.98, 156.35, 156.35, …
## $ High          <dbl> 154.95, 155.37, 155.86, 157.43, 157.53, 156.80, 156.49, …
## $ Low           <dbl> 151.71, 153.90, 154.36, 156.58, 156.13, 155.33, 155.82, …
## $ Close         <dbl> 152.33, 155.23, 154.87, 156.98, 157.16, 156.04, 156.31, …
## $ Adj.Close     <dbl> 152.33, 155.23, 154.87, 156.98, 157.16, 156.04, 156.31, …
## $ Volume        <int> 21521900, 18124300, 12547200, 9136300, 11996100, 9888400…
## $ SP_open       <dbl> 123.03, 122.23, 122.06, 122.18, 123.93, 124.63, 125.67, …
## $ SP_high       <dbl> 123.20, 122.95, 122.32, 124.14, 124.36, 125.40, 126.43, …
## $ SP_low        <dbl> 121.99, 121.30, 120.03, 120.37, 122.75, 124.23, 125.41, …
## $ SP_close      <dbl> 122.18, 121.59, 120.29, 123.93, 124.17, 125.27, 126.39, …
## $ SP_Ajclose    <dbl> 105.4412, 105.5975, 104.4685, 107.6298, 107.8382, 108.79…
## $ SP_volume     <int> 199109200, 220481400, 183903000, 225418100, 194230900, 1…
## $ DJ_open       <dbl> 11825.29, 11870.25, 11866.54, 11769.21, 12103.58, 12107.…
## $ DJ_high       <dbl> 11967.84, 11968.18, 11925.88, 12117.13, 12119.70, 12182.…
## $ DJ_low        <dbl> 11825.22, 11819.31, 11735.19, 11768.83, 11999.44, 12107.…
## $ DJ_close      <dbl> 11868.81, 11866.39, 11766.26, 12103.58, 12107.74, 12169.…
## $ DJ_Ajclose    <dbl> 11868.81, 11866.39, 11766.26, 12103.58, 12107.74, 12169.…
## $ DJ_volume     <int> 136930000, 389520000, 135170000, 165180000, 163250000, 1…
## $ EG_open       <dbl> 74.55, 73.60, 69.10, 66.45, 67.10, 68.40, 70.05, 70.80, …
## $ EG_high       <dbl> 76.15, 75.10, 69.80, 68.10, 69.40, 70.00, 71.75, 71.15, …
## $ EG_low        <dbl> 72.15, 73.35, 64.20, 66.00, 66.90, 66.70, 69.90, 67.55, …
## $ EG_close      <dbl> 72.90, 74.90, 64.70, 67.00, 68.50, 69.75, 71.60, 67.65, …
## $ EG_Ajclose    <dbl> 70.43176, 72.36404, 62.50938, 64.73151, 66.18072, 67.388…
## $ EG_volume     <int> 787900, 896600, 2096700, 875300, 837600, 780600, 621200,…
## $ EU_Price      <dbl> 1.3018, 1.3035, 1.2995, 1.3079, 1.3045, 1.3051, 1.3043, …
## $ EU_open       <dbl> 1.2982, 1.3020, 1.3043, 1.3003, 1.3079, 1.3047, 1.3051, …
## $ EU_high       <dbl> 1.3051, 1.3087, 1.3044, 1.3133, 1.3197, 1.3121, 1.3098, …
## $ EU_low        <dbl> 1.2957, 1.2997, 1.2981, 1.2994, 1.3024, 1.3018, 1.3026, …
## $ EU_Trend      <int> 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,…
## $ OF_Price      <dbl> 105.09, 103.35, 103.64, 106.73, 107.71, 107.89, 107.96, …
## $ OF_Open       <dbl> 104.88, 103.51, 103.63, 104.30, 107.15, 107.92, 107.77, …
## $ OF_High       <dbl> 106.50, 104.56, 104.57, 107.27, 108.17, 108.50, 108.55, …
## $ OF_Low        <dbl> 104.88, 102.46, 102.37, 103.91, 106.16, 107.29, 107.40, …
## $ OF_Volume     <int> 14330, 140080, 147880, 170240, 145090, 111160, 64610, 50…
## $ OF_Trend      <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,…
## $ OS_Price      <dbl> 93.42, 93.79, 94.09, 95.55, 99.01, 99.47, 99.85, 101.28,…
## $ OS_Open       <dbl> 94.91, 93.43, 93.77, 96.39, 97.54, 99.04, 99.49, 99.94, …
## $ OS_High       <dbl> 96.00, 94.80, 94.43, 99.70, 99.26, 100.06, 100.24, 101.7…
## $ OS_Low        <dbl> 93.33, 92.53, 92.55, 96.39, 96.81, 98.52, 99.25, 99.39, …
## $ OS_Trend      <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,…
## $ SF_Price      <int> 53604, 53458, 52961, 53487, 53148, 52894, 53067, 52711, …
## $ SF_Open       <int> 54248, 53650, 53400, 52795, 53519, 53000, 53000, 53014, …
## $ SF_High       <int> 54248, 54030, 53400, 53575, 54184, 53460, 53348, 53099, …
## $ SF_Low        <int> 52316, 52890, 52544, 52595, 52937, 52690, 52960, 52588, …
## $ SF_Volume     <int> 119440, 65390, 67280, 55130, 75950, 56530, 27510, 27450,…
## $ SF_Trend      <int> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1,…
## $ USB_Price     <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Open      <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_High      <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Low       <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Trend     <int> 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ PLT_Price     <dbl> 1414.65, 1420.25, 1411.10, 1434.75, 1429.05, 1425.90, 14…
## $ PLT_Open      <dbl> 1420.30, 1414.75, 1422.65, 1408.95, 1434.40, 1429.00, 14…
## $ PLT_High      <dbl> 1423.35, 1431.75, 1427.60, 1436.55, 1453.75, 1435.25, 14…
## $ PLT_Low       <dbl> 1376.85, 1400.70, 1404.60, 1408.15, 1417.65, 1415.90, 14…
## $ PLT_Trend     <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ PLD_Price     <dbl> 618.85, 623.65, 608.80, 626.65, 635.90, 652.45, 665.55, …
## $ PLD_Open      <dbl> 614.70, 622.60, 626.00, 622.45, 625.70, 632.00, 664.95, …
## $ PLD_High      <dbl> 615.00, 623.45, 630.00, 622.45, 641.50, 650.80, 665.05, …
## $ PLD_Low       <dbl> 614.60, 622.30, 608.60, 622.45, 623.80, 632.00, 664.95, …
## $ PLD_Trend     <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ RHO_PRICE     <int> 1425, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 14…
## $ USDI_Price    <dbl> 80.341, 80.249, 80.207, 80.273, 80.350, 80.306, 80.236, …
## $ USDI_Open     <dbl> 80.565, 80.175, 80.300, 80.890, 80.105, 80.360, 80.210, …
## $ USDI_High     <dbl> 80.630, 80.395, 80.470, 80.940, 80.445, 80.505, 80.375, …
## $ USDI_Low      <dbl> 80.130, 79.935, 80.125, 80.035, 79.550, 79.955, 80.060, …
## $ USDI_Volume   <int> 22850, 13150, 970, 22950, 24140, 13240, 7170, 5660, 2111…
## $ USDI_Trend    <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ GDX_Open      <dbl> 53.01, 52.50, 52.49, 52.38, 53.15, 52.56, 52.59, 52.37, …
## $ GDX_High      <dbl> 53.14, 53.18, 52.55, 53.25, 53.43, 52.62, 52.79, 52.52, …
## $ GDX_Low       <dbl> 51.57, 52.04, 51.03, 52.37, 52.42, 51.77, 52.35, 51.68, …
## $ GDX_Close     <dbl> 51.68, 52.68, 51.17, 52.99, 52.96, 52.40, 52.79, 51.87, …
## $ GDX_Adj.Close <dbl> 48.97388, 49.92151, 48.49058, 50.21528, 50.18685, 49.656…
## $ GDX_Volume    <int> 20605600, 16285400, 15120200, 11644900, 8724300, 1317960…
## $ USO_Open      <dbl> 36.90, 36.18, 36.39, 37.30, 37.67, 38.27, 38.47, 38.49, …
## $ USO_High      <dbl> 36.94, 36.50, 36.45, 37.61, 38.24, 38.55, 38.61, 39.21, …
## $ USO_Low       <dbl> 36.05, 35.73, 35.93, 37.22, 37.52, 38.21, 38.37, 38.46, …
## $ USO_Close     <dbl> 36.13, 36.27, 36.20, 37.56, 38.11, 38.28, 38.49, 39.06, …
## $ USO_Adj.Close <dbl> 36.13, 36.27, 36.20, 37.56, 38.11, 38.28, 38.49, 39.06, …
## $ USO_Volume    <int> 12616700, 12578800, 7418200, 10041600, 10728000, 5376800…

3.2 Remove Redundancy.

The closing price columns for Gold ETF, S&P 500 index, Down Jones index, Eldorado Gold Corporation (EGO) and Gold Miners ETF was removed since the closing prices exhibit a strong correlation with their respective adjusted closing prices. Retaining both sets of columns might introduce redundancy without contributing additional insights.

The adjusted closing price was retained to ensure a more concise representation as it already incorporates all the essential adjustments.

# remove closing gold price 
remove_columns <- c("Close", "SP_close", "DJ_close", "EG_close", "GDX_Close", "USO_Close")
df <- df[, !(names(df) %in% remove_columns)]
head(df)

##         Date   Open   High    Low Adj.Close   Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71    152.33 21521900  123.03  123.20 121.99
## 2 2011-12-16 154.31 155.37 153.90    155.23 18124300  122.23  122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36    154.87 12547200  122.06  122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58    156.98  9136300  122.18  124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13    157.16 11996100  123.93  124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33    156.04  9888400  124.63  125.40 124.23
##   SP_Ajclose SP_volume  DJ_open  DJ_high   DJ_low DJ_Ajclose DJ_volume EG_open
## 1   105.4412 199109200 11825.29 11967.84 11825.22   11868.81 136930000   74.55
## 2   105.5975 220481400 11870.25 11968.18 11819.31   11866.39 389520000   73.60
## 3   104.4685 183903000 11866.54 11925.88 11735.19   11766.26 135170000   69.10
## 4   107.6298 225418100 11769.21 12117.13 11768.83   12103.58 165180000   66.45
## 5   107.8382 194230900 12103.58 12119.70 11999.44   12107.74 163250000   67.10
## 6   108.7935 119465400 12107.59 12182.71 12107.37   12169.65 151610000   68.40
##   EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1   76.15  72.15   70.43176    787900   1.3018  1.2982  1.3051 1.2957        1
## 2   75.10  73.35   72.36404    896600   1.3035  1.3020  1.3087 1.2997        1
## 3   69.80  64.20   62.50938   2096700   1.2995  1.3043  1.3044 1.2981        0
## 4   68.10  66.00   64.73151    875300   1.3079  1.3003  1.3133 1.2994        1
## 5   69.40  66.90   66.18072    837600   1.3045  1.3079  1.3197 1.3024        0
## 6   70.00  66.70   67.38840    780600   1.3051  1.3047  1.3121 1.3018        1
##   OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1   105.09  104.88  106.50 104.88     14330        1    93.42   94.91   96.00
## 2   103.35  103.51  104.56 102.46    140080        0    93.79   93.43   94.80
## 3   103.64  103.63  104.57 102.37    147880        1    94.09   93.77   94.43
## 4   106.73  104.30  107.27 103.91    170240        1    95.55   96.39   99.70
## 5   107.71  107.15  108.17 106.16    145090        1    99.01   97.54   99.26
## 6   107.89  107.92  108.50 107.29    111160        1    99.47   99.04  100.06
##   OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1  93.33        0    53604   54248   54248  52316    119440        1     1.911
## 2  92.53        1    53458   53650   54030  52890     65390        0     1.851
## 3  92.55        1    52961   53400   53400  52544     67280        0     1.810
## 4  96.39        1    53487   52795   53575  52595     55130        1     1.927
## 5  96.81        1    53148   53519   54184  52937     75950        0     1.970
## 6  98.52        1    52894   53000   53460  52690     56530        0     1.953
##   USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1    1.911    1.911   1.911         1   1414.65  1420.30  1423.35 1376.85
## 2    1.851    1.851   1.851         0   1420.25  1414.75  1431.75 1400.70
## 3    1.810    1.810   1.810         0   1411.10  1422.65  1427.60 1404.60
## 4    1.927    1.927   1.927         1   1434.75  1408.95  1436.55 1408.15
## 5    1.970    1.970   1.970         1   1429.05  1434.40  1453.75 1417.65
## 6    1.953    1.953   1.953         0   1425.90  1429.00  1435.25 1415.90
##   PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1         0    618.85   614.70   615.00  614.60         1      1425     80.341
## 2         1    623.65   622.60   623.45  622.30         1      1400     80.249
## 3         0    608.80   626.00   630.00  608.60         0      1400     80.207
## 4         1    626.65   622.45   622.45  622.45         1      1400     80.273
## 5         0    635.90   625.70   641.50  623.80         1      1400     80.350
## 6         0    652.45   632.00   650.80  632.00         1      1400     80.306
##   USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1    80.565    80.630   80.130       22850          0    53.01    53.14   51.57
## 2    80.175    80.395   79.935       13150          0    52.50    53.18   52.04
## 3    80.300    80.470   80.125         970          0    52.49    52.55   51.03
## 4    80.890    80.940   80.035       22950          1    52.38    53.25   52.37
## 5    80.105    80.445   79.550       24140          1    53.15    53.43   52.42
## 6    80.360    80.505   79.955       13240          0    52.56    52.62   51.77
##   GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1      48.97388   20605600    36.90    36.94   36.05         36.13   12616700
## 2      49.92151   16285400    36.18    36.50   35.73         36.27   12578800
## 3      48.49058   15120200    36.39    36.45   35.93         36.20    7418200
## 4      50.21528   11644900    37.30    37.61   37.22         37.56   10041600
## 5      50.18685    8724300    37.67    38.24   37.52         38.11   10728000
## 6      49.65618   13179600    38.27    38.55   38.21         38.28    5376800

3.3 Binary Classification Label

# To identify the gold price direction - Increase/ Remain Unchanged (1) or Decrease (0)
df$difference <- c(NA, diff(df$Adj.Close))
df$boolean <-ifelse(df$difference >= 0, 1, 0)
head(df)

##         Date   Open   High    Low Adj.Close   Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71    152.33 21521900  123.03  123.20 121.99
## 2 2011-12-16 154.31 155.37 153.90    155.23 18124300  122.23  122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36    154.87 12547200  122.06  122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58    156.98  9136300  122.18  124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13    157.16 11996100  123.93  124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33    156.04  9888400  124.63  125.40 124.23
##   SP_Ajclose SP_volume  DJ_open  DJ_high   DJ_low DJ_Ajclose DJ_volume EG_open
## 1   105.4412 199109200 11825.29 11967.84 11825.22   11868.81 136930000   74.55
## 2   105.5975 220481400 11870.25 11968.18 11819.31   11866.39 389520000   73.60
## 3   104.4685 183903000 11866.54 11925.88 11735.19   11766.26 135170000   69.10
## 4   107.6298 225418100 11769.21 12117.13 11768.83   12103.58 165180000   66.45
## 5   107.8382 194230900 12103.58 12119.70 11999.44   12107.74 163250000   67.10
## 6   108.7935 119465400 12107.59 12182.71 12107.37   12169.65 151610000   68.40
##   EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1   76.15  72.15   70.43176    787900   1.3018  1.2982  1.3051 1.2957        1
## 2   75.10  73.35   72.36404    896600   1.3035  1.3020  1.3087 1.2997        1
## 3   69.80  64.20   62.50938   2096700   1.2995  1.3043  1.3044 1.2981        0
## 4   68.10  66.00   64.73151    875300   1.3079  1.3003  1.3133 1.2994        1
## 5   69.40  66.90   66.18072    837600   1.3045  1.3079  1.3197 1.3024        0
## 6   70.00  66.70   67.38840    780600   1.3051  1.3047  1.3121 1.3018        1
##   OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1   105.09  104.88  106.50 104.88     14330        1    93.42   94.91   96.00
## 2   103.35  103.51  104.56 102.46    140080        0    93.79   93.43   94.80
## 3   103.64  103.63  104.57 102.37    147880        1    94.09   93.77   94.43
## 4   106.73  104.30  107.27 103.91    170240        1    95.55   96.39   99.70
## 5   107.71  107.15  108.17 106.16    145090        1    99.01   97.54   99.26
## 6   107.89  107.92  108.50 107.29    111160        1    99.47   99.04  100.06
##   OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1  93.33        0    53604   54248   54248  52316    119440        1     1.911
## 2  92.53        1    53458   53650   54030  52890     65390        0     1.851
## 3  92.55        1    52961   53400   53400  52544     67280        0     1.810
## 4  96.39        1    53487   52795   53575  52595     55130        1     1.927
## 5  96.81        1    53148   53519   54184  52937     75950        0     1.970
## 6  98.52        1    52894   53000   53460  52690     56530        0     1.953
##   USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1    1.911    1.911   1.911         1   1414.65  1420.30  1423.35 1376.85
## 2    1.851    1.851   1.851         0   1420.25  1414.75  1431.75 1400.70
## 3    1.810    1.810   1.810         0   1411.10  1422.65  1427.60 1404.60
## 4    1.927    1.927   1.927         1   1434.75  1408.95  1436.55 1408.15
## 5    1.970    1.970   1.970         1   1429.05  1434.40  1453.75 1417.65
## 6    1.953    1.953   1.953         0   1425.90  1429.00  1435.25 1415.90
##   PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1         0    618.85   614.70   615.00  614.60         1      1425     80.341
## 2         1    623.65   622.60   623.45  622.30         1      1400     80.249
## 3         0    608.80   626.00   630.00  608.60         0      1400     80.207
## 4         1    626.65   622.45   622.45  622.45         1      1400     80.273
## 5         0    635.90   625.70   641.50  623.80         1      1400     80.350
## 6         0    652.45   632.00   650.80  632.00         1      1400     80.306
##   USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1    80.565    80.630   80.130       22850          0    53.01    53.14   51.57
## 2    80.175    80.395   79.935       13150          0    52.50    53.18   52.04
## 3    80.300    80.470   80.125         970          0    52.49    52.55   51.03
## 4    80.890    80.940   80.035       22950          1    52.38    53.25   52.37
## 5    80.105    80.445   79.550       24140          1    53.15    53.43   52.42
## 6    80.360    80.505   79.955       13240          0    52.56    52.62   51.77
##   GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1      48.97388   20605600    36.90    36.94   36.05         36.13   12616700
## 2      49.92151   16285400    36.18    36.50   35.73         36.27   12578800
## 3      48.49058   15120200    36.39    36.45   35.93         36.20    7418200
## 4      50.21528   11644900    37.30    37.61   37.22         37.56   10041600
## 5      50.18685    8724300    37.67    38.24   37.52         38.11   10728000
## 6      49.65618   13179600    38.27    38.55   38.21         38.28    5376800
##   difference boolean
## 1         NA      NA
## 2   2.899994       1
## 3  -0.360001       0
## 4   2.110001       1
## 5   0.180008       1
## 6  -1.120011       0

3.4 Missing Values

The utilization of the ‘any(is.na())’ function has indicated the presence of missing values in the dataset. As a precaution to ensure data integrity, the initial step involved removing the first row that contained any missing value (NA).

# Handling Missing Data 
missing_values <- any(is.na(df))
missing_values

## [1] TRUE

# Display the rows with missing values
rows_with_missing <- df[apply(is.na(df), 1, any), ]
print(rows_with_missing)

##         Date   Open   High    Low Adj.Close   Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71    152.33 21521900  123.03   123.2 121.99
##   SP_Ajclose SP_volume  DJ_open  DJ_high   DJ_low DJ_Ajclose DJ_volume EG_open
## 1   105.4412 199109200 11825.29 11967.84 11825.22   11868.81 136930000   74.55
##   EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1   76.15  72.15   70.43176    787900   1.3018  1.2982  1.3051 1.2957        1
##   OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1   105.09  104.88   106.5 104.88     14330        1    93.42   94.91      96
##   OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1  93.33        0    53604   54248   54248  52316    119440        1     1.911
##   USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1    1.911    1.911   1.911         1   1414.65   1420.3  1423.35 1376.85
##   PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1         0    618.85    614.7      615   614.6         1      1425     80.341
##   USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1    80.565     80.63    80.13       22850          0    53.01    53.14   51.57
##   GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1      48.97388   20605600     36.9    36.94   36.05         36.13   12616700
##   difference boolean
## 1         NA      NA

# Display the resulting data frame without NAs
df <- na.omit(df)
head(df)

##         Date   Open   High    Low Adj.Close   Volume SP_open SP_high SP_low
## 2 2011-12-16 154.31 155.37 153.90    155.23 18124300  122.23  122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36    154.87 12547200  122.06  122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58    156.98  9136300  122.18  124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13    157.16 11996100  123.93  124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33    156.04  9888400  124.63  125.40 124.23
## 7 2011-12-23 156.35 156.49 155.82    156.31  3565100  125.67  126.43 125.41
##   SP_Ajclose SP_volume  DJ_open  DJ_high   DJ_low DJ_Ajclose DJ_volume EG_open
## 2   105.5975 220481400 11870.25 11968.18 11819.31   11866.39 389520000   73.60
## 3   104.4685 183903000 11866.54 11925.88 11735.19   11766.26 135170000   69.10
## 4   107.6298 225418100 11769.21 12117.13 11768.83   12103.58 165180000   66.45
## 5   107.8382 194230900 12103.58 12119.70 11999.44   12107.74 163250000   67.10
## 6   108.7935 119465400 12107.59 12182.71 12107.37   12169.65 151610000   68.40
## 7   109.7662  92187200 12169.88 12297.44 12169.80   12294.00  80420000   70.05
##   EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 2   75.10  73.35   72.36404    896600   1.3035  1.3020  1.3087 1.2997        1
## 3   69.80  64.20   62.50938   2096700   1.2995  1.3043  1.3044 1.2981        0
## 4   68.10  66.00   64.73151    875300   1.3079  1.3003  1.3133 1.2994        1
## 5   69.40  66.90   66.18072    837600   1.3045  1.3079  1.3197 1.3024        0
## 6   70.00  66.70   67.38840    780600   1.3051  1.3047  1.3121 1.3018        1
## 7   71.75  69.90   69.17577    621200   1.3043  1.3051  1.3098 1.3026        0
##   OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 2   103.35  103.51  104.56 102.46    140080        0    93.79   93.43   94.80
## 3   103.64  103.63  104.57 102.37    147880        1    94.09   93.77   94.43
## 4   106.73  104.30  107.27 103.91    170240        1    95.55   96.39   99.70
## 5   107.71  107.15  108.17 106.16    145090        1    99.01   97.54   99.26
## 6   107.89  107.92  108.50 107.29    111160        1    99.47   99.04  100.06
## 7   107.96  107.77  108.55 107.40     64610        1    99.85   99.49  100.24
##   OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 2  92.53        1    53458   53650   54030  52890     65390        0     1.851
## 3  92.55        1    52961   53400   53400  52544     67280        0     1.810
## 4  96.39        1    53487   52795   53575  52595     55130        1     1.927
## 5  96.81        1    53148   53519   54184  52937     75950        0     1.970
## 6  98.52        1    52894   53000   53460  52690     56530        0     1.953
## 7  99.25        1    53067   53000   53348  52960     27510        1     2.021
##   USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 2    1.851    1.851   1.851         0   1420.25  1414.75  1431.75 1400.70
## 3    1.810    1.810   1.810         0   1411.10  1422.65  1427.60 1404.60
## 4    1.927    1.927   1.927         1   1434.75  1408.95  1436.55 1408.15
## 5    1.970    1.970   1.970         1   1429.05  1434.40  1453.75 1417.65
## 6    1.953    1.953   1.953         0   1425.90  1429.00  1435.25 1415.90
## 7    2.021    2.021   2.021         1   1426.45  1425.55  1439.30 1424.20
##   PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 2         1    623.65   622.60   623.45  622.30         1      1400     80.249
## 3         0    608.80   626.00   630.00  608.60         0      1400     80.207
## 4         1    626.65   622.45   622.45  622.45         1      1400     80.273
## 5         0    635.90   625.70   641.50  623.80         1      1400     80.350
## 6         0    652.45   632.00   650.80  632.00         1      1400     80.306
## 7         1    665.55   664.95   665.05  664.95         1      1400     80.236
##   USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 2    80.175    80.395   79.935       13150          0    52.50    53.18   52.04
## 3    80.300    80.470   80.125         970          0    52.49    52.55   51.03
## 4    80.890    80.940   80.035       22950          1    52.38    53.25   52.37
## 5    80.105    80.445   79.550       24140          1    53.15    53.43   52.42
## 6    80.360    80.505   79.955       13240          0    52.56    52.62   51.77
## 7    80.210    80.375   80.060        7170          0    52.59    52.79   52.35
##   GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 2      49.92151   16285400    36.18    36.50   35.73         36.27   12578800
## 3      48.49058   15120200    36.39    36.45   35.93         36.20    7418200
## 4      50.21528   11644900    37.30    37.61   37.22         37.56   10041600
## 5      50.18685    8724300    37.67    38.24   37.52         38.11   10728000
## 6      49.65618   13179600    38.27    38.55   38.21         38.28    5376800
## 7      50.16936    4729000    38.47    38.61   38.37         38.49    2162200
##   difference boolean
## 2   2.899994       1
## 3  -0.360001       0
## 4   2.110001       1
## 5   0.180008       1
## 6  -1.120011       0
## 7   0.270005       1

3.5 Duplicate Validation

The execution of the ‘sum(duplicated())’ code revealed that the dataset contains no duplicates, ensuring the uniqueness of each record.

# Check Duplicates
num_duplicates <- sum(duplicated(df))
num_duplicates

## [1] 0

3.6 Data Types Checking

As indicated in Section 3.1, all variables, except for the ‘Date’ column, are in the correct data types. Therefore, no additional conversion is necessary. This ensures that all numeric variables are appropriately formatted for numerical analysis and modeling.

# Convert Date column from chr to date 
df$Date <- as.Date(df$Date,format="%Y-%m-%d")
class(df$Date)

## [1] "Date"

# Check the class of numeric columns
numeric_cols <- sapply(df, is.numeric)
print(numeric_cols)

##          Date          Open          High           Low     Adj.Close 
##         FALSE          TRUE          TRUE          TRUE          TRUE 
##        Volume       SP_open       SP_high        SP_low    SP_Ajclose 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     SP_volume       DJ_open       DJ_high        DJ_low    DJ_Ajclose 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     DJ_volume       EG_open       EG_high        EG_low    EG_Ajclose 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     EG_volume      EU_Price       EU_open       EU_high        EU_low 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      EU_Trend      OF_Price       OF_Open       OF_High        OF_Low 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     OF_Volume      OF_Trend      OS_Price       OS_Open       OS_High 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##        OS_Low      OS_Trend      SF_Price       SF_Open       SF_High 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##        SF_Low     SF_Volume      SF_Trend     USB_Price      USB_Open 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      USB_High       USB_Low     USB_Trend     PLT_Price      PLT_Open 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      PLT_High       PLT_Low     PLT_Trend     PLD_Price      PLD_Open 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      PLD_High       PLD_Low     PLD_Trend     RHO_PRICE    USDI_Price 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     USDI_Open     USDI_High      USDI_Low   USDI_Volume    USDI_Trend 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      GDX_Open      GDX_High       GDX_Low GDX_Adj.Close    GDX_Volume 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##      USO_Open      USO_High       USO_Low USO_Adj.Close    USO_Volume 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##    difference       boolean 
##          TRUE          TRUE

4.0 Exploration Data Analysis (EDA)

EDA involves investigating and understanding the main characteristics of the dataset. The purpose of EDA is to gain insights, identify patterns and uncover potential relationships within data.

4.1 Statistical Summary

Displaying a summary of statistical characteristics for the dataset provides an overview of central tendency, minimum and maximum values, and quartiles for each numerical variable.

# Summary statistics
summary(df)

##       Date                 Open            High            Low       
##  Min.   :2011-12-16   Min.   :100.9   Min.   :101.0   Min.   :100.2  
##  1st Qu.:2013-10-04   1st Qu.:116.2   1st Qu.:116.5   1st Qu.:115.7  
##  Median :2015-07-20   Median :121.9   Median :122.3   Median :121.4  
##  Mean   :2015-07-07   Mean   :127.3   Mean   :127.8   Mean   :126.8  
##  3rd Qu.:2017-04-10   3rd Qu.:128.4   3rd Qu.:129.1   3rd Qu.:127.8  
##  Max.   :2018-12-31   Max.   :173.2   Max.   :174.1   Max.   :172.9  
##    Adj.Close         Volume            SP_open         SP_high     
##  Min.   :100.5   Min.   : 1501600   Min.   :122.1   Min.   :122.3  
##  1st Qu.:116.0   1st Qu.: 5408900   1st Qu.:170.5   1st Qu.:171.0  
##  Median :121.8   Median : 7483200   Median :205.5   Median :206.5  
##  Mean   :127.3   Mean   : 8438712   Mean   :204.5   Mean   :205.4  
##  3rd Qu.:128.5   3rd Qu.:10207800   3rd Qu.:237.3   3rd Qu.:237.8  
##  Max.   :173.6   Max.   :93804200   Max.   :293.1   Max.   :293.9  
##      SP_low        SP_Ajclose      SP_volume            DJ_open     
##  Min.   :120.0   Min.   :104.5   Min.   : 27856500   Min.   :11769  
##  1st Qu.:169.9   1st Qu.:153.1   1st Qu.: 73869000   1st Qu.:15497  
##  Median :204.4   Median :191.7   Median : 99714400   Median :17602  
##  Mean   :203.5   Mean   :192.3   Mean   :109750612   Mean   :18165  
##  3rd Qu.:236.2   3rd Qu.:228.8   3rd Qu.:135101400   3rd Qu.:20868  
##  Max.   :291.8   Max.   :290.6   Max.   :507244300   Max.   :26833  
##     DJ_high          DJ_low        DJ_Ajclose      DJ_volume        
##  Min.   :11926   Min.   :11735   Min.   :11766   Min.   :  8410000  
##  1st Qu.:15555   1st Qu.:15422   1st Qu.:15498   1st Qu.: 92290000  
##  Median :17715   Median :17512   Median :17614   Median :120670000  
##  Mean   :18248   Mean   :18078   Mean   :18168   Mean   :177937001  
##  3rd Qu.:20914   3rd Qu.:20786   3rd Qu.:20856   3rd Qu.:263780000  
##  Max.   :26952   Max.   :26789   Max.   :26828   Max.   :900510000  
##     EG_open         EG_high          EG_low        EG_Ajclose   
##  Min.   : 2.77   Min.   : 2.85   Min.   : 2.73   Min.   : 2.80  
##  1st Qu.:14.20   1st Qu.:14.55   1st Qu.:13.75   1st Qu.:14.08  
##  Median :22.75   Median :23.10   Median :21.75   Median :22.38  
##  Mean   :28.25   Mean   :28.79   Mean   :27.63   Mean   :27.76  
##  3rd Qu.:37.15   3rd Qu.:37.85   3rd Qu.:36.45   3rd Qu.:36.77  
##  Max.   :80.20   Max.   :81.00   Max.   :77.90   Max.   :78.00  
##    EG_volume           EU_Price        EU_open         EU_high     
##  Min.   :  164500   Min.   :1.039   Min.   :1.039   Min.   :1.042  
##  1st Qu.:  700600   1st Qu.:1.121   1st Qu.:1.121   1st Qu.:1.126  
##  Median :  969600   Median :1.184   Median :1.184   Median :1.188  
##  Mean   : 1136276   Mean   :1.208   Mean   :1.208   Mean   :1.213  
##  3rd Qu.: 1344900   3rd Qu.:1.306   3rd Qu.:1.306   3rd Qu.:1.310  
##  Max.   :10061200   Max.   :1.393   Max.   :1.393   Max.   :1.399  
##      EU_low         EU_Trend         OF_Price         OF_Open      
##  Min.   :1.034   Min.   :0.0000   Min.   : 27.88   Min.   : 27.99  
##  1st Qu.:1.116   1st Qu.:0.0000   1st Qu.: 52.15   1st Qu.: 52.19  
##  Median :1.179   Median :0.0000   Median : 70.11   Median : 70.05  
##  Mean   :1.204   Mean   :0.4945   Mean   : 77.49   Mean   : 77.51  
##  3rd Qu.:1.300   3rd Qu.:1.0000   3rd Qu.:107.73   3rd Qu.:107.69  
##  Max.   :1.391   Max.   :1.0000   Max.   :126.22   Max.   :126.16  
##     OF_High           OF_Low         OF_Volume         OF_Trend     
##  Min.   : 28.75   Min.   : 27.10   Min.   : 11520   Min.   :0.0000  
##  1st Qu.: 52.86   1st Qu.: 51.36   1st Qu.:176060   1st Qu.:0.0000  
##  Median : 70.64   Median : 69.00   Median :223490   Median :0.0000  
##  Mean   : 78.37   Mean   : 76.58   Mean   :226082   Mean   :0.4985  
##  3rd Qu.:108.46   3rd Qu.:106.87   3rd Qu.:281530   3rd Qu.:1.0000  
##  Max.   :128.40   Max.   :125.00   Max.   :567760   Max.   :1.0000  
##     OS_Price         OS_Open          OS_High           OS_Low      
##  Min.   : 26.55   Min.   : 27.34   Min.   : 27.61   Min.   : 26.18  
##  1st Qu.: 48.92   1st Qu.: 49.03   1st Qu.: 49.69   1st Qu.: 48.20  
##  Median : 64.61   Median : 64.76   Median : 65.56   Median : 63.73  
##  Mean   : 70.14   Mean   : 70.26   Mean   : 71.11   Mean   : 69.32  
##  3rd Qu.: 94.31   3rd Qu.: 94.41   3rd Qu.: 95.36   3rd Qu.: 93.48  
##  Max.   :110.30   Max.   :110.34   Max.   :112.28   Max.   :109.15  
##     OS_Trend         SF_Price        SF_Open         SF_High     
##  Min.   :0.0000   Min.   :33170   Min.   :33146   Min.   :33566  
##  1st Qu.:0.0000   1st Qu.:38016   1st Qu.:38025   1st Qu.:38293  
##  Median :1.0000   Median :40517   Median :40506   Median :40840  
##  Mean   :0.5038   Mean   :43278   Mean   :43302   Mean   :43665  
##  3rd Qu.:1.0000   3rd Qu.:46573   3rd Qu.:46626   3rd Qu.:47071  
##  Max.   :1.0000   Max.   :65292   Max.   :65400   Max.   :65723  
##      SF_Low        SF_Volume         SF_Trend        USB_Price    
##  Min.   :32626   Min.   :    40   Min.   :0.0000   Min.   :1.358  
##  1st Qu.:37687   1st Qu.: 14210   1st Qu.:0.0000   1st Qu.:1.905  
##  Median :40225   Median : 19640   Median :0.0000   Median :2.259  
##  Mean   :42906   Mean   : 26859   Mean   :0.4805   Mean   :2.263  
##  3rd Qu.:46125   3rd Qu.: 29900   3rd Qu.:1.0000   3rd Qu.:2.597  
##  Max.   :64132   Max.   :203730   Max.   :1.0000   Max.   :3.239  
##     USB_Open        USB_High        USB_Low        USB_Trend     
##  Min.   :1.366   Min.   :1.391   Min.   :1.321   Min.   :0.0000  
##  1st Qu.:1.905   1st Qu.:1.921   1st Qu.:1.886   1st Qu.:0.0000  
##  Median :2.259   Median :2.290   Median :2.231   Median :0.0000  
##  Mean   :2.263   Mean   :2.287   Mean   :2.239   Mean   :0.4898  
##  3rd Qu.:2.598   3rd Qu.:2.620   3rd Qu.:2.575   3rd Qu.:1.0000  
##  Max.   :3.237   Max.   :3.261   Max.   :3.231   Max.   :1.0000  
##    PLT_Price         PLT_Open         PLT_High         PLT_Low      
##  Min.   : 775.6   Min.   : 765.3   Min.   : 786.5   Min.   : 756.0  
##  1st Qu.: 944.0   1st Qu.: 944.0   1st Qu.: 952.6   1st Qu.: 935.9  
##  Median :1097.2   Median :1097.6   Median :1107.4   Median :1086.0  
##  Mean   :1183.8   Mean   :1184.3   Mean   :1194.1   Mean   :1173.3  
##  3rd Qu.:1443.0   3rd Qu.:1443.0   3rd Qu.:1454.3   3rd Qu.:1432.2  
##  Max.   :1737.6   Max.   :1737.8   Max.   :1742.9   Max.   :1717.2  
##    PLT_Trend        PLD_Price         PLD_Open         PLD_High     
##  Min.   :0.0000   Min.   : 470.4   Min.   : 458.6   Min.   : 473.1  
##  1st Qu.:0.0000   1st Qu.: 663.5   1st Qu.: 664.1   1st Qu.: 670.2  
##  Median :0.0000   Median : 748.3   Median : 748.0   Median : 753.8  
##  Mean   :0.4846   Mean   : 766.9   Mean   : 766.5   Mean   : 773.6  
##  3rd Qu.:1.0000   3rd Qu.: 848.4   3rd Qu.: 847.0   3rd Qu.: 855.8  
##  Max.   :1.0000   Max.   :1197.5   Max.   :1196.0   Max.   :1208.7  
##     PLD_Low         PLD_Trend        RHO_PRICE      USDI_Price    
##  Min.   : 458.6   Min.   :0.0000   Min.   :   0   Min.   : 78.30  
##  1st Qu.: 657.1   1st Qu.:0.0000   1st Qu.: 785   1st Qu.: 81.38  
##  Median : 742.5   Median :1.0000   Median :1100   Median : 92.88  
##  Mean   : 759.5   Mean   :0.5306   Mean   :1130   Mean   : 89.81  
##  3rd Qu.: 840.1   3rd Qu.:1.0000   3rd Qu.:1300   3rd Qu.: 96.11  
##  Max.   :1183.6   Max.   :1.0000   Max.   :2600   Max.   :103.29  
##    USDI_Open        USDI_High         USDI_Low       USDI_Volume    
##  Min.   : 78.22   Min.   : 78.64   Min.   : 78.12   Min.   :    60  
##  1st Qu.: 81.38   1st Qu.: 81.62   1st Qu.: 81.11   1st Qu.: 18130  
##  Median : 92.91   Median : 93.17   Median : 92.58   Median : 24450  
##  Mean   : 89.81   Mean   : 90.10   Mean   : 89.51   Mean   : 27571  
##  3rd Qu.: 96.11   3rd Qu.: 96.47   3rd Qu.: 95.73   3rd Qu.: 33750  
##  Max.   :103.35   Max.   :103.81   Max.   :102.97   Max.   :142820  
##    USDI_Trend        GDX_Open        GDX_High        GDX_Low     
##  Min.   :0.0000   Min.   :12.70   Min.   :12.92   Min.   :12.40  
##  1st Qu.:0.0000   1st Qu.:20.64   1st Qu.:20.95   1st Qu.:20.35  
##  Median :1.0000   Median :23.11   Median :23.37   Median :22.87  
##  Mean   :0.5131   Mean   :26.73   Mean   :27.06   Mean   :26.37  
##  3rd Qu.:1.0000   3rd Qu.:27.40   3rd Qu.:27.77   3rd Qu.:26.79  
##  Max.   :1.0000   Max.   :57.52   Max.   :57.94   Max.   :56.77  
##  GDX_Adj.Close     GDX_Volume           USO_Open        USO_High    
##  Min.   :12.27   Min.   :  4729000   Min.   : 7.82   Min.   : 8.03  
##  1st Qu.:20.17   1st Qu.: 22679700   1st Qu.:11.42   1st Qu.:11.50  
##  Median :22.68   Median : 37314000   Median :16.44   Median :16.49  
##  Mean   :25.91   Mean   : 43578526   Mean   :22.10   Mean   :22.30  
##  3rd Qu.:26.48   3rd Qu.: 56978800   3rd Qu.:34.42   3rd Qu.:34.66  
##  Max.   :54.62   Max.   :232153600   Max.   :41.60   Max.   :42.30  
##     USO_Low      USO_Adj.Close     USO_Volume          difference      
##  Min.   : 7.67   Min.   : 7.96   Min.   :  1035100   Min.   :-12.6400  
##  1st Qu.:11.30   1st Qu.:11.39   1st Qu.:  6223500   1st Qu.: -0.6300  
##  Median :16.02   Median :16.34   Median : 16144000   Median :  0.0000  
##  Mean   :21.90   Mean   :22.10   Mean   : 19226981   Mean   : -0.0181  
##  3rd Qu.:34.08   3rd Qu.:34.41   3rd Qu.: 26727200   3rd Qu.:  0.5800  
##  Max.   :41.30   Max.   :42.01   Max.   :110265700   Max.   :  9.8000  
##     boolean      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5038  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

4.2 Gold Price Trends

# To visualize the gold price over time
ggplot(df, aes(x = Date, y = Adj.Close)) + geom_line(color = "blue") + labs(title = "Gold Price Over Time", x = "Date", y = "Gold Close Price") + theme_minimal()

From the time series chart, gold price is dropping from 2012 and the lowest gold price year is 2016.

4.3 Distribution of Gold Prices

#Plot histogram of gold prices
hist(df$Adj.Close, breaks = 30, col='orange', border='black',
     main='Histogram of Gold Prices', xlab='Gold Close Price', ylab='Frequency')

4.4 Distribution of Gold Price Direction

# Summary of boolean
class_counts <- table(df$boolean)

# Convert the table to a data frame for ggplot
class_counts_df <- as.data.frame(class_counts)

# Calculate percentages
class_counts_df$percentage <- prop.table(class_counts) * 100

# Create a bar chart with labels
ggplot(class_counts_df, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", position = "dodge", fill = "blue") +
  geom_text(aes(label = paste0(Freq, " (", round(percentage), "%)")), 
            position = position_dodge(width = 0.9), vjust = -0.5) +
  labs(title = "Distribution of Gold Price Direction", x = "Boolean", y = "Count")

The distribution with approximately 50% in each category suggests a balanced representation between the two classes.

4.5 Correlation

correlation_matrix <- cor(df[, !colnames(df) %in% "Date"])
corrplot(correlation_matrix, 
         method = "color", 
         col = colorRampPalette(c("blue", "white", "red"))(20), 
         tl.col = "black", 
         tl.srt = 90, 
         tl.cex = 0.5)

# Set a correlation threshold
threshold <- 0.8

# Find attributes highly positively correlated with "Gold Close Price"
positively_correlated_vars <- names(which(correlation_matrix["Adj.Close", ] > threshold))
positively_correlated_vars <- setdiff(positively_correlated_vars, "Adj.Close")

# Display the highly correlated attributes
cat("Attributes highly positively correlated with Gold Close Price:\n")

## Attributes highly positively correlated with Gold Close Price:

print(positively_correlated_vars)

##  [1] "Open"          "High"          "Low"           "EG_open"      
##  [5] "EG_high"       "EG_low"        "EG_Ajclose"    "SF_Price"     
##  [9] "SF_Open"       "SF_High"       "SF_Low"        "GDX_Open"     
## [13] "GDX_High"      "GDX_Low"       "GDX_Adj.Close"

Attributes with a correlation coefficient greater than 0.8, indicating a strong positive relationship with the adjusted closing price. The analysis reveals that 15 out of 75 attributes exhibit high correlation with the adjusted closing gold price.

There are “Open”, “High”, “Low” of Gold ETF; “EG_open”, “EG_high”, “EG_low”, “EG_Ajclose” of Eldorado Gold Corporation; “SF_Price”, “SF_Open”, “SF_High”, “SF_Low” of Silver Futures; and “GDX_Open”, “GDX_High”, “GDX_Low”, “GDX_Close”, “GDX_Adj.Close” of Gold Miners ETF.

# Set a correlation threshold
threshold_negative <- -0.8

# Find attributes highly negatively correlated with "Gold Close Price"
negatively_correlated_vars <- names(which(correlation_matrix["Adj.Close", ] < threshold_negative))
negatively_correlated_vars <- setdiff(negatively_correlated_vars, "Adj.Close")

# Display the highly correlated attributes
cat("Attributes highly negatively correlated with Gold Close Price:\n")

## Attributes highly negatively correlated with Gold Close Price:

print(negatively_correlated_vars)

## character(0)

Attributes with a correlation coefficient lower than -0.8, indicating a substantial negative relationship with the adjusted closing price. The result reveal that there are no attributes demonstrating a strong negative correlation with the adjusted closing price.

5.0 Modeling and Evaluation

Machine learning algorithms are employed for the analysis and prediction of outcomes. Random Forest model was utilized due its versatility in handling both regression and classification tasks. In addition, basic models like linear regression and logistic regression are integrated for a comprehensive assessment of predictive performance. This inclusion allows for a comparative analysis of the Random Forest model against more traditional regression approaches. The evaluation of model performance is conducted using standard metrics such as R-squared, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) for regression models, and accuracy, precision, recall, and F1 score for classification models.

5.1 Regression Model

Split the data into training and testing sets.

set.seed(123)  
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

Feature Selection

# Extract features and target
target_column <- "Adj.Close"

features <- train_data[, !colnames(train_data) %in% c("Date", "difference","boolean", target_column)]
target <- train_data[[target_column]]

The “importance = TRUE” argument indicates that feature importance scores should be calculated during the training process.

# Train a Random Forest Model
rf_model <- randomForest(features, y = target, importance = TRUE)

%IncMSE is a metric that reflects the contribution of each feature to the reduction in mean squared error during the building of decision trees within the Random Forest. It provides a useful way to identify and prioritize the features that have the most significant impact on the model’s predictive performance.

# Extract and visualize feature importance
importance_scores <- importance(rf_model)

importance_df <- data.frame(
  Feature = rownames(importance_scores),
  Importance = importance_scores[, "%IncMSE"] 
)

importance_df <- importance_df[order(-importance_df$Importance), ]

# The importance score of each feature.
ggplot(importance_df, aes(x = Importance, y = reorder(Feature, Importance))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Random Forest Feature Importance", x = "Importance", y = "Feature") +
  theme(axis.text.y = element_text(size = 5, hjust = 0))

# The Top 10 importance features
top_10_importance <- head(importance_df, 10)

ggplot(top_10_importance, aes(x = Importance, y = reorder(Feature, Importance))) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top 10 Random Forest Feature Importance", x = "Importance", y = "Feature") +
  theme(axis.text.y = element_text(size = 8, hjust = 0))

# Select top 10 important features
top_features <- importance_df$Feature[1:10]
print(top_features)

##  [1] "High"       "Low"        "Open"       "SF_High"    "PLD_Price" 
##  [6] "GDX_Volume" "USB_Price"  "USB_Low"    "SF_Low"     "PLT_Trend"

# Create new dataset
selected_train_data <- train_data[, c("Adj.Close", top_features)]
selected_test_data <- test_data[, c("Adj.Close", top_features)]

(a) Random Forest Regression

# Train the Random Forest Regression Model.
model_target_column <- 1 #Adj_Close
model_features <- selected_train_data[, -model_target_column]
model_target <- selected_train_data[[model_target_column]]

rf_model_regression <- randomForest(model_features, y = model_target)

# Make predictions on the test set.
predictions <- predict(rf_model_regression, newdata = selected_test_data[, -model_target_column])

cat("Predictions on Test Data:\n")

## Predictions on Test Data:

print(predictions)

##        4        8       16       22       23       44       48       51 
## 157.0033 155.5218 156.5340 160.4026 160.7613 169.9824 171.5362 167.0580 
##       54       58       63       66       67       86       98      100 
## 162.0506 165.0898 161.4969 158.7568 161.5459 159.3583 156.5779 155.3955 
##      102      108      110      111      112      114      123      124 
## 151.8649 153.1618 152.2342 152.5923 152.4697 152.3515 157.5037 157.0869 
##      126      129      130      132      134      141      147      148 
## 157.2275 153.1192 152.4355 152.9328 151.9537 154.0473 153.2261 153.7193 
##      170      174      177      182      183      184      189      193 
## 158.6250 161.7843 161.1424 168.4435 168.2818 168.5639 171.3006 170.9323 
##      204      220      221      223      233      234      245      246 
## 171.1903 165.0669 166.0221 167.9466 169.3908 166.0451 164.3795 164.3408 
##      247      248      264      274      275      284      285      289 
## 164.4159 163.3586 161.3208 160.7360 161.2053 159.7396 159.3781 156.2535 
##      296      301      302      303      308      309      314      318 
## 153.4753 153.1671 152.6199 153.4046 155.5503 155.7474 154.6521 152.4035 
##      321      323      325      330      337      339      341      342 
## 151.5171 152.5017 151.1730 141.9454 137.9297 133.3414 131.3833 132.0258 
##      346      352      355      362      363      369      370      371 
## 134.0351 135.5970 133.4120 131.8745 124.9557 120.9571 117.3066 119.4095 
##      372      378      389      396      400      405      409      413 
## 120.7941 124.2562 126.8717 127.9671 132.3265 136.1943 136.0796 134.1764 
##      415      419      440      441      443      444      455      473 
## 132.1586 126.9607 127.3221 126.9811 128.9918 128.8131 124.0693 118.1027 
##      477      484      487      492      496      497      498      502 
## 121.1903 117.9611 120.6592 119.6939 121.3126 120.8439 121.9611 121.4157 
##      506      508      512      513      515      528      533      544 
## 123.9578 125.1160 127.2302 127.6555 128.9257 132.1816 126.5719 126.1455 
##      547      551      557      560      562      563      564      569 
## 126.9906 125.0734 124.9695 126.2499 124.8045 124.2152 124.2211 124.4066 
##      571      572      574      576      586      591      597      601 
## 124.4118 124.0923 124.4386 121.2945 121.3796 122.4930 126.6546 127.8846 
##      605      608      635      645      653      662      664      677 
## 127.7498 125.6909 122.7536 121.9013 118.2101 116.6356 115.0421 118.2553 
##      690      696      705      719      722      728      732      736 
## 111.9344 114.2219 115.8759 112.8013 113.8100 116.3082 118.3702 124.1022 
##      740      749      750      760      771      774      778      791 
## 124.1795 118.9317 118.4979 115.5673 110.8991 110.2903 113.9350 115.7269 
##      796      801      803      806      810      811      812      824 
## 115.4417 113.7058 115.8504 112.5984 113.7304 113.9993 114.0543 113.7556 
##      826      829      838      839      842      846      850      852 
## 114.8147 112.8883 113.4535 115.2191 112.9224 112.6761 112.0852 111.3270 
##      854      859      871      872      876      882      885      907 
## 111.4289 108.5674 104.3233 104.3085 106.2840 108.0104 110.9892 109.9552 
##      909      912      917      922      926      934      956      963 
## 108.6579 107.2844 110.9426 113.0672 111.7731 107.5709 103.2488 101.7840 
##      965      967      968      973      976      979      997     1001 
## 101.7884 102.9589 102.8979 101.6508 102.9612 105.1002 110.6362 119.6274 
##     1004     1006     1009     1013     1021     1028     1036     1040 
## 115.5039 117.5822 118.7790 118.4656 120.2405 119.5488 116.1019 118.3528 
##     1041     1042     1046     1049     1050     1051     1062     1063 
## 119.6009 119.7801 117.9885 119.8103 118.2950 118.4495 120.6253 121.6575 
##     1065     1072     1085     1086     1090     1091     1098     1102 
## 121.2337 117.6252 122.4884 122.6454 122.7164 121.2159 126.0704 129.8504 
##     1107     1111     1113     1115     1117     1127     1137     1148 
## 127.1492 125.7788 126.5814 126.1944 127.6909 128.1722 127.2732 126.2274 
##     1149     1151     1157     1161     1165     1170     1171     1172 
## 126.3342 125.0018 127.7288 126.2877 119.6468 120.1225 119.6583 119.7322 
##     1174     1176     1178     1180     1185     1195     1202     1208 
## 121.0758 120.6705 121.2151 120.9997 124.2556 115.6194 111.4122 111.0372 
##     1211     1219     1223     1240     1253     1255     1261     1270 
## 110.3393 108.3926 110.2245 113.8927 117.9927 117.5434 118.4054 114.6270 
##     1273     1281     1284     1290     1291     1297     1299     1301 
## 116.9718 119.1332 118.9452 120.8141 121.6288 122.1911 120.4954 120.2923 
##     1306     1313     1316     1319     1320     1329     1337     1338 
## 116.8723 117.2545 119.2929 119.6880 119.4118 122.5550 118.5673 118.1780 
##     1340     1341     1346     1349     1350     1351     1357     1359 
## 118.8748 119.5223 118.0777 116.3552 115.3295 115.1482 118.0292 118.0980 
##     1360     1361     1362     1366     1375     1379     1381     1393 
## 119.0394 119.4588 118.9418 120.6561 122.4354 122.9573 122.3813 127.8806 
##     1395     1398     1403     1404     1405     1419     1420     1421 
## 126.1758 125.7438 123.0921 123.4720 123.3431 121.9841 121.6972 122.2442 
##     1423     1424     1425     1428     1442     1448     1458     1459 
## 121.4114 121.3436 121.2023 120.8007 122.4128 122.7994 117.6743 118.3564 
##     1463     1464     1465     1467     1473     1478     1479     1480 
## 119.7625 120.0588 120.1512 121.6252 125.3655 125.5177 126.2849 126.8266 
##     1494     1495     1496     1504     1516     1520     1539     1542 
## 125.9890 125.1523 124.8925 126.0960 125.2962 125.0185 126.9808 127.3823 
##     1544     1550     1551     1563     1565     1566     1567     1571 
## 127.4448 125.4123 124.5971 122.4115 122.3819 122.2937 122.6062 123.0202 
##     1573     1579     1585     1589     1603     1607     1610     1618 
## 123.3740 122.9500 121.2428 120.1591 117.5695 115.6227 116.0630 115.1348 
##     1623     1633     1636     1638     1643     1651     1654     1656 
## 114.8548 114.3849 113.5825 112.9148 112.9659 113.2322 113.2685 112.5505 
##     1663     1669     1680     1684     1686     1702     1710     1715 
## 112.6428 115.8288 116.5279 115.7210 113.7905 117.3106 117.9432 120.4212

Evaluate model performance.

actual_values <- selected_test_data$Adj.Close

rsquared <- cor(predictions, actual_values)^2
rmse <- rmse(predictions, actual_values)
mae <- mae(predictions, actual_values)

cat("R-squared:", rsquared, "\n")

## R-squared: 0.9991521

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.4944462

cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 0.3343596

R-squared measures the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features). An R-squared value close to 1 indicates a model that explains a high percentage of the variance in the target variable. In this case, R-squared is very close to 1 (99.9%), suggesting an excellent fit.

RMSE provides a measure of the average magnitude of the errors between predicted and actual values. A lower RMSE indicates better model performance. In this case, an RMSE of 0.4944462 suggests that, on average, the model’s predictions deviate by approximately 0.49 units from the actual values.

MAE is another measure of the average magnitude of errors but without squaring them. Similar to RMSE, a lower MAE indicates better performance. The MAE of 0.3343596 suggests that, on average, the absolute difference between predicted and actual values is approximately 0.33 units.

In summary, the output indicates that the Random Forest regression model has performed exceptionally well on the test dataset, with very high R-squared and low RMSE and MAE values. This suggests that the model’s predictions closely align with the actual values in the test set.

Visualize the evaluation result

prediction_df <- data.frame(Actual = actual_values, Predicted = predictions)

ggplot(prediction_df, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Predicted vs. Actual", x = "Actual", y = "Predicted")

The scatter plot visualizes the relationship between the actual and predicted values. Each point on the plot represents an observation in the test set. The dashed red line represents perfect predictions, where actual values equal predicted values.When points are close to the red dashed line, it indicates that the model is making accurate predictions.

(b) Multivariate Linear Regression

#create the Multivariate linear regression model with top 10 features 
regression_model <- lm(Adj.Close ~., data = selected_test_data)
regression_model

## 
## Call:
## lm(formula = Adj.Close ~ ., data = selected_test_data)
## 
## Coefficients:
## (Intercept)         High          Low         Open      SF_High    PLD_Price  
##   3.717e-01    7.393e-01    7.694e-01   -5.141e-01   -9.468e-07    1.961e-04  
##  GDX_Volume    USB_Price      USB_Low       SF_Low    PLT_Trend  
##  -1.077e-09   -1.233e+00    1.136e+00    1.107e-05    7.449e-02

# Make predictions on the test set
regression_predictions <- predict(regression_model , selected_test_data)
regression_predictions

##        4        8       16       22       23       44       48       51 
## 157.1331 155.0758 156.6639 160.9708 160.7969 170.6705 172.2521 167.0736 
##       54       58       63       66       67       86       98      100 
## 162.3405 164.8640 161.8622 159.2307 161.3769 159.7788 155.3895 155.0576 
##      102      108      110      111      112      114      123      124 
## 151.4745 152.6383 151.2542 152.4960 151.1135 151.4872 156.8183 156.8657 
##      126      129      130      132      134      141      147      148 
## 157.3267 152.6250 151.9061 152.5550 150.7518 152.5990 153.1802 153.6514 
##      170      174      177      182      183      184      189      193 
## 159.0661 161.8532 160.3561 168.4695 167.6531 168.2486 171.5633 171.1576 
##      204      220      221      223      233      234      245      246 
## 171.7453 165.8665 165.6504 168.0333 168.9755 166.4886 164.6116 164.4140 
##      247      248      264      274      275      284      285      289 
## 164.5287 161.7853 160.5613 160.4489 161.0741 159.5484 159.6105 155.1738 
##      296      301      302      303      308      309      314      318 
## 152.8472 152.7795 152.7844 152.9715 155.3512 155.9760 154.6687 152.6157 
##      321      323      325      330      337      339      341      342 
## 152.3157 153.5050 151.4577 142.2099 138.6821 134.0146 134.1949 132.7225 
##      346      352      355      362      363      369      370      371 
## 134.4668 135.8587 133.8570 131.2842 124.5494 121.3549 117.0701 119.4454 
##      372      378      389      396      400      405      409      413 
## 120.7794 123.7060 126.8151 127.7227 132.8827 137.0485 136.1404 133.8767 
##      415      419      440      441      443      444      455      473 
## 131.4923 126.3797 127.5239 127.0583 129.4618 128.7926 123.6513 118.2767 
##      477      484      487      492      496      497      498      502 
## 121.0953 118.0152 120.8396 119.7628 121.1862 120.6552 121.6460 121.0524 
##      506      508      512      513      515      528      533      544 
## 124.3617 125.3587 127.2156 127.6810 129.3193 131.8148 126.3630 126.0391 
##      547      551      557      560      562      563      564      569 
## 127.0417 124.7446 125.0840 126.2600 124.2896 124.1357 123.9765 124.5816 
##      571      572      574      576      586      591      597      601 
## 124.6819 123.9654 124.3197 121.2043 121.3770 122.6206 126.6412 127.9878 
##      605      608      635      645      653      662      664      677 
## 127.8580 125.8824 122.8990 121.8682 117.9692 116.5733 114.7895 118.0684 
##      690      696      705      719      722      728      732      736 
## 112.0990 113.5028 116.4388 112.9566 113.5754 116.5960 118.3128 124.2687 
##      740      749      750      760      771      774      778      791 
## 124.3164 119.1270 118.5930 115.5751 110.4334 110.8785 114.1600 115.8208 
##      796      801      803      806      810      811      812      824 
## 115.6741 113.1498 116.2614 112.9324 113.5750 113.9093 113.3409 113.6175 
##      826      829      838      839      842      846      850      852 
## 114.4672 112.7325 113.4880 115.2502 112.9970 112.9312 112.2711 111.3230 
##      854      859      871      872      876      882      885      907 
## 111.4380 108.5268 104.1059 103.9141 106.0761 108.4322 110.3774 110.4534 
##      909      912      917      922      926      934      956      963 
## 108.6686 106.8566 110.7232 112.7982 112.0287 106.9176 103.7750 101.7431 
##      965      967      968      973      976      979      997     1001 
## 100.3874 103.3002 102.7936 101.4596 103.2809 105.4059 111.4874 119.4025 
##     1004     1006     1009     1013     1021     1028     1036     1040 
## 115.4116 117.5883 118.1716 117.7482 119.7739 119.4417 116.2046 118.5302 
##     1041     1042     1046     1049     1050     1051     1062     1063 
## 119.9923 119.7785 117.8162 119.5262 117.7417 118.5753 120.7660 121.7572 
##     1065     1072     1085     1086     1090     1091     1098     1102 
## 121.4192 117.3722 122.2343 122.4430 122.9884 121.0570 126.1130 129.4702 
##     1107     1111     1113     1115     1117     1127     1137     1148 
## 126.9491 125.7638 126.2401 125.9672 127.5553 128.1815 126.8461 125.7327 
##     1149     1151     1157     1161     1165     1170     1171     1172 
## 126.3984 125.0198 127.6807 125.6425 119.4945 120.1476 119.9251 119.8800 
##     1174     1176     1178     1180     1185     1195     1202     1208 
## 121.1664 120.7745 121.5860 120.9005 124.3002 115.6131 111.3907 110.3706 
##     1211     1219     1223     1240     1253     1255     1261     1270 
## 108.8498 108.4185 110.4922 114.1446 118.2563 117.5475 118.7400 114.4370 
##     1273     1281     1284     1290     1291     1297     1299     1301 
## 117.1736 119.3664 119.2804 121.1320 121.8913 122.1323 120.2515 120.1613 
##     1306     1313     1316     1319     1320     1329     1337     1338 
## 117.0297 117.1748 118.7970 119.2111 119.5124 122.4380 118.4545 118.2117 
##     1340     1341     1346     1349     1350     1351     1357     1359 
## 118.7849 119.4839 118.2538 116.2931 114.9795 115.2475 118.2495 118.3178 
##     1360     1361     1362     1366     1375     1379     1381     1393 
## 119.1870 119.2744 118.9932 120.7604 122.6023 122.4777 122.0008 127.7784 
##     1395     1398     1403     1404     1405     1419     1420     1421 
## 126.3657 125.5450 123.2296 124.1864 123.2111 122.1226 121.7509 122.4839 
##     1423     1424     1425     1428     1442     1448     1458     1459 
## 121.7411 121.3463 121.2151 121.2877 122.8458 122.8327 117.8389 118.9591 
##     1463     1464     1465     1467     1473     1478     1479     1480 
## 119.7580 120.1606 120.3179 121.8348 125.5813 125.5703 126.7400 126.9265 
##     1494     1495     1496     1504     1516     1520     1539     1542 
## 125.2714 124.7154 124.8945 125.8001 125.3808 124.8087 126.5746 127.5899 
##     1544     1550     1551     1563     1565     1566     1567     1571 
## 127.2430 125.5156 124.7316 122.1924 122.4400 122.3371 122.4854 123.3147 
##     1573     1579     1585     1589     1603     1607     1610     1618 
## 123.4722 122.9890 121.1168 120.3241 117.5908 116.0764 116.1725 115.2858 
##     1623     1633     1636     1638     1643     1651     1654     1656 
## 114.9339 114.6270 113.3899 112.7877 113.0251 113.3887 113.0913 112.7529 
##     1663     1669     1680     1684     1686     1702     1710     1715 
## 112.6369 116.0716 116.6800 115.7877 113.6950 117.1782 118.0508 120.0375

Evaluate model performance.

regression_actual <- selected_test_data$Adj.Close

regression_rsquared <- cor(regression_predictions, regression_actual)^2
regression_rmse <- rmse(regression_predictions, regression_actual)
regression_mae <- mae(regression_predictions, regression_actual)

cat("R-squared:", regression_rsquared, "\n")

## R-squared: 0.9997354

cat("Root Mean Squared Error (RMSE):", regression_rmse, "\n")

## Root Mean Squared Error (RMSE): 0.2754923

cat("Mean Absolute Error (MAE):", regression_mae, "\n")

## Mean Absolute Error (MAE): 0.2002636

The results from the linear regression model demonstrate an outstanding performance. The R-squared value of 0.9997354 closeness to 1 suggests an excellent fit, signifying that the independent variables have a nearly perfect explanatory power over the target variable.

A low RMSE of 0.2754923 indicative of superior model accuracy. In this case, the small value suggests that, on average, the model’s predictions deviate by only approximately 0.28 units from the actual values.

A low MAE of 0.2002636 signifies that, on average, the absolute difference between the predicted and actual values is only about 0.20 units.

In summary, the linear regression model exhibits remarkable accuracy, as evidenced by the very high R-squared value and the low RMSE and MAE values.

Visualize the evaluation result

prediction_regression <- data.frame(Actual = regression_actual, Predicted = regression_predictions)

ggplot(prediction_regression, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Predicted vs. Actual", x = "Actual", y = "Predicted")

This scatter plot show that predicted values are very close to actual values. This indicates that the model is making accurate predictions.

5.2 Classification Model

Split the data into training and testing sets.

set.seed(123)  
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

Feature Selection

# Extract features and target
target_classification <- "boolean"
features_classification <- train_data[, !colnames(train_data) %in% c("Date", "difference","Adj.Close", target_classification)]
target_classification <- train_data[[target_classification]]

# Assuming boolean is a factor
target_classification <- as.factor(target_classification)

# Train a Random Forest Model
fs_classification <- randomForest(features_classification, y = target_classification, importance = TRUE)

# Extract and visualize feature importance
importance_scores_classification <- importance(fs_classification)

importance_df_classification <- data.frame(
  Feature = rownames(importance_scores_classification),
  Importance = importance_scores_classification[, "MeanDecreaseAccuracy"]
)

importance_df_classification <- importance_df_classification[order(-importance_df_classification$Importance), ]

# The importance score of each feature.
ggplot(importance_df_classification, aes(x = Importance, y = reorder(Feature, Importance))) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Random Forest Feature Importance", x = "Importance", y = "Feature") +
  theme(axis.text.y = element_text(size = 5, hjust = 0))

# The Top 10 importance features
top_10_importance_classification <- head(importance_df_classification, 10)

ggplot(top_10_importance_classification, aes(x = Importance, y = reorder(Feature, Importance))) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Top 10 Random Forest Feature Importance", x = "Importance", y = "Feature") +
  theme(axis.text.y = element_text(size = 8, hjust = 0))

# Select top 10 important features
top_features_classification <- importance_df_classification$Feature[1:10]
print(top_features_classification)

##  [1] "PLT_Trend"     "SF_Trend"      "USDI_Trend"    "EU_Trend"     
##  [5] "USB_Trend"     "PLD_Trend"     "SF_Open"       "GDX_Adj.Close"
##  [9] "DJ_open"       "High"

# Create new dataset
selected_train <- train_data[, c("boolean", top_features_classification)]
selected_test <- test_data[, c("boolean", top_features_classification)]

(a) Random Forest Classificaiton

model_target <- 1 #boolean
train_data$boolean <- as.factor(train_data$boolean)

# Assuming diff_boolean is a factor
# Assuming boolean is a factor
selected_train$boolean <- as.factor(selected_train$boolean)

rf_model_classification = randomForest(x = selected_train[, -model_target], 
                             y = selected_train$boolean, 
                             ntree = 300) 
rf_model_classification

## 
## Call:
##  randomForest(x = selected_train[, -model_target], y = selected_train$boolean,      ntree = 300) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 25.13%
## Confusion matrix:
##     0   1 class.error
## 0 501 177   0.2610619
## 1 168 527   0.2417266

# Make predictions on the test set
predictions_classification <- predict(rf_model_classification , selected_test)
predictions_classification

##    4    8   16   22   23   44   48   51   54   58   63   66   67   86   98  100 
##    1    1    1    1    0    1    0    1    0    0    1    0    1    1    0    0 
##  102  108  110  111  112  114  123  124  126  129  130  132  134  141  147  148 
##    0    0    0    1    0    0    1    1    1    0    0    0    0    0    0    1 
##  170  174  177  182  183  184  189  193  204  220  221  223  233  234  245  246 
##    1    1    0    1    0    1    0    0    1    1    0    1    0    0    0    1 
##  247  248  264  274  275  284  285  289  296  301  302  303  308  309  314  318 
##    1    0    0    0    1    0    0    0    0    1    1    1    1    1    1    0 
##  321  323  325  330  337  339  341  342  346  352  355  362  363  369  370  371 
##    1    1    1    1    0    0    1    0    0    1    1    0    0    1    0    1 
##  372  378  389  396  400  405  409  413  415  419  440  441  443  444  455  473 
##    0    0    1    0    1    1    0    0    1    0    1    1    1    0    0    0 
##  477  484  487  492  496  497  498  502  506  508  512  513  515  528  533  544 
##    0    0    1    0    0    0    1    1    1    1    0    1    1    1    0    1 
##  547  551  557  560  562  563  564  569  571  572  574  576  586  591  597  601 
##    1    0    1    1    0    0    0    0    1    0    0    0    1    1    1    0 
##  605  608  635  645  653  662  664  677  690  696  705  719  722  728  732  736 
##    1    0    0    0    0    0    0    0    1    1    1    1    0    0    1    1 
##  740  749  750  760  771  774  778  791  796  801  803  806  810  811  812  824 
##    0    0    0    1    1    0    1    1    1    1    1    0    0    1    0    0 
##  826  829  838  839  842  846  850  852  854  859  871  872  876  882  885  907 
##    0    0    1    1    0    0    0    1    1    0    0    1    1    1    1    1 
##  909  912  917  922  926  934  956  963  965  967  968  973  976  979  997 1001 
##    0    0    1    0    0    0    1    0    0    1    0    1    1    1    1    1 
## 1004 1006 1009 1013 1021 1028 1036 1040 1041 1042 1046 1049 1050 1051 1062 1063 
##    0    0    1    0    0    0    0    1    1    1    0    0    0    1    1    1 
## 1065 1072 1085 1086 1090 1091 1098 1102 1107 1111 1113 1115 1117 1127 1137 1148 
##    1    0    1    1    1    0    1    1    0    0    0    1    1    0    0    0 
## 1149 1151 1157 1161 1165 1170 1171 1172 1174 1176 1178 1180 1185 1195 1202 1208 
##    1    0    0    0    0    0    0    1    0    1    1    0    1    1    0    0 
## 1211 1219 1223 1240 1253 1255 1261 1270 1273 1281 1284 1290 1291 1297 1299 1301 
##    1    1    0    0    1    0    0    0    1    1    1    1    1    0    0    0 
## 1306 1313 1316 1319 1320 1329 1337 1338 1340 1341 1346 1349 1350 1351 1357 1359 
##    1    1    0    0    1    0    0    0    0    1    0    0    0    1    1    1 
## 1360 1361 1362 1366 1375 1379 1381 1393 1395 1398 1403 1404 1405 1419 1420 1421 
##    1    0    0    1    1    1    0    0    1    0    0    1    0    0    0    1 
## 1423 1424 1425 1428 1442 1448 1458 1459 1463 1464 1465 1467 1473 1478 1479 1480 
##    1    0    0    1    1    0    0    1    1    1    0    1    1    1    1    1 
## 1494 1495 1496 1504 1516 1520 1539 1542 1544 1550 1551 1563 1565 1566 1567 1571 
##    0    0    1    0    1    0    0    1    0    1    0    1    1    1    1    1 
## 1573 1579 1585 1589 1603 1607 1610 1618 1623 1633 1636 1638 1643 1651 1654 1656 
##    0    0    0    1    0    0    1    1    0    1    0    0    0    0    0    0 
## 1663 1669 1680 1684 1686 1702 1710 1715 
##    1    0    0    0    0    1    0    1 
## Levels: 0 1

# Evaluate the model
rf_confusion_matrix <- table(predictions_classification, selected_test$boolean)

rf_cm <- confusionMatrix(rf_confusion_matrix,  mode = "everything")
rf_cm

## Confusion Matrix and Statistics
## 
##                           
## predictions_classification   0   1
##                          0 140  40
##                          1  34 130
##                                           
##                Accuracy : 0.7849          
##                  95% CI : (0.7376, 0.8271)
##     No Information Rate : 0.5058          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5695          
##                                           
##  Mcnemar's Test P-Value : 0.5611          
##                                           
##             Sensitivity : 0.8046          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.7927          
##               Precision : 0.7778          
##                  Recall : 0.8046          
##                      F1 : 0.7910          
##              Prevalence : 0.5058          
##          Detection Rate : 0.4070          
##    Detection Prevalence : 0.5233          
##       Balanced Accuracy : 0.7847          
##                                           
##        'Positive' Class : 0               
##

The random forest classification model achieves an accuracy of 78.49% demonstrating ability to correctly classify instances with a success rate of 78.49%. A low p-value indicates a substantial improvement over a random model, and the Kappa statistic of 0.5695 signifies moderate agreement between predicted and actual classifications.

Examining precision, the model accurately predicts 77.78% of positive instances, showcasing its ability to make reliable positive predictions. Meanwhile, the recall metric evaluates the model’s effectiveness in capturing 80.46% of the positive instances.The F1 score indicating 79.10%, effectively balances precision and recall.

(b) Logistic Regression

#create the logistic regression model with all predictors 
lr_model <- glm(boolean ~. ,selected_train, family = "binomial")
lr_model

## 
## Call:  glm(formula = boolean ~ ., family = "binomial", data = selected_train)
## 
## Coefficients:
##   (Intercept)      PLT_Trend       SF_Trend     USDI_Trend       EU_Trend  
##     2.287e+00      1.318e+00      1.439e+00     -7.490e-01      5.501e-01  
##     USB_Trend      PLD_Trend        SF_Open  GDX_Adj.Close        DJ_open  
##    -9.109e-01      4.042e-01     -5.833e-05      8.557e-02     -1.035e-05  
##          High  
##    -2.157e-02  
## 
## Degrees of Freedom: 1372 Total (i.e. Null);  1362 Residual
## Null Deviance:       1903 
## Residual Deviance: 1312  AIC: 1334

#generate the predictions of the lr model
predictions <- predict(lr_model, newdata = selected_test, type = "response")

# Convert predicted probabilities to binary predictions
lr_pred <- ifelse(predictions > 0.5, 1, 0)
lr_pred

##    4    8   16   22   23   44   48   51   54   58   63   66   67   86   98  100 
##    1    1    1    1    0    1    1    1    1    1    1    0    1    1    0    0 
##  102  108  110  111  112  114  123  124  126  129  130  132  134  141  147  148 
##    0    0    0    1    0    0    1    1    1    0    0    0    0    0    0    0 
##  170  174  177  182  183  184  189  193  204  220  221  223  233  234  245  246 
##    1    1    0    1    0    1    0    0    1    1    0    1    0    0    0    1 
##  247  248  264  274  275  284  285  289  296  301  302  303  308  309  314  318 
##    0    0    0    0    1    0    1    0    0    1    1    0    0    0    0    0 
##  321  323  325  330  337  339  341  342  346  352  355  362  363  369  370  371 
##    1    1    1    0    0    0    1    0    0    1    1    0    0    1    0    1 
##  372  378  389  396  400  405  409  413  415  419  440  441  443  444  455  473 
##    0    0    1    0    1    1    0    0    1    0    1    1    1    0    0    0 
##  477  484  487  492  496  497  498  502  506  508  512  513  515  528  533  544 
##    0    0    1    0    0    0    1    1    1    1    0    1    1    0    0    1 
##  547  551  557  560  562  563  564  569  571  572  574  576  586  591  597  601 
##    0    0    0    1    0    0    0    0    1    0    0    0    1    1    0    1 
##  605  608  635  645  653  662  664  677  690  696  705  719  722  728  732  736 
##    1    0    1    1    0    0    0    0    1    0    1    1    0    0    1    1 
##  740  749  750  760  771  774  778  791  796  801  803  806  810  811  812  824 
##    0    1    0    1    1    0    1    1    1    1    1    0    0    1    0    0 
##  826  829  838  839  842  846  850  852  854  859  871  872  876  882  885  907 
##    0    0    1    1    0    1    0    1    1    0    0    0    1    1    0    1 
##  909  912  917  922  926  934  956  963  965  967  968  973  976  979  997 1001 
##    0    0    1    0    1    0    1    1    0    1    1    0    1    1    0    1 
## 1004 1006 1009 1013 1021 1028 1036 1040 1041 1042 1046 1049 1050 1051 1062 1063 
##    0    0    0    0    0    0    0    1    1    1    0    0    0    1    1    1 
## 1065 1072 1085 1086 1090 1091 1098 1102 1107 1111 1113 1115 1117 1127 1137 1148 
##    1    0    1    0    1    0    1    0    1    0    0    1    1    0    0    0 
## 1149 1151 1157 1161 1165 1170 1171 1172 1174 1176 1178 1180 1185 1195 1202 1208 
##    1    0    0    1    0    0    0    1    0    0    1    1    1    1    1    0 
## 1211 1219 1223 1240 1253 1255 1261 1270 1273 1281 1284 1290 1291 1297 1299 1301 
##    0    1    1    0    1    0    0    0    1    0    1    1    1    0    0    0 
## 1306 1313 1316 1319 1320 1329 1337 1338 1340 1341 1346 1349 1350 1351 1357 1359 
##    1    1    0    0    1    0    0    0    1    1    0    1    0    1    1    1 
## 1360 1361 1362 1366 1375 1379 1381 1393 1395 1398 1403 1404 1405 1419 1420 1421 
##    1    0    0    1    1    1    0    0    1    0    0    1    0    1    0    1 
## 1423 1424 1425 1428 1442 1448 1458 1459 1463 1464 1465 1467 1473 1478 1479 1480 
##    1    0    1    1    1    1    0    1    1    1    1    1    1    1    1    1 
## 1494 1495 1496 1504 1516 1520 1539 1542 1544 1550 1551 1563 1565 1566 1567 1571 
##    0    0    1    0    1    0    0    1    0    1    0    0    0    1    1    1 
## 1573 1579 1585 1589 1603 1607 1610 1618 1623 1633 1636 1638 1643 1651 1654 1656 
##    0    0    0    1    0    0    1    1    0    1    0    0    0    0    0    1 
## 1663 1669 1680 1684 1686 1702 1710 1715 
##    0    0    0    0    0    1    0    1

# Evaluate the model
lr_confusion_matrix <- table(lr_pred, selected_test$boolean)
lr_cm <- confusionMatrix(lr_confusion_matrix,  mode = "everything")
lr_cm

## Confusion Matrix and Statistics
## 
##        
## lr_pred   0   1
##       0 139  42
##       1  35 128
##                                           
##                Accuracy : 0.7762          
##                  95% CI : (0.7284, 0.8191)
##     No Information Rate : 0.5058          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5521          
##                                           
##  Mcnemar's Test P-Value : 0.4941          
##                                           
##             Sensitivity : 0.7989          
##             Specificity : 0.7529          
##          Pos Pred Value : 0.7680          
##          Neg Pred Value : 0.7853          
##               Precision : 0.7680          
##                  Recall : 0.7989          
##                      F1 : 0.7831          
##              Prevalence : 0.5058          
##          Detection Rate : 0.4041          
##    Detection Prevalence : 0.5262          
##       Balanced Accuracy : 0.7759          
##                                           
##        'Positive' Class : 0               
##

The logistic regression model attains an accuracy of 77.62% demonstrating ability to correctly classify instances with a success rate of 77.62%. A low p-value signifies a significant improvement over a random model, and the Kappa statistic of 0.5521 reflects moderate agreement between predicted and actual classifications.

Delving into precision, the model able to predict 76.8% of positive instances, highlighting its ability in making reliable positive predictions. Simultaneously, the recall metric evaluates the model’s effectiveness in capturing 79.89% of the positive instances. The F1 score, at 78.31%, effectively balances precision and recall.

6.0 Conclusion

In summary, the exploratory data analytics reveals a strong correlation among pricing attributes within Gold ETF, Eldorado Gold Corporation, Silver Futures, and Gold Miners ETF, offering valuable insights into potential factors influencing adjusted closing gold prices.

Based on the analysis and the development of the gold price prediction model, machine learning techniques have produced significant results as to the factors influencing gold prices. To analyse the regression model in predicting gold prices, the Linear Regression model outperform Random Forest regression model with higher R-squared close to 1 indicating precise alignment between predictions and actual values, and low RMSE and MAE values below 0.3 units.

For gold price trend classification, the Random Forest classification model and Logistic Regression model demonstrating significant improvement over random model and ability to predict gold price trends with moderate agreement between actual and predicted classifications. the Random Forest classification model surpasses Logistic Regression model exhibiting higher accuracy, precision, recall, and F1 score.

In conclusion, the Random Forest excels in gold price prediction, while overall, the linear regression models demonstrate strong performance. The effectiveness of this model in predicting gold prices would benefit stakeholders with a more informed investment decision. Stakeholders would gain a comprehensive analysis on the sophisticated gold market and the overall structure affecting the fluctuation of the gold prices. In short, this model would act as a valuable tool to empower stakeholders in navigating the gold market industry with the insights on the various factors affecting the gold prices.

Gold Price Prediction

Group 8

2024-01-12

1.0 Introduction

1.1 Problem Question

1.2 Objective

2.0 Dataset details

3.0 Data Cleaning

3.1 Exploring raw data

3.2 Remove Redundancy.

3.3 Binary Classification Label

3.4 Missing Values

3.5 Duplicate Validation

3.6 Data Types Checking

4.0 Exploration Data Analysis (EDA)

4.1 Statistical Summary

4.2 Gold Price Trends

4.3 Distribution of Gold Prices

4.4 Distribution of Gold Price Direction

4.5 Correlation

5.0 Modeling and Evaluation

5.1 Regression Model

Split the data into training and testing sets.

Feature Selection

(a) Random Forest Regression

Evaluate model performance.

Visualize the evaluation result

(b) Multivariate Linear Regression

Evaluate model performance.

Visualize the evaluation result

5.2 Classification Model

Split the data into training and testing sets.

Feature Selection

(a) Random Forest Classificaiton

(b) Logistic Regression

6.0 Conclusion