Group Members:
Gold, a common symbol representing wealth, has been used as a form of currency since decades ago. Due to its intrinsic value, this precious metal continues to play a vital role in today’s society and the global economy. It is deemed as a form of guarantee to foreign debts repayment and it represents the economic strength and stability of a country.
Being a commodity product on its own as well as a currency, the gold market is highly influenced by economic factors, geopolitical factors and, among others, investors’ sentiment. Investors would often rely on forecast of gold prices to decide on the best execution of trade, however, several factors would come to play in determining the prediction of gold price. These factors include prices of other precious metals, prices of crude oils, performance of the stock exchange, currency rates etc. This research aims to formulate an algorithm to predict the movement in price of gold, future prices and its exact value while taking into account of the correlating factors.
The volatile gold market renders it difficult for stakeholders to properly assess the risk and make strategic decisions in seizing the right opportunity for investment. The sophistication in gold price prediction derives from the multifaceted external influences such as economic trends, geopolitical events, oil prices etc.
This research aims to utilize advanced data analytics and machine learning techniques to develop a robust gold price prediction model. The gold price prediction model will provide stakeholders with foreseeable results and valuable insights to better assess the risk and benefits for proper investment strategies.
The dataset used in this project was primarily gathered from Yahoo Finance. It is a set of data collected from November 18, 2011, to January 1, 2019, encompassing 1718 rows and 81 columns.
These columns represent various financial attributes including Oil Price, Standard and Poor’s (S&P) 500 index, Dow Jones Index, US Bond rates (10 years), Euro USD exchange rates, prices of precious metals like Silver, Platinum and other metals such as Palladium and Rhodium. Additionally, it includes data on the US Dollar Index, Eldorado Gold Corporation, and Gold Miners ETF.
Within the dataset, the historical data of the Gold ETF comprises 7 columns, including Date, Open, High, Low, Close, Adjusted Close, and Volume. The value of Adjusted Close is particularly significant as it considers factors like dividends, stock splits, and new stock offerings to determine a more accurate representation of the Gold ETF’s closing value.
source: https://www.kaggle.com/datasets/sid321axn/gold-price-prediction-dataset/data
# Import dataset
df<- read.csv("Dataset/FINAL_USO.csv")
## Import library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(Metrics)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
##
## precision, recall
Upon using the ‘glimpse()’ function from the ‘dplyr’ library, it was observed that the ‘Date’ column is in character format, while the remaining 80 columns consist of numeric variables, with values either in double or integer form.
## Exploring dataset
glimpse(df)
## Rows: 1,718
## Columns: 81
## $ Date <chr> "2011-12-15", "2011-12-16", "2011-12-19", "2011-12-20", …
## $ Open <dbl> 154.74, 154.31, 155.48, 156.82, 156.98, 156.35, 156.35, …
## $ High <dbl> 154.95, 155.37, 155.86, 157.43, 157.53, 156.80, 156.49, …
## $ Low <dbl> 151.71, 153.90, 154.36, 156.58, 156.13, 155.33, 155.82, …
## $ Close <dbl> 152.33, 155.23, 154.87, 156.98, 157.16, 156.04, 156.31, …
## $ Adj.Close <dbl> 152.33, 155.23, 154.87, 156.98, 157.16, 156.04, 156.31, …
## $ Volume <int> 21521900, 18124300, 12547200, 9136300, 11996100, 9888400…
## $ SP_open <dbl> 123.03, 122.23, 122.06, 122.18, 123.93, 124.63, 125.67, …
## $ SP_high <dbl> 123.20, 122.95, 122.32, 124.14, 124.36, 125.40, 126.43, …
## $ SP_low <dbl> 121.99, 121.30, 120.03, 120.37, 122.75, 124.23, 125.41, …
## $ SP_close <dbl> 122.18, 121.59, 120.29, 123.93, 124.17, 125.27, 126.39, …
## $ SP_Ajclose <dbl> 105.4412, 105.5975, 104.4685, 107.6298, 107.8382, 108.79…
## $ SP_volume <int> 199109200, 220481400, 183903000, 225418100, 194230900, 1…
## $ DJ_open <dbl> 11825.29, 11870.25, 11866.54, 11769.21, 12103.58, 12107.…
## $ DJ_high <dbl> 11967.84, 11968.18, 11925.88, 12117.13, 12119.70, 12182.…
## $ DJ_low <dbl> 11825.22, 11819.31, 11735.19, 11768.83, 11999.44, 12107.…
## $ DJ_close <dbl> 11868.81, 11866.39, 11766.26, 12103.58, 12107.74, 12169.…
## $ DJ_Ajclose <dbl> 11868.81, 11866.39, 11766.26, 12103.58, 12107.74, 12169.…
## $ DJ_volume <int> 136930000, 389520000, 135170000, 165180000, 163250000, 1…
## $ EG_open <dbl> 74.55, 73.60, 69.10, 66.45, 67.10, 68.40, 70.05, 70.80, …
## $ EG_high <dbl> 76.15, 75.10, 69.80, 68.10, 69.40, 70.00, 71.75, 71.15, …
## $ EG_low <dbl> 72.15, 73.35, 64.20, 66.00, 66.90, 66.70, 69.90, 67.55, …
## $ EG_close <dbl> 72.90, 74.90, 64.70, 67.00, 68.50, 69.75, 71.60, 67.65, …
## $ EG_Ajclose <dbl> 70.43176, 72.36404, 62.50938, 64.73151, 66.18072, 67.388…
## $ EG_volume <int> 787900, 896600, 2096700, 875300, 837600, 780600, 621200,…
## $ EU_Price <dbl> 1.3018, 1.3035, 1.2995, 1.3079, 1.3045, 1.3051, 1.3043, …
## $ EU_open <dbl> 1.2982, 1.3020, 1.3043, 1.3003, 1.3079, 1.3047, 1.3051, …
## $ EU_high <dbl> 1.3051, 1.3087, 1.3044, 1.3133, 1.3197, 1.3121, 1.3098, …
## $ EU_low <dbl> 1.2957, 1.2997, 1.2981, 1.2994, 1.3024, 1.3018, 1.3026, …
## $ EU_Trend <int> 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,…
## $ OF_Price <dbl> 105.09, 103.35, 103.64, 106.73, 107.71, 107.89, 107.96, …
## $ OF_Open <dbl> 104.88, 103.51, 103.63, 104.30, 107.15, 107.92, 107.77, …
## $ OF_High <dbl> 106.50, 104.56, 104.57, 107.27, 108.17, 108.50, 108.55, …
## $ OF_Low <dbl> 104.88, 102.46, 102.37, 103.91, 106.16, 107.29, 107.40, …
## $ OF_Volume <int> 14330, 140080, 147880, 170240, 145090, 111160, 64610, 50…
## $ OF_Trend <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,…
## $ OS_Price <dbl> 93.42, 93.79, 94.09, 95.55, 99.01, 99.47, 99.85, 101.28,…
## $ OS_Open <dbl> 94.91, 93.43, 93.77, 96.39, 97.54, 99.04, 99.49, 99.94, …
## $ OS_High <dbl> 96.00, 94.80, 94.43, 99.70, 99.26, 100.06, 100.24, 101.7…
## $ OS_Low <dbl> 93.33, 92.53, 92.55, 96.39, 96.81, 98.52, 99.25, 99.39, …
## $ OS_Trend <int> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,…
## $ SF_Price <int> 53604, 53458, 52961, 53487, 53148, 52894, 53067, 52711, …
## $ SF_Open <int> 54248, 53650, 53400, 52795, 53519, 53000, 53000, 53014, …
## $ SF_High <int> 54248, 54030, 53400, 53575, 54184, 53460, 53348, 53099, …
## $ SF_Low <int> 52316, 52890, 52544, 52595, 52937, 52690, 52960, 52588, …
## $ SF_Volume <int> 119440, 65390, 67280, 55130, 75950, 56530, 27510, 27450,…
## $ SF_Trend <int> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1,…
## $ USB_Price <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Open <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_High <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Low <dbl> 1.911, 1.851, 1.810, 1.927, 1.970, 1.953, 2.021, 2.002, …
## $ USB_Trend <int> 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ PLT_Price <dbl> 1414.65, 1420.25, 1411.10, 1434.75, 1429.05, 1425.90, 14…
## $ PLT_Open <dbl> 1420.30, 1414.75, 1422.65, 1408.95, 1434.40, 1429.00, 14…
## $ PLT_High <dbl> 1423.35, 1431.75, 1427.60, 1436.55, 1453.75, 1435.25, 14…
## $ PLT_Low <dbl> 1376.85, 1400.70, 1404.60, 1408.15, 1417.65, 1415.90, 14…
## $ PLT_Trend <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ PLD_Price <dbl> 618.85, 623.65, 608.80, 626.65, 635.90, 652.45, 665.55, …
## $ PLD_Open <dbl> 614.70, 622.60, 626.00, 622.45, 625.70, 632.00, 664.95, …
## $ PLD_High <dbl> 615.00, 623.45, 630.00, 622.45, 641.50, 650.80, 665.05, …
## $ PLD_Low <dbl> 614.60, 622.30, 608.60, 622.45, 623.80, 632.00, 664.95, …
## $ PLD_Trend <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ RHO_PRICE <int> 1425, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 14…
## $ USDI_Price <dbl> 80.341, 80.249, 80.207, 80.273, 80.350, 80.306, 80.236, …
## $ USDI_Open <dbl> 80.565, 80.175, 80.300, 80.890, 80.105, 80.360, 80.210, …
## $ USDI_High <dbl> 80.630, 80.395, 80.470, 80.940, 80.445, 80.505, 80.375, …
## $ USDI_Low <dbl> 80.130, 79.935, 80.125, 80.035, 79.550, 79.955, 80.060, …
## $ USDI_Volume <int> 22850, 13150, 970, 22950, 24140, 13240, 7170, 5660, 2111…
## $ USDI_Trend <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0,…
## $ GDX_Open <dbl> 53.01, 52.50, 52.49, 52.38, 53.15, 52.56, 52.59, 52.37, …
## $ GDX_High <dbl> 53.14, 53.18, 52.55, 53.25, 53.43, 52.62, 52.79, 52.52, …
## $ GDX_Low <dbl> 51.57, 52.04, 51.03, 52.37, 52.42, 51.77, 52.35, 51.68, …
## $ GDX_Close <dbl> 51.68, 52.68, 51.17, 52.99, 52.96, 52.40, 52.79, 51.87, …
## $ GDX_Adj.Close <dbl> 48.97388, 49.92151, 48.49058, 50.21528, 50.18685, 49.656…
## $ GDX_Volume <int> 20605600, 16285400, 15120200, 11644900, 8724300, 1317960…
## $ USO_Open <dbl> 36.90, 36.18, 36.39, 37.30, 37.67, 38.27, 38.47, 38.49, …
## $ USO_High <dbl> 36.94, 36.50, 36.45, 37.61, 38.24, 38.55, 38.61, 39.21, …
## $ USO_Low <dbl> 36.05, 35.73, 35.93, 37.22, 37.52, 38.21, 38.37, 38.46, …
## $ USO_Close <dbl> 36.13, 36.27, 36.20, 37.56, 38.11, 38.28, 38.49, 39.06, …
## $ USO_Adj.Close <dbl> 36.13, 36.27, 36.20, 37.56, 38.11, 38.28, 38.49, 39.06, …
## $ USO_Volume <int> 12616700, 12578800, 7418200, 10041600, 10728000, 5376800…
The closing price columns for Gold ETF, S&P 500 index, Down Jones index, Eldorado Gold Corporation (EGO) and Gold Miners ETF was removed since the closing prices exhibit a strong correlation with their respective adjusted closing prices. Retaining both sets of columns might introduce redundancy without contributing additional insights.
The adjusted closing price was retained to ensure a more concise representation as it already incorporates all the essential adjustments.
# remove closing gold price
remove_columns <- c("Close", "SP_close", "DJ_close", "EG_close", "GDX_Close", "USO_Close")
df <- df[, !(names(df) %in% remove_columns)]
head(df)
## Date Open High Low Adj.Close Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71 152.33 21521900 123.03 123.20 121.99
## 2 2011-12-16 154.31 155.37 153.90 155.23 18124300 122.23 122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36 154.87 12547200 122.06 122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58 156.98 9136300 122.18 124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13 157.16 11996100 123.93 124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33 156.04 9888400 124.63 125.40 124.23
## SP_Ajclose SP_volume DJ_open DJ_high DJ_low DJ_Ajclose DJ_volume EG_open
## 1 105.4412 199109200 11825.29 11967.84 11825.22 11868.81 136930000 74.55
## 2 105.5975 220481400 11870.25 11968.18 11819.31 11866.39 389520000 73.60
## 3 104.4685 183903000 11866.54 11925.88 11735.19 11766.26 135170000 69.10
## 4 107.6298 225418100 11769.21 12117.13 11768.83 12103.58 165180000 66.45
## 5 107.8382 194230900 12103.58 12119.70 11999.44 12107.74 163250000 67.10
## 6 108.7935 119465400 12107.59 12182.71 12107.37 12169.65 151610000 68.40
## EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1 76.15 72.15 70.43176 787900 1.3018 1.2982 1.3051 1.2957 1
## 2 75.10 73.35 72.36404 896600 1.3035 1.3020 1.3087 1.2997 1
## 3 69.80 64.20 62.50938 2096700 1.2995 1.3043 1.3044 1.2981 0
## 4 68.10 66.00 64.73151 875300 1.3079 1.3003 1.3133 1.2994 1
## 5 69.40 66.90 66.18072 837600 1.3045 1.3079 1.3197 1.3024 0
## 6 70.00 66.70 67.38840 780600 1.3051 1.3047 1.3121 1.3018 1
## OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1 105.09 104.88 106.50 104.88 14330 1 93.42 94.91 96.00
## 2 103.35 103.51 104.56 102.46 140080 0 93.79 93.43 94.80
## 3 103.64 103.63 104.57 102.37 147880 1 94.09 93.77 94.43
## 4 106.73 104.30 107.27 103.91 170240 1 95.55 96.39 99.70
## 5 107.71 107.15 108.17 106.16 145090 1 99.01 97.54 99.26
## 6 107.89 107.92 108.50 107.29 111160 1 99.47 99.04 100.06
## OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1 93.33 0 53604 54248 54248 52316 119440 1 1.911
## 2 92.53 1 53458 53650 54030 52890 65390 0 1.851
## 3 92.55 1 52961 53400 53400 52544 67280 0 1.810
## 4 96.39 1 53487 52795 53575 52595 55130 1 1.927
## 5 96.81 1 53148 53519 54184 52937 75950 0 1.970
## 6 98.52 1 52894 53000 53460 52690 56530 0 1.953
## USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1 1.911 1.911 1.911 1 1414.65 1420.30 1423.35 1376.85
## 2 1.851 1.851 1.851 0 1420.25 1414.75 1431.75 1400.70
## 3 1.810 1.810 1.810 0 1411.10 1422.65 1427.60 1404.60
## 4 1.927 1.927 1.927 1 1434.75 1408.95 1436.55 1408.15
## 5 1.970 1.970 1.970 1 1429.05 1434.40 1453.75 1417.65
## 6 1.953 1.953 1.953 0 1425.90 1429.00 1435.25 1415.90
## PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1 0 618.85 614.70 615.00 614.60 1 1425 80.341
## 2 1 623.65 622.60 623.45 622.30 1 1400 80.249
## 3 0 608.80 626.00 630.00 608.60 0 1400 80.207
## 4 1 626.65 622.45 622.45 622.45 1 1400 80.273
## 5 0 635.90 625.70 641.50 623.80 1 1400 80.350
## 6 0 652.45 632.00 650.80 632.00 1 1400 80.306
## USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1 80.565 80.630 80.130 22850 0 53.01 53.14 51.57
## 2 80.175 80.395 79.935 13150 0 52.50 53.18 52.04
## 3 80.300 80.470 80.125 970 0 52.49 52.55 51.03
## 4 80.890 80.940 80.035 22950 1 52.38 53.25 52.37
## 5 80.105 80.445 79.550 24140 1 53.15 53.43 52.42
## 6 80.360 80.505 79.955 13240 0 52.56 52.62 51.77
## GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1 48.97388 20605600 36.90 36.94 36.05 36.13 12616700
## 2 49.92151 16285400 36.18 36.50 35.73 36.27 12578800
## 3 48.49058 15120200 36.39 36.45 35.93 36.20 7418200
## 4 50.21528 11644900 37.30 37.61 37.22 37.56 10041600
## 5 50.18685 8724300 37.67 38.24 37.52 38.11 10728000
## 6 49.65618 13179600 38.27 38.55 38.21 38.28 5376800
# To identify the gold price direction - Increase/ Remain Unchanged (1) or Decrease (0)
df$difference <- c(NA, diff(df$Adj.Close))
df$boolean <-ifelse(df$difference >= 0, 1, 0)
head(df)
## Date Open High Low Adj.Close Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71 152.33 21521900 123.03 123.20 121.99
## 2 2011-12-16 154.31 155.37 153.90 155.23 18124300 122.23 122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36 154.87 12547200 122.06 122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58 156.98 9136300 122.18 124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13 157.16 11996100 123.93 124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33 156.04 9888400 124.63 125.40 124.23
## SP_Ajclose SP_volume DJ_open DJ_high DJ_low DJ_Ajclose DJ_volume EG_open
## 1 105.4412 199109200 11825.29 11967.84 11825.22 11868.81 136930000 74.55
## 2 105.5975 220481400 11870.25 11968.18 11819.31 11866.39 389520000 73.60
## 3 104.4685 183903000 11866.54 11925.88 11735.19 11766.26 135170000 69.10
## 4 107.6298 225418100 11769.21 12117.13 11768.83 12103.58 165180000 66.45
## 5 107.8382 194230900 12103.58 12119.70 11999.44 12107.74 163250000 67.10
## 6 108.7935 119465400 12107.59 12182.71 12107.37 12169.65 151610000 68.40
## EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1 76.15 72.15 70.43176 787900 1.3018 1.2982 1.3051 1.2957 1
## 2 75.10 73.35 72.36404 896600 1.3035 1.3020 1.3087 1.2997 1
## 3 69.80 64.20 62.50938 2096700 1.2995 1.3043 1.3044 1.2981 0
## 4 68.10 66.00 64.73151 875300 1.3079 1.3003 1.3133 1.2994 1
## 5 69.40 66.90 66.18072 837600 1.3045 1.3079 1.3197 1.3024 0
## 6 70.00 66.70 67.38840 780600 1.3051 1.3047 1.3121 1.3018 1
## OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1 105.09 104.88 106.50 104.88 14330 1 93.42 94.91 96.00
## 2 103.35 103.51 104.56 102.46 140080 0 93.79 93.43 94.80
## 3 103.64 103.63 104.57 102.37 147880 1 94.09 93.77 94.43
## 4 106.73 104.30 107.27 103.91 170240 1 95.55 96.39 99.70
## 5 107.71 107.15 108.17 106.16 145090 1 99.01 97.54 99.26
## 6 107.89 107.92 108.50 107.29 111160 1 99.47 99.04 100.06
## OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1 93.33 0 53604 54248 54248 52316 119440 1 1.911
## 2 92.53 1 53458 53650 54030 52890 65390 0 1.851
## 3 92.55 1 52961 53400 53400 52544 67280 0 1.810
## 4 96.39 1 53487 52795 53575 52595 55130 1 1.927
## 5 96.81 1 53148 53519 54184 52937 75950 0 1.970
## 6 98.52 1 52894 53000 53460 52690 56530 0 1.953
## USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1 1.911 1.911 1.911 1 1414.65 1420.30 1423.35 1376.85
## 2 1.851 1.851 1.851 0 1420.25 1414.75 1431.75 1400.70
## 3 1.810 1.810 1.810 0 1411.10 1422.65 1427.60 1404.60
## 4 1.927 1.927 1.927 1 1434.75 1408.95 1436.55 1408.15
## 5 1.970 1.970 1.970 1 1429.05 1434.40 1453.75 1417.65
## 6 1.953 1.953 1.953 0 1425.90 1429.00 1435.25 1415.90
## PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1 0 618.85 614.70 615.00 614.60 1 1425 80.341
## 2 1 623.65 622.60 623.45 622.30 1 1400 80.249
## 3 0 608.80 626.00 630.00 608.60 0 1400 80.207
## 4 1 626.65 622.45 622.45 622.45 1 1400 80.273
## 5 0 635.90 625.70 641.50 623.80 1 1400 80.350
## 6 0 652.45 632.00 650.80 632.00 1 1400 80.306
## USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1 80.565 80.630 80.130 22850 0 53.01 53.14 51.57
## 2 80.175 80.395 79.935 13150 0 52.50 53.18 52.04
## 3 80.300 80.470 80.125 970 0 52.49 52.55 51.03
## 4 80.890 80.940 80.035 22950 1 52.38 53.25 52.37
## 5 80.105 80.445 79.550 24140 1 53.15 53.43 52.42
## 6 80.360 80.505 79.955 13240 0 52.56 52.62 51.77
## GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1 48.97388 20605600 36.90 36.94 36.05 36.13 12616700
## 2 49.92151 16285400 36.18 36.50 35.73 36.27 12578800
## 3 48.49058 15120200 36.39 36.45 35.93 36.20 7418200
## 4 50.21528 11644900 37.30 37.61 37.22 37.56 10041600
## 5 50.18685 8724300 37.67 38.24 37.52 38.11 10728000
## 6 49.65618 13179600 38.27 38.55 38.21 38.28 5376800
## difference boolean
## 1 NA NA
## 2 2.899994 1
## 3 -0.360001 0
## 4 2.110001 1
## 5 0.180008 1
## 6 -1.120011 0
The utilization of the ‘any(is.na())’ function has indicated the presence of missing values in the dataset. As a precaution to ensure data integrity, the initial step involved removing the first row that contained any missing value (NA).
# Handling Missing Data
missing_values <- any(is.na(df))
missing_values
## [1] TRUE
# Display the rows with missing values
rows_with_missing <- df[apply(is.na(df), 1, any), ]
print(rows_with_missing)
## Date Open High Low Adj.Close Volume SP_open SP_high SP_low
## 1 2011-12-15 154.74 154.95 151.71 152.33 21521900 123.03 123.2 121.99
## SP_Ajclose SP_volume DJ_open DJ_high DJ_low DJ_Ajclose DJ_volume EG_open
## 1 105.4412 199109200 11825.29 11967.84 11825.22 11868.81 136930000 74.55
## EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 1 76.15 72.15 70.43176 787900 1.3018 1.2982 1.3051 1.2957 1
## OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 1 105.09 104.88 106.5 104.88 14330 1 93.42 94.91 96
## OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 1 93.33 0 53604 54248 54248 52316 119440 1 1.911
## USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 1 1.911 1.911 1.911 1 1414.65 1420.3 1423.35 1376.85
## PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 1 0 618.85 614.7 615 614.6 1 1425 80.341
## USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 1 80.565 80.63 80.13 22850 0 53.01 53.14 51.57
## GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 1 48.97388 20605600 36.9 36.94 36.05 36.13 12616700
## difference boolean
## 1 NA NA
# Display the resulting data frame without NAs
df <- na.omit(df)
head(df)
## Date Open High Low Adj.Close Volume SP_open SP_high SP_low
## 2 2011-12-16 154.31 155.37 153.90 155.23 18124300 122.23 122.95 121.30
## 3 2011-12-19 155.48 155.86 154.36 154.87 12547200 122.06 122.32 120.03
## 4 2011-12-20 156.82 157.43 156.58 156.98 9136300 122.18 124.14 120.37
## 5 2011-12-21 156.98 157.53 156.13 157.16 11996100 123.93 124.36 122.75
## 6 2011-12-22 156.35 156.80 155.33 156.04 9888400 124.63 125.40 124.23
## 7 2011-12-23 156.35 156.49 155.82 156.31 3565100 125.67 126.43 125.41
## SP_Ajclose SP_volume DJ_open DJ_high DJ_low DJ_Ajclose DJ_volume EG_open
## 2 105.5975 220481400 11870.25 11968.18 11819.31 11866.39 389520000 73.60
## 3 104.4685 183903000 11866.54 11925.88 11735.19 11766.26 135170000 69.10
## 4 107.6298 225418100 11769.21 12117.13 11768.83 12103.58 165180000 66.45
## 5 107.8382 194230900 12103.58 12119.70 11999.44 12107.74 163250000 67.10
## 6 108.7935 119465400 12107.59 12182.71 12107.37 12169.65 151610000 68.40
## 7 109.7662 92187200 12169.88 12297.44 12169.80 12294.00 80420000 70.05
## EG_high EG_low EG_Ajclose EG_volume EU_Price EU_open EU_high EU_low EU_Trend
## 2 75.10 73.35 72.36404 896600 1.3035 1.3020 1.3087 1.2997 1
## 3 69.80 64.20 62.50938 2096700 1.2995 1.3043 1.3044 1.2981 0
## 4 68.10 66.00 64.73151 875300 1.3079 1.3003 1.3133 1.2994 1
## 5 69.40 66.90 66.18072 837600 1.3045 1.3079 1.3197 1.3024 0
## 6 70.00 66.70 67.38840 780600 1.3051 1.3047 1.3121 1.3018 1
## 7 71.75 69.90 69.17577 621200 1.3043 1.3051 1.3098 1.3026 0
## OF_Price OF_Open OF_High OF_Low OF_Volume OF_Trend OS_Price OS_Open OS_High
## 2 103.35 103.51 104.56 102.46 140080 0 93.79 93.43 94.80
## 3 103.64 103.63 104.57 102.37 147880 1 94.09 93.77 94.43
## 4 106.73 104.30 107.27 103.91 170240 1 95.55 96.39 99.70
## 5 107.71 107.15 108.17 106.16 145090 1 99.01 97.54 99.26
## 6 107.89 107.92 108.50 107.29 111160 1 99.47 99.04 100.06
## 7 107.96 107.77 108.55 107.40 64610 1 99.85 99.49 100.24
## OS_Low OS_Trend SF_Price SF_Open SF_High SF_Low SF_Volume SF_Trend USB_Price
## 2 92.53 1 53458 53650 54030 52890 65390 0 1.851
## 3 92.55 1 52961 53400 53400 52544 67280 0 1.810
## 4 96.39 1 53487 52795 53575 52595 55130 1 1.927
## 5 96.81 1 53148 53519 54184 52937 75950 0 1.970
## 6 98.52 1 52894 53000 53460 52690 56530 0 1.953
## 7 99.25 1 53067 53000 53348 52960 27510 1 2.021
## USB_Open USB_High USB_Low USB_Trend PLT_Price PLT_Open PLT_High PLT_Low
## 2 1.851 1.851 1.851 0 1420.25 1414.75 1431.75 1400.70
## 3 1.810 1.810 1.810 0 1411.10 1422.65 1427.60 1404.60
## 4 1.927 1.927 1.927 1 1434.75 1408.95 1436.55 1408.15
## 5 1.970 1.970 1.970 1 1429.05 1434.40 1453.75 1417.65
## 6 1.953 1.953 1.953 0 1425.90 1429.00 1435.25 1415.90
## 7 2.021 2.021 2.021 1 1426.45 1425.55 1439.30 1424.20
## PLT_Trend PLD_Price PLD_Open PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## 2 1 623.65 622.60 623.45 622.30 1 1400 80.249
## 3 0 608.80 626.00 630.00 608.60 0 1400 80.207
## 4 1 626.65 622.45 622.45 622.45 1 1400 80.273
## 5 0 635.90 625.70 641.50 623.80 1 1400 80.350
## 6 0 652.45 632.00 650.80 632.00 1 1400 80.306
## 7 1 665.55 664.95 665.05 664.95 1 1400 80.236
## USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend GDX_Open GDX_High GDX_Low
## 2 80.175 80.395 79.935 13150 0 52.50 53.18 52.04
## 3 80.300 80.470 80.125 970 0 52.49 52.55 51.03
## 4 80.890 80.940 80.035 22950 1 52.38 53.25 52.37
## 5 80.105 80.445 79.550 24140 1 53.15 53.43 52.42
## 6 80.360 80.505 79.955 13240 0 52.56 52.62 51.77
## 7 80.210 80.375 80.060 7170 0 52.59 52.79 52.35
## GDX_Adj.Close GDX_Volume USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## 2 49.92151 16285400 36.18 36.50 35.73 36.27 12578800
## 3 48.49058 15120200 36.39 36.45 35.93 36.20 7418200
## 4 50.21528 11644900 37.30 37.61 37.22 37.56 10041600
## 5 50.18685 8724300 37.67 38.24 37.52 38.11 10728000
## 6 49.65618 13179600 38.27 38.55 38.21 38.28 5376800
## 7 50.16936 4729000 38.47 38.61 38.37 38.49 2162200
## difference boolean
## 2 2.899994 1
## 3 -0.360001 0
## 4 2.110001 1
## 5 0.180008 1
## 6 -1.120011 0
## 7 0.270005 1
The execution of the ‘sum(duplicated())’ code revealed that the dataset contains no duplicates, ensuring the uniqueness of each record.
# Check Duplicates
num_duplicates <- sum(duplicated(df))
num_duplicates
## [1] 0
As indicated in Section 3.1, all variables, except for the ‘Date’ column, are in the correct data types. Therefore, no additional conversion is necessary. This ensures that all numeric variables are appropriately formatted for numerical analysis and modeling.
# Convert Date column from chr to date
df$Date <- as.Date(df$Date,format="%Y-%m-%d")
class(df$Date)
## [1] "Date"
# Check the class of numeric columns
numeric_cols <- sapply(df, is.numeric)
print(numeric_cols)
## Date Open High Low Adj.Close
## FALSE TRUE TRUE TRUE TRUE
## Volume SP_open SP_high SP_low SP_Ajclose
## TRUE TRUE TRUE TRUE TRUE
## SP_volume DJ_open DJ_high DJ_low DJ_Ajclose
## TRUE TRUE TRUE TRUE TRUE
## DJ_volume EG_open EG_high EG_low EG_Ajclose
## TRUE TRUE TRUE TRUE TRUE
## EG_volume EU_Price EU_open EU_high EU_low
## TRUE TRUE TRUE TRUE TRUE
## EU_Trend OF_Price OF_Open OF_High OF_Low
## TRUE TRUE TRUE TRUE TRUE
## OF_Volume OF_Trend OS_Price OS_Open OS_High
## TRUE TRUE TRUE TRUE TRUE
## OS_Low OS_Trend SF_Price SF_Open SF_High
## TRUE TRUE TRUE TRUE TRUE
## SF_Low SF_Volume SF_Trend USB_Price USB_Open
## TRUE TRUE TRUE TRUE TRUE
## USB_High USB_Low USB_Trend PLT_Price PLT_Open
## TRUE TRUE TRUE TRUE TRUE
## PLT_High PLT_Low PLT_Trend PLD_Price PLD_Open
## TRUE TRUE TRUE TRUE TRUE
## PLD_High PLD_Low PLD_Trend RHO_PRICE USDI_Price
## TRUE TRUE TRUE TRUE TRUE
## USDI_Open USDI_High USDI_Low USDI_Volume USDI_Trend
## TRUE TRUE TRUE TRUE TRUE
## GDX_Open GDX_High GDX_Low GDX_Adj.Close GDX_Volume
## TRUE TRUE TRUE TRUE TRUE
## USO_Open USO_High USO_Low USO_Adj.Close USO_Volume
## TRUE TRUE TRUE TRUE TRUE
## difference boolean
## TRUE TRUE
EDA involves investigating and understanding the main characteristics of the dataset. The purpose of EDA is to gain insights, identify patterns and uncover potential relationships within data.
Displaying a summary of statistical characteristics for the dataset provides an overview of central tendency, minimum and maximum values, and quartiles for each numerical variable.
# Summary statistics
summary(df)
## Date Open High Low
## Min. :2011-12-16 Min. :100.9 Min. :101.0 Min. :100.2
## 1st Qu.:2013-10-04 1st Qu.:116.2 1st Qu.:116.5 1st Qu.:115.7
## Median :2015-07-20 Median :121.9 Median :122.3 Median :121.4
## Mean :2015-07-07 Mean :127.3 Mean :127.8 Mean :126.8
## 3rd Qu.:2017-04-10 3rd Qu.:128.4 3rd Qu.:129.1 3rd Qu.:127.8
## Max. :2018-12-31 Max. :173.2 Max. :174.1 Max. :172.9
## Adj.Close Volume SP_open SP_high
## Min. :100.5 Min. : 1501600 Min. :122.1 Min. :122.3
## 1st Qu.:116.0 1st Qu.: 5408900 1st Qu.:170.5 1st Qu.:171.0
## Median :121.8 Median : 7483200 Median :205.5 Median :206.5
## Mean :127.3 Mean : 8438712 Mean :204.5 Mean :205.4
## 3rd Qu.:128.5 3rd Qu.:10207800 3rd Qu.:237.3 3rd Qu.:237.8
## Max. :173.6 Max. :93804200 Max. :293.1 Max. :293.9
## SP_low SP_Ajclose SP_volume DJ_open
## Min. :120.0 Min. :104.5 Min. : 27856500 Min. :11769
## 1st Qu.:169.9 1st Qu.:153.1 1st Qu.: 73869000 1st Qu.:15497
## Median :204.4 Median :191.7 Median : 99714400 Median :17602
## Mean :203.5 Mean :192.3 Mean :109750612 Mean :18165
## 3rd Qu.:236.2 3rd Qu.:228.8 3rd Qu.:135101400 3rd Qu.:20868
## Max. :291.8 Max. :290.6 Max. :507244300 Max. :26833
## DJ_high DJ_low DJ_Ajclose DJ_volume
## Min. :11926 Min. :11735 Min. :11766 Min. : 8410000
## 1st Qu.:15555 1st Qu.:15422 1st Qu.:15498 1st Qu.: 92290000
## Median :17715 Median :17512 Median :17614 Median :120670000
## Mean :18248 Mean :18078 Mean :18168 Mean :177937001
## 3rd Qu.:20914 3rd Qu.:20786 3rd Qu.:20856 3rd Qu.:263780000
## Max. :26952 Max. :26789 Max. :26828 Max. :900510000
## EG_open EG_high EG_low EG_Ajclose
## Min. : 2.77 Min. : 2.85 Min. : 2.73 Min. : 2.80
## 1st Qu.:14.20 1st Qu.:14.55 1st Qu.:13.75 1st Qu.:14.08
## Median :22.75 Median :23.10 Median :21.75 Median :22.38
## Mean :28.25 Mean :28.79 Mean :27.63 Mean :27.76
## 3rd Qu.:37.15 3rd Qu.:37.85 3rd Qu.:36.45 3rd Qu.:36.77
## Max. :80.20 Max. :81.00 Max. :77.90 Max. :78.00
## EG_volume EU_Price EU_open EU_high
## Min. : 164500 Min. :1.039 Min. :1.039 Min. :1.042
## 1st Qu.: 700600 1st Qu.:1.121 1st Qu.:1.121 1st Qu.:1.126
## Median : 969600 Median :1.184 Median :1.184 Median :1.188
## Mean : 1136276 Mean :1.208 Mean :1.208 Mean :1.213
## 3rd Qu.: 1344900 3rd Qu.:1.306 3rd Qu.:1.306 3rd Qu.:1.310
## Max. :10061200 Max. :1.393 Max. :1.393 Max. :1.399
## EU_low EU_Trend OF_Price OF_Open
## Min. :1.034 Min. :0.0000 Min. : 27.88 Min. : 27.99
## 1st Qu.:1.116 1st Qu.:0.0000 1st Qu.: 52.15 1st Qu.: 52.19
## Median :1.179 Median :0.0000 Median : 70.11 Median : 70.05
## Mean :1.204 Mean :0.4945 Mean : 77.49 Mean : 77.51
## 3rd Qu.:1.300 3rd Qu.:1.0000 3rd Qu.:107.73 3rd Qu.:107.69
## Max. :1.391 Max. :1.0000 Max. :126.22 Max. :126.16
## OF_High OF_Low OF_Volume OF_Trend
## Min. : 28.75 Min. : 27.10 Min. : 11520 Min. :0.0000
## 1st Qu.: 52.86 1st Qu.: 51.36 1st Qu.:176060 1st Qu.:0.0000
## Median : 70.64 Median : 69.00 Median :223490 Median :0.0000
## Mean : 78.37 Mean : 76.58 Mean :226082 Mean :0.4985
## 3rd Qu.:108.46 3rd Qu.:106.87 3rd Qu.:281530 3rd Qu.:1.0000
## Max. :128.40 Max. :125.00 Max. :567760 Max. :1.0000
## OS_Price OS_Open OS_High OS_Low
## Min. : 26.55 Min. : 27.34 Min. : 27.61 Min. : 26.18
## 1st Qu.: 48.92 1st Qu.: 49.03 1st Qu.: 49.69 1st Qu.: 48.20
## Median : 64.61 Median : 64.76 Median : 65.56 Median : 63.73
## Mean : 70.14 Mean : 70.26 Mean : 71.11 Mean : 69.32
## 3rd Qu.: 94.31 3rd Qu.: 94.41 3rd Qu.: 95.36 3rd Qu.: 93.48
## Max. :110.30 Max. :110.34 Max. :112.28 Max. :109.15
## OS_Trend SF_Price SF_Open SF_High
## Min. :0.0000 Min. :33170 Min. :33146 Min. :33566
## 1st Qu.:0.0000 1st Qu.:38016 1st Qu.:38025 1st Qu.:38293
## Median :1.0000 Median :40517 Median :40506 Median :40840
## Mean :0.5038 Mean :43278 Mean :43302 Mean :43665
## 3rd Qu.:1.0000 3rd Qu.:46573 3rd Qu.:46626 3rd Qu.:47071
## Max. :1.0000 Max. :65292 Max. :65400 Max. :65723
## SF_Low SF_Volume SF_Trend USB_Price
## Min. :32626 Min. : 40 Min. :0.0000 Min. :1.358
## 1st Qu.:37687 1st Qu.: 14210 1st Qu.:0.0000 1st Qu.:1.905
## Median :40225 Median : 19640 Median :0.0000 Median :2.259
## Mean :42906 Mean : 26859 Mean :0.4805 Mean :2.263
## 3rd Qu.:46125 3rd Qu.: 29900 3rd Qu.:1.0000 3rd Qu.:2.597
## Max. :64132 Max. :203730 Max. :1.0000 Max. :3.239
## USB_Open USB_High USB_Low USB_Trend
## Min. :1.366 Min. :1.391 Min. :1.321 Min. :0.0000
## 1st Qu.:1.905 1st Qu.:1.921 1st Qu.:1.886 1st Qu.:0.0000
## Median :2.259 Median :2.290 Median :2.231 Median :0.0000
## Mean :2.263 Mean :2.287 Mean :2.239 Mean :0.4898
## 3rd Qu.:2.598 3rd Qu.:2.620 3rd Qu.:2.575 3rd Qu.:1.0000
## Max. :3.237 Max. :3.261 Max. :3.231 Max. :1.0000
## PLT_Price PLT_Open PLT_High PLT_Low
## Min. : 775.6 Min. : 765.3 Min. : 786.5 Min. : 756.0
## 1st Qu.: 944.0 1st Qu.: 944.0 1st Qu.: 952.6 1st Qu.: 935.9
## Median :1097.2 Median :1097.6 Median :1107.4 Median :1086.0
## Mean :1183.8 Mean :1184.3 Mean :1194.1 Mean :1173.3
## 3rd Qu.:1443.0 3rd Qu.:1443.0 3rd Qu.:1454.3 3rd Qu.:1432.2
## Max. :1737.6 Max. :1737.8 Max. :1742.9 Max. :1717.2
## PLT_Trend PLD_Price PLD_Open PLD_High
## Min. :0.0000 Min. : 470.4 Min. : 458.6 Min. : 473.1
## 1st Qu.:0.0000 1st Qu.: 663.5 1st Qu.: 664.1 1st Qu.: 670.2
## Median :0.0000 Median : 748.3 Median : 748.0 Median : 753.8
## Mean :0.4846 Mean : 766.9 Mean : 766.5 Mean : 773.6
## 3rd Qu.:1.0000 3rd Qu.: 848.4 3rd Qu.: 847.0 3rd Qu.: 855.8
## Max. :1.0000 Max. :1197.5 Max. :1196.0 Max. :1208.7
## PLD_Low PLD_Trend RHO_PRICE USDI_Price
## Min. : 458.6 Min. :0.0000 Min. : 0 Min. : 78.30
## 1st Qu.: 657.1 1st Qu.:0.0000 1st Qu.: 785 1st Qu.: 81.38
## Median : 742.5 Median :1.0000 Median :1100 Median : 92.88
## Mean : 759.5 Mean :0.5306 Mean :1130 Mean : 89.81
## 3rd Qu.: 840.1 3rd Qu.:1.0000 3rd Qu.:1300 3rd Qu.: 96.11
## Max. :1183.6 Max. :1.0000 Max. :2600 Max. :103.29
## USDI_Open USDI_High USDI_Low USDI_Volume
## Min. : 78.22 Min. : 78.64 Min. : 78.12 Min. : 60
## 1st Qu.: 81.38 1st Qu.: 81.62 1st Qu.: 81.11 1st Qu.: 18130
## Median : 92.91 Median : 93.17 Median : 92.58 Median : 24450
## Mean : 89.81 Mean : 90.10 Mean : 89.51 Mean : 27571
## 3rd Qu.: 96.11 3rd Qu.: 96.47 3rd Qu.: 95.73 3rd Qu.: 33750
## Max. :103.35 Max. :103.81 Max. :102.97 Max. :142820
## USDI_Trend GDX_Open GDX_High GDX_Low
## Min. :0.0000 Min. :12.70 Min. :12.92 Min. :12.40
## 1st Qu.:0.0000 1st Qu.:20.64 1st Qu.:20.95 1st Qu.:20.35
## Median :1.0000 Median :23.11 Median :23.37 Median :22.87
## Mean :0.5131 Mean :26.73 Mean :27.06 Mean :26.37
## 3rd Qu.:1.0000 3rd Qu.:27.40 3rd Qu.:27.77 3rd Qu.:26.79
## Max. :1.0000 Max. :57.52 Max. :57.94 Max. :56.77
## GDX_Adj.Close GDX_Volume USO_Open USO_High
## Min. :12.27 Min. : 4729000 Min. : 7.82 Min. : 8.03
## 1st Qu.:20.17 1st Qu.: 22679700 1st Qu.:11.42 1st Qu.:11.50
## Median :22.68 Median : 37314000 Median :16.44 Median :16.49
## Mean :25.91 Mean : 43578526 Mean :22.10 Mean :22.30
## 3rd Qu.:26.48 3rd Qu.: 56978800 3rd Qu.:34.42 3rd Qu.:34.66
## Max. :54.62 Max. :232153600 Max. :41.60 Max. :42.30
## USO_Low USO_Adj.Close USO_Volume difference
## Min. : 7.67 Min. : 7.96 Min. : 1035100 Min. :-12.6400
## 1st Qu.:11.30 1st Qu.:11.39 1st Qu.: 6223500 1st Qu.: -0.6300
## Median :16.02 Median :16.34 Median : 16144000 Median : 0.0000
## Mean :21.90 Mean :22.10 Mean : 19226981 Mean : -0.0181
## 3rd Qu.:34.08 3rd Qu.:34.41 3rd Qu.: 26727200 3rd Qu.: 0.5800
## Max. :41.30 Max. :42.01 Max. :110265700 Max. : 9.8000
## boolean
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5038
## 3rd Qu.:1.0000
## Max. :1.0000
# To visualize the gold price over time
ggplot(df, aes(x = Date, y = Adj.Close)) + geom_line(color = "blue") + labs(title = "Gold Price Over Time", x = "Date", y = "Gold Close Price") + theme_minimal()
From the time series chart, gold price is dropping from 2012 and the lowest gold price year is 2016.
#Plot histogram of gold prices
hist(df$Adj.Close, breaks = 30, col='orange', border='black',
main='Histogram of Gold Prices', xlab='Gold Close Price', ylab='Frequency')
# Summary of boolean
class_counts <- table(df$boolean)
# Convert the table to a data frame for ggplot
class_counts_df <- as.data.frame(class_counts)
# Calculate percentages
class_counts_df$percentage <- prop.table(class_counts) * 100
# Create a bar chart with labels
ggplot(class_counts_df, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", position = "dodge", fill = "blue") +
geom_text(aes(label = paste0(Freq, " (", round(percentage), "%)")),
position = position_dodge(width = 0.9), vjust = -0.5) +
labs(title = "Distribution of Gold Price Direction", x = "Boolean", y = "Count")
The distribution with approximately 50% in each category suggests a balanced representation between the two classes.
correlation_matrix <- cor(df[, !colnames(df) %in% "Date"])
corrplot(correlation_matrix,
method = "color",
col = colorRampPalette(c("blue", "white", "red"))(20),
tl.col = "black",
tl.srt = 90,
tl.cex = 0.5)
# Set a correlation threshold
threshold <- 0.8
# Find attributes highly positively correlated with "Gold Close Price"
positively_correlated_vars <- names(which(correlation_matrix["Adj.Close", ] > threshold))
positively_correlated_vars <- setdiff(positively_correlated_vars, "Adj.Close")
# Display the highly correlated attributes
cat("Attributes highly positively correlated with Gold Close Price:\n")
## Attributes highly positively correlated with Gold Close Price:
print(positively_correlated_vars)
## [1] "Open" "High" "Low" "EG_open"
## [5] "EG_high" "EG_low" "EG_Ajclose" "SF_Price"
## [9] "SF_Open" "SF_High" "SF_Low" "GDX_Open"
## [13] "GDX_High" "GDX_Low" "GDX_Adj.Close"
Attributes with a correlation coefficient greater than 0.8, indicating a strong positive relationship with the adjusted closing price. The analysis reveals that 15 out of 75 attributes exhibit high correlation with the adjusted closing gold price.
There are “Open”, “High”, “Low” of Gold ETF; “EG_open”, “EG_high”, “EG_low”, “EG_Ajclose” of Eldorado Gold Corporation; “SF_Price”, “SF_Open”, “SF_High”, “SF_Low” of Silver Futures; and “GDX_Open”, “GDX_High”, “GDX_Low”, “GDX_Close”, “GDX_Adj.Close” of Gold Miners ETF.
# Set a correlation threshold
threshold_negative <- -0.8
# Find attributes highly negatively correlated with "Gold Close Price"
negatively_correlated_vars <- names(which(correlation_matrix["Adj.Close", ] < threshold_negative))
negatively_correlated_vars <- setdiff(negatively_correlated_vars, "Adj.Close")
# Display the highly correlated attributes
cat("Attributes highly negatively correlated with Gold Close Price:\n")
## Attributes highly negatively correlated with Gold Close Price:
print(negatively_correlated_vars)
## character(0)
Attributes with a correlation coefficient lower than -0.8, indicating a substantial negative relationship with the adjusted closing price. The result reveal that there are no attributes demonstrating a strong negative correlation with the adjusted closing price.
Machine learning algorithms are employed for the analysis and prediction of outcomes. Random Forest model was utilized due its versatility in handling both regression and classification tasks. In addition, basic models like linear regression and logistic regression are integrated for a comprehensive assessment of predictive performance. This inclusion allows for a comparative analysis of the Random Forest model against more traditional regression approaches. The evaluation of model performance is conducted using standard metrics such as R-squared, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) for regression models, and accuracy, precision, recall, and F1 score for classification models.
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Extract features and target
target_column <- "Adj.Close"
features <- train_data[, !colnames(train_data) %in% c("Date", "difference","boolean", target_column)]
target <- train_data[[target_column]]
The “importance = TRUE” argument indicates that feature importance scores should be calculated during the training process.
# Train a Random Forest Model
rf_model <- randomForest(features, y = target, importance = TRUE)
%IncMSE is a metric that reflects the contribution of each feature to the reduction in mean squared error during the building of decision trees within the Random Forest. It provides a useful way to identify and prioritize the features that have the most significant impact on the model’s predictive performance.
# Extract and visualize feature importance
importance_scores <- importance(rf_model)
importance_df <- data.frame(
Feature = rownames(importance_scores),
Importance = importance_scores[, "%IncMSE"]
)
importance_df <- importance_df[order(-importance_df$Importance), ]
# The importance score of each feature.
ggplot(importance_df, aes(x = Importance, y = reorder(Feature, Importance))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Random Forest Feature Importance", x = "Importance", y = "Feature") +
theme(axis.text.y = element_text(size = 5, hjust = 0))
# The Top 10 importance features
top_10_importance <- head(importance_df, 10)
ggplot(top_10_importance, aes(x = Importance, y = reorder(Feature, Importance))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Top 10 Random Forest Feature Importance", x = "Importance", y = "Feature") +
theme(axis.text.y = element_text(size = 8, hjust = 0))
# Select top 10 important features
top_features <- importance_df$Feature[1:10]
print(top_features)
## [1] "High" "Low" "Open" "SF_High" "PLD_Price"
## [6] "GDX_Volume" "USB_Price" "USB_Low" "SF_Low" "PLT_Trend"
# Create new dataset
selected_train_data <- train_data[, c("Adj.Close", top_features)]
selected_test_data <- test_data[, c("Adj.Close", top_features)]
# Train the Random Forest Regression Model.
model_target_column <- 1 #Adj_Close
model_features <- selected_train_data[, -model_target_column]
model_target <- selected_train_data[[model_target_column]]
rf_model_regression <- randomForest(model_features, y = model_target)
# Make predictions on the test set.
predictions <- predict(rf_model_regression, newdata = selected_test_data[, -model_target_column])
cat("Predictions on Test Data:\n")
## Predictions on Test Data:
print(predictions)
## 4 8 16 22 23 44 48 51
## 157.0033 155.5218 156.5340 160.4026 160.7613 169.9824 171.5362 167.0580
## 54 58 63 66 67 86 98 100
## 162.0506 165.0898 161.4969 158.7568 161.5459 159.3583 156.5779 155.3955
## 102 108 110 111 112 114 123 124
## 151.8649 153.1618 152.2342 152.5923 152.4697 152.3515 157.5037 157.0869
## 126 129 130 132 134 141 147 148
## 157.2275 153.1192 152.4355 152.9328 151.9537 154.0473 153.2261 153.7193
## 170 174 177 182 183 184 189 193
## 158.6250 161.7843 161.1424 168.4435 168.2818 168.5639 171.3006 170.9323
## 204 220 221 223 233 234 245 246
## 171.1903 165.0669 166.0221 167.9466 169.3908 166.0451 164.3795 164.3408
## 247 248 264 274 275 284 285 289
## 164.4159 163.3586 161.3208 160.7360 161.2053 159.7396 159.3781 156.2535
## 296 301 302 303 308 309 314 318
## 153.4753 153.1671 152.6199 153.4046 155.5503 155.7474 154.6521 152.4035
## 321 323 325 330 337 339 341 342
## 151.5171 152.5017 151.1730 141.9454 137.9297 133.3414 131.3833 132.0258
## 346 352 355 362 363 369 370 371
## 134.0351 135.5970 133.4120 131.8745 124.9557 120.9571 117.3066 119.4095
## 372 378 389 396 400 405 409 413
## 120.7941 124.2562 126.8717 127.9671 132.3265 136.1943 136.0796 134.1764
## 415 419 440 441 443 444 455 473
## 132.1586 126.9607 127.3221 126.9811 128.9918 128.8131 124.0693 118.1027
## 477 484 487 492 496 497 498 502
## 121.1903 117.9611 120.6592 119.6939 121.3126 120.8439 121.9611 121.4157
## 506 508 512 513 515 528 533 544
## 123.9578 125.1160 127.2302 127.6555 128.9257 132.1816 126.5719 126.1455
## 547 551 557 560 562 563 564 569
## 126.9906 125.0734 124.9695 126.2499 124.8045 124.2152 124.2211 124.4066
## 571 572 574 576 586 591 597 601
## 124.4118 124.0923 124.4386 121.2945 121.3796 122.4930 126.6546 127.8846
## 605 608 635 645 653 662 664 677
## 127.7498 125.6909 122.7536 121.9013 118.2101 116.6356 115.0421 118.2553
## 690 696 705 719 722 728 732 736
## 111.9344 114.2219 115.8759 112.8013 113.8100 116.3082 118.3702 124.1022
## 740 749 750 760 771 774 778 791
## 124.1795 118.9317 118.4979 115.5673 110.8991 110.2903 113.9350 115.7269
## 796 801 803 806 810 811 812 824
## 115.4417 113.7058 115.8504 112.5984 113.7304 113.9993 114.0543 113.7556
## 826 829 838 839 842 846 850 852
## 114.8147 112.8883 113.4535 115.2191 112.9224 112.6761 112.0852 111.3270
## 854 859 871 872 876 882 885 907
## 111.4289 108.5674 104.3233 104.3085 106.2840 108.0104 110.9892 109.9552
## 909 912 917 922 926 934 956 963
## 108.6579 107.2844 110.9426 113.0672 111.7731 107.5709 103.2488 101.7840
## 965 967 968 973 976 979 997 1001
## 101.7884 102.9589 102.8979 101.6508 102.9612 105.1002 110.6362 119.6274
## 1004 1006 1009 1013 1021 1028 1036 1040
## 115.5039 117.5822 118.7790 118.4656 120.2405 119.5488 116.1019 118.3528
## 1041 1042 1046 1049 1050 1051 1062 1063
## 119.6009 119.7801 117.9885 119.8103 118.2950 118.4495 120.6253 121.6575
## 1065 1072 1085 1086 1090 1091 1098 1102
## 121.2337 117.6252 122.4884 122.6454 122.7164 121.2159 126.0704 129.8504
## 1107 1111 1113 1115 1117 1127 1137 1148
## 127.1492 125.7788 126.5814 126.1944 127.6909 128.1722 127.2732 126.2274
## 1149 1151 1157 1161 1165 1170 1171 1172
## 126.3342 125.0018 127.7288 126.2877 119.6468 120.1225 119.6583 119.7322
## 1174 1176 1178 1180 1185 1195 1202 1208
## 121.0758 120.6705 121.2151 120.9997 124.2556 115.6194 111.4122 111.0372
## 1211 1219 1223 1240 1253 1255 1261 1270
## 110.3393 108.3926 110.2245 113.8927 117.9927 117.5434 118.4054 114.6270
## 1273 1281 1284 1290 1291 1297 1299 1301
## 116.9718 119.1332 118.9452 120.8141 121.6288 122.1911 120.4954 120.2923
## 1306 1313 1316 1319 1320 1329 1337 1338
## 116.8723 117.2545 119.2929 119.6880 119.4118 122.5550 118.5673 118.1780
## 1340 1341 1346 1349 1350 1351 1357 1359
## 118.8748 119.5223 118.0777 116.3552 115.3295 115.1482 118.0292 118.0980
## 1360 1361 1362 1366 1375 1379 1381 1393
## 119.0394 119.4588 118.9418 120.6561 122.4354 122.9573 122.3813 127.8806
## 1395 1398 1403 1404 1405 1419 1420 1421
## 126.1758 125.7438 123.0921 123.4720 123.3431 121.9841 121.6972 122.2442
## 1423 1424 1425 1428 1442 1448 1458 1459
## 121.4114 121.3436 121.2023 120.8007 122.4128 122.7994 117.6743 118.3564
## 1463 1464 1465 1467 1473 1478 1479 1480
## 119.7625 120.0588 120.1512 121.6252 125.3655 125.5177 126.2849 126.8266
## 1494 1495 1496 1504 1516 1520 1539 1542
## 125.9890 125.1523 124.8925 126.0960 125.2962 125.0185 126.9808 127.3823
## 1544 1550 1551 1563 1565 1566 1567 1571
## 127.4448 125.4123 124.5971 122.4115 122.3819 122.2937 122.6062 123.0202
## 1573 1579 1585 1589 1603 1607 1610 1618
## 123.3740 122.9500 121.2428 120.1591 117.5695 115.6227 116.0630 115.1348
## 1623 1633 1636 1638 1643 1651 1654 1656
## 114.8548 114.3849 113.5825 112.9148 112.9659 113.2322 113.2685 112.5505
## 1663 1669 1680 1684 1686 1702 1710 1715
## 112.6428 115.8288 116.5279 115.7210 113.7905 117.3106 117.9432 120.4212
actual_values <- selected_test_data$Adj.Close
rsquared <- cor(predictions, actual_values)^2
rmse <- rmse(predictions, actual_values)
mae <- mae(predictions, actual_values)
cat("R-squared:", rsquared, "\n")
## R-squared: 0.9991521
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.4944462
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 0.3343596
R-squared measures the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features). An R-squared value close to 1 indicates a model that explains a high percentage of the variance in the target variable. In this case, R-squared is very close to 1 (99.9%), suggesting an excellent fit.
RMSE provides a measure of the average magnitude of the errors between predicted and actual values. A lower RMSE indicates better model performance. In this case, an RMSE of 0.4944462 suggests that, on average, the model’s predictions deviate by approximately 0.49 units from the actual values.
MAE is another measure of the average magnitude of errors but without squaring them. Similar to RMSE, a lower MAE indicates better performance. The MAE of 0.3343596 suggests that, on average, the absolute difference between predicted and actual values is approximately 0.33 units.
In summary, the output indicates that the Random Forest regression model has performed exceptionally well on the test dataset, with very high R-squared and low RMSE and MAE values. This suggests that the model’s predictions closely align with the actual values in the test set.
prediction_df <- data.frame(Actual = actual_values, Predicted = predictions)
ggplot(prediction_df, aes(x = Actual, y = Predicted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
labs(title = "Predicted vs. Actual", x = "Actual", y = "Predicted")
The scatter plot visualizes the relationship between the actual and predicted values. Each point on the plot represents an observation in the test set. The dashed red line represents perfect predictions, where actual values equal predicted values.When points are close to the red dashed line, it indicates that the model is making accurate predictions.
#create the Multivariate linear regression model with top 10 features
regression_model <- lm(Adj.Close ~., data = selected_test_data)
regression_model
##
## Call:
## lm(formula = Adj.Close ~ ., data = selected_test_data)
##
## Coefficients:
## (Intercept) High Low Open SF_High PLD_Price
## 3.717e-01 7.393e-01 7.694e-01 -5.141e-01 -9.468e-07 1.961e-04
## GDX_Volume USB_Price USB_Low SF_Low PLT_Trend
## -1.077e-09 -1.233e+00 1.136e+00 1.107e-05 7.449e-02
# Make predictions on the test set
regression_predictions <- predict(regression_model , selected_test_data)
regression_predictions
## 4 8 16 22 23 44 48 51
## 157.1331 155.0758 156.6639 160.9708 160.7969 170.6705 172.2521 167.0736
## 54 58 63 66 67 86 98 100
## 162.3405 164.8640 161.8622 159.2307 161.3769 159.7788 155.3895 155.0576
## 102 108 110 111 112 114 123 124
## 151.4745 152.6383 151.2542 152.4960 151.1135 151.4872 156.8183 156.8657
## 126 129 130 132 134 141 147 148
## 157.3267 152.6250 151.9061 152.5550 150.7518 152.5990 153.1802 153.6514
## 170 174 177 182 183 184 189 193
## 159.0661 161.8532 160.3561 168.4695 167.6531 168.2486 171.5633 171.1576
## 204 220 221 223 233 234 245 246
## 171.7453 165.8665 165.6504 168.0333 168.9755 166.4886 164.6116 164.4140
## 247 248 264 274 275 284 285 289
## 164.5287 161.7853 160.5613 160.4489 161.0741 159.5484 159.6105 155.1738
## 296 301 302 303 308 309 314 318
## 152.8472 152.7795 152.7844 152.9715 155.3512 155.9760 154.6687 152.6157
## 321 323 325 330 337 339 341 342
## 152.3157 153.5050 151.4577 142.2099 138.6821 134.0146 134.1949 132.7225
## 346 352 355 362 363 369 370 371
## 134.4668 135.8587 133.8570 131.2842 124.5494 121.3549 117.0701 119.4454
## 372 378 389 396 400 405 409 413
## 120.7794 123.7060 126.8151 127.7227 132.8827 137.0485 136.1404 133.8767
## 415 419 440 441 443 444 455 473
## 131.4923 126.3797 127.5239 127.0583 129.4618 128.7926 123.6513 118.2767
## 477 484 487 492 496 497 498 502
## 121.0953 118.0152 120.8396 119.7628 121.1862 120.6552 121.6460 121.0524
## 506 508 512 513 515 528 533 544
## 124.3617 125.3587 127.2156 127.6810 129.3193 131.8148 126.3630 126.0391
## 547 551 557 560 562 563 564 569
## 127.0417 124.7446 125.0840 126.2600 124.2896 124.1357 123.9765 124.5816
## 571 572 574 576 586 591 597 601
## 124.6819 123.9654 124.3197 121.2043 121.3770 122.6206 126.6412 127.9878
## 605 608 635 645 653 662 664 677
## 127.8580 125.8824 122.8990 121.8682 117.9692 116.5733 114.7895 118.0684
## 690 696 705 719 722 728 732 736
## 112.0990 113.5028 116.4388 112.9566 113.5754 116.5960 118.3128 124.2687
## 740 749 750 760 771 774 778 791
## 124.3164 119.1270 118.5930 115.5751 110.4334 110.8785 114.1600 115.8208
## 796 801 803 806 810 811 812 824
## 115.6741 113.1498 116.2614 112.9324 113.5750 113.9093 113.3409 113.6175
## 826 829 838 839 842 846 850 852
## 114.4672 112.7325 113.4880 115.2502 112.9970 112.9312 112.2711 111.3230
## 854 859 871 872 876 882 885 907
## 111.4380 108.5268 104.1059 103.9141 106.0761 108.4322 110.3774 110.4534
## 909 912 917 922 926 934 956 963
## 108.6686 106.8566 110.7232 112.7982 112.0287 106.9176 103.7750 101.7431
## 965 967 968 973 976 979 997 1001
## 100.3874 103.3002 102.7936 101.4596 103.2809 105.4059 111.4874 119.4025
## 1004 1006 1009 1013 1021 1028 1036 1040
## 115.4116 117.5883 118.1716 117.7482 119.7739 119.4417 116.2046 118.5302
## 1041 1042 1046 1049 1050 1051 1062 1063
## 119.9923 119.7785 117.8162 119.5262 117.7417 118.5753 120.7660 121.7572
## 1065 1072 1085 1086 1090 1091 1098 1102
## 121.4192 117.3722 122.2343 122.4430 122.9884 121.0570 126.1130 129.4702
## 1107 1111 1113 1115 1117 1127 1137 1148
## 126.9491 125.7638 126.2401 125.9672 127.5553 128.1815 126.8461 125.7327
## 1149 1151 1157 1161 1165 1170 1171 1172
## 126.3984 125.0198 127.6807 125.6425 119.4945 120.1476 119.9251 119.8800
## 1174 1176 1178 1180 1185 1195 1202 1208
## 121.1664 120.7745 121.5860 120.9005 124.3002 115.6131 111.3907 110.3706
## 1211 1219 1223 1240 1253 1255 1261 1270
## 108.8498 108.4185 110.4922 114.1446 118.2563 117.5475 118.7400 114.4370
## 1273 1281 1284 1290 1291 1297 1299 1301
## 117.1736 119.3664 119.2804 121.1320 121.8913 122.1323 120.2515 120.1613
## 1306 1313 1316 1319 1320 1329 1337 1338
## 117.0297 117.1748 118.7970 119.2111 119.5124 122.4380 118.4545 118.2117
## 1340 1341 1346 1349 1350 1351 1357 1359
## 118.7849 119.4839 118.2538 116.2931 114.9795 115.2475 118.2495 118.3178
## 1360 1361 1362 1366 1375 1379 1381 1393
## 119.1870 119.2744 118.9932 120.7604 122.6023 122.4777 122.0008 127.7784
## 1395 1398 1403 1404 1405 1419 1420 1421
## 126.3657 125.5450 123.2296 124.1864 123.2111 122.1226 121.7509 122.4839
## 1423 1424 1425 1428 1442 1448 1458 1459
## 121.7411 121.3463 121.2151 121.2877 122.8458 122.8327 117.8389 118.9591
## 1463 1464 1465 1467 1473 1478 1479 1480
## 119.7580 120.1606 120.3179 121.8348 125.5813 125.5703 126.7400 126.9265
## 1494 1495 1496 1504 1516 1520 1539 1542
## 125.2714 124.7154 124.8945 125.8001 125.3808 124.8087 126.5746 127.5899
## 1544 1550 1551 1563 1565 1566 1567 1571
## 127.2430 125.5156 124.7316 122.1924 122.4400 122.3371 122.4854 123.3147
## 1573 1579 1585 1589 1603 1607 1610 1618
## 123.4722 122.9890 121.1168 120.3241 117.5908 116.0764 116.1725 115.2858
## 1623 1633 1636 1638 1643 1651 1654 1656
## 114.9339 114.6270 113.3899 112.7877 113.0251 113.3887 113.0913 112.7529
## 1663 1669 1680 1684 1686 1702 1710 1715
## 112.6369 116.0716 116.6800 115.7877 113.6950 117.1782 118.0508 120.0375
regression_actual <- selected_test_data$Adj.Close
regression_rsquared <- cor(regression_predictions, regression_actual)^2
regression_rmse <- rmse(regression_predictions, regression_actual)
regression_mae <- mae(regression_predictions, regression_actual)
cat("R-squared:", regression_rsquared, "\n")
## R-squared: 0.9997354
cat("Root Mean Squared Error (RMSE):", regression_rmse, "\n")
## Root Mean Squared Error (RMSE): 0.2754923
cat("Mean Absolute Error (MAE):", regression_mae, "\n")
## Mean Absolute Error (MAE): 0.2002636
The results from the linear regression model demonstrate an outstanding performance. The R-squared value of 0.9997354 closeness to 1 suggests an excellent fit, signifying that the independent variables have a nearly perfect explanatory power over the target variable.
A low RMSE of 0.2754923 indicative of superior model accuracy. In this case, the small value suggests that, on average, the model’s predictions deviate by only approximately 0.28 units from the actual values.
A low MAE of 0.2002636 signifies that, on average, the absolute difference between the predicted and actual values is only about 0.20 units.
In summary, the linear regression model exhibits remarkable accuracy, as evidenced by the very high R-squared value and the low RMSE and MAE values.
prediction_regression <- data.frame(Actual = regression_actual, Predicted = regression_predictions)
ggplot(prediction_regression, aes(x = Actual, y = Predicted)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
labs(title = "Predicted vs. Actual", x = "Actual", y = "Predicted")
This scatter plot show that predicted values are very close to actual values. This indicates that the model is making accurate predictions.
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Extract features and target
target_classification <- "boolean"
features_classification <- train_data[, !colnames(train_data) %in% c("Date", "difference","Adj.Close", target_classification)]
target_classification <- train_data[[target_classification]]
# Assuming boolean is a factor
target_classification <- as.factor(target_classification)
# Train a Random Forest Model
fs_classification <- randomForest(features_classification, y = target_classification, importance = TRUE)
# Extract and visualize feature importance
importance_scores_classification <- importance(fs_classification)
importance_df_classification <- data.frame(
Feature = rownames(importance_scores_classification),
Importance = importance_scores_classification[, "MeanDecreaseAccuracy"]
)
importance_df_classification <- importance_df_classification[order(-importance_df_classification$Importance), ]
# The importance score of each feature.
ggplot(importance_df_classification, aes(x = Importance, y = reorder(Feature, Importance))) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Random Forest Feature Importance", x = "Importance", y = "Feature") +
theme(axis.text.y = element_text(size = 5, hjust = 0))
# The Top 10 importance features
top_10_importance_classification <- head(importance_df_classification, 10)
ggplot(top_10_importance_classification, aes(x = Importance, y = reorder(Feature, Importance))) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Top 10 Random Forest Feature Importance", x = "Importance", y = "Feature") +
theme(axis.text.y = element_text(size = 8, hjust = 0))
# Select top 10 important features
top_features_classification <- importance_df_classification$Feature[1:10]
print(top_features_classification)
## [1] "PLT_Trend" "SF_Trend" "USDI_Trend" "EU_Trend"
## [5] "USB_Trend" "PLD_Trend" "SF_Open" "GDX_Adj.Close"
## [9] "DJ_open" "High"
# Create new dataset
selected_train <- train_data[, c("boolean", top_features_classification)]
selected_test <- test_data[, c("boolean", top_features_classification)]
model_target <- 1 #boolean
train_data$boolean <- as.factor(train_data$boolean)
# Assuming diff_boolean is a factor
# Assuming boolean is a factor
selected_train$boolean <- as.factor(selected_train$boolean)
rf_model_classification = randomForest(x = selected_train[, -model_target],
y = selected_train$boolean,
ntree = 300)
rf_model_classification
##
## Call:
## randomForest(x = selected_train[, -model_target], y = selected_train$boolean, ntree = 300)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 25.13%
## Confusion matrix:
## 0 1 class.error
## 0 501 177 0.2610619
## 1 168 527 0.2417266
# Make predictions on the test set
predictions_classification <- predict(rf_model_classification , selected_test)
predictions_classification
## 4 8 16 22 23 44 48 51 54 58 63 66 67 86 98 100
## 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 0
## 102 108 110 111 112 114 123 124 126 129 130 132 134 141 147 148
## 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1
## 170 174 177 182 183 184 189 193 204 220 221 223 233 234 245 246
## 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 1
## 247 248 264 274 275 284 285 289 296 301 302 303 308 309 314 318
## 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0
## 321 323 325 330 337 339 341 342 346 352 355 362 363 369 370 371
## 1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1
## 372 378 389 396 400 405 409 413 415 419 440 441 443 444 455 473
## 0 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0
## 477 484 487 492 496 497 498 502 506 508 512 513 515 528 533 544
## 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1
## 547 551 557 560 562 563 564 569 571 572 574 576 586 591 597 601
## 1 0 1 1 0 0 0 0 1 0 0 0 1 1 1 0
## 605 608 635 645 653 662 664 677 690 696 705 719 722 728 732 736
## 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1
## 740 749 750 760 771 774 778 791 796 801 803 806 810 811 812 824
## 0 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0
## 826 829 838 839 842 846 850 852 854 859 871 872 876 882 885 907
## 0 0 1 1 0 0 0 1 1 0 0 1 1 1 1 1
## 909 912 917 922 926 934 956 963 965 967 968 973 976 979 997 1001
## 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 1
## 1004 1006 1009 1013 1021 1028 1036 1040 1041 1042 1046 1049 1050 1051 1062 1063
## 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1
## 1065 1072 1085 1086 1090 1091 1098 1102 1107 1111 1113 1115 1117 1127 1137 1148
## 1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 0
## 1149 1151 1157 1161 1165 1170 1171 1172 1174 1176 1178 1180 1185 1195 1202 1208
## 1 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0
## 1211 1219 1223 1240 1253 1255 1261 1270 1273 1281 1284 1290 1291 1297 1299 1301
## 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0
## 1306 1313 1316 1319 1320 1329 1337 1338 1340 1341 1346 1349 1350 1351 1357 1359
## 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 1
## 1360 1361 1362 1366 1375 1379 1381 1393 1395 1398 1403 1404 1405 1419 1420 1421
## 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0 1
## 1423 1424 1425 1428 1442 1448 1458 1459 1463 1464 1465 1467 1473 1478 1479 1480
## 1 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1
## 1494 1495 1496 1504 1516 1520 1539 1542 1544 1550 1551 1563 1565 1566 1567 1571
## 0 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1
## 1573 1579 1585 1589 1603 1607 1610 1618 1623 1633 1636 1638 1643 1651 1654 1656
## 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0
## 1663 1669 1680 1684 1686 1702 1710 1715
## 1 0 0 0 0 1 0 1
## Levels: 0 1
# Evaluate the model
rf_confusion_matrix <- table(predictions_classification, selected_test$boolean)
rf_cm <- confusionMatrix(rf_confusion_matrix, mode = "everything")
rf_cm
## Confusion Matrix and Statistics
##
##
## predictions_classification 0 1
## 0 140 40
## 1 34 130
##
## Accuracy : 0.7849
## 95% CI : (0.7376, 0.8271)
## No Information Rate : 0.5058
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5695
##
## Mcnemar's Test P-Value : 0.5611
##
## Sensitivity : 0.8046
## Specificity : 0.7647
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.7927
## Precision : 0.7778
## Recall : 0.8046
## F1 : 0.7910
## Prevalence : 0.5058
## Detection Rate : 0.4070
## Detection Prevalence : 0.5233
## Balanced Accuracy : 0.7847
##
## 'Positive' Class : 0
##
The random forest classification model achieves an accuracy of 78.49% demonstrating ability to correctly classify instances with a success rate of 78.49%. A low p-value indicates a substantial improvement over a random model, and the Kappa statistic of 0.5695 signifies moderate agreement between predicted and actual classifications.
Examining precision, the model accurately predicts 77.78% of positive instances, showcasing its ability to make reliable positive predictions. Meanwhile, the recall metric evaluates the model’s effectiveness in capturing 80.46% of the positive instances.The F1 score indicating 79.10%, effectively balances precision and recall.
#create the logistic regression model with all predictors
lr_model <- glm(boolean ~. ,selected_train, family = "binomial")
lr_model
##
## Call: glm(formula = boolean ~ ., family = "binomial", data = selected_train)
##
## Coefficients:
## (Intercept) PLT_Trend SF_Trend USDI_Trend EU_Trend
## 2.287e+00 1.318e+00 1.439e+00 -7.490e-01 5.501e-01
## USB_Trend PLD_Trend SF_Open GDX_Adj.Close DJ_open
## -9.109e-01 4.042e-01 -5.833e-05 8.557e-02 -1.035e-05
## High
## -2.157e-02
##
## Degrees of Freedom: 1372 Total (i.e. Null); 1362 Residual
## Null Deviance: 1903
## Residual Deviance: 1312 AIC: 1334
#generate the predictions of the lr model
predictions <- predict(lr_model, newdata = selected_test, type = "response")
# Convert predicted probabilities to binary predictions
lr_pred <- ifelse(predictions > 0.5, 1, 0)
lr_pred
## 4 8 16 22 23 44 48 51 54 58 63 66 67 86 98 100
## 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0
## 102 108 110 111 112 114 123 124 126 129 130 132 134 141 147 148
## 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0
## 170 174 177 182 183 184 189 193 204 220 221 223 233 234 245 246
## 1 1 0 1 0 1 0 0 1 1 0 1 0 0 0 1
## 247 248 264 274 275 284 285 289 296 301 302 303 308 309 314 318
## 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0
## 321 323 325 330 337 339 341 342 346 352 355 362 363 369 370 371
## 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1
## 372 378 389 396 400 405 409 413 415 419 440 441 443 444 455 473
## 0 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0
## 477 484 487 492 496 497 498 502 506 508 512 513 515 528 533 544
## 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1
## 547 551 557 560 562 563 564 569 571 572 574 576 586 591 597 601
## 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1
## 605 608 635 645 653 662 664 677 690 696 705 719 722 728 732 736
## 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 1
## 740 749 750 760 771 774 778 791 796 801 803 806 810 811 812 824
## 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0
## 826 829 838 839 842 846 850 852 854 859 871 872 876 882 885 907
## 0 0 1 1 0 1 0 1 1 0 0 0 1 1 0 1
## 909 912 917 922 926 934 956 963 965 967 968 973 976 979 997 1001
## 0 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1
## 1004 1006 1009 1013 1021 1028 1036 1040 1041 1042 1046 1049 1050 1051 1062 1063
## 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1
## 1065 1072 1085 1086 1090 1091 1098 1102 1107 1111 1113 1115 1117 1127 1137 1148
## 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0
## 1149 1151 1157 1161 1165 1170 1171 1172 1174 1176 1178 1180 1185 1195 1202 1208
## 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0
## 1211 1219 1223 1240 1253 1255 1261 1270 1273 1281 1284 1290 1291 1297 1299 1301
## 0 1 1 0 1 0 0 0 1 0 1 1 1 0 0 0
## 1306 1313 1316 1319 1320 1329 1337 1338 1340 1341 1346 1349 1350 1351 1357 1359
## 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 1
## 1360 1361 1362 1366 1375 1379 1381 1393 1395 1398 1403 1404 1405 1419 1420 1421
## 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 1
## 1423 1424 1425 1428 1442 1448 1458 1459 1463 1464 1465 1467 1473 1478 1479 1480
## 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## 1494 1495 1496 1504 1516 1520 1539 1542 1544 1550 1551 1563 1565 1566 1567 1571
## 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 1
## 1573 1579 1585 1589 1603 1607 1610 1618 1623 1633 1636 1638 1643 1651 1654 1656
## 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 1
## 1663 1669 1680 1684 1686 1702 1710 1715
## 0 0 0 0 0 1 0 1
# Evaluate the model
lr_confusion_matrix <- table(lr_pred, selected_test$boolean)
lr_cm <- confusionMatrix(lr_confusion_matrix, mode = "everything")
lr_cm
## Confusion Matrix and Statistics
##
##
## lr_pred 0 1
## 0 139 42
## 1 35 128
##
## Accuracy : 0.7762
## 95% CI : (0.7284, 0.8191)
## No Information Rate : 0.5058
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5521
##
## Mcnemar's Test P-Value : 0.4941
##
## Sensitivity : 0.7989
## Specificity : 0.7529
## Pos Pred Value : 0.7680
## Neg Pred Value : 0.7853
## Precision : 0.7680
## Recall : 0.7989
## F1 : 0.7831
## Prevalence : 0.5058
## Detection Rate : 0.4041
## Detection Prevalence : 0.5262
## Balanced Accuracy : 0.7759
##
## 'Positive' Class : 0
##
The logistic regression model attains an accuracy of 77.62% demonstrating ability to correctly classify instances with a success rate of 77.62%. A low p-value signifies a significant improvement over a random model, and the Kappa statistic of 0.5521 reflects moderate agreement between predicted and actual classifications.
Delving into precision, the model able to predict 76.8% of positive instances, highlighting its ability in making reliable positive predictions. Simultaneously, the recall metric evaluates the model’s effectiveness in capturing 79.89% of the positive instances. The F1 score, at 78.31%, effectively balances precision and recall.
In summary, the exploratory data analytics reveals a strong correlation among pricing attributes within Gold ETF, Eldorado Gold Corporation, Silver Futures, and Gold Miners ETF, offering valuable insights into potential factors influencing adjusted closing gold prices.
Based on the analysis and the development of the gold price prediction model, machine learning techniques have produced significant results as to the factors influencing gold prices. To analyse the regression model in predicting gold prices, the Linear Regression model outperform Random Forest regression model with higher R-squared close to 1 indicating precise alignment between predictions and actual values, and low RMSE and MAE values below 0.3 units.
For gold price trend classification, the Random Forest classification model and Logistic Regression model demonstrating significant improvement over random model and ability to predict gold price trends with moderate agreement between actual and predicted classifications. the Random Forest classification model surpasses Logistic Regression model exhibiting higher accuracy, precision, recall, and F1 score.
In conclusion, the Random Forest excels in gold price prediction, while overall, the linear regression models demonstrate strong performance. The effectiveness of this model in predicting gold prices would benefit stakeholders with a more informed investment decision. Stakeholders would gain a comprehensive analysis on the sophisticated gold market and the overall structure affecting the fluctuation of the gold prices. In short, this model would act as a valuable tool to empower stakeholders in navigating the gold market industry with the insights on the various factors affecting the gold prices.