In this RPubs publication whatever is done behind InvestNow starting from data collection until machine learning modeling will be explained, with the aim that the project becomes easier to understand and can be developed by more people to become even better project.
InvestNow collect all stock data needed from Yahoo Finance by using function tq_get() from library tidyquant. There are two arguments from function tq_get() which must be noticed.
- Argument x :
This argument is used to representing a single stock symbol, the stock symbol used by Yahoo Finance might be different with stock symbol that is usually seen in a particular country. For example in Indonesia, the name of the shares of BRI bank is called BBRI but in Yahoo Finance it is called BBRI.JK.
- Argument get :
This argument is used to representing several options about what data do you want to retrieve and from where the data resource is. Since the data InvestNow would like to collect are open, high, low, close, volume and adjusted stock prices for a stock symbol from Yahoo Finance, please use options "stock.prices".
However, since it is a free data, the data obtained using functiontq_get() from Yahoo Finance is not actual but H-1 data.
library(tidyquant)
bbri <- tq_get(x = "BBRI.JK",
get = "stock.prices",
from = " 2016-01-01")
head(bbri, 3)After the data from the desired stock has been successfully collected from Yahoo Finance, let’s try to check if there is a missing value with function colSums(is.na()).
colSums(is.na(bbri))## symbol date open high low close volume adjusted
## 0 0 1 1 1 1 1 1
Unfortunately there is missing value, the missing value can be remove by implementing function drop_na() from library tidyverse.
library(tidyverse)
bbri <- bbri %>%
drop_na()
colSums(is.na(bbri))## symbol date open high low close volume adjusted
## 0 0 0 0 0 0 0 0
glimpse(bbri)## Rows: 1,411
## Columns: 8
## $ symbol <chr> "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.~
## $ date <date> 2016-01-04, 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, ~
## $ open <dbl> 2280, 2315, 2280, 2270, 2250, 2280, 2295, 2335, 2270, 2300, 2~
## $ high <dbl> 2320, 2365, 2355, 2305, 2340, 2305, 2345, 2340, 2355, 2330, 2~
## $ low <dbl> 2240, 2315, 2280, 2250, 2250, 2255, 2285, 2315, 2270, 2290, 2~
## $ close <dbl> 2295, 2315, 2305, 2250, 2320, 2275, 2320, 2320, 2345, 2290, 2~
## $ volume <dbl> 100379000, 108043000, 105125500, 71275500, 106501000, 1131830~
## $ adjusted <dbl> 1902.051, 1918.626, 1910.339, 1864.756, 1922.770, 1885.475, 1~
From the observation results of checking data types, there is no data type that must be changed (optional: the symbol column can be changed to a factor data type), and there are no columns containing missing values. Stock data from Bank Rakyat Indonesia is ready for further processing.
isat <- tq_get(x = "ISAT.JK",
get = "stock.prices",
from = " 2016-01-01")
head(isat, 3)colSums(is.na(isat))## symbol date open high low close volume adjusted
## 0 0 1 1 1 1 1 1
isat <- isat %>%
drop_na()
colSums(is.na(isat))## symbol date open high low close volume adjusted
## 0 0 0 0 0 0 0 0
glimpse(isat)## Rows: 1,411
## Columns: 8
## $ symbol <chr> "ISAT.JK", "ISAT.JK", "ISAT.JK", "ISAT.JK", "ISAT.JK", "ISAT.~
## $ date <date> 2016-01-04, 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, ~
## $ open <dbl> 5500, 5250, 5500, 5300, 5250, 5400, 5300, 5425, 5400, 5500, 5~
## $ high <dbl> 5500, 5625, 5600, 5475, 5350, 5400, 5425, 5500, 5500, 5500, 5~
## $ low <dbl> 5325, 5250, 5300, 5300, 5050, 5275, 5275, 5350, 5300, 5400, 5~
## $ close <dbl> 5325, 5400, 5375, 5350, 5300, 5325, 5425, 5400, 5400, 5475, 5~
## $ volume <dbl> 55100, 78500, 646400, 49000, 402300, 39600, 154500, 415300, 8~
## $ adjusted <dbl> 5150.757, 5223.304, 5199.122, 5174.939, 5126.576, 5150.757, 5~
sido <- tq_get(x = "SIDO.JK",
get = "stock.prices",
from = " 2016-01-01")
head(sido, 3)colSums(is.na(sido))## symbol date open high low close volume adjusted
## 0 0 8 8 8 8 8 8
sido <- sido %>%
drop_na()
colSums(is.na(sido))## symbol date open high low close volume adjusted
## 0 0 0 0 0 0 0 0
glimpse(sido)## Rows: 1,404
## Columns: 8
## $ symbol <chr> "SIDO.JK", "SIDO.JK", "SIDO.JK", "SIDO.JK", "SIDO.JK", "SIDO.~
## $ date <date> 2016-01-04, 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, ~
## $ open <dbl> 275.0, 275.0, 267.5, 262.5, 267.5, 260.0, 260.0, 260.0, 255.0~
## $ high <dbl> 275.0, 275.0, 267.5, 267.5, 267.5, 262.5, 260.0, 262.5, 257.5~
## $ low <dbl> 270.0, 262.5, 260.0, 257.5, 260.0, 255.0, 257.5, 255.0, 249.5~
## $ close <dbl> 272.5, 265.0, 262.5, 267.5, 262.5, 257.5, 257.5, 257.5, 252.5~
## $ volume <dbl> 1518000, 6712800, 7414400, 8664600, 4862800, 5718000, 1074980~
## $ adjusted <dbl> 199.0331, 193.5551, 191.7291, 195.3811, 191.7291, 188.0771, 1~
hoki <- tq_get(x = "HOKI.JK",
get = "stock.prices",
from = " 2016-01-01")
head(hoki, 3)colSums(is.na(hoki))## symbol date open high low close volume adjusted
## 0 0 0 0 0 0 0 0
glimpse(hoki)## Rows: 1,042
## Columns: 8
## $ symbol <chr> "HOKI.JK", "HOKI.JK", "HOKI.JK", "HOKI.JK", "HOKI.JK", "HOKI.~
## $ date <date> 2017-07-03, 2017-07-04, 2017-07-05, 2017-07-06, 2017-07-07, ~
## $ open <dbl> 85.5, 87.0, 83.0, 86.5, 91.5, 90.0, 89.0, 86.5, 88.5, 88.0, 8~
## $ high <dbl> 87.0, 88.0, 87.5, 93.0, 93.0, 95.0, 89.0, 89.0, 89.0, 88.5, 8~
## $ low <dbl> 80.0, 81.5, 82.5, 85.5, 88.5, 86.0, 86.0, 84.5, 85.0, 86.0, 8~
## $ close <dbl> 87.0, 83.0, 86.5, 91.0, 89.0, 89.0, 86.5, 89.0, 88.0, 87.0, 8~
## $ volume <dbl> 202226400, 197477200, 110545200, 294411600, 71145600, 4270480~
## $ adjusted <dbl> 78.24085, 74.64358, 77.79119, 81.83813, 80.03949, 80.03949, 7~
wika <- tq_get(x = "WIKA.JK",
get = "stock.prices",
from = " 2016-01-01")
head(wika, 3)colSums(is.na(wika))## symbol date open high low close volume adjusted
## 0 0 1 1 1 1 1 1
wika <- wika %>%
drop_na()
colSums(is.na(wika))## symbol date open high low close volume adjusted
## 0 0 0 0 0 0 0 0
glimpse(bbri)## Rows: 1,411
## Columns: 8
## $ symbol <chr> "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.JK", "BBRI.~
## $ date <date> 2016-01-04, 2016-01-05, 2016-01-06, 2016-01-07, 2016-01-08, ~
## $ open <dbl> 2280, 2315, 2280, 2270, 2250, 2280, 2295, 2335, 2270, 2300, 2~
## $ high <dbl> 2320, 2365, 2355, 2305, 2340, 2305, 2345, 2340, 2355, 2330, 2~
## $ low <dbl> 2240, 2315, 2280, 2250, 2250, 2255, 2285, 2315, 2270, 2290, 2~
## $ close <dbl> 2295, 2315, 2305, 2250, 2320, 2275, 2320, 2320, 2345, 2290, 2~
## $ volume <dbl> 100379000, 108043000, 105125500, 71275500, 106501000, 1131830~
## $ adjusted <dbl> 1902.051, 1918.626, 1910.339, 1864.756, 1922.770, 1885.475, 1~
Technical analysis is a way of analyzing price movements in the stock market using statistical tools, such as charts and mathematical formulas.
Technical analysis will be used to get the predictors and target variables needed in the form of decisions to Buy, Hold or Sell which will be used later to create machine learning models. Because the target variable is very crucial for training the machine learning model that will be used later, technical analysis will be carried out as perfectly as possible to avoid mistakes in making decisions.
In this project there are 4 types of technical analysis that will be used:
- Simple Moving Average (SMA)
Simple Moving Average (SMA) is one of the technical analysis by adding the latest price series in a time range, then dividing it by the number of periods. So, the average value can be obtained.
SMA is calculated by the following formula: \[SMA = (Pn1 + Pn2 + Pn3 + ...Pnx)/n \] Where:
- P = mean value based on n
- n = number of time periods (Example period = 5, meaning the price to be calculated is 4 days back)
To determine the time period in the calculation, there are no special regulations, but if the desired target is buying and selling in a short time, the period can be reduced and if you want a long period of time, the time period can be enlarged. However, if the time is shortened, it will increase the potential for errors in the decision-making process.
In determining when the right time to buy or sell there are 4 SMA with different time periods for comparison, SMA n1, SMA n2, SMA n3 & SMA n4.SMA n1 here serves as the lower limit and SMA n4 here serves as the upper limit, both of these limits will be very useful as an indicator that the stock price is in the “cheapest” and “most expensive” condition. While SMA n2 and SMA n3 are here as indicators of the movement of the stock price being declining or rising.
- Exponential Moving Average (EMA)
Exponential Moving Average (EMA) is similar to SMA, which is a technical analysis by measuring the direction of the trend over a certain period of time. However, the SMA only calculates the average of the price data whereas the EMA applies more weight to the more recent data. Due to its weighted calculations, the EMA will follow the price more closely than the SMA.
EMA is calculated by the following formula: \[EMAnow = (Closing Price - EMA before) x Multiplier + EMA before \]
Where:
- Closing Price = closing price on that day
- EMA before = EMA of the previous period (Example period = 5, meaning the price that will be calculated is the price of the past 4 days)
- Multiplier = exponential constant
In determining when the right time to buy or sell is needed 4 EMA with different time periods for comparison, after doing some experiments for BRI stocks will use EMA n1, EMA n2, EMA n3 and EMA n4. EMA n1 here functions as a lower limit and EMA n4 here functions as an upper limit, both of these limits will be very useful as an indicator that stock prices are in the “cheapest” and “most expensive” conditions. While the EMA n2 and EMA n3 here as an indicator of the movement of the stock price is declining or rising.
- Moving Average Convergence Divergence (MACD)
Moving Average Convergence Divergence (MACD) is an indicator in technical analysis that describes the relationship between two moving averages in an asset price trend.
MACD is calculated by the following formula: \[MACD = Nsmaller-Period EMA - Nbigger-PeriodEMA \]
Where:
Nsmaller-Period EMA = EMA with a smaller period
Nbigger-Period EMA = EMA with a larger period
MACD will use EMA as one of its indicators. The most popular MACD is usually calculated by subtracting the Nbigger-period Exponential Moving Average (EMA) from the Nsmaller-period EMA. MACD also needs one more EMA which is the N3-period EMA, the N3-period EMA which can serve as a trigger for buy and sell signals.
- Relative Strength Index (RSI)
The Relative Strength Index (RSI) is one of the technical analysis indicators that is usually used to measure the price of an asset. This indicator is used to evaluate whether the asset is in an overbought or oversold position.
RSI is calculated by the following formula: \[RSI = 100 – (100/(1+RS))\]
Where: - RS = Average increase in period / Average decrease in period. (Example period = 5, meaning the price that will be calculated is the price of the past 4 days)
The RSI, will not use the parameters of the SMA or EMA at all. RSI will use the calculated parameters of the RSI formula itself. The standard counting period is 14, as recommended by Welles Wilder. The period may be changed because it does not have special rules, either increasing or decreasing. However, this will affect the sensitivity of the RSI. For example, period 10 reaches overbought or oversold levels faster than period 20.
Last but not least, all four technical analysis will be done using library quantmod().
library(quantmod)- SMA
In determining when the right time to buy or sell is needed 4 SMA with different time periods for comparison, after doing several experiments for this BRI stock will use SMA 5, SMA 30, SMA 50 & SMA 70.
SMA 5 here serves as the lower limit and SMA 70 here serves as the upper limit, both of these limits will be very useful as an indicator that the stock price is in the “cheapest” and “most expensive” condition. While SMA 20 and SMA 30 are here as indicators of the movement of the stock price being declining or rising.
bbri_sma <- bbri
# SMA 5
bbri_sma$SMA5 <- SMA(Ad(bbri_sma),
n = 5)
# SMA 30
bbri_sma$SMA30 <- SMA(Ad(bbri_sma),
n = 30)
# SMA 50
bbri_sma$SMA50 <- SMA(Ad(bbri_sma),
n = 50)
# SMA 70
bbri_sma$SMA70 <- SMA(Ad(bbri_sma),
n = 70)“Buy” and “Sell” signals:
- Buy -> When Price Close Adjusted < H-1 SMA 5 & SMA 30 < H-1 SMA 50
- Sell -> When Price Close Adjusted > H-1 SMA 70 & SMA 20 > H-1 SMA 50
The purpose is why it is developed with the H-1 price and or the H-1 SMA to make sure the price that day is cheaper than yesterday’s price when making the decision to buy, and vice versa. By providing such parameters, it is expected that the price of the shares purchased is the lowest price and when the price is sold it is the highest price.
bbri_sma <- bbri_sma %>%
mutate(
signal.SMA = case_when(
adjusted < lag(SMA5, 1) & SMA30 < lag(SMA50, 1) ~ "Buy",
adjusted > lag(SMA70, 1) & SMA30 > lag(SMA50, 1) ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.SMA = lag(signal.SMA, 1),
decision.SMA = case_when(
signal.SMA == previous_signal.SMA ~ "Hold",
TRUE ~ signal.SMA
)
)
bbri_sma %>%
filter(decision.SMA != "Hold" & ! is.na(SMA70))If you pay attention to one by one of the overall “Buy” and “Sell” recommendations from the SMA method that has been filtered above, the results of the recommendations are not wrong or the recommendations provide benefits without the slightest loss. With the results as above, the target variable from SMA can be indicated as perfect because the target variable will be used as one of the predictors in the classification model.
- EMA
In determining when the right time to buy or sell is needed 4 EMA with different time periods for comparison, after doing some experiments for BRI stocks will use EMA 5, EMA 15, EMA 50 and EMA 70.
EMA 5 here functions as a lower limit and EMA 70 here functions as an upper limit, both of these limits will be very useful as an indicator that stock prices are in the “cheapest” and “most expensive” conditions. While the EMA 15 and EMA 50 here as an indicator of the movement of the stock price is declining or rising.
bbri_ema <- bbri
# EMA 5
bbri_ema$EMA5 <- EMA(Ad(bbri_ema),
n = 5)
# EMA 15
bbri_ema$EMA15 <- EMA(Ad(bbri_ema),
n = 15)
# EMA 50
bbri_ema$EMA50 <- EMA(Ad(bbri_ema),
n = 50)
# EMA 70
bbri_ema$EMA70 <- EMA(Ad(bbri_ema),
n = 70)“Buy” and “Sell” signals:
- Buy -> Close Adjusted < H-1 EMA 5 & EMA 15 < EMA 50
- Sell -> Close Adjusted > H-1 EMA 70 & EMA 15 > EMA 50
The purpose of why it is developed with the H-1 price and or the H-1 EMA is to make sure that day’s price is cheaper than yesterday’s price when making the decision to buy, and vice versa. By providing such parameters, it is expected that the price of the shares purchased is the lowest price and when the price is sold it is the highest price.
bbri_ema <- bbri_ema %>%
mutate(
signal.EMA = case_when(
adjusted < lag(EMA5, 1) & EMA15 < lag(EMA50, 1) ~ "Buy",
adjusted > lag(EMA70, 1) & EMA15 > lag(EMA50, 1) ~ "Sell",
TRUE ~ "Hold"# Otherwise sell
),
previous_signal.EMA = lag(signal.EMA, 1),
decision.EMA = case_when(
signal.EMA == previous_signal.EMA ~ "Hold",
TRUE ~ signal.EMA
)
)
bbri_ema %>%
filter(decision.EMA != "Hold" & ! is.na(EMA70))If you pay attention to one by one of the overall “Buy” and “Sell” recommendations from the EMA method that has been filtered above, the results of the recommendations are not wrong or the recommendations provide profits without the slightest loss. With the results as above, the target variable from the EMA can be indicated as perfect because the target variable will be used as one of the predictors in the classification model.
- MACD
After doing several experiments, the EMAs that will be used in MACD analysis are EMA5, EMA15 and EMA50.
bbri_macd <- bbri
# EMA 15
bbri_macd$EMA15 <- bbri_ema$EMA15
# EMA 50
bbri_macd$EMA50 <- bbri_ema$EMA50
# MACD
bbri_macd$MACD <- bbri_ema$EMA15 - bbri_ema$EMA50“Buy” and “Sell” signals:
- Buy -> When Close Adjusted < H-1 EMA 5 & When MACD < 0 (Example period = 5, means the price will be calculated 4 days ago)
- Sell -> When Close Adjusted > H-1 EMA 5 & When MACD > 0
Slightly different from SMA and EMA, signals on MACD will use 0 as a parameter. When the result of the subtraction of the EMA15 and EMA50 is smaller than 0 or minus it is an indication that the stock price is experiencing a decline, and vice versa. As for the comparison of adjusted prices with EMA5, it is the same as SMA and EMA.
bbri_macd <- bbri_macd %>%
mutate(
signal.MACD = case_when(
adjusted < lag(bbri_ema$EMA5, 1) & MACD < 0 & lag(MACD, 1) < MACD ~ "Buy",
adjusted > lag(bbri_ema$EMA5, 1) & MACD > 0 & lag(MACD, 1) > MACD ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.MACD = lag(signal.MACD, 1),
decision.MACD = case_when(
signal.MACD == previous_signal.MACD ~ "Hold",
TRUE ~ signal.MACD
)
)
bbri_macd %>%
filter(decision.MACD != "Hold" & ! is.na(EMA50))If you pay attention to one by one of the overall “Buy” and “Sell” recommendations from the MACD method that has been filtered above, the results of the recommendations are not wrong or the recommendations provide benefits without the slightest loss. With the results as above, the target variable from MACD can be indicated as perfect because the target variable will be used as one of the predictors in the classification model.
- RSI
RSI will not use the parameters of the SMA or EMA at all. RSI will use the calculated parameters of the RSI formula itself. The standard counting period is 14, as recommended by Welles Wilder. The period may be changed because it does not have special rules, either increasing or decreasing. However, this will affect the sensitivity of the RSI. For example, period 10 reaches overbought or oversold levels faster than period 20. The RSI that will be used for BRI is RSI7, RSI14, RSI30 and RSI70, because after several experiments, these 4 RSIs showed the best results.
bbri_rsi <- bbri
# RSI 7
bbri_rsi$RSI7 <- RSI(Ad(bbri_rsi),
n =7)
# RSI 14
bbri_rsi$RSI14 <- RSI(Ad(bbri_rsi),
n=14)
# RSI 30
bbri_rsi$RSI30 <- RSI(Ad(bbri_rsi),
n=30)
# RSI 70
bbri_rsi$RSI70 <- RSI(Ad(bbri_rsi),
n=70)“Buy” and “Sell” signals:
- Buy -> When Close Adjusted Price < H-1 Close Adjusted Price & RSI7 > RSI14 & RSI7 < RSI20
- Sell -> When Price Close Adjusted > Price H-1 Close Adjusted & RSI7 < RSI14 & RSI7 > RSI80
In technical analysis using RSI signals “Buy” and “Sell” will depend on the movement of RSI 7 which will be compared to RSI 14, RSI30 and RSI 70. If the movement of RSI 7 is bigger than RSI14 it is an indication that prices will soon rise vice versa. Meanwhile, RSI 30 and RSI 70 will be used as an “overbought” indicator and RSI 70 will be used as an “oversell” indicator. larger than the RSI 30, it is an indicator that the stock price is not at the lowest price.
bbri_rsi <- bbri_rsi %>%
mutate(
signal.RSI = case_when(
adjusted < lag(adjusted,1) & RSI7 > RSI14 & RSI7 < RSI70 ~ "Buy",
adjusted > lag(adjusted,1) & RSI7 < RSI14 & RSI7 > RSI30 ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.RSI = lag(signal.RSI, 1),
decision.RSI = case_when(
signal.RSI == previous_signal.RSI ~ "Hold",
TRUE ~ signal.RSI
)
)
bbri_rsi %>%
group_by(decision.RSI) %>%
summarise(freq = n())If you pay attention to one by one of the overall “Buy” and “Sell” recommendations from the SMA method that has been filtered above, the results of the recommendations are not wrong or the recommendations provide benefits without the slightest loss.
- SMA
isat_sma <- isat
# SMA 5
isat_sma$SMA5 <- SMA(Ad(isat_sma),
n = 5)
# SMA 30
isat_sma$SMA20 <- SMA(Ad(isat_sma),
n = 20)
# SMA 50
isat_sma$SMA60 <- SMA(Ad(isat_sma),
n = 60)
# SMA 70
isat_sma$SMA70 <- SMA(Ad(isat_sma),
n =70)# Membuat sinyal Buy, Sell & Hold
isat_sma <- isat_sma %>%
mutate(
signal.SMA = case_when(
adjusted < lag(SMA5, 1) & SMA20 > lag(SMA60, 1) ~ "Sell",
adjusted > lag(SMA60, 1) & SMA20 < lag(SMA60, 1) ~ "Buy",
TRUE ~ "Hold"
),
previous_signal.SMA = lag(signal.SMA, 1),
decision.SMA = case_when(
signal.SMA == previous_signal.SMA ~ "Hold",
TRUE ~ signal.SMA
)
)
isat_sma %>%
filter(decision.SMA != "Hold" & ! is.na(SMA60))- EMA
# Membuat objek baru
isat_ema <- isat
# EMA 10
isat_ema$EMA10 <- EMA(Ad(isat_ema),
n = 10)
# EMA 30
isat_ema$EMA30 <- EMA(Ad(isat_ema),
n = 30)
# EMA 50
isat_ema$EMA50 <- EMA(Ad(isat_ema),
n = 50)
# EMA 70
isat_ema$EMA70 <- EMA(Ad(isat_ema),
n = 70)isat_ema <- isat_ema %>%
mutate(
signal.EMA = case_when(
adjusted < lag(EMA10, 1) & EMA30 > lag(EMA50, 1) ~ "Sell",
adjusted > lag(EMA70, 1) & EMA30 < lag(EMA50, 1) ~ "Buy",
TRUE ~ "Hold"# Otherwise sell
),
previous_signal.EMA = lag(signal.EMA, 1),
decision.EMA = case_when(
signal.EMA == previous_signal.EMA ~ "Hold",
TRUE ~ signal.EMA
)
)
isat_ema %>%
filter(decision.EMA != "Hold" & ! is.na(EMA70))- MACD
isat_macd <- isat
# EMA 10
isat_macd$EMA10 <- EMA(Ad(isat_ema),
n = 10)
# EMA 18
isat_macd$EMA18 <- EMA(Ad(isat_ema),
n = 18)
# EMA 48
isat_macd$EMA48 <- EMA(Ad(isat_ema),
n = 48)
# MACD
isat_macd$MACD <- isat_macd$EMA18 - isat_macd$EMA48isat_macd <- isat_macd %>%
mutate(
signal.MACD = case_when(
adjusted < lag(isat_macd$EMA10, 1) & MACD < 0 & lag(MACD, 1) < MACD ~ "Buy",
adjusted > lag(isat_macd$EMA10, 1) & MACD > 0 & lag(MACD, 1) > MACD ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.MACD = lag(signal.MACD, 1),
decision.MACD = case_when(
signal.MACD == previous_signal.MACD ~ "Hold",
TRUE ~ signal.MACD
)
)
isat_macd %>%
filter(decision.MACD != "Hold")- RSI
# Membuat objek baru
isat_rsi <- isat
# RSI 10
isat_rsi$RSI10 <- RSI(Ad(isat_rsi),
n =10)
# RSI 38
isat_rsi$RSI38 <- RSI(Ad(isat_rsi),
n=38)
# RSI 45
isat_rsi$RSI45 <- RSI(Ad(isat_rsi),
n=45)
# RSI 70
isat_rsi$RSI70 <- RSI(Ad(isat_rsi),
n=70)isat_rsi <- isat_rsi %>%
mutate(
signal.RSI = case_when(
adjusted < lag(adjusted,1) & RSI10 > RSI38 & RSI10 < RSI70 ~ "Buy",
adjusted > lag(adjusted,1) & RSI10 < RSI38 & RSI10 > RSI45 ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.RSI = lag(signal.RSI, 1),
decision.RSI = case_when(
signal.RSI == previous_signal.RSI ~ "Hold",
TRUE ~ signal.RSI
)
)
isat_rsi %>%
filter(decision.RSI != "Hold" & ! is.na(RSI70))- SMA
sido_sma <- sido
# SMA 5
sido_sma$SMA5 <- SMA(Ad(sido_sma),
n = 5)
# SMA 15
sido_sma$SMA15 <- SMA(Ad(sido_sma),
n = 15)
# SMA 55
sido_sma$SMA55 <- SMA(Ad(sido_sma),
n = 55)
# SMA 80
sido_sma$SMA80 <- SMA(Ad(sido_sma),
n =80)sido_sma <- sido_sma %>%
mutate(
signal.SMA = case_when(
adjusted < lag(SMA5, 1) & SMA15 < lag(SMA55, 1) ~ "Buy",
adjusted > lag(SMA80, 1) & SMA15 > lag(SMA55, 1) ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.SMA = lag(signal.SMA, 1),
decision.SMA = case_when(
signal.SMA == previous_signal.SMA ~ "Hold",
TRUE ~ signal.SMA
)
)
sido_sma %>%
filter(decision.SMA != "Hold" & ! is.na(SMA80))From the observation above SMA decision at 03-05-2017, 08-05-2017, 20-04-202 and 23-04-2020 must be changed into Hold in order to prevent trading loss.
sido_sma$decision.SMA[sido_sma$date == "2017-05-03"] <- "Hold"
sido_sma$decision.SMA[sido_sma$date == "2017-05-08"] <- "Hold"
sido_sma$decision.SMA[sido_sma$date == "2020-04-20"] <- "Hold"
sido_sma$decision.SMA[sido_sma$date == "2020-04-23"] <- "Hold"- EMA
# Membuat objek baru
sido_ema <- sido
# EMA 5
sido_ema$EMA5 <- EMA(Ad(sido_ema),
n = 5)
# EMA 30
sido_ema$EMA30 <- EMA(Ad(sido_ema),
n = 30)
# EMA 55
sido_ema$EMA55 <- EMA(Ad(sido_ema),
n = 55)
# EMA 60
sido_ema$EMA60 <- EMA(Ad(sido_ema),
n = 60)sido_ema <- sido_ema %>%
mutate(
signal.EMA = case_when(
adjusted < lag(EMA5, 1) & EMA30 < lag(EMA55, 1) ~ "Buy",
adjusted > lag(EMA60, 1) & EMA30 > lag(EMA55, 1) ~ "Sell",
TRUE ~ "Hold"# Otherwise sell
),
previous_signal.EMA = lag(signal.EMA, 1),
decision.EMA = case_when(
signal.EMA == previous_signal.EMA ~ "Hold",
TRUE ~ signal.EMA
)
)sido_ema %>%
filter(decision.EMA != "Hold" & ! is.na(EMA60))sido_ema$decision.EMA[sido_ema$date == "2017-05-10"] <- "Hold"- MACD
sido_macd <- sido
# EMA 10
sido_macd$EMA10 <- EMA(Ad(sido_macd),
n = 10)
# EMA 15
sido_macd$EMA15 <- EMA(Ad(sido_macd),
n = 15)
# EMA 20
sido_macd$EMA20 <- EMA(Ad(sido_macd),
n = 20)
# MACD
sido_macd$MACD <- sido_macd$EMA15 - sido_macd$EMA20sido_macd <- sido_macd %>%
mutate(
signal.MACD = case_when(
adjusted < lag(sido_macd$EMA10, 1) & MACD < 0 & lag(MACD, 1) < MACD ~ "Buy",
adjusted > lag(sido_macd$EMA10, 1) & MACD > 0 & lag(MACD, 1) > MACD ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.MACD = lag(signal.MACD, 1),
decision.MACD = case_when(
signal.MACD == previous_signal.MACD ~ "Hold",
TRUE ~ signal.MACD
)
)
sido_macd %>%
filter(decision.MACD != "Hold")sido_macd$decision.MACD[sido_macd$date == "2017-11-02"] <- "Hold"- RSI
sido_rsi <- sido
# RSI 10
sido_rsi$RSI10 <- RSI(Ad(sido_rsi),
n =10)
# RSI 14
sido_rsi$RSI14 <- RSI(Ad(sido_rsi),
n=14)
# RSI 30
sido_rsi$RSI30 <- RSI(Ad(sido_rsi),
n=30)
# RSI 65
sido_rsi$RSI65 <- RSI(Ad(sido_rsi),
n=65)sido_rsi <- sido_rsi %>%
mutate(
signal.RSI = case_when(
close < lag(close,1) & RSI10 > RSI14 & RSI10 < RSI65 ~ "Sell",
close > lag(close,1) & RSI10 < RSI14 & RSI10 > RSI30 ~ "Buy",
TRUE ~ "Hold"
),
previous_signal.RSI = lag(signal.RSI, 1),
decision.RSI = case_when(
signal.RSI == previous_signal.RSI ~ "Hold",
TRUE ~ signal.RSI
)
)
sido_rsi %>%
filter(decision.RSI != "Hold" & ! is.na(RSI65))- SMA
hoki_sma <- hoki
# SMA 5
hoki_sma$SMA5 <- SMA(Ad(hoki_sma),
n = 5)
# SMA 25
hoki_sma$SMA25 <- SMA(Ad(hoki_sma),
n = 25)
# SMA 55
hoki_sma$SMA55 <- SMA(Ad(hoki_sma),
n = 55)
# SMA 70
hoki_sma$SMA70 <- SMA(Ad(hoki_sma),
n =70)# Membuat sinyal Buy, Sell & Hold
hoki_sma <- hoki_sma %>%
mutate(
signal.SMA = case_when(
close < lag(SMA5, 1) & SMA25 < lag(SMA55, 1) ~ "Buy",
close > lag(SMA70, 1) & SMA25 > lag(SMA55, 1) ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.SMA = lag(signal.SMA, 1),
decision.SMA = case_when(
signal.SMA == previous_signal.SMA ~ "Hold",
TRUE ~ signal.SMA
)
)
hoki_sma %>%
filter(decision.SMA != "Hold" & ! is.na(SMA70))- EMA
hoki_ema <- hoki
# EMA 5
hoki_ema$EMA5 <- EMA(Ad(hoki_ema),
n = 5)
# EMA 30
hoki_ema$EMA30 <- EMA(Ad(hoki_ema),
n = 30)
# EMA 40
hoki_ema$EMA50 <- EMA(Ad(hoki_ema),
n = 50)
# EMA 60
hoki_ema$EMA60 <- EMA(Ad(hoki_ema),
n = 60)hoki_ema <- hoki_ema %>%
mutate(
signal.EMA = case_when(
close < lag(EMA5, 1) & EMA30 < lag(EMA50, 1) ~ "Buy",
close > lag(EMA60, 1) & EMA30 > lag(EMA50, 1) ~ "Sell",
TRUE ~ "Hold"# Otherwise sell
),
previous_signal.EMA = lag(signal.EMA, 1),
decision.EMA = case_when(
signal.EMA == previous_signal.EMA ~ "Hold",
TRUE ~ signal.EMA
)
)
hoki_ema %>%
filter(decision.EMA != "Hold" & ! is.na(EMA60))- MACD
# Membuat objek baru
hoki_macd <- hoki
# EMA 10
hoki_macd$EMA10 <- EMA(Ad(hoki_macd),
n = 10)
# EMA 15
hoki_macd$EMA15 <- EMA(Ad(hoki_macd),
n = 15)
# EMA 50
hoki_macd$EMA50 <- EMA(Ad(hoki_macd),
n = 50)
# MACD
hoki_macd$MACD <- hoki_macd$EMA15 - hoki_macd$EMA50hoki_macd <- hoki_macd %>%
mutate(
signal.MACD = case_when(
adjusted < lag(hoki_macd$EMA10, 1) & MACD < 0 & lag(MACD, 1) < MACD ~ "Buy",
adjusted > lag(hoki_macd$EMA10, 1) & MACD > 0 & lag(MACD, 1) > MACD ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.MACD = lag(signal.MACD, 1),
decision.MACD = case_when(
signal.MACD == previous_signal.MACD ~ "Hold",
TRUE ~ signal.MACD
)
)
hoki_macd %>%
filter(decision.MACD != "Hold")- RSI
hoki_rsi <- hoki
# RSI 10
hoki_rsi$RSI10 <- RSI(Ad(hoki_rsi),
n =10)
# RSI 14
hoki_rsi$RSI14 <- RSI(Ad(hoki_rsi),
n=14)
# RSI 40
hoki_rsi$RSI40 <- RSI(Ad(hoki_rsi),
n=40)
# RSI 65
hoki_rsi$RSI65 <- RSI(Ad(hoki_rsi),
n=65)hoki_rsi <- hoki_rsi %>%
mutate(
signal.RSI = case_when(
close < lag(close,1) & RSI10 > RSI14 & RSI10 < RSI65 ~ "Sell",
close > lag(close,1) & RSI10 < RSI14 & RSI10 > RSI40 ~ "Buy",
TRUE ~ "Hold"
),
previous_signal.RSI = lag(signal.RSI, 1),
decision.RSI = case_when(
signal.RSI == previous_signal.RSI ~ "Hold",
TRUE ~ signal.RSI
)
)
hoki_rsi %>%
filter(decision.RSI != "Hold" & ! is.na(RSI65))sido_rsi$decision.RSI[sido_rsi$date == "2016-08-03"] <- "Hold"
sido_rsi$decision.RSI[sido_rsi$date == "2016-08-08"] <- "Hold"- SMA
wika_sma <- wika
# SMA 5
wika_sma$SMA20 <- SMA(Ad(wika_sma),
n = 20)
# SMA 30
wika_sma$SMA30 <- SMA(Ad(wika_sma),
n = 30)
# SMA 65
wika_sma$SMA65 <- SMA(Ad(wika_sma),
n =65)
# SMA 80
wika_sma$SMA80 <- SMA(Ad(wika_sma),
n =80)wika_sma <- wika_sma %>%
mutate(
signal.SMA = case_when(
close < lag(SMA20, 1) & SMA30 < lag(SMA65, 1) ~ "Buy",
close > lag(SMA80, 1) & SMA30 > lag(SMA65, 1) ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.SMA = lag(signal.SMA, 1),
decision.SMA = case_when(
signal.SMA == previous_signal.SMA ~ "Hold",
TRUE ~ signal.SMA
)
)
wika_sma %>%
filter(decision.SMA != "Hold" & ! is.na(SMA80))wika_sma$decision.SMA[wika_sma$date == "2017-04-10"] <- "Hold"- EMA
wika_ema <- wika
# EMA 20
wika_ema$EMA20 <- EMA(Ad(wika_ema),
n = 20)
# EMA 30
wika_ema$EMA30 <- EMA(Ad(wika_ema),
n = 30)
# EMA 65
wika_ema$EMA65 <- EMA(Ad(wika_ema),
n = 65)
# EMA 80
wika_ema$EMA80 <- EMA(Ad(wika_ema),
n = 80)wika_ema <- wika_ema %>%
mutate(
signal.EMA = case_when(
adjusted < lag(EMA20, 1) & EMA30 < lag(EMA65, 1) ~ "Buy",
adjusted > lag(EMA80, 1) & EMA30 > lag(EMA65, 1) ~ "Sell",
TRUE ~ "Hold"# Otherwise sell
),
previous_signal.EMA = lag(signal.EMA, 1),
decision.EMA = case_when(
signal.EMA == previous_signal.EMA ~ "Hold",
TRUE ~ signal.EMA
)
)wika_ema %>%
filter(decision.EMA != "Hold" & ! is.na(EMA80))wika_ema$decision.EMA[wika_ema$date == "2020-01-29"] <- "Hold"
wika_ema$decision.EMA[wika_ema$date == "2020-02-07"] <- "Hold"
wika_ema$decision.EMA[wika_ema$date == "2020-02-24"] <- "Hold"- MACD
# Membuat objek baru
wika_macd <- wika
# EMA 6
wika_macd$EMA6 <- EMA(Ad(wika_macd),
n =6)
# EMA 10
wika_macd$EMA10 <- EMA(Ad(wika_macd),
n = 10)
# EMA 25
wika_macd$EMA25 <- EMA(Ad(wika_macd),
n = 25)
# MACD
wika_macd$MACD <- wika_macd$EMA10 - wika_macd$EMA25wika_macd <- wika_macd %>%
mutate(
signal.MACD = case_when(
adjusted < lag(wika_macd$EMA10, 1) & MACD < 0 & lag(MACD, 1) < MACD ~ "Buy",
adjusted > lag(wika_macd$EMA10, 1) & MACD > 0 & lag(MACD, 1) > MACD ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.MACD = lag(signal.MACD, 1),
decision.MACD = case_when(
signal.MACD == previous_signal.MACD ~ "Hold",
TRUE ~ signal.MACD
)
)
wika_macd %>%
filter(decision.MACD != "Hold")wika_macd$decision.MACD[wika_macd$date == "2017-03-09"] <- "Hold"
wika_macd$decision.MACD[wika_macd$date == "2019-12-26"] <- "Hold"- RSI
wika_rsi <- wika
# RSI 10
wika_rsi$RSI5 <- RSI(Ad(wika_rsi),
n =5)
# RSI 14
wika_rsi$RSI20 <- RSI(Ad(wika_rsi),
n=20)
# RSI 30
wika_rsi$RSI35 <- RSI(Ad(wika_rsi),
n=35)
# RSI 65
wika_rsi$RSI65 <- RSI(Ad(wika_rsi),
n=65)wika_rsi <- wika_rsi %>%
mutate(
signal.RSI = case_when(
close < lag(close,1) & RSI5 > RSI20 & RSI5 < RSI65 ~ "Buy",
close > lag(close,1) & RSI5 < RSI20 & RSI5 > RSI35 ~ "Sell",
TRUE ~ "Hold"
),
previous_signal.RSI = lag(signal.RSI, 1),
decision.RSI = case_when(
signal.RSI == previous_signal.RSI ~ "Hold",
TRUE ~ signal.RSI
)
)wika_rsi %>%
filter(decision.RSI != "Hold" & ! is.na(RSI65))After getting the required analysis results, combine them into a new data frame. The purpose of combining all indicators into one is to get a final decision, to assess whether the stock should be bought or sold on the day. The results of the final decision will also be used as a target variable to train the classification model.
# Combine all indicator
bbri_analisa <- cbind(bbri,bbri_sma$SMA5, bbri_sma$SMA30, bbri_sma$SMA50, bbri_sma$SMA70, bbri_ema$EMA5, bbri_ema$EMA15, bbri_ema$EMA50, bbri_ema$EMA70, bbri_macd$MACD, bbri_rsi$RSI7, bbri_rsi$RSI14, bbri_rsi$RSI30, bbri_rsi$RSI70, bbri_sma$decision.SMA, bbri_ema$decision.EMA, bbri_macd$decision.MACD, bbri_rsi$decision.RSI)
# Change column name
colnames(bbri_analisa) <- c("symbol", "date", "open", "high", "low", "close", "volume", "adjusted", "SMA5", "SMA30", "SMA50", "SMA70", "EMA5", "EMA15", "EMA50", "EMA70", "MACD", "RSI7", "RSI14", "RSI30", "RSI70", "decision.SMA", "decision.EMA", "decision.MACD", "decision.RSI")
head(bbri_analisa, 3)After all the analysis columns are combined into a new data frame, create a new column that will be used as a final decision. The final decision will be taken from the decision column per each analysis.
bbri_analisa <- bbri_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Sell",
TRUE ~ "Hold"
)
)
bbri_analisa %>%
filter(final_decision != "Hold")From the results above, it can be seen that there are only 3 out of 4 technical analyzes showing the same decision, from there it can be concluded that the majority of the results from each technical analysis have different results. One alternative to get a final decision from the results of combining the 4 technical analyzes above is to give weight to the technical analysis that is better than the 4 analyzes that have been done.
Each technical analysis certainly has its advantages and disadvantages, in this project there are 2 analyzes that will be given additional weight to the MACD and RSI analysis. The reason why choosing MACD is its superiority in providing signals or indications as early as possible, this is very useful to avoid losing momentum to buy or too late to take profits. While the RSI is one of the most favorite indicators due to its ability to detect prices in the market being in the overbought or oversell area, the ability to detect is what makes RSI the favorite because it can provide maximum benefits from the results of the analysis. It is hoped that by combining the two indicators above, they can complement each other.
bbri_analisa2 <- bbri_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
TRUE ~ "Hold"
)
)
bbri_analisa2 %>%
select(c("date", "open", "close", "final_decision"))# Combine all indicator
isat_analisa <- cbind(isat,isat_sma$SMA5, isat_sma$SMA20, isat_sma$SMA60, isat_sma$SMA70, isat_ema$EMA10, isat_ema$EMA30, isat_ema$EMA50, isat_ema$EMA70, isat_macd$EMA10, isat_macd$EMA18, isat_macd$EMA48, isat_macd$MACD, isat_rsi$RSI10, isat_rsi$RSI38, isat_rsi$RSI45, isat_rsi$RSI70, isat_sma$decision.SMA, isat_ema$decision.EMA, isat_macd$decision.MACD, isat_rsi$decision.RSI)
# Change column name
colnames(isat_analisa) <- c("symbol", "date", "open", "high", "low", "close", "volume", "adjusted", "SMA5", "SMA20", "SMA60", "SMA70", "EMA5", "EMA10", "EMA30", "EMA50", "EMA70", "EMA18", "EMA48", "MACD", "RSI10", "RSI38", "RSI45", "RSI70", "decision.SMA", "decision.EMA", "decision.MACD", "decision.RSI")
head(isat_analisa, 3)isat_analisa <- isat_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Sell",
TRUE ~ "Hold"
)
)
isat_analisa %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")isat_analisa2 <- isat_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.EMA == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.EMA == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.EMA == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.MACD == "Buy" | decision.EMA == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Sell" | decision.EMA == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.EMA == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.EMA == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.EMA == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.MACD == "Sell" | decision.EMA == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Buy" | decision.EMA == "Sell" | decision.RSI == "Sell" ~ "Sell",
TRUE ~ "Hold"
)
)
isat_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")# Combine all columns
sido_analisa <- cbind(sido, sido_sma$SMA5, sido_sma$SMA15, sido_sma$SMA55, sido_sma$SMA80, sido_ema$EMA5, sido_ema$EMA30, sido_ema$EMA55, sido_ema$EMA60, sido_macd$EMA10, sido_macd$EMA15, sido_macd$MACD, sido_rsi$RSI10, sido_rsi$RSI14, sido_rsi$RSI30, sido_rsi$RSI65, sido_sma$decision.SMA, sido_ema$decision.EMA, sido_macd$decision.MACD, sido_rsi$decision.RSI)
# Change column name
colnames(sido_analisa) <- c("symbol", "date", "open", "high", "low", "close", "volume", "adjusted", "SMA5", "SMA15", "SMA55", "SMA80", "EMA5", "EMA30", "EMA55", "EMA60", "EMA10", "EMA15", "MACD", "RSI10", "RSI14", "RSI40", "RSI65", "decision.SMA", "decision.EMA", "decision.MACD", "decision.RSI")
head(sido_analisa, 3)sido_analisa <- sido_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Sell",
TRUE ~ "Hold"
)
)
sido_analisa %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")sido_analisa2 <- sido_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
TRUE ~ "Hold"
)
)
sido_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")sido_analisa2$final_decision[sido_analisa2$date == "2019-01-16"] <- "Hold"
sido_analisa2$final_decision[sido_analisa2$date == "2016-02-09"] <- "Hold"
sido_analisa2$final_decision[sido_analisa2$date == "2016-11-04"] <- "Hold"
sido_analisa2$final_decision[sido_analisa2$date == "2017-01-10"] <- "Buy"
sido_analisa2$final_decision[sido_analisa2$date == "2016-02-09"] <- "Hold"
sido_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")# Combine all the columns
hoki_analisa <- cbind(hoki,hoki_sma$SMA5, hoki_sma$SMA25, hoki_sma$SMA55, hoki_sma$SMA70, hoki_ema$EMA5, hoki_ema$EMA30, hoki_ema$EMA50, hoki_ema$EMA60, hoki_macd$EMA10, hoki_macd$EMA15, hoki_macd$MACD, hoki_rsi$RSI10, hoki_rsi$RSI14, hoki_rsi$RSI40, hoki_rsi$RSI65, hoki_sma$decision.SMA, hoki_ema$decision.EMA, hoki_macd$decision.MACD, hoki_rsi$decision.RSI)
# Change column name
colnames(hoki_analisa) <- c("symbol", "date", "open", "high", "low", "close", "volume", "adjusted", "SMA5", "SMA25", "SMA55", "SMA70", "EMA5", "EMA30", "EMA50", "EMA60", "EMA10", "EMA15", "MACD", "RSI10", "RSI14", "RSI40", "RSI65", "decision.SMA", "decision.EMA", "decision.MACD", "decision.RSI")
head(hoki_analisa, 3)hoki_analisa <- hoki_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Sell",
TRUE ~ "Hold"
)
)
hoki_analisa %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")hoki_analisa2 <- hoki_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
TRUE ~ "Hold"
)
)
hoki_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")hoki_analisa2$final_decision[hoki_analisa2$date == "2018-11-05"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2019-08-19"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2020-12-16"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-04-09"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-04-23"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-05-10"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-06-03"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-06-04"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-06-10"] <- "Hold"
hoki_analisa2$final_decision[hoki_analisa2$date == "2021-06-15"] <- "Hold"
hoki_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")# Combine all the columns
wika_analisa <- cbind(wika, wika_sma$SMA20, wika_sma$SMA30, wika_sma$SMA65, wika_sma$SMA80, wika_ema$EMA20, wika_ema$EMA30, wika_ema$EMA65, wika_ema$EMA80, wika_macd$EMA10, wika_macd$EMA25, wika_macd$MACD, wika_rsi$RSI5, wika_rsi$RSI20, wika_rsi$RSI35, wika_rsi$RSI65, wika_sma$decision.SMA, wika_ema$decision.EMA, wika_macd$decision.MACD, wika_rsi$decision.RSI)
# Change column name
colnames(wika_analisa) <- c("symbol", "date", "open", "high", "low", "close", "volume", "adjusted", "SMA20", "SMA30", "SMA65", "SMA80", "EMA20", "EMA30", "EMA65", "EMA80", "EMA10", "EMA25", "MACD", "RSI5", "RSI20", "RSI35", "RSI65", "decision.SMA", "decision.EMA", "decision.MACD", "decision.RSI")
head(wika_analisa, 3)wika_analisa <- wika_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.RSI == "Buy" ~ "Sell",
TRUE ~ "Hold"
)
)
wika_analisa %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")wika_analisa2 <- wika_analisa %>%
mutate(
final_decision = case_when(
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.MACD == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.MACD == "Buy" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Buy" & decision.EMA == "Buy" & decision.MACD == "Buy" & decision.MACD == "Sell" | decision.MACD == "Buy" | decision.RSI == "Buy" ~ "Buy",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Buy" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Buy" & decision.MACD == "Sell" & decision.MACD == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Buy" & decision.MACD == "Sell" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
decision.SMA == "Sell" & decision.EMA == "Sell" & decision.MACD == "Sell" & decision.MACD == "Buy" | decision.MACD == "Sell" | decision.RSI == "Sell" ~ "Sell",
TRUE ~ "Hold"
)
)
wika_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")wika_analisa2$final_decision[wika_analisa2$date == "2017-04-20"] <- "Hold"
wika_analisa2$final_decision[wika_analisa2$date == "2017-10-31"] <- "Hold"
wika_analisa2$final_decision[wika_analisa2$date == "2017-11-03"] <- "Hold"
wika_analisa2$final_decision[wika_analisa2$date == "2019-08-08"] <- "Hold"
wika_analisa2 %>%
select(c("date", "open", "close", "final_decision")) %>%
filter(final_decision != "Hold")Data preprocessing is a process of preparing the raw data that has been obtained and has passed the EDA process and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
First thing first, let’s conduct cross validation. Cross-validation is a statistical technique for testing the performance of a Machine Learning model. In particular, a good cross validation method gives us a comprehensive measure of our model’s performance throughout the whole dataset.
In this project the whole dataset will be divided into three parts, Data Train, Data Validation and Data Test.
- Training Dataset: The sample of data used to fit the model.
- Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
- Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.
The main reason why cross validation are devided the dataset intro three parts is to helps avoiding overfitting, since by dividing the data intro three parts we can concretely check that our model performs well on data seen during training and not. Without cross validation, we would never know if our model is amazing worldwide or just on our sheltered training set!
- Data Train
data_train <- bbri_analisa2 %>%
filter(date > "2017-12-31" & date < "2021-06-01") %>%
mutate_if(is.character, as.factor)After the data train has been separated from data test, let’s randomize the data into two parts, namely bbri_train and bbri_validation with a composition of 65% for train data and 35% for validation data. Randomize data is necessary since it will prevents any bias during the training and prevents the model from learning the order of the training.
Randomize the data and split it into data train and validation can be done using library rsample.
library(rsample)
set.seed(123)
init <- initial_split(data = data_train, #data that want to be split
prop = 0.65, #proportion of data that want to be split
strata = final_decision) #target variable
bbri_train <- training(init)
bbri_validation <- testing(init)table(bbri_train$final_decision)##
## Buy Hold Sell
## 24 510 25
table(bbri_validation$final_decision)##
## Buy Hold Sell
## 9 271 20
- Data Test
bbri_test <- bbri_analisa2[71:500,] %>%
filter(date < "2018-01-01") %>%
mutate_if(is.character, as.factor)
table(bbri_test$final_decision)##
## Buy Hold Sell
## 10 382 38
- Data Train
data_train <- isat_analisa2 %>%
filter(date > "2017-12-31" & date < "2021-06-01") %>%
mutate_if(is.character, as.factor)set.seed(123)
init <- initial_split(data = data_train,
prop = 0.65,
strata = final_decision)
isat_train <- training(init)
isat_validation <- testing(init)table(isat_train$final_decision)##
## Buy Hold Sell
## 21 509 29
table(isat_validation$final_decision)##
## Buy Hold Sell
## 9 277 14
- Data Test
isat_test <- isat_analisa2[71:500,] %>%
filter(date < "2018-01-01") %>%
mutate_if(is.character, as.factor)
table(isat_test$final_decision)##
## Buy Hold Sell
## 12 392 26
- Data Train
data_train <- sido_analisa2 %>%
filter(date > "2017-12-31" & date < "2021-06-01") %>%
mutate_if(is.character, as.factor)set.seed(123)
init <- initial_split(data = data_train,
prop = 0.65,
strata = final_decision)
sido_train <- training(init)
sido_validation <- testing(init)table(sido_train$final_decision)##
## Buy Hold Sell
## 19 492 48
table(sido_validation$final_decision)##
## Buy Hold Sell
## 5 265 31
- Data Test
sido_test <- sido_analisa2[81:492,] %>%
filter(date < "2018-01-01") %>%
mutate_if(is.character, as.factor)
table(sido_test$final_decision)##
## Buy Hold Sell
## 17 378 17
- Data Train
data_train <- hoki_analisa2 %>%
filter(date > "2017-12-31" & date < "2021-06-01") %>%
mutate_if(is.character, as.factor)set.seed(123)
init <- initial_split(data = data_train,
prop = 0.65,
strata = final_decision)
hoki_train <- training(init)
hoki_validation <- testing(init)table(hoki_train$final_decision)##
## Buy Hold Sell
## 30 488 41
table(hoki_validation$final_decision)##
## Buy Hold Sell
## 13 273 15
- Data Test
hoki_test <- hoki_analisa2[71:130,] %>%
filter(date < "2018-01-01") %>%
mutate_if(is.character, as.factor)
table(hoki_test$final_decision)##
## Buy Hold Sell
## 5 53 2
- Data Train
data_train <- wika_analisa2 %>%
filter(date > "2017-12-31" & date < "2021-06-01") %>%
mutate_if(is.character, as.factor)set.seed(123)
init <- initial_split(data = data_train,
prop = 0.65,
strata = final_decision)
wika_train <- training(init)
wika_validation <- testing(init)table(wika_train$final_decision)##
## Buy Hold Sell
## 39 480 40
table(wika_validation$final_decision)##
## Buy Hold Sell
## 22 262 16
- Data Test
wika_test <- wika_analisa2[81:500,] %>%
filter(date < "2018-01-01") %>%
mutate_if(is.character, as.factor)
table(wika_test$final_decision)##
## Buy Hold Sell
## 29 372 19
As can be seen from cross validation, the target variable proportions are imbalanced. The imbalanced training dataset cannot be used because of the severely skewed class distribution. This is the cause for poor performance with traditional machine learning models and evaluation metrics that assume a balanced class distribution.
But afraid not, there is a technique called Synthetic Minority Oversampling Technique (SMOTE). SMOTE is an oversampling technique that generates synthetic samples from the minority class. It is used to obtain a synthetically class-balanced or nearly class-balanced training set, which is then used to train the model.
Library used to activate function SMOTE is UBL. In the UBL library there is a function called SmoteClassif(), that function will be used to make the target variable in our train data balanced.
There are four arguments from function SmoteClassif() which must be noticed.
- Argument form :
A formula describing the prediction problem.
- Argument dat :
A data frame containing the original (unbalanced) data set, can be specified which column.
- Argument C.perc :
A named list containing the percentage(s) of under- or/and over-sampling to apply to each class. The over-sampling percentage is a number above 1 while the under-sampling percentage should be a number below 1. If the number 1 is provided for a given class then that class remains unchanged. Alternatively it may be “balance” (the default) or “extreme”, cases where the sampling percentages are automatically estimated either to balance the examples between the minority and majority classes or to invert the distribution of examples across the existing classes transforming the majority classes into minority and vice-versa.
- Argument dist :
The parameter dist allows the user to define the distance metric to be used in the neighbors computation. Although the default is the Euclidean distance, other metrics are available. This allows the computation of distances in data sets with, for instance, both nominal and numeric features. The options available for the distance functions are as follows:
* for data with only numeric features: “Manhattan”, “Euclidean”, “Canberra”, “Chebyshev”, “p-norm”.
* for data with only nominal features: “Overlap”.
* for dealing with both nominal and numeric features: “HEOM”, “HVDM”.
library(UBL)
# Since the date and stock name are not important to be duplicated, they can be excluded from the dataframe
dat <- bbri_train[, c(3:26)]
# There are two option to balanced the data, by make it equally balanced or make it almost balanced by setting the distribution of each target varibale unique
almost_balanced <- list(Buy = 30, Hold = 1, Sell = 20)
bbri_train_smote <- SmoteClassif(form = final_decision ~ ., # The format of form = target variable ~ the rest of the column
dat = dat,
C.perc = almost_balanced,
dist = "HVDM") # HEOM / HVDM can be used for upsample both nominal and numeric# Total distribution of target variable before SMOTE
table(bbri_train$final_decision)##
## Buy Hold Sell
## 24 510 25
# Total distribution of target variable after SMOTE
table(bbri_train_smote$final_decision)##
## Buy Hold Sell
## 720 510 500
dat <- isat_train[, c(3:29)]
almost_balanced <- list(Buy = 25, Hold = 1, Sell = 22)
isat_train_smote <- SmoteClassif(form = final_decision ~ .,
dat = dat,
C.perc = almost_balanced,
dist = "HVDM") # Total distribution of target variable before SMOTE
table(isat_train$final_decision)##
## Buy Hold Sell
## 21 509 29
# Total distribution of target variable after SMOTE
table(isat_train_smote$final_decision)##
## Buy Hold Sell
## 525 509 638
dat <- isat_train[, c(3:29)]
almost_balanced <- list(Buy = 25, Hold = 1, Sell = 22)
sido_train_smote <- SmoteClassif(form = final_decision ~ .,
dat = dat,
C.perc = almost_balanced,
dist = "HVDM") # Total distribution of target variable before SMOTE
table(sido_train$final_decision)##
## Buy Hold Sell
## 19 492 48
# Total distribution of target variable after SMOTE
table(sido_train_smote$final_decision)##
## Buy Hold Sell
## 525 509 638
dat <- hoki_train[, c(3:28)]
almost_balanced <- list(Buy = 15, Hold = 1, Sell = 13)
hoki_train_smote <- SmoteClassif(form = final_decision ~ .,
dat = dat,
C.perc = almost_balanced,
dist = "HVDM") # Total distribution of target variable before SMOTE
table(hoki_train$final_decision)##
## Buy Hold Sell
## 30 488 41
# Total distribution of target variable after SMOTE
table(hoki_train_smote$final_decision)##
## Buy Hold Sell
## 450 488 533
dat <- wika_train[, c(3:28)]
almost_balanced <- list(Buy = 15, Hold = 1, Sell = 15)
wika_train_smote <- SmoteClassif(form = final_decision ~ .,
dat = dat,
C.perc = almost_balanced,
dist = "HVDM") # Total distribution of target variable before SMOTE
table(wika_train$final_decision)##
## Buy Hold Sell
## 39 480 40
# Total distribution of target variable after SMOTE
table(wika_train_smote$final_decision)##
## Buy Hold Sell
## 585 480 600
After getting all the target variables needed through the EDA process and prepared in data pre-processing, the next thing to do is start the machine learning modeling process. There will be two models developed in this project to be compared which model is the best, those models are called Decision Tree and Random Forest.
Decision Tree and Random Forest are categorized as a classification model, which very suitable for this project since the main goal of this project to built a machine learning project which capable to classify whether today is it the right time for the investor to open position, close the position or keep position until the right time.
Decision Tree is a model that can be used to visually and explicitly represent decisions and decision making. In this case, this model will be used to classify whether today is the right time to buy, sell or hold stock. The function that will be used to create the decision tree model is ctree() from the library partykit.
- Model Based On Data Train & Validation
In this process what we will do is train the decision tree model using the data train and the model that have been train will be used to evaluate the model’s performance with data validation.
Modeling in the decision tree can also be given several parameters to simplify the model to make it easier to interpret, those parameters are:
- mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split.
- minsplit: The minimum number of observations that must exist in a node in order for a split to be attempted. (Default to 20)
- minbucket: The minimum number of observations at the terminal node. If not fulfilled, no branching is done. (default: 7)
Those three parameters above can be set using argument control = ctree_control(mincriterion = ,minsplit = , minbucket = ).
In this case to reduce the complexity of the model:
- mincriterion: the value is need to be enlarged into 0.5 since from the model above, the p-value is to small.
- minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted, in this case 100.
- minbucket: the minimum number of observations in any terminal node, in this case 50.
library(partykit)
# Model training
model_dt <- ctree(final_decision ~ .,
data = bbri_train_smote,
control = ctree_control(mincriterion = 0.5,
minsplit = 100,
minbucket = 50))
# Model training result visualization
plot(model_dt, type= "simple")Based on model Decision Tree visualization above, MACD technical analysis is most often used to consider giving advice on when is the right time to buy or sell stocks.
The next thing that needs to be done after training the model is to evaluate the model on the validation data that has been separated in the cross validation process. To evaluate the model, you can follow the method below.
pred_model_dt <- predict(object = model_dt, newdata = bbri_validation, type = "response")
library(caret)
eval_pred_model_dt <- confusionMatrix(data = pred_model_dt,
reference = as.factor(bbri_validation$final_decision))
table(bbri_validation$final_decision)##
## Buy Hold Sell
## 9 271 20
eval_pred_model_dt ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 7 0 0
## Hold 0 271 0
## Sell 2 0 20
##
## Overall Statistics
##
## Accuracy : 0.9933
## 95% CI : (0.9761, 0.9992)
## No Information Rate : 0.9033
## P-Value [Acc > NIR] : 3.106e-11
##
## Kappa : 0.9626
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.77778 1.0000 1.00000
## Specificity 1.00000 1.0000 0.99286
## Pos Pred Value 1.00000 1.0000 0.90909
## Neg Pred Value 0.99317 1.0000 1.00000
## Prevalence 0.03000 0.9033 0.06667
## Detection Rate 0.02333 0.9033 0.06667
## Detection Prevalence 0.02333 0.9033 0.07333
## Balanced Accuracy 0.88889 1.0000 0.99643
- Model Based On Data Test
In order to confirm once again that the model that has been made is not overfitting, then let’s try to do an evaluation again using validation data. If the results of the evaluation using test data do not differ much from the results of the evaluation with validation data, it means that our model is good enough and is not overfitting.
pred_model_dt_test <- predict(object = model_dt, newdata = bbri_test, type = "response")
eval_pred_model_dt_test <- confusionMatrix(data = pred_model_dt_test,
reference = as.factor(bbri_test$final_decision))
table(bbri_test$final_decision)##
## Buy Hold Sell
## 10 382 38
eval_pred_model_dt_test ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 7 0 0
## Hold 0 382 0
## Sell 3 0 38
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.9797, 0.9986)
## No Information Rate : 0.8884
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9655
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.70000 1.0000 1.00000
## Specificity 1.00000 1.0000 0.99235
## Pos Pred Value 1.00000 1.0000 0.92683
## Neg Pred Value 0.99291 1.0000 1.00000
## Prevalence 0.02326 0.8884 0.08837
## Detection Rate 0.01628 0.8884 0.08837
## Detection Prevalence 0.01628 0.8884 0.09535
## Balanced Accuracy 0.85000 1.0000 0.99617
# Model training
model_dt <- ctree(final_decision ~ .,
data = isat_train_smote,
control = ctree_control(mincriterion = 0.5,
minsplit = 100,
minbucket = 50))
# Model training result visualization
plot(model_dt, type= "simple")pred_model_dt <- predict(object = model_dt, newdata = isat_validation, type = "response")
eval_pred_model_dt <- confusionMatrix(data = pred_model_dt,
reference = as.factor(isat_validation$final_decision))
table(isat_validation$final_decision)##
## Buy Hold Sell
## 9 277 14
eval_pred_model_dt ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 9 9 0
## Hold 0 268 0
## Sell 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.97
## 95% CI : (0.9438, 0.9862)
## No Information Rate : 0.9233
## P-Value [Acc > NIR] : 0.000562
##
## Kappa : 0.8247
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.0000 0.9675 1.00000
## Specificity 0.9691 1.0000 1.00000
## Pos Pred Value 0.5000 1.0000 1.00000
## Neg Pred Value 1.0000 0.7187 1.00000
## Prevalence 0.0300 0.9233 0.04667
## Detection Rate 0.0300 0.8933 0.04667
## Detection Prevalence 0.0600 0.8933 0.04667
## Balanced Accuracy 0.9845 0.9838 1.00000
- Model Based On Data Test
pred_model_dt_test <- predict(object = model_dt, newdata = isat_test, type = "response")
eval_pred_model_dt_test <- confusionMatrix(data = pred_model_dt_test,
reference = as.factor(isat_test$final_decision))
table(isat_test$final_decision)##
## Buy Hold Sell
## 12 392 26
eval_pred_model_dt_test ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 11 42 0
## Hold 0 350 0
## Sell 1 0 26
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.8677, 0.9267)
## No Information Rate : 0.9116
## P-Value [Acc > NIR] : 0.8259
##
## Kappa : 0.6012
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.91667 0.8929 1.00000
## Specificity 0.89952 1.0000 0.99752
## Pos Pred Value 0.20755 1.0000 0.96296
## Neg Pred Value 0.99735 0.4750 1.00000
## Prevalence 0.02791 0.9116 0.06047
## Detection Rate 0.02558 0.8140 0.06047
## Detection Prevalence 0.12326 0.8140 0.06279
## Balanced Accuracy 0.90809 0.9464 0.99876
# Model training
model_dt <- ctree(final_decision ~ .,
data = isat_train_smote,
control = ctree_control(mincriterion = 0.5,
minsplit = 100,
minbucket = 50))
# Model training result visualization
plot(model_dt, type= "simple")pred_model_dt <- predict(object = model_dt, newdata = isat_validation, type = "response")
eval_pred_model_dt <- confusionMatrix(data = pred_model_dt,
reference = as.factor(isat_validation$final_decision))
table(isat_validation$final_decision)##
## Buy Hold Sell
## 9 277 14
eval_pred_model_dt ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 9 9 0
## Hold 0 268 0
## Sell 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.97
## 95% CI : (0.9438, 0.9862)
## No Information Rate : 0.9233
## P-Value [Acc > NIR] : 0.000562
##
## Kappa : 0.8247
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.0000 0.9675 1.00000
## Specificity 0.9691 1.0000 1.00000
## Pos Pred Value 0.5000 1.0000 1.00000
## Neg Pred Value 1.0000 0.7187 1.00000
## Prevalence 0.0300 0.9233 0.04667
## Detection Rate 0.0300 0.8933 0.04667
## Detection Prevalence 0.0600 0.8933 0.04667
## Balanced Accuracy 0.9845 0.9838 1.00000
- Model Based On Data Test
pred_model_dt_test <- predict(object = model_dt, newdata = isat_test, type = "response")
eval_pred_model_dt_test <- confusionMatrix(data = pred_model_dt_test,
reference = as.factor(isat_test$final_decision))
table(isat_test$final_decision)##
## Buy Hold Sell
## 12 392 26
eval_pred_model_dt_test ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 11 42 0
## Hold 0 350 0
## Sell 1 0 26
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.8677, 0.9267)
## No Information Rate : 0.9116
## P-Value [Acc > NIR] : 0.8259
##
## Kappa : 0.6012
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.91667 0.8929 1.00000
## Specificity 0.89952 1.0000 0.99752
## Pos Pred Value 0.20755 1.0000 0.96296
## Neg Pred Value 0.99735 0.4750 1.00000
## Prevalence 0.02791 0.9116 0.06047
## Detection Rate 0.02558 0.8140 0.06047
## Detection Prevalence 0.12326 0.8140 0.06279
## Balanced Accuracy 0.90809 0.9464 0.99876
# Model training
model_dt <- ctree(final_decision ~ .,
data = hoki_train_smote,
control = ctree_control(mincriterion = 0.5,
minsplit = 100,
minbucket = 50))
# Model training result visualization
plot(model_dt, type= "simple")pred_model_dt <- predict(object = model_dt, newdata = hoki_validation, type = "response")
eval_pred_model_dt <- confusionMatrix(data = pred_model_dt,
reference = as.factor(hoki_validation$final_decision))
table(hoki_validation$final_decision)##
## Buy Hold Sell
## 13 273 15
eval_pred_model_dt ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 13 2 0
## Hold 0 271 6
## Sell 0 0 9
##
## Overall Statistics
##
## Accuracy : 0.9734
## 95% CI : (0.9483, 0.9885)
## No Information Rate : 0.907
## P-Value [Acc > NIR] : 4.307e-06
##
## Kappa : 0.8356
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 0.9927 0.60000
## Specificity 0.99306 0.7857 1.00000
## Pos Pred Value 0.86667 0.9783 1.00000
## Neg Pred Value 1.00000 0.9167 0.97945
## Prevalence 0.04319 0.9070 0.04983
## Detection Rate 0.04319 0.9003 0.02990
## Detection Prevalence 0.04983 0.9203 0.02990
## Balanced Accuracy 0.99653 0.8892 0.80000
- Model Based On Data Test
pred_model_dt_test <- predict(object = model_dt, newdata = hoki_test, type = "response")
eval_pred_model_dt_test <- confusionMatrix(data = pred_model_dt_test,
reference = as.factor(hoki_test$final_decision))
table(hoki_test$final_decision)##
## Buy Hold Sell
## 5 53 2
eval_pred_model_dt_test ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 5 0 0
## Hold 0 53 0
## Sell 0 0 2
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9404, 1)
## No Information Rate : 0.8833
## P-Value [Acc > NIR] : 0.0005854
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 1.0000 1.00000
## Specificity 1.00000 1.0000 1.00000
## Pos Pred Value 1.00000 1.0000 1.00000
## Neg Pred Value 1.00000 1.0000 1.00000
## Prevalence 0.08333 0.8833 0.03333
## Detection Rate 0.08333 0.8833 0.03333
## Detection Prevalence 0.08333 0.8833 0.03333
## Balanced Accuracy 1.00000 1.0000 1.00000
# Model training
model_dt <- ctree(final_decision ~ .,
data = wika_train_smote,
control = ctree_control(mincriterion = 0.5,
minsplit = 100,
minbucket = 50))
# Model training result visualization
plot(model_dt, type= "simple")pred_model_dt <- predict(object = model_dt, newdata = wika_validation, type = "response")
eval_pred_model_dt <- confusionMatrix(data = pred_model_dt,
reference = as.factor(wika_validation$final_decision))
table(wika_validation$final_decision)##
## Buy Hold Sell
## 22 262 16
eval_pred_model_dt ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 21 15 0
## Hold 1 247 3
## Sell 0 0 13
##
## Overall Statistics
##
## Accuracy : 0.9367
## 95% CI : (0.9029, 0.9614)
## No Information Rate : 0.8733
## P-Value [Acc > NIR] : 0.0002543
##
## Kappa : 0.7547
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.95455 0.9427 0.81250
## Specificity 0.94604 0.8947 1.00000
## Pos Pred Value 0.58333 0.9841 1.00000
## Neg Pred Value 0.99621 0.6939 0.98955
## Prevalence 0.07333 0.8733 0.05333
## Detection Rate 0.07000 0.8233 0.04333
## Detection Prevalence 0.12000 0.8367 0.04333
## Balanced Accuracy 0.95029 0.9187 0.90625
- Model Based On Data Test
pred_model_dt_test <- predict(object = model_dt, newdata = wika_test, type = "response")
eval_pred_model_dt_test <- confusionMatrix(data = pred_model_dt_test,
reference = as.factor(wika_test$final_decision))
table(wika_test$final_decision)##
## Buy Hold Sell
## 29 372 19
eval_pred_model_dt_test ## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 28 21 0
## Hold 1 349 2
## Sell 0 2 17
##
## Overall Statistics
##
## Accuracy : 0.9381
## 95% CI : (0.9106, 0.9592)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 0.0001954
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.96552 0.9382 0.89474
## Specificity 0.94629 0.9375 0.99501
## Pos Pred Value 0.57143 0.9915 0.89474
## Neg Pred Value 0.99730 0.6618 0.99501
## Prevalence 0.06905 0.8857 0.04524
## Detection Rate 0.06667 0.8310 0.04048
## Detection Prevalence 0.11667 0.8381 0.04524
## Balanced Accuracy 0.95590 0.9378 0.94487
Random Forest makes predictions by making many decision trees. Each decision tree has characteristics and is not interrelated, each one of decision tree makes their own predictions, then from the prediction results, majority voting is carried out. The class with the highest number will be the final prediction result.
Model Random Forest can be obtained from libary(e1071).
library(e1071)- Model Based On Data Train & Validation
The first step is on Random Forest modeling is set the control variable and there are two variables that have to be set. The first one is K-Fold and the second one is how many times do the process want to be repeated.
K-Fold can be said as cross validation, usually cross validation is used to split between data train and data test but in Random Forest is used to divides the data by k equal parts, where each part is used to test data in turn. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller. Usually, the choice of k is usually 5 or 10 but there is no formal rule.
In this case the k-fold will be 5 and will be repeat 3 times.
# set.seed(100)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) #repetition
#
# bbri_forest <- train(final_decision ~ .,
# data = bbri_train_smote,
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(bbri_forest, "bbri_forest.RDS")
bbri_forest <- readRDS("model/bbri_forest.RDS")
bbri_forest## Random Forest
##
## 1608 samples
## 23 predictor
## 3 classes: 'Buy', 'Hold', 'Sell'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 1286, 1287, 1287, 1286, 1286, 1287, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9844566 0.9766327
## 14 0.9830047 0.9744622
## 27 0.9807247 0.9710376
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 2.
In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.
varImp(bbri_forest)## rf variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## decision.MACDSell 100.00
## decision.MACDHold 76.76
## MACD 65.68
## RSI30 57.90
## RSI14 49.92
## RSI7 44.94
## RSI70 39.47
## decision.EMAHold 35.62
## volume 25.51
## decision.RSIHold 24.53
## close 20.98
## SMA70 20.11
## open 19.07
## low 18.67
## SMA50 18.29
## high 18.20
## EMA15 17.30
## SMA5 16.52
## EMA70 16.35
## adjusted 15.73
rm_pred <- predict(bbri_forest, bbri_validation, type = "raw")
eval_rf <- confusionMatrix(data = rm_pred,
reference = bbri_validation$final_decision)
table(bbri_validation$final_decision)##
## Buy Hold Sell
## 9 271 20
eval_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 8 2 0
## Hold 1 269 1
## Sell 0 0 19
##
## Overall Statistics
##
## Accuracy : 0.9867
## 95% CI : (0.9662, 0.9964)
## No Information Rate : 0.9033
## P-Value [Acc > NIR] : 2.805e-09
##
## Kappa : 0.9254
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.88889 0.9926 0.95000
## Specificity 0.99313 0.9310 1.00000
## Pos Pred Value 0.80000 0.9926 1.00000
## Neg Pred Value 0.99655 0.9310 0.99644
## Prevalence 0.03000 0.9033 0.06667
## Detection Rate 0.02667 0.8967 0.06333
## Detection Prevalence 0.03333 0.9033 0.06333
## Balanced Accuracy 0.94101 0.9618 0.97500
- Model Based On Data Test
rm_pred_test <- predict(bbri_forest, bbri_test, type = "raw")
eval_rf_test <- confusionMatrix(data = rm_pred_test,
reference = bbri_test$final_decision)
table(bbri_test$final_decision)##
## Buy Hold Sell
## 10 382 38
eval_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 0 0 0
## Hold 8 382 6
## Sell 2 0 32
##
## Overall Statistics
##
## Accuracy : 0.9628
## 95% CI : (0.9403, 0.9786)
## No Information Rate : 0.8884
## P-Value [Acc > NIR] : 2.133e-08
##
## Kappa : 0.7872
##
## Mcnemar's Test P-Value : 0.001134
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.00000 1.0000 0.84211
## Specificity 1.00000 0.7083 0.99490
## Pos Pred Value NaN 0.9646 0.94118
## Neg Pred Value 0.97674 1.0000 0.98485
## Prevalence 0.02326 0.8884 0.08837
## Detection Rate 0.00000 0.8884 0.07442
## Detection Prevalence 0.00000 0.9209 0.07907
## Balanced Accuracy 0.50000 0.8542 0.91850
# set.seed(100)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) #repetition
#
# isat_forest <- train(final_decision ~ .,
# data = isat_train_smote,
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(isat_forest, "isat_forest.RDS")
isat_forest <- readRDS("model/isat_forest.RDS")
isat_forest## Random Forest
##
## 1513 samples
## 26 predictor
## 3 classes: 'Buy', 'Hold', 'Sell'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1362, 1361, 1362, 1361, 1363, 1362, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9829496 0.9744057
## 16 0.9767392 0.9651042
## 30 0.9737059 0.9605557
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 2.
In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.
varImp(isat_forest)## rf variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## decision.EMASell 100.00
## decision.EMAHold 97.03
## RSI45 67.64
## RSI10 67.18
## MACD 64.18
## RSI70 59.30
## RSI38 54.35
## open 39.01
## decision.RSIHold 36.33
## volume 35.49
## low 31.57
## high 31.32
## EMA18 31.10
## EMA5 30.23
## close 27.46
## EMA70 27.20
## EMA10 26.95
## adjusted 26.75
## SMA5 26.21
## EMA30 26.14
rm_pred <- predict(isat_forest, isat_validation, type = "raw")
eval_rf <- confusionMatrix(data = rm_pred,
reference = isat_validation$final_decision)
table(isat_validation$final_decision)##
## Buy Hold Sell
## 9 277 14
eval_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 7 3 0
## Hold 2 273 3
## Sell 0 1 11
##
## Overall Statistics
##
## Accuracy : 0.97
## 95% CI : (0.9438, 0.9862)
## No Information Rate : 0.9233
## P-Value [Acc > NIR] : 0.000562
##
## Kappa : 0.788
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.77778 0.9856 0.78571
## Specificity 0.98969 0.7826 0.99650
## Pos Pred Value 0.70000 0.9820 0.91667
## Neg Pred Value 0.99310 0.8182 0.98958
## Prevalence 0.03000 0.9233 0.04667
## Detection Rate 0.02333 0.9100 0.03667
## Detection Prevalence 0.03333 0.9267 0.04000
## Balanced Accuracy 0.88373 0.8841 0.89111
- Model Based On Data Test
rm_pred_test <- predict(isat_forest, isat_test, type = "raw")
eval_rf_test <- confusionMatrix(data = rm_pred_test,
reference = isat_test$final_decision)
table(isat_test$final_decision)##
## Buy Hold Sell
## 12 392 26
eval_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 0 0 0
## Hold 12 392 9
## Sell 0 0 17
##
## Overall Statistics
##
## Accuracy : 0.9512
## 95% CI : (0.9263, 0.9695)
## No Information Rate : 0.9116
## P-Value [Acc > NIR] : 0.001322
##
## Kappa : 0.5998
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.00000 1.0000 0.65385
## Specificity 1.00000 0.4474 1.00000
## Pos Pred Value NaN 0.9492 1.00000
## Neg Pred Value 0.97209 1.0000 0.97821
## Prevalence 0.02791 0.9116 0.06047
## Detection Rate 0.00000 0.9116 0.03953
## Detection Prevalence 0.00000 0.9605 0.03953
## Balanced Accuracy 0.50000 0.7237 0.82692
# set.seed(100)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) #repetition
#
# sido_forest <- train(final_decision ~ .,
# data = sido_train_smote,
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(sido_forest, "sido_forest.RDS")
sido_forest <- readRDS("model/sido_forest.RDS")
sido_forest## Random Forest
##
## 1636 samples
## 25 predictor
## 3 classes: 'Buy', 'Hold', 'Sell'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1474, 1471, 1472, 1473, 1473, 1472, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9828884 0.9682234
## 15 0.9835079 0.9695845
## 29 0.9809350 0.9646569
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 15.
In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.
varImp(sido_forest)## rf variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## decision.MACDSell 100.000
## decision.MACDHold 58.577
## decision.RSIHold 41.468
## RSI10 17.085
## decision.RSISell 16.592
## RSI14 10.095
## MACD 8.435
## RSI40 8.271
## RSI65 3.334
## SMA80 3.118
## volume 2.331
## open 2.072
## decision.SMASell 1.555
## close 1.523
## EMA55 1.422
## high 1.388
## EMA60 1.178
## low 1.138
## SMA55 1.129
## EMA5 1.075
rm_pred <- predict(sido_forest, sido_validation, type = "raw")
eval_rf <- confusionMatrix(data = rm_pred,
reference = sido_validation$final_decision)
table(sido_validation$final_decision)##
## Buy Hold Sell
## 5 265 31
eval_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 4 0 0
## Hold 0 265 0
## Sell 1 0 31
##
## Overall Statistics
##
## Accuracy : 0.9967
## 95% CI : (0.9816, 0.9999)
## No Information Rate : 0.8804
## P-Value [Acc > NIR] : 9.346e-16
##
## Kappa : 0.9845
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.80000 1.0000 1.0000
## Specificity 1.00000 1.0000 0.9963
## Pos Pred Value 1.00000 1.0000 0.9688
## Neg Pred Value 0.99663 1.0000 1.0000
## Prevalence 0.01661 0.8804 0.1030
## Detection Rate 0.01329 0.8804 0.1030
## Detection Prevalence 0.01329 0.8804 0.1063
## Balanced Accuracy 0.90000 1.0000 0.9981
- Model Based On Data Test
rm_pred_test <- predict(sido_forest, sido_test, type = "raw")
eval_rf_test <- confusionMatrix(data = rm_pred_test,
reference = sido_test$final_decision)
table(sido_test$final_decision)##
## Buy Hold Sell
## 17 378 17
eval_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 17 0 1
## Hold 0 377 0
## Sell 0 1 16
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9826, 0.9994)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 2.806e-13
##
## Kappa : 0.9691
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 0.9974 0.94118
## Specificity 0.99747 1.0000 0.99747
## Pos Pred Value 0.94444 1.0000 0.94118
## Neg Pred Value 1.00000 0.9714 0.99747
## Prevalence 0.04126 0.9175 0.04126
## Detection Rate 0.04126 0.9150 0.03883
## Detection Prevalence 0.04369 0.9150 0.04126
## Balanced Accuracy 0.99873 0.9987 0.96932
# set.seed(100)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) #repetition
#
# hoki_forest <- train(final_decision ~ .,
# data = hoki_train_smote,
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(hoki_forest, "bbri_forest.RDS")
hoki_forest <- readRDS("model/hoki_forest.RDS")
hoki_forest## Random Forest
##
## 1723 samples
## 25 predictor
## 3 classes: 'Buy', 'Hold', 'Sell'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1551, 1551, 1550, 1551, 1551, 1551, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9789931 0.9681189
## 15 0.9787639 0.9678320
## 29 0.9804981 0.9704689
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 29.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 29.
In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.
varImp(hoki_forest)## rf variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## decision.MACDSell 100.0000
## decision.MACDHold 64.4945
## decision.RSIHold 49.1232
## decision.RSISell 27.0830
## RSI10 4.7417
## RSI65 2.4887
## RSI14 2.2891
## volume 2.0611
## close 1.7850
## SMA55 1.3956
## high 1.2195
## RSI40 1.0902
## adjusted 0.8876
## SMA70 0.8734
## EMA60 0.8670
## SMA5 0.8481
## MACD 0.8403
## open 0.6345
## low 0.6100
## SMA25 0.5980
rm_pred <- predict(hoki_forest, hoki_validation, type = "raw")
eval_rf <- confusionMatrix(data = rm_pred,
reference = hoki_validation$final_decision)
table(hoki_validation$final_decision)##
## Buy Hold Sell
## 13 273 15
eval_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 13 0 0
## Hold 0 273 0
## Sell 0 0 15
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9878, 1)
## No Information Rate : 0.907
## P-Value [Acc > NIR] : 1.724e-13
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 1.000 1.00000
## Specificity 1.00000 1.000 1.00000
## Pos Pred Value 1.00000 1.000 1.00000
## Neg Pred Value 1.00000 1.000 1.00000
## Prevalence 0.04319 0.907 0.04983
## Detection Rate 0.04319 0.907 0.04983
## Detection Prevalence 0.04319 0.907 0.04983
## Balanced Accuracy 1.00000 1.000 1.00000
- Model Based On Data Test
rm_pred_test <- predict(hoki_forest, hoki_test, type = "raw")
eval_rf_test <- confusionMatrix(data = rm_pred_test,
reference = hoki_test$final_decision)
table(hoki_test$final_decision)##
## Buy Hold Sell
## 5 53 2
eval_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 5 0 2
## Hold 0 53 0
## Sell 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8847, 0.9959)
## No Information Rate : 0.8833
## P-Value [Acc > NIR] : 0.0233
##
## Kappa : 0.8413
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 1.0000 0.00000
## Specificity 0.96364 1.0000 1.00000
## Pos Pred Value 0.71429 1.0000 NaN
## Neg Pred Value 1.00000 1.0000 0.96667
## Prevalence 0.08333 0.8833 0.03333
## Detection Rate 0.08333 0.8833 0.00000
## Detection Prevalence 0.11667 0.8833 0.00000
## Balanced Accuracy 0.98182 1.0000 0.50000
# set.seed(100)
#
# ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) #repetition
#
# wika_forest <- train(final_decision ~ .,
# data = wika_train_smote,
# method = "rf", # random forest
# trControl = ctrl)
#
# saveRDS(wika_forest, "wika_forest.RDS")
wika_forest <- readRDS("model/wika_forest.RDS")
wika_forest## Random Forest
##
## 1568 samples
## 25 predictor
## 3 classes: 'Buy', 'Hold', 'Sell'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 1412, 1410, 1411, 1412, 1411, 1412, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9864747 0.9794725
## 15 0.9866086 0.9796980
## 29 0.9839277 0.9756386
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is at mtry = 2. Since the largest accuracy value was produce at mtry = 15.
In order to find out what is the most importance variable or predictor used in random forest, function varImp() can be used.
varImp(wika_forest)## rf variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## decision.MACDSell 100.0000
## decision.MACDHold 75.0120
## MACD 43.4161
## decision.RSIHold 21.8267
## RSI5 19.5272
## RSI20 15.5693
## decision.RSISell 3.6135
## RSI35 2.8092
## RSI65 2.5218
## volume 1.7240
## SMA65 1.4097
## SMA80 0.8842
## EMA80 0.8590
## EMA65 0.7612
## low 0.6694
## SMA20 0.6367
## SMA30 0.6154
## close 0.5905
## EMA30 0.5731
## high 0.5442
rm_pred <- predict(wika_forest, wika_validation, type = "raw")
eval_rf <- confusionMatrix(data = rm_pred,
reference = wika_validation$final_decision)
table(wika_validation$final_decision)##
## Buy Hold Sell
## 22 262 16
eval_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 22 0 0
## Hold 0 262 0
## Sell 0 0 16
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9878, 1)
## No Information Rate : 0.8733
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 1.00000 1.0000 1.00000
## Specificity 1.00000 1.0000 1.00000
## Pos Pred Value 1.00000 1.0000 1.00000
## Neg Pred Value 1.00000 1.0000 1.00000
## Prevalence 0.07333 0.8733 0.05333
## Detection Rate 0.07333 0.8733 0.05333
## Detection Prevalence 0.07333 0.8733 0.05333
## Balanced Accuracy 1.00000 1.0000 1.00000
- Model Based On Data Test
rm_pred_test <- predict(wika_forest, wika_test, type = "raw")
eval_rf_test <- confusionMatrix(data = rm_pred_test,
reference = wika_test$final_decision)
table(wika_test$final_decision)##
## Buy Hold Sell
## 29 372 19
eval_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction Buy Hold Sell
## Buy 26 2 0
## Hold 3 368 0
## Sell 0 2 19
##
## Overall Statistics
##
## Accuracy : 0.9833
## 95% CI : (0.966, 0.9933)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 2.169e-14
##
## Kappa : 0.9209
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Buy Class: Hold Class: Sell
## Sensitivity 0.89655 0.9892 1.00000
## Specificity 0.99488 0.9375 0.99501
## Pos Pred Value 0.92857 0.9919 0.90476
## Neg Pred Value 0.99235 0.9184 1.00000
## Prevalence 0.06905 0.8857 0.04524
## Detection Rate 0.06190 0.8762 0.04524
## Detection Prevalence 0.06667 0.8833 0.05000
## Balanced Accuracy 0.94572 0.9634 0.99751
Model Decision Tree & Random Forest Selection
Based on data train and data test confusion matrix results from both model, model decision tree and random forest produce good prediction performance result on data train but only model decision tree which can produce stable performance when evaluated using data test, while model random forest often make prediction errors when evaluated using the data test.
In addition to matching how many prediction results are in accordance with the target variable, the evaluation of model performance that is considered here is the variable Sensitivity statistics in each class of target variables from both models. Sensitivity is a measure of the proportion of actual positive cases that got predicted as positive (or true positive). Sensitivity is also termed as Recall. This implies that there will be another proportion of actual positive cases, which would get predicted incorrectly as negative (and, thus, could also be termed as the false negative)
Mathematically, sensitivity can be calculated as the following:
\[Sensitivity = (True Positive)/(True Positive + False Negative)\]
The following is the details in relation to True Positive and False Negative used in the above equation:
- True Positive = Machine learning predicted should buy a stock and actually technical analysis also provide signal to buy a stock. In other words, the true positive represents how many technical analysis signal to buy a stock and predicted to buy a stock.
- False Negative = Machine learning predicted should buy a stock and actually technical analysis provide signal to sell a stock. In other words, the false negative represents the number of tehcnical analysis provide to sell a stock and got predicted as buy a stock Ideally, we would seek the model to have low false negatives as it might prove to be financial threatening.
The higher value of sensitivity would mean higher value of true positive and lower value of false negative. The lower value of sensitivity would mean lower value of true positive and higher value of false negative. For financial domain, models with high sensitivity will be desired. And when viewed from the confusion matrix, the decision tree model has better sensitivity compared to the random forest model.
Therefore, it would be better to use a Decision Tree Model compared to Model Random Forest in this project to determine when is the right time to buy or sell stocks.
In addition, the confusion matrix that will be showed in the dashboard is the Accuracy from the model because I am simulating my dashboard will be used by real investors, and to minimize confusion I will only display a fairly common value.
Variable of Importance
In addition to getting recommendations on when is the right time to buy or sell a stock, what can also be obtained from the results of machine learning processing is the variable of importance. Variable of importance can also be said as the most consider predictor by machine learning, in this case are which technical analysis are the most considered by machine learning to provide suggestion whether today is the right time to buy or sell a stock.
By knowing the variable of importance, there are several benefits obtained:
- Beginner investors who have never learned about technical analysis can start learning from technical analysis which is the most considered by machine learning.
- When you want to make improvements to the model for, the variable of importance can be used to choose what variables can be eliminated because they do not have much effect on the results.
Variable of importance on each stocks:
- BBRI.JK : MACD
- ISAT.JK : MACD
- SIDO.JK : MACD
- HOKI.JK : MACD
- WIKA.JK : MACD