Background
This article is inspired by a paper written in 2020 titled “Forecasting Daily Stock Market Return with Multiple Linear Regression” by Chen Shengxuan. In the paper, a multiple regression model intended to predict the movement of the stock market was formed. The model used the S&P 500 (500 of the largest stocks in the USA) as the dependent variable. Additionally, by using stepwise regression (forward selection), 4 out of 11 predictors were chosen to form the most optimal model in predicting stock movement:
- The daily return of the SPDR S&P 500 ETF (SPY), an index fund that trades stocks listed in the S&P 500.
- The 5-day difference in percentage of the SPY
- The 10-day exponential moving average of the SPY
- The change in trading volume of the S&P500
(Shengxuan, 2020)
Using said paper as base, the author wishes to create a similar report and create a similar model intended to predict the stock movement in Indonesia using different variables that may or may not only apply in Indonesia. Additionally, different countries may have different investing behaviours, therefore a model that is applied in another country may not apply in Indonesia.
The author would like to make a disclaimer that this report is not written within an academic setting, therefore it shouldn’t be treated as such since there may be inaccuracies or mistakes. The author is just someone who is interested in implementing data science to have a better understanding of how the Indonesian stock market works. Regardless, the author will minimize the possibilities of misinformation by citing of previous works done by actual academics or at the very least well-known reliable websites. The author also hopes that readers may learn something useful out of this report.
Definitions
Indonesia Stock Exchange (IDX/IHSG/^JKSE on Yahoo Finance)
The Indonesia Stock Exchange is an index that consists of all of Indonesia’s publicly listed companies. The performance of the IDX can be used as a representation of how well the Indonesian stock market is doing, and by extension can also give an insight of how the Indonesian economy is doing.
Government Bonds
A government bond can be described as a form debt issued by the government. Governments usually issue bonds when they need additional support in fulfilling their plans/needs or just simply trying to incentivise the country to spend more and increase the circulation of money in the country. Entities that have purchased the bond will receive annual payments called “coupon payments” as long as the buyer retains the bond or until the bond has reached maturity.
Gold
Gold has always been regarded as a “safe haven” of investments throughout history as it is always valuable, stable, and liquid regardless of economic condition. Generally, when investors become pessimistic toward the economic outlook, they will sell their stocks and start buying gold. Therefore, we can expect a negative relationship between the IDX and gold prices.
Trading Volume
The amount of transactions performed on an investment. In the context of this report, we will discuss the trading volume of the IDX. The relationship of trading volume towards stock prices is as follows: 1. The amount of time needed for new information to affect the stock market can be measured by determining when the information is released and when there is a drastic change in trading volume. 2. A change in trading volume may represent the effect of new information towards stock prices. 3. Trading volume can be used to measure the changes in variance of stock prices, both in times where there are no new information, and also in times where new information is released. (Karpoff, 1987)
Higher trading volume also indicates that there is higher liquidity within the market, which means that volume (or specifically liquidity) becomes a risk factor in determining stock returns. A study shows that when the market becomes illiquid, it becomes harder to make transactions. This difficulty in transactions can cause investors to lose some, if not all, of their stock returns. At the same time however, when investors are able to time when a sudden drop in liquidity occurs, then they can acquire higher returns compared to investors who did not expect market illiquidity to happen (Amihud, 2002).
Exchange Rate
Exchange rates can be defined as how much a currency is worth when converted into a different currency. The exchange rate is important as changes in it may affect the the country’s capability to pay for its needs (Reserve Bank of Australia, 2023). The exchange rate is also used as a reference for spot transactions (transactions using the current date’s price) between one country with another (Bank Indonesia, n.d.). As this report is intended to analyze the Indonesian stock market, this report will use the USD/IDR exchange rate.
Interest Rate
Interest rates refer the rate that a bank pays for the customers that save their money in the bank (called savings interests), and also the rate that a bank sets to the customer that borrow money from them (called loan interests) (Financial Services Authority, n.d.). When interest rates rise, the people will have to pay more for lending costs. Higher lending costs mean less income remaining, which leads to less money spent on investing. In other words, the rise of interest rates will generally lead to a drop in the stock market except for stocks from companies in the finance industry, where those companies can charge their customers with higher rates and generate more income (Hall et al., 2022).
In this report, the proxy used to represent the interest rate of Indonesia is the IndONIA (Indonesia Overnight Index Average), which is the benchmark used to measure the loan interests for lending betwee
LQ45
The Liquid 45 (commonly referred to as LQ45) is an index that contains information of Indonesia’s 45 companies with:
- The highest market cap in the last 12 months
- The highest transaction value in the last 12 months
- Listed in the Indonesia Stock Exchange for at least 3 months
- Perceived to have good financial conditions, growth prospects, high transaction value and frequency
Source of Data
Data is gathered from January 2018 until December 2022 in the form of daily data. The following list consists of the variables needed to make our model, along with the sources of their historical data:
- IDX (and it’s trading volume): https://finance.yahoo.com/quote/%5EJKSE/history?p=%5EJKSE
- Gold prices: https://id.investing.com/commodities/gold-historical-data
- Bond yield: https://www.investing.com/rates-bonds/indonesia-10-year-bond-yield-historical-data
- LQ45: https://id.investing.com/indices/jakarta-lq45-historical-data
- USD/IDR Exchange Rate: https://id.investing.com/currencies/usd-idr-historical-data
- Interest Rate: https://www.bi.go.id/id/statistik/indikator/indonia.aspx
Read Data
All of the data used for this report is in .csv. Prior to writing this section, the interest rate data is in the form of .xlsx. The original .xlsx contains information not needed for this report, so the author decided to copy the columns that only contain the interest rate data, put it on a new Excel file, and then save it as .csv. The author is aware that the “readxl” package exists, however it runs very slowly on the author’s computer.
#To read .csv files
idx <- read_csv("datainput/idx.csv")
gold <- read_csv("datainput/gold.csv")
bond <- read_csv("datainput/bond.csv")
lq45 <- read_csv("datainput/lq45.csv")
exc_rate <- read_csv("datainput/exc.rate.csv")
int_rate <- read_csv("datainput/int_rate.csv")Data Wrangling
This process is required to ensure that the data is ready to be formed into a reagression model which will then be analysed.
Checking for columns
Note that most of the data frames have their column names in Indonesian. Below is a translation if necessary:
- Tanggal = Date
- Terakhir = Closing Price
- Pembukaan = Open
- Tertinggi = High
- Terendah = Low
- Vol = Volume
- Perubahan% = Change from one day to previous day in percentage
glimpse(idx)## Rows: 1,253
## Columns: 7
## $ Date <date> 2018-01-02, 2018-01-03, 2018-01-04, 2018-01-05, 2018-01-0…
## $ Open <chr> "6366.083008", "6341.270996", "6257.716797", "6279.457031"…
## $ High <chr> "6445.910156", "6342.190918", "6292.320801", "6353.737793"…
## $ Low <chr> "6326.092773", "6220.694824", "6233.937012", "6278.724121"…
## $ Close <chr> "6339.237793", "6251.479004", "6292.320801", "6353.737793"…
## $ `Adj Close` <chr> "6339.237793", "6251.479004", "6292.320801", "6353.737793"…
## $ Volume <chr> "52168200", "48866400", "57864900", "74223700", "66702400"…
glimpse(gold)## Rows: 1,293
## Columns: 7
## $ Tanggal <chr> "30/12/2022", "29/12/2022", "28/12/2022", "27/12/2022", "…
## $ Terakhir <dbl> 1.8262, 1.8260, 1.8158, 1.8231, 1.8097, 1.8042, 1.7953, 1…
## $ Pembukaan <dbl> 1.8218, 1.8123, 1.8224, 1.8082, 1.8058, 1.8010, 1.8238, 1…
## $ Tertinggi <dbl> 1.83240, 1.82730, 1.82280, 1.84190, 1.81195, 1.81220, 1.8…
## $ Terendah <dbl> 1.81980, 1.81120, 1.80420, 1.80800, 1.80555, 1.79890, 1.7…
## $ Vol. <chr> "107,50K", "105,99K", "118,08K", "159,62K", NA, "105,46K"…
## $ `Perubahan%` <chr> "0,01%", "0,56%", "-0,40%", "0,74%", "0,30%", "0,50%", "-…
glimpse(bond)## Rows: 1,233
## Columns: 6
## $ Tanggal <chr> "30/12/2022", "29/12/2022", "28/12/2022", "27/12/2022", "…
## $ Terakhir <dbl> 6925, 6895, 6922, 6909, 6911, 6924, 6918, 6929, 6906, 688…
## $ Pembukaan <dbl> 6892, 6927, 6938, 6923, 6912, 6948, 6957, 6918, 6896, 687…
## $ Tertinggi <dbl> 6935, 6927, 6938, 6927, 6923, 6948, 6957, 6951, 6906, 689…
## $ Terendah <dbl> 6888, 6895, 6922, 6904, 6911, 6918, 6905, 6918, 6881, 686…
## $ `Perubahan%` <chr> "0,44%", "-0,39%", "0,19%", "-0,03%", "-0,19%", "0,09%", …
glimpse(lq45)## Rows: 1,220
## Columns: 7
## $ Tanggal <chr> "30/12/2022", "29/12/2022", "28/12/2022", "27/12/2022", "…
## $ Terakhir <dbl> 93718.00000, 93987.00000, 93369.00000, 94038.00000, 93515…
## $ Pembukaan <dbl> 93988.00000, 93369.00000, 94039.00000, 93515.00000, 93427…
## $ Tertinggi <dbl> 94483.00000, 94006.00000, 94256.00000, 94198.00000, 93891…
## $ Terendah <dbl> 93718, 92680, 93052, 93512, 93122, 93313, 93929, 93522, 9…
## $ Vol. <chr> "3,95B", "5,04B", "2,94B", "3,70B", "2,37B", "2,38B", "2,…
## $ `Perubahan%` <chr> "-0,29%", "0,66%", "-0,71%", "0,56%", "0,09%", "-0,79%", …
glimpse(exc_rate)## Rows: 1,260
## Columns: 7
## $ Tanggal <chr> "30/12/2022", "29/12/2022", "28/12/2022", "27/12/2022", "…
## $ Terakhir <dbl> 15.565, 15.655, 15.700, 15.660, 15.630, 15.590, 15.580, 1…
## $ Pembukaan <dbl> 15.6375, 15.7450, 15.6575, 15.6225, 15.5925, 15.5825, 15.…
## $ Tertinggi <dbl> 15.6550, 15.7650, 15.7225, 15.6725, 15.6475, 15.6125, 15.…
## $ Terendah <dbl> 15.5450, 15.6575, 15.6575, 15.5925, 15.5925, 15.5825, 15.…
## $ Vol. <chr> "0,14K", "0,14K", "0,09K", "0,08K", "0,11K", "0,07K", "0,…
## $ `Perubahan%` <chr> "-0,57%", "-0,29%", "0,26%", "0,19%", "0,26%", "0,06%", "…
glimpse(int_rate)## Rows: 1,496
## Columns: 1
## $ `Tanggal;IndONIA (%)` <chr> "30 Dec 2022;5.02413", "29 Dec 2022;4.97859", "2…
As seen above there seems to be something wrong with the data “int_rate” (the glimpse result with only 1 column), as it only has 1 column. However, it is not too problematic as there is a clear separator between the date and the rate, which is the “;”. To fix this, the author will use the ‘str_split_fixed’ function in order to divide the date and interest rate.
int_rate[c("Date","Rate")] <- str_split_fixed(int_rate$`Tanggal;IndONIA (%)`, ';', 2) # The subsetting is used in order to immediately name the newly formed columns from the separation
int_rate<- int_rate %>%
select(-`Tanggal;IndONIA (%)`) #the wrong column still persists even after the separation, so it is better to just delete it
glimpse(int_rate)## Rows: 1,496
## Columns: 2
## $ Date <chr> "30 Dec 2022", "29 Dec 2022", "28 Dec 2022", "27 Dec 2022", "26 D…
## $ Rate <chr> "5.02413", "4.97859", "4.99196", "4.96943", "4.98980", "4.96515",…
Adjusting data types
The second part we need to do is to change the data types. As seen above some of the data types of the columns are wrong, particularly multiple date columns, multiple change % columns, and the transaction volume in the IDX is considered as the character type. It is a good idea to change all of the data types that are inappropriate.
idx <- idx %>%
arrange(desc(Date)) %>% # We will need to have the data sorted from latest to earliest
mutate_at(.vars = vars("Close", "Volume"), # as.numeric is used to change data type into numbers
as.numeric) %>%
select(-c(Open, High, Low, 'Adj Close')) # select(-c()) is used to exclude unnecessary columns
gold <- gold %>%
mutate("Date" = dmy(Tanggal)) %>% # dmy is used to change data type into date format
mutate(Change = str_replace(gold$`Perubahan%`, ",",".")) %>% # Indonesia uses commas (",") instead of periods (".") to express decimal points, therefore it is necessary to replace it to "." so that R can perceive the data as numbers
mutate(Change = as.numeric(str_replace(Change,"%",""))) %>% # R also can't perceive numbers if "%" is attached to the numbers, so it has to be removed before converting the character to numeric
select(-c(Tanggal, 'Perubahan%',Terakhir, Pembukaan, Tertinggi, Terendah,Vol.))
bond <- bond %>%
mutate("Date" = dmy(Tanggal)) %>%
mutate(Change = str_replace(bond$`Perubahan%`, "%","")) %>%
mutate(Change = as.numeric(str_replace(Change, ",", "."))) %>%
select(-c(Tanggal, 'Perubahan%',Terakhir, Pembukaan, Tertinggi, Terendah))
lq45 <- lq45 %>%
mutate("Date" = dmy(Tanggal)) %>%
mutate(Change = str_replace(lq45$`Perubahan%`, ",",".")) %>%
mutate(Change = as.numeric(str_replace(Change,"%",""))) %>%
select(-c(Tanggal, 'Perubahan%',Terakhir, Pembukaan, Tertinggi, Terendah,Vol.))
exc_rate <- exc_rate %>%
mutate("Date" = dmy(Tanggal)) %>%
mutate(Change = str_replace(exc_rate$`Perubahan%`, ",",".")) %>%
mutate(Change = as.numeric(str_replace(Change,"%",""))) %>%
select(-c(Tanggal, 'Perubahan%',Terakhir, Pembukaan, Tertinggi, Terendah,Vol.))
int_rate <- int_rate %>%
mutate("Date" = dmy(Date)) %>%
mutate(Rate = as.numeric(Rate))Further modifications need to be done for the data frames “idx” and “int_rate” as they do not have a column that represents their daily changes in percentage.
idx <- idx %>%
mutate("idx_%" = round(c(-diff(Close)/Close[-1] * 100, NA),2)) %>% #To find the percentage change in the closing price
mutate("volume_%"= round(c(-diff(Volume)/Volume[-1] * 100, NA),2)) %>% #To find the percentage change of the volume
select(-c(Close, Volume))
int_rate<- int_rate %>%
mutate("rate_%" = round(c(-diff(Rate), NA),2)) %>% #Since the rates are already in expressed in percentage, we don't have to divide it by Rate[-1]
select(-c(Rate))Combining Data Frames
Readers may notice that the number of rows for each data frame seems to vary. This is because the variables don’t always move at the same time. For example, the IDX may have days off during the weekends (Saturday and Sunday) and national holidays (Indonesia’s independence day, Christmas, New Years, etc), but that does not apply for the USD/IDR exchange rate that will always be active regardless of weekends or holidays. The chunk below shows the number of rows each variable has.
nrow(idx)## [1] 1253
nrow(gold)## [1] 1293
nrow(bond)## [1] 1233
nrow(lq45) ## [1] 1220
nrow(exc_rate)## [1] 1260
nrow(int_rate)## [1] 1496
In this report, the inner join function is used so that data frames are adapted to match the data frame with the least amount of observations, which is the lq45 data frame.
our_rm<- purrr::reduce(list(idx,gold,bond,lq45,exc_rate,int_rate),dplyr::inner_join, by = "Date") The author will also use this section to rename the columns that have been changed during the inner join process.
our_rm<- our_rm %>%
rename("gold_%" = "Change.x") %>%
rename("bond_%" = "Change.y") %>%
rename("lq45_%" = "Change.x.x") %>%
rename("exc_%" = "Change.y.y")
head(our_rm)## # A tibble: 6 × 8
## Date `idx_%` `volume_%` `gold_%` `bond_%` `lq45_%` `exc_%` `rate_%`
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022-12-30 -0.14 -8.31 0.01 0.44 -0.29 -0.57 0.05
## 2 2022-12-29 0.14 17.0 0.56 -0.39 0.66 -0.29 -0.01
## 3 2022-12-28 -1.05 -0.9 -0.4 0.19 -0.71 0.26 0.02
## 4 2022-12-27 1.28 3.51 0.74 -0.03 0.56 0.19 -0.02
## 5 2022-12-26 0.52 -2.4 0.3 -0.19 0.09 0.26 0.02
## 6 2022-12-23 -0.35 -6.65 0.5 0.09 -0.79 0.06 0.06
Check for NA
Below is the amount of NAs available in “our_rm”, along with the amount of rows in the model
nrow(our_rm)## [1] 1219
colSums(is.na(our_rm))## Date idx_% volume_% gold_% bond_% lq45_% exc_% rate_%
## 0 22 22 0 0 0 0 0
All of the NAs seemed to have come from the result of finding the daily percentage change in the IDX along with it’s trading volume. Let’s see the specific entries that have NAs
our_rm[!complete.cases(our_rm),]## # A tibble: 22 × 8
## Date `idx_%` `volume_%` `gold_%` `bond_%` `lq45_%` `exc_%` `rate_%`
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-10 NA NA -1.25 -3.5 1.95 -0.18 5.88
## 2 2019-05-31 NA NA 1.45 -0.14 2.45 -0.87 6.13
## 3 2019-05-02 NA NA -0.95 0.67 -1.21 0 5.82
## 4 2019-04-22 NA NA 0.13 0.46 -1.71 0.23 5.91
## 5 2019-04-18 NA NA -0.06 -0.62 0.6 -0.3 5.98
## 6 2019-04-04 NA NA -0.08 -0.21 0.61 -0.28 5.88
## 7 2019-03-08 NA NA 1.03 0.93 -1.42 1.2 5.91
## 8 2019-02-06 NA NA -0.36 -0.89 1.13 -0.24 5.81
## 9 2019-01-02 NA NA 0.22 0.18 0.16 0.49 -0.05
## 10 2018-12-26 NA NA 0.09 0.36 -0.92 0.14 5.82
## # … with 12 more rows
If we were to check our calendars, along with the schedule of the IDX, you may notice that some of the dates are after Sunday while the others are after public holidays. This means that the NAs are caused by the days off mentioned earlier, therefore it is impossible to find the daily change of these specific dates in the first place. Therefore the author believes that it may be best to delete those observations.
our_rm <- our_rm[complete.cases(our_rm),]
colSums(is.na(our_rm)) # to prove that our data no longer has NA## Date idx_% volume_% gold_% bond_% lq45_% exc_% rate_%
## 0 0 0 0 0 0 0 0
It might also be a good idea to find out whether our data frame has inf or -inf values. To do this the author will first use the summary() function.
summary(our_rm)## Date idx_% volume_% gold_%
## Min. :2018-01-03 Min. :-6.58000 Min. :-100.00 Min. :-4.99000
## 1st Qu.:2019-04-29 1st Qu.:-0.50000 1st Qu.: -10.05 1st Qu.:-0.41000
## Median :2020-07-21 Median : 0.04000 Median : 0.41 Median : 0.04000
## Mean :2020-07-17 Mean : 0.01038 Mean : Inf Mean : 0.02782
## 3rd Qu.:2021-10-14 3rd Qu.: 0.57000 3rd Qu.: 11.21 3rd Qu.: 0.49000
## Max. :2022-12-30 Max. :10.19000 Max. : Inf Max. : 5.95000
## bond_% lq45_% exc_% rate_%
## Min. :-6.2700 Min. :-8.260000 Min. :-2.75000 Min. :-0.7100
## 1st Qu.:-0.3500 1st Qu.:-0.700000 1st Qu.:-0.14000 1st Qu.:-0.0100
## Median : 0.0000 Median : 0.000000 Median : 0.00000 Median : 0.0000
## Mean : 0.0065 Mean :-0.004812 Mean : 0.01358 Mean : 0.7214
## 3rd Qu.: 0.3800 3rd Qu.: 0.710000 3rd Qu.: 0.17000 3rd Qu.: 0.0400
## Max. : 4.3700 Max. :14.920000 Max. : 4.57000 Max. : 6.0700
Inf values exist in the volume_% column. The author believes that this may be caused by dividing by zero. Let’s see how many rows contain these inf values.
countInf(our_rm$`volume_%`)## [1] 7
Only 7 observations contain inf values, therefore the author believes that removing these observations should not be harmful to our model, considering we still have a lot of observations left.
our_rm <- our_rm[!is.infinite(our_rm$`volume_%`),]
countInf(our_rm$`volume_%`)## [1] 0
Now let’s check our remaining data.
nrow(our_rm)## [1] 1190
summary(our_rm)## Date idx_% volume_% gold_%
## Min. :2018-01-03 Min. :-6.58000 Min. :-100.000 Min. :-4.99000
## 1st Qu.:2019-04-25 1st Qu.:-0.49750 1st Qu.: -10.068 1st Qu.:-0.40000
## Median :2020-07-21 Median : 0.05000 Median : 0.350 Median : 0.04000
## Mean :2020-07-17 Mean : 0.01273 Mean : 1.024 Mean : 0.03487
## 3rd Qu.:2021-10-13 3rd Qu.: 0.57000 3rd Qu.: 10.935 3rd Qu.: 0.49000
## Max. :2022-12-30 Max. :10.19000 Max. : 194.590 Max. : 5.95000
## bond_% lq45_% exc_% rate_%
## Min. :-6.270000 Min. :-8.260000 Min. :-2.75000 Min. :-0.7100
## 1st Qu.:-0.350000 1st Qu.:-0.697500 1st Qu.:-0.14000 1st Qu.:-0.0100
## Median : 0.000000 Median : 0.000000 Median : 0.00000 Median : 0.0000
## Mean : 0.009471 Mean :-0.002235 Mean : 0.01433 Mean : 0.7256
## 3rd Qu.: 0.380000 3rd Qu.: 0.710000 3rd Qu.: 0.17000 3rd Qu.: 0.0400
## Max. : 4.370000 Max. :14.920000 Max. : 4.57000 Max. : 6.0700
Our data is now free of NAs, NaNs, or Inf values, which means it is now ready to be formed into a regression model.
Model Specification
Correlation
Let’s first see the correlations between the variables using a correlation matrix. The result of the correlation matrix can give us an insight of what variables should we expect to be included in determining stock returns.
ggcorr(our_rm, label = T)## Warning in ggcorr(our_rm, label = T): data in column(s) 'Date' are not numeric
## and were ignored
Based on the correlation matrix created from our data frame:
- Volume has a weak positive correlation with IDX
- Gold has a very weak positive correlation with IDX
- Bond has a weak positive correlation with IDX
- LQ45 has a very strong positive correlation with IDX
- Exchange rate has a moderate positive correlation with IDX
- Interest rate has no correlation with IDX
Regression Model
Next, let’s make the regression model. We’ll call it “stock_model”.
our_rm<- our_rm %>%
select(-Date)
stock_model <- lm(formula = `idx_%` ~ .,
data = our_rm)
summary(stock_model)##
## Call:
## lm(formula = `idx_%` ~ ., data = our_rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11326 -0.13566 -0.00099 0.12655 1.37726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0199726 0.0078171 2.555 0.01074 *
## `volume_%` 0.0004129 0.0003636 1.136 0.25638
## `gold_%` 0.0207618 0.0075092 2.765 0.00578 **
## `bond_%` -0.0057530 0.0100099 -0.575 0.56558
## `lq45_%` 0.7284407 0.0056977 127.848 < 2e-16 ***
## `exc_%` -0.0617710 0.0207938 -2.971 0.00303 **
## `rate_%` -0.0080209 0.0043405 -1.848 0.06486 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2462 on 1183 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9469
## F-statistic: 3532 on 6 and 1183 DF, p-value: < 2.2e-16
Let’s take a look at our regression model. There are four significant variables in our model:
- “gold_%” is significant within the 0.01 critical value, where an increase of 1 in “gold_%” will increase the “idx_%” by 0.021 (assuming no changes in the other variables)
- “lq45_%” is significant within the 0.001 critical value, where an increase of 1 in “lq45_%” will increase the “idx_%” by 0.728 (assuming no changes in the other variables)
- “exc_%” is significant within the 0.01 critical value, where an increase of 1 in “exc_%” will decrease the “idx_%” by 0.062 (assuming no changes in the other variables)
- “rate_%” is significant within the 0.1 critical value, where an increase of 1 in “rate_%” will decrease the “idx_%” by 0.008 (assuming no changes in the other variables)
We also get a value of 94.69% as our Adjusted R-Squared. This means that the independent variables in our model are very good at explaining the movement of our dependent variable, the IDX.
Model Evaluation
As a brief explanation, errors, in the context of statistics, is the difference between the values created by a formed model and the value actually found in the raw data. It is important for a model to have low errors, as the lower the error means the better the model is at explaining and predicting values from a different data set.
RMSE stands for Rooted Mean Squared Error, which measures the standard deviation in our data by calculating the sum of the squared difference of the data predicted and the actual observations. The sum will then be divided by the number of observations, and then rooted. Below is the RMSE of our model.
RMSE(y_pred = stock_model$fitted.values, y_true = our_rm$`idx_%`)## [1] 0.2454476
range(our_rm$`idx_%`)## [1] -6.58 10.19
The calculation above shows that our model’s prediction has an average error of 0.245 from the actual values. The idx_% is expressed in percentages, and has a range from -6.58% to 10.19%. Therefore, the author believes that the RMSE created by the model is low enough to be considered acceptable.
Model Selection
In this step, we will perform a stepwise regression, which is meant to find the final, optimal model that can explain our dependent variable based on our independent variables. There are three types of stepwise regression: Backward, Forward, and Both. We will perform them all in this section. Before that however, the author would like to give a short definition of AIC (Akaika Information Criterion), which is a measure of the quality of our regression model based on how much information is lost. The lower the value of AIC, the better our model.
Backwards Elimination
Backwards selection refers to a model selection where the model includes all the variables, and then each variable is removed from the model until the model with the lowest AIC value is formed. The chunk below shows the model formed using backwards elimination.
sm_backward<- step(object = stock_model, direction = "backward", trace = F)
summary(sm_backward)##
## Call:
## lm(formula = `idx_%` ~ `gold_%` + `lq45_%` + `exc_%` + `rate_%`,
## data = our_rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11385 -0.13600 -0.00261 0.12630 1.37989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.020401 0.007801 2.615 0.009033 **
## `gold_%` 0.020740 0.007508 2.763 0.005824 **
## `lq45_%` 0.729854 0.005565 131.144 < 2e-16 ***
## `exc_%` -0.067424 0.019100 -3.530 0.000431 ***
## `rate_%` -0.007987 0.004332 -1.844 0.065459 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 1185 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9469
## F-statistic: 5300 on 4 and 1185 DF, p-value: < 2.2e-16
Using backward elimination, the optimal model is determined to only include the variables that are significant in our model, which are “gold_%”, “lq45_%”, “exc_%”, and “rate_%”. The ones that are not significant, “volume_%” and “bond_%” are removed from the model
Forward Selection
The next method of selection is called the forward selection, where the model starts without any of it’s predictors and is slowly rebuilt by including independent variables until the lowest AIC is achieved. The chunk below shows the model formed using forward selection. Since the model has to start with no predictors, the formation of the model with no predictors will also included below
stock_model_none <- lm(formula = `idx_%` ~ 1,
data = our_rm)
sm_forward <- step(object = stock_model_none, direction = "forward", scope = list(upper = stock_model), trace = F)
summary(sm_forward)##
## Call:
## lm(formula = `idx_%` ~ `lq45_%` + `exc_%` + `gold_%` + `rate_%`,
## data = our_rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11385 -0.13600 -0.00261 0.12630 1.37989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.020401 0.007801 2.615 0.009033 **
## `lq45_%` 0.729854 0.005565 131.144 < 2e-16 ***
## `exc_%` -0.067424 0.019100 -3.530 0.000431 ***
## `gold_%` 0.020740 0.007508 2.763 0.005824 **
## `rate_%` -0.007987 0.004332 -1.844 0.065459 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 1185 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9469
## F-statistic: 5300 on 4 and 1185 DF, p-value: < 2.2e-16
This model also included only the significant variables, while dropping the ones that aren’t.
Both
Finally, a method that combines both backward elimination and forward selection. The model can start with either all predictors included or no predictors at all, and then rebuilt until the model has achieved the lowest AIC.
sm_both <- step(object = stock_model_none, direction = "both", scope = list(upper = stock_model), trace = F)
summary(sm_both)##
## Call:
## lm(formula = `idx_%` ~ `lq45_%` + `exc_%` + `gold_%` + `rate_%`,
## data = our_rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11385 -0.13600 -0.00261 0.12630 1.37989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.020401 0.007801 2.615 0.009033 **
## `lq45_%` 0.729854 0.005565 131.144 < 2e-16 ***
## `exc_%` -0.067424 0.019100 -3.530 0.000431 ***
## `gold_%` 0.020740 0.007508 2.763 0.005824 **
## `rate_%` -0.007987 0.004332 -1.844 0.065459 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 1185 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9469
## F-statistic: 5300 on 4 and 1185 DF, p-value: < 2.2e-16
This method also shows that the most optimal choice is to only include the significant variables. Therefore, all of the models formed from the process of stepwise regression are equally optimal. We will be using is the sm_backward model
Linearity
Linearity is important for a model because it signifies that the prediction made by our model is accurate and the errors are minimized. The chunk below will be used to see whether our model is linear or not.
plot(sm_backward, which = 1)This shows that our data is non-linear. Non-linearity does not necessarily mean that our model is problematic, but it is said that our adjusted R-squared value becomes invalid as a measure of goodness of fit because in a non-linear model. This is because the total variance of our model (or the R-squared value) is no longer within the range of 0 and 100%. Because of this, we will have to rely on the standard deviation of the model instead to evaluate how good our model is, which is represented by the Residual Standard Error. The general rule of RSE is that the lower it is, the better. A value of 0.2461 for our residual standard error means that the average error made by our prediction is 0.2461, as seen below.
summary(sm_backward)##
## Call:
## lm(formula = `idx_%` ~ `gold_%` + `lq45_%` + `exc_%` + `rate_%`,
## data = our_rm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11385 -0.13600 -0.00261 0.12630 1.37989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.020401 0.007801 2.615 0.009033 **
## `gold_%` 0.020740 0.007508 2.763 0.005824 **
## `lq45_%` 0.729854 0.005565 131.144 < 2e-16 ***
## `exc_%` -0.067424 0.019100 -3.530 0.000431 ***
## `rate_%` -0.007987 0.004332 -1.844 0.065459 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2461 on 1185 degrees of freedom
## Multiple R-squared: 0.9471, Adjusted R-squared: 0.9469
## F-statistic: 5300 on 4 and 1185 DF, p-value: < 2.2e-16
Normality
Next is to show whether the data used for our model is distributed normally or not. Normal distribution means that most of the value of our data is gathered near the mean. So when our data is “normal”, then we can expect that our model’s predictions will be precise and produce values that are close to the mean.
Before we proceed, the author would like to make a disclaimer for one important thing regarding normality: stock prices are NOT normally distributed. It is possible to get a negative value from normal distribution, however stock prices can’t go below 0. In indonesia, the lowest price a stock can have is Rp.50,- (Fifty Rupiah). Because of this, the author expects that even though the data used is the daily change of the IDX, the output should still show that our model does not follow normal distribution. Additionally, analyzing data that is sorted (in our case, historical data from latest to earliest) may also affect normality.
Let’s prove it by performing the shapiro test. To determine whether our data is normal or not, we will use a critical value of 0.05. Below are the hypotheses:
- H0: Data is distributed normally
- H1: Data is not distributed normally
shapiro.test(x = sm_backward$residuals)##
## Shapiro-Wilk normality test
##
## data: sm_backward$residuals
## W = 0.96484, p-value = 2.267e-16
Our p-value is smaller than the critical value of 0.05, which means our model can’t be considered to be distributed normally, which is within expectations.
Heteroscedasticity
Heteroscedasticity refers to the standard error within a regression model being non-consistent. These patterns causes the estimation of the standard error for our model become unreliable, and the p-values of our predictors may seem smaller than they should be.
Stock prices are generally considered to be heteroscedastic, which makes it impossible to predict the long-term movement of a stock. This is caused by changes in the economic condition, whether in a minor scale (conditions of a specific company industry, like the rising of interest rates towards finance companies) or a major scale (events that affect the economic activities of an entire country, or even the world. The COVID-19 pandemic would be a perfect example of this case).
Because of this, the author also expects that our model does not pass the heteroscedasticity test. Let’s test it using the function bptest. “bp” refers to Breusch-Pagan, the statistical measure used to find out whether heteroscedasticity is present or not. Critical value is 0.05, as per usual.
H0: Model is not heteroscedastic H1: Model is heteroscedastic
bptest(sm_backward)##
## studentized Breusch-Pagan test
##
## data: sm_backward
## BP = 6.8079, df = 4, p-value = 0.1464
Surprisingly, our model actually passes the heteroscedasticity test. The author would like to use this chance to say “mantap jiwa” since it signifies that the standard error of our model is constant from period to period.
Multicollinearity
Multicollinearity refers to the event where our predictors have high correlations with other predictors, which causes interpretation of our model to become unreliable since the predictors may be influenced by the presence of other predictors. Our model becomes less accurate.
We can use the vif() function in order to see whether multicollinearity exists or not. The VIF refers to the term Variance Inflation Factor, which measures the variance of each independent variables within a regression model. The rule is that if the values are less than 10, then multicollinearity is not present in our model.
vif(sm_backward)## `gold_%` `lq45_%` `exc_%` `rate_%`
## 1.006489 1.201691 1.208623 1.002009
Each predictor has VIF values of less than 10, which means our model passes the multicollinearity test.
Interpretation
Unfortunately, since our model does not pass the normality assumption test it cannot be used for predicting the movement of the IDX. This is because the model may give out predictions that are inaccurate. However, it should be sufficient enough in explaining the factors that are significant toward the movement of the Indonesian Stock Market.
We have finally reached the final part of the report, which is interpreting the model itself. There are four variables that are significant in determining the movement of stock prices in Indonesia. Within the critical value of 0.05, we have the performance of gold prices in Indonesia, the performance of stocks listed in the LQ45 index, and the USD/IDR exchange rate. We also have the changes in the Indonesian Interest Rate, represented by the IndONIA, within the critical value of 0.1.
Regarding the critical values, lower critical values will help in improving the legitimacy of an independent variable’s significance towards the dependent variable, as it has lower chances of actually affecting the movement of the dependent variable. Because of this, while acknowledging the possibility of the interest rates may have an effect on the movement of the stock market, the author would still like to drop this variable in particular from this discussion as it’s p-value is higher than 0.05.
We now have three variables left: the movement of gold prices, LQ45 index, and the USD/IDR exchange rate.
Gold
In the definition section, the author mentioned that the stock market and gold prices should have an inverse relationship, where if the prices of stocks rise, the price of gold will fall (and vice versa). However, as seen in the correlation matrix and in our regression model, it is shown that those two variable have a positive relationship, even though the correlation is very weak. The author believes that the reason why this happened is because of the time period that the author used for this report, which is the year 2018 until 2022. Within those 5 years, the COVID-19 pandemic happened from early 2020 and the world has only recovered as of recently. While recovering from the pandemic, there seems to be a “transition” phase where more risk-tolerant investors start investing in stocks again while at the same time risk-adverse investors are still investing in gold. As a result, both of these investments that are supposed to move inversely to each other actually moved in the same direction, contradicting the general rule of thumb.
LQ45
This should come off as no surprise, seeing as the LQ45 is an index that represents the top 45 publicly listed companies and can be seen as a “sample platter” of how well the stock market is doing. So to a certain extent, we can use these 45 companies as a representation of how well the stock market of Indonesia is doing.
Exchange Rate
This is another variable that can be an indicator of how well the Indonesian economic is doing, represented by the strength of our currency. If the USD/IDR rate goes up, that means that it becomes cheaper to buy Indonesian Rupiah using US Dollars. This means that Indonesia will have less buying power while doing international transactions, and could also lead to the country not being able to repay international debts. This will cause a domino effect of bad economic outlook for the country, since it requires the country to print out more money, which will lead to inflation and making the Rupiah less valuable, which leads to prices for goods having to go up to compensate, resulting in less money used for investments.
Conclusion
In this report, we created a regression model that explains the movement of the Indonesian stock market (IDX). To do this we’ve gathered several variables intended to be used as predictors. Using stepwise regression, it is found that the optimal model to be used includes four variables: gold prices, LQ45 index, USD/IDR exchange rate, and the Indonesian interest rate. While it cannot be used to predict future IDX movements as the model is not BLUE, it is sufficient enough in explaining how our stock market works. Out of those four variables, gold prices, LQ45 index, and USD/IDR exchange rate is significant toward the movement of the IDX.
The author would like to suggest two things for future research. Firstly, it may be a wise choice to extend the time period observed. This report used a time frame of five years, from 2018 to 2022. This time frame contains three years where a pandemic led to economic instability, which results in our model contradicting the general rule of gold having an inverse relationship with stock prices. Extending the time frame into 10 years or more may give out a model that can avoid this situation
Secondly, if possible the author would suggest to find a way to improve the normality as well as the linearity of our model so that it actually becomes robust and sufficient to predict future values. The current model we now have may not be able to give out precise and accurate predictions due to those two factors.
The author hopes that this report can be of some use, and would like to apologize if there is anything wrong said in this report. Thank you for your time
References
Amihud, Y. (2002). Illiquidity and stock returns: Cross-section and time-series effects. Journal of Financial Markets, 5(1), 31–56. https://doi.org/10.1016/s1386-4181(01)00024-6
Bank Indonesia. (n.d.). Jisdor, Kurs Acuan non-USD/IDR, Dan Kurs Transaksi // . JISDOR, Kurs Acuan Non-USD/IDR, dan Kurs Transaksi. Retrieved March 10, 2023, from https://www.bi.go.id/id/fungsi-utama/moneter/informasi-kurs/default.aspx#floating-2
Chen, Shengxuan, “Forecasting Daily Stock Market Return with Multiple Linear Regression” (2020). Mathematics Senior Capstone Papers. 19. https://digitalcommons.latech.edu/mathematics-senior-capstone-papers/19
Financial Services Authority. (n.d.). Mengenal Jenis-Jenis suku bunga bank. MENGENAL JENIS-JENIS SUKU BUNGA BANK. Retrieved March 10, 2023, from https://sikapiuangmu.ojk.go.id/FrontEnd/CMS/Article/20649
Hall, M. (2022, July 8). How do interest rates affect the stock market? Investopedia. Retrieved March 10, 2023, from https://www.investopedia.com/investing/how-interest-rates-affect-stock-market/#toc-what-happens-when-interest-rates-rise
Karpoff, J. M. (1987). The relation between price changes and Trading Volume: A survey. The Journal of Financial and Quantitative Analysis, 22(1), 109–126. https://doi.org/10.2307/2330874
Reserve Bank of Australia. (2023, March 9). Exchange rates and their measurement: Explainer: Education. Reserve Bank of Australia. Retrieved March 10, 2023, from https://www.rba.gov.au/education/resources/explainers/exchange-rates-and-their-measurement.html#:~:text=An%20exchange%20rate%20is%20a,is%20an%20important%20economic%20variable.