Previously, traditional statistical techniques were used to analyze the trends of stock prices over time and related variables to extract temporal correlations. Stock market forecasting is a challenging task due to the volatile nature of stocks. Various machine learning methods have now been applied to stock market forecasting. Predicting the stock market through machine learning becomes easier and accurate. By comparing various machine learning models, we select the more accurate AMIRA model for predicting stock prices. And evaluate the effectiveness of the AMIRA model based on the parameters of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). The results show the usefulness of our proposed machine learning method in predicting stock prices.
| STUDENT ID | STUDENTNAME | WORK |
|---|---|---|
| S2128100 | ZHANG YU | codeing and 7,8 point |
| S2164046 | HU LIANGLIANG | 9 point |
| 22060214 | ZHU MEI | Pre and 3,4,5,6 point |
| S2121005 | XU HUANDI | 1 7.1 7.2point |
| S2177633 | ZHOU YAO | 2 point |
Stock trading is a very important investment activity in today’s financial markets. The traditional method of stock forecasting is to analyze the trend of stock prices over time and related variables to extract time correlation. However, the stock market is very volatile and the rise or fall of stock prices is not simply influenced by the current environment and past share prices ,but can be influenced by a variety of complex factors such as company earnings reports, national policies, influential shareholders, expert speculation on current events and more (Zou et al., 2022). Accurately predicting stock trends is therefore a very challenging task.
Despite the fact that the stock market is a non-linear, non-stationary system(Huang, &Liu,2019), with the development of machine learning techniques, more and more people are exploring how to use machine learning algorithms to predict stock trends. The main objective of this study is to provide detailed analysis using an autoregressive integrated moving average (ARIMA) model to predict future trends and stock returns. This study contains a literature survey of related work in the field, provides the methodology and variability of data analysis, and finally sets up an evaluation framework to assess the validity of the model using accuracy, recall, and other metrics, and the results show the usefulness of this model in predicting stock prices.
Traditionally, many investors and traders rely on their own experience, knowledge, and skills to predict changes in stock prices. Although personal experience has some value in stock price prediction, its limitations cannot be ignored. Personal experience forecasts typically depend on limited sources of information, such as past trading experiences, personal observations, and media reports. These sources of information may not comprehensively and accurately reflect the dynamics and changes in the market, thereby limiting the accuracy and reliability of predictions.
To enhance the accuracy and reliability of predictions, many investors and traders have started leveraging technologies such as machine learning to assist in forecasting. By learning from extensive historical data and pattern recognition, processing complex data and patterns, and uncovering hidden patterns, machine learning provides more systematic and objective prediction results, thereby compensating for some of the shortcomings of personal experience forecasts.
In our project, historical stock price data is utilized to train and establish a predictive model to Take the Tesla stock seat example forecast the future trends of Tesla stock prices, aiding investors in making more informed decisions.
1.Classification Question:Can classify specific stock price trends as “rising,” “falling,” or “stable” based on ARIMA model predictive result?
Regression Question:How accurately can we predict the future closing price of a specific stock ?
Which machine learning models perform better in predicting stock prices?
2.To categorize specific stock price trends as “rising”, “falling” or “stable” by comparing the forecasted values with the previous day’s value.
To evaluate the performance of AMIRA model by MSE, RMSE, MAE and MAPE.
To determine more accurate stock price predictive models by comparing various machine learning models.
The stock market is filled with uncertainty and dynamic changes, and factors such as policies and market conditions can impact stock prices. Stock price prediction is of great importance for investors and traders in decision-making. Accurate stock price predictions can help investors make informed investment decisions, choose the right timing for buying and selling, reduce investment risks, and enhance investment returns. Our project aims to go beyond traditional stock price prediction methods by utilizing machine learning algorithms to analyze new data in a timely manner, capture market changes and trends, and provide more accurate prediction results.
According to the study by ARAVIND et al., they studied the use of the AutoRegressive Integrated Moving Average (ARIMA) model for stock price prediction in the stock market. The volatility of the stock market makes stock price prediction a cumbersome task, as stock prices are influenced by various important parameters, including economic factors, interest rates, and inflation. In order to forecast stock prices, they obtained continuous stock data from Yahoo Finance and proposed the ARIMA model for stock price prediction. This model captures trends and seasonality in time series data and provides a relatively accurate method for short-term forecasting. Experimental results demonstrate that the ARIMA model exhibits reasonable accuracy in the short term and can be used for stock prediction.(M et al., 2018)
Stock trading plays a crucial role in the current financial market. Investors aspire to accurately predict stock trends using various technical tools to achieve better investment outcomes. Traditional experiential and intuitive methods have limitations in dealing with the complexity and volatility of the stock market. Stock prices are influenced by multiple factors such as the economy, politics, industry developments, market sentiment, and global events, which interact to make predicting stock trends complex and challenging. Machine learning technology has emerged as a promising approach to address this challenge. By analyzing historical performance data, market indicators, company financial reports, and other relevant information, predictive models can be established to capture hidden patterns and trends within vast amounts of data, providing price predictions and comprehensive market insights to investors(Moghar & Hamiche, 2020). This helps investors make informed investment decisions, optimize asset allocation, mitigate risks, and increase returns.
Identify sources of stock price data, such as financial data providers, stock exchanges, or online financial platforms. Get historical stock price data, including date, open, close, high, low, volume, and more. Ensure data integrity and accuracy and save it in appropriate data format (e.g. CSV, Excel) for subsequent processing and analysis. In the literature, variables such as technical indicators, financial variables, and macro-economic variables have been considered the most influential ones affecting stock price movements. In this study, we classified all the variables in the selected literature that were used as input data into four main categories and several subcategories, as illustrated in Fig. 1.
This project designed
the research methodology flow by collecting data through the open
interface of the exchange and make a simple clarification of the
data.Slicing the data into a training set test set. lastly, creating a
machine learning model using ARIMA and testing the predicted values of
the model using the test set.
Stock trading is a game process under incomplete information, and the single-objective supervised learning model is difficult to deal with such serialization decision problems due to the stock market’s rapid change, many interference factors, and insufficient periodic data (Li et al., 2020).There are many factors that can affect stock prices, such as wars, geological disasters, national credit crises, and so on.But these effects are short-lived, for example The results contribute to the research on the economic impact of the pandemic by providing empirical evidence that COVID-19 has spill-over effects on stock markets of other countries, and the results also provide a basis for assessing trends in international stock markets when the situation is alleviated globally. However, there is no evidence that COVID-19 adversely affects these countries’ stock markets more than it does the global average (He et al., 2020). Many historical events, including viruses, have had a regional impact on the stock market, Del and Paltrinieri examined the 78 mutual equity funds geographically based in African countries with observed monthly flows and results for the 2006–2015 period and proposed that Ebola and the Arab Spring were more serious than SARS, while Nippani and Washer examined the effect of SARS on Canada, China, the particular administrative region of Hong Kong, Indonesia, China, Singapore, the Philippines, Vietnam, and Thailand and concluded that SARS only affected the stock markets of China and Vietnam (H. Liu et al., 2020). Therefore, it is not practical to consider all factors that may affect the stock market, as some factors may only have a short-term impact on the stock market, but in the long run, their influence may be relatively small. The intensity of these volatile periods is likely to cause investors to leave the stock market in favor of other asset classes, such as gold, even though the history of the stock market’s evolution shows that episodes of volatility are frequently followed by calm periods (Tounsi & Maatoug, 2021). In this case, we only need to select the relevant features for analysis. Here, I plan to discard all the fragmented attributes that affect the stock market and only select the key attributes that affect the stock market, such as Timestamp, open, High, Low, Close, Volume, Volume, Weighted, Price.
To obtain data for Tesla stock, we chose to use the API interface provided by platforms such as Yahoo Finance. This allows us to access real-time data and retrieve specific datasets according to our needs. We need to pass the desired parameters to the API to retrieve the relevant data. A concise R program is insufficient to fetch stock data for a specified time period from financial sources, so we rely on methods from the “quantmod” package to call them.
We selected the stock data for Tesla from 2017 to the present. This time span provides us with a relatively long-term dataset, enabling us to observe the long-term trends, seasonal variations, and other important market dynamics of Tesla stock prices. It helps us assess the performance of Tesla stock more accurately in investment decisions and predicting future trends.
library(quantmod)
## Warning: package 'quantmod' was built under R version 4.2.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
## Warning: package 'tseries' was built under R version 4.2.3
indf_data <- getSymbols(Symbols = "TSLA", src = "yahoo", from = Sys.Date() - 2017,
to = Sys.Date(), auto.assign = FALSE)
We only need to select relevant features for analysis. Here, I intend to discard irrelevant attributes that do not significantly impact the stock market and focus on key attributes that affect the stock market, such as Timestamp, Open, High, Low, Close, Volume, and Weighted Price. By doing so, we can concentrate on analyzing factors closely related to stock market behavior and price fluctuations. This enables us to gain deeper insights into market behavior, volatility, and potential driving factors, providing valuable support for predicting future stock price trends.
Due to the peculiarities of the stock market, there may be instances where the data for a particular day is missing or labeled as “NA” due to reasons like trading halts. However, such occurrences are infrequent. Therefore, during data cleaning, we directly remove the data for those days. The presence of missing values can introduce bias in analysis results or lead to inaccurate predictions. Deleting the data for those days helps avoid introducing uncertainty in the analysis.
indf_data <- na.omit(indf_data)
The data contains 7 parameters, in addition to the date there are data such as opening price, high price, low price and trading volume.
View(indf_data)
Charting stock data based on data, MACD indicator smoothing moving average, RSline. Explore the average change curve and trend of the collected Tesla stock data one by one.
chart_Series(indf_data, col = "black")
add_SMA(n = 100, on = 1, col = "red")
add_SMA(n = 20, on = 1, col = "black")
add_RSI(n = 14, maType = "SMA")
add_BBands(n = 20, maType = "SMA", sd = 1, on = -1)
add_MACD(fast = 12, slow = 25, signal = 9, maType = "SMA", histogram = TRUE)
library(funModeling)
## Warning: package 'funModeling' was built under R version 4.2.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.2.3
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:quantmod':
##
## Lag
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks xts::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks xts::last()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Hmisc)
indf_log <- log(indf_data)
head(indf_log, n = 10)
## TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2017-12-08 3.043252 3.050788 3.032578 3.044935 17.76728 3.044935
## 2017-12-11 3.043347 3.088038 3.040546 3.087734 18.59522 3.087734
## 2017-12-12 3.092405 3.125122 3.091133 3.123920 18.69069 3.123920
## 2017-12-13 3.123627 3.133231 3.110548 3.118038 18.35157 3.118038
## 2017-12-14 3.123862 3.142542 3.111736 3.114670 18.28140 3.114670
## 2017-12-15 3.126878 3.132301 3.108346 3.130991 18.45988 3.130991
## 2017-12-18 3.135204 3.140496 3.113752 3.117566 18.22397 3.117566
## 2017-12-19 3.121660 3.125268 3.091951 3.094370 18.44415 3.094370
## 2017-12-20 3.099161 3.100393 3.075898 3.087947 18.30759 3.087947
## 2017-12-21 3.089799 3.102312 3.082552 3.096060 18.00180 3.096060
Count the number of observations (rows) and variables in the first example and use head to display the first few rows of data.
glimpse(indf_data)
## An xts object on 2017-12-08 / 2023-06-16 containing:
## Data: double [1389, 6]
## Columns: TSLA.Open, TSLA.High, TSLA.Low, TSLA.Close, TSLA.Volume ... with 1 more column
## Index: Date [1389] (TZ: "UTC")
## xts Attributes:
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2023-06-17 13:13:00"
Obtain statistical information on data types, zero values, infinite numbers and missing values:
df_status(indf_data)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 var.TSLA.Open 0 0 0 0 0 0 numeric 1358
## 2 var.TSLA.High 0 0 0 0 0 0 numeric 1348
## 3 var.TSLA.Low 0 0 0 0 0 0 numeric 1365
## 4 var.TSLA.Close 0 0 0 0 0 0 numeric 1376
## 5 var.TSLA.Volume 0 0 0 0 0 0 numeric 1384
## 6 var.TSLA.Adjusted 0 0 0 0 0 0 numeric 1376
Use the Hmisc package describe to quickly understand all the variables.
describe(indf_data)
## indf_data
##
## 6 Variables 1389 Observations
## --------------------------------------------------------------------------------
## TSLA.Open
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1358 1 134.8 126.8 15.53 17.57
## .25 .50 .75 .90 .95
## 21.45 125.70 230.04 292.59 332.88
##
## lowest : 12.0733 12.34 12.3673 12.4733 12.5833
## highest: 391.2 392.443 396.517 409.333 411.47
## --------------------------------------------------------------------------------
## TSLA.High
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1348 1 137.9 129.7 15.81 17.91
## .25 .50 .75 .90 .95
## 21.85 128.62 235.57 299.88 340.49
##
## lowest : 12.4453 12.6613 12.8173 12.826 12.932
## highest: 402.863 403.25 405.13 413.29 414.497
## --------------------------------------------------------------------------------
## TSLA.Low
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1365 1 131.4 123.6 15.27 17.04
## .25 .50 .75 .90 .95
## 21.07 121.02 224.33 285.59 325.09
##
## lowest : 11.7993 11.974 12.2733 12.336 12.4147
## highest: 378.68 382 384.207 402.667 405.667
## --------------------------------------------------------------------------------
## TSLA.Close
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1376 1 134.7 126.6 15.64 17.52
## .25 .50 .75 .90 .95
## 21.52 125.24 231.24 291.96 332.89
##
## lowest : 11.9313 12.344 12.548 12.58 12.6573
## highest: 399.927 402.863 404.62 407.363 409.97
## --------------------------------------------------------------------------------
## TSLA.Volume
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1384 1 133889572 84010133 53581040 61370640
## .25 .50 .75 .90 .95
## 77433000 105895200 159883500 245943900 306506160
##
## lowest : 29401800 35042700 35842500 36741600 36984000
## highest: 598212000 666378600 705975000 726357000 914082000
## --------------------------------------------------------------------------------
## TSLA.Adjusted
## n missing distinct Info Mean Gmd .05 .10
## 1389 0 1376 1 134.7 126.6 15.64 17.52
## .25 .50 .75 .90 .95
## 21.52 125.24 231.24 291.96 332.89
##
## lowest : 11.9313 12.344 12.548 12.58 12.6573
## highest: 399.927 402.863 404.62 407.363 409.97
## --------------------------------------------------------------------------------
The autocorrelation function (ACF) is a statistic that measures the relationship between lags and current observations in time series data. It can help us understand whether there is lagged correlation in time series data, i.e., whether the current observation is correlated with past observations.
By calculating the ACF, we can obtain a set of autocorrelation coefficients, where each coefficient represents the level of correlation at the corresponding lag. These coefficients are usually presented in graphical form to provide a more intuitive understanding of the lagged correlation structure of the time series data.
acf_log <- acf(indf_log, lag.max = 320)
Analysis of this data is more suitable for the ARMA model, and the model is constructed using a time series approach.
diff.acf <- acf(indf_log)
The provided code is aimed at visualizing the outcomes of a linear regression model by creating a scatter plot and incorporating a regression line. This scatter plot effectively illustrates the distribution of the data, while the regression line depicts the linear relationship between the independent and dependent variables. This graphical representation allows for a clear visual assessment of how well the regression model fits the data
models <- lm(TSLA.Open ~ TSLA.Adjusted, data = indf_data)
ggplot(indf_data, aes(x = TSLA.Adjusted, y = TSLA.Open)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "TSLA.Adjusted", y = "TSLA.Open", title = "Linear Regression")
## `geom_smooth()` using formula = 'y ~ x'
After executing the above code, cor_matrix will be a correlation coefficient matrix, showing the correlation between variables in indf_data
library(ggplot2)
cor_matrix <- cor(indf_data)
cor_data <- reshape2::melt(cor_matrix)
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Correlation Matrix Heatmap")
The best ARIMA(p,d,q) model is finally selected and stored in the arima_model object using the ARIMA model selection method for fitting training, using the training dataset, and performing model smoothness testing.
library(caTools)
## Warning: package 'caTools' was built under R version 4.2.3
library(forecast)
## Warning: package 'forecast' was built under R version 4.2.3
train_data <- indf_log[1:1270, "TSLA.Close"]
set.seed(123)
arima_model <- auto.arima(train_data, stationary = TRUE, ic = c("aicc", "aic", "bic"),
trace = TRUE)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 4069.53
## ARIMA(1,0,0) with non-zero mean : Inf
## ARIMA(0,0,1) with non-zero mean : 2387.889
## ARIMA(0,0,0) with zero mean : 7387.961
## ARIMA(1,0,1) with non-zero mean : Inf
## ARIMA(0,0,2) with non-zero mean : 1024.902
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(0,0,3) with non-zero mean : -15.06664
## ARIMA(1,0,3) with non-zero mean : Inf
## ARIMA(0,0,4) with non-zero mean : -711.9691
## ARIMA(1,0,4) with non-zero mean : Inf
## ARIMA(0,0,5) with non-zero mean : -1316.529
## ARIMA(1,0,5) with non-zero mean : Inf
## ARIMA(0,0,5) with zero mean : 1364.836
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(0,0,5) with non-zero mean : -1542.995
##
## Best model: ARIMA(0,0,5) with non-zero mean
summary(arima_model): This line of code is used to obtain the summary information of the ARIMA model. arima_model is the ARIMA model object that has been fitted. By executing the summary function, we can get the statistical information such as parameter estimation results, standard error, t-statistic, p-value, etc. of the ARIMA model. This summary information can help us to evaluate the degree of fit of the model and the significance of each model parameter.
summary(arima_model)
## Series: train_data
## ARIMA(0,0,5) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 mean
## 2.2588 3.1303 2.9223 1.8285 0.6638 4.2662
## s.e. 0.0249 0.0485 0.0522 0.0344 0.0179 0.0432
##
## sigma^2 = 0.01716: log likelihood = 778.54
## AIC=-1543.08 AICc=-1543 BIC=-1507.06
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.000264673 0.1306758 0.1113869 -0.7396326 2.862258 3.890185
## ACF1
## Training set 0.3689181
checkresiduals(arima_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,5) with non-zero mean
## Q* = 4366, df = 5, p-value < 2.2e-16
##
## Model df: 5. Total lags used: 10
arima <- arima(train_data, order = c(0, 0, 5)): this line of code uses the arima function to fit an ARIMA model. train_data is the training data used to fit the model, which is a time series. order = c(0, 0, 5) is the order parameter of the ARIMA model, where 0 represents the self regression (AR) order of 0, 0 for difference (number of differences) order of 0, and 5 for moving average (MA) order of 5. This specific ARIMA model is determined based on the characteristics of the data and prior knowledge.
arima <- arima(train_data, order = c(0, 0, 5))
summary(arima)
##
## Call:
## arima(x = train_data, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 2.2588 3.1303 2.9223 1.8285 0.6638 4.2662
## s.e. 0.0249 0.0485 0.0522 0.0344 0.0179 0.0432
##
## sigma^2 estimated as 0.01708: log likelihood = 778.54, aic = -1543.08
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.000264673 0.1306758 0.1113869 -0.7396326 2.862258 3.890185
## ACF1
## Training set 0.3689181
Plot the time series of stock closing prices with rolling mean and standard deviation.
close_prices <- Cl(indf_data)
forecast1 <- forecast(arima, h = 100)
plot(forecast1)
train_datas <- indf_log[1:1270, "TSLA.Close"]
arima <- arima(train_datas, order = c(0, 0, 5))
forecast_ori <- forecast(arima, h = 100)
a <- ts(train_datas)
forecast_ori %>% autoplot() + autolayer(a)
View the data structure and summary
str(indf_data)
## An xts object on 2017-12-08 / 2023-06-16 containing:
## Data: double [1389, 6]
## Columns: TSLA.Open, TSLA.High, TSLA.Low, TSLA.Close, TSLA.Volume ... with 1 more column
## Index: Date [1389] (TZ: "UTC")
## xts Attributes:
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2023-06-17 13:13:00"
summary(indf_data)
## Index TSLA.Open TSLA.High TSLA.Low
## Min. :2017-12-08 Min. : 12.07 Min. : 12.45 Min. : 11.80
## 1st Qu.:2019-04-30 1st Qu.: 21.45 1st Qu.: 21.85 1st Qu.: 21.07
## Median :2020-09-14 Median :125.70 Median :128.62 Median :121.02
## Mean :2020-09-12 Mean :134.77 Mean :137.89 Mean :131.38
## 3rd Qu.:2022-01-28 3rd Qu.:230.04 3rd Qu.:235.57 3rd Qu.:224.33
## Max. :2023-06-16 Max. :411.47 Max. :414.50 Max. :405.67
## TSLA.Close TSLA.Volume TSLA.Adjusted
## Min. : 11.93 Min. : 29401800 Min. : 11.93
## 1st Qu.: 21.52 1st Qu.: 77433000 1st Qu.: 21.52
## Median :125.24 Median :105895200 Median :125.24
## Mean :134.73 Mean :133889572 Mean :134.73
## 3rd Qu.:231.24 3rd Qu.:159883500 3rd Qu.:231.24
## Max. :409.97 Max. :914082000 Max. :409.97
We have built an ARIMA model based on the best-selected ARIMA(0,0,5) and visualized the forecast data. Next, I will evaluate the accuracy of this time series model using the remaining test set. The introduced metrics for evaluation include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).
arima <- arima(train_data, order = c(0, 0, 5))
forecast1 <- forecast(arima, h = 100)
forecasted_values <- forecast1$mean
actual_values <- coredata(indf_log[1271:1370, "TSLA.Close"])
errors <- actual_values - forecasted_values
mse <- mean(errors^2)
rmse <- sqrt(mse)
mae <- mean(abs(errors))
mape <- mean(abs(errors/actual_values)) * 100
cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 0.7704065
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.877728
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 0.8496022
cat("Mean Absolute Percentage Error (MAPE):", mape, "%\n")
## Mean Absolute Percentage Error (MAPE): 16.42937 %
For the forecast results of ARIMA(0,0,5) with non-zero mean, the ARIMA(0,0,5) model is fitted to the time series data “train_data”. The forecast() function is used with a parameter “h” set to 100, which generates future values for the time series over 100 periods. The plot() function is then called to visualize the forecast results, showing the predicted values along with their corresponding confidence intervals.The x-axis represents the time period, while the y-axis represents the values of the time series. On the right side of the graph, different colors are used to represent confidence intervals at different levels. The purple color represents the 80% confidence interval, while the grayish-purple color represents the 95% confidence interval. These intervals indicate the probability that the actual observed values will fall within the respective interval, with a probability of 80% and 95% respectively.The confidence intervals provide insights into the potential range of values that the time series may take in the future. Analyzing the predicted values and their confidence intervals can help investors make informed decisions or forecasts based on the predicted behavior of the time series. For the residual plot of the ARIMA(0,0,5) model with a non-zero mean: Each point on the x-axis corresponds to a specific observation in the time series. The values on the y-axis represent the magnitude of the residuals, indicating the deviation between the observed values and the predicted values for each observation. From the plot, it can be observed that the residuals of the ARIMA(0,0,5) model with a non-zero mean fluctuate around zero, and the dispersion of the residuals remains relatively constant. The fact that the residuals fluctuate around zero suggests that, on average, the model’s predictions are close to the actual observed values. This indicates a good fit of the model to the data, without significant systematic errors. Overall, the model performs well. For the Lag-ACF plot, the autocorrelation function (ACF) measures the correlation between a time series and its own values at different lags, helping determine the appropriate order of the ARIMA model. From the resulting plot, it can be observed that the autocorrelation coefficients fluctuate around 0.6, indicating a strong positive correlation between the current lag and the previous lag. A high autocorrelation coefficient signifies that the current value of the time series is closely related to its past values. This suggests that the model successfully captures the persistence and patterns in the data, indicating that the model’s performance is good enough to capture the trends and periodicity in the time series data, making it suitable for modeling and forecasting future observations.
Overall, the output provides valuable information for evaluating the performance of the ARIMA model, assessing its suitability for forecasting future values, understanding the behavior of residuals, identifying temporal patterns in the stock’s closing prices, and exploring the relationships between variables. These information can support the decision-making process.
We have built an ARIMA model based on the best-selected ARIMA(0,0,5) and visualized the forecast data. Next, I will evaluate the accuracy of this time series model using the remaining test set. The introduced metrics for evaluation include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).
Analysis of the data: Mean Squared Error (MSE): 0.7545285 MSE measures the average squared difference between predicted values and actual values. A lower MSE value indicates that the model’s predictions are closer to the actual results. Therefore, the MSE value of 0.7545285 can be considered relatively low, indicating a certain level of accuracy in predicting stock data.
Root Mean Squared Error (RMSE): 0.868636 RMSE is the square root of MSE and is used to measure the difference between predicted values and actual values. Similar to MSE, a lower RMSE value indicates better prediction capability of the model. In this case, the RMSE value of 0.868636 shows that the model has relatively small prediction errors.
Mean Absolute Error (MAE): 0.8372229 MAE measures the average absolute difference between predicted values and actual values. A lower MAE value indicates smaller prediction errors of the model. Based on the provided value of 0.8372229, the MAE value is relatively low, indicating a certain level of accuracy in predicting stock data.
Mean Absolute Percentage Error (MAPE): 16.25078% MAPE measures the average percentage difference between predicted values and actual values. A lower MAPE value indicates higher accuracy of the model’s predictions. In this case, the MAPE value of 16.25078% can be considered relatively low, indicating a relatively accurate prediction of stock data.
Overall, this stock prediction model demonstrates a certain level of accuracy. The predicted results are very close to the actual results, indicating that the constructed model is highly accurate.
In our machine learning-based stock prediction project, our goal is to develop a stock price forecasting model using machine learning techniques. To achieve this, we first gained a business understanding by recognizing the significance of stock trading in the financial market and how machine learning can enhance the accuracy of stock price predictions. Next, we conducted data understanding by collecting historical data for Tesla stocks. We cleaned and structured the data, selected key features for analysis, and chose to apply the ARIMA model for prediction. Through data understanding, preparation, analysis, and modeling, we successfully built an ARIMA(0,0,5) model and evaluated its performance.
The results of our model showed reasonable accuracy in predicting stock prices, as indicated by the evaluation metrics including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). The MSE, RMSE, and MAE values were relatively low, indicating that our model’s predictions were close to the actual values. The MAPE value of 16.25078% also indicated a relatively accurate prediction of stock prices.
By utilizing machine learning techniques, our model provides investors and traders with a more systematic and objective approach to stock price prediction. This can help them make informed investment decisions, optimize asset allocation, mitigate risks, and increase returns. However, it is important to note that stock market prediction is a complex task, and various factors can influence stock prices. Therefore, it is always advisable to consider multiple sources of information and perform comprehensive analysis before making investment decisions.
Overall, our project contributes to the field of stock market forecasting by showcasing the effectiveness of machine learning models, specifically the ARIMA model, in predicting stock prices. Future research can explore other machine learning algorithms and incorporate additional features and data sources to further enhance the accuracy and reliability of stock price predictions.
J. Zou et al., “Stock Market Prediction via Deep Learning Techniques: A Survey,” Dec. 2022, Accessed: May 30, 2023. [Online]. Available: https://arxiv.org/abs/2212.12717v2
J. Y. Huang and J. H. Liu, “Using social media mining technology to improve stock price forecast accuracy,” J Forecast, vol. 39, no. 1, pp. 104–116, Jan. 2020, doi: 10.1002/FOR.2616.
A. Ganesan and A. Kannan, “Stock Price Prediction using ARIMA Model,” International Research Journal of Engineering and Technology, 2021, Accessed: May 30, 2023. [Online]. Available: www.irjet.net
Huang, J., & Liu, J. (2020). Using social media mining technology to improve stock price forecast accuracy. Journal of Forecasting, 39(1), 104–116. https://doi.org/10.1002/for.2616
Huang MJ, & Liu YXuan. (2019). Research on stock trend prediction method based on machine learning. Modern Salt Chemicals, 46(5), 3.
Li, X., Wu, P., & Wang, W. (2020). Incorporating stock prices and news sentiments for stock market prediction: A case of Hong Kong. Information Processing and Management, 57(5), 102212. https://doi.org/10.1016/j.ipm.2020.102212
Liu, H., Manzoor, A., Wang, C., Zhang, L., & Manzoor, Z. (2020). The COVID-19 Outbreak and Affected Countries Stock Markets Response. International Journal of Environmental Research and Public Health, 17(8), 2800. https://doi.org/10.3390/ijerph17082800
Lohrmann, C., & Luukka, P. (2019). Classification of intraday S&P500 returns with a Random Forest. International Journal of Forecasting, 35(1), 390–407. https://doi.org/10.1016/j.ijforecast.2018.08.004
Madhikerrni, M., & Främling, K. (2019). Data discovery method for Extract- Transform-Load.27”https://doi.org/10.1109/icmimt.2019.8712027
Moghar, A., & Hamiche, M. (2020). Stock Market Prediction Using LSTM Recurrent Neural Network. Procedia Computer Science, 170, 1168–1173. https://doi.org/10.1016/j.procs.2020.03.049
Li, Y., Ni, P.,&Chang, V.(2020). Application of deep reinforcement learning in stock trading strategies andstock forecasting.Computing,102(6),1305–1322.https://doi.org/10.1007/s00607-019-00773-w
Liu, H., Manzoor, A., Wang, C., Zhang, L., & Manzoor, Z. (2020). The COVID-19 Outbreak and Affected Countries Stock Markets Response. International Journal of Environmental Research and Public Health, 17(8), 2800. https://doi.org/10.3390/ijerph17082800 https://doi.org/10.1016/j.ijforecast.2018.08.004
Madhikerrni, M., & Främling, K. (2019). Data discovery method for Extract- 30Transform-Load.27”https://doi.org/10.1109/icmimt.2019.8712027
O’Connor, N., & Madden, M. C. (2006). A neural network approach to predicting stock exchange movements using external factors. Knowledge Based Systems, 19(5), 371–378. https://doi.org/10.1016/j.knosys.2005.11.015
Pan, Y., Xiao, Z., Wang, X., & Yang, D. (2017). A multiple support vector machine approach to stock index forecasting with mixed frequency sampling. Knowledge Based Systems, 122, 90–102.https://doi.org/10.1016/j.knosys.2017.01.033
Rajihy, Y., Nermend, K., & Alsakaa, A. (2017). Back-propagation artificial neural networks in stock market forecasting. An application to the Warsaw Stock Exchange WIG20. AESTIMATIO : The IEB International Journal of Finance, 15,88–99.https://dialnet.unirioja.es/descarga/articulo/6637821.pdf
Tounsi, S., & Maatoug, A. B. (2021). The GOLD market as a safe haven against the stock market uncertainty: Evidence from geopolitical risk. Resources Policy, 70, 101872. https://doi.org/10.1016/j.resourpol.2020.101872
Zhang, J., Cui, S., Xu, Y., Li, Q., & Li, T. (2018). A novel data-driven stock price trend prediction system. Expert Systems With Applications, 97, 60–69. https://doi.org/10.1016/j.eswa.2017.12.026
Zou, J., Zhao, Q., Jiao, Y., Cao, H., Liu, Y., Yan, Q., Abbasnejad, E., Liu, L., & Shi, J. (2022). Stock Market Prediction via Deep Learning Techniques: A Survey. arXiv (Cornell University). https://doi.org/10.48550/arXiv.2212.12717
Bustos, O. H., & Pomares-Quimbaya, A. (2020). Stock market movement forecast: A Systematic review. Expert Systems With Applications, 156, 113464. https://doi.org/10.1016/j.eswa.2020.113464
Moghar, A., & Hamiche, M. (2020). Stock Market Prediction Using LSTM Recurrent Neural Network. Procedia Computer Science, 170, 1168–1173. https://doi.org/10.1016/j.procs.2020.03.049
Bhandari, H. N., Rimal, B., Pokhrel, N. R., Rimal, R., Dahal, K. R., & Khatri, R. K. (2022). Predicting stock market index using LSTM. Machine Learning with Applications, 9, 100320. Learning to Forget: Continual Prediction with LSTM. (2000, October 1). MIT Press Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/6789445
Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735