1 Project Abstract

Previously, traditional statistical techniques were used to analyze the trends of stock prices over time and related variables to extract temporal correlations. Stock market forecasting is a challenging task due to the volatile nature of stocks. Various machine learning methods have now been applied to stock market forecasting. Predicting the stock market through machine learning becomes easier and accurate. By comparing various machine learning models, we select the more accurate AMIRA model for predicting stock prices. And evaluate the effectiveness of the AMIRA model based on the parameters of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). The results show the usefulness of our proposed machine learning method in predicting stock prices.

1.1 Project member

STUDENT ID	STUDENTNAME	WORK
S2128100	ZHANG YU	codeing and 7,8 point
S2164046	HU LIANGLIANG	9 point
22060214	ZHU MEI	Pre and 3,4,5,6 point
S2121005	XU HUANDI	1 7.1 7.2point
S2177633	ZHOU YAO	2 point

2 Introduction

Stock trading is a very important investment activity in today’s financial markets. The traditional method of stock forecasting is to analyze the trend of stock prices over time and related variables to extract time correlation. However, the stock market is very volatile and the rise or fall of stock prices is not simply influenced by the current environment and past share prices ,but can be influenced by a variety of complex factors such as company earnings reports, national policies, influential shareholders, expert speculation on current events and more (Zou et al., 2022). Accurately predicting stock trends is therefore a very challenging task.

Despite the fact that the stock market is a non-linear, non-stationary system（Huang, &Liu,2019)， with the development of machine learning techniques, more and more people are exploring how to use machine learning algorithms to predict stock trends. The main objective of this study is to provide detailed analysis using an autoregressive integrated moving average (ARIMA) model to predict future trends and stock returns. This study contains a literature survey of related work in the field, provides the methodology and variability of data analysis, and finally sets up an evaluation framework to assess the validity of the model using accuracy, recall, and other metrics, and the results show the usefulness of this model in predicting stock prices.

3 Problem Statement

Traditionally, many investors and traders rely on their own experience, knowledge, and skills to predict changes in stock prices. Although personal experience has some value in stock price prediction, its limitations cannot be ignored. Personal experience forecasts typically depend on limited sources of information, such as past trading experiences, personal observations, and media reports. These sources of information may not comprehensively and accurately reflect the dynamics and changes in the market, thereby limiting the accuracy and reliability of predictions.

To enhance the accuracy and reliability of predictions, many investors and traders have started leveraging technologies such as machine learning to assist in forecasting. By learning from extensive historical data and pattern recognition, processing complex data and patterns, and uncovering hidden patterns, machine learning provides more systematic and objective prediction results, thereby compensating for some of the shortcomings of personal experience forecasts.

In our project, historical stock price data is utilized to train and establish a predictive model to Take the Tesla stock seat example forecast the future trends of Tesla stock prices, aiding investors in making more informed decisions.

4 Project Objective

1.Classification Question：Can classify specific stock price trends as “rising,” “falling,” or “stable” based on ARIMA model predictive result?

Regression Question：How accurately can we predict the future closing price of a specific stock ?

Which machine learning models perform better in predicting stock prices?

2.To categorize specific stock price trends as “rising”, “falling” or “stable” by comparing the forecasted values with the previous day’s value.

To evaluate the performance of AMIRA model by MSE, RMSE, MAE and MAPE.

To determine more accurate stock price predictive models by comparing various machine learning models.

5 Scope and Domain

The stock market is filled with uncertainty and dynamic changes, and factors such as policies and market conditions can impact stock prices. Stock price prediction is of great importance for investors and traders in decision-making. Accurate stock price predictions can help investors make informed investment decisions, choose the right timing for buying and selling, reduce investment risks, and enhance investment returns. Our project aims to go beyond traditional stock price prediction methods by utilizing machine learning algorithms to analyze new data in a timely manner, capture market changes and trends, and provide more accurate prediction results.

6 Literature Review

According to the study by ARAVIND et al., they studied the use of the AutoRegressive Integrated Moving Average (ARIMA) model for stock price prediction in the stock market. The volatility of the stock market makes stock price prediction a cumbersome task, as stock prices are influenced by various important parameters, including economic factors, interest rates, and inflation. In order to forecast stock prices, they obtained continuous stock data from Yahoo Finance and proposed the ARIMA model for stock price prediction. This model captures trends and seasonality in time series data and provides a relatively accurate method for short-term forecasting. Experimental results demonstrate that the ARIMA model exhibits reasonable accuracy in the short term and can be used for stock prediction.(M et al., 2018)

7 Design & Development

7.1 Business Understanding

Stock trading plays a crucial role in the current financial market. Investors aspire to accurately predict stock trends using various technical tools to achieve better investment outcomes. Traditional experiential and intuitive methods have limitations in dealing with the complexity and volatility of the stock market. Stock prices are influenced by multiple factors such as the economy, politics, industry developments, market sentiment, and global events, which interact to make predicting stock trends complex and challenging. Machine learning technology has emerged as a promising approach to address this challenge. By analyzing historical performance data, market indicators, company financial reports, and other relevant information, predictive models can be established to capture hidden patterns and trends within vast amounts of data, providing price predictions and comprehensive market insights to investors(Moghar & Hamiche, 2020). This helps investors make informed investment decisions, optimize asset allocation, mitigate risks, and increase returns.

7.2 Data Understanding

Identify sources of stock price data, such as financial data providers, stock exchanges, or online financial platforms. Get historical stock price data, including date, open, close, high, low, volume, and more. Ensure data integrity and accuracy and save it in appropriate data format (e.g. CSV, Excel) for subsequent processing and analysis. In the literature, variables such as technical indicators, financial variables, and macro-economic variables have been considered the most influential ones affecting stock price movements. In this study, we classified all the variables in the selected literature that were used as input data into four main categories and several subcategories, as illustrated in Fig. 1.

7.2.1 Research method

This project designed the research methodology flow by collecting data through the open interface of the exchange and make a simple clarification of the data.Slicing the data into a training set test set. lastly, creating a machine learning model using ARIMA and testing the predicted values of the model using the test set.

7.3 Data Preparation

Stock trading is a game process under incomplete information, and the single-objective supervised learning model is difficult to deal with such serialization decision problems due to the stock market’s rapid change, many interference factors, and insufficient periodic data (Li et al., 2020).There are many factors that can affect stock prices, such as wars, geological disasters, national credit crises, and so on.But these effects are short-lived, for example The results contribute to the research on the economic impact of the pandemic by providing empirical evidence that COVID-19 has spill-over effects on stock markets of other countries, and the results also provide a basis for assessing trends in international stock markets when the situation is alleviated globally. However, there is no evidence that COVID-19 adversely affects these countries’ stock markets more than it does the global average (He et al., 2020). Many historical events, including viruses, have had a regional impact on the stock market, Del and Paltrinieri examined the 78 mutual equity funds geographically based in African countries with observed monthly flows and results for the 2006–2015 period and proposed that Ebola and the Arab Spring were more serious than SARS, while Nippani and Washer examined the effect of SARS on Canada, China, the particular administrative region of Hong Kong, Indonesia, China, Singapore, the Philippines, Vietnam, and Thailand and concluded that SARS only affected the stock markets of China and Vietnam (H. Liu et al., 2020). Therefore, it is not practical to consider all factors that may affect the stock market, as some factors may only have a short-term impact on the stock market, but in the long run, their influence may be relatively small. The intensity of these volatile periods is likely to cause investors to leave the stock market in favor of other asset classes, such as gold, even though the history of the stock market’s evolution shows that episodes of volatility are frequently followed by calm periods (Tounsi & Maatoug, 2021). In this case, we only need to select the relevant features for analysis. Here, I plan to discard all the fragmented attributes that affect the stock market and only select the key attributes that affect the stock market, such as Timestamp, open, High, Low, Close, Volume, Volume, Weighted, Price.

7.3.1 Select data

7.3.1.1 Collect initial data

To obtain data for Tesla stock, we chose to use the API interface provided by platforms such as Yahoo Finance. This allows us to access real-time data and retrieve specific datasets according to our needs. We need to pass the desired parameters to the API to retrieve the relevant data. A concise R program is insufficient to fetch stock data for a specified time period from financial sources, so we rely on methods from the “quantmod” package to call them.

7.3.1.2 Select data

We selected the stock data for Tesla from 2017 to the present. This time span provides us with a relatively long-term dataset, enabling us to observe the long-term trends, seasonal variations, and other important market dynamics of Tesla stock prices. It helps us assess the performance of Tesla stock more accurately in investment decisions and predicting future trends.

library(quantmod)

## Warning: package 'quantmod' was built under R version 4.2.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.2.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.2.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Warning: package 'TTR' was built under R version 4.2.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tseries)

## Warning: package 'tseries' was built under R version 4.2.3

indf_data <- getSymbols(Symbols = "TSLA", src = "yahoo", from = Sys.Date() - 2017, 
                        to = Sys.Date(), auto.assign = FALSE)

We only need to select relevant features for analysis. Here, I intend to discard irrelevant attributes that do not significantly impact the stock market and focus on key attributes that affect the stock market, such as Timestamp, Open, High, Low, Close, Volume, and Weighted Price. By doing so, we can concentrate on analyzing factors closely related to stock market behavior and price fluctuations. This enables us to gain deeper insights into market behavior, volatility, and potential driving factors, providing valuable support for predicting future stock price trends.

7.3.2 Clean and Construct data

Due to the peculiarities of the stock market, there may be instances where the data for a particular day is missing or labeled as “NA” due to reasons like trading halts. However, such occurrences are infrequent. Therefore, during data cleaning, we directly remove the data for those days. The presence of missing values can introduce bias in analysis results or lead to inaccurate predictions. Deleting the data for those days helps avoid introducing uncertainty in the analysis.

indf_data <- na.omit(indf_data)

7.4 Data Analysis

The data contains 7 parameters, in addition to the date there are data such as opening price, high price, low price and trading volume.

View(indf_data)

Charting stock data based on data, MACD indicator smoothing moving average, RSline. Explore the average change curve and trend of the collected Tesla stock data one by one.

chart_Series(indf_data, col = "black")

add_SMA(n = 100, on = 1, col = "red")

add_SMA(n = 20, on = 1, col = "black")

add_RSI(n = 14, maType = "SMA")

add_BBands(n = 20, maType = "SMA", sd = 1, on = -1)

add_MACD(fast = 12, slow = 25, signal = 9, maType = "SMA", histogram = TRUE)

library(funModeling)

## Warning: package 'funModeling' was built under R version 4.2.3

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 4.2.3

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:quantmod':
## 
##     Lag

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks xts::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks xts::last()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Hmisc)

indf_log <- log(indf_data)
head(indf_log, n = 10)

##            TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2017-12-08  3.043252  3.050788 3.032578   3.044935    17.76728      3.044935
## 2017-12-11  3.043347  3.088038 3.040546   3.087734    18.59522      3.087734
## 2017-12-12  3.092405  3.125122 3.091133   3.123920    18.69069      3.123920
## 2017-12-13  3.123627  3.133231 3.110548   3.118038    18.35157      3.118038
## 2017-12-14  3.123862  3.142542 3.111736   3.114670    18.28140      3.114670
## 2017-12-15  3.126878  3.132301 3.108346   3.130991    18.45988      3.130991
## 2017-12-18  3.135204  3.140496 3.113752   3.117566    18.22397      3.117566
## 2017-12-19  3.121660  3.125268 3.091951   3.094370    18.44415      3.094370
## 2017-12-20  3.099161  3.100393 3.075898   3.087947    18.30759      3.087947
## 2017-12-21  3.089799  3.102312 3.082552   3.096060    18.00180      3.096060

Count the number of observations (rows) and variables in the first example and use head to display the first few rows of data.

glimpse(indf_data)

## An xts object on 2017-12-08 / 2023-06-16 containing: 
##   Data:    double [1389, 6]
##   Columns: TSLA.Open, TSLA.High, TSLA.Low, TSLA.Close, TSLA.Volume ... with 1 more column
##   Index:   Date [1389] (TZ: "UTC")
##   xts Attributes:
##     $ src    : chr "yahoo"
##     $ updated: POSIXct[1:1], format: "2023-06-17 13:13:00"

Obtain statistical information on data types, zero values, infinite numbers and missing values:

df_status(indf_data)

##            variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1     var.TSLA.Open       0       0    0    0     0     0 numeric   1358
## 2     var.TSLA.High       0       0    0    0     0     0 numeric   1348
## 3      var.TSLA.Low       0       0    0    0     0     0 numeric   1365
## 4    var.TSLA.Close       0       0    0    0     0     0 numeric   1376
## 5   var.TSLA.Volume       0       0    0    0     0     0 numeric   1384
## 6 var.TSLA.Adjusted       0       0    0    0     0     0 numeric   1376

Use the Hmisc package describe to quickly understand all the variables.

describe(indf_data)

## indf_data 
## 
##  6  Variables      1389  Observations
## --------------------------------------------------------------------------------
## TSLA.Open 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1389        0     1358        1    134.8    126.8    15.53    17.57 
##      .25      .50      .75      .90      .95 
##    21.45   125.70   230.04   292.59   332.88 
## 
## lowest : 12.0733 12.34   12.3673 12.4733 12.5833
## highest: 391.2   392.443 396.517 409.333 411.47 
## --------------------------------------------------------------------------------
## TSLA.High 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1389        0     1348        1    137.9    129.7    15.81    17.91 
##      .25      .50      .75      .90      .95 
##    21.85   128.62   235.57   299.88   340.49 
## 
## lowest : 12.4453 12.6613 12.8173 12.826  12.932 
## highest: 402.863 403.25  405.13  413.29  414.497
## --------------------------------------------------------------------------------
## TSLA.Low 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1389        0     1365        1    131.4    123.6    15.27    17.04 
##      .25      .50      .75      .90      .95 
##    21.07   121.02   224.33   285.59   325.09 
## 
## lowest : 11.7993 11.974  12.2733 12.336  12.4147
## highest: 378.68  382     384.207 402.667 405.667
## --------------------------------------------------------------------------------
## TSLA.Close 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1389        0     1376        1    134.7    126.6    15.64    17.52 
##      .25      .50      .75      .90      .95 
##    21.52   125.24   231.24   291.96   332.89 
## 
## lowest : 11.9313 12.344  12.548  12.58   12.6573
## highest: 399.927 402.863 404.62  407.363 409.97 
## --------------------------------------------------------------------------------
## TSLA.Volume 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##      1389         0      1384         1 133889572  84010133  53581040  61370640 
##       .25       .50       .75       .90       .95 
##  77433000 105895200 159883500 245943900 306506160 
## 
## lowest :  29401800  35042700  35842500  36741600  36984000
## highest: 598212000 666378600 705975000 726357000 914082000
## --------------------------------------------------------------------------------
## TSLA.Adjusted 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1389        0     1376        1    134.7    126.6    15.64    17.52 
##      .25      .50      .75      .90      .95 
##    21.52   125.24   231.24   291.96   332.89 
## 
## lowest : 11.9313 12.344  12.548  12.58   12.6573
## highest: 399.927 402.863 404.62  407.363 409.97 
## --------------------------------------------------------------------------------

The autocorrelation function (ACF) is a statistic that measures the relationship between lags and current observations in time series data. It can help us understand whether there is lagged correlation in time series data, i.e., whether the current observation is correlated with past observations.

By calculating the ACF, we can obtain a set of autocorrelation coefficients, where each coefficient represents the level of correlation at the corresponding lag. These coefficients are usually presented in graphical form to provide a more intuitive understanding of the lagged correlation structure of the time series data.

acf_log <- acf(indf_log, lag.max = 320)

Analysis of this data is more suitable for the ARMA model, and the model is constructed using a time series approach.

diff.acf <- acf(indf_log)

The provided code is aimed at visualizing the outcomes of a linear regression model by creating a scatter plot and incorporating a regression line. This scatter plot effectively illustrates the distribution of the data, while the regression line depicts the linear relationship between the independent and dependent variables. This graphical representation allows for a clear visual assessment of how well the regression model fits the data

models <- lm(TSLA.Open ~ TSLA.Adjusted, data = indf_data)
ggplot(indf_data, aes(x = TSLA.Adjusted, y = TSLA.Open)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "TSLA.Adjusted", y = "TSLA.Open", title = "Linear Regression")

## `geom_smooth()` using formula = 'y ~ x'

After executing the above code, cor_matrix will be a correlation coefficient matrix, showing the correlation between variables in indf_data

library(ggplot2)

cor_matrix <- cor(indf_data)
cor_data <- reshape2::melt(cor_matrix)

ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Correlation Matrix Heatmap")

7.5 Modelling

7.5.1 Select the modeling technique

The best ARIMA(p,d,q) model is finally selected and stored in the arima_model object using the ARIMA model selection method for fitting training, using the training dataset, and performing model smoothness testing.

library(caTools)

## Warning: package 'caTools' was built under R version 4.2.3

library(forecast)

## Warning: package 'forecast' was built under R version 4.2.3

train_data <- indf_log[1:1270, "TSLA.Close"]  
set.seed(123)
arima_model <- auto.arima(train_data, stationary = TRUE, ic = c("aicc", "aic", "bic"), 
                          trace = TRUE)

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2) with non-zero mean : Inf
##  ARIMA(0,0,0) with non-zero mean : 4069.53
##  ARIMA(1,0,0) with non-zero mean : Inf
##  ARIMA(0,0,1) with non-zero mean : 2387.889
##  ARIMA(0,0,0) with zero mean     : 7387.961
##  ARIMA(1,0,1) with non-zero mean : Inf
##  ARIMA(0,0,2) with non-zero mean : 1024.902
##  ARIMA(1,0,2) with non-zero mean : Inf
##  ARIMA(0,0,3) with non-zero mean : -15.06664
##  ARIMA(1,0,3) with non-zero mean : Inf
##  ARIMA(0,0,4) with non-zero mean : -711.9691
##  ARIMA(1,0,4) with non-zero mean : Inf
##  ARIMA(0,0,5) with non-zero mean : -1316.529
##  ARIMA(1,0,5) with non-zero mean : Inf
##  ARIMA(0,0,5) with zero mean     : 1364.836
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(0,0,5) with non-zero mean : -1542.995
## 
##  Best model: ARIMA(0,0,5) with non-zero mean

summary(arima_model): This line of code is used to obtain the summary information of the ARIMA model. arima_model is the ARIMA model object that has been fitted. By executing the summary function, we can get the statistical information such as parameter estimation results, standard error, t-statistic, p-value, etc. of the ARIMA model. This summary information can help us to evaluate the degree of fit of the model and the significance of each model parameter.

summary(arima_model)

## Series: train_data 
## ARIMA(0,0,5) with non-zero mean 
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5    mean
##       2.2588  3.1303  2.9223  1.8285  0.6638  4.2662
## s.e.  0.0249  0.0485  0.0522  0.0344  0.0179  0.0432
## 
## sigma^2 = 0.01716:  log likelihood = 778.54
## AIC=-1543.08   AICc=-1543   BIC=-1507.06
## 
## Training set error measures:
##                       ME      RMSE       MAE        MPE     MAPE     MASE
## Training set 0.000264673 0.1306758 0.1113869 -0.7396326 2.862258 3.890185
##                   ACF1
## Training set 0.3689181

checkresiduals(arima_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5) with non-zero mean
## Q* = 4366, df = 5, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 10

7.5.2 Build model

arima <- arima(train_data, order = c(0, 0, 5)): this line of code uses the arima function to fit an ARIMA model. train_data is the training data used to fit the model, which is a time series. order = c(0, 0, 5) is the order parameter of the ARIMA model, where 0 represents the self regression (AR) order of 0, 0 for difference (number of differences) order of 0, and 5 for moving average (MA) order of 5. This specific ARIMA model is determined based on the characteristics of the data and prior knowledge.

arima <- arima(train_data, order = c(0, 0, 5))
summary(arima)

## 
## Call:
## arima(x = train_data, order = c(0, 0, 5))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5  intercept
##       2.2588  3.1303  2.9223  1.8285  0.6638     4.2662
## s.e.  0.0249  0.0485  0.0522  0.0344  0.0179     0.0432
## 
## sigma^2 estimated as 0.01708:  log likelihood = 778.54,  aic = -1543.08
## 
## Training set error measures:
##                       ME      RMSE       MAE        MPE     MAPE     MASE
## Training set 0.000264673 0.1306758 0.1113869 -0.7396326 2.862258 3.890185
##                   ACF1
## Training set 0.3689181

Plot the time series of stock closing prices with rolling mean and standard deviation.

close_prices <- Cl(indf_data)

8 Results

forecast1 <- forecast(arima, h = 100)
plot(forecast1)

train_datas <- indf_log[1:1270, "TSLA.Close"]
arima <- arima(train_datas, order = c(0, 0, 5))
forecast_ori <- forecast(arima, h = 100)
a <- ts(train_datas)
forecast_ori %>% autoplot() + autolayer(a)

8.1 Exploratory Data Analysis (EDA)— visualization

View the data structure and summary

str(indf_data)

## An xts object on 2017-12-08 / 2023-06-16 containing: 
##   Data:    double [1389, 6]
##   Columns: TSLA.Open, TSLA.High, TSLA.Low, TSLA.Close, TSLA.Volume ... with 1 more column
##   Index:   Date [1389] (TZ: "UTC")
##   xts Attributes:
##     $ src    : chr "yahoo"
##     $ updated: POSIXct[1:1], format: "2023-06-17 13:13:00"

summary(indf_data)

##      Index              TSLA.Open        TSLA.High         TSLA.Low     
##  Min.   :2017-12-08   Min.   : 12.07   Min.   : 12.45   Min.   : 11.80  
##  1st Qu.:2019-04-30   1st Qu.: 21.45   1st Qu.: 21.85   1st Qu.: 21.07  
##  Median :2020-09-14   Median :125.70   Median :128.62   Median :121.02  
##  Mean   :2020-09-12   Mean   :134.77   Mean   :137.89   Mean   :131.38  
##  3rd Qu.:2022-01-28   3rd Qu.:230.04   3rd Qu.:235.57   3rd Qu.:224.33  
##  Max.   :2023-06-16   Max.   :411.47   Max.   :414.50   Max.   :405.67  
##    TSLA.Close      TSLA.Volume        TSLA.Adjusted   
##  Min.   : 11.93   Min.   : 29401800   Min.   : 11.93  
##  1st Qu.: 21.52   1st Qu.: 77433000   1st Qu.: 21.52  
##  Median :125.24   Median :105895200   Median :125.24  
##  Mean   :134.73   Mean   :133889572   Mean   :134.73  
##  3rd Qu.:231.24   3rd Qu.:159883500   3rd Qu.:231.24  
##  Max.   :409.97   Max.   :914082000   Max.   :409.97

9 Evaluation of models

9.2 Evaluation of models

We have built an ARIMA model based on the best-selected ARIMA(0,0,5) and visualized the forecast data. Next, I will evaluate the accuracy of this time series model using the remaining test set. The introduced metrics for evaluation include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).

arima <- arima(train_data, order = c(0, 0, 5))
forecast1 <- forecast(arima, h = 100)
forecasted_values <- forecast1$mean
actual_values <- coredata(indf_log[1271:1370, "TSLA.Close"])
errors <- actual_values - forecasted_values
mse <- mean(errors^2)
rmse <- sqrt(mse)  
mae <- mean(abs(errors))
mape <- mean(abs(errors/actual_values)) * 100

cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 0.7704065

cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.877728

cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 0.8496022

cat("Mean Absolute Percentage Error (MAPE):", mape, "%\n")

## Mean Absolute Percentage Error (MAPE): 16.42937 %

9.3 Discussion of output

For the forecast results of ARIMA(0,0,5) with non-zero mean, the ARIMA(0,0,5) model is fitted to the time series data “train_data”. The forecast() function is used with a parameter “h” set to 100, which generates future values for the time series over 100 periods. The plot() function is then called to visualize the forecast results, showing the predicted values along with their corresponding confidence intervals.The x-axis represents the time period, while the y-axis represents the values of the time series. On the right side of the graph, different colors are used to represent confidence intervals at different levels. The purple color represents the 80% confidence interval, while the grayish-purple color represents the 95% confidence interval. These intervals indicate the probability that the actual observed values will fall within the respective interval, with a probability of 80% and 95% respectively.The confidence intervals provide insights into the potential range of values that the time series may take in the future. Analyzing the predicted values and their confidence intervals can help investors make informed decisions or forecasts based on the predicted behavior of the time series. For the residual plot of the ARIMA(0,0,5) model with a non-zero mean: Each point on the x-axis corresponds to a specific observation in the time series. The values on the y-axis represent the magnitude of the residuals, indicating the deviation between the observed values and the predicted values for each observation. From the plot, it can be observed that the residuals of the ARIMA(0,0,5) model with a non-zero mean fluctuate around zero, and the dispersion of the residuals remains relatively constant. The fact that the residuals fluctuate around zero suggests that, on average, the model’s predictions are close to the actual observed values. This indicates a good fit of the model to the data, without significant systematic errors. Overall, the model performs well. For the Lag-ACF plot, the autocorrelation function (ACF) measures the correlation between a time series and its own values at different lags, helping determine the appropriate order of the ARIMA model. From the resulting plot, it can be observed that the autocorrelation coefficients fluctuate around 0.6, indicating a strong positive correlation between the current lag and the previous lag. A high autocorrelation coefficient signifies that the current value of the time series is closely related to its past values. This suggests that the model successfully captures the persistence and patterns in the data, indicating that the model’s performance is good enough to capture the trends and periodicity in the time series data, making it suitable for modeling and forecasting future observations.

Overall, the output provides valuable information for evaluating the performance of the ARIMA model, assessing its suitability for forecasting future values, understanding the behavior of residuals, identifying temporal patterns in the stock’s closing prices, and exploring the relationships between variables. These information can support the decision-making process.

9.4 Evaluation of models

Analysis of the data: Mean Squared Error (MSE): 0.7545285 MSE measures the average squared difference between predicted values and actual values. A lower MSE value indicates that the model’s predictions are closer to the actual results. Therefore, the MSE value of 0.7545285 can be considered relatively low, indicating a certain level of accuracy in predicting stock data.

Root Mean Squared Error (RMSE): 0.868636 RMSE is the square root of MSE and is used to measure the difference between predicted values and actual values. Similar to MSE, a lower RMSE value indicates better prediction capability of the model. In this case, the RMSE value of 0.868636 shows that the model has relatively small prediction errors.

Mean Absolute Error (MAE): 0.8372229 MAE measures the average absolute difference between predicted values and actual values. A lower MAE value indicates smaller prediction errors of the model. Based on the provided value of 0.8372229, the MAE value is relatively low, indicating a certain level of accuracy in predicting stock data.

Mean Absolute Percentage Error (MAPE): 16.25078% MAPE measures the average percentage difference between predicted values and actual values. A lower MAPE value indicates higher accuracy of the model’s predictions. In this case, the MAPE value of 16.25078% can be considered relatively low, indicating a relatively accurate prediction of stock data.

Overall, this stock prediction model demonstrates a certain level of accuracy. The predicted results are very close to the actual results, indicating that the constructed model is highly accurate.

10 Conclusion

In our machine learning-based stock prediction project, our goal is to develop a stock price forecasting model using machine learning techniques. To achieve this, we first gained a business understanding by recognizing the significance of stock trading in the financial market and how machine learning can enhance the accuracy of stock price predictions. Next, we conducted data understanding by collecting historical data for Tesla stocks. We cleaned and structured the data, selected key features for analysis, and chose to apply the ARIMA model for prediction. Through data understanding, preparation, analysis, and modeling, we successfully built an ARIMA(0,0,5) model and evaluated its performance.

The results of our model showed reasonable accuracy in predicting stock prices, as indicated by the evaluation metrics including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). The MSE, RMSE, and MAE values were relatively low, indicating that our model’s predictions were close to the actual values. The MAPE value of 16.25078% also indicated a relatively accurate prediction of stock prices.

By utilizing machine learning techniques, our model provides investors and traders with a more systematic and objective approach to stock price prediction. This can help them make informed investment decisions, optimize asset allocation, mitigate risks, and increase returns. However, it is important to note that stock market prediction is a complex task, and various factors can influence stock prices. Therefore, it is always advisable to consider multiple sources of information and perform comprehensive analysis before making investment decisions.

Overall, our project contributes to the field of stock market forecasting by showcasing the effectiveness of machine learning models, specifically the ARIMA model, in predicting stock prices. Future research can explore other machine learning algorithms and incorporate additional features and data sources to further enhance the accuracy and reliability of stock price predictions.

References & Appendixes

J. Zou et al., “Stock Market Prediction via Deep Learning Techniques: A Survey,” Dec. 2022, Accessed: May 30, 2023. [Online]. Available: https://arxiv.org/abs/2212.12717v2

J. Y. Huang and J. H. Liu, “Using social media mining technology to improve stock price forecast accuracy,” J Forecast, vol. 39, no. 1, pp. 104–116, Jan. 2020, doi: 10.1002/FOR.2616.

A. Ganesan and A. Kannan, “Stock Price Prediction using ARIMA Model,” International Research Journal of Engineering and Technology, 2021, Accessed: May 30, 2023. [Online]. Available: www.irjet.net

Huang, J., & Liu, J. (2020). Using social media mining technology to improve stock price forecast accuracy. Journal of Forecasting, 39(1), 104–116. https://doi.org/10.1002/for.2616

Huang MJ, & Liu YXuan. (2019). Research on stock trend prediction method based on machine learning. Modern Salt Chemicals, 46(5), 3.

Li, X., Wu, P., & Wang, W. (2020). Incorporating stock prices and news sentiments for stock market prediction: A case of Hong Kong. Information Processing and Management, 57(5), 102212. https://doi.org/10.1016/j.ipm.2020.102212

Liu, H., Manzoor, A., Wang, C., Zhang, L., & Manzoor, Z. (2020). The COVID-19 Outbreak and Affected Countries Stock Markets Response. International Journal of Environmental Research and Public Health, 17(8), 2800. https://doi.org/10.3390/ijerph17082800

Lohrmann, C., & Luukka, P. (2019). Classification of intraday S&P500 returns with a Random Forest. International Journal of Forecasting, 35(1), 390–407. https://doi.org/10.1016/j.ijforecast.2018.08.004

Madhikerrni, M., & FrÃ¤mling, K. (2019). Data discovery method for Extract- Transform-Load.27”https://doi.org/10.1109/icmimt.2019.8712027

Moghar, A., & Hamiche, M. (2020). Stock Market Prediction Using LSTM Recurrent Neural Network. Procedia Computer Science, 170, 1168–1173. https://doi.org/10.1016/j.procs.2020.03.049

Li, Y., Ni, P.,&Chang, V.(2020). Application of deep reinforcement learning in stock trading strategies andstock forecasting.Computing,102(6),1305–1322.https://doi.org/10.1007/s00607-019-00773-w

Madhikerrni, M., & FrÃ¤mling, K. (2019). Data discovery method for Extract- 30Transform-Load.27”https://doi.org/10.1109/icmimt.2019.8712027

O’Connor, N., & Madden, M. C. (2006). A neural network approach to predicting stock exchange movements using external factors. Knowledge Based Systems, 19(5), 371–378. https://doi.org/10.1016/j.knosys.2005.11.015

Pan, Y., Xiao, Z., Wang, X., & Yang, D. (2017). A multiple support vector machine approach to stock index forecasting with mixed frequency sampling. Knowledge Based Systems, 122, 90–102.https://doi.org/10.1016/j.knosys.2017.01.033

Rajihy, Y., Nermend, K., & Alsakaa, A. (2017). Back-propagation artificial neural networks in stock market forecasting. An application to the Warsaw Stock Exchange WIG20. AESTIMATIO : The IEB International Journal of Finance, 15,88–99.https://dialnet.unirioja.es/descarga/articulo/6637821.pdf

Tounsi, S., & Maatoug, A. B. (2021). The GOLD market as a safe haven against the stock market uncertainty: Evidence from geopolitical risk. Resources Policy, 70, 101872. https://doi.org/10.1016/j.resourpol.2020.101872

Zhang, J., Cui, S., Xu, Y., Li, Q., & Li, T. (2018). A novel data-driven stock price trend prediction system. Expert Systems With Applications, 97, 60–69. https://doi.org/10.1016/j.eswa.2017.12.026

Zou, J., Zhao, Q., Jiao, Y., Cao, H., Liu, Y., Yan, Q., Abbasnejad, E., Liu, L., & Shi, J. (2022). Stock Market Prediction via Deep Learning Techniques: A Survey. arXiv (Cornell University). https://doi.org/10.48550/arXiv.2212.12717

Bustos, O. H., & Pomares-Quimbaya, A. (2020). Stock market movement forecast: A Systematic review. Expert Systems With Applications, 156, 113464. https://doi.org/10.1016/j.eswa.2020.113464

Moghar, A., & Hamiche, M. (2020). Stock Market Prediction Using LSTM Recurrent Neural Network. Procedia Computer Science, 170, 1168–1173. https://doi.org/10.1016/j.procs.2020.03.049

Bhandari, H. N., Rimal, B., Pokhrel, N. R., Rimal, R., Dahal, K. R., & Khatri, R. K. (2022). Predicting stock market index using LSTM. Machine Learning with Applications, 9, 100320. Learning to Forget: Continual Prediction with LSTM. (2000, October 1). MIT Press Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/6789445

Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(Aug), 115-143. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

Stock Prediction Model Based on Machine Learning

2023-06-15

1 Project Abstract

1.1 Project member

2 Introduction

3 Problem Statement

4 Project Objective

5 Scope and Domain

6 Literature Review

7 Design & Development

7.1 Business Understanding

7.2 Data Understanding

7.2.1 Research method

7.3 Data Preparation

7.3.1 Select data

7.3.1.1 Collect initial data

7.3.1.2 Select data

7.3.2 Clean and Construct data

7.4 Data Analysis

7.5 Modelling

7.5.1 Select the modeling technique

7.5.2 Build model

8 Results

8.1 Exploratory Data Analysis (EDA)— visualization

9 Evaluation of models

9.2 Evaluation of models

9.3 Discussion of output

9.4 Evaluation of models

10 Conclusion

References & Appendixes