Introduction

Background

Nowadays, the living status of Chinese people has been tremendously promoted. More and more people possess extra money and they want the money to be regenerated into a larger amount. To achieve that, an increasing amount of funds have been invested in the stock market. According to the Mobdata Institute, until 2018, there are approximately 30 to 40 million active participants in Chinese stock market and they spend about 1 hour every day to do their own research about the market trends. Meanwhile, approximately 70% of the participants are male and an increasing amount of aged people are taking part in the tradings. A lot of people earned their fortune in the market but a lot of people also lose excessive amounts of wealth.

In the financial market, information is most invaluable asset, the success of an investor largely depends on the quality of information he obtains when making the decision and the velocity his decision can be transformed into actions. However, as the modernization of the financial trading system progresses, massive amount of information emerges making the investors confused about how to perform accurate analysis.

Fundamental analysis and technical analysis are two major methods in analyzing the financial products. Foundamental analysis was first proposed by Benjamin Graham in 1928. He thinks that some fundamental features of a company, such as scale, liability and price earning ratio, essencially determine the proper stock value. From then on, some indicies have been formed to help the investors to quantitatively measure the fundamentals of a company. Generally, fundamental analysis concerns more the supply-demand equilibrium, macroeconomics factors, government policies and so on. It is more suitable for long-term analysis.

In contrast, technical analysis focuses more on the trading status of the stocks. From a technical analyst’s perspective, all the information about fundamental factors is already included in the current and past prices. Thus, any subjective analysis is avoided. Prices, volumes and other trading information are used to identify the patterns to forecast the future prices. From this sense, technical analysts usually use the past data to construct models to predict future in the belief that history will happen again.

Although this method is more flexiable, some researcher suggest that technical analysis cannot help investors to form profitable strategies. In 1964, Godfrey, Granger and Morgenstern proposed that the stock prices are random walk^[1]. According to this theory, the stock prices are memoryless and thus historical prices cannot be used to forest the future. The Efficient Market Hypothesis^[2] supports this theory because in a efficient market all the information about a stock is contained in its prices and only new information can alter the prices.

However, there are also a lot of investors and researchers who obtained fruitful successes in forcasting stock prices using technical analysis^[3]. In 1995, Lui and Mole^[4] used questionnaires to make surveys on the Hong Kong foreign exchange market dealers. The results suggest that 85% of the investigated dealers use both fundamental and technical analysis. Meanwhile, short-term investors tend to use technical analysis.

The forecast of financial market has been a challenging problem in the area of both time series analysis and machine learning^[5]. In the past decades, numerous researchers have tried various methods to predict the future market trends^[6] and abundant attention has been paid to construct automatic trading systems basing on the forecasts. Two most commonly used methods are based on statistical models and machine learning algorithms respectively^[7]. Traditional statistical models usually assume that time series data are genereated according to a linear process^[8]. However, financial data are complex, noisy, nonlinear, dynamic and nonparametric in nature^[9]. Recently, some machine learning methods have been applied to forecast the financial time series data^[10]. A lot of mathine learning algorithms can automatically capture the nonlinear relationships between data without any prespecified statistical assumptions^[11].

The first trail to forecast stock prices was made in 1991 by Nottola, Condamin and Naim. They use BP network and decision trees to analyze the fundamentals of thousands companies and form a strategy to choose stocks basing on accounting figures^[12]. Virili and Freisleben also implement the BP neural network and successfully predict the FAZ index in 2001^[13]. More recently, Francis concludes that support vector machine is more accurate and efficient then BP network in terms of predicting the prices of futures^{[14, 15]}.

The similar academic researches began much later in China. In recent studies, Zhong Yuan use machine learning models to analyze the evaluation models in the financialization of copyrights^[16]. In 2016, Zhang Gui Sheng and Zhang Hong construct the composite models of SVM-GARCH and PCA-SVM respectively whose results are better than the single-method models.

Data Description

The Sourse of Data

All the data used in this paper are extracted from the CSMAR database. From April 8^th, 2005, to March 1^st 2019, there are totally 3379 trading-day data. This dataset contains several basic trading figures including opening price, closing price, highest and lowest price within the day and daily return.

From the above figure, it’s clear that CSI300 index fluctuates a lot since its outset. The maximal the index hit is 5891.723 in 2007-10-17 which is 1.57 times higher than the current price 3749.714 in 2019-03-01. During this period of time, there are totally 1810 days that the index ends up with positive return and it accounts for about 54% in all the trading days. The mean value of the index is about 2904.19 whereas its standard deviation is 989.51 which also suggests that the volatility of CSI300 is quite high. If we simply buy one share of the CSI300 index at the opening price every day since its outset and consistently hold them, how much could we earn?

From the figure above, it is clear that if we simply buy one share each day at the opening price, for the most of the times, the return is above 0 with an average return of 25.34% and the highest cumulative return was realized at 2007-09-03, which is about 204.3%. Till the date at which the data is updated, this strategy yields a return of 29.25%.

Feature Values Selection and Description

Since the basic dataset I used only includes 5 different entries about the index status of CSI300, there could be more hidden layers under these data that are intertwined with each other if not specified properly. In order to capture the features of CSI300 as comprehensive as possible, I use above-mentioned data to construct 13 feature values which evaluate different aspects of the CSI300 index. In the following context, I will denote the opening, closing, highest, lowest and daily return at date t as O_t, C_t, H_t, L_t, R_t, repsectively.

Dispersion Index

\[ A/D_{t} = \frac {H_{t}-C_{t-1}} {H_{t}-L_{t}} \]

This feature measures the relative volatility of the CSI300 index within one trading day and the trend of price movement. If the absulote value of this figure is higher, then potentially the CSI300 index may undergo a relatively signficant fluctuation in that day.

From the histogram above, it is clear that the majority of the data are clustered in a small region around the mean 0.48, which makes the distribution of the data more similar to that of the normal distribution. The mean value quite near to 0.5 suggests that for the majority of time, the closing prices are around the midpoint of the highest and lowest values. It is also worth mentioning that there are some outliers with extreme high A/D value.

Commodity Channel Index

\[ CCI_{t} = \frac {M_{t}-SM_{t-1}} {0.015\times D_{t}} \] where \[ M_{t} = (H_{t}+L_{t}+C_{t})/3 \qquad SM_{t} = \sum_{i=0}^{n-1} M_{t-i}/n \qquad D_{t} = \sum_{i=0}^{n-1} |M_{t-i}-SM_{t}| \]

The commodity channel index measuers the variation from the statistical mean. This index is proposed by Donald Lambert and he set the value of the inverse factor to be 0.015 to ensure the readability of the results and by doing so the majority of the CCI values will fall between -100 and +100. CCI values are broadly used to identify reversals. Values above +100 may indicate an overbought situation while readings below -100 could imply an oversold condition, which means that there is large probability that prices will correct to more reasonable levels. I set the value of n to 14 which is consistent with the default value of most trading platforms. In setting n to a relatively large value will make this figure more smooth and less volatile.

From the figure above, we can observe that the distribution of the data seems to have two peaks and the data are almost symmetric around the mean 1.07. The majority of the data cluster within the region of -10 to +10, which suggests that accoeding to CCI index there isn’t too mach overbought or oversold situations.

Williams %R

\[ \%R_{t} = \frac {H_{n}-C_{t}} {H_{n}-L_{n}}\times100 \]

where H_n and L_n denote the highest and lowest stock prices within n days and n is set to 14 in accordance with the Commodity Channel Index.

The Williams %R measures which interval the stock is currently trading, whether near the high or the low, or somewhere between. A higher %R implies that the closing price today is very close to the lowest price within the period, and thus this may be an indicator of a signal to buy. Meanwhile, a lower value of %R can serve as a signal of selling the overbought stock.

The distribution of Williams %R is almost uniform in the region between 25 and 100. The most concerntrated part of the dataset is within 0 to 25. Due to the large number of data falling on relatively small values, the mean 43.11 is smaller than 50, the middle point of the range of data. The data also suggests that for CSI300 index, there are more times when the constituent stocks are overbought.

Relative Strength Index

\[ RSI_{t} = 100-\frac {100} {1+RS_{t}} \]

where \[ RS_{t} =\frac {SMMA(Up_{t},n)} {SMMA(Down_{t},n)} \] meanwhile Up_t and Down_t stands for the absolute value of price increase or decrease relative to the previous day. ESMA is exponentially smoothed moving average with the smoothing factor of 1/n and it’s calculated as

\[ SMMA(X_{t},n) = (1-\alpha)\times SMMA(X_{t-1},n) + \alpha \times X_{t} \]

Relative strenth index is designed to grasp the recent and historical strengh or weakness of a stock on the basis of the closing price in the current trading period. This figure can capture the velocity and significance of price movements. If the RSI of a stock is higher, this stock could possibily have had more or stronger possitive price movements whereas a lower RSI may indicate that the stock have had more or stronger negative price movements. Here, n is also set to 14 which is recommonded by the proposer of this index — J. Welles Wilder and is also in accordance with other indices.

As the distribution reveals, the RSI readings are relatively compact. Approximately, 49.6% of the data fall above the average and 91.29% of the data fall within the region of 25 and 75. Thus, the distribution of RSI values is quite symmetric and the power measured by this index is well balanced without too much strong or week stimulus.

Raw Stochastic Value

\[ RSV_{t} = \frac {C_{t}-L_{n}} {H_{n}-L_{n}}\times100 \]

where LL_n and HH_n denote the highest and lowest stock price within n days, and n is also set to 14.

This index measures the trading condition of the market and it can signal the opportunity to outperform other participants. Normally, if this index approaches one, the current closing price is quite close to the highest prices these days so it is highly possible that the price will fall back to a more representative level. Meanwhile, a RSV value lower than 20% usually reveals an oversold situation and the stock price is probably to recover soon.

The distribution of RSV values is highly left-skewed with 39.3% of the data cluster in the 25% uppermost region. This fact implies that it is more common for CSI300 index to be overbought and a high proportion of the closing prices is in a relatively high interval.

KDJ Index

\[ K_{t} = \frac {2}{3} K_{t-1} + \frac {1}{3}RSV_{t} \qquad D_{t} = \frac {2}{3} D_{t-1} + \frac {1}{3}K_{t} \qquad J_{t} = 3\times K_{t} - 2\times D_{t} \]

where K₁ and D₁ are set to be 50.

KDJ index is first introduced by George Lane and it is proposed to generate insights into the trading status of a stock in the short run. The value of K, D and J can each serve as a symbol of overbought or oversold condition. Take J as example, it is commonly believed that a J value higher than 100 indicates overbought and a J value lower than 0 indicates oversold.

From the figures above, it is clear that the distributions of all these three indicies seem to have two humps with most of the data clustering in the top and bottom quarter of the domain. These three distributions are also very similar to each other with means that are all approximately 56.9. According to the criteria given above about J index, there are 12.57% of the trading days that the CSI300 index is overbought and 7.4% of the cases that it is oversold.

Momentum Index

\[ Momentum_{t} = C_{t} - C_{t-n} \]

where n is still set to 14.

As its name suggests, momentum index measures the momentum or the stamina that drives the stock price towards a specific direction. In general, if the trend of momentum index and that of the stock price diverge, then the current tendency that the stock price follows may potentially reverse and the investors should make adjustments according to this situation.

It is clear that Momentum is highly concentrated around the mean 9.92. There are some extreme values on both the negative and possitive sides, which suggests that there are times when the market suffer great turbulence. For example, from 2015-06-15 to 2015-07-03, CSI300 index slumped -1335.25 whcih accounts for approximately 25.57% of its original value. On the contrary, from 2007-07-19 to 2007-08-07, the index jumped 24.1% to 4724.55. However, for the majority of times, the Momentum, namely the difference between closing prices within 14 days, is within the region between -81.73 and 113.75.

N-Day Weighted and Simple Average

\[ WA_{t} = \frac {n\times C_{t} + (n-1)\times C_{t-1} + ... + C_{t-n}} {(n + (n-1) + ... +1)} \qquad SA_{t} = \frac {C_{t} + C_{t-1} + ... + C_{t-n}} {n} \]

where n is set to 14.

The weighted and simple average all stand for the running average of the stock price within several days. They are indicators of the relatively reasonable stock price within the time period and it makes sense that the stock prices will still fluctuate around this value in the coming days.

As expected, the distributions of weighted average and simple average are quite similar, with 55.91% of the cases that the weighted average is larger that the simple average. The mean absolute difference between these two indicies is 26.45.

CR Index

\[ CR_{t} = \frac {\sum_{i=0}^{n-1} (H_{t-i} - M_{t-i})} {\sum_{i=0}^{n-1} (M_{t-i} - L_{t-i})} \times 100 \]

where \[ M_{t} = (H_{t}+L_{t}+C_{t})/3 \] and n is set to 14

CR index measures the energy a stock has to boost itself. It uses the mid-price as the criteria to evaluate the size of the driving forces. It also signals the balance between the long and short traders in the market. A lower CR value means that the mid-price is very close to the highest price within days and relatively far away from the lowest value. Thus, it indicates that the stock may be oversold and the stock price may increase to a more reasonable value, which means that it is time to buy in. In a similar reasoning, we had better sell the stocks when CR value is high.

From the histogram, it is clear that the distribution of CR is also bell-shape with more value cluster around the mean 93.53 and there are few extreme values. According this criteria, there are’t too many cases when the component stocks of CSI300 index are overbought or oversold.

Bias

\[ Bias_{t} = \frac {C_{t} - \sum_{i=0}^{n-1} C_{t-i}} {\sum_{i=0}^{n-1} C_{t-i}} \times 100 \]

where n is set to 14.

Bias is utilized to capture the deviation of the closing price from the average closing price within days. Normally, when Bias is high, it is probable that the current trend of the price movement will be reversed. For example, when the Bias is too high, it indicates that the current price is significantly higher than the average and the stock is overbought, thus an opportunity to sell.

As the figure reveals, the distribution of Bias is very similar to that of a normal distribution. For most of the data points, the Bias is relatively small which implies that for most of the cases, current closing price is deviated too much from its 14-day average. However, there are several outliers where the closing prices diverge from the existing trend. This conclusion is also consistent with the result of Momentum.

Average Return — 5-day and 14-day return

\[ 5\_DR_{t} = \frac {C_{t} - C_{t-5}} {C_{t-5}} \qquad 14\_DR_{t} = \frac {C_{t} - C_{t-14}} {C_{t-14}} \]

For average return, I use 5-day average to represent the short-run performance of the stock and 14-day average to represent the mid-run performance.

As we can see, both the 5-day return and 14-day return are clustered around 0 and the average returns are extermely close to 0. For all the data investigated, the 14-day return is on average 0.5% higher that the 5-day return, which indicates that CSI300 may perform worse in the short run.

Features Summarization and Further Preprocessing

Feature	Maximum	Minimum	Mean	Standard Deviation
A/D	6.59	-1.3	0.48	0.46
CCI	25.42	-28.25	1.07	8.7
%R	100	0	43.11	32.62
RSI	93.04	16.13	53.04	14.16
RSV	100	0	56.89	32.62
K	99.31	3.43	56.86	28.22
D	98.25	5.53	56.84	26.25
J	123.76	-23.28	56.91	37.93
Momentum	917.55	-1335.25	9.92	219.32
WA	5710.35	840.89	2908.21	982.83
SA	5674.73	842.17	2906.56	983.17
CR	135.08	59.4	93.53	12.06
Bias	15.37	-20.39	0.29	3.76
5-Day Return	0.16	-0.22	0	0.04
14-Day Return	0.28	-0.26	0.01	0.07

From the summarizing statistics above, it is clear that the distributions of these feature indicies are disparate with the data falling in various ranges. In terms of absolute values, some figures are times larger than others. If these data are used directly, the machine learning models are prone to be manipulated by the data with largest values and patterns contained in other features tend to be burried. The solution is to rescale the features by shrinking or expanding their ranges such that each contributes relatively equally to the model. Thus, I implement the min-max normalization process in the following way.

\[ X_{new} = \frac {X - X_{min}} {X_{max} - X_{min}} \]

Essentially, the process subtracts the minimum of feature X from each value and divides by the range of X. These normalized values can be interpreted as indicating how far, from 0 percent to 100 percent, the original value fall along the range between the origincal maximun and minimum. All the values will be mapped into the range between 0 and 1.

Normalized Feature	Maximum	Mean	Standard Deviation
A/D	1	0.23	0.06
CCI	1	0.55	0.16
%R	1	0.43	0.33
RSI	1	0.48	0.18
RSV	1	0.57	0.33
K	1	0.56	0.29
D	1	0.55	0.28
J	1	0.55	0.26
Momentum	1	0.6	0.1
WA	1	0.42	0.2
SA	1	0.43	0.2
CR	1	0.45	0.16
Bias	1	0.58	0.11
5-Day Return	1	0.58	0.09
14-Day Return	1	0.49	0.12

The above is the table of normalized statistics. In the following chapters, all the data are normalized,if not specified, before being used in training and testing the model.

Apart from these features, the opening, highest, lowest value and the return rate of CSI300 index in each day are also included in the dataset and they all serve as independent variables. The variable needs to be predicted is the closing price.

After the calculation of the features and the preprossing, there are some missing values generated, because in the computation some features need the previous data which are normally 13 days ahead in this paper. Thus, after eliminating these missing values, there are totally 3365 observations in the data set. A quarter of the observations will be left to test the models and the rest will be used in the training process. For those data used in training process, they are reordered randomly so that the algorithm’s predicting power will not be due to the regularity of the data. The observations used in testing proceduce are not reordered so that a consecutive period of the market can be used to evaluate the model performance.

Training and Evaluating the Models

Regression Tree

Training Models

One of the distinctive advantageous and convenient aspect of decision tree models is that the data do not need to be pre-processed because the dependent variables used in the model are only used to classify the data and they do not participate in the calculation process. Thus, in regression trees I will just use the original data without being normalized.

The diagram above presents the result of the regression tree. Lowest and Highest at each node stand for the lowest and highest value of CSI300 in a single day. Although there are a lot of other variables, the algorithm determines that only these two features are most useful in partitioning the data and generates a relatively simple strategy to predict average stock price. For example, if the lowest value in a day is less than 2893 and the highest value is less than 1776, then the predicted CSI300 index is 1123, which accounts for 16% of all the cases. Meanwhile, if the lowest value falls in the same region but the highest value is greater or equal than 2453 and lower than 4244, the predicted CSI300 index is 2637 making up 24% of the dataset.

This strategy seems to be excessively simple. Let’s take a look at the predicted results in comparation with the actual trends in the market.

The red curve above represents the actual values CSI300 index reached from 2015-09-16 to 2019-02-28. The green dashed line is the predicted value by the regression tree.

From this plot, the prediction results seem to be quite reasonable and most of the major trends of the reality can be captured by the model.

Evaluating Performance

In terms of evaluation, two major aspects are considered in this paper. First of all, the correlation between the predicted values and the actual values. A higher correlation means that the predicted values are more related with the actual ones. The second indicator is the Mean Absolute Error(MAE) calculated as:

\[ MAE = \frac {1}{n} \sum_{i=i}^{n} e_{i} \]

where n indicates the number of predictions and e_i indicates the error for prediction i.

To make the evaluating process more reliable and objective, I use the technique known as k_fold cross-validation (or k-fold CV), which has become the rule of thumb for eatimating model preformance. If we randomly select a proportion of the data to train or test the model, there may exist potential bias due to some unexpected patterns contained in the chosen dataset. Rather than taking repeated random samples that could potentially use the same observation more than once, k-fold CV randomly divides the data into k partitions called folds.

Although k can be set to any number, the most convention is to use 10-fold cross-validation because empirical evidence suggests that adding more folds will only bring marginal benefit. For a 10-fold CV, in each of the 10 folds(each consisting 10% of the overall data) a machine learning model is built on the remaining 90% of the data. The 10% data comprised in that fold is then used for model evaluation. After this process has occurred for 10 times, the evaluation results in each fold will be averaged to generate the final report.

Based on 10-fold CV, the correlation between predictions and real values is 0.9763 which is fairly large to conclude that the predictions are highly related to the reality in terms that the predictions can almost follow the actual trends in the market. Meanwhile, the mean absolute error is about 165 which is 5.67% of the mean value of the closing prices of the CSI300 index. Considering the simplicity of the machine learning model, this result is quite assuring since improvements can be made further.

Model Tree

Training The Model

Because linear regression models are used in the algorithm, the data used in training process are normalized.

Where the red line represents the actual values and the greed dashed line represents the predicted values.

As we can see, these two lines almost overlap, which suggests that model trees may possess excellent pridicting power in this specific area. To present the result more clearly, the seperate plots for both the actual and predicted values are also shown below.

Another important point here is that the model tree algorithm automatically partitions the data into many subgroups and within each subgroup a multilinear regression model is built, which makes the result easily to be interpreted. In this process, the algorithm will choose the most important or most relevant factors in solving particular issues. Hence, those factors with largest influence upon the closing price of CSI300 index are selected in the predicting process. For example, at a specific leaf node, the following multilinear regression model is implemented:

\[ Close = -831.52 * Open + 3221.93 * Highest + 2654.91 * Lowest + 335.08 * Return - 264.83 * A_Dt - 14.18 * CCIt + 12.69 * RSIt + 708.28 \]

Evaluating Performance

Based on 10-fold CV, the average correlation in these ten folds is 0.9982 and the mean absolute error is 37.56 which is only 3.82% of CSI300 index’s standard deviation – 983.6.

Compared with regression trees, model trees generate the predictions that are slightly more correlated with the real data. Although the improvement in terms of correlation is trivial, the absolute mean error has been tremendously reduced to the level that is less than 1% of the standard deviation. From my perspective, this level of accuracy is sufficient to construct corresponding trading strategies which will be discussed further in the following chapters.

Neural Network

Training the Model

where, the green line represents the predicted value and the red line depicts the actual CSI300 index value.

As we can see, neural network does not perfectly capture the characteristics contained in the training data, and the range of the predicted value (from 3461.7546004 to 3951.2352206) is much smaller than that of the actual data (from 2853.756 to 4389.885).

Evaluating Performance

Implementing 10-fold CV, the average correlation between the predicted values and the actual CSI300 index value is about 0.9981 and the average MAE reaches about 83.31, which is roughly 8.42% of the standatd deviaion of CSI300 index. This relust seems to be much better than the plot shown above. The escalation of the model performance may due to the fact that in each fold 90% of the data are used to train the model in the process of cross validation while only 75% of the data are used to train the model in the previous section. Thus, the overall performance of neural network is also quite precise but the accuracy is worse than that of a model tree.

Support Vector Machine

Training the Model

where, the green dashed line represents the predicted values and the red line denotes the acutal values of CSI300.

By some trail and error, I find that using the simple linear kernel yields the most accurate result. As shown above, the predicted results are almost overlapped with the actual data and the SVM algorithm captures almost all the major fluctuations. However, we can also observe from the figure that in some situations when the index is decreasing, the predicted values seem to decrease somewhat more. It turns out that actually, there are about 59.05% of the predicted values falling below the real ones. Hence, SVM with linear kernel tends to give more conservative predictions in this situation.

Evaluating Performance

Basing on 10-fold cross validation, I find that the average correlation between the predicted values and the real values is about 0.9982 and the average mean absolute error of the ten folds is 39.68 which is only 4.01% of the standard deviation of CSI300 index. By far, following the model tree, the support vector machine is the second best model in predicting the CSI300 index.

Trading Strategies

From the previous chapter, we can clearly observe that the model tree and support vector machine are the two most powerful or most suitable models in this particular question. Thus, in this chapter, I intend to develop the trading strategies basing on these two models.

Buy and Sell a Fixed Amount Each Day

In this strategy, one share of the CSI300 index will be bought if the predicted price is higher than current price and all the shares will be sold if the predicted price is going to fall. Meanwhile, shortselling is not permitted as the real situation in the Chinese stock market, which means that we cannot sell the stock if we do not possess any. Since I only predict the closing price in this situation, I assume that all the transactions happen at the end of each day. In other words, all the tradings are liquidated using closing price.

Basing on the prediction of CSI300 given by the model tree, I implement the above trading strategy from 2015-09-16 to 2019-02-28, during which the previous models are tested. Within this period, the average profit is about 1.22341110^{4}. Meanwhile, the maximum return is roughly 2.59557910^{4} realized at 2019-02-26 and the maximum loss is approximately 0 reached at 2015-09-16. Most importantly, there is approximately 100% of the times that the overall profit is possitive and about 0.36% of the cases that the overall account balance is 0, which leaves only about 0% chances(0 out of 840) that end up with a loss. Actually, for all these 0 cases, the average loss is only NaN.

The average shares we hold is slightly above 1 and the average money we used to buy the shares is about 4300 so that the average return is approximately 158.3%. Meanwhile, the average return of CSI300 index is only 0.03% during this period. Because there is a large proportion of circumstances that we do not conduct any transactions according to this strategy, funds can be invested in other liquid assets and earn decent returns in these days. When we only consider the days with transactions, the average return increases to 122.48% and the Sharpe ratio is about 0.17 .

Compared with the average profit, this level of loss is totally acceptable, let along the loss happens with a negligible probability.

Meanwhile, to varify the reliability of the results I compare the perdicted values with the actual ones in term of trends, the directions of price movement. The result is summarized in the following table, where “Up” and “Down” stands for the upward and downward price movements respectively.

Price Movement	Up_actual	Down_actual
Up_predicted	408	58
Down_predicted	33	339

It is clear that the predicted trends fit the real cases well and total error rate is about 0.11. Amoung these wrong predictions, the average deviation of the predicted value from the actual ones is about 7.9, which is fairly small compared to the standard deviation of the prices in the same period. Given our strategy above, the more serious error happens when we predict the price to rise but the actual price falls. In this case, we buy stocks that will actually deteriorate in the next day. However, since the shares we possess at the 58 days when this error raises is only 1.55, the losses are limited to a permissible range.

In contrast, if we simply buy one share every day at the closing price, how much profit could we obtain?

To make the results comparable, I transform the profit into per-share basis. It is clear that the simple buy-one-share-each-day strategy yields the returns that are highly volatile. Although the average return is above 0, there is 39.74% of the time that the return is below 0, which posts a serious risk. In comtrast, the trading strategy based on predictions of the model tree yields an average return of 4309.78 per share with very limited downside risk.

Thus, it is safe to conclude that, using the predictions of the model tree algorithm, we can construct a nearly risk-free trading strategy that ensures some kind of positive return with almost no risk.

Reference

[1] Godfrey M D, Granger C W, Morgenstern O. The random-walk hypothesis of stock market behavior[J]. Kykols, 1964, 17(1): 1-30. [2] Fama E F. The behavior of stock-market prices[J]. The Journal of Business, 1965, 38(1): 34-105. [3] Rodriguez-Gonzalez A, Garcia-Crespo A, Colomo-Palacios R, et al. CAST: Using neural networks to improve trading systems based on technical analysis by means of the RSI financial indicator[J]. Expert Systems with Applications, 2011, 38(9): 11489-11500. [4] Lui Y H, Mole D. The use of fundamental and technical analyses by foreign exchange dealers: Hong Kong evidence[J]. Journal of International Money and Finance, 1998, 17(3): 535-545. [5] Tay F E, Cao L. Application of support vector machines in financial time series forecasting[J]. Omega, 2001, 29(4): 309-317. [6] Teixeira L A, De Oliveira A L I. A method for automaticstock trading combining technical analysis and nearest neighbor classification[J]. Expert Systems with Application, 2010, 37(10): 6885-6890. [7] Wnag J Z, Wang J J, Zhang Z G, et al. Forecasting stock indices with back propagation neural network[J]. Expert Systems with Applications, 2011, 38(11): 14346-14355. [8] Kumar D A, Murugan S. Performance analysis of Indian stock market index using neural network time series model[J]. Proceedings of Pattern Recognition, Informatics and Mobile Engineering(PRIME), 2013 International Conference on IEEE, 2013: 72-78. [9] Si Y W, Yin J. OBST-based segmentation approcach to financial time series[J]. Engineering Applications of Artificial Intelligence, 2013, 26(10): 2581-2596. [10] Lee M C. Using support vector machine with a hybrid feature selection method to the stock trend perdiction[J]. Expert Systems with Applications, 2009, 36(8): 10896-10904. [11] Atsalakis G S, Valavanis K P. Surveying stock matket forecasting techniques-Part two: Soft computing methods[J]. Expert Systems with Applications, 2009, 36(3): 5932-5941. [12] Nottola, Condamin & Naim. Hard neural networks for rule extraction. A methodological approach applied to business financial evaluation[J]. In proceeding of Neuro-Mimes 91, Nimes, France, Nov, 4-8, 1991. [13] Virili F, Freisleben B. Neural netowrk model selection for financial time series prediction[J]. Physics-Verlag, 2001, (3): 451-463. [14] Cao Lijuan, Francis E H. Financial forecasting using support vector machine[J]. Springer-Verlag London Limited, 2001, (2): 184-192. [15] Lu Chi-Jie, Lee Tian-Shyug, Chiu Chih-Chou. Financial tiem series forecasting using independent component analysis and support vector regression[J]. Decision Support Systems, 2009, 47(2): 11-125. [16] Zhong Yuan. Research on evaluation model of copyright assets based on machine learning[J]. Shanghai Journal of Economics, 2017(06): 72-81. [17] Zhang Gui-sheng, Zhang Xin-dong. A SVM-GARCH model for stock price forecasting based on neighborhood mutual information[J]. Chinese Journal of Management Science, 2016, 24(09): 11-20. [18] Zhang Hong, Gao Shuai, Zhang Yang. A PCA-SVM model fot predicting profitability of companies. Statistics & Decision, 2016(23): 174-177.

Draft

Jay

Introduction

Background

Data Description

The Sourse of Data

Feature Values Selection and Description

Dispersion Index

Commodity Channel Index

Williams %R

Relative Strength Index

Raw Stochastic Value

KDJ Index

Momentum Index

N-Day Weighted and Simple Average

CR Index

Bias

Average Return — 5-day and 14-day return

Features Summarization and Further Preprocessing

Training and Evaluating the Models

Regression Tree

Training Models

Evaluating Performance

Model Tree

Training The Model

Evaluating Performance

Neural Network

Training the Model

Evaluating Performance

Support Vector Machine

Training the Model

Evaluating Performance

Trading Strategies

Buy and Sell a Fixed Amount Each Day

Reference