Data Mining Final Project, Summer 2017

Copyright © 2017 by Danilo Martinez Information was gathered to explain concepts related to this project from public sources and were credited, but findings, methods, and ideas discovered are the sole property of Danilo Martinez. All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher. For permission requests, write to the publisher, addressed “Attention: Danilo Martinez,” at djmartinez21@knights.ucf.edu.

BACKGROUND: Predicting the stock market is the act of trying to determine the future value of a company stock, or other financial instrument, traded on an exchange. The successful prediction of a stock’s future price could yield significant profit. The efficient market hypothesis suggests that stock prices reflect all currently available information and any price changes that are not based on newly revealed information thus are inherently unpredictable. Others disagree and those with this point of view have a variety of methods, and technologies, that supposively allow them to gain future price information. The efficient market hypothesis says that stock prices are a function of information and rational expectations, and that newly revealed information about a company’s prospects is almost immediately reflected in the current stock price. This would imply that all publicly known information about a company, which obviously includes its price history, would already be reflected in the current price of the stock. Accordingly, changes in the stock price reflect release of new information, changes in the market generally, or random movements around the value that reflects the existing information set. Burton Malkiel, in his influential 1973 work A Random Walk Down Wall Street, claimed that stock prices could therefore not be accurately predicted by looking at price history. As a result, Malkiel argued, stock prices are best described by a statistical process called a “random walk” meaning each day’s deviations from the central value are random and unpredictable. This led Malkiel to conclude that paying financial services persons to predict the market actually hurt, rather than help, net portfolio return. A number of empirical tests support the notion that the theory applies generally, as most portfolios managed by professional stock predictors do not outperform the market average return after accounting for the manager’s fees.

GOAL: The goal for this project is to utitlize a mixed method of time series and boosting regression trees to predict the closing value direction in regards to the opening value for each period. We will attempt to establish a relationship between time intervals and variables related to a stock for this purpose. We will use strictly numerical values in our analysis and no outside influences. Basically, our focus is on the end result figures for a stock, and will use these result values, or variable values, as the sole basis for prediction. Basically, the intent is to buy or short the stock for a profit as often as possible, or during intervals withthe highest profitability. Data preparation and visual analysis are the foundation for this study. Due to the uncertainty of the stock market and limited data availability, this study reflects only a specific time period.

SUMMARY: Using the method of boosting regression trees using data collection at equal time intervals, I was able to generate an average trade return of .8347704%, with an overall accuracy of 60%. I used the values predicted to conduct a trade in that direction, in a sense converting this regression problem into a binary problem. At the end of each interval, we aim to collect the highest profit possible. Although, we are interested in overall accuracy, the average return figure is even more important for a productive trading strategy. If returns are reinvested or compounded, this average trade return can be a successful strategy. We were able to identtify a successful frequent trading strategy for the stock chosen.
The time series analysis overfit the training data and did not perform as well as the Adaboost model using a time series component for the collection of data.

ANALYSIS: A time series is a set of observations on the values that a variable takes at different times. If you don’t know or don’t have the related values of x and just the values of y, we can use time series.
Yt = B*Yt-1 + E We can use past values of the dependent variable as the independent variable, and the time intervals should be the same. Univariate time series refers to a time series that consists of a single observation recorded over regular time intervals. For example, regression uses many variables at one time point. Time series uses one variable at many time points, and depending on on frequency of data, patterns emerge. The patterns are: Trend: Long term relative smooth pattern that usually persists for more than a year. Seasonal: Pattern appears in a regular interval where frequency is within a year or even shorter. Cyclical: Repeated pattern that appears beyond a frequency of one year. They are rarely regular. Random: After all above patterns have been extracted this is what is left. No pattern. White Noise is a series that it is purely random in nature. It has a mean of zero, a variance that is constant, and an uncorrelated random variable. There is no pattern so no forecasting is possible. The best forecast is to take the average. A stationary series is a series whose marginal distribution of Y at any time t [P(Yt)] is the same as any other point in time. This implies that the mean, variance, and covariance of the series Yt do not depend on time.

*Insert pic of stationary time series All time series should be stationary, but sometimes they are not, and we need to make stationary by differentiating. Differentiating computes the differences between consecutive observations. Transformations such as logarithms can help to stabilize the variance of a time series. Differencing can help stabilize the mean of a time series by removing changes in the level of a time series, and so eliminating trend and seasonality. Why Stationarity? 1) The results of the theory are derived under the assumption that variables are stationary. 2) Techniques are valid when data is stationary. 3) Sometimes autocorrelation may result because the time series is nonstationary. 4) Nonstationary time series may show a significant relationship between 2 variables when there should not be any.

One of the ways for identifying non-stationary times series is the ACF plot. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. The Auto Correlation Function refers to the way the observations in a time series are related to each other. It is measured by a simple correlation between the current obesrvation Yt, and the observation(s) from p periods from the current one. p is the parameters representing the number of lags. Yt-p The Partial Auto Correlation Function is used to measure the degree of association between Yt and Yt-p, when the effects of other time lags 1,2,3,…,(p-1) are removed. Y5 Y1 Y4,Y3,Y2 lags are removed

We need ACF and PACF to know what values are most related to the current value, and they tell us how many lags we need for forecasting. Once we determine how many lags we need for forecasting, we can focus on choosing the right model. There are four basic types of models: AR, AM, ARMA, and ARIMA. 1) AR(p) or Auto Regressive model is one in which Yt depends only on its own past values. ex. Yt = B0 + B1Yt-1 + B2Yt-2 +……+ BpYt-p + E 2) MA(q) or Moving Average model forecasts by taking the error terms. The error terms Et are assumed to be white noise processes with mean zero and a constant variance. Yt = f(Et, Et-1, Et-2,…,). We basically replace the past values in AR with the error terms. ex. Yt = B0 + OEt-1 + OEt-2 +….+ OEt-q 3) ARMA(p,q) or Auto Regressive Moving Average model depends on its own p past values, and q past values of white noise disturbances. ex. Yt = B0 + B1Yt-1 + B2Yt-2 +……+ BpYt-p + OEt-1 + OEt-2 +….+ OEt-q 4) ARIMA (p,i,q) or Auto Regressive Intergrated Moving Average model is the ARMA model but the “I” for integrated indicates that the data values have been replaced with the difference between the values and the previous values.

To determine which model to use we can examine the general characteristics of ACF and PACF. Recommended Model ACF PACF AR(p) Spikes Decay trds 0 Spikes cut off to 0 MA(q) Spikes cut off to 0 Spikes Decay trds 0 ARMA(p,q) Spikes Decay trds 0 Spikes Decay trds 0 The model with the lowest AIC/BIC is chosen as the best model. We can also check by plotting the residuals ACF. If most of the sample autocorrelation coefficients of the residuals lie within the limits of the 95% interval, then the residuals are white noise so the model is appropriate. To summarize, the time series process is: 1) Determine if it is white noise: If yes, we cannot do time series. If no, we move forward. 2)Determine if it is stationary: If no, make it stationary. If yes, we move forward. 3) Apply model: Select from AR, MA, ARMA, ARIMA.

Preparing data by ordering it in equal time intervals, removing unnecessary columns, and organizing relationships accurately, and converting data into time series for analysis.

#Setting working Directory to import files
setwd("C:/Users/Me/OneDriveLatestData/OneDrive - University of Central Florida - UCF/DataMining 2")
#Reading in Data File
#Stock info
stockdata<-read.csv("stockdata.txt", header=TRUE)
#Preparing Data by eliminating date fields
stockdata$DATE=NULL
#Setting up dataframe
data=stockdata
pdata=stockdata
ppdata=stockdata
pppdata=stockdata
data$HIGH=NULL
data$LOW=NULL
data$VOLUME=NULL
#Removing data frame from memory after use
rm(stockdata)
# #Preparing data for analysis
# shift <- function(x, n){
#   c(x[-(seq(n))], rep(NA, n))
# }
# data$CLOSE <- shift(data$CLOSE, 1)
# data$OPEN <- shift(data$OPEN, 1)
pdata <- pdata[-nrow(pdata),]
ppdata <- ppdata[-nrow(ppdata),]
ppdata <- ppdata[-nrow(ppdata),]
pppdata <- pppdata[-nrow(pppdata),]
pppdata <- pppdata[-nrow(pppdata),]
pppdata <- pppdata[-nrow(pppdata),]
data=data[-1,]
# data=data[-1,]
# pdata=pdata[-1,]
#Converting Dataframe to time series
sdata=as.ts(data$CLOSE)
data=cbind(data,pdata)
#data=cbind(data,ppdata)
#data=cbind(data,pppdata)
#Inspecting Time series dataframe
print("Data as Time Series")

## [1] "Data as Time Series"

head(sdata,3)

## [1] 61.345 62.050 61.825

# evap.n <- nrow(data)
# step(lm(data$CLOSE ~ (data$OPEN+pdata$CLOSE+pdata$OPEN+pdata$HIGH+pdata$LOW+pdata$VOLUME+ppdata$CLOSE+ppdata$OPEN+ppdata$HIGH+ppdata$LOW+ppdata$VOLUME+pppdata$CLOSE+pppdata$OPEN+pppdata$HIGH+pppdata$LOW+pppdata$VOLUME)),direction = "both", k=log(evap.n))

We will divide the prepared data into a training and test set. We will take the last 33% of the entries as our test set.

#Number of test cases
n=130
sdata=tail(sdata,390)
data=tail(data,390)
#Train datasets
sdatatrain=head(sdata,-n)
datatrain=head(data,-n)
#Test data sets
sdatatest=tail(sdata,n)
datatest=tail(data,n)
# #adjustment for friday report
# datatest=head(datatest,-26)
# sdatatest=head(sdatatest,-26)
#Removing datasets no longer needed
rm(sdata)
rm(data)

First, let’s visualize the time series.

#Plotting time series 
plot.ts(sdatatrain)
abline(reg=lm(sdatatrain~time(sdatatrain)))

One of the first things we must check for before beginning a time series analysis is to make sure our time series is stationary. That is: The mean of the series should not be a function of time rather should be a constant. The variance of the series should not a be a function of time. This property is known as homoscedasticity. The covariance of the ith term and the (i + m)th term should not be a function of time. Unless your time series is stationary, you cannot build a time series model. In cases where the stationary criterion are violated, the first requisite becomes to stationarize the time series and then try stochastic models to predict this time series. There are multiple ways of bringing this stationarity. For example, detrending, differencing etc. Stationary testing and converting a series into a stationary series are the most critical processes in a time series modelling. Below are three types of test to measure stationarity: The Ljung-Box test examines whether there is significant evidence for non-zero correlations at lags 1-20. Small p-values less than 0.05 suggest that the series is stationary.

#Running test
Box.test(sdatatrain,lag = 20, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  sdatatrain
## X-squared = 150.26, df = 20, p-value < 2.2e-16

The Augmented Dickey-Fuller (ADF) t-statistic test with small p-values suggest the data is stationary and doesn’t need to be differentiated.

#Including library for test
library(fpp)
#Running test
adf.test(sdatatrain,alternative="stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  sdatatrain
## Dickey-Fuller = -2.4156, Lag order = 2, p-value = 0.4141
## alternative hypothesis: stationary

In the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, accepting the null hypothesis means that the series is stationarity, and small p-values suggest that the series is not stationary and differencing is required.

#Including library for test
library(forecast)
#Running test
kpss.test(sdatatrain)

## 
##  KPSS Test for Level Stationarity
## 
## data:  sdatatrain
## KPSS Level = 1.1394, Truncation lag parameter = 1, p-value = 0.01

Two our of our three test failed for stationarity. Therefore, we must make this time series stationary.

#Number of differences required to achieve stationarity 
ndiffs(sdatatrain)

## [1] 1

Even tho, the output for desired differences is 1 from the above function, we only pass 2 out of 3 test, having acceptable results, with a minimum of 2 differenciations. We will move forward with our analysis and pick our model based on the minimum AIC/BIC that is generated, but also consider the minimum test error generated.

#Differenciating suggested number of times
dsdatatrain=diff(sdatatrain,2)

Below are the outputs for the stationarity tests after we differentiate:

#Running tests again
Box.test(dsdatatrain,lag = 20, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  dsdatatrain
## X-squared = 29.622, df = 20, p-value = 0.07621

adf.test(dsdatatrain,alternative="stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  dsdatatrain
## Dickey-Fuller = -2.9764, Lag order = 2, p-value = 0.2004
## alternative hypothesis: stationary

kpss.test(dsdatatrain)

## 
##  KPSS Test for Level Stationarity
## 
## data:  dsdatatrain
## KPSS Level = 0.053767, Truncation lag parameter = 1, p-value = 0.1

Now we are going to check stationary using other tools. We will start with the ACF and PACF graphs and check significant lags.

Acf(dsdatatrain)

Pacf(dsdatatrain)

Both graphs have a few significant lags but they die out rather quickly, so we can conclude that the series is stationary with a differentiation of 2.

To fit the best ARIMA model to a univariate time series, we tried to use the auto.arima function. This returns the best ARIMA model according to either AIC, AICc or BIC value. The function conducts a search over possible models within the order constraints provided. When the test set was evaluated, the arima(1,2,5) model discovered manually, overfit the training data, and did not perform as well on the test data as the suggested ARIMA model by the auto.arima function. Below we create the time series model and explore its accuracy.

Creating ARIMA (1,2,5) model, predicting and plotting the future values, and checking accuracy.

#Creating model
model=arima(sdatatrain,order=c(1,2,5))
#Predicting future values
preds=forecast(model, n)
#Plotting the future values
plot(forecast(model, n))

#Predicted values
pred=preds$mean
#Setting counts to zero
cnt=0
pctgain=0
#Using for loop to count accurate number of predictions
for (i in 1:length(sdatatest))
{  #Number of high accurate predictions
  if (((sdatatest[i] > datatest$OPEN[i]) &
      (pred[i] > datatest$OPEN[i])) ||
      ((sdatatest[i] < datatest$OPEN[i]) &
      (pred[i] < datatest$OPEN[i])))
  {
    cnt=cnt+1
    pctgain[i]=abs((sdatatest[i]-datatest$OPEN[i])/datatest$OPEN[i])
  }
  else
  {
   pctgain[i]=-abs((sdatatest[i]-datatest$OPEN[i])/datatest$OPEN[i])
  }
}
#Percent accuracy rate for number of trades
print("Accuracy Rate for Correct Number of Trades")

## [1] "Accuracy Rate for Correct Number of Trades"

cnt/length(sdatatest)

## [1] 0.3846154

#Average percent return
print("Average Percentage Gain")

## [1] "Average Percentage Gain"

mean(pctgain)

## [1] -0.002265502

Creating ARIMA model using auto arima function, predicting and plotting the future values, and checking accuracy.

#Creating model
model=auto.arima(sdatatrain)
#Predicting future values
preds=forecast(model, n)
#Plotting the future values
plot(forecast(model, n))

#Predicted values
pred=preds$mean
#Setting counts to zero
cnt=0
pctgain=0
#Using for loop to count accurate number of predictions
for (i in 1:length(sdatatest))
{  #Number of high accurate predictions 
  if (((sdatatest[i] > datatest$OPEN[i]) &
      (pred[i] > datatest$OPEN[i])) ||
      ((sdatatest[i] < datatest$OPEN[i]) & 
      (pred[i] < datatest$OPEN[i])))
  {
    cnt=cnt+1
    pctgain[i]=abs((sdatatest[i]-datatest$OPEN[i])/datatest$OPEN[i])
  }
  else 
  {
   pctgain[i]=-abs((sdatatest[i]-datatest$OPEN[i])/datatest$OPEN[i])
  }
}
#Percent accuracy rate for number of trades
print("Accuracy Rate for Correct Number of Trades")

## [1] "Accuracy Rate for Correct Number of Trades"

cnt/length(sdatatest)

## [1] 0.4846154

#Average percent return
print("Average Percentage Gain")

## [1] "Average Percentage Gain"

mean(pctgain)

## [1] -0.0009722851

Now, we will explore the variables in the data to get a sense of the correlation and relationships that exist between our dependent and independent variables. One common way of plotting multivariate data is to make a “matrix scatterplot”, showing each pair of variables plotted against each other.

#Library for plotting
library(car)
#Plotting all variables 
scatterplotMatrix(datatrain)

Below is a correlation analysis plot showing the correlation strength between all the variables.

#Converting data set into numeric to plot
chronplot<-data.matrix(datatrain)
#Importing library needed for plotting
library(corrplot)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

Conducting a correlation test to see the strongest relationships between variables numerically in the complete and organized data set. Range should be from -1 to 1. The more negative, the more negative the correlation, the more positive, the more positive the correlation. Closer to 0, no correlation. From the picture above, we can see the correlation coefficient for each pair of variables, but let’s find out what are the most highly correlated pairs of variables. We will print out the linear correlation coefficients for each pair of variables in the data set, in order of the correlation coefficient. This let us see very easily which pair of variables are most highly correlated.

mosthighlycorrelated <- function(mydataframe,numtoreport)
{
# set the correlations on the diagonal or lower triangle to zero,
# so they will not be reported as the highest ones:
diag(M) <- 0
M[lower.tri(M)] <- 0
# flatten the matrix into a dataframe for easy sorting
fm <- as.data.frame(as.table(M))
# assign human-friendly names
names(fm) <- c("First Variable", "Second Variable","Correlation")
# sort and print the top n correlations
head(fm[order(abs(fm$Correlation),decreasing=T),],n=numtoreport)
}
mosthighlycorrelated(datatrain, 10)

#Removing data frame from memory after use
rm(M)
rm(mosthighlycorrelated)

Another type of plot that is useful is a “profile plot”, which shows the variation in each of the variables, by plotting the value of each of the variables for each of the samples.

makeProfilePlot <- function(mylist,names)
{
require(RColorBrewer)
# find out how many variables we want to include
numvariables <- length(mylist)
# choose 'numvariables' random colours
colours <- brewer.pal(numvariables,"Set1")
# find out the minimum and maximum values of the variables:
mymin <- 1e+20
mymax <- 1e-20
for (i in 1:numvariables)
{
  vectori <- mylist[[i]]
mini <- min(vectori)
maxi <- max(vectori)
if (mini < mymin) { mymin <- mini }
if (maxi > mymax) { mymax <- maxi }
}
# plot the variables
for (i in 1:numvariables)
{
vectori <- mylist[[i]]
namei <- names[i]
colouri <- colours[i]
if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) }
else { points(vectori, col=colouri,type="l") }
lastxval <- length(vectori)
lastyval <- vectori[length(vectori)]
text((lastxval-10),(lastyval),namei,col="black",cex=0.6)
}
}
library(RColorBrewer)
#Stock Data Plot
names <- c("CLOSE","HIGH","LOW","OPEN","VOLUME")
mylist <- list(datatrain$CLOSE,datatrain$HIGH,datatrain$LOW,datatrain$OPEN,datatrain$VOLUME)
print("Stock Data Profile Plot")

## [1] "Stock Data Profile Plot"

makeProfilePlot(mylist,names)

#Removing data frame from memory after use
rm(mylist)
rm(names)
rm(makeProfilePlot)

It is clear from the profile plot that the mean and standard deviation for thevolume of the stock is a lot higher than the other variables.

Another thing that we will do is to calculate summary statistics such as the mean and standard deviation for each of the variables in the multivariate data set.

#Mean of each variable
print("All means by each variable")

## [1] "All means by each variable"

sapply(datatrain[1:5],mean)

##    CLOSE     OPEN  CLOSE.1     HIGH      LOW 
## 66.01214 65.63973 65.61020 66.04167 64.60055

#Standard Deviation of each variable
print("All standard deviations by each variable")

## [1] "All standard deviations by each variable"

sapply(datatrain[1:5],sd)

##    CLOSE     OPEN  CLOSE.1     HIGH      LOW 
## 3.094942 3.102965 3.144625 3.138938 3.028530

If we were using a normal linear regression, it would make sense to standardize in order to compare the variables because the variables have very different standard deviations. For example, the standard deviation of CLOSE is 8.899170e+00, while the standard deviation of VOLUME is much greater at 1.479409e+05. We would need to standardize each variable so that they have a sample variance of 1 and sample mean of 0.

Computer scientists have developed a wide variety of algorithms particularly suited for prediction, including neural nets, ensembles of trees and support vector machines. These machine learning methods are used less frequently than regression methods because they are less interpretable. These machine learning methods avoid starting with a data model and rather use an algorithm to learn the relationship between the response and its predictors. Boosting regression tress is an approach that assumes that the data generating process is complex and unknown, and tries to learn the response by observing inputs and responses and finding dominant patterns. This places the emphasis on a model’s ability to predict well, and focuses on what is being predicted and how prediction success should be measured. We are able to build a predictive modelthat minimizes overfitting. Boosting regression trees is one of several techniques that aim to improve the performance of a single model by fitting many models and combining them for prediction. It uses regression trees from the classification and regression tree, decision tree, group of models, and boosting, which builds and combines a collection of models. Tree based models partition the predictor space into rectangles, using a series of rules to identify regions having the most homogeneous responses to predictors. Then, they fit a constant to each region with classification trees fitting the most probable class as the constant, and regression trees fitting the mean response for observations in that region, assuming normally distributed errors. Predictors and split points are chosen to minimize prediction errors. Growing a tree involves recursive binary splits. A binary split is repeatedly applied to its own output until some stopping criterion is reached. An effective strategy for fitting a single decision tree is to grow a large tree, then prune it by collapsing the weakest links identified through cross-validation. Decision trees are popular because they represent information in a way that is intuitive and easy to visualize, and have several other advantageous properties. Preparation of candidate predictors is simplified because predictor variables can be of any type (numeric, binary, categorical, etc.). Model outcomes are unaffected by monotone transformations and differing scales of measurement among predictors, and irrelevant predictors are seldom selected. Trees are insensitive to outliers, and can accommodate missing data in predictor variables by using surrogates (Breiman et al 1984). The hierarchical structure of a tree means that the response to one input variable depends on values of inputs higher in the tree, so interactions between predictors are automatically modelled. Despite these benefits, trees are not usually as accurate as other methods, such as GLM and GAM. They have difficulty in modelling smooth functions. Also, the tree structure depends on the sample of data, and small changes in training data can result in very different series of splits (Hastie et al 2001). These factors detract from the advantages of trees, introducing uncertainty into their interpretation and limiting their predictive performance.

Boosting is a method for improving model accuracy, based on the idea that it is easier to find, and average, many rough rules of thumb, than to find a single, highly accurate prediction rule (Schapire 2003). Related techniques including bagging, stacking and model averaging, also build, then merge results from multiple models, but boosting is unique because it is sequential. It is a forward, stagewise procedure. In boosting, models (e.g. decision trees) are fitted iteratively to the training data, using appropriate methods gradually to increase emphasis on observations modelled poorly by the existing collection of trees. Boosting algorithms vary in how they quantify lack of fit and select settings for the next iteration. The original boosting algorithms such as AdaBoost (Freund & Schapire 1996) were developed for two class classification problems. They apply weights to the observations, emphasizing poorly modelled ones, so this tends to discuss boosting in terms of changing weights. In this project, we focus on regression trees and the intuition is different. For regression problems, boosting is a form of functional gradient descent. Consider a loss function, or a measure, such as deviance, that represents the loss in predictive performance due to a suboptimal model. Boosting is a numerical optimization technique for minimizing the loss function by adding, at each step, a new tree that best reduces, or steps down the gradient of, the loss function. In Boosting regression trees, the first regression tree is the one that, for the selected tree size, maximally reduces the loss function. For each following step, the focus is on the residuals, or the variation in the response that is not so far explained by the model. In ordinary regression and squared-error loss, standard residuals are used. For more general loss, the analogue of the residual vector is the vector of negative gradients. Deviance is used as the loss function in the software we use. For example, at the second step, a tree is fitted to the residuals of the first tree, and that second tree could contain quite different variables and split points compared with the first. The model is then updated to contain two trees, or two terms, and the residuals from this two term model are calculated, and so on. The process is stagewise, not stepwise, meaning that existing trees are left unchanged as the model is enlarged. Only the fitted value for each observation is re-estimated at each step to reflect the contribution of the newly added tree. The final boosting regression tree model is a linear combination of usually hundreds to thousands of trees that can be thought of as a regression model where each term is a tree. Therefore, I decided to try the AdaBoost allgorithm to select a model utilzing the Extreme Gradient Boosting package (xgBoost). AdaBoost, short for “Adaptive Boosting”, is a machine learning meta-algorithm that can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost can be sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing (e.g., their error rate is smaller than 0.5 for binary classification), the final model can be proven to converge to a strong learner.

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out of the box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative ‘hardness’ of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples. I tested many ranges for the appropriate maximum number of iterations in the training model, and selected 200 as the max number of iterations for the model building. The range up to 200 contained the iteration with lowest training error that yielded the lowest test error.

Creating and plotting the model for adaBoost method using training set, fitting model to the test set, and checking accuracy.

#Creating adaBoost model
library(xgboost)
train=as.matrix(datatrain)
test=as.matrix(datatest)
#Choosing minimum ireration from errors in training model
model=xgboost(data=train[,-1],label=train[,1],max.depth=2,eta=.4,nthread=4,nroun=200,objective="reg:linear",silent=1,verbose = 0)
#Fitting trainning model on test set
pred = predict(model,newdata=test)
#Setting counts to zero
cnt=0
pctgain=0
#Using for loop to count accurate number of predictions
for (i in 1:nrow(datatest))
{  #Number of high accurate predictions 
  if (((datatest$CLOSE[i] > datatest$OPEN[i]) &
      (pred[i] > datatest$OPEN[i])) || 
      ((datatest$CLOSE[i] < datatest$OPEN[i]) & 
      (pred[i] < datatest$OPEN[i])))
  {
    cnt=cnt+1
    pctgain[i]=abs((datatest$CLOSE[i]-datatest$OPEN[i])/datatest$OPEN[i])
  }
  else 
  {
   pctgain[i]=-(abs((datatest$CLOSE[i]-datatest$OPEN[i])/datatest$OPEN[i]))
  }
}
#Percent accuracy rate for number of trades
print("Accuracy Rate for Correct Number of Trades")

## [1] "Accuracy Rate for Correct Number of Trades"

cnt/nrow(datatest)

## [1] 0.4461538

#Average percent return
print("Average Percentage Gain")

## [1] "Average Percentage Gain"

mean(pctgain)

## [1] 0.000120602

The total time the program took to run is approximately 12 seconds.

proc.time()-ptm

##    user  system elapsed 
##    4.83    0.66    6.27

ISLR Text Wikipedia http://little-book-of-r-for-multivariate-analysis.readthedocs.org/ http://avesbiodiv.mncn.csic.es/estadistica/bt1.pdf

Data Mining Final Project, Summer 2017

Danilo Martinez

August 3rd, 2017