This script shows how to grab thousands of available stocks, then select one stock from the set available, and count the number of days in a set time period, if available, that the stock increases and separately decreases based on the stock value a set number of days (lag value) earlier. It then uses that lag value to predict whether the stock will increase by 2% the next day and separately in 30 days. The accuracy could be improved by tuning, but with these default parameters the accuracy was able to predict between 80-95% accuracy the next day, and 70-91% in 30 days whether the stock would increase by two percent based on the time interval specified. To specify the time period, search for this mark ‘%^&%’ in this file and set the dates, lag, lead (amount of days in the future to predict based on the lag set price earlier), and the stock from the list. Kaggle has a large csv file originally made for this script, but the smaller files are at github. Change the read (search for mark ’%^&*’ in of the csv after downloading from Kaggle allStocksGathered1.csv to the file name. Or get the smaller files (yahooStockBasket.csv and StockSwingTradeBotCom.csv) at the github address and run the first part of this script (about 20 minutes more or less).
library(lubridate)
library(dplyr)
library(quantmod)
library(PerformanceAnalytics)
library(tidyr)
library(DT)
library(stringr)
library(e1071)
library(caret)
library(randomForest)
library(MASS)
library(gbm)
setDefaults(getSymbols, src='yahoo')
Enter a date and the number of days to go back for the stock value as the lag value.
date="1999-01-01"
lag = 5
today = today()
lagN = paste('lag',lag,sep='')
newStocks <- read.csv('StockSwingTradeBotCom.csv', header=TRUE, na.strings=c('',' ','NA'),
sep=',')
head(newStocks)
## stockSymbol stockName dailyClose volatility avgVolume
## 1 A Agilent Technolog... 65.76 34.988 2,515,221
## 2 AA Alcoa Inc. 5.83 121.188 7,431,897
## 3 AAC AAC Holdings, Inc. 0.48 99.106 279,165
## 4 AACG ATA Creativity Gl... 0.74 45.638 25,584
## 5 AAL American Airlines... 10.29 97.428 24,309,238
## 6 AAMC Altisource Asset ... 11.12 91.945 7,226
newStocks <- newStocks[,-c(4:5)]
head(newStocks)
## stockSymbol stockName dailyClose
## 1 A Agilent Technolog... 65.76
## 2 AA Alcoa Inc. 5.83
## 3 AAC AAC Holdings, Inc. 0.48
## 4 AACG ATA Creativity Gl... 0.74
## 5 AAL American Airlines... 10.29
## 6 AAMC Altisource Asset ... 11.12
tail(newStocks)
## stockSymbol stockName dailyClose
## 5525 ZUMZ Zumiez Inc. 19.28
## 5526 ZUO Zuora, Inc. 8.40
## 5527 ZVO Zovio Inc. 1.46
## 5528 ZYME Zymeworks Inc. 26.31
## 5529 ZYNE Zynerba Pharmaceu... 3.10
## 5530 ZYXI Zynex, Inc. 10.81
Lets look at the stock symbols or tickers to pull up information on.
dim(newStocks)
## [1] 5530 3
newStocks$stockSymbol[1:25]
## [1] A AA AAC AACG AAL AAMC AAME AAN AAOI AAON AAP AAPL AAT AAU AAWW
## [16] AAXN AB ABB ABBV ABC ABCB ABDC ABEO ABEV ABG
## 5530 Levels: A AA AAC AACG AAL AAMC AAME AAN AAOI AAON AAP AAPL AAT AAU ... ZYXI
newStocks$stockSymbol[5505:5530]
## [1] ZGNX ZGYHU ZION ZIONL ZIONN ZIONO ZIONP ZIOP ZIXI ZKIN ZLAB ZM
## [13] ZN ZNGA ZNH ZOM ZS ZSAN ZTO ZTS ZUMZ ZUO ZVO ZYME
## [25] ZYNE ZYXI
## 5530 Levels: A AA AAC AACG AAL AAMC AAME AAN AAOI AAON AAP AAPL AAT AAU ... ZYXI
These are the stocks that we had built the previous lags, group counts, and other stats for of 65 total.
stockNames <- read.csv('yahooStockBasket.csv', header=TRUE, sep=',',
na.strings=c('',' ','NA'))
Get a list of those names of stocks that have the period in the names, recalling that SCE.PB was changed in this 65 stock list, stockNames, to SCE-PB in a previous script.
a1 <- grep('[.]', stockNames$stock)
a2 <- grep('[.]', newStocks$stockSymbol)
stockNames[a1,2]
## factor(0)
## 65 Levels: AAL AAP ADDYY AMC AMZN ARWR ASCCY C COST CSSEP CVX DLTR EPD ... YELP
There aren’t any periods in the stock names of the 65 stock list. List those stock names that are separated with a period in the 5530 stock list.
newStocks[a2,1]
## [1] AGM.A AKO.A AKO.B BF.A BF.B BH.A BIO.B BRK.A BRK.B BWL.A
## [11] CBS.A CCH.U CCX.U CIG.C CRD.A CRD.B CWEN.A EBR.B GEF.B GIG.U
## [21] GRP.U GTN.A HEI.A HVT.A JIH.U JIH.W JW.A JW.B LEN.B LGF.A
## [31] LGF.B LHC.U MKC.V MOG.A MOG.B OAC.U OIBR.C PBR.A PIC.U RDS.A
## [41] RDS.B RMG.U SBE.U STZ.B TAP.A TPGH.U WSO.B
## 5530 Levels: A AA AAC AACG AAL AAMC AAME AAN AAOI AAON AAP AAPL AAT AAU ... ZYXI
See if those stocks we hand picked and analyzed previously are in this larger set.
stocksBothSets <- merge(stockNames, newStocks, by.x='stock', by.y='stockSymbol')
This combines all the 65 stocks and fills in as NAs for the missing stocks not in the 5530 stock set.
stocksNotBothSets <- merge(stockNames, stocksBothSets, by.x='stock',by.y='stock',
all.x=TRUE)
The stocks not in the 65 stock set and the 5530 other stocks.
stocksNotBoth <- stocksNotBothSets[is.na(stocksNotBothSets$stockInfo.y),]
stocksNotBoth$stock
## [1] ADDYY ASCCY GNBT NSANY S SCE-PB
## 65 Levels: AAL AAP ADDYY AMC AMZN ARWR ASCCY C COST CSSEP CVX DLTR EPD ... YELP
Lets add these stocks to the table, but replace the ‘-’ with ‘.’ in SCE-PB to the newStocks table.
large <- newStocks[,1:2]
small <- stockNames[,c(2,1)]
colnames(large) <- c('stock','company')
colnames(small) <- c('stock','company')
all <- rbind(large,small)
all$stock <- gsub('-','.',all$stock)
allStockTicks <- as.character(all$stock)
length(allStockTicks)
## [1] 5595
There are duplicates in the stocks that need to be removed. This set is the original 5530 plus the six stocks that were not in the set originally but were in the 65 stock sets analyzed already.
allStockTicks1 <- allStockTicks[!duplicated(allStockTicks)]
length(allStockTicks1)
## [1] 5536
These are the stock tickers that aren’t in the list of 5595 stock tickers. They produced errors when ran in a previous script for getSymbols(src=‘yahoo’).
notFound <- as.data.frame(c("AAC","ACIW","AFGB","AGBAR","AGM.A","AKO.A","AKO.B", "ALGRR","ANDAR","AREX","AVDR","BF.A","BF.B","BH.A","BIO.B",
"BRK.A","BRK.B","BROGR","BWL.A","BXRXV","CBS.A","CCH.U","CCX.U",
"CIG.C","CJ","CMFNL","CRD.A","CRD.B","CTRP","CUR","CVRS",
"CWEN.A","DEST","DF","DFBHU","DFPHU","DHCNL","DISHR","DISHV",
"DNJR","EBR.B","ECOLW","FELP","FRAC","GEF.B","GHDX","GIG.U",
"GLACR","GLACU","GOODO","GOODP","GRP.U","GTN.A","HAIR",
"HCP","HEI.A","HES","HESM","HPJ","HVT.A","ISRL","JIH.U","JIH.W",
"JMPB","JW.A","JW.B","KOOL","LEN.B","LGF.A","LHC.U","LTXB",
"MKC.V","MOG.A","MOG.B","NEWM","NRCG","NVTR","NYNY",
"OAC.U","OIBR.C","OTIV","PBR.A","PIC.U","RDS.A","RDS.B",
"RMG.U","RTEC","RVEN","SBE.U","SDR","SDT","SHOS","SSFN",
"STNL","STNLU","STZ.B","SWJ","SYMC","TAP.A","TBLTU",
"TMCX","TMCXU","TPGH","TPGH.U","TTS","UBNK","WSO.B","SCE.PB"))
colnames(notFound) <- 'stock'
notFound$stock <- as.character(paste(notFound$stock))
This combines the 5536 stocks and creates a group to subset those stocks that aren’t found in the getSymbols(src=‘yahoo’) search results.
allStockTicks2 <- as.data.frame(allStockTicks1)
colnames(allStockTicks2) <- 'stock'
allStockTicks2$stock <- as.factor(allStockTicks2$stock)
notFound$avail <- 'notAvail'
allStockTicks2$avail <- 'Avail'
availStocks <- merge(allStockTicks2, notFound, by.x='stock',by.y='stock',
all.x=TRUE)
availStocks1 <- availStocks[with(availStocks, order(avail.y,avail.x)),]
available <- subset(availStocks, is.na(availStocks$avail.y) &
availStocks$avail.x=='Avail')
notavail <- subset(availStocks, availStocks$avail.y=='notAvail')
We are interested in the available stocks, of which there are 5,428 stock tickers that are available on yahoo.
available1 <- as.character(available$stock)
tickers1 <- available1[c(1:700)]
tickers2 <- available1[c(701:1300)]
tickers3 <- available1[c(1301:2000)]
tickers4 <- available1[c(2001:2700)]
tickers5 <- available1[c(2701:3300)]
tickers6<- available1[c(3301:4000)]
tickers7 <- available1[c(4001:4700)]
tickers8 <- available1[c(4701:5428)]
All_portfolioPrices1 <- NULL
for (ticker in tickers1){
All_portfolioPrices1 <- cbind(All_portfolioPrices1, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices2 <- NULL
for (ticker in tickers2){
All_portfolioPrices2 <- cbind(All_portfolioPrices2, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices3 <- NULL
for (ticker in tickers3){
All_portfolioPrices3 <- cbind(All_portfolioPrices3, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices4 <- NULL
for (ticker in tickers4){
All_portfolioPrices4 <- cbind(All_portfolioPrices4, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
names <- as.data.frame(as.character(index(All_portfolioPrices1)))
colnames(names) <- 'Date'
names2 <- as.data.frame(as.character(index(All_portfolioPrices2)))
colnames(names2) <- 'Date'
p1 <- as.data.frame(All_portfolioPrices1)
p2 <- as.data.frame(All_portfolioPrices2)
p3 <- as.data.frame(All_portfolioPrices3)
p4 <- as.data.frame(All_portfolioPrices4)
val1 <- cbind(names,p1)
val2 <- cbind(names2,p2)
val3 <- cbind(names,p3)
val4 <- cbind(names,p4)
colnames(val1) <- gsub('.Close','',colnames(val1))
colnames(val2) <- gsub('.Close','',colnames(val2))
colnames(val3) <- gsub('.Close','',colnames(val3))
colnames(val4) <- gsub('.Close','',colnames(val4))
close1 <- gather(val1,'stockName','stockValue',2:701)
close2 <- gather(val2,'stockName','stockValue',2:601)
close3 <- gather(val3,'stockName','stockValue',2:701)
close4 <- gather(val4,'stockName','stockValue',2:701)
All_portfolioPrices5 <- NULL
for (ticker in tickers5){
All_portfolioPrices5 <- cbind(All_portfolioPrices5, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices6 <- NULL
for (ticker in tickers6){
All_portfolioPrices6 <- cbind(All_portfolioPrices6, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices7 <- NULL
for (ticker in tickers7){
All_portfolioPrices7 <- cbind(All_portfolioPrices7, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
All_portfolioPrices8 <- NULL
for (ticker in tickers8){
All_portfolioPrices8 <- cbind(All_portfolioPrices8, getSymbols(ticker,
from = date,
periodicity='daily', auto.assign=FALSE)[,4])
}
p5 <- as.data.frame(All_portfolioPrices5)
p6 <- as.data.frame(All_portfolioPrices6)
p7 <- as.data.frame(All_portfolioPrices7)
p8 <- as.data.frame(All_portfolioPrices8)
names1 <- as.data.frame(as.character(index(All_portfolioPrices5)))
colnames(names1) <- 'Date'
names2 <- as.data.frame(as.character(index(All_portfolioPrices6)))
colnames(names2) <- 'Date'
val5 <- cbind(names1,p5)
val6 <- cbind(names2,p6)
val7 <- cbind(names2,p7)
val8 <- cbind(names2,p8)
colnames(val5) <- gsub('.Close','',colnames(val5))
colnames(val6) <- gsub('.Close','',colnames(val6))
colnames(val7) <- gsub('.Close','',colnames(val7))
colnames(val8) <- gsub('.Close','',colnames(val8))
close5 <- gather(val5,'stockName','stockValue',2:601)
close6 <- gather(val6,'stockName','stockValue',2:701)
close7 <- gather(val7,'stockName','stockValue',2:701)
close8 <- gather(val8,'stockName','stockValue',2:729)
allStocksGathered <- rbind(close1,close2,close3,close4,
close5,close6,close7,close8)
Remove the NAs.
allStocksGathered1 <-
allStocksGathered[complete.cases(allStocksGathered),]
Save this file to csv.
write.csv(allStocksGathered1,
paste(paste(paste(paste('allStocksGathered1', date,sep='_'),
today,sep='_'),lag,sep='_'), 'csv', sep='.'),
row.names=FALSE)
%%%%%%%%%%%%%%%
Retrieve the large (approximately 442 mb file size) from Kaggle called allStocksGathered1.csv for the large data processed in the last part of this script and to get the individual stock stats and counts of increasing and decreasing days.This script is in github as newStocksLagsCountsGroups.Rmd that made the csv large data file just mentioned.
%^&*
allStocksGathered1 <- read.csv('allStocksGathered1_1999-01-01_2020-03-25_5.csv',sep=',', header=TRUE,
na.strings=c('',' ','NA'))
Lets create the lag and count/group by counts fields to analyze by each stock, using subsets of each stock.
DF <- allStocksGathered1
DF$Date <- as.Date(DF$Date)
%^&%
What stock ticker are you interested in? And what lag are you interested in? You can keep the lag from the top of this script or these defaults or change them here. You need to enter the stock name in the chunk for Rmarkdown below. Also, pick a start date and end date of the time you want to select for running the counts and groups of counts.
stock_1 <- toupper("yelp")
lag <- 3
startDate <- '2012-12-31'
endDate <- '2020-03-25'
lead <- 1
Lets subset our large table to get the dates requested.
sDF <- subset(DF, DF$Date>=startDate & DF$Date<=endDate)
lagN <- paste('lag',lag,sep='')
cat('The number of days to retrieve the stock value compared to each day value listed as an instance is ',lag,'and the stock to look up this information for is ',stock_1)
## The number of days to retrieve the stock value compared to each day value listed as an instance is 3 and the stock to look up this information for is YELP
stknme <- as.character(newStocks[newStocks$stockSymbol==stock_1,2])
cat('\nThis stock is ',stknme)
##
## This stock is Yelp Inc.
Lstock_1 <- subset(sDF, sDF$stockName==stock_1)
cat('The number of days for trading that this time period will provide counts of increasing and decreasing days is ', length(Lstock_1$Date), ' trading days.')
## The number of days for trading that this time period will provide counts of increasing and decreasing days is 1820 trading days.
Generic automation of above stock to look up and the lag to use for generating the counts, group of counts, and lag values to get those counts.
Lstock_1$startDayValue <-Lstock_1$stockValue[1]
Lstock_1$startDayDate <- Lstock_1$Date[1]
Lstock_1$finalDayValue <-Lstock_1$stockValue[length(Lstock_1$stockValue)]
Lstock_1$finalDayDate <- Lstock_1$Date[length(Lstock_1$Date)]
stock_1LN <- lag(Lstock_1$stockValue, lag)
Lstock_1$lagN <- stock_1LN
Lstock_1$today2_lagN <- Lstock_1$stockValue/Lstock_1$lagN
Lstock_1 <- Lstock_1[complete.cases(Lstock_1),]
Lets look at the data we will be adding counts of increasing and decreasing days, for the time interval dates and stock values at the beginning and end of the time interval available or requested.
cat('\nThe lag for this table was for ',lag,'days.','\nThe stock to look up was ', stock_1,'.\nThe start date of this stock and starting value was ',as.character(paste(Lstock_1$startDayDate[1])),' and ','$',Lstock_1$startDayValue[1],'\nThe end date and end date price of this stock analysis is ',as.character(paste(Lstock_1$finalDayDate[1])),' and ','$',Lstock_1$finalDayValue[1])
##
## The lag for this table was for 3 days.
## The stock to look up was YELP .
## The start date of this stock and starting value was 2012-12-31 and $ 18.85
## The end date and end date price of this stock analysis is 2020-03-24 and $ 20.64
roi <- Lstock_1$finalDayValue[1]/Lstock_1$startDayValue[1]
cat('\nThe return on investment as a percentage of the amount invested for this time period is ',roi)
##
## The return on investment as a percentage of the amount invested for this time period is 1.09496
cat('\n\nIn dollars initially invested your return is $',Lstock_1$finalDayValue[1]-Lstock_1$startDayValue[1], 'for the dates:',
as.character(paste(Lstock_1$startDayDate[1])),'through ',
as.character(paste(Lstock_1$finalDayDate[1])))
##
##
## In dollars initially invested your return is $ 1.79 for the dates: 2012-12-31 through 2020-03-24
Now, lets look at the counts and group counts of increasing and decreasing days for this stock and the time period available.
#assign a 1 to increasing values
Lstock_1$todayGrtrThan_lagN <- ifelse(Lstock_1$today2_lagN>1, 1,0)
Lstock_1$cumulativeSumTodayGrtrThan_lagN <- cumsum(Lstock_1$todayGrtrThan_lagN)
# get the count of how many instances repeat,
# those counts repeating are counts that measure the days cumulatively decreasing
# those cumulative counts that don't repeat, are counting increasing days.
# These are stock values for today's value to 7 days prior value.
countstock_10 <- Lstock_1 %>% group_by(cumulativeSumTodayGrtrThan_lagN) %>% count(n=n())
countstock_10 <- as.data.frame(countstock_10)
countstock_10 <- countstock_10[,-3]
colnames(countstock_10)[2] <- 'nRepeatsTodayGrtrThan_lagN'
# Count the REPEATS of each number (minus the initial start)
countstock_10$decrDaysThisCycle <- countstock_10$n-1
# Count the number of times the cycle count repeats in this time span exactly that many days
countstock_10b <- countstock_10 %>% group_by(decrDaysThisCycle) %>% count(n=n())
countstock_10b <- as.data.frame(countstock_10b)
countstock_10b <- countstock_10b[,-3]
colnames(countstock_10b)[2] <- 'nTimesDecrDayCountsOccurs'
#combine these two count matrices of decreasing days
countsstock_tableDecr <- merge(countstock_10, countstock_10b, by.x='decrDaysThisCycle',
by.y='decrDaysThisCycle')
#combine the counts to the stock subset
stock_3 <- merge(Lstock_1, countsstock_tableDecr, by.x='cumulativeSumTodayGrtrThan_lagN',
by.y='cumulativeSumTodayGrtrThan_lagN')
#assign a 1 to decreasing values
stock_3$todayLessThan_lagN <- ifelse(stock_3$today2_lagN>1, 0,1)
stock_3$cumulativeSumTodayLessThan_lagN <- cumsum(stock_3$todayLessThan_lagN)
# get the count of how many instances repeat,
# those counts repeating are counts that measure the days cumulatively increasing
# those cumulative counts that don't repeat, are counting decreasing days.
# These are stock values for today's value to 7 days prior value.
countstock_11 <- stock_3 %>% group_by(cumulativeSumTodayLessThan_lagN) %>% count(n=n())
countstock_11 <- as.data.frame(countstock_11)
countstock_11 <- countstock_11[,-3]
colnames(countstock_11)[2] <- 'nRepeatsTodayLessThan_lagN'
# Count the REPEATS of each number (minus the initial start)
countstock_11$incrDaysThisCycle <- countstock_11$n-1
# Count the number of times the cycle count repeats in this time span exactly that many days
countstock_11b <- countstock_11 %>% group_by(incrDaysThisCycle) %>% count(n=n())
countstock_11b <- as.data.frame(countstock_11b)
countstock_11b <- countstock_11b[,-3]
colnames(countstock_11b)[2] <- 'nTimesIncrDayCountsOccurs'
#combine these two count matrices of decreasing days
countsstock_tableIncr <- merge(countstock_11, countstock_11b,
by.x='incrDaysThisCycle',
by.y='incrDaysThisCycle')
#combine the counts to the stock subset
stock_4 <- merge(stock_3, countsstock_tableIncr,
by.x='cumulativeSumTodayLessThan_lagN',
by.y='cumulativeSumTodayLessThan_lagN')
colnames(stock_4)
## [1] "cumulativeSumTodayLessThan_lagN" "cumulativeSumTodayGrtrThan_lagN"
## [3] "Date" "stockName"
## [5] "stockValue" "startDayValue"
## [7] "startDayDate" "finalDayValue"
## [9] "finalDayDate" "lagN"
## [11] "today2_lagN" "todayGrtrThan_lagN"
## [13] "decrDaysThisCycle" "nRepeatsTodayGrtrThan_lagN"
## [15] "nTimesDecrDayCountsOccurs" "todayLessThan_lagN"
## [17] "incrDaysThisCycle" "nRepeatsTodayLessThan_lagN"
## [19] "nTimesIncrDayCountsOccurs"
stock_5 <- stock_4[,c(3:11,
12,2,14,13,15,
16,1,18,17,19)]
colnames(stock_5) <- gsub('lagN',lagN,colnames(stock_5))
colnames(stock_5)
## [1] "Date" "stockName"
## [3] "stockValue" "startDayValue"
## [5] "startDayDate" "finalDayValue"
## [7] "finalDayDate" "lag3"
## [9] "today2_lag3" "todayGrtrThan_lag3"
## [11] "cumulativeSumTodayGrtrThan_lag3" "nRepeatsTodayGrtrThan_lag3"
## [13] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [15] "todayLessThan_lag3" "cumulativeSumTodayLessThan_lag3"
## [17] "nRepeatsTodayLessThan_lag3" "incrDaysThisCycle"
## [19] "nTimesIncrDayCountsOccurs"
pretty_headers <- str_to_title(colnames(stock_5))
stock5 <- datatable(data=stock_5, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
stock5
Using this information on one stock of the thousands available in our large csv file and table, lets return the count information and the number of times this stock has seen those exact days of counts.
cat('\nThe number of times this stock has decreased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates\' stock value is ', stock_5$decrDaysThisCycle[length(stock_5$decrDaysThisCycle)],'\n')
##
## The number of times this stock has decreased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates' stock value is 0
cat('\nThe number of times this stock has increased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates\' stock value is ', stock_5$incrDaysThisCycle[length(stock_5$incrDaysThisCycle)],'\n')
##
## The number of times this stock has increased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates' stock value is 3
cat('\nThe number of times this stock has decreased exactly this number of days compared to its price ', lag, ' days ago, is ', stock_5$nTimesDecrDayCountsOccurs[length(stock_5$nTimesDecrDayCountsOccurs)],'\n')
##
## The number of times this stock has decreased exactly this number of days compared to its price 3 days ago, is 702
cat('\nThe number of times this stock has increased exactly this number of days compared to its price ', lag, ' days ago, is ', stock_5$nTimesIncrDayCountsOccurs[length(stock_5$nTimesIncrDayCountsOccurs)],'\n')
##
## The number of times this stock has increased exactly this number of days compared to its price 3 days ago, is 52
The unique number of days this stock selected (for the time period retrieved) decreased is shown in the table below.
stock_5[unique(stock_5$decrDaysThisCycle),c(1:3,8,9,13,14)]
## Date stockName stockValue lag3 today2_lag3 decrDaysThisCycle
## 1 2013-01-04 YELP 21.52 18.85 1.1416446 0
## 5 2013-01-10 YELP 22.19 21.08 1.0526565 0
## 3 2013-01-08 YELP 20.83 20.15 1.0337469 1
## 2 2013-01-07 YELP 21.08 19.70 1.0700508 0
## 8 2013-01-15 YELP 20.61 22.19 0.9287968 5
## 7 2013-01-14 YELP 21.97 20.76 1.0582852 5
## 4 2013-01-09 YELP 20.76 21.52 0.9646840 1
## 10 2013-01-17 YELP 20.27 21.97 0.9226218 5
## 6 2013-01-11 YELP 22.12 20.83 1.0619299 0
## 14 2013-01-24 YELP 20.83 20.30 1.0261084 0
## 9 2013-01-16 YELP 20.36 22.12 0.9204340 5
## 11 2013-01-18 YELP 20.30 20.61 0.9849588 5
## 13 2013-01-23 YELP 20.34 20.27 1.0034534 0
## 15 2013-01-25 YELP 21.17 19.90 1.0638191 0
## nTimesDecrDayCountsOccurs
## 1 702
## 5 702
## 3 87
## 2 702
## 8 16
## 7 16
## 4 87
## 10 16
## 6 702
## 14 702
## 9 16
## 11 16
## 13 702
## 15 702
The unique number of days this stock selected (for the time period retrieved) increased is shown in the table below.
stock_5[unique(stock_5$incrDaysThisCycle),c(1:3,8,9,18,19)]
## Date stockName stockValue lag3 today2_lag3 incrDaysThisCycle
## 2 2013-01-07 YELP 21.08 19.70 1.0700508 2
## 3 2013-01-08 YELP 20.83 20.15 1.0337469 2
## 5 2013-01-10 YELP 22.19 21.08 1.0526565 3
## 1 2013-01-04 YELP 21.52 18.85 1.1416446 2
## 10 2013-01-17 YELP 20.27 21.97 0.9226218 0
## 8 2013-01-15 YELP 20.61 22.19 0.9287968 0
## 4 2013-01-09 YELP 20.76 21.52 0.9646840 3
## 13 2013-01-23 YELP 20.34 20.27 1.0034534 5
## 7 2013-01-14 YELP 21.97 20.76 1.0582852 3
## 6 2013-01-11 YELP 22.12 20.83 1.0619299 3
## 9 2013-01-16 YELP 20.36 22.12 0.9204340 0
## 12 2013-01-22 YELP 19.90 20.36 0.9774067 5
## 11 2013-01-18 YELP 20.30 20.61 0.9849588 0
## 15 2013-01-25 YELP 21.17 19.90 1.0638191 5
## nTimesIncrDayCountsOccurs
## 2 21
## 3 21
## 5 52
## 1 21
## 10 609
## 8 609
## 4 52
## 13 11
## 7 52
## 6 52
## 9 609
## 12 11
## 11 609
## 15 11
%%%%%%%%%%%%%%%%%
Lets do some machine learning to predict the next trading day stock value of being increased by 2% as 1, or not being decreased by 2% as a 0 value.This will use the lag set earlier in the above table, and those counts and other fields already used, but add in a lead price field for 1 day. We can then replay this scenario with the lead price in 2 weeks, 1 month, and so on as needed.
colnames(stock_5)
## [1] "Date" "stockName"
## [3] "stockValue" "startDayValue"
## [5] "startDayDate" "finalDayValue"
## [7] "finalDayDate" "lag3"
## [9] "today2_lag3" "todayGrtrThan_lag3"
## [11] "cumulativeSumTodayGrtrThan_lag3" "nRepeatsTodayGrtrThan_lag3"
## [13] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [15] "todayLessThan_lag3" "cumulativeSumTodayLessThan_lag3"
## [17] "nRepeatsTodayLessThan_lag3" "incrDaysThisCycle"
## [19] "nTimesIncrDayCountsOccurs"
Lets remove the date fields and stock name. We will name the rows of this matrix by the date before deleting the Date field.
stock_6 <- stock_5
row.names(stock_6) <- stock_6$Date
stock_6 <- stock_6[,-c(1,2,4:7)]#removes the date, stock name, and start and final dates/values
stock_6$NextDayPrice <- lead(stock_6$stockValue,lead)
stock_6 <- stock_6[complete.cases(stock_6$NextDayPrice),]
stock_6$NextDay2PercentIncrease <- ifelse(stock_6$NextDayPrice/stock_6$stockValue >1.02, 1,0)
stock_6$NextDay2PercentIncrease <- as.factor(stock_6$NextDay2PercentIncrease)
head(stock_6)
## stockValue lag3 today2_lag3 todayGrtrThan_lag3
## 2013-01-04 21.52 18.85 1.141645 1
## 2013-01-07 21.08 19.70 1.070051 1
## 2013-01-08 20.83 20.15 1.033747 1
## 2013-01-09 20.76 21.52 0.964684 0
## 2013-01-10 22.19 21.08 1.052657 1
## 2013-01-11 22.12 20.83 1.061930 1
## cumulativeSumTodayGrtrThan_lag3 nRepeatsTodayGrtrThan_lag3
## 2013-01-04 1 1
## 2013-01-07 2 1
## 2013-01-08 3 2
## 2013-01-09 3 2
## 2013-01-10 4 1
## 2013-01-11 5 1
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag3
## 2013-01-04 0 702 0
## 2013-01-07 0 702 0
## 2013-01-08 1 87 0
## 2013-01-09 1 87 1
## 2013-01-10 0 702 0
## 2013-01-11 0 702 0
## cumulativeSumTodayLessThan_lag3 nRepeatsTodayLessThan_lag3
## 2013-01-04 0 3
## 2013-01-07 0 3
## 2013-01-08 0 3
## 2013-01-09 1 4
## 2013-01-10 1 4
## 2013-01-11 1 4
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDayPrice
## 2013-01-04 2 21 21.08
## 2013-01-07 2 21 20.83
## 2013-01-08 2 21 20.76
## 2013-01-09 3 52 22.19
## 2013-01-10 3 52 22.12
## 2013-01-11 3 52 21.97
## NextDay2PercentIncrease
## 2013-01-04 0
## 2013-01-07 0
## 2013-01-08 0
## 2013-01-09 1
## 2013-01-10 0
## 2013-01-11 0
Remove the next day price field just added so that it doesn’t have multicollinearity or correlation with the predictions of whether or not the next day will increase by 2% (1 value) or not (0 value).
colnames(stock_6)
## [1] "stockValue" "lag3"
## [3] "today2_lag3" "todayGrtrThan_lag3"
## [5] "cumulativeSumTodayGrtrThan_lag3" "nRepeatsTodayGrtrThan_lag3"
## [7] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [9] "todayLessThan_lag3" "cumulativeSumTodayLessThan_lag3"
## [11] "nRepeatsTodayLessThan_lag3" "incrDaysThisCycle"
## [13] "nTimesIncrDayCountsOccurs" "NextDayPrice"
## [15] "NextDay2PercentIncrease"
For this set that is column 14 for ‘NextDayPrice’ to remove before running our predictive analytics algorithms.
set.seed(12356789)
NextDayPrice <- as.vector(stock_6$NextDayPrice)
stock_6 <- stock_6[,-14]
inTrain <- createDataPartition(y=stock_6$NextDay2PercentIncrease, p=0.7, list=FALSE)
trainingSet <- stock_6[inTrain,]
testingSet <- stock_6[-inTrain,]
Here are the first few observations in the training set.
head(trainingSet)
## stockValue lag3 today2_lag3 todayGrtrThan_lag3
## 2013-01-04 21.52 18.85 1.1416446 1
## 2013-01-08 20.83 20.15 1.0337469 1
## 2013-01-09 20.76 21.52 0.9646840 0
## 2013-01-10 22.19 21.08 1.0526565 1
## 2013-01-11 22.12 20.83 1.0619299 1
## 2013-01-15 20.61 22.19 0.9287968 0
## cumulativeSumTodayGrtrThan_lag3 nRepeatsTodayGrtrThan_lag3
## 2013-01-04 1 1
## 2013-01-08 3 2
## 2013-01-09 3 2
## 2013-01-10 4 1
## 2013-01-11 5 1
## 2013-01-15 6 6
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag3
## 2013-01-04 0 702 0
## 2013-01-08 1 87 0
## 2013-01-09 1 87 1
## 2013-01-10 0 702 0
## 2013-01-11 0 702 0
## 2013-01-15 5 16 1
## cumulativeSumTodayLessThan_lag3 nRepeatsTodayLessThan_lag3
## 2013-01-04 0 3
## 2013-01-08 0 3
## 2013-01-09 1 4
## 2013-01-10 1 4
## 2013-01-11 1 4
## 2013-01-15 2 1
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease
## 2013-01-04 2 21 0
## 2013-01-08 2 21 0
## 2013-01-09 3 52 1
## 2013-01-10 3 52 0
## 2013-01-11 3 52 0
## 2013-01-15 0 609 0
Here are the first few observations in the testing set.
head(testingSet)
## stockValue lag3 today2_lag3 todayGrtrThan_lag3
## 2013-01-07 21.08 19.70 1.0700508 1
## 2013-01-14 21.97 20.76 1.0582852 1
## 2013-01-18 20.30 20.61 0.9849588 0
## 2013-01-22 19.90 20.36 0.9774067 0
## 2013-01-24 20.83 20.30 1.0261084 1
## 2013-01-28 21.03 20.34 1.0339233 1
## cumulativeSumTodayGrtrThan_lag3 nRepeatsTodayGrtrThan_lag3
## 2013-01-07 2 1
## 2013-01-14 6 6
## 2013-01-18 6 6
## 2013-01-22 6 6
## 2013-01-24 8 1
## 2013-01-28 10 1
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag3
## 2013-01-07 0 702 0
## 2013-01-14 5 16 0
## 2013-01-18 5 16 1
## 2013-01-22 5 16 1
## 2013-01-24 0 702 0
## 2013-01-28 0 702 0
## cumulativeSumTodayLessThan_lag3 nRepeatsTodayLessThan_lag3
## 2013-01-07 0 3
## 2013-01-14 1 4
## 2013-01-18 5 1
## 2013-01-22 6 6
## 2013-01-24 6 6
## 2013-01-28 6 6
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease
## 2013-01-07 2 21 0
## 2013-01-14 3 52 0
## 2013-01-18 0 609 0
## 2013-01-22 5 11 1
## 2013-01-24 5 11 0
## 2013-01-28 5 11 0
Some modifications to the data as a factor instead of numeric when copy and pasting the code above. The random forest is taking a tree approach to classification of the next day’s value as 1 for increased by 2% or a 0 for didn’t increase by 2%. Otherwise, the random forest would have done regression as done in previous two runs with the next day price value guess. Below is the caret random forest or you could use the random forest package. Make sure not to keep the round(,0) modification or all values will be NA. If you just run this, then no worries.
rfMod <- train(NextDay2PercentIncrease ~ ., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, NextDay2PercentIncrease=testingSet$NextDay2PercentIncrease)
print(rbind(head(predDF,10),tail(predDF,10)))
## predRF NextDay2PercentIncrease
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 1
## 5 0 0
## 6 0 0
## 7 0 1
## 8 1 0
## 9 0 0
## 10 0 0
## 535 0 0
## 536 0 1
## 537 0 0
## 538 0 0
## 539 0 0
## 540 0 0
## 541 0 1
## 542 0 0
## 543 1 0
## 544 1 1
sum <- sum(predRF==testingSet$NextDay2PercentIncrease)
length <- length(testingSet$NextDay2PercentIncrease)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 0.8198529
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
print(rbind(head(Results),tail(Results)))
## predRF NextDay2PercentIncrease
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 1
## 5 0 0
## 6 0 0
## 540 0 0
## 541 0 1
## 542 0 0
## 543 1 0
## 544 1 1
## results 0.82 100
From the above, 80% is not bad. You could run this program just using the random forest for if the next day value will increase by 2% and be right 4 out of 5 days. Lets see the rest using the caret package for the classification. The metric was changed to ‘Accuracy’ or ‘Kappa’ to use the train() of the caret package to classify instead of regress the features on the target.
knnMod <- train(NextDay2PercentIncrease ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, metric='Accuracy',
trControl=trainControl(method='cv'), data=trainingSet)
rpartMod <- train(NextDay2PercentIncrease ~ .,
method='rpart', tuneLength=9, metric='Accuracy',
data=trainingSet)
glmMod <- train(NextDay2PercentIncrease ~ ., metric='Kappa',
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$NextDay2PercentIncrease)
length
## [1] 544
sumKNN <- sum(predKNN==testingSet$NextDay2PercentIncrease)
sumRPart <- sum(predRPART==testingSet$NextDay2PercentIncrease)
sumGLM <- sum(predGLM==testingSet$NextDay2PercentIncrease)
cat('The number correct for KNN is ',sumKNN, ', the number correct for Rpart is ',sumRPart,
' and the number correct for GLM is ',sumGLM,'.')
## The number correct for KNN is 437 , the number correct for Rpart is 447 and the number correct for GLM is 441 .
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
cat('The accuracy for KNN, RPart, and GLM respectively is: ',
accuracy_KNN,accuracy_RPART,accuracy_GLM,'.')
## The accuracy for KNN, RPart, and GLM respectively is: 0.8033088 0.8216912 0.8106618 .
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
NextDay2PercentIncrease=testingSet$NextDay2PercentIncrease)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','NextDay2PercentIncrease')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM',
'NextDay2PercentIncrease')
Results <- rbind(predDF2, results)
Results1 <- datatable(data=Results, rownames=FALSE,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('csv'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive'))
Results1
Lets now align these predicted values to the subset for the stock YELP made earlier.
stock_6$Date <- row.names(stock_6)
testingSet$Date <- row.names(testingSet)
predDF2$Date <- row.names(testingSet)
stock_7 <- merge(testingSet,predDF2, by.x='Date', by.y='Date')
tbl <- rbind(head(stock_7),tail(stock_7))
tbl
## Date stockValue lag3 today2_lag3 todayGrtrThan_lag3
## 1 2013-01-07 21.08 19.70 1.0700508 1
## 2 2013-01-14 21.97 20.76 1.0582852 1
## 3 2013-01-18 20.30 20.61 0.9849588 0
## 4 2013-01-22 19.90 20.36 0.9774067 0
## 5 2013-01-24 20.83 20.30 1.0261084 1
## 6 2013-01-28 21.03 20.34 1.0339233 1
## 539 2020-02-14 35.23 34.76 1.0135213 1
## 540 2020-02-18 34.57 35.36 0.9776584 0
## 541 2020-03-03 30.59 31.40 0.9742038 0
## 542 2020-03-11 24.00 28.94 0.8293020 0
## 543 2020-03-16 17.68 24.00 0.7366667 0
## 544 2020-03-20 17.80 17.53 1.0154022 1
## cumulativeSumTodayGrtrThan_lag3 nRepeatsTodayGrtrThan_lag3
## 1 2 1
## 2 6 6
## 3 6 6
## 4 6 6
## 5 8 1
## 6 10 1
## 539 951 12
## 540 951 12
## 541 951 12
## 542 952 12
## 543 952 12
## 544 953 1
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag3
## 1 0 702 0
## 2 5 16 0
## 3 5 16 1
## 4 5 16 1
## 5 0 702 0
## 6 0 702 0
## 539 11 5 0
## 540 11 5 1
## 541 11 5 1
## 542 11 5 1
## 543 11 5 1
## 544 0 702 0
## cumulativeSumTodayLessThan_lag3 nRepeatsTodayLessThan_lag3
## 1 0 3
## 2 1 4
## 3 5 1
## 4 6 6
## 5 6 6
## 6 6 6
## 539 840 5
## 540 841 1
## 541 851 2
## 542 856 1
## 543 859 1
## 544 862 4
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease.x
## 1 2 21 0
## 2 3 52 0
## 3 0 609 0
## 4 5 11 1
## 5 5 11 0
## 6 5 11 0
## 539 4 42 0
## 540 0 609 0
## 541 1 73 1
## 542 0 609 0
## 543 0 609 0
## 544 3 52 1
## RandomForest KNN Rpart GLM NextDay2PercentIncrease.y
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 1 1 1 1 1
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 539 0 0 0 0 0
## 540 0 0 0 0 0
## 541 0 0 0 0 1
## 542 0 0 0 0 0
## 543 1 0 0 0 0
## 544 1 0 0 0 1
stock_7$RandomForest <- as.numeric(paste(stock_7$RandomForest))
stock_7$Rpart <- as.numeric(paste(stock_7$Rpart))
stock_7$GLM <- as.numeric(paste(stock_7$GLM))
stock_7$NextDay2PercentIncrease.y <-
as.numeric(paste(stock_7$NextDay2PercentIncrease.y))
stock_7$KNN <- as.numeric(paste(stock_7$KNN))
stock_7$mostVoted <- ifelse(rowSums(stock_7[,16:19])>=2,1,0)
sMost <- sum(stock_7$mostVoted==stock_7$NextDay2PercentIncrease.y)
lMost <- length(stock_7$NextDay2PercentIncrease.y)
accMost <- sMost/lMost
results <- as.data.frame(c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(accMost,2),
round(100,2)))
colnames(results) <- 'Results'
row.names(results) <- c('RandomForest','KNN','Rpart','GLM','mostVoted',
'NextDay2PercentIncrease')
results
## Results
## RandomForest 0.82
## KNN 0.80
## Rpart 0.82
## GLM 0.81
## mostVoted 0.82
## NextDay2PercentIncrease 100.00
The above accuracy measures for random forest, recursive partitioning trees, k-nearest neighbors, and generalized linear models and the most voted algorithm for this classification were shown. The following is the table of the first 30 predicted results on the testing set of samples. Also, add back in the NextDayPrice extracted earlier so the results weren’t dependent on the next day price when predicting whether the next day would be increased by 2%. Note that because not all stocks have the same time series start dates, the size of each stock time series being selected will vary with some stocks.
stock_6$NextDayPrice <- NextDayPrice #add back in to original complete dataset
stock_6b <- stock_6[,15:16]
stock_8 <- merge(stock_7,stock_6b,by.x='Date',by.y='Date')
print(rbind(head(stock_8,30),tail(stock_8,30)))
## Date stockValue lag3 today2_lag3 todayGrtrThan_lag3
## 1 2013-01-07 21.08 19.70 1.0700508 1
## 2 2013-01-14 21.97 20.76 1.0582852 1
## 3 2013-01-18 20.30 20.61 0.9849588 0
## 4 2013-01-22 19.90 20.36 0.9774067 0
## 5 2013-01-24 20.83 20.30 1.0261084 1
## 6 2013-01-28 21.03 20.34 1.0339233 1
## 7 2013-02-04 20.34 20.46 0.9941349 0
## 8 2013-02-11 21.63 22.38 0.9664879 0
## 9 2013-02-20 21.89 21.44 1.0209888 1
## 10 2013-02-25 21.62 21.89 0.9876656 0
## 11 2013-03-05 23.02 22.19 1.0374042 1
## 12 2013-03-11 25.13 23.02 1.0916594 1
## 13 2013-03-13 25.02 25.01 1.0003998 1
## 14 2013-03-14 25.05 25.13 0.9968166 0
## 15 2013-03-21 24.00 24.50 0.9795918 0
## 16 2013-03-25 23.49 24.12 0.9738806 0
## 17 2013-03-27 23.72 23.79 0.9970576 0
## 18 2013-03-28 23.71 23.49 1.0093657 1
## 19 2013-04-02 22.77 23.72 0.9599494 0
## 20 2013-04-05 25.70 22.77 1.1286781 1
## 21 2013-04-16 25.71 26.35 0.9757116 0
## 22 2013-04-24 24.97 25.50 0.9792157 0
## 23 2013-04-26 25.45 25.29 1.0063266 1
## 24 2013-05-03 31.12 26.03 1.1955436 1
## 25 2013-05-07 30.67 32.22 0.9518932 0
## 26 2013-05-08 29.91 31.12 0.9611183 0
## 27 2013-05-09 30.67 30.69 0.9993483 0
## 28 2013-05-10 30.93 30.67 1.0084773 1
## 29 2013-05-14 30.97 30.67 1.0097815 1
## 30 2013-05-20 30.97 30.01 1.0319893 1
## 515 2019-10-07 33.28 33.33 0.9984998 0
## 516 2019-10-11 32.72 32.75 0.9990840 0
## 517 2019-10-29 34.32 33.51 1.0241719 1
## 518 2019-10-30 34.85 34.69 1.0046123 1
## 519 2019-11-04 34.24 34.85 0.9824964 0
## 520 2019-11-06 32.71 35.06 0.9329720 0
## 521 2019-11-12 35.77 30.12 1.1875830 1
## 522 2019-11-15 34.61 35.77 0.9675706 0
## 523 2019-12-02 33.47 35.13 0.9527469 0
## 524 2019-12-04 34.18 34.68 0.9855825 0
## 525 2019-12-12 33.39 33.08 1.0093712 1
## 526 2019-12-16 33.78 32.71 1.0327117 1
## 527 2019-12-18 34.14 33.55 1.0175857 1
## 528 2019-12-19 34.11 33.78 1.0097691 1
## 529 2019-12-23 34.21 34.14 1.0020504 1
## 530 2020-01-03 34.72 34.58 1.0040486 1
## 531 2020-01-07 34.57 34.77 0.9942479 0
## 532 2020-01-08 35.54 34.72 1.0236175 1
## 533 2020-01-14 36.59 35.16 1.0406712 1
## 534 2020-01-17 36.51 36.59 0.9978136 0
## 535 2020-01-28 35.41 35.93 0.9855274 0
## 536 2020-02-03 32.94 35.39 0.9307714 0
## 537 2020-02-05 34.66 32.60 1.0631902 1
## 538 2020-02-11 34.76 34.12 1.0187573 1
## 539 2020-02-14 35.23 34.76 1.0135213 1
## 540 2020-02-18 34.57 35.36 0.9776584 0
## 541 2020-03-03 30.59 31.40 0.9742038 0
## 542 2020-03-11 24.00 28.94 0.8293020 0
## 543 2020-03-16 17.68 24.00 0.7366667 0
## 544 2020-03-20 17.80 17.53 1.0154022 1
## cumulativeSumTodayGrtrThan_lag3 nRepeatsTodayGrtrThan_lag3
## 1 2 1
## 2 6 6
## 3 6 6
## 4 6 6
## 5 8 1
## 6 10 1
## 7 12 4
## 8 15 2
## 9 19 1
## 10 21 4
## 11 25 1
## 12 29 1
## 13 31 2
## 14 31 2
## 15 32 9
## 16 32 9
## 17 32 9
## 18 33 4
## 19 33 4
## 20 35 1
## 21 41 8
## 22 41 8
## 23 43 1
## 24 47 1
## 25 48 4
## 26 48 4
## 27 48 4
## 28 49 1
## 29 51 4
## 30 52 1
## 515 903 11
## 516 903 11
## 517 913 1
## 518 914 2
## 519 915 5
## 520 915 5
## 521 918 1
## 522 919 4
## 523 923 11
## 524 923 11
## 525 924 1
## 526 926 1
## 527 928 1
## 528 929 1
## 529 931 2
## 530 936 3
## 531 936 3
## 532 937 1
## 533 941 1
## 534 943 5
## 535 944 8
## 536 944 8
## 537 945 1
## 538 948 1
## 539 951 12
## 540 951 12
## 541 951 12
## 542 952 12
## 543 952 12
## 544 953 1
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag3
## 1 0 702 0
## 2 5 16 0
## 3 5 16 1
## 4 5 16 1
## 5 0 702 0
## 6 0 702 0
## 7 3 48 1
## 8 1 87 1
## 9 0 702 0
## 10 3 48 1
## 11 0 702 0
## 12 0 702 0
## 13 1 87 0
## 14 1 87 1
## 15 8 3 1
## 16 8 3 1
## 17 8 3 1
## 18 3 48 0
## 19 3 48 1
## 20 0 702 0
## 21 7 9 1
## 22 7 9 1
## 23 0 702 0
## 24 0 702 0
## 25 3 48 1
## 26 3 48 1
## 27 3 48 1
## 28 0 702 0
## 29 3 48 0
## 30 0 702 0
## 515 10 5 1
## 516 10 5 1
## 517 0 702 0
## 518 1 87 0
## 519 4 30 1
## 520 4 30 1
## 521 0 702 0
## 522 3 48 1
## 523 10 5 1
## 524 10 5 1
## 525 0 702 0
## 526 0 702 0
## 527 0 702 0
## 528 0 702 0
## 529 1 87 0
## 530 2 30 0
## 531 2 30 1
## 532 0 702 0
## 533 0 702 0
## 534 4 30 1
## 535 7 9 1
## 536 7 9 1
## 537 0 702 0
## 538 0 702 0
## 539 11 5 0
## 540 11 5 1
## 541 11 5 1
## 542 11 5 1
## 543 11 5 1
## 544 0 702 0
## cumulativeSumTodayLessThan_lag3 nRepeatsTodayLessThan_lag3
## 1 0 3
## 2 1 4
## 3 5 1
## 4 6 6
## 5 6 6
## 6 6 6
## 7 9 1
## 8 11 2
## 9 13 6
## 10 14 1
## 11 16 11
## 12 16 11
## 13 16 11
## 14 17 2
## 15 21 1
## 16 23 1
## 17 25 2
## 18 25 2
## 19 27 1
## 20 28 9
## 21 29 1
## 22 35 5
## 23 35 5
## 24 36 4
## 25 37 1
## 26 38 1
## 27 39 4
## 28 39 4
## 29 39 4
## 30 42 3
## 515 798 1
## 516 802 5
## 517 804 8
## 518 804 8
## 519 806 1
## 520 808 1
## 521 809 5
## 522 811 1
## 523 817 1
## 524 819 1
## 525 824 9
## 526 824 9
## 527 824 9
## 528 824 9
## 529 824 9
## 530 826 2
## 531 828 8
## 532 828 8
## 533 828 8
## 534 829 1
## 535 834 1
## 536 838 1
## 537 839 4
## 538 840 5
## 539 840 5
## 540 841 1
## 541 851 2
## 542 856 1
## 543 859 1
## 544 862 4
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease.x
## 1 2 21 0
## 2 3 52 0
## 3 0 609 0
## 4 5 11 1
## 5 5 11 0
## 6 5 11 0
## 7 0 609 1
## 8 1 73 0
## 9 5 11 0
## 10 0 609 0
## 11 10 6 0
## 12 10 6 0
## 13 10 6 0
## 14 1 73 0
## 15 0 609 0
## 16 0 609 0
## 17 1 73 0
## 18 1 73 0
## 19 0 609 0
## 20 8 13 0
## 21 0 609 0
## 22 4 42 0
## 23 4 42 1
## 24 3 52 0
## 25 0 609 0
## 26 0 609 1
## 27 3 52 0
## 28 3 52 0
## 29 3 52 0
## 30 2 21 0
## 515 0 609 0
## 516 4 42 0
## 517 7 14 0
## 518 7 14 0
## 519 0 609 0
## 520 0 609 0
## 521 4 42 0
## 522 0 609 0
## 523 0 609 0
## 524 0 609 0
## 525 8 13 0
## 526 8 13 0
## 527 8 13 0
## 528 8 13 0
## 529 8 13 0
## 530 1 73 0
## 531 7 14 1
## 532 7 14 0
## 533 7 14 0
## 534 0 609 0
## 535 0 609 0
## 536 0 609 1
## 537 3 52 0
## 538 4 42 0
## 539 4 42 0
## 540 0 609 0
## 541 1 73 1
## 542 0 609 0
## 543 0 609 0
## 544 3 52 1
## RandomForest KNN Rpart GLM NextDay2PercentIncrease.y mostVoted NextDayPrice
## 1 0 0 0 0 0 0 20.83
## 2 0 0 0 0 0 0 20.61
## 3 0 0 0 0 0 0 19.90
## 4 1 1 1 1 1 1 20.34
## 5 0 0 0 0 0 0 21.17
## 6 0 0 0 0 0 0 20.84
## 7 0 0 0 0 1 0 21.18
## 8 1 1 0 1 0 1 21.66
## 9 0 0 0 0 0 0 22.00
## 10 0 0 0 0 0 0 21.29
## 11 0 1 0 0 0 0 23.02
## 12 0 0 0 0 0 0 23.73
## 13 0 0 0 0 0 0 25.05
## 14 0 1 0 1 0 1 25.10
## 15 0 0 0 0 0 0 23.79
## 16 0 0 0 0 0 0 23.68
## 17 0 0 0 0 0 0 23.71
## 18 0 0 0 0 0 0 22.89
## 19 0 0 0 0 0 0 23.05
## 20 0 0 0 0 0 0 25.00
## 21 0 0 0 0 0 0 26.10
## 22 1 1 1 1 0 1 25.42
## 23 0 0 0 0 1 0 26.62
## 24 0 0 0 0 0 0 30.69
## 25 0 0 0 0 0 0 29.91
## 26 0 0 0 0 1 0 30.67
## 27 1 1 1 1 0 1 30.93
## 28 0 0 0 0 0 0 30.92
## 29 0 0 0 0 0 0 30.01
## 30 0 0 0 0 0 0 31.58
## 515 0 0 0 0 0 0 32.75
## 516 1 1 1 0 0 1 32.81
## 517 0 0 0 0 0 0 34.85
## 518 0 0 0 0 0 0 34.51
## 519 0 0 0 0 0 0 33.16
## 520 0 0 0 0 0 0 30.12
## 521 0 0 0 0 0 0 35.56
## 522 0 0 0 0 0 0 34.78
## 523 0 0 0 0 0 0 33.66
## 524 0 0 0 0 0 0 33.42
## 525 0 0 0 0 0 0 33.55
## 526 0 0 0 0 0 0 34.00
## 527 0 0 0 0 0 0 34.11
## 528 0 0 0 0 0 0 34.03
## 529 0 0 0 0 0 0 34.06
## 530 0 0 0 0 0 0 34.44
## 531 0 0 1 0 1 0 35.54
## 532 0 0 0 0 0 0 35.16
## 533 0 0 0 0 0 0 36.95
## 534 0 0 0 0 0 0 35.33
## 535 0 0 0 0 0 0 35.39
## 536 0 0 0 0 1 0 34.02
## 537 0 0 0 0 0 0 34.12
## 538 0 0 0 0 0 0 35.36
## 539 0 0 0 0 0 0 34.57
## 540 0 0 0 0 0 0 34.04
## 541 0 0 0 0 1 0 31.39
## 542 0 0 0 0 0 0 21.46
## 543 1 0 0 0 0 0 17.53
## 544 1 0 0 0 1 0 18.51
The above is the first and last 30 of the testing set samples with the next day predictions in predicting whether the value would be 2% more than the previous day.The stock you asked for when running the script at the counts section is the stock that was analyzed. you can rerun this script for any of the stocks in that dataset where you search for these markers ‘%^&%’ in this script within RStudio.
%%%%%%%%%%%%%%%%%%%%%%%%%% ***
The above script can be run with a different lead value to see how well the algorithms use this time series with the same or varying lag prices to also predict the value of the stock increasing more than 2% in the future if the stock has enough information for how far into the future you want to forecast or predict. Lets do that but with a 1 day lag and a prediction for whether or not the stock value will increase in 30 days by 2%. We just change the lag and lead values to 1 and 30 respectively. There will be missing information, that will shrink our data for the last 30 days, if there aren’t more dates available after the end date specified.
stock_1 <- toupper("yelp")
lag <- 1
startDate <- '2012-12-31'
endDate <- '2020-03-25'
lead <- 30
Lets subset our large table to get the dates requested.
sDF <- subset(DF, DF$Date>=startDate & DF$Date<=endDate)
lagN <- paste('lag',lag,sep='')
cat('The number of days to retrieve the stock value compared to each day value listed as an instance is ',lag,'and the stock to look up this information for is ',stock_1)
## The number of days to retrieve the stock value compared to each day value listed as an instance is 1 and the stock to look up this information for is YELP
stknme <- as.character(newStocks[newStocks$stockSymbol==stock_1,2])
cat('\nThis stock is ',stknme)
##
## This stock is Yelp Inc.
Lstock_1 <- subset(sDF, sDF$stockName==stock_1)
cat('The number of days for trading that this time period will provide counts of increasing and decreasing days is ', length(Lstock_1$Date), ' trading days.')
## The number of days for trading that this time period will provide counts of increasing and decreasing days is 1820 trading days.
Generic automation of above stock to look up and the lag to use for generating the counts, group of counts, and lag values to get those counts.
Lstock_1$startDayValue <-Lstock_1$stockValue[1]
Lstock_1$startDayDate <- Lstock_1$Date[1]
Lstock_1$finalDayValue <-Lstock_1$stockValue[length(Lstock_1$stockValue)]
Lstock_1$finalDayDate <- Lstock_1$Date[length(Lstock_1$Date)]
stock_1LN <- lag(Lstock_1$stockValue, lag)
Lstock_1$lagN <- stock_1LN
Lstock_1$today2_lagN <- Lstock_1$stockValue/Lstock_1$lagN
Lstock_1 <- Lstock_1[complete.cases(Lstock_1),]
Lets look at the data we will be adding counts of increasing and decreasing days, for the time interval dates and stock values at the beginning and end of the time interval available or requested.
cat('\nThe lag for this table was for ',lag,'days.','\nThe stock to look up was ', stock_1,'.\nThe start date of this stock and starting value was ',as.character(paste(Lstock_1$startDayDate[1])),' and ','$',Lstock_1$startDayValue[1],'\nThe end date and end date price of this stock analysis is ',as.character(paste(Lstock_1$finalDayDate[1])),' and ','$',Lstock_1$finalDayValue[1])
##
## The lag for this table was for 1 days.
## The stock to look up was YELP .
## The start date of this stock and starting value was 2012-12-31 and $ 18.85
## The end date and end date price of this stock analysis is 2020-03-24 and $ 20.64
roi <- Lstock_1$finalDayValue[1]/Lstock_1$startDayValue[1]
cat('\nThe return on investment as a percentage of the amount invested for this time period is ',roi)
##
## The return on investment as a percentage of the amount invested for this time period is 1.09496
cat('\n\nIn dollars initially invested your return is $',Lstock_1$finalDayValue[1]-Lstock_1$startDayValue[1], 'for the dates:',
as.character(paste(Lstock_1$startDayDate[1])),'through ',
as.character(paste(Lstock_1$finalDayDate[1])))
##
##
## In dollars initially invested your return is $ 1.79 for the dates: 2012-12-31 through 2020-03-24
Now, lets look at the counts and group counts of increasing and decreasing days for this stock and the time period available.
#assign a 1 to increasing values
Lstock_1$todayGrtrThan_lagN <- ifelse(Lstock_1$today2_lagN>1, 1,0)
Lstock_1$cumulativeSumTodayGrtrThan_lagN <- cumsum(Lstock_1$todayGrtrThan_lagN)
# get the count of how many instances repeat,
# those counts repeating are counts that measure the days cumulatively decreasing
# those cumulative counts that don't repeat, are counting increasing days.
# These are stock values for today's value to 7 days prior value.
countstock_10 <- Lstock_1 %>% group_by(cumulativeSumTodayGrtrThan_lagN) %>% count(n=n())
countstock_10 <- as.data.frame(countstock_10)
countstock_10 <- countstock_10[,-3]
colnames(countstock_10)[2] <- 'nRepeatsTodayGrtrThan_lagN'
# Count the REPEATS of each number (minus the initial start)
countstock_10$decrDaysThisCycle <- countstock_10$n-1
# Count the number of times the cycle count repeats in this time span exactly that many days
countstock_10b <- countstock_10 %>% group_by(decrDaysThisCycle) %>% count(n=n())
countstock_10b <- as.data.frame(countstock_10b)
countstock_10b <- countstock_10b[,-3]
colnames(countstock_10b)[2] <- 'nTimesDecrDayCountsOccurs'
#combine these two count matrices of decreasing days
countsstock_tableDecr <- merge(countstock_10, countstock_10b, by.x='decrDaysThisCycle',
by.y='decrDaysThisCycle')
#combine the counts to the stock subset
stock_3 <- merge(Lstock_1, countsstock_tableDecr, by.x='cumulativeSumTodayGrtrThan_lagN',
by.y='cumulativeSumTodayGrtrThan_lagN')
#assign a 1 to decreasing values
stock_3$todayLessThan_lagN <- ifelse(stock_3$today2_lagN>1, 0,1)
stock_3$cumulativeSumTodayLessThan_lagN <- cumsum(stock_3$todayLessThan_lagN)
# get the count of how many instances repeat,
# those counts repeating are counts that measure the days cumulatively increasing
# those cumulative counts that don't repeat, are counting decreasing days.
# These are stock values for today's value to 7 days prior value.
countstock_11 <- stock_3 %>% group_by(cumulativeSumTodayLessThan_lagN) %>% count(n=n())
countstock_11 <- as.data.frame(countstock_11)
countstock_11 <- countstock_11[,-3]
colnames(countstock_11)[2] <- 'nRepeatsTodayLessThan_lagN'
# Count the REPEATS of each number (minus the initial start)
countstock_11$incrDaysThisCycle <- countstock_11$n-1
# Count the number of times the cycle count repeats in this time span exactly that many days
countstock_11b <- countstock_11 %>% group_by(incrDaysThisCycle) %>% count(n=n())
countstock_11b <- as.data.frame(countstock_11b)
countstock_11b <- countstock_11b[,-3]
colnames(countstock_11b)[2] <- 'nTimesIncrDayCountsOccurs'
#combine these two count matrices of decreasing days
countsstock_tableIncr <- merge(countstock_11, countstock_11b,
by.x='incrDaysThisCycle',
by.y='incrDaysThisCycle')
#combine the counts to the stock subset
stock_4 <- merge(stock_3, countsstock_tableIncr,
by.x='cumulativeSumTodayLessThan_lagN',
by.y='cumulativeSumTodayLessThan_lagN')
colnames(stock_4)
## [1] "cumulativeSumTodayLessThan_lagN" "cumulativeSumTodayGrtrThan_lagN"
## [3] "Date" "stockName"
## [5] "stockValue" "startDayValue"
## [7] "startDayDate" "finalDayValue"
## [9] "finalDayDate" "lagN"
## [11] "today2_lagN" "todayGrtrThan_lagN"
## [13] "decrDaysThisCycle" "nRepeatsTodayGrtrThan_lagN"
## [15] "nTimesDecrDayCountsOccurs" "todayLessThan_lagN"
## [17] "incrDaysThisCycle" "nRepeatsTodayLessThan_lagN"
## [19] "nTimesIncrDayCountsOccurs"
stock_5 <- stock_4[,c(3:11,
12,2,14,13,15,
16,1,18,17,19)]
colnames(stock_5) <- gsub('lagN',lagN,colnames(stock_5))
colnames(stock_5)
## [1] "Date" "stockName"
## [3] "stockValue" "startDayValue"
## [5] "startDayDate" "finalDayValue"
## [7] "finalDayDate" "lag1"
## [9] "today2_lag1" "todayGrtrThan_lag1"
## [11] "cumulativeSumTodayGrtrThan_lag1" "nRepeatsTodayGrtrThan_lag1"
## [13] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [15] "todayLessThan_lag1" "cumulativeSumTodayLessThan_lag1"
## [17] "nRepeatsTodayLessThan_lag1" "incrDaysThisCycle"
## [19] "nTimesIncrDayCountsOccurs"
pretty_headers <- str_to_title(colnames(stock_5))
stock5 <- datatable(data=stock_5, rownames=FALSE,
colnames=pretty_headers,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('colvis','csv','excel'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive')
)
stock5
Using this information on one stock of the thousands available in our large csv file and table, lets return the count information and the number of times this stock has seen those exact days of counts.
cat('\nThe number of times this stock has decreased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates\' stock value is ', stock_5$decrDaysThisCycle[length(stock_5$decrDaysThisCycle)],'\n')
##
## The number of times this stock has decreased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates' stock value is 0
cat('\nThe number of times this stock has increased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates\' stock value is ', stock_5$incrDaysThisCycle[length(stock_5$incrDaysThisCycle)],'\n')
##
## The number of times this stock has increased in the current cycle from the start of this time period retrieved in price comparison of the number of days in lags retrieved prior to each instance dates' stock value is 4
cat('\nThe number of times this stock has decreased exactly this number of days compared to its price ', lag, ' days ago, is ', stock_5$nTimesDecrDayCountsOccurs[length(stock_5$nTimesDecrDayCountsOccurs)],'\n')
##
## The number of times this stock has decreased exactly this number of days compared to its price 1 days ago, is 438
cat('\nThe number of times this stock has increased exactly this number of days compared to its price ', lag, ' days ago, is ', stock_5$nTimesIncrDayCountsOccurs[length(stock_5$nTimesIncrDayCountsOccurs)],'\n')
##
## The number of times this stock has increased exactly this number of days compared to its price 1 days ago, is 26
The unique number of days this stock selected (for the time period retrieved) decreased is shown in the table below.
stock_5[unique(stock_5$decrDaysThisCycle),c(1:3,8,9,13,14)]
## Date stockName stockValue lag1 today2_lag1 decrDaysThisCycle
## 3 2013-01-04 YELP 21.52 20.15 1.0679901 3
## 5 2013-01-08 YELP 20.83 21.08 0.9881404 3
## 1 2013-01-02 YELP 19.70 18.85 1.0450928 0
## 2 2013-01-03 YELP 20.15 19.70 1.0228426 0
## 6 2013-01-09 YELP 20.76 20.83 0.9966395 3
## 4 2013-01-07 YELP 21.08 21.52 0.9795539 3
## 7 2013-01-10 YELP 22.19 20.76 1.0688825 5
## 9 2013-01-14 YELP 21.97 22.12 0.9932188 5
## 11 2013-01-16 YELP 20.36 20.61 0.9878700 5
## 8 2013-01-11 YELP 22.12 22.19 0.9968454 5
## 10 2013-01-15 YELP 20.61 21.97 0.9380974 5
## nTimesDecrDayCountsOccurs
## 3 67
## 5 67
## 1 438
## 2 438
## 6 67
## 4 67
## 7 9
## 9 9
## 11 9
## 8 9
## 10 9
The unique number of days this stock selected (for the time period retrieved) increased is shown in the table below.
stock_5[unique(stock_5$incrDaysThisCycle),c(1:3,8,9,18,19)]
## Date stockName stockValue lag1 today2_lag1 incrDaysThisCycle
## 2 2013-01-03 YELP 20.15 19.70 1.0228426 2
## 1 2013-01-02 YELP 19.70 18.85 1.0450928 2
## 3 2013-01-04 YELP 21.52 20.15 1.0679901 2
## 4 2013-01-07 YELP 21.08 21.52 0.9795539 0
## 6 2013-01-09 YELP 20.76 20.83 0.9966395 1
## 9 2013-01-14 YELP 21.97 22.12 0.9932188 0
## 7 2013-01-10 YELP 22.19 20.76 1.0688825 1
## 5 2013-01-08 YELP 20.83 21.08 0.9881404 0
## 12 2013-01-17 YELP 20.27 20.36 0.9955796 1
## nTimesIncrDayCountsOccurs
## 2 121
## 1 121
## 3 121
## 4 435
## 6 241
## 9 435
## 7 241
## 5 435
## 12 241
%%%%%%%%%%%%%%%%%
Lets do some machine learning to predict the next trading day stock value of being increased by 2% as 1, or not being decreased by 2% as a 0 value.This will use the lag set earlier in the above table, and those counts and other fields already used, but add in a lead price field for 1 day. We can then replay this scenario with the lead price in 2 weeks, 1 month, and so on as needed.
colnames(stock_5)
## [1] "Date" "stockName"
## [3] "stockValue" "startDayValue"
## [5] "startDayDate" "finalDayValue"
## [7] "finalDayDate" "lag1"
## [9] "today2_lag1" "todayGrtrThan_lag1"
## [11] "cumulativeSumTodayGrtrThan_lag1" "nRepeatsTodayGrtrThan_lag1"
## [13] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [15] "todayLessThan_lag1" "cumulativeSumTodayLessThan_lag1"
## [17] "nRepeatsTodayLessThan_lag1" "incrDaysThisCycle"
## [19] "nTimesIncrDayCountsOccurs"
Lets remove the date fields and stock name. We will name the rows of this matrix by the date before deleting the Date field.
stock_6 <- stock_5
row.names(stock_6) <- stock_6$Date
stock_6 <- stock_6[,-c(1,2,4:7)]#removes the date, stock name, and start and final dates/values
stock_6$NextDayPrice <- lead(stock_6$stockValue,lead)
stock_6 <- stock_6[complete.cases(stock_6$NextDayPrice),]
stock_6$NextDay2PercentIncrease <- ifelse(stock_6$NextDayPrice/stock_6$stockValue >1.02, 1,0)
stock_6$NextDay2PercentIncrease <- as.factor(stock_6$NextDay2PercentIncrease)
head(stock_6)
## stockValue lag1 today2_lag1 todayGrtrThan_lag1
## 2013-01-02 19.70 18.85 1.0450928 1
## 2013-01-03 20.15 19.70 1.0228426 1
## 2013-01-04 21.52 20.15 1.0679901 1
## 2013-01-07 21.08 21.52 0.9795539 0
## 2013-01-08 20.83 21.08 0.9881404 0
## 2013-01-09 20.76 20.83 0.9966395 0
## cumulativeSumTodayGrtrThan_lag1 nRepeatsTodayGrtrThan_lag1
## 2013-01-02 1 1
## 2013-01-03 2 1
## 2013-01-04 3 4
## 2013-01-07 3 4
## 2013-01-08 3 4
## 2013-01-09 3 4
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag1
## 2013-01-02 0 438 0
## 2013-01-03 0 438 0
## 2013-01-04 3 67 0
## 2013-01-07 3 67 1
## 2013-01-08 3 67 1
## 2013-01-09 3 67 1
## cumulativeSumTodayLessThan_lag1 nRepeatsTodayLessThan_lag1
## 2013-01-02 0 3
## 2013-01-03 0 3
## 2013-01-04 0 3
## 2013-01-07 1 1
## 2013-01-08 2 1
## 2013-01-09 3 2
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDayPrice
## 2013-01-02 2 121 21.44
## 2013-01-03 2 121 21.96
## 2013-01-04 2 121 21.88
## 2013-01-07 0 435 21.89
## 2013-01-08 0 435 22.00
## 2013-01-09 1 241 22.18
## NextDay2PercentIncrease
## 2013-01-02 1
## 2013-01-03 1
## 2013-01-04 0
## 2013-01-07 1
## 2013-01-08 1
## 2013-01-09 1
Remove the next day price field just added so that it doesn’t have multicollinearity or correlation with the predictions of whether or not the next day will increase by 2% (1 value) or not (0 value).
colnames(stock_6)
## [1] "stockValue" "lag1"
## [3] "today2_lag1" "todayGrtrThan_lag1"
## [5] "cumulativeSumTodayGrtrThan_lag1" "nRepeatsTodayGrtrThan_lag1"
## [7] "decrDaysThisCycle" "nTimesDecrDayCountsOccurs"
## [9] "todayLessThan_lag1" "cumulativeSumTodayLessThan_lag1"
## [11] "nRepeatsTodayLessThan_lag1" "incrDaysThisCycle"
## [13] "nTimesIncrDayCountsOccurs" "NextDayPrice"
## [15] "NextDay2PercentIncrease"
For this set that is column 14 for ‘NextDayPrice’ to remove before running our predictive analytics algorithms.
set.seed(12356789)
NextDayPrice <- as.vector(stock_6$NextDayPrice)
stock_6 <- stock_6[,-14]
inTrain <- createDataPartition(y=stock_6$NextDay2PercentIncrease, p=0.7, list=FALSE)
trainingSet <- stock_6[inTrain,]
testingSet <- stock_6[-inTrain,]
Here are the first few observations in the training set.
head(trainingSet)
## stockValue lag1 today2_lag1 todayGrtrThan_lag1
## 2013-01-04 21.52 20.15 1.0679901 1
## 2013-01-07 21.08 21.52 0.9795539 0
## 2013-01-08 20.83 21.08 0.9881404 0
## 2013-01-09 20.76 20.83 0.9966395 0
## 2013-01-10 22.19 20.76 1.0688825 1
## 2013-01-14 21.97 22.12 0.9932188 0
## cumulativeSumTodayGrtrThan_lag1 nRepeatsTodayGrtrThan_lag1
## 2013-01-04 3 4
## 2013-01-07 3 4
## 2013-01-08 3 4
## 2013-01-09 3 4
## 2013-01-10 4 6
## 2013-01-14 4 6
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag1
## 2013-01-04 3 67 0
## 2013-01-07 3 67 1
## 2013-01-08 3 67 1
## 2013-01-09 3 67 1
## 2013-01-10 5 9 0
## 2013-01-14 5 9 1
## cumulativeSumTodayLessThan_lag1 nRepeatsTodayLessThan_lag1
## 2013-01-04 0 3
## 2013-01-07 1 1
## 2013-01-08 2 1
## 2013-01-09 3 2
## 2013-01-10 3 2
## 2013-01-14 5 1
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease
## 2013-01-04 2 121 0
## 2013-01-07 0 435 1
## 2013-01-08 0 435 1
## 2013-01-09 1 241 1
## 2013-01-10 1 241 0
## 2013-01-14 0 435 0
Here are the first few observations in the testing set.
head(testingSet)
## stockValue lag1 today2_lag1 todayGrtrThan_lag1
## 2013-01-02 19.70 18.85 1.0450928 1
## 2013-01-03 20.15 19.70 1.0228426 1
## 2013-01-11 22.12 22.19 0.9968454 0
## 2013-01-23 20.34 19.90 1.0221106 1
## 2013-01-25 21.17 20.83 1.0163226 1
## 2013-02-11 21.63 21.85 0.9899314 0
## cumulativeSumTodayGrtrThan_lag1 nRepeatsTodayGrtrThan_lag1
## 2013-01-02 1 1
## 2013-01-03 2 1
## 2013-01-11 4 6
## 2013-01-23 6 1
## 2013-01-25 8 4
## 2013-02-11 12 2
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag1
## 2013-01-02 0 438 0
## 2013-01-03 0 438 0
## 2013-01-11 5 9 1
## 2013-01-23 0 438 0
## 2013-01-25 3 67 0
## 2013-02-11 1 251 1
## cumulativeSumTodayLessThan_lag1 nRepeatsTodayLessThan_lag1
## 2013-01-02 0 3
## 2013-01-03 0 3
## 2013-01-11 4 1
## 2013-01-23 9 4
## 2013-01-25 9 4
## 2013-02-11 16 3
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease
## 2013-01-02 2 121 1
## 2013-01-03 2 121 1
## 2013-01-11 0 435 0
## 2013-01-23 3 65 1
## 2013-01-25 3 65 1
## 2013-02-11 2 121 1
Some modifications to the data as a factor instead of numeric when copy and pasting the code above. The random forest is taking a tree approach to classification of the next day’s value as 1 for increased by 2% or a 0 for didn’t increase by 2%. Otherwise, the random forest would have done regression as done in previous two runs with the next day price value guess. Below is the caret random forest or you could use the random forest package. Make sure not to keep the round(,0) modification or all values will be NA. If you just run this, then no worries.
rfMod <- train(NextDay2PercentIncrease ~ ., method='rf', data=(trainingSet),
trControl=trainControl(method='cv'), number=5)
predRF <- predict(rfMod, testingSet)
predDF <- data.frame(predRF, NextDay2PercentIncrease=testingSet$NextDay2PercentIncrease)
print(rbind(head(predDF,10),tail(predDF,10)))
## predRF NextDay2PercentIncrease
## 1 1 1
## 2 1 1
## 3 0 0
## 4 1 1
## 5 1 1
## 6 1 1
## 7 1 1
## 8 1 1
## 9 1 1
## 10 1 1
## 526 0 0
## 527 0 0
## 528 0 0
## 529 0 0
## 530 0 1
## 531 1 1
## 532 0 1
## 533 0 0
## 534 0 0
## 535 0 0
sum <- sum(predRF==testingSet$NextDay2PercentIncrease)
length <- length(testingSet$NextDay2PercentIncrease)
accuracy_rfMod <- (sum/length)
accuracy_rfMod
## [1] 0.9121495
results <- c(round(accuracy_rfMod,2), round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- colnames(predDF)
Results <- rbind(predDF, results)
print(rbind(head(Results),tail(Results)))
## predRF NextDay2PercentIncrease
## 1 1 1
## 2 1 1
## 3 0 0
## 4 1 1
## 5 1 1
## 6 1 1
## 531 1 1
## 532 0 1
## 533 0 0
## 534 0 0
## 535 0 0
## results 0.91 100
From the above, 80% is not bad. You could run this program just using the random forest for if the next day value will increase by 2% and be right 4 out of 5 days. Lets see the rest using the caret package for the classification. The metric was changed to ‘Accuracy’ or ‘Kappa’ to use the train() of the caret package to classify instead of regress the features on the target.
knnMod <- train(NextDay2PercentIncrease ~ .,
method='knn', preProcess=c('center','scale'),
tuneLength=10, metric='Accuracy',
trControl=trainControl(method='cv'), data=trainingSet)
rpartMod <- train(NextDay2PercentIncrease ~ .,
method='rpart', tuneLength=9, metric='Accuracy',
data=trainingSet)
glmMod <- train(NextDay2PercentIncrease ~ ., metric='Kappa',
method='glm', data=trainingSet)
predKNN <- predict(knnMod, testingSet)
predRPART <- predict(rpartMod, testingSet)
predGLM <- predict(glmMod, testingSet)
length=length(testingSet$NextDay2PercentIncrease)
length
## [1] 535
sumKNN <- sum(predKNN==testingSet$NextDay2PercentIncrease)
sumRPart <- sum(predRPART==testingSet$NextDay2PercentIncrease)
sumGLM <- sum(predGLM==testingSet$NextDay2PercentIncrease)
cat('The number correct for KNN is ',sumKNN, ', the number correct for Rpart is ',sumRPart,
' and the number correct for GLM is ',sumGLM,'.')
## The number correct for KNN is 363 , the number correct for Rpart is 464 and the number correct for GLM is 349 .
accuracy_KNN <- sumKNN/length
accuracy_RPART <- sumRPart/length
accuracy_GLM <- sumGLM/length
cat('The accuracy for KNN, RPart, and GLM respectively is: ',
accuracy_KNN,accuracy_RPART,accuracy_GLM,'.')
## The accuracy for KNN, RPart, and GLM respectively is: 0.6785047 0.8672897 0.6523364 .
predDF2 <- data.frame(predRF,predKNN,predRPART,predGLM,
NextDay2PercentIncrease=testingSet$NextDay2PercentIncrease)
colnames(predDF2) <- c('RandomForest','KNN','Rpart','GLM','NextDay2PercentIncrease')
results <- c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(100,2))
results <- as.factor(results)
results <- t(data.frame(results))
colnames(results) <- c('RandomForest','KNN','Rpart','GLM',
'NextDay2PercentIncrease')
Results <- rbind(predDF2, results)
Results1 <- datatable(data=Results, rownames=FALSE,
filter=list(position='top'),
options=list(
dom='Bfrtip',
buttons=c('csv'),
language=list(sSearch='Filter:')),
extensions=c('Buttons','Responsive'))
Results1
Lets now align these predicted values to the subset for this stock made earlier.
stock_6$Date <- row.names(stock_6)
testingSet$Date <- row.names(testingSet)
predDF2$Date <- row.names(testingSet)
stock_7 <- merge(testingSet,predDF2, by.x='Date', by.y='Date')
tbl <- rbind(head(stock_7),tail(stock_7))
tbl
## Date stockValue lag1 today2_lag1 todayGrtrThan_lag1
## 1 2013-01-02 19.70 18.85 1.0450928 1
## 2 2013-01-03 20.15 19.70 1.0228426 1
## 3 2013-01-11 22.12 22.19 0.9968454 0
## 4 2013-01-23 20.34 19.90 1.0221106 1
## 5 2013-01-25 21.17 20.83 1.0163226 1
## 6 2013-02-11 21.63 21.85 0.9899314 0
## 530 2019-11-27 35.21 35.13 1.0022773 1
## 531 2019-12-10 33.03 33.08 0.9984885 0
## 532 2019-12-16 33.78 33.55 1.0068554 1
## 533 2019-12-26 34.66 34.06 1.0176160 1
## 534 2019-12-27 34.85 34.66 1.0054818 1
## 535 2020-01-03 34.72 34.77 0.9985620 0
## cumulativeSumTodayGrtrThan_lag1 nRepeatsTodayGrtrThan_lag1
## 1 1 1
## 2 2 1
## 3 4 6
## 4 6 1
## 5 8 4
## 6 12 2
## 530 872 3
## 531 874 6
## 532 877 1
## 533 881 1
## 534 882 2
## 535 883 4
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag1
## 1 0 438 0
## 2 0 438 0
## 3 5 9 1
## 4 0 438 0
## 5 3 67 0
## 6 1 251 1
## 530 2 111 0
## 531 5 9 1
## 532 0 438 0
## 533 0 438 0
## 534 1 251 0
## 535 3 67 1
## cumulativeSumTodayLessThan_lag1 nRepeatsTodayLessThan_lag1
## 1 0 3
## 2 0 3
## 3 4 1
## 4 9 4
## 5 9 4
## 6 16 3
## 530 868 3
## 531 874 1
## 532 875 6
## 533 878 3
## 534 878 3
## 535 881 1
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease.x
## 1 2 121 1
## 2 2 121 1
## 3 0 435 0
## 4 3 65 1
## 5 3 65 1
## 6 2 121 1
## 530 2 121 1
## 531 0 435 1
## 532 5 9 1
## 533 2 121 0
## 534 2 121 0
## 535 0 435 0
## RandomForest KNN Rpart GLM NextDay2PercentIncrease.y
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 3 0 1 1 1 0
## 4 1 1 1 1 1
## 5 1 1 1 1 1
## 6 1 1 1 1 1
## 530 0 0 0 0 1
## 531 1 1 1 0 1
## 532 0 0 0 0 1
## 533 0 0 0 0 0
## 534 0 0 0 0 0
## 535 0 0 0 0 0
stock_7$RandomForest <- as.numeric(paste(stock_7$RandomForest))
stock_7$Rpart <- as.numeric(paste(stock_7$Rpart))
stock_7$GLM <- as.numeric(paste(stock_7$GLM))
stock_7$NextDay2PercentIncrease.y <-
as.numeric(paste(stock_7$NextDay2PercentIncrease.y))
stock_7$KNN <- as.numeric(paste(stock_7$KNN))
stock_7$mostVoted <- ifelse(rowSums(stock_7[,16:19])>=2,1,0)
sMost <- sum(stock_7$mostVoted==stock_7$NextDay2PercentIncrease.y)
lMost <- length(stock_7$NextDay2PercentIncrease.y)
accMost <- sMost/lMost
results <- as.data.frame(c(round(accuracy_rfMod,2),
round(accuracy_KNN,2),
round(accuracy_RPART,2),
round(accuracy_GLM,2),
round(accMost,2),
round(100,2)))
colnames(results) <- 'Results'
row.names(results) <- c('RandomForest','KNN','Rpart','GLM','mostVoted',
'NextDay2PercentIncrease')
results
## Results
## RandomForest 0.91
## KNN 0.68
## Rpart 0.87
## GLM 0.65
## mostVoted 0.84
## NextDay2PercentIncrease 100.00
The above accuracy measures for random forest, recursive partitioning trees, k-nearest neighbors, and generalized linear models and the most voted algorithm for this classification were shown. When analyzing YELP for predicting whether the stock would increase by more than 2% in the next 30 days, the highest accuracy was with Random Forest with 91% accuracy, then Rpart with 87% accuracy. The time interval was also 2012-12-31 through 2020-03-25, with missing information the last 30 days of the time span to predict, as this script was made 3/25/2020, and 30 days in the future isn’t available. The lag to compare and make counts was 1 day.
The following is the table of the first 30 predicted results on the testing set of samples. Also, add back in the NextDayPrice extracted earlier so the results weren’t dependent on the next day price when predicting whether the next day would be increased by 2%. Note that because not all stocks have the same time series start dates, the size of each stock time series being selected will vary with some stocks.
stock_6$NextDayPrice <- NextDayPrice #add back in to original complete dataset
stock_6b <- stock_6[,15:16]
stock_8 <- merge(stock_7,stock_6b,by.x='Date',by.y='Date')
print(rbind(head(stock_8,30),tail(stock_8,30)))
## Date stockValue lag1 today2_lag1 todayGrtrThan_lag1
## 1 2013-01-02 19.70 18.85 1.0450928 1
## 2 2013-01-03 20.15 19.70 1.0228426 1
## 3 2013-01-11 22.12 22.19 0.9968454 0
## 4 2013-01-23 20.34 19.90 1.0221106 1
## 5 2013-01-25 21.17 20.83 1.0163226 1
## 6 2013-02-11 21.63 21.85 0.9899314 0
## 7 2013-02-14 21.44 21.79 0.9839376 0
## 8 2013-02-19 21.88 21.96 0.9963570 0
## 9 2013-02-21 22.00 21.89 1.0050251 1
## 10 2013-02-25 21.62 22.18 0.9747520 0
## 11 2013-02-28 22.19 21.86 1.0150961 1
## 12 2013-03-06 23.02 23.02 1.0000000 0
## 13 2013-03-07 23.46 23.02 1.0191138 1
## 14 2013-03-15 25.10 25.05 1.0019960 1
## 15 2013-03-18 24.50 25.10 0.9760956 0
## 16 2013-03-19 24.28 24.50 0.9910204 0
## 17 2013-03-20 24.12 24.28 0.9934102 0
## 18 2013-03-22 23.79 24.00 0.9912500 0
## 19 2013-03-25 23.49 23.79 0.9873897 0
## 20 2013-04-01 22.89 23.71 0.9654154 0
## 21 2013-04-09 24.62 25.00 0.9848000 0
## 22 2013-04-10 25.75 24.62 1.0458976 1
## 23 2013-04-11 26.35 25.75 1.0233010 1
## 24 2013-04-16 25.71 25.83 0.9953542 0
## 25 2013-04-22 25.32 25.50 0.9929412 0
## 26 2013-04-24 24.97 25.29 0.9873468 0
## 27 2013-04-30 26.03 26.62 0.9778362 0
## 28 2013-05-08 29.91 30.67 0.9752201 0
## 29 2013-05-09 30.67 29.91 1.0254096 1
## 30 2013-05-14 30.97 30.92 1.0016171 1
## 506 2019-07-24 35.44 34.48 1.0278422 1
## 507 2019-07-26 35.42 34.89 1.0151906 1
## 508 2019-07-29 35.43 35.42 1.0002823 1
## 509 2019-08-02 35.34 34.93 1.0117378 1
## 510 2019-08-05 33.53 35.34 0.9487832 0
## 511 2019-08-19 32.92 33.17 0.9924631 0
## 512 2019-08-22 32.89 33.23 0.9897683 0
## 513 2019-08-23 31.77 32.89 0.9659471 0
## 514 2019-08-26 32.03 31.77 1.0081838 1
## 515 2019-09-04 32.84 33.21 0.9888588 0
## 516 2019-09-13 37.44 36.52 1.0251917 1
## 517 2019-09-26 35.08 34.96 1.0034325 1
## 518 2019-09-27 34.42 35.08 0.9811859 0
## 519 2019-09-30 34.75 34.42 1.0095874 1
## 520 2019-10-02 33.33 34.86 0.9561102 0
## 521 2019-10-23 33.53 33.84 0.9908392 0
## 522 2019-11-01 35.06 34.51 1.0159374 1
## 523 2019-11-04 34.24 35.06 0.9766115 0
## 524 2019-11-05 33.16 34.24 0.9684579 0
## 525 2019-11-06 32.71 33.16 0.9864294 0
## 526 2019-11-18 34.78 34.61 1.0049119 1
## 527 2019-11-20 34.23 34.35 0.9965066 0
## 528 2019-11-21 34.14 34.23 0.9973707 0
## 529 2019-11-26 35.13 35.01 1.0034276 1
## 530 2019-11-27 35.21 35.13 1.0022773 1
## 531 2019-12-10 33.03 33.08 0.9984885 0
## 532 2019-12-16 33.78 33.55 1.0068554 1
## 533 2019-12-26 34.66 34.06 1.0176160 1
## 534 2019-12-27 34.85 34.66 1.0054818 1
## 535 2020-01-03 34.72 34.77 0.9985620 0
## cumulativeSumTodayGrtrThan_lag1 nRepeatsTodayGrtrThan_lag1
## 1 1 1
## 2 2 1
## 3 4 6
## 4 6 1
## 5 8 4
## 6 12 2
## 7 14 2
## 8 15 2
## 9 17 1
## 10 18 3
## 11 20 1
## 12 22 3
## 13 23 1
## 14 28 7
## 15 28 7
## 16 28 7
## 17 28 7
## 18 28 7
## 19 28 7
## 20 30 4
## 21 33 3
## 22 34 1
## 23 35 1
## 24 36 3
## 25 38 4
## 26 38 4
## 27 41 3
## 28 42 5
## 29 43 1
## 30 45 2
## 506 828 2
## 507 829 1
## 508 830 1
## 509 832 2
## 510 832 2
## 511 838 3
## 512 839 3
## 513 839 3
## 514 840 1
## 515 843 4
## 516 849 1
## 517 852 2
## 518 852 2
## 519 853 1
## 520 854 3
## 521 861 3
## 522 865 5
## 523 865 5
## 524 865 5
## 525 865 5
## 526 869 4
## 527 869 4
## 528 869 4
## 529 871 1
## 530 872 3
## 531 874 6
## 532 877 1
## 533 881 1
## 534 882 2
## 535 883 4
## decrDaysThisCycle nTimesDecrDayCountsOccurs todayLessThan_lag1
## 1 0 438 0
## 2 0 438 0
## 3 5 9 1
## 4 0 438 0
## 5 3 67 0
## 6 1 251 1
## 7 1 251 1
## 8 1 251 1
## 9 0 438 0
## 10 2 111 1
## 11 0 438 0
## 12 2 111 1
## 13 0 438 0
## 14 6 6 0
## 15 6 6 1
## 16 6 6 1
## 17 6 6 1
## 18 6 6 1
## 19 6 6 1
## 20 3 67 1
## 21 2 111 1
## 22 0 438 0
## 23 0 438 0
## 24 2 111 1
## 25 3 67 1
## 26 3 67 1
## 27 2 111 1
## 28 4 20 1
## 29 0 438 0
## 30 1 251 0
## 506 1 251 0
## 507 0 438 0
## 508 0 438 0
## 509 1 251 0
## 510 1 251 1
## 511 2 111 1
## 512 2 111 1
## 513 2 111 1
## 514 0 438 0
## 515 3 67 1
## 516 0 438 0
## 517 1 251 0
## 518 1 251 1
## 519 0 438 0
## 520 2 111 1
## 521 2 111 1
## 522 4 20 0
## 523 4 20 1
## 524 4 20 1
## 525 4 20 1
## 526 3 67 0
## 527 3 67 1
## 528 3 67 1
## 529 0 438 0
## 530 2 111 0
## 531 5 9 1
## 532 0 438 0
## 533 0 438 0
## 534 1 251 0
## 535 3 67 1
## cumulativeSumTodayLessThan_lag1 nRepeatsTodayLessThan_lag1
## 1 0 3
## 2 0 3
## 3 4 1
## 4 9 4
## 5 9 4
## 6 16 3
## 7 17 2
## 8 18 4
## 9 18 4
## 10 19 1
## 11 20 5
## 12 22 4
## 13 22 4
## 14 23 4
## 15 24 1
## 16 25 1
## 17 26 1
## 18 28 1
## 19 29 3
## 20 31 1
## 21 34 4
## 22 34 4
## 23 34 4
## 24 36 2
## 25 38 1
## 26 40 4
## 27 41 1
## 28 46 3
## 29 46 3
## 30 47 2
## 506 823 3
## 507 824 4
## 508 824 4
## 509 826 2
## 510 827 5
## 511 831 1
## 512 833 1
## 513 834 5
## 514 834 5
## 515 837 2
## 516 838 7
## 517 844 3
## 518 845 3
## 519 845 3
## 520 846 1
## 521 854 1
## 522 857 2
## 523 858 1
## 524 859 1
## 525 860 1
## 526 864 3
## 527 866 1
## 528 867 2
## 529 868 3
## 530 868 3
## 531 874 1
## 532 875 6
## 533 878 3
## 534 878 3
## 535 881 1
## incrDaysThisCycle nTimesIncrDayCountsOccurs NextDay2PercentIncrease.x
## 1 2 121 1
## 2 2 121 1
## 3 0 435 0
## 4 3 65 1
## 5 3 65 1
## 6 2 121 1
## 7 1 241 1
## 8 3 65 1
## 9 3 65 1
## 10 0 435 1
## 11 4 26 1
## 12 3 65 1
## 13 3 65 1
## 14 3 65 1
## 15 0 435 1
## 16 0 435 1
## 17 0 435 1
## 18 0 435 1
## 19 2 121 1
## 20 0 435 1
## 21 3 65 1
## 22 3 65 1
## 23 3 65 1
## 24 1 241 1
## 25 0 435 1
## 26 3 65 1
## 27 0 435 1
## 28 2 121 1
## 29 2 121 0
## 30 1 241 1
## 506 2 121 0
## 507 3 65 0
## 508 3 65 0
## 509 1 241 1
## 510 4 26 1
## 511 0 435 1
## 512 0 435 0
## 513 4 26 1
## 514 4 26 1
## 515 1 241 1
## 516 6 8 0
## 517 2 121 0
## 518 2 121 0
## 519 2 121 1
## 520 0 435 1
## 521 0 435 0
## 522 1 241 0
## 523 0 435 0
## 524 0 435 1
## 525 0 435 1
## 526 2 121 0
## 527 0 435 0
## 528 1 241 0
## 529 2 121 0
## 530 2 121 1
## 531 0 435 1
## 532 5 9 1
## 533 2 121 0
## 534 2 121 0
## 535 0 435 0
## RandomForest KNN Rpart GLM NextDay2PercentIncrease.y mostVoted NextDayPrice
## 1 1 1 1 1 1 1 21.44
## 2 1 1 1 1 1 1 21.96
## 3 0 1 1 1 0 1 21.29
## 4 1 1 1 1 1 1 23.46
## 5 1 1 1 1 1 1 25.13
## 6 1 1 1 1 1 1 23.68
## 7 1 1 1 1 1 1 22.89
## 8 1 1 1 1 1 1 23.05
## 9 1 1 1 1 1 1 25.70
## 10 1 1 1 1 1 1 24.62
## 11 1 1 1 1 1 1 26.96
## 12 1 1 1 1 1 1 25.33
## 13 1 1 1 1 1 1 25.50
## 14 1 1 1 1 1 1 26.62
## 15 1 1 1 1 1 1 26.03
## 16 1 1 1 1 1 1 25.30
## 17 1 1 1 1 1 1 32.22
## 18 1 1 1 1 1 1 30.69
## 19 1 1 1 1 1 1 30.67
## 20 1 1 1 1 1 1 30.92
## 21 1 1 1 1 1 1 31.58
## 22 1 1 1 1 1 1 29.40
## 23 1 1 1 1 1 1 29.46
## 24 1 1 1 1 1 1 30.09
## 25 1 1 1 1 1 1 27.46
## 26 1 1 1 1 1 1 29.09
## 27 1 1 1 1 1 1 29.39
## 28 1 1 1 1 1 1 31.00
## 29 1 1 1 1 0 1 31.24
## 30 0 1 1 1 1 1 34.87
## 506 0 0 0 0 0 0 33.24
## 507 0 0 0 0 0 0 33.23
## 508 0 0 0 0 0 0 33.96
## 509 0 0 0 0 1 0 38.21
## 510 0 0 0 0 1 0 36.98
## 511 1 1 1 0 1 1 34.86
## 512 1 1 1 0 0 1 33.40
## 513 1 0 1 0 1 1 33.28
## 514 1 1 1 0 1 1 32.75
## 515 1 1 1 0 1 1 33.61
## 516 0 0 0 0 0 0 34.69
## 517 0 0 0 0 0 0 30.12
## 518 0 0 0 0 0 0 34.77
## 519 0 1 0 0 1 0 36.06
## 520 1 1 0 0 1 1 35.56
## 521 0 1 0 0 0 0 33.42
## 522 0 0 0 0 0 0 33.78
## 523 0 0 0 0 0 0 34.00
## 524 1 0 1 0 1 1 34.14
## 525 1 0 1 0 1 1 34.11
## 526 0 0 0 0 0 0 34.77
## 527 0 0 0 0 0 0 34.44
## 528 0 1 0 0 0 0 34.57
## 529 0 1 0 0 0 0 35.28
## 530 0 0 0 0 1 0 36.23
## 531 1 1 1 0 1 1 36.12
## 532 0 0 0 0 1 0 34.77
## 533 0 0 0 0 0 0 34.20
## 534 0 0 0 0 0 0 34.76
## 535 0 0 0 0 0 0 34.57
%%%%%%%%%%%%%%%%%%%%%%