Predicting Stock price movement from companies investor relations articles

In this project I will attempt to see if articles from Advanced Micro Devices,Intel,Apple and Nvidia investor relations pages are able to predict whether or not their stock price will move positively or negatively on the day of publishing.

Data Storage

All data will be stored in a MySQL database. I created a function that takes in a list of company symbols and creates a database for each company and a table to store the articles. Each table is comprised of four fields; Links,Title and Article, all as varchars, and Published as Datetime. As a primary key I use Links, Title and Published.

Data Acquisition

To gather all the articles I scrape each company’s investor relations page. Since there are many articles I have to use a proxy ip address and port, so I do not incur 403 errors.

To ensure continuous scraping of the pages I create a recursive code checking function, that checks whether or not the website accepts the request from the ip. If it does not accept the request then it gets a different ip address and port and checks that ips code status. This function will iterate until a 200 code is returned by the server and the function will return the html of the page.If no 200 code status is returned, then the recursive code will cycle through the dataframe of ips and ports until the last one is reached.

This function still does not handle timeouts properly and is a work in progress

All of the scraping functions and the recursive function take in a database connection, and a dataframe containing proxy ips and ports.

All the pages have a similar structure so I created a function called get_pr_data that takes in the company symbol, link, the pages html, the title node, release date node, release date format for the database and the article node.

For the Apple page I basically use the same function, except for a minor modification of removing an if statement that was not required for the page.

This functions loads the link, title, article and publishing date into the press releases table.

The hyperlink structure is the main aspect that separates the different websites. Amd and Intel both share the same structure for their article archives, so a function can be created to scrape both of their pages.

However, Intel does not have their newest articles in the archive and a separate coding must be created to scrape those articles.

Nvidia’s investor relations page is made up of a bunch of pages and uses pagination. To deal with this I scrape the web page for the number of pages. I then use that number to cycle through each page and retrieve all of the hyper links. Then I cycle through each hyperlink and retrieve the pages data and insert it into the database.

Apple’s investor relations page has the same set up, so I do basically the same processes.

Data retrieval and processing

The historical daily data for Amd, Apple, Intel and Nvidia was retrieved from Yahoo Finance. I placed the csv’s into my github and pulled the data from there.

To be able to see if the articles could predict whether or not the company’s stock price would move up or down I had to create a few new rows of data. First I created a column called Diff, that just takes the difference from one row to the next. The next column I created was called lag, which just takes every element in the Close column and moves it down one row. Then I created a Percent Change column, that divides the Diff element by the lag element and results in the percentage change for that day.

After doing that I create a new dataframe that consists of Date, Close and Percent_Change, from the companies daily dataframe. I then create a classification column called Case that states whether the percent change is positive or negative.

After fixing the historical daily dataframes for each company, I retrieve all the articles and publishing dates from the data. I then create a new dataframe by doing an inner join on the company changes dataframe and the company’s article dataframes. The inner join matches each article to the corresponding daily closing price.

Classification Analysis

Naive Bayes Classification

Naive Bayes is a type of classification that breaks each word into a probability of occurrence in each class. In this model, the classes would be positive and negative. From there it tries to predict which class a particular article is by the probabilities of each word being positive or negative.

As we can see from the summaries below, the naive bayes classification model is not a very good model for this particular set of articles. For each company the predicted accuracy has a 95% confidence interval between roughly 45% to roughly 60%. This level of accuracy is way too low for making a wager on stock price movement and hoping to make a profit.

amd_nb_model_matched <- dfm_match(amd_testing,features=featnames(amd_training))
amd_actual_class <- as.factor(amd_nb_model_matched$Case)
amd_predicted_class <- predict(amd_nb_model,newdata=amd_nb_model_matched)
amd_tab_class <- table(amd_actual_class,amd_predicted_class)
amd_tab_class
##                 amd_predicted_class
## amd_actual_class neg pos
##              neg  59  66
##              pos  59  84
confusionMatrix(amd_tab_class,model='everything',positive='pos')
## Confusion Matrix and Statistics
## 
##                 amd_predicted_class
## amd_actual_class neg pos
##              neg  59  66
##              pos  59  84
##                                           
##                Accuracy : 0.5336          
##                  95% CI : (0.4719, 0.5945)
##     No Information Rate : 0.5597          
##     P-Value [Acc > NIR] : 0.8220          
##                                           
##                   Kappa : 0.0596          
##                                           
##  Mcnemar's Test P-Value : 0.5915          
##                                           
##             Sensitivity : 0.5600          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.5874          
##          Neg Pred Value : 0.4720          
##              Prevalence : 0.5597          
##          Detection Rate : 0.3134          
##    Detection Prevalence : 0.5336          
##       Balanced Accuracy : 0.5300          
##                                           
##        'Positive' Class : pos             
## 
aapl_nb_model_matched <- dfm_match(aapl_testing,features=featnames(aapl_training))
aapl_actual_class <- as.factor(aapl_nb_model_matched$Case)
aapl_predicted_class <- predict(aapl_nb_model,newdata=aapl_nb_model_matched)
aapl_tab_class <- table(aapl_actual_class,aapl_predicted_class)
aapl_tab_class
##                  aapl_predicted_class
## aapl_actual_class neg pos
##               neg 101 118
##               pos  87 131
confusionMatrix(aapl_tab_class,model='everything',positive='pos')
## Confusion Matrix and Statistics
## 
##                  aapl_predicted_class
## aapl_actual_class neg pos
##               neg 101 118
##               pos  87 131
##                                           
##                Accuracy : 0.5309          
##                  95% CI : (0.4829, 0.5785)
##     No Information Rate : 0.5698          
##     P-Value [Acc > NIR] : 0.95423         
##                                           
##                   Kappa : 0.0621          
##                                           
##  Mcnemar's Test P-Value : 0.03615         
##                                           
##             Sensitivity : 0.5261          
##             Specificity : 0.5372          
##          Pos Pred Value : 0.6009          
##          Neg Pred Value : 0.4612          
##              Prevalence : 0.5698          
##          Detection Rate : 0.2998          
##    Detection Prevalence : 0.4989          
##       Balanced Accuracy : 0.5317          
##                                           
##        'Positive' Class : pos             
## 
intc_nb_model_matched <- dfm_match(intc_testing,features=featnames(intc_training))
intc_actual_class <- as.factor(intc_nb_model_matched$Case)
intc_predicted_class <- predict(intc_nb_model,newdata=intc_nb_model_matched)
intc_tab_class <- table(intc_actual_class,intc_predicted_class)
intc_tab_class
##                  intc_predicted_class
## intc_actual_class neg pos
##               neg 100  65
##               pos  86  89
confusionMatrix(intc_tab_class,model='everything',positive='pos')
## Confusion Matrix and Statistics
## 
##                  intc_predicted_class
## intc_actual_class neg pos
##               neg 100  65
##               pos  86  89
##                                           
##                Accuracy : 0.5559          
##                  95% CI : (0.5013, 0.6095)
##     No Information Rate : 0.5471          
##     P-Value [Acc > NIR] : 0.3933          
##                                           
##                   Kappa : 0.1142          
##                                           
##  Mcnemar's Test P-Value : 0.1036          
##                                           
##             Sensitivity : 0.5779          
##             Specificity : 0.5376          
##          Pos Pred Value : 0.5086          
##          Neg Pred Value : 0.6061          
##              Prevalence : 0.4529          
##          Detection Rate : 0.2618          
##    Detection Prevalence : 0.5147          
##       Balanced Accuracy : 0.5578          
##                                           
##        'Positive' Class : pos             
## 
nvda_nb_model_matched <- dfm_match(nvda_testing,features=featnames(nvda_training))
nvda_actual_class <- as.factor(nvda_nb_model_matched$Case)
nvda_predicted_class <- predict(nvda_nb_model,newdata=nvda_nb_model_matched)
nvda_tab_class <- table(nvda_actual_class,nvda_predicted_class)
nvda_tab_class
##                  nvda_predicted_class
## nvda_actual_class neg pos
##               neg  55  26
##               pos  44  22
confusionMatrix(nvda_tab_class,model='everything',positive='pos')
## Confusion Matrix and Statistics
## 
##                  nvda_predicted_class
## nvda_actual_class neg pos
##               neg  55  26
##               pos  44  22
##                                           
##                Accuracy : 0.5238          
##                  95% CI : (0.4399, 0.6067)
##     No Information Rate : 0.6735          
##     P-Value [Acc > NIR] : 0.99994         
##                                           
##                   Kappa : 0.0127          
##                                           
##  Mcnemar's Test P-Value : 0.04216         
##                                           
##             Sensitivity : 0.4583          
##             Specificity : 0.5556          
##          Pos Pred Value : 0.3333          
##          Neg Pred Value : 0.6790          
##              Prevalence : 0.3265          
##          Detection Rate : 0.1497          
##    Detection Prevalence : 0.4490          
##       Balanced Accuracy : 0.5069          
##                                           
##        'Positive' Class : pos             
## 

Random Forest Classification

A random forest classification model is a supervised machine learning algorithm. That is made up of many different decision trees which all use different combinations of learning models. The results of which are then aggregated together to provide a result. The random forest algorithm can be used for both classification and regressions.

From the results below we can see that this model is much better at predicting whether an article will move positively or negatively. For 3 out of the 4 companies, the prediction accuracy rate is 100%. While the one that isn’t 100% is ~98%.

The No Information Rate value represents what the probability of achieving success is without any information. Basically it is the percentage of positive articles out of the total number of articles.

This is the low bar for knowing whether or not the model is good at predicting positive or negative stock price movement. If the model predicts at a higher percentage then we know it is at least better than using the basic statistic of percentage of positivity.

library(randomForest)

amd_data <- select(amd_total,c(4,5))
amd_ind <-sample(2,nrow(amd_data),replace=T,prob=c(.7,.3))
amd_train <- amd_data[amd_ind==1,]
amd_test <- amd_data[amd_ind==2,]
amd_rf_model <- randomForest(as.factor(Case)~.,data=na.omit(amd_train))
pred<- predict(amd_rf_model,amd_train)
confusionMatrix(pred,as.factor(amd_train$Case),positive='pos')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 346   0
##        pos   0 371
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9949, 1)
##     No Information Rate : 0.5174     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5174     
##          Detection Rate : 0.5174     
##    Detection Prevalence : 0.5174     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : pos        
## 
aapl_data <- select(aapl_total,c(4,5))
aapl_ind <-sample(2,nrow(aapl_data),replace=T,prob=c(.75,.25))
aapl_train <- aapl_data[aapl_ind==1,]
aapl_test <- aapl_data[aapl_ind==2,]
aapl_rf_model <- randomForest(as.factor(Case)~.,data=aapl_train)
pred<- predict(aapl_rf_model,aapl_train)
confusionMatrix(pred,as.factor(aapl_train$Case),positive='pos')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 706   0
##        pos   0 576
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9971, 1)
##     No Information Rate : 0.5507     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4493     
##          Detection Rate : 0.4493     
##    Detection Prevalence : 0.4493     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : pos        
## 
intc_data <- select(intc_total,c(4,5))
intc_ind <-sample(2,nrow(intc_data),replace=T,prob=c(.75,.25))
intc_train <- intc_data[intc_ind==1,]
intc_test <- intc_data[intc_ind==2,]
intc_rf_model <- randomForest(as.factor(Case)~.,data=intc_train)
pred<- predict(intc_rf_model,intc_train)
confusionMatrix(pred,as.factor(intc_train$Case),positive='pos')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 434   4
##        pos  14 502
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.9703, 0.9888)
##     No Information Rate : 0.5304          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9621          
##                                           
##  Mcnemar's Test P-Value : 0.03389         
##                                           
##             Sensitivity : 0.9921          
##             Specificity : 0.9688          
##          Pos Pred Value : 0.9729          
##          Neg Pred Value : 0.9909          
##              Prevalence : 0.5304          
##          Detection Rate : 0.5262          
##    Detection Prevalence : 0.5409          
##       Balanced Accuracy : 0.9804          
##                                           
##        'Positive' Class : pos             
## 
nvda_data <- select(nvda_total,c(4,5))
nvda_ind <-sample(2,nrow(nvda_data),replace=T,prob=c(.75,.25))
nvda_train <- nvda_data[nvda_ind==1,]
nvda_test <- nvda_data[nvda_ind==2,]
nvda_rf_model <- randomForest(as.factor(Case)~.,data=nvda_train)
pred<- predict(nvda_rf_model,nvda_train)
confusionMatrix(pred,as.factor(nvda_train$Case), positive='pos')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 189   0
##        pos   0 139
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9888, 1)
##     No Information Rate : 0.5762     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4238     
##          Detection Rate : 0.4238     
##    Detection Prevalence : 0.4238     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : pos        
##