Load the libraries + functions

Load all the libraries or functions

##r chunk

library(reticulate)

library(car)
## Loading required package: carData
library(carData)
library(reticulate)
library(Rling)
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
library(ca)
##r chunk

#py_install('sklearn')
#py_install('matplotlib')
#py_install('keras')
#py_install('wordcloud')


#use_python("C:\\Users\\rohan\\AppData\\Local\\CONTIN~1\\ANACON~1\\python")

Objective

This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.

This project is about getting information for the stoks price movement based on the news and ananlyst reviews which is to be captured from reditt news for DJIA index. This is cross verified by the pattern recogntion of words for down price and up price in stocks, which will be further used for predicting stock price movement with the stated words in past.

Instructions

The final document should be a knitted HTML/PDF/Word document from a Markdown file. You will turn in the knitted document along with your .Rmd. Be sure to spell and grammar check your work! The following sections should be included:

Introduction

Introduce your research topic. What is the background knowledge that someone would need to understand the field or area that you have decided to investigate? In this section, you should include sources that help explain the background area and cite them in APA style. 5-10 articles across the paper would be appropriate - be sure to include these! They are part of the grade!

Stock Price is one of the prices which everyone interested to get some insight and at least once try to use luck to make some profit from it. To make good profit we need to make a good future prediction of the stock price before someone can bet on the same price, Stock price is generally determined by the behavior of human investors and this investor determine the stock price by getting information which are available in public. These information’s come in the form of analyst view, speculations, political news, expert opinion and sometime natural disaster news. There is almost most of the time lag between the news and the price fluctuation of stocks and investor use this lag between news surfaced to public and the stock price movement to make profit. This research project focus on the news articles and stock price movement prediction. For Prediction purpose various algorithm are tested and compared and was found that Extreme gradient boosting shows high accuracy. There are some articles which state about the use of different machine learning algorithm to predict stock price using test mining from news articles.(Timmons),(Kalyani Joshi)(Gidófalvi, 2001)

Hypothesis / Problem Statement

What is the data that you are using for you project? What is your hypothesis as to the outcome of the analysis? Why is the problem important for us to study or answer?

Here Hypothesis is that stock price changes with the Negative and positive news.

Statistical Analysis Plan

Explain the statistical analysis that you are using - you can assume some statistical background, but not to the specific design you are mentioning. For example, the person would know what a mean is, but not the more complex analyses.

These are steps followed to prepare data and produce exploratory and statistical results.

data preparation for evaluation purpose check the quality of data to understand data feature inspection and filtering

Method - Data - Variables

Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? Identify what the independent and dependent variables are for the analysis. How do these independent and dependent variables fit into the analyses you selected?

For this project I used the redditt news about the stocks and DJIA Index for the year range from 2008 to 2016. The data set can be find in kaggle at the following location: https://www.kaggle.com/aaron7sun/stocknews

This data set consists of three files :

Combined_News_DJIA RedditNews upload_DJIA_table

In combines_News_DJIA there are 27 columns with 25 top Hot stock news from the reddit and two columns are for the date and binary outcome of the stock price increase and decrease from the previous day closing price. This binary outcome is in the form of 0 and 1 which was predefined as per the stock return from the previous days stock return. Stock return = (Current day Stock price- Previous day stock price)/Previous day stock price. If the difference of current stock return with the previous day stock return is positive than it was defined as 1 binary outcome else binary outcome is 0.

Here is the data which I gathered from the kaggle site:

##r chunk

data= read.csv("C:\\Users\\rohan\\OneDrive\\540\\Project\\stocknews\\Combined_News_DJIA.csv")

## Transforming news to the number of words which will be later used for predicting sticks loss or gain

data=r.data

All= data.copy()

data['All']=data.iloc[:,2:27].apply(lambda row: ''.join(str(row.values)), axis=1)

Data Cleaning and Processing


## Checking for NAN 

data.isnull().sum()
## Date     0
## Label    0
## Top1     0
## Top2     0
## Top3     0
## Top4     0
## Top5     0
## Top6     0
## Top7     0
## Top8     0
## Top9     0
## Top10    0
## Top11    0
## Top12    0
## Top13    0
## Top14    0
## Top15    0
## Top16    0
## Top17    0
## Top18    0
## Top19    0
## Top20    0
## Top21    0
## Top22    0
## Top23    0
## Top24    0
## Top25    0
## All      0
## dtype: int64

No headlines has missing values


## Removing t\html tags 

data = data.replace('b\"|b\'|\\\\|\\\"', '', regex=True)
data.head(4)
##          Date  ...                                                All
## 0  2008-08-08  ...  ['Georgia 'downs two Russian warplanes' as cou...
## 1  2008-08-11  ...  [Why wont America and Nato help us? If they wo...
## 2  2008-08-12  ...  [Remember that adorable 9-year-old who sang at...
## 3  2008-08-13  ...  [ U.S. refuses Israel weapons to attack Iran: ...
## 
## [4 rows x 28 columns]

Statistical Analysis Results

Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.


## Exploratory data analysis 

nodown = data[data['Label']==1]
down = data[data['Label']==0]
print(len(nodown)/len(data))
## 0.5354449472096531
print(len(down)/len(data))
## 0.4645550527903469

From the


### Clean up the data
import re
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud,STOPWORDS
import matplotlib
import matplotlib.pyplot as plt



def clean_words(content):
    only_words = re.sub("[^a-zA-Z]", " ", content) 
    words = only_words.lower().split()                             
    stop = set(stopwords.words("english"))                  
    words_useful = [w for w in words if not w in stop] 
    return( " ".join( words_useful )) 
    

Exploratory analysis of words from news articles show which words are more prominent in the stock price movement in either UP or Down direction, Up here means increase in stock return from previous day and vice versa.

## Exploratory data analysis 

nodownprice=[]
downprice=[]
for each in nodown['All']:
    nodownprice.append(clean_words(each))

for each in down['All']:
    downprice.append(clean_words(each))


# Word plot 

#wordoccurenceupprice = WordCloud(background_color='grey',
                      #width=3200,
                     # height=2700
                     #).generate(nodownprice[0])

#plt.figure(1,figsize=(8,8))
#plt.imshow(wordoccurenceupprice)
#plt.axis('off')
#plt.show()

#wordoccurencedownprice = WordCloud(background_color='blue',
                     # width=3200,
                     ## height=2700
                    # ).generate(downprice[0])

#plt.figure(1,figsize=(8,8))
#plt.imshow(wordoccurencedownprice)
#plt.axis('off')
#plt.show()

Model training and testing


## Spliting data set into training and testing

import sklearn

from sklearn.model_selection import train_test_split

train,test = train_test_split(data,test_size=0.2,random_state=42)


print("Length of train is",len(train))
## Length of train is 1591
print("Length of test is", len(test))
## Length of test is 398
trainheadline= []
for row in range(0,len(train.index)):
    trainheadline.append(' '.join(str(x) for x in train.iloc[row,2:27]))

testheadline=[]

for row in range(0,len(test.index)):
    testheadline.append(' '.join(str(x) for x in test.iloc[row,2:27]))
    

# Generate word vector features from trainheadlines and test headlines which have 25 topics.

from sklearn.feature_extraction.text import CountVectorizer

vectorizerbasic = CountVectorizer()
trainbasic = vectorizerbasic.fit_transform(trainheadline)
print(trainbasic.shape)
## (1591, 31525)
testbasic = vectorizerbasic.transform(testheadline)
print(testbasic.shape)

    
## (398, 31525)

Training model with train data set , than examine how well model predicts the new testing data

Models that fit well have: High accuracy scores: how much you got right High precision scores: number of times correct for class given false positives High recall scores: number of times correct for class given false negatives High F1 scores: the harmonic mean of precision and recall

Different Algorithm check


from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, SGDRegressor,LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

basicmodel2 = LogisticRegression()
#basicmodel2 = basicmodel2.fit(trainbasic, train["Label"])

#define your outcomes
Stock_Price_Updown = ["0", "1"]

# Build log model

logreg = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=10000)

# Fit the data to the log model
logreg = logreg.fit(trainbasic, train["Label"])

    

##python chunk


from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

#predict new data using logistic regression 
prediction_test = logreg.predict(testbasic)

#print out results
print('accuracy %s' % accuracy_score(prediction_test, test["Label"]))
## accuracy 0.49246231155778897
print(classification_report(test["Label"], prediction_test,target_names=Stock_Price_Updown))
##               precision    recall  f1-score   support
## 
##            0       0.42      0.47      0.44       171
##            1       0.56      0.51      0.53       227
## 
##     accuracy                           0.49       398
##    macro avg       0.49      0.49      0.49       398
## weighted avg       0.50      0.49      0.49       398
## Random Forest classification

Randforestclf=RandomForestClassifier(n_estimators=200)

Randforestclf.fit(trainbasic, train["Label"])
## RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
##                        criterion='gini', max_depth=None, max_features='auto',
##                        max_leaf_nodes=None, max_samples=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=200,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)
predrand=Randforestclf.predict(testbasic)

#print out results
print('accuracy %s' % accuracy_score(predrand, test["Label"]))
## accuracy 0.5376884422110553
print(classification_report(test["Label"], predrand,target_names=Stock_Price_Updown))

 
##               precision    recall  f1-score   support
## 
##            0       0.44      0.26      0.32       171
##            1       0.57      0.75      0.65       227
## 
##     accuracy                           0.54       398
##    macro avg       0.50      0.50      0.49       398
## weighted avg       0.51      0.54      0.51       398

# Create Decision Tree classifer object

decisiontreeclf= DecisionTreeClassifier()

# Train Decision Tree Classifer
decisiontreeclf=decisiontreeclf.fit(trainbasic, train["Label"])

#Predict the response for test dataset
preddectree = decisiontreeclf.predict(testbasic)


#print out results
print('accuracy %s' % accuracy_score(preddectree, test["Label"]))
## accuracy 0.5301507537688442
print(classification_report(test["Label"], preddectree,target_names=Stock_Price_Updown))
##               precision    recall  f1-score   support
## 
##            0       0.46      0.54      0.50       171
##            1       0.60      0.52      0.56       227
## 
##     accuracy                           0.53       398
##    macro avg       0.53      0.53      0.53       398
## weighted avg       0.54      0.53      0.53       398

# Create GradientBoostingClassifier classifer object

GradientBoostclf = sklearn.ensemble.GradientBoostingClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100)

GradientBoostclf.fit(trainbasic, train["Label"])
## GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
##                            learning_rate=0.001, loss='deviance', max_depth=1,
##                            max_features=None, max_leaf_nodes=None,
##                            min_impurity_decrease=0.0, min_impurity_split=None,
##                            min_samples_leaf=1, min_samples_split=2,
##                            min_weight_fraction_leaf=0.0, n_estimators=100,
##                            n_iter_no_change=None, presort='deprecated',
##                            random_state=None, subsample=1.0, tol=0.0001,
##                            validation_fraction=0.1, verbose=0,
##                            warm_start=False)
predgradboostclf=GradientBoostclf.predict(testbasic)

print(GradientBoostclf.score(testbasic, test["Label"]))
## 0.5703517587939698
print(classification_report(test["Label"], predgradboostclf,target_names=Stock_Price_Updown))

    
##               precision    recall  f1-score   support
## 
##            0       0.00      0.00      0.00       171
##            1       0.57      1.00      0.73       227
## 
##     accuracy                           0.57       398
##    macro avg       0.29      0.50      0.36       398
## weighted avg       0.33      0.57      0.41       398
## 
## 
## C:\Users\rohan\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
##   _warn_prf(average, modifier, msg_start, len(result))

Interpret and Discuss

Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?

This research project tried to verify its hypothesis that news can make lots of difference in the stock price fluctuations. For this purpose word cloud visualization was used to get idea that whether some frequent used words are really make any difference for the stock price movement, from words plot analysis it was found that up to some extent it was visible that some negative word made impact for the down price movement. To cross verify this finding four algorithm were tested to see how the predictive model perform when we use news from different articles and stock return movement. From all the four models which comprise of Logistic regression, Random forest, decision trees and gradient boosting classifier the accuracy, precision, recall and f1 score were checked and all the models performed average. Gradient boosting showed the highest accuracy in all the four algorithms. As this research is based on 25 top most topics so CNN and other deep learning techniques were not tested , which require all the topics to be combined in one tag , and it will not make good prediction if all the top news behave as single news articles. But in future this project requires more detail research on using deep learning techniques, but again capturing time series stock movement through machine learning approach is little time consuming and less accurate compare to normal Time series regression approach like ARIMA and Auto Regression.

References

Adam Atkins*, M. N. (2018). Financial news predicts stock market volatility better than closeprice. Science Direct.

Gidófalvi, G. (2001). Using News Articles to Predict Stock Price Movements. Semantic Scholar.

Kalyani Joshi, P. B. (б.д.). STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS. 1-11.

Ma, Q. (2008). CS224n Final ProjectStock Price Prediction Using News Articles.

Tahir M. Nisar*, M. Y. (2018). Twitter as a tool for forecasting stock market movements:A short-window event study. Science Direct.

Timmons, K. L. (б.д.). Predicting the Stock Market with News Articles.