Load all the libraries or functions
##r chunk
library(reticulate)
library(car)
## Loading required package: carData
library(carData)
library(reticulate)
library(Rling)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
library(ca)
##r chunk
#py_install('sklearn')
#py_install('matplotlib')
#py_install('keras')
#py_install('wordcloud')
#use_python("C:\\Users\\rohan\\AppData\\Local\\CONTIN~1\\ANACON~1\\python")
This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.
This project is about getting information for the stoks price movement based on the news and ananlyst reviews which is to be captured from reditt news for DJIA index. This is cross verified by the pattern recogntion of words for down price and up price in stocks, which will be further used for predicting stock price movement with the stated words in past.
The final document should be a knitted HTML/PDF/Word document from a Markdown file. You will turn in the knitted document along with your .Rmd. Be sure to spell and grammar check your work! The following sections should be included:
Introduce your research topic. What is the background knowledge that someone would need to understand the field or area that you have decided to investigate? In this section, you should include sources that help explain the background area and cite them in APA style. 5-10 articles across the paper would be appropriate - be sure to include these! They are part of the grade!
Stock Price is one of the prices which everyone interested to get some insight and at least once try to use luck to make some profit from it. To make good profit we need to make a good future prediction of the stock price before someone can bet on the same price, Stock price is generally determined by the behavior of human investors and this investor determine the stock price by getting information which are available in public. These information’s come in the form of analyst view, speculations, political news, expert opinion and sometime natural disaster news. There is almost most of the time lag between the news and the price fluctuation of stocks and investor use this lag between news surfaced to public and the stock price movement to make profit. This research project focus on the news articles and stock price movement prediction. For Prediction purpose various algorithm are tested and compared and was found that Extreme gradient boosting shows high accuracy. There are some articles which state about the use of different machine learning algorithm to predict stock price using test mining from news articles.(Timmons),(Kalyani Joshi)(Gidófalvi, 2001)
What is the data that you are using for you project? What is your hypothesis as to the outcome of the analysis? Why is the problem important for us to study or answer?
Here Hypothesis is that stock price changes with the Negative and positive news.
Explain the statistical analysis that you are using - you can assume some statistical background, but not to the specific design you are mentioning. For example, the person would know what a mean is, but not the more complex analyses.
These are steps followed to prepare data and produce exploratory and statistical results.
data preparation for evaluation purpose check the quality of data to understand data feature inspection and filtering
Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? Identify what the independent and dependent variables are for the analysis. How do these independent and dependent variables fit into the analyses you selected?
For this project I used the redditt news about the stocks and DJIA Index for the year range from 2008 to 2016. The data set can be find in kaggle at the following location: https://www.kaggle.com/aaron7sun/stocknews
This data set consists of three files :
Combined_News_DJIA RedditNews upload_DJIA_table
In combines_News_DJIA there are 27 columns with 25 top Hot stock news from the reddit and two columns are for the date and binary outcome of the stock price increase and decrease from the previous day closing price. This binary outcome is in the form of 0 and 1 which was predefined as per the stock return from the previous days stock return. Stock return = (Current day Stock price- Previous day stock price)/Previous day stock price. If the difference of current stock return with the previous day stock return is positive than it was defined as 1 binary outcome else binary outcome is 0.
Here is the data which I gathered from the kaggle site:
##r chunk
data= read.csv("C:\\Users\\rohan\\OneDrive\\540\\Project\\stocknews\\Combined_News_DJIA.csv")
## Transforming news to the number of words which will be later used for predicting sticks loss or gain
data=r.data
All= data.copy()
data['All']=data.iloc[:,2:27].apply(lambda row: ''.join(str(row.values)), axis=1)
## Checking for NAN
data.isnull().sum()
## Date 0
## Label 0
## Top1 0
## Top2 0
## Top3 0
## Top4 0
## Top5 0
## Top6 0
## Top7 0
## Top8 0
## Top9 0
## Top10 0
## Top11 0
## Top12 0
## Top13 0
## Top14 0
## Top15 0
## Top16 0
## Top17 0
## Top18 0
## Top19 0
## Top20 0
## Top21 0
## Top22 0
## Top23 0
## Top24 0
## Top25 0
## All 0
## dtype: int64
## Removing t\html tags
data = data.replace('b\"|b\'|\\\\|\\\"', '', regex=True)
data.head(4)
## Date ... All
## 0 2008-08-08 ... ['Georgia 'downs two Russian warplanes' as cou...
## 1 2008-08-11 ... [Why wont America and Nato help us? If they wo...
## 2 2008-08-12 ... [Remember that adorable 9-year-old who sang at...
## 3 2008-08-13 ... [ U.S. refuses Israel weapons to attack Iran: ...
##
## [4 rows x 28 columns]
Analyze the data given your statistical plan. Report the appropriate statistics for that analysis (see lecture notes). Include figures! Include the R-chunks so we can see the analyses you ran and output from the study. Note what you are doing in each step.
## Exploratory data analysis
nodown = data[data['Label']==1]
down = data[data['Label']==0]
print(len(nodown)/len(data))
## 0.5354449472096531
print(len(down)/len(data))
## 0.4645550527903469
From the
### Clean up the data
import re
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud,STOPWORDS
import matplotlib
import matplotlib.pyplot as plt
def clean_words(content):
only_words = re.sub("[^a-zA-Z]", " ", content)
words = only_words.lower().split()
stop = set(stopwords.words("english"))
words_useful = [w for w in words if not w in stop]
return( " ".join( words_useful ))
Exploratory analysis of words from news articles show which words are more prominent in the stock price movement in either UP or Down direction, Up here means increase in stock return from previous day and vice versa.
## Exploratory data analysis
nodownprice=[]
downprice=[]
for each in nodown['All']:
nodownprice.append(clean_words(each))
for each in down['All']:
downprice.append(clean_words(each))
# Word plot
#wordoccurenceupprice = WordCloud(background_color='grey',
#width=3200,
# height=2700
#).generate(nodownprice[0])
#plt.figure(1,figsize=(8,8))
#plt.imshow(wordoccurenceupprice)
#plt.axis('off')
#plt.show()
#wordoccurencedownprice = WordCloud(background_color='blue',
# width=3200,
## height=2700
# ).generate(downprice[0])
#plt.figure(1,figsize=(8,8))
#plt.imshow(wordoccurencedownprice)
#plt.axis('off')
#plt.show()
## Spliting data set into training and testing
import sklearn
from sklearn.model_selection import train_test_split
train,test = train_test_split(data,test_size=0.2,random_state=42)
print("Length of train is",len(train))
## Length of train is 1591
print("Length of test is", len(test))
## Length of test is 398
trainheadline= []
for row in range(0,len(train.index)):
trainheadline.append(' '.join(str(x) for x in train.iloc[row,2:27]))
testheadline=[]
for row in range(0,len(test.index)):
testheadline.append(' '.join(str(x) for x in test.iloc[row,2:27]))
# Generate word vector features from trainheadlines and test headlines which have 25 topics.
from sklearn.feature_extraction.text import CountVectorizer
vectorizerbasic = CountVectorizer()
trainbasic = vectorizerbasic.fit_transform(trainheadline)
print(trainbasic.shape)
## (1591, 31525)
testbasic = vectorizerbasic.transform(testheadline)
print(testbasic.shape)
## (398, 31525)
Training model with train data set , than examine how well model predicts the new testing data
Models that fit well have: High accuracy scores: how much you got right High precision scores: number of times correct for class given false positives High recall scores: number of times correct for class given false negatives High F1 scores: the harmonic mean of precision and recall
from sklearn import linear_model
from sklearn.linear_model import SGDClassifier, SGDRegressor,LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
basicmodel2 = LogisticRegression()
#basicmodel2 = basicmodel2.fit(trainbasic, train["Label"])
#define your outcomes
Stock_Price_Updown = ["0", "1"]
# Build log model
logreg = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=10000)
# Fit the data to the log model
logreg = logreg.fit(trainbasic, train["Label"])
##python chunk
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
#predict new data using logistic regression
prediction_test = logreg.predict(testbasic)
#print out results
print('accuracy %s' % accuracy_score(prediction_test, test["Label"]))
## accuracy 0.49246231155778897
print(classification_report(test["Label"], prediction_test,target_names=Stock_Price_Updown))
## precision recall f1-score support
##
## 0 0.42 0.47 0.44 171
## 1 0.56 0.51 0.53 227
##
## accuracy 0.49 398
## macro avg 0.49 0.49 0.49 398
## weighted avg 0.50 0.49 0.49 398
## Random Forest classification
Randforestclf=RandomForestClassifier(n_estimators=200)
Randforestclf.fit(trainbasic, train["Label"])
## RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
## criterion='gini', max_depth=None, max_features='auto',
## max_leaf_nodes=None, max_samples=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=200,
## n_jobs=None, oob_score=False, random_state=None,
## verbose=0, warm_start=False)
predrand=Randforestclf.predict(testbasic)
#print out results
print('accuracy %s' % accuracy_score(predrand, test["Label"]))
## accuracy 0.5376884422110553
print(classification_report(test["Label"], predrand,target_names=Stock_Price_Updown))
## precision recall f1-score support
##
## 0 0.44 0.26 0.32 171
## 1 0.57 0.75 0.65 227
##
## accuracy 0.54 398
## macro avg 0.50 0.50 0.49 398
## weighted avg 0.51 0.54 0.51 398
# Create Decision Tree classifer object
decisiontreeclf= DecisionTreeClassifier()
# Train Decision Tree Classifer
decisiontreeclf=decisiontreeclf.fit(trainbasic, train["Label"])
#Predict the response for test dataset
preddectree = decisiontreeclf.predict(testbasic)
#print out results
print('accuracy %s' % accuracy_score(preddectree, test["Label"]))
## accuracy 0.5301507537688442
print(classification_report(test["Label"], preddectree,target_names=Stock_Price_Updown))
## precision recall f1-score support
##
## 0 0.46 0.54 0.50 171
## 1 0.60 0.52 0.56 227
##
## accuracy 0.53 398
## macro avg 0.53 0.53 0.53 398
## weighted avg 0.54 0.53 0.53 398
# Create GradientBoostingClassifier classifer object
GradientBoostclf = sklearn.ensemble.GradientBoostingClassifier(learning_rate=0.001,
max_depth = 1,
n_estimators = 100)
GradientBoostclf.fit(trainbasic, train["Label"])
## GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
## learning_rate=0.001, loss='deviance', max_depth=1,
## max_features=None, max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=100,
## n_iter_no_change=None, presort='deprecated',
## random_state=None, subsample=1.0, tol=0.0001,
## validation_fraction=0.1, verbose=0,
## warm_start=False)
predgradboostclf=GradientBoostclf.predict(testbasic)
print(GradientBoostclf.score(testbasic, test["Label"]))
## 0.5703517587939698
print(classification_report(test["Label"], predgradboostclf,target_names=Stock_Price_Updown))
## precision recall f1-score support
##
## 0 0.00 0.00 0.00 171
## 1 0.57 1.00 0.73 227
##
## accuracy 0.57 398
## macro avg 0.29 0.50 0.36 398
## weighted avg 0.33 0.57 0.41 398
##
##
## C:\Users\rohan\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
## _warn_prf(average, modifier, msg_start, len(result))
Summarize the results from your study in as plain of language as possible. How does this relate to previous literature? Where the results supportive of your hypotheses? What have we learned from you doing this analysis/study?
This research project tried to verify its hypothesis that news can make lots of difference in the stock price fluctuations. For this purpose word cloud visualization was used to get idea that whether some frequent used words are really make any difference for the stock price movement, from words plot analysis it was found that up to some extent it was visible that some negative word made impact for the down price movement. To cross verify this finding four algorithm were tested to see how the predictive model perform when we use news from different articles and stock return movement. From all the four models which comprise of Logistic regression, Random forest, decision trees and gradient boosting classifier the accuracy, precision, recall and f1 score were checked and all the models performed average. Gradient boosting showed the highest accuracy in all the four algorithms. As this research is based on 25 top most topics so CNN and other deep learning techniques were not tested , which require all the topics to be combined in one tag , and it will not make good prediction if all the top news behave as single news articles. But in future this project requires more detail research on using deep learning techniques, but again capturing time series stock movement through machine learning approach is little time consuming and less accurate compare to normal Time series regression approach like ARIMA and Auto Regression.
Adam Atkins*, M. N. (2018). Financial news predicts stock market volatility better than closeprice. Science Direct.
Gidófalvi, G. (2001). Using News Articles to Predict Stock Price Movements. Semantic Scholar.
Kalyani Joshi, P. B. (б.д.). STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS. 1-11.
Ma, Q. (2008). CS224n Final ProjectStock Price Prediction Using News Articles.
Tahir M. Nisar*, M. Y. (2018). Twitter as a tool for forecasting stock market movements:A short-window event study. Science Direct.
Timmons, K. L. (б.д.). Predicting the Stock Market with News Articles.