Introduction

Does social media sentiment make a difference in Kroger’s stock price? Specifically, is the closing stock price influenced by the sentiment contained in tweets that mention Kroger? As an employee and stockholder of Kroger I am always interested in what drives both the stock and the chatter on social media. In order to try and answer this question 2 databases were used. The unstructured database is 61,206 indiviual tweets with 93 descriptive variables that were gathered using rtweet for scraping. The dates collected were from 11/8 - 11/15. In the collection of tweets the keywords used to search were Kroger and @Kroger, retweets were not collected. The structured database was collected from Yahoo Finance and includes Kroger as well as 4 of their closest competitors, according to Yahoo Finance. The competitors included are CVS, Costco, Target, and Walmart. The competitors were not used in this particular analysis but would be an interesting future analysis. The financial database includes 9 descriptive variables including opening and closing stock price. A data dictionary and datatable for each database is included below. The tweeter datatable in the documet below has a reduced number of variables due to the size of the database. Included are in the table are created_at, screen_name, text and date. The full database is available in the download section of the document along with the full stock database.

Sections

  • Introduction
  • The Data
  • Twitter Data Descriptive Analysis
  • Stock Data Descriptive Analysis
  • Correlation and Regression Analysis
  • Prediction
  • Summary

Required Packages

The packages required for this markdown are:

Package Summary
tidyverse The tidyverse collection of packages
lubridate Tools for manipulating dates
tidytext Functionality to tokenize by commonly used units of text
DT Datatables
knitr Kable tables
PerformanceAnalytics Tools for statistical analysis
stargazer Tools for well formatted regression tables

The Data

Twitter Data Dictionary

Variable Name Description
created_at Time and date the tweet was created
screen_name Twitter screen name of tweet creator
text The text that the tweet contained
query Twitter search used to find tweets
date lubridate in year-month-day-hour-minute-second format
tweet_id Unique id per tweet to enable unnesting of text

Twitter Data Table

Stock Data Dictionary

Variable Name Description
Ticker Symbol Company’s stock market symbol
Date Date the stock information pertains to
Open Stock price at stock market open
High Stock’s highest price during that day’s session
Low Stock’s lowest price during that day’s session
Close Stock price at stock market close
Adj Close Price of stock after adjustments made, more accurate price than close
Volume Number of shares traded that day
date lubridate in year-month-day-hour-minute-second format

Stock Datatable

Twitter Data Descriptive Analysis

In exploring the twitter data, volume of tweets along with the different sentiment analysis tools available in tidytext will be deleved into. This section also reviews sentiment analysis approaches using Bing and NRC lexicons.

How many tweets were collected daily? The visualization below shows the count of tweets collected during the time period of 11/8-11/25 for Kroger. The number really ramped up ahead of Thanksgiving week peaking at 7214 for November 18th.

Top 50 Words in Kroger Tweets from 11/8-11/25

Tidytext was used to unnest the words of each twitter text. Stop_words are then removed from the text. Stop_words are common words that are not particularly interesting. Examples would be “and” and “the”. In addition 25 words that reached the top 50 were filtered such http, t.co, and various number codes that searched and found not relevant to sentiment. The datatable below shows the top 50 words used in Kroger tweets. The dataframe that contains these words has 608,564 observations or words that were in the Kroger tweets.In the top 50 “Walmart” is the second highest behind store and then time. Interstingly enough an unacceptable word made the top 50 as well. Overall the words are what would be expected to see involving a grocery store.

The unnested words from above are reviewd and placed into categories of postive and negative. The Bing sentiment lexicon is used to the compare the words and determine if they are negative or positive. The first table shows the number of negative words verses the number of positive. Negative words are about 90% higher than positive words. The visualization below reveals the words that had more than 400 in count and are considered to be negative or positive. Negative words are given negative numbers and postive words given positive numbers. Again choice words show up. The words are all in line with the dealings in a grocery store. Strike is interesting because Kroger is not in contract negotiations however Amazon and other retail and restuarant workers have considereed strikes during the holidays.

The bing sentiment lexicon is used again to see how Kroger fares in sentiment score for each day of the week. The scores for the words are added up for each day. Kroger sentiment for the time period of 11/8-11/25 is negative everyday of the week. Tuesday is the worst day followed by Thursday and Sunday. In the stores, Tuesday is a day of merchandise changes which takes the focus off the customer. Thursday is the day of the week with the least amount of labor hours allowed because it is the slowest sales day. Sunday is typically the busiest sales day of the week. There are many other factors that could go into sentiment scores including corporate news and earnings the operation in the stores are just part of the equation.

The previous chart led to the question does Kroger have a positive sentiment score on any given day? The Bing sentiment lexicon data was broken into each day between 11/8-11/25. The results show that on Monday, November 15th there was a positive score! It was followed by the worst daily scores leading into the Thanksgiving holiday week.

The NRC emotion lexicon is a list of English words and their associations with 8 basic emotions and the 2 sentiments, negative and positive. The bar plot below shows how Kroger fared in the analysis. The positive sentiment reached 50,000 which is close to double that of negative sentiments. The highest emotion was trust with anticipation close behind.

Stock Data Descriptive Analysis

Now lets take a closer look at the stock data.How does Kroger price fluctuate daily? The line chart below shows the difference between the Close and the Open stock price for the dates 11/8-11/25. The flucuations stayed within $2.00 and averaged -.03. The table below that, shows the averages for key indicators for the stock data.

Averages for Key Indicators of Kroger Stock Data
Open Price Close Price Volume Difference
42.01 42.04 5317215 -0.03

Correlation and Regression Analysis

In order to answer the question, does Kroger sentiment through tweets have an effect on Kroger stock prices, a multiple regression model will be created. Prior to executing the model, correlation between the continuous independent varaibles must be reviewed. If the variables are highly correlated, above .70, the results of the regression can be overstated such as adjusted \(R^2\). The coefficients of the variables also may be over or under valued. All continuous independent variables were included in the below correlation chart with the exception of Open, High, and Low. These 3 variables were included initially but correlated at 100% to each other. Close was left in to include a stock price variable. The NRC sentiment variables are likely to have higher correlation due to emotions that are often connected. For example anger and disgust both can be felt in a single tweet.

The dataframes are conformed so the data can be used together. The NRC emotion lexicon data is merged together with the stock and twitter volume dataframe, variable n, to create the kroge_sentiments dataframe. This dataframe transforms all the variables by daily total. The daily total allows for comparison and analysis of the daily stock price at close of the market. Kroger_sentiments dataframe is analyzed with Chart.correlation function in the PerformanceAnalytics package. The results are shown in the correlation chart below, there is high correlation with 11 variables. In addition to the previous comments on the possibilities of high correlation, the data used in this analysis covered 13 business days which is a short period of time. The short time span could also contribute to the high correlation issues. Variables will need to be removed.

The regression results show that none of the variables are significant but \(R^2\) is 1. This means that all variance in the dependent variable is explained by the independent variables included in this model. 1 is the highest and most perfect \(R^2\) can be.It is perfect. As mentioned previously, the high correlation is contributing to the high \(R^2\) value. The next step is to use regsubsets from the leaps package, to find the best model. This function and other analytic tools will allow the best model to come forward. The multiple high correlations are a serious problem that will need to be addressed and this function will guide that.

## 
## Regression Results
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         Close           
## ----------------------------------------
## anger                   0.004           
## anticipation           -0.005           
## disgust                -0.004           
## fear                   -0.004           
## joy                     0.002           
## negative               -0.002           
## positive                0.002           
## sadness                 0.008           
## surprise               -0.002           
## trust                  -0.001           
## positivity                              
## Volume                 0.00000          
## Difference              0.401           
## n                                       
## Constant               42.870           
## ----------------------------------------
## Observations             13             
## R2                      1.000           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

In the regsubset results there are 14 variables and 9 possible “best models”. \(R^2\) improves greatly up to model 4, then 5-7 rise a slight amount and 8 and 9 dip slightly. The skee plots below will help evluate how many variables should be used in the regression.

## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(Close ~ ., data = d_kroger_reg_2)
## 14 Variables  (and intercept)
##              Forced in Forced out
## anger            FALSE      FALSE
## anticipation     FALSE      FALSE
## disgust          FALSE      FALSE
## fear             FALSE      FALSE
## joy              FALSE      FALSE
## negative         FALSE      FALSE
## positive         FALSE      FALSE
## sadness          FALSE      FALSE
## surprise         FALSE      FALSE
## trust            FALSE      FALSE
## Volume           FALSE      FALSE
## positivity       FALSE      FALSE
## Difference       FALSE      FALSE
## n                FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          anger anticipation disgust fear joy negative positive sadness surprise
## 1  ( 1 ) " "   " "          " "     " "  " " " "      " "      " "     " "     
## 2  ( 1 ) " "   " "          " "     " "  " " " "      " "      " "     " "     
## 3  ( 1 ) " "   " "          " "     " "  "*" " "      " "      " "     " "     
## 4  ( 1 ) " "   "*"          " "     " "  "*" " "      " "      " "     " "     
## 5  ( 1 ) " "   "*"          " "     "*"  " " " "      "*"      "*"     " "     
## 6  ( 1 ) " "   "*"          " "     "*"  "*" " "      " "      " "     " "     
## 7  ( 1 ) " "   "*"          "*"     "*"  " " " "      " "      " "     "*"     
## 8  ( 1 ) " "   "*"          "*"     "*"  " " "*"      " "      " "     "*"     
## 9  ( 1 ) "*"   "*"          "*"     "*"  " " " "      "*"      " "     "*"     
##          trust positivity Volume Difference n  
## 1  ( 1 ) " "   " "        " "    "*"        " "
## 2  ( 1 ) "*"   " "        " "    "*"        " "
## 3  ( 1 ) "*"   " "        " "    "*"        " "
## 4  ( 1 ) " "   " "        " "    "*"        "*"
## 5  ( 1 ) " "   " "        " "    "*"        " "
## 6  ( 1 ) " "   "*"        " "    "*"        "*"
## 7  ( 1 ) " "   "*"        " "    "*"        "*"
## 8  ( 1 ) " "   "*"        " "    "*"        "*"
## 9  ( 1 ) " "   "*"        " "    "*"        "*"
## [1] 0.2895700 0.5709598 0.8040240 0.8590595 0.8822763 0.8982516 0.9013613
## [8] 0.8811162 0.8418339

Also in The leaps package, a plot summary reveals the bend on the RSS skee plot below is at between 2 and 4 varaibles. The next step is to check the \(R^2\) plot and compare the two plots.

The \(R^2\) plot also bends between 2 and 4. The ideal number of variables should be between 3 and 4. A correlation chart was ran on the 3 and 4 variables the regsubsets function identified as the “best model”, for the respective number of variables. High correlation was a problem for the variables trust and joy and when the 4th was added, anticipation , it was also highly correlated. The 2 variable model did not have high correlation and returned all variables as significant to highly significant. The 2 variable model was chosen for this reason.

The best 2 variable model is as follows:

\[ Close = (42.689)\alpha_i + (-.0004)Trust_i + (.6111) Difference_i \]

We can evaluate it with the following correlation chart and regression results:

## 
## Regression Results for Final Model
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                Close           
## -----------------------------------------------
## Difference               0.611*** (0.192)      
## trust                   -0.0004** (0.0001)     
## Constant                 42.689*** (0.247)     
## -----------------------------------------------
## Observations                    13             
## R2                             0.667           
## Adjusted R2                    0.600           
## Residual Std. Error       0.471 (df = 10)      
## F Statistic           10.013*** (df = 2; 10)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The correlation chart shows .61 correlation in close and difference and -.57 correlation between trust and close. These numbers are not as extreme as the near 1 correlation in the first chart although may slightly articifially inflate intercepts and \(R^2\) numbers. The F statistic p-value is highly significant. Each of the variables is significant with Difference being the most important. Ajusted \(R^2\) is .60 which means 60% of the variablilty of the model is explained by these variables. 60% is a good number for a model and although other models provided higher \(R^2\) numbers some of the variables included were highly correlated therefore artificially inflating \(R^2\).

Prediction

The model is now ready to predict the close of the stock market when given the Difference and trust numbers that are projected. In the prediction below the mean of the Difference and trust variables were used.

The prediction linear model, predicted that when the average of those variables is used in the model, Kroger stock will close at $42.04. For every dollar the the close increases the Difference increases .61 and the trust decreases by .0004 for every dollar increase in the Kroger stock close.

##        1 
## 42.04231

Summary

Sentiment has a significant effect on Kroger’s closing price. Trust is the sentiment that has the most weight in the close price. Joy and anticipation were the next sentiments the "best’ model added. Those variable also had the highest sentiment scores in the NRC sentiment analysis conducted in the twitter data descriptive section.

The next steps in this analysis would be to continue to collect twitter and stock data to add to the dataframe. The analysis would benefit from more data over a longer period of time. The addition of competitors data for comparision would also be recommended.