library(tidyverse)
library(tidytext)
library(readr)
library(textdata)
tidy_dcf_holdings <- read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/kellnerm3_xavier_edu/EefIRM4I0BlNiB8TE6aCUkYBLtPVK5lPl5x1Qfw7HmJaNA?download=1")DCF Sentiment Analysis
The Reasons for Analysis
Upon scraping Google finance for information on companies that the D’Artagnan Capital Fund (DCF) has holdings in I had some questions regarding the “company history” that Google finance provided. I had three questions in particular:
Since Google finance uses Wikipedia’s information for their company histories, and since Wikipedia can be edited by the public, is their any bias in company history? By bias I mean is there any emotional words, positive/negative words that are more heavily used in certain company histories compared to others?
I also wanted to see the basic word counts for each companies history and see if there was any correlation between the age of a company and the length of their company history in Google finance to see if older companies had a more extensive history description.
Concluding with looking at the correlation between the posititivy score and the age of a company to see if different companies’ histories have higher or lower positivity scored based on their age. Checking to see if older companies have higher positivity scores.
The Data I used
The DCF currently has holdings in 41 publicly traded companies on the New York Stock Exchange (NYSE) and the Nasdaq. I gathered this data by using Rstudio to scrape data from Google finance about these 41 companies. Because stock prices and financial data change very often I will mention that all of this data was taken on April 18th, 2024 at 10:30 am EST. The following is a list of all the groups of data I gathered from Google finance with a description of what that column has inside of it. I will include the link to download this data here: https://myxavier-my.sharepoint.com/:x:/g/personal/kellnerm3_xavier_edu/EdFnsRsdYdVDjj32a8NUKS8BpHrI2cCizU0DylmluDkEJQ?download=1.
Data Dictionary
(All prices are in USD$)
Company name - Name of the publicly traded company and/or stock.
Exchange - The public exchange that the stock is traded on. Either NYSE or Nasdaq
Climate Score - A score provided by CDP (formerly the Carbon Disclosure Project) that rates a company on its climate transparency and performance. Not all companies had a score available in Google Finance, so some are null. Here is a link to their website if you want more information: https://www.cdp.net/en/
CEO - The company’s current Chief Executive Officer
Founded - The date the company was founded. Some dates only include the year while others are specified down to the day
Headquarters - The name of where the HQ of the company is. Google Finance did not have some of the HQ’s listed and as a result some may be Null.
Stock Price - The current stock price as of April 18th, 2024 at 10:30 am EST.
Previous close - The stock price at the previous days close.
Day range - The high and low of stock price from the previous day. So, April 17th 2024.
Year range - The high and low of stock price from the previous year. So, 2023.
Market capitalization - The total amount of stock in $ that is in the market. It is the price of the stock times the total shares outstanding (shares outstanding is not included in this data but could easily be found by dividing market capitalization by share price).
Average volume - The average number of shares traded each day over the past 30 days.
P/E ratio - The ratio of the current share price over the trailing twelve month earnings per share (EPS). This signals if the price of the stock is higher or lower than other stocks.
Dividend Yield Percentage - The ratio of annual dividends to current share price that estimates the dividend return of stock.
Employees - This is the total number of employees that works for the company.
Company History - This is a paragraph that describes the history of the company provided by Google which they gathered from Wikipedia.
Website - This is a website like that can be used to research the company or stock.
Page ID - This is the HTML link that was used to scrape each company’s page from Google finance.
Another note before getting into the analysis and data. Just because a value is null/NA/missing does not mean that that company does not have that thing. For instance some companies may be shown with NA company history. This is simply because this was not listed on Google finance. Some of the stock information being null however may mean that that stock does not provide dividends for example. Tesla for example does not issue dividends and therefore will not have a dividend yield in this data.
The lexicon I Used
I used the NRC lexicon which has 13,872 observations (rows) and 2 variables (columns). Each row is a word (column 1) and each word has a sentiment word (column 2) which includes general emotional words such as: trust, fear, anger, anticipation, disgust, joy, sadness, surprise, negative, and positive. Using this lexicon to compare the words from the lexicon and the words in the company history is how i ran much of this analysis.
Loading the Data
First I load the packages required for this analysis along with the data I scraped which is being hosted from my Xavier One Drive. You can copy the link in the code to get the entire data set in a csv.
Next I create a data frame containing all of the words we want to analysis in each company history. This data set gets rid of any stop words in the history. Stop words are words such as “the”, “and”, etc.
tidy_history <-
tidy_dcf_holdings %>%
unnest_tokens(word, company_history) %>%
anti_join(stop_words)Analysis
Question 1
Getting into the analysis. First I made a new data frame that consists of a joined table between the words in the company history and the NRC lexicon. Using this I am able to see the counts of each emotion for each companies’ history. Along with positivity and negativity counts. Using those positive and negative counts I made a new variable called “positivity” which takes all the positive counts and subtracts the negative counts from them to get an overall positivty score. I put this data into a graph to show which companies have the highest positivity scores.
Using this graph it is apparent that Wikipedia may be showing bias to its company history in terms of using positive or negative words. This is clear when looking at AbbVie Inc which has the only negative score. There is also a lot of variance in positivity scores. When investigating deeper however, at least in AbbVie’s case, its clear why the NRC lexicon scored their history the way they do. If we look at their history it contains words such as: injection, diseases, cancer, and a couple others. Because of this being part of their history, my analysis sees this as bad and scored them poorly. These words do tend to be negative and since they are repeated multiple times it makes their positivity score negative.
Question 2
This contains a scatter plot showing the correlation between the age of a company and the word count it was given. This includes companies that are younger than 75 years old.
Looking at this graph, there doesn’t seem to be much of a correlation between age and word count. So, Wikipedia is not being biased in word counts and the age of a company. It seems pretty scattered.
Question 3
This contains a regression line that shows the correlation between a company’s positivity score and its age. This includes companies that are younger than 75 years old.
This graph shows that the average positivity score for older companies is slightly higher than that of the younger companies. Meaning that Wikipedia shows some bias in terms of age and the positivity of a companies history sentiment. Older companies do seem to have a slightly higher positivity score than younger companies.
Conclusion
Based on the analysis above, Wikipedia shows some bias when writing a companies history. This bias is based on sentiment of the words included in the companies history. Older companies tend to have higher positivity scores than younger companies. Emotional words in company histories appear very often which is surprising considering the history of a company should not be that emotional. This analysis however may be skewed because of the smaller sample size. Remember, I only chose the 41 holdings that the DCF has. If an analysis of the entire NYSE and Nasdaq were done there would likely be different outcomes to this analysis.