For my final project I conducted sentiment analysis on a company’s annual report and examined the corresponding stock price performance for the following period. I wanted to see if an annual report where management’s commentary was more positive corresponded with a period of stronger stock price performance for the following year (until the next annual report).
I decided to conduct this analysis on Nvidia given its market dominance recently.
Specifically, I pulled Form 10-K, which is an annual form that is required of publicly traded companies and discusses the company’s performance for the year. Form 10-K has a required section of management commentary (Item 7) which covers both past performance and outlook and can be analyzed for sentiment. I pulled 10-Ks for several years from the SEC EDGAR database via API and webscraping.
Next, I pulled stock price performance from Yahoo Finance. The challenge was to extract and clean the text data, get it into tidytext format to perform sentiment analysis, and add in the percent change for the stock price for the corresponding forward period.
Below I load the libraries necessary for this analysis
library(httr2)
library(tidyverse)
library(data.table)
library(RCurl)
library(rvest)
library(tidytext)
library(XML)
The first step in this analysis is to obtain the URLs that lead to each filing of Form 10-K for Nvidia from the SEC EDGAR database via its API.
As detailed at the following link, a company’s filings are available via API via the following URL, which incorporates a unique CIK number for each company:
Nvidia’s CIK is 1045810
Below I send a request to the API and receive a JSON object
request_url <- 'https://data.sec.gov/submissions/CIK0001045810.json'
apiResponse <- request(request_url) |> req_user_agent('semyon.toybis43@spsmail.cuny.edu') |>
req_perform()
summary(apiResponse)
## Length Class Mode
## method 1 -none- character
## url 1 -none- character
## status_code 1 -none- numeric
## headers 13 httr2_headers list
## body 152709 -none- raw
## request 7 httr2_request list
## cache 0 -none- environment
I extract the body from this JSON response and look at the list for filings
nvdaList <- apiResponse |> resp_body_json()
summary(nvdaList$filings$recent)
## Length Class Mode
## accessionNumber 1004 -none- list
## filingDate 1004 -none- list
## reportDate 1004 -none- list
## acceptanceDateTime 1004 -none- list
## act 1004 -none- list
## form 1004 -none- list
## fileNumber 1004 -none- list
## filmNumber 1004 -none- list
## items 1004 -none- list
## size 1004 -none- list
## isXBRL 1004 -none- list
## isInlineXBRL 1004 -none- list
## primaryDocument 1004 -none- list
## primaryDocDescription 1004 -none- list
The accessionNumber and primaryDocument can be used to build a URL to access a specific Form 10-K, which will be demonstrated later. For now, I save this information into a data frame.
nvdaForms <- nvdaList$filings$recent
formsDF <- do.call(rbind.data.frame, nvdaForms)
formsDFTranspose <- transpose(formsDF)
colnames(formsDFTranspose) <- rownames(formsDF)
remove(formsDF)
formsDFTranspose <- formsDFTranspose |> filter(form == '10-K')
head(formsDFTranspose)
## accessionNumber filingDate reportDate acceptanceDateTime act form
## 1 0001045810-24-000029 2024-02-21 2024-01-28 2024-02-21T16:36:57.000Z 34 10-K
## 2 0001045810-23-000017 2023-02-24 2023-01-29 2023-02-24T17:23:43.000Z 34 10-K
## 3 0001045810-22-000036 2022-03-18 2022-01-30 2022-03-17T20:33:34.000Z 34 10-K
## 4 0001045810-21-000010 2021-02-26 2021-01-31 2021-02-26T17:03:14.000Z 34 10-K
## 5 0001045810-20-000010 2020-02-20 2020-01-26 2020-02-20T16:38:18.000Z 34 10-K
## 6 0001045810-19-000023 2019-02-21 2019-01-27 2019-02-21T16:37:18.000Z 34 10-K
## fileNumber filmNumber items size isXBRL isInlineXBRL primaryDocument
## 1 000-23985 24660316 11813809 1 1 nvda-20240128.htm
## 2 000-23985 23668751 13525180 1 1 nvda-20230129.htm
## 3 000-23985 22750748 12667394 1 1 nvda-20220130.htm
## 4 000-23985 21690665 11996719 1 1 nvda-20210131.htm
## 5 000-23985 20635743 12031473 1 1 nvda-2020x10k.htm
## 6 000-23985 19622362 11434705 1 0 nvda-2019x10k.htm
## primaryDocDescription
## 1 10-K
## 2 10-K
## 3 10-K
## 4 FY2021 10-K
## 5 FY2020 10-K
## 6 FY2019 10-K
This data frame is in tidy format: each row is an observation, which in this case is a form filed by Nvidia, and each column is a variable related to the forms. However, this is not the final data frame I need.
By viewing the 10-Ks online I saw that each 10-K for Nvidia had the following base URL:
https://www.sec.gov/Archives/edgar/data/1045810/
Then, this base URL incorporated the accessionNumber (without dashes) followed by a slash, followed by the primaryDocument.
For example, the 10-K filed in 2024 is at the following URL:
https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
Thus, I add a column to the data frame that mentioned above that combines the baseURL with the other fields to create the URL for each Form 10-K
baseURL <- 'https://www.sec.gov/Archives/edgar/data/1045810/'
formsDFTranspose$nvda10KURLs <- paste(baseURL, str_replace_all(formsDFTranspose$accessionNumber,'-',''), '/',formsDFTranspose$primaryDocument, sep = '')
I then create a subset data frame that has the columns relevant to my analysis and convert the date field from string to date format
finalDF <- formsDFTranspose |> select(reportDate, nvda10KURLs)
finalDF$reportDate <- as.Date(finalDF$reportDate, '%Y-%m-%d')
finalDF
## reportDate
## 1 2024-01-28
## 2 2023-01-29
## 3 2022-01-30
## 4 2021-01-31
## 5 2020-01-26
## 6 2019-01-27
## 7 2018-01-28
## 8 2017-01-29
## 9 2016-01-31
## 10 2015-01-25
## nvda10KURLs
## 1 https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
## 2 https://www.sec.gov/Archives/edgar/data/1045810/000104581023000017/nvda-20230129.htm
## 3 https://www.sec.gov/Archives/edgar/data/1045810/000104581022000036/nvda-20220130.htm
## 4 https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm
## 5 https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm
## 6 https://www.sec.gov/Archives/edgar/data/1045810/000104581019000023/nvda-2019x10k.htm
## 7 https://www.sec.gov/Archives/edgar/data/1045810/000104581018000010/nvda-2018x10k.htm
## 8 https://www.sec.gov/Archives/edgar/data/1045810/000104581017000027/nvda-2017x10k.htm
## 9 https://www.sec.gov/Archives/edgar/data/1045810/000104581016000205/nvda-2016x10k.htm
## 10 https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm
The URL’s above take us to an html webpage that has the Form 10-K for Nvidia for each year it was filed.
Because I have ten different 10-K’s I created a function that would scrape each one for Item 7 commentary and return the sentiment score, rather than scrape each one individually. Below is the code for the function, with comments explaining the code
sentimentFromURL <- function(urlFor10K){
#I want to return two values: the date of the filing and the sentiment score
#thus I create the returnDF which will be populated with these two values
returnDF <- data.frame(reportDate = character(), sentiment = integer())
#use httr2 package to request the URL from the API
nvda10kResponse <- request(urlFor10K) |>
req_user_agent('semyon.toybis43@spsmail.cuny.edu') |>
req_perform()
#this returned a httr2 response
#I use the resp_body_string to extract the body from the response
nvda10KContent <- resp_body_string(nvda10kResponse)
#I now have a string that has html formatting
#I use the read_html function form rvest package to convert it to html
page_content <- read_html(nvda10KContent)
#next I extract the body from the html content
body_content <- html_element(page_content, "body")
#I then extract the text from the body content
body_text <- html_text(body_content)
#the body text has the text information from the filing
#I use the stringr package to pull certain information
#First I extract the reportDate
#which I know starts with the phrase below and has a consistent length
year <- str_sub(body_text, str_locate_all(body_text, 'For the fiscal year ended')[[1]][1,2]+1, str_locate_all(body_text, 'For the fiscal year ended')[[1]][1,2]+17)
year <- str_squish(year)
#Next I search for the location of the phrase Item 7 in the body
item7Appearances <- str_locate_all(body_text, 'Item 7')
#Item 7 appears twice in the table of contents - Item 7 and 7A
#The third appearance should be the start of the actual commentary
#I grab the starting position for the third appearance
subsetStartLocation <- item7Appearances[[1]][3]
#now I need to find where item 8 starts. I will use this as the endpoint for my extraction
#For some reason, Item 8 was only being picked up once (table of contents).
#Instead, I searched the content of item 8, which was consistent year over year
item8ContentAppearances <- str_locate_all(body_text, 'The information required by this Item is set forth in our Consolidated Financial Statements and Notes thereto included in this Annual Report on Form 10-K')
subsetEndLocation <- item8ContentAppearances[[1]][1]
#Now I have the start and end point for Item 7
#Below I extract the Item 7 commentary from the whole body of the 10k
item7Commentary <- str_sub(body_text, subsetStartLocation, subsetEndLocation)
#Now that I have the Item 7 text, I need to get it into tidy text format
#first, I put the text into a tibble
item7Tibble <- tibble(line = 1, text = item7Commentary)
#then I use the tidytext library to unnest the tokens
item7TidyText <- item7Tibble |> unnest_tokens(word, text)
#I then drop the last 7 rows because this is the portion of commentary from item 8
#this is because my end point included the begining of item 8
item7TidyText <- item7TidyText[1:(nrow(item7TidyText)-7), ]
#I now have a dataframe that is in tidy-text format
#Now I can begin the sentiment analsis
#First I remove stop words
item7TidyText <- item7TidyText |> anti_join(stop_words)
#I use the Loughran sentiment lexicon which is meant specifically for financial reports
#I perform an inner join of the item 7 commentary with words in the lexicon
item7Sentiment <- item7TidyText |> inner_join(get_sentiments('loughran'))
#The lexicon categorizes words as: positive, negative, litigious, uncertain,
#constraining, and superfluous
#I will perform a simple sentiment analysis that evaluates the amount of positive
#and negative words
#I filter the data frame for words that fall into these two categories
item7Sentiment <- item7Sentiment |> filter(sentiment == 'negative' | sentiment == 'positive')
#below I have the count of the amount of positive and negative words
item7SentimentTable <- item7Sentiment |> group_by(sentiment) |> count()
#I create a sentiment ratio that subtracts the count of negative words from the
#count of positive words and divides it by the sum of the count of negative
#and positive words
item7SentimentScore <- (item7SentimentTable[2,2] - item7SentimentTable[1,2])/ (item7SentimentTable[2,2] + item7SentimentTable[1,2])
#I populate the dataframe with the reportDate and sentiment score and return the dataframe
returnDF[1, ] <-c(year, item7SentimentScore)
return(returnDF)
}
Now that I have the function, below I use it to obtain the sentiment scores for the ten Form 10-K’s
I use the lapply function, which returns a list of ten that has the reportDate and sentimentScore for each 10-K
nvidia10KSentimentScores <- lapply(finalDF$nvda10KURLs, sentimentFromURL)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
I convert this list to a data frame and convert the dates from string format to date
nvidia10KSentimentScoresDF <- do.call(rbind.data.frame, nvidia10KSentimentScores)
nvidia10KSentimentScoresDF$reportDate <- as.Date(nvidia10KSentimentScoresDF$reportDate, '%B %d, %Y')
head(nvidia10KSentimentScoresDF, n = 10)
## reportDate sentiment
## 1 2024-01-28 -0.17171717
## 2 2023-01-29 -0.23404255
## 3 2022-01-30 0.02564103
## 4 2021-01-31 -0.17241379
## 5 2020-01-26 -0.22222222
## 6 2019-01-27 -0.32142857
## 7 2018-01-28 -0.21875000
## 8 2017-01-29 -0.43037975
## 9 2016-01-31 -0.46464646
## 10 2015-01-25 -0.40659341
I join the above data frame values to my finalDF on which I will conduct analysis.
finalDF <- left_join(finalDF, nvidia10KSentimentScoresDF, by = 'reportDate')
head(finalDF, n = 10)
## reportDate
## 1 2024-01-28
## 2 2023-01-29
## 3 2022-01-30
## 4 2021-01-31
## 5 2020-01-26
## 6 2019-01-27
## 7 2018-01-28
## 8 2017-01-29
## 9 2016-01-31
## 10 2015-01-25
## nvda10KURLs
## 1 https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
## 2 https://www.sec.gov/Archives/edgar/data/1045810/000104581023000017/nvda-20230129.htm
## 3 https://www.sec.gov/Archives/edgar/data/1045810/000104581022000036/nvda-20220130.htm
## 4 https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm
## 5 https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm
## 6 https://www.sec.gov/Archives/edgar/data/1045810/000104581019000023/nvda-2019x10k.htm
## 7 https://www.sec.gov/Archives/edgar/data/1045810/000104581018000010/nvda-2018x10k.htm
## 8 https://www.sec.gov/Archives/edgar/data/1045810/000104581017000027/nvda-2017x10k.htm
## 9 https://www.sec.gov/Archives/edgar/data/1045810/000104581016000205/nvda-2016x10k.htm
## 10 https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm
## sentiment
## 1 -0.17171717
## 2 -0.23404255
## 3 0.02564103
## 4 -0.17241379
## 5 -0.22222222
## 6 -0.32142857
## 7 -0.21875000
## 8 -0.43037975
## 9 -0.46464646
## 10 -0.40659341
First, I check which day of the week the reportDate is:
weekdays(finalDF$reportDate)
## [1] "Sunday" "Sunday" "Sunday" "Sunday" "Sunday" "Sunday" "Sunday" "Sunday"
## [9] "Sunday" "Sunday"
Because these are all weekends, there will not be any stock price information for these days. Thus, I take the Friday before the reportDate as the period start date for the analysis of change in stock price.
finalDF$periodStartDate <- finalDF$reportDate - days(2)
I want to see the change in stock price for the following year, so I need the end date that is a year from the start date.
finalDF$periodStartDate <- finalDF$reportDate - days(2)
finalDF$startPlusOneYear <- finalDF$periodStartDate + years(1)
weekdays(finalDF$startPlusOneYear)
## [1] "Sunday" "Saturday" "Saturday" "Saturday" "Sunday" "Saturday"
## [7] "Saturday" "Saturday" "Sunday" "Saturday"
Because these are all weekends, I want to take the stock price from the Friday before this date. I do this below:
finalDF <- finalDF |> mutate(forwardPeriodEndDate = case_when(
weekdays(finalDF$startPlusOneYear) == 'Sunday' ~ finalDF$periodStartDate + years(1) - days(2),
weekdays(finalDF$startPlusOneYear) == 'Saturday' ~ finalDF$periodStartDate + years(1) - days(1),
))
finalDF$startPlusOneYear <- NULL
First, I import a csv file from my Github that has the daily stock price values for Nvidia. I convert the date column from string to date format
nvdaFileUrl <- 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/FinalProject/NVDA.csv'
nvdaPrices <- read_csv(nvdaFileUrl)
## Rows: 2285 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Open, High, Low, Close, Adj Close, Volume
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nvdaPrices$Date <- as.Date(nvdaPrices$Date, '%Y-%m-%d')
head(nvdaPrices)
## # A tibble: 6 × 7
## Date Open High Low Close `Adj Close` Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-12-31 5.1 5.13 5.00 5.01 4.81 16630000
## 2 2015-01-02 5.03 5.07 4.95 5.03 4.83 11368000
## 3 2015-01-05 5.03 5.05 4.92 4.95 4.75 19795200
## 4 2015-01-06 4.96 4.96 4.79 4.80 4.61 19776400
## 5 2015-01-07 4.83 4.88 4.77 4.78 4.59 32180800
## 6 2015-01-08 4.84 5.00 4.84 4.96 4.77 28378000
Next, I join the the adjusted closing price from the above file to my final data frame for the period start date.
finalDF <- left_join(finalDF, select(nvdaPrices, Date, 'Adj Close'), join_by(periodStartDate == Date))
colnames(finalDF)[colnames(finalDF) == 'Adj Close'] <- 'startPrice'
head(finalDF, n = 10)
## reportDate
## 1 2024-01-28
## 2 2023-01-29
## 3 2022-01-30
## 4 2021-01-31
## 5 2020-01-26
## 6 2019-01-27
## 7 2018-01-28
## 8 2017-01-29
## 9 2016-01-31
## 10 2015-01-25
## nvda10KURLs
## 1 https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
## 2 https://www.sec.gov/Archives/edgar/data/1045810/000104581023000017/nvda-20230129.htm
## 3 https://www.sec.gov/Archives/edgar/data/1045810/000104581022000036/nvda-20220130.htm
## 4 https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm
## 5 https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm
## 6 https://www.sec.gov/Archives/edgar/data/1045810/000104581019000023/nvda-2019x10k.htm
## 7 https://www.sec.gov/Archives/edgar/data/1045810/000104581018000010/nvda-2018x10k.htm
## 8 https://www.sec.gov/Archives/edgar/data/1045810/000104581017000027/nvda-2017x10k.htm
## 9 https://www.sec.gov/Archives/edgar/data/1045810/000104581016000205/nvda-2016x10k.htm
## 10 https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm
## sentiment periodStartDate forwardPeriodEndDate startPrice
## 1 -0.17171717 2024-01-26 2025-01-24 610.281372
## 2 -0.23404255 2023-01-27 2024-01-26 203.550140
## 3 0.02564103 2022-01-28 2023-01-27 228.074722
## 4 -0.17241379 2021-01-29 2022-01-28 129.599289
## 5 -0.22222222 2020-01-24 2021-01-22 62.373329
## 6 -0.32142857 2019-01-25 2020-01-24 39.724636
## 7 -0.21875000 2018-01-26 2019-01-25 60.189217
## 8 -0.43037975 2017-01-27 2018-01-26 27.539711
## 9 -0.46464646 2016-01-29 2017-01-27 7.148232
## 10 -0.40659341 2015-01-23 2016-01-22 4.971818
Next, I do the same step as above but for the period ending price.
finalDF <- left_join(finalDF, select(nvdaPrices, Date, 'Adj Close'), join_by(forwardPeriodEndDate == Date))
colnames(finalDF)[colnames(finalDF) == 'Adj Close'] <- 'endPrice'
head(finalDF, n = 10)
## reportDate
## 1 2024-01-28
## 2 2023-01-29
## 3 2022-01-30
## 4 2021-01-31
## 5 2020-01-26
## 6 2019-01-27
## 7 2018-01-28
## 8 2017-01-29
## 9 2016-01-31
## 10 2015-01-25
## nvda10KURLs
## 1 https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
## 2 https://www.sec.gov/Archives/edgar/data/1045810/000104581023000017/nvda-20230129.htm
## 3 https://www.sec.gov/Archives/edgar/data/1045810/000104581022000036/nvda-20220130.htm
## 4 https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm
## 5 https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm
## 6 https://www.sec.gov/Archives/edgar/data/1045810/000104581019000023/nvda-2019x10k.htm
## 7 https://www.sec.gov/Archives/edgar/data/1045810/000104581018000010/nvda-2018x10k.htm
## 8 https://www.sec.gov/Archives/edgar/data/1045810/000104581017000027/nvda-2017x10k.htm
## 9 https://www.sec.gov/Archives/edgar/data/1045810/000104581016000205/nvda-2016x10k.htm
## 10 https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm
## sentiment periodStartDate forwardPeriodEndDate startPrice endPrice
## 1 -0.17171717 2024-01-26 2025-01-24 610.281372 NA
## 2 -0.23404255 2023-01-27 2024-01-26 203.550140 610.281372
## 3 0.02564103 2022-01-28 2023-01-27 228.074722 203.550140
## 4 -0.17241379 2021-01-29 2022-01-28 129.599289 228.074722
## 5 -0.22222222 2020-01-24 2021-01-22 62.373329 136.810226
## 6 -0.32142857 2019-01-25 2020-01-24 39.724636 62.373329
## 7 -0.21875000 2018-01-26 2019-01-25 60.189217 39.724636
## 8 -0.43037975 2017-01-27 2018-01-26 27.539711 60.189217
## 9 -0.46464646 2016-01-29 2017-01-27 7.148232 27.539711
## 10 -0.40659341 2015-01-23 2016-01-22 4.971818 6.943229
Note that the forward price a year from now is not yet available.
Last, I calculate the percent change for the period.
finalDF$stockPricePercentChangeOneYearFwd <- (finalDF$endPrice - finalDF$startPrice)/finalDF$startPrice
head(finalDF, n = 10)
## reportDate
## 1 2024-01-28
## 2 2023-01-29
## 3 2022-01-30
## 4 2021-01-31
## 5 2020-01-26
## 6 2019-01-27
## 7 2018-01-28
## 8 2017-01-29
## 9 2016-01-31
## 10 2015-01-25
## nvda10KURLs
## 1 https://www.sec.gov/Archives/edgar/data/1045810/000104581024000029/nvda-20240128.htm
## 2 https://www.sec.gov/Archives/edgar/data/1045810/000104581023000017/nvda-20230129.htm
## 3 https://www.sec.gov/Archives/edgar/data/1045810/000104581022000036/nvda-20220130.htm
## 4 https://www.sec.gov/Archives/edgar/data/1045810/000104581021000010/nvda-20210131.htm
## 5 https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm
## 6 https://www.sec.gov/Archives/edgar/data/1045810/000104581019000023/nvda-2019x10k.htm
## 7 https://www.sec.gov/Archives/edgar/data/1045810/000104581018000010/nvda-2018x10k.htm
## 8 https://www.sec.gov/Archives/edgar/data/1045810/000104581017000027/nvda-2017x10k.htm
## 9 https://www.sec.gov/Archives/edgar/data/1045810/000104581016000205/nvda-2016x10k.htm
## 10 https://www.sec.gov/Archives/edgar/data/1045810/000104581015000036/nvda-2015x10k.htm
## sentiment periodStartDate forwardPeriodEndDate startPrice endPrice
## 1 -0.17171717 2024-01-26 2025-01-24 610.281372 NA
## 2 -0.23404255 2023-01-27 2024-01-26 203.550140 610.281372
## 3 0.02564103 2022-01-28 2023-01-27 228.074722 203.550140
## 4 -0.17241379 2021-01-29 2022-01-28 129.599289 228.074722
## 5 -0.22222222 2020-01-24 2021-01-22 62.373329 136.810226
## 6 -0.32142857 2019-01-25 2020-01-24 39.724636 62.373329
## 7 -0.21875000 2018-01-26 2019-01-25 60.189217 39.724636
## 8 -0.43037975 2017-01-27 2018-01-26 27.539711 60.189217
## 9 -0.46464646 2016-01-29 2017-01-27 7.148232 27.539711
## 10 -0.40659341 2015-01-23 2016-01-22 4.971818 6.943229
## stockPricePercentChangeOneYearFwd
## 1 NA
## 2 1.9981869
## 3 -0.1075287
## 4 0.7598455
## 5 1.1934091
## 6 0.5701422
## 7 -0.3400041
## 8 1.1855428
## 9 2.8526605
## 10 0.3965171
I now have a tidy data frame where each row is an observation and each column is an attribute of the observation:
the observations (rows) is Nvidia in different points in time (reporting dates)
the attributes (columns) are the report date of the 10-K, the URL for the 10-K, the sentiment score for that 10-K, the start and end date for the forward periods, the stock prices for each period, and the percent change in stock price
Below I visualize the relationship between the sentiment score and the stock price change for the following year using a scatter plot. I add in a trend line via a linear model.
plotTitle <- 'Relationship betweebw sentiment score of 10-K Item 7 Commentary & Change in Stock Price One Year Forward for Nvidia'
ggplot(finalDF, aes(x = sentiment, y = stockPricePercentChangeOneYearFwd)) + geom_point() + geom_smooth(method = 'lm', se = FALSE) +
labs(title = str_wrap(plotTitle, 60)) +
theme(plot.title = element_text(size=12)) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(labels = scales::percent_format())
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
It seems like there is a negative relationship between the sentiment score of management commentary and forward stock price performance. Below, I conduct a correlation test.
cor.test(finalDF$sentiment, finalDF$stockPricePercentChangeOneYearFwd)
##
## Pearson's product-moment correlation
##
## data: finalDF$sentiment and finalDF$stockPricePercentChangeOneYearFwd
## t = -1.678, df = 7, p-value = 0.1373
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8849380 0.1995023
## sample estimates:
## cor
## -0.5355805
The correlation is negative, however based on the p-value we fail to reject the null hypothesis that the correlation between the two variables is zero. However, it is important to note that the sample size is small, there is one point that is an outlier, and the relationship between sentiment score and stock price performance may not be a linear one.
Below I examine how the sentiment of the 10-K’s has evolved over time
ggplot(finalDF, aes(x = reportDate, y = sentiment)) + geom_line() +
geom_hline(yintercept=0) +
scale_y_continuous(labels = scales::percent_format()) +
ggtitle('10-K Sentiment score over time')
Interestingly, while sentiment still is negative overall, it has gotten less negative over time.
I was surprised to see that the sentiment score of the 10-Ks in the sample was mostly negative. It is possible that management tends to take a precautionary approach in Item 7 commentary, as being overly optimistic about the outlook but then failing to deliver could reflect poorly on management while outperforming reflects strongly. This type of bias could naturally skew commentary towards a more negative sentiment.
Regarding whether there is a positive relationship between sentiment score and forward stock price performance, while the chart above suggests the relationship is negative, there is not enough data to say whether this is truly the case. Furthermore, Nvidia has seen mostly positive stock price performance over this period whereas a stock that had mixed performance might have a different relationship with the sentiment score of commentary. Additionally, there are many other factors that drive stock price performance so it is important to remember that correlation does not equal causation.
For future analysis, it would be interesting to expand the sample size to include more Forms 10-K. However, because there are a lot of variables that can drive stock price performance over a one year period, it could be interesting to analyze performance over a shorter period such as the day before the 10-K was released and the day of. Additionally, Form 10-Q is filed quarterly and could be included for analysis as well. Lastly, analyzing daily data such as social media posts or news articles about the company for sentiment scores will lead to more sentiment data that can be used with daily stock price data for even more granular analysis.