library(tidyverse)
library(dplyr)
library(tidytext)
library(lubridate)
library(flextable)
library(wordcloud)
library(vader)
library(RColorBrewer)
library(scales)
library(ggplot2)
data(stop_words)DATA607_Final_Proj_Fox
1. Background
In recent years, Boeing has had a number of major negative events, beginning with two fatal crashes of its new Max 8 jets in 2018/2019 and subsequent grouding of those aircraft until the end of 2020 and related legal battles that stretched to 2023. In January 2024, a new round of serious safety events and violations began when a door plug blew out and violently depressurized a Max 9 jet. The fallout from this and subsequent events and investigations continues to this day.
To better understand the impact of these events on media coverage and public sentiment, I obtained 407 mass media articles from a balanced set of nine major news sources for twelve months (May 2023 to April 2024). I derived their mean sentiment scores using both lexicon-based and ML-based tools, and correlated those findings to Boeing stock prices.
2. Data Sources & Methodology
Articles Data:
News articles (n=407) about Boeing were selected from nine major publications by querying the Perigon API as follows:
- Keyword “Boeing” in headlines only
- Publication dates 5/1/2023 - 4/30/2024
- Selected the top 10 sources, then removed articles from bloomberg.com, which had a large number of short, low-content articles
- Included articles behind paywalls and reprints
- Excluded “roundups,” paid articles, opinion, non-news, and press releases
Note that the articles themselves were truncated, but they were accompanied by long “summary” fields that often appeared to be straight copies of shorter articles. To approximate the full text for analysis, I concatenated the truncated article with the summary and title. While this was not ideal, I did not find any APIs that provided full articles except for very short timeframes (e.g. seven days). This source had the benefit of allowing a search over many years as well as compiling a set of balanced sources representative of what the public would consume.
Tidying: The articles data was tidy for article-level analysis, with one row equaling one observation (article). The tokenized dataframes created later were also tidy with one observation (word or bigram) per row, and the stock market data was tidy with one observation (week) per row.
Normalizing: For this assignment, a tidy dataset was to be normalized in MySQL, then denormalized and cleaned in R. Despite numerous attempts, the articles data was not able to be loaded to My SQL due to characters throughout the long text fields. See that section for more info.
However, the dataframe was already normalized, with no repeating values requiring separate lookup tables for efficient management. The stock market dataframe was also already normalized.
Cleaning: All of the datasets were cleaned and transformed multiple times during the analysis, and all dataframes created were tidy (one row per observation, which could be an article, a token, a week or month, etc. depending on analysis.)
Sentiment Scores
Sentiment scores were added using AFINN (Hutto) and VADER (Finn).
Boeing (BA) Stock Prices
Historical stock prices were downloaded from Yahoo Finance. There were no stock splits during this time period, so prices were consistent. The closing price on the Monday following each week of articles was selected as the data point of interest.
Analysis
All of the above data was analyzed in multiple ways, including but not limited to comparative analysis of the publications themselves; trends of article frequency and sentiment; word clouds comparing common tokens in 2023 vs. 2024; and Pearson correlation matrix of weekly article volume, sentiment, and stock price.
3. Loading Libraries and Articles Dataset
First, I loaded the libraries:
Then I loaded the 407 articles in four text files due to file size limitations of the API. As discussed above, this data was already tidy and normalized, with one row per observation (tidy) and no repeating values (normalized).
However, for the exercise, I did export it and attempt to import to MySQL in multiple ways (comma delimited, tab delimited, encoding, cleaning file of special characters, etc.). While I came close, I was not able to parse the file 100% correctly and could not use it for accurate analysis.
In future, I would use a different database if the data did indeed require normalization.
#---------- Read in csv files and create one df
df_raw1 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(1).csv")
df_raw2 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(2).csv")
df_raw3 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(3).csv")
df_raw4 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export.csv")
df_all <- union_all(df_raw1, df_raw2)
df_all <- union_all(df_all, df_raw3)
df_all <- union_all(df_all, df_raw4)
#---------- Code to send to SQL for normalization. Includes sample troubleshooting code to find erroneous tab characters.
#---------- MySQL data loads were unsuccessful and abandoned.
#---------- Dataframe is already TIDY with one row per observation (article)
#---------- and already normalized: no repeating values requiring lookups for optimal management
# look for columns with tabs
test <- apply(df_all,1,function(row) any(grepl("\t",row)))
print(df_all[test,1])# A tibble: 0 × 1
# ℹ 1 variable: url <chr>
df_export <- df_all %>%
mutate(content = str_replace_all(content, "\t",""),
summary = str_replace_all(summary, "\t",""),
description = str_replace_all(description, "\t",""),
title = str_replace_all(title, "\t",""))
df_export %>%
write.table(file = "export_all.txt",
sep = "\t",
row.names = FALSE)4. Summary Analysis: Publication Bias and Reliability Scores
To ensure that the data sample was reasonably representative of mass media coverage during this time period, the nine data sources were reviewed for bias and reliability.
Ad Fontes Media publishes bias and reliability scores for media outlets of all types. The nine publications selected for this analysis were evenly distributed in terms of bias: three sources are classified as “Middle/Centrist”, three “Skew Left”, and three “Skew Right”.
Further, six were considered reliable (reliability score 40+) and none were unreliable (score <24). Three sources had scores between these two thresholds, which Ad Fontes Media states may be due to a high proportion of opinion pieces or wide variation in the reliability of individual articles.
# Plot bias and reliability data (collected manually from Ad Fontes site)
sources <-
read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/sources_info.csv")
sources <- sources %>%
mutate(skew2 = if_else(Bias<0,"Left","Right"))
sources %>%
ggplot(aes(x = reorder(Source, -Bias), y=Bias, fill = skew2)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("deepskyblue3", "darkred")) +
coord_flip() +
geom_hline(aes(yintercept = 5, color = "darkred"), show.legend = FALSE) +
geom_hline(aes(yintercept = 0), show.legend = FALSE) +
geom_hline(aes(yintercept = -5, color = "green"), show.legend = FALSE) +
scale_y_continuous(labels = label_number(accuracy = 0.01),
breaks = pretty_breaks(n=8)
) +
labs(title = "Publications: Left vs. Right Bias Scores",
x = "Publication",
y = "Ad Fontes Media Bias Score (Range = -42 to +42)"
)sources %>%
ggplot(aes(x = reorder(Source, -Bias), y=Reliability)) +
geom_col(fill = "seagreen") +
coord_flip() +
geom_hline(aes(yintercept = 40), color = "green", show.legend = FALSE) +
geom_hline(aes(yintercept = 24), color = "red", show.legend = FALSE) +
scale_y_continuous(breaks = breaks_width(5,0),
limits=c(0, 64)) +
labs(title = "Publications: Reliability Score",
x = "Publication",
y = "Ad Fontes Media Reliability Score (Range = 0-64)")5. Data Cleaning and Transformation: Articles Data
To prepare for trending and analysis, the articles dataframe was summarized data by year, month, and week; the publication name was stripped from the article URL and standardized; and the title, description, and abbreviated “content” was concatenated to provide sufficient data for sentiment analysis.
#-----------------------------------------------
# CLEANING DATA
#-----------------------------------------------
# Summarize dates by year, month, week. Pull publication from URL. Remove Bloomberg.
df_all <- df_all %>%
mutate(dt = date(pubDate),
yr = format(pubDate, format = "%Y"),
yr_month = format(pubDate, format="%Y%m"),
wk_begin = floor_date(pubDate,unit = "week")) %>%
mutate(pub = sub("^https://www\\.(.*?)\\..*", "\\1", url)) %>%
mutate(pub = if_else(substring(pub,9,14)=="nypost","NY Post",pub)) %>%
filter(pub != "bloomberg")
# Capitalize publications
df_all <- df_all %>%
mutate(pub = if_else(pub =="cnn","CNN",
if_else(pub =="forbes","Forbes",
if_else(pub =="foxnews","Fox News",
if_else(pub =="latimes","LA Times",
if_else(pub =="nytimes","NY Times",
if_else(pub =="usatoday","USA Today",
if_else(pub =="washingtontimes","Wash Times",
if_else(pub =="wsj","Wall St Journal",
pub
)))))))))
# remove unneeded fields
df_clean <- df_all %>%
subset(select = c(pub, yr, yr_month, wk_begin, dt, title, summary, content))
head(df_clean)# A tibble: 6 × 8
pub yr yr_month wk_begin dt title summary content
<chr> <chr> <chr> <dttm> <date> <chr> <chr> <chr>
1 CNN 2024 202401 2024-01-07 00:00:00 2024-01-11 The FA… "The F… "The F…
2 NY Times 2024 202401 2024-01-21 00:00:00 2024-01-24 Boeing… "Boein… "Boein…
3 Forbes 2024 202401 2024-01-07 00:00:00 2024-01-11 Boeing… "The F… "The F…
4 Forbes 2024 202401 2024-01-28 00:00:00 2024-01-31 Besieg… "Boein… "Boein…
5 CNN 2024 202401 2024-01-28 00:00:00 2024-01-31 Boeing… "Boein… "Boein…
6 CNN 2024 202402 2024-02-04 00:00:00 2024-02-05 New pr… "A new… "A new…
# Summary counts
df_clean %>%
group_by(pub) %>%
summarise(articles = n()) %>%
arrange (desc(articles))# A tibble: 9 × 2
pub articles
<chr> <int>
1 Forbes 79
2 CNN 76
3 NY Post 72
4 Wash Times 34
5 NY Times 33
6 USA Today 30
7 Wall St Journal 26
8 Fox News 12
9 LA Times 10
# Add concatentated text field for analysis. Mutate POSIXct to Date.
df_analyze <- df_clean %>%
mutate(text = paste(title,summary,content,sep = ' ')) %>%
subset(select = c(pub, yr, yr_month, wk_begin, dt, text)) %>%
mutate(wk_begin = as.Date(wk_begin))The resulting clean, tidy table for article-level analysis contains one observation per article (sample row below):
# Flextable to display an example with full text
flex_clean <- flextable(head(df_clean,1)) %>%
theme_box() %>%
fontsize(i = 1, j = 1:8, size = 8, part = "all") %>%
align(align = "center", i=1, j = 1:8, part = "header") %>%
valign(valign = "top", i=1, j = 1:8, part = "body")
flex_cleanpub | yr | yr_month | wk_begin | dt | title | summary | content |
|---|---|---|---|---|---|---|---|
CNN | 2024 | 202401 | 2024-01-07 00:00:00 | 2024-01-11 | The FAA is formally investigating Boeing over Alaska Airlines Boeing 737 Max incident | The Federal Aviation Administration (FAA) is investigating Boeing's quality control following a violent in-flight failure of a door plug on a nearly new 737 Max 9. The FAA stated that the incident "should have never happened and it cannot happen again." The investigation will focus on whether Boeing failed to ensure completed products conformed to its approved design and were in a condition for safe operation in compliance with FAA regulations. This comes after airlines Alaska Airlines and United Airlines found loose hardware or bolts in the assembly of door plugs on their aircraft, which remain grounded pending FAA-mandated inspections. Boeing CEO Dave Calhoun described the incident as a "horrible escape" of its manufacturing and quality control processes. | The Federal Aviation Administration says it is opening an investigation into Boeing’s quality control after the violent in-flight failure of a door plug on a nearly new 737 Max 9. |
6. Summary Analysis: Frequency Trends
Article Frequency Trends
Article volume corresponded with major events, beginning with a ruling in May 2023 that victims of the 2018/19 crashes could sue for pain and suffering. Volume spiked multiple times in 2024 beginning with the dramatic mid-air loss of a door plug on January 5, followed by updates of investigations by airlines and the FAA, multiple resignation announcements in February and March, the death of a whistleblower in March, and frequent media reports of other incidents and findings.
# ------------------------------
# Summary Trends: Articles
# ------------------------------
articles <- df_analyze %>%
group_by(pub, yr_month, wk_begin, dt) %>%
summarise(articles = n())
articles %>%
ggplot(aes(x = wk_begin, y = articles)) +
geom_col(fill = "seagreen") +
scale_x_date(date_breaks = "2 week", date_labels = "%b %d %Y") +
scale_y_continuous(breaks = pretty_breaks(n=10)) +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
labs(title = "Count of Articles by Week",
x = "Week Begin Date",
y = "Count of Articles")articles %>%
ggplot(aes(x = wk_begin, y = articles, fill = pub)) +
geom_col(show.legend = FALSE) +
facet_wrap(~pub, scales = "free_y") +
scale_x_date(date_breaks = "4 week", date_labels = "%b %d %Y") +
scale_y_continuous(breaks = breaks_width(width=2)) +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
labs(title = "Count of Articles by Week and Publication",
x = "Week Begin Date",
y = "Count of Articles")Word Frequency Trend: 2023 vs. 2024
To identify the most common terms in 2023 vs. 2024, the text data was tokenized into words as well as bigrams, creating new tidy dataframes where one row = one observation (word or bigram, depending on analysis). This tidy data was used to create word clouds.
While the besieged “737 Max” was a common term in both years, 2023 featured words related to the court case settled in May 2023 regarding the earlier crashes, as well as a number of bigrams describing the company, planes, or airlines themselve. 2024 saw increased mentions of the FAA, door plugs, whistleblowers, and other terms related to safety incidents and investigations. Bigrams provided more nuanced insights into common topics than individual words.
# --------------------------------------------------
# Summary Trends: Comparative Frequencies of Words and Bigrams
# --------------------------------------------------#
# tokenize words in text fields, remove stop words and NA, add afinn scores
token <- df_analyze %>%
unnest_tokens(word, text) %>%
select(pub, yr, yr_month, wk_begin, dt, word) %>%
anti_join(stop_words) %>%
filter(!is.na(word)) %>%
inner_join(get_sentiments("afinn")) %>%
mutate(lexicon = "AFINN", n = 1)
head(token)# A tibble: 6 × 9
pub yr yr_month wk_begin dt word value lexicon n
<chr> <chr> <chr> <date> <date> <chr> <dbl> <chr> <dbl>
1 CNN 2024 202401 2024-01-07 2024-01-11 violent -3 AFINN 1
2 CNN 2024 202401 2024-01-07 2024-01-11 failure -2 AFINN 1
3 CNN 2024 202401 2024-01-07 2024-01-11 failed -2 AFINN 1
4 CNN 2024 202401 2024-01-07 2024-01-11 ensure 1 AFINN 1
5 CNN 2024 202401 2024-01-07 2024-01-11 approved 2 AFINN 1
6 CNN 2024 202401 2024-01-07 2024-01-11 safe 1 AFINN 1
# Summarize by Year
set.seed(248)
word_cloud_2023 <- token %>%
filter(yr == 2023) %>%
group_by(word) %>%
summarise(frq = n())
word_cloud_2024 <- token %>%
filter(yr == 2024) %>%
group_by(word) %>%
summarise(frq = n())
cloud2023_token <- wordcloud(
words = word_cloud_2023$word,
freq = word_cloud_2023$frq,
min.freq = 5,
max.words=100,
scale=c(3.5, .75),
random.order=FALSE,
rot.per=0.5,
colors=brewer.pal(8, "Set1"),
title = "Word Frequency 2023") cloud2024_token <- wordcloud(
words = word_cloud_2024$word,
freq = word_cloud_2024$frq,
min.freq = 5,
max.words=50,
scale=c(3.5, .75),
random.order=FALSE,
rot.per=0.5,
colors=brewer.pal(8, "Set1"),
title = "Word Frequency 2024") # Create bigrams
bigrams <- df_analyze %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) %>%
select(pub, yr, yr_month, wk_begin, dt, bigram) %>%
mutate(n = 1)
# separate words and remove bigrams with stop words
bigrams <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# put bigrams back together
bigrams_clean <- bigrams %>%
unite(bigram, word1, word2, sep=" ")
head(bigrams_clean)# A tibble: 6 × 7
pub yr yr_month wk_begin dt bigram n
<chr> <chr> <chr> <date> <date> <chr> <dbl>
1 CNN 2024 202401 2024-01-07 2024-01-11 formally investigating 1
2 CNN 2024 202401 2024-01-07 2024-01-11 investigating boeing 1
3 CNN 2024 202401 2024-01-07 2024-01-11 alaska airlines 1
4 CNN 2024 202401 2024-01-07 2024-01-11 airlines boeing 1
5 CNN 2024 202401 2024-01-07 2024-01-11 boeing 737 1
6 CNN 2024 202401 2024-01-07 2024-01-11 737 max 1
# Summarize by Year
bigrams_cloud_2023 <- bigrams_clean %>%
filter(yr == 2023) %>%
group_by(bigram) %>%
summarise(frq = sum(n))
bigrams_cloud_2024 <- bigrams_clean %>%
filter(yr == 2024) %>%
group_by(bigram) %>%
summarise(frq = sum(n))
cloud2023_bigram <- wordcloud(
words = bigrams_cloud_2023$bigram,
freq = bigrams_cloud_2023$frq,
scale=c(3, .5),
min.freq = 3,
max.words=50,
random.order=FALSE,
rot.per=0.5,
colors=brewer.pal(8, "Set1"),
title = "Bigram Frequency 2023") cloud2024_bigram <- wordcloud(
words = bigrams_cloud_2024$bigram,
freq = bigrams_cloud_2024$frq,
scale=c(3, .5),
min.freq = 3,
max.words=50,
random.order=FALSE,
rot.per=0.5,
colors=brewer.pal(8, "Set1"),
title = "Bigram Frequency 2024") 7. Sentiment Analysis Trends
AFINN Scoring
The AFINN lexicon is an easy to use lexicon that assigns sentiment scores ranging from -5 to 5 to individual words based on a manual categorization. Because it uses only single words, it does not handle negation or context but is a good first step.
#-----------------------------------------------------
# Sentiment Analysis: AFINN Scoring
#-------------=---------------------------------------
# Average of AFINN score (per token) by week
afinn_weekly <- token %>%
group_by(wk_begin) %>%
summarise(tokens = sum(n), total_AFINN = sum(value), avg_AFINN_token = total_AFINN/tokens) %>%
mutate(wk_begin = as.Date(wk_begin))
afinn_weekly# A tibble: 41 × 4
wk_begin tokens total_AFINN avg_AFINN_token
<date> <dbl> <dbl> <dbl>
1 2023-05-07 13 18 1.38
2 2023-05-28 119 -82 -0.689
3 2023-06-04 7 -13 -1.86
4 2023-06-11 9 12 1.33
5 2023-06-18 26 -12 -0.462
6 2023-06-25 22 11 0.5
7 2023-07-23 32 12 0.375
8 2023-07-30 12 16 1.33
9 2023-08-13 20 -5 -0.25
10 2023-08-20 10 5 0.5
# ℹ 31 more rows
afinn_weekly %>%
ggplot(aes(x = wk_begin, y = avg_AFINN_token)) +
geom_col(fill = "deepskyblue3") +
scale_x_date(date_breaks = "2 week", date_labels = "%b %d %Y") +
scale_y_continuous(breaks=pretty_breaks(n=10)) +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
labs(title = "Average AFINN Score by Week",
x = "Week Begin Date",
y = "Mean AFINN Score Per Token")# Average of AFINN score (per token) by week and publication
afinn_weekly_pub <- token %>%
group_by(pub, wk_begin) %>%
summarise(tokens = n(), total_AFINN = sum(value), avg_AFINN_token = total_AFINN/tokens) %>%
mutate(wk_begin = as.Date(wk_begin))
afinn_weekly_pub# A tibble: 155 × 5
# Groups: pub [9]
pub wk_begin tokens total_AFINN avg_AFINN_token
<chr> <date> <int> <dbl> <dbl>
1 CNN 2023-05-07 4 6 1.5
2 CNN 2023-05-28 38 -44 -1.16
3 CNN 2023-06-18 14 3 0.214
4 CNN 2023-06-25 21 9 0.429
5 CNN 2023-09-24 4 -2 -0.5
6 CNN 2023-10-22 14 -16 -1.14
7 CNN 2023-11-12 3 -5 -1.67
8 CNN 2023-12-24 8 -9 -1.12
9 CNN 2024-01-07 32 -20 -0.625
10 CNN 2024-01-14 30 -12 -0.4
# ℹ 145 more rows
afinn_weekly_pub %>%
ggplot(aes(x = wk_begin, y = avg_AFINN_token, fill = pub, label = avg_AFINN_token)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~pub, nrow=3, scales = "fixed") +
ylab("Mean AFINN Score Per Token") +
xlab("Week Begin Date") +
ggtitle("Average AFINN Score by Week and Publication") +
labs(fill = "Publication") +
scale_x_date(date_breaks = "4 week", date_labels = "%b %d %Y") +
theme(axis.text.x=element_text(size=8,angle=60, hjust=1)) +
scale_y_continuous(breaks=pretty_breaks(n=6),labels = label_number(accuracy = 0.1)) VADER Scoring
VADER is another sentiment analysis tool that is easy to use and interpret, but it is more robust than AFINN. The VADER lexicon is flexible across many domains and VADER includes a trained ML model. Instead of assigning a score to individual words, VADER contextualizes the words and handles negation, amplifiers, diminishers, punctuation, and more, making it more nuanced.
VADER provides compound, positive, negative and neutral scores as well as a count of the word “but”. Below is a trend of the compound score only for simplicity:
#-----------------------------------------------------
# Sentiment Analysis: VADER Scoring
#-------------=---------------------------------------
# calculate sentiment
vader = vader_df(df_analyze$text, incl_nt = T, neu_set = T, rm_qm = F)
# add to df_analyze
df_vader <- df_analyze %>%
inner_join(vader,join_by(x$text == y$text)) %>%
mutate(articles = 1)
head(df_vader,1)# A tibble: 1 × 13
pub yr yr_month wk_begin dt text word_scores compound pos
<chr> <chr> <chr> <date> <date> <chr> <chr> <dbl> <dbl>
1 CNN 2024 202401 2024-01-07 2024-01-11 "The FA… {0, 0, 0, … -0.914 0.064
# ℹ 4 more variables: neu <dbl>, neg <dbl>, but_count <dbl>, articles <dbl>
str(df_vader)tibble [372 × 13] (S3: tbl_df/tbl/data.frame)
$ pub : chr [1:372] "CNN" "NY Times" "Forbes" "Forbes" ...
$ yr : chr [1:372] "2024" "2024" "2024" "2024" ...
$ yr_month : chr [1:372] "202401" "202401" "202401" "202401" ...
$ wk_begin : Date[1:372], format: "2024-01-07" "2024-01-21" ...
$ dt : Date[1:372], format: "2024-01-11" "2024-01-24" ...
$ text : chr [1:372] "The FAA is formally investigating Boeing over Alaska Airlines Boeing 737 Max incident The Federal Aviation Admi"| __truncated__ "Boeing Faces Backlash From Airline Chiefs Boeing's customers are expressing frustration over the company's hand"| __truncated__ "Boeing Under Investigation For 737 Max 9 The Federal Aviation Administration (FAA) is investigating Boeing for "| __truncated__ "Besieged Boeing Declines To Share Financial Guidance Boeing has reported stronger than expected Q4 sales and a "| __truncated__ ...
$ word_scores: chr [1:372] "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.9, 0, -2.3, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.65, 0, 0, 0, 0, 0.9, 0, 0, 0, 0"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.3, 0, 1.6, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0"| __truncated__ "{0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0, -0.65, 0, 0.6, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, "| __truncated__ ...
$ compound : num [1:372] -0.914 0.219 0.493 0.329 0.588 -0.844 0.913 -0.25 -0.542 0.751 ...
$ pos : num [1:372] 0.064 0.133 0.09 0.089 0.071 0.044 0.117 0.038 0.037 0.092 ...
$ neu : num [1:372] 0.819 0.732 0.842 0.855 0.87 0.869 0.86 0.919 0.903 0.847 ...
$ neg : num [1:372] 0.117 0.134 0.068 0.056 0.059 0.087 0.023 0.043 0.06 0.06 ...
$ but_count : num [1:372] 0 1 0 1 0 0 1 0 0 0 ...
$ articles : num [1:372] 1 1 1 1 1 1 1 1 1 1 ...
#------------- trend by week
# Average of VADER score (per ARTICLE) by week, mutate POSIXct to Date
vader_weekly <- df_vader %>%
group_by(wk_begin) %>%
summarise(articles = sum(articles),
total_compound = sum(compound), avg_vader = total_compound/articles) %>%
mutate(wk_begin = as.Date(wk_begin))
# plot
vader_weekly %>%
ggplot(aes(x = wk_begin, y = avg_vader)) +
geom_col(fill = "deepskyblue3") +
scale_x_date(date_breaks = "2 week", date_labels = "%b %d %Y") +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
scale_y_continuous(label = comma, breaks = pretty_breaks(n=10)) +
theme(axis.text.x=element_text(angle=60, hjust=1)) +
ylab("Average VADER Score") +
xlab("Week Begin Date") +
ggtitle("Mean VADER Score Per Article") #-------- trend by week and publication
# Average of VADER score by week and pub, mutate POSIXct to Date
vader_weekly_pub <- df_vader %>%
group_by(pub, wk_begin) %>%
summarise(articles = sum(articles),
total_compound = sum(compound), avg_vader = total_compound/articles) %>%
mutate(wk_begin = as.Date(wk_begin))
str(vader_weekly_pub)gropd_df [156 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
$ pub : chr [1:156] "CNN" "CNN" "CNN" "CNN" ...
$ wk_begin : Date[1:156], format: "2023-05-07" "2023-05-28" ...
$ articles : num [1:156] 1 4 1 1 1 2 1 1 3 6 ...
$ total_compound: num [1:156] 0.872 0.398 0.077 0.807 -0.631 ...
$ avg_vader : num [1:156] 0.872 0.0995 0.077 0.807 -0.631 ...
- attr(*, "groups")= tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
..$ pub : chr [1:9] "CNN" "Forbes" "Fox News" "LA Times" ...
..$ .rows: list<int> [1:9]
.. ..$ : int [1:24] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:25] 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int [1:8] 50 51 52 53 54 55 56 57
.. ..$ : int [1:9] 58 59 60 61 62 63 64 65 66
.. ..$ : int [1:21] 67 68 69 70 71 72 73 74 75 76 ...
.. ..$ : int [1:17] 88 89 90 91 92 93 94 95 96 97 ...
.. ..$ : int [1:15] 105 106 107 108 109 110 111 112 113 114 ...
.. ..$ : int [1:17] 120 121 122 123 124 125 126 127 128 129 ...
.. ..$ : int [1:20] 137 138 139 140 141 142 143 144 145 146 ...
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
# plot
vader_weekly_pub %>%
ggplot(aes(x = wk_begin, y = avg_vader, fill = pub, label = avg_vader)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~pub, nrow=2, scales = "fixed") +
scale_x_date(date_breaks = "4 week", date_labels = "%b %d %Y") +
theme(axis.text.x=element_text(size=7, angle=60, hjust=1)) +
scale_y_continuous(label = comma, breaks = pretty_breaks(n=10)) +
ylab("Average VADER Score") +
xlab("Week Begin Date") +
ggtitle("Mean VADER Score Per Article") +
labs(fill = "Publication") 8. Analysis: Correlation of Sentiment Scores to Boeing (BA) Stock Price
Finally, a weekly trend of Boeing (BA) stock prices was obtained and loaded to analyze the correlation of sentiment scores in our 407 news articles to the value of BA stock.
set.seed(508)
#-----------------------------------------------------
# Stock Price Trend
#-----------------------------------------------------
# load biweekly close prices
raw_stock <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/BA.csv")
head(raw_stock)# A tibble: 6 × 7
Date Open High Low Close `Adj Close` Volume
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-05-01 206. 209. 193. 198. 198. 23089700
2 2023-05-08 199. 205. 196. 201. 201. 24420600
3 2023-05-15 200. 209. 199. 205. 205. 20737900
4 2023-05-22 206. 207. 197. 204. 204. 19206500
5 2023-05-29 204. 215. 201. 213. 213. 22047200
6 2023-06-05 213. 220 201. 217. 217. 37073500
# create combined final df of stocks, AFINN, and VADER
# join VADER and AFINN, add a "next Monday" column to link to stock prices dataframe
final_weekly <- vader_weekly %>%
left_join(afinn_weekly) %>%
mutate(nxt_monday = wk_begin + days(8))
# join in stock prices
final_weekly <- final_weekly %>%
left_join(raw_stock, join_by(x$nxt_monday == y$Date))
str(final_weekly)tibble [41 × 14] (S3: tbl_df/tbl/data.frame)
$ wk_begin : Date[1:41], format: "2023-05-07" "2023-05-28" ...
$ articles : num [1:41] 3 13 1 1 2 2 4 1 4 2 ...
$ total_compound : num [1:41] 2.552 1.723 -0.92 0.989 0.537 ...
$ avg_vader : num [1:41] 0.851 0.133 -0.92 0.989 0.269 ...
$ tokens : num [1:41] 13 119 7 9 26 22 32 12 20 10 ...
$ total_AFINN : num [1:41] 18 -82 -13 12 -12 11 12 16 -5 5 ...
$ avg_AFINN_token: num [1:41] 1.385 -0.689 -1.857 1.333 -0.462 ...
$ nxt_monday : Date[1:41], format: "2023-05-15" "2023-06-05" ...
$ Open : num [1:41] 200 213 217 219 206 ...
$ High : num [1:41] 209 220 224 219 213 ...
$ Low : num [1:41] 199 201 215 203 205 ...
$ Close : num [1:41] 205 217 220 205 211 ...
$ Adj Close : num [1:41] 205 217 220 205 211 ...
$ Volume : num [1:41] 20737900 37073500 32376400 26974600 21114200 ...
# CORRELATIONS AND TRENDS
df_cor <- final_weekly %>%
subset(select = c(articles,avg_vader,avg_AFINN_token,Close))
colnames(df_cor) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")
cor_matrix <- as.data.frame(cor(df_cor, use = "pairwise.complete.obs"))
cor_matrix <- cor_matrix %>%
rownames_to_column(var = "Variable")
flex_cor <- flextable(cor_matrix) %>%
theme_box() %>%
align(align = "center", i=1, j = 1:5, part = "header")
flex_corVariable | Articles Count | Mean Weekly VADER | Mean Weekly AFINN | BA Closing Price (Following Monday) |
|---|---|---|---|---|
Articles Count | 1.0000000 | -0.2819906 | -0.3392632 | -0.3319448 |
Mean Weekly VADER | -0.2819906 | 1.0000000 | 0.8756847 | 0.2017327 |
Mean Weekly AFINN | -0.3392632 | 0.8756847 | 1.0000000 | 0.2131817 |
BA Closing Price (Following Monday) | -0.3319448 | 0.2017327 | 0.2131817 | 1.0000000 |
# correlation matrix 2024
df_cor_2024 <- final_weekly %>%
filter(year(wk_begin)==2024) %>%
subset(select = c(articles,avg_vader,avg_AFINN_token,Close))
colnames(df_cor_2024) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")
cor_matrix_2024 <- as.data.frame(cor(df_cor_2024, use = "pairwise.complete.obs"))
cor_matrix_2024 <- cor_matrix_2024 %>%
rownames_to_column(var = "Variable")
flex_cor_2024 <- flextable(cor_matrix_2024) %>%
theme_box() %>%
align(align = "center", i=1, j = 1:5, part = "header")
flex_cor_2024Variable | Articles Count | Mean Weekly VADER | Mean Weekly AFINN | BA Closing Price (Following Monday) |
|---|---|---|---|---|
Articles Count | 1.00000000 | 0.1547550 | 0.02089022 | 0.1249197 |
Mean Weekly VADER | 0.15475500 | 1.0000000 | 0.91751755 | 0.3662835 |
Mean Weekly AFINN | 0.02089022 | 0.9175175 | 1.00000000 | 0.1605992 |
BA Closing Price (Following Monday) | 0.12491966 | 0.3662835 | 0.16059925 | 1.0000000 |
# plot of two lexicons (correlated)
final_weekly %>%
ggplot(aes(x=wk_begin))+
geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
geom_line(aes(y=avg_vader, color="VADER")) +
theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
scale_color_manual(
name = "Lexicon",
values = c("AFINN" = "red", "VADER" = "blue")) +
scale_x_date(date_breaks = "2 week", date_labels = "%b %d %Y") +
scale_y_continuous(label = comma, breaks = pretty_breaks(n=15)) +
ylab("Average Value") +
xlab("Week Begin Date") +
labs(title = "Sentiment Scores: Mean AFINN and VADER Scores by Week") #plot of stock price and lexicons (not correlated)
final_weekly %>%
ggplot(aes(x=wk_begin))+
geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
geom_line(aes(y=avg_vader, color="VADER")) +
geom_line(aes(y=Close * .02, color="BA Price")) +
geom_vline(aes(xintercept = 12/31/2023, color = "darkred"), show.legend = FALSE) +
# scale_y_continuous() +
scale_y_continuous(name = "Mean Weekly Sentiment Score",
label = comma,
breaks = pretty_breaks(n=15),
sec.axis = sec_axis(~.*50, name="Boeing (BA) Closing Stock Price",
label=comma,
breaks=pretty_breaks(n=10))) +
theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
scale_color_manual(
name = "Lexicon",
values = c("AFINN" = "red", "VADER" = "blue")) +
scale_x_date(date_breaks = "2 week", date_labels = "%b %d %Y") +
ylab("Average Value") +
xlab("Week Begin Date") +
labs(title = "Correlation of Sentiment Scores and Stock Price",
subtitle = "Mean AFINN and VADER Scores by Week vs. BA Closing Stock Price on Following Monday") 8. Conclusion
In conclusion, both the volume and sentiment of news articles did trend along with the increasing number of serious safety events and investigations in 2024.
The two major sentiment scoring tools were highly correlated to one another (0.88) over this corpus, despite the simplicity of the AFINN word-based scoring and the greater sophistication of VADER.
The weekly average sentiment of news articles was not correlated to weekly stock prices in this analysis. However, it is worth noting that as of 5/10/2024, BA is down 29.1% YTD, and down a whopping 49.7% since five years ago (Yahoo). It may be that BA’s stock is not sensitive to weekly fluctuations of news coverage, but it is suffering nonetheless from its corporate safety issues.
9. Works Cited
Ad Fontes Media. “Individual News Source Ratings.” Ad Fontes Media, adfontesmedia.com/rankings-by-individual-news-source/. Accessed 5 May 2024.
Finn Ärup Nielsen (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages (2011) 93-98.
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Perigon. “Perigon - Real-Time AI-Powered Contextual Intelligence Solutions.” Www.goperigon.com, www.goperigon.com/. Accessed 5 May 2024.
Yahoo! Finance. “The Boeing Company (BA) Stock Historical Prices & Data - Yahoo Finance.” Finance.yahoo.com, 2024, finance.yahoo.com/quote/BA/history. Accessed 5 May 2024.