DATA607_Final_Proj_Fox

Author

Amanda Fox

Published

May 8, 2024

1. Background

In recent years, Boeing has had a number of major negative events, beginning with two fatal crashes of its new Max 8 jets in 2018/2019 and subsequent grouding of those aircraft until the end of 2020 and related legal battles that stretched to 2023. In January 2024, a new round of serious safety events and violations began when a door plug blew out and violently depressurized a Max 9 jet. The fallout from this and subsequent events and investigations continues to this day.

To better understand the impact of these events on media coverage and public sentiment, I obtained 407 mass media articles from a balanced set of nine major news sources for twelve months (May 2023 to April 2024). I derived their mean sentiment scores using both lexicon-based and ML-based tools, and correlated those findings to Boeing stock prices.

2. Data Sources & Methodology

Articles Data:

News articles (n=407) about Boeing were selected from nine major publications by querying the Perigon API as follows:

Keyword “Boeing” in headlines only
Publication dates 5/1/2023 - 4/30/2024
Selected the top 10 sources, then removed articles from bloomberg.com, which had a large number of short, low-content articles
Included articles behind paywalls and reprints
Excluded “roundups,” paid articles, opinion, non-news, and press releases

Note that the articles themselves were truncated, but they were accompanied by long “summary” fields that often appeared to be straight copies of shorter articles. To approximate the full text for analysis, I concatenated the truncated article with the summary and title. While this was not ideal, I did not find any APIs that provided full articles except for very short timeframes (e.g. seven days). This source had the benefit of allowing a search over many years as well as compiling a set of balanced sources representative of what the public would consume.

Tidying: The articles data was tidy for article-level analysis, with one row equaling one observation (article). The tokenized dataframes created later were also tidy with one observation (word or bigram) per row, and the stock market data was tidy with one observation (week) per row.

Normalizing: For this assignment, a tidy dataset was to be normalized in MySQL, then denormalized and cleaned in R. Despite numerous attempts, the articles data was not able to be loaded to My SQL due to characters throughout the long text fields. See that section for more info.

However, the dataframe was already normalized, with no repeating values requiring separate lookup tables for efficient management. The stock market dataframe was also already normalized.

Cleaning: All of the datasets were cleaned and transformed multiple times during the analysis, and all dataframes created were tidy (one row per observation, which could be an article, a token, a week or month, etc. depending on analysis.)

Sentiment Scores

Sentiment scores were added using AFINN (Hutto) and VADER (Finn).

Boeing (BA) Stock Prices

Historical stock prices were downloaded from Yahoo Finance. There were no stock splits during this time period, so prices were consistent. The closing price on the Monday following each week of articles was selected as the data point of interest.

Analysis

All of the above data was analyzed in multiple ways, including but not limited to comparative analysis of the publications themselves; trends of article frequency and sentiment; word clouds comparing common tokens in 2023 vs. 2024; and Pearson correlation matrix of weekly article volume, sentiment, and stock price.

3. Loading Libraries and Articles Dataset

First, I loaded the libraries:

library(tidyverse)
library(dplyr)
library(tidytext)
library(lubridate)
library(flextable)
library(wordcloud)
library(vader)
library(RColorBrewer)
library(scales)
library(ggplot2)

data(stop_words)

Then I loaded the 407 articles in four text files due to file size limitations of the API. As discussed above, this data was already tidy and normalized, with one row per observation (tidy) and no repeating values (normalized).

However, for the exercise, I did export it and attempt to import to MySQL in multiple ways (comma delimited, tab delimited, encoding, cleaning file of special characters, etc.). While I came close, I was not able to parse the file 100% correctly and could not use it for accurate analysis.

In future, I would use a different database if the data did indeed require normalization.

#---------- Read in csv files and create one df
df_raw1 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(1).csv")
df_raw2 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(2).csv")
df_raw3 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(3).csv")
df_raw4 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export.csv")

df_all <- union_all(df_raw1, df_raw2)
df_all <- union_all(df_all, df_raw3)
df_all <- union_all(df_all, df_raw4)

#---------- Code to send to SQL for normalization. Includes sample troubleshooting code to find erroneous tab characters.
#---------- MySQL data loads were unsuccessful and abandoned.

#---------- Dataframe is already TIDY with one row per observation (article) 
#---------- and already normalized: no repeating values requiring lookups for optimal management

# look for columns with tabs
test <- apply(df_all,1,function(row) any(grepl("\t",row)))
print(df_all[test,1])

# A tibble: 0 × 1
# ℹ 1 variable: url <chr>

df_export <- df_all %>% 
  mutate(content = str_replace_all(content, "\t",""),
         summary = str_replace_all(summary, "\t",""),
         description = str_replace_all(description, "\t",""),
         title = str_replace_all(title, "\t",""))
  
df_export %>% 
  write.table(file = "export_all.txt",
              sep = "\t",
              row.names = FALSE)

4. Summary Analysis: Publication Bias and Reliability Scores

To ensure that the data sample was reasonably representative of mass media coverage during this time period, the nine data sources were reviewed for bias and reliability.

Ad Fontes Media publishes bias and reliability scores for media outlets of all types. The nine publications selected for this analysis were evenly distributed in terms of bias: three sources are classified as “Middle/Centrist”, three “Skew Left”, and three “Skew Right”.

Further, six were considered reliable (reliability score 40+) and none were unreliable (score <24). Three sources had scores between these two thresholds, which Ad Fontes Media states may be due to a high proportion of opinion pieces or wide variation in the reliability of individual articles.

# Plot bias and reliability data (collected manually from Ad Fontes site)

sources <- 
  read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/sources_info.csv")

sources <- sources %>% 
  mutate(skew2 = if_else(Bias<0,"Left","Right"))

sources %>% 
  ggplot(aes(x = reorder(Source, -Bias), y=Bias, fill = skew2)) +
  geom_col(show.legend = FALSE) + 
  scale_fill_manual(values = c("deepskyblue3", "darkred")) +
  coord_flip() +
  geom_hline(aes(yintercept = 5, color = "darkred"), show.legend = FALSE) + 
  geom_hline(aes(yintercept = 0), show.legend = FALSE) + 
  geom_hline(aes(yintercept = -5, color = "green"), show.legend = FALSE) +
  scale_y_continuous(labels = label_number(accuracy = 0.01), 
                     breaks = pretty_breaks(n=8)
                     ) +
  labs(title = "Publications: Left vs. Right Bias Scores",
     x = "Publication",
     y = "Ad Fontes Media Bias Score (Range = -42 to +42)"
     )

sources %>% 
  ggplot(aes(x = reorder(Source, -Bias), y=Reliability)) +
  geom_col(fill = "seagreen") + 
  coord_flip() +
  geom_hline(aes(yintercept = 40), color = "green", show.legend = FALSE) + 
  geom_hline(aes(yintercept = 24), color = "red", show.legend = FALSE) + 
  scale_y_continuous(breaks = breaks_width(5,0),
                     limits=c(0, 64)) +
  labs(title = "Publications: Reliability Score",
     x = "Publication",
     y = "Ad Fontes Media Reliability Score (Range = 0-64)")

5. Data Cleaning and Transformation: Articles Data

To prepare for trending and analysis, the articles dataframe was summarized data by year, month, and week; the publication name was stripped from the article URL and standardized; and the title, description, and abbreviated “content” was concatenated to provide sufficient data for sentiment analysis.

#-----------------------------------------------
# CLEANING DATA
#-----------------------------------------------

# Summarize dates by year, month, week. Pull publication from URL. Remove Bloomberg.
df_all <- df_all %>% 
  mutate(dt = date(pubDate), 
         yr = format(pubDate, format = "%Y"),
         yr_month = format(pubDate, format="%Y%m"),
         wk_begin = floor_date(pubDate,unit = "week")) %>% 
  mutate(pub = sub("^https://www\\.(.*?)\\..*", "\\1", url)) %>% 
  mutate(pub = if_else(substring(pub,9,14)=="nypost","NY Post",pub)) %>% 
  filter(pub != "bloomberg")

# Capitalize publications 
df_all <- df_all %>% 
  mutate(pub = if_else(pub =="cnn","CNN",
                       if_else(pub =="forbes","Forbes",
                       if_else(pub =="foxnews","Fox News",
                       if_else(pub =="latimes","LA Times",
                       if_else(pub =="nytimes","NY Times",
                       if_else(pub =="usatoday","USA Today",
                       if_else(pub =="washingtontimes","Wash Times",
                       if_else(pub =="wsj","Wall St Journal",
                       pub
                       )))))))))

# remove unneeded fields 
df_clean <- df_all %>% 
  subset(select = c(pub, yr, yr_month, wk_begin, dt, title, summary, content))
head(df_clean)

# A tibble: 6 × 8
  pub      yr    yr_month wk_begin            dt         title   summary content
  <chr>    <chr> <chr>    <dttm>              <date>     <chr>   <chr>   <chr>  
1 CNN      2024  202401   2024-01-07 00:00:00 2024-01-11 The FA… "The F… "The F…
2 NY Times 2024  202401   2024-01-21 00:00:00 2024-01-24 Boeing… "Boein… "Boein…
3 Forbes   2024  202401   2024-01-07 00:00:00 2024-01-11 Boeing… "The F… "The F…
4 Forbes   2024  202401   2024-01-28 00:00:00 2024-01-31 Besieg… "Boein… "Boein…
5 CNN      2024  202401   2024-01-28 00:00:00 2024-01-31 Boeing… "Boein… "Boein…
6 CNN      2024  202402   2024-02-04 00:00:00 2024-02-05 New pr… "A new… "A new…

# Summary counts
df_clean %>% 
  group_by(pub) %>% 
  summarise(articles = n()) %>% 
  arrange (desc(articles))

# A tibble: 9 × 2
  pub             articles
  <chr>              <int>
1 Forbes                79
2 CNN                   76
3 NY Post               72
4 Wash Times            34
5 NY Times              33
6 USA Today             30
7 Wall St Journal       26
8 Fox News              12
9 LA Times              10

# Add concatentated text field for analysis. Mutate POSIXct to Date.
df_analyze <- df_clean %>% 
  mutate(text = paste(title,summary,content,sep = ' ')) %>% 
  subset(select = c(pub, yr, yr_month, wk_begin, dt, text)) %>% 
  mutate(wk_begin = as.Date(wk_begin))

The resulting clean, tidy table for article-level analysis contains one observation per article (sample row below):

# Flextable to display an example with full text

flex_clean <- flextable(head(df_clean,1)) %>%  
  theme_box() %>% 
  fontsize(i = 1, j = 1:8, size = 8, part = "all") %>% 
  align(align = "center", i=1, j = 1:8, part = "header") %>% 
  valign(valign = "top", i=1, j = 1:8, part = "body")

flex_clean

pub	yr	yr_month	wk_begin	dt	title	summary	content
CNN	2024	202401	2024-01-07 00:00:00	2024-01-11	The FAA is formally investigating Boeing over Alaska Airlines Boeing 737 Max incident	The Federal Aviation Administration (FAA) is investigating Boeing's quality control following a violent in-flight failure of a door plug on a nearly new 737 Max 9. The FAA stated that the incident "should have never happened and it cannot happen again." The investigation will focus on whether Boeing failed to ensure completed products conformed to its approved design and were in a condition for safe operation in compliance with FAA regulations. This comes after airlines Alaska Airlines and United Airlines found loose hardware or bolts in the assembly of door plugs on their aircraft, which remain grounded pending FAA-mandated inspections. Boeing CEO Dave Calhoun described the incident as a "horrible escape" of its manufacturing and quality control processes.	The Federal Aviation Administration says it is opening an investigation into Boeing’s quality control after the violent in-flight failure of a door plug on a nearly new 737 Max 9. In a new statement, the FAA says last Friday’s dramatic in-flight blowout ... [1591 symbols]

6. Summary Analysis: Frequency Trends

Article Frequency Trends

Article volume corresponded with major events, beginning with a ruling in May 2023 that victims of the 2018/19 crashes could sue for pain and suffering. Volume spiked multiple times in 2024 beginning with the dramatic mid-air loss of a door plug on January 5, followed by updates of investigations by airlines and the FAA, multiple resignation announcements in February and March, the death of a whistleblower in March, and frequent media reports of other incidents and findings.

# ------------------------------
# Summary Trends: Articles
# ------------------------------

articles <- df_analyze %>% 
  group_by(pub, yr_month, wk_begin, dt) %>%
  summarise(articles = n())

articles %>% 
  ggplot(aes(x = wk_begin, y = articles)) +
  geom_col(fill = "seagreen") + 
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  scale_y_continuous(breaks = pretty_breaks(n=10)) +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  labs(title = "Count of Articles by Week",
       x = "Week Begin Date",
       y = "Count of Articles")

articles %>% 
  ggplot(aes(x = wk_begin, y = articles, fill = pub)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~pub, scales = "free_y") +
  scale_x_date(date_breaks = "4 week", date_labels =  "%b %d %Y") +
  scale_y_continuous(breaks = breaks_width(width=2)) +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  labs(title = "Count of Articles by Week and Publication",
       x = "Week Begin Date",
       y = "Count of Articles")

Word Frequency Trend: 2023 vs. 2024

To identify the most common terms in 2023 vs. 2024, the text data was tokenized into words as well as bigrams, creating new tidy dataframes where one row = one observation (word or bigram, depending on analysis). This tidy data was used to create word clouds.

While the besieged “737 Max” was a common term in both years, 2023 featured words related to the court case settled in May 2023 regarding the earlier crashes, as well as a number of bigrams describing the company, planes, or airlines themselve. 2024 saw increased mentions of the FAA, door plugs, whistleblowers, and other terms related to safety incidents and investigations. Bigrams provided more nuanced insights into common topics than individual words.

# --------------------------------------------------
# Summary Trends: Comparative Frequencies of Words and Bigrams
# --------------------------------------------------# 

# tokenize words in text fields, remove stop words and NA, add afinn scores
token <- df_analyze %>%
  unnest_tokens(word, text) %>% 
  select(pub, yr, yr_month, wk_begin, dt, word) %>% 
  anti_join(stop_words)  %>% 
  filter(!is.na(word)) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  mutate(lexicon = "AFINN", n = 1)
head(token)

# A tibble: 6 × 9
  pub   yr    yr_month wk_begin   dt         word     value lexicon     n
  <chr> <chr> <chr>    <date>     <date>     <chr>    <dbl> <chr>   <dbl>
1 CNN   2024  202401   2024-01-07 2024-01-11 violent     -3 AFINN       1
2 CNN   2024  202401   2024-01-07 2024-01-11 failure     -2 AFINN       1
3 CNN   2024  202401   2024-01-07 2024-01-11 failed      -2 AFINN       1
4 CNN   2024  202401   2024-01-07 2024-01-11 ensure       1 AFINN       1
5 CNN   2024  202401   2024-01-07 2024-01-11 approved     2 AFINN       1
6 CNN   2024  202401   2024-01-07 2024-01-11 safe         1 AFINN       1

# Summarize by Year

set.seed(248)

word_cloud_2023 <- token %>% 
  filter(yr == 2023) %>% 
  group_by(word) %>% 
  summarise(frq = n())

word_cloud_2024 <- token %>% 
  filter(yr == 2024) %>% 
  group_by(word) %>% 
  summarise(frq = n())

cloud2023_token <- wordcloud(
  words = word_cloud_2023$word, 
  freq = word_cloud_2023$frq, 
  min.freq = 5,
  max.words=100, 
  scale=c(3.5, .75),
  random.order=FALSE, 
  rot.per=0.5, 
  colors=brewer.pal(8, "Set1"),
  title = "Word Frequency 2023")

cloud2024_token <- wordcloud(
  words = word_cloud_2024$word, 
  freq = word_cloud_2024$frq, 
  min.freq = 5,
  max.words=50, 
  scale=c(3.5, .75),
  random.order=FALSE, 
  rot.per=0.5, 
  colors=brewer.pal(8, "Set1"),
  title = "Word Frequency 2024")

# Create bigrams

bigrams <- df_analyze %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>% 
  select(pub, yr, yr_month, wk_begin, dt, bigram) %>% 
  mutate(n = 1)

# separate words and remove bigrams with stop words
bigrams <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# put bigrams back together
bigrams_clean <- bigrams %>%
  unite(bigram, word1, word2, sep=" ")

head(bigrams_clean)

# A tibble: 6 × 7
  pub   yr    yr_month wk_begin   dt         bigram                     n
  <chr> <chr> <chr>    <date>     <date>     <chr>                  <dbl>
1 CNN   2024  202401   2024-01-07 2024-01-11 formally investigating     1
2 CNN   2024  202401   2024-01-07 2024-01-11 investigating boeing       1
3 CNN   2024  202401   2024-01-07 2024-01-11 alaska airlines            1
4 CNN   2024  202401   2024-01-07 2024-01-11 airlines boeing            1
5 CNN   2024  202401   2024-01-07 2024-01-11 boeing 737                 1
6 CNN   2024  202401   2024-01-07 2024-01-11 737 max                    1

# Summarize by Year

bigrams_cloud_2023 <- bigrams_clean %>% 
  filter(yr == 2023) %>% 
  group_by(bigram) %>% 
  summarise(frq = sum(n))

bigrams_cloud_2024 <- bigrams_clean %>% 
  filter(yr == 2024) %>% 
  group_by(bigram) %>% 
  summarise(frq = sum(n))

cloud2023_bigram <- wordcloud(
  words = bigrams_cloud_2023$bigram, 
  freq = bigrams_cloud_2023$frq, 
  scale=c(3, .5),
  min.freq = 3,
  max.words=50, 
  random.order=FALSE, 
  rot.per=0.5, 
  colors=brewer.pal(8, "Set1"),
  title = "Bigram Frequency 2023")

cloud2024_bigram <- wordcloud(
  words = bigrams_cloud_2024$bigram, 
  freq = bigrams_cloud_2024$frq, 
  scale=c(3, .5),
  min.freq = 3,
  max.words=50, 
  random.order=FALSE, 
  rot.per=0.5, 
  colors=brewer.pal(8, "Set1"),
  title = "Bigram Frequency 2024")

7. Sentiment Analysis Trends

AFINN Scoring

The AFINN lexicon is an easy to use lexicon that assigns sentiment scores ranging from -5 to 5 to individual words based on a manual categorization. Because it uses only single words, it does not handle negation or context but is a good first step.

#-----------------------------------------------------
# Sentiment Analysis: AFINN Scoring
#-------------=---------------------------------------

# Average of AFINN score (per token) by week
afinn_weekly <- token %>% 
  group_by(wk_begin) %>% 
  summarise(tokens = sum(n), total_AFINN = sum(value), avg_AFINN_token = total_AFINN/tokens) %>% 
  mutate(wk_begin = as.Date(wk_begin))
afinn_weekly

# A tibble: 41 × 4
   wk_begin   tokens total_AFINN avg_AFINN_token
   <date>      <dbl>       <dbl>           <dbl>
 1 2023-05-07     13          18           1.38 
 2 2023-05-28    119         -82          -0.689
 3 2023-06-04      7         -13          -1.86 
 4 2023-06-11      9          12           1.33 
 5 2023-06-18     26         -12          -0.462
 6 2023-06-25     22          11           0.5  
 7 2023-07-23     32          12           0.375
 8 2023-07-30     12          16           1.33 
 9 2023-08-13     20          -5          -0.25 
10 2023-08-20     10           5           0.5  
# ℹ 31 more rows

afinn_weekly %>% 
  ggplot(aes(x = wk_begin, y = avg_AFINN_token)) +
  geom_col(fill = "deepskyblue3") + 
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  scale_y_continuous(breaks=pretty_breaks(n=10)) +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  labs(title = "Average AFINN Score by Week",
       x = "Week Begin Date",
       y = "Mean AFINN Score Per Token")

# Average of AFINN score (per token) by week and publication
afinn_weekly_pub <- token %>% 
  group_by(pub, wk_begin) %>% 
  summarise(tokens = n(), total_AFINN = sum(value), avg_AFINN_token = total_AFINN/tokens) %>% 
  mutate(wk_begin = as.Date(wk_begin))
afinn_weekly_pub

# A tibble: 155 × 5
# Groups:   pub [9]
   pub   wk_begin   tokens total_AFINN avg_AFINN_token
   <chr> <date>      <int>       <dbl>           <dbl>
 1 CNN   2023-05-07      4           6           1.5  
 2 CNN   2023-05-28     38         -44          -1.16 
 3 CNN   2023-06-18     14           3           0.214
 4 CNN   2023-06-25     21           9           0.429
 5 CNN   2023-09-24      4          -2          -0.5  
 6 CNN   2023-10-22     14         -16          -1.14 
 7 CNN   2023-11-12      3          -5          -1.67 
 8 CNN   2023-12-24      8          -9          -1.12 
 9 CNN   2024-01-07     32         -20          -0.625
10 CNN   2024-01-14     30         -12          -0.4  
# ℹ 145 more rows

afinn_weekly_pub %>% 
  ggplot(aes(x = wk_begin, y = avg_AFINN_token, fill = pub, label = avg_AFINN_token)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~pub, nrow=3, scales = "fixed") +
  
    ylab("Mean AFINN Score Per Token") +
  xlab("Week Begin Date") +
  ggtitle("Average AFINN Score by Week and Publication") +
  labs(fill = "Publication") +
  scale_x_date(date_breaks = "4 week", date_labels =  "%b %d %Y") +
  theme(axis.text.x=element_text(size=8,angle=60, hjust=1)) +
  scale_y_continuous(breaks=pretty_breaks(n=6),labels = label_number(accuracy = 0.1))

VADER Scoring

VADER is another sentiment analysis tool that is easy to use and interpret, but it is more robust than AFINN. The VADER lexicon is flexible across many domains and VADER includes a trained ML model. Instead of assigning a score to individual words, VADER contextualizes the words and handles negation, amplifiers, diminishers, punctuation, and more, making it more nuanced.

VADER provides compound, positive, negative and neutral scores as well as a count of the word “but”. Below is a trend of the compound score only for simplicity:

#-----------------------------------------------------
# Sentiment Analysis: VADER Scoring
#-------------=---------------------------------------

# calculate sentiment
vader = vader_df(df_analyze$text, incl_nt = T, neu_set = T, rm_qm = F)

# add to df_analyze
df_vader <- df_analyze %>% 
  inner_join(vader,join_by(x$text == y$text)) %>% 
  mutate(articles = 1)

head(df_vader,1)

# A tibble: 1 × 13
  pub   yr    yr_month wk_begin   dt         text     word_scores compound   pos
  <chr> <chr> <chr>    <date>     <date>     <chr>    <chr>          <dbl> <dbl>
1 CNN   2024  202401   2024-01-07 2024-01-11 "The FA… {0, 0, 0, …   -0.914 0.064
# ℹ 4 more variables: neu <dbl>, neg <dbl>, but_count <dbl>, articles <dbl>

str(df_vader)

tibble [372 × 13] (S3: tbl_df/tbl/data.frame)
 $ pub        : chr [1:372] "CNN" "NY Times" "Forbes" "Forbes" ...
 $ yr         : chr [1:372] "2024" "2024" "2024" "2024" ...
 $ yr_month   : chr [1:372] "202401" "202401" "202401" "202401" ...
 $ wk_begin   : Date[1:372], format: "2024-01-07" "2024-01-21" ...
 $ dt         : Date[1:372], format: "2024-01-11" "2024-01-24" ...
 $ text       : chr [1:372] "The FAA is formally investigating Boeing over Alaska Airlines Boeing 737 Max incident The Federal Aviation Admi"| __truncated__ "Boeing Faces Backlash From Airline Chiefs Boeing's customers are expressing frustration over the company's hand"| __truncated__ "Boeing Under Investigation For 737 Max 9 The Federal Aviation Administration (FAA) is investigating Boeing for "| __truncated__ "Besieged Boeing Declines To Share Financial Guidance Boeing has reported stronger than expected Q4 sales and a "| __truncated__ ...
 $ word_scores: chr [1:372] "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.9, 0, -2.3, 0, 0, 0, 0, 0, 0, 0,"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.05, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.65, 0, 0, 0, 0, 0.9, 0, 0, 0, 0"| __truncated__ "{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.3, 0, 1.6, 0, 0, 0, 0, 0, 0, 1.9, 0, 0, 0, 0, 0, 0, 0, 0, 0"| __truncated__ "{0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0, 0, -0.65, 0, 0.6, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, "| __truncated__ ...
 $ compound   : num [1:372] -0.914 0.219 0.493 0.329 0.588 -0.844 0.913 -0.25 -0.542 0.751 ...
 $ pos        : num [1:372] 0.064 0.133 0.09 0.089 0.071 0.044 0.117 0.038 0.037 0.092 ...
 $ neu        : num [1:372] 0.819 0.732 0.842 0.855 0.87 0.869 0.86 0.919 0.903 0.847 ...
 $ neg        : num [1:372] 0.117 0.134 0.068 0.056 0.059 0.087 0.023 0.043 0.06 0.06 ...
 $ but_count  : num [1:372] 0 1 0 1 0 0 1 0 0 0 ...
 $ articles   : num [1:372] 1 1 1 1 1 1 1 1 1 1 ...

#------------- trend by week

# Average of VADER score (per ARTICLE) by week, mutate POSIXct to Date
vader_weekly <- df_vader %>% 
  group_by(wk_begin) %>% 
  summarise(articles = sum(articles), 
            total_compound = sum(compound), avg_vader = total_compound/articles) %>% 
  mutate(wk_begin = as.Date(wk_begin))

# plot
vader_weekly %>% 
  ggplot(aes(x = wk_begin, y = avg_vader)) +
  geom_col(fill = "deepskyblue3") + 
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  scale_y_continuous(label = comma, breaks = pretty_breaks(n=10)) +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Average VADER Score") +
  xlab("Week Begin Date") +
  ggtitle("Mean VADER Score Per Article")

#-------- trend by week and publication

# Average of VADER score by week and pub, mutate POSIXct to Date
vader_weekly_pub <- df_vader %>% 
  group_by(pub, wk_begin) %>% 
  summarise(articles = sum(articles), 
            total_compound = sum(compound), avg_vader = total_compound/articles) %>% 
  mutate(wk_begin = as.Date(wk_begin))

str(vader_weekly_pub)

gropd_df [156 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ pub           : chr [1:156] "CNN" "CNN" "CNN" "CNN" ...
 $ wk_begin      : Date[1:156], format: "2023-05-07" "2023-05-28" ...
 $ articles      : num [1:156] 1 4 1 1 1 2 1 1 3 6 ...
 $ total_compound: num [1:156] 0.872 0.398 0.077 0.807 -0.631 ...
 $ avg_vader     : num [1:156] 0.872 0.0995 0.077 0.807 -0.631 ...
 - attr(*, "groups")= tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ pub  : chr [1:9] "CNN" "Forbes" "Fox News" "LA Times" ...
  ..$ .rows: list<int> [1:9] 
  .. ..$ : int [1:24] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:25] 25 26 27 28 29 30 31 32 33 34 ...
  .. ..$ : int [1:8] 50 51 52 53 54 55 56 57
  .. ..$ : int [1:9] 58 59 60 61 62 63 64 65 66
  .. ..$ : int [1:21] 67 68 69 70 71 72 73 74 75 76 ...
  .. ..$ : int [1:17] 88 89 90 91 92 93 94 95 96 97 ...
  .. ..$ : int [1:15] 105 106 107 108 109 110 111 112 113 114 ...
  .. ..$ : int [1:17] 120 121 122 123 124 125 126 127 128 129 ...
  .. ..$ : int [1:20] 137 138 139 140 141 142 143 144 145 146 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

# plot
vader_weekly_pub %>% 
  ggplot(aes(x = wk_begin, y = avg_vader, fill = pub, label = avg_vader)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~pub, nrow=2, scales = "fixed") +
  scale_x_date(date_breaks = "4 week", date_labels =  "%b %d %Y") +
  theme(axis.text.x=element_text(size=7, angle=60, hjust=1)) +
  scale_y_continuous(label = comma, breaks = pretty_breaks(n=10)) +
  ylab("Average VADER Score") +
  xlab("Week Begin Date") +
  ggtitle("Mean VADER Score Per Article") + 
  labs(fill = "Publication")

8. Analysis: Correlation of Sentiment Scores to Boeing (BA) Stock Price

Finally, a weekly trend of Boeing (BA) stock prices was obtained and loaded to analyze the correlation of sentiment scores in our 407 news articles to the value of BA stock.

set.seed(508)

#-----------------------------------------------------
# Stock Price Trend
#-----------------------------------------------------

# load biweekly close prices

raw_stock <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/BA.csv")

head(raw_stock)

# A tibble: 6 × 7
  Date        Open  High   Low Close `Adj Close`   Volume
  <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl>
1 2023-05-01  206.  209.  193.  198.        198. 23089700
2 2023-05-08  199.  205.  196.  201.        201. 24420600
3 2023-05-15  200.  209.  199.  205.        205. 20737900
4 2023-05-22  206.  207.  197.  204.        204. 19206500
5 2023-05-29  204.  215.  201.  213.        213. 22047200
6 2023-06-05  213.  220   201.  217.        217. 37073500

# create combined final df of stocks, AFINN, and VADER
# join VADER and AFINN, add a "next Monday" column to link to stock prices dataframe

final_weekly <- vader_weekly %>% 
  left_join(afinn_weekly) %>% 
  mutate(nxt_monday = wk_begin + days(8))

# join in stock prices
final_weekly <- final_weekly %>% 
  left_join(raw_stock, join_by(x$nxt_monday == y$Date))

str(final_weekly)

tibble [41 × 14] (S3: tbl_df/tbl/data.frame)
 $ wk_begin       : Date[1:41], format: "2023-05-07" "2023-05-28" ...
 $ articles       : num [1:41] 3 13 1 1 2 2 4 1 4 2 ...
 $ total_compound : num [1:41] 2.552 1.723 -0.92 0.989 0.537 ...
 $ avg_vader      : num [1:41] 0.851 0.133 -0.92 0.989 0.269 ...
 $ tokens         : num [1:41] 13 119 7 9 26 22 32 12 20 10 ...
 $ total_AFINN    : num [1:41] 18 -82 -13 12 -12 11 12 16 -5 5 ...
 $ avg_AFINN_token: num [1:41] 1.385 -0.689 -1.857 1.333 -0.462 ...
 $ nxt_monday     : Date[1:41], format: "2023-05-15" "2023-06-05" ...
 $ Open           : num [1:41] 200 213 217 219 206 ...
 $ High           : num [1:41] 209 220 224 219 213 ...
 $ Low            : num [1:41] 199 201 215 203 205 ...
 $ Close          : num [1:41] 205 217 220 205 211 ...
 $ Adj Close      : num [1:41] 205 217 220 205 211 ...
 $ Volume         : num [1:41] 20737900 37073500 32376400 26974600 21114200 ...

# CORRELATIONS AND TRENDS

df_cor <- final_weekly %>% 
  subset(select = c(articles,avg_vader,avg_AFINN_token,Close))

colnames(df_cor) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")

cor_matrix <- as.data.frame(cor(df_cor, use = "pairwise.complete.obs"))

cor_matrix <- cor_matrix %>% 
  rownames_to_column(var = "Variable")

flex_cor <- flextable(cor_matrix)  %>% 
  theme_box() %>% 
  align(align = "center", i=1, j = 1:5, part = "header")

flex_cor

Variable	Articles Count	Mean Weekly VADER	Mean Weekly AFINN	BA Closing Price (Following Monday)
Articles Count	1.0000000	-0.2819906	-0.3392632	-0.3319448
Mean Weekly VADER	-0.2819906	1.0000000	0.8756847	0.2017327
Mean Weekly AFINN	-0.3392632	0.8756847	1.0000000	0.2131817
BA Closing Price (Following Monday)	-0.3319448	0.2017327	0.2131817	1.0000000

# correlation matrix 2024

df_cor_2024 <- final_weekly %>% 
  filter(year(wk_begin)==2024) %>% 
  subset(select = c(articles,avg_vader,avg_AFINN_token,Close))

colnames(df_cor_2024) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")

cor_matrix_2024 <- as.data.frame(cor(df_cor_2024, use = "pairwise.complete.obs"))

cor_matrix_2024 <- cor_matrix_2024 %>% 
  rownames_to_column(var = "Variable")

flex_cor_2024 <- flextable(cor_matrix_2024)  %>% 
  theme_box() %>% 
  align(align = "center", i=1, j = 1:5, part = "header")

flex_cor_2024

Variable	Articles Count	Mean Weekly VADER	Mean Weekly AFINN	BA Closing Price (Following Monday)
Articles Count	1.00000000	0.1547550	0.02089022	0.1249197
Mean Weekly VADER	0.15475500	1.0000000	0.91751755	0.3662835
Mean Weekly AFINN	0.02089022	0.9175175	1.00000000	0.1605992
BA Closing Price (Following Monday)	0.12491966	0.3662835	0.16059925	1.0000000

# plot of two lexicons (correlated)
final_weekly %>% 
  ggplot(aes(x=wk_begin))+
    geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
    geom_line(aes(y=avg_vader, color="VADER")) +
  theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
  scale_color_manual(
    name = "Lexicon",
    values = c("AFINN" = "red", "VADER" = "blue")) +
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  scale_y_continuous(label = comma, breaks = pretty_breaks(n=15)) +
  ylab("Average Value") +
  xlab("Week Begin Date") +
  labs(title = "Sentiment Scores: Mean AFINN and VADER Scores by Week")

#plot of stock price and lexicons (not correlated)

final_weekly %>% 
  ggplot(aes(x=wk_begin))+
    geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
    geom_line(aes(y=avg_vader, color="VADER")) +
    geom_line(aes(y=Close * .02, color="BA Price")) +
    geom_vline(aes(xintercept = 12/31/2023, color = "darkred"), show.legend = FALSE) +
  #  scale_y_continuous() +
  scale_y_continuous(name = "Mean Weekly Sentiment Score",
                     label = comma, 
                     breaks = pretty_breaks(n=15),
                     sec.axis = sec_axis(~.*50, name="Boeing (BA) Closing Stock Price", 
                                         label=comma,
                                         breaks=pretty_breaks(n=10))) +
  theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
  scale_color_manual(
    name = "Lexicon",
    values = c("AFINN" = "red", "VADER" = "blue")) +
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  ylab("Average Value") +
  xlab("Week Begin Date") +
  labs(title = "Correlation of Sentiment Scores and Stock Price",
       subtitle = "Mean AFINN and VADER Scores by Week vs. BA Closing Stock Price on Following Monday")

8. Conclusion

In conclusion, both the volume and sentiment of news articles did trend along with the increasing number of serious safety events and investigations in 2024.

The two major sentiment scoring tools were highly correlated to one another (0.88) over this corpus, despite the simplicity of the AFINN word-based scoring and the greater sophistication of VADER.

The weekly average sentiment of news articles was not correlated to weekly stock prices in this analysis. However, it is worth noting that as of 5/10/2024, BA is down 29.1% YTD, and down a whopping 49.7% since five years ago (Yahoo). It may be that BA’s stock is not sensitive to weekly fluctuations of news coverage, but it is suffering nonetheless from its corporate safety issues.

9. Works Cited

Ad Fontes Media. “Individual News Source Ratings.” Ad Fontes Media, adfontesmedia.com/rankings-by-individual-news-source/. Accessed 5 May 2024.

Finn Ärup Nielsen (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages (2011) 93-98.

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Perigon. “Perigon - Real-Time AI-Powered Contextual Intelligence Solutions.” Www.goperigon.com, www.goperigon.com/. Accessed 5 May 2024.

Yahoo! Finance. “The Boeing Company (BA) Stock Historical Prices & Data - Yahoo Finance.” Finance.yahoo.com, 2024, finance.yahoo.com/quote/BA/history. Accessed 5 May 2024.