DATA607_Final_Proj_Fox

Author

Amanda Fox

Published

May 8, 2024

1. Background

In recent years, Boeing has had a number of major negative events, beginning with two fatal crashes of its new Max 8 jets in 2018/2019 and subsequent grouding of those aircraft until the end of 2020 and related legal battles that stretched to 2023. In January 2024, a new round of serious safety events and violations began when a door plug blew out and violently depressurized a Max 9 jet. The fallout from this and subsequent events and investigations continues to this day.

To better understand the impact of these events on media coverage and public sentiment, I obtained 407 mass media articles from a balanced set of nine major news sources for twelve months (May 2023 to April 2024). I derived their mean sentiment scores using both lexicon-based and ML-based tools, and correlated those findings to Boeing stock prices.

2. Data Sources & Methodology

Articles Data:

News articles (n=407) about Boeing were selected from nine major publications by querying the Perigon API as follows:

  • Keyword “Boeing” in headlines only
  • Publication dates 5/1/2023 - 4/30/2024
  • Selected the top 10 sources, then removed articles from bloomberg.com, which had a large number of short, low-content articles
  • Included articles behind paywalls and reprints
  • Excluded “roundups,” paid articles, opinion, non-news, and press releases

Note that the articles themselves were truncated, but they were accompanied by long “summary” fields that often appeared to be straight copies of shorter articles. To approximate the full text for analysis, I concatenated the truncated article with the summary and title. While this was not ideal, I did not find any APIs that provided full articles except for very short timeframes (e.g. seven days). This source had the benefit of allowing a search over many years as well as compiling a set of balanced sources representative of what the public would consume.

Tidying: The articles data was tidy for article-level analysis, with one row equaling one observation (article). The tokenized dataframes created later were also tidy with one observation (word or bigram) per row, and the stock market data was tidy with one observation (week) per row.

Normalizing: For this assignment, a tidy dataset was to be normalized in MySQL, then denormalized and cleaned in R. Despite numerous attempts, the articles data was not able to be loaded to My SQL due to characters throughout the long text fields. See that section for more info.

However, the dataframe was already normalized, with no repeating values requiring separate lookup tables for efficient management. The stock market dataframe was also already normalized.

Cleaning: All of the datasets were cleaned and transformed multiple times during the analysis, and all dataframes created were tidy (one row per observation, which could be an article, a token, a week or month, etc. depending on analysis.)

Sentiment Scores

Sentiment scores were added using AFINN (Hutto) and VADER (Finn).

Boeing (BA) Stock Prices

Historical stock prices were downloaded from Yahoo Finance. There were no stock splits during this time period, so prices were consistent. The closing price on the Monday following each week of articles was selected as the data point of interest.

Analysis

All of the above data was analyzed in multiple ways, including but not limited to comparative analysis of the publications themselves; trends of article frequency and sentiment; word clouds comparing common tokens in 2023 vs. 2024; and Pearson correlation matrix of weekly article volume, sentiment, and stock price.

3. Loading Libraries and Articles Dataset

First, I loaded the libraries:

library(tidyverse)
library(dplyr)
library(tidytext)
library(lubridate)
library(flextable)
library(wordcloud)
library(vader)
library(RColorBrewer)
library(scales)
library(ggplot2)

data(stop_words)

Then I loaded the 407 articles in four text files due to file size limitations of the API. As discussed above, this data was already tidy and normalized, with one row per observation (tidy) and no repeating values (normalized).

However, for the exercise, I did export it and attempt to import to MySQL in multiple ways (comma delimited, tab delimited, encoding, cleaning file of special characters, etc.). While I came close, I was not able to parse the file 100% correctly and could not use it for accurate analysis.

In future, I would use a different database if the data did indeed require normalization.

#---------- Read in csv files and create one df
df_raw1 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(1).csv")
df_raw2 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(2).csv")
df_raw3 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export%20(3).csv")
df_raw4 <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/export.csv")

df_all <- union_all(df_raw1, df_raw2)
df_all <- union_all(df_all, df_raw3)
df_all <- union_all(df_all, df_raw4)

#---------- Code to send to SQL for normalization. Includes sample troubleshooting code to find erroneous tab characters.
#---------- MySQL data loads were unsuccessful and abandoned.

#---------- Dataframe is already TIDY with one row per observation (article) 
#---------- and already normalized: no repeating values requiring lookups for optimal management

# look for columns with tabs
test <- apply(df_all,1,function(row) any(grepl("\t",row)))
print(df_all[test,1])
# A tibble: 0 × 1
# ℹ 1 variable: url <chr>
df_export <- df_all %>% 
  mutate(content = str_replace_all(content, "\t",""),
         summary = str_replace_all(summary, "\t",""),
         description = str_replace_all(description, "\t",""),
         title = str_replace_all(title, "\t",""))
  
df_export %>% 
  write.table(file = "export_all.txt",
              sep = "\t",
              row.names = FALSE)

4. Summary Analysis: Publication Bias and Reliability Scores

To ensure that the data sample was reasonably representative of mass media coverage during this time period, the nine data sources were reviewed for bias and reliability.

Ad Fontes Media publishes bias and reliability scores for media outlets of all types. The nine publications selected for this analysis were evenly distributed in terms of bias: three sources are classified as “Middle/Centrist”, three “Skew Left”, and three “Skew Right”.

Further, six were considered reliable (reliability score 40+) and none were unreliable (score <24). Three sources had scores between these two thresholds, which Ad Fontes Media states may be due to a high proportion of opinion pieces or wide variation in the reliability of individual articles.

# Plot bias and reliability data (collected manually from Ad Fontes site)

sources <- 
  read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/sources_info.csv")

sources <- sources %>% 
  mutate(skew2 = if_else(Bias<0,"Left","Right"))

sources %>% 
  ggplot(aes(x = reorder(Source, -Bias), y=Bias, fill = skew2)) +
  geom_col(show.legend = FALSE) + 
  scale_fill_manual(values = c("deepskyblue3", "darkred")) +
  coord_flip() +
  geom_hline(aes(yintercept = 5, color = "darkred"), show.legend = FALSE) + 
  geom_hline(aes(yintercept = 0), show.legend = FALSE) + 
  geom_hline(aes(yintercept = -5, color = "green"), show.legend = FALSE) +
  scale_y_continuous(labels = label_number(accuracy = 0.01), 
                     breaks = pretty_breaks(n=8)
                     ) +
  labs(title = "Publications: Left vs. Right Bias Scores",
     x = "Publication",
     y = "Ad Fontes Media Bias Score (Range = -42 to +42)"
     )

sources %>% 
  ggplot(aes(x = reorder(Source, -Bias), y=Reliability)) +
  geom_col(fill = "seagreen") + 
  coord_flip() +
  geom_hline(aes(yintercept = 40), color = "green", show.legend = FALSE) + 
  geom_hline(aes(yintercept = 24), color = "red", show.legend = FALSE) + 
  scale_y_continuous(breaks = breaks_width(5,0),
                     limits=c(0, 64)) +
  labs(title = "Publications: Reliability Score",
     x = "Publication",
     y = "Ad Fontes Media Reliability Score (Range = 0-64)")

5. Data Cleaning and Transformation: Articles Data

To prepare for trending and analysis, the articles dataframe was summarized data by year, month, and week; the publication name was stripped from the article URL and standardized; and the title, description, and abbreviated “content” was concatenated to provide sufficient data for sentiment analysis.

#-----------------------------------------------
# CLEANING DATA
#-----------------------------------------------

# Summarize dates by year, month, week. Pull publication from URL. Remove Bloomberg.
df_all <- df_all %>% 
  mutate(dt = date(pubDate), 
         yr = format(pubDate, format = "%Y"),
         yr_month = format(pubDate, format="%Y%m"),
         wk_begin = floor_date(pubDate,unit = "week")) %>% 
  mutate(pub = sub("^https://www\\.(.*?)\\..*", "\\1", url)) %>% 
  mutate(pub = if_else(substring(pub,9,14)=="nypost","NY Post",pub)) %>% 
  filter(pub != "bloomberg")

# Capitalize publications 
df_all <- df_all %>% 
  mutate(pub = if_else(pub =="cnn","CNN",
                       if_else(pub =="forbes","Forbes",
                       if_else(pub =="foxnews","Fox News",
                       if_else(pub =="latimes","LA Times",
                       if_else(pub =="nytimes","NY Times",
                       if_else(pub =="usatoday","USA Today",
                       if_else(pub =="washingtontimes","Wash Times",
                       if_else(pub =="wsj","Wall St Journal",
                       pub
                       )))))))))

# remove unneeded fields 
df_clean <- df_all %>% 
  subset(select = c(pub, yr, yr_month, wk_begin, dt, title, summary, content))
head(df_clean)
# A tibble: 6 × 8
  pub      yr    yr_month wk_begin            dt         title   summary content
  <chr>    <chr> <chr>    <dttm>              <date>     <chr>   <chr>   <chr>  
1 CNN      2024  202401   2024-01-07 00:00:00 2024-01-11 The FA… "The F… "The F…
2 NY Times 2024  202401   2024-01-21 00:00:00 2024-01-24 Boeing… "Boein… "Boein…
3 Forbes   2024  202401   2024-01-07 00:00:00 2024-01-11 Boeing… "The F… "The F…
4 Forbes   2024  202401   2024-01-28 00:00:00 2024-01-31 Besieg… "Boein… "Boein…
5 CNN      2024  202401   2024-01-28 00:00:00 2024-01-31 Boeing… "Boein… "Boein…
6 CNN      2024  202402   2024-02-04 00:00:00 2024-02-05 New pr… "A new… "A new…
# Summary counts
df_clean %>% 
  group_by(pub) %>% 
  summarise(articles = n()) %>% 
  arrange (desc(articles))
# A tibble: 9 × 2
  pub             articles
  <chr>              <int>
1 Forbes                79
2 CNN                   76
3 NY Post               72
4 Wash Times            34
5 NY Times              33
6 USA Today             30
7 Wall St Journal       26
8 Fox News              12
9 LA Times              10
# Add concatentated text field for analysis. Mutate POSIXct to Date.
df_analyze <- df_clean %>% 
  mutate(text = paste(title,summary,content,sep = ' ')) %>% 
  subset(select = c(pub, yr, yr_month, wk_begin, dt, text)) %>% 
  mutate(wk_begin = as.Date(wk_begin))

The resulting clean, tidy table for article-level analysis contains one observation per article (sample row below):

# Flextable to display an example with full text

flex_clean <- flextable(head(df_clean,1)) %>%  
  theme_box() %>% 
  fontsize(i = 1, j = 1:8, size = 8, part = "all") %>% 
  align(align = "center", i=1, j = 1:8, part = "header") %>% 
  valign(valign = "top", i=1, j = 1:8, part = "body")

flex_clean

pub

yr

yr_month

wk_begin

dt

title

summary

content

CNN

2024

202401

2024-01-07 00:00:00

2024-01-11

The FAA is formally investigating Boeing over Alaska Airlines Boeing 737 Max incident

The Federal Aviation Administration (FAA) is investigating Boeing's quality control following a violent in-flight failure of a door plug on a nearly new 737 Max 9. The FAA stated that the incident "should have never happened and it cannot happen again." The investigation will focus on whether Boeing failed to ensure completed products conformed to its approved design and were in a condition for safe operation in compliance with FAA regulations. This comes after airlines Alaska Airlines and United Airlines found loose hardware or bolts in the assembly of door plugs on their aircraft, which remain grounded pending FAA-mandated inspections. Boeing CEO Dave Calhoun described the incident as a "horrible escape" of its manufacturing and quality control processes.

The Federal Aviation Administration says it is opening an investigation into Boeing’s quality control after the violent in-flight failure of a door plug on a nearly new 737 Max 9.

In a new statement, the FAA says last Friday’s dramatic in-flight blowout ... [1591 symbols]

8. Analysis: Correlation of Sentiment Scores to Boeing (BA) Stock Price

Finally, a weekly trend of Boeing (BA) stock prices was obtained and loaded to analyze the correlation of sentiment scores in our 407 news articles to the value of BA stock.

set.seed(508)

#-----------------------------------------------------
# Stock Price Trend
#-----------------------------------------------------

# load biweekly close prices

raw_stock <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA607/main/Project_Final/BA.csv")

head(raw_stock)
# A tibble: 6 × 7
  Date        Open  High   Low Close `Adj Close`   Volume
  <date>     <dbl> <dbl> <dbl> <dbl>       <dbl>    <dbl>
1 2023-05-01  206.  209.  193.  198.        198. 23089700
2 2023-05-08  199.  205.  196.  201.        201. 24420600
3 2023-05-15  200.  209.  199.  205.        205. 20737900
4 2023-05-22  206.  207.  197.  204.        204. 19206500
5 2023-05-29  204.  215.  201.  213.        213. 22047200
6 2023-06-05  213.  220   201.  217.        217. 37073500
# create combined final df of stocks, AFINN, and VADER
# join VADER and AFINN, add a "next Monday" column to link to stock prices dataframe

final_weekly <- vader_weekly %>% 
  left_join(afinn_weekly) %>% 
  mutate(nxt_monday = wk_begin + days(8))

# join in stock prices
final_weekly <- final_weekly %>% 
  left_join(raw_stock, join_by(x$nxt_monday == y$Date))

str(final_weekly)
tibble [41 × 14] (S3: tbl_df/tbl/data.frame)
 $ wk_begin       : Date[1:41], format: "2023-05-07" "2023-05-28" ...
 $ articles       : num [1:41] 3 13 1 1 2 2 4 1 4 2 ...
 $ total_compound : num [1:41] 2.552 1.723 -0.92 0.989 0.537 ...
 $ avg_vader      : num [1:41] 0.851 0.133 -0.92 0.989 0.269 ...
 $ tokens         : num [1:41] 13 119 7 9 26 22 32 12 20 10 ...
 $ total_AFINN    : num [1:41] 18 -82 -13 12 -12 11 12 16 -5 5 ...
 $ avg_AFINN_token: num [1:41] 1.385 -0.689 -1.857 1.333 -0.462 ...
 $ nxt_monday     : Date[1:41], format: "2023-05-15" "2023-06-05" ...
 $ Open           : num [1:41] 200 213 217 219 206 ...
 $ High           : num [1:41] 209 220 224 219 213 ...
 $ Low            : num [1:41] 199 201 215 203 205 ...
 $ Close          : num [1:41] 205 217 220 205 211 ...
 $ Adj Close      : num [1:41] 205 217 220 205 211 ...
 $ Volume         : num [1:41] 20737900 37073500 32376400 26974600 21114200 ...
# CORRELATIONS AND TRENDS

df_cor <- final_weekly %>% 
  subset(select = c(articles,avg_vader,avg_AFINN_token,Close))

colnames(df_cor) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")

cor_matrix <- as.data.frame(cor(df_cor, use = "pairwise.complete.obs"))

cor_matrix <- cor_matrix %>% 
  rownames_to_column(var = "Variable")

flex_cor <- flextable(cor_matrix)  %>% 
  theme_box() %>% 
  align(align = "center", i=1, j = 1:5, part = "header")

flex_cor

Variable

Articles Count

Mean Weekly VADER

Mean Weekly AFINN

BA Closing Price (Following Monday)

Articles Count

1.0000000

-0.2819906

-0.3392632

-0.3319448

Mean Weekly VADER

-0.2819906

1.0000000

0.8756847

0.2017327

Mean Weekly AFINN

-0.3392632

0.8756847

1.0000000

0.2131817

BA Closing Price (Following Monday)

-0.3319448

0.2017327

0.2131817

1.0000000

# correlation matrix 2024

df_cor_2024 <- final_weekly %>% 
  filter(year(wk_begin)==2024) %>% 
  subset(select = c(articles,avg_vader,avg_AFINN_token,Close))

colnames(df_cor_2024) <- c("Articles Count","Mean Weekly VADER","Mean Weekly AFINN","BA Closing Price (Following Monday)")

cor_matrix_2024 <- as.data.frame(cor(df_cor_2024, use = "pairwise.complete.obs"))

cor_matrix_2024 <- cor_matrix_2024 %>% 
  rownames_to_column(var = "Variable")

flex_cor_2024 <- flextable(cor_matrix_2024)  %>% 
  theme_box() %>% 
  align(align = "center", i=1, j = 1:5, part = "header")

flex_cor_2024

Variable

Articles Count

Mean Weekly VADER

Mean Weekly AFINN

BA Closing Price (Following Monday)

Articles Count

1.00000000

0.1547550

0.02089022

0.1249197

Mean Weekly VADER

0.15475500

1.0000000

0.91751755

0.3662835

Mean Weekly AFINN

0.02089022

0.9175175

1.00000000

0.1605992

BA Closing Price (Following Monday)

0.12491966

0.3662835

0.16059925

1.0000000

# plot of two lexicons (correlated)
final_weekly %>% 
  ggplot(aes(x=wk_begin))+
    geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
    geom_line(aes(y=avg_vader, color="VADER")) +
  theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
  scale_color_manual(
    name = "Lexicon",
    values = c("AFINN" = "red", "VADER" = "blue")) +
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  scale_y_continuous(label = comma, breaks = pretty_breaks(n=15)) +
  ylab("Average Value") +
  xlab("Week Begin Date") +
  labs(title = "Sentiment Scores: Mean AFINN and VADER Scores by Week") 

#plot of stock price and lexicons (not correlated)

final_weekly %>% 
  ggplot(aes(x=wk_begin))+
    geom_line(aes(y=avg_AFINN_token, color="AFINN")) +
    geom_line(aes(y=avg_vader, color="VADER")) +
    geom_line(aes(y=Close * .02, color="BA Price")) +
    geom_vline(aes(xintercept = 12/31/2023, color = "darkred"), show.legend = FALSE) +
  #  scale_y_continuous() +
  scale_y_continuous(name = "Mean Weekly Sentiment Score",
                     label = comma, 
                     breaks = pretty_breaks(n=15),
                     sec.axis = sec_axis(~.*50, name="Boeing (BA) Closing Stock Price", 
                                         label=comma,
                                         breaks=pretty_breaks(n=10))) +
  theme(axis.text.x=element_text(size=8, angle=60, hjust=1)) +
  scale_color_manual(
    name = "Lexicon",
    values = c("AFINN" = "red", "VADER" = "blue")) +
  scale_x_date(date_breaks = "2 week", date_labels =  "%b %d %Y") +
  ylab("Average Value") +
  xlab("Week Begin Date") +
  labs(title = "Correlation of Sentiment Scores and Stock Price",
       subtitle = "Mean AFINN and VADER Scores by Week vs. BA Closing Stock Price on Following Monday") 

8. Conclusion

In conclusion, both the volume and sentiment of news articles did trend along with the increasing number of serious safety events and investigations in 2024.

The two major sentiment scoring tools were highly correlated to one another (0.88) over this corpus, despite the simplicity of the AFINN word-based scoring and the greater sophistication of VADER.

The weekly average sentiment of news articles was not correlated to weekly stock prices in this analysis. However, it is worth noting that as of 5/10/2024, BA is down 29.1% YTD, and down a whopping 49.7% since five years ago (Yahoo). It may be that BA’s stock is not sensitive to weekly fluctuations of news coverage, but it is suffering nonetheless from its corporate safety issues.

9. Works Cited

Ad Fontes Media. “Individual News Source Ratings.” Ad Fontes Media, adfontesmedia.com/rankings-by-individual-news-source/. Accessed 5 May 2024.

Finn Ärup Nielsen (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages (2011) 93-98.

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Perigon. “Perigon - Real-Time AI-Powered Contextual Intelligence Solutions.” Www.goperigon.com, www.goperigon.com/. Accessed 5 May 2024.

Yahoo! Finance. “The Boeing Company (BA) Stock Historical Prices & Data - Yahoo Finance.” Finance.yahoo.com, 2024, finance.yahoo.com/quote/BA/history. Accessed 5 May 2024.