guiding question + lab set-up

We defined our guiding question for the lab as: how does sentiment towards data science change regionally (east coast, west coast, midwest)? To answer this question, we hoped to conduct sentiment analysis with R text libraries on regional newspaper articles (constrained by a time-line of 2018-present) that contained the keywords “data science”.

Our research centered around 3 chosen newspapers from each region, selected with mind to geographic variability and circulation spread:

  1. west coast: The Los Angeles Times, The EurAsia, The Spokesman
  2. midwest: The Chicago Daily Herald, The St. Louis Post-Dispatch (Missouri)
  3. east coast: The New York Times, The Atlanta Journal-Constitution, The Associated Press

The quantity of newspaper articles directly addressing the keywords “data science” significantly varied within a region (e.g. The Atlanta Journal-Constitution returning 17 files while the New York Times returned 332 files); this constraint limited the quality, quantity, and diversity of data. Once the data was processed and joined to sentiment analysis libraries AFINN, NRC, and bing, we could begin the data visualization process.

binary analysis (positive or negative)

Bing sentiment analysis tables details newspaper trends in binary: simple positive or simple negative. West coast newspaper tables (LA times, Spokesman, EurAsia, and combined datasets, respectively) show a slightly more negative view of the data science field;

## 
## negative positive 
##      124      109
## 
## negative positive 
##       39       27
## 
## negative positive 
##      508      368
## 
## negative positive 
##      575      413

midwest newspaper tables (St. Louis Dispatch, Chicago Daily Herald, and combined datasets) show a significant negative skew (with the Chicago Daily Herald more balanced);

## 
## negative positive 
##      708      556
## 
## negative positive 
##      469      445
## 
## negative positive 
##      915      680

with east coast newspaper tables (the NYT, the AJC, the AP, and combined datasets, respectively) seeing a significant positive skew from the New York Times balancing increasingly negative trends from the AJC and the AP.

## 
## negative positive 
##      207      360
## 
## negative positive 
##      196      128
## 
## negative positive 
##      320      188
## 
## negative positive 
##      531      450

range analysis

AFINN sentiment analysis scores words from -5 (most negative) to 5 (most positive); the variable range can be graphed to realize specific trends. West coast newspaper visualizations (LA times, Spokesman, Eurasia, and combined datasets, respectively) support the bing sentiment analysis, highlighting a negative skew;

midwest newspaper tables (St. Louis Dispatch, Chicago Daily Herald, and combined datasets) support the conclusion arrived at by bing analysis (balanced view from the Chicago Daily Herald contrasted with a overall negative trend);

with east coast newspaper tables (the NYT, the AJC, the AP, and combined datasets, respectively) mirroring a dramatic positive skew from the New York Times not quite balancing increasingly negative trends from the AJC and the AP.

data contextualization

Word cloud analysis can help contextualize the keywords; by examining topic prevalence, we can audit the quantitative outputs from AFINN and bing. East coast newspapers (LA Times, EurAsia, Spokesman, combined) trend towards analyzing data science through an industry- and research-oriented scope, with keywords “ai”, “amazon”, and “future” at the forefront.

Midwest newspapers (Chicago Daily, St. Louis, combined) explore data science from an educative lens, with academically perceived words (“college”, “students”, “learning”) being highlighted.

East coast newspapers (NYT, AJC, AP, combined) politicize data science and examine its topical usage; New York newspapers touch on contemporary political debates (highlighted by election-heavy keywords), while Georgia examines the role of data science in a COVID-centric world (as defined with public health and COVID politicization keywords).

term frequency + relevance analysis

TF-IDF is a statistical measure evaluating how relevant a word is to a document in a corpus; crucial for keyword extraction, TF-IDF can identify the most critical terms within a given corpus. Given the regional corpus, we can isolate the most significant keywords both within their document and within their collection.

quick data dictionary:

  1. TF (term frequency) -> frequency of the word within its document
  2. IDF (inverse document frequency) -> how common (close to 0) or rare (close to 1) a word is within its corpus
  3. TF-IDF -> the higher the score, the more relevant the word is within its document

Exploratory and interactive data tables allow us to isolate statistically high- or low- ranking terms through an allotted lens (high TF, low IDF, high TF-IDF, etc.).

## [1] "tbl_df"     "tbl"        "data.frame"
## [1] "tbl_df"     "tbl"        "data.frame"

conclusion

Newspaper sentiment towards data science keywords vary by region, which, when contextualized, provide insight into the topicality of data science. The West Coast, heavily influenced by the presence of tech centers and rapidly advancing industry, sees a strong correlation between data science and system technologies (AI, amazon, etc.); the Midwest, seeing an acceleration of economic and technological growth, views data science through a educative lens; the East coast, home to crucial political centers (New York, DC, etc.), concentrates on the connection between data science and politicization, examining election cycle keywords, ethics discussions, and COVID impact analysis.

Further research could augment data quantity, quality, and diversity, while expanding the data science keyword library (e.g. include ethics and systems-specific keywords)