We defined our guiding question for the lab as: how does sentiment towards data science change regionally (east coast, west coast, midwest)? To answer this question, we hoped to conduct sentiment analysis with R text libraries on regional newspaper articles (constrained by a time-line of 2018-present) that contained the keywords “data science”.
Our research centered around 3 chosen newspapers from each region, selected with mind to geographic variability and circulation spread:
The quantity of newspaper articles directly addressing the keywords “data science” significantly varied within a region (e.g. The Atlanta Journal-Constitution returning 17 files while the New York Times returned 332 files); this constraint limited the quality, quantity, and diversity of data. Once the data was processed and joined to sentiment analysis libraries AFINN, NRC, and bing, we could begin the data visualization process.
Bing sentiment analysis tables details newspaper trends in binary: simple positive or simple negative. West coast newspaper tables (LA times, Spokesman, EurAsia, and combined datasets, respectively) show a slightly more negative view of the data science field;
##
## negative positive
## 124 109
##
## negative positive
## 39 27
##
## negative positive
## 508 368
##
## negative positive
## 575 413
midwest newspaper tables (St. Louis Dispatch, Chicago Daily Herald, and combined datasets) show a significant negative skew (with the Chicago Daily Herald more balanced);
##
## negative positive
## 708 556
##
## negative positive
## 469 445
##
## negative positive
## 915 680
with east coast newspaper tables (the NYT, the AJC, the AP, and combined datasets, respectively) seeing a significant positive skew from the New York Times balancing increasingly negative trends from the AJC and the AP.
##
## negative positive
## 207 360
##
## negative positive
## 196 128
##
## negative positive
## 320 188
##
## negative positive
## 531 450
AFINN sentiment analysis scores words from -5 (most negative) to 5 (most positive); the variable range can be graphed to realize specific trends. West coast newspaper visualizations (LA times, Spokesman, Eurasia, and combined datasets, respectively) support the bing sentiment analysis, highlighting a negative skew;
midwest newspaper tables (St. Louis Dispatch, Chicago Daily Herald, and combined datasets) support the conclusion arrived at by bing analysis (balanced view from the Chicago Daily Herald contrasted with a overall negative trend);
with east coast newspaper tables (the NYT, the AJC, the AP, and combined datasets, respectively) mirroring a dramatic positive skew from the New York Times not quite balancing increasingly negative trends from the AJC and the AP.
Word cloud analysis can help contextualize the keywords; by examining topic prevalence, we can audit the quantitative outputs from AFINN and bing. East coast newspapers (LA Times, EurAsia, Spokesman, combined) trend towards analyzing data science through an industry- and research-oriented scope, with keywords “ai”, “amazon”, and “future” at the forefront.
Midwest newspapers (Chicago Daily, St. Louis, combined) explore data science from an educative lens, with academically perceived words (“college”, “students”, “learning”) being highlighted.
East coast newspapers (NYT, AJC, AP, combined) politicize data science and examine its topical usage; New York newspapers touch on contemporary political debates (highlighted by election-heavy keywords), while Georgia examines the role of data science in a COVID-centric world (as defined with public health and COVID politicization keywords).
TF-IDF is a statistical measure evaluating how relevant a word is to a document in a corpus; crucial for keyword extraction, TF-IDF can identify the most critical terms within a given corpus. Given the regional corpus, we can isolate the most significant keywords both within their document and within their collection.
quick data dictionary:
Exploratory and interactive data tables allow us to isolate statistically high- or low- ranking terms through an allotted lens (high TF, low IDF, high TF-IDF, etc.).
## [1] "tbl_df" "tbl" "data.frame"
## [1] "tbl_df" "tbl" "data.frame"
Newspaper sentiment towards data science keywords vary by region, which, when contextualized, provide insight into the topicality of data science. The West Coast, heavily influenced by the presence of tech centers and rapidly advancing industry, sees a strong correlation between data science and system technologies (AI, amazon, etc.); the Midwest, seeing an acceleration of economic and technological growth, views data science through a educative lens; the East coast, home to crucial political centers (New York, DC, etc.), concentrates on the connection between data science and politicization, examining election cycle keywords, ethics discussions, and COVID impact analysis.
Further research could augment data quantity, quality, and diversity, while expanding the data science keyword library (e.g. include ethics and systems-specific keywords)