Introduction

We decided to perform Sentiment Analysis on the term “Text Mining”. We felt a more specific term would derive differentiated results from the “Data Science” term, and this was a term we were interested in learning more about.

Narrowing Down Source Material

Part of our approach to gathering an appropriate corpus to analyze came in part from the use of Google Ngrams. Google Ngram is a search engine that charts word frequencies from a large corpus of books that were printed between 1500 and 2008. The tool generates charts by dividing the number of a word’s yearly appearances by the total number of words in the corpus in that year. By searching the term “Text Mining” in Google Ngrams, we were able to identify the prominence of the term across general media resources and its relative growth in frequency from 2000 to 2021. As shown by the image below, the curve for “Text Mining” demonstrates a change in concavity at 2010, indicating a positive “acceleration” of its use and a potential turning point for the phrase.

Google NGrams Graphic for term “Text Mining”

Newspaper List

After taking this into consideration, we were able to filter the timeline of newspaper articles on Nexus Uni dating from January 1st, 2010 to March 2021. We arrived on our 5 main sources, the five newspapers that contained the highest number of articles within the period of 01/2010-Present. They are the following (in no particular order):

  1. The Guardian
  2. The New York Times
  3. CE NOTICIAS FINANCIERAS (ENGLISH)
  4. The Chronicle for Higher Education
  5. University Wire

Wordclouds

The following graphics display the 50 words most mentioned in articles where “Text Mining” is mentioned across these 5 Corpuses. Many carry a high level of neutral terms, and we think this is likely because people do not know enough about Text Mining to allow for a clear and discernible emotional response to the term yet. However, there are still emotional sentiments on the term that we can gather from these graphics. Further, it is also clear that the subsequent audiences and purposes of each Newspapers allow for individual biases and tones. This separates the wordcloud graphics from each other greatly. For the University Wire, terms like research and student are mentioned, which is far and away from the New York Times “Shakespeare”, which was one of the most common terms. Yet, these newspapers both share a rather high use of “Humanities” which can allow us to gather that Text Mining technology is likely applied to research in the humanities.

The Guardian

The New York Times

CE Noticias Financieras English

The Chronicle of Higher Education

University Wire

AFINN Histograms

The following histograms indicate the distribution of the word score frequencies for positive connotations, negative connotations or neutral connotations for each newspaper. Words scores range from minus five (negative) to plus five (positive). Each newspaper out of the five selected seem to display a somewhat similar distribution, with low frequencies on the sentiment extremes and high frequency with mild positivity and negativity. This may be because the term “Text Mining” is technical and more specific than a term like “Data Science” and thus, less subject to emotional interpretation or generalization in the media.

The Guardian

The New York Times

CE Noticias Financieras English

The Chronicle of Higher Education

University Wire

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(y)` instead of `y` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(z)` instead of `z` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## New names:
## * text -> text...1
## * text -> text...2
## * text -> text...3
## * text -> text...4
## * text -> text...5
## Joining, by = "word"
## Joining, by = "newspaper"

TF-IDF of the Five Newspapers

Analysis & Next Steps

As shown by the rather low TFIDF values, measuring the sentiment trend for the term “Text Mining” could be well aided by further steps, and these results indicate a clear recommendation on those next steps. The inverse document frequency is a measure of how much information the word provides and a measure how common it is across all documents based on its frequency relative to the whole. Given this, our somewhat low TFIDF values indicate that there may be a sourcing issue, which could either be remedied by focusing our inputs more towards media that are more concentrated on the topic, or by decreasing the amount of articles in aggregate we use, or both. At the same time, our group often ran into a shortage of content after removing stop words and filtering down to the actual substantial texts, so perhaps making our selection process for inputs more specific would be best. After all, our selected articles were just articles that contained the phrase “Text Mining” rather than have it in the title or in some greater context, which caused some of the articles to be misrepresentations of the word given that it was not the true subject of the piece and was only mentioned once. For example, one article from The Guardian titled “Beatles did not revolutionise music, study claims” simply mentions the text mining once as the method the researchers used. This article is likely to have a negative sentiment toward the Beatles, yet this sentiment will be picked up by our sentiment analysis and interpreted as a negative view of text mining.

Performing this analysis properly requires that we reconsider the question we are actually aiming to answer and in turn, what resources we allocate to these methods. Given the “poor data in, poor results out” principle, our choice of newspapers and text are crucial to the validity and quality of these insights, and our choices for our inputs were based on some assumptions that could be sharpened further. For example, a crucial assumption we made during our lab was that the 5 newspapers with the most articles that had an instance of the term “text mining” would be best. However, we do not know if these 5 newspapers accurately captured all demographic baskets of the American population, nor do we have an idea of the opinions of those who do not read newspapers. Using R, one can implement these tools and discern whether a particular term carries a systematic pattern of sentiment in terms of writing, but that does not account for the absorption of the information and the resulting opinions of actual readers (it tells us more about writing trends than the true thoughts of people). In the future, we could specifiy who exactly we are trying to learn about by analyzing different media types and searching by group rather than by the rubber stamp method of which corpuses have the most content. Overall though, our results indicate the general sentiment of “Text Mining” is positive.