We chose to conduct a sentiment analysis on articles with the term “text mining”. We felt a more specific term would give us more differntiated and speicific results, and we also wanted to learn more about the word, and gain a new perspective on it
To gather an appropriate timeline and newspapers to analyze, we used Google Ngrams. By searching the term “text mining” in Google Ngrams, we were able to identify the prominence of the term across general media resources and its relative growth in frequency from 2000 to 2021. As shown by the image below, the curve for “Text Mining” demonstrates a change in concavity at 2010, indicating a increasing use of the term in popular culture.
Google NGrams Graphic for term “Text Mining”
After taking this into consideration, we filtered the timeline on the Nexus Uni data base to from January 1st, 2010 to the present. We arrived on our five publications, which were the five newspapers containing the highest number of articles within the period of the timeline established. They are the following:
The following graphics display the top 50 words that were mentioned in the articles of the publcation. Many words seen are neutral terms, likely due to people do not knowing enough about Text Mining to form any kind of emotional response to it. However, there are still emotional sentiments on the term that we can note from these word clouds. For the University Wire and Chronicle of Higher Education, terms like research and student are mentioned, which is quite the contrary from the New York Times, which has words like “Shakespeare”, which was one of the most common terms in the corpus. However, most of these newspapers share a rather high usage of “Humanities” which can allow us to make inferences on how text mining is being applied in different fields. The words seen in the CE Noticerias Financieras also point to humanities, as seen by words such as “humanities” and “culture”. From this, we are finding that text mining can serve great anthrpological and humanitarian purposes, outside of the technological world.
The following histograms indicate the distribution of the afinn word score frequencies for positive connotations, negative connotations or neutral connotations. Each newspaper out of the five selected seem to display a somewhat similar distribution, with low frequencies on the sentiment extremes and high frequency with mild positivity and negativity. Some newspapers are more positive, some are more negative, but overall, we see a general similar pattern developing. This may be because the term “Text Mining” is technical and more specific than a term like “Data Science” and thus, there is little emotional response or emotion given to such a technical term. There is little reason to be for or against such terms.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(y)` instead of `y` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(z)` instead of `z` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## New names:
## * text -> text...1
## * text -> text...2
## * text -> text...3
## * text -> text...4
## * text -> text...5
## Joining, by = "word"
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "newspaper"
Shown by the rather low tf-idf values, analyzing the sentiment for the term “Text Mining” could be greatly helped by further steps, and the results we found give a clear guide for the steps that need to be taken. Our relatively low tf-idf values indicate that there may be a sourcing issue. Many of the sources that we used cover vast ranges of different topics, so it is difficult to know if text mining or data science is the actual subject of the text. It would likely be better to used more focused publications, as it would lead to more accurate and appropriate results. However, due to the limited number of articles we could find on the datta base that include “text mining”, we had to utilize whatever we could get. This does not take away from the hesitation that we should approach these articles with. Many of the articles, just by a glance, only contained “text mining” once or a few times. The mere mentions of text mining cause our sentiment analysis to cover a range of different issues not specific enough for appropriate conclusions. For example, an article from The Guardian titled “Beatles did not revolutionise music, study claims” mentions text mining once as the method the researchers used. This article is likely to have more negative sentiment toward the Beatles, which could cause more negative words and sentiments to come up in our analysis about text mining.
Performing this analysis appropriately requires that we reconsider what resources we allocate to these methods of sentiment analysis. We were forced to make a lot of assumptions in this lab regarding the validity and appropriateness of a lot of these articles and newspapers, which may not be the most proper way to go about this. For example, an assumption we made prior to the analysis was that the five newspapers with the greatest number of articles would be best, as it would be more reliable/representative. However, we have no way of knowing if the newspapers accurately capture the different demographic populations in our society. We also don’t know about any of the biases or opinions that some of the newspapers may hold. We can study the different text and sentiments present using technology, but that will never give us the full, or even the bigger picture in the grand scheme of things. In the future we could specifiy who and what we are trying to learn through the sentiment analysis. Another big thing we could hone in on would be to more carefully choose our publcations, and even filter out certain articles. We should only be interested in articles that are primarily focused on text mining, but it would certaintly be hard to draw a line in that situation.