Lab 8

Introduction

We decided to perform Sentiment Analysis on the term “Text Mining”. We felt a more specific term would derive differentiated results from the “Data Science” term, and this was a term we were interested in learning more about.

Narrowing Down Source Material

Part of my approach to gathering an appropriate corpus to analyze came in part from the use of Google Ngrams. Google Ngram is a search engine that charts word frequencies from a large corpus of books that were printed between 1500 and 2008. The tool generates charts by dividing the number of a word’s yearly appearances by the total number of words in the corpus in that year. By searching the term “Text Mining” in Google Ngrams, we were able to identify the prominence of the term across general media resources and its relative growth in frequency from 2000 to 2021. As shown by the image below, the curve for “Text Mining” demonstrates a change in concavity at 2010, indicating a positive “acceleration” of its use and a potential turning point for the phrase.

## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

## Warning: Unknown or uninitialised column: `sentiment`.

## 
## negative positive 
##      134      156

## 
##        anger anticipation      disgust         fear          joy     negative 
##           62          131           33           80           91          139 
##     positive      sadness     surprise        trust 
##          335           71           57          188

## < table of extent 0 >

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Histograms The following histogram indicatea the distribution of the word score frequencies for positive connotations, negative connotations or neutral connotations for the aggregare. Words scores range from minus five (negative) to plus five (positive). It seems to display a somewhat neutral distribution, with low frequencies on the sentiment extremes and high frequency with mild positivity and negativity. This may be because the term “Text Mining” is technical and more specific than a term like “Data Science” and thus, less subject to emotional interpretation or generalization in the media.

#Wordclouds The following graphics display the 50 words most mentioned in articles where “Text Mining” is mentioned across these this Corpus. Many carry a high level of neutral terms and non-normative, such as “research” and we think this is likely because people in American Universityies do not know enough about Text Mining (beyond the deep research academic communities) to allow for a clear and discernible emotional response to the term yet. However, there are still emotional sentiments on the term that we can gather from these graphics. We can see from the NRC WordCLoud below that people are mostly in “postive” spirits or in anticipation for the technology.

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(y)` instead of `y` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(z)` instead of `z` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

## Joining, by = "word"
## Joining, by = "word"

#Conclusions

As shown by the rather low TFIDF values, measuring the sentiment trend for the term “Text Mining” could be well aided by further steps, and these results indicate a clear recommendation on those next steps. The inverse document frequency is a measure of how much information the word provides and a measure how common it is across all documents based on its frequency relative to the whole. Given this, my somewhat low TFIDF values indicate that there may be a sourcing issue, which could either be remedied by focusing my inputs more towards media that are more concentrated on the topic, or by decreasing the amount of articles in aggregate we use, or both. At the same time, my group often ran into a shortage of content after removing stop words and filtering down to the actual substantial texts, so perhaps making my selection process for inputs more specific would be best. After all, my selected articles were just articles that contained the phrase “Text Mining” rather than have it in the title or in some greater context, which caused some of the articles to be misrepresentations of the word given that it was not the true subject of the piece and was only mentioned once.

Performing this analysis properly requires that we reconsider the question we are actually aiming to answer and in turn, what resources we allocate to these methods. Given the “poor data in, poor results out” principle, my choice of articles and text are crucial to the validity and quality of these insights, and my choices for my inputs were based on some assumptions that could be sharpened further. In the future, I could specifiy who exactly we are trying to learn about by analyzing different media types within the corpus and searching by group. Overall though, my results indicate the general sentiment of “Text Mining” is positive.

Lab 8

bill cull

3/29/2021

Introduction

Narrowing Down Source Material