Intro to Data Science - HW 9

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Text mining plays an important role in many industries because of the prevalence of text in the interactions between customers and company representatives. Even when the customer interaction is by speech, rather than by chat or email, speech to text algorithms have gotten so good that transcriptions of these spoken word interactions are often available. To an increasing extent, a data scientist needs to be able to wield tools that turn a body of text into actionable insights. In this homework, we explore a real City of Syracuse dataset using the quanteda and quanteda.textplots packages. Make sure to install the quanteda and quanteda.textplots packages before following the steps below:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(quanteda)
## Package version: 3.2.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textplots)
library(quanteda.textstats)

Part 1: Load and visualize the data file

  1. Take a look at this article: https://samedelstein.medium.com/snowplow-naming-contest-data-2dcd38272caf and write a comment in your R script, briefly describing what it is about.
# City of Syracuse was having their citizens do a write-in campaign to name the 10 snow-plows that were recently purchased. 1,948 unique submissions were entered into the campaign.
  1. Read the data from the following URL into a dataframe called df: https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv
df <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv")
## Rows: 1907 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): submitter_name_anonymized, snowplow_name, meaning
## dbl (1): submission_number
## lgl (1): winning_name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Inspect the df dataframe – which column contains an explanation of the meaning of each submitted snowplow name? Transform that column into a document-feature matrix, using the corpus(), tokens(), tokens_select(), and dfm()** functions. Do not forget to remove stop words.

Hint: Make sure you have libraried quanteda

# Inspecting the df dataframe
glimpse(df) # the "meaning" attribute is the column providing an explanation towards each submitted snowplow name
## Rows: 1,907
## Columns: 5
## $ submission_number         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ submitter_name_anonymized <chr> "kjlt9cua", "KXKaabXN", "kjlt9cua", "Rv9sODq…
## $ snowplow_name             <chr> "rudolph", "salt life", "blizzard", "butter"…
## $ meaning                   <chr> "The red nose cuts through any storm.", "We …
## $ winning_name              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
# Transform "meaning" attribute into a document-feature matrix using corpus(), tokens(), tokens_select() and dfm() functions. Also removing stop words
df_corpus <- corpus(df$meaning, docnames = df$submission_number)
## Warning: NA is replaced by empty string
toks <- tokens(df_corpus, remove_punct = TRUE)
toks_nostop_words <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")

df_dfm <- dfm(toks_nostop_words)
  1. Plot a word cloud, where a word is only represented if it appears at least 2 times . Hint: use textplot_wordcloud():

Hint: Make sure you have libraried (and installed if needed) quanteda.textplots

textplot_wordcloud(df_dfm, min_count = 2)

  1. Next, increase the minimum count to 10. What happens to the word cloud? Explain in a comment.
textplot_wordcloud(df_dfm, min_count = 10)

# The word cloud shrinks in the number of words included when increasing the min_count from 2 to 10. THis makes sense due to the words varying in frequency of 2 through 9 would ultimately be removed from the wordcloud. 
  1. What are the top words in the word cloud? Explain in a brief comment.
# It appears that the top words in the word cloud, provided above, include "snow", "syracuse", "city", "plow", "1/2". Able to discern these as top words as the scaling font of the word is proportional with the frequency as to the word appearing in the snow-plow naming campaign of Syracuse.

Part 2: Analyze the sentiment of the descriptions

  1. Create a named list of word counts by frequency.

output the 10 most frequent words (their word count and the word).

Hint: use textstat_frequency() from the quanteda.textstats package.

textstat_frequency(df_dfm, n=10)
##     feature frequency rank docfreq group
## 1         ½       432    1     143   all
## 2         ï       336    2     147   all
## 3      snow       321    3     292   all
## 4  syracuse       174    4     164   all
## 5      name       143    5     137   all
## 6      plow       140    6     130   all
## 7      salt       104    7      83   all
## 8     plows       100    8      98   all
## 9  columbus       100    8      96   all
## 10     city        96   10      94   all
  1. Explain in a comment what you observed in the sorted list of word counts.
# In this sorted list of word counts, we are better able to see that there are still some additional irrelevant entries such as the "1/2" and "ï' entries. There's another aspect we may need to account for being that the 6th ranked "plow" with 140 appearances and the 8th ranked "plows" with 100 appearances are essentially the same word. 

Part 3: Match the words with positive and negative words

  1. Read in the list of positive words, using the scan() function, and output the first 5 words in the list. Do the same for the the negative words list:

    https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt
    https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt

There should be 2006 positive words and 4783 negative words, so you may need to clean up these lists a bit.

pos_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt"
pos_words <- scan(pos_url, character(0), sep = "\n")
pos_words <- pos_words[-1:-34]

length(pos_words)
## [1] 2006
  1. Use dfm_match() to match the words in the dfm with the words in posWords). Note that dfm_match() creates a new dfm.

Then pass this new dfm to the textstat_frequency() function to see the positive words in our corpus, and how many times each word was mentioned.

pos_dfm <- dfm_match(df_dfm, pos_words)

pos_freq <- textstat_frequency(pos_dfm)
nrow(pos_freq)
## [1] 211

C. Sum all the positive words

sum(pos_freq$frequency)
## [1] 866

D. Do a similar analysis for the negative words - show the 10 most reuent negative words and then sum the negative words in the document.

neg_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt"
neg_words <- scan(neg_url, character(0), sep = "\n")
neg_words <- neg_words[-1:-34]

length(neg_words)
## [1] 4783
neg_dfm <- dfm_match(df_dfm, neg_words)

neg_freq <- textstat_frequency(neg_dfm)
nrow(neg_freq)
## [1] 148
sum(neg_freq$frequency)
## [1] 255
  1. Write a comment describing what you found after matching positive and negative words. Which group is more common in this dataset? Might some of the negative words not actually be used in a negative way? What about the positive words?
# We see that the positive words had more matches (866) than the negative words (255). The group of positive words were by far more common. It may be possible, with contextual discrepancies, that the exact count of negative words may be different, same with the positive words.