# Enter your name here: Joshua Gaze
# 1. I did this homework by myself, with help from the book and the professor.
Text mining plays an important role in many industries because of the prevalence of text in the interactions between customers and company representatives. Even when the customer interaction is by speech, rather than by chat or email, speech to text algorithms have gotten so good that transcriptions of these spoken word interactions are often available. To an increasing extent, a data scientist needs to be able to wield tools that turn a body of text into actionable insights. In this homework, we explore a real City of Syracuse dataset using the quanteda and quanteda.textplots packages. Make sure to install the quanteda and quanteda.textplots packages before following the steps below:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(quanteda)
## Package version: 3.2.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textplots)
library(quanteda.textstats)
# City of Syracuse was having their citizens do a write-in campaign to name the 10 snow-plows that were recently purchased. 1,948 unique submissions were entered into the campaign.
df <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv")
## Rows: 1907 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): submitter_name_anonymized, snowplow_name, meaning
## dbl (1): submission_number
## lgl (1): winning_name
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Hint: Make sure you have libraried quanteda
# Inspecting the df dataframe
glimpse(df) # the "meaning" attribute is the column providing an explanation towards each submitted snowplow name
## Rows: 1,907
## Columns: 5
## $ submission_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ submitter_name_anonymized <chr> "kjlt9cua", "KXKaabXN", "kjlt9cua", "Rv9sODq…
## $ snowplow_name <chr> "rudolph", "salt life", "blizzard", "butter"…
## $ meaning <chr> "The red nose cuts through any storm.", "We …
## $ winning_name <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
# Transform "meaning" attribute into a document-feature matrix using corpus(), tokens(), tokens_select() and dfm() functions. Also removing stop words
df_corpus <- corpus(df$meaning, docnames = df$submission_number)
## Warning: NA is replaced by empty string
toks <- tokens(df_corpus, remove_punct = TRUE)
toks_nostop_words <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")
df_dfm <- dfm(toks_nostop_words)
Hint: Make sure you have libraried (and installed if needed) quanteda.textplots
textplot_wordcloud(df_dfm, min_count = 2)
textplot_wordcloud(df_dfm, min_count = 10)
# The word cloud shrinks in the number of words included when increasing the min_count from 2 to 10. THis makes sense due to the words varying in frequency of 2 through 9 would ultimately be removed from the wordcloud.
# It appears that the top words in the word cloud, provided above, include "snow", "syracuse", "city", "plow", "1/2". Able to discern these as top words as the scaling font of the word is proportional with the frequency as to the word appearing in the snow-plow naming campaign of Syracuse.
output the 10 most frequent words (their word count and the word).
Hint: use textstat_frequency() from the quanteda.textstats package.
textstat_frequency(df_dfm, n=10)
## feature frequency rank docfreq group
## 1 ½ 432 1 143 all
## 2 ï 336 2 147 all
## 3 snow 321 3 292 all
## 4 syracuse 174 4 164 all
## 5 name 143 5 137 all
## 6 plow 140 6 130 all
## 7 salt 104 7 83 all
## 8 plows 100 8 98 all
## 9 columbus 100 8 96 all
## 10 city 96 10 94 all
# In this sorted list of word counts, we are better able to see that there are still some additional irrelevant entries such as the "1/2" and "ï' entries. There's another aspect we may need to account for being that the 6th ranked "plow" with 140 appearances and the 8th ranked "plows" with 100 appearances are essentially the same word.
There should be 2006 positive words and 4783 negative words, so you may need to clean up these lists a bit.
pos_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt"
pos_words <- scan(pos_url, character(0), sep = "\n")
pos_words <- pos_words[-1:-34]
length(pos_words)
## [1] 2006
Then pass this new dfm to the textstat_frequency() function to see the positive words in our corpus, and how many times each word was mentioned.
pos_dfm <- dfm_match(df_dfm, pos_words)
pos_freq <- textstat_frequency(pos_dfm)
nrow(pos_freq)
## [1] 211
C. Sum all the positive words
sum(pos_freq$frequency)
## [1] 866
D. Do a similar analysis for the negative words - show the 10 most reuent negative words and then sum the negative words in the document.
neg_url <- "https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt"
neg_words <- scan(neg_url, character(0), sep = "\n")
neg_words <- neg_words[-1:-34]
length(neg_words)
## [1] 4783
neg_dfm <- dfm_match(df_dfm, neg_words)
neg_freq <- textstat_frequency(neg_dfm)
nrow(neg_freq)
## [1] 148
sum(neg_freq$frequency)
## [1] 255
# We see that the positive words had more matches (866) than the negative words (255). The group of positive words were by far more common. It may be possible, with contextual discrepancies, that the exact count of negative words may be different, same with the positive words.