#Name: Joseph Silvestri
# 1. I did this homework by myself, with help from the book and the professor.
Text mining plays an important role in many industries because of the prevalence of text in the interactions between customers and company representatives. Even when the customer interaction is by speech, rather than by chat or email, speech to text algorithms have gotten so good that transcriptions of these spoken word interactions are often available. To an increasing extent, a data scientist needs to be able to wield tools that turn a body of text into actionable insights. In this homework, we explore a real City of Syracuse dataset using the quanteda and quanteda.textplots packages. Make sure to install the quanteda and quanteda.textplots packages before following the steps below:
#The article is about a City of Syracuse contest to name the ten new snow plowing trucks, there was 1,948 unique submissions, the #1 being Santa Maria.
#library(tidyverse)
df<-read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv")
#view(df)
#The column called "meaning" gives data on the explanation of each name
D. Transform that column into a document-feature matrix, using the corpus(), tokens(), tokens_select(), and dfm() functions from the quanteda package. Do not forget to remove stop words.
library(quanteda)
## Package version: 3.1.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
dfCorpus<-corpus(df$meaning,docnames = df$submission_number)
## Warning: NA is replaced by empty string
toks<-tokens(dfCorpus, remove_punct = TRUE)
toks_nostop<- tokens_select(toks, pattern = stopwords("en"), selection = "remove")
dfDFM<-dfm(toks_nostop)
#install.packages("quanteda.textplots")
library(quanteda.textplots)
textplot_wordcloud(dfDFM,min_count = 2)
textplot_wordcloud(dfDFM, min_count = 10)
#The size of the word clouds breadth shrinks significantly, removing a lot of infrequently words
#Snow, Syracuse, and the replacement character for apostrophes
Hint: use textstat_frequency in the quanteda.textstats package
#install.packages("quanteda.textstats")
library(quanteda.textstats)
tstats<-textstat_frequency(dfDFM)
head(tstats,14)
## feature frequency rank docfreq group
## 1 â 914 1 150 all
## 2 ¯ 478 2 150 all
## 3 ½ 432 3 143 all
## 4 ã 341 4 147 all
## 5 snow 320 5 291 all
## 6 syracuse 174 6 164 all
## 7 name 143 7 137 all
## 8 plow 140 8 130 all
## 9 salt 104 9 83 all
## 10 plows 100 10 98 all
## 11 columbus 100 10 96 all
## 12 city 96 12 94 all
## 13 like 88 13 85 all
## 14 one 75 14 75 all
#Eliminating the first 4 invalid data inputs
#Snow, Syracuse, Name, Plow, Salt, Plows, Columbus, City, Like, and One
#In the sorted data set we are able to see there are symbols within the data set and some words used in sentences not relevant to the data such as "like", "name", and "just"
#Or the difference of plural such as plow vs. plows
###Match the review words with positive and negative words
https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt
There should be 2006 positive words words, so you may need to clean up these lists a bit.
URL<-"https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt"
posWords <- scan(URL, character(0), sep = "\n")
posWords <- posWords[-1:-34]
length(posWords)
## [1] 2006
J. Do the same for the the negative words list (there are 4783 negative words):
https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt
URL1<-"https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt"
negWords <- scan(URL1, character(0), sep = "\n")
negWords <- negWords[-1:-34]
length(negWords)
## [1] 4783
posDFM <- dfm_match(dfDFM, posWords)
posFreq <- textstat_frequency(posDFM)
nrow(posFreq)
## [1] 211
#211
negDFM <- dfm_match(dfDFM, negWords)
negFreq <- textstat_frequency(negDFM)
head(negFreq, 10)
## feature frequency rank docfreq group
## 1 funny 24 1 24 all
## 2 cold 8 2 8 all
## 3 twist 8 2 8 all
## 4 hard 7 4 7 all
## 5 abominable 6 5 6 all
## 6 problem 6 5 6 all
## 7 bad 5 7 5 all
## 8 destroy 5 7 5 all
## 9 died 5 7 5 all
## 10 bust 4 10 4 all
#count(negFreq)
#147
sum(posFreq$frequency)
## [1] 865
sum(negFreq$frequency)
## [1] 253
#Positive words have more matches than negative words, as well as the frequency of their use is significantly higher by 612 positive words
X. Complete the function below, so that it returns a sentiment score (number of positive words - number of negative words)
doMySentiment <- function(posWords, negWords, stringToAnalyze ) {sentimentScore <- match(stringToAnalyze, posWords,nomatch=0 )-match(stringToAnalyze, negWords,nomatch=0)
return(sentimentScore)
}
X. Test your function with the string “This book is horrible”
doMySentiment(posWords, negWords, "This book is horrible")
## [1] 0
Use the syuzhet package, to calculate the sentiment of the same phrase (“This book is horrible”), using syuzhet’s get_sentiment() function, using the afinn method. In AFINN, words are scored as integers from -5 to +5:
#install.packages("syuzhet")
library(syuzhet)
get_sentiment("This book is horrible", method="afinn")
## [1] -3
#-3