Intro to Data Science - HW 11

Copyright 2021, Jeffrey Stanton, Jeffrey Saltz, Christopher Dunham, and Jasmina Tacheva

#Name: Joseph Silvestri

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Text mining plays an important role in many industries because of the prevalence of text in the interactions between customers and company representatives. Even when the customer interaction is by speech, rather than by chat or email, speech to text algorithms have gotten so good that transcriptions of these spoken word interactions are often available. To an increasing extent, a data scientist needs to be able to wield tools that turn a body of text into actionable insights. In this homework, we explore a real City of Syracuse dataset using the quanteda and quanteda.textplots packages. Make sure to install the quanteda and quanteda.textplots packages before following the steps below:

Part 1: Load and visualize the data file

Take a look at this article: https://samedelstein.medium.com/snowplow-naming-contest-data-2dcd38272caf and write a comment in your R script, briefly describing what it is about.

#The article is about a City of Syracuse contest to name the ten new snow plowing trucks, there was 1,948 unique submissions, the #1 being Santa Maria.

Read the data from the following URL into a dataframe called df: https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv

#library(tidyverse)
df<-read.csv("https://intro-datascience.s3.us-east-2.amazonaws.com/snowplownames.csv")

Inspect the df dataframe – which column contains an explanation of the meaning of each submitted snowplow name?

#view(df)
#The column called "meaning" gives data on the explanation of each name

D. Transform that column into a document-feature matrix, using the corpus(), tokens(), tokens_select(), and dfm() functions from the quanteda package. Do not forget to remove stop words.

library(quanteda)

## Package version: 3.1.0
## Unicode version: 10.0
## ICU version: 61.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

dfCorpus<-corpus(df$meaning,docnames = df$submission_number)

## Warning: NA is replaced by empty string

toks<-tokens(dfCorpus, remove_punct = TRUE)
toks_nostop<- tokens_select(toks,   pattern =   stopwords("en"),    selection   = "remove")
dfDFM<-dfm(toks_nostop)

Plot a word cloud where a word is only represented if it appears at least 2 times in the corpus. Hint: use textplot_wordcloud() from the quanteda.textplots package:

#install.packages("quanteda.textplots")
library(quanteda.textplots)
textplot_wordcloud(dfDFM,min_count  =   2)

Next, increase the minimum count to 10. What happens to the word cloud? Explain in a comment.

textplot_wordcloud(dfDFM,   min_count   =   10)

#The size of the word clouds breadth shrinks significantly, removing a lot of infrequently words
#Snow, Syracuse, and the replacement character for apostrophes

What are the top 10 words in the word cloud?

Hint: use textstat_frequency in the quanteda.textstats package

#install.packages("quanteda.textstats")
library(quanteda.textstats)

tstats<-textstat_frequency(dfDFM)

head(tstats,14)

##     feature frequency rank docfreq group
## 1         â       914    1     150   all
## 2         ¯       478    2     150   all
## 3         ½       432    3     143   all
## 4         ã       341    4     147   all
## 5      snow       320    5     291   all
## 6  syracuse       174    6     164   all
## 7      name       143    7     137   all
## 8      plow       140    8     130   all
## 9      salt       104    9      83   all
## 10    plows       100   10      98   all
## 11 columbus       100   10      96   all
## 12     city        96   12      94   all
## 13     like        88   13      85   all
## 14      one        75   14      75   all

#Eliminating the first 4 invalid data inputs 
#Snow, Syracuse, Name, Plow, Salt, Plows, Columbus, City, Like, and One

Explain in a comment what you observed in the sorted list of word counts.

#In the sorted data set we are able to see there are symbols within the data set and some words used in sentences not relevant to the data such as "like", "name", and "just"
#Or the difference of plural such as plow vs. plows

Part 2: Analyze the sentiment of the descriptions

###Match the review words with positive and negative words

Read in the list of positive words (using the scan() function), and output the first 5 words in the list.

https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt

There should be 2006 positive words words, so you may need to clean up these lists a bit.

URL<-"https://intro-datascience.s3.us-east-2.amazonaws.com/positive-words.txt"
posWords    <- scan(URL,    character(0),   sep =   "\n")
posWords    <- posWords[-1:-34]
length(posWords)

## [1] 2006

J. Do the same for the the negative words list (there are 4783 negative words):

https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt

URL1<-"https://intro-datascience.s3.us-east-2.amazonaws.com/negative-words.txt"
negWords    <- scan(URL1,   character(0),   sep =   "\n")
negWords    <- negWords[-1:-34]
length(negWords)

## [1] 4783

Using dfm_match() with the dfm and the positive word file you read in, and then textstat_frequency(), output the 10 most frequent positive words

posDFM  <- dfm_match(dfDFM, posWords)
posFreq <- textstat_frequency(posDFM)

Use R to print out the total number of positive words in the name explanation.

nrow(posFreq)

## [1] 211

#211

Repeat that process for the negative words you matched. Which negative words were in the name explanation variable, and what is their total number?

negDFM  <- dfm_match(dfDFM, negWords)
negFreq <- textstat_frequency(negDFM)
head(negFreq, 10)

##       feature frequency rank docfreq group
## 1       funny        24    1      24   all
## 2        cold         8    2       8   all
## 3       twist         8    2       8   all
## 4        hard         7    4       7   all
## 5  abominable         6    5       6   all
## 6     problem         6    5       6   all
## 7         bad         5    7       5   all
## 8     destroy         5    7       5   all
## 9        died         5    7       5   all
## 10       bust         4   10       4   all

#count(negFreq)
#147

Write a comment describing what you found after exploring the positive and negative word lists. Which group is more common in this dataset?

sum(posFreq$frequency)

## [1] 865

sum(negFreq$frequency)

## [1] 253

#Positive words have more matches than negative words, as well as the frequency of their use is significantly higher by 612 positive words

X. Complete the function below, so that it returns a sentiment score (number of positive words - number of negative words)

doMySentiment <- function(posWords, negWords, stringToAnalyze ) {sentimentScore <- match(stringToAnalyze, posWords,nomatch=0 )-match(stringToAnalyze, negWords,nomatch=0)
  

  return(sentimentScore)
}

X. Test your function with the string “This book is horrible”

doMySentiment(posWords, negWords, "This book is horrible")

## [1] 0

Use the syuzhet package, to calculate the sentiment of the same phrase (“This book is horrible”), using syuzhet’s get_sentiment() function, using the afinn method. In AFINN, words are scored as integers from -5 to +5:

#install.packages("syuzhet")
library(syuzhet)

get_sentiment("This book is horrible", method="afinn")

## [1] -3

#-3