Learning Outcomes measured in this assignment: LO1 to LO5
Content knowledge you’ll gain from doing this assignment: Tokenization, word counts, visualization of frequent words, wordclouds, sentiment analysis, and pairwise correlations.
For this assingmet, you can work in groups of up to 3 people.
For the assignment 1, the grading criteria is 70% based on correctness of the code and 30% based on your communication of results.
Submission: You have two options. Please choose as you wish.
For this assignment, we will be using a much simplified version of Movie Reviews data. The entire dataset is available here: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview
The following R chunk reads the data:
movie=read_csv("https://unh.box.com/shared/static/3sd0exk43cz04mk9ftt3r4jomhaus6m0.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PhraseId = col_double(),
## SentenceId = col_double(),
## Phrase = col_character(),
## Sentiment = col_double()
## )
movie=distinct(movie, SentenceId, .keep_all=TRUE)
movie
## # A tibble: 150 x 4
## PhraseId SentenceId Phrase Sentiment
## <dbl> <dbl> <chr> <dbl>
## 1 1 1 A series of escapades demonstrating the adage … 1
## 2 64 2 This quiet , introspective and entertaining in… 4
## 3 82 3 Even fans of Ismail Merchant 's work , I suspe… 1
## 4 117 4 A positively thrilling combination of ethnogra… 3
## 5 157 5 Aggressive self-glorification and a manipulati… 1
## 6 167 6 A comedy-drama of nearly epic proportions root… 4
## 7 199 7 Narratively , Trouble Every Day is a plodding … 1
## 8 214 8 The Importance of Being Earnest , so thick wit… 3
## 9 248 9 But it does n't leave you with much . 1
## 10 260 10 You could hate it for the same reason . 1
## # … with 140 more rows
moviePhrase=movie[,"Phrase"]
moviePhrase
## # A tibble: 150 x 1
## Phrase
## <chr>
## 1 A series of escapades demonstrating the adage that what is good for the goos…
## 2 This quiet , introspective and entertaining independent is worth seeking .
## 3 Even fans of Ismail Merchant 's work , I suspect , would have a hard time si…
## 4 A positively thrilling combination of ethnography and all the intrigue , bet…
## 5 Aggressive self-glorification and a manipulative whitewash .
## 6 A comedy-drama of nearly epic proportions rooted in a sincere performance by…
## 7 Narratively , Trouble Every Day is a plodding mess .
## 8 The Importance of Being Earnest , so thick with wit it plays like a reading …
## 9 But it does n't leave you with much .
## 10 You could hate it for the same reason .
## # … with 140 more rows
text_tidy_movie=moviePhrase %>%
unnest_tokens(word, Phrase)
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
text_tidy_movie = text_tidy_movie %>%
anti_join(stop_words)
## Joining, by = "word"
text_tidy_movie
## # A tibble: 1,192 x 1
## word
## <chr>
## 1 series
## 2 escapades
## 3 demonstrating
## 4 adage
## 5 goose
## 6 gander
## 7 occasionally
## 8 amuses
## 9 amounts
## 10 story
## # … with 1,182 more rows
text_tidy_movie%>%
count(word, sort=TRUE)
## # A tibble: 960 x 2
## word n
## <chr> <int>
## 1 film 17
## 2 movie 16
## 3 n't 12
## 4 lrb 9
## 5 rrb 9
## 6 story 9
## 7 character 5
## 8 comedy 5
## 9 director 5
## 10 makes 5
## # … with 950 more rows
Based on the above output, we need to add film, movie, movies to the stop words since it relates to the subject and occurs very frequently with movie reviews.
top words? Why/Why not? What does this plot tell you?custom_stop_words <- tribble(
~word, ~lexicon,
"movie", "CUSTOM",
"film","CUSTOM",
"movies", "CUSTOM"
)
text_tidy_movie=text_tidy_movie %>%
anti_join(custom_stop_words)
## Joining, by = "word"
text_tidy_movie%>%
count(word, sort=TRUE)
## # A tibble: 957 x 2
## word n
## <chr> <int>
## 1 n't 12
## 2 lrb 9
## 3 rrb 9
## 4 story 9
## 5 character 5
## 6 comedy 5
## 7 director 5
## 8 makes 5
## 9 narrative 5
## 10 action 4
## # … with 947 more rows
word_counts <- text_tidy_movie %>%
count(word) %>%
filter(n > 2 ) %>%
mutate(word2 = fct_reorder(word, n))
word_counts
## # A tibble: 42 x 3
## word n word2
## <chr> <int> <fct>
## 1 action 4 action
## 2 character 5 character
## 3 characters 4 characters
## 4 cinematic 3 cinematic
## 5 comedy 5 comedy
## 6 dark 3 dark
## 7 director 5 director
## 8 earnest 3 earnest
## 9 enjoy 3 enjoy
## 10 entertaining 3 entertaining
## # … with 32 more rows
ggplot(word_counts, aes(x=word2, y=n)) +
geom_col() +
coord_flip() +
ggtitle("Word Counts for Music Category")
Had to look at the frequencies of the words as there were a lot of words with frequency between 1 and 2. This made the plot not readable and hence had to use the filter n > 2.
The plot shows that for the 150 reviews we have, most of the words are less repetitive and unique to the reviews with most of the words occuring only once or twice.
w_count_movie = text_tidy_movie %>%
count(word)
wordcloud(word= w_count_movie$word,
freq=w_count_movie$n,
random.order=FALSE,
colors=brewer.pal(8, "Dark2"),
max.words=500)
We chose to use wordcloud as it provides a good visual on the frequency of words. Here we see n’t is the most use word, followed by lrb, rrb and story. Also we used the colors parameter rather than use single color as it groups every frequncy in a diferent color.
sentiment_title = text_tidy_movie %>%
inner_join(get_sentiments("loughran"))
## Joining, by = "word"
sentiment_title2 = sentiment_title %>%
filter(sentiment %in% c("positive", "negative"))
sentiment_title %>%
count(sentiment) %>%
arrange(desc(n))
## # A tibble: 5 x 2
## sentiment n
## <chr> <int>
## 1 negative 47
## 2 positive 25
## 3 uncertainty 4
## 4 constraining 3
## 5 litigious 1
sent_count = sentiment_title2 %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word2=fct_reorder(word, n))
ggplot(sent_count, aes(x=word2, y=n, fill=sentiment)) +
geom_col(show.legend=FALSE)+
facet_wrap(~sentiment, scales="free")+
coord_flip()+
labs(title = "Sentiment Word Counts", x="Words")
positive = sentiment_title %>%
filter(sentiment %in% c("positive"))
unique_positive = unique(positive["word"])
for(i in 1:nrow(unique_positive)) {
row <- unique_positive[i,]
reviews_positive_line=grep(row, movie$Phrase, perl=T)
if (i==1) {
positive_reviews = (movie[reviews_positive_line,])
}
else{
positive_reviews = rbind((movie[reviews_positive_line,]),positive_reviews)
}
}
negative = sentiment_title %>%
filter(sentiment %in% c("negative"))
unique_negative = unique(negative["word"])
for(i in 1:nrow(unique_negative)) {
row <- unique_negative[i,]
reviews_negative_line=grep(row, movie$Phrase, perl=T)
if (i==1) {
negative_reviews = (movie[reviews_negative_line,])
}
else{
negative_reviews = rbind((movie[reviews_negative_line,]),negative_reviews)
}
}
negative_reviews = negative_reviews %>%
filter(!(negative_reviews$PhraseId %in% positive_reviews$PhraseId))
count_positive = count(positive_reviews)
count_negative = count(negative_reviews)
inputmoviepositive <- movie %>%
filter(movie$Sentiment == 4) %>%
count(sort=TRUE)
inputmovienegative <- movie %>%
filter(movie$Sentiment == 0) %>%
count(sort=TRUE)
Method.List=c("Positive Reviews","Negative Reviews","Input positive Reviews","Input negative Reviews")
count=c(count_positive$n,count_negative$n,inputmoviepositive$n,inputmovienegative$n)
tbl <- data.frame(Method.List, count)
tbll<-kable(tbl, format = "html")
kable_styling(tbll, bootstrap_options = c("striped", "hover"))
| Method.List | count |
|---|---|
| Positive Reviews | 24 |
| Negative Reviews | 38 |
| Input positive Reviews | 24 |
| Input negative Reviews | 12 |
The above analysis shows that there are more almost double the number of negative words than positive words, with nearly 47 negative words and 25 positive words.
If we tie back the words to reviews and find number of reviews with positive count and number of reviews with negative count, we get 24 positive and 38 negative reviews. If we compare the positive and negative reviews in the input data, 0 - negative and 4 - positive, we match number of positive reviews . For negative reviews we have a big gap - might be due to the fact that the inptu seniments also has 3 - somewhat positive and 1 - somewhat negative entries, or might be due to a different sentiment library that was used.
ngram_Phrases = movie %>%
unnest_tokens(bigram, Phrase, token="ngrams", n=2)
ngram_Phrases
## # A tibble: 2,540 x 4
## PhraseId SentenceId Sentiment bigram
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 1 a series
## 2 1 1 1 series of
## 3 1 1 1 of escapades
## 4 1 1 1 escapades demonstrating
## 5 1 1 1 demonstrating the
## 6 1 1 1 the adage
## 7 1 1 1 adage that
## 8 1 1 1 that what
## 9 1 1 1 what is
## 10 1 1 1 is good
## # … with 2,530 more rows
ngram_Phrases %>%
count(bigram, sort=TRUE)
## # A tibble: 2,277 x 2
## bigram n
## <chr> <int>
## 1 it s 16
## 2 of the 13
## 3 of a 12
## 4 the film 9
## 5 the movie 8
## 6 is a 6
## 7 on the 6
## 8 but it 5
## 9 for a 5
## 10 like a 5
## # … with 2,267 more rows
filtered_Phrases = ngram_Phrases %>%
separate(bigram, c("word1", "word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
filtered_Phrases
## # A tibble: 444 x 5
## PhraseId SentenceId Sentiment word1 word2
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 1 1 escapades demonstrating
## 2 1 1 1 occasionally amuses
## 3 64 2 4 quiet introspective
## 4 64 2 4 entertaining independent
## 5 64 2 4 worth seeking
## 6 82 3 1 ismail merchant
## 7 82 3 1 hard time
## 8 82 3 1 time sitting
## 9 117 4 3 positively thrilling
## 10 117 4 3 thrilling combination
## # … with 434 more rows
filtered_Phrases_united = filtered_Phrases %>%
unite(bigram, c("word1", "word2"), sep=" ")
filtered_Phrases_united %>%
count(bigram, sort=TRUE)
## # A tibble: 443 x 2
## bigram n
## <chr> <int>
## 1 romantic comedy 2
## 2 100 minute 1
## 3 13 rating 1
## 4 25 minutes 1
## 5 3000 guys 1
## 6 60 minutes 1
## 7 7 times 1
## 8 abel ferrara 1
## 9 aborted attempts 1
## 10 absolute joy 1
## # … with 433 more rows
bigram_count=filtered_Phrases %>%
count(word1, word2, sort=TRUE)
bigram_network = bigram_count %>%
filter( n > 1) %>%
graph_from_data_frame()
bigram_network
## IGRAPH e524527 DN-- 2 1 --
## + attr: name (v/c), n (e/n)
## + edge from e524527 (vertex names):
## [1] romantic->comedy
set.seed(1234)
ggraph(bigram_network, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The above output shows that pairwise correlations are not present with the data. Romantic and Comedy are the only words that occur more than once.
Based on the above analysis, it can be seen that the reviews are very unique in nature with not much pattern between the reviews. It would make sense to perform the same analysis with a larger dataset.
Also if you notice the sentiment analysis, we compared the sentiments of the reviews identified by our analysis with that of the sentiment provided in the input data. We found discrepancies for the negative reviews, which could have been due to a diferent method used as well.