DS 807 Assignment 1

Group Assignment Guidelines

Purpose:

Learning Outcomes measured in this assignment: LO1 to LO5
Content knowledge you’ll gain from doing this assignment: Tokenization, word counts, visualization of frequent words, wordclouds, sentiment analysis, and pairwise correlations.

Criteria:

For this assingmet, you can work in groups of up to 3 people.
For the assignment 1, the grading criteria is 70% based on correctness of the code and 30% based on your communication of results.
Submission: You have two options. Please choose as you wish.
1. Upload the knitted document on Canvas.
2. Publish your final output in RPubs. https://rpubs.com/about/getting-started

Data Set

For this assignment, we will be using a much simplified version of Movie Reviews data. The entire dataset is available here: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview

The following R chunk reads the data:

movie=read_csv("https://unh.box.com/shared/static/3sd0exk43cz04mk9ftt3r4jomhaus6m0.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   PhraseId = col_double(),
##   SentenceId = col_double(),
##   Phrase = col_character(),
##   Sentiment = col_double()
## )

movie=distinct(movie, SentenceId, .keep_all=TRUE)
movie

## # A tibble: 150 x 4
##    PhraseId SentenceId Phrase                                          Sentiment
##       <dbl>      <dbl> <chr>                                               <dbl>
##  1        1          1 A series of escapades demonstrating the adage …         1
##  2       64          2 This quiet , introspective and entertaining in…         4
##  3       82          3 Even fans of Ismail Merchant 's work , I suspe…         1
##  4      117          4 A positively thrilling combination of ethnogra…         3
##  5      157          5 Aggressive self-glorification and a manipulati…         1
##  6      167          6 A comedy-drama of nearly epic proportions root…         4
##  7      199          7 Narratively , Trouble Every Day is a plodding …         1
##  8      214          8 The Importance of Being Earnest , so thick wit…         3
##  9      248          9 But it does n't leave you with much .                   1
## 10      260         10 You could hate it for the same reason .                 1
## # … with 140 more rows

(10 points) Tokenize the data set by word and remove stop words.

moviePhrase=movie[,"Phrase"]
moviePhrase

## # A tibble: 150 x 1
##    Phrase                                                                       
##    <chr>                                                                        
##  1 A series of escapades demonstrating the adage that what is good for the goos…
##  2 This quiet , introspective and entertaining independent is worth seeking .   
##  3 Even fans of Ismail Merchant 's work , I suspect , would have a hard time si…
##  4 A positively thrilling combination of ethnography and all the intrigue , bet…
##  5 Aggressive self-glorification and a manipulative whitewash .                 
##  6 A comedy-drama of nearly epic proportions rooted in a sincere performance by…
##  7 Narratively , Trouble Every Day is a plodding mess .                         
##  8 The Importance of Being Earnest , so thick with wit it plays like a reading …
##  9 But it does n't leave you with much .                                        
## 10 You could hate it for the same reason .                                      
## # … with 140 more rows

text_tidy_movie=moviePhrase %>%
  unnest_tokens(word, Phrase)

stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

text_tidy_movie = text_tidy_movie %>%
  anti_join(stop_words)

## Joining, by = "word"

text_tidy_movie

## # A tibble: 1,192 x 1
##    word         
##    <chr>        
##  1 series       
##  2 escapades    
##  3 demonstrating
##  4 adage        
##  5 goose        
##  6 gander       
##  7 occasionally 
##  8 amuses       
##  9 amounts      
## 10 story        
## # … with 1,182 more rows

(15 points) Arrange the words in descending order by frequency. Based on the most frequent words, do you need to add more words to stop words? Why/Why not?

 text_tidy_movie%>%
  count(word, sort=TRUE)

## # A tibble: 960 x 2
##    word          n
##    <chr>     <int>
##  1 film         17
##  2 movie        16
##  3 n't          12
##  4 lrb           9
##  5 rrb           9
##  6 story         9
##  7 character     5
##  8 comedy        5
##  9 director      5
## 10 makes         5
## # … with 950 more rows

Based on the above output, we need to add film, movie, movies to the stop words since it relates to the subject and occurs very frequently with movie reviews.

(15 points) If necessary anti_join the new stop words. Visualize the word counts. Did you need to filter by frequency, or look at some top words? Why/Why not? What does this plot tell you?

custom_stop_words <- tribble(
  ~word, ~lexicon,
  "movie", "CUSTOM",
  "film","CUSTOM",
  "movies", "CUSTOM"
)

text_tidy_movie=text_tidy_movie %>%
  anti_join(custom_stop_words)

## Joining, by = "word"

 text_tidy_movie%>%
  count(word, sort=TRUE)

## # A tibble: 957 x 2
##    word          n
##    <chr>     <int>
##  1 n't          12
##  2 lrb           9
##  3 rrb           9
##  4 story         9
##  5 character     5
##  6 comedy        5
##  7 director      5
##  8 makes         5
##  9 narrative     5
## 10 action        4
## # … with 947 more rows

word_counts <- text_tidy_movie %>% 
  count(word) %>%
  filter(n > 2 ) %>%
  mutate(word2 = fct_reorder(word, n))

word_counts

## # A tibble: 42 x 3
##    word             n word2       
##    <chr>        <int> <fct>       
##  1 action           4 action      
##  2 character        5 character   
##  3 characters       4 characters  
##  4 cinematic        3 cinematic   
##  5 comedy           5 comedy      
##  6 dark             3 dark        
##  7 director         5 director    
##  8 earnest          3 earnest     
##  9 enjoy            3 enjoy       
## 10 entertaining     3 entertaining
## # … with 32 more rows

ggplot(word_counts, aes(x=word2, y=n)) +
  geom_col() +
  coord_flip() +
  ggtitle("Word Counts for Music Category")

Had to look at the frequencies of the words as there were a lot of words with frequency between 1 and 2. This made the plot not readable and hence had to use the filter n > 2.

The plot shows that for the 150 reviews we have, most of the words are less repetitive and unique to the reviews with most of the words occuring only once or twice.

(15 points) Plot a word cloud of these 30 words (choose wordcloud or wordcloud 2). Why did you choose this particular plot or any of the parameters? Looking at this plot, what information do you gain?

w_count_movie = text_tidy_movie %>%
  count(word)

wordcloud(word= w_count_movie$word,
          freq=w_count_movie$n,
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"),
          max.words=500)

We chose to use wordcloud as it provides a good visual on the frequency of words. Here we see n’t is the most use word, followed by lrb, rrb and story. Also we used the colors parameter rather than use single color as it groups every frequncy in a diferent color.

(20 points) Choose a sentiment library and perform a sentiment analysis, i.e., join the data with sentiments, count sentiments, and plot sentiments. What does your analysis tells you?

sentiment_title = text_tidy_movie %>%
  inner_join(get_sentiments("loughran"))

## Joining, by = "word"

sentiment_title2 = sentiment_title %>%
  filter(sentiment %in% c("positive", "negative"))

sentiment_title %>%
  count(sentiment) %>%
  arrange(desc(n))

## # A tibble: 5 x 2
##   sentiment        n
##   <chr>        <int>
## 1 negative        47
## 2 positive        25
## 3 uncertainty      4
## 4 constraining     3
## 5 litigious        1

sent_count =  sentiment_title2 %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word2=fct_reorder(word, n))

ggplot(sent_count, aes(x=word2, y=n, fill=sentiment)) +
  geom_col(show.legend=FALSE)+
  facet_wrap(~sentiment, scales="free")+
  coord_flip()+
  labs(title = "Sentiment Word Counts", x="Words")

positive = sentiment_title %>%
  filter(sentiment %in% c("positive"))

unique_positive = unique(positive["word"])

for(i in 1:nrow(unique_positive)) {
    row <- unique_positive[i,]
    reviews_positive_line=grep(row, movie$Phrase, perl=T)
    if (i==1) {
    positive_reviews = (movie[reviews_positive_line,])
    }
    else{
    positive_reviews = rbind((movie[reviews_positive_line,]),positive_reviews)
    }
}

negative = sentiment_title %>%
  filter(sentiment %in% c("negative"))

unique_negative = unique(negative["word"])

for(i in 1:nrow(unique_negative)) {
    row <- unique_negative[i,]
    reviews_negative_line=grep(row, movie$Phrase, perl=T)
    if (i==1) {
    negative_reviews = (movie[reviews_negative_line,])
    }
    else{
    negative_reviews = rbind((movie[reviews_negative_line,]),negative_reviews)
    }
}

negative_reviews = negative_reviews %>%
  filter(!(negative_reviews$PhraseId %in% positive_reviews$PhraseId))

count_positive = count(positive_reviews)
count_negative = count(negative_reviews)

inputmoviepositive <- movie %>%
  filter(movie$Sentiment == 4) %>%
  count(sort=TRUE)

inputmovienegative <- movie %>%
  filter(movie$Sentiment == 0) %>%
  count(sort=TRUE)

Method.List=c("Positive Reviews","Negative Reviews","Input positive Reviews","Input negative Reviews")
count=c(count_positive$n,count_negative$n,inputmoviepositive$n,inputmovienegative$n)
tbl <- data.frame(Method.List, count)

tbll<-kable(tbl, format = "html")
kable_styling(tbll, bootstrap_options = c("striped", "hover"))

Method.List	count
Positive Reviews	24
Negative Reviews	38
Input positive Reviews	24
Input negative Reviews	12

The above analysis shows that there are more almost double the number of negative words than positive words, with nearly 47 negative words and 25 positive words.

If we tie back the words to reviews and find number of reviews with positive count and number of reviews with negative count, we get 24 positive and 38 negative reviews. If we compare the positive and negative reviews in the input data, 0 - negative and 4 - positive, we match number of positive reviews . For negative reviews we have a big gap - might be due to the fact that the inptu seniments also has 3 - somewhat positive and 1 - somewhat negative entries, or might be due to a different sentiment library that was used.

(15 points) Produce a bigram and calculate pairwise correlations. What does your analysis tells you?

ngram_Phrases = movie %>%
  unnest_tokens(bigram, Phrase, token="ngrams", n=2)
ngram_Phrases

## # A tibble: 2,540 x 4
##    PhraseId SentenceId Sentiment bigram                 
##       <dbl>      <dbl>     <dbl> <chr>                  
##  1        1          1         1 a series               
##  2        1          1         1 series of              
##  3        1          1         1 of escapades           
##  4        1          1         1 escapades demonstrating
##  5        1          1         1 demonstrating the      
##  6        1          1         1 the adage              
##  7        1          1         1 adage that             
##  8        1          1         1 that what              
##  9        1          1         1 what is                
## 10        1          1         1 is good                
## # … with 2,530 more rows

ngram_Phrases %>%
  count(bigram, sort=TRUE)

## # A tibble: 2,277 x 2
##    bigram        n
##    <chr>     <int>
##  1 it s         16
##  2 of the       13
##  3 of a         12
##  4 the film      9
##  5 the movie     8
##  6 is a          6
##  7 on the        6
##  8 but it        5
##  9 for a         5
## 10 like a        5
## # … with 2,267 more rows

filtered_Phrases = ngram_Phrases %>%
  separate(bigram, c("word1", "word2"), sep=" ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

filtered_Phrases

## # A tibble: 444 x 5
##    PhraseId SentenceId Sentiment word1        word2        
##       <dbl>      <dbl>     <dbl> <chr>        <chr>        
##  1        1          1         1 escapades    demonstrating
##  2        1          1         1 occasionally amuses       
##  3       64          2         4 quiet        introspective
##  4       64          2         4 entertaining independent  
##  5       64          2         4 worth        seeking      
##  6       82          3         1 ismail       merchant     
##  7       82          3         1 hard         time         
##  8       82          3         1 time         sitting      
##  9      117          4         3 positively   thrilling    
## 10      117          4         3 thrilling    combination  
## # … with 434 more rows

filtered_Phrases_united = filtered_Phrases %>%
  unite(bigram, c("word1", "word2"), sep=" ") 

filtered_Phrases_united %>%
  count(bigram, sort=TRUE)

## # A tibble: 443 x 2
##    bigram               n
##    <chr>            <int>
##  1 romantic comedy      2
##  2 100 minute           1
##  3 13 rating            1
##  4 25 minutes           1
##  5 3000 guys            1
##  6 60 minutes           1
##  7 7 times              1
##  8 abel ferrara         1
##  9 aborted attempts     1
## 10 absolute joy         1
## # … with 433 more rows

bigram_count=filtered_Phrases %>%
  count(word1, word2, sort=TRUE)

bigram_network = bigram_count %>%
  filter( n > 1) %>%
  graph_from_data_frame()

bigram_network

## IGRAPH e524527 DN-- 2 1 -- 
## + attr: name (v/c), n (e/n)
## + edge from e524527 (vertex names):
## [1] romantic->comedy

set.seed(1234)

ggraph(bigram_network, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

The above output shows that pairwise correlations are not present with the data. Romantic and Comedy are the only words that occur more than once.

(10 points) What is your learning outcome in this analysis? What would you like me to to notice in yoru analysis?

Based on the above analysis, it can be seen that the reviews are very unique in nature with not much pattern between the reviews. It would make sense to perform the same analysis with a larger dataset.

Also if you notice the sentiment analysis, we compared the sentiments of the reviews identified by our analysis with that of the sentiment provided in the input data. We found discrepancies for the negative reviews, which could have been due to a diferent method used as well.

DS 807 Assignment 1

Vijay Raju / Vineeth Appala

Due: 3/29/2021 by 5:40 pm

Group Assignment Guidelines

Purpose:

Criteria:

Data Set