Title-

Reddit Comments

Team members

Business context

Reddit is a social network, web content rating and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics such as news, politics, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing

Problem description

Reddit members submit content about various topics such news politics, videos and games. Over the past years Reedit has become 17th most visited website in the world. The data in Reddit has been exponentially growing. The content posted by members reflects both positive and negative sentiments. Our goal is to do sentiment analysis and determine whether the post is positive, negative or neutral. This is interesting because reddit is usually perceived as a platform where people get bullied a lot. Hence we want to explore in this regard.

URL to download te dataset: https://drive.google.com/file/d/1FIKQcSUMX5oKe8m-2MBOMs2pqYzpF22P/view?usp=sharing

Describe the data source

The data will be fetched from the reddit API using RedditExtractoR package to fetch data from reddit API using the search terms “virus” and “news”.

Data summary-

Import Libraries

library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(tidytext)
library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(data.table)
library(lubridate)
library(wordcloud)

Set the Working Directory

getwd()
## [1] "C:/Users/harsh/Downloads/R project"

Read the data file

file = "reddit_sentiments.csv"
df <- fread(file)

Sample Data

initialrow <- head(df, 1)
initialrow
##    id structure  post_date  comm_date num_comments subreddit upvote_prop
## 1:  1         1 0013-03-15 0013-03-15         3345 worldnews        0.91
##    post_score              author           user comment_score controversiality
## 1:      54849 bendertheoffender22 AwesomeAxolotl          3281                0
##                                                                                                                       comment
## 1: As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text                                          link  domain
## 1:           http://www.bbc.com/news/world-europe-31864218 bbc.com
##                                                                                                                      URL
## 1: http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts

Data Summary

1. Total number of rows and columns in the dataset

nrow(df)
## [1] 11276
ncol(df)
## [1] 18

2.Datatypes of columns in dataset

str(df)
## Classes 'data.table' and 'data.frame':   11276 obs. of  18 variables:
##  $ id              : chr  "1" "2" "3" "4" ...
##  $ structure       : chr  "1" "1_1" "1_1_1" "1_1_1_1" ...
##  $ post_date       : IDate, format: "0013-03-15" "0013-03-15" ...
##  $ comm_date       : IDate, format: "0013-03-15" "0013-03-15" ...
##  $ num_comments    : int  3345 3345 3345 3345 3345 3345 3345 3345 3345 3345 ...
##  $ subreddit       : chr  "worldnews" "worldnews" "worldnews" "worldnews" ...
##  $ upvote_prop     : num  0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 ...
##  $ post_score      : int  54849 54849 54849 54849 54849 54849 54849 54849 54849 54849 ...
##  $ author          : chr  "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" ...
##  $ user            : chr  "AwesomeAxolotl" "weewolf" "[deleted]" "octatoan" ...
##  $ comment_score   : int  3281 1960 1002 122 126 58 64 3 3 4 ...
##  $ controversiality: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ comment         : chr  "As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order." "Across the pond we say \"\"You can't fix stupid\"\".  I think I like the german attitude better. " "[deleted]" "Not *erhoehen? Just asking." ...
##  $ title           : chr  "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ ...
##  $ post_text       : chr  "" "" "" "" ...
##  $ link            : chr  "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" ...
##  $ domain          : chr  "bbc.com" "bbc.com" "bbc.com" "bbc.com" ...
##  $ URL             : chr  "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data Cleaning

Creating seperate columns for POST’s Year, Months and Date which will be used later in visualizations

df$dayofpost <- wday(df$post_date,label = T)

df$monthofpost <- month(df$post_date,abbr = T,label = T)

df$yearofpost <- year(df$post_date)

(head(df,1))
##    id  post_date  comm_date num_comments subreddit upvote_prop post_score
## 1:  1 0013-03-15 0013-03-15         3345 worldnews        0.91      54849
##                 author           user comment_score controversiality
## 1: bendertheoffender22 AwesomeAxolotl          3281                0
##                                                                                                                       comment
## 1: As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text dayofpost monthofpost yearofpost
## 1:                 Fri         Mar         13
df$upvote_prop <- df$upvote_prop*100
names(df)[names(df) == 'upvote_prop'] <- 'upvote_pct'

Checking the Null values

colSums(is.na(df))
##               id        post_date        comm_date     num_comments 
##                0                0                0                0 
##        subreddit       upvote_pct       post_score           author 
##                0                0                0                0 
##             user    comment_score controversiality          comment 
##                0                0                0                0 
##            title        post_text        dayofpost      monthofpost 
##                0                0                0                0 
##       yearofpost 
##                0

dropping the comment column as it has inappropriate data

df <- df[df$comment != "[deleted]"]

Summary of Dataset

summary(df)
##       id              post_date            comm_date           num_comments  
##  Length:10917       Min.   :0001-08-16   Min.   :0001-01-18   Min.   : 3345  
##  Class :character   1st Qu.:0011-04-11   1st Qu.:0011-04-11   1st Qu.: 3557  
##  Mode  :character   Median :0017-01-20   Median :0018-01-20   Median : 4452  
##                     Mean   :0018-01-22   Mean   :0017-10-14   Mean   : 5408  
##                     3rd Qu.:0028-01-20   3rd Qu.:0027-01-16   3rd Qu.: 6405  
##                     Max.   :0031-12-17   Max.   :0031-12-17   Max.   :11016  
##                                                                              
##   subreddit           upvote_pct       post_score       author         
##  Length:10917       Min.   : 80.00   Min.   :    1   Length:10917      
##  Class :character   1st Qu.: 88.00   1st Qu.:  892   Class :character  
##  Mode  :character   Median : 91.00   Median : 1714   Mode  :character  
##                     Mean   : 91.49   Mean   :18289                     
##                     3rd Qu.: 94.00   3rd Qu.:40524                     
##                     Max.   :100.00   Max.   :83359                     
##                                                                        
##      user           comment_score     controversiality     comment         
##  Length:10917       Min.   :  -73.0   Min.   :0.000000   Length:10917      
##  Class :character   1st Qu.:    4.0   1st Qu.:0.000000   Class :character  
##  Mode  :character   Median :   14.0   Median :0.000000   Mode  :character  
##                     Mean   :  161.4   Mean   :0.006962                     
##                     3rd Qu.:   68.0   3rd Qu.:0.000000                     
##                     Max.   :30026.0   Max.   :1.000000                     
##                                                                            
##     title            post_text         dayofpost   monthofpost  
##  Length:10917       Length:10917       Sun:3315   Jan    :4250  
##  Class :character   Class :character   Mon:1426   Jun    : 948  
##  Mode  :character   Mode  :character   Tue: 941   Apr    : 940  
##                                        Wed:2360   May    : 933  
##                                        Thu:1445   Feb    : 915  
##                                        Fri:1430   Oct    : 530  
##                                        Sat:   0   (Other):2401  
##    yearofpost   
##  Min.   : 1.00  
##  1st Qu.:11.00  
##  Median :17.00  
##  Mean   :17.75  
##  3rd Qu.:28.00  
##  Max.   :31.00  
## 

Data Visualization

1. Posts by week days

ggplot(df[,.N,by=dayofpost],aes(x = dayofpost,y = N,fill=N,label=round(N,2)))+
  geom_bar(stat = "identity")+labs(title="Posts by the weekday",subtitle="Number of Posts Per Week Day ")+xlab("Day")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")

Observation-

  • It’s interesting to see the highest number of posts that has been posted is on a weekend i.e Sunday followed by Wednesday

2. Post by Month

ggplot(df[,.N,by=monthofpost],aes(x = monthofpost,y = N,fill=N, label=round(N,2)))+
  geom_bar(stat = "identity")+labs(title="Posts by the month",subtitle="Number of Posts Per Month")+xlab("Month")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")

Inference -

As is evident, January month dominates the number of posts.Its interesting to see the pattern i.e Number of posts is maintained ~900 for the months of February, April, May and June while for the rest of the months its ~500. This will be really useful from business perspective, as now companies knows that the first month of new year i.e January is the month for highest number of posts, hence companies can launch and advertise their products during this peak time.

3. Post per day in month

ggplot(df[,.N,by=yearofpost],aes(x = yearofpost,y = N,fill=N, label=round(N,3)))+
  geom_bar(stat = "identity")+labs(title="Posts per day by month",subtitle="Number of given Posts")+xlab("The day of month")+ylab(NULL)+geom_text(size=4, vjust=1, color="White")

TOP 20 Authors by most number of posts

top_author<-df[,.N,by=author][order(-N)][1:20]
top_author
##                   author   N
##  1:          maxwellhill 491
##  2:       HugeDetective0 490
##  3:            accessirl 488
##  4:           saurabh24_ 488
##  5:         dorkprincess 487
##  6: Clavis_Apocalypticae 482
##  7:         deletetables 481
##  8:              fmaasnt 480
##  9:            muchonada 480
## 10:   CondescendingBench 479
## 11:            Pilgram94 479
## 12:         areascontrol 477
## 13:      tryin2immigrate 475
## 14:          shutdown924 470
## 15:   NintendoSwitchMods 469
## 16:      trixiethesalmon 469
## 17:             jdpatric 464
## 18:       Fun_With_Forks 464
## 19:  bendertheoffender22 460
## 20:   Network_operations 459

TOP 20 Authors by post score

df[,.("Post_Score"=sum(post_score,na.rm=T)),by=author][order(-Post_Score)][1:20]
##                   author Post_Score
##  1:          maxwellhill   40929269
##  2:      tryin2immigrate   31593200
##  3:  bendertheoffender22   25230540
##  4:            accessirl   23420584
##  5:       HugeDetective0   23405340
##  6:           saurabh24_   19775712
##  7:            Pilgram94   13080532
##  8:            muchonada   12375360
##  9:   CondescendingBench    1714341
## 10: Clavis_Apocalypticae    1319716
## 11:          shutdown924    1098390
## 12:             jdpatric     795296
## 13:         deletetables     703703
## 14:   NintendoSwitchMods     618142
## 15:            hoosakiwi     604758
## 16:      trixiethesalmon     575463
## 17:              fmaasnt     527040
## 18:         areascontrol     425484
## 19:          newjackruby     351002
## 20:           maxwasson2     343211

Data Cleaning to get clean text

#Replacing the URLS

df_data <- df %>% mutate(comment = str_replace_all(comment, "//^(?:http(?:s)?:\\//\\//)?(?:[^\\.]+\\.)?[a-zA-Z0-9]\\.com(\\//.*)?$", ""))


#Replacing HTML tags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "(<br />)+", ""))


#Remove # in hashtags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "#([^\\s]+)", "\1"))

#Remove punctuations,numbers and special characters
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "^[a-zA-Z0-9]*$", ""))
head(df_data)
##    id  post_date  comm_date num_comments subreddit upvote_pct post_score
## 1:  1 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 2:  2 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 3:  4 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 4:  6 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 5:  8 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 6:  9 0013-03-15 0013-03-15         3345 worldnews         91      54849
##                 author             user comment_score controversiality
## 1: bendertheoffender22   AwesomeAxolotl          3281                0
## 2: bendertheoffender22          weewolf          1960                0
## 3: bendertheoffender22         octatoan           122                0
## 4: bendertheoffender22         octatoan            58                0
## 5: bendertheoffender22 piss_in_a_bottle             3                0
## 6: bendertheoffender22     LinkFixerBot             3                0
##                                                                                                                                                   comment
## 1:                             As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
## 2:                                                          Across the pond we say ""You can't fix stupid"".  I think I like the german attitude better. 
## 3:                                                                                                                            Not *erhoehen? Just asking.
## 4:                     I didn't know, thanks. I don't know the language as well as I'd like, and colloquial everyday usage is a big part of that . . . ;)
## 5: Wobei ich sagen würde, dass man selbst ""film'n"" durchgehen lassen kann. \r\n""Kannst'e das mal film'n?!"" hört sich für mich z.B. recht normal an ;)
## 6:                                           Pretty sure it's the combination of m and n to m'n\r\n\r\nKomm'n wouldn't really work, neither would stemm'n
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 2: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 3: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 4: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 5: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 6: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text dayofpost monthofpost yearofpost
## 1:                 Fri         Mar         13
## 2:                 Fri         Mar         13
## 3:                 Fri         Mar         13
## 4:                 Fri         Mar         13
## 5:                 Fri         Mar         13
## 6:                 Fri         Mar         13

NLP procedure -

get_cleaned_tokens <- function(df_data,redditname) {
  if (redditname == 'all') {
    df_data <- df_data
  } else {
    df_data <- subset(df_data,subreddit == redditname)
  }

  tokens <- df_data %>% unnest_tokens(output = word, input = comment)
  tokens %>%  count(word,sort = TRUE)
  
  #get stop words
  sw = get_stopwords()
  cleaned_tokens <- tokens %>%  filter(!word %in% sw$word)

    nums <- cleaned_tokens %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
  #head(nums)
  cleaned_tokens <- cleaned_tokens %>%   anti_join(nums, by = "word")
  #head(cleaned_tokens)

  cleaned_tokens %>%
    count(word, sort = T) %>%
    rename(word_freq = n) %>%
    ggplot(aes(x=word_freq)) +
    geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
    scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) +
    scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) +
    theme_bw()
  
  rare <- cleaned_tokens %>%   count(word) %>%  filter(n<10) %>%  select(word) %>% unique()
  head(rare)
  
  rare <- cleaned_tokens %>%
    count(word) %>%
    filter(n<10) %>%
    select(word) %>% unique()

  alpha_remove <- cleaned_tokens %>% filter(str_detect(word, "^[â|s|t|r|gt|http]$")) %>%   select(word) %>% unique()
  
  
  cleaned_tokens <- cleaned_tokens %>%
    filter(!word %in% rare$word)
  length(unique(cleaned_tokens$word))
  
  cleaned_tokens <- cleaned_tokens %>% filter(!word %in% alpha_remove$word)
  return(cleaned_tokens)
  }

Visualizing word cloud for world news

cleaned_tokens = get_cleaned_tokens(df_data,'worldnews')
pal <- brewer.pal(8,"Dark2")

ploting the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing word cloud for olympics

cleaned_tokens = get_cleaned_tokens(df_data,'olympics')
pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing Word cloud for books

cleaned_tokens = get_cleaned_tokens(df_data,'books')
pal <- brewer.pal(8,"Dark2")

Plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing Word cloud for NintendoSwitch

cleaned_tokens = get_cleaned_tokens(df_data,'NintendoSwitch')
pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Getting word cloud for ‘all’ so that sentiment analysis will work on everything

cleaned_tokens = get_cleaned_tokens(df_data,'all')
pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Sentiment analysis -

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
sent_reviews = cleaned_tokens %>%   
  left_join(get_sentiments("nrc")) %>%  
  rename(nrc = sentiment) %>%  
  left_join(get_sentiments("bing")) %>%  
  rename(bing = sentiment) %>%  
  left_join(get_sentiments("afinn")) %>%  
  rename(afinn = value)

Most positive and negative words

bing_word_counts <- sent_reviews %>%  
  filter(!is.na(bing)) %>%  
  count(word, bing, sort = TRUE)
head(bing_word_counts,5)
##      word     bing    n
## 1:   good positive 2080
## 2:   like positive 1335
## 3:    bad negative 1105
## 4: pretty positive 1052
## 5:   shit negative  768

Plotting a graph for positive and negative words

bing_word_counts %>%  
  filter(n > 700) %>% 
  mutate(n = ifelse(bing == "negative", -n, n)) %>%  
  mutate(word = reorder(word, n)) %>%  
  ggplot(aes(word, n, fill = bing)) +  geom_col() +  coord_flip() +  labs(y = "Sentiment Contribution")

Analysis

  • From the above graph we see that the words like good, like, pretty are positive words and the words like death, shit and bad are negative words.

Tokenizing by n-grams

df_data$comment<-as.character(df_data$comment)

bigrams <- df_data %>%  unnest_tokens(bigram, comment, token = "ngrams", n = 2)

common bigrams

bigrams %>%  count(bigram, sort = TRUE)
##                           bigram    n
##      1:                   of the 1068
##      2:                   in the  998
##      3:                    to be  550
##      4:                   on the  481
##      5:                   to the  481
##     ---                              
## 143327:       zuck's livejournal    1
## 143328:       zuckerbergs online    1
## 143329: zx3p203y7a3ytjk botw.jpg    1
## 143330:                    zy 1g    1
## 143331:               zygote and    1

Filtering by n-grams

bigrams_separated <- bigrams %>%  
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%  
  filter(!word1 %in% stop_words$word) %>%  
  filter(!word2 %in% stop_words$word)

Bi-gram counts

bigrams_filtered %>%   count(word1, word2, sort = TRUE)
##                  word1           word2   n
##     1:            <NA>            <NA> 231
##     2: www.youtube.com           watch  90
##     3:           https www.youtube.com  71
##     4:        01f3c63c              ip  62
##     5:             ent        01f3c63c  62
##    ---                                    
## 31256:            žu1e               1   1
## 31257:          zuck's     livejournal   1
## 31258:     zuckerbergs          online   1
## 31259: zx3p203y7a3ytjk        botw.jpg   1
## 31260:              zy              1g   1
bigram_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigram_counts <- bigram_united %>% 
  count(bigram, sort = TRUE)

bigram_counts=bigram_counts[-1,]

Visualizing top 10 bigram words-

bigram_counts %>% arrange(desc(n))%>% head(10)%>%ggplot(aes(x=factor(bigram,levels=bigram),y=n))+geom_bar(stat="identity",fill="#003E45")+labs(title="Top 10 bigram words")+coord_flip()

Document term matrix

word_counts_by_doc_id <- cleaned_tokens %>%  
  group_by(id) %>%  
  count(word, sort = TRUE)

review_dtm <- word_counts_by_doc_id %>%  
  cast_dtm(id, word, n)

review_dtm
## <<DocumentTermMatrix (documents: 498, terms: 2671)>>
## Non-/sparse entries: 93014/1237144
## Sparsity           : 93%
## Maximal term length: 26
## Weighting          : term frequency (tf)
library(topicmodels)
lda8 <- LDA(review_dtm, k = 8, control = list(seed = 1234))
terms(lda8, 20)
##       Topic 1   Topic 2     Topic 3  Topic 4    Topic 5   Topic 6    
##  [1,] "people"  "one"       "just"   "just"     "like"    "one"      
##  [2,] "time"    "like"      "like"   "like"     "time"    "people"   
##  [3,] "like"    "just"      "people" "people"   "good"    "get"      
##  [4,] "can"     "can"       "one"    "https"    "much"    "think"    
##  [5,] "every"   "get"       "can"    "get"      "get"     "know"     
##  [6,] "get"     "even"      "name"   "now"      "got"     "really"   
##  [7,] "think"   "amp"       "get"    "time"     "see"     "gt"       
##  [8,] "really"  "know"      "time"   "read"     "every"   "good"     
##  [9,] "one"     "read"      "game"   "really"   "just"    "even"     
## [10,] "still"   "still"     "also"   "say"      "say"     "virus"    
## [11,] "much"    "time"      "now"    "way"      "new"     "also"     
## [12,] "even"    "https"     "ip"     "bad"      "right"   "something"
## [13,] "virus"   "someone"   "know"   "make"     "go"      "now"      
## [14,] "got"     "people"    "every"  "got"      "way"     "said"     
## [15,] "man"     "game"      "lr"     "even"     "can"     "still"    
## [16,] "now"     "probably"  "aff"    "actually" "know"    "china"    
## [17,] "back"    "also"      "ent"    "going"    "many"    "see"      
## [18,] "said"    "year"      "pri"    "better"   "game"    "world"    
## [19,] "already" "right"     "really" "game"     "believe" "can"      
## [20,] "find"    "something" "https"  "china"    "also"    "thing"    
##       Topic 7    Topic 8  
##  [1,] "just"     "just"   
##  [2,] "people"   "people" 
##  [3,] "can"      "can"    
##  [4,] "like"     "know"   
##  [5,] "know"     "like"   
##  [6,] "much"     "got"    
##  [7,] "actually" "right"  
##  [8,] "well"     "well"   
##  [9,] "never"    "china"  
## [10,] "really"   "get"    
## [11,] "make"     "game"   
## [12,] "go"       "someone"
## [13,] "virus"    "see"    
## [14,] "even"     "us"     
## [15,] "world"    "think"  
## [16,] "time"     "gt"     
## [17,] "everyone" "shit"   
## [18,] "probably" "make"   
## [19,] "bad"      "way"    
## [20,] "thing"    "going"
lda8_betas <- broom::tidy(lda8)
lda8_betas
## # A tibble: 21,368 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 u     0.00201  
##  2     2 u     0.000668 
##  3     3 u     0.0000653
##  4     4 u     0.000394 
##  5     5 u     0.00253  
##  6     6 u     0.000136 
##  7     7 u     0.00206  
##  8     8 u     0.000248 
##  9     1 amp   0.00115  
## 10     2 amp   0.00725  
## # ... with 21,358 more rows
library(ggrepel)
terms_in_comments <- lda8_betas %>%  
  group_by(topic) %>%  
  top_n(5, beta) %>%  
  ungroup() %>%  
  arrange(topic, -beta)

terms_per_comments <- function(lda_model, num_words) {


  topics_tidy <- tidy(lda_model, matrix = "beta")
  top_terms <- topics_tidy %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%
  slice(seq_len(num_words)) %>%
  arrange(topic, beta) %>%
  mutate(row = row_number()) %>%
  ungroup() %>%
  mutate(topic = paste("Comment_Topic", topic, sep = " "))
  title <- paste("Top Terms for", k, "Comment Topics")
  comments_wordchart(top_terms, top_terms$term, title)
}

comments_wordchart <- function(data, input, title) {
  data %>%

  ggplot(aes(as.factor(row), 1, label = input, fill = factor(topic) )) +

  geom_point(color = "transparent") +

  geom_label_repel(nudge_x = .2,  
                   direction = "y",
                   box.padding = 0.1,
                   segment.color = "transparent",
                   size = 3) +
  facet_grid(~topic) +
  theme_comments() +
  theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
        axis.title.x = element_text(size = 9),
        panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect("lightgray", fill = NA),
        strip.text.x = element_text(size = 9)) +
  labs(x = NULL, y = NULL, title = title) +
    
  ggtitle(title) +
  coord_flip()
}

theme_comments <- function() 
{
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none")
}

k <-8
terms_per_comments(lda8,15)

tf-idf

tfidf <- word_counts_by_doc_id %>%  
  bind_tf_idf(word, id, n) 

head(tfidf,5)
## # A tibble: 5 x 6
## # Groups:   id [4]
##   id    word      n     tf   idf tf_idf
##   <chr> <chr> <int>  <dbl> <dbl>  <dbl>
## 1 38    u        55 0.188   2.32  0.437
## 2 191   amp      50 0.0949  1.71  0.162
## 3 243   ip       38 0.0683  4.01  0.274
## 4 243   lr       38 0.0683  5.52  0.377
## 5 164   aff      37 0.0673  5.52  0.371

From tf-idf we can understand the words that are important for the particular document but are not in the corpus

top_tfidf = tfidf %>%  
  group_by(id) %>%  
  arrange(desc(tf_idf)) %>%  
  top_n(3) %>% ungroup() %>%  
  arrange(id, -tf_idf)

head(top_tfidf,5)
## # A tibble: 5 x 6
##   id    word       n      tf   idf tf_idf
##   <chr> <chr>  <int>   <dbl> <dbl>  <dbl>
## 1 1     buffs     12 0.0203   2.88 0.0584
## 2 1     click      8 0.0135   2.38 0.0322
## 3 1     clicks     5 0.00846  3.73 0.0315
## 4 10    cell       2 0.0105   3.03 0.0318
## 5 10    front      2 0.0105   2.26 0.0237
library(dplyr)
df_data %>% 
  select(comment) %>% 
  sentimentr::get_sentences() %>% 
  sentimentr::sentiment() %>% 
  mutate(characters = nchar(stripWhitespace(comment))) %>% 
  filter(characters >1 ) -> bounded_sentences 

summary(bounded_sentences$sentiment)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.461737 -0.138675  0.000000 -0.006522  0.137500  2.044070
tfidf
## # A tibble: 93,014 x 6
## # Groups:   id [498]
##    id    word      n     tf   idf tf_idf
##    <chr> <chr> <int>  <dbl> <dbl>  <dbl>
##  1 38    u        55 0.188   2.32  0.437
##  2 191   amp      50 0.0949  1.71  0.162
##  3 243   ip       38 0.0683  4.01  0.274
##  4 243   lr       38 0.0683  5.52  0.377
##  5 164   aff      37 0.0673  5.52  0.371
##  6 164   ent      37 0.0673  5.52  0.371
##  7 164   ip       37 0.0673  4.01  0.270
##  8 164   lr       37 0.0673  5.52  0.371
##  9 164   name     37 0.0673  1.93  0.130
## 10 164   pri      37 0.0673  5.52  0.371
## # ... with 93,004 more rows

Removing positive and negative sentiments

bounded_sentences %>% filter(between(sentiment,-1,1)) ->  bounded_sentences
sentiment_densities <- with(density(bounded_sentences$sentiment), data.frame(x, y))

Visualizing sentiment densities for the movies

ggplot(sentiment_densities, aes(x = x, y = y)) +
  geom_line() +
  geom_area(mapping = aes(x = ifelse(x >=0 & x<=1 , x, 0)), fill = "green") +
  geom_area(mapping = aes(x = ifelse(x <=0 & x>=-1 , x, 0)), fill = "red") +
  scale_y_continuous(limits = c(0,2.5)) +
  theme_minimal(base_size = 16) +
  labs(x = "Sentiment", 
       y = "", 
       title = "Distribution of Sentiment Across Reddit comments") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) -> gg

plot(gg)

Observations-

  • In the above graph we can see that the sentiments are evenly distributed

NLP procedure summary-

Natural Language Processing allows machine to understand the text. We have implemented following in NLP:

  1. Tokenization helps us to cut the corpus into smaller chunks known as tokens.

  2. Removing stop words is necessary as the most frequent language words are stop words that must be filtered out in order to get useful meaning from data processing.

  3. Removing numbers is done as it is text analysis, we removed numbers because the analysis does not add value.

  4. Bi-grams, are a sequence of 2 adjacent elements of a string of tokens( n=2) and are used for text analysis.

  5. Word correlations are used to find the correlated words and this will help in removing uncommon words.