Title-

Reddit Comments

Team members

Raja Atreya BVS
Harsh Desai
Parasuram Viswanath Jasty
Karim Budhwani

Business context

Reddit is a social network, web content rating and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics such as news, politics, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing

Problem description

Reddit members submit content about various topics such news politics, videos and games. Over the past years Reedit has become 17th most visited website in the world. The data in Reddit has been exponentially growing. The content posted by members reflects both positive and negative sentiments. Our goal is to do sentiment analysis and determine whether the post is positive, negative or neutral. This is interesting because reddit is usually perceived as a platform where people get bullied a lot. Hence we want to explore in this regard.

URL to download te dataset: https://drive.google.com/file/d/1FIKQcSUMX5oKe8m-2MBOMs2pqYzpF22P/view?usp=sharing

Describe the data source

The data will be fetched from the reddit API using RedditExtractoR package to fetch data from reddit API using the search terms “virus” and “news”.

Data summary-

The data contains comments on posts from various subreddits like worldnews, AskReddit, cemu, news, etc Number of columns - 18 Number of rows - 11277 We have dropped a few columns since they do not contribute in the sentiment analysis

Import Libraries

library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(tidytext)
library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(data.table)
library(lubridate)
library(wordcloud)

Set the Working Directory

getwd()

## [1] "C:/Users/harsh/Downloads/R project"

Read the data file

file = "reddit_sentiments.csv"
df <- fread(file)

Sample Data

initialrow <- head(df, 1)
initialrow

##    id structure  post_date  comm_date num_comments subreddit upvote_prop
## 1:  1         1 0013-03-15 0013-03-15         3345 worldnews        0.91
##    post_score              author           user comment_score controversiality
## 1:      54849 bendertheoffender22 AwesomeAxolotl          3281                0
##                                                                                                                       comment
## 1: As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text                                          link  domain
## 1:           http://www.bbc.com/news/world-europe-31864218 bbc.com
##                                                                                                                      URL
## 1: http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts

Data Summary

1. Total number of rows and columns in the dataset

nrow(df)

## [1] 11276

ncol(df)

## [1] 18

2.Datatypes of columns in dataset

str(df)

## Classes 'data.table' and 'data.frame':   11276 obs. of  18 variables:
##  $ id              : chr  "1" "2" "3" "4" ...
##  $ structure       : chr  "1" "1_1" "1_1_1" "1_1_1_1" ...
##  $ post_date       : IDate, format: "0013-03-15" "0013-03-15" ...
##  $ comm_date       : IDate, format: "0013-03-15" "0013-03-15" ...
##  $ num_comments    : int  3345 3345 3345 3345 3345 3345 3345 3345 3345 3345 ...
##  $ subreddit       : chr  "worldnews" "worldnews" "worldnews" "worldnews" ...
##  $ upvote_prop     : num  0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 ...
##  $ post_score      : int  54849 54849 54849 54849 54849 54849 54849 54849 54849 54849 ...
##  $ author          : chr  "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" ...
##  $ user            : chr  "AwesomeAxolotl" "weewolf" "[deleted]" "octatoan" ...
##  $ comment_score   : int  3281 1960 1002 122 126 58 64 3 3 4 ...
##  $ controversiality: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ comment         : chr  "As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order." "Across the pond we say \"\"You can't fix stupid\"\".  I think I like the german attitude better. " "[deleted]" "Not *erhoehen? Just asking." ...
##  $ title           : chr  "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ ...
##  $ post_text       : chr  "" "" "" "" ...
##  $ link            : chr  "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" ...
##  $ domain          : chr  "bbc.com" "bbc.com" "bbc.com" "bbc.com" ...
##  $ URL             : chr  "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Data Cleaning

1. We are Dropping Link, domain, URL and structure as these columns does not play any role in analyzing comments

df$link <- NULL
df$domain <- NULL
df$URL <- NULL
df$structure <- NULL

Creating seperate columns for POST’s Year, Months and Date which will be used later in visualizations

df$dayofpost <- wday(df$post_date,label = T)

df$monthofpost <- month(df$post_date,abbr = T,label = T)

df$yearofpost <- year(df$post_date)

(head(df,1))

##    id  post_date  comm_date num_comments subreddit upvote_prop post_score
## 1:  1 0013-03-15 0013-03-15         3345 worldnews        0.91      54849
##                 author           user comment_score controversiality
## 1: bendertheoffender22 AwesomeAxolotl          3281                0
##                                                                                                                       comment
## 1: As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text dayofpost monthofpost yearofpost
## 1:                 Fri         Mar         13

df$upvote_prop <- df$upvote_prop*100

names(df)[names(df) == 'upvote_prop'] <- 'upvote_pct'

Checking the Null values

colSums(is.na(df))

##               id        post_date        comm_date     num_comments 
##                0                0                0                0 
##        subreddit       upvote_pct       post_score           author 
##                0                0                0                0 
##             user    comment_score controversiality          comment 
##                0                0                0                0 
##            title        post_text        dayofpost      monthofpost 
##                0                0                0                0 
##       yearofpost 
##                0

dropping the comment column as it has inappropriate data

df <- df[df$comment != "[deleted]"]

Summary of Dataset

summary(df)

##       id              post_date            comm_date           num_comments  
##  Length:10917       Min.   :0001-08-16   Min.   :0001-01-18   Min.   : 3345  
##  Class :character   1st Qu.:0011-04-11   1st Qu.:0011-04-11   1st Qu.: 3557  
##  Mode  :character   Median :0017-01-20   Median :0018-01-20   Median : 4452  
##                     Mean   :0018-01-22   Mean   :0017-10-14   Mean   : 5408  
##                     3rd Qu.:0028-01-20   3rd Qu.:0027-01-16   3rd Qu.: 6405  
##                     Max.   :0031-12-17   Max.   :0031-12-17   Max.   :11016  
##                                                                              
##   subreddit           upvote_pct       post_score       author         
##  Length:10917       Min.   : 80.00   Min.   :    1   Length:10917      
##  Class :character   1st Qu.: 88.00   1st Qu.:  892   Class :character  
##  Mode  :character   Median : 91.00   Median : 1714   Mode  :character  
##                     Mean   : 91.49   Mean   :18289                     
##                     3rd Qu.: 94.00   3rd Qu.:40524                     
##                     Max.   :100.00   Max.   :83359                     
##                                                                        
##      user           comment_score     controversiality     comment         
##  Length:10917       Min.   :  -73.0   Min.   :0.000000   Length:10917      
##  Class :character   1st Qu.:    4.0   1st Qu.:0.000000   Class :character  
##  Mode  :character   Median :   14.0   Median :0.000000   Mode  :character  
##                     Mean   :  161.4   Mean   :0.006962                     
##                     3rd Qu.:   68.0   3rd Qu.:0.000000                     
##                     Max.   :30026.0   Max.   :1.000000                     
##                                                                            
##     title            post_text         dayofpost   monthofpost  
##  Length:10917       Length:10917       Sun:3315   Jan    :4250  
##  Class :character   Class :character   Mon:1426   Jun    : 948  
##  Mode  :character   Mode  :character   Tue: 941   Apr    : 940  
##                                        Wed:2360   May    : 933  
##                                        Thu:1445   Feb    : 915  
##                                        Fri:1430   Oct    : 530  
##                                        Sat:   0   (Other):2401  
##    yearofpost   
##  Min.   : 1.00  
##  1st Qu.:11.00  
##  Median :17.00  
##  Mean   :17.75  
##  3rd Qu.:28.00  
##  Max.   :31.00  
##

Data Visualization

1. Posts by week days

ggplot(df[,.N,by=dayofpost],aes(x = dayofpost,y = N,fill=N,label=round(N,2)))+
  geom_bar(stat = "identity")+labs(title="Posts by the weekday",subtitle="Number of Posts Per Week Day ")+xlab("Day")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")

Observation-

It’s interesting to see the highest number of posts that has been posted is on a weekend i.e Sunday followed by Wednesday

2. Post by Month

ggplot(df[,.N,by=monthofpost],aes(x = monthofpost,y = N,fill=N, label=round(N,2)))+
  geom_bar(stat = "identity")+labs(title="Posts by the month",subtitle="Number of Posts Per Month")+xlab("Month")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")

Inference -

As is evident, January month dominates the number of posts.Its interesting to see the pattern i.e Number of posts is maintained ~900 for the months of February, April, May and June while for the rest of the months its ~500. This will be really useful from business perspective, as now companies knows that the first month of new year i.e January is the month for highest number of posts, hence companies can launch and advertise their products during this peak time.

3. Post per day in month

ggplot(df[,.N,by=yearofpost],aes(x = yearofpost,y = N,fill=N, label=round(N,3)))+
  geom_bar(stat = "identity")+labs(title="Posts per day by month",subtitle="Number of given Posts")+xlab("The day of month")+ylab(NULL)+geom_text(size=4, vjust=1, color="White")

TOP 20 Authors by most number of posts

top_author<-df[,.N,by=author][order(-N)][1:20]
top_author

##                   author   N
##  1:          maxwellhill 491
##  2:       HugeDetective0 490
##  3:            accessirl 488
##  4:           saurabh24_ 488
##  5:         dorkprincess 487
##  6: Clavis_Apocalypticae 482
##  7:         deletetables 481
##  8:              fmaasnt 480
##  9:            muchonada 480
## 10:   CondescendingBench 479
## 11:            Pilgram94 479
## 12:         areascontrol 477
## 13:      tryin2immigrate 475
## 14:          shutdown924 470
## 15:   NintendoSwitchMods 469
## 16:      trixiethesalmon 469
## 17:             jdpatric 464
## 18:       Fun_With_Forks 464
## 19:  bendertheoffender22 460
## 20:   Network_operations 459

Data Cleaning to get clean text

#Replacing the URLS

df_data <- df %>% mutate(comment = str_replace_all(comment, "//^(?:http(?:s)?:\\//\\//)?(?:[^\\.]+\\.)?[a-zA-Z0-9]\\.com(\\//.*)?$", ""))


#Replacing HTML tags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "(<br />)+", ""))


#Remove # in hashtags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "#([^\\s]+)", "\1"))

#Remove punctuations,numbers and special characters
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "^[a-zA-Z0-9]*$", ""))

head(df_data)

##    id  post_date  comm_date num_comments subreddit upvote_pct post_score
## 1:  1 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 2:  2 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 3:  4 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 4:  6 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 5:  8 0013-03-15 0013-03-15         3345 worldnews         91      54849
## 6:  9 0013-03-15 0013-03-15         3345 worldnews         91      54849
##                 author             user comment_score controversiality
## 1: bendertheoffender22   AwesomeAxolotl          3281                0
## 2: bendertheoffender22          weewolf          1960                0
## 3: bendertheoffender22         octatoan           122                0
## 4: bendertheoffender22         octatoan            58                0
## 5: bendertheoffender22 piss_in_a_bottle             3                0
## 6: bendertheoffender22     LinkFixerBot             3                0
##                                                                                                                                                   comment
## 1:                             As we say in Germany: Stupidity must be punished.                                   Apparently, sometimes, by court order.
## 2:                                                          Across the pond we say ""You can't fix stupid"".  I think I like the german attitude better. 
## 3:                                                                                                                            Not *erhoehen? Just asking.
## 4:                     I didn't know, thanks. I don't know the language as well as I'd like, and colloquial everyday usage is a big part of that . . . ;)
## 5: Wobei ich sagen würde, dass man selbst ""film'n"" durchgehen lassen kann. \r\n""Kannst'e das mal film'n?!"" hört sich für mich z.B. recht normal an ;)
## 6:                                           Pretty sure it's the combination of m and n to m'n\r\n\r\nKomm'n wouldn't really work, neither would stemm'n
##                                                                                                                                                                                                                     title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 2: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 3: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 4: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 5: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 6: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
##    post_text dayofpost monthofpost yearofpost
## 1:                 Fri         Mar         13
## 2:                 Fri         Mar         13
## 3:                 Fri         Mar         13
## 4:                 Fri         Mar         13
## 5:                 Fri         Mar         13
## 6:                 Fri         Mar         13

NLP procedure -

get_cleaned_tokens <- function(df_data,redditname) {
  if (redditname == 'all') {
    df_data <- df_data
  } else {
    df_data <- subset(df_data,subreddit == redditname)
  }

  tokens <- df_data %>% unnest_tokens(output = word, input = comment)
  tokens %>%  count(word,sort = TRUE)
  
  #get stop words
  sw = get_stopwords()
  cleaned_tokens <- tokens %>%  filter(!word %in% sw$word)

    nums <- cleaned_tokens %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
  #head(nums)
  cleaned_tokens <- cleaned_tokens %>%   anti_join(nums, by = "word")
  #head(cleaned_tokens)

  cleaned_tokens %>%
    count(word, sort = T) %>%
    rename(word_freq = n) %>%
    ggplot(aes(x=word_freq)) +
    geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
    scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) +
    scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) +
    theme_bw()
  
  rare <- cleaned_tokens %>%   count(word) %>%  filter(n<10) %>%  select(word) %>% unique()
  head(rare)
  
  rare <- cleaned_tokens %>%
    count(word) %>%
    filter(n<10) %>%
    select(word) %>% unique()

  alpha_remove <- cleaned_tokens %>% filter(str_detect(word, "^[Ã¢|s|t|r|gt|http]$")) %>%   select(word) %>% unique()
  
  
  cleaned_tokens <- cleaned_tokens %>%
    filter(!word %in% rare$word)
  length(unique(cleaned_tokens$word))
  
  cleaned_tokens <- cleaned_tokens %>% filter(!word %in% alpha_remove$word)
  return(cleaned_tokens)
  }

Visualizing word cloud for world news

cleaned_tokens = get_cleaned_tokens(df_data,'worldnews')

pal <- brewer.pal(8,"Dark2")

ploting the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing word cloud for olympics

cleaned_tokens = get_cleaned_tokens(df_data,'olympics')

pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing Word cloud for books

cleaned_tokens = get_cleaned_tokens(df_data,'books')

pal <- brewer.pal(8,"Dark2")

Plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Visualizing Word cloud for NintendoSwitch

cleaned_tokens = get_cleaned_tokens(df_data,'NintendoSwitch')

pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Getting word cloud for ‘all’ so that sentiment analysis will work on everything

cleaned_tokens = get_cleaned_tokens(df_data,'all')

pal <- brewer.pal(8,"Dark2")

plot the 100 most common words

cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))

Sentiment analysis -

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

sent_reviews = cleaned_tokens %>%   
  left_join(get_sentiments("nrc")) %>%  
  rename(nrc = sentiment) %>%  
  left_join(get_sentiments("bing")) %>%  
  rename(bing = sentiment) %>%  
  left_join(get_sentiments("afinn")) %>%  
  rename(afinn = value)

Most positive and negative words

bing_word_counts <- sent_reviews %>%  
  filter(!is.na(bing)) %>%  
  count(word, bing, sort = TRUE)
head(bing_word_counts,5)

##      word     bing    n
## 1:   good positive 2080
## 2:   like positive 1335
## 3:    bad negative 1105
## 4: pretty positive 1052
## 5:   shit negative  768

Plotting a graph for positive and negative words

bing_word_counts %>%  
  filter(n > 700) %>% 
  mutate(n = ifelse(bing == "negative", -n, n)) %>%  
  mutate(word = reorder(word, n)) %>%  
  ggplot(aes(word, n, fill = bing)) +  geom_col() +  coord_flip() +  labs(y = "Sentiment Contribution")

Analysis

From the above graph we see that the words like good, like, pretty are positive words and the words like death, shit and bad are negative words.

Tokenizing by n-grams

df_data$comment<-as.character(df_data$comment)

bigrams <- df_data %>%  unnest_tokens(bigram, comment, token = "ngrams", n = 2)

common bigrams

bigrams %>%  count(bigram, sort = TRUE)

##                           bigram    n
##      1:                   of the 1068
##      2:                   in the  998
##      3:                    to be  550
##      4:                   on the  481
##      5:                   to the  481
##     ---                              
## 143327:       zuck's livejournal    1
## 143328:       zuckerbergs online    1
## 143329: zx3p203y7a3ytjk botw.jpg    1
## 143330:                    zy 1g    1
## 143331:               zygote and    1

Filtering by n-grams

bigrams_separated <- bigrams %>%  
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%  
  filter(!word1 %in% stop_words$word) %>%  
  filter(!word2 %in% stop_words$word)

Bi-gram counts

bigrams_filtered %>%   count(word1, word2, sort = TRUE)

##                  word1           word2   n
##     1:            <NA>            <NA> 231
##     2: www.youtube.com           watch  90
##     3:           https www.youtube.com  71
##     4:        01f3c63c              ip  62
##     5:             ent        01f3c63c  62
##    ---                                    
## 31256:            žu1e               1   1
## 31257:          zuck's     livejournal   1
## 31258:     zuckerbergs          online   1
## 31259: zx3p203y7a3ytjk        botw.jpg   1
## 31260:              zy              1g   1

bigram_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigram_counts <- bigram_united %>% 
  count(bigram, sort = TRUE)

bigram_counts=bigram_counts[-1,]

Visualizing top 10 bigram words-

bigram_counts %>% arrange(desc(n))%>% head(10)%>%ggplot(aes(x=factor(bigram,levels=bigram),y=n))+geom_bar(stat="identity",fill="#003E45")+labs(title="Top 10 bigram words")+coord_flip()

Document term matrix

word_counts_by_doc_id <- cleaned_tokens %>%  
  group_by(id) %>%  
  count(word, sort = TRUE)

review_dtm <- word_counts_by_doc_id %>%  
  cast_dtm(id, word, n)

review_dtm

## <<DocumentTermMatrix (documents: 498, terms: 2671)>>
## Non-/sparse entries: 93014/1237144
## Sparsity           : 93%
## Maximal term length: 26
## Weighting          : term frequency (tf)

library(topicmodels)
lda8 <- LDA(review_dtm, k = 8, control = list(seed = 1234))
terms(lda8, 20)

##       Topic 1   Topic 2     Topic 3  Topic 4    Topic 5   Topic 6    
##  [1,] "people"  "one"       "just"   "just"     "like"    "one"      
##  [2,] "time"    "like"      "like"   "like"     "time"    "people"   
##  [3,] "like"    "just"      "people" "people"   "good"    "get"      
##  [4,] "can"     "can"       "one"    "https"    "much"    "think"    
##  [5,] "every"   "get"       "can"    "get"      "get"     "know"     
##  [6,] "get"     "even"      "name"   "now"      "got"     "really"   
##  [7,] "think"   "amp"       "get"    "time"     "see"     "gt"       
##  [8,] "really"  "know"      "time"   "read"     "every"   "good"     
##  [9,] "one"     "read"      "game"   "really"   "just"    "even"     
## [10,] "still"   "still"     "also"   "say"      "say"     "virus"    
## [11,] "much"    "time"      "now"    "way"      "new"     "also"     
## [12,] "even"    "https"     "ip"     "bad"      "right"   "something"
## [13,] "virus"   "someone"   "know"   "make"     "go"      "now"      
## [14,] "got"     "people"    "every"  "got"      "way"     "said"     
## [15,] "man"     "game"      "lr"     "even"     "can"     "still"    
## [16,] "now"     "probably"  "aff"    "actually" "know"    "china"    
## [17,] "back"    "also"      "ent"    "going"    "many"    "see"      
## [18,] "said"    "year"      "pri"    "better"   "game"    "world"    
## [19,] "already" "right"     "really" "game"     "believe" "can"      
## [20,] "find"    "something" "https"  "china"    "also"    "thing"    
##       Topic 7    Topic 8  
##  [1,] "just"     "just"   
##  [2,] "people"   "people" 
##  [3,] "can"      "can"    
##  [4,] "like"     "know"   
##  [5,] "know"     "like"   
##  [6,] "much"     "got"    
##  [7,] "actually" "right"  
##  [8,] "well"     "well"   
##  [9,] "never"    "china"  
## [10,] "really"   "get"    
## [11,] "make"     "game"   
## [12,] "go"       "someone"
## [13,] "virus"    "see"    
## [14,] "even"     "us"     
## [15,] "world"    "think"  
## [16,] "time"     "gt"     
## [17,] "everyone" "shit"   
## [18,] "probably" "make"   
## [19,] "bad"      "way"    
## [20,] "thing"    "going"

lda8_betas <- broom::tidy(lda8)
lda8_betas

## # A tibble: 21,368 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 u     0.00201  
##  2     2 u     0.000668 
##  3     3 u     0.0000653
##  4     4 u     0.000394 
##  5     5 u     0.00253  
##  6     6 u     0.000136 
##  7     7 u     0.00206  
##  8     8 u     0.000248 
##  9     1 amp   0.00115  
## 10     2 amp   0.00725  
## # ... with 21,358 more rows

library(ggrepel)
terms_in_comments <- lda8_betas %>%  
  group_by(topic) %>%  
  top_n(5, beta) %>%  
  ungroup() %>%  
  arrange(topic, -beta)

terms_per_comments <- function(lda_model, num_words) {


  topics_tidy <- tidy(lda_model, matrix = "beta")
  top_terms <- topics_tidy %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%
  slice(seq_len(num_words)) %>%
  arrange(topic, beta) %>%
  mutate(row = row_number()) %>%
  ungroup() %>%
  mutate(topic = paste("Comment_Topic", topic, sep = " "))
  title <- paste("Top Terms for", k, "Comment Topics")
  comments_wordchart(top_terms, top_terms$term, title)
}

comments_wordchart <- function(data, input, title) {
  data %>%

  ggplot(aes(as.factor(row), 1, label = input, fill = factor(topic) )) +

  geom_point(color = "transparent") +

  geom_label_repel(nudge_x = .2,  
                   direction = "y",
                   box.padding = 0.1,
                   segment.color = "transparent",
                   size = 3) +
  facet_grid(~topic) +
  theme_comments() +
  theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
        axis.title.x = element_text(size = 9),
        panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect("lightgray", fill = NA),
        strip.text.x = element_text(size = 9)) +
  labs(x = NULL, y = NULL, title = title) +
    
  ggtitle(title) +
  coord_flip()
}

theme_comments <- function() 
{
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position = "none")
}

k <-8
terms_per_comments(lda8,15)

tf-idf

tfidf <- word_counts_by_doc_id %>%  
  bind_tf_idf(word, id, n) 

head(tfidf,5)

## # A tibble: 5 x 6
## # Groups:   id [4]
##   id    word      n     tf   idf tf_idf
##   <chr> <chr> <int>  <dbl> <dbl>  <dbl>
## 1 38    u        55 0.188   2.32  0.437
## 2 191   amp      50 0.0949  1.71  0.162
## 3 243   ip       38 0.0683  4.01  0.274
## 4 243   lr       38 0.0683  5.52  0.377
## 5 164   aff      37 0.0673  5.52  0.371

From tf-idf we can understand the words that are important for the particular document but are not in the corpus

top_tfidf = tfidf %>%  
  group_by(id) %>%  
  arrange(desc(tf_idf)) %>%  
  top_n(3) %>% ungroup() %>%  
  arrange(id, -tf_idf)

head(top_tfidf,5)

## # A tibble: 5 x 6
##   id    word       n      tf   idf tf_idf
##   <chr> <chr>  <int>   <dbl> <dbl>  <dbl>
## 1 1     buffs     12 0.0203   2.88 0.0584
## 2 1     click      8 0.0135   2.38 0.0322
## 3 1     clicks     5 0.00846  3.73 0.0315
## 4 10    cell       2 0.0105   3.03 0.0318
## 5 10    front      2 0.0105   2.26 0.0237

library(dplyr)
df_data %>% 
  select(comment) %>% 
  sentimentr::get_sentences() %>% 
  sentimentr::sentiment() %>% 
  mutate(characters = nchar(stripWhitespace(comment))) %>% 
  filter(characters >1 ) -> bounded_sentences 

summary(bounded_sentences$sentiment)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.461737 -0.138675  0.000000 -0.006522  0.137500  2.044070

tfidf

## # A tibble: 93,014 x 6
## # Groups:   id [498]
##    id    word      n     tf   idf tf_idf
##    <chr> <chr> <int>  <dbl> <dbl>  <dbl>
##  1 38    u        55 0.188   2.32  0.437
##  2 191   amp      50 0.0949  1.71  0.162
##  3 243   ip       38 0.0683  4.01  0.274
##  4 243   lr       38 0.0683  5.52  0.377
##  5 164   aff      37 0.0673  5.52  0.371
##  6 164   ent      37 0.0673  5.52  0.371
##  7 164   ip       37 0.0673  4.01  0.270
##  8 164   lr       37 0.0673  5.52  0.371
##  9 164   name     37 0.0673  1.93  0.130
## 10 164   pri      37 0.0673  5.52  0.371
## # ... with 93,004 more rows

Removing positive and negative sentiments

bounded_sentences %>% filter(between(sentiment,-1,1)) ->  bounded_sentences
sentiment_densities <- with(density(bounded_sentences$sentiment), data.frame(x, y))

Visualizing sentiment densities for the movies

ggplot(sentiment_densities, aes(x = x, y = y)) +
  geom_line() +
  geom_area(mapping = aes(x = ifelse(x >=0 & x<=1 , x, 0)), fill = "green") +
  geom_area(mapping = aes(x = ifelse(x <=0 & x>=-1 , x, 0)), fill = "red") +
  scale_y_continuous(limits = c(0,2.5)) +
  theme_minimal(base_size = 16) +
  labs(x = "Sentiment", 
       y = "", 
       title = "Distribution of Sentiment Across Reddit comments") +
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.y=element_blank()) -> gg

plot(gg)

Observations-

In the above graph we can see that the sentiments are evenly distributed

NLP procedure summary-

Natural Language Processing allows machine to understand the text. We have implemented following in NLP:

Tokenization helps us to cut the corpus into smaller chunks known as tokens.
Removing stop words is necessary as the most frequent language words are stop words that must be filtered out in order to get useful meaning from data processing.
Removing numbers is done as it is text analysis, we removed numbers because the analysis does not add value.
Bi-grams, are a sequence of 2 adjacent elements of a string of tokens( n=2) and are used for text analysis.
Word correlations are used to find the correlated words and this will help in removing uncommon words.

Project-RedditComments

Raja Atreya BVS, Harsh Desai, Parasuram Viswanath Jasty, Karim Budhwani

12/9/2020

Title-

Team members

Business context

Problem description

URL to download te dataset: https://drive.google.com/file/d/1FIKQcSUMX5oKe8m-2MBOMs2pqYzpF22P/view?usp=sharing

Describe the data source

Data summary-

Import Libraries

Set the Working Directory

Read the data file

Sample Data

Data Summary

1. Total number of rows and columns in the dataset

2.Datatypes of columns in dataset

Data Cleaning

1. We are Dropping Link, domain, URL and structure as these columns does not play any role in analyzing comments

Creating seperate columns for POST’s Year, Months and Date which will be used later in visualizations

Checking the Null values

dropping the comment column as it has inappropriate data

Summary of Dataset

Data Visualization

1. Posts by week days

Observation-

2. Post by Month

Inference -

3. Post per day in month

TOP 20 Authors by most number of posts

TOP 20 Authors by post score

Data Cleaning to get clean text

NLP procedure -

Visualizing word cloud for world news

ploting the 100 most common words

Visualizing word cloud for olympics

plot the 100 most common words

Visualizing Word cloud for books

Plot the 100 most common words

Visualizing Word cloud for NintendoSwitch

plot the 100 most common words

Getting word cloud for ‘all’ so that sentiment analysis will work on everything

plot the 100 most common words

Sentiment analysis -

Most positive and negative words

Plotting a graph for positive and negative words

Analysis

Tokenizing by n-grams

common bigrams

Filtering by n-grams

Bi-gram counts

Visualizing top 10 bigram words-

Document term matrix

tf-idf

From tf-idf we can understand the words that are important for the particular document but are not in the corpus

Removing positive and negative sentiments

Visualizing sentiment densities for the movies

Observations-

NLP procedure summary-

Natural Language Processing allows machine to understand the text. We have implemented following in NLP: