Reddit Comments
Reddit is a social network, web content rating and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called “subreddits”, which cover a variety of topics such as news, politics, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing
Reddit members submit content about various topics such news politics, videos and games. Over the past years Reedit has become 17th most visited website in the world. The data in Reddit has been exponentially growing. The content posted by members reflects both positive and negative sentiments. Our goal is to do sentiment analysis and determine whether the post is positive, negative or neutral. This is interesting because reddit is usually perceived as a platform where people get bullied a lot. Hence we want to explore in this regard.
The data will be fetched from the reddit API using RedditExtractoR package to fetch data from reddit API using the search terms “virus” and “news”.
library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(tidytext)
library(tidyverse)
library(tibble)
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(tidytext)
library(RColorBrewer)
library(reshape2)
library(wordcloud)
library(igraph)
library(widyr)
library(ggraph)
library(ngram)
library(wordcloud2)
library(stringr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(data.table)
library(lubridate)
library(wordcloud)
getwd()
## [1] "C:/Users/harsh/Downloads/R project"
file = "reddit_sentiments.csv"
df <- fread(file)
initialrow <- head(df, 1)
initialrow
## id structure post_date comm_date num_comments subreddit upvote_prop
## 1: 1 1 0013-03-15 0013-03-15 3345 worldnews 0.91
## post_score author user comment_score controversiality
## 1: 54849 bendertheoffender22 AwesomeAxolotl 3281 0
## comment
## 1: As we say in Germany: Stupidity must be punished. Apparently, sometimes, by court order.
## title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## post_text link domain
## 1: http://www.bbc.com/news/world-europe-31864218 bbc.com
## URL
## 1: http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts
nrow(df)
## [1] 11276
ncol(df)
## [1] 18
str(df)
## Classes 'data.table' and 'data.frame': 11276 obs. of 18 variables:
## $ id : chr "1" "2" "3" "4" ...
## $ structure : chr "1" "1_1" "1_1_1" "1_1_1_1" ...
## $ post_date : IDate, format: "0013-03-15" "0013-03-15" ...
## $ comm_date : IDate, format: "0013-03-15" "0013-03-15" ...
## $ num_comments : int 3345 3345 3345 3345 3345 3345 3345 3345 3345 3345 ...
## $ subreddit : chr "worldnews" "worldnews" "worldnews" "worldnews" ...
## $ upvote_prop : num 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 ...
## $ post_score : int 54849 54849 54849 54849 54849 54849 54849 54849 54849 54849 ...
## $ author : chr "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" "bendertheoffender22" ...
## $ user : chr "AwesomeAxolotl" "weewolf" "[deleted]" "octatoan" ...
## $ comment_score : int 3281 1960 1002 122 126 58 64 3 3 4 ...
## $ controversiality: int 0 0 0 0 0 0 0 0 0 0 ...
## $ comment : chr "As we say in Germany: Stupidity must be punished. Apparently, sometimes, by court order." "Across the pond we say \"\"You can't fix stupid\"\". I think I like the german attitude better. " "[deleted]" "Not *erhoehen? Just asking." ...
## $ title : chr "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ "German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him "| __truncated__ ...
## $ post_text : chr "" "" "" "" ...
## $ link : chr "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" "http://www.bbc.com/news/world-europe-31864218" ...
## $ domain : chr "bbc.com" "bbc.com" "bbc.com" "bbc.com" ...
## $ URL : chr "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" "http://www.reddit.com/r/worldnews/comments/2ywb5b/german_antivaccer_wagers_100000_to_anyone_proving/?ref=search_posts" ...
## - attr(*, ".internal.selfref")=<externalptr>
df$link <- NULL
df$domain <- NULL
df$URL <- NULL
df$structure <- NULL
df$dayofpost <- wday(df$post_date,label = T)
df$monthofpost <- month(df$post_date,abbr = T,label = T)
df$yearofpost <- year(df$post_date)
(head(df,1))
## id post_date comm_date num_comments subreddit upvote_prop post_score
## 1: 1 0013-03-15 0013-03-15 3345 worldnews 0.91 54849
## author user comment_score controversiality
## 1: bendertheoffender22 AwesomeAxolotl 3281 0
## comment
## 1: As we say in Germany: Stupidity must be punished. Apparently, sometimes, by court order.
## title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## post_text dayofpost monthofpost yearofpost
## 1: Fri Mar 13
df$upvote_prop <- df$upvote_prop*100
names(df)[names(df) == 'upvote_prop'] <- 'upvote_pct'
colSums(is.na(df))
## id post_date comm_date num_comments
## 0 0 0 0
## subreddit upvote_pct post_score author
## 0 0 0 0
## user comment_score controversiality comment
## 0 0 0 0
## title post_text dayofpost monthofpost
## 0 0 0 0
## yearofpost
## 0
df <- df[df$comment != "[deleted]"]
summary(df)
## id post_date comm_date num_comments
## Length:10917 Min. :0001-08-16 Min. :0001-01-18 Min. : 3345
## Class :character 1st Qu.:0011-04-11 1st Qu.:0011-04-11 1st Qu.: 3557
## Mode :character Median :0017-01-20 Median :0018-01-20 Median : 4452
## Mean :0018-01-22 Mean :0017-10-14 Mean : 5408
## 3rd Qu.:0028-01-20 3rd Qu.:0027-01-16 3rd Qu.: 6405
## Max. :0031-12-17 Max. :0031-12-17 Max. :11016
##
## subreddit upvote_pct post_score author
## Length:10917 Min. : 80.00 Min. : 1 Length:10917
## Class :character 1st Qu.: 88.00 1st Qu.: 892 Class :character
## Mode :character Median : 91.00 Median : 1714 Mode :character
## Mean : 91.49 Mean :18289
## 3rd Qu.: 94.00 3rd Qu.:40524
## Max. :100.00 Max. :83359
##
## user comment_score controversiality comment
## Length:10917 Min. : -73.0 Min. :0.000000 Length:10917
## Class :character 1st Qu.: 4.0 1st Qu.:0.000000 Class :character
## Mode :character Median : 14.0 Median :0.000000 Mode :character
## Mean : 161.4 Mean :0.006962
## 3rd Qu.: 68.0 3rd Qu.:0.000000
## Max. :30026.0 Max. :1.000000
##
## title post_text dayofpost monthofpost
## Length:10917 Length:10917 Sun:3315 Jan :4250
## Class :character Class :character Mon:1426 Jun : 948
## Mode :character Mode :character Tue: 941 Apr : 940
## Wed:2360 May : 933
## Thu:1445 Feb : 915
## Fri:1430 Oct : 530
## Sat: 0 (Other):2401
## yearofpost
## Min. : 1.00
## 1st Qu.:11.00
## Median :17.00
## Mean :17.75
## 3rd Qu.:28.00
## Max. :31.00
##
ggplot(df[,.N,by=dayofpost],aes(x = dayofpost,y = N,fill=N,label=round(N,2)))+
geom_bar(stat = "identity")+labs(title="Posts by the weekday",subtitle="Number of Posts Per Week Day ")+xlab("Day")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")
ggplot(df[,.N,by=monthofpost],aes(x = monthofpost,y = N,fill=N, label=round(N,2)))+
geom_bar(stat = "identity")+labs(title="Posts by the month",subtitle="Number of Posts Per Month")+xlab("Month")+ylab(NULL)+geom_text(size=5, vjust=1, color="white")
As is evident, January month dominates the number of posts.Its interesting to see the pattern i.e Number of posts is maintained ~900 for the months of February, April, May and June while for the rest of the months its ~500. This will be really useful from business perspective, as now companies knows that the first month of new year i.e January is the month for highest number of posts, hence companies can launch and advertise their products during this peak time.
ggplot(df[,.N,by=yearofpost],aes(x = yearofpost,y = N,fill=N, label=round(N,3)))+
geom_bar(stat = "identity")+labs(title="Posts per day by month",subtitle="Number of given Posts")+xlab("The day of month")+ylab(NULL)+geom_text(size=4, vjust=1, color="White")
#Replacing the URLS
df_data <- df %>% mutate(comment = str_replace_all(comment, "//^(?:http(?:s)?:\\//\\//)?(?:[^\\.]+\\.)?[a-zA-Z0-9]\\.com(\\//.*)?$", ""))
#Replacing HTML tags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "(<br />)+", ""))
#Remove # in hashtags
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "#([^\\s]+)", "\1"))
#Remove punctuations,numbers and special characters
df_data <- df_data %>% mutate(comment = str_replace_all(comment, "^[a-zA-Z0-9]*$", ""))
head(df_data)
## id post_date comm_date num_comments subreddit upvote_pct post_score
## 1: 1 0013-03-15 0013-03-15 3345 worldnews 91 54849
## 2: 2 0013-03-15 0013-03-15 3345 worldnews 91 54849
## 3: 4 0013-03-15 0013-03-15 3345 worldnews 91 54849
## 4: 6 0013-03-15 0013-03-15 3345 worldnews 91 54849
## 5: 8 0013-03-15 0013-03-15 3345 worldnews 91 54849
## 6: 9 0013-03-15 0013-03-15 3345 worldnews 91 54849
## author user comment_score controversiality
## 1: bendertheoffender22 AwesomeAxolotl 3281 0
## 2: bendertheoffender22 weewolf 1960 0
## 3: bendertheoffender22 octatoan 122 0
## 4: bendertheoffender22 octatoan 58 0
## 5: bendertheoffender22 piss_in_a_bottle 3 0
## 6: bendertheoffender22 LinkFixerBot 3 0
## comment
## 1: As we say in Germany: Stupidity must be punished. Apparently, sometimes, by court order.
## 2: Across the pond we say ""You can't fix stupid"". I think I like the german attitude better.
## 3: Not *erhoehen? Just asking.
## 4: I didn't know, thanks. I don't know the language as well as I'd like, and colloquial everyday usage is a big part of that . . . ;)
## 5: Wobei ich sagen würde, dass man selbst ""film'n"" durchgehen lassen kann. \r\n""Kannst'e das mal film'n?!"" hört sich für mich z.B. recht normal an ;)
## 6: Pretty sure it's the combination of m and n to m'n\r\n\r\nKomm'n wouldn't really work, neither would stemm'n
## title
## 1: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 2: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 3: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 4: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 5: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## 6: German anti-vaccer wagers ¬ 100,000 to anyone proving the existance of the measles virus. Researcher sends him articles proving the virus' existance and takes him to court where he's ordered to pay the full amount.
## post_text dayofpost monthofpost yearofpost
## 1: Fri Mar 13
## 2: Fri Mar 13
## 3: Fri Mar 13
## 4: Fri Mar 13
## 5: Fri Mar 13
## 6: Fri Mar 13
get_cleaned_tokens <- function(df_data,redditname) {
if (redditname == 'all') {
df_data <- df_data
} else {
df_data <- subset(df_data,subreddit == redditname)
}
tokens <- df_data %>% unnest_tokens(output = word, input = comment)
tokens %>% count(word,sort = TRUE)
#get stop words
sw = get_stopwords()
cleaned_tokens <- tokens %>% filter(!word %in% sw$word)
nums <- cleaned_tokens %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
#head(nums)
cleaned_tokens <- cleaned_tokens %>% anti_join(nums, by = "word")
#head(cleaned_tokens)
cleaned_tokens %>%
count(word, sort = T) %>%
rename(word_freq = n) %>%
ggplot(aes(x=word_freq)) +
geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) +
scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) +
theme_bw()
rare <- cleaned_tokens %>% count(word) %>% filter(n<10) %>% select(word) %>% unique()
head(rare)
rare <- cleaned_tokens %>%
count(word) %>%
filter(n<10) %>%
select(word) %>% unique()
alpha_remove <- cleaned_tokens %>% filter(str_detect(word, "^[â|s|t|r|gt|http]$")) %>% select(word) %>% unique()
cleaned_tokens <- cleaned_tokens %>%
filter(!word %in% rare$word)
length(unique(cleaned_tokens$word))
cleaned_tokens <- cleaned_tokens %>% filter(!word %in% alpha_remove$word)
return(cleaned_tokens)
}
cleaned_tokens = get_cleaned_tokens(df_data,'worldnews')
pal <- brewer.pal(8,"Dark2")
cleaned_tokens %>%
count(word) %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
cleaned_tokens = get_cleaned_tokens(df_data,'olympics')
pal <- brewer.pal(8,"Dark2")
cleaned_tokens %>%
count(word) %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
cleaned_tokens = get_cleaned_tokens(df_data,'books')
pal <- brewer.pal(8,"Dark2")
cleaned_tokens %>%
count(word) %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
cleaned_tokens = get_cleaned_tokens(df_data,'NintendoSwitch')
pal <- brewer.pal(8,"Dark2")
cleaned_tokens %>%
count(word) %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
cleaned_tokens = get_cleaned_tokens(df_data,'all')
pal <- brewer.pal(8,"Dark2")
cleaned_tokens %>%
count(word) %>%
with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
sent_reviews = cleaned_tokens %>%
left_join(get_sentiments("nrc")) %>%
rename(nrc = sentiment) %>%
left_join(get_sentiments("bing")) %>%
rename(bing = sentiment) %>%
left_join(get_sentiments("afinn")) %>%
rename(afinn = value)
bing_word_counts <- sent_reviews %>%
filter(!is.na(bing)) %>%
count(word, bing, sort = TRUE)
head(bing_word_counts,5)
## word bing n
## 1: good positive 2080
## 2: like positive 1335
## 3: bad negative 1105
## 4: pretty positive 1052
## 5: shit negative 768
bing_word_counts %>%
filter(n > 700) %>%
mutate(n = ifelse(bing == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = bing)) + geom_col() + coord_flip() + labs(y = "Sentiment Contribution")
df_data$comment<-as.character(df_data$comment)
bigrams <- df_data %>% unnest_tokens(bigram, comment, token = "ngrams", n = 2)
bigrams %>% count(bigram, sort = TRUE)
## bigram n
## 1: of the 1068
## 2: in the 998
## 3: to be 550
## 4: on the 481
## 5: to the 481
## ---
## 143327: zuck's livejournal 1
## 143328: zuckerbergs online 1
## 143329: zx3p203y7a3ytjk botw.jpg 1
## 143330: zy 1g 1
## 143331: zygote and 1
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered %>% count(word1, word2, sort = TRUE)
## word1 word2 n
## 1: <NA> <NA> 231
## 2: www.youtube.com watch 90
## 3: https www.youtube.com 71
## 4: 01f3c63c ip 62
## 5: ent 01f3c63c 62
## ---
## 31256: žu1e 1 1
## 31257: zuck's livejournal 1
## 31258: zuckerbergs online 1
## 31259: zx3p203y7a3ytjk botw.jpg 1
## 31260: zy 1g 1
bigram_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_counts <- bigram_united %>%
count(bigram, sort = TRUE)
bigram_counts=bigram_counts[-1,]
bigram_counts %>% arrange(desc(n))%>% head(10)%>%ggplot(aes(x=factor(bigram,levels=bigram),y=n))+geom_bar(stat="identity",fill="#003E45")+labs(title="Top 10 bigram words")+coord_flip()
word_counts_by_doc_id <- cleaned_tokens %>%
group_by(id) %>%
count(word, sort = TRUE)
review_dtm <- word_counts_by_doc_id %>%
cast_dtm(id, word, n)
review_dtm
## <<DocumentTermMatrix (documents: 498, terms: 2671)>>
## Non-/sparse entries: 93014/1237144
## Sparsity : 93%
## Maximal term length: 26
## Weighting : term frequency (tf)
library(topicmodels)
lda8 <- LDA(review_dtm, k = 8, control = list(seed = 1234))
terms(lda8, 20)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "people" "one" "just" "just" "like" "one"
## [2,] "time" "like" "like" "like" "time" "people"
## [3,] "like" "just" "people" "people" "good" "get"
## [4,] "can" "can" "one" "https" "much" "think"
## [5,] "every" "get" "can" "get" "get" "know"
## [6,] "get" "even" "name" "now" "got" "really"
## [7,] "think" "amp" "get" "time" "see" "gt"
## [8,] "really" "know" "time" "read" "every" "good"
## [9,] "one" "read" "game" "really" "just" "even"
## [10,] "still" "still" "also" "say" "say" "virus"
## [11,] "much" "time" "now" "way" "new" "also"
## [12,] "even" "https" "ip" "bad" "right" "something"
## [13,] "virus" "someone" "know" "make" "go" "now"
## [14,] "got" "people" "every" "got" "way" "said"
## [15,] "man" "game" "lr" "even" "can" "still"
## [16,] "now" "probably" "aff" "actually" "know" "china"
## [17,] "back" "also" "ent" "going" "many" "see"
## [18,] "said" "year" "pri" "better" "game" "world"
## [19,] "already" "right" "really" "game" "believe" "can"
## [20,] "find" "something" "https" "china" "also" "thing"
## Topic 7 Topic 8
## [1,] "just" "just"
## [2,] "people" "people"
## [3,] "can" "can"
## [4,] "like" "know"
## [5,] "know" "like"
## [6,] "much" "got"
## [7,] "actually" "right"
## [8,] "well" "well"
## [9,] "never" "china"
## [10,] "really" "get"
## [11,] "make" "game"
## [12,] "go" "someone"
## [13,] "virus" "see"
## [14,] "even" "us"
## [15,] "world" "think"
## [16,] "time" "gt"
## [17,] "everyone" "shit"
## [18,] "probably" "make"
## [19,] "bad" "way"
## [20,] "thing" "going"
lda8_betas <- broom::tidy(lda8)
lda8_betas
## # A tibble: 21,368 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 u 0.00201
## 2 2 u 0.000668
## 3 3 u 0.0000653
## 4 4 u 0.000394
## 5 5 u 0.00253
## 6 6 u 0.000136
## 7 7 u 0.00206
## 8 8 u 0.000248
## 9 1 amp 0.00115
## 10 2 amp 0.00725
## # ... with 21,358 more rows
library(ggrepel)
terms_in_comments <- lda8_betas %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
terms_per_comments <- function(lda_model, num_words) {
topics_tidy <- tidy(lda_model, matrix = "beta")
top_terms <- topics_tidy %>%
group_by(topic) %>%
arrange(topic, desc(beta)) %>%
slice(seq_len(num_words)) %>%
arrange(topic, beta) %>%
mutate(row = row_number()) %>%
ungroup() %>%
mutate(topic = paste("Comment_Topic", topic, sep = " "))
title <- paste("Top Terms for", k, "Comment Topics")
comments_wordchart(top_terms, top_terms$term, title)
}
comments_wordchart <- function(data, input, title) {
data %>%
ggplot(aes(as.factor(row), 1, label = input, fill = factor(topic) )) +
geom_point(color = "transparent") +
geom_label_repel(nudge_x = .2,
direction = "y",
box.padding = 0.1,
segment.color = "transparent",
size = 3) +
facet_grid(~topic) +
theme_comments() +
theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
axis.title.x = element_text(size = 9),
panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect("lightgray", fill = NA),
strip.text.x = element_text(size = 9)) +
labs(x = NULL, y = NULL, title = title) +
ggtitle(title) +
coord_flip()
}
theme_comments <- function()
{
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none")
}
k <-8
terms_per_comments(lda8,15)
tfidf <- word_counts_by_doc_id %>%
bind_tf_idf(word, id, n)
head(tfidf,5)
## # A tibble: 5 x 6
## # Groups: id [4]
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 38 u 55 0.188 2.32 0.437
## 2 191 amp 50 0.0949 1.71 0.162
## 3 243 ip 38 0.0683 4.01 0.274
## 4 243 lr 38 0.0683 5.52 0.377
## 5 164 aff 37 0.0673 5.52 0.371
top_tfidf = tfidf %>%
group_by(id) %>%
arrange(desc(tf_idf)) %>%
top_n(3) %>% ungroup() %>%
arrange(id, -tf_idf)
head(top_tfidf,5)
## # A tibble: 5 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 1 buffs 12 0.0203 2.88 0.0584
## 2 1 click 8 0.0135 2.38 0.0322
## 3 1 clicks 5 0.00846 3.73 0.0315
## 4 10 cell 2 0.0105 3.03 0.0318
## 5 10 front 2 0.0105 2.26 0.0237
library(dplyr)
df_data %>%
select(comment) %>%
sentimentr::get_sentences() %>%
sentimentr::sentiment() %>%
mutate(characters = nchar(stripWhitespace(comment))) %>%
filter(characters >1 ) -> bounded_sentences
summary(bounded_sentences$sentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.461737 -0.138675 0.000000 -0.006522 0.137500 2.044070
tfidf
## # A tibble: 93,014 x 6
## # Groups: id [498]
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 38 u 55 0.188 2.32 0.437
## 2 191 amp 50 0.0949 1.71 0.162
## 3 243 ip 38 0.0683 4.01 0.274
## 4 243 lr 38 0.0683 5.52 0.377
## 5 164 aff 37 0.0673 5.52 0.371
## 6 164 ent 37 0.0673 5.52 0.371
## 7 164 ip 37 0.0673 4.01 0.270
## 8 164 lr 37 0.0673 5.52 0.371
## 9 164 name 37 0.0673 1.93 0.130
## 10 164 pri 37 0.0673 5.52 0.371
## # ... with 93,004 more rows
bounded_sentences %>% filter(between(sentiment,-1,1)) -> bounded_sentences
sentiment_densities <- with(density(bounded_sentences$sentiment), data.frame(x, y))
ggplot(sentiment_densities, aes(x = x, y = y)) +
geom_line() +
geom_area(mapping = aes(x = ifelse(x >=0 & x<=1 , x, 0)), fill = "green") +
geom_area(mapping = aes(x = ifelse(x <=0 & x>=-1 , x, 0)), fill = "red") +
scale_y_continuous(limits = c(0,2.5)) +
theme_minimal(base_size = 16) +
labs(x = "Sentiment",
y = "",
title = "Distribution of Sentiment Across Reddit comments") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) -> gg
plot(gg)
Tokenization helps us to cut the corpus into smaller chunks known as tokens.
Removing stop words is necessary as the most frequent language words are stop words that must be filtered out in order to get useful meaning from data processing.
Removing numbers is done as it is text analysis, we removed numbers because the analysis does not add value.
Bi-grams, are a sequence of 2 adjacent elements of a string of tokens( n=2) and are used for text analysis.
Word correlations are used to find the correlated words and this will help in removing uncommon words.