As a millenial growing up in NYC and having lived in London for over a year social media is a large tool in our lives with YouTube being one of the largest.

My project here looks at a wide scope of the most popular videos in the five countries of Great Britain, USA, France, Germany and Canada. If you’ve ever been on Youtube to look up a popular video your friend told you about then chances are you have seen a negative comment. But why is that? Are there actually more negative comments that positive ones? are people really going to videos with the most likes and leave insensitive, mean or degrading comments or was this just a coincedence?

I believe the vast majority of comments left on popular YouTube videos are negative and I think my research will show this. As part of the demonstration, I will attempt to train a troll. Use Recurring Neural Networks to create a text generator for Youtbe comments. I believe this generator will in fact be forced to produce negative comments as a result of the training data I will feed into the algorithm.

#Loading the packages I believe I will need
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(DT)

## Warning: package 'DT' was built under R version 3.4.3

library(lubridate)

## Warning: package 'lubridate' was built under R version 3.4.3

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)
library(plotrix)

## Warning: package 'plotrix' was built under R version 3.4.4

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.3

## corrplot 0.84 loaded

library(ggdendro)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 3.4.4

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.4.4

## Loading required package: RColorBrewer

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.4.4

library(stringr)

## Warning: package 'stringr' was built under R version 3.4.3

library(tm)

## Warning: package 'tm' was built under R version 3.4.4

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(sentimentr)

## Warning: package 'sentimentr' was built under R version 3.4.4

library(RSentiment)

## Warning: package 'RSentiment' was built under R version 3.4.4

library(rjson)

## Warning: package 'rjson' was built under R version 3.4.4

library(SnowballC)
library(RColorBrewer)
library(syuzhet)

## Warning: package 'syuzhet' was built under R version 3.4.4

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:sentimentr':
## 
##     get_sentences

## The following object is masked from 'package:plotrix':
## 
##     rescale

library(plotly)

## Warning: package 'plotly' was built under R version 3.4.4

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:sentimentr':
## 
##     highlight

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.3

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## v tibble  1.4.2     v readr   1.1.1
## v tidyr   0.7.2     v purrr   0.2.4
## v tibble  1.4.2     v forcats 0.2.0

## Warning: package 'tibble' was built under R version 3.4.3

## Warning: package 'readr' was built under R version 3.4.4

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x NLP::annotate()          masks ggplot2::annotate()
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between()         masks data.table::between()
## x lubridate::date()        masks base::date()
## x plotly::filter()         masks dplyr::filter(), stats::filter()
## x dplyr::first()           masks data.table::first()
## x lubridate::hour()        masks data.table::hour()
## x lubridate::intersect()   masks base::intersect()
## x lubridate::isoweek()     masks data.table::isoweek()
## x dplyr::lag()             masks stats::lag()
## x dplyr::last()            masks data.table::last()
## x lubridate::mday()        masks data.table::mday()
## x lubridate::minute()      masks data.table::minute()
## x lubridate::month()       masks data.table::month()
## x lubridate::quarter()     masks data.table::quarter()
## x lubridate::second()      masks data.table::second()
## x lubridate::setdiff()     masks base::setdiff()
## x purrr::transpose()       masks data.table::transpose()
## x lubridate::union()       masks base::union()
## x lubridate::wday()        masks data.table::wday()
## x lubridate::week()        masks data.table::week()
## x lubridate::yday()        masks data.table::yday()
## x lubridate::year()        masks data.table::year()

library(readr)
library(wordcloud2)

## Warning: package 'wordcloud2' was built under R version 3.4.4

#Loading the datasets needed for analysis. 
us = read.csv(file = "C:/Users/Alex O/Documents/USvideos.csv/USvideos.csv")
gb = read.csv(file = "C:/Users/Alex O/Documents/GBvideos.csv/GBvideos.csv")
fr = read.csv(file = "C:/Users/Alex O/Documents/FRvideos.csv/FRvideos.csv")
de = read.csv(file = "C:/Users/Alex O/Documents/DEvideos.csv/DEvideos.csv")
ca = read.csv(file = "C:/Users/Alex O/Documents/CAvideos.csv/CAvideos.csv")
uscom = read.csv(file = "C:/Users/Alex O/Downloads/UScomments.csv/UScomments.csv")

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input

gbcom = read.csv(file = "C:/Users/Alex O/Downloads/GBcomments.csv/GBcomments.csv")

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input

#Create a combined table with all the datasets
ytvids <- as.data.table(rbind(gb,fr,ca,us,de))
ytvids$trending_date <- ydm(ytvids$trending_date)
ytvids$publish_time <- ymd(substr(ytvids$publish_time,start = 1,stop = 10))
ytvids$dif_days <- ytvids$trending_date-ytvids$publish_time

corrplot.mixed(corr = cor(ytvids[,c("category_id","views","likes","dislikes","comment_count"),with=F]))

There is a direct and high correlation between views and likes, and while I thought that we will have a similar correlation between views and dislikes, it almost appears to be an inverse relationship. My assumption doesn’t look so good.

#Create a table to see the most viewed videos
mostviewdvid <- ytvids[,.("Total_Views"=round(max(views,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_Views)]
mostviewdvid %>% 
  arrange(-Total_Views) %>% 
  top_n(10,wt = Total_Views) %>% 
  select(title, Total_Views) %>% 
  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

#Create table that has videos with the most comments
mostcmtdvid <- ytvids[,.("Total_comments"=round(max(comment_count,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_comments)]

mostcmtdvid %>% 
  arrange(-Total_comments) %>% 
  top_n(10,wt = Total_comments) %>% 
  select(title, Total_comments) %>% 
  datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Now we look to see how this relationship plays out to see how correlated with each other Views and likes really are.

ggplot(ytvids[,.("views"=max(views),"likes"=max(likes)),by=title],aes(views,likes,colour=likes,size=likes))+geom_jitter()+geom_smooth()+guides(fill="none")+labs(title="Views Vs Likes",subtitle="In days")+theme(legend.position = "none")+geom_text_repel(data=subset(ytvids[,.("views"=max(views),"likes"=max(likes)),by=title], views > 5e+07),
            aes(views,likes,label=title),check_overlap=T)

## Warning: Ignoring unknown parameters: check_overlap

## `geom_smooth()` using method = 'gam'

#The correlation seems strong enough at this point, so now We analyze the comments themselves and do to the volume of data as well as the processing times I will only look at Great Britain Vs The United States:

rawcomments <- readLines("C:/Users/Alex O/Downloads/GBcomments.csv/GBcomments.csv", n = 10000)
newraw<- iconv(rawcomments, "UTF-8", "ASCII", sub = "")
text <- newraw

textcorpus <- Corpus(VectorSource(text))

newdtm <- TermDocumentMatrix(textcorpus)
matrix <- as.matrix(newdtm)
sortedmat <- sort(rowSums(matrix),decreasing=TRUE)

getSent <- get_nrc_sentiment(text)
head(getSent)

##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
## 2     0            0       0    0   0       0        0     1        0
## 3     0            0       0    0   0       0        0     0        0
## 4     0            1       0    1   0       0        0     0        0
## 5     0            0       1    0   0       0        0     0        1
## 6     1            0       1    1   0       1        0     0        1
##   positive
## 1        0
## 2        1
## 3        0
## 4        0
## 5        0
## 6        0

text <- cbind(text,getSent)

TotalSent <- data.frame(colSums(text[,c(2:11)]))
names(TotalSent) <- "count"
TotalSent <- cbind("sentiment" = rownames(TotalSent), TotalSent)
rownames(TotalSent) <- NULL

ggplot(data = TotalSent, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score")

raw1comments <- readLines("C:/Users/Alex O/Downloads/UScomments.csv/UScomments.csv", n = 10000)
newraw1<- iconv(raw1comments, "UTF-8", "ASCII", sub = "")
text1 <- newraw1

text1corpus <- Corpus(VectorSource(text1))

new2dtm <- TermDocumentMatrix(text1corpus)
matrix2 <- as.matrix(new2dtm)
sortedmats <- sort(rowSums(matrix2),decreasing=TRUE)

get1Sent <- get_nrc_sentiment(text1)
head(get1Sent)

##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
## 2     0            0       0    0   0       0        0     0        0
## 3     0            1       0    0   0       0        0     0        0
## 4     0            0       0    0   0       0        0     0        0
## 5     0            1       0    0   0       0        0     0        0
## 6     0            0       0    0   0       0        0     0        0
##   positive
## 1        0
## 2        0
## 3        0
## 4        0
## 5        0
## 6        0

text1 <- cbind(text1,get1Sent)

TotalSents <- data.frame(colSums(text1[,c(2:11)]))
names(TotalSents) <- "count"
TotalSents <- cbind("sentiment" = rownames(TotalSents), TotalSents)
rownames(TotalSents) <- NULL

ggplot(data = TotalSents, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score")

After running the sentiment analysis for the Comments Dataset, on both accounts (focusing on the broad areas of positive and negative) that there are far more positives than negatives, especially in Great Britain. My assumption was incorrect.

My initial idea of creating text generator via a recurring neural network took too much time and delayed my efforts with the project completion, but given the results of my analysis, the generator would have produced positive YouTube comments not bad ones.

DATA 607 Final Project

Alejandro D. Osborne

May 21, 2018

Now we look to see how this relationship plays out to see how correlated with each other Views and likes really are.