As a millenial growing up in NYC and having lived in London for over a year social media is a large tool in our lives with YouTube being one of the largest.
My project here looks at a wide scope of the most popular videos in the five countries of Great Britain, USA, France, Germany and Canada. If you’ve ever been on Youtube to look up a popular video your friend told you about then chances are you have seen a negative comment. But why is that? Are there actually more negative comments that positive ones? are people really going to videos with the most likes and leave insensitive, mean or degrading comments or was this just a coincedence?
I believe the vast majority of comments left on popular YouTube videos are negative and I think my research will show this. As part of the demonstration, I will attempt to train a troll. Use Recurring Neural Networks to create a text generator for Youtbe comments. I believe this generator will in fact be forced to produce negative comments as a result of the training data I will feed into the algorithm.
#Loading the packages I believe I will need
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(DT)
## Warning: package 'DT' was built under R version 3.4.3
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.4.4
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
library(ggdendro)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.4.4
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.4
## Loading required package: RColorBrewer
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.4
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.3
library(tm)
## Warning: package 'tm' was built under R version 3.4.4
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(sentimentr)
## Warning: package 'sentimentr' was built under R version 3.4.4
library(RSentiment)
## Warning: package 'RSentiment' was built under R version 3.4.4
library(rjson)
## Warning: package 'rjson' was built under R version 3.4.4
library(SnowballC)
library(RColorBrewer)
library(syuzhet)
## Warning: package 'syuzhet' was built under R version 3.4.4
##
## Attaching package: 'syuzhet'
## The following object is masked from 'package:sentimentr':
##
## get_sentences
## The following object is masked from 'package:plotrix':
##
## rescale
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:sentimentr':
##
## highlight
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v readr 1.1.1
## v tidyr 0.7.2 v purrr 0.2.4
## v tibble 1.4.2 v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.4
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x NLP::annotate() masks ggplot2::annotate()
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::between() masks data.table::between()
## x lubridate::date() masks base::date()
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::first() masks data.table::first()
## x lubridate::hour() masks data.table::hour()
## x lubridate::intersect() masks base::intersect()
## x lubridate::isoweek() masks data.table::isoweek()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x lubridate::mday() masks data.table::mday()
## x lubridate::minute() masks data.table::minute()
## x lubridate::month() masks data.table::month()
## x lubridate::quarter() masks data.table::quarter()
## x lubridate::second() masks data.table::second()
## x lubridate::setdiff() masks base::setdiff()
## x purrr::transpose() masks data.table::transpose()
## x lubridate::union() masks base::union()
## x lubridate::wday() masks data.table::wday()
## x lubridate::week() masks data.table::week()
## x lubridate::yday() masks data.table::yday()
## x lubridate::year() masks data.table::year()
library(readr)
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 3.4.4
#Loading the datasets needed for analysis.
us = read.csv(file = "C:/Users/Alex O/Documents/USvideos.csv/USvideos.csv")
gb = read.csv(file = "C:/Users/Alex O/Documents/GBvideos.csv/GBvideos.csv")
fr = read.csv(file = "C:/Users/Alex O/Documents/FRvideos.csv/FRvideos.csv")
de = read.csv(file = "C:/Users/Alex O/Documents/DEvideos.csv/DEvideos.csv")
ca = read.csv(file = "C:/Users/Alex O/Documents/CAvideos.csv/CAvideos.csv")
uscom = read.csv(file = "C:/Users/Alex O/Downloads/UScomments.csv/UScomments.csv")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input
gbcom = read.csv(file = "C:/Users/Alex O/Downloads/GBcomments.csv/GBcomments.csv")
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input
#Create a combined table with all the datasets
ytvids <- as.data.table(rbind(gb,fr,ca,us,de))
ytvids$trending_date <- ydm(ytvids$trending_date)
ytvids$publish_time <- ymd(substr(ytvids$publish_time,start = 1,stop = 10))
ytvids$dif_days <- ytvids$trending_date-ytvids$publish_time
corrplot.mixed(corr = cor(ytvids[,c("category_id","views","likes","dislikes","comment_count"),with=F]))
There is a direct and high correlation between views and likes, and while I thought that we will have a similar correlation between views and dislikes, it almost appears to be an inverse relationship. My assumption doesn’t look so good.
#Create a table to see the most viewed videos
mostviewdvid <- ytvids[,.("Total_Views"=round(max(views,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_Views)]
mostviewdvid %>%
arrange(-Total_Views) %>%
top_n(10,wt = Total_Views) %>%
select(title, Total_Views) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
#Create table that has videos with the most comments
mostcmtdvid <- ytvids[,.("Total_comments"=round(max(comment_count,na.rm = T),digits = 2)),by=.(title,thumbnail_link)][order(-Total_comments)]
mostcmtdvid %>%
arrange(-Total_comments) %>%
top_n(10,wt = Total_comments) %>%
select(title, Total_comments) %>%
datatable(class = "nowrap hover row-border", escape = FALSE, options = list(dom = 't',scrollX = TRUE, autoWidth = TRUE))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html