n this project, we have chosen the topic of YouTube videos. Specifically, what makes a viral/trending YouTube video? YouTube is single-handedly the largest video sharing website in the world, and every day it generates a list of the top trending videos based on a combination of factors measuring user interactions. In this project, we will be exploring aspects such as number of views, comments, likes, video category, channel subscribers, etc. on videos that YouTube has identified to be ‘trending’. We have analyzed whether these aspects contribute more or less to the popularity of a video, and identify the key factors that make a YouTube video viral. We have identified some key factors that are associated with the popularity of a YouTube video: number of views and comments, number of channel subscribers, number of likes vs dislikes, publish time, and category. These factors are supported by two studies by Professor Hema Yoganarasiham and Michiko Izawa.
Professor Hema Yoganarasiham from UC Davis suggests that one determining factor of the popularity of a video is the content creator. Videos are more likely to be viral if they are created by Influential people than people in a close-knit community, as they have more second- or third-degree followers, who play a big role in creating a “viral spread” (Yoganarasimhan 2012). Yoganarasiham also mentioned that video rating is another important factor - whether the rating is good or bad, or whether there is a rating at all.
Additionally, Michiko Izawa from the John Hopkins University suggests that the number of views, comments, and shares are also important factors that contributes to video popularity. She took an example of an incident that happened to a United Airlines passenger in 2008, who had his guitar broken by the airline. The video generated approximately 3 million views and 10,000 comments just days after it was posted. It was shared and viewed by so many people that United Airlines had to apologize to the content creator. This shows how the number of views, comments, and shares affect the level of impact and popularity of a video (Izawa 2010).
Our initial hypothesis states that a video would be trending if it is uploaded by a famous channel, has content related to new release of a movie/tv show, music, or sports. We used several visualizations to predict if our hypothesis is correct or not. We used various libraries for manipulating our data and creating the visualizations. We used “ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics” (H. Wickham 2016). The ‘lubridate’ package has a consistent and memorable syntax that makes working with dates easy (Grolemund and Wickham 2013). Word cloud shows the frequency of words in a document by varying the size of words in a visualization. Word clouds are great for a quick, qualitative view of your open-ended survey responses, collection of tweets, or website content (Banks et al. 2018). Flexibly restructure and aggregate data using just two functions: melt and ‘dcast’ (or ‘acast’)(H. Wickham and Wickham 2017).
library(ggplot2) #Plots
library(ggthemes) #Plots
library(dplyr) #Data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) #Date manipulation
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggcorrplot) #Correlation Plot
library(corpus) #Wordcloud
library(wordcloud) #Wordcloud
## Loading required package: RColorBrewer
library(tm) #Text mining for wordcloud
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(SnowballC) #Wordcloud
library(stringr) #Data manipulation
library(reshape2) #Data manipulation for correlation plot
library(bookdown) #include references
We included the following data sources for our analysis: * Trending YouTube Video Statistics * YouTube API category id * Channel Subscribers
Trending YouTube Video Statistics contains several months of data on daily trending YouTube videos for the USA, Great Britain, Germany, Canada, and France. We only used the data for trending videos in US to perform our analysis.
YouTube API category id contains the name corresponding to each of the category id. We used the category_id as the key to merge into our Trending YouTube Video Statistics data and create another column for category name.
Channel Subscribers contains names of top 5000 channels (sorted by subscriber count), number of uploads by each, subscriber count, and total number of views. We used this dataset to get subscriber count information for the channels in our data and create a new column ‘subscriber count’ in our dataset.
During the merging of datasets, some columns were duplicated which were then dropped to ensure the final dataset does not include unnecessary columns.
Video <-read.csv("C:/Users/nd911/OneDrive/Documents/USvideos.csv")
str(Video)
## 'data.frame': 23362 obs. of 16 variables:
## $ video_id : Factor w/ 4712 levels "-_jlqATo9eo",..: 328 286 565 3302 1349 1857 383 2885 2345 3887 ...
## $ trending_date : Factor w/ 117 levels "17.01.12","17.02.12",..: 14 14 14 14 14 14 14 14 14 14 ...
## $ title : Factor w/ 4774 levels "'Bachelor' Finale: Worst Breakup Ever? | The View",..: 4480 4100 3307 3015 1931 111 3451 188 3963 4670 ...
## $ channel_title : Factor w/ 1946 levels "12 News","1MILLION Dance Studio",..: 282 961 1450 662 1243 768 1477 403 4 1874 ...
## $ category_id : int 22 24 23 24 24 28 24 28 1 25 ...
## $ publish_time : Factor w/ 4645 levels "2006-07-23T08:24:11.000Z",..: 318 287 271 291 269 323 256 274 297 295 ...
## $ tags : Factor w/ 4484 levels "#MeToo|Grammys 2018|Janelle Monáe|Kesha",..: 3443 2215 3157 3281 3335 1780 3534 49 4102 4283 ...
## $ views : int 748374 2418783 3191434 343168 2095731 119180 2103417 817732 826059 256426 ...
## $ likes : int 57527 97185 146033 10172 132235 9763 15993 23663 3543 12654 ...
## $ dislikes : int 2966 6146 5339 666 1989 511 2445 778 119 1363 ...
## $ comment_count : int 15954 12703 8181 2146 17518 1434 1970 3432 340 2368 ...
## $ thumbnail_link : Factor w/ 4712 levels "https://i.ytimg.com/vi/-_jlqATo9eo/default.jpg",..: 328 286 565 3302 1349 1857 383 2885 2345 3887 ...
## $ comments_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ ratings_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ video_error_or_removed: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ description : Factor w/ 4917 levels "","'A curious cat helps his owner with home improvements.'\\nWe're releasing a NEW BLACK & WHITE episode every wee"| __truncated__,..: 3480 3072 4564 4396 1850 4489 1128 1919 2098 1305 ...
Video$category_id <-factor(Video$category_id)
library(readxl)
youtube_category_id_list <- read_excel("~/youtube_category_id_list.xlsx")
#Merging the category dataset by category_id
Video1 <-merge(Video, youtube_category_id_list, by="category_id")
#Merging the subscribers dataset with origin Youtube trending data
subscribers <-read.csv("~/socialblade.csv")
#renaming variables to allow merging
colnames(subscribers)[colnames(subscribers)=="name"] <- "channel_title"
Video2 <-merge(subscribers, Video1, by="channel_title")
#dropping variables that are not needed
Video3 = subset(Video2, select = -c(views.x,link, uploads, rank, grade, video_id))
colnames(Video3)[colnames(Video3)=="views.y"] <- "views"
str(Video3)
## 'data.frame': 9714 obs. of 17 variables:
## $ channel_title : Factor w/ 4990 levels "#0","#LoMasPopular",..: 23 23 23 23 23 23 23 23 23 23 ...
## $ subscribers : Factor w/ 4712 levels "--","1,000,397",..: 4259 4259 4259 4259 4259 4259 4259 4259 4259 4259 ...
## $ category_id : Factor w/ 16 levels "1","2","10","15",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ trending_date : Factor w/ 117 levels "17.01.12","17.02.12",..: 51 77 79 69 108 66 75 54 72 57 ...
## $ title : Factor w/ 4774 levels "'Bachelor' Finale: Worst Breakup Ever? | The View",..: 1455 4160 4160 1455 800 1455 4160 1455 4160 1455 ...
## $ publish_time : Factor w/ 4645 levels "2006-07-23T08:24:11.000Z",..: 4388 4573 4573 4388 4099 4388 4573 4388 4573 4388 ...
## $ tags : Factor w/ 4484 levels "#MeToo|Grammys 2018|Janelle Monáe|Kesha",..: 754 754 754 754 754 754 754 754 754 754 ...
## $ views : int 384249 352080 383303 992294 601159 896401 307249 513455 210802 607740 ...
## $ likes : int 26271 23638 25171 48246 27962 45386 21782 31505 17499 35180 ...
## $ dislikes : int 238 240 254 453 336 419 214 279 134 314 ...
## $ comment_count : int 540 850 857 854 444 821 762 623 644 672 ...
## $ thumbnail_link : Factor w/ 4712 levels "https://i.ytimg.com/vi/-_jlqATo9eo/default.jpg",..: 133 4241 4241 133 3440 133 4241 133 4241 133 ...
## $ comments_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ ratings_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ video_error_or_removed: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ description : Factor w/ 4917 levels "","'A curious cat helps his owner with home improvements.'\\nWe're releasing a NEW BLACK & WHITE episode every wee"| __truncated__,..: 4349 2306 2306 4349 4348 4349 2306 4349 2306 4349 ...
## $ category_name : chr "Entertainment" "Entertainment" "Entertainment" "Entertainment" ...
We only used the US dataset which contains 23362 observations and 16 variables. Our final dataset includes following variables:
#converting variables to numeric
Video3$views <-as.numeric(Video3$views)
Video3$likes <-as.numeric(Video3$likes)
Video3$dislikes<-as.numeric(Video3$dislikes)
Video3$comment_count <-as.numeric(Video3$comment_count)
#converting variables to factor
Video3$channel_title <- as.factor(Video3$channel_title)
Video3$category_name <- as.factor(Video3$category_name)
Using the category ID in our dataset and the YouTube API categories data, a new column was created to match the category IDs to the category name. A bar chart was then created to determine the number of trending videos in each category. Our results indicate that entertainment and music are the top categories with most trending videos.
Video3.cat <- as.data.frame(sort(table(Video3$category_name), decreasing = TRUE))
names(Video3.cat) <- c("category_name", "count")
ggplot(Video3.cat[1:10,], aes(x = category_name, y = count, fill = factor(category_name))) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45,hjust = 1), legend.position = "none") +
scale_x_discrete(name = "category_name ",label = function(x) str_wrap(x, width = 15)) +
scale_y_continuous(name = "Number of videos") + labs(title = "Top Categories with Trending Videos")
We used ggplot bar graph to visualize the channels with most top trending videos. Our results indicate that ESPN, NBA, NFL and WWE are the channels with most top trending videos.
Video3.ch <- as.data.frame(sort(table(Video3$channel_title), decreasing = TRUE))
names(Video3.ch) <- c("channel_title", "count")
ggplot(Video3.ch[1:10,], aes(x = channel_title, y = count, fill = factor(channel_title))) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45,hjust = 1), legend.position = "none") + scale_x_discrete(name = "channel_title",label = function(x) str_wrap(x, width = 15)) +
scale_y_continuous(name = "Number of videos") + labs(title = "Top Channels with Trending Videos")
We used the wordcloud to determine the top 200 famous words (repeated atleast 30 times) for the title of trending videos. Our results indicate that Official, trailer, Video, new, game, show, espn, season are some of the most famous keywords that make a video trending.
vid_corpus1 = Corpus(VectorSource(Video3$title))
vid_corpus1 = tm_map(vid_corpus1, PlainTextDocument)
vid_corpus1 = tm_map(vid_corpus1, removePunctuation)
vid_corpus1 = tm_map(vid_corpus1, content_transformer(tolower))
vid_corpus1 = tm_map(vid_corpus1, removeNumbers)
vid_corpus1 = tm_map(vid_corpus1, stripWhitespace)
vid_corpus1 = tm_map(vid_corpus1, removeWords, stopwords("english"))
vid_corpus1 = tm_map(vid_corpus1, stemDocument)
wordcloud(vid_corpus1, max.words = 200, min.freq = 30, rot.per=0.35, colors=brewer.pal(8, "Dark2"), random.order = FALSE)
We used the wordcloud to determine the top 170 famous words (repeated atleast 30 times) for the description of trending videos. Our results indicate that Video, facebook, music, twitter, instagram, music, shows,new watch, channel are some of the most famous keywords in the description that make a video trending.
vid_corpus2 = Corpus(VectorSource(Video3$description))
vid_corpus2 = tm_map(vid_corpus2, PlainTextDocument)
vid_corpus2 = tm_map(vid_corpus2, removePunctuation)
vid_corpus2 = tm_map(vid_corpus2, content_transformer(tolower))
vid_corpus2 = tm_map(vid_corpus2, removeNumbers)
vid_corpus2 = tm_map(vid_corpus2, stripWhitespace)
vid_corpus2 = tm_map(vid_corpus2, removeWords, stopwords("english"))
vid_corpus2 = tm_map(vid_corpus2, stemDocument)
wordcloud(vid_corpus2, max.words = 170, min.freq = 30, rot.per=0.35, colors=brewer.pal(8, "Dark2"), random.order = FALSE)
We constructed a correlation plot to determine how our variables such as likes,comment_count, dislikes and views are related to each other. Using reshape library and ggplot we arrived at our corrgram. Our results indicate that comment_count and dislikes (0.83), comment_count and likes(0.67) have a high correlation coefficient. It could mean that people who are adding a comment to the video are also more inclined hit like or dislike icon. Also, the no. of views a video gets and the likes (0.83) are highly correlated which makes sense as the no. of views would always drive whether a viewer is going to like a video after viewing it.
corr_vid <- Video3[, c("views","likes","dislikes","comment_count")]
corr_vid$views <-as.numeric(corr_vid$views)
corr_vid$likes <-as.numeric(corr_vid$likes)
corr_vid$dislikes <-as.numeric(corr_vid$dislikes)
corr_vid$comment_count <-as.numeric(corr_vid$comment_count)
corr_vid1 <- round(cor(corr_vid),2)
reorder_corr_vid1 <- function(corr_vid1){
#Use correlation between variables as distance
dd <- as.dist((1-corr_vid1)/2)
hc <- hclust(dd)
corr_vid1 <-corr_vid1[hc$order, hc$order]
}
#Get upper triangle of the correlation matrix
get_upper_tri <- function(corr_vid1){
corr_vid1[lower.tri(corr_vid1)]<- NA
return(corr_vid1)}
#Get lower triangle of the correlation matrix
get_lower_tri<-function(corr_vid1){
corr_vid1[upper.tri(corr_vid1)] <- NA
return(corr_vid1)}
corr_vid1 <- reorder_corr_vid1(corr_vid1)
upper_tri <- get_upper_tri(corr_vid1)
# Melt the correlation matrix
corr_vid1<- melt(upper_tri, na.rm = TRUE)
ggplot(corr_vid1, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 5) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.5, 0.8),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,title.position = "top", title.hjust = 0.5))
Our correlation heatmap indicated a high corelation between the no. of views a trending video get and the comments on the video. So we decided to view the relationship through a scatterplot. Our results confirm that there is a linear relationship between no. of times a video is viewed and commented on.
ggplot(data = Video3, aes(x = views, y = comment_count, color = views), na.rm=TRUE) +
geom_point(shape = 16, size = 4, show.legend = FALSE, alpha = .4) + xlab('No. of Views') + ylab('Comment count')+ scale_color_gradient(low = "#0091ff", high = "#f0650e")+ggtitle('No. of Views vs Comments')+ coord_cartesian(xlim =c(0, 1e+06), ylim = (c(0,25000)))+theme_minimal()+geom_smooth()
## `geom_smooth()` using method = 'gam'
## No. of dislikes vs Comment Count Similar to the views and comment count relationship, we ran a scatterplot for dislikes and comment count and found that the relationship is linear.
ggplot(data = Video3, aes(x = dislikes, y = comment_count, color = views), na.rm=TRUE) +
geom_point(shape = 8, size = 2, show.legend = FALSE) + xlab('No. of Dislikes') + ylab('Comment Count')+ scale_color_gradient(low = "#0091ff", high = "#f0650e")+ggtitle('No. of dislikes vs Comments')+coord_cartesian(xlim =c(0,50000 ), ylim = (c(0,50000)))+theme_classic()
In conclusion, the visualizations that we created using the dataset proved our hypothesis correct. Indeed, videos that are related to music, sports, and TV/movie trailers are more likely to trend on YouTube, as compared to other video categories like lifestyle, automobiles, non-profit, etc.
In addition our original hypothesis, we also uncovered new insights from the data analysis we conducted. Firstly, it appears that sharing the video on various other social media platforms really help a video to become trending. For example, some of the most frequently used words in video descriptions are Instagram, Facebook, Twitter. This means that these videos have been widely circulated on other popular platforms besides YouTube itself. Also, judging by the correlation analysis conducted, videos that garner any sort of interactions (comments, likes, dislikes), be it positive or negative, are more likely to become trending. Lastly, based on our time series analysis on the trending videos in January, videos that are published towards the end of the month seem to get more views and is more likely to trend.
These factors together contribute to creating a trending video. Of course, the factors that we have investigated in this study is by no means a comprehensive list. But given the allotted time and data available, we managed to identify several of the key factors.
Banks, George C, Haley M Woznyj, Ryan S Wesslen, and Roxanne L Ross. 2018. “A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App).” Journal of Business and Psychology. Springer, 1–15.
Grolemund, Maintainer Garrett, and Hadley Wickham. 2013. “Package ‘Lubridate’.” Citeseer.
Izawa, Michiko. 2010. “What Makes Viral Videos Viral?: Roles of Emotion, Impression, Utility, and Social Ties in Online Sharing Behavior.” PhD thesis, Citeseer.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer.
Wickham, Hadley, and Maintainer Hadley Wickham. 2017. “Package ‘Reshape’.”
Yoganarasimhan, Hema. 2012. “Impact of Social Network Structure on Content Propagation: A Study Using Youtube Data.” Quantitative Marketing and Economics 10 (1). Springer: 111–50.