Founded 14 years ago, YouTube, a video-sharing platform, is currently the second-most popular site in the world, only to its parent company Google. It is estimated that 300 hours of videos are uploaded to the site every minute and almost 5 billion videos are watched on YouTube every day of those videos posted, only a small fraction achieve “virality”.
We found our data set on kaggle https://www.kaggle.com/datasnaek/youtube-new.
Data: USvideos.csv contains 40949 observations with 16 variables. The observations are videos published between 2006 and 2018, which have all “trended” at some point. When a video is trending, it means that Youtube’s algorithm has deemed the video ‘relevant’ and thus promotes the video on its trending feed located on its home screen menu.
Note: The company does not disclose their algorithm on how it defines ‘trending’.
We will add a new categorical variable Viral as our response variable. Having gone Viral is defined where the video achieves more than 5 million views.
The predictor variables in the original data set are as follows:
video_id: unique id assigned to the video (will remove from analysis)
trending_date: the date that YouTube started promoting the video on its ‘trending’ feed.
title: video title
channel_title: author or publisher of the material
category_id: 16 levels of video category ID. They are described as follows:
| ID | Category Name | ID | Category Name |
|---|---|---|---|
| 1 | Film & Animation | 23 | Comedy |
| 2 | Autos & Vehicles | 24 | Entertainment |
| 10 | Music | 25 | News & Politics |
| 15 | Pets & Animals | 26 | Howto & Style |
| 17 | Sports | 27 | Education |
| 19 | Travel & Events | 28 | Science & Technology |
| 20 | Gaming | 29 | Nonprofits & Activism |
| 22 | People & Blogs | 43 | Shows |
publish_time: date and time when the video was published
tags: user-generated tags to improve SEO
views: total number of views the video received as of the last time it trended (will remove from analysis)
likes: total number of likes the video received as of the last time it trended
dislikes: total number of dislikes the video received as of the last time it trended
comments: total number of comments the video received as of the last time it trended
thumbnail_link: link to outside material (will remove from analysis)
comments_disabled: whether or not the uploader disabled comments
ratings_disabled: whether or not the uploader disabled ratings
video_error_or_removed: whether or not the content was removed or had an error
description: user-generated video description
Depending on the analysis requirements, we may add, delete, or modify attributes as necessary. We will call it out in our analysis.
Upon opening up the data, we quickly noticed that many videos had “duplicate” entries depending on which day the video had “trended”. So if the video was trending on multiple days, it would then show up on the dataset multiple times. Because we wanted to study what categories or key words would otherwise be correlated with having gone viral, we felt it necessary to remove the duplicate rows from our dataset and only keeping the row where the video last was trending and use the reported last accumulated view numbers.
## [1] 40949 16
n.unique <- length(unique(data.all$video_id))
n.unique # of the 40949 videos, only 6351 unique values## [1] 6351
data <- data.all %>% dplyr:: arrange(desc(views)) %>%
group_by(video_id) %>%
slice(1) %>%
ungroup(video_id)
dim(data)## [1] 6351 16
We need to remove that as well as categorical variables with no predictive power (video_id, thumbnail_link). We are also removing description because key words should have been put into tags and title.
Now we will need to define viral as amassing more than 5 million views; this will be our response variable (categorical).
| Code | Description |
|---|---|
| 0 | Less than 5 million views |
| 1 | More than 5 million views |
We then remove views as a predictor.
data$viral <- c(0)
data$viral[data$views >= 5000000] <- 1
data$viral <- as.factor(data$viral)
data <- data %>% dplyr::select(-views)
str(data)## Classes 'tbl_df', 'tbl' and 'data.frame': 6351 obs. of 13 variables:
## $ trending_date : chr "18.22.02" "18.11.06" "18.01.02" "18.01.05" ...
## $ title : chr "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
## $ channel_title : chr "Watch What Happens Live with Andy Cohen" "TheEllenShow" "TheEllenShow" "Jimmy Kimmel Live" ...
## $ category_id : int 24 24 24 23 22 10 25 27 17 10 ...
## $ publish_time : chr "2018-02-15T04:30:12.000Z" "2018-06-04T13:00:00.000Z" "2018-01-29T14:00:39.000Z" "2018-04-27T07:30:02.000Z" ...
## $ tags : chr "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
## $ likes : int 136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
## $ dislikes : int 33 332 54 580 212 1642 603 27 383 1522 ...
## $ comment_count : int 24 423 94 1484 846 977 3093 20 714 8757 ...
## $ comments_disabled : chr "False" "False" "False" "False" ...
## $ ratings_disabled : chr "False" "False" "False" "False" ...
## $ video_error_or_removed: chr "False" "False" "False" "False" ...
## $ viral : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Proportion of viral videos:
##
## 0 1
## 0.92883011 0.07116989
Only 7.1% of the videos in our population of 6351 trending videos had gone viral (having more than 5 million views). Now we need to clean up the data in order to figure out potential variables that could predict going ‘viral’.
We see that we have to change the vector category of several attributes into something we can analyze.
trending_date and publish_time)##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
data$trending_date <- format(as.Date(data$trending_date, format = "%Y.%d.%m"), "20%y-%m-%d")
data$trending_date <- strptime(data$trending_date, format = "%Y-%m-%d")
data$publish_time <- strptime(data$publish_time, format = "%Y-%m-%d")
data$trending_lag <- as.numeric(data$trending_date - data$publish_time)/(60*60*24) #create new predictor variable "lag"; turn lag into days
data$weekdays <- weekdays(as.Date(data$publish_time)) # get weekdays for each upload
data$months <- months(as.Date(data$publish_time)) # get months Comment: It would appear the the majority of trending videos are rarely published during the summer and early fall months.
prop.table(table(data$viral, data$weekdays), 2) # prop of the columns
prop.table(table(data$viral, data$weekdays), 1) # prop of the rows##
## Friday Monday Saturday Sunday Thursday Tuesday
## 0 0.90353391 0.93502538 0.93601463 0.92307692 0.91477273 0.94245283
## 1 0.09646609 0.06497462 0.06398537 0.07692308 0.08522727 0.05754717
##
## Wednesday
## 0 0.94712853
## 1 0.05287147
##
## Friday Monday Saturday Sunday Thursday Tuesday
## 0 0.16036616 0.15612816 0.08679437 0.08747245 0.16375657 0.16935074
## 1 0.22345133 0.14159292 0.07743363 0.09513274 0.19911504 0.13495575
##
## Wednesday
## 0 0.17613155
## 1 0.12831858
prop.table(table(data$viral, data$months), 2) # prop of the columns
prop.table(table(data$viral, data$months), 1) # prop of the rows##
## April August December February January July
## 0 0.86753731 1.00000000 0.95626072 0.94070352 0.95744681 1.00000000
## 1 0.13246269 0.00000000 0.04373928 0.05929648 0.04255319 0.00000000
##
## June March May November October September
## 0 0.85034014 0.91860465 0.81146026 0.96193416 1.00000000 1.00000000
## 1 0.14965986 0.08139535 0.18853974 0.03806584 0.00000000 0.00000000
##
## April August December February January
## 0 0.078826920 0.001017122 0.189015087 0.158670961 0.205967113
## 1 0.157079646 0.000000000 0.112831858 0.130530973 0.119469027
##
## July June March May November
## 0 0.000678081 0.021190032 0.107136803 0.074419393 0.158501441
## 1 0.000000000 0.048672566 0.123893805 0.225663717 0.081858407
##
## October September
## 0 0.002881844 0.001695203
## 1 0.000000000 0.000000000
category_id, channel_title, comments_disabled, ratings_disabled, video_error_or_removed, weekdays, and months)data$category_id <- as.factor(data$category_id)
data$channel_title <- as.factor(data$channel_title)
data$comments_disabled <- as.factor(data$comments_disabled)
data$ratings_disabled <- as.factor(data$ratings_disabled)
data$video_error_or_removed <- as.factor(data$video_error_or_removed)
data$weekdays <- as.factor(data$weekdays)
data$months <- as.factor(data$months)
str(data)## Classes 'tbl_df', 'tbl' and 'data.frame': 6351 obs. of 16 variables:
## $ trending_date : POSIXlt, format: "2018-02-22" "2018-06-11" ...
## $ title : chr "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
## $ channel_title : Factor w/ 2199 levels "12 News","1MILLION Dance Studio",..: 2129 1955 1955 980 1316 171 424 1663 1376 53 ...
## $ category_id : Factor w/ 16 levels "1","2","10","15",..: 10 10 10 9 8 3 11 13 5 3 ...
## $ publish_time : POSIXlt, format: "2018-02-15" "2018-06-04" ...
## $ tags : chr "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
## $ likes : int 136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
## $ dislikes : int 33 332 54 580 212 1642 603 27 383 1522 ...
## $ comment_count : int 24 423 94 1484 846 977 3093 20 714 8757 ...
## $ comments_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ ratings_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ video_error_or_removed: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ viral : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ trending_lag : num 7 7 3 4 4 5 2 4 4 8 ...
## $ weekdays : Factor w/ 7 levels "Friday","Monday",..: 5 2 2 1 6 7 6 4 6 7 ...
## $ months : Factor w/ 12 levels "April","August",..: 4 7 5 1 10 1 3 10 4 4 ...
likes, dislikes and comment_count)We can see that the distribution of likes, dislikes and comment_count are heavily skewed to the left so we should apply the log function to improve its predictive power.
data$likes <- log(data$likes)
data$dislikes <- log(data$dislikes)
data$comment_count <- log(data$comment_count)
par(mfrow=c(1, 3))
hist(data$likes)
hist(data$dislikes)
hist(data$comment_count)Given that the number of likes, dislikes and comment_count depend heavily on whether or not the video was viewed, thus we have to wonder whether these 3 variables are not a good predictor of whether a video goes viral. A more interesting variable we can explore in a later study could be a sentiment_index, where we compare the ratio of likes to dislikes multiplied by a factor of sd( log(comment counts)). Perhaps people like to hate-watch certain programs or some channels produce particularly triggering material to get more clicks.
tm_map()mycorpus.title <- VCorpus(VectorSource(data$title.cleaned))
mycorpus.title_clean <- tm_map(mycorpus.title, content_transformer(tolower))
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removeWords, stopwords("english"))
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removePunctuation)
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removeNumbers)
mycorpus.title_clean <- tm_map(mycorpus.title_clean, stemDocument, lazy = TRUE) # Extract document term matrix for title
dtm.title.full <- DocumentTermMatrix(mycorpus.title_clean)
str(dtm.title.full)## List of 6
## $ i : int [1:36384] 1 1 1 1 1 1 2 2 2 2 ...
## $ j : int [1:36384] 1126 1996 4200 5430 7527 8152 1849 2468 3506 3973 ...
## $ v : num [1:36384] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 6351
## $ ncol : int 8242
## $ dimnames:List of 2
## ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:8242] "ã–rs" "ã–zil" "â—\220" "世畜ã\201§ä¸\200番å\210‡ã‚œã‚‹ãƒ‘スタã\201®åœ…ä¸\201を作゚ã\201ÿã\201„ï¼\201" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
# Get words appearing in at least 0.25% of all titles
threshold <- .0025*length(mycorpus.title_clean) # 0.25% of all titles
title.words <- findFreqTerms(dtm.title.full, lowfreq=threshold) # words appearing at least among 0.25% of the titles
length(title.words)## [1] 443
# Extract document term matrix for words appearing in at least .25% of all titles
dtm.title <- DocumentTermMatrix(mycorpus.title_clean, control = list(dictionary = title.words))
dim(as.matrix(dtm.title))## [1] 6351 443
## List of 6
## $ i : int [1:15904] 1 2 2 2 3 4 4 4 4 5 ...
## $ j : int [1:15904] 439 178 293 311 96 18 52 183 418 406 ...
## $ v : num [1:15904] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 6351
## $ ncol : int 443
## $ dimnames:List of 2
## ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:443] "â\200“" "â\200”" "abc" "actual" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
# Combine original data with the title matrix
data.temp <- data.frame(data, as.matrix(dtm.title))
dim(data.temp)## [1] 6351 461
## [1] "trending_date" "title"
## [3] "channel_title" "category_id"
## [5] "publish_time" "tags"
## [7] "likes" "dislikes"
## [9] "comment_count" "comments_disabled"
## [11] "ratings_disabled" "video_error_or_removed"
## [13] "viral" "trending_lag"
## [15] "weekdays" "months"
## [17] "title.cleaned" "tags.cleaned"
## [19] "â.." "â...1"
#data2 combined viral with the .025% of top words in title
data2 <- data.temp[, c(13, 19:ncol(data.temp))]
dim(data2)## [1] 6351 444
## [1] "viral" "â.." "â...1" "abc" "actual" "adam"
## [7] "amazon" "america" "american" "anim"