I. Executive summary.

Founded 14 years ago, YouTube, a video-sharing platform, is currently the second-most popular site in the world, only to its parent company Google. It is estimated that 300 hours of videos are uploaded to the site every minute and almost 5 billion videos are watched on YouTube every day of those videos posted, only a small fraction achieve “virality”.

We found our data set on kaggle https://www.kaggle.com/datasnaek/youtube-new.

II. Response and predictor variables

Data: USvideos.csv contains 40949 observations with 16 variables. The observations are videos published between 2006 and 2018, which have all “trended” at some point. When a video is trending, it means that Youtube’s algorithm has deemed the video ‘relevant’ and thus promotes the video on its trending feed located on its home screen menu.

Note: The company does not disclose their algorithm on how it defines ‘trending’.

We will add a new categorical variable Viral as our response variable. Having gone Viral is defined where the video achieves more than 5 million views.

The predictor variables in the original data set are as follows:

  • video_id: unique id assigned to the video (will remove from analysis)

  • trending_date: the date that YouTube started promoting the video on its ‘trending’ feed.

  • title: video title

  • channel_title: author or publisher of the material

  • category_id: 16 levels of video category ID. They are described as follows:

ID Category Name ID Category Name
1 Film & Animation 23 Comedy
2 Autos & Vehicles 24 Entertainment
10 Music 25 News & Politics
15 Pets & Animals 26 Howto & Style
17 Sports 27 Education
19 Travel & Events 28 Science & Technology
20 Gaming 29 Nonprofits & Activism
22 People & Blogs 43 Shows
  • publish_time: date and time when the video was published

  • tags: user-generated tags to improve SEO

  • views: total number of views the video received as of the last time it trended (will remove from analysis)

  • likes: total number of likes the video received as of the last time it trended

  • dislikes: total number of dislikes the video received as of the last time it trended

  • comments: total number of comments the video received as of the last time it trended

  • thumbnail_link: link to outside material (will remove from analysis)

  • comments_disabled: whether or not the uploader disabled comments

  • ratings_disabled: whether or not the uploader disabled ratings

  • video_error_or_removed: whether or not the content was removed or had an error

  • description: user-generated video description

Depending on the analysis requirements, we may add, delete, or modify attributes as necessary. We will call it out in our analysis.

III. Detailed process of the analysis

a) EDA

Data exploration and summary

Upon opening up the data, we quickly noticed that many videos had “duplicate” entries depending on which day the video had “trended”. So if the video was trending on multiple days, it would then show up on the dataset multiple times. Because we wanted to study what categories or key words would otherwise be correlated with having gone viral, we felt it necessary to remove the duplicate rows from our dataset and only keeping the row where the video last was trending and use the reported last accumulated view numbers.

## [1] 40949    16
## [1] 6351
## [1] 6351   16

We need to remove that as well as categorical variables with no predictive power (video_id, thumbnail_link). We are also removing description because key words should have been put into tags and title.

Now we will need to define viral as amassing more than 5 million views; this will be our response variable (categorical).

Code Description
0 Less than 5 million views
1 More than 5 million views

We then remove views as a predictor.

## Classes 'tbl_df', 'tbl' and 'data.frame':    6351 obs. of  13 variables:
##  $ trending_date         : chr  "18.22.02" "18.11.06" "18.01.02" "18.01.05" ...
##  $ title                 : chr  "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
##  $ channel_title         : chr  "Watch What Happens Live with Andy Cohen" "TheEllenShow" "TheEllenShow" "Jimmy Kimmel Live" ...
##  $ category_id           : int  24 24 24 23 22 10 25 27 17 10 ...
##  $ publish_time          : chr  "2018-02-15T04:30:12.000Z" "2018-06-04T13:00:00.000Z" "2018-01-29T14:00:39.000Z" "2018-04-27T07:30:02.000Z" ...
##  $ tags                  : chr  "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
##  $ likes                 : int  136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
##  $ dislikes              : int  33 332 54 580 212 1642 603 27 383 1522 ...
##  $ comment_count         : int  24 423 94 1484 846 977 3093 20 714 8757 ...
##  $ comments_disabled     : chr  "False" "False" "False" "False" ...
##  $ ratings_disabled      : chr  "False" "False" "False" "False" ...
##  $ video_error_or_removed: chr  "False" "False" "False" "False" ...
##  $ viral                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Proportion of viral videos:

## 
##          0          1 
## 0.92883011 0.07116989

Only 7.1% of the videos in our population of 6351 trending videos had gone viral (having more than 5 million views). Now we need to clean up the data in order to figure out potential variables that could predict going ‘viral’.

Data clean-up

We see that we have to change the vector category of several attributes into something we can analyze.

ii)Categorical variables (category_id, channel_title, comments_disabled, ratings_disabled, video_error_or_removed, weekdays, and months)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6351 obs. of  16 variables:
##  $ trending_date         : POSIXlt, format: "2018-02-22" "2018-06-11" ...
##  $ title                 : chr  "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
##  $ channel_title         : Factor w/ 2199 levels "12 News","1MILLION Dance Studio",..: 2129 1955 1955 980 1316 171 424 1663 1376 53 ...
##  $ category_id           : Factor w/ 16 levels "1","2","10","15",..: 10 10 10 9 8 3 11 13 5 3 ...
##  $ publish_time          : POSIXlt, format: "2018-02-15" "2018-06-04" ...
##  $ tags                  : chr  "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
##  $ likes                 : int  136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
##  $ dislikes              : int  33 332 54 580 212 1642 603 27 383 1522 ...
##  $ comment_count         : int  24 423 94 1484 846 977 3093 20 714 8757 ...
##  $ comments_disabled     : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ratings_disabled      : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ video_error_or_removed: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ viral                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trending_lag          : num  7 7 3 4 4 5 2 4 4 8 ...
##  $ weekdays              : Factor w/ 7 levels "Friday","Monday",..: 5 2 2 1 6 7 6 4 6 7 ...
##  $ months                : Factor w/ 12 levels "April","August",..: 4 7 5 1 10 1 3 10 4 4 ...

iii) Applying log function to attributes with skewed distribution (likes, dislikes and comment_count)

We can see that the distribution of likes, dislikes and comment_count are heavily skewed to the left so we should apply the log function to improve its predictive power.

Given that the number of likes, dislikes and comment_count depend heavily on whether or not the video was viewed, thus we have to wonder whether these 3 variables are not a good predictor of whether a video goes viral. A more interesting variable we can explore in a later study could be a sentiment_index, where we compare the ratio of likes to dislikes multiplied by a factor of sd( log(comment counts)). Perhaps people like to hate-watch certain programs or some channels produce particularly triggering material to get more clicks.

b) Text mining

i) Preparing text for analysis

Titles

## List of 6
##  $ i       : int [1:36384] 1 1 1 1 1 1 2 2 2 2 ...
##  $ j       : int [1:36384] 1126 1996 4200 5430 7527 8152 1849 2468 3506 3973 ...
##  $ v       : num [1:36384] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 6351
##  $ ncol    : int 8242
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:8242] "ã–rs" "ã–zil" "â—\220" "世畜ã\201§ä¸\200番å\210‡ã‚œã‚‹ãƒ‘スタã\201®åœ…ä¸\201を作゚ã\201ÿã\201„ï¼\201" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## [1] 443
## [1] 6351  443
## List of 6
##  $ i       : int [1:15904] 1 2 2 2 3 4 4 4 4 5 ...
##  $ j       : int [1:15904] 439 178 293 311 96 18 52 183 418 406 ...
##  $ v       : num [1:15904] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 6351
##  $ ncol    : int 443
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:443] "â\200“" "â\200”" "abc" "actual" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## [1] 6351  461
##  [1] "trending_date"          "title"                 
##  [3] "channel_title"          "category_id"           
##  [5] "publish_time"           "tags"                  
##  [7] "likes"                  "dislikes"              
##  [9] "comment_count"          "comments_disabled"     
## [11] "ratings_disabled"       "video_error_or_removed"
## [13] "viral"                  "trending_lag"          
## [15] "weekdays"               "months"                
## [17] "title.cleaned"          "tags.cleaned"          
## [19] "â.."                    "â...1"
## [1] 6351  444
##  [1] "viral"    "â.."      "â...1"    "abc"      "actual"   "adam"    
##  [7] "amazon"   "america"  "american" "anim"

Tags

## List of 6
##  $ i       : int [1:128784] 1 1 1 1 1 1 1 1 1 1 ...
##  $ j       : int [1:128784] 513 702 2151 2593 2837 3366 3445 3599 4506 4507 ...
##  $ v       : num [1:128784] 1 2 2 2 1 1 1 1 1 1 ...
##  $ nrow    : int 6351
##  $ ncol    : int 19206
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:19206] "â\210†" "如何过你自己的ç”ÿæ´»" "妻" "寿å\217¸" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## [1] 2402
## [1] 6351 2402
## List of 6
##  $ i       : int [1:90831] 1 1 1 1 1 1 1 1 1 1 ...
##  $ j       : int [1:90831] 44 67 254 310 337 423 437 456 660 826 ...
##  $ v       : num [1:90831] 1 2 2 2 1 1 1 1 1 2 ...
##  $ nrow    : int 6351
##  $ ncol    : int 2402
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:2402] "à¤\210" "डबà¥\215लू" "सà¥\201परसà¥\215à¤ÿार" "मà¥\210च" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
##  [1] "viral"                          "à.."                           
##  [3] "à..à..à..à.²à.."                "à..à..à.ªà..à..à..à.ÿà..à.."   
##  [5] "à..à..à.š"                      "à.µà..à..à..à.µà..à..à..à..à.."
##  [7] "à..à..à..à..à..à.."             "à.ªà.¹à.²à.µà..à.."            
##  [9] "aaron"                          "abc"

Splitting data

data3 (viral and words that appear in .25% of tags)

## [1] 6351 2403
## [1] 1270 2403
## [1] 4065 2403
## [1] 1016 2403

ii) Lasso | WordCloud

Title Analysis

First, we run LASSO to identify which words in the title often appear for viral videos

We Load the LASSO results here and plot to determine the best lambda.

lambda.min is used to reduce the dimension of the text through frequency table.

##   [1] "(Intercept)" "amazon"      "america"     "anthem"      "artist"     
##   [6] "audit"       "aveng"       "babi"        "back"        "bad"        
##  [11] "ball"        "bbc"         "beat"        "best"        "big"        
##  [16] "black"       "bowl"        "bts"         "cabello"     "cake"       
##  [21] "call"        "cardi"       "cat"         "celebr"      "chang"      
##  [26] "christma"    "come"        "cook"        "cover"       "cri"        
##  [31] "david"       "day"         "die"         "diy"         "dog"        
##  [36] "dress"       "eagl"        "easi"        "emot"        "espn"       
##  [41] "everyth"     "explain"     "face"        "fair"        "fake"       
##  [46] "famili"      "fan"         "feat"        "featur"      "fight"      
##  [51] "film"        "find"        "fire"        "fox"         "fri"        
##  [56] "full"        "futur"       "gadget"      "golden"      "got"        
##  [61] "grace"       "graham"      "grammi"      "guy"         "hair"       
##  [66] "high"        "holiday"     "honest"      "ice"         "impress"    
##  [71] "interview"   "jame"        "japanes"     "jedi"        "jennif"     
##  [76] "jimmi"       "jordan"      "just"        "kelli"       "kendrick"   
##  [81] "kim"         "king"        "know"        "kyli"        "laugh"      
##  [86] "leagu"       "let"         "light"       "line"        "live"       
##  [91] "lost"        "love"        "machin"      "magic"       "michael"    
##  [96] "moment"      "money"       "motion"      "move"        "movi"       
## [101] "music"       "name"        "need"        "netflix"     "new"        
## [106] "news"        "nick"        "night"       "nintendo"    "now"        
## [111] "offici"      "part"        "paul"        "peopl"       "perfect"    
## [116] "pictur"      "power"       "presid"      "princ"       "react"      
## [121] "reaction"    "real"        "reason"      "refineri"    "reveal"     
## [126] "review"      "routin"      "save"        "scene"       "scott"      
## [131] "season"      "secret"      "see"         "shawn"       "sheeran"    
## [136] "shoot"       "shop"        "show"        "sing"        "smith"      
## [141] "special"     "speech"      "stephen"     "stock"       "stop"       
## [146] "stranger"    "studio"      "super"       "take"        "talk"       
## [151] "taylor"      "teaser"      "tell"        "theater"     "theori"     
## [156] "thing"       "timberlak"   "time"        "today"       "top"        
## [161] "trailer"     "train"       "tri"         "trump"       "use"        
## [166] "video"       "voic"        "water"       "wed"         "wild"       
## [171] "win"         "wish"        "world"       "wwhl"        "year"
  1. Pull out all the positive coefficients and the corresponding words. Rank the coefficients in a decreasing order. Report the leading 2 words and the coefficients. Describe briefly the interpretation for those two coefficients.

##   anthem    cardi 
## 3.712773 3.154864

The two leading positive words are “anthem” with a coefficient of 3.71 and “cardi” with a coefficient of 3.15. The interpretation of these coefficients is the log odds of being a viral video for a unit change in the respective word frequency.

We then create a word cloud with the top positive words associated with viral videos. We ranked them according to their coefficients.

##  [1] "anthem"    "cardi"     "motion"    "perfect"   "paul"     
##  [6] "timberlak" "kiss"      "bts"       "sheeran"   "shawn"

Tags Analysis

First, we run LASSO to identify which tags often appear for viral videos

We Load the LASSO results here

lambda.min is used to reduce the dimension of the text through frequency table.

##  [1] "bangtan"         "bowl"            "bubbl"          
##  [4] "cardi"           "cup"             "deadpool"       
##  [7] "dobr"            "drake"           "dude"           
## [10] "ë..ë¹."          "ë..íƒ.ì.œë..ë.." "feat"           
## [13] "foil"            "got"             "halsey"         
## [16] "khale"           "label"           "latin"          
## [19] "lizzza"          "logan"           "lopez"          
## [22] "marvel"          "movi"            "offici"         
## [25] "perfect"         "pictur"          "pon"            
## [28] "pop"             "rca"             "reason"         
## [31] "record"          "remix"           "road"           
## [34] "spi"             "spiderman"       "ultra"          
## [37] "zedd"

Next, we pull out all the positive coefficients and the corresponding words and ranked them to determine which ones have the biggest positive association with virality.

## ë..íƒ.ì.œë..ë..          ë..ë¹.          halsey           label 
##        3.758664        3.291212        2.293280        2.278958
##   halsey    label 
## 2.293280 2.278958

The two leading positive words that we are able to interpret are “halsey” with a coefficient of 2.29 and “label” with a coefficient of 2.28. The interpretation of these coefficients is the log odds of being a viral video for a unit change in the respective tag frequency.

A word cloud with the top 100 positive words according to their coefficients.

##  [1] "halsey" "label"  "latin"  "lizzza" "pon"    "bubbl"  "ultra" 
##  [8] "spi"    "reason" "remix"

From Lasso Fit for Title Analysis, we see that that misclassification error is 0.074

## [1] 0.07401575

Using majority votes, testing misclassification error for Lasso fit for tag analysis is .075.

## [1] 0.07480315

From glm fit for title analysis, we see that the testing misclassification error is 0.06.

## [1] 0.06021204
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data2.test$viral, predictor = predict.glm,     plot = T)
## 
## Data: predict.glm in 1179 controls (data2.test$viral 0) < 91 cases (data2.test$viral 1).
## Area under the curve: 0.8252

Using majority votes, testing misclassification error for glm fit for tag analysis is .084.

## [1] 0.08361672
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data3.test$viral, predictor = predict.tag.glm,     plot = T)
## 
## Data: predict.tag.glm in 1179 controls (data3.test$viral 0) < 91 cases (data3.test$viral 1).
## Area under the curve: 0.6646

iii) Random forest

ranger package We use ranger(), a faster implementation of random forests

*Titles Analysis**

We run a loop with varying number of trees from 20-200 (in 20 increments) and record the testing errors.

##  [1] 0.06687898 0.06687898 0.06767516 0.06847134 0.06767516 0.06767516
##  [7] 0.06767516 0.06767516 0.06767516 0.06767516
##  [1] 0.07261875 0.07187267 0.07162397 0.07187267 0.07137528 0.07187267
##  [7] 0.07237006 0.07137528 0.07162397 0.07137528

As we observe from the vector, the testing misclassification error and OOB error are stable after 100 trees. Therefore, we picked 100 trees to balance accuracy and computing power.

Next, we run a loop with 100 trees and varying number of mtry (14-32, in 2 increments) and record the testing errors.

##  [1] 0.06847134 0.06767516 0.06767516 0.06847134 0.06847134 0.06926752
##  [7] 0.06687898 0.06767516 0.06847134 0.06847134
##  [1] 0.07261875 0.07162397 0.07087789 0.07212136 0.07187267 0.07187267
##  [7] 0.07237006 0.07212136 0.07286745 0.07261875

We observe that the testing misclassification error is minimized with mtry = 26, while OOB error swings slightly across the different mtry. Therefore, we picked mtry = 26 instead of the default mtry = sqrt(439) = 20.

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data2.train, num.trees = 100, mtry = 26,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      4065 
## Number of independent variables:  443 
## Mtry:                             26 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             7.31 %
##    offici     video   perfect     cardi   trailer   sheeran    teaser 
## 20.715256 10.184915 10.011829  4.810809  4.179687  3.064753  2.881592 
##      fake     audio       bts    pictur    justin     world  kendrick 
##  2.693733  2.359388  2.195621  2.184556  2.067891  2.003746  1.955107 
##       new       fri     david      love timberlak       bad 
##  1.894719  1.774509  1.725552  1.674613  1.622483  1.613559

The top 5 important words are offici[al], video, trailer, sheeran and cardi.

##     predicted
## true    0    1
##    0 3758   13
##    1  284   10
## [1] 0.07401575

The testing misclassification error for our final random forest model is 6.60%.

*Tags Analysis**

We run a loop with varying number of trees from 20-200 (in 20 increments) and record the testing errors.

##  [1] 0.06210191 0.06210191 0.05971338 0.05812102 0.06050955 0.05812102
##  [7] 0.06050955 0.05971338 0.05971338 0.05971338
##  [1] 0.07187267 0.06739617 0.06938572 0.06863964 0.06789356 0.06789356
##  [7] 0.06764486 0.06490923 0.06689878 0.06764486

We observe that we need 160 trees to stabilize testing misclassification error and minimize OOB error.

Next, we run a loop with 160 trees and varying number of mtry (24-78, in 6 increments) and record the testing errors.

##  [1] 0.06210191 0.06130573 0.05891720 0.05891720 0.05891720 0.06130573
##  [7] 0.06050955 0.06050955 0.06130573 0.06050955
##  [1] 0.06839095 0.06863964 0.06714748 0.06764486 0.06540662 0.06615270
##  [7] 0.06689878 0.06739617 0.06789356 0.06739617

We observe that mtry = 48 minimizes testing misclassification error and OOB error. Therefore, we picked mtry = 48 instead of the default mtry = sqrt(2381) = 40.

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data3.train, num.trees = 160, mtry = 48,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  160 
## Sample size:                      4021 
## Number of independent variables:  2380 
## Mtry:                             48 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             6.76 %
##          offici            paul           remix          record 
##        6.008956        3.992303        3.736670        3.431084 
##             pop           cardi            movi         bangtan 
##        3.306166        3.073816        2.955152        2.880538 
##           logan             pon            feat           elder 
##        2.856512        2.603727        2.154104        2.018227 
##             rap           twice             lil          ë..ë¹. 
##        1.983825        1.982894        1.958532        1.913647 
## ë..íƒ.ì.œë..ë..           video             new          spider 
##        1.891761        1.814045        1.769265        1.678810

The top 5 most important keywords in tags are offici[al], remix, paul, record and movi[e].

##     predicted
## true    0    1
##    0 3718   17
##    1  255   31
## [1] 0.04724409

The testing misclassification error for our final random forest model is 5.971%.

Final model

Based on minimizing testing misclassification error, we chose the glm fit for the title model and random forest for the tags model. We then used our validation data to determine the misclassification error of the final model.

Tags final model

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data3.train, num.trees = 160, mtry = 48,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  160 
## Sample size:                      4021 
## Number of independent variables:  2380 
## Mtry:                             48 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             6.76 %
##          offici            paul           remix          record 
##        6.008956        3.992303        3.736670        3.431084 
##             pop           cardi            movi         bangtan 
##        3.306166        3.073816        2.955152        2.880538 
##           logan             pon            feat           elder 
##        2.856512        2.603727        2.154104        2.018227 
##             rap           twice             lil          ë..ë¹. 
##        1.983825        1.982894        1.958532        1.913647 
## ë..íƒ.ì.œë..ë..           video             new          spider 
##        1.891761        1.814045        1.769265        1.678810
## [1] 0.09606299

According to the final model, using the validation data, we got validation error 8.28%.

b) Logistic regression using other non-text factors

Based on the data that we have, we also want to see what other factors other than text-related ones, such as title and tags, contribute to a video being viral. We are conducting logistic regression for the other factors such as category_id and features such as comments_disabledand ratings_disabled. We are not including other factors such as likes or dislikes since they have no predictive power.

## [1] 1270  461
## [1] 4065  461
## [1] 1016  461
## 
## Call:
## glm(formula = viral ~ category_id + weekdays + months + comments_disabled + 
##     ratings_disabled, family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5551  -0.3707  -0.2546  -0.1075   2.9521  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -5.889e-01  5.288e-01  -1.114  0.26538   
## category_id2          -2.199e-01  8.565e-01  -0.257  0.79739   
## category_id10          1.116e-01  4.413e-01   0.253  0.80032   
## category_id15         -1.779e+01  2.172e+03  -0.008  0.99347   
## category_id17         -1.675e+00  6.472e-01  -2.589  0.00963 **
## category_id19         -1.780e+01  3.099e+03  -0.006  0.99542   
## category_id20         -3.225e-01  7.878e-01  -0.409  0.68228   
## category_id22         -1.201e+00  5.350e-01  -2.246  0.02473 * 
## category_id23         -1.364e+00  5.576e-01  -2.446  0.01446 * 
## category_id24         -1.215e+00  4.470e-01  -2.719  0.00656 **
## category_id25         -1.801e+01  9.881e+02  -0.018  0.98546   
## category_id26         -1.823e+00  6.465e-01  -2.820  0.00480 **
## category_id27         -1.800e+01  1.326e+03  -0.014  0.98917   
## category_id28         -2.729e+00  1.101e+00  -2.479  0.01316 * 
## category_id29          1.548e+00  1.012e+00   1.529  0.12620   
## weekdaysMonday        -1.172e-01  4.360e-01  -0.269  0.78807   
## weekdaysSaturday      -9.790e-03  4.974e-01  -0.020  0.98430   
## weekdaysSunday        -2.057e-02  4.965e-01  -0.041  0.96696   
## weekdaysThursday       3.141e-01  3.623e-01   0.867  0.38600   
## weekdaysTuesday       -6.199e-01  4.529e-01  -1.369  0.17102   
## weekdaysWednesday      1.037e-01  3.786e-01   0.274  0.78412   
## monthsAugust          -1.707e+01  7.201e+03  -0.002  0.99811   
## monthsDecember        -1.216e+00  4.244e-01  -2.867  0.00415 **
## monthsFebruary        -1.208e+00  4.538e-01  -2.661  0.00779 **
## monthsJanuary         -1.214e+00  4.408e-01  -2.755  0.00587 **
## monthsJuly            -1.726e+01  1.075e+04  -0.002  0.99872   
## monthsJune            -2.182e+00  1.106e+00  -1.972  0.04865 * 
## monthsMarch           -1.369e+00  5.328e-01  -2.569  0.01021 * 
## monthsMay              5.022e-01  3.959e-01   1.269  0.20460   
## monthsNovember        -1.670e+00  5.076e-01  -3.290  0.00100 **
## monthsOctober         -1.761e+01  7.593e+03  -0.002  0.99815   
## monthsSeptember       -1.841e+01  7.436e+03  -0.002  0.99802   
## comments_disabledTrue  9.621e-01  9.373e-01   1.027  0.30465   
## ratings_disabledTrue  -1.728e+01  4.366e+03  -0.004  0.99684   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 526.06  on 1236  degrees of freedom
## AIC: 594.06
## 
## Number of Fisher Scoring iterations: 18
## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         74.566 14  2.842e-10 ***
## weekdays             5.128  6     0.5275    
## months              46.269 11  2.898e-06 ***
## comments_disabled    0.926  1     0.3359    
## ratings_disabled     0.998  1     0.3179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using backward validation, we will remove weekdays first as it is not significant at the 0.05 level.

## 
## Call:
## glm(formula = viral ~ category_id + months + comments_disabled + 
##     ratings_disabled, family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5357  -0.4000  -0.2828  -0.1214   2.7130  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.6803     0.4596  -1.480 0.138833    
## category_id2             -0.1453     0.8498  -0.171 0.864271    
## category_id10             0.2541     0.4306   0.590 0.555067    
## category_id15           -17.7116  2155.6437  -0.008 0.993444    
## category_id17            -1.5965     0.6368  -2.507 0.012174 *  
## category_id19           -17.6476  3104.0194  -0.006 0.995464    
## category_id20            -0.1461     0.7784  -0.188 0.851131    
## category_id22            -1.1240     0.5296  -2.122 0.033805 *  
## category_id23            -1.2601     0.5511  -2.286 0.022226 *  
## category_id24            -1.1334     0.4387  -2.584 0.009773 ** 
## category_id25           -17.9443   992.6782  -0.018 0.985578    
## category_id26            -1.7309     0.6378  -2.714 0.006649 ** 
## category_id27           -17.9649  1333.7422  -0.013 0.989253    
## category_id28            -2.6280     1.0948  -2.401 0.016371 *  
## category_id29             1.5116     0.9906   1.526 0.127001    
## monthsAugust            -17.2021  7381.1316  -0.002 0.998140    
## monthsDecember           -1.2427     0.4219  -2.946 0.003221 ** 
## monthsFebruary           -1.2117     0.4498  -2.694 0.007065 ** 
## monthsJanuary            -1.2591     0.4373  -2.879 0.003990 ** 
## monthsJuly              -17.1549 10754.0130  -0.002 0.998727    
## monthsJune               -2.2356     1.1082  -2.017 0.043662 *  
## monthsMarch              -1.3645     0.5276  -2.586 0.009700 ** 
## monthsMay                 0.4740     0.3929   1.206 0.227713    
## monthsNovember           -1.6824     0.5047  -3.333 0.000859 ***
## monthsOctober           -17.4856  7558.6978  -0.002 0.998154    
## monthsSeptember         -18.4240  7425.0868  -0.002 0.998020    
## comments_disabledTrue     1.0180     0.9258   1.100 0.271505    
## ratings_disabledTrue    -17.2300  4337.6938  -0.004 0.996831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 531.19  on 1242  degrees of freedom
## AIC: 587.19
## 
## Number of Fisher Scoring iterations: 18
## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         77.446 14  8.399e-11 ***
## months              46.799 11  2.334e-06 ***
## comments_disabled    1.053  1     0.3049    
## ratings_disabled     0.976  1     0.3231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We then proceed to remove ratings_disabled since it is still not significant at 0.05 level.

## 
## Call:
## glm(formula = viral ~ category_id + months + comments_disabled, 
##     family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4376  -0.4092  -0.2821  -0.1375   2.7898  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.6731     0.4588  -1.467 0.142348    
## category_id2             -0.1492     0.8496  -0.176 0.860584    
## category_id10             0.2337     0.4295   0.544 0.586399    
## category_id15           -17.7139  2154.7725  -0.008 0.993441    
## category_id17            -1.6148     0.6359  -2.539 0.011102 *  
## category_id19           -17.6550  3103.9395  -0.006 0.995462    
## category_id20            -0.1479     0.7777  -0.190 0.849179    
## category_id22            -1.1253     0.5291  -2.127 0.033434 *  
## category_id23            -1.2672     0.5506  -2.301 0.021364 *  
## category_id24            -1.1333     0.4378  -2.589 0.009639 ** 
## category_id25           -17.9345   993.2446  -0.018 0.985594    
## category_id26            -1.7365     0.6373  -2.725 0.006433 ** 
## category_id27           -17.9868  1337.6118  -0.013 0.989271    
## category_id28            -2.6017     1.0874  -2.393 0.016728 *  
## category_id29             1.5051     0.9902   1.520 0.128524    
## monthsAugust            -17.2160  7387.6359  -0.002 0.998141    
## monthsDecember           -1.2392     0.4216  -2.940 0.003287 ** 
## monthsFebruary           -1.2103     0.4496  -2.692 0.007105 ** 
## monthsJanuary            -1.2639     0.4370  -2.892 0.003827 ** 
## monthsJuly              -17.1564 10754.0130  -0.002 0.998727    
## monthsJune               -2.2120     1.1040  -2.004 0.045111 *  
## monthsMarch              -1.3823     0.5271  -2.622 0.008730 ** 
## monthsMay                 0.4803     0.3924   1.224 0.220995    
## monthsNovember           -1.7055     0.5065  -3.367 0.000759 ***
## monthsOctober           -17.4907  7557.9895  -0.002 0.998154    
## monthsSeptember         -18.4312  7424.7936  -0.002 0.998019    
## comments_disabledTrue     0.7864     0.8888   0.885 0.376228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 532.16  on 1243  degrees of freedom
## AIC: 586.16
## 
## Number of Fisher Scoring iterations: 18
## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         77.029 14  1.002e-10 ***
## months              47.421 11  1.811e-06 ***
## comments_disabled    0.694  1     0.4049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that comments_disabled is also not significant at 0.05 level so we removed it for our final glm model.

## 
## Call:
## glm(formula = viral ~ category_id + months, family = binomial, 
##     data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2020  -0.3916  -0.2805  -0.1448   3.0208  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.6563     0.4565  -1.438 0.150525    
## category_id2       -0.1731     0.8477  -0.204 0.838156    
## category_id10       0.2084     0.4261   0.489 0.624769    
## category_id15     -17.7426  2154.3531  -0.008 0.993429    
## category_id17      -1.6249     0.6344  -2.561 0.010434 *  
## category_id19     -17.6778  3104.0035  -0.006 0.995456    
## category_id20      -0.1616     0.7757  -0.208 0.834987    
## category_id22      -1.1452     0.5270  -2.173 0.029768 *  
## category_id23      -1.2988     0.5477  -2.371 0.017732 *  
## category_id24      -1.1479     0.4351  -2.638 0.008335 ** 
## category_id25     -17.9251   993.3093  -0.018 0.985602    
## category_id26      -1.7668     0.6349  -2.783 0.005388 ** 
## category_id27     -18.0168  1337.4465  -0.013 0.989252    
## category_id28      -2.5307     1.0763  -2.351 0.018706 *  
## category_id29       1.4838     0.9889   1.500 0.133491    
## monthsAugust      -17.2404  7408.6977  -0.002 0.998143    
## monthsDecember     -1.2340     0.4215  -2.928 0.003415 ** 
## monthsFebruary     -1.2048     0.4495  -2.680 0.007356 ** 
## monthsJanuary      -1.2606     0.4370  -2.885 0.003919 ** 
## monthsJuly        -17.1429 10754.0130  -0.002 0.998728    
## monthsJune         -2.1580     1.0956  -1.970 0.048885 *  
## monthsMarch        -1.3652     0.5266  -2.592 0.009529 ** 
## monthsMay           0.5056     0.3909   1.293 0.195850    
## monthsNovember     -1.6749     0.5039  -3.323 0.000889 ***
## monthsOctober     -17.4867  7555.7302  -0.002 0.998153    
## monthsSeptember   -18.4434  7421.0651  -0.002 0.998017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 532.86  on 1244  degrees of freedom
## AIC: 584.86
## 
## Number of Fisher Scoring iterations: 18
## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##             LR Chisq Df Pr(>Chisq)    
## category_id   76.560 14  1.223e-10 ***
## months        47.534 11  1.728e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have our final model that takes into account category_idand months. We now calculate predict the testing error.

## [1] 0.06961392

The testing error based on these predictors are 0.069.

We also used ROC below to evaluate the performance of our final model using the testing data.

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.8054
## Area under the curve: 0.7994
## Area under the curve: 0.7971
## Area under the curve: 0.7972