I. Executive summary.

Founded 14 years ago, YouTube, a video-sharing platform, is currently the second-most popular site in the world, only to its parent company Google. It is estimated that 300 hours of videos are uploaded to the site every minute and almost 5 billion videos are watched on YouTube every day of those videos posted, only a small fraction achieve “virality”.

We found our data set on kaggle https://www.kaggle.com/datasnaek/youtube-new.

II. Response and predictor variables

Data: USvideos.csv contains 40949 observations with 16 variables. The observations are videos published between 2006 and 2018, which have all “trended” at some point. When a video is trending, it means that Youtube’s algorithm has deemed the video ‘relevant’ and thus promotes the video on its trending feed located on its home screen menu.

Note: The company does not disclose their algorithm on how it defines ‘trending’.

We will add a new categorical variable Viral as our response variable. Having gone Viral is defined where the video achieves more than 5 million views.

The predictor variables in the original data set are as follows:

video_id: unique id assigned to the video (will remove from analysis)
trending_date: the date that YouTube started promoting the video on its ‘trending’ feed.
title: video title
channel_title: author or publisher of the material
category_id: 16 levels of video category ID. They are described as follows:

ID	Category Name	ID	Category Name
1	Film & Animation	23	Comedy
2	Autos & Vehicles	24	Entertainment
10	Music	25	News & Politics
15	Pets & Animals	26	Howto & Style
17	Sports	27	Education
19	Travel & Events	28	Science & Technology
20	Gaming	29	Nonprofits & Activism
22	People & Blogs	43	Shows

publish_time: date and time when the video was published
tags: user-generated tags to improve SEO
views: total number of views the video received as of the last time it trended (will remove from analysis)
likes: total number of likes the video received as of the last time it trended
dislikes: total number of dislikes the video received as of the last time it trended
comments: total number of comments the video received as of the last time it trended
thumbnail_link: link to outside material (will remove from analysis)
comments_disabled: whether or not the uploader disabled comments
ratings_disabled: whether or not the uploader disabled ratings
video_error_or_removed: whether or not the content was removed or had an error
description: user-generated video description

Depending on the analysis requirements, we may add, delete, or modify attributes as necessary. We will call it out in our analysis.

III. Detailed process of the analysis

a) EDA

Data exploration and summary

Upon opening up the data, we quickly noticed that many videos had “duplicate” entries depending on which day the video had “trended”. So if the video was trending on multiple days, it would then show up on the dataset multiple times. Because we wanted to study what categories or key words would otherwise be correlated with having gone viral, we felt it necessary to remove the duplicate rows from our dataset and only keeping the row where the video last was trending and use the reported last accumulated view numbers.

dim(data.all)

## [1] 40949    16

n.unique <- length(unique(data.all$video_id))
n.unique # of the 40949 videos, only 6351 unique values

## [1] 6351

data <- data.all %>% dplyr:: arrange(desc(views)) %>%
  group_by(video_id) %>%
  slice(1) %>%
  ungroup(video_id)
dim(data)

## [1] 6351   16

We need to remove that as well as categorical variables with no predictive power (video_id, thumbnail_link). We are also removing description because key words should have been put into tags and title.

data <- data %>% dplyr:: select(-video_id, -thumbnail_link, -description)

Now we will need to define viral as amassing more than 5 million views; this will be our response variable (categorical).

Code	Description
0	Less than 5 million views
1	More than 5 million views

We then remove views as a predictor.

data$viral <- c(0)
data$viral[data$views >= 5000000] <- 1
data$viral <- as.factor(data$viral)
data <- data %>% dplyr::select(-views)

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6351 obs. of  13 variables:
##  $ trending_date         : chr  "18.22.02" "18.11.06" "18.01.02" "18.01.05" ...
##  $ title                 : chr  "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
##  $ channel_title         : chr  "Watch What Happens Live with Andy Cohen" "TheEllenShow" "TheEllenShow" "Jimmy Kimmel Live" ...
##  $ category_id           : int  24 24 24 23 22 10 25 27 17 10 ...
##  $ publish_time          : chr  "2018-02-15T04:30:12.000Z" "2018-06-04T13:00:00.000Z" "2018-01-29T14:00:39.000Z" "2018-04-27T07:30:02.000Z" ...
##  $ tags                  : chr  "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
##  $ likes                 : int  136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
##  $ dislikes              : int  33 332 54 580 212 1642 603 27 383 1522 ...
##  $ comment_count         : int  24 423 94 1484 846 977 3093 20 714 8757 ...
##  $ comments_disabled     : chr  "False" "False" "False" "False" ...
##  $ ratings_disabled      : chr  "False" "False" "False" "False" ...
##  $ video_error_or_removed: chr  "False" "False" "False" "False" ...
##  $ viral                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Proportion of viral videos:

prop.table(table(data$viral))

## 
##          0          1 
## 0.92883011 0.07116989

Only 7.1% of the videos in our population of 6351 trending videos had gone viral (having more than 5 million views). Now we need to clean up the data in order to figure out potential variables that could predict going ‘viral’.

Data clean-up

We see that we have to change the vector category of several attributes into something we can analyze.

i) Dates & Time (`trending_date` and `publish_time`)

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

data$trending_date <- format(as.Date(data$trending_date, format = "%Y.%d.%m"), "20%y-%m-%d")
data$trending_date <- strptime(data$trending_date, format = "%Y-%m-%d")
data$publish_time <- strptime(data$publish_time, format = "%Y-%m-%d")
data$trending_lag <- as.numeric(data$trending_date - data$publish_time)/(60*60*24) #create new predictor variable "lag"; turn lag into days
data$weekdays <- weekdays(as.Date(data$publish_time)) # get weekdays for each upload
data$months <- months(as.Date(data$publish_time))   # get months

Do people tend to upload on the weekdays? Which month is most popular?

par(mfrow=c(1,2))
pie(table(data$weekdays), main="Prop of uploads") 
pie(table(data$months))

Comment: It would appear the the majority of trending videos are rarely published during the summer and early fall months.

Proportion of viral videos

prop.table(table(data$viral, data$weekdays), 2)  # prop of the columns
prop.table(table(data$viral, data$weekdays), 1)  # prop of the rows

##    
##         Friday     Monday   Saturday     Sunday   Thursday    Tuesday
##   0 0.90353391 0.93502538 0.93601463 0.92307692 0.91477273 0.94245283
##   1 0.09646609 0.06497462 0.06398537 0.07692308 0.08522727 0.05754717
##    
##      Wednesday
##   0 0.94712853
##   1 0.05287147
##    
##         Friday     Monday   Saturday     Sunday   Thursday    Tuesday
##   0 0.16036616 0.15612816 0.08679437 0.08747245 0.16375657 0.16935074
##   1 0.22345133 0.14159292 0.07743363 0.09513274 0.19911504 0.13495575
##    
##      Wednesday
##   0 0.17613155
##   1 0.12831858

prop.table(table(data$viral, data$months), 2)  # prop of the columns
prop.table(table(data$viral, data$months), 1)  # prop of the rows

##    
##          April     August   December   February    January       July
##   0 0.86753731 1.00000000 0.95626072 0.94070352 0.95744681 1.00000000
##   1 0.13246269 0.00000000 0.04373928 0.05929648 0.04255319 0.00000000
##    
##           June      March        May   November    October  September
##   0 0.85034014 0.91860465 0.81146026 0.96193416 1.00000000 1.00000000
##   1 0.14965986 0.08139535 0.18853974 0.03806584 0.00000000 0.00000000
##    
##           April      August    December    February     January
##   0 0.078826920 0.001017122 0.189015087 0.158670961 0.205967113
##   1 0.157079646 0.000000000 0.112831858 0.130530973 0.119469027
##    
##            July        June       March         May    November
##   0 0.000678081 0.021190032 0.107136803 0.074419393 0.158501441
##   1 0.000000000 0.048672566 0.123893805 0.225663717 0.081858407
##    
##         October   September
##   0 0.002881844 0.001695203
##   1 0.000000000 0.000000000

ii)Categorical variables (`category_id`, `channel_title`, `comments_disabled`, `ratings_disabled`, `video_error_or_removed`, `weekdays`, and `months`)

data$category_id <- as.factor(data$category_id)
data$channel_title <- as.factor(data$channel_title)
data$comments_disabled <- as.factor(data$comments_disabled)
data$ratings_disabled <- as.factor(data$ratings_disabled)
data$video_error_or_removed <- as.factor(data$video_error_or_removed)
data$weekdays <- as.factor(data$weekdays)
data$months <- as.factor(data$months)
str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6351 obs. of  16 variables:
##  $ trending_date         : POSIXlt, format: "2018-02-22" "2018-06-11" ...
##  $ title                 : chr  "Padma Lakshmi On A #TopChefâ\200\231s Cancer Diagnosis | WWHL" "Mindy Kaling's Daughter Had the Perfect Reaction to Entering Oprah's House" "Megan Mullally Didn't Notice the Interesting Pattern with Ellen's Roommates" "Cast of Avengers: Infinity War Draws Their Characters" ...
##  $ channel_title         : Factor w/ 2199 levels "12 News","1MILLION Dance Studio",..: 2129 1955 1955 980 1316 171 424 1663 1376 53 ...
##  $ category_id           : Factor w/ 16 levels "1","2","10","15",..: 10 10 10 9 8 3 11 13 5 3 ...
##  $ publish_time          : POSIXlt, format: "2018-02-15" "2018-06-04" ...
##  $ tags                  : chr  "What What Happens live|reality|interview|fun|celebrity|Andy Cohen|talk|show|program|Bravo|Watch What Happens Li"| __truncated__ "ellen|ellen degeneres|the ellen show|ellentube|ellen audience|season 15 episode 165|mindy kaling|mindy kaling b"| __truncated__ "megan mullally|megan|mullally|will and grace|karen on will and grace|actress|nick offerman|Ellen|degeneres|elle"| __truncated__ "jimmy|jimmy kimmel|jimmy kimmel live|late night|talk show|funny|comedic|comedy|clip|comedian|mean tweets|Benedi"| __truncated__ ...
##  $ likes                 : int  136 9773 4429 41248 7734 41016 3788 460 12984 129381 ...
##  $ dislikes              : int  33 332 54 580 212 1642 603 27 383 1522 ...
##  $ comment_count         : int  24 423 94 1484 846 977 3093 20 714 8757 ...
##  $ comments_disabled     : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ratings_disabled      : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ video_error_or_removed: Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ viral                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trending_lag          : num  7 7 3 4 4 5 2 4 4 8 ...
##  $ weekdays              : Factor w/ 7 levels "Friday","Monday",..: 5 2 2 1 6 7 6 4 6 7 ...
##  $ months                : Factor w/ 12 levels "April","August",..: 4 7 5 1 10 1 3 10 4 4 ...

#might want to do some exploratory analysis on hot categories

iii) Applying log function to attributes with skewed distribution (`likes`, `dislikes` and `comment_count`)

We can see that the distribution of likes, dislikes and comment_count are heavily skewed to the left so we should apply the log function to improve its predictive power.

par(mfrow=c(1, 3))
hist(data$likes)
hist(data$dislikes)
hist(data$comment_count)

data$likes <- log(data$likes)
data$dislikes <- log(data$dislikes)
data$comment_count <- log(data$comment_count)



par(mfrow=c(1, 3))
hist(data$likes)
hist(data$dislikes)
hist(data$comment_count)

Given that the number of likes, dislikes and comment_count depend heavily on whether or not the video was viewed, thus we have to wonder whether these 3 variables are not a good predictor of whether a video goes viral. A more interesting variable we can explore in a later study could be a sentiment_index, where we compare the ratio of likes to dislikes multiplied by a factor of sd( log(comment counts)). Perhaps people like to hate-watch certain programs or some channels produce particularly triggering material to get more clicks.

b) Text mining

i) Preparing text for analysis

Data cleaning using `tm_map()`

## Clean up
# Change "â€™" to space in title
data$title.cleaned <- gsub("â€™", " ", data$title, fixed=TRUE)

# Change | to space in tags
data$tags.cleaned <- gsub("|", " ", data$tags, fixed=TRUE)

Titles

mycorpus.title <- VCorpus(VectorSource(data$title.cleaned))

mycorpus.title_clean <- tm_map(mycorpus.title, content_transformer(tolower))
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removeWords, stopwords("english"))
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removePunctuation)
mycorpus.title_clean <- tm_map(mycorpus.title_clean, removeNumbers)
mycorpus.title_clean <- tm_map(mycorpus.title_clean, stemDocument, lazy = TRUE)

# Extract document term matrix for title
dtm.title.full <- DocumentTermMatrix(mycorpus.title_clean)
str(dtm.title.full)

## List of 6
##  $ i       : int [1:36384] 1 1 1 1 1 1 2 2 2 2 ...
##  $ j       : int [1:36384] 1126 1996 4200 5430 7527 8152 1849 2468 3506 3973 ...
##  $ v       : num [1:36384] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 6351
##  $ ncol    : int 8242
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:8242] "ã–rs" "ã–zil" "â—\220" "ä¸–ç•œã\201§ä¸\200ç•ªå\210‡ã‚œã‚‹ãƒ‘ã‚¹ã‚¿ã\201®åœ…ä¸\201ã‚’ä½œã‚šã\201ÿã\201„ï¼\201" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

# Get words appearing in at least 0.25% of all titles
threshold <- .0025*length(mycorpus.title_clean)   # 0.25% of all titles
title.words <- findFreqTerms(dtm.title.full, lowfreq=threshold)  # words appearing at least among 0.25% of the titles
length(title.words)

## [1] 443

# Extract document term matrix for words appearing in at least .25% of all titles
dtm.title <- DocumentTermMatrix(mycorpus.title_clean, control = list(dictionary = title.words))
dim(as.matrix(dtm.title))

## [1] 6351  443

str(dtm.title)

## List of 6
##  $ i       : int [1:15904] 1 2 2 2 3 4 4 4 4 5 ...
##  $ j       : int [1:15904] 439 178 293 311 96 18 52 183 418 406 ...
##  $ v       : num [1:15904] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 6351
##  $ ncol    : int 443
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:6351] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:443] "â\200“" "â\200”" "abc" "actual" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

# Combine original data with the title matrix
data.temp <- data.frame(data, as.matrix(dtm.title))
dim(data.temp)

## [1] 6351  461

names(data.temp)[1:20]

##  [1] "trending_date"          "title"                 
##  [3] "channel_title"          "category_id"           
##  [5] "publish_time"           "tags"                  
##  [7] "likes"                  "dislikes"              
##  [9] "comment_count"          "comments_disabled"     
## [11] "ratings_disabled"       "video_error_or_removed"
## [13] "viral"                  "trending_lag"          
## [15] "weekdays"               "months"                
## [17] "title.cleaned"          "tags.cleaned"          
## [19] "â.."                    "â...1"

#data2 combined viral with the .025% of top words in title
data2 <- data.temp[, c(13, 19:ncol(data.temp))]
dim(data2)

## [1] 6351  444

names(data2) [1:10]

##  [1] "viral"    "â.."      "â...1"    "abc"      "actual"   "adam"    
##  [7] "amazon"   "america"  "american" "anim"

Splitting data

data2 (viral and words that appear in .25% of title)

dim(data2)     #n=6282, p=439

## [1] 6351  444

# Reserve 20% as test data
set.seed(234)
n <- nrow(data2)
test.index <- sample(n, 0.2*n)
data2.test <- data2[test.index, ]
data2.train <- data2[-test.index,]

# Reserve 20% of the remaining training data as validation data
n1 <- nrow(data2.train)
validation.index <- sample(n1, 0.2*n1)
data2.validation <- data2.train[validation.index, ]
data2.train <- data2.train[-validation.index, ]

dim(data2.test)

## [1] 1270  444

dim(data2.train)

## [1] 4065  444

dim(data2.validation)

## [1] 1016  444

data3 (viral and words that appear in .25% of tags)

dim(data3)     #n=6282, p=2381

## [1] 6351 2403

# Reserve 20% as test data
set.seed(234)
n <- nrow(data3)
test.index <- sample(n, 0.2*n)
data3.test <- data3[test.index, ]
data3.train <- data3[-test.index,]

# Reserve 20% of the remaining training data as validation data
n1 <- nrow(data3.train)
validation.index <- sample(n1, 0.2*n1)
data3.validation <- data3.train[validation.index, ]
data3.train <- data3.train[-validation.index, ]

dim(data3.test)

## [1] 1270 2403

dim(data3.train)

## [1] 4065 2403

dim(data3.validation)

## [1] 1016 2403

ii) Lasso | WordCloud

Title Analysis

First, we run LASSO to identify which words in the title often appear for viral videos

set.seed(1234)
X2 <- as.matrix(data2.train[, -c(1)]) # we can use as.matrix directly here
y2 <- data2.train$viral

#run elastic net for title
#title.lasso <- cv.glmnet(X2, y2, alpha=.99, family="binomial")
#save(title.lasso, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/TitleLasso.RData")

We Load the LASSO results here and plot to determine the best lambda.

load("TitleLasso.RData")  
plot(title.lasso)

lambda.min is used to reduce the dimension of the text through frequency table.

beta.lasso <- coef(title.lasso, s="lambda.min")   # output lasso estimates
beta <- beta.lasso[which(beta.lasso !=0),] # non zero beta's
beta <- as.matrix(beta);
beta <- rownames(beta)
beta

##   [1] "(Intercept)" "amazon"      "america"     "anthem"      "artist"     
##   [6] "audit"       "aveng"       "babi"        "back"        "bad"        
##  [11] "ball"        "bbc"         "beat"        "best"        "big"        
##  [16] "black"       "bowl"        "bts"         "cabello"     "cake"       
##  [21] "call"        "cardi"       "cat"         "celebr"      "chang"      
##  [26] "christma"    "come"        "cook"        "cover"       "cri"        
##  [31] "david"       "day"         "die"         "diy"         "dog"        
##  [36] "dress"       "eagl"        "easi"        "emot"        "espn"       
##  [41] "everyth"     "explain"     "face"        "fair"        "fake"       
##  [46] "famili"      "fan"         "feat"        "featur"      "fight"      
##  [51] "film"        "find"        "fire"        "fox"         "fri"        
##  [56] "full"        "futur"       "gadget"      "golden"      "got"        
##  [61] "grace"       "graham"      "grammi"      "guy"         "hair"       
##  [66] "high"        "holiday"     "honest"      "ice"         "impress"    
##  [71] "interview"   "jame"        "japanes"     "jedi"        "jennif"     
##  [76] "jimmi"       "jordan"      "just"        "kelli"       "kendrick"   
##  [81] "kim"         "king"        "know"        "kyli"        "laugh"      
##  [86] "leagu"       "let"         "light"       "line"        "live"       
##  [91] "lost"        "love"        "machin"      "magic"       "michael"    
##  [96] "moment"      "money"       "motion"      "move"        "movi"       
## [101] "music"       "name"        "need"        "netflix"     "new"        
## [106] "news"        "nick"        "night"       "nintendo"    "now"        
## [111] "offici"      "part"        "paul"        "peopl"       "perfect"    
## [116] "pictur"      "power"       "presid"      "princ"       "react"      
## [121] "reaction"    "real"        "reason"      "refineri"    "reveal"     
## [126] "review"      "routin"      "save"        "scene"       "scott"      
## [131] "season"      "secret"      "see"         "shawn"       "sheeran"    
## [136] "shoot"       "shop"        "show"        "sing"        "smith"      
## [141] "special"     "speech"      "stephen"     "stock"       "stop"       
## [146] "stranger"    "studio"      "super"       "take"        "talk"       
## [151] "taylor"      "teaser"      "tell"        "theater"     "theori"     
## [156] "thing"       "timberlak"   "time"        "today"       "top"        
## [161] "trailer"     "train"       "tri"         "trump"       "use"        
## [166] "video"       "voic"        "water"       "wed"         "wild"       
## [171] "win"         "wish"        "world"       "wwhl"        "year"

# Relaxed LASSO
glm.input <- as.formula(paste("viral", "~", paste(beta[-1],collapse = "+"))) # prepare the formulae

# The below code runs glm with all the words output from the LASSO and then save the results in YoutubeTitleGlm.RData
#title.glm <- glm(glm.input, family=binomial, data2.train )
#save(title.glm, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/YoutubeTitleGlm.RData")

Pull out all the positive coefficients and the corresponding words. Rank the coefficients in a decreasing order. Report the leading 2 words and the coefficients. Describe briefly the interpretation for those two coefficients.

# Histogram of all coefficients
load("YoutubeTitleGlm.RData")  
result.title.glm.coef <- coef(title.glm)
hist(result.title.glm.coef)

# Pick up the positive coef's which are positively related to the prob of viral videos
good.glm <- result.title.glm.coef[which(result.title.glm.coef > 0)]
good.glm <- good.glm[-1]  # took intercept out
good.sorted <- sort(good.glm, decreasing = TRUE) # sort the coef's
good.sorted[1:2] # leading 2 words

##   anthem    cardi 
## 3.712773 3.154864

The two leading positive words are “anthem” with a coefficient of 3.71 and “cardi” with a coefficient of 3.15. The interpretation of these coefficients is the log odds of being a viral video for a unit change in the respective word frequency.

We then create a word cloud with the top positive words associated with viral videos. We ranked them according to their coefficients.

library("RColorBrewer")
library("wordcloud")
good.words <- names(good.sorted)
wordcloud(words = good.words[1:60], freq = good.sorted[1:60], min.freq = 0, scale=c(5,.1),
          colors=brewer.pal(8, "Dark2"))

# Top 10 good words
good.words[1:10]

##  [1] "anthem"    "cardi"     "motion"    "perfect"   "paul"     
##  [6] "timberlak" "kiss"      "bts"       "sheeran"   "shawn"

Tags Analysis

First, we run LASSO to identify which tags often appear for viral videos

set.seed(1234)
X3 <- as.matrix(data3.train[, -c(1)]) # we can use as.matrix directly here
y3 <- data3.train$viral

#run elastic net for title
#tag.lasso <- cv.glmnet(X3, y3, alpha=.95, family="binomial")
#save(tag.lasso, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/TagLasso.RData")

We Load the LASSO results here

load("TagLasso.RData")  
plot(tag.lasso)

lambda.min is used to reduce the dimension of the text through frequency table.

beta.tag.lasso <- coef(tag.lasso, s="lambda.min")   # output lasso estimates
beta.tag <- beta.tag.lasso[which(beta.tag.lasso !=0),] # non zero beta's
beta.tag <- as.matrix(beta.tag);
beta.tag <- rownames(beta.tag)
beta.tag <- beta.tag[-1]
beta.tag

##  [1] "bangtan"         "bowl"            "bubbl"          
##  [4] "cardi"           "cup"             "deadpool"       
##  [7] "dobr"            "drake"           "dude"           
## [10] "ë..ë¹."          "ë..íƒ.ì.œë..ë.." "feat"           
## [13] "foil"            "got"             "halsey"         
## [16] "khale"           "label"           "latin"          
## [19] "lizzza"          "logan"           "lopez"          
## [22] "marvel"          "movi"            "offici"         
## [25] "perfect"         "pictur"          "pon"            
## [28] "pop"             "rca"             "reason"         
## [31] "record"          "remix"           "road"           
## [34] "spi"             "spiderman"       "ultra"          
## [37] "zedd"

# Relaxed LASSO
glm.tag.input <- as.formula(paste("viral", "~", paste(beta.tag[-1],collapse = "+"))) # prepare the formulae

# The below code runs glm with all the words output from the LASSO and then save the results in YoutubeTagGlm.RData
#tag.glm <- glm(glm.tag.input, family=binomial, data3.train)
#save(tag.glm, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/YoutubeTagGlm.RData")

# Histogram of all coefficients
load("YoutubeTagGlm.RData")  
result.tag.glm.coef <- coef(tag.glm)
hist(result.tag.glm.coef)

Next, we pull out all the positive coefficients and the corresponding words and ranked them to determine which ones have the biggest positive association with virality.

# Pick up the positive coef's of tags which are positively related to the prob of viral videos
good.tag.glm <- result.tag.glm.coef[which(result.tag.glm.coef > 0)]
good.tag.glm <- good.tag.glm[-1]  # took intercept out
good.tag.sorted <- sort(good.tag.glm, decreasing = TRUE) # sort the coef's
good.tag.sorted[1:4] # leading 4 words

## ë..íƒ.ì.œë..ë..          ë..ë¹.          halsey           label 
##        3.758664        3.291212        2.293280        2.278958

good.tag.sorted <- good.tag.sorted[-1]
good.tag.sorted <- good.tag.sorted[-1]
good.tag.sorted [1:2]

##   halsey    label 
## 2.293280 2.278958

The two leading positive words that we are able to interpret are “halsey” with a coefficient of 2.29 and “label” with a coefficient of 2.28. The interpretation of these coefficients is the log odds of being a viral video for a unit change in the respective tag frequency.

A word cloud with the top 100 positive words according to their coefficients.

good.tag.words <- names(good.tag.sorted)
wordcloud(words = good.tag.words[1:30], freq = good.tag.sorted[1:30], min.freq = 0, scale=c(5,.1),
          colors=brewer.pal(8, "Dark2"))

# Top 10 good words
good.tag.words[1:10]

##  [1] "halsey" "label"  "latin"  "lizzza" "pon"    "bubbl"  "ultra" 
##  [8] "spi"    "reason" "remix"

From Lasso Fit for Title Analysis, we see that that misclassification error is 0.074

# output majority vote labels
predict.lasso <- predict(title.lasso, as.matrix(data2.test[, -1]), type = "class", s="lambda.min") 

# LASSO testing errors
testerror.lasso <- mean(data2.test$viral != predict.lasso)
testerror.lasso

## [1] 0.07401575

Using majority votes, testing misclassification error for Lasso fit for tag analysis is .075.

# output majority vote labels
predict.tag.lasso <- predict(tag.lasso, as.matrix(data3.test[, -1]), type = "class", s="lambda.min") 

# LASSO testing errors
testerror.tag.lasso <- mean(data3.test$viral != predict.tag.lasso)
testerror.tag.lasso

## [1] 0.07480315

From glm fit for title analysis, we see that the testing misclassification error is 0.06.

predict.glm <- predict(title.glm, data2.test, type = "response")
class.glm <- rep("0", 4999)
class.glm[predict.glm > .5] ="1"

testerror.glm <- mean(data2.test$viral != class.glm)
testerror.glm

## [1] 0.06021204

pROC::roc(data2.test$viral, predict.glm, plot=T) #AUC = 0.8252

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data2.test$viral, predictor = predict.glm,     plot = T)
## 
## Data: predict.glm in 1179 controls (data2.test$viral 0) < 91 cases (data2.test$viral 1).
## Area under the curve: 0.8252

Using majority votes, testing misclassification error for glm fit for tag analysis is .084.

predict.tag.glm <- predict(tag.glm, data3.test, type = "response")
class.tag.glm <- rep("0", 4999)
class.tag.glm[predict.tag.glm > .5] ="1"

testerror.tag.glm <- mean(data3.test$viral != class.tag.glm)
testerror.tag.glm

## [1] 0.08361672

pROC::roc(data3.test$viral, predict.tag.glm, plot=T) #AUC = 0.6646

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data3.test$viral, predictor = predict.tag.glm,     plot = T)
## 
## Data: predict.tag.glm in 1179 controls (data3.test$viral 0) < 91 cases (data3.test$viral 1).
## Area under the curve: 0.6646

iii) Random forest

ranger package We use ranger(), a faster implementation of random forests

*Titles Analysis**

We run a loop with varying number of trees from 20-200 (in 20 increments) and record the testing errors.

# # Set up 2 vectors of length 10
# rf.error.n <- 1:10  
# rf.oob.n <- 1:10
# 
# # Run the loop
# for (n in 1:10)  
# {
# fit.rf.titles <- ranger::ranger(viral~., data2.train, num.trees=n*20, splitrule = "gini", importance="impurity")
# predict.rf.titles <- predict(fit.rf.titles, data=data2.test, type="response") 
# rf.error.n[n] <- mean(data2.test$viral != predict.rf.titles$predictions)
# rf.oob.n[n] <- fit.rf.titles$prediction.error
# }
#
# # Save the outputs
# save(rf.error.n, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTitleLoopMCE.RData")
# save(rf.oob.n, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTitleLoopOOB.RData")

load("RangerTitleLoopMCE.Rdata")
load("RangerTitleLoopOOB.Rdata")
rf.error.n   # testing error returned: should be a vector of 10

##  [1] 0.06687898 0.06687898 0.06767516 0.06847134 0.06767516 0.06767516
##  [7] 0.06767516 0.06767516 0.06767516 0.06767516

rf.oob.n # OOB error returned: should be a vector of 10

##  [1] 0.07261875 0.07187267 0.07162397 0.07187267 0.07137528 0.07187267
##  [7] 0.07237006 0.07137528 0.07162397 0.07137528

plot(1:10, rf.error.n, pch=16,
     xlab="number of trees",
     ylab="testing misclassification error")
lines(1:10, rf.error.n)

As we observe from the vector, the testing misclassification error and OOB error are stable after 100 trees. Therefore, we picked 100 trees to balance accuracy and computing power.

Next, we run a loop with 100 trees and varying number of mtry (14-32, in 2 increments) and record the testing errors.

#rf.error.mtry <- 1:10  # set up a vector of length 10
#rf.oob.mtry <- 1:10

#for (n in 1:10)
 #{
# fit.rf.titles <- ranger::ranger(viral~., data2.train, num.trees=100, mtry=(12+2*n), splitrule = "gini", importance="impurity")
 #predict.rf.titles <- predict(fit.rf.titles, data=data2.test, type="response")
 #rf.error.mtry[n] <- mean(data2.test$viral != predict.rf.titles$predictions)
 #rf.oob.mtry[n] <- fit.rf.titles$prediction.error
 #}
#save(rf.error.mtry, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTitleMtryMCE.RData")
#save(rf.oob.mtry, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTitleMtryOOB.RData")

load("RangerTitleMtryMCE.Rdata") 
load("RangerTitleMtryOOB.Rdata") 
rf.error.mtry  # testing error returned: should be a vector of 10

##  [1] 0.06847134 0.06767516 0.06767516 0.06847134 0.06847134 0.06926752
##  [7] 0.06687898 0.06767516 0.06847134 0.06847134

rf.oob.mtry

##  [1] 0.07261875 0.07162397 0.07087789 0.07212136 0.07187267 0.07187267
##  [7] 0.07237006 0.07212136 0.07286745 0.07261875

plot(1:10, rf.error.mtry, pch=16,
     xlab="mtry",
     ylab="testing misclassification error")
lines(1:10, rf.error.mtry)

We observe that the testing misclassification error is minimized with mtry = 26, while OOB error swings slightly across the different mtry. Therefore, we picked mtry = 26 instead of the default mtry = sqrt(439) = 20.

fit.rf.title.ranger <- ranger::ranger(viral~., data2.train, num.trees = 100, mtry = 26, splitrule = "gini", importance="impurity") # no plotting fun
fit.rf.title.ranger

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data2.train, num.trees = 100, mtry = 26,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      4065 
## Number of independent variables:  443 
## Mtry:                             26 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             7.31 %

imp <- importance(fit.rf.title.ranger)
imp[order(imp, decreasing = T)][1:20]

##    offici     video   perfect     cardi   trailer   sheeran    teaser 
## 20.715256 10.184915 10.011829  4.810809  4.179687  3.064753  2.881592 
##      fake     audio       bts    pictur    justin     world  kendrick 
##  2.693733  2.359388  2.195621  2.184556  2.067891  2.003746  1.955107 
##       new       fri     david      love timberlak       bad 
##  1.894719  1.774509  1.725552  1.674613  1.622483  1.613559

The top 5 important words are offici[al], video, trailer, sheeran and cardi.

fit.rf.title.ranger$confusion # gives us the confusion matrix for the last forest!

##     predicted
## true    0    1
##    0 3758   13
##    1  284   10

predict.rf.title <- predict(fit.rf.title.ranger, data=data2.test, type="response")  # output the classes by majority vote
mean(data2.test$viral != predict.rf.title$predictions)

## [1] 0.07401575

The testing misclassification error for our final random forest model is 6.60%.

*Tags Analysis**

We run a loop with varying number of trees from 20-200 (in 20 increments) and record the testing errors.

# # Set up 2 vectors of length 10
# rf2.error.n <- 1:10
# rf2.oob.n <- 1:10
# 
# # Run the loop
# for (n in 1:10)
# {
# fit.rf.tags <- ranger::ranger(viral~., data3.train, num.trees=n*20, splitrule = "gini", importance="impurity")
# predict.rf.tags <- predict(fit.rf.tags, data=data3.test, type="response")
# rf2.error.n[n] <- mean(data3.test$viral != predict.rf.tags$predictions)
# rf2.oob.n[n] <- fit.rf.tags$prediction.error
# }

# # Save the outputs
# save(rf2.error.n, file="/Users/stacy/OneDrive/Documents/Wharton/STAT 701/Final project/RangerTagsLoopMCE.RData")
# save(rf2.oob.n, file="/Users/stacy/OneDrive/Documents/Wharton/STAT 701/Final project/RangerTagsLoopOOB.RData")

load("RangerTagsLoopMCE.Rdata")
load("RangerTagsLoopOOB.Rdata")
rf2.error.n   # testing error returned: should be a vector of 10

##  [1] 0.06210191 0.06210191 0.05971338 0.05812102 0.06050955 0.05812102
##  [7] 0.06050955 0.05971338 0.05971338 0.05971338

rf2.oob.n # OOB error returned: should be a vector of 10

##  [1] 0.07187267 0.06739617 0.06938572 0.06863964 0.06789356 0.06789356
##  [7] 0.06764486 0.06490923 0.06689878 0.06764486

plot(1:10, rf2.error.n, pch=16,
     xlab="number of trees",
     ylab="testing misclassification error")
lines(1:10, rf2.error.n)

plot(1:10, rf2.oob.n, pch=16,
     xlab="number of trees",
     ylab="OOB error")
lines(1:10, rf2.oob.n)

We observe that we need 160 trees to stabilize testing misclassification error and minimize OOB error.

Next, we run a loop with 160 trees and varying number of mtry (24-78, in 6 increments) and record the testing errors.

# rf2.error.mtry <- 1:10  # set up a vector of length 10
# rf2.oob.mtry <- 1:10
# 
# for (n in 1:10)  
#  {
#  fit.rf.tags <- ranger::ranger(viral~., data3.train, num.trees=160, mtry=(18+6*n), splitrule = "gini", importance="impurity")
#  predict.rf.tags <- predict(fit.rf.tags, data=data3.test, type="response")
#  rf2.error.mtry[n] <- mean(data3.test$viral != predict.rf.tags$predictions)
#  rf2.oob.mtry[n] <- fit.rf.tags$prediction.error
#  }
# save(rf2.error.mtry, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTagsMtryMCE.RData")
# save(rf2.oob.mtry, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/RangerTagsMtryOOB.RData")

load("RangerTagsMtryMCE.Rdata") 
load("RangerTagsMtryOOB.Rdata") 
rf2.error.mtry  # testing error returned: should be a vector of 10

##  [1] 0.06210191 0.06130573 0.05891720 0.05891720 0.05891720 0.06130573
##  [7] 0.06050955 0.06050955 0.06130573 0.06050955

rf2.oob.mtry

##  [1] 0.06839095 0.06863964 0.06714748 0.06764486 0.06540662 0.06615270
##  [7] 0.06689878 0.06739617 0.06789356 0.06739617

plot(1:10, rf2.error.mtry, pch=16,
     xlab="mtry",
     ylab="testing misclassification error")
lines(1:10, rf2.error.mtry)

We observe that mtry = 48 minimizes testing misclassification error and OOB error. Therefore, we picked mtry = 48 instead of the default mtry = sqrt(2381) = 40.

# fit.rf.tags.ranger <- ranger::ranger(viral~., data3.train, num.trees = 160, mtry = 48, splitrule = "gini", importance="impurity") # no plotting fun
# save(fit.rf.tags.ranger, file="/Users/daniz/OneDrive/Documents/Fall19/Modern Data Mining/Final Project 2/FitRangerTags.RData")

load("FitRangerTags.RData")
fit.rf.tags.ranger

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data3.train, num.trees = 160, mtry = 48,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  160 
## Sample size:                      4021 
## Number of independent variables:  2380 
## Mtry:                             48 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             6.76 %

imp <- importance(fit.rf.tags.ranger)
imp[order(imp, decreasing = T)][1:20]

##          offici            paul           remix          record 
##        6.008956        3.992303        3.736670        3.431084 
##             pop           cardi            movi         bangtan 
##        3.306166        3.073816        2.955152        2.880538 
##           logan             pon            feat           elder 
##        2.856512        2.603727        2.154104        2.018227 
##             rap           twice             lil          ë..ë¹. 
##        1.983825        1.982894        1.958532        1.913647 
## ë..íƒ.ì.œë..ë..           video             new          spider 
##        1.891761        1.814045        1.769265        1.678810

The top 5 most important keywords in tags are offici[al], remix, paul, record and movi[e].

fit.rf.tags.ranger$confusion # gives us the confusion matrix for the last forest

##     predicted
## true    0    1
##    0 3718   17
##    1  255   31

predict.rf.tags <- predict(fit.rf.tags.ranger, data=data3.test, type="response")  # output the classes by majority vote
mean(data3.test$viral != predict.rf.tags$predictions)

## [1] 0.04724409

The testing misclassification error for our final random forest model is 5.971%.

Final model

Based on minimizing testing misclassification error, we chose the glm fit for the title model and random forest for the tags model. We then used our validation data to determine the misclassification error of the final model.

Title final model

predict.glm.final <- predict(title.glm, data2.validation, type = "response")
class.glm.final <- rep("0", 4999)
class.glm.final[predict.glm.final > .5] ="1"

testerror.glm.final <- mean(data2.validation$viral != class.glm.final)
testerror.glm.final

## [1] 0.06081216

Tags final model

fit.rf.tags.ranger

## Ranger result
## 
## Call:
##  ranger::ranger(viral ~ ., data3.train, num.trees = 160, mtry = 48,      splitrule = "gini", importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  160 
## Sample size:                      4021 
## Number of independent variables:  2380 
## Mtry:                             48 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             6.76 %

imp <- importance(fit.rf.tags.ranger)
imp[order(imp, decreasing = T)][1:20]

##          offici            paul           remix          record 
##        6.008956        3.992303        3.736670        3.431084 
##             pop           cardi            movi         bangtan 
##        3.306166        3.073816        2.955152        2.880538 
##           logan             pon            feat           elder 
##        2.856512        2.603727        2.154104        2.018227 
##             rap           twice             lil          ë..ë¹. 
##        1.983825        1.982894        1.958532        1.913647 
## ë..íƒ.ì.œë..ë..           video             new          spider 
##        1.891761        1.814045        1.769265        1.678810

predict.rf.final <- predict(fit.rf.tags.ranger, data=data3.validation, type="response")
mean(data2.test$viral != predict.rf.final$predictions)

## [1] 0.09606299

According to the final model, using the validation data, we got validation error 8.28%.

b) Logistic regression using other non-text factors

Based on the data that we have, we also want to see what other factors other than text-related ones, such as title and tags, contribute to a video being viral. We are conducting logistic regression for the other factors such as category_id and features such as comments_disabledand ratings_disabled. We are not including other factors such as likes or dislikes since they have no predictive power.

# Reserve 20% as test data
set.seed(234)
n <- nrow(data.temp)
test.index <- sample(n, 0.2*n)
data4.test <- data.temp[test.index, ]
data4.train <- data.temp[-test.index,]

# Reserve 20% of the remaining training data as validation data
n1 <- nrow(data4.train)
validation.index <- sample(n1, 0.2*n1)
data4.validation <- data4.train[validation.index, ]
data4.train <- data4.train[-validation.index, ]

dim(data4.test)

## [1] 1270  461

dim(data4.train)

## [1] 4065  461

dim(data4.validation)

## [1] 1016  461

fit.glm <- glm(viral~ category_id + weekdays + months + comments_disabled + ratings_disabled, data=data4.test, family = binomial)
summary (fit.glm)

## 
## Call:
## glm(formula = viral ~ category_id + weekdays + months + comments_disabled + 
##     ratings_disabled, family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5551  -0.3707  -0.2546  -0.1075   2.9521  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -5.889e-01  5.288e-01  -1.114  0.26538   
## category_id2          -2.199e-01  8.565e-01  -0.257  0.79739   
## category_id10          1.116e-01  4.413e-01   0.253  0.80032   
## category_id15         -1.779e+01  2.172e+03  -0.008  0.99347   
## category_id17         -1.675e+00  6.472e-01  -2.589  0.00963 **
## category_id19         -1.780e+01  3.099e+03  -0.006  0.99542   
## category_id20         -3.225e-01  7.878e-01  -0.409  0.68228   
## category_id22         -1.201e+00  5.350e-01  -2.246  0.02473 * 
## category_id23         -1.364e+00  5.576e-01  -2.446  0.01446 * 
## category_id24         -1.215e+00  4.470e-01  -2.719  0.00656 **
## category_id25         -1.801e+01  9.881e+02  -0.018  0.98546   
## category_id26         -1.823e+00  6.465e-01  -2.820  0.00480 **
## category_id27         -1.800e+01  1.326e+03  -0.014  0.98917   
## category_id28         -2.729e+00  1.101e+00  -2.479  0.01316 * 
## category_id29          1.548e+00  1.012e+00   1.529  0.12620   
## weekdaysMonday        -1.172e-01  4.360e-01  -0.269  0.78807   
## weekdaysSaturday      -9.790e-03  4.974e-01  -0.020  0.98430   
## weekdaysSunday        -2.057e-02  4.965e-01  -0.041  0.96696   
## weekdaysThursday       3.141e-01  3.623e-01   0.867  0.38600   
## weekdaysTuesday       -6.199e-01  4.529e-01  -1.369  0.17102   
## weekdaysWednesday      1.037e-01  3.786e-01   0.274  0.78412   
## monthsAugust          -1.707e+01  7.201e+03  -0.002  0.99811   
## monthsDecember        -1.216e+00  4.244e-01  -2.867  0.00415 **
## monthsFebruary        -1.208e+00  4.538e-01  -2.661  0.00779 **
## monthsJanuary         -1.214e+00  4.408e-01  -2.755  0.00587 **
## monthsJuly            -1.726e+01  1.075e+04  -0.002  0.99872   
## monthsJune            -2.182e+00  1.106e+00  -1.972  0.04865 * 
## monthsMarch           -1.369e+00  5.328e-01  -2.569  0.01021 * 
## monthsMay              5.022e-01  3.959e-01   1.269  0.20460   
## monthsNovember        -1.670e+00  5.076e-01  -3.290  0.00100 **
## monthsOctober         -1.761e+01  7.593e+03  -0.002  0.99815   
## monthsSeptember       -1.841e+01  7.436e+03  -0.002  0.99802   
## comments_disabledTrue  9.621e-01  9.373e-01   1.027  0.30465   
## ratings_disabledTrue  -1.728e+01  4.366e+03  -0.004  0.99684   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 526.06  on 1236  degrees of freedom
## AIC: 594.06
## 
## Number of Fisher Scoring iterations: 18

Anova (fit.glm)

## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         74.566 14  2.842e-10 ***
## weekdays             5.128  6     0.5275    
## months              46.269 11  2.898e-06 ***
## comments_disabled    0.926  1     0.3359    
## ratings_disabled     0.998  1     0.3179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using backward validation, we will remove weekdays first as it is not significant at the 0.05 level.

fit.glm2 <- glm(viral~ category_id + months + comments_disabled + ratings_disabled, data=data4.test, family = binomial)
summary (fit.glm2)

## 
## Call:
## glm(formula = viral ~ category_id + months + comments_disabled + 
##     ratings_disabled, family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5357  -0.4000  -0.2828  -0.1214   2.7130  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.6803     0.4596  -1.480 0.138833    
## category_id2             -0.1453     0.8498  -0.171 0.864271    
## category_id10             0.2541     0.4306   0.590 0.555067    
## category_id15           -17.7116  2155.6437  -0.008 0.993444    
## category_id17            -1.5965     0.6368  -2.507 0.012174 *  
## category_id19           -17.6476  3104.0194  -0.006 0.995464    
## category_id20            -0.1461     0.7784  -0.188 0.851131    
## category_id22            -1.1240     0.5296  -2.122 0.033805 *  
## category_id23            -1.2601     0.5511  -2.286 0.022226 *  
## category_id24            -1.1334     0.4387  -2.584 0.009773 ** 
## category_id25           -17.9443   992.6782  -0.018 0.985578    
## category_id26            -1.7309     0.6378  -2.714 0.006649 ** 
## category_id27           -17.9649  1333.7422  -0.013 0.989253    
## category_id28            -2.6280     1.0948  -2.401 0.016371 *  
## category_id29             1.5116     0.9906   1.526 0.127001    
## monthsAugust            -17.2021  7381.1316  -0.002 0.998140    
## monthsDecember           -1.2427     0.4219  -2.946 0.003221 ** 
## monthsFebruary           -1.2117     0.4498  -2.694 0.007065 ** 
## monthsJanuary            -1.2591     0.4373  -2.879 0.003990 ** 
## monthsJuly              -17.1549 10754.0130  -0.002 0.998727    
## monthsJune               -2.2356     1.1082  -2.017 0.043662 *  
## monthsMarch              -1.3645     0.5276  -2.586 0.009700 ** 
## monthsMay                 0.4740     0.3929   1.206 0.227713    
## monthsNovember           -1.6824     0.5047  -3.333 0.000859 ***
## monthsOctober           -17.4856  7558.6978  -0.002 0.998154    
## monthsSeptember         -18.4240  7425.0868  -0.002 0.998020    
## comments_disabledTrue     1.0180     0.9258   1.100 0.271505    
## ratings_disabledTrue    -17.2300  4337.6938  -0.004 0.996831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 531.19  on 1242  degrees of freedom
## AIC: 587.19
## 
## Number of Fisher Scoring iterations: 18

Anova (fit.glm2)

## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         77.446 14  8.399e-11 ***
## months              46.799 11  2.334e-06 ***
## comments_disabled    1.053  1     0.3049    
## ratings_disabled     0.976  1     0.3231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We then proceed to remove ratings_disabled since it is still not significant at 0.05 level.

fit.glm3 <- glm(viral~ category_id + months + comments_disabled , data=data4.test, family = binomial)
summary (fit.glm3)

## 
## Call:
## glm(formula = viral ~ category_id + months + comments_disabled, 
##     family = binomial, data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4376  -0.4092  -0.2821  -0.1375   2.7898  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -0.6731     0.4588  -1.467 0.142348    
## category_id2             -0.1492     0.8496  -0.176 0.860584    
## category_id10             0.2337     0.4295   0.544 0.586399    
## category_id15           -17.7139  2154.7725  -0.008 0.993441    
## category_id17            -1.6148     0.6359  -2.539 0.011102 *  
## category_id19           -17.6550  3103.9395  -0.006 0.995462    
## category_id20            -0.1479     0.7777  -0.190 0.849179    
## category_id22            -1.1253     0.5291  -2.127 0.033434 *  
## category_id23            -1.2672     0.5506  -2.301 0.021364 *  
## category_id24            -1.1333     0.4378  -2.589 0.009639 ** 
## category_id25           -17.9345   993.2446  -0.018 0.985594    
## category_id26            -1.7365     0.6373  -2.725 0.006433 ** 
## category_id27           -17.9868  1337.6118  -0.013 0.989271    
## category_id28            -2.6017     1.0874  -2.393 0.016728 *  
## category_id29             1.5051     0.9902   1.520 0.128524    
## monthsAugust            -17.2160  7387.6359  -0.002 0.998141    
## monthsDecember           -1.2392     0.4216  -2.940 0.003287 ** 
## monthsFebruary           -1.2103     0.4496  -2.692 0.007105 ** 
## monthsJanuary            -1.2639     0.4370  -2.892 0.003827 ** 
## monthsJuly              -17.1564 10754.0130  -0.002 0.998727    
## monthsJune               -2.2120     1.1040  -2.004 0.045111 *  
## monthsMarch              -1.3823     0.5271  -2.622 0.008730 ** 
## monthsMay                 0.4803     0.3924   1.224 0.220995    
## monthsNovember           -1.7055     0.5065  -3.367 0.000759 ***
## monthsOctober           -17.4907  7557.9895  -0.002 0.998154    
## monthsSeptember         -18.4312  7424.7936  -0.002 0.998019    
## comments_disabledTrue     0.7864     0.8888   0.885 0.376228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 532.16  on 1243  degrees of freedom
## AIC: 586.16
## 
## Number of Fisher Scoring iterations: 18

Anova (fit.glm3)

## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##                   LR Chisq Df Pr(>Chisq)    
## category_id         77.029 14  1.002e-10 ***
## months              47.421 11  1.811e-06 ***
## comments_disabled    0.694  1     0.4049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that comments_disabled is also not significant at 0.05 level so we removed it for our final glm model.

fit.glm.final <- glm(viral~ category_id + months  , data=data4.test, family = binomial)
summary (fit.glm.final)

## 
## Call:
## glm(formula = viral ~ category_id + months, family = binomial, 
##     data = data4.test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2020  -0.3916  -0.2805  -0.1448   3.0208  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.6563     0.4565  -1.438 0.150525    
## category_id2       -0.1731     0.8477  -0.204 0.838156    
## category_id10       0.2084     0.4261   0.489 0.624769    
## category_id15     -17.7426  2154.3531  -0.008 0.993429    
## category_id17      -1.6249     0.6344  -2.561 0.010434 *  
## category_id19     -17.6778  3104.0035  -0.006 0.995456    
## category_id20      -0.1616     0.7757  -0.208 0.834987    
## category_id22      -1.1452     0.5270  -2.173 0.029768 *  
## category_id23      -1.2988     0.5477  -2.371 0.017732 *  
## category_id24      -1.1479     0.4351  -2.638 0.008335 ** 
## category_id25     -17.9251   993.3093  -0.018 0.985602    
## category_id26      -1.7668     0.6349  -2.783 0.005388 ** 
## category_id27     -18.0168  1337.4465  -0.013 0.989252    
## category_id28      -2.5307     1.0763  -2.351 0.018706 *  
## category_id29       1.4838     0.9889   1.500 0.133491    
## monthsAugust      -17.2404  7408.6977  -0.002 0.998143    
## monthsDecember     -1.2340     0.4215  -2.928 0.003415 ** 
## monthsFebruary     -1.2048     0.4495  -2.680 0.007356 ** 
## monthsJanuary      -1.2606     0.4370  -2.885 0.003919 ** 
## monthsJuly        -17.1429 10754.0130  -0.002 0.998728    
## monthsJune         -2.1580     1.0956  -1.970 0.048885 *  
## monthsMarch        -1.3652     0.5266  -2.592 0.009529 ** 
## monthsMay           0.5056     0.3909   1.293 0.195850    
## monthsNovember     -1.6749     0.5039  -3.323 0.000889 ***
## monthsOctober     -17.4867  7555.7302  -0.002 0.998153    
## monthsSeptember   -18.4434  7421.0651  -0.002 0.998017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 655.05  on 1269  degrees of freedom
## Residual deviance: 532.86  on 1244  degrees of freedom
## AIC: 584.86
## 
## Number of Fisher Scoring iterations: 18

Anova (fit.glm.final)

## Analysis of Deviance Table (Type II tests)
## 
## Response: viral
##             LR Chisq Df Pr(>Chisq)    
## category_id   76.560 14  1.223e-10 ***
## months        47.534 11  1.728e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have our final model that takes into account category_idand months. We now calculate predict the testing error.

predict.logit.glm <- predict(fit.glm.final, data4.test, type = "response")
class.glm2 <- rep("0", 4999)
class.glm2[predict.logit.glm > .5] ="1"

testerror.glm2 <- mean(data4.test$viral != class.glm2)
testerror.glm2

## [1] 0.06961392

The testing error based on these predictors are 0.069.

We also used ROC below to evaluate the performance of our final model using the testing data.

fit1.fitted.test <- predict(fit.glm, data4.test, type="response") # get the prob's
fit2.fitted.test <- predict(fit.glm2, data4.test, type="response")
fit3.fitted.test <- predict(fit.glm3, data4.test, type="response")
fitfinal.fitted.test <- predict(fit.glm.final, data4.test, type="response")

fit1.test.roc <- roc(data4.test$viral, fit1.fitted.test)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

fit2.test.roc <- roc(data4.test$viral, fit2.fitted.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

fit3.test.roc <- roc(data4.test$viral, fit3.fitted.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

fitfinal.test.roc <- roc(data4.test$viral, fitfinal.fitted.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(1-fit1.test.roc$specificities, 
     fit1.test.roc$sensitivities, col="red", lwd=3, type="l",
     xlab="False Positive", 
     ylab="Sensitivity")
lines(1-fit2.test.roc$specificities, fit2.test.roc$sensitivities, col="blue", lwd=3)
lines(1-fit3.test.roc$specificities, fit3.test.roc$sensitivities, col="green", lwd=3)
lines(1-fitfinal.test.roc$specificities, fitfinal.test.roc$sensitivities, col="yellow", lwd=3)
legend("bottomright",legend=c("fit1.test.roc", "fit2.test.roc", "fit3.test.roc", "fit final.test.roc"),
      col=c("red", "blue", "green", "yellow"), lty=1,)

fit1.test.roc$auc

## Area under the curve: 0.8054

fit2.test.roc$auc

## Area under the curve: 0.7994

fit3.test.roc$auc

## Area under the curve: 0.7971

fitfinal.test.roc$auc

## Area under the curve: 0.7972

How to go Viral

Group Member 1 Diane Ching

Group Member 2 Daniza Muliawan

Group Member 3 Stacy Ramli

I. Executive summary.

II. Response and predictor variables

III. Detailed process of the analysis

a) EDA

Data exploration and summary

Data clean-up

i) Dates & Time (`trending_date` and `publish_time`)

ii)Categorical variables (`category_id`, `channel_title`, `comments_disabled`, `ratings_disabled`, `video_error_or_removed`, `weekdays`, and `months`)

iii) Applying log function to attributes with skewed distribution (`likes`, `dislikes` and `comment_count`)

b) Text mining

i) Preparing text for analysis

Data cleaning using `tm_map()`

Titles

Tags

Splitting data

data2 (viral and words that appear in .25% of title)

data3 (viral and words that appear in .25% of tags)

ii) Lasso | WordCloud

iii) Random forest

Final model

Title final model

Tags final model

b) Logistic regression using other non-text factors

How to go Viral

Group Member 1 Diane Ching

Group Member 2 Daniza Muliawan

Group Member 3 Stacy Ramli

I. Executive summary.

II. Response and predictor variables

III. Detailed process of the analysis

a) EDA

Data exploration and summary

Data clean-up

i) Dates & Time (trending_date and publish_time)

ii)Categorical variables (category_id, channel_title, comments_disabled, ratings_disabled, video_error_or_removed, weekdays, and months)

iii) Applying log function to attributes with skewed distribution (likes, dislikes and comment_count)

b) Text mining

i) Preparing text for analysis

Data cleaning using tm_map()

Titles

Tags

Splitting data

data2 (viral and words that appear in .25% of title)

data3 (viral and words that appear in .25% of tags)

ii) Lasso | WordCloud

iii) Random forest

Final model

Title final model

Tags final model

b) Logistic regression using other non-text factors

i) Dates & Time (`trending_date` and `publish_time`)

ii)Categorical variables (`category_id`, `channel_title`, `comments_disabled`, `ratings_disabled`, `video_error_or_removed`, `weekdays`, and `months`)

iii) Applying log function to attributes with skewed distribution (`likes`, `dislikes` and `comment_count`)

Data cleaning using `tm_map()`