CIS 8392: FinalProject

YouTube Video Comments Sentiment Analysis using YouTube API

Team members

Rohit Shekhar
Darshika Kesarwani
Poonam Angne

Business context

Youtube is biggest Video streaming Platform, with millions of videos uploaded by its users and billions of comments for all of these videos. Youtube unlike other social Media platform like facebook and twitter it allows user to express negative emotions through dislike button. We can evaluate the public opinion by comparing number of likes vs dislikes. However, if we want to know detailed analysis of public emotions and sentiments for a certain video, we need to dig deep into the comment sections. Sentiment Analysis can be performed on comments for troll detection, polarity detection and can be used for improvement in any future video release.

Problem description

The online users express their positive, negative or neutral opinions or sentiments on the videos that they watch on YouTube. But, do these comments differ for different categories? Do sentiments of the users change as per the category of the video? This is interesting to explore because this will provide us deeper insights into what the users like and dislike, what the users prefer to watch on YouTube and more. The main aim of our project is to use text analysis to explore YouTube comments, classify the comments as positive, negative and neutral alongwith exploring whether the user sentiments differ among different categories.

Data Collection from YouTube API

To work with the YouTube API, we need to get an authorized key first. YouTube offers a unique access key to each user. With this key, we can connect to the interface of the application and from that point, we can use the data and protocols of API.

To create our YouTube API credentials, we performed the following steps:
1. Go to https://console.developers.google.com/project and login using your Google Credentials.
2. Click on ‘Project’ and select ‘Create a project’ from the drop-down menu. A new project screen is displayed.
3. Enter a name for your project in the ‘Project name’ field.
4. Click on ‘Create’. Your project is being created and will take a few moments.
5. Click on the ‘Project’ menu and select the project you recently created from the drop-down list. Your project library page is displayed.
6. Click on ‘YouTube Data API’. The YouTube data API overview screen is displayed.
7. Click on ‘ENABLE’. The API is enabled but cannot be used until you create credentials.
8. Click on ‘Go to Credentials’. The credentials screen is displayed.
9. Click on âOAuth client IDâ and select âOtherâ. Enter a client name and then you will get the client key and secret key. Copy and paste them on your local computer.

We extracted the YouTube comments data using the below R Code. We picked four YouTube categories and extracted 50k comments for videos in each of these categories and saved the results in four different CSV files.

library(tuber)
client_key <- '-- our client key --'
client_secret <- '-- our secret key --'
yt_oauth(app_id = client_key, app_secret = client_secret, token = '')

# Category IDâs
# 1 -> Film & Animation 
# 10 -> Music
# 17 -> Sports
# 26 -> Howto & Style

for(j in c(1, 10, 17, 26)){

  # Get list of top 50 videos in US region in the selected category
  video_list <- list_videos(max_results = 50, region_code = "US", video_category_id = j)
  
  # Get a list of Video ID's in the Category which have less than 10000 comments each
  video_id <- c()
  for(i in 1:length(video_list$items)){
    video_stats <- get_stats(video_id = video_list$items[[i]]$id)
    count <- video_stats$commentCount
    if(!is.null(count)){
      if(as.numeric(count) < 10000){
        video_id <- c(video_id, video_list$items[[i]]$id)
      }
    }
  }
  
  # Create a data frame with all the comments combined for the selected Video ID's in the Category
  comments <- c()
  for(i in 1:length(video_id)){
    comments <- rbind(comments
                      , get_comment_threads(c(video_id = video_id[i]), max_results = 101)[, 1:12]
                      , method="common")
  }
  
  # Pick 50k rows
  comments <- comments[1:50000,]
  
# Save the results in CSV files
  if(j == 1)
    write.csv(comments, "animation_video_comments.csv")
  if(j == 10)
    write.csv(comments, "music_video_comments.csv")
  if(j == 17)
    write.csv(comments, "sports_video_comments.csv")
  if(j == 26)
    write.csv(comments, "howto_video_comments.csv")
}

Data Summary

Data Files: https://drive.google.com/open?id=1rzdC7DNtsmJoywZeW9jqyBKBwEOIY3Q1

Our dataset consists of 4 CSVs one for each category per below

Sports
Animation
Music
How to

We will first load all the 4 csv into different variables and then observe the basic stats for each category.Each category has 50000 rows and 13 columns. Each csv has columns like VideoId, likes counts, Text etc

library(tm)
library(tidyverse)
library(stringr)
library(wordcloud)
library(sentimentr)
library(h2o)
library(caret)
library(tidytext)
library(caTools)
# Load Dataset
sports_df <- read_csv("sports_video_comments.csv")
animation_df <- read_csv("animation_video_comments.csv")
music_df <- read_csv("music_video_comments.csv")
howto_df <- read_csv("howto_video_comments.csv")

#Helper Function to find statistics
summ <- function(df){
cat("Class is ",class(df),"\n")
cat("Columnnames are ",colnames(df),"\n")
cat("Type is ", typeof(df),"\n")    
#dimension 
cat("Numbers of rows and columns are ", dim(df),"\n")
}

#Summary Statistics for Sports Category
summ(sports_df)

## Class is  spec_tbl_df tbl_df tbl data.frame 
## Columnnames are  X1 authorDisplayName authorProfileImageUrl authorChannelUrl videoId textDisplay textOriginal canRate viewerRating likeCount publishedAt updatedAt authorChannelId.value 
## Type is  list 
## Numbers of rows and columns are  50000 13

#Summary Statistics for Animation Category
summ(animation_df)

## Class is  spec_tbl_df tbl_df tbl data.frame 
## Columnnames are  X1 authorDisplayName authorProfileImageUrl authorChannelUrl authorChannelId.value videoId textDisplay textOriginal canRate viewerRating likeCount publishedAt updatedAt 
## Type is  list 
## Numbers of rows and columns are  50000 13

#Summary Statistics for Music Category
summ(music_df)

## Class is  spec_tbl_df tbl_df tbl data.frame 
## Columnnames are  X1 authorDisplayName authorProfileImageUrl authorChannelUrl authorChannelId.value videoId textDisplay textOriginal canRate viewerRating likeCount publishedAt updatedAt 
## Type is  list 
## Numbers of rows and columns are  50000 13

#Summary Statistics for How to Category
summ(howto_df)

## Class is  spec_tbl_df tbl_df tbl data.frame 
## Columnnames are  X1 authorDisplayName authorProfileImageUrl authorChannelUrl authorChannelId.value videoId textDisplay textOriginal canRate viewerRating likeCount publishedAt updatedAt 
## Type is  list 
## Numbers of rows and columns are  50000 13

Data Exploration and Discussion

Visualization of Top 4 author with most likes across every category

library(gridExtra)
library(ggplot2)
expl <- function(df, title){
  df %>% group_by(authorDisplayName) %>%
  summarise(likes = sum(likeCount)) %>%  arrange(desc(likes)) %>% head(4) %>%
  ggplot(aes(x = authorDisplayName, y = likes)) + 
  xlab("Author") +
  ylab("LikesCount") +
  ggtitle(title) +
  geom_bar(stat = "identity", position = "stack",color="black", fill="coral", alpha=0.3) +
  coord_flip()
}
plot1 <- expl(sports_df, 'sports_category')
plot2 <- expl(animation_df, 'animation_category')
plot3 <- expl(music_df, 'music_category')
plot4 <- expl(howto_df, 'howto_category')
grid.arrange(plot1, plot2,plot3,plot4, ncol=2)

Pre Processing

Before doing actual Sentiment Analysis of Youtube comments, we will preprocees the data. We will perform preprocessing on column name textOriginal which stores the comment of youtube videos,it will done for all the CSV files of 4 categories.

comm <- function(df){
  # Filter comment column
df_comments <- df %>%
  select(textOriginal)
return(df_comments)
}
#Save dataframe with only 'textOriginal' Column
comm_sports <- comm(sports_df)
comm_animation <- comm(animation_df)
comm_music <- comm(music_df)
comm_howto <- comm(howto_df)
# Turn text to tokens
tok<- function(df_comments){
 tokens <- df_comments %>%
 tidytext::unnest_tokens(output = word, input = textOriginal)

#Display common words
tokens %>%
  count(word, sort = TRUE)

# Remove stopwords
cleaned_tokens <- tokens %>%
  anti_join(tidytext::get_stopwords())

# Find stopwords
nums <- cleaned_tokens %>%
  filter(str_detect(word, "^[0-9]")) %>%
  select(word) %>% unique()

# Remove numbers
cleaned_tokens <- cleaned_tokens %>%
  anti_join(nums, by = "word")
return(cleaned_tokens)
# Find unique tokens
length(unique(cleaned_tokens$word))
}

#Save Processed cleaned tokens for all 4 categories
tok_sports <- tok(comm_sports)
tok_animation <- tok(comm_animation)
tok_music <- tok(comm_music)
tok_howto <- tok(comm_howto)

In the following code we will find the Word frequency across every category, which will be further used to remove rare words. We can analyse from the graph that words used in ‘howto’ category comments has less number of repetition when compared to other categories of our dataset

#install.packages("gridExtra")
library(gridExtra)
library(ggplot2)
vis1 <- function(cleaned_tokens, title){
  cleaned_tokens %>%
  count(word, sort = T) %>%
  rename(word_freq = n) %>%
  ggplot(aes(x=word_freq)) +
  geom_histogram(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
  ggtitle(title) +
  scale_x_continuous(breaks=c(0:5,10,100,500,10e3), trans="log1p", expand=c(0,0)) +
  scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), expand=c(0,0)) +
    
  theme_bw()
}
plot1 <- vis1(tok_sports, 'sports_category')
plot2 <- vis1(tok_animation, 'animation_category')
plot3 <- vis1(tok_music, 'music_category')
plot4 <- vis1(tok_howto, 'howto_category')
grid.arrange(plot1, plot2,plot3,plot4, ncol=2)

We will remove rare words that have less than 10 appearances in our collection to improve the performance of text analytics. We will then save those tokens category wise for futher analysis.

#helper function to remove rare words
rare_df <- function(cleaned_tokens){
# Find rare words
rare <- cleaned_tokens %>%
  count(word) %>%
  filter(n<10) %>%
  select(word) %>% unique()

# Remove rare words
cleaned_tokens <- cleaned_tokens %>%
  anti_join(rare, by = "word")
#return cleaned token after removing rare word
return(cleaned_tokens)
}


#Cleaned_token after removing rare words in Sports category
ct_sports <- rare_df(tok_sports)
#Cleaned_token after removing rare words in animation category
ct_animation <-rare_df(tok_animation)
#Cleaned_token after removing rare words in music category
ct_music<-rare_df(tok_music)
#Cleaned_token after removing rare words in howto category
ct_howto<-rare_df(tok_howto)

Wordcloud

We will visualize wordcloud for 100 most common words across each 4 category

wc <- function(cleaned_tokens){
  # Find unique tokens
length(unique(cleaned_tokens$word))

# Define a nice color palette
pal <- brewer.pal(8,"Dark2")

# Plot the 100 most common words
cleaned_tokens %>%
  count(word) %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors=pal))
}


#Word Cloud sports Category
wc(ct_sports)

#Word Cloud animation Category
wc(ct_animation)

#Word Cloud music Category
wc(ct_music)

#Word Cloud howto Category
wc(ct_howto)

Sentiment Analysis

We will use following 3 types of lexicon from tidytext to perform Sentiment Analysis:

nrc lexicon: word and their sentiment category
The bing lexicon: word and their polarity (negative or positive)
The afinn lexicon: word and their numeric sentiment score

For every word we will save values of all 3 lexicons and store it in a variable across all 4 categories

ML <- function(cleaned_tokens){
  sent_comments = cleaned_tokens %>%
  left_join(tidytext::get_sentiments("nrc")) %>%
  rename(nrc = sentiment) %>%
  left_join(tidytext::get_sentiments("bing")) %>%
  rename(bing = sentiment) %>%
  left_join(tidytext::get_sentiments("afinn")) %>%
  rename(afinn = score)

# Most common positive and negative words
bing_word_counts <- sent_comments %>%
  filter(!is.na(bing)) %>%
  count(word, bing, sort = TRUE)
return(bing_word_counts)
}
#Save Processed cleaned tokens for all 4 categories
bing_sports <- ML(ct_sports)
bing_animation <- ML(ct_animation)
bing_music <- ML(ct_music)
bing_howto <- ML(ct_howto)

NLP Procedure Summary

We have used following steps to perform NLP on our dataset:

Loaded libraries like dplyr, tidytext, ggplot, wordcloud, gridExtra etc
We have read all 4 csv file using read_csv and loaded into dataframe
We further examined the dataframe using basic functions like dim()
We have extracted the columnn ‘text display’ for sentiments analysis
Basic Cleaning after creating tokens like removal of number, stop words and rare words
ggplot and wordcloud is used to analyse frequency of words across each category
3 types of sentiment lexicons nrc, bing and afinn are calculated for each word
Final Visualization will be done using ggplot to compare number of positive and negative reviews across each category

NLP Summary Visualization

Visualization 1

# Contribution to sentiments
visual1 <- function(bing_word_counts, title){
  bing_word_counts %>%
  filter(n > 800) %>%
  mutate(n = ifelse(bing == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = bing)) +
  ggtitle(title) +
  geom_col() +
  coord_flip() +
  labs(y = "Contribution to sentiment")
}

#Visualize
plot1 <- visual1(bing_sports, 'sports_category')
plot2 <- visual1(bing_animation, 'animation_category')
plot3 <- visual1(bing_music, 'music_category')
plot4 <- visual1(bing_howto, 'howto_category')
grid.arrange(plot1, plot2,plot3,plot4, ncol=2)

Visualization 2

# Top 10 words for each sentiment.
visual2 <- function(bing_word_counts,title)
  {bing_word_counts %>%
  group_by(bing) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = bing)) +
  ggtitle(title) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~bing, scales = "free") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip() 
}
#Words that contribute to positive and negative sentiment in the reviews
plot1 <- visual2(bing_sports, 'sports_category')
plot2 <- visual2(bing_animation, 'animation_category')
plot3 <- visual2(bing_music, 'music_category')
plot4 <- visual2(bing_howto, 'howto_category')
grid.arrange(plot1, plot2,plot3,plot4, ncol=2)

NLP Result Summary and Discussion

Analysis 1: We can observe from first summary Visualization that howto category has less number of frequent words when compared to other categories, it might be because every howto category can be further sub divided into different categories. So, detailed sentiments for howto category is tough to analyze. If we observe sentiments of music and animation category we can get detailed negative emotions using this visualization.

Analysis 2: We can observe from above plot the count of most used negative and positive words across each category. This words can be further analyzed to find detailed emotion for each category.
**When we observe above graphs for sport category people have negative emotions like video is fake, and for howto category many people think videos a waste. If we see positive emotions then many people think videos in animations are powerful, and may people think music video are reach.

**We can futher observe sports category videos has the maximum number of negative emotions when compared to other categories. Music category video has highest number of positive emotions when compared to other categories

Applying Classification Model to our dataset

Creating a classification label based on sentiment score

classify <- function(df_comments){
# Create new column to store sentiment score
df_comments$sentiment_score <- 0

# Calculate Sentiment for entire comment box
for(i in 1:nrow(df_comments)){
  a <- sentiment(as.character(df_comments[i, 1]))
  a <- a %>%
    group_by(element_id) %>%
    summarise(sentiment = sum(sentiment))
  df_comments[i, 2] <- as.numeric(as.character(a[1, 2]))
}

# Create new sentiment label based on sentiment score
df_comments <- df_comments %>%
  mutate(sentiment_label = ifelse(sentiment_score == 0, "neutral"
                                  , ifelse(sentiment_score > 0, "positive", "negative"))
         , video_category = "Sports")
return(df_comments)
}
classify(comm_sports) %>% write.csv(., "sports.csv")
classify(comm_animation) %>% write.csv(., "animation.csv")
classify(comm_music) %>% write.csv(., "music.csv")
classify(comm_howto) %>% write.csv(., "howto.csv")

Comment Sentiment Percentages by Video Category

  library(ggthemes)
# Load the labeled CSV files
df_sports <- read_csv("sports.csv")
df_music <- read_csv("music.csv")
df_animation <- read_csv("animation.csv")
df_howto <- read_csv("howto.csv")

# Combine into a single dataframe
df <- rbind(df_sports, df_music, df_animation, df_howto)

# Plot Sentiments
df %>%
  ggplot(aes(video_category, fill = sentiment_label)) + 
  geom_bar(position="fill") + 
  geom_text(aes(label=scales::percent(..count../50000)),
            stat='count',position=position_fill(vjust=0.5)) + 
  ggtitle("Comment Sentiment Percentages by Video Category") + 
  theme_economist_white()

Explanation - We can analyze neutral, negative and positive sentiment percentage across each category.

** Music has maximum neutral sentiment comments and Animation has least neutral snetiment comments

** Sports and Howto category has highest positive sentiment comments and Animation has least positive sentiment comments

** Animation category has highest negative sentiment comments and Howto has least negative sentiment comments

We will use Sports csv file with sentiment data to run Random Forest model using H20 for classification of comments into Positive, Neutral and Negative

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 16 minutes 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    4 months and 3 days !!! 
##     H2O cluster name:           H2O_started_from_R_rohit_pxu032 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.03 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)

sports <- read.csv("sports.csv",  header = TRUE)
#sports <- sports[sports$sentiment_label=="positive"|sports$sentiment_label=="negative",]
corpus <- Corpus(VectorSource(sports$textOriginal)) 
corpus <- tm_map(corpus, content_transformer(tolower)) 
corpus <- tm_map(corpus, removeNumbers) 
corpus <- tm_map(corpus, removePunctuation) 
corpus <- tm_map(corpus, removeWords, stopwords("english")) 

# tokenize the corpus 
corpusTokenized <- lapply(corpus, scan_tokenizer)
cleaned_reviews <- data.frame(text = sapply(corpusTokenized, paste, collapse = " "), stringsAsFactors = FALSE)  
sports$desc <- cleaned_reviews$text 

#Create dtm 
dtm <- DocumentTermMatrix(corpus)  
dtm <- removeSparseTerms(dtm, 0.97)

# Convert to a data frame
comments_df = as.data.frame(as.matrix(dtm))
# Make all variable names R-friendly
colnames(comments_df) = make.names(colnames(comments_df))
# Add sentiment variable
comments_df$sentiment = factor(sports$sentiment_label)
colnames(comments_df)=iconv(colnames(comments_df), to='ASCII', sub='')
# Split the data to test and train dataset

set.seed(100)
split = sample.split(comments_df$sentiment,0.7)
train = subset(comments_df, split==TRUE)
test = subset(comments_df, split==FALSE)
#Using h2o library for fast randomforest 
x_var=colnames(comments_df)[-172]
#Covert to h2o frames
htrain=as.h2o(train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

htest=as.h2o(test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

comment_sentiment = h2o.randomForest(x=x_var,y="sentiment",ntree=200,max_depth = 70, training_frame=htrain)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |====================                                             |  32%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===============================================================  |  98%
  |                                                                       
  |=================================================================| 100%

comment_sentiment

## Model Details:
## ==============
## 
## H2OMultinomialModel: drf
## Model ID:  DRF_model_R_1556724724677_272 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             200                      600              999937        15
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        22   16.66167         82        157   125.08334
## 
## 
## H2OMultinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## Training Set Metrics: 
## =====================
## 
## Extract training frame with `h2o.getFrame("train_sid_a021_1")`
## MSE: (Extract with `h2o.mse`) 0.3657415
## RMSE: (Extract with `h2o.rmse`) 0.6047657
## Logloss: (Extract with `h2o.logloss`) 0.9356441
## Mean Per-Class Error: 0.4799886
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##          negative neutral positive  Error              Rate
## negative     3133    6459     1310 0.7126 =  7,769 / 10,902
## neutral       747    9547      688 0.1307 =  1,435 / 10,982
## positive     1170    6656     5290 0.5967 =  7,826 / 13,116
## Totals       5050   22662     7288 0.4866 = 17,030 / 35,000
## 
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-3 Hit Ratios: 
##   k hit_ratio
## 1 1  0.513429
## 2 2  0.775286
## 3 3  1.000000

test_sentiment = as.data.frame(h2o.predict(comment_sentiment,htest)[[,1]][1])

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

#Check how accurate the predictions are
Cm <-confusionMatrix(test_sentiment$predict, test$sentiment)
Cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative neutral positive
##   negative     1364     331      481
##   neutral      2737    4061     2821
##   positive      572     314     2319
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5163          
##                  95% CI : (0.5082, 0.5243)
##     No Information Rate : 0.3747          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2818          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                  0.29189         0.8629          0.4126
## Specificity                  0.92137         0.4601          0.9055
## Pos Pred Value               0.62684         0.4222          0.7236
## Neg Pred Value               0.74197         0.8801          0.7201
## Prevalence                   0.31153         0.3137          0.3747
## Detection Rate               0.09093         0.2707          0.1546
## Detection Prevalence         0.14507         0.6413          0.2137
## Balanced Accuracy            0.60663         0.6615          0.6590

Accuracy of Model is approx 52% . Our Data is biased towards neutral sentiment, this could be reason for not having higher accuracy . we can further try different models and by making dataset less biased to achieved higher accuracy.

CIS 8392: FinalProject_Group10

Rohit Shekhar / Darshika Kesarwani / Poonam Angne

5/1/2019