Vice Data Analytics

Endri Raco

2022-04-25

Introduction

The aim of this report is to answer following questions using data techniques:

How company content strategy has shifted over time.
Are all kinds of engagement beneficial for video popularity? Naturally, a more popular video will have more reactions of all kinds, but does a higher fraction of, say, “Angry” reactions, have a negative effect on video performance?
Are there any topics, word combinations which always perform higher than average, or have been successful as of recently?

We will use dataset vice_data_for_test_task This dataset contains Facebook video data from the past three years. The data concerns posts from four pages belonging to VICE.

For all project calculations is used the following PC:

print('Operating System:')

## [1] "Operating System:"

version

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          4                           
## minor          1.2                         
## year           2021                        
## month          11                          
## day            01                          
## svn rev        81115                       
## language       R                           
## version.string R version 4.1.2 (2021-11-01)
## nickname       Bird Hippie

Data preparation

Importing data

data_path <- here("data", "vice_data_for_test_task.csv")
vice_data <- read_csv(data_path)

A first glimpse

First, we make a check if our data format is indeed data frame:

# Check format
class(vice_data)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

We see that vice_data data frame has 18497 rows and 37 variables.

Now let’s check the structure of vice_data data frame

# Check structure
glimpse(vice_data)

## Rows: 18,497
## Columns: 37
## $ `Page Name`                      <chr> "VICE News", "VICE News", "VICE", "VI~
## $ `User Name`                      <chr> "vicenews", "vicenews", "VICE", "vice~
## $ `Facebook Id`                    <dbl> 236000000000000, 236000000000000, 167~
## $ `Page Category`                  <chr> "MEDIA_NEWS_COMPANY", "MEDIA_NEWS_COM~
## $ `Page Admin Top Country`         <chr> "US", "US", "US", "US", "US", "US", "~
## $ `Page Description`               <chr> "VICE News Tonight airs Monday–Thursd~
## $ `Page Created`                   <chr> "2014-02-23 19:00:02 EST", "2014-02-2~
## $ `Likes at Posting`               <dbl> 3339049, 3339049, 8312112, 3339023, 8~
## $ `Followers at Posting`           <chr> "4342864", "4342864", "9754669", "434~
## $ `Post Created`                   <chr> "2021-05-26 04:00:18 EDT", "2021-05-2~
## $ Type                             <chr> "Native Video", "Native Video", "Nati~
## $ `Total Interactions`             <dbl> 54, 41, 66, 351, 24, 132, 358, 139, 7~
## $ Likes                            <dbl> 34, 23, 19, 77, 12, 35, 151, 36, 15, ~
## $ Comments                         <dbl> 4, 5, 5, 126, 6, 54, 79, 44, 21, 53, ~
## $ Shares                           <dbl> 8, 8, 8, 60, 1, 21, 48, 21, 15, 12, 2~
## $ Love                             <dbl> 6, 1, 6, 5, 0, 1, 1, 1, 1, 13, 1, 7, ~
## $ Wow                              <dbl> 2, 0, 0, 8, 0, 2, 9, 1, 0, 1, 1, 1, 0~
## $ Haha                             <dbl> 0, 2, 3, 19, 2, 15, 58, 22, 5, 23, 40~
## $ Sad                              <dbl> 0, 1, 22, 3, 0, 2, 5, 11, 10, 0, 1, 1~
## $ Angry                            <dbl> 0, 1, 1, 52, 0, 2, 5, 1, 0, 1, 2, 0, ~
## $ Care                             <dbl> 0, 0, 2, 1, 3, 0, 2, 2, 8, 2, 1, 0, 1~
## $ `Video Share Status`             <chr> "crosspost", "crosspost", "crosspost"~
## $ `Is Video Owner?`                <chr> "Yes", "No", "No", "No", "No", "No", ~
## $ `Post Views`                     <dbl> 3213, 1745, 7268, 8294, 2761, 25601, ~
## $ `Total Views`                    <dbl> 3214, 1752, 7273, 8375, 2761, 25672, ~
## $ `Total Views For All Crossposts` <dbl> 1793907, 13838, 81146, 10240, 129914,~
## $ `Video Length`                   <chr> "0:17:38", "0:09:21", "0:24:57", "0:0~
## $ URL                              <chr> "https://www.facebook.com/23585288990~
## $ Message                          <chr> "Tattoos are stigmatized in Japan bec~
## $ Link                             <chr> "https://www.facebook.com/vicenews/vi~
## $ `Final Link`                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Image Text`                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Link Text`                      <chr> "Inside the Underground Pilgrimage Th~
## $ Description                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Id`                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Name`                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Category`               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

It is a good idea to check for dublicates in rows so to create a general idea about real amount of data.

# Distinct users, movies, genres
nrow(vice_data %>% distinct())

## [1] 18497

Let’s repair the names of variables:

# Name repair
vice_data_cl <- janitor::clean_names(vice_data)

Now time for checking problems in dataset previous turning to data analysis

diagnose(vice_data_cl)

## # A tibble: 37 x 6
##    variables        types missing_count missing_percent unique_count unique_rate
##    <chr>            <chr>         <int>           <dbl>        <int>       <dbl>
##  1 page_name        char~             0               0            3   0.000162 
##  2 user_name        char~             0               0            4   0.000216 
##  3 facebook_id      nume~             0               0            4   0.000216 
##  4 page_category    char~             0               0            2   0.000108 
##  5 page_admin_top_~ char~             0               0            1   0.0000541
##  6 page_description char~             0               0            4   0.000216 
##  7 page_created     char~             0               0            4   0.000216 
##  8 likes_at_posting nume~             0               0         3906   0.211    
##  9 followers_at_po~ char~             0               0         3899   0.211    
## 10 post_created     char~             0               0        18320   0.990    
## # ... with 27 more rows

Data Wrangling

When we diagnosed vice_data_cl data frame we noticed that final_link, image_text, description, sponsor_id, sponsor_name, sponsor_category variables have more than \(90\%\) missing data. Also we can notice that page_admin_top_country variables has a single value US so it will not be included in analytics. Let’s remove these variables

vice_data_cl <- vice_data_cl %>% select(-c('final_link', 'image_text', 'description', 'sponsor_id', 'sponsor_name', 'sponsor_category'))

Next step is to turn our two variables page_created and post_created to the right date-time format. We will use Vilnius timezone where company is located.

vice_data_cl$page_created <- as.POSIXct(vice_data_cl$page_created, tz = 'Europe/Vilnius')
vice_data_cl$post_created <- as.POSIXct(vice_data_cl$post_created, tz = 'Europe/Vilnius')
vice_data_cl$video_length <- lubridate::period_to_seconds(lubridate::hms(vice_data_cl$video_length))

Information Extracting

Post Creation

Posting by year

vice_data_cl %>%  
  mutate(year = lubridate::year(post_created)) %>% 
  group_by(year) %>% summarise(freq = n()) -> year_freqs 
ggplot(year_freqs, aes(x=year, y=freq)) +
  geom_bar(fill = 'green', stat='identity')

Posting by month

vice_data_cl %>%  
  mutate(year = lubridate::year(post_created)) %>% 
  mutate(month = lubridate::month(post_created, label=TRUE)) %>%   
  group_by(year, month) %>% 
  summarise(freq = n())  -> month_freqs 
# subset 2 months around flood
month_freqs %>%
  ggplot(aes(x = month, y = freq)) +
  geom_bar(stat = "identity", fill = "darkorchid4") +
  facet_wrap(~ year, ncol = 1) +
  labs(title = "Monthly Video Postings")

Posting by day

vice_data_cl %>%  
  mutate(year = lubridate::year(post_created)) %>% 
  mutate(day = lubridate::date(post_created)) %>% 
  group_by(year, day) %>% 
  summarise(freq = n())  -> day_freqs
ggplot(day_freqs, aes(x = day, y = freq)) + 
  geom_line(aes(color = factor(year)))

Frequency of Daily Posting

source("https://raw.githubusercontent.com/iascchen/VisHealth/master/R/calendarHeat.R")
vcl <- vice_data_cl %>%  
  select(post_created) %>%  
  group_by(post_created) %>% 
  summarise(freq = n()) 

r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384")
calendarHeat(vcl$post_created, vcl$freq, ncolors = 99, color = "r2g", varname="Frequency of Daily Posting")

Monthly Average of Daily POsts

vlc <- vice_data_cl %>%  
select(post_created) %>% 
count(post_created)  %>%
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))


by_month <- vlc %>%
  group_by(Start.Month) %>%           
  summarise(av_posts = mean(n)) 

ggplot( data = by_month, 
aes(x = Start.Month, y = av_posts, fill=as.factor(lubridate::year(Start.Month)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Monthly Average of Daily Posts", x=NULL,  y="Number of Posts") + 
  theme_minimal() +
  theme(legend.position = "none")

Weekly Average of Daily POsts

vlc <- vice_data_cl %>%  
select(post_created) %>% 
count(post_created)  %>%
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))


by_week <- vlc %>%
  group_by(Start.Week) %>%           
  summarise(av_posts = mean(n)) 

ggplot( data = by_week, 
aes(x = Start.Week, y = av_posts, fill=as.factor(lubridate::year(Start.Week)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Weekly Average of Daily Posts", x=NULL,  y="Number of Posts") + 
  theme_minimal() +
  theme(legend.position = "none")

Page Posting Over Time

vice_data_cl %>%  select(post_created, page_name) %>% 
  group_by(post_created, page_name) %>% 
  summarise(freq = n()) %>%
  spread(key=page_name, value=freq) %>%
  select(-post_created) %>%
  ts_plot( title = "Page Posting over Time",
          Xtitle = "Time",
          Ytitle = "Number of Posts")

Monthly Average of Page Posting Over Time

vlc<- vice_data_cl %>%  
select(post_created, page_name) %>% 
group_by(post_created, page_name) %>%  
count(post_created)  %>%
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))


vlc %>%
  group_by(Start.Month, page_name) %>%           
  summarise(av_posts = mean(n)) %>%
  spread(key=page_name, value=av_posts) %>%
  ts_plot( title = "Page Posting over Time",
          Xtitle = "Time",
          Ytitle = "Number of Posts")

Weekly Average of Page Posting Over Time

vlc<- vice_data_cl %>%  
select(post_created, page_name) %>% 
group_by(post_created, page_name) %>%  
count(post_created)  %>%
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))


vlc %>%
  group_by(Start.Week, page_name) %>%           
  summarise(av_posts = mean(n)) %>%
  spread(key=page_name, value=av_posts) %>%
  ts_plot( title = "Page Posting over Time",
          Xtitle = "Time",
          Ytitle = "Number of Posts")

Post Views

Daily Post Views Over Time

View count is the total number of people who have viewed your video.

Facebook measure a view by checking if someone views your video for 3 seconds (same for Live videos)

View count can be considered more of a vanity metric, as the number of views don’t really affect your bottom line if no other action is taken. However, this still shows us that we need to make those first 3-30 seconds hyper-engaging in order to reel a viewer in.

don <- xts(x = vice_data_cl$post_views, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Post Views Over Time", 
        ylab = "Number of Views") %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1) 
p

Daily Total Views Over Time

don <- xts(x = vice_data_cl$total_views, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Total Views Over Time", 
        ylab = "Number of Views") %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1) 
p

Monthly Average of Views Over Time

vlc <- vice_data_cl %>%  
select(post_created, post_views) %>%
group_by(post_created) %>%  
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))


by_month <- vlc %>%
  group_by(Start.Month) %>%           
  summarise(av_views = mean(post_views)) 

ggplot( data = by_month, 
aes(x = Start.Month, y = av_views, fill=as.factor(lubridate::year(Start.Month)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Monthly Average of Views Over Time", x=NULL,  y="Number of Views") + 
  theme_minimal() +
  theme(legend.position = "none")

Weekly Average of Views Over Time

vlc <- vice_data_cl %>%  
select(post_created, post_views) %>%
group_by(post_created) %>%  
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))


by_week <- vlc %>%
  group_by(Start.Week) %>%           
  summarise(av_views = mean(post_views)) 

ggplot( data = by_week, 
aes(x = Start.Week, y = av_views, fill=as.factor(lubridate::year(Start.Week)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Weekly Average of Views Over Time", x=NULL,  y="Number of Views") + 
  theme_minimal() +
  theme(legend.position = "none")

Total Views for all Crossposts Over Time

don <- xts(x = vice_data_cl$total_views_for_all_crossposts, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Total Views for all Crossposts Over Time", 
        ylab = "Number of Views") %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1) 
p

Monthly Average of Total Views for all Crossposts Over Time

vlc <- vice_data_cl %>%  
select(post_created, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))


by_month <- vlc %>%
  group_by(Start.Month) %>%           
  summarise(av_views = mean(total_views_for_all_crossposts)) 

ggplot( data = by_month, 
aes(x = Start.Month, y = av_views, fill=as.factor(lubridate::year(Start.Month)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Monthly Average of Total Views for all Crossposts Over Time", x=NULL,  y="Number of Views") + 
  theme_minimal() +
  theme(legend.position = "none")

Weekly Average of Total Views for all Crossposts Over Time

vlc <- vice_data_cl %>%  
select(post_created, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))


by_week <- vlc %>%
  group_by(Start.Week) %>%           
  summarise(av_views = mean(total_views_for_all_crossposts)) 

ggplot( data = by_week, 
aes(x = Start.Week, y = av_views, fill=as.factor(lubridate::year(Start.Week)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Weekly Average of Total Views for all Crossposts Over Time", x=NULL,  y="Number of Views") + 
  theme_minimal() +
  theme(legend.position = "none")

Total Views vs Total Views for All Crossposts Overtime

vice_data_cl %>%  select(post_created, total_views, total_views_for_all_crossposts) %>% 
  group_by(post_created) %>% 
  ts_plot(title = " Total views vs Total Views for All crossposts Over Time",
          Xtitle = "Time",
          Ytitle = "Frequency")

Monthly Average of Total Views vs Total Views for All Crossposts Overtime

vlc <- vice_data_cl %>%  
select(post_created, total_views, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))

vlc %>%
  group_by(Start.Month) %>%           
  summarise(av_tot_views = mean(total_views), av_tot_cviews = mean(total_views_for_all_crossposts)) %>%
  ts_plot(title = " Monthly Average of Total Views vs Total Views for All Crossposts Overtime",
          Xtitle = "Time",
          Ytitle = "Frequency")

Weekly Average of Total Views vs Total Views for All Crossposts Overtime

vlc <- vice_data_cl %>%  
select(post_created, total_views, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))

vlc %>%
  group_by(Start.Week) %>%           
  summarise(av_tot_views = mean(total_views), av_tot_cviews = mean(total_views_for_all_crossposts)) %>%
  ts_plot(title = " Weekly Average of Total Views vs Total Views for All Crossposts Overtime",
          Xtitle = "Time",
          Ytitle = "Frequency")

Video Length

Posted Video Length

vice_data_cl %>%
  filter( video_length < 1200 ) %>%
  ggplot( aes(x= video_length)) +
  geom_histogram( binwidth=10, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  ggtitle("Histogram of Posted Video Length ") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  ) + 
  scale_y_continuous(breaks=seq(0,1000,50)) + 
  scale_x_continuous(breaks=seq(0,1200,100))

Length of Video Posts in Time

vice_data_cl %>%  select(post_created, video_length) %>%
  filter( video_length < 1200 ) %>%
  mutate(year = lubridate::year(post_created)) %>%
select(year, video_length) %>%
ggplot(aes(x=video_length, fill = as.factor(year)))+
  geom_histogram( color='#e9ecef', alpha=0.6) + 
labs(title = "Posted Video Lengths in Years") +
xlab('Video Length') +
ylab('Frequency of Video Posts') +
guides(fill=guide_legend(title="Years"))

Monthly Average of Video Length

vlc <- vice_data_cl %>%  
select(post_created, video_length) %>% 
group_by(post_created)  %>%
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"))


by_month <- vlc %>%
  group_by(Start.Month) %>%           
  summarise(av_length = mean(video_length)) 

ggplot( data = by_month, 
aes(x = Start.Month, y = av_length, fill=as.factor(lubridate::year(Start.Month)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Monthly Average of Video Length", x=NULL,  y="Video Length") + 
  theme_minimal() +
  theme(legend.position = "none") +
scale_y_continuous(breaks=seq(0, 1000,100), limits=c(0,1000))

Weekly Average of Video Length

vlc <- vice_data_cl %>%  
select(post_created, video_length) %>% 
group_by(post_created)  %>%
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"))


by_week <- vlc %>%
  group_by(Start.Week) %>%           
  summarise(av_length = mean(video_length)) 

ggplot( data = by_week, 
aes(x = Start.Week, y = av_length, fill=as.factor(lubridate::year(Start.Week)))) +
geom_col() +  
scale_fill_brewer(palette = "Paired") +
labs(title="Weekly Average of Video Length", x=NULL,  y="Video Length") + 
  theme_minimal() +
  theme(legend.position = "none") +
scale_y_continuous(breaks=seq(0, 1000,100), limits=c(0,1000))

Engagement

Video engagement includes the comments and likes that video content generates.

It’s a good idea to see how many people are actually taking action on your video, but more than that, company pay attention to the types of comments is getting.

Daily User Activity

don <- xts(x = vice_data_cl$total_interactions, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Total Interactions Over Time", 
        ylab = "Number of Views") %>%
  dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
  dyRangeSelector() %>%
  dyCrosshair(direction = "vertical") %>%
  dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE)  %>%
  dyRoller(rollPeriod = 1) 
p

Daily User Activity

vice_data_cl %>%  select(post_created, total_interactions) %>%
filter( total_interactions < 5000 ) %>%
mutate(year = lubridate::year(post_created)) %>%
select(year, total_interactions) %>%
ggplot(aes(x=total_interactions, fill = as.factor(year)))+
geom_histogram( binwidth=200,color="#e9ecef", alpha=0.9) +
ggtitle("Histogram of Total Interactions During Years ") +
theme_ipsum() +
theme(
plot.title = element_text(size=15)
) +
xlab('Total Interactions') +
ylab('Frequency of Total Interactions') +
guides(fill=guide_legend(title="Years"))

Relationship between different user reactions

vice_data_cl %>% 
select(likes, comments, shares, love, wow, haha, sad, angry, care) %>%
ggpairs()

Comparision of weekly user interaction rates

vlc <- vice_data_cl %>%  
select(post_created, likes, comments, shares, love, wow, haha, sad, angry, care, total_interactions) %>%
group_by(post_created) %>%  
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"),
       like_ratio = likes/total_interactions,
       comments_ratio = comments/total_interactions,
       shares_ratio = shares/total_interactions,
       love_ratio = love/total_interactions,
       wow_ratio = wow/total_interactions,
       haha_ratio = haha/total_interactions,
       sad_ratio = sad/total_interactions,
       angry_ratio = angry/total_interactions,
       care_ratio = care/total_interactions) %>%
select(post_created, like_ratio, comments_ratio, shares_ratio, love_ratio, wow_ratio, haha_ratio, sad_ratio, angry_ratio, care_ratio, Start.Week)  

vlc %>%
  group_by(Start.Week) %>%           
  summarise(
    av_like_ratio = mean(like_ratio),
    av_comments_ratio = mean(comments_ratio),
    av_shares_ratio = mean(shares_ratio),
    av_love_ratio = mean(love_ratio),
    av_wow_ratio = mean(wow_ratio),
    av_haha_ratio = mean(haha_ratio),
    av_sad_ratio = mean(sad_ratio),
    av_angry_ratio = mean(angry_ratio),
    av_care_ratio  = mean(care_ratio)) %>%
  ts_plot(title = " Comparision of weekly user interaction rates Over Time",
          Xtitle = "Time",
          Ytitle = "")

Comparision of monthly user interaction rates

vlc <- vice_data_cl %>%  
select(post_created, likes, comments, shares, love, wow, haha, sad, angry, care, total_interactions) %>%
group_by(post_created) %>%  
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"),
       like_ratio = likes/total_interactions,
       comments_ratio = comments/total_interactions,
       shares_ratio = shares/total_interactions,
       love_ratio = love/total_interactions,
       wow_ratio = wow/total_interactions,
       haha_ratio = haha/total_interactions,
       sad_ratio = sad/total_interactions,
       angry_ratio = angry/total_interactions,
       care_ratio = care/total_interactions) %>%
select(post_created, like_ratio, comments_ratio, shares_ratio, love_ratio, wow_ratio, haha_ratio, sad_ratio, angry_ratio, care_ratio, Start.Month)  

vlc %>%
  group_by(Start.Month) %>%           
  summarise(
    av_like_ratio = mean(like_ratio),
    av_comments_ratio = mean(comments_ratio),
    av_shares_ratio = mean(shares_ratio),
    av_love_ratio = mean(love_ratio),
    av_wow_ratio = mean(wow_ratio),
    av_haha_ratio = mean(haha_ratio),
    av_sad_ratio = mean(sad_ratio),
    av_angry_ratio = mean(angry_ratio),
    av_care_ratio  = mean(care_ratio)) %>%
  ts_plot(title = " Comparision of monthly user interaction rates Over Time",
          Xtitle = "Time",
          Ytitle = "")

Monthy effect of Angry Reaction in Video Performance

library(ggpubr)
vlc <- vice_data_cl %>%  
select(post_created, angry, total_interactions, likes_at_posting, followers_at_posting, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month"),
       angry_ratio = angry/total_interactions) %>%
select(post_created, angry_ratio, Start.Month,  total_interactions, likes_at_posting, followers_at_posting, total_views_for_all_crossposts)  

vlc <- vlc %>%
  group_by(Start.Month) %>%           
  summarise(
    av_angry_ratio = mean(angry_ratio),
    av_total_interactions  = mean(total_interactions),
    av_likes_at_posting  = mean(likes_at_posting),
    av_total_views_for_all_crossposts  = mean(total_views_for_all_crossposts) ) 


vlc1 <- vlc %>% select(Start.Month, av_angry_ratio)
vlc2 <- vlc %>% select(Start.Month, av_total_interactions)
vlc3 <- vlc %>% select(Start.Month, av_likes_at_posting)
vlc4 <- vlc %>% select(Start.Month, av_total_views_for_all_crossposts)

p1 <- ggplot(vlc1, aes(x=Start.Month, av_angry_ratio)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Angry Ratio by Month") +
  xlab("Time")


p2 <- ggplot(vlc2, aes(x=Start.Month, av_total_interactions)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Monthly Average of Total Interactions") +
  xlab("Time")

p3 <- ggplot(vlc3, aes(x=Start.Month, av_likes_at_posting)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Monthly Average of Likes at Posting") +
  xlab("Time")

p4 <- ggplot(vlc4, aes(x=Start.Month, av_total_views_for_all_crossposts)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Monthly Average of Total Crossposts") +
  xlab("Time")

ggarrange(p1,p2,p3,p4)

Weekly of Angry Reaction in Video Performance

library(ggpubr)
vlc <- vice_data_cl %>%  
select(post_created, angry, total_interactions, likes_at_posting, followers_at_posting, total_views_for_all_crossposts) %>%
group_by(post_created) %>%  
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week"),
       angry_ratio = angry/total_interactions) %>%
select(post_created, angry_ratio, Start.Week,  total_interactions, likes_at_posting, followers_at_posting, total_views_for_all_crossposts)  

vlc <- vlc %>%
  group_by(Start.Week) %>%           
  summarise(
    av_angry_ratio = mean(angry_ratio),
    av_total_interactions  = mean(total_interactions),
    av_likes_at_posting  = mean(likes_at_posting),
    av_total_views_for_all_crossposts  = mean(total_views_for_all_crossposts) ) 


vlc1 <- vlc %>% select(Start.Week, av_angry_ratio)
vlc2 <- vlc %>% select(Start.Week, av_total_interactions)
vlc3 <- vlc %>% select(Start.Week, av_likes_at_posting)
vlc4 <- vlc %>% select(Start.Week, av_total_views_for_all_crossposts)

p1 <- ggplot(vlc1, aes(x=Start.Week, av_angry_ratio)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Angry Ratio by Week") +
  xlab("Time")


p2 <- ggplot(vlc2, aes(x=Start.Week, av_total_interactions)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Weekly Average of Total Interactions") +
  xlab("Time")

p3 <- ggplot(vlc3, aes(x=Start.Week, av_likes_at_posting)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Weekly Average of Likes at Posting") +
  xlab("Time")

p4 <- ggplot(vlc4, aes(x=Start.Week, av_total_views_for_all_crossposts)) +
  geom_line( color="steelblue") + 
  geom_point() +
  xlab("") +
  theme_ipsum() +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  ylab("Weekly Average of Total Crossposts") +
  xlab("Time")

ggarrange(p1,p2,p3,p4)

Social shares

One of main goals for video content should be social shares. This widens audience exponentially, increasing brand awareness and potentially bringing in new leads.

Video Share Status – owned vs crosspost

vlc<- vice_data_cl %>% 
  select(post_created, video_share_status) %>% 
  group_by(post_created, video_share_status) %>% 
  summarise(freq = n()) %>%
  spread(key=video_share_status, value=freq) %>%
  select(crosspost, owned, share)
ts_plot(vlc,         title = "Video Share Status Over Time",
        Xtitle = "Time",
        Ytitle = "Frequency")

Monthly Average Comparision of Video Share Status – owned vs crosspost

vlc<- vice_data_cl %>% 
  select(post_created, video_share_status) %>% 
  group_by(post_created, video_share_status) %>% 
  summarise(freq = n()) %>%
  spread(key=video_share_status, value=freq) %>%
  select(crosspost, owned, share) %>%
mutate(Start.Month = lubridate::floor_date(post_created, unit = "month")) %>%
  select(Start.Month, crosspost, owned, share)

vlc %>%
  group_by(Start.Month) %>%           
  summarise(
    crosspost_ratio = mean(crosspost),
    owned_ratio = mean(owned),
    share_ratio = mean(share)) %>%
  ts_plot(title = " Monthly Comparision of Video Share Status",
          Xtitle = "Time",
          Ytitle = "")

Weekly Average Comparision of Video Share Status – owned vs crosspost

vlc<- vice_data_cl %>% 
  select(post_created, video_share_status) %>% 
  group_by(post_created, video_share_status) %>% 
  summarise(freq = n()) %>%
  spread(key=video_share_status, value=freq) %>%
  select(crosspost, owned, share) %>%
mutate(Start.Week = lubridate::floor_date(post_created, unit = "week")) %>%
  select(Start.Week, crosspost, owned, share)

vlc %>%
  group_by(Start.Week) %>%           
  summarise(
    crosspost_ratio = mean(crosspost),
    owned_ratio = mean(owned),
    share_ratio = mean(share)) %>%
  ts_plot(title = " Monthly Comparision of Video Share Status",
          Xtitle = "Time",
          Ytitle = "")

Natural Language Processing

Data Cleaning

Let’s start with data cleaning for variables message and link_text. Because of text specifics, cleaning is done in two stages:

For message :

text <- as.character(vice_data_cl$message) %>%
tolower() %>%
  # remove non-word characters
  str_replace_all("[^[:alpha:][:space:]]*", "")  %>%
  tm::removePunctuation() %>%
  stringr::str_squish() %>%
  stringr::str_split(" ") %>%
  textclean::replace_non_ascii(replacement = "") %>%
  unlist()

message_corpus <- Corpus(VectorSource(na.omit(text))) 

english_stopwords <- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding = "UTF-8")


message_clean_corpus <- tm_map(message_corpus, content_transformer(tolower))
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(removeWords), english_stopwords)
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(removePunctuation), preserve_intra_word_dashes = TRUE)
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(removeNumbers))
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(stemDocument), language = "en")
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(stripWhitespace))
message_clean_corpus <- tm_map(message_clean_corpus, content_transformer(stemDocument))


clean_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

message_clean_corpus <- tm_map(message_clean_corpus, clean_fun, "/")
message_clean_corpus <- tm_map(message_clean_corpus, clean_fun , "@")
message_clean_corpus <- tm_map(message_clean_corpus, clean_fun , "\\|")

Now the same for link_text

link_text <- as.character(vice_data_cl$link_text) %>%
tolower() %>%
  # remove non-word characters
  str_replace_all("[^[:alpha:][:space:]]*", "")  %>%
  tm::removePunctuation() %>%
  stringr::str_squish() %>%
  stringr::str_split(" ") %>%
  textclean::replace_non_ascii(replacement = "") %>%
  unlist()

ltext_corpus <- Corpus(VectorSource(na.omit(link_text))) 

english_stopwords <- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding = "UTF-8")


ltext_clean_corpus <- tm_map(ltext_corpus, content_transformer(tolower))
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(removeWords), english_stopwords)
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(removePunctuation), preserve_intra_word_dashes = TRUE)
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(removeNumbers))
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(stemDocument), language = "en")
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(stripWhitespace))
ltext_clean_corpus <- tm_map(ltext_clean_corpus, content_transformer(stemDocument))


clean_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

ltext_clean_corpus <- tm_map(ltext_clean_corpus, clean_fun, "/")
ltext_clean_corpus <- tm_map(ltext_clean_corpus, clean_fun , "@")
ltext_clean_corpus <- tm_map(ltext_clean_corpus, clean_fun , "\\|")

Topic Model

Now we will calculate topic model for both created corpuses. First we vectorize:

dtm_message <- DocumentTermMatrix(message_clean_corpus, control = list(wordLengths = c(2, Inf)))
dtm_ltext <- DocumentTermMatrix(ltext_clean_corpus, control = list(wordLengths = c(2, Inf)))

Next, we determine the optimal number of topics from dtm_message:

# Remove zero elements to perform LDA
raw.sum=apply(dtm_message,1,FUN=sum) 
dtm_message=dtm_message[raw.sum!=0,]
# For message_dtm
message_result <- FindTopicsNumber(
  dtm_message,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.

FindTopicsNumber_plot(message_result)

Basically we look for parameters that minimize Arun and CaoJuan, or maximize Griffiths and Deveaud. For our purpose we take \(k = 7\)

raw.sum=apply(dtm_message,1,FUN=sum) 
dtm_message=dtm_message[raw.sum!=0,]
K <- 7
# compute the LDA model, inference via 1000 iterations of Gibbs sampling
message_topicModel <- LDA(dtm_message, K, method="Gibbs", control=list(iter = 500, verbose = 25, alpha = 0.2))

## K = 7; V = 13988; M = 18435
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!

Below 20 most likely terms within the term probabilities beta of the inferred topics

terms(message_topicModel, 20)

##       Topic 1     Topic 2   Topic 3   Topic 4   Topic 5    Topic 6   
##  [1,] "video"     "peopl"   "watch"   "make"    "vice"     "trump"   
##  [2,] "peopl"     "year"    "tonight" "test"    "news"     "presid"  
##  [3,] "life"      "citi"    "world"   "recip"   "pm"       "vice"    
##  [4,] "time"      "live"    "tattoo"  "food"    "tonight"  "vote"    
##  [5,] "find"      "countri" "meet"    "check"   "hbo"      "elect"   
##  [6,] "nation"    "meet"    "episod"  "show"    "peopl"    "impeach" 
##  [7,] "sex"       "make"    "season"  "munchi"  "report"   "news"    
##  [8,] "dont"      "million" "catch"   "eat"     "year"     "donald"  
##  [9,] "addict"    "world"   "king"    "chef"    "war"      "state"   
## [10,] "kentucki"  "hous"    "free"    "action"  "tv"       "democrat"
## [11,] "make"      "black"   "live"    "bronson" "watch"    "protest" 
## [12,] "turn"      "legal"   "demand"  "delici"  "edt"      "happen"  
## [13,] "ayahuasca" "state"   "onlin"   "chicken" "kill"     "watch"   
## [14,] "game"      "weed"    "dark"    "pizza"   "countri"  "live"    
## [15,] "watch"     "bank"    "stori"   "cook"    "covid"    "call"    
## [16,] "meet"      "york"    "vice"    "restaur" "children" "hes"     
## [17,] "question"  "find"    "side"    "made"    "famili"   "ask"     
## [18,] "year"      "thing"   "time"    "full"    "world"    "hous"    
## [19,] "industri"  "back"    "artist"  "chee"    "drug"     "make"    
## [20,] "love"      "man"     "wrestl"  "perfect" "full"     "voter"   
##       Topic 7      
##  [1,] "chang"      
##  [2,] "school"     
##  [3,] "coronavirus"
##  [4,] "student"    
##  [5,] "climat"     
##  [6,] "world"      
##  [7,] "american"   
##  [8,] "fight"      
##  [9,] "black"      
## [10,] "peopl"      
## [11,] "year"       
## [12,] "communiti"  
## [13,] "meet"       
## [14,] "teacher"    
## [15,] "home"       
## [16,] "kill"       
## [17,] "pay"        
## [18,] "creat"      
## [19,] "protest"    
## [20,] "doesnt"

Let’s repeat the same procedure for the link_message

# Remove zero elements to perform LDA
raw.sum=apply(dtm_ltext,1,FUN=sum) 
dtm_ltext=dtm_ltext[raw.sum!=0,]
# For message_dtm
ltext_result <- FindTopicsNumber(
  dtm_ltext,
  topics = seq(from = 2, to = 15, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
##   Deveaud2014... done.

FindTopicsNumber_plot(ltext_result)

For our purpose we take \(k = 7\)

raw.sum=apply(dtm_ltext,1,FUN=sum) 
dtm_ltext=dtm_ltext[raw.sum!=0,]
K <- 7
# compute the LDA model, inference via 1000 iterations of Gibbs sampling
ltext_topicModel <- LDA(dtm_ltext, K, method="Gibbs", control=list(iter = 500, verbose = 25, alpha = 0.2))

## K = 7; V = 7041; M = 17744
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!

Below 20 most likely terms within the term probabilities beta of the inferred topics

terms(ltext_topicModel, 20)

##       Topic 1   Topic 2     Topic 3       Topic 4       Topic 5      
##  [1,] "trump"   "vice"      "insid"       "meet"        "nation"     
##  [2,] "protest" "news"      "meet"        "weed"        "peopl"      
##  [3,] "insid"   "tonight"   "world"       "man"         "tattoo"     
##  [4,] "kill"    "hbo"       "crisi"       "time"        "king"       
##  [5,] "world"   "updat"     "dark"        "vote"        "question"   
##  [6,] "polic"   "live"      "california"  "expensivest" "stori"      
##  [7,] "death"   "black"     "side"        "chainz"      "game"       
##  [8,] "food"    "futur"     "hous"        "women"       "drug"       
##  [9,] "covid"   "weeknight" "town"        "save"        "road"       
## [10,] "run"     "report"    "border"      "action"      "video"      
## [11,] "make"    "war"       "made"        "peopl"       "back"       
## [12,] "eat"     "space"     "covid"       "world"       "want"       
## [13,] "man"     "brown"     "ring"        "america"     "dopesick"   
## [14,] "hong"    "kid"       "live"        "die"         "war"        
## [15,] "work"    "januari"   "coronavirus" "opioid"      "black"      
## [16,] "hunt"    "showtim"   "prison"      "voter"       "coronavirus"
## [17,] "street"  "novemb"    "water"       "yearold"     "love"       
## [18,] "fake"    "impeach"   "year"        "tuesday"     "covid"      
## [19,] "kong"    "sunday"    "industri"    "jasper"      "guy"        
## [20,] "secret"  "trump"     "migrant"     "errol"       "america"    
##       Topic 6     Topic 7   
##  [1,] "trump"     "chang"   
##  [2,] "make"      "fight"   
##  [3,] "show"      "climat"  
##  [4,] "impeach"   "live"    
##  [5,] "donald"    "american"
##  [6,] "call"      "make"    
##  [7,] "ayahuasca" "delici"  
##  [8,] "chicken"   "polit"   
##  [9,] "sport"     "artist"  
## [10,] "cook"      "kid"     
## [11,] "kentucki"  "stop"    
## [12,] "mueller"   "school"  
## [13,] "presid"    "street"  
## [14,] "life"      "trump"   
## [15,] "congress"  "high"    
## [16,] "pizza"     "take"    
## [17,] "howto"     "peopl"   
## [18,] "stick"     "cultur"  
## [19,] "wednesday" "gun"     
## [20,] "ice"       "famili"

Lets give some name to topics:

message_topic_name <- terms(message_topicModel, 7)
message_topicNames <- apply(message_topic_name, 2, paste, collapse=" ")
ltext_topic_name <- terms(ltext_topicModel, 7)
ltext_topicNames <- apply(ltext_topic_name, 2, paste, collapse=" ")

Information Extraction

Lets visualize topic models:

message_lda <- tidy(message_topicModel)
ltext_lda <- tidy(ltext_topicModel)

Top 20 terms for each topic and then look at this information visually:

message_top20_terms <- message_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 20, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 20 terms in each message LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
message_top20_terms

For link_text

ltext_top20_terms <- ltext_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 20, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 20 terms in each link_text LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
ltext_top20_terms

Now we will use wordclouds to visualize our selected topics:

message_post <- posterior(message_topicModel)
# our topic of interest
int_topic_message <- 1
## 100 terms
top100terms <- sort(message_post$terms[int_topic_message,], decreasing=TRUE)[1:100]
words <- names(top100terms)
probabilities <- sort(message_post$terms[int_topic_message,], decreasing=TRUE)[1:100]
mycolors <- brewer.pal(7, "Dark2")
wordcloud(words, probabilities, random.order = FALSE, color = mycolors)

Same for link_text

ltext_post <- posterior(ltext_topicModel)
# our topic of interest
int_topic_message <- 1
## 100 terms
top100terms <- sort(ltext_post$terms[int_topic_message,], decreasing=TRUE)[1:100]
words <- names(top100terms)
probabilities <- sort(ltext_post$terms[int_topic_message,], decreasing=TRUE)[1:100]
mycolors <- brewer.pal(7, "Dark2")
wordcloud(words, probabilities, random.order = FALSE, color = mycolors)

The message topic model includes two posterior probability distributions: a distribution \(\theta\) over K topics within each document and a distribution \(\beta\) over V terms within each topic, where V represents the length of the vocabulary of the collection. Let’s check V in dtm_message:

nTerms(dtm_message) ## Our V value

## [1] 13988

message_post_result <- posterior(message_topicModel)
message_beta <- message_post_result$terms ## Our beta distribution
dim(message_beta)

## [1]     7 13988

message_theta <- message_post_result$topics  ## Our theta distribution
dim(message_theta)

## [1] 18435     7

For link_text

nTerms(dtm_ltext) ## Our V value

## [1] 7041

ltext_post_result <- posterior(ltext_topicModel)
ltext_beta <- ltext_post_result$terms ## Our beta distribution
dim(ltext_beta)

## [1]    7 7041

ltext_theta <- ltext_post_result$topics  ## Our theta distribution
dim(ltext_theta)

## [1] 17744     7

Let’s check on the distribution of topics

sample_id<- c(5, 50, 150, 250)
lapply(message_clean_corpus[sample_id], as.character)

## [[1]]
## [1] "protest anger lot solac happi joy lynsey weatherspoon tell photograph georg floyd protest"
## 
## [[2]]
## [1] "massiv fatberg wet wipe block london thame smell bad coveringclimatenow"
## 
## [[3]]
## [1] "water toxic georgia town home america largest coal plant"
## 
## [[4]]
## [1] "raw physic bareknuckl box madagascar afghanistan brutal nation sport buzkashi photojournalist finbarr oreilli captur uniqu moment human pageantri despair celebr"

message_topic_top_5 <- terms(message_topicModel, 5)
message_topic_name <- apply(message_topic_top_5, 2, paste, collapse=" ")
N <- length(sample_id)
# Proportion for each sample
sample_prop_message <- message_theta[sample_id,]
colnames(sample_prop_message) <- message_topic_name
message_plot <- reshape::melt(cbind(data.frame(sample_prop_message), document = factor(1:N)), variable.name = "topic", id.vars = "document")  
ggplot(data = message_plot, aes(variable, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

Now we repeat the above steps for link_text

# Put some name for each topic
ltext_topic_top_5 <- terms(ltext_topicModel, 5)
ltext_topic_name <- apply(ltext_topic_top_5, 2, paste, collapse=" ")
N <- length(sample_id)
# Proportion for each sample
ltext_prop_message <- ltext_theta[sample_id,]
colnames(ltext_prop_message) <- ltext_topic_name
ltext_plot <- reshape::melt(cbind(data.frame(ltext_prop_message), document = factor(1:N)), variable.name = "topic", id.vars = "document")  
ggplot(data = message_plot, aes(variable, value, fill = document), ylab = "proportion") + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  
  coord_flip() +
  facet_wrap(~ document, ncol = N)

Topic ranking

Now we try to get a more meaningful order of top terms per topic by re-ranking them with a specific score

message_topic_name <- apply(lda::top.topic.words(message_beta, 5, by.score = T), 2, paste, collapse = " ")

Rank-1 Methods

topic_count <- rep(0, K)
names(topic_count) <- message_topic_name
for (i in 1:nDocs(dtm_message)) {
  doc_topic <- message_theta[i, ] 
  high_topic <- order(doc_topic, decreasing = TRUE)[1] 
  topic_count[high_topic] <- topic_count[high_topic] + 1
}
sort(topic_count, decreasing = TRUE)

##         trump presid vote elect impeach                vice news pm tonight hbo 
##                                    2937                                    2918 
##          watch tattoo season catch dark     video sex addict ayahuasca kentucki 
##                                    2883                                    2682 
##          citi countri million bank hous student climat school coronavirus chang 
##                                    2444                                    2358 
##              test recip make check food 
##                                    2213

topic_sort <- sort(topic_count, decreasing = TRUE)
paste(topic_sort, ":", names(topic_sort))

## [1] "2937 : trump presid vote elect impeach"        
## [2] "2918 : vice news pm tonight hbo"               
## [3] "2883 : watch tattoo season catch dark"         
## [4] "2682 : video sex addict ayahuasca kentucki"    
## [5] "2444 : citi countri million bank hous"         
## [6] "2358 : student climat school coronavirus chang"
## [7] "2213 : test recip make check food"

Now we repeat steps for link_text

ltext_topic_name <- apply(lda::top.topic.words(ltext_beta, 5, by.score = T), 2, paste, collapse = " ")
topic_count <- rep(0, K)
names(topic_count) <- ltext_topic_name
for (i in 1:nDocs(dtm_ltext)) {
  doc_topic <- ltext_theta[i, ] 
  high_topic <- order(doc_topic, decreasing = TRUE)[1] 
  topic_count[high_topic] <- topic_count[high_topic] + 1
}
sort(topic_count, decreasing = TRUE)

##       vice news tonight hbo updat   crisi dark meet side california 
##                              2908                              2677 
##      protest kill death polic eat    meet weed time man expensivest 
##                              2651                              2594 
##  trump make donald ayahuasca show nation tattoo question stori game 
##                              2415                              2397 
##    chang climat fight delici live 
##                              2102

topic_sort <- sort(topic_count, decreasing = TRUE)
paste(topic_sort, ":", names(topic_sort))

## [1] "2908 : vice news tonight hbo updat"      
## [2] "2677 : crisi dark meet side california"  
## [3] "2651 : protest kill death polic eat"     
## [4] "2594 : meet weed time man expensivest"   
## [5] "2415 : trump make donald ayahuasca show" 
## [6] "2397 : nation tattoo question stori game"
## [7] "2102 : chang climat fight delici live"

Conclusions

Question 1 How company content strategy has shifted over time

In this study, several metrics are taken into account, such as the number of posts, video views, user engagement, and page growth.

Number of posts

We see that during 2018, the monthly averages of company postings are very similar except for a slight increase in November and December of this year. We can see the same trend during 2019 as well. There is a drop in monthly averages of company postings during 2020 that started to be significant since March and remains similar for all 2020, probably because of pandemic situation. There is an increase in monthly averages of postings in February 2021, but in a lower rate compare to 2019.

When it comes to daily postings, 2018 is characterized by active postings on Monday through Friday, with a lower frequency during weekends. An increase in posting frequency during weekends is noticed in November - December 2018 reaching the level of other daily weeks, and this trend remains unchanged during 2019 with a peak of postings in the third week of November, likely due to elections in US or sales season. Again 2020 shows a decreased frequency of postings which remains in these levels during whole 2020, with a slight increase in 2021.

Vice News is the page which seems to be favorite when it comes to postings, with a continuous increase from 2018 to start of 2020 followed by a sharp decrease in the number of posts.

Vice seems to be more stable with a rather smooth increase from 2018 to start of 2020, followed by not such as sharp decrease as Vice News and constantly improving by getting to the levels of 2019.

Vice TV has experienced an increase in number of posts only in the last months of 2018, and immediately after that started to experience a gradually decrease in postings.

Video Views

When it comes to video views, company seems to have performed quite well with politics of slowing down postings number, and focusing more in user engagement and content. 2018 has an increase in monthly average total views, with some slowing during September - October. 2019 seems to be a very good year when it comes to the total view with a slower rate of increase from April to August. The increase trend seems to remains constant for first half of 2020 with a noticeable decrease in second half which seems to be recovered in first months in 2021. Company shows to be very focused on crossposts and views the video has amassed over all times it was posted. Average monthly crossposts show a constant high-rate increase from the start to July 2019 keeping a more constant rate during 2020 and 2021. The characteristic for average monthly cross-posts seems to be turbulence (frequent ups and downs), probably connected to video content.

Video Length

Company seems to keep posting videos of length (3.5-4 min) and while frequency of postings (number of videos during day) can variate, company seems to maintain a rather constant weekly average of video length posting, with rare turbulence, connected probably with live streaming. A rather small increase in video length is noticed from second half of 2020 after some disrupters continuing in 2021.

User Interaction

Users appear to the be faithful to pages during 2019. The last months of 2018 are characterized by some mixed feelings from users, showing a decrease in the number of likes, an increase in the number of comments, and a constant number of shares. More strong feelings like sad, love, and angry remain in constant very low rates which show that the user is not strongly negative affected by the content. The company seems to have reacted successfully to user feedback, because 2019 seems to be more commented on, while likes seem to stay constant, while an increase rate in haha shows more engaging content. The end of 2019 and the start of 2020 seem to show that the company has a good understanding of the willingness for content, reflected in higher like rates, lower level of comments, and constant rates for other feelings.

Question 2 Are all kinds of engagement beneficial for video popularity?

It does not seem that all kinds of engagement are beneficial for video popularity, but this should be studied in different times. For example, plots show that Angry feeling has affected the page popularity measured in terms of Average Interactions and Average Crossposts. However, this effect seems to get unimportant and even comes in benefit in 2019 and on when page already has a constant flow of user interactions. Some reactions like care or love seem to be in very low rates to get a feeling about their importance in page performance.

Question 3 Are there any topics, word combinations which always perform higher than average, or have been successful as of recently?

As it may not seem strange Covid19, Trump Impeachment, and Black Matters are the hottest topics during the period of study with Covid to be the main topic. Beside that people seems to continue following their interests such as Cooking or Entertainment Series. The combinations which seem to perform higher are world year coronavirus countries, news pm tonight trump and tonight demand episode stories