Introduction

The aim of this report is to answer following questions using data techniques:

  1. How company content strategy has shifted over time.

  2. Are all kinds of engagement beneficial for video popularity? Naturally, a more popular video will have more reactions of all kinds, but does a higher fraction of, say, “Angry” reactions, have a negative effect on video performance?

  3. Are there any topics, word combinations which always perform higher than average, or have been successful as of recently?

We will use dataset vice_data_for_test_task This dataset contains Facebook video data from the past three years. The data concerns posts from four pages belonging to VICE.

For all project calculations is used the following PC:

print("Operating System:")
## [1] "Operating System:"
version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          4                           
## minor          1.2                         
## year           2021                        
## month          11                          
## day            01                          
## svn rev        81115                       
## language       R                           
## version.string R version 4.1.2 (2021-11-01)
## nickname       Bird Hippie

 

Data preparation

Importing data

data_path <- here("data/vice_data_for_test_task.csv")
vice_data <- read_csv(data_path, col_types = cols(`Page Name` = col_factor(levels = c("VICE",
    "VICE News", "VICE TV")), `User Name` = col_factor(levels = c("VICE",
    "vicenews", "vicetv", "viceuk")), `Page Category` = col_factor(levels = c("MEDIA_NEWS_COMPANY",
    "TV_CHANNEL")), Type = col_factor(levels = c("Live Video",
    "Live Video Complete", "Live Video Scheduled", "Native Video")),
    `Video Share Status` = col_factor(levels = c("crosspost",
        "owned", "share")), `Is Video Owner?` = col_factor(levels = c("NA",
        "No", "Yes")), `Video Length` = col_time(format = "%H:%M:%S")))

A first glimpse

First, we make a check if our data format is indeed data frame:

 

# Check format
class(vice_data)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

We see that vice_data data frame has 18497 rows and 37 variables.

 

Now let’s check the structure of vice_data data frame

# Check structure
glimpse(vice_data)
## Rows: 18,497
## Columns: 37
## $ `Page Name`                      <fct> VICE News, VICE News, VICE, VICE News~
## $ `User Name`                      <fct> vicenews, vicenews, VICE, vicenews, V~
## $ `Facebook Id`                    <dbl> 236000000000000, 236000000000000, 167~
## $ `Page Category`                  <fct> MEDIA_NEWS_COMPANY, MEDIA_NEWS_COMPAN~
## $ `Page Admin Top Country`         <chr> "US", "US", "US", "US", "US", "US", "~
## $ `Page Description`               <chr> "VICE News Tonight airs Monday–Thursd~
## $ `Page Created`                   <chr> "2014-02-23 19:00:02 EST", "2014-02-2~
## $ `Likes at Posting`               <dbl> 3339049, 3339049, 8312112, 3339023, 8~
## $ `Followers at Posting`           <chr> "4342864", "4342864", "9754669", "434~
## $ `Post Created`                   <chr> "2021-05-26 04:00:18 EDT", "2021-05-2~
## $ Type                             <fct> Native Video, Native Video, Native Vi~
## $ `Total Interactions`             <dbl> 54, 41, 66, 351, 24, 132, 358, 139, 7~
## $ Likes                            <dbl> 34, 23, 19, 77, 12, 35, 151, 36, 15, ~
## $ Comments                         <dbl> 4, 5, 5, 126, 6, 54, 79, 44, 21, 53, ~
## $ Shares                           <dbl> 8, 8, 8, 60, 1, 21, 48, 21, 15, 12, 2~
## $ Love                             <dbl> 6, 1, 6, 5, 0, 1, 1, 1, 1, 13, 1, 7, ~
## $ Wow                              <dbl> 2, 0, 0, 8, 0, 2, 9, 1, 0, 1, 1, 1, 0~
## $ Haha                             <dbl> 0, 2, 3, 19, 2, 15, 58, 22, 5, 23, 40~
## $ Sad                              <dbl> 0, 1, 22, 3, 0, 2, 5, 11, 10, 0, 1, 1~
## $ Angry                            <dbl> 0, 1, 1, 52, 0, 2, 5, 1, 0, 1, 2, 0, ~
## $ Care                             <dbl> 0, 0, 2, 1, 3, 0, 2, 2, 8, 2, 1, 0, 1~
## $ `Video Share Status`             <fct> crosspost, crosspost, crosspost, cros~
## $ `Is Video Owner?`                <fct> Yes, No, No, No, No, No, Yes, No, Yes~
## $ `Post Views`                     <dbl> 3213, 1745, 7268, 8294, 2761, 25601, ~
## $ `Total Views`                    <dbl> 3214, 1752, 7273, 8375, 2761, 25672, ~
## $ `Total Views For All Crossposts` <dbl> 1793907, 13838, 81146, 10240, 129914,~
## $ `Video Length`                   <time> 00:17:38, 00:09:21, 00:24:57, 00:05:~
## $ URL                              <chr> "https://www.facebook.com/23585288990~
## $ Message                          <chr> "Tattoos are stigmatized in Japan bec~
## $ Link                             <chr> "https://www.facebook.com/vicenews/vi~
## $ `Final Link`                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Image Text`                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Link Text`                      <chr> "Inside the Underground Pilgrimage Th~
## $ Description                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Id`                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Name`                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ `Sponsor Category`               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N~

It is a good idea to check for dublicates in rows so to create a general idea about real amount of data.

 

# Distinct users, movies, genres
nrow(vice_data %>%
    distinct())
## [1] 18497

 

Let’s repair the names of variables:

# Name repair
vice_data_cl <- janitor::clean_names(vice_data)

Now time for checking problems in dataset previous turning to data analysis

diagnose(vice_data_cl)

Data Wrangling

When we diagnosed vice_data_cl data frame we noticed that final_link, image_text, description, sponsor_id, sponsor_name, sponsor_category variables have more than \(90\%\) missing data. Also we can notice that page_admin_top_country variables has a single value US so it will not be included in analytics. Let’s remove these variables

vice_data_cl <- vice_data_cl %>%
    select(-c("final_link", "image_text", "description", "sponsor_id",
        "sponsor_name", "sponsor_category"))

Next step is to turn our two variables page_created and post_created to the right date-time format. We will use Vilnius timezone where company is located.

vice_data_cl$page_created <- as.POSIXct(vice_data_cl$page_created,
    tz = "Europe/Vilnius")
vice_data_cl$post_created <- as.POSIXct(vice_data_cl$post_created,
    tz = "Europe/Vilnius")

Analytics

Question 1. Based on the data, comment on how VICE’s content strategy has shifted over time. You are free to focus on just a few aspects of your choice.

We’ll walk through several video metrics to answer question 1.

Post Creation

source("https://raw.githubusercontent.com/iascchen/VisHealth/master/R/calendarHeat.R")
vcl <- vice_data_cl %>%
    select(post_created) %>%
    group_by(post_created) %>%
    summarise(freq = n())

r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384")
calendarHeat(vcl$post_created, vcl$freq, ncolors = 99, color = "r2g",
    varname = "AMZN Adjusted Close")

View count

View count is the total number of people who have viewed your video.

Facebook measure a view by checking if someone views your video for 3 seconds (same for Live videos)

View count can be considered more of a vanity metric, as the number of views don’t really affect your bottom line if no other action is taken. However, this still shows us that we need to make those first 3-30 seconds hyper-engaging in order to reel a viewer in.

don <- xts(x = vice_data_cl$post_views, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Post Views Over Time", ylab = "Number of Views") %>%
    dyOptions(labelsUTC = TRUE, fillGraph = TRUE, fillAlpha = 0.1,
        drawGrid = FALSE, colors = "#D8AE5A") %>%
    dyRangeSelector() %>%
    dyCrosshair(direction = "vertical") %>%
    dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2,
        hideOnMouseOut = FALSE) %>%
    dyRoller(rollPeriod = 1)
p
don <- xts(x = vice_data_cl$total_views, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Total Views Over Time", ylab = "Number of Views") %>%
    dyOptions(labelsUTC = TRUE, fillGraph = TRUE, fillAlpha = 0.1,
        drawGrid = FALSE, colors = "#D8AE5A") %>%
    dyRangeSelector() %>%
    dyCrosshair(direction = "vertical") %>%
    dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2,
        hideOnMouseOut = FALSE) %>%
    dyRoller(rollPeriod = 1)
p
don <- xts(x = vice_data_cl$total_views_for_all_crossposts, order.by = vice_data_cl$post_created)
# Finally the plot
p <- dygraph(don, main = "Total Views for all Crossposts Over Time",
    ylab = "Number of Views") %>%
    dyOptions(labelsUTC = TRUE, fillGraph = TRUE, fillAlpha = 0.1,
        drawGrid = FALSE, colors = "#D8AE5A") %>%
    dyRangeSelector() %>%
    dyCrosshair(direction = "vertical") %>%
    dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2,
        hideOnMouseOut = FALSE) %>%
    dyRoller(rollPeriod = 1)
p

Engagement

video engagement includes the comments and likes that video content generates.

It’s a good idea to see how many people are actually taking action on your video, but more than that, company pay attention to the types of comments is getting.

Social shares

One of main goals for video content should be social shares. This widens audience exponentially, increasing brand awareness and potentially bringing in new leads.

Negative feedback