Abstract

Generally speaking providers of video content are interested in understanding what their strongest brands are, and how consumption might vary across platforms. The primary goal of this exercise is to determine how usage might differ by brand as the platforms vary. This will help business owners understand where to put invest dollars in future products and content. The platfroms were divided into two sets - desktop/browser and app/devices. Furture research would involved more granular segments such as operating system and individual devices (Roku, XBox, Apple TV, etc.).

Methodology

At a high level, the process of securing data involve retreiving CONFIDENTIAL internal video usage data from content owner, posting to github, loading into R, and the normalizing string values. The data munging and normalization plan was to use load the data, use base R functions and possibly Regex to clean up the data and then the dplyr library to group_by and join the disparate data sets. Due to unexpected results with the dplyr package, the aggregation functions were performed with base R functions instead.

There were two attributes and one metric/numeric value selected from each source - Network, Program & Video Start counts. Although the data sets are structered, the content management systems (cms) that generates the meta-data for each of these attributes varies. Due to this challenge a bit of normalization must occur, before any analysis can begin. The first challenge was loading the data into R, the fill=true function along with iconv function was needed to rid the data of illegal characters that are common when dealing with data generated on the internet.

There was a levels check on the Network variables across both sources (commented out in the code), from there a manual find/replace operation was performed. A more dynamic solution using Regex might have been preferable in this case, but the universe of value was 15 or less (the number of networks) - so using manual approach is indeed scalable. From there a data frame was created, and the process of tranforming data types from factors to characters, and then all lowercase was undertaken.

The primary purpose for trying to normalize the values in the network variables of each table was so that they could be joined to get an aggregate view.

The tables below are aggregated video views/starts by network for PC/Browser & Apps/Connected Devices.

##    BrowserVideoNetwork BrowserVideoStarts
## 11                 nbc             127307
## 1                bravo              44823
## 18                 usa              13936
## 15                syfy              10507
## 14              oxygen               6269
## 17           telemundo               1584
## 2                 cnbc               1512
## 9              esquire               1431
## 10               msnbc                609
## 12            nbc news                252
## 3         csn bay area                  0
## 4       csn california                  0
## 5          csn chicago                  0
## 6     csn mid-atlantic                  0
## 7      csn new england                  0
## 8           csn philly                  0
## 13          nbc sports                  0
## 16                 tcn                  0
##    AppVideoNetwork AppVideoStarts
## 7              nbc          96661
## 2            bravo          54671
## 13             usa          31276
## 12       telemundo          28184
## 11            syfy          12561
## 9           oxygen          10742
## 4               e!           4991
## 10          sprout           2087
## 3             cnbc            353
## 5          esquire            164
## 8     nbc universo             25
## 6            msnbc             13
## 1              awe              0
##   AllVideoNetworks AllVideoStarts
## 5              nbc         223968
## 1            bravo          99494
## 9              usa          45212
## 8        telemundo          29768
## 7             syfy          23068
## 6           oxygen          17011
## 2             cnbc           1865
## 3          esquire           1595
## 4            msnbc            622

Results are also displayed below in the horizontal bar graph, notice that the union set network string values match across both charts and tables.

Executive Summary

The data indicates that broadcast network, nbc, does in fact generate the most usage of any brands aross both brands. However, more of its usage happens on PC/Browser versus mobile apps and connected devices (127K vs. 97K). However, for the cable brands such as Bravo, Telemundo, USA & Syfy more of the usage is on mobile apps & devices - indicating a younger audience. The recommendation would be for cable brands to continue to invest more in mobile apps & devices, while more traditional broadcast brands like NBC might need to continue to invest in older PC technology in order to service it’s older audience segments.

Function Reference Section:

fill argument to help read.table execute
Source: Stack Overflow Posting on Error in Reading in Data Set

iconv function to handle bad characters
Source: Stack Overflow Posting on Invalin Multibyte String

Notes:

Dplyr functions didn’t deliver expected results
Had Difficulty using Regex to parse and replace values as expected
Base Order functions failed on data frames and tables.