Abstract
Generally speaking providers of video content are interested in understanding what their strongest brands are, and how consumption might vary across platforms. The primary goal of this exercise is to determine how usage might differ by brand as the platforms vary. This will help business owners understand where to put invest dollars in future products and content. The platfroms were divided into two sets - desktop/browser and app/devices. Furture research would involved more granular segments such as operating system and individual devices (Roku, XBox, Apple TV, etc.).
Methodology
At a high level, the process of securing data involve retreiving CONFIDENTIAL internal video usage data from content owner, posting to github, loading into R, and the normalizing string values. The data munging and normalization plan was to use load the data, use base R functions and possibly Regex to clean up the data and then the dplyr library to group_by and join the disparate data sets. Due to unexpected results with the dplyr package, the aggregation functions were performed with base R functions instead.
There were two attributes and one metric/numeric value selected from each source - Network, Program & Video Start counts. Although the data sets are structered, the content management systems (cms) that generates the meta-data for each of these attributes varies. Due to this challenge a bit of normalization must occur, before any analysis can begin. The first challenge was loading the data into R, the fill=true function along with iconv function was needed to rid the data of illegal characters that are common when dealing with data generated on the internet.
There was a levels check on the Network variables across both sources (commented out in the code), from there a manual find/replace operation was performed. A more dynamic solution using Regex might have been preferable in this case, but the universe of value was 15 or less (the number of networks) - so using manual approach is indeed scalable. From there a data frame was created, and the process of tranforming data types from factors to characters, and then all lowercase was undertaken.
The primary purpose for trying to normalize the values in the network variables of each table was so that they could be joined to get an aggregate view.
The tables below are aggregated video views/starts by network for PC/Browser & Apps/Connected Devices.
## BrowserVideoNetwork BrowserVideoStarts
## 11 nbc 127307
## 1 bravo 44823
## 18 usa 13936
## 15 syfy 10507
## 14 oxygen 6269
## 17 telemundo 1584
## 2 cnbc 1512
## 9 esquire 1431
## 10 msnbc 609
## 12 nbc news 252
## 3 csn bay area 0
## 4 csn california 0
## 5 csn chicago 0
## 6 csn mid-atlantic 0
## 7 csn new england 0
## 8 csn philly 0
## 13 nbc sports 0
## 16 tcn 0
## AppVideoNetwork AppVideoStarts
## 7 nbc 96661
## 2 bravo 54671
## 13 usa 31276
## 12 telemundo 28184
## 11 syfy 12561
## 9 oxygen 10742
## 4 e! 4991
## 10 sprout 2087
## 3 cnbc 353
## 5 esquire 164
## 8 nbc universo 25
## 6 msnbc 13
## 1 awe 0
## AllVideoNetworks AllVideoStarts
## 5 nbc 223968
## 1 bravo 99494
## 9 usa 45212
## 8 telemundo 29768
## 7 syfy 23068
## 6 oxygen 17011
## 2 cnbc 1865
## 3 esquire 1595
## 4 msnbc 622
Results are also displayed below in the horizontal bar graph, notice that the union set network string values match across both charts and tables.
Executive Summary
The data indicates that broadcast network, nbc, does in fact generate the most usage of any brands aross both brands. However, more of its usage happens on PC/Browser versus mobile apps and connected devices (127K vs. 97K). However, for the cable brands such as Bravo, Telemundo, USA & Syfy more of the usage is on mobile apps & devices - indicating a younger audience. The recommendation would be for cable brands to continue to invest more in mobile apps & devices, while more traditional broadcast brands like NBC might need to continue to invest in older PC technology in order to service it’s older audience segments.
Function Reference Section:
fill argument to help read.table execute
Source: Stack Overflow Posting on Error in Reading in Data Set
iconv function to handle bad characters
Source: Stack Overflow Posting on Invalin Multibyte String
Notes:
Dplyr functions didn’t deliver expected results
Had Difficulty using Regex to parse and replace values as expected
Base Order functions failed on data frames and tables.