607presentation

Alvaro Bueno
09/13/2017

On Predictive modeling

Trying to Estimate the value of a target variable with the help of the current variables.

  • We can find those variables that correlate with or give us information about another variable of interest.
  • One basic measure of attribute information is called information gain, based on entropy.
  • Selecting informative attributes forms the basis of a common modeling technique called tree induction.

Predictive Modeling on Netflix

  • A viewer's experience is impacted by factors that include quality of Internet connection, device characteristics, CDN, algorithms on the device, and quality of content.
  • Streaming experience depends heavily on quality of the video, audio, and text (subtitle, closed captions) assets that are used.
  • Netflix QC process consists of automated and manual inspections to identify and replace assets that do not meet our specified quality standards.

Manual QC on netflix

Manual QC is then done to check for issues easily detected with the human eye. Depending on the content, selected points of the asset or the entire duration is checked. % of assets that fail this is small.

  • video interlacing artifacts
  • audio-video sync issues
  • missing subtitles
  • poorly placed subtitles.

Predictive QC on netflix

Looking at the data on manual QC failures, certain factors affected the likelihood of an asset failing QC.

  • some combinations of content and fulfillment partners had a higher rate of defects for certain types of assets.
  • Metadata, like Release Year, shown patterns of failure. Older content had higher defect rates.
  • The genre of the content also exhibited certain patterns of failure.

Predictive QC on Netflix(2)

Predictive QC model on Netflix.

Observations

A key goal of the model is to identify all defective assets even if this results in extra manual checks. Hence, we tuned the model for low false-negative rate (i.e. fewer uncaught defects) at the cost of increased false-positive rate.

we have a lot more data on “pass” assets than “fail” assets. We tackled this by using cost-sensitive training that heavily penalizes misclassification of the minority class.

Observations(2)

Video assets from episodes within the same season of a show are mostly defective or mostly non-defective. It’s likely that assets in a batch were created or packaged around the same time and/or with the same equipment, and hence with similar defects.

To fine tune, offline validation of the model was performed by passively making predictions on incoming assets and comparing with actual results from manual QC.

About