Base Rate

Included in the commercials dataset about 64 percent of the total observations are true commercials. Any model generated needs to be better at classifying commercials than assigning every observation the label “commercial,” which would result in an accuracy of about 64 percent.

## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keepcol)` instead of `keepcol` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

Correlation Matrix

Shot Length Motion Dist. Frame Diff. STE ZCR Centroid Roll Off Fluz Frequency Motion Dist 2 Label
Shot Length 1.0000000 -0.1476257 -0.1492126 0.0264850 0.1909047 0.3644193 0.3801847 0.1023112 0.2926857 0.2121442 -0.2721069
Motion Dist. -0.1476257 1.0000000 0.7157031 -0.0071601 -0.0522060 -0.1790648 -0.2215247 -0.0191231 -0.0969806 -0.7576453 0.0539380
Frame Diff. -0.1492126 0.7157031 1.0000000 -0.0239682 -0.0420741 -0.2989128 -0.3848395 0.0061967 -0.0906722 -0.6451789 -0.0474565
STE 0.0264850 -0.0071601 -0.0239682 1.0000000 -0.1250579 0.3087839 0.1603135 0.8234632 0.0225120 0.0314128 0.1088148
ZCR 0.1909047 -0.0522060 -0.0420741 -0.1250579 1.0000000 0.3089015 0.0330808 -0.0533694 0.5335548 0.0674151 -0.2537635
Centroid 0.3644193 -0.1790648 -0.2989128 0.3087839 0.3089015 1.0000000 0.8092628 0.2834196 0.4190101 0.3139645 -0.2734260
Roll Off 0.3801847 -0.2215247 -0.3848395 0.1603135 0.0330808 0.8092628 1.0000000 0.1684059 0.3249171 0.3897625 -0.2445148
Fluz 0.1023112 -0.0191231 0.0061967 0.8234632 -0.0533694 0.2834196 0.1684059 1.0000000 0.2361593 0.0364320 -0.1400250
Frequency 0.2926857 -0.0969806 -0.0906722 0.0225120 0.5335548 0.4190101 0.3249171 0.2361593 1.0000000 0.1282513 -0.3949560
Motion Dist 2 0.2121442 -0.7576453 -0.6451789 0.0314128 0.0674151 0.3139645 0.3897625 0.0364320 0.1282513 1.0000000 -0.0452032
Label -0.2721069 0.0539380 -0.0474565 0.1088148 -0.2537635 -0.2734260 -0.2445148 -0.1400250 -0.3949560 -0.0452032 1.0000000

Variables that appear to be highly correlated: 1. Motion distr and frame difference distribution 2. Motion distr and motion dist 3. Frame distribution and motion dist 4. Short time energy and spectral flux 5. ZCR and fundamental frequency 6. Spectral centroid and spectral roll off

In order to ensure effective k-nearest neighbors modeling, it makes sense to remove variables that are highly correlated with more than one other variable. In this case, text area distribution was removed. This was the only variable removed out of the list. There are others with high correlations, but if all variables with at least one seemingly high correlation were removed then nine of the eleven total variables would have been removed.

Train and Test Sets

In order to create a k-nearest neighbors model, the data needs to be split into training and testing datasets. Typically, the data is split into 80 percent training and the remaining 20 percent is for testing. If training data was also used in testing, the results would generally show that the model is effective. In this case, 18,036 observations were included in the training data and 4,509 observations were included in testing

Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1143  335
##         1   520 2511
##                                           
##                Accuracy : 0.8104          
##                  95% CI : (0.7986, 0.8217)
##     No Information Rate : 0.6312          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5831          
##                                           
##  Mcnemar's Test P-Value : 3.121e-10       
##                                           
##             Sensitivity : 0.8823          
##             Specificity : 0.6873          
##          Pos Pred Value : 0.8284          
##          Neg Pred Value : 0.7733          
##              Prevalence : 0.6312          
##          Detection Rate : 0.5569          
##    Detection Prevalence : 0.6722          
##       Balanced Accuracy : 0.7848          
##                                           
##        'Positive' Class : 1               
## 

After creating a k-nearest neighbors with 3 set as the nearest neighbors, the resulting confusion matrix tells us the model is approximately 81 percent accurate and has a sensitivity of about 88 percent. The model’s accuracy is higher than the initial base rate of 64 percent meaning the model likely does a better job than random chance of assigning observations to classes. A sensitivity of 88 percent means the model accurately predicts a commercial is in fact a commercial at a fairly high rate.

Comparing K Values

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1143  304
##         1   520 2542
##                                           
##                Accuracy : 0.8173          
##                  95% CI : (0.8057, 0.8284)
##     No Information Rate : 0.6312          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5966          
##                                           
##  Mcnemar's Test P-Value : 6.894e-14       
##                                           
##             Sensitivity : 0.8932          
##             Specificity : 0.6873          
##          Pos Pred Value : 0.8302          
##          Neg Pred Value : 0.7899          
##              Prevalence : 0.6312          
##          Detection Rate : 0.5638          
##    Detection Prevalence : 0.6791          
##       Balanced Accuracy : 0.7902          
##                                           
##        'Positive' Class : 1               
## 

Summary

From the plot above, the difference in accuracy between k equals 1 nearest neighbor and k equals 3 is over 2 percentage points. This is a fairly large increase, so selecting three nearest neighbors over one is justified. The next level, five nearest neighbors, has just over half a percentage point increase in model accuracy. This is likely not a large enough increase to justify increasing the number of nearest neighbors. The increase to five nearest neighbors also has a relatively small impact on the sensitivity, increasing sensitivity by about one percentage point. Such small increases mean that three nearest neighbors is likely the best choice for this model. With three nearest neighbors, the model is just over 81 percent accurate and about 88 percent sensitive. This means the model does a good job of determining whether or not a video clip is a commercial and is certainly better than random chance.

The model appears to be sufficiently accurate in determining whether or not a given video clip is a commercial or not. However, the purpose of this model is slightly elusive. All of the factors considered within the model are directly under MEH’s control when producing a commercial. Knowing this, MEH is able to make commercials that look more like TV shows without the assistance of a model that determines whether or not a given video clip is a commercial. Finally, rather than test whether or not tv-show-like commercials hold viewers attention better than standard commercials, the model labels whether or not a given video is a commercial. Mr. Rooney may want another model that include variable that measures viewer attention.