kNN ML Model: Commercial Classification

Objective

You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising!

Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a machine learning model using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting.

Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, so you decide to use am accessible approach that you might be able to explain to Mr. Rooney–that approach is k-nearest neighbor.

kNN Analysis

Split between commercials and non-commercials

By tabulating the frequencies of each label, simple math can be done to calculate the baseline % of commercials predicted. The calculated baseline rate of commercials to non-commercials is 63.92%, given no information.

Removing Unnecessary Data

Variance

When first analyzing the company’s internal data which gives us information about the audio and visual components, there are some variables we can remove from the start. First, the columns with the variance of components are removed, as we already have columns containing the means of those components so the variance statistic is unnecessary for our model.

Correlation

Variables with high correlations should be removed from our kNN analysis, as their high correlations will have a biased pull on the labeling. Based on the correlation data, we can deduce which variables to remove:

motion_distr_mn: correlations over +-0.7 with 2 other variables and a very low (~0.05) correlation with the label. Also a very low correlation with the label, the desired classification.
spectral_centroid_mn - correlation over 0.8 with one variable and generally high correlations with multiple other variables. Primarily correlated with spectral_roll_off_mn, but we’ll keep that variable because the mean correlation is lower than spectral_centroid_mn.
spectral_flux_mn - Extremely high correlation of over 0.8 with short_time_energy_mn. We’ll keep the short_time_energy_mn because the mean correlation is lower than the spectral flux.

Columns removed due to high correlations:

motion_distr_mn
spectral_centroid_mn
spectral_flux_mn

Training and Executing the Model

For training and testing our model, we’ll use a 70/30 split. By simply creating an index to split the data with those proportions, we can create our test and train sets.

As we can see, the split is indeed 70/30.

After training the model with k=3 nearest neighbors to get a baseline output, this “confusion matrix” is created. A confusion matrix gives us the basic performance of the model and includes the true positive, true negative, false positive, and false negative instances. Since a “1” indicates a commercial and “-1” indicates a non-commercial, the top left of the table indicates correctly classified non-commercials (true negative) and the bottom right indicates correctly classified commercial (true positive).

To help calculate the accuracy, we extract the true positives and true negatives:

By dividing the correct classifications (true positive or true negative) by the total number of classifications in the confusion matrix, we can assess the accuracy of the model with k=3 neighbors.

The accuracy rate calculates to 73.17%. Compared to the base accuracy rate of 63.92%, there was an increase in about 10% accuracy using our machine learning model.

We can also use a built-in function that creates a confusion matrix and also calculates other useful metrics:

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1314  743
##         1  1072 3635
##                                           
##                Accuracy : 0.7317          
##                  95% CI : (0.7209, 0.7422)
##     No Information Rate : 0.6473          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3933          
##                                           
##  Mcnemar's Test P-Value : 1.371e-14       
##                                           
##             Sensitivity : 0.8303          
##             Specificity : 0.5507          
##          Pos Pred Value : 0.7723          
##          Neg Pred Value : 0.6388          
##              Prevalence : 0.6473          
##          Detection Rate : 0.5374          
##    Detection Prevalence : 0.6959          
##       Balanced Accuracy : 0.6905          
##                                           
##        'Positive' Class : 1               
##

Our sensitivity model, or true positive rate, represents our skill and rate in being correct when guessing that something is a commercial. The 3-nearest neighbor model produces a sensitivity output of 83.03%. We want to make sure that this value is maximized (as opposed to the correct negative percentage, which is represented by the specificity model), as we want to apply this model to our own commercials and check if they are detected correctly.

By creating a function that calculates the accuracy of the model for any value of k, we can find the optimal “k” number of neighbors. This will optimize the model and provide the most accurate classifications.

We can also plot this table to visually pick which k is the optimal choice.

Although the accuracy marginally loses improvement at k=5, the peak of the curve that gives us the highest accuracy is at k=11; therefore, the optimal k option for our model is k = 11.

After running the model with k=11, a new confusion matrix can be produced.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1283  497
##         1  1103 3881
##                                           
##                Accuracy : 0.7635          
##                  95% CI : (0.7531, 0.7735)
##     No Information Rate : 0.6473          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4502          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8865          
##             Specificity : 0.5377          
##          Pos Pred Value : 0.7787          
##          Neg Pred Value : 0.7208          
##              Prevalence : 0.6473          
##          Detection Rate : 0.5738          
##    Detection Prevalence : 0.7368          
##       Balanced Accuracy : 0.7121          
##                                           
##        'Positive' Class : 1               
##

The new accuracy is 76.35%, marking an increase of about 3% from the original k=3 model. However, a more important metric, the true positive rate or sensitivity, is 88.65% which is excellent. The positive and negative prediction rates are also fairly similar, differing by ~5%, meaning there is little bias in the model to predicting positive over negative, or vice versa. Finally, another important metric is the balanced accuracy which is the average of the sensitivity and specificity–for k=11, the balanced accuracy is 71.19% which is a solid (while not excellent) value for our model.

Summary

Mr. Rooney, after building your desired machine learning model, I was able to increase the prediction rate of a commercial vs. a non-commercial by around 13%. By utilizing a “k-nearest neighbor” algorithm and adjusting the parameters to include data that would give us optimal results, the model correctly predicted a commercial 76% of the time. The model did this by calculating something called a “euclidean distance” which is basically the distance between a point we wish to label and other known points. By optimizing the number of known points the model searches for to classify an unknown point, the model was able to reach a prediction accuracy of 76% with a sensitivity of 89%. Compared to the rate of 63% for correct predictions with no given information (the baseline rate), my model is valid and could be applied in the field. Another advantage of this model is that it can be re-trained if given new data and can be continuously updated to reflect changes or trends in modern TV commercials.