kNN: Commercial Prediction

Introduction

In this lab, I created a supervised machine learning model to identify TV shots as non-commercial or commercial. I used the k-nearest neighbors algorithm to classify the shots as commercials or not. To begin, I loaded in the CNN TV commercial dataset, which contains information about the audio and visual components of various TV and commercial clips. The model predictions will help the company, Marketing Enterprises of Halifax or “MEH” and the boss, Mr. Rooney, optimize their advertising campaign. The kNN model I created can be continuously updated and retrained to reflect new data and trends in TV commercials.

kNN Analysis

The base rate for this dataset is 63.92%. This means that, at random, I have about a 63% chance of correctly classifying a shot as a commercial. This is decent, but the accuracy could most likely be improved using kNN.

Data Cleaning

The dataset contains columns for the mean and variance of each variable. I removed all columns with the variance and conducted the analysis solely on the mean values.

In addition to removing the columns with variance, I checked the correlations between columns and removed highly-correlated columns (magnitude greater than 0.7).

Based on the correlation table above, I decided to remove 4 columns that were highly correlated with one or more variables. The columns I dropped were:

motion_distr_mn
shot_time_energy_mn
spectral_flux_mn
spectral_roll_off_mn

Training and Testing Sets

After removing the necessary columns for our kNN analysis, I split the data into training and testing sets. I decided to split the data 80:20 so 80% of the data would be used for training and 20% would later be used for testing the model.

Model with k = 3

I ran the model with 3-nearest neighbors to generate a baseline output. Based on the results in the confusion matrix, our model did pretty well, and much better than guessing at random. Sensitivity, or the true positive rate, measures the proportion of positives that are correctly identified as positive observations. The sensitivity from this model is 0.8536, so out of 10 trials, we will correctly guess a commercial about 8 times. The accuracy measures how many true positives and true negatives are correctly identified. The accuracy in this model is 0.7514, which means we will correctly identify a TV shot as commercial or non-commercial about 75% of the time. Since the accuracy is less than the sensitivity, this may indicate a slightly imbalanced dataset.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1  976  474
##         1   650 2409
##                                           
##                Accuracy : 0.7507          
##                  95% CI : (0.7378, 0.7633)
##     No Information Rate : 0.6394          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4464          
##                                           
##  Mcnemar's Test P-Value : 1.791e-07       
##                                           
##             Sensitivity : 0.8356          
##             Specificity : 0.6002          
##          Pos Pred Value : 0.7875          
##          Neg Pred Value : 0.6731          
##              Prevalence : 0.6394          
##          Detection Rate : 0.5343          
##    Detection Prevalence : 0.6784          
##       Balanced Accuracy : 0.7179          
##                                           
##        'Positive' Class : 1               
##

Finding the optimal k

Next step in the analysis is to determine the optimal k value. Although there is no specific pre-defined method to determine the best k value, I will define a function to compute the accuracy given a specific k value and use the sapply function to run the function on a range of k values. The table and plot below visualize the accuracy for each k value when applied to the train and test datasets. The accuracy increases dramatically from k = 1 to k = 11 but then begins to decrease. I will select k = 11, the value that corresponds to the greatest accuracy score of 0.7955.

Model with k = 11

I reran the model with k = 11 and created a new confusion matrix. As evident below, both the sensitivity and accuracy increased by about 3% and 2%, respectively, from the model with k = 3. The sensitivity is 0.8869 so we are closer to correctly identifying a commercial about 9 out of 10 times. The accuracy increased to 0.7762 so we are correctly identifying non-commercial and commercial clips a bit more frequently.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1  943  326
##         1   683 2557
##                                           
##                Accuracy : 0.7762          
##                  95% CI : (0.7638, 0.7883)
##     No Information Rate : 0.6394          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4903          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8869          
##             Specificity : 0.5800          
##          Pos Pred Value : 0.7892          
##          Neg Pred Value : 0.7431          
##              Prevalence : 0.6394          
##          Detection Rate : 0.5671          
##    Detection Prevalence : 0.7186          
##       Balanced Accuracy : 0.7334          
##                                           
##        'Positive' Class : 1               
##

Conclusion

By using the kNN algorithm approach, I increased the accuracy of predicting non-commercials and commercials by about 14% from a 63.92% base rate to 77.62% with k = 11. I tested various k values to find the optimal value that led to the highest increase in accuracy. This model could be applied to the problem at Marketing Enterprises of Halifax or “MEH” and will help them generate better commercials that seem more like actual TV shows. By using my model, the boss Mr. Rooney will correctly identify more commercials than he would just by guessing. Overall, the kNN model improved the sensitivity and accuracy of predicting commercials and proves to be a useful machine learning model.