Methodology: Use k nearest neighbor to determine if something is a commercial or not in order to make future commercials more like tv shows to increase marketability.
64% are commercials
kNN does not work well with highly correlated variables
Test subset is 80% of the data while train is the remaining 20%
Run kNN on the train and test sets with k = 3 neighbors
Probability of False Positive (tv_CNN = 1 but label = -1): 17%
Probability of False Negative (tv_CNN = -1 but label = 1): 10%
As calculated before, the model has a 73% accuracy with 10% probability of a false negative and a 17% probability of a false positive. This is higher than picking a data point at random. But it could be better.
When determining the number of k values you want to optimize both accuracy and performance. As k increases, accuracy increases but so does computational time. Looking at the elbow plot, the accuracy increases dramatically at low k values but levels off as k increases, displaying a diminishing rate of return as k increases. Based off of this graph, a good k value would be 9 because it has a high accuracy and high rate of change. Once k is greater than 11, the increase in accuracy is negligible compared to the increase in computational cost.
When k=9:
Accuracy = 75%
P(False Positive) = 19%
P(False Negative) = 6%
Given specific data, this algorithm is able to detect if a piece of media is a commercial around 7 out of 10. This could be used to test if our commercials will be interpreted as commercials or tv and how to change our commercials to have similar characteristics as tv shows.
An important decision in this model is the number of neighbors to use. As seen in the elbow chart, increasing the number of neighbors increases accuracy logarithmically. Thus, the benefit of increasing the k value diminishes as k increases.
| k=3 | k=9 | |
|---|---|---|
| Accuracy | 73% | 75% |
| P(False Positive) | 17% | 19% |
| P(False Negative) | 10% | 6% |
Comparing the model with 3 neighbors and 9 neighbors, it is clear that using 9 neighbors increases the accuracy slightly and decreases the probabilty of a false negative. However, it also increases the probability of a false positive. In this scenario, we want to determine if a piece of media is a commercial (positive class) or not (negative class). Therefore, false positives are worse than false negatives so we would want to use the model with 3 neighbors. This decreases computational cost while also limiting the probability of incorrectly identifying a non-commercial as a commercial.
This algorithm does have limitations however. It was only trained using data from one TV network so it may not be representative of all media, or more importantly, the networks that we run our commercials on. Further training could be done on data from TV channels that we frequently air commercials on. Also, this model is only 75% accurate, which is not significantly better than taking a random guess from our data set(63%).