KNN Lab

Overview:

Methodology: Use k nearest neighbor to determine if something is a commercial or not in order to make future commercials more like tv shows to increase marketability.

Steps:

1. Read and Clean Data from CNN commercial dataset

2. Determine split between commercials and non-commercials in data set

64% are commercials

3. Remove redundancies in data

4. Calculate correlation of variables

kNN does not work well with highly correlated variables

5. Remove highly correlated variables

According to the above chart, motion_distr_mn is highly correlated with two other variables, so, remove it from dataset

6. Create train and test subsets of the data for the model

Test subset is 80% of the data while train is the remaining 20%

7. Train the classifier

Run kNN on the train and test sets with k = 3 neighbors

8. Create a confusion matrix to understand the probability of a false positive or false negative classification

Probability of False Positive (tv_CNN = 1 but label = -1): 17%

Probability of False Negative (tv_CNN = -1 but label = 1): 10%

9. Get More Information on model output

As calculated before, the model has a 73% accuracy with 10% probability of a false negative and a 17% probability of a false positive. This is higher than picking a data point at random. But it could be better.

10-12. Determine optimal number of neighbors to improve performance

When determining the number of k values you want to optimize both accuracy and performance. As k increases, accuracy increases but so does computational time. Looking at the elbow plot, the accuracy increases dramatically at low k values but levels off as k increases, displaying a diminishing rate of return as k increases. Based off of this graph, a good k value would be 9 because it has a high accuracy and high rate of change. Once k is greater than 11, the increase in accuracy is negligible compared to the increase in computational cost.

13. Rerun the model with k=9

14. Get more information on model output

When k=9:

Accuracy = 75%

P(False Positive) = 19%

P(False Negative) = 6%

Conclusions:

Summarize Findings:

Given specific data, this algorithm is able to detect if a piece of media is a commercial around 7 out of 10. This could be used to test if our commercials will be interpreted as commercials or tv and how to change our commercials to have similar characteristics as tv shows.

Choosing k:

An important decision in this model is the number of neighbors to use. As seen in the elbow chart, increasing the number of neighbors increases accuracy logarithmically. Thus, the benefit of increasing the k value diminishes as k increases.

	k=3	k=9
Accuracy	73%	75%
P(False Positive)	17%	19%
P(False Negative)	10%	6%

Comparing the model with 3 neighbors and 9 neighbors, it is clear that using 9 neighbors increases the accuracy slightly and decreases the probabilty of a false negative. However, it also increases the probability of a false positive. In this scenario, we want to determine if a piece of media is a commercial (positive class) or not (negative class). Therefore, false positives are worse than false negatives so we would want to use the model with 3 neighbors. This decreases computational cost while also limiting the probability of incorrectly identifying a non-commercial as a commercial.

Limitations:

This algorithm does have limitations however. It was only trained using data from one TV network so it may not be representative of all media, or more importantly, the networks that we run our commercials on. Further training could be done on data from TV channels that we frequently air commercials on. Also, this model is only 75% accurate, which is not significantly better than taking a random guess from our data set(63%).