KNN Lab

You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!

Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).

Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.

You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 20 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 20.

As with the clustering lab, please be prepared to present a five minute overview of your findings.

1: Load in the data, both the commercial dataset and the labels

## Parsed with column specification:
## cols(
##   X1 = col_character()
## )

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   label = col_character()
## )

## See spec(...) for full column specifications.

## 
##    -1     1 
##  8134 14411

##         1 
## 0.6392105

There are 14,411 commercials and 8,134 non-commercials. At random, we have a 63.9% chance of correctly picking out a commercial.

2: Drop columns that contain different metrics for the same variable

## # A tibble: 6 x 11
##   shot_length motion_distr_mn frame_diff_dist… short_time_ener… zcr_mn
##         <dbl>           <dbl>            <dbl>            <dbl>  <dbl>
## 1          29            3.82            13.5            0.0199 0.0672
## 2          25            3.05            22.3            0.0230 0.077 
## 3          82            1.60             5.86           0.0259 0.0823
## 4          25            4.82            41.4            0.0144 0.0699
## 5          29            2.77            13.3            0.0115 0.101 
## 6          25            3.68            27.2            0.0130 0.0681
## # … with 6 more variables: spectral_centroid_mn <dbl>,
## #   spectral_roll_off_mn <dbl>, spectral_flux_mn <dbl>,
## #   fundamental_freq_mn <dbl>, motion_dist_mn <dbl>, label <dbl>

3: Check to make sure that our variables are not highly correlated

##                      shot_length motion_distr_mn frame_diff_dist_mn
## shot_length           1.00000000    -0.147625726        -0.14921262
## motion_distr_mn      -0.14762573     1.000000000         0.71570313
## frame_diff_dist_mn   -0.14921262     0.715703132         1.00000000
## short_time_energy_mn  0.02648501    -0.007160132        -0.02396823
## zcr_mn                0.19090475    -0.052205970        -0.04207414
## spectral_centroid_mn  0.36441928    -0.179064774        -0.29891277
##                      short_time_energy_mn      zcr_mn spectral_centroid_mn
## shot_length                   0.026485006  0.19090475            0.3644193
## motion_distr_mn              -0.007160132 -0.05220597           -0.1790648
## frame_diff_dist_mn           -0.023968229 -0.04207414           -0.2989128
## short_time_energy_mn          1.000000000 -0.12505793            0.3087839
## zcr_mn                       -0.125057928  1.00000000            0.3089015
## spectral_centroid_mn          0.308783942  0.30890154            1.0000000
##                      spectral_roll_off_mn spectral_flux_mn fundamental_freq_mn
## shot_length                    0.38018472      0.102311184          0.29268568
## motion_distr_mn               -0.22152474     -0.019123107         -0.09698064
## frame_diff_dist_mn            -0.38483945      0.006196663         -0.09067221
## short_time_energy_mn           0.16031349      0.823463249          0.02251202
## zcr_mn                         0.03308083     -0.053369373          0.53355483
## spectral_centroid_mn           0.80926285      0.283419648          0.41901010
##                      motion_dist_mn       label
## shot_length              0.21214421 -0.27210692
## motion_distr_mn         -0.75764527  0.05393804
## frame_diff_dist_mn      -0.64517894 -0.04745652
## short_time_energy_mn     0.03141285  0.10881483
## zcr_mn                   0.06741515 -0.25376348
## spectral_centroid_mn     0.31396448 -0.27342595

4: Determine which variables to remove

high correlations start around .7 or below -.7 I would especially remove variables that appear to be correlated with more than one variable. List your rationale here:

Motion distribution is highly correlated with frame differential distribution and motion distribution with values of 0.715 and -0.757, respectively. Short time energy is highly correlated with spectral flux with a value of 0.823. Spectral centroid is highly correlated with spectral roll off with a value of 0.809.

5: Subset the dataframe based on above

## # A tibble: 6 x 4
##   shot_length zcr_mn fundamental_freq_mn label
##         <dbl>  <dbl>               <dbl> <dbl>
## 1          29 0.0672               103.      1
## 2          25 0.077                 84.7     1
## 3          82 0.0823                92.7     1
## 4          25 0.0699                96.6     1
## 5          29 0.101                 95.0     1
## 6          25 0.0681               110.      1

6: Create an index that will divide the data into a 70/30 split

set.seed(13)

kdata_train_rows = sample(1:nrow(kdata),
                          round(0.7 * nrow(kdata), 0),
                          replace = FALSE)

7: Use the index above to generate a train and test sets

## [1] 15781

## [1] 6764

Train the classifier using k = 3

##  Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "prob")= num [1:6764] 1 0.667 0.667 0.667 1 ...

## [1] 6764

##          
## kdata_3NN   -1    1
##        -1 1266  831
##        1  1178 3489

## [1] 1266 3489

8: Calculate the accuracy rate

## [1] 0.7029864

##          1 
## 0.06377593

The accuracy rate was 70.299%. This was 6.378% better than the base rate.

9: Run the confusion matrix function and comment on the model output

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1266  831
##         1  1178 3489
##                                           
##                Accuracy : 0.703           
##                  95% CI : (0.6919, 0.7139)
##     No Information Rate : 0.6387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.336           
##                                           
##  Mcnemar's Test P-Value : 1.168e-14       
##                                           
##             Sensitivity : 0.8076          
##             Specificity : 0.5180          
##          Pos Pred Value : 0.7476          
##          Neg Pred Value : 0.6037          
##              Prevalence : 0.6387          
##          Detection Rate : 0.5158          
##    Detection Prevalence : 0.6900          
##       Balanced Accuracy : 0.6628          
##                                           
##        'Positive' Class : 1               
##

The confusion matrix tells use that our model has an accuracy of 70.3%. The model was also much better at correctly classifying commercials rather than non-commercials given our sensitivity and specificty rates of 80.76% and 51.8% respectively; this is based on the fact that the former refers to the true positive rate and the latter refers to the true negative rate. The average of these rates is our balanced accuracy which was reported to be 66.28%.

10: Run the “chooseK” function to find the perfect K

11: Create a dataframe so we can visualize the difference in accuracy based on K

## [1] "matrix" "array"

##     k  accuracy
## 1   1 0.6723832
## 2   3 0.7029864
## 3   5 0.7201360
## 4   7 0.7238321
## 5   9 0.7270846
## 6  11 0.7270846
## 7  13 0.7290065
## 8  15 0.7284151
## 9  17 0.7315198
## 10 19 0.7294500
## 11 21 0.7298936

12: Use ggplot to show the output and comment on the k to select

Rerun the model with “optimal” k

13: Use the confusion matrix function to measure the quality of the new model

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1231  683
##         1  1213 3637
##                                           
##                Accuracy : 0.7197          
##                  95% CI : (0.7088, 0.7304)
##     No Information Rate : 0.6387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3627          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8419          
##             Specificity : 0.5037          
##          Pos Pred Value : 0.7499          
##          Neg Pred Value : 0.6432          
##              Prevalence : 0.6387          
##          Detection Rate : 0.5377          
##    Detection Prevalence : 0.7170          
##       Balanced Accuracy : 0.6728          
##                                           
##        'Positive' Class : 1               
##

20: Summarize the differences in language Mr. Rooney may actually understand. Include a discussion on which approach k=3 or k=“optimal” is the better method moving forward for “MEH”. Most importantly draft comments about the over approach and model quality as it relates to addressing the problem proposed by Ed.

The first classifier using k = 3 identified 3 of the input’s nearest neighbors to determine whether the input were to be classified as commercial or non-commercial. The accuracy rate was 70.299%. This was 6.378% better than the base rate. The model was also much better at correctly classifying commercials rather than non-commercials given our sensitivity and specificity rates of 80.76% and 51.8% respectively; this is based on the fact that the former refers to the true positive rate and the latter refers to the true negative rate. The average of these rates is our balanced accuracy which was reported to be 66.28%.

As seen in step 16 & 17, when running the “chooseK” function to find the perfect K, we found that k=5 had an accuracy rate of 72.7%, higher than the classifier using k=3 (70.299%). The sensitivity rate and specificity rate were 84.19% and 50.37% respectively. This tells us that the model is better at correctly classifying commercials rather than non-commercials. The rate of correctly classifying commercials for the k=5 model was 84.19%, and was 3,43% higher than that of the k=3 classifier. On the other hand, the rate of correctly classifying non-commercials for the k=5 model was 50.37% and thus 1.43% lower than that of the k=3 model.
However, when looking at the balanced accuracy rate for the K=5 model, which was reported to be 67.28%, we can see that it is 1.0% higher than the k=3 model.

Since the goal of this project is to use the company’s internal data to classify when something is a commercial and when it is not, the model with a higher accuracy rate for classifying commercials is what is needed. Both the accuracy rate (71.97%) and balanced accuracy rate (67.28%) were higher in the k=5 model, compared to the k=3 model, and thus we should recommend the company use the k=5 model. The company should however keep in mind that the model would do a better job in classifying commercials over non-commercials.