You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!
Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).
Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.
You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 20 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 20.
As with the clustering lab, please be prepared to present a five minute overview of your findings.
## Parsed with column specification:
## cols(
## X1 = col_character()
## )
## Parsed with column specification:
## cols(
## .default = col_double(),
## label = col_character()
## )
## See spec(...) for full column specifications.
##
## -1 1
## 8134 14411
## 1
## 0.6392105
There are 14,411 commercials and 8,134 non-commercials. At random, we have a 63.9% chance of correctly picking out a commercial.
## # A tibble: 6 x 11
## shot_length motion_distr_mn frame_diff_dist… short_time_ener… zcr_mn
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 29 3.82 13.5 0.0199 0.0672
## 2 25 3.05 22.3 0.0230 0.077
## 3 82 1.60 5.86 0.0259 0.0823
## 4 25 4.82 41.4 0.0144 0.0699
## 5 29 2.77 13.3 0.0115 0.101
## 6 25 3.68 27.2 0.0130 0.0681
## # … with 6 more variables: spectral_centroid_mn <dbl>,
## # spectral_roll_off_mn <dbl>, spectral_flux_mn <dbl>,
## # fundamental_freq_mn <dbl>, motion_dist_mn <dbl>, label <dbl>
## # A tibble: 6 x 4
## shot_length zcr_mn fundamental_freq_mn label
## <dbl> <dbl> <dbl> <dbl>
## 1 29 0.0672 103. 1
## 2 25 0.077 84.7 1
## 3 82 0.0823 92.7 1
## 4 25 0.0699 96.6 1
## 5 29 0.101 95.0 1
## 6 25 0.0681 110. 1
set.seed(13)
kdata_train_rows = sample(1:nrow(kdata),
round(0.7 * nrow(kdata), 0),
replace = FALSE)
## [1] 15781
## [1] 6764
## Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "prob")= num [1:6764] 1 0.667 0.667 0.667 1 ...
## [1] 6764
##
## kdata_3NN -1 1
## -1 1266 831
## 1 1178 3489
## [1] 1266 3489
## [1] 0.7029864
## 1
## 0.06377593
The accuracy rate was 70.299%. This was 6.378% better than the base rate.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 1266 831
## 1 1178 3489
##
## Accuracy : 0.703
## 95% CI : (0.6919, 0.7139)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.336
##
## Mcnemar's Test P-Value : 1.168e-14
##
## Sensitivity : 0.8076
## Specificity : 0.5180
## Pos Pred Value : 0.7476
## Neg Pred Value : 0.6037
## Prevalence : 0.6387
## Detection Rate : 0.5158
## Detection Prevalence : 0.6900
## Balanced Accuracy : 0.6628
##
## 'Positive' Class : 1
##
The confusion matrix tells use that our model has an accuracy of 70.3%. The model was also much better at correctly classifying commercials rather than non-commercials given our sensitivity and specificty rates of 80.76% and 51.8% respectively; this is based on the fact that the former refers to the true positive rate and the latter refers to the true negative rate. The average of these rates is our balanced accuracy which was reported to be 66.28%.
## [1] "matrix" "array"
## k accuracy
## 1 1 0.6723832
## 2 3 0.7029864
## 3 5 0.7201360
## 4 7 0.7238321
## 5 9 0.7270846
## 6 11 0.7270846
## 7 13 0.7290065
## 8 15 0.7284151
## 9 17 0.7315198
## 10 19 0.7294500
## 11 21 0.7298936
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 1231 683
## 1 1213 3637
##
## Accuracy : 0.7197
## 95% CI : (0.7088, 0.7304)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3627
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8419
## Specificity : 0.5037
## Pos Pred Value : 0.7499
## Neg Pred Value : 0.6432
## Prevalence : 0.6387
## Detection Rate : 0.5377
## Detection Prevalence : 0.7170
## Balanced Accuracy : 0.6728
##
## 'Positive' Class : 1
##
The first classifier using k = 3 identified 3 of the input’s nearest neighbors to determine whether the input were to be classified as commercial or non-commercial. The accuracy rate was 70.299%. This was 6.378% better than the base rate. The model was also much better at correctly classifying commercials rather than non-commercials given our sensitivity and specificity rates of 80.76% and 51.8% respectively; this is based on the fact that the former refers to the true positive rate and the latter refers to the true negative rate. The average of these rates is our balanced accuracy which was reported to be 66.28%.
As seen in step 16 & 17, when running the “chooseK” function to find the perfect K, we found that k=5 had an accuracy rate of 72.7%, higher than the classifier using k=3 (70.299%). The sensitivity rate and specificity rate were 84.19% and 50.37% respectively. This tells us that the model is better at correctly classifying commercials rather than non-commercials. The rate of correctly classifying commercials for the k=5 model was 84.19%, and was 3,43% higher than that of the k=3 classifier. On the other hand, the rate of correctly classifying non-commercials for the k=5 model was 50.37% and thus 1.43% lower than that of the k=3 model.
However, when looking at the balanced accuracy rate for the K=5 model, which was reported to be 67.28%, we can see that it is 1.0% higher than the k=3 model.
Since the goal of this project is to use the company’s internal data to classify when something is a commercial and when it is not, the model with a higher accuracy rate for classifying commercials is what is needed. Both the accuracy rate (71.97%) and balanced accuracy rate (67.28%) were higher in the k=5 model, compared to the k=3 model, and thus we should recommend the company use the k=5 model. The company should however keep in mind that the model would do a better job in classifying commercials over non-commercials.