You left your job as a tracking unstructured text as you wanting to expand your skills into predictive models. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!
Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So he wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will be able to make more convincing commercials that hold audiences attention if they are more like tv shows and as a result customers will pay more attention, thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).
Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.
You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 15 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 15. Think about this questions carefully, what are you really delivering to Mr. Rooney?
As with the clustering lab, please be prepared to present a five minute overview of your findings.
#4. Before we run kNN, sometimes it's good to check to make sure that our variables are not highly correlated. Use the cor() function on 'your_dataframe', label it 'commercial_correlations', and view the data, because remember kNN doesn't work well in high dimensions.
commercial_cors <- cor(commercial_data[, 1:(ncol(commercial_data) -1)])
view(commercial_cors)
# frame_diff_dist_mn and motion_distr_mn for sure, among others
#9 Run the confusion matrix function and comment on the model output
library(e1071)
cm <- confusionMatrix(
comKNN3,
com_train$`label`,
positive = "1",
dnn = c("Prediction", "Actual"),
mode = "sens_spec"
)
cm
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 4392 849
## 1 1312 9228
##
## Accuracy : 0.8631
## 95% CI : (0.8576, 0.8684)
## No Information Rate : 0.6386
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.698
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9157
## Specificity : 0.7700
## Pos Pred Value : 0.8755
## Neg Pred Value : 0.8380
## Prevalence : 0.6386
## Detection Rate : 0.5848
## Detection Prevalence : 0.6679
## Balanced Accuracy : 0.8429
##
## 'Positive' Class : 1
##
After creating a k-nearest neighbors where k = 3, the model is approximately 74.6% accurate and has a sensitivity of 84.9%. The ability for the model to classify negative cases is considerably less accurate, with specificity standing at 56%.
#10 Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label" column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this "knn_diff_k_com"
chooseK = function(k, train_set, val_set, train_class, val_class){
# Build knn with k neighbors considered.
set.seed(1)
class_knn = knn(train = train_set, #<- training set cases
test = val_set, #<- test set cases
cl = train_class, #<- category for classification
k = k, #<- number of neighbors considered
use.all = TRUE) #<- control ties between class assignments# If true, all distances equal to the kth largest are included
conf_mat = table(class_knn, val_class)
# Calculate the accuracy#could change this to Sensitivity
accu = sum(conf_mat[row(conf_mat) == col(conf_mat)]) / sum(conf_mat)
cbind(k = k, accuracy = accu)
}
knn_diff_k_com <- sapply(seq(1,21, by = 2), function(x) chooseK(k = 3,
train_set = com_train,
val_set = com_test,
train_class = com_train$`label`,
val_class = com_test$`label`))
knn_diff_k_com
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 3 3 3 3 3 3 3 3 3 3 3
## [2,] 1 1 1 1 1 1 1 1 1 1 1
#11 Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe
knn_diff_k_comdf <- data.frame(k_value = knn_diff_k_com[1, ], accuracy = knn_diff_k_com[2, ])
knn_diff_k_comdf
#12 Use ggplot to show the output and comment on the k to select.
Kplot <- ggplot(knn_diff_k_comdf, aes(x = k_value, y = accuracy)) +
geom_line() +
geom_point(size = 3) +
labs(title = "K Value Accuracy", x = "K Value", y = "Model Accuracy")
Kplot
#14 Use the confusion matrix function to measure the quality of the new model.
cm7 <- confusionMatrix(
comKNN7,
com_test$label,
positive = "1",
dnn = c("Prediction", "Actual"),
mode = "sens_spec"
)
cm7
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 1410 630
## 1 1020 3704
##
## Accuracy : 0.7561
## 95% CI : (0.7456, 0.7663)
## No Information Rate : 0.6407
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4508
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8546
## Specificity : 0.5802
## Pos Pred Value : 0.7841
## Neg Pred Value : 0.6912
## Prevalence : 0.6407
## Detection Rate : 0.5476
## Detection Prevalence : 0.6984
## Balanced Accuracy : 0.7174
##
## 'Positive' Class : 1
##
#15 Summarize the differences in language Mr. Rooney may actually understand. Include a discussion on which approach k=3 or k="optimal" is the better method moving forward for "MEH". Most importantly draft comments about the overall approach and model quality as it relates to addressing the problem proposed by Ed.
the difference in accuracy between k equals 3 nearest neighbors and k equals 7 is over 2 percent. This is a small but noticeable change in accuracy, but adding more than 7 or beyond the optimal amount will likely not make your guesses that much more accurate. Here, one can observe that the model with a higher k value is slightly better at minimizing false negatives (as shown by the specificity value). In plain english, this model is better than random chance.
However, it must be recognized that this model simply labels predictions as to whether a piece of media is a commercial or not. The model does not actually interpret what makes a commercial “movie like”, and what characteristics are favorable. In an effort to create commercials that seem like tv shows, one needs to actually understand what those differences mean. That is where humans come in. I would advise another model that can somehow incorporate a metric by which media can be rated in terms of its effect on the audience. This can yield the progress you need to retool your commercials.