You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!
Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).
Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.
You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 20 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 20.
As with the clustering lab, please be prepared to present a five minute overview of your findings.
Since Mr. Rooney has no experience with Machine Learning, here we have an oversimplified visual of how KNN functions. We can see here that our data consists of commercial and non-commercial inputs. Given an unknown input, our algorithm should be able to accurately display if the unknown belongs to the commercial or non-commercial grouping. Depending on the value of K selected, the circle in the image will get larger. As we implement the KNN algorithm step by step, this illustration will become clearer to Mr. Rooney.
#1
#Load in the data, both the commercial dataset and the labels. You'll need to the place the labels on the columns. The dataset "tv_commercialsets-CNN_Cleaned.csv", is data collected about the features of commercials on CNN. We can try to predict what segments of video are commercials based on their audio and video components. More information on the datasets can be found data.world:
# https://data.world/kramea/tv-commercial-detection/workspace/file?filename=tv_commercial_datasets%2FBBC_Cleaned.csv
#You can use the function colnames() to apply the labels (hint: you might need to reshape the labels to make this work)
setwd("/cloud/project/KNN")
tv_comm <- read.csv('tv_commercial_datasets_CNN_Cleaned.csv', stringsAsFactors = FALSE)
cnn <- t(read.csv('cnn_commmercial_label.csv', header = FALSE))
colnames(tv_comm) <- cnn
head(tv_comm)
## shot_length motion_distr_mn motion_distr_var frame_diff_dist_mn
## 1 29 3.821209 1.567568 13.547628
## 2 25 3.052969 1.641484 22.334589
## 3 82 1.601274 1.508805 5.860583
## 4 25 4.819368 2.879584 41.382828
## 5 29 2.768753 1.797319 13.338054
## 6 25 3.679925 2.145892 27.195999
## frame_diff_dist_var short_time_energy_mn short_time_energy_var zcr_mn
## 1 7.242389 0.019883 0.012195 0.067241
## 2 15.734018 0.023027 0.010731 0.077000
## 3 3.301121 0.025948 0.006956 0.082317
## 4 24.448074 0.014387 0.007596 0.069875
## 5 9.980667 0.011506 0.007269 0.100647
## 6 21.484812 0.012956 0.007638 0.068125
## zcr_var spectral_centroid_mn spectral_centroid_var spectral_roll_off_mn
## 1 0.049107 3406.866 1363.9906 6796.552
## 2 0.045884 3324.158 1452.0208 6610.000
## 3 0.044845 3771.984 855.7665 7488.112
## 4 0.046916 3301.686 1441.9977 6606.001
## 5 0.067401 3266.021 1432.1149 6688.795
## 6 0.036726 3324.525 1451.8527 6565.001
## spectral_roll_off_var spectral_flux_mn spectral_flux_var fundamental_freq_mn
## 1 2719.627 1021.3592 940.7424 102.60780
## 2 2885.445 1199.3644 821.1235 84.68770
## 3 1697.351 1544.1475 807.1373 92.69461
## 4 2883.915 599.5552 535.1520 96.59550
## 5 2678.394 446.4524 411.4647 95.03695
## 6 2866.239 637.6216 591.3987 109.96493
## fundamental_freq_var motion_dist_mn motion_dist_var label
## 1 60.17829 0.538928 0.038365 1
## 2 46.89688 0.446286 0.066858 1
## 3 41.18265 0.591414 0.141610 1
## 4 61.86105 0.246416 0.093782 1
## 5 48.36439 0.314103 0.193568 1
## 6 60.04241 0.270214 0.136905 1
#2. Determine the split between commercial and non-commercial then calculate the base rate, assume 1 is the commercial label and -1 is the non-commercial label
split <- table(tv_comm$`label `)
non_comm <- split[1] / sum(split)
comm <- split[2] / sum(split)
data.frame(c('Comm' = comm,'Non-Comm' = non_comm))
## c.Comm...comm...Non.Comm....non_comm.
## Comm.1 0.6392105
## Non-Comm.-1 0.3607895
#3. Since there are columns that contain different metrics for the same variable (i.e. any column that ends in 'mn' is the mean of that variable, while any column that ends in 'var' is the variance of that variable), we don't need to keep both, drop all the columns that include var
tv_comm2 <- tv_comm %>%
select(-ends_with('var'))
scale_comm <- scale(tv_comm2[ ,1:(ncol(tv_comm2)-1)])
tv_comm2 <- cbind(scale_comm, Commercial = tv_comm2[ ,ncol(tv_comm2)])
head(tv_comm2)
## shot_length motion_distr_mn frame_diff_dist_mn short_time_energy_mn
## [1,] -0.3257596 0.57845494 0.08964043 0.8929471
## [2,] -0.3396870 0.21609933 1.02213946 1.5786124
## [3,] -0.1412216 -0.46862137 -0.72613217 2.2156442
## [4,] -0.3396870 1.04925640 3.04359652 -0.3056585
## [5,] -0.3257596 0.08204322 0.06739980 -0.9339668
## [6,] -0.3396870 0.51181554 1.53804708 -0.6177408
## zcr_mn spectral_centroid_mn spectral_roll_off_mn spectral_flux_mn
## [1,] -1.2377040 -0.8460014 -0.7181542 -0.2173862
## [2,] -0.9332126 -1.1929107 -1.1892760 0.1153078
## [3,] -0.7673164 0.6854444 1.0283266 0.7597118
## [4,] -1.1555203 -1.2871677 -1.1993765 -1.0057433
## [5,] -0.1954005 -1.4367586 -0.9902869 -1.2918942
## [6,] -1.2101222 -1.1913706 -1.3029188 -0.9345967
## fundamental_freq_mn motion_dist_mn Commercial
## [1,] -0.8728803 0.03034897 1
## [2,] -1.8192935 -0.45323571 1
## [3,] -1.3964250 0.30432217 1
## [4,] -1.1904078 -1.49654296 1
## [5,] -1.2727193 -1.14322161 1
## [6,] -0.4843287 -1.37231909 1
#4. Before we run knn, sometimes it's good to check to make sure that our variables are not highly correlated. Use the cor() function on 'your_dataframe', label it 'commercial_correlations', and view the data.
commercial_correlations <- cor(tv_comm2)
commercial_correlations
## shot_length motion_distr_mn frame_diff_dist_mn
## shot_length 1.00000000 -0.147625726 -0.149212625
## motion_distr_mn -0.14762573 1.000000000 0.715703132
## frame_diff_dist_mn -0.14921262 0.715703132 1.000000000
## short_time_energy_mn 0.02648501 -0.007160132 -0.023968229
## zcr_mn 0.19090475 -0.052205970 -0.042074142
## spectral_centroid_mn 0.36441928 -0.179064774 -0.298912770
## spectral_roll_off_mn 0.38018472 -0.221524744 -0.384839454
## spectral_flux_mn 0.10231118 -0.019123107 0.006196663
## fundamental_freq_mn 0.29268568 -0.096980636 -0.090672211
## motion_dist_mn 0.21214421 -0.757645267 -0.645178936
## Commercial -0.27210692 0.053938038 -0.047456520
## short_time_energy_mn zcr_mn spectral_centroid_mn
## shot_length 0.026485006 0.19090475 0.3644193
## motion_distr_mn -0.007160132 -0.05220597 -0.1790648
## frame_diff_dist_mn -0.023968229 -0.04207414 -0.2989128
## short_time_energy_mn 1.000000000 -0.12505793 0.3087839
## zcr_mn -0.125057928 1.00000000 0.3089015
## spectral_centroid_mn 0.308783942 0.30890154 1.0000000
## spectral_roll_off_mn 0.160313488 0.03308083 0.8092628
## spectral_flux_mn 0.823463249 -0.05336937 0.2834196
## fundamental_freq_mn 0.022512016 0.53355483 0.4190101
## motion_dist_mn 0.031412848 0.06741515 0.3139645
## Commercial 0.108814835 -0.25376348 -0.2734260
## spectral_roll_off_mn spectral_flux_mn fundamental_freq_mn
## shot_length 0.38018472 0.102311184 0.29268568
## motion_distr_mn -0.22152474 -0.019123107 -0.09698064
## frame_diff_dist_mn -0.38483945 0.006196663 -0.09067221
## short_time_energy_mn 0.16031349 0.823463249 0.02251202
## zcr_mn 0.03308083 -0.053369373 0.53355483
## spectral_centroid_mn 0.80926285 0.283419648 0.41901010
## spectral_roll_off_mn 1.00000000 0.168405909 0.32491709
## spectral_flux_mn 0.16840591 1.000000000 0.23615929
## fundamental_freq_mn 0.32491709 0.236159291 1.00000000
## motion_dist_mn 0.38976247 0.036431956 0.12825129
## Commercial -0.24451480 -0.140024993 -0.39495599
## motion_dist_mn Commercial
## shot_length 0.21214421 -0.27210692
## motion_distr_mn -0.75764527 0.05393804
## frame_diff_dist_mn -0.64517894 -0.04745652
## short_time_energy_mn 0.03141285 0.10881483
## zcr_mn 0.06741515 -0.25376348
## spectral_centroid_mn 0.31396448 -0.27342595
## spectral_roll_off_mn 0.38976247 -0.24451480
## spectral_flux_mn 0.03643196 -0.14002499
## fundamental_freq_mn 0.12825129 -0.39495599
## motion_dist_mn 1.00000000 -0.04520322
## Commercial -0.04520322 1.00000000
tv_comm2_df <- as.data.frame(commercial_correlations)
cormat<-signif(cor(tv_comm2_df),2)
col<- colorRampPalette(c( "white","#00203FFF"))(10)
heatmap(cormat, col=col, symm=TRUE)
In order to analyze the correlation values, we can create a heatmap to get a visual understanding. This can be extremely useful for individuals, like Mr. Rooney, who may want to initially refer to a visual to get a good understanding of the data before referring to a chart with all of the correlation values. Here, we can see that the highest correlations have a dark blue color, and the lowest correlations have a white color. Although this heatmap does not show us the specific correlation value, it can help us get a sense of which variables might have to be removed before running KNN.
#5. Determine which variables to remove, high correlations start around .7 or below -.7 I would especially remove variables that appear to be correlated with more than one variable. List your rationale here:
commercial_correlations %>% # start with the correlation matrix (from step four)
as.table() %>% as.data.frame() %>%
subset(Var1 != Var2 & abs(Freq)>=0.7) %>% # omit diagonal and keep significant correlations
filter(!duplicated(paste0(pmax(as.character(Var1), as.character(Var2)), pmin(as.character(Var1), as.character(Var2))))) %>%
# keep only unique occurrences, as.character because Var1 and Var2 are factors
arrange(desc(Freq)) %>% # sort by Freq (aka correlation value)
formattable() #create a visually more appealing chart to refer to
| Var1 | Var2 | Freq |
|---|---|---|
| spectral_flux_mn | short_time_energy_mn | 0.8234632 |
| spectral_roll_off_mn | spectral_centroid_mn | 0.8092628 |
| frame_diff_dist_mn | motion_distr_mn | 0.7157031 |
| motion_dist_mn | motion_distr_mn | -0.7576453 |
We took the matrix from step four and filtered it so that we could see the unique occurrences of correlations that were above or equal to the abs(0.7) threshold. We omitted relationships on the diagonal (meaning perfect correlations of abs(1.0)). These correlation values are not relevant to our analysis as it shows the correlation between the same variables (i.e. shot_length vs shot_length).
When looking at the correlation matrix, using abs(0.7) as a threshold for higher correlations, we decided to remove the following variables:
These variables all exhibited high correlation with other variables, meaning that their correlation was >= abs(0.7). We used this used high correlations as an indicator to analyze correlations with other variables. Upon looking at these variables, we found that although the other correlations did not surpass the abs(0.7) threshold, it did exhibit non-trivial correlations with other variables. Due to their strong relationship with other variables, we believed that these are the most viable candidates to be dropped from the model.
#6. Subset the dataframe based on above.
subset_knn <- tv_comm2[ ,-c(2,3,6,7,8)]
head(subset_knn)
## shot_length short_time_energy_mn zcr_mn fundamental_freq_mn
## [1,] -0.3257596 0.8929471 -1.2377040 -0.8728803
## [2,] -0.3396870 1.5786124 -0.9332126 -1.8192935
## [3,] -0.1412216 2.2156442 -0.7673164 -1.3964250
## [4,] -0.3396870 -0.3056585 -1.1555203 -1.1904078
## [5,] -0.3257596 -0.9339668 -0.1954005 -1.2727193
## [6,] -0.3396870 -0.6177408 -1.2101222 -0.4843287
## motion_dist_mn Commercial
## [1,] 0.03034897 1
## [2,] -0.45323571 1
## [3,] 0.30432217 1
## [4,] -1.49654296 1
## [5,] -1.14322161 1
## [6,] -1.37231909 1
#7. Now we have our data and are ready to run the KNN, but we need to split into test and train. Create a index the will divide the data into a 70/30 split
set.seed(03092000)
sample.data <- sample(1:nrow(subset_knn),
round(0.7 * nrow(subset_knn), 0), #<- multiply the number of rows by 0.7 and round the decimals
replace = FALSE)
#8. Use the index above to generate a train and test sets, then check the row counts to be safe and show Mr. Rooney.
knn_train <- subset_knn[sample.data, ]
knn_test <- subset_knn[-sample.data, ]
size_of_training <- nrow(knn_train)
size_of_total <- nrow(subset_knn)
size_of_test <- nrow(knn_test)
#Verification
paste("The Training Set contains", toString(round(size_of_training/size_of_total,2)*100), "% of the total data")
## [1] "The Training Set contains 70 % of the total data"
paste("The Testing Set contains", toString(round(size_of_test/size_of_total,2)*100), "% of the total data")
## [1] "The Testing Set contains 30 % of the total data"
After indexing the data to generate a train and test set, we can now check the row counts. We see that approximately 70% of the rows are dedicated to the training set, and 30% of the rows are dedicated to the testing set. We can proceed.
#9 Train the classifier using k = 3, remember to set.seed so you can repeat the output and to use the labels as a vector for the class (not a index of the dataframe)
set.seed(03092000)
tv_3NN <- knn(train = knn_train[ ,-ncol(knn_train)],
test = knn_test[ ,-ncol(knn_train)],
cl = knn_train[, "Commercial"],
k = 3,
use.all = TRUE,
prob = TRUE)
#10 Check the output using str and length just to be sure it worked
str(tv_3NN)
## Factor w/ 2 levels "-1","1": 2 1 2 1 2 2 2 2 1 2 ...
## - attr(*, "prob")= num [1:6764] 0.667 0.667 0.667 0.667 1 ...
length(tv_3NN)
## [1] 6764
#11 Create a initial confusion matrix using the table function and pass it to a object. (xx <- your confusion matrix)
conf_mat <- table(tv_3NN, knn_test[ ,"Commercial"])
conf_mat
##
## tv_3NN -1 1
## -1 1390 717
## 1 1054 3603
Created our initial confusion matrix using the table function.
# Confusion matrix with labels
#Using the caret library
lvs <- c("non commercial","commercial")
truth <- factor(rep(lvs, times = c(4320, 2444)),
levels = rev(lvs))
pred <- factor(
c(rep(lvs, times = c(3603, 717)), rep(lvs, times = c(1054, 1390))), levels = rev(lvs))
xtab <- table(pred, truth)
cm <- confusionMatrix(pred, truth)
cm$table
## Reference
## Prediction commercial non commercial
## commercial 1390 717
## non commercial 1054 3603
This shows another way to create our initial confusion matrix. This has labels which would help Mr. Rooney better understand what we are looking at (rather than the -1 and 1 labels we had earlier).
fourfoldplot(cm$table)
Since Mr. Rooney is not familiar with Machine Learning Concepts, we may want to create a visual for the confusion matrix that will be easier to understand. This visual is another way to display the performance of the algorithm. Here, we can see that the True Positives (predicting a commercial, and having it be a commercial) and the True Negatives (predicting a non commercial, and having it be a non commercial) have the largest 1/4 of the circle. This means that most of the time, the algorithm was able to predict a commercial vs non commercial correctly.
#12 Select the true positives and true negatives by selecting only the cells where the row and column names are the same.
conf_mat[row(conf_mat) == col(conf_mat)]
## [1] 1390 3603
#13 Calculate the accuracy rate by dividing the correct classifications by the total number of classifications. Label the data 'kNN_acc_com', and view it. Comment on how this compares to the base rate.
kNN_acc_com <- sum(conf_mat[row(conf_mat) == col(conf_mat)]) / sum(conf_mat)
kNN_acc_com
## [1] 0.7381727
Comparing our accuracy to the baseline rate, we find that we are performing slightly better with the model, but not by much. This should be an indication that our model likely can be improved with parameter tuning. To maximize our accuracy, we will want to find our optimal k value.
#14 Run the confusion matrix function and comment on the model output
confusionMatrix(as.factor(tv_3NN), as.factor(knn_test[ ,'Commercial']), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 1390 717
## 1 1054 3603
##
## Accuracy : 0.7382
## 95% CI : (0.7275, 0.7486)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4152
##
## Mcnemar's Test P-Value : 1.415e-15
##
## Sensitivity : 0.8340
## Specificity : 0.5687
## Pos Pred Value : 0.7737
## Neg Pred Value : 0.6597
## Prevalence : 0.6387
## Detection Rate : 0.5327
## Detection Prevalence : 0.6885
## Balanced Accuracy : 0.7014
##
## 'Positive' Class : 1
##
Looking initially at the resulting confusion matrix, we find that we have a moderate overall accuracy with a value of 73.82%. Diving deeper, we now want to look at our specificity (true negative rate) and sensitivity (true positive rate). Looking at the sensitivity, we find that given that it was a commercial (Commercial = 1), 83.40% were correctly classified as a commercial. Conversely, give that it was not a commercial (Commercial = 0), only 56.87% were correctly classified as non-commercial. Although this is not poor performance, we certainly can do better. To do this, we can find our optimal value of k which will maximize our overall accuracy.
#15 Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label" column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this "knn_diff_k_com"
chooseK = function(k, train_set, val_set, train_class, val_class){
set.seed(03092000)
class_knn = knn(train = train_set,
test = val_set,
cl = train_class,
k = k,
use.all = TRUE)
conf_mat2 = table(class_knn, val_class)
accu = sum(conf_mat2[row(conf_mat2) == col(conf_mat2)]) / sum(conf_mat2)
cbind(k = k, accuracy = accu)
}
knn_different_k_com <- sapply(seq(1, 21, by = 2), #<- set k to be odd number from 1 to 21
function(x) chooseK(x,
train_set = knn_train[ ,-ncol(subset_knn)],
val_set = knn_test[ ,-ncol(subset_knn)],
train_class = knn_train[ ,'Commercial'],
val_class = knn_test[ ,'Commercial']))
knn_different_k_com
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000000 3.0000000 5.0000000 7.000000 9.0000000 11.0000000 13.000000
## [2,] 0.7251626 0.7381727 0.7485216 0.756505 0.7596097 0.7636014 0.765819
## [,8] [,9] [,10] [,11]
## [1,] 15.0000000 17.0000000 19.0000000 21.0000000
## [2,] 0.7647842 0.7662626 0.7640449 0.7643406
#16 Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe
knn_different_k_comdf <- data.frame(k_value = knn_different_k_com[1, ], accuracy = knn_different_k_com[2, ])
knn_different_k_comdf
## k_value accuracy
## 1 1 0.7251626
## 2 3 0.7381727
## 3 5 0.7485216
## 4 7 0.7565050
## 5 9 0.7596097
## 6 11 0.7636014
## 7 13 0.7658190
## 8 15 0.7647842
## 9 17 0.7662626
## 10 19 0.7640449
## 11 21 0.7643406
#17 Use ggplot to show the output and comment on the k to select
ggplot(knn_different_k_comdf, aes(x = k_value, y = accuracy)) +
geom_line(color = "blue", size = 1.5) +
geom_point(size = 3) +
labs(title = 'K Value Versus Overall Model Accuracy',
x = "K Value",
y= "Model Accuracy")
Looking at the resulting output, we find that there is a large increase in the overall accuracy when we use k = 7; after 7, we find that increases in our k value exhibit diminishing marginal returns. Furthermore, this means that as we increase k, our accuracy is increasing at a decreasing rate. To avoid overfitting while maintaining adequate accuracy, we can thus use k = 7 as our optimal k value to proceed with our analysis.
#18 Rerun the model with "optimal" k
tv_7NN <- knn(train = knn_train[ ,-ncol(subset_knn)],
test = knn_test[ ,-ncol(subset_knn)],
cl = knn_train[, "Commercial"],
k = 7,
use.all = TRUE,
prob = TRUE)
#19 Use the confusion matrix function to measure the quality of the new model
confusionMatrix(as.factor(tv_7NN), as.factor(knn_test[ ,'Commercial']), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 1367 571
## 1 1077 3749
##
## Accuracy : 0.7564
## 95% CI : (0.7459, 0.7665)
## No Information Rate : 0.6387
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4473
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8678
## Specificity : 0.5593
## Pos Pred Value : 0.7768
## Neg Pred Value : 0.7054
## Prevalence : 0.6387
## Detection Rate : 0.5543
## Detection Prevalence : 0.7135
## Balanced Accuracy : 0.7136
##
## 'Positive' Class : 1
##
# Confusion matrix with labels
#Using the Caret Library
truth2 <- factor(rep(lvs, times = c(4320, 2444)),
levels = rev(lvs))
pred2 <- factor(
c(rep(lvs, times = c(3749, 571)), rep(lvs, times = c(1077, 1367))), levels = rev(lvs))
xtab2 <- table(pred2, truth2)
cm2 <- confusionMatrix(pred2, truth2)
cm2$table
## Reference
## Prediction commercial non commercial
## commercial 1367 571
## non commercial 1077 3749
#Confusion Matrix Image
fourfoldplot(cm2$table)
Again, here we are displaying the confusion matrix in a more visually appealing format. This is meant to help Mr. Rooney better understand the significance of the numbers he is looking at. This has the same information that a normal confusion matrix would show. Here, similar to our confusion matrix from step eleven, we can see that the True Positives and True Negatives make up most of our data, meaning that most of the time, the algorithm was able to predict a commercial vs non commercial correctly.
#20 Summarize the differences in language Mr. Rooney may actually understand. Include a discussion on which approach k=3 or k="optimal" is the better method moving forward for "MEH". Most importantly draft comments about the over approach and model quality as it relates to addressing the problem proposed by Ed.
When deciding which model to use, it is advantageous to analyze what is important. The overall accuracy is important, but it is more important to target or control for accuracy rates or error rates. For instance, let’s say we are using machine learning to predict whether or not someone has a deadly disease. Although we would like to maximize the accuracy, if we are choosing between different models, we would choose the model that mitigates the false negative rate. It would be extremely detrimental if someone was classified as a zero, meaning they wouldn’t have the disease, even though they are in class 1, meaning that actually do have the disease. In this instance, we would want to choose the model that minimizes our false negative rate, even if it comes at the expense of overall accuracy. We also want to evaluate whether we value high sensitivity or high specificity, as that may influence which model we decide to pick. In some instances, such as our disease example, we would be interested in high specificity, as in we want to make sure our true positive rate is maximized. Applying these concepts to the context of our problem, the premise of our analysis is to use machine learning to accurately classify ads as commercials or non-commercials to maximize our marketing efforts. With this in mind, this means that we may be less interested in the overall accuracy, although that certainly is important to look at, but rather more interested in the true positive rate, or the specificity. Comparing our two models, we have fitted the data using two different k values: 3 and 7. We selected k from a sub-analysis, identifying it as the optimal k value. Upon conducting kNN using both values of k, we find that when using k = 7, this yields not only the higher sensitivity at 86.74%, but also the higher overall accuracy at 75.62%. One thing to know, about either model regardless of the selected k, is that our model exhibits a poor specificity, or true negative rate. In either instance, we find that we are only accurately classifying 55.97% of our observations as a non-commercial given that it truly is a non-commercial. One thing to note is that the parameters we see in the non-commercial category can be just informative, particularly in the instance of our marketing goals. If the premise is to blur the lines between a commercial and a non-commercial, what are seeking is more ambiguity in our classification of either class. We can further analyze the features we see to classifying an observation as a non-commercial and thus use this as an archetype for creating a commercial with non-commercial features. In this instance, the model interpretability actually may provide us with more information that its actual predictability, and counter-intuitively, we would want to misclassify commercials and non-commercials. Addressing the first, we want to see what is driving our classification, as mentioned before. Furthermore, we will know our marketing efforts are successful if we actually see the false negative rate increase. If we take on features that are comparable to non-commercials but really are, we would expect, and hope, that our model would inaccurately classify the commercial as a non-commercial. This mimics our goals and intentions for the viewer’s experience: blurring the lines between comercials and television.