Problem

You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!

Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).

Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.

You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 20 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 20.

As with the clustering lab, please be prepared to present a five minute overview of your findings.

Step 1: Load in the Data & Apply the Labels

#Load in the data, both the commercial dataset and the labels. You'll need to the place the labels on the columns. The dataset "tv_commercialsets-CNN_Cleaned.csv",  is data collected about the features of commercials on CNN. We can try to predict what segments of video are commercials based on their audio and video components. More information on the datasets can be found data.world:
# https://data.world/kramea/tv-commercial-detection/workspace/file?filename=tv_commercial_datasets%2FBBC_Cleaned.csv
#You can use the function colnames() to apply the labels (hint: you might need to reshape the labels to make this work)

setwd("/cloud/project/KNN")
labels.cnn <- read.csv("cnn_commercial_label.csv")
labels.vector <- c("shot_length",labels.cnn[[1]][1:19])
labels.vector[20] <- c("label")

cnn.data <- read.csv("tv_commercial_datasets_CNN_Cleaned.csv")
colnames(cnn.data) <- labels.vector
colnames(cnn.data)

##  [1] "shot_length"           "motion_distr_mn"       "motion_distr_var"     
##  [4] "frame_diff_dist_mn"    "frame_diff_dist_var"   "short_time_energy_mn" 
##  [7] "short_time_energy_var" "zcr_mn"                "zcr_var"              
## [10] "spectral_centroid_mn"  "spectral_centroid_var" "spectral_roll_off_mn" 
## [13] "spectral_roll_off_var" "spectral_flux_mn"      "spectral_flux_var"    
## [16] "fundamental_freq_mn"   "fundamental_freq_var"  "motion_dist_mn"       
## [19] "motion_dist_var"       "label"

The above code reads in the two data sets and uses some data cleaning methods to insert the labels from the labels csv as the column names in the data set of interest.

Step 2: Determine the split between Commercial & Non-Commercial

# Determine the split between commercial and non-commercial then calculate the base rate, assume 1 is the commercial label and -1 is the non-commercial label 

total.obs <- nrow(cnn.data)
n_commercials <- length(which(cnn.data$label==1))
n_commercials/total.obs

## [1] 0.6392105

Approximately 63.92% of the observations in the labels column have values equal to one. In the context of the problem, this informs us that around 64% of the observastions contained in the data set are commercials. Thus, we can establish that, if we define a commercial as a success, we can identify the base rate, or the posterior probability of having a “success” in the data, to be .6392.

Step 3: Remove Columns with Var

# Since there are columns that contain different metrics for the same variable (i.e. any column that ends in 'mn' is the mean of that variable, while any column that ends in 'var' is the variance of that variable), we don't need to keep both, drop all the columns that include var

columns.with.var <- which(grepl("var",colnames(cnn.data),))
cnn.cleaned <- cnn.data[,-columns.with.var]
ncol(cnn.cleaned)

## [1] 11

The above code uses the grepl() function to identify which column names contain “var” and removes them from the data, since we know that these

Step 4: Finding Correlations Among Variables

# Before we run knn, sometimes it's good to check to make sure that our variables are not highly correlated. Use the cor() function on 'your_dataframe', label it 'commercial_correlations', and view the data.

commercial_correlations <- cor(cnn.cleaned)
commercial_correlations

##                      shot_length motion_distr_mn frame_diff_dist_mn
## shot_length           1.00000000    -0.147625726       -0.149212625
## motion_distr_mn      -0.14762573     1.000000000        0.715703132
## frame_diff_dist_mn   -0.14921262     0.715703132        1.000000000
## short_time_energy_mn  0.02648501    -0.007160132       -0.023968229
## zcr_mn                0.19090475    -0.052205970       -0.042074142
## spectral_centroid_mn  0.36441928    -0.179064774       -0.298912770
## spectral_roll_off_mn  0.38018472    -0.221524744       -0.384839454
## spectral_flux_mn      0.10231118    -0.019123107        0.006196663
## fundamental_freq_mn   0.29268568    -0.096980636       -0.090672211
## motion_dist_mn        0.21214421    -0.757645267       -0.645178936
## label                -0.27210692     0.053938038       -0.047456520
##                      short_time_energy_mn      zcr_mn spectral_centroid_mn
## shot_length                   0.026485006  0.19090475            0.3644193
## motion_distr_mn              -0.007160132 -0.05220597           -0.1790648
## frame_diff_dist_mn           -0.023968229 -0.04207414           -0.2989128
## short_time_energy_mn          1.000000000 -0.12505793            0.3087839
## zcr_mn                       -0.125057928  1.00000000            0.3089015
## spectral_centroid_mn          0.308783942  0.30890154            1.0000000
## spectral_roll_off_mn          0.160313488  0.03308083            0.8092628
## spectral_flux_mn              0.823463249 -0.05336937            0.2834196
## fundamental_freq_mn           0.022512016  0.53355483            0.4190101
## motion_dist_mn                0.031412848  0.06741515            0.3139645
## label                         0.108814835 -0.25376348           -0.2734260
##                      spectral_roll_off_mn spectral_flux_mn fundamental_freq_mn
## shot_length                    0.38018472      0.102311184          0.29268568
## motion_distr_mn               -0.22152474     -0.019123107         -0.09698064
## frame_diff_dist_mn            -0.38483945      0.006196663         -0.09067221
## short_time_energy_mn           0.16031349      0.823463249          0.02251202
## zcr_mn                         0.03308083     -0.053369373          0.53355483
## spectral_centroid_mn           0.80926285      0.283419648          0.41901010
## spectral_roll_off_mn           1.00000000      0.168405909          0.32491709
## spectral_flux_mn               0.16840591      1.000000000          0.23615929
## fundamental_freq_mn            0.32491709      0.236159291          1.00000000
## motion_dist_mn                 0.38976247      0.036431956          0.12825129
## label                         -0.24451480     -0.140024993         -0.39495599
##                      motion_dist_mn       label
## shot_length              0.21214421 -0.27210692
## motion_distr_mn         -0.75764527  0.05393804
## frame_diff_dist_mn      -0.64517894 -0.04745652
## short_time_energy_mn     0.03141285  0.10881483
## zcr_mn                   0.06741515 -0.25376348
## spectral_centroid_mn     0.31396448 -0.27342595
## spectral_roll_off_mn     0.38976247 -0.24451480
## spectral_flux_mn         0.03643196 -0.14002499
## fundamental_freq_mn      0.12825129 -0.39495599
## motion_dist_mn           1.00000000 -0.04520322
## label                   -0.04520322  1.00000000

In this step, we created a correlation matrix to identify multicollinearity, which occurs when predictor variables are highly correlated with one another, in the data. If we end up seeing high correlation between our explanatory variables, we will want to consider removing those variables from the model to avoid introducing unnecessary error into the KNN procedure.

Step 5: Discuss Which Variables to Remove

The only variable that is highly correlated (as defined by a linear correlation coefficient of > .7 in absolute value) with more than one other variable is motion_distr_mn, which is highly correlated with frame_diff_dist_mn (r = .72) and motion_dist_mn (r = .76)

Step 6: Subset the Dataframe to Remove Correlated Variables

# Subset the dataframe based on above.

cnn.knn <- cnn.cleaned[,-which(colnames(cnn.cleaned)=="motion_distr_mn")]
cnn.knn$label <- as.factor(cnn.knn$label)

Since, as we discussed above, the variable motion_distr_mn is highly correlated with more than one other explanatory variable, we have chosen to remove that variable from the data set to avoid the problem of multicollinearity in our KNN procedure. The above code subsets the previous data frame to not include the column motion_distr_mn. Additionally, before we split our data into a training and test split, we have coded our binary variable of interest, label, to be a factor variable. Since 1 and -1 are somewhat unusual values for a binary variable to take on, R had previously recognized them as integer variables, so we are simply altering the way the R software views the variable, not making any substantive change to the data.

Step 7: Creating the Training/Test Split Index

# Now we have our data and are ready to run the KNN, but we need to split into test and train. Create a index that will divide the data into a 70/30 split

set.seed(1982)
sample.data.cnn <- sample.int(nrow(cnn.cleaned), 
                        floor(.7*nrow(cnn.cleaned)), replace = F)

The above code randomly selects integers in a range from 1 to the number of rows in the data set. We have specified that the number of integers selected be equal to either 70% of the data set, or 1 observation less than 70% of the data set, if 70% of the total number of observations is not a whole number. Since this is a randomized procedure, we have used the set.seed() function to allow ourselves to replicate this result.

Step 8: Generating the Training & Test Sets Using Split Index

# Use the index above to generate a train and test sets, then check the row counts to be safe and show Mr. Rooney.

train.cnn <- cnn.knn[sample.data.cnn, -ncol(cnn.knn)]
train.cnn <- apply(train.cnn,2,function(x) scale(x))
test.cnn <- cnn.knn[-sample.data.cnn, -ncol(cnn.knn)]
test.cnn <- apply(test.cnn,2,function(x) scale(x))

train.cnn.class <- cnn.knn[sample.data.cnn, ncol(cnn.knn)]
test.cnn.class <- cnn.knn[-sample.data.cnn, ncol(cnn.knn)]

display.mat.split <- rbind(c(nrow(train.cnn),nrow(test.cnn)),
                           c(nrow(train.cnn)/nrow(cnn.knn),
                             nrow(test.cnn)/nrow(cnn.knn)))
display.mat.split[2,] <- round(display.mat.split[2,],1)
rownames(display.mat.split) <- c("Number of Obs", "Percent of Total Obs")
colnames(display.mat.split) <- c("Training Data", "Test Data")
display.mat.split

##                      Training Data Test Data
## Number of Obs              15781.0    6764.0
## Percent of Total Obs           0.7       0.3

The above code creates 4 new sets of data. The first two sets created are the training and test sets of predictor variables. Since in the previous stage we created an index of integers corresponding to 70% of the rows in the data, we will make those rows the training set and the remaining 30% of rows not included in the index our test set. Additionally, we want to scale each variable to a common range of values, as having variables which follow the same scale is an important requirement for the KNN procedure to work properly. Finally, since in running the KNN procedure we want to have separate objects for the predictor variables and the response variable, labels, for procedural coding reasons, we have removed the label column from the training and test predictor sets and created two new variables, training.cnn.class and test.cnn.class, which contain the values of the response variable for the training and test sets respectively.

Step 9: Train the Classifier using k = 3

# Train the classifier using k = 3, remember to set.seed so you can repeat the output and to use the labels as a vector for the class (not a index of the dataframe)

set.seed(1982)
cnn.3nn <-  knn(train = train.cnn,
               test = test.cnn,
               cl = train.cnn.class,
               k = 3, 
               use.all = TRUE,
               prob = TRUE)

After doing all the requisite data cleaning and exploratory data analysis, it is time for us to run the KNN procedure. The above function code uses a knn procedure trained using our training data to make classification predictions on our test data. As when we were creating our training/test split, it is important to use set.seed() so that we can replicate our results when needed.

Step 10: Check the Output

# Check the output using str and length just to be sure it worked

str(cnn.3nn)

##  Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "prob")= num [1:6764] 1 0.667 1 0.667 1 ...

length(cnn.3nn)

## [1] 6764

length(cnn.3nn)==nrow(test.cnn)

## [1] TRUE

As a sanity check, we want to make sure that the procedure ran properly, so we have printed out two different ways to view the data. In the str() printout, we hope to see a vector of binary classification predictions for the test data, and in the length() printout, we should see that the length of the KNN object is equal to the length of the test data set. Thus, we have included a third line of code to verify that the these two lengths are equal. As we can see in the output, the procedure appears to have worked properly, since we can see that each of the 3 things we were looking for appear to be there.

Step 11: Create an Initial Confusion Matrix

# Create a initial confusion matrix using the table function and pass it to a object. (xx <- your confusion matrix)
cnn.confusion.matrix <- table(test.cnn.class,cnn.3nn)
cnn.confusion.matrix

##               cnn.3nn
## test.cnn.class   -1    1
##             -1 1622  802
##             1   500 3840

The above code creates a confusion matrix, the purpose of which is to allow us to view, in a fairly simply output to visually understand, how our KNN procedure performed in classifying the data. The columns of the data frame represent how our procedure classified each observation in the test data set, whereas the rows in the data frame correspond to the actual values of label for each variable in the test set.

Step 12: Select the True Positives and True Negatives

# Select the true positives and true negatives by selecting only the cells where the row and column names are the same.

true.negatives.cnn <- cnn.confusion.matrix[1,1]
true.positives.cnn <- cnn.confusion.matrix[2,2]
output.tp.tn <- c(true.negatives.cnn, true.positives.cnn)
names(output.tp.tn) <- c("True Negatives", "True Positives")
output.tp.tn

## True Negatives True Positives 
##           1622           3840

As discussed above, the rows correspond to the actual values of label from the test data, and the columns represent our model’s predicted classification for each observation in the test set. Thus, we will define a true negative as any observation that had a -1 value for label, and for which our model predicted a -1 value. In the confusion matrix, the number of true negatives will be the 1,622 number in the first row and first column of the data frame. Similarly, a true positive will be defined as any observation for which the value of label was 1, and for which our model predicted a value of 1. In the confusion matrix, the number of true positives will be the 3,840 number in the bottom right of the data frame (second row, second column).

Step 13: Calculate the Overall Accuracy of Predictions

# Calculate the accuracy rate by dividing the correct classifications by the total number of classifications. Label the data 'kNN_acc_com', and view it. Comment on how this compares to the base rate.

kNN_acc_cnn <- (true.negatives.cnn+true.positives.cnn)/
  sum(cnn.confusion.matrix)
kNN_acc_cnn

## [1] 0.8075103

We calculate the accuracy rate by adding the number of true positives and true negatives and dividing by the total number of observations tested. From the confusion matrix, we can find this by adding the number of true positives and true negatives to the number of false negatives (observations that the model classified as -1 despite truly having a value of 1) and the number of false positives (observations that the moe classified as 1 despite truly having a value of -1). The code above this performs that relatively simple task, and we can see that our model returns an accuracy rate of around 80.75%. Since the base rate was 63.92%, we see that this model does significantly improve our chances of successfully identifying an observation as a commercial as compared to randomly assigning values based upon the prior known prior probabilities.

Step 14: Run the Confusion Matrix Function

# Run the confusion matrix function and comment on the model output
confusionMatrix(as.factor(cnn.3nn), as.factor(test.cnn.class), positive = "1", 
                dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1622  500
##         1   802 3840
##                                           
##                Accuracy : 0.8075          
##                  95% CI : (0.7979, 0.8168)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5696          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8848          
##             Specificity : 0.6691          
##          Pos Pred Value : 0.8272          
##          Neg Pred Value : 0.7644          
##              Prevalence : 0.6416          
##          Detection Rate : 0.5677          
##    Detection Prevalence : 0.6863          
##       Balanced Accuracy : 0.7770          
##                                           
##        'Positive' Class : 1               
##

The sensitivity in this model, also known as the true positive rate, or in the context of the problem, the proportion of commercials that are successfully identified as commercials, is actually pretty good at .8848, or about 88.48%. In the context of the problem we are trying to solve at MEH, this seems to be the single most important measure of accuracy, since in order to accomplish Mr. Rooney’s goal, we need to be able to successfully identify commercials so that we can know that those are the commercials we need to alter to become more like the TV show. Of course, we also would like our model to be as accurate as possible in identifying non-commercials properly, but if an item is already a non-commercial, we are likely less concerned about knowing as much as possible about that observation since our focus is on altering our marketing campaign. Additionally, a hypothesis test indicates that there is significant evidence to suggest that the classification predictions of our model represent a significant improvement over the “no information rate,” or the base rate, which is equal to the base rate observed in the test data. In this case, the base rate observed in the test data is in line with the base rate from the overall data set, with the above function calculating a “No Information Rate” of .6416 compared to our base rate of .6392.

Step 15: Run ChooseK to test Overall Accuracy at Different K

# Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label"   column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this  "knn_diff_k_com"

chooseK <- function(k, train_set, val_set, train_class, val_class){
  set.seed(1982)
  class_knn <- knn(train = train_set,
                  test = val_set,
                  cl = train_class,
                  k = k,
                  use.all = TRUE)
  conf_mat <- table(class_knn, val_class)
  accu <- sum(conf_mat[row(conf_mat) == col(conf_mat)]) / sum(conf_mat)                         
  cbind(k = k, accuracy = accu)
}

knn_different_k.cnn <- sapply(seq(1, 21, by = 2),
                              function(x) 
                                chooseK(x,
                                        train_set=train.cnn, val_set = test.cnn,
                                        train_class=train.cnn.class,
                                        val_class = test.cnn.class))
knn_different_k.cnn

##           [,1]      [,2]     [,3]      [,4]      [,5]       [,6]       [,7]
## [1,] 1.0000000 3.0000000 5.000000 7.0000000 9.0000000 11.0000000 13.0000000
## [2,] 0.7893258 0.8075103 0.816233 0.8177114 0.8172679  0.8191898  0.8212596
##           [,8]       [,9]      [,10]      [,11]
## [1,] 15.000000 17.0000000 19.0000000 21.0000000
## [2,]  0.819042  0.8178593  0.8180071  0.8188941

The above code chunk writes a function given to us by our college data science Professor Brian Wright. The function will take a vector of values for K (where K is the number of nearest neighbors in the training data used to make a prediction for each observation in the test data) and output the overall accuracy of a KNN model fitted on the data of interest for each K. This will allow us to find an ideal K. As a note, it is important to set.seed() within our function to the same seed value we had used before in our KNN procedure, so as to ensure that we get the same results as our previous model. We want our procedure to be as replicable and consistent as possible, especially to help Mr. Rooney understand and accept our results.

Step 16: Use Data Frame to Visualize Results of ChooseK Procedure

# Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe

knn_different_k.cnn.df <- data.frame(k = knn_different_k.cnn[1,],
                             accuracy = knn_different_k.cnn[2,])

knn_different_k.cnn.df

##     k  accuracy
## 1   1 0.7893258
## 2   3 0.8075103
## 3   5 0.8162330
## 4   7 0.8177114
## 5   9 0.8172679
## 6  11 0.8191898
## 7  13 0.8212596
## 8  15 0.8190420
## 9  17 0.8178593
## 10 19 0.8180071
## 11 21 0.8188941

Here we are simply taking the data on the accuracy rates for each K from Step 15 and making it easier to digest and view. As we can see clearly in the data frame, and will be able to visualize more easily in the graph we will create momentarily, the overall classification accuracy of the procedure increases with K up to a point, then begins to decrease after K increases past an ideal level. This is because as K increases, the increased K provides increased predictive power. However, as K continues to increase, the procedure can “overfit” the data, leading to less accurate results on test observations.

Step 17: Use ggplot to Visualize the Ideal K

# Use ggplot to show the output and comment on the k to select

ggplot(knn_different_k.cnn.df,
       aes(x = k, y = accuracy)) +
  geom_line(color = "orange", size = 1.5) +
  geom_point(size = 3)

This plot provides an even easier to visualize display of the results from the ChooseK procedure. As we can clearly see, there is one K for which the overall accuracy of our model is higher than all others. From the graph, this appears to be at K = 13, which is only K for which the overall model accuracy is above .82. A quick sanity check in the data frame produced in Step 16 confirms that K = 13 does indeed maximize the model accuracy, so we will identify K = 13 as the “optimal” K.

Step 18: Rerun the model with the Optimal K

# Rerun the model  with "optimal" k

set.seed(1982)
cnn.13nn <-  knn(train = train.cnn,
               test = test.cnn,
               cl = train.cnn.class,
               k = 13, 
               use.all = TRUE,
               prob = TRUE)

Here we rerun the exact same procedure from Step 9, only with K = 13, which we established as the “optimal” K for which the classification accuracy is the highest. Note again that we set the seed equal to 1982 to ensure that our results are replicable and consistent with the prior procedures.

Step 19: Create a Confusion Matrix & Calculate Accuracy with Optimal K

# Use the confusion matrix function to measure the quality of the new model

cnn.confusion.matrix.13 <- table(test.cnn.class,cnn.13nn)
cnn.confusion.matrix.13

##               cnn.13nn
## test.cnn.class   -1    1
##             -1 1599  825
##             1   384 3956

kNN_acc_cnn.13 <- (cnn.confusion.matrix.13[1,1]+cnn.confusion.matrix.13[2,2])/
  sum(cnn.confusion.matrix.13)
kNN_acc_cnn.13

## [1] 0.8212596

Here we create a confusion matrix for the new KNN model with K = 13 and calculate the overall accuracy of the model. As we can see, the overall model accuracy calculated here is the same as was calculated in the ChooseK procedure, which would not be the case had we not consistently set the seed to be equal to the same value throughout the procedure (a quick check would also reveal that the overall accuracy for K=3 as calculated by the ChooseK function is identical to our overall accuracy of the original model as calculated in Step 13). This new KNN procedure has an overall accuracy rate of .8213, or 82.13%, which is a not insignificant improvement over the 80.75% accuracy of the procedure with K = 3.

20: Summarize the Procedure for Mr. Rooney and Give Recommendations

# Summarize the differences in language Mr. Rooney may actually understand. Include a discussion on which approach k=3 or k="optimal" is the better method moving forward for "MEH". Most importantly draft comments about the over approach and model quality as it relates to addressing the problem proposed by Ed.

confusionMatrix(as.factor(cnn.13nn), as.factor(test.cnn.class), positive = "1", 
                dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1599  384
##         1   825 3956
##                                           
##                Accuracy : 0.8213          
##                  95% CI : (0.8119, 0.8303)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5951          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9115          
##             Specificity : 0.6597          
##          Pos Pred Value : 0.8274          
##          Neg Pred Value : 0.8064          
##              Prevalence : 0.6416          
##          Detection Rate : 0.5849          
##    Detection Prevalence : 0.7068          
##       Balanced Accuracy : 0.7856          
##                                           
##        'Positive' Class : 1               
##

The above code simply outputs the confusion matrix function from Step 14, which we feel is helpful in comparing the two KNN procedures. As we can see from the output, the procedure with K = 13 performed better overall. Specifically, this optimal model had better sensitivity, or true positive rate, than the previous model (91.15% as compared to 88.48% with K = 3). As we discussed before, since we are a marketing agency and not a production company, our focus is on the commercials rather than the non-commercials. In this vain, Mr. Rooney would likely chalk up false positives as poorly made TV shows. Thus, we recommend focusing on the True Positives and False Negatives (that is the correctly classified commercials and the commercials classified as TV shows). Increasing our model’s ability to distinguish between these two will be important, since our recommendation to Mr. Rooney is to examine the differences between these False Negatives, which in the case of an accurate model actually attain Mr. Rooney’s goal, and try to alter our commercials to be more like those. Analyzing TV shows could be useful, but on a basic level TV shows and commercials are fundamentally different, so we feel this could be futile. Instead, our team recommends fitting as accurate of a model as possible and then identifying the common characteristics among the hopefully few instances of False Negative classifications to try to make our commercials as similar to non-commercials as possible. If this is the strategy, having a high overall accuracy rate and a high sensitivity are critically important, so we would recommend using a KNN procedure with K = 13, as this produces a higher overall accuracy and sensitivity as compared to the same procedure with K = 3.

KNN Lab

Jack McGrath, Matthew Pinson, Andy Hui, Austin Cress

10/27/2020