KNN Lab

Task:

You left your job as a lobbyist because the political environment was become just too toxic to handle. Luckily you landed a job in advertising! Unfortunately have a demanding and totally clueless boss. Clueless meaning that he doesn’t understand data science, but he knows he wants it to be used to fix all the company’s problems and you are just the data scientist to do it!

Your company, Marketing Enterprises of Halifax or “MEH” is being beat out by the competition and wants a new way to determine the quality of its commercials. Your boss, Mr. Ed Rooney, would like the company’s commercials to seem more like actual TV shows. So we wants you to develop a “machine learning thing” using the company’s internal data to classify when something is a commercial and when it is not. Mr. Rooney believes the company will then know how to trick potential future customers into thinking their commercials are actually still part of the show and as a result will pay more attention and thus buy more of the terrible products “MEH” is supporting (it’s a terrible plan, but you have to make a living).

Given that MEH is producing commercials more or less continuously you know there will be a need to update the model quite frequently, also being a newish data scientist and having a clueless boss you decide to use a accessible approach that you might be able to explain to Mr. Rooney, (given several months of dedicated one on one time), that approach is k-nearest neighbor.

You’ll also need to document your work extensively, because Mr. Rooney doesn’t know he’s clueless so he will ask lots of “insightful” questions and require lots of detail that he won’t understand, so you’ll need to have an easy to use reference document. Before you get started you hearken back to the excellent education you received at UVA and using this knowledge outline roughly 20 steps that need to be completed to build this algo for MEH and Ed, they are documented below…good luck. As always, the most important part is translating your work to actionable insights, so please make sure to be verbose in the explanation required for step 20.

As with the clustering lab, please be prepared to present a five minute overview of your findings.

KNN Algorithm Implementation:

Since Mr. Rooney has no experience with Machine Learning, here we have an oversimplified visual of how KNN functions. We can see here that our data consists of commercial and non-commercial inputs. Given an unknown input, our algorithm should be able to accurately display if the unknown belongs to the commercial or non-commercial grouping. Depending on the value of K selected, the circle in the image will get larger. As we implement the KNN algorithm step by step, this illustration will become clearer to Mr. Rooney.

Step One

#1
#Load in the data, both the commercial dataset and the labels. You'll need to the place the labels on the columns. The dataset "tv_commercialsets-CNN_Cleaned.csv",  is data collected about the features of commercials on CNN. We can try to predict what segments of video are commercials based on their audio and video components. More information on the datasets can be found data.world:
# https://data.world/kramea/tv-commercial-detection/workspace/file?filename=tv_commercial_datasets%2FBBC_Cleaned.csv

#You can use the function colnames() to apply the labels (hint: you might need to reshape the labels to make this work)

setwd("/cloud/project/KNN")
tv_comm <- read.csv('tv_commercial_datasets_CNN_Cleaned.csv', stringsAsFactors = FALSE)
cnn <- t(read.csv('cnn_commmercial_label.csv', header = FALSE))
colnames(tv_comm) <- cnn
head(tv_comm)

##   shot_length motion_distr_mn motion_distr_var frame_diff_dist_mn
## 1          29        3.821209         1.567568          13.547628
## 2          25        3.052969         1.641484          22.334589
## 3          82        1.601274         1.508805           5.860583
## 4          25        4.819368         2.879584          41.382828
## 5          29        2.768753         1.797319          13.338054
## 6          25        3.679925         2.145892          27.195999
##   frame_diff_dist_var short_time_energy_mn short_time_energy_var   zcr_mn
## 1            7.242389             0.019883              0.012195 0.067241
## 2           15.734018             0.023027              0.010731 0.077000
## 3            3.301121             0.025948              0.006956 0.082317
## 4           24.448074             0.014387              0.007596 0.069875
## 5            9.980667             0.011506              0.007269 0.100647
## 6           21.484812             0.012956              0.007638 0.068125
##    zcr_var spectral_centroid_mn spectral_centroid_var spectral_roll_off_mn
## 1 0.049107             3406.866             1363.9906             6796.552
## 2 0.045884             3324.158             1452.0208             6610.000
## 3 0.044845             3771.984              855.7665             7488.112
## 4 0.046916             3301.686             1441.9977             6606.001
## 5 0.067401             3266.021             1432.1149             6688.795
## 6 0.036726             3324.525             1451.8527             6565.001
##   spectral_roll_off_var spectral_flux_mn spectral_flux_var fundamental_freq_mn
## 1              2719.627        1021.3592          940.7424           102.60780
## 2              2885.445        1199.3644          821.1235            84.68770
## 3              1697.351        1544.1475          807.1373            92.69461
## 4              2883.915         599.5552          535.1520            96.59550
## 5              2678.394         446.4524          411.4647            95.03695
## 6              2866.239         637.6216          591.3987           109.96493
##   fundamental_freq_var motion_dist_mn motion_dist_var label 
## 1             60.17829       0.538928        0.038365      1
## 2             46.89688       0.446286        0.066858      1
## 3             41.18265       0.591414        0.141610      1
## 4             61.86105       0.246416        0.093782      1
## 5             48.36439       0.314103        0.193568      1
## 6             60.04241       0.270214        0.136905      1

Step Two

#2. Determine the split between commercial and non-commercial then calculate the base rate, assume 1 is the commercial label and -1 is the non-commercial label
split <- table(tv_comm$`label `)
non_comm <- split[1] / sum(split)
comm <- split[2] / sum(split)
data.frame(c('Comm' = comm,'Non-Comm' = non_comm))

##             c.Comm...comm...Non.Comm....non_comm.
## Comm.1                                  0.6392105
## Non-Comm.-1                             0.3607895

Step Three

#3. Since there are columns that contain different metrics for the same variable (i.e. any column that ends in 'mn' is the mean of that variable, while any column that ends in 'var' is the variance of that variable), we don't need to keep both, drop all the columns that include var
tv_comm2 <- tv_comm %>%
  select(-ends_with('var'))
scale_comm <- scale(tv_comm2[ ,1:(ncol(tv_comm2)-1)])
tv_comm2 <- cbind(scale_comm, Commercial = tv_comm2[ ,ncol(tv_comm2)])
head(tv_comm2)

##      shot_length motion_distr_mn frame_diff_dist_mn short_time_energy_mn
## [1,]  -0.3257596      0.57845494         0.08964043            0.8929471
## [2,]  -0.3396870      0.21609933         1.02213946            1.5786124
## [3,]  -0.1412216     -0.46862137        -0.72613217            2.2156442
## [4,]  -0.3396870      1.04925640         3.04359652           -0.3056585
## [5,]  -0.3257596      0.08204322         0.06739980           -0.9339668
## [6,]  -0.3396870      0.51181554         1.53804708           -0.6177408
##          zcr_mn spectral_centroid_mn spectral_roll_off_mn spectral_flux_mn
## [1,] -1.2377040           -0.8460014           -0.7181542       -0.2173862
## [2,] -0.9332126           -1.1929107           -1.1892760        0.1153078
## [3,] -0.7673164            0.6854444            1.0283266        0.7597118
## [4,] -1.1555203           -1.2871677           -1.1993765       -1.0057433
## [5,] -0.1954005           -1.4367586           -0.9902869       -1.2918942
## [6,] -1.2101222           -1.1913706           -1.3029188       -0.9345967
##      fundamental_freq_mn motion_dist_mn Commercial
## [1,]          -0.8728803     0.03034897          1
## [2,]          -1.8192935    -0.45323571          1
## [3,]          -1.3964250     0.30432217          1
## [4,]          -1.1904078    -1.49654296          1
## [5,]          -1.2727193    -1.14322161          1
## [6,]          -0.4843287    -1.37231909          1

Step Four

#4.  Before we run knn, sometimes it's good to check to make sure that our variables are not highly correlated. Use the cor() function on 'your_dataframe', label it 'commercial_correlations', and view the data.
commercial_correlations <- cor(tv_comm2)
commercial_correlations

##                      shot_length motion_distr_mn frame_diff_dist_mn
## shot_length           1.00000000    -0.147625726       -0.149212625
## motion_distr_mn      -0.14762573     1.000000000        0.715703132
## frame_diff_dist_mn   -0.14921262     0.715703132        1.000000000
## short_time_energy_mn  0.02648501    -0.007160132       -0.023968229
## zcr_mn                0.19090475    -0.052205970       -0.042074142
## spectral_centroid_mn  0.36441928    -0.179064774       -0.298912770
## spectral_roll_off_mn  0.38018472    -0.221524744       -0.384839454
## spectral_flux_mn      0.10231118    -0.019123107        0.006196663
## fundamental_freq_mn   0.29268568    -0.096980636       -0.090672211
## motion_dist_mn        0.21214421    -0.757645267       -0.645178936
## Commercial           -0.27210692     0.053938038       -0.047456520
##                      short_time_energy_mn      zcr_mn spectral_centroid_mn
## shot_length                   0.026485006  0.19090475            0.3644193
## motion_distr_mn              -0.007160132 -0.05220597           -0.1790648
## frame_diff_dist_mn           -0.023968229 -0.04207414           -0.2989128
## short_time_energy_mn          1.000000000 -0.12505793            0.3087839
## zcr_mn                       -0.125057928  1.00000000            0.3089015
## spectral_centroid_mn          0.308783942  0.30890154            1.0000000
## spectral_roll_off_mn          0.160313488  0.03308083            0.8092628
## spectral_flux_mn              0.823463249 -0.05336937            0.2834196
## fundamental_freq_mn           0.022512016  0.53355483            0.4190101
## motion_dist_mn                0.031412848  0.06741515            0.3139645
## Commercial                    0.108814835 -0.25376348           -0.2734260
##                      spectral_roll_off_mn spectral_flux_mn fundamental_freq_mn
## shot_length                    0.38018472      0.102311184          0.29268568
## motion_distr_mn               -0.22152474     -0.019123107         -0.09698064
## frame_diff_dist_mn            -0.38483945      0.006196663         -0.09067221
## short_time_energy_mn           0.16031349      0.823463249          0.02251202
## zcr_mn                         0.03308083     -0.053369373          0.53355483
## spectral_centroid_mn           0.80926285      0.283419648          0.41901010
## spectral_roll_off_mn           1.00000000      0.168405909          0.32491709
## spectral_flux_mn               0.16840591      1.000000000          0.23615929
## fundamental_freq_mn            0.32491709      0.236159291          1.00000000
## motion_dist_mn                 0.38976247      0.036431956          0.12825129
## Commercial                    -0.24451480     -0.140024993         -0.39495599
##                      motion_dist_mn  Commercial
## shot_length              0.21214421 -0.27210692
## motion_distr_mn         -0.75764527  0.05393804
## frame_diff_dist_mn      -0.64517894 -0.04745652
## short_time_energy_mn     0.03141285  0.10881483
## zcr_mn                   0.06741515 -0.25376348
## spectral_centroid_mn     0.31396448 -0.27342595
## spectral_roll_off_mn     0.38976247 -0.24451480
## spectral_flux_mn         0.03643196 -0.14002499
## fundamental_freq_mn      0.12825129 -0.39495599
## motion_dist_mn           1.00000000 -0.04520322
## Commercial              -0.04520322  1.00000000

tv_comm2_df <- as.data.frame(commercial_correlations)
cormat<-signif(cor(tv_comm2_df),2)
col<- colorRampPalette(c( "white","#00203FFF"))(10)
heatmap(cormat, col=col, symm=TRUE)

In order to analyze the correlation values, we can create a heatmap to get a visual understanding. This can be extremely useful for individuals, like Mr. Rooney, who may want to initially refer to a visual to get a good understanding of the data before referring to a chart with all of the correlation values. Here, we can see that the highest correlations have a dark blue color, and the lowest correlations have a white color. Although this heatmap does not show us the specific correlation value, it can help us get a sense of which variables might have to be removed before running KNN.

Step Five

#5. Determine which variables to remove, high correlations start around .7 or below -.7 I would especially remove variables that appear to be correlated with more than one variable. List your rationale here:

commercial_correlations %>%        # start with the correlation matrix (from step four)
  as.table() %>% as.data.frame() %>%       
  subset(Var1 != Var2 & abs(Freq)>=0.7) %>% # omit diagonal and keep significant correlations 
  filter(!duplicated(paste0(pmax(as.character(Var1), as.character(Var2)), pmin(as.character(Var1),   as.character(Var2))))) %>%
  # keep only unique occurrences, as.character because Var1 and Var2 are factors
  arrange(desc(Freq)) %>%           # sort by Freq (aka correlation value)
  formattable()     #create a visually more appealing chart to refer to

Var1	Var2	Freq
spectral_flux_mn	short_time_energy_mn	0.8234632
spectral_roll_off_mn	spectral_centroid_mn	0.8092628
frame_diff_dist_mn	motion_distr_mn	0.7157031
motion_dist_mn	motion_distr_mn	-0.7576453

We took the matrix from step four and filtered it so that we could see the unique occurrences of correlations that were above or equal to the abs(0.7) threshold. We omitted relationships on the diagonal (meaning perfect correlations of abs(1.0)). These correlation values are not relevant to our analysis as it shows the correlation between the same variables (i.e. shot_length vs shot_length).

When looking at the correlation matrix, using abs(0.7) as a threshold for higher correlations, we decided to remove the following variables:

motion_distr_mn
frame_diff_dist_mn
spectral_centroid_mn
spectral_roll_off_mn
spectral_flux_mn

These variables all exhibited high correlation with other variables, meaning that their correlation was >= abs(0.7). We used this used high correlations as an indicator to analyze correlations with other variables. Upon looking at these variables, we found that although the other correlations did not surpass the abs(0.7) threshold, it did exhibit non-trivial correlations with other variables. Due to their strong relationship with other variables, we believed that these are the most viable candidates to be dropped from the model.

Step Six

#6. Subset the dataframe based on above.
subset_knn <- tv_comm2[ ,-c(2,3,6,7,8)]
head(subset_knn)

##      shot_length short_time_energy_mn     zcr_mn fundamental_freq_mn
## [1,]  -0.3257596            0.8929471 -1.2377040          -0.8728803
## [2,]  -0.3396870            1.5786124 -0.9332126          -1.8192935
## [3,]  -0.1412216            2.2156442 -0.7673164          -1.3964250
## [4,]  -0.3396870           -0.3056585 -1.1555203          -1.1904078
## [5,]  -0.3257596           -0.9339668 -0.1954005          -1.2727193
## [6,]  -0.3396870           -0.6177408 -1.2101222          -0.4843287
##      motion_dist_mn Commercial
## [1,]     0.03034897          1
## [2,]    -0.45323571          1
## [3,]     0.30432217          1
## [4,]    -1.49654296          1
## [5,]    -1.14322161          1
## [6,]    -1.37231909          1

Step Seven

#7. Now we have our data and are ready to run the KNN, but we need to split into test and train. Create a index the will divide the data into a 70/30 split
set.seed(03092000)
sample.data <- sample(1:nrow(subset_knn),
                              round(0.7 * nrow(subset_knn), 0),  #<- multiply the number of rows by 0.7 and round the decimals
                              replace = FALSE)

Step Eight

#8. Use the index above to generate a train and test sets, then check the row counts to be safe and show Mr. Rooney. 
knn_train <- subset_knn[sample.data, ]
knn_test <- subset_knn[-sample.data, ]
size_of_training <- nrow(knn_train)
size_of_total <- nrow(subset_knn)
size_of_test <- nrow(knn_test)

#Verification
paste("The Training Set contains", toString(round(size_of_training/size_of_total,2)*100), "% of the total data")

## [1] "The Training Set contains 70 % of the total data"

paste("The Testing Set contains", toString(round(size_of_test/size_of_total,2)*100), "% of the total data")

## [1] "The Testing Set contains 30 % of the total data"

After indexing the data to generate a train and test set, we can now check the row counts. We see that approximately 70% of the rows are dedicated to the training set, and 30% of the rows are dedicated to the testing set. We can proceed.

Step Nine

#9 Train the classifier using k = 3, remember to set.seed so you can repeat the output and to use the labels as a vector for the class (not a index of the dataframe)
set.seed(03092000)
tv_3NN <-  knn(train = knn_train[ ,-ncol(knn_train)],
               test = knn_test[ ,-ncol(knn_train)],  
               cl = knn_train[, "Commercial"],
               k = 3,
               use.all = TRUE,
               prob = TRUE)

Step Ten

#10 Check the output using str and length just to be sure it worked
str(tv_3NN)

##  Factor w/ 2 levels "-1","1": 2 1 2 1 2 2 2 2 1 2 ...
##  - attr(*, "prob")= num [1:6764] 0.667 0.667 0.667 0.667 1 ...

length(tv_3NN)

## [1] 6764

Step Eleven

#11 Create a initial confusion matrix using the table function and pass it to a object. (xx <- your confusion matrix)
conf_mat <- table(tv_3NN, knn_test[ ,"Commercial"])
conf_mat

##       
## tv_3NN   -1    1
##     -1 1390  717
##     1  1054 3603

Created our initial confusion matrix using the table function.

# Confusion matrix with labels 
#Using the caret library 
lvs <- c("non commercial","commercial")
truth <- factor(rep(lvs, times = c(4320, 2444)),
levels = rev(lvs))
pred <- factor(
c(rep(lvs, times = c(3603, 717)), rep(lvs, times = c(1054, 1390))), levels = rev(lvs))
xtab <- table(pred, truth)
cm <- confusionMatrix(pred, truth)
cm$table

##                 Reference
## Prediction       commercial non commercial
##   commercial           1390            717
##   non commercial       1054           3603

This shows another way to create our initial confusion matrix. This has labels which would help Mr. Rooney better understand what we are looking at (rather than the -1 and 1 labels we had earlier).

fourfoldplot(cm$table)

Since Mr. Rooney is not familiar with Machine Learning Concepts, we may want to create a visual for the confusion matrix that will be easier to understand. This visual is another way to display the performance of the algorithm. Here, we can see that the True Positives (predicting a commercial, and having it be a commercial) and the True Negatives (predicting a non commercial, and having it be a non commercial) have the largest 1/4 of the circle. This means that most of the time, the algorithm was able to predict a commercial vs non commercial correctly.

Step Twelve

#12 Select the true positives and true negatives by selecting only the cells where the row and column names are the same.
conf_mat[row(conf_mat) == col(conf_mat)]

## [1] 1390 3603

Step Thirteen

#13 Calculate the accuracy rate by dividing the correct classifications by the total number of classifications. Label the data 'kNN_acc_com', and view it. Comment on how this compares to the base rate. 
kNN_acc_com <- sum(conf_mat[row(conf_mat) == col(conf_mat)]) / sum(conf_mat)
kNN_acc_com

## [1] 0.7381727

Comparing our accuracy to the baseline rate, we find that we are performing slightly better with the model, but not by much. This should be an indication that our model likely can be improved with parameter tuning. To maximize our accuracy, we will want to find our optimal k value.

Step Fourteen

#14  Run the confusion matrix function and comment on the model output
confusionMatrix(as.factor(tv_3NN), as.factor(knn_test[ ,'Commercial']), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1390  717
##         1  1054 3603
##                                           
##                Accuracy : 0.7382          
##                  95% CI : (0.7275, 0.7486)
##     No Information Rate : 0.6387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4152          
##                                           
##  Mcnemar's Test P-Value : 1.415e-15       
##                                           
##             Sensitivity : 0.8340          
##             Specificity : 0.5687          
##          Pos Pred Value : 0.7737          
##          Neg Pred Value : 0.6597          
##              Prevalence : 0.6387          
##          Detection Rate : 0.5327          
##    Detection Prevalence : 0.6885          
##       Balanced Accuracy : 0.7014          
##                                           
##        'Positive' Class : 1               
##

Looking initially at the resulting confusion matrix, we find that we have a moderate overall accuracy with a value of 73.82%. Diving deeper, we now want to look at our specificity (true negative rate) and sensitivity (true positive rate). Looking at the sensitivity, we find that given that it was a commercial (Commercial = 1), 83.40% were correctly classified as a commercial. Conversely, give that it was not a commercial (Commercial = 0), only 56.87% were correctly classified as non-commercial. Although this is not poor performance, we certainly can do better. To do this, we can find our optimal value of k which will maximize our overall accuracy.

Step Fifteen

#15 Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label"   column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this  "knn_diff_k_com"

chooseK = function(k, train_set, val_set, train_class, val_class){
  set.seed(03092000)
  class_knn = knn(train = train_set,    
                  test = val_set,       
                  cl = train_class,     
                  k = k,                
                  use.all = TRUE)     
  conf_mat2 = table(class_knn, val_class)
  accu = sum(conf_mat2[row(conf_mat2) == col(conf_mat2)]) / sum(conf_mat2)                         
  cbind(k = k, accuracy = accu)
}


knn_different_k_com <- sapply(seq(1, 21, by = 2),  #<- set k to be odd number from 1 to 21
                         function(x) chooseK(x, 
                                             train_set = knn_train[ ,-ncol(subset_knn)],
                                             val_set = knn_test[ ,-ncol(subset_knn)],
                                             train_class = knn_train[ ,'Commercial'],
                                             val_class = knn_test[ ,'Commercial']))
knn_different_k_com

##           [,1]      [,2]      [,3]     [,4]      [,5]       [,6]      [,7]
## [1,] 1.0000000 3.0000000 5.0000000 7.000000 9.0000000 11.0000000 13.000000
## [2,] 0.7251626 0.7381727 0.7485216 0.756505 0.7596097  0.7636014  0.765819
##            [,8]       [,9]      [,10]      [,11]
## [1,] 15.0000000 17.0000000 19.0000000 21.0000000
## [2,]  0.7647842  0.7662626  0.7640449  0.7643406

Step Sixteen

#16 Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe
knn_different_k_comdf <- data.frame(k_value = knn_different_k_com[1, ], accuracy = knn_different_k_com[2, ])
knn_different_k_comdf

##    k_value  accuracy
## 1        1 0.7251626
## 2        3 0.7381727
## 3        5 0.7485216
## 4        7 0.7565050
## 5        9 0.7596097
## 6       11 0.7636014
## 7       13 0.7658190
## 8       15 0.7647842
## 9       17 0.7662626
## 10      19 0.7640449
## 11      21 0.7643406

Step Seventeen

#17 Use ggplot to show the output and comment on the k to select
ggplot(knn_different_k_comdf, aes(x = k_value, y = accuracy)) + 
  geom_line(color = "blue", size = 1.5) + 
  geom_point(size = 3) + 
  labs(title = 'K Value Versus Overall Model Accuracy', 
       x = "K Value",
       y= "Model Accuracy")

Looking at the resulting output, we find that there is a large increase in the overall accuracy when we use k = 7; after 7, we find that increases in our k value exhibit diminishing marginal returns. Furthermore, this means that as we increase k, our accuracy is increasing at a decreasing rate. To avoid overfitting while maintaining adequate accuracy, we can thus use k = 7 as our optimal k value to proceed with our analysis.

Step Eighteen

#18 Rerun the model  with "optimal" k 
tv_7NN <- knn(train = knn_train[ ,-ncol(subset_knn)],
               test = knn_test[ ,-ncol(subset_knn)],   
               cl = knn_train[, "Commercial"],
               k = 7,
               use.all = TRUE,
               prob = TRUE)

Step Nineteen

#19 Use the confusion matrix function to measure the quality of the new model
confusionMatrix(as.factor(tv_7NN), as.factor(knn_test[ ,'Commercial']), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1 1367  571
##         1  1077 3749
##                                           
##                Accuracy : 0.7564          
##                  95% CI : (0.7459, 0.7665)
##     No Information Rate : 0.6387          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4473          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8678          
##             Specificity : 0.5593          
##          Pos Pred Value : 0.7768          
##          Neg Pred Value : 0.7054          
##              Prevalence : 0.6387          
##          Detection Rate : 0.5543          
##    Detection Prevalence : 0.7135          
##       Balanced Accuracy : 0.7136          
##                                           
##        'Positive' Class : 1               
##

# Confusion matrix with labels 
#Using the Caret Library 
truth2 <- factor(rep(lvs, times = c(4320, 2444)),
levels = rev(lvs))
pred2 <- factor(
c(rep(lvs, times = c(3749, 571)), rep(lvs, times = c(1077, 1367))), levels = rev(lvs))
xtab2 <- table(pred2, truth2)
cm2 <- confusionMatrix(pred2, truth2)
cm2$table

##                 Reference
## Prediction       commercial non commercial
##   commercial           1367            571
##   non commercial       1077           3749

#Confusion Matrix Image
fourfoldplot(cm2$table)

Again, here we are displaying the confusion matrix in a more visually appealing format. This is meant to help Mr. Rooney better understand the significance of the numbers he is looking at. This has the same information that a normal confusion matrix would show. Here, similar to our confusion matrix from step eleven, we can see that the True Positives and True Negatives make up most of our data, meaning that most of the time, the algorithm was able to predict a commercial vs non commercial correctly.

Step Twenty

#20 Summarize the differences in language Mr. Rooney may actually understand. Include a discussion on which approach k=3 or k="optimal" is the better method moving forward for "MEH". Most importantly draft comments about the over approach and model quality as it relates to addressing the problem proposed by Ed.

When deciding which model to use, it is advantageous to analyze what is important. The overall accuracy is important, but it is more important to target or control for accuracy rates or error rates. For instance, let’s say we are using machine learning to predict whether or not someone has a deadly disease. Although we would like to maximize the accuracy, if we are choosing between different models, we would choose the model that mitigates the false negative rate. It would be extremely detrimental if someone was classified as a zero, meaning they wouldn’t have the disease, even though they are in class 1, meaning that actually do have the disease. In this instance, we would want to choose the model that minimizes our false negative rate, even if it comes at the expense of overall accuracy. We also want to evaluate whether we value high sensitivity or high specificity, as that may influence which model we decide to pick. In some instances, such as our disease example, we would be interested in high specificity, as in we want to make sure our true positive rate is maximized. Applying these concepts to the context of our problem, the premise of our analysis is to use machine learning to accurately classify ads as commercials or non-commercials to maximize our marketing efforts. With this in mind, this means that we may be less interested in the overall accuracy, although that certainly is important to look at, but rather more interested in the true positive rate, or the specificity. Comparing our two models, we have fitted the data using two different k values: 3 and 7. We selected k from a sub-analysis, identifying it as the optimal k value. Upon conducting kNN using both values of k, we find that when using k = 7, this yields not only the higher sensitivity at 86.74%, but also the higher overall accuracy at 75.62%. One thing to know, about either model regardless of the selected k, is that our model exhibits a poor specificity, or true negative rate. In either instance, we find that we are only accurately classifying 55.97% of our observations as a non-commercial given that it truly is a non-commercial. One thing to note is that the parameters we see in the non-commercial category can be just informative, particularly in the instance of our marketing goals. If the premise is to blur the lines between a commercial and a non-commercial, what are seeking is more ambiguity in our classification of either class. We can further analyze the features we see to classifying an observation as a non-commercial and thus use this as an archetype for creating a commercial with non-commercial features. In this instance, the model interpretability actually may provide us with more information that its actual predictability, and counter-intuitively, we would want to misclassify commercials and non-commercials. Addressing the first, we want to see what is driving our classification, as mentioned before. Furthermore, we will know our marketing efforts are successful if we actually see the false negative rate increase. If we take on features that are comparable to non-commercials but really are, we would expect, and hope, that our model would inaccurately classify the commercial as a non-commercial. This mimics our goals and intentions for the viewer’s experience: blurring the lines between comercials and television.