We began our analysis by loading in the data, applying the commercial labels to the dataset, and then we split the data into commercial versus non-commercial. From this, we were able to calculate a base rate, which is defined as the number of commercials divided by the size of our dataset (the percentage of commercials in the dataset, essentially).

We filtered out the data that looked at the variance of each variable, and decided to do our analysis and develop our machine learning model on the means only.

This machine learning algorithm does not work for variables that are highly correlated to each other. We looked at the correlations between each variable, and we kept the variables that had correlation -0.7 < x < 0.7. We also removed variables that were correlated with more than one variable. This mean that we had to get rid of motion_distr,short_time_energy, and spectral_centroid

From this point, we developed our train and test data sets. The train dataset is what we would use to help our algorithm learn the patterns, and the test would verify the accuracy of our model. For our model, we will choose the 3 nearest neighbors and see how well that works.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1  848  475
##         1   774 2412
##                                          
##                Accuracy : 0.723          
##                  95% CI : (0.7097, 0.736)
##     No Information Rate : 0.6403         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3734         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8355         
##             Specificity : 0.5228         
##          Pos Pred Value : 0.7571         
##          Neg Pred Value : 0.6410         
##              Prevalence : 0.6403         
##          Detection Rate : 0.5349         
##    Detection Prevalence : 0.7066         
##       Balanced Accuracy : 0.6791         
##                                          
##        'Positive' Class : 1              
## 

This model has an accuracy of 73%, which is much better compared to the base rate of 63.9%. This means that our model is much more accurate than the base rate provided by the dataset.

This plot shows the accuracy versus the k value. In order to select the best k value, we looked at where the graph makes a sharp turn (the elbow, if you will). For our second iteration, we decided to look at 5 neighbors.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction   -1    1
##         -1  814  419
##         1   808 2468
##                                           
##                Accuracy : 0.7279          
##                  95% CI : (0.7146, 0.7408)
##     No Information Rate : 0.6403          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3765          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8549          
##             Specificity : 0.5018          
##          Pos Pred Value : 0.7534          
##          Neg Pred Value : 0.6602          
##              Prevalence : 0.6403          
##          Detection Rate : 0.5473          
##    Detection Prevalence : 0.7265          
##       Balanced Accuracy : 0.6784          
##                                           
##        'Positive' Class : 1               
## 

Rerunning this model gives us another accuracy of 73%, approximately

There are two different approaches we have taken in order to properly identify commercials without knowing anything about the commercial. We used a k-nearest neighbors algorithm as a way to establish classes based on the proximity of each commercial’s data, and this would help us take a random commercial (or non-commercial) and place it into one of those classes. We decided first to try looking at less neighbors for each point (3 neighbors, to be exact), and obtained an accuracy of ~73%. This means that our model could accurately predict commercial vs. non-commercial 73% of the time, which is much better than the base rate of ~64% (this was, in essence, the probability of commercial vs. non-commercial). We wanted to optimize this algorithm, so we looked at the number of neighbors plotted against the accuracy against the model, and from this graph, we decided that looking at 5 neighbors would have the best of both worlds in terms of accuracy and computational efficiency. We reran the model with this new value, and yet again obtained an accuracy of ~73%. This would tell us that there is no strong difference between the two approaches, and maybe a higher neighbor value would yield different results. However, we would recommend looking at the 3 nearest neighbors, as it would be less computationally expensive and has no tradeoff in accuracy, which would make it easier to identify commercials on the fly.