In this analysis, we will be using the k-nearest neighbor method on our companies internal data to classify when something is a commercial and when it is not. The k-nearest neighbor is a method for classification, which has a simple goal of labeling a point of data by looking at the ‘k’ closest data points, known as neighbors, and using a majority vote. Essentially, we are trying to predict how a data point should be labeled, in this situation, is a data point a commercial or not?
To begin our k-nearest neighbor (kNN) analysis, there is some initial data reading and cleaning to be done.
After this, we can get the initial split between commercial and non-commercial data.
##
## -1 1
## 8134 14411
## 1
## 0.6392105
From the table above, we can see that the data is split such that 8,134 of the rows are under the non-commercial label, while 14,411 rows are under the commercial label. Therefore the base rate of commercial data is around 64%.
Since there are columns in the data that contain different metrics for the same variable (i.e. any column that ends in ‘mn’ is the mean of that variable, while any column that ends in ‘var’ is the variance of that variable), keeping both is unnecessary, so we will drop all the columns that include “var”.
Next, we will run correlation analysis on the data to make sure that the variables are not highly correlated. Variables with high correlations will have to be removed.
## shot_length motion_distr_mn frame_diff_dist_mn
## shot_length 1.00 -0.15 -0.15
## motion_distr_mn -0.15 1.00 0.72
## frame_diff_dist_mn -0.15 0.72 1.00
## short_time_energy_mn 0.03 -0.01 -0.02
## zcr_mn 0.19 -0.05 -0.04
## spectral_centroid_mn 0.36 -0.18 -0.30
## spectral_roll_off_mn 0.38 -0.22 -0.38
## spectral_flux_mn 0.10 -0.02 0.01
## fundamental_freq_mn 0.29 -0.10 -0.09
## motion_dist_mn 0.21 -0.76 -0.65
## Label -0.27 0.05 -0.05
## short_time_energy_mn zcr_mn spectral_centroid_mn
## shot_length 0.03 0.19 0.36
## motion_distr_mn -0.01 -0.05 -0.18
## frame_diff_dist_mn -0.02 -0.04 -0.30
## short_time_energy_mn 1.00 -0.13 0.31
## zcr_mn -0.13 1.00 0.31
## spectral_centroid_mn 0.31 0.31 1.00
## spectral_roll_off_mn 0.16 0.03 0.81
## spectral_flux_mn 0.82 -0.05 0.28
## fundamental_freq_mn 0.02 0.53 0.42
## motion_dist_mn 0.03 0.07 0.31
## Label 0.11 -0.25 -0.27
## spectral_roll_off_mn spectral_flux_mn fundamental_freq_mn
## shot_length 0.38 0.10 0.29
## motion_distr_mn -0.22 -0.02 -0.10
## frame_diff_dist_mn -0.38 0.01 -0.09
## short_time_energy_mn 0.16 0.82 0.02
## zcr_mn 0.03 -0.05 0.53
## spectral_centroid_mn 0.81 0.28 0.42
## spectral_roll_off_mn 1.00 0.17 0.32
## spectral_flux_mn 0.17 1.00 0.24
## fundamental_freq_mn 0.32 0.24 1.00
## motion_dist_mn 0.39 0.04 0.13
## Label -0.24 -0.14 -0.39
## motion_dist_mn Label
## shot_length 0.21 -0.27
## motion_distr_mn -0.76 0.05
## frame_diff_dist_mn -0.65 -0.05
## short_time_energy_mn 0.03 0.11
## zcr_mn 0.07 -0.25
## spectral_centroid_mn 0.31 -0.27
## spectral_roll_off_mn 0.39 -0.24
## spectral_flux_mn 0.04 -0.14
## fundamental_freq_mn 0.13 -0.39
## motion_dist_mn 1.00 -0.05
## Label -0.05 1.00
After seeing these correlations, we will remove motion_distr_mn, as it had several correlations higher than 0.7, and we will also remove spectral_centroid_mn and spectral_flux_mn, as these variables also had significant correlations with other variables.
Now, we will begin to train and test the data. We will split the data such that 80% goes towards training, and 20% will goes towards testing.
## [1] 18036
## [1] 4509
As we can see, 80%, or 18,036 rows, will be going towards training the data, while the other 20%, or 4,509 rows, will be used for testing.
We will first use k = 3 as our number of nearest neighbors here. All of the variables besides labeling will be tested and trained.
After training the classifier, a confusion matrix can be created to give us information regarding predictions and the accuracy of them.
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 958 381
## 1 664 2506
##
## Accuracy : 0.7682
## 95% CI : (0.7556, 0.7805)
## No Information Rate : 0.6403
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4769
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8680
## Specificity : 0.5906
## Pos Pred Value : 0.7905
## Neg Pred Value : 0.7155
## Prevalence : 0.6403
## Detection Rate : 0.5558
## Detection Prevalence : 0.7030
## Balanced Accuracy : 0.7293
##
## 'Positive' Class : 1
##
The matrix seen at the top shows how the model predicted the data and the actual result, with -1 being non-commercial, and 1 being commercial data. Below that, we can see the accuracy of the model, which is calculated to be 0.7682, or 76.82%, which is fairly accurate. Also, we see that the accuracy is significantly greater than the no information rate.
Although this model is fairly accurate, there are other options for k that we can choose based on certain rationale. Next, we will test all of the k values and compile them into a data frame, so that they can be displayed visually.
The graph seen above displays the k value versus the accuracy of models with that k value. The graph shows the accuracy rising from 1 to 7, then a sudden change in pattern. Since k = 7 has the “kink” or change in pattern discussed, and it’s rather high accuracy compared to the other values, we will select 7 as our k value.
After training the data again under k = 7, we can create another confusion matrix and gather information regarding our model.
## Confusion Matrix and Statistics
##
## Actual
## Prediction -1 1
## -1 934 308
## 1 688 2579
##
## Accuracy : 0.7791
## 95% CI : (0.7667, 0.7911)
## No Information Rate : 0.6403
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4945
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8933
## Specificity : 0.5758
## Pos Pred Value : 0.7894
## Neg Pred Value : 0.7520
## Prevalence : 0.6403
## Detection Rate : 0.5720
## Detection Prevalence : 0.7246
## Balanced Accuracy : 0.7346
##
## 'Positive' Class : 1
##
As can be seen above, the accuracy of our model is now 0.7791, or 77.91% which is a slight increase in accuracy compared to our k = 3 model.
Given all of the information collected here, we can make some conclusions regarding our models, and how it relates to the problem at hand here. In terms of our k = 3 and k = 7 model, we see slight increases in accuracy in the k = 7 model. Given this, our k = 7 model would be better suited for our methods moving forward in “MEH”. As a whole, our approach and model are fairly accurate, and can be very useful in addressing our problem. With 77.91% accuracy in predicting commercial vs non-commercial, there is obviously room to improve, but right now, this model is appropriate and decently suited to work for our needs. Given the proposals to identify commercials vs non-commercials in order to help our companies commercials, which then helps our companies profits in the long run, the k-nearest neighbor approach seems most appropriate. Though the model will have to be continuously updated and improved upon as we go deeper into our approach, the model crafted and method being taken seem most appropriate under the circumstances of the problem at hand.