============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/562805
Data source: https://www.kaggle.com/sherloconan/anly-53053b  

1. k-Means Clustering

 

Question 1: Given this is an imperfect real-world, you need to determine what you believe is the best value for “k” and write-up this portion of your lab report.  

You should include a brief discussion of your k-Means analysis as well as the best value of “k” that you determine. You should include what mixture of variables within the clusters that this value of “k” results in. That is, you need to interpret your k-Means analysis and discuss what it means.**  

Answer 1: According to the plot of the average total within sum of squares by number of clusters (k) in Fig. 1, also known as the elbow method, the optimal k is 3. This method investigates the percentage of variance explained by k, and the optimal k is chosen when the rate of change starts to drop. Other preferable techniques are the silhouette method and the the Bayesian inference criterion. However, the instructions for Part 1 of Lab 3 suggest k=5.  

When k=5,
Cluster 1 seems to be those customers who heavily consume milk and grocery, followed by Cluster 4.
Cluster 5 seems to be those customers who heavily consume fresh, followed by Cluster 2.
Cluster 3 seems to represent the low (or regular) consumption customers.

 

Question 2: How many points do you see in each cluster?  

Answer 2: The clustering algorithm used is k-means, i.e., the function kmeans() from the default package “stats”. The random seed is set to 76964057. There are 421 observations based on six variables, namely “Fresh”, “Milk”, “Grocery”, “Frozen”, “Detergents_Paper”, and “Delicassen” (therefore, “Channel” and “Region” are not taken into consideration). The numbers of components in the five clusters are shown below.

Cluster One Two Three Four Five
Count 42 99 166 83 31

 

Step 1: Exploring and preparing the data

wholesale <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/Wholesale_customers_data.csv")
pastecs::stat.desc(wholesale[-c(1,2)]) %>% kable(digits=0) %>% kable_styling(full_width=F)
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
nbr.val 440 440 440 440 440 440
nbr.null 0 0 0 0 0 0
nbr.na 0 0 0 0 0 0
min 3 55 3 25 3 3
max 112151 73498 92780 60869 40827 47943
range 112148 73443 92777 60844 40824 47940
sum 5280131 2550357 3498562 1351650 1267857 670943
median 8504 3627 4756 1526 816 966
mean 12000 5796 7951 3072 2881 1525
SE.mean 603 352 453 231 227 134
CI.mean.0.95 1185 692 890 455 447 264
var 159954927 54469967 90310104 23567853 22732436 7952997
std.dev 12647 7380 9503 4855 4768 2820
coef.var 1 1 1 2 2 2
#Note: In this case, scaling / normalizing the dataset will not necessarily remove the outliers; a logarithmic transformation might (or might not) help.

top.n.custs <- function(data, cols, n) {
  #' Given a data frame, find the row indices for the top of each specified column
  #' Input: data - the data frame
  #'        cols - the vector of the specified columns, e.g., c(3:8)
  #'        n - top number, e.g., 5
  #' Output: return a vector of such row indices without duplicates

  index <- integer(0)
  for (i in cols) {
    top.n <- head(order(data[,i], decreasing=T), n)
    index <- union(index, top.n)
  }
  return(index)
}

wholesaleR <- wholesale[-top.n.custs(wholesale, 3:8, 5),]

 

Step 2: Building a k-Means Clustering model

set.seed(76964057)
k <- kmeans(wholesaleR[-c(1,2)], centers=5) #create 5 clusters; drop the "Channel" and "Region" variables
k$centers %>% kable() %>% kable_styling(full_width=F)
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
5830.214 15295.048 23449.167 1936.452 10361.6429 1912.738
18649.606 3335.586 4497.848 3301.747 1046.5859 1450.566
5845.392 2337.319 2878.205 2766.596 660.2952 858.994
4238.892 7725.289 11011.747 1336.566 4733.3614 1400.530
35922.387 4851.806 5862.581 3730.677 1004.6129 1552.161
table(k$cluster)
## 
##   1   2   3   4   5 
##  42  99 166  83  31

 

Step 3: Finding the optimal number of clusters (elbow method)

avg.totw.ss <- numeric(19)
for (num in 2:20) {
  totw.ss <- numeric(100)
  for (trial in 1:100) {
    runs <- kmeans(wholesaleR[-c(1,2)], centers=num) #Warning: it may not converge in some iterations (10); consider assigning, e.g., iter.max=15
    totw.ss[trial] <- runs$tot.withinss
  }
  avg.totw.ss[num-1] <- mean(totw.ss)
}

#(avg.totw.ss[-19]-avg.totw.ss[-1])/avg.totw.ss[-19]
ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss/1000000000))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares (×10^9)")+ggtitle("Fig. 1. Plot of TotW.SS by k in \"Wholesale\" Clustering")+theme_classic()

 

 

2. Cluster Analysis

 

Summary:These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
As usual, according to the plot of total within sum of squares by number of clusters (k) in Fig. 2 (the elbow method), the optimal k is 3. The function NbClust from the package “NbClust” also proposes k=3 as the best number of clusters by 17 out of 26 criteria. The random seed is set to 1234. There are 178 observations on the 13 scaled attributes. The cluster sizes are shown below.

Cluster One Two Three
Count 62 65 51

It is worthwhile to note that the indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this is the case here because the first column (Wine Type) is in increasing order. In that regard, the confusion matrix reflects an accuracy of 96.63%.  

Cluster 1 has relatively high values in Alcohol, Ash, Magnesium, Total Phenols, Flavanoids, Proanthocyanins, OD280/OD315 of Diluted Wines, and Proline, and relatively low values in Alcalinity of Ash and Nonflavanoid Phenols. Cluster 2 has relatively low values in Alcohol, Ash, Magnesium, Color Intensity, and Proline, which are almost the opposite of Cluster 1. Cluster 3 has relatively high values in Malic Acid, Alcalinity of Ash, Nonflavanoid Phenols, and Color Intensity, and relatively low values in Total Phenols, Flavanoids, Proanthocyanins, Hue, and OD280/OD315 of Diluted Wines, which are also opposite to Cluster 1.

 

Step 1: Exploring and preparing the data

#Note: this wine dataset is different from the Lab 1's.
wine <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wine.csv")
wineR <- wine
wineR[-1] <- scale(wineR[-1])

 

Step 2: Building k-Means Clustering models (elbow method)

# #OR
# totwss <- sapply(1:15, function(x) {set.seed(1234); kmeans(wineR[-1], x,)$tot.withinss})

wssplot <- function(data, nc=15, seed=1234, title) {
  #' Given a data frame, plot total within sum of squares by number of clusters from 1 to nc
  #' Input: data - the data frame
  #'        nc - number of clusters, default nc=15
  #'        seed - random seed, default seed=1234
  #'        title - the figure title
  #' Output: ggplot
  
  totwss <- numeric(nc)
  for (num in 1:nc) {
    set.seed(seed)
    totwss[num] <- kmeans(data, centers=num)$tot.withinss
  }
  #totwss[1] == (nrow(data)-1)*sum(apply(data,2,var)), i.e., (178-1)*(14-1)
  
  fig <- ggplot(aes(x,y), data=data.frame("x"=c(1:nc), "y"=totwss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=1)", y="Total Within Sum of Squares")+ggtitle(title)+theme_classic()
  return(fig)
}

wssplot(wineR[-1], title="Fig. 2. Plot of TotW.SS by k in \"Wine\" Clustering")

 

Step 3: Cluster analysis

set.seed(1234)
cluster.analysis <- NbClust(wineR[-1], min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 19 proposed 3 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
#Also note that length(unique(wineR$Wine)) == 3
#Number of clusters chosen by 26 criteria
table(cluster.analysis$Best.n[1,]) #e.g., 19 proposed 3 as the best number of clusters
## 
##  0  1  2  3 14 15 
##  2  1  2 19  1  1
#n <- readline(prompt="Enter the best number of clusters: ")
set.seed(1234)
k2 <- kmeans(wineR[-1], 3, nstart=25)
table(k2$cluster)
## 
##  1  2  3 
## 62 65 51
k2$centers %>% kable(digits=3) %>% kable_styling(font_size=12)
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0.833 -0.303 0.364 -0.608 0.576 0.883 0.975 -0.561 0.579 0.171 0.473 0.777 1.122
-0.923 -0.393 -0.493 0.170 -0.490 -0.076 0.021 -0.033 0.058 -0.899 0.461 0.270 -0.752
0.164 0.869 0.186 0.523 -0.075 -0.977 -1.212 0.724 -0.778 0.939 -1.162 -1.289 -0.406
aggregate(wine[-1], by=list(Cluster=k2$cluster), mean) %>% kable(digits=3) %>% kable_styling(font_size=12)
Cluster Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
1 13.677 1.998 2.466 17.463 107.968 2.848 3.003 0.292 1.922 5.454 1.065 3.163 1100.226
2 12.251 1.897 2.231 20.063 92.738 2.248 2.050 0.358 1.624 2.973 1.063 2.803 510.169
3 13.134 3.307 2.418 21.241 98.667 1.684 0.819 0.452 1.146 7.235 0.692 1.697 619.059
(t <- table(wineR$Wine, k2$cluster)) #Note: indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this case is so because the first column (Wine Type) is in increasing order.
##    
##      1  2  3
##   1 59  0  0
##   2  3 65  3
##   3  0  0 48
(sum(diag(t))/sum(t))*100
## [1] 96.62921
par(mfrow=c(1,1))
cluster::clusplot(wineR[-1], k2$cluster, main="2D Representation of the Cluster Solution", lines=0, labels=4)

#About PCA: https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

 

 

3. Classification Model

 

Summary: The scaled predictors are “Alcohol,” “Malic.acid,” “Ash,” “Acl,” “Mg,” “Phenols,” “Flavanoids,” “Nonflavanoid.phenols,” “Proanth,” “Color.int,” “Hue,” “OD,” and “Proline,” with 141 observations in the training set and 37 observations in the test set. The random seed is set to 1234 for sampling purposes. The regression trees model, i.e., the function rpart() from the package “rpart”, is used for classification in predicting clusters. The accuracy for the training set and the test set is 96.45% and 91.89%, respectively. The three most important features in this model are “Flavanoids,” “Phenols,” and “OD.”.  

Previously, Part 2 identified six mismatches: three instances of Type 2 wine are grouped into Cluster 1, and three instances of Type 2 wine are grouped into Cluster 3. By investigating the row indices where clusters mismatch types, five instances are in the training set while one instance is in the test set. \(\frac{5}{141}\approx3.5461\%\). Hence, the training process may indeed be compromised.

Wine Type Cluster Prediction
2 3 3
2 1 2
2 1 1
2 3 3
2 3 2

 

Step 1: Exploring and preparing the data

wineR$Cluster <- k2$cluster
set.seed(1234)
wineR2 <- wineR[sample(1:nrow(wineR)),]
mis <- which(wineR2$Wine!=wineR2$Cluster) #19 25 52 100 119 166
wineTraining <- wineR2[1:(as.integer(0.8*nrow(wineR2))-1),] #141 obs.
wineTest <- wineR2[as.integer(0.8*nrow(wineR2)):nrow(wineR2),] #37 obs,

 

Step 2: Training a model on the data

(wineModelR <- rpart(Cluster~., data=wineTraining[-1], method="class"))
## n= 141 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 141 91 2 (0.32624113 0.35460993 0.31914894)  
##    2) Proline>=0.02574325 52  7 1 (0.86538462 0.01923077 0.11538462)  
##      4) Phenols>=-0.2718097 45  0 1 (1.00000000 0.00000000 0.00000000) *
##      5) Phenols< -0.2718097 7  1 3 (0.00000000 0.14285714 0.85714286) *
##    3) Proline< 0.02574325 89 40 2 (0.01123596 0.55056180 0.43820225)  
##      6) OD>=-0.5939311 46  2 2 (0.02173913 0.95652174 0.02173913) *
##      7) OD< -0.5939311 43  5 3 (0.00000000 0.11627907 0.88372093)  
##       14) Color.int< -0.4671943 7  2 2 (0.00000000 0.71428571 0.28571429) *
##       15) Color.int>=-0.4671943 36  0 3 (0.00000000 0.00000000 1.00000000) *
# #OR
# rpart.plot(wineModelR, digits=4, type=3, extra=101)
fancyRpartPlot(wineModelR)

#accuracy on the training set
wineTrPredR <- predict(wineModelR, wineTraining[-1], type="class")
CrossTable(wineTraining$Cluster, wineTrPredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  141 
## 
##  
##              | Predicted (RT) 
##       Actual |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        45 |         1 |         0 |        46 | 
##              |     0.319 |     0.007 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            2 |         0 |        49 |         1 |        50 | 
##              |     0.000 |     0.348 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            3 |         0 |         3 |        42 |        45 | 
##              |     0.000 |     0.021 |     0.298 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        45 |        53 |        43 |       141 | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 
comparable <- data.frame(wineTraining[c("Wine","Cluster")], "Pred"=wineTrPredR)
comparable[mis[mis<=nrow(wineTraining)],]
##     Wine Cluster Pred
## 62     2       3    3
## 122    2       1    2
## 96     2       1    1
## 84     2       3    3
## 119    2       3    2

 

Step 3: Evaluating model performance

#accuracy on the test set
winePredR <- predict(wineModelR, wineTest[-1], type="class")
CrossTable(wineTest$Cluster, winePredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  37 
## 
##  
##              | Predicted (RT) 
##       Actual |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        14 |         2 |         0 |        16 | 
##              |     0.378 |     0.054 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            2 |         1 |        14 |         0 |        15 | 
##              |     0.027 |     0.378 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            3 |         0 |         0 |         6 |         6 | 
##              |     0.000 |     0.000 |     0.162 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        15 |        16 |         6 |        37 | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 
wineModelR$variable.importance %>% kable(digits=4) %>% kable_styling(full_width=F)
x
Flavanoids 58.1216
Phenols 44.8599
OD 42.8464
Alcohol 39.9385
Proline 36.6332
Color.int 29.2045
Proanth 20.2278
Acl 15.3934
Hue 12.5302
Mg 9.8628
Nonflavanoid.phenols 4.5565
Ash 2.5629
Malic.acid 2.5629

 

 

4. k-Nearest Neighbors Model

 

Question 3: Load the dataset of breast cancer. Do the preliminary analysis and implement a KNN (K- nearest neighbors) model for this dataset and don’t forget that whenever it is required you should use: set.seed(12345).  

Answer 3: There are 30 scaled predictors with 469 observations in the training set and 100 observations in the test set. There is no randomization in sampling. The k-nearest neighbors (KNN) model, i.e., the function knn() from the default package “class”, is used for classification in predicting “diagnosis”. k is set to the positive square root of the number of rows in the training set, i.e., k=21. The accuracy for the test set is 98%.

 

Question 4: Now let’s get back to our problem of news popularity and see if we can apply KNN (K- nearest neighbors) to improve the accuracy of the model. Use the same strategy of training and testing that we did on first 2 labs, and don’t forget that whenever it is required you should use: set.seed(12345).  

Answer 4: The random seed is set to 12345 for sampling. k is set to the positive square root of the number of rows in the training set, i.e., k=172. The accuracy for the test set is 57.05%. The eight models are shown below.

Decision Tree Random Forest Regression Trees Naïve Bayes Classifier Linear SVM Polynomial SVM Radial Basis SVM k-Nearest Neighbors
Test set 59.11% 60.25% 56.37% 47.22% 57.51% 57.51% 59.34% 57.05%

 

Step 1: Exploring and preparing the data

cancer <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wisc_bc_data.csv")
cancer <- cancer[-1]
cancer[-1] <- lapply(cancer[2:31], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling
summary(cancer) %>% kable(digits=4) %>% kable_styling(font_size=11)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean points_mean symmetry_mean dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se points_se symmetry_se dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst points_worst symmetry_worst dimension_worst
Length:569 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.2233 1st Qu.:0.2185 1st Qu.:0.2168 1st Qu.:0.1174 1st Qu.:0.3046 1st Qu.:0.1397 1st Qu.:0.06926 1st Qu.:0.1009 1st Qu.:0.2823 1st Qu.:0.1630 1st Qu.:0.04378 1st Qu.:0.1047 1st Qu.:0.04000 1st Qu.:0.02064 1st Qu.:0.1175 1st Qu.:0.08132 1st Qu.:0.03811 1st Qu.:0.1447 1st Qu.:0.1024 1st Qu.:0.04675 1st Qu.:0.1807 1st Qu.:0.2415 1st Qu.:0.1678 1st Qu.:0.08113 1st Qu.:0.3000 1st Qu.:0.1163 1st Qu.:0.09145 1st Qu.:0.2231 1st Qu.:0.1851 1st Qu.:0.1077
Mode :character Median :0.3024 Median :0.3088 Median :0.2933 Median :0.1729 Median :0.3904 Median :0.2247 Median :0.14419 Median :0.1665 Median :0.3697 Median :0.2439 Median :0.07702 Median :0.1653 Median :0.07209 Median :0.03311 Median :0.1586 Median :0.13667 Median :0.06538 Median :0.2070 Median :0.1526 Median :0.07919 Median :0.2504 Median :0.3569 Median :0.2353 Median :0.12321 Median :0.3971 Median :0.1791 Median :0.18107 Median :0.3434 Median :0.2478 Median :0.1640
NA Mean :0.3382 Mean :0.3240 Mean :0.3329 Mean :0.2169 Mean :0.3948 Mean :0.2606 Mean :0.20806 Mean :0.2431 Mean :0.3796 Mean :0.2704 Mean :0.10635 Mean :0.1893 Mean :0.09938 Mean :0.06264 Mean :0.1811 Mean :0.17444 Mean :0.08054 Mean :0.2235 Mean :0.1781 Mean :0.10019 Mean :0.2967 Mean :0.3640 Mean :0.2831 Mean :0.17091 Mean :0.4041 Mean :0.2202 Mean :0.21740 Mean :0.3938 Mean :0.2633 Mean :0.1896
NA 3rd Qu.:0.4164 3rd Qu.:0.4089 3rd Qu.:0.4168 3rd Qu.:0.2711 3rd Qu.:0.4755 3rd Qu.:0.3405 3rd Qu.:0.30623 3rd Qu.:0.3678 3rd Qu.:0.4530 3rd Qu.:0.3404 3rd Qu.:0.13304 3rd Qu.:0.2462 3rd Qu.:0.12251 3rd Qu.:0.07170 3rd Qu.:0.2187 3rd Qu.:0.22680 3rd Qu.:0.10619 3rd Qu.:0.2787 3rd Qu.:0.2195 3rd Qu.:0.12656 3rd Qu.:0.3863 3rd Qu.:0.4717 3rd Qu.:0.3735 3rd Qu.:0.22090 3rd Qu.:0.4942 3rd Qu.:0.3025 3rd Qu.:0.30583 3rd Qu.:0.5546 3rd Qu.:0.3182 3rd Qu.:0.2429
NA Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000

 

Step 2: Building a KNN model

#no randomization in sampling
cancerTraining <- cancer[1:469,] #82%
cancerTest <- cancer[470:569,]
cancerPred <- knn(train=cancerTraining[-1], test=cancerTest[-1], cl=cancerTraining[,1], k=21)
CrossTable(cancerTest$diagnosis, cancerPred, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | Predicted (KNN) 
##       Actual |         B |         M | Row Total | 
## -------------|-----------|-----------|-----------|
##            B |        61 |         0 |        61 | 
##              |     0.610 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            M |         2 |        37 |        39 | 
##              |     0.020 |     0.370 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        63 |        37 |       100 | 
## -------------|-----------|-----------|-----------|
## 
## 

 

Step 3: Evaluating News Popularity
The first step is the same as Lab 1’s Tab 4 Step 1.

news17[-1] <- lapply(news17[2:17], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling

set.seed(12345)
training <- sample(nrow(news17), nrow(news17)*0.75)
newsTraining <- news17[training,]
newsTest <- news17[-training,]

newsPredKNN <- knn(train=newsTraining[-1], test=newsTest[-1], cl=newsTraining[,1], k=172)
CrossTable(newsTest$popular, newsPredKNN, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  9911 
## 
##  
##              | Predicted (KNN) 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1997 |      2585 |      4582 | 
##              |     0.201 |     0.261 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1672 |      3657 |      5329 | 
##              |     0.169 |     0.369 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3669 |      6242 |      9911 | 
## -------------|-----------|-----------|-----------|
## 
##