ANLY 530 - Lab 3

1. k-Means Clustering

Question 1: Given this is an imperfect real-world, you need to determine what you believe is the best value for “k” and write-up this portion of your lab report.

You should include a brief discussion of your k-Means analysis as well as the best value of “k” that you determine. You should include what mixture of variables within the clusters that this value of “k” results in. That is, you need to interpret your k-Means analysis and discuss what it means.**

Answer 1: According to the plot of the average total within sum of squares by number of clusters (k) in Fig. 1, also known as the elbow method, the optimal k is 3. This method investigates the percentage of variance explained by k, and the optimal k is chosen when the rate of change starts to drop. Other preferable techniques are the silhouette method and the the Bayesian inference criterion. However, the instructions for Part 1 of Lab 3 suggest k=5.

When k=5,
Cluster 1 seems to be those customers who heavily consume milk and grocery, followed by Cluster 4.
Cluster 5 seems to be those customers who heavily consume fresh, followed by Cluster 2.
Cluster 3 seems to represent the low (or regular) consumption customers.

Question 2: How many points do you see in each cluster?

Answer 2: The clustering algorithm used is k-means, i.e., the function kmeans() from the default package “stats”. The random seed is set to 76964057. There are 421 observations based on six variables, namely “Fresh”, “Milk”, “Grocery”, “Frozen”, “Detergents_Paper”, and “Delicassen” (therefore, “Channel” and “Region” are not taken into consideration). The numbers of components in the five clusters are shown below.

Cluster	One	Two	Three	Four	Five
Count	42	99	166	83	31

Step 1: Exploring and preparing the data

wholesale <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/Wholesale_customers_data.csv")
pastecs::stat.desc(wholesale[-c(1,2)]) %>% kable(digits=0) %>% kable_styling(full_width=F)

	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
nbr.val	440	440	440	440	440	440
nbr.null	0	0	0	0	0	0
nbr.na	0	0	0	0	0	0
min	3	55	3	25	3	3
max	112151	73498	92780	60869	40827	47943
range	112148	73443	92777	60844	40824	47940
sum	5280131	2550357	3498562	1351650	1267857	670943
median	8504	3627	4756	1526	816	966
mean	12000	5796	7951	3072	2881	1525
SE.mean	603	352	453	231	227	134
CI.mean.0.95	1185	692	890	455	447	264
var	159954927	54469967	90310104	23567853	22732436	7952997
std.dev	12647	7380	9503	4855	4768	2820
coef.var	1	1	1	2	2	2

#Note: In this case, scaling / normalizing the dataset will not necessarily remove the outliers; a logarithmic transformation might (or might not) help.

top.n.custs <- function(data, cols, n) {
  #' Given a data frame, find the row indices for the top of each specified column
  #' Input: data - the data frame
  #'        cols - the vector of the specified columns, e.g., c(3:8)
  #'        n - top number, e.g., 5
  #' Output: return a vector of such row indices without duplicates

  index <- integer(0)
  for (i in cols) {
    top.n <- head(order(data[,i], decreasing=T), n)
    index <- union(index, top.n)
  }
  return(index)
}

wholesaleR <- wholesale[-top.n.custs(wholesale, 3:8, 5),]

Step 2: Building a k-Means Clustering model

set.seed(76964057)
k <- kmeans(wholesaleR[-c(1,2)], centers=5) #create 5 clusters; drop the "Channel" and "Region" variables
k$centers %>% kable() %>% kable_styling(full_width=F)

Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
5830.214	15295.048	23449.167	1936.452	10361.6429	1912.738
18649.606	3335.586	4497.848	3301.747	1046.5859	1450.566
5845.392	2337.319	2878.205	2766.596	660.2952	858.994
4238.892	7725.289	11011.747	1336.566	4733.3614	1400.530
35922.387	4851.806	5862.581	3730.677	1004.6129	1552.161

table(k$cluster)

## 
##   1   2   3   4   5 
##  42  99 166  83  31

Step 3: Finding the optimal number of clusters (elbow method)

avg.totw.ss <- numeric(19)
for (num in 2:20) {
  totw.ss <- numeric(100)
  for (trial in 1:100) {
    runs <- kmeans(wholesaleR[-c(1,2)], centers=num) #Warning: it may not converge in some iterations (10); consider assigning, e.g., iter.max=15
    totw.ss[trial] <- runs$tot.withinss
  }
  avg.totw.ss[num-1] <- mean(totw.ss)
}

#(avg.totw.ss[-19]-avg.totw.ss[-1])/avg.totw.ss[-19]
ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss/1000000000))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares (×10^9)")+ggtitle("Fig. 1. Plot of TotW.SS by k in \"Wholesale\" Clustering")+theme_classic()

2. Cluster Analysis

Summary: “These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.”
As usual, according to the plot of total within sum of squares by number of clusters (k) in Fig. 2 (the elbow method), the optimal k is 3. The function NbClust from the package “NbClust” also proposes k=3 as the best number of clusters by 17 out of 26 criteria. The random seed is set to 1234. There are 178 observations on the 13 scaled attributes. The cluster sizes are shown below.

Cluster	One	Two	Three
Count	62	65	51

It is worthwhile to note that the indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this is the case here because the first column (Wine Type) is in increasing order. In that regard, the confusion matrix reflects an accuracy of 96.63%.

Cluster 1 has relatively high values in Alcohol, Ash, Magnesium, Total Phenols, Flavanoids, Proanthocyanins, OD280/OD315 of Diluted Wines, and Proline, and relatively low values in Alcalinity of Ash and Nonflavanoid Phenols. Cluster 2 has relatively low values in Alcohol, Ash, Magnesium, Color Intensity, and Proline, which are almost the opposite of Cluster 1. Cluster 3 has relatively high values in Malic Acid, Alcalinity of Ash, Nonflavanoid Phenols, and Color Intensity, and relatively low values in Total Phenols, Flavanoids, Proanthocyanins, Hue, and OD280/OD315 of Diluted Wines, which are also opposite to Cluster 1.

Step 1: Exploring and preparing the data

#Note: this wine dataset is different from the Lab 1's.
wine <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wine.csv")
wineR <- wine
wineR[-1] <- scale(wineR[-1])

Step 2: Building k-Means Clustering models (elbow method)

# #OR
# totwss <- sapply(1:15, function(x) {set.seed(1234); kmeans(wineR[-1], x,)$tot.withinss})

wssplot <- function(data, nc=15, seed=1234, title) {
  #' Given a data frame, plot total within sum of squares by number of clusters from 1 to nc
  #' Input: data - the data frame
  #'        nc - number of clusters, default nc=15
  #'        seed - random seed, default seed=1234
  #'        title - the figure title
  #' Output: ggplot
  
  totwss <- numeric(nc)
  for (num in 1:nc) {
    set.seed(seed)
    totwss[num] <- kmeans(data, centers=num)$tot.withinss
  }
  #totwss[1] == (nrow(data)-1)*sum(apply(data,2,var)), i.e., (178-1)*(14-1)
  
  fig <- ggplot(aes(x,y), data=data.frame("x"=c(1:nc), "y"=totwss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=1)", y="Total Within Sum of Squares")+ggtitle(title)+theme_classic()
  return(fig)
}

wssplot(wineR[-1], title="Fig. 2. Plot of TotW.SS by k in \"Wine\" Clustering")

Step 3: Cluster analysis

set.seed(1234)
cluster.analysis <- NbClust(wineR[-1], min.nc=2, max.nc=15, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 19 proposed 3 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

#Also note that length(unique(wineR$Wine)) == 3
#Number of clusters chosen by 26 criteria
table(cluster.analysis$Best.n[1,]) #e.g., 19 proposed 3 as the best number of clusters

## 
##  0  1  2  3 14 15 
##  2  1  2 19  1  1

#n <- readline(prompt="Enter the best number of clusters: ")
set.seed(1234)
k2 <- kmeans(wineR[-1], 3, nstart=25)
table(k2$cluster)

## 
##  1  2  3 
## 62 65 51

k2$centers %>% kable(digits=3) %>% kable_styling(font_size=12)

Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
0.833	-0.303	0.364	-0.608	0.576	0.883	0.975	-0.561	0.579	0.171	0.473	0.777	1.122
-0.923	-0.393	-0.493	0.170	-0.490	-0.076	0.021	-0.033	0.058	-0.899	0.461	0.270	-0.752
0.164	0.869	0.186	0.523	-0.075	-0.977	-1.212	0.724	-0.778	0.939	-1.162	-1.289	-0.406

aggregate(wine[-1], by=list(Cluster=k2$cluster), mean) %>% kable(digits=3) %>% kable_styling(font_size=12)

Cluster	Alcohol	Malic.acid	Ash	Acl	Mg	Phenols	Flavanoids	Nonflavanoid.phenols	Proanth	Color.int	Hue	OD	Proline
1	13.677	1.998	2.466	17.463	107.968	2.848	3.003	0.292	1.922	5.454	1.065	3.163	1100.226
2	12.251	1.897	2.231	20.063	92.738	2.248	2.050	0.358	1.624	2.973	1.063	2.803	510.169
3	13.134	3.307	2.418	21.241	98.667	1.684	0.819	0.452	1.146	7.235	0.692	1.697	619.059

(t <- table(wineR$Wine, k2$cluster)) #Note: indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this case is so because the first column (Wine Type) is in increasing order.

##    
##      1  2  3
##   1 59  0  0
##   2  3 65  3
##   3  0  0 48

(sum(diag(t))/sum(t))*100

## [1] 96.62921

par(mfrow=c(1,1))
cluster::clusplot(wineR[-1], k2$cluster, main="2D Representation of the Cluster Solution", lines=0, labels=4)

#About PCA: https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

3. Classification Model

Summary: The scaled predictors are “Alcohol,” “Malic.acid,” “Ash,” “Acl,” “Mg,” “Phenols,” “Flavanoids,” “Nonflavanoid.phenols,” “Proanth,” “Color.int,” “Hue,” “OD,” and “Proline,” with 141 observations in the training set and 37 observations in the test set. The random seed is set to 1234 for sampling purposes. The regression trees model, i.e., the function rpart() from the package “rpart”, is used for classification in predicting clusters. The accuracy for the training set and the test set is 96.45% and 91.89%, respectively. The three most important features in this model are “Flavanoids,” “Phenols,” and “OD.”.

Previously, Part 2 identified six mismatches: three instances of Type 2 wine are grouped into Cluster 1, and three instances of Type 2 wine are grouped into Cluster 3. By investigating the row indices where clusters mismatch types, five instances are in the training set while one instance is in the test set. \(\frac{5}{141}\approx3.5461\%\). Hence, the training process may indeed be compromised.

Wine Type	Cluster	Prediction
2	3	3
2	1	2
2	1	1
2	3	3
2	3	2

Step 1: Exploring and preparing the data

wineR$Cluster <- k2$cluster
set.seed(1234)
wineR2 <- wineR[sample(1:nrow(wineR)),]
mis <- which(wineR2$Wine!=wineR2$Cluster) #19 25 52 100 119 166
wineTraining <- wineR2[1:(as.integer(0.8*nrow(wineR2))-1),] #141 obs.
wineTest <- wineR2[as.integer(0.8*nrow(wineR2)):nrow(wineR2),] #37 obs,

Step 2: Training a model on the data

(wineModelR <- rpart(Cluster~., data=wineTraining[-1], method="class"))

## n= 141 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 141 91 2 (0.32624113 0.35460993 0.31914894)  
##    2) Proline>=0.02574325 52  7 1 (0.86538462 0.01923077 0.11538462)  
##      4) Phenols>=-0.2718097 45  0 1 (1.00000000 0.00000000 0.00000000) *
##      5) Phenols< -0.2718097 7  1 3 (0.00000000 0.14285714 0.85714286) *
##    3) Proline< 0.02574325 89 40 2 (0.01123596 0.55056180 0.43820225)  
##      6) OD>=-0.5939311 46  2 2 (0.02173913 0.95652174 0.02173913) *
##      7) OD< -0.5939311 43  5 3 (0.00000000 0.11627907 0.88372093)  
##       14) Color.int< -0.4671943 7  2 2 (0.00000000 0.71428571 0.28571429) *
##       15) Color.int>=-0.4671943 36  0 3 (0.00000000 0.00000000 1.00000000) *

# #OR
# rpart.plot(wineModelR, digits=4, type=3, extra=101)
fancyRpartPlot(wineModelR)

#accuracy on the training set
wineTrPredR <- predict(wineModelR, wineTraining[-1], type="class")
CrossTable(wineTraining$Cluster, wineTrPredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  141 
## 
##  
##              | Predicted (RT) 
##       Actual |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        45 |         1 |         0 |        46 | 
##              |     0.319 |     0.007 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            2 |         0 |        49 |         1 |        50 | 
##              |     0.000 |     0.348 |     0.007 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            3 |         0 |         3 |        42 |        45 | 
##              |     0.000 |     0.021 |     0.298 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        45 |        53 |        43 |       141 | 
## -------------|-----------|-----------|-----------|-----------|
## 
##

comparable <- data.frame(wineTraining[c("Wine","Cluster")], "Pred"=wineTrPredR)
comparable[mis[mis<=nrow(wineTraining)],]

##     Wine Cluster Pred
## 62     2       3    3
## 122    2       1    2
## 96     2       1    1
## 84     2       3    3
## 119    2       3    2

Step 3: Evaluating model performance

#accuracy on the test set
winePredR <- predict(wineModelR, wineTest[-1], type="class")
CrossTable(wineTest$Cluster, winePredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  37 
## 
##  
##              | Predicted (RT) 
##       Actual |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |        14 |         2 |         0 |        16 | 
##              |     0.378 |     0.054 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            2 |         1 |        14 |         0 |        15 | 
##              |     0.027 |     0.378 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            3 |         0 |         0 |         6 |         6 | 
##              |     0.000 |     0.000 |     0.162 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |        15 |        16 |         6 |        37 | 
## -------------|-----------|-----------|-----------|-----------|
## 
##

wineModelR$variable.importance %>% kable(digits=4) %>% kable_styling(full_width=F)

	x
Flavanoids	58.1216
Phenols	44.8599
OD	42.8464
Alcohol	39.9385
Proline	36.6332
Color.int	29.2045
Proanth	20.2278
Acl	15.3934
Hue	12.5302
Mg	9.8628
Nonflavanoid.phenols	4.5565
Ash	2.5629
Malic.acid	2.5629

4. k-Nearest Neighbors Model

Question 3: Load the dataset of breast cancer. Do the preliminary analysis and implement a KNN (K- nearest neighbors) model for this dataset and don’t forget that whenever it is required you should use: set.seed(12345).

Answer 3: There are 30 scaled predictors with 469 observations in the training set and 100 observations in the test set. There is no randomization in sampling. The k-nearest neighbors (KNN) model, i.e., the function knn() from the default package “class”, is used for classification in predicting “diagnosis”. k is set to the positive square root of the number of rows in the training set, i.e., k=21. The accuracy for the test set is 98%.

Question 4: Now let’s get back to our problem of news popularity and see if we can apply KNN (K- nearest neighbors) to improve the accuracy of the model. Use the same strategy of training and testing that we did on first 2 labs, and don’t forget that whenever it is required you should use: set.seed(12345).

Answer 4: The random seed is set to 12345 for sampling. k is set to the positive square root of the number of rows in the training set, i.e., k=172. The accuracy for the test set is 57.05%. The eight models are shown below.

	Decision Tree	Random Forest	Regression Trees	Naïve Bayes Classifier	Linear SVM	Polynomial SVM	Radial Basis SVM	k-Nearest Neighbors
Test set	59.11%	60.25%	56.37%	47.22%	57.51%	57.51%	59.34%	57.05%

Step 1: Exploring and preparing the data

cancer <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wisc_bc_data.csv")
cancer <- cancer[-1]
cancer[-1] <- lapply(cancer[2:31], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling
summary(cancer) %>% kable(digits=4) %>% kable_styling(font_size=11)

diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	points_mean	symmetry_mean	dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	points_se	symmetry_se	dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	points_worst	symmetry_worst	dimension_worst
Length:569	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. :0.0000
Class :character	1st Qu.:0.2233	1st Qu.:0.2185	1st Qu.:0.2168	1st Qu.:0.1174	1st Qu.:0.3046	1st Qu.:0.1397	1st Qu.:0.06926	1st Qu.:0.1009	1st Qu.:0.2823	1st Qu.:0.1630	1st Qu.:0.04378	1st Qu.:0.1047	1st Qu.:0.04000	1st Qu.:0.02064	1st Qu.:0.1175	1st Qu.:0.08132	1st Qu.:0.03811	1st Qu.:0.1447	1st Qu.:0.1024	1st Qu.:0.04675	1st Qu.:0.1807	1st Qu.:0.2415	1st Qu.:0.1678	1st Qu.:0.08113	1st Qu.:0.3000	1st Qu.:0.1163	1st Qu.:0.09145	1st Qu.:0.2231	1st Qu.:0.1851	1st Qu.:0.1077
Mode :character	Median :0.3024	Median :0.3088	Median :0.2933	Median :0.1729	Median :0.3904	Median :0.2247	Median :0.14419	Median :0.1665	Median :0.3697	Median :0.2439	Median :0.07702	Median :0.1653	Median :0.07209	Median :0.03311	Median :0.1586	Median :0.13667	Median :0.06538	Median :0.2070	Median :0.1526	Median :0.07919	Median :0.2504	Median :0.3569	Median :0.2353	Median :0.12321	Median :0.3971	Median :0.1791	Median :0.18107	Median :0.3434	Median :0.2478	Median :0.1640
NA	Mean :0.3382	Mean :0.3240	Mean :0.3329	Mean :0.2169	Mean :0.3948	Mean :0.2606	Mean :0.20806	Mean :0.2431	Mean :0.3796	Mean :0.2704	Mean :0.10635	Mean :0.1893	Mean :0.09938	Mean :0.06264	Mean :0.1811	Mean :0.17444	Mean :0.08054	Mean :0.2235	Mean :0.1781	Mean :0.10019	Mean :0.2967	Mean :0.3640	Mean :0.2831	Mean :0.17091	Mean :0.4041	Mean :0.2202	Mean :0.21740	Mean :0.3938	Mean :0.2633	Mean :0.1896
NA	3rd Qu.:0.4164	3rd Qu.:0.4089	3rd Qu.:0.4168	3rd Qu.:0.2711	3rd Qu.:0.4755	3rd Qu.:0.3405	3rd Qu.:0.30623	3rd Qu.:0.3678	3rd Qu.:0.4530	3rd Qu.:0.3404	3rd Qu.:0.13304	3rd Qu.:0.2462	3rd Qu.:0.12251	3rd Qu.:0.07170	3rd Qu.:0.2187	3rd Qu.:0.22680	3rd Qu.:0.10619	3rd Qu.:0.2787	3rd Qu.:0.2195	3rd Qu.:0.12656	3rd Qu.:0.3863	3rd Qu.:0.4717	3rd Qu.:0.3735	3rd Qu.:0.22090	3rd Qu.:0.4942	3rd Qu.:0.3025	3rd Qu.:0.30583	3rd Qu.:0.5546	3rd Qu.:0.3182	3rd Qu.:0.2429
NA	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.00000	Max. :1.0000	Max. :1.0000	Max. :1.0000

Step 2: Building a KNN model

#no randomization in sampling
cancerTraining <- cancer[1:469,] #82%
cancerTest <- cancer[470:569,]
cancerPred <- knn(train=cancerTraining[-1], test=cancerTest[-1], cl=cancerTraining[,1], k=21)
CrossTable(cancerTest$diagnosis, cancerPred, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | Predicted (KNN) 
##       Actual |         B |         M | Row Total | 
## -------------|-----------|-----------|-----------|
##            B |        61 |         0 |        61 | 
##              |     0.610 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            M |         2 |        37 |        39 | 
##              |     0.020 |     0.370 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        63 |        37 |       100 | 
## -------------|-----------|-----------|-----------|
## 
##

Step 3: Evaluating News Popularity
The first step is the same as Lab 1’s Tab 4 Step 1.

news17[-1] <- lapply(news17[2:17], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling

set.seed(12345)
training <- sample(nrow(news17), nrow(news17)*0.75)
newsTraining <- news17[training,]
newsTest <- news17[-training,]

newsPredKNN <- knn(train=newsTraining[-1], test=newsTest[-1], cl=newsTraining[,1], k=172)
CrossTable(newsTest$popular, newsPredKNN, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  9911 
## 
##  
##              | Predicted (KNN) 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1997 |      2585 |      4582 | 
##              |     0.201 |     0.261 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1672 |      3657 |      5329 | 
##              |     0.169 |     0.369 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3669 |      6242 |      9911 | 
## -------------|-----------|-----------|-----------|
## 
##

ANLY 530 - Lab 3

Mixed Methods or Semi-Supervised Learning

Zhengxiao Wei

2019-12-27

1. k-Means Clustering

2. Cluster Analysis

3. Classification Model

4. k-Nearest Neighbors Model