============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/562805
Data source: https://www.kaggle.com/sherloconan/anly-53053b
Question 1: Given this is an imperfect real-world, you need to determine what you believe is the best value for “k” and write-up this portion of your lab report.
You should include a brief discussion of your k-Means analysis as well as the best value of “k” that you determine. You should include what mixture of variables within the clusters that this value of “k” results in. That is, you need to interpret your k-Means analysis and discuss what it means.**
Answer 1: According to the plot of the average total within sum of squares by number of clusters (k) in Fig. 1, also known as the elbow method, the optimal k is 3. This method investigates the percentage of variance explained by k, and the optimal k is chosen when the rate of change starts to drop. Other preferable techniques are the silhouette method and the the Bayesian inference criterion. However, the instructions for Part 1 of Lab 3 suggest k=5.
When k=5,
Cluster 1 seems to be those customers who
heavily consume milk and grocery, followed by Cluster 4.
Cluster 5 seems to be those customers who
heavily consume fresh, followed by Cluster
2.
Cluster 3 seems to represent the low (or
regular) consumption customers.
Question 2: How many points do you see in each cluster?
Answer 2: The clustering algorithm used is k-means, i.e.,
the function kmeans() from the default package “stats”. The
random seed is set to 76964057. There are 421 observations based on six
variables, namely “Fresh”, “Milk”, “Grocery”, “Frozen”,
“Detergents_Paper”, and “Delicassen” (therefore, “Channel” and “Region”
are not taken into consideration). The numbers of components in the five
clusters are shown below.
| Cluster | One | Two | Three | Four | Five |
|---|---|---|---|---|---|
| Count | 42 | 99 | 166 | 83 | 31 |
Step 1: Exploring and preparing the data
wholesale <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/Wholesale_customers_data.csv")
pastecs::stat.desc(wholesale[-c(1,2)]) %>% kable(digits=0) %>% kable_styling(full_width=F)
| Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen | |
|---|---|---|---|---|---|---|
| nbr.val | 440 | 440 | 440 | 440 | 440 | 440 |
| nbr.null | 0 | 0 | 0 | 0 | 0 | 0 |
| nbr.na | 0 | 0 | 0 | 0 | 0 | 0 |
| min | 3 | 55 | 3 | 25 | 3 | 3 |
| max | 112151 | 73498 | 92780 | 60869 | 40827 | 47943 |
| range | 112148 | 73443 | 92777 | 60844 | 40824 | 47940 |
| sum | 5280131 | 2550357 | 3498562 | 1351650 | 1267857 | 670943 |
| median | 8504 | 3627 | 4756 | 1526 | 816 | 966 |
| mean | 12000 | 5796 | 7951 | 3072 | 2881 | 1525 |
| SE.mean | 603 | 352 | 453 | 231 | 227 | 134 |
| CI.mean.0.95 | 1185 | 692 | 890 | 455 | 447 | 264 |
| var | 159954927 | 54469967 | 90310104 | 23567853 | 22732436 | 7952997 |
| std.dev | 12647 | 7380 | 9503 | 4855 | 4768 | 2820 |
| coef.var | 1 | 1 | 1 | 2 | 2 | 2 |
#Note: In this case, scaling / normalizing the dataset will not necessarily remove the outliers; a logarithmic transformation might (or might not) help.
top.n.custs <- function(data, cols, n) {
#' Given a data frame, find the row indices for the top of each specified column
#' Input: data - the data frame
#' cols - the vector of the specified columns, e.g., c(3:8)
#' n - top number, e.g., 5
#' Output: return a vector of such row indices without duplicates
index <- integer(0)
for (i in cols) {
top.n <- head(order(data[,i], decreasing=T), n)
index <- union(index, top.n)
}
return(index)
}
wholesaleR <- wholesale[-top.n.custs(wholesale, 3:8, 5),]
Step 2: Building a k-Means Clustering model
set.seed(76964057)
k <- kmeans(wholesaleR[-c(1,2)], centers=5) #create 5 clusters; drop the "Channel" and "Region" variables
k$centers %>% kable() %>% kable_styling(full_width=F)
| Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
|---|---|---|---|---|---|
| 5830.214 | 15295.048 | 23449.167 | 1936.452 | 10361.6429 | 1912.738 |
| 18649.606 | 3335.586 | 4497.848 | 3301.747 | 1046.5859 | 1450.566 |
| 5845.392 | 2337.319 | 2878.205 | 2766.596 | 660.2952 | 858.994 |
| 4238.892 | 7725.289 | 11011.747 | 1336.566 | 4733.3614 | 1400.530 |
| 35922.387 | 4851.806 | 5862.581 | 3730.677 | 1004.6129 | 1552.161 |
table(k$cluster)
##
## 1 2 3 4 5
## 42 99 166 83 31
Step 3: Finding the optimal number of clusters (elbow method)
avg.totw.ss <- numeric(19)
for (num in 2:20) {
totw.ss <- numeric(100)
for (trial in 1:100) {
runs <- kmeans(wholesaleR[-c(1,2)], centers=num) #Warning: it may not converge in some iterations (10); consider assigning, e.g., iter.max=15
totw.ss[trial] <- runs$tot.withinss
}
avg.totw.ss[num-1] <- mean(totw.ss)
}
#(avg.totw.ss[-19]-avg.totw.ss[-1])/avg.totw.ss[-19]
ggplot(aes(x,y), data=data.frame("x"=c(2:20), "y"=avg.totw.ss/1000000000))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=2)", y="Average Total Within Sum of Squares (×10^9)")+ggtitle("Fig. 1. Plot of TotW.SS by k in \"Wholesale\" Clustering")+theme_classic()
Summary: “These data are the
results of a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars.”
As usual, according to the plot of total within sum of squares by
number of clusters (k) in Fig. 2 (the elbow method), the optimal k is 3. The function
NbClust from the package “NbClust” also proposes k=3 as the
best number of clusters by 17 out of 26 criteria. The random seed is set
to 1234. There are 178 observations on the 13 scaled attributes. The
cluster sizes are shown below.
| Cluster | One | Two | Three |
|---|---|---|---|
| Count | 62 | 65 | 51 |
It is worthwhile to note that the indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this is the case here because the first column (Wine Type) is in increasing order. In that regard, the confusion matrix reflects an accuracy of 96.63%.
Cluster 1 has relatively high values in Alcohol, Ash, Magnesium, Total Phenols, Flavanoids, Proanthocyanins, OD280/OD315 of Diluted Wines, and Proline, and relatively low values in Alcalinity of Ash and Nonflavanoid Phenols. Cluster 2 has relatively low values in Alcohol, Ash, Magnesium, Color Intensity, and Proline, which are almost the opposite of Cluster 1. Cluster 3 has relatively high values in Malic Acid, Alcalinity of Ash, Nonflavanoid Phenols, and Color Intensity, and relatively low values in Total Phenols, Flavanoids, Proanthocyanins, Hue, and OD280/OD315 of Diluted Wines, which are also opposite to Cluster 1.
Step 1: Exploring and preparing the data
#Note: this wine dataset is different from the Lab 1's.
wine <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wine.csv")
wineR <- wine
wineR[-1] <- scale(wineR[-1])
Step 2: Building k-Means Clustering models (elbow method)
# #OR
# totwss <- sapply(1:15, function(x) {set.seed(1234); kmeans(wineR[-1], x,)$tot.withinss})
wssplot <- function(data, nc=15, seed=1234, title) {
#' Given a data frame, plot total within sum of squares by number of clusters from 1 to nc
#' Input: data - the data frame
#' nc - number of clusters, default nc=15
#' seed - random seed, default seed=1234
#' title - the figure title
#' Output: ggplot
totwss <- numeric(nc)
for (num in 1:nc) {
set.seed(seed)
totwss[num] <- kmeans(data, centers=num)$tot.withinss
}
#totwss[1] == (nrow(data)-1)*sum(apply(data,2,var)), i.e., (178-1)*(14-1)
fig <- ggplot(aes(x,y), data=data.frame("x"=c(1:nc), "y"=totwss))+geom_line()+geom_point(size=2.5)+labs(x="Number of Clusters (starting with k=1)", y="Total Within Sum of Squares")+ggtitle(title)+theme_classic()
return(fig)
}
wssplot(wineR[-1], title="Fig. 2. Plot of TotW.SS by k in \"Wine\" Clustering")
Step 3: Cluster analysis
set.seed(1234)
cluster.analysis <- NbClust(wineR[-1], min.nc=2, max.nc=15, method="kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 19 proposed 3 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
#Also note that length(unique(wineR$Wine)) == 3
#Number of clusters chosen by 26 criteria
table(cluster.analysis$Best.n[1,]) #e.g., 19 proposed 3 as the best number of clusters
##
## 0 1 2 3 14 15
## 2 1 2 19 1 1
#n <- readline(prompt="Enter the best number of clusters: ")
set.seed(1234)
k2 <- kmeans(wineR[-1], 3, nstart=25)
table(k2$cluster)
##
## 1 2 3
## 62 65 51
k2$centers %>% kable(digits=3) %>% kable_styling(font_size=12)
| Alcohol | Malic.acid | Ash | Acl | Mg | Phenols | Flavanoids | Nonflavanoid.phenols | Proanth | Color.int | Hue | OD | Proline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.833 | -0.303 | 0.364 | -0.608 | 0.576 | 0.883 | 0.975 | -0.561 | 0.579 | 0.171 | 0.473 | 0.777 | 1.122 |
| -0.923 | -0.393 | -0.493 | 0.170 | -0.490 | -0.076 | 0.021 | -0.033 | 0.058 | -0.899 | 0.461 | 0.270 | -0.752 |
| 0.164 | 0.869 | 0.186 | 0.523 | -0.075 | -0.977 | -1.212 | 0.724 | -0.778 | 0.939 | -1.162 | -1.289 | -0.406 |
aggregate(wine[-1], by=list(Cluster=k2$cluster), mean) %>% kable(digits=3) %>% kable_styling(font_size=12)
| Cluster | Alcohol | Malic.acid | Ash | Acl | Mg | Phenols | Flavanoids | Nonflavanoid.phenols | Proanth | Color.int | Hue | OD | Proline |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 13.677 | 1.998 | 2.466 | 17.463 | 107.968 | 2.848 | 3.003 | 0.292 | 1.922 | 5.454 | 1.065 | 3.163 | 1100.226 |
| 2 | 12.251 | 1.897 | 2.231 | 20.063 | 92.738 | 2.248 | 2.050 | 0.358 | 1.624 | 2.973 | 1.063 | 2.803 | 510.169 |
| 3 | 13.134 | 3.307 | 2.418 | 21.241 | 98.667 | 1.684 | 0.819 | 0.452 | 1.146 | 7.235 | 0.692 | 1.697 | 619.059 |
(t <- table(wineR$Wine, k2$cluster)) #Note: indices do not necessarily mean that Cluster 1 is Wine Type 1, etc., but this case is so because the first column (Wine Type) is in increasing order.
##
## 1 2 3
## 1 59 0 0
## 2 3 65 3
## 3 0 0 48
(sum(diag(t))/sum(t))*100
## [1] 96.62921
par(mfrow=c(1,1))
cluster::clusplot(wineR[-1], k2$cluster, main="2D Representation of the Cluster Solution", lines=0, labels=4)
#About PCA: https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r
Summary: The scaled predictors are “Alcohol,”
“Malic.acid,” “Ash,” “Acl,” “Mg,” “Phenols,” “Flavanoids,”
“Nonflavanoid.phenols,” “Proanth,” “Color.int,” “Hue,” “OD,” and
“Proline,” with 141 observations in the training set and 37 observations
in the test set. The random seed is set to 1234 for sampling purposes.
The regression trees model, i.e., the function
rpart() from the package “rpart”, is used for
classification in predicting clusters. The accuracy for the training set
and the test set is 96.45% and 91.89%, respectively. The three most important
features in this model are “Flavanoids,” “Phenols,” and “OD.”.
Previously, Part 2 identified six mismatches: three instances of Type 2 wine are grouped into Cluster 1, and three instances of Type 2 wine are grouped into Cluster 3. By investigating the row indices where clusters mismatch types, five instances are in the training set while one instance is in the test set. \(\frac{5}{141}\approx3.5461\%\). Hence, the training process may indeed be compromised.
| Wine Type | Cluster | Prediction |
|---|---|---|
| 2 | 3 | 3 |
| 2 | 1 | 2 |
| 2 | 1 | 1 |
| 2 | 3 | 3 |
| 2 | 3 | 2 |
Step 1: Exploring and preparing the data
wineR$Cluster <- k2$cluster
set.seed(1234)
wineR2 <- wineR[sample(1:nrow(wineR)),]
mis <- which(wineR2$Wine!=wineR2$Cluster) #19 25 52 100 119 166
wineTraining <- wineR2[1:(as.integer(0.8*nrow(wineR2))-1),] #141 obs.
wineTest <- wineR2[as.integer(0.8*nrow(wineR2)):nrow(wineR2),] #37 obs,
Step 2: Training a model on the data
(wineModelR <- rpart(Cluster~., data=wineTraining[-1], method="class"))
## n= 141
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 141 91 2 (0.32624113 0.35460993 0.31914894)
## 2) Proline>=0.02574325 52 7 1 (0.86538462 0.01923077 0.11538462)
## 4) Phenols>=-0.2718097 45 0 1 (1.00000000 0.00000000 0.00000000) *
## 5) Phenols< -0.2718097 7 1 3 (0.00000000 0.14285714 0.85714286) *
## 3) Proline< 0.02574325 89 40 2 (0.01123596 0.55056180 0.43820225)
## 6) OD>=-0.5939311 46 2 2 (0.02173913 0.95652174 0.02173913) *
## 7) OD< -0.5939311 43 5 3 (0.00000000 0.11627907 0.88372093)
## 14) Color.int< -0.4671943 7 2 2 (0.00000000 0.71428571 0.28571429) *
## 15) Color.int>=-0.4671943 36 0 3 (0.00000000 0.00000000 1.00000000) *
# #OR
# rpart.plot(wineModelR, digits=4, type=3, extra=101)
fancyRpartPlot(wineModelR)
#accuracy on the training set
wineTrPredR <- predict(wineModelR, wineTraining[-1], type="class")
CrossTable(wineTraining$Cluster, wineTrPredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 141
##
##
## | Predicted (RT)
## Actual | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## 1 | 45 | 1 | 0 | 46 |
## | 0.319 | 0.007 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## 2 | 0 | 49 | 1 | 50 |
## | 0.000 | 0.348 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|
## 3 | 0 | 3 | 42 | 45 |
## | 0.000 | 0.021 | 0.298 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 45 | 53 | 43 | 141 |
## -------------|-----------|-----------|-----------|-----------|
##
##
comparable <- data.frame(wineTraining[c("Wine","Cluster")], "Pred"=wineTrPredR)
comparable[mis[mis<=nrow(wineTraining)],]
## Wine Cluster Pred
## 62 2 3 3
## 122 2 1 2
## 96 2 1 1
## 84 2 3 3
## 119 2 3 2
Step 3: Evaluating model performance
#accuracy on the test set
winePredR <- predict(wineModelR, wineTest[-1], type="class")
CrossTable(wineTest$Cluster, winePredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 37
##
##
## | Predicted (RT)
## Actual | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## 1 | 14 | 2 | 0 | 16 |
## | 0.378 | 0.054 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## 2 | 1 | 14 | 0 | 15 |
## | 0.027 | 0.378 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## 3 | 0 | 0 | 6 | 6 |
## | 0.000 | 0.000 | 0.162 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 15 | 16 | 6 | 37 |
## -------------|-----------|-----------|-----------|-----------|
##
##
wineModelR$variable.importance %>% kable(digits=4) %>% kable_styling(full_width=F)
| x | |
|---|---|
| Flavanoids | 58.1216 |
| Phenols | 44.8599 |
| OD | 42.8464 |
| Alcohol | 39.9385 |
| Proline | 36.6332 |
| Color.int | 29.2045 |
| Proanth | 20.2278 |
| Acl | 15.3934 |
| Hue | 12.5302 |
| Mg | 9.8628 |
| Nonflavanoid.phenols | 4.5565 |
| Ash | 2.5629 |
| Malic.acid | 2.5629 |
Question 3: Load the dataset of breast cancer. Do the preliminary analysis and implement a KNN (K- nearest neighbors) model for this dataset and don’t forget that whenever it is required you should use: set.seed(12345).
Answer 3: There are 30 scaled predictors with 469 observations in the
training set and 100 observations in the test set. There is no
randomization in sampling. The k-nearest neighbors (KNN) model,
i.e., the function knn() from the default package
“class”, is used for classification in predicting “diagnosis”. k is set
to the positive square root of the number of rows in the training set,
i.e., k=21. The accuracy for the test set is 98%.
Question 4: Now let’s get back to our problem of news popularity and see if we can apply KNN (K- nearest neighbors) to improve the accuracy of the model. Use the same strategy of training and testing that we did on first 2 labs, and don’t forget that whenever it is required you should use: set.seed(12345).
Answer 4: The random seed is set to 12345 for sampling. k is set to the positive square root of the number of rows in the training set, i.e., k=172. The accuracy for the test set is 57.05%. The eight models are shown below.
| Decision Tree | Random Forest | Regression Trees | Naïve Bayes Classifier | Linear SVM | Polynomial SVM | Radial Basis SVM | k-Nearest Neighbors | |
|---|---|---|---|---|---|---|---|---|
| Test set | 59.11% | 60.25% | 56.37% | 47.22% | 57.51% | 57.51% | 59.34% | 57.05% |
Step 1: Exploring and preparing the data
cancer <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 10- Naive Bayes classifier/Lab3/wisc_bc_data.csv")
cancer <- cancer[-1]
cancer[-1] <- lapply(cancer[2:31], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling
summary(cancer) %>% kable(digits=4) %>% kable_styling(font_size=11)
| diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | points_mean | symmetry_mean | dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | points_se | symmetry_se | dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | points_worst | symmetry_worst | dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:569 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | |
| Class :character | 1st Qu.:0.2233 | 1st Qu.:0.2185 | 1st Qu.:0.2168 | 1st Qu.:0.1174 | 1st Qu.:0.3046 | 1st Qu.:0.1397 | 1st Qu.:0.06926 | 1st Qu.:0.1009 | 1st Qu.:0.2823 | 1st Qu.:0.1630 | 1st Qu.:0.04378 | 1st Qu.:0.1047 | 1st Qu.:0.04000 | 1st Qu.:0.02064 | 1st Qu.:0.1175 | 1st Qu.:0.08132 | 1st Qu.:0.03811 | 1st Qu.:0.1447 | 1st Qu.:0.1024 | 1st Qu.:0.04675 | 1st Qu.:0.1807 | 1st Qu.:0.2415 | 1st Qu.:0.1678 | 1st Qu.:0.08113 | 1st Qu.:0.3000 | 1st Qu.:0.1163 | 1st Qu.:0.09145 | 1st Qu.:0.2231 | 1st Qu.:0.1851 | 1st Qu.:0.1077 | |
| Mode :character | Median :0.3024 | Median :0.3088 | Median :0.2933 | Median :0.1729 | Median :0.3904 | Median :0.2247 | Median :0.14419 | Median :0.1665 | Median :0.3697 | Median :0.2439 | Median :0.07702 | Median :0.1653 | Median :0.07209 | Median :0.03311 | Median :0.1586 | Median :0.13667 | Median :0.06538 | Median :0.2070 | Median :0.1526 | Median :0.07919 | Median :0.2504 | Median :0.3569 | Median :0.2353 | Median :0.12321 | Median :0.3971 | Median :0.1791 | Median :0.18107 | Median :0.3434 | Median :0.2478 | Median :0.1640 | |
| NA | Mean :0.3382 | Mean :0.3240 | Mean :0.3329 | Mean :0.2169 | Mean :0.3948 | Mean :0.2606 | Mean :0.20806 | Mean :0.2431 | Mean :0.3796 | Mean :0.2704 | Mean :0.10635 | Mean :0.1893 | Mean :0.09938 | Mean :0.06264 | Mean :0.1811 | Mean :0.17444 | Mean :0.08054 | Mean :0.2235 | Mean :0.1781 | Mean :0.10019 | Mean :0.2967 | Mean :0.3640 | Mean :0.2831 | Mean :0.17091 | Mean :0.4041 | Mean :0.2202 | Mean :0.21740 | Mean :0.3938 | Mean :0.2633 | Mean :0.1896 | |
| NA | 3rd Qu.:0.4164 | 3rd Qu.:0.4089 | 3rd Qu.:0.4168 | 3rd Qu.:0.2711 | 3rd Qu.:0.4755 | 3rd Qu.:0.3405 | 3rd Qu.:0.30623 | 3rd Qu.:0.3678 | 3rd Qu.:0.4530 | 3rd Qu.:0.3404 | 3rd Qu.:0.13304 | 3rd Qu.:0.2462 | 3rd Qu.:0.12251 | 3rd Qu.:0.07170 | 3rd Qu.:0.2187 | 3rd Qu.:0.22680 | 3rd Qu.:0.10619 | 3rd Qu.:0.2787 | 3rd Qu.:0.2195 | 3rd Qu.:0.12656 | 3rd Qu.:0.3863 | 3rd Qu.:0.4717 | 3rd Qu.:0.3735 | 3rd Qu.:0.22090 | 3rd Qu.:0.4942 | 3rd Qu.:0.3025 | 3rd Qu.:0.30583 | 3rd Qu.:0.5546 | 3rd Qu.:0.3182 | 3rd Qu.:0.2429 | |
| NA | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
Step 2: Building a KNN model
#no randomization in sampling
cancerTraining <- cancer[1:469,] #82%
cancerTest <- cancer[470:569,]
cancerPred <- knn(train=cancerTraining[-1], test=cancerTest[-1], cl=cancerTraining[,1], k=21)
CrossTable(cancerTest$diagnosis, cancerPred, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | Predicted (KNN)
## Actual | B | M | Row Total |
## -------------|-----------|-----------|-----------|
## B | 61 | 0 | 61 |
## | 0.610 | 0.000 | |
## -------------|-----------|-----------|-----------|
## M | 2 | 37 | 39 |
## | 0.020 | 0.370 | |
## -------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## -------------|-----------|-----------|-----------|
##
##
Step 3: Evaluating News Popularity
The first step is the same as Lab 1’s Tab 4 Step 1.
news17[-1] <- lapply(news17[2:17], function(x) {(x-min(x))/(max(x)-min(x))}) #scaling
set.seed(12345)
training <- sample(nrow(news17), nrow(news17)*0.75)
newsTraining <- news17[training,]
newsTest <- news17[-training,]
newsPredKNN <- knn(train=newsTraining[-1], test=newsTest[-1], cl=newsTraining[,1], k=172)
CrossTable(newsTest$popular, newsPredKNN, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (KNN)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 9911
##
##
## | Predicted (KNN)
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1997 | 2585 | 4582 |
## | 0.201 | 0.261 | |
## -------------|-----------|-----------|-----------|
## 1 | 1672 | 3657 | 5329 |
## | 0.169 | 0.369 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3669 | 6242 | 9911 |
## -------------|-----------|-----------|-----------|
##
##