Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf.
Research question :Which attribute is the most accurcy in predictive classifier of poisonous or edible?
Data is available online here: https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
#download required packages
suppressWarnings(suppressMessages(library(RCurl))) #read url to R
suppressWarnings(suppressMessages(library(plyr))) #split-apply-combine paradigm
suppressWarnings(suppressMessages(library(psych))) #Summary Statisitcs by Group
suppressWarnings(suppressMessages(library(stringr)))#read string value
suppressWarnings(suppressMessages(library(dplyr))) #data manipulation
suppressWarnings(suppressMessages(library(tidyr))) #spread and gather functions
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(mlbench)))
suppressWarnings(suppressMessages(library(e1071)))
In this data set, there are 8124 obervations corresponding to 22 attribustes. Each row represents a gilled mushroom sample in the Agaricus and Lepiota Family in the united states. It has been classified by shape, color, smell, population, ring,and habitat.
The response variables is v1 “classes”" by poisonous (p) or edible (e) and it is categorical.The explanatory variables are independent variables and in this case presentes as following and they are categorival.
#get raw data from a url
url_mushroom <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data')
#store raw data in a data frame
rawData <- data.frame(read.csv(text=url_mushroom, header=FALSE),sep="'")
summary(rawData) #review raw data
## V1 V2 V3 V4 V5 V6
## e:4208 b: 452 f:2320 n :2284 f:4748 n :3528
## p:3916 c: 4 g: 4 g :1840 t:3376 f :2160
## f:3152 s:2556 e :1500 s : 576
## k: 828 y:3244 y :1072 y : 576
## s: 32 w :1040 a : 400
## x:3656 b : 168 l : 400
## (Other): 220 (Other): 484
## V7 V8 V9 V10 V11 V12 V13
## a: 210 c:6812 b:5612 b :1728 e:3516 ?:2480 f: 552
## f:7914 w:1312 n:2512 p :1492 t:4608 b:3776 k:2372
## w :1202 c: 556 s:5176
## n :1048 e:1120 y: 24
## g : 752 r: 192
## h : 732
## (Other):1170
## V14 V15 V16 V17 V18 V19
## f: 600 w :4464 w :4384 p:8124 n: 96 n: 36
## k:2304 p :1872 p :1872 o: 96 o:7488
## s:4936 g : 576 g : 576 w:7924 t: 600
## y: 284 n : 448 n : 512 y: 8
## b : 432 b : 432
## o : 192 o : 192
## (Other): 140 (Other): 156
## V20 V21 V22 V23 sep
## e:2776 w :2388 a: 384 d:3148 ':8124
## f: 48 n :1968 c: 340 g:2148
## l:1296 k :1872 n: 400 l: 832
## n: 36 h :1632 s:1248 m: 292
## p:3968 r : 72 v:4040 p:1144
## b : 48 y:1712 u: 368
## (Other): 144 w: 192
Here I rename the column names and remove last column which is ‘NA’ then rename the cleared data set as df.
#Change colunm names for attributes
names(rawData) <- c("classes","cap_shape","cap_surface","cap_color","bruises","odor","gill_attachment","gill_spacing"
,"gill_size","gill_color","stalk_shape","stalk_root","stalk_surface_above_ring",
"stalk_surface_below_ring","stalk_color_above_ring","stalk_color_below_ring",
"veil_type","veil_color","ring_number","ring_type","spore_print_color","population","habitat")
#remove last column 'NA'
df <-rawData[,1:(dim(rawData)[2]-1)]
#change values in classes
levels(df$classes) [levels(df$classes)=="p"] <- "poisonous"
levels(df$classes) [levels(df$classes)=="e"] <- "edible"
There are few famous disjunctive rules for poisonous mushrooms, from most generalto most specific:
odor_cyfmps <- df %>% filter(odor!="a"|odor!="s"|odor!="n"|classes =="p") %>% select(classes, odor)
ftable(odor_cyfmps)
## odor a c f l m n p s y
## classes
## edible 400 0 0 400 0 3408 0 0 0
## poisonous 0 192 2160 0 36 120 256 576 576
#calculate the accuracy
total <- nrow(odor_cyfmps) #Numerical sample size of subset odor
print( 1- 120/total) #Prop. of edible in odor method
## [1] 0.985229
qplot(df$odor, data = odor_cyfmps, fill= df$classes)
odor_stalkFaceBelow_stalkColAbove <-df %>% filter(odor=="n"& stalk_surface_below_ring=="s"&stalk_color_above_ring!="b") %>% select(classes,odor,stalk_surface_below_ring,stalk_color_above_ring)%>%filter(odor=="n")
ftable(odor_stalkFaceBelow_stalkColAbove)
## stalk_color_above_ring b c e g n o p w y
## classes odor stalk_surface_below_ring
## edible a f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## c f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## f f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## l f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## m f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## n f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 96 576 0 192 576 1352 0
## y 0 0 0 0 0 0 0 0 0
## p f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## s f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## y f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## poisonous a f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## c f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## f f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## l f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## m f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## n f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 80 0
## y 0 0 0 0 0 0 0 0 0
## p f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## s f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
## y f 0 0 0 0 0 0 0 0 0
## k 0 0 0 0 0 0 0 0 0
## s 0 0 0 0 0 0 0 0 0
## y 0 0 0 0 0 0 0 0 0
#calculate the accuracy while 80 cases missed
total <- nrow(odor_stalkFaceBelow_stalkColAbove) #Numerical sample size of subset odor
print( 1- 80/total) #Prop. of edible in odor method
## [1] 0.9721448
hab_capCol <- df%>% filter (habitat=="l" & cap_color=="w") %>% select(classes,habitat,cap_color)
ftable(hab_capCol)
## cap_color b c e g n p r u w y
## classes habitat
## edible d 0 0 0 0 0 0 0 0 0 0
## g 0 0 0 0 0 0 0 0 0 0
## l 0 0 0 0 0 0 0 0 0 0
## m 0 0 0 0 0 0 0 0 0 0
## p 0 0 0 0 0 0 0 0 0 0
## u 0 0 0 0 0 0 0 0 0 0
## w 0 0 0 0 0 0 0 0 0 0
## poisonous d 0 0 0 0 0 0 0 0 0 0
## g 0 0 0 0 0 0 0 0 0 0
## l 0 0 0 0 0 0 0 0 8 0
## m 0 0 0 0 0 0 0 0 0 0
## p 0 0 0 0 0 0 0 0 0 0
## u 0 0 0 0 0 0 0 0 0 0
## w 0 0 0 0 0 0 0 0 0 0
library ('ggplot2')
qplot(habitat=="l",cap_color=="w", data = hab_capCol, fill=classes,color=classes)
Now I want to know whether a computer can classify a mushroom edible or poisonous base on what it learns from the data set. What’s more, what kind of modle can predete well to edible mushroom?
Here, I use Decision Tree classification and Naive Bayes classification to do the predetion.Then, I concent on ediable predetion and compare accuracy of these two models.
#generate random seeds to get reproducible data set for two preditive models
seeds<-sample(1:9999999, 100, replace=FALSE)
#Both models include all 22 attribustes to cospond the classes variable
#x: A numeric matrix, or a data frame of categorical
#y: Class vector
#setup predicted fomular: y~x
frmla = classes~ cap_shape+cap_surface+cap_color+bruises+odor+gill_attachment+gill_spacing+ gill_size+gill_color+stalk_shape+stalk_root+stalk_surface_above_ring+stalk_surface_below_ring+stalk_color_above_ring+stalk_color_below_ring+veil_type+veil_color+ring_number+ring_type+spore_print_color+population+habitat
“Decision tree models allow you to develop classification systems that predict or classify future observations based on a set of decision rules.” - IBM Knowledge Center
#Conditional Inference Tree via PARTY package
suppressWarnings(suppressMessages(library(party)))
ct = ctree(frmla, data = df)
## Warning in factor_trafo(x): factors at only one level may lead to problems
plot(ct, main="Conditional Inference Tree")
#Table of prediction errors
table(predict(ct), df$classes)
##
## edible poisonous
## edible 4208 0
## poisonous 0 3916
# Estimated class probabilities
tr.pred = predict(ct, newdata=df, type="prob")
## Warning in factor_trafo(x): factors at only one level may lead to problems
“Naïve Bayes (NB) based on applying Bayes’ theorem (from probability theory) with strong (naive) independence assumptions. It is particularly suited when the dimensionality of the inputs is high. Naive Bayes classifiers can handle an arbitrary number of independent variables whether continuous or categorical.”
In this analysis, X is the predictors and C is the set of categorical levels present in the dependent variable. variables set:X={x1,x2,….,x22} for 22 atrributes possible outcomes of training set: C ={c1, c2} #editable and poisonous
Bayes’ rule: posterior probability p(C|x1,x2,…x22) = p(C)p(x1,x2,…,x22|C)/ p(x1,x2,…x22) = p(C)p(x1|C)p(x2|C)…p(x22|C)
To label new cases in test set F with a class level Ci that achieves the highest posterior probability: classify(F1,…Fd) = argmaxp(C=c)p(C)*p(x1=F1|C=c)p(x2=F2|C=c)…p(x22=F22|C=c)
Naïve Bayes algorithm and code references: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/Na%C3%AFve_Bayes
classifier<-naiveBayes(df[,2:23], df[,1])
table(predict(classifier, df[,-23]), df[,1])
##
## edible poisonous
## edible 4197 455
## poisonous 11 3461
Both models use condition probability to predict the same test sets which were randomly picked 100 times 20% observations from the clear data set.To add some challenge to these two models, I was simply sampling 100 difference seeds to generate 100 pairs of training data set and testing data set.
In both predetive models, I randomly split the clear data set by “80-20” rule - 80% data as a training set and 20% data as testing set.
accuracy_edible_tree <- c() #pecentage of right predition of editable from test set for tree model
accuracy_edible_nb <- c() #pecentage of right predition of editable from test set for NB model
conf_matrix_tree_list <-list() #confusion matrix of tree model
conf_matrix_nb_list <-list() #confusion matrix of Naive Bayes
for(i in 1:length(seeds) ){
#generate random test set and training set from the clear data set
set.seed(seeds[i]) #get reproducible test set for both models
#add flag: 20% data with "1" as testing set, 80% with "0" as training set
df[,'test'] <- ifelse(runif(nrow(df)) <= 0.2,1,0)
#split testing set and training set
testSet <- df %>% filter(test=='1')
trainSet <- df %>% filter(test=='0')
#count percentage of edible in both sets
test_eat <-testSet %>% filter(classes=="edible")
train_eat <-trainSet %>% filter(classes=="edible")
#get column index of train flag
train_ColNum <- grep('test',names(trainSet))
#remove flag from testSet and trainSet
testSet <-testSet[,1:(dim(testSet)[2]-1)]
trainSet <-testSet[,1:(dim(trainSet)[2]-1)]
#get column index of predicted varialbe in cleared data set
class_ColNum <- grep('classes', names(df))
#Decision Tree model
#complexity factor set to .0001
tree_modle = rpart(frmla,method="class",data=trainSet, control = rpart.control(cp = .001))
#test on testSet data
tree_pred = predict(tree_modle, testSet, type='class')
#Decision Tree Confusion Matrix
conf_matrix_tree<-table(tree_pred, testSet$classes)
conf_matrix_tree_list[[i]] <-conf_matrix_tree
#errorI = P( p(poisonous in edible)| p(edible) )
accuracy_edible_tree[i] =1 - conf_matrix_tree[1,2]/sum(conf_matrix_tree[1,])
#Naive Bayes Classification
nb_model <- naiveBayes(frmla, data = trainSet)
#test on testSet data
nb_pred<-predict(nb_model,testSet)
#NB_conf_matrix
conf_matrix_nb<- table(nb_pred, testSet$classes)
conf_matrix_nb_list[[i]] <-conf_matrix_nb
#errorI = P( p(poisonous in edible)| p(edible) )
accuracy_edible_nb[i] =1 - conf_matrix_nb[1,2]/sum(conf_matrix_nb[1,])
}
#output confusion matrix
write.csv(conf_matrix_tree_list,file="D:/CUNY_SPS_DA/606_Statistic_in_R/606 Project/confMatrix_Tree.txt" )
write.csv(conf_matrix_nb_list,file="D:/CUNY_SPS_DA/606_Statistic_in_R/606 Project/confMatrix_nb.txt" )
Now I have 100 accuracy probabilities prediting editable mushroom for each model.I want to look for which is the better model in the predition.
The mean of the accuracy of predition from tree modle (accuracy_edible_tree) approximate to 100% and is higher than the one from naive bayses modle (accuracy_edible_nb).The min of the accuracy of Tree modelis also higher than the max of the accuracy of Naive Bayes model.
test_result <-as.data.frame(cbind(accuracy_edible_nb,accuracy_edible_tree))
colnames(test_result) <- c("accuracy_edible_nb","accuracy_edible_tree")
#summary(test_result)
describe(test_result)
## vars n mean sd median trimmed mad min max
## accuracy_edible_nb 1 100 0.91 0.01 0.91 0.91 0.01 0.89 0.93
## accuracy_edible_tree 2 100 1.00 0.00 1.00 1.00 0.00 0.99 1.00
## range skew kurtosis se
## accuracy_edible_nb 0.04 0.45 1.31 0
## accuracy_edible_tree 0.01 -0.58 -0.89 0
#histograms of frequency of accuracy
hist(accuracy_edible_nb,xlim=c(0.85,1),ylim=c(0,35),breaks=10,labels=rep("nb"),col=rgb(1,1,0,0.7),main="Accuracy of Edible Predition",xlab="accuracy_edible_predition")
par(new=TRUE)
hist(accuracy_edible_tree,xlim=c(0.85,1),ylim=c(0,35),breaks=5,labels=rep("tr"),col=rgb(0,1,1,0.4),main="",xlab="",ylab="")
#boxplot data set
long_test_result <- gather(test_result, model, accuracy, accuracy_edible_nb:accuracy_edible_tree, factor_key=TRUE)
boxplot(long_test_result$accuracy ~ long_test_result$model, main=" Editable Predition Accuracy by Decision Tree vs Naive Bayes ")
t-test: two means difference
Since t-test is sensible to non-normality, I need to evaluate the accuracy data from tree and Naive Bayes. The accuracy data from both preditive modles are nearly normal where the points that tend to the lines but with some errant points towards the tails.
#tree data
qqnorm(accuracy_edible_tree)
qqline(accuracy_edible_tree)
#naive bayes data
qqnorm(accuracy_edible_nb)
qqline(accuracy_edible_nb)
Assume the mean of accuracy from two models aren’t difference. To test where these prediton has significan difference, I use t test since the population variances are unknown, choose 95% confident interal as critial point.
Ho: the mean of accuracy of Naisve Bayes is no difference to Decision Tree’s Ha: the mean of accuracy of Naisve Bayes is difference Decision Tree’s
The p-value of Student’s t distribution with (100 + 100 -2) degrees of freedom, one-tail model with 95% confident interval is approximated to 0. -> Reject Ho, the mean of accuracy from Tree model and Naive Bayes modle has a significan difference.
tree_mean <- mean(accuracy_edible_tree)
nb_mean <- mean(accuracy_edible_nb)
tree_var <- var(accuracy_edible_tree)
nb_var <- var(accuracy_edible_nb)
tree_num <- 100
nb_num <- 100
t = (0 - (tree_mean - nb_mean))/sqrt(tree_var/100 +nb_var/100)
2*pt(-abs(t),df=tree_num+nb_num-2)
## [1] 1.36109e-185