# Applied Predictive Modeling Chapter 3: Exercise 3.2
#
# This web page provides answers to Exercise 3.2 of Applied Predictive Modeling.
# That exercise says:
#
# "The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to
# predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the
# environmental condititions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold
# growth). The outcome labels consist of 19 distinct classes." The data can be loaded via mlbench.
#
#
# EXERCISE 3.2(a)
# Investigate the frequency distributions for the categorical predictors. Are any of the distributions
# degenerate in the ways dicussed earlier in this chapter?
#
# To address 3.2(a), I load the some programs and data and then create graphics.
#
# Step 1: I load the libraries whose programs are used in the analysis
suppressMessages(library(mlbench)) # contains data set used in the analysis
suppressMessages(library(caret)) # APM pre-procoessing program and others
suppressMessages(library(dplyr)) # used for some data manipulation
suppressMessages(library(e1071)) # for skewness computation
suppressMessages(library(tidyr)) # reshape a correlation matrix for use in heat map
suppressMessages(library(knitr)) # used to create output for web
suppressMessages(library(GGally)) # used for ggpairs correlation graph graph
#
#
data(Soybean)
#
# Degenerate distribution: takes on only a single value. Related: variables with insufficient
# variability to provide much information. (See p. 44.)
#
# A look at the dataframe shows that the variables are all factors or ordered factors. Since the Y variable
# also is a factor, one might use multinomial logit or LDA.
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#
# Look at variability (or lack thereof) of variables by Class
# The idea behind the following anaysis is to compute the number of times (and the %) that a particular
# factor value arises. We do this for each explanatory variable by Class.
# The first steop is to compute the total count of observations for each of the Class diseases.
# This is used to compute the proportion of observed factor counts.
total.count<-Soybean%>%
group_by(Class)%>%
summarise(total=n()) # You can also use do(count(.)) in this line.
# The next step is, for each explanatory variable, create a cross-tabluation of the count
# (by factor number) by Class.
temp<-sapply(Soybean[,2:36],function(fact.num){table(Soybean$Class,fact.num)})
# The loop below converts the results of the sapply into a long-form dataframe called count.byclass.
# Possibly, I could have used a do.call, but I wanted to rbind the individual list results from
# the table together into a single long-form df.
for(i in 1:35){
temp2<-data.frame(temp[[i]]) # This step converts item [[i]] of the list into a df.
temp2$name<-names(temp)[i] # Here, we add names to the colummns.
ifelse(i==1,count.byclass<-temp2,count.byclass<-rbind(count.byclass,temp2)) # This step rbinds the df's.
}
names(count.byclass)<-c("Class","factor.value","count","factor") #Add names to the dataframe count.byclass.
count.byclass<-left_join(count.byclass,total.count,by="Class") #Append the totals so that we can compute proportions if we wish to.
count.byclass$pct<-count.byclass$count/count.byclass$total # Compute the the proportions.
count.byclass<-tbl_df(count.byclass)
rm(temp,temp2,i)
#
# Now we can see whether a particular independent variable has sufficient variability to explain a disease.
# For example, if an independent variable has a value of zero for all diseases (i.e., by Class) then it
# produces no explanatory information. (I interpret the data to being that a plant has the disease and that
# the researcher looks for one or more identifying characteristic (i.e., a non-zero on the independent variable).
# For example, in the charts created below:
# Chart 12 [leaves] has a factor value of 1 100% of the time for 14 of the 19 diseases. Hence, a value
# of zero would mean that it it is one of 5 diseases.
# Chart 18 [leaf.mild] has zeros for 12 of 19 diseases. It has 1 100% of the time for powdery mildew and
# 2 100% of the time for downy mildew.
# Chart 20 [lodging] has zeros 100% of the time for 7 of the diseases, and 1s a small proportion of the time
# for 8 of the diseases. It seems this would not have good explanatory power.
# Chart 25 [mycelium] is zero across the board except (1) where it is NA and (2) thizoctonia-root-rot,
# where it takes on a value of 1 about 25% of the time. This is an example of low explanatory power and
# insufficient variability.
# Chart 26 [int.discolor] may be an indicator for brown-step-rot and charcoal-rot, but not anything else.
# Chart 27 [sclorotia] may be an indicator of charcoal rot. Perhaps 26 and 27 together.
#
# One could contemplate a rule that if a non-zero arises in 75% or more of the instances for a particular disease,
# that factor is an indicator for the disease.
# As an aside: I am not sure how NAs are handled except in instances where the entire series is NA.
for(i in 1:35){
readline(prompt="Hit return to see next plot>> ")
fact.nam<-names(Soybean)[i+1]
print(ggplot(subset(count.byclass,(factor==fact.nam)),aes(x=Class,y=pct,fill=factor.value))+
geom_bar(stat="identity",position="stack")+
theme(axis.text.x=element_text(angle=90))+
labs(ylab("Percent of Total (by Class)")) +
ggtitle(paste("Series Chart ",i,": Percent of Observations by Factor Value for Explanatory Variable [",
fact.nam,"] \n (by Class)",sep=""))+
theme(plot.title=element_text(size=10))+
annotate(geom="text",label="Failure to sum to 100% is due to NAs",x=15,y=.1,size=3,colour="grey25")
)
}
## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

rm(fact.nam,i)
#
# This next exercise is designed to identify the number of NAs in the independent variables.
# EXERCISE 3.2(b)
# Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing?
# Is the pattern of missing data related to the classes?
#
isna.byxvar<-data.frame(sapply(Soybean[,2:36],function(isna){sum(is.na(isna))}))
isna.byxvar$x.var<-rownames(isna.byxvar)
rownames(isna.byxvar)<-NULL
names(isna.byxvar)<-c("na.count","x.var")
isna.byxvar
## na.count x.var
## 1 1 date
## 2 36 plant.stand
## 3 38 precip
## 4 30 temp
## 5 121 hail
## 6 16 crop.hist
## 7 1 area.dam
## 8 121 sever
## 9 121 seed.tmt
## 10 112 germ
## 11 16 plant.growth
## 12 0 leaves
## 13 84 leaf.halo
## 14 84 leaf.marg
## 15 84 leaf.size
## 16 100 leaf.shread
## 17 84 leaf.malf
## 18 108 leaf.mild
## 19 16 stem
## 20 121 lodging
## 21 38 stem.cankers
## 22 38 canker.lesion
## 23 106 fruiting.bodies
## 24 38 ext.decay
## 25 38 mycelium
## 26 38 int.discolor
## 27 38 sclerotia
## 28 84 fruit.pods
## 29 106 fruit.spots
## 30 92 seed
## 31 92 mold.growth
## 32 106 seed.discolor
## 33 92 seed.size
## 34 106 shriveling
## 35 31 roots
#
ggplot(isna.byxvar,aes(x=reorder(x.var,-na.count),y=na.count))+
geom_bar(stat="identity",fill="light blue",colour="black")+
theme(axis.text.x=element_text(angle=90))+
xlab("Explanatory Variable")+
ylab("Number of NAs")+
ggtitle("Number of NAs by Explanatory Variable")+
theme(plot.title=element_text(size=10))

#
# The above shows that (out of 683), the max NAs are 121 (17%).
# I also re-run this analysis by Class
isna.byclass<-Soybean%>%
group_by(Class)%>%
do(data.frame(sum(is.na(.))))
names(isna.byclass)<-c("Class","na.count")
isna.byclass
## Source: local data frame [19 x 2]
## Groups: Class [19]
##
## Class na.count
## (fctr) (int)
## 1 2-4-d-injury 450
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode 336
## 10 diaporthe-pod-&-stem-blight 177
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury 160
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 1214
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
#
#
ggplot(isna.byclass,aes(x=reorder(Class,-na.count),y=na.count))+
geom_bar(stat="identity",fill="light blue",colour="black")+
theme(axis.text.x=element_text(angle=90))+
xlab("Class of Disease")+
ylab("Number of NAs")+
ggtitle("Number of NAs by Class of Disease")+
theme(plot.title=element_text(size=10))

#
# Here is a full table of NA counts
isna.all<-Soybean%>%
group_by(Class)%>%
do(data.frame(apply(.,2,function(x){length(which(is.na(x)))})))
isna.all$x.var<-names(Soybean)
names(isna.all)<-c("Class","na.count","var")
isna.all<-filter(isna.all,var!="Class")
#
isna.table<-spread(isna.all,Class,na.count)
#
#
# And an NA heatmap
ggplot(isna.all,aes(x=Class,y=var,size=na.count,colour=na.count))+
geom_point()+
theme_bw()+
theme(axis.text.x=element_text(angle=90))+
ggtitle("Number of NAs \n (by Class and Variable)")+
theme(plot.title=element_text(size=8))+
ylab("Explanatory Variable")+
scale_colour_gradientn(colours=topo.colors(50))

#
# The heatmap shows that most of the NAs are with regard to:
# 1. phytophtora rot
# 2. 2-4-d-injury
# 3. cyst nematode
# 4. diaporthe pod & stem blight
#
# EXERCISE 3.2(c)
# Develop a strategy for handling missing data, either by eliminating predictors or imputation.
#
# I will focus on phytophthora rot, where 1,214 of the NAs are concentrated.
# 1. The reason for the NAs *may* be that the particular explanatory variable is irrelevant for that Class (although
# one would think this would be coded as a 0 and not as an NA.)
# 2.
fact.nam<-as.character(unique(Soybean$Class)[4])
nlv<-sapply(Soybean,function(x){nlevels(x)})
nlv<-data.frame(nlv[2:36])
nlv$x.var<-rownames(nlv)
names(nlv)[1]<-"lev.num"
nlv[,1]<-as.character(nlv[,1])
rownames(nlv)<-NULL
nlv<-arrange(nlv,x.var)
nlv
## lev.num x.var
## 1 4 area.dam
## 2 4 canker.lesion
## 3 4 crop.hist
## 4 7 date
## 5 3 ext.decay
## 6 4 fruit.pods
## 7 4 fruit.spots
## 8 2 fruiting.bodies
## 9 3 germ
## 10 2 hail
## 11 3 int.discolor
## 12 3 leaf.halo
## 13 2 leaf.malf
## 14 3 leaf.marg
## 15 3 leaf.mild
## 16 2 leaf.shread
## 17 3 leaf.size
## 18 2 leaves
## 19 2 lodging
## 20 2 mold.growth
## 21 2 mycelium
## 22 2 plant.growth
## 23 2 plant.stand
## 24 3 precip
## 25 3 roots
## 26 2 sclerotia
## 27 2 seed
## 28 2 seed.discolor
## 29 2 seed.size
## 30 3 seed.tmt
## 31 3 sever
## 32 2 shriveling
## 33 2 stem
## 34 4 stem.cankers
## 35 3 temp
ggplot(subset(count.byclass,(Class==fact.nam)),aes(x=factor,y=count,fill=factor.value))+
geom_bar(stat="identity",position="stack")+
theme(axis.text.x=element_text(angle=90))+
ylab("Number of Observations by Factor Value")+
theme(axis.title.y = element_text(size = rel(.8)))+
xlab("Explanatory Variable")+
annotate("text",y=5,x=1:35,label=nlv[,1],color="gray20",size=3)+
ggtitle(paste("Number of Observations by Factor Value for Each Explanatory Variable \n (for ",fact.nam,
" and for a total of 88 possible observations)",sep=""))+
theme(plot.title=element_text(size=10))

rm(fact.nam,nlv,i)
## Warning in rm(fact.nam, nlv, i): object 'i' not found
#
# The chart shows, for example, that fruit.pods has 4 levels. However, all of the observations are of level 3.
# Moreover, there are only 20 observations, leaving 68 NAs (88 - 20). Here's a look at the fruit.pods data showing
# no observations for factor values 0,1,or 2 and 20 observations for factor value 3:
#
subset(count.byclass,factor=="fruit.pods" & Class=="phytophthora-rot")
## Source: local data frame [4 x 6]
##
## Class factor.value count factor total pct
## (fctr) (fctr) (int) (chr) (int) (dbl)
## 1 phytophthora-rot 0 0 fruit.pods 88 0.0000000
## 2 phytophthora-rot 1 0 fruit.pods 88 0.0000000
## 3 phytophthora-rot 2 0 fruit.pods 88 0.0000000
## 4 phytophthora-rot 3 20 fruit.pods 88 0.2272727
#
# The above chart also shows that:
# 1. There are 19 Explanatory Variables with fewer than 40 observations.
# 2. 10 of these had factor values of zero for all observations.
# 3. It's not clear that these Explanatory Variables have any explaining power for phytophthora-rot.
# 4. For those Explanatory variables with > 75 observations, 6 have a single non-zero factor value (and area.dam almost
# has a single factor value). If we were sure that those Explanatory variables would be zero in the absence
# of the disease, we would have markers for the disease.
#
# It is not clear that for items 1 and 2 above that interpolation would be useful.
# (possibly return to this and do kNN)