Ch3Exercise3-02.R

# Applied Predictive Modeling Chapter 3: Exercise 3.2
#
# This web page provides answers to Exercise 3.2 of Applied Predictive Modeling.
# That exercise says:
#
# "The soybean data can also be found at the UC Irvine Machine Learning Repository.  Data were collected to
# predict disease in 683 soybeans.  The 35 predictors are mostly categorical and include information on the 
# environmental condititions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold
# growth).  The outcome labels consist of 19 distinct classes."  The data can be loaded via mlbench.
#
#
# EXERCISE 3.2(a) 
# Investigate the frequency distributions for the categorical predictors.  Are any of the distributions
# degenerate in the ways dicussed earlier in this chapter?
#
# To address 3.2(a), I load the some programs and data and then create graphics.  
#
# Step 1: I load the libraries whose programs are used in the analysis
suppressMessages(library(mlbench))      # contains data set used in the analysis
suppressMessages(library(caret))        # APM pre-procoessing program and others
suppressMessages(library(dplyr))        # used for some data manipulation
suppressMessages(library(e1071))        # for skewness computation
suppressMessages(library(tidyr))        # reshape a correlation matrix for use in heat map
suppressMessages(library(knitr))        # used to create output for web
suppressMessages(library(GGally))       # used for ggpairs correlation graph graph
#
# 
data(Soybean)
#
# Degenerate distribution: takes on only a single value.  Related: variables with insufficient 
# variability to provide much information.  (See p. 44.)
#
# A look at the dataframe shows that the variables are all factors or ordered factors. Since the Y variable
# also is a factor, one might use multinomial logit or LDA.
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

#
# Look at variability (or lack thereof) of variables by Class
# The idea behind the following anaysis is to compute the number of times (and the %) that a particular
# factor value arises.  We do this for each explanatory variable by Class.  
#     The first steop is to compute the total count of observations for each of the Class diseases.
#     This is used to compute the proportion of observed factor counts.
total.count<-Soybean%>%
  group_by(Class)%>%
  summarise(total=n()) # You can also use do(count(.)) in this line.
#     The next step is, for each explanatory variable, create a cross-tabluation of the count 
#     (by factor number) by Class.
temp<-sapply(Soybean[,2:36],function(fact.num){table(Soybean$Class,fact.num)})
#     The loop below converts the results of the sapply into a long-form dataframe called count.byclass.
#     Possibly, I could have used a do.call, but I wanted to rbind the individual list results from 
#     the table together into a single long-form df.
for(i in 1:35){
  temp2<-data.frame(temp[[i]]) # This step converts item [[i]] of the list into a df.
  temp2$name<-names(temp)[i]   # Here, we add names to the colummns.
  ifelse(i==1,count.byclass<-temp2,count.byclass<-rbind(count.byclass,temp2)) # This step rbinds the df's.
}
names(count.byclass)<-c("Class","factor.value","count","factor") #Add names to the dataframe count.byclass.
count.byclass<-left_join(count.byclass,total.count,by="Class") #Append the totals so that we can compute proportions if we wish to.
count.byclass$pct<-count.byclass$count/count.byclass$total # Compute the the proportions.
count.byclass<-tbl_df(count.byclass)
rm(temp,temp2,i)
#
# Now we can see whether a particular independent variable has sufficient variability to explain a disease.
# For example, if an independent variable has a value of zero for all diseases (i.e., by Class) then it
# produces no explanatory information. (I interpret the data to being that a plant has the disease and that
# the researcher looks for one or more identifying characteristic (i.e., a non-zero on the independent variable).
# For example, in the charts created below:
#     Chart 12 [leaves] has a factor value of 1 100% of the time for 14 of the 19 diseases.  Hence, a value
#     of zero would mean that it it is one of 5 diseases. 
#     Chart 18 [leaf.mild] has zeros for 12 of 19 diseases.  It has 1 100% of the time for powdery mildew and
#     2 100% of the time for downy mildew.  
#     Chart 20 [lodging] has zeros 100% of the time for 7 of the diseases, and 1s a small proportion of the time
#     for 8 of the diseases.  It seems this would not have good explanatory power.
#     Chart 25 [mycelium] is zero across the board except (1) where it is NA and (2) thizoctonia-root-rot,
#     where it takes on a value of 1 about 25% of the time.   This is an example of low explanatory power and
#     insufficient variability.
#     Chart 26 [int.discolor] may be an indicator for brown-step-rot and charcoal-rot, but not anything else.
#     Chart 27 [sclorotia] may be an indicator of charcoal rot.  Perhaps 26 and 27 together.
#
# One could contemplate a rule that if a non-zero arises in 75% or more of the instances for a particular disease,
# that factor is an indicator for the disease.
#   As an aside: I am not sure how NAs are handled except in instances where the entire series is NA.
for(i in 1:35){
  readline(prompt="Hit return to see next plot>> ")
  fact.nam<-names(Soybean)[i+1]
    print(ggplot(subset(count.byclass,(factor==fact.nam)),aes(x=Class,y=pct,fill=factor.value))+
    geom_bar(stat="identity",position="stack")+
    theme(axis.text.x=element_text(angle=90))+
    labs(ylab("Percent of Total (by Class)")) +
    ggtitle(paste("Series Chart ",i,": Percent of Observations by Factor Value for Explanatory Variable [",
    fact.nam,"] \n (by Class)",sep=""))+
    theme(plot.title=element_text(size=10))+
    annotate(geom="text",label="Failure to sum to 100% is due to NAs",x=15,y=.1,size=3,colour="grey25")
  )
}

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

## Hit return to see next plot>>

rm(fact.nam,i)
# 
# This next exercise is designed to identify the number of NAs in the independent variables.
# EXERCISE 3.2(b)
# Roughly 18% of the data are missing.  Are there particular predictors that are more likely to be missing?
# Is the pattern of missing data related to the classes?
#
isna.byxvar<-data.frame(sapply(Soybean[,2:36],function(isna){sum(is.na(isna))}))
isna.byxvar$x.var<-rownames(isna.byxvar)
rownames(isna.byxvar)<-NULL
names(isna.byxvar)<-c("na.count","x.var")
isna.byxvar

##    na.count           x.var
## 1         1            date
## 2        36     plant.stand
## 3        38          precip
## 4        30            temp
## 5       121            hail
## 6        16       crop.hist
## 7         1        area.dam
## 8       121           sever
## 9       121        seed.tmt
## 10      112            germ
## 11       16    plant.growth
## 12        0          leaves
## 13       84       leaf.halo
## 14       84       leaf.marg
## 15       84       leaf.size
## 16      100     leaf.shread
## 17       84       leaf.malf
## 18      108       leaf.mild
## 19       16            stem
## 20      121         lodging
## 21       38    stem.cankers
## 22       38   canker.lesion
## 23      106 fruiting.bodies
## 24       38       ext.decay
## 25       38        mycelium
## 26       38    int.discolor
## 27       38       sclerotia
## 28       84      fruit.pods
## 29      106     fruit.spots
## 30       92            seed
## 31       92     mold.growth
## 32      106   seed.discolor
## 33       92       seed.size
## 34      106      shriveling
## 35       31           roots

#
ggplot(isna.byxvar,aes(x=reorder(x.var,-na.count),y=na.count))+
  geom_bar(stat="identity",fill="light blue",colour="black")+
  theme(axis.text.x=element_text(angle=90))+
  xlab("Explanatory Variable")+
  ylab("Number of NAs")+
  ggtitle("Number of NAs by Explanatory Variable")+
  theme(plot.title=element_text(size=10))

# 
# The above shows that (out of 683), the max NAs are 121 (17%).  
# I also re-run this analysis by Class
isna.byclass<-Soybean%>%
  group_by(Class)%>%
  do(data.frame(sum(is.na(.))))
names(isna.byclass)<-c("Class","na.count")
isna.byclass

## Source: local data frame [19 x 2]
## Groups: Class [19]
## 
##                          Class na.count
##                         (fctr)    (int)
## 1                 2-4-d-injury      450
## 2          alternarialeaf-spot        0
## 3                  anthracnose        0
## 4             bacterial-blight        0
## 5            bacterial-pustule        0
## 6                   brown-spot        0
## 7               brown-stem-rot        0
## 8                 charcoal-rot        0
## 9                cyst-nematode      336
## 10 diaporthe-pod-&-stem-blight      177
## 11       diaporthe-stem-canker        0
## 12                downy-mildew        0
## 13          frog-eye-leaf-spot        0
## 14            herbicide-injury      160
## 15      phyllosticta-leaf-spot        0
## 16            phytophthora-rot     1214
## 17              powdery-mildew        0
## 18           purple-seed-stain        0
## 19        rhizoctonia-root-rot        0

#
#
ggplot(isna.byclass,aes(x=reorder(Class,-na.count),y=na.count))+
  geom_bar(stat="identity",fill="light blue",colour="black")+
  theme(axis.text.x=element_text(angle=90))+
  xlab("Class of Disease")+
  ylab("Number of NAs")+
  ggtitle("Number of NAs by Class of Disease")+
  theme(plot.title=element_text(size=10))

#
# Here is a full table of NA counts
isna.all<-Soybean%>%
  group_by(Class)%>%
  do(data.frame(apply(.,2,function(x){length(which(is.na(x)))})))
isna.all$x.var<-names(Soybean)
names(isna.all)<-c("Class","na.count","var")
isna.all<-filter(isna.all,var!="Class")
#
isna.table<-spread(isna.all,Class,na.count)
#
#
# And an NA heatmap
ggplot(isna.all,aes(x=Class,y=var,size=na.count,colour=na.count))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x=element_text(angle=90))+
  ggtitle("Number of NAs \n (by Class and Variable)")+
  theme(plot.title=element_text(size=8))+
  ylab("Explanatory Variable")+
  scale_colour_gradientn(colours=topo.colors(50))

#
# The heatmap shows that most of the NAs are with regard to:
#   1.  phytophtora rot
#   2.  2-4-d-injury
#   3.  cyst nematode
#   4.  diaporthe pod & stem blight
#
# EXERCISE 3.2(c)
# Develop a strategy for handling missing data, either by eliminating predictors or imputation.
#
# I will focus on phytophthora rot, where 1,214 of the NAs are concentrated.
# 1.  The reason for the NAs *may* be that the particular explanatory variable is irrelevant for that Class (although
#     one would think this would be coded as a 0 and not as an NA.)
# 2. 
fact.nam<-as.character(unique(Soybean$Class)[4])
nlv<-sapply(Soybean,function(x){nlevels(x)})
nlv<-data.frame(nlv[2:36])
nlv$x.var<-rownames(nlv)
names(nlv)[1]<-"lev.num"
nlv[,1]<-as.character(nlv[,1])
rownames(nlv)<-NULL
nlv<-arrange(nlv,x.var)
nlv

##    lev.num           x.var
## 1        4        area.dam
## 2        4   canker.lesion
## 3        4       crop.hist
## 4        7            date
## 5        3       ext.decay
## 6        4      fruit.pods
## 7        4     fruit.spots
## 8        2 fruiting.bodies
## 9        3            germ
## 10       2            hail
## 11       3    int.discolor
## 12       3       leaf.halo
## 13       2       leaf.malf
## 14       3       leaf.marg
## 15       3       leaf.mild
## 16       2     leaf.shread
## 17       3       leaf.size
## 18       2          leaves
## 19       2         lodging
## 20       2     mold.growth
## 21       2        mycelium
## 22       2    plant.growth
## 23       2     plant.stand
## 24       3          precip
## 25       3           roots
## 26       2       sclerotia
## 27       2            seed
## 28       2   seed.discolor
## 29       2       seed.size
## 30       3        seed.tmt
## 31       3           sever
## 32       2      shriveling
## 33       2            stem
## 34       4    stem.cankers
## 35       3            temp

ggplot(subset(count.byclass,(Class==fact.nam)),aes(x=factor,y=count,fill=factor.value))+
  geom_bar(stat="identity",position="stack")+
  theme(axis.text.x=element_text(angle=90))+
  ylab("Number of Observations by Factor Value")+
  theme(axis.title.y = element_text(size = rel(.8)))+
  xlab("Explanatory Variable")+
  annotate("text",y=5,x=1:35,label=nlv[,1],color="gray20",size=3)+
  ggtitle(paste("Number of Observations by Factor Value for Each Explanatory Variable \n (for ",fact.nam,
                " and for a total of 88 possible observations)",sep=""))+
  theme(plot.title=element_text(size=10))

rm(fact.nam,nlv,i)

## Warning in rm(fact.nam, nlv, i): object 'i' not found

#
# The chart shows, for example, that fruit.pods has 4 levels.  However, all of the observations are of level 3.
# Moreover, there are only 20 observations, leaving 68 NAs (88 - 20).  Here's a look at the fruit.pods data showing
# no observations for factor values 0,1,or 2 and 20 observations for factor value 3:
#
subset(count.byclass,factor=="fruit.pods" & Class=="phytophthora-rot")

## Source: local data frame [4 x 6]
## 
##              Class factor.value count     factor total       pct
##             (fctr)       (fctr) (int)      (chr) (int)     (dbl)
## 1 phytophthora-rot            0     0 fruit.pods    88 0.0000000
## 2 phytophthora-rot            1     0 fruit.pods    88 0.0000000
## 3 phytophthora-rot            2     0 fruit.pods    88 0.0000000
## 4 phytophthora-rot            3    20 fruit.pods    88 0.2272727

#
# The above chart also shows that:
#   1.    There are 19 Explanatory Variables with fewer than 40 observations.
#   2.    10 of these had factor values of zero for all observations.
#   3.    It's not clear that these Explanatory Variables have any explaining power for phytophthora-rot.
#   4.    For those Explanatory variables with > 75 observations, 6 have a single non-zero factor value (and area.dam almost
#         has a single factor value).  If we were sure that those Explanatory variables would be zero in the absence
#         of the disease, we would have markers for the disease.  
#
# It is not clear that for items 1 and 2 above that interpolation would be useful.  
# (possibly return to this and do kNN)

Ch3Exercise3-02.R

FPAMPUSH

Thu Oct 29 15:45:01 2015