Github Link Web Link

Assignment:

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

Exercises

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

a- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.0.5
library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:car':
## 
##     some
## The following object is masked from 'package:caret':
## 
##     lift
data(Glass)
#str(Glass)
#view(Glass)
sum(is.na(Glass))
## [1] 0
sapply(Glass, class)
##        RI        Na        Mg        Al        Si         K        Ca        Ba 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        Fe      Type 
## "numeric"  "factor"
Glass1 <-  dplyr::select(Glass, -Type)

Glass %>%
  head(05)%>%
  kable()
RI Na Mg Al Si K Ca Ba Fe Type
1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0 1
1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0 1
1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0 1
1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0 1
1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0 1
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
Glass1 %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density() +                        # as density
  theme_dark()

cat("\nAnother way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).")
## 
## Another way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).
df1 <- c(1:1, 2:9)
df2 <- Glass1[ , -10]
par(mfrow = c(3,5))
for (i in df1) {
  #hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
  d <- density(df2[,i])
  plot(d, main = names(df2[i]))
  polygon(d, col="blue")
}

# Not sure if I should convert the variable "Type" in numeric datatype
#hist(Glass$Type,main = " Distribution of predicted Elements Type ",xlab = "Probability of Element", col = 'blue')

out <- boxplot.stats(Glass1$Na)$out
boxplot(Glass1$Na,
  ylab = "Value of element",
  main = "Boxplot of 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe)"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))

Outlier = boxplot(stack(Glass1), plot=TRUE)$out

library(outliers)
# df1a <- c(1:9)
# df2a <- Glass1[ , -10]
# for (i in df1a) {
# outlier(df2a$i)
# }
# 
outlier(Glass1$Na)
## [1] 17.38
outlier(Glass1$Mg)
## [1] 0
outlier(Glass1$Al)
## [1] 3.5
outlier(Glass1$Al)
## [1] 3.5
outlier(Glass1$Si)
## [1] 69.81
outlier(Glass$K)
## [1] 6.21
outlier(Glass1$Ca)
## [1] 16.19
outlier(Glass1$Ba)
## [1] 3.15
outlier(Glass1$Fe)
## [1] 0.51

  1. Do there appear to be any outliers in the data? Yes, there are outliers in the data. Are any predictors skewed? Yes! Based on the density plot, we see many elements(variables) right and left skewed.
  2. Are there any relevant transformations of one or more predictors that might improve the classification model? On this predictive analysis, the classification model is based on the variable “Type”. This variable could be redefined (or I did not find the definition of the variables), then some variable could be transformed (maybe use log() function ), then apply boxcox to better visualize the outliers. but for the classification model all the variable excepted “RI” and “Type” variables should be group by value range, this way, we can easilly define the x variable when applying the machine leaning.

Exercise 3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
## starting httpd help server ... done
Class date plant.stand precip temp hail crop.hist area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
diaporthe-stem-canker 6 0 2 1 0 1 1 1 0 0 1 1 0 2 2 0 0 0 1 1 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 4 0 2 1 0 2 0 2 1 1 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 1 2 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 3 0 2 1 0 1 0 2 0 1 1 1 0 2 2 0 0 0 1 0 3 0 1 1 0 0 0 0 4 0 0 0 0 0 0
diaporthe-stem-canker 6 0 2 1 0 2 0 1 0 2 1 1 0 2 2 0 0 0 1 0 3 1 1 1 0 0 0 0 4 0 0 0 0 0 0
##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
## 
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## Warning: Removed 2336 rows containing non-finite values (stat_density).

## [1] 17 24 26
## Warning: package 'corrplot' was built under R version 4.0.5
## corrplot 0.84 loaded
## [1] 0

Based on the frequency plots, we don’t see a degenarated distribution(As there is no spread of variables around the mean, the variance for the degenerate distribution is zero (Var(X) = 0))…because there is no constant value among variables. If we observed little variation, we could say we have a degenarated distribution. We could also use the correlation function to find if there are variable with high correlation(meaning explaning the same underlying response). However, there many missing values rendering the plot difficult. the findCorrelation() with cutoff at 75% on Soybean dataframe output zero. This means we should not delete a variable.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
## [1] 683  36
## Total missing values from Soybean data is :  2337
## The percentage of missing values from Soybean data is:  9.504636
## [1] 9.5
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
variables number.missing
hail 121
lodging 121
seed.tmt 121
sever 121
germ 112
leaf.mild 108
fruit.spots 106
fruiting.bodies 106
seed.discolor 106
shriveling 106
leaf.shread 100
mold.growth 92
seed 92
seed.size 92
fruit.pods 84
leaf.halo 84
leaf.malf 84
leaf.marg 84
leaf.size 84
canker.lesion 38
ext.decay 38
int.discolor 38
mycelium 38
precip 38
sclerotia 38
stem.cankers 38
plant.stand 36
roots 31
temp 30
crop.hist 16
plant.growth 16
stem 16
area.dam 1
date 1

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

There are many variables with missing values. We cannot tell whether there is a pattern on missing values but the plot of missing values show were to look at when dealing with removing NA.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation. This missing value looks like a structural one. Meaning, there was no value recorded at the time the data was generated. The nearZerVar() output shows that we cannot delete a variable. This means will use imputation to handle missing data. There variant techniques in imputation method. One we have used in the past is imputation by mean(). Another strategyy to explore discussed by the book is a function, impute.knn, that uses K- nearest neighbor model(A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value) to estimate the missing data.