Data624

Assignment:

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .rmd code.

Exercises

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

a- Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.0.5

library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:car':
## 
##     some

## The following object is masked from 'package:caret':
## 
##     lift

data(Glass)
#str(Glass)
#view(Glass)
sum(is.na(Glass))

## [1] 0

sapply(Glass, class)

##        RI        Na        Mg        Al        Si         K        Ca        Ba 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        Fe      Type 
## "numeric"  "factor"

Glass1 <-  dplyr::select(Glass, -Type)

Glass %>%
  head(05)%>%
  kable()

RI	Na	Mg	Al	Si	K	Ca	Type
1.52101	13.64	4.49	1.10	71.78	0.06	8.75	1
1.51761	13.89	3.60	1.36	72.73	0.48	7.83	1
1.51618	13.53	3.55	1.54	72.99	0.39	7.78	1
1.51766	13.21	3.69	1.29	72.61	0.57	8.22	1
1.51742	13.27	3.62	1.24	73.08	0.55	8.07	1

summary(Glass)

##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29

Glass1 %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density() +                        # as density
  theme_dark()

cat("\nAnother way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).")

## 
## Another way of looking at the data distribution for all 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe).

df1 <- c(1:1, 2:9)
df2 <- Glass1[ , -10]
par(mfrow = c(3,5))
for (i in df1) {
  #hist(X[ ,i], xlab = names(X[i]), main = names(X[i]))
  d <- density(df2[,i])
  plot(d, main = names(df2[i]))
  polygon(d, col="blue")
}

# Not sure if I should convert the variable "Type" in numeric datatype
#hist(Glass$Type,main = " Distribution of predicted Elements Type ",xlab = "Probability of Element", col = 'blue')

out <- boxplot.stats(Glass1$Na)$out
boxplot(Glass1$Na,
  ylab = "Value of element",
  main = "Boxplot of 8 variables (Na, Mg, Al, Si, K, Ca, Ba, and Fe)"
)
mtext(paste("Outliers: ", paste(out, collapse = ", ")))

Outlier = boxplot(stack(Glass1), plot=TRUE)$out

library(outliers)
# df1a <- c(1:9)
# df2a <- Glass1[ , -10]
# for (i in df1a) {
# outlier(df2a$i)
# }
# 
outlier(Glass1$Na)

## [1] 17.38

outlier(Glass1$Mg)

## [1] 0

outlier(Glass1$Al)

## [1] 3.5

outlier(Glass1$Al)

## [1] 3.5

outlier(Glass1$Si)

## [1] 69.81

outlier(Glass$K)

## [1] 6.21

outlier(Glass1$Ca)

## [1] 16.19

outlier(Glass1$Ba)

## [1] 3.15

outlier(Glass1$Fe)

## [1] 0.51

Do there appear to be any outliers in the data? Yes, there are outliers in the data. Are any predictors skewed? Yes! Based on the density plot, we see many elements(variables) right and left skewed.
Are there any relevant transformations of one or more predictors that might improve the classification model? On this predictive analysis, the classification model is based on the variable “Type”. This variable could be redefined (or I did not find the definition of the variables), then some variable could be transformed (maybe use log() function ), then apply boxcox to better visualize the outliers. but for the classification model all the variable excepted “RI” and “Type” variables should be group by value range, this way, we can easilly define the x variable when applying the machine leaning.

Exercise 3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

## starting httpd help server ... done

Class	date	precip	temp	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.marg	leaf.size	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	fruit.spots
diaporthe-stem-canker	6	2	1	1	1	1	0	0	1	1	2	2	1	1	3	1	1	1	4
diaporthe-stem-canker	4	2	1	2	0	2	1	1	1	1	2	2	1	0	3	1	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	1	2	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	3	2	1	1	0	2	0	1	1	1	2	2	1	0	3	0	1	1	4
diaporthe-stem-canker	6	2	1	2	0	1	0	2	1	1	2	2	1	0	3	1	1	1	4

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

## Warning: Removed 2336 rows containing non-finite values (stat_density).

## [1] 17 24 26

## Warning: package 'corrplot' was built under R version 4.0.5

## corrplot 0.84 loaded

## [1] 0

Based on the frequency plots, we don’t see a degenarated distribution(As there is no spread of variables around the mean, the variance for the degenerate distribution is zero (Var(X) = 0))…because there is no constant value among variables. If we observed little variation, we could say we have a degenarated distribution. We could also use the correlation function to find if there are variable with high correlation(meaning explaning the same underlying response). However, there many missing values rendering the plot difficult. the findCorrelation() with cutoff at 75% on Soybean dataframe output zero. This means we should not delete a variable.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

## [1] 683  36

## Total missing values from Soybean data is :  2337

## The percentage of missing values from Soybean data is:  9.504636

## [1] 9.5

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing
hail	121
lodging	121
seed.tmt	121
sever	121
germ	112
leaf.mild	108
fruit.spots	106
fruiting.bodies	106
seed.discolor	106
shriveling	106
leaf.shread	100
mold.growth	92
seed	92
seed.size	92
fruit.pods	84
leaf.halo	84
leaf.malf	84
leaf.marg	84
leaf.size	84
canker.lesion	38
ext.decay	38
int.discolor	38
mycelium	38
precip	38
sclerotia	38
stem.cankers	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
area.dam	1
date	1

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

There are many variables with missing values. We cannot tell whether there is a pattern on missing values but the plot of missing values show were to look at when dealing with removing NA.

Develop a strategy for handling missing data, either by eliminating predictors or imputation. This missing value looks like a structural one. Meaning, there was no value recorded at the time the data was generated. The nearZerVar() output shows that we cannot delete a variable. This means will use imputation to handle missing data. There variant techniques in imputation method. One we have used in the past is imputation by mean(). Another strategyy to explore discussed by the book is a function, impute.knn, that uses K- nearest neighbor model(A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value) to estimate the missing data.

Data624_HW4

Alexis Mekueko

10/3/2021

Assignment:

Exercises

Exercise 3.2.