DATA 624 HOMEWORK4

Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling. Please submit your Rpubs link along with your .pdf for your run code.

3.1 a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

ANSWER:

I have used histograms and the correlation plots to show relationships between predictors.For instance RI has a strong positive correlation with Ca. Multicollinearity may have an impact during model build and we want to reduce where possible to appropriate features so that we don’t have redundant features in the model. Here’s we can use GGlally , pairsplot to view the correlation between variables.This is a good tool to understand multicollinearity.

library(ggplot2)
library(ggfortify)
#3.1 UC Irvine Glass Analysis

data(Glass) 

str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

df <- Glass %>% 
  dplyr::select(-Type)

df1 <- df %>% 
  gather(key = 'variable', value = 'value')

# Histogram plots of each variable
ggplot(df1) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~variable, scales='free', ncol=4)

# Select numeric columns only
M<-cor(df)
 
ggcorrplot(M, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           #colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

ggpairs(Glass[,1:2])

ggpairs(Glass[,3:4])

ggpairs(Glass[,5:6])

3.1b

Do there appear to be any outliers in the data? Are any predictors skewed?

ANSWER:

Yes, several variables have outliers see boxplot (Ca, FE, K all have outliers), see histogram of RI below, shows outliers on the right. Additionally, right skew variables have large values on the left side of the distribution. As the distribution become right skewed, the skewness becomes larger. Here we can see from the skewness calculation that MG and SI have negative skewness, whereas the rest of the variables have a positive skewness.

From the book - “A general rule of thumb to consider is that skewed data whose ratio of the highest value to the lowest value is greater than 20 have significant skewness. Also, the skewness statistic can be used as a diagnostic. If the predictor distribution is roughly symmetric, the skewness values will be close to zero. As the distribution becomes more right skewed, the skewness statistic becomes larger. Similarly, as the distribution becomes more left skewed, the value becomes negative.”

#3.1 UC Irvine Glass Analysis

# Identify missing data by Feature and display percent breakout
missing <- colSums(Glass %>% sapply(is.na))
missing_pct <- round(missing / nrow(Glass) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))

# Skewness Predictor
(skewV <- apply(df, 2, skewness))

## $RI
## Skewness |    SE
## ----------------
##    1.625 | 0.165
## 
## $Na
## Skewness |    SE
## ----------------
##    0.454 | 0.165
## 
## $Mg
## Skewness |    SE
## ----------------
##   -1.153 | 0.165
## 
## $Al
## Skewness |    SE
## ----------------
##    0.907 | 0.165
## 
## $Si
## Skewness |    SE
## ----------------
##   -0.730 | 0.165
## 
## $K
## Skewness |    SE
## ----------------
##    6.552 | 0.165
## 
## $Ca
## Skewness |    SE
## ----------------
##    2.047 | 0.165
## 
## $Ba
## Skewness |    SE
## ----------------
##    3.416 | 0.165
## 
## $Fe
## Skewness |    SE
## ----------------
##    1.754 | 0.165

# Boxplots for each variable
ggplot(df1, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)

3.1c

Are there any relevant transformations of one or more predictors that might improve the classification model?

ANSWER:

Yes, to reduce skewness, we can apply centering and scaling. We can subtract values from the mean to center. Also divide by SD to normalize the data. Here I have tried boxcox and log transformations and to check and see if we can minimize the skewness where possible for few variables on the histogram.

hist(Glass$RI)

hist(Glass$Mg)

hist(Glass$Si)

#Transform using boxcox. 
Glass$GlassMgt_t <- BoxCox(Glass$Mg, BoxCox.lambda(Glass$Mg))
Glass$GlassRI_t <- BoxCox(Glass$RI, BoxCox.lambda(Glass$RI))
Glass$GlassSI_t <- BoxCox(Glass$Si, BoxCox.lambda(Glass$Si))
Glass$GlassMgt_lt <- log(Glass$Mg)
Glass$Glasssi_lt <- log(Glass$Si)
hist(Glass$GlassMgt_t)

hist(Glass$GlassMgt_lt)

hist(Glass$GlassRI_t)

hist(Glass$Glasssi_lt)

hist(Glass$GlassSI_t)

3.2a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

ANSWER below:

MYCELIUM , ROOTS, SCLEROTIA, INT.DISCOLOR seem to take on 1 unique value. These are zero variance predictor therefore, these can cause imbalance and affect our model have examples of degenerate distribution. We could potentially remove these variables as one option if they are not informational and value added when building our model. There are many variables with NA’S and missing which could be structurally missing or missing from certain classes that contribute to informative missingness.

Degenerate distributions :We can see from the textbook the following.A predictor variable that has a single unique value can be a zero variance predictor. We can remove a variable here as these may be uninformative and not impact calculations.Similarly, some predictors might have only a handful of unique values that occur with very low frequencies. These “near-zero variance.The fraction of unique values over the sample size is low (say 10%). The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say around 20). If both of these criteria are true and the model in question is susceptible to a predictor that can cause imbalance and degenerate distribution.

data(Soybean)

# summary statistics
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

Soybean %>%
  ggplot(aes(x=Class)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90))

columns <- colnames(Soybean)

#Example histograms
lapply('sever',
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

## [[1]]

lapply('mycelium',
  function(col) {
    ggplot(Soybean, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

## [[1]]

3.2b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

ANSWER 1 :

There are many variables with NA’S and missing which could be structurally missing.Missing can also be related to the outcome. For instance, customer rating values can be called “informative missingness” since only strong opinions push them to rate a product or movie for instance. Such type of missing info can bias a bodel.

ANSWER 2:

We can see that few classes represent all of the missing data. These classes are 2-4-d injury, cyst-nematode,diaporthe-pod-&-stem-blight, herbicide-injury,phytopythora-rot. Looking at the count of the classes and the class frequency, these class type of soybean plant are less frequent types of classes that have missingness in data.

vis_miss(Soybean)

Soybean %>% 
  dplyr::select(-Class) %>% 
  gg_miss_var()

gg_miss_upset(Soybean, 
              nsets = 15,
              nintersects = NA)

Soybean %>% 
  gather(variable, value) %>%
  filter(is.na(value)) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(df) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable Missing Data` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()

Variable Missing Data	Number of Records	Share of Total
hail	121	57%
lodging	121	57%
seed.tmt	121	57%
sever	121	57%
germ	112	52%
leaf.mild	108	50%
fruit.spots	106	50%
fruiting.bodies	106	50%
seed.discolor	106	50%
shriveling	106	50%
leaf.shread	100	47%
mold.growth	92	43%
seed	92	43%
seed.size	92	43%
fruit.pods	84	39%
leaf.halo	84	39%
leaf.malf	84	39%
leaf.marg	84	39%
leaf.size	84	39%
canker.lesion	38	18%
ext.decay	38	18%
int.discolor	38	18%
mycelium	38	18%
precip	38	18%
sclerotia	38	18%
stem.cankers	38	18%
plant.stand	36	17%
roots	31	14%
temp	30	14%
crop.hist	16	7.5%
plant.growth	16	7.5%
stem	16	7.5%
area.dam	1	0.5%
date	1	0.5%

missing <- Soybean %>%
  group_by(Class) %>%
  miss_var_summary() 


ggplot(missing, aes(Class, variable,fill=n_miss)) + 
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90)) + coord_flip()

3.2C

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

ANSWER BELOW:

The book talks about using KNN imputation. Here, I’ve used mice imputation and repeated the visualizations after imputations to show some of the density plots.”MICE stands for Multivariate Imputation by Chained Equations. It is a statistical method used to handle missing data in a dataset. The process uses multiple imputation techniques to fill in the missing data and then combines the results from multiple imputations to produce a final imputed dataset. The goal of MICE is to preserve the relationship between variables in the original data and to reduce the bias introduced by imputing missing values.” It looks like after imputation the NA’s have been transferred to 0 or 1 for Hail for instance.

Imputation can also be performed by looking at closely correlated values and imputed similar values for the correlated variables as mentioned in the book. However, we would have to be careful with correlated values, as these can cause unstable models, numerical errors and inaccurate predictive performance due to multicollinearity.

We can remove predictors as well and reduce predictors. If we opted to go down this path, we could remove variables such as MYCELIUM, Roots etc.. that may take on only 1 unique value, additionally we could look at the 5 classes mentioned above that have a large number of missing info which may be meaningful in its missing data. We can look at that pattern to see any predictors can be dropped.

set.seed(12345)

# Perform Multiple Imputation
imputed_data_df <- mice(Soybean, m=5, method='pmm', print=FALSE)
completedData <- complete(imputed_data_df,1)
summary(completedData) %>%
    kable() %>%
    kable_styling()

Class	date	plant.stand	precip	temp	hail	crop.hist	area.dam	sever	seed.tmt	germ	plant.growth	leaves	leaf.halo	leaf.marg	leaf.size	leaf.shread	leaf.malf	leaf.mild	stem	lodging	stem.cankers	canker.lesion	fruiting.bodies	ext.decay	mycelium	int.discolor	sclerotia	fruit.pods	fruit.spots	seed	mold.growth	seed.discolor	seed.size	shriveling	roots
brown-spot : 92	0: 26	0:390	0: 82	0: 80	0:444	0: 65	0:124	0:315	0:410	0:169	0:450	0: 77	0:276	0:386	0: 51	0:562	0:638	0:643	0:312	0:520	0:409	0:350	0:489	0:508	0:674	0:619	0:625	0:423	0:348	0:499	0:616	0:535	0:556	0:559	0:551
alternarialeaf-spot: 91	1: 76	1:293	1:112	1:382	1:239	1:166	1:227	1:323	1:228	1:219	1:233	1:606	1: 36	1: 21	1:351	1:121	1: 45	1: 20	1:371	1:163	1: 39	1: 83	1:194	1:161	1: 9	1: 44	1: 58	1:130	1: 86	1:184	1: 67	1:148	1:127	1:124	1: 86
frog-eye-leaf-spot : 91	2: 93	NA	2:489	2:221	NA	2:219	2:145	2: 45	2: 45	2:295	NA	NA	2:371	2:276	2:281	NA	NA	2: 20	NA	NA	2: 36	2:177	NA	2: 14	NA	2: 20	NA	2: 14	2: 58	NA	NA	NA	NA	NA	2: 46
phytophthora-rot : 88	3:118	NA	NA	NA	NA	3:233	3:187	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3:199	3: 73	NA	NA	NA	NA	NA	3:116	4:191	NA	NA	NA	NA	NA	NA
anthracnose : 44	4:131	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
brown-stem-rot : 44	5:149	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
(Other) :233	6: 90	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

#Analysis of Missing values on the imputed dataset
missing2 <- colSums(completedData %>% sapply(is.na))
missing_pct2 <- round(missing2 / nrow(completedData) * 100, 2)
stack(sort(missing_pct2, decreasing = TRUE))

summary(completedData)

##                  Class     date    plant.stand precip  temp    hail   
##  brown-spot         : 92   0: 26   0:390       0: 82   0: 80   0:444  
##  alternarialeaf-spot: 91   1: 76   1:293       1:112   1:382   1:239  
##  frog-eye-leaf-spot : 91   2: 93               2:489   2:221          
##  phytophthora-rot   : 88   3:118                                      
##  anthracnose        : 44   4:131                                      
##  brown-stem-rot     : 44   5:149                                      
##  (Other)            :233   6: 90                                      
##  crop.hist area.dam sever   seed.tmt germ    plant.growth leaves  leaf.halo
##  0: 65     0:124    0:315   0:410    0:169   0:450        0: 77   0:276    
##  1:166     1:227    1:323   1:228    1:219   1:233        1:606   1: 36    
##  2:219     2:145    2: 45   2: 45    2:295                        2:371    
##  3:233     3:187                                                           
##                                                                            
##                                                                            
##                                                                            
##  leaf.marg leaf.size leaf.shread leaf.malf leaf.mild stem    lodging
##  0:386     0: 51     0:562       0:638     0:643     0:312   0:520  
##  1: 21     1:351     1:121       1: 45     1: 20     1:371   1:163  
##  2:276     2:281                           2: 20                    
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor
##  0:409        0:350         0:489           0:508     0:674    0:619       
##  1: 39        1: 83         1:194           1:161     1:  9    1: 44       
##  2: 36        2:177                         2: 14              2: 20       
##  3:199        3: 73                                                        
##                                                                            
##                                                                            
##                                                                            
##  sclerotia fruit.pods fruit.spots seed    mold.growth seed.discolor seed.size
##  0:625     0:423      0:348       0:499   0:616       0:535         0:556    
##  1: 58     1:130      1: 86       1:184   1: 67       1:148         1:127    
##            2: 14      2: 58                                                  
##            3:116      4:191                                                  
##                                                                              
##                                                                              
##                                                                              
##  shriveling roots  
##  0:559      0:551  
##  1:124      1: 86  
##             2: 46  
##                    
##                    
##                    
##

vis_miss(completedData)

completedData %>% 
  dplyr::select(-Class) %>% 
  gg_miss_var()

# Density plots 
ggplot(Soybean, aes(x=sever, fill="Original")) +
  geom_density(alpha=0.5) +
  geom_density(data=completedData, aes(x=sever, fill="imputed"), alpha=0.5) +
  labs(title="Density Plot of Sever: Original vs. Imputed")

# Density plots 
ggplot(Soybean, aes(x=fruit.spots, fill="Original")) +
  geom_density(alpha=0.5) +
  geom_density(data=completedData, aes(x=fruit.spots, fill="imputed"), alpha=0.5) +
  labs(title="Density Plot of leaf.shread: Original vs. Imputed")

columns <- colnames(Soybean)
#After MICE IMPUTATION let's take a look at the histograms
lapply('sever',
  function(col) {
    ggplot(completedData, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

## [[1]]

lapply('mycelium',
  function(col) {
    ggplot(completedData, 
           aes_string(col)) + geom_bar() + coord_flip() + ggtitle(col)})

## [[1]]

DATA 624 HOMEWORK4

Banu Boopalan

2024-09-28

3.1 a

ANSWER:

3.1b

ANSWER:

3.1c

ANSWER:

3.2a

ANSWER below:

3.2b

ANSWER 1 :

ANSWER 2:

3.2C

ANSWER BELOW: