Week 5 Data Preprocessing/Overfitting |28-Feb 6-Mar

3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

From initial analysis of the data we see that we have 1 Target Variable[type] of 6 levels and 9 predictor variables

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

glass <- Glass %>% select(-Type)
chart.Correlation(glass, bg=c("blue","red","yellow"), pch=21)

# this we be too convert to long format. Also will be creating predictor as factor
m <- Glass %>% pivot_longer(-Type, names_to = "Predictor", values_to = "Value", values_drop_na=TRUE) %>% 
  mutate(Predictor = as.factor(Predictor)) %>% arrange(Predictor)

# here we will explore the distribution of each factor 
m %>% 
  ggplot(., aes(Value, fill=Predictor))+geom_histogram(bins=20)+ facet_wrap(~Predictor,scales='free') + labs(title="The Distribution of the Predictors")+theme_minimal()

We can see that of all the predictors, Si appears to have the most normal distribution In addition to having the highest level of any mineral. We can conclude that this seems rational as silicon is a fundamental element to make glass. You’ll also notice RI, Na, Al, and Ca have normal distributions. In additions to quantity of silicon, we can also assume that the RI of glass should be an important factor. Besides the Si, I would not say these as normal distributions. In general we could say that most of the element distributions are right skewed, meaning, very small trace amounts of the element are usually present, but with notable exceptions. We can say that color indicates the strength and polarity of the correlation. For example, Mg and Al have a strong negative correlation.

corrplot(cor(Glass[,1:9]), method='square')

We can see that the variables differ quite a bit. Some are more normally distributed such as Na/Al, mean while others do not look normal at all such as Ba, Fe, K.In general, the correlation table tells us the relationship between each variables.There are some strong positive relationships such as Rl and Ca, Al and Ba. Also as well as some strong negative relationships, for example Rl and Si, Rl and Al, Mg and Ba.

Do there appear to be any outliers in the data? Are any predictors skewed?

We conclude that there are outliers in the data. K, Fe and Ba variable contains lots of zeros having their graphs highly skewed to the right. “K” has a very obvious outlier. “Ba” also has outliers at above 2.0 and “Fe” has an outlier above 0.5. Most of the variables including RI, NA, AI, SI, CA have peaks in the center of the distribution. They appear to be more normally distributed. Lots of outliers in variable Ri, Al, Ca, Ba, Fe. You can see that the correlation table tell us that most of the variables are not related to each other The columns Ba, Fe, and K look to be heavily skewed right. This is caused by left limit is bounded at 0 and outliers on the right side of the distribution. I would expect outliers due to impurities introduced in the glass manufacturing process.In addition, other than Si (silicon - the main ingredient in glass) most of the element distributions are right skewed, meaning, very small trace amounts of the element are normally present.

m %>%
  ggplot(aes(x = Type, y = Value, color = Predictor)) +
  geom_jitter() +
  ylim(0, 20) + 
  scale_color_brewer(palette = "Set2") +
  theme_dark()

Are there any relevant transformations of one or more predictors that might improve the classification model?

In my opinion the relevant transformation that can be considered is box cox transformation or log transformation. This might improve the classification model. Besides this, removing outliers might be the best choice for improving the classification model. Another thought is that center and scaling is another option that might improve model the performance. transformations like a log or a Box Cox could help improve the classification model. Also removing skew is removing outliers that improves a model’s performance. Also, centering and scaling can be important for all variables with any model. You can say that checking if there are any missing values in any columns that can cause a delay or miscalculate or need to addressed by removal/imputation or other means.

3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data(Soybean)
?Soybean

## starting httpd help server ... done

summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean %>% gather() %>% ggplot(aes(value))+facet_wrap(~key, scales = "free")+geom_histogram(stat="count")

created plot which shows the distribution of data points on the categorical variables. Keep in mind it also shows the missing values exist in almost all of the variables. mycelium, sclerotia, and leaf.mild are strongly imbalanced. It might be favorable to remove these variables from the model. Note that int.discolor, resulted in both a keep and remove for each factor, given that we can keep one factor, the variable is kept unless there is another indication that is affecting the model.

par(mfrow = c(3,3))
for(i in 2:ncol(Soybean)) {
  plot(Soybean[i], main = colnames(Soybean[i]))
}

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

sorted <- order(-colSums(is.na(Soybean)))
kable(colSums(is.na(Soybean))[sorted])

	x
hail	121
sever	121
seed.tmt	121
lodging	121
germ	112
leaf.mild	108
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.shread	100
seed	92
mold.growth	92
seed.size	92
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
plant.stand	36
roots	31
temp	30
crop.hist	16
plant.growth	16
stem	16
date	1
area.dam	1
Class	0
leaves	0

we can see that there are a lot of missing values in the dataset as we can see in summary and first plot. Second plot shows if there is any pattern in the missing values. It shows that germ, hail, server, seed.tmt and lodging have missing values together and thus it has pattern of missing values together.

soybean_missing_counts <- sapply(Soybean, function(x) sum(is.na(x))) %>% 
  sort(decreasing = TRUE) %>%
  as.data.frame() %>%
  rename('NA_Count' ='.') 

soybean_missing_counts <- soybean_missing_counts%>%
  mutate('Feature' = rownames(soybean_missing_counts))

ggplot(soybean_missing_counts, aes(x = NA_Count, y = reorder(Feature, NA_Count))) + 
  geom_bar(stat = 'identity', fill = 'blue') +
  labs(title = 'Soybean Missing Counts') +
  theme(plot.title = element_text(hjust = 0.5))

As you can see the graphs above are very helpful in indicating the amount of missing data the Soybean data contains. From the first plot, it highlights lodging, hail, sever and seed.tmt accounts for nearly 18% each. The second plot shows the pattern of the missing data as it relates to the other variables. It shows 82% are complete, in addition to the Class and leaves variables. There are quite a few signs of missing patterns, but their overall proportion is not extreme. In addition, from the graph, the first set of variables, from hail to fruit.pods, accounts for 8% of the missing data when the other variables are complete, note this does not indicate within variable missingness. In conclusion, for some imputation methods, such as certain types of multiple imputations, having fewer missingness patterns is helpful, as it requires fitting fewer models.

Develop a strategy for handling missing data, either by eliminating predictors or imputation. KNN imputation - An effective approach to data imputing is to use a model to predict the missing values. A model is created for each feature that has missing. Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from n_neighbors nearest.A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as “nearest neighbor imputation.”

Soybean2 <- Soybean[3:36] # Removed class and date 

# for our knn method
knn_method <- kNN(Soybean ,k=5)
colSums(is.na(knn_method))

##               Class                date         plant.stand              precip 
##                   0                   0                   0                   0 
##                temp                hail           crop.hist            area.dam 
##                   0                   0                   0                   0 
##               sever            seed.tmt                germ        plant.growth 
##                   0                   0                   0                   0 
##              leaves           leaf.halo           leaf.marg           leaf.size 
##                   0                   0                   0                   0 
##         leaf.shread           leaf.malf           leaf.mild                stem 
##                   0                   0                   0                   0 
##             lodging        stem.cankers       canker.lesion     fruiting.bodies 
##                   0                   0                   0                   0 
##           ext.decay            mycelium        int.discolor           sclerotia 
##                   0                   0                   0                   0 
##          fruit.pods         fruit.spots                seed         mold.growth 
##                   0                   0                   0                   0 
##       seed.discolor           seed.size          shriveling               roots 
##                   0                   0                   0                   0 
##           Class_imp            date_imp     plant.stand_imp          precip_imp 
##                   0                   0                   0                   0 
##            temp_imp            hail_imp       crop.hist_imp        area.dam_imp 
##                   0                   0                   0                   0 
##           sever_imp        seed.tmt_imp            germ_imp    plant.growth_imp 
##                   0                   0                   0                   0 
##          leaves_imp       leaf.halo_imp       leaf.marg_imp       leaf.size_imp 
##                   0                   0                   0                   0 
##     leaf.shread_imp       leaf.malf_imp       leaf.mild_imp            stem_imp 
##                   0                   0                   0                   0 
##         lodging_imp    stem.cankers_imp   canker.lesion_imp fruiting.bodies_imp 
##                   0                   0                   0                   0 
##       ext.decay_imp        mycelium_imp    int.discolor_imp       sclerotia_imp 
##                   0                   0                   0                   0 
##      fruit.pods_imp     fruit.spots_imp            seed_imp     mold.growth_imp 
##                   0                   0                   0                   0 
##   seed.discolor_imp       seed.size_imp      shriveling_imp           roots_imp 
##                   0                   0                   0                   0

One option to consider is for predictors that were entirely NA for a whole class, its possible to create a dummy variable to show if the predictor was filled in or not or remove it entirely. doing this filling in may be an issue because that is likely something to do with data collection and may not keep up over time.on the other hand, for predictors that have some data within a class I would impute an average for that predictor for a given class.Many sources suggest that the wisest stratergy is to start with checking the correlation between two variables. Important note, is due to high percentage of missing values, we were not able to get correct correlation between the variables. In case there was strong correlation between two predictors, we would have removed one with high percentages of missing values. In general, predictors with missing values with more than 5% values are suggested to be dropped, as with more missing values, the predictor might not be providing correct information to the model. We used k nearest neighbours to impute the missing values in our dataset.

Week 5 Data Preprocessing/Overfitting |28-Feb 6-Mar

John Mazon

2/20/2022

3.1.

3.2.