DATA 624 - Homework #4

#install.packages("easyGgplot2")

R Markdown

3.1. The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Composition of Soda Lime Glass and Colouring

Most glasses (Soda-Lime) are made using only three components: Silicon Dioxide (SiO2), Calcium Oxide (CaO )and Sodium Oxide (Na2O). However, different elements can be added to the mixture to change the color of glass to to change the optical properties (Refractive Index)

Below we can see the first ten observations in the data set showing the different percentages of eight different elements in the glass composition. As expected, the most abundant component is Silicon, followed by Sodium and Calcium.

##         RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1  1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2  1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3  1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4  1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5  1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6  1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1
## 7  1.51743 13.30 3.60 1.14 73.09 0.58 8.17  0 0.00    1
## 8  1.51756 13.15 3.61 1.05 73.24 0.57 8.24  0 0.00    1
## 9  1.51918 14.04 3.58 1.37 72.08 0.56 8.30  0 0.00    1
## 10 1.51755 13.00 3.60 1.36 72.99 0.57 8.40  0 0.11    1

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

The percentage of different elements added to the glass mix will affect its refractive index. As we can see below increasing the percentage of elements other than silicon will increase the refreactive index of glass.

$Variation of Glass Refractive Index in Function of Additives$

Variation of Glass Refractive Index in Function of Additives

In one plot we can see a histogram of our predictor variables, bivariate scatter plots and the correlation between each other. As expected, some elements (Si, Na, Ca) are commonly found in the composition of glass and show something close to a normal distribution in the histograms above.

Suprisingly, Aluminum seems to present common in the observed glass mixtures. Magnesium is not commonly found in the mixture. While Potasium, Barium and Iron and very rare in the mixtures. These three elements appear often with a composition percentage of 0%.

From the perspective of Refractive Index, Calcium has the highest correlation coefficient (+0.81), followed by Silicon (-0.54) and Aluminum (-0.41). The addition of Barium appears to have to correlation to changes in Refractive Index.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

From the histogram grid above, there doesn't appear to be any outliers in the elemental percentages. None of the elements have a impossible percentages (greater than 100%) and they all all fall within a certain range. We can see how Barium and Iron are mostly not found in the observations of the study. Magnesium is either not present in the mixture of when present is at a concentration of around 3.5%.

All the predictors have some degree of skewness - either left or right skewed. Magnesium, Potasium, Barium and Iron are examples of a zero-inflated population (ZIP). These elements show a large spike of zero-values with a proportion of non-zero values.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Some predictors show skewnewss in their distribution as the case of the Silicon concentration in the glass mix. We can see above how from three transformations (natural log, square root and inverse), the square root transformation (green fill) resulted in a closer to normal distribution.

A more obvious impact of Data Transformation can be seen in the case of the Iron (Fe) distribution. The orignal data is an example of zero-inflated population. Using log (natural or base 10) transformation (blue and green fill) helps to normalize the distribution of the values.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

First, let's plot the frequency distribution for our target (Class) and predictor variables:

As we can see above, the frequency among the different factors of the categorical variables is unbalanced in almost half of predictors. There are some predictors such as "data", "germ" and "croop.hist" where the observations show what looks like a normal distribution if there were more factor levels.

Other predictors such as "sclerotia" have only two levels and these levels are very unbalanced. These are examples of degenerate distributions. These constant or almost-constant predictors show frequency distributions with zero or near-zero variance respectively. For example, in the "sclerotia" predictor:

## 
##   0   1 
## 625  20

Has a very small frequency of observations belonging to one of the levels. Most of the observations 96.9% belong to one of the two possible levels.

We can use the nearZeroVar function from the caret package to identify predictors that "have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors)".

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

Three of the predictors "leaf.mild", "mycelium" and "mycelium" qualify as having near-zero variance (nzv). If we wanted to simplify the model by reducing the numbet of predictor variables, we can choose nzv predictors as they would not add signficantly to the specificity of the model.

##           freqRatio percentUnique zeroVar  nzv
## leaf.mild     26.75     0.4392387   FALSE TRUE
## mycelium     106.50     0.2928258   FALSE TRUE
## sclerotia     31.25     0.2928258   FALSE TRUE

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

At 35 predictors, we have a wealth of information to generate a predictive model. Using all predictors will lead to an overly complicated model. Therefore, we will need to select features (predictor variables) with the predictive potential. How can we choose which variables to keep or not?

We can select for elimination, those variables with zero or near-zero variance that we found earlier on.

Also, we can choose to eliminate those variables that show collinearity. We can see in the Correlation Matrix plot above that there are some variables that are strongly correlated with each other. For example, "leaf.marg" is strongly negative correlated with "leaf.halo"

After we have simplified our range of predictors we need to deal with the missing observations.

##           Class            date     plant.stand          precip 
##       0.0000000       0.1464129       5.2708638       5.5636896 
##            temp            hail       crop.hist        area.dam 
##       4.3923865      17.7159590       2.3426061       0.1464129 
##           sever        seed.tmt            germ    plant.growth 
##      17.7159590      17.7159590      16.3982430       2.3426061 
##          leaves       leaf.halo       leaf.marg       leaf.size 
##       0.0000000      12.2986823      12.2986823      12.2986823 
##     leaf.shread       leaf.malf       leaf.mild            stem 
##      14.6412884      12.2986823      15.8125915       2.3426061 
##         lodging    stem.cankers   canker.lesion fruiting.bodies 
##      17.7159590       5.5636896       5.5636896      15.5197657 
##       ext.decay        mycelium    int.discolor       sclerotia 
##       5.5636896       5.5636896       5.5636896       5.5636896 
##      fruit.pods     fruit.spots            seed     mold.growth 
##      12.2986823      15.5197657      13.4699854      13.4699854 
##   seed.discolor       seed.size      shriveling           roots 
##      15.5197657      13.4699854      15.5197657       4.5387994

We can see above that there are some predictors (e.g. "hail", "lodging" an others) that have a higher than 15% percentage of observartions missing. We could target these variables for elimination. The rest of the predictor variables can have their missing values completed using one of the imputation techniques.

In summary, the strategy would be first to simplify the number of variables and then to deal with the missing data by eliminating the variables (>15% missing) or using imputation. The selection of 15% missing is arbitrary but I believe a viable approach taking into account the large number of predictor variables.

Near-zero variance predictors. Should we remove them? *https://www.r-bloggers.com/2014/03/near-zero-variance-predictors-should-we-remove-them/*

DATA 624 - Homework #4

Jose Mawyin

9/17/2020

R Markdown

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

DATA 624 - Homework #4

Jose Mawyin

9/17/2020

R Markdown

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.