3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

#install.packages('mlbench')
library(mlbench)
data(Glass)
head(Glass)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

3.1.a ) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library('ggcorrplot')
## Loading required package: ggplot2
cols <- colnames(Glass)

par(mfrow = c(3,3))
for(i in 1:9) {
  # histogram 
  hist(Glass[,i],
     breaks=seq(min(Glass[,i]), max(Glass[,i]), length=22),
     prob= TRUE, col= "lightgray", main =cols[i] )
  # overlay
  lines(density(Glass[,i], adjust = 3), col = "blue")
}

corr <- cor(Glass[,1:9])


ggcorrplot(corr, type = "lower",
   lab = TRUE)

Above are the histogram of each predictor variable(RI, NA, Mg, AI, Si, K, Ca, BA, Fe). By looking at this visuals, we see that not all predictor variable are normally distributed.

RI, Na, AI, Si with slight skewness and CA with heavy right skew. Rest of the variables are not normally distributed.

3.1.b) Do there appear to be any outliers in the data? Are any predictors skewed?

par(mfrow = c(3,3))

for(i in 1:9) {
boxplot(Glass[,i] , main =cols[i])
}

The Box plot says that there ar outliers in all predictors except Mg.

3.1.c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We will try apply the box-cox tranformation and see if the skewness is removed from any distribution.

library(caret)
## Loading required package: lattice
bxCx <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
Glass_bxCx <- predict(bxCx, Glass[-10])

par(mfrow = c(3, 3))

for(i in 1:9) {
boxplot(Glass_bxCx[,i] , main =cols[i])
}

par(mfrow = c(3,3))
for(i in 1:9) {
  # histogram 
  hist(Glass_bxCx[,i],
     breaks=seq(min(Glass_bxCx[,i]), max(Glass_bxCx[,i]), length=22),
     prob= TRUE, col= "lightgray", main =cols[i] )
  # overlay
  lines(density(Glass_bxCx[,i], adjust = 3), col = "blue")
}

We can see that skewness is reduced for NA and SI and Ca and the distribution is normal distribution.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
 data(Soybean)

3.2.a ) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

cols <- colnames(Soybean)

par(mfrow = c(3,3))
for(i in 3:34) {
  # histogram 
  hist(as.numeric(Soybean[,i]), main =cols[i] )
}

Yes, there are many predictors which are having only fewer values and low variance. Some of them are mycelium, sclerotia, lodging, stem, leaf.malf and etc.

3.3.c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Dropping the missing records

The easiest solution is to drop the NA records from the dataset. This is effective only if there are very few missing records. If there are more, we will end up deteriorating the data and thus impact the model performance.

If a particular column has a major percentage missing, it is quite okay to drop the column from the dataset. Sometimes, we cant drop na or columns. In such scenarios, we will use imputers.

library(knitr)

data_rs1 <- na.omit(Soybean)

Imputers

Sometime back, I had created an r notebook explaining on how to handle the missing values with different types of imputers. I have covered mean imputers and Regression imputers( Deterministic regression imputation & Stochastic regression imputation) Out of this, Stochastic regression imputation would do a better job in imputing missing values.

Please refer https://rpubs.com/charlsjoseph/missing_values