Data 624 Week 5

3.1.

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

#install.packages('mlbench')
library(mlbench)
data(Glass)
head(Glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

3.1.a ) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library('ggcorrplot')

## Loading required package: ggplot2

cols <- colnames(Glass)

par(mfrow = c(3,3))
for(i in 1:9) {
  # histogram 
  hist(Glass[,i],
     breaks=seq(min(Glass[,i]), max(Glass[,i]), length=22),
     prob= TRUE, col= "lightgray", main =cols[i] )
  # overlay
  lines(density(Glass[,i], adjust = 3), col = "blue")
}

corr <- cor(Glass[,1:9])


ggcorrplot(corr, type = "lower",
   lab = TRUE)

Above are the histogram of each predictor variable(RI, NA, Mg, AI, Si, K, Ca, BA, Fe). By looking at this visuals, we see that not all predictor variable are normally distributed.

RI, Na, AI, Si with slight skewness and CA with heavy right skew. Rest of the variables are not normally distributed.

3.1.b) Do there appear to be any outliers in the data? Are any predictors skewed?

par(mfrow = c(3,3))

for(i in 1:9) {
boxplot(Glass[,i] , main =cols[i])
}

The Box plot says that there ar outliers in all predictors except Mg.

3.1.c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We will try apply the box-cox tranformation and see if the skewness is removed from any distribution.

library(caret)

## Loading required package: lattice

bxCx <- preProcess(Glass[-10], method=c('BoxCox', 'center', 'scale'))
Glass_bxCx <- predict(bxCx, Glass[-10])

par(mfrow = c(3, 3))

for(i in 1:9) {
boxplot(Glass_bxCx[,i] , main =cols[i])
}

par(mfrow = c(3,3))
for(i in 1:9) {
  # histogram 
  hist(Glass_bxCx[,i],
     breaks=seq(min(Glass_bxCx[,i]), max(Glass_bxCx[,i]), length=22),
     prob= TRUE, col= "lightgray", main =cols[i] )
  # overlay
  lines(density(Glass_bxCx[,i], adjust = 3), col = "blue")
}

We can see that skewness is reduced for NA and SI and Ca and the distribution is normal distribution.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
 data(Soybean)

3.2.a ) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

cols <- colnames(Soybean)

par(mfrow = c(3,3))
for(i in 3:34) {
  # histogram 
  hist(as.numeric(Soybean[,i]), main =cols[i] )
}

Yes, there are many predictors which are having only fewer values and low variance. Some of them are mycelium, sclerotia, lodging, stem, leaf.malf and etc.

3.3.b) Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

I used a function from VIM package to see the intensity of the missing value. I see 18% of values are missing for various predictors as we see in the visualisation below.

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

aggr_plot <- aggr(Soybean, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(Soybean), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

3.3.c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Dropping the missing records

The easiest solution is to drop the NA records from the dataset. This is effective only if there are very few missing records. If there are more, we will end up deteriorating the data and thus impact the model performance.

If a particular column has a major percentage missing, it is quite okay to drop the column from the dataset. Sometimes, we cant drop na or columns. In such scenarios, we will use imputers.

library(knitr)

data_rs1 <- na.omit(Soybean)

Imputers

Sometime back, I had created an r notebook explaining on how to handle the missing values with different types of imputers. I have covered mean imputers and Regression imputers( Deterministic regression imputation & Stochastic regression imputation) Out of this, Stochastic regression imputation would do a better job in imputing missing values.

Please refer https://rpubs.com/charlsjoseph/missing_values