DATA624: Predictive Analytics: HW#4

Exercise KJ 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Histograms

par(mfrow = c(3, 3))
for (col in 2:ncol(Glass)-1) {
  hist(Glass[,col], main = colnames(Glass[col]))
}

From the histograms, we can see that the following predictors have normal distribution: RI, Na, Al and Si. The following predictors do not have normal distributions: Mg, K, Ca, Ba, Fe.

Corrplots

c <- cor(Glass[2:ncol(Glass)-1])
corrplot(c)

There is a very strong correlation between RI and Ca, Ri and Si. We can remove Ca and Si from the model to avoid redunduncy caused by this correlation.

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

for (col in 2:ncol(Glass)-1) {
  outlier_values <- boxplot.stats(Glass[,col])$out
  boxplot(Glass[,col], main=colnames(Glass[col]), boxwex=0.1)
  mtext(paste("Outliers: ", paste(outlier_values, collapse=", ")), cex=0.6)
    print(colnames(Glass[col]))
    print ("Skewness")
    print(skewness(Glass[,col]))
}

## [1] "RI"
## [1] "Skewness"
## [1] 1.602715

## [1] "Na"
## [1] "Skewness"
## [1] 0.4478343

## [1] "Mg"
## [1] "Skewness"
## [1] -1.136452

## [1] "Al"
## [1] "Skewness"
## [1] 0.8946104

## [1] "Si"
## [1] "Skewness"
## [1] -0.7202392

## [1] "K"
## [1] "Skewness"
## [1] 6.460089

## [1] "Ca"
## [1] "Skewness"
## [1] 2.018446

## [1] "Ba"
## [1] "Skewness"
## [1] 3.36868

## [1] "Fe"
## [1] "Skewness"
## [1] 1.729811

There are outliers in most predictors with one exception being Mg. The following variables have skewness greater than 1 or less that -1: RI, Mg, K, Ca, Ba, Fe.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

We will use preProcess to do BoxCox, Center and Scale and PCA transformations on the data and see if that will have an impact on improving the classification model.

trans <- preProcess(Glass, 
  method = c("BoxCox", "center", "scale", "pca"))
trans

## Created from 214 samples and 10 variables
## 
## Pre-processing:
##   - Box-Cox transformation (5)
##   - centered (9)
##   - ignored (1)
##   - principal component signal extraction (9)
##   - scaled (9)
## 
## Lambda estimates for Box-Cox transformation:
## -2, -0.1, 0.5, 2, -1.1
## PCA needed 7 components to capture 95 percent of the variance

Transforming the data and reviewing the histograms of new predictors:

transformed <- predict(trans, Glass)
par(mfrow = c(3, 3))
for (col in 2:ncol(transformed)) {
  hist(transformed[,col], main = colnames(transformed[col]))
}

The data appears to be more normally distributed, less skewed, and centered around 0 as a result of the transformations.

Outliers:

for (col in 2:ncol(transformed)) {
  outlier_values <- boxplot.stats(transformed[,col])$out
  boxplot(transformed[,col], main=colnames(transformed[col]), boxwex=0.1)
  mtext(paste("Outliers: ", paste(outlier_values, collapse=", ")), cex=0.6)
}

The transformation decreased the number of outliers.

Exercise KJ 3.2.

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

data(Soybean)

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

par(mfrow = c(3, 3))
for (col in 1:ncol(Soybean)) {
  barplot(table(Soybean[,col]), main = colnames(Soybean[col]))
}

Let’s search for any “near-zero variance predictors”:

x<-nearZeroVar(Soybean)
colnames(Soybean[x])

## [1] "leaf.mild" "mycelium"  "sclerotia"

par(mfrow = c(1, 3))
  barplot(table(Soybean[x[1]]), main = colnames(Soybean[x[1]]))
  barplot(table(Soybean[x[2]]), main = colnames(Soybean[x[2]]))
  barplot(table(Soybean[x[3]]), main = colnames(Soybean[x[3]]))

Looking at the barplots above we can see that those variables appear to be degenerate and should be removed from the model to improve performance.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

pct_miss(Soybean)

## [1] 9.504636

It actually appears that only around 10% of the data is missing, which is still pretty significant.

res<-summary(aggr(Soybean, sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##         Variable       Count
##             hail 0.177159590
##            sever 0.177159590
##         seed.tmt 0.177159590
##          lodging 0.177159590
##             germ 0.163982430
##        leaf.mild 0.158125915
##  fruiting.bodies 0.155197657
##      fruit.spots 0.155197657
##    seed.discolor 0.155197657
##       shriveling 0.155197657
##      leaf.shread 0.146412884
##             seed 0.134699854
##      mold.growth 0.134699854
##        seed.size 0.134699854
##        leaf.halo 0.122986823
##        leaf.marg 0.122986823
##        leaf.size 0.122986823
##        leaf.malf 0.122986823
##       fruit.pods 0.122986823
##           precip 0.055636896
##     stem.cankers 0.055636896
##    canker.lesion 0.055636896
##        ext.decay 0.055636896
##         mycelium 0.055636896
##     int.discolor 0.055636896
##        sclerotia 0.055636896
##      plant.stand 0.052708638
##            roots 0.045387994
##             temp 0.043923865
##        crop.hist 0.023426061
##     plant.growth 0.023426061
##             stem 0.023426061
##             date 0.001464129
##         area.dam 0.001464129
##            Class 0.000000000
##           leaves 0.000000000

Let’s take a look at the number of missings in a variable grouped by Class using the facet argument.

gg_miss_var(Soybean,
            facet = Class)

There is clearly a pattern emerging - there are 5 classes that have significant missing variable numbers: 2-4-d-injuty, cyst-nematode, rthe-pod-&-stem, herbicide-injury, and phytophthora-rot.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

We will use the mice package to deal with missing data by imputation.

Soybean2 <- mice(Soybean, method="pmm", printFlag = FALSE)

## Warning: Number of logged events: 1670

Soybean2<-as.data.frame(complete(Soybean2))
gg_miss_var(Soybean2)

Sources:

https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec26/detecting-outliers

http://juliejosse.com/wp-content/uploads/2018/06/DataAnalysisMissingR.html

https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html