library(readxl, quietly = TRUE, warn.conflicts =  FALSE, verbose = F)
library(fpp2,quietly = TRUE, warn.conflicts =  FALSE, verbose = F)
library(ggplot2)
library(gridExtra)
library(mlbench)
library(caret)
library(corrplot)
library(dplyr )
library(kableExtra)
library(e1071)

Q1 Excercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed. correlation plot shows the relationship between predictors. We see that there is high correlation between CA and Ri, we also see negative and positive correlation between variables.
library(mlbench)
data(Glass)
glassData <- Glass[-10]
par(mfrow=c(4,2))
for (col in 2:ncol(glassData)) {
    hist(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

par(mfrow=c(1,1))

corGlass <- cor(glassData)
corrplot(corGlass, method = 'number')

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed. 
Table display skewness of the variables. 
Boxplots shows  only Mg doesn’t have any outliers.
par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
    hist(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
    boxplot(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

kable_styling (kable(apply(Glass[-10],2, skewness )
),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
x
RI 1.6027151
Na 0.4478343
Mg -1.1364523
Al 0.8946104
Si -0.7202392
K 6.4600889
Ca 2.0184463
Ba 3.3686800
Fe 1.7298107

Histogram shows skewness. Scatter plot and box plot shows there are outliers.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

Below plots display actual data vs transformed data. Based on these plots, we do not see significant reduction in skewness. Transformation will be useful for making all variables on a similar scale or for doing PCA.

par(mfrow=c(4,2))
for (col in 2:ncol(Glass[-10])) {
    hist(Glass[,col],main =   colnames(Glass[-10])[col], xlab = colnames(Glass[-10])[col])
    t = BoxCoxTrans( as.numeric( Glass[,col]))
    transformed  = predict(t, Glass[,col])
    hist(transformed, main =  paste("after transform ", colnames(Glass[-10])[col]), xlab = colnames(Glass[-10])[col] )
}

trans <- preProcess(Glass[-10], method = c("BoxCox", "center", "scale"))
predict(trans, Glass[-10])

Q2 Excercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data("Soybean")
Soybeandata <- Soybean[-1]

par(mfrow=c(4,5))
for (col in 2:ncol(Soybean)) {
    hist( as.numeric(Soybean[,col]),main =   colnames(Soybean)[col], xlab = colnames(Soybean)[col])
}

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Degenerate distributions are those where the predictor variable has a unique value or a few of unique values that occur with very low frequencies.

“leaf.mild” “mycelium” “sclerotia” predictors have near zero variance.

# Checking for degenerate distibution
nearZeroVar(Soybeandata, names =  T)
## [1] "leaf.mild" "mycelium"  "sclerotia"
hist( as.numeric(Soybeandata[, "leaf.mild" ]))

hist( as.numeric(Soybeandata[, "mycelium" ]))

hist(as.numeric(Soybeandata[ ,"sclerotia" ]))

(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

hail, sever, seed.tmt and lodging variables are missing most data. leaves,date and area.dan are missing the least amount of data.

Table below shows number of missing values by predictor variable.

missingCols <- sort(colSums(is.na(Soybean)) )
kable_styling (kable(missingCols),bootstrap_options = c("striped", "hover", "condensed", "responsive"))
x
Class 0
leaves 0
date 1
area.dam 1
crop.hist 16
plant.growth 16
stem 16
temp 30
roots 31
plant.stand 36
precip 38
stem.cankers 38
canker.lesion 38
ext.decay 38
mycelium 38
int.discolor 38
sclerotia 38
leaf.halo 84
leaf.marg 84
leaf.size 84
leaf.malf 84
fruit.pods 84
seed 92
mold.growth 92
seed.size 92
leaf.shread 100
fruiting.bodies 106
fruit.spots 106
seed.discolor 106
shriveling 106
leaf.mild 108
germ 112
hail 121
sever 121
seed.tmt 121
lodging 121

class phytophthora-rot has most missing data. Table below displays missing data by classes.

Soybean %>%
  mutate(Total = n()) %>% 
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n() ) %>%
  select(Class, Missing ) %>%
  unique()

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation

Following methods could be used for data imputation.

Mean value imputation

each missing value is replaced with an imputed value equal to the mean of the observed data.

Non-stochastic regression imputation

this approach also involves replacing missing data with imputed values but uses predicted values from a regression model.

Stochastic regression imputatios

this approach extends the regression imputation by adding a varying component to the predictions so that the imputed values have the same variance as the observed values.