Homework 4

library(readxl, quietly = TRUE, warn.conflicts =  FALSE, verbose = F)
library(fpp2,quietly = TRUE, warn.conflicts =  FALSE, verbose = F)
library(ggplot2)
library(gridExtra)
library(mlbench)
library(caret)
library(corrplot)
library(dplyr )
library(kableExtra)
library(e1071)

Q1 Excercise 3.1

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

(a)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed. correlation plot shows the relationship between predictors. We see that there is high correlation between CA and Ri, we also see negative and positive correlation between variables.

library(mlbench)
data(Glass)
glassData <- Glass[-10]
par(mfrow=c(4,2))
for (col in 2:ncol(glassData)) {
    hist(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

par(mfrow=c(1,1))

corGlass <- cor(glassData)
corrplot(corGlass, method = 'number')

(b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Based on the plots below, we see that predictors NA AI Si, CA looks more like normal distribution; other predictor variables are skewed. 
Table display skewness of the variables. 
Boxplots shows  only Mg doesn’t have any outliers.

par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
    hist(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

par(mfrow=c(2,4))
for (col in 2:ncol(glassData)) {
    boxplot(glassData[,col],main =   colnames(glassData)[col], xlab = colnames(glassData)[col])
}

kable_styling (kable(apply(Glass[-10],2, skewness )
),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	x
RI	1.6027151
Na	0.4478343
Mg	-1.1364523
Al	0.8946104
Si	-0.7202392
K	6.4600889
Ca	2.0184463
Ba	3.3686800
Fe	1.7298107

Histogram shows skewness. Scatter plot and box plot shows there are outliers.

(c)

Are there any relevant transformations of one or more predictors that might improve the classification model?

Below plots display actual data vs transformed data. Based on these plots, we do not see significant reduction in skewness. Transformation will be useful for making all variables on a similar scale or for doing PCA.

par(mfrow=c(4,2))
for (col in 2:ncol(Glass[-10])) {
    hist(Glass[,col],main =   colnames(Glass[-10])[col], xlab = colnames(Glass[-10])[col])
    t = BoxCoxTrans( as.numeric( Glass[,col]))
    transformed  = predict(t, Glass[,col])
    hist(transformed, main =  paste("after transform ", colnames(Glass[-10])[col]), xlab = colnames(Glass[-10])[col] )
}

trans <- preProcess(Glass[-10], method = c("BoxCox", "center", "scale"))
predict(trans, Glass[-10])

Q2 Excercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

data("Soybean")
Soybeandata <- Soybean[-1]

par(mfrow=c(4,5))
for (col in 2:ncol(Soybean)) {
    hist( as.numeric(Soybean[,col]),main =   colnames(Soybean)[col], xlab = colnames(Soybean)[col])
}

(a)

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Degenerate distributions are those where the predictor variable has a unique value or a few of unique values that occur with very low frequencies.

“leaf.mild” “mycelium” “sclerotia” predictors have near zero variance.

# Checking for degenerate distibution
nearZeroVar(Soybeandata, names =  T)

## [1] "leaf.mild" "mycelium"  "sclerotia"

hist( as.numeric(Soybeandata[, "leaf.mild" ]))

hist( as.numeric(Soybeandata[, "mycelium" ]))

hist(as.numeric(Soybeandata[ ,"sclerotia" ]))

(b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

hail, sever, seed.tmt and lodging variables are missing most data. leaves,date and area.dan are missing the least amount of data.

Table below shows number of missing values by predictor variable.

missingCols <- sort(colSums(is.na(Soybean)) )
kable_styling (kable(missingCols),bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	x
Class	0
leaves	0
date	1
area.dam	1
crop.hist	16
plant.growth	16
stem	16
temp	30
roots	31
plant.stand	36
precip	38
stem.cankers	38
canker.lesion	38
ext.decay	38
mycelium	38
int.discolor	38
sclerotia	38
leaf.halo	84
leaf.marg	84
leaf.size	84
leaf.malf	84
fruit.pods	84
seed	92
mold.growth	92
seed.size	92
leaf.shread	100
fruiting.bodies	106
fruit.spots	106
seed.discolor	106
shriveling	106
leaf.mild	108
germ	112
hail	121
sever	121
seed.tmt	121
lodging	121

class phytophthora-rot has most missing data. Table below displays missing data by classes.

Soybean %>%
  mutate(Total = n()) %>% 
  filter(!complete.cases(.)) %>%
  group_by(Class) %>%
  mutate(Missing = n() ) %>%
  select(Class, Missing ) %>%
  unique()

(c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation

Following methods could be used for data imputation.

Mean value imputation

each missing value is replaced with an imputed value equal to the mean of the observed data.

Non-stochastic regression imputation

this approach also involves replacing missing data with imputed values but uses predicted values from a regression model.

Stochastic regression imputatios

this approach extends the regression imputation by adding a varying component to the predictions so that the imputed values have the same variance as the observed values.

Reference :

https://ies.ed.gov/ncee/pubs/20090049/section_3a.asp

Homework 4

J John

Q1 Excercise 3.1

(a)

(b)

(c)

Q2 Excercise 3.2

(a)

(b)

(c)

Reference :