3.1

A

data(Glass)
glass <- data.frame(Glass)

numeric_cols <- glass[-c(10)]

numeric_cols %>% 
  gather() %>%
  ggplot(aes(value,color = key))+
  geom_histogram(bins=25)+
  facet_wrap(~key,scales='free')

numeric_cols %>% 
  gather() %>%
  ggplot(aes(value,color = key))+
  geom_boxplot()+
  facet_wrap(~key,scales='free')

corrmat <- round(cor(numeric_cols),1)
corrmat %>%
  ggcorrplot(lab=TRUE)

B

Iron content is heavily skewed to the right. Silicon is the most normally distributed element, and aluminum, calcium, sodium, are also moderately normal. The refractive index is right-skewed. Barium and potassium are both right-skewed, wewhile magnesium is left-skewed.

From the boxplots, there are quite a few outliers throughout the distributions of each predictor. Magnesium is the only variable without a considerable number of outliers, though 107 of the 214 observations contained no magnesium at all. The outliers that are shown in the boxplots would be considerably reduced if only samples that contain a percentage of each element were considered. Barium is a good example that shows this idea. Every observation with any barium whatsoever appears to be an outlier because most samples had no barium whatsoever. With that being said, it is important to also consider all of the samples with no traces of each element when drawing any conclusions. It is simply worthwhile to also take into account a standard range for those samples that do contain traces of an element so that the outliers can be further isolated.

C

I would recommend starting with a Box-Cox transformation on the excessively skewed predictors. Box-Cox transformations are a vital tool for multiple regression and classification models with sparsely distributed numeric predictors. Additional log transformations may be necessary for the highly skewed predictors. Further, it is worth noting the magnitude of the percentage of each predictor when performing any transformation.

3.2

data(Soybean)
soybean <- data.frame(Soybean)

A

numeric_cols <- soybean[-c(1)]

numeric_cols %>% 
  gather() %>%
  ggplot(aes(value,color = key))+
  geom_bar()+
  facet_wrap(~key,scales='free')+
  theme(legend.position = "none")
## Warning: attributes are not identical across measure variables;
## they will be dropped

Mycelium, Sclerotia, Roots, Shriveling, Seed.Discolor, Seed.Size, and Leaf.Mild all appear to be potentially degenerate, though they may be major predictors of specific classes. We shouldn’t cast aside any variables with multiple different values without first determining if they are truly degenerate.

B

None of the predictors are glaringly missing, though germ, leaf.shread, hail, lodging, leaf.size, seed.tmt, and sever are among the highest NA values.

na_dist <- data.frame(apply(soybean,2,function(x)sum(is.na(x))))

colnames(na_dist) <- c("count")

na_dist

There is nothing to indicate too many NAs in a particular predictor given the overall incompleteness of the dataset.

nas <- filter_all(soybean, any_vars(is.na(.)))

na_counts <- data.frame(table(nas$Class))

colnames(na_counts) <- c("Class","Total.NA")

na_counts <- subset(na_counts, Total.NA >= 1)

na_counts

The NAs are limited to only five classes, with the most incomplete data coming from the class: Phytophthora-Rot.

C

For the missing data, imputation should be used within a class only. The classifications may change if the data is imputed with the median value of the variable from the dataset at large. It is important to keep in mind the purpose of the dataset. If this data is being used for a classification model, it will be impossible to impute missing values relative to one’s class, because the class will be unknown.

It may be best to add another calculated field that searches for the percentage of missing data for a given observation. The fact that data is missing should reveal that we are working with one of the five problematic classes. From there, a separate route should be taken. For the five problematic classes, missing values actually point to these classes. We can then run a classification model that takes into account the presence of NAs and perhaps even the percentage of NAs that exist in each observation.

We can also use the NA values to our advantage further by creating a sparse matrix that includes true and false values for whether an NA was present in a given column.

Overall, because the presence of NAs actually points to only a select group of the classes, it is best to flag for NAs in a calculated column prior to imputing any data.