Data 624: Predictive Analytics HW 1, Week 2

Homework Solutions

KJ 3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench) data(Glass) str(glass)

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

library(mlbench)
library(GGally)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

for (i in 1:(ncol(Glass)-1)){
  p <- ggplot(Glass, aes(x=Glass[,i])) + 
  geom_histogram(bins=30) +
  ggtitle(colnames(Glass)[i])
  print(p)
}

ggpairs(Glass, title="correlogram with ggpairs()", columns = 1:9) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

ggcorr(Glass[1:9], method = c("all.obs", "pearson"))

In terms of distributions, some are normal, some are skewed or heavily skewed, some contain outliers. The closest to normal distributions appear to be the Na, Al and Si. The outliers and skew will be addressed in part (b). The relationship between the predictors are explored using the GGally libraries capabilites. Based on this we can see that the strongest correlation exists between Ca and RI (at 0.81) this suggests that these predictors may be collinear. Aside from this correlation, there are multiple correlations between 0.4 and 0.6 for example -0.542 between Si and RI and -0.482 between Al and Mg and -0.444 between Ca and Mg and -0.479 between Ba and Al. Aside from these the correlations are lower and the relationship between the remaining predictors would not be described as collinear.ss

Do there appear to be any outliers in the data? Are any predictors skewed?

Based on the histograms in part (a) you can see which predictors are skewed visually and another way to check, is to use the e1071 package.

library(e1071)
skewValues <- apply(Glass[1:9],2, skewness)
(skewValues)

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6027151  0.4478343 -1.1364523  0.8946104 -0.7202392  6.4600889  2.0184463 
##         Ba         Fe 
##  3.3686800  1.7298107

Based on the skewness, a negative value is left skewed and a positive value is right skewed. According to the histogram and the skewness value you can see K is heavily right skewed, Ba, Ca, Fe, and RI are also right skewed. Then Mg is left skewed. The other predictors have a low absolute value for the skew and could be classified as normal.

To help identify outliers, we can make a boxplot.

library(tidyverse)
library(reshape2)

melteddf <- melt(Glass)

melteddf_1 <- melteddf |> filter(variable == "RI" | 
                                   variable == "Al" |
                                   variable == "K" |
                                   variable == "Ba" |
                                   variable == "Fe")

melteddf_2 <- melteddf |> filter(variable == "Na" | 
                                   variable == "Ca" |
                                   variable == "Mg")

melteddf_3 <- melteddf |> filter(variable == "Si")

p <- melteddf |> ggplot(aes(x=variable, y= value)) + 
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)

p <- melteddf_1 |> ggplot(aes(x=variable, y= value)) + 
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)

p <- melteddf_2 |> ggplot(aes(x=variable, y= value)) + 
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)

p <- melteddf_3 |> ggplot(aes(x=variable, y= value)) + 
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)

There is at least one outlier in all of the variables. The most outliers seem to occur in Ca.

Are there any relevant transformations of one or more predictors that might improve the classification model?

Yes there are transformations that can be applied to some of the predictors that may improve the classification model.

OUTLIERS: According to the book for outliers the first step is to check if they are valid or if an error could have occurred. You have to be careful to not remove an outlier that may be true, especially if the sample data is small. However if there is a real outlier one option is to use a spatial sign which projects the predictors onto a multidimensional sphere to make all of the samples the same distance form the center.
CENTERING & SCALING: According to the book it can be useful to center the scale of the predictor variables - where the average predictor value is subtracted from all of the values giving the centered scale a 0 mean. Then to scale it the predictor can be divided by its standard deviation so the standard deviation is 1. This can improve numerical stability because some models benefit from predictors being on a common scale.
SKEWNESS: Replacing the data with the log, square root or inverse can help remove skew and after the transformation the distribution should be closer to symmetric. The Box and Cox can be used to use the data to help determine what transformation should be used.
MISSING VALUES: Missing data can be removed or imputed (try to estimate missing values)

#TODO: APPLY TRANSFORMS#

KJ 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: library(mlbench) data(Soybean) ?Soybean

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Below are the plots of the frequency distributions for the categorical predictors. According to the book degenerate data could come in a variety of forms but there can be significant improvements to model performance when these are removed. One issue is if a predictor has a single unique value, this has no information and can be discarded. Or a predictor can have only some unique values that occur at low frequencies, “near-zero variance predictor” may have a single value with most of the samples. In summary if the fraction of unique values over the sample size is low (under 10%) and the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (around 20) it may be good to remove the predictor. Another degenerate data example would be data that is collinear or multi-linear

library(mlbench)
data(Soybean)
# ?Soybean

for (i in 2:(ncol(Soybean))){
  p <- ggplot(Soybean, aes(x=(Soybean[,i]))) + 
  geom_bar() +
  ggtitle(colnames(Soybean)[i]) + xlab(colnames(Soybean)[i])
  print(p)
}

Looking at the plots, there are multiple cases where the ratio for the most common predictor to the least common predictor is over 20. The most dramatic case of this appears to be for the mycelium predictor where there are over 600 0 values and less than 10 1 values, with the remaining being NAs. Additionally, usually the unique values compared to the sample size is low.

The book demonstrates the use of the caret package which will return column numbers for predictors that are degenerate.

library(caret)
degen <- nearZeroVar(Soybean)
colnames(Soybean[degen])

## [1] "leaf.mild" "mycelium"  "sclerotia"

If we refer back to these plots, we can see that in each of these cases the ratio of the most prevalent to the least prevalent is very high.

Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

First, calculate the number of missing values for each predictor.

library(DT)

soybean_nas <- Soybean |> summarise(across(everything(), ~ sum(is.na(.))))

soybean_nas <- soybean_nas[2:length(soybean_nas)]

soybean_nas_long <- soybean_nas |> pivot_longer(everything(), 
                            names_to = "predictor",
                            values_to = "na_count")

datatable(soybean_nas_long |> arrange(desc(na_count)))

soybean_nas_long |> ggplot(aes(x=fct_reorder(predictor, na_count), y=na_count)) + geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + labs(x = "predictor", title = "Count of NA for each Predictor")

The most common predictors to be missing are “hail”, “sever”, “seed.tmt” and “lodging”.

Next, we need to see if any specific class is more likely to have NAs than other classes.

df_soybean <- Soybean
df_soybean$row_sum_na <- rowSums(is.na(Soybean))

class_nas_all <- df_soybean |>
  group_by(Class) |>
  summarise(class_nas = sum(row_sum_na)) |>
  arrange(desc(class_nas))

head(class_nas_all)

Yes, specific classes are much more likely to contain NAs than others - these are shown above - 5 classes make up ALL of the NAs and the remaining classes have no NAs.

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Based on the analysis in part A, “leaf.mild” “mycelium” and “sclerotia” are degenerate predictors and can be removed.

df_soybean <- df_soybean[-degen]

Next for imputation, one option is to use the K-nearest neighbor. According to the book the “advantage of this approach is that the imputed data are confined to be within the range of the training set values… one disadvantage is that the entire training set is required every time a missing value needs to be imputed”. In this case, the dataset is not large, so we can try to use KNN to impute.

library(DMwR2)
# impute 
# https://www.rdocumentation.org/packages/DMwR2/versions/0.0.2/topics/knnImputation
soybean_imputed <- knnImputation(df_soybean,k=3)
# check for any NAs
# https://discuss.analyticsvidhya.com/t/how-can-i-check-whether-my-data-frame-contains-na-inf-values-in-some-column-or-not-in-r/1647
apply(soybean_imputed, 2, function(x) any(is.na(x)))

##           Class            date     plant.stand          precip            temp 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##       leaf.size     leaf.shread       leaf.malf            stem         lodging 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##    stem.cankers   canker.lesion fruiting.bodies       ext.decay    int.discolor 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##      fruit.pods     fruit.spots            seed     mold.growth   seed.discolor 
##           FALSE           FALSE           FALSE           FALSE           FALSE 
##       seed.size      shriveling           roots      row_sum_na 
##           FALSE           FALSE           FALSE           FALSE

All values NA values were replaced.

HA 7.1

Consider the pigs series — the number of pigs slaughtered in Victoria each month.

Use the ses() function in R to find the optimal values of α and ℓ0, and generate forecasts for the next four months.

First, according to the book simple exponential smoothing “is suitable for forecasting data with no clear trend or seasonal pattern”, so lets take a look at the trend before performing the ses().

# TODO: for some reason to make this work I have to restart the R session (session > restart R)

library(fpp2)

## Warning: package 'fpp2' was built under R version 4.2.3

## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──

## ✔ forecast  8.21     ✔ expsmooth 2.3 
## ✔ fma       2.5

## Warning: package 'forecast' was built under R version 4.2.3

## Warning: package 'fma' was built under R version 4.2.3

## Warning: package 'expsmooth' was built under R version 4.2.3

## ── Conflicts ───────────────────────────────────────────────── fpp2_conflicts ──
## ✖ caret::lift() masks purrr::lift()

data(pigs)

p <- autoplot(pigs) + ggtitle("Pigs slaughtered in Victoria each month")
print(p)

There does not appear to be a seasonal trend, there does appear to be some trend but we will proceed with ses().

pigs_forecast <- ses(pigs, h = 4)
summary(pigs_forecast)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

Compute a 95% prediction interval for the first forecast using: yhat ± 1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

We will use the same yhat as what is the point forecast given by ses().

pu <- pigs_forecast$mean[1] + 1.96 * sd(pigs_forecast$residuals)
pl <- pigs_forecast$mean[1] - 1.96 * sd(pigs_forecast$residuals)

paste0("95% Upper: ", pu, "    95% Lower: ", pl)

## [1] "95% Upper: 118952.844969765    95% Lower: 78679.9672534162"

The 95% confidence interval output from the ses for the high limit was 119,020.8 whereas using the standard deviation of the residuals was 118,952.8, and the lower limit of ses was 78,611.97 in comparison to 78,679.97. This means that the 95% confidence interval estimated with ses() had a higher upper limit and a lower lower limit which means the ses() 95% confidence interval was wider than the one calculated with the residuals. However the difference is only about 68 pigs compared to about 100,000 pigs so really the confidence intervals are not that different.

HA 7.2

Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level ℓ0 ). It should return the forecast of the next observation in the series.

Does it give the same forecast as ses()?

First we will write the simple exponential smoothing.

# y = time series
# alpha = smoothing parameter
# level = initial level (first fitted value at time 1)

# yhat[t+1] | T = alpha * yT + alpha*(1-alpha)* y[T-1] + (alpha*(1 - alpha)(^2)) * y[T-2]

ses2 <- function(y, alph) {
  y.hat <- alph*y[length(y)] # declare first term of yhat
  
  for (i in 1:(length(y)-1)){
    next.weighted <- alph*((1-alph)^i)*y[length(y)-i]
    y.hat <- y.hat + next.weighted
  }
  
  forecast.next.obs <- y.hat
  print(forecast.next.obs)
  return(forecast.next.obs)
}

ses2(pigs, 0.2971)

## [1] 98816.45

## [1] 98816.45

This first method did not actually use the level. Rewrite the function with the level actually used based on the Weighted average form section of the book.

# y = time series
# alpha = smoothing parameter
# level = initial level (first fitted value at time 1)
# "The process has to start somewhere, so we let the first fitted value at time 1 be denoted by  ℓ0)

# yhat[t+1] | T = sum(apha*(1-alpha)^j)*y[T-j]) + (1-alpha)^T*l0

ses3 <- function(y, alph, lev) {
  
  y.hat <- lev # declare first term of yhat
  
  for (i in 1:(length(y))){
    y.hat <- alph*y[i] + (1-alph)*y.hat
  }
  
  forecast.next.obs <- y.hat
  print(forecast.next.obs)
  return(forecast.next.obs)
}

ses3(pigs, 0.2971, 77260.0561)

## [1] 98816.45

## [1] 98816.45

Now we can compare this forecast with ses() forecast.

The ses() function forecast was 98816.41 for the point estimate and our ses2() function (using the alpha and level values from the ses()) returned 98816.45. So essentially the two functions resulted in the same output.

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of α and ℓ0. Do you get the same values as the ses() function?

First we will adapt the function from 7.2 to return the sum of square errors. For the optim function, you have to specify using the par argument the initial values for the parameters that you want to optimize. Because of this, rather than taking alph and lev as their own arguments, they will get combined into one input as a vector.

# SSE = sum of (yt - yhat(t-1))^2

sse_fun <- function(par, y) {
  alph <- par[1]
  lev <- par[2]
  
  y.hat <- lev # declare first term of yhat
  sse <- 0
  
  for (i in 1:(length(y))){
    err <- y[i] - y.hat # the current error is the correct value - predicted
    err.2 <- err^2 # square it
    sse <- sse + err.2 # add it up
    
    y.hat <- alph*y[i] + (1-alph)*y.hat # next prediction
  }
  return(sse)
}

Next we will run the optim function to minimize the sums of squares. We chose alpha is 0.8 to start to try to put more emphasis on the more recent past and mean(pigs) data as the level to start with based on there being no strong trend in the data.

find_opt <- optim(par = c(0.8, mean(pigs)), y = pigs, fn=sse_fun)
find_opt

## $par
## [1] 2.971474e-01 7.727386e+04
## 
## $value
## [1] 19765613340
## 
## $counts
## function gradient 
##      145       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The convergence is 0, indicating successful completion. The outputs are alpha = 0.2971474 and level = 77,273.86.

Last we will compare the values to those from the ses() function.

From the ses() function alpha = 0.2971 and level = 77260.0561. These values are very close to the optimized sse_fun() that we wrote.