This homework assignment includes problems from:
This accompanies readings from KJ 1,2 and 3 and HA 1,2,6,7 and 8.
The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(glass)
library(mlbench)
library(GGally)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
for (i in 1:(ncol(Glass)-1)){
p <- ggplot(Glass, aes(x=Glass[,i])) +
geom_histogram(bins=30) +
ggtitle(colnames(Glass)[i])
print(p)
}
ggpairs(Glass, title="correlogram with ggpairs()", columns = 1:9) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
ggcorr(Glass[1:9], method = c("all.obs", "pearson"))
In terms of distributions, some are normal, some are skewed or
heavily skewed, some contain outliers. The closest to normal
distributions appear to be the Na, Al and Si. The outliers and skew will
be addressed in part (b). The relationship between the predictors are
explored using the GGally
libraries capabilites. Based on
this we can see that the strongest correlation exists between Ca and RI
(at 0.81) this suggests that these predictors may be collinear. Aside
from this correlation, there are multiple correlations between 0.4 and
0.6 for example -0.542 between Si and RI and -0.482 between Al and Mg
and -0.444 between Ca and Mg and -0.479 between Ba and Al. Aside from
these the correlations are lower and the relationship between the
remaining predictors would not be described as collinear.ss
Based on the histograms in part (a) you can see which predictors are
skewed visually and another way to check, is to use the
e1071
package.
library(e1071)
skewValues <- apply(Glass[1:9],2, skewness)
(skewValues)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
Based on the skewness, a negative value is left skewed and a positive value is right skewed. According to the histogram and the skewness value you can see K is heavily right skewed, Ba, Ca, Fe, and RI are also right skewed. Then Mg is left skewed. The other predictors have a low absolute value for the skew and could be classified as normal.
To help identify outliers, we can make a boxplot.
library(tidyverse)
library(reshape2)
melteddf <- melt(Glass)
melteddf_1 <- melteddf |> filter(variable == "RI" |
variable == "Al" |
variable == "K" |
variable == "Ba" |
variable == "Fe")
melteddf_2 <- melteddf |> filter(variable == "Na" |
variable == "Ca" |
variable == "Mg")
melteddf_3 <- melteddf |> filter(variable == "Si")
p <- melteddf |> ggplot(aes(x=variable, y= value)) +
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)
p <- melteddf_1 |> ggplot(aes(x=variable, y= value)) +
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)
p <- melteddf_2 |> ggplot(aes(x=variable, y= value)) +
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)
p <- melteddf_3 |> ggplot(aes(x=variable, y= value)) +
geom_boxplot() +
ggtitle(colnames(Glass)[i])
print(p)
There is at least one outlier in all of the variables. The most outliers seem to occur in Ca.
Yes there are transformations that can be applied to some of the predictors that may improve the classification model.
OUTLIERS: According to the book for outliers the first step is to check if they are valid or if an error could have occurred. You have to be careful to not remove an outlier that may be true, especially if the sample data is small. However if there is a real outlier one option is to use a spatial sign which projects the predictors onto a multidimensional sphere to make all of the samples the same distance form the center.
CENTERING & SCALING: According to the book it can be useful to center the scale of the predictor variables - where the average predictor value is subtracted from all of the values giving the centered scale a 0 mean. Then to scale it the predictor can be divided by its standard deviation so the standard deviation is 1. This can improve numerical stability because some models benefit from predictors being on a common scale.
SKEWNESS: Replacing the data with the log, square root or inverse can help remove skew and after the transformation the distribution should be closer to symmetric. The Box and Cox can be used to use the data to help determine what transformation should be used.
MISSING VALUES: Missing data can be removed or imputed (try to estimate missing values)
#TODO: APPLY TRANSFORMS#
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via: library(mlbench)
data(Soybean)
?Soybean
Below are the plots of the frequency distributions for the categorical predictors. According to the book degenerate data could come in a variety of forms but there can be significant improvements to model performance when these are removed. One issue is if a predictor has a single unique value, this has no information and can be discarded. Or a predictor can have only some unique values that occur at low frequencies, “near-zero variance predictor” may have a single value with most of the samples. In summary if the fraction of unique values over the sample size is low (under 10%) and the ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (around 20) it may be good to remove the predictor. Another degenerate data example would be data that is collinear or multi-linear
library(mlbench)
data(Soybean)
# ?Soybean
for (i in 2:(ncol(Soybean))){
p <- ggplot(Soybean, aes(x=(Soybean[,i]))) +
geom_bar() +
ggtitle(colnames(Soybean)[i]) + xlab(colnames(Soybean)[i])
print(p)
}
Looking at the plots, there are multiple cases where the ratio for the most common predictor to the least common predictor is over 20. The most dramatic case of this appears to be for the mycelium predictor where there are over 600 0 values and less than 10 1 values, with the remaining being NAs. Additionally, usually the unique values compared to the sample size is low.
The book demonstrates the use of the caret
package which
will return column numbers for predictors that are degenerate.
library(caret)
degen <- nearZeroVar(Soybean)
colnames(Soybean[degen])
## [1] "leaf.mild" "mycelium" "sclerotia"
If we refer back to these plots, we can see that in each of these cases the ratio of the most prevalent to the least prevalent is very high.
First, calculate the number of missing values for each predictor.
library(DT)
soybean_nas <- Soybean |> summarise(across(everything(), ~ sum(is.na(.))))
soybean_nas <- soybean_nas[2:length(soybean_nas)]
soybean_nas_long <- soybean_nas |> pivot_longer(everything(),
names_to = "predictor",
values_to = "na_count")
datatable(soybean_nas_long |> arrange(desc(na_count)))
soybean_nas_long |> ggplot(aes(x=fct_reorder(predictor, na_count), y=na_count)) + geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) + labs(x = "predictor", title = "Count of NA for each Predictor")
The most common predictors to be missing are “hail”, “sever”, “seed.tmt” and “lodging”.
Next, we need to see if any specific class is more likely to have NAs than other classes.
df_soybean <- Soybean
df_soybean$row_sum_na <- rowSums(is.na(Soybean))
class_nas_all <- df_soybean |>
group_by(Class) |>
summarise(class_nas = sum(row_sum_na)) |>
arrange(desc(class_nas))
head(class_nas_all)
Yes, specific classes are much more likely to contain NAs than others - these are shown above - 5 classes make up ALL of the NAs and the remaining classes have no NAs.
Based on the analysis in part A, “leaf.mild” “mycelium” and “sclerotia” are degenerate predictors and can be removed.
df_soybean <- df_soybean[-degen]
Next for imputation, one option is to use the K-nearest neighbor. According to the book the “advantage of this approach is that the imputed data are confined to be within the range of the training set values… one disadvantage is that the entire training set is required every time a missing value needs to be imputed”. In this case, the dataset is not large, so we can try to use KNN to impute.
library(DMwR2)
# impute
# https://www.rdocumentation.org/packages/DMwR2/versions/0.0.2/topics/knnImputation
soybean_imputed <- knnImputation(df_soybean,k=3)
# check for any NAs
# https://discuss.analyticsvidhya.com/t/how-can-i-check-whether-my-data-frame-contains-na-inf-values-in-some-column-or-not-in-r/1647
apply(soybean_imputed, 2, function(x) any(is.na(x)))
## Class date plant.stand precip temp
## FALSE FALSE FALSE FALSE FALSE
## hail crop.hist area.dam sever seed.tmt
## FALSE FALSE FALSE FALSE FALSE
## germ plant.growth leaves leaf.halo leaf.marg
## FALSE FALSE FALSE FALSE FALSE
## leaf.size leaf.shread leaf.malf stem lodging
## FALSE FALSE FALSE FALSE FALSE
## stem.cankers canker.lesion fruiting.bodies ext.decay int.discolor
## FALSE FALSE FALSE FALSE FALSE
## fruit.pods fruit.spots seed mold.growth seed.discolor
## FALSE FALSE FALSE FALSE FALSE
## seed.size shriveling roots row_sum_na
## FALSE FALSE FALSE FALSE
All values NA values were replaced.
Consider the pigs series — the number of pigs slaughtered in Victoria each month.
First, according to the book simple exponential smoothing “is suitable for forecasting data with no clear trend or seasonal pattern”, so lets take a look at the trend before performing the ses().
# TODO: for some reason to make this work I have to restart the R session (session > restart R)
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.2.3
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ forecast 8.21 ✔ expsmooth 2.3
## ✔ fma 2.5
## Warning: package 'forecast' was built under R version 4.2.3
## Warning: package 'fma' was built under R version 4.2.3
## Warning: package 'expsmooth' was built under R version 4.2.3
## ── Conflicts ───────────────────────────────────────────────── fpp2_conflicts ──
## ✖ caret::lift() masks purrr::lift()
data(pigs)
p <- autoplot(pigs) + ggtitle("Pigs slaughtered in Victoria each month")
print(p)
There does not appear to be a seasonal trend, there does appear to be some trend but we will proceed with ses().
pigs_forecast <- ses(pigs, h = 4)
summary(pigs_forecast)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
We will use the same yhat as what is the point forecast given by ses().
pu <- pigs_forecast$mean[1] + 1.96 * sd(pigs_forecast$residuals)
pl <- pigs_forecast$mean[1] - 1.96 * sd(pigs_forecast$residuals)
paste0("95% Upper: ", pu, " 95% Lower: ", pl)
## [1] "95% Upper: 118952.844969765 95% Lower: 78679.9672534162"
The 95% confidence interval output from the ses for the high limit was 119,020.8 whereas using the standard deviation of the residuals was 118,952.8, and the lower limit of ses was 78,611.97 in comparison to 78,679.97. This means that the 95% confidence interval estimated with ses() had a higher upper limit and a lower lower limit which means the ses() 95% confidence interval was wider than the one calculated with the residuals. However the difference is only about 68 pigs compared to about 100,000 pigs so really the confidence intervals are not that different.
Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level ℓ0 ). It should return the forecast of the next observation in the series.
Does it give the same forecast as ses()?
First we will write the simple exponential smoothing.
# y = time series
# alpha = smoothing parameter
# level = initial level (first fitted value at time 1)
# yhat[t+1] | T = alpha * yT + alpha*(1-alpha)* y[T-1] + (alpha*(1 - alpha)(^2)) * y[T-2]
ses2 <- function(y, alph) {
y.hat <- alph*y[length(y)] # declare first term of yhat
for (i in 1:(length(y)-1)){
next.weighted <- alph*((1-alph)^i)*y[length(y)-i]
y.hat <- y.hat + next.weighted
}
forecast.next.obs <- y.hat
print(forecast.next.obs)
return(forecast.next.obs)
}
ses2(pigs, 0.2971)
## [1] 98816.45
## [1] 98816.45
This first method did not actually use the level. Rewrite the function with the level actually used based on the Weighted average form section of the book.
# y = time series
# alpha = smoothing parameter
# level = initial level (first fitted value at time 1)
# "The process has to start somewhere, so we let the first fitted value at time 1 be denoted by ℓ0)
# yhat[t+1] | T = sum(apha*(1-alpha)^j)*y[T-j]) + (1-alpha)^T*l0
ses3 <- function(y, alph, lev) {
y.hat <- lev # declare first term of yhat
for (i in 1:(length(y))){
y.hat <- alph*y[i] + (1-alph)*y.hat
}
forecast.next.obs <- y.hat
print(forecast.next.obs)
return(forecast.next.obs)
}
ses3(pigs, 0.2971, 77260.0561)
## [1] 98816.45
## [1] 98816.45
Now we can compare this forecast with ses() forecast.
The ses() function forecast was 98816.41 for the point estimate and our ses2() function (using the alpha and level values from the ses()) returned 98816.45. So essentially the two functions resulted in the same output.
Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of α and ℓ0. Do you get the same values as the ses() function?
First we will adapt the function from 7.2 to return the sum of square errors. For the optim function, you have to specify using the par argument the initial values for the parameters that you want to optimize. Because of this, rather than taking alph and lev as their own arguments, they will get combined into one input as a vector.
# SSE = sum of (yt - yhat(t-1))^2
sse_fun <- function(par, y) {
alph <- par[1]
lev <- par[2]
y.hat <- lev # declare first term of yhat
sse <- 0
for (i in 1:(length(y))){
err <- y[i] - y.hat # the current error is the correct value - predicted
err.2 <- err^2 # square it
sse <- sse + err.2 # add it up
y.hat <- alph*y[i] + (1-alph)*y.hat # next prediction
}
return(sse)
}
Next we will run the optim function to minimize the sums of squares. We chose alpha is 0.8 to start to try to put more emphasis on the more recent past and mean(pigs) data as the level to start with based on there being no strong trend in the data.
find_opt <- optim(par = c(0.8, mean(pigs)), y = pigs, fn=sse_fun)
find_opt
## $par
## [1] 2.971474e-01 7.727386e+04
##
## $value
## [1] 19765613340
##
## $counts
## function gradient
## 145 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
The convergence is 0, indicating successful completion. The outputs are alpha = 0.2971474 and level = 77,273.86.
Last we will compare the values to those from the ses() function.
From the ses() function alpha = 0.2971 and level = 77260.0561. These values are very close to the optimized sse_fun() that we wrote.