The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refrative index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
Refractive index (RI) is a numerical variable with a slightly right-skewed distribution and a moderately high correlation (0.81) with the calcium variable (Ca). Sodiam (Na), aluminum (Al), and silicon (Si) exhibit almost no skewness, nor do are they highly correlated with other variables. The distribution for magnesium (Mg) is bathtub-shaped and only exhibits slight correlation with other variables (aluminum, calcium, and barium). Potassium (K), barium (Ba), and iron (Fe) are highly right-skewed but are only slightly correlated with other variables. The RI-Ca correlation is the only one to take particular note of, as it could impact the types of modeling sensitive to those effect (e.g. linear models). Using the heuristic algorithm proposed by K&J (and using the findCorrelation() function), calcium is a candidate for removal due to high correlation. All variables except Type are quantitative; Type is a categorical variable with 6 levels (1, 2, 3, 5, 6, and 7).
# Correlation of quantitative vars
# Load data
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
# Filter for just quantitative vars
glassQuant <- Glass %>% select (-Type)
# Corr chart
chart.Correlation(glassQuant)# Corr matrix
corr1 <- cor(glassQuant)
corr1## RI Na Mg Al Si K
## RI 1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790 1.00000000 -0.273731961 0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196 1.000000000 -0.48179851 -0.16592672 0.005395667
## Al -0.4073260341 0.15679367 -0.481798509 1.00000000 -0.00552372 0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372 1.00000000 -0.193330854
## K -0.2898327111 -0.26608650 0.005395667 0.32595845 -0.19333085 1.000000000
## Ca 0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189 0.32660288 -0.492262118 0.47940390 -0.10215131 -0.042618059
## Fe 0.1430096093 -0.24134641 0.083059529 -0.07440215 -0.09420073 -0.007719049
## Ca Ba Fe
## RI 0.8104027 -0.0003860189 0.143009609
## Na -0.2754425 0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178 0.083059529
## Al -0.2595920 0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K -0.3178362 -0.0426180594 -0.007719049
## Ca 1.0000000 -0.1128409671 0.124968219
## Ba -0.1128410 1.0000000000 -0.058691755
## Fe 0.1249682 -0.0586917554 1.000000000
# Corr plot
corrplot(corr1, order='hclust', type='full')# Find correlations past cutoff
print(paste0("Candidate for removal due to high correlation: ", findCorrelation(corr1, cutoff=0.8, exact=T, verbose=T, names=T)))## Compare row 7 and column 1 with corr 0.81
## Means: 0.319 vs 0.217 so flagging column 7
## All correlations <= 0.8
## [1] "Candidate for removal due to high correlation: Ca"
Do there appear to be any outliers in the data? Are any predictors skewed?
Based on the histograms from the correlation plot, larger histograms were generated to look for outliers in suspect variables (K, Ca, Ba, and Fe). Outliers were sought based on values that were outside of three standard deviations from the mean. Of the variables examined, two values of potassium seem suspect; other values were outside the three-standard-deviation range, but visually they don’t appear to be that unusual.
Some columns exhibited some skewness, although none surpassed 20, which K&J indicate would be highly skewed. Other sources suggest values greater than 1 should be considered highly skewed. If so, the following would be considered as such: RI, Mg, K, Ca, Ba, and Fe.
# Look for outliers
par(mfrow=c(2, 2))
hist(Glass$K, main='Histogram of K', xlab='K')
hist(Glass$Ca, main='Histogram of Ca', xlab='Ca')
hist(Glass$Ba, main='Histogram of Ba', xlab='Ba')
hist(Glass$Fe, main='Histogram of Fe', xlab='Fe')# Examine K outliers beyond 3 standard deviations of mean
print("K outliers beyond 3 standard deviations of mean:")## [1] "K outliers beyond 3 standard deviations of mean:"
Glass[Glass$K > mean(Glass$K) + 3 * sd(Glass$K) | Glass$K < mean(Glass$K) - 3 * sd(Glass$K),]$K## [1] 6.21 6.21 2.70
# Examine Ca outliers beyond 3 standard deviations of mean
print("Ca outliers beyond 3 standard deviations of mean:")## [1] "Ca outliers beyond 3 standard deviations of mean:"
Glass[Glass$Ca > mean(Glass$Ca) + 3 * sd(Glass$Ca) | Glass$Ca < mean(Glass$Ca) - 3 * sd(Glass$Ca),]$Ca## [1] 13.24 13.30 16.19 14.68 14.96 14.40 13.44
# Examine Ba outliers beyond 3 standard deviations of mean
print("Ba outliers beyond 3 standard deviations of mean:")## [1] "Ba outliers beyond 3 standard deviations of mean:"
Glass[Glass$Ba > mean(Glass$Ba) + 3 * sd(Glass$Ba) | Glass$Ba < mean(Glass$Ba) - 3 * sd(Glass$Ba),]$Ba## [1] 3.15 2.20 1.68 1.71 2.88 1.67
# Examine Fe outliers beyond 3 standard deviations of mean
print("Fe outliers beyond 3 standard deviations of mean:")## [1] "Fe outliers beyond 3 standard deviations of mean:"
Glass[Glass$Fe > mean(Glass$Fe) + 3 * sd(Glass$Fe) | Glass$Fe < mean(Glass$Fe) - 3 * sd(Glass$Fe),]$Fe## [1] 0.35 0.37 0.51
# Look for skewness
# As a general rule of thumb: If skewness is less than -1 or greater than 1, the distribution is highly skewed.
# If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
# If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
print("Skewness of columns:")## [1] "Skewness of columns:"
colskewness <- apply(glassQuant, 2, e1071::skewness, type=1)
colskewness## RI Na Mg Al Si K Ca
## 1.6140150 0.4509917 -1.1444648 0.9009179 -0.7253173 6.5056358 2.0326774
## Ba Fe
## 3.3924309 1.7420068
# Show skewed columns
print("Skewed columns (skewness < -1 or > 1):")## [1] "Skewed columns (skewness < -1 or > 1):"
skewedcols <- glassQuant[colskewness < -1 | colskewness > 1]
colnames(skewedcols)## [1] "RI" "Mg" "K" "Ca" "Ba" "Fe"
Are there any relevant transofrmations of one or more predictors that might improve the classification model?
Six of the predictors can benefit from transformation. The skewness values and histograms of untransformed, log-transformed, and Box-Cox transformed data were compared to evaluate which transformation would be optimal:
# Possible transformations of skewed columns;
# look at log histograms first
par(mfrow=c(2, 3))
hist(log(Glass$RI), main='Log Transform of RI', xlab='Ri')
hist(log(Glass$Mg), main='Log Transform of Mg', xlab='Mg')
hist(log(Glass$K), main='Log Transform of K', xlab='K')
hist(log(Glass$Ca), main='Log Transform of Ca', xlab='Ca')
hist(log(Glass$Ba), main='Log Transform of Ba', xlab='Ba')
hist(log(Glass$Fe), main='Log Transform of Fe', xlab='Fe')# Box-Cox transformation - single var
bctrans <- BoxCoxTrans(Glass$Ca)
bccol <- predict(bctrans, Glass$Ca)
skewness(bccol)## [1] -0.1939557
# Box-Cox transformation - use "preProcess" to transform multiple cols
bctrans <- preProcess(skewedcols, method=c('BoxCox'))
bccols <- predict(bctrans, skewedcols)
bctrans$bc## $RI
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.511 1.517 1.518 1.518 1.519 1.534
##
## Largest/Smallest: 1.02
## Sample Skewness: 1.6
##
## Estimated Lambda: -2
##
##
## $Ca
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.430 8.240 8.600 8.957 9.172 16.190
##
## Largest/Smallest: 2.98
## Sample Skewness: 2.02
##
## Estimated Lambda: -1.1
# Show histograms of Box-Cox-transformed columns
par(mfrow=c(2, 3))
hist(bccols$RI, main='Box-Cox Transform of RI', xlab='RI')
hist(bccols$Ca, main='Box-Cox Transform of Ca', xlab='Ca')
# Compare transforms
par(mfrow=c(2,3))hist(Glass$RI, main='RI (Untransformed)', xlab='RI')
hist(log(Glass$RI), main='RI (Log Transformation)', xlab='RI')
hist(bccols$RI, main='RI (Box-Cox Transformation)', xlab='RI')
hist(Glass$Mg, main='Mg (Untransformed)', xlab='Mg')
hist(log(Glass$Mg), main='Mg (Log Transformation)', xlab='Mg')
plot.new()par(mfrow=c(2,3))
hist(Glass$K, main='K (Untransformed)', xlab='K')
hist(log(Glass$K), main='K (Log Transformation)', xlab='K')
plot.new()
hist(Glass$Ca, main='Ca (Untransformed)', xlab='Ca')
hist(log(Glass$Ca), main='Ca (Log Transformation)', xlab='Ca')
hist(bccols$Ca, main='Ca (Box-Cox Transformation)', xlab='Ca')par(mfrow=c(2,3))
hist(Glass$Ba, main='Ba (Untransformed)', xlab='Ba')
hist(log(Glass$Ba), main='Ba (Log Transformation)', xlab='Ba')
plot.new()
hist(Glass$Fe, main='Fe (Untransformed)', xlab='Fe')
hist(log(Glass$Fe), main='Fe (Log Transformation)', xlab='Fe')
plot.new()# Check skewness
tmpdf <- as.data.frame(list(Non.Transformed.RI=Glass$RI, Log.RI=log(Glass$RI), Box.Cox.RI=bccols$RI))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.RI Log.RI Box.Cox.RI
## 1.614015 1.601538 1.576699
tmpdf <- as.data.frame(list(Non.Transformed.Mg=Glass$Mg, Log.Mg=log(Glass$Mg), Box.Cox.Mg=bccols$Mg))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.Mg Log.Mg Box.Cox.Mg
## -1.144465 NaN -1.144465
tmpdf <- as.data.frame(list(Non.Transformed.K=Glass$K, Log.K=log(Glass$K), Box.Cox.K=bccols$K))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.K Log.K Box.Cox.K
## 6.505636 NaN 6.505636
tmpdf <- as.data.frame(list(Non.Transformed.Ca=Glass$Ca, Log.Ca=log(Glass$Ca), Box.Cox.Ca=bccols$Ca))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.Ca Log.Ca Box.Cox.Ca
## 2.0326774 1.0588744 -0.1953232
tmpdf <- as.data.frame(list(Non.Transformed.Ba=Glass$Ba, Log.Ba=log(Glass$Ba), Box.Cox.Ba=bccols$Ba))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.Ba Log.Ba Box.Cox.Ba
## 3.392431 NaN 3.392431
tmpdf <- as.data.frame(list(Non.Transformed.Fe=Glass$Fe, Log.Fe=log(Glass$Fe), Box.Cox.Fe=bccols$Fe))
apply(tmpdf, 2, e1071::skewness, type=1)## Non.Transformed.Fe Log.Fe Box.Cox.Fe
## 1.742007 NaN 1.742007
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
Using the nearZeroVar() function, these variables appear to be degenerate as described in K&J as having near-zero variance:
As shown in the summary output, each of these columns has a dominant class that far exceeds its next most populous class.
# Load data
data(Soybean)
str(Soybean)## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
# Generate frequency tables to look for degenerate distributions
multifunc2 <- function(x) {
return(prop.table(table(x)))
}
#apply(Soybean, 2, table) # freq table - produces long output; use nzv instead
#apply(Soybean, 2, multifunc2) # proportion table - produces long output; use nzv instead
# Degenerate vars: if both of these are true:
# 1. fraction of unique vals over the sample size < 10%
# 2. ratio of freq of the most prevalent to the freq of the second most prevalent is > 20
nzv <- nearZeroVar(Soybean)
print(colnames(Soybean[,nzv]))## [1] "leaf.mild" "mycelium" "sclerotia"
# Examine NZV columns
summary(Soybean[,nzv])## leaf.mild mycelium sclerotia
## 0 :535 0 :639 0 :625
## 1 : 20 1 : 6 1 : 20
## 2 : 20 NA's: 38 NA's: 38
## NA's:108
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
As shown in the md.pattern plot below, there are 562 (of 683) complete cases, or roughly 82% of the data. Of the 18% of the data that have over 15% missing values, most of the variables are related to plant pathology:
We verified that the missing values weren’t due to the absence of the specific pathology, as negative cases were also reported. So the values do appear to be missing as opposed to being intentionally or structurally absent.
# Look for missing values
summary(Soybean)## Class date plant.stand precip temp
## brown-spot : 92 5 :149 0 :354 0 : 74 0 : 80
## alternarialeaf-spot: 91 4 :131 1 :293 1 :112 1 :374
## frog-eye-leaf-spot : 91 3 :118 NA's: 36 2 :459 2 :199
## phytophthora-rot : 88 2 : 93 NA's: 38 NA's: 30
## anthracnose : 44 6 : 90
## brown-stem-rot : 44 (Other):101
## (Other) :233 NA's : 1
## hail crop.hist area.dam sever seed.tmt germ plant.growth
## 0 :435 0 : 65 0 :123 0 :195 0 :305 0 :165 0 :441
## 1 :127 1 :165 1 :227 1 :322 1 :222 1 :213 1 :226
## NA's:121 2 :219 2 :145 2 : 45 2 : 35 2 :193 NA's: 16
## 3 :218 3 :187 NA's:121 NA's:121 NA's:112
## NA's: 16 NA's: 1
##
##
## leaves leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild
## 0: 77 0 :221 0 :357 0 : 51 0 :487 0 :554 0 :535
## 1:606 1 : 36 1 : 21 1 :327 1 : 96 1 : 45 1 : 20
## 2 :342 2 :221 2 :221 NA's:100 NA's: 84 2 : 20
## NA's: 84 NA's: 84 NA's: 84 NA's:108
##
##
##
## stem lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 :296 0 :520 0 :379 0 :320 0 :473 0 :497
## 1 :371 1 : 42 1 : 39 1 : 83 1 :104 1 :135
## NA's: 16 NA's:121 2 : 36 2 :177 NA's:106 2 : 13
## 3 :191 3 : 65 NA's: 38
## NA's: 38 NA's: 38
##
##
## mycelium int.discolor sclerotia fruit.pods fruit.spots seed
## 0 :639 0 :581 0 :625 0 :407 0 :345 0 :476
## 1 : 6 1 : 44 1 : 20 1 :130 1 : 75 1 :115
## NA's: 38 2 : 20 NA's: 38 2 : 14 2 : 57 NA's: 92
## NA's: 38 3 : 48 4 :100
## NA's: 84 NA's:106
##
##
## mold.growth seed.discolor seed.size shriveling roots
## 0 :524 0 :513 0 :532 0 :539 0 :551
## 1 : 67 1 : 64 1 : 59 1 : 38 1 : 86
## NA's: 92 NA's:106 NA's: 92 NA's:106 2 : 15
## NA's: 31
##
##
##
# Select columns where % of missing values > 15%
over15missing <- (colMeans(is.na(Soybean)) * 100) > 15
print("Vars with over 15% of data missing:")## [1] "Vars with over 15% of data missing:"
colMeans(is.na(Soybean[,over15missing])) * 100## hail sever seed.tmt germ leaf.mild
## 17.71596 17.71596 17.71596 16.39824 15.81259
## lodging fruiting.bodies fruit.spots seed.discolor shriveling
## 17.71596 15.51977 15.51977 15.51977 15.51977
# Generate plots of soybean class vs variables with missing values > 15%
par(mfrow=c(3,4))
for (i in seq(1, length(over15missing))) {
if (over15missing[i] == T) {
plot(x=Soybean[,i], xlab=colnames(Soybean[i]))
}
}
mtext("Soybean class vs variables with proportion of missing values > 15%", outer=T, line=-2)# Generate ggcorrplot correlating missing values and soybean class:
# Make a new vector containing column names of variables with missing values > 15%;
# also include the outcome variable, soybean class.
class_and_over15 <- over15missing
class_and_over15[1] <- T
# Convert to a model matrix, then generate correlations and the correlation plot
model.matrix(~.+0, data=Soybean[,class_and_over15]) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)# Generate missing data pattern plot; blue=observed, red=missing.
# Row labels at left give the number of times that pattern occurs.
# Row labels at right give the number of columns with missing values (i.e., the number of red cells).
# Totals along the bottom are the number of missing values in each variable.
md.pattern(Soybean, rotate.names=T, plot=T)## Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 1 0
## 6 1 1 1 1 1 1 1 1 0
## 14 1 1 1 1 1 1 1 0 1
## 15 1 1 1 1 0 0 0 0 0
## 1 1 1 0 0 0 0 0 0 0
## 0 0 1 1 16 16 16 30 31
## plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562 1 1 1 1 1 1
## 13 1 1 1 1 1 1
## 55 1 1 1 1 1 1
## 8 1 0 0 0 0 0
## 9 1 1 1 1 1 1
## 6 0 1 1 1 1 1
## 14 0 0 0 0 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 36 38 38 38 38 38
## int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 562 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 0
## 55 1 1 0 0 0 0 0
## 8 0 0 1 1 1 1 1
## 9 1 1 0 0 0 0 1
## 6 1 1 0 0 0 0 1
## 14 0 0 0 0 0 0 1
## 15 0 0 1 1 1 1 0
## 1 0 0 1 1 1 1 0
## 38 38 84 84 84 84 84
## seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots
## 562 1 1 1 1 1 1
## 13 0 0 0 1 0 0
## 55 0 0 0 0 0 0
## 8 0 0 0 1 0 0
## 9 1 1 1 0 1 1
## 6 1 1 1 0 1 1
## 14 1 1 1 0 0 0
## 15 0 0 0 0 0 0
## 1 0 0 0 0 0 0
## 92 92 92 100 106 106
## seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging
## 562 1 1 1 1 1 1 1 1 0
## 13 0 0 1 0 0 0 0 0 13
## 55 0 0 0 0 0 0 0 0 19
## 8 0 0 0 0 0 0 0 0 20
## 9 1 1 0 1 0 0 0 0 11
## 6 1 1 0 0 0 0 0 0 13
## 14 0 0 0 0 0 0 0 0 24
## 15 0 0 0 0 0 0 0 0 28
## 1 0 0 0 0 0 0 0 0 30
## 106 106 108 112 121 121 121 121 2337
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
As shown in the md.pattern plot above, there are several patterns that are of immediate concern due to the fact that over half of the variables are missing; these observations are candidates for removal:
Using this strategy, 86.4% of the original observations are retained, leaving 4.7% of the remaining rows having at least one missing value.
# Create copy of the original data set
sb2 <- Soybean
# Remove the observations in which 19 or more variables have missing values
most_complete <- rowSums(is.na(sb2)) < 19
sb2 <- sb2[most_complete,]
# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb2)) > 0
print(paste0("Proportion of observations with missing values after removing those with high proportion of missing variables: ",
round(nrow(sb2[some_missing,]) / nrow(sb2), 3)))## [1] "Proportion of observations with missing values after removing those with high proportion of missing variables: 0.047"
print(paste0("Proportion of observations retained from original data set: ",
round(nrow(sb2) / nrow(Soybean), 3)))## [1] "Proportion of observations retained from original data set: 0.864"
An alternative approach would be to remove the three variables with degenerate classes:
and to additionally remove those with a high proportion of data:
However, that still leaves 17.7% of observations with at least one missing value. So a combination of removing some observations along with the above variables would be more effective. The disadvantage of this is that 11 variables would be removed–a full 30% of all variables.
# Create copy of original data set
sb3 <- Soybean
# Remove the observations in which 24 or more variables have missing values
most_complete <- rowSums(is.na(sb3)) < 24
sb3 <- sb3[most_complete,]
# Remove variables with degenerate classes or having high proportion of missing values (11 all together)
sb3 <- sb3 %>%
select(-leaf.mild, -mycelium, -sclerotia, -hail, -sever, -seed.tmt, -germ, -lodging, -fruiting.bodies, -fruit.spots, -seed.discolor)
# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb3)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ",
round(nrow(sb3[some_missing,]) / nrow(sb3), 3)))## [1] "Proportion of observations with missing values after removing degenerate variables: 0.139"
print(paste0("Proportion of observations retained from original data set: ",
round(nrow(sb3) / nrow(Soybean), 3)))## [1] "Proportion of observations retained from original data set: 0.956"
Another approach would be to impute the missing data. The MICE package (multivariate imputation by chained equations) is an appropriate way to do this. Even using MICE, the observations with over 24 variables missing would seem to remain problematic, so we’ll remove those first before imputing. As shown below, we now have a dataset with no missing cases, and over 95% of the original observations were retained.
# Create copy of original data set
sb4 <- Soybean
# Remove the observations in which 24 or more variables have missing values
most_complete <- rowSums(is.na(sb4)) < 24
sb4 <- sb4[most_complete,]
# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb4)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ",
round(nrow(sb4[some_missing,]) / nrow(sb4), 3)))## [1] "Proportion of observations with missing values after removing degenerate variables: 0.139"
print(paste0("Proportion of observations retained from original data set: ",
round(nrow(sb4) / nrow(Soybean), 3)))## [1] "Proportion of observations retained from original data set: 0.956"
# Impute missing data
imp <- mice(sb4, maxit=5, m=5, seed=777)##
## iter imp variable
## 1 1 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 2 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 3 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 5 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 1 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 2 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 3 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 4 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 2 5 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 1 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 2 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 3 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 4 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 3 5 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 1 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 2 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 3 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 4 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 4 5 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 1 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 2 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 3 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 4 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 5 5 plant.stand precip hail sever seed.tmt germ leaf.halo leaf.marg leaf.size leaf.shread leaf.malf leaf.mild lodging stem.cankers canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
# Complete the data set using imputed values
sb4 <- complete(imp)
# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb4)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ",
round(nrow(sb4[some_missing,]) / nrow(sb4), 3)))## [1] "Proportion of observations with missing values after removing degenerate variables: 0"
print(paste0("Proportion of observations retained from original data set: ",
round(nrow(sb4) / nrow(Soybean), 3)))## [1] "Proportion of observations retained from original data set: 0.956"
Consider the pigs series — the number of pigs slaughtered in Victoria each month.
Use the ses() function in R to find the optimal values of α and \(ℓ_0\), and generate forecasts for the next four months.
Using SES, 98816.41 pigs are forecast to be slaughtered during each of the four months following August 1995. The initial state, \(ℓ_0\), was estimated to be 77260.0561, with an α of 0.2971. The low value of α indicates that relatively more weight is given to values in the more distant past than if the α value had been closer to 1. The 80 and 95 % prediction intervals were large, indicating that the predictions aren’t uncertain.
# Load data
data(pigs)
summary(pigs)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33873 79080 91662 90640 101493 120184
pigs## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1980 76378 71947 33873 96428 105084 95741 110647 100331 94133 103055
## 1981 76889 81291 91643 96228 102736 100264 103491 97027 95240 91680
## 1982 76892 85773 95210 93771 98202 97906 100306 94089 102680 77919
## 1983 81225 88357 106175 91922 104114 109959 97880 105386 96479 97580
## 1984 90974 98981 107188 94177 115097 113696 114532 120110 93607 110925
## 1985 103069 103351 111331 106161 111590 99447 101987 85333 86970 100561
## 1986 82719 79498 74846 73819 77029 78446 86978 75878 69571 75722
## 1987 63292 59380 78332 72381 55971 69750 85472 70133 79125 85805
## 1988 69069 79556 88174 66698 72258 73445 76131 86082 75443 73969
## 1989 66269 73776 80034 70694 81823 75640 75540 82229 75345 77034
## 1990 75982 78074 77588 84100 97966 89051 93503 84747 74531 91900
## 1991 81022 78265 77271 85043 95418 79568 103283 95770 91297 101244
## 1992 93866 95171 100183 103926 102643 108387 97077 90901 90336 88732
## 1993 73292 78943 94399 92937 90130 91055 106062 103560 104075 101783
## 1994 82413 83534 109011 96499 102430 103002 91815 99067 110067 101599
## 1995 88905 89936 106723 84307 114896 106749 87892 100506
## Nov Dec
## 1980 90595 101457
## 1981 101259 109564
## 1982 93561 117062
## 1983 109490 110191
## 1984 103312 120184
## 1985 89543 89265
## 1986 64182 77357
## 1987 81778 86852
## 1988 78139 78646
## 1989 78589 79769
## 1990 81635 89797
## 1991 114525 101139
## 1992 83759 99267
## 1993 93791 102313
## 1994 97646 104930
## 1995
# Plot
pigs %>% autoplot()# Forecast next 4 months using SES
fit <- ses(pigs, h=4)
summary(fit)##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = pigs, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
# Plot fitted values
autoplot(fit) +
autolayer(fitted(fit), series="Fitted") +
ylab("Pigs slaughtered") + xlab("Year")Compute a 95% prediction interval for the first forecast using \(\hat{y}\)±1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.
The prediction interval calculated using \(\hat{y}\)±1.96s yields:
98816.41 ± 20136.4388581745 = (78679.97, 118952.84)
This is close, but different from the intervals calculated by the model. The reason for this is that the model takes into account the values of α and h as shown on table 7.8, yielding more uncertainty as h gets larger. I.e., the further into the future the forecast, the more uncertain the prediction and, therefore, the larger the prediction interval.
# Calculate standard deviaion of the residuals from previous ses fit
sd_res <- sd(fit$residuals)
# Calculate margin of error for 95% prediction interval
me <- sd_res * 1.96
# Calculate lo and hi 95
lo_95 <- fit$mean[1] - me
hi_95 <- fit$mean[1] + me
print(paste0("Prediction interval: ", round(fit$mean[1], 2), " ± ", me, " = (", round(lo_95, 2), ", ", round(hi_95, 2), ")"))## [1] "Prediction interval: 98816.41 ± 20136.4388581745 = (78679.97, 118952.84)"
print("Compare to interval predicted by SES fit:")## [1] "Compare to interval predicted by SES fit:"
for (i in seq(1, 4)) {
print(paste0(" ", round(fit$mean[i], 2), " ± ", me,
" = (", round(fit$lower[i, 2], 2), ", ", round(fit$upper[i, 2], 2), ")"))
}## [1] " 98816.41 ± 20136.4388581745 = (78611.97, 119020.84)"
## [1] " 98816.41 ± 20136.4388581745 = (77738.83, 119893.98)"
## [1] " 98816.41 ± 20136.4388581745 = (76900.46, 120732.35)"
## [1] " 98816.41 ± 20136.4388581745 = (76092.99, 121539.82)"
Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level \(ℓ_0\)). It should return the forecast of the next observation in the series. Does it give the same forecast as ses()?
Use weighted average form of the SES equation:
\(\hat{y}_{{t + 1}|t}\ =\ \alpha\ y_t\ +\ (1\ -\ \alpha)\ \hat{y}_{t|{t - 1}}\)
Using the values produced by the ses() function for α and \(ℓ_0\), the custom function does return the same value as the ses() function: 98816 (rounded) in both cases. It is noted that various values of \(ℓ_0\) were tried, and they produced the same result. However, when alpha was varied, different results were returned, indicating that the outcome is heavily dependent on which value of alpha is chosen.
# Function to implement SES
my_ses <- function(y, alpha, level) {
print(paste('before:', level))
# Initialize vector to store forecast values
fitted <- c()
# Set T to be length of time series
T <- length(y)
# Iterate over terms
for (j in seq(1, T)) {
if (j == 1) {
fitted[j] <- level
} else {
fitted[j] <- alpha * (y[j - 1]) + (1 - alpha) * fitted[j - 1]
}
}
# Find the next fitted value in the series
fitted[T + 1] <- alpha * (y[T]) + (1 - alpha) * fitted[T]
print(paste('after:', level))
# Return fitted values
return(fitted[T + 1])
}
# Calculate SES using custom function
print(paste('my_ses() result with initial level of 77260.0561:', my_ses(pigs, 0.2971, 77260.0561)))## [1] "before: 77260.0561"
## [1] "after: 77260.0561"
## [1] "my_ses() result with initial level of 77260.0561: 98816.4543875089"
print(paste('my_ses() result with initial level of 55:', my_ses(pigs, 0.2971, 55)))## [1] "before: 55"
## [1] "after: 55"
## [1] "my_ses() result with initial level of 55: 98816.4543875089"
Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of α and ℓ_0. Do you get the same values as the ses() function?
The optim() function using the default method (Nelder and Mead) produced different values for α and \(ℓ_0\) (0.5179980 and 0.1307012, respectively). Changing the method to BFGS (Broyden, Fletcher, Goldfarb and Shanno) made the results very close to the one produced by the ses() function (0.297142 and 77273.160331, respectively).
# Function to implement SES, modified to return the sum of squared errors instead of the forecast of the next observation
my_ses_mod <- function(y, params) {
# Split out parameters for optim() function
alpha <- params[1]
level <- params[2]
# Initialize vector to store forecast values
fitted <- c()
# Set T to be length of time series
T <- length(y)
# Iterate over terms
for (j in seq(1, T)) {
if (j == 1) {
fitted[j] <- level
} else {
fitted[j] <- alpha * (y[j - 1]) + (1 - alpha) * fitted[j - 1]
}
}
# Convert fitted values to time series
fitted <- ts(fitted, frequency=12, start=c(1980, 1))
# Calculate residuals
residuals <- y - fitted
# Calculate sum of squared errors
sse=sum(residuals ** 2)
# Return fitted values, residuals, and sum of squared errors
# return(list(fitted=fitted, residuals=residuals, sse=sse))
# Return sse
return(sse)
}
# Calculate SSE using modified custom function
fit2 <- my_ses_mod(pigs, c(0.2971, 77260.0561))
# Compare this SSE to SSE from original fit using sse()
fit2## [1] 19765613579
sum(fit$residuals ** 2)## [1] 19765613407
# Optimize α and ℓ_0 using optim() function
fit3 <- optim(par=c(0, 0), fn=my_ses_mod, y=pigs)
fit3 # produces alpha=0.5179980, l_0=0.1307012## $par
## [1] 0.5179980 0.1307012
##
## $value
## [1] 27848581225
##
## $counts
## function gradient
## 35 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
# Try optimize again using method='BFGS'
fit4 <- optim(par=c(0, 0), fn=my_ses_mod, y=pigs, method='BFGS')
fit4## $par
## [1] 0.297142 77273.160331
##
## $value
## [1] 19765613325
##
## $counts
## function gradient
## 84 13
##
## $convergence
## [1] 0
##
## $message
## NULL