ippolito_data624

KJ 3.1

KJ 3.1 (a)

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refrative index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Refractive index (RI) is a numerical variable with a slightly right-skewed distribution and a moderately high correlation (0.81) with the calcium variable (Ca). Sodiam (Na), aluminum (Al), and silicon (Si) exhibit almost no skewness, nor do are they highly correlated with other variables. The distribution for magnesium (Mg) is bathtub-shaped and only exhibits slight correlation with other variables (aluminum, calcium, and barium). Potassium (K), barium (Ba), and iron (Fe) are highly right-skewed but are only slightly correlated with other variables. The RI-Ca correlation is the only one to take particular note of, as it could impact the types of modeling sensitive to those effect (e.g. linear models). Using the heuristic algorithm proposed by K&J (and using the findCorrelation() function), calcium is a candidate for removal due to high correlation. All variables except Type are quantitative; Type is a categorical variable with 6 levels (1, 2, 3, 5, 6, and 7).

# Correlation of quantitative vars

# Load data
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

# Filter for just quantitative vars
glassQuant <- Glass %>% select (-Type)

# Corr chart
chart.Correlation(glassQuant)

# Corr matrix
corr1 <- cor(glassQuant)
corr1

##               RI          Na           Mg          Al          Si            K
## RI  1.0000000000 -0.19188538 -0.122274039 -0.40732603 -0.54205220 -0.289832711
## Na -0.1918853790  1.00000000 -0.273731961  0.15679367 -0.06980881 -0.266086504
## Mg -0.1222740393 -0.27373196  1.000000000 -0.48179851 -0.16592672  0.005395667
## Al -0.4073260341  0.15679367 -0.481798509  1.00000000 -0.00552372  0.325958446
## Si -0.5420521997 -0.06980881 -0.165926723 -0.00552372  1.00000000 -0.193330854
## K  -0.2898327111 -0.26608650  0.005395667  0.32595845 -0.19333085  1.000000000
## Ca  0.8104026963 -0.27544249 -0.443750026 -0.25959201 -0.20873215 -0.317836155
## Ba -0.0003860189  0.32660288 -0.492262118  0.47940390 -0.10215131 -0.042618059
## Fe  0.1430096093 -0.24134641  0.083059529 -0.07440215 -0.09420073 -0.007719049
##            Ca            Ba           Fe
## RI  0.8104027 -0.0003860189  0.143009609
## Na -0.2754425  0.3266028795 -0.241346411
## Mg -0.4437500 -0.4922621178  0.083059529
## Al -0.2595920  0.4794039017 -0.074402151
## Si -0.2087322 -0.1021513105 -0.094200731
## K  -0.3178362 -0.0426180594 -0.007719049
## Ca  1.0000000 -0.1128409671  0.124968219
## Ba -0.1128410  1.0000000000 -0.058691755
## Fe  0.1249682 -0.0586917554  1.000000000

# Corr plot
corrplot(corr1, order='hclust', type='full')

# Find correlations past cutoff
print(paste0("Candidate for removal due to high correlation: ", findCorrelation(corr1, cutoff=0.8, exact=T, verbose=T, names=T)))

## Compare row 7  and column  1 with corr  0.81 
##   Means:  0.319 vs 0.217 so flagging column 7 
## All correlations <= 0.8 
## [1] "Candidate for removal due to high correlation: Ca"

KJ 3.1 (b)

Do there appear to be any outliers in the data? Are any predictors skewed?

Based on the histograms from the correlation plot, larger histograms were generated to look for outliers in suspect variables (K, Ca, Ba, and Fe). Outliers were sought based on values that were outside of three standard deviations from the mean. Of the variables examined, two values of potassium seem suspect; other values were outside the three-standard-deviation range, but visually they don’t appear to be that unusual.

Some columns exhibited some skewness, although none surpassed 20, which K&J indicate would be highly skewed. Other sources suggest values greater than 1 should be considered highly skewed. If so, the following would be considered as such: RI, Mg, K, Ca, Ba, and Fe.

# Look for outliers
par(mfrow=c(2, 2))
hist(Glass$K, main='Histogram of K', xlab='K')
hist(Glass$Ca, main='Histogram of Ca', xlab='Ca')
hist(Glass$Ba, main='Histogram of Ba', xlab='Ba')
hist(Glass$Fe, main='Histogram of Fe', xlab='Fe')

# Examine K outliers beyond 3 standard deviations of mean
print("K outliers beyond 3 standard deviations of mean:")

## [1] "K outliers beyond 3 standard deviations of mean:"

Glass[Glass$K > mean(Glass$K) + 3 * sd(Glass$K) | Glass$K < mean(Glass$K) - 3 * sd(Glass$K),]$K

## [1] 6.21 6.21 2.70

# Examine Ca outliers beyond 3 standard deviations of mean
print("Ca outliers beyond 3 standard deviations of mean:")

## [1] "Ca outliers beyond 3 standard deviations of mean:"

Glass[Glass$Ca > mean(Glass$Ca) + 3 * sd(Glass$Ca) | Glass$Ca < mean(Glass$Ca) - 3 * sd(Glass$Ca),]$Ca

## [1] 13.24 13.30 16.19 14.68 14.96 14.40 13.44

# Examine Ba outliers beyond 3 standard deviations of mean
print("Ba outliers beyond 3 standard deviations of mean:")

## [1] "Ba outliers beyond 3 standard deviations of mean:"

Glass[Glass$Ba > mean(Glass$Ba) + 3 * sd(Glass$Ba) | Glass$Ba < mean(Glass$Ba) - 3 * sd(Glass$Ba),]$Ba

## [1] 3.15 2.20 1.68 1.71 2.88 1.67

# Examine Fe outliers beyond 3 standard deviations of mean
print("Fe outliers beyond 3 standard deviations of mean:")

## [1] "Fe outliers beyond 3 standard deviations of mean:"

Glass[Glass$Fe > mean(Glass$Fe) + 3 * sd(Glass$Fe) | Glass$Fe < mean(Glass$Fe) - 3 * sd(Glass$Fe),]$Fe

## [1] 0.35 0.37 0.51

# Look for skewness
# As a general rule of thumb: If skewness is less than -1 or greater than 1, the distribution is highly skewed.
# If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. 
# If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
print("Skewness of columns:")

## [1] "Skewness of columns:"

colskewness <- apply(glassQuant, 2, e1071::skewness, type=1)
colskewness

##         RI         Na         Mg         Al         Si          K         Ca 
##  1.6140150  0.4509917 -1.1444648  0.9009179 -0.7253173  6.5056358  2.0326774 
##         Ba         Fe 
##  3.3924309  1.7420068

# Show skewed columns
print("Skewed columns (skewness < -1 or > 1):")

## [1] "Skewed columns (skewness < -1 or > 1):"

skewedcols <- glassQuant[colskewness < -1 | colskewness > 1]
colnames(skewedcols)

## [1] "RI" "Mg" "K"  "Ca" "Ba" "Fe"

KJ 3.1 (c)

Are there any relevant transofrmations of one or more predictors that might improve the classification model?

Six of the predictors can benefit from transformation. The skewness values and histograms of untransformed, log-transformed, and Box-Cox transformed data were compared to evaluate which transformation would be optimal:

RI: The refractive index was only slighly skewed, but a Box-Cox transform reduced the value.
Ca: The Box-Cox tranform for calcium produced the best skewness result, although the log histogram seemed visually closer to normal.
Mg, K, Ba, Fe: Box-Cox transformations for magnesium, potassium, barium, and iron were not mathematically possible, but log transformation produced far more normally distributed histograms. Skewness calculations produced NaNs and were therefore not able to be evaluated.

# Possible transformations of skewed columns;
# look at log histograms first
par(mfrow=c(2, 3))
hist(log(Glass$RI), main='Log Transform of RI', xlab='Ri')
hist(log(Glass$Mg), main='Log Transform of Mg', xlab='Mg')
hist(log(Glass$K), main='Log Transform of K', xlab='K')
hist(log(Glass$Ca), main='Log Transform of Ca', xlab='Ca')
hist(log(Glass$Ba), main='Log Transform of Ba', xlab='Ba')
hist(log(Glass$Fe), main='Log Transform of Fe', xlab='Fe')

# Box-Cox transformation - single var
bctrans <- BoxCoxTrans(Glass$Ca)
bccol <- predict(bctrans, Glass$Ca)
skewness(bccol)

## [1] -0.1939557

# Box-Cox transformation - use "preProcess" to transform multiple cols
bctrans <- preProcess(skewedcols, method=c('BoxCox'))
bccols <- predict(bctrans, skewedcols)
bctrans$bc

## $RI
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.511   1.517   1.518   1.518   1.519   1.534 
## 
## Largest/Smallest: 1.02 
## Sample Skewness: 1.6 
## 
## Estimated Lambda: -2 
## 
## 
## $Ca
## Box-Cox Transformation
## 
## 214 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.430   8.240   8.600   8.957   9.172  16.190 
## 
## Largest/Smallest: 2.98 
## Sample Skewness: 2.02 
## 
## Estimated Lambda: -1.1

# Show histograms of Box-Cox-transformed columns
par(mfrow=c(2, 3))
hist(bccols$RI, main='Box-Cox Transform of RI', xlab='RI')
hist(bccols$Ca, main='Box-Cox Transform of Ca', xlab='Ca')

# Compare transforms
par(mfrow=c(2,3))

hist(Glass$RI, main='RI (Untransformed)', xlab='RI')
hist(log(Glass$RI), main='RI (Log Transformation)', xlab='RI')
hist(bccols$RI, main='RI (Box-Cox Transformation)', xlab='RI')
hist(Glass$Mg, main='Mg (Untransformed)', xlab='Mg')
hist(log(Glass$Mg), main='Mg (Log Transformation)', xlab='Mg')
plot.new()

par(mfrow=c(2,3))
hist(Glass$K, main='K (Untransformed)', xlab='K')
hist(log(Glass$K), main='K (Log Transformation)', xlab='K')
plot.new()
hist(Glass$Ca, main='Ca (Untransformed)', xlab='Ca')
hist(log(Glass$Ca), main='Ca (Log Transformation)', xlab='Ca')
hist(bccols$Ca, main='Ca (Box-Cox Transformation)', xlab='Ca')

par(mfrow=c(2,3))
hist(Glass$Ba, main='Ba (Untransformed)', xlab='Ba')
hist(log(Glass$Ba), main='Ba (Log Transformation)', xlab='Ba')
plot.new()
hist(Glass$Fe, main='Fe (Untransformed)', xlab='Fe')
hist(log(Glass$Fe), main='Fe (Log Transformation)', xlab='Fe')
plot.new()

# Check skewness
tmpdf <- as.data.frame(list(Non.Transformed.RI=Glass$RI, Log.RI=log(Glass$RI), Box.Cox.RI=bccols$RI))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.RI             Log.RI         Box.Cox.RI 
##           1.614015           1.601538           1.576699

tmpdf <- as.data.frame(list(Non.Transformed.Mg=Glass$Mg, Log.Mg=log(Glass$Mg), Box.Cox.Mg=bccols$Mg))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.Mg             Log.Mg         Box.Cox.Mg 
##          -1.144465                NaN          -1.144465

tmpdf <- as.data.frame(list(Non.Transformed.K=Glass$K, Log.K=log(Glass$K), Box.Cox.K=bccols$K))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.K             Log.K         Box.Cox.K 
##          6.505636               NaN          6.505636

tmpdf <- as.data.frame(list(Non.Transformed.Ca=Glass$Ca, Log.Ca=log(Glass$Ca), Box.Cox.Ca=bccols$Ca))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.Ca             Log.Ca         Box.Cox.Ca 
##          2.0326774          1.0588744         -0.1953232

tmpdf <- as.data.frame(list(Non.Transformed.Ba=Glass$Ba, Log.Ba=log(Glass$Ba), Box.Cox.Ba=bccols$Ba))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.Ba             Log.Ba         Box.Cox.Ba 
##           3.392431                NaN           3.392431

tmpdf <- as.data.frame(list(Non.Transformed.Fe=Glass$Fe, Log.Fe=log(Glass$Fe), Box.Cox.Fe=bccols$Fe))
apply(tmpdf, 2, e1071::skewness, type=1)

## Non.Transformed.Fe             Log.Fe         Box.Cox.Fe 
##           1.742007                NaN           1.742007

KJ 3.2

KJ 3.2 (a)

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Using the nearZeroVar() function, these variables appear to be degenerate as described in K&J as having near-zero variance:

leaf.mild
mycelium
sclerotia

As shown in the summary output, each of these columns has a dominant class that far exceeds its next most populous class.

# Load data
data(Soybean)
str(Soybean)

## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

# Generate frequency tables to look for degenerate distributions
multifunc2 <- function(x) {
    return(prop.table(table(x)))
}
#apply(Soybean, 2, table)  # freq table - produces long output; use nzv instead
#apply(Soybean, 2, multifunc2)  # proportion table - produces long output; use nzv instead

# Degenerate vars: if both of these are true:
# 1. fraction of unique vals over the sample size < 10%
# 2. ratio of freq of the most prevalent to the freq of the second most prevalent is > 20
nzv <- nearZeroVar(Soybean)
print(colnames(Soybean[,nzv]))

## [1] "leaf.mild" "mycelium"  "sclerotia"

# Examine NZV columns
summary(Soybean[,nzv])

##  leaf.mild  mycelium   sclerotia 
##  0   :535   0   :639   0   :625  
##  1   : 20   1   :  6   1   : 20  
##  2   : 20   NA's: 38   NA's: 38  
##  NA's:108

KJ 3.2 (b)

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

As shown in the md.pattern plot below, there are 562 (of 683) complete cases, or roughly 82% of the data. Of the 18% of the data that have over 15% missing values, most of the variables are related to plant pathology:

hail
sever
seed.tmt
germ
leaf.mild
lodging
fruiting.bodies
fruit.spots
seed.discolor

We verified that the missing values weren’t due to the absence of the specific pathology, as negative cases were also reported. So the values do appear to be missing as opposed to being intentionally or structurally absent.

# Look for missing values
summary(Soybean)

##                  Class          date     plant.stand  precip      temp    
##  brown-spot         : 92   5      :149   0   :354    0   : 74   0   : 80  
##  alternarialeaf-spot: 91   4      :131   1   :293    1   :112   1   :374  
##  frog-eye-leaf-spot : 91   3      :118   NA's: 36    2   :459   2   :199  
##  phytophthora-rot   : 88   2      : 93               NA's: 38   NA's: 30  
##  anthracnose        : 44   6      : 90                                    
##  brown-stem-rot     : 44   (Other):101                                    
##  (Other)            :233   NA's   :  1                                    
##    hail     crop.hist  area.dam    sever     seed.tmt     germ     plant.growth
##  0   :435   0   : 65   0   :123   0   :195   0   :305   0   :165   0   :441    
##  1   :127   1   :165   1   :227   1   :322   1   :222   1   :213   1   :226    
##  NA's:121   2   :219   2   :145   2   : 45   2   : 35   2   :193   NA's: 16    
##             3   :218   3   :187   NA's:121   NA's:121   NA's:112               
##             NA's: 16   NA's:  1                                                
##                                                                                
##                                                                                
##  leaves  leaf.halo  leaf.marg  leaf.size  leaf.shread leaf.malf  leaf.mild 
##  0: 77   0   :221   0   :357   0   : 51   0   :487    0   :554   0   :535  
##  1:606   1   : 36   1   : 21   1   :327   1   : 96    1   : 45   1   : 20  
##          2   :342   2   :221   2   :221   NA's:100    NA's: 84   2   : 20  
##          NA's: 84   NA's: 84   NA's: 84                          NA's:108  
##                                                                            
##                                                                            
##                                                                            
##    stem     lodging    stem.cankers canker.lesion fruiting.bodies ext.decay 
##  0   :296   0   :520   0   :379     0   :320      0   :473        0   :497  
##  1   :371   1   : 42   1   : 39     1   : 83      1   :104        1   :135  
##  NA's: 16   NA's:121   2   : 36     2   :177      NA's:106        2   : 13  
##                        3   :191     3   : 65                      NA's: 38  
##                        NA's: 38     NA's: 38                                
##                                                                             
##                                                                             
##  mycelium   int.discolor sclerotia  fruit.pods fruit.spots   seed    
##  0   :639   0   :581     0   :625   0   :407   0   :345    0   :476  
##  1   :  6   1   : 44     1   : 20   1   :130   1   : 75    1   :115  
##  NA's: 38   2   : 20     NA's: 38   2   : 14   2   : 57    NA's: 92  
##             NA's: 38                3   : 48   4   :100              
##                                     NA's: 84   NA's:106              
##                                                                      
##                                                                      
##  mold.growth seed.discolor seed.size  shriveling  roots    
##  0   :524    0   :513      0   :532   0   :539   0   :551  
##  1   : 67    1   : 64      1   : 59   1   : 38   1   : 86  
##  NA's: 92    NA's:106      NA's: 92   NA's:106   2   : 15  
##                                                  NA's: 31  
##                                                            
##                                                            
##

# Select columns where % of missing values > 15%
over15missing <- (colMeans(is.na(Soybean)) * 100) > 15
print("Vars with over 15% of data missing:")

## [1] "Vars with over 15% of data missing:"

colMeans(is.na(Soybean[,over15missing])) * 100

##            hail           sever        seed.tmt            germ       leaf.mild 
##        17.71596        17.71596        17.71596        16.39824        15.81259 
##         lodging fruiting.bodies     fruit.spots   seed.discolor      shriveling 
##        17.71596        15.51977        15.51977        15.51977        15.51977

# Generate plots of soybean class vs variables with missing values > 15%
par(mfrow=c(3,4))
for (i in seq(1, length(over15missing))) {
    if (over15missing[i] == T) {
        plot(x=Soybean[,i], xlab=colnames(Soybean[i]))
    }
}
mtext("Soybean class vs variables with proportion of missing values > 15%", outer=T, line=-2)

# Generate ggcorrplot correlating missing values and soybean class:
# Make a new vector containing column names of variables with missing values > 15%;
# also include the outcome variable, soybean class.
class_and_over15 <- over15missing  
class_and_over15[1] <- T

# Convert to a model matrix, then generate correlations and the correlation plot
model.matrix(~.+0, data=Soybean[,class_and_over15]) %>%
  cor(use="pairwise.complete.obs") %>%
  ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)

# Generate missing data pattern plot; blue=observed, red=missing.
# Row labels at left give the number of times that pattern occurs.
# Row labels at right give the number of columns with missing values (i.e., the number of red cells).
# Totals along the bottom are the number of missing values in each variable.
md.pattern(Soybean, rotate.names=T, plot=T)

##     Class leaves date area.dam crop.hist plant.growth stem temp roots
## 562     1      1    1        1         1            1    1    1     1
## 13      1      1    1        1         1            1    1    1     1
## 55      1      1    1        1         1            1    1    1     1
## 8       1      1    1        1         1            1    1    1     1
## 9       1      1    1        1         1            1    1    1     0
## 6       1      1    1        1         1            1    1    1     0
## 14      1      1    1        1         1            1    1    0     1
## 15      1      1    1        1         0            0    0    0     0
## 1       1      1    0        0         0            0    0    0     0
##         0      0    1        1        16           16   16   30    31
##     plant.stand precip stem.cankers canker.lesion ext.decay mycelium
## 562           1      1            1             1         1        1
## 13            1      1            1             1         1        1
## 55            1      1            1             1         1        1
## 8             1      0            0             0         0        0
## 9             1      1            1             1         1        1
## 6             0      1            1             1         1        1
## 14            0      0            0             0         0        0
## 15            0      0            0             0         0        0
## 1             0      0            0             0         0        0
##              36     38           38            38        38       38
##     int.discolor sclerotia leaf.halo leaf.marg leaf.size leaf.malf fruit.pods
## 562            1         1         1         1         1         1          1
## 13             1         1         1         1         1         1          0
## 55             1         1         0         0         0         0          0
## 8              0         0         1         1         1         1          1
## 9              1         1         0         0         0         0          1
## 6              1         1         0         0         0         0          1
## 14             0         0         0         0         0         0          1
## 15             0         0         1         1         1         1          0
## 1              0         0         1         1         1         1          0
##               38        38        84        84        84        84         84
##     seed mold.growth seed.size leaf.shread fruiting.bodies fruit.spots
## 562    1           1         1           1               1           1
## 13     0           0         0           1               0           0
## 55     0           0         0           0               0           0
## 8      0           0         0           1               0           0
## 9      1           1         1           0               1           1
## 6      1           1         1           0               1           1
## 14     1           1         1           0               0           0
## 15     0           0         0           0               0           0
## 1      0           0         0           0               0           0
##       92          92        92         100             106         106
##     seed.discolor shriveling leaf.mild germ hail sever seed.tmt lodging     
## 562             1          1         1    1    1     1        1       1    0
## 13              0          0         1    0    0     0        0       0   13
## 55              0          0         0    0    0     0        0       0   19
## 8               0          0         0    0    0     0        0       0   20
## 9               1          1         0    1    0     0        0       0   11
## 6               1          1         0    0    0     0        0       0   13
## 14              0          0         0    0    0     0        0       0   24
## 15              0          0         0    0    0     0        0       0   28
## 1               0          0         0    0    0     0        0       0   30
##               106        106       108  112  121   121      121     121 2337

KJ 3.2 (c)

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

As shown in the md.pattern plot above, there are several patterns that are of immediate concern due to the fact that over half of the variables are missing; these observations are candidates for removal:

one observation has 30 variables missing
15 observations have 28 variables missing
14 observations have 24 variables missing
8 observations have 20 variables missing
55 observations have 19 variables missing

Using this strategy, 86.4% of the original observations are retained, leaving 4.7% of the remaining rows having at least one missing value.

# Create copy of the original data set
sb2 <- Soybean

# Remove the observations in which 19 or more variables have missing values
most_complete <- rowSums(is.na(sb2)) < 19
sb2 <- sb2[most_complete,]

# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb2)) > 0
print(paste0("Proportion of observations with missing values after removing those with high proportion of missing variables: ", 
    round(nrow(sb2[some_missing,]) / nrow(sb2), 3)))

## [1] "Proportion of observations with missing values after removing those with high proportion of missing variables: 0.047"

print(paste0("Proportion of observations retained from original data set: ",
    round(nrow(sb2) / nrow(Soybean), 3)))

## [1] "Proportion of observations retained from original data set: 0.864"

An alternative approach would be to remove the three variables with degenerate classes:

leaf.mild
mycelium
sclerotia

and to additionally remove those with a high proportion of data:

hail
sever
seed.tmt
germ
leaf.mild (also degenerate)
lodging
fruiting.bodies
fruit.spots
seed.discolor

However, that still leaves 17.7% of observations with at least one missing value. So a combination of removing some observations along with the above variables would be more effective. The disadvantage of this is that 11 variables would be removed–a full 30% of all variables.

# Create copy of original data set
sb3 <- Soybean

# Remove the observations in which 24 or more variables have missing values
most_complete <- rowSums(is.na(sb3)) < 24
sb3 <- sb3[most_complete,]

# Remove variables with degenerate classes or having high proportion of missing values (11 all together)
sb3 <- sb3 %>%
    select(-leaf.mild, -mycelium, -sclerotia, -hail, -sever, -seed.tmt, -germ, -lodging, -fruiting.bodies, -fruit.spots, -seed.discolor)

# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb3)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ", 
    round(nrow(sb3[some_missing,]) / nrow(sb3), 3)))

## [1] "Proportion of observations with missing values after removing degenerate variables: 0.139"

print(paste0("Proportion of observations retained from original data set: ",
    round(nrow(sb3) / nrow(Soybean), 3)))

## [1] "Proportion of observations retained from original data set: 0.956"

Another approach would be to impute the missing data. The MICE package (multivariate imputation by chained equations) is an appropriate way to do this. Even using MICE, the observations with over 24 variables missing would seem to remain problematic, so we’ll remove those first before imputing. As shown below, we now have a dataset with no missing cases, and over 95% of the original observations were retained.

# Create copy of original data set
sb4 <- Soybean

# Remove the observations in which 24 or more variables have missing values
most_complete <- rowSums(is.na(sb4)) < 24
sb4 <- sb4[most_complete,]

# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb4)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ", 
    round(nrow(sb4[some_missing,]) / nrow(sb4), 3)))

## [1] "Proportion of observations with missing values after removing degenerate variables: 0.139"

print(paste0("Proportion of observations retained from original data set: ",
    round(nrow(sb4) / nrow(Soybean), 3)))

## [1] "Proportion of observations retained from original data set: 0.956"

# Impute missing data
imp <- mice(sb4, maxit=5, m=5, seed=777)

## 
##  iter imp variable
##   1   1  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   2  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   3  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   4  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   1   5  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   1  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   2  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   3  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   4  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   2   5  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   1  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   2  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   3  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   4  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   3   5  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   1  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   2  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   3  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   4  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   4   5  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   1  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   2  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   3  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   4  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots
##   5   5  plant.stand  precip  hail  sever  seed.tmt  germ  leaf.halo  leaf.marg  leaf.size  leaf.shread  leaf.malf  leaf.mild  lodging  stem.cankers  canker.lesion  fruiting.bodies  ext.decay  mycelium  int.discolor  sclerotia  fruit.pods  fruit.spots  seed  mold.growth  seed.discolor  seed.size  shriveling  roots

# Complete the data set using imputed values
sb4 <- complete(imp)

# Recalculate number of rows contain missing values
some_missing <- rowSums(is.na(sb4)) > 0
print(paste0("Proportion of observations with missing values after removing degenerate variables: ", 
    round(nrow(sb4[some_missing,]) / nrow(sb4), 3)))

## [1] "Proportion of observations with missing values after removing degenerate variables: 0"

print(paste0("Proportion of observations retained from original data set: ",
    round(nrow(sb4) / nrow(Soybean), 3)))

## [1] "Proportion of observations retained from original data set: 0.956"

HA 7.1

HA 7.1 (a)

Consider the pigs series — the number of pigs slaughtered in Victoria each month.

Use the ses() function in R to find the optimal values of α and \(ℓ_0\), and generate forecasts for the next four months.

Using SES, 98816.41 pigs are forecast to be slaughtered during each of the four months following August 1995. The initial state, \(ℓ_0\), was estimated to be 77260.0561, with an α of 0.2971. The low value of α indicates that relatively more weight is given to values in the more distant past than if the α value had been closer to 1. The 80 and 95 % prediction intervals were large, indicating that the predictions aren’t uncertain.

# Load data
data(pigs)
summary(pigs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33873   79080   91662   90640  101493  120184

pigs

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1980  76378  71947  33873  96428 105084  95741 110647 100331  94133 103055
## 1981  76889  81291  91643  96228 102736 100264 103491  97027  95240  91680
## 1982  76892  85773  95210  93771  98202  97906 100306  94089 102680  77919
## 1983  81225  88357 106175  91922 104114 109959  97880 105386  96479  97580
## 1984  90974  98981 107188  94177 115097 113696 114532 120110  93607 110925
## 1985 103069 103351 111331 106161 111590  99447 101987  85333  86970 100561
## 1986  82719  79498  74846  73819  77029  78446  86978  75878  69571  75722
## 1987  63292  59380  78332  72381  55971  69750  85472  70133  79125  85805
## 1988  69069  79556  88174  66698  72258  73445  76131  86082  75443  73969
## 1989  66269  73776  80034  70694  81823  75640  75540  82229  75345  77034
## 1990  75982  78074  77588  84100  97966  89051  93503  84747  74531  91900
## 1991  81022  78265  77271  85043  95418  79568 103283  95770  91297 101244
## 1992  93866  95171 100183 103926 102643 108387  97077  90901  90336  88732
## 1993  73292  78943  94399  92937  90130  91055 106062 103560 104075 101783
## 1994  82413  83534 109011  96499 102430 103002  91815  99067 110067 101599
## 1995  88905  89936 106723  84307 114896 106749  87892 100506              
##         Nov    Dec
## 1980  90595 101457
## 1981 101259 109564
## 1982  93561 117062
## 1983 109490 110191
## 1984 103312 120184
## 1985  89543  89265
## 1986  64182  77357
## 1987  81778  86852
## 1988  78139  78646
## 1989  78589  79769
## 1990  81635  89797
## 1991 114525 101139
## 1992  83759  99267
## 1993  93791 102313
## 1994  97646 104930
## 1995

# Plot
pigs %>% autoplot()

# Forecast next 4 months using SES
fit <- ses(pigs, h=4)
summary(fit)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

# Plot fitted values
autoplot(fit) +
  autolayer(fitted(fit), series="Fitted") +
  ylab("Pigs slaughtered") + xlab("Year")

HA 7.1 (b)

Compute a 95% prediction interval for the first forecast using \(\hat{y}\)±1.96s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.

The prediction interval calculated using \(\hat{y}\)±1.96s yields:

98816.41 ± 20136.4388581745 = (78679.97, 118952.84)

This is close, but different from the intervals calculated by the model. The reason for this is that the model takes into account the values of α and h as shown on table 7.8, yielding more uncertainty as h gets larger. I.e., the further into the future the forecast, the more uncertain the prediction and, therefore, the larger the prediction interval.

# Calculate standard deviaion of the residuals from previous ses fit
sd_res <- sd(fit$residuals)

# Calculate margin of error for 95% prediction interval
me <- sd_res * 1.96

# Calculate lo and hi 95
lo_95 <- fit$mean[1] - me
hi_95 <- fit$mean[1] + me
print(paste0("Prediction interval: ", round(fit$mean[1], 2), " ± ", me, " = (", round(lo_95, 2), ", ", round(hi_95, 2), ")"))

## [1] "Prediction interval: 98816.41 ± 20136.4388581745 = (78679.97, 118952.84)"

print("Compare to interval predicted by SES fit:")

## [1] "Compare to interval predicted by SES fit:"

for (i in seq(1, 4)) {
    print(paste0("    ", round(fit$mean[i], 2), " ± ", me, 
        " = (", round(fit$lower[i, 2], 2), ", ", round(fit$upper[i, 2], 2), ")"))
}

## [1] "    98816.41 ± 20136.4388581745 = (78611.97, 119020.84)"
## [1] "    98816.41 ± 20136.4388581745 = (77738.83, 119893.98)"
## [1] "    98816.41 ± 20136.4388581745 = (76900.46, 120732.35)"
## [1] "    98816.41 ± 20136.4388581745 = (76092.99, 121539.82)"

HA 7.2

Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level \(ℓ_0\)). It should return the forecast of the next observation in the series. Does it give the same forecast as ses()?

Use weighted average form of the SES equation:

\(\hat{y}_{{t + 1}|t}\ =\ \alpha\ y_t\ +\ (1\ -\ \alpha)\ \hat{y}_{t|{t - 1}}\)

Using the values produced by the ses() function for α and \(ℓ_0\), the custom function does return the same value as the ses() function: 98816 (rounded) in both cases. It is noted that various values of \(ℓ_0\) were tried, and they produced the same result. However, when alpha was varied, different results were returned, indicating that the outcome is heavily dependent on which value of alpha is chosen.

# Function to implement SES
my_ses <- function(y, alpha, level) {
    
    print(paste('before:', level))
    
    # Initialize vector to store forecast values
    fitted <- c()
    
    # Set T to be length of time series
    T <- length(y)
    
    # Iterate over terms
    for (j in seq(1, T)) {
        if (j == 1) {
            fitted[j] <- level
        } else {
            fitted[j] <- alpha * (y[j - 1]) + (1 - alpha) * fitted[j - 1]
        }
    }
    
    # Find the next fitted value in the series
    fitted[T + 1] <- alpha * (y[T]) + (1 - alpha) * fitted[T]
    
    print(paste('after:', level))
    
    # Return fitted values
    return(fitted[T + 1])

}

# Calculate SES using custom function
print(paste('my_ses() result with initial level of 77260.0561:', my_ses(pigs, 0.2971, 77260.0561)))

## [1] "before: 77260.0561"
## [1] "after: 77260.0561"
## [1] "my_ses() result with initial level of 77260.0561: 98816.4543875089"

print(paste('my_ses() result with initial level of 55:', my_ses(pigs, 0.2971, 55)))

## [1] "before: 55"
## [1] "after: 55"
## [1] "my_ses() result with initial level of 55: 98816.4543875089"

HA 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of α and ℓ_0. Do you get the same values as the ses() function?

The optim() function using the default method (Nelder and Mead) produced different values for α and \(ℓ_0\) (0.5179980 and 0.1307012, respectively). Changing the method to BFGS (Broyden, Fletcher, Goldfarb and Shanno) made the results very close to the one produced by the ses() function (0.297142 and 77273.160331, respectively).

# Function to implement SES, modified to return the sum of squared errors instead of the forecast of the next observation
my_ses_mod <- function(y, params) {

    # Split out parameters for optim() function    
    alpha <- params[1]
    level <- params[2]
    
    # Initialize vector to store forecast values
    fitted <- c()
    
    # Set T to be length of time series
    T <- length(y)
    
    # Iterate over terms
    for (j in seq(1, T)) {
        if (j == 1) {
            fitted[j] <- level
        } else {
            fitted[j] <- alpha * (y[j - 1]) + (1 - alpha) * fitted[j - 1]
        }
    }
    
    # Convert fitted values to time series
    fitted <- ts(fitted, frequency=12, start=c(1980, 1))
    
    # Calculate residuals
    residuals <- y - fitted
    
    # Calculate sum of squared errors
    sse=sum(residuals ** 2)

    # Return fitted values, residuals, and sum of squared errors
    # return(list(fitted=fitted, residuals=residuals, sse=sse))
    
    # Return sse
    return(sse)

}

# Calculate SSE using modified custom function
fit2 <- my_ses_mod(pigs, c(0.2971, 77260.0561))

# Compare this SSE to SSE from original fit using sse()
fit2

## [1] 19765613579

sum(fit$residuals ** 2)

## [1] 19765613407

# Optimize α and ℓ_0 using optim() function
fit3 <- optim(par=c(0, 0), fn=my_ses_mod, y=pigs)
fit3  # produces alpha=0.5179980, l_0=0.1307012

## $par
## [1] 0.5179980 0.1307012
## 
## $value
## [1] 27848581225
## 
## $counts
## function gradient 
##       35       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

# Try optimize again using method='BFGS'
fit4 <- optim(par=c(0, 0), fn=my_ses_mod, y=pigs, method='BFGS')
fit4

## $par
## [1]     0.297142 77273.160331
## 
## $value
## [1] 19765613325
## 
## $counts
## function gradient 
##       84       13 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

ippolito_data624_week2

Michael Ippolito

2023-06-04