Data 624 - Homework 1

Questions from HA Text

Some initialization by loading the fpp2 library and erasing any variables in memory

library(fpp2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v ggplot2   3.3.6     v fma       2.4  
## v forecast  8.16      v expsmooth 2.3

##

rm(list = ls())

2.1

Use the help function to explore what the series gold, woolyrnq and gas represent.

?gold tells us that is daily gold prices in US Dollars between 1/1/1985 and 03/31/1989

?woolyrnq tells us quarterly production of woollen yan in AU between March 1965 and Sep 1994

?gas tell us Australian monthly gas production between 1956 and 1995

2.1.a

Use autoplot() to plot each of these in separate plots.

ANSWER An inspection of all 3 datasets tells un something important about them. gas and woolyrnq are structured as regular dataset in monthly and quarterly frequencies. But gold is not. The values are a sequence from 1 to 1,108.

We could do a simple autoplot(gold) and we woiuld get a plot, but the X axis would shows the 1:108 sequence which is not very helpful.

autoplot(gold)

gold has 1,108 rows in the dataset. But if you check the number of days between 1/1/1985 and 03/31/1989, the number of days is 1551, which tells us the time series is not based on a 365 days/year.

length(seq(as.Date('1985-01-01'),as.Date('1989-03-31'),by = 1))

## [1] 1551

We checked the number of weekdays between those dates and the number was 1,108 so we know now that gold is daily prices, but includes only weekdays

gold_series <- seq(as.Date('1985-01-01'),as.Date('1989-03-31'),by = 1)
gold_series <- gold_series[!weekdays(gold_series) %in% c('Saturday','Sunday')]
length(gold_series)-1

## [1] 1108

Unfortunatel the ts object we are using under fpp2 allows only for regular frequencies. To get a nicer and more informative graph we would make use of tsiblle and plot it showing dates in the x-axis

gold2 <- tsibble::tsibble(date=gold_series[-length(gold_series)], price=gold, index = date)
plot(gold2, type ="l")

2.1. b

What is the frequency of each series? Hint: apply the frequency() function.

ANSWER We used both the ? function and the frequency function to get a complete description of thr frequencies used.

Gold is daily, but only weekdays
Gas is monthly
Wooltrnq is quarterly (3 months)

frequency(gold)

## [1] 1

frequency(gas)

## [1] 12

frequency(woolyrnq)

## [1] 4

2.1.c

Use which.max() to spot the outlier in the gold series. Which observation was it?

ANSWER We made use of the weekday series we generate before for the plot. So we could get more valuable data for the outlier.

# This gives us the row number for the MAX prices in the Gold dataset
which.max(gold)

## [1] 770

# This give is the DATE index for the max value in the series
gold_series[which.max(gold)]

## [1] "1987-12-14"

# This gives us the MAX price in the series
gold[which.max(gold)]

## [1] 593.7

The outlier was row 770, which was for the date: 1987-12-14 and the price was on that date was: $ 593.70

2.3

Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.

2.3.a

You can read the data into R with the following script

retaildata <- readxl::read_excel("data/retail.xlsx", skip=1)

2.3.b

Select one of the time series as follows (but replace the column name with your own chosen column):

myts <- ts(retaildata[,"A3349627V"],
           frequency=12, start=c(1982,4))

2.3.c

Explore your chosen retail time series using the following functions:

autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

autoplot(myts) +
ggtitle("Turnover ;  New South Wales ;  Liquor retailing") +
  xlab("Year") +
  ylab("AU$ Millions")

ggseasonplot(myts, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("AU$ Million") +
  ggtitle("Seasonal Plot - Turnover ;  New South Wales ;  Liquor retailing")

myts2 <- window(myts, start=2004)
gglagplot(myts2,12, do.lines = FALSE)

This plot is hard to read because of the very large trend in the data. If the take one difference to remove trend then we will see better the impact of seasonality.

Dmyts <-diff(myts2)
gglagplot(Dmyts,12, do.lines = FALSE)

ggAcf(myts)

ggAcf(myts2)

6.2

The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.

6.2.a

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?

plastics %>% autoplot()

6.2.b

Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

plastics %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition Plastic Prod A Sales")

6.2.c

Do the results support the graphical interpretation from part a?

ANSWER Yes, the decomposition shows a clear 12 seasonal pattern in the data

6.2.d

Compute and plot the seasonally adjusted data.

ANSWER Using classical decomposition, we decompose the plastics dataset using the decompose function with multiplicative parameter.

fit <- decompose(plastics, type = "multiplicative")
#fit

season_adj <- fit$x / fit$seasonal 
season_adj %>% autoplot()

Plot Data, Trend, Seasonally Adusted Data

autoplot(plastics, series="Data") +
  autolayer(fit$trend, series="Trend") +
  autolayer(fit$x / fit$seasonal, series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

autoplot(plastics, series="Data") +
  autolayer(trendcycle(fit), series="Trend") +
  autolayer(seasadj(fit), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

6.2.e

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

#plastics2 <- plastics
#plastics2[30] <- plastics[30] + 500
plastics2 <- plastics
plastics2[30] <- plastics2[30] + 500
plastics2 %>% autoplot()

plastics2 %>% decompose(type="multiplicative") -> fit2

autoplot(plastics2, series="Data") +
  autolayer(trendcycle(fit2), series="Trend") +
  autolayer(seasadj(fit2), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

6.2.f

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Near end

plastics3 <- plastics
plastics3[53] <- plastics3[53] + 500
plastics3 %>% decompose(type="multiplicative") -> fit3

autoplot(plastics3, series="Data") +
  autolayer(trendcycle(fit3), series="Trend") +
  autolayer(seasadj(fit3), series="Seasonally Adjusted") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
             breaks=c("Data","Seasonally Adjusted","Trend"))

## Warning: Removed 12 row(s) containing missing values (geom_path).

Plot all trends

autoplot(plastics, series="Data") +
  autolayer(trendcycle(fit), series="Trend 0 Original") +
  autolayer(trendcycle(fit2), series="Trend 1 Middle") +
  autolayer(trendcycle(fit3), series="Trend 2 End") +
    xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red","green"),
             breaks=c("Data","Trend 0 Original","Trend 1 Middle","Trend 2 End"))

## Warning: Removed 12 row(s) containing missing values (geom_path).
## Removed 12 row(s) containing missing values (geom_path).
## Removed 12 row(s) containing missing values (geom_path).

Plot both Seasonaly adjusted

autoplot(plastics, series="Data") +
  autolayer(seasadj(fit2), series="Seasonally Adjusted Middle") +
  autolayer(seasadj(fit3), series="Seasonally Adjusted End") +
    xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("gray","blue","red"),
             breaks=c("Data","Seasonally Adjusted Middle","Seasonally Adjusted End"))

plot all seasonalities only

  autoplot(seasonal(fit), series="Seasonal 0 Original") +
  autolayer(seasonal(fit2), series="Seasonal 1 Middle") +
  autolayer(seasonal(fit3), series="Seasonal 2 End") +
    xlab("Year") + ylab("Sales") +
  ggtitle("Plastic Sales Product A") +
  scale_colour_manual(values=c("blue","red","green"),
             breaks=c("Seasonal 0 Original","Seasonal 1 Middle","Seasonal 2 End"))

Check and compara using X11

library(seasonal)
 ts(plastics,start=2001, frequency = 12) %>% seas(x11="") -> X11fit
 ts(plastics2,start=2001, frequency = 12) %>% seas(x11="") -> X11fit2
 ts(plastics3,start=2001, frequency = 12) %>% seas(x11="") -> X11fit3

Plot the trends

autoplot(trendcycle(X11fit), series="Original") +
  autolayer(trendcycle(X11fit2), series="Middle") +
  autolayer(trendcycle(X11fit3), series="Towards End") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plot of all 3 Trends") +
  scale_colour_manual(values=c("green","blue","red"),
             breaks=c("Original","Middle","Towards End"))

Plot Seasonalities

autoplot(seasonal(X11fit), series="Original") +
  autolayer(seasonal(X11fit2), series="Middle") +
  autolayer(seasonal(X11fit3), series="Towards End") +
  xlab("Year") + ylab("Sales") +
  ggtitle("Plot of all 3 Seasonalities") +
  scale_colour_manual(values=c("green","blue","red"),
             breaks=c("Original","Middle","Towards End"))

There is of course some differences since the MA averages values, but I say the bumps you see in the middle and end don’t dramatically change the overal plots. Furthermore when we use X11 the plots are almost identical.

7.1

Consider the pigs series — the number of pigs slaughtered in Victoria each month.

pigs2 <- fma::pigs
str(pigs2)

##  Time-Series [1:188] from 1980 to 1996: 76378 71947 33873 96428 105084 ...

frequency(pigs2)

## [1] 12

7.1.1

Use the ses() function in R to find the optimal values of
α and ℓ0, and generate forecasts for the next four months.

fc <- ses(pigs2, h=4)
summary(fc)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = pigs2, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.2971 
## 
##   Initial states:
##     l = 77260.0561 
## 
##   sigma:  10308.58
## 
##      AIC     AICc      BIC 
## 4462.955 4463.086 4472.665 
## 
## Error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Sep 1995       98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995       98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995       98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995       98816.41 83958.37 113674.4 76092.99 121539.8

alpha = 0.2971

ℓ0 = 77,260.06

7.1.2

Compute a 95% prediction interval for the first forecast using y± 1.96s where is the standard deviation of the residuals. Compare your interval with the interval produced by R.

sigma = 10,308.58

# Standard deviation of Residuals
sd_resid <- sd(fc$residuals)

my_lower <- fc$mean[1] - (1.96*sd_resid)
my_upper <- fc$mean[1] + (1.96*sd_resid)
print(my_lower)

## [1] 78679.97

print(my_upper)

## [1] 118952.8

print("Diff from lower")

## [1] "Diff from lower"

print(my_lower-fc$lower[1,2])

##     95% 
## 67.9988

print("Diff from upper")

## [1] "Diff from upper"

print(my_upper-fc$upper[1,2])

##      95% 
## -67.9988

7.2

Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level ℓ It should return the forecast of the next observation in the series. Does it give the same forecast as ses()?

my_ses <- function(y, alpha, level){
  y_hat <- level

    for(i in 1:length(y)){
   y_hat <- alpha * y[i] + (1 - alpha)* y_hat 
    }
return(y_hat)
  }

my_ses(pigs2,0.2971,77260.0561)

## [1] 98816.45

fc$mean[1]

## [1] 98816.41

7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of
α and ℓ. Do you get the same values as the ses() function ?

Let’s try this

my_sse <- function(data, par){
  y_hat <- par[2]
  local_sse <- 0

    for(i in 1:length(data)){
      # print(y_hat - data[i])
      local_sse <- local_sse + (y_hat - data[i])^2 
      y_hat <- par[1] * data[i] + (1 - par[1])* y_hat
    }
return(local_sse)
}

my_optim <- optim(par= c(0.2,50000), fn=my_sse,  data=pigs2)
my_optim$par

## [1] 2.971173e-01 7.726697e+04

fc$model$par

##        alpha            l 
## 2.971488e-01 7.726006e+04

8.1

ANSWER The all seem to be white noise. No stron correlation (beyond confidence interval level) at any of the lags observed.

8.2

ibmclose %>%
  autoplot()

Box.test(diff(ibmclose),lag=24,type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  diff(ibmclose)
## X-squared = 42.156, df = 24, p-value = 0.0124

library(urca)
ibmclose %>% 
  ur.kpss() %>%
  summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 3.6236 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

ggAcf(ibmclose)

ggPacf(ibmclose)

8.6

Use R to simulate and plot some data from simple ARIMA models.

8.6.a

Use the following R code to generate data from an AR(1) model with
ϕ1 = 0.6 and σ2 = 1. The process starts with y1 = 0.

y <- ts(numeric(100))
e <- rnorm(100)

a <- y
b <- y
c <- y

for(i in 2:100)
  y[i] <- 0.6*y[i-1] + e[i]

8.6.b

Produce a time plot for the series. How does the plot change as you change ϕ1 ?

y %>%
  autoplot()

for(i in 2:100) {
  a[i] <- 0.1*a[i-1] + e[i]
  b[i] <- 0.6*b[i-1] + e[i]
  c[i] <- 0.9*c[i-1] + e[i]
}

autoplot(a, series="=0.1") +
  autolayer(b, series="=0.6") +
  autolayer(c, series="=0.9") +
  ggtitle("Plot of all 3 AR") +
  scale_colour_manual(values=c("orange","blue","red"),
             breaks=c("=0.1","=0.6","=0.9"))

8.6.c

Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ2 = 1.

y <- ts(numeric(100))
e <- rnorm(100, mean=0, sd=1)

a <- y
b <- y
c <- y

for(i in 2:100)
  y[i] <- 0.6*e[i-1] + e[i]

y %>%
  autoplot()

8.6.d

Produce a time plot for the series. How does the plot change as you change
θ1?

for(i in 2:100) {
  a[i] <- 0.1*e[i-1] + e[i]
  b[i] <- 0.6*e[i-1] + e[i]
  c[i] <- 0.9*e[i-1] + e[i]
}

autoplot(a, series="=0.1") +
  autolayer(b, series="=0.6") +
  autolayer(c, series="=0.9") +
  ggtitle("Plot of all 3 MA") +
  scale_colour_manual(values=c("orange","blue","red"),
             breaks=c("=0.1","=0.6","=0.9"))

8.6.e

Generate data from an ARMA(1,1) model with
ϕ1 = 0.6, θ1 = 0.6 and σ2 = 1.

y1 <- ts(numeric(100))
e <- rnorm(100, mean = 0, sd=1)

for(i in 2:100)
  y1[i] <- 0.6*y1[i-1] + 0.6*e[i-1] + e[i]

8.6.f

Generate data from an AR(2) model with ϕ1 = −0.8, ϕ2 = 0.3 and σ2=1.(Note that these parameters will give a non-stationary series.)

y2 <- ts(numeric(100))
e <- rnorm(100, mean = 0, sd=1)

for(i in 3:100)
  y2[i] <- -0.8*y2[i-1] + 0.3*y2[i-2] + e[i]

8.6.g

Graph the latter two series and compare them.

y1 %>%
  autoplot()

y2 %>%
  autoplot()

8.8

Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.

str(austa)

##  Time-Series [1:36] from 1980 to 2015: 0.83 0.86 0.877 0.867 0.932 ...

frequency(austa)

## [1] 1

austa %>%
  autoplot()

8.8.a

Use auto.arima() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.

fit <- auto.arima(austa)

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
## 
## Model df: 2.   Total lags used: 7

fit %>%
  forecast(10) %>%
  autoplot()

8.8.b

Plot forecasts from an ARIMA(0,1,1) model with no drift and compare these to part a. Remove the MA term and plot again.

austa %>%
  Arima(order = c(0,1,1), include.drift = FALSE) %>%
  forecast(10) %>%
  autoplot()

austa %>%
  Arima(order = c(0,1,0), include.drift = FALSE) %>%
  forecast(10) %>%
  autoplot()

8.8.c

Plot forecasts from an ARIMA(2,1,3) model with drift. Remove the constant and see what happens.

austa %>%
  Arima(order = c(2,1,3), include.drift = TRUE, method="ML") %>%
  forecast(10) %>%
  autoplot()

austa %>%
  Arima(order = c(2,1,3), include.drift = TRUE, include.constant = FALSE, method="ML") %>%
  forecast(10) %>%
  autoplot()

8.8.d

Plot forecasts from an ARIMA(0,0,1) model with a constant. Remove the MA term and plot again.

austa %>%
  Arima(order = c(0,0,1), include.constant = TRUE, method="ML") %>%
  forecast(10) %>%
  autoplot()

austa %>%
  Arima(order = c(0,0,0), include.constant = TRUE, method="ML") %>%
  forecast(10) %>%
  autoplot()

8.8.e

Plot forecasts from an ARIMA(0,2,1) model with no constant.

austa %>%
  Arima(order = c(0,2,1), include.constant = FALSE, method="ML") %>%
  forecast(10) %>%
  autoplot()

Questions from KJ Text

3.1

The UC Irvine Machine Learning contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

library(mlbench)
data(Glass)
str(Glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

3.1.a

Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

# library(ggplot2)
library(GGally)
meltData <- reshape2::melt(Glass)

## Using Type as id variables

p <- ggplot(meltData, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

data <- subset(Glass, select=RI:Fe)
ggpairs(data, title="correlogram with ggpairs()")

3.1.b

Do there appear to be any outliers in the data? Are any predictors skewed?

ANSWER From the Boxplots yes, most of them except Mg show outliers. Most except Fe and Ba show outliers in both low and high sides of their distributions.

3.1.c

Are there any relevant transformations of one or more predictors that might improve the classification model?

ANSWER Yes :) some will benefit from fixing skewness, specifically RI, Mg, Al, K, Ca,Ba,Fe

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

# library(mlbench)
data(Soybean)

3.2.a

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data <- subset(Soybean, select=date:roots)
for (i in colnames(data)){
  freq <- table(data[[i]])
  print(paste0("Relative Frequency Table:",i))
  prob <- prop.table(freq)
  print(prob)
  print("******************************************************************")
}

## [1] "Relative Frequency Table:date"
## 
##          0          1          2          3          4          5          6 
## 0.03812317 0.10997067 0.13636364 0.17302053 0.19208211 0.21847507 0.13196481 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:plant.stand"
## 
##         0         1 
## 0.5471406 0.4528594 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:precip"
## 
##         0         1         2 
## 0.1147287 0.1736434 0.7116279 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:temp"
## 
##         0         1         2 
## 0.1225115 0.5727412 0.3047473 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:hail"
## 
##         0         1 
## 0.7740214 0.2259786 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:crop.hist"
## 
##          0          1          2          3 
## 0.09745127 0.24737631 0.32833583 0.32683658 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:area.dam"
## 
##         0         1         2         3 
## 0.1803519 0.3328446 0.2126100 0.2741935 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:sever"
## 
##          0          1          2 
## 0.34697509 0.57295374 0.08007117 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.tmt"
## 
##          0          1          2 
## 0.54270463 0.39501779 0.06227758 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:germ"
## 
##         0         1         2 
## 0.2889667 0.3730298 0.3380035 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:plant.growth"
## 
##         0         1 
## 0.6611694 0.3388306 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaves"
## 
##         0         1 
## 0.1127379 0.8872621 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.halo"
## 
##          0          1          2 
## 0.36894825 0.06010017 0.57095159 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.marg"
## 
##          0          1          2 
## 0.59599332 0.03505843 0.36894825 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.size"
## 
##         0         1         2 
## 0.0851419 0.5459098 0.3689482 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.shread"
## 
##         0         1 
## 0.8353345 0.1646655 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.malf"
## 
##          0          1 
## 0.92487479 0.07512521 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.mild"
## 
##          0          1          2 
## 0.93043478 0.03478261 0.03478261 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:stem"
## 
##         0         1 
## 0.4437781 0.5562219 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:lodging"
## 
##         0         1 
## 0.9252669 0.0747331 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:stem.cankers"
## 
##          0          1          2          3 
## 0.58759690 0.06046512 0.05581395 0.29612403 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:canker.lesion"
## 
##         0         1         2         3 
## 0.4961240 0.1286822 0.2744186 0.1007752 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruiting.bodies"
## 
##         0         1 
## 0.8197574 0.1802426 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:ext.decay"
## 
##          0          1          2 
## 0.77054264 0.20930233 0.02015504 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:mycelium"
## 
##           0           1 
## 0.990697674 0.009302326 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:int.discolor"
## 
##          0          1          2 
## 0.90077519 0.06821705 0.03100775 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:sclerotia"
## 
##          0          1 
## 0.96899225 0.03100775 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruit.pods"
## 
##          0          1          2          3 
## 0.67946578 0.21702838 0.02337229 0.08013356 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruit.spots"
## 
##          0          1          2          4 
## 0.59792028 0.12998267 0.09878683 0.17331023 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed"
## 
##         0         1 
## 0.8054146 0.1945854 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:mold.growth"
## 
##         0         1 
## 0.8866328 0.1133672 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.discolor"
## 
##         0         1 
## 0.8890815 0.1109185 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.size"
## 
##         0         1 
## 0.9001692 0.0998308 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:shriveling"
## 
##          0          1 
## 0.93414211 0.06585789 
## [1] "******************************************************************"
## [1] "Relative Frequency Table:roots"
## 
##          0          1          2 
## 0.84509202 0.13190184 0.02300613 
## [1] "******************************************************************"

3.2.b

Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

# colSums(is.na(Soybean))

Soybean %>%
  dplyr::group_by(Class) %>%
  sapply(function(x) sum(is.na(x)))

##           Class            date     plant.stand          precip            temp 
##               0               1              36              38              30 
##            hail       crop.hist        area.dam           sever        seed.tmt 
##             121              16               1             121             121 
##            germ    plant.growth          leaves       leaf.halo       leaf.marg 
##             112              16               0              84              84 
##       leaf.size     leaf.shread       leaf.malf       leaf.mild            stem 
##              84             100              84             108              16 
##         lodging    stem.cankers   canker.lesion fruiting.bodies       ext.decay 
##             121              38              38             106              38 
##        mycelium    int.discolor       sclerotia      fruit.pods     fruit.spots 
##              38              38              38              84             106 
##            seed     mold.growth   seed.discolor       seed.size      shriveling 
##              92              92             106              92             106 
##           roots 
##              31

By Class

Soybean %>%
  dplyr::group_by(Class) %>%
  dplyr::summarise(across(.fns = ~sum(is.na(.)))) %>%
  dplyr::transmute(Class=Class, totNA = rowSums(dplyr::across(where(is.numeric))))

## # A tibble: 19 x 2
##    Class                       totNA
##    <fct>                       <dbl>
##  1 2-4-d-injury                  450
##  2 alternarialeaf-spot             0
##  3 anthracnose                     0
##  4 bacterial-blight                0
##  5 bacterial-pustule               0
##  6 brown-spot                      0
##  7 brown-stem-rot                  0
##  8 charcoal-rot                    0
##  9 cyst-nematode                 336
## 10 diaporthe-pod-&-stem-blight   177
## 11 diaporthe-stem-canker           0
## 12 downy-mildew                    0
## 13 frog-eye-leaf-spot              0
## 14 herbicide-injury              160
## 15 phyllosticta-leaf-spot          0
## 16 phytophthora-rot             1214
## 17 powdery-mildew                  0
## 18 purple-seed-stain               0
## 19 rhizoctonia-root-rot            0

3.2.c

Develop a strategy for handling missing data, either by eliminating predictors or imputation.

ANSWER We could either start only with the intercept and add one predictor at a time if it improves performance. Since RI is a continuous variable, we are talking most likely about a regression problem so measures like RMSE or MSE may be proper. If our goal is to predict Class, then it is a classification problem and measure like accuracy, precision, recall, and ROC-AUC would be proper.

The strategy is to add a predictor if it improves the chosen metric, if it doesn’t choose a different one to add. Keep repeating until you exhaust all variables (predictors)