Data 624 - Homework 1
Questions from HA Text
Some initialization by loading the fpp2 library and erasing any variables in memory
library(fpp2)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v ggplot2 3.3.6 v fma 2.4
## v forecast 8.16 v expsmooth 2.3
##
rm(list = ls())2.1
Use the help function to explore what the series gold, woolyrnq and gas represent.
?gold tells us that is daily gold prices in US Dollars between 1/1/1985 and 03/31/1989
?woolyrnq tells us quarterly production of woollen yan in AU between March 1965 and Sep 1994
?gas tell us Australian monthly gas production between 1956 and 1995
2.1.a
Use autoplot() to plot each of these in separate plots.
ANSWER An inspection of all 3 datasets tells un something important about them. gas and woolyrnq are structured as regular dataset in monthly and quarterly frequencies. But gold is not. The values are a sequence from 1 to 1,108.
We could do a simple autoplot(gold) and we woiuld get a plot, but the X axis would shows the 1:108 sequence which is not very helpful.
autoplot(gold)gold has 1,108 rows in the dataset. But if you check the number of days between 1/1/1985 and 03/31/1989, the number of days is 1551, which tells us the time series is not based on a 365 days/year.
length(seq(as.Date('1985-01-01'),as.Date('1989-03-31'),by = 1))## [1] 1551
We checked the number of weekdays between those dates and the number was 1,108 so we know now that gold is daily prices, but includes only weekdays
gold_series <- seq(as.Date('1985-01-01'),as.Date('1989-03-31'),by = 1)
gold_series <- gold_series[!weekdays(gold_series) %in% c('Saturday','Sunday')]
length(gold_series)-1## [1] 1108
Unfortunatel the ts object we are using under fpp2 allows only for regular frequencies. To get a nicer and more informative graph we would make use of tsiblle and plot it showing dates in the x-axis
gold2 <- tsibble::tsibble(date=gold_series[-length(gold_series)], price=gold, index = date)
plot(gold2, type ="l")2.1. b
What is the frequency of each series? Hint: apply the frequency() function.
ANSWER We used both the ? function and the frequency function to get a complete description of thr frequencies used.
- Gold is daily, but only weekdays
- Gas is monthly
- Wooltrnq is quarterly (3 months)
frequency(gold)## [1] 1
frequency(gas)## [1] 12
frequency(woolyrnq)## [1] 4
2.1.c
Use which.max() to spot the outlier in the gold series. Which observation was it?
ANSWER We made use of the weekday series we generate before for the plot. So we could get more valuable data for the outlier.
# This gives us the row number for the MAX prices in the Gold dataset
which.max(gold)## [1] 770
# This give is the DATE index for the max value in the series
gold_series[which.max(gold)]## [1] "1987-12-14"
# This gives us the MAX price in the series
gold[which.max(gold)]## [1] 593.7
The outlier was row 770, which was for the date: 1987-12-14 and the price was on that date was: $ 593.70
2.3
Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
2.3.a
You can read the data into R with the following script
retaildata <- readxl::read_excel("data/retail.xlsx", skip=1)2.3.b
Select one of the time series as follows (but replace the column name with your own chosen column):
myts <- ts(retaildata[,"A3349627V"],
frequency=12, start=c(1982,4))2.3.c
Explore your chosen retail time series using the following functions:
autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()
Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
autoplot(myts) +
ggtitle("Turnover ; New South Wales ; Liquor retailing") +
xlab("Year") +
ylab("AU$ Millions")ggseasonplot(myts, year.labels=TRUE, year.labels.left=TRUE) +
ylab("AU$ Million") +
ggtitle("Seasonal Plot - Turnover ; New South Wales ; Liquor retailing")myts2 <- window(myts, start=2004)
gglagplot(myts2,12, do.lines = FALSE)This plot is hard to read because of the very large trend in the data. If the take one difference to remove trend then we will see better the impact of seasonality.
Dmyts <-diff(myts2)
gglagplot(Dmyts,12, do.lines = FALSE)ggAcf(myts)ggAcf(myts2)6.2
The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
6.2.a
Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle?
plastics %>% autoplot()6.2.b
Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
plastics %>% decompose(type="multiplicative") %>%
autoplot() + xlab("Year") +
ggtitle("Classical multiplicative decomposition Plastic Prod A Sales")6.2.c
Do the results support the graphical interpretation from part a?
ANSWER Yes, the decomposition shows a clear 12 seasonal pattern in the data
6.2.d
Compute and plot the seasonally adjusted data.
ANSWER Using classical decomposition, we decompose the plastics dataset using the decompose function with multiplicative parameter.
fit <- decompose(plastics, type = "multiplicative")
#fitseason_adj <- fit$x / fit$seasonal
season_adj %>% autoplot()Plot Data, Trend, Seasonally Adusted Data
autoplot(plastics, series="Data") +
autolayer(fit$trend, series="Trend") +
autolayer(fit$x / fit$seasonal, series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))## Warning: Removed 12 row(s) containing missing values (geom_path).
OR
autoplot(plastics, series="Data") +
autolayer(trendcycle(fit), series="Trend") +
autolayer(seasadj(fit), series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))## Warning: Removed 12 row(s) containing missing values (geom_path).
6.2.e
Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
#plastics2 <- plastics
#plastics2[30] <- plastics[30] + 500
plastics2 <- plastics
plastics2[30] <- plastics2[30] + 500
plastics2 %>% autoplot()plastics2 %>% decompose(type="multiplicative") -> fit2
autoplot(plastics2, series="Data") +
autolayer(trendcycle(fit2), series="Trend") +
autolayer(seasadj(fit2), series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))## Warning: Removed 12 row(s) containing missing values (geom_path).
6.2.f
Does it make any difference if the outlier is near the end rather than in the middle of the time series?
Near end
plastics3 <- plastics
plastics3[53] <- plastics3[53] + 500
plastics3 %>% decompose(type="multiplicative") -> fit3
autoplot(plastics3, series="Data") +
autolayer(trendcycle(fit3), series="Trend") +
autolayer(seasadj(fit3), series="Seasonally Adjusted") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted","Trend"))## Warning: Removed 12 row(s) containing missing values (geom_path).
Plot all trends
autoplot(plastics, series="Data") +
autolayer(trendcycle(fit), series="Trend 0 Original") +
autolayer(trendcycle(fit2), series="Trend 1 Middle") +
autolayer(trendcycle(fit3), series="Trend 2 End") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red","green"),
breaks=c("Data","Trend 0 Original","Trend 1 Middle","Trend 2 End"))## Warning: Removed 12 row(s) containing missing values (geom_path).
## Removed 12 row(s) containing missing values (geom_path).
## Removed 12 row(s) containing missing values (geom_path).
Plot both Seasonaly adjusted
autoplot(plastics, series="Data") +
autolayer(seasadj(fit2), series="Seasonally Adjusted Middle") +
autolayer(seasadj(fit3), series="Seasonally Adjusted End") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("gray","blue","red"),
breaks=c("Data","Seasonally Adjusted Middle","Seasonally Adjusted End"))plot all seasonalities only
autoplot(seasonal(fit), series="Seasonal 0 Original") +
autolayer(seasonal(fit2), series="Seasonal 1 Middle") +
autolayer(seasonal(fit3), series="Seasonal 2 End") +
xlab("Year") + ylab("Sales") +
ggtitle("Plastic Sales Product A") +
scale_colour_manual(values=c("blue","red","green"),
breaks=c("Seasonal 0 Original","Seasonal 1 Middle","Seasonal 2 End"))Check and compara using X11
library(seasonal)
ts(plastics,start=2001, frequency = 12) %>% seas(x11="") -> X11fit
ts(plastics2,start=2001, frequency = 12) %>% seas(x11="") -> X11fit2
ts(plastics3,start=2001, frequency = 12) %>% seas(x11="") -> X11fit3Plot the trends
autoplot(trendcycle(X11fit), series="Original") +
autolayer(trendcycle(X11fit2), series="Middle") +
autolayer(trendcycle(X11fit3), series="Towards End") +
xlab("Year") + ylab("Sales") +
ggtitle("Plot of all 3 Trends") +
scale_colour_manual(values=c("green","blue","red"),
breaks=c("Original","Middle","Towards End"))Plot Seasonalities
autoplot(seasonal(X11fit), series="Original") +
autolayer(seasonal(X11fit2), series="Middle") +
autolayer(seasonal(X11fit3), series="Towards End") +
xlab("Year") + ylab("Sales") +
ggtitle("Plot of all 3 Seasonalities") +
scale_colour_manual(values=c("green","blue","red"),
breaks=c("Original","Middle","Towards End"))There is of course some differences since the MA averages values, but I say the bumps you see in the middle and end don’t dramatically change the overal plots. Furthermore when we use X11 the plots are almost identical.
7.1
Consider the pigs series — the number of pigs slaughtered in Victoria each month.
pigs2 <- fma::pigs
str(pigs2)## Time-Series [1:188] from 1980 to 1996: 76378 71947 33873 96428 105084 ...
frequency(pigs2)## [1] 12
7.1.1
Use the ses() function in R to find the optimal values of
α and ℓ0, and generate forecasts for the next four months.
fc <- ses(pigs2, h=4)
summary(fc)##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = pigs2, h = 4)
##
## Smoothing parameters:
## alpha = 0.2971
##
## Initial states:
## l = 77260.0561
##
## sigma: 10308.58
##
## AIC AICc BIC
## 4462.955 4463.086 4472.665
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 385.8721 10253.6 7961.383 -0.922652 9.274016 0.7966249 0.01282239
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Sep 1995 98816.41 85605.43 112027.4 78611.97 119020.8
## Oct 1995 98816.41 85034.52 112598.3 77738.83 119894.0
## Nov 1995 98816.41 84486.34 113146.5 76900.46 120732.4
## Dec 1995 98816.41 83958.37 113674.4 76092.99 121539.8
alpha = 0.2971
ℓ0 = 77,260.06
7.1.2
Compute a 95% prediction interval for the first forecast using y± 1.96s where is the standard deviation of the residuals. Compare your interval with the interval produced by R.
sigma = 10,308.58
# Standard deviation of Residuals
sd_resid <- sd(fc$residuals)
my_lower <- fc$mean[1] - (1.96*sd_resid)
my_upper <- fc$mean[1] + (1.96*sd_resid)
print(my_lower)## [1] 78679.97
print(my_upper)## [1] 118952.8
print("Diff from lower")## [1] "Diff from lower"
print(my_lower-fc$lower[1,2])## 95%
## 67.9988
print("Diff from upper")## [1] "Diff from upper"
print(my_upper-fc$upper[1,2])## 95%
## -67.9988
7.2
Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter α) and level (the initial level ℓ It should return the forecast of the next observation in the series. Does it give the same forecast as ses()?
my_ses <- function(y, alpha, level){
y_hat <- level
for(i in 1:length(y)){
y_hat <- alpha * y[i] + (1 - alpha)* y_hat
}
return(y_hat)
}my_ses(pigs2,0.2971,77260.0561)## [1] 98816.45
fc$mean[1]## [1] 98816.41
7.3
Modify your function from the previous exercise to return the sum of
squared errors rather than the forecast of the next observation. Then
use the optim() function to find the optimal values of
α and ℓ. Do you get the same values as the ses() function ?
Let’s try this
my_sse <- function(data, par){
y_hat <- par[2]
local_sse <- 0
for(i in 1:length(data)){
# print(y_hat - data[i])
local_sse <- local_sse + (y_hat - data[i])^2
y_hat <- par[1] * data[i] + (1 - par[1])* y_hat
}
return(local_sse)
}my_optim <- optim(par= c(0.2,50000), fn=my_sse, data=pigs2)
my_optim$par## [1] 2.971173e-01 7.726697e+04
fc$model$par## alpha l
## 2.971488e-01 7.726006e+04
8.1
ANSWER The all seem to be white noise. No stron correlation (beyond confidence interval level) at any of the lags observed.
8.2
ibmclose %>%
autoplot()Box.test(diff(ibmclose),lag=24,type="Ljung-Box")##
## Box-Ljung test
##
## data: diff(ibmclose)
## X-squared = 42.156, df = 24, p-value = 0.0124
library(urca)
ibmclose %>%
ur.kpss() %>%
summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 3.6236
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
ggAcf(ibmclose)ggPacf(ibmclose)8.6
Use R to simulate and plot some data from simple ARIMA models.
8.6.a
Use the following R code to generate data from an AR(1) model
with
ϕ1 = 0.6 and σ2 = 1. The process starts with y1 = 0.
y <- ts(numeric(100))
e <- rnorm(100)
a <- y
b <- y
c <- y
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]8.6.b
Produce a time plot for the series. How does the plot change as you change ϕ1 ?
y %>%
autoplot()for(i in 2:100) {
a[i] <- 0.1*a[i-1] + e[i]
b[i] <- 0.6*b[i-1] + e[i]
c[i] <- 0.9*c[i-1] + e[i]
}autoplot(a, series="=0.1") +
autolayer(b, series="=0.6") +
autolayer(c, series="=0.9") +
ggtitle("Plot of all 3 AR") +
scale_colour_manual(values=c("orange","blue","red"),
breaks=c("=0.1","=0.6","=0.9"))8.6.c
Write your own code to generate data from an MA(1) model with θ1 = 0.6 and σ2 = 1.
y <- ts(numeric(100))
e <- rnorm(100, mean=0, sd=1)
a <- y
b <- y
c <- y
for(i in 2:100)
y[i] <- 0.6*e[i-1] + e[i]
y %>%
autoplot()8.6.d
Produce a time plot for the series. How does the plot change as you
change
θ1?
for(i in 2:100) {
a[i] <- 0.1*e[i-1] + e[i]
b[i] <- 0.6*e[i-1] + e[i]
c[i] <- 0.9*e[i-1] + e[i]
}
autoplot(a, series="=0.1") +
autolayer(b, series="=0.6") +
autolayer(c, series="=0.9") +
ggtitle("Plot of all 3 MA") +
scale_colour_manual(values=c("orange","blue","red"),
breaks=c("=0.1","=0.6","=0.9"))8.6.e
Generate data from an ARMA(1,1) model with
ϕ1 = 0.6, θ1 = 0.6 and σ2 = 1.
y1 <- ts(numeric(100))
e <- rnorm(100, mean = 0, sd=1)
for(i in 2:100)
y1[i] <- 0.6*y1[i-1] + 0.6*e[i-1] + e[i]8.6.f
Generate data from an AR(2) model with ϕ1 = −0.8, ϕ2 = 0.3 and σ2=1.(Note that these parameters will give a non-stationary series.)
y2 <- ts(numeric(100))
e <- rnorm(100, mean = 0, sd=1)
for(i in 3:100)
y2[i] <- -0.8*y2[i-1] + 0.3*y2[i-2] + e[i]8.6.g
Graph the latter two series and compare them.
y1 %>%
autoplot()y2 %>%
autoplot()8.8
Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.
str(austa)## Time-Series [1:36] from 1980 to 2015: 0.83 0.86 0.877 0.867 0.932 ...
frequency(austa)## [1] 1
austa %>%
autoplot()8.8.a
Use auto.arima() to find an appropriate ARIMA model. What model was selected. Check that the residuals look like white noise. Plot forecasts for the next 10 periods.
fit <- auto.arima(austa)checkresiduals(fit)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 5, p-value = 0.8067
##
## Model df: 2. Total lags used: 7
fit %>%
forecast(10) %>%
autoplot()8.8.b
Plot forecasts from an ARIMA(0,1,1) model with no drift and compare these to part a. Remove the MA term and plot again.
austa %>%
Arima(order = c(0,1,1), include.drift = FALSE) %>%
forecast(10) %>%
autoplot()austa %>%
Arima(order = c(0,1,0), include.drift = FALSE) %>%
forecast(10) %>%
autoplot()8.8.c
Plot forecasts from an ARIMA(2,1,3) model with drift. Remove the constant and see what happens.
austa %>%
Arima(order = c(2,1,3), include.drift = TRUE, method="ML") %>%
forecast(10) %>%
autoplot()austa %>%
Arima(order = c(2,1,3), include.drift = TRUE, include.constant = FALSE, method="ML") %>%
forecast(10) %>%
autoplot()8.8.d
Plot forecasts from an ARIMA(0,0,1) model with a constant. Remove the MA term and plot again.
austa %>%
Arima(order = c(0,0,1), include.constant = TRUE, method="ML") %>%
forecast(10) %>%
autoplot()austa %>%
Arima(order = c(0,0,0), include.constant = TRUE, method="ML") %>%
forecast(10) %>%
autoplot()8.8.e
Plot forecasts from an ARIMA(0,2,1) model with no constant.
austa %>%
Arima(order = c(0,2,1), include.constant = FALSE, method="ML") %>%
forecast(10) %>%
autoplot()Questions from KJ Text
3.1
The UC Irvine Machine Learning contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
library(mlbench)
data(Glass)
str(Glass)## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
3.1.a
Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# library(ggplot2)
library(GGally)
meltData <- reshape2::melt(Glass)## Using Type as id variables
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")data <- subset(Glass, select=RI:Fe)
ggpairs(data, title="correlogram with ggpairs()") 3.1.b
Do there appear to be any outliers in the data? Are any predictors skewed?
ANSWER From the Boxplots yes, most of them except Mg show outliers. Most except Fe and Ba show outliers in both low and high sides of their distributions.
3.1.c
Are there any relevant transformations of one or more predictors that might improve the classification model?
ANSWER Yes :) some will benefit from fixing skewness, specifically RI, Mg, Al, K, Ca,Ba,Fe
3.2
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
# library(mlbench)
data(Soybean)3.2.a
Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
data <- subset(Soybean, select=date:roots)
for (i in colnames(data)){
freq <- table(data[[i]])
print(paste0("Relative Frequency Table:",i))
prob <- prop.table(freq)
print(prob)
print("******************************************************************")
}## [1] "Relative Frequency Table:date"
##
## 0 1 2 3 4 5 6
## 0.03812317 0.10997067 0.13636364 0.17302053 0.19208211 0.21847507 0.13196481
## [1] "******************************************************************"
## [1] "Relative Frequency Table:plant.stand"
##
## 0 1
## 0.5471406 0.4528594
## [1] "******************************************************************"
## [1] "Relative Frequency Table:precip"
##
## 0 1 2
## 0.1147287 0.1736434 0.7116279
## [1] "******************************************************************"
## [1] "Relative Frequency Table:temp"
##
## 0 1 2
## 0.1225115 0.5727412 0.3047473
## [1] "******************************************************************"
## [1] "Relative Frequency Table:hail"
##
## 0 1
## 0.7740214 0.2259786
## [1] "******************************************************************"
## [1] "Relative Frequency Table:crop.hist"
##
## 0 1 2 3
## 0.09745127 0.24737631 0.32833583 0.32683658
## [1] "******************************************************************"
## [1] "Relative Frequency Table:area.dam"
##
## 0 1 2 3
## 0.1803519 0.3328446 0.2126100 0.2741935
## [1] "******************************************************************"
## [1] "Relative Frequency Table:sever"
##
## 0 1 2
## 0.34697509 0.57295374 0.08007117
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.tmt"
##
## 0 1 2
## 0.54270463 0.39501779 0.06227758
## [1] "******************************************************************"
## [1] "Relative Frequency Table:germ"
##
## 0 1 2
## 0.2889667 0.3730298 0.3380035
## [1] "******************************************************************"
## [1] "Relative Frequency Table:plant.growth"
##
## 0 1
## 0.6611694 0.3388306
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaves"
##
## 0 1
## 0.1127379 0.8872621
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.halo"
##
## 0 1 2
## 0.36894825 0.06010017 0.57095159
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.marg"
##
## 0 1 2
## 0.59599332 0.03505843 0.36894825
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.size"
##
## 0 1 2
## 0.0851419 0.5459098 0.3689482
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.shread"
##
## 0 1
## 0.8353345 0.1646655
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.malf"
##
## 0 1
## 0.92487479 0.07512521
## [1] "******************************************************************"
## [1] "Relative Frequency Table:leaf.mild"
##
## 0 1 2
## 0.93043478 0.03478261 0.03478261
## [1] "******************************************************************"
## [1] "Relative Frequency Table:stem"
##
## 0 1
## 0.4437781 0.5562219
## [1] "******************************************************************"
## [1] "Relative Frequency Table:lodging"
##
## 0 1
## 0.9252669 0.0747331
## [1] "******************************************************************"
## [1] "Relative Frequency Table:stem.cankers"
##
## 0 1 2 3
## 0.58759690 0.06046512 0.05581395 0.29612403
## [1] "******************************************************************"
## [1] "Relative Frequency Table:canker.lesion"
##
## 0 1 2 3
## 0.4961240 0.1286822 0.2744186 0.1007752
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruiting.bodies"
##
## 0 1
## 0.8197574 0.1802426
## [1] "******************************************************************"
## [1] "Relative Frequency Table:ext.decay"
##
## 0 1 2
## 0.77054264 0.20930233 0.02015504
## [1] "******************************************************************"
## [1] "Relative Frequency Table:mycelium"
##
## 0 1
## 0.990697674 0.009302326
## [1] "******************************************************************"
## [1] "Relative Frequency Table:int.discolor"
##
## 0 1 2
## 0.90077519 0.06821705 0.03100775
## [1] "******************************************************************"
## [1] "Relative Frequency Table:sclerotia"
##
## 0 1
## 0.96899225 0.03100775
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruit.pods"
##
## 0 1 2 3
## 0.67946578 0.21702838 0.02337229 0.08013356
## [1] "******************************************************************"
## [1] "Relative Frequency Table:fruit.spots"
##
## 0 1 2 4
## 0.59792028 0.12998267 0.09878683 0.17331023
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed"
##
## 0 1
## 0.8054146 0.1945854
## [1] "******************************************************************"
## [1] "Relative Frequency Table:mold.growth"
##
## 0 1
## 0.8866328 0.1133672
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.discolor"
##
## 0 1
## 0.8890815 0.1109185
## [1] "******************************************************************"
## [1] "Relative Frequency Table:seed.size"
##
## 0 1
## 0.9001692 0.0998308
## [1] "******************************************************************"
## [1] "Relative Frequency Table:shriveling"
##
## 0 1
## 0.93414211 0.06585789
## [1] "******************************************************************"
## [1] "Relative Frequency Table:roots"
##
## 0 1 2
## 0.84509202 0.13190184 0.02300613
## [1] "******************************************************************"
3.2.b
Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
# colSums(is.na(Soybean))Soybean %>%
dplyr::group_by(Class) %>%
sapply(function(x) sum(is.na(x)))## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
By Class
Soybean %>%
dplyr::group_by(Class) %>%
dplyr::summarise(across(.fns = ~sum(is.na(.)))) %>%
dplyr::transmute(Class=Class, totNA = rowSums(dplyr::across(where(is.numeric))))## # A tibble: 19 x 2
## Class totNA
## <fct> <dbl>
## 1 2-4-d-injury 450
## 2 alternarialeaf-spot 0
## 3 anthracnose 0
## 4 bacterial-blight 0
## 5 bacterial-pustule 0
## 6 brown-spot 0
## 7 brown-stem-rot 0
## 8 charcoal-rot 0
## 9 cyst-nematode 336
## 10 diaporthe-pod-&-stem-blight 177
## 11 diaporthe-stem-canker 0
## 12 downy-mildew 0
## 13 frog-eye-leaf-spot 0
## 14 herbicide-injury 160
## 15 phyllosticta-leaf-spot 0
## 16 phytophthora-rot 1214
## 17 powdery-mildew 0
## 18 purple-seed-stain 0
## 19 rhizoctonia-root-rot 0
3.2.c
Develop a strategy for handling missing data, either by eliminating predictors or imputation.
ANSWER We could either start only with the intercept and add one predictor at a time if it improves performance. Since RI is a continuous variable, we are talking most likely about a regression problem so measures like RMSE or MSE may be proper. If our goal is to predict Class, then it is a classification problem and measure like accuracy, precision, recall, and ROC-AUC would be proper.
The strategy is to add a predictor if it improves the chosen metric, if it doesn’t choose a different one to add. Keep repeating until you exhaust all variables (predictors)