DATA 624 - Predictive Analytics HW #1

Use the help function to explore what the series gold, woolyrnq and gas represent

1. Use autoplot() to plot each of these in separate plots.
1. What is the frequency of each series? Hint: apply the frequency() function.
1. Use which.max() to spot the outlier in the gold series. Which observation was it?

library(fpp2)
help(gold)

tsdisplay(gold)

The help functions tells us that gold is the “daily morning gold prices” in US dollars for the time period spanning January 1, 1985 through March 31, 1989. The blurb that appeared with the help function mentioned an example of tsdisplay(gold), so I tried that out and included the output. I’m not quite sure what the bottom two plots are but I’m sure I’ll find out before I finish this assignment. My next step is to use the autoplot() function on the gold ts data.

autoplot(gold) + ggtitle("Plot for Gold Time Series") + xlab("Count of Day") + ylab("Price in US Dollars")

There are two things that my eyes immediately jump to on this plot; the enormous spike about a quarter of the way in the 700’s and the dip shortly after we started at about day 30. It looked to be a consistent dip as opposed the spikes that other valleys show. The next task to to use the frequency function to determine the frequency of the old series.

gf <- frequency(gold)
print(paste0("The frequency for the gold time series data is: ", gf, "."))

## [1] "The frequency for the gold time series data is: 1."

The frequency is shown to be one but we know from the description of the gold time serires data that it is daily information. I assume that the 1 the frequency is stating is daily and not annual. Now to “spot the outlier” in this series.

go <- which.max(gold)

print(paste0("The outlier is: ", go, "."))

## [1] "The outlier is: 770."

help(woolyrnq)

We can see that the woolyrnq provides the data on the “quarterly production of wollen yarn in Australia” for the time period of March 1965 - September 1994. The wool is measured in tonnes.

tsdisplay(woolyrnq)

Now to use the autoplot function, which shows a decreasing overall trend in the production of wool. I can’t help but wonder what happened in 1975 to cause the huge dip in production.

autoplot(woolyrnq) + ggtitle("Plot for Wool Timeseries") + xlab("Year") + ylab("Weight Produced in Tonnes")

wf <- frequency(woolyrnq)
print(paste0("The frequency for the woolyrnq time series data is: ", wf, " or quarterly"))

## [1] "The frequency for the woolyrnq time series data is: 4 or quarterly"

Lastly, we’ll take a look at the gas series, using the help and autoplot functions to see what we can gather.

help(gas)

We see that the gas time series data is the data for the Australia’s monthly gas production for 1956 through 1995. The auto plot shows what I think is a dramatic overall increase in the production of gas in Australia that began in 1970.

autoplot(gas) + ggtitle("Plot for Gas Time Series") + xlab("Year") + ylab("Amount Produced")

gasf <- frequency(gas)
print(paste0("The frequency for the gas time series data is: ", gasf, " or monthly"))

## [1] "The frequency for the gas time series data is: 12 or monthly"

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

1. You can read the data in R with the following script(I changed it a little bit as I prefer using GitHub for my data):

tute1 <- read.csv("tute1.csv", header = TRUE)
View(tute1)

1. Convert hte data to time series

myts <- ts(tute1[, -1], start = 1981, frequency = 4)

1. Construct time series plots of each of the three series

autoplot(myts, facets = TRUE)

Looking at this output, I would guess that not adding facets = TRUE will generate an output that is not broken down by sales, adbudget, and GDP. Let’s try and see.

autoplot(myts)

I was correct I appears that without the facet = TRUE, it’s one plot showing the 3 different time series, and not three distinct plots when it is included. A ledgend is required to show which plot goes to which time series.

Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.

1. You can read the data into R with the following script:

retaildata <- readxl::read_excel("retail.xlsx", skip = 1)
View(retaildata)

myts_2 <- ts(retaildata[, "A3349336V"], frequency = 12, start = c(1982, 4) )

Now to start exploring my chosen timeseries.

autoplot(myts_2) + ggtitle("Electrical|Electronics Goods Retailing Time Series") + xlab("Year") + ylab("sales")

From what I can gather, the time series I chose is for electronic and electrical goods retailing in New South Wales. There has been an overall upward trend with a spike in 2009.I am seeing some seasonality and not so much a cyclicity. I say this because there’s a frequency to changes that occur.

ggseasonplot(myts_2, polar = TRUE)

ggsubseriesplot(myts_2)

gglagplot(myts_2)

ggAcf(myts_2, lag = 48)

The plots show that there are definite seaonsonality. The monthly average is about the same and jumps in December. Lag 12 jumps out at me as it looks very different from the others. They all show a positive relationship. The ACF plot blows my “overall trend” theory as it does not show this, although it shows the seasonality.

Use the following graphics functions: autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggACF() and explore features from the following time series: hsales, usdeaths, bricksq, sunspotarea, gasoline.

autoplot(hsales)

ggseasonplot(hsales)

ggsubseriesplot(hsales)

gglagplot(hsales)

ggAcf(hsales, lag = 48)

Looking at the time series data for the monthly sales of new one-family houses in US since 1973. I’m not sure if my analysis is correct but initially, I did not see any signs of trend, cycle or seasons. I’m wondering, though if it’s not cyclicity and seasonality that I’m seeing. Setting the lag to 48 allowed me to see that there is a downward trend, although there is an anomaly at lag 24.

autoplot(usdeaths)

ggseasonplot(usdeaths, polar = TRUE)

ggsubseriesplot(usdeaths)

gglagplot(usdeaths)

ggAcf(usdeaths, lag = 48)

Looking at this time series, I’m immediately seeing some seasonality. This is very clear in the ACF plot. It shows no trend, but rather, seasonality. The closer I look, the more I see that the positive peaks are decreasing, while the negative ones are increasing(becoming less negative).

autoplot(bricksq)

ggseasonplot(bricksq, polar = TRUE)

ggsubseriesplot(bricksq)

gglagplot(bricksq)

ggAcf(bricksq, lag = 48)

The bricksq time series data shows the production of clay brick in Australia from 1956 - 1994. The ACF plot shows a downward overall trend. The lag plots show very that the relationship is strongly positive for all the lags. I do not see any seasonality but some cyclicity. I can’t help but wonder what occured at about 1983 that caused the big dip.

autoplot(sunspotarea)

#ggseasonplot(sunspotarea)
#ggsubseriesplot(sunspotarea)
gglagplot(sunspotarea)

ggAcf(sunspotarea, lag = 48)

I had to comment out the ggseasonplot line, as I got the error that the time series data for the annual average sunspot area was not seasonal. I think R makes it very easy for you to spot seasonality because it throughs that error message. This leads me to conclude that all previous data contained seasonality. I would say that there is cyclicity.

autoplot(gasoline)

ggseasonplot(gasoline)

#ggsubseriesplot(gasoline)
gglagplot(gasoline)

ggAcf(gasoline, lag = 48)

frequency(gasoline)

## [1] 52.17857

I got the following error message “Error in ggsubseriesplot(gasoline) : Each season requires at least 2 observations. This may be caused from specifying a time-series with non-integer frequency.” so I obviously checked the frequency for the time series. From checking the frequency, I see it is indeed not an integer. I’m going to attempt to change the frequency to 52 as suggested by the text.

#gasoline_changed <- as.ts(gasoline, frequency = 52)

#autoplot(gasoline_changed)
#ggseasonplot(gasoline_changed)
#ggsubseriesplot(gasoline_changed)
#gglagplot(gasoline_changed)
#ggAcf(gasoline_changed, lag = 48)

DATA 624 - Predictive Analytics HW #1

Oluwakemi Omotunde

February 3, 2019