Main takeaway from homework

  • Need to check for outliers and data misentries
  • Extrapolating trends from starting points, without a historical knowledge of where your starting point really exists can lead to dangerous analysis and predictions
    • Australian Wool production and clay production likely have very similar seasonality, cycles, and trends. However, because of a difference in starting points in the observed datasets, different conclusions can be drawn from each.
    • A similar trend likely exists in Australian gas versus US gasoline datasets.
  • stlplus package can be useful for decomposition and visualization, I explored the package rpubs links

2.1

A Use autoplot() to plot each of these in separate plots.

  • The data represents 3 time series dataframes
  • Below autoplot reveals
    • Gold series- is cyclic, and has trends. It doesn’t appear to have seasonality beyond the single day fluctuations and its frequency is 1 day. The spike in the dataset around day 800, looks like an outlier
    • The Wool dataset- appears to be cyclical and seasonal. There was a large downward movement from 1972-75. From 75-92 it appears to not be cyclical and show only seasonal trends, but then something is happening at the end of the dataset in 1995 that causes it to spike. Its frequency is 4 representing quarters. Clearly wool sales reach bottom beginning of every year(summer), and peak in what looks like 2nd quarter(winter). Austraillia is in the southern hemisphere
      • it appears in 1974 a Reserve Price Scheme (RPS) was introduced to in Australia, perhaps that caused the decline. Setting a floor for price could potentially lower production
      • please see link for info on wool prices in Australia
      • This practice ended in 1992 and perhaps lead to the upward trend we may be seeing in 95
    • The gas data set- is seasonal and has a clear upward trend, and appears to be cyclical with increased movement in it’s price range as time increases
devtools::install_github("strboul/caseconverter")
# Comment out exploration
#?gold
# ?woolyrnq 
# ?gas
#tsdisplay(gold)
# tsdisplay(woolyrnq)
# tsdisplay(gas)
#str(gold)
plot_grid(autoplot(gold),autoplot(woolyrnq),autoplot(gas))

plot_grid(tsdisplay(gold)
,tsdisplay(woolyrnq),tsdisplay(gas))

B What is the frequency of each series?

  • Gold- 1 daily
  • wool- 4- quarterly
  • gas-12- monthly
print(list(frequency(gold),frequency(woolyrnq),frequency(gas)))
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 12
#time(gas)

#?frequency

C Use which.max() to spot the outlier in the gold series. Which observation was it?

  • Time series point 770 is the outlier.

Is observation 770 a real datapoint?

  • First problem, our dataset isn’t complete. 1108 days total. Jan 1st+ 1108 days should fall sometime in early Feb, dataset info says it runs from Jan1- march 31st.
  • Missing entries aside, looking at historical gold prices it doesn’t appear gold closed that high in any point in 1987(770 days from 1985). link
    • I thought Black Monday could be the cause, but even there gold peaked at 505 and closed the day down. Gold never approached 600 in 1985-1989
  • datapoint 770 is a data misentry
#?gold
which.max(gold)
## [1] 770
plot(gold)

#length(gold)

2.2

  • When facets equals true isn’t included, the graphs share a common scale. This masks some of the trends in smaller data, however, I believe in this case the graphs are both rather readable. All 3 charts share seasonality,lack trends, and aren’t cyclical. They are inflation adjusted.
  • There does appear to be a large outlier in GDP in the early 90’s.
  • It also seems that as GDP rises seasonally, budget and sales tend to fall seasonally.
tute1 <- read.csv("tute1.csv", header=TRUE)
#View(tute1)
mytimeseries <- ts(tute1[,-1], start=1981, frequency=4)
autoplot(mytimeseries, facets=TRUE)

autoplot(mytimeseries)

frequency(mytimeseries)
## [1] 4

2.3

  • Column I chose was New South Wales ; Other recreational goods retailing
  • Looking below at the first grid plot, we can see there appears to be seasonality in the data, with an upward trend and the data appears to be cyclical.
  • We can see the seasonality on the ggacf chart with the scalloped shape. We also see the trend as the acf decreases as the lag increases. All of these points are above threshold of significance which implies the data is auto-correlated
  • The seasonal subseries plots show that drug sales are higher in (Nov,Dec,January) with a dramatic increase in December
  • The seasonal plots reveal once again much higher sales in December with increased sales in Nov and January as well.
  • The lag plot seems to indicate there is some annual seasonality occurring at lag 12
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,15],
  frequency=12, start=c(1982,4))
#myts

autoplot(myts)

plot_grid(autoplot(myts),ggAcf(myts,lag=48))

ggsubseriesplot(myts) +
  ggtitle("Seasonal subseries plot: antidiabetic drug sales")

ggseasonplot(myts)

ggseasonplot(myts,polar=TRUE)

gglagplot(myts) 

checkresiduals(snaive(myts))

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 532.41, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

2.6

Deeper look at hsales

  • Frequency = monthly, 12
  • Data doesn’t appear to have a trend
  • The data appears to be seasonal. Data bottoms out at beginning of year
  • ggacf with lag-12 shows clear seasonality with a trough shape
  • Monthly data means show seasonality as well
    • Sales of homes are higher in spring and lower in winter
    • Auto correlation decreases at 6 months out and comes back at 12 months.(trough behavior)
  • Appears to have some sort of cyclical multi year behavior. 3-4 years up, 3-4 years down
  • Knowing that the economy started to suffer from stagflation in the early 70’s, it’s difficult to tell what the true trend is in the home sales data.
#?hsales
autoplot(hsales)

frequency(hsales)
## [1] 12
plot_grid(autoplot(hsales),ggAcf(hsales,lag=12))

ggsubseriesplot(hsales) +
  ggtitle("Seasonal subseries plot: h sales")

ggseasonplot(hsales)

ggseasonplot(hsales,polar=TRUE)

gglagplot(hsales) 

plot(stl(hsales,"periodic"),main="Sales of new one-family houses, USA")

#?stl

Deeper look at usdeaths

  • Frequency is monthly
  • Clear seasonality is present
  • There is no trend in the data
    • stl package indicates a significant proportion of variance in data, is due to seasonality
  • There doesn’t appear to be a trend in the data, and it doesn’t look cyclic.
  • Summer appears to be much more deadly
  • Lag 12 shows yearly seasonality and high auto correlation. Our Acf plot tells us the same thing as we can see that month 12 is highly correlated with initial value.
  • Strangely, it seems warmer weather leads to more deaths, i would associate colder weather with higher deaths intuitively. Winter brings viruses like the Flu, along side issues with heating and winter storms.
    • I thought perhaps the Vietnam war could have something to do with the data, yet there were few casualties annually by 1975
usdeaths_1 <- as_data_frame(usdeaths)
usdeaths
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 1973  9007  8106  8928  9137 10017 10826 11317 10744  9713  9938  9161
## 1974  7750  6981  8038  8422  8714  9512 10120  9823  8743  9129  8710
## 1975  8162  7306  8124  7870  9387  9556 10093  9620  8285  8433  8160
## 1976  7717  7461  7776  7925  8634  8945 10078  9179  8037  8488  7874
## 1977  7792  6957  7726  8106  8890  9299 10625  9302  8314  8850  8265
## 1978  7836  6892  7791  8129  9115  9434 10484  9827  9110  9070  8633
##        Dec
## 1973  8927
## 1974  8680
## 1975  8034
## 1976  8647
## 1977  8796
## 1978  9240
autoplot(usdeaths)

plot_grid(autoplot(usdeaths),ggAcf(usdeaths,lag=12))

ggsubseriesplot(usdeaths) +
  ggtitle("Seasonal subseries plot: us deaths")

ggseasonplot(usdeaths)

ggseasonplot(usdeaths,polar=TRUE)

gglagplot(usdeaths) 

plot(stl(usdeaths,na.action = na.approx, s.window ="periodic"),main="Frequency by month with stl")

Deeper look at bricksq

  • Frequency is quarterly
  • The data shows a clear upward trend
  • Some large cyclical downfalls around what appears to be 1982 and 1975
  • Clear seasonality, production is lower in quarter 1 and 4 and higher in quarter 2 and 3
  • High autocorrelation according to Acf and the lag plots.
  • Trend explains most of the movement in the data.
#?bricksq
bricksq
##      Qtr1 Qtr2 Qtr3 Qtr4
## 1956  189  204  208  197
## 1957  187  214  227  223
## 1958  199  229  249  234
## 1959  208  253  267  255
## 1960  242  268  290  277
## 1961  241  253  265  236
## 1962  229  265  275  258
## 1963  231  263  308  313
## 1964  293  328  349  340
## 1965  309  349  366  340
## 1966  302  350  362  337
## 1967  326  358  359  357
## 1968  341  380  404  409
## 1969  383  417  454  428
## 1970  386  428  434  417
## 1971  385  433  453  436
## 1972  399  461  476  477
## 1973  452  461  534  516
## 1974  478  526  518  417
## 1975  340  437  459  449
## 1976  424  501  540  533
## 1977  457  513  522  478
## 1978  421  487  470  482
## 1979  458  526  573  563
## 1980  513  551  589  564
## 1981  519  581  581  578
## 1982  500  560  512  412
## 1983  303  409  420  413
## 1984  400  469  482  484
## 1985  447  507  533  503
## 1986  443  503  505  443
## 1987  415  485  495  458
## 1988  427  519  555  539
## 1989  511  572  570  526
## 1990  472  524  497  460
## 1991  373  436  424  430
## 1992  387  413  451  420
## 1993  394  462  476  443
## 1994  421  472  494
# usdeaths_1 <- as_data_frame(usdeaths)
# usdeaths

autoplot(bricksq)

plot_grid(autoplot(bricksq),ggAcf(bricksq,lag=12))

ggsubseriesplot(bricksq) +
  ggtitle("Seasonal subseries plot: Austrailian Quarterly clay production")

ggseasonplot(bricksq)

ggseasonplot(bricksq,polar=TRUE)

gglagplot(bricksq) 

plot(stl(bricksq,na.action = na.approx, s.window ="periodic"),main="Frequency by month with stl")

Compare Bricks Production to Wool Production

  • Both the clay dataset and the wool dataset deal with Australian production
  • Let’s compare them:
plot_grid(autoplot(bricksq),autoplot(woolyrnq))

  • The data trends seem to be rather similar
  • I think these graphs highlight very well, the problem with extrapolating trends from data. Not knowing where starting point is on a historical scale, can clearly throw off analysis and prediction.
    • If we had graphed Wool from 1960 on, we would likely assume, that it had an upward trend. This is the observation we made in the wool dataset
    • I also assumed that the large downslides in wool production, were industry specific. Looking at brick production, it appears that the downfall might have more to do with Australian economy as a whole. More data would need to be explored

Deeper look at sunspotarea

  • Frequency is yearly and 1
  • Seasonality is difficult to determine because our unit of measure is in years
  • It looks like there is some sort of cyclic behavior, with peaks and dips which seam to get larger from 1900-1950 and then seem to get smaller from 1950- present
    • looking at a lag plot of the entire time period, It appears after 1950, the data is no longer auto correlating with itself. Something must have happened in 1950, to alter the way we estimate this data, or perhaps some sort of regulations must have been put in place to prevent sunspots?
#?sunspotarea
sunspotarea
## Time Series:
## Start = 1875 
## End = 2015 
## Frequency = 1 
##       1875       1876       1877       1878       1879       1880 
##  213.13333  109.28333   92.85833   22.21667   36.33333  446.75000 
##       1881       1882       1883       1884       1885       1886 
##  679.49167  968.02500 1148.90833 1034.12500  810.20833  379.40833 
##       1887       1888       1889       1890       1891       1892 
##  177.25833   87.92500   76.69167   98.76667  566.44167 1211.27500 
##       1893       1894       1895       1896       1897       1898 
## 1460.62500 1281.54167  973.43333  547.39167  511.65000  374.65833 
##       1899       1900       1901       1902       1903       1904 
##  109.98333   74.30833   27.85833   59.47500  338.60000  488.17500 
##       1905       1906       1907       1908       1909       1910 
## 1195.88333  774.98333 1092.10833  697.48333  691.52500  266.00833 
##       1911       1912       1913       1914       1915       1916 
##   64.40833   37.33333    7.52500  152.38333  697.76667  725.45833 
##       1917       1918       1919       1920       1921       1922 
## 1533.92500 1112.56667 1054.68333  617.25833  419.61667  251.95833 
##       1923       1924       1925       1926       1927       1928 
##   54.65833  278.00833  825.07500 1263.15000 1060.80000 1388.91667 
##       1929       1930       1931       1932       1933       1934 
## 1238.91667  516.60833  279.06667  163.15000   91.25000  118.22500 
##       1935       1936       1937       1938       1939       1940 
##  622.07500 1140.84167 2072.77500 2015.23333 1576.83333 1037.16667 
##       1941       1942       1943       1944       1945       1946 
##  658.12500  427.33333  296.76667  124.65000  426.50000 1823.90000 
##       1947       1948       1949       1950       1951       1952 
## 2634.05000 1974.60000 2139.99167 1227.34167 1135.25833  402.85000 
##       1953       1954       1955       1956       1957       1958 
##  145.10000   34.64167  552.35000 2394.69167 3048.47500 3011.26667 
##       1959       1960       1961       1962       1963       1964 
## 2872.94167 1641.21667  613.50000  463.51667  287.83333   53.90833 
##       1965       1966       1967       1968       1969       1970 
##  113.30833  592.55833 1519.14167 1569.75833 1450.13333 1601.25833 
##       1971       1972       1973       1974       1975       1976 
##  990.20000  916.70000  457.55000  398.85833  166.40833  169.77500 
##       1977       1978       1979       1980       1981       1982 
##  347.04167 1368.51667 2194.52500 2160.71667 2270.24167 2283.30000 
##       1983       1984       1985       1986       1987       1988 
##  944.00000  811.31667  180.15000  124.70000  296.81667 1345.26667 
##       1989       1990       1991       1992       1993       1994 
## 2579.22500 2048.68333 2470.18333 1349.20833  696.20000  340.39167 
##       1995       1996       1997       1998       1999       2000 
##  159.63333   81.90000  210.24167  763.07500 1162.00833 1614.21667 
##       2001       2002       2003       2004       2005       2006 
## 1704.09167 1828.73333 1099.24167  683.77500  542.57500  245.10833 
##       2007       2008       2009       2010       2011       2012 
##  133.34167   22.80833   26.55833  214.29167  749.56667  796.88333 
##       2013       2014       2015 
##  860.77500 1252.15000  618.80833
autoplot(sunspotarea)

plot_grid(autoplot(sunspotarea),ggAcf(sunspotarea,lag=140))

#ggsubseriesplot(sunspotarea) +
 # ggtitle("Seasonal subseries plot: Austrailian Quarterly clay production")
#ggseasonplot(sunspotarea)
#ggseasonplot(sunspotarea,polar=TRUE)
gglagplot(sunspotarea) 

#plot(stl(sunspotarea,na.action = na.approx, s.window =25),main="Frequency by month with stl")

Deeper look at gasoline

  • Frequency is weekly, and 52
  • Clear seasonality at the quarterly level
  • Clear upward trend in the data
  • very high auto correlations
  • It appears that production peaked in 2005
    • It’s hard to extrapolate out the trend in production. More data is needed. We know the gas dataset earlier, Australia had a large increase in production from 1970 on. Our Gasoline dataset here, Presumably starts near the end of that large spike in global production.
#?gasoline
autoplot(gasoline)

plot_grid(autoplot(gasoline),ggAcf(gasoline,lag=52))

#ggsubseriesplot(gasoline) +
#  ggtitle("Seasonal subseries plot: weekly gasoline supply")
ggseasonplot(gasoline)

ggseasonplot(gasoline,polar=TRUE)

gglagplot(gasoline) 

plot(stl(gasoline,na.action = na.approx, s.window =25),main="Frequency by month with stl")