Main takeaway from homework
- Need to check for outliers and data misentries
- Extrapolating trends from starting points, without a historical knowledge of where your starting point really exists can lead to dangerous analysis and predictions
- Australian Wool production and clay production likely have very similar seasonality, cycles, and trends. However, because of a difference in starting points in the observed datasets, different conclusions can be drawn from each.
- A similar trend likely exists in Australian gas versus US gasoline datasets.
- stlplus package can be useful for decomposition and visualization, I explored the package rpubs links
2.1
A Use autoplot() to plot each of these in separate plots.
- The data represents 3 time series dataframes
- Below autoplot reveals
- Gold series- is cyclic, and has trends. It doesn’t appear to have seasonality beyond the single day fluctuations and its frequency is 1 day. The spike in the dataset around day 800, looks like an outlier
- The Wool dataset- appears to be cyclical and seasonal. There was a large downward movement from 1972-75. From 75-92 it appears to not be cyclical and show only seasonal trends, but then something is happening at the end of the dataset in 1995 that causes it to spike. Its frequency is 4 representing quarters. Clearly wool sales reach bottom beginning of every year(summer), and peak in what looks like 2nd quarter(winter). Austraillia is in the southern hemisphere
- it appears in 1974 a Reserve Price Scheme (RPS) was introduced to in Australia, perhaps that caused the decline. Setting a floor for price could potentially lower production
- please see link for info on wool prices in Australia
- This practice ended in 1992 and perhaps lead to the upward trend we may be seeing in 95
- The gas data set- is seasonal and has a clear upward trend, and appears to be cyclical with increased movement in it’s price range as time increases
devtools::install_github("strboul/caseconverter")
# Comment out exploration
#?gold
# ?woolyrnq
# ?gas
#tsdisplay(gold)
# tsdisplay(woolyrnq)
# tsdisplay(gas)
#str(gold)
plot_grid(autoplot(gold),autoplot(woolyrnq),autoplot(gas))

plot_grid(tsdisplay(gold)
,tsdisplay(woolyrnq),tsdisplay(gas))




B What is the frequency of each series?
- Gold- 1 daily
- wool- 4- quarterly
- gas-12- monthly
print(list(frequency(gold),frequency(woolyrnq),frequency(gas)))
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 12
C Use which.max() to spot the outlier in the gold series. Which observation was it?
- Time series point 770 is the outlier.
Is observation 770 a real datapoint?
- First problem, our dataset isn’t complete. 1108 days total. Jan 1st+ 1108 days should fall sometime in early Feb, dataset info says it runs from Jan1- march 31st.
- Missing entries aside, looking at historical gold prices it doesn’t appear gold closed that high in any point in 1987(770 days from 1985). link
- I thought Black Monday could be the cause, but even there gold peaked at 505 and closed the day down. Gold never approached 600 in 1985-1989
- datapoint 770 is a data misentry
## [1] 770

2.2
- When facets equals true isn’t included, the graphs share a common scale. This masks some of the trends in smaller data, however, I believe in this case the graphs are both rather readable. All 3 charts share seasonality,lack trends, and aren’t cyclical. They are inflation adjusted.
- There does appear to be a large outlier in GDP in the early 90’s.
- It also seems that as GDP rises seasonally, budget and sales tend to fall seasonally.
tute1 <- read.csv("tute1.csv", header=TRUE)
#View(tute1)
mytimeseries <- ts(tute1[,-1], start=1981, frequency=4)
autoplot(mytimeseries, facets=TRUE)


## [1] 4
2.3
- Column I chose was New South Wales ; Other recreational goods retailing
- Looking below at the first grid plot, we can see there appears to be seasonality in the data, with an upward trend and the data appears to be cyclical.
- We can see the seasonality on the ggacf chart with the scalloped shape. We also see the trend as the acf decreases as the lag increases. All of these points are above threshold of significance which implies the data is auto-correlated
- The seasonal subseries plots show that drug sales are higher in (Nov,Dec,January) with a dramatic increase in December
- The seasonal plots reveal once again much higher sales in December with increased sales in Nov and January as well.
- The lag plot seems to indicate there is some annual seasonality occurring at lag 12
retaildata <- readxl::read_excel("retail.xlsx", skip=1)
myts <- ts(retaildata[,15],
frequency=12, start=c(1982,4))
#myts
autoplot(myts)

plot_grid(autoplot(myts),ggAcf(myts,lag=48))

ggsubseriesplot(myts) +
ggtitle("Seasonal subseries plot: antidiabetic drug sales")


ggseasonplot(myts,polar=TRUE)


checkresiduals(snaive(myts))

##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 532.41, df = 24, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 24
2.6
Deeper look at hsales
- Frequency = monthly, 12
- Data doesn’t appear to have a trend
- The data appears to be seasonal. Data bottoms out at beginning of year
- ggacf with lag-12 shows clear seasonality with a trough shape
- Monthly data means show seasonality as well
- Sales of homes are higher in spring and lower in winter
- Auto correlation decreases at 6 months out and comes back at 12 months.(trough behavior)
- Appears to have some sort of cyclical multi year behavior. 3-4 years up, 3-4 years down
- Knowing that the economy started to suffer from stagflation in the early 70’s, it’s difficult to tell what the true trend is in the home sales data.
#?hsales
autoplot(hsales)

## [1] 12
plot_grid(autoplot(hsales),ggAcf(hsales,lag=12))

ggsubseriesplot(hsales) +
ggtitle("Seasonal subseries plot: h sales")


ggseasonplot(hsales,polar=TRUE)


plot(stl(hsales,"periodic"),main="Sales of new one-family houses, USA")

Deeper look at usdeaths
- Frequency is monthly
- Clear seasonality is present
- There is no trend in the data
- stl package indicates a significant proportion of variance in data, is due to seasonality
- There doesn’t appear to be a trend in the data, and it doesn’t look cyclic.
- Summer appears to be much more deadly
- Lag 12 shows yearly seasonality and high auto correlation. Our Acf plot tells us the same thing as we can see that month 12 is highly correlated with initial value.
- Strangely, it seems warmer weather leads to more deaths, i would associate colder weather with higher deaths intuitively. Winter brings viruses like the Flu, along side issues with heating and winter storms.
- I thought perhaps the Vietnam war could have something to do with the data, yet there were few casualties annually by 1975
usdeaths_1 <- as_data_frame(usdeaths)
usdeaths
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 1973 9007 8106 8928 9137 10017 10826 11317 10744 9713 9938 9161
## 1974 7750 6981 8038 8422 8714 9512 10120 9823 8743 9129 8710
## 1975 8162 7306 8124 7870 9387 9556 10093 9620 8285 8433 8160
## 1976 7717 7461 7776 7925 8634 8945 10078 9179 8037 8488 7874
## 1977 7792 6957 7726 8106 8890 9299 10625 9302 8314 8850 8265
## 1978 7836 6892 7791 8129 9115 9434 10484 9827 9110 9070 8633
## Dec
## 1973 8927
## 1974 8680
## 1975 8034
## 1976 8647
## 1977 8796
## 1978 9240

plot_grid(autoplot(usdeaths),ggAcf(usdeaths,lag=12))

ggsubseriesplot(usdeaths) +
ggtitle("Seasonal subseries plot: us deaths")


ggseasonplot(usdeaths,polar=TRUE)


plot(stl(usdeaths,na.action = na.approx, s.window ="periodic"),main="Frequency by month with stl")

Deeper look at bricksq
- Frequency is quarterly
- The data shows a clear upward trend
- Some large cyclical downfalls around what appears to be 1982 and 1975
- Clear seasonality, production is lower in quarter 1 and 4 and higher in quarter 2 and 3
- High autocorrelation according to Acf and the lag plots.
- Trend explains most of the movement in the data.
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 189 204 208 197
## 1957 187 214 227 223
## 1958 199 229 249 234
## 1959 208 253 267 255
## 1960 242 268 290 277
## 1961 241 253 265 236
## 1962 229 265 275 258
## 1963 231 263 308 313
## 1964 293 328 349 340
## 1965 309 349 366 340
## 1966 302 350 362 337
## 1967 326 358 359 357
## 1968 341 380 404 409
## 1969 383 417 454 428
## 1970 386 428 434 417
## 1971 385 433 453 436
## 1972 399 461 476 477
## 1973 452 461 534 516
## 1974 478 526 518 417
## 1975 340 437 459 449
## 1976 424 501 540 533
## 1977 457 513 522 478
## 1978 421 487 470 482
## 1979 458 526 573 563
## 1980 513 551 589 564
## 1981 519 581 581 578
## 1982 500 560 512 412
## 1983 303 409 420 413
## 1984 400 469 482 484
## 1985 447 507 533 503
## 1986 443 503 505 443
## 1987 415 485 495 458
## 1988 427 519 555 539
## 1989 511 572 570 526
## 1990 472 524 497 460
## 1991 373 436 424 430
## 1992 387 413 451 420
## 1993 394 462 476 443
## 1994 421 472 494
# usdeaths_1 <- as_data_frame(usdeaths)
# usdeaths
autoplot(bricksq)

plot_grid(autoplot(bricksq),ggAcf(bricksq,lag=12))

ggsubseriesplot(bricksq) +
ggtitle("Seasonal subseries plot: Austrailian Quarterly clay production")


ggseasonplot(bricksq,polar=TRUE)


plot(stl(bricksq,na.action = na.approx, s.window ="periodic"),main="Frequency by month with stl")

Compare Bricks Production to Wool Production
- Both the clay dataset and the wool dataset deal with Australian production
- Let’s compare them:
plot_grid(autoplot(bricksq),autoplot(woolyrnq))

- The data trends seem to be rather similar
- I think these graphs highlight very well, the problem with extrapolating trends from data. Not knowing where starting point is on a historical scale, can clearly throw off analysis and prediction.
- If we had graphed Wool from 1960 on, we would likely assume, that it had an upward trend. This is the observation we made in the wool dataset
- I also assumed that the large downslides in wool production, were industry specific. Looking at brick production, it appears that the downfall might have more to do with Australian economy as a whole. More data would need to be explored
Deeper look at sunspotarea
- Frequency is yearly and 1
- Seasonality is difficult to determine because our unit of measure is in years
- It looks like there is some sort of cyclic behavior, with peaks and dips which seam to get larger from 1900-1950 and then seem to get smaller from 1950- present
- looking at a lag plot of the entire time period, It appears after 1950, the data is no longer auto correlating with itself. Something must have happened in 1950, to alter the way we estimate this data, or perhaps some sort of regulations must have been put in place to prevent sunspots?
#?sunspotarea
sunspotarea
## Time Series:
## Start = 1875
## End = 2015
## Frequency = 1
## 1875 1876 1877 1878 1879 1880
## 213.13333 109.28333 92.85833 22.21667 36.33333 446.75000
## 1881 1882 1883 1884 1885 1886
## 679.49167 968.02500 1148.90833 1034.12500 810.20833 379.40833
## 1887 1888 1889 1890 1891 1892
## 177.25833 87.92500 76.69167 98.76667 566.44167 1211.27500
## 1893 1894 1895 1896 1897 1898
## 1460.62500 1281.54167 973.43333 547.39167 511.65000 374.65833
## 1899 1900 1901 1902 1903 1904
## 109.98333 74.30833 27.85833 59.47500 338.60000 488.17500
## 1905 1906 1907 1908 1909 1910
## 1195.88333 774.98333 1092.10833 697.48333 691.52500 266.00833
## 1911 1912 1913 1914 1915 1916
## 64.40833 37.33333 7.52500 152.38333 697.76667 725.45833
## 1917 1918 1919 1920 1921 1922
## 1533.92500 1112.56667 1054.68333 617.25833 419.61667 251.95833
## 1923 1924 1925 1926 1927 1928
## 54.65833 278.00833 825.07500 1263.15000 1060.80000 1388.91667
## 1929 1930 1931 1932 1933 1934
## 1238.91667 516.60833 279.06667 163.15000 91.25000 118.22500
## 1935 1936 1937 1938 1939 1940
## 622.07500 1140.84167 2072.77500 2015.23333 1576.83333 1037.16667
## 1941 1942 1943 1944 1945 1946
## 658.12500 427.33333 296.76667 124.65000 426.50000 1823.90000
## 1947 1948 1949 1950 1951 1952
## 2634.05000 1974.60000 2139.99167 1227.34167 1135.25833 402.85000
## 1953 1954 1955 1956 1957 1958
## 145.10000 34.64167 552.35000 2394.69167 3048.47500 3011.26667
## 1959 1960 1961 1962 1963 1964
## 2872.94167 1641.21667 613.50000 463.51667 287.83333 53.90833
## 1965 1966 1967 1968 1969 1970
## 113.30833 592.55833 1519.14167 1569.75833 1450.13333 1601.25833
## 1971 1972 1973 1974 1975 1976
## 990.20000 916.70000 457.55000 398.85833 166.40833 169.77500
## 1977 1978 1979 1980 1981 1982
## 347.04167 1368.51667 2194.52500 2160.71667 2270.24167 2283.30000
## 1983 1984 1985 1986 1987 1988
## 944.00000 811.31667 180.15000 124.70000 296.81667 1345.26667
## 1989 1990 1991 1992 1993 1994
## 2579.22500 2048.68333 2470.18333 1349.20833 696.20000 340.39167
## 1995 1996 1997 1998 1999 2000
## 159.63333 81.90000 210.24167 763.07500 1162.00833 1614.21667
## 2001 2002 2003 2004 2005 2006
## 1704.09167 1828.73333 1099.24167 683.77500 542.57500 245.10833
## 2007 2008 2009 2010 2011 2012
## 133.34167 22.80833 26.55833 214.29167 749.56667 796.88333
## 2013 2014 2015
## 860.77500 1252.15000 618.80833

plot_grid(autoplot(sunspotarea),ggAcf(sunspotarea,lag=140))

#ggsubseriesplot(sunspotarea) +
# ggtitle("Seasonal subseries plot: Austrailian Quarterly clay production")
#ggseasonplot(sunspotarea)
#ggseasonplot(sunspotarea,polar=TRUE)
gglagplot(sunspotarea)

#plot(stl(sunspotarea,na.action = na.approx, s.window =25),main="Frequency by month with stl")
Deeper look at gasoline
- Frequency is weekly, and 52
- Clear seasonality at the quarterly level
- Clear upward trend in the data
- very high auto correlations
- It appears that production peaked in 2005
- It’s hard to extrapolate out the trend in production. More data is needed. We know the gas dataset earlier, Australia had a large increase in production from 1970 on. Our Gasoline dataset here, Presumably starts near the end of that large spike in global production.
#?gasoline
autoplot(gasoline)

plot_grid(autoplot(gasoline),ggAcf(gasoline,lag=52))

#ggsubseriesplot(gasoline) +
# ggtitle("Seasonal subseries plot: weekly gasoline supply")
ggseasonplot(gasoline)

ggseasonplot(gasoline,polar=TRUE)


plot(stl(gasoline,na.action = na.approx, s.window =25),main="Frequency by month with stl")
