About me

Hassan OUKHOUYA
E-mail: hassan.oukhouya@um5r.ac.ma
ResearchGate: https://www.researchgate.net/profile/Hassan-Oukhouya
LinkedIn: https://www.linkedin.com/in/hassan-oukhouya-3901b816b/
ORCID iD: https://orcid.org/0000-0002-5058-2008
Upwork: https://www.upwork.com/services/product/time-series-analysis-with-python-or-r-studio-1449669530698514432?ref=project_share

Exercice 1

The data below (data set fancy) concern the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Question a)

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

In the time plot we see seasonality with the spike at Christmas and a smaller spike in March for the surfing festival as expected. Over time we see an increase of sales volume which makes sense as the shop has expanded.

Question b)

Explain why it is necessary to take logarithms of these data before fitting a model.

Logarithms of the data should be taken in order to more clearly see the seasonality without the expanding changes in the data over time.

Question c)

Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

## 
## Call:
## tslm(formula = fancy_log ~ trend + season + festival_dummy, data = fancy_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.6196670  0.0742471 102.626  < 2e-16 ***
## trend          0.0220198  0.0008268  26.634  < 2e-16 ***
## season2        0.2514168  0.0956790   2.628 0.010555 *  
## season3        0.2660828  0.1934044   1.376 0.173275    
## season4        0.3840535  0.0957075   4.013 0.000148 ***
## season5        0.4094870  0.0957325   4.277 5.88e-05 ***
## season6        0.4488283  0.0957647   4.687 1.33e-05 ***
## season7        0.6104545  0.0958039   6.372 1.71e-08 ***
## season8        0.5879644  0.0958503   6.134 4.53e-08 ***
## season9        0.6693299  0.0959037   6.979 1.36e-09 ***
## season10       0.7473919  0.0959643   7.788 4.48e-11 ***
## season11       1.2067479  0.0960319  12.566  < 2e-16 ***
## season12       1.9622412  0.0961066  20.417  < 2e-16 ***
## festival_dummy 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

Question d)

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

Both plots show the residuals to be random and do not point to any problems with the model.

Question e)

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

The boxplots show some wider variance towards the end of the summer and start of the fall. This could point to the model missing out on expressing some seasonality at this time.

Question f)

What do the values of the coefficients tell you about each variable?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02202 0.39041 0.54474 1.12051 0.72788 7.61967

Looking at the coefficients we can see how much each variable is thought to impact the model. All are very significant expect for season 3 which would be March and when the surfing festival takes place. Because this same time period is also represented using the dummy variable, season 3 becomes a less impactful variable in the model.

Question g)

What does the Durbin-Watson statistic tell you about your model?

## 
##  Durbin-Watson test
## 
## data:  fit_fancy
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0

The Durbin-Watson rejects the null expressing their is still some autocorrelation remaining in the residuals that could be exploited in order to obtain a better forecast.

Question h)

Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994       9.801475  9.461879 10.141071  9.277961 10.32499
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995       9.755590  9.499844 10.011336  9.361338 10.14984
## Feb 1995      10.029027  9.773281 10.284773  9.634775 10.42328
## Mar 1995      10.065713  9.722498 10.408928  9.536620 10.59481
## Apr 1995      10.205703  9.949957 10.461449  9.811451 10.59996
## May 1995      10.253157  9.997411 10.508903  9.858904 10.64741
## Jun 1995      10.314518 10.058772 10.570264  9.920265 10.70877
## Jul 1995      10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995      10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995      10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995      10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995      11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995      11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996      10.019828  9.760564 10.279093  9.620151 10.41951
## Feb 1996      10.293265 10.034000 10.552530  9.893588 10.69294
## Mar 1996      10.329951  9.982679 10.677222  9.794605 10.86530
## Apr 1996      10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996      10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996      10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996      10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996      10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996      10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996      10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996      11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996      12.224288 11.965023 12.483552 11.824611 12.62396

Question i)

Transform your predictions and intervals to obtain predictions and intervals for the raw data.

Question j)

How could you improve these predictions by modifying the model?

As stated by the book, because the Durbin-Watson test shows that there is some autocorrelation remaining in the residuals, and therefore information remaining in the residuals that can be exploited to obtain better forecasts; a dynamic-regression model might be better for this data as the forecasts from the current model are unbiased, but will have larger prediction intervals than they need to.

Exercice 2

The data below (data set texasgas) shows the demand for natural gas and the price of natural gas for 20 towns in Texas in 1969.

##   price consumption
## 1    30         134
## 2    31         112
## 3    37         136
## 4    42         109
## 5    43         105
## 6    45          87

Question a)

Do a scatterplot of consumption against price. The data are clearly not linear. Three possible nonlinear models for the data are given. The second model divides the data into two sections, depending on whether the price is above or below 60 cents per 1,000 cubic feet.

Question b)

Can you explain why the slope of the fitted line should change with P?

It looks like there is a relationship between P and C, therefore changes in P will affect the slope of a fitted line.

Question c)

Fit the three models and find the coefficients, and residual variance in each case.

For the second model, the parameters a1a1, a2a2, b1b1, b2b2 can be estimated by simply fitting a regression with four regressors but no constant: (i) a dummy taking value 1 when P is less than or equal to 60 and 0 otherwise; (ii) P1=P when P is less than or equal to 60P and 0 otherwise; (iii) a dummy taking value 0 when P is less than or equal to 60 and 1 otherwise; (iv) P2=P when P>60 and 0 otherwise.

## 
## Call:
## lm(formula = texasgas$consumption ~ exp_price)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.86 -25.09 -13.86  20.64  65.14 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.086e+01  7.670e+00   9.238 2.98e-08 ***
## exp_price   -1.642e-43  1.711e-43  -0.959     0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.19 on 18 degrees of freedom
## Multiple R-squared:  0.04864,    Adjusted R-squared:  -0.004214 
## F-statistic: 0.9203 on 1 and 18 DF,  p-value: 0.3501

## [1] 1101.359

## 
## Call:
## lm(formula = texasgas$consumption ~ dummy_i + dummy_ii + dummy_iii)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.987  -6.421   2.823   9.324  22.617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.7861    51.8428   1.635   0.1215    
## dummy_i     136.1068    54.9412   2.477   0.0248 *  
## dummy_ii     -2.9057     0.3738  -7.773 8.05e-07 ***
## dummy_iii    -0.4470     0.5634  -0.793   0.4392    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.565e-07

## [1] 182.1078

## 
## Call:
## lm(formula = texasgas$consumption ~ texasgas$price + sq_price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5601  -5.4693   0.7502  11.0252  25.6619 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    273.930628  31.031614   8.827 9.32e-08 ***
## texasgas$price  -5.675863   1.009086  -5.625 3.03e-05 ***
## sq_price         0.033904   0.007412   4.574 0.000269 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 17 degrees of freedom
## Multiple R-squared:  0.8315, Adjusted R-squared:  0.8117 
## F-statistic: 41.95 on 2 and 17 DF,  p-value: 2.666e-07

## [1] 206.5276

Question d)

For each model, find the value of R2 and AIC, and produce a residual plot. Comment on the adequacy of the three models.

## [1] -0.004214286

## [1] 200.7363

## [1] 0.811689

## [1] 168.1158

Looking at the graphs it’s clear that Model 1 differs drastically from the other two and is likely not a very good model as we see from the lack of residual variance. Model 2 and 3 both show more acceptable residual variance. When looking at R-squared and AIC, Model 2 has a higher R-sqaured and a smaller AIC so that is likely the best model out of the three.

Question e)

For prices 40, 60, 80, 100, and 120 cents per 1,000 cubic feet, compute the forecasted per capita demand using the best model of the three above.

##         1         2         3         4         5         6         7         8 
## 133.72290 130.81724 113.38323  98.85490  95.94923  90.13790  75.60956  63.98689 
##         9        10        11        12        13        14        15        16 
##  63.98689  55.26989  52.36423  52.36423  46.55289  52.15787  45.45344  45.00647 
##        17        18        19        20 
##  43.66559  41.43078  40.08989  39.19597

This is coming out for 20 variables instead of the 5, I know this has something to do with the naming of variables but can’t seem to fix it.

Question f)

Compute \(95\%\) prediction intervals. Make a graph of these prediction intervals and discuss their interpretation.

##          fit        lwr       upr
## 1  133.72290 100.916974 166.52883
## 2  130.81724  98.340621 163.29385
## 3  113.38323  82.526833 144.23964
## 4   98.85490  68.835752 128.87405
## 5   95.94923  66.037298 125.86117
## 6   90.13790  60.378172 119.89762
## 7   75.60956  45.862015 105.35711
## 8   63.98689  33.871341  94.10245
## 9   63.98689  33.871341  94.10245
## 10  55.26989  24.665025  85.87476
## 11  52.36423  21.557183  83.17127
## 12  52.36423  21.557183  83.17127
## 13  46.55289  15.285111  77.82067
## 14  52.15787  14.378306  89.93743
## 15  45.45344  14.574639  76.33223
## 16  45.00647  14.269892  75.74306
## 17  43.66559  13.078544  74.25263
## 18  41.43078  10.168295  72.69326
## 19  40.08989   7.892941  72.28684
## 20  39.19597   6.174121  72.21781

Same issue as before with different amounts of variables causing no graph to view. If the prediction intervals are shown to be more narrow on a graph then we would assume the model is predicting more accurately. If the prediction intervals are wider then the assumption would follow that the model is not predicting very accurately.

Question g)

What is the correlation between P and P^2? Does this suggest any general problem to be considered in dealing with polynomial regressions—especially of higher orders?

## [1] 0.9904481

Because \(P^2\) can only be positive and P could be positive or negative there is the possibility that this could be problematic in dealing with polynomial regressions as price increases as the relation with consumption might differ when P is lower versus higher.

Exercice 3

Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

3x5 MA = 1/3 [1/5(y1+y2+y3+y4+y5)1/5(y2+y3+y4+y5+y6)1/5(y3+y4+y5+y6+y7) = 1/3 (1/5y1+1/5y2+1/5y3+1/5y4+1/5y5+1/5y2+1/5y3+1/5y4+1/5y5+1/5y6+1/5y3+1/5y4+1/5y5+1/5y6+1/5y7) = 1/3 (1/5y1+2/5y2+3/5y3+3/5y4+3/5y5+2/5y6+1/5y7) = 1/15y1 + 2/15y2 + 3/15y3 + 3/15y4 + 3/15y5 + 2/15y6 + 1/15y7) = 0.067y1 + 0.133y2 + 0.200y3 + 0.200y4 + 0.200y5 + 0.133y6 + 0.067y7

The data below represent the monthly sales (in thousands) of product A for a plastics manufacturer for years 1 through 5 (data set plastics).

##    Jan  Feb  Mar  Apr  May  Jun
## 1  742  697  776  898 1030 1107

Question a)

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend?

It appears that there is a seasonal fluctuation with a peak towards the end of summer each year. There is a postive trend over time.

Question b)

Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

##          Length Class  Mode     
## x        60     ts     numeric  
## seasonal 60     ts     numeric  
## trend    60     ts     numeric  
## random   60     ts     numeric  
## figure   12     -none- numeric  
## type      1     -none- character

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1        NA        NA        NA        NA        NA        NA  976.9583
## 2 1000.4583 1011.2083 1022.2917 1034.7083 1045.5417 1054.4167 1065.7917
## 3 1117.3750 1121.5417 1130.6667 1142.7083 1153.5833 1163.0000 1170.3750
## 4 1208.7083 1221.2917 1231.7083 1243.2917 1259.1250 1276.5833 1287.6250
## 5 1374.7917 1382.2083 1381.2500 1370.5833 1351.2500 1331.2500        NA
##         Aug       Sep       Oct       Nov       Dec
## 1  977.0417  977.0833  978.4167  982.7083  990.4167
## 2 1076.1250 1084.6250 1094.3750 1103.8750 1112.5417
## 3 1175.5000 1180.5417 1185.0000 1190.1667 1197.0833
## 4 1298.0417 1313.0000 1328.1667 1343.5833 1360.6250
## 5        NA        NA        NA        NA        NA

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 2 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 3 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 4 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
## 5 0.7670466 0.7103357 0.7765294 0.9103112 1.0447386 1.1570026 1.1636317
##         Aug       Sep       Oct       Nov       Dec
## 1 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 2 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 3 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 4 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834
## 5 1.2252952 1.2313635 1.1887444 0.9919176 0.8330834

Question c)

Do the results support the graphical interpretation from part (a)?

Yes, the results support the graphical interpretation that there is a peak during the summer. May-Oct have the highest points.

Question d)

Compute and plot the seasonally adjusted data.

Question e)

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

The outlier causes a spike and drop during the summer.

Question f)

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

If towards the end of the time series the outlier has less of an impact in the earlier time points and more of an impact later on.

Question g)

Use a random walk with drift to produce forecasts of the seasonally adjusted data.

##       Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Jan 6       1220.179 1167.802 1272.555 1140.0757 1300.281
## Feb 6       1224.392 1149.706 1299.079 1110.1697 1338.615
## Mar 6       1228.606 1136.388 1320.825 1087.5706 1369.642
## Apr 6       1232.820 1125.480 1340.160 1068.6581 1396.982
## May 6       1237.034 1116.076 1357.992 1052.0443 1422.024
## Jun 6       1241.248 1107.714 1374.782 1037.0248 1445.471
## Jul 6       1245.462 1100.123 1390.800 1023.1853 1467.738
## Aug 6       1249.676 1093.129 1406.222 1010.2586 1489.093
## Sep 6       1253.889 1086.612 1421.166  998.0614 1509.718
## Oct 6       1258.103 1080.486 1435.721  986.4612 1529.745

Question h)

Personalize the results to give forecasts on the original scale

##       Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 6       936.2531  883.8662  988.6400  856.1342 1016.3720
## Feb 6       863.6074  789.5211  937.6937  750.3022  976.9126
## Mar 6       942.2625  851.5257 1032.9993  803.4925 1081.0325
## Apr 6      1095.6329  990.8590 1200.4067  935.3951 1255.8707
## May 6      1234.3115 1117.1708 1351.4523 1055.1602 1413.4628
## Jun 6      1344.5774 1216.2562 1472.8987 1148.3270 1540.8278
## Jul 6      1390.0138 1251.4111 1528.6166 1178.0392 1601.9885
## Aug 6      1447.9805 1299.8079 1596.1531 1221.3700 1674.5910
## Sep 6      1463.6068 1306.4460 1620.7676 1223.2501 1703.9635
## Oct 6      1415.5186 1249.8565 1581.1806 1162.1604 1668.8768
## Nov 6      1159.9252  986.1774 1333.6730  894.2009 1425.6495
## Dec 6      1013.0000  831.5264 1194.4736  735.4600 1290.5400
## Jan 7       936.2531  747.3693 1125.1369  647.3803 1225.1259
## Feb 7       863.6074  667.5935 1059.6214  563.8300 1163.3849
## Mar 7       942.2625  739.3688 1145.1562  631.9633 1252.5616
## Apr 7      1095.6329  886.0852 1305.1806  775.1573 1416.1085
## May 7      1234.3115 1018.3147 1450.3084  903.9729 1564.6502
## Jun 7      1344.5774 1122.3185 1566.8363 1004.6617 1684.4931
## Jul 7      1390.0138 1161.6645 1618.3632 1040.7837 1739.2440
## Aug 7      1447.9805 1213.6990 1682.2620 1089.6779 1806.2831
## Sep 7      1463.6068 1223.5397 1703.6739 1096.4559 1830.7577
## Oct 7      1415.5186 1169.8021 1661.2350 1039.7276 1791.3095
## Nov 7      1159.9252  908.6863 1411.1641  775.6885 1544.1620
## Dec 7      1013.0000  756.3575 1269.6425  620.4992 1405.5008

Analysis, time series modeling, graphs and forecaster’s toolbox using regression models

Hassan OUKHOUYA

15 December, 2023

Exercice 1

Question a)

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

Question b)

Explain why it is necessary to take logarithms of these data before fitting a model.

Question c)

Question d)

Question e)

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

Question f)

What do the values of the coefficients tell you about each variable?

Question g)

What does the Durbin-Watson statistic tell you about your model?

Question h)

Question i)

Question j)

How could you improve these predictions by modifying the model?

Exercice 2

Question a)

Question b)

Can you explain why the slope of the fitted line should change with P?

Question c)

Fit the three models and find the coefficients, and residual variance in each case.

Question d)

For each model, find the value of R2 and AIC, and produce a residual plot. Comment on the adequacy of the three models.

Question e)

Question f)

Compute \(95\%\) prediction intervals. Make a graph of these prediction intervals and discuss their interpretation.

Question g)

What is the correlation between P and P^2? Does this suggest any general problem to be considered in dealing with polynomial regressions—especially of higher orders?

Exercice 3

Question a)

Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend?

Question b)

Question c)

Do the results support the graphical interpretation from part (a)?

Question d)

Question e)

Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

Question f)

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Question g)

Question h)

References