Q1) Consider the data given by the file named KoreanImports.txt. The data is quarterly time-series of 67 observations (starting with the first quarter of 1987) of the following variables (in the given order).

APP= US Apparel Imports from Korea. NONAPP= US NonApparel Imports from Korea. KOREAGDP=Korean GDP. USPROD= US Production Index. USLCOST= US Labor Cost Index.

KoreanImport = read.table('/Users/shuaibismail/Documents copy/809/midterm/KoreanImports.txt', header = FALSE)
head(KoreanImport)
KoreanImport$APP = KoreanImport$V1
KoreanImport$NONAPP = KoreanImport$V2
KoreanImport$KOREAGDP = KoreanImport$V3
KoreanImport$USPROD = KoreanImport$V4
KoreanImport$USLCOST = KoreanImport$V5

KoreanImport = KoreanImport[6:10]
head(KoreanImport)
  1. Obtain the correlation estimates between all five variables (are they significant?). Also obtain pairwise scatter plots of all four variables. Do you initially suspect an issue of multicollinearity?

Yes, there might be an issue of multicollinearity between KOREAGDP and USPROD, the plot and correlation shows a linear relationship.

pairs(KoreanImport)

cor(KoreanImport)
##                  APP     NONAPP   KOREAGDP    USPROD     USLCOST
## APP       1.00000000 0.52473708 0.01110722 0.1531741 -0.22466635
## NONAPP    0.52473708 1.00000000 0.52417788 0.6420864  0.09698359
## KOREAGDP  0.01110722 0.52417788 1.00000000 0.9723125  0.42414357
## USPROD    0.15317410 0.64208635 0.97231253 1.0000000  0.39132073
## USLCOST  -0.22466635 0.09698359 0.42414357 0.3913207  1.00000000
  1. Estimate a regression model as 𝐴𝑃𝑃𝑑 = 128.398 + 19.856𝑁𝑂𝑁𝐴𝑃𝑃𝑑 -82.701𝐾𝑂𝑅𝐸𝐴𝐺𝐷𝑃𝑑 + 78.819π‘ˆπ‘†π‘ƒπ‘…π‘‚π· -8.930 π‘ˆπ‘†πΏπΆπ‘‚π‘†π‘‡ + πœ–π‘‘

Scaling becomes necessary since the predictors values ranges from 86 to 30000, the model might not give equal weight to all values if this isn’t done.

KoreanImport1 = KoreanImport
KoreanImport1[, 2:5] = scale(KoreanImport[, 2:5])
m_regressor = lm(formula = APP~.,
                           data = KoreanImport1)
summary(m_regressor)
## 
## Call:
## lm(formula = APP ~ ., data = KoreanImport1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.905 -25.096  -6.814  19.726 132.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.398      4.584  28.009  < 2e-16 ***
## NONAPP        19.856      7.327   2.710  0.00869 ** 
## KOREAGDP     -82.701     23.955  -3.452  0.00101 ** 
## USPROD        78.819     26.479   2.977  0.00415 ** 
## USLCOST       -8.930      5.169  -1.728  0.08906 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.52 on 62 degrees of freedom
## Multiple R-squared:  0.4714, Adjusted R-squared:  0.4373 
## F-statistic: 13.82 on 4 and 62 DF,  p-value: 4.079e-08
  1. It’s evident that all predictor variables have a linear relationship with APP and with the p-value less than the statistical significance level. Hence, this can be useful for prediction.

128.398 is the intercept and the expected value of APP when all other predictors are zero.

For NONAPP, the𝛽1 regression coefficient is 19.856, this implies that for a unit change in US Apparel Imports from Korea, there’s a resulting 19.856 increase in NonApparel Imports from Korea.This is an evidence of a positive linear relationship. Keeping all other predictors constant or controlled.

For KOREAGDP, the 𝛽2 regression coefficient is -82.701, this implies that for a unit change in US Apparel Imports from Korea, there’s a resulting -82.701 decrease in Korean GDP.This is an evidence of a negative linear relationship between US Apparel Imports and the Korean GDP.Keeping all other predictors constant or controlled.

For USPROD, the 𝛽3 regression coefficient is 78.819, this implies that for a unit change in US Apparel Imports from Korea, there’s a resulting 78.819 increase in US Production Index.This is an evidence of a positive linear relationship between US Apparel imports and its Production Index. Keeping all other predictors constant or controlled.

For USLCOST, the 𝛽4 regression coefficient is -8.930, this implies that for a unit change in US Apparel Imports from Korea, there’s a resulting -8.930 decrease in US Labor Cost Index.This is an evidence of a negative linear relationship these two variables. Keeping all other predictors constant or controlled.

confint(m_regressor, level=.90)
##                     5 %        95 %
## (Intercept)  120.743133 136.0523893
## NONAPP         7.620534  32.0908346
## KOREAGDP    -122.701972 -42.7009837
## USPROD        34.603811 123.0343131
## USLCOST      -17.561719  -0.2982608
confint(m_regressor)
##                   2.5 %     97.5 %
## (Intercept)  119.234182 137.561341
## NONAPP         5.208627  34.502741
## KOREAGDP    -130.587242 -34.815714
## USPROD        25.887689 131.750435
## USLCOST      -19.263286   1.403306
confint(m_regressor, level=.99)
##                    0.5 %     99.5 %
## (Intercept)  116.2154883 140.580034
## NONAPP         0.3835501  39.327818
## KOREAGDP    -146.3619103 -19.041045
## USPROD         8.4508812 149.187243
## USLCOST      -22.6673108   4.807331
  1. Formally check if all the residuals assumptions are satisfied (normality, autocorrelation, constant variance). In doing so, for each case write down your hypothesis properly and state your conclusion.

From the plot, the residual looks fairly normally distributed. However, the shapiro test says otherwise, the p-value = 0.02215,so we can reject the null hypothesis (residuals are normally distributed)

shapiro.test(m_regressor$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  m_regressor$residuals
## W = 0.95746, p-value = 0.02215
hist(m_regressor$residuals,breaks=20, col="steelblue")

p-value is 0.191421, so we fail to reject the null hypothesis (constant variance)

library(skedastic)
white_lm(m_regressor, interactions = TRUE)
Box.test(m_regressor$residuals,lag = 20, type = 'Ljung')
## 
##  Box-Ljung test
## 
## data:  m_regressor$residuals
## X-squared = 108.12, df = 20, p-value = 4.319e-14
  1. Obtain the variance inflation factors (VIF’s) of the four independent variables and identify any problematic ones. If a potential problem exists, investigate whether eliminating any of the independent variable(s) will alleviate the problem. Re-estimate a model that does not have multicollinearity.

The VIF value of the Korean GDP and US Production Index are greater than 10. This shows there’s multicollinearity between the two predictors.

library(car)
## Loading required package: carData
vif(m_regressor)
##    NONAPP  KOREAGDP    USPROD   USLCOST 
##  2.516747 26.899961 32.867384  1.252612
m2_regressor = lm(formula = APP~.-USPROD,
                           data = KoreanImport1)
summary(m2_regressor)
## 
## Call:
## lm(formula = APP ~ . - USPROD, data = KoreanImport1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.043 -23.760  -7.633  23.045 135.865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.398      4.862  26.410  < 2e-16 ***
## NONAPP        34.278      5.829   5.880  1.7e-07 ***
## KOREAGDP     -13.700      6.407  -2.138   0.0364 *  
## USLCOST       -8.752      5.482  -1.596   0.1154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.79 on 63 degrees of freedom
## Multiple R-squared:  0.3959, Adjusted R-squared:  0.3671 
## F-statistic: 13.76 on 3 and 63 DF,  p-value: 5.265e-07

No more multicollinearity.

vif(m2_regressor)
##   NONAPP KOREAGDP  USLCOST 
## 1.416269 1.710699 1.252444

Compare this model to your model from part 2 in terms of model fit. Which one provides a better fit to data based on the residual standard error, Adjusted R-squared, AIC, and BIC measures?

Based on residual standard error, the second model is better.

Based on Adjusted R-squared, the first model is provides a better fit.

Based on AIC, the first model is provides a better fit.

Based on BIC, the first model is provides a better fit.

summary(m_regressor)
## 
## Call:
## lm(formula = APP ~ ., data = KoreanImport1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.905 -25.096  -6.814  19.726 132.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.398      4.584  28.009  < 2e-16 ***
## NONAPP        19.856      7.327   2.710  0.00869 ** 
## KOREAGDP     -82.701     23.955  -3.452  0.00101 ** 
## USPROD        78.819     26.479   2.977  0.00415 ** 
## USLCOST       -8.930      5.169  -1.728  0.08906 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.52 on 62 degrees of freedom
## Multiple R-squared:  0.4714, Adjusted R-squared:  0.4373 
## F-statistic: 13.82 on 4 and 62 DF,  p-value: 4.079e-08
summary(m2_regressor)
## 
## Call:
## lm(formula = APP ~ . - USPROD, data = KoreanImport1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.043 -23.760  -7.633  23.045 135.865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  128.398      4.862  26.410  < 2e-16 ***
## NONAPP        34.278      5.829   5.880  1.7e-07 ***
## KOREAGDP     -13.700      6.407  -2.138   0.0364 *  
## USLCOST       -8.752      5.482  -1.596   0.1154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.79 on 63 degrees of freedom
## Multiple R-squared:  0.3959, Adjusted R-squared:  0.3671 
## F-statistic: 13.76 on 3 and 63 DF,  p-value: 5.265e-07
AIC(m_regressor)
## [1] 682.6847
AIC(m2_regressor)
## [1] 689.6343
BIC(m_regressor)
## [1] 695.9129
BIC(m2_regressor)
## [1] 700.6578
  1. Obtain a time-series overlay plot of actual values of APP versus your predicted values and the respective 95% prediction intervals.
#Plotting the confidence and prediction intervals
predint =  predict(m2_regressor,interval="prediction")
## Warning in predict.lm(m2_regressor, interval = "prediction"): predictions on current data refer to _future_ responses
predlower = predint[,2]
predupper = predint[,3]

plot(KoreanImport$APP, col="lightgray", ylim=c(10,400))
lines(predict(m2_regressor), col="red")
lines(predlower,col="orange")
lines(predupper,col="orange")

  1. Estimate a new regression model with quarterly seasonal indicators, a trend (if needed), and your optimal set of independent variables from part 5. Does the new model improve the fit to data?
timeS<-seq(1, length(KoreanImport1$APP))
m3_regressor<-lm(formula = APP~timeS+NONAPP+KOREAGDP+USLCOST,
         data = KoreanImport1)

summary(m3_regressor)
## 
## Call:
## lm(formula = APP ~ timeS + NONAPP + KOREAGDP + USLCOST, data = KoreanImport1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.487 -22.952  -7.278  22.530 135.305 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  111.806     66.603   1.679   0.0982 .  
## timeS          0.488      1.954   0.250   0.8036    
## NONAPP        33.553      6.552   5.121 3.18e-06 ***
## KOREAGDP     -22.422     35.508  -0.631   0.5301    
## USLCOST       -9.440      6.173  -1.529   0.1313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.09 on 62 degrees of freedom
## Multiple R-squared:  0.3965, Adjusted R-squared:  0.3575 
## F-statistic: 10.18 on 4 and 62 DF,  p-value: 2.116e-06

Yes, the seasonal model provides a better fit.

plot.ts(KoreanImport1$APP, col="gray")
lines(predict(m3_regressor),col="red")

Q2) Consider the time series data given by the file β€œdataQ2.txt”. The data has 300 monthly observations for a financial series. In all your analysis below, use only the first 290 observations of data as your training set and use the last 10 observations as your test set.

dataQ2 = read.table('/Users/shuaibismail/Documents copy/809/midterm/dataQ2.txt', header = FALSE)
head(dataQ2)

Train-test Split

dataQ2["Month"]<-rep(1:12, times=25)
train_data<- dataQ2[1:290,]
test_data<-dataQ2[291:300,]

#c(nrow(dataQ2), nrow(train_data), nrow(test_data))
  1. Obtain the time series and the autocorrelation plots of the data. What do you observe? Is the series stationary? Why or why not?. This is not a stationary series, the plots show seasonality in the data, that repeasts itself at an interval (s=4)
plot.ts(dataQ2$V1, col='steelblue')

acf(dataQ2$V1, col='steelblue')

Box plot

boxplot(dataQ2$V1~dataQ2$Month, col="steelblue")

2. Obtain a polynomial time series model. In doing so, please write down the estimated model. Check if the residuals are white noise. Obtain forecasts for the last 10 observations.

Estimating a Polynomial model

timeQ<-seq(1, length(train_data$V1))
summary(poly_model<- lm(formula = V1~poly(timeQ, 3, raw=TRUE), 
                            data = train_data))
## 
## Call:
## lm(formula = V1 ~ poly(timeQ, 3, raw = TRUE), data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9132 -1.0078 -0.0357  1.1122  3.1055 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  8.192e-01  3.539e-01   2.315  0.02134 * 
## poly(timeQ, 3, raw = TRUE)1 -2.967e-02  1.051e-02  -2.821  0.00512 **
## poly(timeQ, 3, raw = TRUE)2  2.534e-04  8.388e-05   3.021  0.00274 **
## poly(timeQ, 3, raw = TRUE)3 -5.962e-07  1.895e-07  -3.146  0.00183 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.487 on 286 degrees of freedom
## Multiple R-squared:  0.03926,    Adjusted R-squared:  0.02918 
## F-statistic: 3.896 on 3 and 286 DF,  p-value: 0.00942

The prediction doesn’t look good.

plot.ts(test_data$V1, col="gray")
lines(predict(poly_model),col="red")

The residuals are not white-noise

acf(poly_model$residuals, lag = 20)

The forecast doesn’t look good.

plot.ts(test_data$V1, col="gray")
lines(predict(poly_model),col="blue")

3. Obtain the periodogram of the time series data (de-trend it if needed) and identify the sine-cosine pairs which have the highest amplitudes. 7, 13, 14, 15 have the highest amplitudes.

library(TSA)
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
order(periodogram(train_data$V1)$spec)

##   [1] 110 112  91  78  46  50  95  59  45  51 131 129 146 138  57 102  84 116
##  [19]  96 128 111 141 143 149  69  68 121  64 105  29  67 120  74  48 145 122
##  [37] 109   2 140 148 117 101  76  41  43  72  75   9 130  70 103  90  88 107
##  [55]  62 134  82 135 114 106  92  93  33 100  60 124 125 133 150 115  18  26
##  [73]  89  36 126   3  34 132  24  22  23  83  19  21  63 147  25  98  73 144
##  [91]  49  38  77  31  81  65 142 137  30  86  99  85 113  42   5  53  97  58
## [109] 123 136 118  66 139  35  79  55  71  39  94 104   1 108  44  11 119  40
## [127]  27  16  56   4  10  87  54  47  80  52  37 127  61   6  32  28  20  12
## [145]   8  17  15  14  13   7
  1. Based on your conclusion in part 3, obtain a best fit harmonic sine/cosine model and report the model. Discuss the significance of the model parameters.

We can see that the periods corresponding to the 7th, 13th and 14th harmonics are statistically significant. These are the hidden periodicities. This tells us that there’s a repeated pattern in every 7th, 13th and 14th month over the 25 years period. The 15th and 17th harmonics are not statistically significant.

n<-290
sin1<-sin(2*pi*timeQ*7/n)
cos1<-cos(2*pi*timeQ*7/n)
sin2<-sin(2*pi*timeQ*13/n)
cos2<-cos(2*pi*timeQ*13/n)  
sin3<-sin(2*pi*timeQ*14/n)
cos3<-cos(2*pi*timeQ*14/n)
sin4<-sin(2*pi*timeQ*15/n)
cos4<-cos(2*pi*timeQ*15/n)
sin5<-sin(2*pi*timeQ*17/n)
cos5<-cos(2*pi*timeQ*17/n)


summary(trig_model<-lm(V1~timeQ+sin1+cos1+sin2+cos2+sin3+cos3+sin4+cos4+sin5+cos5, 
                   data=train_data))
## 
## Call:
## lm(formula = V1 ~ timeQ + sin1 + cos1 + sin2 + cos2 + sin3 + 
##     cos3 + sin4 + cos4 + sin5 + cos5, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.99477 -0.41458  0.00067  0.40287  1.66130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1196099  0.0757419  -1.579   0.1154    
## timeQ        0.0007139  0.0004526   1.577   0.1158    
## sin1         1.2645589  0.0532659  23.740  < 2e-16 ***
## cos1         0.4445201  0.0529338   8.398 2.36e-15 ***
## sin2         1.3770328  0.0530280  25.968  < 2e-16 ***
## cos2         0.1295426  0.0529338   2.447   0.0150 *  
## sin3         0.2718404  0.0530146   5.128 5.51e-07 ***
## cos3         0.0280577  0.0529338   0.530   0.5965    
## sin4         0.0881655  0.0530038   1.663   0.0974 .  
## cos4         0.0424693  0.0529338   0.802   0.4231    
## sin5         0.0288706  0.0529876   0.545   0.5863    
## cos5        -0.0291144  0.0529338  -0.550   0.5828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6374 on 278 degrees of freedom
## Multiple R-squared:  0.8285, Adjusted R-squared:  0.8217 
## F-statistic: 122.1 on 11 and 278 DF,  p-value: < 2.2e-16
trig_model2<-lm(V1~timeQ+sin1+cos1+sin2+cos2+sin3+cos3, 
                   data=train_data)
summary(trig_model2)
## 
## Call:
## lm(formula = V1 ~ timeQ + sin1 + cos1 + sin2 + cos2 + sin3 + 
##     cos3, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97443 -0.44488  0.01976  0.41648  1.74848 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1124192  0.0756010  -1.487    0.138    
## timeQ        0.0006645  0.0004514   1.472    0.142    
## sin1         1.2639084  0.0532649  23.729  < 2e-16 ***
## cos1         0.4445695  0.0529343   8.399 2.24e-15 ***
## sin2         1.3766842  0.0530281  25.961  < 2e-16 ***
## cos2         0.1295920  0.0529343   2.448    0.015 *  
## sin3         0.2715170  0.0530148   5.122 5.62e-07 ***
## cos3         0.0281071  0.0529343   0.531    0.596    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6374 on 282 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8217 
## F-statistic: 191.3 on 7 and 282 DF,  p-value: < 2.2e-16
  1. For the model from part 4, check if the residuals are white noise. Also, obtain predictions for the last 10 observations. The residuals are clearly not white-noises
acf(trig_model2$residuals, lag = 20)

Prediction Plot for test data. The model didn’t do well in predicting new data

plot.ts(test_data$V1, col="gray", ylim=c(.5,3))
lines(predict(trig_model2),col="steelblue")

  1. For both models, compute the respective predictive performance measures (MAE, MSE, and MAPE). Plot your predictions for each model against the actual data.
pred_poly = predict(poly_model, data.frame(timeQ=c(291:300)))

pred_trig = predict(trig_model2, data.frame(timeQ=c(291:300),
                                      sin1<-sin(2*pi*(7/n)*c(291:300)),
                                      cos1<-cos(2*pi*(7/n)*c(291:300)), 
                                      sin2<-sin(2*pi*(13/n)*c(291:300)),
                                      cos2<-cos(2*pi*(13/n)*c(291:300)),
                                      sin3<-sin(2*pi*(14/n)*c(291:300)),
                                      cos3<-cos(2*pi*(14/n)*c(291:300))
                                       ))

The Trig model has a better MAE

mae_poly<- mean(abs(test_data$V1 - pred_poly))
mae_poly
## [1] 2.953709
mae_trig<- mean(abs(test_data$V1 - pred_trig))
mae_trig
## [1] 0.8939527

Trig model has a much better predictions based on the mean squared error

mse_poly<- mean((test_data$V1 - pred_poly)^2)
mse_poly
## [1] 9.515094
mse_trig<- mean((test_data$V1 - pred_trig)^2)
mse_trig
## [1] 1.048849

Trig model has a better MAPE.

mape_poly<- mean(abs(test_data$V1 - pred_poly)/test_data$V1)
mape_poly
## [1] 1.957668
mape_trig<- mean(abs(test_data$V1 - pred_trig)/test_data$V1)
mape_trig
## [1] 0.9221996

Q3)

GNP = read.table('/Users/shuaibismail/Documents copy/809/midterm/GNP_Quarterly.txt', header = FALSE)
attach(GNP)
head(GNP)
  1. Time series plot looks stationary
plot.ts(GNP)

There’s no slow decay, this seems like a stationary series.

acf(GNP)

2. This is an AR(1) process, as the series dies off after the first lag

pacf(GNP)

The model can be represented as π‘Œπ‘‘ = 0.0077 + 0.37871π‘Œπ‘‘βˆ’1 + πœ–t.

With a t-value of 5.425, the coefficient is statistically significant.

ar1_model = arima(GNP, 
                  order = c(1, 0, 0))
ar1_model
## 
## Call:
## arima(x = GNP, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.3787     0.0077
## s.e.  0.0698     0.0012
## 
## sigma^2 estimated as 9.801e-05:  log likelihood = 562.47,  aic = -1120.94
plot(ar1_model$aic)

  1. Test of Residuals: ACF plot of the residuals shows are white noise, but there’s an exception at lag 2.
acf(ar1_model$residuals, main = 'ACF Plot of Residuals', na.action = na.pass)

Box Test of Residuals: Null hypothesis is that residuals are white-noise. P-value = 0.7104, we fail to reject the null hypothesis

Box.test(ar1_model$residuals, lag = 20, type = 'Ljung')
## 
##  Box-Ljung test
## 
## data:  ar1_model$residuals
## X-squared = 16.099, df = 20, p-value = 0.7104

Random Walk Model of the First order Difference. The ACF of the first difference is not a white-noise. Hence, this is not a random walk process.

diff_GNP = diff(GNP$V1)
acf(diff_GNP)

The RW estimates are zeros.

rwModel = arima(diff_GNP, 
                order = c(0,0,0))
rwModel
## 
## Call:
## arima(x = diff_GNP, order = c(0, 0, 0))
## 
## Coefficients:
##       intercept
##          -1e-04
## s.e.      9e-04
## 
## sigma^2 estimated as 0.0001422:  log likelihood = 526.76,  aic = -1051.52
acf(rwModel$residuals, main = 'ACF Plot of the Random Walk Model (First Difference)')

pred_ar<-GNP$V1-ar1_model$residuals
plot.ts(GNP$V1)
lines(pred_ar,col="blue", type="b")

#pred_rW<-GNP$V1-rwModel$residuals
#plot.ts(GNP$V1)
#lines(pred_rW,col="blue", type="b")
  1. The first difference are not white-noise. Hence it doesn’t make sense to model the data as a random walk process. This is why the model couldn’t explain the data.

Q3)

set.seed(1)
n=100
variance = 2
Y = rep(0, n)
error_term = rnorm(n, 0, sqrt(variance))
for (i in 3:n){
  Y[i] = 2 + .75 * Y[i-1] -.125 * Y[i-2] + error_term[i]
}
plot.ts(Y)

ACF Plot of the AR(2) series

acf(Y)

PACF plot of the series: It’s evidence that the AR series is of the second order.

pacf(Y)

ar2_model = arima(Y, 
                  order = c(2,0,0))
ar2_model
## 
## Call:
## arima(x = Y, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.8618  -0.2676     5.5497
## s.e.  0.0970   0.1037     0.3271
## 
## sigma^2 estimated as 1.757:  log likelihood = -170.45,  aic = 346.89

Investigating the residuals: It’s evident from the plot that the residuals are white noise.

acf(ar2_model$residuals, main = 'Plot of the AR(2) Model Residuals')

Residual test shows that the residuals are white noise. p-value = 0.9494, so we fail to reject the null hypothesis

Box.test(ar2_model$residuals, lag = 20, type = 'Ljung')
## 
##  Box-Ljung test
## 
## data:  ar2_model$residuals
## X-squared = 10.873, df = 20, p-value = 0.9494

Q5) Consider a time-series π‘Œπ‘‘. The sample autocorrelation at lag 1 for π‘Œπ‘‘is estimated as 0.709. Also, the estimate of the mean (level) of Y series is πœ‡ = 40.2. It is decided that π‘Œπ‘‘ can be modeled adequately by a first order stationary autoregressive process, that is, π‘Œπ‘‘ = 𝐢 + πœ™1π‘Œπ‘‘βˆ’1 + 𝑑, where πœ–π‘‘ s are white-noise terms. Based on the given information in the above and using properties of an AR(1) process: 1. Estimate the constant term C, coefficient πœ™1 and write down the estimated AR(1) model.

from E(Yt) = E(C + πœ™1π‘Œt + πœ–t) πœ‡ = c + πœ™1πœ‡ + 0

Since πœ‡=40.2, πœ™1 (sample auto-correlation at lag 1) = 0.709 πœ‡ = c/1-πœ™1 40.2 = c/1-0.709 c = 40.2*0.291 c = 11.4982

Hence, c = 11.4982, πœ™1 = 0.709 π‘Œπ‘‘ = 𝐢 + πœ™1π‘Œπ‘‘βˆ’1 + πœ–t π‘Œt = 11.4982 + 0.709π‘Œπ‘‘βˆ’1

  1. Estimate the auto-correlations at lag 2 and 3.

from π‘Œπ‘‘ = 𝐢 + πœ™1π‘Œπ‘‘βˆ’1 + πœ™2π‘Œπ‘‘βˆ’2 + πœ–π‘‘ πœ‡ = c/1-πœ™1-πœ™2 40.2 = 11.4982/1-0.709-πœ™2

40.2(0.291-πœ™2) = 11.4982

11.69-40.2πœ™2 = 11.4982

-40.2πœ™2 = -0.2

πœ™2 = 0.0049

π‘Œt = 11.4982 + 0.709π‘Œπ‘‘βˆ’1 + 0.0049π‘Œπ‘‘βˆ’2

40.2 = 11.4982/1-0.709-0.0049-πœ™3

40.2(0.2861-πœ™3) = 11.4982

-40.2πœ™3= 11.4982-11.5

πœ™3 = 0.00124

π‘Œt = 11.4982 + 0.709π‘Œπ‘‘βˆ’1 + 0.0049π‘Œπ‘‘βˆ’2 + 0.00124π‘Œπ‘‘βˆ’3

  1. Estimate the partial autocorrelation at lag 1.

Since the PACF at lag k is the correlation between Yt and Yt-k. Consequently, for an AR(1), the partial autocorrelation at lag 1 is πœ™1 = 0.709. Hence the reason why the PACF plot will chop off π‘Œπ‘‘βˆ’2 and π‘Œπ‘‘βˆ’3 as πœ™2 and πœ™3 will be zeros.