Lecture 1

Covers exploratory analysis and bivariate analysis. Assignment1 (also see this is given that which independent variables effect the birth weight variable.

Lecture 2

Directed to do the cross table & chi square association and Assignment2 is given.

Lecture 3

Basics of Time Series. The related pdf is here.

Lecture 4

Time Series Basic

Step1: Birth data read.

Step2: store the data in a time series object by using ts() function in R.

library(readr)
births <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/births.txt")
head(births)
births.ts <- ts(births)
plot(births.ts)

Step3: Set monthly time series.

Here 1 indicates starting from January.

Step4: Make a plot of the time series data.

births1.ts <- ts(births, frequency = 12, start = c(1946, 1)) 
head(births1.ts)
##         Jan    Feb    Mar    Apr    May    Jun
## 1946 23.598 26.931 24.740 25.806 24.364 24.477
plot(births1.ts)

Decomposing the time series means separating the time series into these three components: trend component, a seasonal component and an irregular component.

Step5: Use the decompose() function in R.

d_births <- decompose(births1.ts)
plot(d_births)

Stationarity check

A stationary time series has the conditions that the mean, variance and covariance are not functions of time. We will use two methods to test the stationarity (ACF & PACF).

To check stationarity, we can use the autocorrelation function (ACF) via the acf() function in base R.

  • Shows correlation between current values and lags
  • If bars cross Blue dashed lines (95% confidence interval) → significant autocorrelation.

A non-stationary series shows:

  • ACF decays slowly
  • Bars stay above blue lines → long-term correlation

A stationary series shows:

  • ACF drops quickly (exponential decay)

  • Bars mostly within blue lines → short-term dependence

Lecture 4

After cleaning the Rainfall data, had to proceed the time series things. And the assignment is done below…

library(readxl)
library(tidyverse) 
library(pander)
library(tinytex)
library(dplyr)

Reading and Cleaning the Rainfall in Dhaka data

3 day variable were found as character data type. Changed them to integer.

rainfall <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/Rainfall_dhaka.csv", show_col_types = FALSE)

Question 01

Make variables - average rainfall per month & Date

Solution

rainfall <- rainfall %>% 
  select(!Stati) %>%
  mutate_if(is.character, as.numeric) %>% 
  pivot_longer(
    cols = !c(Year, Month),
    names_to = "day",
    values_to = "rain"
  ) %>%
  mutate(day = as.integer(day)) %>%
  group_by(Year, Month) %>%
  mutate(meanrain = mean(rain, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(date = make_date(Year, Month, day))

rainfall
  • The column named Stati is removed from the dataset to clean up unnecessary variables.

  • All character columns are converted to numeric format.

  • The dataset is reshaped from wide to long format. All columns except Year and Month are converted from columns into two new columns:

        **day**: contains the old column names (1–31)
    
       **rain**: contains the rainfall amounts for those days
  • The data is group_by(Year, Month) to calculate average rainfall monthly means year-wise and added as a new column meanrain.

  • A new date column is created by combining Year, Month and day into a proper date format using a date-making function.

Question 03

Draw a basic line plot of rainfall data

Solution

rainfall <- rainfall %>% arrange(Year, Month, day)

# Step 2: Remove duplicates (keep one meanrain per month)
monthly_data <- rainfall %>% distinct(Year, Month, .keep_all = TRUE)

ggplot(data = monthly_data, aes(x = date, y = meanrain))+
  geom_line(color = "#00AFBB") 

# Create time series object
t <- ts(monthly_data$meanrain, frequency = 12, start = c(1976, 1))

# Plot
plot(t, main = "Average Rainfall per Month (1976–2019)",
     xlab = "Year", ylab = "Average Rainfall")

Question 04

Draw a line plot of average rainfall after year 2000

# Considering monthly data (can't use t cz that's not a data frame)
r1 <- monthly_data %>% 
  filter(Year >= 2000)

plot(ts(r1$meanrain, start = c(2000, 1), frequency = 12), main = "Average Rainfall per Month after 2000",
xlab = "Year",
ylab = "Average Rainfall")

Question 05 & 06

Decompose the time series and Plot the decomposed time series.

Solution

decomposed_ts <- decompose(t, type = "additive")

plot(decomposed_ts)

# Time Series Graph

library(ggplot2)
#theme_set(theme_minimal())
# Demo dataset
head(economics)
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
  geom_line(color = "#00AFBB")

# Plot a subset of the data
ss <- subset(economics, date > as.Date("2006-1-1"))
ggplot(data = ss, aes(x = date, y = pop)) + 
  geom_line(color = "#FC4E07", size = 2)

#Control line size by the value of a continuous variable:
ggplot(data = economics, aes(x = date, y = pop)) +
  geom_line(aes(size = unemploy/pop), color = "#FC4E07")

Plot multiple time series data

library(tidyr)
library(dplyr)
df <- economics %>%
  select(date, psavert, uempmed) %>%
  gather(key = "variable", value = "value", -date)
head(df, 3)
# Multiple line plot
ggplot(df, aes(x = date, y = value)) + 
  geom_line(aes(color = variable), size = 1) +
  scale_color_manual(values = c("#00AFBB", "#E7B800")) +
  theme_minimal()

# Area plot
ggplot(df, aes(x = date, y = value)) + 
  geom_area(aes(color = variable, fill = variable), 
            alpha = 0.5, position = position_dodge(0.8)) +
  scale_color_manual(values = c("#00AFBB", "#E7B800")) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800"))

Base plot with date axis

p <- ggplot(data = economics, aes(x = date, y = psavert)) + 
  geom_line(color = "#00AFBB", size = 1)
p

# Set axis limits c(min, max)
min <- as.Date("2002-1-1")
max <- NA
p + scale_x_date(limits = c(min, max))

# Format : month/year
p + scale_x_date(date_labels = "%b/%Y")

#Add trend smoothed line
p + stat_smooth(
  color = "#FC4E07", fill = "#FC4E07",
  method = "loess"
)

Lecture 4

Econometrics

For this the Slide1 which is not necessary, but Slide2 must be followed.

Load data Journals

#install.packages('AER')
library(AER)
data(Journals)
class(Journals)
## [1] "data.frame"
head(Journals)

Create a variable: citeprice=price/citations

Journals$citeprice <- Journals$price/Journals$citations

Goal: Estimate the effect of the price per citation on the number of library subscriptions.

\[ \log(\text{subs}) = \beta_0 + \beta_1 \log(\text{citeprice}) + \epsilon \] Model fit (As without log doesn’t give linear relationship,so we can’t fit linear model so that can draw the regression line)

jour_lm <- lm(log(subs) ~ log(citeprice), data = Journals)
#abline(jour_lm)
summary(jour_lm)
## 
## Call:
## lm(formula = log(subs) ~ log(citeprice), data = Journals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72478 -0.53609  0.03721  0.46619  1.84808 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.76621    0.05591   85.25   <2e-16 ***
## log(citeprice) -0.53305    0.03561  -14.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7497 on 178 degrees of freedom
## Multiple R-squared:  0.5573, Adjusted R-squared:  0.5548 
## F-statistic:   224 on 1 and 178 DF,  p-value: < 2.2e-16

Plot regression line

plot(log(subs) ~ log(citeprice), data = Journals)

Model with Polynomial terms(2nd and 3rd degree polynomials)

fitted_val<-jour_lm$fitted.values

new_model<-lm(log(Journals$subs) ~ (log(Journals$citeprice))+I(fitted_val^2)+I(fitted_val^3))

summary(new_model)
## 
## Call:
## lm(formula = log(Journals$subs) ~ (log(Journals$citeprice)) + 
##     I(fitted_val^2) + I(fitted_val^3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5698 -0.4940  0.0444  0.4518  1.8269 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)              3.37868   11.86254   0.285    0.776
## log(Journals$citeprice) -0.43929    2.02420  -0.217    0.828
## I(fitted_val^2)          0.13885    0.75270   0.184    0.854
## I(fitted_val^3)         -0.01570    0.04881  -0.322    0.748
## 
## Residual standard error: 0.7479 on 176 degrees of freedom
## Multiple R-squared:  0.5644, Adjusted R-squared:  0.557 
## F-statistic: 76.01 on 3 and 176 DF,  p-value: < 2.2e-16

Predicts log(subs) for citeprice 2.11 (not in log) with a prediction interval

predict(jour_lm, newdata = data.frame(citeprice = 2.11),
         interval = "prediction")
##        fit      lwr      upr
## 1 4.368188 2.883746 5.852629

Predict for some new data value (multiple values)

  • exp(l_citeprice) While making prediction need to input in original scale not the logarithmic scale.
l_citeprice <- seq(from = -6, to = 4, by = 0.25)

jour_pred <- predict(jour_lm, interval = "prediction",
                      newdata = data.frame(citeprice = exp(l_citeprice)))

Plot Prediction Interval

plot(log(subs) ~ log(citeprice), data = Journals)#need the real one

lines(jour_pred[, 1] ~ l_citeprice, col = 1) #fit
lines(jour_pred[, 2] ~ l_citeprice, col = 1, lty = 2) #lower
lines(jour_pred[, 3] ~ l_citeprice, col = 1, lty = 2) #upper

### Diagnostic plot

#Don't know why!
qqnorm(rt(n=1000,df=2))

plot(jour_lm)

#In console panel, Hit <Return> to see next plot: just press 'Enter' to see the all diagnostic plot

For the tests details follow 404 slides.

Ramsey Reset Test

  • Omitted Variable or model misspecification

  • Power of variables should be included (i.g. test for quadratic and cubic response)

  • \(Ho\): There is no misspecification

  • Draw conclusion by p value

library(lmtest)
resettest(jour_lm,power = 2:3)
## 
##  RESET test
## 
## data:  jour_lm
## RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395

Lagrange Multiplier test for adding variables

  • Model: Residuals(u) on x, \(x^2\), \(x^3\) to check non-linearity

  • dim(Journals)[1] = 180

  • 0.01611 is basically n\(R^2\) [Computes 1.6% of the total rows & \(R^2\) is from summary]

  • Since LM < table value we fail to reject null. NULL: Additional variables do not improve the model. beta = 0.

u <- resid(jour_lm)
x <- log(Journals$citeprice)

model_u <- lm(u~x + I(x^2) + I(x^3), data = Journals)
summary(model_u)
## 
## Call:
## lm(formula = u ~ x + I(x^2) + I(x^3), data = Journals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5698 -0.4940  0.0444  0.4518  1.8269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.067327   0.073136   0.921    0.359
## x           -0.041605   0.057375  -0.725    0.469
## I(x^2)      -0.024315   0.022912  -1.061    0.290
## I(x^3)       0.002377   0.007393   0.322    0.748
## 
## Residual standard error: 0.7479 on 176 degrees of freedom
## Multiple R-squared:  0.01611,    Adjusted R-squared:  -0.0006605 
## F-statistic: 0.9606 on 3 and 176 DF,  p-value: 0.4127
dim(Journals)[1]*0.01611
## [1] 2.8998

F test for non-nested model

model1 <- lm(log(subs) ~ citations, data = Journals)
model2 <- lm(log(subs) ~ pages, data = Journals)
encomptest(model1,model2)
  • 1st NULL: Adding citation variable does not improve the model

2nd NULL: Adding pages variable does not improve the model

Davidson Mackinnon J test

Compare two non nested models introducing 3rd one that combines predictors from both models.

jtest(model1, model2)
  • NULL: Model 1 + fitted(Model 2) →
    H0: Model 1 is correctly specified; Model 2’s predictions add no additional explanatory power.

  • NULL: Model 2 + fitted(Model 1) →
    H0: Model 2 is correctly specified; Model 1’s predictions add no additional explanatory power.

AIC(model1)
## [1] 515.3615
AIC(model2)
## [1] 532.7627

Econometrics (Rest Part)

Simultaneous equation modelling (SEM). Need truffles.csv data for this.

truffles <- read.table("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/truffles.txt")

model <- lm(p ~ pf + ps + di,
               data = truffles)
summary(model)
## 
## Call:
## lm(formula = p ~ pf + ps + di, data = truffles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.4825  -3.5927   0.2801   4.5326  12.9210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -32.5124     7.9842  -4.072 0.000387 ***
## pf            1.3539     0.2985   4.536 0.000115 ***
## ps            1.7081     0.3509   4.868 4.76e-05 ***
## di            7.6025     1.7243   4.409 0.000160 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.597 on 26 degrees of freedom
## Multiple R-squared:  0.8887, Adjusted R-squared:  0.8758 
## F-statistic: 69.19 on 3 and 26 DF,  p-value: 1.597e-12
p_hat<- model$fitted.values
model2 <- lm(q ~ p_hat + ps + di,
               data = truffles)
summary(model2)
## 
## Call:
## lm(formula = q ~ p_hat + ps + di, data = truffles)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1814 -1.1390  0.2765  1.4595  4.4318 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.27947    3.01383  -1.420 0.167505    
## p_hat       -0.37446    0.08956  -4.181 0.000291 ***
## ps           1.29603    0.19309   6.712 4.03e-07 ***
## di           5.01398    1.24141   4.039 0.000422 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.68 on 26 degrees of freedom
## Multiple R-squared:  0.6974, Adjusted R-squared:  0.6625 
## F-statistic: 19.97 on 3 and 26 DF,  p-value: 6.332e-07

According to the hand note we transformed those equation so that only exogenous variables are present, allowing us to use OLS. Regressing on p using relevant predictors, we got the parameter values. Since we found p_hat from fitted values from the first OLS model, we could safely use the original equation, to regress q using p_hat as a predictor instead of p.

Dynamic models using built-in data

library(dynlm)
library(AER)

data("USMacroG")

plot(USMacroG[, c("dpi", "consumption")], 
     lty = c(3, 1), #line type
     plot.type = "single", 
     ylab = "")
legend("topleft",
       legend = c("income", "consumption"),
       lty = c(3, 1), bty = "n")

##Have to run plot and legend together.

Distributed Lag Model

The first distributed lag model is:
\(consumption_t = \beta_0 + \beta_1 \cdot dpi_t + \beta_2 \cdot dpi_{t-1} + u_t\)

The second distributed lag model is:
\(consumption_t = \beta_0 + \beta_1 \cdot dpi_t + \beta_2 \cdot dpi_{t-1} + \beta_3 \cdot dpi_{t-2} + \beta_4 \cdot dpi_{t-3} + u_t\)

cons_lm1 <- dynlm(consumption ~ dpi + L(dpi), data = USMacroG)
# L for lags!!

cons_lm2 <- dynlm(consumption ~ dpi + L(dpi, 1:3), data = USMacroG)

summary(cons_lm1)
## 
## Time series regression with "ts" data:
## Start = 1950(2), End = 2000(4)
## 
## Call:
## dynlm(formula = consumption ~ dpi + L(dpi), data = USMacroG)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -190.02  -56.68    1.58   49.91  323.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -81.07959   14.50814  -5.589 7.43e-08 ***
## dpi           0.89117    0.20625   4.321 2.45e-05 ***
## L(dpi)        0.03091    0.20754   0.149    0.882    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.58 on 200 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9964 
## F-statistic: 2.785e+04 on 2 and 200 DF,  p-value: < 2.2e-16
summary(cons_lm2)
## 
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
## 
## Call:
## dynlm(formula = consumption ~ dpi + L(dpi, 1:3), data = USMacroG)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199.37  -58.09    1.60   52.32  323.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -85.73584   14.86920  -5.766 3.12e-08 ***
## dpi            0.89301    0.20925   4.268 3.07e-05 ***
## L(dpi, 1:3)1   0.13563    0.28941   0.469    0.640    
## L(dpi, 1:3)2  -0.03442    0.28937  -0.119    0.905    
## L(dpi, 1:3)3  -0.07246    0.21053  -0.344    0.731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.88 on 196 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.9963 
## F-statistic: 1.359e+04 on 4 and 196 DF,  p-value: < 2.2e-16

Autoregressive Model

The autoregressive model is:
\(consumption_t = \beta_0 + \beta_1 \cdot dpi_t + \beta_2 \cdot consumption_{t-1} + u_t\)

cons_lm3<-dynlm(consumption~dpi+L(consumption), data=USMacroG)
summary(cons_lm3)
## 
## Time series regression with "ts" data:
## Start = 1950(2), End = 2000(4)
## 
## Call:
## dynlm(formula = consumption ~ dpi + L(consumption), data = USMacroG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.303   -9.674    1.141   12.691   45.322 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.535216   3.845170   0.139    0.889    
## dpi            -0.004064   0.016626  -0.244    0.807    
## L(consumption)  1.013111   0.018161  55.785   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.52 on 200 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 4.627e+05 on 2 and 200 DF,  p-value: < 2.2e-16
  • The association must go both ways for there to be causality.
# restricted sum of square
g1model <- dynlm(consumption ~ L(consumption), data =USMacroG)
anova(g1model)
#Unrestricted sum of sqaure
g1model_un <- dynlm(consumption ~ L(consumption) + L(dpi), data =USMacroG)
anova(g1model_un)

Granger causality test

library(lmtest)
grangertest(consumption ~ dpi, data = USMacroG)
grangertest(dpi ~ consumption, data = USMacroG)

First, we manually calculate the Residual Sum of Squares (RSS) from the restricted and unrestricted models. We compare the raw RSS values to see whether they are different. However, to test whether the difference is statistically significant, we use Granger’s causality test.

Granger causality tests whether adding lagged disposable personal income (\(dpi_{t-1}\)) to the model significantly improves its predictive power.

  • The restricted model excludes lagged \(dpi\).
  • The unrestricted model includes lagged \(dpi\).

Does past \(dpi\) (i.e., \(dpi_{t-1}\)) help explain current consumption beyond what current \(dpi\) already explains?

If the test statistic is significant, we reject the null hypothesis that lagged \(dpi\) has no effect — implying that \(dpi\) Granger-causes consumption.

Econometric Vector autoregressive model (VAR)

Var_1 <- dynlm(consumption ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
summary(Var_1)
## 
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
## 
## Call:
## dynlm(formula = consumption ~ L(consumption, 1:3) + L(dpi, 1:3), 
##     data = USMacroG)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.724 -10.287   0.822  12.092  44.413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.69704    3.71059   0.188   0.8512    
## L(consumption, 1:3)1  1.12289    0.07753  14.484   <2e-16 ***
## L(consumption, 1:3)2  0.10399    0.11124   0.935   0.3510    
## L(consumption, 1:3)3 -0.20018    0.08322  -2.405   0.0171 *  
## L(dpi, 1:3)1          0.03491    0.05693   0.613   0.5404    
## L(dpi, 1:3)2         -0.04252    0.06992  -0.608   0.5438    
## L(dpi, 1:3)3         -0.01162    0.05711  -0.203   0.8390    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.51 on 194 degrees of freedom
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 1.67e+05 on 6 and 194 DF,  p-value: < 2.2e-16
Var_2 <- dynlm(dpi ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
summary(Var_2)
## 
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
## 
## Call:
## dynlm(formula = dpi ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -121.462  -14.712   -2.078   13.750  115.550 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.85925    5.19356   2.091 0.037839 *  
## L(consumption, 1:3)1  0.43039    0.10851   3.966 0.000103 ***
## L(consumption, 1:3)2 -0.37263    0.15570  -2.393 0.017653 *  
## L(consumption, 1:3)3 -0.02034    0.11648  -0.175 0.861534    
## L(dpi, 1:3)1          0.79448    0.07968   9.971  < 2e-16 ***
## L(dpi, 1:3)2          0.24302    0.09786   2.483 0.013862 *  
## L(dpi, 1:3)3         -0.06858    0.07994  -0.858 0.391976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.71 on 194 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
## F-statistic: 9.973e+04 on 6 and 194 DF,  p-value: < 2.2e-16
coeftest(Var_1, vcov. = sandwich)
## 
## t test of coefficients:
## 
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.697044   2.890480  0.2412  0.80969    
## L(consumption, 1:3)1  1.122891   0.091202 12.3122  < 2e-16 ***
## L(consumption, 1:3)2  0.103990   0.111881  0.9295  0.35380    
## L(consumption, 1:3)3 -0.200184   0.089856 -2.2278  0.02704 *  
## L(dpi, 1:3)1          0.034914   0.056934  0.6132  0.54043    
## L(dpi, 1:3)2         -0.042518   0.066909 -0.6355  0.52587    
## L(dpi, 1:3)3         -0.011617   0.062566 -0.1857  0.85290    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(vars)

var_est <- VAR(as.matrix(cbind(USMacroG[,"consumption"], USMacroG[,"dpi"])), p= 3)
summary(var_est)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: USMacroG....consumption.., USMacroG....dpi.. 
## Deterministic variables: const 
## Sample size: 201 
## Log Likelihood: -1823.062 
## Roots of the characteristic polynomial:
## 1.013 0.9893 0.5221 0.4362 0.3434 0.1721
## Call:
## VAR(y = as.matrix(cbind(USMacroG[, "consumption"], USMacroG[, 
##     "dpi"])), p = 3)
## 
## 
## Estimation results for equation USMacroG....consumption..: 
## ========================================================== 
## USMacroG....consumption.. = USMacroG....consumption...l1 + USMacroG....dpi...l1 + USMacroG....consumption...l2 + USMacroG....dpi...l2 + USMacroG....consumption...l3 + USMacroG....dpi...l3 + const 
## 
##                              Estimate Std. Error t value Pr(>|t|)    
## USMacroG....consumption...l1  1.12289    0.07753  14.484   <2e-16 ***
## USMacroG....dpi...l1          0.03491    0.05693   0.613   0.5404    
## USMacroG....consumption...l2  0.10399    0.11124   0.935   0.3510    
## USMacroG....dpi...l2         -0.04252    0.06992  -0.608   0.5438    
## USMacroG....consumption...l3 -0.20018    0.08322  -2.405   0.0171 *  
## USMacroG....dpi...l3         -0.01162    0.05711  -0.203   0.8390    
## const                         0.69704    3.71059   0.188   0.8512    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 20.51 on 194 degrees of freedom
## Multiple R-Squared: 0.9998,  Adjusted R-squared: 0.9998 
## F-statistic: 1.67e+05 on 6 and 194 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation USMacroG....dpi..: 
## ================================================== 
## USMacroG....dpi.. = USMacroG....consumption...l1 + USMacroG....dpi...l1 + USMacroG....consumption...l2 + USMacroG....dpi...l2 + USMacroG....consumption...l3 + USMacroG....dpi...l3 + const 
## 
##                              Estimate Std. Error t value Pr(>|t|)    
## USMacroG....consumption...l1  0.43039    0.10851   3.966 0.000103 ***
## USMacroG....dpi...l1          0.79448    0.07968   9.971  < 2e-16 ***
## USMacroG....consumption...l2 -0.37263    0.15570  -2.393 0.017653 *  
## USMacroG....dpi...l2          0.24302    0.09786   2.483 0.013862 *  
## USMacroG....consumption...l3 -0.02034    0.11648  -0.175 0.861534    
## USMacroG....dpi...l3         -0.06858    0.07994  -0.858 0.391976    
## const                        10.85925    5.19356   2.091 0.037839 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 28.71 on 194 degrees of freedom
## Multiple R-Squared: 0.9997,  Adjusted R-squared: 0.9997 
## F-statistic: 9.973e+04 on 6 and 194 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##                           USMacroG....consumption.. USMacroG....dpi..
## USMacroG....consumption..                     420.6             262.1
## USMacroG....dpi..                             262.1             824.0
## 
## Correlation matrix of residuals:
##                           USMacroG....consumption.. USMacroG....dpi..
## USMacroG....consumption..                    1.0000            0.4451
## USMacroG....dpi..                            0.4451            1.0000
(forecast <- predict(var_est))
## $USMacroG....consumption..
##           fcst    lower    upper        CI
##  [1,] 6405.643 6365.446 6445.839  40.19672
##  [2,] 6468.504 6407.383 6529.626  61.12183
##  [3,] 6535.172 6452.212 6618.131  82.95946
##  [4,] 6602.435 6500.487 6704.383 101.94817
##  [5,] 6671.212 6551.583 6790.842 119.62967
##  [6,] 6740.885 6604.990 6876.781 135.89515
##  [7,] 6811.610 6660.420 6962.800 151.18976
##  [8,] 6883.271 6717.595 7048.947 165.67580
##  [9,] 6955.902 6776.349 7135.454 179.55241
## [10,] 7029.487 6836.543 7222.432 192.94461
## 
## $USMacroG....dpi..
##           fcst    lower    upper        CI
##  [1,] 6688.442 6632.180 6744.704  56.26160
##  [2,] 6752.849 6674.420 6831.278  78.42899
##  [3,] 6814.348 6715.375 6913.320  98.97210
##  [4,] 6879.143 6762.435 6995.852 116.70877
##  [5,] 6943.980 6811.522 7076.438 132.45778
##  [6,] 7010.201 6863.404 7156.997 146.79648
##  [7,] 7077.115 6917.173 7237.057 159.94192
##  [8,] 7145.002 6972.805 7317.199 172.19694
##  [9,] 7213.728 7030.014 7397.441 183.71392
## [10,] 7283.356 7088.706 7478.005 194.64928
####### Impulse response function (irf)

imp <- irf(var_est)
plot(imp)

1st Incourse

See the Question and the script.

Question 01

#1a
library(readxl)
ChikenDemand <- read_excel("ChikenDemand.xlsx")
View(ChikenDemand)

model<- lm(log(Y)~log(X2)+log(X3)+log(X6), data = ChikenDemand)
summary(model)

#1b
#I prefer B and Ramsey Reset test bcz there is an omission & B is nested in A.

#1c
model2<- lm(log(Y)~log(X2)+log(X3), data = ChikenDemand)
u<- resid(model2)

model_u<- lm(u~log(X2)+log(X3)+log(X6), data = ChikenDemand)
summary(model_u)

#n= 23
#R2 = 0.011
#n*r2=0.253

qchisq((1-.05), df=1)

#1d
model3<- lm(log(Y)~log(X4)+log(X5), data = ChikenDemand)

encomptest(model2,model3)

#1e
jtest(model2,model3)

Question 02

library(dplyr)
library(tidyverse)
library(ggplot2)

#2a
AveTemp <- read_excel("G:/My Drive/Turin/4th Year/AST 431/1st_inc/1st_inc_solve/AveTemp.xlsx")

AveTemp <- AveTemp %>% 
 mutate(date= make_date(Year,Mo))
#head(AveTemp)

p<- AveTemp %>% 
  filter(Station== "Dhaka") 

plot(ts(data = p$AveTemp, frequency = 12, start = c(2008, 1)), main = "Monthly Average Temperature in Dhaka (2008–2018)",
     ylab = "Temperature (°C)", xlab = "Year")

#2b
q<- AveTemp %>% 
  filter(Station %in% c("Dhaka", "Sylhet")) 
#head(q)

ggplot(q, aes(x=date, y=q$AveTemp, color=Station))+
  geom_line()+
  labs(title = "Monthly Average Temperature (Dhaka vs. Sylhet)",
       x = "Date", y = "Temperature (°C)") +
  theme_minimal()
## Warning: Use of `q$AveTemp` is discouraged.
## ℹ Use `AveTemp` instead.

#2c
ggplot(p, aes(x = date, y = AveTemp)) +
  geom_line(color = "blue") +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Dhaka Monthly Temperature with Trend Line",
       x = "Date", y = "Temperature (°C)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

#2d
new<-AveTemp %>% filter(Station == "Sylhet") 
#head(new)
t<-ts(new$AveTemp, frequency = 12, start = c(2008, 1))
acf(t, main = "ACF of Sylhet Monthly Temperature")

Panel Data

The main lecture is here.

Operations for wages data…

#install.packages("plm")
library(plm)

Load data

mydata<- read.csv("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/Class/panel_wage.csv")
head(mydata)

Set data as panel data

pdata <- pdata.frame(mydata, index=c("id","t"))
head(pdata)

Descriptive statistics

summary(pdata)
##       exp             wks             occ              ind        
##  Min.   : 1.00   Min.   : 5.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:11.00   1st Qu.:46.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :18.00   Median :48.00   Median :1.0000   Median :0.0000  
##  Mean   :19.85   Mean   :46.81   Mean   :0.5112   Mean   :0.3954  
##  3rd Qu.:29.00   3rd Qu.:50.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :51.00   Max.   :52.00   Max.   :1.0000   Max.   :1.0000  
##                                                                   
##      south             smsa              ms              fem        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.2903   Mean   :0.6538   Mean   :0.8144   Mean   :0.1126  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##      union             ed             blk              lwage      
##  Min.   :0.000   Min.   : 4.00   Min.   :0.00000   Min.   :4.605  
##  1st Qu.:0.000   1st Qu.:12.00   1st Qu.:0.00000   1st Qu.:6.395  
##  Median :0.000   Median :12.00   Median :0.00000   Median :6.685  
##  Mean   :0.364   Mean   :12.85   Mean   :0.07227   Mean   :6.676  
##  3rd Qu.:1.000   3rd Qu.:16.00   3rd Qu.:0.00000   3rd Qu.:6.953  
##  Max.   :1.000   Max.   :17.00   Max.   :1.00000   Max.   :8.537  
##                                                                   
##        id       t           tdum1            tdum2            tdum3       
##  1      :   7   1:595   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  2      :   7   2:595   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  3      :   7   3:595   Median :0.0000   Median :0.0000   Median :0.0000  
##  4      :   7   4:595   Mean   :0.1429   Mean   :0.1429   Mean   :0.1429  
##  5      :   7   5:595   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  6      :   7   6:595   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  (Other):4123   7:595                                                     
##      tdum4            tdum5            tdum6            tdum7       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1429   Mean   :0.1429   Mean   :0.1429   Mean   :0.1429  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##       exp2       
##  Min.   :   1.0  
##  1st Qu.: 121.0  
##  Median : 324.0  
##  Mean   : 514.4  
##  3rd Qu.: 841.0  
##  Max.   :2601.0  
## 

Pooled OLS estimator

OLS without considering panel effects

pooling <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "pooling")
summary(pooling)
## Pooling Model
## 
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "pooling")
## 
## Balanced Panel: n = 595, T = 7, N = 4165
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -2.16057670 -0.25034526  0.00027256  0.26792139  2.12969386 
## 
## Coefficients:
##                Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)  4.9080e+00  6.7330e-02  72.8945 < 2.2e-16 ***
## exp          4.4675e-02  2.3929e-03  18.6701 < 2.2e-16 ***
## exp2        -7.1563e-04  5.2794e-05 -13.5552 < 2.2e-16 ***
## wks          5.8270e-03  1.1826e-03   4.9271 8.673e-07 ***
## ed           7.6041e-02  2.2266e-03  34.1511 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    886.9
## Residual Sum of Squares: 635.41
## R-Squared:      0.28356
## Adj. R-Squared: 0.28287
## F-statistic: 411.624 on 4 and 4160 DF, p-value: < 2.22e-16

Between estimator

Captures variation between individuals

between <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "between")
summary(between)
## Oneway (individual) effect Between Model
## 
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "between")
## 
## Balanced Panel: n = 595, T = 7, N = 4165
## Observations used in estimation: 595
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -0.978153 -0.220264  0.036574  0.250118  0.985629 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept)  4.68303917  0.21009890 22.2897 < 2.2e-16 ***
## exp          0.03815295  0.00569666  6.6974 4.953e-11 ***
## exp2        -0.00063127  0.00012568 -5.0228 6.757e-07 ***
## wks          0.01309028  0.00406592  3.2195  0.001355 ** 
## ed           0.07378378  0.00489848 15.0626 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    92.322
## Residual Sum of Squares: 62.187
## R-Squared:      0.32641
## Adj. R-Squared: 0.32185
## F-statistic: 71.4768 on 4 and 590 DF, p-value: < 2.22e-16

First differences estimator

Accounts for individual fixed effects by differencing

firstdiff <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "fd")
summary(firstdiff)
## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "fd")
## 
## Balanced Panel: n = 595, T = 7, N = 4165
## Observations used in estimation: 3570
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -2.1131555 -0.0654718 -0.0095751  0.0483881  2.3295637 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept)  0.11706540  0.00631057 18.5507 < 2.2e-16 ***
## exp2        -0.00053212  0.00013927 -3.8207 0.0001354 ***
## wks         -0.00026826  0.00056483 -0.4749 0.6348525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    118.06
## Residual Sum of Squares: 117.58
## R-Squared:      0.004108
## Adj. R-Squared: 0.0035496
## F-statistic: 7.35691 on 2 and 3567 DF, p-value: 0.0006479

Fixed effects (within estimator)

Removes time-invariant characteristics

fixed <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "within")
## 
## Balanced Panel: n = 595, T = 7, N = 4165
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -1.8120879 -0.0511128  0.0037112  0.0614250  1.9434065 
## 
## Coefficients:
##         Estimate  Std. Error t-value  Pr(>|t|)    
## exp   1.1379e-01  2.4689e-03 46.0888 < 2.2e-16 ***
## exp2 -4.2437e-04  5.4632e-05 -7.7678 1.036e-14 ***
## wks   8.3588e-04  5.9967e-04  1.3939    0.1634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    240.65
## Residual Sum of Squares: 82.632
## R-Squared:      0.65663
## Adj. R-Squared: 0.59916
## F-statistic: 2273.74 on 3 and 3567 DF, p-value: < 2.22e-16

Random effects estimator

Assumes unobserved heterogeneity is uncorrelated with regressors

random <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "random")
## 
## Balanced Panel: n = 595, T = 7, N = 4165
## 
## Effects:
##                   var std.dev share
## idiosyncratic 0.02317 0.15220 0.185
## individual    0.10209 0.31952 0.815
## theta: 0.8228
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -2.0439676 -0.1057048  0.0070992  0.1147499  2.0875839 
## 
## Coefficients:
##                Estimate  Std. Error  z-value Pr(>|z|)    
## (Intercept)  3.8294e+00  9.3634e-02  40.8974   <2e-16 ***
## exp          8.8861e-02  2.8178e-03  31.5360   <2e-16 ***
## exp2        -7.7257e-04  6.2262e-05 -12.4083   <2e-16 ***
## wks          9.6577e-04  7.4329e-04   1.2993   0.1938    
## ed           1.1171e-01  6.0572e-03  18.4426   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    260.94
## Residual Sum of Squares: 151.35
## R-Squared:      0.42
## Adj. R-Squared: 0.41945
## Chisq: 3012.45 on 4 DF, p-value: < 2.22e-16

LM test for random effects versus OLS

plmtest(pooling)
## 
##  Lagrange Multiplier Test - (Honda)
## 
## data:  lwage ~ exp + exp2 + wks + ed
## normal = 72.056, p-value < 2.2e-16
## alternative hypothesis: significant effects

LM test for fixed effects versus OLS

pFtest(fixed, pooling)
## 
##  F test for individual effects
## 
## data:  lwage ~ exp + exp2 + wks + ed
## F = 40.239, df1 = 593, df2 = 3567, p-value < 2.2e-16
## alternative hypothesis: significant effects

Hausman test for fixed versus random effects model

phtest(random, fixed)
## 
##  Hausman Test
## 
## data:  lwage ~ exp + exp2 + wks + ed
## chisq = 6191.4, df = 3, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

Analysis with “Panel101.dta” Data

The dataset contains variables such as y (response variable), year (time variable) and country (panel identifier). We first load and inspect the dataset.

library(foreign)
library(plm)
Panel <- read.dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/Class/Panel101.dta")
head(Panel)

Pooled OLS estimator

ols<-lm(y~x1, data = Panel)
summary(ols)
## 
## Call:
## lm(formula = y ~ x1, data = Panel)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.546e+09 -1.578e+09  1.554e+08  1.422e+09  7.183e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1.524e+09  6.211e+08   2.454   0.0167 *
## x1          4.950e+08  7.789e+08   0.636   0.5272  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.028e+09 on 68 degrees of freedom
## Multiple R-squared:  0.005905,   Adjusted R-squared:  -0.008714 
## F-statistic: 0.4039 on 1 and 68 DF,  p-value: 0.5272

Regular OLS regression does not consider heterogeneity across groups or time

yhat<-ols$fitted.values
plot(Panel$x1, Panel$y, pch=19, xlab="x1", ylab="y")
abline(lm(Panel$y~Panel$x1),lwd=3, col="red")

Adjusting for countries (least square dummy variables, LSDV)

lsdv <- lm(y ~ x1 + factor(country), data = Panel)
# but, need to remove the constant term for it to be actually LSDV
summary(lsdv)
## 
## Call:
## lm(formula = y ~ x1 + factor(country), data = Panel)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.634e+09 -9.697e+08  5.405e+08  1.386e+09  5.612e+09 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       8.805e+08  9.618e+08   0.916   0.3635  
## x1                2.476e+09  1.107e+09   2.237   0.0289 *
## factor(country)B -1.938e+09  1.265e+09  -1.533   0.1304  
## factor(country)C -2.603e+09  1.596e+09  -1.631   0.1080  
## factor(country)D  2.282e+09  1.261e+09   1.810   0.0752 .
## factor(country)E -1.483e+09  1.268e+09  -1.169   0.2467  
## factor(country)F  1.130e+09  1.289e+09   0.877   0.3839  
## factor(country)G -1.865e+09  1.497e+09  -1.246   0.2175  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.796e+09 on 62 degrees of freedom
## Multiple R-squared:  0.2276, Adjusted R-squared:  0.1404 
## F-statistic:  2.61 on 7 and 62 DF,  p-value: 0.01991
# x1 is now significant at 10% significance level

Fixed effects (within estimator)

Fixed effects: n entity-specific intercepts (using plm)

fixed<-plm(y~x1, data = Panel, index = c("country", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country", 
##     "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.63e+09 -9.70e+08  5.40e+08  0.00e+00  1.39e+09  5.61e+09 
## 
## Coefficients:
##      Estimate Std. Error t-value Pr(>|t|)  
## x1 2475617827 1106675594   2.237  0.02889 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared:      0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892

In output:The coeff of x1 indicates how much Y changes overtime, on average per country, when X increases by one unit. Pr(>|t|)= 0.02889* indicates Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y). p-value: 0.028892: If this number is < 0.05 then your model is ok. This is a test (F) to see whether all the coefficients in the model are different than zero.

Display the fixed effects (constants for each country)

fixef(fixed)
##           A           B           C           D           E           F 
##   880542404 -1057858363 -1722810755  3162826897  -602622000  2010731793 
##           G 
##  -984717493

LM test for fixed effects versus OLS; null: OLS better than fixed

pFtest(fixed, ols)
## 
##  F test for individual effects
## 
## data:  y ~ x1
## F = 2.9655, df1 = 6, df2 = 62, p-value = 0.01307
## alternative hypothesis: significant effects

In output: p-value = 0.01307 indicates If the p-value is < 0.05 then the fixed effects model is a better choice.

Random effects estimator

random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = y ~ x1, data = Panel, model = "random", index = c("country", 
##     "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual    1.133e+18 1.065e+09 0.127
## theta: 0.3611
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.94e+09 -1.51e+09  2.82e+08  0.00e+00  1.56e+09  6.63e+09 
## 
## Coefficients:
##               Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1037014284  790626206  1.3116   0.1896
## x1          1247001782  902145601  1.3823   0.1669
## 
## Total Sum of Squares:    5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared:      0.02733
## Adj. R-Squared: 0.013026
## Chisq: 1.91065 on 1 DF, p-value: 0.16689

In output:Balanced Panel: n=7= # of groups/panels, T=10=# years, N=70 = total # of observations. Interpretation of the coefficients is tricky since they include both the within-entity and between-entity effects. In the case of TSCS data represents the average effect of X over Y when X changes across time and between countries by one unit. Pr(>|t|)= 0.16689 indicates Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y) p-value: 0.17141:If this number is < 0.05 then your model is ok. This is a test (F) to see whether all the coefficients in the model are different than zero.

Fixed or Random: Hausman test

To decide between fixed or random effects you can run a Hausman test where the null hypothesis is that the preferred model is random effects vs. the alternative the fixed effects. It basically tests whether the unique errors (ui) are correlated with the regressors, the null hypothesis is they are not.

phtest(fixed, random)
## 
##  Hausman Test
## 
## data:  y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent

LM test for random effects versus OLS

  • The LM test helps you decide between a random effects regression and a simple OLS regression. The null hypothesis in the LM test is that variances across entities is zero. This is, no significant difference across units (i.e. no panel effect).

  • Breusch-Pagan [bp] Lagrange Multiplier for random effects. Null is no panel effect (i.e. OLS better).

pool <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="pooling")
plmtest(pool,  type=c("bp"))
## 
##  Lagrange Multiplier Test - (Breusch-Pagan)
## 
## data:  y ~ x1
## chisq = 2.6692, df = 1, p-value = 0.1023
## alternative hypothesis: significant effects

Conclusion here we failed to reject the null and conclude that random effects is not appropriate. This is, no evidence of significant differences across countries, therefore you can run a simple OLS regression.

First differences estimator

fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"), 
               model="within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country", 
##     "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.63e+09 -9.70e+08  5.40e+08  0.00e+00  1.39e+09  5.61e+09 
## 
## Coefficients:
##      Estimate Std. Error t-value Pr(>|t|)  
## x1 2475617827 1106675594   2.237  0.02889 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared:      0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892

First differences estimator (Testing for time-fixed effects)

fixed.time <- plm(y ~ x1 + factor(year), data=Panel, index=c("country", 
              "year"), model="within")
summary(fixed.time)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1 + factor(year), data = Panel, model = "within", 
##     index = c("country", "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -7.92e+09 -1.05e+09 -1.40e+08  0.00e+00  1.63e+09  5.49e+09 
## 
## Coefficients:
##                    Estimate Std. Error t-value Pr(>|t|)  
## x1               1389050354 1319849567  1.0524  0.29738  
## factor(year)1991  296381559 1503368528  0.1971  0.84447  
## factor(year)1992  145369666 1547226548  0.0940  0.92550  
## factor(year)1993 2874386795 1503862554  1.9113  0.06138 .
## factor(year)1994 2848156288 1661498927  1.7142  0.09233 .
## factor(year)1995  973941306 1567245748  0.6214  0.53698  
## factor(year)1996 1672812557 1631539254  1.0253  0.30988  
## factor(year)1997 2991770063 1627062032  1.8388  0.07156 .
## factor(year)1998  367463593 1587924445  0.2314  0.81789  
## factor(year)1999 1258751933 1512397632  0.8323  0.40898  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.0201e+20
## R-Squared:      0.23229
## Adj. R-Squared: 0.00052851
## F-statistic: 1.60365 on 10 and 53 DF, p-value: 0.13113

Testing time-fixed effects. The null is that no time-fixed effects needed.

pFtest(fixed.time, fixed)
## 
##  F test for individual effects
## 
## data:  y ~ x1 + factor(year)
## F = 1.209, df1 = 9, df2 = 53, p-value = 0.3094
## alternative hypothesis: significant effects
plmtest(fixed, c("time"), type=("bp"))
## 
##  Lagrange Multiplier Test - time effects (Breusch-Pagan)
## 
## data:  y ~ x1
## chisq = 0.16532, df = 1, p-value = 0.6843
## alternative hypothesis: significant effects

Panel Data Visualization

Introduction

Panel data consists of observations collected over multiple time periods for the same entities (e.g., countries, firms, individuals). This document demonstrates different visualization techniques to analyze trends, heterogeneity and relationships in panel data.

Loading the Packages

library(foreign) # coplot
library(car) # scatterplot
library(gplots) # plotmeans

Interpretation:

• Such patterns are useful in econometric modeling to determine whether fixed effects or random effects models are appropriate for capturing country-specific influences. • Observing common trends across countries can suggest shared economic, social, or policy-related in fluences. • The x-axis corresponds to time (year), and the y-axis represents y. • Each subplot represents a different country.

Interpretation:

• The individual points represent data observations for each country. • The smooth curves indicate the underlying trends of y over time. • If trends for different countries follow similar patterns, it may indicate shared influences across panels. • Large fluctuations in trend lines suggest higher volatility or structural shifts in specific countries. • Identifying these trends is crucial in dynamic panel data models (e.g., Arellano-Bond estimators) that capture lagged effects and autocorrelation over time.

Mean Values of y Across Countries

This plot compares the mean values of y for each country, providing insights into cross-sectional differences.

plotmeans(y ~ country, main = "Heterogeineity across countries", data = Panel)

plotmeans(y~year, main="Heterogeneity across years", data = Panel)

Interpretation:

• The presence of heterogeneity supports the use of fixed effects models, which account for unobserved, time-invariant country-specific characteristics. • Large variations suggest heterogeneity. • Countries with non-overlapping confidence intervals have statistically significant differences in y. • The vertical error bars indicate the confidence intervals, showing variability within each country. • The dots represent the mean values of y for each country.

Conclusion

• Coplot helps analyze time-series patterns in each country and supports the selection of fixed vs. random effects models. • Scatterplots with smooth trends reveal the general direction and fluctuations of y, guiding the choice of dynamic panel models. • Mean plots with confidence intervals highlight heterogeneity across countries, reinforcing the use of fixed effects estimators to control for unobserved heterogeneity.

Regression lines for all countries

# Ensure 'country' is a factor
Panel$country <- as.factor(Panel$country)

# Compute regression lines for each country
regression_lines <- Panel %>%
  group_by(country) %>%
  summarize(intercept = coef(lm(y ~ x1, data = cur_data()))[1],
            slope = coef(lm(y ~ x1, data = cur_data()))[2]) %>%
  expand_grid(x1 = seq(min(Panel$x1), max(Panel$x1), length.out = 100)) %>%
  mutate(y_pred = intercept + slope * x1)
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `intercept = coef(lm(y ~ x1, data = cur_data()))[1]`.
## ℹ In group 1: `country = A`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
# Plot scatter points and regression lines
ggplot(data = Panel,
       aes(x = x1, y = y, color = country))+
  geom_point(alpha = 0.6) +  # Scatter points
  geom_line(data = regression_lines,
            aes(x = x1, y = y_pred, color = country),
            size = 1) +
  labs(title = "Regression Lines for Each Country",
       x = "X1",
       y = "Y") +
  theme_minimal()

Another Example

Data

indiv <- rep(1:3, each = 3)
year <- rep(2000:2002, times = 3)
y <- c(6, 4.6, 9.4, 9.1, 8.3, 0.6, 9.1, 4.8, 9.1)
x1 <- c(7.8, 0.6, 2.1, 1.3, 0.9, 9.8, 0.2, 5.9, 5.2)
x2 <- c(5.8, 7.9, 5.4, 6.7, 6.6, 0.4, 2.6, 3.2, 6.9)
dat <- data.frame(country = factor(indiv), year, y, x1, x2)
dat

Plot

plot(x1, y, col = rep(c("red", "blue", "green"), each = 3))
mod1 <- lm(y ~ x1, data = dat)
abline(mod1)
summary(mod1)
## 
## Call:
## lm(formula = y ~ x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9975 -0.7411  0.2719  1.5545  3.1552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9435     1.2010   7.447 0.000144 ***
## x1           -0.5767     0.2396  -2.406 0.047026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.386 on 7 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.3745 
## F-statistic:  5.79 on 1 and 7 DF,  p-value: 0.04703
mod2 <- lm(y ~ x1 + country, data = dat) # LSDS
summary(mod2)
## 
## Call:
## lm(formula = y ~ x1 + country, data = dat)
## 
## Residuals:
##       1       2       3       4       5       6       7       8       9 
##  1.8026 -3.7320  1.9294  1.5495  0.5198 -2.0694 -0.6148 -1.6416  2.2564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   8.6765     1.8152   4.780  0.00497 **
## x1           -0.5742     0.2701  -2.126  0.08685 . 
## country2     -0.3795     2.1956  -0.173  0.86954   
## country3      1.1531     2.1926   0.526  0.62143   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.684 on 5 degrees of freedom
## Multiple R-squared:  0.5052, Adjusted R-squared:  0.2084 
## F-statistic: 1.702 on 3 and 5 DF,  p-value: 0.2812
abline(a = mod2$coefficients[1], b = mod2$coefficients[2], col = "red") # country‐1
abline(a = mod2$coefficients[1] + mod2$coefficients[3], b = mod2$coefficients[2], col = "blue")
 abline(a = mod2$coefficients[1] + mod2$coefficients[4], b = mod2$coefficients[2], col = "green")

Previous Incourse (Panel Data)

2022(pic1)

library(haven)
data <- read_dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Panel_data.dta")

#a) Fit a suitable model to find the the relationship between y and x1.
ols<-lm(y~x1, data = data)
summary(ols)
## 
## Call:
## lm(formula = y ~ x1, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.546e+09 -1.578e+09  1.554e+08  1.422e+09  7.183e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1.524e+09  6.211e+08   2.454   0.0167 *
## x1          4.950e+08  7.789e+08   0.636   0.5272  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.028e+09 on 68 degrees of freedom
## Multiple R-squared:  0.005905,   Adjusted R-squared:  -0.008714 
## F-statistic: 0.4039 on 1 and 68 DF,  p-value: 0.5272
#(b) Fit a fixed effect model considering country as a dummy variable. Compare the results with the results obtained in (i). Produce a graph showing the regression line for each country in the same graph. Interpret your results and graph.

lsdv<-lm(y~x1+factor(country), data = data)
summary(lsdv)
## 
## Call:
## lm(formula = y ~ x1 + factor(country), data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.634e+09 -9.697e+08  5.405e+08  1.386e+09  5.612e+09 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       8.805e+08  9.618e+08   0.916   0.3635  
## x1                2.476e+09  1.107e+09   2.237   0.0289 *
## factor(country)2 -1.938e+09  1.265e+09  -1.533   0.1304  
## factor(country)3 -2.603e+09  1.596e+09  -1.631   0.1080  
## factor(country)4  2.282e+09  1.261e+09   1.810   0.0752 .
## factor(country)5 -1.483e+09  1.268e+09  -1.169   0.2467  
## factor(country)6  1.130e+09  1.289e+09   0.877   0.3839  
## factor(country)7 -1.865e+09  1.497e+09  -1.246   0.2175  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.796e+09 on 62 degrees of freedom
## Multiple R-squared:  0.2276, Adjusted R-squared:  0.1404 
## F-statistic:  2.61 on 7 and 62 DF,  p-value: 0.01991
plot(data$y~data$x1, pch=data$country)
abline(ols, col=1)

library(car)
scatterplot(fitted(lsdv)~x1|country, boxplot= FALSE ,smooth=TRUE, data = data)

#(c) Fit a random effects model considering country as a panel variable and Interpret your results.

library(plm)
random<-plm(y~x1, data = data, index = c("country", "year"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = y ~ x1, data = data, model = "random", index = c("country", 
##     "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual    1.133e+18 1.065e+09 0.127
## theta: 0.3611
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.94e+09 -1.51e+09  2.82e+08  0.00e+00  1.56e+09  6.63e+09 
## 
## Coefficients:
##               Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1037014284  790626206  1.3116   0.1896
## x1          1247001782  902145601  1.3823   0.1669
## 
## Total Sum of Squares:    5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared:      0.02733
## Adj. R-Squared: 0.013026
## Chisq: 1.91065 on 1 DF, p-value: 0.16689
#(d) Carry out the Hausman test for the random effects against the fixed effects and write your conclusion.

fixed<-plm(y~x1, data=data, index = c("country", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1, data = data, model = "within", index = c("country", 
##     "year"))
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.63e+09 -9.70e+08  5.40e+08  0.00e+00  1.39e+09  5.61e+09 
## 
## Coefficients:
##      Estimate Std. Error t-value Pr(>|t|)  
## x1 2475617827 1106675594   2.237  0.02889 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared:      0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
phtest(random, fixed)
## 
##  Hausman Test
## 
## data:  y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent

2022(pic2)

library(readr)
dat <- read.table("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Eggs.txt")
view(dat)

#(a)Estimate the model for the years 1990 and 1991 separately.

m1<-lm(Y~X, data = dat[dat$t==1990,])
summary(m1)
## 
## Call:
## lm(formula = Y ~ X, data = dat[dat$t == 1990, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1903.6  -949.7  -515.6   294.2  5779.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3118.48     873.08   3.572 0.000819 ***
## X             -22.50      10.77  -2.089 0.041989 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1607 on 48 degrees of freedom
## Multiple R-squared:  0.08337,    Adjusted R-squared:  0.06428 
## F-statistic: 4.366 on 1 and 48 DF,  p-value: 0.04199
m2<-lm(Y~X, data = dat[dat$t==1991,])
summary(m2)
## 
## Call:
## lm(formula = Y ~ X, data = dat[dat$t == 1991, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1983.6  -944.8  -494.0   333.9  5658.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3149.36     810.91   3.884 0.000314 ***
## X             -23.35      10.27  -2.274 0.027461 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1594 on 48 degrees of freedom
## Multiple R-squared:  0.09727,    Adjusted R-squared:  0.07847 
## F-statistic: 5.172 on 1 and 48 DF,  p-value: 0.02746
#(b) Pool the observations for the two years and estimate the pooled regression. What assumptions are you making in pooling the data?
#normal OLS

m<-lm(Y~X, data = dat)
summary(m)
## 
## Call:
## lm(formula = Y ~ X, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1988.3  -960.6  -499.0   321.5  5791.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3132.258    587.868   5.328 6.34e-07 ***
## X            -22.895      7.345  -3.117   0.0024 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1584 on 98 degrees of freedom
## Multiple R-squared:  0.09021,    Adjusted R-squared:  0.08093 
## F-statistic: 9.718 on 1 and 98 DF,  p-value: 0.002395
#(c) Use the fixed effects model, distinguishing the two years, and present the regression results.

library(plm)
dat$t<-as.factor(dat$t)
fixed<-plm(Y~X, data = dat, index = c("id", "t"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Y ~ X, data = dat, model = "within", index = c("id", 
##     "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -97.466 -11.476   0.000  11.476  97.466 
## 
## Coefficients:
##   Estimate Std. Error t-value Pr(>|t|)
## X  -2.1920     2.1806 -1.0052   0.3197
## 
## Total Sum of Squares:    108920
## Residual Sum of Squares: 106720
## R-Squared:      0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
#(d) Can you use the fixed effects model, distinguishing the 50 states? Why or why not?

fixed1<-plm(Y~X, data = dat, index = c("id", "t"), model = "within", effect = "individual")
summary(fixed1)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Y ~ X, data = dat, effect = "individual", model = "within", 
##     index = c("id", "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -97.466 -11.476   0.000  11.476  97.466 
## 
## Coefficients:
##   Estimate Std. Error t-value Pr(>|t|)
## X  -2.1920     2.1806 -1.0052   0.3197
## 
## Total Sum of Squares:    108920
## Residual Sum of Squares: 106720
## R-Squared:      0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
#(e) Estimate a random components model using the same data. From this analysis which model best describes the egg prices?

random<-plm(Y~X, data = dat, index = c("id", "t"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = Y ~ X, data = dat, model = "random", index = c("id", 
##     "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 2.178e+03 4.667e+01 0.001
## individual    2.557e+06 1.599e+03 0.999
## theta: 0.9794
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -99.965 -27.679 -12.074  13.040 163.003 
## 
## Coefficients:
##              Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept) 1602.5840   283.6850  5.6492 1.612e-08 ***
## X             -3.0502     2.1654 -1.4086    0.1589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    223930
## Residual Sum of Squares: 219480
## R-Squared:      0.019846
## Adj. R-Squared: 0.0098442
## Chisq: 1.98427 on 1 DF, p-value: 0.15894
phtest(random, fixed)
## 
##  Hausman Test
## 
## data:  Y ~ X
## chisq = 11.143, df = 1, p-value = 0.0008437
## alternative hypothesis: one model is inconsistent

2022(pic4)

## (a)
# Load necessary libraries
library(plm)

# Load the dataset (replace with your file path)
vote_data <- read.csv("vote.csv")

# Convert to panel data format
panel_data <- pdata.frame(vote_data, index = c("district", "year"))

# Fixed Effects Model
fe_model <- plm(vote ~ d90 + log(inexp) + log(chexp) + ln_eshr, data = panel_data, model = "within")
summary(fe_model)

# Random Effects Model
re_model <- plm(vote ~ d90 + log(inexp) + log(chexp) + ln_eshr, data = panel_data, model = "random")
summary(re_model)

# Extract district-specific effects from the fixed effects model
fixef(fe_model)


## (b)
# Perform Hausman Test
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)


## (c)
# Check significance of variables in the chosen model (e.g., fixed effects)
summary(fe_model)

## (d)
# Fit a restricted model (excluding log(inexp) and log(chexp))
restricted_model <- plm(vote ~ d90 + ln_eshr, data = panel_data, model = "within")

# Perform F-test
anova(fe_model, restricted_model)


## (e)
# Re-estimate the model with ln_eshr as the only independent variable
simple_model <- plm(vote ~ ln_eshr, data = panel_data, model = "within")
summary(simple_model)

MyXm

Packages

library(plm)
library(foreign)
library(car)
library(haven)
library(gplots)

#Q1

data<-read.table("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Eggs.txt")
#View(data)

#a
m_90<-lm(Y~X, data = data[data$t==1990,])
summary(m_90)
## 
## Call:
## lm(formula = Y ~ X, data = data[data$t == 1990, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1903.6  -949.7  -515.6   294.2  5779.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3118.48     873.08   3.572 0.000819 ***
## X             -22.50      10.77  -2.089 0.041989 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1607 on 48 degrees of freedom
## Multiple R-squared:  0.08337,    Adjusted R-squared:  0.06428 
## F-statistic: 4.366 on 1 and 48 DF,  p-value: 0.04199
m_91<-lm(Y~X, data = data[data$t==1991,])
summary(m_91)
## 
## Call:
## lm(formula = Y ~ X, data = data[data$t == 1991, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1983.6  -944.8  -494.0   333.9  5658.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3149.36     810.91   3.884 0.000314 ***
## X             -23.35      10.27  -2.274 0.027461 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1594 on 48 degrees of freedom
## Multiple R-squared:  0.09727,    Adjusted R-squared:  0.07847 
## F-statistic: 5.172 on 1 and 48 DF,  p-value: 0.02746
#b
m_p<-lm(Y~X, data = data)
summary(m_p)
## 
## Call:
## lm(formula = Y ~ X, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1988.3  -960.6  -499.0   321.5  5791.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3132.258    587.868   5.328 6.34e-07 ***
## X            -22.895      7.345  -3.117   0.0024 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1584 on 98 degrees of freedom
## Multiple R-squared:  0.09021,    Adjusted R-squared:  0.08093 
## F-statistic: 9.718 on 1 and 98 DF,  p-value: 0.002395
pooling<-plm(Y~X,index = c("id", "t"), data = data, model = "pooling")
summary(pooling)
## Pooling Model
## 
## Call:
## plm(formula = Y ~ X, data = data, model = "pooling", index = c("id", 
##     "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -1988.29  -960.61  -498.99   321.48  5791.30 
## 
## Coefficients:
##              Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept) 3132.2582   587.8677  5.3282 6.337e-07 ***
## X            -22.8953     7.3445 -3.1173  0.002395 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    270310000
## Residual Sum of Squares: 245920000
## R-Squared:      0.090215
## Adj. R-Squared: 0.080931
## F-statistic: 9.71774 on 1 and 98 DF, p-value: 0.0023954
#c
data$t<-as.factor(data$t)
fe_m<-plm(Y~X ,index = c("id", "t"), data = data, model = "within", effect = "time")
summary(fe_m)
## Oneway (time) effect Within Model
## 
## Call:
## plm(formula = Y ~ X, data = data, effect = "time", model = "within", 
##     index = c("id", "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -1972.25  -955.86  -500.01   308.60  5773.34 
## 
## Coefficients:
##   Estimate Std. Error t-value Pr(>|t|)   
## X -22.9404     7.3934 -3.1028 0.002512 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    270290000
## Residual Sum of Squares: 245890000
## R-Squared:      0.09029
## Adj. R-Squared: 0.071533
## F-statistic: 9.62734 on 1 and 97 DF, p-value: 0.0025115
##d

fi_m<-plm(Y~X, index = c("id", "t"), data = data, model = "within")
summary(fi_m)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Y ~ X, data = data, model = "within", index = c("id", 
##     "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -97.466 -11.476   0.000  11.476  97.466 
## 
## Coefficients:
##   Estimate Std. Error t-value Pr(>|t|)
## X  -2.1920     2.1806 -1.0052   0.3197
## 
## Total Sum of Squares:    108920
## Residual Sum of Squares: 106720
## R-Squared:      0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
fi_m1<-plm(Y~X, index = c("id", "t"), data = data, model = "within", effect = "individual" )
summary(fi_m1)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Y ~ X, data = data, effect = "individual", model = "within", 
##     index = c("id", "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -97.466 -11.476   0.000  11.476  97.466 
## 
## Coefficients:
##   Estimate Std. Error t-value Pr(>|t|)
## X  -2.1920     2.1806 -1.0052   0.3197
## 
## Total Sum of Squares:    108920
## Residual Sum of Squares: 106720
## R-Squared:      0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
##e

ra_m<-plm(Y~X ,index = c("id", "t"), data = data, model = "random")
summary(ra_m)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = Y ~ X, data = data, model = "random", index = c("id", 
##     "t"))
## 
## Balanced Panel: n = 50, T = 2, N = 100
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 2.178e+03 4.667e+01 0.001
## individual    2.557e+06 1.599e+03 0.999
## theta: 0.9794
## 
## Residuals:
##    Min. 1st Qu.  Median 3rd Qu.    Max. 
## -99.965 -27.679 -12.074  13.040 163.003 
## 
## Coefficients:
##              Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept) 1602.5840   283.6850  5.6492 1.612e-08 ***
## X             -3.0502     2.1654 -1.4086    0.1589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    223930
## Residual Sum of Squares: 219480
## R-Squared:      0.019846
## Adj. R-Squared: 0.0098442
## Chisq: 1.98427 on 1 DF, p-value: 0.15894
phtest(ra_m, fe_m)
## 
##  Hausman Test
## 
## data:  Y ~ X
## chisq = 7.9164, df = 1, p-value = 0.004899
## alternative hypothesis: one model is inconsistent

Q2

data1<- read_dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Panel101.dta")
#View(data1)

panal<-pdata.frame(data1, index = c("country", "year"))

## Graphical Exploration

coplot(y~year|country,type="l", data = panal)

scatterplot(y~year|country, boxplot=F, smooth=T, data = panal)

## [1] "11" "54" "45" "55" "46" "36" "59"
## Better model for the Dataset

pool<-plm(y~x1+x2+x3, data = panal, model = "pooling")
summary(pool)
## Pooling Model
## 
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "pooling")
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -9.50e+09 -1.62e+09  1.74e+08  0.00e+00  1.50e+09  7.36e+09 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)  
## (Intercept) 1400527665  762362572  1.8371   0.0707 .
## x1           559063957  915594368  0.6106   0.5436  
## x2            87445761  349461177  0.2502   0.8032  
## x3            92622347  293654989  0.3154   0.7534  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    6.2729e+20
## Residual Sum of Squares: 6.2251e+20
## R-Squared:      0.0076238
## Adj. R-Squared: -0.037484
## F-statistic: 0.169012 on 3 and 66 DF, p-value: 0.91693
random<-plm(y~x1+x2+x3, data = panal, model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "random")
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Effects:
##                     var   std.dev share
## idiosyncratic 7.847e+18 2.801e+09 0.815
## individual    1.780e+18 1.334e+09 0.185
## theta: 0.4469
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.67e+09 -1.39e+09  1.14e+08  0.00e+00  1.69e+09  7.01e+09 
## 
## Coefficients:
##               Estimate Std. Error z-value Pr(>|z|)  
## (Intercept)  492524201  962576196  0.5117  0.60888  
## x1          1734069418  999303736  1.7353  0.08269 .
## x2           420030518  497228766  0.8447  0.39825  
## x3           226606952  324458564  0.6984  0.48492  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.5536e+20
## Residual Sum of Squares: 5.2872e+20
## R-Squared:      0.047967
## Adj. R-Squared: 0.0046933
## Chisq: 3.32536 on 3 DF, p-value: 0.34413
fixed<-plm(y~x1+x2+x3, data = panal, model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "within")
## 
## Balanced Panel: n = 7, T = 10, N = 70
## 
## Residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8.31e+09 -1.12e+09  5.02e+08  0.00e+00  1.37e+09  6.31e+09 
## 
## Coefficients:
##      Estimate Std. Error t-value Pr(>|t|)  
## x1 2424529174 1156516240  2.0964  0.04027 *
## x2 1822699670 2028055946  0.8987  0.37238  
## x3  309718343  368552481  0.8404  0.40404  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.2364e+20
## Residual Sum of Squares: 4.708e+20
## R-Squared:      0.10092
## Adj. R-Squared: -0.033937
## F-statistic: 2.24507 on 3 and 60 DF, p-value: 0.092235
plmtest(pool)
## 
##  Lagrange Multiplier Test - (Honda)
## 
## data:  y ~ x1 + x2 + x3
## normal = 1.7534, p-value = 0.03976
## alternative hypothesis: significant effects
pFtest(fixed, pool)
## 
##  F test for individual effects
## 
## data:  y ~ x1 + x2 + x3
## F = 3.2225, df1 = 6, df2 = 60, p-value = 0.008242
## alternative hypothesis: significant effects
phtest(random, fixed)
## 
##  Hausman Test
## 
## data:  y ~ x1 + x2 + x3
## chisq = 4.7863, df = 3, p-value = 0.1881
## alternative hypothesis: one model is inconsistent

For Scripts: Dyuti Turin

Time Series Analysis

Simulate from an ARIMA Model

MA(1) process positive coefficient

par(mfrow=c(2,2))

ma_sim<-arima.sim(list(order=c(0,0,1), ma=0.9), n=100)
plot(ma_sim, ylab=expression(Y[t]),
     xlab="Time",
     type="o",
     main="MA(1) simulation")

acf(ma_sim,main="Sample ACF")

MA(1) process negative coefficient

ma_sim1<-arima.sim(list(order=c(0,0,1), ma=-0.9), n=100)
plot(ma_sim1, ylab=expression(Y[t]),
     xlab="Time",
     type="o",
     main="MA(1) simulation")

acf(ma_sim1,main="Sample ACF")

MA(2) process

ma_sim2<-arima.sim(list(order=c(0,0,2), ma=c(-0.9, 0.7)), n=100)
plot(ma_sim2, xlab=expression((Y(t))), ylab="Time",
     main="MA(2) simulation", type="o")

acf(ma_sim2, main="Sample ACF")

acf(ma_sim2, main="Sample ACF")[1:30] #Give value
## 
## Autocorrelations of series 'ma_sim2', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## -0.639  0.239  0.066 -0.040 -0.037  0.019  0.051 -0.142  0.186 -0.157  0.094 
##     12     13     14     15     16     17     18     19     20     NA     NA 
## -0.056  0.007  0.006 -0.015 -0.048  0.189 -0.249  0.249 -0.151     NA     NA 
##     NA     NA     NA     NA     NA     NA     NA     NA 
##     NA     NA     NA     NA     NA     NA     NA     NA
plot(acf(ma_sim2, main="Sample ACF")[1:30])

AR(1) process positive coefficient

ma_sim3<-arima.sim(list(order=c(1,0,0), ar=0.9), n=100)
plot(ma_sim3, ylab=expression(Y[t]),
     xlab="Time",
     type="o",
     main="AR(1) simulation")

acf(ma_sim3,main="Sample ACF")

AR(1) process negative coefficient

ma_sim4<-arima.sim(list(order=c(1,0,0), ar=-0.9), n=100)
plot(ma_sim4, ylab=expression(Y[t]),
     xlab="Time",
     type="o",
     main="AR(1) simulation")

acf(ma_sim3,main="Sample ACF")

AR(2) process

ma_sim5<-arima.sim(list(order=c(2,0,0), ar=c(0.7, -0.9)), n=100)
plot(ma_sim5, xlab=expression((Y(t))), ylab="Time",
     main="AR(2) simulation", type="o")

acf(ma_sim5, main="Sample ACF")

acf(ma_sim5, main="Sample ACF")[1:20] #Give value
## 
## Autocorrelations of series 'ma_sim5', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  0.365 -0.669 -0.818  0.059  0.806  0.511 -0.411 -0.775 -0.145  0.623  0.563 
##     12     13     14     15     16     17     18     19     20 
## -0.199 -0.661 -0.276  0.427  0.547 -0.024 -0.531 -0.337  0.270

ARMA(1,1) process

x<-arima.sim(n=1000, list(ar=c(0.5), ma=c(0.2)))
acf(x)

pacf(x)

ARMA(1,1) process (different coefficient)

x1<-arima.sim(n=100, list(order=c(1,0,1), ar=0.9,ma=0.2))
acf(x1)

pacf(x1)

Fitting AR(1) on raw data

data <- c(14.3, 14.6, 13.5, 14.2, 12.1, 14.19, 14.55, 13.6, 14.59, 16.6, 15.4, 12.89, 12.2, 12.15, 10.89, 11.11, 11.98, 13.44, 14.05, 13.91, 14.1, 12.66, 14.6, 16.7, 15.4, 17.1, 15.45, 16.76, 15.7, 14.1, 15.3, 16.4, 17.09, 16.8, 16.98, 16.39, 15.8, 15.1, 13.7, 13.79, 15.7)
acf(data)

pacf(data)

a<-arima(data, c(1,0,0))
names(a)
##  [1] "coef"      "sigma2"    "var.coef"  "mask"      "loglik"    "aic"      
##  [7] "arma"      "residuals" "call"      "series"    "code"      "n.cond"   
## [13] "nobs"      "model"
acf(a$residuals)

pacf(a$residuals)

Kings Data

data1<-read.table("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/Kings.txt")
t<-ts(data1)

acf(t)

pacf(t)

a1<-arima(data1, c(1,0,0))
names(a1)
##  [1] "coef"      "sigma2"    "var.coef"  "mask"      "loglik"    "aic"      
##  [7] "arma"      "residuals" "call"      "series"    "code"      "n.cond"   
## [13] "nobs"      "model"
acf(a1$residuals)

pacf(a1$residuals)

White Noise

Generate a white noise process

WN <- arima.sim(model = list(order = c(0, 0, 0)), n = 200)
plot(WN,col=5, main="White Noise Series")

For co2_csv data.

library(timeSeries)
library(readr)
co2_csv <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/co2_mm_mlo.csv")

co2 <- co2_csv$average
date_co2 <- timeSequence(from="1958-03-01",
                         to="2021-07-30", by="month")
co2.ts <- timeSeries(data=co2,date_co2, units="co2")

Plotting various components of the Time Series

plot(co2.ts, main = "Time Series Plot of co2")

co2.stl<-stl(as.ts(co2.ts) ,s.window = 12,
             t.window=0.75*length(as.ts(co2.ts)),
             t.degree=0) 

plot(co2.stl$time.series[,"trend"], main = "Trend")

plot(co2.stl$time.series[,"seasonal"], main = "Seasonal")

plot(co2.stl$time.series[,"remainder"], main = "Remainder")

co2d <- diff(co2.stl$time.series[,"remainder"])
plot(co2d)

Checking PACF and ACF

pacf(as.vector(co2d),main="PACF Plot")

acf(co2d, main="ACF Plot")

Fitting an autoregressive model

co2dar<-ar(co2d)  
summary(co2dar)
##              Length Class  Mode     
## order          1    -none- numeric  
## ar            15    -none- numeric  
## var.pred       1    -none- numeric  
## x.mean         1    -none- numeric  
## aic           29    -none- numeric  
## n.used         1    -none- numeric  
## n.obs          1    -none- numeric  
## order.max      1    -none- numeric  
## partialacf    28    -none- numeric  
## resid        760    ts     numeric  
## method         1    -none- character
## series         1    -none- character
## frequency      1    -none- numeric  
## call           2    -none- call     
## asy.var.coef 225    -none- numeric

Fitting ARIMA(1,0,1) to this data

fit<-arima(co2d,order=c(1,0,1))
tsdiag(fit)

qqnorm(fit$residuals)

# ASSIGNMENT

Fit a appropriate ARIMA model for monthly Rainfall data

Pdf is here

Packages

library(readr)
library(timeSeries)
library(tidyverse)

Data:Rainfall

Rainfall_dhaka <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/Rainfall_dhaka.csv")
#view(Rainfall_dhaka)

Make variables - average rainfall per month & Date

rainfall <- Rainfall_dhaka[, 2:34] %>%
  mutate_if(is.character, as.numeric) %>% 
  pivot_longer(
    cols = !c(Year, Month),
    names_to = "day",
    values_to = "rain"
  ) %>%
  mutate(day = as.integer(day)) %>%
  group_by(Year, Month) %>%
  mutate(meanrain = mean(rain, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(date = make_date(Year, Month, day))

rainfall

Construct Time series data

rainfall <- rainfall %>% arrange(Year, Month, day)

# Step 2: Remove duplicates (keep one meanrain per month)
monthly_data <- rainfall %>% distinct(Year, Month, .keep_all = TRUE)

# Create time series object
rain.ts <- ts(monthly_data$meanrain, frequency = 12, start = c(1976, 1))

plot(rain.ts, main="Time series of rainfall")

Seasonal Decomposition of Time Series by Loess

rain.stl<-stl(as.ts(rain.ts) ,s.window = 12,
             t.window=0.75*length(as.ts(rain.ts)),
             t.degree=0) 
plot(rain.stl$time.series[, "trend"], main="Trend")

plot(rain.stl$time.series[, "seasonal"], main="Seasonal")

rain_o<-rain.stl$time.series[, "remainder"]
plot(rain_o, main="Remainder")

rain_d<-diff(rain.stl$time.series[, "remainder"])
plot(rain_d)

pacf(as.vector(rain_d), main="PACF plot")

acf(rain_d, main="ACF plot")

Fitting ARIMA Models

The remainder term will be used to model a time series process. We fit some closely related models that might best describe this time series.

m1 <- arima(rain_o,order = c(0,0,1))
m2 <- arima(rain_o,order = c(1,0,0))
m3 <- arima(rain_o,order = c(1,1,1))
m4 <- arima(rain_o,order = c(0,0,2))
m5 <- arima(rain_o,order = c(1,0,2))
m6 <- arima(rain_o,order = c(1,1,2))
m7 <- arima(rain_o,order = c(2,0,2))
m8 <- arima(rain_o,order = c(2,1,2))
m9 <- arima(rain_o,order = c(0,0,3))

Comparing AIC values for different models.

AIC(m1)
## [1] 2689.181
AIC(m2)
## [1] 2686.034
AIC(m3)
## [1] 2687.775
AIC(m4)
## [1] 2684.678
AIC(m5)
## [1] 2686.675
AIC(m6)
## [1] 2687.325
AIC(m7)
## [1] 2688.624
AIC(m8)
## [1] 2688.275
AIC(m9)
## [1] 2686.675

The Model ARIMA(0,0,2) or MA(2) has the lowest AIC. We conclude this study by selecting this as our model of best fit (comparatively).

# can check by this
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
auto.arima(rain.ts)
## Series: rain.ts 
## ARIMA(0,0,2)(2,1,0)[12] 
## 
## Coefficients:
##          ma1     ma2     sar1     sar2
##       0.1532  0.1038  -0.7145  -0.2734
## s.e.  0.0440  0.0444   0.0424   0.0430
## 
## sigma^2 = 14.99:  log likelihood = -1426.41
## AIC=2862.82   AICc=2862.94   BIC=2884.03

Summary estimates of ARIMA(0,0,2)

broom::tidy(m4)

Diagnostic Plots

tsdiag(m4)

qqnorm(m4$residuals)

If we take difference remainder for ensuring stationarity…

M1 <- arima(rain_d,order = c(0,0,1))
M2 <- arima(rain_d,order = c(1,0,0))
M3 <- arima(rain_d,order = c(1,1,1))
M4 <- arima(rain_d,order = c(0,0,2))
M5 <- arima(rain_d,order = c(1,0,2))
M6 <- arima(rain_d,order = c(1,1,2))
M7 <- arima(rain_d,order = c(2,0,2))
M8 <- arima(rain_d,order = c(2,1,2))
M9 <- arima(rain_d,order = c(0,0,3))

Comparing AIC values for different models.

AIC(M1)
## [1] 2707.506
AIC(M2)
## [1] 2835.506
AIC(M3)
## [1] 2838.118
AIC(M4)
## [1] 2692.726
AIC(M5)
## [1] 2689.091
AIC(M6)
## [1] 2704.044
AIC(M7)
## [1] 2690.033
AIC(M8)
## [1] 2702.281
AIC(M9)
## [1] 2688.091

The Model ARIMA(0,0,3) has the lowest AIC. We conclude this study by selecting this as our model of best fit (comparatively).

Seasonal ARIMA

Provided lecture

ARIMA(0,0,0)(0,0,1)_{12}

 θ = 0.5
set.seed(123)
n <- 200

y <- arima.sim(n = n, model = list(order = c(0, 0, 0),
                                   seasonal = list(order = c(0, 0, 1),
                                                   period = 12, ma = 0.5)
                                   )
               )

# Visualize the time series data
plot.ts(y, main = expression("Generated Time Series " ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))

acf(y, main="ACF")

pacf(y, main="PACF")

ARIMA(0,0,0)(1,0,0)_{12}

 φ = 0.5
set.seed(123)
n <- 200

y1<- arima.sim(n=n, model = list(order=c(0,0,0), 
               seasonal= list(order=c(1, 0, 0), period= 12, ar=0.5))
               )

# Visualize the time series data
plot.ts(y1, main=expression("Generated Time Series" ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))

pacf(y1, main="PACF")

acf(y1, main="ACF")

For airline_passengers data.

library(readr)
airline_passengers <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/airline-passengers.csv")
## Rows: 144 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Passengers
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(airline_passengers)

ap_t <- ts(data=airline_passengers, start = c(1949, 1), frequency = 12)  

plot.ts(ap_t, main = expression("Generated Time Series" ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))

pacf(ap_t) 

acf(ap_t)

Extra…

library(zoo)
library(forecast)
library(Metrics)

Seasonal MA(12) process

arima.m<-arima.sim(list(order = c(0,0,12), ma = c(0.7,rep(0,10),0.9)), n = 200)
acf(arima.m)

A custom seasonal AR(13) process

ar_pacf<-ARMAacf (ar = c(.6,rep(0,10),.5,-.30),lag.max=30,pacf=T)
plot(ar_pacf, type='h')

library(readr)
airline_passengers1 <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/airline-passengers.csv", show_col_types = FALSE)
#view(airline_passengers)

airline_passengers1$Month <- as.yearmon(airline_passengers1$Month)
max(airline_passengers1$Month)
## [1] "Dec 1960"
head(airline_passengers1)

Divides the dataframe into train and test sets and then tries to predict the test set using the trained data. It is a naive approach. Calculates monthly averages as future predictions

test_set_start_date <- as.yearmon("Jan 1959")

train_set <- subset(airline_passengers1, Month < test_set_start_date)
test_set <- subset(airline_passengers1, Month >= test_set_start_date)

dim(train_set)
## [1] 120   2
dim(test_set)
## [1] 24  2
train_set_avg <- mean(train_set$Passengers)

simple_avg_predictions <- data.frame(
  Month = test_set$Month,
  Passengers = rep(train_set_avg, nrow(test_set))
)

ggplot() +
  geom_line(data = train_set, aes(x = Month, y = Passengers, color = "Training"), size = 1) +
  geom_line(data = test_set, aes(x = Month, y = Passengers, color = "Testing"), size = 1) +
  labs(
    title = "Airline Passengers - Training and Testing Sets",
    x = "Date",
    y = "Passengers"
  ) +
  scale_color_manual(values = c("Training" = "#12355B", "Testing" = "#D72638"), name = "Airline passengers") +
  theme_minimal() +
  theme(plot.title = element_text(size = 20))

Converts the training set to a time series data, then fits a seasonal ARIMA(1,0,0)(2,1,0)_{12}.

series_ts <- ts(train_set$Passengers, frequency = 12)
#series_ts

mod2<-Arima(series_ts,order=c(1, 0, 0),
            seasonal=list(order=c(2, 1, 0), period=12))
mod2
## Series: series_ts 
## ARIMA(1,0,0)(2,1,0)[12] 
## 
## Coefficients:
##          ar1     sar1    sar2
##       0.9418  -0.0637  0.0781
## s.e.  0.0313   0.1079  0.1107
## 
## sigma^2 = 107.1:  log likelihood = -405.29
## AIC=818.58   AICc=818.97   BIC=829.31

Generates a hypothetical sales data that has trend, seasonality and random white noise, based on a monthly data starting from 2020 Jan 1st.

library(forecast)
library(tseries)

set.seed(123)  # For reproducibility
months <- seq(as.Date("2020-01-01"),
              by = "month",
              length.out = 36)

sales <- 200 + (1:36) * 3 + 20 * sin(2 * pi * (1:36) / 12) + rnorm(36, mean = 0, sd = 10)

data <- data.frame(Date = months, Sales = sales)
head(data)
  • A sequence of time steps from 1 to 36 (e.g., months over 3 years).

  • Starts at 200 and increases by 3 each time step.[Trend Component]

  • A sinusoidal wave with a period of 12 which completes one full cycle every 12 units of time. Amplitude is 20 → the sales oscillate ±20 above and below the trend each year.So it shows how far the wave goes up and down from the center. [Seasonal Component]

  • Adds random variation from a normal distribution. [Random Noise Component]

  • Sales = Trend + Seasonality + Random Noise

Automatic ploting of the hypothetical sales data

ts_data <- ts(data$Sales, start = c(2020, 1), frequency = 12)

autoplot(ts_data, series = "Sales") + 
  ggtitle("Synthetic Monthly Sales Data") + 
  xlab("Time") + 
  ylab("Sales") +
  scale_color_manual(values = "blue") +  
  theme_minimal(base_size = 15) +  # Set base font size for better visibility
  theme(legend.position = "bottom")

Check for stationarity

adf_test <- adf.test(ts_data)
## Warning in adf.test(ts_data): p-value smaller than printed p-value
print(adf_test)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -5.3005, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

Automatically fits a seasonal ARIMA as it sees best.

auto_model <- auto.arima(ts_data)
summary(auto_model)
## Series: ts_data 
## ARIMA(0,0,0)(1,1,0)[12] with drift 
## 
## Coefficients:
##          sar1   drift
##       -0.8392  2.9958
## s.e.   0.0854  0.1095
## 
## sigma^2 = 83.1:  log likelihood = -93.36
## AIC=192.71   AICc=193.91   BIC=196.25
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.4441953 7.126158 4.467821 -0.2139526 1.667343 0.1240441
##                   ACF1
## Training set 0.1815494

We fit SARIMA(1,1,1)(1,1,1)_{12} manually.

sarima_model <- Arima(ts_data, order=c(1,1,1), seasonal=c(1,1,1))
summary(sarima_model)
## Series: ts_data 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1      ma1     sar1     sma1
##       0.0267  -0.7219  -0.8417  -0.0275
## s.e.  0.3068   0.2199      NaN      NaN
## 
## sigma^2 = 97.05:  log likelihood = -91.09
## AIC=192.18   AICc=195.71   BIC=197.86
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 0.3929868 7.156714 4.692756 0.07424912 1.739757 0.1302892
##                     ACF1
## Training set -0.02246882

Forecast future values of SARIMA(1,1,1)(1,1,1)_{12}.

forecasted_values <- forecast(sarima_model, h=12)

autoplot(forecasted_values) + 
  ggtitle("Sales Forecast for Next 12 Months") + 
  xlab("Time") + 
  ylab("Sales") +
  theme_minimal()