Covers exploratory analysis and bivariate analysis. Assignment1 (also see this is given that which independent variables effect the birth weight variable.
Directed to do the cross table & chi square association and Assignment2 is given.
Basics of Time Series. The related pdf is here.
Step1: Birth data read.
Step2: store the data in a time series object by using ts() function in R.
library(readr)
births <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/births.txt")
head(births)
births.ts <- ts(births)
plot(births.ts)
Step3: Set monthly time series.
Here 1 indicates starting from January.
Step4: Make a plot of the time series data.
births1.ts <- ts(births, frequency = 12, start = c(1946, 1))
head(births1.ts)
## Jan Feb Mar Apr May Jun
## 1946 23.598 26.931 24.740 25.806 24.364 24.477
plot(births1.ts)
Decomposing the time series means separating the time series into these
three components: trend component, a seasonal component and an irregular
component.
Step5: Use the decompose() function in R.
d_births <- decompose(births1.ts)
plot(d_births)
A stationary time series has the conditions that the mean, variance and covariance are not functions of time. We will use two methods to test the stationarity (ACF & PACF).
To check stationarity, we can use the autocorrelation
function (ACF) via the acf() function in base
R.
A non-stationary series shows:
A stationary series shows:
ACF drops quickly (exponential decay)
Bars mostly within blue lines → short-term dependence
After cleaning the Rainfall data, had to proceed the time series things. And the assignment is done below…
library(readxl)
library(tidyverse)
library(pander)
library(tinytex)
library(dplyr)
3 day variable were found as character data type. Changed them to integer.
rainfall <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/Rainfall_dhaka.csv", show_col_types = FALSE)
Make variables - average rainfall per month & Date
rainfall <- rainfall %>%
select(!Stati) %>%
mutate_if(is.character, as.numeric) %>%
pivot_longer(
cols = !c(Year, Month),
names_to = "day",
values_to = "rain"
) %>%
mutate(day = as.integer(day)) %>%
group_by(Year, Month) %>%
mutate(meanrain = mean(rain, na.rm = TRUE)) %>%
ungroup() %>%
mutate(date = make_date(Year, Month, day))
rainfall
The column named Stati is removed
from the dataset to clean up unnecessary variables.
All character columns are converted to numeric format.
The dataset is reshaped from wide to long format. All columns except Year and Month are converted from columns into two new columns:
**day**: contains the old column names (1–31)
**rain**: contains the rainfall amounts for those daysThe data is group_by(Year, Month) to calculate
average rainfall monthly means year-wise and added as a
new column meanrain.
A new date column is created by combining
Year, Month and day into a proper
date format using a date-making function.
Draw a basic line plot of rainfall data
rainfall <- rainfall %>% arrange(Year, Month, day)
# Step 2: Remove duplicates (keep one meanrain per month)
monthly_data <- rainfall %>% distinct(Year, Month, .keep_all = TRUE)
ggplot(data = monthly_data, aes(x = date, y = meanrain))+
geom_line(color = "#00AFBB")
# Create time series object
t <- ts(monthly_data$meanrain, frequency = 12, start = c(1976, 1))
# Plot
plot(t, main = "Average Rainfall per Month (1976–2019)",
xlab = "Year", ylab = "Average Rainfall")
Draw a line plot of average rainfall after year 2000
# Considering monthly data (can't use t cz that's not a data frame)
r1 <- monthly_data %>%
filter(Year >= 2000)
plot(ts(r1$meanrain, start = c(2000, 1), frequency = 12), main = "Average Rainfall per Month after 2000",
xlab = "Year",
ylab = "Average Rainfall")
Decompose the time series and Plot the decomposed time series.
decomposed_ts <- decompose(t, type = "additive")
plot(decomposed_ts)
# Time Series Graph
library(ggplot2)
#theme_set(theme_minimal())
# Demo dataset
head(economics)
# Basic line plot
ggplot(data = economics, aes(x = date, y = pop))+
geom_line(color = "#00AFBB")
# Plot a subset of the data
ss <- subset(economics, date > as.Date("2006-1-1"))
ggplot(data = ss, aes(x = date, y = pop)) +
geom_line(color = "#FC4E07", size = 2)
#Control line size by the value of a continuous variable:
ggplot(data = economics, aes(x = date, y = pop)) +
geom_line(aes(size = unemploy/pop), color = "#FC4E07")
library(tidyr)
library(dplyr)
df <- economics %>%
select(date, psavert, uempmed) %>%
gather(key = "variable", value = "value", -date)
head(df, 3)
# Multiple line plot
ggplot(df, aes(x = date, y = value)) +
geom_line(aes(color = variable), size = 1) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
theme_minimal()
# Area plot
ggplot(df, aes(x = date, y = value)) +
geom_area(aes(color = variable, fill = variable),
alpha = 0.5, position = position_dodge(0.8)) +
scale_color_manual(values = c("#00AFBB", "#E7B800")) +
scale_fill_manual(values = c("#00AFBB", "#E7B800"))
p <- ggplot(data = economics, aes(x = date, y = psavert)) +
geom_line(color = "#00AFBB", size = 1)
p
# Set axis limits c(min, max)
min <- as.Date("2002-1-1")
max <- NA
p + scale_x_date(limits = c(min, max))
# Format : month/year
p + scale_x_date(date_labels = "%b/%Y")
#Add trend smoothed line
p + stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
)
For this the Slide1 which is not necessary, but Slide2 must be followed.
Load data Journals
#install.packages('AER')
library(AER)
data(Journals)
class(Journals)
## [1] "data.frame"
head(Journals)
Create a variable: citeprice=price/citations
Journals$citeprice <- Journals$price/Journals$citations
Goal: Estimate the effect of the price per citation on the number of library subscriptions.
\[ \log(\text{subs}) = \beta_0 + \beta_1 \log(\text{citeprice}) + \epsilon \] Model fit (As without log doesn’t give linear relationship,so we can’t fit linear model so that can draw the regression line)
jour_lm <- lm(log(subs) ~ log(citeprice), data = Journals)
#abline(jour_lm)
summary(jour_lm)
##
## Call:
## lm(formula = log(subs) ~ log(citeprice), data = Journals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72478 -0.53609 0.03721 0.46619 1.84808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.76621 0.05591 85.25 <2e-16 ***
## log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7497 on 178 degrees of freedom
## Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548
## F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
plot(log(subs) ~ log(citeprice), data = Journals)
fitted_val<-jour_lm$fitted.values
new_model<-lm(log(Journals$subs) ~ (log(Journals$citeprice))+I(fitted_val^2)+I(fitted_val^3))
summary(new_model)
##
## Call:
## lm(formula = log(Journals$subs) ~ (log(Journals$citeprice)) +
## I(fitted_val^2) + I(fitted_val^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5698 -0.4940 0.0444 0.4518 1.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.37868 11.86254 0.285 0.776
## log(Journals$citeprice) -0.43929 2.02420 -0.217 0.828
## I(fitted_val^2) 0.13885 0.75270 0.184 0.854
## I(fitted_val^3) -0.01570 0.04881 -0.322 0.748
##
## Residual standard error: 0.7479 on 176 degrees of freedom
## Multiple R-squared: 0.5644, Adjusted R-squared: 0.557
## F-statistic: 76.01 on 3 and 176 DF, p-value: < 2.2e-16
Predicts log(subs) for citeprice 2.11 (not in log) with a prediction interval
predict(jour_lm, newdata = data.frame(citeprice = 2.11),
interval = "prediction")
## fit lwr upr
## 1 4.368188 2.883746 5.852629
l_citeprice <- seq(from = -6, to = 4, by = 0.25)
jour_pred <- predict(jour_lm, interval = "prediction",
newdata = data.frame(citeprice = exp(l_citeprice)))
plot(log(subs) ~ log(citeprice), data = Journals)#need the real one
lines(jour_pred[, 1] ~ l_citeprice, col = 1) #fit
lines(jour_pred[, 2] ~ l_citeprice, col = 1, lty = 2) #lower
lines(jour_pred[, 3] ~ l_citeprice, col = 1, lty = 2) #upper
### Diagnostic plot
#Don't know why!
qqnorm(rt(n=1000,df=2))
plot(jour_lm)
#In console panel, Hit <Return> to see next plot: just press 'Enter' to see the all diagnostic plot
For the tests details follow 404 slides.
Omitted Variable or model misspecification
Power of variables should be included (i.g. test for quadratic and cubic response)
\(Ho\): There is no misspecification
Draw conclusion by p value
library(lmtest)
resettest(jour_lm,power = 2:3)
##
## RESET test
##
## data: jour_lm
## RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395
Model: Residuals(u) on x, \(x^2\), \(x^3\) to check non-linearity
dim(Journals)[1] = 180
0.01611 is basically n\(R^2\) [Computes 1.6% of the total rows & \(R^2\) is from summary]
Since LM < table value we fail to reject null. NULL: Additional variables do not improve the model. beta = 0.
u <- resid(jour_lm)
x <- log(Journals$citeprice)
model_u <- lm(u~x + I(x^2) + I(x^3), data = Journals)
summary(model_u)
##
## Call:
## lm(formula = u ~ x + I(x^2) + I(x^3), data = Journals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5698 -0.4940 0.0444 0.4518 1.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.067327 0.073136 0.921 0.359
## x -0.041605 0.057375 -0.725 0.469
## I(x^2) -0.024315 0.022912 -1.061 0.290
## I(x^3) 0.002377 0.007393 0.322 0.748
##
## Residual standard error: 0.7479 on 176 degrees of freedom
## Multiple R-squared: 0.01611, Adjusted R-squared: -0.0006605
## F-statistic: 0.9606 on 3 and 176 DF, p-value: 0.4127
dim(Journals)[1]*0.01611
## [1] 2.8998
model1 <- lm(log(subs) ~ citations, data = Journals)
model2 <- lm(log(subs) ~ pages, data = Journals)
encomptest(model1,model2)
2nd NULL: Adding pages variable does not improve the model
Compare two non nested models introducing 3rd one that combines predictors from both models.
jtest(model1, model2)
NULL: Model 1 + fitted(Model 2) →
H0: Model 1 is correctly specified; Model 2’s predictions add no
additional explanatory power.
NULL: Model 2 + fitted(Model 1) →
H0: Model 2 is correctly specified; Model 1’s predictions add no
additional explanatory power.
AIC(model1)
## [1] 515.3615
AIC(model2)
## [1] 532.7627
Simultaneous equation modelling (SEM). Need truffles.csv data for this.
truffles <- read.table("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/truffles.txt")
model <- lm(p ~ pf + ps + di,
data = truffles)
summary(model)
##
## Call:
## lm(formula = p ~ pf + ps + di, data = truffles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.4825 -3.5927 0.2801 4.5326 12.9210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.5124 7.9842 -4.072 0.000387 ***
## pf 1.3539 0.2985 4.536 0.000115 ***
## ps 1.7081 0.3509 4.868 4.76e-05 ***
## di 7.6025 1.7243 4.409 0.000160 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.597 on 26 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8758
## F-statistic: 69.19 on 3 and 26 DF, p-value: 1.597e-12
p_hat<- model$fitted.values
model2 <- lm(q ~ p_hat + ps + di,
data = truffles)
summary(model2)
##
## Call:
## lm(formula = q ~ p_hat + ps + di, data = truffles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1814 -1.1390 0.2765 1.4595 4.4318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.27947 3.01383 -1.420 0.167505
## p_hat -0.37446 0.08956 -4.181 0.000291 ***
## ps 1.29603 0.19309 6.712 4.03e-07 ***
## di 5.01398 1.24141 4.039 0.000422 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.68 on 26 degrees of freedom
## Multiple R-squared: 0.6974, Adjusted R-squared: 0.6625
## F-statistic: 19.97 on 3 and 26 DF, p-value: 6.332e-07
According to the hand note we transformed those equation so that only exogenous variables are present, allowing us to use OLS. Regressing on p using relevant predictors, we got the parameter values. Since we found p_hat from fitted values from the first OLS model, we could safely use the original equation, to regress q using p_hat as a predictor instead of p.
library(dynlm)
library(AER)
data("USMacroG")
plot(USMacroG[, c("dpi", "consumption")],
lty = c(3, 1), #line type
plot.type = "single",
ylab = "")
legend("topleft",
legend = c("income", "consumption"),
lty = c(3, 1), bty = "n")
##Have to run plot and legend together.
The first distributed lag model is:
\(consumption_t = \beta_0 + \beta_1 \cdot
dpi_t + \beta_2 \cdot dpi_{t-1} + u_t\)
The second distributed lag model is:
\(consumption_t = \beta_0 + \beta_1 \cdot
dpi_t + \beta_2 \cdot dpi_{t-1} + \beta_3 \cdot dpi_{t-2} + \beta_4
\cdot dpi_{t-3} + u_t\)
cons_lm1 <- dynlm(consumption ~ dpi + L(dpi), data = USMacroG)
# L for lags!!
cons_lm2 <- dynlm(consumption ~ dpi + L(dpi, 1:3), data = USMacroG)
summary(cons_lm1)
##
## Time series regression with "ts" data:
## Start = 1950(2), End = 2000(4)
##
## Call:
## dynlm(formula = consumption ~ dpi + L(dpi), data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -190.02 -56.68 1.58 49.91 323.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.07959 14.50814 -5.589 7.43e-08 ***
## dpi 0.89117 0.20625 4.321 2.45e-05 ***
## L(dpi) 0.03091 0.20754 0.149 0.882
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.58 on 200 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.9964
## F-statistic: 2.785e+04 on 2 and 200 DF, p-value: < 2.2e-16
summary(cons_lm2)
##
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
##
## Call:
## dynlm(formula = consumption ~ dpi + L(dpi, 1:3), data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -199.37 -58.09 1.60 52.32 323.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -85.73584 14.86920 -5.766 3.12e-08 ***
## dpi 0.89301 0.20925 4.268 3.07e-05 ***
## L(dpi, 1:3)1 0.13563 0.28941 0.469 0.640
## L(dpi, 1:3)2 -0.03442 0.28937 -0.119 0.905
## L(dpi, 1:3)3 -0.07246 0.21053 -0.344 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.88 on 196 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.9963
## F-statistic: 1.359e+04 on 4 and 196 DF, p-value: < 2.2e-16
The autoregressive model is:
\(consumption_t = \beta_0 + \beta_1 \cdot
dpi_t + \beta_2 \cdot consumption_{t-1} + u_t\)
cons_lm3<-dynlm(consumption~dpi+L(consumption), data=USMacroG)
summary(cons_lm3)
##
## Time series regression with "ts" data:
## Start = 1950(2), End = 2000(4)
##
## Call:
## dynlm(formula = consumption ~ dpi + L(consumption), data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.303 -9.674 1.141 12.691 45.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.535216 3.845170 0.139 0.889
## dpi -0.004064 0.016626 -0.244 0.807
## L(consumption) 1.013111 0.018161 55.785 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.52 on 200 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 4.627e+05 on 2 and 200 DF, p-value: < 2.2e-16
# restricted sum of square
g1model <- dynlm(consumption ~ L(consumption), data =USMacroG)
anova(g1model)
#Unrestricted sum of sqaure
g1model_un <- dynlm(consumption ~ L(consumption) + L(dpi), data =USMacroG)
anova(g1model_un)
library(lmtest)
grangertest(consumption ~ dpi, data = USMacroG)
grangertest(dpi ~ consumption, data = USMacroG)
First, we manually calculate the Residual Sum of Squares (RSS) from the restricted and unrestricted models. We compare the raw RSS values to see whether they are different. However, to test whether the difference is statistically significant, we use Granger’s causality test.
Granger causality tests whether adding lagged disposable personal income (\(dpi_{t-1}\)) to the model significantly improves its predictive power.
Does past \(dpi\) (i.e., \(dpi_{t-1}\)) help explain current consumption beyond what current \(dpi\) already explains?
If the test statistic is significant, we reject the null hypothesis that lagged \(dpi\) has no effect — implying that \(dpi\) Granger-causes consumption.
Var_1 <- dynlm(consumption ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
summary(Var_1)
##
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
##
## Call:
## dynlm(formula = consumption ~ L(consumption, 1:3) + L(dpi, 1:3),
## data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.724 -10.287 0.822 12.092 44.413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69704 3.71059 0.188 0.8512
## L(consumption, 1:3)1 1.12289 0.07753 14.484 <2e-16 ***
## L(consumption, 1:3)2 0.10399 0.11124 0.935 0.3510
## L(consumption, 1:3)3 -0.20018 0.08322 -2.405 0.0171 *
## L(dpi, 1:3)1 0.03491 0.05693 0.613 0.5404
## L(dpi, 1:3)2 -0.04252 0.06992 -0.608 0.5438
## L(dpi, 1:3)3 -0.01162 0.05711 -0.203 0.8390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.51 on 194 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 1.67e+05 on 6 and 194 DF, p-value: < 2.2e-16
Var_2 <- dynlm(dpi ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
summary(Var_2)
##
## Time series regression with "ts" data:
## Start = 1950(4), End = 2000(4)
##
## Call:
## dynlm(formula = dpi ~ L(consumption, 1:3) + L(dpi, 1:3), data = USMacroG)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.462 -14.712 -2.078 13.750 115.550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.85925 5.19356 2.091 0.037839 *
## L(consumption, 1:3)1 0.43039 0.10851 3.966 0.000103 ***
## L(consumption, 1:3)2 -0.37263 0.15570 -2.393 0.017653 *
## L(consumption, 1:3)3 -0.02034 0.11648 -0.175 0.861534
## L(dpi, 1:3)1 0.79448 0.07968 9.971 < 2e-16 ***
## L(dpi, 1:3)2 0.24302 0.09786 2.483 0.013862 *
## L(dpi, 1:3)3 -0.06858 0.07994 -0.858 0.391976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.71 on 194 degrees of freedom
## Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
## F-statistic: 9.973e+04 on 6 and 194 DF, p-value: < 2.2e-16
coeftest(Var_1, vcov. = sandwich)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.697044 2.890480 0.2412 0.80969
## L(consumption, 1:3)1 1.122891 0.091202 12.3122 < 2e-16 ***
## L(consumption, 1:3)2 0.103990 0.111881 0.9295 0.35380
## L(consumption, 1:3)3 -0.200184 0.089856 -2.2278 0.02704 *
## L(dpi, 1:3)1 0.034914 0.056934 0.6132 0.54043
## L(dpi, 1:3)2 -0.042518 0.066909 -0.6355 0.52587
## L(dpi, 1:3)3 -0.011617 0.062566 -0.1857 0.85290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(vars)
var_est <- VAR(as.matrix(cbind(USMacroG[,"consumption"], USMacroG[,"dpi"])), p= 3)
summary(var_est)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: USMacroG....consumption.., USMacroG....dpi..
## Deterministic variables: const
## Sample size: 201
## Log Likelihood: -1823.062
## Roots of the characteristic polynomial:
## 1.013 0.9893 0.5221 0.4362 0.3434 0.1721
## Call:
## VAR(y = as.matrix(cbind(USMacroG[, "consumption"], USMacroG[,
## "dpi"])), p = 3)
##
##
## Estimation results for equation USMacroG....consumption..:
## ==========================================================
## USMacroG....consumption.. = USMacroG....consumption...l1 + USMacroG....dpi...l1 + USMacroG....consumption...l2 + USMacroG....dpi...l2 + USMacroG....consumption...l3 + USMacroG....dpi...l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## USMacroG....consumption...l1 1.12289 0.07753 14.484 <2e-16 ***
## USMacroG....dpi...l1 0.03491 0.05693 0.613 0.5404
## USMacroG....consumption...l2 0.10399 0.11124 0.935 0.3510
## USMacroG....dpi...l2 -0.04252 0.06992 -0.608 0.5438
## USMacroG....consumption...l3 -0.20018 0.08322 -2.405 0.0171 *
## USMacroG....dpi...l3 -0.01162 0.05711 -0.203 0.8390
## const 0.69704 3.71059 0.188 0.8512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 20.51 on 194 degrees of freedom
## Multiple R-Squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 1.67e+05 on 6 and 194 DF, p-value: < 2.2e-16
##
##
## Estimation results for equation USMacroG....dpi..:
## ==================================================
## USMacroG....dpi.. = USMacroG....consumption...l1 + USMacroG....dpi...l1 + USMacroG....consumption...l2 + USMacroG....dpi...l2 + USMacroG....consumption...l3 + USMacroG....dpi...l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## USMacroG....consumption...l1 0.43039 0.10851 3.966 0.000103 ***
## USMacroG....dpi...l1 0.79448 0.07968 9.971 < 2e-16 ***
## USMacroG....consumption...l2 -0.37263 0.15570 -2.393 0.017653 *
## USMacroG....dpi...l2 0.24302 0.09786 2.483 0.013862 *
## USMacroG....consumption...l3 -0.02034 0.11648 -0.175 0.861534
## USMacroG....dpi...l3 -0.06858 0.07994 -0.858 0.391976
## const 10.85925 5.19356 2.091 0.037839 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 28.71 on 194 degrees of freedom
## Multiple R-Squared: 0.9997, Adjusted R-squared: 0.9997
## F-statistic: 9.973e+04 on 6 and 194 DF, p-value: < 2.2e-16
##
##
##
## Covariance matrix of residuals:
## USMacroG....consumption.. USMacroG....dpi..
## USMacroG....consumption.. 420.6 262.1
## USMacroG....dpi.. 262.1 824.0
##
## Correlation matrix of residuals:
## USMacroG....consumption.. USMacroG....dpi..
## USMacroG....consumption.. 1.0000 0.4451
## USMacroG....dpi.. 0.4451 1.0000
(forecast <- predict(var_est))
## $USMacroG....consumption..
## fcst lower upper CI
## [1,] 6405.643 6365.446 6445.839 40.19672
## [2,] 6468.504 6407.383 6529.626 61.12183
## [3,] 6535.172 6452.212 6618.131 82.95946
## [4,] 6602.435 6500.487 6704.383 101.94817
## [5,] 6671.212 6551.583 6790.842 119.62967
## [6,] 6740.885 6604.990 6876.781 135.89515
## [7,] 6811.610 6660.420 6962.800 151.18976
## [8,] 6883.271 6717.595 7048.947 165.67580
## [9,] 6955.902 6776.349 7135.454 179.55241
## [10,] 7029.487 6836.543 7222.432 192.94461
##
## $USMacroG....dpi..
## fcst lower upper CI
## [1,] 6688.442 6632.180 6744.704 56.26160
## [2,] 6752.849 6674.420 6831.278 78.42899
## [3,] 6814.348 6715.375 6913.320 98.97210
## [4,] 6879.143 6762.435 6995.852 116.70877
## [5,] 6943.980 6811.522 7076.438 132.45778
## [6,] 7010.201 6863.404 7156.997 146.79648
## [7,] 7077.115 6917.173 7237.057 159.94192
## [8,] 7145.002 6972.805 7317.199 172.19694
## [9,] 7213.728 7030.014 7397.441 183.71392
## [10,] 7283.356 7088.706 7478.005 194.64928
####### Impulse response function (irf)
imp <- irf(var_est)
plot(imp)
See the Question and the script.
Question 01
#1a
library(readxl)
ChikenDemand <- read_excel("ChikenDemand.xlsx")
View(ChikenDemand)
model<- lm(log(Y)~log(X2)+log(X3)+log(X6), data = ChikenDemand)
summary(model)
#1b
#I prefer B and Ramsey Reset test bcz there is an omission & B is nested in A.
#1c
model2<- lm(log(Y)~log(X2)+log(X3), data = ChikenDemand)
u<- resid(model2)
model_u<- lm(u~log(X2)+log(X3)+log(X6), data = ChikenDemand)
summary(model_u)
#n= 23
#R2 = 0.011
#n*r2=0.253
qchisq((1-.05), df=1)
#1d
model3<- lm(log(Y)~log(X4)+log(X5), data = ChikenDemand)
encomptest(model2,model3)
#1e
jtest(model2,model3)
Question 02
library(dplyr)
library(tidyverse)
library(ggplot2)
#2a
AveTemp <- read_excel("G:/My Drive/Turin/4th Year/AST 431/1st_inc/1st_inc_solve/AveTemp.xlsx")
AveTemp <- AveTemp %>%
mutate(date= make_date(Year,Mo))
#head(AveTemp)
p<- AveTemp %>%
filter(Station== "Dhaka")
plot(ts(data = p$AveTemp, frequency = 12, start = c(2008, 1)), main = "Monthly Average Temperature in Dhaka (2008–2018)",
ylab = "Temperature (°C)", xlab = "Year")
#2b
q<- AveTemp %>%
filter(Station %in% c("Dhaka", "Sylhet"))
#head(q)
ggplot(q, aes(x=date, y=q$AveTemp, color=Station))+
geom_line()+
labs(title = "Monthly Average Temperature (Dhaka vs. Sylhet)",
x = "Date", y = "Temperature (°C)") +
theme_minimal()
## Warning: Use of `q$AveTemp` is discouraged.
## ℹ Use `AveTemp` instead.
#2c
ggplot(p, aes(x = date, y = AveTemp)) +
geom_line(color = "blue") +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Dhaka Monthly Temperature with Trend Line",
x = "Date", y = "Temperature (°C)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
#2d
new<-AveTemp %>% filter(Station == "Sylhet")
#head(new)
t<-ts(new$AveTemp, frequency = 12, start = c(2008, 1))
acf(t, main = "ACF of Sylhet Monthly Temperature")
The main lecture is here.
wages data…#install.packages("plm")
library(plm)
mydata<- read.csv("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/Class/panel_wage.csv")
head(mydata)
pdata <- pdata.frame(mydata, index=c("id","t"))
head(pdata)
summary(pdata)
## exp wks occ ind
## Min. : 1.00 Min. : 5.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:11.00 1st Qu.:46.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :18.00 Median :48.00 Median :1.0000 Median :0.0000
## Mean :19.85 Mean :46.81 Mean :0.5112 Mean :0.3954
## 3rd Qu.:29.00 3rd Qu.:50.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :51.00 Max. :52.00 Max. :1.0000 Max. :1.0000
##
## south smsa ms fem
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.2903 Mean :0.6538 Mean :0.8144 Mean :0.1126
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## union ed blk lwage
## Min. :0.000 Min. : 4.00 Min. :0.00000 Min. :4.605
## 1st Qu.:0.000 1st Qu.:12.00 1st Qu.:0.00000 1st Qu.:6.395
## Median :0.000 Median :12.00 Median :0.00000 Median :6.685
## Mean :0.364 Mean :12.85 Mean :0.07227 Mean :6.676
## 3rd Qu.:1.000 3rd Qu.:16.00 3rd Qu.:0.00000 3rd Qu.:6.953
## Max. :1.000 Max. :17.00 Max. :1.00000 Max. :8.537
##
## id t tdum1 tdum2 tdum3
## 1 : 7 1:595 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 2 : 7 2:595 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## 3 : 7 3:595 Median :0.0000 Median :0.0000 Median :0.0000
## 4 : 7 4:595 Mean :0.1429 Mean :0.1429 Mean :0.1429
## 5 : 7 5:595 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## 6 : 7 6:595 Max. :1.0000 Max. :1.0000 Max. :1.0000
## (Other):4123 7:595
## tdum4 tdum5 tdum6 tdum7
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1429 Mean :0.1429 Mean :0.1429 Mean :0.1429
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## exp2
## Min. : 1.0
## 1st Qu.: 121.0
## Median : 324.0
## Mean : 514.4
## 3rd Qu.: 841.0
## Max. :2601.0
##
OLS without considering panel effects
pooling <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "pooling")
summary(pooling)
## Pooling Model
##
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "pooling")
##
## Balanced Panel: n = 595, T = 7, N = 4165
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.16057670 -0.25034526 0.00027256 0.26792139 2.12969386
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 4.9080e+00 6.7330e-02 72.8945 < 2.2e-16 ***
## exp 4.4675e-02 2.3929e-03 18.6701 < 2.2e-16 ***
## exp2 -7.1563e-04 5.2794e-05 -13.5552 < 2.2e-16 ***
## wks 5.8270e-03 1.1826e-03 4.9271 8.673e-07 ***
## ed 7.6041e-02 2.2266e-03 34.1511 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 886.9
## Residual Sum of Squares: 635.41
## R-Squared: 0.28356
## Adj. R-Squared: 0.28287
## F-statistic: 411.624 on 4 and 4160 DF, p-value: < 2.22e-16
Captures variation between individuals
between <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "between")
summary(between)
## Oneway (individual) effect Between Model
##
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "between")
##
## Balanced Panel: n = 595, T = 7, N = 4165
## Observations used in estimation: 595
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.978153 -0.220264 0.036574 0.250118 0.985629
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 4.68303917 0.21009890 22.2897 < 2.2e-16 ***
## exp 0.03815295 0.00569666 6.6974 4.953e-11 ***
## exp2 -0.00063127 0.00012568 -5.0228 6.757e-07 ***
## wks 0.01309028 0.00406592 3.2195 0.001355 **
## ed 0.07378378 0.00489848 15.0626 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 92.322
## Residual Sum of Squares: 62.187
## R-Squared: 0.32641
## Adj. R-Squared: 0.32185
## F-statistic: 71.4768 on 4 and 590 DF, p-value: < 2.22e-16
Accounts for individual fixed effects by differencing
firstdiff <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "fd")
summary(firstdiff)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "fd")
##
## Balanced Panel: n = 595, T = 7, N = 4165
## Observations used in estimation: 3570
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.1131555 -0.0654718 -0.0095751 0.0483881 2.3295637
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 0.11706540 0.00631057 18.5507 < 2.2e-16 ***
## exp2 -0.00053212 0.00013927 -3.8207 0.0001354 ***
## wks -0.00026826 0.00056483 -0.4749 0.6348525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 118.06
## Residual Sum of Squares: 117.58
## R-Squared: 0.004108
## Adj. R-Squared: 0.0035496
## F-statistic: 7.35691 on 2 and 3567 DF, p-value: 0.0006479
Removes time-invariant characteristics
fixed <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "within")
##
## Balanced Panel: n = 595, T = 7, N = 4165
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1.8120879 -0.0511128 0.0037112 0.0614250 1.9434065
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## exp 1.1379e-01 2.4689e-03 46.0888 < 2.2e-16 ***
## exp2 -4.2437e-04 5.4632e-05 -7.7678 1.036e-14 ***
## wks 8.3588e-04 5.9967e-04 1.3939 0.1634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 240.65
## Residual Sum of Squares: 82.632
## R-Squared: 0.65663
## Adj. R-Squared: 0.59916
## F-statistic: 2273.74 on 3 and 3567 DF, p-value: < 2.22e-16
Assumes unobserved heterogeneity is uncorrelated with regressors
random <- plm(lwage ~ exp + exp2 + wks + ed, data=pdata, model= "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = lwage ~ exp + exp2 + wks + ed, data = pdata, model = "random")
##
## Balanced Panel: n = 595, T = 7, N = 4165
##
## Effects:
## var std.dev share
## idiosyncratic 0.02317 0.15220 0.185
## individual 0.10209 0.31952 0.815
## theta: 0.8228
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -2.0439676 -0.1057048 0.0070992 0.1147499 2.0875839
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 3.8294e+00 9.3634e-02 40.8974 <2e-16 ***
## exp 8.8861e-02 2.8178e-03 31.5360 <2e-16 ***
## exp2 -7.7257e-04 6.2262e-05 -12.4083 <2e-16 ***
## wks 9.6577e-04 7.4329e-04 1.2993 0.1938
## ed 1.1171e-01 6.0572e-03 18.4426 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 260.94
## Residual Sum of Squares: 151.35
## R-Squared: 0.42
## Adj. R-Squared: 0.41945
## Chisq: 3012.45 on 4 DF, p-value: < 2.22e-16
plmtest(pooling)
##
## Lagrange Multiplier Test - (Honda)
##
## data: lwage ~ exp + exp2 + wks + ed
## normal = 72.056, p-value < 2.2e-16
## alternative hypothesis: significant effects
pFtest(fixed, pooling)
##
## F test for individual effects
##
## data: lwage ~ exp + exp2 + wks + ed
## F = 40.239, df1 = 593, df2 = 3567, p-value < 2.2e-16
## alternative hypothesis: significant effects
phtest(random, fixed)
##
## Hausman Test
##
## data: lwage ~ exp + exp2 + wks + ed
## chisq = 6191.4, df = 3, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
The dataset contains variables such as y (response variable), year (time variable) and country (panel identifier). We first load and inspect the dataset.
library(foreign)
library(plm)
Panel <- read.dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/Class/Panel101.dta")
head(Panel)
ols<-lm(y~x1, data = Panel)
summary(ols)
##
## Call:
## lm(formula = y ~ x1, data = Panel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.546e+09 -1.578e+09 1.554e+08 1.422e+09 7.183e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.524e+09 6.211e+08 2.454 0.0167 *
## x1 4.950e+08 7.789e+08 0.636 0.5272
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.028e+09 on 68 degrees of freedom
## Multiple R-squared: 0.005905, Adjusted R-squared: -0.008714
## F-statistic: 0.4039 on 1 and 68 DF, p-value: 0.5272
Regular OLS regression does not consider heterogeneity across groups or time
yhat<-ols$fitted.values
plot(Panel$x1, Panel$y, pch=19, xlab="x1", ylab="y")
abline(lm(Panel$y~Panel$x1),lwd=3, col="red")
Adjusting for countries (least square dummy variables, LSDV)
lsdv <- lm(y ~ x1 + factor(country), data = Panel)
# but, need to remove the constant term for it to be actually LSDV
summary(lsdv)
##
## Call:
## lm(formula = y ~ x1 + factor(country), data = Panel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.634e+09 -9.697e+08 5.405e+08 1.386e+09 5.612e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.805e+08 9.618e+08 0.916 0.3635
## x1 2.476e+09 1.107e+09 2.237 0.0289 *
## factor(country)B -1.938e+09 1.265e+09 -1.533 0.1304
## factor(country)C -2.603e+09 1.596e+09 -1.631 0.1080
## factor(country)D 2.282e+09 1.261e+09 1.810 0.0752 .
## factor(country)E -1.483e+09 1.268e+09 -1.169 0.2467
## factor(country)F 1.130e+09 1.289e+09 0.877 0.3839
## factor(country)G -1.865e+09 1.497e+09 -1.246 0.2175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.796e+09 on 62 degrees of freedom
## Multiple R-squared: 0.2276, Adjusted R-squared: 0.1404
## F-statistic: 2.61 on 7 and 62 DF, p-value: 0.01991
# x1 is now significant at 10% significance level
Fixed effects: n entity-specific intercepts (using plm)
fixed<-plm(y~x1, data = Panel, index = c("country", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country",
## "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.63e+09 -9.70e+08 5.40e+08 0.00e+00 1.39e+09 5.61e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 2475617827 1106675594 2.237 0.02889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared: 0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
In output:The coeff of x1 indicates how much Y changes overtime, on average per country, when X increases by one unit. Pr(>|t|)= 0.02889* indicates Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y). p-value: 0.028892: If this number is < 0.05 then your model is ok. This is a test (F) to see whether all the coefficients in the model are different than zero.
fixef(fixed)
## A B C D E F
## 880542404 -1057858363 -1722810755 3162826897 -602622000 2010731793
## G
## -984717493
pFtest(fixed, ols)
##
## F test for individual effects
##
## data: y ~ x1
## F = 2.9655, df1 = 6, df2 = 62, p-value = 0.01307
## alternative hypothesis: significant effects
In output: p-value = 0.01307 indicates If the p-value is < 0.05 then the fixed effects model is a better choice.
random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = y ~ x1, data = Panel, model = "random", index = c("country",
## "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Effects:
## var std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual 1.133e+18 1.065e+09 0.127
## theta: 0.3611
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.94e+09 -1.51e+09 2.82e+08 0.00e+00 1.56e+09 6.63e+09
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1037014284 790626206 1.3116 0.1896
## x1 1247001782 902145601 1.3823 0.1669
##
## Total Sum of Squares: 5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared: 0.02733
## Adj. R-Squared: 0.013026
## Chisq: 1.91065 on 1 DF, p-value: 0.16689
In output:Balanced Panel: n=7= # of groups/panels, T=10=# years, N=70 = total # of observations. Interpretation of the coefficients is tricky since they include both the within-entity and between-entity effects. In the case of TSCS data represents the average effect of X over Y when X changes across time and between countries by one unit. Pr(>|t|)= 0.16689 indicates Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y) p-value: 0.17141:If this number is < 0.05 then your model is ok. This is a test (F) to see whether all the coefficients in the model are different than zero.
To decide between fixed or random effects you can run a Hausman test where the null hypothesis is that the preferred model is random effects vs. the alternative the fixed effects. It basically tests whether the unique errors (ui) are correlated with the regressors, the null hypothesis is they are not.
phtest(fixed, random)
##
## Hausman Test
##
## data: y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent
The LM test helps you decide between a random effects regression and a simple OLS regression. The null hypothesis in the LM test is that variances across entities is zero. This is, no significant difference across units (i.e. no panel effect).
Breusch-Pagan [bp] Lagrange Multiplier for random effects. Null is no panel effect (i.e. OLS better).
pool <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="pooling")
plmtest(pool, type=c("bp"))
##
## Lagrange Multiplier Test - (Breusch-Pagan)
##
## data: y ~ x1
## chisq = 2.6692, df = 1, p-value = 0.1023
## alternative hypothesis: significant effects
Conclusion here we failed to reject the null and conclude that random effects is not appropriate. This is, no evidence of significant differences across countries, therefore you can run a simple OLS regression.
fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"),
model="within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1, data = Panel, model = "within", index = c("country",
## "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.63e+09 -9.70e+08 5.40e+08 0.00e+00 1.39e+09 5.61e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 2475617827 1106675594 2.237 0.02889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared: 0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
fixed.time <- plm(y ~ x1 + factor(year), data=Panel, index=c("country",
"year"), model="within")
summary(fixed.time)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1 + factor(year), data = Panel, model = "within",
## index = c("country", "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.92e+09 -1.05e+09 -1.40e+08 0.00e+00 1.63e+09 5.49e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 1389050354 1319849567 1.0524 0.29738
## factor(year)1991 296381559 1503368528 0.1971 0.84447
## factor(year)1992 145369666 1547226548 0.0940 0.92550
## factor(year)1993 2874386795 1503862554 1.9113 0.06138 .
## factor(year)1994 2848156288 1661498927 1.7142 0.09233 .
## factor(year)1995 973941306 1567245748 0.6214 0.53698
## factor(year)1996 1672812557 1631539254 1.0253 0.30988
## factor(year)1997 2991770063 1627062032 1.8388 0.07156 .
## factor(year)1998 367463593 1587924445 0.2314 0.81789
## factor(year)1999 1258751933 1512397632 0.8323 0.40898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.0201e+20
## R-Squared: 0.23229
## Adj. R-Squared: 0.00052851
## F-statistic: 1.60365 on 10 and 53 DF, p-value: 0.13113
pFtest(fixed.time, fixed)
##
## F test for individual effects
##
## data: y ~ x1 + factor(year)
## F = 1.209, df1 = 9, df2 = 53, p-value = 0.3094
## alternative hypothesis: significant effects
plmtest(fixed, c("time"), type=("bp"))
##
## Lagrange Multiplier Test - time effects (Breusch-Pagan)
##
## data: y ~ x1
## chisq = 0.16532, df = 1, p-value = 0.6843
## alternative hypothesis: significant effects
Panel data consists of observations collected over multiple time periods for the same entities (e.g., countries, firms, individuals). This document demonstrates different visualization techniques to analyze trends, heterogeneity and relationships in panel data.
library(foreign) # coplot
library(car) # scatterplot
library(gplots) # plotmeans
The coplot function generates small multiple plots for each country, showing how y evolves over year.
coplot(y ~ year|country, type = "l", data = Panel) # Lines
#coplot(y ~ year|country, type = "b", data = Panel) # points and Lines
• Such patterns are useful in econometric modeling to determine whether fixed effects or random effects models are appropriate for capturing country-specific influences. • Observing common trends across countries can suggest shared economic, social, or policy-related in fluences. • The x-axis corresponds to time (year), and the y-axis represents y. • Each subplot represents a different country.
A scatterplot with trend lines for each country highlights patterns and potential correlations over time.
scatterplot(y ~ year|country, boxplots =F, smooth=T, data = Panel)
• The individual points represent data observations for each country. • The smooth curves indicate the underlying trends of y over time. • If trends for different countries follow similar patterns, it may indicate shared influences across panels. • Large fluctuations in trend lines suggest higher volatility or structural shifts in specific countries. • Identifying these trends is crucial in dynamic panel data models (e.g., Arellano-Bond estimators) that capture lagged effects and autocorrelation over time.
This plot compares the mean values of y for each country, providing insights into cross-sectional differences.
plotmeans(y ~ country, main = "Heterogeineity across countries", data = Panel)
plotmeans(y~year, main="Heterogeneity across years", data = Panel)
• The presence of heterogeneity supports the use of fixed effects models, which account for unobserved, time-invariant country-specific characteristics. • Large variations suggest heterogeneity. • Countries with non-overlapping confidence intervals have statistically significant differences in y. • The vertical error bars indicate the confidence intervals, showing variability within each country. • The dots represent the mean values of y for each country.
• Coplot helps analyze time-series patterns in each country and supports the selection of fixed vs. random effects models. • Scatterplots with smooth trends reveal the general direction and fluctuations of y, guiding the choice of dynamic panel models. • Mean plots with confidence intervals highlight heterogeneity across countries, reinforcing the use of fixed effects estimators to control for unobserved heterogeneity.
# Ensure 'country' is a factor
Panel$country <- as.factor(Panel$country)
# Compute regression lines for each country
regression_lines <- Panel %>%
group_by(country) %>%
summarize(intercept = coef(lm(y ~ x1, data = cur_data()))[1],
slope = coef(lm(y ~ x1, data = cur_data()))[2]) %>%
expand_grid(x1 = seq(min(Panel$x1), max(Panel$x1), length.out = 100)) %>%
mutate(y_pred = intercept + slope * x1)
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `intercept = coef(lm(y ~ x1, data = cur_data()))[1]`.
## ℹ In group 1: `country = A`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
# Plot scatter points and regression lines
ggplot(data = Panel,
aes(x = x1, y = y, color = country))+
geom_point(alpha = 0.6) + # Scatter points
geom_line(data = regression_lines,
aes(x = x1, y = y_pred, color = country),
size = 1) +
labs(title = "Regression Lines for Each Country",
x = "X1",
y = "Y") +
theme_minimal()
Data
indiv <- rep(1:3, each = 3)
year <- rep(2000:2002, times = 3)
y <- c(6, 4.6, 9.4, 9.1, 8.3, 0.6, 9.1, 4.8, 9.1)
x1 <- c(7.8, 0.6, 2.1, 1.3, 0.9, 9.8, 0.2, 5.9, 5.2)
x2 <- c(5.8, 7.9, 5.4, 6.7, 6.6, 0.4, 2.6, 3.2, 6.9)
dat <- data.frame(country = factor(indiv), year, y, x1, x2)
dat
Plot
plot(x1, y, col = rep(c("red", "blue", "green"), each = 3))
mod1 <- lm(y ~ x1, data = dat)
abline(mod1)
summary(mod1)
##
## Call:
## lm(formula = y ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9975 -0.7411 0.2719 1.5545 3.1552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9435 1.2010 7.447 0.000144 ***
## x1 -0.5767 0.2396 -2.406 0.047026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.386 on 7 degrees of freedom
## Multiple R-squared: 0.4527, Adjusted R-squared: 0.3745
## F-statistic: 5.79 on 1 and 7 DF, p-value: 0.04703
mod2 <- lm(y ~ x1 + country, data = dat) # LSDS
summary(mod2)
##
## Call:
## lm(formula = y ~ x1 + country, data = dat)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9
## 1.8026 -3.7320 1.9294 1.5495 0.5198 -2.0694 -0.6148 -1.6416 2.2564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6765 1.8152 4.780 0.00497 **
## x1 -0.5742 0.2701 -2.126 0.08685 .
## country2 -0.3795 2.1956 -0.173 0.86954
## country3 1.1531 2.1926 0.526 0.62143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.684 on 5 degrees of freedom
## Multiple R-squared: 0.5052, Adjusted R-squared: 0.2084
## F-statistic: 1.702 on 3 and 5 DF, p-value: 0.2812
abline(a = mod2$coefficients[1], b = mod2$coefficients[2], col = "red") # country‐1
abline(a = mod2$coefficients[1] + mod2$coefficients[3], b = mod2$coefficients[2], col = "blue")
abline(a = mod2$coefficients[1] + mod2$coefficients[4], b = mod2$coefficients[2], col = "green")
library(haven)
data <- read_dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Panel_data.dta")
#a) Fit a suitable model to find the the relationship between y and x1.
ols<-lm(y~x1, data = data)
summary(ols)
##
## Call:
## lm(formula = y ~ x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.546e+09 -1.578e+09 1.554e+08 1.422e+09 7.183e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.524e+09 6.211e+08 2.454 0.0167 *
## x1 4.950e+08 7.789e+08 0.636 0.5272
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.028e+09 on 68 degrees of freedom
## Multiple R-squared: 0.005905, Adjusted R-squared: -0.008714
## F-statistic: 0.4039 on 1 and 68 DF, p-value: 0.5272
#(b) Fit a fixed effect model considering country as a dummy variable. Compare the results with the results obtained in (i). Produce a graph showing the regression line for each country in the same graph. Interpret your results and graph.
lsdv<-lm(y~x1+factor(country), data = data)
summary(lsdv)
##
## Call:
## lm(formula = y ~ x1 + factor(country), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.634e+09 -9.697e+08 5.405e+08 1.386e+09 5.612e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.805e+08 9.618e+08 0.916 0.3635
## x1 2.476e+09 1.107e+09 2.237 0.0289 *
## factor(country)2 -1.938e+09 1.265e+09 -1.533 0.1304
## factor(country)3 -2.603e+09 1.596e+09 -1.631 0.1080
## factor(country)4 2.282e+09 1.261e+09 1.810 0.0752 .
## factor(country)5 -1.483e+09 1.268e+09 -1.169 0.2467
## factor(country)6 1.130e+09 1.289e+09 0.877 0.3839
## factor(country)7 -1.865e+09 1.497e+09 -1.246 0.2175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.796e+09 on 62 degrees of freedom
## Multiple R-squared: 0.2276, Adjusted R-squared: 0.1404
## F-statistic: 2.61 on 7 and 62 DF, p-value: 0.01991
plot(data$y~data$x1, pch=data$country)
abline(ols, col=1)
library(car)
scatterplot(fitted(lsdv)~x1|country, boxplot= FALSE ,smooth=TRUE, data = data)
#(c) Fit a random effects model considering country as a panel variable and Interpret your results.
library(plm)
random<-plm(y~x1, data = data, index = c("country", "year"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = y ~ x1, data = data, model = "random", index = c("country",
## "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Effects:
## var std.dev share
## idiosyncratic 7.815e+18 2.796e+09 0.873
## individual 1.133e+18 1.065e+09 0.127
## theta: 0.3611
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.94e+09 -1.51e+09 2.82e+08 0.00e+00 1.56e+09 6.63e+09
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1037014284 790626206 1.3116 0.1896
## x1 1247001782 902145601 1.3823 0.1669
##
## Total Sum of Squares: 5.6595e+20
## Residual Sum of Squares: 5.5048e+20
## R-Squared: 0.02733
## Adj. R-Squared: 0.013026
## Chisq: 1.91065 on 1 DF, p-value: 0.16689
#(d) Carry out the Hausman test for the random effects against the fixed effects and write your conclusion.
fixed<-plm(y~x1, data=data, index = c("country", "year"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1, data = data, model = "within", index = c("country",
## "year"))
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.63e+09 -9.70e+08 5.40e+08 0.00e+00 1.39e+09 5.61e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 2475617827 1106675594 2.237 0.02889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.8454e+20
## R-Squared: 0.074684
## Adj. R-Squared: -0.029788
## F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
phtest(random, fixed)
##
## Hausman Test
##
## data: y ~ x1
## chisq = 3.674, df = 1, p-value = 0.05527
## alternative hypothesis: one model is inconsistent
library(readr)
dat <- read.table("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Eggs.txt")
view(dat)
#(a)Estimate the model for the years 1990 and 1991 separately.
m1<-lm(Y~X, data = dat[dat$t==1990,])
summary(m1)
##
## Call:
## lm(formula = Y ~ X, data = dat[dat$t == 1990, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1903.6 -949.7 -515.6 294.2 5779.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3118.48 873.08 3.572 0.000819 ***
## X -22.50 10.77 -2.089 0.041989 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1607 on 48 degrees of freedom
## Multiple R-squared: 0.08337, Adjusted R-squared: 0.06428
## F-statistic: 4.366 on 1 and 48 DF, p-value: 0.04199
m2<-lm(Y~X, data = dat[dat$t==1991,])
summary(m2)
##
## Call:
## lm(formula = Y ~ X, data = dat[dat$t == 1991, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1983.6 -944.8 -494.0 333.9 5658.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3149.36 810.91 3.884 0.000314 ***
## X -23.35 10.27 -2.274 0.027461 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1594 on 48 degrees of freedom
## Multiple R-squared: 0.09727, Adjusted R-squared: 0.07847
## F-statistic: 5.172 on 1 and 48 DF, p-value: 0.02746
#(b) Pool the observations for the two years and estimate the pooled regression. What assumptions are you making in pooling the data?
#normal OLS
m<-lm(Y~X, data = dat)
summary(m)
##
## Call:
## lm(formula = Y ~ X, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1988.3 -960.6 -499.0 321.5 5791.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3132.258 587.868 5.328 6.34e-07 ***
## X -22.895 7.345 -3.117 0.0024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1584 on 98 degrees of freedom
## Multiple R-squared: 0.09021, Adjusted R-squared: 0.08093
## F-statistic: 9.718 on 1 and 98 DF, p-value: 0.002395
#(c) Use the fixed effects model, distinguishing the two years, and present the regression results.
library(plm)
dat$t<-as.factor(dat$t)
fixed<-plm(Y~X, data = dat, index = c("id", "t"), model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Y ~ X, data = dat, model = "within", index = c("id",
## "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -97.466 -11.476 0.000 11.476 97.466
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## X -2.1920 2.1806 -1.0052 0.3197
##
## Total Sum of Squares: 108920
## Residual Sum of Squares: 106720
## R-Squared: 0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
#(d) Can you use the fixed effects model, distinguishing the 50 states? Why or why not?
fixed1<-plm(Y~X, data = dat, index = c("id", "t"), model = "within", effect = "individual")
summary(fixed1)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Y ~ X, data = dat, effect = "individual", model = "within",
## index = c("id", "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -97.466 -11.476 0.000 11.476 97.466
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## X -2.1920 2.1806 -1.0052 0.3197
##
## Total Sum of Squares: 108920
## Residual Sum of Squares: 106720
## R-Squared: 0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
#(e) Estimate a random components model using the same data. From this analysis which model best describes the egg prices?
random<-plm(Y~X, data = dat, index = c("id", "t"), model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = Y ~ X, data = dat, model = "random", index = c("id",
## "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Effects:
## var std.dev share
## idiosyncratic 2.178e+03 4.667e+01 0.001
## individual 2.557e+06 1.599e+03 0.999
## theta: 0.9794
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -99.965 -27.679 -12.074 13.040 163.003
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1602.5840 283.6850 5.6492 1.612e-08 ***
## X -3.0502 2.1654 -1.4086 0.1589
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 223930
## Residual Sum of Squares: 219480
## R-Squared: 0.019846
## Adj. R-Squared: 0.0098442
## Chisq: 1.98427 on 1 DF, p-value: 0.15894
phtest(random, fixed)
##
## Hausman Test
##
## data: Y ~ X
## chisq = 11.143, df = 1, p-value = 0.0008437
## alternative hypothesis: one model is inconsistent
## (a)
# Load necessary libraries
library(plm)
# Load the dataset (replace with your file path)
vote_data <- read.csv("vote.csv")
# Convert to panel data format
panel_data <- pdata.frame(vote_data, index = c("district", "year"))
# Fixed Effects Model
fe_model <- plm(vote ~ d90 + log(inexp) + log(chexp) + ln_eshr, data = panel_data, model = "within")
summary(fe_model)
# Random Effects Model
re_model <- plm(vote ~ d90 + log(inexp) + log(chexp) + ln_eshr, data = panel_data, model = "random")
summary(re_model)
# Extract district-specific effects from the fixed effects model
fixef(fe_model)
## (b)
# Perform Hausman Test
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)
## (c)
# Check significance of variables in the chosen model (e.g., fixed effects)
summary(fe_model)
## (d)
# Fit a restricted model (excluding log(inexp) and log(chexp))
restricted_model <- plm(vote ~ d90 + ln_eshr, data = panel_data, model = "within")
# Perform F-test
anova(fe_model, restricted_model)
## (e)
# Re-estimate the model with ln_eshr as the only independent variable
simple_model <- plm(vote ~ ln_eshr, data = panel_data, model = "within")
summary(simple_model)
library(plm)
library(foreign)
library(car)
library(haven)
library(gplots)
#Q1
data<-read.table("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Eggs.txt")
#View(data)
#a
m_90<-lm(Y~X, data = data[data$t==1990,])
summary(m_90)
##
## Call:
## lm(formula = Y ~ X, data = data[data$t == 1990, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1903.6 -949.7 -515.6 294.2 5779.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3118.48 873.08 3.572 0.000819 ***
## X -22.50 10.77 -2.089 0.041989 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1607 on 48 degrees of freedom
## Multiple R-squared: 0.08337, Adjusted R-squared: 0.06428
## F-statistic: 4.366 on 1 and 48 DF, p-value: 0.04199
m_91<-lm(Y~X, data = data[data$t==1991,])
summary(m_91)
##
## Call:
## lm(formula = Y ~ X, data = data[data$t == 1991, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1983.6 -944.8 -494.0 333.9 5658.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3149.36 810.91 3.884 0.000314 ***
## X -23.35 10.27 -2.274 0.027461 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1594 on 48 degrees of freedom
## Multiple R-squared: 0.09727, Adjusted R-squared: 0.07847
## F-statistic: 5.172 on 1 and 48 DF, p-value: 0.02746
#b
m_p<-lm(Y~X, data = data)
summary(m_p)
##
## Call:
## lm(formula = Y ~ X, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1988.3 -960.6 -499.0 321.5 5791.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3132.258 587.868 5.328 6.34e-07 ***
## X -22.895 7.345 -3.117 0.0024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1584 on 98 degrees of freedom
## Multiple R-squared: 0.09021, Adjusted R-squared: 0.08093
## F-statistic: 9.718 on 1 and 98 DF, p-value: 0.002395
pooling<-plm(Y~X,index = c("id", "t"), data = data, model = "pooling")
summary(pooling)
## Pooling Model
##
## Call:
## plm(formula = Y ~ X, data = data, model = "pooling", index = c("id",
## "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1988.29 -960.61 -498.99 321.48 5791.30
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 3132.2582 587.8677 5.3282 6.337e-07 ***
## X -22.8953 7.3445 -3.1173 0.002395 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 270310000
## Residual Sum of Squares: 245920000
## R-Squared: 0.090215
## Adj. R-Squared: 0.080931
## F-statistic: 9.71774 on 1 and 98 DF, p-value: 0.0023954
#c
data$t<-as.factor(data$t)
fe_m<-plm(Y~X ,index = c("id", "t"), data = data, model = "within", effect = "time")
summary(fe_m)
## Oneway (time) effect Within Model
##
## Call:
## plm(formula = Y ~ X, data = data, effect = "time", model = "within",
## index = c("id", "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -1972.25 -955.86 -500.01 308.60 5773.34
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## X -22.9404 7.3934 -3.1028 0.002512 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 270290000
## Residual Sum of Squares: 245890000
## R-Squared: 0.09029
## Adj. R-Squared: 0.071533
## F-statistic: 9.62734 on 1 and 97 DF, p-value: 0.0025115
##d
fi_m<-plm(Y~X, index = c("id", "t"), data = data, model = "within")
summary(fi_m)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Y ~ X, data = data, model = "within", index = c("id",
## "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -97.466 -11.476 0.000 11.476 97.466
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## X -2.1920 2.1806 -1.0052 0.3197
##
## Total Sum of Squares: 108920
## Residual Sum of Squares: 106720
## R-Squared: 0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
fi_m1<-plm(Y~X, index = c("id", "t"), data = data, model = "within", effect = "individual" )
summary(fi_m1)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Y ~ X, data = data, effect = "individual", model = "within",
## index = c("id", "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -97.466 -11.476 0.000 11.476 97.466
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## X -2.1920 2.1806 -1.0052 0.3197
##
## Total Sum of Squares: 108920
## Residual Sum of Squares: 106720
## R-Squared: 0.020205
## Adj. R-Squared: -0.97959
## F-statistic: 1.01048 on 1 and 49 DF, p-value: 0.31973
##e
ra_m<-plm(Y~X ,index = c("id", "t"), data = data, model = "random")
summary(ra_m)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = Y ~ X, data = data, model = "random", index = c("id",
## "t"))
##
## Balanced Panel: n = 50, T = 2, N = 100
##
## Effects:
## var std.dev share
## idiosyncratic 2.178e+03 4.667e+01 0.001
## individual 2.557e+06 1.599e+03 0.999
## theta: 0.9794
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -99.965 -27.679 -12.074 13.040 163.003
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 1602.5840 283.6850 5.6492 1.612e-08 ***
## X -3.0502 2.1654 -1.4086 0.1589
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 223930
## Residual Sum of Squares: 219480
## R-Squared: 0.019846
## Adj. R-Squared: 0.0098442
## Chisq: 1.98427 on 1 DF, p-value: 0.15894
phtest(ra_m, fe_m)
##
## Hausman Test
##
## data: Y ~ X
## chisq = 7.9164, df = 1, p-value = 0.004899
## alternative hypothesis: one model is inconsistent
data1<- read_dta("G:/My Drive/Turin/4th Year/AST 431/2nd_inc/2022/data/Panel101.dta")
#View(data1)
panal<-pdata.frame(data1, index = c("country", "year"))
## Graphical Exploration
coplot(y~year|country,type="l", data = panal)
scatterplot(y~year|country, boxplot=F, smooth=T, data = panal)
## [1] "11" "54" "45" "55" "46" "36" "59"
## Better model for the Dataset
pool<-plm(y~x1+x2+x3, data = panal, model = "pooling")
summary(pool)
## Pooling Model
##
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "pooling")
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.50e+09 -1.62e+09 1.74e+08 0.00e+00 1.50e+09 7.36e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 1400527665 762362572 1.8371 0.0707 .
## x1 559063957 915594368 0.6106 0.5436
## x2 87445761 349461177 0.2502 0.8032
## x3 92622347 293654989 0.3154 0.7534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 6.2729e+20
## Residual Sum of Squares: 6.2251e+20
## R-Squared: 0.0076238
## Adj. R-Squared: -0.037484
## F-statistic: 0.169012 on 3 and 66 DF, p-value: 0.91693
random<-plm(y~x1+x2+x3, data = panal, model = "random")
summary(random)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "random")
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Effects:
## var std.dev share
## idiosyncratic 7.847e+18 2.801e+09 0.815
## individual 1.780e+18 1.334e+09 0.185
## theta: 0.4469
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.67e+09 -1.39e+09 1.14e+08 0.00e+00 1.69e+09 7.01e+09
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 492524201 962576196 0.5117 0.60888
## x1 1734069418 999303736 1.7353 0.08269 .
## x2 420030518 497228766 0.8447 0.39825
## x3 226606952 324458564 0.6984 0.48492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.5536e+20
## Residual Sum of Squares: 5.2872e+20
## R-Squared: 0.047967
## Adj. R-Squared: 0.0046933
## Chisq: 3.32536 on 3 DF, p-value: 0.34413
fixed<-plm(y~x1+x2+x3, data = panal, model = "within")
summary(fixed)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = y ~ x1 + x2 + x3, data = panal, model = "within")
##
## Balanced Panel: n = 7, T = 10, N = 70
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.31e+09 -1.12e+09 5.02e+08 0.00e+00 1.37e+09 6.31e+09
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## x1 2424529174 1156516240 2.0964 0.04027 *
## x2 1822699670 2028055946 0.8987 0.37238
## x3 309718343 368552481 0.8404 0.40404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 5.2364e+20
## Residual Sum of Squares: 4.708e+20
## R-Squared: 0.10092
## Adj. R-Squared: -0.033937
## F-statistic: 2.24507 on 3 and 60 DF, p-value: 0.092235
plmtest(pool)
##
## Lagrange Multiplier Test - (Honda)
##
## data: y ~ x1 + x2 + x3
## normal = 1.7534, p-value = 0.03976
## alternative hypothesis: significant effects
pFtest(fixed, pool)
##
## F test for individual effects
##
## data: y ~ x1 + x2 + x3
## F = 3.2225, df1 = 6, df2 = 60, p-value = 0.008242
## alternative hypothesis: significant effects
phtest(random, fixed)
##
## Hausman Test
##
## data: y ~ x1 + x2 + x3
## chisq = 4.7863, df = 3, p-value = 0.1881
## alternative hypothesis: one model is inconsistent
MA(1) process positive coefficient
par(mfrow=c(2,2))
ma_sim<-arima.sim(list(order=c(0,0,1), ma=0.9), n=100)
plot(ma_sim, ylab=expression(Y[t]),
xlab="Time",
type="o",
main="MA(1) simulation")
acf(ma_sim,main="Sample ACF")
MA(1) process negative coefficient
ma_sim1<-arima.sim(list(order=c(0,0,1), ma=-0.9), n=100)
plot(ma_sim1, ylab=expression(Y[t]),
xlab="Time",
type="o",
main="MA(1) simulation")
acf(ma_sim1,main="Sample ACF")
MA(2) process
ma_sim2<-arima.sim(list(order=c(0,0,2), ma=c(-0.9, 0.7)), n=100)
plot(ma_sim2, xlab=expression((Y(t))), ylab="Time",
main="MA(2) simulation", type="o")
acf(ma_sim2, main="Sample ACF")
acf(ma_sim2, main="Sample ACF")[1:30] #Give value
##
## Autocorrelations of series 'ma_sim2', by lag
##
## 1 2 3 4 5 6 7 8 9 10 11
## -0.639 0.239 0.066 -0.040 -0.037 0.019 0.051 -0.142 0.186 -0.157 0.094
## 12 13 14 15 16 17 18 19 20 NA NA
## -0.056 0.007 0.006 -0.015 -0.048 0.189 -0.249 0.249 -0.151 NA NA
## NA NA NA NA NA NA NA NA
## NA NA NA NA NA NA NA NA
plot(acf(ma_sim2, main="Sample ACF")[1:30])
AR(1) process positive coefficient
ma_sim3<-arima.sim(list(order=c(1,0,0), ar=0.9), n=100)
plot(ma_sim3, ylab=expression(Y[t]),
xlab="Time",
type="o",
main="AR(1) simulation")
acf(ma_sim3,main="Sample ACF")
AR(1) process negative coefficient
ma_sim4<-arima.sim(list(order=c(1,0,0), ar=-0.9), n=100)
plot(ma_sim4, ylab=expression(Y[t]),
xlab="Time",
type="o",
main="AR(1) simulation")
acf(ma_sim3,main="Sample ACF")
AR(2) process
ma_sim5<-arima.sim(list(order=c(2,0,0), ar=c(0.7, -0.9)), n=100)
plot(ma_sim5, xlab=expression((Y(t))), ylab="Time",
main="AR(2) simulation", type="o")
acf(ma_sim5, main="Sample ACF")
acf(ma_sim5, main="Sample ACF")[1:20] #Give value
##
## Autocorrelations of series 'ma_sim5', by lag
##
## 1 2 3 4 5 6 7 8 9 10 11
## 0.365 -0.669 -0.818 0.059 0.806 0.511 -0.411 -0.775 -0.145 0.623 0.563
## 12 13 14 15 16 17 18 19 20
## -0.199 -0.661 -0.276 0.427 0.547 -0.024 -0.531 -0.337 0.270
ARMA(1,1) process
x<-arima.sim(n=1000, list(ar=c(0.5), ma=c(0.2)))
acf(x)
pacf(x)
ARMA(1,1) process (different coefficient)
x1<-arima.sim(n=100, list(order=c(1,0,1), ar=0.9,ma=0.2))
acf(x1)
pacf(x1)
Fitting AR(1) on raw data
data <- c(14.3, 14.6, 13.5, 14.2, 12.1, 14.19, 14.55, 13.6, 14.59, 16.6, 15.4, 12.89, 12.2, 12.15, 10.89, 11.11, 11.98, 13.44, 14.05, 13.91, 14.1, 12.66, 14.6, 16.7, 15.4, 17.1, 15.45, 16.76, 15.7, 14.1, 15.3, 16.4, 17.09, 16.8, 16.98, 16.39, 15.8, 15.1, 13.7, 13.79, 15.7)
acf(data)
pacf(data)
a<-arima(data, c(1,0,0))
names(a)
## [1] "coef" "sigma2" "var.coef" "mask" "loglik" "aic"
## [7] "arma" "residuals" "call" "series" "code" "n.cond"
## [13] "nobs" "model"
acf(a$residuals)
pacf(a$residuals)
Kings Data
data1<-read.table("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/Kings.txt")
t<-ts(data1)
acf(t)
pacf(t)
a1<-arima(data1, c(1,0,0))
names(a1)
## [1] "coef" "sigma2" "var.coef" "mask" "loglik" "aic"
## [7] "arma" "residuals" "call" "series" "code" "n.cond"
## [13] "nobs" "model"
acf(a1$residuals)
pacf(a1$residuals)
Generate a white noise process
WN <- arima.sim(model = list(order = c(0, 0, 0)), n = 200)
plot(WN,col=5, main="White Noise Series")
For
co2_csv data.
library(timeSeries)
library(readr)
co2_csv <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/co2_mm_mlo.csv")
co2 <- co2_csv$average
date_co2 <- timeSequence(from="1958-03-01",
to="2021-07-30", by="month")
co2.ts <- timeSeries(data=co2,date_co2, units="co2")
Plotting various components of the Time Series
plot(co2.ts, main = "Time Series Plot of co2")
co2.stl<-stl(as.ts(co2.ts) ,s.window = 12,
t.window=0.75*length(as.ts(co2.ts)),
t.degree=0)
plot(co2.stl$time.series[,"trend"], main = "Trend")
plot(co2.stl$time.series[,"seasonal"], main = "Seasonal")
plot(co2.stl$time.series[,"remainder"], main = "Remainder")
co2d <- diff(co2.stl$time.series[,"remainder"])
plot(co2d)
Checking PACF and ACF
pacf(as.vector(co2d),main="PACF Plot")
acf(co2d, main="ACF Plot")
Fitting an autoregressive model
co2dar<-ar(co2d)
summary(co2dar)
## Length Class Mode
## order 1 -none- numeric
## ar 15 -none- numeric
## var.pred 1 -none- numeric
## x.mean 1 -none- numeric
## aic 29 -none- numeric
## n.used 1 -none- numeric
## n.obs 1 -none- numeric
## order.max 1 -none- numeric
## partialacf 28 -none- numeric
## resid 760 ts numeric
## method 1 -none- character
## series 1 -none- character
## frequency 1 -none- numeric
## call 2 -none- call
## asy.var.coef 225 -none- numeric
Fitting ARIMA(1,0,1) to this data
fit<-arima(co2d,order=c(1,0,1))
tsdiag(fit)
qqnorm(fit$residuals)
# ASSIGNMENT
Pdf is here
library(readr)
library(timeSeries)
library(tidyverse)
Rainfall_dhaka <- read_csv("G:/My Drive/Turin/4th Year/AST 431/1st_inc/Time_series/Dataset/Rainfall_dhaka.csv")
#view(Rainfall_dhaka)
rainfall <- Rainfall_dhaka[, 2:34] %>%
mutate_if(is.character, as.numeric) %>%
pivot_longer(
cols = !c(Year, Month),
names_to = "day",
values_to = "rain"
) %>%
mutate(day = as.integer(day)) %>%
group_by(Year, Month) %>%
mutate(meanrain = mean(rain, na.rm = TRUE)) %>%
ungroup() %>%
mutate(date = make_date(Year, Month, day))
rainfall
rainfall <- rainfall %>% arrange(Year, Month, day)
# Step 2: Remove duplicates (keep one meanrain per month)
monthly_data <- rainfall %>% distinct(Year, Month, .keep_all = TRUE)
# Create time series object
rain.ts <- ts(monthly_data$meanrain, frequency = 12, start = c(1976, 1))
plot(rain.ts, main="Time series of rainfall")
rain.stl<-stl(as.ts(rain.ts) ,s.window = 12,
t.window=0.75*length(as.ts(rain.ts)),
t.degree=0)
plot(rain.stl$time.series[, "trend"], main="Trend")
plot(rain.stl$time.series[, "seasonal"], main="Seasonal")
rain_o<-rain.stl$time.series[, "remainder"]
plot(rain_o, main="Remainder")
rain_d<-diff(rain.stl$time.series[, "remainder"])
plot(rain_d)
pacf(as.vector(rain_d), main="PACF plot")
acf(rain_d, main="ACF plot")
The remainder term will be used to model a time series process. We fit some closely related models that might best describe this time series.
m1 <- arima(rain_o,order = c(0,0,1))
m2 <- arima(rain_o,order = c(1,0,0))
m3 <- arima(rain_o,order = c(1,1,1))
m4 <- arima(rain_o,order = c(0,0,2))
m5 <- arima(rain_o,order = c(1,0,2))
m6 <- arima(rain_o,order = c(1,1,2))
m7 <- arima(rain_o,order = c(2,0,2))
m8 <- arima(rain_o,order = c(2,1,2))
m9 <- arima(rain_o,order = c(0,0,3))
Comparing AIC values for different models.
AIC(m1)
## [1] 2689.181
AIC(m2)
## [1] 2686.034
AIC(m3)
## [1] 2687.775
AIC(m4)
## [1] 2684.678
AIC(m5)
## [1] 2686.675
AIC(m6)
## [1] 2687.325
AIC(m7)
## [1] 2688.624
AIC(m8)
## [1] 2688.275
AIC(m9)
## [1] 2686.675
The Model ARIMA(0,0,2) or MA(2) has the lowest AIC. We conclude this study by selecting this as our model of best fit (comparatively).
# can check by this
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
auto.arima(rain.ts)
## Series: rain.ts
## ARIMA(0,0,2)(2,1,0)[12]
##
## Coefficients:
## ma1 ma2 sar1 sar2
## 0.1532 0.1038 -0.7145 -0.2734
## s.e. 0.0440 0.0444 0.0424 0.0430
##
## sigma^2 = 14.99: log likelihood = -1426.41
## AIC=2862.82 AICc=2862.94 BIC=2884.03
broom::tidy(m4)
tsdiag(m4)
qqnorm(m4$residuals)
M1 <- arima(rain_d,order = c(0,0,1))
M2 <- arima(rain_d,order = c(1,0,0))
M3 <- arima(rain_d,order = c(1,1,1))
M4 <- arima(rain_d,order = c(0,0,2))
M5 <- arima(rain_d,order = c(1,0,2))
M6 <- arima(rain_d,order = c(1,1,2))
M7 <- arima(rain_d,order = c(2,0,2))
M8 <- arima(rain_d,order = c(2,1,2))
M9 <- arima(rain_d,order = c(0,0,3))
Comparing AIC values for different models.
AIC(M1)
## [1] 2707.506
AIC(M2)
## [1] 2835.506
AIC(M3)
## [1] 2838.118
AIC(M4)
## [1] 2692.726
AIC(M5)
## [1] 2689.091
AIC(M6)
## [1] 2704.044
AIC(M7)
## [1] 2690.033
AIC(M8)
## [1] 2702.281
AIC(M9)
## [1] 2688.091
The Model ARIMA(0,0,3) has the lowest AIC. We conclude this study by selecting this as our model of best fit (comparatively).
Provided lecture
θ = 0.5
set.seed(123)
n <- 200
y <- arima.sim(n = n, model = list(order = c(0, 0, 0),
seasonal = list(order = c(0, 0, 1),
period = 12, ma = 0.5)
)
)
# Visualize the time series data
plot.ts(y, main = expression("Generated Time Series " ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))
acf(y, main="ACF")
pacf(y, main="PACF")
φ = 0.5
set.seed(123)
n <- 200
y1<- arima.sim(n=n, model = list(order=c(0,0,0),
seasonal= list(order=c(1, 0, 0), period= 12, ar=0.5))
)
# Visualize the time series data
plot.ts(y1, main=expression("Generated Time Series" ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))
pacf(y1, main="PACF")
acf(y1, main="ACF")
airline_passengers data.library(readr)
airline_passengers <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/airline-passengers.csv")
## Rows: 144 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Passengers
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(airline_passengers)
ap_t <- ts(data=airline_passengers, start = c(1949, 1), frequency = 12)
plot.ts(ap_t, main = expression("Generated Time Series" ~ ARIMA(0, 0, 0)(0, 0, 1)[12]))
pacf(ap_t)
acf(ap_t)
library(zoo)
library(forecast)
library(Metrics)
Seasonal MA(12) process
arima.m<-arima.sim(list(order = c(0,0,12), ma = c(0.7,rep(0,10),0.9)), n = 200)
acf(arima.m)
A custom seasonal AR(13) process
ar_pacf<-ARMAacf (ar = c(.6,rep(0,10),.5,-.30),lag.max=30,pacf=T)
plot(ar_pacf, type='h')
library(readr)
airline_passengers1 <- read_csv("G:/My Drive/Turin/4th Year/AST 431/After 2nd inc/data/airline-passengers.csv", show_col_types = FALSE)
#view(airline_passengers)
airline_passengers1$Month <- as.yearmon(airline_passengers1$Month)
max(airline_passengers1$Month)
## [1] "Dec 1960"
head(airline_passengers1)
Divides the dataframe into train and test sets and then tries to predict the test set using the trained data. It is a naive approach. Calculates monthly averages as future predictions
test_set_start_date <- as.yearmon("Jan 1959")
train_set <- subset(airline_passengers1, Month < test_set_start_date)
test_set <- subset(airline_passengers1, Month >= test_set_start_date)
dim(train_set)
## [1] 120 2
dim(test_set)
## [1] 24 2
train_set_avg <- mean(train_set$Passengers)
simple_avg_predictions <- data.frame(
Month = test_set$Month,
Passengers = rep(train_set_avg, nrow(test_set))
)
ggplot() +
geom_line(data = train_set, aes(x = Month, y = Passengers, color = "Training"), size = 1) +
geom_line(data = test_set, aes(x = Month, y = Passengers, color = "Testing"), size = 1) +
labs(
title = "Airline Passengers - Training and Testing Sets",
x = "Date",
y = "Passengers"
) +
scale_color_manual(values = c("Training" = "#12355B", "Testing" = "#D72638"), name = "Airline passengers") +
theme_minimal() +
theme(plot.title = element_text(size = 20))
Converts the training set to a time series data, then fits a seasonal ARIMA(1,0,0)(2,1,0)_{12}.
series_ts <- ts(train_set$Passengers, frequency = 12)
#series_ts
mod2<-Arima(series_ts,order=c(1, 0, 0),
seasonal=list(order=c(2, 1, 0), period=12))
mod2
## Series: series_ts
## ARIMA(1,0,0)(2,1,0)[12]
##
## Coefficients:
## ar1 sar1 sar2
## 0.9418 -0.0637 0.0781
## s.e. 0.0313 0.1079 0.1107
##
## sigma^2 = 107.1: log likelihood = -405.29
## AIC=818.58 AICc=818.97 BIC=829.31
Generates a hypothetical sales data that has trend, seasonality and random white noise, based on a monthly data starting from 2020 Jan 1st.
library(forecast)
library(tseries)
set.seed(123) # For reproducibility
months <- seq(as.Date("2020-01-01"),
by = "month",
length.out = 36)
sales <- 200 + (1:36) * 3 + 20 * sin(2 * pi * (1:36) / 12) + rnorm(36, mean = 0, sd = 10)
data <- data.frame(Date = months, Sales = sales)
head(data)
A sequence of time steps from 1 to 36 (e.g., months over 3 years).
Starts at 200 and increases by 3 each time step.[Trend Component]
A sinusoidal wave with a period of 12 which completes one full cycle every 12 units of time. Amplitude is 20 → the sales oscillate ±20 above and below the trend each year.So it shows how far the wave goes up and down from the center. [Seasonal Component]
Adds random variation from a normal distribution. [Random Noise Component]
Sales = Trend + Seasonality + Random Noise
ts_data <- ts(data$Sales, start = c(2020, 1), frequency = 12)
autoplot(ts_data, series = "Sales") +
ggtitle("Synthetic Monthly Sales Data") +
xlab("Time") +
ylab("Sales") +
scale_color_manual(values = "blue") +
theme_minimal(base_size = 15) + # Set base font size for better visibility
theme(legend.position = "bottom")
Check for stationarity
adf_test <- adf.test(ts_data)
## Warning in adf.test(ts_data): p-value smaller than printed p-value
print(adf_test)
##
## Augmented Dickey-Fuller Test
##
## data: ts_data
## Dickey-Fuller = -5.3005, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
Automatically fits a seasonal ARIMA as it sees best.
auto_model <- auto.arima(ts_data)
summary(auto_model)
## Series: ts_data
## ARIMA(0,0,0)(1,1,0)[12] with drift
##
## Coefficients:
## sar1 drift
## -0.8392 2.9958
## s.e. 0.0854 0.1095
##
## sigma^2 = 83.1: log likelihood = -93.36
## AIC=192.71 AICc=193.91 BIC=196.25
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.4441953 7.126158 4.467821 -0.2139526 1.667343 0.1240441
## ACF1
## Training set 0.1815494
We fit SARIMA(1,1,1)(1,1,1)_{12} manually.
sarima_model <- Arima(ts_data, order=c(1,1,1), seasonal=c(1,1,1))
summary(sarima_model)
## Series: ts_data
## ARIMA(1,1,1)(1,1,1)[12]
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ma1 sar1 sma1
## 0.0267 -0.7219 -0.8417 -0.0275
## s.e. 0.3068 0.2199 NaN NaN
##
## sigma^2 = 97.05: log likelihood = -91.09
## AIC=192.18 AICc=195.71 BIC=197.86
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.3929868 7.156714 4.692756 0.07424912 1.739757 0.1302892
## ACF1
## Training set -0.02246882
Forecast future values of SARIMA(1,1,1)(1,1,1)_{12}.
forecasted_values <- forecast(sarima_model, h=12)
autoplot(forecasted_values) +
ggtitle("Sales Forecast for Next 12 Months") +
xlab("Time") +
ylab("Sales") +
theme_minimal()