library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(lmtest)
library(HH)
## Loading required package: lattice
## Loading required package: grid
## Loading required package: latticeExtra
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
## Loading required package: gridExtra
##
## Attaching package: 'HH'
## The following object is masked from 'package:gplots':
##
## residplot
## The following objects are masked from 'package:car':
##
## logit, vif
library("leaps")
library(rgl)
library(sandwich)
data(HousePrices)
head(HousePrices)
## price lotsize bedrooms bathrooms stories driveway recreation fullbase gasheat
## 1 42000 5850 3 1 2 yes no yes no
## 2 38500 4000 2 1 1 yes no no no
## 3 49500 3060 3 1 1 yes no no no
## 4 60500 6650 3 1 2 yes yes no no
## 5 61000 6360 2 1 1 yes no no no
## 6 66000 4160 3 1 1 yes yes yes no
## aircon garage prefer
## 1 no 1 no
## 2 no 0 no
## 3 no 0 no
## 4 no 0 no
## 5 no 0 no
## 6 yes 0 no
any(is.na(HousePrices))
## [1] FALSE
lotsize <- HousePrices$lotsize
bedrooms <- HousePrices$bedrooms
driveway <- HousePrices$driveway
driveway <- ifelse(driveway == "yes", 1, 0)
boxplot(lotsize, main = "boxplot of lotsize")
hist(lotsize)
boxplot(bedrooms, main = "boxplot of bedrooms")
hist(bedrooms)
boxplot(driveway, main = "boxplot of driveway")
hist(driveway)
Analysis of lotsize: The median seems to be around 500, and there seems
to be some statistical outliers that have a lotsize greater than the
max. The lotsize histogram is skewed to the right, where 3000-4000
lotsize is the most common observation.
Analysis of bedrooms: Again, median of bedrooms is around 2-3, but there are some outliers above the max. The histogram shows that the most common observation for the the number of bedrooms is 3.
Analysis of driveway (yes or no): It seems like most of the houses have driveways, and those that do not are considered outliers.
extracted_variables <- data.frame(lotsize = lotsize, bedrooms = bedrooms, driveway = driveway)
cor(extracted_variables)
## lotsize bedrooms driveway
## lotsize 1.0000000 0.15185149 0.28877775
## bedrooms 0.1518515 1.00000000 -0.01199629
## driveway 0.2887778 -0.01199629 1.00000000
Some noticeable correlations are that as lotsize increases, so does the chance of the house having a driveway (vice versa). But the more bedrooms there are, there are slight decreases in the chance of a driveway being part of the house.
price <- HousePrices$price
price_lm <- lm(price ~ lotsize)
price_size_relationship <- summary(price_lm)
price_size_relationship
##
## Call:
## lm(formula = price ~ lotsize)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69551 -14626 -2858 9752 106901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.414e+04 2.491e+03 13.7 <2e-16 ***
## lotsize 6.599e+00 4.458e-01 14.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22570 on 544 degrees of freedom
## Multiple R-squared: 0.2871, Adjusted R-squared: 0.2858
## F-statistic: 219.1 on 1 and 544 DF, p-value: < 2.2e-16
This tells us the linear relationship between lotsize and price. Based on the p-value associated to the coefficients and the overall model, the model is significant. The effect of lotsize seems to have some impact on house prices (6.599). This raw coefficient of lotsize might be unreliable, but the R^2 tells us that there is a .2858 goodness of fit.
slope <- coef(price_size_relationship)[2]
elasticity <- slope * (mean(price))/(mean(lotsize))
elasticity
## [1] 87.28066
For a 1% increase in lotsize, the prices are expected to increase on average, 87%.Compared to the slope of 6.6, this seems similar, since a one unit increase in lotsize on average would increase the price by 6.6 more dollars. Elasticity measures the proportionality of the slope, but both measurements seem to say lotsize increases price.
confint(price_lm, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 29242.909795 39029.473335
## lotsize 5.722976 7.474559
The 95% confidence interval for the slope of lotsize onto price is from 5.7 to 7.5, which means that it lotsize clearly has an impact on price, since 5.7 > 0.
mutliple_reg <- lm(price ~ lotsize + bedrooms + driveway)
summary(mutliple_reg)
##
## Call:
## lm(formula = price ~ lotsize + bedrooms + driveway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62594 -12374 -1890 9239 96399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3617.4498 4422.2588 -0.818 0.414
## lotsize 5.4164 0.4345 12.465 < 2e-16 ***
## bedrooms 10927.1076 1223.2994 8.932 < 2e-16 ***
## driveway 13320.8107 2673.2163 4.983 8.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20780 on 542 degrees of freedom
## Multiple R-squared: 0.3979, Adjusted R-squared: 0.3945
## F-statistic: 119.4 on 3 and 542 DF, p-value: < 2.2e-16
Compared to the linear model, the multiple regression with multiple explanatory variables have increased the R^2, with all of its p-values being significant. This does not necessarily mean that the multiple regression is better, since more explanatory variables are susceptible to multicollinearity and increases variance and complexity.
Lets quickly test for the granger casuality:
grangertest(price ~ lotsize, data = HousePrices)
## Granger causality test
##
## Model 1: price ~ Lags(price, 1:1) + Lags(lotsize, 1:1)
## Model 2: price ~ Lags(price, 1:1)
## Res.Df Df F Pr(>F)
## 1 542
## 2 543 -1 5.8014 0.01635 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
grangertest(lotsize ~ price, data = HousePrices)
## Granger causality test
##
## Model 1: lotsize ~ Lags(lotsize, 1:1) + Lags(price, 1:1)
## Model 2: lotsize ~ Lags(lotsize, 1:1)
## Res.Df Df F Pr(>F)
## 1 542
## 2 543 -1 39.072 8.277e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both models of the granger casuality have a low enough p-value, though the second model with prices as the explanatory variable is lower. However, we will still use the first model because it makes more intuitive sense.
Lets use the reset test first to determine the functional form of the model.
resettest(mutliple_reg, power = 2, type = "fitted", data = HousePrices)
##
## RESET test
##
## data: mutliple_reg
## RESET = 2.2334, df1 = 1, df2 = 541, p-value = 0.1356
resettest(mutliple_reg, power = 3, type = "fitted", data = HousePrices)
##
## RESET test
##
## data: mutliple_reg
## RESET = 6.1289, df1 = 1, df2 = 541, p-value = 0.0136
The fitted argument tests the functional form of Y, not the independent variable. Because the power of 2 has a p-value of 01356, we fail to reject the H0, meaning that we should consider using Y^2 in our model.
resettest(mutliple_reg, power = 2, type = "regressor", data = HousePrices)
##
## RESET test
##
## data: mutliple_reg
## RESET = 9.8321, df1 = 3, df2 = 539, p-value = 2.535e-06
resettest(mutliple_reg, power = 3, type = "regressor", data = HousePrices)
##
## RESET test
##
## data: mutliple_reg
## RESET = 9.2712, df1 = 3, df2 = 539, p-value = 5.49e-06
For the regressor type argument, we are testing the functional form of the independent variables. Since the p-values are small and get smaller as we rise in power, the results indicate that the functional form of 1 is the most ideal.
summaryHH(regsubsets(price ~ lotsize + bedrooms + driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 l 2 0.287 2.77e+11 0.286 99.7 -172 22567
## 2 l-b 3 0.370 2.45e+11 0.368 26.8 -234 21229
## 3 l-b-d 4 0.398 2.34e+11 0.395 4.0 -252 20778
##
## Model variables with abbreviations
## model
## l lotsize
## l-b lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
##
## model with largest adjr2
## 3
##
## Number of observations
## 546
summaryHH(regsubsets(I(price^2) ~ lotsize + bedrooms + driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 l 2 0.243 8.86e+21 0.242 70.8 -139 4.04e+09
## 2 l-b 3 0.315 8.03e+21 0.312 14.9 -187 3.84e+09
## 3 l-b-d 4 0.330 7.84e+21 0.327 4.0 -194 3.80e+09
##
## Model variables with abbreviations
## model
## l lotsize
## l-b lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
##
## model with largest adjr2
## 3
##
## Number of observations
## 546
Based on the results of these two, we find that Y is more appropriate for the model than Y^2, with all three explanatory variables included in the model. We also do not need to test for quadratic/cubic function forms of the explanatory variables since we used the resettest for that.
We should still consider interaction variables, logged models (log-linear and linear-log), as well as testing for multicolloinearity.
Log-linear model:
summaryHH(regsubsets(log(price) ~ lotsize + bedrooms + driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 l 2 0.295 53.2 0.293 115.7 -178 0.313
## 2 l-b 3 0.379 46.8 0.377 38.9 -242 0.294
## 3 l-b-d 4 0.419 43.8 0.416 4.0 -271 0.284
##
## Model variables with abbreviations
## model
## l lotsize
## l-b lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
##
## model with largest adjr2
## 3
##
## Number of observations
## 546
Our lowest bic value and cp is 4/-271, which means a log-linear model is more appropriate than a linear multiple regression model.
Linear-log model
summaryHH(regsubsets(price ~ log(lotsize) + log(bedrooms) + driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 lg(l) 2 0.315 2.66e+11 0.314 99.5 -194 22121
## 2 lg(l)-lg(b) 3 0.403 2.32e+11 0.401 18.8 -263 20666
## 3 lg(l)-lg(b)-d 4 0.421 2.25e+11 0.418 4.0 -273 20372
##
## Model variables with abbreviations
## model
## lg(l) log(lotsize)
## lg(l)-lg(b) log(lotsize)-log(bedrooms)
## lg(l)-lg(b)-d log(lotsize)-log(bedrooms)-drivewayyes
##
## model with largest adjr2
## 3
##
## Number of observations
## 546
Our cp/bic values are similar, but on average the linear-log model is a lot lower. We should prefer the linear-log model over the log-linear model. Remember that the driveways variable is not logged since it is a binary variable.
Finally, lets consider when we log both the dependent and independent variable.
summaryHH(regsubsets(log(price) ~ log(lotsize) + log(bedrooms) + driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 lg(l) 2 0.336 50.0 0.335 117.8 -211 0.303
## 2 lg(l)-lg(b) 3 0.428 43.1 0.426 28.4 -286 0.282
## 3 lg(l)-lg(b)-d 4 0.455 41.1 0.452 4.0 -306 0.275
##
## Model variables with abbreviations
## model
## lg(l) log(lotsize)
## lg(l)-lg(b) log(lotsize)-log(bedrooms)
## lg(l)-lg(b)-d log(lotsize)-log(bedrooms)-drivewayyes
##
## model with largest adjr2
## 3
##
## Number of observations
## 546
Surprisingly, our lowest bic model is even lower than the log-linear/linear-log. We should use the log-log model over the log-linear and linear-log model.
Now, lets test for multicollinearity between the explanatory variables. Remember, multicollinearity occurs when two or more explanatory variables are correlated because they are too similar to each other. In this case, we have three very different explanatory variables, so it is unlikely that the variables will be uncorrelated, but it is still good to make sure.
loglog_model <- lm(log(price) ~ log(lotsize) + log(bedrooms) + driveway)
vif(loglog_model)
## log(lotsize) log(bedrooms) driveway
## 1.155539 1.028450 1.126830
A VIF value of 5 or more means it is a concern for multicollinearity. Since all of the variables are around 1, we do not need to remove any variables from multicollinearity.
Finally, lets consider interaction variables. Remember, interaction variables may be necessary when two the relationship of two variables indicate a further result that when they are just added to each other. For example, if a house has lots of bedrooms AND a driveway, the price may be more expensive that we have thought. Instead of adding the variables together, interaction variables will multiply the variables.
There are three possible interaction variables in our multiple regression model.
summaryHH(regsubsets(log(price) ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(lotsize) * driveway + log(bedrooms) * driveway, method =
c("forward"), data = HousePrices))
## model p rsq rss adjr2 cp bic stderr
## 1 lg(l) 2 0.336 50.0 0.335 128.16 -211 0.303
## 2 lg(l)-l():( 3 0.430 43.0 0.428 36.01 -288 0.281
## 3 lg(l)-l():(-lg(b): 4 0.459 40.8 0.456 8.67 -310 0.274
## 4 lg(l)-lg(b)-l():(-lg(b): 5 0.459 40.8 0.455 9.95 -304 0.275
## 5 lg(l)-lg(b)-l():(-lg(l):-lg(b): 6 0.460 40.8 0.455 11.76 -298 0.275
## 6 lg(l)-lg(b)-d-l():(-lg(l):-lg(b): 7 0.466 40.3 0.460 7.00 -299 0.273
##
## Model variables with abbreviations
## model
## lg(l) log(lotsize)
## lg(l)-l():( log(lotsize)-log(lotsize):log(bedrooms)
## lg(l)-l():(-lg(b): log(lotsize)-log(lotsize):log(bedrooms)-log(bedrooms):drivewayyes
## lg(l)-lg(b)-l():(-lg(b): log(lotsize)-log(bedrooms)-log(lotsize):log(bedrooms)-log(bedrooms):drivewayyes
## lg(l)-lg(b)-l():(-lg(l):-lg(b): log(lotsize)-log(bedrooms)-log(lotsize):log(bedrooms)-log(lotsize):drivewayyes-log(bedrooms):drivewayyes
## lg(l)-lg(b)-d-l():(-lg(l):-lg(b): log(lotsize)-log(bedrooms)-drivewayyes-log(lotsize):log(bedrooms)-log(lotsize):drivewayyes-log(bedrooms):drivewayyes
##
## model with largest adjr2
## 6
##
## Number of observations
## 546
lg(l) = log(lotsize), lg(b) = log(bedrooms), d = driveway, l():( = log(lotsize) * log(bedrooms), lg(l): = log(lotsize) * driveway, lg(b): = log(bedrooms) * driveway
There are more outputs now that the regsubsets function has to consider the interaction variables as well. Again, we look for the lowest bic value. We need to be careful, since the model with the lowest bic value does not have the lowest cp value. Therefore, will consider two models: 1) price ~ log(lotsize) + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway 2) price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway
summary(lm(price ~ log(lotsize) + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway))
##
## Call:
## lm(formula = price ~ log(lotsize) + log(lotsize) * log(bedrooms) +
## log(bedrooms) * driveway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52393 -12599 -1001 9129 91405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14023 85761 0.164 0.87018
## log(lotsize) 2532 10454 0.242 0.80870
## log(bedrooms) -201945 77980 -2.590 0.00987 **
## driveway -4285 10077 -0.425 0.67085
## log(lotsize):log(bedrooms) 26410 9493 2.782 0.00559 **
## log(bedrooms):driveway 14398 9287 1.550 0.12164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20140 on 540 degrees of freedom
## Multiple R-squared: 0.4363, Adjusted R-squared: 0.431
## F-statistic: 83.58 on 5 and 540 DF, p-value: < 2.2e-16
summary(lm(price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway))
##
## Call:
## lm(formula = price ~ log(lotsize) + log(bedrooms) + driveway +
## log(lotsize) * log(bedrooms) + log(bedrooms) * driveway +
## log(lotsize) * driveway)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54863 -12430 -1083 8907 90339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226560 105564 2.146 0.032303 *
## log(lotsize) -23765 12937 -1.837 0.066752 .
## log(bedrooms) -220796 77433 -2.851 0.004519 **
## driveway -215646 63128 -3.416 0.000683 ***
## log(lotsize):log(bedrooms) 28906 9431 3.065 0.002286 **
## log(bedrooms):driveway 11315 9243 1.224 0.221402
## log(lotsize):driveway 26238 7738 3.391 0.000748 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19950 on 539 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.4419
## F-statistic: 72.92 on 6 and 539 DF, p-value: < 2.2e-16
Not only is our adjusted R^2 bigger for our second model, we have lower p-values for each variable of the model. Therefore, we conclude that the ideal regression model is:
price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway
This model will not necessarily have the highest R^2, but the best combination results from BIC,AIC, and R^2.
First, lets use a visual test if our regression model has any het-city:
ideal_model <- lm(price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway)
plot3d(predict(ideal_model), residuals(ideal_model), size = 3, type = "p")
It is tough to see whether there is het-city on a 3D model, so lets perform a couple of formal tests.
bptest(ideal_model)
##
## studentized Breusch-Pagan test
##
## data: ideal_model
## BP = 69.279, df = 6, p-value = 5.746e-13
gqtest(ideal_model)
##
## Goldfeld-Quandt test
##
## data: ideal_model
## GQ = 1.3451, df1 = 266, df2 = 266, p-value = 0.007953
## alternative hypothesis: variance increases from segment 1 to 2
Both formal tests for het-city rejects its H0, meaning that there is het-city within our model and we need to correct for it.
Robust standard error test:
robust_se <- sqrt(diag(vcovHC(ideal_model, type = "HC1")))
corrected_ideal_model <- coeftest(ideal_model, vcov. = vcovHC)
corrected_ideal_model
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226560.3 91717.4 2.4702 0.013812 *
## log(lotsize) -23765.4 11232.3 -2.1158 0.034819 *
## log(bedrooms) -220796.4 77148.5 -2.8620 0.004374 **
## driveway -215645.8 42043.6 -5.1291 4.067e-07 ***
## log(lotsize):log(bedrooms) 28905.5 9388.9 3.0787 0.002185 **
## log(bedrooms):driveway 11315.4 6523.8 1.7345 0.083403 .
## log(lotsize):driveway 26237.8 5227.8 5.0189 7.072e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, not only are our estimated coefficients are accurate, but our standard errors are too. This is important because we can utilize confidence intervals and perform hypothesis testing. Another noticeable observation of the coefficients are that some of the variables are negative. At first this might not make sense, but note that the same variables are incorporated into multiple parameters, so overall the effect of the overall variable is most likely positive. Finally, we must remember that these results are mostly regressive, and not necessarily used for forecasting. These two concepts do come in hand to hand, but for now we should mostly interpret these coefficient effects as how much one variable affects another, not how much it will affect another variable in the future.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
library(dynlm)
library(tseries)
data(TechChange)
head(TechChange)
## Time Series:
## Start = 1909
## End = 1914
## Frequency = 1
## output clr technology
## 1909 0.623 2.06 1.000
## 1910 0.616 2.10 0.983
## 1911 0.647 2.17 1.021
## 1912 0.652 2.21 1.023
## 1913 0.680 2.23 1.064
## 1914 0.682 2.20 1.071
clr = capital to labor ratio ## 1) Once again, check the consistency of the data (any NAs). Then, analyze the individual variables with boxplots, histograms, and correlation matrix.
any(is.na(TechChange))
## [1] FALSE
output <- TechChange[, "output"]
clr <- TechChange[, "clr"]
tech <- TechChange[, "technology"]
hist(output, probability = TRUE)
lines(density(output))
boxplot(output, main = "boxplot of output")
hist(clr, probability = TRUE)
lines(density(clr))
boxplot(clr, main = "boxplot of clr")
hist(tech, probability = TRUE)
lines(density(tech))
boxplot(tech, main = "boxplot of tech")
Output is slightly right skewed, with the median at around 0.9. Capital to labor ratio looks like a normal distribution, with the median around 2.6. Tech is also heavily right skewed, with the median at around 1.2. The median for tech looks closer to the 25th percentile, which means that the right part of the distribution has more extreme values. There does not seem to be much outliers in any of the datasets, opposed to the last set of variables.
cor(TechChange)
## output clr technology
## output 1.0000000 0.3639961 0.9879093
## clr 0.3639961 1.0000000 0.2190149
## technology 0.9879093 0.2190149 1.0000000
There seems to be an extremely high correlation between technology and output, and a decent correlation between clr and output.
is.ts(TechChange)
## [1] TRUE
ggtsdisplay(output)
ggtsdisplay(clr)
ggtsdisplay(tech)
Based on the plot of the time series model, none of the variables seem stationary, since there is a clear upwards trend. However, visual inspection is not the most reliable, lets use a formal test: the unit root test.
adf.test(output)
##
## Augmented Dickey-Fuller Test
##
## data: output
## Dickey-Fuller = -1.6873, Lag order = 3, p-value = 0.6965
## alternative hypothesis: stationary
adf.test(clr)
##
## Augmented Dickey-Fuller Test
##
## data: clr
## Dickey-Fuller = -1.433, Lag order = 3, p-value = 0.7966
## alternative hypothesis: stationary
adf.test(tech)
##
## Augmented Dickey-Fuller Test
##
## data: tech
## Dickey-Fuller = -1.4595, Lag order = 3, p-value = 0.7861
## alternative hypothesis: stationary
If we want stationarity, we want to reject the H0. Unfortunately, none of the variables pass for stationarity (which will usually be the case), so we need to correct for it.
diffoutput <- diff(output)
diffclr <- diff(clr)
difftech <- diff(tech)
ggtsdisplay(diffoutput)
ggtsdisplay(diffclr)
ggtsdisplay(difftech)
The data looks a lot more stationary. Although the variance looks like it has increased, that is just the effect of the smaller scale of the y-values of the plot. Lets still use the adf test:
adf.test(diffoutput)
##
## Augmented Dickey-Fuller Test
##
## data: diffoutput
## Dickey-Fuller = -3.5831, Lag order = 3, p-value = 0.04696
## alternative hypothesis: stationary
adf.test(diffclr)
##
## Augmented Dickey-Fuller Test
##
## data: diffclr
## Dickey-Fuller = -2.0806, Lag order = 3, p-value = 0.5419
## alternative hypothesis: stationary
adf.test(difftech)
##
## Augmented Dickey-Fuller Test
##
## data: difftech
## Dickey-Fuller = -2.8293, Lag order = 3, p-value = 0.248
## alternative hypothesis: stationary
The p-values besides the output did not decrease. Further differentiation will not necessarily consistently decrease the p-value, since that can change the functional form.
Based on the results of the PACF, we should try an AR(1) model for each of the models.
auto.arima(diffoutput)
## Series: diffoutput
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.0163
## s.e. 0.0049
##
## sigma^2 = 0.0009667: log likelihood = 82.58
## AIC=-161.16 AICc=-160.84 BIC=-157.78
auto.arima(diffclr)
## Series: diffclr
## ARIMA(0,0,1) with zero mean
##
## Coefficients:
## ma1
## 0.3739
## s.e. 0.1377
##
## sigma^2 = 0.008665: log likelihood = 38.64
## AIC=-73.28 AICc=-72.96 BIC=-69.91
auto.arima(difftech)
## Series: difftech
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.0202
## s.e. 0.0070
##
## sigma^2 = 0.001998: log likelihood = 68.06
## AIC=-132.12 AICc=-131.79 BIC=-128.74
testlag <- dynlm(diffoutput ~ L(diffoutput, 1))
bgtest(testlag)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: testlag
## LM test = 0.37123, df = 1, p-value = 0.5423
Since the p-value is high, we do not reject the H0, confirming our original analysis that there is no autocorrelation left in our model. We did not find any autoregressive models within any of the models, so lets try using the DL model.
We cannot use the same auto.arima functions and ADF indicators for the DL model, so we are going to use the BIC/AIC results to determine the ideal DL model. For simplicity, lets say output is the Y and the tech and clr are the explanatory variables.
BIC(dynlm(output ~ tech))
## [1] -157.7489
BIC(dynlm(output ~ clr))
## [1] -10.71857
BIC(dynlm(output ~ clr + tech))
## [1] -278.9281
Our lowest BIC value is when we regress output on both clr and tech.
BIC(dynlm(output ~ tech))
## [1] -157.7489
BIC(dynlm(output ~ L(tech, 0:1)))
## [1] -151.6847
BIC(dynlm(output ~ L(tech, 0:2)))
## [1] -146.0387
BIC(dynlm(output ~ L(tech, 0:3)))
## [1] -140.9306
Notice how the BIC gets higher as we add more lags to our DL model. This means that for output, the model with no lagged variables of tech is ideal for the relationship of output.
BIC(dynlm(output ~ clr))
## [1] -10.71857
BIC(dynlm(output ~ L(clr, 0:1)))
## [1] -7.476218
BIC(dynlm(output ~ L(clr, 0:2)))
## [1] -6.323461
BIC(dynlm(output ~ L(clr, 0:3)))
## [1] -5.401046
Similarly, clr increases as we add lag aggressors.
We conclude that no lagged variables of output, tech, or clr is necessary for our regression model when finding the determinants of output. This means that there previous lags/values of output, tech, and clr do not have a noticeable relationship into the future values. Although this conclusion is based on our statistical results, our failure to find any time-related relationship may also have to do with regulating our data too much from differentiation. Although stationarity is an assumption for our time-series models, the difference in our tsdisplay plots between the original model and differentiated model may have been too much.
Panel Data Model: ## 1) Analyze the panel data by determining if the pooled, fixed, or random effect is the most appropriate.
library("AER")
library(plm)
data(Cigar)
head(Cigar)
## state year price pop pop16 cpi ndi sales pimin
## 1 1 63 28.6 3383 2236.5 30.6 1558.305 93.9 26.1
## 2 1 64 29.8 3431 2276.7 31.0 1684.073 95.4 27.5
## 3 1 65 29.8 3486 2327.5 31.5 1809.842 98.5 28.9
## 4 1 66 31.5 3524 2369.7 32.4 1915.160 96.4 29.5
## 5 1 67 31.6 3533 2393.7 33.4 2023.546 95.5 29.6
## 6 1 68 35.6 3522 2405.2 34.8 2202.486 88.4 32.0
View(Cigar)
p_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "pooling")
f_i_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "individual")
f_t_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "time")
r_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "random")
Tests to find out which model is ideal: ?plm
Test for heterogeneity with the LM test:
plmtest(p_model, effect = "individual")
##
## Lagrange Multiplier Test - (Honda)
##
## data: sales ~ price + pop + ndi
## normal = 90.776, p-value < 2.2e-16
## alternative hypothesis: significant effects
The LM test is a test for heterogeneity, or random variation/differences across individuals. Since we reject the null hypothesis, we can conclude that there is unobserved heterogeneity within the model, meaning that the REM/FEM model should be used over the pooled model, since the REM/FEM model controls for differences across individuals.
summary(f_i_model)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = sales ~ price + pop + ndi, data = Cigar, effect = "individual",
## model = "within")
##
## Balanced Panel: n = 46, T = 30, N = 1380
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -73.9055 -6.6190 0.7536 6.7863 125.6954
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## price -0.41338427 0.03753098 -11.0145 < 2.2e-16 ***
## pop 0.00064924 0.00055109 1.1781 0.239
## ndi 0.00184651 0.00034185 5.4015 7.818e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 412530
## Residual Sum of Squares: 299580
## R-Squared: 0.27378
## Adj. R-Squared: 0.24759
## F-statistic: 167.261 on 3 and 1331 DF, p-value: < 2.22e-16
summary(f_t_model)
## Oneway (time) effect Within Model
##
## Call:
## plm(formula = sales ~ price + pop + ndi, data = Cigar, effect = "time",
## model = "within")
##
## Balanced Panel: n = 46, T = 30, N = 1380
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -67.7909 -14.5711 -2.8440 9.3824 160.3132
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## price -1.76577191 0.11377566 -15.5198 < 2.2e-16 ***
## pop -0.00066471 0.00015660 -4.2446 2.339e-05 ***
## ndi 0.00649901 0.00058114 11.1831 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 1175700
## Residual Sum of Squares: 961780
## R-Squared: 0.18194
## Adj. R-Squared: 0.16251
## F-statistic: 99.8606 on 3 and 1347 DF, p-value: < 2.22e-16
Since both the time and individual effect fixed models are rejected in the F-test, this means that there are differences within time and individuals that the pooled model cannot account for. Since the fixed effect model controls for differences in time and individuals by removing the differences constant of time and individuals from averaging those variables and subtracting them to its mean, the result is that the model now accounts for differences of the marginal effect within both individuals and time. Therefore, we use the two-way effect model:
f_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "twoways")
Test for endogeneity with the Hausman test:
phtest(f_model, r_model)
##
## Hausman Test
##
## data: sales ~ price + pop + ndi
## chisq = 743.93, df = 3, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
For the Hausman test, we are testing for endogeneity, or if the independent variables are correlated with the residuals/errors. We reject the null hypothesis that states that there is no endogeneity, meaning that our coefficients are not only biased, but our standard errors are incorrect. Therefore, we conclude that the FEM model should be used over the REM model, since the FEM ackowledges for both heterogeneity and endogeneity in the panel data
Binary Dependent Model:
data("SwissLabor")
head(SwissLabor)
## participation income age education youngkids oldkids foreign
## 1 no 10.78750 3.0 8 1 1 no
## 2 yes 10.52425 4.5 8 0 1 no
## 3 no 10.96858 4.6 9 0 0 no
## 4 no 11.10500 3.1 11 2 0 no
## 5 no 11.10847 4.4 12 0 2 no
## 6 yes 11.02825 4.2 12 0 1 no
First, we must convert “Yes” and “No” to numeric values of 1 and 0.
participation <- SwissLabor$participation
education <- SwissLabor$education
income <- SwissLabor$income
participation <- ifelse(participation == "yes", 1, 0)
Now, lets test the Linear Probability Model:
lpm <- lm(participation ~ education + income)
summary(lpm)
##
## Call:
## lm(formula = participation ~ education + income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7611 -0.4606 -0.3092 0.5088 0.7998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.554313 0.441827 5.781 1.03e-08 ***
## education -0.008454 0.005800 -1.457 0.145
## income -0.188644 0.042696 -4.418 1.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4911 on 869 degrees of freedom
## Multiple R-squared: 0.03226, Adjusted R-squared: 0.03003
## F-statistic: 14.48 on 2 and 869 DF, p-value: 6.498e-07
AIC(lpm)
## [1] 1239.551
cov1 <- hccm(lpm, type = "hc1")
coeftest(lpm, vcoc = cov1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5543132 0.4418269 5.7813 1.033e-08 ***
## education -0.0084537 0.0058004 -1.4574 0.1454
## income -0.1886441 0.0426958 -4.4183 1.120e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our standard errors are now accurate. A couple things to notice: The p-value for education is relatively high, indicating a potential over-fitting of the regression model. Income seems to have a negative impact on labor participation rate, which seems a bit unusual. We can interpret the income coefficient as: For each unit of an increase in income, there is a 18.8% decrease in the labor participation rate. Lets try the other two models as well:
probit_model <- glm(lpm, data = SwissLabor, family = binomial(link = "probit"))
probit_model
##
## Call: glm(formula = lpm, family = binomial(link = "probit"), data = SwissLabor)
##
## Coefficients:
## (Intercept) education income
## 5.58082 -0.02058 -0.51417
##
## Degrees of Freedom: 871 Total (i.e. Null); 869 Residual
## Null Deviance: 1203
## Residual Deviance: 1174 AIC: 1180
A one unit increase in income will result in a -0.5 times decrease in labor participation force. Additionally, the AIC for the probit model is lower than the AIC for the LPM, so the probit model is preferred.
logit_model <- glm(lpm, data = SwissLabor, family = binomial(link = "logit"))
logit_model
##
## Call: glm(formula = lpm, family = binomial(link = "logit"), data = SwissLabor)
##
## Coefficients:
## (Intercept) education income
## 8.99645 -0.03355 -0.82833
##
## Degrees of Freedom: 871 Total (i.e. Null); 869 Residual
## Null Deviance: 1203
## Residual Deviance: 1174 AIC: 1180
For the logit model, a one unit increase in income means a 0.8 decrease in the log-odds of success. Both the probit and logit model may be difficult to interpret and compare to the LPM. Because the AIC of the non-linear models are so similar, we conclude that both the Probit and Logit model is appropriate to find the relationship between the labor participation rate and education & income. in all three models, income has a much greater impact than education. Income clearly has a negative relationship to the labor participation rate, and this makes sense because people with lower wages will more likely quit their job and look for another one. The small relationship between education and labor participation rate is definitely questionable, since we would usually suspect than people with higher education would be more quickly employed. However, there may be some bias in the data, where people who just finished their education may have been surveyed, thus meaning that lots of people with higher education would have been looking for a job.