library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(lmtest)
library(HH)
## Loading required package: lattice
## Loading required package: grid
## Loading required package: latticeExtra
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
## Loading required package: gridExtra
## 
## Attaching package: 'HH'
## The following object is masked from 'package:gplots':
## 
##     residplot
## The following objects are masked from 'package:car':
## 
##     logit, vif
library("leaps")
library(rgl)
library(sandwich)
data(HousePrices)
head(HousePrices)
##   price lotsize bedrooms bathrooms stories driveway recreation fullbase gasheat
## 1 42000    5850        3         1       2      yes         no      yes      no
## 2 38500    4000        2         1       1      yes         no       no      no
## 3 49500    3060        3         1       1      yes         no       no      no
## 4 60500    6650        3         1       2      yes        yes       no      no
## 5 61000    6360        2         1       1      yes         no       no      no
## 6 66000    4160        3         1       1      yes        yes      yes      no
##   aircon garage prefer
## 1     no      1     no
## 2     no      0     no
## 3     no      0     no
## 4     no      0     no
## 5     no      0     no
## 6    yes      0     no

1) Before analysis of variables, check the consistency of the data, checking for any NAs or missing values within the dataset. If there is, replace or delete the missing observations to make the data consistent. After that, use boxplots, histograms, and the correlation matrix between the variables to analyze all the individual variables’ characteristics.

any(is.na(HousePrices))
## [1] FALSE
lotsize <- HousePrices$lotsize
bedrooms <- HousePrices$bedrooms
driveway <- HousePrices$driveway
driveway <- ifelse(driveway == "yes", 1, 0)


boxplot(lotsize, main = "boxplot of lotsize")

hist(lotsize)

boxplot(bedrooms, main = "boxplot of bedrooms")

hist(bedrooms)

boxplot(driveway, main = "boxplot of driveway")

hist(driveway)

Analysis of lotsize: The median seems to be around 500, and there seems to be some statistical outliers that have a lotsize greater than the max. The lotsize histogram is skewed to the right, where 3000-4000 lotsize is the most common observation.

Analysis of bedrooms: Again, median of bedrooms is around 2-3, but there are some outliers above the max. The histogram shows that the most common observation for the the number of bedrooms is 3.

Analysis of driveway (yes or no): It seems like most of the houses have driveways, and those that do not are considered outliers.

extracted_variables <- data.frame(lotsize = lotsize, bedrooms = bedrooms, driveway = driveway)
cor(extracted_variables)
##            lotsize    bedrooms    driveway
## lotsize  1.0000000  0.15185149  0.28877775
## bedrooms 0.1518515  1.00000000 -0.01199629
## driveway 0.2887778 -0.01199629  1.00000000

Some noticeable correlations are that as lotsize increases, so does the chance of the house having a driveway (vice versa). But the more bedrooms there are, there are slight decreases in the chance of a driveway being part of the house.

2) Fit and analyze the linear regression between the price (y) and lotsize (x), including the p-value, t-value, and R^2. Calculate the elasticity as well and compare that to the slope.

price <- HousePrices$price
price_lm  <- lm(price ~ lotsize)
price_size_relationship <- summary(price_lm)
price_size_relationship
## 
## Call:
## lm(formula = price ~ lotsize)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69551 -14626  -2858   9752 106901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.414e+04  2.491e+03    13.7   <2e-16 ***
## lotsize     6.599e+00  4.458e-01    14.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22570 on 544 degrees of freedom
## Multiple R-squared:  0.2871, Adjusted R-squared:  0.2858 
## F-statistic: 219.1 on 1 and 544 DF,  p-value: < 2.2e-16

This tells us the linear relationship between lotsize and price. Based on the p-value associated to the coefficients and the overall model, the model is significant. The effect of lotsize seems to have some impact on house prices (6.599). This raw coefficient of lotsize might be unreliable, but the R^2 tells us that there is a .2858 goodness of fit.

slope <- coef(price_size_relationship)[2]
elasticity <- slope * (mean(price))/(mean(lotsize))
elasticity
## [1] 87.28066

For a 1% increase in lotsize, the prices are expected to increase on average, 87%.Compared to the slope of 6.6, this seems similar, since a one unit increase in lotsize on average would increase the price by 6.6 more dollars. Elasticity measures the proportionality of the slope, but both measurements seem to say lotsize increases price.

3) Calculate the 95% CI of the estimator of the slope and use that value to determine if lotsize has a sufficient impact on price. Create a 95% CI for the price range based on the expected value of lotsize.

confint(price_lm, level = 0.95)
##                    2.5 %       97.5 %
## (Intercept) 29242.909795 39029.473335
## lotsize         5.722976     7.474559

The 95% confidence interval for the slope of lotsize onto price is from 5.7 to 7.5, which means that it lotsize clearly has an impact on price, since 5.7 > 0.

4) Fit the multiple regression model of price to lotsize to all other x-variables, and analyze the explanatory variables of the multiple regression using the same determinants as the linear regression.

mutliple_reg <- lm(price ~ lotsize + bedrooms + driveway)
summary(mutliple_reg)
## 
## Call:
## lm(formula = price ~ lotsize + bedrooms + driveway)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62594 -12374  -1890   9239  96399 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3617.4498  4422.2588  -0.818    0.414    
## lotsize         5.4164     0.4345  12.465  < 2e-16 ***
## bedrooms    10927.1076  1223.2994   8.932  < 2e-16 ***
## driveway    13320.8107  2673.2163   4.983 8.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20780 on 542 degrees of freedom
## Multiple R-squared:  0.3979, Adjusted R-squared:  0.3945 
## F-statistic: 119.4 on 3 and 542 DF,  p-value: < 2.2e-16

Compared to the linear model, the multiple regression with multiple explanatory variables have increased the R^2, with all of its p-values being significant. This does not necessarily mean that the multiple regression is better, since more explanatory variables are susceptible to multicollinearity and increases variance and complexity.

5) Use the reset test and intuition for interaction variables to determine the ideal function for the regression model. Use AIC, BIC, and R^2 to assist in helping in deciding the ideal model. Also make sure there is no multicollinearity between the explanatory variables of the multiple regression model by using formal tests. Also consider the Granger casuality error, and make sure that price is most appropriate for the dependent variable over the other explanatory variables.

Lets quickly test for the granger casuality:

grangertest(price ~ lotsize, data = HousePrices)
## Granger causality test
## 
## Model 1: price ~ Lags(price, 1:1) + Lags(lotsize, 1:1)
## Model 2: price ~ Lags(price, 1:1)
##   Res.Df Df      F  Pr(>F)  
## 1    542                    
## 2    543 -1 5.8014 0.01635 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
grangertest(lotsize ~ price, data = HousePrices)
## Granger causality test
## 
## Model 1: lotsize ~ Lags(lotsize, 1:1) + Lags(price, 1:1)
## Model 2: lotsize ~ Lags(lotsize, 1:1)
##   Res.Df Df      F    Pr(>F)    
## 1    542                        
## 2    543 -1 39.072 8.277e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both models of the granger casuality have a low enough p-value, though the second model with prices as the explanatory variable is lower. However, we will still use the first model because it makes more intuitive sense.

Lets use the reset test first to determine the functional form of the model.

resettest(mutliple_reg, power = 2, type = "fitted", data = HousePrices)
## 
##  RESET test
## 
## data:  mutliple_reg
## RESET = 2.2334, df1 = 1, df2 = 541, p-value = 0.1356
resettest(mutliple_reg, power = 3, type = "fitted", data = HousePrices)
## 
##  RESET test
## 
## data:  mutliple_reg
## RESET = 6.1289, df1 = 1, df2 = 541, p-value = 0.0136

The fitted argument tests the functional form of Y, not the independent variable. Because the power of 2 has a p-value of 01356, we fail to reject the H0, meaning that we should consider using Y^2 in our model.

resettest(mutliple_reg, power = 2, type = "regressor", data = HousePrices)
## 
##  RESET test
## 
## data:  mutliple_reg
## RESET = 9.8321, df1 = 3, df2 = 539, p-value = 2.535e-06
resettest(mutliple_reg, power = 3, type = "regressor", data = HousePrices)
## 
##  RESET test
## 
## data:  mutliple_reg
## RESET = 9.2712, df1 = 3, df2 = 539, p-value = 5.49e-06

For the regressor type argument, we are testing the functional form of the independent variables. Since the p-values are small and get smaller as we rise in power, the results indicate that the functional form of 1 is the most ideal.

summaryHH(regsubsets(price ~ lotsize + bedrooms + driveway, method = 
                       c("forward"), data = HousePrices))
##   model p   rsq      rss adjr2   cp  bic stderr
## 1     l 2 0.287 2.77e+11 0.286 99.7 -172  22567
## 2   l-b 3 0.370 2.45e+11 0.368 26.8 -234  21229
## 3 l-b-d 4 0.398 2.34e+11 0.395  4.0 -252  20778
## 
## Model variables with abbreviations
##                              model
## l                          lotsize
## l-b               lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
## 
## model with largest adjr2
## 3 
## 
## Number of observations
## 546
summaryHH(regsubsets(I(price^2) ~ lotsize + bedrooms + driveway, method = 
                       c("forward"), data = HousePrices))
##   model p   rsq      rss adjr2   cp  bic   stderr
## 1     l 2 0.243 8.86e+21 0.242 70.8 -139 4.04e+09
## 2   l-b 3 0.315 8.03e+21 0.312 14.9 -187 3.84e+09
## 3 l-b-d 4 0.330 7.84e+21 0.327  4.0 -194 3.80e+09
## 
## Model variables with abbreviations
##                              model
## l                          lotsize
## l-b               lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
## 
## model with largest adjr2
## 3 
## 
## Number of observations
## 546

Based on the results of these two, we find that Y is more appropriate for the model than Y^2, with all three explanatory variables included in the model. We also do not need to test for quadratic/cubic function forms of the explanatory variables since we used the resettest for that.

We should still consider interaction variables, logged models (log-linear and linear-log), as well as testing for multicolloinearity.

Log-linear model:

summaryHH(regsubsets(log(price) ~ lotsize + bedrooms + driveway, method = 
                       c("forward"), data = HousePrices))
##   model p   rsq  rss adjr2    cp  bic stderr
## 1     l 2 0.295 53.2 0.293 115.7 -178  0.313
## 2   l-b 3 0.379 46.8 0.377  38.9 -242  0.294
## 3 l-b-d 4 0.419 43.8 0.416   4.0 -271  0.284
## 
## Model variables with abbreviations
##                              model
## l                          lotsize
## l-b               lotsize-bedrooms
## l-b-d lotsize-bedrooms-drivewayyes
## 
## model with largest adjr2
## 3 
## 
## Number of observations
## 546

Our lowest bic value and cp is 4/-271, which means a log-linear model is more appropriate than a linear multiple regression model.

Linear-log model

summaryHH(regsubsets(price ~ log(lotsize) + log(bedrooms) + driveway, method = 
                       c("forward"), data = HousePrices))
##           model p   rsq      rss adjr2   cp  bic stderr
## 1         lg(l) 2 0.315 2.66e+11 0.314 99.5 -194  22121
## 2   lg(l)-lg(b) 3 0.403 2.32e+11 0.401 18.8 -263  20666
## 3 lg(l)-lg(b)-d 4 0.421 2.25e+11 0.418  4.0 -273  20372
## 
## Model variables with abbreviations
##                                                model
## lg(l)                                   log(lotsize)
## lg(l)-lg(b)               log(lotsize)-log(bedrooms)
## lg(l)-lg(b)-d log(lotsize)-log(bedrooms)-drivewayyes
## 
## model with largest adjr2
## 3 
## 
## Number of observations
## 546

Our cp/bic values are similar, but on average the linear-log model is a lot lower. We should prefer the linear-log model over the log-linear model. Remember that the driveways variable is not logged since it is a binary variable.

Finally, lets consider when we log both the dependent and independent variable.

summaryHH(regsubsets(log(price) ~ log(lotsize) + log(bedrooms) + driveway, method = 
                       c("forward"), data = HousePrices))
##           model p   rsq  rss adjr2    cp  bic stderr
## 1         lg(l) 2 0.336 50.0 0.335 117.8 -211  0.303
## 2   lg(l)-lg(b) 3 0.428 43.1 0.426  28.4 -286  0.282
## 3 lg(l)-lg(b)-d 4 0.455 41.1 0.452   4.0 -306  0.275
## 
## Model variables with abbreviations
##                                                model
## lg(l)                                   log(lotsize)
## lg(l)-lg(b)               log(lotsize)-log(bedrooms)
## lg(l)-lg(b)-d log(lotsize)-log(bedrooms)-drivewayyes
## 
## model with largest adjr2
## 3 
## 
## Number of observations
## 546

Surprisingly, our lowest bic model is even lower than the log-linear/linear-log. We should use the log-log model over the log-linear and linear-log model.

Now, lets test for multicollinearity between the explanatory variables. Remember, multicollinearity occurs when two or more explanatory variables are correlated because they are too similar to each other. In this case, we have three very different explanatory variables, so it is unlikely that the variables will be uncorrelated, but it is still good to make sure.

loglog_model <- lm(log(price) ~ log(lotsize) + log(bedrooms) + driveway)
vif(loglog_model)
##  log(lotsize) log(bedrooms)      driveway 
##      1.155539      1.028450      1.126830

A VIF value of 5 or more means it is a concern for multicollinearity. Since all of the variables are around 1, we do not need to remove any variables from multicollinearity.

Finally, lets consider interaction variables. Remember, interaction variables may be necessary when two the relationship of two variables indicate a further result that when they are just added to each other. For example, if a house has lots of bedrooms AND a driveway, the price may be more expensive that we have thought. Instead of adding the variables together, interaction variables will multiply the variables.

There are three possible interaction variables in our multiple regression model.

summaryHH(regsubsets(log(price) ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(lotsize) * driveway + log(bedrooms) * driveway, method = 
                       c("forward"), data = HousePrices))
##                               model p   rsq  rss adjr2     cp  bic stderr
## 1                             lg(l) 2 0.336 50.0 0.335 128.16 -211  0.303
## 2                       lg(l)-l():( 3 0.430 43.0 0.428  36.01 -288  0.281
## 3                lg(l)-l():(-lg(b): 4 0.459 40.8 0.456   8.67 -310  0.274
## 4          lg(l)-lg(b)-l():(-lg(b): 5 0.459 40.8 0.455   9.95 -304  0.275
## 5   lg(l)-lg(b)-l():(-lg(l):-lg(b): 6 0.460 40.8 0.455  11.76 -298  0.275
## 6 lg(l)-lg(b)-d-l():(-lg(l):-lg(b): 7 0.466 40.3 0.460   7.00 -299  0.273
## 
## Model variables with abbreviations
##                                                                                                                                                  model
## lg(l)                                                                                                                                     log(lotsize)
## lg(l)-l():(                                                                                                    log(lotsize)-log(lotsize):log(bedrooms)
## lg(l)-l():(-lg(b):                                                                   log(lotsize)-log(lotsize):log(bedrooms)-log(bedrooms):drivewayyes
## lg(l)-lg(b)-l():(-lg(b):                                               log(lotsize)-log(bedrooms)-log(lotsize):log(bedrooms)-log(bedrooms):drivewayyes
## lg(l)-lg(b)-l():(-lg(l):-lg(b):               log(lotsize)-log(bedrooms)-log(lotsize):log(bedrooms)-log(lotsize):drivewayyes-log(bedrooms):drivewayyes
## lg(l)-lg(b)-d-l():(-lg(l):-lg(b): log(lotsize)-log(bedrooms)-drivewayyes-log(lotsize):log(bedrooms)-log(lotsize):drivewayyes-log(bedrooms):drivewayyes
## 
## model with largest adjr2
## 6 
## 
## Number of observations
## 546

lg(l) = log(lotsize), lg(b) = log(bedrooms), d = driveway, l():( = log(lotsize) * log(bedrooms), lg(l): = log(lotsize) * driveway, lg(b): = log(bedrooms) * driveway

There are more outputs now that the regsubsets function has to consider the interaction variables as well. Again, we look for the lowest bic value. We need to be careful, since the model with the lowest bic value does not have the lowest cp value. Therefore, will consider two models: 1) price ~ log(lotsize) + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway 2) price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway

summary(lm(price ~ log(lotsize) + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway))
## 
## Call:
## lm(formula = price ~ log(lotsize) + log(lotsize) * log(bedrooms) + 
##     log(bedrooms) * driveway)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52393 -12599  -1001   9129  91405 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                   14023      85761   0.164  0.87018   
## log(lotsize)                   2532      10454   0.242  0.80870   
## log(bedrooms)               -201945      77980  -2.590  0.00987 **
## driveway                      -4285      10077  -0.425  0.67085   
## log(lotsize):log(bedrooms)    26410       9493   2.782  0.00559 **
## log(bedrooms):driveway        14398       9287   1.550  0.12164   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20140 on 540 degrees of freedom
## Multiple R-squared:  0.4363, Adjusted R-squared:  0.431 
## F-statistic: 83.58 on 5 and 540 DF,  p-value: < 2.2e-16
summary(lm(price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway))
## 
## Call:
## lm(formula = price ~ log(lotsize) + log(bedrooms) + driveway + 
##     log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + 
##     log(lotsize) * driveway)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54863 -12430  -1083   8907  90339 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  226560     105564   2.146 0.032303 *  
## log(lotsize)                 -23765      12937  -1.837 0.066752 .  
## log(bedrooms)               -220796      77433  -2.851 0.004519 ** 
## driveway                    -215646      63128  -3.416 0.000683 ***
## log(lotsize):log(bedrooms)    28906       9431   3.065 0.002286 ** 
## log(bedrooms):driveway        11315       9243   1.224 0.221402    
## log(lotsize):driveway         26238       7738   3.391 0.000748 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19950 on 539 degrees of freedom
## Multiple R-squared:  0.448,  Adjusted R-squared:  0.4419 
## F-statistic: 72.92 on 6 and 539 DF,  p-value: < 2.2e-16

Not only is our adjusted R^2 bigger for our second model, we have lower p-values for each variable of the model. Therefore, we conclude that the ideal regression model is:

price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway

This model will not necessarily have the highest R^2, but the best combination results from BIC,AIC, and R^2.

6) Finally, test for the six assumptions for the multiple linear regression between the relationship of price and the other explanatory variables. Correct for heteroskedasticity if there is any using formal tests. Once heteroskedasticity is corrected for, fit the residuals versus the fitted values, and comment on the results.

First, lets use a visual test if our regression model has any het-city:

ideal_model <- lm(price ~ log(lotsize) + log(bedrooms) + driveway + log(lotsize) * log(bedrooms) + log(bedrooms) * driveway + log(lotsize) * driveway)
plot3d(predict(ideal_model), residuals(ideal_model), size = 3, type = "p")

It is tough to see whether there is het-city on a 3D model, so lets perform a couple of formal tests.

bptest(ideal_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  ideal_model
## BP = 69.279, df = 6, p-value = 5.746e-13
gqtest(ideal_model)
## 
##  Goldfeld-Quandt test
## 
## data:  ideal_model
## GQ = 1.3451, df1 = 266, df2 = 266, p-value = 0.007953
## alternative hypothesis: variance increases from segment 1 to 2

Both formal tests for het-city rejects its H0, meaning that there is het-city within our model and we need to correct for it.

Robust standard error test:

robust_se <- sqrt(diag(vcovHC(ideal_model, type = "HC1")))
corrected_ideal_model <- coeftest(ideal_model, vcov. = vcovHC)

7 Write the conclusion of the important findings within the regression analysis done in the dataset, as well as the acknowledgements of potential errors/mistakes.

corrected_ideal_model
## 
## t test of coefficients:
## 
##                             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                 226560.3    91717.4  2.4702  0.013812 *  
## log(lotsize)                -23765.4    11232.3 -2.1158  0.034819 *  
## log(bedrooms)              -220796.4    77148.5 -2.8620  0.004374 ** 
## driveway                   -215645.8    42043.6 -5.1291 4.067e-07 ***
## log(lotsize):log(bedrooms)   28905.5     9388.9  3.0787  0.002185 ** 
## log(bedrooms):driveway       11315.4     6523.8  1.7345  0.083403 .  
## log(lotsize):driveway        26237.8     5227.8  5.0189 7.072e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, not only are our estimated coefficients are accurate, but our standard errors are too. This is important because we can utilize confidence intervals and perform hypothesis testing. Another noticeable observation of the coefficients are that some of the variables are negative. At first this might not make sense, but note that the same variables are incorporated into multiple parameters, so overall the effect of the overall variable is most likely positive. Finally, we must remember that these results are mostly regressive, and not necessarily used for forecasting. These two concepts do come in hand to hand, but for now we should mostly interpret these coefficient effects as how much one variable affects another, not how much it will affect another variable in the future.

Time-series Data Analysis:

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)
library(dynlm)
library(tseries)
data(TechChange)
head(TechChange)
## Time Series:
## Start = 1909 
## End = 1914 
## Frequency = 1 
##      output  clr technology
## 1909  0.623 2.06      1.000
## 1910  0.616 2.10      0.983
## 1911  0.647 2.17      1.021
## 1912  0.652 2.21      1.023
## 1913  0.680 2.23      1.064
## 1914  0.682 2.20      1.071

clr = capital to labor ratio ## 1) Once again, check the consistency of the data (any NAs). Then, analyze the individual variables with boxplots, histograms, and correlation matrix.

any(is.na(TechChange))
## [1] FALSE
output <- TechChange[, "output"]
clr <- TechChange[, "clr"]
tech <- TechChange[, "technology"]
hist(output, probability = TRUE)
lines(density(output))

boxplot(output, main = "boxplot of output")

hist(clr, probability = TRUE)
lines(density(clr))

boxplot(clr, main = "boxplot of clr")

hist(tech, probability = TRUE)
lines(density(tech))

boxplot(tech, main = "boxplot of tech")

Output is slightly right skewed, with the median at around 0.9. Capital to labor ratio looks like a normal distribution, with the median around 2.6. Tech is also heavily right skewed, with the median at around 1.2. The median for tech looks closer to the 25th percentile, which means that the right part of the distribution has more extreme values. There does not seem to be much outliers in any of the datasets, opposed to the last set of variables.

cor(TechChange)
##               output       clr technology
## output     1.0000000 0.3639961  0.9879093
## clr        0.3639961 1.0000000  0.2190149
## technology 0.9879093 0.2190149  1.0000000

There seems to be an extremely high correlation between technology and output, and a decent correlation between clr and output.

2) Make sure that the data is a time-series model, then use both visual inspection, then formal testing, to see if there is stationarity or not for each variable. If the variable is not stationary, correct for stationarity using log/derivative transformations, then test for stationarity again.

is.ts(TechChange)
## [1] TRUE
ggtsdisplay(output)

ggtsdisplay(clr)

ggtsdisplay(tech)

Based on the plot of the time series model, none of the variables seem stationary, since there is a clear upwards trend. However, visual inspection is not the most reliable, lets use a formal test: the unit root test.

adf.test(output)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  output
## Dickey-Fuller = -1.6873, Lag order = 3, p-value = 0.6965
## alternative hypothesis: stationary
adf.test(clr)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  clr
## Dickey-Fuller = -1.433, Lag order = 3, p-value = 0.7966
## alternative hypothesis: stationary
adf.test(tech)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  tech
## Dickey-Fuller = -1.4595, Lag order = 3, p-value = 0.7861
## alternative hypothesis: stationary

If we want stationarity, we want to reject the H0. Unfortunately, none of the variables pass for stationarity (which will usually be the case), so we need to correct for it.

diffoutput <- diff(output)
diffclr <- diff(clr)
difftech <- diff(tech)
ggtsdisplay(diffoutput)

ggtsdisplay(diffclr)

ggtsdisplay(difftech)

The data looks a lot more stationary. Although the variance looks like it has increased, that is just the effect of the smaller scale of the y-values of the plot. Lets still use the adf test:

adf.test(diffoutput)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diffoutput
## Dickey-Fuller = -3.5831, Lag order = 3, p-value = 0.04696
## alternative hypothesis: stationary
adf.test(diffclr)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diffclr
## Dickey-Fuller = -2.0806, Lag order = 3, p-value = 0.5419
## alternative hypothesis: stationary
adf.test(difftech)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  difftech
## Dickey-Fuller = -2.8293, Lag order = 3, p-value = 0.248
## alternative hypothesis: stationary

The p-values besides the output did not decrease. Further differentiation will not necessarily consistently decrease the p-value, since that can change the functional form.

3) Using the results of the ACF, PACF, and AIC/BIC, fit an AR(n) model. Extract the residuals and fit the residuals to its regressed variables in order to see if there are any autocorrelations. Carry out a formal test to conclude the ideal AR(n) model.

Based on the results of the PACF, we should try an AR(1) model for each of the models.

auto.arima(diffoutput)
## Series: diffoutput 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##         mean
##       0.0163
## s.e.  0.0049
## 
## sigma^2 = 0.0009667:  log likelihood = 82.58
## AIC=-161.16   AICc=-160.84   BIC=-157.78
auto.arima(diffclr)
## Series: diffclr 
## ARIMA(0,0,1) with zero mean 
## 
## Coefficients:
##          ma1
##       0.3739
## s.e.  0.1377
## 
## sigma^2 = 0.008665:  log likelihood = 38.64
## AIC=-73.28   AICc=-72.96   BIC=-69.91
auto.arima(difftech)
## Series: difftech 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##         mean
##       0.0202
## s.e.  0.0070
## 
## sigma^2 = 0.001998:  log likelihood = 68.06
## AIC=-132.12   AICc=-131.79   BIC=-128.74
testlag <- dynlm(diffoutput ~ L(diffoutput, 1))
bgtest(testlag)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  testlag
## LM test = 0.37123, df = 1, p-value = 0.5423

Since the p-value is high, we do not reject the H0, confirming our original analysis that there is no autocorrelation left in our model. We did not find any autoregressive models within any of the models, so lets try using the DL model.

4) Using the same test determinants as #3, fit a DL(q) model, and repeat the process of #3 of the explanatory variables. Once the ideal lag length of either/both explanatory variables are figured out, compare the ARDL model to the AR or DL model to determine which of the type of models is most appropriate for the time-series data.

We cannot use the same auto.arima functions and ADF indicators for the DL model, so we are going to use the BIC/AIC results to determine the ideal DL model. For simplicity, lets say output is the Y and the tech and clr are the explanatory variables.

BIC(dynlm(output ~ tech))
## [1] -157.7489
BIC(dynlm(output ~ clr))
## [1] -10.71857
BIC(dynlm(output ~ clr + tech))
## [1] -278.9281

Our lowest BIC value is when we regress output on both clr and tech.

BIC(dynlm(output ~ tech))
## [1] -157.7489
BIC(dynlm(output ~ L(tech, 0:1)))
## [1] -151.6847
BIC(dynlm(output ~ L(tech, 0:2)))
## [1] -146.0387
BIC(dynlm(output ~ L(tech, 0:3)))
## [1] -140.9306

Notice how the BIC gets higher as we add more lags to our DL model. This means that for output, the model with no lagged variables of tech is ideal for the relationship of output.

BIC(dynlm(output ~ clr))
## [1] -10.71857
BIC(dynlm(output ~ L(clr, 0:1)))
## [1] -7.476218
BIC(dynlm(output ~ L(clr, 0:2)))
## [1] -6.323461
BIC(dynlm(output ~ L(clr, 0:3)))
## [1] -5.401046

Similarly, clr increases as we add lag aggressors.

5) Give a summary of the important findings of the time-series regression analysis, and give acknowledgements to potential errors/mistakes.

We conclude that no lagged variables of output, tech, or clr is necessary for our regression model when finding the determinants of output. This means that there previous lags/values of output, tech, and clr do not have a noticeable relationship into the future values. Although this conclusion is based on our statistical results, our failure to find any time-related relationship may also have to do with regulating our data too much from differentiation. Although stationarity is an assumption for our time-series models, the difference in our tsdisplay plots between the original model and differentiated model may have been too much.

Panel Data Model: ## 1) Analyze the panel data by determining if the pooled, fixed, or random effect is the most appropriate.

library("AER")
library(plm)
data(Cigar) 
head(Cigar)
##   state year price  pop  pop16  cpi      ndi sales pimin
## 1     1   63  28.6 3383 2236.5 30.6 1558.305  93.9  26.1
## 2     1   64  29.8 3431 2276.7 31.0 1684.073  95.4  27.5
## 3     1   65  29.8 3486 2327.5 31.5 1809.842  98.5  28.9
## 4     1   66  31.5 3524 2369.7 32.4 1915.160  96.4  29.5
## 5     1   67  31.6 3533 2393.7 33.4 2023.546  95.5  29.6
## 6     1   68  35.6 3522 2405.2 34.8 2202.486  88.4  32.0

View(Cigar)

p_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "pooling")
f_i_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "individual")
f_t_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "time")
r_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "random")

Tests to find out which model is ideal: ?plm

Test for heterogeneity with the LM test:

plmtest(p_model, effect = "individual")
## 
##  Lagrange Multiplier Test - (Honda)
## 
## data:  sales ~ price + pop + ndi
## normal = 90.776, p-value < 2.2e-16
## alternative hypothesis: significant effects

The LM test is a test for heterogeneity, or random variation/differences across individuals. Since we reject the null hypothesis, we can conclude that there is unobserved heterogeneity within the model, meaning that the REM/FEM model should be used over the pooled model, since the REM/FEM model controls for differences across individuals.

summary(f_i_model)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = sales ~ price + pop + ndi, data = Cigar, effect = "individual", 
##     model = "within")
## 
## Balanced Panel: n = 46, T = 30, N = 1380
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -73.9055  -6.6190   0.7536   6.7863 125.6954 
## 
## Coefficients:
##          Estimate  Std. Error  t-value  Pr(>|t|)    
## price -0.41338427  0.03753098 -11.0145 < 2.2e-16 ***
## pop    0.00064924  0.00055109   1.1781     0.239    
## ndi    0.00184651  0.00034185   5.4015 7.818e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    412530
## Residual Sum of Squares: 299580
## R-Squared:      0.27378
## Adj. R-Squared: 0.24759
## F-statistic: 167.261 on 3 and 1331 DF, p-value: < 2.22e-16
summary(f_t_model)
## Oneway (time) effect Within Model
## 
## Call:
## plm(formula = sales ~ price + pop + ndi, data = Cigar, effect = "time", 
##     model = "within")
## 
## Balanced Panel: n = 46, T = 30, N = 1380
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -67.7909 -14.5711  -2.8440   9.3824 160.3132 
## 
## Coefficients:
##          Estimate  Std. Error  t-value  Pr(>|t|)    
## price -1.76577191  0.11377566 -15.5198 < 2.2e-16 ***
## pop   -0.00066471  0.00015660  -4.2446 2.339e-05 ***
## ndi    0.00649901  0.00058114  11.1831 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1175700
## Residual Sum of Squares: 961780
## R-Squared:      0.18194
## Adj. R-Squared: 0.16251
## F-statistic: 99.8606 on 3 and 1347 DF, p-value: < 2.22e-16

Since both the time and individual effect fixed models are rejected in the F-test, this means that there are differences within time and individuals that the pooled model cannot account for. Since the fixed effect model controls for differences in time and individuals by removing the differences constant of time and individuals from averaging those variables and subtracting them to its mean, the result is that the model now accounts for differences of the marginal effect within both individuals and time. Therefore, we use the two-way effect model:

f_model <- plm(sales ~ price + pop + ndi, data = Cigar, model = "within", effect = "twoways")

Test for endogeneity with the Hausman test:

phtest(f_model, r_model)
## 
##  Hausman Test
## 
## data:  sales ~ price + pop + ndi
## chisq = 743.93, df = 3, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

For the Hausman test, we are testing for endogeneity, or if the independent variables are correlated with the residuals/errors. We reject the null hypothesis that states that there is no endogeneity, meaning that our coefficients are not only biased, but our standard errors are incorrect. Therefore, we conclude that the FEM model should be used over the REM model, since the FEM ackowledges for both heterogeneity and endogeneity in the panel data

Binary Dependent Model:

data("SwissLabor")
head(SwissLabor)
##   participation   income age education youngkids oldkids foreign
## 1            no 10.78750 3.0         8         1       1      no
## 2           yes 10.52425 4.5         8         0       1      no
## 3            no 10.96858 4.6         9         0       0      no
## 4            no 11.10500 3.1        11         2       0      no
## 5            no 11.10847 4.4        12         0       2      no
## 6           yes 11.02825 4.2        12         0       1      no

1 Fit the LPM, Probit model, and Logit Model based on the probability of participation, and determine which model is preferred between the three:

First, we must convert “Yes” and “No” to numeric values of 1 and 0.

participation <- SwissLabor$participation 
education <- SwissLabor$education
income <- SwissLabor$income
participation <- ifelse(participation == "yes", 1, 0)

Now, lets test the Linear Probability Model:

lpm <- lm(participation ~ education + income)
summary(lpm)
## 
## Call:
## lm(formula = participation ~ education + income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7611 -0.4606 -0.3092  0.5088  0.7998 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.554313   0.441827   5.781 1.03e-08 ***
## education   -0.008454   0.005800  -1.457    0.145    
## income      -0.188644   0.042696  -4.418 1.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4911 on 869 degrees of freedom
## Multiple R-squared:  0.03226,    Adjusted R-squared:  0.03003 
## F-statistic: 14.48 on 2 and 869 DF,  p-value: 6.498e-07
AIC(lpm)
## [1] 1239.551
cov1 <- hccm(lpm, type = "hc1")
coeftest(lpm, vcoc = cov1) 
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  2.5543132  0.4418269  5.7813 1.033e-08 ***
## education   -0.0084537  0.0058004 -1.4574    0.1454    
## income      -0.1886441  0.0426958 -4.4183 1.120e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our standard errors are now accurate. A couple things to notice: The p-value for education is relatively high, indicating a potential over-fitting of the regression model. Income seems to have a negative impact on labor participation rate, which seems a bit unusual. We can interpret the income coefficient as: For each unit of an increase in income, there is a 18.8% decrease in the labor participation rate. Lets try the other two models as well:

probit_model <- glm(lpm, data = SwissLabor, family = binomial(link = "probit"))
probit_model 
## 
## Call:  glm(formula = lpm, family = binomial(link = "probit"), data = SwissLabor)
## 
## Coefficients:
## (Intercept)    education       income  
##     5.58082     -0.02058     -0.51417  
## 
## Degrees of Freedom: 871 Total (i.e. Null);  869 Residual
## Null Deviance:       1203 
## Residual Deviance: 1174  AIC: 1180

A one unit increase in income will result in a -0.5 times decrease in labor participation force. Additionally, the AIC for the probit model is lower than the AIC for the LPM, so the probit model is preferred.

logit_model <- glm(lpm, data = SwissLabor, family = binomial(link = "logit"))
logit_model
## 
## Call:  glm(formula = lpm, family = binomial(link = "logit"), data = SwissLabor)
## 
## Coefficients:
## (Intercept)    education       income  
##     8.99645     -0.03355     -0.82833  
## 
## Degrees of Freedom: 871 Total (i.e. Null);  869 Residual
## Null Deviance:       1203 
## Residual Deviance: 1174  AIC: 1180

For the logit model, a one unit increase in income means a 0.8 decrease in the log-odds of success. Both the probit and logit model may be difficult to interpret and compare to the LPM. Because the AIC of the non-linear models are so similar, we conclude that both the Probit and Logit model is appropriate to find the relationship between the labor participation rate and education & income. in all three models, income has a much greater impact than education. Income clearly has a negative relationship to the labor participation rate, and this makes sense because people with lower wages will more likely quit their job and look for another one. The small relationship between education and labor participation rate is definitely questionable, since we would usually suspect than people with higher education would be more quickly employed. However, there may be some bias in the data, where people who just finished their education may have been surveyed, thus meaning that lots of people with higher education would have been looking for a job.