Ge Chen

May.6th, 2015

RPI

Outline


1. Data

(1) Data Selection

US International goods and services Trade dataset is a combination of two datasets collected from United States Census Bureau and Federal Reserve. They can be accessed from

URL:http://catalog.data.gov/dataset/us-international-trade-in-goods-and-services

URL:http://catalog.data.gov/dataset/federal-reserve-data-download-program

Read in the dataset

#data read in 
rm(list=ls())
Trade.data<-read.csv("~/Desktop/Applied_Regression/Logistic/International_Export_Import.csv")
head(Trade.data,n=14L);
##       Time Export Import TradeBalance CapacityUtilization EuroDollarRate
## 1  1992-03  16054  11365            1             80.2999           4.43
## 2  1992-04  14347  10863            1             80.6967           4.19
## 3  1992-05  13956  10407            1             80.7975           3.99
## 4  1992-06  15698  11533            1             80.5974           4.00
## 5  1992-07  13979  11485            1             81.1304           3.54
## 6  1992-08  13547  11306            1             80.5548           3.43
## 7  1992-09  14606  11680            1             80.5531           3.22
## 8  1992-10  15625  12177            1             80.9928           3.32
## 9  1992-11  14165  11581            1             81.1599           3.70
## 10 1992-12  15794  12419            1             81.0541           3.60
## 11 1993-01  13903  10521            1             81.2884           3.37
## 12 1993-02  13667  10870            1             81.4514           3.24
## 13 1993-03  16619  13334            1             81.2911           3.21
## 14 1993-04  15222  12367            1             81.4254           3.21

(2) Dataset Description

The US International goods and services Trade dataset contains 6 variables which cover the data from March 1992 to March 2015.

str(Trade.data)
## 'data.frame':    277 obs. of  6 variables:
##  $ Time               : Factor w/ 277 levels "1992-03","1992-04",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Export             : num  16054 14347 13956 15698 13979 ...
##  $ Import             : num  11365 10863 10407 11533 11485 ...
##  $ TradeBalance       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ CapacityUtilization: num  80.3 80.7 80.8 80.6 81.1 ...
##  $ EuroDollarRate     : num  4.43 4.19 3.99 4 3.54 3.43 3.22 3.32 3.7 3.6 ...

Time: time of the data, marked with the format yyyy-mm

Export:Total Capital Goods(The materials using for final goods production) Export at Time(i), i=1,2,3,…277, the unit is millions

Import:Total Capital Goods Import value at Time(i), i=1,2,3,…277, the unit is millions

TradeBalance:if the International Trade Account is deficit(Import>Export), TradeBalance=0. If the account is surplus(Export>Import), TradeBalance=1

CapacityUtilization:The Ratio of actually uses between its installed productive capacity (Wikipedia) the unit is Percentage

EuroDollarRate:U.S.-dollar denominated deposits at foreign banks or foreign branches of American banks. (investopedia) the unit is percentage

Variables of Interest

For this Analysis, there mush be two continuous independent variabls and one categorical dependent variable. I am interest in how Capacity Utilization and Euro Dollar affect the US Internation Trade Balance.

Hypothesis

In This case, I want to test if US International Trade Balance can be explained by US Capacity Utilization and Euro Dollar Rate. \(Alternative Hypothesis\)

the \(Null Hypothesis\) is that US International Trade Balance can not be explained by US Capacity Utilization and Euro Dollar Rate. It can be explained by other variables or it just turns out to be randomization.

2. Model

Model Goals

Explain the International Trade Balance using Capacity Utilization and Euro Dollar Rate

Model Construction

#construct new dataset containing the data I used
attach(Trade.data)
Trade.subdata <- subset(Trade.data,select = c(TradeBalance,CapacityUtilization,EuroDollarRate))

Step-wise

cor(Trade.subdata)
##                     TradeBalance CapacityUtilization EuroDollarRate
## TradeBalance           1.0000000           0.1636181      0.3659614
## CapacityUtilization    0.1636181           1.0000000      0.7056917
## EuroDollarRate         0.3659614           0.7056917      1.0000000

Single Independent Variable Model

attach(Trade.subdata)
## The following objects are masked from Trade.data:
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
TradeBalance<-factor(TradeBalance)
model.1IV<-glm(TradeBalance~EuroDollarRate,data = Trade.subdata,family = "binomial")
summary(model.1IV)
## 
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate, family = "binomial", 
##     data = Trade.subdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9888  -0.9737   0.6593   0.9553   1.4344  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.74037    0.23088  -3.207  0.00134 ** 
## EuroDollarRate  0.37559    0.06413   5.857 4.72e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.34  on 276  degrees of freedom
## Residual deviance: 332.92  on 275  degrees of freedom
## AIC: 336.92
## 
## Number of Fisher Scoring iterations: 4

Chi Square test for Single IV model

library(aod)
wald.test(b = coef(model.1IV),Sigma = vcov(model.1IV),Terms =2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 34.3, df = 1, P(> X2) = 4.7e-09

Two Independent Variables Model

model.2IV<-glm(TradeBalance~EuroDollarRate+CapacityUtilization,data = Trade.subdata,family = "binomial")
summary(model.2IV)
## 
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization, 
##     family = "binomial", data = Trade.subdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0897  -0.9395   0.6805   0.8985   1.6211  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.57677    3.74835   2.021   0.0432 *  
## EuroDollarRate       0.51834    0.09228   5.617 1.94e-08 ***
## CapacityUtilization -0.11114    0.05000  -2.223   0.0262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.34  on 276  degrees of freedom
## Residual deviance: 327.67  on 274  degrees of freedom
## AIC: 333.67
## 
## Number of Fisher Scoring iterations: 4

Chi Square test for Two Independent Variables Model

wald.test(b = coef(model.2IV),Sigma = vcov(model.2IV),Terms =2:3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 37.9, df = 2, P(> X2) = 5.8e-09

Describe the Model

The model with two continuous Independent Variables has the highest Chi Squre score. Hence, I will use the two continuous Independent Variables model as my final model.

FinalModel<-glm(TradeBalance~EuroDollarRate+CapacityUtilization,data = Trade.subdata, family = "binomial")
summary(FinalModel)
## 
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization, 
##     family = "binomial", data = Trade.subdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0897  -0.9395   0.6805   0.8985   1.6211  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.57677    3.74835   2.021   0.0432 *  
## EuroDollarRate       0.51834    0.09228   5.617 1.94e-08 ***
## CapacityUtilization -0.11114    0.05000  -2.223   0.0262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.34  on 276  degrees of freedom
## Residual deviance: 327.67  on 274  degrees of freedom
## AIC: 333.67
## 
## Number of Fisher Scoring iterations: 4

3.Plot

Residuals Plot

par(mfrow = c(1,1))
FinalModel.res<-residuals(FinalModel,type = "deviance")
plot(fitted(FinalModel),FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0)

Diagnostic Plots

Residual Plot

I want to check whether the residual has correlation with Independent Variable. The graphs appear some correlation between Residuals and Euro Dollar Rate.

Residual Vs. Capacity Utilization

attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
## 
##     TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 4):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.data:
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
plot(CapacityUtilization,FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Capacity Utilization vs. Residuals ", xlab = "Capacity Utilization", ylab = "Residuals")
abline(1,0)
abline(-1,0)

Residual Vs.EuroDollar Rate

plot(EuroDollarRate,FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of EuroDollarRate vs. Residuals ", xlab = "EuroDollarRate", ylab = "Residuals")
abline(1,0)
abline(-1,0)

Histogram

(calcuate the residuals when TradeBalance equal = 1 or 0, because I assume the residuals are radomizing around 1 and -1, but not 0)

fit1<-predict(FinalModel,subset(Trade.subdata,Trade.subdata$TradeBalance==1))
fit0<-predict(FinalModel,subset(Trade.subdata,Trade.subdata$TradeBalance==0))
#find the residual around 1 and 0
resid1<-fit1-1
resid0<-fit0-0
hist(resid1, main = "Histogram of Residual When Dependent Variable = 1",xlab = "Residual")

hist(resid0, main = "Histogram of Residual When Dependent Variable = 0",xlab = "Residual")

Boxplot

boxplot(FinalModel.res,main="Box PLot of the Residual")

QQPlot

qqnorm(resid1,main="QQplot of the Residual, When Trade Balance =1")
qqline(resid1)

qqnorm(resid0,main = "QQplot pf the Residual, When Trade Balance =0")
qqline(resid0)

4.Interpretation

Statistical Analysis

summary(FinalModel)
## 
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization, 
##     family = "binomial", data = Trade.subdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0897  -0.9395   0.6805   0.8985   1.6211  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.57677    3.74835   2.021   0.0432 *  
## EuroDollarRate       0.51834    0.09228   5.617 1.94e-08 ***
## CapacityUtilization -0.11114    0.05000  -2.223   0.0262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 371.34  on 276  degrees of freedom
## Residual deviance: 327.67  on 274  degrees of freedom
## AIC: 333.67
## 
## Number of Fisher Scoring iterations: 4

LINE analysis

Linear

par(mfrow = c(1,1))
FinalModel.res<-residuals(FinalModel,type = "deviance")
plot(fitted(FinalModel),FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0)

The plot of the fitted value Vs. Residual does nots show linear. So, the expected value of residual is not zero.

Independent

par(mfrow = c(1,1))
plot(FinalModel.res,pch=21,cex=1,bg="blue",xlab = "index",ylab = "Residual", main="Residual Value")

The plot does not clearly show there is absolutely no serial correlation. We can find autocorrelation does exist in the year of 2005~2009 (index 150 to 200)

Normal Distributed

The residual is not normal distributed as showed in the graph. I try to look into the histograms separately, when Dependent Variable equal to 1 or 0. The two histograms also appear not normal distributed, with some skewness.

hist(FinalModel.res, main = "Residual Histogram")

#find the residual around 1 and 0
hist(resid1, main = "Histogram of Residual When Dependent Variable = 1",xlab = "Residual")

hist(resid0, main = "Histogram of Residual When Dependent Variable = 0",xlab ="Residual")

Equal Variance

The residual of the model turns to be Heteroskedastic. Next, I use Breusch-Pagan test to test if the residual is Heteroskedastic.

plot(FinalModel.res,main = "Residual plot")

Breusch - Pagan Test

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
## 
##     TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 5):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 7):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.data:
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
bptest(FinalModel)
## 
##  studentized Breusch-Pagan test
## 
## data:  FinalModel
## BP = 4.9134, df = 2, p-value = 0.08572

The Hypothesis that the residual is Homoskedasitc is rejected in the significant level of 90, indicating that the reisidual is Heteroskedastic.

Four Issues

1.Causality

2.Sample Size

I used G*Power with effective size of 0.219, alpha prob 0.05, and power 0.95 to find the correct sample size. The right one is 322, which is larger than the dataset I used.

3.Colinearity

attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
## 
##     TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 3):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 6):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.subdata (pos = 8):
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
## 
## The following objects are masked from Trade.data:
## 
##     CapacityUtilization, EuroDollarRate, TradeBalance
Colinear<-lm(EuroDollarRate~CapacityUtilization)
summary(Colinear)
## 
## Call:
## lm(formula = EuroDollarRate ~ CapacityUtilization)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2343 -1.0346  0.1970  0.9844  3.6099 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -28.26228    1.91317  -14.77   <2e-16 ***
## CapacityUtilization   0.39916    0.02417   16.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.526 on 275 degrees of freedom
## Multiple R-squared:  0.498,  Adjusted R-squared:  0.4962 
## F-statistic: 272.8 on 1 and 275 DF,  p-value: < 2.2e-16

4.Measurement error

5 Conclusion