US International goods and services Trade dataset is a combination of two datasets collected from United States Census Bureau and Federal Reserve. They can be accessed from
URL:http://catalog.data.gov/dataset/us-international-trade-in-goods-and-services
URL:http://catalog.data.gov/dataset/federal-reserve-data-download-program
#data read in
rm(list=ls())
Trade.data<-read.csv("~/Desktop/Applied_Regression/Logistic/International_Export_Import.csv")
head(Trade.data,n=14L);
## Time Export Import TradeBalance CapacityUtilization EuroDollarRate
## 1 1992-03 16054 11365 1 80.2999 4.43
## 2 1992-04 14347 10863 1 80.6967 4.19
## 3 1992-05 13956 10407 1 80.7975 3.99
## 4 1992-06 15698 11533 1 80.5974 4.00
## 5 1992-07 13979 11485 1 81.1304 3.54
## 6 1992-08 13547 11306 1 80.5548 3.43
## 7 1992-09 14606 11680 1 80.5531 3.22
## 8 1992-10 15625 12177 1 80.9928 3.32
## 9 1992-11 14165 11581 1 81.1599 3.70
## 10 1992-12 15794 12419 1 81.0541 3.60
## 11 1993-01 13903 10521 1 81.2884 3.37
## 12 1993-02 13667 10870 1 81.4514 3.24
## 13 1993-03 16619 13334 1 81.2911 3.21
## 14 1993-04 15222 12367 1 81.4254 3.21
The US International goods and services Trade dataset contains 6 variables which cover the data from March 1992 to March 2015.
str(Trade.data)
## 'data.frame': 277 obs. of 6 variables:
## $ Time : Factor w/ 277 levels "1992-03","1992-04",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Export : num 16054 14347 13956 15698 13979 ...
## $ Import : num 11365 10863 10407 11533 11485 ...
## $ TradeBalance : int 1 1 1 1 1 1 1 1 1 1 ...
## $ CapacityUtilization: num 80.3 80.7 80.8 80.6 81.1 ...
## $ EuroDollarRate : num 4.43 4.19 3.99 4 3.54 3.43 3.22 3.32 3.7 3.6 ...
Time: time of the data, marked with the format yyyy-mm
Export:Total Capital Goods(The materials using for final goods production) Export at Time(i), i=1,2,3,…277, the unit is millions
Import:Total Capital Goods Import value at Time(i), i=1,2,3,…277, the unit is millions
TradeBalance:if the International Trade Account is deficit(Import>Export), TradeBalance=0. If the account is surplus(Export>Import), TradeBalance=1
CapacityUtilization:The Ratio of actually uses between its installed productive capacity (Wikipedia) the unit is Percentage
EuroDollarRate:U.S.-dollar denominated deposits at foreign banks or foreign branches of American banks. (investopedia) the unit is percentage
For this Analysis, there mush be two continuous independent variabls and one categorical dependent variable. I am interest in how Capacity Utilization and Euro Dollar affect the US Internation Trade Balance.
In This case, I want to test if US International Trade Balance can be explained by US Capacity Utilization and Euro Dollar Rate. \(Alternative Hypothesis\)
the \(Null Hypothesis\) is that US International Trade Balance can not be explained by US Capacity Utilization and Euro Dollar Rate. It can be explained by other variables or it just turns out to be randomization.
Explain the International Trade Balance using Capacity Utilization and Euro Dollar Rate
#construct new dataset containing the data I used
attach(Trade.data)
Trade.subdata <- subset(Trade.data,select = c(TradeBalance,CapacityUtilization,EuroDollarRate))
cor(Trade.subdata)
## TradeBalance CapacityUtilization EuroDollarRate
## TradeBalance 1.0000000 0.1636181 0.3659614
## CapacityUtilization 0.1636181 1.0000000 0.7056917
## EuroDollarRate 0.3659614 0.7056917 1.0000000
attach(Trade.subdata)
## The following objects are masked from Trade.data:
##
## CapacityUtilization, EuroDollarRate, TradeBalance
TradeBalance<-factor(TradeBalance)
model.1IV<-glm(TradeBalance~EuroDollarRate,data = Trade.subdata,family = "binomial")
summary(model.1IV)
##
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate, family = "binomial",
## data = Trade.subdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9888 -0.9737 0.6593 0.9553 1.4344
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.74037 0.23088 -3.207 0.00134 **
## EuroDollarRate 0.37559 0.06413 5.857 4.72e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.34 on 276 degrees of freedom
## Residual deviance: 332.92 on 275 degrees of freedom
## AIC: 336.92
##
## Number of Fisher Scoring iterations: 4
library(aod)
wald.test(b = coef(model.1IV),Sigma = vcov(model.1IV),Terms =2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 34.3, df = 1, P(> X2) = 4.7e-09
model.2IV<-glm(TradeBalance~EuroDollarRate+CapacityUtilization,data = Trade.subdata,family = "binomial")
summary(model.2IV)
##
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization,
## family = "binomial", data = Trade.subdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0897 -0.9395 0.6805 0.8985 1.6211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.57677 3.74835 2.021 0.0432 *
## EuroDollarRate 0.51834 0.09228 5.617 1.94e-08 ***
## CapacityUtilization -0.11114 0.05000 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.34 on 276 degrees of freedom
## Residual deviance: 327.67 on 274 degrees of freedom
## AIC: 333.67
##
## Number of Fisher Scoring iterations: 4
wald.test(b = coef(model.2IV),Sigma = vcov(model.2IV),Terms =2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 37.9, df = 2, P(> X2) = 5.8e-09
The model with two continuous Independent Variables has the highest Chi Squre score. Hence, I will use the two continuous Independent Variables model as my final model.
FinalModel<-glm(TradeBalance~EuroDollarRate+CapacityUtilization,data = Trade.subdata, family = "binomial")
summary(FinalModel)
##
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization,
## family = "binomial", data = Trade.subdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0897 -0.9395 0.6805 0.8985 1.6211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.57677 3.74835 2.021 0.0432 *
## EuroDollarRate 0.51834 0.09228 5.617 1.94e-08 ***
## CapacityUtilization -0.11114 0.05000 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.34 on 276 degrees of freedom
## Residual deviance: 327.67 on 274 degrees of freedom
## AIC: 333.67
##
## Number of Fisher Scoring iterations: 4
par(mfrow = c(1,1))
FinalModel.res<-residuals(FinalModel,type = "deviance")
plot(fitted(FinalModel),FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0)
I want to check whether the residual has correlation with Independent Variable. The graphs appear some correlation between Residuals and Euro Dollar Rate.
attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
##
## TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 4):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.data:
##
## CapacityUtilization, EuroDollarRate, TradeBalance
plot(CapacityUtilization,FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Capacity Utilization vs. Residuals ", xlab = "Capacity Utilization", ylab = "Residuals")
abline(1,0)
abline(-1,0)
plot(EuroDollarRate,FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of EuroDollarRate vs. Residuals ", xlab = "EuroDollarRate", ylab = "Residuals")
abline(1,0)
abline(-1,0)
(calcuate the residuals when TradeBalance equal = 1 or 0, because I assume the residuals are radomizing around 1 and -1, but not 0)
fit1<-predict(FinalModel,subset(Trade.subdata,Trade.subdata$TradeBalance==1))
fit0<-predict(FinalModel,subset(Trade.subdata,Trade.subdata$TradeBalance==0))
#find the residual around 1 and 0
resid1<-fit1-1
resid0<-fit0-0
hist(resid1, main = "Histogram of Residual When Dependent Variable = 1",xlab = "Residual")
hist(resid0, main = "Histogram of Residual When Dependent Variable = 0",xlab = "Residual")
boxplot(FinalModel.res,main="Box PLot of the Residual")
qqnorm(resid1,main="QQplot of the Residual, When Trade Balance =1")
qqline(resid1)
qqnorm(resid0,main = "QQplot pf the Residual, When Trade Balance =0")
qqline(resid0)
summary(FinalModel)
##
## Call:
## glm(formula = TradeBalance ~ EuroDollarRate + CapacityUtilization,
## family = "binomial", data = Trade.subdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0897 -0.9395 0.6805 0.8985 1.6211
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.57677 3.74835 2.021 0.0432 *
## EuroDollarRate 0.51834 0.09228 5.617 1.94e-08 ***
## CapacityUtilization -0.11114 0.05000 -2.223 0.0262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 371.34 on 276 degrees of freedom
## Residual deviance: 327.67 on 274 degrees of freedom
## AIC: 333.67
##
## Number of Fisher Scoring iterations: 4
par(mfrow = c(1,1))
FinalModel.res<-residuals(FinalModel,type = "deviance")
plot(fitted(FinalModel),FinalModel.res,pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals ", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0)
The plot of the fitted value Vs. Residual does nots show linear. So, the expected value of residual is not zero.
par(mfrow = c(1,1))
plot(FinalModel.res,pch=21,cex=1,bg="blue",xlab = "index",ylab = "Residual", main="Residual Value")
The plot does not clearly show there is absolutely no serial correlation. We can find autocorrelation does exist in the year of 2005~2009 (index 150 to 200)
The residual is not normal distributed as showed in the graph. I try to look into the histograms separately, when Dependent Variable equal to 1 or 0. The two histograms also appear not normal distributed, with some skewness.
hist(FinalModel.res, main = "Residual Histogram")
#find the residual around 1 and 0
hist(resid1, main = "Histogram of Residual When Dependent Variable = 1",xlab = "Residual")
hist(resid0, main = "Histogram of Residual When Dependent Variable = 0",xlab ="Residual")
The residual of the model turns to be Heteroskedastic. Next, I use Breusch-Pagan test to test if the residual is Heteroskedastic.
plot(FinalModel.res,main = "Residual plot")
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
##
## TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 5):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 7):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.data:
##
## CapacityUtilization, EuroDollarRate, TradeBalance
bptest(FinalModel)
##
## studentized Breusch-Pagan test
##
## data: FinalModel
## BP = 4.9134, df = 2, p-value = 0.08572
The Hypothesis that the residual is Homoskedasitc is rejected in the significant level of 90, indicating that the reisidual is Heteroskedastic.
I used G*Power with effective size of 0.219, alpha prob 0.05, and power 0.95 to find the correct sample size. The right one is 322, which is larger than the dataset I used.
attach(Trade.subdata)
## The following object is masked _by_ .GlobalEnv:
##
## TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 3):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 6):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.subdata (pos = 8):
##
## CapacityUtilization, EuroDollarRate, TradeBalance
##
## The following objects are masked from Trade.data:
##
## CapacityUtilization, EuroDollarRate, TradeBalance
Colinear<-lm(EuroDollarRate~CapacityUtilization)
summary(Colinear)
##
## Call:
## lm(formula = EuroDollarRate ~ CapacityUtilization)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2343 -1.0346 0.1970 0.9844 3.6099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.26228 1.91317 -14.77 <2e-16 ***
## CapacityUtilization 0.39916 0.02417 16.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.526 on 275 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4962
## F-statistic: 272.8 on 1 and 275 DF, p-value: < 2.2e-16