1.Data

Data Source

The dataset is downloaded from: http://catalog.data.gov/dataset/annual-1990-2011-u-s-electric-power-industry-estimated-emissions-by-state. Provided by department of energy.

The data is annual emissions of Carbon Dioxide(CO2), Sulfur Dioxide(SO2), and Nitrogen Oxides (NOx) by type of electric power producer, by energy source, by state. I combine CO2 and SO2 into one variable CS, then we can include all three variables in the model with only two independent variables.

data.table<-read.csv("C:/Users/Echo/Downloads/emission.csv")
head(data.table)
##   Year Producer     CS    N
## 1 1990        0 837202 3011
## 2 1990        0 835120 3009
## 3 1990        0   2081    2
## 4 1990        0  23989  138
## 5 1990        0  21142   92
## 6 1990        0      0   43

Independent variable: Carbon Dioxide and Sulfur Dioxide(CS), Nitrogen Oxides(N).

Dependent variable: the producer(commercial cogen defined as 0, commercial non-cogen defined as 1).

Hypothesis:

Null Hypothesis: the producer type cannot be explained by the value of carbon dioxide, sulfur dioxide and nitrogen oxides.

Alternative Hypothesis: the producer type can be explained by the value of carbon dioxide, sulfur dioxide and nitrogen oxides.

explanatory model: The meaning of testing the relaitonship between dv and iv is to determine the producer type basing on the conposition of carbon dioxide, sulfur dioxide and nitrogen oxides.. The model will be an explanatory loistic regression model. Because based on this model, we can annalyze the component factors of emission and determine the producer type, based on which we can deicde wether this kind of producer has a bad effect on environment by producing too much emission gas.

Shrink the data size

According to power analysis, we shrink the size of the dataset.

N<-568
V<-nrow(data.table)
set.seed(99)
indicies<-sample(V,N,replace=FALSE)
data.table<-data.table[indicies,]
attach(data.table)
## The following object is masked _by_ .GlobalEnv:
## 
##     N
str(data.table)
## 'data.frame':    568 obs. of  4 variables:
##  $ Year    : int  2009 1994 2012 2012 2007 2011 2012 2000 2001 1996 ...
##  $ Producer: int  0 0 0 1 0 1 0 0 0 0 ...
##  $ CS      : int  112 13 55462 2252 1684 414 0 15129 23251 133723 ...
##  $ N       : int  3 0 152 13 1 10 0 355 173 340 ...

2. Model

In my opinion, both gases Carbon Dioxide(CO2 and Sulfur Dioxide(SO2) are important composition of emission, then I will use entry-wise method to establish the model by inputting two independent variables at the same time.

Logistic regression will be used to generalize the linear model.

model1<-glm(Producer~CS+N,data=data.table,family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model1)
## 
## Call:
## glm(formula = Producer ~ CS + N, family = "binomial", data = data.table)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1631  -0.9868  -0.7039   1.3536   2.5527  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.216e-01  1.037e-01  -4.064 4.82e-05 ***
## CS          -7.363e-06  1.447e-06  -5.088 3.63e-07 ***
## N            6.144e-04  1.945e-04   3.158  0.00159 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 710.97  on 567  degrees of freedom
## Residual deviance: 648.13  on 565  degrees of freedom
## AIC: 654.13
## 
## Number of Fisher Scoring iterations: 7

From the summary, we get both IV are significant.

For every one unit change in CS, the log odds of producer decreases by -7.363-06. For a one unit change in N, the log odds of producer increases by 6.144e-04. When both the independent variable is zero, the value of producer is -4.216e-01. The residual deviance is 648.13, which is much higher than zero, illustrates this model is not a very good fit.

Check the significance of the variable CS:

library(aod)
wald.test(b=coef(model1),Sigma = vcov(model1), Terms = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 25.9, df = 1, P(> X2) = 3.6e-07

The chi sq is 25.9 and p is 3.6e-07, indicating cS is statistically significance.

Check the significance of the variable N:

wald.test(b=coef(model1),Sigma = vcov(model1), Terms = 3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 10.0, df = 1, P(> X2) = 0.0016

The chi sq is 10 and p is 0.0016, indicating N is also statistically significance which is in accordance with our generalized linear model.

Check the significance of both variables:

wald.test(b = coef(model1), Sigma = vcov(model1), Terms = 2:3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 27.3, df = 2, P(> X2) = 1.2e-06

The chi sq is 27.3 and p is 1.2e-06. Then it is better to include both variables in the model.

3. Plots

standardized residuals vs. fitted

y.rst<-rstandard(model1)
y.fit<-fitted(model1)
plot(y.fit,y.rst,pch=21,cex=1,main="standardized residuals vs. fitted value",xlab="Fitted value of model",ylab="standardized residuals")
abline(0,0,col='blue')

standard residuals distribution

par(mfrow=c(1,2))
hist(y.rst,xlab="standardized residuals",main="standardized residuals distribution")
plot(y.rst,main="standardized residuals plot")

boxplot

boxplot(y.rst,xlab="producer~CS+N",ylab="standardized residual",main="boxplot of std residuals")

QQ plot

qqnorm(y.rst,ylab="std residuals",main="std residuals qq plot")

4.Interpretation

Line Analysis

  1. Linear

Based on the plot standardized residuals vs. fitted value, there is an obvious downward trend in the plot. Then non-linearity exists.

  1. independence of errors

The assumption suggests that there is a lack of autocorrelation between residuals. Accoding to the sccaterplot of residuals, no obvious pattern appers. Then we can believe the data is dependent among each other.

  1. normally distributed

I don’t think my std residuals distribution is normally distributed. The average value is smaller than 0 and there is a gap in the middle of the distribution around zero. There is kurtosis present as is evident in the histogram with two peaks.

  1. equal variance

Breusch-Pagen test:

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bp1<-bptest(Producer~CS,data=data.table)
bp1
## 
##  studentized Breusch-Pagan test
## 
## data:  Producer ~ CS
## BP = 23.2687, df = 1, p-value = 1.409e-06
library(lmtest)
bp2<-bptest(Producer~N,data=data.table)
bp2
## 
##  studentized Breusch-Pagan test
## 
## data:  Producer ~ N
## BP = 17.7819, df = 1, p-value = 2.477e-05
library(lmtest)
bp3<-bptest(Producer~CS+N,data=data.table)
bp3
## 
##  studentized Breusch-Pagan test
## 
## data:  Producer ~ CS + N
## BP = 27.287, df = 2, p-value = 1.188e-06

The null hypothesis of Breush-Pagen test is the data is homoskedastic. The alternative hypothesis is the data is heteroskedasitc. From all three p values which are significant smaller than 0.05, then we have to reject the null hypothesis. The data is heteroskedastic. Then the assumption of equal variances dos not hold.

Assumptions Interpretation

In summary, it was found that the independent variables and dependent variable were not linearly related, the errors are independent, the errors are not normally distributed, and the errors do not have equal variances.

As most of the assumptions do not apply for the logistic regression, then we test the assumptions for logistic regression.

  1. IVs are linearly related to log odds ratio. This assumption does not hold deduced from the plots of standardized residuals vs. fitted value.

  2. No extraneous variables are included. This assumption holds.

  3. Each observation must be independent with little or no multicollinearity. This assumption holds because there is no correlation between IV.

  4. Observations are independent. From the plot of residuals, we can say there is lack of correlation between error terms. Then this assumption holds.

Four Issues

  1. Causality

Both variables are factors leading to the dependent variable. They are component of emission and because of the difference in bussiness, the component of emission of different producer varies.

  1. Sample size

The original data set is too large and we use power analysis to select the proper size of dataset. Choosing z tests and logistic regression, alpha level is 0.05 and power level is 0.8. 568 sample size is determined.

  1. collinearity
a<-lm(CS~N,data=data.table)
summary(a)
## 
## Call:
## lm(formula = CS ~ N, data = data.table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2126248   -16790    -6333    12672  2813189 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6336.703  12186.279    0.52    0.603    
## N             240.247      3.297   72.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279900 on 566 degrees of freedom
## Multiple R-squared:  0.9037, Adjusted R-squared:  0.9035 
## F-statistic:  5310 on 1 and 566 DF,  p-value: < 2.2e-16

From the summary, more than 90% of CS can be explained by N, we can say there is obvious collinearity.

  1. measurement error

The data is from 1990 to 2012. As time passed, measurement technology develops which may lead to the difference in the standard of testing gas. And because of chemical reaction, the gas may transformed to other materials hard to be tested.

Then there should by measurement error to some extent.

Interaction effects

interaction<-glm(Producer~CS*N,family=binomial,data=data.table)
summary(interaction)
## 
## Call:
## glm(formula = Producer ~ CS * N, family = binomial, data = data.table)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1224  -0.9952  -0.6795   1.3428   2.5937  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.936e-01  1.046e-01  -3.764 0.000167 ***
## CS          -7.848e-06  1.509e-06  -5.202 1.97e-07 ***
## N            5.585e-04  1.841e-04   3.034 0.002417 ** 
## CS:N         1.212e-10  3.000e-11   4.042 5.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 710.97  on 567  degrees of freedom
## Residual deviance: 645.36  on 564  degrees of freedom
## AIC: 653.36
## 
## Number of Fisher Scoring iterations: 7

All p value is smaller than 0.05, illustrating there is interation effects.

5. Conclusions

From the entire logistic regression analysis we can conclude that the difference in the producer type can be explained by carbon dioxide, sulfur dioxide and nitrogen oxides.