Dataset selection

Dataset description

data<-read.delim("ResourceAssessmentSummaryData032011.txt", header=T, sep="\t")
dim(data)
## [1] 191  15
summary(data)
##  Region  Resource.Assessment.Site.ID            Site.Name.Facility
##  GP:73   GP-10  :  1                 Agate Dam           :  1     
##  LC: 5   GP-102 :  1                 Agency Valley       :  1     
##  MP:14   GP-103 :  1                 Anchor Dam          :  1     
##  PN:34   GP-107 :  1                 Anderson-Rose Dam   :  1     
##  UC:65   GP-108 :  1                 Angostura Dam       :  1     
##          GP-114 :  1                 Angostura Diversion :  1     
##          (Other):185                 (Other)             :185     
##  Design.Head..feet. Design.Flow..cfs. Installed.Capacity..kW.
##  8      :  6        17     :  5       1,362  :  2            
##  17     :  5        537    :  5       102    :  2            
##  3      :  5        65     :  4       20     :  2            
##  46     :  5        773    :  4       267    :  2            
##  15     :  4        110    :  3       276    :  2            
##  22     :  4        200    :  3       287    :  2            
##  (Other):162        (Other):167       (Other):179            
##  Annual.Production..MWh.  Plant.Factor   
##  720    :  3             Min.   :0.1400  
##  1,904  :  2             1st Qu.:0.3800  
##  126    :  2             Median :0.4700  
##  148    :  2             Mean   :0.4876  
##  566    :  2             3rd Qu.:0.5700  
##  1,001  :  1             Max.   :0.8700  
##  (Other):179                             
##  Total.Construction.Cost..1.000.... Annual.O.M.Cost..1.000....
##  $1,047.70 :  1                     $130.00 :  2              
##  $1,069.40 :  1                     $264.00 :  2              
##  $1,100.00 :  1                     $48.90  :  2              
##  $1,106.90 :  1                     $65.90  :  2              
##  $1,166.50 :  1                     $66.00  :  2              
##  $1,176.00 :  1                     $72.00  :  2              
##  (Other)   :185                     (Other) :179              
##  Cost.per.Installed.Capacity....kW.
##  $1,455 :  1                       
##  $1,482 :  1                       
##  $1,620 :  1                       
##  $1,704 :  1                       
##  $1,806 :  1                       
##  $1,889 :  1                       
##  (Other):185                       
##  Benefit.Cost.Ratio.with.Green.Incentives IRR.with.Green.Incentives
##  Min.   :0.0200                            < 0   :85               
##  1st Qu.:0.2100                           < 0    :26               
##  Median :0.4900                           3.00%  : 4               
##  Mean   :0.6849                           0.10%  : 2               
##  3rd Qu.:0.9500                           2.30%  : 2               
##  Max.   :3.5000                           2.80%  : 2               
##                                           (Other):70               
##  Benefit.Cost.Ratio.without.Green.Incentives IRR.without.Green.Incentives
##  Min.   :0.0200                               < 0   :69                  
##  1st Qu.:0.1950                              < 0    :46                  
##  Median :0.4400                              2.40%  : 3                  
##  Mean   :0.6093                              2.80%  : 3                  
##  3rd Qu.:0.8700                              3.30%  : 3                  
##  Max.   :2.8600                              3.70%  : 3                  
##                                              (Other):64
str(data)
## 'data.frame':    191 obs. of  15 variables:
##  $ Region                                     : Factor w/ 5 levels "GP","LC","MP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Resource.Assessment.Site.ID                : Factor w/ 191 levels "GP-10","GP-102 ",..: 46 52 66 1 11 25 33 34 35 36 ...
##  $ Site.Name.Facility                         : Factor w/ 191 levels "Agate Dam ","Agency Valley ",..: 3 5 14 16 21 24 25 29 31 32 ...
##  $ Design.Head..feet.                         : Factor w/ 113 levels "1,149","10","100",..: 90 14 32 82 99 103 82 29 100 107 ...
##  $ Design.Flow..cfs.                          : Factor w/ 141 levels "1,157","1,213",..: 30 14 12 26 128 105 65 130 84 101 ...
##  $ Installed.Capacity..kW.                    : Factor w/ 183 levels "1,008","1,042",..: 146 178 27 132 113 77 177 166 52 96 ...
##  $ Annual.Production..MWh.                    : Factor w/ 185 levels "1,001","1,003",..: 45 94 139 12 88 38 74 72 7 48 ...
##  $ Plant.Factor                               : num  0.23 0.4 0.62 0.31 0.77 0.54 0.29 0.31 0.72 0.52 ...
##  $ Total.Construction.Cost..1.000....         : Factor w/ 191 levels "$1,047.70 ","$1,069.40 ",..: 143 96 13 79 16 170 137 107 18 169 ...
##  $ Annual.O.M.Cost..1.000....                 : Factor w/ 184 levels "$1,031.90 ","$1,206.20 ",..: 23 18 124 177 126 104 47 21 137 82 ...
##  $ Cost.per.Installed.Capacity....kW.         : Factor w/ 191 levels "$1,455 ","$1,482 ",..: 189 74 25 125 130 67 144 116 172 43 ...
##  $ Benefit.Cost.Ratio.with.Green.Incentives   : num  0.02 0.9 0.35 0.49 0.15 0.12 0.41 0.56 0.69 1.52 ...
##  $ IRR.with.Green.Incentives                  : Factor w/ 64 levels " < 0 ","< 0 ",..: 1 35 1 1 1 1 1 1 6 60 ...
##  $ Benefit.Cost.Ratio.without.Green.Incentives: num  0.02 0.84 0.33 0.46 0.14 0.11 0.39 0.53 0.65 1.42 ...
##  $ IRR.without.Green.Incentives               : Factor w/ 63 levels " < 0 ","< 0 ",..: 1 30 1 1 1 1 1 1 5 57 ...

Selection of independent and dependent variables

Dependent variable:

Independent variables:

row.names(data) <- data$Site.Name.Facility; data<-data[,-1]
data <- data[,c(5,7,11)]
colnames(data)<-c("capacity", "plant.factor","benefit.cost.ratio")
data$benefit.cost.ratio<-ifelse(data$benefit.cost.ratio < 1, 0, 1)
data$capacity<-as.numeric(data$capacity)
data$benefit.cost.ratio<-as.factor(data$benefit.cost.ratio)
plot(data)

Null and alternate hypothesis

Model

attach(data)
model.one<-glm(benefit.cost.ratio~plant.factor, family="binomial")
summary(model.one)
## 
## Call:
## glm(formula = benefit.cost.ratio ~ plant.factor, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5778  -0.6690  -0.4704  -0.2999   2.2961  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -4.850      0.774  -6.266 3.71e-10 ***
## plant.factor    6.933      1.377   5.037 4.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 203.73  on 190  degrees of freedom
## Residual deviance: 172.94  on 189  degrees of freedom
## AIC: 176.94
## 
## Number of Fisher Scoring iterations: 4

The IV plant factor is a significant predictor of the DV benefit cost ratio based on the p-value of 4.73e-07.

Next we add the IV capacity to the model and look at the performance.

model.two<-glm(benefit.cost.ratio~plant.factor + capacity, family="binomial")
summary(model.two)
## 
## Call:
## glm(formula = benefit.cost.ratio ~ plant.factor + capacity, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5886  -0.6633  -0.4756  -0.2679   2.4142  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.375979   0.855126  -5.117 3.10e-07 ***
## plant.factor  6.757097   1.382013   4.889 1.01e-06 ***
## capacity     -0.004397   0.003754  -1.171    0.242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 203.73  on 190  degrees of freedom
## Residual deviance: 171.56  on 188  degrees of freedom
## AIC: 177.56
## 
## Number of Fisher Scoring iterations: 4

Plant factor is once again a significant predictor of benefit cost ratio (P val of 1.01e-06), while the IV capacity is not significant in the model (P val of 0.242).

Next we convert the coefficients in the model to odds ratios for interpretation in the below sections.

exp(coef(model.two))
##  (Intercept) plant.factor     capacity 
##   0.01257582 860.14150345   0.99561269

Plots

Standardized residuals as a function of the fitted model

st.resid<-rstandard(model.two)
plot(fitted(model.two),st.resid, xlab="fitted values", ylab="std. residuals")
abline(c(0,0),lwd=2, col="red")

Distribution of residuals

We will plot the standardised residuals. And we will also check the normality of residuals via histogram, boxplot and QQ plot.

plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)

hist(st.resid, breaks = 50, main="Distribution of std. residuals")

boxplot(st.resid, main="Boxplot of std. residuals")

qqnorm(st.resid, ylab="Std. residuals", xlab="Normal scores", main="QQplot of std. residuals")
qqline(st.resid, col="red")

Interpretation

Regression model and stastical analysis

LINE analysis

Since we are performing logistic regression based on maximum likelihood ratios, assumptions of linear regression may not be indicative of poor model fit in this context.

A linear relationship between the DV and IV would mean that at every value of the outcome the expected (mean) value of the residuals is zero. Looking at the plot of the std. residuals against the fitted values we see that this is not the case, as the std. residuals are not randomly distributed around 0, but have a downward pattern.

This is saying that the expected correlation between residuals, for any two cases, is 0 and that every case is independent of other cases. Based on both the plot of the standardised residuals and the std. residuals against fitted values there is no obvious clustering of points into groups that could indicate dependence or the need for a time series. Based on this information, we may still be satisfying this assumption.

This is talking about normality of residuals which we can assess by the boxplot, histogram and QQplot. Clearly we can see that the residuals are not normally distributed, violating this assumption. The plots show a skew on both the left and the right hand side of the distribution.

The spread of residuals is constant in the plot of std residuals against fitted values, i.e., there is homoscedasticity. From our plot we see this is not the case as there are more points below 0 than above 0, suggesting a heteroscedastic relationship. To confirm we can perform the Breusch-Pagan test for heteroscedasticity

Breusch-Pagan test for heteroscedasticity

library(lmtest)
bptest(model.two)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.two
## BP = 21.8067, df = 2, p-value = 1.84e-05

The null hypothesis states that the data is homoscedastic and the alternate states that the data is heteroscedastic. Our result showing a p-value of 1.84e-05 means that we reject the null and conclude that the data is heteroscadastic, confirming what we saw from the diagnostic plots.

Four issues of regression

We will only consider the IV plant factor here since the IV capacity was not statistically significant in the model and had an OR of close to 1.

cor(plant.factor,capacity)
## [1] -0.1473668

Interaction effects between the IVs

model.int<-glm(benefit.cost.ratio~plant.factor*capacity, family="binomial")

Based on the p value for the interaction term (0.4125, which is not significant), we can say that the interaction between the two IVs does not contribute additional information to the model, after taking plant factor into account.

Overall interpretation on predictive capacity of model

Our model shows that a high installed capacity of a dam alone is not sufficient to predict the benefit to cost ratio of that dam being developed into a hydropower site. In general when the installed capacity of a dam is high and it performs at that capacity a large proportion of the time (i.e., high plant factor), our model suggests that the site is likely to be feasible. Hence overall it seems that the plant factor contributes significant information regarding the benefit to cost ratio. Finally, an important caveat to consider is that there were instances where the outcome benefit to cost ratios were close to 1 but lesser than 1, and so got binned into the ‘0’ category. This arbitrary cutoff could have implications for sites having ratios such as 0.99 for example, being binned into the same category as other sites with much lower ratios. Related to this issue, we see that there is unequal distribution of outcome variable data between 0 and 1, with 148 sites having a ratio of benefit to cost of 0 and 43 sites with a ratio of 1. In our analysis we ignored this unequal distribution, which may or may not contribute to the fit of the model.