Dataset selection

I selected the Hydropower potential dataset from data.gov: http://catalog.data.gov/dataset/hydropower-potential-in-the-western-us
Data source: http://catalog.data.gov/dataset/hydropower-potential-in-the-western-us
This purpose of this dataset is to asses hydropower capability at the US Bureau of Reclamation’s existing facilities where hydropower has not been developed.

Dataset description

The dataset includes design elements, installed capacity, production capability, associated costs and cost -to-benefit ratios for 191 water storing and conveying structures currently maintained by the Bureau of Reclamation.

data<-read.delim("ResourceAssessmentSummaryData032011.txt", header=T, sep="\t")
dim(data)

## [1] 191  15

summary(data)

##  Region  Resource.Assessment.Site.ID            Site.Name.Facility
##  GP:73   GP-10  :  1                 Agate Dam           :  1     
##  LC: 5   GP-102 :  1                 Agency Valley       :  1     
##  MP:14   GP-103 :  1                 Anchor Dam          :  1     
##  PN:34   GP-107 :  1                 Anderson-Rose Dam   :  1     
##  UC:65   GP-108 :  1                 Angostura Dam       :  1     
##          GP-114 :  1                 Angostura Diversion :  1     
##          (Other):185                 (Other)             :185     
##  Design.Head..feet. Design.Flow..cfs. Installed.Capacity..kW.
##  8      :  6        17     :  5       1,362  :  2            
##  17     :  5        537    :  5       102    :  2            
##  3      :  5        65     :  4       20     :  2            
##  46     :  5        773    :  4       267    :  2            
##  15     :  4        110    :  3       276    :  2            
##  22     :  4        200    :  3       287    :  2            
##  (Other):162        (Other):167       (Other):179            
##  Annual.Production..MWh.  Plant.Factor   
##  720    :  3             Min.   :0.1400  
##  1,904  :  2             1st Qu.:0.3800  
##  126    :  2             Median :0.4700  
##  148    :  2             Mean   :0.4876  
##  566    :  2             3rd Qu.:0.5700  
##  1,001  :  1             Max.   :0.8700  
##  (Other):179                             
##  Total.Construction.Cost..1.000.... Annual.O.M.Cost..1.000....
##  $1,047.70 :  1                     $130.00 :  2              
##  $1,069.40 :  1                     $264.00 :  2              
##  $1,100.00 :  1                     $48.90  :  2              
##  $1,106.90 :  1                     $65.90  :  2              
##  $1,166.50 :  1                     $66.00  :  2              
##  $1,176.00 :  1                     $72.00  :  2              
##  (Other)   :185                     (Other) :179              
##  Cost.per.Installed.Capacity....kW.
##  $1,455 :  1                       
##  $1,482 :  1                       
##  $1,620 :  1                       
##  $1,704 :  1                       
##  $1,806 :  1                       
##  $1,889 :  1                       
##  (Other):185                       
##  Benefit.Cost.Ratio.with.Green.Incentives IRR.with.Green.Incentives
##  Min.   :0.0200                            < 0   :85               
##  1st Qu.:0.2100                           < 0    :26               
##  Median :0.4900                           3.00%  : 4               
##  Mean   :0.6849                           0.10%  : 2               
##  3rd Qu.:0.9500                           2.30%  : 2               
##  Max.   :3.5000                           2.80%  : 2               
##                                           (Other):70               
##  Benefit.Cost.Ratio.without.Green.Incentives IRR.without.Green.Incentives
##  Min.   :0.0200                               < 0   :69                  
##  1st Qu.:0.1950                              < 0    :46                  
##  Median :0.4400                              2.40%  : 3                  
##  Mean   :0.6093                              2.80%  : 3                  
##  3rd Qu.:0.8700                              3.30%  : 3                  
##  Max.   :2.8600                              3.70%  : 3                  
##                                              (Other):64

str(data)

## 'data.frame':    191 obs. of  15 variables:
##  $ Region                                     : Factor w/ 5 levels "GP","LC","MP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Resource.Assessment.Site.ID                : Factor w/ 191 levels "GP-10","GP-102 ",..: 46 52 66 1 11 25 33 34 35 36 ...
##  $ Site.Name.Facility                         : Factor w/ 191 levels "Agate Dam ","Agency Valley ",..: 3 5 14 16 21 24 25 29 31 32 ...
##  $ Design.Head..feet.                         : Factor w/ 113 levels "1,149","10","100",..: 90 14 32 82 99 103 82 29 100 107 ...
##  $ Design.Flow..cfs.                          : Factor w/ 141 levels "1,157","1,213",..: 30 14 12 26 128 105 65 130 84 101 ...
##  $ Installed.Capacity..kW.                    : Factor w/ 183 levels "1,008","1,042",..: 146 178 27 132 113 77 177 166 52 96 ...
##  $ Annual.Production..MWh.                    : Factor w/ 185 levels "1,001","1,003",..: 45 94 139 12 88 38 74 72 7 48 ...
##  $ Plant.Factor                               : num  0.23 0.4 0.62 0.31 0.77 0.54 0.29 0.31 0.72 0.52 ...
##  $ Total.Construction.Cost..1.000....         : Factor w/ 191 levels "$1,047.70 ","$1,069.40 ",..: 143 96 13 79 16 170 137 107 18 169 ...
##  $ Annual.O.M.Cost..1.000....                 : Factor w/ 184 levels "$1,031.90 ","$1,206.20 ",..: 23 18 124 177 126 104 47 21 137 82 ...
##  $ Cost.per.Installed.Capacity....kW.         : Factor w/ 191 levels "$1,455 ","$1,482 ",..: 189 74 25 125 130 67 144 116 172 43 ...
##  $ Benefit.Cost.Ratio.with.Green.Incentives   : num  0.02 0.9 0.35 0.49 0.15 0.12 0.41 0.56 0.69 1.52 ...
##  $ IRR.with.Green.Incentives                  : Factor w/ 64 levels " < 0 ","< 0 ",..: 1 35 1 1 1 1 1 1 6 60 ...
##  $ Benefit.Cost.Ratio.without.Green.Incentives: num  0.02 0.84 0.33 0.46 0.14 0.11 0.39 0.53 0.65 1.42 ...
##  $ IRR.without.Green.Incentives               : Factor w/ 63 levels " < 0 ","< 0 ",..: 1 30 1 1 1 1 1 1 5 57 ...

Selection of independent and dependent variables

Dependent variable:

The parameter ‘benefit to cost ratio’ compares benefits and costs of potential hydropower development at a site. It can be used as a proxy to determine if it will be economically viable for a site to be developed for hydropower capability.
A benefit cost ratio greater than 1.0 indicates benefits are greater than costs. Hence these ratios can be converted into categorical format for a logistic regression model, where ratios below 1 can be designated a 0 and ratios above 1 can be designated a 1.

Independent variables:

The variable ‘installed capacity’ was generated by the Bureau of Reclamation to estimate the power potential for a particular site based on design parameters of the site, and it is measured in units of kilowatts (kW).
The variable ‘plant factor’ indicates how often the hydropower plant operates at the installed capacity and is expressed as a percentage.

row.names(data) <- data$Site.Name.Facility; data<-data[,-1]
data <- data[,c(5,7,11)]
colnames(data)<-c("capacity", "plant.factor","benefit.cost.ratio")
data$benefit.cost.ratio<-ifelse(data$benefit.cost.ratio < 1, 0, 1)
data$capacity<-as.numeric(data$capacity)
data$benefit.cost.ratio<-as.factor(data$benefit.cost.ratio)
plot(data)

Null and alternate hypothesis

The logistic regression model is testing whether the variation in the continuous IVs capacity and plant factor can predict the variation in the categorical DV, benefit cost ratio.
The null hypothesis is that there is no association between the IVs (capacity and plant factor) and the dependent variable (benefit cost ratio).

Model

In general it is likely that higher capacity and plant factor would both lead to a more feasible site resulting in higher benefit cost ratios.
Based on crude theory, it is likely that a higher plant factor would better predict whether a site is more feasible for hydropower because it describes the frequency at which the site actually operates at its capacity. So we add this to the model first in the form of a hierarchical building.
The model can be best described as forecasting rather than as explanatory, since we want to assess how well the parameters ‘plant factor’ and ‘capacity’ can predict the benefit to cost ratio of a site.

attach(data)
model.one<-glm(benefit.cost.ratio~plant.factor, family="binomial")
summary(model.one)

## 
## Call:
## glm(formula = benefit.cost.ratio ~ plant.factor, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5778  -0.6690  -0.4704  -0.2999   2.2961  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -4.850      0.774  -6.266 3.71e-10 ***
## plant.factor    6.933      1.377   5.037 4.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 203.73  on 190  degrees of freedom
## Residual deviance: 172.94  on 189  degrees of freedom
## AIC: 176.94
## 
## Number of Fisher Scoring iterations: 4

The IV plant factor is a significant predictor of the DV benefit cost ratio based on the p-value of 4.73e-07.

Next we add the IV capacity to the model and look at the performance.

model.two<-glm(benefit.cost.ratio~plant.factor + capacity, family="binomial")
summary(model.two)

## 
## Call:
## glm(formula = benefit.cost.ratio ~ plant.factor + capacity, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5886  -0.6633  -0.4756  -0.2679   2.4142  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.375979   0.855126  -5.117 3.10e-07 ***
## plant.factor  6.757097   1.382013   4.889 1.01e-06 ***
## capacity     -0.004397   0.003754  -1.171    0.242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 203.73  on 190  degrees of freedom
## Residual deviance: 171.56  on 188  degrees of freedom
## AIC: 177.56
## 
## Number of Fisher Scoring iterations: 4

Plant factor is once again a significant predictor of benefit cost ratio (P val of 1.01e-06), while the IV capacity is not significant in the model (P val of 0.242).

Next we convert the coefficients in the model to odds ratios for interpretation in the below sections.

exp(coef(model.two))

##  (Intercept) plant.factor     capacity 
##   0.01257582 860.14150345   0.99561269

Plots

Standardized residuals as a function of the fitted model

st.resid<-rstandard(model.two)
plot(fitted(model.two),st.resid, xlab="fitted values", ylab="std. residuals")
abline(c(0,0),lwd=2, col="red")

Distribution of residuals

We will plot the standardised residuals. And we will also check the normality of residuals via histogram, boxplot and QQ plot.

plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)

hist(st.resid, breaks = 50, main="Distribution of std. residuals")

boxplot(st.resid, main="Boxplot of std. residuals")

qqnorm(st.resid, ylab="Std. residuals", xlab="Normal scores", main="QQplot of std. residuals")
qqline(st.resid, col="red")

Interpretation

Regression model and stastical analysis

We can reject the null hypothesis for the IV plant factor and conclude that there is a significant association between plant factor and benefit to cost ratio for the sites. However, we cannot reject the null hypothesis for the IV capacity, since the model found this to be insignificant. So there is no significant association between the capacity of a site and its benefit to cost ratio based on the model above.
The logistic regression coefficients give the change in the log odds of the outcome for one unit increase in the predictor variable. For easier interpretation we converted these coefficients into odds ratios (ORs). The OR for plant factor is 860. For one unit increase in plant factor (i.e., an increase of 1% in plant factor) the odds of having a good benefit to cost ratio (of 1, as opposed to 0 benefit to cost ratio) increases by a factor of 860.
The OR for capacity is near 1. This means that the predictor capacity does not affect the odds of benefits to cost ratio (odds of getting a ratio of 1 or a ratio of 0 is unaffected by capacity of the site).

LINE analysis

Since we are performing logistic regression based on maximum likelihood ratios, assumptions of linear regression may not be indicative of poor model fit in this context.

Linear: The mean of the response at each set of values of the predictor is a Linear function of the predictors.

A linear relationship between the DV and IV would mean that at every value of the outcome the expected (mean) value of the residuals is zero. Looking at the plot of the std. residuals against the fitted values we see that this is not the case, as the std. residuals are not randomly distributed around 0, but have a downward pattern.

Independent: Errors are independent.

This is saying that the expected correlation between residuals, for any two cases, is 0 and that every case is independent of other cases. Based on both the plot of the standardised residuals and the std. residuals against fitted values there is no obvious clustering of points into groups that could indicate dependence or the need for a time series. Based on this information, we may still be satisfying this assumption.

Normally distributed: The errors at each set of values of the predictors are Normally distributed.

This is talking about normality of residuals which we can assess by the boxplot, histogram and QQplot. Clearly we can see that the residuals are not normally distributed, violating this assumption. The plots show a skew on both the left and the right hand side of the distribution.

Equal variances: The errors at each set of values of the predictor have Equal variances.

The spread of residuals is constant in the plot of std residuals against fitted values, i.e., there is homoscedasticity. From our plot we see this is not the case as there are more points below 0 than above 0, suggesting a heteroscedastic relationship. To confirm we can perform the Breusch-Pagan test for heteroscedasticity

Breusch-Pagan test for heteroscedasticity

library(lmtest)
bptest(model.two)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.two
## BP = 21.8067, df = 2, p-value = 1.84e-05

The null hypothesis states that the data is homoscedastic and the alternate states that the data is heteroscedastic. Our result showing a p-value of 1.84e-05 means that we reject the null and conclude that the data is heteroscadastic, confirming what we saw from the diagnostic plots.

Four issues of regression

We will only consider the IV plant factor here since the IV capacity was not statistically significant in the model and had an OR of close to 1.

Causality: Does plant factor, i.e., the proportion of time that the site functions at its installed capacity causally affect the benefit cost ratio for that site? From our overall interpretation it certainly seems that a higher plant factor indicates a more feasible site, and that the relationship could be likely causal since sites with low plant factors may not be considered viable for hydroelectric power. Proximal vs. Ultimate – the plant factor cannot be ultimately causal because if the installed capacity of a site is low to begin with, even if the plant factor is very high (the site always works at the installed capacity) - it is not going to yield a good benefit to cost ratio. Hence the plant factor could be proximally causal. For the same reason, plant factor is likely to be probabalistic rather than determinate.
Sample size: The sample size of 191 could be slightly underpowered.
Collinearity: We see below that the correlation of the two IVs is weak, so this may not be an issue.

cor(plant.factor,capacity)

## [1] -0.1473668

Measurement error: The hydropower assessment tool developed by the Bureau was used to calculate some of the variables in this dataset and limitations in the tool could certainly have lead to ‘errors’ in the dataset. For example, the tool enables a user to change some assumptions regarding the performance of a turbine or flow of water at a site, which is subject to inherent limitations. Assumptions were also made at several stages along the development of the assessment tool.

Interaction effects between the IVs

model.int<-glm(benefit.cost.ratio~plant.factor*capacity, family="binomial")

Based on the p value for the interaction term (0.4125, which is not significant), we can say that the interaction between the two IVs does not contribute additional information to the model, after taking plant factor into account.

Overall interpretation on predictive capacity of model

Our model shows that a high installed capacity of a dam alone is not sufficient to predict the benefit to cost ratio of that dam being developed into a hydropower site. In general when the installed capacity of a dam is high and it performs at that capacity a large proportion of the time (i.e., high plant factor), our model suggests that the site is likely to be feasible. Hence overall it seems that the plant factor contributes significant information regarding the benefit to cost ratio. Finally, an important caveat to consider is that there were instances where the outcome benefit to cost ratios were close to 1 but lesser than 1, and so got binned into the ‘0’ category. This arbitrary cutoff could have implications for sites having ratios such as 0.99 for example, being binned into the same category as other sites with much lower ratios. Related to this issue, we see that there is unequal distribution of outcome variable data between 0 and 1, with 148 sites having a ratio of benefit to cost of 0 and 43 sites with a ratio of 1. In our analysis we ignored this unequal distribution, which may or may not contribute to the fit of the model.

Applied Linear Regression Project 4

Kalpana Gopalakrishnan