By: Maria Isabel Olivera
The NY Energy Star program is a New York State program which helps certain NYS homeowners pay for home improvements which make their homes more energy efficient. Eligibility for a subsidy is based on a homeowners income. For this weeks analysis I will be utizing data on the NY Energy Star program which I obtained from Open Data NY. The data for this dataset was provided by NYS contractors who implemented energy-efficiency projects for homeowners throughout the state.
For this analysis, my dependent variable will be “savings”, which is the dollar amount that homeowners saved on their utility bills during the first year after the completion of their energy-efficiency home renovation project.
My independent variable is “Customer” which is binary. Since not all homeowners were eligible to receive a subsidy through the NY Energy Star program due to their income, some homeowners received assistance while other homeowners paid market price for their projects. Homeowner who received assistance are classified as “Assisted” while those who paid market price where classified as “Market”. I will see if subsized or “Assisted” homeowners saw any difference in their first years savings compared to those homeowners who paid market price.
Finally, because contacting costs and homeowner incomes vary from county to county, I will group my data by County and see if different counties saw differences in savings rates.
library(nlme)
library(dplyr)
library(magrittr)
library(tidyr)
library(lmerTest)
library(ggplot2)
library(texreg)
library(lme4)
library(merTools)
cpooling <- lm(Savings ~ Customer, data = nyestar)
summary(cpooling)
Call:
lm(formula = Savings ~ Customer, data = nyestar)
Residuals:
Min 1Q Median 3Q Max
-1335.3 -471.1 -235.3 182.9 23648.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 613.062 5.308 115.49 <2e-16 ***
CustomerMarket 134.190 6.938 19.34 <2e-16 ***
---
Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 755.7 on 48869 degrees of freedom
Multiple R-squared: 0.007596, Adjusted R-squared: 0.007576
F-statistic: 374 on 1 and 48869 DF, p-value: < 2.2e-16
When we pool all homeowners together, regardless of county, it appears that homeowners who paid market price saved on average $134.00 compared to those homeowners who received subsidies.
For the next analysis we have run models for each individual county to see if there are differences in the intercept between counties.
dcoef <- nyestar %>%
group_by(County) %>%
do(mod = lm(Savings ~ Customer, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram()
As you can see there is a large amount of variation between the 62 counties of New York with some counties seeing no increase in savings amongst those customers who paid market price for their home improvements.
The below graph shows the changes in slope between the counties:
dcoef <- nyestar %>%
group_by(County) %>%
do(mod = lm(Savings ~ Customer, data = .))
coef <- dcoef %>% do(data.frame(Customerc = coef(.$mod)[2]))
ggplot(coef, aes(x = Customerc)) + geom_histogram()
Once again we see a significant amount of variation between counties with regards to changes in the slopes of this models.
m1_lme <- lme(Savings ~ Customer, data = nyestar, random = ~1|County, method = "ML")
summary(m1_lme)
Linear mixed-effects model fit by maximum likelihood
Data: nyestar
AIC BIC logLik
776100 776135.2 -388046
Random effects:
Formula: ~1 | County
(Intercept) Residual
StdDev: 248.4022 677.8631
Fixed effects: Savings ~ Customer
Value Std.Error DF t-value p-value
(Intercept) 833.6524 32.65232 48808 25.531181 0.0000
CustomerMarket 1.9269 6.65599 48808 0.289498 0.7722
Correlation:
(Intr)
CustomerMarket -0.121
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.5436993 -0.5153256 -0.1947324 0.2347675 34.5550366
Number of Observations: 48871
Number of Groups: 62
m2_lme <- lme(Savings ~ Customer, data = nyestar, random = ~ Customer|County, method = "ML")
summary(m2_lme)
Linear mixed-effects model fit by maximum likelihood
Data: nyestar
AIC BIC logLik
775835.7 775888.5 -387911.9
Random effects:
Formula: ~Customer | County
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 245.4908 (Intr)
CustomerMarket 179.7925 -0.289
Residual 675.2749
Fixed effects: Savings ~ Customer
Value Std.Error DF t-value p-value
(Intercept) 818.8808 32.99093 48808 24.821394 0.0000
CustomerMarket 38.6675 26.44093 48808 1.462411 0.1436
Correlation:
(Intr)
CustomerMarket -0.354
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.6601575 -0.5087687 -0.1861220 0.2284160 34.1485973
Number of Observations: 48871
Number of Groups: 62
AIC(cpooling, m1_lme, m2_lme)
htmlreg(list(m1_lme, m2_lme))
| Model 1 | Model 2 | ||
|---|---|---|---|
| (Intercept) | 833.65*** | 818.88*** | |
| (32.65) | (32.99) | ||
| CustomerMarket | 1.93 | 38.67 | |
| (6.66) | (26.44) | ||
| AIC | 776100.00 | 775835.70 | |
| BIC | 776135.19 | 775888.48 | |
| Log Likelihood | -388046.00 | -387911.85 | |
| Num. obs. | 48871 | 48871 | |
| Num. groups | 62 | 62 | |
| p < 0.001, p < 0.01, p < 0.05 | |||
Looking at the random intercept and random slope models, while we see variations in both models, the random slope model shows a larger degree of variation between counties and appears to be a slightly better fit.
fm1 <- lmer(Savings ~ Customer + (1|County), data=nyestar)
library(merTools)
fastdisp(fm1)
lme4::lmer(formula = Savings ~ Customer + (1 | County), data = nyestar)
coef.est coef.se
(Intercept) 833.74 32.91
CustomerMarket 1.92 6.66
Error terms:
Groups Name Std.Dev.
County (Intercept) 250.50
Residual 677.87
---
number of obs: 48871, groups: County, 62
AIC = 776086
Customer1 <- draw(fm1, type = 'random')
head(Customer1)
Preds <- predictInterval(fm1, newdata = Customer1,
level = 0.95, n.sims = 1000,
stat = "mean", type = "linear.prediction", include.resid.var = TRUE)
summary(Preds)
fit upr lwr
Min. :1200 Min. :2460 Min. :-119.5
1st Qu.:1200 1st Qu.:2460 1st Qu.:-119.5
Median :1200 Median :2460 Median :-119.5
Mean :1200 Mean :2460 Mean :-119.5
3rd Qu.:1200 3rd Qu.:2460 3rd Qu.:-119.5
Max. :1200 Max. :2460 Max. :-119.5
Based on the upper and lower values of our confidence interval, it appears that the effect of the variable “Customer” may not be significant. This is most likely due to the fact that there are other variables (such as Fuel type), which have a greater effect on savings.