John Grando
September 27th, 2017
So why use it?
Fitting Generalized Linear Models
The glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
Image taken from http://www.statmethods.net/advstats/glm.html
Example: Greenhouse Gas Emission intensity (GHG / square foot) in New York City
Data Source: Local Law 84 Data Disclosure and Reports (http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml)
Local Law 84 is the NYC Benchmarking Law which requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking and disclose the results publicly.
options(width = 150)
head(energy_data_red_df[,c("PropertyUseTypeAdj", "borough", "YearCat", "GHGCat")])
PropertyUseTypeAdj borough YearCat GHGCat
1 Multifamily Housing Manhattan >Q3 0
2 Multifamily Housing Manhattan <=Q3 0
3 Multifamily Housing Manhattan <=Q3 0
4 Multifamily Housing Manhattan <=Q3 0
5 Other Manhattan >Q3 0
6 Office Manhattan >Q3 1
Year summary (YearCat) - One Category per quartile
YearSummary
Min. 1st Qu. Median Mean 3rd Qu. Max.
1600 1927 1941 1949 1966 2019
GHGUI summary (GHGCat) - Target Variable set to equal one for the highest quartile
GHGUISummary
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.31 5.45 31.27 7.00 39190.31
energy_data_glm <- glm(GHGCat ~ PropertyUseTypeAdj + borough + YearCat, data = energy_data_red_df)
summary(energy_data_glm)
Call:
glm(formula = GHGCat ~ PropertyUseTypeAdj + borough + YearCat,
data = energy_data_red_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.93065 -0.25535 -0.16609 0.06935 1.03303
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.401688 0.012512 32.105 < 2e-16 ***
PropertyUseTypeAdjCollege/University 0.121609 0.053116 2.290 0.02207 *
PropertyUseTypeAdjDistribution Center -0.186384 0.054442 -3.424 0.00062 ***
PropertyUseTypeAdjHotel 0.275779 0.029374 9.389 < 2e-16 ***
PropertyUseTypeAdjK-12 School -0.184949 0.045828 -4.036 5.48e-05 ***
PropertyUseTypeAdjMultifamily Housing -0.167128 0.013073 -12.785 < 2e-16 ***
PropertyUseTypeAdjNon-Refrigerated Warehouse -0.265270 0.032450 -8.175 3.28e-16 ***
PropertyUseTypeAdjOther 0.069889 0.022129 3.158 0.00159 **
PropertyUseTypeAdjResidence Hall/Dormitory -0.020755 0.042978 -0.483 0.62915
PropertyUseTypeAdjRetail Store -0.009896 0.045885 -0.216 0.82925
PropertyUseTypeAdjSelf-Storage Facility -0.348063 0.046224 -7.530 5.46e-14 ***
PropertyUseTypeAdjSenior Care Community 0.484201 0.041603 11.639 < 2e-16 ***
boroughBronx 0.067353 0.011713 5.750 9.14e-09 ***
boroughBrooklyn -0.146858 0.011174 -13.142 < 2e-16 ***
boroughQueens -0.089253 0.011926 -7.484 7.74e-14 ***
boroughStaten Island -0.053494 0.036330 -1.472 0.14093
YearCat<=Q2 0.014483 0.011586 1.250 0.21128
YearCat<=Q3 0.020787 0.011876 1.750 0.08007 .
YearCat>Q3 -0.022589 0.011366 -1.987 0.04690 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.169193)
Null deviance: 2111.1 on 11257 degrees of freedom
Residual deviance: 1901.6 on 11239 degrees of freedom
AIC: 11968
Number of Fisher Scoring iterations: 2
#Convert logit to probability https://sebastiansauer.github.io/convert_logit2prob/
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
Base example - Office building in Manhattan built before 1927
coef(energy_data_glm)[[1]]
[1] 0.4016883
logit2prob(coef(energy_data_glm)[[1]])
[1] 0.5990932
High-emission example - Senior Care community building in the Bronx built after 1927 and before 1941
energy_prediction_high <- predict.glm(energy_data_glm, data.frame(PropertyUseTypeAdj="Senior Care Community", borough="Bronx", YearCat="<=Q2"), type="response", se.fit = TRUE)
energy_prediction_high$fit[[1]]
[1] 0.9677254
logit2prob(energy_prediction_high$fit)[[1]]
[1] 0.7246659
Low-emission example - Self-Storage Facility building in Brooklyn built after 1966.
energy_prediction_low <- predict.glm(energy_data_glm, data.frame(PropertyUseTypeAdj="Self-Storage Facility", borough="Brooklyn", YearCat=">Q3"), type="response", se.fit = TRUE)
energy_prediction_low$fit[[1]]
[1] -0.1158211
logit2prob(energy_prediction_low$fit)[[1]]
[1] 0.471077