Data Science in Context Presentation - DATA 607

John Grando
September 27th, 2017

Fitting A Model to Data - Summary

  • Discriminant Functions
  • Create decision lines / hyperplanes to separate data into regions with different values of the target variable.
  • These hyperplanes can be as simple as a linear function (y=mx+b) in a 2d space or very complex, consisting of many variables.

Fitting A Model to Data - Summary

  • Scoring and Ranking
  • If we want to know the likelyhood of a specific instance having a positive outcome in the target variable, we would have to use logistic regression or support vector machines.
  • More specifically, this week's reading refers to the log-odds linear functon

Logistic Regression

So why use it?

  • Gurantees the probability will be bound between 0 and 1
  • Allows for predictor variables to be expressed in a linear equation
  • Fairly easy to interpret

Logistic Regression - glm() Example

Fitting Generalized Linear Models

The glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

Image taken from http://www.statmethods.net/advstats/glm.html

Logistic Regression - glm() Example

Example: Greenhouse Gas Emission intensity (GHG / square foot) in New York City

Data Source: Local Law 84 Data Disclosure and Reports (http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml)

Local Law 84 is the NYC Benchmarking Law which requires owners of large buildings to annually measure their energy and water consumption in a process called benchmarking and disclose the results publicly.

Logistic Regression - glm() Example - Data Editing

  • Categorizing the year built and GHG emissions
  • Grouping unique buildings (less than 50) into one category (Other)
options(width = 150)
head(energy_data_red_df[,c("PropertyUseTypeAdj", "borough", "YearCat", "GHGCat")])
   PropertyUseTypeAdj   borough YearCat GHGCat
1 Multifamily Housing Manhattan     >Q3      0
2 Multifamily Housing Manhattan    <=Q3      0
3 Multifamily Housing Manhattan    <=Q3      0
4 Multifamily Housing Manhattan    <=Q3      0
5               Other Manhattan     >Q3      0
6              Office Manhattan     >Q3      1

Year summary (YearCat) - One Category per quartile

YearSummary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1600    1927    1941    1949    1966    2019 

GHGUI summary (GHGCat) - Target Variable set to equal one for the highest quartile

GHGUISummary
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.00     4.31     5.45    31.27     7.00 39190.31 

Logistic Regression - glm() Example

energy_data_glm <- glm(GHGCat ~ PropertyUseTypeAdj + borough + YearCat, data = energy_data_red_df)
summary(energy_data_glm)

Call:
glm(formula = GHGCat ~ PropertyUseTypeAdj + borough + YearCat, 
    data = energy_data_red_df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.93065  -0.25535  -0.16609   0.06935   1.03303  

Coefficients:
                                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                   0.401688   0.012512  32.105  < 2e-16 ***
PropertyUseTypeAdjCollege/University          0.121609   0.053116   2.290  0.02207 *  
PropertyUseTypeAdjDistribution Center        -0.186384   0.054442  -3.424  0.00062 ***
PropertyUseTypeAdjHotel                       0.275779   0.029374   9.389  < 2e-16 ***
PropertyUseTypeAdjK-12 School                -0.184949   0.045828  -4.036 5.48e-05 ***
PropertyUseTypeAdjMultifamily Housing        -0.167128   0.013073 -12.785  < 2e-16 ***
PropertyUseTypeAdjNon-Refrigerated Warehouse -0.265270   0.032450  -8.175 3.28e-16 ***
PropertyUseTypeAdjOther                       0.069889   0.022129   3.158  0.00159 ** 
PropertyUseTypeAdjResidence Hall/Dormitory   -0.020755   0.042978  -0.483  0.62915    
PropertyUseTypeAdjRetail Store               -0.009896   0.045885  -0.216  0.82925    
PropertyUseTypeAdjSelf-Storage Facility      -0.348063   0.046224  -7.530 5.46e-14 ***
PropertyUseTypeAdjSenior Care Community       0.484201   0.041603  11.639  < 2e-16 ***
boroughBronx                                  0.067353   0.011713   5.750 9.14e-09 ***
boroughBrooklyn                              -0.146858   0.011174 -13.142  < 2e-16 ***
boroughQueens                                -0.089253   0.011926  -7.484 7.74e-14 ***
boroughStaten Island                         -0.053494   0.036330  -1.472  0.14093    
YearCat<=Q2                                   0.014483   0.011586   1.250  0.21128    
YearCat<=Q3                                   0.020787   0.011876   1.750  0.08007 .  
YearCat>Q3                                   -0.022589   0.011366  -1.987  0.04690 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.169193)

    Null deviance: 2111.1  on 11257  degrees of freedom
Residual deviance: 1901.6  on 11239  degrees of freedom
AIC: 11968

Number of Fisher Scoring iterations: 2

Logistic Regression - glm() Example

#Convert logit to probability https://sebastiansauer.github.io/convert_logit2prob/
logit2prob <- function(logit){
  odds <- exp(logit)
  prob <- odds / (1 + odds)
  return(prob)
}

Base example - Office building in Manhattan built before 1927

coef(energy_data_glm)[[1]]
[1] 0.4016883
logit2prob(coef(energy_data_glm)[[1]])
[1] 0.5990932

Logistic Regression - glm() Example

High-emission example - Senior Care community building in the Bronx built after 1927 and before 1941

energy_prediction_high <- predict.glm(energy_data_glm, data.frame(PropertyUseTypeAdj="Senior Care Community", borough="Bronx", YearCat="<=Q2"), type="response", se.fit = TRUE)
energy_prediction_high$fit[[1]]
[1] 0.9677254
logit2prob(energy_prediction_high$fit)[[1]]
[1] 0.7246659

Low-emission example - Self-Storage Facility building in Brooklyn built after 1966.

energy_prediction_low <- predict.glm(energy_data_glm, data.frame(PropertyUseTypeAdj="Self-Storage Facility", borough="Brooklyn", YearCat=">Q3"), type="response", se.fit = TRUE)
energy_prediction_low$fit[[1]]
[1] -0.1158211
logit2prob(energy_prediction_low$fit)[[1]]
[1] 0.471077

Conclusion

  • glm() function can create logistic regression models
  • Business Case - Targeted high-use buildings can reap the largest benefits (low-hanging fruit)
  • Using Logistic Regression (or more generally, function fitting) models and predicting the likelyhood of target variables can help identify sub-classes within the population.
  • This methodology can be scalable to more variables as long as they improve the model, fortunately there are methods to measure the fit!

    Side Notes:
  • Create a *.css file (same name as the Rpresentation) to automatically apply css to your presentation.
  • Use '.reveal' to override the base values: .reveal p {font-size: 0.75em;}
  • Use <br> to create linebreaks (html code can be directly inserted)