Assignment #04: Logistic Regression [Final Version]

Logistic Regression Project - Analysis of Building Type by Energy Usage

Brendan Howell

Renselaer Polytechnic Institute

05/14/15 - Version 1.0

1. Data

Dataset of Energy Usage in Chicago in 2010.

Description

This data, which displays several units of energy consumption for households, businesses, and industries in the city of Chicago in the year 2010, is aggregated from ComEd and Peoples Natural Gas by Accenture. The dataset itself, which contains 66,974 observations and 73 individual variables, accounts for approximately 88% of Chicago buildings’ electrical and gas usage in 2010, representing 68% of Chicago’s overall electrical usage and 81% of Chicago’s gas consumption. For the sake of this analysis, only three distinct variables are tested for our null hypothesis, including ‘BUILDING_TYPE’ (which represents the specific type of building as it corresponds to being either residential, commercial, or industrial), ‘TOTAL_POPULATION’ (which represents the population capacity of a given building), and ‘TOTAL_KWH’ (which represents the total energy being consumed in 2010 in kilowatt-hours).

[Reference: http://catalog.data.gov/dataset/energy-usage-2010-24a67]

Data Organization

Below, the “Energy Usage 2010” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).

#Install and load the "Energy Usage 2010" dataset into R, assigning a variable, "energy_raw", to the complete dataframe..
rm(list=ls())
energy_raw <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #4/Energy_Usage_2010.csv", header=TRUE, stringsAsFactors = FALSE)
#Then, display the "head" and "tail" of the dataset, "eNergy_raw".
head(energy_raw)

##   COMMUNITY.AREA.NAME CENSUS.BLOCK BUILDING_TYPE BUILDING_SUBTYPE
## 1         Albany Park      1.7e+14   Residential         Multi 7+
## 2         Albany Park      1.7e+14   Residential        Multi < 7
## 3         Albany Park      1.7e+14   Residential    Single Family
## 4         Albany Park      1.7e+14   Residential         Multi 7+
## 5         Albany Park      1.7e+14   Residential        Multi < 7
## 6         Albany Park      1.7e+14    Commercial        Multi < 7
##   KWH.JANUARY.2010 KWH.FEBRUARY.2010 KWH.MARCH.2010 KWH.APRIL.2010
## 1            11921             12145           9759          11542
## 2             1233              1645            994           1055
## 3             4141              3798           2939           4727
## 4             1230              1333           1260           1405
## 5            12977             14639          12718          14973
## 6             2878              3755           4571           2984
##   KWH.MAY.2010 KWH.JUNE.2010 KWH.JULY.2010 KWH.AUGUST.2010
## 1        14348         26617         24210           20383
## 2         1284          3527          3099            2527
## 3         5324          9676          7591            6287
## 4         1699          2094           732            1312
## 5        16384         32940         24454           23926
## 6         3111          4808          4132            3564
##   KWH.SEPTEMBER.2010 KWH.OCTOBER.2010 KWH.NOVEMBER.2010 KWH.DECEMBER.2010
## 1              11983            10335             25327             22462
## 2                904              626              2092              1622
## 3               2920             2565              5979              5073
## 4               1462             1358              1372              1495
## 5              15012            13679             31979             30660
## 6               2174             1985              5968              5400
##   TOTAL_KWH ELECTRICITY.ACCOUNTS ZERO.KWH.ACCOUNTS THERM.JANUARY.2010
## 1    201032                   48                22               7247
## 2     20608          Less than 4                 1                321
## 3     61020                    6                 2               1222
## 4     16752          Less than 4                 2               2961
## 5    244341                   49                32              11508
## 6     45330                    7                 0               1793
##   THERM.FEBRUARY.2010 THERM.MARCH.2010 TERM.APRIL.2010 THERM.MAY.2010
## 1                5904             5180            3113           1822
## 2                 130               86              49             19
## 3                1016              860             543            346
## 4                2664             1616             798            344
## 5                9057             8000            4529           2809
## 6                1573             1352             890            853
##   THERM.JUNE.2010 THERM.JULY.2010 THERM.AUGUST.2010 THERM.SEPTEMBER.2010
## 1            1272            1234               952                 1780
## 2              13               7                10                   12
## 3             247             203               179                  170
## 4             404             320               272                  368
## 5            1507            1179               991                  994
## 6             541             448               438                  439
##   THERM.OCTOBER.2010 THERM.NOVEMBER.2010 THERM.DECEMBER.2010 TOTAL_THERMS
## 1               1472                1961                4885        36822
## 2                  9                  21                  78          755
## 3                190                 298                 791         6065
## 4                745                1260                2901        14653
## 5               1254                2595                7167        51590
## 6                565                 787                1538        11217
##   GAS.ACCOUNTS KWH.TOTAL.SQFT THERMS.TOTAL.SQFT KWH.MEAN.2010
## 1           21          48825             48825      20103.20
## 2  Less than 4           3306              3306      20608.00
## 3            6           9472              9472      10170.00
## 4            6          14407             14407      16752.00
## 5           54          58835             58835      15271.31
## 6            6           8240              8240      22665.00
##   KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010 KWH.1ST.QUARTILE.2010
## 1                     8609.69             9414               12563.0
## 2                          NA            20608               20608.0
## 3                     4410.10             5619                6746.0
## 4                          NA            16752               16752.0
## 5                     8089.70             5462               10343.5
## 6                     9526.14            15929               15929.0
##   KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010 KWH.MAXIMUM.2010
## 1               19072.5               22177.0            36781
## 2               20608.0               20608.0            20608
## 3                9055.5               13014.0            17530
## 4               16752.0               16752.0            16752
## 5               12427.0               17495.5            34236
## 6               22665.0               29401.0            29401
##   KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
## 1           24412.50                          5698.57
## 2            3306.00                               NA
## 3            1578.67                           863.85
## 4           14407.00                               NA
## 5            3677.19                          1061.65
## 6            8240.00                               NA
##   KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
## 1                 20383                      20383
## 2                  3306                       3306
## 3                  1226                       1226
## 4                 14407                      14407
## 5                  2414                       2546
## 6                  8240                       8240
##   KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
## 1                    24412.5                      28442
## 2                     3306.0                       3306
## 3                     1226.0                       1226
## 4                    14407.0                      14407
## 5                     3553.5                       4692
## 6                     8240.0                       8240
##   KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010 THERM.STANDARD.DEVIATION.2010
## 1                 28442         5260.29                       8435.63
## 2                  3306          755.00                            NA
## 3                  3342         1010.83                        620.53
## 4                 14407        14653.00                            NA
## 5                  5530         3224.38                       1079.13
## 6                  8240         5608.50                       5620.79
##   THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
## 1                882                     957                  1102.0
## 2                755                     755                   755.0
## 3                496                     514                   835.5
## 4              14653                   14653                 14653.0
## 5               2071                    2499                  2933.5
## 6               1634                    1634                  5608.5
##   THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
## 1                  8024.0              23460              24412.50
## 2                   755.0                755               3306.00
## 3                  1240.0               2144               1578.67
## 4                 14653.0              14653              14407.00
## 5                  3593.5               5754               3677.19
## 6                  9583.0               9583               8240.00
##   THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
## 1                             5698.57                    20383
## 2                                  NA                     3306
## 3                              863.85                     1226
## 4                                  NA                    14407
## 5                             1061.65                     2414
## 6                                  NA                     8240
##   THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
## 1                         20383                       24412.5
## 2                          3306                        3306.0
## 3                          1226                        1226.0
## 4                         14407                       14407.0
## 5                          2546                        3553.5
## 6                          8240                        8240.0
##   THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010 TOTAL_POPULATION
## 1                         28442                    28442              132
## 2                          3306                     3306              132
## 3                          1226                     3342              132
## 4                         14407                    14407              228
## 5                          4692                     5530              228
## 6                          8240                     8240              231
##   TOTAL.UNITS AVERAGE.STORIES AVERAGE.BUILDING.AGE AVERAGE.HOUSESIZE
## 1          64            3.00                65.50              2.20
## 2          64            2.00                86.00              2.20
## 3          64            1.17                14.33              2.20
## 4          79            3.00                86.00              3.51
## 5          79            2.50                87.69              3.51
## 6          70            1.00                 0.00              3.73
##   OCCUPIED.UNITS OCCUPIED.UNITS.PERCENTAGE RENTER.OCCUPIED.HOUSING.UNITS
## 1             60                    0.9375                            33
## 2             60                    0.9375                            33
## 3             60                    0.9375                            33
## 4             65                    0.8228                            49
## 5             65                    0.8228                            49
## 6             62                    0.8856                            49
##   RENTER.OCCUPIED.HOUSING.PERCENTAGE OCCUPIED.HOUSING.UNITS
## 1                              0.550                     60
## 2                              0.550                     60
## 3                              0.550                     60
## 4                              0.754                     65
## 5                              0.754                     65
## 6                              0.790                     62

tail(energy_raw)

##       COMMUNITY.AREA.NAME CENSUS.BLOCK BUILDING_TYPE BUILDING_SUBTYPE
## 66969            Woodlawn      1.7e+14   Residential        Multi < 7
## 66970            Woodlawn      1.7e+14   Residential    Single Family
## 66971            Woodlawn      1.7e+14    Commercial        Multi < 7
## 66972            Woodlawn      1.7e+14   Residential        Multi < 7
## 66973            Woodlawn      1.7e+14   Residential    Single Family
## 66974            Woodlawn      1.7e+14   Residential        Multi < 7
##       KWH.JANUARY.2010 KWH.FEBRUARY.2010 KWH.MARCH.2010 KWH.APRIL.2010
## 66969             9572              9104           8525           7756
## 66970             2705              1318           1582           1465
## 66971             1005              1760           1521           1832
## 66972             3567              3031           2582           2295
## 66973             1208              1055           1008           1109
## 66974             2717              3057           2695           3793
##       KWH.MAY.2010 KWH.JUNE.2010 KWH.JULY.2010 KWH.AUGUST.2010
## 66969        11256         11669         12099           13200
## 66970         1494          2990          2449            2351
## 66971         2272          2361          3018            3030
## 66972         7902          4987          5773            3996
## 66973         1591          1367          1569            1551
## 66974         4237          5383          5544            6929
##       KWH.SEPTEMBER.2010 KWH.OCTOBER.2010 KWH.NOVEMBER.2010
## 66969               9694             8419             19077
## 66970               1213             2174              2888
## 66971               2886             3833              6290
## 66972               3050             3103              3880
## 66973               1376             1236              2108
## 66974               5280             5971              6986
##       KWH.DECEMBER.2010 TOTAL_KWH ELECTRICITY.ACCOUNTS ZERO.KWH.ACCOUNTS
## 66969             18869    139240                   21                18
## 66970              5025     27654                    6                 7
## 66971             12169     41977                    9                 5
## 66972              4684     48850                    7                 2
## 66973              2529     17707                    7                 9
## 66974              5144     57736                   12                17
##       THERM.JANUARY.2010 THERM.FEBRUARY.2010 THERM.MARCH.2010
## 66969               6914                5433             5054
## 66970               2166                1681             1858
## 66971                985                1152             1238
## 66972               2202                1874             1647
## 66973                 95                  11               47
## 66974               2372                1787             1449
##       TERM.APRIL.2010 THERM.MAY.2010 THERM.JUNE.2010 THERM.JULY.2010
## 66969            2967           2241            1107             770
## 66970            1172            708             360              72
## 66971             630            475             192             141
## 66972             906            645             346              84
## 66973               9             45              18              22
## 66974             718            572             286             155
##       THERM.AUGUST.2010 THERM.SEPTEMBER.2010 THERM.OCTOBER.2010
## 66969               674                  788                954
## 66970                67                   77                185
## 66971               162                  144                210
## 66972               150                  150                260
## 66973                 9                   17                 11
## 66974               134                  161                303
##       THERM.NOVEMBER.2010 THERM.DECEMBER.2010 TOTAL_THERMS GAS.ACCOUNTS
## 66969                2423                4619        33944           25
## 66970                 623                1800        10769            9
## 66971                 653                1744         7726            8
## 66972                 694                1335        10293            5
## 66973                  18                  13          315            5
## 66974                 588                1469         9994           13
##       KWH.TOTAL.SQFT THERMS.TOTAL.SQFT KWH.MEAN.2010
## 66969          48349             48349      12658.18
## 66970           7801              7801       6913.50
## 66971          11838             11838      13992.33
## 66972          11028             11028      16283.33
## 66973           4653              4653       4426.75
## 66974          17812             13776       9622.67
##       KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010 KWH.1ST.QUARTILE.2010
## 66969                     7948.06             2691                7635.0
## 66970                     5695.82             2444                2872.5
## 66971                     2989.54            10754               10754.0
## 66972                    15000.83             7010                7010.0
## 66973                     2297.29             1878                2635.0
## 66974                     5625.23             1312                6288.0
##       KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010 KWH.MAXIMUM.2010
## 66969               11370.0               19168.0            30287
## 66970                5139.0               10954.5            14932
## 66971               14576.0               16647.0            16647
## 66972                8250.0               33590.0            33590
## 66973                4325.0                6218.5             7179
## 66974                9586.5               15290.0            15673
##       KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
## 66969             4834.9                          2180.96
## 66970             3900.5                          1429.06
## 66971             5919.0                           725.49
## 66972             3676.0                          1022.80
## 66973             4653.0                               NA
## 66974             3562.4                          2911.56
##       KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
## 66969                  2810                       3166
## 66970                  2890                       2890
## 66971                  5406                       5406
## 66972                  2800                       2800
## 66973                  4653                       4653
## 66974                  1866                       2170
##       KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
## 66969                     3771.0                       7232
## 66970                     3900.5                       4911
## 66971                     5919.0                       6432
## 66972                     3428.0                       4800
## 66973                     4653.0                       4653
## 66974                     2472.0                       2556
##       KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010 THERM.STANDARD.DEVIATION.2010
## 66969                  8016         3085.82                       1542.64
## 66970                  4911         2692.25                       3661.92
## 66971                  6432         2575.33                       3492.97
## 66972                  4800         3431.00                       1155.32
## 66973                  4653          105.00                         80.30
## 66974                  8748         2498.50                       2372.88
##       THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
## 66969                621                    2300                  2669.0
## 66970                272                     464                  1195.5
## 66971                 42                      42                  1124.0
## 66972               2449                    2449                  3140.0
## 66973                 49                      49                    69.0
## 66974                487                     578                  2029.0
##       THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
## 66969                  4408.0               6246                4834.9
## 66970                  4920.5               8106                3900.5
## 66971                  6560.0               6560                5919.0
## 66972                  4704.0               4704                3676.0
## 66973                   197.0                197                4653.0
## 66974                  4419.0               5449                4592.0
##       THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
## 66969                             2180.96                     2810
## 66970                             1429.06                     2890
## 66971                              725.49                     5406
## 66972                             1022.80                     2800
## 66973                                  NA                     4653
## 66974                             3599.45                     2472
##       THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
## 66969                          3166                        3771.0
## 66970                          2890                        3900.5
## 66971                          5406                        5919.0
## 66972                          2800                        3428.0
## 66973                          4653                        4653.0
## 66974                          2472                        2556.0
##       THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010
## 66969                          7232                     8016
## 66970                          4911                     4911
## 66971                          6432                     6432
## 66972                          4800                     4800
## 66973                          4653                     4653
## 66974                          8748                     8748
##       TOTAL_POPULATION TOTAL.UNITS AVERAGE.STORIES AVERAGE.BUILDING.AGE
## 66969              116          55            2.00                51.90
## 66970              116          55            1.00                 0.00
## 66971               31          24            3.00               104.50
## 66972               31          24            2.33               100.67
## 66973                0           0            1.00                 0.00
## 66974               77          49            2.00                79.40
##       AVERAGE.HOUSESIZE OCCUPIED.UNITS OCCUPIED.UNITS.PERCENTAGE
## 66969              3.14             37                    0.6727
## 66970              3.14             37                    0.6727
## 66971              2.07             15                    0.6250
## 66972              2.07             15                    0.6250
## 66973              0.00              0                        NA
## 66974              2.57             30                    0.6122
##       RENTER.OCCUPIED.HOUSING.UNITS RENTER.OCCUPIED.HOUSING.PERCENTAGE
## 66969                            26                             0.7030
## 66970                            26                             0.7030
## 66971                            13                             0.8670
## 66972                            13                             0.8670
## 66973                             0                                 NA
## 66974                            28                             0.9329
##       OCCUPIED.HOUSING.UNITS
## 66969                     37
## 66970                     37
## 66971                     15
## 66972                     15
## 66973                      0
## 66974                     30

#Display the summary statistics and the structure of the data
summary(energy_raw)

##  COMMUNITY.AREA.NAME  CENSUS.BLOCK     BUILDING_TYPE     
##  Length:66974        Min.   :1.7e+14   Length:66974      
##  Class :character    1st Qu.:1.7e+14   Class :character  
##  Mode  :character    Median :1.7e+14   Mode  :character  
##                      Mean   :1.7e+14                     
##                      3rd Qu.:1.7e+14                     
##                      Max.   :1.7e+14                     
##                                                          
##  BUILDING_SUBTYPE   KWH.JANUARY.2010   KWH.FEBRUARY.2010 
##  Length:66974       Min.   :       0   Min.   :       0  
##  Class :character   1st Qu.:    1369   1st Qu.:    1612  
##  Mode  :character   Median :    3476   Median :    3806  
##                     Mean   :   12810   Mean   :   12582  
##                     3rd Qu.:    7138   3rd Qu.:    7396  
##                     Max.   :21214017   Max.   :21065500  
##                     NA's   :871        NA's   :871       
##  KWH.MARCH.2010     KWH.APRIL.2010      KWH.MAY.2010     
##  Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:    1585   1st Qu.:    1578   1st Qu.:    1955  
##  Median :    3676   Median :    3636   Median :    4522  
##  Mean   :   11707   Mean   :   11463   Mean   :   13853  
##  3rd Qu.:    7042   3rd Qu.:    6989   3rd Qu.:    8922  
##  Max.   :18503691   Max.   :17310058   Max.   :21344049  
##  NA's   :871        NA's   :871        NA's   :871       
##  KWH.JUNE.2010      KWH.JULY.2010      KWH.AUGUST.2010   
##  Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:    2695   1st Qu.:    3199   1st Qu.:    2834  
##  Median :    6283   Median :    7375   Median :    6404  
##  Mean   :   17213   Mean   :   18845   Mean   :   16989  
##  3rd Qu.:   12793   3rd Qu.:   14624   3rd Qu.:   12274  
##  Max.   :20209197   Max.   :21478035   Max.   :18586958  
##  NA's   :871        NA's   :871        NA's   :871       
##  KWH.SEPTEMBER.2010 KWH.OCTOBER.2010   KWH.NOVEMBER.2010 
##  Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:    2024   1st Qu.:    1951   1st Qu.:    2639  
##  Median :    4566   Median :    4354   Median :    5851  
##  Mean   :   13595   Mean   :   12595   Mean   :   15705  
##  3rd Qu.:    8612   3rd Qu.:    8154   3rd Qu.:   11044  
##  Max.   :19280342   Max.   :18423025   Max.   :20670698  
##  NA's   :871        NA's   :871        NA's   :871       
##  KWH.DECEMBER.2010    TOTAL_KWH         ELECTRICITY.ACCOUNTS
##  Min.   :       0   Min.   :      102   Length:66974        
##  1st Qu.:    3076   1st Qu.:    28188   Class :character    
##  Median :    6813   Median :    62272   Mode  :character    
##  Mean   :   18315   Mean   :   175672                       
##  3rd Qu.:   12602   3rd Qu.:   118172                       
##  Max.   :25060008   Max.   :231280522                       
##  NA's   :871        NA's   :871                             
##  ZERO.KWH.ACCOUNTS THERM.JANUARY.2010 THERM.FEBRUARY.2010 THERM.MARCH.2010
##  Min.   :  0.000   Min.   :     1     Min.   :     1      Min.   :     1  
##  1st Qu.:  1.000   1st Qu.:  1022     1st Qu.:   897      1st Qu.:   736  
##  Median :  2.000   Median :  2141     Median :  1901      Median :  1558  
##  Mean   :  4.771   Mean   :  3306     Mean   :  2893      Mean   :  2406  
##  3rd Qu.:  5.000   3rd Qu.:  3866     3rd Qu.:  3418      3rd Qu.:  2808  
##  Max.   :601.000   Max.   :566238     Max.   :511323      Max.   :557509  
##                    NA's   :2230       NA's   :4232        NA's   :1482    
##  TERM.APRIL.2010  THERM.MAY.2010     THERM.JUNE.2010    THERM.JULY.2010   
##  Min.   :     1   Min.   :     1.0   Min.   :     1.0   Min.   :     1.0  
##  1st Qu.:   354   1st Qu.:   209.0   1st Qu.:   113.0   1st Qu.:    87.0  
##  Median :   779   Median :   469.0   Median :   256.0   Median :   197.0  
##  Mean   :  1261   Mean   :   807.2   Mean   :   498.3   Mean   :   418.4  
##  3rd Qu.:  1440   3rd Qu.:   875.0   3rd Qu.:   486.0   3rd Qu.:   369.0  
##  Max.   :624882   Max.   :651226.0   Max.   :631383.0   Max.   :680201.0  
##  NA's   :1575     NA's   :1857       NA's   :1767       NA's   :1820      
##  THERM.AUGUST.2010  THERM.SEPTEMBER.2010 THERM.OCTOBER.2010
##  Min.   :     1.0   Min.   :     1.0     Min.   :     1.0  
##  1st Qu.:    79.0   1st Qu.:    82.0     1st Qu.:   122.0  
##  Median :   180.0   Median :   187.0     Median :   276.0  
##  Mean   :   399.7   Mean   :   401.2     Mean   :   568.2  
##  3rd Qu.:   340.0   3rd Qu.:   347.0     3rd Qu.:   509.2  
##  Max.   :693230.0   Max.   :634051.0     Max.   :593026.0  
##  NA's   :1908       NA's   :2282         NA's   :1722      
##  THERM.NOVEMBER.2010 THERM.DECEMBER.2010  TOTAL_THERMS    
##  Min.   :     1      Min.   :     1      Min.   :     25  
##  1st Qu.:   282      1st Qu.:   774      1st Qu.:   4879  
##  Median :   629      Median :  1631      Median :  10340  
##  Mean   :  1150      Mean   :  2645      Mean   :  16524  
##  3rd Qu.:  1167      3rd Qu.:  2965      3rd Qu.:  18570  
##  Max.   :539356      Max.   :566326      Max.   :7035940  
##  NA's   :1559        NA's   :1544        NA's   :1296     
##  GAS.ACCOUNTS       KWH.TOTAL.SQFT    THERMS.TOTAL.SQFT
##  Length:66974       Min.   :    300   Min.   :    300  
##  Class :character   1st Qu.:   5385   1st Qu.:   5368  
##  Mode  :character   Median :  10858   Median :  10844  
##                     Mean   :  21093   Mean   :  20347  
##                     3rd Qu.:  18721   3rd Qu.:  18844  
##                     Max.   :6548217   Max.   :6548217  
##                     NA's   :1150      NA's   :1673     
##  KWH.MEAN.2010       KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010   
##  Min.   :      102   Min.   :        0           Min.   :      100  
##  1st Qu.:     8229   1st Qu.:     3630           1st Qu.:     2164  
##  Median :    10515   Median :     5148           Median :     4377  
##  Mean   :    62493   Mean   :    40323           Mean   :    36852  
##  3rd Qu.:    15645   3rd Qu.:     8065           3rd Qu.:     8774  
##  Max.   :227750000   Max.   :162851049           Max.   :227752064  
##  NA's   :871         NA's   :9956                NA's   :871        
##  KWH.1ST.QUARTILE.2010 KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010
##  Min.   :      100     Min.   :      102     Min.   :      102    
##  1st Qu.:     4766     1st Qu.:     7636     1st Qu.:    10477    
##  Median :     6746     Median :     9944     Median :    13623    
##  Mean   :    39158     Mean   :    55773     Mean   :    85608    
##  3rd Qu.:    10374     3rd Qu.:    14603     3rd Qu.:    20018    
##  Max.   :227752064     Max.   :227752064     Max.   :230793342    
##  NA's   :871           NA's   :871           NA's   :871          
##  KWH.MAXIMUM.2010    KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
##  Min.   :      102   Min.   :    300    Min.   :      0                 
##  1st Qu.:    13281   1st Qu.:   1326    1st Qu.:    240                 
##  Median :    18033   Median :   2214    Median :    471                 
##  Mean   :   103512   Mean   :   7665    Mean   :   3446                 
##  3rd Qu.:    26276   3rd Qu.:   3790    3rd Qu.:   1048                 
##  Max.   :230793342   Max.   :6548217    Max.   :3840818                 
##  NA's   :871         NA's   :1150       NA's   :15385                   
##  KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
##  Min.   :    100       Min.   :    100           
##  1st Qu.:    954       1st Qu.:   1078           
##  Median :   1534       Median :   1760           
##  Mean   :   5604       Mean   :   5792           
##  3rd Qu.:   2684       3rd Qu.:   2854           
##  Max.   :6548217       Max.   :6548217           
##  NA's   :1150          NA's   :1150              
##  KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
##  Min.   :    300            Min.   :    300           
##  1st Qu.:   1250            1st Qu.:   1490           
##  Median :   2132            Median :   2470           
##  Mean   :   7268            Mean   :   9534           
##  3rd Qu.:   3612            3rd Qu.:   4491           
##  Max.   :6548217            Max.   :6548217           
##  NA's   :1150               NA's   :1150              
##  KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010   THERM.STANDARD.DEVIATION.2010
##  Min.   :    300       Min.   :     25   Min.   :      0              
##  1st Qu.:   1890       1st Qu.:   1365   1st Qu.:    351              
##  Median :   2810       Median :   1842   Median :    577              
##  Mean   :  10581       Mean   :   4062   Mean   :   2649              
##  3rd Qu.:   5254       3rd Qu.:   2707   3rd Qu.:   1183              
##  Max.   :6548217       Max.   :6600274   Max.   :4941759              
##  NA's   :1150          NA's   :1296      NA's   :10230                
##  THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
##  Min.   :     25    Min.   :     25         Min.   :     25        
##  1st Qu.:    592    1st Qu.:    957         1st Qu.:   1286        
##  Median :    990    Median :   1290         Median :   1724        
##  Mean   :   2267    Mean   :   2545         Mean   :   3634        
##  3rd Qu.:   1643    3rd Qu.:   1878         3rd Qu.:   2474        
##  Max.   :6600274    Max.   :6600274         Max.   :6600274        
##  NA's   :1296       NA's   :1296            NA's   :1296           
##  THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
##  Min.   :     25         Min.   :     25    Min.   :    300      
##  1st Qu.:   1595         1st Qu.:   1934    1st Qu.:   1318      
##  Median :   2182         Median :   2603    Median :   2200      
##  Mean   :   5490         Mean   :   6955    Mean   :   7175      
##  3rd Qu.:   3241         3rd Qu.:   4069    3rd Qu.:   3736      
##  Max.   :7012321         Max.   :7012321    Max.   :6548217      
##  NA's   :1296            NA's   :1296       NA's   :1673         
##  THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
##  Min.   :      0                     Min.   :    100         
##  1st Qu.:    239                     1st Qu.:    950         
##  Median :    467                     Median :   1520         
##  Mean   :   3140                     Mean   :   5282         
##  3rd Qu.:   1034                     3rd Qu.:   2651         
##  Max.   :3840818                     Max.   :6548217         
##  NA's   :15684                       NA's   :1673            
##  THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
##  Min.   :    132               Min.   :    300              
##  1st Qu.:   1075               1st Qu.:   1244              
##  Median :   1756               Median :   2116              
##  Mean   :   5462               Mean   :   6799              
##  3rd Qu.:   2820               3rd Qu.:   3564              
##  Max.   :6548217               Max.   :6548217              
##  NA's   :1673                  NA's   :1673                 
##  THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010 TOTAL_POPULATION 
##  Min.   :    300               Min.   :    300          Min.   :   0.00  
##  1st Qu.:   1479               1st Qu.:   1888          1st Qu.:  37.00  
##  Median :   2450               Median :   2796          Median :  64.00  
##  Mean   :   8897               Mean   :   9851          Mean   :  83.85  
##  3rd Qu.:   4410               3rd Qu.:   5191          3rd Qu.: 104.00  
##  Max.   :6548217               Max.   :6548217          Max.   :1590.00  
##  NA's   :1673                  NA's   :1673             NA's   :14       
##   TOTAL.UNITS      AVERAGE.STORIES   AVERAGE.BUILDING.AGE
##  Min.   :   0.00   Min.   :  1.000   Min.   :  0.00      
##  1st Qu.:  15.00   1st Qu.:  1.140   1st Qu.: 53.00      
##  Median :  25.00   Median :  1.750   Median : 80.00      
##  Mean   :  38.11   Mean   :  1.887   Mean   : 71.61      
##  3rd Qu.:  42.00   3rd Qu.:  2.000   3rd Qu.: 96.50      
##  Max.   :1365.00   Max.   :110.000   Max.   :158.00      
##  NA's   :14                                              
##  AVERAGE.HOUSESIZE OCCUPIED.UNITS   OCCUPIED.UNITS.PERCENTAGE
##  Min.   : 0.000    Min.   :   0.0   Min.   :0.0000           
##  1st Qu.: 2.140    1st Qu.:  13.0   1st Qu.:0.8332           
##  Median : 2.700    Median :  22.0   Median :0.9148           
##  Mean   : 2.722    Mean   :  33.5   Mean   :0.8804           
##  3rd Qu.: 3.310    3rd Qu.:  37.0   3rd Qu.:0.9677           
##  Max.   :12.000    Max.   :1034.0   Max.   :1.0000           
##  NA's   :14        NA's   :14       NA's   :2445             
##  RENTER.OCCUPIED.HOUSING.UNITS RENTER.OCCUPIED.HOUSING.PERCENTAGE
##  Min.   :   0.00               Min.   :0.0000                    
##  1st Qu.:   3.00               1st Qu.:0.2860                    
##  Median :  11.00               Median :0.5379                    
##  Mean   :  19.78               Mean   :0.5116                    
##  3rd Qu.:  23.00               3rd Qu.:0.7330                    
##  Max.   :1009.00               Max.   :1.0000                    
##  NA's   :14                    NA's   :2618                      
##  OCCUPIED.HOUSING.UNITS
##  Min.   :   0.0        
##  1st Qu.:  13.0        
##  Median :  22.0        
##  Mean   :  33.5        
##  3rd Qu.:  37.0        
##  Max.   :1034.0        
##  NA's   :14

str(energy_raw)

## 'data.frame':    66974 obs. of  73 variables:
##  $ COMMUNITY.AREA.NAME                : chr  "Albany Park" "Albany Park" "Albany Park" "Albany Park" ...
##  $ CENSUS.BLOCK                       : num  1.7e+14 1.7e+14 1.7e+14 1.7e+14 1.7e+14 ...
##  $ BUILDING_TYPE                      : chr  "Residential" "Residential" "Residential" "Residential" ...
##  $ BUILDING_SUBTYPE                   : chr  "Multi 7+" "Multi < 7" "Single Family" "Multi 7+" ...
##  $ KWH.JANUARY.2010                   : int  11921 1233 4141 1230 12977 2878 1478 4985 4926 16639 ...
##  $ KWH.FEBRUARY.2010                  : int  12145 1645 3798 1333 14639 3755 1890 2636 6413 23502 ...
##  $ KWH.MARCH.2010                     : int  9759 994 2939 1260 12718 4571 1364 2353 5586 19587 ...
##  $ KWH.APRIL.2010                     : int  11542 1055 4727 1405 14973 2984 1271 4761 5606 23327 ...
##  $ KWH.MAY.2010                       : int  14348 1284 5324 1699 16384 3111 1464 4391 6271 26537 ...
##  $ KWH.JUNE.2010                      : int  26617 3527 9676 2094 32940 4808 2118 7362 11549 40725 ...
##  $ KWH.JULY.2010                      : int  24210 3099 7591 732 24454 4132 2384 6462 8549 41430 ...
##  $ KWH.AUGUST.2010                    : int  20383 2527 6287 1312 23926 3564 3767 8015 6709 41268 ...
##  $ KWH.SEPTEMBER.2010                 : int  11983 904 2920 1462 15012 2174 2059 7314 3963 26208 ...
##  $ KWH.OCTOBER.2010                   : int  10335 626 2565 1358 13679 1985 1387 3816 3480 23230 ...
##  $ KWH.NOVEMBER.2010                  : int  25327 2092 5979 1372 31979 5968 2874 7496 7998 43196 ...
##  $ KWH.DECEMBER.2010                  : int  22462 1622 5073 1495 30660 5400 3244 6391 8613 43582 ...
##  $ TOTAL_KWH                          : int  201032 20608 61020 16752 244341 45330 25300 65982 79663 369231 ...
##  $ ELECTRICITY.ACCOUNTS               : chr  "48" "Less than 4" "6" "Less than 4" ...
##  $ ZERO.KWH.ACCOUNTS                  : int  22 1 2 2 32 0 2 3 2 106 ...
##  $ THERM.JANUARY.2010                 : int  7247 321 1222 2961 11508 1793 1554 3107 3371 22813 ...
##  $ THERM.FEBRUARY.2010                : int  5904 130 1016 2664 9057 1573 1195 2749 2647 18905 ...
##  $ THERM.MARCH.2010                   : int  5180 86 860 1616 8000 1352 1280 2228 2396 16890 ...
##  $ TERM.APRIL.2010                    : int  3113 49 543 798 4529 890 821 1331 1407 10504 ...
##  $ THERM.MAY.2010                     : int  1822 19 346 344 2809 853 663 738 833 6981 ...
##  $ THERM.JUNE.2010                    : int  1272 13 247 404 1507 541 607 443 460 4455 ...
##  $ THERM.JULY.2010                    : int  1234 7 203 320 1179 448 487 329 286 3456 ...
##  $ THERM.AUGUST.2010                  : int  952 10 179 272 991 438 476 284 260 3232 ...
##  $ THERM.SEPTEMBER.2010               : int  1780 12 170 368 994 439 382 288 246 3306 ...
##  $ THERM.OCTOBER.2010                 : int  1472 9 190 745 1254 565 459 301 323 3477 ...
##  $ THERM.NOVEMBER.2010                : int  1961 21 298 1260 2595 787 590 520 632 5898 ...
##  $ THERM.DECEMBER.2010                : int  4885 78 791 2901 7167 1538 971 1821 1919 14630 ...
##  $ TOTAL_THERMS                       : int  36822 755 6065 14653 51590 11217 9485 14139 14780 114547 ...
##  $ GAS.ACCOUNTS                       : chr  "21" "Less than 4" "6" "6" ...
##  $ KWH.TOTAL.SQFT                     : int  48825 3306 9472 14407 58835 8240 13305 16654 9690 127916 ...
##  $ THERMS.TOTAL.SQFT                  : int  48825 3306 9472 14407 58835 8240 13305 16654 10840 127916 ...
##  $ KWH.MEAN.2010                      : num  20103 20608 10170 16752 15271 ...
##  $ KWH.STANDARD.DEVIATION.2010        : num  8610 NA 4410 NA 8090 ...
##  $ KWH.MINIMUM.2010                   : int  9414 20608 5619 16752 5462 15929 7285 8496 5388 4397 ...
##  $ KWH.1ST.QUARTILE.2010              : num  12563 20608 6746 16752 10344 ...
##  $ KWH.2ND.QUARTILE.2010              : num  19073 20608 9056 16752 12427 ...
##  $ KWH.3RD.QUARTILE.2010              : num  22177 20608 13014 16752 17496 ...
##  $ KWH.MAXIMUM.2010                   : int  36781 20608 17530 16752 34236 29401 18015 16794 19735 39809 ...
##  $ KWH.SQFT.MEAN.2010                 : num  24413 3306 1579 14407 3677 ...
##  $ KWH.SQFT.STANDARD.DEVIATION.2010   : num  5699 NA 864 NA 1062 ...
##  $ KWH.SQFT.MINIMUM.2010              : int  20383 3306 1226 14407 2414 8240 13305 2448 1116 24751 ...
##  $ KWH.SQFT.1ST.QUARTILE.2010         : num  20383 3306 1226 14407 2546 ...
##  $ KWH.SQFT.2ND.QUARTILE.2010         : num  24413 3306 1226 14407 3554 ...
##  $ KWH.SQFT.3RD.QUARTILE.2010         : num  28442 3306 1226 14407 4692 ...
##  $ KWH.SQFT.MAXIMUM.2010              : int  28442 3306 3342 14407 5530 8240 13305 4554 1334 27975 ...
##  $ THERM.MEAN.2010                    : num  5260 755 1011 14653 3224 ...
##  $ THERM.STANDARD.DEVIATION.2010      : num  8436 NA 621 NA 1079 ...
##  $ THERM.MINIMUM.2010                 : int  882 755 496 14653 2071 1634 1866 2689 835 114 ...
##  $ THERM.1ST.QUARTILE.2010            : num  957 755 514 14653 2499 ...
##  $ THERM.2ND.QUARTILE.2010            : num  1102 755 836 14653 2934 ...
##  $ THERM.3RD.QUARTILE.2010            : num  8024 755 1240 14653 3594 ...
##  $ THERM.MAXIMUM.2010                 : int  23460 755 2144 14653 5754 9583 7619 2956 2372 28459 ...
##  $ THERMS.SQFT.MEAN.2010              : num  24413 3306 1579 14407 3677 ...
##  $ THERMS.SQFT.STANDARD.DEVIATION.2010: num  5699 NA 864 NA 1062 ...
##  $ THERMS.SQFT.MINIMUM.2010           : int  20383 3306 1226 14407 2414 8240 13305 2448 1116 24751 ...
##  $ THERMS.SQFT.1ST.QUARTILE.2010      : num  20383 3306 1226 14407 2546 ...
##  $ THERMS.SQFT.2ND.QUARTILE.2010      : num  24413 3306 1226 14407 3554 ...
##  $ THERMS.SQFT.3RD.QUARTILE.2010      : num  28442 3306 1226 14407 4692 ...
##  $ THERMS.SQFT.MAXIMUM.2010           : int  28442 3306 3342 14407 5530 8240 13305 4554 1334 27975 ...
##  $ TOTAL_POPULATION                   : int  132 132 132 228 228 231 231 231 231 456 ...
##  $ TOTAL.UNITS                        : int  64 64 64 79 79 70 70 70 70 180 ...
##  $ AVERAGE.STORIES                    : num  3 2 1.17 3 2.5 1 3 2.2 1 3 ...
##  $ AVERAGE.BUILDING.AGE               : num  65.5 86 14.3 86 87.7 ...
##  $ AVERAGE.HOUSESIZE                  : num  2.2 2.2 2.2 3.51 3.51 3.73 3.73 3.73 3.73 2.73 ...
##  $ OCCUPIED.UNITS                     : int  60 60 60 65 65 62 62 62 62 167 ...
##  $ OCCUPIED.UNITS.PERCENTAGE          : num  0.938 0.938 0.938 0.823 0.823 ...
##  $ RENTER.OCCUPIED.HOUSING.UNITS      : int  33 33 33 49 49 49 49 49 49 167 ...
##  $ RENTER.OCCUPIED.HOUSING.PERCENTAGE : num  0.55 0.55 0.55 0.754 0.754 0.79 0.79 0.79 0.79 1 ...
##  $ OCCUPIED.HOUSING.UNITS             : int  60 60 60 65 65 62 62 62 62 167 ...

#Create a subset of "energy_raw" that contains only numeric data
energy_data0 <- subset(energy_raw, select = c(BUILDING_TYPE, TOTAL_KWH, TOTAL_POPULATION))
energy_data1 <- na.omit(energy_data0)
#Display the "head" and "tail" of the dataset, "energy_data1"
head(energy_data1)

##   BUILDING_TYPE TOTAL_KWH TOTAL_POPULATION
## 1   Residential    201032              132
## 2   Residential     20608              132
## 3   Residential     61020              132
## 4   Residential     16752              228
## 5   Residential    244341              228
## 6    Commercial     45330              231

tail(energy_data1)

##       BUILDING_TYPE TOTAL_KWH TOTAL_POPULATION
## 66969   Residential    139240              116
## 66970   Residential     27654              116
## 66971    Commercial     41977               31
## 66972   Residential     48850               31
## 66973   Residential     17707                0
## 66974   Residential     57736               77

#Display the summary statistics and the structure of the data
summary(energy_data1)

##  BUILDING_TYPE        TOTAL_KWH         TOTAL_POPULATION 
##  Length:66089       Min.   :      102   Min.   :   0.00  
##  Class :character   1st Qu.:    28189   1st Qu.:  37.00  
##  Mode  :character   Median :    62271   Median :  64.00  
##                     Mean   :   175675   Mean   :  83.81  
##                     3rd Qu.:   118156   3rd Qu.: 104.00  
##                     Max.   :231280522   Max.   :1590.00

str(energy_data1)

## 'data.frame':    66089 obs. of  3 variables:
##  $ BUILDING_TYPE   : chr  "Residential" "Residential" "Residential" "Residential" ...
##  $ TOTAL_KWH       : int  201032 20608 61020 16752 244341 45330 25300 65982 79663 369231 ...
##  $ TOTAL_POPULATION: int  132 132 132 228 228 231 231 231 231 456 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:885] 67 85 104 128 328 415 494 522 804 853 ...
##   .. ..- attr(*, "names")= chr [1:885] "67" "85" "104" "128" ...

#Transform 'BUILDING_TYPE' into a categorical variable (where 0 represents residential buildings and 1 represents non-residential buildings, which corresponds to both commercial and industrial buildings)
energy_data1$BUILDING_TYPE = as.character(energy_data1$BUILDING_TYPE)
energy_data1$BUILDING_TYPE[energy_data1$BUILDING_TYPE != "Residential"] = 0
energy_data1$BUILDING_TYPE[energy_data1$BUILDING_TYPE == "Residential"] = 1
#Categorize 'BUILDING.TYPE' as a factor and display its resulting levels
energy_data1$BUILDING_TYPE = as.factor(energy_data1$BUILDING_TYPE)
levels(energy_data1$BUILDING_TYPE)

## [1] "0" "1"

Data Selection for Hierarchical Multiple Linear Logistic Regression Model

Upon performing this initial summary statistics analysis, a hierarchical approach is carried out in beginning to develop a multiple linear logistic regression model. Using information obtained from a U.S. Department of Energy document entitled “Energy Efficiency Trends in Residential and Commercial Buildings” [reference: http://apps1.eere.energy.gov/buildings/publications/pdfs/corporate/bt_stateindustry.pdf] and learning that a relationship exists between energy consumption, building type (residential, commercial, etc.), and building population, we aim to determine (using the “Energy Usage 2010” dataset) if building type can be determined using information pertaining to energy consumption (in kilowatt-hours) and/or building population. In answering our question, building type is treated as a dichotomous dependent variable and both building population and energy consumption (in kilowatt-hours) are treated as continuous independent variables.

Description of the null hypothesis (H_0) and the alternate hypothesis (H_1)

Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘BUILDING_TYPE’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’). Therefore, the null hypothesis that is being tested states that total energy consumption (in kilowatt-hours) and building population do not have a significant effect on the determination of building type (i.e., either residential or non-residential). Opposingly, the alternate hypothesis that is being tested states that total energy consumption (in kilowatt-hours) and building population do, in fact, have a significant effect on the determination of building type (i.e., either residential or non-residential). In our analysis, we aim to create a predictive model that uses these independent variables in the determination of our dichotomous dependent variable.

2. The Linear Model (A Hierarchical Multiple Linear Logistic Regression Model)

Description of independent variables and dependent variable

In this experiment, a hierarchical multiple linear logistic regression model is generated, which will offer some insight into determining whether building type can be explained by each of the independent variables being considered in this analysis, and whether any existence of suppression is likely to exist within a multiple linear logistic regression model comprised of this data. The independent variables include total energy consumption (in kilowatt-hours) and building population, and the dependent variable refers to building type characterized as being either residential or non-residential.

#Generate an initial Hierarchical Multiple Linear Logistic Regression Model that uses all 66,974 observations
energy_model <- glm(energy_data1$BUILDING_TYPE~energy_data1$TOTAL_KWH+energy_data1$TOTAL_POPULATION, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Display summary of the initial Hierarchical Multiple Linear Logistic Regression Model
summary(energy_model)

## 
## Call:
## glm(formula = energy_data1$BUILDING_TYPE ~ energy_data1$TOTAL_KWH + 
##     energy_data1$TOTAL_POPULATION, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8184  -0.0002   0.7041   0.7457   5.0715  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.441e+00  1.389e-02  103.76   <2e-16 ***
## energy_data1$TOTAL_KWH        -2.062e-06  6.351e-08  -32.47   <2e-16 ***
## energy_data1$TOTAL_POPULATION -1.265e-03  1.073e-04  -11.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74606  on 66088  degrees of freedom
## Residual deviance: 71640  on 66086  degrees of freedom
## AIC: 71646
## 
## Number of Fisher Scoring iterations: 8

Power Analysis for Multiple Linear Logistic Regression Modeling

Originally, the “Energy Usage 2010” dataset contains 66,974 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear logistic regression model (where our desired alpha-level equals 0.05, our desired power-level equals 0.95, our effect size equals 0.02, and the considered number of predictors equals 2). In doing so, the software G[STAR]Power is used to determine the most appropriate sample size for this hierarchical multiple linear logistic regression analysis. In its results, G[STAR]Power generated a sample size of 1,188. So, with this sample size, the dataset “energy_data1” will be sampled, creating a new dataset to be used for this hierarchical multiple linear logistic regression model, which will then be used to determine if corresponding building types can be explained by the variation existent in both energy consumption and building population.

#Randomly take a sample of 1,188 observations from "energy_data1", creating "energy_final".
S <- 1188
set.seed(45)
energy.index <- sample(1:nrow(energy_data1),S,replace=FALSE)
energy_final <- energy_data1[energy.index,]
#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations
energy_model_final <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH+energy_final$TOTAL_POPULATION, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(energy_model_final)

## 
## Call:
## glm(formula = energy_final$BUILDING_TYPE ~ energy_final$TOTAL_KWH + 
##     energy_final$TOTAL_POPULATION, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8160  -0.8836   0.7224   0.7768   1.7534  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.446e+00  1.027e-01  14.074  < 2e-16 ***
## energy_final$TOTAL_KWH        -2.023e-06  4.626e-07  -4.374 1.22e-05 ***
## energy_final$TOTAL_POPULATION -2.660e-03  7.966e-04  -3.340 0.000839 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1392.2  on 1187  degrees of freedom
## Residual deviance: 1319.7  on 1185  degrees of freedom
## AIC: 1325.7
## 
## Number of Fisher Scoring iterations: 8

#Calculate the p-value associated with goodness-of-fit of entire model
null_deviance = 1392.2
residual_deviance = 1319.7
null_degrees_of_freedom = 1187
residual_degrees_of_freedom = 1185
p_value = 1 - pchisq((null_deviance - residual_deviance), (null_degrees_of_freedom - residual_degrees_of_freedom))
p_value

## [1] 2.220446e-16

#Collinearity Check
col.test <- lm(energy_final$TOTAL_KWH~energy_final$TOTAL_POPULATION)
summary(col.test)

## 
## Call:
## lm(formula = energy_final$TOTAL_KWH ~ energy_final$TOTAL_POPULATION)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6244293  -318212    28178   231683 29928036 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -435050.1    68996.6  -6.305 4.05e-10 ***
## energy_final$TOTAL_POPULATION    8014.5      533.4  15.026  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1802000 on 1186 degrees of freedom
## Multiple R-squared:  0.1599, Adjusted R-squared:  0.1592 
## F-statistic: 225.8 on 1 and 1186 DF,  p-value: < 2.2e-16

For the hierarchical multiple linear logistic regression analysis that is performed where ‘TOTAL_KWH’,and ‘TOTAL_POPULATION’ are all analyzed against the response variable ‘BUILDING_TYPE’, p-values equal to 1.22e-05 and 0.000839 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability equal to 4.58e-05 and 0.000839 (for each of these independent variables, respectively) that the degree to which the variance of these independent variables’ is able to explain the variance in the dependent variable is the result of solely randomization. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’, leading us to believe that the variation that is observed in the determination of building type can be explained by the variation existent in total energy consumption and the variation existent in building population being considered in this analysis and, as such, is not likely solely caused by randomization). [See above results for corresponding p-values.]

In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is 1.446,the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘TOTAL_KWH’ in the linear model) is -2.023e-06, and the value of b_2 (which represents both the slope of the linear model and the coefficient associated with the variable ‘TOTAL_POPULATION’ in the linear model) is -2.660e-03. These values indicate the relationship between the independent variables corresponding to total energy consumption and total building population and the dependent variable corresponding to building type, which corresponds to the idea that an increase in one unit of “total energy consumption” (in KWH) results in a log-odds-decrease in 2.023e-06 units of “building type” and that an increase in one unit of “total building population” results in a log-odds-decrease in 2.660e-03 units of “building type.” Furthermore, with a chi-squared p-value of 2.220446e-16, our model does seem do exhibit a significant goodness of fit.

Additionally, it’s important to note the metrics that are used here to measure the correlation between the independent variables corresponding to total energy consumption and total building population, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.1592, it can be inferred that the variation that exists in the independent variable corresponding to total building population can explain approximately 15.92% of the variation existent in the independent variable corresponding to total energy consumption. As a result of this low adjusted R-squared value here, one can likely assert that the independent variables being considered in this analysis (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) do not exhibit much collinearity in our model.

3. Diagnostic Plots

Before beginning to check the model against the four “LINE” assumptions associated with linear regression modeling, histograms, boxplots, scatterplots, and a “Quality of Fit” plot (via a fitted vs. residual values determination) are generated, which will be used for their graphical nature in our interpretations.

#Generate histograms for all of the different independent variables being considered in our sampled data ('TOTAL_KWH' and 'TOTAL_POPULATION')
hist(energy_final$TOTAL_KWH, xlab = "Total Energy Consumption [in kilowatt-hours]", main = "Histogram of Total Energy Consumption")

hist(energy_final$TOTAL_POPULATION, xlab = "Total Building Population", main = "Histogram of Total Building Population")

#Generate a boxplot of the data (Independent Variable = Energy Consumption)
boxplot(x = energy_final$TOTAL_KWH, pch=21, bg="darkviolet", main="Total Energy Consumption", xlab = "Total Energy Consumption [in kilowatt-hours]")

#Generate a boxplot of the data (Independent Variable = Population)
boxplot(x = energy_final$TOTAL_POPULATION, pch=21, bg="darkviolet", main="Total Building Population", xlab = "Building Population")

#Generate a scatterplot of the data: "Building Type" vs. "Energy Consumption"
plot(y = energy_final$BUILDING_TYPE,x = energy_final$TOTAL_KWH, pch=21, bg="darkviolet", main="Total Energy Consumption vs. Building Type", ylab = "Building Type", xlab = "Energy Consumption (in kilowatt-hours)")

#Generate a scatterplot of the data: "Building Type" vs. "Building Population"
plot(y = energy_final$BUILDING_TYPE,x = energy_final$TOTAL_POPULATION, pch=21, bg="darkviolet", main="Total Building Population vs. Building Type", ylab = "Building Type", xlab = "Building Population")

#Create a "Quality of Fit Model" that plots the residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(energy_model_final),residuals(energy_model_final), main = "Residuals of 'energy_model_final' Against Fitted", font.main = 4, cex.main = 1.2)
mtext("Model 'energy_model_final' [Not Standardized]", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)

#Create a "Quality of Fit Model" that plots the standardized residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_energy_model <- rstandard(energy_model_final)
plot(fitted(energy_model_final),standardized_energy_model, main = "Standardized Residuals of 'energy_model_final'", font.main = 4, cex.main = 1.2)
mtext("Against Fitted Model 'energy_model_final'", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)

4. Interpretation via LINE Assumptions

In interpreting our hierarchical multiple linear logistic regression model and the statistical significance of the results that were generated therein, it is important to test the model against the four “LINE” assumptions corresponding to linear regression.

1. The mean of the response at each set of values of the predictor is a linear function of the predictors (‘L’).

In order to meet this assumption, we can try to determine whether or not the expected (mean) value of the residuals is zero at every value of the predictor by generating a standardized residual plot for this model against a fitted version of the model. Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, indicating that a non-linear kind of effect likely exists within the data that the model is comprised of. Therefore, it is likely evident that our model does not satisfy this assumption.

#Create a "Quality of Fit Model" that plots the standardized residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_energy_model <- rstandard(energy_model_final)
plot(fitted(energy_model_final),standardized_energy_model, main = "Standardized Residuals of 'energy_model_final'", font.main = 4, cex.main = 1.2)
mtext("Against Fitted Model 'energy_model_final'", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)

2. The errors are independent (‘I’).

In order to determine if the errors are independent, we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed and exhibit no auto-correlation. Upon generating this plot, it appears that the residuals themselves seem to be generally uniformly distributed and lack auto-correlation here [exhibiting homoscedasticity and independence].

#Generate a residuals plot for "energy_model_final"
plot(energy_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'energy_model_final'")

3. The errors at each set of values of the predictor are normally distributed (‘N’).

In order to determine of the distribution of the residuals is normal, we can generate histograms and boxplots for the residuals of the model, analyze them for skewness and kurtosis. Upon observing both the boxplot and the histogram of the residuals, it appears that the model’s residuals do exhibit some significant skewness, as the residuals seem to be skewed severely to the right (indicating that some bias is likely existent in the model). Additionally, upon observing the histogram of the residuals, it appears that there is also some kurtosis existent and bimodality in the residuals.

#Generate histograms for the residuals of our model
hist(residuals(energy_model_final), xlab = "Residuals", main = "Histogram of Residuals of 'energy_model_final'")

#Generate a boxplot for the residuals of our model
boxplot(x = residuals(energy_model_final), pch=21, bg="darkviolet", main="Boxplot of Residuals of 'energy_model_final'", xlab = "Residuals")

We can further determine whether or not the distribution of the residuals exhibits normality by generating a Normal Quantile-Quantile (QQ) Plot for the residuals of the model. Upon doing so, it’s likely that we can readily assume our data does not exhibit normality, since the constructed Normal Q-Q Plot did not seem to display a trend of data that aligned closely with the Normal Q-Q Line. Therefore, it is likely evident that our model does not satisfy this assumption.

#Create a Normal Q-Q Plot for the data pertaining to Building Type.
qqnorm(residuals(energy_model_final), main = "Normal Q-Q Plot for Residuals of 'energy_model_final'")
qqline(residuals(energy_model_final))

4. The errors at each set of values of the predictor have equal variances (‘E’).

In order to determine if the errors at each set of values of the predictor have equal variances (or, in other words, if the variance of the residuals for every set of values for the predictor are equal), we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed along the “y=0” axis. Upon generating this plot, it appears that the residuals themselves seem to be generally uniformly distributed and lack auto-correlation here [exhibiting equal variance].

#Generate a residuals plot for "energy_model_final"
plot(energy_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'energy_model_final'")

In the next section, the Breush-Pagan Test against Heteroscedasticicity is performed and analyzes, which offers more insight into the determination of whether or not the errors at each set of values of the predictor have equal variances.

5. Interpretation of the Breusch-Pagan Test against Heteroscedasticity

In order to determine if the errors at each set of values of the predictor have equal variances (or, in other words, if the variance of the residuals for every set of values for the predictor are equal), the Breusch-Pagan Test against Heteroscedasticity can be performed. The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default, the same explanatory variables are taken as in the main regression model) and rejects the null hypothesis (where the model exhibits homoscedasticity) if too much of the variance is explained by the additional explanatory variables. In carrying out the Breusch-Pagan Test against Heteroscedasticity for each of the independent variables (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) against the dependent variable ‘BUILDING_TYPE’ (considered individually), p-values of 0.6159 and 0.6823 (respectively) were returned, indicating that we would fail to reject the null hypothesis of homoscedasticity for each of our independent variables at an alpha-level of 0.05. Therefore, our model’s residuals exhibit homoscedasticity.

#install.packages("lmtest") #[needs to be installed before use]
library(lmtest)

## Warning: package 'lmtest' was built under R version 3.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

#Generate a model which considered the independent variable 'TOTAL_KWH'
KWH_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Perform the Breusch-Pagan test for 'KWH_model'
bptest(KWH_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  KWH_model
## BP = 0.2517, df = 1, p-value = 0.6159

#Generate a model which considered the independent variable 'TOTAL_POPULATION'
POPULATION_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_POPULATION, family = "binomial")
#Perform the Breusch-Pagan test for 'POPULATION_model'
bptest(POPULATION_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  POPULATION_model
## BP = 0.1675, df = 1, p-value = 0.6823

6. Interpretation via Issues

In interpreting our hierarchical multiple linear logistic regression model and the statistical significance of the results that were generated therein, it is important to check the model against the four main issues surrounding linear regression.

1. Causality

As far as causality is concerned here, we can intuitively discern that “total building population” and “total energy consumption” probabilistically and proximally cause one to predict a given building’s type. These independent variables are considered to be proximal and probabilistic causes (and not ultimate or determinate causes) because they do not “perfectly” or “directly” lead one to discern a given building’s type in every situation. However, with that being said, a stronger analysis would need to be performed for a “flipped” scenario where an explanatory model that uses “building type” to determine “total energy consumption” and/or “building population” is being considered in this regard in order to determine the existence of causality.

2. Sample Sizes

As far as sample sizes are concerned here, our original, raw dataset is massive (66,974 observations). If this dataset were used to generate our hierarchical multiple linear logistic regression model, the significance of our results would have been misinterpreted as a result of the bias that exists when using massive datasets. To alleviate this concern, an appropriate sample size was determined using the software G[STAR]Power for logistic regression modeling. The use of G[STAR]Power resulted in a generated sample size of 1,188. So, with this sample size, the dataset “energy_data” was sampled, and a new dataset “energy_final” was created. (See the section “Power Analysis for Multiple Linear Logistic Regression Modeling” within Chapter #2 for more information.)

3. Collinearity

In order to determine whether or not the predictors are a perfect linear function of other predictors (i.e., no perfect multicollinearity), a correlation matrix and a linear regression model that analyzes the independent variables against each other can be generated. Upon observing the correlation matrix, it appears that “TOTAL_KWH” and “TOTAL_POPULATION” are not very strongly correlated with each other (at a value of 0.3998996, which is not so close to 1.00), which does seem to exhibit the lack of collinearity between these independent variables. Additionally, since the value of the “collinearity-checking” model’s adjusted R-squared is 0.1592, it can be inferred that the variation that exists in the independent variable corresponding to total building population can explain approximately 15.92% of the variation existent in the independent variable corresponding to total energy consumption. As a result of this low adjusted R-squared value here, one can likely assert that the independent variables being considered in this analysis (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) do not exhibit much collinearity in our model. Therefore, it is likely evident that our model does not seem too concerning with regard to this issue.

#Generate a correlation matrix for "TOTAL_KWH" and "TOTAL_POPULATION"
cor(energy_final[c("TOTAL_KWH","TOTAL_POPULATION")])

##                  TOTAL_KWH TOTAL_POPULATION
## TOTAL_KWH        1.0000000        0.3998996
## TOTAL_POPULATION 0.3998996        1.0000000

#Collinearity Check
col.test <- lm(energy_final$TOTAL_KWH~energy_final$TOTAL_POPULATION)
summary(col.test)

## 
## Call:
## lm(formula = energy_final$TOTAL_KWH ~ energy_final$TOTAL_POPULATION)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6244293  -318212    28178   231683 29928036 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -435050.1    68996.6  -6.305 4.05e-10 ***
## energy_final$TOTAL_POPULATION    8014.5      533.4  15.026  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1802000 on 1186 degrees of freedom
## Multiple R-squared:  0.1599, Adjusted R-squared:  0.1592 
## F-statistic: 225.8 on 1 and 1186 DF,  p-value: < 2.2e-16

4. Measurement Error

In this study, the existence of measurement error is not an immediate concern, since this energy consumption data for Chicago is regularly and routinely collected by the government and verified by those who are experts in the field, it is not very likely (though, technically possible) that measurement error would play a role in affecting accurate data collection.

7. Testing for Interaction Effects within Hierarchical Multiple Linear Logistic Regression

In carrying out this analysis, it is important to check to see if any interaction effects are present among the independent variables being tested against the dependent variable.

#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations to test for interaction effects
energy_interaction_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH*energy_final$TOTAL_POPULATION, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(energy_interaction_model)

## 
## Call:
## glm(formula = energy_final$BUILDING_TYPE ~ energy_final$TOTAL_KWH * 
##     energy_final$TOTAL_POPULATION, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8275  -0.8692   0.7202   0.7783   1.6235  
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           1.473e+00  1.046e-01
## energy_final$TOTAL_KWH                               -2.252e-06  5.164e-07
## energy_final$TOTAL_POPULATION                        -2.904e-03  7.919e-04
## energy_final$TOTAL_KWH:energy_final$TOTAL_POPULATION  1.627e-09  5.132e-10
##                                                      z value Pr(>|z|)    
## (Intercept)                                           14.074  < 2e-16 ***
## energy_final$TOTAL_KWH                                -4.362 1.29e-05 ***
## energy_final$TOTAL_POPULATION                         -3.667 0.000246 ***
## energy_final$TOTAL_KWH:energy_final$TOTAL_POPULATION   3.170 0.001525 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1392.2  on 1187  degrees of freedom
## Residual deviance: 1318.8  on 1184  degrees of freedom
## AIC: 1326.8
## 
## Number of Fisher Scoring iterations: 8

For the hierarchical multiple linear logistic regression analysis that is performed where the paired-interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ is analyzed against the dependent variable ‘BUILDING_TYPE’, a p-value equal to 0.001525 is returned for the interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ in the mdoel, a p-value equal to 1.29e-05 is returned for the variable ‘TOTAL_KWH’ in the model, and a p-value equal to 0.000246 is returned for the variable ‘TOTAL_POPULATION’ in the model, indicating that there is roughly a probability of 0.001525, 1.29e-05, and 0.000246 [respectively] that the degree to which the variance of the interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’, the variance of solely ‘TOTAL_KWH’, and the variance of solely ‘TOTAL_POPULATION’ are able to explain the variance in the dependent variable is the result of solely randomization. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ and their corresponding interaction, leading us to believe that the variation that is observed in the determination of building type can be explained by the variation existent in total energy consumption, the variation existent in building population, and the variance existent in the interaction of these two independent variables being considered in this analysis and, as such, is not likely solely caused by randomization. Despite the fact that our original model does not take into account the interaction effects that are likely to exist within this dataset (as per this interaction analysis), the exhibited interaction effects resulting from this test model did not keep us from being able to discern in our original model that the independent variables are likely to be statistically significant with relation to the dependent variable in our predictive model. [See above results for corresponding p-values.]

8. Final Interpretation of Predictive Hiererchical Multiple Linear Logistic Regression Model

After carrying out the “LINE” analysis, it appears that our model does not exhibit normally-distributed errors (‘N’) or maintain a linear relationship between the independent variables and the dependent variable (‘L’). Despite this, it does appear that the errors of the model are independent (‘I’) and that the errors at each set of values of the predictor have equal variances (‘E’). However, because our model is a logistic regression model at its core, meeting these assumptions is not entirely necessary.

The assumptions that do directly apply to logistic regression modeling include the following: (1) The true conditional probabilities are a logistic function of independent variables with the independent variables being linearly related to a log odds ratio. (2) No important variables are omitted and no extraneous variables are excluded. (3) Each observation must be independent with little or no multicollinearity where the independent variables are not linear combinations of each other and are measures without error. (4) The observations are independent. (5) Large sample sizes are included.

Through out “LINE” analysis, it is determined that our observations exhibit both independence and a lack of multicollinearity. Additionally, since our analysis uses G[STAR]Power in the determination of an effective sample size for logistic regression, the only assumption for logistic regression that has not yet been addressed is the second one, which makes the claim that “no important variables are omitted and no extraneous variables are excluded.” While this is generally a tough assumption to address in any form of regression modeling, we can attempt to address one particular concern that arose during this analysis.

In our original model, a dichotomous dependent variable is created which jointly considers commercial and industrial building types (designated with a single-level characteristic of being non-residential) against residential building types. While the research carried out in constructing this hierarchical multiple linear logistic regression model led us to making this modeling decision, it might potentially be an inherently flawed modeling decision. So, to address this concern, a new model is created which solely considers commercial building types against industrial building types, excluding all residential building types found in the dataset.

#Create a subset of "energy_raw" that contains only numeric data for this assumption
logistic_test_data0 <- subset(energy_raw, select = c(BUILDING_TYPE, TOTAL_KWH, TOTAL_POPULATION))
logistic_test_data1 <- na.omit(logistic_test_data0)
#Transform 'BUILDING_TYPE' into a categorical variable (where 0 represents commercial buildings and 1 represents industrial buildings)
logistic_test_data1$BUILDING_TYPE = as.character(logistic_test_data1$BUILDING_TYPE)
logistic_test_data1$BUILDING_TYPE[logistic_test_data1$BUILDING_TYPE == "Residential"] = NA
logistic_test_data2 <- na.omit(logistic_test_data1)
logistic_test_data2$BUILDING_TYPE[logistic_test_data2$BUILDING_TYPE != "Commercial"] = 0
logistic_test_data2$BUILDING_TYPE[logistic_test_data2$BUILDING_TYPE == "Commercial"] = 1
#Categorize 'BUILDING.TYPE' as a factor and display its resulting levels
logistic_test_data2$BUILDING_TYPE = as.factor(logistic_test_data2$BUILDING_TYPE)
levels(logistic_test_data2$BUILDING_TYPE)

## [1] "0" "1"

#Generate an initial Hierarchical Multiple Linear Logistic Regression Model that uses all 16,649 observations contained within this new dataset
logistic_test_model <- glm(logistic_test_data2$BUILDING_TYPE~logistic_test_data2$TOTAL_KWH+logistic_test_data2$TOTAL_POPULATION, family = "binomial")
#Display summary of this new initial Hierarchical Multiple Linear Logistic Regression Model
summary(logistic_test_model)

## 
## Call:
## glm(formula = logistic_test_data2$BUILDING_TYPE ~ logistic_test_data2$TOTAL_KWH + 
##     logistic_test_data2$TOTAL_POPULATION, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7038   0.0517   0.0556   0.0588   1.0447  
## 
## Coefficients:
##                                        Estimate Std. Error z value
## (Intercept)                           6.289e+00  2.719e-01  23.135
## logistic_test_data2$TOTAL_KWH        -2.685e-08  6.026e-09  -4.456
## logistic_test_data2$TOTAL_POPULATION  2.672e-03  2.509e-03   1.065
##                                      Pr(>|z|)    
## (Intercept)                           < 2e-16 ***
## logistic_test_data2$TOTAL_KWH        8.37e-06 ***
## logistic_test_data2$TOTAL_POPULATION    0.287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.98  on 16648  degrees of freedom
## Residual deviance: 376.98  on 16646  degrees of freedom
## AIC: 382.98
## 
## Number of Fisher Scoring iterations: 10

#Randomly take a sample of 1,188 observations from "logistic_test_data2", creating "logistic_final".
S <- 1188
set.seed(28)
energy.index.log <- sample(1:nrow(logistic_test_data2),S,replace=FALSE)
logistic_final <- logistic_test_data2[energy.index.log,]
#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations
logistic_model_final <- glm(logistic_final$BUILDING_TYPE~logistic_final$TOTAL_KWH+logistic_final$TOTAL_POPULATION, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(logistic_model_final)

## 
## Call:
## glm(formula = logistic_final$BUILDING_TYPE ~ logistic_final$TOTAL_KWH + 
##     logistic_final$TOTAL_POPULATION, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3850   0.0296   0.0641   0.0911   0.1381  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     4.626e+00  1.070e+00   4.322 1.55e-05 ***
## logistic_final$TOTAL_KWH        1.243e-05  1.543e-05   0.805    0.421    
## logistic_final$TOTAL_POPULATION 8.827e-03  1.199e-02   0.736    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.881  on 1187  degrees of freedom
## Residual deviance: 39.296  on 1185  degrees of freedom
## AIC: 45.296
## 
## Number of Fisher Scoring iterations: 14

#Calculate the p-value associated with goodness-of-fit of this new model
null_deviance_log = 41.881
residual_deviance_log = 39.296
null_degrees_of_freedom_log = 1187
residual_degrees_of_freedom_log = 1185
p_value_log = 1 - pchisq((null_deviance_log - residual_deviance_log), (null_degrees_of_freedom_log - residual_degrees_of_freedom_log))
p_value_log

## [1] 0.2745835

#Collinearity Check
col.test.log <- lm(logistic_final$TOTAL_KWH~logistic_final$TOTAL_POPULATION)
summary(col.test.log)

## 
## Call:
## lm(formula = logistic_final$TOTAL_KWH ~ logistic_final$TOTAL_POPULATION)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1613634  -345555  -264468  -149554 60957766 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     213955.7    83746.3   2.555  0.01075 * 
## logistic_final$TOTAL_POPULATION   1492.5      480.2   3.108  0.00193 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2284000 on 1186 degrees of freedom
## Multiple R-squared:  0.008079,   Adjusted R-squared:  0.007243 
## F-statistic:  9.66 on 1 and 1186 DF,  p-value: 0.001928

Without thoroughly addressing whether or not this new model (which solely considers commercial building types and industrial building types) meets all of the assumptions surrounding linear regression modeling and logistic regression modeling as we did for our previously-developed model (given that this model is only a slight variation of the previously-developed model of residential vs. non-residential building types), it’s apparent that while ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ lack multicollinearity within this model, the variation existent within the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ [with consideration to commercial and industrial building types alone] cannot significantly account for the variation existent within the dependent variable ‘BUILDING_TYPE’. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would fail to reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ for a consideration of building types that does not include residential building types, leading us to believe that the variation that is observed in the determination of building type (for solely a commercial and industrial building type consideration) cannot be explained by the variation existent in total energy consumption and the variation existent in building population being considered in this analysis and, as such, is likely solely caused by randomization).

For this reason, we can discern that our original modeling approach met all of the assumptions of logistic regression modeling, leaving us satisfied with the results surrounding the original modeling approach that is carried out.