This data, which displays several units of energy consumption for households, businesses, and industries in the city of Chicago in the year 2010, is aggregated from ComEd and Peoples Natural Gas by Accenture. The dataset itself, which contains 66,974 observations and 73 individual variables, accounts for approximately 88% of Chicago buildings’ electrical and gas usage in 2010, representing 68% of Chicago’s overall electrical usage and 81% of Chicago’s gas consumption. For the sake of this analysis, only three distinct variables are tested for our null hypothesis, including ‘BUILDING_TYPE’ (which represents the specific type of building as it corresponds to being either residential, commercial, or industrial), ‘TOTAL_POPULATION’ (which represents the population capacity of a given building), and ‘TOTAL_KWH’ (which represents the total energy being consumed in 2010 in kilowatt-hours).
[Reference: http://catalog.data.gov/dataset/energy-usage-2010-24a67]
Below, the “Energy Usage 2010” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).
#Install and load the "Energy Usage 2010" dataset into R, assigning a variable, "energy_raw", to the complete dataframe..
rm(list=ls())
energy_raw <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #4/Energy_Usage_2010.csv", header=TRUE, stringsAsFactors = FALSE)
#Then, display the "head" and "tail" of the dataset, "eNergy_raw".
head(energy_raw)
## COMMUNITY.AREA.NAME CENSUS.BLOCK BUILDING_TYPE BUILDING_SUBTYPE
## 1 Albany Park 1.7e+14 Residential Multi 7+
## 2 Albany Park 1.7e+14 Residential Multi < 7
## 3 Albany Park 1.7e+14 Residential Single Family
## 4 Albany Park 1.7e+14 Residential Multi 7+
## 5 Albany Park 1.7e+14 Residential Multi < 7
## 6 Albany Park 1.7e+14 Commercial Multi < 7
## KWH.JANUARY.2010 KWH.FEBRUARY.2010 KWH.MARCH.2010 KWH.APRIL.2010
## 1 11921 12145 9759 11542
## 2 1233 1645 994 1055
## 3 4141 3798 2939 4727
## 4 1230 1333 1260 1405
## 5 12977 14639 12718 14973
## 6 2878 3755 4571 2984
## KWH.MAY.2010 KWH.JUNE.2010 KWH.JULY.2010 KWH.AUGUST.2010
## 1 14348 26617 24210 20383
## 2 1284 3527 3099 2527
## 3 5324 9676 7591 6287
## 4 1699 2094 732 1312
## 5 16384 32940 24454 23926
## 6 3111 4808 4132 3564
## KWH.SEPTEMBER.2010 KWH.OCTOBER.2010 KWH.NOVEMBER.2010 KWH.DECEMBER.2010
## 1 11983 10335 25327 22462
## 2 904 626 2092 1622
## 3 2920 2565 5979 5073
## 4 1462 1358 1372 1495
## 5 15012 13679 31979 30660
## 6 2174 1985 5968 5400
## TOTAL_KWH ELECTRICITY.ACCOUNTS ZERO.KWH.ACCOUNTS THERM.JANUARY.2010
## 1 201032 48 22 7247
## 2 20608 Less than 4 1 321
## 3 61020 6 2 1222
## 4 16752 Less than 4 2 2961
## 5 244341 49 32 11508
## 6 45330 7 0 1793
## THERM.FEBRUARY.2010 THERM.MARCH.2010 TERM.APRIL.2010 THERM.MAY.2010
## 1 5904 5180 3113 1822
## 2 130 86 49 19
## 3 1016 860 543 346
## 4 2664 1616 798 344
## 5 9057 8000 4529 2809
## 6 1573 1352 890 853
## THERM.JUNE.2010 THERM.JULY.2010 THERM.AUGUST.2010 THERM.SEPTEMBER.2010
## 1 1272 1234 952 1780
## 2 13 7 10 12
## 3 247 203 179 170
## 4 404 320 272 368
## 5 1507 1179 991 994
## 6 541 448 438 439
## THERM.OCTOBER.2010 THERM.NOVEMBER.2010 THERM.DECEMBER.2010 TOTAL_THERMS
## 1 1472 1961 4885 36822
## 2 9 21 78 755
## 3 190 298 791 6065
## 4 745 1260 2901 14653
## 5 1254 2595 7167 51590
## 6 565 787 1538 11217
## GAS.ACCOUNTS KWH.TOTAL.SQFT THERMS.TOTAL.SQFT KWH.MEAN.2010
## 1 21 48825 48825 20103.20
## 2 Less than 4 3306 3306 20608.00
## 3 6 9472 9472 10170.00
## 4 6 14407 14407 16752.00
## 5 54 58835 58835 15271.31
## 6 6 8240 8240 22665.00
## KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010 KWH.1ST.QUARTILE.2010
## 1 8609.69 9414 12563.0
## 2 NA 20608 20608.0
## 3 4410.10 5619 6746.0
## 4 NA 16752 16752.0
## 5 8089.70 5462 10343.5
## 6 9526.14 15929 15929.0
## KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010 KWH.MAXIMUM.2010
## 1 19072.5 22177.0 36781
## 2 20608.0 20608.0 20608
## 3 9055.5 13014.0 17530
## 4 16752.0 16752.0 16752
## 5 12427.0 17495.5 34236
## 6 22665.0 29401.0 29401
## KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
## 1 24412.50 5698.57
## 2 3306.00 NA
## 3 1578.67 863.85
## 4 14407.00 NA
## 5 3677.19 1061.65
## 6 8240.00 NA
## KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
## 1 20383 20383
## 2 3306 3306
## 3 1226 1226
## 4 14407 14407
## 5 2414 2546
## 6 8240 8240
## KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
## 1 24412.5 28442
## 2 3306.0 3306
## 3 1226.0 1226
## 4 14407.0 14407
## 5 3553.5 4692
## 6 8240.0 8240
## KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010 THERM.STANDARD.DEVIATION.2010
## 1 28442 5260.29 8435.63
## 2 3306 755.00 NA
## 3 3342 1010.83 620.53
## 4 14407 14653.00 NA
## 5 5530 3224.38 1079.13
## 6 8240 5608.50 5620.79
## THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
## 1 882 957 1102.0
## 2 755 755 755.0
## 3 496 514 835.5
## 4 14653 14653 14653.0
## 5 2071 2499 2933.5
## 6 1634 1634 5608.5
## THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
## 1 8024.0 23460 24412.50
## 2 755.0 755 3306.00
## 3 1240.0 2144 1578.67
## 4 14653.0 14653 14407.00
## 5 3593.5 5754 3677.19
## 6 9583.0 9583 8240.00
## THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
## 1 5698.57 20383
## 2 NA 3306
## 3 863.85 1226
## 4 NA 14407
## 5 1061.65 2414
## 6 NA 8240
## THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
## 1 20383 24412.5
## 2 3306 3306.0
## 3 1226 1226.0
## 4 14407 14407.0
## 5 2546 3553.5
## 6 8240 8240.0
## THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010 TOTAL_POPULATION
## 1 28442 28442 132
## 2 3306 3306 132
## 3 1226 3342 132
## 4 14407 14407 228
## 5 4692 5530 228
## 6 8240 8240 231
## TOTAL.UNITS AVERAGE.STORIES AVERAGE.BUILDING.AGE AVERAGE.HOUSESIZE
## 1 64 3.00 65.50 2.20
## 2 64 2.00 86.00 2.20
## 3 64 1.17 14.33 2.20
## 4 79 3.00 86.00 3.51
## 5 79 2.50 87.69 3.51
## 6 70 1.00 0.00 3.73
## OCCUPIED.UNITS OCCUPIED.UNITS.PERCENTAGE RENTER.OCCUPIED.HOUSING.UNITS
## 1 60 0.9375 33
## 2 60 0.9375 33
## 3 60 0.9375 33
## 4 65 0.8228 49
## 5 65 0.8228 49
## 6 62 0.8856 49
## RENTER.OCCUPIED.HOUSING.PERCENTAGE OCCUPIED.HOUSING.UNITS
## 1 0.550 60
## 2 0.550 60
## 3 0.550 60
## 4 0.754 65
## 5 0.754 65
## 6 0.790 62
tail(energy_raw)
## COMMUNITY.AREA.NAME CENSUS.BLOCK BUILDING_TYPE BUILDING_SUBTYPE
## 66969 Woodlawn 1.7e+14 Residential Multi < 7
## 66970 Woodlawn 1.7e+14 Residential Single Family
## 66971 Woodlawn 1.7e+14 Commercial Multi < 7
## 66972 Woodlawn 1.7e+14 Residential Multi < 7
## 66973 Woodlawn 1.7e+14 Residential Single Family
## 66974 Woodlawn 1.7e+14 Residential Multi < 7
## KWH.JANUARY.2010 KWH.FEBRUARY.2010 KWH.MARCH.2010 KWH.APRIL.2010
## 66969 9572 9104 8525 7756
## 66970 2705 1318 1582 1465
## 66971 1005 1760 1521 1832
## 66972 3567 3031 2582 2295
## 66973 1208 1055 1008 1109
## 66974 2717 3057 2695 3793
## KWH.MAY.2010 KWH.JUNE.2010 KWH.JULY.2010 KWH.AUGUST.2010
## 66969 11256 11669 12099 13200
## 66970 1494 2990 2449 2351
## 66971 2272 2361 3018 3030
## 66972 7902 4987 5773 3996
## 66973 1591 1367 1569 1551
## 66974 4237 5383 5544 6929
## KWH.SEPTEMBER.2010 KWH.OCTOBER.2010 KWH.NOVEMBER.2010
## 66969 9694 8419 19077
## 66970 1213 2174 2888
## 66971 2886 3833 6290
## 66972 3050 3103 3880
## 66973 1376 1236 2108
## 66974 5280 5971 6986
## KWH.DECEMBER.2010 TOTAL_KWH ELECTRICITY.ACCOUNTS ZERO.KWH.ACCOUNTS
## 66969 18869 139240 21 18
## 66970 5025 27654 6 7
## 66971 12169 41977 9 5
## 66972 4684 48850 7 2
## 66973 2529 17707 7 9
## 66974 5144 57736 12 17
## THERM.JANUARY.2010 THERM.FEBRUARY.2010 THERM.MARCH.2010
## 66969 6914 5433 5054
## 66970 2166 1681 1858
## 66971 985 1152 1238
## 66972 2202 1874 1647
## 66973 95 11 47
## 66974 2372 1787 1449
## TERM.APRIL.2010 THERM.MAY.2010 THERM.JUNE.2010 THERM.JULY.2010
## 66969 2967 2241 1107 770
## 66970 1172 708 360 72
## 66971 630 475 192 141
## 66972 906 645 346 84
## 66973 9 45 18 22
## 66974 718 572 286 155
## THERM.AUGUST.2010 THERM.SEPTEMBER.2010 THERM.OCTOBER.2010
## 66969 674 788 954
## 66970 67 77 185
## 66971 162 144 210
## 66972 150 150 260
## 66973 9 17 11
## 66974 134 161 303
## THERM.NOVEMBER.2010 THERM.DECEMBER.2010 TOTAL_THERMS GAS.ACCOUNTS
## 66969 2423 4619 33944 25
## 66970 623 1800 10769 9
## 66971 653 1744 7726 8
## 66972 694 1335 10293 5
## 66973 18 13 315 5
## 66974 588 1469 9994 13
## KWH.TOTAL.SQFT THERMS.TOTAL.SQFT KWH.MEAN.2010
## 66969 48349 48349 12658.18
## 66970 7801 7801 6913.50
## 66971 11838 11838 13992.33
## 66972 11028 11028 16283.33
## 66973 4653 4653 4426.75
## 66974 17812 13776 9622.67
## KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010 KWH.1ST.QUARTILE.2010
## 66969 7948.06 2691 7635.0
## 66970 5695.82 2444 2872.5
## 66971 2989.54 10754 10754.0
## 66972 15000.83 7010 7010.0
## 66973 2297.29 1878 2635.0
## 66974 5625.23 1312 6288.0
## KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010 KWH.MAXIMUM.2010
## 66969 11370.0 19168.0 30287
## 66970 5139.0 10954.5 14932
## 66971 14576.0 16647.0 16647
## 66972 8250.0 33590.0 33590
## 66973 4325.0 6218.5 7179
## 66974 9586.5 15290.0 15673
## KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
## 66969 4834.9 2180.96
## 66970 3900.5 1429.06
## 66971 5919.0 725.49
## 66972 3676.0 1022.80
## 66973 4653.0 NA
## 66974 3562.4 2911.56
## KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
## 66969 2810 3166
## 66970 2890 2890
## 66971 5406 5406
## 66972 2800 2800
## 66973 4653 4653
## 66974 1866 2170
## KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
## 66969 3771.0 7232
## 66970 3900.5 4911
## 66971 5919.0 6432
## 66972 3428.0 4800
## 66973 4653.0 4653
## 66974 2472.0 2556
## KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010 THERM.STANDARD.DEVIATION.2010
## 66969 8016 3085.82 1542.64
## 66970 4911 2692.25 3661.92
## 66971 6432 2575.33 3492.97
## 66972 4800 3431.00 1155.32
## 66973 4653 105.00 80.30
## 66974 8748 2498.50 2372.88
## THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
## 66969 621 2300 2669.0
## 66970 272 464 1195.5
## 66971 42 42 1124.0
## 66972 2449 2449 3140.0
## 66973 49 49 69.0
## 66974 487 578 2029.0
## THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
## 66969 4408.0 6246 4834.9
## 66970 4920.5 8106 3900.5
## 66971 6560.0 6560 5919.0
## 66972 4704.0 4704 3676.0
## 66973 197.0 197 4653.0
## 66974 4419.0 5449 4592.0
## THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
## 66969 2180.96 2810
## 66970 1429.06 2890
## 66971 725.49 5406
## 66972 1022.80 2800
## 66973 NA 4653
## 66974 3599.45 2472
## THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
## 66969 3166 3771.0
## 66970 2890 3900.5
## 66971 5406 5919.0
## 66972 2800 3428.0
## 66973 4653 4653.0
## 66974 2472 2556.0
## THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010
## 66969 7232 8016
## 66970 4911 4911
## 66971 6432 6432
## 66972 4800 4800
## 66973 4653 4653
## 66974 8748 8748
## TOTAL_POPULATION TOTAL.UNITS AVERAGE.STORIES AVERAGE.BUILDING.AGE
## 66969 116 55 2.00 51.90
## 66970 116 55 1.00 0.00
## 66971 31 24 3.00 104.50
## 66972 31 24 2.33 100.67
## 66973 0 0 1.00 0.00
## 66974 77 49 2.00 79.40
## AVERAGE.HOUSESIZE OCCUPIED.UNITS OCCUPIED.UNITS.PERCENTAGE
## 66969 3.14 37 0.6727
## 66970 3.14 37 0.6727
## 66971 2.07 15 0.6250
## 66972 2.07 15 0.6250
## 66973 0.00 0 NA
## 66974 2.57 30 0.6122
## RENTER.OCCUPIED.HOUSING.UNITS RENTER.OCCUPIED.HOUSING.PERCENTAGE
## 66969 26 0.7030
## 66970 26 0.7030
## 66971 13 0.8670
## 66972 13 0.8670
## 66973 0 NA
## 66974 28 0.9329
## OCCUPIED.HOUSING.UNITS
## 66969 37
## 66970 37
## 66971 15
## 66972 15
## 66973 0
## 66974 30
#Display the summary statistics and the structure of the data
summary(energy_raw)
## COMMUNITY.AREA.NAME CENSUS.BLOCK BUILDING_TYPE
## Length:66974 Min. :1.7e+14 Length:66974
## Class :character 1st Qu.:1.7e+14 Class :character
## Mode :character Median :1.7e+14 Mode :character
## Mean :1.7e+14
## 3rd Qu.:1.7e+14
## Max. :1.7e+14
##
## BUILDING_SUBTYPE KWH.JANUARY.2010 KWH.FEBRUARY.2010
## Length:66974 Min. : 0 Min. : 0
## Class :character 1st Qu.: 1369 1st Qu.: 1612
## Mode :character Median : 3476 Median : 3806
## Mean : 12810 Mean : 12582
## 3rd Qu.: 7138 3rd Qu.: 7396
## Max. :21214017 Max. :21065500
## NA's :871 NA's :871
## KWH.MARCH.2010 KWH.APRIL.2010 KWH.MAY.2010
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1585 1st Qu.: 1578 1st Qu.: 1955
## Median : 3676 Median : 3636 Median : 4522
## Mean : 11707 Mean : 11463 Mean : 13853
## 3rd Qu.: 7042 3rd Qu.: 6989 3rd Qu.: 8922
## Max. :18503691 Max. :17310058 Max. :21344049
## NA's :871 NA's :871 NA's :871
## KWH.JUNE.2010 KWH.JULY.2010 KWH.AUGUST.2010
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2695 1st Qu.: 3199 1st Qu.: 2834
## Median : 6283 Median : 7375 Median : 6404
## Mean : 17213 Mean : 18845 Mean : 16989
## 3rd Qu.: 12793 3rd Qu.: 14624 3rd Qu.: 12274
## Max. :20209197 Max. :21478035 Max. :18586958
## NA's :871 NA's :871 NA's :871
## KWH.SEPTEMBER.2010 KWH.OCTOBER.2010 KWH.NOVEMBER.2010
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2024 1st Qu.: 1951 1st Qu.: 2639
## Median : 4566 Median : 4354 Median : 5851
## Mean : 13595 Mean : 12595 Mean : 15705
## 3rd Qu.: 8612 3rd Qu.: 8154 3rd Qu.: 11044
## Max. :19280342 Max. :18423025 Max. :20670698
## NA's :871 NA's :871 NA's :871
## KWH.DECEMBER.2010 TOTAL_KWH ELECTRICITY.ACCOUNTS
## Min. : 0 Min. : 102 Length:66974
## 1st Qu.: 3076 1st Qu.: 28188 Class :character
## Median : 6813 Median : 62272 Mode :character
## Mean : 18315 Mean : 175672
## 3rd Qu.: 12602 3rd Qu.: 118172
## Max. :25060008 Max. :231280522
## NA's :871 NA's :871
## ZERO.KWH.ACCOUNTS THERM.JANUARY.2010 THERM.FEBRUARY.2010 THERM.MARCH.2010
## Min. : 0.000 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 1.000 1st Qu.: 1022 1st Qu.: 897 1st Qu.: 736
## Median : 2.000 Median : 2141 Median : 1901 Median : 1558
## Mean : 4.771 Mean : 3306 Mean : 2893 Mean : 2406
## 3rd Qu.: 5.000 3rd Qu.: 3866 3rd Qu.: 3418 3rd Qu.: 2808
## Max. :601.000 Max. :566238 Max. :511323 Max. :557509
## NA's :2230 NA's :4232 NA's :1482
## TERM.APRIL.2010 THERM.MAY.2010 THERM.JUNE.2010 THERM.JULY.2010
## Min. : 1 Min. : 1.0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 354 1st Qu.: 209.0 1st Qu.: 113.0 1st Qu.: 87.0
## Median : 779 Median : 469.0 Median : 256.0 Median : 197.0
## Mean : 1261 Mean : 807.2 Mean : 498.3 Mean : 418.4
## 3rd Qu.: 1440 3rd Qu.: 875.0 3rd Qu.: 486.0 3rd Qu.: 369.0
## Max. :624882 Max. :651226.0 Max. :631383.0 Max. :680201.0
## NA's :1575 NA's :1857 NA's :1767 NA's :1820
## THERM.AUGUST.2010 THERM.SEPTEMBER.2010 THERM.OCTOBER.2010
## Min. : 1.0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 79.0 1st Qu.: 82.0 1st Qu.: 122.0
## Median : 180.0 Median : 187.0 Median : 276.0
## Mean : 399.7 Mean : 401.2 Mean : 568.2
## 3rd Qu.: 340.0 3rd Qu.: 347.0 3rd Qu.: 509.2
## Max. :693230.0 Max. :634051.0 Max. :593026.0
## NA's :1908 NA's :2282 NA's :1722
## THERM.NOVEMBER.2010 THERM.DECEMBER.2010 TOTAL_THERMS
## Min. : 1 Min. : 1 Min. : 25
## 1st Qu.: 282 1st Qu.: 774 1st Qu.: 4879
## Median : 629 Median : 1631 Median : 10340
## Mean : 1150 Mean : 2645 Mean : 16524
## 3rd Qu.: 1167 3rd Qu.: 2965 3rd Qu.: 18570
## Max. :539356 Max. :566326 Max. :7035940
## NA's :1559 NA's :1544 NA's :1296
## GAS.ACCOUNTS KWH.TOTAL.SQFT THERMS.TOTAL.SQFT
## Length:66974 Min. : 300 Min. : 300
## Class :character 1st Qu.: 5385 1st Qu.: 5368
## Mode :character Median : 10858 Median : 10844
## Mean : 21093 Mean : 20347
## 3rd Qu.: 18721 3rd Qu.: 18844
## Max. :6548217 Max. :6548217
## NA's :1150 NA's :1673
## KWH.MEAN.2010 KWH.STANDARD.DEVIATION.2010 KWH.MINIMUM.2010
## Min. : 102 Min. : 0 Min. : 100
## 1st Qu.: 8229 1st Qu.: 3630 1st Qu.: 2164
## Median : 10515 Median : 5148 Median : 4377
## Mean : 62493 Mean : 40323 Mean : 36852
## 3rd Qu.: 15645 3rd Qu.: 8065 3rd Qu.: 8774
## Max. :227750000 Max. :162851049 Max. :227752064
## NA's :871 NA's :9956 NA's :871
## KWH.1ST.QUARTILE.2010 KWH.2ND.QUARTILE.2010 KWH.3RD.QUARTILE.2010
## Min. : 100 Min. : 102 Min. : 102
## 1st Qu.: 4766 1st Qu.: 7636 1st Qu.: 10477
## Median : 6746 Median : 9944 Median : 13623
## Mean : 39158 Mean : 55773 Mean : 85608
## 3rd Qu.: 10374 3rd Qu.: 14603 3rd Qu.: 20018
## Max. :227752064 Max. :227752064 Max. :230793342
## NA's :871 NA's :871 NA's :871
## KWH.MAXIMUM.2010 KWH.SQFT.MEAN.2010 KWH.SQFT.STANDARD.DEVIATION.2010
## Min. : 102 Min. : 300 Min. : 0
## 1st Qu.: 13281 1st Qu.: 1326 1st Qu.: 240
## Median : 18033 Median : 2214 Median : 471
## Mean : 103512 Mean : 7665 Mean : 3446
## 3rd Qu.: 26276 3rd Qu.: 3790 3rd Qu.: 1048
## Max. :230793342 Max. :6548217 Max. :3840818
## NA's :871 NA's :1150 NA's :15385
## KWH.SQFT.MINIMUM.2010 KWH.SQFT.1ST.QUARTILE.2010
## Min. : 100 Min. : 100
## 1st Qu.: 954 1st Qu.: 1078
## Median : 1534 Median : 1760
## Mean : 5604 Mean : 5792
## 3rd Qu.: 2684 3rd Qu.: 2854
## Max. :6548217 Max. :6548217
## NA's :1150 NA's :1150
## KWH.SQFT.2ND.QUARTILE.2010 KWH.SQFT.3RD.QUARTILE.2010
## Min. : 300 Min. : 300
## 1st Qu.: 1250 1st Qu.: 1490
## Median : 2132 Median : 2470
## Mean : 7268 Mean : 9534
## 3rd Qu.: 3612 3rd Qu.: 4491
## Max. :6548217 Max. :6548217
## NA's :1150 NA's :1150
## KWH.SQFT.MAXIMUM.2010 THERM.MEAN.2010 THERM.STANDARD.DEVIATION.2010
## Min. : 300 Min. : 25 Min. : 0
## 1st Qu.: 1890 1st Qu.: 1365 1st Qu.: 351
## Median : 2810 Median : 1842 Median : 577
## Mean : 10581 Mean : 4062 Mean : 2649
## 3rd Qu.: 5254 3rd Qu.: 2707 3rd Qu.: 1183
## Max. :6548217 Max. :6600274 Max. :4941759
## NA's :1150 NA's :1296 NA's :10230
## THERM.MINIMUM.2010 THERM.1ST.QUARTILE.2010 THERM.2ND.QUARTILE.2010
## Min. : 25 Min. : 25 Min. : 25
## 1st Qu.: 592 1st Qu.: 957 1st Qu.: 1286
## Median : 990 Median : 1290 Median : 1724
## Mean : 2267 Mean : 2545 Mean : 3634
## 3rd Qu.: 1643 3rd Qu.: 1878 3rd Qu.: 2474
## Max. :6600274 Max. :6600274 Max. :6600274
## NA's :1296 NA's :1296 NA's :1296
## THERM.3RD.QUARTILE.2010 THERM.MAXIMUM.2010 THERMS.SQFT.MEAN.2010
## Min. : 25 Min. : 25 Min. : 300
## 1st Qu.: 1595 1st Qu.: 1934 1st Qu.: 1318
## Median : 2182 Median : 2603 Median : 2200
## Mean : 5490 Mean : 6955 Mean : 7175
## 3rd Qu.: 3241 3rd Qu.: 4069 3rd Qu.: 3736
## Max. :7012321 Max. :7012321 Max. :6548217
## NA's :1296 NA's :1296 NA's :1673
## THERMS.SQFT.STANDARD.DEVIATION.2010 THERMS.SQFT.MINIMUM.2010
## Min. : 0 Min. : 100
## 1st Qu.: 239 1st Qu.: 950
## Median : 467 Median : 1520
## Mean : 3140 Mean : 5282
## 3rd Qu.: 1034 3rd Qu.: 2651
## Max. :3840818 Max. :6548217
## NA's :15684 NA's :1673
## THERMS.SQFT.1ST.QUARTILE.2010 THERMS.SQFT.2ND.QUARTILE.2010
## Min. : 132 Min. : 300
## 1st Qu.: 1075 1st Qu.: 1244
## Median : 1756 Median : 2116
## Mean : 5462 Mean : 6799
## 3rd Qu.: 2820 3rd Qu.: 3564
## Max. :6548217 Max. :6548217
## NA's :1673 NA's :1673
## THERMS.SQFT.3RD.QUARTILE.2010 THERMS.SQFT.MAXIMUM.2010 TOTAL_POPULATION
## Min. : 300 Min. : 300 Min. : 0.00
## 1st Qu.: 1479 1st Qu.: 1888 1st Qu.: 37.00
## Median : 2450 Median : 2796 Median : 64.00
## Mean : 8897 Mean : 9851 Mean : 83.85
## 3rd Qu.: 4410 3rd Qu.: 5191 3rd Qu.: 104.00
## Max. :6548217 Max. :6548217 Max. :1590.00
## NA's :1673 NA's :1673 NA's :14
## TOTAL.UNITS AVERAGE.STORIES AVERAGE.BUILDING.AGE
## Min. : 0.00 Min. : 1.000 Min. : 0.00
## 1st Qu.: 15.00 1st Qu.: 1.140 1st Qu.: 53.00
## Median : 25.00 Median : 1.750 Median : 80.00
## Mean : 38.11 Mean : 1.887 Mean : 71.61
## 3rd Qu.: 42.00 3rd Qu.: 2.000 3rd Qu.: 96.50
## Max. :1365.00 Max. :110.000 Max. :158.00
## NA's :14
## AVERAGE.HOUSESIZE OCCUPIED.UNITS OCCUPIED.UNITS.PERCENTAGE
## Min. : 0.000 Min. : 0.0 Min. :0.0000
## 1st Qu.: 2.140 1st Qu.: 13.0 1st Qu.:0.8332
## Median : 2.700 Median : 22.0 Median :0.9148
## Mean : 2.722 Mean : 33.5 Mean :0.8804
## 3rd Qu.: 3.310 3rd Qu.: 37.0 3rd Qu.:0.9677
## Max. :12.000 Max. :1034.0 Max. :1.0000
## NA's :14 NA's :14 NA's :2445
## RENTER.OCCUPIED.HOUSING.UNITS RENTER.OCCUPIED.HOUSING.PERCENTAGE
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 3.00 1st Qu.:0.2860
## Median : 11.00 Median :0.5379
## Mean : 19.78 Mean :0.5116
## 3rd Qu.: 23.00 3rd Qu.:0.7330
## Max. :1009.00 Max. :1.0000
## NA's :14 NA's :2618
## OCCUPIED.HOUSING.UNITS
## Min. : 0.0
## 1st Qu.: 13.0
## Median : 22.0
## Mean : 33.5
## 3rd Qu.: 37.0
## Max. :1034.0
## NA's :14
str(energy_raw)
## 'data.frame': 66974 obs. of 73 variables:
## $ COMMUNITY.AREA.NAME : chr "Albany Park" "Albany Park" "Albany Park" "Albany Park" ...
## $ CENSUS.BLOCK : num 1.7e+14 1.7e+14 1.7e+14 1.7e+14 1.7e+14 ...
## $ BUILDING_TYPE : chr "Residential" "Residential" "Residential" "Residential" ...
## $ BUILDING_SUBTYPE : chr "Multi 7+" "Multi < 7" "Single Family" "Multi 7+" ...
## $ KWH.JANUARY.2010 : int 11921 1233 4141 1230 12977 2878 1478 4985 4926 16639 ...
## $ KWH.FEBRUARY.2010 : int 12145 1645 3798 1333 14639 3755 1890 2636 6413 23502 ...
## $ KWH.MARCH.2010 : int 9759 994 2939 1260 12718 4571 1364 2353 5586 19587 ...
## $ KWH.APRIL.2010 : int 11542 1055 4727 1405 14973 2984 1271 4761 5606 23327 ...
## $ KWH.MAY.2010 : int 14348 1284 5324 1699 16384 3111 1464 4391 6271 26537 ...
## $ KWH.JUNE.2010 : int 26617 3527 9676 2094 32940 4808 2118 7362 11549 40725 ...
## $ KWH.JULY.2010 : int 24210 3099 7591 732 24454 4132 2384 6462 8549 41430 ...
## $ KWH.AUGUST.2010 : int 20383 2527 6287 1312 23926 3564 3767 8015 6709 41268 ...
## $ KWH.SEPTEMBER.2010 : int 11983 904 2920 1462 15012 2174 2059 7314 3963 26208 ...
## $ KWH.OCTOBER.2010 : int 10335 626 2565 1358 13679 1985 1387 3816 3480 23230 ...
## $ KWH.NOVEMBER.2010 : int 25327 2092 5979 1372 31979 5968 2874 7496 7998 43196 ...
## $ KWH.DECEMBER.2010 : int 22462 1622 5073 1495 30660 5400 3244 6391 8613 43582 ...
## $ TOTAL_KWH : int 201032 20608 61020 16752 244341 45330 25300 65982 79663 369231 ...
## $ ELECTRICITY.ACCOUNTS : chr "48" "Less than 4" "6" "Less than 4" ...
## $ ZERO.KWH.ACCOUNTS : int 22 1 2 2 32 0 2 3 2 106 ...
## $ THERM.JANUARY.2010 : int 7247 321 1222 2961 11508 1793 1554 3107 3371 22813 ...
## $ THERM.FEBRUARY.2010 : int 5904 130 1016 2664 9057 1573 1195 2749 2647 18905 ...
## $ THERM.MARCH.2010 : int 5180 86 860 1616 8000 1352 1280 2228 2396 16890 ...
## $ TERM.APRIL.2010 : int 3113 49 543 798 4529 890 821 1331 1407 10504 ...
## $ THERM.MAY.2010 : int 1822 19 346 344 2809 853 663 738 833 6981 ...
## $ THERM.JUNE.2010 : int 1272 13 247 404 1507 541 607 443 460 4455 ...
## $ THERM.JULY.2010 : int 1234 7 203 320 1179 448 487 329 286 3456 ...
## $ THERM.AUGUST.2010 : int 952 10 179 272 991 438 476 284 260 3232 ...
## $ THERM.SEPTEMBER.2010 : int 1780 12 170 368 994 439 382 288 246 3306 ...
## $ THERM.OCTOBER.2010 : int 1472 9 190 745 1254 565 459 301 323 3477 ...
## $ THERM.NOVEMBER.2010 : int 1961 21 298 1260 2595 787 590 520 632 5898 ...
## $ THERM.DECEMBER.2010 : int 4885 78 791 2901 7167 1538 971 1821 1919 14630 ...
## $ TOTAL_THERMS : int 36822 755 6065 14653 51590 11217 9485 14139 14780 114547 ...
## $ GAS.ACCOUNTS : chr "21" "Less than 4" "6" "6" ...
## $ KWH.TOTAL.SQFT : int 48825 3306 9472 14407 58835 8240 13305 16654 9690 127916 ...
## $ THERMS.TOTAL.SQFT : int 48825 3306 9472 14407 58835 8240 13305 16654 10840 127916 ...
## $ KWH.MEAN.2010 : num 20103 20608 10170 16752 15271 ...
## $ KWH.STANDARD.DEVIATION.2010 : num 8610 NA 4410 NA 8090 ...
## $ KWH.MINIMUM.2010 : int 9414 20608 5619 16752 5462 15929 7285 8496 5388 4397 ...
## $ KWH.1ST.QUARTILE.2010 : num 12563 20608 6746 16752 10344 ...
## $ KWH.2ND.QUARTILE.2010 : num 19073 20608 9056 16752 12427 ...
## $ KWH.3RD.QUARTILE.2010 : num 22177 20608 13014 16752 17496 ...
## $ KWH.MAXIMUM.2010 : int 36781 20608 17530 16752 34236 29401 18015 16794 19735 39809 ...
## $ KWH.SQFT.MEAN.2010 : num 24413 3306 1579 14407 3677 ...
## $ KWH.SQFT.STANDARD.DEVIATION.2010 : num 5699 NA 864 NA 1062 ...
## $ KWH.SQFT.MINIMUM.2010 : int 20383 3306 1226 14407 2414 8240 13305 2448 1116 24751 ...
## $ KWH.SQFT.1ST.QUARTILE.2010 : num 20383 3306 1226 14407 2546 ...
## $ KWH.SQFT.2ND.QUARTILE.2010 : num 24413 3306 1226 14407 3554 ...
## $ KWH.SQFT.3RD.QUARTILE.2010 : num 28442 3306 1226 14407 4692 ...
## $ KWH.SQFT.MAXIMUM.2010 : int 28442 3306 3342 14407 5530 8240 13305 4554 1334 27975 ...
## $ THERM.MEAN.2010 : num 5260 755 1011 14653 3224 ...
## $ THERM.STANDARD.DEVIATION.2010 : num 8436 NA 621 NA 1079 ...
## $ THERM.MINIMUM.2010 : int 882 755 496 14653 2071 1634 1866 2689 835 114 ...
## $ THERM.1ST.QUARTILE.2010 : num 957 755 514 14653 2499 ...
## $ THERM.2ND.QUARTILE.2010 : num 1102 755 836 14653 2934 ...
## $ THERM.3RD.QUARTILE.2010 : num 8024 755 1240 14653 3594 ...
## $ THERM.MAXIMUM.2010 : int 23460 755 2144 14653 5754 9583 7619 2956 2372 28459 ...
## $ THERMS.SQFT.MEAN.2010 : num 24413 3306 1579 14407 3677 ...
## $ THERMS.SQFT.STANDARD.DEVIATION.2010: num 5699 NA 864 NA 1062 ...
## $ THERMS.SQFT.MINIMUM.2010 : int 20383 3306 1226 14407 2414 8240 13305 2448 1116 24751 ...
## $ THERMS.SQFT.1ST.QUARTILE.2010 : num 20383 3306 1226 14407 2546 ...
## $ THERMS.SQFT.2ND.QUARTILE.2010 : num 24413 3306 1226 14407 3554 ...
## $ THERMS.SQFT.3RD.QUARTILE.2010 : num 28442 3306 1226 14407 4692 ...
## $ THERMS.SQFT.MAXIMUM.2010 : int 28442 3306 3342 14407 5530 8240 13305 4554 1334 27975 ...
## $ TOTAL_POPULATION : int 132 132 132 228 228 231 231 231 231 456 ...
## $ TOTAL.UNITS : int 64 64 64 79 79 70 70 70 70 180 ...
## $ AVERAGE.STORIES : num 3 2 1.17 3 2.5 1 3 2.2 1 3 ...
## $ AVERAGE.BUILDING.AGE : num 65.5 86 14.3 86 87.7 ...
## $ AVERAGE.HOUSESIZE : num 2.2 2.2 2.2 3.51 3.51 3.73 3.73 3.73 3.73 2.73 ...
## $ OCCUPIED.UNITS : int 60 60 60 65 65 62 62 62 62 167 ...
## $ OCCUPIED.UNITS.PERCENTAGE : num 0.938 0.938 0.938 0.823 0.823 ...
## $ RENTER.OCCUPIED.HOUSING.UNITS : int 33 33 33 49 49 49 49 49 49 167 ...
## $ RENTER.OCCUPIED.HOUSING.PERCENTAGE : num 0.55 0.55 0.55 0.754 0.754 0.79 0.79 0.79 0.79 1 ...
## $ OCCUPIED.HOUSING.UNITS : int 60 60 60 65 65 62 62 62 62 167 ...
#Create a subset of "energy_raw" that contains only numeric data
energy_data0 <- subset(energy_raw, select = c(BUILDING_TYPE, TOTAL_KWH, TOTAL_POPULATION))
energy_data1 <- na.omit(energy_data0)
#Display the "head" and "tail" of the dataset, "energy_data1"
head(energy_data1)
## BUILDING_TYPE TOTAL_KWH TOTAL_POPULATION
## 1 Residential 201032 132
## 2 Residential 20608 132
## 3 Residential 61020 132
## 4 Residential 16752 228
## 5 Residential 244341 228
## 6 Commercial 45330 231
tail(energy_data1)
## BUILDING_TYPE TOTAL_KWH TOTAL_POPULATION
## 66969 Residential 139240 116
## 66970 Residential 27654 116
## 66971 Commercial 41977 31
## 66972 Residential 48850 31
## 66973 Residential 17707 0
## 66974 Residential 57736 77
#Display the summary statistics and the structure of the data
summary(energy_data1)
## BUILDING_TYPE TOTAL_KWH TOTAL_POPULATION
## Length:66089 Min. : 102 Min. : 0.00
## Class :character 1st Qu.: 28189 1st Qu.: 37.00
## Mode :character Median : 62271 Median : 64.00
## Mean : 175675 Mean : 83.81
## 3rd Qu.: 118156 3rd Qu.: 104.00
## Max. :231280522 Max. :1590.00
str(energy_data1)
## 'data.frame': 66089 obs. of 3 variables:
## $ BUILDING_TYPE : chr "Residential" "Residential" "Residential" "Residential" ...
## $ TOTAL_KWH : int 201032 20608 61020 16752 244341 45330 25300 65982 79663 369231 ...
## $ TOTAL_POPULATION: int 132 132 132 228 228 231 231 231 231 456 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:885] 67 85 104 128 328 415 494 522 804 853 ...
## .. ..- attr(*, "names")= chr [1:885] "67" "85" "104" "128" ...
#Transform 'BUILDING_TYPE' into a categorical variable (where 0 represents residential buildings and 1 represents non-residential buildings, which corresponds to both commercial and industrial buildings)
energy_data1$BUILDING_TYPE = as.character(energy_data1$BUILDING_TYPE)
energy_data1$BUILDING_TYPE[energy_data1$BUILDING_TYPE != "Residential"] = 0
energy_data1$BUILDING_TYPE[energy_data1$BUILDING_TYPE == "Residential"] = 1
#Categorize 'BUILDING.TYPE' as a factor and display its resulting levels
energy_data1$BUILDING_TYPE = as.factor(energy_data1$BUILDING_TYPE)
levels(energy_data1$BUILDING_TYPE)
## [1] "0" "1"
Upon performing this initial summary statistics analysis, a hierarchical approach is carried out in beginning to develop a multiple linear logistic regression model. Using information obtained from a U.S. Department of Energy document entitled “Energy Efficiency Trends in Residential and Commercial Buildings” [reference: http://apps1.eere.energy.gov/buildings/publications/pdfs/corporate/bt_stateindustry.pdf] and learning that a relationship exists between energy consumption, building type (residential, commercial, etc.), and building population, we aim to determine (using the “Energy Usage 2010” dataset) if building type can be determined using information pertaining to energy consumption (in kilowatt-hours) and/or building population. In answering our question, building type is treated as a dichotomous dependent variable and both building population and energy consumption (in kilowatt-hours) are treated as continuous independent variables.
Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘BUILDING_TYPE’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’). Therefore, the null hypothesis that is being tested states that total energy consumption (in kilowatt-hours) and building population do not have a significant effect on the determination of building type (i.e., either residential or non-residential). Opposingly, the alternate hypothesis that is being tested states that total energy consumption (in kilowatt-hours) and building population do, in fact, have a significant effect on the determination of building type (i.e., either residential or non-residential). In our analysis, we aim to create a predictive model that uses these independent variables in the determination of our dichotomous dependent variable.
In this experiment, a hierarchical multiple linear logistic regression model is generated, which will offer some insight into determining whether building type can be explained by each of the independent variables being considered in this analysis, and whether any existence of suppression is likely to exist within a multiple linear logistic regression model comprised of this data. The independent variables include total energy consumption (in kilowatt-hours) and building population, and the dependent variable refers to building type characterized as being either residential or non-residential.
#Generate an initial Hierarchical Multiple Linear Logistic Regression Model that uses all 66,974 observations
energy_model <- glm(energy_data1$BUILDING_TYPE~energy_data1$TOTAL_KWH+energy_data1$TOTAL_POPULATION, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Display summary of the initial Hierarchical Multiple Linear Logistic Regression Model
summary(energy_model)
##
## Call:
## glm(formula = energy_data1$BUILDING_TYPE ~ energy_data1$TOTAL_KWH +
## energy_data1$TOTAL_POPULATION, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8184 -0.0002 0.7041 0.7457 5.0715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.441e+00 1.389e-02 103.76 <2e-16 ***
## energy_data1$TOTAL_KWH -2.062e-06 6.351e-08 -32.47 <2e-16 ***
## energy_data1$TOTAL_POPULATION -1.265e-03 1.073e-04 -11.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74606 on 66088 degrees of freedom
## Residual deviance: 71640 on 66086 degrees of freedom
## AIC: 71646
##
## Number of Fisher Scoring iterations: 8
Originally, the “Energy Usage 2010” dataset contains 66,974 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear logistic regression model (where our desired alpha-level equals 0.05, our desired power-level equals 0.95, our effect size equals 0.02, and the considered number of predictors equals 2). In doing so, the software G[STAR]Power is used to determine the most appropriate sample size for this hierarchical multiple linear logistic regression analysis. In its results, G[STAR]Power generated a sample size of 1,188. So, with this sample size, the dataset “energy_data1” will be sampled, creating a new dataset to be used for this hierarchical multiple linear logistic regression model, which will then be used to determine if corresponding building types can be explained by the variation existent in both energy consumption and building population.
#Randomly take a sample of 1,188 observations from "energy_data1", creating "energy_final".
S <- 1188
set.seed(45)
energy.index <- sample(1:nrow(energy_data1),S,replace=FALSE)
energy_final <- energy_data1[energy.index,]
#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations
energy_model_final <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH+energy_final$TOTAL_POPULATION, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(energy_model_final)
##
## Call:
## glm(formula = energy_final$BUILDING_TYPE ~ energy_final$TOTAL_KWH +
## energy_final$TOTAL_POPULATION, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8160 -0.8836 0.7224 0.7768 1.7534
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.446e+00 1.027e-01 14.074 < 2e-16 ***
## energy_final$TOTAL_KWH -2.023e-06 4.626e-07 -4.374 1.22e-05 ***
## energy_final$TOTAL_POPULATION -2.660e-03 7.966e-04 -3.340 0.000839 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1392.2 on 1187 degrees of freedom
## Residual deviance: 1319.7 on 1185 degrees of freedom
## AIC: 1325.7
##
## Number of Fisher Scoring iterations: 8
#Calculate the p-value associated with goodness-of-fit of entire model
null_deviance = 1392.2
residual_deviance = 1319.7
null_degrees_of_freedom = 1187
residual_degrees_of_freedom = 1185
p_value = 1 - pchisq((null_deviance - residual_deviance), (null_degrees_of_freedom - residual_degrees_of_freedom))
p_value
## [1] 2.220446e-16
#Collinearity Check
col.test <- lm(energy_final$TOTAL_KWH~energy_final$TOTAL_POPULATION)
summary(col.test)
##
## Call:
## lm(formula = energy_final$TOTAL_KWH ~ energy_final$TOTAL_POPULATION)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6244293 -318212 28178 231683 29928036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -435050.1 68996.6 -6.305 4.05e-10 ***
## energy_final$TOTAL_POPULATION 8014.5 533.4 15.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1802000 on 1186 degrees of freedom
## Multiple R-squared: 0.1599, Adjusted R-squared: 0.1592
## F-statistic: 225.8 on 1 and 1186 DF, p-value: < 2.2e-16
For the hierarchical multiple linear logistic regression analysis that is performed where ‘TOTAL_KWH’,and ‘TOTAL_POPULATION’ are all analyzed against the response variable ‘BUILDING_TYPE’, p-values equal to 1.22e-05 and 0.000839 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability equal to 4.58e-05 and 0.000839 (for each of these independent variables, respectively) that the degree to which the variance of these independent variables’ is able to explain the variance in the dependent variable is the result of solely randomization. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’, leading us to believe that the variation that is observed in the determination of building type can be explained by the variation existent in total energy consumption and the variation existent in building population being considered in this analysis and, as such, is not likely solely caused by randomization). [See above results for corresponding p-values.]
In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is 1.446,the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘TOTAL_KWH’ in the linear model) is -2.023e-06, and the value of b_2 (which represents both the slope of the linear model and the coefficient associated with the variable ‘TOTAL_POPULATION’ in the linear model) is -2.660e-03. These values indicate the relationship between the independent variables corresponding to total energy consumption and total building population and the dependent variable corresponding to building type, which corresponds to the idea that an increase in one unit of “total energy consumption” (in KWH) results in a log-odds-decrease in 2.023e-06 units of “building type” and that an increase in one unit of “total building population” results in a log-odds-decrease in 2.660e-03 units of “building type.” Furthermore, with a chi-squared p-value of 2.220446e-16, our model does seem do exhibit a significant goodness of fit.
Additionally, it’s important to note the metrics that are used here to measure the correlation between the independent variables corresponding to total energy consumption and total building population, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.1592, it can be inferred that the variation that exists in the independent variable corresponding to total building population can explain approximately 15.92% of the variation existent in the independent variable corresponding to total energy consumption. As a result of this low adjusted R-squared value here, one can likely assert that the independent variables being considered in this analysis (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) do not exhibit much collinearity in our model.
Before beginning to check the model against the four “LINE” assumptions associated with linear regression modeling, histograms, boxplots, scatterplots, and a “Quality of Fit” plot (via a fitted vs. residual values determination) are generated, which will be used for their graphical nature in our interpretations.
#Generate histograms for all of the different independent variables being considered in our sampled data ('TOTAL_KWH' and 'TOTAL_POPULATION')
hist(energy_final$TOTAL_KWH, xlab = "Total Energy Consumption [in kilowatt-hours]", main = "Histogram of Total Energy Consumption")
hist(energy_final$TOTAL_POPULATION, xlab = "Total Building Population", main = "Histogram of Total Building Population")
#Generate a boxplot of the data (Independent Variable = Energy Consumption)
boxplot(x = energy_final$TOTAL_KWH, pch=21, bg="darkviolet", main="Total Energy Consumption", xlab = "Total Energy Consumption [in kilowatt-hours]")
#Generate a boxplot of the data (Independent Variable = Population)
boxplot(x = energy_final$TOTAL_POPULATION, pch=21, bg="darkviolet", main="Total Building Population", xlab = "Building Population")
#Generate a scatterplot of the data: "Building Type" vs. "Energy Consumption"
plot(y = energy_final$BUILDING_TYPE,x = energy_final$TOTAL_KWH, pch=21, bg="darkviolet", main="Total Energy Consumption vs. Building Type", ylab = "Building Type", xlab = "Energy Consumption (in kilowatt-hours)")
#Generate a scatterplot of the data: "Building Type" vs. "Building Population"
plot(y = energy_final$BUILDING_TYPE,x = energy_final$TOTAL_POPULATION, pch=21, bg="darkviolet", main="Total Building Population vs. Building Type", ylab = "Building Type", xlab = "Building Population")
#Create a "Quality of Fit Model" that plots the residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(energy_model_final),residuals(energy_model_final), main = "Residuals of 'energy_model_final' Against Fitted", font.main = 4, cex.main = 1.2)
mtext("Model 'energy_model_final' [Not Standardized]", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)
#Create a "Quality of Fit Model" that plots the standardized residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_energy_model <- rstandard(energy_model_final)
plot(fitted(energy_model_final),standardized_energy_model, main = "Standardized Residuals of 'energy_model_final'", font.main = 4, cex.main = 1.2)
mtext("Against Fitted Model 'energy_model_final'", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)
In interpreting our hierarchical multiple linear logistic regression model and the statistical significance of the results that were generated therein, it is important to test the model against the four “LINE” assumptions corresponding to linear regression.
In order to meet this assumption, we can try to determine whether or not the expected (mean) value of the residuals is zero at every value of the predictor by generating a standardized residual plot for this model against a fitted version of the model. Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, indicating that a non-linear kind of effect likely exists within the data that the model is comprised of. Therefore, it is likely evident that our model does not satisfy this assumption.
#Create a "Quality of Fit Model" that plots the standardized residuals of "energy_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_energy_model <- rstandard(energy_model_final)
plot(fitted(energy_model_final),standardized_energy_model, main = "Standardized Residuals of 'energy_model_final'", font.main = 4, cex.main = 1.2)
mtext("Against Fitted Model 'energy_model_final'", font = 4, cex = 1.2)
abline(0,0, col='darkviolet', lwd=2.5)
In order to determine if the errors are independent, we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed and exhibit no auto-correlation. Upon generating this plot, it appears that the residuals themselves seem to be generally uniformly distributed and lack auto-correlation here [exhibiting homoscedasticity and independence].
#Generate a residuals plot for "energy_model_final"
plot(energy_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'energy_model_final'")
In order to determine of the distribution of the residuals is normal, we can generate histograms and boxplots for the residuals of the model, analyze them for skewness and kurtosis. Upon observing both the boxplot and the histogram of the residuals, it appears that the model’s residuals do exhibit some significant skewness, as the residuals seem to be skewed severely to the right (indicating that some bias is likely existent in the model). Additionally, upon observing the histogram of the residuals, it appears that there is also some kurtosis existent and bimodality in the residuals.
#Generate histograms for the residuals of our model
hist(residuals(energy_model_final), xlab = "Residuals", main = "Histogram of Residuals of 'energy_model_final'")
#Generate a boxplot for the residuals of our model
boxplot(x = residuals(energy_model_final), pch=21, bg="darkviolet", main="Boxplot of Residuals of 'energy_model_final'", xlab = "Residuals")
We can further determine whether or not the distribution of the residuals exhibits normality by generating a Normal Quantile-Quantile (QQ) Plot for the residuals of the model. Upon doing so, it’s likely that we can readily assume our data does not exhibit normality, since the constructed Normal Q-Q Plot did not seem to display a trend of data that aligned closely with the Normal Q-Q Line. Therefore, it is likely evident that our model does not satisfy this assumption.
#Create a Normal Q-Q Plot for the data pertaining to Building Type.
qqnorm(residuals(energy_model_final), main = "Normal Q-Q Plot for Residuals of 'energy_model_final'")
qqline(residuals(energy_model_final))
In order to determine if the errors at each set of values of the predictor have equal variances (or, in other words, if the variance of the residuals for every set of values for the predictor are equal), we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed along the “y=0” axis. Upon generating this plot, it appears that the residuals themselves seem to be generally uniformly distributed and lack auto-correlation here [exhibiting equal variance].
#Generate a residuals plot for "energy_model_final"
plot(energy_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'energy_model_final'")
In the next section, the Breush-Pagan Test against Heteroscedasticicity is performed and analyzes, which offers more insight into the determination of whether or not the errors at each set of values of the predictor have equal variances.
In order to determine if the errors at each set of values of the predictor have equal variances (or, in other words, if the variance of the residuals for every set of values for the predictor are equal), the Breusch-Pagan Test against Heteroscedasticity can be performed. The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default, the same explanatory variables are taken as in the main regression model) and rejects the null hypothesis (where the model exhibits homoscedasticity) if too much of the variance is explained by the additional explanatory variables. In carrying out the Breusch-Pagan Test against Heteroscedasticity for each of the independent variables (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) against the dependent variable ‘BUILDING_TYPE’ (considered individually), p-values of 0.6159 and 0.6823 (respectively) were returned, indicating that we would fail to reject the null hypothesis of homoscedasticity for each of our independent variables at an alpha-level of 0.05. Therefore, our model’s residuals exhibit homoscedasticity.
#install.packages("lmtest") #[needs to be installed before use]
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#Generate a model which considered the independent variable 'TOTAL_KWH'
KWH_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Perform the Breusch-Pagan test for 'KWH_model'
bptest(KWH_model)
##
## studentized Breusch-Pagan test
##
## data: KWH_model
## BP = 0.2517, df = 1, p-value = 0.6159
#Generate a model which considered the independent variable 'TOTAL_POPULATION'
POPULATION_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_POPULATION, family = "binomial")
#Perform the Breusch-Pagan test for 'POPULATION_model'
bptest(POPULATION_model)
##
## studentized Breusch-Pagan test
##
## data: POPULATION_model
## BP = 0.1675, df = 1, p-value = 0.6823
In interpreting our hierarchical multiple linear logistic regression model and the statistical significance of the results that were generated therein, it is important to check the model against the four main issues surrounding linear regression.
As far as causality is concerned here, we can intuitively discern that “total building population” and “total energy consumption” probabilistically and proximally cause one to predict a given building’s type. These independent variables are considered to be proximal and probabilistic causes (and not ultimate or determinate causes) because they do not “perfectly” or “directly” lead one to discern a given building’s type in every situation. However, with that being said, a stronger analysis would need to be performed for a “flipped” scenario where an explanatory model that uses “building type” to determine “total energy consumption” and/or “building population” is being considered in this regard in order to determine the existence of causality.
As far as sample sizes are concerned here, our original, raw dataset is massive (66,974 observations). If this dataset were used to generate our hierarchical multiple linear logistic regression model, the significance of our results would have been misinterpreted as a result of the bias that exists when using massive datasets. To alleviate this concern, an appropriate sample size was determined using the software G[STAR]Power for logistic regression modeling. The use of G[STAR]Power resulted in a generated sample size of 1,188. So, with this sample size, the dataset “energy_data” was sampled, and a new dataset “energy_final” was created. (See the section “Power Analysis for Multiple Linear Logistic Regression Modeling” within Chapter #2 for more information.)
In order to determine whether or not the predictors are a perfect linear function of other predictors (i.e., no perfect multicollinearity), a correlation matrix and a linear regression model that analyzes the independent variables against each other can be generated. Upon observing the correlation matrix, it appears that “TOTAL_KWH” and “TOTAL_POPULATION” are not very strongly correlated with each other (at a value of 0.3998996, which is not so close to 1.00), which does seem to exhibit the lack of collinearity between these independent variables. Additionally, since the value of the “collinearity-checking” model’s adjusted R-squared is 0.1592, it can be inferred that the variation that exists in the independent variable corresponding to total building population can explain approximately 15.92% of the variation existent in the independent variable corresponding to total energy consumption. As a result of this low adjusted R-squared value here, one can likely assert that the independent variables being considered in this analysis (‘TOTAL_KWH’ and ‘TOTAL_POPULATION’) do not exhibit much collinearity in our model. Therefore, it is likely evident that our model does not seem too concerning with regard to this issue.
#Generate a correlation matrix for "TOTAL_KWH" and "TOTAL_POPULATION"
cor(energy_final[c("TOTAL_KWH","TOTAL_POPULATION")])
## TOTAL_KWH TOTAL_POPULATION
## TOTAL_KWH 1.0000000 0.3998996
## TOTAL_POPULATION 0.3998996 1.0000000
#Collinearity Check
col.test <- lm(energy_final$TOTAL_KWH~energy_final$TOTAL_POPULATION)
summary(col.test)
##
## Call:
## lm(formula = energy_final$TOTAL_KWH ~ energy_final$TOTAL_POPULATION)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6244293 -318212 28178 231683 29928036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -435050.1 68996.6 -6.305 4.05e-10 ***
## energy_final$TOTAL_POPULATION 8014.5 533.4 15.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1802000 on 1186 degrees of freedom
## Multiple R-squared: 0.1599, Adjusted R-squared: 0.1592
## F-statistic: 225.8 on 1 and 1186 DF, p-value: < 2.2e-16
In this study, the existence of measurement error is not an immediate concern, since this energy consumption data for Chicago is regularly and routinely collected by the government and verified by those who are experts in the field, it is not very likely (though, technically possible) that measurement error would play a role in affecting accurate data collection.
In carrying out this analysis, it is important to check to see if any interaction effects are present among the independent variables being tested against the dependent variable.
#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations to test for interaction effects
energy_interaction_model <- glm(energy_final$BUILDING_TYPE~energy_final$TOTAL_KWH*energy_final$TOTAL_POPULATION, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(energy_interaction_model)
##
## Call:
## glm(formula = energy_final$BUILDING_TYPE ~ energy_final$TOTAL_KWH *
## energy_final$TOTAL_POPULATION, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8275 -0.8692 0.7202 0.7783 1.6235
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.473e+00 1.046e-01
## energy_final$TOTAL_KWH -2.252e-06 5.164e-07
## energy_final$TOTAL_POPULATION -2.904e-03 7.919e-04
## energy_final$TOTAL_KWH:energy_final$TOTAL_POPULATION 1.627e-09 5.132e-10
## z value Pr(>|z|)
## (Intercept) 14.074 < 2e-16 ***
## energy_final$TOTAL_KWH -4.362 1.29e-05 ***
## energy_final$TOTAL_POPULATION -3.667 0.000246 ***
## energy_final$TOTAL_KWH:energy_final$TOTAL_POPULATION 3.170 0.001525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1392.2 on 1187 degrees of freedom
## Residual deviance: 1318.8 on 1184 degrees of freedom
## AIC: 1326.8
##
## Number of Fisher Scoring iterations: 8
For the hierarchical multiple linear logistic regression analysis that is performed where the paired-interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ is analyzed against the dependent variable ‘BUILDING_TYPE’, a p-value equal to 0.001525 is returned for the interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ in the mdoel, a p-value equal to 1.29e-05 is returned for the variable ‘TOTAL_KWH’ in the model, and a p-value equal to 0.000246 is returned for the variable ‘TOTAL_POPULATION’ in the model, indicating that there is roughly a probability of 0.001525, 1.29e-05, and 0.000246 [respectively] that the degree to which the variance of the interaction of ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’, the variance of solely ‘TOTAL_KWH’, and the variance of solely ‘TOTAL_POPULATION’ are able to explain the variance in the dependent variable is the result of solely randomization. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ and their corresponding interaction, leading us to believe that the variation that is observed in the determination of building type can be explained by the variation existent in total energy consumption, the variation existent in building population, and the variance existent in the interaction of these two independent variables being considered in this analysis and, as such, is not likely solely caused by randomization. Despite the fact that our original model does not take into account the interaction effects that are likely to exist within this dataset (as per this interaction analysis), the exhibited interaction effects resulting from this test model did not keep us from being able to discern in our original model that the independent variables are likely to be statistically significant with relation to the dependent variable in our predictive model. [See above results for corresponding p-values.]
After carrying out the “LINE” analysis, it appears that our model does not exhibit normally-distributed errors (‘N’) or maintain a linear relationship between the independent variables and the dependent variable (‘L’). Despite this, it does appear that the errors of the model are independent (‘I’) and that the errors at each set of values of the predictor have equal variances (‘E’). However, because our model is a logistic regression model at its core, meeting these assumptions is not entirely necessary.
The assumptions that do directly apply to logistic regression modeling include the following: (1) The true conditional probabilities are a logistic function of independent variables with the independent variables being linearly related to a log odds ratio. (2) No important variables are omitted and no extraneous variables are excluded. (3) Each observation must be independent with little or no multicollinearity where the independent variables are not linear combinations of each other and are measures without error. (4) The observations are independent. (5) Large sample sizes are included.
Through out “LINE” analysis, it is determined that our observations exhibit both independence and a lack of multicollinearity. Additionally, since our analysis uses G[STAR]Power in the determination of an effective sample size for logistic regression, the only assumption for logistic regression that has not yet been addressed is the second one, which makes the claim that “no important variables are omitted and no extraneous variables are excluded.” While this is generally a tough assumption to address in any form of regression modeling, we can attempt to address one particular concern that arose during this analysis.
In our original model, a dichotomous dependent variable is created which jointly considers commercial and industrial building types (designated with a single-level characteristic of being non-residential) against residential building types. While the research carried out in constructing this hierarchical multiple linear logistic regression model led us to making this modeling decision, it might potentially be an inherently flawed modeling decision. So, to address this concern, a new model is created which solely considers commercial building types against industrial building types, excluding all residential building types found in the dataset.
#Create a subset of "energy_raw" that contains only numeric data for this assumption
logistic_test_data0 <- subset(energy_raw, select = c(BUILDING_TYPE, TOTAL_KWH, TOTAL_POPULATION))
logistic_test_data1 <- na.omit(logistic_test_data0)
#Transform 'BUILDING_TYPE' into a categorical variable (where 0 represents commercial buildings and 1 represents industrial buildings)
logistic_test_data1$BUILDING_TYPE = as.character(logistic_test_data1$BUILDING_TYPE)
logistic_test_data1$BUILDING_TYPE[logistic_test_data1$BUILDING_TYPE == "Residential"] = NA
logistic_test_data2 <- na.omit(logistic_test_data1)
logistic_test_data2$BUILDING_TYPE[logistic_test_data2$BUILDING_TYPE != "Commercial"] = 0
logistic_test_data2$BUILDING_TYPE[logistic_test_data2$BUILDING_TYPE == "Commercial"] = 1
#Categorize 'BUILDING.TYPE' as a factor and display its resulting levels
logistic_test_data2$BUILDING_TYPE = as.factor(logistic_test_data2$BUILDING_TYPE)
levels(logistic_test_data2$BUILDING_TYPE)
## [1] "0" "1"
#Generate an initial Hierarchical Multiple Linear Logistic Regression Model that uses all 16,649 observations contained within this new dataset
logistic_test_model <- glm(logistic_test_data2$BUILDING_TYPE~logistic_test_data2$TOTAL_KWH+logistic_test_data2$TOTAL_POPULATION, family = "binomial")
#Display summary of this new initial Hierarchical Multiple Linear Logistic Regression Model
summary(logistic_test_model)
##
## Call:
## glm(formula = logistic_test_data2$BUILDING_TYPE ~ logistic_test_data2$TOTAL_KWH +
## logistic_test_data2$TOTAL_POPULATION, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7038 0.0517 0.0556 0.0588 1.0447
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 6.289e+00 2.719e-01 23.135
## logistic_test_data2$TOTAL_KWH -2.685e-08 6.026e-09 -4.456
## logistic_test_data2$TOTAL_POPULATION 2.672e-03 2.509e-03 1.065
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## logistic_test_data2$TOTAL_KWH 8.37e-06 ***
## logistic_test_data2$TOTAL_POPULATION 0.287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 387.98 on 16648 degrees of freedom
## Residual deviance: 376.98 on 16646 degrees of freedom
## AIC: 382.98
##
## Number of Fisher Scoring iterations: 10
#Randomly take a sample of 1,188 observations from "logistic_test_data2", creating "logistic_final".
S <- 1188
set.seed(28)
energy.index.log <- sample(1:nrow(logistic_test_data2),S,replace=FALSE)
logistic_final <- logistic_test_data2[energy.index.log,]
#Generate a new Hierarchical Multiple Linear Logistic Regression Model that uses 1,188 observations
logistic_model_final <- glm(logistic_final$BUILDING_TYPE~logistic_final$TOTAL_KWH+logistic_final$TOTAL_POPULATION, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Display summary of the final Hierarchical Multiple Linear Logistic Regression Model
summary(logistic_model_final)
##
## Call:
## glm(formula = logistic_final$BUILDING_TYPE ~ logistic_final$TOTAL_KWH +
## logistic_final$TOTAL_POPULATION, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3850 0.0296 0.0641 0.0911 0.1381
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.626e+00 1.070e+00 4.322 1.55e-05 ***
## logistic_final$TOTAL_KWH 1.243e-05 1.543e-05 0.805 0.421
## logistic_final$TOTAL_POPULATION 8.827e-03 1.199e-02 0.736 0.462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.881 on 1187 degrees of freedom
## Residual deviance: 39.296 on 1185 degrees of freedom
## AIC: 45.296
##
## Number of Fisher Scoring iterations: 14
#Calculate the p-value associated with goodness-of-fit of this new model
null_deviance_log = 41.881
residual_deviance_log = 39.296
null_degrees_of_freedom_log = 1187
residual_degrees_of_freedom_log = 1185
p_value_log = 1 - pchisq((null_deviance_log - residual_deviance_log), (null_degrees_of_freedom_log - residual_degrees_of_freedom_log))
p_value_log
## [1] 0.2745835
#Collinearity Check
col.test.log <- lm(logistic_final$TOTAL_KWH~logistic_final$TOTAL_POPULATION)
summary(col.test.log)
##
## Call:
## lm(formula = logistic_final$TOTAL_KWH ~ logistic_final$TOTAL_POPULATION)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1613634 -345555 -264468 -149554 60957766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213955.7 83746.3 2.555 0.01075 *
## logistic_final$TOTAL_POPULATION 1492.5 480.2 3.108 0.00193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2284000 on 1186 degrees of freedom
## Multiple R-squared: 0.008079, Adjusted R-squared: 0.007243
## F-statistic: 9.66 on 1 and 1186 DF, p-value: 0.001928
Without thoroughly addressing whether or not this new model (which solely considers commercial building types and industrial building types) meets all of the assumptions surrounding linear regression modeling and logistic regression modeling as we did for our previously-developed model (given that this model is only a slight variation of the previously-developed model of residential vs. non-residential building types), it’s apparent that while ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ lack multicollinearity within this model, the variation existent within the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ [with consideration to commercial and industrial building types alone] cannot significantly account for the variation existent within the dependent variable ‘BUILDING_TYPE’. Therefore, based on this hierarchical multiple linear logistic regression model’s yielded results (and its respective p-values outputted in the model summary above), we would fail to reject the null hypothesis for the independent variables ‘TOTAL_KWH’ and ‘TOTAL_POPULATION’ for a consideration of building types that does not include residential building types, leading us to believe that the variation that is observed in the determination of building type (for solely a commercial and industrial building type consideration) cannot be explained by the variation existent in total energy consumption and the variation existent in building population being considered in this analysis and, as such, is likely solely caused by randomization).
For this reason, we can discern that our original modeling approach met all of the assumptions of logistic regression modeling, leaving us satisfied with the results surrounding the original modeling approach that is carried out.