Can we use floor area to determine energy use requirements in office buildings?

Linear Regression Analysis

Penelope Lynch

Sunday 6th June

Introduction

Can the floor area (m2) of a commercial office space be used to predict energy use (MJ/hr)

The Commercial Building Dislosure (CBD) is an Australian Government program that requires all commercial office spaces, greater than 2000 m2, to undertake an energy efficiency assessment before going on the market for sale or lease. The information is incredibly valuable for new occupants, and is hoped to encourage owners to improve the efficiency rating of their premises. More information can be found atlink.

The data is collected on an as needs basis as office spaces are prepared to be put on the market. As it is compulsory, it provides a large, random sample of the population of office spaces greater than 2000ms. There is no percieved bias based on location, size, energy efficiency or occupied hours.

Independent assessors collect a basic set of data and use this to calculate a Star Rating. The Star Ratings, from 0-6, are an indicator of a building energy efficiency. Unfortuantely the algorithm used is not easily available, and as such, the Star Ratings will not be used for this research.

Problem Statement

The questions that will be explored will be

Can the floor area (m2) of a commercial office space be used to predict energy use (MJ/hr)?

When we think about it, this makes sense, that large floor areas will require more energy to power per hour than comparitive small office areas, but is the relationship statistically significant enough to use to predict energy use?

If it is found that there is a relationship between energy use and floor area, then this could be useful in providing quick, preliminary, energy assessments for owners, independent of the compulsory requirement, to guage if their building uses more or less energy than what is expected for a building of that size.

Data

The program releases data on a regular basis. For the purpose of this research, the most recent data set “2016 CBD Downloadable Data Set” is being used.[linkwww.cbd.gov.au/registers/cbd-downloadable-data-set]

library(readr)
CBD <- read_csv("~/Documents/Master (GIS)/Statistics/Assignment 4/2016_CBD_short version.csv")

Variables B_Street Address: business street addres B_State: factor: (ACT, VIC, QLD, NSW, TAS, NT, WA, SA) CRT_Nabers_RatedHours: Number of hours per week the building operates CRT_Nabers_AnnualConsumption: Annual megajoules as calculated using most recent energy bills FS_Name: Individual levels within a building block

Data

Geographic location, due to climate differences, of office buildings has the potential to impact energy use. As such, this research will focus on the state of Victoria.

Filter Victorian buildings

Vic<-CBD %>% filter(B_State=="VIC")

The filtered results provides 2459 observations for analysis.

Creating a standarised variable

The data reveals a large discrepency in the number of operating hours (CRT_Naber_RatedHours) per week. To fairly compare building energy use we need to find a common variable. To do this, the total hours of operation per week was extrapolated across the year to provide the total hours per year each building operates. The total annual energy consumption was then divided by the total number of operating hours to provide an energy (MJ) per hour (hr) variable for each building (MjHr).

Vic$MjHr<-Vic$CRT_Nabers_AnnualConsumption/(Vic$CRT_Nabers_RatedHours*52)

Descriptive Statistics and Visualisation

plot(Vic$MjHr~Vic$CRT_Nabers_RatedArea, data=Vic, xlab="Floor Area (m2)", ylab="Energy Use (MJ/Hour)", main="Dotplot of Energy Use vs Floor Area", col="orange")

The scatter plot illustrates a positive linear relationship, supporting the reseach hypothesis that as the floor area increases so does the amount of energy used per operating hour. As it exhibits a positive linear relationship, a linear regression model can be used to test the relationship between the two variables.

The scatter plot also shows a few outliers that are easily identified including observations with greater than 65,000 m2.

The histograms below appear similar in pattern, and loosely replicate a normal distribution pattern. An attempt to correct the skewed data did not have a benefical result. The decision to press on with the research despite the skewed histograms was based on the large data, the decision to remove prominant outliers and to see if a conclusion could still be formed.

hist(Vic$CRT_Nabers_RatedArea, col='red', xlab = "m2", main="Histogram of Floor Area (m2)", breaks=20)

hist(Vic$MjHr, col='lightblue', xlab = "MJ", main="Histogram Energy Use (MJ/hr)", breaks=20)

# Decsriptive Statistics Cont.

Vic %>% summarise(
Min = min(CRT_Nabers_RatedArea, na.rm = TRUE),
Q1 = quantile(CRT_Nabers_RatedArea, probs = .25, na.rm = TRUE),
Median = median(CRT_Nabers_RatedArea, na.rm = TRUE),
Q3 = quantile(CRT_Nabers_RatedArea, probs = .75, na.rm = TRUE),
Max = max(CRT_Nabers_RatedArea, na.rm = TRUE),
Mean = mean(CRT_Nabers_RatedArea, na.rm = TRUE),
SD = sd(CRT_Nabers_RatedArea, na.rm = TRUE),
n = n(),
Missing = sum(is.na(CRT_Nabers_RatedArea)))
Vic %>% summarise(
Min = min(MjHr, na.rm = TRUE),
Q1 = quantile(MjHr, probs = .25, na.rm = TRUE),
Median = median(MjHr, na.rm = TRUE),
Q3 = quantile(MjHr, probs = .75, na.rm = TRUE),
Max = max(MjHr, na.rm = TRUE),
Mean = mean(MjHr, na.rm = TRUE),
SD = sd(MjHr, na.rm = TRUE),
n = n(),
Missing = sum(is.na(MjHr)))

Before continuing, a basic filter to remove the main outliers was applied to the data. This removes observations with floor areas greater than 70,000 m2, and those that are using more than 10,000 mJ per hour.

Vic_Clean<-Vic %>% filter(CRT_Nabers_RatedArea<70000, MjHr<10000)

Hypothesis Testing

Linear regression model A linear regression model will be used to understand the relationship between two quantitative variables; floor area (m2) and energy use (MJ/Hr).

The lm() function in R enables the relationship between the two variables to be analysed. This function calculates a line of best fit by using the ordinary least squares (OLS) method. This method determines the squared distance from each set of variables (x,y) to a line of best fit. The ‘best line’ is one that minimises the total collective area calculated.

FloorArea2MjHr<-lm(MjHr~CRT_Nabers_RatedArea, data=Vic_Clean)
FloorArea2MjHr %>% summary()
## 
## Call:
## lm(formula = MjHr ~ CRT_Nabers_RatedArea, data = Vic_Clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3142.7 -1203.2  -178.8  1190.5  4918.2 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.213e+03  6.277e+01   19.33   <2e-16 ***
## CRT_Nabers_RatedArea 1.175e-01  2.107e-03   55.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1638 on 2318 degrees of freedom
## Multiple R-squared:  0.573,  Adjusted R-squared:  0.5728 
## F-statistic:  3111 on 1 and 2318 DF,  p-value: < 2.2e-16

The summary from the linear regression model reports the Mulitiple R squared statistic to be 0.573. R squared results range between 0-1, 0 indicating no relationship and 1 indicating a strong relationship between the two variables, the result is an indicator of the goodness to fit. This statistic communicates the proportion that the dependent variable, energy use (MJHr), can be explained by its relationship with the predictor variable, Floor Area (m2). In this instance, floor area explains 57.3% of the variability in a buildings energy consumption.

Hypotheses for the overall linear regression model: The F Statistic is used to test the overall regression model.

The hypothesis to test the statistical significance of the F Statistics is: H0: The data do not fit the linear regression model Ha: The data fit the linear regression model

The F statistic reported in the linear regression model is F(1,2318) = 3111, p-value<0.001.

The p-value is less than 0.05, the level of signficance, and therefore decide to reject the Ho. This is interpreted to mean that there is statistically significant evidence that the data fit a linear regression model.

Hypthesis Testing Cont.

Addressing Assumptions There are four assumptions associated with linear regression that must be validated before the results can be considered.

Independence The two variables selected are based on the raw observations reported by the assessor. The Rated Area is the floor area that was assessed. The energy (MJ) per hour was calculated using the Annual Consumption (MJ) data Rated Hours (the average number of hour the building is ‘active’ during a given week). Some building addresses have had multiple floors assessed during this period, each of which are entered as a seperate observation and independent office space.

Linearity The dot plot of Energy Use vs Floor Area was used to illustrate and confirm a positive linear relationship.

Normality of residuals The normality of the residuals is the calculation of errors for the model. It looks at the y value of each observation and subtracts the predicted y based on the linear regression model.

To check the fit of the regression model the data has been plotted as per graphs below. * The Scale-Location plot is used to guage the homoscedasticity. The result are ok, however are trending towards a heteroscedasticity, with more data plotted to the left, and a trend line.

Hypthesis Testing Cont.

FloorArea2MjHr %>% plot(which=1)

FloorArea2MjHr %>% plot(which=2)

FloorArea2MjHr %>% plot(which=3)

FloorArea2MjHr %>% plot(which=5)

# Hypthesis Testing Cont.

Intercept and Slope The coefficient table provides details regarding the intercept/constant and slope of the line of best fit calculated through the linear regression model.

FloorArea2MjHr %>% summary() %>% coef() %>% round(3)
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)          1213.368     62.769  19.331        0
## CRT_Nabers_RatedArea    0.118      0.002  55.772        0

The intercept/constant (a) is the average value for y when x=0. The intercept value in these results is a=1213.368, or, when the floor area (m2) for a building is 0, the average energy used per hour is 1213.368 MJ. Obviously a building with 0 area doesn’t exist and does not require any energy. This particular result is not required to have any true meaning.

The hypothesis to test the statistical significance of the intercept/constant is: H0: a = 0 Ha: a ≠ 0

This test utilises the t statistic. In this instance the t statistics is reported as t=19.331, p<0.001. As the significance level for the intercept/constant is 0.05, the results means that there is statistically significant evidence that the intercept/constant is not 0.

This is supported by the intercept/ constant confidence intervals. R reports that 95% CI as [1090.279, 1336.456]. Once again, H0: a = 0, which is not captured within the confidence interval, therefore reject the Ho.

FloorArea2MjHr %>% confint() %>% round(3)
##                         2.5 %   97.5 %
## (Intercept)          1090.279 1336.456
## CRT_Nabers_RatedArea    0.113    0.122

The hypothesis to test the statistical significance of the slope is: H0: b = 0 Ha: b ≠ 0

The slope of the linear regression line is reported in R as b= 0.118. This represents the average change of y as x increases by one unit. So, when floor area(m2) increases by 1 m2, energy use increased by an average by 0.118 MJ per hour. The accompanying t statistic is reported by R as t=55.772, p<0.001. As the p value is less than 0.05 the H0 is reject. The results find that there is statistically significant evidence that energy use has a positive relationship with floor area.

In addition, R reported that the 95% CI for slope, b to be [0.113, 0.122]. The 95% CI does not capture Ho=0, therefore we decide to reject the Ho. This further supports that there is a statistically significant positive relationship between floor area (m2) and energy use (MJ per hour).

Results

plot(MjHr~CRT_Nabers_RatedArea, data=Vic_Clean, xlab="Floor Area (m2)", ylab="Energy consumption (MJ per hour)", main="Dotplot of Energy Use vs Floor Area", col="orange")
abline(a= 1213.368 ,b=0.118, col="red")

A linear regression model was fitted to determine if the floor area of office buildings could be used to predict energy use. The dependent variable was energy use (MJ per hour) and the predictor variable was floor area in m2 intervals.

An initial scatter plot of the two variables was viewed to assess the relationship between energy use and floor area. The scatter plot presented a positive linear relationship that could be further investigated. The regression model was overall statistically significant, F(1,2318) = 3111, p<0.001, and can be used to explain 57.3% of the variability in energy use (MJ per hour), R2 = 0.573.

Results Cont.

Correlation To test the strength of the linear relationship between Floor Area(m2) and Energy Use (MJ/Hr) the Pearson correlation coefficient, r is calculated.

To interpret the results, a Pearson correlation can range between r= -1 to r= 1. Zero indicates no correlation.

bivariate<-as.matrix(dplyr::select(Vic_Clean, CRT_Nabers_RatedArea, MjHr))
rcorr(bivariate, type = "pearson")
##                      CRT_Nabers_RatedArea MjHr
## CRT_Nabers_RatedArea                 1.00 0.76
## MjHr                                 0.76 1.00
## 
## n= 2320 
## 
## 
## P
##                      CRT_Nabers_RatedArea MjHr
## CRT_Nabers_RatedArea                       0  
## MjHr                  0

R reports the correlation between Floor Area(m2) and Energy Use(MJ/hour) to be r=0.76 and the p value<0.001.

The statistical hypothesis test for r is: Ho: r = 0 Ha: r ≠ 0

As p is less than 0.05 significance level we reject the Ho.

The confidence interval results:

library(psychometric)
r=cor(Vic_Clean$CRT_Nabers_RatedArea, Vic_Clean$MjHr)
CIr(r=r, n=2320, level= 0.95)
## [1] 0.7390351 0.7738224

The 95% CI [0.739, 0.774] does not capture Ho=0, which supports the decision to reject the null hypothesis. The CI shows there is a statistically significant positive correlation between Floor Area (m2) and Energy Use (MJ/hour).

To determine the strength of the linear relationship between Floor Area (m2) and Energy Use (MJ/Hr), the Pearson’s correlation was calculated. The results determine that there is a positive statistically significant relationship, r=0.76, p<0.001, 95% CI[0.739, 0.774].

Discussion

Through this investigation, the reseach hypothesis, that Floor Area (m2) of office buildings in Victoria can be used to predict Energy Use (MJ/Hr), was found to be true. The investigation supports that there is statistically significant evidence that there is a positive relationship between these two variables. The correlation test found that the strenth of the relationship was strong and positive.

This investigation has been successful in creating a current means to predict energy use in Victorian office spaces. These results could further provide a benchmark for any office, to see how they rate in regards to energy use without formally being assessed, similar to the residential water consumption campaign, target 155, this data could encourage offices to be ‘better than average’ in regard to their energy use.

In regards to improvements, the sample distribution could have been further explored, and with more time, there would be scope to trial different methods to correct the skewed data sets. As it was, the results were still strong enough. In addition, the Star Ratings algorithm is available on request, and if sourced earlier, would have enabled the Star Rating data set to be utilised within the investigation. As it was, it was hard to determine the dependent variables.

Further research could compare individual Star Rating, linear regressions, for example, how does the relationship change when looking at 0 Star to 6 Star offices spaces? Further research could also compare states, or regions within states, for example, how does Melbourne CBD compare to Sydney CBD?

The linear regression model, as used in R is a methodical process that quickly and effectively determines if there is a relationship between two quantitative variables that has lots of scope for further application.

References - Australian Government 2017, Commercial Building Disclosure, viewed 4 June 2017[link] (www.cbd.gov.au)