Can commercial space area(m2) be used to predict the consumption of energy(MJ/hr)?

Linear Regression Analysis

Chaitali Chaudhari S3687120

Last updated: 21 October, 2017

Introduction

The Commercial Building Disclosure (CBD) Program is an initiation of Council of Australian Governments under the Building Energy Efficiency Disclosure Act 2010 and managed by Australian Department of Energy and Environment that instructs commercial office space of 1000 m2 and above to have Building Energy Efficiency Certificate (BEEC).

The Act is made mandate for the buildings where more than 75% is used as commercial space; as 10% of the greenhouse emission is from commercial office space and is rising through the years.

The BEEC consists of National Australian Built Environment Rating System (NABERS) which is based on the scale of 0 to 6 stars. The ratings are given as comparison with the benchmarks set for the building performance data. More details can be found at this [link].

Problem Statement

The analysis is based on the question,

Can commercial space area(m2) be used to predict the consumption of energy(MJ/hr)?

Will large area office be more efficent in usage of energy as compared to small area?

If found statistically significant will help owners, tenants to estimate what could be the ratings based on the area of the office and what would be the expectations to improve even before the assessment.

The two variables used will be Floor Area(m2) and Energy Consume(MJ/hr).

As both the variables are quantitative, Linear Regression Model can be used to deduce the prediction and provide a correlation to measure the strength and direction of linear relationship between two variables.

Data

The data is collected on weekly basis and provided regularly in downloadable form as .csv and .zip since 2011 and compiled from Building Energy Efficiency Register.

For the analysis purpose, the data set “[2017 CBD Downloadable Data Set]” was downloaded as .csv on 16/10/2017.

CBD <- read_csv("E:/RMIT/Introduction to Statistics/Data Repository/Assignment 4/2017extract.csv")

View(CBD)

The below columns are considered for the analysis from the .csv file;

- B_State : State

- CRT_Nabers_RatedHours : Number of hours the building operates for a week

- CRT_Nabers_AnnualConsumption : Annual consumption (in MJ)

- CRT_Nabers_RatedArea : Floor Area (in m2)

The file contains data for Australia nation, only Victoria state and year 2017 are considered for the analysis.

CBD_VIC <- CBD %>% filter(B_State == "VIC")

The filtered dataset for Victoria contains 3165 observations.

Also the number of hours each building operates for the week is not fixed, so for fair comparison of energy usage, total hours operated for the week can be extended to total hours per year and then divide annual consumption with total annual hours to give consumption per hour (MJ/hr).

CBD_VIC$Consumption_PerHour <- CBD_VIC$CRT_Nabers_AnnualConsumption/(CBD_VIC$CRT_Nabers_RatedHours*52)

Descriptive Statistics and Visualisation

plot(Consumption_PerHour ~ CRT_Nabers_RatedArea, data = CBD_VIC, main="Scatter Plot")

plot(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC, main="Scatter Plot (Square-root Transformed)")

The scatter plot depicts a positive relationship between the two variables Consumption per hour and Rated Area. The Linear Regression Model can be used to test the relation between two variables as the plot shows positive relationship.

After the square-root transformation the plot shows a positive linear relationship.

Decsriptive Statistics Cont.

The histogram of both the variables appeared to be skewed, and an attempt to correct the skewness through square root transformation is not of much advantage but there is a change in skewness and the data appears to be normally distributed. The test is performed to check if the deduction can be made from the transformed skewed data.

CBD_VIC$Consumption_PerHour %>% hist(xlab="Energy Consumption (in MJ/hr)", main="Consumption of Energy (MJ/hr)")

sqrt(CBD_VIC$Consumption_PerHour) %>% hist(xlab="Energy Consumption (in MJ/hr)", main="Consumption of Energy (MJ/hr) (Square-root Transformed)")

CBD_VIC %>% summarise(Min = min(Consumption_PerHour,na.rm = TRUE),
                                           Q1 = quantile(Consumption_PerHour,probs = .25,na.rm = TRUE),
                                           Median = median(Consumption_PerHour, na.rm = TRUE),
                                           Q3 = quantile(Consumption_PerHour,probs = .75,na.rm = TRUE),
                                           Max = max(Consumption_PerHour,na.rm = TRUE),
                                           Mean = mean(Consumption_PerHour, na.rm = TRUE),
                                           SD = sd(Consumption_PerHour, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Consumption_PerHour))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
209.1622 1947.822 2975.273 5385.326 16596.11 3874.345 2807.782 3165 0
CBD_VIC$CRT_Nabers_RatedArea %>% hist(xlab="Area (in m2)",main="Area (m2)")

sqrt(CBD_VIC$CRT_Nabers_RatedArea) %>% hist(xlab="Area (in m2)",main="Area (m2) (Square-root Transformed)")

CBD_VIC %>% summarise(Min = min(CRT_Nabers_RatedArea,na.rm = TRUE),
                                           Q1 = quantile(CRT_Nabers_RatedArea,probs = .25,na.rm = TRUE),
                                           Median = median(CRT_Nabers_RatedArea, na.rm = TRUE),
                                           Q3 = quantile(CRT_Nabers_RatedArea,probs = .75,na.rm = TRUE),
                                           Max = max(CRT_Nabers_RatedArea,na.rm = TRUE),
                                           Mean = mean(CRT_Nabers_RatedArea, na.rm = TRUE),
                                           SD = sd(CRT_Nabers_RatedArea, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(CRT_Nabers_RatedArea))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
469.7 8137.6 19015.8 33836 76696.4 24361.53 19324.8 3165 0
CBD_VIC_Filtered <- CBD_VIC %>% filter(Consumption_PerHour < 10000 , CRT_Nabers_RatedArea < 60000)

The data is filtered to remove outliers that is Area more than 60000 m2 and Consumption per hour more than 10000 MJ/hr.

Hypothesis Testing

Hypothesis for the Overall Linear Regression Model

\(H_0\) : The data does not fit the linear regression model.

\(H_A\) : The data fits the linear regression model.

The F-Test to test the linear regression model.

Assumptions:

Independence: There are no multiple observations for the same measurement or there is no dependence of one value on the other. There are multiple observations for the same building but the levels or the area referred are different that is correspond to independent space.

Linearity: The scatter plot shows a positive linear relationship.(Can be confirmed with the Residual vs Fitted plot)

Normality of residuals: The Normality of the residuals can be checked with Normal Q-Q Plots for the residuals.

Homoscedasticity:The homogeneity of variance can be confirmed with Residual vs Fitted or Scale-Location Plots.

model1 <- lm(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC_Filtered)
model1 %>% summary()
## 
## Call:
## lm(formula = sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), 
##     data = CBD_VIC_Filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.942  -7.134   0.148   7.077  42.464 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                18.703446   0.525928   35.56   <2e-16 ***
## sqrt(CRT_Nabers_RatedArea)  0.268986   0.003843   69.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.24 on 2732 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.6418 
## F-statistic:  4899 on 1 and 2732 DF,  p-value: < 2.2e-16

The p-value for the F-Test is really small, F(1,2732)=4899,p<0.001.

The model is statistically significant and can interpret the model coefficients.

Hypothesis Testing Cont.

model1 %>% summary() %>% coef()
##                              Estimate  Std. Error  t value      Pr(>|t|)
## (Intercept)                18.7034455 0.525927710 35.56277 5.494441e-228
## sqrt(CRT_Nabers_RatedArea)  0.2689858 0.003843183 69.99037  0.000000e+00
model1 %>% confint()
##                               2.5 %     97.5 %
## (Intercept)                17.67219 19.7347018
## sqrt(CRT_Nabers_RatedArea)  0.26145  0.2765217

The intercept (\(\alpha\))=18.703 and slope(\(\beta\))=0.2689.

Thus the null hypothesis for the intercept and slope; Intercept: \[H_0: \alpha= 0\]

\[H_A: \alpha \neq 0\]

Slope: \[H_0: \beta = 0\]

\[H_A: \beta \neq 0\]

is rejected as the p-value for intercept and slope is very small, p<0.001.

Also the \(H_0\)=0 is not captured in the 95% CI for intercept [17.672,19.735] and slope [0.261,0.276], hence reject the \(H_0\).

Thus the slope and intercept are not zero are statistically significant.

Hypothesis Testing Cont.

#Residuals vs Fitted
model1 %>% plot(which=1)

The Residuals vs Fitted plot shows that the red line is almost flat, thus confirming the linearity of the data, which can be depicted through the scatter plot as well.

Also the variability of the y values for the range of x values seems to be constant, confirms the data to be homoscedastic.

Hypothesis Testing Cont.

#Normal Q-Q Plot
model1 %>% plot(which=2)

The Normal Q-Q Plot shows the normality of the residuals fall close to the 45o line.

Hypothesis Testing Cont.

#Scale-Location
model1 %>% plot(which=3)

The Scale-Location plot depicts a near to flat red line and variability of the square roots of the standardised residuals is consistent across the fitted values.

Hypothesis Testing Cont.

#Residuals vs Leverage
model1 %>% plot(which=5)

The Residual vs Leverage plot is used to locate influential cases, but in the above plot there are no values beyond or close to the red band, in fact the red bands are not visible in the plot.

Linear Regression - R2

- R2 = 0.642

- 64.20% variability in Consumption per hour can be explained by a linear relationship with Rated Area.

Hypothesis Testing Cont.

Correlation Coefficient r

- Measure the strength and direction of Linear Relationship between Consumption per hour and Rated Area.

- Ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation) and 0 means no correlation

\[H_0 : r = 0\]

\[H_A:r \neq 0\]

bivariate<-as.matrix(dplyr::select(CBD_VIC_Filtered, Consumption_PerHour, CRT_Nabers_RatedArea)) #Create a matrix of the variables to be correlated
rcorr(sqrt(bivariate), type = "pearson")
##                      Consumption_PerHour CRT_Nabers_RatedArea
## Consumption_PerHour                  1.0                  0.8
## CRT_Nabers_RatedArea                 0.8                  1.0
## 
## n= 2734 
## 
## 
## P
##                      Consumption_PerHour CRT_Nabers_RatedArea
## Consumption_PerHour                       0                  
## CRT_Nabers_RatedArea  0
r=cor(sqrt(CBD_VIC_Filtered$Consumption_PerHour),sqrt(CBD_VIC_Filtered$CRT_Nabers_RatedArea))
r
## [1] 0.8012305
CIr(r = r, n = 2734, level = .95)
## [1] 0.7873933 0.8142607
detach("package:psychometric", unload=TRUE)

The Correlation Coefficent r = 0.801 and p-value = 0, is written as p < 0.001. Hence, reject \(H_0\).

The 95% CI [0.7874,0.8143] does not capture \(H_0\)=0, therefore rejected \(H_0\). There was a statistically significant positive correlation between Consumption per hour and Rated Area.

Discussion

plot(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC_Filtered, main = "Scatter Plot with the line of best fit")
abline(model1, col="red")

A linear regression model is fitted with the best fit line to confirm the floor area can be used to predict the consumption of energy.

Simple Linear Regression Summary:

Decision: - Overall Reject \(H_0\)

Thus conlude that there is statistically significant positive linear relationship between Consumption of Energy per hour and Area of the office.

Discussion Contd.

The analysis depicts a positive and strong relationship between the office area and consumption of energry.

The square-root transformation helped the data to be normally distributed as compared to right skewed values and may be a more appropriate transformation can be used in the future analysis. The Star Rating Algorithm can be used to give a more apt analysis result for the energy usage, but has to be requested which can be done in further detailed analysis.

The statistical analysis based on the Star ratings, State-wise, City-wise can be carried out in future.

The statistical investigation can help office in Victoria to predict energy consumption before being assessed by the CBD and persuade office to improve from the current consumption to better thus increasing the Star Ratings.

The linear regression model is very effective to determine relation between two quantitative variables.

References

[link]http://www.cbd.gov.au/

[Full CBD Downloadable Data Set]http://www.cbd.gov.au/registers/cbd-downloadable-data-set