Chaitali Chaudhari S3687120
Last updated: 21 October, 2017
The Commercial Building Disclosure (CBD) Program is an initiation of Council of Australian Governments under the Building Energy Efficiency Disclosure Act 2010 and managed by Australian Department of Energy and Environment that instructs commercial office space of 1000 m2 and above to have Building Energy Efficiency Certificate (BEEC).
The Act is made mandate for the buildings where more than 75% is used as commercial space; as 10% of the greenhouse emission is from commercial office space and is rising through the years.
The BEEC consists of National Australian Built Environment Rating System (NABERS) which is based on the scale of 0 to 6 stars. The ratings are given as comparison with the benchmarks set for the building performance data. More details can be found at this [link].
The analysis is based on the question,
Can commercial space area(m2) be used to predict the consumption of energy(MJ/hr)?
Will large area office be more efficent in usage of energy as compared to small area?
If found statistically significant will help owners, tenants to estimate what could be the ratings based on the area of the office and what would be the expectations to improve even before the assessment.
The two variables used will be Floor Area(m2) and Energy Consume(MJ/hr).
As both the variables are quantitative, Linear Regression Model can be used to deduce the prediction and provide a correlation to measure the strength and direction of linear relationship between two variables.
The data is collected on weekly basis and provided regularly in downloadable form as .csv and .zip since 2011 and compiled from Building Energy Efficiency Register.
For the analysis purpose, the data set “[2017 CBD Downloadable Data Set]” was downloaded as .csv on 16/10/2017.
CBD <- read_csv("E:/RMIT/Introduction to Statistics/Data Repository/Assignment 4/2017extract.csv")
View(CBD)The below columns are considered for the analysis from the .csv file;
- B_State : State
- CRT_Nabers_RatedHours : Number of hours the building operates for a week
- CRT_Nabers_AnnualConsumption : Annual consumption (in MJ)
- CRT_Nabers_RatedArea : Floor Area (in m2)
The file contains data for Australia nation, only Victoria state and year 2017 are considered for the analysis.
CBD_VIC <- CBD %>% filter(B_State == "VIC")The filtered dataset for Victoria contains 3165 observations.
Also the number of hours each building operates for the week is not fixed, so for fair comparison of energy usage, total hours operated for the week can be extended to total hours per year and then divide annual consumption with total annual hours to give consumption per hour (MJ/hr).
CBD_VIC$Consumption_PerHour <- CBD_VIC$CRT_Nabers_AnnualConsumption/(CBD_VIC$CRT_Nabers_RatedHours*52)plot(Consumption_PerHour ~ CRT_Nabers_RatedArea, data = CBD_VIC, main="Scatter Plot")plot(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC, main="Scatter Plot (Square-root Transformed)") The scatter plot depicts a positive relationship between the two variables Consumption per hour and Rated Area. The Linear Regression Model can be used to test the relation between two variables as the plot shows positive relationship.
After the square-root transformation the plot shows a positive linear relationship.
The histogram of both the variables appeared to be skewed, and an attempt to correct the skewness through square root transformation is not of much advantage but there is a change in skewness and the data appears to be normally distributed. The test is performed to check if the deduction can be made from the transformed skewed data.
CBD_VIC$Consumption_PerHour %>% hist(xlab="Energy Consumption (in MJ/hr)", main="Consumption of Energy (MJ/hr)")sqrt(CBD_VIC$Consumption_PerHour) %>% hist(xlab="Energy Consumption (in MJ/hr)", main="Consumption of Energy (MJ/hr) (Square-root Transformed)")CBD_VIC %>% summarise(Min = min(Consumption_PerHour,na.rm = TRUE),
Q1 = quantile(Consumption_PerHour,probs = .25,na.rm = TRUE),
Median = median(Consumption_PerHour, na.rm = TRUE),
Q3 = quantile(Consumption_PerHour,probs = .75,na.rm = TRUE),
Max = max(Consumption_PerHour,na.rm = TRUE),
Mean = mean(Consumption_PerHour, na.rm = TRUE),
SD = sd(Consumption_PerHour, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Consumption_PerHour))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 209.1622 | 1947.822 | 2975.273 | 5385.326 | 16596.11 | 3874.345 | 2807.782 | 3165 | 0 |
CBD_VIC$CRT_Nabers_RatedArea %>% hist(xlab="Area (in m2)",main="Area (m2)")sqrt(CBD_VIC$CRT_Nabers_RatedArea) %>% hist(xlab="Area (in m2)",main="Area (m2) (Square-root Transformed)")CBD_VIC %>% summarise(Min = min(CRT_Nabers_RatedArea,na.rm = TRUE),
Q1 = quantile(CRT_Nabers_RatedArea,probs = .25,na.rm = TRUE),
Median = median(CRT_Nabers_RatedArea, na.rm = TRUE),
Q3 = quantile(CRT_Nabers_RatedArea,probs = .75,na.rm = TRUE),
Max = max(CRT_Nabers_RatedArea,na.rm = TRUE),
Mean = mean(CRT_Nabers_RatedArea, na.rm = TRUE),
SD = sd(CRT_Nabers_RatedArea, na.rm = TRUE),
n = n(),
Missing = sum(is.na(CRT_Nabers_RatedArea))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 469.7 | 8137.6 | 19015.8 | 33836 | 76696.4 | 24361.53 | 19324.8 | 3165 | 0 |
CBD_VIC_Filtered <- CBD_VIC %>% filter(Consumption_PerHour < 10000 , CRT_Nabers_RatedArea < 60000)The data is filtered to remove outliers that is Area more than 60000 m2 and Consumption per hour more than 10000 MJ/hr.
Hypothesis for the Overall Linear Regression Model
\(H_0\) : The data does not fit the linear regression model.
\(H_A\) : The data fits the linear regression model.
The F-Test to test the linear regression model.
Assumptions:
Independence: There are no multiple observations for the same measurement or there is no dependence of one value on the other. There are multiple observations for the same building but the levels or the area referred are different that is correspond to independent space.
Linearity: The scatter plot shows a positive linear relationship.(Can be confirmed with the Residual vs Fitted plot)
Normality of residuals: The Normality of the residuals can be checked with Normal Q-Q Plots for the residuals.
Homoscedasticity:The homogeneity of variance can be confirmed with Residual vs Fitted or Scale-Location Plots.
model1 <- lm(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC_Filtered)
model1 %>% summary()##
## Call:
## lm(formula = sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea),
## data = CBD_VIC_Filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.942 -7.134 0.148 7.077 42.464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.703446 0.525928 35.56 <2e-16 ***
## sqrt(CRT_Nabers_RatedArea) 0.268986 0.003843 69.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.24 on 2732 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.6418
## F-statistic: 4899 on 1 and 2732 DF, p-value: < 2.2e-16
The p-value for the F-Test is really small, F(1,2732)=4899,p<0.001.
The model is statistically significant and can interpret the model coefficients.
model1 %>% summary() %>% coef()## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.7034455 0.525927710 35.56277 5.494441e-228
## sqrt(CRT_Nabers_RatedArea) 0.2689858 0.003843183 69.99037 0.000000e+00
model1 %>% confint()## 2.5 % 97.5 %
## (Intercept) 17.67219 19.7347018
## sqrt(CRT_Nabers_RatedArea) 0.26145 0.2765217
The intercept (\(\alpha\))=18.703 and slope(\(\beta\))=0.2689.
Thus the null hypothesis for the intercept and slope; Intercept: \[H_0: \alpha= 0\]
\[H_A: \alpha \neq 0\]
Slope: \[H_0: \beta = 0\]
\[H_A: \beta \neq 0\]
is rejected as the p-value for intercept and slope is very small, p<0.001.
Also the \(H_0\)=0 is not captured in the 95% CI for intercept [17.672,19.735] and slope [0.261,0.276], hence reject the \(H_0\).
Thus the slope and intercept are not zero are statistically significant.
#Residuals vs Fitted
model1 %>% plot(which=1)The Residuals vs Fitted plot shows that the red line is almost flat, thus confirming the linearity of the data, which can be depicted through the scatter plot as well.
Also the variability of the y values for the range of x values seems to be constant, confirms the data to be homoscedastic.
#Normal Q-Q Plot
model1 %>% plot(which=2)The Normal Q-Q Plot shows the normality of the residuals fall close to the 45o line.
#Scale-Location
model1 %>% plot(which=3)The Scale-Location plot depicts a near to flat red line and variability of the square roots of the standardised residuals is consistent across the fitted values.
#Residuals vs Leverage
model1 %>% plot(which=5)The Residual vs Leverage plot is used to locate influential cases, but in the above plot there are no values beyond or close to the red band, in fact the red bands are not visible in the plot.
Linear Regression - R2
- R2 = 0.642
- 64.20% variability in Consumption per hour can be explained by a linear relationship with Rated Area.
Correlation Coefficient r
- Measure the strength and direction of Linear Relationship between Consumption per hour and Rated Area.
- Ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation) and 0 means no correlation
\[H_0 : r = 0\]
\[H_A:r \neq 0\]
bivariate<-as.matrix(dplyr::select(CBD_VIC_Filtered, Consumption_PerHour, CRT_Nabers_RatedArea)) #Create a matrix of the variables to be correlated
rcorr(sqrt(bivariate), type = "pearson")## Consumption_PerHour CRT_Nabers_RatedArea
## Consumption_PerHour 1.0 0.8
## CRT_Nabers_RatedArea 0.8 1.0
##
## n= 2734
##
##
## P
## Consumption_PerHour CRT_Nabers_RatedArea
## Consumption_PerHour 0
## CRT_Nabers_RatedArea 0
r=cor(sqrt(CBD_VIC_Filtered$Consumption_PerHour),sqrt(CBD_VIC_Filtered$CRT_Nabers_RatedArea))
r## [1] 0.8012305
CIr(r = r, n = 2734, level = .95)## [1] 0.7873933 0.8142607
detach("package:psychometric", unload=TRUE)The Correlation Coefficent r = 0.801 and p-value = 0, is written as p < 0.001. Hence, reject \(H_0\).
The 95% CI [0.7874,0.8143] does not capture \(H_0\)=0, therefore rejected \(H_0\). There was a statistically significant positive correlation between Consumption per hour and Rated Area.
plot(sqrt(Consumption_PerHour) ~ sqrt(CRT_Nabers_RatedArea), data = CBD_VIC_Filtered, main = "Scatter Plot with the line of best fit")
abline(model1, col="red") A linear regression model is fitted with the best fit line to confirm the floor area can be used to predict the consumption of energy.
Simple Linear Regression Summary:
Linearity was assumed, Normality of Residuals OK,Homoscedasticity OK, No Influential Cases.
r=0.801, r2=.642
Model ANOVA, F(1,2732) = 4899, p<0.001
a = 18.702, p<0.001, 95% CI [17.672,19.735]
b=0.268, p<0.001, 95% CI [0.261,0.276]
Decision: - Overall Reject \(H_0\)
Intercept Reject \(H_0\)
Slope Reject \(H_0\)
Correlation Coefficient Reject \(H_0\)
sqrt(Consumption_perHour) = 18.702 + 0.268 X sqrt(CRT_Nabers_RatedArea)
Thus conlude that there is statistically significant positive linear relationship between Consumption of Energy per hour and Area of the office.
The analysis depicts a positive and strong relationship between the office area and consumption of energry.
The square-root transformation helped the data to be normally distributed as compared to right skewed values and may be a more appropriate transformation can be used in the future analysis. The Star Rating Algorithm can be used to give a more apt analysis result for the energy usage, but has to be requested which can be done in further detailed analysis.
The statistical analysis based on the Star ratings, State-wise, City-wise can be carried out in future.
The statistical investigation can help office in Victoria to predict energy consumption before being assessed by the CBD and persuade office to improve from the current consumption to better thus increasing the Star Ratings.
The linear regression model is very effective to determine relation between two quantitative variables.
[link]http://www.cbd.gov.au/
[Full CBD Downloadable Data Set]http://www.cbd.gov.au/registers/cbd-downloadable-data-set