This is the project for Data 606 for Fall 2018. The purpose of this project is to conduct a reproducible analysis. As a part of this project, we will explore a data set which was chosen as a part of the project proposal that was submitted earlier. We will explore the data set and try to find the relationships among the variables, and between the response valriables and the independent variables.
For this project for Data 606, we are going to work on the Energy Efficient data set which is present at this web link: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency#
Data link: http://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx
## X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
## 1 0.98 514.5 294.0 110.25 7 2 0 0 15.55 21.33
## 2 0.98 514.5 294.0 110.25 7 3 0 0 15.55 21.33
## 3 0.98 514.5 294.0 110.25 7 4 0 0 15.55 21.33
## 4 0.98 514.5 294.0 110.25 7 5 0 0 15.55 21.33
## 5 0.90 563.5 318.5 122.50 7 2 0 0 20.84 28.28
## 6 0.90 563.5 318.5 122.50 7 3 0 0 21.46 25.38
## relative.compactness surface.area wall.area roof.area
## Min. :0.6200 Min. :514.5 Min. :245.0 Min. :110.2
## 1st Qu.:0.6825 1st Qu.:606.4 1st Qu.:294.0 1st Qu.:140.9
## Median :0.7500 Median :673.8 Median :318.5 Median :183.8
## Mean :0.7642 Mean :671.7 Mean :318.5 Mean :176.6
## 3rd Qu.:0.8300 3rd Qu.:741.1 3rd Qu.:343.0 3rd Qu.:220.5
## Max. :0.9800 Max. :808.5 Max. :416.5 Max. :220.5
## overall.height orientation glazing.area glazing.area.distribution
## Min. :3.50 Min. :2.00 Min. :0.0000 Min. :0.000
## 1st Qu.:3.50 1st Qu.:2.75 1st Qu.:0.1000 1st Qu.:1.750
## Median :5.25 Median :3.50 Median :0.2500 Median :3.000
## Mean :5.25 Mean :3.50 Mean :0.2344 Mean :2.812
## 3rd Qu.:7.00 3rd Qu.:4.25 3rd Qu.:0.4000 3rd Qu.:4.000
## Max. :7.00 Max. :5.00 Max. :0.4000 Max. :5.000
## heating.load cooling.load
## Min. : 6.01 Min. :10.90
## 1st Qu.:12.99 1st Qu.:15.62
## Median :18.95 Median :22.08
## Mean :22.31 Mean :24.59
## 3rd Qu.:31.67 3rd Qu.:33.13
## Max. :43.10 Max. :48.03
## relative.compactness surface.area wall.area roof.area overall.height
## 1 0.98 514.5 294.0 110.25 7
## 2 0.98 514.5 294.0 110.25 7
## 3 0.98 514.5 294.0 110.25 7
## 4 0.98 514.5 294.0 110.25 7
## 5 0.90 563.5 318.5 122.50 7
## 6 0.90 563.5 318.5 122.50 7
## orientation glazing.area glazing.area.distribution heating.load
## 1 2 0 0 15.55
## 2 3 0 0 15.55
## 3 4 0 0 15.55
## 4 5 0 0 15.55
## 5 2 0 0 20.84
## 6 3 0 0 21.46
## cooling.load
## 1 21.33
## 2 21.33
## 3 21.33
## 4 21.33
## 5 28.28
## 6 25.38
Attribute Details:
Relative Compactness - Ratio
Surface Area - sq. meters
Wall Area - sq. meters
Roof Area - sq. meters
Overall Height - meters
Orientation - 2:North, 3:East, 4:South, 5:West
Glazing area (ratio) - 0.00, 0.10, 0.25, 0.40
Glazing area distribution - 1:Uniform, 2:North, 3:East, 4:South, 5:West
Heating Load - kWh/sq. meters
Cooling Load - kWh/sq. meters
As we see above, the data set contains the various independent variables of the buildings for which the data was collected, along with the response variables.
The data is present on the uci weblink: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency# From the web page, we see that the dataset was created by Angeliki Xifara (angxifara ‘@’ gmail.com, Civil/Structural Engineer) and was processed by Athanasios Tsanas (tsanasthanasis ‘@’ gmail.com, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).
Each row or case signifies a building for which the attributes are measured. There are total of 768 buildings or cases.
Variables have been given above in the introduction section. The 2 response variables are: Heating Load and Cooling Lead. The first 8 variables are indepdendent variables.
As the data is already collected, and no experiment will be performed, this is an observational study. The study that we are going to perform is related to: Do the heating load and cooling load of a building depend on the other independent variables ?
Population of interest - All the buildings that are built throughout.
Here, we have a sample of 768 buildings. We want to check if the results can be applied or infered for the whole population of buildings, by checking the results of these observations.
Generally, the following 3 conditions are checked:
Random: The data or the sample collected must be random. In this case the buildings selected for the study are selected randomly. Hence this condition is met.
Independent: Generally if the sample size is less than 10% of the total population, the sample can be said to be independent. In this case 768 is a very small number compared to the total number of buildings overall. Hence we can say that this sample observations are independent.
Normal distribution:
hist(energy.efficiency.df1$heating.load)
summary(energy.efficiency.df1)
## relative.compactness surface.area wall.area roof.area
## Min. :0.6200 Min. :514.5 Min. :245.0 Min. :110.2
## 1st Qu.:0.6825 1st Qu.:606.4 1st Qu.:294.0 1st Qu.:140.9
## Median :0.7500 Median :673.8 Median :318.5 Median :183.8
## Mean :0.7642 Mean :671.7 Mean :318.5 Mean :176.6
## 3rd Qu.:0.8300 3rd Qu.:741.1 3rd Qu.:343.0 3rd Qu.:220.5
## Max. :0.9800 Max. :808.5 Max. :416.5 Max. :220.5
## overall.height orientation glazing.area glazing.area.distribution
## Min. :3.50 Min. :2.00 Min. :0.0000 Min. :0.000
## 1st Qu.:3.50 1st Qu.:2.75 1st Qu.:0.1000 1st Qu.:1.750
## Median :5.25 Median :3.50 Median :0.2500 Median :3.000
## Mean :5.25 Mean :3.50 Mean :0.2344 Mean :2.812
## 3rd Qu.:7.00 3rd Qu.:4.25 3rd Qu.:0.4000 3rd Qu.:4.000
## Max. :7.00 Max. :5.00 Max. :0.4000 Max. :5.000
## heating.load cooling.load
## Min. : 6.01 Min. :10.90
## 1st Qu.:12.99 1st Qu.:15.62
## Median :18.95 Median :22.08
## Mean :22.31 Mean :24.59
## 3rd Qu.:31.67 3rd Qu.:33.13
## Max. :43.10 Max. :48.03
str(energy.efficiency.df1)
## 'data.frame': 768 obs. of 10 variables:
## $ relative.compactness : num 0.98 0.98 0.98 0.98 0.9 0.9 0.9 0.9 0.86 0.86 ...
## $ surface.area : num 514 514 514 514 564 ...
## $ wall.area : num 294 294 294 294 318 ...
## $ roof.area : num 110 110 110 110 122 ...
## $ overall.height : num 7 7 7 7 7 7 7 7 7 7 ...
## $ orientation : num 2 3 4 5 2 3 4 5 2 3 ...
## $ glazing.area : num 0 0 0 0 0 0 0 0 0 0 ...
## $ glazing.area.distribution: num 0 0 0 0 0 0 0 0 0 0 ...
## $ heating.load : num 15.6 15.6 15.6 15.6 20.8 ...
## $ cooling.load : num 21.3 21.3 21.3 21.3 28.3 ...
### Drawaing the histograms of all the variables including the response variables
par(mfrow=c(2,2))
hist(energy.efficiency.df1$relative.compactness)
hist(energy.efficiency.df1$surface.area)
hist(energy.efficiency.df1$wall.area)
hist(energy.efficiency.df1$roof.area)
hist(energy.efficiency.df1$overall.height)
hist(energy.efficiency.df1$orientation)
hist(energy.efficiency.df1$glazing.area)
hist(energy.efficiency.df1$glazing.area.distribution)
hist(energy.efficiency.df1$heating.load)
hist(energy.efficiency.df1$cooling.load)
### Plotting the boxplots of all the variables including the response variables
par(mfrow=c(2,2))
boxplot(energy.efficiency.df1$relative.compactness, xlab="relative.compactness")
boxplot(energy.efficiency.df1$surface.area, xlab="surface.area")
boxplot(energy.efficiency.df1$wall.area, xlab="wall.area")
boxplot(energy.efficiency.df1$roof.area, xlab="roof.area")
boxplot(energy.efficiency.df1$overall.height, xlab="overall.height")
boxplot(energy.efficiency.df1$orientation, xlab="orientation")
boxplot(energy.efficiency.df1$glazing.area, xlab="glazing.area")
boxplot(energy.efficiency.df1$glazing.area.distribution, xlab="glazing.area.distribution")
boxplot(energy.efficiency.df1$heating.load, xlab="heating.load")
boxplot(energy.efficiency.df1$cooling.load, xlab="cooling.load")
Now let us explore the relationship between a response variable and each of the independent variables individually.
heating.load.m1 <- lm(heating.load ~ relative.compactness, energy.efficiency.df1)
summary(heating.load.m1)
##
## Call:
## lm(formula = heating.load ~ relative.compactness, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.569 -6.332 -1.028 3.393 19.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.053 2.081 -11.08 <2e-16 ***
## relative.compactness 59.359 2.698 22.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.904 on 766 degrees of freedom
## Multiple R-squared: 0.3872, Adjusted R-squared: 0.3864
## F-statistic: 484 on 1 and 766 DF, p-value: < 2.2e-16
ggplot(energy.efficiency.df1, aes(x=relative.compactness, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=relative.compactness, y=predict(heating.load.m1, newdata = energy.efficiency.df1)), color = "blue")
As we see above from the summary if the above model, the coefficient row shows 3 stars on the relative.compactness row. That means that the response variable - heating load is strongly statistically related to the relative.compactness of the building. R-squared value of 0.3872 means that 38.72 per cent of the variability in the heating load is due to the relative compactness.
heating.load.m2 <- lm(heating.load ~ surface.area, energy.efficiency.df1)
summary(heating.load.m2)
##
## Call:
## lm(formula = heating.load ~ surface.area, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.609 -5.524 -1.300 3.529 18.176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.945395 2.111064 34.55 <2e-16 ***
## surface.area -0.075387 0.003116 -24.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.602 on 766 degrees of freedom
## Multiple R-squared: 0.4331, Adjusted R-squared: 0.4324
## F-statistic: 585.3 on 1 and 766 DF, p-value: < 2.2e-16
ggplot(energy.efficiency.df1, aes(x=surface.area, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=surface.area, y=predict(heating.load.m2, newdata = energy.efficiency.df1)), color = "blue")
heating.load.m3 <- lm(heating.load ~ wall.area, energy.efficiency.df1)
summary(heating.load.m3)
##
## Call:
## lm(formula = heating.load ~ wall.area, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0213 -7.3937 -0.4882 7.5728 18.2107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.259681 2.391323 -4.709 2.96e-06 ***
## wall.area 0.105391 0.007439 14.168 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.988 on 766 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.2066
## F-statistic: 200.7 on 1 and 766 DF, p-value: < 2.2e-16
ggplot(energy.efficiency.df1, aes(x=wall.area, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=wall.area, y=predict(heating.load.m3, newdata = energy.efficiency.df1)), color = "blue")
heating.load.m4 <- lm(heating.load ~ roof.area, energy.efficiency.df1)
summary(heating.load.m4)
##
## Call:
## lm(formula = heating.load ~ roof.area, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.5327 -2.6392 -0.3191 2.4997 15.0930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.309657 0.746269 75.45 <2e-16 ***
## roof.area -0.192535 0.004094 -47.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.121 on 766 degrees of freedom
## Multiple R-squared: 0.7427, Adjusted R-squared: 0.7424
## F-statistic: 2212 on 1 and 766 DF, p-value: < 2.2e-16
ggplot(energy.efficiency.df1, aes(x=roof.area, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=roof.area, y=predict(heating.load.m4, newdata = energy.efficiency.df1)), color = "blue")
heating.load.m5 <- lm(heating.load ~ overall.height, energy.efficiency.df1)
summary(heating.load.m5)
##
## Call:
## lm(formula = heating.load ~ overall.height, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7259 -2.5929 -0.3085 2.0015 11.8241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.59887 0.52661 -8.733 <2e-16 ***
## overall.height 5.12497 0.09516 53.857 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.615 on 766 degrees of freedom
## Multiple R-squared: 0.7911, Adjusted R-squared: 0.7908
## F-statistic: 2901 on 1 and 766 DF, p-value: < 2.2e-16
ggplot(energy.efficiency.df1, aes(x=overall.height, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=overall.height, y=predict(heating.load.m5, newdata = energy.efficiency.df1)), color = "blue")
heating.load.m6 <- lm(heating.load ~ glazing.area, energy.efficiency.df1)
summary(heating.load.m6)
##
## Call:
## lm(formula = heating.load ~ glazing.area, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.272 -9.193 -3.054 7.253 17.699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.5170 0.7103 24.662 < 2e-16 ***
## glazing.area 20.4380 2.6351 7.756 2.8e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.722 on 766 degrees of freedom
## Multiple R-squared: 0.07281, Adjusted R-squared: 0.0716
## F-statistic: 60.16 on 1 and 766 DF, p-value: 2.796e-14
ggplot(energy.efficiency.df1, aes(x=glazing.area, y=heating.load)) +
geom_point(color = "red") +
geom_line(aes(x=glazing.area, y=predict(heating.load.m6, newdata = energy.efficiency.df1)), color = "blue")
So, from the above regression models between the heating load and 6 of the independent variables, it is very clear that the heating load has statistcal dependence on each of these. Hence, it would be better to build a relationship that best fits all these variables. So, we will start with using all these variables and build a multiple linear regression model.
Building a multiple-linear model with all the variables:
heating.load.ml1 <- lm(formula = heating.load ~ relative.compactness + surface.area + wall.area + roof.area + overall.height +
orientation + glazing.area + glazing.area.distribution,
data = energy.efficiency.df1)
summary(heating.load.ml1)
##
## Call:
## lm(formula = heating.load ~ relative.compactness + surface.area +
## wall.area + roof.area + overall.height + orientation + glazing.area +
## glazing.area.distribution, data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8965 -1.3196 -0.0252 1.3532 7.7052
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.013418 19.033613 4.414 1.16e-05 ***
## relative.compactness -64.773432 10.289448 -6.295 5.19e-10 ***
## surface.area -0.087289 0.017075 -5.112 4.04e-07 ***
## wall.area 0.060813 0.006648 9.148 < 2e-16 ***
## roof.area NA NA NA NA
## overall.height 4.169954 0.337990 12.338 < 2e-16 ***
## orientation -0.023330 0.094705 -0.246 0.80548
## glazing.area 19.932736 0.813986 24.488 < 2e-16 ***
## glazing.area.distribution 0.203777 0.069918 2.915 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9154
## F-statistic: 1187 on 7 and 760 DF, p-value: < 2.2e-16
As we see above in the summary of this multiple linear model, orientation does not have any relationship with the heating area. Also, if we see the p-value orientation, which is very high 0.8, hence this variable does not have a defined relationship with the heating load. So, we can remove this variable - orientation. Also roof.area coefficient is NA, that means roof.area is dependent on some other independent variables, and hence roof.area can also be removed. There is another variable - glazing.area.distribution which has a strong statistical relationship with the heating.load, but the p-value is slightly large as compared to the p-value of other variables. But still, the p-value (0.00367) is still quite less than the significant level of 0.05. Hence we willcontinue with the glazing.area.distribution
So, from above, removing the 2 variables: roof.area and orientation
Building a new multiple linear model:
heating.load.ml2 <- lm(formula = heating.load ~ relative.compactness + surface.area + wall.area + overall.height +
glazing.area + glazing.area.distribution,
data = energy.efficiency.df1)
summary(heating.load.ml2)
##
## Call:
## lm(formula = heating.load ~ relative.compactness + surface.area +
## wall.area + overall.height + glazing.area + glazing.area.distribution,
## data = energy.efficiency.df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0262 1.3587 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.931762 19.018978 4.413 1.17e-05 ***
## relative.compactness -64.773432 10.283096 -6.299 5.06e-10 ***
## surface.area -0.087289 0.017065 -5.115 3.97e-07 ***
## wall.area 0.060813 0.006644 9.153 < 2e-16 ***
## overall.height 4.169954 0.337781 12.345 < 2e-16 ***
## glazing.area 19.932736 0.813484 24.503 < 2e-16 ***
## glazing.area.distribution 0.203777 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Now that we see from this new model summary, none of the variables has a p-value which is greater than the signifcant level of 0.05. Hence we can assume that this is the best model with the linear regression that can be fit for the heating.load. Of course there can be some non-linear regression models that can better define the relationship between the heating.load and the independent variables. But for this project, the non-linear models are out of scope, and can be dealt in a later project.
So, as per this linear model, the linear equations would be: heating.load = 83.93 - 64.77 X relative.compactness - 0.087 X surface.area + 0.06 X wall.area + 4.17 X overall.height + 19.93 X glazing.area + 0.2 X glazing.area.distribution
The intercept 83.93 in itslf in an ideal scenario means the value of heating.load will be 83.93 if all the independent variables will be 0. Here, this is not a valid scenario, hence it does not mean anything, it only signifies here the position of the linear line.
Let us see what the coefficients signify: Relative.compactness of a negative value (-64.77) means that the heating load is negatively dependent on the relative.compactness, which looks to be correct. More relative.compactness of a building means the building is more compact, and hence the heating load will be less. Lesser the compactness, more the heat load.
For the surface area, the heat load has a negative relation, but here the coefficient is very small (-0.087).
So, as we see above, overall relative compactness, overall height and glazing area have a huge impact on the heating load of a building.
Plotting the residuals from this model:
plot(heating.load.ml2$residuals ~ heating.load.ml2$fitted.values)
hist(heating.load.ml2$residuals)
qqnorm(heating.load.ml2$residuals)
qqline(heating.load.ml2$residuals)
The heating load and cooling load of a building are highly dependent on these 3 variables: relative.compactness overall.height glazing.area
Though there will be better relationship if some non-linear models are adopted. But at least a few trends come up really well by applying the best possible multiple linear model.
heating.load = 83.93 - 64.77 X relative.compactness - 0.087 X surface.area + 0.06 X wall.area + 4.17 X overall.height + 19.93 X glazing.area + 0.2 X glazing.area.distribution