Deepak Mongia Data 606 Project

Part 1 - Introduction

This is the project for Data 606 for Fall 2018. The purpose of this project is to conduct a reproducible analysis. As a part of this project, we will explore a data set which was chosen as a part of the project proposal that was submitted earlier. We will explore the data set and try to find the relationships among the variables, and between the response valriables and the independent variables.

Part 2 - Data

Energy efficiency Data Set

For this project for Data 606, we are going to work on the Energy Efficient data set which is present at this web link: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency#

Data link: http://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx

##     X1    X2    X3     X4 X5 X6 X7 X8    Y1    Y2
## 1 0.98 514.5 294.0 110.25  7  2  0  0 15.55 21.33
## 2 0.98 514.5 294.0 110.25  7  3  0  0 15.55 21.33
## 3 0.98 514.5 294.0 110.25  7  4  0  0 15.55 21.33
## 4 0.98 514.5 294.0 110.25  7  5  0  0 15.55 21.33
## 5 0.90 563.5 318.5 122.50  7  2  0  0 20.84 28.28
## 6 0.90 563.5 318.5 122.50  7  3  0  0 21.46 25.38

##  relative.compactness  surface.area     wall.area       roof.area    
##  Min.   :0.6200       Min.   :514.5   Min.   :245.0   Min.   :110.2  
##  1st Qu.:0.6825       1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9  
##  Median :0.7500       Median :673.8   Median :318.5   Median :183.8  
##  Mean   :0.7642       Mean   :671.7   Mean   :318.5   Mean   :176.6  
##  3rd Qu.:0.8300       3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5  
##  Max.   :0.9800       Max.   :808.5   Max.   :416.5   Max.   :220.5  
##  overall.height  orientation    glazing.area    glazing.area.distribution
##  Min.   :3.50   Min.   :2.00   Min.   :0.0000   Min.   :0.000            
##  1st Qu.:3.50   1st Qu.:2.75   1st Qu.:0.1000   1st Qu.:1.750            
##  Median :5.25   Median :3.50   Median :0.2500   Median :3.000            
##  Mean   :5.25   Mean   :3.50   Mean   :0.2344   Mean   :2.812            
##  3rd Qu.:7.00   3rd Qu.:4.25   3rd Qu.:0.4000   3rd Qu.:4.000            
##  Max.   :7.00   Max.   :5.00   Max.   :0.4000   Max.   :5.000            
##   heating.load    cooling.load  
##  Min.   : 6.01   Min.   :10.90  
##  1st Qu.:12.99   1st Qu.:15.62  
##  Median :18.95   Median :22.08  
##  Mean   :22.31   Mean   :24.59  
##  3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :43.10   Max.   :48.03

##   relative.compactness surface.area wall.area roof.area overall.height
## 1                 0.98        514.5     294.0    110.25              7
## 2                 0.98        514.5     294.0    110.25              7
## 3                 0.98        514.5     294.0    110.25              7
## 4                 0.98        514.5     294.0    110.25              7
## 5                 0.90        563.5     318.5    122.50              7
## 6                 0.90        563.5     318.5    122.50              7
##   orientation glazing.area glazing.area.distribution heating.load
## 1           2            0                         0        15.55
## 2           3            0                         0        15.55
## 3           4            0                         0        15.55
## 4           5            0                         0        15.55
## 5           2            0                         0        20.84
## 6           3            0                         0        21.46
##   cooling.load
## 1        21.33
## 2        21.33
## 3        21.33
## 4        21.33
## 5        28.28
## 6        25.38

Attribute Details:

Relative Compactness - Ratio

Surface Area - sq. meters

Wall Area - sq. meters

Roof Area - sq. meters

Overall Height - meters

Orientation - 2:North, 3:East, 4:South, 5:West

Glazing area (ratio) - 0.00, 0.10, 0.25, 0.40

Glazing area distribution - 1:Uniform, 2:North, 3:East, 4:South, 5:West

Heating Load - kWh/sq. meters

Cooling Load - kWh/sq. meters

As we see above, the data set contains the various independent variables of the buildings for which the data was collected, along with the response variables.

Data Collection

The data is present on the uci weblink: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency# From the web page, we see that the dataset was created by Angeliki Xifara (angxifara ‘@’ gmail.com, Civil/Structural Engineer) and was processed by Athanasios Tsanas (tsanasthanasis ‘@’ gmail.com, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).

Cases

Each row or case signifies a building for which the attributes are measured. There are total of 768 buildings or cases.

Variables

Variables have been given above in the introduction section. The 2 response variables are: Heating Load and Cooling Lead. The first 8 variables are indepdendent variables.

Type of study

As the data is already collected, and no experiment will be performed, this is an observational study. The study that we are going to perform is related to: Do the heating load and cooling load of a building depend on the other independent variables ?

Scope of inference

Population of interest - All the buildings that are built throughout.

Here, we have a sample of 768 buildings. We want to check if the results can be applied or infered for the whole population of buildings, by checking the results of these observations.

Generally, the following 3 conditions are checked:

Random: The data or the sample collected must be random. In this case the buildings selected for the study are selected randomly. Hence this condition is met.
Independent: Generally if the sample size is less than 10% of the total population, the sample can be said to be independent. In this case 768 is a very small number compared to the total number of buildings overall. Hence we can say that this sample observations are independent.
Normal distribution:

hist(energy.efficiency.df1$heating.load)

Exploratory data analysis

summary(energy.efficiency.df1)

##  relative.compactness  surface.area     wall.area       roof.area    
##  Min.   :0.6200       Min.   :514.5   Min.   :245.0   Min.   :110.2  
##  1st Qu.:0.6825       1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9  
##  Median :0.7500       Median :673.8   Median :318.5   Median :183.8  
##  Mean   :0.7642       Mean   :671.7   Mean   :318.5   Mean   :176.6  
##  3rd Qu.:0.8300       3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5  
##  Max.   :0.9800       Max.   :808.5   Max.   :416.5   Max.   :220.5  
##  overall.height  orientation    glazing.area    glazing.area.distribution
##  Min.   :3.50   Min.   :2.00   Min.   :0.0000   Min.   :0.000            
##  1st Qu.:3.50   1st Qu.:2.75   1st Qu.:0.1000   1st Qu.:1.750            
##  Median :5.25   Median :3.50   Median :0.2500   Median :3.000            
##  Mean   :5.25   Mean   :3.50   Mean   :0.2344   Mean   :2.812            
##  3rd Qu.:7.00   3rd Qu.:4.25   3rd Qu.:0.4000   3rd Qu.:4.000            
##  Max.   :7.00   Max.   :5.00   Max.   :0.4000   Max.   :5.000            
##   heating.load    cooling.load  
##  Min.   : 6.01   Min.   :10.90  
##  1st Qu.:12.99   1st Qu.:15.62  
##  Median :18.95   Median :22.08  
##  Mean   :22.31   Mean   :24.59  
##  3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :43.10   Max.   :48.03

str(energy.efficiency.df1)

## 'data.frame':    768 obs. of  10 variables:
##  $ relative.compactness     : num  0.98 0.98 0.98 0.98 0.9 0.9 0.9 0.9 0.86 0.86 ...
##  $ surface.area             : num  514 514 514 514 564 ...
##  $ wall.area                : num  294 294 294 294 318 ...
##  $ roof.area                : num  110 110 110 110 122 ...
##  $ overall.height           : num  7 7 7 7 7 7 7 7 7 7 ...
##  $ orientation              : num  2 3 4 5 2 3 4 5 2 3 ...
##  $ glazing.area             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ glazing.area.distribution: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ heating.load             : num  15.6 15.6 15.6 15.6 20.8 ...
##  $ cooling.load             : num  21.3 21.3 21.3 21.3 28.3 ...

### Drawaing the histograms of all the variables including the response variables
par(mfrow=c(2,2))
hist(energy.efficiency.df1$relative.compactness)
hist(energy.efficiency.df1$surface.area)
hist(energy.efficiency.df1$wall.area)
hist(energy.efficiency.df1$roof.area)

hist(energy.efficiency.df1$overall.height)
hist(energy.efficiency.df1$orientation)
hist(energy.efficiency.df1$glazing.area)
hist(energy.efficiency.df1$glazing.area.distribution)

hist(energy.efficiency.df1$heating.load)
hist(energy.efficiency.df1$cooling.load)

### Plotting the boxplots of all the variables including the response variables
par(mfrow=c(2,2))
boxplot(energy.efficiency.df1$relative.compactness, xlab="relative.compactness")
boxplot(energy.efficiency.df1$surface.area, xlab="surface.area")
boxplot(energy.efficiency.df1$wall.area, xlab="wall.area")
boxplot(energy.efficiency.df1$roof.area, xlab="roof.area")

boxplot(energy.efficiency.df1$overall.height, xlab="overall.height")
boxplot(energy.efficiency.df1$orientation, xlab="orientation")
boxplot(energy.efficiency.df1$glazing.area, xlab="glazing.area")
boxplot(energy.efficiency.df1$glazing.area.distribution, xlab="glazing.area.distribution")

boxplot(energy.efficiency.df1$heating.load, xlab="heating.load")
boxplot(energy.efficiency.df1$cooling.load, xlab="cooling.load")

Inference

Simple Linear Regression for heating load vs each of the independent variables

Now let us explore the relationship between a response variable and each of the independent variables individually.

Heating Load vs Relative compactness: Checking how well the variable relative.compactness is a predictor of a response variable say heating.load. We will be starting with simple linear regression first.

heating.load.m1 <- lm(heating.load ~ relative.compactness, energy.efficiency.df1)
summary(heating.load.m1)

## 
## Call:
## lm(formula = heating.load ~ relative.compactness, data = energy.efficiency.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.569  -6.332  -1.028   3.393  19.259 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -23.053      2.081  -11.08   <2e-16 ***
## relative.compactness   59.359      2.698   22.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.904 on 766 degrees of freedom
## Multiple R-squared:  0.3872, Adjusted R-squared:  0.3864 
## F-statistic:   484 on 1 and 766 DF,  p-value: < 2.2e-16

ggplot(energy.efficiency.df1, aes(x=relative.compactness, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=relative.compactness, y=predict(heating.load.m1, newdata = energy.efficiency.df1)), color = "blue")

As we see above from the summary if the above model, the coefficient row shows 3 stars on the relative.compactness row. That means that the response variable - heating load is strongly statistically related to the relative.compactness of the building. R-squared value of 0.3872 means that 38.72 per cent of the variability in the heating load is due to the relative compactness.

Heating Load vs surface area:

heating.load.m2 <- lm(heating.load ~ surface.area, energy.efficiency.df1)
summary(heating.load.m2)

## 
## Call:
## lm(formula = heating.load ~ surface.area, data = energy.efficiency.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.609  -5.524  -1.300   3.529  18.176 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  72.945395   2.111064   34.55   <2e-16 ***
## surface.area -0.075387   0.003116  -24.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.602 on 766 degrees of freedom
## Multiple R-squared:  0.4331, Adjusted R-squared:  0.4324 
## F-statistic: 585.3 on 1 and 766 DF,  p-value: < 2.2e-16

ggplot(energy.efficiency.df1, aes(x=surface.area, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=surface.area, y=predict(heating.load.m2, newdata = energy.efficiency.df1)), color = "blue")

Heating Load vs wall area:

heating.load.m3 <- lm(heating.load ~ wall.area, energy.efficiency.df1)
summary(heating.load.m3)

## 
## Call:
## lm(formula = heating.load ~ wall.area, data = energy.efficiency.df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.0213  -7.3937  -0.4882   7.5728  18.2107 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.259681   2.391323  -4.709 2.96e-06 ***
## wall.area     0.105391   0.007439  14.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.988 on 766 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.2066 
## F-statistic: 200.7 on 1 and 766 DF,  p-value: < 2.2e-16

ggplot(energy.efficiency.df1, aes(x=wall.area, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=wall.area, y=predict(heating.load.m3, newdata = energy.efficiency.df1)), color = "blue")

Heating Load vs roof area

heating.load.m4 <- lm(heating.load ~ roof.area, energy.efficiency.df1)
summary(heating.load.m4)

## 
## Call:
## lm(formula = heating.load ~ roof.area, data = energy.efficiency.df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.5327  -2.6392  -0.3191   2.4997  15.0930 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.309657   0.746269   75.45   <2e-16 ***
## roof.area   -0.192535   0.004094  -47.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.121 on 766 degrees of freedom
## Multiple R-squared:  0.7427, Adjusted R-squared:  0.7424 
## F-statistic:  2212 on 1 and 766 DF,  p-value: < 2.2e-16

ggplot(energy.efficiency.df1, aes(x=roof.area, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=roof.area, y=predict(heating.load.m4, newdata = energy.efficiency.df1)), color = "blue")

Heating Load vs overall height

heating.load.m5 <- lm(heating.load ~ overall.height, energy.efficiency.df1)
summary(heating.load.m5)

## 
## Call:
## lm(formula = heating.load ~ overall.height, data = energy.efficiency.df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7259  -2.5929  -0.3085   2.0015  11.8241 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.59887    0.52661  -8.733   <2e-16 ***
## overall.height  5.12497    0.09516  53.857   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.615 on 766 degrees of freedom
## Multiple R-squared:  0.7911, Adjusted R-squared:  0.7908 
## F-statistic:  2901 on 1 and 766 DF,  p-value: < 2.2e-16

ggplot(energy.efficiency.df1, aes(x=overall.height, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=overall.height, y=predict(heating.load.m5, newdata = energy.efficiency.df1)), color = "blue")

Heating Load vs Glazing area

heating.load.m6 <- lm(heating.load ~ glazing.area, energy.efficiency.df1)
summary(heating.load.m6)

## 
## Call:
## lm(formula = heating.load ~ glazing.area, data = energy.efficiency.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.272  -9.193  -3.054   7.253  17.699 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.5170     0.7103  24.662  < 2e-16 ***
## glazing.area  20.4380     2.6351   7.756  2.8e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.722 on 766 degrees of freedom
## Multiple R-squared:  0.07281,    Adjusted R-squared:  0.0716 
## F-statistic: 60.16 on 1 and 766 DF,  p-value: 2.796e-14

ggplot(energy.efficiency.df1, aes(x=glazing.area, y=heating.load)) +
  geom_point(color = "red") +
  geom_line(aes(x=glazing.area, y=predict(heating.load.m6, newdata = energy.efficiency.df1)), color = "blue")

Step-by-step multiple linear model approach

So, from the above regression models between the heating load and 6 of the independent variables, it is very clear that the heating load has statistcal dependence on each of these. Hence, it would be better to build a relationship that best fits all these variables. So, we will start with using all these variables and build a multiple linear regression model.

Building a multiple-linear model with all the variables:

heating.load.ml1 <- lm(formula = heating.load  ~ relative.compactness + surface.area + wall.area + roof.area + overall.height +
                              orientation + glazing.area + glazing.area.distribution,
                             data = energy.efficiency.df1)

summary(heating.load.ml1)

## 
## Call:
## lm(formula = heating.load ~ relative.compactness + surface.area + 
##     wall.area + roof.area + overall.height + orientation + glazing.area + 
##     glazing.area.distribution, data = energy.efficiency.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8965 -1.3196 -0.0252  1.3532  7.7052 
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                84.013418  19.033613   4.414 1.16e-05 ***
## relative.compactness      -64.773432  10.289448  -6.295 5.19e-10 ***
## surface.area               -0.087289   0.017075  -5.112 4.04e-07 ***
## wall.area                   0.060813   0.006648   9.148  < 2e-16 ***
## roof.area                         NA         NA      NA       NA    
## overall.height              4.169954   0.337990  12.338  < 2e-16 ***
## orientation                -0.023330   0.094705  -0.246  0.80548    
## glazing.area               19.932736   0.813986  24.488  < 2e-16 ***
## glazing.area.distribution   0.203777   0.069918   2.915  0.00367 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9154 
## F-statistic:  1187 on 7 and 760 DF,  p-value: < 2.2e-16

As we see above in the summary of this multiple linear model, orientation does not have any relationship with the heating area. Also, if we see the p-value orientation, which is very high 0.8, hence this variable does not have a defined relationship with the heating load. So, we can remove this variable - orientation. Also roof.area coefficient is NA, that means roof.area is dependent on some other independent variables, and hence roof.area can also be removed. There is another variable - glazing.area.distribution which has a strong statistical relationship with the heating.load, but the p-value is slightly large as compared to the p-value of other variables. But still, the p-value (0.00367) is still quite less than the significant level of 0.05. Hence we willcontinue with the glazing.area.distribution

So, from above, removing the 2 variables: roof.area and orientation

Building a new multiple linear model:

heating.load.ml2 <- lm(formula = heating.load  ~ relative.compactness + surface.area + wall.area + overall.height +
                              glazing.area + glazing.area.distribution,
                             data = energy.efficiency.df1)

summary(heating.load.ml2)

## 
## Call:
## lm(formula = heating.load ~ relative.compactness + surface.area + 
##     wall.area + overall.height + glazing.area + glazing.area.distribution, 
##     data = energy.efficiency.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9315 -1.3189 -0.0262  1.3587  7.7169 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                83.931762  19.018978   4.413 1.17e-05 ***
## relative.compactness      -64.773432  10.283096  -6.299 5.06e-10 ***
## surface.area               -0.087289   0.017065  -5.115 3.97e-07 ***
## wall.area                   0.060813   0.006644   9.153  < 2e-16 ***
## overall.height              4.169954   0.337781  12.345  < 2e-16 ***
## glazing.area               19.932736   0.813484  24.503  < 2e-16 ***
## glazing.area.distribution   0.203777   0.069875   2.916  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9155 
## F-statistic:  1387 on 6 and 761 DF,  p-value: < 2.2e-16

Now that we see from this new model summary, none of the variables has a p-value which is greater than the signifcant level of 0.05. Hence we can assume that this is the best model with the linear regression that can be fit for the heating.load. Of course there can be some non-linear regression models that can better define the relationship between the heating.load and the independent variables. But for this project, the non-linear models are out of scope, and can be dealt in a later project.

So, as per this linear model, the linear equations would be: heating.load = 83.93 - 64.77 X relative.compactness - 0.087 X surface.area + 0.06 X wall.area + 4.17 X overall.height + 19.93 X glazing.area + 0.2 X glazing.area.distribution

The intercept 83.93 in itslf in an ideal scenario means the value of heating.load will be 83.93 if all the independent variables will be 0. Here, this is not a valid scenario, hence it does not mean anything, it only signifies here the position of the linear line.

Let us see what the coefficients signify: Relative.compactness of a negative value (-64.77) means that the heating load is negatively dependent on the relative.compactness, which looks to be correct. More relative.compactness of a building means the building is more compact, and hence the heating load will be less. Lesser the compactness, more the heat load.

For the surface area, the heat load has a negative relation, but here the coefficient is very small (-0.087).

So, as we see above, overall relative compactness, overall height and glazing area have a huge impact on the heating load of a building.

Plotting the residuals from this model:

plot(heating.load.ml2$residuals ~ heating.load.ml2$fitted.values)

hist(heating.load.ml2$residuals)

qqnorm(heating.load.ml2$residuals)
qqline(heating.load.ml2$residuals)

Conclusion:

The heating load and cooling load of a building are highly dependent on these 3 variables: relative.compactness overall.height glazing.area

Though there will be better relationship if some non-linear models are adopted. But at least a few trends come up really well by applying the best possible multiple linear model.

heating.load = 83.93 - 64.77 X relative.compactness - 0.087 X surface.area + 0.06 X wall.area + 4.17 X overall.height + 19.93 X glazing.area + 0.2 X glazing.area.distribution