The aim of this coursework is to perform univariate analysis with the dplyr package. Furthermore, the ggplot2 packages will be used for incremental plots and the other packages for extra support.The dataset contains ten attributes and two responses denoted by HeatingLoad and CoolingLoad). The purpose is to use the eight features to predict each of the two responses.
# use package "here"
library(here)
## here() starts at /Users/kelvinosuagwu/Desktop/parent/datavisualisation_&Analysis_CW
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Setting the working directory to the folder where this file is.
setwd(here::here())
# Loading the data files
energy <- read.csv("data/energy.csv", header = T, stringsAsFactors = T)
#check the class, if it is a data.frame
class(energy)
## [1] "data.frame"
#print few rows of the dataset
head(energy)
# let us check the structure of the dataset to view the datatype
str(energy)
## 'data.frame': 795 obs. of 10 variables:
## $ Instance : int 1 2 3 4 5 6 7 8 9 10 ...
## $ AproxArea : num 556 463 463 540 514 ...
## $ WallArea : num 279 283 296 294 297 ...
## $ RoofArea : num 104 117 108 104 105 ...
## $ GlassArea : num 0 0 0 0 30 ...
## $ Height : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 1 1 ...
## $ Condition : Factor w/ 5 levels "A","B","C","D",..: 1 3 2 2 1 3 3 3 3 3 ...
## $ Orientation: Factor w/ 4 levels "E","N","S","W": 1 2 3 4 1 1 1 1 1 2 ...
## $ HeatingLoad: num 15.6 15.6 15.6 15.6 24.6 ...
## $ CoolingLoad: num 21.3 21.3 21.3 21.3 26.3 ...
# view dataset if NA is present
summary(energy) #heating load variable has 4NA'S
## Instance AproxArea WallArea RoofArea
## Min. : 1.0 Min. :463.1 Min. :227.8 Min. :103.6
## 1st Qu.:199.5 1st Qu.:602.0 1st Qu.:285.2 1st Qu.:138.2
## Median :398.0 Median :673.8 Median :314.6 Median :207.3
## Mean :398.0 Mean :674.1 Mean :319.1 Mean :179.0
## 3rd Qu.:596.5 3rd Qu.:746.7 3rd Qu.:344.2 3rd Qu.:220.5
## Max. :795.0 Max. :889.4 Max. :550.8 Max. :238.1
##
## GlassArea Height Condition Orientation HeatingLoad
## Min. : 0.00 high:384 A:117 E:202 Min. : 6.01
## 1st Qu.: 32.49 low :411 B:135 N:197 1st Qu.:12.93
## Median : 75.71 C:517 S:195 Median :17.37
## Mean : 75.00 D: 4 W:201 Mean :21.98
## 3rd Qu.:112.90 E: 22 3rd Qu.:31.20
## Max. :174.93 Max. :43.10
## NA's :4
## CoolingLoad
## Min. :10.90
## 1st Qu.:15.49
## Median :21.33
## Mean :24.28
## 3rd Qu.:32.92
## Max. :48.03
##
# this is a dependent variable, check if the datatype is numeric
is.numeric(energy$HeatingLoad) #TRUE
## [1] TRUE
# drop the instance variable using subset function
energy <- subset(energy, select = -c(Instance) )
# check for NA
energy.isNa <- is.na(energy) #check for missing values
sum(energy.isNa) #4 missing values double checked
## [1] 4
# This function replaces missing(NA'S) Heating load data with Mean
# returns a mean with decimals up to 5 places
energy$HeatingLoad <- ifelse(is.na(energy$HeatingLoad),
ave(energy$HeatingLoad,
FUN = function(x) mean(x, na.rm = TRUE) )
,energy$HeatingLoad)
str(energy$HeatingLoad)
## num [1:795] 15.6 15.6 15.6 15.6 24.6 ...
class(energy$HeatingLoad)
## [1] "numeric"
#format and round the variable to the nearest decimal number of 2
# to be at par with the other features/variables
energy$HeatingLoad <- format(round(energy$HeatingLoad, digits = 2), nsmall = 2)
# convert to numeric datatype after formatting
energy$HeatingLoad <- as.numeric(energy$HeatingLoad)
summary(energy)
## AproxArea WallArea RoofArea GlassArea Height
## Min. :463.1 Min. :227.8 Min. :103.6 Min. : 0.00 high:384
## 1st Qu.:602.0 1st Qu.:285.2 1st Qu.:138.2 1st Qu.: 32.49 low :411
## Median :673.8 Median :314.6 Median :207.3 Median : 75.71
## Mean :674.1 Mean :319.1 Mean :179.0 Mean : 75.00
## 3rd Qu.:746.7 3rd Qu.:344.2 3rd Qu.:220.5 3rd Qu.:112.90
## Max. :889.4 Max. :550.8 Max. :238.1 Max. :174.93
## Condition Orientation HeatingLoad CoolingLoad
## A:117 E:202 Min. : 6.01 Min. :10.90
## B:135 N:197 1st Qu.:12.93 1st Qu.:15.49
## C:517 S:195 Median :17.50 Median :21.33
## D: 4 W:201 Mean :21.98 Mean :24.28
## E: 22 3rd Qu.:30.59 3rd Qu.:32.92
## Max. :43.10 Max. :48.03
# Check the summary again to see if the values where HeatingLoad values were rounded.
head(energy)
Central tendency and Spread for APROX AREA variable: This is going to be considered as a continuous scale
#check the summary statistics for this variable
summary(energy$AproxArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 463.1 602.0 673.8 674.1 746.7 889.4
sd(energy$AproxArea)
## [1] 96.66033
Median value using a vertical red line
# let us use dot plot to describe the data
#using line to check if the median is a typical value of this dataset
aproxArea.plot <- ggplot(energy, aes(x= AproxArea)) +
geom_dotplot(col="black", fill="gold" , binwidth= 7) +
labs(x="Approximate area of home", y="") +
theme_classic() +
geom_vline(xintercept = 673.8, color = "red", size=0.5)
aproxArea.plot
aproxArea.plot<- ggplot(energy, aes(y= AproxArea)) +
geom_boxplot(col="blue", fill="lightblue") +
labs(title="The Approximate area of the Home per energy usage in 2016", x="",y="AproxArea")+
theme_classic()
aproxArea.plot
Histograms
aproxArea.plot <- ggplot(energy, aes(x= AproxArea)) +
geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +
geom_vline(xintercept = median(energy$AproxArea), lwd = 2, size=0.01) +
labs(x="Distribution of Approximate area of the Home per energy usage in 2016", y="Density") +
geom_density(col="blue") + theme_classic()
## Warning: Duplicated aesthetics after name standardisation: size
aproxArea.plot
Description
Central tendency and Spread for WALL AREA variable: This is going to be considered as a continuous scale
#check the summary statistics for this variable
summary(energy$WallArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 227.8 285.2 314.6 319.1 344.2 550.8
Histograms plus the spread(density)
# The line is the median
wallArea.plot <- ggplot(energy, aes(x= WallArea)) +
geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +
geom_vline(xintercept = median(energy$WallArea), lwd = 2, size=0.01) +
labs(x="Distribution of Approximate area of the Home per energy usage in 2016", y="Density") +
geom_density(col="blue") + theme_classic()
## Warning: Duplicated aesthetics after name standardisation: size
wallArea.plot
# let us use dot plot to describe the data
#using line to check if the median is a typical value of this dataset
library(cowplot) #cowplot package for grid
wallA.plotMedian <- ggplot(energy, aes(x= WallArea)) +
geom_dotplot(col="black", fill="gold" , binwidth= 7) +
labs(x="Distribution of wall Area", y="") +
theme_classic() +
geom_vline(xintercept = 314.6, color = "red", size=0.7)
#using line to check if the mean is a typical value of this dataset
wallA.plotMean <- ggplot(energy, aes(x= WallArea)) +
geom_dotplot(col="black", fill="gold" , binwidth= 7) +
labs(x="Distribution of wall Area", y="") +
theme_classic() +
geom_vline(xintercept = 319.1, color = "red", size=0.7)
plot_grid(wallA.plotMedian, wallA.plotMean, labels = "AUTO") #grid of two rows
#box plot
wallArea.plot<- ggplot(energy, aes(y= WallArea)) +
geom_boxplot(col="blue", fill="lightblue") +
labs(title="The Wall area in (sqft) for Home energy usage in 2016", x="",y="Average Wall Area in sqft") +
theme_classic()
wallArea.plot
Description
Central tendency and Spread for ROOF AREA variable: This is going to be considered as a continuous scale
#check the summary statistics for this variable
summary(energy$RoofArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 103.6 138.2 207.3 179.0 220.5 238.1
roofArea.plot <- ggplot(energy, aes(y= RoofArea)) +
geom_boxplot(col="blue", fill="lightblue") +
labs(title="The Roof area in (sqft) for Home energy usage in 2016", x="",y="Average roofArea sqft") +
theme_classic()
roofArea.plot
# The line is the mean
roofArea.plotMean <- ggplot(energy, aes(x= RoofArea)) +
geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +
geom_vline(xintercept = mean(energy$RoofArea), lwd = 2, size=0.01) +
labs(x="Distribution of Roof area ", y="Density") +
theme_classic()
## Warning: Duplicated aesthetics after name standardisation: size
# The line is the median
roofArea.plotMedian <- ggplot(energy, aes(x= RoofArea)) +
geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +
geom_vline(xintercept = median(energy$RoofArea), lwd = 2, size=0.01) +
labs(x="Distribution of Roof area ", y="Density") +
theme_classic()
## Warning: Duplicated aesthetics after name standardisation: size
plot_grid(roofArea.plotMean , roofArea.plotMedian, labels = "AUTO") #grid of two rows
#using line to check if the median is representative of this dataset
roofA.plotMedian <- ggplot(energy, aes(x= RoofArea)) +
geom_dotplot(col="black", fill="gold" , binwidth= 4) +
labs(x="Distribution of roof Area", y="") +
theme_classic() +
geom_vline(xintercept = 207.3 , color = "red", size=0.2)
#using line to check if the mean is representative of this dataset
roofB.plotMean <- ggplot(energy, aes(x= RoofArea)) +
geom_dotplot(col="black", fill="gold" , binwidth= 4) +
labs(x="Distribution of roof Area", y="") +
theme_classic() +
geom_vline(xintercept = 179.0, color = "red", size=0.2)
plot_grid(roofA.plotMedian, roofB.plotMean, labels = "AUTO") #grid of two rows
Description
Central tendency and Spread for GLASS AREA variable: This is going to be considered as a continuous variable
#check the summary statistics for this variable
summary(energy$GlassArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 32.49 75.71 75.00 112.90 174.93
#boxplot
glassArea.plot <- ggplot(energy, aes(y= GlassArea)) +
geom_boxplot(col="red", fill="lightblue") +
labs(title="The glass area in (sqft) for Home energy usage in 2016", x="glass area ",y="Average glassArea sqft") +
theme_classic()
glassArea.plot
Description
NA: No missing values
Outliers: No outliers
Distribution: close to a normal the distribution
Typical Values: The values in the dataset are quite centered around the Median than The mean is representative of the dataset.
Spread: The data is compact
Central tendency and Spread for Height variable: This is going to be considered as a discrete /categorical variable
#check the summary statistics for this variable
summary(energy$Height)
## high low
## 384 411
# plot the bars in ascending order
ggplot(energy, aes(x = Height)) +
geom_bar(fill = "light blue",
color="black") +
labs(x = "Height of building",
y = "Frequency",
title = "The Height of building for Home energy usage in 2016")
# plot the distribution as percentages
ggplot(energy,
aes(x = Height,
y = ..count.. / sum(..count..))) +
geom_bar() +
labs(x = "Height of building",
y = "Percent",
title = "The Height of building for Home energy usage in 2016") +
scale_y_continuous(labels = scales::percent)
Description
Central tendency and Spread for Condition variable: This is going to be considered as a discrete /categorical variable
#check the summary statistics for this variable
summary(energy$Condition)
## A B C D E
## 117 135 517 4 22
# plot the bars in ascending order
ggplot(energy, aes(x = Condition)) +
geom_bar(fill = "light blue",
color="black") +
labs(x = "Condition of building",
y = "Frequency",
title = "The Condition of building that affects Home energy usage in 2016")
# plot the distribution as percentages
ggplot(energy,
aes(x = Condition,
y = ..count.. / sum(..count..))) +
geom_bar() +
labs(x = "Condition of building",
y = "Percent",
title = "The Condition of building for Home energy usage in 2016") +
scale_y_continuous(labels = scales::percent)
# Basic piechart for the Condition dataset
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:here':
##
## here
class(energy$Condition)
## [1] "factor"
# counting the number of staff per site.
pieData <- count(energy$Condition)
#rename the column
names(pieData) <- c("Condition", "frequency")
pieData
# order data according to the site (important for placing labels later on)
pieData <- arrange(pieData,desc(Condition))
# create new column with position for label
pieData <- mutate(pieData, positionLabel = cumsum(pieData$frequency) - 0.5*pieData$frequency)
# creating plot
pieData.plot <- ggplot(pieData, aes(x="", y= frequency, fill = Condition)) +
geom_bar(width = 1, stat="identity") +
coord_polar("y", start=0) +
geom_text(aes(y = positionLabel, label = frequency)) +
scale_fill_manual(values = c("red","blue", "lightblue", "purple", "grey"))
pieData.plot
Description
Central tendency and Spread for Orientation variable: This is going to be considered as a discrete /categorical variable
#check the summary statistics for this variable
summary(energy$Orientation)
## E N S W
## 202 197 195 201
# Basic piechart for the Condition dataset
# counting the number of staff per site.
pieData <- count(energy$Orientation)
#rename the column
names(pieData) <- c("Orientation", "frequency")
pieData
# order data according to the site (important for placing labels later on)
pieData <- arrange(pieData,desc(Orientation))
# create new column with position for label
pieData <- mutate(pieData, positionLabel = cumsum(pieData$frequency) - 0.5*pieData$frequency)
# creating plot
pieData.plot <- ggplot(pieData, aes(x="", y= frequency, fill = Orientation)) +
geom_bar(width = 1, stat="identity") +
coord_polar("y", start=0) +
geom_text(aes(y = positionLabel, label = frequency)) +
scale_fill_manual(values = c("red","blue", "lightblue", "purple", "grey")) +
# removing outer "ring"
theme_void()
pieData.plot
Description
Central tendency and Spread for Heating Load dependent variable: This is going to be considered as a continuous variable
#check the summary statistics for this variable
summary(energy$HeatingLoad)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.01 12.93 17.50 21.98 30.59 43.10
#using line to check if the median is representative of this dataset
heatLoadA.plotMedian <- ggplot(energy, aes(x= HeatingLoad)) +
geom_dotplot(col="black", fill="gold" , binwidth= 1) +
labs(x="Distribution of HeatingLoad data", y="") +
theme_classic() +
geom_vline(xintercept = 17.50 , color = "red", size=0.7)
#using line to check if the mean is representative of this dataset
heatLoadB.plotMean <- ggplot(energy, aes(x= HeatingLoad)) +
geom_dotplot(col="black", fill="gold" , binwidth= 1) +
labs(x="Distribution of HeatingLoad data", y="") +
theme_classic() +
geom_vline(xintercept = 21.98, color = "red", size=0.7)
plot_grid(heatLoadA.plotMedian, heatLoadB.plotMean, labels = "AUTO") #grid of two rows
#boxplot on heating load object
hl.plot <- ggplot(energy, aes(y= HeatingLoad)) +
geom_boxplot(col="black", fill="lightblue") +
labs(title="The heating Load for Homes in 2016", x="heating ",y="Average heating Load in British Thermal Unit") +
theme_classic()
hl.plot
Description
Central tendency and Spread for Cooling Load dependent variable: This is going to be considered as a continuous variable
#check the summary statistics for this variable
summary(energy$CoolingLoad)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.90 15.49 21.33 24.28 32.92 48.03
#boxplot on cooling load object
cl.plot <- ggplot(energy, aes(y= CoolingLoad)) +
geom_boxplot(col="black", fill="lightblue") +
labs(title="The Cooling Load for Homes in 2016", x="cooling ",y="Average cooling Load in British Thermal Unit") +
theme_classic()
cl.plot
#using line to check if the mean is representative of this dataset
cLoad.plotMean <- ggplot(energy, aes(x= CoolingLoad)) +
geom_dotplot(col="black", fill="gold" , binwidth= 0.5) +
labs(x="Distribution of Cooling Load data", y="") +
theme_classic() +
geom_vline(xintercept = 24.28, color = "red", size=0.7)
cLoad.plotMean
Description
#Drop levels filter with D
noDCondition = droplevels(energy[!energy$Condition == 'D',])
#show a table of Condition object
table(noDCondition$Condition)
##
## A B C E
## 117 135 517 22
# Basic piechart for the Condition dataset of A, B , C, E
# counting the number of Condition per site.
pieData <- count(noDCondition$Condition)
#rename the column
names(pieData) <- c("Condition", "frequency")
pieData
# order data according to the site (important for placing labels later on)
pieData <- arrange(pieData,desc(Condition))
# create new column with position for label
pieData <- mutate(pieData, positionLabel = cumsum(pieData$frequency) - 0.5*pieData$frequency)
# creating plot
pieData.plot <- ggplot(pieData, aes(x="", y= frequency, fill = Condition)) +
geom_bar(width = 1, stat="identity") +
coord_polar("y", start=0) +
geom_text(aes(y = positionLabel, label = frequency)) +
scale_fill_manual(values = c("red","blue", "lightblue", "purple", "grey")) +
# removing outer "ring"
theme_void()
pieData.plot
Comments
#heating load as hl and condition as con
#Drop levels filter with Orientation is “W” and Height is “high”
hlCon.data <- droplevels(energy[energy$Orientation == 'W' & energy$Height == 'high',])
#print dataset
head(hlCon.data)
# plot the Heating load distribution by condition
pl <- ggplot(hlCon.data ,
aes(x = Condition,
y = HeatingLoad,
group=Condition,
color=Condition)) +
geom_point(size = 2) +
labs(title = "Heating load distribution by condition")
pl
Comments
#Drop levels filter with Orientation is “W” and Height is “high”
hlCon.data <- droplevels(energy[energy$Orientation == 'N' & energy$Condition == 'B' & energy$AproxArea >= 650,])
#print dataset
head(hlCon.data)
Different plot to compare
# plot the data set with Heating Load and cooling load values where AproxArea is at least 650
#filter by Orientation and Condition
plot <- ggplot(hlCon.data, aes(AproxArea)) +
geom_line(aes(y = HeatingLoad, colour = "HeatingLoad")) +
geom_line(aes(y = CoolingLoad, colour = "CoolingLoad")) +
scale_colour_hue("channels") +
labs(x= "AproxArea", y= "Heating Load / Cooling Load", title="Energy consumed in an Aproximate area")
plot
# get the summary to begin with
summary(energy)
## AproxArea WallArea RoofArea GlassArea Height
## Min. :463.1 Min. :227.8 Min. :103.6 Min. : 0.00 high:384
## 1st Qu.:602.0 1st Qu.:285.2 1st Qu.:138.2 1st Qu.: 32.49 low :411
## Median :673.8 Median :314.6 Median :207.3 Median : 75.71
## Mean :674.1 Mean :319.1 Mean :179.0 Mean : 75.00
## 3rd Qu.:746.7 3rd Qu.:344.2 3rd Qu.:220.5 3rd Qu.:112.90
## Max. :889.4 Max. :550.8 Max. :238.1 Max. :174.93
## Condition Orientation HeatingLoad CoolingLoad
## A:117 E:202 Min. : 6.01 Min. :10.90
## B:135 N:197 1st Qu.:12.93 1st Qu.:15.49
## C:517 S:195 Median :17.50 Median :21.33
## D: 4 W:201 Mean :21.98 Mean :24.28
## E: 22 3rd Qu.:30.59 3rd Qu.:32.92
## Max. :43.10 Max. :48.03
# calculate the covariance (this gives the sample covariance) that determine if Heating load and
# coolingLoad covary.The default method is pearsons.
cov(energy$CoolingLoad,energy$HeatingLoad) #The Heating Load is the response variable and the Cooling load is the predictor
## [1] 93.02044
# calculate the correlation that determines if Heating load and coolingLoad correlate.
# The default method is pearsons.
cor(energy$CoolingLoad,energy$HeatingLoad)
## [1] 0.9752675
# Create the data frame from a subset of energy data
energy.data <- data.frame(
HeatingLoad = energy$HeatingLoad,
CoolingLoad = energy$CoolingLoad,
stringsAsFactors = FALSE
)
# Print the data frame.
head(energy.data)
# check datatype
class(energy.data)
## [1] "data.frame"
# import Library corplot to visualize if there is a linear relationship between variables
library(corrplot)
## corrplot 0.84 loaded
cor.matrix <- corrplot(cor(energy.data)) #correlation matrices
cor.matrix
## HeatingLoad CoolingLoad
## HeatingLoad 1.0000000 0.9752675
## CoolingLoad 0.9752675 1.0000000
# scatter plotting the heating load and cooling to see if there is a linear relationship
ggplot(energy.data, aes(x=CoolingLoad ,y=HeatingLoad)) +
geom_point() +
labs(title = "Cooling Load vs Heating Load",
x="Cooling load (KBTU)",
y = "Heating Load (KBTU)")
Covariance: From the results, covariance is 93.02044 ~ 93%. The two variable covary
Correlation: correlation coefficient is 0.9752675 very close to 1
Correlation Matrix: Shows strong positive correlation
Scatter plot: The variables have a strong linear correlation and the graph looks reasonably linear
A linear model is appropriate. So we shall proceed to a linear regression.
Analysis of variance table
# HeatingLoad is to be regressed on CoolingLoad,
# model
regressor <- lm(formula=HeatingLoad~CoolingLoad,data=energy.data)
#Analysis of variance table
summary(regressor)
##
## Call:
## lm(formula = HeatingLoad ~ CoolingLoad, data = energy.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2435 -0.9171 -0.1044 1.3788 6.2989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.045746 0.216277 -14.08 <2e-16 ***
## CoolingLoad 1.030965 0.008297 124.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.221 on 793 degrees of freedom
## Multiple R-squared: 0.9511, Adjusted R-squared: 0.9511
## F-statistic: 1.544e+04 on 1 and 793 DF, p-value: < 2.2e-16
# let us visualise the result graph using ggplot2
# let us design the visualization in layers that we trained our model on using the simple linear equation ^y = B0 + B1X
# 1. the ggplot function()
# 2. the point function through the channel
# 3. the liner model using the y predicted value of the energy.data set
p <- ggplot() +
geom_point(aes(x= energy.data$CoolingLoad, y= energy.data$HeatingLoad),
color= "red") +
geom_line(aes(x= energy.data$CoolingLoad,, y =predict(regressor, newdata= energy.data)),
color=" blue") +
ggtitle('Cooling vs Heating load(Energy observation points)') +
xlab('Cooling load') +
ylab('Heating load')
p
# create a dataframe from the new data
newdata <- data.frame(CoolingLoad=32)
# let us predict the Heating load result after we have trained our model with the subset data (energy.data)
hl.predict <- predict(regressor, newdata = newdata) # heating load as hl
hl.predict
## 1
## 29.94512
# scatter plotting the heating load and cooling to see if there is a linear relationship
ggplot(energy.data, aes(x=HeatingLoad ,y=CoolingLoad)) +
geom_point() +
labs(title = "Heating Load vs Cooling Load",
x="Heating load (KBTU)",
y = "Cooling Load (KBTU)")
# CoolingLoad, is to be regressed on HeatingLoad.
# model
regressorNew <- lm(formula=CoolingLoad~HeatingLoad, data=energy.data)
#Analysis of variance table
summary(regressorNew)
##
## Call:
## lm(formula = CoolingLoad ~ HeatingLoad, data = energy.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1246 -1.1918 -0.1625 0.6572 8.9658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.995919 0.179418 22.27 <2e-16 ***
## HeatingLoad 0.922579 0.007425 124.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.101 on 793 degrees of freedom
## Multiple R-squared: 0.9511, Adjusted R-squared: 0.9511
## F-statistic: 1.544e+04 on 1 and 793 DF, p-value: < 2.2e-16
From the summary statistics, The p-VALUE shows the cooling load is statistically significant and has a strong influence on the Heating load. The heating load increases as the cooling load increases.Also the coefficient of detrmination is 0.9511. A linear model is ideal because it is close to 1
# let us visualise the result graph using ggplot2
# let us design the visualization in layers that we trained our model on using the simple linear equation ^y = B0 + B1X
# 1. the ggplot function()
# 2. the point function through the channel
# 3. the liner model using the y predicted value of the energy.data set
p <- ggplot() +
geom_point(aes(x= energy.data$HeatingLoad, y= energy.data$CoolingLoad),
color= "red") +
geom_line(aes(x= energy.data$HeatingLoad,, y =predict(regressorNew, newdata= energy.data)),
color=" blue") +
ggtitle('Heating vs Cooling load(Energy observation points)') +
xlab('Heating load') +
ylab('Cooling load')
p
# create a dataframe from the new data
data.new <- data.frame(HeatingLoad =32)
# let us predict the Heating load result after we have trained our model with the subset data (energy.data)
cl.predict <- predict(regressorNew, newdata = data.new ) # Cooling load as cl
cl.predict
## 1
## 33.51846
Results:
Based on the regression model and the data, we expect or estimate to have a cooling load of 33.5 KBTU approximately for a heating load of 41 KBTU. The model is close to 1 and is an ideal model for prediction.
# calculate the mean for the cooling load, cl as cooling load
cl.mean <- mean(energy$CoolingLoad)
# approximate the value to whole number
cl.mean <- round(cl.mean)
cl.mean
## [1] 24
# check the datatype
class(cl.mean)
## [1] "numeric"
# create a vector of datasetNumber
datasetNumber <- c(1:18)
# create a vector of cooling Load Average
coolingLoadAvg <- c(23,24,23,25,24,23,26,24,23,25,24,23,26,22,25,25,22,cl.mean)
coolLoads <- data.frame(datasetNumber, coolingLoadAvg)
print(coolLoads)
## datasetNumber coolingLoadAvg
## 1 1 23
## 2 2 24
## 3 3 23
## 4 4 25
## 5 5 24
## 6 6 23
## 7 7 26
## 8 8 24
## 9 9 23
## 10 10 25
## 11 11 24
## 12 12 23
## 13 13 26
## 14 14 22
## 15 15 25
## 16 16 25
## 17 17 22
## 18 18 24
class(coolLoads)
## [1] "data.frame"
# Check for normality in distribution since the sample size( 18) is not greater than 30
shapiro.test(coolLoads$coolingLoadAvg)
##
## Shapiro-Wilk normality test
##
## data: coolLoads$coolingLoadAvg
## W = 0.92863, p-value = 0.1837
# Using : Q-Q plot
p <- ggplot(coolLoads, aes(sample = coolingLoadAvg))
p <- p + stat_qq()
p <- p + stat_qq_line( )
p
#Using Dot plot of distribution
p <- ggplot(coolLoads, aes(x = coolingLoadAvg ))
p <- p+ geom_dotplot( binwidth=0.4)
p <- p + labs( x="",y="")
p <- p + xlim(20,27)
p
# parametric t-test to check for alternative hypothesis
t.test(x=coolLoads$coolingLoadAvg, alternative="two.sided", paired=F, mu=23.5, conf.level = 0.99)
##
## One Sample t-test
##
## data: coolLoads$coolingLoadAvg
## t = 1.5567, df = 17, p-value = 0.138
## alternative hypothesis: true mean is not equal to 23.5
## 99 percent confidence interval:
## 23.11696 24.77193
## sample estimates:
## mean of x
## 23.94444
# parametric t-test to check for alternative hypothesis using confidence level of 90%
t.test(x=coolLoads$coolingLoadAvg, alternative="two.sided", paired=F, mu=23.5, conf.level = 0.80)
##
## One Sample t-test
##
## data: coolLoads$coolingLoadAvg
## t = 1.5567, df = 17, p-value = 0.138
## alternative hypothesis: true mean is not equal to 23.5
## 80 percent confidence interval:
## 23.56375 24.32514
## sample estimates:
## mean of x
## 23.94444
#let us explore the relationship using boxplot with ggplot2
# Boxplot showcasing the distribution of RoofArea by Height
colors <- c(rgb(0.1,0.1,0.7,0.5), rgb(0.8,0.1,0.3,0.6))
ggplot(energy, aes(Height, RoofArea, fill = Height)) + geom_boxplot()+
ggtitle('Roof area by Height') + xlab('') + ylab('Roof Area') + scale_fill_manual(values=colors) + theme_classic()
No outliers.
close to normal distributions.
It is assumed that the value of the height influences the Roof area. Let test this using T-test
Ho: No difference in variance and correlation Mean Roof Area = Mean of height
H1: two-sided test Assume variance and correlation.
Independent two sample parametric test.
# parametric t-test to check for alternative hypothesis using confidence level of 95%
t.test(energy$RoofArea~energy$Height, alternative="two.sided", paired=F, mu=0, var.equal=T, conf.level = 0.95)
##
## Two Sample t-test
##
## data: energy$RoofArea by energy$Height
## t = -95.38, df = 793, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -88.98369 -85.39492
## sample estimates:
## mean in group high mean in group low
## 133.9513 221.1406
#let us explore the relationship using boxplot with ggplot2
# Boxplot showcasing the distribution of GlassArea by Height
colors <- c(rgb(0.4,0.1,0.5,0.5), rgb(0.8,0.1,0.3,0.6))
ggplot(energy, aes(Height, GlassArea, fill = Height)) + geom_boxplot()+
ggtitle('Glass area by Height') + xlab('') + ylab('Glass Area') + scale_fill_manual(values=colors) + theme_classic()
It is assumed that the value of the height influences the Glass area. Let test this using T-test
Ho: No difference in variance and correlation Mean Glass Area = Mean of height
H1: two-sided test Assume variance and correlation.
Independent two sample parametric test. Since sample size (energy) > 30, Neglect check for normality in distribution of data
# parametric t-test to check for alternative hypothesis using confidence level of 95%
t.test(energy$GlassArea~energy$Height, alternative="two.sided", paired=F, mu=0, var.equal=T, conf.level = 0.95)
##
## Two Sample t-test
##
## data: energy$GlassArea by energy$Height
## t = 1.6481, df = 793, p-value = 0.09973
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.986432 11.312884
## sample estimates:
## mean in group high mean in group low
## 77.67133 72.50810