Introduction

The aim of this coursework is to perform univariate analysis with the dplyr package. Furthermore, the ggplot2 packages will be used for incremental plots and the other packages for extra support.The dataset contains ten attributes and two responses denoted by HeatingLoad and CoolingLoad). The purpose is to use the eight features to predict each of the two responses.

1. Univariate analysis will be done for both categorical and numerical variables.

Importing dataset and libraries for this analysis

library(here)

## here() starts at /Users/kelvinosuagwu/Desktop/parent/datavisualisation_&Analysis_CW

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
setwd(here::here())
energy <- read.csv("data/energy.csv", header = T, stringsAsFactors = T)

Data Exploration

#check the class, if it is a data.frame
class(energy)

## [1] "data.frame"

#print few rows of the dataset
head(energy)

# let us check the structure of the dataset to view the datatype
str(energy)

## 'data.frame':    795 obs. of  10 variables:
##  $ Instance   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ AproxArea  : num  556 463 463 540 514 ...
##  $ WallArea   : num  279 283 296 294 297 ...
##  $ RoofArea   : num  104 117 108 104 105 ...
##  $ GlassArea  : num  0 0 0 0 30 ...
##  $ Height     : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Condition  : Factor w/ 5 levels "A","B","C","D",..: 1 3 2 2 1 3 3 3 3 3 ...
##  $ Orientation: Factor w/ 4 levels "E","N","S","W": 1 2 3 4 1 1 1 1 1 2 ...
##  $ HeatingLoad: num  15.6 15.6 15.6 15.6 24.6 ...
##  $ CoolingLoad: num  21.3 21.3 21.3 21.3 26.3 ...

# view dataset if NA is present
summary(energy)    #heating load variable has 4NA'S

##     Instance       AproxArea        WallArea        RoofArea    
##  Min.   :  1.0   Min.   :463.1   Min.   :227.8   Min.   :103.6  
##  1st Qu.:199.5   1st Qu.:602.0   1st Qu.:285.2   1st Qu.:138.2  
##  Median :398.0   Median :673.8   Median :314.6   Median :207.3  
##  Mean   :398.0   Mean   :674.1   Mean   :319.1   Mean   :179.0  
##  3rd Qu.:596.5   3rd Qu.:746.7   3rd Qu.:344.2   3rd Qu.:220.5  
##  Max.   :795.0   Max.   :889.4   Max.   :550.8   Max.   :238.1  
##                                                                 
##    GlassArea       Height    Condition Orientation  HeatingLoad   
##  Min.   :  0.00   high:384   A:117     E:202       Min.   : 6.01  
##  1st Qu.: 32.49   low :411   B:135     N:197       1st Qu.:12.93  
##  Median : 75.71              C:517     S:195       Median :17.37  
##  Mean   : 75.00              D:  4     W:201       Mean   :21.98  
##  3rd Qu.:112.90              E: 22                 3rd Qu.:31.20  
##  Max.   :174.93                                    Max.   :43.10  
##                                                    NA's   :4      
##   CoolingLoad   
##  Min.   :10.90  
##  1st Qu.:15.49  
##  Median :21.33  
##  Mean   :24.28  
##  3rd Qu.:32.92  
##  Max.   :48.03  
##

# this is a dependent variable, check if the datatype is numeric
is.numeric(energy$HeatingLoad)   #TRUE

## [1] TRUE

Data Preparation

The instance variable has no meaning in the dataset.
4 NA’S are present in the Heating load variable.

# drop the instance variable using subset function
energy <- subset(energy, select = -c(Instance) )

# check for NA
energy.isNa <- is.na(energy) #check for missing values
sum(energy.isNa)   #4 missing values double checked

## [1] 4

# This function replaces missing(NA'S) Heating load data with Mean
#   returns a mean with decimals up to 5 places
energy$HeatingLoad <- ifelse(is.na(energy$HeatingLoad),
                    ave(energy$HeatingLoad,
                        FUN = function(x) mean(x, na.rm = TRUE) )
                         ,energy$HeatingLoad)


str(energy$HeatingLoad)

##  num [1:795] 15.6 15.6 15.6 15.6 24.6 ...

class(energy$HeatingLoad)

## [1] "numeric"

#format and round the variable to the nearest decimal number of 2 
# to be at par with the other features/variables
energy$HeatingLoad <- format(round(energy$HeatingLoad, digits = 2), nsmall = 2)
# convert to numeric datatype after formatting
energy$HeatingLoad  <- as.numeric(energy$HeatingLoad)

summary(energy)

##    AproxArea        WallArea        RoofArea       GlassArea       Height   
##  Min.   :463.1   Min.   :227.8   Min.   :103.6   Min.   :  0.00   high:384  
##  1st Qu.:602.0   1st Qu.:285.2   1st Qu.:138.2   1st Qu.: 32.49   low :411  
##  Median :673.8   Median :314.6   Median :207.3   Median : 75.71             
##  Mean   :674.1   Mean   :319.1   Mean   :179.0   Mean   : 75.00             
##  3rd Qu.:746.7   3rd Qu.:344.2   3rd Qu.:220.5   3rd Qu.:112.90             
##  Max.   :889.4   Max.   :550.8   Max.   :238.1   Max.   :174.93             
##  Condition Orientation  HeatingLoad     CoolingLoad   
##  A:117     E:202       Min.   : 6.01   Min.   :10.90  
##  B:135     N:197       1st Qu.:12.93   1st Qu.:15.49  
##  C:517     S:195       Median :17.50   Median :21.33  
##  D:  4     W:201       Mean   :21.98   Mean   :24.28  
##  E: 22                 3rd Qu.:30.59   3rd Qu.:32.92  
##                        Max.   :43.10   Max.   :48.03

# Check the summary again to see if the values where HeatingLoad values were rounded.
head(energy)

Data Exploration and Analysis

Univariate analysis of a continuous(numerical) variable vs categorical(discrete) variable

Central tendency and Spread for APROX AREA variable: This is going to be considered as a continuous scale

#check the summary statistics for this variable
summary(energy$AproxArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   463.1   602.0   673.8   674.1   746.7   889.4

sd(energy$AproxArea)

## [1] 96.66033

Visualization

Median value using a vertical red line

# let us use dot plot to describe the data 
#using line to check if the median is a typical value of this dataset
aproxArea.plot  <- ggplot(energy, aes(x= AproxArea)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 7) +  
       labs(x="Approximate area of home", y="") + 
       theme_classic() +
       geom_vline(xintercept = 673.8, color = "red", size=0.5) 

aproxArea.plot

aproxArea.plot<- ggplot(energy, aes(y= AproxArea)) + 
  geom_boxplot(col="blue", fill="lightblue") +   
  labs(title="The Approximate area of the Home per energy usage in 2016",                x="",y="AproxArea")+ 
  theme_classic() 
aproxArea.plot

Histograms

aproxArea.plot <- ggplot(energy, aes(x= AproxArea)) + 
  geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +  
   geom_vline(xintercept = median(energy$AproxArea), lwd = 2, size=0.01) +
  labs(x="Distribution of Approximate area of the Home per energy usage in 2016",         y="Density") + 
  geom_density(col="blue") + theme_classic()

## Warning: Duplicated aesthetics after name standardisation: size

aproxArea.plot

Description

NA: 4 missing values but replaced with mean
Outliers: None
Distribution: close to a Normal distribution, the mean value is close to the Median close to 68% of observations falls within 1 standard deviation of the mean
Typical Values: The values in the dataset are quite centered around the median and mean. So the mean should be representative of the dataset for a normal distribution.
Spread: There is quite a spread almost symmetrically both sides of the Median

Central tendency and Spread for WALL AREA variable: This is going to be considered as a continuous scale

#check the summary statistics for this variable
summary(energy$WallArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   227.8   285.2   314.6   319.1   344.2   550.8

Visualization

Histograms plus the spread(density)

# The line is the median
wallArea.plot <- ggplot(energy, aes(x= WallArea)) + 
  geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +  
   geom_vline(xintercept = median(energy$WallArea), lwd = 2, size=0.01) +
  labs(x="Distribution of Approximate area of the Home per energy usage in 2016", y="Density") + 
  geom_density(col="blue") + theme_classic()

## Warning: Duplicated aesthetics after name standardisation: size

wallArea.plot

# let us use dot plot to describe the data 
#using line to check if the median is a typical value of this dataset
library(cowplot)  #cowplot package for grid 
wallA.plotMedian <- ggplot(energy, aes(x= WallArea)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 7) +  
       labs(x="Distribution of wall Area", y="") + 
       theme_classic() +
       geom_vline(xintercept = 314.6, color = "red", size=0.7) 

#using line to check if the mean is a typical value of this dataset
wallA.plotMean <- ggplot(energy, aes(x= WallArea)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 7) +  
       labs(x="Distribution of wall Area", y="") + 
       theme_classic() +
       geom_vline(xintercept = 319.1, color = "red", size=0.7) 
plot_grid(wallA.plotMedian, wallA.plotMean, labels = "AUTO") #grid of two rows

#box plot
wallArea.plot<- ggplot(energy, aes(y= WallArea)) + 
  geom_boxplot(col="blue", fill="lightblue") +   
  labs(title="The Wall area in (sqft) for Home energy usage in 2016", x="",y="Average Wall Area in sqft") + 
  theme_classic() 
wallArea.plot

Description

NA: No missing values
Outliers: 5 Outliers above point 420 on the y-axis and are part of the top 25%.
Distribution: Skewed to the right(positive) distribution
Typical Values: The values in the dataset are quite centered around the Mean.So the mean is representative of the dataset.
SD: The data is not widely spread and it is close to the mean. It is compact except for the outliers

Central tendency and Spread for ROOF AREA variable: This is going to be considered as a continuous scale

#check the summary statistics for this variable
summary(energy$RoofArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   103.6   138.2   207.3   179.0   220.5   238.1

Visualization

roofArea.plot <- ggplot(energy, aes(y= RoofArea)) + 
  geom_boxplot(col="blue", fill="lightblue") +   
  labs(title="The Roof area in (sqft) for Home energy usage in 2016", x="",y="Average roofArea sqft") + 
  theme_classic() 
roofArea.plot

Histogram

# The line is the mean
roofArea.plotMean <- ggplot(energy, aes(x= RoofArea)) + 
  geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +  
   geom_vline(xintercept = mean(energy$RoofArea), lwd = 2, size=0.01) +
  labs(x="Distribution of Roof area ", y="Density") + 
  geom_density(col="blue") + theme_classic()

## Warning: Duplicated aesthetics after name standardisation: size

# The line is the median
roofArea.plotMedian <- ggplot(energy, aes(x= RoofArea)) + 
  geom_histogram(aes(y=..density..),col="red", fill="grey" , binwidth=20) +  
   geom_vline(xintercept = median(energy$RoofArea), lwd = 2, size=0.01) +
  labs(x="Distribution of Roof area ", y="Density") + 
  geom_density(col="blue") + theme_classic()

## Warning: Duplicated aesthetics after name standardisation: size

plot_grid(roofArea.plotMean , roofArea.plotMedian, labels = "AUTO") #grid of two rows

#using line to check if the median is representative of this dataset
roofA.plotMedian <- ggplot(energy, aes(x= RoofArea)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 4) +  
       labs(x="Distribution of roof Area", y="") + 
       theme_classic() +
       geom_vline(xintercept = 207.3 , color = "red", size=0.2) 

#using line to check if the mean is representative of this dataset
roofB.plotMean <- ggplot(energy, aes(x= RoofArea)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 4) +  
       labs(x="Distribution of roof Area", y="") + 
       theme_classic() +
       geom_vline(xintercept = 179.0, color = "red", size=0.2) 
plot_grid(roofA.plotMedian, roofB.plotMean, labels = "AUTO") #grid of two rows

Description

NA: No missing values
Outliers: No outliers
Distribution: quite cluttered in the distribution
Typical Values: The values in the dataset are quite centered around the Mean than the median. The mean is representative of the dataset.
Spread: The data is widely spread from 50% of the data in the first quartile(Q1) .The interquartile range shows more data below 50th percentile.

Central tendency and Spread for GLASS AREA variable: This is going to be considered as a continuous variable

#check the summary statistics for this variable
summary(energy$GlassArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   32.49   75.71   75.00  112.90  174.93

Visualization

#boxplot
glassArea.plot <- ggplot(energy, aes(y= GlassArea)) + 
  geom_boxplot(col="red", fill="lightblue") +   
  labs(title="The glass area in (sqft) for Home energy usage in 2016", x="glass area ",y="Average glassArea sqft") + 
  theme_classic() 
glassArea.plot

Description

NA: No missing values
Outliers: No outliers
Distribution: close to a normal the distribution
Typical Values: The values in the dataset are quite centered around the Median than The mean is representative of the dataset.
Spread: The data is compact

Central tendency and Spread for Height variable: This is going to be considered as a discrete /categorical variable

#check the summary statistics for this variable
summary(energy$Height)

## high  low 
##  384  411

Visualization

# plot the bars in ascending order
ggplot(energy, aes(x = Height)) + 
  geom_bar(fill = "light blue", 
           color="black") +
  labs(x = "Height of building", 
       y = "Frequency", 
       title = "The Height of building for Home energy usage in 2016")

Plot By Percentage

# plot the distribution as percentages
ggplot(energy, 
       aes(x = Height, 
           y = ..count.. / sum(..count..))) + 
  geom_bar() +
  labs(x = "Height of building", 
       y = "Percent", 
       title  = "The Height of building for Home energy usage in 2016") +
  scale_y_continuous(labels = scales::percent)

Description

NA: No missing values
Outliers: No outliers
Common Value: The low is most frequent value in the dataset. The low building category occupies more than 50% of the data. The data is quite true due to the fact heating will be more for a low building than a high building

Central tendency and Spread for Condition variable: This is going to be considered as a discrete /categorical variable

#check the summary statistics for this variable
summary(energy$Condition)

##   A   B   C   D   E 
## 117 135 517   4  22

Visualization

# plot the bars in ascending order
ggplot(energy, aes(x = Condition)) + 
  geom_bar(fill = "light blue", 
           color="black") +
  labs(x = "Condition of building", 
       y = "Frequency", 
       title = "The Condition of building that affects Home energy usage in 2016")

# plot the distribution as percentages
ggplot(energy, 
       aes(x = Condition, 
           y = ..count.. / sum(..count..))) + 
  geom_bar() +
  labs(x = "Condition of building", 
       y = "Percent", 
       title  = "The Condition of building for Home energy usage in 2016") +
  scale_y_continuous(labels = scales::percent)

# Basic piechart for the Condition dataset
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:here':
## 
##     here

# counting the number of staff per site.
pieData <- count(energy$Condition)
#rename the column
names(pieData) <- c("Condition", "frequency")
pieData

# order data according to the site (important for placing labels later on)
pieData <- arrange(pieData,desc(Condition))
# create new column with position for label
pieData <-  mutate(pieData, positionLabel = cumsum(pieData$frequency) -                          0.5*pieData$frequency)
# creating plot
pieData.plot <- ggplot(pieData,  aes(x="", y= frequency, fill = Condition)) +
      geom_bar(width = 1, stat="identity") +
      coord_polar("y", start=0) +
      geom_text(aes(y = positionLabel, label = frequency)) + 
     scale_fill_manual(values = c("red","blue", "lightblue", "purple", "grey")) 

pieData.plot

Description

NA: No missing values
Outliers: No outliers
Common Value using Mode: The good condition is most frequent value in the dataset. as it occupies more than 60% of the building category and the others occupy the remaining 40%. The data is shows Average condition( c) to have more heat generated as people or middle class can afford house type c

Central tendency and Spread for Orientation variable: This is going to be considered as a discrete /categorical variable

#check the summary statistics for this variable
summary(energy$Orientation)

##   E   N   S   W 
## 202 197 195 201

Visualization

# Basic piechart for the Condition dataset

# counting the number of staff per site.
pieData <- count(energy$Orientation)
#rename the column
names(pieData) <- c("Orientation", "frequency")
pieData

# order data according to the site (important for placing labels later on)
pieData <- arrange(pieData,desc(Orientation))
# create new column with position for label
pieData <-  mutate(pieData, positionLabel = cumsum(pieData$frequency) -                          0.5*pieData$frequency)
# creating plot
pieData.plot <- ggplot(pieData,  aes(x="", y= frequency, fill = Orientation)) +
      geom_bar(width = 1, stat="identity") +
      coord_polar("y", start=0) +
      geom_text(aes(y = positionLabel, label = frequency)) + 
     scale_fill_manual(values = c("red","blue", "lightblue", "purple", "grey")) +
    # removing outer "ring"
      theme_void()


pieData.plot

Description

NA: No missing values
Outliers: No outliers
Common Value using Mode: The North category has most frequent value than others. However others are quite frequent in value. This variable is not biased and imbalance. So from the dataset, Houses with orientation facing the North can affect the heating load or cooling load.

Central tendency and Spread for Heating Load dependent variable: This is going to be considered as a continuous variable

#check the summary statistics for this variable
summary(energy$HeatingLoad)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.01   12.93   17.50   21.98   30.59   43.10

Visualization

#using line to check if the median is representative of this dataset
heatLoadA.plotMedian <- ggplot(energy, aes(x= HeatingLoad)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 1) +  
       labs(x="Distribution of HeatingLoad data", y="") + 
       theme_classic() +
       geom_vline(xintercept = 17.50  , color = "red", size=0.7) 

#using line to check if the mean is representative of this dataset
heatLoadB.plotMean <- ggplot(energy, aes(x= HeatingLoad)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 1) +  
       labs(x="Distribution of HeatingLoad data", y="") + 
       theme_classic() +
       geom_vline(xintercept = 21.98, color = "red", size=0.7) 
plot_grid(heatLoadA.plotMedian, heatLoadB.plotMean, labels = "AUTO") #grid of two rows

#boxplot on heating load object
hl.plot <- ggplot(energy, aes(y= HeatingLoad)) + 
  geom_boxplot(col="black", fill="lightblue") +   
  labs(title="The heating Load for Homes in 2016", x="heating ",y="Average heating Load in British Thermal Unit") + 
  theme_classic() 
hl.plot

Description

NA: No missing values
Outliers: No outliers
Distribution: quite cluttered in the distribution.
Typical Values: The values in the dataset are quite centered around the Mean than the median. The mean is representative of the dataset.
Spread: The data is widely spread above 50% of the data after the middle .The interquartile range shows more spread of data below 75th percentile.

Central tendency and Spread for Cooling Load dependent variable: This is going to be considered as a continuous variable

#check the summary statistics for this variable
summary(energy$CoolingLoad)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.90   15.49   21.33   24.28   32.92   48.03

Visualization

#boxplot on cooling load object
cl.plot <- ggplot(energy, aes(y= CoolingLoad)) + 
  geom_boxplot(col="black", fill="lightblue") +   
  labs(title="The Cooling Load for Homes in 2016", x="cooling ",y="Average cooling Load in British Thermal Unit") + 
  theme_classic() 
cl.plot

#using line to check if the mean is representative of this dataset
cLoad.plotMean <- ggplot(energy, aes(x= CoolingLoad)) + 
       geom_dotplot(col="black", fill="gold" , binwidth= 0.5) +  
       labs(x="Distribution of Cooling Load data", y="") + 
       theme_classic() +
       geom_vline(xintercept = 24.28, color = "red", size=0.7) 

cLoad.plotMean

Description

NA: No missing values
Outliers: No outliers
Distribution: The data is skewed to the right(positive) in the distribution.
Typical Values: The values in the dataset are quite centered around the Mean than the median.
Spread: The data is widely spread above 75% of the data.The interquartile range shows more spread of data above 75th percentile.

noDCondition = droplevels(energy[!energy$Condition == 'D',])
table(noDCondition$Condition)

## 
##   A   B   C   E 
## 117 135 517  22

From the results the condition C appears to be where people consider more, the assumption is people who live in condition C can afford the heating load.You have few people in Condition A due the heating load will be excessive because the building is new and will have more and new heating technologies. So from the data, D is dropped because it is not consistent with the data that appears to be a

datasetNumber <- c(1:18)
coolingLoad <- c(23,24,23,25,24,23,26,24,23,25,24,23,26,22,25,25,22,24)
combine<- data.frame(datasetNumber, coolingLoad)
print(combine)

##    datasetNumber coolingLoad
## 1              1          23
## 2              2          24
## 3              3          23
## 4              4          25
## 5              5          24
## 6              6          23
## 7              7          26
## 8              8          24
## 9              9          23
## 10            10          25
## 11            11          24
## 12            12          23
## 13            13          26
## 14            14          22
## 15            15          25
## 16            16          25
## 17            17          22
## 18            18          24

t.test(x=combine$coolingLoad, alternative="two.sided", paired=F, mu=23.5, conf.level = 0.99)

## 
##  One Sample t-test
## 
## data:  combine$coolingLoad
## t = 1.5567, df = 17, p-value = 0.138
## alternative hypothesis: true mean is not equal to 23.5
## 99 percent confidence interval:
##  23.11696 24.77193
## sample estimates:
## mean of x 
##  23.94444

# shapiro.test(combine$coolingLoad)

DATA VISUALISATION AND ANALYSIS COURSEWORK

Nnamdi Osuagwu

3/29/2021

Introduction

1. Univariate analysis will be done for both categorical and numerical variables.

Importing dataset and libraries for this analysis

Data Exploration

Data Preparation

Data Exploration and Analysis

Univariate analysis of a continuous(numerical) variable vs categorical(discrete) variable

Visualization

Visualization

Visualization

Histogram

Visualization

Visualization

Visualization

Visualization

Visualization

Visualization