We are living in an unprecedented time with the ability to collaborate globally. While considering these opportunities, the ability exists to address common concerns that affect all of us. This publication seeks to explore these concerns using scientific evidence, exploratory data analysis, data science to recognize opportunities, along with incentives and policies to guide solutions.
This document was initially produced as a series of smaller segments of analysis, then compiled to create a holistic understanding and a solution for a much more complex concern. Some of these sections have been referenced from previously researched topics, and can now be connected using data science and software programming. The objective of this publication seeks to address:
This publication provides an overview of industry trends which are associated and should be considered for future energy initiatives, as well as historic trends associated with greenhouse gas emissions, the carbon budget and relationship to carbon dioxide emissions from fuel sources. There are opportunities to enhance our environmental conditions by increasing the efficiency and performance of existing energy systems, as well as introducing new solutions for our future. The environment and energy are complex systems, and this publication seeks to provide an overview of some of the important aspects, while recognizing that there are many more details that can be explored.
The ability to utilize data science, analytics, data engineering, machine learning, and big data with software applications, programming libraries and languages provide the ability to showcase the data, modify the data, and discover innovative opportunities. A variety of mathematical, statistical, and scientific methods have been applied to referenced data sources to connect information logically, considering the environment, existing energy use, opportunities for improvement, then implementing a proposed solution to determine the outcome. The result from this information seeks to recognize economic and environmental savings during the lifecycle of these investments.
Exploratory data analysis (EDA) has been performed on sections of the referenced data sources, and others include more in-depth within this document have been using data science methods known as supervised and unsupervised learning methods. Examples of these include time-series, correlation, clustering, and machine learning to discover insights.
The first step begins with downloading the data, appropriate data science, analysis and statistical packages, libraries, and configuration settings. (If your intention is to test and run the data with the code, you will need to install additional programs such as R-CRAN, as well as mapping download location will need to be configured.)
Install R data science and analysis packages as required
Load Packages
library("bigmemory")
library("car")
library("caret")
library("circlize")
library("cluster")
library("corrplot")
library("cowplot")
library("data.table")
library("dendextend")
library("dplyr")
library("dslabs")
library("dtwclust")
library("dygraphs")
library("e1071")
library("factoextra")
library("FactoMineR")
library("formatR")
library("GGally")
library("gganimate")
library("ggcorrplot")
library("ggeasy")
library("ggplot2")
library("ggraph")
library("ggrepel")
library("gplots")
library("grid")
library("gridExtra")
library("Hmisc")
library("hrbrthemes")
library("htmltools")
library("igraph")
library("kableExtra")
library("lubridate")
library("magrittr")
library("openair")
library("PerformanceAnalytics")
library("plotly")
library("png")
library("randomForest")
library("RColorBrewer")
library("reshape")
library("scales")
library("tidyr")
library("tidyverse")
library("TSclust")
library("xts")
library("mlbench")
library(dplyr, warn.conflicts = FALSE)
options(dplyr.summarise.inform = FALSE)
options(digits=3)
Access Data Files and Configure the Working Directory
We spend nearly 90% of our time indoors, which varies by geography and occupation.1 Although we enjoy the benefits of using the buildings in which we occupy, they have the ability to operate more efficiently and provide us with a higher level of comfort as well as provide a higher level of efficiency. According to the United States Department of Energy, 40% of all primary energy use and 76% of electricity use is consumed by the building sector.2 Although there are other demands for energy use, this introduction provides an overview of historical conditions that have influenced some of the infrastructure that exists, as well as recognizing future energy efficiencies.
The decisions we are making to improve our global living conditions are guided by past, present, and forecasted data. The trends within this data are important to understand so that we can be proactive with our decisions today and into the future. International scientists have recognized that the planet’s surface temperature is increasing as greenhouse gas (GHG) emissions increase, as well as the rise of carbon dioxide (CO2) emissions, which will be referenced in subsequent data and visualizations.3,4,5 The increase of CO2 emissions has an impact on environmental systems, human health, socioeconomic, political, and other factors.
Although this publication does not address specific human health conditions, health professionals, world organizations, and many others have recognized the human health impact from greenhouse gases and climate change.6,7,8
The World Health Organization has indicated that climate change is the single biggest health threat facing humanity, while addressing catastrophic health impacts such as clean air, safe drinking water, heat stress, sufficient food and others.6 Physicians have recognized concerns such as population displacement and increase of infectious disease.7 The National Institute of Environment Health Sciences (NIH) indicates that climate-related hazards include biological, chemical, physical stress that can occur in different locations, times, populations, and severity, which is also referred to as exposure pathways.8 These exposure pathways range from extreme heat, air quality, flooding, vector-borne infection, water-related infection, mental health and others.
Simplified diagram of the ecological effects caused by
nitrogen and sulfur air pollution9
These health repercussions are import to recognize and are the purpose of stimulating change within our environment to improve the human health conditions.
Although the intent of this report is focused on energy use and consumption, it is important to recognize that there are a number of risks that can cause damage to the reliability of energy sources from a variety of sources. These risks may include natural or human created threats, such as earthquakes and flooding, in which some of these concerns can be exasperated by climate change.
According to the U.S. Geological Survey, increased global surface temperatures have the ability to increase droughts and storm intensity, as well as cause the sea level to rise.10
The following datasets provide historic to current climate data to better understand the significance of global trends.
The National Oceanic and Atmospheric Administration (NOAA) and Carbon Dioxide Information Analysis Center (CDIAC) provide valuable data to help us better understand these global trends: 3,4,5
Import and Review NOAA Temperature and Emissions Data
# Import data
# Law Dome Ice Core 2000-Year CO2, CH4, and N2O Data
greenhouse_gases <- read.table("01Data/01-greenhouse_gases.txt",header=TRUE,sep=" ")
# Trends in Atmospheric Carbon Dioxide. Internal CRAN library DSLABS
historic_co2 <- read.table("01Data/02-historic_co2.txt",header=TRUE,sep=" ")
# Antarctic Ice Cores Revised 800KYr CO2 Data
temp_carbon <- read.table("01Data/03-temp_carbon.txt",header=TRUE,sep=" ")
Summary statistics showing the range of values for the data sets:
# Summary Statistics
summary(greenhouse_gases)
## year gas concentration
## Min. : 20 Length:300 Min. : 260
## 1st Qu.: 515 Class :character 1st Qu.: 270
## Median :1010 Mode :character Median : 280
## Mean :1010 Mean : 416
## 3rd Qu.:1505 3rd Qu.: 641
## Max. :2000 Max. :1703
summary(historic_co2)
## year co2 source
## Min. :-803182 Min. :178 Length:694
## 1st Qu.:-470498 1st Qu.:207 Class :character
## Median : -43278 Median :237 Mode :character
## Mean :-219753 Mean :246
## 3rd Qu.: -8924 3rd Qu.:272
## Max. : 2018 Max. :409
summary(temp_carbon)
## year temp_anomaly land_anomaly
## Min. :1751 Min. :-0.4 Min. :-0.7
## 1st Qu.:1818 1st Qu.:-0.2 1st Qu.:-0.3
## Median :1884 Median : 0.0 Median : 0.0
## Mean :1884 Mean : 0.1 Mean : 0.1
## 3rd Qu.:1951 3rd Qu.: 0.3 3rd Qu.: 0.3
## Max. :2018 Max. : 1.0 Max. : 1.5
## NA's :129 NA's :129
## ocean_anomaly carbon_emissions
## Min. :-0.5 Min. : 3
## 1st Qu.:-0.2 1st Qu.: 14
## Median : 0.0 Median : 264
## Mean : 0.1 Mean :1523
## 3rd Qu.: 0.3 3rd Qu.:1432
## Max. : 0.8 Max. :9855
## NA's :129 NA's :4
The data that includes annual mean global temperature anomalies since the year 1880, as well as annual global carbon emissions since 1751, both of which are ongoing measurements.5
# Data plots
temp <- temp_carbon %>%
filter(year > 1880) %>%
ggplot(aes(x=year, y=temp_anomaly, color=year)) +
geom_line() +
scale_color_gradient(low = "blue", high = "red") +
ggtitle("Temperature Anomaly") +
theme(plot.title = element_text(size=12)) +
ylab("Temperature Anomaly") +
theme_light()
# Land Plot
land <- temp_carbon %>%
filter(year > 1880) %>%
ggplot(aes(x=year, y=land_anomaly, color=year)) +
geom_line() +
scale_color_gradient(low = "blue", high = "red") +
ggtitle("Land Anomaly") +
theme(plot.title = element_text(size=12)) +
ylab("Land Anomaly") +
theme_light()
# Ocean Plot
ocean <- temp_carbon %>%
filter(year > 1880) %>%
ggplot(aes(x=year, y=ocean_anomaly, color=year)) +
geom_line() +
scale_color_gradient(low = "blue", high = "red") +
ggtitle("Ocean Anomaly") +
theme(plot.title = element_text(size=12)) +
ylab("Ocean Anomaly") +
theme_light()
# Carbon
carbon <- temp_carbon %>%
ggplot(aes(x=year, y=carbon_emissions, color=year)) +
geom_line() +
scale_color_gradient(low = "blue", high = "red") +
ggtitle("Carbon Emissions") +
theme(plot.title = element_text(size=12)) +
ylab("Carbon Emissions") +
theme_light()
grid.arrange(temp, land, ocean, carbon, ncol=2, nrow=2,
name = "Anomalies and Emissions")
These anomalies show seasonality trends with a positive linear
relationship, showing increases from 1960, while global carbon emissions
have continually increased with slight variations associated with
seasonality (trends over time).
This NOAA dataset indicates the concentrations of the three main greenhouse gases carbon dioxide (CO2), methane(CH4), and nitrous oxide(N20). Measurements are from the Law Dome Ice Core in Antarctica. Selected measurements are provided every 20 years from 1-2000 CE.3 This data visualization helps us understand how these rates have changed over time:
greenhouse_gases %>%
ggplot(aes(year, concentration)) +
geom_line(color = "red") +
facet_grid(gas ~ ., scales = "free") +
geom_vline(xintercept = 1850) +
ylab("Concentration (CH4, CO2, & N2O ppm)") +
ggtitle("Atmospheric Greenhouse Gas Concentration, 0-2000") +
theme_light()
Greenhouse gases have also increased over the duration of time, while
increasing exponentially since approximately 1880. As we continue to
investigate this data further, it’s important to recognize the
challenges that this information provides.
This dataset obtained from NOAA includes the Concentration of carbon dioxide in ppm by volume from direct measurements at Mauna Loa, Hawaii (1959 - 2021) and indirect measurements from a series of Antarctic ice cores (approx. -800,000 - 2001 CE).4 Global carbon dioxide trends are also available which reflect similar trends, however the dataset from Mauna Loa has been collected over a longer duration.
co2_time <- historic_co2 %>%
ggplot(aes(year, co2)) +
geom_line(color = "red") +
ggtitle("Atmospheric CO2 concentration, -800,000 BC to today") +
xlab("Year") +
ylab("CO2 (ppmv)") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme_light()
co2_time
This data set indicates long durations of seasonality cycles; however,
we have not experienced a continual downward trend since the year 272,
indicating that CO2 has a positive linear trend and has increased
exponentially.
Utilizing the Global Carbon Budget data set, we can review emissions generated by various sources, which are measured in million tons of carbon per year (MtC/yr).
The full global carbon budget is calculated by the following formula:11 \[E_{FOSSIL} + E_{LUC} = G_{ATM} + S_{OCEAN} + S_{LAND} + B_{IM}\] Where the variables that are reported annually include:
The growth rate for emissions is calculated by:11
\[E_{FOSSIL}(t_{0}+1) − E_{FOSSIL}(t_{0})) / E_{FOSSIL}(t_{0}) × 100%\] Global Carbon Budget Dataset12
# Import Data
global_carbon <-
read.table("01Data/04-Global_carbon_budget_2021.txt",header=TRUE,sep=",")
#Reviewing the data set
# str(global_carbon)
# Summary Statistics
prettyNum(summary(global_carbon))
## year fossil coal
## Min. :1959 Min. : 2417 Min. :1345
## 1st Qu.:1974 1st Qu.: 4655 1st Qu.:1611
## Median :1990 Median : 6137 Median :2348
## Mean :1990 Mean : 6197 Mean :2451
## 3rd Qu.:2005 3rd Qu.: 8012 3rd Qu.:3112
## Max. :2020 Max. :10016 Max. :4111
## oil gas cement
## Min. : 793 Min. : 207 Min. : 40
## 1st Qu.:2238 1st Qu.: 597 1st Qu.: 91
## Median :2508 Median :1013 Median :135
## Mean :2409 Mean :1039 Mean :182
## 3rd Qu.:3004 3rd Qu.:1464 3rd Qu.:258
## Max. :3337 Max. :2062 Max. :444
## flaring other per_capita
## Min. : 23.4 Min. : 2.3 Min. :0.81
## 1st Qu.: 56.2 1st Qu.:12.9 1st Qu.:1.11
## Median : 73.6 Median :41.1 Median :1.14
## Mean : 74.6 Mean :40.4 Mean :1.14
## 3rd Qu.: 99.0 3rd Qu.:65.1 3rd Qu.:1.22
## Max. :118.5 Max. :83.0 Max. :1.34
Carbon Dioxide Emissions by Fuel Type
gct <- ggplot(order = dplyr::desc(global_carbon)) + geom_area(aes(global_carbon$year,
global_carbon$fossil, fill = "Fossil"), colour = "black",
size = 0.7, alpha = 0.7) + geom_area(aes(global_carbon$year,
global_carbon$coal, fill = "Coal"), colour = "black", size = 0.7,
alpha = 0.7) + geom_area(aes(global_carbon$year, global_carbon$oil,
fill = "Oil"), colour = "black", size = 0.7, alpha = 0.7) +
geom_area(aes(global_carbon$year, global_carbon$gas, fill = "Gas"),
colour = "black", size = 0.7, alpha = 0.7) + geom_area(aes(global_carbon$year,
global_carbon$cement, fill = "Cement"), colour = "black",
size = 0.7, alpha = 0.7) + geom_area(aes(global_carbon$year,
global_carbon$flaring, fill = "Flaring"), colour = "black",
size = 0.7, alpha = 0.7) + scale_fill_brewer(palette = "Spectral",
name = "Fuel Type") + xlab("Year") + ylab("Megatonnes of Carbon") +
ggtitle("Carbon Dioxide Emissions by Fuel Type") + theme_light()
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
gct
This time-series visualization indicates that there has been an upward
trend for CO2 emissions from each type of fuel source since 1960, with
fossil emissions are producing the largest amount and flaring emissions
being the lowest.
Each country uniquely owns a different energy portfolio, consisting of coal, oil, gas, nuclear, hydropower, wind, solar, and other renewables.
There are many sources of energy that are used in the world, with approximately 84% of the global economy from fossil fuels in 2021.13 This referenced dataset will help us gain a better understanding of global energy use.
World Energy Distribution (1965 - 2021)14
# Import Data
world_energy <- read.table("01Data/world_energy.csv",header=TRUE,sep=",")
#Reviewing the data set
# str(world_energy)
# Summary Statistics
prettyNum(summary(world_energy))
## year coal oil
## Min. :1965 Min. :4367 Min. :5387
## 1st Qu.:1979 1st Qu.:4553 1st Qu.:6878
## Median :1993 Median :4738 Median :6978
## Mean :1993 Mean :5025 Mean :7068
## 3rd Qu.:2007 3rd Qu.:5647 3rd Qu.:7134
## Max. :2021 Max. :6251 Max. :8482
## gas nuclear hydropower wind
## Min. :1888 Min. : 22 Min. : 817 Min. : 0
## 1st Qu.:3129 1st Qu.: 422 1st Qu.:1144 1st Qu.: 0
## Median :3657 Median : 906 Median :1206 Median : 3
## Mean :3667 Mean : 774 Mean :1204 Mean : 83
## 3rd Qu.:4278 3rd Qu.:1111 3rd Qu.:1298 3rd Qu.: 72
## Max. :5127 Max. :1202 Max. :1464 Max. :619
## solar otherrenewables
## Min. : 0 Min. : 16.8
## 1st Qu.: 0 1st Qu.: 33.0
## Median : 0 Median : 75.0
## Mean : 29 Mean : 98.3
## 3rd Qu.: 3 3rd Qu.:136.0
## Max. :343 Max. :301.3
Global Energy Consumption by Source 202013,14
EnergyDist <- ggplot(order = dplyr::desc(world_energy)) + geom_area(aes(world_energy$year,
world_energy$coal, fill = "Coal"), colour = "black", size = 0.7,
alpha = 0.7) + geom_area(aes(world_energy$year, world_energy$oil,
fill = "Oil"), colour = "black", size = 0.7, alpha = 0.7) +
geom_area(aes(world_energy$year, world_energy$gas, fill = "Gas"),
colour = "black", size = 0.7, alpha = 0.7) + geom_area(aes(world_energy$year,
world_energy$nuclear, fill = "Nuclear"), colour = "black",
size = 0.7, alpha = 0.7) + geom_area(aes(world_energy$year,
world_energy$hydropower, fill = "Hydro"), colour = "black",
size = 0.7, alpha = 0.7) + geom_area(aes(world_energy$year,
world_energy$wind, fill = "Wind"), colour = "black", size = 0.7,
alpha = 0.7) + geom_area(aes(world_energy$year, world_energy$solar,
fill = "Solar"), colour = "black", size = 0.7, alpha = 0.7) +
geom_area(aes(world_energy$year, world_energy$otherrenewables,
fill = "Other renewables"), colour = "black", size = 0.7,
alpha = 0.7) + scale_fill_brewer(palette = "Spectral",
name = "Fuel Type") + xlab("Year") + ylab("Energy per capita (kWh - equivalent)") +
ggtitle("World Energy Consumption by Type") + theme_light()
EnergyDist
This same dataset is also shown in the visualization below, which
quantifies the total energy use while indicating the energy consumption
diversity by country in 2021.13 A complete list of countries
and their energy distribution are available for review in the cited data
source.
Within the United States, the following diagram indicates the pathways
for energy mix by source and the end-use sector used in
2021.11
Many initiatives have taken place across business lines and sectors to
create efficiency in systems. The industrial industry is one of the most
energy intensive from all sources, using nearly 35% of total energy
consumption.15 Buildings require materials, systems,
components and more to be constructed. Improvements have been made to
increase the efficiency for manufacturing industry physical plant, as
well as many mechanical systems, including engineered solutions for
machining, pumping systems, compressed air, motors, fuel and steam-based
process heating systems, waste recovery and other
innovations.16
There has been a tremendous amount of effort made by organizations, governments, and individuals that have influenced and developed policies, research and technology innovations. Collaboration within the industry has sought to improve energy resilience through the diversification of sources, creating efficiency, and environmental conditions, while improving infrastructure, reducing energy demands and greenhouse emissions.
Innovations within the renewable industry have supported energy targets to reduce emissions while pursuing this balanced approach. Organizations have been formed globally to locally to influence the effective development of energy infrastructure, facilities, and site development. Global standards have been established that prescribe minimal performance requirements, such as the building codes and criteria, as well as European Standards.17,18 Additional programs include the International Renewable Energy Agency (IRENA), sustainable development goals by the United Nations, as well as initiatives such as the 2030 Challenge, which are integrated efforts are being pursued between public and private entities.19,20,21
Within the United States, there are many federal, state, and
commercially operated programs which guide energy initiatives. Within
the building and community planning and construction industry, the
United States Green Building Council’s Leadership (USGBC) in Energy and
Environmental Design (LEED) and additional programs seek to guide an
integrative process to reduce energy consumption and environmental
impacts.22 These programs provide guidelines which prescribe
and evaluate performance metrics, as well as provide financial
incentives and certification for building efficiency characteristics,
operations and other characteristics. This program works in alignment
with other industry standards, such as the ANSI/ASHRAE/IES Standard
90.1-2019 - Energy Standard for Buildings.23
Another initiative is the include Energy Star® program which has
improved the efficiency of technologies across a broad range of
industries, ranging from electronics, building products and many other
innovative solutions.24
Organizations approach efficiency opportunities from many different perspectives. Companies such as the vehicle manufacturer Toyota introduced the initial “Toyota Production System”, followed by the “The Toyota Way”, which has streamlined efficiency improvements with a focus on continual improvement.25 This model is also understood as lean six sigma management, lean manufacturing, and just-in-time (JIT) production or JIT manufacturing. These systems seek to address wastes originating from overproduction, waiting time, transportation, processing, excess inventory, movement, product defects, and underutilization.
The biopharmaceutical company Pfizer has made significant organizational impacts on GHG reduction goals since 2000, implementing more than 4,000 GHG reduction projects. The company is reduced GHG emissions by 16% between 2000 and 2007, while developing target metrics for a additional reductions. Their trajectory is to reduce emissions from 60% to 80% by 2050 (stated in 2000).26 This is a significant commitment to an organization that understands the science and is applying it, while setting standards for organizations around the world.
Another example is from the company Caterpillar, who designs, develops, engineers, manufactures, markets and sells machinery (amongst other products). Along with making commitments to reduce GHG emissions for facilities, they have focused on reengineering some of their product lines to reduce emissions.27
The energy consumed by the supply chain is the energy input from all suppliers to produce a product. Industrial management decisions are considered where it is more affordable to locally produce or outsource products. There are several general strategies and models for supply chains, while promoting various factors such as efficiency, speed, continuous-flow, agile, customer-configurations, and/or flexibility.28 In general, the objective is to deliver products from facilities, using various types of transportation along routes to their destination.
The approach for a more sustainable supply chain has been influenced
by sourcing suppliers that adhere to social, ethical and environmental
standards, which request the same from their suppliers.29
This in turn creates a cascading effect that will promote these
practices. The Responsible Business Alliance (RBA), was established to
promote this sphere of influence for continual improvement along supply
chains.30
There are a multitude of success stories and organizations that are
focused on improving quality, creating efficiency, conserving resources,
while decreasing energy consumption to produce value.
The American Institute of Architects performed a study in 2013 for over 1,100 projects which reflected that the use of energy modeling has the ability to reduce energy consumption by 44% in comparison to the 2003 building stock.31 Collaboration between government entities and industries have made investments for various initiatives for both open-source and commercial modeling programs such as EnergyPro, EnergyPlus, IES, and others to simulate energy use for short and long-term energy savings.32,33,34
Supplementing these collaborative efforts to analyze energy,
additional open-source and commercial software such as R, Python, and
others can strengthen research and solutions.35,36
Image: Building Simulation Modeling, Reverse Solar Envelope
Method37
Energy consumption is associated with a variety of sources, which provides the ability to explore efficiency opportunities. The Energy Information Administration (EIA) has provided the following information associated with projected trends related to energy use for a variety of commercial products:38
# Import Data
electric <-
read.table("01Data/05-Comm_purchased_elec_intensity.txt",
header=TRUE,sep=",")
Anticipated Changes in Energy Consumption by Device
ec <- ggparcoord(electric, columns = c(3,8,13,18,23,28,33), groupColumn = 1,
order = "allClass", showPoints = TRUE,
title = "Commercial Purchased Electricity Intensity",
alphaLines = 0.3) +
scale_color_viridis(discrete=TRUE) +
theme(plot.title = element_text(size=12)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Year") +
ylab("Change in kw p/sf")
ec
Technology innovations are continually introducing efficiencies to systems, while this energy intensity visualization suggests that certain types of building systems are expected to have greater opportunities in the future than others.
Industries seek to implement innovative energy and technology solutions into projects, research and development within various sectors continue to push the boundaries with a variety of technological advancements. Significant improvements have been made with energy technologies, material sciences, natural sciences, construction methods, environmental innovations and others.
Material sciences maturation has increased for technologies such as photovoltaics, energy and hydrogen storage. Approaching solutions with material costs, innovations in nanotechnology, as well as a better understanding of the practical implications for the use of these materials. 39
The International Renewable Energy Agency (IRENA) has focused on a multitude of solutions including matching renewable energy generation and demand over large distances using supergrids, optimizing distribution systems, utility-scale battery solutions and more.19
Researchers at the Massachusetts Institute of Technology (MIT) have also been focusing on innovations such as harnessing energy from waves, new solar cell materials, battery storage and others.40 In addition, Georgia Tech researchers have focused on using electricity, using nuclear waste to produce electricity, generators, radio wave recycling (collecting kinetic, solar, electromagnetic, and vibration energy from sources).41
Governments, legislators, private industry, energy companies and many of the cooperatives that own power grids have historically and continue to work collaboratively to provide a resilient and reliable power grid. Organizations such as the U.S. Government Accountability Office have identified opportunities to deploy energy storage, solar technologies, cybersecurity standards implementation, as well as support to local utilities following a disaster.42 Additional opportunities have been identified to develop into the future across a spectrum of energy storage, smart grids, electricity generation.43
Although some of these may not become mainstream for decades or more, there are many existing technologies that have been integrated into energy system solutions while having the ability to replace more efficient system components when they are developed. There is an abundant amount of opportunity for increasing efficiency within energy systems and infrastructure.
This section includes a technical overview of systems associated with energy infrastructure, which is necessary to understand for the concluding data analysis sections.
Buildings serve a variety of purposes to facilitate business functions and operations. The development of building solutions requires management, engineering, scientific, architectural design, legal processes, standards, policies, best practices, and many others to bring solutions into the built environment using an integrative framework.
This section will briefly introduce the concepts of information systems, which include industrial control systems (ICS), and intelligent building systems (IBS), which are most commonly used within energy infrastructure. Both collectively and independently, these systems are increasingly interoperating, while creating both solutions and challenges within the industry. Each unique category of buildings serves a one or more business functions which require information systems, including ICS, IBS, and others. Consider the healthcare business function which may have a large metropolitan hospital requiring both an ICS and IBS, whereas a smaller clinic may only require an IBS for more simpler functions to serve their requirements.
The smart grid is modernizing the 20th century electrical grid to the 21st century, while iteratively making progress through cycles of innovation. The National Institute of Standards and Technology (NIST) is a primary leader for bringing manufacturers, consumers, energy providers, and regulators to accelerate the development of secure interoperability standards. Building automation and management systems are connected to smart grids, which can influence the way in which they operate.
This diagram shows a high-level overview of the Smart Grid
Architecture Model (SGAM), which diagrams the relationships between
conceptual, logical, and physical architecture.44
The function of the energy source (category, sub-category) will guide
the type of industrial control system requirements to operate the
facility, which are often guided by stakeholders or owner(s),
regulatory, policies, and other factors. The process of integrating ICS
results in many different types that uniquely serve the needs of their
business function(s).
Information systems are integrated into our infrastructure, including buildings, sites, transportation, and many others. These systems can be used independently or used in combination with one another, and are increasingly reliant on centralized monitoring and control.
Organizations have a variety of business lines, functions, and requirements that support their operational objectives. These objectives guide the types of information systems that are integrated into their sites, facilities, and may be locally or remotely managed. These systems, or systems of systems, may include one or more development lifecycles that generally occur as a combination of internally performed work, through technology vendors, or outsourced.
The development lifecycle may include a systems or software development lifecycle (SDLC) and product lifecycles that allow organizations to manage their technology and infrastructure. Some organizations have developed their own system that may be more or less complex to suit the needs of their organization. The following diagram shows phases within these respective lifecycles:
It is important to understand that not every building has the same operational requirements and although these technologies and systems exist, as they may not have a practical use for the owner; aligning the requirements for the facility or site with the minimum required systems provides the most efficient and cost-effective approach, as well as reduced costs associated with long-term operations and maintenance.
These systems are applied within the building sector with a variety of both independent and networked sensors, which are metered and operate on computer networks, servers, databases, and computing infrastructure. There are many types of sensors that are used to detect temperature, light, occupancy, energy use, liquid flow and leaks, air quality, gas concentration levels of variables (such as humidity, carbon monoxide and others), security and access control, and others that may be more specialized.
Information systems examples:45
The attributes associated with these systems include autonomy, controllability for complicated dynamics, human-machine interaction, and bio-inspired behavior. A systems diagram of a mesh network illustrates how some of these types of systems can be both local as well as geographically dispersed and interconnected using a variety of technologies.
Mesh Network Architecture46
Industrial control systems (ICS) are used to control industrial processes, which may include manufacturing, product handling, production, distribution, and others. These can also be referred to as Operational Technology (OT) systems, are broadly used for a variety of industries, including healthcare, manufacturing, automotive, defense and others. ICS may be categorized differently, depending upon a unique organization’s use of the ICS, or from a general systems approach.
The level of the existing infrastructure to the long-term objective can be considered as the level of the intelligence of the ICS or specific objective. Technology programs and initiatives can work towards those goals, while integrating new systems and retiring/decommissioning old systems. Within the Department of Defense alone, there are over thirty unique types of ICS that are used in over 300,000 buildings.47
ICS categories include:47
The customer domain from this architecture is where the building
automation and management systems exist, which is illustrated in the
following diagram:44
There are a variety of building systems that are used to serve the organizational functions. They may include one or more of the following systems, which are often guided by regulatory, policies, standards, organizational requirements, and others.
IBS categories include:48
This following image is an applied application of an information
systems used to control a system used by NASA to communicate with the
International Space Station.
NASA, Flight Control Room49
Infrastructure utilizes a variety of communications mediums, which may include physical wires (communication and/or electrical), and wirelessly (transmitted over a variety of frequency bands) to meet the function of their objective. Some of these characteristics include:
Wired systems:
Wireless systems:
Network communications have been established using the traditional seven-layer Open Systems Interconnection (OSI) model, in which five of these are used for building systems networked devices (i.e. controllers, sensors, and others previously mentioned). The OSI model has seven layers which include application, presentation, session, transport, network, data link, and physical layers.50 The wireless sensor networks (WSN) has five layers which include application, transport, network, data link, and physical.
There are a multitude of additional protocols that are used for process automation, industrial control systems, building automation, power system automation, automatic meter reading, and automobile/vehicles. Each of these protocols provide various levels of systems interoperability to communicate with each other and use networks with different topologies (types of network configurations).
The intelligence that manages building systems has been introduced through the use of IoT and Big Data technologies that use analytics and automated learning processes. Three levels include:51
At the service level, the building and systems owner has the ability to control a variety of factors through an application. The formula for this “ecosystem” can be demonstrated with a general equation such as, Sensors + Networks + Big Data + Analytics = User Application.51
IoT may also provide insight for these systems, while the author identifies three primary approaches to include object-oriented, internet, and semantic visions.51 In brief, this includes addressing an object-oriented vision for the identification, detection, networking and processing capabilities to exchange and share information with each other, while developing advanced services on the internet.
Interconnectivity increases the complexity of the system-of-systems, however when operating effectively provides the ability to make decisions quickly. For example, if a healthcare provider owned two hundred buildings across the nation, and had a system in place to measure the performance of their buildings and/or inventoried systems from both a broad and more granular level, then they could quickly extract unified insights to make more accurate and timely decisions. When system or building product is released, they could more easily identify, evaluate, and make more effective decisions for the short- and long-term impacts associated with both energy and financial costs.
It is important to understand that these systems are vulnerable to a multitude of risks that have historic references to a multitude of damages that have impacted billions of dollars of damages. Many of these systems often have conflicting requirements for operations, performance, security, reliability, and safety, which can unintentionally impose risks.
A few examples of these include malware such as Stuxnet, DuQu, Flame,
and Shamoon.48 Various organizations have worked diligently
to mitigate threats, however this will be an ongoing process at various
levels with collaboration between policy makers, guidelines, standards,
engineering and information technology solutions, operators and many
others to maintain systems integrity.
NIST CPS Reference Architecture44
The concept of smart cities includes many components, some of which include public safety and emergency response, traffic, environmental and energy management that can be integrated and combined to existing capabilities. Global efforts have been made to increase interoperability, which refers to making systems work together, as well as composability, which focuses on the ability to add functions and maintain continuous improvement and integration, and harmonization, which refers to achieving compatibility between technologies and systems.
The IoT-Enabled Smart City Framework (IES-City Framework), which is an international public working group to reduce the cost of systems integrations while seeking to overcome barriers while promoting modern communities and infrastructure.
While engineers, scientists, technologists, business leaders and many others continue to develop technologies and integrations, the legal systems and security associated with providing interoperability requires an extensive amount of oversight.
Legal professionals attempt to stay up to date with technology policies, regulations, and implementing these as they are required to sustain the lifecycle of systems. Internal staff within organizations seek to leverage information technology governance as they relate to their environments while maintaining industry best practices. Unfortunately, they are commonly finding themselves reacting to issues after they have occurred, rather than implementing legal controls and countermeasures. Fortunately, these incidents can stimulate new laws and policies that are intended to improve interoperability of systems and the people that use them.
Much of the responsibility of securing these systems is the responsibility of those internal to an organization. Cybersecurity specialists, analysts, network engineers, telecommunications specialists, electrical system operators, and many others seek to implement a variety of controls to manage business objectives on protected networks. Unfortunately, organizations with great intentions are often sought after for intellectual property, monetary, or other types of advantages by both private entities and governments alike. It is often difficult for organizations to determine the source of these threats while they depend on their own governments to support mutual agendas for systems governance and security.
Fortunately, enterprise governance standards have been created to guide an environment that can achieve organizational objectives, legal and technical standards. Organizations such as Information Systems Audit and Control Association (ISACA) have integrated frameworks such as Control Objectives for Information and Related Technologies(COBIT) that can guide organizations to achieve both business and technical objectives.52 Healthcare environments are enforced with the Health Insurance Portability and Accountability Act (HIPAA) and others to protect systems, health records, patient and provider information.53 There are a number of other organizations that provide resources for organizations to control their enterprises, which require life cycle management until they are decommissioned. Although these standards are in place, it requires a collaborative team and concerted effort to maintain information security using the confidentiality, integrity, and availability triad model.
Following the energy infrastructure overview section, we now have a better understanding of how this information has been collected, as well as how to develop useful meaning from it.
The American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) collected meter data for over 20 million points of training data, sourced from 2,380 energy meters, collected for 1,448 buildings, measuring 16 sources for a duration of a three year period in different parts of the United States.54 This data was collected and provided to the non-profit organization Kaggle to promote data science a competition and to better understand energy usage. This data included a multitude of variables, ranging from electricity use and meter readings, to weather conditions, building types and more.
The types of analysis that can be performed on this dataset are substantial. Examples may include forecasting energy trends while considering how outdoor air temperature can require a mechanical system to work more effectively to control a comfortable indoor environment. Another may include calculating the return on investment (ROI) for an architectural, mechanical, or electrical improvements.
The objectives of this analysis include performing exploratory data analysis while observing energy consumption throughout the day and year utilizing data science. These insights have the ability to guide organizations to more effectively conserve energy, as well as determining the economic and health value of implementing incentives to reduce emissions which are detrimental to human health. Understanding these principles have the ability to provide economic and environmental across a spectrum of opportunities ranging from finance, resources, healthcare costs, and a multitude of environmental conditions.
Load Datasets and Explore the Variables The ASHRAE data sets available for analysis include building energy consumption, weather conditions, and building metadata. Each of the datasets are independently shown, followed by joining the variables and analyzing the information.
There were originally five files within the original sourced dataset, however only three have been used for the purposes of this publication while isolating the variables used for analysis.
Energy Consumption:
bldg_data_train <- read.csv("01Data/06-ASHRAE_Energy_training_data.csv")
prettyNum(summary(bldg_data_train$meter_reading))
## Min. 1st Qu. Median Mean 3rd Qu.
## "0" "18.3" "78.8" "2117" "268"
## Max.
## "21904700"
Building Metadata:
# Building Information Metadata
bldg_meta <- read.csv("01Data/07-ASHRAE_building_metadata.csv")
str(bldg_meta)
## 'data.frame': 1449 obs. of 6 variables:
## $ site_id : int 0 0 0 0 0 0 0 0 0 0 ...
## $ building_id: int 0 1 2 3 4 5 6 7 8 9 ...
## $ primary_use: chr "Education" "Education" "Education" "Education" ...
## $ square_feet: int 7432 2720 5376 23685 116607 8000 27926 121074 60809 27000 ...
## $ year_built : int 2008 2004 1991 2002 1975 2000 1981 1989 2003 2010 ...
## $ floor_count: int NA NA NA NA NA NA NA NA NA NA ...
The final ASHRAE dataset that will be explored is for the weather data, indicating a variety of weather conditions associated with each site at hourly intervals, followed by a data visualization of the mean hour temperature throughout the year:
# Weather Data
weather_data <- read.csv("01Data/08-ASHRAE_weather_train.csv")
weather_temp <- weather_data$air_temperature
weather_temp <- weather_data %>%
mutate(
date = as.POSIXct(strptime(timestamp, "%Y-%m-%d %H:%M:%S")),
site_id = as.factor(site_id),
year = year(date),
wday = wday(date),
hour = hour(date)) %>%
select(-c(timestamp)) %>%
as_tibble()
options(repr.plot.width=50, repr.plot.height=50)
calendarPlot(weather_temp, pollutant = "air_temperature",
par.settings=list(fontsize=list(text=11)),
main = "Weather Data Air Temperature Mean",
statistic = 'mean')
The next step includes joining data for analysis:
# Load data and combine files
bldg_train_data <- data.table::fread("01Data/06-ASHRAE_Energy_training_data.csv")
bldg_meta <- read.csv("01Data/07-ASHRAE_building_metadata.csv")
weather_data <- read.csv("01Data/08-ASHRAE_weather_train.csv")
# Covert variable types for join
bldg_meta$building_id <- as.integer(bldg_meta$building_id)
weather_data$timestamp <- as.Date(weather_data$timestamp, format = "%y %m %d %H:%M:%S")
weather_data$site_id <- as.integer(weather_data$site_id)
# Join variables
building_data <- bldg_train_data %>%
left_join(bldg_meta, by = "building_id") %>%
left_join(weather_data, by = c("site_id", "timestamp"))
# Assign ISO 8601 format YYYY-MM-DD HH:MM:SS
building_data <- building_data %>%
mutate(timestamp_date = ymd(gsub(" .*$", "", timestamp)),
timestamp_month = month(timestamp_date), timestamp_day = wday(timestamp_date,
label = T, abbr = T), timestamp_day_number = day(timestamp_date),
time_ymd_hms = ymd_hms(timestamp), time_hour = hour(time_ymd_hms))
The following data visualization indicates the energy consumption throughout a 24-hour period that is associated with the classification of the building or site. This data reflects that certain types of buildings have a higher demand for electricity during times of the day, whereas others show consistent usage, which is directly related to organizational functions and operations.
energyuse <- building_data %>%
group_by(time_hour, primary_use) %>%
summarise(median_reading = median(meter_reading, na.rm = T)) %>%
ggplot(aes(x= time_hour, y= median_reading)) +
geom_area(fill = "yellow", color = "black") +
theme(text = element_text(size=8),
axis.text.x = element_text(angle=45, hjust=1, size=6),
axis.text.y = element_text(angle=0, hjust=1, size=6)) +
ggtitle("Daily Energy Consumption by Building Type") +
xlab("Hourly Meter Reading") +
ylab("Electricity Use") +
facet_wrap(~ primary_use, scales = "free")
energyuse
### Annual Energy Consumption Trends This data visualization is similar
to the previous, however occurring through an annual cycle.
energyuse2 <- building_data %>%
group_by(timestamp_date, primary_use) %>%
summarise(median_reading = median(meter_reading, na.rm = T)) %>%
ggplot(aes(x= timestamp_date, y= median_reading)) +
geom_line(color = "blue") +
geom_smooth(se = F, color = "black") +
ggtitle("Annual Consumption by Building Type") +
theme(text = element_text(size=8),
axis.text.x = element_text(angle=45, hjust=1, size=6),
axis.text.y = element_text(angle=0, hjust=1, size=6)) +
ggtitle("Daily Energy Consumption by Building Type") +
xlab("Annual Meter Reading") +
ylab("Electricity Use") +
facet_wrap(~ primary_use, scales = "free") +
theme(axis.text.x = element_text(angle = 30))
energyuse2
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This data shows annual trends and some seasonality through the duration of year. Certain types of organizations have increased operations, such as healthcare energy use increasing during the spring and summer, whereas this is lower during the fall and winter. Other categories, such as services tend to have less seasonality and consistently have a similar energy demand and usage throughout the year.
This next dataset will provide us with the ability to assess indoor air quality values to determine opportunities.
The data set is sourced from the University of California Irvine (UCI) Machine Learning Repository, which contains over 20,000 instances of multivariate, time-series data that can be used for classification tasks.55 The dataset is associated with a research and development project within a controlled laboratory environment and collected to provide insights that can be projected over time (i.e. monthly, annual, etc.) or scale, such as large scale implementations (such as educational, healthcare, to manufacturing facilities and others). The attributes associated with these sensor variables include:
The National Renewable Energy Laboratory (NREL) has identified significant energy savings potential using occupant counting/presence inputs amounting to 10-40% energy savings for HVAC and lighting, as well as occupant comfort/preference inputs of 10-40% energy savings for HVAC, and 10-60% for lighting.56,57,58,59 Let’s see what the data reveals.
The initial portion of this analysis includes data exploration over a five-day period, in which the occupied spaces are used most frequently. The analysis of this data seeks to determine relationships within the indoor spaces that will provide insights for energy saving opportunities.
Various statistical and data science methodologies will be implemented using time-series, correlation, and various forms of clustering analysis. The results from this information can inform efficiency opportunities.
Load Data, Summary Statistics and Variable Conversions
# Pre-processing data included converting dates to ISO
# format Loading data
BldgSensorTest <- read.csv("01Data/09-BldgSensorTest.csv", header = TRUE,
sep = ",")
BldgSensorTest2 <- read.table("01Data/10-BldgSensorTest2.csv",
header = TRUE, sep = ",")
BldgSensorTraining <- read.table("01Data/11-BldgSensorTraining.csv",
header = TRUE, sep = ",")
# Summary Statistics
summary(BldgSensorTest)
## date Temperature Humidity
## Length:2665 Min. :20.2 Min. :22.1
## Class :character 1st Qu.:20.6 1st Qu.:23.3
## Mode :character Median :20.9 Median :25.0
## Mean :21.4 Mean :25.4
## 3rd Qu.:22.4 3rd Qu.:26.9
## Max. :24.4 Max. :31.5
## Light CO2 HumidityRatio
## Min. : 0 Min. : 428 Min. :0.00330
## 1st Qu.: 0 1st Qu.: 466 1st Qu.:0.00353
## Median : 0 Median : 580 Median :0.00382
## Mean : 193 Mean : 718 Mean :0.00403
## 3rd Qu.: 442 3rd Qu.: 956 3rd Qu.:0.00453
## Max. :1697 Max. :1402 Max. :0.00538
## Occupancy
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.365
## 3rd Qu.:1.000
## Max. :1.000
summary(BldgSensorTest2)
## date Temperature Humidity
## Length:9752 Min. :19.5 Min. :21.9
## Class :character 1st Qu.:20.3 1st Qu.:26.6
## Mode :character Median :20.8 Median :30.2
## Mean :21.0 Mean :29.9
## 3rd Qu.:21.5 3rd Qu.:32.7
## Max. :24.4 Max. :39.5
## Light CO2 HumidityRatio
## Min. : 0 Min. : 485 Min. :0.00327
## 1st Qu.: 0 1st Qu.: 542 1st Qu.:0.00420
## Median : 0 Median : 639 Median :0.00459
## Mean : 123 Mean : 753 Mean :0.00459
## 3rd Qu.: 208 3rd Qu.: 831 3rd Qu.:0.00500
## Max. :1581 Max. :2076 Max. :0.00577
## Occupancy
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.21
## 3rd Qu.:0.00
## Max. :1.00
summary(BldgSensorTraining)
## date Temperature Humidity
## Length:8143 Min. :19.0 Min. :16.7
## Class :character 1st Qu.:19.7 1st Qu.:20.2
## Mode :character Median :20.4 Median :26.2
## Mean :20.6 Mean :25.7
## 3rd Qu.:21.4 3rd Qu.:30.5
## Max. :23.2 Max. :39.1
## Light CO2 HumidityRatio
## Min. : 0 Min. : 413 Min. :0.00267
## 1st Qu.: 0 1st Qu.: 439 1st Qu.:0.00308
## Median : 0 Median : 454 Median :0.00380
## Mean : 120 Mean : 607 Mean :0.00386
## 3rd Qu.: 256 3rd Qu.: 639 3rd Qu.:0.00435
## Max. :1546 Max. :2028 Max. :0.00648
## Occupancy
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.212
## 3rd Qu.:0.000
## Max. :1.000
Variable Conversions
# Change occupancy integers to factors
BldgSensorTest$Occupancy <- as.factor(BldgSensorTest$Occupancy)
BldgSensorTest2$Occupancy <- as.factor(BldgSensorTest2$Occupancy)
BldgSensorTraining$Occupancy <- as.factor(BldgSensorTraining$Occupancy)
BldgSensorTest$date <- as.POSIXct(BldgSensorTest$date, tz="UTC")
BldgSensorTest2$date <- as.POSIXct(BldgSensorTest2$date, tz="UTC")
BldgSensorTraining$date <- as.POSIXct(BldgSensorTraining$date, tz="UTC")
# Create the xts (extensible time-series object) constructor function for dygraph (all variables)
xts_1 <- xts(x = BldgSensorTraining$Temperature, order.by = BldgSensorTraining$date)
xts_2 <- xts(x = BldgSensorTraining$Humidity,
order.by = BldgSensorTraining$date)
xts_3 <- xts(x = BldgSensorTraining$Light,
order.by = BldgSensorTraining$date)
xts_4 <- xts(x = BldgSensorTraining$CO2,
order.by = BldgSensorTraining$date)
xts_5 <- xts(x = BldgSensorTraining$HumidityRatio,
order.by = BldgSensorTraining$date)
xts_6 <- xts(x = BldgSensorTraining$Occupancy,
order.by = BldgSensorTraining$date)
# Assign variable names if they do not populate correctly
date <- BldgSensorTraining$date
Temperature <- BldgSensorTraining$Temperature
Humidity <- BldgSensorTraining$Humidity
Light <- BldgSensorTraining$Light
CO2 <- BldgSensorTraining$CO2
HumidityRatio <- BldgSensorTraining$HumidityRatio
Occupancy <- BldgSensorTraining$Occupancy
# p1=Temperature, p2=humidity, p3=light, p4=CO2,
# p5=humidity ratio, p6=occupancy
p1 <- ggplot(xts_1,aes(date)) +
geom_line(color="Black", aes(y=Temperature)) +
geom_area( fill="Red", aes(y=Temperature), alpha=0.4) +
ylab("Temp (°C)") +
xlab("Time") +
coord_cartesian(ylim = c(18, 23)) +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT")) +
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
p2 <- ggplot(xts_2,aes(date)) +
geom_line(color="Black", aes(y=Humidity)) +
geom_area( fill="Blue", aes(y=Humidity), alpha=0.4) +
ylab("Humidity") + # % Water Vapor to Air
xlab("Time") +
coord_cartesian(ylim = c(18, 40)) +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT"))+
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
p3 <- ggplot(xts_3,aes(date)) +
geom_line(color="Black", aes(y=Light)) +
geom_area( fill="#F0E442", aes(y=Light), alpha=0.4) +
ylab("Light-Lux") +
xlab("Time") +
coord_cartesian(ylim = c(0, 1600)) +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT"))+
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
p4 <- ggplot(xts_4,aes(date)) +
geom_line(color="Black", aes(y=CO2)) +
geom_area( fill="#009E73", aes(y=CO2), alpha=0.4) +
ylab("CO2 ppm") +
xlab("Time") +
coord_cartesian(ylim = c(400, 2200)) +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT"))+
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
p5 <- ggplot(xts_5,aes(date)) +
geom_line(color="Black", aes(y=HumidityRatio)) +
geom_area( fill="#56B4E9", aes(y=HumidityRatio), alpha=0.4) +
ylab("Humidity Ratio") + # kgwater-vapor/kg-air
xlab("Time") +
coord_cartesian(ylim = c(.0025, .0065)) +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT"))+
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
p6 <- ggplot(xts_6,aes(date)) +
geom_line(color="Black",aes(y=as.numeric(Occupancy))) +
ylab("Occupancy") +
xlab("Time") +
scale_x_datetime(breaks=date_breaks("4 hour"),labels=date_format("%H:%M"),
limits=as.POSIXct(c("2015-02-05 06:00","2015-02-10 06:00"),tz="GMT"))+
theme(text = element_text(size=5)) +
theme(axis.text.x = element_text(angle=90,hjust=1,size=6)) +
theme(axis.text.y = element_text(angle=0,hjust=1,size=6))
timeseries <- grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3,
top = "Time-Series Variables, 5 Day Duration",
bottom = textGrob("",gp = gpar(fontface = 3, fontsize = 5),
hjust = 1, x = 1))
This initial time series analysis is very useful for analyzing the
indoor environment during occupied periods during the week. It is easy
to understand the relationships of the unique variables, as well as
examine some similarities. The occupancy sensor indicates that the space
was occupied on Monday, Tuesday, and Friday of the week. It is also
evident that CO2 levels raised when the space was occupied, as well as
increased utilization of the lighting and HVAC systems. The next step
includes exploring this data to better understand the relationships
between the variables using correlation and various cluster analysis
methods.
The strength and direction of a linear relationship between two variables can be measured using a correlation coefficient. The Pearson correlation coefficient takes a range of values from +1 to -1, indicating a higher correlation with a value closer to +1.
Pearson Correlation Coefficient formula: \[r = \frac {n({\sum}_{xy})-({\sum}_{x})({\sum}_{y})}{\sqrt(n\sum_{x^2}-(\sum_{x^2}))(n\sum_{y^2}-(\sum_{y^2}))}\] Correlation is measured by \(r\), where \(n\) is the number of pairs of data and \(x\) and \(y\) are the sample means of x and y values.
# Correlation of Sensor Relationships
BldgSensorTraining$Occupancy <- as.numeric(BldgSensorTraining$Occupancy)
SensorData.corrplot = cor(BldgSensorTraining[2:7])
cp <- corrplot.mixed(SensorData.corrplot, lower.col = "darkblue",
order = "hclust", number.cex = 0.7, tl.cex = 0.7, tl.col = "black")
This correlation matrix indicates that there are varying relationships
between these variables, provided by the Euclidean distance. The
distance is shown on lower portion of the correlation plot (below the
variable text), whereas the upper portion provides the size of the
relationship that corresponds to the distance. The higher the value that
is indicated (lower), as well as the size of the circle (upper),
indicates a stronger the relationship. The highest-level relationships
exist between Humidity and the Humidity Ratio, which we expect. This is
followed by Occupancy and Light, in which we can conclude that if
someone is using occupying the space, they will most likely use the
lighting system. Using the previous time-series data, we also know that
this is an artificial light source (electricity, and not daylighting),
because the levels are not consistent for the duration, as well as the
occupancy data supporting this conclusion. At the lower end of the
spectrum, we notice that there is a low level of correlation between
light and humidity, which means that there is not a strong relationship
between these factors.
Observations provided by the characteristics of the variables can be measured to determine the similarity, or dissimilarity between them. The clustering distance measurement calculates the similarity of the elements, which influences the shape of the clusters.
The Euclidean Distance Formula is defined as: \[d_{euc}(x,y)=\sqrt{\sum_{i=1}^{n}({x}_{i}-y_{i})^2}\] \(n\) is the length, whereas \(x\) and \(y\) are vectors. The programming within R allows us to compare a multitude of variables to one another. There are a number of classical distance methods that can be used, such as the Manhattan, Spearman, and Kendall.
Hierarchical clustering analysis provides the ability to find relationships and build a hierarchy of clusters. These fall into two types, either the Agglomerative, also known as AGNES (Agglomerative Nesting) approach, in which each observation begins their own cluster, and pairs are merged as you move up within the hierarchy using a bottom-up approach. The second method is Divisive, also known as DIANA (Divise Analysis), in which a top-down approach is used and begins with one cluster and splits as you move down within the hierarchy. AGNES is generally recommended for smaller clusters, while DIANA is used for large clusters.60,61 The Divisive method is used for this analysis and an algorithm (metric) is used to measure the dissimilarity between sets of observations.
A variety of metrics can influence the shape of the cluster, some of which include the Euclidean, Manhatten, Maximum, Mahalanobis and others. The linkage criteria also help determine the distance between the sets of observations using a function (formula) to determine the pairwise distance between the observations.
In this first example, we will use the Euclidean distance for the metric, and the weighted average linkage clustering method. The formula for the Euclidean method was previously shown, while the formula for the average linkage is shown below:
Average Linkage formula \[d(i⋃j,k) = \frac {d(i,k) + d(j,k)}{2}\] \(d\) is the distance measured between the two clusters as defined by the union(join) of \(i\) and \(j\), and \(k\) divided by 2.62
These cluster methods demonstrate various graphical methods to understand the hierarchical clusters within this dataset. Three methods are shown, which use different strategies to identify relationships using dendrograms and a scatter plot. Dendrogram - Nearest Neighbors
set.seed(123)
# Modify objects
BldgSensorTraining$Occupancy <- as.numeric(BldgSensorTraining$Occupancy)
# Create Dataframe
df <- data.frame(BldgSensorTraining[2:7])
dfsample <- sample(df)
dfscale <- scale(dfsample)
distxy <- dist(dfscale)
cluster <- hclust(distxy)
# Plot Cluster Dendrogram
plot(cluster, ylab = "Height", xlab="Distance")
Although there is a significant amount of density from the existing
dataset, it may be also be sampled to obtain less results. The important
aspect to recognize is that there are relationships between these
sensors that are recognized within this dataset. Further analysis will
begin to provide additional insights.
The height of the dendrograms can be limited to better understand specific relationships, which will also help reduce the color density. However, the scatter plot and heatmap provide additional information that is useful for cluster analysis.
In this example we use k-means clustering, which defines clusters so that the variation is minimized. Hartigan-Wong introduced the standard algorithm in 1979, which is defined as:
\[W(C_{k})={\sum_{x_{i}∈C_{k}}}(x_{i} - μ_{k})^2\] Where \(x_{i}\) is the data point associated with the cluster \(C_{k}\), and \(μ_{k}\) is the mean of the points assigned to \(C_{k}\). Each \(x_{i}\) is assigned to a given cluster, so that the sum of squares distance of the observation to the cluster center \(μ_{k}\) is minimized. This is demonstrated in the k-means scatter plot below:
df <- scale(BldgSensorTraining[2:7]) # dataframe
EuclidDist2 <- dist(df, method = "euclidean")
HierarchClust2 <- hclust(EuclidDist2, method = "average")
kmeans_grp <- cutree(HierarchClust2, k = 6)
fviz_cluster(list(data = BldgSensorTraining[2:7], cluster = kmeans_grp))
These clusters have been grouped using the k-means method, while
subdividing the data into six clusters. The clusters indicate that there
are strong relationships between some groups and weak relationships
between others. The heatmap will further help visualize the variable
relationships and their strength to one another.
Heatmaps provide the ability to visualize clusters of samples and features. In this heatmap we can easily understand the relationships between the variables, which uses the Euclidean method as well.
BldgSensorTraining$Occupancy <- as.numeric(BldgSensorTraining$Occupancy)
df <- scale(BldgSensorTraining[2:7])
heatmap.2(df, scale = "none", col = bluered(100), margins=c(9,4) ,
trace = "none", density.info = "none", cexRow=0.2)
The dendrograms within this heatmap (top and right side) show
strong(red) and weak(blue) relationships between variables. These
relationships indicate that some variables do not have strong
relationships with other variables, while others have strong
relationships with others. The dendrogram at the top of the heatmap
indicates that humidity and temperature do not have a strong
relationship (in the climate in which this was analyzed), however once
this initial split has been determined, it is evident that humidity and
the humidity ratio have a strong relationship, and the right side of the
branches, we can see that the CO2, Occupancy, and Light variables have a
strong relationship, and collectively these variables are more closely
associated with temperature. Furthermore, the color scheme indicates
that there are more strong relationships between the humidity variables,
while less strong relationships within the remaining variables. This
analysis is also supported with the cluster analysis and correlation
that was previously explored.
Now that we have a better understanding of the data using time-series, correlation, and various clustering methods, we can more easily make some decisions to increase efficiency.
Initially, we can conclude that this space was occupied on Monday, Tuesday, and Friday by both the Occupancy and CO2 sensor data which is easy to understand within the time-series visualization. Furthermore, it appears that on there were days that lights and HVAC systems were used without anyone in the space to enjoy these services. Considering the correlation data, we notice that there are strong relationships between Humidity and the Humidity Ratio variables, as well as CO2 and occupancy. The cluster analysis supports this analysis, which can be more easily understood in the heatmap with the strength between these relationships. Let’s further explore this data with machine learning algorithms to understand accuracy measurements.
The initial unsupervised learning techniques using various data science strategies indicate that there is a high correlation between occupancy, light, the humidity ratio, and CO2. If we were in a position to make a cost-effective solution for a building system, and wanted to reduce the number of sensors required, while maintaining a high level of accuracy for data analysis, let’s utilize a few machine learning tasks to determine if a single sensor (i.e. CO2), or combination of sensors can provide an acceptable level of accuracy. The results from this machine learning task will enable the ability to use a single sensor or combination of multiple sensors for occupancy detection, as well as signaling lighting and HVAC systems to activate or deactivate when spaces were not in use. (we will not get into the engineering of this task, rather focus on machine learning)
This task will include utilizing five different types of supervised machine learning algorithms to determine which one can provide the highest level of accuracy. These include Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Random Forest (RF), Classification and Regression Trees (CART), and k-nearest neighbors (kNN).
Linear discriminant analysis assumes that the correlation structure is the same for all classes, which reduces the number of parameters necessary for estimation.
LDA Accuracy
set.seed(123)
# Define variables
BldgSensorTraining$Occupancy <- as.factor(BldgSensorTraining$Occupancy)
Occupancy <- BldgSensorTraining$Occupancy
# BldgSensorTraining <-
# read.table('01Data/11-BldgSensorTraining.csv',header=TRUE,sep=',')
## LDA-All variables
LDA_all_var <- train(Occupancy ~ . - date, method = "lda", data = BldgSensorTraining)
LDA_all_var
## Linear Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.988 0.965
# LDA_all_var$finalModel # data stats LDA-CO2 only
LDA_CO2 <- train(Occupancy ~ . - date - Humidity - Temperature -
HumidityRatio - Light, method = "lda", data = BldgSensorTraining)
LDA_CO2
## Linear Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.885 0.616
## LDA-CO2 and Light
LDA_CO2_light <- train(Occupancy ~ . - date - Humidity - Temperature -
HumidityRatio, method = "lda", data = BldgSensorTraining)
LDA_CO2_light
## Linear Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.976 0.932
LDA Accuracy Results
Quadratic discriminant analysis is a version of Naive Bayes. In a binary case, the smallest true error is determined by the Bayes’ rule, which is based on the true conditional probability, and is expressed as:
\[p(x) = Pr(Y = 1 | X = x) = \frac {f_{X|y=1}(x)Pr(Y = 1)}{f_{X|y=0}Pr(Y = 0)+f_{X|y=1}(X)Pr(Y = 1)}\] \(f_{X|y=1}\) and \(f_{X|y=0}\) represent the distribution functions of the predictor \(X\) for the two classes \(Y\)=1 and \(Y\)=0. The formula implies that we can estimate the conditional distribution of the predictors and develop a powerful decision rule. We assume that \(P_{X|y=1}(x)\) and \(P_{X|y=0}(x)\) are multivariate normal. Let’s see how QDA performs:
## QDA-All variables
QDA_all_var <- train(Occupancy~.-date,method="qda",data=BldgSensorTraining)
QDA_all_var
## Quadratic Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.988 0.965
## QDA-CO2 only
QDA_CO2 <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
-Light,method="qda",data=BldgSensorTraining)
QDA_CO2
## Quadratic Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.902 0.692
## QDA-CO2 and Light
QDA_CO2_light <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
,method="qda",data=BldgSensorTraining)
QDA_CO2_light
## Quadratic Discriminant Analysis
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.983 0.95
QDA Accuracy Results
Random forests are effective at making predictions and reduce instability by averaging multiple decision trees. This is accomplished by bootstrap aggregation (bagging), which generates many predictors (regression or classification trees), then forming a final prediction on the average prediction. Secondarily, to ensure that no two trees are the same, the bootstrap method makes the trees randomly different. Let’s see how the random forest algorithm performs:
## Random Forest-CO2 only
randforest_all_var <- train(Occupancy~.-date,method="rf",
data=BldgSensorTraining)
randforest_all_var
## Random Forest
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.993 0.979
## 3 0.993 0.979
## 5 0.992 0.977
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was mtry = 2.
## Random Forest-CO2
randforest_CO2 <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
-Light,method="rf",data=BldgSensorTraining)
randforest_CO2
## Random Forest
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results:
##
## Accuracy Kappa
## 0.894 0.682
##
## Tuning parameter 'mtry' was held constant at a value of 2
## Random Forest-CO2 & Light
randforest_CO2_light <- train(Occupancy~.-date-Humidity-HumidityRatio
,method="rf",data=BldgSensorTraining)
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
randforest_CO2_light
## Random Forest
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.994 0.981
## 3 0.993 0.978
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was mtry = 2.
Random Forest Accuracy Results
Classification trees are used to make predictions when the outcome is categorical. Partitioning accounts for differences while working with categorical outcomes. Predictions are made by calculating which class is the most common among the training set observations within the partition. Let’s see how the CART algorithm performs: 63
## CART-All variables
CART_all_var <- train(Occupancy~.-date,method="rpart",data=BldgSensorTraining)
CART_all_var
## CART
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00405 0.992 0.975
## 0.00607 0.990 0.970
## 0.94274 0.867 0.384
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was cp = 0.00405.
## CART-CO2 only
CART_CO2 <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
-Light,method="rpart",data=BldgSensorTraining)
CART_CO2
## CART
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00318 0.920 0.765
## 0.00839 0.919 0.764
## 0.61481 0.859 0.428
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was cp = 0.00318.
## CART-CO2 and Light
CART_CO2_light <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
,method="rpart",data=BldgSensorTraining)
CART_CO2_light
## CART
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00135 0.988 0.965
## 0.00521 0.988 0.965
## 0.94274 0.884 0.461
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was cp = 0.00135.
Classification and Regression Trees Results
kNN can adapt to multiple dimensions, initially by defining the distances between observations, then for any point \((x_{1},x_{2})\) in which we estimate \((x_{1},x_{2})\), we look for the k nearest points to \((x_{1},x_{2})\), then take an average of the 0s and 1s, which is used to compute the average (known as the neighborhood).58 Let’s see how the k-nearest neighbors algorithm performs:
## KNN-All variables
KNN_all_var <- train(Occupancy~.-date,method="knn",data=BldgSensorTraining)
KNN_all_var
## k-Nearest Neighbors
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.988 0.963
## 7 0.988 0.964
## 9 0.988 0.964
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was k = 9.
## KNN-CO2 only
KNN_CO2 <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
-Light,method="knn",data=BldgSensorTraining)
KNN_CO2
## k-Nearest Neighbors
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.904 0.714
## 7 0.907 0.725
## 9 0.910 0.735
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was k = 9.
## KNN-CO2 and Light
KNN_CO2_light <- train(Occupancy~.-date-Humidity-Temperature-HumidityRatio
,method="knn",data=BldgSensorTraining)
KNN_CO2_light
## k-Nearest Neighbors
##
## 8143 samples
## 6 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.986 0.960
## 7 0.987 0.961
## 9 0.987 0.963
##
## Accuracy was used to select the optimal model using
## the largest value.
## The final value used for the model was k = 9.
k-nearest neighbors Accuracy Results
The following section indicates the accuracy measurement followed by graphical visualizations that indicate the performance for all of the previous measurements.
results <- resamples(list(LDA=LDA_all_var, QDA=QDA_all_var,
RF=randforest_all_var, CART=CART_all_var,
KNN=KNN_all_var))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LDA, QDA, RF, CART, KNN
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.983 0.987 0.988 0.988 0.990 0.992 0
## QDA 0.985 0.987 0.988 0.988 0.989 0.991 0
## RF 0.991 0.992 0.993 0.993 0.994 0.995 0
## CART 0.989 0.991 0.991 0.992 0.992 0.994 0
## KNN 0.984 0.987 0.988 0.988 0.989 0.990 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.950 0.961 0.964 0.965 0.970 0.976 0
## QDA 0.956 0.962 0.965 0.965 0.968 0.973 0
## RF 0.973 0.977 0.979 0.979 0.981 0.986 0
## CART 0.966 0.972 0.974 0.975 0.977 0.983 0
## KNN 0.955 0.961 0.965 0.964 0.967 0.971 0
# Box and whisker plot
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales, title='All Variables')
# Density plot
scales <- list(x=list(relation="free"), y=list(relation="free"))
densityplot(results, scales=scales, title='All Variables')
# parallel plots to compare models
parallelplot(results, title='All Variables')
# pair-wise scatterplots of predictions to compare models
splom(results,pscales = 0)
### Two Variable Accuracy (CO2 and Light)
This data shows how the variables compared with one another to determine accuracy for two variables, including both the CO2 and light sensors.
results2 <- resamples(list(LDA = LDA_CO2_light, QDA = QDA_CO2_light,
RF = randforest_CO2_light, CART = CART_CO2_light, KNN = KNN_CO2_light))
summary(results2)
##
## Call:
## summary.resamples(object = results2)
##
## Models: LDA, QDA, RF, CART, KNN
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.972 0.974 0.976 0.976 0.978 0.981 0
## QDA 0.975 0.981 0.983 0.983 0.984 0.991 0
## RF 0.991 0.993 0.994 0.994 0.994 0.996 0
## CART 0.985 0.987 0.988 0.988 0.989 0.992 0
## KNN 0.985 0.987 0.987 0.987 0.988 0.992 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.917 0.925 0.932 0.932 0.937 0.945 0
## QDA 0.929 0.946 0.950 0.950 0.954 0.973 0
## RF 0.973 0.978 0.981 0.981 0.983 0.988 0
## CART 0.955 0.961 0.964 0.965 0.969 0.974 0
## KNN 0.956 0.960 0.962 0.963 0.966 0.976 0
# Box and whisker plot
scales2 <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(results2, scales = scales2)
# Density plot
densityplot(results2, scales = scales2)
# parallel plots to compare models
parallelplot(results2)
# pair-wise scatterplots of predictions to compare models
splom(results2, pscales = 0)
The following section indicates the accuracy measurement followed by graphical visualizations that indicate the performance for all of the CO2 sensor exclusively.
results3 <- resamples(list(LDA=LDA_CO2, QDA=QDA_CO2,
RF=randforest_CO2, CART=CART_CO2, KNN=KNN_CO2))
summary(results3)
##
## Call:
## summary.resamples(object = results3)
##
## Models: LDA, QDA, RF, CART, KNN
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.878 0.882 0.884 0.885 0.887 0.897 0
## QDA 0.889 0.900 0.903 0.902 0.906 0.909 0
## RF 0.887 0.891 0.895 0.894 0.897 0.902 0
## CART 0.914 0.917 0.920 0.920 0.923 0.928 0
## KNN 0.901 0.907 0.910 0.910 0.913 0.920 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.590 0.602 0.615 0.616 0.625 0.658 0
## QDA 0.653 0.680 0.697 0.692 0.705 0.723 0
## RF 0.657 0.675 0.682 0.682 0.691 0.705 0
## CART 0.743 0.756 0.765 0.765 0.775 0.786 0
## KNN 0.705 0.721 0.736 0.735 0.745 0.767 0
# Box and whisker plot
scales3 <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results3, scales=scales3, title='One Variable')
# Density plot
densityplot(results3, scales=scales3, title='One Variable')
# parallel plots to compare models
parallelplot(results3, title='One Variable')
# pair-wise scatterplots of predictions to compare models
splom(results3,pscales = 0)
The accuracy results from these machine learning algorithms are indicative that the building system sensors can be used in different combinations to detect space occupancy, which can lead to more effective control of spaces. A few takeaways from this exhaustive sensor analysis:
The final section of this publication seeks to incorporate the data previously explored, including ASHRAE building data, technology maturity, interior sensor data, and energy saving initiatives.
The building data reveals a significant amount of usage for various building types during periods of the day. We can infer from the occupancy data that there is an opportunity to decrease energy consumption by using a combination of one or more occupancy sensors.
We do not have sufficient information to determine detailed information for building occupancy and usage from the ASHRAE data set, or if the buildings are operating efficiently with the existing building systems. However, we can utilize this data set and assume that there is a collective 20% opportunity for improving energy efficiency, considering a 10-40% energy savings for HVAC and lighting.56 Utilizing this particular sensor data set, this opportunity for efficiency improvement is attributed to lighting and HVAC systems remaining in an operable condition when spaces are unoccupied.
Considering these energy saving opportunities, let’s see what a 30% decrease in energy is equivalent to for electricity use, economic value, and environmental impact.
Total Energy Use from ASHRAE Dataset:
options(digits=15)
bldg_data_train <- read.csv("01Data/06-ASHRAE_Energy_training_data.csv")
totalenergyuse <- read.csv("01Data/06-ASHRAE_Energy_training_data.csv")
totalenergyuse <- building_data$meter_reading
sum(totalenergyuse)
## [1] 42799931388.8031
This data indicates that the total energy use from 1,448 buildings is consuming 42,799,931,389 kilowatts-hours of electricity annually. Now, we will incorporate building systems efficiency metrics for a 30% improvement and determine the energy saving result.
Optimized Building System:
newenergyuse <- sum(totalenergyuse) * .7
newenergyuse
## [1] 29959951972.1621
savingsenergy <- sum(totalenergyuse) - newenergyuse
savingsenergy
## [1] 12839979416.6409
With an optimized building system, this performance improvement would amount to 29,959,951,972 kWh of energy consumption, while saving 12,839,979,417 kWh of electricity annually.
In support of the financial savings, we will use the average retail rate for electricity, which is currently an average of 10.59 cents per kWh in the United States.64
Economic Performance The financial savings associated with a 30% decrease of energy consumption would amount to:
currentenergycost <- sum(totalenergyuse) * .1059
currentenergycost
## [1] 4532512734.07424
newenergycost <- sum(totalenergyuse) * .7 * .1059
newenergycost
## [1] 3172758913.85197
savingsenergycost <- sum(totalenergyuse) * .3 * .1059
savingsenergycost
## [1] 1359753820.22227
These building owners are cumulatively paying an average total of 4,532,512,734 USD for electricity a year. With a 30% efficiency improvement, the new total is 3,172,758,914 USD, with a savings of 1,359,753,820 USD annually. That’s a significant amount of savings, however there is also an investment cost associated with building systems implementation and lifecycle support which is not discussed within this report.
Emissions Use and Reduction The ASHRAE dataset provides a substantial information about building energy usage. Although it is difficult to understand the specific electricity type and carbon footprint used for each building, we can infer that there is a percentage of emissions created from energy consumption.
The U.S. Environmental Protection Agency (EPA) has produced the Emissions & Generation Resource Integrated Database (eGRID). This database is a resource which provides information about annual emission rate estimates from various sources.65
There are a multitude of formulas associated with emission rates from various sources. The objective we seek to determine is quantifying the environmental impact for CO2, CH4, and N2O reduction from saving energy.
The U.S. Environmental Protection Agency (EPA), as well as NOAA provide a resource for calculating Greenhouse Gas Equivalencies.66,67 Although we have the ability to calculate some of these values, it is difficult to determine specific emissions from specific buildings, as well as their source of energy. With many variables to consider, a general approach includes the following process:
Let’s begin with this process, while providing the values from previously obtained datasets:
# Step 1: Total Energy Consumption from ASHRAE meter readings dataset
sum(totalenergyuse)
## [1] 42799931388.8031
# Steps 2,3: Total Energy Use from country. United States
# had the following total energy consumption from fossil
# fuels for the year that the data was processed. Work
# cited reference [14]
UScoal2016 <- 12262 #kWh
USOil2016 <- 30839 #kWh
USGas2016 <- 23191 #kWh
USFossil2016 <- sum(UScoal2016, USOil2016, USGas2016)
USFossil2016 #Total kWh
## [1] 66292
# 66292 kWh total fossil 78367 kWh all sources
The last portion of this analysis includes calculating the CO2 reduction. Although there are many resources to perform these measurements, the US Environmental Protection Agency (EPA) has introduced the Greenhouse Gases Equivalencies Calculator, which also provides a resource to double-check the results. The savings in energy has a direct relationship to the reduction of C02 emissions.
# Step 4: CO2 reduction calculation
# kWh(data) * Emission Ratio (0.000432594303)
EmissionRatio <- 0.000432594303
C02Reduction <- savingsenergy * EmissionRatio
C02Reduction
## [1] 5554501.94627612
That is a significant amount of savings with energy cost, economically, and with greenhouse emissions, and very evident with the science of data! Taking a long-term approach with energy conservation efforts would result in additional benefits from C02 reduction, such as longer lifespans and lower healthcare expenses.
This deep dive into many data sets using a multitude of data science, analytics, and other processes provided the ability to explore and find solutions for environmental, energy, infrastructure, regulatory controls, and incentives.
Although there was a general approach taken with this analysis, it does provide a great understanding of the problem and how to address solutions, whether they be small or large scale. This small dataset could also be implemented nationally or globally for a greater reach and impact.
There were several limitations for this analysis:
There are a multitude of directions that this analysis can lead to. Some of these include:
American Institute of Architects (AIA); American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE); analytics; architecture; average linkage; big data; building energy modeling; building survey; building occupancy; building systems; carbon dioxide modeling; Classification and Regression Trees (CART); climate change; cluster analysis; communications; correlation; cyber security; cyber-physical systems; data analysis; data science; dendrogram; economics; electric efficiency; electric metering; electrical use; energy consumption; energy programs; energy resiliency; energy use; engineering; environment; Euclidean distance; exploratory data analysis (EDA); financial modeling; greenhouse gases; health; heatmap; hierarchical clustering analysis; humidity ratio; humidity sensor; illumination; industrial control systems; information systems; information technology; infrastructure; internet of things (IoT); Institute of Electrical and Electronics Engineers (IEEE); International Organization for Standardization (ISO); k means; k-nearest neighbors (kNN); Leadership in Energy and Environmental Design (LEED); light sensor; Linear Discriminant Analysis (LDA); machine learning; manufacturing; methane modeling; networks; nitrous oxide modeling; occupancy; operations; passive infrared sensor; policies; Quadratic Discriminant Analysis (QDA); Random Forest (RF); regulations; sensors; service level management; smart cities; standards; statistics; supply chain; systems-of-systems; technology; thermal sensor; time-series; US Green Building Council.