Group8 Project WQD7004

Group member for this project

S2174087/1 Junhao Xu
S2175919/1 Haojie Chu
S2164009/1 Wang Jiaxin
S2165529/1 Xiaowei Zhang
S2001533 Cheah Xiao Ying

Dataset:https://tianchi.aliyun.com/dataset/84167

1 Introduction

Energy plays an important role in human society. With energy, human society can be developed. There are various types of power generation, such as coal, nuclear, gas, solar energy, natural gas and geothermal sources. As thermal power generation will be the main source of electricity for production and domestic use for many years to come, exploring efficient and clean power generation technologies and optimizing thermal power generation facilities to improve energy efficiency are the main issues at hand.

The basic principle of thermal power generation is: when the fuel is burned, the water is heated to generate steam, and the steam pressure pushes the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this series of energy conversions, the core that affects the power generation efficiency is the combustion efficiency of the boiler, that is, the fuel is burned to heat water to generate high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including adjustable parameters of the boiler, such as combustion feed rate, primary and secondary air, induced air, return air, water supply volume; and boiler operating conditions, such as boiler bed temperature, bed pressure, Furnace temperature, pressure, superheater temperature, etc.

With the rapid development of the industrialization road, the application of big data-driven artificial intelligence technology to industrial production and manufacturing is conducive to its digitalization and intelligence. Thermal power boilers generate a large amount of data in the process of operation. Machine learning technology is used to extract valuable information from a large amount of complex data to help companies better optimize equipment parameters, thereby improving the stability, speed, accuracy and energy conversion rate of the boiler operation. Machine learning can be applied to various industries to analyze and organize the data we have accumulated and to make effective use of the data. Machine learning techniques are capable of handling large volumes of data with complex and non-linear structures. Through the design of certain algorithms, it is also possible to achieve self-learning of patterns. These advantages of machine learning have led to a new breakthrough direction in the analysis of industrial data.

3 Objectives

A steam prediction model for thermal power generation was developed based on machine learning algorithms. Multiple linear regression, SVR, and XGBoost、Random Forest are used to predict the steam volume of thermal power generation respectively. To understand the relationship between each variable and target for thermal power generation. To identify which model performs the best in predicting the steam volume of thermal power generation.

4 Data collection

Import the dataset and view the basic information of the dataset

# import file
# test size 1925 rows * 38 columns    
# train size 2888 rows * 39 columns
# The provided data set, train, is one more column of data called target than the one we need to predict
data_train <- read.table("D:/zhengqi_train.txt", header = TRUE, sep = "\t")
data_test <- read.table("D:/zhengqi_test.txt", header = TRUE, sep = "\t")

# data_train is the original data
data_train1 <- data_train

# complete the origin column
data_train<-mutate(data_train, origin = 'train')
data_test<-mutate(data_test, origin = 'test')

head(data_train)

head(data_test)

str(data_train)

## 'data.frame':    2888 obs. of  40 variables:
##  $ V0    : num  0.566 0.968 1.013 0.733 0.684 ...
##  $ V1    : num  0.016 0.437 0.568 0.368 0.638 ...
##  $ V2    : num  -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
##  $ V3    : num  0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
##  $ V4    : num  0.452 0.194 0.112 0.599 0.337 ...
##  $ V5    : num  -0.901 -0.893 -0.797 -0.679 -0.454 ...
##  $ V6    : num  -1.81 -1.57 -1.37 -1.2 -1.07 ...
##  $ V7    : num  -2.36 -2.36 -2.36 -2.09 -2.09 ...
##  $ V8    : num  -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
##  $ V9    : num  -2.11 -2.11 -2.11 -2.11 -2.11 ...
##  $ V10   : num  -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
##  $ V11   : num  -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
##  $ V12   : num  -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
##  $ V13   : num  0.55 1.109 0.767 0.769 -0.349 ...
##  $ V14   : num  -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
##  $ V15   : num  0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
##  $ V16   : num  -1.707 -0.977 -0.618 -0.429 -0.391 ...
##  $ V17   : num  -1.162 -1.162 -0.897 -0.897 -0.897 ...
##  $ V18   : num  -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
##  $ V19   : num  -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
##  $ V20   : num  0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
##  $ V21   : num  -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
##  $ V22   : num  -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
##  $ V23   : num  0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
##  $ V24   : num  0.8 0.801 0.961 1.435 0.881 ...
##  $ V25   : num  -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
##  $ V26   : num  0.796 1.057 0.915 0.898 0.386 ...
##  $ V27   : num  0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
##  $ V28   : num  -0.45 0.671 1.287 1.298 1.289 ...
##  $ V29   : num  0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
##  $ V30   : num  0.109 0.124 0.361 0.417 1.078 ...
##  $ V31   : num  -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
##  $ V32   : num  0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
##  $ V33   : num  -4.627 -0.843 -0.843 -0.843 -0.843 ...
##  $ V34   : num  -4.789 0.16 0.16 -0.065 -0.215 ...
##  $ V35   : num  -5.101 0.364 0.364 0.364 0.364 ...
##  $ V36   : num  -2.608 -0.335 0.765 0.333 -0.28 ...
##  $ V37   : num  -3.508 -0.73 -0.589 -0.112 -0.028 ...
##  $ target: num  0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
##  $ origin: chr  "train" "train" "train" "train" ...

str(data_test)

## 'data.frame':    1925 obs. of  39 variables:
##  $ V0    : num  0.368 0.148 -0.166 0.102 0.3 0.05 -0.223 -0.126 -0.203 -0.181 ...
##  $ V1    : num  0.38 0.489 -0.062 0.294 0.428 0.34 0.175 0.152 -0.014 0.797 ...
##  $ V2    : num  -0.225 -0.247 -0.311 -0.259 0.208 0.108 -0.39 0.227 0.01 0.47 ...
##  $ V3    : num  -0.049 -0.049 0.046 0.051 0.051 0.051 0.051 0.021 -0.034 -0.107 ...
##  $ V4    : num  0.379 0.122 -0.055 -0.183 -0.033 -0.348 0.006 -0.619 -0.322 -0.477 ...
##  $ V5    : num  0.092 -0.201 0.063 0.148 0.116 0.074 0.134 -0.069 0.105 0.184 ...
##  $ V6    : num  0.55 0.487 0.485 0.474 0.408 0.516 0.497 0.52 0.453 0.588 ...
##  $ V7    : num  0.551 0.493 0.493 0.504 0.497 0.491 0.548 0.548 0.518 0.528 ...
##  $ V8    : num  0.244 -0.127 -0.227 0.01 0.155 0.238 -0.099 0.06 -0.032 0.319 ...
##  $ V9    : num  0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.473 0.904 ...
##  $ V10   : num  -0.419 -0.403 0.33 -0.431 -0.162 0.079 0.226 -0.223 -0.352 -0.108 ...
##  $ V11   : num  0.515 -0.324 0.389 0.524 0.554 0.373 0.477 0.171 0.254 0.618 ...
##  $ V12   : num  0.346 0.465 0.173 -0.038 -0.063 -0.246 -0.276 -0.369 -0.318 -0.709 ...
##  $ V13   : num  -0.114 0.653 0.398 -0.34 0.611 ...
##  $ V14   : num  -0.204 0.148 0.068 -0.313 -0.319 -0.365 -0.247 -0.271 -0.008 -0.207 ...
##  $ V15   : num  0.239 -0.113 -0.192 -0.59 -0.927 ...
##  $ V16   : num  -0.089 -0.093 -0.061 -0.134 -0.075 0.222 0.252 0.01 0.003 0.252 ...
##  $ V17   : num  0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 ...
##  $ V18   : num  0.247 0.073 0.07 0.078 0.08 0.07 0.066 0.07 0.064 0.078 ...
##  $ V19   : num  0.899 1.168 0.98 1.07 1.238 ...
##  $ V20   : num  -0.252 -0.276 -0.34 -0.292 -0.15 -0.11 -0.331 -0.245 -0.472 -0.493 ...
##  $ V21   : num  0.628 0.009 0.27 0.726 0.141 ...
##  $ V22   : num  -0.063 -0.063 -0.063 0.133 0.133 0.133 -0.063 -0.063 0.133 0.133 ...
##  $ V23   : num  0.098 0.09 0.091 0.086 0.089 0.099 0.104 0.101 0.093 -0.178 ...
##  $ V24   : num  -1.314 -1.31 -1.31 0.234 0.237 ...
##  $ V25   : num  -0.662 -0.646 -0.473 -0.337 -0.285 -0.071 0.009 0.003 0.006 0.199 ...
##  $ V26   : num  -0.596 -0.776 -0.607 -0.986 -0.669 ...
##  $ V27   : num  0.208 0.226 0.084 0.203 0.227 0.229 0.133 0.262 0.241 0.307 ...
##  $ V28   : num  -0.449 -0.443 -0.458 -0.456 -0.458 -0.45 -0.452 -0.452 -0.45 -0.446 ...
##  $ V29   : num  0.047 0.047 -0.398 -0.398 -0.776 ...
##  $ V30   : num  0.057 0.56 0.101 1.007 0.291 ...
##  $ V31   : num  -0.042 0.176 0.199 0.137 0.37 0.447 0.432 0.281 0.222 0.466 ...
##  $ V32   : num  0.847 0.551 0.634 1.042 0.181 ...
##  $ V33   : num  0.534 0.046 0.017 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 ...
##  $ V34   : num  -0.009 -0.22 -0.234 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 ...
##  $ V35   : num  -0.19 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 -0.289 ...
##  $ V36   : num  -0.567 -0.294 0.373 -0.666 -0.14 -0.228 0.104 -0.7 -0.236 -0.431 ...
##  $ V37   : num  0.388 0.104 0.569 0.391 -0.497 ...
##  $ origin: chr  "test" "test" "test" "test" ...

# Merge data_train and data_test by row
df<- bind_rows(data_train, data_test)
str(df)

## 'data.frame':    4813 obs. of  40 variables:
##  $ V0    : num  0.566 0.968 1.013 0.733 0.684 ...
##  $ V1    : num  0.016 0.437 0.568 0.368 0.638 ...
##  $ V2    : num  -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
##  $ V3    : num  0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
##  $ V4    : num  0.452 0.194 0.112 0.599 0.337 ...
##  $ V5    : num  -0.901 -0.893 -0.797 -0.679 -0.454 ...
##  $ V6    : num  -1.81 -1.57 -1.37 -1.2 -1.07 ...
##  $ V7    : num  -2.36 -2.36 -2.36 -2.09 -2.09 ...
##  $ V8    : num  -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
##  $ V9    : num  -2.11 -2.11 -2.11 -2.11 -2.11 ...
##  $ V10   : num  -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
##  $ V11   : num  -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
##  $ V12   : num  -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
##  $ V13   : num  0.55 1.109 0.767 0.769 -0.349 ...
##  $ V14   : num  -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
##  $ V15   : num  0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
##  $ V16   : num  -1.707 -0.977 -0.618 -0.429 -0.391 ...
##  $ V17   : num  -1.162 -1.162 -0.897 -0.897 -0.897 ...
##  $ V18   : num  -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
##  $ V19   : num  -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
##  $ V20   : num  0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
##  $ V21   : num  -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
##  $ V22   : num  -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
##  $ V23   : num  0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
##  $ V24   : num  0.8 0.801 0.961 1.435 0.881 ...
##  $ V25   : num  -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
##  $ V26   : num  0.796 1.057 0.915 0.898 0.386 ...
##  $ V27   : num  0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
##  $ V28   : num  -0.45 0.671 1.287 1.298 1.289 ...
##  $ V29   : num  0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
##  $ V30   : num  0.109 0.124 0.361 0.417 1.078 ...
##  $ V31   : num  -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
##  $ V32   : num  0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
##  $ V33   : num  -4.627 -0.843 -0.843 -0.843 -0.843 ...
##  $ V34   : num  -4.789 0.16 0.16 -0.065 -0.215 ...
##  $ V35   : num  -5.101 0.364 0.364 0.364 0.364 ...
##  $ V36   : num  -2.608 -0.335 0.765 0.333 -0.28 ...
##  $ V37   : num  -3.508 -0.73 -0.589 -0.112 -0.028 ...
##  $ target: num  0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
##  $ origin: chr  "train" "train" "train" "train" ...

5 Data pre-precessing

5.1 Determine NA value

colSums(is.na(data_train))

##     V0     V1     V2     V3     V4     V5     V6     V7     V8     V9    V10 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V11    V12    V13    V14    V15    V16    V17    V18    V19    V20    V21 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V22    V23    V24    V25    V26    V27    V28    V29    V30    V31    V32 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V33    V34    V35    V36    V37 target origin 
##      0      0      0      0      0      0      0

5.2 Determining non-numeric columns

lapply(df[,c(1:39)], is.numeric)

## $V0
## [1] TRUE
## 
## $V1
## [1] TRUE
## 
## $V2
## [1] TRUE
## 
## $V3
## [1] TRUE
## 
## $V4
## [1] TRUE
## 
## $V5
## [1] TRUE
## 
## $V6
## [1] TRUE
## 
## $V7
## [1] TRUE
## 
## $V8
## [1] TRUE
## 
## $V9
## [1] TRUE
## 
## $V10
## [1] TRUE
## 
## $V11
## [1] TRUE
## 
## $V12
## [1] TRUE
## 
## $V13
## [1] TRUE
## 
## $V14
## [1] TRUE
## 
## $V15
## [1] TRUE
## 
## $V16
## [1] TRUE
## 
## $V17
## [1] TRUE
## 
## $V18
## [1] TRUE
## 
## $V19
## [1] TRUE
## 
## $V20
## [1] TRUE
## 
## $V21
## [1] TRUE
## 
## $V22
## [1] TRUE
## 
## $V23
## [1] TRUE
## 
## $V24
## [1] TRUE
## 
## $V25
## [1] TRUE
## 
## $V26
## [1] TRUE
## 
## $V27
## [1] TRUE
## 
## $V28
## [1] TRUE
## 
## $V29
## [1] TRUE
## 
## $V30
## [1] TRUE
## 
## $V31
## [1] TRUE
## 
## $V32
## [1] TRUE
## 
## $V33
## [1] TRUE
## 
## $V34
## [1] TRUE
## 
## $V35
## [1] TRUE
## 
## $V36
## [1] TRUE
## 
## $V37
## [1] TRUE
## 
## $target
## [1] TRUE

5.3 Visualisation of missing data

missmap(data_train)

5.4 Eigenvalue density plots

for(col in names(data_test)){
  if(!is.na(col)){
     print(ggplot(df, aes(x = get(col), colour=origin,group=origin)) + 
     xlab("x-axis eigenvalue") + 
     ggtitle(col) +
    geom_density(size=2))
   }
}

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Augmentation: removing features with large differences in feature distribution between the training and test sets in order to facilitate a reduction in the likelihood of overfitting the model, removing features that are not conducive to applying the model to the test set, and helping to improve the generalizability of the model.

# Delete columns with larger eigenvalues
drop_array <- c("V5", "V9", "V14", "V17", "21", "28")
data_train1 <- data_train1[, !(names(data_train1) %in% drop_array)]
names(data_train1)

##  [1] "V0"     "V1"     "V2"     "V3"     "V4"     "V6"     "V7"     "V8"    
##  [9] "V10"    "V11"    "V12"    "V13"    "V15"    "V16"    "V18"    "V19"   
## [17] "V20"    "V21"    "V22"    "V23"    "V24"    "V25"    "V26"    "V27"   
## [25] "V28"    "V29"    "V30"    "V31"    "V32"    "V33"    "V34"    "V35"   
## [33] "V36"    "V37"    "target"

5.5 Plotting scatter plots and fitting curves

# Plotting scatter plots and fitting curves
for(col in names(data_train1)){
   print(ggplot(data_train1, aes(x = get(col), y = target)) +
    # Scatter plot function
    geom_point()+
    xlab(col) + 
    ylab("target") +
    # Scale function: palette sets the color scheme
    scale_colour_brewer(palette = "Set1") +
    geom_smooth())
  }

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

5.6 Heatmap

# Heat map of correlation between each column and target
par(mfrow = c(1, 1), mar = c(2, 2, 2, 2), cex = 0.8)
corr <- cor(data_train1)
column_data <- data.frame(corr)
column_data <- as.matrix(column_data)
corrplot(cor(data_train1), method = "color", type = "lower", tl.col = "black", tl.srt = 45, addCoefasPercent = TRUE, addCoef.cex = 0.5)

## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "addCoef.cex"不是图形参数

## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "addCoef.cex"不是图形参数

## Warning in title(title, ...): "addCoef.cex"不是图形参数

Removal of correlation less than 0.1

# Removal of correlation less than 0.1
threshold <- 0.1
corr_matrix <- abs(cor(data_train1)) # Absolute value correlation matrix
drop_col <- names(data_train1)[corr_matrix[,"target"] < threshold]
drop_col

## [1] "V21" "V25" "V26" "V32" "V33" "V34"

data_train1 <- data_train1[, !(names(data_train1) %in% drop_col)]
names(data_train1)

##  [1] "V0"     "V1"     "V2"     "V3"     "V4"     "V6"     "V7"     "V8"    
##  [9] "V10"    "V11"    "V12"    "V13"    "V15"    "V16"    "V18"    "V19"   
## [17] "V20"    "V22"    "V23"    "V24"    "V27"    "V28"    "V29"    "V30"   
## [25] "V31"    "V35"    "V36"    "V37"    "target"

# datatrain1 is the data frame after removing low correlation columns

5.7 Separate datasets

trainy = data_train1["target"]
trainx = data_train1[, !(names(data_train1) %in% "target")]
# separation train and test
set.seed(827)
inTrain <- createDataPartition(trainy$target, p = 0.7, list = FALSE)
X_train <- trainx[inTrain,]
X_test <- trainx[-inTrain,]
y_train <- trainy[inTrain,]
y_test <- trainy[-inTrain,]

6 Modeling

Prediction evaluation reference RMSE (Root Mean Squared Error) is the root mean squared error, which represents the deviation between the predicted and true values. The smaller the value, the more accurate the model. MSE (Mean Squared Error) is the mean squared error, which represents the squared difference between the predicted value and the true value. The smaller the value, the more accurate the model is. R² (Coefficient of Determination) is the coefficient of determination, which represents the percentage of variation in the true value that is explained by the predicted value of the model. The closer the value is to 1, the more accurate the model is.

6.1 Multiple linear regression

# Multiple linear regression
  x_vars <- names(X_train)
  x_vars<-paste(x_vars, collapse = "+")
  x_vars

## [1] "V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37"

lr_model <- lm(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
# Coefficient table showing beta coefficient estimates and their significance levels
# Estimate: intercept (b0) and beta coefficient estimates associated with each predictor variable
# Std.Error: standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confidence we have in the estimate.
# t value: t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3).
# Pr(>|t|): p-value corresponding to the t-statistic. the smaller the p-value, the more significant the estimate.
summary(lr_model)

## 
## Call:
## lm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 + V7 + V8 + 
##     V10 + V11 + V12 + V13 + V15 + V16 + V18 + V19 + V20 + V22 + 
##     V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, 
##     data = cbind(X_train, y_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55322 -0.18309 -0.01699  0.17665  1.55426 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.2569077  0.0216219 -11.882  < 2e-16 ***
## V0           0.3651026  0.0308942  11.818  < 2e-16 ***
## V1           0.2144276  0.0291299   7.361 2.65e-13 ***
## V2           0.1391036  0.0236074   5.892 4.46e-09 ***
## V3           0.1440395  0.0096956  14.856  < 2e-16 ***
## V4           0.0490734  0.0303358   1.618 0.105891    
## V6           0.1821767  0.0350485   5.198 2.22e-07 ***
## V7          -0.1618945  0.0253709  -6.381 2.18e-10 ***
## V8          -0.1911270  0.0341246  -5.601 2.43e-08 ***
## V10          0.3314013  0.0255671  12.962  < 2e-16 ***
## V11          0.0195061  0.0118248   1.650 0.099182 .  
## V12          0.0489378  0.0271553   1.802 0.071673 .  
## V13          0.0004546  0.0112052   0.041 0.967639    
## V15          0.0323430  0.0253617   1.275 0.202362    
## V16          0.0179925  0.0286673   0.628 0.530318    
## V18          0.0371944  0.0101865   3.651 0.000268 ***
## V19          0.0189041  0.0086529   2.185 0.029027 *  
## V20         -0.0009343  0.0110808  -0.084 0.932812    
## V22          0.1181494  0.0152160   7.765 1.30e-14 ***
## V23          0.0253475  0.0116220   2.181 0.029300 *  
## V24         -0.0550185  0.0108939  -5.050 4.81e-07 ***
## V27          0.9780766  0.0813257  12.027  < 2e-16 ***
## V28         -0.0086860  0.0083124  -1.045 0.296172    
## V29         -0.0438405  0.0274205  -1.599 0.110019    
## V30         -0.0062451  0.0104126  -0.600 0.548732    
## V31          0.0189293  0.0275342   0.687 0.491858    
## V35         -0.0244023  0.0108775  -2.243 0.024982 *  
## V36         -0.2557386  0.0243117 -10.519  < 2e-16 ***
## V37         -0.0647407  0.0178687  -3.623 0.000298 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3249 on 1995 degrees of freedom
## Multiple R-squared:  0.8946, Adjusted R-squared:  0.8932 
## F-statistic:   605 on 28 and 1995 DF,  p-value: < 2.2e-16

# Make predictions
pred<-predict(lr_model, X_test)
data <- data.frame(y_test, pred)
# plot the linear model using ggplot
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Linear Model Plot")

#Evaluation
mse <- mean((y_test - pred)^2)
paste("mse:",mse)

## [1] "mse: 0.131918573848608"

paste("rmse:",RMSE(pred, y_test))

## [1] "rmse: 0.363205966152275"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.858248127195159"

6.2 XGBoost

# Set training parameters
params <- list(booster = "gbtree",
               objective = "reg:squarederror",
               eta = 0.3,
               max_depth = 6,
               subsample = 1,
               colsample_bytree = 1)

# Convert training data into the required format for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)

#TrainingModel
xgboost_model <- xgb.train(params = params, data = train_matrix, nrounds = 100)

#Predictions for the test set
test_matrix <- xgb.DMatrix(data = as.matrix(X_test))
pred <- predict(xgboost_model, test_matrix)
# Use the xgb.plot.importance() function to visualize the importance of each feature to the model prediction results.
importance_matrix <- xgb.importance(colnames(X_train), model = xgboost_model)
xgb.plot.importance(importance_matrix)

# xgb.plot.importance(xgboost_model)
# Plot the scatter plot of predicted and true values
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("XGBoost Model Plot")

#Evaluation
mse <- mean((y_test - pred)^2)
paste("mse:",mse)

## [1] "mse: 0.142307400438178"

paste("rmse:",RMSE(pred, y_test))

## [1] "rmse: 0.377236531155425"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.847343007661001"

6.3 Random forest regression

rf<-randomForest(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),importance=TRUE, ntree=1000)
rf

## 
## Call:
##  randomForest(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 +      V7 + V8 + V10 + V11 + V12 + V13 + V15 + V16 + V18 + V19 +      V20 + V22 + V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 +      V36 + V37, data = cbind(X_train, y_train), importance = TRUE,      ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 9
## 
##           Mean of squared residuals: 0.1167821
##                     % Var explained: 88.17

importance(rf)

##       %IncMSE IncNodePurity
## V0  45.973435    579.981173
## V1  33.688909    444.433595
## V2  36.948294     72.460349
## V3  44.608535     34.981417
## V4  22.903202     36.007938
## V6  22.115563     14.002075
## V7  19.450743     10.630222
## V8  24.883507    266.626312
## V10 33.353428     20.410732
## V11  9.537586      8.850677
## V12 25.348093     38.166916
## V13 11.603076      8.688748
## V15 16.557200     11.802178
## V16 25.531909     23.274157
## V18 13.027087      9.791694
## V19 12.751449      9.165131
## V20 12.439089     11.314784
## V22 12.365576      5.540328
## V23 13.140948      9.645274
## V24 20.126101      8.910752
## V27 21.392283    163.047891
## V28  5.651549      7.580850
## V29 19.565117     13.404295
## V30  5.343394      8.235286
## V31 22.004976    109.075215
## V35  7.613683      4.164560
## V36 17.089229     10.947168
## V37 22.104463     47.724478

varImpPlot(rf)

# Use the training set to see the prediction accuracy
pred<-predict(rf,X_test)
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Random forest regression Model Plot")

mse <- mean((y_test - pred)^2)
paste("mse:",mse)

## [1] "mse: 0.137529889242212"

paste("rmse:",RMSE(pred, y_test))

## [1] "rmse: 0.370850224810788"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.851089213259495"

# %IncMSE，IncNodePurity as a parameter of the random forest can only be used as a "qualitative" exploration variable, not as a quantitative decision "variable trade-off

6.4 SVM model

svm_model <- svm(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),knernel="radial")
svm_model

## 
## Call:
## svm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 + V7 + V8 + V10 + 
##     V11 + V12 + V13 + V15 + V16 + V18 + V19 + V20 + V22 + V23 + V24 + 
##     V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train, 
##     y_train), knernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.03571429 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1405

varImpPlot(rf)

svm_pred=predict(svm_model,X_test)
pred = svm_pred
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("SVM Model Plot")

mse <- mean((y_test - pred)^2)
paste("mse:",mse)

## [1] "mse: 0.129686503841175"

paste("rmse:",RMSE(pred, y_test))

## [1] "rmse: 0.360120124182439"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.860236887744732"

6.5 Decision tree

model_dt <- rpart(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
rpart.plot(model_dt)

tree_pred=predict(model_dt,X_test,decision.values = TRUE)
pred = tree_pred
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Decision tree Model Plot")

mse <- mean((y_test - pred)^2)
paste("mse:",mse)

## [1] "mse: 0.233206926793712"

paste("rmse:",RMSE(pred, y_test))

## [1] "rmse: 0.482915030614819"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.748673016472035"

Organize the performance of different models

mseList<-c(0.1319, 0.1423,0.1375,0.1297,0.2332)
rmseList<-c(0.3632, 0.3772,0.3709,0.3601,0.4829)
r2List<-c(0.8582, 0.8473,0.8511,0.8602,0.7487)
evaluation_df<- as.data.frame(cbind(mseList, rmseList, r2List))
colnames(evaluation_df) <- c("mse", "rmse", "r2")
rownames(evaluation_df) <- c("multiple linear regression", "xgBoost", "randomForest", "svm", "decision Trees")

7 Evaluation

kable(head(evaluation_df))

	mse	rmse	r2
multiple linear regression	0.1319	0.3632	0.8582
xgBoost	0.1423	0.3772	0.8473
randomForest	0.1375	0.3709	0.8511
svm	0.1297	0.3601	0.8602
decision Trees	0.2332	0.4829	0.7487

Based on the results of our svm, we have identified the optimal model for predicting the steam volume of thermal power generation. We have evaluated the performance of the model using three key indicators: MSE, RMSE, and R-squared. Our analysis has revealed that the svm exhibits the highest R-squared, indicating the strongest correlation between the predicted and actual values. Additionally, this model also has the lowest MSE and RMSE values, indicating the smallest average deviation from the actual values. Therefore, we have selected the svm model as the optimal model for predicting the steam volume in this power plant.

8 Cluster

hc_model <- hclust(dist(data_train1))
print(hc_model)

## 
## Call:
## hclust(d = dist(data_train1))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 2888

plot(hc_model)

Clustering of boilers is a useful technique for identifying boilers with similar characteristics and emissions. By grouping boilers with similar properties and emission levels, maintenance and repair costs can be reduced. Clustering also allows for targeted maintenance and repairs, which can improve the overall efficiency and performance of the boilers. Additionally, clustering can assist in identifying potential issues or areas for improvement in the boiler systems, allowing for proactive measures to be taken. Overall, clustering boilers is a valuable tool for reducing maintenance costs, improving efficiency, and ensuring the optimal performance of the boiler systems.

9 Conclusion

With the development of science, technology and economy, the types of industrial sensors are becoming more and more extensive, which provides a solution for the collection of thermal power boiler operation data. Based on these data, the application of machine learning technology to predict the steam volume of thermal power generation has become a crucial part of thermal power generation system. An accurate thermal power generation steam volume prediction model plays a vital role in optimizing boiler parameters, improving the overall process level, and reducing labor costs. The main work of this paper is as follows: 1.Apply multiple linear regression, SVR, Random Forest, XGBoost for modeling prediction. 2.Based on the results of our regression analysis, identify the optimal model for predicting the efficiency of boilers in a power plantusing three key indicators: MSE, RMSE, and R-squared. 3.Group boilers with similar properties and emission levels to reduce maintenance costs, improve efficiency, and to ensure the optimal performance of the boiler systems.

10 References

[1] Adams D, Oh D H, Kim D W, et al. Prediction of SOx–NOx emission from a coal-fired CFB power plant with machine learning: Plant data learned by deep neural network and least square support vector machine[J]. Journal of Cleaner Production, 2020, 270: 122310. [2] Gu Y, Zhao W, Wu Z. Online adaptive least squares support vector machine and its application in utility boiler combustion optimization systems[J]. Journal of Process Control, 2011, 21(7): 1040-1048. [3] Joseph Omosanya, A., Titilayo Akinlabi, E., & Olusegun Okeniyi, J. (2019). Overview for Improving Steam Turbine Power Generation Efficiency. Journal of Physics: Conference Series, 1378(3), 032040. https://doi.org/10.1088/1742-6596/1378/3/032040 [4] Kim, K. H., seok Lee, H., Kim, J. H., & Ho Park, J. (2019). Detection of Boiler Tube Leakage Fault in a Thermal Power Plant Using Machine Learning Based Data Mining Technique. 2019 IEEE International Conference on Industrial Technology (ICIT). https://doi.org/10.1109/icit.2019.8755058 [5] Sun X, Marquez H J, Chen T, et al. An improved PCA method with application to boiler leak detection[J]. ISA transactions, 2005, 44(3): 379-397. [6] Yu, B., Fang, D., & Dong, F. (2020). Study on the evolution of thermal power generation and its nexus with economic growth: Evidence from EU regions. Energy, 205, 118053. https://doi.org/10.1016/j.energy.2020.118053

Prediction and mining of industrial steam volume