Group8 Project WQD7004
Group member for this project
S2174087/1 Junhao Xu S2175919/1 Haojie Chu S2164009/1 Wang Jiaxin S2165529/1 Xiaowei Zhang S2001533 Cheah Xiao Ying
Dataset:https://tianchi.aliyun.com/dataset/84167
Energy plays an important role in human society. With energy, human society can be developed. There are various types of power generation, such as coal, nuclear, gas, solar energy, natural gas and geothermal sources. As thermal power generation will be the main source of electricity for production and domestic use for many years to come, exploring efficient and clean power generation technologies and optimizing thermal power generation facilities to improve energy efficiency are the main issues at hand.
The basic principle of thermal power generation is: when the fuel is burned, the water is heated to generate steam, and the steam pressure pushes the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this series of energy conversions, the core that affects the power generation efficiency is the combustion efficiency of the boiler, that is, the fuel is burned to heat water to generate high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including adjustable parameters of the boiler, such as combustion feed rate, primary and secondary air, induced air, return air, water supply volume; and boiler operating conditions, such as boiler bed temperature, bed pressure, Furnace temperature, pressure, superheater temperature, etc.
With the rapid development of the industrialization road, the application of big data-driven artificial intelligence technology to industrial production and manufacturing is conducive to its digitalization and intelligence. Thermal power boilers generate a large amount of data in the process of operation. Machine learning technology is used to extract valuable information from a large amount of complex data to help companies better optimize equipment parameters, thereby improving the stability, speed, accuracy and energy conversion rate of the boiler operation. Machine learning can be applied to various industries to analyze and organize the data we have accumulated and to make effective use of the data. Machine learning techniques are capable of handling large volumes of data with complex and non-linear structures. Through the design of certain algorithms, it is also possible to achieve self-learning of patterns. These advantages of machine learning have led to a new breakthrough direction in the analysis of industrial data.
A steam prediction model for thermal power generation was developed based on machine learning algorithms. Multiple linear regression, SVR, and XGBoost、Random Forest are used to predict the steam volume of thermal power generation respectively. To understand the relationship between each variable and target for thermal power generation. To identify which model performs the best in predicting the steam volume of thermal power generation.
Import the dataset and view the basic information of the dataset
# import file
# test size 1925 rows * 38 columns
# train size 2888 rows * 39 columns
# The provided data set, train, is one more column of data called target than the one we need to predict
data_train <- read.table("D:/zhengqi_train.txt", header = TRUE, sep = "\t")
data_test <- read.table("D:/zhengqi_test.txt", header = TRUE, sep = "\t")
# data_train is the original data
data_train1 <- data_train
# complete the origin column
data_train<-mutate(data_train, origin = 'train')
data_test<-mutate(data_test, origin = 'test')
head(data_train)
head(data_test)
str(data_train)
## 'data.frame': 2888 obs. of 40 variables:
## $ V0 : num 0.566 0.968 1.013 0.733 0.684 ...
## $ V1 : num 0.016 0.437 0.568 0.368 0.638 ...
## $ V2 : num -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
## $ V3 : num 0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
## $ V4 : num 0.452 0.194 0.112 0.599 0.337 ...
## $ V5 : num -0.901 -0.893 -0.797 -0.679 -0.454 ...
## $ V6 : num -1.81 -1.57 -1.37 -1.2 -1.07 ...
## $ V7 : num -2.36 -2.36 -2.36 -2.09 -2.09 ...
## $ V8 : num -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
## $ V9 : num -2.11 -2.11 -2.11 -2.11 -2.11 ...
## $ V10 : num -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
## $ V11 : num -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
## $ V12 : num -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
## $ V13 : num 0.55 1.109 0.767 0.769 -0.349 ...
## $ V14 : num -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
## $ V15 : num 0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
## $ V16 : num -1.707 -0.977 -0.618 -0.429 -0.391 ...
## $ V17 : num -1.162 -1.162 -0.897 -0.897 -0.897 ...
## $ V18 : num -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
## $ V19 : num -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
## $ V20 : num 0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
## $ V21 : num -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
## $ V22 : num -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
## $ V23 : num 0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
## $ V24 : num 0.8 0.801 0.961 1.435 0.881 ...
## $ V25 : num -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
## $ V26 : num 0.796 1.057 0.915 0.898 0.386 ...
## $ V27 : num 0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
## $ V28 : num -0.45 0.671 1.287 1.298 1.289 ...
## $ V29 : num 0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
## $ V30 : num 0.109 0.124 0.361 0.417 1.078 ...
## $ V31 : num -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
## $ V32 : num 0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
## $ V33 : num -4.627 -0.843 -0.843 -0.843 -0.843 ...
## $ V34 : num -4.789 0.16 0.16 -0.065 -0.215 ...
## $ V35 : num -5.101 0.364 0.364 0.364 0.364 ...
## $ V36 : num -2.608 -0.335 0.765 0.333 -0.28 ...
## $ V37 : num -3.508 -0.73 -0.589 -0.112 -0.028 ...
## $ target: num 0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
## $ origin: chr "train" "train" "train" "train" ...
str(data_test)
## 'data.frame': 1925 obs. of 39 variables:
## $ V0 : num 0.368 0.148 -0.166 0.102 0.3 0.05 -0.223 -0.126 -0.203 -0.181 ...
## $ V1 : num 0.38 0.489 -0.062 0.294 0.428 0.34 0.175 0.152 -0.014 0.797 ...
## $ V2 : num -0.225 -0.247 -0.311 -0.259 0.208 0.108 -0.39 0.227 0.01 0.47 ...
## $ V3 : num -0.049 -0.049 0.046 0.051 0.051 0.051 0.051 0.021 -0.034 -0.107 ...
## $ V4 : num 0.379 0.122 -0.055 -0.183 -0.033 -0.348 0.006 -0.619 -0.322 -0.477 ...
## $ V5 : num 0.092 -0.201 0.063 0.148 0.116 0.074 0.134 -0.069 0.105 0.184 ...
## $ V6 : num 0.55 0.487 0.485 0.474 0.408 0.516 0.497 0.52 0.453 0.588 ...
## $ V7 : num 0.551 0.493 0.493 0.504 0.497 0.491 0.548 0.548 0.518 0.528 ...
## $ V8 : num 0.244 -0.127 -0.227 0.01 0.155 0.238 -0.099 0.06 -0.032 0.319 ...
## $ V9 : num 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.473 0.904 ...
## $ V10 : num -0.419 -0.403 0.33 -0.431 -0.162 0.079 0.226 -0.223 -0.352 -0.108 ...
## $ V11 : num 0.515 -0.324 0.389 0.524 0.554 0.373 0.477 0.171 0.254 0.618 ...
## $ V12 : num 0.346 0.465 0.173 -0.038 -0.063 -0.246 -0.276 -0.369 -0.318 -0.709 ...
## $ V13 : num -0.114 0.653 0.398 -0.34 0.611 ...
## $ V14 : num -0.204 0.148 0.068 -0.313 -0.319 -0.365 -0.247 -0.271 -0.008 -0.207 ...
## $ V15 : num 0.239 -0.113 -0.192 -0.59 -0.927 ...
## $ V16 : num -0.089 -0.093 -0.061 -0.134 -0.075 0.222 0.252 0.01 0.003 0.252 ...
## $ V17 : num 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 ...
## $ V18 : num 0.247 0.073 0.07 0.078 0.08 0.07 0.066 0.07 0.064 0.078 ...
## $ V19 : num 0.899 1.168 0.98 1.07 1.238 ...
## $ V20 : num -0.252 -0.276 -0.34 -0.292 -0.15 -0.11 -0.331 -0.245 -0.472 -0.493 ...
## $ V21 : num 0.628 0.009 0.27 0.726 0.141 ...
## $ V22 : num -0.063 -0.063 -0.063 0.133 0.133 0.133 -0.063 -0.063 0.133 0.133 ...
## $ V23 : num 0.098 0.09 0.091 0.086 0.089 0.099 0.104 0.101 0.093 -0.178 ...
## $ V24 : num -1.314 -1.31 -1.31 0.234 0.237 ...
## $ V25 : num -0.662 -0.646 -0.473 -0.337 -0.285 -0.071 0.009 0.003 0.006 0.199 ...
## $ V26 : num -0.596 -0.776 -0.607 -0.986 -0.669 ...
## $ V27 : num 0.208 0.226 0.084 0.203 0.227 0.229 0.133 0.262 0.241 0.307 ...
## $ V28 : num -0.449 -0.443 -0.458 -0.456 -0.458 -0.45 -0.452 -0.452 -0.45 -0.446 ...
## $ V29 : num 0.047 0.047 -0.398 -0.398 -0.776 ...
## $ V30 : num 0.057 0.56 0.101 1.007 0.291 ...
## $ V31 : num -0.042 0.176 0.199 0.137 0.37 0.447 0.432 0.281 0.222 0.466 ...
## $ V32 : num 0.847 0.551 0.634 1.042 0.181 ...
## $ V33 : num 0.534 0.046 0.017 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 ...
## $ V34 : num -0.009 -0.22 -0.234 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 ...
## $ V35 : num -0.19 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 -0.289 ...
## $ V36 : num -0.567 -0.294 0.373 -0.666 -0.14 -0.228 0.104 -0.7 -0.236 -0.431 ...
## $ V37 : num 0.388 0.104 0.569 0.391 -0.497 ...
## $ origin: chr "test" "test" "test" "test" ...
# Merge data_train and data_test by row
df<- bind_rows(data_train, data_test)
str(df)
## 'data.frame': 4813 obs. of 40 variables:
## $ V0 : num 0.566 0.968 1.013 0.733 0.684 ...
## $ V1 : num 0.016 0.437 0.568 0.368 0.638 ...
## $ V2 : num -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
## $ V3 : num 0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
## $ V4 : num 0.452 0.194 0.112 0.599 0.337 ...
## $ V5 : num -0.901 -0.893 -0.797 -0.679 -0.454 ...
## $ V6 : num -1.81 -1.57 -1.37 -1.2 -1.07 ...
## $ V7 : num -2.36 -2.36 -2.36 -2.09 -2.09 ...
## $ V8 : num -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
## $ V9 : num -2.11 -2.11 -2.11 -2.11 -2.11 ...
## $ V10 : num -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
## $ V11 : num -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
## $ V12 : num -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
## $ V13 : num 0.55 1.109 0.767 0.769 -0.349 ...
## $ V14 : num -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
## $ V15 : num 0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
## $ V16 : num -1.707 -0.977 -0.618 -0.429 -0.391 ...
## $ V17 : num -1.162 -1.162 -0.897 -0.897 -0.897 ...
## $ V18 : num -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
## $ V19 : num -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
## $ V20 : num 0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
## $ V21 : num -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
## $ V22 : num -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
## $ V23 : num 0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
## $ V24 : num 0.8 0.801 0.961 1.435 0.881 ...
## $ V25 : num -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
## $ V26 : num 0.796 1.057 0.915 0.898 0.386 ...
## $ V27 : num 0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
## $ V28 : num -0.45 0.671 1.287 1.298 1.289 ...
## $ V29 : num 0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
## $ V30 : num 0.109 0.124 0.361 0.417 1.078 ...
## $ V31 : num -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
## $ V32 : num 0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
## $ V33 : num -4.627 -0.843 -0.843 -0.843 -0.843 ...
## $ V34 : num -4.789 0.16 0.16 -0.065 -0.215 ...
## $ V35 : num -5.101 0.364 0.364 0.364 0.364 ...
## $ V36 : num -2.608 -0.335 0.765 0.333 -0.28 ...
## $ V37 : num -3.508 -0.73 -0.589 -0.112 -0.028 ...
## $ target: num 0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
## $ origin: chr "train" "train" "train" "train" ...
colSums(is.na(data_train))
## V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 0 0 0 0 0 0 0 0 0
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 0 0 0 0 0 0 0 0 0 0 0
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 0 0 0 0 0 0 0 0 0 0 0
## V33 V34 V35 V36 V37 target origin
## 0 0 0 0 0 0 0
lapply(df[,c(1:39)], is.numeric)
## $V0
## [1] TRUE
##
## $V1
## [1] TRUE
##
## $V2
## [1] TRUE
##
## $V3
## [1] TRUE
##
## $V4
## [1] TRUE
##
## $V5
## [1] TRUE
##
## $V6
## [1] TRUE
##
## $V7
## [1] TRUE
##
## $V8
## [1] TRUE
##
## $V9
## [1] TRUE
##
## $V10
## [1] TRUE
##
## $V11
## [1] TRUE
##
## $V12
## [1] TRUE
##
## $V13
## [1] TRUE
##
## $V14
## [1] TRUE
##
## $V15
## [1] TRUE
##
## $V16
## [1] TRUE
##
## $V17
## [1] TRUE
##
## $V18
## [1] TRUE
##
## $V19
## [1] TRUE
##
## $V20
## [1] TRUE
##
## $V21
## [1] TRUE
##
## $V22
## [1] TRUE
##
## $V23
## [1] TRUE
##
## $V24
## [1] TRUE
##
## $V25
## [1] TRUE
##
## $V26
## [1] TRUE
##
## $V27
## [1] TRUE
##
## $V28
## [1] TRUE
##
## $V29
## [1] TRUE
##
## $V30
## [1] TRUE
##
## $V31
## [1] TRUE
##
## $V32
## [1] TRUE
##
## $V33
## [1] TRUE
##
## $V34
## [1] TRUE
##
## $V35
## [1] TRUE
##
## $V36
## [1] TRUE
##
## $V37
## [1] TRUE
##
## $target
## [1] TRUE
missmap(data_train)
for(col in names(data_test)){
if(!is.na(col)){
print(ggplot(df, aes(x = get(col), colour=origin,group=origin)) +
xlab("x-axis eigenvalue") +
ggtitle(col) +
geom_density(size=2))
}
}
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Augmentation: removing features with large differences in feature distribution between the training and test sets in order to facilitate a reduction in the likelihood of overfitting the model, removing features that are not conducive to applying the model to the test set, and helping to improve the generalizability of the model.
# Delete columns with larger eigenvalues
drop_array <- c("V5", "V9", "V14", "V17", "21", "28")
data_train1 <- data_train1[, !(names(data_train1) %in% drop_array)]
names(data_train1)
## [1] "V0" "V1" "V2" "V3" "V4" "V6" "V7" "V8"
## [9] "V10" "V11" "V12" "V13" "V15" "V16" "V18" "V19"
## [17] "V20" "V21" "V22" "V23" "V24" "V25" "V26" "V27"
## [25] "V28" "V29" "V30" "V31" "V32" "V33" "V34" "V35"
## [33] "V36" "V37" "target"
# Plotting scatter plots and fitting curves
for(col in names(data_train1)){
print(ggplot(data_train1, aes(x = get(col), y = target)) +
# Scatter plot function
geom_point()+
xlab(col) +
ylab("target") +
# Scale function: palette sets the color scheme
scale_colour_brewer(palette = "Set1") +
geom_smooth())
}
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# Heat map of correlation between each column and target
par(mfrow = c(1, 1), mar = c(2, 2, 2, 2), cex = 0.8)
corr <- cor(data_train1)
column_data <- data.frame(corr)
column_data <- as.matrix(column_data)
corrplot(cor(data_train1), method = "color", type = "lower", tl.col = "black", tl.srt = 45, addCoefasPercent = TRUE, addCoef.cex = 0.5)
## Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =
## tl.srt, : "addCoef.cex"不是图形参数
## Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =
## tl.col, : "addCoef.cex"不是图形参数
## Warning in title(title, ...): "addCoef.cex"不是图形参数
Removal of correlation less than 0.1
# Removal of correlation less than 0.1
threshold <- 0.1
corr_matrix <- abs(cor(data_train1)) # Absolute value correlation matrix
drop_col <- names(data_train1)[corr_matrix[,"target"] < threshold]
drop_col
## [1] "V21" "V25" "V26" "V32" "V33" "V34"
data_train1 <- data_train1[, !(names(data_train1) %in% drop_col)]
names(data_train1)
## [1] "V0" "V1" "V2" "V3" "V4" "V6" "V7" "V8"
## [9] "V10" "V11" "V12" "V13" "V15" "V16" "V18" "V19"
## [17] "V20" "V22" "V23" "V24" "V27" "V28" "V29" "V30"
## [25] "V31" "V35" "V36" "V37" "target"
# datatrain1 is the data frame after removing low correlation columns
trainy = data_train1["target"]
trainx = data_train1[, !(names(data_train1) %in% "target")]
# separation train and test
set.seed(827)
inTrain <- createDataPartition(trainy$target, p = 0.7, list = FALSE)
X_train <- trainx[inTrain,]
X_test <- trainx[-inTrain,]
y_train <- trainy[inTrain,]
y_test <- trainy[-inTrain,]
Prediction evaluation reference RMSE (Root Mean Squared Error) is the root mean squared error, which represents the deviation between the predicted and true values. The smaller the value, the more accurate the model. MSE (Mean Squared Error) is the mean squared error, which represents the squared difference between the predicted value and the true value. The smaller the value, the more accurate the model is. R² (Coefficient of Determination) is the coefficient of determination, which represents the percentage of variation in the true value that is explained by the predicted value of the model. The closer the value is to 1, the more accurate the model is.
# Multiple linear regression
x_vars <- names(X_train)
x_vars<-paste(x_vars, collapse = "+")
x_vars
## [1] "V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37"
lr_model <- lm(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
# Coefficient table showing beta coefficient estimates and their significance levels
# Estimate: intercept (b0) and beta coefficient estimates associated with each predictor variable
# Std.Error: standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confidence we have in the estimate.
# t value: t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3).
# Pr(>|t|): p-value corresponding to the t-statistic. the smaller the p-value, the more significant the estimate.
summary(lr_model)
##
## Call:
## lm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 + V7 + V8 +
## V10 + V11 + V12 + V13 + V15 + V16 + V18 + V19 + V20 + V22 +
## V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37,
## data = cbind(X_train, y_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55322 -0.18309 -0.01699 0.17665 1.55426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2569077 0.0216219 -11.882 < 2e-16 ***
## V0 0.3651026 0.0308942 11.818 < 2e-16 ***
## V1 0.2144276 0.0291299 7.361 2.65e-13 ***
## V2 0.1391036 0.0236074 5.892 4.46e-09 ***
## V3 0.1440395 0.0096956 14.856 < 2e-16 ***
## V4 0.0490734 0.0303358 1.618 0.105891
## V6 0.1821767 0.0350485 5.198 2.22e-07 ***
## V7 -0.1618945 0.0253709 -6.381 2.18e-10 ***
## V8 -0.1911270 0.0341246 -5.601 2.43e-08 ***
## V10 0.3314013 0.0255671 12.962 < 2e-16 ***
## V11 0.0195061 0.0118248 1.650 0.099182 .
## V12 0.0489378 0.0271553 1.802 0.071673 .
## V13 0.0004546 0.0112052 0.041 0.967639
## V15 0.0323430 0.0253617 1.275 0.202362
## V16 0.0179925 0.0286673 0.628 0.530318
## V18 0.0371944 0.0101865 3.651 0.000268 ***
## V19 0.0189041 0.0086529 2.185 0.029027 *
## V20 -0.0009343 0.0110808 -0.084 0.932812
## V22 0.1181494 0.0152160 7.765 1.30e-14 ***
## V23 0.0253475 0.0116220 2.181 0.029300 *
## V24 -0.0550185 0.0108939 -5.050 4.81e-07 ***
## V27 0.9780766 0.0813257 12.027 < 2e-16 ***
## V28 -0.0086860 0.0083124 -1.045 0.296172
## V29 -0.0438405 0.0274205 -1.599 0.110019
## V30 -0.0062451 0.0104126 -0.600 0.548732
## V31 0.0189293 0.0275342 0.687 0.491858
## V35 -0.0244023 0.0108775 -2.243 0.024982 *
## V36 -0.2557386 0.0243117 -10.519 < 2e-16 ***
## V37 -0.0647407 0.0178687 -3.623 0.000298 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3249 on 1995 degrees of freedom
## Multiple R-squared: 0.8946, Adjusted R-squared: 0.8932
## F-statistic: 605 on 28 and 1995 DF, p-value: < 2.2e-16
# Make predictions
pred<-predict(lr_model, X_test)
data <- data.frame(y_test, pred)
# plot the linear model using ggplot
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ggtitle("Linear Model Plot")
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mse:",mse)
## [1] "mse: 0.131918573848608"
paste("rmse:",RMSE(pred, y_test))
## [1] "rmse: 0.363205966152275"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.858248127195159"
# Set training parameters
params <- list(booster = "gbtree",
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6,
subsample = 1,
colsample_bytree = 1)
# Convert training data into the required format for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
#TrainingModel
xgboost_model <- xgb.train(params = params, data = train_matrix, nrounds = 100)
#Predictions for the test set
test_matrix <- xgb.DMatrix(data = as.matrix(X_test))
pred <- predict(xgboost_model, test_matrix)
# Use the xgb.plot.importance() function to visualize the importance of each feature to the model prediction results.
importance_matrix <- xgb.importance(colnames(X_train), model = xgboost_model)
xgb.plot.importance(importance_matrix)
# xgb.plot.importance(xgboost_model)
# Plot the scatter plot of predicted and true values
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ggtitle("XGBoost Model Plot")
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mse:",mse)
## [1] "mse: 0.142307400438178"
paste("rmse:",RMSE(pred, y_test))
## [1] "rmse: 0.377236531155425"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.847343007661001"
rf<-randomForest(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),importance=TRUE, ntree=1000)
rf
##
## Call:
## randomForest(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 + V7 + V8 + V10 + V11 + V12 + V13 + V15 + V16 + V18 + V19 + V20 + V22 + V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train, y_train), importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 9
##
## Mean of squared residuals: 0.1167821
## % Var explained: 88.17
importance(rf)
## %IncMSE IncNodePurity
## V0 45.973435 579.981173
## V1 33.688909 444.433595
## V2 36.948294 72.460349
## V3 44.608535 34.981417
## V4 22.903202 36.007938
## V6 22.115563 14.002075
## V7 19.450743 10.630222
## V8 24.883507 266.626312
## V10 33.353428 20.410732
## V11 9.537586 8.850677
## V12 25.348093 38.166916
## V13 11.603076 8.688748
## V15 16.557200 11.802178
## V16 25.531909 23.274157
## V18 13.027087 9.791694
## V19 12.751449 9.165131
## V20 12.439089 11.314784
## V22 12.365576 5.540328
## V23 13.140948 9.645274
## V24 20.126101 8.910752
## V27 21.392283 163.047891
## V28 5.651549 7.580850
## V29 19.565117 13.404295
## V30 5.343394 8.235286
## V31 22.004976 109.075215
## V35 7.613683 4.164560
## V36 17.089229 10.947168
## V37 22.104463 47.724478
varImpPlot(rf)
# Use the training set to see the prediction accuracy
pred<-predict(rf,X_test)
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ggtitle("Random forest regression Model Plot")
mse <- mean((y_test - pred)^2)
paste("mse:",mse)
## [1] "mse: 0.137529889242212"
paste("rmse:",RMSE(pred, y_test))
## [1] "rmse: 0.370850224810788"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.851089213259495"
# %IncMSE,IncNodePurity as a parameter of the random forest can only be used as a "qualitative" exploration variable, not as a quantitative decision "variable trade-off
svm_model <- svm(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),knernel="radial")
svm_model
##
## Call:
## svm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V6 + V7 + V8 + V10 +
## V11 + V12 + V13 + V15 + V16 + V18 + V19 + V20 + V22 + V23 + V24 +
## V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train,
## y_train), knernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.03571429
## epsilon: 0.1
##
##
## Number of Support Vectors: 1405
varImpPlot(rf)
svm_pred=predict(svm_model,X_test)
pred = svm_pred
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ggtitle("SVM Model Plot")
mse <- mean((y_test - pred)^2)
paste("mse:",mse)
## [1] "mse: 0.129686503841175"
paste("rmse:",RMSE(pred, y_test))
## [1] "rmse: 0.360120124182439"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.860236887744732"
model_dt <- rpart(y_train ~V0+V1+V2+V3+V4+V6+V7+V8+V10+V11+V12+V13+V15+V16+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
rpart.plot(model_dt)
tree_pred=predict(model_dt,X_test,decision.values = TRUE)
pred = tree_pred
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
ggtitle("Decision tree Model Plot")
mse <- mean((y_test - pred)^2)
paste("mse:",mse)
## [1] "mse: 0.233206926793712"
paste("rmse:",RMSE(pred, y_test))
## [1] "rmse: 0.482915030614819"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.748673016472035"
Organize the performance of different models
mseList<-c(0.1319, 0.1423,0.1375,0.1297,0.2332)
rmseList<-c(0.3632, 0.3772,0.3709,0.3601,0.4829)
r2List<-c(0.8582, 0.8473,0.8511,0.8602,0.7487)
evaluation_df<- as.data.frame(cbind(mseList, rmseList, r2List))
colnames(evaluation_df) <- c("mse", "rmse", "r2")
rownames(evaluation_df) <- c("multiple linear regression", "xgBoost", "randomForest", "svm", "decision Trees")
kable(head(evaluation_df))
| mse | rmse | r2 | |
|---|---|---|---|
| multiple linear regression | 0.1319 | 0.3632 | 0.8582 |
| xgBoost | 0.1423 | 0.3772 | 0.8473 |
| randomForest | 0.1375 | 0.3709 | 0.8511 |
| svm | 0.1297 | 0.3601 | 0.8602 |
| decision Trees | 0.2332 | 0.4829 | 0.7487 |
Based on the results of our svm, we have identified the optimal model for predicting the steam volume of thermal power generation. We have evaluated the performance of the model using three key indicators: MSE, RMSE, and R-squared. Our analysis has revealed that the svm exhibits the highest R-squared, indicating the strongest correlation between the predicted and actual values. Additionally, this model also has the lowest MSE and RMSE values, indicating the smallest average deviation from the actual values. Therefore, we have selected the svm model as the optimal model for predicting the steam volume in this power plant.
hc_model <- hclust(dist(data_train1))
print(hc_model)
##
## Call:
## hclust(d = dist(data_train1))
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 2888
plot(hc_model)
Clustering of boilers is a useful technique for identifying boilers with similar characteristics and emissions. By grouping boilers with similar properties and emission levels, maintenance and repair costs can be reduced. Clustering also allows for targeted maintenance and repairs, which can improve the overall efficiency and performance of the boilers. Additionally, clustering can assist in identifying potential issues or areas for improvement in the boiler systems, allowing for proactive measures to be taken. Overall, clustering boilers is a valuable tool for reducing maintenance costs, improving efficiency, and ensuring the optimal performance of the boiler systems.
With the development of science, technology and economy, the types of industrial sensors are becoming more and more extensive, which provides a solution for the collection of thermal power boiler operation data. Based on these data, the application of machine learning technology to predict the steam volume of thermal power generation has become a crucial part of thermal power generation system. An accurate thermal power generation steam volume prediction model plays a vital role in optimizing boiler parameters, improving the overall process level, and reducing labor costs. The main work of this paper is as follows: 1.Apply multiple linear regression, SVR, Random Forest, XGBoost for modeling prediction. 2.Based on the results of our regression analysis, identify the optimal model for predicting the efficiency of boilers in a power plantusing three key indicators: MSE, RMSE, and R-squared. 3.Group boilers with similar properties and emission levels to reduce maintenance costs, improve efficiency, and to ensure the optimal performance of the boiler systems.
[1] Adams D, Oh D H, Kim D W, et al. Prediction of SOx–NOx emission from a coal-fired CFB power plant with machine learning: Plant data learned by deep neural network and least square support vector machine[J]. Journal of Cleaner Production, 2020, 270: 122310. [2] Gu Y, Zhao W, Wu Z. Online adaptive least squares support vector machine and its application in utility boiler combustion optimization systems[J]. Journal of Process Control, 2011, 21(7): 1040-1048. [3] Joseph Omosanya, A., Titilayo Akinlabi, E., & Olusegun Okeniyi, J. (2019). Overview for Improving Steam Turbine Power Generation Efficiency. Journal of Physics: Conference Series, 1378(3), 032040. https://doi.org/10.1088/1742-6596/1378/3/032040 [4] Kim, K. H., seok Lee, H., Kim, J. H., & Ho Park, J. (2019). Detection of Boiler Tube Leakage Fault in a Thermal Power Plant Using Machine Learning Based Data Mining Technique. 2019 IEEE International Conference on Industrial Technology (ICIT). https://doi.org/10.1109/icit.2019.8755058 [5] Sun X, Marquez H J, Chen T, et al. An improved PCA method with application to boiler leak detection[J]. ISA transactions, 2005, 44(3): 379-397. [6] Yu, B., Fang, D., & Dong, F. (2020). Study on the evolution of thermal power generation and its nexus with economic growth: Evidence from EU regions. Energy, 205, 118053. https://doi.org/10.1016/j.energy.2020.118053