Group8 Project WQD7004

Prediction and mining of industrial steam volume

Group member for this project

S2174087/1 JUNHAO XU S2175919/1 Haojie Chu S2164009/1 Wang Jiaxin S2165529/1 Xiaowei Zhang S2001533 Cheah Xiao Ying

My Picture

Dataset:https://tianchi.aliyun.com/dataset/84167

Introduction

Energy plays an important role in human society. With energy, human society can be developed. There are various types of power generation, such as coal, nuclear, gas, solar energy, natural gas and geothermal sources. As thermal power generation will be the main source of electricity for production and domestic use for many years to come, exploring efficient and clean power generation technologies and optimizing thermal power generation facilities to improve energy efficiency are the main issues at hand.

The basic principle of thermal power generation is: when the fuel is burned, the water is heated to generate steam, and the steam pressure pushes the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this series of energy conversions, the core that affects the power generation efficiency is the combustion efficiency of the boiler, that is, the fuel is burned to heat water to generate high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including adjustable parameters of the boiler, such as combustion feed rate, primary and secondary air, induced air, return air, water supply volume; and boiler operating conditions, such as boiler bed temperature, bed pressure, Furnace temperature, pressure, superheater temperature, etc.

With the rapid development of the industrialization road, the application of big data-driven artificial intelligence technology to industrial production and manufacturing is conducive to its digitalization and intelligence. Thermal power boilers generate a large amount of data in the process of operation. Machine learning technology is used to extract valuable information from a large amount of complex data to help companies better optimize equipment parameters, thereby improving the stability, speed, accuracy and energy conversion rate of the boiler operation. Machine learning can be applied to various industries to analyze and organize the data we have accumulated and to make effective use of the data. Machine learning techniques are capable of handling large volumes of data with complex and non-linear structures. Through the design of certain algorithms, it is also possible to achieve self-learning of patterns. These advantages of machine learning have led to a new breakthrough direction in the analysis of industrial data.

Data description

Import all packages needed for this study.

install.packages("Amelia",repos ="http://cran.us.r-project.org")

## 程序包'Amelia'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'Amelia'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\Amelia\libs\x64\Amelia.dll到D:
## \CODE\R-4.2.1\library\Amelia\libs\x64\Amelia.dll时出了问题：Permission denied

## Warning: 回复了'Amelia'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("rpart",repos ="http://cran.us.r-project.org")

## 程序包'rpart'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'rpart'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\rpart\libs\x64\rpart.dll到D:
## \CODE\R-4.2.1\library\rpart\libs\x64\rpart.dll时出了问题：Permission denied

## Warning: 回复了'rpart'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("rattle",repos ="http://cran.us.r-project.org")

## 程序包'rattle'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("rpart.plot",repos ="http://cran.us.r-project.org")

## 程序包'rpart.plot'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("RColorBrewer",repos ="http://cran.us.r-project.org")

## 程序包'RColorBrewer'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("randomForest",repos ="http://cran.us.r-project.org")

## 程序包'randomForest'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'randomForest'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝D:
## \CODE\R-4.2.1\library\00LOCK\randomForest\libs\x64\randomForest.dll到D:
## \CODE\R-4.2.1\library\randomForest\libs\x64\randomForest.dll时出了问题：
## Permission denied

## Warning: 回复了'randomForest'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(Amelia)

## Warning: 程辑包'Amelia'是用R版本4.2.2 来建造的

## 载入需要的程辑包：Rcpp

## Warning: 程辑包'Rcpp'是用R版本4.2.2 来建造的

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(rpart)

## Warning: 程辑包'rpart'是用R版本4.2.2 来建造的

library(rattle)

## Warning: 程辑包'rattle'是用R版本4.2.2 来建造的

## 载入需要的程辑包：tibble

## Warning: 程辑包'tibble'是用R版本4.2.2 来建造的

## 载入需要的程辑包：bitops

## Rattle: A free graphical interface for data science with R.
## XXXX 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## 键入'rattle()'去轻摇、晃动、翻滚你的数据。

library(rpart.plot)

## Warning: 程辑包'rpart.plot'是用R版本4.2.2 来建造的

library(RColorBrewer)
library(randomForest)

## Warning: 程辑包'randomForest'是用R版本4.2.2 来建造的

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

install.packages("xgboost",repos ="http://cran.us.r-project.org")

## 程序包'xgboost'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'xgboost'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝D:
## \CODE\R-4.2.1\library\00LOCK\xgboost\libs\x64\xgboost.dll到D:
## \CODE\R-4.2.1\library\xgboost\libs\x64\xgboost.dll时出了问题：Permission denied

## Warning: 回复了'xgboost'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(xgboost)

## Warning: 程辑包'xgboost'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'xgboost'

## The following object is masked from 'package:rattle':
## 
##     xgboost

install.packages("dplyr",repos ="http://cran.us.r-project.org")

## 程序包'dplyr'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'dplyr'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\dplyr\libs\x64\dplyr.dll到D:
## \CODE\R-4.2.1\library\dplyr\libs\x64\dplyr.dll时出了问题：Permission denied

## Warning: 回复了'dplyr'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

install.packages("gridExtra",repos ="http://cran.us.r-project.org")

## 程序包'gridExtra'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(dplyr)

## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'dplyr'

## The following object is masked from 'package:xgboost':
## 
##     slice

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gridExtra)

## Warning: 程辑包'gridExtra'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:randomForest':
## 
##     combine

install.packages("caretEnsemble",repos ="http://cran.us.r-project.org")

## 程序包'caretEnsemble'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(caretEnsemble)

## Warning: 程辑包'caretEnsemble'是用R版本4.2.2 来建造的

install.packages("rsample",repos ="http://cran.us.r-project.org")

## 程序包'rsample'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(rsample)

## Warning: 程辑包'rsample'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'rsample'

## The following object is masked from 'package:Rcpp':
## 
##     populate

install.packages("caret",repos ="http://cran.us.r-project.org")

## 程序包'caret'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'caret'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\caret\libs\x64\caret.dll到D:
## \CODE\R-4.2.1\library\caret\libs\x64\caret.dll时出了问题：Permission denied

## Warning: 回复了'caret'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(caret)

## Warning: 程辑包'caret'是用R版本4.2.2 来建造的

## 载入需要的程辑包：ggplot2

## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'ggplot2'

## The following object is masked from 'package:caretEnsemble':
## 
##     autoplot

## The following object is masked from 'package:randomForest':
## 
##     margin

## 载入需要的程辑包：lattice

install.packages("caretEnsemble",repos ="http://cran.us.r-project.org")

## Warning: 正在使用'caretEnsemble'这个程序包，因此不会被安装

library(caretEnsemble)
install.packages("rsample",repos ="http://cran.us.r-project.org")

## Warning: 正在使用'rsample'这个程序包，因此不会被安装

library(rsample)
install.packages("ggplot2",repos ="http://cran.us.r-project.org")

## Warning: 正在使用'ggplot2'这个程序包，因此不会被安装

library(ggplot2)
install.packages("e1071",repos ="http://cran.us.r-project.org")

## 程序包'e1071'打开成功，MD5和检查也通过

## Warning: 无法将拆除原来安装的程序包'e1071'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\e1071\libs\x64\e1071.dll到D:
## \CODE\R-4.2.1\library\e1071\libs\x64\e1071.dll时出了问题：Permission denied

## Warning: 回复了'e1071'

## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(e1071)

## Warning: 程辑包'e1071'是用R版本4.2.2 来建造的

## 
## 载入程辑包：'e1071'

## The following object is masked from 'package:rsample':
## 
##     permutations

install.packages("pheatmap",repos ="http://cran.us.r-project.org")

## 程序包'pheatmap'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(pheatmap)

## Warning: 程辑包'pheatmap'是用R版本4.2.2 来建造的

install.packages("factoextra",repos ="http://cran.us.r-project.org")

## 程序包'factoextra'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(factoextra)

## Warning: 程辑包'factoextra'是用R版本4.2.2 来建造的

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

install.packages("patchwork",repos ="http://cran.us.r-project.org")

## 程序包'patchwork'打开成功，MD5和检查也通过
## 
## 下载的二进制程序包在
##  C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里

library(patchwork)

## Warning: 程辑包'patchwork'是用R版本4.2.2 来建造的

install.packages("rpart",repos ="http://cran.us.r-project.org")

## Warning: 正在使用'rpart'这个程序包，因此不会被安装

library(rpart) 
library(rpart.plot)

# import file
# test size 1925 rows * 38 columns    
# train size 2888 rows * 39 columns
# The provided data set, train, is one more column of data called target than the one we need to predict
data_train <- read.table("D:/zhengqi_train.txt", header = TRUE, sep = "\t")
data_test <- read.table("D:/zhengqi_test.txt", header = TRUE, sep = "\t")

# data_train is the original data
data_train1 <- data_train

# complete the origin column
data_train<-mutate(data_train, origin = 'train')
data_test<-mutate(data_test, origin = 'test')

head(data_train)

##      V0    V1     V2    V3    V4     V5     V6     V7     V8     V9    V10
## 1 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.436 -2.114 -0.940
## 2 0.968 0.437  0.066 0.566 0.194 -0.893 -1.566 -2.360  0.332 -2.114  0.188
## 3 1.013 0.568  0.235 0.370 0.112 -0.797 -1.367 -2.360  0.396 -2.114  0.874
## 4 0.733 0.368  0.283 0.165 0.599 -0.679 -1.200 -2.086  0.403 -2.114  0.011
## 5 0.684 0.638  0.260 0.209 0.337 -0.454 -1.073 -2.086  0.314 -2.114 -0.251
## 6 0.445 0.627  0.408 0.220 0.458 -1.056 -1.009 -1.896  0.481 -2.114 -0.511
##      V11    V12    V13    V14    V15    V16    V17    V18    V19   V20    V21
## 1 -0.307 -0.073  0.550 -0.484  0.000 -1.707 -1.162 -0.573 -0.991 0.610 -0.400
## 2 -0.455 -0.134  1.109 -0.488  0.000 -0.977 -1.162 -0.571 -0.836 0.588 -0.802
## 3 -0.051 -0.072  0.767 -0.493 -0.212 -0.618 -0.897 -0.564 -0.558 0.576 -0.477
## 4  0.102 -0.014  0.769 -0.371 -0.162 -0.429 -0.897 -0.574 -0.564 0.272 -0.491
## 5  0.570  0.199 -0.349 -0.342 -0.138 -0.391 -0.897 -0.572 -0.394 0.106  0.309
## 6 -0.564  0.294  0.912 -0.345  0.111 -0.333 -1.029 -0.573 -0.516 0.029 -0.560
##      V22   V23   V24    V25   V26   V27    V28    V29   V30    V31    V32
## 1 -0.063 0.356 0.800 -0.223 0.796 0.168 -0.450  0.136 0.109 -0.615  0.327
## 2 -0.063 0.357 0.801 -0.144 1.057 0.338  0.671 -0.128 0.124  0.032  0.600
## 3 -0.063 0.355 0.961 -0.067 0.915 0.326  1.287 -0.009 0.361  0.277 -0.116
## 4 -0.063 0.352 1.435  0.113 0.898 0.277  1.298  0.015 0.417  0.279  0.603
## 5 -0.259 0.352 0.881  0.221 0.386 0.332  1.289  0.183 1.078  0.328  0.418
## 6 -0.096 0.349 0.798  0.245 0.643 0.356  1.296  0.454 0.674  0.358  0.618
##      V33    V34    V35    V36    V37 target origin
## 1 -4.627 -4.789 -5.101 -2.608 -3.508  0.175  train
## 2 -0.843  0.160  0.364 -0.335 -0.730  0.676  train
## 3 -0.843  0.160  0.364  0.765 -0.589  0.633  train
## 4 -0.843 -0.065  0.364  0.333 -0.112  0.206  train
## 5 -0.843 -0.215  0.364 -0.280 -0.028  0.384  train
## 6 -0.843 -0.290  0.364 -0.191 -0.883  0.060  train

head(data_test)

##       V0     V1     V2     V3     V4     V5    V6    V7     V8    V9    V10
## 1  0.368  0.380 -0.225 -0.049  0.379  0.092 0.550 0.551  0.244 0.904 -0.419
## 2  0.148  0.489 -0.247 -0.049  0.122 -0.201 0.487 0.493 -0.127 0.904 -0.403
## 3 -0.166 -0.062 -0.311  0.046 -0.055  0.063 0.485 0.493 -0.227 0.904  0.330
## 4  0.102  0.294 -0.259  0.051 -0.183  0.148 0.474 0.504  0.010 0.904 -0.431
## 5  0.300  0.428  0.208  0.051 -0.033  0.116 0.408 0.497  0.155 0.904 -0.162
## 6  0.050  0.340  0.108  0.051 -0.348  0.074 0.516 0.491  0.238 0.904  0.079
##      V11    V12    V13    V14    V15    V16   V17   V18   V19    V20   V21
## 1  0.515  0.346 -0.114 -0.204  0.239 -0.089 0.961 0.247 0.899 -0.252 0.628
## 2 -0.324  0.465  0.653  0.148 -0.113 -0.093 0.961 0.073 1.168 -0.276 0.009
## 3  0.389  0.173  0.398  0.068 -0.192 -0.061 0.961 0.070 0.980 -0.340 0.270
## 4  0.524 -0.038 -0.340 -0.313 -0.590 -0.134 0.961 0.078 1.070 -0.292 0.726
## 5  0.554 -0.063  0.611 -0.319 -0.927 -0.075 0.961 0.080 1.238 -0.150 0.141
## 6  0.373 -0.246  0.169 -0.365 -0.976  0.222 0.961 0.070 1.052 -0.110 0.407
##      V22   V23    V24    V25    V26   V27    V28    V29   V30    V31   V32
## 1 -0.063 0.098 -1.314 -0.662 -0.596 0.208 -0.449  0.047 0.057 -0.042 0.847
## 2 -0.063 0.090 -1.310 -0.646 -0.776 0.226 -0.443  0.047 0.560  0.176 0.551
## 3 -0.063 0.091 -1.310 -0.473 -0.607 0.084 -0.458 -0.398 0.101  0.199 0.634
## 4  0.133 0.086  0.234 -0.337 -0.986 0.203 -0.456 -0.398 1.007  0.137 1.042
## 5  0.133 0.089  0.237 -0.285 -0.669 0.227 -0.458 -0.776 0.291  0.370 0.181
## 6  0.133 0.099  0.234 -0.071 -0.712 0.229 -0.450 -0.897 0.536  0.447 0.370
##      V33    V34    V35    V36    V37 origin
## 1  0.534 -0.009 -0.190 -0.567  0.388   test
## 2  0.046 -0.220  0.008 -0.294  0.104   test
## 3  0.017 -0.234  0.008  0.373  0.569   test
## 4 -0.040 -0.290  0.008 -0.666  0.391   test
## 5 -0.040 -0.290  0.008 -0.140 -0.497   test
## 6 -0.040 -0.290  0.008 -0.228  0.169   test

str(data_train)

## 'data.frame':    2888 obs. of  40 variables:
##  $ V0    : num  0.566 0.968 1.013 0.733 0.684 ...
##  $ V1    : num  0.016 0.437 0.568 0.368 0.638 ...
##  $ V2    : num  -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
##  $ V3    : num  0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
##  $ V4    : num  0.452 0.194 0.112 0.599 0.337 ...
##  $ V5    : num  -0.901 -0.893 -0.797 -0.679 -0.454 ...
##  $ V6    : num  -1.81 -1.57 -1.37 -1.2 -1.07 ...
##  $ V7    : num  -2.36 -2.36 -2.36 -2.09 -2.09 ...
##  $ V8    : num  -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
##  $ V9    : num  -2.11 -2.11 -2.11 -2.11 -2.11 ...
##  $ V10   : num  -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
##  $ V11   : num  -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
##  $ V12   : num  -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
##  $ V13   : num  0.55 1.109 0.767 0.769 -0.349 ...
##  $ V14   : num  -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
##  $ V15   : num  0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
##  $ V16   : num  -1.707 -0.977 -0.618 -0.429 -0.391 ...
##  $ V17   : num  -1.162 -1.162 -0.897 -0.897 -0.897 ...
##  $ V18   : num  -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
##  $ V19   : num  -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
##  $ V20   : num  0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
##  $ V21   : num  -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
##  $ V22   : num  -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
##  $ V23   : num  0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
##  $ V24   : num  0.8 0.801 0.961 1.435 0.881 ...
##  $ V25   : num  -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
##  $ V26   : num  0.796 1.057 0.915 0.898 0.386 ...
##  $ V27   : num  0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
##  $ V28   : num  -0.45 0.671 1.287 1.298 1.289 ...
##  $ V29   : num  0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
##  $ V30   : num  0.109 0.124 0.361 0.417 1.078 ...
##  $ V31   : num  -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
##  $ V32   : num  0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
##  $ V33   : num  -4.627 -0.843 -0.843 -0.843 -0.843 ...
##  $ V34   : num  -4.789 0.16 0.16 -0.065 -0.215 ...
##  $ V35   : num  -5.101 0.364 0.364 0.364 0.364 ...
##  $ V36   : num  -2.608 -0.335 0.765 0.333 -0.28 ...
##  $ V37   : num  -3.508 -0.73 -0.589 -0.112 -0.028 ...
##  $ target: num  0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
##  $ origin: chr  "train" "train" "train" "train" ...

str(data_test)

## 'data.frame':    1925 obs. of  39 variables:
##  $ V0    : num  0.368 0.148 -0.166 0.102 0.3 0.05 -0.223 -0.126 -0.203 -0.181 ...
##  $ V1    : num  0.38 0.489 -0.062 0.294 0.428 0.34 0.175 0.152 -0.014 0.797 ...
##  $ V2    : num  -0.225 -0.247 -0.311 -0.259 0.208 0.108 -0.39 0.227 0.01 0.47 ...
##  $ V3    : num  -0.049 -0.049 0.046 0.051 0.051 0.051 0.051 0.021 -0.034 -0.107 ...
##  $ V4    : num  0.379 0.122 -0.055 -0.183 -0.033 -0.348 0.006 -0.619 -0.322 -0.477 ...
##  $ V5    : num  0.092 -0.201 0.063 0.148 0.116 0.074 0.134 -0.069 0.105 0.184 ...
##  $ V6    : num  0.55 0.487 0.485 0.474 0.408 0.516 0.497 0.52 0.453 0.588 ...
##  $ V7    : num  0.551 0.493 0.493 0.504 0.497 0.491 0.548 0.548 0.518 0.528 ...
##  $ V8    : num  0.244 -0.127 -0.227 0.01 0.155 0.238 -0.099 0.06 -0.032 0.319 ...
##  $ V9    : num  0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.473 0.904 ...
##  $ V10   : num  -0.419 -0.403 0.33 -0.431 -0.162 0.079 0.226 -0.223 -0.352 -0.108 ...
##  $ V11   : num  0.515 -0.324 0.389 0.524 0.554 0.373 0.477 0.171 0.254 0.618 ...
##  $ V12   : num  0.346 0.465 0.173 -0.038 -0.063 -0.246 -0.276 -0.369 -0.318 -0.709 ...
##  $ V13   : num  -0.114 0.653 0.398 -0.34 0.611 ...
##  $ V14   : num  -0.204 0.148 0.068 -0.313 -0.319 -0.365 -0.247 -0.271 -0.008 -0.207 ...
##  $ V15   : num  0.239 -0.113 -0.192 -0.59 -0.927 ...
##  $ V16   : num  -0.089 -0.093 -0.061 -0.134 -0.075 0.222 0.252 0.01 0.003 0.252 ...
##  $ V17   : num  0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 ...
##  $ V18   : num  0.247 0.073 0.07 0.078 0.08 0.07 0.066 0.07 0.064 0.078 ...
##  $ V19   : num  0.899 1.168 0.98 1.07 1.238 ...
##  $ V20   : num  -0.252 -0.276 -0.34 -0.292 -0.15 -0.11 -0.331 -0.245 -0.472 -0.493 ...
##  $ V21   : num  0.628 0.009 0.27 0.726 0.141 ...
##  $ V22   : num  -0.063 -0.063 -0.063 0.133 0.133 0.133 -0.063 -0.063 0.133 0.133 ...
##  $ V23   : num  0.098 0.09 0.091 0.086 0.089 0.099 0.104 0.101 0.093 -0.178 ...
##  $ V24   : num  -1.314 -1.31 -1.31 0.234 0.237 ...
##  $ V25   : num  -0.662 -0.646 -0.473 -0.337 -0.285 -0.071 0.009 0.003 0.006 0.199 ...
##  $ V26   : num  -0.596 -0.776 -0.607 -0.986 -0.669 ...
##  $ V27   : num  0.208 0.226 0.084 0.203 0.227 0.229 0.133 0.262 0.241 0.307 ...
##  $ V28   : num  -0.449 -0.443 -0.458 -0.456 -0.458 -0.45 -0.452 -0.452 -0.45 -0.446 ...
##  $ V29   : num  0.047 0.047 -0.398 -0.398 -0.776 ...
##  $ V30   : num  0.057 0.56 0.101 1.007 0.291 ...
##  $ V31   : num  -0.042 0.176 0.199 0.137 0.37 0.447 0.432 0.281 0.222 0.466 ...
##  $ V32   : num  0.847 0.551 0.634 1.042 0.181 ...
##  $ V33   : num  0.534 0.046 0.017 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 ...
##  $ V34   : num  -0.009 -0.22 -0.234 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 ...
##  $ V35   : num  -0.19 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 -0.289 ...
##  $ V36   : num  -0.567 -0.294 0.373 -0.666 -0.14 -0.228 0.104 -0.7 -0.236 -0.431 ...
##  $ V37   : num  0.388 0.104 0.569 0.391 -0.497 ...
##  $ origin: chr  "test" "test" "test" "test" ...

# Merge data_train and data_test by row
df<- bind_rows(data_train, data_test)
str(df)

## 'data.frame':    4813 obs. of  40 variables:
##  $ V0    : num  0.566 0.968 1.013 0.733 0.684 ...
##  $ V1    : num  0.016 0.437 0.568 0.368 0.638 ...
##  $ V2    : num  -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
##  $ V3    : num  0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
##  $ V4    : num  0.452 0.194 0.112 0.599 0.337 ...
##  $ V5    : num  -0.901 -0.893 -0.797 -0.679 -0.454 ...
##  $ V6    : num  -1.81 -1.57 -1.37 -1.2 -1.07 ...
##  $ V7    : num  -2.36 -2.36 -2.36 -2.09 -2.09 ...
##  $ V8    : num  -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
##  $ V9    : num  -2.11 -2.11 -2.11 -2.11 -2.11 ...
##  $ V10   : num  -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
##  $ V11   : num  -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
##  $ V12   : num  -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
##  $ V13   : num  0.55 1.109 0.767 0.769 -0.349 ...
##  $ V14   : num  -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
##  $ V15   : num  0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
##  $ V16   : num  -1.707 -0.977 -0.618 -0.429 -0.391 ...
##  $ V17   : num  -1.162 -1.162 -0.897 -0.897 -0.897 ...
##  $ V18   : num  -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
##  $ V19   : num  -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
##  $ V20   : num  0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
##  $ V21   : num  -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
##  $ V22   : num  -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
##  $ V23   : num  0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
##  $ V24   : num  0.8 0.801 0.961 1.435 0.881 ...
##  $ V25   : num  -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
##  $ V26   : num  0.796 1.057 0.915 0.898 0.386 ...
##  $ V27   : num  0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
##  $ V28   : num  -0.45 0.671 1.287 1.298 1.289 ...
##  $ V29   : num  0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
##  $ V30   : num  0.109 0.124 0.361 0.417 1.078 ...
##  $ V31   : num  -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
##  $ V32   : num  0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
##  $ V33   : num  -4.627 -0.843 -0.843 -0.843 -0.843 ...
##  $ V34   : num  -4.789 0.16 0.16 -0.065 -0.215 ...
##  $ V35   : num  -5.101 0.364 0.364 0.364 0.364 ...
##  $ V36   : num  -2.608 -0.335 0.765 0.333 -0.28 ...
##  $ V37   : num  -3.508 -0.73 -0.589 -0.112 -0.028 ...
##  $ target: num  0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
##  $ origin: chr  "train" "train" "train" "train" ...

Summary of data check for missing value in each column

colSums(is.na(data_train))

##     V0     V1     V2     V3     V4     V5     V6     V7     V8     V9    V10 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V11    V12    V13    V14    V15    V16    V17    V18    V19    V20    V21 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V22    V23    V24    V25    V26    V27    V28    V29    V30    V31    V32 
##      0      0      0      0      0      0      0      0      0      0      0 
##    V33    V34    V35    V36    V37 target origin 
##      0      0      0      0      0      0      0

Visualisation of missing data

missmap(data_train)

Eigenvalue density plots of data_train and data_test

for(i in seq(1, 10, by = 1)){
  i<-i*4
  start<-i-3
  if(i==40){
    i<-38
  }
  plots <- list()
  for(col in names(data_test)[start:i]){
    if(!is.na(col)){
      p <-ggplot(df, aes(x = get(col), colour=origin,group=origin)) + 
      xlab("x-axis eigenvalue") + 
       ggtitle(col) +
      geom_density(size=2)
      plots <- c(plots, list(p))
    }
  }
  grid.arrange(grobs = plots, ncol = 2)
}

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Augmentation: removing features with large differences in feature distribution between the training and test sets in order to facilitate a reduction in the likelihood of overfitting the model, removing features that are not conducive to applying the model to the test set, and helping to improve the generalizability of the model.

drop_array <- c("V8", "V9", "V10", "V11")
data_train1<-data_train1[,-which(colnames(data_train1) %in% drop_array)]
names(data_train1)

##  [1] "V0"     "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V12"    "V13"    "V14"    "V15"    "V16"    "V17"    "V18"    "V19"   
## [17] "V20"    "V21"    "V22"    "V23"    "V24"    "V25"    "V26"    "V27"   
## [25] "V28"    "V29"    "V30"    "V31"    "V32"    "V33"    "V34"    "V35"   
## [33] "V36"    "V37"    "target"

# Plotting scatter plots and fitting curves
head(data_train1)

##      V0    V1     V2    V3    V4     V5     V6     V7    V12    V13    V14
## 1 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.073  0.550 -0.484
## 2 0.968 0.437  0.066 0.566 0.194 -0.893 -1.566 -2.360 -0.134  1.109 -0.488
## 3 1.013 0.568  0.235 0.370 0.112 -0.797 -1.367 -2.360 -0.072  0.767 -0.493
## 4 0.733 0.368  0.283 0.165 0.599 -0.679 -1.200 -2.086 -0.014  0.769 -0.371
## 5 0.684 0.638  0.260 0.209 0.337 -0.454 -1.073 -2.086  0.199 -0.349 -0.342
## 6 0.445 0.627  0.408 0.220 0.458 -1.056 -1.009 -1.896  0.294  0.912 -0.345
##      V15    V16    V17    V18    V19   V20    V21    V22   V23   V24    V25
## 1  0.000 -1.707 -1.162 -0.573 -0.991 0.610 -0.400 -0.063 0.356 0.800 -0.223
## 2  0.000 -0.977 -1.162 -0.571 -0.836 0.588 -0.802 -0.063 0.357 0.801 -0.144
## 3 -0.212 -0.618 -0.897 -0.564 -0.558 0.576 -0.477 -0.063 0.355 0.961 -0.067
## 4 -0.162 -0.429 -0.897 -0.574 -0.564 0.272 -0.491 -0.063 0.352 1.435  0.113
## 5 -0.138 -0.391 -0.897 -0.572 -0.394 0.106  0.309 -0.259 0.352 0.881  0.221
## 6  0.111 -0.333 -1.029 -0.573 -0.516 0.029 -0.560 -0.096 0.349 0.798  0.245
##     V26   V27    V28    V29   V30    V31    V32    V33    V34    V35    V36
## 1 0.796 0.168 -0.450  0.136 0.109 -0.615  0.327 -4.627 -4.789 -5.101 -2.608
## 2 1.057 0.338  0.671 -0.128 0.124  0.032  0.600 -0.843  0.160  0.364 -0.335
## 3 0.915 0.326  1.287 -0.009 0.361  0.277 -0.116 -0.843  0.160  0.364  0.765
## 4 0.898 0.277  1.298  0.015 0.417  0.279  0.603 -0.843 -0.065  0.364  0.333
## 5 0.386 0.332  1.289  0.183 1.078  0.328  0.418 -0.843 -0.215  0.364 -0.280
## 6 0.643 0.356  1.296  0.454 0.674  0.358  0.618 -0.843 -0.290  0.364 -0.191
##      V37 target
## 1 -3.508  0.175
## 2 -0.730  0.676
## 3 -0.589  0.633
## 4 -0.112  0.206
## 5 -0.028  0.384
## 6 -0.883  0.060

for(i in seq(1, 9, by = 1)){
  i<-i*4
  start<-i-3
  if(i==36){
    i<-34
  }
  plots <- list()
  for(col in names(data_train1)[start:i]){
    p<-ggplot(data_train1, aes(x = get(col), y = target)) +
      # Scatter plot function
      geom_point()+
      xlab("x-axis eigenvalue") + 
      ylab(col) +
      # Scale function: palette sets the color scheme
      scale_colour_brewer(palette = "Set1") +
      geom_smooth()
    plots <- c(plots, list(p))
  }
  grid.arrange(grobs = plots, ncol = 2)
}

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Heat map of correlation between each column and target

corr <- cor(data_train1)
column_data <- data.frame(corr)
column_data <- as.matrix(column_data)
# heatmap(corr,col=heat.colors(50))
levelplot(t(column_data),col.regions=colorRampPalette(c("blue", "green", "yellow", "red"))(100),
          xlab="", ylab="", main="Heatmap", par.settings = list(layout.heights = list(bottom.padding = 5, top.padding = 5)))

# Removal of correlation less than 0.1
threshold <- 0.1
corr_matrix <- abs(cor(data_train1)) # Absolute value correlation matrix
drop_col <- names(data_train1)[corr_matrix[,"target"] < threshold]
drop_col

## [1] "V14" "V21" "V25" "V26" "V32" "V33" "V34"

data_train1 <- data_train1[, !(names(data_train1) %in% drop_col)]
names(data_train1)

##  [1] "V0"     "V1"     "V2"     "V3"     "V4"     "V5"     "V6"     "V7"    
##  [9] "V12"    "V13"    "V15"    "V16"    "V17"    "V18"    "V19"    "V20"   
## [17] "V22"    "V23"    "V24"    "V27"    "V28"    "V29"    "V30"    "V31"   
## [25] "V35"    "V36"    "V37"    "target"

# datatrain1 is the data frame after removing low correlation columns

trainy = data_train1["target"]
trainx = data_train1[, !(names(data_train1) %in% "target")]
# separation train and test
set.seed(827)
inTrain <- createDataPartition(trainy$target, p = 0.7, list = FALSE)
X_train <- trainx[inTrain,]
X_test <- trainx[-inTrain,]
y_train <- trainy[inTrain,]
y_test <- trainy[-inTrain,]

modeling

Degree of fit reference Residual standard error（RSE），Residual standard error R-squared (R2) and adjusted R2, Multiple R-squared: Adjusted R-squared: F-statistic f.statistic

Prediction evaluation reference RMSE (Root Mean Squared Error) is the root mean squared error, which represents the deviation between the predicted and true values. The smaller the value, the more accurate the model. MSE (Mean Squared Error) is the mean squared error, which represents the squared difference between the predicted value and the true value. The smaller the value, the more accurate the model is. R2 (Coefficient of Determination) is the coefficient of determination, which represents the percentage of variation in the true value that is explained by the predicted value of the model. The closer the value is to 1, the more accurate the model is.

# Multiple linear regression
  x_vars <- names(X_train)
  x_vars<-paste(x_vars, collapse = "+")
  x_vars

## [1] "V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37"

lr_model <- lm(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
# Coefficient table showing beta coefficient estimates and their significance levels
# Estimate: intercept (b0) and beta coefficient estimates associated with each predictor variable
# Std.Error: standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confidence we have in the estimate.
# t value: t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3).
# Pr(>|t|): p-value corresponding to the t-statistic. the smaller the p-value, the more significant the estimate.
summary(lr_model)

## 
## Call:
## lm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 + 
##     V12 + V13 + V15 + V16 + V17 + V18 + V19 + V20 + V22 + V23 + 
##     V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train, 
##     y_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54258 -0.20255 -0.00513  0.17818  1.42964 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1815800  0.0223937  -8.109 8.85e-16 ***
## V0           0.3991926  0.0303538  13.151  < 2e-16 ***
## V1           0.1976929  0.0290428   6.807 1.31e-11 ***
## V2           0.2197217  0.0243662   9.017  < 2e-16 ***
## V3           0.1352732  0.0102928  13.142  < 2e-16 ***
## V4           0.0155256  0.0314455   0.494 0.621551    
## V5           0.0030441  0.0216578   0.141 0.888236    
## V6           0.2350767  0.0378718   6.207 6.54e-10 ***
## V7          -0.1792823  0.0277447  -6.462 1.30e-10 ***
## V12          0.0676274  0.0279165   2.422 0.015503 *  
## V13         -0.0061209  0.0115662  -0.529 0.596720    
## V15          0.0478988  0.0261885   1.829 0.067549 .  
## V16         -0.0720262  0.0296579  -2.429 0.015246 *  
## V17          0.0751529  0.0157702   4.766 2.02e-06 ***
## V18          0.0369619  0.0105296   3.510 0.000458 ***
## V19          0.0071012  0.0090155   0.788 0.430984    
## V20          0.0065008  0.0115224   0.564 0.572687    
## V22          0.1110113  0.0167087   6.644 3.93e-11 ***
## V23          0.0215705  0.0119886   1.799 0.072129 .  
## V24         -0.0531022  0.0113056  -4.697 2.82e-06 ***
## V27          0.5497939  0.0743657   7.393 2.10e-13 ***
## V28         -0.0130093  0.0086068  -1.512 0.130816    
## V29         -0.0766962  0.0283104  -2.709 0.006804 ** 
## V30         -0.0150293  0.0107219  -1.402 0.161150    
## V31          0.0200336  0.0256401   0.781 0.434696    
## V35         -0.0278864  0.0112024  -2.489 0.012879 *  
## V36          0.0135898  0.0113926   1.193 0.233067    
## V37          0.0005468  0.0181994   0.030 0.976033    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3369 on 1996 degrees of freedom
## Multiple R-squared:  0.8867, Adjusted R-squared:  0.8851 
## F-statistic: 578.3 on 27 and 1996 DF,  p-value: < 2.2e-16

# Make predictions
pred<-predict(lr_model, X_test)
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mes:",mse)

## [1] "mes: 0.143400237516701"

paste("rmes:",RMSE(pred, y_test))

## [1] "rmes: 0.378682238184868"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.845392042438334"

# Set training parameters
params <- list(booster = "gbtree",
               objective = "reg:squarederror",
               eta = 0.3,
               max_depth = 6,
               subsample = 1,
               colsample_bytree = 1)

# Convert training data into the required format for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)

#TrainingModel
xgboost_model <- xgb.train(params = params, data = train_matrix, nrounds = 100)

#Predictions for the test set
test_matrix <- xgb.DMatrix(data = as.matrix(X_test))
pred <- predict(xgboost_model, test_matrix)
# Use the xgb.plot.importance() function to visualize the importance of each feature to the model prediction results.
importance_matrix <- xgb.importance(colnames(X_train), model = xgboost_model)
xgb.plot.importance(importance_matrix)

# xgb.plot.importance(xgboost_model)
# Plot the scatter plot of predicted and true values
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) + 
  geom_point() + 
  geom_abline(intercept = 0, slope = 1)

# Make predictions
pred<-predict(lr_model, X_test)
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mes:",mse)

## [1] "mes: 0.143400237516701"

paste("rmes:",RMSE(pred, y_test))

## [1] "rmes: 0.378682238184868"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.845392042438334"

Random forest regression

rf<-randomForest(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),importance=TRUE, ntree=1000)
rf

## 
## Call:
##  randomForest(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 +      V6 + V7 + V12 + V13 + V15 + V16 + V17 + V18 + V19 + V20 +      V22 + V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 +      V37, data = cbind(X_train, y_train), importance = TRUE, ntree = 1000) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 9
## 
##           Mean of squared residuals: 0.1181603
##                     % Var explained: 88.03

importance(rf)

##       %IncMSE IncNodePurity
## V0  49.785910    591.580126
## V1  36.648070    509.231388
## V2  37.138166     84.043155
## V3  47.063295     39.019175
## V4  24.141669     49.501757
## V5  16.774140     10.169873
## V6  25.315566     16.785337
## V7  20.894548     11.370043
## V12 19.708106     38.608386
## V13 10.014369      8.926566
## V15 17.350189     12.081380
## V16 30.995485     36.485409
## V17 11.082196      5.721145
## V18 12.110626     10.514513
## V19 11.704465      9.287433
## V20 13.836438     13.020525
## V22 14.318238      5.864386
## V23 13.936151     10.051831
## V24 17.560764      9.729105
## V27 24.697703    255.113118
## V28  6.870147      7.943724
## V29 18.550579     14.902115
## V30  4.144389      8.677826
## V31 24.651701    148.581726
## V35  7.760467      4.566536
## V36 20.054828     14.627364
## V37 19.927453     57.444320

varImpPlot(rf)

# Use the training set to see the prediction accuracy
pres<-predict(rf,X_test)
mse <- mean((y_test - pred)^2)
paste("mes:",mse)

## [1] "mes: 0.143400237516701"

paste("rmes:",RMSE(pred, y_test))

## [1] "rmes: 0.378682238184868"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.845392042438334"

# %IncMSE，IncNodePurity as a parameter of the random forest can only be used as a "qualitative" exploration variable, not as a quantitative decision "variable trade-off

svm_model <- svm(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),knernel="radial")
svm_model

## 
## Call:
## svm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 + V12 + 
##     V13 + V15 + V16 + V17 + V18 + V19 + V20 + V22 + V23 + V24 + V27 + 
##     V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train, 
##     y_train), knernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.03703704 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  1448

varImpPlot(rf)

svm_pred=predict(svm_model,X_test,decision.values = TRUE)
pred = svm_pred
mse <- mean((y_test - pred)^2)
paste("mes:",mse)

## [1] "mes: 0.142456887135041"

paste("rmes:",RMSE(pred, y_test))

## [1] "rmes: 0.377434613059058"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.846482244254957"

# Decision tree regression model
model_dt <- rpart(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))

# prediction and evaluation
pres<-predict(rf,X_test)
mse <- mean((y_test - pred)^2)
paste("mes:",mse)

## [1] "mes: 0.142456887135041"

paste("rmes:",RMSE(pred, y_test))

## [1] "rmes: 0.377434613059058"

paste("r2:",R2(pred, y_test))

## [1] "r2: 0.846482244254957"

rpart.plot(model_dt)

install.packages(“cluster”,repos =“http://cran.us.r-project.org”) library(cluster)

result <- kmeans(data_train1, 3)
print(result)

## K-means clustering with 3 clusters of sizes 1244, 516, 1128
## 
## Cluster means:
##            V0          V1         V2         V3         V4         V5
## 1  0.03878296  0.09255145  0.5597982 -0.1234277 -0.3908521 -0.4309116
## 2 -1.02471318 -1.29671705 -0.9235678 -0.8406919 -0.4681919 -0.3858740
## 3  0.74101773  0.63466046  0.5468821  0.3471312  0.6782996 -0.7783422
##           V6         V7        V12        V13        V15        V16         V17
## 1  0.6710571  0.5349277 -0.3932347  0.1951158 -0.6795740  0.5006061  0.03433039
## 2 -0.9607287 -0.9029341 -0.5006841 -0.1642519  0.4264845 -1.2715969 -0.19939535
## 3  0.1676720  0.1204956  0.7220488  0.3611011  0.8005275  0.3202057 -0.05791223
##           V18        V19         V20       V22        V23         V24
## 1  0.07826447  0.2124301 -0.13592605 0.2784405  0.1487846  0.66117283
## 2 -0.31330620 -0.1392868 -0.77540504 0.4618430 -0.2886802  0.08840116
## 3  0.19791046 -0.4646950  0.02781826 0.2571489  0.3673183 -0.82545213
##          V27        V28        V29         V30        V31         V35
## 1  0.3155611  0.2922822 -0.7148730  0.24424116  0.2273392  0.13748794
## 2 -0.1402132 -0.1121802  0.5301609 -0.43078101 -1.1898043 -0.06321124
## 3  0.4135665  0.0815594  0.7958750  0.06973759  0.6207340  0.38362057
##          V36        V37     target
## 1  0.1908023 -0.4861584  0.1540595
## 2 -0.7337636  1.0278895 -1.3382229
## 3  0.2037270 -0.2677323  0.7657624
## 
## Clustering vector:
##    [1] 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 3 1 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3
##   [38] 3 3 2 1 2 1 1 3 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 2 1
##   [75] 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3 3 3 2 2 3 3 3
##  [112] 3 1 2 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 3 3 3 1 3 1 1 1
##  [149] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3
##  [186] 1 3 1 1 1 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 3 3 3 3 3 1 1 3 1 1 1 1 1 1 1 1 3
##  [223] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 3 3 3 3 1 1 1 3 3 3 3 3
##  [260] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3
##  [297] 3 3 3 3 1 1 1 1 1 3 3 3 1 1 3 2 3 1 1 1 1 1 1 2 2 2 2 1 2 3 3 3 3 3 2 2 2
##  [334] 3 3 3 3 3 3 3 1 1 1 2 2 2 3 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2
##  [371] 3 3 3 3 1 1 2 2 2 2 2 1 2 1 2 2 2 3 3 2 2 2 3 3 2 2 3 3 2 3 3 2 2 2 3 3 3
##  [408] 3 3 2 2 2 2 2 2 2 2 2 3 3 2 3 3 2 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 3 3 3
##  [445] 1 1 1 1 1 3 3 3 2 3 2 2 1 1 2 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [482] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3
##  [519] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 3 3
##  [556] 3 1 1 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1
##  [593] 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [630] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3
##  [667] 3 3 3 3 3 3 3 3 1 1 1 1 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 2 1 1 3 1 3 3 3 1 1
##  [704] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
##  [741] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
##  [778] 3 3 3 3 3 1 1 1 1 1 1 3 1 3 3 3 3 2 2 2 2 3 2 3 1 1 3 3 3 3 3 3 3 3 3 3 3
##  [815] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 1
##  [852] 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 3 1 3 3 1 1 2 2 1 1
##  [889] 3 3 1 3 3 1 2 1 1 1 1 2 3 3 3 2 2 2 3 2 3 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 3
##  [926] 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 3 3 2 2 3 3 3 3 3
##  [963] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [1000] 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 2 2 2 3 3 2 2 2
## [1037] 2 2 2 2 2 2 3 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2
## [1074] 2 3 2 2 2 2 2 2 3 3 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 1 1 1 3 3 3 3 3 3
## [1111] 3 3 3 2 2 2 3 2 2 2 3 3 3 3 1 1 2 2 2 3 1 2 2 1 2 1 2 1 2 2 2 2 2 2 3 1 2
## [1148] 2 2 2 3 3 2 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 1 3 2 2 3 3 3 3
## [1185] 1 1 1 1 3 3 3 3 1 1 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1222] 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1296] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1333] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1
## [1370] 1 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1407] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1444] 2 1 1 1 3 3 3 3 3 3 3 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1
## [1481] 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 2 2 2 2 2 2 2 2
## [1518] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1555] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 2 3 2 2 2 2 2 3 3
## [1592] 3 3 3 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1629] 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
## [1666] 1 3 3 3 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 2 2 2 2 1
## [1703] 2 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [1740] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1777] 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 1 2 2 2
## [1814] 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 3 2 2 2 2 2 1 1 1 1 1 1
## [1851] 1 1 1 1 1 1 1 1 3 3 3 3 1 2 2 2 3 3 3 3 2 2 2 3 3 3 3 3 2 2 2 2 2 2 2 2 3
## [1888] 3 3 3 3 3 1 1 2 2 1 1 1 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 3 2 2 2 2 3 3 3 3 3
## [1925] 2 2 2 2 2 3 3 3 3 2 2 2 2 1 2 1 2 3 2 3 3 3 3 3 2 2 3 3 2 3 1 1 1 1 1 1 1
## [1962] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1999] 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2036] 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2073] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2
## [2110] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 1 1 1 1 1 1 1
## [2147] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [2184] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 3 3 2 3
## [2221] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2258] 1 1 1 1 1 1 2 2 2 3 3 3 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1
## [2295] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 3 3 1 1 2 1 1
## [2332] 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2369] 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2406] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
## [2443] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1
## [2480] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2517] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## [2554] 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [2591] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [2628] 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2
## [2665] 2 1 1 1 2 1 1 1 3 3 2 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1
## [2702] 1 1 3 1 1 1 1 1 2 1 1 1 1 1 3 3 3 3 3 1 1 1 3 1 3 3 3 3 3 1 3 3 3 3 1 3 3
## [2739] 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 3 3 3 3 2 2 2 2 3 3 3 3 3
## [2776] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2813] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 2 3 3 3 3 2 2 2 3 3 3 1 1 1
## [2850] 1 1 1 1 1 1 3 3 3 2 3 3 3 3 2 3 2 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1
## [2887] 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 16771.51 16823.36 15288.10
##  (between_SS / total_SS =  27.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(stats)
library(factoextra)
hc_model <- hclust(dist(data_train1))
print(hc_model)

## 
## Call:
## hclust(d = dist(data_train1))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 2888

plot(hc_model)

Prediction and mining of industrial steam volume

gz

2023-01-18

Group8 Project WQD7004

Prediction and mining of industrial steam volume

Introduction

Data description

Summary of data check for missing value in each column

Visualisation of missing data

modeling

Prediction and mining of industrial steam volume

gz

2023-01-18

Group8 Project WQD7004

Prediction and mining of industrial steam volume

Introduction

Related Work

Data description

Summary of data check for missing value in each column

Visualisation of missing data

modeling