Group member for this project
S2174087/1 JUNHAO XU S2175919/1 Haojie Chu S2164009/1 Wang Jiaxin S2165529/1 Xiaowei Zhang S2001533 Cheah Xiao Ying
My Picture
Energy plays an important role in human society. With energy, human society can be developed. There are various types of power generation, such as coal, nuclear, gas, solar energy, natural gas and geothermal sources. As thermal power generation will be the main source of electricity for production and domestic use for many years to come, exploring efficient and clean power generation technologies and optimizing thermal power generation facilities to improve energy efficiency are the main issues at hand.
The basic principle of thermal power generation is: when the fuel is burned, the water is heated to generate steam, and the steam pressure pushes the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this series of energy conversions, the core that affects the power generation efficiency is the combustion efficiency of the boiler, that is, the fuel is burned to heat water to generate high-temperature and high-pressure steam. There are many factors that affect the combustion efficiency of the boiler, including adjustable parameters of the boiler, such as combustion feed rate, primary and secondary air, induced air, return air, water supply volume; and boiler operating conditions, such as boiler bed temperature, bed pressure, Furnace temperature, pressure, superheater temperature, etc.
With the rapid development of the industrialization road, the application of big data-driven artificial intelligence technology to industrial production and manufacturing is conducive to its digitalization and intelligence. Thermal power boilers generate a large amount of data in the process of operation. Machine learning technology is used to extract valuable information from a large amount of complex data to help companies better optimize equipment parameters, thereby improving the stability, speed, accuracy and energy conversion rate of the boiler operation. Machine learning can be applied to various industries to analyze and organize the data we have accumulated and to make effective use of the data. Machine learning techniques are capable of handling large volumes of data with complex and non-linear structures. Through the design of certain algorithms, it is also possible to achieve self-learning of patterns. These advantages of machine learning have led to a new breakthrough direction in the analysis of industrial data.
Import all packages needed for this study.
install.packages("Amelia",repos ="http://cran.us.r-project.org")
## 程序包'Amelia'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'Amelia'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\Amelia\libs\x64\Amelia.dll到D:
## \CODE\R-4.2.1\library\Amelia\libs\x64\Amelia.dll时出了问题:Permission denied
## Warning: 回复了'Amelia'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("rpart",repos ="http://cran.us.r-project.org")
## 程序包'rpart'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'rpart'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\rpart\libs\x64\rpart.dll到D:
## \CODE\R-4.2.1\library\rpart\libs\x64\rpart.dll时出了问题:Permission denied
## Warning: 回复了'rpart'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("rattle",repos ="http://cran.us.r-project.org")
## 程序包'rattle'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("rpart.plot",repos ="http://cran.us.r-project.org")
## 程序包'rpart.plot'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("RColorBrewer",repos ="http://cran.us.r-project.org")
## 程序包'RColorBrewer'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("randomForest",repos ="http://cran.us.r-project.org")
## 程序包'randomForest'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'randomForest'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝D:
## \CODE\R-4.2.1\library\00LOCK\randomForest\libs\x64\randomForest.dll到D:
## \CODE\R-4.2.1\library\randomForest\libs\x64\randomForest.dll时出了问题:
## Permission denied
## Warning: 回复了'randomForest'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(Amelia)
## Warning: 程辑包'Amelia'是用R版本4.2.2 来建造的
## 载入需要的程辑包:Rcpp
## Warning: 程辑包'Rcpp'是用R版本4.2.2 来建造的
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(rpart)
## Warning: 程辑包'rpart'是用R版本4.2.2 来建造的
library(rattle)
## Warning: 程辑包'rattle'是用R版本4.2.2 来建造的
## 载入需要的程辑包:tibble
## Warning: 程辑包'tibble'是用R版本4.2.2 来建造的
## 载入需要的程辑包:bitops
## Rattle: A free graphical interface for data science with R.
## XXXX 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## 键入'rattle()'去轻摇、晃动、翻滚你的数据。
library(rpart.plot)
## Warning: 程辑包'rpart.plot'是用R版本4.2.2 来建造的
library(RColorBrewer)
library(randomForest)
## Warning: 程辑包'randomForest'是用R版本4.2.2 来建造的
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
install.packages("xgboost",repos ="http://cran.us.r-project.org")
## 程序包'xgboost'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'xgboost'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝D:
## \CODE\R-4.2.1\library\00LOCK\xgboost\libs\x64\xgboost.dll到D:
## \CODE\R-4.2.1\library\xgboost\libs\x64\xgboost.dll时出了问题:Permission denied
## Warning: 回复了'xgboost'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(xgboost)
## Warning: 程辑包'xgboost'是用R版本4.2.2 来建造的
##
## 载入程辑包:'xgboost'
## The following object is masked from 'package:rattle':
##
## xgboost
install.packages("dplyr",repos ="http://cran.us.r-project.org")
## 程序包'dplyr'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'dplyr'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\dplyr\libs\x64\dplyr.dll到D:
## \CODE\R-4.2.1\library\dplyr\libs\x64\dplyr.dll时出了问题:Permission denied
## Warning: 回复了'dplyr'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
install.packages("gridExtra",repos ="http://cran.us.r-project.org")
## 程序包'gridExtra'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(dplyr)
## Warning: 程辑包'dplyr'是用R版本4.2.2 来建造的
##
## 载入程辑包:'dplyr'
## The following object is masked from 'package:xgboost':
##
## slice
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
## Warning: 程辑包'gridExtra'是用R版本4.2.2 来建造的
##
## 载入程辑包:'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:randomForest':
##
## combine
install.packages("caretEnsemble",repos ="http://cran.us.r-project.org")
## 程序包'caretEnsemble'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(caretEnsemble)
## Warning: 程辑包'caretEnsemble'是用R版本4.2.2 来建造的
install.packages("rsample",repos ="http://cran.us.r-project.org")
## 程序包'rsample'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(rsample)
## Warning: 程辑包'rsample'是用R版本4.2.2 来建造的
##
## 载入程辑包:'rsample'
## The following object is masked from 'package:Rcpp':
##
## populate
install.packages("caret",repos ="http://cran.us.r-project.org")
## 程序包'caret'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'caret'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\caret\libs\x64\caret.dll到D:
## \CODE\R-4.2.1\library\caret\libs\x64\caret.dll时出了问题:Permission denied
## Warning: 回复了'caret'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(caret)
## Warning: 程辑包'caret'是用R版本4.2.2 来建造的
## 载入需要的程辑包:ggplot2
## Warning: 程辑包'ggplot2'是用R版本4.2.2 来建造的
##
## 载入程辑包:'ggplot2'
## The following object is masked from 'package:caretEnsemble':
##
## autoplot
## The following object is masked from 'package:randomForest':
##
## margin
## 载入需要的程辑包:lattice
install.packages("caretEnsemble",repos ="http://cran.us.r-project.org")
## Warning: 正在使用'caretEnsemble'这个程序包,因此不会被安装
library(caretEnsemble)
install.packages("rsample",repos ="http://cran.us.r-project.org")
## Warning: 正在使用'rsample'这个程序包,因此不会被安装
library(rsample)
install.packages("ggplot2",repos ="http://cran.us.r-project.org")
## Warning: 正在使用'ggplot2'这个程序包,因此不会被安装
library(ggplot2)
install.packages("e1071",repos ="http://cran.us.r-project.org")
## 程序包'e1071'打开成功,MD5和检查也通过
## Warning: 无法将拆除原来安装的程序包'e1071'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): 拷贝
## D:\CODE\R-4.2.1\library\00LOCK\e1071\libs\x64\e1071.dll到D:
## \CODE\R-4.2.1\library\e1071\libs\x64\e1071.dll时出了问题:Permission denied
## Warning: 回复了'e1071'
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(e1071)
## Warning: 程辑包'e1071'是用R版本4.2.2 来建造的
##
## 载入程辑包:'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
install.packages("pheatmap",repos ="http://cran.us.r-project.org")
## 程序包'pheatmap'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(pheatmap)
## Warning: 程辑包'pheatmap'是用R版本4.2.2 来建造的
install.packages("factoextra",repos ="http://cran.us.r-project.org")
## 程序包'factoextra'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(factoextra)
## Warning: 程辑包'factoextra'是用R版本4.2.2 来建造的
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
install.packages("patchwork",repos ="http://cran.us.r-project.org")
## 程序包'patchwork'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\guazi\AppData\Local\Temp\Rtmp0O1yBr\downloaded_packages里
library(patchwork)
## Warning: 程辑包'patchwork'是用R版本4.2.2 来建造的
install.packages("rpart",repos ="http://cran.us.r-project.org")
## Warning: 正在使用'rpart'这个程序包,因此不会被安装
library(rpart)
library(rpart.plot)
# import file
# test size 1925 rows * 38 columns
# train size 2888 rows * 39 columns
# The provided data set, train, is one more column of data called target than the one we need to predict
data_train <- read.table("D:/zhengqi_train.txt", header = TRUE, sep = "\t")
data_test <- read.table("D:/zhengqi_test.txt", header = TRUE, sep = "\t")
# data_train is the original data
data_train1 <- data_train
# complete the origin column
data_train<-mutate(data_train, origin = 'train')
data_test<-mutate(data_test, origin = 'test')
head(data_train)
## V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.436 -2.114 -0.940
## 2 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.360 0.332 -2.114 0.188
## 3 1.013 0.568 0.235 0.370 0.112 -0.797 -1.367 -2.360 0.396 -2.114 0.874
## 4 0.733 0.368 0.283 0.165 0.599 -0.679 -1.200 -2.086 0.403 -2.114 0.011
## 5 0.684 0.638 0.260 0.209 0.337 -0.454 -1.073 -2.086 0.314 -2.114 -0.251
## 6 0.445 0.627 0.408 0.220 0.458 -1.056 -1.009 -1.896 0.481 -2.114 -0.511
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 -0.307 -0.073 0.550 -0.484 0.000 -1.707 -1.162 -0.573 -0.991 0.610 -0.400
## 2 -0.455 -0.134 1.109 -0.488 0.000 -0.977 -1.162 -0.571 -0.836 0.588 -0.802
## 3 -0.051 -0.072 0.767 -0.493 -0.212 -0.618 -0.897 -0.564 -0.558 0.576 -0.477
## 4 0.102 -0.014 0.769 -0.371 -0.162 -0.429 -0.897 -0.574 -0.564 0.272 -0.491
## 5 0.570 0.199 -0.349 -0.342 -0.138 -0.391 -0.897 -0.572 -0.394 0.106 0.309
## 6 -0.564 0.294 0.912 -0.345 0.111 -0.333 -1.029 -0.573 -0.516 0.029 -0.560
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1 -0.063 0.356 0.800 -0.223 0.796 0.168 -0.450 0.136 0.109 -0.615 0.327
## 2 -0.063 0.357 0.801 -0.144 1.057 0.338 0.671 -0.128 0.124 0.032 0.600
## 3 -0.063 0.355 0.961 -0.067 0.915 0.326 1.287 -0.009 0.361 0.277 -0.116
## 4 -0.063 0.352 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603
## 5 -0.259 0.352 0.881 0.221 0.386 0.332 1.289 0.183 1.078 0.328 0.418
## 6 -0.096 0.349 0.798 0.245 0.643 0.356 1.296 0.454 0.674 0.358 0.618
## V33 V34 V35 V36 V37 target origin
## 1 -4.627 -4.789 -5.101 -2.608 -3.508 0.175 train
## 2 -0.843 0.160 0.364 -0.335 -0.730 0.676 train
## 3 -0.843 0.160 0.364 0.765 -0.589 0.633 train
## 4 -0.843 -0.065 0.364 0.333 -0.112 0.206 train
## 5 -0.843 -0.215 0.364 -0.280 -0.028 0.384 train
## 6 -0.843 -0.290 0.364 -0.191 -0.883 0.060 train
head(data_test)
## V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 0.368 0.380 -0.225 -0.049 0.379 0.092 0.550 0.551 0.244 0.904 -0.419
## 2 0.148 0.489 -0.247 -0.049 0.122 -0.201 0.487 0.493 -0.127 0.904 -0.403
## 3 -0.166 -0.062 -0.311 0.046 -0.055 0.063 0.485 0.493 -0.227 0.904 0.330
## 4 0.102 0.294 -0.259 0.051 -0.183 0.148 0.474 0.504 0.010 0.904 -0.431
## 5 0.300 0.428 0.208 0.051 -0.033 0.116 0.408 0.497 0.155 0.904 -0.162
## 6 0.050 0.340 0.108 0.051 -0.348 0.074 0.516 0.491 0.238 0.904 0.079
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1 0.515 0.346 -0.114 -0.204 0.239 -0.089 0.961 0.247 0.899 -0.252 0.628
## 2 -0.324 0.465 0.653 0.148 -0.113 -0.093 0.961 0.073 1.168 -0.276 0.009
## 3 0.389 0.173 0.398 0.068 -0.192 -0.061 0.961 0.070 0.980 -0.340 0.270
## 4 0.524 -0.038 -0.340 -0.313 -0.590 -0.134 0.961 0.078 1.070 -0.292 0.726
## 5 0.554 -0.063 0.611 -0.319 -0.927 -0.075 0.961 0.080 1.238 -0.150 0.141
## 6 0.373 -0.246 0.169 -0.365 -0.976 0.222 0.961 0.070 1.052 -0.110 0.407
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1 -0.063 0.098 -1.314 -0.662 -0.596 0.208 -0.449 0.047 0.057 -0.042 0.847
## 2 -0.063 0.090 -1.310 -0.646 -0.776 0.226 -0.443 0.047 0.560 0.176 0.551
## 3 -0.063 0.091 -1.310 -0.473 -0.607 0.084 -0.458 -0.398 0.101 0.199 0.634
## 4 0.133 0.086 0.234 -0.337 -0.986 0.203 -0.456 -0.398 1.007 0.137 1.042
## 5 0.133 0.089 0.237 -0.285 -0.669 0.227 -0.458 -0.776 0.291 0.370 0.181
## 6 0.133 0.099 0.234 -0.071 -0.712 0.229 -0.450 -0.897 0.536 0.447 0.370
## V33 V34 V35 V36 V37 origin
## 1 0.534 -0.009 -0.190 -0.567 0.388 test
## 2 0.046 -0.220 0.008 -0.294 0.104 test
## 3 0.017 -0.234 0.008 0.373 0.569 test
## 4 -0.040 -0.290 0.008 -0.666 0.391 test
## 5 -0.040 -0.290 0.008 -0.140 -0.497 test
## 6 -0.040 -0.290 0.008 -0.228 0.169 test
str(data_train)
## 'data.frame': 2888 obs. of 40 variables:
## $ V0 : num 0.566 0.968 1.013 0.733 0.684 ...
## $ V1 : num 0.016 0.437 0.568 0.368 0.638 ...
## $ V2 : num -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
## $ V3 : num 0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
## $ V4 : num 0.452 0.194 0.112 0.599 0.337 ...
## $ V5 : num -0.901 -0.893 -0.797 -0.679 -0.454 ...
## $ V6 : num -1.81 -1.57 -1.37 -1.2 -1.07 ...
## $ V7 : num -2.36 -2.36 -2.36 -2.09 -2.09 ...
## $ V8 : num -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
## $ V9 : num -2.11 -2.11 -2.11 -2.11 -2.11 ...
## $ V10 : num -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
## $ V11 : num -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
## $ V12 : num -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
## $ V13 : num 0.55 1.109 0.767 0.769 -0.349 ...
## $ V14 : num -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
## $ V15 : num 0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
## $ V16 : num -1.707 -0.977 -0.618 -0.429 -0.391 ...
## $ V17 : num -1.162 -1.162 -0.897 -0.897 -0.897 ...
## $ V18 : num -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
## $ V19 : num -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
## $ V20 : num 0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
## $ V21 : num -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
## $ V22 : num -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
## $ V23 : num 0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
## $ V24 : num 0.8 0.801 0.961 1.435 0.881 ...
## $ V25 : num -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
## $ V26 : num 0.796 1.057 0.915 0.898 0.386 ...
## $ V27 : num 0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
## $ V28 : num -0.45 0.671 1.287 1.298 1.289 ...
## $ V29 : num 0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
## $ V30 : num 0.109 0.124 0.361 0.417 1.078 ...
## $ V31 : num -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
## $ V32 : num 0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
## $ V33 : num -4.627 -0.843 -0.843 -0.843 -0.843 ...
## $ V34 : num -4.789 0.16 0.16 -0.065 -0.215 ...
## $ V35 : num -5.101 0.364 0.364 0.364 0.364 ...
## $ V36 : num -2.608 -0.335 0.765 0.333 -0.28 ...
## $ V37 : num -3.508 -0.73 -0.589 -0.112 -0.028 ...
## $ target: num 0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
## $ origin: chr "train" "train" "train" "train" ...
str(data_test)
## 'data.frame': 1925 obs. of 39 variables:
## $ V0 : num 0.368 0.148 -0.166 0.102 0.3 0.05 -0.223 -0.126 -0.203 -0.181 ...
## $ V1 : num 0.38 0.489 -0.062 0.294 0.428 0.34 0.175 0.152 -0.014 0.797 ...
## $ V2 : num -0.225 -0.247 -0.311 -0.259 0.208 0.108 -0.39 0.227 0.01 0.47 ...
## $ V3 : num -0.049 -0.049 0.046 0.051 0.051 0.051 0.051 0.021 -0.034 -0.107 ...
## $ V4 : num 0.379 0.122 -0.055 -0.183 -0.033 -0.348 0.006 -0.619 -0.322 -0.477 ...
## $ V5 : num 0.092 -0.201 0.063 0.148 0.116 0.074 0.134 -0.069 0.105 0.184 ...
## $ V6 : num 0.55 0.487 0.485 0.474 0.408 0.516 0.497 0.52 0.453 0.588 ...
## $ V7 : num 0.551 0.493 0.493 0.504 0.497 0.491 0.548 0.548 0.518 0.528 ...
## $ V8 : num 0.244 -0.127 -0.227 0.01 0.155 0.238 -0.099 0.06 -0.032 0.319 ...
## $ V9 : num 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.904 0.473 0.904 ...
## $ V10 : num -0.419 -0.403 0.33 -0.431 -0.162 0.079 0.226 -0.223 -0.352 -0.108 ...
## $ V11 : num 0.515 -0.324 0.389 0.524 0.554 0.373 0.477 0.171 0.254 0.618 ...
## $ V12 : num 0.346 0.465 0.173 -0.038 -0.063 -0.246 -0.276 -0.369 -0.318 -0.709 ...
## $ V13 : num -0.114 0.653 0.398 -0.34 0.611 ...
## $ V14 : num -0.204 0.148 0.068 -0.313 -0.319 -0.365 -0.247 -0.271 -0.008 -0.207 ...
## $ V15 : num 0.239 -0.113 -0.192 -0.59 -0.927 ...
## $ V16 : num -0.089 -0.093 -0.061 -0.134 -0.075 0.222 0.252 0.01 0.003 0.252 ...
## $ V17 : num 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 0.961 ...
## $ V18 : num 0.247 0.073 0.07 0.078 0.08 0.07 0.066 0.07 0.064 0.078 ...
## $ V19 : num 0.899 1.168 0.98 1.07 1.238 ...
## $ V20 : num -0.252 -0.276 -0.34 -0.292 -0.15 -0.11 -0.331 -0.245 -0.472 -0.493 ...
## $ V21 : num 0.628 0.009 0.27 0.726 0.141 ...
## $ V22 : num -0.063 -0.063 -0.063 0.133 0.133 0.133 -0.063 -0.063 0.133 0.133 ...
## $ V23 : num 0.098 0.09 0.091 0.086 0.089 0.099 0.104 0.101 0.093 -0.178 ...
## $ V24 : num -1.314 -1.31 -1.31 0.234 0.237 ...
## $ V25 : num -0.662 -0.646 -0.473 -0.337 -0.285 -0.071 0.009 0.003 0.006 0.199 ...
## $ V26 : num -0.596 -0.776 -0.607 -0.986 -0.669 ...
## $ V27 : num 0.208 0.226 0.084 0.203 0.227 0.229 0.133 0.262 0.241 0.307 ...
## $ V28 : num -0.449 -0.443 -0.458 -0.456 -0.458 -0.45 -0.452 -0.452 -0.45 -0.446 ...
## $ V29 : num 0.047 0.047 -0.398 -0.398 -0.776 ...
## $ V30 : num 0.057 0.56 0.101 1.007 0.291 ...
## $ V31 : num -0.042 0.176 0.199 0.137 0.37 0.447 0.432 0.281 0.222 0.466 ...
## $ V32 : num 0.847 0.551 0.634 1.042 0.181 ...
## $ V33 : num 0.534 0.046 0.017 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 ...
## $ V34 : num -0.009 -0.22 -0.234 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 -0.29 ...
## $ V35 : num -0.19 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 -0.289 ...
## $ V36 : num -0.567 -0.294 0.373 -0.666 -0.14 -0.228 0.104 -0.7 -0.236 -0.431 ...
## $ V37 : num 0.388 0.104 0.569 0.391 -0.497 ...
## $ origin: chr "test" "test" "test" "test" ...
# Merge data_train and data_test by row
df<- bind_rows(data_train, data_test)
str(df)
## 'data.frame': 4813 obs. of 40 variables:
## $ V0 : num 0.566 0.968 1.013 0.733 0.684 ...
## $ V1 : num 0.016 0.437 0.568 0.368 0.638 ...
## $ V2 : num -0.143 0.066 0.235 0.283 0.26 0.408 0.64 0.704 0.584 0.638 ...
## $ V3 : num 0.407 0.566 0.37 0.165 0.209 0.22 0.356 0.438 0.459 0.617 ...
## $ V4 : num 0.452 0.194 0.112 0.599 0.337 ...
## $ V5 : num -0.901 -0.893 -0.797 -0.679 -0.454 ...
## $ V6 : num -1.81 -1.57 -1.37 -1.2 -1.07 ...
## $ V7 : num -2.36 -2.36 -2.36 -2.09 -2.09 ...
## $ V8 : num -0.436 0.332 0.396 0.403 0.314 0.481 0.729 0.753 0.763 0.968 ...
## $ V9 : num -2.11 -2.11 -2.11 -2.11 -2.11 ...
## $ V10 : num -0.94 0.188 0.874 0.011 -0.251 -0.511 -0.256 -0.067 0.205 0.145 ...
## $ V11 : num -0.307 -0.455 -0.051 0.102 0.57 -0.564 -0.278 -0.24 0.422 0.179 ...
## $ V12 : num -0.073 -0.134 -0.072 -0.014 0.199 0.294 0.425 0.272 0.387 0.688 ...
## $ V13 : num 0.55 1.109 0.767 0.769 -0.349 ...
## $ V14 : num -0.484 -0.488 -0.493 -0.371 -0.342 -0.345 -0.3 -0.387 -0.264 -0.289 ...
## $ V15 : num 0 0 -0.212 -0.162 -0.138 0.111 0.111 0.244 0.293 0.317 ...
## $ V16 : num -1.707 -0.977 -0.618 -0.429 -0.391 ...
## $ V17 : num -1.162 -1.162 -0.897 -0.897 -0.897 ...
## $ V18 : num -0.573 -0.571 -0.564 -0.574 -0.572 -0.573 -0.586 -0.579 -0.566 -0.567 ...
## $ V19 : num -0.991 -0.836 -0.558 -0.564 -0.394 -0.516 -0.544 -0.465 -0.173 -0.557 ...
## $ V20 : num 0.61 0.588 0.576 0.272 0.106 0.029 0.156 0.254 0.25 0.263 ...
## $ V21 : num -0.4 -0.802 -0.477 -0.491 0.309 -0.56 -0.34 -0.442 0.31 0.241 ...
## $ V22 : num -0.063 -0.063 -0.063 -0.063 -0.259 -0.096 -0.063 -0.259 -0.259 -0.259 ...
## $ V23 : num 0.356 0.357 0.355 0.352 0.352 0.349 0.352 0.366 0.366 0.358 ...
## $ V24 : num 0.8 0.801 0.961 1.435 0.881 ...
## $ V25 : num -0.223 -0.144 -0.067 0.113 0.221 0.245 0.389 0.56 0.577 0.493 ...
## $ V26 : num 0.796 1.057 0.915 0.898 0.386 ...
## $ V27 : num 0.168 0.338 0.326 0.277 0.332 0.356 0.401 0.409 0.49 0.512 ...
## $ V28 : num -0.45 0.671 1.287 1.298 1.289 ...
## $ V29 : num 0.136 -0.128 -0.009 0.015 0.183 0.454 0.454 0.139 0.188 0.86 ...
## $ V30 : num 0.109 0.124 0.361 0.417 1.078 ...
## $ V31 : num -0.615 0.032 0.277 0.279 0.328 0.358 0.243 0.428 0.597 0.916 ...
## $ V32 : num 0.327 0.6 -0.116 0.603 0.418 0.618 0.468 -0.119 -0.057 0.039 ...
## $ V33 : num -4.627 -0.843 -0.843 -0.843 -0.843 ...
## $ V34 : num -4.789 0.16 0.16 -0.065 -0.215 ...
## $ V35 : num -5.101 0.364 0.364 0.364 0.364 ...
## $ V36 : num -2.608 -0.335 0.765 0.333 -0.28 ...
## $ V37 : num -3.508 -0.73 -0.589 -0.112 -0.028 ...
## $ target: num 0.175 0.676 0.633 0.206 0.384 0.06 0.415 0.609 0.981 0.818 ...
## $ origin: chr "train" "train" "train" "train" ...
colSums(is.na(data_train))
## V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 0 0 0 0 0 0 0 0 0 0 0
## V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 0 0 0 0 0 0 0 0 0 0 0
## V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 0 0 0 0 0 0 0 0 0 0 0
## V33 V34 V35 V36 V37 target origin
## 0 0 0 0 0 0 0
missmap(data_train)
Eigenvalue density plots of data_train and data_test
for(i in seq(1, 10, by = 1)){
i<-i*4
start<-i-3
if(i==40){
i<-38
}
plots <- list()
for(col in names(data_test)[start:i]){
if(!is.na(col)){
p <-ggplot(df, aes(x = get(col), colour=origin,group=origin)) +
xlab("x-axis eigenvalue") +
ggtitle(col) +
geom_density(size=2)
plots <- c(plots, list(p))
}
}
grid.arrange(grobs = plots, ncol = 2)
}
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Augmentation: removing features with large differences in feature distribution between the training and test sets in order to facilitate a reduction in the likelihood of overfitting the model, removing features that are not conducive to applying the model to the test set, and helping to improve the generalizability of the model.
drop_array <- c("V8", "V9", "V10", "V11")
data_train1<-data_train1[,-which(colnames(data_train1) %in% drop_array)]
names(data_train1)
## [1] "V0" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V12" "V13" "V14" "V15" "V16" "V17" "V18" "V19"
## [17] "V20" "V21" "V22" "V23" "V24" "V25" "V26" "V27"
## [25] "V28" "V29" "V30" "V31" "V32" "V33" "V34" "V35"
## [33] "V36" "V37" "target"
# Plotting scatter plots and fitting curves
head(data_train1)
## V0 V1 V2 V3 V4 V5 V6 V7 V12 V13 V14
## 1 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.073 0.550 -0.484
## 2 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.360 -0.134 1.109 -0.488
## 3 1.013 0.568 0.235 0.370 0.112 -0.797 -1.367 -2.360 -0.072 0.767 -0.493
## 4 0.733 0.368 0.283 0.165 0.599 -0.679 -1.200 -2.086 -0.014 0.769 -0.371
## 5 0.684 0.638 0.260 0.209 0.337 -0.454 -1.073 -2.086 0.199 -0.349 -0.342
## 6 0.445 0.627 0.408 0.220 0.458 -1.056 -1.009 -1.896 0.294 0.912 -0.345
## V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
## 1 0.000 -1.707 -1.162 -0.573 -0.991 0.610 -0.400 -0.063 0.356 0.800 -0.223
## 2 0.000 -0.977 -1.162 -0.571 -0.836 0.588 -0.802 -0.063 0.357 0.801 -0.144
## 3 -0.212 -0.618 -0.897 -0.564 -0.558 0.576 -0.477 -0.063 0.355 0.961 -0.067
## 4 -0.162 -0.429 -0.897 -0.574 -0.564 0.272 -0.491 -0.063 0.352 1.435 0.113
## 5 -0.138 -0.391 -0.897 -0.572 -0.394 0.106 0.309 -0.259 0.352 0.881 0.221
## 6 0.111 -0.333 -1.029 -0.573 -0.516 0.029 -0.560 -0.096 0.349 0.798 0.245
## V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36
## 1 0.796 0.168 -0.450 0.136 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608
## 2 1.057 0.338 0.671 -0.128 0.124 0.032 0.600 -0.843 0.160 0.364 -0.335
## 3 0.915 0.326 1.287 -0.009 0.361 0.277 -0.116 -0.843 0.160 0.364 0.765
## 4 0.898 0.277 1.298 0.015 0.417 0.279 0.603 -0.843 -0.065 0.364 0.333
## 5 0.386 0.332 1.289 0.183 1.078 0.328 0.418 -0.843 -0.215 0.364 -0.280
## 6 0.643 0.356 1.296 0.454 0.674 0.358 0.618 -0.843 -0.290 0.364 -0.191
## V37 target
## 1 -3.508 0.175
## 2 -0.730 0.676
## 3 -0.589 0.633
## 4 -0.112 0.206
## 5 -0.028 0.384
## 6 -0.883 0.060
for(i in seq(1, 9, by = 1)){
i<-i*4
start<-i-3
if(i==36){
i<-34
}
plots <- list()
for(col in names(data_train1)[start:i]){
p<-ggplot(data_train1, aes(x = get(col), y = target)) +
# Scatter plot function
geom_point()+
xlab("x-axis eigenvalue") +
ylab(col) +
# Scale function: palette sets the color scheme
scale_colour_brewer(palette = "Set1") +
geom_smooth()
plots <- c(plots, list(p))
}
grid.arrange(grobs = plots, ncol = 2)
}
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# Heat map of correlation between each column and target
corr <- cor(data_train1)
column_data <- data.frame(corr)
column_data <- as.matrix(column_data)
# heatmap(corr,col=heat.colors(50))
levelplot(t(column_data),col.regions=colorRampPalette(c("blue", "green", "yellow", "red"))(100),
xlab="", ylab="", main="Heatmap", par.settings = list(layout.heights = list(bottom.padding = 5, top.padding = 5)))
# Removal of correlation less than 0.1
threshold <- 0.1
corr_matrix <- abs(cor(data_train1)) # Absolute value correlation matrix
drop_col <- names(data_train1)[corr_matrix[,"target"] < threshold]
drop_col
## [1] "V14" "V21" "V25" "V26" "V32" "V33" "V34"
data_train1 <- data_train1[, !(names(data_train1) %in% drop_col)]
names(data_train1)
## [1] "V0" "V1" "V2" "V3" "V4" "V5" "V6" "V7"
## [9] "V12" "V13" "V15" "V16" "V17" "V18" "V19" "V20"
## [17] "V22" "V23" "V24" "V27" "V28" "V29" "V30" "V31"
## [25] "V35" "V36" "V37" "target"
# datatrain1 is the data frame after removing low correlation columns
trainy = data_train1["target"]
trainx = data_train1[, !(names(data_train1) %in% "target")]
# separation train and test
set.seed(827)
inTrain <- createDataPartition(trainy$target, p = 0.7, list = FALSE)
X_train <- trainx[inTrain,]
X_test <- trainx[-inTrain,]
y_train <- trainy[inTrain,]
y_test <- trainy[-inTrain,]
Degree of fit reference Residual standard error(RSE),Residual standard error R-squared (R2) and adjusted R2, Multiple R-squared: Adjusted R-squared: F-statistic f.statistic
Prediction evaluation reference RMSE (Root Mean Squared Error) is the root mean squared error, which represents the deviation between the predicted and true values. The smaller the value, the more accurate the model. MSE (Mean Squared Error) is the mean squared error, which represents the squared difference between the predicted value and the true value. The smaller the value, the more accurate the model is. R2 (Coefficient of Determination) is the coefficient of determination, which represents the percentage of variation in the true value that is explained by the predicted value of the model. The closer the value is to 1, the more accurate the model is.
# Multiple linear regression
x_vars <- names(X_train)
x_vars<-paste(x_vars, collapse = "+")
x_vars
## [1] "V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37"
lr_model <- lm(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
# Coefficient table showing beta coefficient estimates and their significance levels
# Estimate: intercept (b0) and beta coefficient estimates associated with each predictor variable
# Std.Error: standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confidence we have in the estimate.
# t value: t-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3).
# Pr(>|t|): p-value corresponding to the t-statistic. the smaller the p-value, the more significant the estimate.
summary(lr_model)
##
## Call:
## lm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 +
## V12 + V13 + V15 + V16 + V17 + V18 + V19 + V20 + V22 + V23 +
## V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train,
## y_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54258 -0.20255 -0.00513 0.17818 1.42964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1815800 0.0223937 -8.109 8.85e-16 ***
## V0 0.3991926 0.0303538 13.151 < 2e-16 ***
## V1 0.1976929 0.0290428 6.807 1.31e-11 ***
## V2 0.2197217 0.0243662 9.017 < 2e-16 ***
## V3 0.1352732 0.0102928 13.142 < 2e-16 ***
## V4 0.0155256 0.0314455 0.494 0.621551
## V5 0.0030441 0.0216578 0.141 0.888236
## V6 0.2350767 0.0378718 6.207 6.54e-10 ***
## V7 -0.1792823 0.0277447 -6.462 1.30e-10 ***
## V12 0.0676274 0.0279165 2.422 0.015503 *
## V13 -0.0061209 0.0115662 -0.529 0.596720
## V15 0.0478988 0.0261885 1.829 0.067549 .
## V16 -0.0720262 0.0296579 -2.429 0.015246 *
## V17 0.0751529 0.0157702 4.766 2.02e-06 ***
## V18 0.0369619 0.0105296 3.510 0.000458 ***
## V19 0.0071012 0.0090155 0.788 0.430984
## V20 0.0065008 0.0115224 0.564 0.572687
## V22 0.1110113 0.0167087 6.644 3.93e-11 ***
## V23 0.0215705 0.0119886 1.799 0.072129 .
## V24 -0.0531022 0.0113056 -4.697 2.82e-06 ***
## V27 0.5497939 0.0743657 7.393 2.10e-13 ***
## V28 -0.0130093 0.0086068 -1.512 0.130816
## V29 -0.0766962 0.0283104 -2.709 0.006804 **
## V30 -0.0150293 0.0107219 -1.402 0.161150
## V31 0.0200336 0.0256401 0.781 0.434696
## V35 -0.0278864 0.0112024 -2.489 0.012879 *
## V36 0.0135898 0.0113926 1.193 0.233067
## V37 0.0005468 0.0181994 0.030 0.976033
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3369 on 1996 degrees of freedom
## Multiple R-squared: 0.8867, Adjusted R-squared: 0.8851
## F-statistic: 578.3 on 27 and 1996 DF, p-value: < 2.2e-16
# Make predictions
pred<-predict(lr_model, X_test)
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mes:",mse)
## [1] "mes: 0.143400237516701"
paste("rmes:",RMSE(pred, y_test))
## [1] "rmes: 0.378682238184868"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.845392042438334"
# Set training parameters
params <- list(booster = "gbtree",
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6,
subsample = 1,
colsample_bytree = 1)
# Convert training data into the required format for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
#TrainingModel
xgboost_model <- xgb.train(params = params, data = train_matrix, nrounds = 100)
#Predictions for the test set
test_matrix <- xgb.DMatrix(data = as.matrix(X_test))
pred <- predict(xgboost_model, test_matrix)
# Use the xgb.plot.importance() function to visualize the importance of each feature to the model prediction results.
importance_matrix <- xgb.importance(colnames(X_train), model = xgboost_model)
xgb.plot.importance(importance_matrix)
# xgb.plot.importance(xgboost_model)
# Plot the scatter plot of predicted and true values
df_pred <- data.frame(pred, y_test)
colnames(df_pred) <- c("pred","y_test")
ggplot(df_pred, aes(x=pred, y=y_test)) +
geom_point() +
geom_abline(intercept = 0, slope = 1)
# Make predictions
pred<-predict(lr_model, X_test)
#Evaluation
mse <- mean((y_test - pred)^2)
paste("mes:",mse)
## [1] "mes: 0.143400237516701"
paste("rmes:",RMSE(pred, y_test))
## [1] "rmes: 0.378682238184868"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.845392042438334"
Random forest regression
rf<-randomForest(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),importance=TRUE, ntree=1000)
rf
##
## Call:
## randomForest(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 + V12 + V13 + V15 + V16 + V17 + V18 + V19 + V20 + V22 + V23 + V24 + V27 + V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train, y_train), importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 9
##
## Mean of squared residuals: 0.1181603
## % Var explained: 88.03
importance(rf)
## %IncMSE IncNodePurity
## V0 49.785910 591.580126
## V1 36.648070 509.231388
## V2 37.138166 84.043155
## V3 47.063295 39.019175
## V4 24.141669 49.501757
## V5 16.774140 10.169873
## V6 25.315566 16.785337
## V7 20.894548 11.370043
## V12 19.708106 38.608386
## V13 10.014369 8.926566
## V15 17.350189 12.081380
## V16 30.995485 36.485409
## V17 11.082196 5.721145
## V18 12.110626 10.514513
## V19 11.704465 9.287433
## V20 13.836438 13.020525
## V22 14.318238 5.864386
## V23 13.936151 10.051831
## V24 17.560764 9.729105
## V27 24.697703 255.113118
## V28 6.870147 7.943724
## V29 18.550579 14.902115
## V30 4.144389 8.677826
## V31 24.651701 148.581726
## V35 7.760467 4.566536
## V36 20.054828 14.627364
## V37 19.927453 57.444320
varImpPlot(rf)
# Use the training set to see the prediction accuracy
pres<-predict(rf,X_test)
mse <- mean((y_test - pred)^2)
paste("mes:",mse)
## [1] "mes: 0.143400237516701"
paste("rmes:",RMSE(pred, y_test))
## [1] "rmes: 0.378682238184868"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.845392042438334"
# %IncMSE,IncNodePurity as a parameter of the random forest can only be used as a "qualitative" exploration variable, not as a quantitative decision "variable trade-off
svm_model <- svm(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train),knernel="radial")
svm_model
##
## Call:
## svm(formula = y_train ~ V0 + V1 + V2 + V3 + V4 + V5 + V6 + V7 + V12 +
## V13 + V15 + V16 + V17 + V18 + V19 + V20 + V22 + V23 + V24 + V27 +
## V28 + V29 + V30 + V31 + V35 + V36 + V37, data = cbind(X_train,
## y_train), knernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.03703704
## epsilon: 0.1
##
##
## Number of Support Vectors: 1448
varImpPlot(rf)
svm_pred=predict(svm_model,X_test,decision.values = TRUE)
pred = svm_pred
mse <- mean((y_test - pred)^2)
paste("mes:",mse)
## [1] "mes: 0.142456887135041"
paste("rmes:",RMSE(pred, y_test))
## [1] "rmes: 0.377434613059058"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.846482244254957"
# Decision tree regression model
model_dt <- rpart(y_train ~V0+V1+V2+V3+V4+V5+V6+V7+V12+V13+V15+V16+V17+V18+V19+V20+V22+V23+V24+V27+V28+V29+V30+V31+V35+V36+V37, data=cbind(X_train,y_train))
# prediction and evaluation
pres<-predict(rf,X_test)
mse <- mean((y_test - pred)^2)
paste("mes:",mse)
## [1] "mes: 0.142456887135041"
paste("rmes:",RMSE(pred, y_test))
## [1] "rmes: 0.377434613059058"
paste("r2:",R2(pred, y_test))
## [1] "r2: 0.846482244254957"
rpart.plot(model_dt)
install.packages(“cluster”,repos =“http://cran.us.r-project.org”) library(cluster)
result <- kmeans(data_train1, 3)
print(result)
## K-means clustering with 3 clusters of sizes 1244, 516, 1128
##
## Cluster means:
## V0 V1 V2 V3 V4 V5
## 1 0.03878296 0.09255145 0.5597982 -0.1234277 -0.3908521 -0.4309116
## 2 -1.02471318 -1.29671705 -0.9235678 -0.8406919 -0.4681919 -0.3858740
## 3 0.74101773 0.63466046 0.5468821 0.3471312 0.6782996 -0.7783422
## V6 V7 V12 V13 V15 V16 V17
## 1 0.6710571 0.5349277 -0.3932347 0.1951158 -0.6795740 0.5006061 0.03433039
## 2 -0.9607287 -0.9029341 -0.5006841 -0.1642519 0.4264845 -1.2715969 -0.19939535
## 3 0.1676720 0.1204956 0.7220488 0.3611011 0.8005275 0.3202057 -0.05791223
## V18 V19 V20 V22 V23 V24
## 1 0.07826447 0.2124301 -0.13592605 0.2784405 0.1487846 0.66117283
## 2 -0.31330620 -0.1392868 -0.77540504 0.4618430 -0.2886802 0.08840116
## 3 0.19791046 -0.4646950 0.02781826 0.2571489 0.3673183 -0.82545213
## V27 V28 V29 V30 V31 V35
## 1 0.3155611 0.2922822 -0.7148730 0.24424116 0.2273392 0.13748794
## 2 -0.1402132 -0.1121802 0.5301609 -0.43078101 -1.1898043 -0.06321124
## 3 0.4135665 0.0815594 0.7958750 0.06973759 0.6207340 0.38362057
## V36 V37 target
## 1 0.1908023 -0.4861584 0.1540595
## 2 -0.7337636 1.0278895 -1.3382229
## 3 0.2037270 -0.2677323 0.7657624
##
## Clustering vector:
## [1] 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 3 3 1 1 1 2 2 3 3 3 3 3 2 2 2 3 3 3
## [38] 3 3 2 1 2 1 1 3 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 2 1
## [75] 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3 3 3 2 2 3 3 3
## [112] 3 1 2 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 2 2 3 3 3 1 3 1 1 1
## [149] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3
## [186] 1 3 1 1 1 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 3 3 3 3 3 1 1 3 1 1 1 1 1 1 1 1 3
## [223] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 3 3 3 3 1 1 1 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3
## [297] 3 3 3 3 1 1 1 1 1 3 3 3 1 1 3 2 3 1 1 1 1 1 1 2 2 2 2 1 2 3 3 3 3 3 2 2 2
## [334] 3 3 3 3 3 3 3 1 1 1 2 2 2 3 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2
## [371] 3 3 3 3 1 1 2 2 2 2 2 1 2 1 2 2 2 3 3 2 2 2 3 3 2 2 3 3 2 3 3 2 2 2 3 3 3
## [408] 3 3 2 2 2 2 2 2 2 2 2 3 3 2 3 3 2 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 3 3 3
## [445] 1 1 1 1 1 3 3 3 2 3 2 2 1 1 2 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## [519] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 3 3
## [556] 3 1 1 1 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1
## [593] 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [630] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3
## [667] 3 3 3 3 3 3 3 3 1 1 1 1 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 2 1 1 3 1 3 3 3 1 1
## [704] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [778] 3 3 3 3 3 1 1 1 1 1 1 3 1 3 3 3 3 2 2 2 2 3 2 3 1 1 3 3 3 3 3 3 3 3 3 3 3
## [815] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 1
## [852] 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 3 1 3 3 1 1 2 2 1 1
## [889] 3 3 1 3 3 1 2 1 1 1 1 2 3 3 3 2 2 2 3 2 3 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 3
## [926] 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 3 3 2 2 3 3 3 3 3
## [963] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
## [1000] 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 2 2 2 3 3 2 2 2
## [1037] 2 2 2 2 2 2 3 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2
## [1074] 2 3 2 2 2 2 2 2 3 3 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 2 3 1 1 1 3 3 3 3 3 3
## [1111] 3 3 3 2 2 2 3 2 2 2 3 3 3 3 1 1 2 2 2 3 1 2 2 1 2 1 2 1 2 2 2 2 2 2 3 1 2
## [1148] 2 2 2 3 3 2 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 3 3 3 3 1 3 2 2 3 3 3 3
## [1185] 1 1 1 1 3 3 3 3 1 1 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1222] 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1296] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1333] 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1
## [1370] 1 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1407] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1444] 2 1 1 1 3 3 3 3 3 3 3 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1
## [1481] 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 3 3 3 3 2 2 2 2 2 2 2 2
## [1518] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [1555] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 2 3 2 2 2 2 2 3 3
## [1592] 3 3 3 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [1629] 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 3 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1
## [1666] 1 3 3 3 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 1 1 2 2 2 2 1
## [1703] 2 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3
## [1740] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1777] 3 3 3 3 3 3 3 3 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 2 2 1 2 2 2
## [1814] 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 3 2 2 2 2 2 1 1 1 1 1 1
## [1851] 1 1 1 1 1 1 1 1 3 3 3 3 1 2 2 2 3 3 3 3 2 2 2 3 3 3 3 3 2 2 2 2 2 2 2 2 3
## [1888] 3 3 3 3 3 1 1 2 2 1 1 1 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 3 2 2 2 2 3 3 3 3 3
## [1925] 2 2 2 2 2 3 3 3 3 2 2 2 2 1 2 1 2 3 2 3 3 3 3 3 2 2 3 3 2 3 1 1 1 1 1 1 1
## [1962] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1999] 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2036] 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2073] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2
## [2110] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 1 1 1 1 1 1 1
## [2147] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1
## [2184] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 3 3 2 3
## [2221] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2258] 1 1 1 1 1 1 2 2 2 3 3 3 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1
## [2295] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 3 3 1 1 2 1 1
## [2332] 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2369] 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2406] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
## [2443] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1
## [2480] 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2517] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
## [2554] 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [2591] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [2628] 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2
## [2665] 2 1 1 1 2 1 1 1 3 3 2 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1
## [2702] 1 1 3 1 1 1 1 1 2 1 1 1 1 1 3 3 3 3 3 1 1 1 3 1 3 3 3 3 3 1 3 3 3 3 1 3 3
## [2739] 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 3 3 3 3 2 2 2 2 3 3 3 3 3
## [2776] 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2813] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 2 3 3 3 3 2 2 2 3 3 3 1 1 1
## [2850] 1 1 1 1 1 1 3 3 3 2 3 3 3 3 2 3 2 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1
## [2887] 3 3
##
## Within cluster sum of squares by cluster:
## [1] 16771.51 16823.36 15288.10
## (between_SS / total_SS = 27.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(stats)
library(factoextra)
hc_model <- hclust(dist(data_train1))
print(hc_model)
##
## Call:
## hclust(d = dist(data_train1))
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 2888
plot(hc_model)