In this project, we will use “Wave Energy Converts Data Set” data set from UCI Machine Learning Repository.
We will conduct a pure model with multiple regression. In each step we will conduct a new model with the evaluation of the selected sub-steps. For conclusion we will compare pure model and other evaluated models to see what are the effects on touches in the sub-steps whether if they increase efficiency of the machine learning or not.
We must remember that each case is unique thus these outputs are specific to this case. Although, this project will give insight for sub-step evaluation importance on machine learning algorithms. With this experience one may carry out more instinct on upcoming projects. These exercises adds experience and vision where machine learning has tricky sub-steps and one may evaluate always has a door to increase their model efficiency. But again, remember, each case is special and unique but methods are very similar.
Main aim of the project is the bullet points shown below;
Exploratory data analysis.
Data set structure.
Outlier Control. (Boxplot IQR method. One may use Z score method+)
Missing Values Control.
Taking Action: Data Imputation Method.
Correlation between independent variables. Conduct PCA in case of high correlation(>.9)
Assumption control of MLP.
Multicollinearity.
Multivariate Normality .
Homoscedasticity.
Model evaluation and model performance metrics.
Leave one out cross validation, LOOCV.
K-fold cross-validation.
Repeated K-fold cross-validation.
Conducting multiple regression based algorithm.
Calculation of efficiency.
Each cases model efficiency will be in matrix with chosen metrics.
Summary matrix of the model metrics will help on easier comparison and conclusion.
Each model will be named as “model1, model2, model3 …” etc.
Model evaluation.
Residual analysis.
Conclusions. (Experiences)
- There was no outlier or missing value so we did not use any of the imputation methods.
- Data scaling did not increased model efficiency. (Adjusted R Square value.) Effects of Data Scaling for OLS
- Can not conduct Shapiro-Wilk test over 5000 rows. Thus we applied Anderson-Darling test for normality.
- Weak correlation between predicted and target values reflects models efficiency and a need for controlling algorithm outputs.
- Different train-test methods and cross validation methods can change efficiency of the model for both train and test runs. But there are no “best” model and each model may show power over different scenarios.
- Residual analysis plays strong role where the one may can see the reflection over the predicted-target values. Even if the learning model is doing fine for the train set. Also, one must be vary of the “overfit”.
#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
Models Comparison
library(ggplot2)
library("PerformanceAnalytics")
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(fpc)
library(colorspace)
library(MASS)
library(naniar)
library(ltm)
library(pROC)
library(caret)
library(class)
library(leaps)
library(sp)
library(FSelectorRcpp)
library(FSelector)
library(e1071)
library(naniar)
library(caret)
library(PerformanceAnalytics)
library(readxl)
library(caTools)
library(nortest)
library(car)
library(mgcv)
library(auditor)
library(DALEX)
Story of the data; (While analyzing data, remember that story of the data is one of the most important aspect.)
This data set consists of positions and absorbed power outputs of wave energy converters (WECs) in four real wave scenarios from the southern coast of Australia (Sydney, Adelaide, Perth and Tasmania). The applied converter model is a fully submerged three-tether converter called CETO. 16 WECs locations are placed and optimized in a size-constrained environment. In terms of optimization, the problem is categorized as an expensive optimization problem that each farm evaluation takes several minutes.
#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#Attribute Information:
#Attribute: Attribute Range
#1. WECs position {X1, X2, …, X16; Y1, Y2,…, Y16} continuous from 0 to 566 (m).
#2. WECs absorbed power: {P1, P2, …, P16}
#3. Total power output of the farm: Powerall
#Since they did not add column names we will use the information above and hope
#dataset order and explanation order is the same.
Xoriented <- sprintf("X%s",seq(1:16))
Yoriented <- sprintf("Y%s",seq(1:16))
Poriented <- sprintf("P%s",seq(1:16))
P <- "Sumpowerall"
#We will act like "powerall" is our dependent variable.
#powerall shows high correlation with sum of "P".
#In order to exercise regression we will remove "P" segments.
#If we include "P" segment, these variables are overrun other ones.
# Colname vector lenght and dataframe column number matches.
colnamesofDF <- c(as.character(Xoriented),as.character(Yoriented),as.character(Poriented),P)
df1 <- read.csv("Tasmania_Data.csv",col.names = colnamesofDF)
#This repository has the data for different areas.
#Characteristic between different areas may differ.
#Considering this situation we will focus on the area called "Perth".
## You can use the link if you want to download data from UCI Repository directly.
#temp <- tempfile()
#download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00494/WECs_DataSet.zip",temp)
#https://archive.ics.uci.edu/ml/datasets/Wave+Energy+Converters#
#Data set contains 71999 rows and 49 columns.
#All variables are numeric.
dim(df1)
## [1] 71999 49
#To understand dataset, lets find sum of columns start with P and check their correlation with "Power" column.
#In %91 case sum of P columns and Powerall data is equal.
df1 <- df1 %>% mutate(
SumP = rowSums(dplyr::select(., starts_with("P")))
)
prop.table(table(df1$SumP == df1$Sumpowerall))*100
##
## FALSE TRUE
## 90.79015 9.20985
#Since they are both numerical data lets check Pearson Correlation.
#Correlation is very high. We need to remove P columns to have a more healthy regression.
#Otherwise, P columns will overrun other variables.
#And it does not make sense to conduct a regression where dependent column is sum of the other columns..
cor(df1$SumP,df1$Sumpowerall,method = "pearson")
## [1] 0.9999992
#It can be seen that sum of P columns and Power column is nearly similar to each other.
#We will remove P columns to have an ideal regression data set.
library(tidyverse)
df1 <- df1 %>% dplyr::select(-starts_with("P"))
df1$SumP <- NULL
colnames(df1)
## [1] "X1" "X2" "X3" "X4" "X5"
## [6] "X6" "X7" "X8" "X9" "X10"
## [11] "X11" "X12" "X13" "X14" "X15"
## [16] "X16" "Y1" "Y2" "Y3" "Y4"
## [21] "Y5" "Y6" "Y7" "Y8" "Y9"
## [26] "Y10" "Y11" "Y12" "Y13" "Y14"
## [31] "Y15" "Y16" "Sumpowerall"
#Outlier Control
#Since we know data contains characteristics of X,Y,P and power,
#we can check their boxplot in their class.
#Thus this will make our visual more readable.
#No outliers in "X" class.
boxplot(x = as.list(as.data.frame(df1[1:16])))
#Outlier Control
#No Outlier in "Y" class.
boxplot(x = as.list(as.data.frame(df1[17:32])))
#Outlier Control
#There are some outliers in "Power" class.
boxplot(x = as.list(as.data.frame(df1[33])))
Outlier conclusions:
Dataset contains of numerious results of X-Y and dependent Power variable.
We will act this data set as some kind of “sensor” data.
Outliers are fairly close to the benchmarks.
Thus we will not take action against these outliers.
missing <- miss_var_summary(df1)
mean(missing$pct_miss)
## [1] 0
vis_miss(df1, warn_large_data = FALSE)
set.seed(123)
#Shuffle data set.
df1 <- df1[sample(nrow(df1)),]
#Split train-test.
sample = sample.split(df1$Sumpowerall, SplitRatio = .80)
train = subset(df1, sample == TRUE)
test = subset(df1, sample == FALSE)
# Build the model
model1 <- lm(Sumpowerall ~., data = train)
summary(model1)
##
## Call:
## lm(formula = Sumpowerall ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -493754 -65841 1562 67370 517458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.804e+06 4.833e+03 787.094 < 2e-16 ***
## X1 5.652e+00 2.576e+00 2.194 0.028209 *
## X2 -1.757e+01 2.648e+00 -6.633 3.33e-11 ***
## X3 3.570e+01 2.578e+00 13.847 < 2e-16 ***
## X4 -7.652e+01 2.591e+00 -29.536 < 2e-16 ***
## X5 3.308e+01 2.464e+00 13.424 < 2e-16 ***
## X6 -9.028e+01 2.529e+00 -35.692 < 2e-16 ***
## X7 -2.141e+01 2.611e+00 -8.199 2.48e-16 ***
## X8 6.770e+00 2.670e+00 2.536 0.011210 *
## X9 -3.847e+01 2.598e+00 -14.809 < 2e-16 ***
## X10 4.951e+01 2.535e+00 19.532 < 2e-16 ***
## X11 -2.654e+01 2.616e+00 -10.143 < 2e-16 ***
## X12 -3.115e+01 2.551e+00 -12.210 < 2e-16 ***
## X13 2.900e+00 2.543e+00 1.140 0.254188
## X14 -1.073e+01 2.552e+00 -4.207 2.60e-05 ***
## X15 1.110e+01 2.511e+00 4.419 9.92e-06 ***
## X16 3.901e+01 2.549e+00 15.304 < 2e-16 ***
## Y1 -3.950e+01 2.610e+00 -15.134 < 2e-16 ***
## Y2 4.199e+01 2.646e+00 15.871 < 2e-16 ***
## Y3 -5.783e+01 2.542e+00 -22.749 < 2e-16 ***
## Y4 5.667e+01 2.501e+00 22.657 < 2e-16 ***
## Y5 -3.138e+01 2.581e+00 -12.155 < 2e-16 ***
## Y6 -2.260e+01 2.602e+00 -8.686 < 2e-16 ***
## Y7 3.574e+01 2.633e+00 13.573 < 2e-16 ***
## Y8 -1.621e+01 2.560e+00 -6.334 2.41e-10 ***
## Y9 3.042e+00 2.543e+00 1.196 0.231723
## Y10 3.890e+01 2.579e+00 15.082 < 2e-16 ***
## Y11 9.289e+00 2.504e+00 3.710 0.000208 ***
## Y12 5.961e+00 2.639e+00 2.259 0.023877 *
## Y13 -5.373e+01 2.540e+00 -21.154 < 2e-16 ***
## Y14 -2.825e+01 2.622e+00 -10.773 < 2e-16 ***
## Y15 1.404e+01 2.570e+00 5.463 4.71e-08 ***
## Y16 -4.199e+01 2.571e+00 -16.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103400 on 57566 degrees of freedom
## Multiple R-squared: 0.1476, Adjusted R-squared: 0.1472
## F-statistic: 311.6 on 32 and 57566 DF, p-value: < 2.2e-16
# Make predictions
predictions <- model1 %>% predict(test)
Adjusted Rsquare value is 0.1472.
Scaling data may increase this very important indicator.
defaultSummary(data.frame(obs = train$Sumpowerall,
pred = predict(model1, train[-33]))
)
## RMSE Rsquared MAE
## 1.034064e+05 1.476286e-01 8.079594e+04
#Scaling the data.
df1scaled <- scale(df1)
df1scaled <- as.data.frame(df1scaled)
set.seed(123)
#Shuffle data set.
df1scaled <- df1scaled[sample(nrow(df1)),]
#Split train-test.
sample = sample.split(df1scaled$Sumpowerall, SplitRatio = .80)
trainscaled = subset(df1scaled, sample == TRUE)
testscaled = subset(df1scaled, sample == FALSE)
# Build the model
model1scaled <- lm(Sumpowerall ~., data = trainscaled)
summary(model1scaled)
##
## Call:
## lm(formula = Sumpowerall ~ ., data = trainscaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4019 -0.5892 0.0123 0.5998 4.6062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.357e-05 3.850e-03 -0.014 0.98890
## X1 9.874e-03 4.402e-03 2.243 0.02489 *
## X2 -2.714e-02 4.211e-03 -6.445 1.17e-10 ***
## X3 5.835e-02 4.173e-03 13.981 < 2e-16 ***
## X4 -1.245e-01 4.184e-03 -29.756 < 2e-16 ***
## X5 5.850e-02 4.238e-03 13.804 < 2e-16 ***
## X6 -1.487e-01 4.134e-03 -35.971 < 2e-16 ***
## X7 -3.398e-02 4.246e-03 -8.003 1.24e-15 ***
## X8 1.177e-02 4.057e-03 2.900 0.00373 **
## X9 -6.042e-02 4.169e-03 -14.492 < 2e-16 ***
## X10 7.943e-02 4.184e-03 18.986 < 2e-16 ***
## X11 -4.234e-02 4.335e-03 -9.766 < 2e-16 ***
## X12 -5.109e-02 4.424e-03 -11.548 < 2e-16 ***
## X13 6.595e-03 4.263e-03 1.547 0.12191
## X14 -2.327e-02 4.304e-03 -5.406 6.46e-08 ***
## X15 1.626e-02 4.172e-03 3.897 9.75e-05 ***
## X16 6.708e-02 4.164e-03 16.109 < 2e-16 ***
## Y1 -6.835e-02 4.483e-03 -15.247 < 2e-16 ***
## Y2 6.719e-02 4.140e-03 16.228 < 2e-16 ***
## Y3 -9.411e-02 4.109e-03 -22.901 < 2e-16 ***
## Y4 9.231e-02 4.097e-03 22.532 < 2e-16 ***
## Y5 -5.076e-02 4.223e-03 -12.021 < 2e-16 ***
## Y6 -3.997e-02 4.137e-03 -9.662 < 2e-16 ***
## Y7 5.682e-02 4.088e-03 13.899 < 2e-16 ***
## Y8 -2.359e-02 4.106e-03 -5.746 9.20e-09 ***
## Y9 5.458e-03 4.262e-03 1.281 0.20033
## Y10 6.267e-02 4.222e-03 14.842 < 2e-16 ***
## Y11 1.184e-02 4.224e-03 2.804 0.00504 **
## Y12 1.131e-02 4.315e-03 2.622 0.00875 **
## Y13 -8.584e-02 4.134e-03 -20.766 < 2e-16 ***
## Y14 -4.478e-02 4.277e-03 -10.468 < 2e-16 ***
## Y15 2.399e-02 4.192e-03 5.723 1.05e-08 ***
## Y16 -6.552e-02 4.050e-03 -16.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.924 on 57566 degrees of freedom
## Multiple R-squared: 0.147, Adjusted R-squared: 0.1465
## F-statistic: 310 on 32 and 57566 DF, p-value: < 2.2e-16
defaultSummary(data.frame(obs = testscaled$Sumpowerall,
pred = predict(model1scaled, testscaled[-33]))
)
## RMSE Rsquared MAE
## 0.9241219 0.1450212 0.7199197
Checking the assumptions for multiple linear regression.
# We can see that all variables VIF value is <5.
# All variables are okey to stay in terms of VIF control.
VIF <-car::vif(model1)
VIF
## X1 X2 X3 X4 X5 X6 X7 X8
## 1.309168 1.198578 1.169546 1.185964 1.212113 1.152853 1.215584 1.107723
## X9 X10 X11 X12 X13 X14 X15 X16
## 1.174065 1.187159 1.264464 1.318311 1.225675 1.241495 1.172852 1.171252
## Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8
## 1.355357 1.152917 1.138277 1.137597 1.205421 1.154004 1.128049 1.139387
## Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16
## 1.222920 1.198602 1.205828 1.254164 1.153791 1.231831 1.180657 1.105763
# We can not conduct Shapiro test over 5000 sample.
# Anderson-Darling Test has been conducted over all variables.
# We can see that all p values are <0.5 so non of the variables distributed normally.
lshap <- lapply(df1, ad.test)
as.data.frame(data.table::rbindlist(lshap))
## statistic p.value method data.name
## 1 1647.06944 3.7e-24 Anderson-Darling normality test X[[i]]
## 2 1029.11161 3.7e-24 Anderson-Darling normality test X[[i]]
## 3 1150.58217 3.7e-24 Anderson-Darling normality test X[[i]]
## 4 1183.07769 3.7e-24 Anderson-Darling normality test X[[i]]
## 5 1792.13692 3.7e-24 Anderson-Darling normality test X[[i]]
## 6 1412.43445 3.7e-24 Anderson-Darling normality test X[[i]]
## 7 1238.14292 3.7e-24 Anderson-Darling normality test X[[i]]
## 8 595.29865 3.7e-24 Anderson-Darling normality test X[[i]]
## 9 1061.89334 3.7e-24 Anderson-Darling normality test X[[i]]
## 10 1318.57871 3.7e-24 Anderson-Darling normality test X[[i]]
## 11 1416.37399 3.7e-24 Anderson-Darling normality test X[[i]]
## 12 1872.09254 3.7e-24 Anderson-Darling normality test X[[i]]
## 13 1464.89313 3.7e-24 Anderson-Darling normality test X[[i]]
## 14 1507.24460 3.7e-24 Anderson-Darling normality test X[[i]]
## 15 1614.36058 3.7e-24 Anderson-Darling normality test X[[i]]
## 16 1367.75962 3.7e-24 Anderson-Darling normality test X[[i]]
## 17 1780.99398 3.7e-24 Anderson-Darling normality test X[[i]]
## 18 1050.22535 3.7e-24 Anderson-Darling normality test X[[i]]
## 19 1551.70754 3.7e-24 Anderson-Darling normality test X[[i]]
## 20 1366.20301 3.7e-24 Anderson-Darling normality test X[[i]]
## 21 1251.88105 3.7e-24 Anderson-Darling normality test X[[i]]
## 22 965.35591 3.7e-24 Anderson-Darling normality test X[[i]]
## 23 923.81990 3.7e-24 Anderson-Darling normality test X[[i]]
## 24 1226.07561 3.7e-24 Anderson-Darling normality test X[[i]]
## 25 1631.84361 3.7e-24 Anderson-Darling normality test X[[i]]
## 26 1415.42358 3.7e-24 Anderson-Darling normality test X[[i]]
## 27 1339.33633 3.7e-24 Anderson-Darling normality test X[[i]]
## 28 1151.57875 3.7e-24 Anderson-Darling normality test X[[i]]
## 29 1364.44486 3.7e-24 Anderson-Darling normality test X[[i]]
## 30 1341.77638 3.7e-24 Anderson-Darling normality test X[[i]]
## 31 1478.25834 3.7e-24 Anderson-Darling normality test X[[i]]
## 32 969.71070 3.7e-24 Anderson-Darling normality test X[[i]]
## 33 55.37006 3.7e-24 Anderson-Darling normality test X[[i]]
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(model1)
The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.
But in our case, as you can notice from the top-left plot, the red line is slightly curved and the residuals seem to increase as the fitted Y values increase. So, the inference here is, heteroscedasticity exists.
car::ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 110.0427, Df = 1, p = < 2.22e-16
lmtest::bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 986.3, df = 32, p-value < 2.2e-16
# Define training control
#train.control <- trainControl(method = "LOOCV")
# Train the model
#model <- train(Sumpowerall ~., data = train, method = "lm",
# trControl = train.control)
# Summarize the results
#print(model)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
modelKFold <- train(Sumpowerall ~., data = train, method = "lm",
trControl = train.control)
# Summarize the results
print(modelKFold)
## Linear Regression
##
## 57599 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 51839, 51839, 51840, 51839, 51839, 51839, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 103461 0.1468158 80840.73
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
modelCV <- train(Sumpowerall ~., data = train, method = "lm",
trControl = train.control)
# Summarize the results
print(modelCV)
## Linear Regression
##
## 57599 samples
## 32 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 51839, 51839, 51840, 51839, 51839, 51839, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 103459.8 0.1468718 80842.11
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
lm_expTrain <- DALEX::explain(model1, label = "lm", data = train[-33], y = train[33])
## Preparation of a new explainer is initiated
## -> model label : lm
## -> data : 57599 rows 32 cols
## -> target variable : Argument 'y' was a data frame. Converted to a vector. ( [31m WARNING [39m )
## -> target variable : 57599 values
## -> predict function : yhat.lm will be used ( [33m default [39m )
## -> predicted values : numerical, min = 3595676 , mean = 3760194 , max = 3892355
## -> model_info : package stats , ver. 4.0.2 , task regression ( [33m default [39m )
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -493754.1 , mean = 9.019097e-09 , max = 517458.1
## [32m A new explainer has been created! [39m
lm_expTest <- DALEX::explain(model1, label = "lm", data = test[-33], y = test[33])
## Preparation of a new explainer is initiated
## -> model label : lm
## -> data : 14400 rows 32 cols
## -> target variable : Argument 'y' was a data frame. Converted to a vector. ( [31m WARNING [39m )
## -> target variable : 14400 values
## -> predict function : yhat.lm will be used ( [33m default [39m )
## -> predicted values : numerical, min = 3594773 , mean = 3760072 , max = 3878498
## -> model_info : package stats , ver. 4.0.2 , task regression ( [33m default [39m )
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -447459.8 , mean = -162.8945 , max = 500989.5
## [32m A new explainer has been created! [39m
lm_mrTrain <- model_residual(lm_expTrain)
lm_mrTest <- model_residual(lm_expTest)
plot(lm_mrTrain, type = "prediction", abline = TRUE)
We can see that predicted values and target values spread like “shotgun bullets”.
It is very clear that there is weak correlation between predicted and target values.
Which also reflects how weak our algorithm is.
#plot(lm_mrTrain, type = "residual")
#plot(lm_mrTrain, type = "residual_density")