AIM OF THE PROJECT

In this project, we will use “Wave Energy Converts Data Set” data set from UCI Machine Learning Repository.

Link to Dataset

We will conduct a pure model with multiple regression. In each step we will conduct a new model with the evaluation of the selected sub-steps. For conclusion we will compare pure model and other evaluated models to see what are the effects on touches in the sub-steps whether if they increase efficiency of the machine learning or not.

We must remember that each case is unique thus these outputs are specific to this case. Although, this project will give insight for sub-step evaluation importance on machine learning algorithms. With this experience one may carry out more instinct on upcoming projects. These exercises adds experience and vision where machine learning has tricky sub-steps and one may evaluate always has a door to increase their model efficiency. But again, remember, each case is special and unique but methods are very similar.

Main aim of the project is the bullet points shown below;

#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Models Comparison

Ready - Set - Go

Load Packages

library(ggplot2)
library("PerformanceAnalytics")
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(fpc)
library(colorspace) 
library(MASS)
library(naniar)
library(ltm)
library(pROC)
library(caret)
library(class)
library(leaps)
library(sp)
library(FSelectorRcpp)
library(FSelector)
library(e1071)
library(naniar)
library(caret)
library(PerformanceAnalytics)
library(readxl)
library(caTools)
library(nortest)
library(car)
library(mgcv)
library(auditor)
library(DALEX)

Load Data set

Story of the data; (While analyzing data, remember that story of the data is one of the most important aspect.)

This data set consists of positions and absorbed power outputs of wave energy converters (WECs) in four real wave scenarios from the southern coast of Australia (Sydney, Adelaide, Perth and Tasmania). The applied converter model is a fully submerged three-tether converter called CETO. 16 WECs locations are placed and optimized in a size-constrained environment. In terms of optimization, the problem is categorized as an expensive optimization problem that each farm evaluation takes several minutes.

#setwd below does set path where RMD file is in. Pretty USEFUL !
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

#Attribute Information:
#Attribute: Attribute Range

#1. WECs position {X1, X2, …, X16; Y1, Y2,…, Y16} continuous from 0 to 566 (m).
#2. WECs absorbed power: {P1, P2, …, P16}
#3. Total power output of the farm: Powerall

#Since they did not add column names we will use the information above and hope 
#dataset order and explanation order is the same.

Xoriented <- sprintf("X%s",seq(1:16))
Yoriented <- sprintf("Y%s",seq(1:16))
Poriented <- sprintf("P%s",seq(1:16))
P <- "Sumpowerall"

#We will act like "powerall" is our dependent variable.
#powerall shows high correlation with sum of "P".
#In order to exercise regression we will remove "P" segments.
#If we include "P" segment, these variables are overrun other ones.

# Colname vector lenght and dataframe column number matches.

colnamesofDF <- c(as.character(Xoriented),as.character(Yoriented),as.character(Poriented),P)

df1 <- read.csv("Tasmania_Data.csv",col.names =  colnamesofDF)

#This repository has the data for different areas.
#Characteristic between different areas may differ.
#Considering this situation we will focus on the area called "Perth".

## You can use the link if you want to download data from UCI Repository directly.

#temp <- tempfile()
#download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00494/WECs_DataSet.zip",temp)

#https://archive.ics.uci.edu/ml/datasets/Wave+Energy+Converters#

Exploratory Analysis

#Data set contains 71999 rows and 49 columns.
#All variables are numeric.
dim(df1)
## [1] 71999    49
#To understand dataset, lets find sum of columns start with P and check their correlation with "Power" column.

#In %91 case sum of P columns and Powerall data is equal.

df1 <- df1 %>% mutate(
   SumP = rowSums(dplyr::select(., starts_with("P")))
)

prop.table(table(df1$SumP == df1$Sumpowerall))*100
## 
##    FALSE     TRUE 
## 90.79015  9.20985
#Since they are both numerical data lets check Pearson Correlation.

#Correlation is very high. We need to remove P columns to have a more healthy regression.

#Otherwise, P columns will overrun other variables.

#And it does not make sense to conduct a regression where dependent column is sum of the other columns..

cor(df1$SumP,df1$Sumpowerall,method = "pearson")
## [1] 0.9999992
#It can be seen that sum of P columns and Power column is nearly similar to each other.
#We will remove P columns to have an ideal regression data set.
library(tidyverse) 
df1 <- df1 %>% dplyr::select(-starts_with("P"))
df1$SumP <- NULL
colnames(df1)
##  [1] "X1"          "X2"          "X3"          "X4"          "X5"         
##  [6] "X6"          "X7"          "X8"          "X9"          "X10"        
## [11] "X11"         "X12"         "X13"         "X14"         "X15"        
## [16] "X16"         "Y1"          "Y2"          "Y3"          "Y4"         
## [21] "Y5"          "Y6"          "Y7"          "Y8"          "Y9"         
## [26] "Y10"         "Y11"         "Y12"         "Y13"         "Y14"        
## [31] "Y15"         "Y16"         "Sumpowerall"

Outlier Control

#Outlier Control

#Since we know data contains characteristics of X,Y,P and power,
#we can check their boxplot in their class.

#Thus this will make our visual more readable.

#No outliers in "X" class.
boxplot(x = as.list(as.data.frame(df1[1:16])))

#Outlier Control

#No Outlier in "Y" class.

boxplot(x = as.list(as.data.frame(df1[17:32])))

#Outlier Control

#There are some outliers in "Power" class.

boxplot(x = as.list(as.data.frame(df1[33])))

Outlier conclusions:

  • Dataset contains of numerious results of X-Y and dependent Power variable.

  • We will act this data set as some kind of “sensor” data.

  • Outliers are fairly close to the benchmarks.

  • Thus we will not take action against these outliers.

Missing Values

missing <- miss_var_summary(df1)
mean(missing$pct_miss)
## [1] 0
vis_miss(df1, warn_large_data = FALSE)

  • There are no missing values in this data set.

Creating Pure MLR

set.seed(123)
#Shuffle data set.
df1 <- df1[sample(nrow(df1)),]

#Split train-test.
sample = sample.split(df1$Sumpowerall, SplitRatio = .80)
train = subset(df1, sample == TRUE)
test  = subset(df1, sample == FALSE)

# Build the model
model1 <- lm(Sumpowerall ~., data = train)
summary(model1)
## 
## Call:
## lm(formula = Sumpowerall ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -493754  -65841    1562   67370  517458 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.804e+06  4.833e+03 787.094  < 2e-16 ***
## X1           5.652e+00  2.576e+00   2.194 0.028209 *  
## X2          -1.757e+01  2.648e+00  -6.633 3.33e-11 ***
## X3           3.570e+01  2.578e+00  13.847  < 2e-16 ***
## X4          -7.652e+01  2.591e+00 -29.536  < 2e-16 ***
## X5           3.308e+01  2.464e+00  13.424  < 2e-16 ***
## X6          -9.028e+01  2.529e+00 -35.692  < 2e-16 ***
## X7          -2.141e+01  2.611e+00  -8.199 2.48e-16 ***
## X8           6.770e+00  2.670e+00   2.536 0.011210 *  
## X9          -3.847e+01  2.598e+00 -14.809  < 2e-16 ***
## X10          4.951e+01  2.535e+00  19.532  < 2e-16 ***
## X11         -2.654e+01  2.616e+00 -10.143  < 2e-16 ***
## X12         -3.115e+01  2.551e+00 -12.210  < 2e-16 ***
## X13          2.900e+00  2.543e+00   1.140 0.254188    
## X14         -1.073e+01  2.552e+00  -4.207 2.60e-05 ***
## X15          1.110e+01  2.511e+00   4.419 9.92e-06 ***
## X16          3.901e+01  2.549e+00  15.304  < 2e-16 ***
## Y1          -3.950e+01  2.610e+00 -15.134  < 2e-16 ***
## Y2           4.199e+01  2.646e+00  15.871  < 2e-16 ***
## Y3          -5.783e+01  2.542e+00 -22.749  < 2e-16 ***
## Y4           5.667e+01  2.501e+00  22.657  < 2e-16 ***
## Y5          -3.138e+01  2.581e+00 -12.155  < 2e-16 ***
## Y6          -2.260e+01  2.602e+00  -8.686  < 2e-16 ***
## Y7           3.574e+01  2.633e+00  13.573  < 2e-16 ***
## Y8          -1.621e+01  2.560e+00  -6.334 2.41e-10 ***
## Y9           3.042e+00  2.543e+00   1.196 0.231723    
## Y10          3.890e+01  2.579e+00  15.082  < 2e-16 ***
## Y11          9.289e+00  2.504e+00   3.710 0.000208 ***
## Y12          5.961e+00  2.639e+00   2.259 0.023877 *  
## Y13         -5.373e+01  2.540e+00 -21.154  < 2e-16 ***
## Y14         -2.825e+01  2.622e+00 -10.773  < 2e-16 ***
## Y15          1.404e+01  2.570e+00   5.463 4.71e-08 ***
## Y16         -4.199e+01  2.571e+00 -16.330  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103400 on 57566 degrees of freedom
## Multiple R-squared:  0.1476, Adjusted R-squared:  0.1472 
## F-statistic: 311.6 on 32 and 57566 DF,  p-value: < 2.2e-16
# Make predictions
predictions <- model1 %>% predict(test)

Pure MLR Test Error

defaultSummary(data.frame(obs = train$Sumpowerall,
pred = predict(model1, train[-33]))
)
##         RMSE     Rsquared          MAE 
## 1.034064e+05 1.476286e-01 8.079594e+04

Scaling

#Scaling the data.
df1scaled <- scale(df1)
df1scaled <- as.data.frame(df1scaled)
set.seed(123)
#Shuffle data set.
df1scaled <- df1scaled[sample(nrow(df1)),]

#Split train-test.
sample = sample.split(df1scaled$Sumpowerall, SplitRatio = .80)
trainscaled = subset(df1scaled, sample == TRUE)
testscaled  = subset(df1scaled, sample == FALSE)

# Build the model
model1scaled <- lm(Sumpowerall ~., data = trainscaled)
summary(model1scaled)
## 
## Call:
## lm(formula = Sumpowerall ~ ., data = trainscaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4019 -0.5892  0.0123  0.5998  4.6062 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.357e-05  3.850e-03  -0.014  0.98890    
## X1           9.874e-03  4.402e-03   2.243  0.02489 *  
## X2          -2.714e-02  4.211e-03  -6.445 1.17e-10 ***
## X3           5.835e-02  4.173e-03  13.981  < 2e-16 ***
## X4          -1.245e-01  4.184e-03 -29.756  < 2e-16 ***
## X5           5.850e-02  4.238e-03  13.804  < 2e-16 ***
## X6          -1.487e-01  4.134e-03 -35.971  < 2e-16 ***
## X7          -3.398e-02  4.246e-03  -8.003 1.24e-15 ***
## X8           1.177e-02  4.057e-03   2.900  0.00373 ** 
## X9          -6.042e-02  4.169e-03 -14.492  < 2e-16 ***
## X10          7.943e-02  4.184e-03  18.986  < 2e-16 ***
## X11         -4.234e-02  4.335e-03  -9.766  < 2e-16 ***
## X12         -5.109e-02  4.424e-03 -11.548  < 2e-16 ***
## X13          6.595e-03  4.263e-03   1.547  0.12191    
## X14         -2.327e-02  4.304e-03  -5.406 6.46e-08 ***
## X15          1.626e-02  4.172e-03   3.897 9.75e-05 ***
## X16          6.708e-02  4.164e-03  16.109  < 2e-16 ***
## Y1          -6.835e-02  4.483e-03 -15.247  < 2e-16 ***
## Y2           6.719e-02  4.140e-03  16.228  < 2e-16 ***
## Y3          -9.411e-02  4.109e-03 -22.901  < 2e-16 ***
## Y4           9.231e-02  4.097e-03  22.532  < 2e-16 ***
## Y5          -5.076e-02  4.223e-03 -12.021  < 2e-16 ***
## Y6          -3.997e-02  4.137e-03  -9.662  < 2e-16 ***
## Y7           5.682e-02  4.088e-03  13.899  < 2e-16 ***
## Y8          -2.359e-02  4.106e-03  -5.746 9.20e-09 ***
## Y9           5.458e-03  4.262e-03   1.281  0.20033    
## Y10          6.267e-02  4.222e-03  14.842  < 2e-16 ***
## Y11          1.184e-02  4.224e-03   2.804  0.00504 ** 
## Y12          1.131e-02  4.315e-03   2.622  0.00875 ** 
## Y13         -8.584e-02  4.134e-03 -20.766  < 2e-16 ***
## Y14         -4.478e-02  4.277e-03 -10.468  < 2e-16 ***
## Y15          2.399e-02  4.192e-03   5.723 1.05e-08 ***
## Y16         -6.552e-02  4.050e-03 -16.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.924 on 57566 degrees of freedom
## Multiple R-squared:  0.147,  Adjusted R-squared:  0.1465 
## F-statistic:   310 on 32 and 57566 DF,  p-value: < 2.2e-16

Scaled MLR Test Error

defaultSummary(data.frame(obs = testscaled$Sumpowerall,
pred = predict(model1scaled, testscaled[-33]))
)
##      RMSE  Rsquared       MAE 
## 0.9241219 0.1450212 0.7199197

Assumption Controls

Checking the assumptions for multiple linear regression.

Multicollinearity

# We can see that all variables VIF value is <5.
# All variables are okey to stay in terms of VIF control.

VIF <-car::vif(model1)
VIF
##       X1       X2       X3       X4       X5       X6       X7       X8 
## 1.309168 1.198578 1.169546 1.185964 1.212113 1.152853 1.215584 1.107723 
##       X9      X10      X11      X12      X13      X14      X15      X16 
## 1.174065 1.187159 1.264464 1.318311 1.225675 1.241495 1.172852 1.171252 
##       Y1       Y2       Y3       Y4       Y5       Y6       Y7       Y8 
## 1.355357 1.152917 1.138277 1.137597 1.205421 1.154004 1.128049 1.139387 
##       Y9      Y10      Y11      Y12      Y13      Y14      Y15      Y16 
## 1.222920 1.198602 1.205828 1.254164 1.153791 1.231831 1.180657 1.105763

Multivariate Normality

# We can not conduct Shapiro test over 5000 sample.

# Anderson-Darling Test has been conducted over all variables.

# We can see that all p values are <0.5 so non of the variables distributed normally.

lshap <- lapply(df1, ad.test)

as.data.frame(data.table::rbindlist(lshap))
##     statistic p.value                          method data.name
## 1  1647.06944 3.7e-24 Anderson-Darling normality test    X[[i]]
## 2  1029.11161 3.7e-24 Anderson-Darling normality test    X[[i]]
## 3  1150.58217 3.7e-24 Anderson-Darling normality test    X[[i]]
## 4  1183.07769 3.7e-24 Anderson-Darling normality test    X[[i]]
## 5  1792.13692 3.7e-24 Anderson-Darling normality test    X[[i]]
## 6  1412.43445 3.7e-24 Anderson-Darling normality test    X[[i]]
## 7  1238.14292 3.7e-24 Anderson-Darling normality test    X[[i]]
## 8   595.29865 3.7e-24 Anderson-Darling normality test    X[[i]]
## 9  1061.89334 3.7e-24 Anderson-Darling normality test    X[[i]]
## 10 1318.57871 3.7e-24 Anderson-Darling normality test    X[[i]]
## 11 1416.37399 3.7e-24 Anderson-Darling normality test    X[[i]]
## 12 1872.09254 3.7e-24 Anderson-Darling normality test    X[[i]]
## 13 1464.89313 3.7e-24 Anderson-Darling normality test    X[[i]]
## 14 1507.24460 3.7e-24 Anderson-Darling normality test    X[[i]]
## 15 1614.36058 3.7e-24 Anderson-Darling normality test    X[[i]]
## 16 1367.75962 3.7e-24 Anderson-Darling normality test    X[[i]]
## 17 1780.99398 3.7e-24 Anderson-Darling normality test    X[[i]]
## 18 1050.22535 3.7e-24 Anderson-Darling normality test    X[[i]]
## 19 1551.70754 3.7e-24 Anderson-Darling normality test    X[[i]]
## 20 1366.20301 3.7e-24 Anderson-Darling normality test    X[[i]]
## 21 1251.88105 3.7e-24 Anderson-Darling normality test    X[[i]]
## 22  965.35591 3.7e-24 Anderson-Darling normality test    X[[i]]
## 23  923.81990 3.7e-24 Anderson-Darling normality test    X[[i]]
## 24 1226.07561 3.7e-24 Anderson-Darling normality test    X[[i]]
## 25 1631.84361 3.7e-24 Anderson-Darling normality test    X[[i]]
## 26 1415.42358 3.7e-24 Anderson-Darling normality test    X[[i]]
## 27 1339.33633 3.7e-24 Anderson-Darling normality test    X[[i]]
## 28 1151.57875 3.7e-24 Anderson-Darling normality test    X[[i]]
## 29 1364.44486 3.7e-24 Anderson-Darling normality test    X[[i]]
## 30 1341.77638 3.7e-24 Anderson-Darling normality test    X[[i]]
## 31 1478.25834 3.7e-24 Anderson-Darling normality test    X[[i]]
## 32  969.71070 3.7e-24 Anderson-Darling normality test    X[[i]]
## 33   55.37006 3.7e-24 Anderson-Darling normality test    X[[i]]

Homoscedasticity

par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(model1)

The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.

But in our case, as you can notice from the top-left plot, the red line is slightly curved and the residuals seem to increase as the fitted Y values increase. So, the inference here is, heteroscedasticity exists.

Citation Link

car::ncvTest(model1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 110.0427, Df = 1, p = < 2.22e-16
lmtest::bptest(model1)      
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 986.3, df = 32, p-value < 2.2e-16

Model evaluation and model performance metrics.

Leave one out cross validation - LOOCV

# Define training control
#train.control <- trainControl(method = "LOOCV")
# Train the model
#model <- train(Sumpowerall ~., data = train, method = "lm",
#               trControl = train.control)
# Summarize the results
#print(model)

K-fold cross-validation

# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
modelKFold <- train(Sumpowerall ~., data = train, method = "lm",
               trControl = train.control)
# Summarize the results
print(modelKFold)
## Linear Regression 
## 
## 57599 samples
##    32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 51839, 51839, 51840, 51839, 51839, 51839, ... 
## Resampling results:
## 
##   RMSE    Rsquared   MAE     
##   103461  0.1468158  80840.73
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

# Define training control
set.seed(123)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
modelCV <- train(Sumpowerall ~., data = train, method = "lm",
               trControl = train.control)
# Summarize the results
print(modelCV)
## Linear Regression 
## 
## 57599 samples
##    32 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 51839, 51839, 51840, 51839, 51839, 51839, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   103459.8  0.1468718  80842.11
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Resiudal Analysis of the Pure Model

lm_expTrain <- DALEX::explain(model1, label = "lm", data = train[-33], y = train[33])
## Preparation of a new explainer is initiated
##   -> model label       :  lm 
##   -> data              :  57599  rows  32  cols 
##   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
##   -> target variable   :  57599  values 
##   -> predict function  :  yhat.lm  will be used (  default  )
##   -> predicted values  :  numerical, min =  3595676 , mean =  3760194 , max =  3892355  
##   -> model_info        :  package stats , ver. 4.0.2 , task regression (  default  ) 
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -493754.1 , mean =  9.019097e-09 , max =  517458.1  
##   A new explainer has been created! 
lm_expTest  <- DALEX::explain(model1, label = "lm", data = test[-33], y = test[33])
## Preparation of a new explainer is initiated
##   -> model label       :  lm 
##   -> data              :  14400  rows  32  cols 
##   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
##   -> target variable   :  14400  values 
##   -> predict function  :  yhat.lm  will be used (  default  )
##   -> predicted values  :  numerical, min =  3594773 , mean =  3760072 , max =  3878498  
##   -> model_info        :  package stats , ver. 4.0.2 , task regression (  default  ) 
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -447459.8 , mean =  -162.8945 , max =  500989.5  
##   A new explainer has been created! 
lm_mrTrain <- model_residual(lm_expTrain)
lm_mrTest <- model_residual(lm_expTest)
plot(lm_mrTrain, type = "prediction", abline = TRUE)

#plot(lm_mrTrain, type = "residual")
#plot(lm_mrTrain, type = "residual_density")