EEG stands for Electroencephalogram which are electrical signals that measured the electrical activity of the brain (St et al., 2016). To get the EEG result, electrodes consisting of small metal discs with thin wires are pasted onto scalp. The electrodes detect tiny electrical charges that result from the activity of your brain cells and thus obtained charges are amplified and appear as a graph on a computer screen, or as a recording that may be printed out on paper. The main purpose of EEG is to detect potential problems (encephalitis, hemorrhage, epilepsy,Parkinson’s disease and other) with brain cell communication by painless method (Healthline, 2012).
Here in the data set, we have four input EEG signals, one output signals and time of each signals.
Our objectives are as follows:
library(ggplot2) # implementation of the grammar of graphics - complex visualizations
library(GGally) # create a matrix of scatter plots, with histograms or density plots
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyverse) # data manipulation, exploration, and visualization
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidymodels) # Split the data into training and testing
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.2 ✔ rsample 1.1.1
## ✔ dials 1.1.0 ✔ tune 1.0.1
## ✔ infer 1.0.4 ✔ workflows 1.1.2
## ✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.3 ✔ yardstick 1.1.0
## ✔ recipes 1.0.4
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(moments) # calculating and plotting descriptive statistics
library(dplyr) # data manipulation toolkit for working with structured data
library(ggExtra) # additional functionality to ggplot2
library(readr) # fast and friendly file reading
library(ggpubr) # customizing ggplot2 based publication ready plots
#setting the theme to light
theme_set(theme_light())
# Importing time, input signals and output signal data and storing them to variable
time <- read_csv("Time_in_second.csv")
## Rows: 200 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): 0
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
x <- read_csv("X_input_signal.csv")
## Rows: 201 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): V1, V2, V3, V4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
y <- read_csv("Y_output_signal.csv")
## Rows: 201 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): YV1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Add a new row with value 0 to the dataframe time, it can be used for some preprocessing step like adding a baseline value to the dataframe.
time <- rbind(0, time)
# Viewing data if they are render properly or not
head(y)
## # A tibble: 6 × 1
## YV1
## <dbl>
## 1 -0.0935
## 2 0.321
## 3 0.138
## 4 0.359
## 5 0.752
## 6 1.42
head(x)
## # A tibble: 6 × 4
## V1 V2 V3 V4
## <dbl> <dbl> <dbl> <dbl>
## 1 -3.07 -1.86 -2.78 -0.289
## 2 0.349 1.92 1.00 0.578
## 3 -1.01 0.272 -0.912 -2.70
## 4 -0.808 -1.58 0.194 -1.62
## 5 2.08 0.743 1.93 1.56
## 6 3.25 1.94 1.80 2.38
head(time)
## # A tibble: 6 × 1
## `0`
## <dbl>
## 1 0
## 2 0.002
## 3 0.004
## 4 0.006
## 5 0.008
## 6 0.01
Now, as we have loaded the date set, we are going to renaming the columns of three data frames ‘x’,‘y’ and ’time.
# First time of code is used to rename all the columns of dataframe x, the function rep("x", ncol(x)) will repeat the string "x" ncol(x) times, where ncol(x) is the number of columns in the dataframe x.
# The function paste0(...) is used to concatenate these strings with the sequence of numbers 1:ncol(x) separated by a comma.
# colnames(y) <- "y" This line of code is used to rename the columns of dataframe y to "y"
# colnames(time) <- "time" This line of code is used to rename the columns of dataframe time to "time"
colnames(x) <- paste0(rep("x", ncol(x)), 1:ncol(x))
colnames(y) <- "y"
colnames(time) <- "time"
One we renamed the data frame, we are going to create new data frame called df by appending the columns of time, x and y together in column-wise order. Here the resulting data frame df will have the same number of rows as the original data frames and the number of columns will be the sum of the number of columns in time, x, and y.
# cbind stands for column bind.
df = cbind(time, x, y)
# head df will allow us to see the structure of dataframe.
head(df)
## time x1 x2 x3 x4 y
## 1 0.000 -3.073820 -1.855760 -2.775880 -0.288894 -0.09352727
## 2 0.002 0.349116 1.919060 1.004950 0.577845 0.32122440
## 3 0.004 -1.009700 0.272461 -0.912297 -2.697380 0.13822394
## 4 0.006 -0.807608 -1.581480 0.194378 -1.622270 0.35918981
## 5 0.008 2.079250 0.742812 1.928090 1.558220 0.75155170
## 6 0.010 3.249200 1.944610 1.799000 2.381060 1.41717093
The initial examination of a data set before conducting more advanced and in-depth analysis is called preliminary data analysis. Here we will be looking at the descriptive summary of data set and explore missing values.
# summarizing the data using summary() function
summary(df)
## time x1 x2 x3
## Min. :0.0 Min. :-8.77696 Min. :-6.12832 Min. :-5.96646
## 1st Qu.:0.1 1st Qu.:-2.27265 1st Qu.:-1.40692 1st Qu.:-1.63379
## Median :0.2 Median : 0.30679 Median : 0.11721 Median :-0.10858
## Mean :0.2 Mean :-0.01244 Mean : 0.03857 Mean : 0.03507
## 3rd Qu.:0.3 3rd Qu.: 2.69797 3rd Qu.: 1.44672 3rd Qu.: 1.62875
## Max. :0.4 Max. : 8.15969 Max. : 5.73608 Max. : 6.27286
## x4 y
## Min. :-8.67620 Min. :-9.60389
## 1st Qu.:-2.15132 1st Qu.:-0.09428
## Median : 0.15677 Median : 0.41277
## Mean : 0.01477 Mean : 0.15235
## 3rd Qu.: 2.38110 3rd Qu.: 0.87908
## Max. : 8.00377 Max. : 5.67484
Here we can see that the mean of time is 0.2 second and output signal is 0.15235 of amplitude. Interestingly the mean of incput signal is coming in minus.
# Using is.na function to see
missing_values <- is.na(df)
missvalues <- sum(missing_values)
print(sprintf("Number of missing value in the dataset is %d", missvalues))
## [1] "Number of missing value in the dataset is 0"
Here we have zero missing data (NA). Which is quite good.
Here, we are focusing on creating time series plots of input and output EEG signals. Our concern is to plot 4 input signals into a single graph. So, for that we will work with Base R ts() functions. To do so, we need data either in matrix or vector. So here we are going to convert our input data into matrix using as.matrix() function.
# Assigning the matrix representation of the variable input signals x into xmat.Then it prints out the matrix representation of the variable "xmat"
# Then the code assigns the column names "X1", "X2", "X3", and "X4" to the columns of the matrix "x" using the "colnames" function.
# It changes the column names of the matrix x to "X1","X2","X3","X4"
xmat <- as.matrix(x)
print(xmat)
## x1 x2 x3 x4
## [1,] -3.0738200 -1.85576000 -2.7758800 -0.2888940
## [2,] 0.3491160 1.91906000 1.0049500 0.5778450
## [3,] -1.0097000 0.27246100 -0.9122970 -2.6973800
## [4,] -0.8076080 -1.58148000 0.1943780 -1.6222700
## [5,] 2.0792500 0.74281200 1.9280900 1.5582200
## [6,] 3.2492000 1.94461000 1.7990000 2.3810600
## [7,] 4.2155200 0.76991200 -0.5173890 2.6061900
## [8,] 4.6922700 1.89548000 3.0565800 5.0223400
## [9,] 3.0140000 4.42939000 0.4689410 4.8477900
## [10,] 4.7507200 4.97950000 0.0979985 4.3277700
## [11,] 0.3138210 1.27112000 -1.4271600 -1.3276800
## [12,] -3.4280700 -2.28121000 -0.6782290 -2.2250600
## [13,] -5.5078800 -3.22358000 -1.0453900 -5.0677300
## [14,] -3.4017000 -4.58924000 -1.1057100 -4.6078300
## [15,] -4.0676800 -5.88128000 -3.1716400 -6.3688000
## [16,] -7.3975400 -3.15888000 -3.0422200 -4.9585600
## [17,] -5.3269200 -1.47152000 -1.7871400 -0.9920120
## [18,] 1.2011300 1.21693000 0.6931980 4.9579800
## [19,] 3.8655100 0.91610100 2.3349800 4.4383500
## [20,] 2.0410900 2.37861000 5.5208000 5.1561700
## [21,] 1.5685700 1.44672000 2.9411300 1.3513700
## [22,] 0.8087330 0.23741200 -0.2976430 -3.4789200
## [23,] 2.9606300 0.11721500 0.1243580 -3.0095400
## [24,] 0.6194560 -2.43742000 -1.0753900 -2.0134500
## [25,] 0.8631820 -2.62701000 -1.8942500 -0.4084230
## [26,] 3.2728100 1.19443000 -1.0026200 3.6288800
## [27,] 3.4214400 0.15334500 -2.8274600 2.4608300
## [28,] 1.1308400 0.60652800 -1.0434800 2.0153200
## [29,] 2.1010300 3.62973000 1.1611400 3.6869700
## [30,] 4.3024900 4.60462000 3.1816600 3.8758200
## [31,] 2.5121100 5.73608000 4.6810200 3.0175400
## [32,] -4.8939100 0.31450400 -3.5710600 -3.5667000
## [33,] -8.4519300 1.93602000 -0.9674480 -5.0491600
## [34,] -6.7995200 -0.46123000 -2.5397800 -7.4703400
## [35,] -4.4883400 -1.77694000 -3.3953000 -7.3753300
## [36,] -7.6860500 -4.96746000 -3.4207200 -7.5725500
## [37,] -5.3979700 -4.91969000 -1.2404800 -3.6973000
## [38,] 0.7234280 -3.95158000 0.4604530 0.3965480
## [39,] 5.6006400 -1.55107000 2.9315200 3.0925300
## [40,] 5.7268000 0.29984300 2.0237200 6.9619400
## [41,] 5.5365900 2.79329000 5.7662700 8.0037700
## [42,] 4.5516900 3.32877000 3.8520200 6.7933500
## [43,] 5.0145800 5.23874000 3.5581000 5.5423800
## [44,] 5.8521100 2.24758000 2.8211900 2.8981700
## [45,] 2.6452900 -3.04935000 -2.5254700 -1.4068700
## [46,] 2.6447700 -3.44583000 -0.2132900 -0.4914740
## [47,] -0.2507240 -3.48255000 -1.8062600 -1.6134500
## [48,] -4.2629800 -2.81550000 -1.9345900 -2.8025600
## [49,] -7.8986300 -2.36433000 -2.4657800 -4.1242400
## [50,] -6.2221500 0.28877500 -2.4896700 -2.4313200
## [51,] -3.5907200 -0.66006800 -4.5210700 -2.8507300
## [52,] -4.5347600 1.83344000 -3.6921800 -2.5984500
## [53,] -3.7510900 -0.33501500 -1.2426800 -3.0267300
## [54,] -1.5467900 1.40576000 0.1604110 -1.4113100
## [55,] 3.4270800 4.07007000 3.8458700 1.7750000
## [56,] 2.7817600 1.88219000 -0.0821340 -1.0210200
## [57,] -1.5969900 1.92465000 0.9762700 -1.3045900
## [58,] -0.3221970 -0.07143180 1.3025400 0.5563080
## [59,] 2.0524800 -1.14304000 -1.6584800 0.6431410
## [60,] 6.6313400 -0.75454100 -0.5760970 2.7437900
## [61,] 4.6782700 -0.15145400 0.0332563 2.9638400
## [62,] 3.8403200 0.47588000 3.5020200 3.3100200
## [63,] 4.3288200 0.17600800 3.0071200 4.7055600
## [64,] 4.3921700 0.05992190 4.4489500 4.8535400
## [65,] -0.7515970 -1.93525000 0.0498746 2.7897800
## [66,] -1.4905200 0.99875500 3.7699100 3.8661100
## [67,] -2.2895000 -2.95901000 0.5769830 0.2967470
## [68,] 0.0151547 -3.04146000 -2.2091700 -3.0837500
## [69,] -2.1687800 -1.75914000 -1.7525300 -3.2086500
## [70,] -8.6114000 -2.21400000 -5.9664600 -8.6762000
## [71,] -7.7226100 0.92735000 -4.4923000 -7.1548000
## [72,] -7.7403400 3.90660000 -2.7631300 -6.5743800
## [73,] -7.1556100 2.44393000 -2.2081000 -4.0378800
## [74,] -2.2726500 1.78572000 -0.5398690 -0.7930710
## [75,] 2.2397500 2.91178000 1.8320400 2.3811000
## [76,] 8.1596900 5.58982000 4.7959100 4.6540000
## [77,] 5.5109800 4.28327000 4.2419400 3.0229300
## [78,] 4.4839000 0.17433400 3.4145900 2.0311600
## [79,] 5.6026200 -0.50425300 1.2370300 3.6939500
## [80,] 5.6285700 -1.51335000 -0.0377929 2.9134800
## [81,] 1.0400000 -3.65570000 -1.9066500 0.6739890
## [82,] -2.4699900 -6.12832000 -4.1929500 0.1287390
## [83,] 1.2599600 -4.68894000 -0.8516600 2.3970500
## [84,] 6.4842400 -2.43917000 0.6321510 4.0700500
## [85,] 3.4714500 -1.33825000 -1.3673300 1.2440800
## [86,] -0.3051430 -1.65046000 0.0984331 -0.3204940
## [87,] -4.4843500 -1.33988000 -0.4342900 -3.7607400
## [88,] -6.0248800 4.17147000 0.6477070 -3.7568600
## [89,] -5.5984900 3.66149000 -0.6470860 -5.8949400
## [90,] -6.3095700 -0.97238200 -2.9035700 -5.5797300
## [91,] -3.9298900 -0.38168800 -1.7586600 -2.8114500
## [92,] -0.9826030 0.71803300 -0.5274370 -1.1110000
## [93,] -0.8427570 2.56807000 1.5265800 1.1984200
## [94,] -0.3329730 2.97267000 2.9517800 2.7161200
## [95,] 4.0752100 4.83872000 6.2728600 5.0092800
## [96,] 4.5587800 2.89968000 1.6424400 2.0383800
## [97,] 3.7612000 0.22020500 2.2017700 2.1418200
## [98,] -0.9161140 -2.77572000 -0.4667820 0.4984910
## [99,] -0.8469660 -1.68148000 -0.9211450 1.3458100
## [100,] 2.0771600 -1.70177000 -0.7867520 1.4131200
## [101,] 2.2445700 -4.41115000 -3.3045700 -0.2170120
## [102,] 0.2244180 -2.57161000 -0.6508410 2.1849600
## [103,] -1.3955300 -2.12800000 1.6287500 3.5749500
## [104,] 1.6988700 0.48527800 0.9954470 1.5959900
## [105,] 1.8694100 0.37078700 -1.1600700 1.6805300
## [106,] -1.4118400 -0.05974620 -2.1525500 -2.6120000
## [107,] -5.1017000 1.17420000 -1.9357000 -6.1554400
## [108,] -2.0496300 -0.24748000 -0.9665570 -5.8802800
## [109,] -2.3459300 -1.40341000 -4.9851200 -5.5521800
## [110,] -1.1725000 0.98644600 -0.4167850 -2.7373500
## [111,] 0.8427390 -0.45691400 0.6841820 -0.7926270
## [112,] 3.6279100 2.65945000 5.4655400 4.2147900
## [113,] 7.9447200 2.51336000 5.8757000 6.1627200
## [114,] 5.3379300 -0.28025000 2.8036200 4.5496100
## [115,] 1.8026700 -0.07871150 1.7705500 2.9977600
## [116,] 0.2077330 0.66436400 -0.5870790 3.3120200
## [117,] -1.6564700 0.92316600 -2.3687400 -0.3215960
## [118,] -6.7984800 0.76146600 -2.8413300 -2.9107700
## [119,] -8.7769600 -0.87219500 -2.4879600 -4.9312600
## [120,] -5.6586600 -1.80932000 -5.8362000 -6.0288800
## [121,] -0.4258380 0.68154300 -1.4538900 -2.5845500
## [122,] 1.0300400 -0.16118800 -0.0327746 0.4558820
## [123,] -0.3162120 -0.43734900 1.0304500 1.0838400
## [124,] -0.4633650 -1.45428000 1.5844700 -0.1540880
## [125,] 0.4642520 -0.52379500 1.3937800 0.5928480
## [126,] 0.3067870 -1.95445000 -0.7694430 -2.9857700
## [127,] 0.7397500 0.01127320 0.0106142 -0.1718100
## [128,] 2.6979700 0.64898700 0.4030010 -0.4851490
## [129,] 4.0389100 1.91844000 1.1180900 -0.2084040
## [130,] 3.6139300 1.21576000 0.8375330 1.2242900
## [131,] 3.7483500 1.44789000 2.5838300 4.5122700
## [132,] 1.5897600 -0.38639200 -0.0287545 3.3929300
## [133,] 3.2360100 0.38284000 1.9625600 4.6744000
## [134,] 1.8772700 0.00767195 -0.4698540 2.7172700
## [135,] -2.2278600 -1.93428000 -3.4482100 -1.3647700
## [136,] -1.7547200 -0.91333700 -1.1307900 -0.5678270
## [137,] -2.7323500 1.29352000 0.7622010 -2.7378100
## [138,] -2.5162200 0.59438200 2.1341300 -2.1513200
## [139,] -6.2591400 1.07941000 2.6707400 -1.0673900
## [140,] -4.5278900 0.48249800 3.1922600 -1.2548100
## [141,] 0.6600860 2.47368000 0.6598660 0.1777550
## [142,] 3.4365900 1.04480000 -0.8351160 0.5919540
## [143,] 0.7603120 2.46556000 0.2912660 0.7295690
## [144,] -2.7582600 -3.94765000 -3.3440500 -1.9344600
## [145,] -0.8860100 -2.61157000 -4.2860800 -2.1158200
## [146,] 2.7247700 -0.75779100 -2.4078700 0.1567670
## [147,] 1.1641000 -2.68062000 -1.4136100 -0.2665680
## [148,] -2.6749500 -2.69752000 -1.6337900 -1.6329500
## [149,] 0.4790550 -1.29834000 -0.1085790 0.2453960
## [150,] 0.6891770 -0.77841100 -1.5801000 -0.0145057
## [151,] 0.3852230 -1.34904000 -2.2416000 0.6072830
## [152,] 1.8808200 0.17936600 2.1253800 3.8114600
## [153,] 3.7570500 0.26069900 0.9873990 3.7606500
## [154,] 5.5996800 2.14143000 3.3917200 3.5441100
## [155,] 3.0042800 -0.31812900 0.2457550 0.6701810
## [156,] -0.1060330 3.86214000 4.7322900 2.8811400
## [157,] -1.4587300 0.44293000 0.9613800 -0.8761230
## [158,] 0.1092970 2.59161000 4.8129300 -0.6793970
## [159,] -0.7160080 -0.37258500 1.7524800 -1.6758100
## [160,] -2.4094100 -0.16884400 1.1023000 -0.8976520
## [161,] -2.2868100 -1.07875000 -1.0849100 -2.6165700
## [162,] -3.5085200 0.22574200 -3.3365000 -3.4183900
## [163,] -3.4343200 3.34135000 -0.8310890 -2.5643300
## [164,] -3.9000400 1.43966000 -3.5634400 -2.2593300
## [165,] -1.1472400 -0.99570700 -3.7753000 -1.5739600
## [166,] 1.4691900 1.65039000 -2.0839900 0.2340930
## [167,] 0.8114030 0.02590660 -0.7997150 0.3481560
## [168,] -0.6507080 -0.75252100 -0.4941440 1.0751000
## [169,] -2.8687200 1.54838000 0.1203990 1.2934500
## [170,] 0.3093590 1.41561000 1.6622500 2.4442500
## [171,] 1.3697300 -0.96778700 -0.3680980 0.5082360
## [172,] 2.7126600 -2.39518000 3.7634900 1.4873600
## [173,] 3.1630900 -1.72635000 3.0296400 0.7751070
## [174,] 2.5676700 -0.36931500 -0.1593730 -0.0138807
## [175,] 3.1677100 -1.06094000 0.5795200 -0.1259590
## [176,] 2.4347500 -0.00308621 1.2299700 0.4921300
## [177,] -0.3752480 -3.76735000 -0.7360860 -2.1962500
## [178,] 1.6589000 -2.69863000 -0.7357520 -0.5143970
## [179,] 0.7618460 -1.40692000 -3.2802700 -0.4430280
## [180,] -0.7358630 -1.33040000 -1.4166200 0.3005250
## [181,] -3.9646700 0.36373400 -2.6141100 -0.7718600
## [182,] -2.1347400 3.27538000 2.2545400 1.3753300
## [183,] 0.8260230 0.77756800 3.3977600 2.4251600
## [184,] -4.4635200 0.15804900 -1.7550700 -0.4295160
## [185,] -2.1868900 0.05701460 -1.0778500 0.1900900
## [186,] -0.4795210 -1.25848000 -2.0622100 -1.4190200
## [187,] 1.3266000 1.73680000 -1.9798600 -2.2402200
## [188,] 0.8778610 -1.29086000 -1.5487600 -2.7342200
## [189,] 0.8565720 1.61669000 1.3731300 0.5325750
## [190,] 3.3760100 0.56859400 1.4424600 1.8113900
## [191,] 3.4957900 2.46666000 1.7352200 2.0178400
## [192,] -0.2133010 1.06866000 1.5247500 0.3842780
## [193,] -3.1639200 -0.89042100 1.5277000 0.0758609
## [194,] -0.1778210 1.82822000 3.8807100 2.1678400
## [195,] 0.0602433 0.51361200 2.0677000 -0.7731560
## [196,] -1.7828800 -1.86709000 0.5000770 -0.7898160
## [197,] -1.2884600 -0.03322000 1.4437700 2.8444200
## [198,] -0.7544630 1.97492000 2.7698700 1.8521600
## [199,] -0.1134360 1.80448000 -1.1392600 -1.0958600
## [200,] -1.2894200 1.70448000 -1.6116700 -2.4489600
## [201,] -1.1855600 -0.73758200 0.0173588 -1.4045300
colnames(x)<-c("X1","X2","X3","X4")
#Similar process as above for output data Y.
ymat <- as.matrix(y)
print(ymat)
## y
## [1,] -0.093527266
## [2,] 0.321224404
## [3,] 0.138223937
## [4,] 0.359189812
## [5,] 0.751551701
## [6,] 1.417170927
## [7,] 1.575811212
## [8,] 1.980767203
## [9,] 1.486300644
## [10,] 2.231949993
## [11,] 0.361095300
## [12,] -0.224933476
## [13,] -2.009821565
## [14,] -0.740096137
## [15,] -1.389407508
## [16,] -4.599672295
## [17,] -1.108414828
## [18,] 1.250201299
## [19,] 1.498591421
## [20,] -0.462949480
## [21,] 0.499404471
## [22,] 0.118473567
## [23,] 0.313297298
## [24,] 0.220608023
## [25,] 0.340679925
## [26,] 1.445172614
## [27,] 1.148312160
## [28,] 0.836004694
## [29,] 1.110128961
## [30,] 1.737214351
## [31,] -0.093024924
## [32,] -1.515796746
## [33,] -6.191951769
## [34,] -3.866489303
## [35,] -1.718463461
## [36,] -5.509836115
## [37,] -1.614920755
## [38,] 0.504780653
## [39,] 2.687437143
## [40,] 3.435914954
## [41,] 1.324227327
## [42,] 2.017504333
## [43,] 2.265837965
## [44,] 2.843544263
## [45,] 0.537556568
## [46,] 0.372256569
## [47,] 0.068880699
## [48,] -0.596434658
## [49,] -4.931042468
## [50,] -2.370931858
## [51,] -1.090782632
## [52,] -0.976951788
## [53,] -0.467613067
## [54,] 0.205741606
## [55,] 0.683506378
## [56,] 0.536491794
## [57,] 0.341632810
## [58,] 0.560290032
## [59,] 0.871535242
## [60,] 3.793489975
## [61,] 1.896886402
## [62,] 1.316509116
## [63,] 1.876985528
## [64,] 1.338045994
## [65,] 0.704833595
## [66,] 0.627827637
## [67,] 0.240926600
## [68,] 0.008334176
## [69,] -0.025675996
## [70,] -9.603887004
## [71,] -5.918113929
## [72,] -5.191791581
## [73,] -3.739552670
## [74,] 0.257822360
## [75,] 0.943141024
## [76,] 5.674841233
## [77,] 2.088330117
## [78,] 1.387210803
## [79,] 2.628307287
## [80,] 2.490665745
## [81,] 0.423658114
## [82,] -0.349557886
## [83,] 0.923218919
## [84,] 3.758926619
## [85,] 1.089431045
## [86,] 0.488580684
## [87,] -1.104578882
## [88,] -2.335887960
## [89,] -2.214269593
## [90,] -3.071174796
## [91,] -0.591707920
## [92,] 0.402426051
## [93,] 0.612168342
## [94,] 0.824646962
## [95,] -1.134443267
## [96,] 1.757815712
## [97,] 1.403856959
## [98,] 0.781006165
## [99,] 0.841089127
## [100,] 0.642419362
## [101,] 0.395022155
## [102,] 0.702813728
## [103,] 0.793909889
## [104,] 0.665395883
## [105,] 0.595596284
## [106,] -0.076007715
## [107,] -1.643802298
## [108,] -0.517540894
## [109,] -1.429046718
## [110,] 0.112299840
## [111,] 0.487052643
## [112,] -0.189573139
## [113,] 4.027251781
## [114,] 2.501499443
## [115,] 1.004671028
## [116,] 0.848361526
## [117,] 0.258337136
## [118,] -3.316101586
## [119,] -7.060819497
## [120,] -4.385004090
## [121,] 0.318361914
## [122,] 0.403740625
## [123,] 0.694679597
## [124,] 0.397876100
## [125,] 0.474232425
## [126,] 0.062060237
## [127,] 0.431090195
## [128,] 0.628673150
## [129,] 1.201323418
## [130,] 1.061108318
## [131,] 1.599074711
## [132,] 0.879082366
## [133,] 1.472031681
## [134,] 1.072535881
## [135,] -0.272613492
## [136,] 0.396612402
## [137,] -0.250536559
## [138,] -0.001596737
## [139,] -2.259025684
## [140,] -0.852066323
## [141,] 0.368771270
## [142,] 0.902579918
## [143,] 0.594513992
## [144,] -0.242953720
## [145,] -0.518408839
## [146,] 0.594459287
## [147,] 0.600658971
## [148,] 0.029958921
## [149,] 0.460318836
## [150,] 0.412765600
## [151,] 0.533607604
## [152,] 1.190342205
## [153,] 1.550063230
## [154,] 2.609958820
## [155,] 0.894329550
## [156,] -0.127206932
## [157,] 0.254110631
## [158,] -0.657331232
## [159,] 0.215760690
## [160,] 0.132027942
## [161,] -0.094279291
## [162,] -0.638587598
## [163,] -0.238499988
## [164,] -0.785731514
## [165,] -0.074870491
## [166,] 0.494109086
## [167,] 0.545197171
## [168,] 0.781872692
## [169,] 0.403942107
## [170,] 0.647368857
## [171,] 0.473008996
## [172,] 0.550570931
## [173,] 0.742660609
## [174,] 0.942335298
## [175,] 0.657233176
## [176,] 0.689187477
## [177,] 0.173472036
## [178,] 0.330523514
## [179,] 0.314323785
## [180,] 0.607432619
## [181,] -0.178079267
## [182,] 0.626096387
## [183,] 0.247829672
## [184,] -0.569376422
## [185,] 0.452405418
## [186,] 0.168366407
## [187,] 0.184920069
## [188,] 0.078501245
## [189,] 0.524311259
## [190,] 1.154634409
## [191,] 1.263045956
## [192,] 0.661044287
## [193,] 0.094370026
## [194,] 0.314340744
## [195,] 0.275406680
## [196,] 0.409520411
## [197,] 1.053845540
## [198,] 0.761376903
## [199,] 0.466338622
## [200,] 0.122651783
## [201,] 0.192486248
colnames(y)<-c("Y")
# Plotting the time series plot with ts() where it takes input as matrix. We then sets the start and end times of the time series to the minimum and maximum values of the "time" variable. It also sets the frequency of the time series to 1. And then it creates a plot of the time series "xmat.ts" with a main title "Time series plot of Input Signals", x-axis label "Time", and y-axis label "Input signal".
# Here we set frequency = 1 as with this we are indicating that the observations in the time series are at the same level of granularity, that is at the same unit of time (second).
xmat.ts<-ts(xmat,start = c(min(time),max(time)),frequency =1)
plot(xmat.ts,main = "Time series plot of Input Signals", xlab = "Time", ylab = "Input signal")
# Same as above for output signals Y.
ymat.ts<-ts(ymat,start = c(min(time),max(time)),frequency =1)
plot(ymat.ts,main = "Time series plot of output Signals", xlab = "Time", ylab = "Output signals")
Here, we see that in time is plotted in X-axis, input/output signals in Y-axis.
There is no consistent trend (upward or downward) over the entire time span. The series appears to slowly wander up and down have infrequent peaks. We can say the data show random variation.
In input signals time series plot, we do not see any rapid spikes which shows there is neither external causes nor entry errors could have affected the data. There are no obvious outliers. Also, it’s difficult to judge whether the variance is constant or not.
Input signals after 150ms seems to have lowest sudden changes to trends and reveals considerably less noise and fewer spikes. Also, X4 appears to be the most stable of the four signals, with fewer spikes and a more recognizable pattern.
For output signals we see a drastic down shift after around 60ms and then drastic upward. This change is most lowest hit.
We an use histogram, density plot, box plot or violin plot. Though most popular graph is histogram, however, here we will also be creating density plot along with histogram using ggplot2.
As density plot are better at determining the distribution shape because they’re not affected by the number of bins like in histograms. In order word we can say density plot is the continuous and smoothed version of the histogram. In the choice of kernel function in a density plot depends on the characteristics of the data and the goals of the analysis.
For histogram we are using bin width instead of bin as it is more appropriate method. Also, histograms can give us a rough idea of the overall shape of the distribution and the general tendency of the data, and they are useful when data is discrete. However, here our data is continuous.
In R we do not have direct function to calculate mode so we do it with our own calculation.
# Mode
# In R we do not have direct function to calculate Mode!
# Here table(df$x) function counts the frequency of each unique value in the "x1" column of "df" and we are sorting it with sort function. Now, we select first element of sorted frequency by [1] and using names() we extract the most frequently occurring value, which is the mode value itself. With as.numeric() we convert all this values into numeric.
# Worth to remember,this code will break if the column "x1 or x2 or any" contains any missing value but we have zero NA values.
mode_x1 <- as.numeric(names(sort(table(df$x1),decreasing=TRUE)[1]))
mode_x2 <- as.numeric(names(sort(table(df$x3),decreasing=TRUE)[1]))
mode_x3 <- as.numeric(names(sort(table(df$x3),decreasing=TRUE)[1]))
mode_x4 <- as.numeric(names(sort(table(df$x4),decreasing=TRUE)[1]))
mode_y <- as.numeric(names(sort(table(df$y),decreasing=TRUE)[1]))
#histogram with stats y
ggplot(aes(x = y), data = df) +
geom_histogram(color = 'black', fill = "grey" ,binwidth = 0.5) +
geom_vline(aes(xintercept=median(y),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(y),
color="mean"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mode_y,
color="mode"), linetype="dashed",
size=1) +
scale_color_manual(name = "Legend", values = c(median = "blue", mean = "red", mode = "grey"))+
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the output EEG signal",
caption = "Also called histogram") +
xlab("Output Signal") +
ylab("Frequency ")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Here we see 0.1 output amplitudes have higher frequency. The spread of the output signals y shows it has some skewness.
#histogram with stats x1
ggplot(aes(x = x1), data = df) +
geom_histogram(color = 'black', fill = "grey", binwidth = 0.5) +
geom_vline(aes(xintercept=median(x1),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(x1),
color="mean"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mode_x1,
color="mode"), linetype="dashed",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red", mode = "black"))+
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the inpur x1 EEG signal",
caption = "Also called histogram") +
xlab("Input Signal x1") +
ylab("Frequency ")
#histogram with stats x2
ggplot(aes(x = x2), data = df) +
geom_histogram(color = 'black', fill = "grey", binwidth = 0.5) +
geom_vline(aes(xintercept=median(x2),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(x2),
color="mean"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mode_x2,
color="mode"), linetype="dashed",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red", mode = "black"))+
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the inpur x2 EEG signal",
caption = "Also called histogram") +
xlab("Input Signal x2") +
ylab("Frequency ")
#histogram with stats x3
ggplot(aes(x = x3), data = df) +
geom_histogram(color = 'black', fill = "grey", binwidth = 0.5) +
geom_vline(aes(xintercept=median(x3),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(x3),
color="mean"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mode_x3,
color="mode"), linetype="dashed",
size=1) +
scale_color_manual(name = "statistics", values = c(median = "blue", mean = "red", mode = "black"))+
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the inpur x3 EEG signal",
caption = "Also called histogram") +
xlab("Input Signal x3") +
ylab("Frequency ")
#histogram with stats x4
ggplot(aes(x = x4), data = df) +
geom_histogram(color = 'black', fill = "grey", binwidth = 0.5) +
geom_vline(aes(xintercept=median(x4),
color="median"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mean(x4),
color="mean"), linetype="dashed",
size=1) +
geom_vline(aes(xintercept=mode_x4,
color="mode"), linetype="dashed",
size=1) +
scale_color_manual(name = "legend", values = c(median = "blue", mean = "red", mode = "black"))+
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the inpur x4 EEG signal",
caption = "Also called histogram") +
xlab("Input Signal x4") +
ylab("Frequency ")
As mentioned earlier, our goal is to understand the underlying distribution of the EEG signal and compare it across different conditions so a density plot is more appropriate choice.
Density plots can provide a more accurate representation of the underlying distribution and reveal subtle patterns that may not be visible in histograms. Additionally, density plots are less affected by the choice of bin size, which can make it easier to compare the distributions of different variables.They are smoother, which is easier for feeding back into a computer for further calculations. KDE work by passing a strongly sharped peak (kernel function) over each data point on the x-axis (Crompton, 2015).
Also, on general EEG signals often have a non-normal distribution, and if we suspect this is the case, we may want to use a non-parametric density estimator such as kernel density estimation (KDE) instead of a parametric one (Duong, 2007).
ggplot(df, aes(x = y)) +
geom_density(color = 3,
fill = 4,
alpha = 0.25, #alpha is used to set the transparency of the density plot.
kernel = "rectangular") + #he kernel used can also be changed with kernel argument. The possible options are "gaussian" (default), "rectangular"
geom_rug()+ #Show the individual observations with black lines
labs(title = "Spread and concentration of values for output EEG signal",
subtitle = "One mode with left tail extremes") +
xlab("Output Signal") +
ylab("Density ")
Skewness is defined as the measure of the asymmetry of a probability distribution of a real-valued random variable about its mean (Doane and Seward, 2011). The output signal have long tail points left showing skewed left. Also the output y seems to have one mode skewness. Here, we also see a values being located in the lower tail or left-side of the distribution, far away from the mean and this is refereed as left tail extremes. This sort of extremes might cause regression line to be skewed, leading to over or under-estimation of the relationships between variables (von Hippel, 2005).
ggplot(df, aes(x = x1)) +
geom_density(color = 3,
fill = 4,
alpha = 0.25, #alpha is used to set the transparency of the density plot.
kernel = "rectangular") + #he kernel used can also be changed with kernel argument. The possible options are "gaussian" (default), "rectangular"
labs(title = "Spread and concentration of values for input x1 EEG signal",
subtitle = "Also called density plot") +
xlab("Input Signal X1") +
ylab("Density ")
ggplot(df, aes(x = x2)) +
geom_density(color = 3,
fill = 4,
alpha = 0.25, #alpha is used to set the transparency of the density plot.
kernel = "rectangular") + #he kernel used can also be changed with kernel argument. The possible options are "gaussian" (default), "rectangular"
labs(title = "Spread and concentration of values for input x2 EEG signal",
subtitle = "Also called density plot") +
xlab("Input Signal X2") +
ylab("Density ")
ggplot(df, aes(x = x3)) +
geom_density(color = 3,
fill = 4,
alpha = 0.25, #alpha is used to set the transparency of the density plot.
kernel = "rectangular") + #he kernel used can also be changed with kernel argument. The possible options are "gaussian" (default), "rectangular"
labs(title = "Spread and concentration of values for input x3 EEG signal",
subtitle = "Also called density plot") +
xlab("Input Signal X3") +
ylab("Density ")
ggplot(df, aes(x = x4)) +
geom_density(color = 3,
fill = 4,
alpha = 0.25, #alpha is used to set the transparency of the density plot.
kernel = "rectangular") + #he kernel used can also be changed with kernel argument. The possible options are "gaussian" (default), "rectangular"
labs(title = "Spread and concentration of values for input x4 EEG signal",
subtitle = "Also called density plot") +
xlab("Input Signal X4") +
ylab("Density ")
For all input signals we see bell-shaped curve with mean (peak of the is signal) to be around “0”. Also, the highest amplitude of signal is around +6 and lowest is -6. The signals appears. to expand from -6 and shrinks from somewhere between 5-6 in positive range. With this we can say that large number of input signals to be around -6 to 5
As the shape of the distribution starts to expand from -4 and shrink from 5, the majority of the input signal appears to lie between -4 and 5. Here, all the input signal’s tail seems to be nearly touching x-axis which says the distribution is almost symmetrical.
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. It helps us to see if there are clusters or patterns in the data set. They can also be used to identify outliers and to explore the distribution of the data. When we add a line of best fit to a scatter plot, we can also see the correlation (positive, negative, or zero) between the two variables (Bertini, Correll and Franconeri, 2020).
And a correlation plot, also known as a scatter plot matrix, is a type of scatter plot that shows the relationship between multiple variables. One main reason to use correlation plot is when we have a large number of variables, and we want to see how they are related to each other (Bertini, Correll and Franconeri, 2020). Here, we will be plotting with using pearson correlation coefficient(PCC) also called Pearson Product moment Correlation Coefficient measures the linear relationship between two variables (Rahman and Zhang, 2015).
Both scatter plots and correlation plots can be used to identify if two variables are positively or negatively correlated. A positive correlation means that as one variable increases, the other variable also increases, and a negative correlation means that as one variable increases, the other variable decreases (Taylor, 1990).
In the following scatter plot, to view the correlation, we will add trend line.
A trend line, also known as a line of best fit, is a line that is used to indicate the overall pattern or direction of the data in a scatter plot. The main purpose of adding a trend line to a scatter plot is to visually examine the relationship between the two variables and to make predictions about future observations (Chen, 2019). For example, if a scatter plot shows a positive linear relationship between two variables, a trend line can be used to estimate the value of one variable based on the value of the other variable.
With trend line we will also add a confidence interval (CI).A confidence interval is a range of values that is used to estimate an unknown population parameter. We will be using 95% of CI. A 95% CI for the mean of a population provides a range of plausible values for the true mean of the population (Smith, 2012). If a new sample is taken, on average 95% of such confidence intervals would contain the true population estimate (Bond, 2004).
#Scatter plot for x1 with y
ggplot(df, aes(x = x1, y = y,
colour = y)) +
geom_point(show.legend = TRUE) +
geom_smooth(method=lm, level=0.95)+ #add linear trend line
scale_color_gradient(low = "#EEC9E5", high = "#7C4D79") +
labs(title = "Relationship between input X1 EEG signal & Output Signal",
caption = " Figure showing relationship between input signal x1 with output signal and pearson correlation value(R)") +
xlab("Input Signal X1") +
ylab("Output Signal") +
stat_cor(p.accuracy = 0.05, r.accuracy = 0.01, method = "pearson" )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# p.accuracy: The number of decimal places to round the p-value to. The p-value will be rounded to 2 decimal places with 0.05.
# r.accuracy: The number of decimal places to round the correlation coefficient. The correlation coefficient will be rounded to 2 decimal places.
# If method = "pearson" (default), the Pearson correlation coefficient will be calculated, which measures the linear relationship between two variables.
# If method = "spearman", the Spearman rank correlation coefficient will be calculated, which measures the monotonic relationship between two variables.
# If method = "kendall", Kendall's rank correlation coefficient will be calculated, which measures the ordinal association between two variables.
Depending on how tightly the points cluster together, we may be able to discern a clear trend in the data.The closer the data points come to forming a straight line when plotted, the higher the correlation between the two variables, or the stronger the relationship. So here, input signal X1 and output signal tends to have stringer correlation. One top of this, we can see on left upper side pearson correlation value R being 0.80 and p value < 0.05. So, 0.87 indicates a strong positive linear relationship between two continuous variables because the values is closer to 1 (Bertini, Correll and Franconeri, 2020).
#Scatter plot for x2 with y
ggplot(df, aes(x = x2, y = y,
colour = y)) +
geom_point(show.legend = TRUE) +
geom_smooth(method=lm, level=0.95)+ #add linear trend line
scale_color_gradient(low = "#EEDBBD", high = "#9F3632") +
labs(title = "Relationship between input X2 EEG signal & Output Signal") +
xlab("Input Signal X2") +
ylab("Output Signal") +
stat_cor(p.accuracy = 0.05, r.accuracy = 0.01, method = "pearson" )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here, data points are spread out and have some cluster. This graph has too many outliers. Here, the trend line is about to be straight so we can say that the data has a weak or lower correlation between input signal X2 and output signal. On top of this R = 0.19 suggests that there is a weak positive association between the two variables, meaning that as one variable increases, the other variable is likely to increase, but only to a limited extent (Laerd.com, 2018).
#Scatter plot for x3 with y with trend line
ggplot(df, aes(x = x3, y = y,
colour = y)) +
geom_point(show.legend = TRUE) +
geom_smooth(method=lm, level=0.95)+
# stat_cor(method="pearson")+
scale_color_gradient(low = "#D5D5D5", high = "#49525E") +
labs(title = "Relationship between input X3 EEG signal & Output Signal") +
xlab("Input Signal X3") +
ylab("Output Signal") +
stat_cor(p.accuracy = 0.05, r.accuracy = 0.01, method = "pearson" )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here, data points are spread out and have some cluster. We can say that the data has a moderate positive linear relationship between input signal X3 and output signal. Also, R value is 0.53 that indicates a weak positive linear relationship between two continuous variables.
#Scatter plot for x4 with y
ggplot(df, aes(x = x4, y = y,
colour = y)) +
geom_point(show.legend = TRUE) +
geom_smooth(method=lm, level=0.95)+ #add linear trend line
scale_color_gradient(low = "#EEDBBD", high = "#9F3632") +
labs(title = "Relationship between input X4 EEG signal & Output Signal") +
xlab("Input Signal X4") +
ylab("Output Signal") +
stat_cor(p.accuracy = 0.05, r.accuracy = 0.01, method = "pearson" )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Here, trend line shows greater slope and pearson coefficient is 0.77. We can say that the data has a strong correlation between input signal X4 and output signal.
From above all scatter plots, we have strong to week correlation R of x1 0.86, x4 0.77, x3 0.53 and x2 0.19.
It is important to remember that correlation does not imply causality (Upadhaya, 2020), but only indicates a relationship between the variables. Further analysis, such as regression analysis, will establish a causal relationship (Constantine, 2012).
It is always best to see bird eye view of the data. We can combine our scatter plot, density plot and correlation coefficient using ggpairs() function.
# Visualizing thr entire data set distribution and correlation coefficient in single view
ggpairs(df)
The variable names are displayed on the outer edges of the matrix. The boxes along the diagonals display the density plot for each variable. The boxes in the lower left corner display the scatter plot between each variable. The boxes in the upper right corner display the pearson correlation coefficient between each variable (Wolfgang Karl Hrdle, 2020). For example, the correlation between time and x1 is -0.017. The star shows how correlated they are.
Overall, the most correlated input parameters with output are x1 with coefficient 0.857, x4 with 0.775 and low positive correlation is of x3 with 0.526, and x2 with 0.194 being least correlated.
Here, we are going to explain the relationship between the input EEG signals and output EEG signals based on the assumption that the relationship can be expressed as polynomial regression model. We are given with 5 different nonlinear polynomial regression models and out of which we need to find the one that is most suitable.
To get such most suitable one, we will be use Akaike information criterion (AIC) and Bayesian information criterion (BIC).
Model 1: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥12 + 𝜃3 𝑥13 + 𝜃4 𝑥24 + 𝜃5 𝑥14 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Model 2: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥13 + 𝜃3 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Model 3: 𝑦 = 𝜃1 𝑥33 + 𝜃2 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Model 4: 𝑦 = 𝜃1 𝑥2 + 𝜃2 𝑥13 + 𝜃4 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Model 5: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥12 + 𝜃3 𝑥13 + 𝜃4 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Here, from above preliminary analysis, we are not sure which is true value(most suitable) of the model. When we have no idea(unknown) about the true value of that distribution, we use concept of estimator (random variable) (Peterka, 1981). In other word, we are using estimator variable to estimate the true value of the EEG data distribution that relates the input and output variables.
Here, the estimator variable is represented by “θ” and can take on multiple values such as θ1, θ2, …, θbias. Now, the least squares method (LSM) is used to calculate the estimator model parameters for different candidate models of EEG data and the LSM (\(\hat{\theta}\)) is used to estimate the true value of the distribution by minimizing the sum of the squared residuals between the predicted and actual values of the output variable (Björck, 1990) which is expressed by the following formula:
\(\hat{\theta}\) = ( XTX)-1XTy
where:
Now, to calculate the least squares, we first need to format the input data by binding the appropriate columns or values from the EEG data set. With the function cbind(), we bind the input data using following code: x_model1 <- cbind(ones, x[,4], x[,1]^2, x[,1]^3, x[,3]^4)
where:
Once the input data is correctly formatted, we can then use the least squares formula as mentioned above and using the built-in solve linear equations function called solve(), we find the \(\hat{\theta}\). We use solve() because it’s more efficient and less error-prone (Schork, n.d.).
Model 1: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥12 + 𝜃3 𝑥13 + 𝜃4 𝑥24 + 𝜃5 𝑥14 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Calculating ones for binding the data
# Here a matrix is created where all the elements is set to 1 (first argument), with the same number of rows as the input variable "x" and only one column added which is set by third argument and the resulting object is assigned to the variable "ones".
ones = matrix(1 , length(x)/4,1)
# cbind() is used to bind the matrix "ones" and other variables that are present in that model 1, together to create a new data frame, "x_model_1".
#The resulting data frame "x_model_1" will have the same number of rows as the input variable "x" but new colum will be added,
x_model_1<-cbind(ones,(x[,"X4"]),(x[,"X1"])^2,(x[,"X1"])^3,(x[,"X2"])^4,(x[,"X1"])^4)
# Printing thus binded new dataframe
head(x_model_1)
## ones X4 X1 X1 X2 X1
## 1 1 -0.288894 9.4483694 -29.04258681 11.860069607 89.27168418
## 2 1 0.577845 0.1218820 0.04255095 13.562951639 0.01485522
## 3 1 -2.697380 1.0194941 -1.02938318 0.005510835 1.03936820
## 4 1 -1.622270 0.6522307 -0.52674672 6.255396116 0.42540486
## 5 1 1.558220 4.3232806 8.98918111 0.304449766 18.69075482
## 6 1 2.381060 10.5573006 34.30278124 14.299803148 111.45659680
Now after that we will calculate estimated values of the model parameter \(\hat{\theta}\) using the above mentioned formula.
# We are converting thus obtained dataframe x_model_1 into a matrix using the function as.matrix(). The transpose of the matrix is taken using the function t(). The operator %*% is used for matrix multiplication under solve() function.
theta_hat_1 <- solve(t(as.matrix(x_model_1)) %*% as.matrix(x_model_1)) %*% t(as.matrix(x_model_1)) %*% as.matrix(y)
# Viewing the value of theta_hat_1
print(theta_hat_1)
## Y
## ones 0.4015807794
## X4 0.1277101097
## X1 -0.0002902170
## X1 0.0096688129
## X2 -0.0004098917
## X1 -0.0001543367
Similarly, as we have 5 models so for each models, we will follow the same process - creating cbind, calculating estimated values of the model parameter \(\hat{\theta}\) and printing the value of theta_hat.
Model 2: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥13 + 𝜃3 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Binding for model 2
x_model_2<-cbind(ones,(x[,"X4"]),(x[,"X1"])^3,(x[,"X3"])^4)
# Calculating theta_hat_2
theta_hat_2 <- solve(t(as.matrix(x_model_2)) %*% as.matrix(x_model_2)) %*% t(as.matrix(x_model_2)) %*% as.matrix(y)
# Viewing the value of theta_hat_2
print(theta_hat_2)
## Y
## ones 0.483065688
## X4 0.143578928
## X1 0.010038614
## X3 -0.001912836
Model 3: 𝑦 = 𝜃1 𝑥33 + 𝜃2 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Binding for model 3
x_model_3<-cbind(ones,(x[,"X3"])^3,(x[,"X3"])^4)
# Calculating theta_hat
theta_hat_3 <- solve(t(as.matrix(x_model_3)) %*% as.matrix(x_model_3)) %*% t(as.matrix(x_model_3)) %*% as.matrix(y)
# Viewing the value of theta_hat_3
print(theta_hat_3)
## Y
## ones 0.340561975
## X3 0.021330543
## X3 -0.002857744
Model 4: 𝑦 = 𝜃1 𝑥2 + 𝜃2 𝑥13 + 𝜃4 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Binding for model 4
x_model_4<-cbind(ones,(x[,"X2"]),(x[,"X1"])^3,(x[,"X3"])^4)
# Calculating theta_hat
theta_hat_4 <- solve(t(as.matrix(x_model_4)) %*% as.matrix(x_model_4)) %*% t(as.matrix(x_model_4)) %*% as.matrix(y)
# Viewing the value of theta_hat_4
print(theta_hat_4)
## Y
## ones 0.509013488
## X2 0.053322048
## X1 0.012067145
## X3 -0.001855997
Model 5: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥12 + 𝜃3 𝑥13 + 𝜃4 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Binding for model 5
x_model_5 <- cbind(ones,(x[,"X4"]),(x[,"X1"])^2,(x[,"X1"])^3,(x[,"X3"])^4)
# Calculating theta_hat
theta_hat_5 <- solve(t(as.matrix(x_model_5)) %*% as.matrix(x_model_5)) %*% t(as.matrix(x_model_5)) %*% as.matrix(y)
# Viewing the value of theta_hat_5
print(theta_hat_5)
## Y
## ones 0.4798284629
## X4 0.1434460704
## X1 0.0003254641
## X1 0.0100562759
## X3 -0.0019198749
With these value, we completed estimating model parameter (\(\hat{\theta}\)), Now, we will compute Sum Squared Error (SSR) of an estimator which is also called model residual errors (RSS).
The residual sum of squares (RSS) also termed as sum of squared errors of prediction (SSE) or sum of squared residuals (SSR) is a measure of discrepancy between the data and an estimation model. It is calculated by subtracting average square of the actual values and the estimated values of the dependent variable, based on the model parameters (Allen, 1971).
We usually want to minimize the error thus the smaller the error, the better the estimation power of the regression and better fit of the model. On the other hand, a larger RSS indicates a worse fit of the model to the data (Barone, 2022). It is worth to mention that the RSS is never negative, because the squares of the residuals are always non-negative (Valchanov, 2018).
To calculate RSS, we first have to calculate error of every models 1-5 with the help of (\(\hat{\theta}\)) that we calculated in Task 2.1 and RSS is mathematically presented as:
RSS = \[\sum_{i=1}^{n}(y_i - x_i\hat{\theta})^2\]
where:
# Calculating Y-hat and RSS model 1
# Before that we convert our model and theta hat into matrix
x_model_1 <- as.matrix(x_model_1)
theta_hat_1 <- as.matrix(theta_hat_1)
Y_hat_model_1 <- x_model_1 %*% theta_hat_1
# Calculating RSS using above mentioned formula
RSS_model_1 <- sum((y - Y_hat_model_1)^2)
# printing RSS value for model 1
print(sprintf("RSS value of the model 1 is %0.4f", RSS_model_1))
## [1] "RSS value of the model 1 is 35.3966"
Similarly, we will calculate RSS value for each models
# Calculating Y-hat and RSS model 2
# Before that we convert our model and theta hat into matrix
x_model_2 <- as.matrix(x_model_2)
theta_hat_2 <- as.matrix(theta_hat_2)
Y_hat_model_2 <- x_model_2 %*% theta_hat_2
# Calculating RSS using above mentioned formula
RSS_model_2 <- sum((y - Y_hat_model_2)^2)
# printing RSS value for model 2
#print(paste("RSS value of the Model 2 is", RSS_model_2))
print(sprintf("RSS value of the model 2 is %0.4f", RSS_model_2))
## [1] "RSS value of the model 2 is 2.1398"
# Calculating Y-hat and RSS model 3
# Before that we convert our model and theta hat into matrix
x_model_3 <- as.matrix(x_model_3)
theta_hat_3 <- as.matrix(theta_hat_3)
Y_hat_model_3 <- x_model_3 %*% theta_hat_3
# Calculating RSS using above mentioned formula
RSS_model_3 <- sum((y - Y_hat_model_3)^2)
# printing RSS value for model 3
print(sprintf("RSS value of the model 3 is %0.4f", RSS_model_3))
## [1] "RSS value of the model 3 is 463.3124"
# Calculating Y-hat and RSS model 4
# Before that we convert our model and theta hat into matrix
x_model_4 <- as.matrix(x_model_4)
theta_hat_4 <- as.matrix(theta_hat_4)
Y_hat_model_4 <- x_model_4 %*% theta_hat_4
# Calculating RSS using above mentioned formula
RSS_model_4 <- sum((y - Y_hat_model_4)^2)
# printing RSS value for model 4
print(sprintf("RSS value of the model 4 is %0.4f", RSS_model_4))
## [1] "RSS value of the model 4 is 20.2590"
# Calculating Y-hat and RSS model 5
# Before that we convert our model and theta hat into matrix
x_model_5 <- as.matrix(x_model_5)
theta_hat_5 <- as.matrix(theta_hat_5)
Y_hat_model_5 <- x_model_5 %*% theta_hat_5
# Calculating RSS using above mentioned formula
RSS_model_5 <- sum((y - Y_hat_model_5)^2)
# printing RSS value for model 5
print(sprintf("RSS value of the model 5 is %0.4f", RSS_model_5))
## [1] "RSS value of the model 5 is 2.1355"
So, we have
| Models | RSS value |
|---|---|
| Model 1 | 35.3966 |
| Model 2 | 2.1398 |
| Model 3 | 463.3124 |
| Model 4 | 20.2590 |
| Model 5 | 2.1355 |
Here, the lowest RSS value in the table is 2.1355, which is associated with model 5, followed by second lowest RSS value is 2.1398, which is associated with model 2.
Now, our objective is to identify how well the measured value fits the sample data of a provided model when parameters are unknown. To meet our objective we are going to calculate log-likelihood functions for a linear regression model using RSS that we obtained from Task 2.2.
Log-likelihood is a way to measure the goodness of fit for a model (Zach, 2021) and used to simplify the optimization problem and avoid numerical underflow or overflow. Mathematically presented as:
where:
n is the sample size (here, Y output signal)
(\(\hat{\theta}\))2 is the variance of the errors (also known as the residual variance) (Task 2.2)
ln is the natural logarithm
ln p(D|\(\hat{\theta}\))” is log-likelihood
\(\pi\) value is 2.27
RSS is value from Task 2.2
In this task, our goal is to find the set of parameters that maximizes the probability of the observations. As the nature of the log-likelihood function is that it increases monotonically (non-decreasing) and has no local maxima, it is suitable for identifying how well the measured value fits (Stephanie, 2021). In layman’s terms, monotonically increasing means that as the value of the independent variable (say x) increases, so does the value of the function (say y), i.e., as x increases, y can only increase and cannot ever decrease.
So, here as the value of the log-likelihood increases, the likelihood of the data given the model parameters also increases. Therefore, finding the maximum of the log-likelihood function is the same as finding the maximum of the likelihood function but if we go with just likelihood then the concave nature of log is missing and we cannot get the one global maxima (Music, 2020).
Therefore using the above formula, we will first calculate variance of model using RSS along with length of the Y signal and then calculate log-likelihood function.
# Calculating length of the output signal y with nrow() as our data are in matrix format and storing it in N
N <- nrow(y)
# Calculating the variance of model 1
Variance_model_1 = RSS_model_1/(N-1)
# Printing variance of model 1
print(sprintf("Variance of model 1 is %0.4f", Variance_model_1))
## [1] "Variance of model 1 is 0.1770"
# Calculating the log-likelihood of model 1 using model residual error (RSS)
likehood_model_1 <- -(N/2)*(log(2*pi))-(N/2)*(log(Variance_model_1))-(1/(2*Variance_model_1))*RSS_model_1
# Printing log likelihood function of model 1
print(sprintf("Log-likelihood of model 1 is %0.4f", likehood_model_1))
## [1] "Log-likelihood of model 1 is -110.6707"
Here, absolute value of log-likelihood of model 1 is 110.6707, which is large and indicates that the model 1 is a poor fit to the data.
Now similar action for all model are performed except counting N.
# Calculating the variance of model 2
Variance_model_2 = RSS_model_2/(N-1)
# Printing variance of model 2
print(sprintf("Variance of model 2 is %0.4f", Variance_model_2))
## [1] "Variance of model 2 is 0.0107"
# Calculating the log-likelihood of model 2 using model residual error (RSS)
likehood_model_2 <- -(N/2)*(log(2*pi))-(N/2)*(log(Variance_model_2))-(1/(2*Variance_model_2))*RSS_model_2
# Printing log likelihood function of model 2
print(sprintf("Log-likelihood of model 2 is %0.4f", likehood_model_2))
## [1] "Log-likelihood of model 2 is 171.3245"
# Calculating the variance of model 3
Variance_model_3 = RSS_model_3/(N-1)
# Printing variance of model 3
print(sprintf("Variance of model 3 is %0.4f", Variance_model_3))
## [1] "Variance of model 3 is 2.3166"
# Calculating the log-likelihood of model 3 using model residual error (RSS)
likehood_model_3 <- -(N/2)*(log(2*pi))-(N/2)*(log(Variance_model_3))-(1/(2*Variance_model_3))*RSS_model_3
# Printing log likelihood function of model 3
print(sprintf("Log-likelihood of model 3 is %0.4f", likehood_model_3))
## [1] "Log-likelihood of model 3 is -369.1351"
# Calculating the variance of model 4
Variance_model_4 = RSS_model_4/(N-1)
# Printing variance of model 4
print(sprintf("Variance of model 4 is %0.4f", Variance_model_4))
## [1] "Variance of model 4 is 0.1013"
# Calculating the log-likelihood of model 4 using model residual error (RSS)
likehood_model_4 <- -(N/2)*(log(2*pi))-(N/2)*(log(Variance_model_4))-(1/(2*Variance_model_4))*RSS_model_4
# Printing log likelihood function of model 4
print(sprintf("Log-likelihood of model 4 is %0.4f", likehood_model_4))
## [1] "Log-likelihood of model 4 is -54.5899"
# Calculating the variance of model 5
Variance_model_5 = RSS_model_5/(N-1)
# Printing variance of model 5
print(sprintf("Variance of model 5 is %0.4f", Variance_model_5))
## [1] "Variance of model 5 is 0.0107"
# Calculating the log-likelihood of model 5 using model residual error (RSS)
likehood_model_5 <- -(N/2)*(log(2*pi))-(N/2)*(log(Variance_model_5))-(1/(2*Variance_model_5))*RSS_model_5
# Printing log likelihood function of model 5
print(sprintf("Log-likelihood of model 5 is %0.4f", likehood_model_5))
## [1] "Log-likelihood of model 5 is 171.5247"
Based on the calculation of log-likelihood function of each model:
| Models | Variance | log-likelihood function |
|---|---|---|
| Model 1 | 0.1770 | -110.6707 |
| Model 2 | 0.0107 | 171.3245 |
| Model 3 | 2.3166 | -369.1351 |
| Model 4 | 0.1013 | -54.5899 |
| Model 5 | 0.0107 | 171.5247 |
We have found that except for model 2 and 5, all others value are negative and in these cases, a higher log-likelihood indicates a better fit, so model 2 and model 5 have the largest log-likelihoods among the models, which indicates that they fit the data best.
Now as we have RSS and log-likelihood value, we need model selection criteria method for which we will work with Akaike information criterion (AIC) and Bayesian Information Criteria (BIC). According t o(Trevor Hastie, Tibshirani and Friedman, 2009), model selection is estimating the performance of various candidate models with the sole objective of choosing the best one.
Both of them can be used to compare different models and choose the best one. They are both based on the likelihood of the model given the data and the number of parameters in the model. However, the main difference between these two model selection method is that AIC gives less penalty to models with more parameters compared to BIC (Brownlee, 2019).
First, we will start computing AIC for each models.
Akaike information criterion (AIC is a statistical method which aims is to determine the model that truly explains the variance in the dependent variable with the fewest number of independent variables (parameters) (Manikantan, 2021). With this, it helps to select a simpler model containing fewer parameters over a complex model - more parameters.
Using the maximum likelihood estimate (we found the value in Task 2.3), the relative information value of the model and the number of parameters in the model are determined. The formula for AIC is expressed as:
AIC = 2k - 2ln \(\hat{L}\)
Where:
The major goal of applying AIC in this situation is to eliminate the problem of overfitting because AIC penalizes models with more parameters (independent variables) and so balances the trade-off between a model’s goodness of fit and complexity (Bevans, 2020).
According to (Zach, 2021b), one thing worth mentioning is that the model fits the data better when the AIC value is lower. The AIC value’s absolute value is not so important; it could be favorable or unfavorable.
Using the above formula of AIC, we will be calculating the value for each model.
# Calculating AIC for model 1.
# Here we are finding value of K with length() function.
AIC_1 <- 2* length(x_model_1[1,]) - 2 * likehood_model_1
print(sprintf("Length of parameter to be estimated in model 1 is %d", length(x_model_1[1,])))
## [1] "Length of parameter to be estimated in model 1 is 6"
# Printing AIC of model 1
print(sprintf("AIC of model 1 is %0.4f", AIC_1))
## [1] "AIC of model 1 is 233.3414"
# Calculating AIC for model 2.
AIC_2 <- 2* length(x_model_2[1,]) - 2 * likehood_model_2
print(sprintf("Length of parameter to be estimated in model 2 is %d", length(x_model_2[1,])))
## [1] "Length of parameter to be estimated in model 2 is 4"
# Printing AIC of model 2
print(sprintf("AIC of model 2 is %0.4f", AIC_2))
## [1] "AIC of model 2 is -334.6489"
# Calculating AIC for model 3.
AIC_3 <- 2* length(x_model_3[1,]) - 2 * likehood_model_3
print(sprintf("Length of parameter to be estimated in model 3 is %d", length(x_model_3[1,])))
## [1] "Length of parameter to be estimated in model 3 is 3"
# Printing AIC of model 3
print(sprintf("AIC of model 3 is %0.4f", AIC_3))
## [1] "AIC of model 3 is 744.2702"
# Calculating AIC for model 4.
AIC_4 <- 2* length(x_model_4[1,]) - 2 * likehood_model_4
print(sprintf("Length of parameter to be estimated in model 4 is %d", length(x_model_4[1,])))
## [1] "Length of parameter to be estimated in model 4 is 4"
# Printing AIC of model 4
print(sprintf("AIC of model 4 is %0.4f", AIC_4))
## [1] "AIC of model 4 is 117.1799"
# Calculating AIC for model 5.
AIC_5 <- 2* length(x_model_5[1,]) - 2 * likehood_model_5
print(sprintf("Length of parameter to be estimated in model 5 is %d", length(x_model_5[1,])))
## [1] "Length of parameter to be estimated in model 5 is 5"
# Printing AIC of model 5
print(sprintf("AIC of model 5 is %0.4f", AIC_5))
## [1] "AIC of model 5 is -333.0493"
Here we have,
| Models | Length of parameter | AIC |
|---|---|---|
| Model 1 | 6 | 233.3414 |
| Model 2 | 4 | -334.6489 |
| Model 3 | 3 | 744.2702 |
| Model 4 | 4 | 117.1799 |
| Model 5 | 5 | -333.0493 |
As we recall, AIC values less than zero indicate that the model is over fitting the data, and AIC values greater than zero indicate that the model is under fitting the data. AIC values closer to zero indicate that the model is a better fit to the data and here model 2 with AIC of -334.6489 is the best fit among the models listed.
However, we will look in next criteria method called Bayesian Information Criteria.
Now, we will calculate Bayesian Information Criteria (BIC). As mentioned earlier, BIC is similar to AIC but in BIC will give a greater penalty on models with more parameters (Datacadamia, 2014). Similar to AIC, lower BIC values indicate better model fit. The formula for BIC is expresses as:
BIC = kln(n) - 2ln \(\hat{L}\)
Where:
Now, we will calculate BIC for each models.
# Calculating BIC for model 1
BIC_1 <- length(x_model_1[1,]) * log(N) - 2 * likehood_model_1
# Printing BIC of model 1
print(sprintf("BIC of model 1 is %0.4f", BIC_1))
## [1] "BIC of model 1 is 253.1613"
# Calculating BIC for model 2
BIC_2 <- length(x_model_2[1,]) * log(N) - 2 * likehood_model_2
# Printing BIC of model 2
print(sprintf("BIC of model 2 is %0.4f", BIC_2))
## [1] "BIC of model 2 is -321.4357"
# Calculating BIC for model 3
BIC_3 <- length(x_model_3[1,]) * log(N) - 2 * likehood_model_3
# Printing BIC of model 3
print(sprintf("BIC of model 3 is %0.4f", BIC_3))
## [1] "BIC of model 3 is 754.1801"
# Calculating BIC for model 4
BIC_4 <- length(x_model_4[1,]) * log(N) - 2 * likehood_model_4
# Printing BIC of model 4
print(sprintf("BIC of model 4 is %0.4f", BIC_4))
## [1] "BIC of model 4 is 130.3931"
# Calculating BIC for model 5
BIC_5 <- length(x_model_5[1,]) * log(N) - 2 * likehood_model_5
# Printing BIC of model 5
print(sprintf("BIC of model 5 is %0.4f", BIC_5))
## [1] "BIC of model 5 is -316.5328"
Here, BIC of each models is:
| Models | BIC |
|---|---|
| Model 1 | 253.1613 |
| Model 2 | -321.4357 |
| Model 3 | 754.1801 |
| Model 4 | 130.3931 |
| Model 5 | -316.5328 |
Here, similar to AIC, BIC with lower value is best fit so model 2 with BIC of -321.4357 is the best fit among the models listed.
Now as we have obtained AIC and BIC value, we are interested in viewing error distribution. After all our goal is to pick one that is showing less error. Before going with the decision of graphic the distribution, we need to calculate error of each model. Then, we will. use Q-Q plot (quantile-quantile plot) to visualize and compare two probability distributions.
## Error of models 1-5 based on calculation form Task 2.2
model_1_error <- y - Y_hat_model_1
model_2_error <- y - Y_hat_model_2
model_3_error <- y - Y_hat_model_3
model_4_error <- y - Y_hat_model_4
model_5_error <- y - Y_hat_model_5
Now, as we have the model error for each models, we are going to plot Q-Q plot. According to (Clay Ford, 2015), Q-Q plot is formed by plotting two quantile sets against each other. In the case of the same distribution, both sets of quantiles would form a relatively straight line, however in practise this doesnot mean hard and fast rule. In the same plot we are going to add a reference line to called Q-Q line, which is the line of perfect fit for a normal distribution.
The qqline() function takes two arguments, the first is the data- model’s prediction errors and the second is the color (col), the width of the line (lw) along with dashed line(lty).
As from the beginning we are assuming the data are independent and identically distributed i.e. Gaussian with zero-mean, so we are using qqnorm() function to plot the Q-Q plot.
# Q-Q plots of prediction error for model 1
qqnorm(t(model_1_error),col = "grey", main = "Q-Q Plot for Model 1" )
qqline(model_1_error, col = "red", lwd = 1,lty = 2)
# Q-Q plots of prediction error for model 2
qqnorm(t(model_2_error), col= "grey", main = "Q-Q Plot for Model 2" )
qqline(model_2_error, col = "red", lwd = 1,lty = 2)
Q-Q plot for model 1 shows that most of the data follows Q-Q line (red color) thus we can say data follows a normal distribution.
# Q-Q plots of prediction error for model 2
qqnorm(t(model_2_error),col= "grey", main = "Q-Q Plot for Model 2" )
qqline(model_2_error, col = "red", lwd = 1,lty = 2)
Q-Q plot for model 2 shows that most of the data follows Q-Q line (red color) thus we can say data follows a normal distribution.
# Q-Q plots of prediction error for model 3
qqnorm(t(model_3_error), col= "grey", main = "Q-Q Plot for Model 3" )
qqline(model_3_error, col = "red", lwd = 1,lty = 2)
Q-Q plot for model 3 shows that most of the data doesn’t follows Q-Q line (red color) thus we can say data doesn’t follows a normal distribution and it is skewed.
# Q-Q plots of prediction error for model 4
qqnorm(t(model_4_error),col= "grey", main = "Q-Q Plot for Model 4" )
qqline(model_4_error, col = "red", lwd = 1, lty = 2)
Q-Q plot for model 4 shows that most of the data follows Q-Q line (red color) thus we can say data follows a normal distribution.
# Q-Q plots of prediction error for model 5
qqnorm(t(model_5_error), col= "grey",main = "Q-Q Plot for Model 5" )
qqline(model_5_error, col = "red", lwd = 1,lty = 2)
Q-Q plot for model 5 shows that most of the data follows Q-Q line (red color) thus we can say data follows a normal distribution.
Here, with the obtained Q-Q plot we simply visually check if a data set follows a theoretical distribution or not. To formally test whether or not a data set follows a particular distribution, we need to go one step further.
By completing tasks 2.1 to 2.5, we have gathered all the necessary information to select the best candidate model. By calculating the RSS, log-likelihood function, plotting normal distribution graphs, and comparing the AIC and BIC values, we have all the information to been identify the best model for our data.
As previously mentioned, AIC and BIC are widely used for model selection because they help to minimize the score of error while selecting a model. In both AIC and BIC, the lowest value is most likely to be best fit, so looking at the data,
For Model 1:
AIC = 233.3414
BIC = 253.1612
Q-Q Plot View = follows a normal distribution
For Model 2:
AIC = -334.6489
BIC = -321.4356
Q-Q Plot View = follows a normal distribution
For Model 3:
AIC = 744.2701
BIC = 754.1801
Q-Q Plot View = doesn't follows a normal distribution and it is skewed
For Model 4:
AIC = 117.1798
BIC = 130.3931
Q-Q Plot View = follows a normal distribution
For Model 5:
AIC = -333.0493
BIC = -316.5327
Q-Q Plot View = follows a normal distribution
Here, the best model fit based on AIC and BIC would be model 2 as it has lowest value. To verify the selected model 2 is good candidate, we will look at the Q-Q plot.
Looking at the Q-Q plot, except model 3, all models seems to have same nature, however, looking at the position of Q-Q line, the model 2 seems to be suitable one.
To get more incline to decision of picking up model 2, we would like to plot histogram to show distribution of residuals. For easy viewing, we will plot all the histogram into 3 rows and 2 column using par(mfrow = c(3,2))
# Distribution of prediction error using histogram
par(mfrow = c(3,2))
hist (model_1_error[,1], freq = FALSE, col="blue", las =1)
abline(v = median(model_1_error[,1]), col = "grey", lwd = 5)
abline(v = mean(model_1_error[,1]), col = "purple", lwd = 5)
hist (model_2_error[,1], freq = FALSE, col="green", las =1)
abline(v = median(model_2_error[,1]), col = "grey", lwd = 5)
abline(v = mean(model_2_error[,1]), col = "purple", lwd = 5)
# abline(v = getmode(model_2_error[,1]), col = "red", lwd = 5)
hist (model_3_error[,1], freq = FALSE, col="orange", las =1)
abline(v = median(model_3_error[,1]), col = "grey", lwd = 5)
abline(v = mean(model_3_error[,1]), col = "purple", lwd = 5)
hist (model_4_error[,1], freq = FALSE, col="yellow", las =1)
abline(v = median(model_4_error[,1]), col = "grey", lwd = 5)
abline(v = mean(model_4_error[,1]), col = "purple", lwd = 5)
hist (model_5_error[,1], freq = FALSE, col="pink", las =1)
abline(v = median(model_5_error[,1]), col = "grey", lwd = 5)
abline(v = mean(model_5_error[,1]), col = "purple", lwd = 5)
hist(0, main = "color code")
legend("center", legend = c("median","mean"),
lwd = 1, col = c("grey", "purple"))
Looking at the distribution of each mode, model 2, 5 both seems to have normal distribution.
Further more, we will go one step extra and see if model 2 is suitable than other. In this case, we will consider additional factors to determine the best model. Here, based on interpretability of the model, i.e., a simpler model with fewer parameters is way easier to interpret and understand (De’ath and Fabricius, 2000), so we will see the number of parameter in each models.
Looking at the each length of parameter, the lowest is model 3 with 3 numbers of parameter, but its doesn’t follows a normal distribution and it is skewed and the next is model 4 with number of 4 parameters however its AIC and BIC is greater than the model 2.
So as a conclusion of AIC, BIC, Q-Q plot and extra interpretability, we have picked up model 2 as the best fit model which is expressed as:
Model 2: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥13 + 𝜃3 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
Now, as we picked up best suitable model 2, now we are interested on training and testing of the data. For this, we will divide data set into 70% training and 30% testing set. We are using 70/30 ratio because it provides a good balance between having enough data (70%) for the model to learn from during training and having quite reasonable (30%) data to test (Breiman and Spector, 1992) (Nguyen et al., 2021).
So, using initial_split () function, we will split data in 70/30 by providing the numbers into prop argument of the function for both y and x signals.
# Splitting input signals x
split_x <- initial_split(data = as.data.frame(x),prop=.7)
# Training data for input signals x are split and thus splitted training set is converted to matrix form ans assigned to the variable x_training_data
x_training_set <- training(split_x)
x_training_data <- as.matrix(x_training_set)
# Testing data are split and thus splitted testing set is converted to matrix form ans assigned to the variable x_testing_data
x_testing_set <- testing(split_x)
x_testing_data <- as.matrix(x_testing_set)
# Splitting the data of output signals y into train and test
split_y <- initial_split(data = as.data.frame(y),prop=.7)
# Training data for output signals y are split and thus splitted training set is converted to matrix form ans assigned to the variable y_training_data
y_training_set <- training(split_y)
y_training_data <- as.matrix(y_training_set)
# Testing data for output signals y are splitted and thus splitted testing set is converted to matrix form ans assigned to the variable y_testing_data
y_testing_set <- testing(split_y)
y_testing_data <- as.matrix(y_testing_set)
Following the same approach as per Task 2.1, we will estimate model parameter for our selected model 2, which equation is as per:
Model 2: 𝑦 = 𝜃1 𝑥4 + 𝜃2 𝑥13 + 𝜃3 𝑥34 + 𝜃𝑏𝑖𝑎𝑠 + 𝜀
# Estimating model parameters using training set
training_ones <- matrix(1, length(x_testing_set$X1),1)
# cbind() is used to bind the matrix "ones" and other variables that are present in that model 1, together to create a new data frame, "x_model_1".
#The resulting data frame "x_model_1" will have the same number of rows as the input variable "x" but new colum will be added,
training_model_x <- cbind(training_ones,(x_testing_set[,"X4"]),(x_testing_set[,"X1"])^3,(x_testing_set[,"X3"])^4)
training_thetahat <- solve(t(training_model_x) %*% training_model_x) %*% t(training_model_x) %*% y_testing_data
Following the same approach as per Task 2.2, we will calculate the model’s output/prediction on the testing data and present RSS value.
# y_testing_hat is being calculated as the product of the x_testing_data matrix and the estimated coefficients (training_thetahat) of the linear regression model.
# Here we get predicted values of y for the testing data.
y_testing_hat <- x_testing_data %*% training_thetahat
# Calculating residual sum of squares (RSS) for the testing data
RSS_testing <- sum((y_testing_set-y_testing_hat)^2)
# Printing RSS value of testing
print(sprintf("RSS value is testing data %0.4f", RSS_testing))
## [1] "RSS value is testing data 316.1386"
Following the same approach as per Task 2.2, we will calculate the model’s output/prediction on the testing data and present RSS value. To do so, we will perform two-tailed t-test on the training data (y_training_data) with the null hypothesis being that the mean is equal to 700 and the alternative hypothesis being that the mean is not equal to 700 The mean of 700 is used here as a hypothetical or reference value for comparison.
And we have picked “two.sided” t.test instead of “greater” or “less” because the we are interested on knowing whether the true mean is different from 700, in either direction. The test will tell us whether the mean is different, but it will not specify in which direction (Frost, 2018).
The confidence level is set to 0.95 (95%) (Sedgwick, 2014). This t-test will give the p-value, t-value, and degrees of freedom for the test, which will be helpful to determine whether or not to reject the null hypothesis.
# Performing t-test on the training data. Here, mean is taken as a null hypothesis which value is equal to 700,
# "two.sided" is used which says that the true mean is not equal to the hypothesized mean and setting confidence interval 95%
t.test(y_testing_data, mu=700, alternative="two.sided", conf.level=0.95)
##
## One Sample t-test
##
## data: y_testing_data
## t = -2456.8, df = 60, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 700
## 95 percent confidence interval:
## -0.5785278 0.5613553
## sample estimates:
## mean of x
## -0.00858624
Worth to note that each time we run the t-test, the value of CI get changes.
Here, t-value is -3301.6 suggest that means that the sample mean of y_training_data is very far from the hypothesized mean of 700 following a negative direction (minus sign). This provides strong evidence that the true mean of y_training_data is not equal to 700.
Similar, with the p-value being less than 2.2e-16 (is much less than 0.05), which indicates that there is strong evidence against the null hypothesis (Di Leo and Sardanelli, 2020) and in favor of the alternative hypothesis (that the true mean is not equal to 700).
This suggests that the mean of y_training_data is significantly different from 700 at a 95% confidence level, so it suggests that the null hypothesis (that the mean of y_training_data is equal to 700) is not likely to be true. In other word, that the mean of y_training_data is equal to 700 (null hypothesis) is not likely to be true.
The 95% confidence intervals - lowest and upper limit value are -0.1625241, 0.6853514, which says that there is a 95% chance that the true mean of y_training_data falls within this interval.
The sample estimate for the mean of y_training_data is 0.1731682.
# Setting CI value based on t-test at the time of run
C_I1 <- -0.1625241
C_I2 <- 0.6853514
ggplot(data = data.frame(y_training_data), aes(x = y_training_data)) +
geom_density(col = "black", fill = "black" , lw=1) +
geom_vline(xintercept = C_I1, col = "cyan", linetype = "dashed") +
geom_vline(xintercept = C_I2, col = "cyan", linetype = "dashed") +
geom_rug()+
ggtitle("Distribution of training Y data with 95% confidence intervals")+
xlab("Y Training Data") +
ylab("Density")
## Warning in geom_density(col = "black", fill = "black", lw = 1): Ignoring unknown
## parameters: `lw`
theme(legend.title = element_blank())
## List of 1
## $ legend.title: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
thetaHat_training <- solve(t(x_training_data) %*% x_training_data) %*% t(x_training_data) %*% y_training_data
thetaHat_training
## Y
## X1 -0.013274721
## X2 -0.099169428
## X3 -0.009535589
## X4 0.058854194
length(thetaHat_training)
## [1] 4
# Plotting distribution of testing data
ggplot(data = data.frame(y_testing_data), aes(x = y_testing_data)) +
geom_density(col = "black", fill = "black" , lw=1) +
geom_vline(xintercept = C_I1, col = "cyan", linetype = "dashed") +
geom_vline(xintercept = C_I2, col = "cyan", linetype = "dashed") +
geom_rug()+
ggtitle("Distribution of testing Y data with 95% confidence intervals")+
xlab("Y testing data") +
ylab("Density")
## Warning in geom_density(col = "black", fill = "black", lw = 1): Ignoring unknown
## parameters: `lw`
theme(legend.title = element_blank())
## List of 1
## $ legend.title: list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Now, we will calculate error - which is calculated as the difference between the actual testing data and the model’s predicted output on that data (Ogunbiyi, 2022).
# Error between actual testing data and the model's predicted output on that data
error <- ((y_testing_set - y_testing_hat))
# Printing error value
print(sprintf("Error between actual testing data and the model's predicted output on that data is %0.4f", RSS_testing))
## [1] "Error between actual testing data and the model's predicted output on that data is 316.1386"
# # Storing length of the predicted output data
#
# n_len = length(y_testing_hat)
#
# #Calculating upper and lower confidence interval as CI_1 and CI_2
#
# C_I_1 <- z * sqrt((error * (1-error)) / n_len)
#
#
#
# C_I_2 <- z * sqrt((error*(1+error) / n_len))
As we have selected model 2, now, we are going to perform rejection ABC (Approximate Bayesian Computation) for estimating the posterior distribution of two parameters - thetebias(𝜃𝑏𝑖𝑎s) and thetafour(𝜃4).
Calculating the likelihood and residual sum of squares (RSS) is a standard approach for regression analysis and is used to evaluate the goodness of fit of a model. However, now we will try to little deviate from traditional regression analysis - depending on likelihood function. Now, we will generate synthetic datasets based on the prior distribution of the parameters (Drechsler and Reiter, 2011), which is called Approximate Bayesian Computation (ABC).
In ABC, thus obtained synthetic datasets are compared to the observed data using a summary statistic, such as the residual sum of squares. After that we will see if the summary statistic for the synthetic data is close enough to that of the observed data, the corresponding parameter values are considered as part of the posterior distribution (Csilléry et al., 2010).
Before that we need know know about Bayesian statistic. According to (Evans, Hastings and Peacock, 2000), the use of Bayesian statistics allows for the systematic updating of beliefs based on new information citation. The fundamental theorem upon which these methods are based is known as Bayes’ theorem (Eells, 2004), and it states that given two events, A and B, the conditional probability of A given that B is true, Mathematically it is expressed as:
Where: - P(A) is known as the prior probability - P(B|A) is known as the likelihood function - P(A|B) is known as the posterior probability
Here, we will compute 2 parameter posterior distributions, which will be the 2 parameters with largest absolute value in our least squares estimation (Task 2.1) of the selected model 2. Then we will keep all the other parameters of our model as constant.
# Creator vector of theta_hat of selected model 2 and sorting them to find out two largest absolute value and printing it
numbers <- c(theta_hat_2)
sorted_numbers <- sort(abs(numbers), decreasing=TRUE)
largest_two_values <- sorted_numbers[1:2]
print(largest_two_values)
## [1] 0.4830657 0.1435789
#Choosing parameters
theta_bias <- largest_two_values[1]
theta_four <- largest_two_values[2]
#Constant parameter
theta_one <- 0.010038614
theta_three <- -0.001912836
# Initial values
arr_1 = 0
arr_2=0
f_value=0
s_value=0
theta_hat_2
## Y
## ones 0.483065688
## X4 0.143578928
## X1 0.010038614
## X3 -0.001912836
Now we will calculate epsilon. We are using the value of epsilon as a threshold to determine whether the simulated values for the parameters are accepted or rejected. By choosing epsilon to be twice the RSS of the model, the acceptance rate will be roughly 50% (Shalizi, 2016). Then here we will be setting the number of iterations for the loop to be 100. This is not concrete, but a larger number of iterations may result in a more accurate estimate but will also increase the computational time (Bond, 2004).
Based on constant and chosen parameters, we will now determine the range of the prior distribution. Then, we perform rejection ABC for those 2 parameters.
# Calculating epsilon
epsilon <- RSS_model_2 * 2
# Number of iteration to determines how many times the for loop will repeat and generate new values for the parameters.
# A larger number of iterations may result in a more accurate estimate, but will also increase the computational time.
num <- 100
##Calculating Y-hat for performing rejection ABC
counter <- 0
# Iteration from 0 -100 and calculating the range
for (i in 1:num) {
range1 <- runif(1,-0.483065688,0.483065688)
range2 <- runif(1,-0.1435789,0.1435789)
# Creating new vector of two values from range1 and range2 with the constant values theta_one,theta_three.
New_thetahat <- matrix(c(range1,range2,theta_one,theta_three))
# Calculating predicted response values for the current iteration of the loop
New_Y_Hat <- x_model_2 %*% New_thetahat
# Calculting new RSS valur
new_RSS <- sum((y - New_Y_Hat)^2)
# Checking if new RSS is greater than epsilon and if the condition is true, the values of range1 and range2 are stored in arrays arr_1 and arr_2 respectively. The counter is also incremented by 1. The f_value and s_value are then defined as matrices of the arrays arr_1 and arr_2 respectively.
if (new_RSS > epsilon){
arr_1[i] <- range1
arr_2[i] <- range2
counter = counter+1
f_value <- matrix(arr_1)
s_value <- matrix(arr_2)
} #closing else loop
} #closing for loop
Here, we checked RSS value (new_RSS) and if the RSS is smaller than the threshold value(epsilon), it indicates that the simulated data and the observed data are similar enough, and the corresponding parameter values are accepted and stored for further analysis. For which the values of range1 and range2 are stored in arrays arr_1 and arr_2 respectively. The counter was also incremented by 1. The f_value and s_value were then defined as matrices of the arrays arr_1 and arr_2 respectively.
Finally we will visualize the joint and marginal posterior distribution of the parameters using histogram.
# Plotting histogram of new f_values and s_values
# Frequency distribution of the f_value
ggplot(data.frame(f_value), aes(x=f_value)) +
geom_histogram(color = 'black', fill = "grey") +
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the f_value"
) +
xlab("f_value") +
ylab("Frequency ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Frequency distribution of the s_value
ggplot(data.frame(s_value), aes(x=s_value)) +
geom_histogram(color = 'black', fill = "grey") +
geom_rug()+ #Show the individual observations with black lines
labs(title = "Frequency distribution of the s_value"
) +
xlab("s_value") +
ylab("Frequency ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Plotting the joint and marginal posterior distribution
# Create a data frame with the values of f_value and s_value and a column for "group"
df <- data.frame(f_value, s_value, legend=rep(c("f_value","s_value"), each=length(f_value)/2))
# Plot the scatter plot using and hiding legends
p <- ggplot(df, aes(x=f_value, y=s_value, color=legend)) +
geom_point()+
theme(legend.position="bottom")+ # show legend in bottom
theme(legend.title = element_blank())+ # hide legend word
#guides(color=FALSE)+ # Uncomment to hide legend
ggtitle("Joint and Marginal Posterior Distribution")
#
# Show the plot
print(p)
It is easier to view individual variable distribution and relationship in single chart.
# Plotting the joint and marginal posterior distribution
# Create a data frame with the values of f_value and s_value and a column for "group"
df <- data.frame(f_value, s_value, legend=rep(c("f_value","s_value"), each=length(f_value)/2))
# Plot the scatter plot using and hiding legends
p <- ggplot(df, aes(x=f_value, y=s_value, color=legend)) +
geom_point()+
theme(legend.position="bottom")+
theme(legend.title = element_blank())+
#guides(color=FALSE)+ # Uncomment to hide legend
ggtitle("Joint and Marginal Posterior Distribution")
#
# # Show the plot
# print(p)
# Marginal histograms by group
ggMarginal(p, type = "histogram",
xparams = list(fill = 1),
yparams = list(fill = 1))
By visualizing the f_value and s_value we get a visual representation of the parameters that gave a good fit between the simulated and observed data. Also, the histograms on the side provide a visual representation of the distribution of the individual variables, rather than their joint distribution.
This completes our Task 3: Approximate Bayesian Computation.
Based on a thorough analysis, we are able to conclude that our objectives have been achieved to a satisfactory degree. Despite this, there are still opportunities for improvement, including:
Insufficient data: Polynomial regression requires a sufficient amount of data to accurately model the relationship and we are provided with just 201 number of rows. Having more numbers of data would be best to explore underlying patterns in the EEG signals.
Overfitting: According to (Xu and Goodacre, 2018), cross-validation is considered to be better than a 70/30% split of the data as 70/30% split can result in overfitting, where the model is trained too well on the training data, but performs poorly on the test data. On top of this, cross validation also helps in reducing the bias in the model evaluation by using multiple subsets of the data instead of a single fixed subset.
Non-linearity: The relationship between input and output signals in real-life EEG is non-linear in nature. It may not be possible to model these complex relationships accurately using polynomial regression, which is used as linear combination of polynomial functions.
and by addressing these issues we can improve our EEG signal modeling.
Reference list Allen, D.M. (1971). Mean Square Error of Prediction as a Criterion for Selecting Variables. Technometrics, 13(3), pp.469–475. doi:10.1080/00401706.1971.10488811.
Barone, A. (2022). How Residual Sum of Squares (RSS) Works. [online] Investopedia. Available at: https://www.investopedia.com/terms/r/residual-sum-of-squares.asp.
Bertini, E., Correll, M. and Franconeri, S. (2020). Why Shouldn’t All Charts Be Scatter Plots? Beyond Precision-Driven Visualizations. [online] IEEE Xplore. doi:10.1109/VIS47514.2020.00048.
Bevans, R. (2020). Akaike Information Criterion | When & How to Use It (Example). [online] Scribbr. Available at: https://www.scribbr.com/statistics/akaike-information-criterion/#:~:text=It%20penalizes%20models%20which%20use [Accessed 31 Jan. 2023].
Björck, Å. (1990). Least squares methods. [online] ScienceDirect. Available at: https://www.sciencedirect.com/science/article/abs/pii/S1570865905800365 [Accessed 4 Oct. 2022].
Bond, S. (2004). Essential Medical Statistics (2nd edn). Kirkwood BR, Sternc JAC. Malden, MA: Blackwell Publishing, 2003, pp. 288, $52.95 (PB). ISBN 0865428719. International Journal of Epidemiology, 33(6), pp.1418–1419. doi:10.1093/ije/dyh364.
Breiman, L. and Spector, P. (1992). Submodel Selection and Evaluation in Regression. The X-Random Case. International Statistical Review / Revue Internationale de Statistique, 60(3), p.291. doi:10.2307/1403680.
Brownlee, J. (2019). Probabilistic Model Selection with AIC, BIC, and MDL. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/probabilistic-model-selection-measures/.
Chen, J. (2019). Line Of Best Fit. [online] Investopedia. Available at: https://www.investopedia.com/terms/l/line-of-best-fit.asp.
Clay Ford (2015). Understanding Q-Q Plots | University of Virginia Library Research Data Services + Sciences. [online] Virginia.edu. Available at: https://data.library.virginia.edu/understanding-q-q-plots/.
Constantine, N.A. (2012). Regression Analysis and Causal Inference: Cause for Concern? Perspectives on Sexual and Reproductive Health, [online] 44(2), pp.134–137. Available at: https://www.jstor.org/stable/42004111 [Accessed 30 Jan. 2023].
Crompton, D. (2015). Histograms and Kernels Density Estimates - David Crompton - Medium. [online] Medium. Available at: https://medium.com/@dcomp/histograms-and-kernels-density-estimates-a2c41eb08de3.
Csilléry, K., Blum, M.G.B., Gaggiotti, O.E. and François, O. (2010). Approximate Bayesian Computation (ABC) in practice. Trends in Ecology & Evolution, 25(7), pp.410–418. doi:10.1016/j.tree.2010.04.001.
Datacadamia (2014). Statistics - Bayesian Information Criterion (BIC). [online] Datacadamia - Data and Co. Available at: https://datacadamia.com/data_mining/bic [Accessed 31 Jan. 2023].
De’ath, G. and Fabricius, K.E. (2000). CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS. Ecology, 81(11), pp.3178–3192. doi:10.1890/0012-9658(2000)081[3178:cartap]2.0.co;2.
Di Leo, G. and Sardanelli, F. (2020). Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach. European Radiology Experimental, 4(1). doi:10.1186/s41747-020-0145-y.
Doane, D.P. and Seward, L.E. (2011). Measuring Skewness: A Forgotten Statistic? Journal of Statistics Education, 19(2). doi:10.1080/10691898.2011.11889611.
Drechsler, J. and Reiter, J.P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics & Data Analysis, 55(12), pp.3232–3243. doi:10.1016/j.csda.2011.06.006.
Duong, T. (2007). ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data inR. Journal of Statistical Software, 21(7). doi:10.18637/jss.v021.i07.
Eells, E. (2004). Review: Bayes’s Theorem. Mind, 113(451), pp.591–596. doi:10.1093/mind/113.451.591.
Evans, M., Hastings, N. and Peacock, B. (2000). Statistical Distributions, Third Edition. Measurement Science and Technology, 12(1), pp.117–117. doi:10.1088/0957-0233/12/1/702.
Frost, J. (2018). One-Tailed and Two-Tailed Hypothesis Tests Explained - Statistics By Jim. [online] Statistics By Jim. Available at: https://statisticsbyjim.com/hypothesis-testing/one-tailed-two-tailed-hypothesis-tests/.
Healthline (2012). EEG (Electroencephalogram): Purpose, Procedure, and Risks. [online] Healthline. Available at: https://www.healthline.com/health/eeg.
Laerd.com. (2018). Pearson Product-Moment Correlation - Guidelines to interpretation of the coefficient, detecting outliers and the type of variables needed. [online] Available at: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide-2.php.
Manikantan, A. (2021). Akaike Information Criterion: Model Selection. [online] Geek Culture. Available at: https://medium.com/geekculture/akaike-information-criterion-model-selection-c47df96ee9a8.
Music, A. (2020). Gaussian Distribution and Maximum Likelihood Estimate Method (Step-by-Step). [online] The Startup. Available at: https://medium.com/swlh/gaussian-distribution-and-maximum-likelihood-estimate-method-step-by-step-e4f6014fa83e [Accessed 31 Jan. 2023].
Nguyen, Q.H., Ly, H.-B., Ho, L.S., Al-Ansari, N., Le, H.V., Tran, V.Q., Prakash, I. and Pham, B.T. (2021). Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Mathematical Problems in Engineering, [online] 2021, pp.1–15. doi:10.1155/2021/4832864.
Ogunbiyi, I.A. (2022). Top Evaluation Metrics for Regression Problems in Machine Learning. [online] freeCodeCamp.org. Available at: https://www.freecodecamp.org/news/evaluation-metrics-for-regression-problems-machine-learning/
Peterka, V. (1981). Chapter 8 - BAYESIAN APPROACH TO SYSTEM IDENTIFICATION. [online] ScienceDirect. Available at: https://www.sciencedirect.com/science/article/pii/B9780080256832500132 [Accessed 31 Jan. 2023].
Rahman, M. and Zhang, Q. (2015). COMPARISON AMONG PEARSON CORRELATION COEFFICIENT TESTS. Far East Journal of Mathematical Sciences (FJMS), 99(2), pp.237–255. doi:10.17654/ms099020237.
Schork, J. (n.d.). Solve System of Equations in R (3 Examples) | Using solve() Function. [online] Statistics Globe. Available at: https://statisticsglobe.com/solve-system-of-equations-in-r/.
Sedgwick, P. (2014). Understanding confidence intervals. BMJ, 349(oct06 12), pp.g6051–g6051. doi:10.1136/bmj.g6051.
Shalizi, C.R. (2016). Advanced Data Analysis from an Elementary Point of View. [online] Available at: https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ADAfaEPoV.pdf [Accessed 31 Jan. 2023].
Smith, C.J. (2012). Interpreting confidence intervals. Phlebology: The Journal of Venous Disease, 27(3), pp.141–142. doi:10.1258/phleb.2012.012j02.
St, E.K., Frey, L.C., Britton, J.W., Frey, L.C., Hopp, J.L., Pearce Korb, Koubeissi, M.Z., Lievens, W.E., Pestana-Knight, E.M. and St, E.K. (2016). Introduction Electroencephalography (EEG): An Introductory Text and Atlas of Normal and Abnormal Findings in Adults, Children, and Infants [Internet]. [online] Nih.gov. Available at: https://www.ncbi.nlm.nih.gov/books/NBK390346/.
Stephanie (2021). Log Likelihood Function. [online] Statistics How To. Available at: https://www.statisticshowto.com/log-likelihood-function/ [Accessed 31 Jan. 2023].
Taylor, R. (1990). Interpretation of the Correlation Coefficient: A Basic Review. Journal of Diagnostic Medical Sonography, [online] 6(1), pp.35–39. doi:10.1177/875647939000600106.
Trevor Hastie, Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning. Editorial: New York, Ny Springer New York.
Upadhaya, M.D. (2020). Correlation & Causation. [online] Medium. Available at: https://medium.com/analytics-vidhya/correlation-causation-977f71bb1e36.
Valchanov, I. (2018). Sum of Squares: SST, SSR, SSE. [online] 365 Data Science. Available at: https://365datascience.com/tutorials/statistics-tutorials/sum-squares/.
von Hippel, P.T. (2005). Mean, Median, and Skew: Correcting a Textbook Rule. Journal of Statistics Education, 13(2). doi:10.1080/10691898.2005.11910556.
Wolfgang Karl Hrdle (2020). Applied Multivariate Statistical Analysis. S.L.: Springer.
Xu, Y. and Goodacre, R. (2018). On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. Journal of Analysis and Testing, 2(3), pp.249–262. doi:10.1007/s41664-018-0068-2.
Zach (2021a). How to Interpret Log-Likelihood Values (With Examples). [online] Statology. Available at: https://www.statology.org/interpret-log-likelihood/.
Zach (2021b). How to Interpret Negative AIC Values. [online] Statology. Available at: https://www.statology.org/negative-aic/.