Preliminaries

set.seed(10)
library(haven)
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

transformed_boluwatife <- read_sav("C:/Users/DELL/Desktop/2024_Projects/project work from boluwatife/transformed_boluwatife.sav")
head(transformed_boluwatife,5)

## # A tibble: 5 × 9
##   Behaviour Group Week_0 Week_3 Week_5 Week_7 Week_9 Week_11 Week_13
##   <dbl+lbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
## 1 1 [SL]        1    6.8    5.6    6.2    5.2    5.2     7.1     5.5
## 2 1 [SL]        1    6.4    7      5.9    4.8    5.8     6.2     5.1
## 3 1 [SL]        1    4.6    6.1    5      5.1    8.1     7.4     6.1
## 4 1 [SL]        1    7.5    5.8    4.9    5.2    6.4     5.8     5.9
## 5 1 [SL]        1    5.2    4.9    5      6.6    6.3     6       7.5

Data Cleaning

transformed_boluwatife$Group <- ifelse(transformed_boluwatife$Group== 1, 0, 1)

Labeling Behaviour

ucl_decoded <- transformed_boluwatife %>% 
  mutate(Behaviour = case_when(
    Behaviour == 1 ~ "SL",
    Behaviour == 2 ~ " TL",
    Behaviour == 3 ~ "Aggressive",
    Behaviour == 4 ~ "Weight"
         ))
head(ucl_decoded, 10)

## # A tibble: 10 × 9
##    Behaviour Group Week_0 Week_3 Week_5 Week_7 Week_9 Week_11 Week_13
##    <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
##  1 SL            0    6.8    5.6    6.2    5.2    5.2     7.1     5.5
##  2 SL            0    6.4    7      5.9    4.8    5.8     6.2     5.1
##  3 SL            0    4.6    6.1    5      5.1    8.1     7.4     6.1
##  4 SL            0    7.5    5.8    4.9    5.2    6.4     5.8     5.9
##  5 SL            0    5.2    4.9    5      6.6    6.3     6       7.5
##  6 SL            0    5      5.1    4.6    7.2    5.5     5       7.7
##  7 SL            0    5.6    6.2    7.9    5.7    7       6.8     6.6
##  8 SL            0    4.6    4.8    5      4.8    7.3     5.4     5.4
##  9 SL            0    6.5    6.9    6.7    6.9    6.1     5       5.1
## 10 SL            0    4.2    4.9    6.4    8      6.2     5.3     6.9

Data Exploration

library(dlookr)

## Warning: package 'dlookr' was built under R version 4.4.0

## Registered S3 methods overwritten by 'dlookr':
##   method          from  
##   plot.transform  scales
##   print.transform scales

## 
## Attaching package: 'dlookr'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following object is masked from 'package:base':
## 
##     transform

Checking for Missing values

Checking for outliers

diagnose_outlier(transformed_boluwatife)

## # A tibble: 8 × 6
##   variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>            <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 Group                0            0           NaN        0.5          0.5 
## 2 Week_0              60           18.8          21.5      8.23         5.17
## 3 Week_3              55           17.2          22.6      8.25         5.28
## 4 Week_5              59           18.4          22.9      8.54         5.29
## 5 Week_7              58           18.1          24.1      8.87         5.50
## 6 Week_9              48           15            24.0      8.59         5.88
## 7 Week_11             41           12.8          24        8.51         6.24
## 8 Week_13             52           16.2          26.2      9.22         5.92

library(visdat)

## Warning: package 'visdat' was built under R version 4.3.3

Checking for data structure

Correlation Analysis

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

cr_data <- transformed_boluwatife %>% select(-c(Group, Behaviour))
coq <- cor(cr_data)
corrplot(coq, method = "number")

plot(cr_data, col ="red")

Descriptive analysis

library(summarytools)

## Warning: package 'summarytools' was built under R version 4.3.3

## 
## Attaching package: 'summarytools'

## The following object is masked from 'package:tibble':
## 
##     view

dfSummary(cr_data$Week_0)

## cr_data$Week_0 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 196  
## 
## ------------------------------------------------------------------------------------------------------------
## No   Variable    Label       Stats / Values         Freqs (% of Valid)    Graph         Valid      Missing  
## ---- ----------- ----------- ---------------------- --------------------- ------------- ---------- ---------
## 1    Week_0      Week Zero   Mean (sd) : 8.2 (7)    124 distinct values   :             320        0        
##      [numeric]               min < med < max:                             : .           (100.0%)   (0.0%)   
##                              2.2 < 5.2 < 34                               : :                               
##                              IQR (CV) : 3.6 (0.8)                         : :                               
##                                                                           : : . . . .                       
## ------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_3)

## cr_data$Week_3 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 199  
## 
## ----------------------------------------------------------------------------------------------------------------
## No   Variable    Label        Stats / Values          Freqs (% of Valid)    Graph           Valid      Missing  
## ---- ----------- ------------ ----------------------- --------------------- --------------- ---------- ---------
## 1    Week_3      Week Three   Mean (sd) : 8.3 (7.3)   121 distinct values   :               320        0        
##      [numeric]                min < med < max:                              : .             (100.0%)   (0.0%)   
##                               2 < 5.2 < 37                                  : :                                 
##                               IQR (CV) : 4 (0.9)                            : :                                 
##                                                                             : : . . . . .                       
## ----------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_5)

## cr_data$Week_5 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 197  
## 
## -------------------------------------------------------------------------------------------------------------
## No   Variable    Label       Stats / Values          Freqs (% of Valid)    Graph         Valid      Missing  
## ---- ----------- ----------- ----------------------- --------------------- ------------- ---------- ---------
## 1    Week_5      Week Five   Mean (sd) : 8.5 (7.6)   123 distinct values   :             320        0        
##      [numeric]               min < med < max:                              : .           (100.0%)   (0.0%)   
##                              2 < 5.3 < 35                                  : :                               
##                              IQR (CV) : 4.1 (0.9)                          : :                               
##                                                                            : : . : . .                       
## -------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_7)

## cr_data$Week_7 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 185  
## 
## --------------------------------------------------------------------------------------------------------------
## No   Variable    Label        Stats / Values          Freqs (% of Valid)    Graph         Valid      Missing  
## ---- ----------- ------------ ----------------------- --------------------- ------------- ---------- ---------
## 1    Week_7      Week Seven   Mean (sd) : 8.9 (7.9)   135 distinct values   :             320        0        
##      [numeric]                min < med < max:                              : :           (100.0%)   (0.0%)   
##                               2 < 5.3 < 36                                  : :                               
##                               IQR (CV) : 4.4 (0.9)                          : :                               
##                                                                             : : . . . .                       
## --------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_9)

## cr_data$Week_9 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 180  
## 
## -------------------------------------------------------------------------------------------------------------
## No   Variable    Label       Stats / Values          Freqs (% of Valid)    Graph         Valid      Missing  
## ---- ----------- ----------- ----------------------- --------------------- ------------- ---------- ---------
## 1    Week_9      Week Nine   Mean (sd) : 8.6 (7.3)   140 distinct values   :             320        0        
##      [numeric]               min < med < max:                              : :           (100.0%)   (0.0%)   
##                              2 < 5.3 < 33                                  : :                               
##                              IQR (CV) : 4.8 (0.9)                          : :                               
##                                                                            : : : . . .                       
## -------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_11)

## cr_data$Week_11 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 183  
## 
## --------------------------------------------------------------------------------------------------------------
## No   Variable    Label         Stats / Values         Freqs (% of Valid)    Graph         Valid      Missing  
## ---- ----------- ------------- ---------------------- --------------------- ------------- ---------- ---------
## 1    Week_11     Week Eleven   Mean (sd) : 8.5 (7)    137 distinct values   :             320        0        
##      [numeric]                 min < med < max:                             : :           (100.0%)   (0.0%)   
##                                2 < 5.4 < 34                                 : :                               
##                                IQR (CV) : 5.4 (0.8)                         : :                               
##                                                                             : : : : . .                       
## --------------------------------------------------------------------------------------------------------------

dfSummary(cr_data$Week_13)

## cr_data$Week_13 was converted to a data frame

## Data Frame Summary  
## cr_data  
## Dimensions: 320 x 1  
## Duplicates: 176  
## 
## -------------------------------------------------------------------------------------------------------------------
## No   Variable    Label           Stats / Values          Freqs (% of Valid)    Graph           Valid      Missing  
## ---- ----------- --------------- ----------------------- --------------------- --------------- ---------- ---------
## 1    Week_13     Week Thirteen   Mean (sd) : 9.2 (8.4)   144 distinct values   : .             320        0        
##      [numeric]                   min < med < max:                              : :             (100.0%)   (0.0%)   
##                                  2 < 5.5 < 40                                  : :                                 
##                                  IQR (CV) : 5.6 (0.9)                          : :                                 
##                                                                                : : . : . . .                       
## -------------------------------------------------------------------------------------------------------------------

Features by Behaviours

ggplot(transformed_boluwatife, aes(y =Week_7, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_0, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_5, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_9, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_11, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_13, col= as.factor(Behaviour)))+geom_boxplot()

ggplot(transformed_boluwatife, aes(y =Week_3, col= as.factor(Behaviour)))+geom_boxplot()

Splitting the dataset intro Training set and test set 0.75 and 0.25

cont <- transformed_boluwatife %>% select(3:9)
scaled <- cont
cate <- transformed_boluwatife %>% select(!3:9)
b <- select(ucl_decoded, Behaviour)
transformed_boluwatife <- cbind(cate, scaled, b)
t <- transformed_boluwatife %>% select(n_behaviour = 10)
transformed_boluwatife <- transformed_boluwatife %>% select(!10)
transformed_boluwatife <- cbind(transformed_boluwatife, t) %>% select(!1)
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(caTools)
sample <- sample.split(transformed_boluwatife$Group, SplitRatio = 0.75)
Train <- subset(transformed_boluwatife, sample ==T)
Test <- subset(transformed_boluwatife, sample==F)

Model creation

library(e1071)

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:dlookr':
## 
##     kurtosis, skewness

Models on for Group based on One features

Removing the behaviour column

Test_g <- Test %>% select(!n_behaviour)
Train_g <- Train %>% select(!n_behaviour)

Logistic regression

Logit <- glm(Group~., data = Train_g, family = binomial(link = "logit"))
summary(Logit)

## 
## Call:
## glm(formula = Group ~ ., family = binomial(link = "logit"), data = Train_g)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.27006    0.22927   1.178  0.23883   
## Week_0       0.14432    0.05578   2.587  0.00967 **
## Week_3       0.04661    0.05472   0.852  0.39437   
## Week_5       0.05878    0.04495   1.308  0.19102   
## Week_7       0.04148    0.04622   0.897  0.36948   
## Week_9      -0.03144    0.04931  -0.637  0.52381   
## Week_11     -0.13925    0.06398  -2.177  0.02951 * 
## Week_13     -0.15076    0.05294  -2.848  0.00440 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.71  on 239  degrees of freedom
## Residual deviance: 303.24  on 232  degrees of freedom
## AIC: 319.24
## 
## Number of Fisher Scoring iterations: 5

pl <- predict(Logit, newdata = Test_g)
datpl <- as.factor(ifelse(pl >= 0.5, 1, 0))
confpl <- confusionMatrix(as.factor(Test_g$Group), datpl)
confpl

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  3
##          1 37  3
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.386, 0.614)
##     No Information Rate : 0.925         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : 1.811e-07     
##                                         
##             Sensitivity : 0.5000        
##             Specificity : 0.5000        
##          Pos Pred Value : 0.9250        
##          Neg Pred Value : 0.0750        
##              Prevalence : 0.9250        
##          Detection Rate : 0.4625        
##    Detection Prevalence : 0.5000        
##       Balanced Accuracy : 0.5000        
##                                         
##        'Positive' Class : 0             
##

Probit regression

Probit <-glm(Group~., data = Train_g, family = binomial(link="probit"))
summary(Probit)

## 
## Call:
## glm(formula = Group ~ ., family = binomial(link = "probit"), 
##     data = Train_g)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.17047    0.14080   1.211   0.2260   
## Week_0       0.08276    0.03249   2.547   0.0109 * 
## Week_3       0.02704    0.03142   0.861   0.3895   
## Week_5       0.03402    0.02648   1.285   0.1988   
## Week_7       0.02337    0.02671   0.875   0.3816   
## Week_9      -0.01965    0.02914  -0.674   0.5001   
## Week_11     -0.07847    0.03722  -2.108   0.0350 * 
## Week_13     -0.08855    0.03059  -2.895   0.0038 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 332.71  on 239  degrees of freedom
## Residual deviance: 303.43  on 232  degrees of freedom
## AIC: 319.43
## 
## Number of Fisher Scoring iterations: 5

pb <- predict(Probit, newdata = Test_g)
datp <- as.factor(ifelse(pb >= 0.5, 1, 0))
confp <- confusionMatrix(as.factor(Test_g$Group), datp)
confp

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39  1
##          1 38  2
##                                           
##                Accuracy : 0.5125          
##                  95% CI : (0.3981, 0.6259)
##     No Information Rate : 0.9625          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.025           
##                                           
##  Mcnemar's Test P-Value : 8.185e-09       
##                                           
##             Sensitivity : 0.5065          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.9750          
##          Neg Pred Value : 0.0500          
##              Prevalence : 0.9625          
##          Detection Rate : 0.4875          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.5866          
##                                           
##        'Positive' Class : 0               
##

Decision tree model

library(rpart)

## Warning: package 'rpart' was built under R version 4.3.3

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

cart <- rpart(Group~., data = Train_g)
summary(cart)

## Call:
## rpart(formula = Group ~ ., data = Train_g)
##   n= 240 
## 
##            CP nsplit rel error    xerror        xstd
## 1  0.22309905      0 1.0000000 1.0120542 0.003172568
## 2  0.08470146      1 0.7769010 0.9630955 0.056783151
## 3  0.07024795      2 0.6921995 0.9324862 0.068752465
## 4  0.06723783      3 0.6219515 0.9068161 0.071453267
## 5  0.03669618      4 0.5547137 0.9020017 0.076557643
## 6  0.02133343      5 0.5180175 0.8573562 0.079355754
## 7  0.01607143      6 0.4966841 0.9065573 0.084071604
## 8  0.01407625      8 0.4645412 0.9133495 0.085757790
## 9  0.01212121      9 0.4504650 0.9184050 0.085260447
## 10 0.01052291     10 0.4383438 0.9145756 0.085296731
## 11 0.01000000     11 0.4278209 0.9228764 0.086130320
## 
## Variable importance
## Week_13  Week_5 Week_11  Week_7  Week_3  Week_9  Week_0 
##      22      18      17      14      13      11       5 
## 
## Node number 1: 240 observations,    complexity param=0.223099
##   mean=0.5, MSE=0.25 
##   left son=2 (149 obs) right son=3 (91 obs)
##   Primary splits:
##       Week_13 < 5.015 to the right, improve=0.2230990, (0 missing)
##       Week_11 < 4.255 to the right, improve=0.2072345, (0 missing)
##       Week_5  < 4.41  to the right, improve=0.2063492, (0 missing)
##       Week_9  < 3.92  to the right, improve=0.1927519, (0 missing)
##       Week_7  < 5.01  to the right, improve=0.1785714, (0 missing)
##   Surrogate splits:
##       Week_5  < 4.92  to the right, agree=0.825, adj=0.538, (0 split)
##       Week_9  < 4.895 to the right, agree=0.825, adj=0.538, (0 split)
##       Week_11 < 4.255 to the right, agree=0.817, adj=0.516, (0 split)
##       Week_7  < 5.06  to the right, agree=0.808, adj=0.495, (0 split)
##       Week_3  < 4.245 to the right, agree=0.783, adj=0.429, (0 split)
## 
## Node number 2: 149 observations,    complexity param=0.08470146
##   mean=0.3154362, MSE=0.2159362 
##   left son=4 (101 obs) right son=5 (48 obs)
##   Primary splits:
##       Week_3  < 10.5  to the left,  improve=0.1579539, (0 missing)
##       Week_5  < 10.5  to the left,  improve=0.1544955, (0 missing)
##       Week_0  < 7.8   to the left,  improve=0.1517745, (0 missing)
##       Week_7  < 10.5  to the left,  improve=0.1517745, (0 missing)
##       Week_11 < 10.5  to the left,  improve=0.1353075, (0 missing)
##   Surrogate splits:
##       Week_0  < 7.8   to the left,  agree=0.946, adj=0.833, (0 split)
##       Week_11 < 9.65  to the left,  agree=0.940, adj=0.812, (0 split)
##       Week_13 < 9.7   to the left,  agree=0.940, adj=0.812, (0 split)
##       Week_5  < 11.5  to the left,  agree=0.933, adj=0.792, (0 split)
##       Week_7  < 9.31  to the left,  agree=0.933, adj=0.792, (0 split)
## 
## Node number 3: 91 observations,    complexity param=0.07024795
##   mean=0.8021978, MSE=0.1586765 
##   left son=6 (12 obs) right son=7 (79 obs)
##   Primary splits:
##       Week_5  < 5.2   to the right, improve=0.2918979, (0 missing)
##       Week_9  < 5.32  to the right, improve=0.2429258, (0 missing)
##       Week_11 < 5.835 to the right, improve=0.2326674, (0 missing)
##       Week_7  < 4.815 to the right, improve=0.1901038, (0 missing)
##       Week_13 < 3.95  to the right, improve=0.1506219, (0 missing)
##   Surrogate splits:
##       Week_11 < 6.75  to the right, agree=0.901, adj=0.250, (0 split)
##       Week_9  < 5.865 to the right, agree=0.890, adj=0.167, (0 split)
## 
## Node number 4: 101 observations,    complexity param=0.06723783
##   mean=0.1881188, MSE=0.1527301 
##   left son=8 (86 obs) right son=9 (15 obs)
##   Primary splits:
##       Week_11 < 4.25  to the right, improve=0.26152840, (0 missing)
##       Week_3  < 4.235 to the right, improve=0.15725290, (0 missing)
##       Week_7  < 4.95  to the right, improve=0.10527100, (0 missing)
##       Week_13 < 5.75  to the right, improve=0.09828623, (0 missing)
##       Week_5  < 4.36  to the right, improve=0.09329401, (0 missing)
##   Surrogate splits:
##       Week_5 < 3.935 to the right, agree=0.921, adj=0.467, (0 split)
##       Week_9 < 4.055 to the right, agree=0.921, adj=0.467, (0 split)
##       Week_7 < 4.1   to the right, agree=0.901, adj=0.333, (0 split)
##       Week_3 < 3.605 to the right, agree=0.881, adj=0.200, (0 split)
## 
## Node number 5: 48 observations,    complexity param=0.03669618
##   mean=0.5833333, MSE=0.2430556 
##   left son=10 (17 obs) right son=11 (31 obs)
##   Primary splits:
##       Week_13 < 22.5  to the right, improve=0.18872320, (0 missing)
##       Week_9  < 16.5  to the right, improve=0.11454500, (0 missing)
##       Week_11 < 22.5  to the right, improve=0.08691729, (0 missing)
##       Week_7  < 24.5  to the left,  improve=0.06938776, (0 missing)
##       Week_5  < 13.5  to the left,  improve=0.06222001, (0 missing)
##   Surrogate splits:
##       Week_7  < 12.5  to the left,  agree=0.750, adj=0.294, (0 split)
##       Week_9  < 10.5  to the left,  agree=0.708, adj=0.176, (0 split)
##       Week_0  < 29.5  to the right, agree=0.688, adj=0.118, (0 split)
##       Week_11 < 21.5  to the right, agree=0.667, adj=0.059, (0 split)
## 
## Node number 6: 12 observations
##   mean=0.25, MSE=0.1875 
## 
## Node number 7: 79 observations,    complexity param=0.02133343
##   mean=0.8860759, MSE=0.1009454 
##   left son=14 (28 obs) right son=15 (51 obs)
##   Primary splits:
##       Week_13 < 3.95  to the right, improve=0.16050860, (0 missing)
##       Week_9  < 4.96  to the right, improve=0.11749710, (0 missing)
##       Week_11 < 5.33  to the right, improve=0.11749710, (0 missing)
##       Week_7  < 4.51  to the right, improve=0.08785492, (0 missing)
##       Week_0  < 5.65  to the right, improve=0.06131519, (0 missing)
##   Surrogate splits:
##       Week_5  < 4.25  to the right, agree=0.709, adj=0.179, (0 split)
##       Week_9  < 4.005 to the right, agree=0.709, adj=0.179, (0 split)
##       Week_11 < 4.89  to the right, agree=0.696, adj=0.143, (0 split)
##       Week_0  < 6.15  to the right, agree=0.671, adj=0.071, (0 split)
##       Week_3  < 4.96  to the right, agree=0.658, adj=0.036, (0 split)
## 
## Node number 8: 86 observations,    complexity param=0.01052291
##   mean=0.1046512, MSE=0.0936993 
##   left son=16 (73 obs) right son=17 (13 obs)
##   Primary splits:
##       Week_11 < 4.7   to the right, improve=0.07835239, (0 missing)
##       Week_7  < 4.95  to the right, improve=0.07603926, (0 missing)
##       Week_0  < 4.05  to the right, improve=0.06523491, (0 missing)
##       Week_3  < 4.235 to the right, improve=0.06523491, (0 missing)
##       Week_13 < 5.65  to the right, improve=0.05197897, (0 missing)
##   Surrogate splits:
##       Week_5 < 4.46  to the right, agree=0.872, adj=0.154, (0 split)
##       Week_0 < 3.73  to the right, agree=0.860, adj=0.077, (0 split)
##       Week_9 < 4.105 to the right, agree=0.860, adj=0.077, (0 split)
## 
## Node number 9: 15 observations
##   mean=0.6666667, MSE=0.2222222 
## 
## Node number 10: 17 observations
##   mean=0.2941176, MSE=0.2076125 
## 
## Node number 11: 31 observations,    complexity param=0.01407625
##   mean=0.7419355, MSE=0.1914672 
##   left son=22 (22 obs) right son=23 (9 obs)
##   Primary splits:
##       Week_7  < 17.5  to the right, improve=0.14229250, (0 missing)
##       Week_9  < 17    to the right, improve=0.07617754, (0 missing)
##       Week_11 < 14.5  to the left,  improve=0.05010352, (0 missing)
##       Week_3  < 17    to the right, improve=0.04614076, (0 missing)
##       Week_5  < 27.5  to the left,  improve=0.03216564, (0 missing)
##   Surrogate splits:
##       Week_9 < 14.5  to the right, agree=0.774, adj=0.222, (0 split)
##       Week_0 < 24.5  to the left,  agree=0.742, adj=0.111, (0 split)
##       Week_3 < 28.5  to the left,  agree=0.742, adj=0.111, (0 split)
## 
## Node number 14: 28 observations,    complexity param=0.01607143
##   mean=0.7142857, MSE=0.2040816 
##   left son=28 (21 obs) right son=29 (7 obs)
##   Primary splits:
##       Week_3  < 3.505 to the right, improve=0.13333330, (0 missing)
##       Week_7  < 4.815 to the right, improve=0.13333330, (0 missing)
##       Week_11 < 5.23  to the right, improve=0.13333330, (0 missing)
##       Week_0  < 3.93  to the left,  improve=0.09037433, (0 missing)
##       Week_13 < 4.15  to the left,  improve=0.09037433, (0 missing)
##   Surrogate splits:
##       Week_0  < 2.74  to the right, agree=0.821, adj=0.286, (0 split)
##       Week_5  < 3.305 to the right, agree=0.786, adj=0.143, (0 split)
##       Week_11 < 3.59  to the right, agree=0.786, adj=0.143, (0 split)
## 
## Node number 15: 51 observations
##   mean=0.9803922, MSE=0.01922338 
## 
## Node number 16: 73 observations
##   mean=0.06849315, MSE=0.06380184 
## 
## Node number 17: 13 observations
##   mean=0.3076923, MSE=0.2130178 
## 
## Node number 22: 22 observations,    complexity param=0.01212121
##   mean=0.6363636, MSE=0.231405 
##   left son=44 (11 obs) right son=45 (11 obs)
##   Primary splits:
##       Week_7  < 25    to the left,  improve=0.14285710, (0 missing)
##       Week_9  < 19.5  to the right, improve=0.06696429, (0 missing)
##       Week_0  < 18.5  to the right, improve=0.06696429, (0 missing)
##       Week_5  < 17.5  to the right, improve=0.05982906, (0 missing)
##       Week_11 < 14.5  to the left,  improve=0.01953602, (0 missing)
##   Surrogate splits:
##       Week_3  < 19.5  to the left,  agree=0.773, adj=0.545, (0 split)
##       Week_0  < 12    to the right, agree=0.727, adj=0.455, (0 split)
##       Week_5  < 18.5  to the right, agree=0.636, adj=0.273, (0 split)
##       Week_9  < 26.5  to the left,  agree=0.591, adj=0.182, (0 split)
##       Week_11 < 15.5  to the right, agree=0.591, adj=0.182, (0 split)
## 
## Node number 23: 9 observations
##   mean=1, MSE=0 
## 
## Node number 28: 21 observations,    complexity param=0.01607143
##   mean=0.6190476, MSE=0.2358277 
##   left son=56 (7 obs) right son=57 (14 obs)
##   Primary splits:
##       Week_13 < 4.15  to the left,  improve=0.23557690, (0 missing)
##       Week_0  < 3.93  to the left,  improve=0.15541790, (0 missing)
##       Week_3  < 3.96  to the left,  improve=0.07692308, (0 missing)
##       Week_7  < 4.815 to the right, improve=0.07692308, (0 missing)
##       Week_11 < 5.13  to the right, improve=0.07692308, (0 missing)
##   Surrogate splits:
##       Week_3 < 3.755 to the left,  agree=0.762, adj=0.286, (0 split)
##       Week_0 < 4.13  to the left,  agree=0.714, adj=0.143, (0 split)
##       Week_9 < 3.47  to the left,  agree=0.714, adj=0.143, (0 split)
## 
## Node number 29: 7 observations
##   mean=1, MSE=0 
## 
## Node number 44: 11 observations
##   mean=0.4545455, MSE=0.2479339 
## 
## Node number 45: 11 observations
##   mean=0.8181818, MSE=0.1487603 
## 
## Node number 56: 7 observations
##   mean=0.2857143, MSE=0.2040816 
## 
## Node number 57: 14 observations
##   mean=0.7857143, MSE=0.1683673

rpart.plot(cart)

pre <- predict(cart, newdata = Test_g)
dat_pre <- as.factor(ifelse(pre >= 0.5, 1, 0))
conf_matr <- confusionMatrix(as.factor(Test_g$Group), dat_pre)
conf_matr

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 30 10
##          1 15 25
##                                           
##                Accuracy : 0.6875          
##                  95% CI : (0.5741, 0.7865)
##     No Information Rate : 0.5625          
##     P-Value [Acc > NIR] : 0.01511         
##                                           
##                   Kappa : 0.375           
##                                           
##  Mcnemar's Test P-Value : 0.42371         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.5625          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6905          
##                                           
##        'Positive' Class : 0               
##

Random Forest

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

l <- randomForest(Group~., data = Train_g)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?

summary(l)

##                 Length Class  Mode     
## call              3    -none- call     
## type              1    -none- character
## predicted       240    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times       240    -none- numeric  
## importance        7    -none- numeric  
## importanceSD      0    -none- NULL     
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y               240    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

plot(l)

importance(l)

##         IncNodePurity
## Week_0       4.872464
## Week_3       6.103293
## Week_5       8.532478
## Week_7       7.867519
## Week_9       7.709972
## Week_11      9.523452
## Week_13     10.510064

library(randomForestExplainer)

## Warning: package 'randomForestExplainer' was built under R version 4.3.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

tree_to_plot <- getTree(l, k = 1, labelVar = TRUE)
plot(tree_to_plot)

predi <- predict(l, newdata = Test_g) dat_pred <- as.factor(ifelse(predi >= 0.5, 1, 0)) conf_matri <- confusionMatrix(as.factor(Test_g$Group), dat_pred) conf_matri

Support Vector Machine

library(e1071)
suport <- svm(Group~., data = Train_g, cost = 0.1, type="C-classification", kernel ="linear", scale = T)
summary(suport)

## 
## Call:
## svm(formula = Group ~ ., data = Train_g, cost = 0.1, type = "C-classification", 
##     kernel = "linear", scale = T)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  219
## 
##  ( 109 110 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Extract two principal components for visualization

pca_data <- prcomp(Train_g[, -ncol(Train_g)], scale. = TRUE)
data_pca <- as.data.frame(cbind(pca_data$x[,1], pca_data$x[,2], Group = Train_g$Group))

Plot decision boundary

ggplot(data_pca, aes(x = V1, y = V2, color = Group)) +
  geom_point() +
  labs(x = "Principal Component 1", y = "Principal Component 2", color = "Label") +
  theme_minimal()

Hyperparameter tuning

tuned <- tune(svm, Group~.,  data = Train_g, kernel= "linear",  ranges = list(cost =c(0.1,1, 10, 20, 100)))

Print the results

print(tuned)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.3288493

pred <- predict(suport, Test_g)

Compute confusion matrix and other evaluation parameters

conf_matrix <- confusionMatrix(as.factor(Test_g$Group), pred)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 10 30
##          1  7 33
##                                           
##                Accuracy : 0.5375          
##                  95% CI : (0.4224, 0.6497)
##     No Information Rate : 0.7875          
##     P-Value [Acc > NIR] : 0.9999998       
##                                           
##                   Kappa : 0.075           
##                                           
##  Mcnemar's Test P-Value : 0.0002983       
##                                           
##             Sensitivity : 0.5882          
##             Specificity : 0.5238          
##          Pos Pred Value : 0.2500          
##          Neg Pred Value : 0.8250          
##              Prevalence : 0.2125          
##          Detection Rate : 0.1250          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.5560          
##                                           
##        'Positive' Class : 0               
##

MODEL FOR BEHAVIOUR

Classification and Regression Tree

Test_b  <- transformed_boluwatife %>% select(-Group) 
Train_b <- transformed_boluwatife %>% select(-Group)
cart_behaviour <- rpart(as.factor(n_behaviour)~., data = Train_b)
summary(cart_behaviour)

## Call:
## rpart(formula = as.factor(n_behaviour) ~ ., data = Train_b)
##   n= 320 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.33333333      0 1.0000000 1.1250000 0.02706329
## 2 0.15833333      1 0.6666667 0.7708333 0.03681006
## 3 0.03333333      2 0.5083333 0.6000000 0.03708099
## 4 0.02916667      3 0.4750000 0.6166667 0.03716284
## 5 0.02083333      4 0.4458333 0.5541667 0.03673334
## 6 0.01944444      6 0.4041667 0.5166667 0.03631221
## 7 0.01666667      9 0.3458333 0.5083333 0.03620148
## 8 0.01000000     10 0.3291667 0.5208333 0.03636521
## 
## Variable importance
## Week_11  Week_0  Week_5 Week_13  Week_7  Week_9  Week_3 
##      17      17      16      16      16      16       3 
## 
## Node number 1: 320 observations,    complexity param=0.3333333
##   predicted class= TL         expected loss=0.75  P(node) =1
##     class counts:    80    80    80    80
##    probabilities: 0.250 0.250 0.250 0.250 
##   left son=2 (80 obs) right son=3 (240 obs)
##   Primary splits:
##       Week_11 < 9.65  to the right, improve=80.00000, (0 missing)
##       Week_13 < 9.7   to the right, improve=80.00000, (0 missing)
##       Week_9  < 9.6   to the right, improve=78.67220, (0 missing)
##       Week_7  < 9.55  to the right, improve=77.35537, (0 missing)
##       Week_5  < 9.5   to the right, improve=74.75410, (0 missing)
##   Surrogate splits:
##       Week_13 < 9.7   to the right, agree=1.000, adj=1.000, (0 split)
##       Week_9  < 9.6   to the right, agree=0.997, adj=0.987, (0 split)
##       Week_7  < 9.55  to the right, agree=0.994, adj=0.975, (0 split)
##       Week_5  < 9.5   to the right, agree=0.988, adj=0.950, (0 split)
##       Week_0  < 7.995 to the right, agree=0.984, adj=0.938, (0 split)
## 
## Node number 2: 80 observations
##   predicted class=Aggressive  expected loss=0  P(node) =0.25
##     class counts:     0    80     0     0
##    probabilities: 0.000 1.000 0.000 0.000 
## 
## Node number 3: 240 observations,    complexity param=0.1583333
##   predicted class= TL         expected loss=0.6666667  P(node) =0.75
##     class counts:    80     0    80    80
##    probabilities: 0.333 0.000 0.333 0.333 
##   left son=6 (164 obs) right son=7 (76 obs)
##   Primary splits:
##       Week_3  < 3.995 to the right, improve=14.22336, (0 missing)
##       Week_0  < 4.895 to the right, improve=13.49114, (0 missing)
##       Week_13 < 5.58  to the right, improve=13.09218, (0 missing)
##       Week_11 < 5.45  to the right, improve=13.02944, (0 missing)
##       Week_9  < 5.375 to the right, improve=12.91925, (0 missing)
##   Surrogate splits:
##       Week_5  < 3.92  to the right, agree=0.775, adj=0.289, (0 split)
##       Week_0  < 3.91  to the right, agree=0.762, adj=0.250, (0 split)
##       Week_11 < 3.795 to the right, agree=0.750, adj=0.211, (0 split)
##       Week_13 < 3.975 to the right, agree=0.746, adj=0.197, (0 split)
##       Week_9  < 3.49  to the right, agree=0.733, adj=0.158, (0 split)
## 
## Node number 6: 164 observations,    complexity param=0.03333333
##   predicted class= TL         expected loss=0.5609756  P(node) =0.5125
##     class counts:    72     0    58    34
##    probabilities: 0.439 0.000 0.354 0.207 
##   left son=12 (74 obs) right son=13 (90 obs)
##   Primary splits:
##       Week_7  < 5.28  to the right, improve=6.251491, (0 missing)
##       Week_0  < 4.895 to the right, improve=5.776332, (0 missing)
##       Week_9  < 5.37  to the right, improve=5.604068, (0 missing)
##       Week_13 < 5.58  to the right, improve=5.604068, (0 missing)
##       Week_5  < 3.99  to the right, improve=5.348671, (0 missing)
##   Surrogate splits:
##       Week_13 < 5.84  to the right, agree=0.720, adj=0.378, (0 split)
##       Week_9  < 5.85  to the right, agree=0.695, adj=0.324, (0 split)
##       Week_11 < 5.885 to the right, agree=0.683, adj=0.297, (0 split)
##       Week_3  < 5.995 to the right, agree=0.677, adj=0.284, (0 split)
##       Week_5  < 5.15  to the right, agree=0.659, adj=0.243, (0 split)
## 
## Node number 7: 76 observations,    complexity param=0.02916667
##   predicted class=Weight      expected loss=0.3947368  P(node) =0.2375
##     class counts:     8     0    22    46
##    probabilities: 0.105 0.000 0.289 0.605 
##   left son=14 (12 obs) right son=15 (64 obs)
##   Primary splits:
##       Week_0  < 5.015 to the right, improve=6.207785, (0 missing)
##       Week_11 < 4.89  to the right, improve=5.108541, (0 missing)
##       Week_7  < 4.195 to the right, improve=3.583801, (0 missing)
##       Week_13 < 5.49  to the right, improve=3.050888, (0 missing)
##       Week_9  < 4.075 to the right, improve=2.581998, (0 missing)
##   Surrogate splits:
##       Week_13 < 5.49  to the right, agree=0.882, adj=0.250, (0 split)
##       Week_9  < 5.445 to the right, agree=0.868, adj=0.167, (0 split)
##       Week_7  < 5.35  to the right, agree=0.855, adj=0.083, (0 split)
##       Week_11 < 5.3   to the right, agree=0.855, adj=0.083, (0 split)
## 
## Node number 12: 74 observations,    complexity param=0.01666667
##   predicted class= TL         expected loss=0.3918919  P(node) =0.23125
##     class counts:    45     0    23     6
##    probabilities: 0.608 0.000 0.311 0.081 
##   left son=24 (56 obs) right son=25 (18 obs)
##   Primary splits:
##       Week_0  < 4.85  to the right, improve=3.242063, (0 missing)
##       Week_9  < 5.35  to the right, improve=2.710008, (0 missing)
##       Week_11 < 5.45  to the right, improve=2.704215, (0 missing)
##       Week_5  < 5.05  to the right, improve=1.860140, (0 missing)
##       Week_13 < 5.55  to the right, improve=1.754286, (0 missing)
##   Surrogate splits:
##       Week_5  < 3.2   to the right, agree=0.784, adj=0.111, (0 split)
##       Week_3  < 4.335 to the right, agree=0.770, adj=0.056, (0 split)
##       Week_13 < 3.75  to the right, agree=0.770, adj=0.056, (0 split)
## 
## Node number 13: 90 observations,    complexity param=0.02083333
##   predicted class=SL          expected loss=0.6111111  P(node) =0.28125
##     class counts:    27     0    35    28
##    probabilities: 0.300 0.000 0.389 0.311 
##   left son=26 (70 obs) right son=27 (20 obs)
##   Primary splits:
##       Week_0  < 3.995 to the right, improve=3.549206, (0 missing)
##       Week_5  < 3.99  to the right, improve=3.452109, (0 missing)
##       Week_13 < 3.73  to the right, improve=1.628453, (0 missing)
##       Week_11 < 5.575 to the right, improve=1.366548, (0 missing)
##       Week_9  < 3.625 to the left,  improve=1.023724, (0 missing)
##   Surrogate splits:
##       Week_9 < 2.65  to the right, agree=0.789, adj=0.05, (0 split)
## 
## Node number 14: 12 observations
##   predicted class=SL          expected loss=0.3333333  P(node) =0.0375
##     class counts:     3     0     8     1
##    probabilities: 0.250 0.000 0.667 0.083 
## 
## Node number 15: 64 observations
##   predicted class=Weight      expected loss=0.296875  P(node) =0.2
##     class counts:     5     0    14    45
##    probabilities: 0.078 0.000 0.219 0.703 
## 
## Node number 24: 56 observations
##   predicted class= TL         expected loss=0.3035714  P(node) =0.175
##     class counts:    39     0    13     4
##    probabilities: 0.696 0.000 0.232 0.071 
## 
## Node number 25: 18 observations
##   predicted class=SL          expected loss=0.4444444  P(node) =0.05625
##     class counts:     6     0    10     2
##    probabilities: 0.333 0.000 0.556 0.111 
## 
## Node number 26: 70 observations,    complexity param=0.01944444
##   predicted class= TL         expected loss=0.6142857  P(node) =0.21875
##     class counts:    27     0    25    18
##    probabilities: 0.386 0.000 0.357 0.257 
##   left son=52 (36 obs) right son=53 (34 obs)
##   Primary splits:
##       Week_11 < 4.865 to the left,  improve=1.6658260, (0 missing)
##       Week_3  < 5.15  to the left,  improve=1.3785710, (0 missing)
##       Week_7  < 4.915 to the left,  improve=1.0285710, (0 missing)
##       Week_5  < 6.25  to the left,  improve=1.0231070, (0 missing)
##       Week_0  < 6.25  to the left,  improve=0.9237327, (0 missing)
##   Surrogate splits:
##       Week_7  < 4.21  to the left,  agree=0.771, adj=0.529, (0 split)
##       Week_5  < 4.96  to the left,  agree=0.729, adj=0.441, (0 split)
##       Week_9  < 4.975 to the left,  agree=0.686, adj=0.353, (0 split)
##       Week_13 < 4.75  to the left,  agree=0.686, adj=0.353, (0 split)
##       Week_0  < 5.15  to the left,  agree=0.614, adj=0.206, (0 split)
## 
## Node number 27: 20 observations,    complexity param=0.02083333
##   predicted class=SL          expected loss=0.5  P(node) =0.0625
##     class counts:     0     0    10    10
##    probabilities: 0.000 0.000 0.500 0.500 
##   left son=54 (12 obs) right son=55 (8 obs)
##   Primary splits:
##       Week_5  < 3.99  to the right, improve=6.6666670, (0 missing)
##       Week_13 < 4.875 to the right, improve=2.7472530, (0 missing)
##       Week_11 < 4.255 to the right, improve=2.5252530, (0 missing)
##       Week_7  < 4.025 to the right, improve=1.6000000, (0 missing)
##       Week_9  < 4.21  to the left,  improve=0.4166667, (0 missing)
##   Surrogate splits:
##       Week_11 < 4.255 to the right, agree=0.75, adj=0.375, (0 split)
##       Week_0  < 3.315 to the right, agree=0.70, adj=0.250, (0 split)
##       Week_7  < 2.93  to the right, agree=0.70, adj=0.250, (0 split)
##       Week_9  < 4.495 to the right, agree=0.65, adj=0.125, (0 split)
##       Week_13 < 4.875 to the right, agree=0.65, adj=0.125, (0 split)
## 
## Node number 52: 36 observations,    complexity param=0.01944444
##   predicted class= TL         expected loss=0.5277778  P(node) =0.1125
##     class counts:    17     0    14     5
##    probabilities: 0.472 0.000 0.389 0.139 
##   left son=104 (17 obs) right son=105 (19 obs)
##   Primary splits:
##       Week_11 < 3.975 to the right, improve=2.724974, (0 missing)
##       Week_9  < 3.925 to the right, improve=1.776190, (0 missing)
##       Week_3  < 5.505 to the left,  improve=1.462963, (0 missing)
##       Week_7  < 3.245 to the right, improve=1.389984, (0 missing)
##       Week_5  < 3.95  to the right, improve=1.333333, (0 missing)
##   Surrogate splits:
##       Week_9  < 4.025 to the right, agree=0.694, adj=0.353, (0 split)
##       Week_13 < 3.85  to the right, agree=0.694, adj=0.353, (0 split)
##       Week_0  < 5.95  to the right, agree=0.639, adj=0.235, (0 split)
##       Week_5  < 3.76  to the right, agree=0.639, adj=0.235, (0 split)
##       Week_7  < 4.065 to the right, agree=0.639, adj=0.235, (0 split)
## 
## Node number 53: 34 observations,    complexity param=0.01944444
##   predicted class=Weight      expected loss=0.6176471  P(node) =0.10625
##     class counts:    10     0    11    13
##    probabilities: 0.294 0.000 0.324 0.382 
##   left son=106 (20 obs) right son=107 (14 obs)
##   Primary splits:
##       Week_11 < 5.575 to the right, improve=3.943697, (0 missing)
##       Week_13 < 5.58  to the right, improve=3.472269, (0 missing)
##       Week_0  < 4.3   to the left,  improve=2.561158, (0 missing)
##       Week_9  < 5.57  to the right, improve=2.200840, (0 missing)
##       Week_5  < 6.25  to the right, improve=1.683258, (0 missing)
##   Surrogate splits:
##       Week_5  < 4.86  to the right, agree=0.735, adj=0.357, (0 split)
##       Week_13 < 4.625 to the right, agree=0.706, adj=0.286, (0 split)
##       Week_7  < 3.97  to the right, agree=0.676, adj=0.214, (0 split)
##       Week_9  < 3.84  to the right, agree=0.676, adj=0.214, (0 split)
##       Week_3  < 6.3   to the left,  agree=0.618, adj=0.071, (0 split)
## 
## Node number 54: 12 observations
##   predicted class=SL          expected loss=0.1666667  P(node) =0.0375
##     class counts:     0     0    10     2
##    probabilities: 0.000 0.000 0.833 0.167 
## 
## Node number 55: 8 observations
##   predicted class=Weight      expected loss=0  P(node) =0.025
##     class counts:     0     0     0     8
##    probabilities: 0.000 0.000 0.000 1.000 
## 
## Node number 104: 17 observations
##   predicted class= TL         expected loss=0.2941176  P(node) =0.053125
##     class counts:    12     0     4     1
##    probabilities: 0.706 0.000 0.235 0.059 
## 
## Node number 105: 19 observations
##   predicted class=SL          expected loss=0.4736842  P(node) =0.059375
##     class counts:     5     0    10     4
##    probabilities: 0.263 0.000 0.526 0.211 
## 
## Node number 106: 20 observations
##   predicted class=SL          expected loss=0.55  P(node) =0.0625
##     class counts:     8     0     9     3
##    probabilities: 0.400 0.000 0.450 0.150 
## 
## Node number 107: 14 observations
##   predicted class=Weight      expected loss=0.2857143  P(node) =0.04375
##     class counts:     2     0     2    10
##    probabilities: 0.143 0.000 0.143 0.714

rpart.plot(cart_behaviour)

pre_behaviour <- predict(cart_behaviour, newdata = Test_b, type ="class")

conf_matr_behaviour <- confusionMatrix(as.factor(Test_b$n_behaviour), pre_behaviour)
conf_matr_behaviour

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         51          0 22      7
##   Aggressive   0         80  0      0
##   SL          17          0 47     16
##   Weight       5          0 12     63
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7531          
##                  95% CI : (0.7021, 0.7994)
##     No Information Rate : 0.2688          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6708          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.6986              1.00    0.5802        0.7326
## Specificity              0.8826              1.00    0.8619        0.9274
## Pos Pred Value           0.6375              1.00    0.5875        0.7875
## Neg Pred Value           0.9083              1.00    0.8583        0.9042
## Prevalence               0.2281              0.25    0.2531        0.2687
## Detection Rate           0.1594              0.25    0.1469        0.1969
## Detection Prevalence     0.2500              0.25    0.2500        0.2500
## Balanced Accuracy        0.7906              1.00    0.7211        0.8300

Iterative dichotomiser 3

library(C50)

## Warning: package 'C50' was built under R version 4.3.3

ID_3 <- C5.0(as.factor(n_behaviour)~., data = Train_b)
summary(ID_3)

## 
## Call:
## C5.0.formula(formula = as.factor(n_behaviour) ~ ., data = Train_b)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Apr 17 21:46:09 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 320 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## Week_11 > 9.3: Aggressive (80)
## Week_11 <= 9.3:
## :...Week_3 <= 3.99:
##     :...Week_0 > 5:
##     :   :...Week_5 <= 3.5: TL (2/1)
##     :   :   Week_5 > 3.5:
##     :   :   :...Week_0 <= 5.8: SL (7)
##     :   :       Week_0 > 5.8: TL (3/1)
##     :   Week_0 <= 5:
##     :   :...Week_7 > 4.19:
##     :       :...Week_5 <= 3.01: Weight (4)
##     :       :   Week_5 > 3.01:
##     :       :   :...Week_5 > 4:
##     :       :       :...Week_3 <= 3.8: Weight (4)
##     :       :       :   Week_3 > 3.8: SL (3/1)
##     :       :       Week_5 <= 4:
##     :       :       :...Week_5 <= 3.21: SL (2)
##     :       :           Week_5 > 3.21:
##     :       :           :...Week_7 <= 4.82: TL (3)
##     :       :               Week_7 > 4.82: SL (3/1)
##     :       Week_7 <= 4.19:
##     :       :...Week_13 > 3.6: Weight (25/1)
##     :           Week_13 <= 3.6:
##     :           :...Week_13 > 3.46: SL (3)
##     :               Week_13 <= 3.46:
##     :               :...Week_9 > 4.29: SL (3/1)
##     :                   Week_9 <= 4.29:
##     :                   :...Week_0 <= 4.1: Weight (10)
##     :                       Week_0 > 4.1:
##     :                       :...Week_5 <= 3.6: Weight (2)
##     :                           Week_5 > 3.6: SL (2)
##     Week_3 > 3.99:
##     :...Week_0 <= 3.99:
##         :...Week_5 <= 3.98: Weight (9/1)
##         :   Week_5 > 3.98:
##         :   :...Week_7 <= 5.23: SL (11/1)
##         :       Week_7 > 5.23: Weight (3/1)
##         Week_0 > 3.99:
##         :...Week_9 <= 5.35:
##             :...Week_3 > 7.1: Weight (3)
##             :   Week_3 <= 7.1:
##             :   :...Week_5 <= 7.5: TL (71/43)
##             :       Week_5 > 7.5: Weight (2)
##             Week_9 > 5.35:
##             :...Week_11 > 7.43: TL (12)
##                 Week_11 <= 7.43:
##                 :...Week_7 > 5.9:
##                     :...Week_0 <= 4.4: SL (2)
##                     :   Week_0 > 4.4:
##                     :   :...Week_11 > 5.1: TL (18/1)
##                     :       Week_11 <= 5.1:
##                     :       :...Week_11 <= 4.8: TL (2)
##                     :           Week_11 > 4.8: SL (3)
##                     Week_7 <= 5.9:
##                     :...Week_11 > 6.7: SL (6)
##                         Week_11 <= 6.7:
##                         :...Week_5 > 5.9:
##                             :...Week_9 <= 6.4: TL (5)
##                             :   Week_9 > 6.4: Weight (2)
##                             Week_5 <= 5.9:
##                             :...Week_7 <= 3.95: TL (3/1)
##                                 Week_7 > 3.95:
##                                 :...Week_9 > 7.5: TL (2)
##                                     Week_9 <= 7.5:
##                                     :...Week_5 <= 5.6: SL (7)
##                                         Week_5 > 5.6: TL (3/1)
## 
## 
## Evaluation on training data (320 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      34   55(17.2%)   <<
## 
## 
##     (a)   (b)   (c)   (d)    <-classified as
##    ----  ----  ----  ----
##      76           2     2    (a): class TL
##            80                (b): class Aggressive
##      31          48     1    (c): class SL
##      17           2    61    (d): class Weight
## 
## 
##  Attribute usage:
## 
##  100.00% Week_11
##   75.00% Week_0
##   75.00% Week_3
##   49.38% Week_9
##   47.81% Week_5
##   40.94% Week_7
##   14.06% Week_13
## 
## 
## Time: 0.0 secs

plot(ID_3)

ID_3_behaviour <- predict(ID_3, newdata = Test_b, prob ="class")

ID_3_behaviour <- confusionMatrix(as.factor(Test_b$n_behaviour), ID_3_behaviour)
ID_3_behaviour

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         76          0  2      2
##   Aggressive   0         80  0      0
##   SL          31          0 48      1
##   Weight      17          0  2     61
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8281          
##                  95% CI : (0.7822, 0.8678)
##     No Information Rate : 0.3875          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7708          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.6129              1.00    0.9231        0.9531
## Specificity              0.9796              1.00    0.8806        0.9258
## Pos Pred Value           0.9500              1.00    0.6000        0.7625
## Neg Pred Value           0.8000              1.00    0.9833        0.9875
## Prevalence               0.3875              0.25    0.1625        0.2000
## Detection Rate           0.2375              0.25    0.1500        0.1906
## Detection Prevalence     0.2500              0.25    0.2500        0.2500
## Balanced Accuracy        0.7962              1.00    0.9018        0.9395

Chi-square automatic iterative detection Model

library(partykit)

## Warning: package 'partykit' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 4.3.3

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.3.3

Chaid <- ctree(as.factor(n_behaviour) ~., data = Train_b)
summary(Chaid)

##   Length Class      Mode
## 1 9      constparty list
## 2 7      constparty list
## 3 3      constparty list
## 4 1      constparty list
## 5 1      constparty list
## 6 3      constparty list
## 7 1      constparty list
## 8 1      constparty list
## 9 1      constparty list

plot(Chaid)

chaid_behaviour <- predict(Chaid, newdata = Test_b, prob = "class")

## Compute confusion matrix
conf_matrix_chaid <- confusionMatrix(as.factor(Test_b$n_behaviour), chaid_behaviour)
conf_matrix_chaid

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         57          0 18      5
##   Aggressive   0         80  0      0
##   SL          35          0 32     13
##   Weight      17          0 19     44
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6656         
##                  95% CI : (0.611, 0.7171)
##     No Information Rate : 0.3406         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5542         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.5229              1.00    0.4638        0.7097
## Specificity              0.8910              1.00    0.8088        0.8605
## Pos Pred Value           0.7125              1.00    0.4000        0.5500
## Neg Pred Value           0.7833              1.00    0.8458        0.9250
## Prevalence               0.3406              0.25    0.2156        0.1938
## Detection Rate           0.1781              0.25    0.1000        0.1375
## Detection Prevalence     0.2500              0.25    0.2500        0.2500
## Balanced Accuracy        0.7070              1.00    0.6363        0.7851

## Conditional inference trees
library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.3.3

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following objects are masked from 'package:partykit':
## 
##     cforest, ctree, ctree_control, edge_simple, mob, mob_control,
##     node_barplot, node_bivplot, node_boxplot, node_inner, node_surv,
##     node_terminal, varimp

## The following object is masked from 'package:dplyr':
## 
##     where

l_behaviour <- ctree(as.factor(n_behaviour)~., data = Train_b)
summary(l_behaviour)

##     Length      Class       Mode 
##          1 BinaryTree         S4

plot(l_behaviour)

predi_behaviour <- predict(l_behaviour, newdata = Test_b, prob ="class")
conf_matri_behavi <- confusionMatrix(as.factor(Test_b$n_behaviour), predi_behaviour)
conf_matri_behavi

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         38          0 37      5
##   Aggressive   0         80  0      0
##   SL          23          0 38     19
##   Weight       4          0 36     40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6125          
##                  95% CI : (0.5567, 0.6662)
##     No Information Rate : 0.3469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4833          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.5846              1.00    0.3423        0.6250
## Specificity              0.8353              1.00    0.7990        0.8438
## Pos Pred Value           0.4750              1.00    0.4750        0.5000
## Neg Pred Value           0.8875              1.00    0.6958        0.9000
## Prevalence               0.2031              0.25    0.3469        0.2000
## Detection Rate           0.1187              0.25    0.1187        0.1250
## Detection Prevalence     0.2500              0.25    0.2500        0.2500
## Balanced Accuracy        0.7100              1.00    0.5707        0.7344

Naive Bayes

behaviour <- naiveBayes(as.factor(n_behaviour)~., data = Train_b)
summary(behaviour)

##           Length Class  Mode     
## apriori   4      table  numeric  
## tables    7      -none- list     
## levels    4      -none- character
## isnumeric 7      -none- logical  
## call      4      -none- call

prei_behaviour <- predict(behaviour, newdata = Test_b, type ="class")

conf_matri_behavi <- confusionMatrix(as.factor(Test_b$n_behaviour), as.factor(prei_behaviour))
conf_matri_behavi

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         43          0 19     18
##   Aggressive   0         80  0      0
##   SL          27          0 18     35
##   Weight      13          0 13     54
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6094          
##                  95% CI : (0.5535, 0.6632)
##     No Information Rate : 0.3344          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4792          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.5181              1.00   0.36000        0.5047
## Specificity              0.8439              1.00   0.77037        0.8779
## Pos Pred Value           0.5375              1.00   0.22500        0.6750
## Neg Pred Value           0.8333              1.00   0.86667        0.7792
## Prevalence               0.2594              0.25   0.15625        0.3344
## Detection Rate           0.1344              0.25   0.05625        0.1688
## Detection Prevalence     0.2500              0.25   0.25000        0.2500
## Balanced Accuracy        0.6810              1.00   0.56519        0.6913

Support Vector Machine

suport_behaviour <- svm(as.factor(n_behaviour)~., data = Train_b, kernel ="linear", cost = 10)
summary(suport_behaviour)

## 
## Call:
## svm(formula = as.factor(n_behaviour) ~ ., data = Train_b, kernel = "linear", 
##     cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  214
## 
##  ( 80 66 4 64 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##   TL Aggressive SL Weight

tuned <- tune(svm, as.factor(n_behaviour)~.,  data = Train_b, kernel= "linear", ranges = list(cost =c(0.1,1, 10, 20, 100)))

print(tuned)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.415625

predq_behaviour <- predict(suport_behaviour, Test_b)

conf_matrix_behaviour <- confusionMatrix(as.factor(Test_b$n_behaviour), predq_behaviour)
conf_matrix_behaviour

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction    TL Aggressive SL Weight
##    TL         44          0 19     17
##   Aggressive   0         80  0      0
##   SL          25          0 24     31
##   Weight      17          0 10     53
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6281          
##                  95% CI : (0.5726, 0.6812)
##     No Information Rate : 0.3156          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5042          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class:  TL Class: Aggressive Class: SL Class: Weight
## Sensitivity              0.5116              1.00    0.4528        0.5248
## Specificity              0.8462              1.00    0.7903        0.8767
## Pos Pred Value           0.5500              1.00    0.3000        0.6625
## Neg Pred Value           0.8250              1.00    0.8792        0.8000
## Prevalence               0.2687              0.25    0.1656        0.3156
## Detection Rate           0.1375              0.25    0.0750        0.1656
## Detection Prevalence     0.2500              0.25    0.2500        0.2500
## Balanced Accuracy        0.6789              1.00    0.6215        0.7007

Summary

This summary of model performance metrics provides insights into the accuracy, confidence intervals, agreement beyond chance (Kappa), and statistical significance of different predictive models. Here’s what each aspect means: Accuracy: It represents the proportion of correctly classified instances by the model. Higher accuracy indicates better predictive performance. 95% Confidence Intervals: They provide a range within which the true accuracy of the model is likely to fall with 95% confidence. Narrower intervals indicate more precise estimates. No Information Rate: It serves as a baseline accuracy level that a model should outperform. It represents the accuracy achieved by predicting the majority class in the dataset. Kappa: It measures the agreement between predicted and observed classifications, accounting for the possibility of agreement occurring by chance alone. Higher Kappa values indicate better agreement beyond chance. McNemar’s Test P-Value: It assesses whether differences in model performance compared to a baseline are statistically significant. A significant p-value (usually < 0.05) suggests that the model performs differently from the baseline.

Fish Aggressiveness prediction project

Adekunle Joseph Damilare

2024-04-17

Preliminaries

Data Cleaning

Labeling Behaviour

Data Exploration

Checking for Missing values

Checking for outliers

Checking for data structure

Correlation Analysis

Descriptive analysis

Features by Behaviours

Splitting the dataset intro Training set and test set 0.75 and 0.25

Model creation

Models on for Group based on One features

Removing the behaviour column

Logistic regression

Probit regression

Decision tree model

Random Forest

Support Vector Machine

Extract two principal components for visualization

Plot decision boundary

Hyperparameter tuning

Print the results

Compute confusion matrix and other evaluation parameters

MODEL FOR BEHAVIOUR

Classification and Regression Tree

Iterative dichotomiser 3

Chi-square automatic iterative detection Model

Naive Bayes

Support Vector Machine

Summary