510509343_Assignment

Question 1 - Case Study 1 (Reef): Visualising data

Loaded the data from public data set from github into this environment by using readr package and url function

library (readr)
library(maps)
library(ggplot2)
library(viridis)
library(plotly)
library(dplyr)

urlfile="https://raw.githubusercontent.com/InstituteForGlobalEcology/Coral-bleaching-a-global-analysis-of-the-past-two-decades/master/Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv"

reef_data <- read_csv(url(urlfile))

a. Informative map visualisation

Construct an informative map to visuallised and investigate whether the authors claim “the highest probability of coral bleaching occurred at tropical midlatitude sites (15–20 degrees north and south of the Equator)” is correct or not

world_map <- map_data("world")

ggplot() + geom_polygon(data = world_map, aes(x = long, y = lat, group = group),
    fill = "grey", alpha = 0.3) + geom_point(data = reef_data, alpha = 0.2, aes(y = Latitude.Degrees,
    x = Longitude.Degrees, size = Average_bleaching, color = Average_bleaching)) +
    scale_colour_viridis() + theme_minimal()

The presented visualization reveals that coral bleaching events are mainly concentrated in the tropical mid-latitude regions around 15-20 degrees north and south of the Equator, confirming the findings of Sully and colleagues. The colours and size of dots on the map indicate the severity of the bleaching event, allowing for swift identification of areas that need immediate protection from further damage. This visualization provides valuable insights into the geographic distribution and severity of coral bleaching events and highlights the urgent need for protective measures to preserve marine biodiversity.

b. Interactive map visualisation

Construct a interactive map filtered by certain period of time to satisfy researcher’s willingness of investigating coral bleaching events around the world as they occurred from 1998 to 2017.

coral_events_year_filtered <- reef_data %>% 
  filter(Year >= 1998 & Year <= 2017) %>%
  select(Longitude.Degrees, Latitude.Degrees, Average_bleaching)

plot_geo(coral_events_year_filtered, lon = ~Longitude.Degrees, lat = ~Latitude.Degrees, size = ~Average_bleaching, color = ~Average_bleaching)

No scattergeo mode specifed:
  Setting the mode to markers
  Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Warning: `line.width` does not currently support multiple values.

By observing the world map which been coloured by average coral bleaching, the precise location of the happening coral bleaching events have been implemented into this interactive map and presented with the level of severity by different size of point and colours. The map additionally shows the most coral bleaching occurs around Pacific areas and Mid-Southern America region.

Question 2 - Case Study 2 (Kidney): Blood vs Biopsy Biomarker for classification

Loaded data and related libraries for set up

library(Biobase)
library(caret)
library(dplyr)
library(e1071)
library(GEOquery) 
library(R.utils)
library(ggplot2)
library(reshape2)
library(limma)

a. Genes that can be found in the top 300 differentailly expressed genes for both datasets

Load two RData data sets into the environment and classified as gse1 and gse2 for further analysis

# Week 1 Lab 1.1 sample solution

load("GSE46474.RData")
gse1 = gse
load("GSE138043.RData")
gse2 = gse

# number of samples in the Rejection class and the Stable class in gse1 and gse2
gse1$outcome <- ifelse(grepl("AR", gse1$title), "Rejection", "Stable")
gse2$outcome <- ifelse(grepl("non", gse2$characteristics_ch1), "Stable", "Rejection")

# extract the featureData object using the fData function.
featureData1 <- fData(gse1)
featureData2 <- fData(gse2)

Creates a design matrix for the linear model that will be used to test for differential gene expression, and generates a table of the 50 most significant differentially expressed genes in the data based on the linear model fitted of framework 1

# Week 1 Lab2.2 sample solution

design1 = model.matrix(~outcome, data = pData(gse1))
fit1 = lmFit(exprs(gse1),design1)
fit1 = eBayes(fit1)

tT1 = topTable(fit1, n = Inf)

Removing intercept from test coefficients

DT::datatable(round(tT1[1:300, ], 2))

tT1 = topTable(fit1, genelist=featureData1[,"Gene Symbol"], n = 50)

Removing intercept from test coefficients

tT11 = topTable(fit1, genelist=featureData1[,"Gene Symbol"], n = 300)

Removing intercept from test coefficients

for (i in 1:length(tT11$ID)) {               
  tT11$ID2[i] = tT11$ID[[i]][2]
}

# Week 1 Lab2.2 sample solution

design2 = model.matrix(~outcome, data = pData(gse2))
fit2 = lmFit(exprs(gse2),design2)
fit2 = eBayes(fit2)

tT2 = topTable(fit2, n = Inf)

Removing intercept from test coefficients

DT::datatable(round(tT2[1:300, ], 2))

tT2 = topTable(fit2, genelist=featureData2[,"gene_assignment"], n = 50)

Removing intercept from test coefficients

tT22 = topTable(fit2, genelist=featureData2[,"gene_assignment"], n = 300)

Removing intercept from test coefficients

tT22$ID2 = strsplit(tT22$ID," // ", fixed = TRUE)

for (i in 1:length(tT22$ID2)) {               
  tT22$ID3[i] = tT22$ID2[[i]][2]
}

Compare two framework under certain filtered conditions to observe the overlapping genes presented.

overlapping_genes <- intersect(tT11$ID, tT22$ID3)
overlapping_genes

[1] "SLAMF6" "RASA3"  "ITGAX"

The significant findings for the top 300 differentially expressed genes in both the GSE46474 and GSE138043 datasets indicates there are three overlapping genes presented, which is “SLAMF6”, “RASA3”, and ’’ITGAX” between these two data sets.

b. Visualisation of identifying 50 most differentially expressed genes from the entire data set

Load the GSE46474.RData, and GSE138043.RData as gse46474, and gse138043 repeatedly.

load("GSE46474.RData")
gse46474 = gse

load("GSE138043.RData")
gse138043 = gse

Generates a table of the top 50 differentially expressed genes in the data based on the linear model, filtered out the missing values by using na.omit() function. Then further creates a subset of the original data that can be used for downstream analyses by extracting the rows corresponding to the top differentially expressed genes.

top_Table_framework1 = topTable(fit1, n = 50, genelist = featureData1[,"Gene Symbol"], sort.by = "logFC")

Removing intercept from test coefficients

top_Table_framework1 = na.omit(top_Table_framework1)

top_genes <- rownames(top_Table_framework1)
gse1_m = gse1[top_genes,]

Apply the machine learning models to predict patient rejection status from gene expression data.

# Week 2 Lab3.2 sample solution 

largevar = apply(gse1_m, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = as.matrix(t(exprs(gse1_m)))
y = gse1_m$outcome

cvK = 5  # number of CV folds
cv_50acc5_svm_1 = c()
cv_acc_svm_1 = c()

n_sim = 50 ## number of repeats

for (i in 1:n_sim) {
  cvSets = cvTools::cvFolds(nrow(X), cvK)  # permute all the data, into 5 folds
  cv_acc_svm_1 = c()
  
  for (j in 1:cvK) {
    n = length(gse1_m)
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    ## SVM
    svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
    fit <- predict(svm_res, X_test)
    cv_acc_svm_1[j] = mean(fit == y_test)
    
  }
  cv_50acc5_svm_1 <- append(cv_50acc5_svm_1, mean(cv_acc_svm_1))
} ## end for

Same steps but applied into framework 2

top_Table_framework2 = topTable(fit2, n = 50, genelist = featureData2[,"gene_assignment"], sort.by = "logFC")

Removing intercept from test coefficients

top_Table_framework2 = na.omit(top_Table_framework2)

top_genes <- rownames(top_Table_framework2)
gse2_m = gse2[top_genes,]

# Week 2 3.2 lab sample code

largevar = apply(gse2_m, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = as.matrix(t(exprs(gse2_m)))
y = gse2_m$outcome

cvK = 5  # number of CV folds
cv_50acc5_svm_2 = c()
cv_acc_svm_2 = c()

n_sim = 50 ## number of repeats

for (i in 1:n_sim) {
  cvSets = cvTools::cvFolds(nrow(X), cvK)  # permute all the data, into 5 folds
  cv_acc_svm_2 = c()
  
  for (j in 1:cvK) {
    n = length(gse1_m)
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    ## SVM
    svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
    fit <- predict(svm_res, X_test)
    cv_acc_svm_2[j] = mean(fit == y_test)
    
  }
  cv_50acc5_svm_2 <- append(cv_50acc5_svm_2, mean(cv_acc_svm_2))
} ## end for

Comparison between two different frameworks and test for the accuracy by outlining the summary of the box plot by their sides for easy accessibility of differences.

boxplot(list(frame1 = cv_50acc5_svm_1, frame2 = cv_50acc5_svm_2), ylab = "CV Accuracy")

text(y = boxplot.stats(cv_50acc5_svm_1)$stats,
     labels = boxplot.stats(cv_50acc5_svm_1)$stats,
     x=1.48,
     cex = 0.8,
     col = "Red")

text(y = boxplot.stats(cv_50acc5_svm_2)$stats,
     labels = boxplot.stats(cv_50acc5_svm_2)$stats,
     x=2.62,
     cex = 0.7,
     col = "Red")

# Add a horizontal line for the mean
abline(h = mean(cv_50acc5_svm_1),
       col = "darkgreen",
       lwd = 1)

# Add the text for the mean. over the mean line
text(y=mean(cv_50acc5_svm_1),
     x=1.7,
     paste("Framework 1 Mean:", round(mean(cv_50acc5_svm_1),3)),
     col = "darkgreen",
     cex = 0.8,
     pos = 4)

# Same proccess with different dataframe
abline(h = mean(cv_50acc5_svm_2),
       col = "blue",
       lwd = 1)

text(y=mean(cv_50acc5_svm_2),
     x=1.7,
     paste("Framework 2 Mean:", round(mean(cv_50acc5_svm_2),3)),
     col = "blue",
     cex = 0.8,
     pos = 4)

From the boxplots demonstrate above, the accuracy of frame 1 (GSE46474) is calculated by its mean which is 0.814, whereas the accuracy of frame 2 (GSE138043) is 0.8, which is slightly lower than frame 1 has. This findings concludes by using 5-fold cross validation (with 50 repeats), the accuracy of GSE46474 has is slightly preciser than GSE138043 has.

c. Visualisation of identifying 50 most differentially expressed genes from 80:20 split training and testing set

Apply the machine learning models to predict patient rejection status from gene expression data.

# Week 2 3.2 lab sample code
set.seed(3888)
largevar = apply(gse1, 1, var)
ind = which(largevar > quantile(largevar, 0.95))

X = t(exprs(gse1))
y = gse1$outcome

cvK = 5  # number of CV folds
cv_50acc5_svm_1_1 = c()
cv_acc_svm_1_1 = c()

n_sim = 50 ## number of repeats

for (i in 1:n_sim) {
  cvSets = cvTools::cvFolds(nrow(X), cvK)  # permute all the data, into 5 folds
  cv_acc_svm_1_1 = c()
  
  for (j in 1:cvK) {
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    design1 <- model.matrix(~outcome, data = pData(gse1[-test_id, ]))
     #colnames(design1) <- levels(y_train)
    
    fit1 = lmFit(exprs(gse1[-test_id, ]),design1)
    fit1 = eBayes(fit1)
    tT1_50 = topTable(fit1, n = 50, sort.by = "logFC") 
    top_genes <- rownames(tT1_50)
    
    X_train_subset <- X_train[,top_genes]
    X_test_subset <- X_test[,top_genes]
    test_id2 = sample(1:50, round(50 * 0.8))
    X_train_subset_matrix = as.matrix(X_train_subset)
    X_test_subset_matrix = as.matrix(X_test_subset)
    
    ## SVM
    svm_res11 <- e1071::svm(x = X_train_subset_matrix, y = as.factor(y_train))
    fit <- predict(svm_res11, X_test_subset_matrix)
    cv_acc_svm_1_1[j] = mean(fit == y_test)

  }
  cv_50acc5_svm_1_1 <- append(cv_50acc5_svm_1_1, mean(cv_acc_svm_1_1))
} ## end for

# Week 2 3.2 lab sample code
set.seed(3888)
largevar = apply(gse2, 1, var)
ind = which(largevar > quantile(largevar, 0.95))

X = t(exprs(gse2))
y = gse2$outcome

cvK = 5  # number of CV folds
cv_50acc5_svm_2_2 = c()
cv_acc_svm_2_2 = c()

n_sim = 50 ## number of repeats

for (i in 1:n_sim) {
  cvSets = cvTools::cvFolds(nrow(X), cvK)  # permute all the data, into 5 folds
  cv_acc_svm_2_2 = c()
  
  for (j in 1:cvK) {
    test_id = cvSets$subsets[cvSets$which == j]
    X_test = X[test_id, ]
    X_train = X[-test_id, ]
    y_test = y[test_id]
    y_train = y[-test_id]
    
    design2 <- model.matrix(~outcome, data = pData(gse2[-test_id, ]))
     #colnames(design1) <- levels(y_train)
    
    fit2 = lmFit(exprs(gse2[-test_id, ]),design2)
    fit2 = eBayes(fit2)
    tT2_50 = topTable(fit2, n = 50, sort.by = "logFC") 
    top_genes <- rownames(tT2_50)
    
    X_train_subset <- X_train[,top_genes]
    X_test_subset <- X_test[,top_genes]
    test_id2 = sample(1:50, round(50 * 0.8))
    X_train_subset_matrix = as.matrix(X_train_subset)
    X_test_subset_matrix = as.matrix(X_test_subset)
    
    ## SVM
    svm_res12 <- e1071::svm(x = X_train_subset_matrix, y = as.factor(y_train))
    fit <- predict(svm_res12, X_test_subset_matrix)
    cv_acc_svm_2_2[j] = mean(fit == y_test)

  }
  cv_50acc5_svm_2_2 <- append(cv_50acc5_svm_2_2, mean(cv_acc_svm_2_2))
} ## end for

Same approach with different data set, frame 2.

boxplot(list(frame1 = cv_50acc5_svm_1_1, frame2 = cv_50acc5_svm_2_2), ylab = "CV Accuracy")

text(y = boxplot.stats(cv_50acc5_svm_1_1)$stats,
     labels = boxplot.stats(cv_50acc5_svm_1_1)$stats,
     x=1.48,
     cex = 0.8,
     col = "Red")

text(y = boxplot.stats(cv_50acc5_svm_2_2)$stats,
     labels = boxplot.stats(cv_50acc5_svm_2_2)$stats,
     x=2.62,
     cex = 0.7,
     col = "Red")

# Add a horizontal line for the mean
abline(h = mean(cv_50acc5_svm_1_1),
       col = "darkgreen",
       lwd = 1)

# Add the text for the mean. over the mean line
text(y=mean(cv_50acc5_svm_1_1),
     x=1.7,
     paste("Framework 1 Mean:", round(mean(cv_50acc5_svm_1_1),3)),
     col = "darkgreen",
     cex = 0.8,
     pos = 4)

# Same proccess with different dataframe
abline(h = mean(cv_50acc5_svm_2_2),
       col = "blue",
       lwd = 1)

text(y=mean(cv_50acc5_svm_2_2),
     x=0.6,
     paste("Framework 2 Mean:", round(mean(cv_50acc5_svm_2_2),3)),
     col = "blue",
     cex = 0.8,
     pos = 4)

From the boxplot visual presentation demonstrated above, the frame 1 implies higher accuracy than frame 2 did, whose mean is 0.881 and the other is only 0.801. This finding shows the significant different between frame 1 and 2 and concludes that the accuracy of frame 1 is more preciser than frame 2. This finding highlights the importance of selecting an appropriate frame fro the task at hand to achieve optimal accuracy in the result.

d. Comparison between part b and part c

By combining four data set, two frames into two different classifiers, into one visualisation to compare their differences in between.

boxplot(list(frame1_b = cv_50acc5_svm_1, frame1_c = cv_50acc5_svm_1_1, frame2_b = cv_50acc5_svm_2, frame2_c = cv_50acc5_svm_2_2), ylab = "CV Accuracy")

The accuracy of different approaches for classifying gene expression datasets from peripheral blood and kidney biopsy samples was analyzed to assess the validity and reliability of two different frames. The analysis revealed significant differences in accuracy between frame 1, consisting of the peripheral blood gene expression dataset, and frame 2, comprising samples that were sequenced from a kidney biopsy. The accuracy range for frame 1 was found to be considerably higher than that of frame 2, indicating that the validity of frame 1 is superior to frame 2. This finding suggests that peripheral blood gene expression data may be more precise in predicting disease outcomes than gene expression data obtained from kidney biopsy samples.

Further analysis using different approaches, including 80:20 training and testing datasets and 50 repeated cross-validation classifiers, revealed that employing the former approach resulted in significantly higher accuracy for frame 1. However, no significant differences were observed in frame 2 despite the application of different approaches on the same dataset.

Overall, the multiple observation approaches employed in this study provide evidence that peripheral blood gene expression data may be more reliable in predicting disease outcomes than gene expression data obtained from kidney biopsy samples. This finding has important implications for the use of gene expression data in precision medicine, as it suggests that non-invasive methods such as peripheral blood collection may be more effective in predicting disease outcomes than invasive methods such as kidney biopsy.

Question 3 - Case Study 3 (Brain): Streaming classifier for Brain-box

library(tidyverse)
library(tuneR)
library(devtools)
library(ggplot2)
library(tsfeatures)
library(stringdist)

dir_medium = "Spiker_box_Louis/Medium"
all_files_medium <- list.files(dir_medium)

wave_file_medium = list()
for (i in all_files_medium) {
  wave_file_medium[[i]] <- tuneR::readWave(file.path(dir_medium, i))
}

a. Classification rule for detecting a series of {L, R}

Build a classification rule under streaming condition for detecting a series of {L, R}

# Sample Code: Lab3 - 2.3
LR_detection = function(seq) {
      maxval = which.max(seq)
      minval = which.min(seq)
      movement = ifelse(maxval < minval,  "L", "R")
      return(movement)
}

# Week 3 Lab 3.3 Sample solution
streaming_classifier = function(wave_file,  
                                window_size = wave_file@samp.rate,
                                increment = window_size/3,
                                thresholdEvents = 50
) {
  Y = wave_file@left
  xtime = seq_len(length(wave_file@left))/wave_file@samp.rate 
  predict = c()
  lower_interval = 1
  max_time = max(xtime)*window_size
  
  while(max_time > lower_interval + window_size)
  {
    upper_interval = lower_interval + window_size
    interval = Y[lower_interval:upper_interval]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    
      predicted = LR_detection(interval)
      
      lower_interval = lower_interval + window_size
  }
  return(predict)
}

The streaming_classifier function detects a series of {L, R} in a wave file by using a sliding window approach and a simple classification rule. The function takes as input a wave file, window size, increment, and a threshold for the number of events to detect. For each window, the function computes a test statistic and assigns a label of L or R based on the sign of the test statistic using the LR_detection function. The function iterates through the wave file, shifting the window by the specified increment and keeping track of predicted labels for each window, and returns them as the final output.

b. Metric to estimate the accuracy of your classifier on the length 3 wave files

Create a metric to estimate the accuracy of the classifier on the length 3 wave files

eye_movement_ZC = function(Y, time, 
                           windowSize = wave_file@samp.rate, 
                           thresholdEvents = 30,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
    testStat1[i] = max(Y_subset)
    testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

# Source: Lab3 - 3.3

streaming_classifier = function(wave_file,  
                                window_size = wave_file@samp.rate,
                                increment = window_size/3,
                                thresholdEvents = 50
) {
  Y = wave_file@left
  xtime = seq_len(length(wave_file@left))/wave_file@samp.rate 
  predicted_labels = c()
  lower_interval = 1
  max_time = max(xtime)*window_size
  
  while(max_time > lower_interval + window_size)
  {
    upper_interval = lower_interval + window_size
    interval = Y[lower_interval:upper_interval]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    
      predicted = LR_detection(interval)
      
      lower_interval = lower_interval + window_size
  }
  return(predicted_labels)
}

i <- 1
total_accuracy_list <- c()
total_predicted_labels <- c()
while(i < length(wave_file_medium) + 1){
  wave_file = wave_file_medium[[i]]
  predicted_labels = streaming_classifier(wave_file, wave_file@samp.rate)
  predicted_labels <- toString(predicted_labels)
  predicted_labels <- gsub(', ', '', predicted_labels)
  #print(predicted_labels)
  total_predicted_labels <- append(total_predicted_labels, predicted_labels)
  true_sequence <- gsub("_z.wav", "", all_files_medium[2])
  true_sequence <- strsplit(true_sequence, "")[[1]][1:3]
  true_sequence <- toString(true_sequence)
  true_sequence <- gsub(', ', '', true_sequence)
  #print(true_sequence)
  correct_predictions = 0
  accuracy <- stringdist(predicted_labels,true_sequence, method="lv")
  total_accuracy_list <- append(total_accuracy_list, accuracy)
  i <- i+1
}
mean(total_accuracy_list)

[1] 3

By using the stringdist function to create a metric to detect the differences between predicted results and actual representation, and average the result to calculate the final accuracy of this metric. In this case, the average mean returns to be 3 which represents there is three different characters compared from actual wave movement, in calculation, (3/8)*100 = 37.5%, hence the accurate rate is 37.5% which is relatively low to estimate classifier by using this metric.

c. Compare at least four different classification rules on the length 3 wave files

Using the metric in part b to compare at least four different classification rules on certain wave files, then select the best model upon those classification rules

Implement the streaming classifier into eye zero crossing function and change each parameter

normal set of classification rules

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.5, 
                           thresholdEvents = 200,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
    #testStat1[i] = max(Y_subset)
    #testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predicted_labels = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predicted_labels = c(predicted_labels, predicted)
    
    i <- i+1
  }
  
  return(predicted_labels)
}

j <- 1
total_accuracy_list <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  true_sequence <- gsub("_z.wav", "", all_files_medium[j])
  true_sequence <- strsplit(true_sequence, "")[[1]][1:3]
  true_sequence <- toString(true_sequence)
  true_sequence <- gsub(', ', '', true_sequence)
  
  predicted_labels <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  predicted_labels <- toString(predicted_labels)
  predicted_labels <- gsub(', ', '', predicted_labels)
  
  accuracy <- stringdist(predicted_labels,true_sequence, method="lv")
  total_accuracy_list <- append(total_accuracy_list, accuracy)
  j <- j+1
}

accuracy_normal <- mean(total_accuracy_list)
accuracy_normal

[1] 6.2

Increase the window size from 0.5 to 1.0

eye_movement_ZC = function(Y, time, 
                           windowSize = 1.0, 
                           thresholdEvents = 200,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)

  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_larger_ws <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_larger_ws <- append(total_accuracy_list_larger_ws, accuracy)
  j <- j+1
}


accuracy_larger_ws <- mean(total_accuracy_list_larger_ws)
accuracy_larger_ws

[1] 6

Decrease the window size from 0.5 to 0.2

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.2, 
                           thresholdEvents = 200,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_smaller_ws <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
  j <- j+1
}


accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
accuracy_smaller_ws

[1] 11

Increase thresholdEvents from 200 to 500

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.5, 
                           thresholdEvents = 500,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)

  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_larger_te <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_larger_te <- append(total_accuracy_list_larger_te, accuracy)
  j <- j+1
}


accuracy_larger_te <- mean(total_accuracy_list_larger_te)
accuracy_larger_te

[1] 6.2

Decrease thresholdEvents from 200 to 100

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.5, 
                           thresholdEvents = 100,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
    #testStat1[i] = max(Y_subset)
    #testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_smaller_te <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_smaller_te <- append(total_accuracy_list_smaller_te, accuracy)
  j <- j+1
}


accuracy_smaller_te <- mean(total_accuracy_list_smaller_te)
accuracy_smaller_te

[1] 6.4

Increase downSampleRate from 50 to 100

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.5, 
                           thresholdEvents = 200,
                           downSampleRate = 100) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)

  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_larger_dsr <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_larger_dsr <- append(total_accuracy_list_larger_dsr, accuracy)
  j <- j+1
}


accuracy_larger_dsr <- mean(total_accuracy_list_larger_dsr)
accuracy_larger_dsr

[1] 6.2

Decrease downSampleRate from 50 to 10

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.5, 
                           thresholdEvents = 200,
                           downSampleRate = 10) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
    #testStat1[i] = max(Y_subset)
    #testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_smaller_dsr <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:3]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_smaller_dsr <- append(total_accuracy_list_smaller_dsr, accuracy)
  j <- j+1
}


accuracy_smaller_dsr <- mean(total_accuracy_list_smaller_dsr)
accuracy_smaller_dsr

[1] 6.2

# create a barplot to compare by firstly define the variables
data <- c(accuracy_normal, accuracy_larger_ws, accuracy_smaller_ws, accuracy_larger_te, accuracy_smaller_te, accuracy_larger_dsr, accuracy_smaller_dsr)
name <- c("normal", "larger window size", "smaller window size", "larger thresholdEvents", "smaller thresholdEvents", "larger downSamplerate", "smaller downSamplerate")
colours <- c("#d64e12", "#f9a52c", "#efdf48","#8bd346", "#60dbe8", "#16a4d8", "#9b5fe0")

bp <- barplot(data, col = colours,xlab = "Changed Parameter", ylab = "Accuracy", ylim = c(0, 12), main = "Accuracy Comparison Between Each Paramater") 

# Add number of accuracy into each bar for clear observation
text(bp, data, data, pos = 3, cex = 0.8)
legend("topright", legend = name, fill = colours, cex = 0.6)

By observation, the highest accuracy is 11, which belongs to smaller the window size, which refers to the number of instances that are used to train the classifier before making predictions on new incoming data, that concludes smaller window sizes model tend to result in higher accuracy due to the dynamic nature of the data. By training the classifier on a smaller subset of more recent data, it can better capture the current data distribution and trends, resulting in more accurate predictions and leads to become the best model.

d. Evaluation

For the best model that been selected in part c, evaluate its performance on sequences of varying lengths. Does the length of the sequence have an impact on the classification accuracy? Justify your answer with appropriate visualisations.

Substitute length of 8 into smaller window size classifier

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.2, 
                           thresholdEvents = 200,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_smaller_ws <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:8]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
  j <- j+1
}


length_8_accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
length_8_accuracy_smaller_ws

[1] 6.8

Substitute length of 20 into smaller window size classifier

eye_movement_ZC = function(Y, time, 
                           windowSize = 0.2, 
                           thresholdEvents = 200,
                           downSampleRate = 50) {
  
  ## down sample
  ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
  ind = seq(1, ind[length(ind)], by = downSampleRate)
  
  ## time vector for middle of each window
  timeMiddle <- time[ind] + windowSize/2 
  testStat = testStat1 = testStat2 = rep(NA, length(ind))
  stat_event <- rep(NA, length(ind))
  
  ## calculate zero-crossings
  for (i in 1:length(ind)) {
    Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
    testStat[i] <- sd(Y_subset)
  }
  
  ## using threshold to determine eye movement intervals
  predictedEvent <- which(testStat < thresholdEvents)
  eventTimes <- timeMiddle[predictedEvent] # map back to the time of this 
  gaps <- which(diff(eventTimes) > windowSize )
  
  ## estimate event_time_interval
  event_time_interval <- min(eventTimes)
  for (i in 1:length(gaps)) {
    event_time_interval <- append(event_time_interval, 
                                  c(eventTimes[gaps[i]],
                                    eventTimes[gaps[i] + 1]))
  }
  event_time_interval <- append(event_time_interval, max(eventTimes))
  event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
  
  predictedEventTimes <- rep(FALSE, length(Y))
  for (i in 1:nrow(event_time_interval)) {
    predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
  }
  
  num_event <- length(gaps) + 1
  return(list(num_event = num_event, 
              testStat = testStat,
              eventTimes = eventTimes,
              predictedEventTimes = predictedEventTimes,
              predictedInterval = event_time_interval))
}

streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
  Y = wave_file@left
  predict = c()
  i <- 1
  while(i < length(predictedInterval)/2+1)
  {
    lower_interval = predictedInterval[i,1]
    upper_interval = predictedInterval[i,2]
    interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
    testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
    
    predicted = LR_detection(interval)
    predict = c(predict, predicted)
    
    i <- i+1
  }
  
  return(predict)
}

j <- 1
total_accuracy_list_smaller_ws <- c()

while(j < length(wave_file_medium) + 1){
  waveSeq = wave_file_medium[[j]]
  timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
  predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
  eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes

  actual <- gsub("_z.wav", "", all_files_medium[j])
  actual <- strsplit(actual, "")[[1]][1:20]
  actual <- toString(actual)
  actual <- gsub(', ', '', actual)
  
  predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
  
  predict <- toString(predict)
  predict <- gsub(', ', '', predict)

  
  accuracy <- stringdist(predict,actual, method="lv")
  total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
  j <- j+1
}


length_20_accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
length_20_accuracy_smaller_ws

[1] 20

# create a barplot to compare by firstly define the variables
data <- c(accuracy_smaller_ws, length_8_accuracy_smaller_ws, length_20_accuracy_smaller_ws)
name <- c("Length 3", "Length 8", "Length 20")
colours <- c("#d64e12", "#efdf48","#8bd346")

bp <- barplot(data, col = colours,xlab = "Number of Length", ylab = "Accuracy", ylim = c(0, 22), main = "Accuracy Comparison Between Each Paramater") 

# Add number of accuracy into each bar for clear observation
text(bp, data, data, pos = 3, cex = 0.8)
legend("topleft", legend = name, fill = colours, cex = 0.6)

The impact of sequence length on classification accuracy depends on the application and classifier being used. Longer sequences may provide more information and context, improving classification accuracy. However, longer sequences may require more computational resources and may be subject to issues such as over-fitting. The impact of sequence length on classification accuracy is context-dependent, requiring careful consideration of the application and the type of classifier being used.