library (readr)
library(maps)
library(ggplot2)
library(viridis)
library(plotly)
library(dplyr)
urlfile="https://raw.githubusercontent.com/InstituteForGlobalEcology/Coral-bleaching-a-global-analysis-of-the-past-two-decades/master/Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv"
reef_data <- read_csv(url(urlfile))510509343_Assignment
Question 1 - Case Study 1 (Reef): Visualising data
Loaded the data from public data set from github into this environment by using readr package and url function
a. Informative map visualisation
Construct an informative map to visuallised and investigate whether the authors claim “the highest probability of coral bleaching occurred at tropical midlatitude sites (15–20 degrees north and south of the Equator)” is correct or not
world_map <- map_data("world")
ggplot() + geom_polygon(data = world_map, aes(x = long, y = lat, group = group),
fill = "grey", alpha = 0.3) + geom_point(data = reef_data, alpha = 0.2, aes(y = Latitude.Degrees,
x = Longitude.Degrees, size = Average_bleaching, color = Average_bleaching)) +
scale_colour_viridis() + theme_minimal()The presented visualization reveals that coral bleaching events are mainly concentrated in the tropical mid-latitude regions around 15-20 degrees north and south of the Equator, confirming the findings of Sully and colleagues. The colours and size of dots on the map indicate the severity of the bleaching event, allowing for swift identification of areas that need immediate protection from further damage. This visualization provides valuable insights into the geographic distribution and severity of coral bleaching events and highlights the urgent need for protective measures to preserve marine biodiversity.
b. Interactive map visualisation
Construct a interactive map filtered by certain period of time to satisfy researcher’s willingness of investigating coral bleaching events around the world as they occurred from 1998 to 2017.
coral_events_year_filtered <- reef_data %>%
filter(Year >= 1998 & Year <= 2017) %>%
select(Longitude.Degrees, Latitude.Degrees, Average_bleaching)
plot_geo(coral_events_year_filtered, lon = ~Longitude.Degrees, lat = ~Latitude.Degrees, size = ~Average_bleaching, color = ~Average_bleaching)No scattergeo mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Warning: `line.width` does not currently support multiple values.
By observing the world map which been coloured by average coral bleaching, the precise location of the happening coral bleaching events have been implemented into this interactive map and presented with the level of severity by different size of point and colours. The map additionally shows the most coral bleaching occurs around Pacific areas and Mid-Southern America region.
Question 2 - Case Study 2 (Kidney): Blood vs Biopsy Biomarker for classification
Loaded data and related libraries for set up
library(Biobase)
library(caret)
library(dplyr)
library(e1071)
library(GEOquery)
library(R.utils)
library(ggplot2)
library(reshape2)
library(limma)a. Genes that can be found in the top 300 differentailly expressed genes for both datasets
Load two RData data sets into the environment and classified as gse1 and gse2 for further analysis
# Week 1 Lab 1.1 sample solution
load("GSE46474.RData")
gse1 = gse
load("GSE138043.RData")
gse2 = gse
# number of samples in the Rejection class and the Stable class in gse1 and gse2
gse1$outcome <- ifelse(grepl("AR", gse1$title), "Rejection", "Stable")
gse2$outcome <- ifelse(grepl("non", gse2$characteristics_ch1), "Stable", "Rejection")
# extract the featureData object using the fData function.
featureData1 <- fData(gse1)
featureData2 <- fData(gse2)Creates a design matrix for the linear model that will be used to test for differential gene expression, and generates a table of the 50 most significant differentially expressed genes in the data based on the linear model fitted of framework 1
# Week 1 Lab2.2 sample solution
design1 = model.matrix(~outcome, data = pData(gse1))
fit1 = lmFit(exprs(gse1),design1)
fit1 = eBayes(fit1)
tT1 = topTable(fit1, n = Inf) Removing intercept from test coefficients
DT::datatable(round(tT1[1:300, ], 2))tT1 = topTable(fit1, genelist=featureData1[,"Gene Symbol"], n = 50)Removing intercept from test coefficients
tT11 = topTable(fit1, genelist=featureData1[,"Gene Symbol"], n = 300)Removing intercept from test coefficients
for (i in 1:length(tT11$ID)) {
tT11$ID2[i] = tT11$ID[[i]][2]
}Creates a design matrix for the linear model that will be used to test for differential gene expression, and generates a table of the 50 most significant differentially expressed genes in the data based on the linear model fitted of framework 2
# Week 1 Lab2.2 sample solution
design2 = model.matrix(~outcome, data = pData(gse2))
fit2 = lmFit(exprs(gse2),design2)
fit2 = eBayes(fit2)
tT2 = topTable(fit2, n = Inf)Removing intercept from test coefficients
DT::datatable(round(tT2[1:300, ], 2))tT2 = topTable(fit2, genelist=featureData2[,"gene_assignment"], n = 50)Removing intercept from test coefficients
tT22 = topTable(fit2, genelist=featureData2[,"gene_assignment"], n = 300)Removing intercept from test coefficients
tT22$ID2 = strsplit(tT22$ID," // ", fixed = TRUE)
for (i in 1:length(tT22$ID2)) {
tT22$ID3[i] = tT22$ID2[[i]][2]
}Compare two framework under certain filtered conditions to observe the overlapping genes presented.
overlapping_genes <- intersect(tT11$ID, tT22$ID3)
overlapping_genes[1] "SLAMF6" "RASA3" "ITGAX"
The significant findings for the top 300 differentially expressed genes in both the GSE46474 and GSE138043 datasets indicates there are three overlapping genes presented, which is “SLAMF6”, “RASA3”, and ’’ITGAX” between these two data sets.
b. Visualisation of identifying 50 most differentially expressed genes from the entire data set
Load the GSE46474.RData, and GSE138043.RData as gse46474, and gse138043 repeatedly.
load("GSE46474.RData")
gse46474 = gse
load("GSE138043.RData")
gse138043 = gseGenerates a table of the top 50 differentially expressed genes in the data based on the linear model, filtered out the missing values by using na.omit() function. Then further creates a subset of the original data that can be used for downstream analyses by extracting the rows corresponding to the top differentially expressed genes.
top_Table_framework1 = topTable(fit1, n = 50, genelist = featureData1[,"Gene Symbol"], sort.by = "logFC") Removing intercept from test coefficients
top_Table_framework1 = na.omit(top_Table_framework1)
top_genes <- rownames(top_Table_framework1)
gse1_m = gse1[top_genes,]Apply the machine learning models to predict patient rejection status from gene expression data.
# Week 2 Lab3.2 sample solution
largevar = apply(gse1_m, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = as.matrix(t(exprs(gse1_m)))
y = gse1_m$outcome
cvK = 5 # number of CV folds
cv_50acc5_svm_1 = c()
cv_acc_svm_1 = c()
n_sim = 50 ## number of repeats
for (i in 1:n_sim) {
cvSets = cvTools::cvFolds(nrow(X), cvK) # permute all the data, into 5 folds
cv_acc_svm_1 = c()
for (j in 1:cvK) {
n = length(gse1_m)
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
## SVM
svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
fit <- predict(svm_res, X_test)
cv_acc_svm_1[j] = mean(fit == y_test)
}
cv_50acc5_svm_1 <- append(cv_50acc5_svm_1, mean(cv_acc_svm_1))
} ## end forSame steps but applied into framework 2
top_Table_framework2 = topTable(fit2, n = 50, genelist = featureData2[,"gene_assignment"], sort.by = "logFC") Removing intercept from test coefficients
top_Table_framework2 = na.omit(top_Table_framework2)
top_genes <- rownames(top_Table_framework2)
gse2_m = gse2[top_genes,]# Week 2 3.2 lab sample code
largevar = apply(gse2_m, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = as.matrix(t(exprs(gse2_m)))
y = gse2_m$outcome
cvK = 5 # number of CV folds
cv_50acc5_svm_2 = c()
cv_acc_svm_2 = c()
n_sim = 50 ## number of repeats
for (i in 1:n_sim) {
cvSets = cvTools::cvFolds(nrow(X), cvK) # permute all the data, into 5 folds
cv_acc_svm_2 = c()
for (j in 1:cvK) {
n = length(gse1_m)
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
## SVM
svm_res <- e1071::svm(x = X_train, y = as.factor(y_train))
fit <- predict(svm_res, X_test)
cv_acc_svm_2[j] = mean(fit == y_test)
}
cv_50acc5_svm_2 <- append(cv_50acc5_svm_2, mean(cv_acc_svm_2))
} ## end forComparison between two different frameworks and test for the accuracy by outlining the summary of the box plot by their sides for easy accessibility of differences.
boxplot(list(frame1 = cv_50acc5_svm_1, frame2 = cv_50acc5_svm_2), ylab = "CV Accuracy")
text(y = boxplot.stats(cv_50acc5_svm_1)$stats,
labels = boxplot.stats(cv_50acc5_svm_1)$stats,
x=1.48,
cex = 0.8,
col = "Red")
text(y = boxplot.stats(cv_50acc5_svm_2)$stats,
labels = boxplot.stats(cv_50acc5_svm_2)$stats,
x=2.62,
cex = 0.7,
col = "Red")
# Add a horizontal line for the mean
abline(h = mean(cv_50acc5_svm_1),
col = "darkgreen",
lwd = 1)
# Add the text for the mean. over the mean line
text(y=mean(cv_50acc5_svm_1),
x=1.7,
paste("Framework 1 Mean:", round(mean(cv_50acc5_svm_1),3)),
col = "darkgreen",
cex = 0.8,
pos = 4)
# Same proccess with different dataframe
abline(h = mean(cv_50acc5_svm_2),
col = "blue",
lwd = 1)
text(y=mean(cv_50acc5_svm_2),
x=1.7,
paste("Framework 2 Mean:", round(mean(cv_50acc5_svm_2),3)),
col = "blue",
cex = 0.8,
pos = 4)From the boxplots demonstrate above, the accuracy of frame 1 (GSE46474) is calculated by its mean which is 0.814, whereas the accuracy of frame 2 (GSE138043) is 0.8, which is slightly lower than frame 1 has. This findings concludes by using 5-fold cross validation (with 50 repeats), the accuracy of GSE46474 has is slightly preciser than GSE138043 has.
c. Visualisation of identifying 50 most differentially expressed genes from 80:20 split training and testing set
Apply the machine learning models to predict patient rejection status from gene expression data.
# Week 2 3.2 lab sample code
set.seed(3888)
largevar = apply(gse1, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = t(exprs(gse1))
y = gse1$outcome
cvK = 5 # number of CV folds
cv_50acc5_svm_1_1 = c()
cv_acc_svm_1_1 = c()
n_sim = 50 ## number of repeats
for (i in 1:n_sim) {
cvSets = cvTools::cvFolds(nrow(X), cvK) # permute all the data, into 5 folds
cv_acc_svm_1_1 = c()
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
design1 <- model.matrix(~outcome, data = pData(gse1[-test_id, ]))
#colnames(design1) <- levels(y_train)
fit1 = lmFit(exprs(gse1[-test_id, ]),design1)
fit1 = eBayes(fit1)
tT1_50 = topTable(fit1, n = 50, sort.by = "logFC")
top_genes <- rownames(tT1_50)
X_train_subset <- X_train[,top_genes]
X_test_subset <- X_test[,top_genes]
test_id2 = sample(1:50, round(50 * 0.8))
X_train_subset_matrix = as.matrix(X_train_subset)
X_test_subset_matrix = as.matrix(X_test_subset)
## SVM
svm_res11 <- e1071::svm(x = X_train_subset_matrix, y = as.factor(y_train))
fit <- predict(svm_res11, X_test_subset_matrix)
cv_acc_svm_1_1[j] = mean(fit == y_test)
}
cv_50acc5_svm_1_1 <- append(cv_50acc5_svm_1_1, mean(cv_acc_svm_1_1))
} ## end for# Week 2 3.2 lab sample code
set.seed(3888)
largevar = apply(gse2, 1, var)
ind = which(largevar > quantile(largevar, 0.95))
X = t(exprs(gse2))
y = gse2$outcome
cvK = 5 # number of CV folds
cv_50acc5_svm_2_2 = c()
cv_acc_svm_2_2 = c()
n_sim = 50 ## number of repeats
for (i in 1:n_sim) {
cvSets = cvTools::cvFolds(nrow(X), cvK) # permute all the data, into 5 folds
cv_acc_svm_2_2 = c()
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_test = X[test_id, ]
X_train = X[-test_id, ]
y_test = y[test_id]
y_train = y[-test_id]
design2 <- model.matrix(~outcome, data = pData(gse2[-test_id, ]))
#colnames(design1) <- levels(y_train)
fit2 = lmFit(exprs(gse2[-test_id, ]),design2)
fit2 = eBayes(fit2)
tT2_50 = topTable(fit2, n = 50, sort.by = "logFC")
top_genes <- rownames(tT2_50)
X_train_subset <- X_train[,top_genes]
X_test_subset <- X_test[,top_genes]
test_id2 = sample(1:50, round(50 * 0.8))
X_train_subset_matrix = as.matrix(X_train_subset)
X_test_subset_matrix = as.matrix(X_test_subset)
## SVM
svm_res12 <- e1071::svm(x = X_train_subset_matrix, y = as.factor(y_train))
fit <- predict(svm_res12, X_test_subset_matrix)
cv_acc_svm_2_2[j] = mean(fit == y_test)
}
cv_50acc5_svm_2_2 <- append(cv_50acc5_svm_2_2, mean(cv_acc_svm_2_2))
} ## end forSame approach with different data set, frame 2.
boxplot(list(frame1 = cv_50acc5_svm_1_1, frame2 = cv_50acc5_svm_2_2), ylab = "CV Accuracy")
text(y = boxplot.stats(cv_50acc5_svm_1_1)$stats,
labels = boxplot.stats(cv_50acc5_svm_1_1)$stats,
x=1.48,
cex = 0.8,
col = "Red")
text(y = boxplot.stats(cv_50acc5_svm_2_2)$stats,
labels = boxplot.stats(cv_50acc5_svm_2_2)$stats,
x=2.62,
cex = 0.7,
col = "Red")
# Add a horizontal line for the mean
abline(h = mean(cv_50acc5_svm_1_1),
col = "darkgreen",
lwd = 1)
# Add the text for the mean. over the mean line
text(y=mean(cv_50acc5_svm_1_1),
x=1.7,
paste("Framework 1 Mean:", round(mean(cv_50acc5_svm_1_1),3)),
col = "darkgreen",
cex = 0.8,
pos = 4)
# Same proccess with different dataframe
abline(h = mean(cv_50acc5_svm_2_2),
col = "blue",
lwd = 1)
text(y=mean(cv_50acc5_svm_2_2),
x=0.6,
paste("Framework 2 Mean:", round(mean(cv_50acc5_svm_2_2),3)),
col = "blue",
cex = 0.8,
pos = 4)From the boxplot visual presentation demonstrated above, the frame 1 implies higher accuracy than frame 2 did, whose mean is 0.881 and the other is only 0.801. This finding shows the significant different between frame 1 and 2 and concludes that the accuracy of frame 1 is more preciser than frame 2. This finding highlights the importance of selecting an appropriate frame fro the task at hand to achieve optimal accuracy in the result.
d. Comparison between part b and part c
By combining four data set, two frames into two different classifiers, into one visualisation to compare their differences in between.
boxplot(list(frame1_b = cv_50acc5_svm_1, frame1_c = cv_50acc5_svm_1_1, frame2_b = cv_50acc5_svm_2, frame2_c = cv_50acc5_svm_2_2), ylab = "CV Accuracy")The accuracy of different approaches for classifying gene expression datasets from peripheral blood and kidney biopsy samples was analyzed to assess the validity and reliability of two different frames. The analysis revealed significant differences in accuracy between frame 1, consisting of the peripheral blood gene expression dataset, and frame 2, comprising samples that were sequenced from a kidney biopsy. The accuracy range for frame 1 was found to be considerably higher than that of frame 2, indicating that the validity of frame 1 is superior to frame 2. This finding suggests that peripheral blood gene expression data may be more precise in predicting disease outcomes than gene expression data obtained from kidney biopsy samples.
Further analysis using different approaches, including 80:20 training and testing datasets and 50 repeated cross-validation classifiers, revealed that employing the former approach resulted in significantly higher accuracy for frame 1. However, no significant differences were observed in frame 2 despite the application of different approaches on the same dataset.
Overall, the multiple observation approaches employed in this study provide evidence that peripheral blood gene expression data may be more reliable in predicting disease outcomes than gene expression data obtained from kidney biopsy samples. This finding has important implications for the use of gene expression data in precision medicine, as it suggests that non-invasive methods such as peripheral blood collection may be more effective in predicting disease outcomes than invasive methods such as kidney biopsy.
Question 3 - Case Study 3 (Brain): Streaming classifier for Brain-box
library(tidyverse)
library(tuneR)
library(devtools)
library(ggplot2)
library(tsfeatures)
library(stringdist)dir_medium = "Spiker_box_Louis/Medium"
all_files_medium <- list.files(dir_medium)
wave_file_medium = list()
for (i in all_files_medium) {
wave_file_medium[[i]] <- tuneR::readWave(file.path(dir_medium, i))
}a. Classification rule for detecting a series of {L, R}
Build a classification rule under streaming condition for detecting a series of {L, R}
# Sample Code: Lab3 - 2.3
LR_detection = function(seq) {
maxval = which.max(seq)
minval = which.min(seq)
movement = ifelse(maxval < minval, "L", "R")
return(movement)
}
# Week 3 Lab 3.3 Sample solution
streaming_classifier = function(wave_file,
window_size = wave_file@samp.rate,
increment = window_size/3,
thresholdEvents = 50
) {
Y = wave_file@left
xtime = seq_len(length(wave_file@left))/wave_file@samp.rate
predict = c()
lower_interval = 1
max_time = max(xtime)*window_size
while(max_time > lower_interval + window_size)
{
upper_interval = lower_interval + window_size
interval = Y[lower_interval:upper_interval]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
lower_interval = lower_interval + window_size
}
return(predict)
}The streaming_classifier function detects a series of {L, R} in a wave file by using a sliding window approach and a simple classification rule. The function takes as input a wave file, window size, increment, and a threshold for the number of events to detect. For each window, the function computes a test statistic and assigns a label of L or R based on the sign of the test statistic using the LR_detection function. The function iterates through the wave file, shifting the window by the specified increment and keeping track of predicted labels for each window, and returns them as the final output.
b. Metric to estimate the accuracy of your classifier on the length 3 wave files
Create a metric to estimate the accuracy of the classifier on the length 3 wave files
eye_movement_ZC = function(Y, time,
windowSize = wave_file@samp.rate,
thresholdEvents = 30,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
testStat1[i] = max(Y_subset)
testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}# Source: Lab3 - 3.3
streaming_classifier = function(wave_file,
window_size = wave_file@samp.rate,
increment = window_size/3,
thresholdEvents = 50
) {
Y = wave_file@left
xtime = seq_len(length(wave_file@left))/wave_file@samp.rate
predicted_labels = c()
lower_interval = 1
max_time = max(xtime)*window_size
while(max_time > lower_interval + window_size)
{
upper_interval = lower_interval + window_size
interval = Y[lower_interval:upper_interval]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
lower_interval = lower_interval + window_size
}
return(predicted_labels)
}i <- 1
total_accuracy_list <- c()
total_predicted_labels <- c()
while(i < length(wave_file_medium) + 1){
wave_file = wave_file_medium[[i]]
predicted_labels = streaming_classifier(wave_file, wave_file@samp.rate)
predicted_labels <- toString(predicted_labels)
predicted_labels <- gsub(', ', '', predicted_labels)
#print(predicted_labels)
total_predicted_labels <- append(total_predicted_labels, predicted_labels)
true_sequence <- gsub("_z.wav", "", all_files_medium[2])
true_sequence <- strsplit(true_sequence, "")[[1]][1:3]
true_sequence <- toString(true_sequence)
true_sequence <- gsub(', ', '', true_sequence)
#print(true_sequence)
correct_predictions = 0
accuracy <- stringdist(predicted_labels,true_sequence, method="lv")
total_accuracy_list <- append(total_accuracy_list, accuracy)
i <- i+1
}
mean(total_accuracy_list)[1] 3
By using the stringdist function to create a metric to detect the differences between predicted results and actual representation, and average the result to calculate the final accuracy of this metric. In this case, the average mean returns to be 3 which represents there is three different characters compared from actual wave movement, in calculation, (3/8)*100 = 37.5%, hence the accurate rate is 37.5% which is relatively low to estimate classifier by using this metric.
c. Compare at least four different classification rules on the length 3 wave files
Using the metric in part b to compare at least four different classification rules on certain wave files, then select the best model upon those classification rules
Implement the streaming classifier into eye zero crossing function and change each parameter
- normal set of classification rules
eye_movement_ZC = function(Y, time,
windowSize = 0.5,
thresholdEvents = 200,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
#testStat1[i] = max(Y_subset)
#testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predicted_labels = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predicted_labels = c(predicted_labels, predicted)
i <- i+1
}
return(predicted_labels)
}
j <- 1
total_accuracy_list <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
true_sequence <- gsub("_z.wav", "", all_files_medium[j])
true_sequence <- strsplit(true_sequence, "")[[1]][1:3]
true_sequence <- toString(true_sequence)
true_sequence <- gsub(', ', '', true_sequence)
predicted_labels <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predicted_labels <- toString(predicted_labels)
predicted_labels <- gsub(', ', '', predicted_labels)
accuracy <- stringdist(predicted_labels,true_sequence, method="lv")
total_accuracy_list <- append(total_accuracy_list, accuracy)
j <- j+1
}
accuracy_normal <- mean(total_accuracy_list)
accuracy_normal[1] 6.2
- Increase the window size from 0.5 to 1.0
eye_movement_ZC = function(Y, time,
windowSize = 1.0,
thresholdEvents = 200,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_larger_ws <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_larger_ws <- append(total_accuracy_list_larger_ws, accuracy)
j <- j+1
}
accuracy_larger_ws <- mean(total_accuracy_list_larger_ws)
accuracy_larger_ws[1] 6
- Decrease the window size from 0.5 to 0.2
eye_movement_ZC = function(Y, time,
windowSize = 0.2,
thresholdEvents = 200,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_smaller_ws <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
j <- j+1
}
accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
accuracy_smaller_ws[1] 11
- Increase thresholdEvents from 200 to 500
eye_movement_ZC = function(Y, time,
windowSize = 0.5,
thresholdEvents = 500,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_larger_te <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_larger_te <- append(total_accuracy_list_larger_te, accuracy)
j <- j+1
}
accuracy_larger_te <- mean(total_accuracy_list_larger_te)
accuracy_larger_te[1] 6.2
- Decrease thresholdEvents from 200 to 100
eye_movement_ZC = function(Y, time,
windowSize = 0.5,
thresholdEvents = 100,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
#testStat1[i] = max(Y_subset)
#testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_smaller_te <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_smaller_te <- append(total_accuracy_list_smaller_te, accuracy)
j <- j+1
}
accuracy_smaller_te <- mean(total_accuracy_list_smaller_te)
accuracy_smaller_te[1] 6.4
- Increase downSampleRate from 50 to 100
eye_movement_ZC = function(Y, time,
windowSize = 0.5,
thresholdEvents = 200,
downSampleRate = 100) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_larger_dsr <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_larger_dsr <- append(total_accuracy_list_larger_dsr, accuracy)
j <- j+1
}
accuracy_larger_dsr <- mean(total_accuracy_list_larger_dsr)
accuracy_larger_dsr[1] 6.2
- Decrease downSampleRate from 50 to 10
eye_movement_ZC = function(Y, time,
windowSize = 0.5,
thresholdEvents = 200,
downSampleRate = 10) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
#testStat1[i] = max(Y_subset)
#testStat2[i] = sum(Y_subset[1:(length(Y_subset) - 1)] * Y_subset[2:(length(Y_subset))] <= 0)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_smaller_dsr <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:3]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_smaller_dsr <- append(total_accuracy_list_smaller_dsr, accuracy)
j <- j+1
}
accuracy_smaller_dsr <- mean(total_accuracy_list_smaller_dsr)
accuracy_smaller_dsr[1] 6.2
# create a barplot to compare by firstly define the variables
data <- c(accuracy_normal, accuracy_larger_ws, accuracy_smaller_ws, accuracy_larger_te, accuracy_smaller_te, accuracy_larger_dsr, accuracy_smaller_dsr)
name <- c("normal", "larger window size", "smaller window size", "larger thresholdEvents", "smaller thresholdEvents", "larger downSamplerate", "smaller downSamplerate")
colours <- c("#d64e12", "#f9a52c", "#efdf48","#8bd346", "#60dbe8", "#16a4d8", "#9b5fe0")
bp <- barplot(data, col = colours,xlab = "Changed Parameter", ylab = "Accuracy", ylim = c(0, 12), main = "Accuracy Comparison Between Each Paramater")
# Add number of accuracy into each bar for clear observation
text(bp, data, data, pos = 3, cex = 0.8)
legend("topright", legend = name, fill = colours, cex = 0.6)By observation, the highest accuracy is 11, which belongs to smaller the window size, which refers to the number of instances that are used to train the classifier before making predictions on new incoming data, that concludes smaller window sizes model tend to result in higher accuracy due to the dynamic nature of the data. By training the classifier on a smaller subset of more recent data, it can better capture the current data distribution and trends, resulting in more accurate predictions and leads to become the best model.
d. Evaluation
For the best model that been selected in part c, evaluate its performance on sequences of varying lengths. Does the length of the sequence have an impact on the classification accuracy? Justify your answer with appropriate visualisations.
- Substitute length of 8 into smaller window size classifier
eye_movement_ZC = function(Y, time,
windowSize = 0.2,
thresholdEvents = 200,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_smaller_ws <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:8]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
j <- j+1
}
length_8_accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
length_8_accuracy_smaller_ws[1] 6.8
- Substitute length of 20 into smaller window size classifier
eye_movement_ZC = function(Y, time,
windowSize = 0.2,
thresholdEvents = 200,
downSampleRate = 50) {
## down sample
ind = seq_len(which(time == round(time[length(time)] - windowSize, 4)) + 1)
ind = seq(1, ind[length(ind)], by = downSampleRate)
## time vector for middle of each window
timeMiddle <- time[ind] + windowSize/2
testStat = testStat1 = testStat2 = rep(NA, length(ind))
stat_event <- rep(NA, length(ind))
## calculate zero-crossings
for (i in 1:length(ind)) {
Y_subset <- Y[time >= time[ind[i]] & time < time[ind[i]] + windowSize]
testStat[i] <- sd(Y_subset)
}
## using threshold to determine eye movement intervals
predictedEvent <- which(testStat < thresholdEvents)
eventTimes <- timeMiddle[predictedEvent] # map back to the time of this
gaps <- which(diff(eventTimes) > windowSize )
## estimate event_time_interval
event_time_interval <- min(eventTimes)
for (i in 1:length(gaps)) {
event_time_interval <- append(event_time_interval,
c(eventTimes[gaps[i]],
eventTimes[gaps[i] + 1]))
}
event_time_interval <- append(event_time_interval, max(eventTimes))
event_time_interval <- matrix(event_time_interval, ncol = 2, byrow = TRUE)
predictedEventTimes <- rep(FALSE, length(Y))
for (i in 1:nrow(event_time_interval)) {
predictedEventTimes[event_time_interval[i, 1] <= time & event_time_interval[i, 2] >= time] <- TRUE
}
num_event <- length(gaps) + 1
return(list(num_event = num_event,
testStat = testStat,
eventTimes = eventTimes,
predictedEventTimes = predictedEventTimes,
predictedInterval = event_time_interval))
}
streaming_classifier2 = function(wave_file,predictedInterval,eventTimes) {
Y = wave_file@left
predict = c()
i <- 1
while(i < length(predictedInterval)/2+1)
{
lower_interval = predictedInterval[i,1]
upper_interval = predictedInterval[i,2]
interval = Y[which(eventTimes == lower_interval):which(eventTimes == upper_interval)]
testStat = sum(interval[1:(length(interval) - 1)] * interval[2:(length(interval))] <= 0)
predicted = LR_detection(interval)
predict = c(predict, predicted)
i <- i+1
}
return(predict)
}
j <- 1
total_accuracy_list_smaller_ws <- c()
while(j < length(wave_file_medium) + 1){
waveSeq = wave_file_medium[[j]]
timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate
predictedInterval <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$predictedInterval
eventTimes <- eye_movement_ZC(Y = waveSeq@left, time = timeSeq)$eventTimes
actual <- gsub("_z.wav", "", all_files_medium[j])
actual <- strsplit(actual, "")[[1]][1:20]
actual <- toString(actual)
actual <- gsub(', ', '', actual)
predict <- streaming_classifier2(waveSeq,predictedInterval,eventTimes)
predict <- toString(predict)
predict <- gsub(', ', '', predict)
accuracy <- stringdist(predict,actual, method="lv")
total_accuracy_list_smaller_ws <- append(total_accuracy_list_smaller_ws, accuracy)
j <- j+1
}
length_20_accuracy_smaller_ws <- mean(total_accuracy_list_smaller_ws)
length_20_accuracy_smaller_ws[1] 20
# create a barplot to compare by firstly define the variables
data <- c(accuracy_smaller_ws, length_8_accuracy_smaller_ws, length_20_accuracy_smaller_ws)
name <- c("Length 3", "Length 8", "Length 20")
colours <- c("#d64e12", "#efdf48","#8bd346")
bp <- barplot(data, col = colours,xlab = "Number of Length", ylab = "Accuracy", ylim = c(0, 22), main = "Accuracy Comparison Between Each Paramater")
# Add number of accuracy into each bar for clear observation
text(bp, data, data, pos = 3, cex = 0.8)
legend("topleft", legend = name, fill = colours, cex = 0.6)The impact of sequence length on classification accuracy depends on the application and classifier being used. Longer sequences may provide more information and context, improving classification accuracy. However, longer sequences may require more computational resources and may be subject to issues such as over-fitting. The impact of sequence length on classification accuracy is context-dependent, requiring careful consideration of the application and the type of classifier being used.