在 Numerosity 實驗中,共有 6 種刺激類型,每種類型進行 25 次 MEG 和 fMRI 掃腦作業。目前已將 pilot_05 受試者的 MEG 資料,透過 brainstorm 進行資料的前處理。得到的結果如下圖 1 所示,每筆資料分別都會有 306 個 MEG channels,從刺激出現的 -100 ms 到 +1500 ms 的腦波大小之矩陣資料,共 150 筆(6 種刺激 * 25 次)。
在傳統 MEG 分析實驗上,會將同種刺激下的 trails,對其刺激發生的時間點後,將 MEG 的波進行平均(預期將雜訊給平均掉,凸顯出有效的波段),看在多少毫秒時,那些 channels、甚至回推到何處腦區會有特別反應,結果如下圖 2 所示:
可發現刺激 1 到刺激 6,約在刺激發生後的 140 ~ 180 ms間,有2個平均後振幅較大的波出現!而在約 500 ms 後便是按鍵反應的區段,且沒有較大的波動,因此在上圖中沒有將此時間區段放入在內。
載入相關套件
library(tidyverse) # for data cleaning and plot
library(e1071) # for SVM
透過 BrainStorm 將 MEG 資料輸出成 csv 或 txt 格式,由 R 來開始分析。
然而 BrainStorm 在輸出時會有些問題,csv 和 txt 格式會混者使用,因此只好自己將分隔符號(comma or space)自己給拆解。
subjectFolder <- "~/Documents/R/CDG/pilot_5"
folderNames <- dir(subjectFolder)
# function of reading MEG data that is csv or txt format
read.MEGdata <- function(file) {
.file <- readLines(file)
.data <- lapply(.file, function(.line) {
.split <- strsplit(.line, split = "[, ]+") %>% unlist() %>% as.numeric() %>% tail(-1)
return(.split)
})
meg <- do.call(rbind, .data)
}
MEG <- lapply(folderNames, function(session){
fileNames <- file.path(session, dir(session))
data <- lapply(fileNames, function(filename){
. <- read.MEGdata(filename)
megChannel <- .[13:318, ]
return(megChannel)
})
names(data) <- fileNames
return(data)
})
MEG <- do.call(c, MEG)
saveRDS(MEG, file = file.path(subjectFolder, "MEG.rds"))
由於檔案非常的大,因此我已將MEG統一好的原始資料資料先儲存 R 的 rds 檔,方便之後使用。
MEG <- readRDS(file.path(subjectFolder, "MEG.rds"))
MEG 為長度 150 (6 種刺激 * 25 次 trials)的 list。每個元素的名字為該次 trial 的編號
length(MEG)
## [1] 150
names(MEG) %>% head()
## [1] "num_1/16_trial001.txt" "num_1/16_trial002.txt" "num_1/16_trial003.txt"
## [4] "num_1/16_trial004.txt" "num_1/16_trial005.txt" "num_1/16_trial006.txt"
而每一筆的資料形式為 306*1601 matrix 儲存,如下顯示(在此只展示 20*20):
dim(MEG[[1]])
## [1] 306 1601
DT::datatable(MEG[[1]][1:20, 1:20],
rownames = paste0("Channel_", 1:20),
colnames = paste(-100:-81, "(ms)"))
其中,row 共有 306 列,表示 306 個 MEG channel; column 共有 1601 欄,表示取樣的時間點,從刺激出現的 -100 ms到 +1500 ms,1 ms/次的取樣頻率。在使用降維之前,我先將取樣的時間點做些整理。
在此我先關注 -100 ms 到 +500 ms 的時間範圍,並以長度為 50 ms (1 個 epoch)的平均值作為新的資料點
# set time range and average
timeRange <- c(-100, 500)
timeCut <- seq(timeRange[1], timeRange[2], 50)
timeGroup <- cut(timeRange[1]:timeRange[2], breaks = timeCut, include.lowest = TRUE, right = FALSE)
timePoint <- seq(-100, 1500, 1)
neuralChannel <- read.table("neuromag_channels.txt", stringsAsFactors = FALSE)
MEG_epoch <- lapply(MEG, function(meg){
.a <- data.frame(meg)
colnames(.a) <- timePoint
# select target time-range and group by epoch
.a <- .a %>%
select(as.character(timeRange[1]:timeRange[2])) %>%
mutate(channel = neuralChannel$V1) %>%
gather(key = "time", value = "amplitude", -channel) %>%
mutate(timeGroup = rep(timeGroup, times = 306))
# average each channel in each epoch
.b <- .a%>%
group_by(channel, timeGroup) %>%
summarise(ave_amplitude = mean(amplitude)) %>%
ungroup() %>%
spread(key = timeGroup, value = ave_amplitude)
# remove channels' name than transport the matrix
.c <- .b %>%
select(-channel) %>%
t()
return(.c)
})
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
因此如同下表所示,縱軸為時間區段,橫軸為 channel(這邊與原先的資料格式)。我將 50 ms 作為一個區段,並將每個 channel 下取平均值,因此資料維度變 12*306。
dim(MEG_epoch[[1]])
## [1] 12 306
DT::datatable(MEG_epoch[[1]][, 1:20],
colnames = paste0("Channel_", 1:20))
經過上次和老師老論過,將要做 PCA 的資料應該是同時將所有 150 個 trial 結合在一起後再使用 PCA 降維,這樣才是確保所有的 components 在不同 trials 中是一致的。
# combine all trial
MEG_epoch_df <- do.call(rbind, MEG_epoch)
dim(MEG_epoch_df) # [12(epoch)*150(trials), 306(channels)]
## [1] 1800 306
DT::datatable(MEG_epoch_df[1:20, 1:20],
colnames = paste0("Channel_", 1:20))
接下來便使用 PCA 來降維,在這邊我有做 “center” 和 “scale”,其中一筆資料的結果如下:
MEG_pca <- prcomp(MEG_epoch_df, center = TRUE, scale. = TRUE)
顯示 306 個 channel (上方只有顯示 6 個 channels 代表)對應到新的 306 個 components 之 loading。
求出每個主成份的特徵值(也就是 $eigenvalue = variance = std^2 $)以及累加每個主成份的解釋比例(aggregated effects)
eigen_value <- data.frame(factor = 1:length(MEG_pca$sdev),
eigen_value = MEG_pca$sdev^2,
eigen_value_cum = cumsum(MEG_pca$sdev^2) / sum(MEG_pca$sdev^2))
eigen_plot <- ggplot(eigen_value, aes(x = factor, y = eigen_value)) +
geom_line() + geom_point() +
geom_hline(yintercept = 1, color = "red") +
geom_vline(xintercept = 47, color = "blue", linetype = "dotted") +
coord_cartesian(xlim = c(1, 50)) +
theme_bw()
eigen_cum_plot <- ggplot(eigen_value, aes(x = factor, y = eigen_value_cum)) +
geom_line() + geom_point() +
geom_vline(xintercept = 47, color = "blue", linetype = "dotted") +
coord_cartesian(xlim = c(1, 50)) +
theme_bw()
gridExtra::grid.arrange(eigen_plot, eigen_cum_plot)
而新的 componenet 下所對應的 data 為(這邊只顯示 20*20):
MEG_pca_data <- MEG_pca$x
DT::datatable(MEG_pca_data[1:20, 1:20] %>% round(digits = 3))
在做 SVM 前必須將每個變數給先標準化,再去建立模型。
在這邊的步驟是把所有的 PCA 後的 data (所有 trials 且所有 epochs),分別對自己的 component 標準化,目的是為了讓所有的 trials 都是同一種主成份們
MEG_pca_data <- apply(MEG_pca_data, 2, scale)
這邊先獲取每個 trial 的 label。在 numerosity 的實驗中,刺激為 1~6 個點數,在這邊維持 MEG 收集刺激事件的標記,用 2 的幂次項 (21~26) 來標記每個 trial 的 label 是什麼。
取 PC1 ~ PC47,並加入刺激標記
PCend <- 47
timeSections <- rownames(MEG_epoch[[1]])
epochNumber <- length(timeSections)
trialNames <- names(MEG_epoch)
trialNumber <- length(trialNames)
# get label
.location <- regexpr("/.*_", trialNames)
.start <- .location + 1
.stop <- .location + attr(.location, "match.length") - 2
label <- substr(trialNames, .start, .stop) %>% as.factor()
MEG_pca_df <- MEG_pca_data[, 1:PCend] %>%
as.data.frame() %>%
add_column(Trial = rep(trialNames, each = epochNumber),
Label = rep(label, each = epochNumber),
Epoch = rep(timeSections, times = trialNumber))
DT::datatable(MEG_pca_df[1:15, c(1:3, 46:50)])
接下來,我們從每種刺激的 20 筆資料作為訓練集,剩下 5 筆作為測驗集
set.seed(408)
train_data <- vector(mode = "list") # the list is composed of each epoch
test_data <- vector(mode = "list")
for(timeSection in timeSections){
epoch_data <- MEG_pca_df %>% filter(Epoch == timeSection)
train_data[[timeSection]] <- epoch_data %>%
group_by(Label) %>%
sample_n(size = 20) %>%
ungroup() %>%
select(-Trial, -Epoch)
test_data[[timeSection]] <- epoch_data %>%
select(-Trial, -Epoch) %>%
setdiff(train_data[[timeSection]])
}
訓練集中的 120 筆資料(有 6 種不同刺激類別),建立 SVM 的模型,並用測試集來檢測該模型的預測準確率。
需要注意的是,我在這邊的 svm() 中的參數 scale = FALSE 表示不用再將訓練集資料標準化了,因為我們在前面步驟已經做過了!
MEG_SVM <- lapply(timeSections, function(timeSection){
.train_data = train_data[[timeSection]]
.test_data = test_data[[timeSection]]
## SVM
.model <- svm(formula = Label ~ .,
data = .train_data,
scale = FALSE)
.train_predict <- predict(.model, .train_data)
.test_predict <- predict(.model, .test_data)
.train_predMat <- table(real = .train_data$Label, predict = .train_predict)
.test_predMat <- table(real = .test_data$Label, predict = .test_predict)
.train_predAcc <- sum(diag(.train_predMat)) / sum(.train_predMat)
.test_predAcc <- sum(diag(.test_predMat)) / sum(.test_predMat)
svm <-list(model = .model, train_predict = .train_predict, test_predict = .test_predict,
train_predMat = .train_predMat, test_predMat = .test_predMat,
train_predAcc = .train_predAcc, test_predAcc = .test_predAcc)
})
每個 epoch 都有各自的 SVM 模型和其在訓練集預測集準確率表現,如下
MEG_SVM[[1]]
## $model
##
## Call:
## svm(formula = Label ~ ., data = .train_data, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 120
##
##
## $train_predict
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 16 16 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 2 2 2 2 32 32 32 32 32 32 32 32 32 32 32 32 32 32
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 32 32 32 32 32 32 4 4 4 4 4 4 4 4 4 4 4 4
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 4 4 4 4 4 4 4 4 64 64 64 64 64 64 64 64 64 64
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 64 64 64 64 64 64 64 64 64 64 8 8 8 8 8 8 8 8
## 109 110 111 112 113 114 115 116 117 118 119 120
## 8 8 8 8 8 8 8 8 8 8 8 8
## Levels: 16 2 32 4 64 8
##
## $test_predict
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 32 8 8 2 32 4 32 8 32 16 32 8 8 8 64 8 32 16 2 4 2 64 8 16 16
## 26 27 28 29 30
## 16 32 64 32 2
## Levels: 16 2 32 4 64 8
##
## $train_predMat
## predict
## real 16 2 32 4 64 8
## 16 20 0 0 0 0 0
## 2 0 20 0 0 0 0
## 32 0 0 20 0 0 0
## 4 0 0 0 20 0 0
## 64 0 0 0 0 20 0
## 8 0 0 0 0 0 20
##
## $test_predMat
## predict
## real 16 2 32 4 64 8
## 16 1 0 2 0 0 2
## 2 0 2 1 2 0 0
## 32 0 1 2 0 1 1
## 4 3 0 1 0 0 1
## 64 1 0 1 0 1 2
## 8 0 1 1 0 1 2
##
## $train_predAcc
## [1] 1
##
## $test_predAcc
## [1] 0.2666667
最終的結果如下圖所示:
train_predAcc <- sapply(MEG_SVM, "[[", "train_predAcc")
test_predAcc <- sapply(MEG_SVM, "[[", "test_predAcc")
performance <- data.frame(time = factor(timeSections, levels = timeSections),
train_data = train_predAcc, test_data = test_predAcc) %>%
gather(key = data_set, value = predict_accuracy, -time)
ggplot(performance, aes(x = time, y = predict_accuracy, group = data_set)) +
geom_histogram(stat= "identity") +
facet_wrap(~ data_set) +
coord_cartesian(ylim = c(0, 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
右圖是原本訓練集建立模型下,不管在哪個 epoch 的準確率皆能 100 % 。
而左圖為測試集的預測準確率,除了 epoch 在 [100, 150) 內有達到 30 % 的預測率,但其他幾乎在 0.2 以下,根本上是用猜的。
雖然這次的結果比上次報告時候要好,但還是有點差強人意,接下來可能的做法就是修改 SVM 的參數,看看模型是否能改善。
tune_par <- data.frame("cost" = c(), "gamma" = c())
for(timeSection in timeSections){
.train_data = train_data[[timeSection]]
MEG_SVM_tune = tune(svm,
Label ~ .,
data = .train_data,
scale = FALSE,
kernel = "radial", # RBF kernel function
range = list(cost = 10^(-1:2), gamma = c(.5,1,2))# 調參數的最主要一行
)
plot(MEG_SVM_tune)
tune_par <- rbind(tune_par, MEG_SVM_tune$best.parameters)
}
rownames(tune_par) <- timeSections
tune_par
## cost gamma
## [-100,-50) 10.0 2.0
## [-50,0) 10.0 0.5
## [0,50) 0.1 0.5
## [50,100) 0.1 0.5
## [100,150) 0.1 1.0
## [150,200) 1.0 0.5
## [200,250) 0.1 1.0
## [250,300) 10.0 0.5
## [300,350) 0.1 0.5
## [350,400) 0.1 0.5
## [400,450) 0.1 0.5
## [450,500] 0.1 1.0
如果分別對上述的最佳參數,對每個 epoch 重新建立自己的 SVM 模型
MEG_SVM_best <- lapply(timeSections, function(timeSection){
.train_data = train_data[[timeSection]]
.test_data = test_data[[timeSection]]
## SVM
.model <- svm(formula = Label ~ .,
data = .train_data,
scale = FALSE,
cost = tune_par[timeSection, "cost"],
gamma = tune_par[timeSection, "gamma"]) # best parameters from grid search
.train_predict <- predict(.model, .train_data)
.test_predict <- predict(.model, .test_data)
.train_predMat <- table(real = .train_data$Label, predict = .train_predict)
.test_predMat <- table(real = .test_data$Label, predict = .test_predict)
.train_predAcc <- sum(diag(.train_predMat)) / sum(.train_predMat)
.test_predAcc <- sum(diag(.test_predMat)) / sum(.test_predMat)
svm <-list(model = .model, train_predict = .train_predict, test_predict = .test_predict,
train_predMat = .train_predMat, test_predMat = .test_predMat,
train_predAcc = .train_predAcc, test_predAcc = .test_predAcc)
})
train_predAcc_best <- sapply(MEG_SVM_best, "[[", "train_predAcc")
test_predAcc_best <- sapply(MEG_SVM_best, "[[", "test_predAcc")
performance_best <- data.frame(time = factor(timeSections, levels = timeSections),
train_data = train_predAcc_best, test_data = test_predAcc_best) %>%
gather(key = data_set, value = predict_accuracy, -time)
ggplot(performance_best, aes(x = time, y = predict_accuracy, group = data_set)) +
geom_histogram(stat= "identity") +
facet_wrap(~ data_set) +
coord_cartesian(ylim = c(0, 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
恩…反而感覺變差了些 QQ
SVM classification plot
plot(MEG_SVM_best[[6]]$model, data = train_data[[6]], formula = PC1 ~ PC2)
MEG_epoch_df_scale <- MEG_epoch_df %>%
apply(2, scale) %>%
as.data.frame() %>%
add_column(Trial = rep(trialNames, each = epochNumber),
Label = rep(label, each = epochNumber),
Epoch = rep(timeSections, times = trialNumber))
train_data <- vector(mode = "list") # the list is composed of each epoch
test_data <- vector(mode = "list")
for(timeSection in timeSections){
epoch_data <- MEG_epoch_df_scale %>% filter(Epoch == timeSection)
train_data[[timeSection]] <- epoch_data %>%
group_by(Label) %>%
sample_n(size = 20) %>%
ungroup() %>%
select(-Trial, -Epoch)
test_data[[timeSection]] <- epoch_data %>%
select(-Trial, -Epoch) %>%
setdiff(train_data[[timeSection]])
}
MEG_SVM <- lapply(timeSections, function(timeSection){
.train_data = train_data[[timeSection]]
.test_data = test_data[[timeSection]]
## SVM
.model <- svm(formula = Label ~ .,
data = .train_data,
scale = FALSE)
.train_predict <- predict(.model, .train_data)
.test_predict <- predict(.model, .test_data)
.train_predMat <- table(real = .train_data$Label, predict = .train_predict)
.test_predMat <- table(real = .test_data$Label, predict = .test_predict)
.train_predAcc <- sum(diag(.train_predMat)) / sum(.train_predMat)
.test_predAcc <- sum(diag(.test_predMat)) / sum(.test_predMat)
svm <-list(model = .model, train_predict = .train_predict, test_predict = .test_predict,
train_predMat = .train_predMat, test_predMat = .test_predMat,
train_predAcc = .train_predAcc, test_predAcc = .test_predAcc)
})
train_predAcc <- sapply(MEG_SVM, "[[", "train_predAcc")
test_predAcc <- sapply(MEG_SVM, "[[", "test_predAcc")
performance <- data.frame(time = factor(timeSections, levels = timeSections),
train_data = train_predAcc, test_data = test_predAcc) %>%
gather(key = data_set, value = predict_accuracy, -time)
ggplot(performance, aes(x = time, y = predict_accuracy, group = data_set)) +
geom_histogram(stat= "identity") +
facet_wrap(~ data_set) +
coord_cartesian(ylim = c(0, 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
tune_par <- data.frame("cost" = c(), "gamma" = c())
for(timeSection in timeSections){
.train_data = train_data[[timeSection]]
MEG_SVM_tune = tune(svm,
Label ~ .,
data = .train_data,
scale = FALSE,
kernel = "radial", # RBF kernel function
range = list(cost = 10^(-1:2), gamma = c(.5,1,2))# 調參數的最主要一行
)
plot(MEG_SVM_tune)
tune_par <- rbind(tune_par, MEG_SVM_tune$best.parameters)
}
rownames(tune_par) <- timeSections
tune_par
## cost gamma
## [-100,-50) 0.1 0.5
## [-50,0) 0.1 2.0
## [0,50) 0.1 0.5
## [50,100) 0.1 0.5
## [100,150) 0.1 0.5
## [150,200) 0.1 0.5
## [200,250) 0.1 2.0
## [250,300) 0.1 2.0
## [300,350) 0.1 0.5
## [350,400) 0.1 0.5
## [400,450) 0.1 0.5
## [450,500] 0.1 0.5
MEG_SVM_best <- lapply(timeSections, function(timeSection){
.train_data = train_data[[timeSection]]
.test_data = test_data[[timeSection]]
## SVM
.model <- svm(formula = Label ~ .,
data = .train_data,
scale = FALSE,
cost = tune_par[timeSection, "cost"],
gamma = tune_par[timeSection, "gamma"]) # best parameters from grid search
.train_predict <- predict(.model, .train_data)
.test_predict <- predict(.model, .test_data)
.train_predMat <- table(real = .train_data$Label, predict = .train_predict)
.test_predMat <- table(real = .test_data$Label, predict = .test_predict)
.train_predAcc <- sum(diag(.train_predMat)) / sum(.train_predMat)
.test_predAcc <- sum(diag(.test_predMat)) / sum(.test_predMat)
svm <-list(model = .model, train_predict = .train_predict, test_predict = .test_predict,
train_predMat = .train_predMat, test_predMat = .test_predMat,
train_predAcc = .train_predAcc, test_predAcc = .test_predAcc)
})
train_predAcc_best <- sapply(MEG_SVM_best, "[[", "train_predAcc")
test_predAcc_best <- sapply(MEG_SVM_best, "[[", "test_predAcc")
performance_best <- data.frame(time = factor(timeSections, levels = timeSections),
train_data = train_predAcc_best, test_data = test_predAcc_best) %>%
gather(key = data_set, value = predict_accuracy, -time)
ggplot(performance_best, aes(x = time, y = predict_accuracy, group = data_set)) +
geom_histogram(stat= "identity") +
facet_wrap(~ data_set) +
coord_cartesian(ylim = c(0, 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad