モデリング対象種のデータ収集と設計(7.1,7.2,7.3)

7 モデリング対象種のデータ収集と設計

生息適地モデリングで利用可能ないくつかの大規模データベースと、Rでの利用方法

7.1 既存のデータおよびデータベース

gbif

巨大データベースの注意点

種同定の不確実性 (触れない)
採集場所の位置精度の不確実性 (8章)
設計の欠如 (7.4?)
真の種分布の不完全性または偏った補足(8.2)
採集地点間の空間自己相関(7.2)
サンプルサイズ(不在データの欠如)

7.2 空間自己相関と疑似反復

空間自己相関：地理的空間で得られたデータにおける空間構造や空間パターン

ある地域における事象が、周辺の他の地域における事象の影響を受けて相互作用が発生する場合、空間的自己相関があるという (wikipedia)

空間上で測定された値の分布はランダムではなく、空間構造を持っている。標準的な統計手法のほとんどで、データに空間自己相関が無いことを前提にしているので、気をつけないといけません。

疑似反復：同じ個体を2回カウントしてしまう

生息適地モデルの目的：環境予測変数で、生物の空間分布パターンを予測したいモデル当てはめ後の残差に空間構造が残っている → 環境予測変数だけで空間分布パターンが説明されきっていないモデルを当てはめたあとの、残差の空間自己相関を検証する必要がある

注意空間自己相関の事後的な補正は、未解決の問題である 1. 自由度をどうやって補正するか 2. 空間自己相関の指標からどうやってモデルパラメータを最前に補正するか

空間自己相関の検証方法

MoranのI検定：すべての観測点間の距離行列を求め、残差に対する距離効果を検定する。

注意

read.tableは、データの1行目のカラム数が、データより1つだけ少ないときに、データの1行目をrownameとして扱うそうです。

header: a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: ‘header’ is set to ‘TRUE’ if and only if the first row contains one fewer field than the number of columns.

しかし、rownamesは本質的にただの文字列カラムであるにもかかわらず、無駄に通常のカラムとは異なった動きをするため、tidyverseでは避けるべきとされています。

Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column. (https://tibble.tidyverse.org/reference/rownames.html より)

以下ではアカギツネで使ったスッテプワイズ最適化したGLMを検証する。

library(ape)
xy <- select(pts_cal, X_WGS84:Y_WGS84)
dists <- as.matrix(dist(xy))
dists_inv <- 1/dists
diag(dists_inv) <- 0
Moran.I(vulpes_step$residuals, dists_inv)

## $observed
## [1] 0.01506913
## 
## $expected
## [1] -0.0001695203
## 
## $sd
## [1] 0.0003468404
## 
## $p.value
## [1] 0

↑p値が0なので、空間自己相関があることがわかる。以下では、残差の空間分布をプロットしてる。短かい距離間での相関が強い(?)

rsd <- vulpes_step$residuals
rnd <- sample(1:length(rsd), 500, replace = TRUE)
spat_cor <- ncf::correlog(xy$X_WGS84[rnd], xy$Y_WGS84[rnd], rsd[rnd], increment = 2, resamp = 10)

par(mfcol = c(1,2))
plot(spat_cor$mean.of.class, spat_cor$correlation,
  pch = 16,
  col = "firebrick3",
  ylab = "Correlation",
  xlab = "Distance class",
  main = "Spatial Correlogram",
  font.lab = 2)
lines(spat_cor$mean.of.class, spat_cor$correlation, col = "firebrick3")

plot(xy[order(rsd),], pch = 15, col = rev(heat.colors(length(rsd))),
  cex = .3,
  main = "Residuals from Vulpes vulpes GLM ",
  xlab = "Latitude",
  ylab = "Longitude", font.lab = 2)

空間自己相関の影響を弱める方法:

解析に利用する各変数が、空間自己相関にどのように影響するかを調べるなど

7.3 サンプルサイズ、在データ率、サンプル精度

モデル当てはめには、十分な観測データ(サンプルサイズ)が必要である。在データ数が30以下の場合に、モデルの精度評価が顕著に悪化するらしい。少ないデータ数で構築したモデルは、解析範囲と異なる場所に使おうとすると無理なことが多い。

以下では全球のアカギツネデータセットとWorldClimの3つの予測変数を用いて、モデルの質に対するサンプルサイズと在データ率の影響を調べる。

library(ecospat)
library(PresenceAbsence)


vulvul_pa <- pts_cal_ovl
vulvul_p <- vulvul_pa %>% filter(VulpesVulpes == 1)
vulvul_a <- vulvul_pa %>% filter(VulpesVulpes == 0)

yb1 <- c(10, 20, 40, 60, 80, 100, 125, 150, 200, 250, 300, 400, 600, 800, 1000, 1500, length(vulvul_p$VulpesVulpes))


## 在データ率
result <- map_dfr(yb1, function(n) {
  ## random sampled data
  paok <- sample_n(vulvul_p, n) %>% bind_rows(vulvul_a)
  ## full model
  paok1f <- glm(VulpesVulpes~bio3+I(bio3^2)+bio7+I(bio7^2)+bio11+I(bio11^2)+bio12+I(bio12^2),
    family = "binomial",
    data = paok)
  paok0f <- predict(paok1f, paok, type = "response")

  ## step wise model
  paok1s <- step(paok1f, direction = "both", trace = F)
  paok0s <- predict(paok1s, paok, type = "response")

  ## cross validation
  paok1x <- ecospat.cv.glm(paok1s)

  ## get values
  list(
    "adjD2.f" = ecospat.adj.D2.glm(paok1f),
    "adjD2.s" =  ecospat.adj.D2.glm(paok1s),
    "AUC.f" = auc(data.frame(1:length(paok0f), paok$VulpesVulpes, paok0f))$AUC,
    "AUC.s" = auc(data.frame(1:length(paok0s), paok$VulpesVulpes, paok0s))$AUC,
    "AUC.x" = auc(data.frame(1:length(paok0s), paok$VulpesVulpes, paok1x$predictions))$AUC,
    "Kappa.f" = ecospat.max.kappa(paok0f, paok$VulpesVulpes)$max.Kappa,
    "Kappa.s" = ecospat.max.kappa(paok0s, paok$VulpesVulpes)$max.Kappa,
    "Kappa.x" = ecospat.max.kappa(paok1x$predictions, paok$VulpesVulpes)$max.Kappa,
    "TSS.f" = ecospat.max.tss(paok0f, paok$VulpesVulpes)$max.TSS,
    "TSS.s" = ecospat.max.tss(paok0s, paok$VulpesVulpes)$max.TSS,
    "TSS.x" = ecospat.max.tss(paok1x$predictions, paok$VulpesVulpes)$max.TSS,
    "Prev" = n/(n+length(vulvul_a$VulpesVulpes))
  )
})

なんか図がおかしくなりました。

ggplot(gather(result, "stat", "value", -Prev), aes(x = Prev)) + geom_line(aes(y = value, color = stat))

AUCはサンプルサイズや在データ率の不足に関して、過度に楽観的な予測を与えてしまう恐れがある。