在犯罪現場,盜賊可能在犯罪過程破壞玻璃、觸碰玻璃、用玻璃突擊…,在調查案件中,玻璃碎片是一個很重要的法醫科學,因為玻璃碎片可能會掉入犯人的衣著或身上。
It is easy to imagine a crime scene with glass fragments: a burglar may have broken a glass door, a glass bottle could have been used in an assault, or a domestic disturbance may involve throwing something through a window. During the commission of a crime, there are many ways that glass can break and be transferred from the scene. The study of glass fragments is important to forensic science because the glass broken at the scene can transfer to the perpetrator’s shoes, clothing, or even their hair (Curran, Hicks, and Buckleton 2000).
| 英單 |
|---|
| burglar 盜賊 |
| forensic science 法醫學 |
在犯罪現場調查中收集玻璃碎片是收集證據的一部份,玻璃碎片會送去檢驗,如果在嫌疑人的身上玻璃碎片(證據),就會被送去檢驗比對,接者就會分析犯罪現場的玻璃碎片及嫌犯的玻璃是否來自同個玻璃。
Crime scene investigators collect fragments of glass at the scene as a part of the evidence collection process, and the fragments are sent to the forensic science lab for processing. Similarly, evidence such as clothing and shoes are collected from a suspect, and if glass is found, the fragments are sent to the lab and compared to the fragments found at the scene.
| 英單 |
|---|
| suspect 嫌疑人 |
在法醫分析中,比對兩片玻璃碎片,有兩個層級的問題 :
1.activity level(活動層面)> 為何嫌疑會與玻璃碎片有關連。
2.offense level (罪刑層面)> 如果嫌犯就是犯罪犯。
The question that the analyst usually tries to answer is, “Did these glass fragments come from the same source?” This is a source level question, meaning that the comparison of the fragments will only tell the investigators whether or not the fragments from the suspect and the fragments from the scene have the same origin. As discussed in Section 1.3, the forensic analysis will not inform investigators how the suspect came into contact (activity level) with the glass or if the suspect was the perpetrator of the crime(offense level) (Roger Cook et al. 1998).
| 英單 |
|---|
| perpetrator 犯罪者 |
我們要定義在犯罪調查中的玻璃碎片:
Q > 代表質疑的玻璃碎片 (從嫌疑犯身上取得的)。
K > 代表已知的玻璃碎片 (從犯罪現場取得的)。
There are two key problems of interest in glass fragments comparison, but before defining them, we need to define the different glass involved in the investigation of a crime. Glass fragments found on the suspect, for example in their hair, shoes, or clothes, are questioned fragments, which we denote by Q. Glass fragments found at the crime scene, for example in front of a broken window or taken from the broken window, are known fragments, which we denoteK.
Define the different glass :
| Denotion | category | describe |
|---|---|---|
| Q | questioned fragments | Glass fragments found on the suspect |
| K | known fragments | Glass fragments found at the crime scene |
Questioned Source :
>來歷不明的證據
Evidence of unknown origin. These samples could be found at a crime scene, transferred to an offender during commission of a crime, or recovered from more than one crime scene.
Known Source :
>猜考樣本,可驗證/記錄的來源
Also called a reference sample, this is material from a verifiable/documented source which, when compared with evidence of a questioned source, shows an association or linkage between an offender, crime scene, and/or victim.
- 具體來源問題 :
從嫌犯取地的質疑的玻璃碎片( Q )是不是跟已知的玻璃碎片來自同樣的玻璃?。
- 目標 :
要量化Q和K之間的相似點。
This brings us to a specific source question: Did the questioned fragments Q found on the suspect come from the same source of glass as the known fragments K , which we know belong to a specific piece of glass at the scene? The goal is now to quantify the similarity between Q and K .
Specific source question :
The specific source identification question considers whether the trace originates from a fixed specific source. Source: Ommen, Saunders, and Neumann.
有很多方法可以去測量兩者相似的碎片,根據可用的玻璃碎片測量資料庫來測量,如果有測量的元素成分(ppm)數值,就可以分辨出Q和K之間的化學成分。
There are lots of ways measure similarity between two glass fragments, but the metric should be defined according to available databases of glass fragment measurements for which ground truth is known. For example, if we have elemental compositions measured in parts per million (ppm) as numerical values, the similarity can be quantified by the difference of the chemical compositions of Q and K.
| 補充 | ||
|---|---|---|
| parts per million(ppm) | 百萬分點濃度 | 定義為百萬分之一 |
我們有很多方法可以來測量玻璃,像是顏色、厚度、折光率(RI),在此章節會著重在使用範圍很廣的float glass。
There are many types of glass measurements such as color, thickness, refractive index (RI) and chemical concentrations. In this chapter, we will focus on float glass that is most frequently used in windows, doors and automobiles. For discussions of the other measurements see e.g. Curran, Hicks, and Buckleton (2000).
| 英單 |
|---|
| Refractive Index 折光率 |
Refractive Index :
A dimensionless number that describes how fast light propagates through a material.
Float Glass :
Extremely smooth, nearly distortion-free(無失真) plate glass manufactured by pouring molten glass onto asurface of molten tin(錫).
float glass的元素濃度的取得要透過雷射電漿光譜儀(LA-ICP-MS)。目前使用ASTM International的兩種分析指南(ASTM-E2330-12 2012)和(ASTM-E2927-16 2016),來確定玻璃碎片的來源,計算每個元素間隔的平均濃度,如果每個元素的間隔重疊,那就是玻璃碎片來自同個來源。
The elemental concentrations of float glass that we use here were obtained through inductively coupled mass spectrometry with a laser add-on (LA-ICP-MS). In the current practice, there are two analysis guides from ASTM International, (ASTM-E2330-12 2012) and (ASTM-E2927-16 2016). To determine the source of glass fragments according to these two guides, intervals around the mean concentrations are computed for each element, and if all elements’ intervals overlap, then the conclusion is that the fragments come from the same source. For more detail on these methods, see ASTM-E2330-12 (2012) and ASTM-E2927-16 (2016).
| 英單 |
|---|
| concentrations 濃度 |
| 補充 | ||
|---|---|---|
| inductively coupled mass spectrometry with a laser add-on (LA-ICP-MS) | 雷射電漿光譜儀 | 可以直接分析固態樣品 |
| ASTM International | 美國材料和試驗協會 | ASTM國際標準組織,是國際標準化組織,它是制定、發布自願共識的有關材料、產品、系統和服務的技術標準 |
去確認兩個玻璃碎片來自同來源,法醫分析會考慮很多屬性(顏色、螢光、厚薄、表面特徵、曲率和化學成分)。
In order to determine if two glass fragments come from the same source, a forensic analyst considers many properties of the glass, including color, fluorescence, thickness, surface features, curvature, and chemical composition.
Fluorescence :
>物質在接觸外部輻射的可見光(像光或是X光)
The emission of radiation, especially of visible light, by a substance during exposure to externalradiation, as light or x-rays.
| 補充 | ||
|---|---|---|
| curvature | 曲率 | 曲率是描述幾何體彎曲程度的量。 |
檢查所有屬性的所有方法,除了化學成分分析,都是沒破壞性的。小的玻璃碎片顏色判定是非常困難的,玻璃的厚度都會一致,因此如果兩個玻璃碎片的厚度超過0.25mm,那就排除是同個玻璃來源。同色同厚度的玻璃碎片,在破壞性化學成分分析前,會用微觀技術來決定光吸收(螢光)、曲率、表面特徵(表皮)。我們在這章節會用新的規則,使用隨機森林去做玻璃來源結論去分類。
All methods for examining these properties, except for methods of chemical composition analysis, are non-destructive. If the fragments are large, exclusion are easy to reach if the glass are of different colors because of the wide variety of glass colors possible in manufacturing. Typically, however, glass fragments are quite small and color determination is very difficult. Similarly, thickness of glass is dictated by the manufacturing process, which aims for uniform thickness, so if two glass fragments differ in thickness by more than 0.25mm, an exclusion is made (Bottrell 2009). For glass fragments of the same color and thickness, microscopic techniques for determining light absorption (fluorescence), curvature, surface features (such as coatings), are used before the destructive chemical composition analysis. 7.1.4 Goal of this chapter In this chapter, we construct a new rule for making glass source conclusions using the random forest algorithm to classify the source of glass fragments (Park and Carriquiry 2019a).
| 補充 | ||
|---|---|---|
| microscopic techniques | 微觀技術 | 用來查看肉眼無法看到的物體的技術。 |
| light absorption | 光吸收 | 吸收光並將其轉換成能量的過程。 |
| random forest | 隨機森林 | 是一個包含多個決策樹的分類器,並且其輸出的類別是由個別樹輸出的類別的眾數而定。 |
- 用ICP-MS破壞性方法會破壞元素成分,過程中決定化學成分是由ASTM-E2330-12 (2012) 和 ASTM-E2927-16 (2016)細節給定,多達40個元素可以檢測到。
- 只有18元素會用到 :
主要的元素 鈣(Ca)、鈉(NA)和鎂(Mg)
次要的元素 鋁(Al)、鉀(K)和鐵(Fe)
微量元素 鋰(Li)、鈦(Ti)、錳(Mn)、銣(Rb)、鍶(Sr)、鋯(Zr)、鋇(Ba)、鑭(La)、鈰(Ce)、釹(Nd)、鉿(Hf)和鉛(Pb)- 重複測量間隔一樣的玻璃碎片,用標準差(σ)表示,間隔寬2σ,4σ,6σ,8σ,10σ,12σ,16σ,20σ,30σ和40σ都是考慮重疊。
The process for determining the chemical composition of a glass fragment is given in great detail in ASTM-E2330-12 (2012) and ASTM-E2927-16 (2016). This destructive method determines elemental composition with Inductively Coupled Plasma Mass Spectrometry (ICP-MS). Up to 40 elements can be detected in a glass fragment using this method. In Weis et al. (2011), only 18 elements are used: calcium (Ca), sodium (NA) and magnesium (Mg) are the major elements, followed by aluminum (Al), potassium (K) and iron (Fe) as minor elements, and lithium (Li), titanium (Ti), manganese (Mn), rubidium (Rb), strontium (Sr), zirconium (Zr), barium (Ba), lanthanum (La), cerium (Ce), neodymium (Nd), hafnium (Hf), and lead (Pb) as the trace elements. The methods of Weis et al. (2011) use standard deviations (σ ) of repeated measurements of the same fragment to create intervals around the measurements. Intervals of width 2σ,4σ,6σ,8σ,10σ,12σ,16σ,20σ,30σ and 40σ are considered for overlap.
我們使用的Database是float glass範例的化學成分,database包含A公司製造的31個窗格(標記AA, AB, … , AAR)和B公司製造的17個窗格(標記BA, BB, … ,BR)。
Dr. Alicia Carriquiry of Iowa State University commissioned the collection of a large database of chemical compositions of float glass samples. The details of this database are explained in Park and Carriquiry (2019a). The full database is available here. The database includes 31 panes of float glass manufactured by Company A and 17 panes manufactured by Company B, both located in the United States. The Company A panes are labeled AA, AB, … , AAR, and the Company B panes are labeled BA, BB, … ,BR. The panes from Company A were produced during a three week period (January 3-24, 2017) and the panes from Company B were produced during a two week period (December 5-16, 2016).
玻璃帶內的變異性
To understand variability within a ribbon of glass, two glass panes were collected on almost all days in each company, one from the left side and one from the right side of the ribbon. Twenty four fragments were randomly sampled from each glass pane. Five replicate measurements were obtained for 21 of the 24 fragments in each pane; for the remaining three fragments in each pane, we obtained 20 replicate measurements. Therefore, each pane has 165 measurements for 18 elements. For example, see the heuristic in Figure 7.1. In some panes, there may be a fragment with fewer than five replicate measurements. The unit for all measurements is parts per million (ppm).
#Tidyverse>>是一系列優秀軟件庫的合集,其中最常用的幾個package包括 dplyr,ggplot2 和 readr 等
#將套件載入
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#expand.grid() >>根據提供的向量或因子的所有組合創建一個數據框。
#pane數據框為16*4
pane <- expand.grid(x = 1:16, y = 1:4)
#nrow() >>列數
#n的列數為pane
n <- nrow(pane)
#$>>用於取出物件內容,亦可用於串列資料的讀取.
#pane 讀取 id 1到n
pane$id <- 1:n
#sample()>>對母體進行抽樣
#frags n 中抽樣24個
frags <- sample(n, 24)
#rep20 frags 中抽樣3個
rep20 <- sample(frags, 3)
#rep5 是否 frags %in% rep20 >>frags是否在rep20中
rep5 <- frags[!(frags %in% rep20)]
#data.frame()>>資料框
#rep(複製的對象, each =每個元素進行複製的次数 )
pane_sample <- data.frame(frag = c(rep(rep5, each = 5), rep(rep20, each = 20)),rep = c(rep(1:5, 21), rep(1:20, 3)))
#sample_data 保留pane中的所有觀測左連接pane_sample,靠鍵c(id = "frag")。
sample_data <- left_join(pane, pane_sample, by = c(id = "frag"))
#ggplot()>>繪圖時是以一種圖層式的概念在建立圖形的
#ggplot(data = sample_data, aes(x = x, y = y))>data = sample_data>>指定資料來源sample_data,aes(x = x, y = y)>>對應的參數x = x, y = y
#geom_tile()>>使用tile中心和尺寸
#geom_jitter()>>添加隨機噪音來改善的圖表,它在小規模的圖表上降低了數據的精確度,但同時也使得大型圖表能夠表現出更多信息。
ggplot(data = sample_data, aes(x = x, y = y)) + geom_tile(fill = "white", color = "black") +
geom_jitter(aes(color = as.factor(rep)), alpha = 0.8, size = 0.5) + scale_color_manual(values = c(rep("black",20))) + theme_void() + theme(legend.position = "none")
## Warning: Removed 40 rows containing missing values (geom_point).
Figure 7.1: An example of how the glass fragments were sampled, if the 64 squares are imagined to be randomly broken fragments within a pane.
Next, we look at the glass data.
glass <- read_csv("/cloud/project/glass_raw_all.csv")
## Parsed with column specification:
## cols(
## pane = col_character(),
## fragment = col_double(),
## Rep = col_double(),
## mfr = col_character(),
## element = col_character(),
## ppm = col_double()
## )
head(glass)
## # A tibble: 6 x 6
## pane fragment Rep mfr element ppm
## <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 AA 1 1 A Al 2678
## 2 AA 1 1 A Ba 10.8
## 3 AA 1 1 A Ca 63140
## 4 AA 1 1 A Ce 9.52
## 5 AA 1 1 A Fe 667
## 6 AA 1 1 A Hf 1.15
我們用對數轉換去將不同的元素放進相似的規模
The elements have very different scales, as some (e.g. Ca) are major elements, some (e.g. Al) are minor elements, and others (e.g Hf) are trace elements. Thus, we take the natural log transformation of all measurements to put them on a similar scale.
# need to make sure the panes are shown in order of mfr date(確保窗格以mfr date的順序顯示)
pane_order <- c(paste0("A", LETTERS[c(1:13, 15, 22:25)]), paste0("AA", LETTERS[c(1:4,
6, 8:13, 17:18)]), names(table(glass$pane))[c(32:48)])
glass$pane <- ordered(glass$pane, levels = pane_order)
#使用對數轉換
glass_log <- glass %>% mutate(log_ppm = log(ppm))
#select(): 選要分析的欄位,欄位子集 (Column)
glass_log %>% select(-ppm) %>% spread(element, log_ppm) %>% select(mfr, pane,fragment, Rep, Li, Na, Mg, Al, K, Ca) %>% head()
## # A tibble: 6 x 10
## mfr pane fragment Rep Li Na Mg Al K Ca
## <chr> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A AA 1 1 0.833 11.5 10.0 7.89 7.14 11.1
## 2 A AA 1 2 0.542 11.5 10.0 7.88 7.13 11.0
## 3 A AA 1 3 0.723 11.5 10.0 7.91 7.13 11.0
## 4 A AA 1 4 0.798 11.5 10.1 7.91 7.14 11.1
## 5 A AA 1 5 0.723 11.5 10.0 7.87 7.10 11.0
## 6 A AA 2 1 0.560 11.5 10.0 7.90 7.12 11.0
library(magrittr) # needs to be run every time you start R and want to use %>%
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(dplyr) # alternatively, this also loads %>%
library(ggplot2) #繪製圖
library(arm)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: lme4
##
## arm (Version 1.11-1, built: 2020-4-27)
## Working directory is /cloud/project
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
cols <- csafethemes:::csafe_cols_secondary[c(3, 12)]
glass_log %>% filter(element %in% c("Li", "Na", "Mg", "Al", "K", "Ca")) %>%
ggplot() + geom_density(aes(x = log_ppm, fill = mfr), alpha = 0.7) + scale_fill_manual(name = "Manufacturer",
values = cols) + facet_wrap(~element, scales = "free", nrow = 2) + labs(x = "Log concentration (ppm)",
y = "Density") + theme(legend.position = "top")
Figure 7.2: Density estimation of selected chemical compositions(選擇的化學成分的密度估計), colored by manufacturers(著色是以製造商來分別紅與藍)
Figure 7.2 shows density plots of six chemical compositions of Al, Ca, K, Li, Mg, Na. Na and Ca (major elements) show density curves from two manufacturers are overlapped, while Al, K show clear separation between curves by manufacturers. This implies that glass fragments from different manufacturers will be very easy to distinguish.
Figure 7.2顯示了六個(主要元素)化學成分(Al, Ca, K, Li, Mg, Na and Ca)的密度圖,顯示了兩個製造商密度曲線都是重疊的,Al和K兩個製造商間清晰分離。這代表說來自不同製造商的玻璃碎片是非常簡單可以分辨的。
glass_log %>% filter(element %in% c("Na", "Ti", "Zr", "Hf")) %>% ggplot() +
geom_boxplot(aes(x = pane, y = log_ppm, fill = mfr), alpha = 0.8, outlier.size = 0.5,
size = 0.1) + scale_fill_manual(name = "Manufacturer", values = cols) +
facet_wrap(~element, scales = "free", nrow = 2, labeller = label_both) +
theme_bw() + theme(legend.position = "none") + scale_x_discrete(labels = c("AA",
rep("", 30), "BA", rep("", 15), "BR")) + labs(x = "Pane (in order of manufacture)",
y = "Log concentration (ppm)")
Figure 7.3: Box plot of four elements(Na, Ti, Zr, Hf) in 48 panes, ordered by date of production, from two manufacturers.
Figure 7.3 shows box plots of the measurements of four elements (Na, Ti, Zr, Hf) in each of the 48 panes in the database, colored by manufacturer. Boxes are ordered by date of production within manufacturer. We can confirm the between- and within- pane variability. Interestingly, element values of Zr(鋯) and Hf(氫氟酸) in manufacturer A are both decreasing in time, which is evidence that the element measurements are highly correlated. To account for this relationship, we use methods that do not require independence of measurements.
Figure 7.3 顯示在database中每個48個窗格的四個元素(Na, Ti, Zr,Hf) 測量的箱型圖,不同的顏色代表著不同的製造商,箱子的順序是按製造生產日期排序,Zr(鋯)和Hf(氫氟酸)元素值,證明說製造商A都是隨時間減少是與元素測量高度相關的
##col_data_AA <- glass_log[, -6] %>% filter(pane == “AA”) %>% spread(element,log_ppm) %>% group_by(fragment) %>% summarise_if(is.numeric, mean, na.rm = TRUE) ##col_data_BA <- glass_log[, -6] %>% filter(pane == “BA”) %>% spread(element,log_ppm) %>% group_by(fragment) %>% summarise_if(is.numeric, mean, na.rm = TRUE) ##P1 <- ggcorr(col_data_AA[, 3:20], geom = “blank”, label = TRUE, hjust = 0.75) + geom_point(size = 10, aes(color = coefficient > 0, alpha = abs(coefficient) > ## 0.5)) + scale_alpha_manual(values = c(TRUE = 0.25, FALSE = 0)) + ## guides(color = FALSE, alpha = FALSE) ##P2 <- ggcorr(col_data_BA[, 3:20], geom = “blank”, label = TRUE, hjust = 0.75) + ## geom_point(size = 10, aes(color = coefficient > 0, alpha = abs(coefficient) > ## 0.5)) + scale_alpha_manual(values = c(TRUE = 0.25, FALSE = 0)) + ## guides(color = FALSE, alpha = FALSE)
要分辨玻璃碎片Q和K ,這裡是要構造一個預測分類器( 低錯 )。
We propose a machine learning method to quantify the similarity between two glass fragments Q and K . The goal here is to construct a classifier that predicts, with low error, whether or not Q and K have the same source. Using the glass database, we construct many pairwise comparisons between two glass fragments with known sources, and we will record their source as either same source or different source.
構造一組比較
1. 取所有玻璃碎片的所有測量值的自然對數。(ppm至log(ppm))
2. 在data中選擇一塊玻璃當“questioned”來源。從此採樣一個玻璃碎片為* Q *
3. 在data中選擇一塊玻璃當“known”來源。從此採樣一個玻璃碎片為* K *
4. 構造反應變量:如果2和3中的窗格相同,則對於已知配偶,反應變量為KM。否則,反應變量為KNM。
5. 構造特徵:取每個元素中* Q * 和 * K * 重複測量的平均值,並取* Q * 和 * K * 的平均值之差的絕對值。
6. 重複第2到5項,直到data集適合訓練分類器。
To construct the set of comparisons, we
1.Take the natural log of all measurements of all glass fragments. (ppm to log(ppm))
2.Select one pane of glass in the database to be the “questioned” source. Sample one fragment from this pane, and call it Q. 3.Select one pane of glass in the database to be the “known” source. Sample one fragment from this pane, and call it K. 4.Construct the response variable: if the panes in 2 and 3 are the same, the response variable is KM for known mates. Otherwise, the response variable is KNM for known non-mates. 5.Construct the features: take the mean of the repeated measurements of Q and K in each element, and take the absolute value of the difference between the mean of Q and the mean of K. 6.Repeat 2-5 until we have a data set suitable for training a classifier.
用R package caret 去訓練隨機森林分類器,確認兩個玻璃碎片( * Q * 和 * K * )有是否來自同樣的來源
We use the R package caret to train a random forest classifier to determine whether two glass fragments (Q and K) have the same source or have different sources (Jed Wing et al. 2018).
To begin, the package can be installed from CRAN:
#install.packages("caret")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# other package used for plotting
library(GGally)
The R package caret (Classification And REgression Training) is used for applied predictive modeling.
The caret() package allows us to run 238 different models as well as adjusting data splitting, preprocessing, feature selection, tuning parameter, and variable importance estimation. We use it here for fitting a random forest model to our data using cross-validation and down-sampling.
| 補充 | ||
|---|---|---|
| cross-validation | 交叉驗證 | 是一種統計學上將數據樣本切割成較小子集的實用方法。 |
| down-sampling | 降採樣 | 在數位信號處理領域中,降採樣,又作減採集, 是一種多速率數位訊號處理的技術或是降低信號採樣率的過程,通常用於降低數據傳輸速率或者數據大小。 |
diff_Q_K_data <- readr::read_rds("/cloud/project/rf_data_kfrags_1zz.rds")
head(diff_Q_K_data)
## Li7 Na23 Mg25 Al27 K39 Ca42
## 1 0.007825983 0.004262884 0.008902929 0.0270017037 0.01033831 0.01490457
## 2 -0.536202027 -0.058134419 0.026661368 0.0741020130 -0.08336967 0.06190373
## 3 -0.243691734 -0.028272780 0.009173808 0.0180585778 -0.03772804 0.03856983
## 4 -0.231870060 -0.031597995 0.028266398 -0.0001019363 -0.04368778 0.05183077
## 5 -0.287539103 -0.035254403 0.002752094 0.0150527042 -0.06557825 0.02655397
## 6 -0.294178202 -0.019490402 -0.008642867 -0.0318866510 -0.05688349 -0.01137215
## Ti49 Mn55 Fe57 Rb85 Sr88 Zr90
## 1 0.019411328 0.011765938 -0.012984198 0.03366349 -0.0006435085 0.04191747
## 2 0.038062536 0.009261067 -0.001471990 -0.10348290 0.0637929768 0.02774879
## 3 -0.023376384 -0.004417209 0.001861948 0.02524520 0.0263322024 -0.02035716
## 4 0.012680492 0.002256554 -0.021357223 -0.04280431 0.0678344217 -0.06662281
## 5 0.001426977 -0.011926781 -0.014250764 -0.04423522 0.0166767672 -0.07966173
## 6 -0.029209614 0.014859015 -0.005084877 -0.02655627 -0.0289439199 -0.26429203
## Ba137 La139 Ce140 Nd146 Hf178 Pb208
## 1 -0.005737496 0.01476758 -0.014580270 0.02862212 -0.006219102 -0.04459173
## 2 0.066215878 0.11477498 0.024415810 0.09085547 0.046927269 -0.02445929
## 3 0.055889714 0.04877220 0.024638225 0.06798612 0.017574537 -0.03702562
## 4 0.076768149 0.06966050 0.063706423 0.09604351 -0.035719520 -0.03943974
## 5 0.073241180 0.04877937 0.040370926 0.05441425 -0.092643748 -0.09306914
## 6 -0.006523154 0.01038928 0.007283092 -0.05093958 -0.197201603 -0.06987293
## pane_q pane_k
## 1 AA AA
## 2 AB AA
## 3 AC AA
## 4 AD AA
## 5 AE AA
## 6 AF AA
head(diff_Q_K_data)
## Li7 Na23 Mg25 Al27 K39 Ca42
## 1 0.007825983 0.004262884 0.008902929 0.0270017037 0.01033831 0.01490457
## 2 -0.536202027 -0.058134419 0.026661368 0.0741020130 -0.08336967 0.06190373
## 3 -0.243691734 -0.028272780 0.009173808 0.0180585778 -0.03772804 0.03856983
## 4 -0.231870060 -0.031597995 0.028266398 -0.0001019363 -0.04368778 0.05183077
## 5 -0.287539103 -0.035254403 0.002752094 0.0150527042 -0.06557825 0.02655397
## 6 -0.294178202 -0.019490402 -0.008642867 -0.0318866510 -0.05688349 -0.01137215
## Ti49 Mn55 Fe57 Rb85 Sr88 Zr90
## 1 0.019411328 0.011765938 -0.012984198 0.03366349 -0.0006435085 0.04191747
## 2 0.038062536 0.009261067 -0.001471990 -0.10348290 0.0637929768 0.02774879
## 3 -0.023376384 -0.004417209 0.001861948 0.02524520 0.0263322024 -0.02035716
## 4 0.012680492 0.002256554 -0.021357223 -0.04280431 0.0678344217 -0.06662281
## 5 0.001426977 -0.011926781 -0.014250764 -0.04423522 0.0166767672 -0.07966173
## 6 -0.029209614 0.014859015 -0.005084877 -0.02655627 -0.0289439199 -0.26429203
## Ba137 La139 Ce140 Nd146 Hf178 Pb208
## 1 -0.005737496 0.01476758 -0.014580270 0.02862212 -0.006219102 -0.04459173
## 2 0.066215878 0.11477498 0.024415810 0.09085547 0.046927269 -0.02445929
## 3 0.055889714 0.04877220 0.024638225 0.06798612 0.017574537 -0.03702562
## 4 0.076768149 0.06966050 0.063706423 0.09604351 -0.035719520 -0.03943974
## 5 0.073241180 0.04877937 0.040370926 0.05441425 -0.092643748 -0.09306914
## 6 -0.006523154 0.01038928 0.007283092 -0.05093958 -0.197201603 -0.06987293
## pane_q pane_k
## 1 AA AA
## 2 AB AA
## 3 AC AA
## 4 AD AA
## 5 AE AA
## 6 AF AA
nrow(diff_Q_K_data)
## [1] 69120
filter(diff_Q_K_data, pane_q==pane_k) %>% nrow()
## [1] 1440
#構造反應變量:如果2和3中的窗格相同,則對於已知配偶,反應變量為KM。否則,反應變量為KNM
diff_Q_K_data$class<-ifelse(diff_Q_K_data$pane_q==diff_Q_K_data$pane_k, "KM","KNM")
head(diff_Q_K_data[1:10, c(21,1:6)])
## class Li7 Na23 Mg25 Al27 K39
## 1 KM 0.007825983 0.004262884 0.008902929 0.0270017037 0.01033831
## 2 KNM -0.536202027 -0.058134419 0.026661368 0.0741020130 -0.08336967
## 3 KNM -0.243691734 -0.028272780 0.009173808 0.0180585778 -0.03772804
## 4 KNM -0.231870060 -0.031597995 0.028266398 -0.0001019363 -0.04368778
## 5 KNM -0.287539103 -0.035254403 0.002752094 0.0150527042 -0.06557825
## 6 KNM -0.294178202 -0.019490402 -0.008642867 -0.0318866510 -0.05688349
## Ca42
## 1 0.01490457
## 2 0.06190373
## 3 0.03856983
## 4 0.05183077
## 5 0.02655397
## 6 -0.01137215
diff_Q_K_data %>% gather(element, diff, Li7:Pb208) %>% filter(element %in% c("Zr90","Li7", "Hf178", "Ca42")) %>% ggplot() + geom_density(aes(x = diff, fill = class),
alpha = 0.7) + scale_fill_manual(name = "Class", values = cols) + facet_wrap(~element,
scales = "free", nrow = 2) + labs(x = "Difference between Q and K (log(ppm))") + theme(legend.position = "top")
Figure 7.4: Histogram of differences from four chemical elements among KM and KNM.(KM和KNM中,四種化學元素的差異圖)
Figure 7.4 shows the distribution of differences for KM and KNM pairs in the Ca, Hf, Li, and Zr. Across all elements, distribution of differences from KNM pairs are more dispersed than differences of KM. The differences of KM have high density near zero( KM的最高鋒接近0), while the KNM measurements are shifted to the right and have a long tail. These differences for all 18 elements will be the features of the random forest ( 18個元素會變成隨機森林的特徵).
Finally, we construct the random forest classifier. Since the data have 1,440 KM and 67,680 KNM observations, the response variable is imbalanced(KM和KNM反應變量是不平衡的). With this large imbalance, the algorithm can simply predict KNM and have low error rate without learning anything about the properties of the KM class. You can find more ways to consider this imbalance problem (不平衡問題) for fitting the random forest (去合隨機森林) in this glass fragment source prediction in (Park and Carriquiry 2019a). In this chapter, we will down-sample the KNM observations to equal the number of KM observations(我們將對KNM觀測值進行下採樣等於KM觀測值的數量). That way, we will have 1,440 KM and 1,440 KNM comparisons. Then, we sample 70% of them to be the training set and the remaining 30% are the testing set.
diff_Q_K_data$class <- as.factor(diff_Q_K_data$class)
# Down sample the majority class(主要) to the same number of minority class(次要)
set.seed(123789)
down_diff_Q_K_data <- downSample(diff_Q_K_data[, c(1:18)], diff_Q_K_data$class)
down_diff_Q_K_data <- down_diff_Q_K_data %>% mutate(id = row_number())
names(down_diff_Q_K_data)[19] <- "class"
table(down_diff_Q_K_data$class)
##
## KM KNM
## 1440 1440
# Create training set
train_data <- down_diff_Q_K_data %>% sample_frac(0.7)
# Create test set
test_data <- anti_join(down_diff_Q_K_data, train_data, by = "id")
train_data <- train_data[, -20] #exclude id
test_data <- test_data[, -20] #exclude id
# dim(train_data) # 2016 19 dim(test_data) # 864 19
data測試做了降採樣設定旁邊30%,就可以訓練分類器 隨機森林算法的調整參數為mtry,每個節點可用於拆分的變量數。用了五個值(1, 2, 3, 4, 5)去調整mtry並選擇最佳的
mtry是預測變量數的平方根
選擇最佳的mtry參數,低於ROC曲線的mtry參數計算
After down-sampling and setting aside 30% of the data for testing, we can train the classifier. Below is the R code to fit the random forest, using caret (Jed Wing et al. 2018). The tuning parameter for the random forest algorithm is mtry, the number of variables available for splitting at each tree node. We use five values (1, 2, 3, 4, 5) to tune mtry and pick the optimal one. For the classification, the default mtry is the square root of the number of predictor variables (it is ≈4.2, in our study). To pick the optimal mtry parameter, the area under the receiver operating characteristic (ROC) curve is calculated. The optimal value will result the highest area under the ROC curve (AUC). The data are also automatically centered and scaled for each predictor. We use 10-fold cross-valildation to evaluate the random forest algorithm and repeat entire process three times.
| 補充 | ||
|---|---|---|
| receiver operating characteristic (ROC) | ROC曲線 | 在信號檢測理論中,接收者操作特徵曲線是一種坐標圖式的分析工具,用於選擇最佳的信號偵測模型、捨棄次佳的模型。在同一模型中設定最佳閾值。 |
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = "final",
summaryFunction = twoClassSummary, classProbs = TRUE)
# mtry is recommended to use sqrt(# of variables)=sqrt(18) so 1:5 are tried
# to find the optimal one
RF_classifier <- train(class ~ ., train_data, method = "rf", tuneGrid = expand.grid(.mtry = c(1:5)),
metric = "ROC", preProc = c("center", "scale"), trControl = ctrl)
RF_classifier <- readr::read_rds("/cloud/project/RF_classifier.RDS")
RF_classifier
## Random Forest
##
## 2016 samples
## 18 predictor
## 2 classes: 'KM', 'KNM'
##
## Pre-processing: centered (18), scaled (18)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1815, 1814, 1814, 1814, 1815, 1814, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9637918 0.9826797 0.8323266
## 2 0.9641196 0.9807190 0.8467239
## 3 0.9625203 0.9758170 0.8457239
## 4 0.9624018 0.9738562 0.8467172
## 5 0.9614865 0.9728758 0.8473838
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
根據RF擬合結果,它顯示了每個mtry值中的AUC(ROC),Sens(靈敏度)和Spec(特異性)
R-packagecaret從10倍交叉驗證和3個重複過程中選擇最佳mtry值,給出最高的AUC(ROC)為0.964
From the RF fitting result, it showed the AUC (ROC), Sens (Sensitivity) and Spec (Specificity) values in each mtry values.
R-package caret selects the best mtry value (it is 2, in our study) to give the highest AUC (ROC) value 0.964, from 10-fold cross-validation and 3 repeated process.
隨機森林500棵樹用2個最佳mtry 誤判率中失敗負面率為0.02 失敗正面率為0.153
The final performance of the random forest algorithm with the optimal mtry of 2 with 500 of number of trees. In the training set, the false negative rate is 0.02 and the false positive rate is 0.153 of false positive rat(誤判率,是針對特定測試錯誤地拒絕零假設的可能性。)e, which is the optimized classification result. We see the OOB estimate of error rate of 0.085, which is the average error rate from the fold left out of the training set during cross-validation.
RF_classifier$finalModel$confusion
## KM KNM class.error
## KM 1000 20 0.01960784
## KNM 152 844 0.15261044
imp <- varImp(RF_classifier)$importance
imp <- as.data.frame(imp)
imp$varnames <- rownames(imp) # row names to column
rownames(imp) <- NULL
imp <- imp %>% arrange(-Overall)
imp$varnames <- as.character(imp$varnames)
elements <- read_csv("/cloud/project/elements.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## symb = col_character(),
## name = col_character(),
## classification = col_character(),
## color = col_character(),
## type = col_character(),
## phase = col_character(),
## crystal = col_character()
## )
## See spec(...) for full column specifications.
imp %>% left_join(elements[, c(2, 4)], by = c(varnames = "symb")) %>% ggplot(aes(x = reorder(varnames,
Overall), y = Overall, fill = classification)) + geom_bar(stat = "identity",
color = "grey40") + scale_fill_brewer(name = "Element", palette = "Blues",
direction = -1) + labs(y = "Overall importance", x = "Variable") + scale_y_continuous(position = "right") +
coord_flip() + theme_bw()
Figure 7.5: Variable importance from the , colored by element types (major, minor or trace).
在這邊我們不將主要元素視為重要
By fitting the random forest algorithm, we can also get a measure of variable importance. This metric ranks which elements out of 18 are most important to correctly predicting the source of the glass fragments. Figure 7.5 shows that elements K, Ce, Zr, Rb, and Hf are five very important variables and Pb, Sr, Na, Ca, and Mg are five minimally important variables. All** major elements are not importan**t: this is not surprising because all glass is, broadly speaking, very similar chemically(相似的化學結構). Conversely, most of the trace elements are ranked in the top half(主要會在之上微量元素).