An Automated Outlier Detection System for
Hourly Rainfall Data Quality Control

指導教授 鄭克聲

臺灣大學統計碩士學位學程 周卉敏

August 20, 2020

Outline

Data QA & QC

Motivation

Data Collection

Data Preprocessing

Table 1.
Storm Type Rainy Season Duration
Frontal Rain Nov. - Apr. > 1 h
Meiyu May and June
Convective Storms July - Oct. 1-12 h
Typhoons July - Oct. > 12 h

List of warning typhoons

Abnormal Situation I

There are no returned observations because of malfunctions or delays.
The CWB Code are reorganized in Table 2.

Table 2.
New Code Circumstance CWB Code
A1 The instrument observed trace -9998
A2 The instrument failed to return OBS due to malfunctions or unknown reasons -9991
-9995
-9997
-9999
A3 The instrument delayed returning accumulated OBS for a while -9996
Note:
1 Trace: an amount of precipitation < 0.1 mm
2 OBS: Observations
3 -9991 = Instrument malfunctions waiting for repair
4 -9995 = The instrument failed to return OBS due to malfunctions
5 -9997 = The instrument failed to return OBS for unknown reasons
6 -9999 = The instrument did not observe rainfall

Abnormal Situation II

Rainfall time series of a station differs from that of other nearby stations.
Table 3 exhibits 9 circumstances for a station in one day that may be abnormal.

Table 3.
Code Circumstance Further Verification
B1 OBS at some hours are higher than that of nearby stations Y
B2 Trace (A1), while nearby stations have observed rainfalls Y
B3 No OBS due to malfunctions (A2)
B4 No OBS due to delays (A3)
B5 Has observed rainfalls while nearby stations merely don’t (A1) Y
B6 Has observed rainfalls while nearby stations don’t (A2)
B7 Has observed rainfalls while nearby stations don’t (A3)
B8 Delayed return of accumulated rainfall records (A3)
B9 Rainfall trend is different from that of nearby stations Y
Note:
1 Trace: an amount of precipitation < 0.1 mm
2 OBS: Observations

Cutoff Point Method



Dealing with abnormal situation I.

\[\begin{equation} \tag{1} x(i,t) \ge v_{1-p} \end{equation}\]


K-Means Cluster Analysis

  1. \(C_1 \cup C_2 \cup ... \cup C_K = \{1,2,...,n\}\). Namely, each observation belongs to at least one cluster.
  2. \(C_k \cap C_{k'} = \phi \forall k \ne k'\). Namely, the clusters are non-overlapping.
\[\begin{equation} \tag{2} min_{C_1,...,C_k}\lbrace\sum^K_{k=1} V(C_k)\rbrace \end{equation}\] \[\begin{equation} \tag{3} V(C_k)=\frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \end{equation}\] \[\begin{equation} \tag{4} \min_{C_1,...,C_k}\lbrace\sum^K_{k=1} \frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \rbrace \end{equation}\]

Principal Component Analysis (1/3)

PCA is used to

Principal Component Analysis (2/3)

Why use PCA?


Principal Component Analysis (3/3)

\[\begin{equation} \tag{5} \mathbf Z = \mathbf \Phi^T \mathbf X \end{equation}\] \[\begin{equation} \tag{6} Z_1=z_{i1}=\phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip} \end{equation}\]

PCA from Temporal Variation Aspect

  • To find the temporal variation at each station (observation).
  • Each column vector \(X_j\) denotes hourly rainfall of \(n\) raingauge stations at hour \(j\). (\(j \ge 2\))
\[\begin{equation} \tag{13} d = \sqrt{Z_1^2+Z_2^2} \ge d_{i,j,p} \end{equation}\]
  • \(d\): the Euclidean distance b/w (0,0) and \(X_j\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d_{i,j,p}\): threshold, the \(p^{th}\) quantile of those \(n_{i,j}\) maximum distances.

PCA from Spatial Variation Aspect

  • To find the spatial variation at each hour (observation).
  • Each column vector \(X_i\) denotes the \(m\) (\(m \le 24\)) hourly rainfall at station \(i\). (\(i \ge 2\))
\[\begin{equation} \tag{13} d^{'} = \sqrt{Z_1^2+Z_2^2} \ge d^{'}_{i,j,p} \end{equation}\]
  • \(d^{'}\): the Euclidean distance b/w (0,0) and \(X_i\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n^{'}_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d^{'}_{i,j,p}\): the \(p\) quantile of those \(n^{'}_{i,j}\) maximum distances.

Result of Cutoff Point Method

Table 4.
Item Station ID 站名 Station Time HR
1 C0A931 三和 Sanhe 2008/11/9 05:00 936.0
2 C0A940 金山 Jinshan 2008/12/24 19:00 735.5
3 C0AI40 石牌 Shipai 2008/10/21 14:00 283.0
4 C0C490 八德 Bade 1998/7/14 15:00 388.5
5 C0M640 中埔 Zhongpu 2001/9/19 10:00 741.5
6 C0O970 虎頭埤 Hutoupi 2004/10/21 13:00 201.0
7 C0R130 阿禮 Ali 2001/5/21 11:00 544.0
8 C0R140 瑪家 Majia 2001/5/21 11:00 828.0
9 C0R140 瑪家 Majia 2001/5/31 10:00 434.0
10 C0R280 檳榔 Binlang 2012/8/28 15:00 666.0
11 C0R341 牡丹 Mudan 2011/9/3 13:00 320.5
12 C0R341 牡丹 Mudan 2012/8/25 04:00 460.5
13 C0S660 下馬 Siama 2016/7/9 14:00 369.0
14 C0S710 鹿野 Luye 2016/7/9 15:00 226.5
15 C0S760 紅石 Hongshih 1999/9/4 19:00 216.5
16 C0S760 紅石 Hongshih 1999/10/9 11:00 237.0
17 C0V310 美濃 Meinong 2001/5/21 10:00 312.5
18 C0V350 溪埔 Xipu 2001/5/21 10:00 342.0
19 C0V740 旗山 Qishan 2001/5/21 11:00 292.0
20 C1A630 下盆 Siapen 2001/7/21 14:00 216.0
21 C1E480 鳳美 Fongmei 1998/2/25 18:00 268.0
22 C1F891 稍來 Shaolai 1998/2/20 17:00 223.5
23 C1F941 雪嶺 Xueling 1998/2/20 18:00 251.5
24 C1R110 口社 Gusia 2001/5/21 14:00 579.5
25 C1R110 口社 Gusia 2001/5/31 16:00 334.0
26 C1R120 上德文 Shangdewun 2001/5/21 11:00 711.0
27 C1S670 摩天 Motian 2016/7/9 13:00 216.5
28 C1U690 新寮 Sinliao 2009/10/12 14:00 734.5
29 C1Z130 銅門 Tongmen 2005/9/23 09:00 364.5
Note:
1 HR = Hourly Rainfall (mm)

Clustering Result

Figure 1.

Figure 1.

5 variables for conducting the K-Means cluster analysis:

  • Altitude
  • Longitude
  • Latitude
  • Average annual rainfall (1998-2019) for a given storm type
  • Standard deviation of the average annual rainfall for a given storm type

Anomaly Detection on March 27, 2020 (Temporal Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 3.

Figure 3.

Figure 4.

Figure 4.

Anomaly Detection on March 27, 2020 (Spatial Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山
Figure 5.

Figure 5.

Figure 6.

Figure 6.

Time Series Plot on March 27, 2020

Frontal Rain, Cluster 4
東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 7.

Figure 7.

Figure 8.

Figure 8.

Result of PCA

Table 6. For Frontal Rain
Cluster 1 2 3 4 5 6 7 8
Number of Stations 61 27 50 36 55 47 15 6
Days Available 424 514 1401 947 551 682 592 1153
Threshold 8.925 6.859 12.039 10.185 10.391 7.641 6.093 5.234
Anomalies Detected 43 52 140 95 55 69 60 116
PAIVV 4 7 23 12 11 9 5 5
Note:
1 PAIVV: possible anomalies identified by visual verification from the detected outliers



Table 6. For Convective Storms
Cluster 1 2 3 4 5 6 7 8 9 10
Number of Stations 39 37 25 21 40 44 25 41 15 10
Days Available 787 1084 895 929 1568 1363 1077 970 843 892
Threshold 8.610 8.708 6.854 7.841 8.715 9.952 7.703 10.745 6.228 4.883
Anomalies Detected 79 109 90 93 157 137 108 97 85 90
PAIVV 6 13 11 9 18 7 16 10 3 8
Note:
1 PAIVV: possible anomalies identified by visual verification from the detected outliers;

9 Categories of Anomalies Detected by PCA

Table 7.
Code B1 B2 B3 B4 B5 B6 B7 B8 B9 Sum
Convective Storms 87 1 4 0 4 0 1 0 4 101
Frontal Rain 50 0 3 2 10 7 2 1 1 76
Meiyu 56 1 9 3 18 5 0 4 5 101
Typhoons 20 1 0 0 0 0 0 0 1 22
Sum 213 3 16 5 32 12 3 5 11 300
Note:
1 B1= OBS at some hours are higher than that of nearby stations
2 B2= Trace, while nearby stations have observed rainfalls
3 B3= No OBS due to malfunctions
4 B4= No OBS due to delays
5 B5= Has observed rainfalls while nearby stations merely don’t
6 B6= Has observed rainfalls while nearby stations don’t
7 B7= Has observed rainfalls while nearby stations don’t
8 B8= Delayed return of accumulated rainfall records
9 B9= Rainfall trend is different from that of nearby stations

Figure 9.

Example of B2 on July 18, 2005

Figure 10.

Figure 10.

Figure 11.

Figure 11.

Example of B5 on June 17, 1998

Figure 16.

Figure 16.

Figure 17.

Figure 17.

Example of B9 on July 10, 1999

Figure 24.

Figure 24.

Figure 25.

Figure 25.

Conclusion