An Automated Outlier Detection System for
Hourly Rainfall Data Quality Control

指導教授:鄭克聲

統計碩士學位學程 周卉敏

June 30, 2020

Outline

Data QA & QC

Motivation

Data Collection

Data Preprocessing

Table 1.
Storm Type Rainy Season Duration
Frontal Rain Nov. - Apr. More than 1 hour
Meiyu May and June
Convective Storms July - Oct. From 1 to 12 hours
Typhoons July - Oct. More than 12 hours

List of warning typhoons

Abnormal Situation I

There are no returned observations because of malfunctions or delays.
We reorganize the Code in Table 2. to Table 3.

Table 3.
Code Circumstance Original Code
A1 Trace -9998
A2 The instrument failed to return OBS due to malfunctions or unknown reasons -9991
-9995
-9997
-9999
A3 The instrument delayed returning accumulated OBS for a while -9996
Note:
1 Trace: an amount of precipitation < 0.1 mm;
2 OBS: Observations

Abnormal Situation II

Rainfall time series of a station differs from that of other nearby stations.
Table 4 exhibits 9 circumstances for a station in one day that may be abnormal.

Table 4.
Code Circumstance Further Verification
B1 OBS at some hours are higher than that of nearby stations Y
B2 Trace (A1), while nearby stations have observed rainfalls Y
B3 No OBS due to malfunctions (A2)
B4 No OBS due to delays (A3)
B5 Has observed rainfalls while nearby stations merely don’t (A1) Y
B6 Has observed rainfalls while nearby stations don’t (A2)
B7 Has observed rainfalls while nearby stations don’t (A3)
B8 Delayed return of accumulated rainfall records (A3)
B9 Rainfall trend is different from that of nearby stations Y
Note:
1 Trace: an amount of precipitation < 0.1 mm;
2 OBS: Observations;
3 A1: Trace;
4 A2: The instrument failed to return OBS due to malfunctions or unknown reasons;
5 A3: The instrument delayed returning accumulated OBS for a while

Cutoff Point Method



Dealing with abnormal situation I.

\[\begin{equation} \tag{1} x(i,t) \ge v_{1-p} \end{equation}\]


K-Means Cluster Analysis

  1. \(C_1 \cup C_2 \cup ... \cup C_K = \{1,2,...,n\}\). Namely, each observation belongs to at least one cluster.
  2. \(C_k \cap C_{k'} = \phi \forall k \ne k'\). Namely, the clusters are non-overlapping.
\[\begin{equation} \tag{2} min_{C_1,...,C_k}\lbrace\sum^K_{k=1} V(C_k)\rbrace \end{equation}\] \[\begin{equation} \tag{3} V(C_k)=\frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \end{equation}\] \[\begin{equation} \tag{4} \min_{C_1,...,C_k}\lbrace\sum^K_{k=1} \frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \rbrace \end{equation}\]

Principal Component Analysis (1/4)

PCA is used to

Principal Component Analysis (2/4)

Why use PCA?


Principal Component Analysis (3/4)

\[\begin{equation} \tag{5} \mathbf Z = \mathbf \Phi^T \mathbf X \end{equation}\] \[\begin{equation} \tag{6} Z_1=z_{i1}=\phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip} \end{equation}\]

Principal Component Analysis (4/4)

Singular value decomposition of \(\mathbf S\) \[\begin{equation} \tag{9} \mathbf U^T \mathbf S \mathbf U = \mathbf L \end{equation}\] \[\begin{equation} \tag{10} \mathbf \Phi = \mathbf U \mathbf L^{-\frac{1}{2}} \end{equation}\] \[\begin{equation} \tag{12} \mathbf Z =\mathbf \Phi^T \mathbf D ^{\frac{-1}{2}} \mathbf X \end{equation}\]

PCA from Temporal Variation Aspect

  • To find the temporal variation at each station (observation).
  • Each column vector \(X_j\) denotes hourly rainfall of \(n\) raingauge stations at hour \(j\). (\(j \ge 2\))
\[\begin{equation} \tag{13} d = \sqrt{Z_1^2+Z_2^2} \ge d_{i,j,p} \end{equation}\]
  • \(d\): the Euclidean distance b/w (0,0) and \(X_j\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d_{i,j,p}\): threshold, the \(p^{th}\) quantile of those \(n_{i,j}\) maximum distances.

PCA from Spatial Variation Aspect

  • To find the spatial variation at each hour (observation).
  • Each column vector \(X_i\) denotes the \(m\) (\(m \le 24\)) hourly rainfall at station \(i\). (\(i \ge 2\))
\[\begin{equation} \tag{13} d^{'} = \sqrt{Z_1^2+Z_2^2} \ge d^{'}_{i,j,p} \end{equation}\]
  • \(d^{'}\): the Euclidean distance b/w (0,0) and \(X_i\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n^{'}_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d^{'}_{i,j,p}\): the \(p\) quantile of those \(n^{'}_{i,j}\) maximum distances.

Result of Cutoff Point Method (1/2)

Table 5.
Item Station ID 站名 Station Time HR PHR
1 C0A931 三和 Sanhe 2008110905 936.0 -9996
2 C0A940 金山 Jinshan 2008122419 735.5 -9996
3 C0AI40 石牌 Shipai 2008102114 283.0 -9996
4 C0C490 八德 Bade 1998071415 388.5 -9996
5 C0M640 中埔 Zhongpu 2001091910 741.5 -9996
6 C0O970 虎頭埤 Hutoupi 2004102113 201.0 -9996
7 C0R130 阿禮 Ali 2001052111 544.0 -9996
8 C0R140 瑪家 Majia 2001052111 828.0 -9996
9 C0R140 瑪家 Majia 2001053110 434.0 -9996
10 C0R280 檳榔 Binlang 2012082815 666.0 -9996
11 C0R341 牡丹 Mudan 2011090313 320.5 -9996
12 C0R341 牡丹 Mudan 2012082504 460.5 -9996
13 C0S660 下馬 Siama 2016070914 369.0 -9996
14 C0S710 鹿野 Luye 2016070915 226.5 -9996
15 C0S760 紅石 Hongshih 1999090419 216.5 -9996
16 C0S760 紅石 Hongshih 1999100911 237.0 -9996
17 C0V310 美濃 Meinong 2001052110 312.5 -9996
18 C0V350 溪埔 Xipu 2001052110 342.0 -9996
19 C0V740 旗山 Qishan 2001052111 292.0 -9996
20 C1A630 下盆 Siapen 2001072114 216.0 -9996
21 C1E480 鳳美 Fongmei 1998022518 268.0 -9996
22 C1F891 稍來 Shaolai 1998022017 223.5 -9996
23 C1F941 雪嶺 Xueling 1998022018 251.5 -9996
24 C1R110 口社 Gusia 2001052114 579.5 -9996
25 C1R110 口社 Gusia 2001053116 334.0 -9996
26 C1R120 上德文 Shangdewun 2001052111 711.0 -9996
27 C1S670 摩天 Motian 2016070913 216.5 -9996
28 C1U690 新寮 Sinliao 2009101214 734.5 -9996
29 C1Z130 銅門 Tongmen 2005092309 364.5 -9996
Note:
1 HR: Hourly rainfall;
2 PHR: Previous hourly rainfall;
3 Time: yyyymmddhh

Result of Cutoff Point Method (2/2)

Table 6.
Item StationID 站名 Station Time HR PHR
1 C0A870 五指山 Wujhihshan 2000081816 75.0 -9995
2 C0D360 梅花 Meihua 2019070114 108.5 -9995
3 C0R150 三地門 Sandimen 2000110112 83.0 -9995
4 C0R190 赤山 Chishan 2000110114 85.0 -9995
5 C0R190 赤山 Chishan 2000092409 77.5 -9995
6 C0S760 紅石 Hongshih 2000070308 63.0 -9995
7 C0S760 紅石 Hongshih 2000070608 44.5 -9995
8 C0S760 紅石 Hongshih 2000071807 47.5 -9995
9 C0S760 紅石 Hongshih 2000080408 43.0 -9995
10 C0S760 紅石 Hongshih 2000080709 56.5 -9995
11 C0V350 溪埔 Xipu 2000041816 82.5 -9995
12 C1H9B1 阿眉 Amei 2019061416 131.5 -9995
13 C1I121 大鞍 Da-An 2000032115 150.0 -9995
14 C1I121 大鞍 Da-An 2000110914 158.0 -9995
15 C1I131 桶頭 Tongtou 2000110914 103.0 -9995
Note:
1 HR: Hourly rainfall;
2 PHR: Previous hourly rainfall;
3 Time: yyyymmddhh

Clustering Result

Figure 1.

Figure 1.

5 variables for conducting the K-Means cluster analysis:

  • Altitude
  • Longitude
  • Latitude
  • Average annual rainfall (1998-2019) for a given storm type
  • Standard deviation of the average annual rainfall for a given storm type

Outlier Detection on March 27, 2020 (Temporal Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 3.

Figure 3.

Figure 4.

Figure 4.

Outlier Detection on March 27, 2020 (Spatial Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山
Figure 5.

Figure 5.

Figure 6.

Figure 6.

Time Series Plot on March 27, 2020

Frontal Rain, Cluster 4
東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 7.

Figure 7.

Figure 8.

Figure 8.

Result of PCA Method

Table 7.
Cluster 1 2 3 4 5 6 7 8
Number of Stations 61 27 50 36 55 47 15 6
Available Days for PCA 424 514 1401 947 551 682 592 1153
90% Quantile (Temporal) 8.925 6.859 12.039 10.185 10.391 7.641 6.093 5.234
Detected Outliers 43 52 140 95 55 69 60 116
POIVV 4 7 23 12 11 9 5 5
Note:
1 POIVV: possible outliers identified by visual verification from the detected outliers;



Table 9.
Cluster 1 2 3 4 5 6 7 8 9 10
Number of Stations 39 37 25 21 40 44 25 41 15 10
Available Days for PCA 787 1084 895 929 1568 1363 1077 970 843 892
90% Quantile (Temporal) 8.610 8.708 6.854 7.841 8.715 9.952 7.703 10.745 6.228 4.883
Detected Outliers 79 109 90 93 157 137 108 97 85 90
POIVV 6 13 11 9 18 7 16 10 3 8
Note:
1 POIVV: possible outliers identified by visual verification from the detected outliers;

9 Categories of Outliers Detected by PCA Method

Table 11.
Code B1 B2 B3 B4 B5 B6 B7 B8 B9 Sum
Convective Storms 87 1 4 0 4 0 1 0 4 101
Frontal Rain 50 0 3 2 10 7 2 1 1 76
Meiyu 56 1 9 3 18 5 0 4 5 101
Typhoons 20 1 0 0 0 0 0 0 1 22
Sum 213 3 16 5 32 12 3 5 11 300
Note:
1 B1= OBS at some hours are higher than that of nearby stations;
2 B2= Trace (A1), while nearby stations have observed rainfalls;
3 B3= No OBS due to malfunctions (A2);
4 B4= No OBS due to delays (A3);
5 B5= Has observed rainfalls while nearby stations merely don’t (A1);
6 B6= Has observed rainfalls while nearby stations don’t (A2);
7 B7= Has observed rainfalls while nearby stations don’t (A3);
8 B8= Delayed return of accumulated rainfall records (A3);
9 B9= Rainfall trend is different from that of nearby stations

Figure 9.

Example of B2 on July 18, 2005

Figure 10.

Figure 10.

Figure 11.

Figure 11.

Example of B5 on June 17, 1998

Figure 16.

Figure 16.

Figure 17.

Figure 17.

Example of B9 on July 10, 1999

Figure 24.

Figure 24.

Figure 25.

Figure 25.

Conclusion