An Automated Anomaly Detection System for
Hourly Rainfall Data Quality Control

Hway-Min Chou

Master Program in Statistics, National Taiwan University

Advisor Prof. Ke-Sheng Cheng

Outline

Motivation

The goal is not to eliminate anomalies but to identify potential data contamination.

Data Collection

Data Preprocessing

Table 1.
Storm Type Rainy Season Duration
Frontal Rain Nov. - Apr.  > 1 h
Meiyu May and June
Convective Storms July - Oct.  1-12 h
Typhoons July - Oct.  > 12 h

List of warning typhoons

Abnormal Situation I

There are no returned observations because of malfunctions or delays.
Table 2.
New Code Circumstance CWB Code
A1 The instrument observed trace -9998
A2 The instrument failed to return OBS due to malfunctions or unknown reasons -9991
-9995
-9997
-9999
A3 The instrument delayed returning accumulated OBS for a while -9996
Note:
1 Trace: an amount of precipitation < 0.1 mm
2 OBS: Observations
3 -9991 = Instrument malfunctions waiting for repair
4 -9995 = The instrument failed to return OBS due to malfunctions
5 -9997 = The instrument failed to return OBS for unknown reasons
6 -9999 = The instrument did not observe rainfall

Abnormal Situation II

Rainfall time series of a station differs from that of other nearby stations.
Table 3.
Code Circumstance Further Verification
B1 OBS at some hours are higher than that of nearby stations Y
B2 Trace (A1), while nearby stations have observed rainfalls Y
B3 No OBS due to malfunctions (A2)
B4 No OBS due to delays (A3)
B5 Has observed rainfalls while nearby stations merely don’t Y
B6 Has observed rainfalls while nearby stations don’t (A2)
B7 Has observed rainfalls while nearby stations don’t (A3)
B8 Delayed return of accumulated rainfall records (A3)
B9 Rainfall trend is different from that of nearby stations Y
Note:
1 Trace: an amount of precipitation < 0.1 mm
2 OBS: Observations

Cutoff Point Method



Dealing with abnormal situation I.

\[\begin{equation} \tag{1} x(i,t) \ge v_{1-p} \end{equation}\]


K-Means Cluster Analysis

  1. \(C_1 \cup C_2 \cup ... \cup C_K = \{1,2,...,n\}\). Namely, each observation belongs to at least one cluster.
  2. \(C_k \cap C_{k'} = \phi \forall k \ne k'\). Namely, the clusters are non-overlapping.

\[\begin{equation} \tag{2} min_{C_1,...,C_k}\lbrace\sum^K_{k=1} V(C_k)\rbrace \end{equation}\]

\[\begin{equation} \tag{3} V(C_k)=\frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \end{equation}\]

\[\begin{equation} \tag{4} \min_{C_1,...,C_k}\lbrace\sum^K_{k=1} \frac{1}{|C_k|} \sum_{i,i'\in C_k} \sum^p_{j=1}(x_{ij}-x{i'j})^2 \rbrace \end{equation}\]

Principal Component Analysis (1/3)

PCA is used to

Principal Component Analysis (2/3)

Why use PCA?


Principal Component Analysis (3/3)

\[\begin{equation} \tag{5} \mathbf Z = \mathbf \Phi^T \mathbf X \end{equation}\]

\[\begin{equation} \tag{6} Z_1=z_{i1}=\phi_{11}x_{i1} + \phi_{21}x_{i2} + ... + \phi_{p1}x_{ip} \end{equation}\]

PCA from Temporal Variation Aspect

  • To find the temporal variation at each station (observation).
  • Each column vector \(X_j\) denotes hourly rainfall of \(n\) raingauge stations at hour \(j\). (\(j \ge 2\))

\[\begin{equation} \tag{13} d = \sqrt{Z_1^2+Z_2^2} \ge d_{i,j,p} \end{equation}\]

  • \(d\): the Euclidean distance b/w (0,0) and \(X_j\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d_{i,j,p}\): threshold, the \(p^{th}\) quantile of those \(n_{i,j}\) maximum distances.

PCA from Spatial Variation Aspect

  • To find the spatial variation at each hour (observation).
  • Each column vector \(X_i\) denotes the \(m\) (\(m \le 24\)) hourly rainfall at station \(i\). (\(i \ge 2\))

\[\begin{equation} \tag{13} d^{'} = \sqrt{Z_1^2+Z_2^2} \ge d^{'}_{i,j,p} \end{equation}\]

  • \(d^{'}\): the Euclidean distance b/w (0,0) and \(X_i\) being projected on the subspace of PC1 and PC2.
  • Assume there are \(n^{'}_{i,j}\) days that PCA can be performed with storm type \(i\), cluster \(j\).
  • \(d^{'}_{i,j,p}\): the \(p\) quantile of those \(n^{'}_{i,j}\) maximum distances.

Result of Cutoff Point Method

Table 4.
Item Station ID Station Time HR
1 C0A931 Sanhe 2008/11/9 05:00 936.0
2 C0A940 Jinshan 2008/12/24 19:00 735.5
3 C0AI40 Shipai 2008/10/21 14:00 283.0
4 C0C490 Bade 1998/7/14 15:00 388.5
5 C0M640 Zhongpu 2001/9/19 10:00 741.5
6 C0O970 Hutoupi 2004/10/21 13:00 201.0
7 C0R130 Ali 2001/5/21 11:00 544.0
8 C0R140 Majia 2001/5/21 11:00 828.0
9 C0R140 Majia 2001/5/31 10:00 434.0
10 C0R280 Binlang 2012/8/28 15:00 666.0
11 C0R341 Mudan 2011/9/3 13:00 320.5
12 C0R341 Mudan 2012/8/25 04:00 460.5
13 C0S660 Siama 2016/7/9 14:00 369.0
14 C0S710 Luye 2016/7/9 15:00 226.5
15 C0S760 Hongshih 1999/9/4 19:00 216.5
16 C0S760 Hongshih 1999/10/9 11:00 237.0
17 C0V310 Meinong 2001/5/21 10:00 312.5
18 C0V350 Xipu 2001/5/21 10:00 342.0
19 C0V740 Qishan 2001/5/21 11:00 292.0
20 C1A630 Siapen 2001/7/21 14:00 216.0
21 C1E480 Fongmei 1998/2/25 18:00 268.0
22 C1F891 Shaolai 1998/2/20 17:00 223.5
23 C1F941 Xueling 1998/2/20 18:00 251.5
24 C1R110 Gusia 2001/5/21 14:00 579.5
25 C1R110 Gusia 2001/5/31 16:00 334.0
26 C1R120 Shangdewun 2001/5/21 11:00 711.0
27 C1S670 Motian 2016/7/9 13:00 216.5
28 C1U690 Sinliao 2009/10/12 14:00 734.5
29 C1Z130 Tongmen 2005/9/23 09:00 364.5
Note:
1 HR = Hourly Rainfall (mm)

Result of K-Means Cluster Analysis

Figure 1.

5 variables for conducting the K-Means cluster analysis:

  • Altitude
  • Longitude
  • Latitude
  • Average annual rainfall for a given storm type
  • Standard deviation of the average annual rainfall for a given storm type

Anomaly Detection on Mar. 27, 2020 (Temporal Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 3.

Figure 4.

Anomaly Detection on Mar. 27, 2020 (Spatial Aspect)

東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 5.

Figure 6.

Time Series Plot on March 27, 2020

Frontal Rain, Cluster 4
東河、成功、池上、鹿野、紅石、明里、台東、紅葉山

Figure 7.

Figure 8.

Result of PCA

Table 6. For Frontal Rain
Cluster 1 2 3 4 5 6 7 8
Number of Stations 61 27 50 36 55 47 15 6
Days Available 424 514 1401 947 551 682 592 1153
Threshold 8.925 6.859 12.039 10.185 10.391 7.641 6.093 5.234
Anomalies Detected 43 52 140 95 55 69 60 116
PAIVV 4 7 23 12 11 9 5 5
Note:
1 PAIVV: possible anomalies identified by visual verification from the detected outliers

Nine Categories of Anomalies Detected by PCA

Figure 9.

Category B2 on July 18, 2005

Figure 10.

Figure 11.

Category of B5 on June 17, 1998

Figure 16.

Figure 17.

Category of B9 on July 10, 1999

Figure 24.

Figure 25.

Conclusion