Use Case: Outlier Detection from Historic Crime Data

Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people).

Test Outliers

#set seed to make the results reproducible
set.seed(1)

#clear the environment
rm(list = ls())

#load the data
data <- read.table("uscrime.txt",header = TRUE)
nrow(data)
## [1] 47
head(data)
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682

First, let’s do some quick plots to inspect the last column, which is the number of crimes per 100,000 people.

plot(data$Crime, ylim = c(0,2500))
text(data$Crime, labels = data$Crime, cex = 0.8, font = 2, pos = 3)

From the above scatter plot, looks like there is a high value outlier between the 25th-30th points, with number of crimes at 1993 per 100,000 people. Let’s also create a couple more plots:

#this plot focuses on the distribution of absolute values of crimes per 100,000 people.
plot(rep(0,length(data$Crime)),data$Crime)

Above plot shows the majority of the data values are concentrated in the 500 - 1500 range, with a few other points beyond or below, which may be the outliers. Let’s also use the boxplot() function to visualize the data distribution. Boxplots are a measure of how well distributed the data is in a data set.

boxplot(data$Crime)

This box plot clearly shows that the outliers are on the high value side, well above the value of 1500.

#histogram plot
hist(data$Crime)

This histogram also shows that around 40 out of the 47 data points are concetrate between the 500 - 1500 range, and 2 points are below 500, 4 points are above 1500.

A closer look at the data reveals that the 26th row has the highest crime numbers of 1993, which is over 1087, or 2.81 standard deviations more than the average crime numbers, and over 1160, or 3 standard deviations more than the median crime numbers.

#identify the index of the potential high-value outlier
which(data$Crime == max(data$Crime))
## [1] 26
#calculate the difference between the max value and the mean and the median value on crime per 100,000 people
(max(data$Crime) - mean(data$Crime))/sd(data$Crime)
## [1] 2.812874
(max(data$Crime) - median(data$Crime))/sd(data$Crime)
## [1] 3.004426

Furthermore, the grubbs.test function in the outliers R package can be used to identify outliers more formally. However, note that the grubbs.test assumes the data is normally distributed before testing (reference:https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/grubtest.htm), thus let’s take a look first at the crime data’s normality.

#Use q-q plot to inspect normality
qqnorm(data$Crime, pch = 1, frame = FALSE)
qqline(data$Crime, col = "steelblue", lwd = 2)

A Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution itself. Usually a linear pattern would imply a normally distributed dataset.From the above plot, it can be observed that for most of the data points, the theoretical quantiles and the sample quantiles are following a linear pattern, except several points near the tail end. Therefore, we could assume that the crime column is largely normally distributed with a few exceptions, thus we could apply the grubbs.test() function here.

#install and load the package
library(outliers)
?grubbs.test
## starting httpd help server ... done
#set the opposite as FALSE to check the largest difference from the mean, and type = 10 as we want to test for one max value outlier first, on the high side.
#The null hypothesis here is that the max point is not an outlier.
grubbs.test(data$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

For the hypothesis testing here, if we assume a 0.05 alpha, then the p-value is larger than 0.05, even it’s pretty close, strictly speaking, the null hypothesis cannot be rejected, which means the max value point is not an outlier. However, if we allow a 0.1 alpha, then the p-value is significant, and the null hypothesis is rejected. The max point is an outlier. From the earlier box plot, this reflects the data distribution more accurately.

For information purpose, let’s also try the type = 11 test, which is a test for two outliers on opposite tails. It is testing on the null hypothesis that the min and max points are not outliers at the same time. It is expected to be not significant.

#Try with a one-sided test first.
grubbs.test(data$Crime, type = 11, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for two opposite outliers
## 
## data:  data$Crime
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers

P-value is well beyond either 0.05 or 0.1, therefore, there aren’t two outliers at the same time, but it doesn’t mean there isn’t any.

Therefore, one max value outlier is identified in the crime number data. It is the 26th point with a value of 1993, which is 2.81 standard deviations from the mean and 3 standard deviations away from the median value.

Use Case: Change Detection in Time Series Data

What are some of the situations or problems from everyday life, for which a Change Detection model would be appropriate? A Cumulative Sum Chart (CUSUM) technique could be applied there, with the appropriate choice of critical value and the threshold.

Detecting Stock Market Changes

A common scenario where a Change Detection model would be appropriate is analyzing the stock market. Let’s say we use a market-based index to monitor and track the performance of stock markets, and our target here is to identify a potential market crisis at the earliest point, so to prepare for it. Generally the stock market index would experience ups and downs many times in a single trading day - possibly though within a min and max range set up by the financial regulators.

To accomplish this, a Change Detection model can be introduced. The critical value and the threshold are the two major parameters of the Change Detection model. They represent a trade-off behind the model: the potential damage of delaying the detection and missing reporting crisis, and on the other hand, the costs incurred by falsely reporting a crisis earlier than the actual time. Generally speaking, the smaller the critical value, or smaller the threshold, the more sensitive the model is, and the model tends to detect changes earlier than later, with the risk of falsely identifying early random fluctuations as the actual change. On the other hand, the larger the critical value, or larger the threshold value, the less sensitive the change detection model is, and it is more likely to delay detecting the change, with the risk of missing reporting the actual change in time.

As a rule of thumb, a typical threshold value T is 5 * standard deviations of the data, and the typical critical value C is 1/2 of the standard deviations. However, it may take some experiments to try to find a reliable threshold and critical value. If applicable, we may need us to build multiple change detection models and compare their detection results with the actual historic data, to find the most reliable model among them.

Use Case: Climate Change Pattern Recognition: Atlanta’s case

6.2.1

__1.With July through October daily-high-temperature data for Atlanta for 1996 through 2015, a CUSUM approach can be applied to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. The data source can be accessed here at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html.__

#load the data
temp_data <- read.table("temps.txt", header = TRUE)
head(temp_data)
##     DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul    98    86    91    84    89    84    90    73    82    91    93    95
## 2 2-Jul    97    90    88    82    91    87    90    81    81    89    93    85
## 3 3-Jul    97    93    91    87    93    87    87    87    86    86    93    82
## 4 4-Jul    90    91    91    88    95    84    89    86    88    86    91    86
## 5 5-Jul    89    84    91    90    96    86    93    80    90    89    90    88
## 6 6-Jul    93    84    89    91    96    87    93    84    90    82    81    87
##   X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1    85    95    87    92   105    82    90    85
## 2    87    90    84    94    93    85    93    87
## 3    91    89    83    95    99    76    87    79
## 4    90    91    85    92    98    77    84    85
## 5    88    80    88    90   100    83    86    84
## 6    82    87    89    90    98    83    87    84
tail(temp_data)
##        DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006
## 118 26-Oct    75    71    79    69    75    64    68    68    79    61    62
## 119 27-Oct    75    57    79    75    78    51    69    64    81    63    66
## 120 28-Oct    81    55    79    73    80    55    75    57    78    62    63
## 121 29-Oct    82    64    78    72    75    63    75    70    75    64    72
## 122 30-Oct    82    66    82    75    77    72    68    77    78    69    73
## 123 31-Oct    81    60    79    75    78    71    60    75    82    70    68
##     X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 118    68    70    65    85    77    80    61    84    67
## 119    67    59    60    76    79    70    69    84    56
## 120    70    50    71    74    74    56    64    77    78
## 121    62    59    75    68    59    56    75    73    70
## 122    67    65    66    71    61    56    78    68    70
## 123    71    67    69    75    65    65    74    63    62
nrow(temp_data)
## [1] 123

Let’s do some quick visual inspection on the temperature trend over the days in 1996 first.

library(ggplot2)
#convert the row index into a numeric field. Reference: https://stackoverflow.com/questions/25833762/how-do-i-use-index-number-as-x-in-ggplot
temp_data$idu <- as.numeric(row.names(temp_data))

#plot the daily-high-temperature data from July to October in 1996, use the row index as the x-axis
ggplot(data=temp_data, aes(x=temp_data$idu, y=temp_data$X1996, group=1)) +
  geom_line(color="blue")+
  geom_point()
## Warning: Use of `temp_data$idu` is discouraged. Use `idu` instead.
## Warning: Use of `temp_data$X1996` is discouraged. Use `X1996` instead.
## Warning: Use of `temp_data$idu` is discouraged. Use `idu` instead.
## Warning: Use of `temp_data$X1996` is discouraged. Use `X1996` instead.

  # +scale_x_discrete(breaks=c("1","16","31","46","61","76","91","106","121"),labels=c("1","2","3","4","5","6","7","8","9"))

The overall time series implies there is a generally downward trend, representing a gradually cooling weather. It is observed from the graph that, 75 days after July 1st, the daily-high-temperatures are below 90 degrees onward. However, to identify a specific end of summer date, we could apply the Cumulative Sum Control Chart (CUSUM) approach.

Mathematically, a CUMSUM model measures the cumulative differences between data points and the mean value, adjusted by a critical value, and measures it against a predetermined threshold value to decide on the time when a change is happening.

Let’s first try calculate a daily s(t) time series for the 1996 data.

#Atlanta is located in a subtropical climate zone. Its summer spans from May to September. Mid September (about 75 days since July 1st) would be the rough time when Summer ends. We are trying to identify a specific date around that time as the change date. Reference:https://www.weather-us.com/en/georgia-usa/atlanta-climate

#Create a new vector to store the s values
s <- rep(0,123)
#initialize the first value as 0
s[[1]] <- 0
#use a C value of 1/2 of the standard deviation
c0 <- 0.5*sd(temp_data$X1996)
#calculate the mean value of the observed proper "summer" high temperature, or the first 75 days of daily high temperatures till mid September.
mean_temp0 <- mean(temp_data$X1996[1:75])
#calculate the s values by day.
#Note that we are detecting the change in a decreasing time series trend.
for (i in 2:123){
  s[[i]] <- max(0, s[[i-1]] + (mean_temp0 - temp_data$X1996[i] - c0))
}
s
##   [1]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##   [7]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [13]   0.0000000   0.0000000   2.5391637   0.0000000   0.0000000   0.0000000
##  [19]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [25]   0.5391637   1.0783274   3.6174911   9.1566549   3.6958186   0.0000000
##  [31]   0.0000000   0.0000000   0.0000000   0.5391637   0.0000000   0.0000000
##  [37]   0.0000000   0.0000000   0.5391637   0.0000000   0.0000000   0.0000000
##  [43]   0.0000000   0.5391637   0.0000000   0.0000000   0.0000000   0.0000000
##  [49]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [55]   0.0000000   0.5391637   0.0000000   0.5391637   0.0000000   0.0000000
##  [61]   0.5391637   3.0783274   7.6174911  19.1566549  16.6958186  17.2349823
##  [67]  14.7741460  10.3133097   5.8524734   1.3916372   0.0000000   0.5391637
##  [73]   0.0000000   0.0000000   6.5391637  12.0783274  10.6174911  13.1566549
##  [79]  15.6958186  22.2349823  27.7741460  33.3133097  39.8524734  43.3916372
##  [85]  43.9308009  44.4699646  42.0091283  42.5482920  48.0874557  57.6266194
##  [91]  70.1657832  90.7049469 109.2441106 121.7832743 122.3224380 136.8616017
##  [97] 155.4007655 175.9399292 200.4790929 207.0182566 221.5574203 234.0965840
## [103] 249.6357477 265.1749115 276.7140752 282.2532389 285.7924026 290.3315663
## [109] 292.8707300 311.4098938 332.9490575 349.4882212 355.0273849 358.5665486
## [115] 374.1057123 385.6448760 397.1840398 406.7232035 416.2623672 419.8015309
## [121] 422.3406946 424.8798583 428.4190221
#As a rule of thumb, let's first use a threshold value t of 5*sd
t0 <- 5*sd(temp_data$X1996)

#plot out the s values over time and the t threshold value
plot(s)
abline(h=t0, col="blue")

#identify the first day s reaches t
min(which(s>=t0))
## [1] 84

Therefore, the 84th day since July 1st, or September 22nd is identified by the CUSUM model as the end of summer day in 1996.

Now, try to replicate this process into a callable function.

#Global settings
#store all the s values by year in a matrix
s_all <- data.frame(matrix(ncol = 21, nrow = 123))
colnames(s_all) <- colnames(temp_data)[1:21]

#The function
summer_end <- function(i,c,t){
  
  s_all[,i][1] <- 0
  
  temp_mean <- mean(temp_data[,i][1:75])

  for (j in 2:123){
    s_all[,i][j] <- max(0, s_all[,i][j-1] + temp_mean - temp_data[,i][j] - c)
  }
  
  return (min(which(s_all[,i]>=t)))
  
}

#test on a few years
# summer_end(2,0.5*sd(temp_data[,2]),5*sd(temp_data[,2]))
# summer_end(3,0.5*sd(temp_data[,3]),5*sd(temp_data[,3]))
# summer_end(4,0.5*sd(temp_data[,4]),5*sd(temp_data[,4]))
# summer_end(21,0.5*sd(temp_data[,21]),5*sd(temp_data[,21]))

Now run the function across the years. Note that for each different year, the c value and the threshold value t will be recalculated each time based on the standard deviation of that specific year’s temperature data.

#Store the detected end of summer day for each year
end_summer <- rep(0,20)

for (i in 2:21){
    end_summer[[i-1]] <- summer_end(i,0.5*sd(temp_data[,i]),5*sd(temp_data[,i]))
}

end_summer
##  [1]  84  89  92  84  69  89  87  90  80 101  84  83  88  94  91  70  92  48  88
## [20]  84

From above loop result, an “end_summer” vector storing the detected end of summer date has been generated. Now let’s convert it into actual dates.

#store the actual dates in a new vector
end_summer_dates <- vector(length(end_summer), mode = "list")
#assign the date
for (i in 1:20){
  end_summer_dates[[i]] <- temp_data$DAY[end_summer[[i]]]
}

head(end_summer_dates)
## [[1]]
## [1] "22-Sep"
## 
## [[2]]
## [1] "27-Sep"
## 
## [[3]]
## [1] "30-Sep"
## 
## [[4]]
## [1] "22-Sep"
## 
## [[5]]
## [1] "7-Sep"
## 
## [[6]]
## [1] "27-Sep"

Plot out the detected end of summer days by year.

#get a clean year column
years <- sub('.', '', colnames(temp_data)[2:21])
years
##  [1] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [11] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"
#create a data frame for plotting
end_summer_df <- data.frame(cbind(year = years, days = as.numeric(end_summer)))
end_summer_df
##    year days
## 1  1996   84
## 2  1997   89
## 3  1998   92
## 4  1999   84
## 5  2000   69
## 6  2001   89
## 7  2002   87
## 8  2003   90
## 9  2004   80
## 10 2005  101
## 11 2006   84
## 12 2007   83
## 13 2008   88
## 14 2009   94
## 15 2010   91
## 16 2011   70
## 17 2012   92
## 18 2013   48
## 19 2014   88
## 20 2015   84
#plot the detected end of summer day by year
ggplot(data=end_summer_df, aes(x=years, y=as.numeric(days), group=1)) +
  geom_line(color="orange",lwd=1.5)+
  geom_point()+
  geom_text(
    label=end_summer_dates, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

The above graph presents the detected end of summer day for each year from 1996 - 2015, and labeled with the actual date. Most of the detected end of summer days are within the late September between Sep 20th and Sep 30th. This makes sense as we know that, Atlanta being in a humid subtropical climate(according to the Köppen classification), summer spans from May to September. There are a few years when the detected end of summer days are in October (2005, 2009) and August (2013).

Let’s try to adjust the c and t values and do another detection. Note that the smaller the t threshold and the c value, the more sensitive the change detection model is, and it is exptected to detect the change earlier than later.

#use a smaller c and t
#Store the detected end of summer day for each year
end_summer1 <- rep(0,20)

for (i in 2:21){
    end_summer1[[i-1]] <- summer_end(i,0.25*sd(temp_data[,i]),2.5*sd(temp_data[,i]))
}

end_summer1
##  [1] 64 87 48 13 34 65 62 69 43 98 71  8 57 63 89 68 61  7 20 61
#store the actual dates in a new vector
end_summer_dates1 <- vector(length(end_summer1), mode = "list")
#assign the date
for (i in 1:20){
  end_summer_dates1[[i]] <- temp_data$DAY[end_summer1[[i]]]
}

#create a data frame for plotting
end_summer_df1 <- data.frame(cbind(year = years, days = as.numeric(end_summer1)))
end_summer_df1
##    year days
## 1  1996   64
## 2  1997   87
## 3  1998   48
## 4  1999   13
## 5  2000   34
## 6  2001   65
## 7  2002   62
## 8  2003   69
## 9  2004   43
## 10 2005   98
## 11 2006   71
## 12 2007    8
## 13 2008   57
## 14 2009   63
## 15 2010   89
## 16 2011   68
## 17 2012   61
## 18 2013    7
## 19 2014   20
## 20 2015   61
#plot the detected end of summer day by year
ggplot(data=end_summer_df1, aes(x=years, y=as.numeric(days), group=1)) +
  geom_line(color="orange",lwd=1.5)+
  geom_point()+
  geom_text(
    label=end_summer_dates1, 
    nudge_x = 0.25, nudge_y = 0.25, 
    check_overlap = T
  )

From the above graph, the detection results seem not as good as the previous one, because there are multiple detected end of summer days in early and mid July(1999, 2007, 2013, 2014). This is contrary to common knowledge and perception, therefore, the c and t values may have been set too small, that the change detection model falsely reports an early random low temperature as the sign of overall drop in temperature.

Is Atlanta getting warmer?

There are multiple ways to approach this problem. Firstly, one can look at the daily high temperature for the same summer dates across years, and inspect whether there is an increasing trend over these data points. Alternatively, one can also calculate a mean daily high temperature for each year’s summer days (July 1st to the end of summer day), and see if this summer mean temperature is increasing over the years. For simplicity, let’s apply CUSUM using the second approach.

#store the summer days mean temperature in a vector
summer_mean <- rep(0,20)

#get the mean daily high temperature for the summer days for each year
for (i in 2:21){
  summer_mean[[i-1]] <- mean(temp_data[,i][1:end_summer[i-1]])  
}

summer_mean
##  [1] 87.91667 85.64045 86.84783 88.44048 89.98551 85.15730 88.08046 84.75556
##  [9] 85.12500 85.86139 87.86905 89.44578 86.31818 84.73404 90.68132 91.55714
## [17] 88.53261 84.62500 86.63636 87.91667

Plot out the summer days mean daily high temperature over the years.

plot(years, summer_mean, main="Mean Daily High Temperature in Summer Days over 1996 - 2015", xlab="Year", ylab="temperature", xlim=c(1995,2016), ylim=c(75,100))
lines(years, summer_mean, lwd=2, col="orange")
text(years, summer_mean, round(summer_mean,2), cex=0.6, pos=4, col="blue")

From the above plot it’s not very clear whether the summer mean temperature is going up. Therefore, let’s construct a CUSUM time series values called s_mean.

#Global settings
#store all the s_mean values by year in a matrix
s_mean <- rep(0,20)
#initialize the first value
s_mean[[1]] <- 0

#loop through the 20 years. Note that we are detecting an increasing trend here.
temp_increase <- function(c,t){
  for (j in 2:20){
    s_mean[[j]] <- max(0, s_mean[[j-1]] + summer_mean[[j]] - mean(summer_mean) - c)
  }
  return(min(which(s_mean >= t)))  
}

#testing with 10 different c and t pairs
temp_increase(0.5*sd(summer_mean),5*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.25*sd(summer_mean),5*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.4*sd(summer_mean),4*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.2*sd(summer_mean),4*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.3*sd(summer_mean),3*sd(summer_mean))
## [1] 16
temp_increase(0.15*sd(summer_mean),3*sd(summer_mean))
## [1] 16
temp_increase(0.2*sd(summer_mean),2*sd(summer_mean))
## [1] 16
temp_increase(0.1*sd(summer_mean),2*sd(summer_mean))
## [1] 16
temp_increase(0.1*sd(summer_mean),1*sd(summer_mean))
## [1] 5
temp_increase(0.05*sd(summer_mean),1*sd(summer_mean))
## [1] 5

The above results show that with smaller c and t values, the CUSUM approach is able to identify the start year of the warming to be the 16th (2011) or the 5th (2000) year since 1996. Looking back at the last two pairs, where both c and t are really small, those two detection models may be too sensitive and have mistakenly reported a random increase in temperature in 2000 as the start of the warming trend. Therefore, 2011 is detected to be the first year during the 1996 - 2015 period when Atlanta’s summer climate has gotten warmer.