Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people).
#set seed to make the results reproducible
set.seed(1)
#clear the environment
rm(list = ls())
#load the data
data <- read.table("uscrime.txt",header = TRUE)
nrow(data)
## [1] 47
head(data)
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob
## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682
First, let’s do some quick plots to inspect the last column, which is the number of crimes per 100,000 people.
plot(data$Crime, ylim = c(0,2500))
text(data$Crime, labels = data$Crime, cex = 0.8, font = 2, pos = 3)
From the above scatter plot, looks like there is a high value outlier between the 25th-30th points, with number of crimes at 1993 per 100,000 people. Let’s also create a couple more plots:
#this plot focuses on the distribution of absolute values of crimes per 100,000 people.
plot(rep(0,length(data$Crime)),data$Crime)
Above plot shows the majority of the data values are concentrated in the 500 - 1500 range, with a few other points beyond or below, which may be the outliers. Let’s also use the boxplot() function to visualize the data distribution. Boxplots are a measure of how well distributed the data is in a data set.
boxplot(data$Crime)
This box plot clearly shows that the outliers are on the high value side, well above the value of 1500.
#histogram plot
hist(data$Crime)
This histogram also shows that around 40 out of the 47 data points are concetrate between the 500 - 1500 range, and 2 points are below 500, 4 points are above 1500.
A closer look at the data reveals that the 26th row has the highest crime numbers of 1993, which is over 1087, or 2.81 standard deviations more than the average crime numbers, and over 1160, or 3 standard deviations more than the median crime numbers.
#identify the index of the potential high-value outlier
which(data$Crime == max(data$Crime))
## [1] 26
#calculate the difference between the max value and the mean and the median value on crime per 100,000 people
(max(data$Crime) - mean(data$Crime))/sd(data$Crime)
## [1] 2.812874
(max(data$Crime) - median(data$Crime))/sd(data$Crime)
## [1] 3.004426
Furthermore, the grubbs.test function in the outliers R package can be used to identify outliers more formally. However, note that the grubbs.test assumes the data is normally distributed before testing (reference:https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/grubtest.htm), thus let’s take a look first at the crime data’s normality.
#Use q-q plot to inspect normality
qqnorm(data$Crime, pch = 1, frame = FALSE)
qqline(data$Crime, col = "steelblue", lwd = 2)
A Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution itself. Usually a linear pattern would imply a normally distributed dataset.From the above plot, it can be observed that for most of the data points, the theoretical quantiles and the sample quantiles are following a linear pattern, except several points near the tail end. Therefore, we could assume that the crime column is largely normally distributed with a few exceptions, thus we could apply the grubbs.test() function here.
#install and load the package
library(outliers)
?grubbs.test
## starting httpd help server ... done
#set the opposite as FALSE to check the largest difference from the mean, and type = 10 as we want to test for one max value outlier first, on the high side.
#The null hypothesis here is that the max point is not an outlier.
grubbs.test(data$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
For the hypothesis testing here, if we assume a 0.05 alpha, then the p-value is larger than 0.05, even it’s pretty close, strictly speaking, the null hypothesis cannot be rejected, which means the max value point is not an outlier. However, if we allow a 0.1 alpha, then the p-value is significant, and the null hypothesis is rejected. The max point is an outlier. From the earlier box plot, this reflects the data distribution more accurately.
For information purpose, let’s also try the type = 11 test, which is a test for two outliers on opposite tails. It is testing on the null hypothesis that the min and max points are not outliers at the same time. It is expected to be not significant.
#Try with a one-sided test first.
grubbs.test(data$Crime, type = 11, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for two opposite outliers
##
## data: data$Crime
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers
P-value is well beyond either 0.05 or 0.1, therefore, there aren’t two outliers at the same time, but it doesn’t mean there isn’t any.
Therefore, one max value outlier is identified in the crime number data. It is the 26th point with a value of 1993, which is 2.81 standard deviations from the mean and 3 standard deviations away from the median value.
What are some of the situations or problems from everyday life, for which a Change Detection model would be appropriate? A Cumulative Sum Chart (CUSUM) technique could be applied there, with the appropriate choice of critical value and the threshold.
A common scenario where a Change Detection model would be appropriate is analyzing the stock market. Let’s say we use a market-based index to monitor and track the performance of stock markets, and our target here is to identify a potential market crisis at the earliest point, so to prepare for it. Generally the stock market index would experience ups and downs many times in a single trading day - possibly though within a min and max range set up by the financial regulators.
To accomplish this, a Change Detection model can be introduced. The critical value and the threshold are the two major parameters of the Change Detection model. They represent a trade-off behind the model: the potential damage of delaying the detection and missing reporting crisis, and on the other hand, the costs incurred by falsely reporting a crisis earlier than the actual time. Generally speaking, the smaller the critical value, or smaller the threshold, the more sensitive the model is, and the model tends to detect changes earlier than later, with the risk of falsely identifying early random fluctuations as the actual change. On the other hand, the larger the critical value, or larger the threshold value, the less sensitive the change detection model is, and it is more likely to delay detecting the change, with the risk of missing reporting the actual change in time.
As a rule of thumb, a typical threshold value T is 5 * standard deviations of the data, and the typical critical value C is 1/2 of the standard deviations. However, it may take some experiments to try to find a reliable threshold and critical value. If applicable, we may need us to build multiple change detection models and compare their detection results with the actual historic data, to find the most reliable model among them.
__1.With July through October daily-high-temperature data for Atlanta for 1996 through 2015, a CUSUM approach can be applied to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. The data source can be accessed here at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html.__
#load the data
temp_data <- read.table("temps.txt", header = TRUE)
head(temp_data)
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93 82
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91 86
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90 88
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81 87
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 85 95 87 92 105 82 90 85
## 2 87 90 84 94 93 85 93 87
## 3 91 89 83 95 99 76 87 79
## 4 90 91 85 92 98 77 84 85
## 5 88 80 88 90 100 83 86 84
## 6 82 87 89 90 98 83 87 84
tail(temp_data)
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006
## 118 26-Oct 75 71 79 69 75 64 68 68 79 61 62
## 119 27-Oct 75 57 79 75 78 51 69 64 81 63 66
## 120 28-Oct 81 55 79 73 80 55 75 57 78 62 63
## 121 29-Oct 82 64 78 72 75 63 75 70 75 64 72
## 122 30-Oct 82 66 82 75 77 72 68 77 78 69 73
## 123 31-Oct 81 60 79 75 78 71 60 75 82 70 68
## X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 118 68 70 65 85 77 80 61 84 67
## 119 67 59 60 76 79 70 69 84 56
## 120 70 50 71 74 74 56 64 77 78
## 121 62 59 75 68 59 56 75 73 70
## 122 67 65 66 71 61 56 78 68 70
## 123 71 67 69 75 65 65 74 63 62
nrow(temp_data)
## [1] 123
Let’s do some quick visual inspection on the temperature trend over the days in 1996 first.
library(ggplot2)
#convert the row index into a numeric field. Reference: https://stackoverflow.com/questions/25833762/how-do-i-use-index-number-as-x-in-ggplot
temp_data$idu <- as.numeric(row.names(temp_data))
#plot the daily-high-temperature data from July to October in 1996, use the row index as the x-axis
ggplot(data=temp_data, aes(x=temp_data$idu, y=temp_data$X1996, group=1)) +
geom_line(color="blue")+
geom_point()
## Warning: Use of `temp_data$idu` is discouraged. Use `idu` instead.
## Warning: Use of `temp_data$X1996` is discouraged. Use `X1996` instead.
## Warning: Use of `temp_data$idu` is discouraged. Use `idu` instead.
## Warning: Use of `temp_data$X1996` is discouraged. Use `X1996` instead.
# +scale_x_discrete(breaks=c("1","16","31","46","61","76","91","106","121"),labels=c("1","2","3","4","5","6","7","8","9"))
The overall time series implies there is a generally downward trend, representing a gradually cooling weather. It is observed from the graph that, 75 days after July 1st, the daily-high-temperatures are below 90 degrees onward. However, to identify a specific end of summer date, we could apply the Cumulative Sum Control Chart (CUSUM) approach.
Mathematically, a CUMSUM model measures the cumulative differences between data points and the mean value, adjusted by a critical value, and measures it against a predetermined threshold value to decide on the time when a change is happening.
Let’s first try calculate a daily s(t) time series for the 1996 data.
#Atlanta is located in a subtropical climate zone. Its summer spans from May to September. Mid September (about 75 days since July 1st) would be the rough time when Summer ends. We are trying to identify a specific date around that time as the change date. Reference:https://www.weather-us.com/en/georgia-usa/atlanta-climate
#Create a new vector to store the s values
s <- rep(0,123)
#initialize the first value as 0
s[[1]] <- 0
#use a C value of 1/2 of the standard deviation
c0 <- 0.5*sd(temp_data$X1996)
#calculate the mean value of the observed proper "summer" high temperature, or the first 75 days of daily high temperatures till mid September.
mean_temp0 <- mean(temp_data$X1996[1:75])
#calculate the s values by day.
#Note that we are detecting the change in a decreasing time series trend.
for (i in 2:123){
s[[i]] <- max(0, s[[i-1]] + (mean_temp0 - temp_data$X1996[i] - c0))
}
s
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [13] 0.0000000 0.0000000 2.5391637 0.0000000 0.0000000 0.0000000
## [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [25] 0.5391637 1.0783274 3.6174911 9.1566549 3.6958186 0.0000000
## [31] 0.0000000 0.0000000 0.0000000 0.5391637 0.0000000 0.0000000
## [37] 0.0000000 0.0000000 0.5391637 0.0000000 0.0000000 0.0000000
## [43] 0.0000000 0.5391637 0.0000000 0.0000000 0.0000000 0.0000000
## [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [55] 0.0000000 0.5391637 0.0000000 0.5391637 0.0000000 0.0000000
## [61] 0.5391637 3.0783274 7.6174911 19.1566549 16.6958186 17.2349823
## [67] 14.7741460 10.3133097 5.8524734 1.3916372 0.0000000 0.5391637
## [73] 0.0000000 0.0000000 6.5391637 12.0783274 10.6174911 13.1566549
## [79] 15.6958186 22.2349823 27.7741460 33.3133097 39.8524734 43.3916372
## [85] 43.9308009 44.4699646 42.0091283 42.5482920 48.0874557 57.6266194
## [91] 70.1657832 90.7049469 109.2441106 121.7832743 122.3224380 136.8616017
## [97] 155.4007655 175.9399292 200.4790929 207.0182566 221.5574203 234.0965840
## [103] 249.6357477 265.1749115 276.7140752 282.2532389 285.7924026 290.3315663
## [109] 292.8707300 311.4098938 332.9490575 349.4882212 355.0273849 358.5665486
## [115] 374.1057123 385.6448760 397.1840398 406.7232035 416.2623672 419.8015309
## [121] 422.3406946 424.8798583 428.4190221
#As a rule of thumb, let's first use a threshold value t of 5*sd
t0 <- 5*sd(temp_data$X1996)
#plot out the s values over time and the t threshold value
plot(s)
abline(h=t0, col="blue")
#identify the first day s reaches t
min(which(s>=t0))
## [1] 84
Therefore, the 84th day since July 1st, or September 22nd is identified by the CUSUM model as the end of summer day in 1996.
Now, try to replicate this process into a callable function.
#Global settings
#store all the s values by year in a matrix
s_all <- data.frame(matrix(ncol = 21, nrow = 123))
colnames(s_all) <- colnames(temp_data)[1:21]
#The function
summer_end <- function(i,c,t){
s_all[,i][1] <- 0
temp_mean <- mean(temp_data[,i][1:75])
for (j in 2:123){
s_all[,i][j] <- max(0, s_all[,i][j-1] + temp_mean - temp_data[,i][j] - c)
}
return (min(which(s_all[,i]>=t)))
}
#test on a few years
# summer_end(2,0.5*sd(temp_data[,2]),5*sd(temp_data[,2]))
# summer_end(3,0.5*sd(temp_data[,3]),5*sd(temp_data[,3]))
# summer_end(4,0.5*sd(temp_data[,4]),5*sd(temp_data[,4]))
# summer_end(21,0.5*sd(temp_data[,21]),5*sd(temp_data[,21]))
Now run the function across the years. Note that for each different year, the c value and the threshold value t will be recalculated each time based on the standard deviation of that specific year’s temperature data.
#Store the detected end of summer day for each year
end_summer <- rep(0,20)
for (i in 2:21){
end_summer[[i-1]] <- summer_end(i,0.5*sd(temp_data[,i]),5*sd(temp_data[,i]))
}
end_summer
## [1] 84 89 92 84 69 89 87 90 80 101 84 83 88 94 91 70 92 48 88
## [20] 84
From above loop result, an “end_summer” vector storing the detected end of summer date has been generated. Now let’s convert it into actual dates.
#store the actual dates in a new vector
end_summer_dates <- vector(length(end_summer), mode = "list")
#assign the date
for (i in 1:20){
end_summer_dates[[i]] <- temp_data$DAY[end_summer[[i]]]
}
head(end_summer_dates)
## [[1]]
## [1] "22-Sep"
##
## [[2]]
## [1] "27-Sep"
##
## [[3]]
## [1] "30-Sep"
##
## [[4]]
## [1] "22-Sep"
##
## [[5]]
## [1] "7-Sep"
##
## [[6]]
## [1] "27-Sep"
Plot out the detected end of summer days by year.
#get a clean year column
years <- sub('.', '', colnames(temp_data)[2:21])
years
## [1] "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005"
## [11] "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015"
#create a data frame for plotting
end_summer_df <- data.frame(cbind(year = years, days = as.numeric(end_summer)))
end_summer_df
## year days
## 1 1996 84
## 2 1997 89
## 3 1998 92
## 4 1999 84
## 5 2000 69
## 6 2001 89
## 7 2002 87
## 8 2003 90
## 9 2004 80
## 10 2005 101
## 11 2006 84
## 12 2007 83
## 13 2008 88
## 14 2009 94
## 15 2010 91
## 16 2011 70
## 17 2012 92
## 18 2013 48
## 19 2014 88
## 20 2015 84
#plot the detected end of summer day by year
ggplot(data=end_summer_df, aes(x=years, y=as.numeric(days), group=1)) +
geom_line(color="orange",lwd=1.5)+
geom_point()+
geom_text(
label=end_summer_dates,
nudge_x = 0.25, nudge_y = 0.25,
check_overlap = T
)
The above graph presents the detected end of summer day for each year from 1996 - 2015, and labeled with the actual date. Most of the detected end of summer days are within the late September between Sep 20th and Sep 30th. This makes sense as we know that, Atlanta being in a humid subtropical climate(according to the Köppen classification), summer spans from May to September. There are a few years when the detected end of summer days are in October (2005, 2009) and August (2013).
Let’s try to adjust the c and t values and do another detection. Note that the smaller the t threshold and the c value, the more sensitive the change detection model is, and it is exptected to detect the change earlier than later.
#use a smaller c and t
#Store the detected end of summer day for each year
end_summer1 <- rep(0,20)
for (i in 2:21){
end_summer1[[i-1]] <- summer_end(i,0.25*sd(temp_data[,i]),2.5*sd(temp_data[,i]))
}
end_summer1
## [1] 64 87 48 13 34 65 62 69 43 98 71 8 57 63 89 68 61 7 20 61
#store the actual dates in a new vector
end_summer_dates1 <- vector(length(end_summer1), mode = "list")
#assign the date
for (i in 1:20){
end_summer_dates1[[i]] <- temp_data$DAY[end_summer1[[i]]]
}
#create a data frame for plotting
end_summer_df1 <- data.frame(cbind(year = years, days = as.numeric(end_summer1)))
end_summer_df1
## year days
## 1 1996 64
## 2 1997 87
## 3 1998 48
## 4 1999 13
## 5 2000 34
## 6 2001 65
## 7 2002 62
## 8 2003 69
## 9 2004 43
## 10 2005 98
## 11 2006 71
## 12 2007 8
## 13 2008 57
## 14 2009 63
## 15 2010 89
## 16 2011 68
## 17 2012 61
## 18 2013 7
## 19 2014 20
## 20 2015 61
#plot the detected end of summer day by year
ggplot(data=end_summer_df1, aes(x=years, y=as.numeric(days), group=1)) +
geom_line(color="orange",lwd=1.5)+
geom_point()+
geom_text(
label=end_summer_dates1,
nudge_x = 0.25, nudge_y = 0.25,
check_overlap = T
)
From the above graph, the detection results seem not as good as the previous one, because there are multiple detected end of summer days in early and mid July(1999, 2007, 2013, 2014). This is contrary to common knowledge and perception, therefore, the c and t values may have been set too small, that the change detection model falsely reports an early random low temperature as the sign of overall drop in temperature.
There are multiple ways to approach this problem. Firstly, one can look at the daily high temperature for the same summer dates across years, and inspect whether there is an increasing trend over these data points. Alternatively, one can also calculate a mean daily high temperature for each year’s summer days (July 1st to the end of summer day), and see if this summer mean temperature is increasing over the years. For simplicity, let’s apply CUSUM using the second approach.
#store the summer days mean temperature in a vector
summer_mean <- rep(0,20)
#get the mean daily high temperature for the summer days for each year
for (i in 2:21){
summer_mean[[i-1]] <- mean(temp_data[,i][1:end_summer[i-1]])
}
summer_mean
## [1] 87.91667 85.64045 86.84783 88.44048 89.98551 85.15730 88.08046 84.75556
## [9] 85.12500 85.86139 87.86905 89.44578 86.31818 84.73404 90.68132 91.55714
## [17] 88.53261 84.62500 86.63636 87.91667
Plot out the summer days mean daily high temperature over the years.
plot(years, summer_mean, main="Mean Daily High Temperature in Summer Days over 1996 - 2015", xlab="Year", ylab="temperature", xlim=c(1995,2016), ylim=c(75,100))
lines(years, summer_mean, lwd=2, col="orange")
text(years, summer_mean, round(summer_mean,2), cex=0.6, pos=4, col="blue")
From the above plot it’s not very clear whether the summer mean temperature is going up. Therefore, let’s construct a CUSUM time series values called s_mean.
#Global settings
#store all the s_mean values by year in a matrix
s_mean <- rep(0,20)
#initialize the first value
s_mean[[1]] <- 0
#loop through the 20 years. Note that we are detecting an increasing trend here.
temp_increase <- function(c,t){
for (j in 2:20){
s_mean[[j]] <- max(0, s_mean[[j-1]] + summer_mean[[j]] - mean(summer_mean) - c)
}
return(min(which(s_mean >= t)))
}
#testing with 10 different c and t pairs
temp_increase(0.5*sd(summer_mean),5*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.25*sd(summer_mean),5*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.4*sd(summer_mean),4*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.2*sd(summer_mean),4*sd(summer_mean))
## Warning in min(which(s_mean >= t)): no non-missing arguments to min; returning
## Inf
## [1] Inf
temp_increase(0.3*sd(summer_mean),3*sd(summer_mean))
## [1] 16
temp_increase(0.15*sd(summer_mean),3*sd(summer_mean))
## [1] 16
temp_increase(0.2*sd(summer_mean),2*sd(summer_mean))
## [1] 16
temp_increase(0.1*sd(summer_mean),2*sd(summer_mean))
## [1] 16
temp_increase(0.1*sd(summer_mean),1*sd(summer_mean))
## [1] 5
temp_increase(0.05*sd(summer_mean),1*sd(summer_mean))
## [1] 5
The above results show that with smaller c and t values, the CUSUM approach is able to identify the start year of the warming to be the 16th (2011) or the 5th (2000) year since 1996. Looking back at the last two pairs, where both c and t are really small, those two detection models may be too sensitive and have mistakenly reported a random increase in temperature in 2000 as the start of the warming trend. Therefore, 2011 is detected to be the first year during the 1996 - 2015 period when Atlanta’s summer climate has gotten warmer.