Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R
library(knitr)
data1 <- read.delim("uscrime.txt", header=T)
| M | So | Ed | Po1 | Po2 | LF | M.F | Pop | NW | U1 | U2 | Wealth | Ineq | Prob | Time | Crime |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15.1 | 1 | 9.1 | 5.8 | 5.6 | 0.510 | 95.0 | 33 | 30.1 | 0.108 | 4.1 | 3940 | 26.1 | 0.084602 | 26.2011 | 791 |
| 14.3 | 0 | 11.3 | 10.3 | 9.5 | 0.583 | 101.2 | 13 | 10.2 | 0.096 | 3.6 | 5570 | 19.4 | 0.029599 | 25.2999 | 1635 |
| 14.2 | 1 | 8.9 | 4.5 | 4.4 | 0.533 | 96.9 | 18 | 21.9 | 0.094 | 3.3 | 3180 | 25.0 | 0.083401 | 24.3006 | 578 |
| 13.6 | 0 | 12.1 | 14.9 | 14.1 | 0.577 | 99.4 | 157 | 8.0 | 0.102 | 3.9 | 6730 | 16.7 | 0.015801 | 29.9012 | 1969 |
| 14.1 | 0 | 12.1 | 10.9 | 10.1 | 0.591 | 98.5 | 18 | 3.0 | 0.091 | 2.0 | 5780 | 17.4 | 0.041399 | 21.2998 | 1234 |
library(stringr)
library(outliers)
x <- data1[,ncol(data1)]
results <- grubbs.test(x, type=11)
vals <- as.numeric(str_extract_all(results$alternative, "[0-9]+")[[1]])
subdata <- subset(data1, Crime %in% vals)
boxplot(x,
ylab="Crime Value")
points(subdata[,ncol(subdata)], pch=19,col='red')
title(main="Crime Outliers (Iteration 1)")
NOTE: The values in red located outside the whiskers of the boxplot are the values obtained from using the grubbs.test function. Using a type parameter of 11 it obtained multiple outliers for each side of the plot. The highest value was 1993 while the lowest value was 342. Seeing that 342 is inside the whiskers, it wouldn’t really be an outlier, just the minimum of the values. On the high end is 1993 which is the maximum value. With the 1993 point (in red) there’s another point right next to it (1969) that should also be considered an outlier, however grubbs.test only finds 1 outlier for each side unless type 20 is specified. The problem with type 20 is it will not accept more than 30 values and if I take out some of the data values to use type 20 it may skew the outlier results.
Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?
A change detection model would work well in a digital manufacturing settings where many sensors are used to understand processes. Many of the facilities now use a Statistical Process Control chart to determine if the process variation is out-of-control. This is usually done by using various methods, which are all based on the patterns created on an SPC chart. Using a change detection model would be more beneficial because of the flexible parameters, which would allow for more personalization based a manufacturers priorities.
That was a general example, so a more specific example would be when the temperature of a facility can increase the chance for defects. When temperature reaches some value that’s based on the temperature where defects are most likely to occur, then the facility can turn off the machines or increase the power of the fans to compensate. The T value would be obtained by making sure most of the temperatures with high defect percentages have a St value that’s higher than T. That increases the chance of getting a correct detection. The C value would depend on how much of a buffer is needed, so if the facility doesn’t make huge changes in temperature then I would use a smaller C value and vice versa.
SPC chart example
1. Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.
2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).
data2 <- read.delim("temps.txt", header=T)
data2N <- data2[-1]
row.names(data2N) <- data2$DAY
| X1996 | X1997 | X1998 | X1999 | X2000 | X2001 | X2002 | X2003 | X2004 | X2005 | X2006 | X2007 | X2008 | X2009 | X2010 | X2011 | X2012 | X2013 | X2014 | X2015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1-Jul | 98 | 86 | 91 | 84 | 89 | 84 | 90 | 73 | 82 | 91 | 93 | 95 | 85 | 95 | 87 | 92 | 105 | 82 | 90 | 85 |
| 2-Jul | 97 | 90 | 88 | 82 | 91 | 87 | 90 | 81 | 81 | 89 | 93 | 85 | 87 | 90 | 84 | 94 | 93 | 85 | 93 | 87 |
| 3-Jul | 97 | 93 | 91 | 87 | 93 | 87 | 87 | 87 | 86 | 86 | 93 | 82 | 91 | 89 | 83 | 95 | 99 | 76 | 87 | 79 |
| 4-Jul | 90 | 91 | 91 | 88 | 95 | 84 | 89 | 86 | 88 | 86 | 91 | 86 | 90 | 91 | 85 | 92 | 98 | 77 | 84 | 85 |
| 5-Jul | 89 | 84 | 91 | 90 | 96 | 86 | 93 | 80 | 90 | 89 | 90 | 88 | 88 | 80 | 88 | 90 | 100 | 83 | 86 | 84 |
Temp CUSUM Results
NOTE: The parameters I used for CUSUM were 5 for C and 20 for T. I obtained those values by tweaking the 1996 year so that the CUSUM changes closely aligned to the actual end of the season. For the rest of the years I kept them the same so that I could make comparisons on how the long the summer was for each year. The calculated length of summer is presented in the “DIFF” column. I started at 1-Jul for all the years since that’s where the data began but it doesn’t matter as long as all the years have the same starting point. Based on the results in the “DIFF” column the longest summers were between 2004-2009 period. I can’t give a definitive reason why that is but I’d start out by checking to see if were any correlations between values elsewhere. The lowest length was 47 which is a very premature detection due to the lower T value used. Overall the detections occurred near the end of the season so there is definate value in using the CUSUM method. For further investigation the excel file with the results should be in the zip file.