Question 5.1

Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R

Reading Data:

library(knitr)
data1 <- read.delim("uscrime.txt", header=T)
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob Time Crime
15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602 26.2011 791
14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599 25.2999 1635
14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401 24.3006 578
13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801 29.9012 1969
14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399 21.2998 1234
library(stringr)
library(outliers)
x <- data1[,ncol(data1)]
results <- grubbs.test(x, type=11)
vals <- as.numeric(str_extract_all(results$alternative, "[0-9]+")[[1]])
subdata <- subset(data1, Crime %in% vals)
boxplot(x,
     ylab="Crime Value")
points(subdata[,ncol(subdata)], pch=19,col='red')
title(main="Crime Outliers (Iteration 1)")

NOTE: The values in red located outside the whiskers of the boxplot are the values obtained from using the grubbs.test function. Using a type parameter of 11 it obtained multiple outliers for each side of the plot. The highest value was 1993 while the lowest value was 342. Seeing that 342 is inside the whiskers, it wouldn’t really be an outlier, just the minimum of the values. On the high end is 1993 which is the maximum value. With the 1993 point (in red) there’s another point right next to it (1969) that should also be considered an outlier, however grubbs.test only finds 1 outlier for each side unless type 20 is specified. The problem with type 20 is it will not accept more than 30 values and if I take out some of the data values to use type 20 it may skew the outlier results.

Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

A change detection model would work well in a digital manufacturing settings where many sensors are used to understand processes. Many of the facilities now use a Statistical Process Control chart to determine if the process variation is out-of-control. This is usually done by using various methods, which are all based on the patterns created on an SPC chart. Using a change detection model would be more beneficial because of the flexible parameters, which would allow for more personalization based a manufacturers priorities.

That was a general example, so a more specific example would be when the temperature of a facility can increase the chance for defects. When temperature reaches some value that’s based on the temperature where defects are most likely to occur, then the facility can turn off the machines or increase the power of the fans to compensate. The T value would be obtained by making sure most of the temperatures with high defect percentages have a St value that’s higher than T. That increases the chance of getting a correct detection. The C value would depend on how much of a buffer is needed, so if the facility doesn’t make huge changes in temperature then I would use a smaller C value and vice versa.

SPC chart example

SPC chart example

Question 6.2

1. Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.

2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

data2 <- read.delim("temps.txt", header=T)
data2N <- data2[-1]
row.names(data2N) <- data2$DAY
X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1-Jul 98 86 91 84 89 84 90 73 82 91 93 95 85 95 87 92 105 82 90 85
2-Jul 97 90 88 82 91 87 90 81 81 89 93 85 87 90 84 94 93 85 93 87
3-Jul 97 93 91 87 93 87 87 87 86 86 93 82 91 89 83 95 99 76 87 79
4-Jul 90 91 91 88 95 84 89 86 88 86 91 86 90 91 85 92 98 77 84 85
5-Jul 89 84 91 90 96 86 93 80 90 89 90 88 88 80 88 90 100 83 86 84
Temp CUSUM Results

Temp CUSUM Results

NOTE: The parameters I used for CUSUM were 5 for C and 20 for T. I obtained those values by tweaking the 1996 year so that the CUSUM changes closely aligned to the actual end of the season. For the rest of the years I kept them the same so that I could make comparisons on how the long the summer was for each year. The calculated length of summer is presented in the “DIFF” column. I started at 1-Jul for all the years since that’s where the data began but it doesn’t matter as long as all the years have the same starting point. Based on the results in the “DIFF” column the longest summers were between 2004-2009 period. I can’t give a definitive reason why that is but I’d start out by checking to see if were any correlations between values elsewhere. The lowest length was 47 which is a very premature detection due to the lower T value used. Overall the detections occurred near the end of the season so there is definate value in using the CUSUM method. For further investigation the excel file with the results should be in the zip file.