Homework 3

Question 5.1

Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R

Reading Data:

library(knitr)
data1 <- read.delim("uscrime.txt", header=T)

M	So	Ed	Po1	Po2	LF	M.F	Pop	NW	U1	U2	Wealth	Ineq	Prob	Time	Crime
15.1	1	9.1	5.8	5.6	0.510	95.0	33	30.1	0.108	4.1	3940	26.1	0.084602	26.2011	791
14.3	0	11.3	10.3	9.5	0.583	101.2	13	10.2	0.096	3.6	5570	19.4	0.029599	25.2999	1635
14.2	1	8.9	4.5	4.4	0.533	96.9	18	21.9	0.094	3.3	3180	25.0	0.083401	24.3006	578
13.6	0	12.1	14.9	14.1	0.577	99.4	157	8.0	0.102	3.9	6730	16.7	0.015801	29.9012	1969
14.1	0	12.1	10.9	10.1	0.591	98.5	18	3.0	0.091	2.0	5780	17.4	0.041399	21.2998	1234

library(stringr)
library(outliers)
x <- data1[,ncol(data1)]
results <- grubbs.test(x, type=11)
vals <- as.numeric(str_extract_all(results$alternative, "[0-9]+")[[1]])
subdata <- subset(data1, Crime %in% vals)
boxplot(x,
     ylab="Crime Value")
points(subdata[,ncol(subdata)], pch=19,col='red')
title(main="Crime Outliers (Iteration 1)")

NOTE: The values in red located outside the whiskers of the boxplot are the values obtained from using the grubbs.test function. Using a type parameter of 11 it obtained multiple outliers for each side of the plot. The highest value was 1993 while the lowest value was 342. Seeing that 342 is inside the whiskers, it wouldn’t really be an outlier, just the minimum of the values. On the high end is 1993 which is the maximum value. With the 1993 point (in red) there’s another point right next to it (1969) that should also be considered an outlier, however grubbs.test only finds 1 outlier for each side unless type 20 is specified. The problem with type 20 is it will not accept more than 30 values and if I take out some of the data values to use type 20 it may skew the outlier results.

Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

A change detection model would work well in a digital manufacturing settings where many sensors are used to understand processes. Many of the facilities now use a Statistical Process Control chart to determine if the process variation is out-of-control. This is usually done by using various methods, which are all based on the patterns created on an SPC chart. Using a change detection model would be more beneficial because of the flexible parameters, which would allow for more personalization based a manufacturers priorities.

That was a general example, so a more specific example would be when the temperature of a facility can increase the chance for defects. When temperature reaches some value that’s based on the temperature where defects are most likely to occur, then the facility can turn off the machines or increase the power of the fans to compensate. The T value would be obtained by making sure most of the temperatures with high defect percentages have a St value that’s higher than T. That increases the chance of getting a correct detection. The C value would depend on how much of a buffer is needed, so if the facility doesn’t make huge changes in temperature then I would use a smaller C value and vice versa.

SPC chart example

Question 6.2

1. Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.

2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

data2 <- read.delim("temps.txt", header=T)
data2N <- data2[-1]
row.names(data2N) <- data2$DAY

	X1996	X1997	X1998	X1999	X2000	X2001	X2002	X2003	X2004	X2005	X2006	X2007	X2008	X2009	X2010	X2011	X2012	X2013	X2014	X2015
1-Jul	98	86	91	84	89	84	90	73	82	91	93	95	85	95	87	92	105	82	90	85
2-Jul	97	90	88	82	91	87	90	81	81	89	93	85	87	90	84	94	93	85	93	87
3-Jul	97	93	91	87	93	87	87	87	86	86	93	82	91	89	83	95	99	76	87	79
4-Jul	90	91	91	88	95	84	89	86	88	86	91	86	90	91	85	92	98	77	84	85
5-Jul	89	84	91	90	96	86	93	80	90	89	90	88	88	80	88	90	100	83	86	84

Temp CUSUM Results

NOTE: The parameters I used for CUSUM were 5 for C and 20 for T. I obtained those values by tweaking the 1996 year so that the CUSUM changes closely aligned to the actual end of the season. For the rest of the years I kept them the same so that I could make comparisons on how the long the summer was for each year. The calculated length of summer is presented in the “DIFF” column. I started at 1-Jul for all the years since that’s where the data began but it doesn’t matter as long as all the years have the same starting point. Based on the results in the “DIFF” column the longest summers were between 2004-2009 period. I can’t give a definitive reason why that is but I’d start out by checking to see if were any correlations between values elsewhere. The lowest length was 47 which is a very premature detection due to the lower T value used. Overall the detections occurred near the end of the season so there is definate value in using the CUSUM method. For further investigation the excel file with the results should be in the zip file.

Homework 3

Javarrus Mickle

September 11, 2019

Question 5.1

Reading Data:

Question 6.1

Question 6.2