Question 5.1 - Using crime data from the file uscrime.txt
(http://www.statsci.org/data/general/uscrime.txt,
description at http://www.statsci.org/data/general/uscrime.html), test
to see whether there are any outliers in the last column (number of
crimes per 100,000 people). Use the grubbs.test function in the outliers
package in R.
Methodology - First we are going to load in our crime data for the question. To explore the presence of outliers in the data, I want to begin with a box and whiskers plot of the crime variable, for this will give us a broad visualization of what data we are working with. Now, we are going to work with the grubbs.test function to see what outliers this function also outputs. When we work with the grubbs.test function, it only checks one outlier at a time with type = 10, so we need to check one at a time.
library(outliers)
crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 5.1/uscrime.txt", header = TRUE)
head(crime_data)
boxplot(crime_data$Crime)
Let us now use the grubbs.test function to see if it identifies any
outliers in our data.
grubbs.test(crime_data$Crime, type = 10, two.sided = FALSE, opposite = FALSE)
##
## Grubbs test for one outlier
##
## data: crime_data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
According to our grubbs.test, with a level of significance of p = 0.1, we can conclude that the data point where crime is 1993, is indeed an outlier. However, if we reference our boxplot, there were two points very close to each other at the top; thus, let us remove our outlier point and do the same grubbs.test function.
crime_data2 = crime_data[crime_data$Crime != 1993,]
boxplot(crime_data2$Crime)
grubbs.test(crime_data2$Crime, type = 10, two.sided = FALSE, opposite = FALSE)
##
## Grubbs test for one outlier
##
## data: crime_data2$Crime
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier
With the same level of significance as before of p = 0.1, we can conclude that 1969 is also an outlier in our data. As such, we will remove this from our data as well. From our original boxplot, we know there was another point, so let us run this test again.
crime_data3 = crime_data2[crime_data2$Crime != 1969,]
boxplot(crime_data3$Crime)
grubbs.test(crime_data3$Crime, type = 10, two.sided = FALSE, opposite = FALSE)
##
## Grubbs test for one outlier
##
## data: crime_data3$Crime
## G = 2.56457, U = 0.84712, p-value = 0.1781
## alternative hypothesis: highest value 1674 is an outlier
After removing the top two outliers from our original boxplot, our grubbs.test function cannot confirm that this is next data point is an outlier, for this p-value of 0.178 is not less than our level of significance of 0.1. Now, it didn’t appear that there were any outliers on the other side of the tail when we examine original boxplot, but let us go ahead and test if there are any small outliers.
grubbs.test(crime_data$Crime, type = 10, two.sided = FALSE, opposite = TRUE)
##
## Grubbs test for one outlier
##
## data: crime_data$Crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier
As we can see above, with a p-value of 1, our grubbs.test function
definitely does not think that 342 is an outlier in our data.
Results - After multiple rounds of applying our grubbs.test function,
we removed two outlier points from our data: 1993 and 1969. However, we
did not remove the values of 1674 and 342, despite us testing those
values.
Discussion of Results - These two outlier results are not necessarliy
suprising given our original boxplot, for our original boxplot gave us
three total outliers with two of them being far outliers and one of them
being very close to the upper whisker. The only interesting point in our
data was the 1674 data point. According to our original box and whiskers
plot, 1674 was an outlier, but the grubbs.test function was unable to
identify this point as an outlier with a degree of confidence that we
required (level of significance = 0.1).
Question 6.1 - Describe a situation or problem from your
job, everyday life, current events, etc., for which a Change Detection
model would be appropriate. Applying the CUSUM technique, how would you
choose the critical value and the threshold?
Last week I talked about biome classification and clustering, and the
ideas of identifying trends in environmental predictors. For instance,
we might want to use a change detection model when monitoring the air
quality of certain environments, especially when we consider
preservation efforts of certain species. Using a change detection model
here would allow us to assess whether or not a habitat is becoming more
or less suitable for a given species. Choosing a critical value and
threshold for a given change detection model. For instance, we need to
take into consideration how confident we want to be with our answer. We
would much rather a “false alarm” for our air quality changing than not
recognizing this change. As such, when taking the Air Quality Index into
consideration, an appropriate threshold would be around 15-25, so for
the sake of this conversation, we can just say 20. We can also choose a
critical value of around 5. We want a decent sized critical value to
cushion natural variance in air quality. Ultimately, these values could
change if we are taking about a certain tree population compared to a
bird species.
Question 6.2.1 - Using July through October
daily-high-temperature data for Atlanta for 1996 through 2015, use a
CUSUM approach to identify when unofficial summer ends (i.e., when the
weather starts cooling off) each year. You can get the data that you
need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html
. You can use R if you’d like, but it’s straightforward enough that an
Excel spreadsheet can easily do the job too.
Methodology - To begin this question, we want to go ahead and load in our temperature data from 1996-2015. Once we create our data, we want to run the cusum function on our 1996 data. This way, we can closely analyze how the cusum function is working on our data. Then, we can productionalize this and put it into a function. Since our cusum function finds the max of two values, it is hard to code this into our data, for it is much more intuitive to create and if branch where we see if the data - mean - critical value is less than zero.
temp_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 6.2/temps.txt", header = TRUE)
head(temp_data)
Let us test our 1996 data vector.
data_1996 = temp_data$X1996
head(data_1996)
## [1] 98 97 97 90 89 93
Like previously mentioned, we can go ahead and split our max function into two different if branches. This is all inside our for loop, where we are looping through all the temperatures in 1996. Then, once we compute all the incremental change values, we go back in and see where we first cross our threshold. We are going to insert a critical value of 5 and a threshold of 15 here for testing purposes.
st_1996 = rep(0, length(data_1996))
mean_1996 = mean(data_1996[1:62])
for (i in 1:length(data_1996)){
differ = mean_1996- data_1996[i]-5
if (differ <= 0){
st_1996[i] = 0
}
else {
st_1996[i] = st_1996[i-1] + differ
}
}
st_1996
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [13] 0.0000000 0.0000000 2.6129032 0.0000000 0.0000000 0.0000000
## [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [25] 0.6129032 1.2258065 3.8387097 9.4516129 0.0000000 0.0000000
## [31] 0.0000000 0.0000000 0.0000000 0.6129032 0.0000000 0.0000000
## [37] 0.0000000 0.0000000 0.6129032 0.0000000 0.0000000 0.0000000
## [43] 0.0000000 0.6129032 0.0000000 0.0000000 0.0000000 0.0000000
## [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [55] 0.0000000 0.6129032 0.0000000 0.6129032 0.0000000 0.0000000
## [61] 0.6129032 3.2258065 7.8387097 19.4516129 0.0000000 0.6129032
## [67] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.6129032
## [73] 0.0000000 0.0000000 6.6129032 12.2258065 0.0000000 2.6129032
## [79] 5.2258065 11.8387097 17.4516129 23.0645161 29.6774194 33.2903226
## [85] 33.9032258 34.5161290 0.0000000 0.6129032 6.2258065 15.8387097
## [91] 28.4516129 49.0645161 67.6774194 80.2903226 80.9032258 95.5161290
## [97] 114.1290323 134.7419355 159.3548387 165.9677419 180.5806452 193.1935484
## [103] 208.8064516 224.4193548 236.0322581 241.6451613 245.2580645 249.8709677
## [109] 252.4838710 271.0967742 292.7096774 309.3225806 314.9354839 318.5483871
## [115] 334.1612903 345.7741935 357.3870968 367.0000000 376.6129032 380.2258065
## [121] 382.8387097 385.4516129 389.0645161
cross_1996 = which(st_1996 >= 15)[1]
cross_1996
## [1] 64
temp_data[cross_1996,1]
## [1] "2-Sep"
In 1996, our process would suggest that summer was unofficially over on September 2nd. Let us now productionalize this and apply this to every single year. We do this by adding another for loop to cycle through the different years.
cusum_1 = function(C, T) {
#one for each year column in our data
for (i in 2:21){
year_data = temp_data[,i]
year_results = rep(0, length(year_data))
#62 Days in both July and August
year_mean = mean(year_data[1:62])
for (j in 1:length(year_data)){
year_differ = year_mean - year_data[j] - C
if (year_differ <= 0){
year_results[j] = 0
}
else{
if (j == 1){
year_results[j] = year_differ
}
else{
year_results[j] = year_results[j-1] + year_differ
}
}
}
year_cross = which(year_results >= T)[1]
output = paste(substring(colnames(temp_data)[i],2), ":", temp_data[year_cross,1])
print(output)
}
}
cusum_1(5,20)
## [1] "1996 : 20-Sep"
## [1] "1997 : 26-Sep"
## [1] "1998 : 6-Oct"
## [1] "1999 : 13-Jul"
## [1] "2000 : 26-Jul"
## [1] "2001 : 25-Sep"
## [1] "2002 : 24-Sep"
## [1] "2003 : 29-Sep"
## [1] "2004 : 16-Sep"
## [1] "2005 : 7-Oct"
## [1] "2006 : 14-Sep"
## [1] "2007 : 16-Sep"
## [1] "2008 : 17-Sep"
## [1] "2009 : 2-Sep"
## [1] "2010 : 28-Sep"
## [1] "2011 : 6-Sep"
## [1] "2012 : 14-Sep"
## [1] "2013 : 17-Aug"
## [1] "2014 : 26-Sep"
## [1] "2015 : 14-Sep"
Like before, when using a critical value of 5, but after playing
around with some different threshold values, we ended up at a threshold
of 20. These are the days in their respective year where we would
declare summer unofficially over.
Results - We can see from the above results that most of the time in
our 20 year span, we are declaring summer unofficially over in the
middle of September most of the time. Sometimes, we will get the
occasional early October or early September.
Discussion of Results - Intuitively, our results make sense. Our
dates are typically around the data where summer turns to fall, but it
is usually a little bit earlier and in my experience, the temperature
normally cools down a little bit before we actually enter our Fall
season. When I originally tested my model, I used a threshold of 15
instead of 20 because I thought that 15 degrees would be a significant
cumulative change; however, when applying this threshold, there were
multiple days in July that would have been declared the unofficial end
of summer, which doesn’t check out in reality. Moreover, another option
was to lower our threshold to 15, but increase our critical value to
something like 10. This ended up over-correcting my data by putting most
of the dates in the early to mid-October range, which doesn’t really
make sense either. Overall, our data mostly makes sense when applying a
Critical value of 5 and a Threshold value of 20.
Question 6.2.2 - 2. Use a CUSUM approach to make a
judgment of whether Atlanta’s summer climate has gotten warmer in that
time (and if so, when).
Methodology - We want to format this cusum function similar to the one we just did, but we need to change our data a little bit first. Now, we want to take the mean of the temperatures in the summer of every year. Typically, the beginning of the fall season falls on around the 21st of September. As such, this represents the first 83 data points in our data. Thus, our yearly summer temperature will be from the first 83 days of each year in our data and our overall mean will be the mean of all the days in all of the summers combined. Correspondingly, we will use our yearly means as our individual data points and our 20 year summer mean as the mean value in our cusum. One other major difference between the previous cusum is we were measuring cooling, but now we are measuring warming. Since we were measuring cooling before, our cumulative value was the mean - the data point - the critical value. However, since we are now warming, we are doing the data point - the mean - the critical value.
sum = 0
year_means = rep(0,20)
for (i in 2:21){
year_sum = 0
for (j in 1:83){
sum = sum + temp_data[j,i]
year_sum = year_sum + temp_data[j,i]
}
year_means[i-1] = year_sum/83
}
total_mean = sum/(20*83)
#year_means
cusum_2 = function(C,T){
years_st = rep(0,20)
for (i in 1:length(year_means)){
data_differ = year_means[i] - total_mean - C
if (data_differ <= 0){
years_st[i] = 0
}
else{
if (i == 1){
years_st[i] = data_differ
}
else{
years_st[i] = years_st[i-1] + data_differ
}
}
}
cross = which(years_st >= T)
colnames(temp_data[cross + 1])
substring(colnames(temp_data[cross + 1]),2)
}
cusum_2(.5,1)
## [1] "2007" "2010" "2011" "2012"
Results - When applying our warming cusum function we decided to use
a critical value of .5 and a threshold value of 1. We thought that since
our individual data points were actually means, a single increase in
temperature would be significant enough to detect and a critical value
of .5 gave us a nice cushion, but also did well at signifying when our
data changed. Ultimately, it is hard to have an intuition of whether or
not this is correct, but our model did select the years of 2007, 2010,
2011, and 2012.
Discussion of Results - Overall, our results are pretty standard in comparison to other values of C and T. In most values of C and T, 2010, 2011, and 2012 always appear in summers that got warmer in Atlanta, and 2007 is also very common. Therefore, I felt satisfied with this model since these were years that were included in our results. We shouldn’t expect too many years, for these 4 years only represent 20% of our data, so I feel like this is an appropriate amount of years.