ISYE 6501 Homework 3

Question 5.1 - Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.

Methodology - First we are going to load in our crime data for the question. To explore the presence of outliers in the data, I want to begin with a box and whiskers plot of the crime variable, for this will give us a broad visualization of what data we are working with. Now, we are going to work with the grubbs.test function to see what outliers this function also outputs. When we work with the grubbs.test function, it only checks one outlier at a time with type = 10, so we need to check one at a time.

library(outliers)

crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 5.1/uscrime.txt", header = TRUE)

head(crime_data)

boxplot(crime_data$Crime)

Let us now use the grubbs.test function to see if it identifies any outliers in our data.

grubbs.test(crime_data$Crime, type = 10, two.sided = FALSE, opposite = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  crime_data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

According to our grubbs.test, with a level of significance of p = 0.1, we can conclude that the data point where crime is 1993, is indeed an outlier. However, if we reference our boxplot, there were two points very close to each other at the top; thus, let us remove our outlier point and do the same grubbs.test function.

crime_data2 = crime_data[crime_data$Crime != 1993,]

boxplot(crime_data2$Crime)

grubbs.test(crime_data2$Crime, type = 10, two.sided = FALSE, opposite = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  crime_data2$Crime
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier

With the same level of significance as before of p = 0.1, we can conclude that 1969 is also an outlier in our data. As such, we will remove this from our data as well. From our original boxplot, we know there was another point, so let us run this test again.

crime_data3 = crime_data2[crime_data2$Crime != 1969,]

boxplot(crime_data3$Crime)

grubbs.test(crime_data3$Crime, type = 10, two.sided = FALSE, opposite = FALSE)

## 
##  Grubbs test for one outlier
## 
## data:  crime_data3$Crime
## G = 2.56457, U = 0.84712, p-value = 0.1781
## alternative hypothesis: highest value 1674 is an outlier

After removing the top two outliers from our original boxplot, our grubbs.test function cannot confirm that this is next data point is an outlier, for this p-value of 0.178 is not less than our level of significance of 0.1. Now, it didn’t appear that there were any outliers on the other side of the tail when we examine original boxplot, but let us go ahead and test if there are any small outliers.

grubbs.test(crime_data$Crime, type = 10, two.sided = FALSE, opposite = TRUE)

## 
##  Grubbs test for one outlier
## 
## data:  crime_data$Crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier

As we can see above, with a p-value of 1, our grubbs.test function definitely does not think that 342 is an outlier in our data.

Results - After multiple rounds of applying our grubbs.test function, we removed two outlier points from our data: 1993 and 1969. However, we did not remove the values of 1674 and 342, despite us testing those values.

Discussion of Results - These two outlier results are not necessarliy suprising given our original boxplot, for our original boxplot gave us three total outliers with two of them being far outliers and one of them being very close to the upper whisker. The only interesting point in our data was the 1674 data point. According to our original box and whiskers plot, 1674 was an outlier, but the grubbs.test function was unable to identify this point as an outlier with a degree of confidence that we required (level of significance = 0.1).

Question 6.1 - Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

Last week I talked about biome classification and clustering, and the ideas of identifying trends in environmental predictors. For instance, we might want to use a change detection model when monitoring the air quality of certain environments, especially when we consider preservation efforts of certain species. Using a change detection model here would allow us to assess whether or not a habitat is becoming more or less suitable for a given species. Choosing a critical value and threshold for a given change detection model. For instance, we need to take into consideration how confident we want to be with our answer. We would much rather a “false alarm” for our air quality changing than not recognizing this change. As such, when taking the Air Quality Index into consideration, an appropriate threshold would be around 15-25, so for the sake of this conversation, we can just say 20. We can also choose a critical value of around 5. We want a decent sized critical value to cushion natural variance in air quality. Ultimately, these values could change if we are taking about a certain tree population compared to a bird species.

Question 6.2.1 - Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.

Methodology - To begin this question, we want to go ahead and load in our temperature data from 1996-2015. Once we create our data, we want to run the cusum function on our 1996 data. This way, we can closely analyze how the cusum function is working on our data. Then, we can productionalize this and put it into a function. Since our cusum function finds the max of two values, it is hard to code this into our data, for it is much more intuitive to create and if branch where we see if the data - mean - critical value is less than zero.

temp_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 6.2/temps.txt", header = TRUE)

head(temp_data)

Let us test our 1996 data vector.

data_1996 = temp_data$X1996

head(data_1996)

## [1] 98 97 97 90 89 93

Like previously mentioned, we can go ahead and split our max function into two different if branches. This is all inside our for loop, where we are looping through all the temperatures in 1996. Then, once we compute all the incremental change values, we go back in and see where we first cross our threshold. We are going to insert a critical value of 5 and a threshold of 15 here for testing purposes.

st_1996 = rep(0, length(data_1996))

mean_1996 = mean(data_1996[1:62])
for (i in 1:length(data_1996)){
  differ = mean_1996- data_1996[i]-5
  
  if (differ <= 0){
    st_1996[i] = 0
  }
  else {
    st_1996[i] = st_1996[i-1] + differ
  }
}

st_1996

##   [1]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##   [7]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [13]   0.0000000   0.0000000   2.6129032   0.0000000   0.0000000   0.0000000
##  [19]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [25]   0.6129032   1.2258065   3.8387097   9.4516129   0.0000000   0.0000000
##  [31]   0.0000000   0.0000000   0.0000000   0.6129032   0.0000000   0.0000000
##  [37]   0.0000000   0.0000000   0.6129032   0.0000000   0.0000000   0.0000000
##  [43]   0.0000000   0.6129032   0.0000000   0.0000000   0.0000000   0.0000000
##  [49]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000
##  [55]   0.0000000   0.6129032   0.0000000   0.6129032   0.0000000   0.0000000
##  [61]   0.6129032   3.2258065   7.8387097  19.4516129   0.0000000   0.6129032
##  [67]   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.6129032
##  [73]   0.0000000   0.0000000   6.6129032  12.2258065   0.0000000   2.6129032
##  [79]   5.2258065  11.8387097  17.4516129  23.0645161  29.6774194  33.2903226
##  [85]  33.9032258  34.5161290   0.0000000   0.6129032   6.2258065  15.8387097
##  [91]  28.4516129  49.0645161  67.6774194  80.2903226  80.9032258  95.5161290
##  [97] 114.1290323 134.7419355 159.3548387 165.9677419 180.5806452 193.1935484
## [103] 208.8064516 224.4193548 236.0322581 241.6451613 245.2580645 249.8709677
## [109] 252.4838710 271.0967742 292.7096774 309.3225806 314.9354839 318.5483871
## [115] 334.1612903 345.7741935 357.3870968 367.0000000 376.6129032 380.2258065
## [121] 382.8387097 385.4516129 389.0645161

cross_1996 = which(st_1996 >= 15)[1]
cross_1996

## [1] 64

temp_data[cross_1996,1]

## [1] "2-Sep"

In 1996, our process would suggest that summer was unofficially over on September 2nd. Let us now productionalize this and apply this to every single year. We do this by adding another for loop to cycle through the different years.

cusum_1 = function(C, T) {
  #one for each year column in our data
  for (i in 2:21){
    year_data = temp_data[,i]
        
    year_results = rep(0, length(year_data))
    
    #62 Days in both July and August
    year_mean = mean(year_data[1:62])
    
    for (j in 1:length(year_data)){
      year_differ = year_mean - year_data[j] - C
      
      if (year_differ <= 0){
        year_results[j] = 0
      }
      
      else{
        
        if (j == 1){
          year_results[j] = year_differ
        }
        
        else{
          year_results[j] = year_results[j-1] + year_differ
        }
        
      }
    }
    
    year_cross = which(year_results >= T)[1]
    
    output = paste(substring(colnames(temp_data)[i],2), ":", temp_data[year_cross,1])
    
    print(output)
  }
}

cusum_1(5,20)

## [1] "1996 : 20-Sep"
## [1] "1997 : 26-Sep"
## [1] "1998 : 6-Oct"
## [1] "1999 : 13-Jul"
## [1] "2000 : 26-Jul"
## [1] "2001 : 25-Sep"
## [1] "2002 : 24-Sep"
## [1] "2003 : 29-Sep"
## [1] "2004 : 16-Sep"
## [1] "2005 : 7-Oct"
## [1] "2006 : 14-Sep"
## [1] "2007 : 16-Sep"
## [1] "2008 : 17-Sep"
## [1] "2009 : 2-Sep"
## [1] "2010 : 28-Sep"
## [1] "2011 : 6-Sep"
## [1] "2012 : 14-Sep"
## [1] "2013 : 17-Aug"
## [1] "2014 : 26-Sep"
## [1] "2015 : 14-Sep"

Like before, when using a critical value of 5, but after playing around with some different threshold values, we ended up at a threshold of 20. These are the days in their respective year where we would declare summer unofficially over.

Results - We can see from the above results that most of the time in our 20 year span, we are declaring summer unofficially over in the middle of September most of the time. Sometimes, we will get the occasional early October or early September.

Discussion of Results - Intuitively, our results make sense. Our dates are typically around the data where summer turns to fall, but it is usually a little bit earlier and in my experience, the temperature normally cools down a little bit before we actually enter our Fall season. When I originally tested my model, I used a threshold of 15 instead of 20 because I thought that 15 degrees would be a significant cumulative change; however, when applying this threshold, there were multiple days in July that would have been declared the unofficial end of summer, which doesn’t check out in reality. Moreover, another option was to lower our threshold to 15, but increase our critical value to something like 10. This ended up over-correcting my data by putting most of the dates in the early to mid-October range, which doesn’t really make sense either. Overall, our data mostly makes sense when applying a Critical value of 5 and a Threshold value of 20.

Question 6.2.2 - 2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

Methodology - We want to format this cusum function similar to the one we just did, but we need to change our data a little bit first. Now, we want to take the mean of the temperatures in the summer of every year. Typically, the beginning of the fall season falls on around the 21st of September. As such, this represents the first 83 data points in our data. Thus, our yearly summer temperature will be from the first 83 days of each year in our data and our overall mean will be the mean of all the days in all of the summers combined. Correspondingly, we will use our yearly means as our individual data points and our 20 year summer mean as the mean value in our cusum. One other major difference between the previous cusum is we were measuring cooling, but now we are measuring warming. Since we were measuring cooling before, our cumulative value was the mean - the data point - the critical value. However, since we are now warming, we are doing the data point - the mean - the critical value.

sum = 0
year_means = rep(0,20)

for (i in 2:21){
  year_sum = 0
  for (j in 1:83){
    sum = sum + temp_data[j,i]
    year_sum = year_sum + temp_data[j,i]
  }
  year_means[i-1] = year_sum/83
}

total_mean = sum/(20*83)

#year_means


cusum_2 = function(C,T){
  years_st = rep(0,20)
  
  for (i in 1:length(year_means)){
    data_differ = year_means[i] - total_mean - C
    
    if (data_differ <= 0){
      years_st[i] = 0
    }
    
    else{
        
        if (i == 1){
          years_st[i] = data_differ
        }
        
        else{
          years_st[i] = years_st[i-1] + data_differ
        }
        
      }
  }
  
  cross = which(years_st >= T)
  colnames(temp_data[cross + 1])
  substring(colnames(temp_data[cross + 1]),2)
  
}

cusum_2(.5,1)

## [1] "2007" "2010" "2011" "2012"

Results - When applying our warming cusum function we decided to use a critical value of .5 and a threshold value of 1. We thought that since our individual data points were actually means, a single increase in temperature would be significant enough to detect and a critical value of .5 gave us a nice cushion, but also did well at signifying when our data changed. Ultimately, it is hard to have an intuition of whether or not this is correct, but our model did select the years of 2007, 2010, 2011, and 2012.

Discussion of Results - Overall, our results are pretty standard in comparison to other values of C and T. In most values of C and T, 2010, 2011, and 2012 always appear in summers that got warmer in Atlanta, and 2007 is also very common. Therefore, I felt satisfied with this model since these were years that were included in our results. We shouldn’t expect too many years, for these 4 years only represent 20% of our data, so I feel like this is an appropriate amount of years.

ISYE 6501 Homework 3

James Jessup