Question 5.1

Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.

Clear the environment

rm(list = ls())

Install libraries

library(outliers)
library(ggplot2)

Set seed for reproducibility

set.seed(123)

Import data and isolate Crime column

data <- read.table('/Users/djmariano/Downloads/hw3-SP22/uscrime.txt', stringsAsFactor = FALSE, header = TRUE)
data_crime <- data$Crime

Perform a five-number summary

summary(data_crime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   342.0   658.5   831.0   905.1  1057.5  1993.0

Establish hypotheses:
H0 (null hypothesis): There are no outliers in the dataset.
Ha (alternative hypothesis): There are outliers in the dataset.
For hypothesis testing, we will be testing with 95% confidence level.

Visualize the data

set.seed(123)
plot(data_crime)

boxplot(data_crime)

It appears we might have 3 possible outliers.
Let’s find the maximum value of data_crime.

data_crime_max <- max(data_crime)
data_crime_max
## [1] 1993

Since we’re here, let’s also find the minimum value of data_crime (because we will still test for the lower end)

data_crime_min <- min(data_crime)
data_crime_min
## [1] 342

We see that 1993 is our maximum value and 342 is our minimum value.

We will now use the grubbs function to test our hypotheses.

grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
## 
##  Grubbs test for two opposite outliers
## 
## data:  data_crime
## G = 4.26877, U = 0.78103, p-value < 2.2e-16
## alternative hypothesis: 342 and 1993 are outliers

We reject the null hypothesis that there are no outliers present because the p-value is lower than 2.2e-16 which is less than 0.5
We will now look for the outliers.

Here, will test for only one outlier at a time by setting type = 10.

set.seed(123)
grubbs.test(data_crime, type = 10, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  data_crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

We fail to reject the null hypothesis that 1993 is not an outlier because the p-value of 0.07887 is higher than .05

We will now test with two.sided = TRUE

grubbs.test(data_crime, type = 10, opposite = FALSE, two.sided = TRUE)
## 
##  Grubbs test for one outlier
## 
## data:  data_crime
## G = 2.81287, U = 0.82426, p-value = 0.1577
## alternative hypothesis: highest value 1993 is an outlier

We fail to reject the null hypothesis that 1993 is not an outlier because the p-value of 0.1577 is higher than .05

We will now test to see if there are outliers on the low end of the distribution by setting opposite = TRUE

grubbs.test(data_crime, type = 10, opposite = TRUE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  data_crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier

We fail to reject the null hypothesis that 342 is an outlier because the p-value of 1 is higher than 0.05

We will now test with two.sided = TRUE

grubbs.test(data_crime, type = 10, opposite = TRUE, two.sided = TRUE)
## 
##  Grubbs test for one outlier
## 
## data:  data_crime
## G = 1.45589, U = 0.95292, p-value < 2.2e-16
## alternative hypothesis: lowest value 342 is an outlier

We reject the null hypothesis that 342 is not an outlier because the p-value is less than 2.2e-16 which is lower than .05, and we accept the alternative hypothesis that 342 is an outlier

We will now test for 2 outliers by setting the type = 11.

grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
## 
##  Grubbs test for two opposite outliers
## 
## data:  data_crime
## G = 4.26877, U = 0.78103, p-value < 2.2e-16
## alternative hypothesis: 342 and 1993 are outliers

We reject the null hypothesis that there are no outliers in the data set because the p-value is less than 2.2e-16 which is lower than .05, and we accept the alternative hypothesis that 342 or 1993 are outliers

We should now test and see whether we still have outliers on the lower end after removing 342 from our dataset.

data_crime <- data_crime[data_crime != data_crime_min]

Verify that 342 has been removed from our dataset.

min(data_crime)
## [1] 373

Perform grubbs() again to see if it still says we have outliers.

grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
## 
##  Grubbs test for two opposite outliers
## 
## data:  data_crime
## G = 4.24394, U = 0.77737, p-value < 2.2e-16
## alternative hypothesis: 373 and 1993 are outliers

Grubbs still says we have outliers, so I will test removing 1993 (since we removed the lowest end value) and see if it thinks whether or not we still have outliers.

data_crime <- data_crime[data_crime != data_crime_max]

Verify that 1993 has been removed from our dataset.

max(data_crime)
## [1] 1969

Perform grubbs() again to see if it still says we have outliers.

grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
## 
##  Grubbs test for two opposite outliers
## 
## data:  data_crime
## G = 4.56671, U = 0.73301, p-value = 0.6214
## alternative hypothesis: 373 and 1969 are outliers

We fail to reject the null hypothesis that either 373 or 1969 are outliers because the p-value is 0.6214 which is greater than 0.05

After removing 342 and 1993 from our dataset, we can run this two sided two opposite outlier grubbs test and it tells us that we have no more outliers in this dataset.
342 and 1993 are outliers, however, 342 was only considered as an outlier according to the one-outlier two tailed grubbs test.

Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

I play a videogame where the playable characters can attack other players and monsters, and the integer that gets calculated for damage is derived from the bonuses that the character’s weapon and equipment gives them, among other factors.

I could use CUSUM to calculate when another player drinks a potion that gives them higher bonuses, which lets them inflict a higher amount of damage over time on average.

Let’s say we gather the data for 200 attacks (xt) and calculate the average of damage for all attacks (µ) from this player, and somewhere during that time, they drink that potion, and we want to find out when they did.

For this example, the normal damage roll without the potion is between 1 and 50, and the damage roll with the potion is between 1-60.

C could be picked by calculating the standard deviation of all xt and divding it in half. For this example (that I simulated on the side), C would be 8

T is picked by choosing a value that properly indicates around where the data begins to show a trend in a growing St value. In this simulation, I chose T = 18.

In my simulated example, the player drinks their potion after exactly 100 hits and hits another 100 times. By picking this C and T, the St >= 0 begins to trend upward and show a continued increase in damage around attack 130.

In my simulation, the character hit an unlucky amount of low attacks between 100-130 where setting any value of C and T wouldn’t have been likely to realistically better distinguish when the player drank the potion

However, if we had more datapoints (xT) collected, the time the player drink’s the potion would be much more distinguishable and obvious.

Question 6.2.1

Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year.
You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html.\ You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.

This question was answered using Excel. I’ve provided a spreadsheet with all of my calculations (first tab of the workbook included, labeled 6.2.1)

I calculated St with the following formula: St = max{0,St-1+(µ-xt-C)} because we are trying to detect a decrease.

I chose µ by calculating the mean of all xt for all days and for all years, which was 83.34.

I chose C by calculating the average of all daily recorded temperatures for every year, took the standard deviation of all those average temperatures which was 6.70, and took half of that.

I chose T to be 15 after testing different values and I found that T = 15 was a good threshold for detecting which day is the “unofficial” end of summer for each year.

The following is a screenshot of my spreadsheet indicating, with green highlighting, all the times where St >= T for each instance of xt.

Please note I had to “scrunch” up some of the rows and crop the picture in order to fit the “green” highlighted cells. You can have a full view and understanding of my work by downloading the workbook and opening it in Excel or Sheets.

Screenshot 6.2.1
Screenshot 6.2.1

The following list is which day I would consider to be the “unofficial” end of summer for each year according to the data and calculations done in Excel and provided in the spreadsheet:

1996: 30-Sept
1997: 25-Sept
1998: 8-Oct
1999: 22-Sept
2000: 17-Sept, with a drop in temperatures in early September
2001: 25-Sept
2002: 8-Oct (could be earlier, specifically 25-Sept, but with context I’d still say 8-Oct)
2003: 29-Sept
2004: 19-Sept
2005: 7-Oct
2006: 7-Oct, with some colder temperatures observed in late Sept/early Oct
2007: 12-Oct
2008: 10-Oct
2009: 1-Oct
2010: 3-Oct
2011: 1-Oct (with some cold temperatures observed in the second week of Sept)
2012: 2-Oct
2013: 25-Sept (with abnormally low temperatures observed in the middle of August)
2014: 27-Sept
2015: 25-Sept

Question 6.2.2

Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

This question was answered using Excel. I’ve provided a spreadsheet with all of my calculations (second tab of the workbook included, labeled 6.2.2)

For this question, I took the average of each year’s temperatures and put them on a different worksheet. I put the corresponding year next to each of the averages calculated.

I chose µ by calculating the mean of all xt for all days and for all years, which was 83.34.

I chose C by calculating the standard deviation for all of the average temperatures per year, which was 1.5824, and then took half of that.

I chose T = 1 (just figured it was a fitting choice for the calculated values of St)

The following is a screenshot of my spreadsheet indicating, with green highlighting, all the times where St >= T for each instance of xt.

Screenshot 6.2.2
Screenshot 6.2.2

It could be argued that Atlanta has been experiencing higher temperatures since 2007, with hotter years from 2010 and 2011 to support that, but I don’t believe the data clearly indicates that there has been an overall rise in temperatures.