Question 5.1
Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt,
description at http://www.statsci.org/data/general/uscrime.html), test
to see whether there are any outliers in the last column (number of
crimes per 100,000 people). Use the grubbs.test function in the outliers
package in R.
Clear the environment
rm(list = ls())
Install libraries
library(outliers)
library(ggplot2)
Set seed for reproducibility
set.seed(123)
Import data and isolate Crime column
data <- read.table('/Users/djmariano/Downloads/hw3-SP22/uscrime.txt', stringsAsFactor = FALSE, header = TRUE)
data_crime <- data$Crime
Perform a five-number summary
summary(data_crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342.0 658.5 831.0 905.1 1057.5 1993.0
Establish hypotheses:
H0 (null hypothesis): There are no outliers in the dataset.
Ha (alternative hypothesis): There are outliers in the dataset.
For hypothesis testing, we will be testing with 95% confidence
level.
Visualize the data
set.seed(123)
plot(data_crime)
boxplot(data_crime)
It appears we might have 3 possible outliers.
Let’s find the maximum value of data_crime.
data_crime_max <- max(data_crime)
data_crime_max
## [1] 1993
Since we’re here, let’s also find the minimum value of data_crime
(because we will still test for the lower end)
data_crime_min <- min(data_crime)
data_crime_min
## [1] 342
We see that 1993 is our maximum value and 342 is our minimum
value.
We will now use the grubbs function to test our hypotheses.
grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
##
## Grubbs test for two opposite outliers
##
## data: data_crime
## G = 4.26877, U = 0.78103, p-value < 2.2e-16
## alternative hypothesis: 342 and 1993 are outliers
We reject the null hypothesis that there are no outliers present
because the p-value is lower than 2.2e-16 which is less than 0.5
We will now look for the outliers.
Here, will test for only one outlier at a time by setting type =
10.
set.seed(123)
grubbs.test(data_crime, type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: data_crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
We fail to reject the null hypothesis that 1993 is not an outlier
because the p-value of 0.07887 is higher than .05
We will now test with two.sided = TRUE
grubbs.test(data_crime, type = 10, opposite = FALSE, two.sided = TRUE)
##
## Grubbs test for one outlier
##
## data: data_crime
## G = 2.81287, U = 0.82426, p-value = 0.1577
## alternative hypothesis: highest value 1993 is an outlier
We fail to reject the null hypothesis that 1993 is not an outlier
because the p-value of 0.1577 is higher than .05
We will now test to see if there are outliers on the low end of the
distribution by setting opposite = TRUE
grubbs.test(data_crime, type = 10, opposite = TRUE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: data_crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier
We fail to reject the null hypothesis that 342 is an outlier because
the p-value of 1 is higher than 0.05
We will now test with two.sided = TRUE
grubbs.test(data_crime, type = 10, opposite = TRUE, two.sided = TRUE)
##
## Grubbs test for one outlier
##
## data: data_crime
## G = 1.45589, U = 0.95292, p-value < 2.2e-16
## alternative hypothesis: lowest value 342 is an outlier
We reject the null hypothesis that 342 is not an outlier because the
p-value is less than 2.2e-16 which is lower than .05, and we accept the
alternative hypothesis that 342 is an outlier
We will now test for 2 outliers by setting the type = 11.
grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
##
## Grubbs test for two opposite outliers
##
## data: data_crime
## G = 4.26877, U = 0.78103, p-value < 2.2e-16
## alternative hypothesis: 342 and 1993 are outliers
We reject the null hypothesis that there are no outliers in the data
set because the p-value is less than 2.2e-16 which is lower than .05,
and we accept the alternative hypothesis that 342 or 1993 are
outliers
We should now test and see whether we still have outliers on the
lower end after removing 342 from our dataset.
data_crime <- data_crime[data_crime != data_crime_min]
Verify that 342 has been removed from our dataset.
min(data_crime)
## [1] 373
Perform grubbs() again to see if it still says we have
outliers.
grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
##
## Grubbs test for two opposite outliers
##
## data: data_crime
## G = 4.24394, U = 0.77737, p-value < 2.2e-16
## alternative hypothesis: 373 and 1993 are outliers
Grubbs still says we have outliers, so I will test removing 1993
(since we removed the lowest end value) and see if it thinks whether or
not we still have outliers.
data_crime <- data_crime[data_crime != data_crime_max]
Verify that 1993 has been removed from our dataset.
max(data_crime)
## [1] 1969
Perform grubbs() again to see if it still says we have
outliers.
grubbs.test(data_crime, type = 11, opposite = TRUE, two.sided = TRUE)
##
## Grubbs test for two opposite outliers
##
## data: data_crime
## G = 4.56671, U = 0.73301, p-value = 0.6214
## alternative hypothesis: 373 and 1969 are outliers
We fail to reject the null hypothesis that either 373 or 1969 are
outliers because the p-value is 0.6214 which is greater than 0.05
After removing 342 and 1993 from our dataset, we can run this two
sided two opposite outlier grubbs test and it tells us that we have no
more outliers in this dataset.
342 and 1993 are outliers, however, 342 was only considered as an
outlier according to the one-outlier two tailed grubbs test.
Question 6.1
Describe a situation or problem from your job, everyday life, current
events, etc., for which a Change Detection model would be appropriate.
Applying the CUSUM technique, how would you choose the critical value
and the threshold?
I play a videogame where the playable characters can attack other
players and monsters, and the integer that gets calculated for damage is
derived from the bonuses that the character’s weapon and equipment gives
them, among other factors.
I could use CUSUM to calculate when another player drinks a potion
that gives them higher bonuses, which lets them inflict a higher amount
of damage over time on average.
Let’s say we gather the data for 200 attacks (xt) and calculate the
average of damage for all attacks (µ) from this player, and somewhere
during that time, they drink that potion, and we want to find out when
they did.
For this example, the normal damage roll without the potion is
between 1 and 50, and the damage roll with the potion is between
1-60.
C could be picked by calculating the standard deviation of all xt and
divding it in half. For this example (that I simulated on the side), C
would be 8
T is picked by choosing a value that properly indicates around where
the data begins to show a trend in a growing St value. In this
simulation, I chose T = 18.
In my simulated example, the player drinks their potion after exactly
100 hits and hits another 100 times. By picking this C and T, the St
>= 0 begins to trend upward and show a continued increase in damage
around attack 130.
In my simulation, the character hit an unlucky amount of low attacks
between 100-130 where setting any value of C and T wouldn’t have been
likely to realistically better distinguish when the player drank the
potion
However, if we had more datapoints (xT) collected, the time the
player drink’s the potion would be much more distinguishable and
obvious.
Question 6.2.1
Using July through October daily-high-temperature data for Atlanta
for 1996 through 2015, use a CUSUM approach to identify when unofficial
summer ends (i.e., when the weather starts cooling off) each year.
You can get the data that you need from the file temps.txt or online,
for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html.\
You can use R if you’d like, but it’s straightforward enough that an
Excel spreadsheet can easily do the job too.
This question was answered using Excel. I’ve provided a spreadsheet
with all of my calculations (first tab of the workbook included, labeled
6.2.1)
I calculated St with the following formula: St = max{0,St-1+(µ-xt-C)}
because we are trying to detect a decrease.
I chose µ by calculating the mean of all xt for all days and for all
years, which was 83.34.
I chose C by calculating the average of all daily recorded
temperatures for every year, took the standard deviation of all those
average temperatures which was 6.70, and took half of that.
I chose T to be 15 after testing different values and I found that T
= 15 was a good threshold for detecting which day is the “unofficial”
end of summer for each year.
The following is a screenshot of my spreadsheet indicating, with
green highlighting, all the times where St >= T for each instance of
xt.
Please note I had to “scrunch” up some of the rows and crop the
picture in order to fit the “green” highlighted cells. You can have a
full view and understanding of my work by downloading the workbook and
opening it in Excel or Sheets.
The following list is which day I would consider to be the
“unofficial” end of summer for each year according to the data and
calculations done in Excel and provided in the spreadsheet:
1996: 30-Sept
1997: 25-Sept
1998: 8-Oct
1999: 22-Sept
2000: 17-Sept, with a drop in temperatures in early September
2001: 25-Sept
2002: 8-Oct (could be earlier, specifically 25-Sept, but with context
I’d still say 8-Oct)
2003: 29-Sept
2004: 19-Sept
2005: 7-Oct
2006: 7-Oct, with some colder temperatures observed in late Sept/early
Oct
2007: 12-Oct
2008: 10-Oct
2009: 1-Oct
2010: 3-Oct
2011: 1-Oct (with some cold temperatures observed in the second week of
Sept)
2012: 2-Oct
2013: 25-Sept (with abnormally low temperatures observed in the middle
of August)
2014: 27-Sept
2015: 25-Sept
Question 6.2.2
Use a CUSUM approach to make a judgment of whether Atlanta’s summer
climate has gotten warmer in that time (and if so, when).
This question was answered using Excel. I’ve provided a spreadsheet
with all of my calculations (second tab of the workbook included,
labeled 6.2.2)
For this question, I took the average of each year’s temperatures and
put them on a different worksheet. I put the corresponding year next to
each of the averages calculated.
I chose µ by calculating the mean of all xt for all days and for all
years, which was 83.34.
I chose C by calculating the standard deviation for all of the
average temperatures per year, which was 1.5824, and then took half of
that.
I chose T = 1 (just figured it was a fitting choice for the
calculated values of St)
The following is a screenshot of my spreadsheet indicating, with
green highlighting, all the times where St >= T for each instance of
xt.
It could be argued that Atlanta has been experiencing higher
temperatures since 2007, with hotter years from 2010 and 2011 to support
that, but I don’t believe the data clearly indicates that there has been
an overall rise in temperatures.