HW3

Sets global option so that all code chunks will display their source code by default when rendering the document.

Question 5.1 Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.

Methodology Used grubbs.test function from outliars package to identify outliar from the dataset. In this workbook we tried to idenitfy if there is any ouliar data available on crime column.

#reading txt file 
data<-read.table("C:/Users/bardh/OneDrive/Desktop/Gtech/ISYE6501/HW3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 5.1/uscrime.txt", stringsAsFactor = FALSE, header = TRUE, sep = "")

head(data,5)

#install.packages("outliers") 
library(outliers)

# Run Grubbs’ test
grubbs.test(data$Crime)

## 
##  Grubbs test for one outlier
## 
## data:  data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

Result Discussion: With the Grubbs’ test I ran, the outlier detection is only based on the crime rate column. Other parameters in the dataset are ignored unless we bring them into a model or use multivariate methods.It’s not checking whether crime is unusually high given unemployment or given population density. It’s just treating the crime column in isolation.

G measures how far the most extreme value (max or min) is from the mean, in units of standard deviations. Larger G → stronger evidence that the extreme value is an outlier.

G=2.81287 means the highest value is 2.81287 standard deviation apart from mean.

p-value : The probability (under the null hypothesis) of observing an extreme value as far away from the mean as the one detected. If, in reality, there were no outliers in the data, the chance of getting a result at least this extreme (a value like 1993) is about 7.9%. The p-value answers: “If nothing unusual is going on, how surprising is my data?” A small p-value (say < 0.05) means: “This is too surprising under H0 → probably something is going on (maybe an outlier).” A large p-value means: “This is not very surprising under H0 → nothing unusual detected.”

p = 0.079 → If there were no outliers, we’d see a value like 1993 about 8 times out of 100 just by random chance.

That’s not super rare → so at the usual 5% threshold, we don’t declare it an outlier. But if our threshold is 10% , then a value like 1993 can be declared as outliar.

This means that, statistically, we do not have strong enough evidence to call 1993 an outlier at the 5% level

The test always phrases the alternative as “the extreme value is an outlier.” But whether to accept or reject that statement depends on the p-value relative to our chosen significance level. In our case → 1993 is not significant at 5%, but could be flagged at 10%.

Question 6.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

Example situation

We live in a city where the average summer temperature in July is normally around 86F. We wonder if this year it’s getting hotter than usual. Instead of waiting for the whole summer to finish, we want a model to alert us if the average temperature shifts upward. That’s where CUSUM comes in.

How CUSUM works (with temperatures)

Every day we look at the day’s temperature and compare it to the usual average (86 F). If today is hotter, we add a little to “hot score.” If today is cooler, we subtract or reset. If the “hot score” keeps climbing day after day, it means the summer has actually shifted hotter — not just a random warm afternoon.

The span (or time window) is also very important in CUSUM.

Small Span (fast detection): Good for catching sudden changes quickly. Risk: More false positives from outlier events.

Large Span (slow detection): Smooths out temporary bursts. Better for distinguishing true shifts from short-term outliers. Risk: May miss early warnings.

Setting the two dials

Critical value (C) Suppose we care about catching a +4F shift from mean. Then set C = 3. Threshold (T) → How confident should I be before saying “it’s hotter this year”? If we pick T = 6, we will detect changes faster, but we might get false alarms after a short heatwave. If we pick T = 15, we will only get an alarm when the warmer trend is really consistent.

Question 6.2 1. Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.

Methodology

Parse DAY (no year) by attaching each column’s year, Baseline is taken. Mean of July+August highs (per year),as July and August are considered Summer peak temperatures. CUSUM is calculated with C = 5 and threshold = 20F,so basically we are making model less sensitive upto 5 F temperature change and CUSUM threashold we are taking at 20 F which means model will flag warning when threshold is >= 20. CUSUM also will be reset to 0 when there is a dip bringing temperature less than mean. We are plotting day-of-year when CUSUM is >= 20.

#install.packages("dplyr")
#install.packages("tidyr")
#install.packages("lubridate") 
#install.packages("readr")
#install.packages("purrr")
#install.packages("ggplot2")

# --- Packages ---
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(readr)
library(purrr)
library(ggplot2)
# --- Load the wide data ---

df_wide <- read_tsv("C:/Users/bardh/OneDrive/Desktop/Gtech/ISYE6501/HW3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 6.2/temps.txt")  # or # ---- Minimal, copy-paste R ----

## Rows: 123 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): DAY
## dbl (20): 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df_wide,5)

# Packages
library(readr)
library(dplyr)
library(tidyr)
library(lubridate)

# ---- Load wide table (first column 'DAY' = "1-Jul", others are years) ----
names(df_wide)[1] <- "DAY"

# Parse month/day by adding a dummy year just to read the DAY strings
md <- as.Date(paste0(df_wide$DAY, "-2001"), format = "%d-%b-%Y")

# Long format with real dates per year
df_long <- df_wide %>%
  mutate(Month = month(md), Day = day(md)) %>%
  pivot_longer(-c(DAY, Month, Day), names_to = "Year", values_to = "HighTemp") %>%
  mutate(
    Year = as.integer(Year),
    Date = make_date(Year, Month, Day)
  ) %>%
  filter(Month %in% 7:10) %>%           # July–October only
  arrange(Year, Date)

# ---- CUSUM parameters (°F) ----
C <- 5    # allowance C
T <- 20   # threshold

# Cooling CUSUM: S_t = max(0, S_{t-1} + (mu - x_t - C))
cusum_cooling <- function(x, mu, C = 5) {
  n <- length(x); S <- numeric(n)
  for (i in seq_len(n)) {
    inc <- (mu - x[i]) - C
    if (inc > 0) {
    S[i] <- (if (i == 1) 0 else S[i-1]) + inc
  } else {
    S[i] <- 0
  }
  }
  S
}

detect_end <- function(d) {
  mu <- mean(d$HighTemp[d$Month %in% c(7, 8)], na.rm = TRUE)  # July+Aug mean
  if (is.na(mu)) return(tibble(Baseline = NA_real_, EndOfSummer = as.Date(NA)))
  S <- cusum_cooling(d$HighTemp, mu, C)
  hit <- which(S >= T)[1]
  tibble(
    Baseline   = mu,
    EndOfSummer = if (is.na(hit)) as.Date(NA) else d$Date[hit]
  )
}

# ---- Run per year and print the list ----
results <- df_long %>%
  group_by(Year) %>%
  group_modify(~ detect_end(.x)) %>%
  ungroup()

# List format: "YYYY: Mon DD" (or "No signal")
out <- results %>%
  mutate(line = ifelse(is.na(EndOfSummer),
                       sprintf("%d: No signal", Year),
                       sprintf("%d: %s", Year, format(EndOfSummer, "%b %d")))) %>%
  pull(line)

cat(paste(out, collapse = "\n"))

## 1996: Sep 20
## 1997: Sep 26
## 1998: Oct 06
## 1999: Jul 13
## 2000: Jul 26
## 2001: Sep 25
## 2002: Sep 24
## 2003: Sep 29
## 2004: Sep 16
## 2005: Oct 07
## 2006: Sep 14
## 2007: Sep 16
## 2008: Sep 17
## 2009: Sep 02
## 2010: Sep 28
## 2011: Sep 06
## 2012: Sep 14
## 2013: Aug 17
## 2014: Sep 26
## 2015: Sep 14

Plotting date against year to idenitfy which date of each year to be consider as unofficial end of summer.

library(dplyr)
library(lubridate)
library(ggplot2)

# Use a reference (non-leap) year so y-axis shows month–day instead of year
plot_df <- results %>%
  filter(!is.na(EndOfSummer)) %>%
  mutate(End_MD = make_date(2001, month(EndOfSummer), day(EndOfSummer)))

ggplot(plot_df, aes(x = Year, y = End_MD)) +
  geom_line() +
  geom_point() +
  scale_y_date(date_labels = "%b %d", date_breaks = "1 week") +
  labs(
    title = "End of Summer (CUSUM: C = 5, threshold = 20)",
    x = "Year",
    y = "End-of-summer date (month–day)"
  ) +
  theme_minimal()

Plotting the CUSUM curve for a single year.This helps sanity-check where the threshold is crossed. We can change the year in plot_cusum_year(df_long, 2015, C = 5, T = 20) function to check for each year.

# Requires df_long from the earlier data-prep (Jul–Oct rows with columns Year, Date, Month, HighTemp)

C <- 5
T <- 20

plot_cusum_year <- function(df_long, year, C = 5, T = 20) {
  d <- df_long %>% dplyr::filter(Year == year) %>% dplyr::arrange(Date)
  mu <- mean(d$HighTemp[d$Month %in% c(7, 8)], na.rm = TRUE)
  inc <- (mu - d$HighTemp) - C
  S <- numeric(nrow(d))
  for (i in seq_len(nrow(d))) {
    S[i] <- max(0, (if (i == 1) 0 else S[i-1]) + inc[i])
  }
  hit <- which(S >= T)[1]

  gg <- ggplot(d, aes(Date, S)) +
    geom_line() +
    geom_hline(yintercept = T, linetype = "dashed") +
    labs(
      title = paste0("CUSUM (cooling) for ", year, "  —  mu(Jul–Aug) = ", round(mu, 1)),
      x = "Date",
      y = "CUSUM"
    ) +
    theme_minimal()

  if (!is.na(hit)) {
    gg <- gg + geom_vline(xintercept = d$Date[hit], linetype = "dotted") +
      annotate("text", x = d$Date[hit], y = T, vjust = -0.8,
               label = paste("Crosses", T, "on", format(d$Date[hit], "%b %d")))
  }
  gg
}

# Example: Can change the year to plot for each
plot_cusum_year(df_long, 2015, C = 5, T = 20)

Result discussion What we have observed from last 10 years data is mostly mid of September is the unofficial time when we can call summer ends.However Early 2000s from data we can observe Summer used to last until end of September.

2. Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).

Methodology

Calculated mu = grand mean of July+August means (1996–2015). Calculated summer mean of every year. Checked if summer mean is greater than grand mean, then counted to CUSUM. I have set allowance to 1 F and threshold to 2 F . This means if CUSUM is crossing 2F then that year will be reported as the year when Atlata started warming.

df <- read_delim("C:/Users/bardh/OneDrive/Desktop/Gtech/ISYE6501/HW3/Homework3_ISYE_6501/Homework3_ISYE_6501/data 6.2/temps.txt", delim = "\t", show_col_types = FALSE)
head(df,5)

# ---- Hard-reset, one-sided warming CUSUM on Jul+Aug means ----
# k = 1 F (allowance), h = 2 F (threshold). Reset to 0 whenever year <= mu.

library(readr)
library(lubridate)
library(dplyr)
library(ggplot2)

# 1) Load wide file (first col = DAY "1-Jul", rest = years)
names(df)[1] <- "DAY"

# 2) Find July/Aug rows by parsing DAY with a dummy year 1
md <- as.Date(paste0(df$DAY, "-2001"), format = "%d-%b-%Y")
summer_rows <- month(md) %in% c(7, 8)

# 3) Per-year July+August mean (F)
yr_means <- sapply(df[summer_rows, -1], function(x) mean(as.numeric(x), na.rm = TRUE))
tab <- tibble(Year = as.integer(names(yr_means)),
              SummerMean = as.numeric(yr_means)) %>% arrange(Year)

# 4) Grand mean across years (baseline mu)
mu <- mean(tab$SummerMean, na.rm = TRUE)

# 5) Hard-reset CUSUM across years
k <- 1  # allowance
h <- 2  # threshold

S <- numeric(nrow(tab))
Excess <- numeric(nrow(tab))
Reset <- logical(nrow(tab))
acc <- 0

for (i in seq_len(nrow(tab))) {
  x <- tab$SummerMean[i]
  if (is.na(x) || x <= mu) {       # <-- HARD RESET when at/below mu
    acc <- 0
    Reset[i] <- TRUE
    Excess[i] <- 0
  } else {
    e <- max(0, (x - mu) - k)      # only add if above mu + k
    Excess[i] <- e
    acc <- acc + e
  }
  S[i] <- acc
}

tab <- tab %>%
  mutate(GrandMean = mu,
         Excess = Excess,
         CUSUM = S,
         Reset = Reset)

# 6) First crossing year (if any)
hit_idx <- which(tab$CUSUM >= h)[1]
hit_year <- if (length(hit_idx) && !is.na(hit_idx)) tab$Year[hit_idx] else NA

# 7) Print a concise list
cat(sprintf("Grand mean (Jul+Aug, 1996-2015): %.2f F\n\n", mu))

## Grand mean (Jul+Aug, 1996-2015): 88.68 F

apply(tab, 1, function(r)
  cat(sprintf("%d: mean=%.1f F | Excess=%.2f | CUSUM=%.2f | Reset=%s\n",
              as.integer(r["Year"]), as.numeric(r["SummerMean"]),
              as.numeric(r["Excess"]), as.numeric(r["CUSUM"]),
              ifelse(r["Reset"], "YES","no"))))

## 1996: mean=89.6 F | Excess=0.00 | CUSUM=0.00 | Reset=no
## 1997: mean=86.5 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 1998: mean=88.2 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 1999: mean=89.6 F | Excess=0.00 | CUSUM=0.00 | Reset=no
## 2000: mean=91.4 F | Excess=1.72 | CUSUM=1.72 | Reset=no
## 2001: mean=86.7 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2002: mean=89.2 F | Excess=0.00 | CUSUM=0.00 | Reset=no
## 2003: mean=86.2 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2004: mean=86.5 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2005: mean=87.0 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2006: mean=90.0 F | Excess=0.28 | CUSUM=0.28 | Reset=no
## 2007: mean=91.2 F | Excess=1.53 | CUSUM=1.81 | Reset=no
## 2008: mean=87.7 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2009: mean=87.1 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2010: mean=91.3 F | Excess=1.62 | CUSUM=1.62 | Reset=no
## 2011: mean=92.7 F | Excess=3.03 | CUSUM=4.65 | Reset=no
## 2012: mean=90.9 F | Excess=1.19 | CUSUM=5.84 | Reset=no
## 2013: mean=84.8 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2014: mean=87.4 F | Excess=0.00 | CUSUM=0.00 | Reset=YES
## 2015: mean=89.4 F | Excess=0.00 | CUSUM=0.00 | Reset=no

## NULL

cat("\n")

if (!is.na(hit_year)) {
  cat(sprintf(">>> CUSUM first crosses %.1f F in %d.\n", h, hit_year))
} else {
  cat(sprintf(">>> CUSUM never reaches %.1f F with k=%.1f under hard-reset rule.\n", h, k))
}

## >>> CUSUM first crosses 2.0 F in 2011.

# 8) Plots

#  Hard-reset CUSUM with threshold
p <- ggplot(tab, aes(Year, CUSUM)) +
  geom_line() + geom_point() +
  geom_hline(yintercept = h, linetype = "dashed") +
  labs(title = "Hard-reset CUSUM of Summer (Jul–Aug) Means",
       subtitle = paste0("Reset to 0 when year <= mu; k = ", k, " F, h = ", h, " F"),
       x = "Year", y = "CUSUM ( F)") +
  theme_minimal()
if (!is.na(hit_year)) {
  p <- p + geom_vline(xintercept = hit_year, linetype = "dotted") +
    annotate("text", x = hit_year, y = max(tab$CUSUM, na.rm = TRUE),
             label = paste("First cross:", hit_year), vjust = -0.5)
}
print(p)

Discussion of Result: We observe that in the year 2011 first time the threashold of 2 F is crossed so, the year 2011 is identified as the year when whether Atlanta’s summer climate has gotten warmer.

HW3

2025-09-06