Load libraries

library(scales)
library(tidyr)
library(dplyr)
library(knitr)
require(ggthemes)
library(ggplot2)
library(tidyverse)  
library(gridExtra)
library("scatterplot3d")
library(outliers)
library(reshape2)
library(lubridate)
library(plotly)
library(qcc)

CUSUM Change Detection exercise

From SPC Control Chart Handbook:

CUSUM charts are good at detecting small shifts away from the target, because they incorporate information from the sequence of sample values. The plotted points are the cumulative sums of the deviations of the sample values from the target. These points should fluctuate randomly around zero. If a trend develops upwards or downwards, it should be considered as evidence that the process mean has shifted from the norm


df <- read.csv(file="temps.csv",stringsAsFactors = F, header=T) 
#Replacing Unwanted headers
df1 <- df %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df1,2)

Interactive BoxPlot

#make wide to long format
df2 <- melt(df1, id.vars = c("DAY"),variable.name = "YEAR",value.name="TEMP")
#head(df2,2)
#making interactive plot
fig1 <- plot_ly(df2, y = ~TEMP, x = ~YEAR, type = "box")

fig1 <- fig1 %>% layout(title = "Fig1: Boxplots of hypothetical Daily Expenses over 20 Years",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Expenses, $")) %>% layout(plot_bgcolor='rgb(555, 666, 188)')
fig1

Observation1:


#mean expenses of each year
temp.mean<-aggregate(TEMP~YEAR, data=df2, mean)
head(temp.mean,5)
#expenses Standard Deviation of each year
temp.sd<-aggregate(TEMP~YEAR, data=df2, sd)
head(temp.sd,5)
#Standard Deviation of entire data; this is to get a feel of the overall Std Deviation
sd.df1<-sd(as.matrix(df1[,2:21]), na.rm = TRUE)
sd.df1
## [1] 8.620253

ggplot(data=temp.mean, aes(x=YEAR, y=TEMP, group=1)) +geom_line(color="red")+geom_point()+ggtitle("Fig2:Line Chart of Avg expenses over 20 Yrs")+theme_excel()


Observation1a:


CUSUM Algorithm

  • Loop to detect violations
  1. Using default H or Decision Interval of 4 sigma. H is equivalent to T

  2. Using Shift (SE) of 3 or (Shift/2 = 3/2 standard error shifts away)

df3<-select(df1,2:21)
CUSUMmodels <- vector(mode="list", length=ncol(df3)) 
CUSUMviolations <- vector(mode="list", length=ncol(df3)) 

DI = 4 #out of control set to default of 4 standard deviations (akin to setting Target value)
Shift = 3 # C=(delta*sigma)/2, where delta*sigma = Shift
for (i in 1:ncol(df3)){#looping over years
  CUSUMmodels[[i]] <- cusum(df3[,i], center=temp.mean$TEMP[i], std.dev = temp.sd$TEMP[i], decision.interval=DI, se.shift=Shift , plot = TRUE)
  CUSUMviolations[[i]] <- CUSUMmodels[[i]]$violations$lower#This is index where (violations) occur in the group
}

  • Based on above T and C:

Year 1997,1998,2000,2001,2002,2004,2005,2006,2007,2008,2009,2012 & 2013 had low expense violations

#looking for minimum expense values and when it occurs
no.violations<-13 #from CUSUM plots above
CUSUMlowest <- vector(mode="list", length=no.violations) 
#Years (by groups) with violations. For ex: 2=1997, 3=year 1998 etc. from above loop
x<- c(2,3,5,6,7,9,10,11,12,13,14,17,18)
for (i in x){
  CUSUMlowest[i] <- min(CUSUMviolations[[i]])
}
unlist(CUSUMlowest)#index or when the lowest expenses occurred; the higher means the closer to Oct
##  [1] 120 116 102 120 117 108 117 117 121 119 110 122 117
  • Identifying Lowest expense days from its correspondent index
#Finding out what the index of low expenses are in terms of actual days of the month
lowtemp.c<-c(120,116,102,120,117,108,121,119,110,122)
df.reconnect<- df1 %>% mutate(Low.expense.Index = row_number()) %>% select("DAY","Low.expense.Index") 
df.lowtemp.days<- slice(df.reconnect, lowtemp.c)
#Days when expense dollars were lowest
df.lowtemp.days

Observation3:

Based on default T=4 and Shift or C-value of 1.5 sigma, we discovered 13 years were in low expense violations. These violations happened in the later part of summer. By the same token, it was discovered most of the lowest expenses occurred in the later half of October. Based on this data set and our guesstimated T and C values; most of the changes happened in the last week of October


Problem 6.2.2

  • Rerun the CUCUM again:
  1. Using only occurrences of min expense dollars from above

  2. Tighten up T=1 and C=0.5 Sigma because we didn’t break hi expense violations from our first run (see above)

  3. Also, our time period of analysis is much shorter now; range from 10-Oct thru 30-Oct

# Run the CUSUM function using only low expense days from above analysis
avg.day <- mean(unlist(CUSUMlowest))
sd.day <- sd(unlist(CUSUMlowest))
DI.day = 1
Shift.day = 1 
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)


  • Did not detect any high expense violations, so had to lower the bandwidth
  1. T is lowered to 0.5 sigma

  2. C is lowered to 0.25 sigma

DI.day = 1/2
Shift.day = 1/2
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)

#looking for Hi temp index to reconnect to actual days of occurrence
CUSUM.hi.temp.violations <- CUSUMmodel_day $violations
CUSUM.hi.temp.violations
## $lower
## [1]  3  4  5  6  7  8 11
## 
## $upper
## [1]  9 10 12 13

Observation4:

Now the opposite detection strategy is taking place:

  • From the lowered bandwidth, the hi expense violations occurred at points; 9, 10, 12 & 13. These points coincided with low expense index 121, 119, 122 & 117 respectively from previous. Due to this association, we are able to link the days of occurrences
#Finding out what the index of high temperatures are in terms of actual days of the month
hi.temp.c<-c(121, 119, 122 ,117)
df.reconnect1<- df1 %>% mutate(High.expense.Index = row_number()) %>% select("DAY","High.expense.Index") 
df.hi.temp.days<- slice(df.reconnect1, hi.temp.c)
#Days when temperature were highest
df.hi.temp.days

Summary

- Based on selected T=4 and C=1.5 sigma values, we were able to detect low expense violations the following days:

  28-Oct,24-Oct,10-Oct,28-Oct,25-Oct,16-Oct,29-Oct,27-Oct,18-Oct and 30-Oct
 
- And within those days of low expense violations, there were high expenses that occured which violated our lowered tighter upper boundary.  They happened on the days of 25-Oct, 29-Oct, 27-Oct 30-Oct respectively

Conclusions

- The CUSUM methodology is very susceptable to the selections of T & C values or within the R functionalities; Decision.Interval & se.shift hyper-parameters.  These 2 paramaeters controls the occurances of "outliers".  For instance, we experienced 10 days within 13 years of low expense violations and within those 10 days, 4 high expense violations.  All these were primarily due to how the bandwidths were defined  

- As shown in this exercise, T & C values directly affected how many outliers were detected and in violations of the boundaries.  Outliers detections is a functions of end-users' risk aversion profiles and historical context of the data.  This was evidenced by the fact that high temp were only detected once T & C were set low enough