library(scales)
library(tidyr)
library(dplyr)
library(knitr)
require(ggthemes)
library(ggplot2)
library(tidyverse)
library(gridExtra)
library("scatterplot3d")
library(outliers)
library(reshape2)
library(lubridate)
library(plotly)
library(qcc)
From SPC Control Chart Handbook:
CUSUM charts are good at detecting small shifts away from the target, because they incorporate information from the sequence of sample values. The plotted points are the cumulative sums of the deviations of the sample values from the target. These points should fluctuate randomly around zero. If a trend develops upwards or downwards, it should be considered as evidence that the process mean has shifted from the norm
k or C is the allowable “slack” in the process. In the CUSUM point formula, k specifies the size of the shift you want to detect. In the R function SS or Standard Error = 2*C = Shift
h or T is the number of standard deviations between the center line and the control limits. It is the value at which an out-of-control signal occurs. In the R function its called the DI or Decision Interval; 4 is default value
df <- read.csv(file="temps.csv",stringsAsFactors = F, header=T)
#Replacing Unwanted headers
df1 <- df %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df1,2)
#make wide to long format
df2 <- melt(df1, id.vars = c("DAY"),variable.name = "YEAR",value.name="TEMP")
#head(df2,2)
#making interactive plot
fig1 <- plot_ly(df2, y = ~TEMP, x = ~YEAR, type = "box")
fig1 <- fig1 %>% layout(title = "Fig1: Boxplots of hypothetical Daily Expenses over 20 Years",
xaxis = list(title = "Year"),
yaxis = list (title = "Expenses, $")) %>% layout(plot_bgcolor='rgb(555, 666, 188)')
fig1
Observation1:
Overall, Expense medians across 20 years did not vary much relatively year over year (YoY)
However, quite a number of outliers within each year as the variance (spread) expands and shrinks YoY
The outliers were towards the lower dollar ranges
#mean expenses of each year
temp.mean<-aggregate(TEMP~YEAR, data=df2, mean)
head(temp.mean,5)
#expenses Standard Deviation of each year
temp.sd<-aggregate(TEMP~YEAR, data=df2, sd)
head(temp.sd,5)
#Standard Deviation of entire data; this is to get a feel of the overall Std Deviation
sd.df1<-sd(as.matrix(df1[,2:21]), na.rm = TRUE)
sd.df1
## [1] 8.620253
ggplot(data=temp.mean, aes(x=YEAR, y=TEMP, group=1)) +geom_line(color="red")+geom_point()+ggtitle("Fig2:Line Chart of Avg expenses over 20 Yrs")+theme_excel()
Observation1a:
Using default H or Decision Interval of 4 sigma. H is equivalent to T
Using Shift (SE) of 3 or (Shift/2 = 3/2 standard error shifts away)
df3<-select(df1,2:21)
CUSUMmodels <- vector(mode="list", length=ncol(df3))
CUSUMviolations <- vector(mode="list", length=ncol(df3))
DI = 4 #out of control set to default of 4 standard deviations (akin to setting Target value)
Shift = 3 # C=(delta*sigma)/2, where delta*sigma = Shift
for (i in 1:ncol(df3)){#looping over years
CUSUMmodels[[i]] <- cusum(df3[,i], center=temp.mean$TEMP[i], std.dev = temp.sd$TEMP[i], decision.interval=DI, se.shift=Shift , plot = TRUE)
CUSUMviolations[[i]] <- CUSUMmodels[[i]]$violations$lower#This is index where (violations) occur in the group
}
Year 1997,1998,2000,2001,2002,2004,2005,2006,2007,2008,2009,2012 & 2013 had low expense violations
#looking for minimum expense values and when it occurs
no.violations<-13 #from CUSUM plots above
CUSUMlowest <- vector(mode="list", length=no.violations)
#Years (by groups) with violations. For ex: 2=1997, 3=year 1998 etc. from above loop
x<- c(2,3,5,6,7,9,10,11,12,13,14,17,18)
for (i in x){
CUSUMlowest[i] <- min(CUSUMviolations[[i]])
}
unlist(CUSUMlowest)#index or when the lowest expenses occurred; the higher means the closer to Oct
## [1] 120 116 102 120 117 108 117 117 121 119 110 122 117
#Finding out what the index of low expenses are in terms of actual days of the month
lowtemp.c<-c(120,116,102,120,117,108,121,119,110,122)
df.reconnect<- df1 %>% mutate(Low.expense.Index = row_number()) %>% select("DAY","Low.expense.Index")
df.lowtemp.days<- slice(df.reconnect, lowtemp.c)
#Days when expense dollars were lowest
df.lowtemp.days
Observation3:
Based on default T=4 and Shift or C-value of 1.5 sigma, we discovered 13 years were in low expense violations. These violations happened in the later part of summer. By the same token, it was discovered most of the lowest expenses occurred in the later half of October. Based on this data set and our guesstimated T and C values; most of the changes happened in the last week of October
Using only occurrences of min expense dollars from above
Tighten up T=1 and C=0.5 Sigma because we didn’t break hi expense violations from our first run (see above)
Also, our time period of analysis is much shorter now; range from 10-Oct thru 30-Oct
# Run the CUSUM function using only low expense days from above analysis
avg.day <- mean(unlist(CUSUMlowest))
sd.day <- sd(unlist(CUSUMlowest))
DI.day = 1
Shift.day = 1
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)
T is lowered to 0.5 sigma
C is lowered to 0.25 sigma
DI.day = 1/2
Shift.day = 1/2
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)
#looking for Hi temp index to reconnect to actual days of occurrence
CUSUM.hi.temp.violations <- CUSUMmodel_day $violations
CUSUM.hi.temp.violations
## $lower
## [1] 3 4 5 6 7 8 11
##
## $upper
## [1] 9 10 12 13
Observation4:
Now the opposite detection strategy is taking place:
#Finding out what the index of high temperatures are in terms of actual days of the month
hi.temp.c<-c(121, 119, 122 ,117)
df.reconnect1<- df1 %>% mutate(High.expense.Index = row_number()) %>% select("DAY","High.expense.Index")
df.hi.temp.days<- slice(df.reconnect1, hi.temp.c)
#Days when temperature were highest
df.hi.temp.days
- Based on selected T=4 and C=1.5 sigma values, we were able to detect low expense violations the following days:
28-Oct,24-Oct,10-Oct,28-Oct,25-Oct,16-Oct,29-Oct,27-Oct,18-Oct and 30-Oct
- And within those days of low expense violations, there were high expenses that occured which violated our lowered tighter upper boundary. They happened on the days of 25-Oct, 29-Oct, 27-Oct 30-Oct respectively
- The CUSUM methodology is very susceptable to the selections of T & C values or within the R functionalities; Decision.Interval & se.shift hyper-parameters. These 2 paramaeters controls the occurances of "outliers". For instance, we experienced 10 days within 13 years of low expense violations and within those 10 days, 4 high expense violations. All these were primarily due to how the bandwidths were defined
- As shown in this exercise, T & C values directly affected how many outliers were detected and in violations of the boundaries. Outliers detections is a functions of end-users' risk aversion profiles and historical context of the data. This was evidenced by the fact that high temp were only detected once T & C were set low enough