Problem 5.1

library(scales)
library(tidyr)
library(dplyr)
library(knitr)
require(ggthemes)
library(ggplot2)
library(tidyverse)  
library(gridExtra)
library("scatterplot3d")
library(outliers)
library(reshape2)
library(lubridate)
library(plotly)
library(qcc)

Sample view of Crimes Data

data.crime<-data$Crime
summary(data.crime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   342.0   658.5   831.0   905.1  1057.5  1993.0

From Engineering & Statistic handbook:

Note: All hypothesis test in this exercise is performed at 95% confidence-level

Objective:


  1. Histogram

  2. QQ plot


# Histogram 
par(bg = 'lightgrey')
hist(data$Crime, xlab="Crimes Committed",col="yellow",,main="Fig1a: Histogram of Crimes Data",probability=TRUE)
s = sd(data$Crime)
m = mean(data$Crime)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "blue", lwd = 3)
grid()

# Q-Q Plot to check for normality of original as-is data set
qq <-ggplot() + aes(sample = data$Crime) + geom_qq(distribution = qnorm) + 
           geom_qq_line(line.p = c(0.25, 0.75), col = "blue") + ylab("No. of Crimes")+ ggtitle("Fig1b: QQ Plot of Crime Data")+theme_excel()
qq


shapiro.test(data$Crime)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$Crime
## W = 0.91273, p-value = 0.001882

Observation1:

The question now is whether the core of the data set is normal and this non-normality behavior was due to a few outliers; causing skewness. In reality, outliers should be flagged for further investigations but in this purely educational setting, we will further this exercise by tossing these few outliers, and see if it nudges the data back to normality.

Let’s see?


Tukey Fences & Boxplot plot for outlier detection

par(bg = 'lightgrey')
p<-boxplot(data$Crime,
main = "Fig1c: Box plot of Crimes",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
abline(v=quantile(data$Crime, c(0.25, 0.75)), col="red")
text(x=fivenum(data$Crime), labels =fivenum(data$Crime), y=1.27)

dfsorted <-data[order(-data$Crime),]
head(dfsorted[c(16)],5)
IQR.calc<-IQR(dfsorted$Crime)
IQR.calc
## [1] 399

Observation2:

IQR = 75th percentile (Q3=1057.5) - 25th percentile (Q1=658.5) = 399

Tukey fences:

It appears the min value of 342 is not an outlier as it’s inside Tukey’s lower fence. However, it appears we have 3 upper outliers; namely 1993, 1969 and 1674 as they are above Tukey’s upper fence.


  1. Now we have confirmation there lies at least three outliers, let’s see if Grubbs test can correctly identify these outliers even on non-normal data sets
#test on max data points
grubbs.test(data$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
#test on min data points
grubbs.test(data$Crime, type = 10, opposite = T, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  data$Crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier

Observation3:


#Removing the top outliers: 1993
crime.data3<-filter(data, data$Crime< 1993)

# Histogram to check for normality of data again
par(bg = 'lightgrey')
hist(crime.data3$Crime, xlab="Crimes Committed",col="coral2",,main="Fig2a: Histogram of Crimes Data-OneOutlier Removal",probability=TRUE)
s2 = sd(crime.data3$Crime)
m2 = mean(crime.data3$Crime)
curve(dnorm(x, mean=m2, sd=s2), add=TRUE,col = "blue", lwd = 3)
grid()

# Q-Q Plot to check for normality of data again
qq2 <-ggplot() + aes(sample = crime.data3$Crime) + geom_qq(distribution = qnorm) + 
           geom_qq_line(line.p = c(0.25, 0.75), col = "brown1") + ylab("No. of Crimes")+ ggtitle("Fig2b: QQ Plot of Crime Data- One Outlier removal")+theme_excel()
qq2

#Shapiro Wilk test again:  To see if removal of the 3 outliers had made the dataset normal?
shapiro.test(crime.data3$Crime)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime.data3$Crime
## W = 0.93207, p-value = 0.01001
#Box plot to see outliers actually removed
par(bg = 'lightgrey')
p<-boxplot(crime.data3$Crime,
main = "Fig2c: Box plot of Crimes - One Outlier removal",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "dodgerblue1",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
text(x=fivenum(crime.data3$Crime), labels =fivenum(crime.data3$Crime), y=1.27)

grubbs.test(crime.data3$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  crime.data3$Crime
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier

Summary

- First off, Fig2a, box plot and Shapiro-Wilk test showed data is still non-normal after removal of the the highest data point.  However, Grubb's test was still able to detect the 2nd highest value (1969) as an outlier which jive with our upper Tukey fence discovery

- See Appendix A below, the removal of the 2nd and 3rd highest data point finally made the data normal.  However, Grubb's test still considered the 4th highest data point 1635 as an outlier which conflicted with Tukey's upper fence conclusion

Conclusions

- It's safe to say the 3 highest data points of 1993, 1969 and 1674 were true outliers while the 4th highest 1635 was not; as confirmed by Boxplot and its related Tukey Fences.  

- To end, three main inferences can be drawn from this exercise:

(1) We can now be confident, Grubb's test has the ability in detecting outliers even on semi-normal univariate data sets

(2) By the same token, we also need to understand Grubb's test gives a confidence level on whether a max value is an outlier or not.  This meant that user experience and juriprudence are needed in making judgement calls

(3) Lastly when in doubt, other tools such as Box Plot with Tukey fences could aid in making that final decision

Appendix A: Supporting evidence of:

  1. removing the 3 highest data points

  2. data being nudged back to normality (Shapiro-Wilk’s P-value > 0.05)

  3. Grubb’s test identifying the 4th highest data point as outliers (Grubb’s test P-value > 0.05)

#Removing the top 3 outliers: 1993, 1969 and 1674
crime.data4<-filter(data, data$Crime< 1674)

# Histogram to check for normality of data again
par(bg = 'lightgrey')
hist(crime.data4$Crime, xlab="Crimes Committed",col="chartreuse3",,main="Fig3a: Histogram of Crimes Data-All 3 highest Outlier Removed",probability=TRUE)
s4 = sd(crime.data4$Crime)
m4 = mean(crime.data4$Crime)
curve(dnorm(x, mean=m4, sd=s4), add=TRUE,col = "blue", lwd = 3)
grid()

# Q-Q Plot to check for normality of data again
qq3 <-ggplot() + aes(sample = crime.data4$Crime) + geom_qq(distribution = qnorm) + 
           geom_qq_line(line.p = c(0.25, 0.75), col = "chocolate") + ylab("No. of Crimes")+ ggtitle("Fig3b: QQ Plot of Crime Data- All 3 highest Outlier Removed")+theme_excel()
qq3

#Shapiro Wilk test again:  To see if removal of the 3 outliers had made the dataset normal?
shapiro.test(crime.data4$Crime)
## 
##  Shapiro-Wilk normality test
## 
## data:  crime.data4$Crime
## W = 0.96412, p-value = 0.1852
#Box plot to see outliers actually removed
par(bg = 'lightgrey')
p<-boxplot(crime.data4$Crime,
main = "Fig3c: Box plot of Crimes - All 3 highest Outlier Removed",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "darkorchid1",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
text(x=fivenum(crime.data4$Crime), labels =fivenum(crime.data4$Crime), y=1.27)

#Grub's test after removing all 3 top max values
grubbs.test(crime.data4$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
## 
##  Grubbs test for one outlier
## 
## data:  crime.data4$Crime
## G = 2.68561, U = 0.82837, p-value = 0.1139
## alternative hypothesis: highest value 1635 is an outlier

Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

CUSUM is an appropriate strategy to detect process changes especially in a manufacturing or production environment.  The critical values such as C or T could well be derived from experience of the processes and business domain knowledge.  A good set of boundaries would be using T of 6 which equates to +/- 3 Sigma from the mean and fine tunning it from there based on the context of the particular data set 

Problem 6.2.1

From SPC Control Chart Handbook:

CUSUM charts are good at detecting small shifts away from the target, because they incorporate information from the sequence of sample values. The plotted points are the cumulative sums of the deviations of the sample values from the target. These points should fluctuate randomly around zero. If a trend develops upwards or downwards, it should be considered as evidence that the process mean has shifted from the norm


df <- read.csv(file="temps.csv",stringsAsFactors = F, header=T) 
#Replacing Unwanted headers
df1 <- df %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df1,2)

Interactive BoxPlot

#make wide to long format
df2 <- melt(df1, id.vars = c("DAY"),variable.name = "YEAR",value.name="TEMP")
#head(df2,2)
#making interactive plot
fig1 <- plot_ly(df2, y = ~TEMP, x = ~YEAR, type = "box")

fig1 <- fig1 %>% layout(title = "Fig1: Boxplots of Temperature over 20 yrs in Atlanta",
         xaxis = list(title = "Year"),
         yaxis = list (title = "Degree Fahrenheit")) %>% layout(plot_bgcolor='rgb(555, 666, 188)')
fig1

Observation1:


#mean Temp of each year
temp.mean<-aggregate(TEMP~YEAR, data=df2, mean)
head(temp.mean,5)
#Temp Standard Deviation of each year
temp.sd<-aggregate(TEMP~YEAR, data=df2, sd)
head(temp.sd,5)
#Standard Deviation of entire data; this is to get a feel of the overall Std Deviation
sd.df1<-sd(as.matrix(df1[,2:21]), na.rm = TRUE)
sd.df1
## [1] 8.620253

ggplot(data=temp.mean, aes(x=YEAR, y=TEMP, group=1)) +geom_line(color="red")+geom_point()+ggtitle("Fig2:Line Chart of Avg Temp over 20 Yrs in Atlanta")+theme_excel()


Observation1a:


CUSUM Algorithm

  • Loop to detect violations (low temperature variances)
  1. Using default H or Decision Interval of 4 sigma. H is equivalent to T

  2. Using Shift (SE) of 3 or (Shift/2 = 3/2 standard error shifts away)

df3<-select(df1,2:21)
CUSUMmodels <- vector(mode="list", length=ncol(df3)) 
CUSUMviolations <- vector(mode="list", length=ncol(df3)) 

DI = 4 #out of control set to default of 4 standard deviations (akin to setting Target value)
Shift = 3 # C=(delta*sigma)/2, where delta*sigma = Shift
for (i in 1:ncol(df3)){#looping over years
  CUSUMmodels[[i]] <- cusum(df3[,i], center=temp.mean$TEMP[i], std.dev = temp.sd$TEMP[i], decision.interval=DI, se.shift=Shift , plot = TRUE)
  CUSUMviolations[[i]] <- CUSUMmodels[[i]]$violations$lower#This is index where lowest temp (violations) occur in the group
}

  • Based on above T and C:

Year 1997,1998,2000,2001,2002,2004,2005,2006,2007,2008,2009,2012 & 2013 had low Temp violations

#looking for minimum temp values and when it occurs
no.violations<-13 #from CUSUM plots above
CUSUMlowest <- vector(mode="list", length=no.violations) 
#Years (by groups) with violations. For ex: 2=1997, 3=year 1998 etc. from above loop
x<- c(2,3,5,6,7,9,10,11,12,13,14,17,18)
for (i in x){
  CUSUMlowest[i] <- min(CUSUMviolations[[i]])
}
unlist(CUSUMlowest)#index or when the lowest temp occurred; the higher means the closer to Oct
##  [1] 120 116 102 120 117 108 117 117 121 119 110 122 117
  • Identifying Lowest Temp days from its correspondent index
#Finding out what the index of low temperatures are in terms of actual days of the month
lowtemp.c<-c(120,116,102,120,117,108,121,119,110,122)
df.reconnect<- df1 %>% mutate(Low.Temp.Index = row_number()) %>% select("DAY","Low.Temp.Index") 
df.lowtemp.days<- slice(df.reconnect, lowtemp.c)
#Days when temperature were lowest
df.lowtemp.days

Observation3:

Based on default T=4 and Shift or C-value of 1.5 sigma, we discovered 13 years were in low temperature violations. These violations happened in the later part of summer. By the same token, it was discovered most of the lowest temperatures occurred in the later half of October. Based on this data set and our guesstimated T and C values; our unofficial summer ends in the last week of October


Problem 6.2.2

  • Rerun the CUCUM again:
  1. Using only “end of summer” days from above

  2. Tighten up T=1 and C=0.5 Sigma because we didn’t break any hi temp violations from our first run (see above)

  3. Also, our time period of analysis is much shorter now; range from 10-Oct thru 30-Oct

# Run the CUSUM function using only low temp days from above analysis
avg.day <- mean(unlist(CUSUMlowest))
sd.day <- sd(unlist(CUSUMlowest))
DI.day = 1
Shift.day = 1 
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)


  • Did not detect any high temperature violations, so had to lower the bandwidth
  1. T is lowered to 0.5 sigma

  2. C is lowered to 0.25 sigma

DI.day = 1/2
Shift.day = 1/2
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)

#looking for Hi temp index to reconnect to actual days of occurrence
CUSUM.hi.temp.violations <- CUSUMmodel_day $violations
CUSUM.hi.temp.violations
## $lower
## [1]  3  4  5  6  7  8 11
## 
## $upper
## [1]  9 10 12 13

Observation4:

Now the opposite detection strategy is taking place:

  • From the lowered bandwidth, the hi temp violations occurred at points; 9, 10, 12 & 13. These points coincided with low temperature index 121, 119, 122 & 117 respectively from our when unofficial summer ends (i.e., when the weather starts cooling off) each year, see above. Due to this association, we are able to link the days of hi temp violation occurrences within these days of cooling summer
#Finding out what the index of high temperatures are in terms of actual days of the month
hi.temp.c<-c(121, 119, 122 ,117)
df.reconnect1<- df1 %>% mutate(High.Temp.Index = row_number()) %>% select("DAY","High.Temp.Index") 
df.hi.temp.days<- slice(df.reconnect1, hi.temp.c)
#Days when temperature were highest
df.hi.temp.days

Summary

Prob 6.2.1

- The unofficial end to summer according to our selected T=4 and C=1.5 sigma values were on the following days:

  28-Oct,24-Oct,10-Oct,28-Oct,25-Oct,16-Oct,29-Oct,27-Oct,18-Oct and 30-Oct
 
Prob 6.2.2

- And within the days when temperatures were supposedly "cooling" off, there were high temperatures that occured which violated our lowered tighter upper boundary.  They happened on the days of 25-Oct, 29-Oct, 27-Oct 30-Oct respectively

Conclusions

- The CUSUM methodology is very susceptable to the selections of T & C values or within the R functionalities; Decision.Interval & se.shift hyper-parameters.  These 2 paramaeters controls the occurances of "outliers".  For instance, we experienced 10 days within 13 years of low temp violations and within those 10 days, 4 high temp violations.  All these were primarily due to how the bandwidths were defined  

- As shown in this exercise, T & C values directly affected how many outliers were detected and in violations of the boundaries.  Outliers detections is a functions of end-users' risk aversion profiles and historical context of the data.  This was evidenced by the fact that high temp were only detected once T & C were set low enough 

Finally, it's inconclusive to judge whether Atlanta's cooler days had gotten hotter or not as these hotter days were detected only when tighter bandwidths were imposed.