library(scales)
library(tidyr)
library(dplyr)
library(knitr)
require(ggthemes)
library(ggplot2)
library(tidyverse)
library(gridExtra)
library("scatterplot3d")
library(outliers)
library(reshape2)
library(lubridate)
library(plotly)
library(qcc)
Taking only the crime column
Quick Statistical Summary of the Crime data
data.crime<-data$Crime
summary(data.crime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342.0 658.5 831.0 905.1 1057.5 1993.0
From Engineering & Statistic handbook:
Grubbs’ test (Grubbs 1969 and Stefansky 1972) is used to detect a single outlier in a univariate data set that follows an approximately normal distribution:
H0: there are no outliers in the data
Ha: the maximum value is an outlier
Note: All hypothesis test in this exercise is performed at 95% confidence-level
let’s first check if our crimes data is really normal or not and whether there exist any outliers that are actually causing non-normalities. And if its inconsequential, we can drop it and apply the Grubb’s test
First off, visualize our crimes data:
Histogram
QQ plot
# Histogram
par(bg = 'lightgrey')
hist(data$Crime, xlab="Crimes Committed",col="yellow",,main="Fig1a: Histogram of Crimes Data",probability=TRUE)
s = sd(data$Crime)
m = mean(data$Crime)
curve(dnorm(x, mean=m, sd=s), add=TRUE,col = "blue", lwd = 3)
grid()
# Q-Q Plot to check for normality of original as-is data set
qq <-ggplot() + aes(sample = data$Crime) + geom_qq(distribution = qnorm) +
geom_qq_line(line.p = c(0.25, 0.75), col = "blue") + ylab("No. of Crimes")+ ggtitle("Fig1b: QQ Plot of Crime Data")+theme_excel()
qq
shapiro.test(data$Crime)
##
## Shapiro-Wilk normality test
##
## data: data$Crime
## W = 0.91273, p-value = 0.001882
Observation1:
Histogram plot clearly showed right skewness on the crimes data
With a p-value less than 0.05, the Shapiro-Wilk test rejects the hypothesis of normality when the p-value is less than or equal to 0.05
However, the QQ plot showed the middle to be normal with the outer fringes “fraying” quite a bit, perhaps due to outliers
The question now is whether the core of the data set is normal and this non-normality behavior was due to a few outliers; causing skewness. In reality, outliers should be flagged for further investigations but in this purely educational setting, we will further this exercise by tossing these few outliers, and see if it nudges the data back to normality.
Let’s see?
par(bg = 'lightgrey')
p<-boxplot(data$Crime,
main = "Fig1c: Box plot of Crimes",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "orange",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
abline(v=quantile(data$Crime, c(0.25, 0.75)), col="red")
text(x=fivenum(data$Crime), labels =fivenum(data$Crime), y=1.27)
dfsorted <-data[order(-data$Crime),]
head(dfsorted[c(16)],5)
IQR.calc<-IQR(dfsorted$Crime)
IQR.calc
## [1] 399
Observation2:
IQR = 75th percentile (Q3=1057.5) - 25th percentile (Q1=658.5) = 399
Tukey fences:
Upper fence = Q3 + (1.5 * IQR) = 1656
Lower fence = Q1 – (1.5 * IQR) = 60
It appears the min value of 342 is not an outlier as it’s inside Tukey’s lower fence. However, it appears we have 3 upper outliers; namely 1993, 1969 and 1674 as they are above Tukey’s upper fence.
#test on max data points
grubbs.test(data$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: data$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
#test on min data points
grubbs.test(data$Crime, type = 10, opposite = T, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: data$Crime
## G = 1.45589, U = 0.95292, p-value = 1
## alternative hypothesis: lowest value 342 is an outlier
Observation3:
Since p-value > 0.05, we reject the Null hypothesis and accept Grubbs’s test claim that 1993 (max) is an outlier while 342(min) is Not. Which we have also concurrently confirmed to be true from Tukey’s upper fence. Therefore, Grubb’s test was able to detect outliers on semi-normal data set; which is a good sign
Let’s move forward on some outliers removal exercise and see if data improves or not; specifically to see if it becomes normal or not. If it is, then we can be more comfortable in applying the Grubbs.test function for outlier detection purposes
#Removing the top outliers: 1993
crime.data3<-filter(data, data$Crime< 1993)
# Histogram to check for normality of data again
par(bg = 'lightgrey')
hist(crime.data3$Crime, xlab="Crimes Committed",col="coral2",,main="Fig2a: Histogram of Crimes Data-OneOutlier Removal",probability=TRUE)
s2 = sd(crime.data3$Crime)
m2 = mean(crime.data3$Crime)
curve(dnorm(x, mean=m2, sd=s2), add=TRUE,col = "blue", lwd = 3)
grid()
# Q-Q Plot to check for normality of data again
qq2 <-ggplot() + aes(sample = crime.data3$Crime) + geom_qq(distribution = qnorm) +
geom_qq_line(line.p = c(0.25, 0.75), col = "brown1") + ylab("No. of Crimes")+ ggtitle("Fig2b: QQ Plot of Crime Data- One Outlier removal")+theme_excel()
qq2
#Shapiro Wilk test again: To see if removal of the 3 outliers had made the dataset normal?
shapiro.test(crime.data3$Crime)
##
## Shapiro-Wilk normality test
##
## data: crime.data3$Crime
## W = 0.93207, p-value = 0.01001
#Box plot to see outliers actually removed
par(bg = 'lightgrey')
p<-boxplot(crime.data3$Crime,
main = "Fig2c: Box plot of Crimes - One Outlier removal",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "dodgerblue1",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
text(x=fivenum(crime.data3$Crime), labels =fivenum(crime.data3$Crime), y=1.27)
grubbs.test(crime.data3$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: crime.data3$Crime
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier
- First off, Fig2a, box plot and Shapiro-Wilk test showed data is still non-normal after removal of the the highest data point. However, Grubb's test was still able to detect the 2nd highest value (1969) as an outlier which jive with our upper Tukey fence discovery
- See Appendix A below, the removal of the 2nd and 3rd highest data point finally made the data normal. However, Grubb's test still considered the 4th highest data point 1635 as an outlier which conflicted with Tukey's upper fence conclusion
- It's safe to say the 3 highest data points of 1993, 1969 and 1674 were true outliers while the 4th highest 1635 was not; as confirmed by Boxplot and its related Tukey Fences.
- To end, three main inferences can be drawn from this exercise:
(1) We can now be confident, Grubb's test has the ability in detecting outliers even on semi-normal univariate data sets
(2) By the same token, we also need to understand Grubb's test gives a confidence level on whether a max value is an outlier or not. This meant that user experience and juriprudence are needed in making judgement calls
(3) Lastly when in doubt, other tools such as Box Plot with Tukey fences could aid in making that final decision
removing the 3 highest data points
data being nudged back to normality (Shapiro-Wilk’s P-value > 0.05)
Grubb’s test identifying the 4th highest data point as outliers (Grubb’s test P-value > 0.05)
#Removing the top 3 outliers: 1993, 1969 and 1674
crime.data4<-filter(data, data$Crime< 1674)
# Histogram to check for normality of data again
par(bg = 'lightgrey')
hist(crime.data4$Crime, xlab="Crimes Committed",col="chartreuse3",,main="Fig3a: Histogram of Crimes Data-All 3 highest Outlier Removed",probability=TRUE)
s4 = sd(crime.data4$Crime)
m4 = mean(crime.data4$Crime)
curve(dnorm(x, mean=m4, sd=s4), add=TRUE,col = "blue", lwd = 3)
grid()
# Q-Q Plot to check for normality of data again
qq3 <-ggplot() + aes(sample = crime.data4$Crime) + geom_qq(distribution = qnorm) +
geom_qq_line(line.p = c(0.25, 0.75), col = "chocolate") + ylab("No. of Crimes")+ ggtitle("Fig3b: QQ Plot of Crime Data- All 3 highest Outlier Removed")+theme_excel()
qq3
#Shapiro Wilk test again: To see if removal of the 3 outliers had made the dataset normal?
shapiro.test(crime.data4$Crime)
##
## Shapiro-Wilk normality test
##
## data: crime.data4$Crime
## W = 0.96412, p-value = 0.1852
#Box plot to see outliers actually removed
par(bg = 'lightgrey')
p<-boxplot(crime.data4$Crime,
main = "Fig3c: Box plot of Crimes - All 3 highest Outlier Removed",
xlab = "Number of crimes committed",
ylab = "Crimes",
col = "darkorchid1",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
text(x=fivenum(crime.data4$Crime), labels =fivenum(crime.data4$Crime), y=1.27)
#Grub's test after removing all 3 top max values
grubbs.test(crime.data4$Crime, type = 10, opposite = FALSE, two.sided = FALSE)
##
## Grubbs test for one outlier
##
## data: crime.data4$Crime
## G = 2.68561, U = 0.82837, p-value = 0.1139
## alternative hypothesis: highest value 1635 is an outlier
Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?
CUSUM is an appropriate strategy to detect process changes especially in a manufacturing or production environment. The critical values such as C or T could well be derived from experience of the processes and business domain knowledge. A good set of boundaries would be using T of 6 which equates to +/- 3 Sigma from the mean and fine tunning it from there based on the context of the particular data set
From SPC Control Chart Handbook:
CUSUM charts are good at detecting small shifts away from the target, because they incorporate information from the sequence of sample values. The plotted points are the cumulative sums of the deviations of the sample values from the target. These points should fluctuate randomly around zero. If a trend develops upwards or downwards, it should be considered as evidence that the process mean has shifted from the norm
k or C is the allowable “slack” in the process. In the CUSUM point formula, k specifies the size of the shift you want to detect. In the R function SS or Standard Error = 2*C = Shift
h or T is the number of standard deviations between the center line and the control limits. It is the value at which an out-of-control signal occurs. In the R function its called the DI or Decision Interval; 4 is default value
df <- read.csv(file="temps.csv",stringsAsFactors = F, header=T)
#Replacing Unwanted headers
df1 <- df %>% rename_at(vars(starts_with("X")),funs(str_replace(.,"X","")))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
head(df1,2)
#make wide to long format
df2 <- melt(df1, id.vars = c("DAY"),variable.name = "YEAR",value.name="TEMP")
#head(df2,2)
#making interactive plot
fig1 <- plot_ly(df2, y = ~TEMP, x = ~YEAR, type = "box")
fig1 <- fig1 %>% layout(title = "Fig1: Boxplots of Temperature over 20 yrs in Atlanta",
xaxis = list(title = "Year"),
yaxis = list (title = "Degree Fahrenheit")) %>% layout(plot_bgcolor='rgb(555, 666, 188)')
fig1
Observation1:
Overall, temperature medians across 20 years did not vary much relatively year over year (YoY)
However, quite a number of outliers within each year as the variance (spread) expands and shrinks YoY
The outliers were towards the lower temperature ranges
#mean Temp of each year
temp.mean<-aggregate(TEMP~YEAR, data=df2, mean)
head(temp.mean,5)
#Temp Standard Deviation of each year
temp.sd<-aggregate(TEMP~YEAR, data=df2, sd)
head(temp.sd,5)
#Standard Deviation of entire data; this is to get a feel of the overall Std Deviation
sd.df1<-sd(as.matrix(df1[,2:21]), na.rm = TRUE)
sd.df1
## [1] 8.620253
ggplot(data=temp.mean, aes(x=YEAR, y=TEMP, group=1)) +geom_line(color="red")+geom_point()+ggtitle("Fig2:Line Chart of Avg Temp over 20 Yrs in Atlanta")+theme_excel()
Observation1a:
Using default H or Decision Interval of 4 sigma. H is equivalent to T
Using Shift (SE) of 3 or (Shift/2 = 3/2 standard error shifts away)
df3<-select(df1,2:21)
CUSUMmodels <- vector(mode="list", length=ncol(df3))
CUSUMviolations <- vector(mode="list", length=ncol(df3))
DI = 4 #out of control set to default of 4 standard deviations (akin to setting Target value)
Shift = 3 # C=(delta*sigma)/2, where delta*sigma = Shift
for (i in 1:ncol(df3)){#looping over years
CUSUMmodels[[i]] <- cusum(df3[,i], center=temp.mean$TEMP[i], std.dev = temp.sd$TEMP[i], decision.interval=DI, se.shift=Shift , plot = TRUE)
CUSUMviolations[[i]] <- CUSUMmodels[[i]]$violations$lower#This is index where lowest temp (violations) occur in the group
}
Year 1997,1998,2000,2001,2002,2004,2005,2006,2007,2008,2009,2012 & 2013 had low Temp violations
#looking for minimum temp values and when it occurs
no.violations<-13 #from CUSUM plots above
CUSUMlowest <- vector(mode="list", length=no.violations)
#Years (by groups) with violations. For ex: 2=1997, 3=year 1998 etc. from above loop
x<- c(2,3,5,6,7,9,10,11,12,13,14,17,18)
for (i in x){
CUSUMlowest[i] <- min(CUSUMviolations[[i]])
}
unlist(CUSUMlowest)#index or when the lowest temp occurred; the higher means the closer to Oct
## [1] 120 116 102 120 117 108 117 117 121 119 110 122 117
#Finding out what the index of low temperatures are in terms of actual days of the month
lowtemp.c<-c(120,116,102,120,117,108,121,119,110,122)
df.reconnect<- df1 %>% mutate(Low.Temp.Index = row_number()) %>% select("DAY","Low.Temp.Index")
df.lowtemp.days<- slice(df.reconnect, lowtemp.c)
#Days when temperature were lowest
df.lowtemp.days
Observation3:
Based on default T=4 and Shift or C-value of 1.5 sigma, we discovered 13 years were in low temperature violations. These violations happened in the later part of summer. By the same token, it was discovered most of the lowest temperatures occurred in the later half of October. Based on this data set and our guesstimated T and C values; our unofficial summer ends in the last week of October
Using only “end of summer” days from above
Tighten up T=1 and C=0.5 Sigma because we didn’t break any hi temp violations from our first run (see above)
Also, our time period of analysis is much shorter now; range from 10-Oct thru 30-Oct
# Run the CUSUM function using only low temp days from above analysis
avg.day <- mean(unlist(CUSUMlowest))
sd.day <- sd(unlist(CUSUMlowest))
DI.day = 1
Shift.day = 1
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)
T is lowered to 0.5 sigma
C is lowered to 0.25 sigma
DI.day = 1/2
Shift.day = 1/2
CUSUMmodel_day <- cusum(unlist(CUSUMlowest), center=avg.day, std.dev = sd.day , decision.interval=DI.day, se.shift=Shift.day, plot = TRUE)
#looking for Hi temp index to reconnect to actual days of occurrence
CUSUM.hi.temp.violations <- CUSUMmodel_day $violations
CUSUM.hi.temp.violations
## $lower
## [1] 3 4 5 6 7 8 11
##
## $upper
## [1] 9 10 12 13
Observation4:
Now the opposite detection strategy is taking place:
#Finding out what the index of high temperatures are in terms of actual days of the month
hi.temp.c<-c(121, 119, 122 ,117)
df.reconnect1<- df1 %>% mutate(High.Temp.Index = row_number()) %>% select("DAY","High.Temp.Index")
df.hi.temp.days<- slice(df.reconnect1, hi.temp.c)
#Days when temperature were highest
df.hi.temp.days
Prob 6.2.1
- The unofficial end to summer according to our selected T=4 and C=1.5 sigma values were on the following days:
28-Oct,24-Oct,10-Oct,28-Oct,25-Oct,16-Oct,29-Oct,27-Oct,18-Oct and 30-Oct
Prob 6.2.2
- And within the days when temperatures were supposedly "cooling" off, there were high temperatures that occured which violated our lowered tighter upper boundary. They happened on the days of 25-Oct, 29-Oct, 27-Oct 30-Oct respectively
- The CUSUM methodology is very susceptable to the selections of T & C values or within the R functionalities; Decision.Interval & se.shift hyper-parameters. These 2 paramaeters controls the occurances of "outliers". For instance, we experienced 10 days within 13 years of low temp violations and within those 10 days, 4 high temp violations. All these were primarily due to how the bandwidths were defined
- As shown in this exercise, T & C values directly affected how many outliers were detected and in violations of the boundaries. Outliers detections is a functions of end-users' risk aversion profiles and historical context of the data. This was evidenced by the fact that high temp were only detected once T & C were set low enough
Finally, it's inconclusive to judge whether Atlanta's cooler days had gotten hotter or not as these hotter days were detected only when tighter bandwidths were imposed.