Homework-3.utf8

Installing Packages and Loading Libraries

options(warn=-1) #suppressing warnings

install.packages("outliers")

Installing package into 'C:/Users/fayal/R/win-library/3.6'
(as 'lib' is unspecified)


package 'outliers' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\fayal\AppData\Local\Temp\RtmpuWldb2\downloaded_packages

install.packages("lubridate")

Installing package into 'C:/Users/fayal/R/win-library/3.6'
(as 'lib' is unspecified)


package 'lubridate' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\fayal\AppData\Local\Temp\RtmpuWldb2\downloaded_packages

library(outliers)

library(ggplot2)

library(tidyverse)

-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v tibble  2.1.3     v dplyr   0.8.3
v tidyr   1.0.0     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
v purrr   0.3.3     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(repr)

library(reshape)

Attaching package: 'reshape'

The following object is masked from 'package:dplyr':

    rename

The following objects are masked from 'package:tidyr':

    expand, smiths

library(tidyr)

Question 5.1

Using crime data , test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.

#Loading data as a data frame
crime<-read_tsv("uscrime.txt")

Parsed with column specification:
cols(
  M = col_double(),
  So = col_double(),
  Ed = col_double(),
  Po1 = col_double(),
  Po2 = col_double(),
  LF = col_double(),
  M.F = col_double(),
  Pop = col_double(),
  NW = col_double(),
  U1 = col_double(),
  U2 = col_double(),
  Wealth = col_double(),
  Ineq = col_double(),
  Prob = col_double(),
  Time = col_double(),
  Crime = col_double()
)

#Exploring the data
head(crime)

M	So	Ed	Po1	Po2	LF	M.F	Pop	NW	U1	U2	Wealth	Ineq	Prob	Time	Crime
15.1	1	9.1	5.8	5.6	0.510	95.0	33	30.1	0.108	4.1	3940	26.1	0.084602	26.2011	791
14.3	0	11.3	10.3	9.5	0.583	101.2	13	10.2	0.096	3.6	5570	19.4	0.029599	25.2999	1635
14.2	1	8.9	4.5	4.4	0.533	96.9	18	21.9	0.094	3.3	3180	25.0	0.083401	24.3006	578
13.6	0	12.1	14.9	14.1	0.577	99.4	157	8.0	0.102	3.9	6730	16.7	0.015801	29.9012	1969
14.1	0	12.1	10.9	10.1	0.591	98.5	18	3.0	0.091	2.0	5780	17.4	0.041399	21.2998	1234
12.1	0	11.0	11.8	11.5	0.547	96.4	25	4.4	0.084	2.9	6890	12.6	0.034201	20.9995	682

dim(crime)

47
16

Variable        Description
M       percentage of males aged 14–24 in total state population
So      indicator variable for a southern state
Ed      mean years of schooling of the population aged 25 years or over
Po1     per capita expenditure on police protection in 1960
Po2     per capita expenditure on police protection in 1959
LF      labour force participation rate of civilian urban males in the age-group 14-24
M.F     number of males per 100 females
Pop     state population in 1960 in hundred thousands
NW      percentage of nonwhites in the population
U1      unemployment rate of urban males 14–24
U2      unemployment rate of urban males 35–39
Wealth  wealth: median value of transferable assets or family income
Ineq    income inequality: percentage of families earning below half the median income
Prob    probability of imprisonment: ratio of number of commitments to number of offenses
Time    average time in months served by offenders in state prisons before their first release
Crime   crime rate: number of offenses per 100,000 population in 1960

#Summary of crime rate
summary(crime$Crime)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  342.0   658.5   831.0   905.1  1057.5  1993.0

We can use visualization to see if there is any outliers in crime rate data:

options(repr.plot.width=3, repr.plot.height=5)

ggplot(crime, aes(x="",y=Crime))+
geom_boxplot(width=0.25,outlier.color="red",outlier.shape = 1, color = "tomato")+
labs(title="Number of offenses per 100,000 population in 1960")+
theme(plot.title = element_text(hjust=0.5,lineheight=.8,size=10))

png

The above boxplot shows three points which are “too far away” from the range in which we expect them. Although, the plot shows these points as outliers, they are real data and valid observations.

We can also use Grubb’s test.This test is defined for the hypothesis:

H0 : There are no outliers in the data set

Ha : The maximum and minimum values are outliers in the data set.

grubbs.test(crime$Crime,type=11) #type 11 is a test for two outliers on opposite tails

    Grubbs test for two opposite outliers

data:  crime$Crime
G = 4.26877, U = 0.78103, p-value = 1
alternative hypothesis: 342 and 1993 are outliers

Using Grubb’s test, a p-value bigger than 0.05 indicates there is no evidence against a null hypothesis(The smaller the p-value, the stronger the evidence that you should reject the null hypothesis). So we reject alternate hypothesis and keep the null hypothesis that the maximum and minimum values are not outliers.

Question 6.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?

Answer:

We can use CUSUM method for evaluating energy consumption in power grid.CUSUM control chart can be utilized to monitor energy usage data so that abnormal changes can be detected in a timely manner.

Upon detecting the change, other techniques can be used to detect the cause of abnormal energy consumption such as performance degradation, poor maintenance or improper operation of systems. We can collect data of energy usage over a long peroid of time, calculate the sample space standard deviation and use that as critical value. As threshold, we can select 4 or 5 times of standard deviation value.

Question 6.2

Part 1:

Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year.

#Loading the data
temps=read_tsv("temps.txt")

Parsed with column specification:
cols(
  .default = col_double(),
  DAY = col_character()
)
See spec(...) for full column specifications.

#checking for NA/missing values in data frame
sum(is.na(temps))

0

#Exploring the data
tail(temps)[1:3,]

DAY	1996	1997	1998	1999	2000	2001	2002	2003	2004	…	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
26-Oct	75	71	79	69	75	64	68	68	79	…	62	68	70	65	85	77	80	61	84	67
27-Oct	75	57	79	75	78	51	69	64	81	…	66	67	59	60	76	79	70	69	84	56
28-Oct	81	55	79	73	80	55	75	57	78	…	63	70	50	71	74	74	56	64	77	78

We can look at the changes using two approaches:

First approach:

we will look at average of temperature of each day taken from 1996 to 2015. The goal is to observe, in average, when the weather starts cooling down. Since the data includes 2 months of summer and 1st month of fall, the change indicates change of the season.

#adding a column to data frame, with average value for each day
temps<-cbind(temps,rowMeans(temps[,-1]))

#renaming the column to Average
colnames(temps)[ncol(temps)] <- "Average"

#checking the data frame
head(temps)[1:3,]

DAY	1996	1997	1998	1999	2000	2001	2002	2003	2004	…	2007	2008	2009	2010	2011	2012	2013	2014	2015	Average
1-Jul	98	86	91	84	89	84	90	73	82	…	95	85	95	87	92	105	82	90	85	88.85
2-Jul	97	90	88	82	91	87	90	81	81	…	85	87	90	84	94	93	85	93	87	88.35
3-Jul	97	93	91	87	93	87	87	87	86	…	82	91	89	83	95	99	76	87	79	88.40

Now, we calculate the CUSUM values using average temperatures:

#adding a col to temps data frame to store St value in each step
temps[,"St"]<-NA

In order to detect the shift from target,we define an “allowable slack” (C). This value is usually ½ or 1 standard deviation of the parameter.

#Standard Deviation of the data points
std_sample<-sd(temps[,"Average"])

std_sample

6.70138087245912

mean_sample<-mean(temps[,"Average"])

mean_sample

83.3390243902439

#allowable slack C : typically set to 1 sigma
C<-std_sample

When the CUSUM value is at or more than Threshold, we say the process is “out of control” . The standard practice is to set the threshold to 4 or 5 times the standard deviation.

#threshold set to 5*sigma
T<-5*std_sample
T

33.5069043622956

#using Zero start method

temps[1,"St"]<-0
for(i in 2:nrow(temps)){
    
    temps[i,"St"]<-max(0,(temps[i-1,"St"]+mean_sample-temps[i,"Average"]-C))
    
    
}

temps$St

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.287643517784796
0
0
0
0
0
0
0.287643517784796
1.3252870355696
3.1629305533544
5.55057407113919
7.03821758892398
7.82586110670878
8.66350462449358
9.85114814227837
12.2887916600632
16.026435177848
20.0140786956327
23.5517222134175
28.2893657312023
33.8770092489871
39.2646527667719
41.8022962845567
46.0899398023415
53.0775833201263
60.8652268379111
68.1528703556959
73.3905138734807
81.1281573912655
89.1658009090503
96.4534444268351
102.04108794462
108.178731462405

cat("The day a change in trend is detected is:",temps[which(temps$St>T),"DAY"][1])

The day a change in trend is detected is: 19-Oct

temps[,"Date"]<-as.Date(temps[,"DAY"],"%d-%B")

temps[,"Date"]<-format(temps[,"Date"],format="%m/%d")

options(repr.plot.width=10, repr.plot.height=5)
ggplot(data = temps, aes(x = Date, y =`St`,group=2)) + 
  geom_line()+
  geom_hline(yintercept=T,color="red")+
 
  scale_x_discrete(breaks = unique(temps$Date)[seq(1,123,5)])+
  xlab("Days") +
  ylab("CUSUM") +
  ggtitle("CUSUM Chart for July-October Daily-high Temperature for Atlanta 1996-2015")+
  theme(plot.title = element_text(hjust = 0.5))

png

As we can also see in above visualization, in october 19th, the weather offically starts changing and it makes sense.Atlanta has humid subtropical climate with hot and long summers. Eventhough, the offical end day of summer is toward end of September, the weather does not cool down until near end of October.

Second approach:

we will look at CUSUM for each year indivudually and look at the trend:

#creating a data frame to store the cusum values for each year
cusum_df<-data.frame(matrix(nrow=nrow(temps),ncol=length(1996:2015)+1))

#assigning columns names to cusum data frame
colnames(cusum_df)<-colnames(temps[,1:21])

#tidying up the Day and converting it to Date for ease of processing
cusum_df$DAY<-temps$Date

#using Zero start method
#calculating cusum values for each year
for(y in 2:ncol(cusum_df)){
    cusum_df[1,y]<-0 #initial St value for each column,set to zero
    mu<-mean(temps[,y]) #mean of each sample space(each year's observations)
    std<-sd(temps[,y]) #sd of each sample,also used as allowable slack
    threshold<-5*std #using 5 sd as threshold value,different T for each year 
    change<-NULL # to store dates with St over threshold,first value:first day change detected
    
    for(i in 2:nrow(cusum_df)){
        cusum_df[i,y]<-max(0,cusum_df[i-1,y]+(mu-temps[i,y]-std))
        if (cusum_df[i,y]>=threshold){
            change<-append(change,cusum_df[i,y])}}
    cat("In year of",colnames(cusum_df[y]),"the day Summer started is:",
        cusum_df[which(cusum_df[,y]==change[1]),"DAY"],"\n")
}

In year of 1996 the day Summer started is: 10/06 
In year of 1997 the day Summer started is: 10/19 
In year of 1998 the day Summer started is: 10/22 
In year of 1999 the day Summer started is: 10/23 
In year of 2000 the day Summer started is: 10/09 
In year of 2001 the day Summer started is: 10/28 
In year of 2002 the day Summer started is: 10/18 
In year of 2003 the day Summer started is: 10/11 
In year of 2004 the day Summer started is: 10/14 
In year of 2005 the day Summer started is: 10/25 
In year of 2006 the day Summer started is: 10/23 
In year of 2007 the day Summer started is: 10/26 
In year of 2008 the day Summer started is: 10/23 
In year of 2009 the day Summer started is: 10/17 
In year of 2010 the day Summer started is: 10/04 
In year of 2011 the day Summer started is: 10/23 
In year of 2012 the day Summer started is: 10/29 
In year of 2013 the day Summer started is: 10/23 
In year of 2014 the day Summer started is: 10/22 
In year of 2015 the day Summer started is: 10/29

As we can observe,more or less,for each year, the weather offically starts cooling down toward end of October.

Part 2:

Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).