Question 4.1

A situation from everyday life where clustering would be appropriate would be in clustering the performance of students based on predictors such as: - Amount of time studied - Average GPA - Student-Teacher Interaction - Class size

Question 4.2

library(stats)
library(cluster)
library(fpc)

iris_response <- read.csv("iris.txt", sep="") #includes response column
iris <- iris_response[,1:4] #removes the response values from the data set
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 145          6.7         3.3          5.7         2.5
## 146          6.7         3.0          5.2         2.3
## 147          6.3         2.5          5.0         1.9
## 148          6.5         3.0          5.2         2.0
## 149          6.2         3.4          5.4         2.3
## 150          5.9         3.0          5.1         1.8

Generating an elbow plot to determine the an ideal number of clusters to use for the data set.

WSS = vector("numeric")
for (K in 1:10) {
  km <- kmeans(iris,
             centers = K,
             iter.max = 100,
             nstart = 25)
  WSS[K] <- km$tot.withinss
  
}

plot(WSS)

Will use K = 3 since the marginal benefits of increasing the number of clusters starts to decrease after 3.

km1 <- kmeans(iris,
             centers = 3,
             iter.max = 100,
             nstart = 25)

clusplot(iris, km1$cluster, color = T, lines = F )

Question 5.1

library(outliers)
crime <- read.table("uscrime.txt", header = T, )
head(crime)
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682
tail(crime)
##       M So   Ed  Po1 Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 42 14.1  0 10.9  5.6 5.4 0.523  96.8   4  0.2 0.107 3.7   4890 17.0 0.088904
## 43 16.2  1  9.9  7.5 7.0 0.522  99.6  40 20.8 0.073 2.7   4960 22.4 0.054902
## 44 13.6  0 12.1  9.5 9.6 0.574 101.2  29  3.6 0.111 3.7   6220 16.2 0.028100
## 45 13.9  1  8.8  4.6 4.1 0.480  96.8  19  4.9 0.135 5.3   4570 24.9 0.056202
## 46 12.6  0 10.4 10.6 9.7 0.599  98.9  40  2.4 0.078 2.5   5930 17.1 0.046598
## 47 13.0  0 12.1  9.0 9.1 0.623 104.9   3  2.2 0.113 4.0   5880 16.0 0.052802
##       Time Crime
## 42 12.1996   542
## 43 31.9989   823
## 44 30.0001  1030
## 45 32.5996   455
## 46 16.6999   508
## 47 16.0997   849

Determining any outliers in the last column of the dataset (number of crimes per 100,000 people) using a box and whisker plot and the grubbs.test function.

crime[,16] #the crime column of the dataset
##  [1]  791 1635  578 1969 1234  682  963 1555  856  705 1674  849  511  664  798
## [16]  946  539  929  750 1225  742  439 1216  968  523 1993  342 1216 1043  696
## [31]  373  754 1072  923  653 1272  831  566  826 1151  880  542  823 1030  455
## [46]  508  849
grubbs.test(crime[,16],type =  10) #Grubbs first test for determining one outlier from the data set. Value 1993 is an outlier.
## 
##  Grubbs test for one outlier
## 
## data:  crime[, 16]
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
grubbs.test(crime[,16],type = 11) #Grubbs seconds test for checcking if the lowest and highest values are two outliers on opposite tails of the sample. Values 342 and 1993 are outliers.
## 
##  Grubbs test for two opposite outliers
## 
## data:  crime[, 16]
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers
boxplot(crime[,16]) # Boxplot showing points 1969, 1993, and 1635 as potential outliers from dataset

Question 6.1

In technical trading, a Change Detection strategy could be employed to detect any significant cumulative changes to a stocks price. Both the critical and threshold values could be set to whatever the risk tolerance the trader’s strategy allows for. Since an increase to a stocks price could potentially be advantageous, the trader could set up some indicator or alarm if the CUSUM model is detecting decreases in the stock price.

Question 6.2

The data and plots for this problem can be seen in the “6501 HW2”.xslx file uploaded along with this file. For the first part of the question, we needed to identify when the unofficial summer ends (i.e, when the weather starts cooling off). My approach was to obtain an average temperature from the years 1997-2015 and plotting that with respect to each day of data that was available. I made an assumption that the mean of xt with no change, would be taken from the month of July since we know for sure that it is summer in July. Implementing the CUSUM approach in Excel, it was determined that August 24th represented the peak St from June 18th to October 31 indicating that this could be the unoffical day summer ends. It was found that a C value of 0.2 fit the model the best. Any C values larger than 0.75 made the plot inconclusive.

For the second part of 6.2 a similar approach was done to determine if the average temperature in the summer was increasing over the years. It was observed from our CUSUM model that between the years 2009 to 2012, there was an increase in temperature to the summer months.