Question 4.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.
Answer: A clustering model would be appropriate for clustering different types of recipes based on the recipe’s features:
Question 4.2 The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.
Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.
Data = read.csv("iris.txt",sep = "")
str(Data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
print(summary(Data))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Assuming we do not the number of types of flower, we need to
# Look for which k to use
# use iter.max to ensure the algorithms converge
# use nstart=25 to ensure at least 25 random sets are chosen
set.seed(123)
# k = 1 is bad, try to plot wss for k = 2 ...6
k.max <- 8
data <- as.matrix(Data[,1:4])
wss <- sapply(1:k.max,function(k){kmeans(data,k,nstart=5,iter.max=15)$tot.withinss})
wss
[1] 681.37060 152.34795 78.85144 57.26562 46.44618 41.70442 34.75675
[8] 30.30321
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
Based on the elbow diagram, I will choose 2 as k. However, I know there are 3 classes of flower exist so I will use k=3 and try to see which combinations of predictors give me the best results.
# List the within-class sum of squares for different sets of predictors
find_predictors = function(d){kmeans(d,3,nstart=5,iter.max=15)$tot.withinss}
find_predictors(Data[,1]) #15.8
[1] 15.81662
find_predictors(Data[,2]) #5.535
[1] 5.349783
find_predictors(Data[,3]) #24.5
[1] 24.8629
find_predictors(Data[,4]) #4.9
[1] 4.913174
find_predictors(Data[,1:2]) #37.5
[1] 37.0507
find_predictors(Data[,2:3]) #40.7
[1] 40.73707
find_predictors(Data[,3:4]) #31.4
[1] 31.37136
find_predictors(Data[,c(1,3)]) #53.8
[1] 53.80998
find_predictors(Data[,c(1,4)]) #32.7
[1] 32.73348
find_predictors(Data[,c(2,3)]) #40.7
[1] 40.73707
find_predictors(Data[,c(2,4)]) #20.6
[1] 20.6024
find_predictors(Data[,1:3])#69.4
[1] 69.42974
find_predictors(Data[,2:4])#47.8
[1] 47.86643
find_predictors(Data[,c(1,2,4)])#48.6
[1] 48.66078
find_predictors(Data[,1:4])#78.85
[1] 78.85144
#As number of predictors increase, the clusters become less compact
#That is why within-class sum of squares increase as number of predictors
#increase.As the main goal of using k-mean clusters
# is to find patterns and structures in the dataset.
# I will use all the predictors in the model
best_model <-kmeans(Data[,1:4],3,nstart=5,iter.max=15)
table(Data[,5], best_model$cluster)
1 2 3
setosa 0 0 50
versicolor 48 2 0
virginica 14 36 0
After trying out the different predictors, using all four predictors (Sepal.Length Sepal.Width, Sepal.Length, Petal.Width) will give the smallest total within-cluster sum of squares.
Use the best model, the results I get: 1 2 3 setosa 0 0 50 versicolor 48 2 0 virginica 14 36 0
This model did a good job in putting all the ones for species setosa to cluster 3. There were some overlapping for the species versicolor and virginica. For the species versicolor, 2 out of 50 were classified as belong to cluster 2 For the species virginica, 14 out of 50 were classified as belong to cluster 1.
Question 5.1 Using crime data from the file uscrime.txt (http://www.statsci.org/data/general/uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html), test to see whether there are any outliers in the last column (number of crimes per 100,000 people). Use the grubbs.test function in the outliers package in R.
Data_crime = read.csv("uscrime.txt",sep = "")
str(Data_crime)
'data.frame': 47 obs. of 16 variables:
$ M : num 15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
$ So : int 1 0 1 0 0 0 1 1 1 0 ...
$ Ed : num 9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
$ Po1 : num 5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
$ Po2 : num 5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
$ LF : num 0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
$ M.F : num 95 101.2 96.9 99.4 98.5 ...
$ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
$ NW : num 30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
$ U1 : num 0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
$ U2 : num 4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
$ Wealth: int 3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
$ Ineq : num 26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
$ Prob : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
$ Time : num 26.2 25.3 24.3 29.9 21.3 ...
$ Crime : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
print(summary(Data))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
library(outliers)
set.seed(1234)
x = Data_crime[,16]
#
boxplot(x,ylabel="# of crimes/100K people")
hist(x, breaks=6)
#Make sure the assumption of normally distributed dataset by Grubbs Test is correct
grubbs.test(x,type=10) #One-Side Test
Grubbs test for one outlier
data: x
G = 2.81290, U = 0.82426, p-value = 0.07887
alternative hypothesis: highest value 1993 is an outlier
#This shows a p-value of .078 which is not significant at the .05 threshold. This #means I cannot reject the null hypothesis that the highest value is an outlier.
#So I will not remove 1993
grubbs.test(x,type=11,two.sided=TRUE) #Two-Side One Outlier Test
Grubbs test for two opposite outliers
data: x
G = 4.26880, U = 0.78103, p-value < 2.2e-16
alternative hypothesis: 342 and 1993 are outliers
#This shows a p-value of 2.2e-16 which is not significant at the .05 threshold. This #means we can reject the null hypothesis that the highest value is an outlier.
#remove 342
x_2 <- x[-match(c(1993), x)]
grubbs.test(x_2 ,type=11,two.sided=TRUE)
Grubbs test for two opposite outliers
data: x_2
G = 4.58290, U = 0.73894, p-value = 0.6078
alternative hypothesis: 342 and 1969 are outliers
Question 6.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a Change Detection model would be appropriate. Applying the CUSUM technique, how would you choose the critical value and the threshold?
A credit card company will track the number of fraud transactions each minute. It will have an avaerage number of fraud transactions during the recent periods. Setting T and C is a tradeoff. The lower the T value, it is likely the soonner the change will be detected. However, if the T value is set to low, there will be too frequent alert to the Fraud prevention team to fix the loopholes in the system. This might cost Fraud prevention too much resources and time.
If the C value is too big, the system will be less sensitive to changes in number of fraud transactions, so it might have costed the company a lot of money before the Fraud prevention team becomes aware of this issue and fixed the issues.
Question 6.2
Using July through October daily-high-temperature data for Atlanta for 1996 through 2015, use a CUSUM approach to identify when unofficial summer ends (i.e., when the weather starts cooling off) each year. You can get the data that you need from the file temps.txt or online, for example at http://www.iweathernet.com/atlanta-weather-records or https://www.wunderground.com/history/airport/KFTY/2015/7/1/CustomHistory.html . You can use R if you’d like, but it’s straightforward enough that an Excel spreadsheet can easily do the job too.
Use a CUSUM approach to make a judgment of whether Atlanta’s summer climate has gotten warmer in that time (and if so, when).
Please see Excel Sheet for Part 1 and Part 2.
Thank you for grading my HW2!