I have recently been planning a vacation, and I think it would be reasonable to use a clustering model to group similar travel destinations. Clustering rather than classification would be well suited to this type of question because the desired output (which places I might like to visit) does not have a pre-defined result for each data point. After all, I haven’t been everywhere! That makes this an unsupervised learning problem to which clustering would be better suited than a classification model. Some predictors that could be used to group destinations include:
For the population, taking the log might be more useful than using the population by itself. Otherwise, there would be almost no difference between almost all location when scaled - very highly populated cities of 10 million or more would dwarf all others.
Summary: The best result I achieved (limiting the number of clusters to no more than 10) was using a k value of 7 with petal length and width as the only predictors. With these parameters, my model correctly clustered 145 of 150 datapoints.
Discussion:
Min k value. I analyzed k values between 2 and 10. While I knew going in that k = 2 was not going to produce a useful result (given that we know there are 3 different types of iris in the data), it still represents the lowest possible case for which there is some differentiation. In my exploratory analysis, I noted that the setosa variety seemed to be largely distinct (mostly in its own cluster) from the other two using all four attributes as predictors. This may indicate that versicolor and virginica are usually more similar in size to each other than either is to setosa.
Max k value. I chose a max k value of 10, despite the fact that I was able to get slightly higher counts for correct classification using k values up to 20. There are only 150 total datapoints, for one, and it becomes unwieldy to use clusters that are too small. Also, we know that there are only three categories in the end, so as we use more clusters, we start assigning more and more of them to each category. By the time we get to 15 clusters, for example, there are 5 of them on average assigned to each data point. 1 cluster would have produced a trivial result with all data assigned to it.
How to determine “correct” classification. I assigned a cluster to the species represented by a majority (or plurality) of its datapoints. Any members of that cluster that didn’t match the majority species were labeled incorrect. It is possible that this could skew results if one species was far more numerous than another, but in this case, a look at the raw data indicates that there are exactly 50 of each species.
Scaling. I scaled the data to ensure that larger values did not have an outsized effect on the results. For both petals and sepals, lengths are greater than widths, so this step helped to make width a more relevant predictor.
Permutations. I tested each possible permutation of sepal length, sepal width, petal length, and petal width to find the best predictor. Although my model had its absolute best results using only the petal measurements, several other combinations were very close (within 1 or 2 correct answers). If I were to use this model in practice, I would still recommend using all four available data points for any additional data, at least until a larger dataset indicated a clear difference between successful predictors. ** Sepal size is a weaker predictor than petal size across the board. Using only sepal length and/or width as predictors gave the least accurate results of any permutations. Even sepal length and width combined did a worse job classifying than either petal width or petal length by themselves.
Reliability of results. K-means clustering is not guaranteed to deliver the best answer, so it is possible that some of these solutions are sub-optimal. Testing different seeds and averaging the results could help to get closer to finding optimal solutions, but at the cost of adding another dimension to the analysis. I chose not to go to that level of depth in this problem, but this could be a useful validation step in a different setting. Here, the various results were so close to one another that the performance difference is likely to be minimal.
#read in data
iris_data <- read.table('data 4.2/iris.txt', header = TRUE, stringsAsFactors = TRUE)
#setting seed so code is reproducible
set.seed(19)
#create vector of all possible permutations of predictors
#last term is trivial, but everything else i tried resulted in r automatically splitting the vector into single ints
permutations <- c(1:4, combn(4, 2, simplify = FALSE), combn(4, 3, simplify = FALSE), combn(4, 4, simplify = FALSE))
#function loops through columns in cross-classified table to count total correctly classified
#(each centroid's classification defined by most common species within centroid)
correct_class <- function(table){
# table - a data.frame
# returns the total number of correctly classified data points
#initializes correct/incorrect counts
total_correct = 0
total_incorrect = 0
#loops through each column (cluster) in table, finds total count not in majority
#counts both correct and incorrect values; though not used here, incorrect counts may be useful in other contexts
for (col in 1:ncol(table)){
num_incorrect <- sum(table[,col]) - max(table[,col])
total_incorrect <- total_incorrect + num_incorrect
total_correct <- total_correct + max(table[,col])
}
return(total_correct)
}
#function finds the best k value for a set of predictors
find_best_k <- function(test_cols, num_centers = 10, data = iris_data){
#test_cols - a vector of ints consisting of columns to be included in test run
#num_centers - a positive int, the max number of centroids to be tested in kmeans
# note that code will run for num_centers = 1 but results will be trivial
#data - the data.frame to be analyzed
#initializing vector to be used to store results
#returns a two-element vector consisting of the best k-value and
#the number of correct classifications for that value
k_comparison <- c()
#looping through increasing numbers of centers starting at 2
for (c in 2:num_centers){
#scale data using built in scale function
scaled_data <- scale(data[,test_cols])
#running k-means clustering for all 4 variables in iris data (sepal length/width, petal length/width)
iris_kmeans <- kmeans(scaled_data, c, iter.max = 20)
fit_iris <- fitted(iris_kmeans, method = c('classes'))
#cross-classifying factors using table()
iris_x_class <- table(data$Species, fit_iris)
#adding new row to results dataframe
k_comparison <- c(k_comparison, correct_class(iris_x_class))
}
return(c(which.max(k_comparison) + 1, max(k_comparison)))
}
#initialize final results table
results.table <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(results.table) <- c('k', 'correct.count')
#loop through each permutation and store best output in results.table
i <- 1
for (perm in permutations){
results.table[i,] <- find_best_k(perm)
i <- i+1
}
print(results.table)
## k correct.count
## 1 3 108
## 2 6 89
## 3 4 143
## 4 3 144
## 5 7 121
## 6 9 134
## 7 7 143
## 8 8 133
## 9 8 143
## 10 7 145
## 11 7 128
## 12 10 137
## 13 7 144
## 14 5 143
## 15 10 142
print(sprintf('Best Predictor Set: %s, k-value: %s, Total correct classifications: %s', permutations[which.max(results.table$correct.count)], results.table[which.max(results.table$correct.count), 1], max(results.table$correct.count)))
## [1] "Best Predictor Set: 3:4, k-value: 7, Total correct classifications: 145"
Another approach to this problem (which I did not take) involves the use of elbow charts. Elbow charts allow easy visual inspection to find a “good enough” solution - not necessarily the absolute best, but the point where the marginal gains in finding an extra cluster are no longer very large. In this case, I chose not to use elbow charts because the assignment directed reporting of the best combination of both predictor set and k value. Given the number of permutations involved, it would have been difficult to make a single understandable plot depicting all of the data, and separate plots for each permuation would have been unwieldy.
The simple one-sided Grubbs test (type = 10) on the number of crimes per 100,000 people does not identify any outliers in the data assuming a 95% confidence interval. The p-value for the highest crime rate (1993) is 0.079, so it does not quite meet the threshold of 0.05 or less needed to be considered as an outlier. However, it is 2.81 standard deviations from the mean (z-score = 2.81), which would normally indicate that it is an outlier given a normal distribution. Some further investigation is required. For a normal distribution, a z-score with a magnitude greater than 2 is outside of the 95% confidence interval.
The lowest value, 342, has a p-value of 1 and z-score of -1.456 and so is not an outlier by any reasonable measure under this test.
It should be noted that the Grubbs test assumes the data is normally distributed - it does not produce useful results otherwise. Since the assignment is specifically to use this function on this particular dataset, I did not do this here. However, if you were running this test on data “in the wild,” it would be worth testing for normalcy first to ensure the Grubbs test applies in the first place.
library(outliers)
uscrime_data <- read.table('data 5.1/uscrime.txt', header= TRUE)
#run outlier test
#opposite = FALSE tests upper endpoint of sample, opposite = TRUE tests lower endpoint
#this could also be run by setting type = 11, but I wanted to get individual p-values
upper_outlier_test <- grubbs.test(uscrime_data$Crime, opposite = FALSE)
lower_outlier_test <- grubbs.test(uscrime_data$Crime, opposite = TRUE)
print(sprintf('Highest value: p value = %s, z score = %s', upper_outlier_test$p.value, upper_outlier_test$statistic[1]))
## [1] "Highest value: p value = 0.0788748625675554, z score = 2.81287441018599"
print(sprintf('Lowest value: p value = %s, z score = %s', lower_outlier_test$p.value, lower_outlier_test$statistic[1]))
## [1] "Lowest value: p value = 1, z score = 1.45589300761895"
#tests if there are two outliers on same tail
#this code does not run for given data, as it only supports sample sizes between 3 and 30 (47 rows in this dataset)
#double_outlier_test <- grubbs.test(uscrime_data$Crime, type = 20)
A look at a boxplot of the data provides further information on the apparent discrepancy between the p-value and z-score at the high end of the plot. It turns out that there are actually 3 data points (out of a total of 47) beyond two standard deviations from the mean on the upper tail but none on the lower tail. In particular, there are actually two relatively high values - 1993, as mentioned above, and 1969, the second-highest. There is also a third datapoint beyond two standard deviations from the mean, but it is significantly less extreme than the two highest values.
boxplot(uscrime_data$Crime, data = uscrime_data)
A Grubbs test with type = 20 could be used to test whether these two observations in combination are in fact outliers, but the grubbs.test function (with the qgrubbs function “under the hood”) only allows for this test to be run on data sets of sizes between 3 and 30. Since there are 47 observations in this dataset, grubbs.test with type = 20 produces an error. The code block below shows the inputs that I would have used if the sample were smaller.
#tests if there are two outliers on same tail
#this code does not run for given data, as it only supports sample sizes between 3 and 30
#(there are 47 rows in this dataset)
#double_outlier_test <- grubbs.test(uscrime_data$Crime, type = 20)
Taking a smaller sample of the dataset, or even producing a new dataset with the same mean and standard deviation, would produce an imprecise result in this case. The Grubbs test with type = 20 uses a ratio of variances with and without the two extreme datapoints to determine whether they are outliers. Removing two datapoints from a set of 30 have a greater effect than removing the same two points from a set of 47. (If there is a way to do this without skewing the final response, it is beyond my ability level to do so.)
While I was still a student, I worked summers in a quality control laboratory for a manufacturing firm where change detection models were in constant use to control processes and output quality. We made industrial ceramics, and one of the key steps in the process was firing in a kiln at very high temperatures, around 2500C, for hours to days. If temperatures were too high or too low, our end products would be unusable. Given the physical size of the kiln, there was always some natural point-to-point variation in temperatures.
Additionally, our raw materials all had some degree of natural variance in their chemical composition, which also affected the final product. For instance, a formulation that started with unusually high calcium content could withstand higher temperatures in the kiln than the same product with low starting calcium.
We used change detection models to determine when deviations in the kiln were causing products to end up outside of acceptable thresholds. We used both the raw measurements from outputs as well as before and after calculations to determine how to correct any errors that occurred, hopefully before they resulted in too much loss.
We used the CUSUM technique on a series of density measurements, as an example. For most products, low densities were much more of a problem than high densities, so our CUSUM tests were mostly one-sided. Depending on customer requirements, we could set different critical values and thresholds for each product. If a customer was buying single-use products, for instance, we would use a high C value, because we were really mostly concerned with defects in our own processes. For high end products for use in applications like electronics manufacturing, we used low C values with tight thresholds, both because of the user requirements and because we typically made such items in smaller batches. We didn’t want it to take too long to realize an error, because then we might lose an entire customer order.
First, I did some exploratory analysis. After reading in the data, I found the mean temperature for each date across all sample years, then took the mean of those temperatures to find out the overall mean temperature for the period of 1 July through 31 October from 1996 to 2015. This last value is particularly important, because it is what will be used for mu in later analysis. I found that the average of all the temperatures in the data is 83.3 degrees F.
#read in data
temps_data <- read.table('data 6.2/temps.txt', header= TRUE)
#find the mean temp across sample years for each day in sample
daily_means <- rowMeans(temps_data[,-1])
#calculate the overall daily mean for the dataset (1 Jul through 31 Oct, 1996-2015)
mean_temp <- mean(daily_means)
print(mean_temp)
## [1] 83.33902
With the daily and overall temperature means, I already have most of the information I need to run a CUSUM model to determine when it starts to cool off in Atlanta. The last value I have to add is C, which will influence where the model starts to shift. Because in this case I want to detect a decrease in temperature, I will be using the formula \(\mu\) - \(x_{t}\) - C for the individual values in the CUSUM formula (as opposed to \(x_{t}\) - \(\mu\) - C for increase detection).
Functionally, the mean temperature (\(\mu\)) - C can be thought of as a “heat threshold,” with higher C values causing a lower temperature to be needed to raise the CUSUM. Put another way, days above \(\mu\) - C will bring the CUSUM model back toward 0. Days below \(\mu\) will raise the CUSUM. The higher the C value, the later in the year summer will end according to our model.
We also need to choose a threshold. This can be measured in degree-days relative to \(\mu\) - C. A day that is 1 degree below \(\mu\) - C will contribute 1 degree-day toward the threshold; two days that are each 5 degrees lower will contribute 10 degree-days in total.
For my model, I chose a C value of 3, to start counting non-summer days when the temperature goes below 80 degrees F (I don’t like the heat). Because Atlanta does not usually have exceptional day-to-day temperature swings, I chose a threshold of 35 degree-days. This is high enough that a mid-summer cold front would not reach it (it requires about a week at 5 degrees below \(\mu\) - C, or 4 days at 9 degrees below each), but low enough that temperatures will reach it within a week or two of consistently falling below 80. If I were looking at a location with high day-to-day variability (for example, Denver, Colorado), I would choose a higher threshold so that a day or two of extremely low temperatures did not unreasonably bias the model. As a note, at a threshold of 30 and C = 3, I got one extreme outlier - in 2013, that threshold was met on August 13th! That balance ultimately drove me to raise the value slightly to avoid an obvious misclassification.
The dataframe below shows the year-by-year outputs for my model. All of the end of summer dates are between late September and mid October. The average end-of-summer date is 2 October, which seems about right for Atlanta.
cusum <- function(data, threshold, c = 0, increasing = TRUE){
###function returns the index at which a vector reaches a CUSUM threshold
###inputs:
###data: a vector of numeric values
###threshold: a non-negative integer; function returns the first index where this is reached
###c: constant that modifies the stickiness of the function.
### higher c values will take longer to reach the threshold, all else considered.
###increasing: TRUE if the goal is to detect an increase in values, FALSE to detect decrease
###returns a positive integer if threshold is met, NULL if threshold is not met
#calculate the absolute values to be used in CUSUM algorithm
ifelse(increasing,
ind_sums <- data - mean(data) - c,
ind_sums <- mean(data) - data - c)
#initializing cusum vector with value for the first term
cusum <- c(max(0, ind_sums[1]))
#loops through the remainder of datapoints to calculate CUSUM
for (i in 2:length(ind_sums)){
#adds iterative CUSUM values, with a min value of 0
cusum <- c(cusum, max(0, cusum[i-1] + ind_sums[i]))
#returns index if threshold met
if (cusum[i] >= threshold){
return(i)
}
}
#returns NULL if for loop completes with no solution
return(NULL)
}
#run case for each year in data set
yearly_end_of_summer <- c()
for (j in 2:ncol(temps_data)){
yearly_end_of_summer <- c(yearly_end_of_summer,
cusum(temps_data[,j], 35, c = 3, increasing = FALSE))
}
print(data.frame(c(1996:2015), temps_data[yearly_end_of_summer, 1]))
## c.1996.2015. temps_data.yearly_end_of_summer..1.
## 1 1996 1-Oct
## 2 1997 27-Sep
## 3 1998 9-Oct
## 4 1999 30-Sep
## 5 2000 8-Sep
## 6 2001 29-Sep
## 7 2002 29-Sep
## 8 2003 2-Oct
## 9 2004 12-Oct
## 10 2005 10-Oct
## 11 2006 12-Oct
## 12 2007 12-Oct
## 13 2008 18-Oct
## 14 2009 5-Oct
## 15 2010 1-Oct
## 16 2011 8-Sep
## 17 2012 7-Oct
## 18 2013 18-Oct
## 19 2014 29-Sep
## 20 2015 26-Sep
eos_mean <- temps_data[as.integer(mean(yearly_end_of_summer)), 1]
print(sprintf('Average end of summer date is %s', eos_mean))
## [1] "Average end of summer date is 2-Oct"
Atlanta’s climate does seem to have gotten warmer, most noticeably by the 2007-2012 timeframe. Because we are dealing with climate data, small changes make a big difference - so I set C = 0 and tried a variety of relatively low thresholds. Using the annual average of 1 July - 31 October daily highs as the dataset (20 total datapoints from 1996 to 2015), the thresholds were all met in a relatively narrow time window. It took 11 years (until 2007) to reach a threshold of 2; 5 years later, a threshold of 7 was met.
My code takes advantage of the cusum function built in part 1.
means_by_year <- colMeans(temps_data[,2:ncol(temps_data)])
for (t in c(2, 3, 5, 7)){
gw <- cusum(means_by_year, t)
print(sprintf('Threshold of %s met in year %s.', t, 1995 + gw))
}
## [1] "Threshold of 2 met in year 2007."
## [1] "Threshold of 3 met in year 2010."
## [1] "Threshold of 5 met in year 2011."
## [1] "Threshold of 7 met in year 2012."