For project 3, the class was tasked with answering this question: In the Academy Awards, is “Best Film Editing” the best predictor of “Best Picture”? To answer this, the class assembled a database of Academy Award nominees and winners. A chi-square analysis did show a correlation between the best film editing and best picture categories. With this good dataset available, I wondered what other types of analysis could reveal evidence for supporting the hypothesis: what would the results be recasting the problems from one of statistical inference to one of classification. After researching various data mining methods, I decided on applying a k-means clustering analysis to the data.
K-means clustering aims to partition n observations (represented as vectors) into k clusters in which each observation belongs to the cluster with the nearest mean, known as a centroid, by calculating the Euclidean distance between the observations. In the simplest sense, k-means finds vectors that are “similar” to each other by varying degrees and groups them into clusters.
My datasource is a slightly edited copy of project_view_year_numeric.csv from project 3. The file had to be edit to ensure a single name for each category, rather than “actor in a leading role” vs. “actor,” for example. This was done with a search-and-replace in a text editor.
In order to perform the k-means analysis, the observation are required to be in vector form, with each nominee represented by a row of real numbers (integers, in this case) as the variables. In other words, the data must be converted from “long” to “wide” format using tidyr::spread().
library(tidyr)
library(dplyr)
awards <- read.csv("https://raw.githubusercontent.com/fdsps/IS607/master/project_view_year_numeric_edit.csv")
awards["Year"] <- NULL # we don't need the Year
We convert the Won column from yes/no to numeric 2/1, then delete rows with NA.
awards["Won"] <- ifelse(awards["Won"]== "yes", 2, 1)
awards <- awards[!is.na(awards$Nominee),] # clean rows with Nominee == NA
Duplicate rows must be removed otherwise the spread() function complains. These occur when there are multiple nominations for one film in the same category, e.g. two nominees for best supporting actress in the same film. There are only a few of such rows, so removing them completely won’t have a meaningful effect on our analysis. tidyr::distinct() makes this easy.
awards<- distinct(awards, Category, Nominee)
awide <-spread(awards, Category, Won, fill=0) # widen the data...
awide[1] <- NULL # drop the Nominee column as not relavent to this analysis.
colnames(awide)<- c("BACTR","BACRS","BP","CIN","COST","DIR","FEDIT","SEDIT","SMIX","BSACTR","BSACRS")
head(awide)
## BACTR BACRS BP CIN COST DIR FEDIT SEDIT SMIX BSACTR BSACRS
## 1 1 0 0 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 1 0 0 0 0 0 0
## 4 0 0 1 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 1 0
We now have a set of vectors, one per film, that records 1 for a nomination in a category, 2 for a win, and zero otherwise. We want to see how the k-means algorithm groups observations based on these properties and, specifically, how a nomination for best film editing (FEDIT=1) and a win for best picture (BP=2) affects the classification, if at all.
library(stats)
model <- kmeans(x = subset(awide, select = -BP), centers = 3)
A subset of the test data without the BP (best picture) column is passed to the k-means function, telling it to create 3 clusters. We exclude BP because we want to see if there is something about the observations that suggest a win for BP without weighing the results with that column beforehand. With the cluster model completed, we create a contingency table and compare the results against the BP column.
table(model$cluster, awide$BP)
##
## 0 1 2
## 1 360 0 0
## 2 60 244 70
## 3 1179 62 0
The left column identifies the cluster number. The heading are the values found in the BP column. Each numeric entry is the count of observations with BP values of 0, 1, or 2 placed into one of the three clusters. For example: 244 of the rows with BP=1 (nominated for best picture) were classified into cluster 2. The remaining 62 were placed in cluster 3. Note that all of the 70 rows with BP=2 (won best picture) are grouped in one cluster.
Let’s look at the table for FEDIT (film editing).
table(model$cluster, awide$FEDIT)
##
## 0 1 2
## 1 360 0 0
## 2 101 193 80
## 3 1116 125 0
All of the rows with FEDIT=2 (won best film editing) were placed in the same cluster as BP=2. This is not surprising. Note that more than half of the FEDIT=1 rows (nominated best picture) are in the same cluster as BP=2, implying similarity.
We can try this with any of the columns. How does k-means classify the best director category?
table(model$cluster, awide$DIR)
##
## 0 1 2
## 1 360 0 0
## 2 71 223 80
## 3 1146 95 0
The best actor category is not classified with best picture winners:
table(model$cluster, awide$BACTR)
##
## 0 1 2
## 1 246 85 29
## 2 373 1 0
## 3 970 219 52
K-means clustering is widely used in many applications: signal processing, color quantization, market segmentation, computer vision, geostatistics, astronomy and agriculture, to name a few.