DATA 607 Project 4: Text Mining

Introduction

The original assignment called for modeling and categorization of a spam/ham email set. I wanted to practice data scraping some more, so I have decided to pick a different data set. I have settled on reviewing movie plots and categorizing them by genre.

Required Libraries

# For data scraping
library(XML)
library(RCurl)
library(urltools)
library(tidyjson)

# For data tidying
library(dplyr)
library(tidyr)
library(stringr)

# For data mining
library(tidytext)
library(RTextTools)

# For data graphing
library(ggplot2)
library(gridExtra)

Data Scraping: Part I

Collecting List of Films

It would not be practical for this assignment to try to collect a very large data set. I needed a movie list that I can use to base my collection on and American Film Institute’s 100 Years… series popped into my mind. AFI released lists of 100 notable movies in various categories from 1998 to 2008. I have used a Wikipedia article about the series as my starting point.

# Get the main URL and read in the HTML text
urlRoot <- "https://en.wikipedia.org"
urlMain <- "/wiki/AFI_100_Years..._series"
links <- getURL(str_c(urlRoot, urlMain))
links <- htmlParse(links)

# Read the links to all lists
links <- xpathSApply(links, "//div[@id='mw-content-text']//table[@class='infobox']//a", xmlGetAttr, "href")

# Several of collected links were extraneous
# Remove them by finding only links without the term 'Template'
links <- links[str_detect(links, "^((?!Template).)*$")]

Collected links are in the table below.

/wiki/AFI%27s_100_Years…100_Movies

/wiki/AFI%27s_100_Years…100_Stars

/wiki/AFI%27s_100_Years…100_Laughs

/wiki/AFI%27s_100_Years…100_Thrills

/wiki/AFI%27s_100_Years…100_Passions

/wiki/AFI%27s_100_Years…100_Heroes_%26_Villains

/wiki/AFI%27s_100_Years…100_Songs

/wiki/AFI%27s_100_Years…100_Movie_Quotes

/wiki/AFI%27s_100_Years_of_Film_Scores

/wiki/AFI%27s_100_Years…100_Cheers

/wiki/AFI%27s_Greatest_Movie_Musicals

/wiki/AFI%27s_100_Years…100_Movies_(10th_Anniversary_Edition)

/wiki/AFI%27s_10_Top_10

Now I will loop through all lists and collect individual film tables form each.

# Create empty data frame to store list of films
films <- data.frame(film=character(), year=integer())

# Loop through all links with the exception of last one
for(i in 1:(length(links)-1)) {
  # Get URL and read HTML table
  url <- getURL(str_c(urlRoot, links[i]))
  tmp <- readHTMLTable(url, which = 2)
  
  # If table contains two columns - Film and Year - then proceed 
  if (any(str_detect(str_to_lower(colnames(tmp)), "film")) && 
      any(str_detect(str_to_lower(colnames(tmp)), "year"))) {
    # Isolate columns containing film name and release year
    tmp <- tmp[, c(which(str_detect(str_to_lower(colnames(tmp)), "film")), 
                   which(str_detect(str_to_lower(colnames(tmp)), "year")))]
    colnames(tmp) <- c("film", "year")
    
    # Store discovered films in main data frame
    films <- rbind(films, tmp)
  }
}

# The last link is special as it contains 10 tables of 10 films 
# rather than 1 table of 100 films
# Loop through all 10 tables
# Same logic as in the loop above
for(i in 2:11) {
  url <- getURL(str_c(urlRoot, links[length(links)]))
  tmp <- readHTMLTable(url, which = i)
  if (any(str_detect(str_to_lower(colnames(tmp)), "film")) && 
      any(str_detect(str_to_lower(colnames(tmp)), "year"))) {
    tmp <- tmp[, c(which(str_detect(str_to_lower(colnames(tmp)), "film")), 
                   which(str_detect(str_to_lower(colnames(tmp)), "year")))]
    colnames(tmp) <- c("film", "year")
    
    films <- rbind(films, tmp)
  }
}

# Get rid of duplicate films
films <- films %>% 
  group_by(film, year) %>% 
  summarize() 

# Convert result into data frame
films <- as.data.frame(films)

At this point I have noticed that there are definitely some film titles that are not formatted properly; however, I only need a sample of some film plots to test my analysis. I am not looking for perfect data collection.

Collecting Film Plots

To collect film plots I have used OMDb API developed by Brian Fritz. It retrieves film plot, genre, ratings, and some other variables from the IMDb repository. Several search parameters are accepted, but I am most interested in searching by film title and release year.

# Create empty data frame to store film plots
film_plots <- data.frame(film=character(),      # Store film title
                         year=integer(),        # Store release year
                         genre=character(),     # Store film genre
                         imdbID=character(),    # Store IMDb ID
                         plot=character())      # Store full film plot

# Loop through all films
for(i in 1:nrow(films)) {
  # Get film title and release year
  title <- as.character(films[i,1])
  year <- as.character(films[i,2])
  
  # Format film title with URL encoding
  title <- url_encode(title)
  
  # Send API request
  url <- str_c("http://www.omdbapi.com/?t=", title, "&y=", year, "&type=movie&plot=full")
  request <- getURL(url)
  
  # Check that result is valid JSON
  # This check was added because some API requests were timing out with HTML error
  if(jsonlite::validate(request)) {
    # Check API response to confirm that request resulted in a hit
    response <- request %>% spread_values(response = jstring("Response"))
    if(response$response=="True") {
      # Collect data
      request <- request %>% spread_values(title = jstring("Title"), 
                                           year = jstring("Year"), 
                                           genre = jstring("Genre"), 
                                           imdbID = jstring("imdbID"), 
                                           plot = jstring("Plot")) %>% 
        select(title, year, genre, imdbID, plot)
      
      # Add data to main data frame
      film_plots <- rbind(film_plots, request)
    }
  }
  # Pause, so not to overwhelm OMDb site 
  Sys.sleep(1)
}

# Remove duplicate films
film_plots <- film_plots %>% 
  group_by(title, year, genre, imdbID, plot) %>% 
  summarize()

Since it took some time to collect film plots for hundreds of films, I have saved the data frame to a CSV file. This also helps preserve data to make analysis reproducable in the future.

write.csv(film_plots, file = "film_plots.csv", row.names = FALSE)

film_plots <- read.csv(file = "https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Project%204%20Data/film_plots.csv")

A sample of collected film plots is in the table below.

title	year	genre	imdbID	plot
12 Angry Men	1957	Crime, Drama	tt0050083	The defense and the prosecution have rested and the jury is filing into the jury room to decide if a young man is guilty or innocent of murdering his father. What begins as an open-and-shut case of murder soon becomes a detective story that presents a succession of clues creating doubt, and a mini-drama of each of the jurors’ prejudices and preconceptions about the trial, the accused, and each other. Based on the play, all of the action takes place on the stage of the jury room.
2001: A Space Odyssey	1968	Adventure, Sci-Fi	tt0062622	“2001” is a story of evolution. Sometime in the distant past, someone or something nudged evolution by placing a monolith on Earth (presumably elsewhere throughout the universe as well). Evolution then enabled humankind to reach the moon’s surface, where yet another monolith is found, one that signals the monolith placers that humankind has evolved that far. Now a race begins between computers (HAL) and human (Bowman) to reach the monolith placers. The winner will achieve the next step in evolution, whatever that may be.
42nd Street	1933	Comedy, Musical, Romance	tt0024034	Renowned Broadway producer/director Julian Marsh is hired to put together a new musical revue. It’s being financed by Abner Dillon to provide a starring vehicle for his girlfriend, songstress Dorothy Brock. Marsh, who is quite ill, is a difficult task master working long hours and continually pushing the cast to do better. When Brock breaks her ankle one of the chorus girls, Peggy Sawyer, gets her big chance to be the star. She also finds romance along the way.
8 Mile	2002	Drama, Music	tt0298203	This is the inspiring captivating story of the legendary rapper Eminem. The troubled young aspiring rapper from a ghetto in Michigan must exert his last chances to become successful while dealing with his life in ruins. All is seemingly lost. He is now single, has only a few friends, an insane/alcoholic mother, and is dealt with poverty and living in a violent city on 8 mile. His only way out of the ghetto and torturous life he’s living in is with his talent in rapping. Will B-Rabbit prevail and seize the shot he’s given or will he let it slip?
A Clockwork Orange	1971	Crime, Drama, Sci-Fi	tt0066921	In future Britain, Alex DeLarge, a charismatic and psycopath delinquent, who likes to practice crimes and ultra-violence with his gang, is jailed and volunteers for an experimental aversion therapy developed by the government in an effort to solve society’s crime problem - but not all goes according to plan.
A Cry in the Dark	1988	Drama	tt0094924	Based on the true story of Lindy Chamberlain. During a camping trip to Ayers Rock in outback Australia, she claimed that she witnessed a dingo stealing her baby daughter, Azaria, from the family tent. Azaria’s body was never found. Police noted some apparent inconsistencies in her story, and she was charged with murder. The case attracted a lot of attention, turning an investigation into a media circus, with the public divided in their opinions.

Data Analysis

As the table above shows IMDb stores multiple values for film genre. For this assignment I decided to concentrate only on Drama and Comedy. I have selected only films that include one of the two classifiers, but not the other (Comedy, but not Drama, and vice versa). Everything else has been discarded.

# Add column 'class' to data frame to store primary genre/classifier
film_plots["class"] <- NA

# Assign classifier values
film_plots[str_detect(film_plots$genre, "Drama")!=TRUE & 
             str_detect(film_plots$genre, "Comedy")==TRUE, 6] <- "Comedy"
film_plots[str_detect(film_plots$genre, "Drama")==TRUE & 
             str_detect(film_plots$genre, "Comedy")!=TRUE, 6] <- "Drama"

# Remove entries without Class
film_plots <- film_plots %>% 
  ungroup() %>% 
  filter(!is.na(class)) %>% 
  select(title, year, class, imdbID, plot)

table(film_plots$class)

## 
## Comedy  Drama 
##     55    190

There are not a lot of values for each class, but there are some. I have had some success with tidyverse in prior assignments. I have decided to try tidytext for text mining and analysis.

# Load stop words
data("stop_words")

film_plots$plot <- as.character(film_plots$plot)

# Get tidy data frame of all words and remove stop words
tidy_plots <- film_plots %>% 
  unnest_tokens(word, plot) %>% 
  anti_join(stop_words, by = "word")

# Get top plot words for comedies
plot_comedy <- tidy_plots %>% 
  group_by(class, word) %>% 
  summarize(count = n()) %>% 
  filter(class == "Comedy") %>% 
  arrange(desc(count)) %>% 
  top_n(10, count) %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count)) +
  geom_bar(stat = "identity", fill = "red") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Comedy Words")

# Get top plot words for dramas
plot_drama <- tidy_plots %>% 
  group_by(class, word) %>% 
  summarize(count = n()) %>% 
  filter(class == "Drama") %>% 
  arrange(desc(count)) %>% 
  top_n(10, count) %>% 
  mutate(word = reorder(word, count)) %>% 
  ggplot(aes(word, count, fill = class)) +
  geom_bar(stat = "identity", fill = "blue") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Drama Words")

grid.arrange(plot_comedy, plot_drama, ncol=2)

Top Drama words seem to be a good fit, but looking at top Comedy words, some appear to be more logical than others. And there is some overlap between top Drama and Comedy words. I decided to look at top words using term frequency-inverse document frequency method.

# Count number of times each word is mentioned
plot_words <- tidy_plots %>% 
  count(class, word, sort = TRUE) %>% 
  ungroup()

# Count number of words per class
total_words <- plot_words %>% 
  group_by(class) %>% 
  summarize(total = sum(n))

# Combine two counts
plot_words <- left_join(plot_words, total_words, by = "class")

# Generate tf-idf values
plot_words <- plot_words %>% 
  bind_tf_idf(word, class, n)

head(plot_words)

## # A tibble: 6 × 7
##   class   word     n total          tf   idf tf_idf
##   <chr>  <chr> <int> <int>       <dbl> <dbl>  <dbl>
## 1 Drama   life    69  9364 0.007368646     0      0
## 2 Drama   love    45  9364 0.004805639     0      0
## 3 Drama family    37  9364 0.003951303     0      0
## 4 Drama    war    36  9364 0.003844511     0      0
## 5 Drama   time    35  9364 0.003737719     0      0
## 6 Drama   wife    34  9364 0.003630927     0      0

plot_comedy <- plot_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Comedy") %>% 
  top_n(10, tf_idf) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill = "red") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Comedy Words (tf-idf)")

plot_drama <- plot_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Drama") %>% 
  top_n(10, tf_idf) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill = "blue") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Drama Words (tf-idf)")

grid.arrange(plot_comedy, plot_drama, ncol=2)

Using tf-idf the words are a lot more specific. However, there are clearly some words that only appear on the list because of one or two movies. For example, shrek or woody (which I thought was a reference to Woody Allen since it is in a Comedy class of critically acclaimed films, but it actually refers to Woody from Toy Story). I have decided to collect more film plots for a bigger data set.

Data Scraping: Part II

As I mentioned above I was suspecting that my data set is not large enough for text mining.

Collecting and Tidying Additional Data

To expand the film list, I went with the National Film Registry of the Library of Congress. Methodology was very similar to Part I.

# Get URL and read HTML table
url <- getURL("https://www.loc.gov/programs/national-film-preservation-board/film-registry/complete-national-film-registry-listing/")
tmp <- readHTMLTable(url, which = 1)

# Select film title and release year
tmp <- tmp[, 1:2]
colnames(tmp) <- c("film", "year")

# Create empty data frame to store film plots
film_plots2 <- data.frame(film=character(), 
                          year=integer(), 
                          genre=character(), 
                          imdbID=character(), 
                          plot=character())

# Loop through all films
# Send request to OMDb API
# Process result if valid
for(i in 1:nrow(tmp)) {
  title <- as.character(tmp[i,1])
  year <- as.character(tmp[i,2])
  title <- url_encode(title)
  
  url <- str_c("http://www.omdbapi.com/?t=", title, "&y=", year, "&type=movie&plot=full")
  request <- getURL(url)
  if(jsonlite::validate(request)) {
    response <- request %>% spread_values(response = jstring("Response"))
    if(response$response=="True") {
      request <- request %>% spread_values(title = jstring("Title"), 
                                           year = jstring("Year"), 
                                           genre = jstring("Genre"), 
                                           imdbID = jstring("imdbID"), 
                                           plot = jstring("Plot")) %>% 
        select(-document.id)
      film_plots2 <- rbind(film_plots2, request)
    }
  }
  Sys.sleep(1)
}

# Remove duplicate entries
film_plots2 <- film_plots2 %>% 
  group_by(title, year, genre, imdbID, plot) %>% 
  summarize() 

# Add primary genre/classifier
film_plots2["class"] <- NA
film_plots2[str_detect(film_plots2$genre, "Drama")!=TRUE & 
              str_detect(film_plots2$genre, "Comedy")==TRUE, 6] <- "Comedy"
film_plots2[str_detect(film_plots2$genre, "Drama")==TRUE & 
              str_detect(film_plots2$genre, "Comedy")!=TRUE, 6] <- "Drama"

# Remove entries with no plot
film_plots2 <- film_plots2 %>% 
  ungroup() %>% 
  filter(!is.na(class)) %>% 
  select(title, year, class, imdbID, plot)

# Merge original and new data frames
film_plots <- rbind(film_plots, film_plots2)

# Remove diplicate entries
film_plots <- film_plots %>% 
  group_by(title, year, class, imdbID, plot) %>% 
  summarize() 

# Remove entries with no plot
film_plots <- film_plots %>% 
  filter(plot != "N/A")

# Similarly to Part I save to file for preservation and efficiency
write.csv(film_plots, file = "film_plots2.csv", row.names = FALSE)

film_plots <- read.csv(file = "https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Project%204%20Data/film_plots2.csv")

table(film_plots$class)

## 
## Comedy  Drama 
##    125    334

Data Analysis

Now that I have more observations for each class, run the same tf-idf analysis as in Part I.

data("stop_words")

film_plots$plot <- as.character(film_plots$plot)

tidy_plots <- film_plots %>% 
  unnest_tokens(word, plot) %>% 
  anti_join(stop_words, by = "word")

plot_words <- tidy_plots %>% 
  count(class, word, sort = TRUE) %>% 
  ungroup()

total_words <- plot_words %>% 
  group_by(class) %>% 
  summarize(total = sum(n))

plot_words <- left_join(plot_words, total_words, by = "class")

plot_words <- plot_words %>% 
  bind_tf_idf(word, class, n)

plot_comedy <- plot_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Comedy") %>% 
  top_n(10, tf_idf) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill = "red") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Comedy Words (tf-idf)")

plot_drama <- plot_words %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Drama") %>% 
  top_n(10, tf_idf) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity", fill = "blue") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Drama Words (tf-idf)")

grid.arrange(plot_comedy, plot_drama, ncol=2)

There is still some very specific words, especially for comedy, but I find the list to be a lot more interesting. I wanted to look at bigrams. The workflow is similar to analyzing individual words.

plot_bigrams <- film_plots %>%
  unnest_tokens(bigram, plot, token = "ngrams", n = 2)

plot_bigrams <- plot_bigrams %>% 
  count(class, bigram, sort = TRUE) %>% 
  ungroup()

total_bigrams <- plot_bigrams %>% 
  group_by(class) %>% 
  summarize(total = sum(n))

plot_bigrams <- 
  left_join(plot_bigrams, total_bigrams, by = "class")

plot_bigrams <- plot_bigrams %>% 
  bind_tf_idf(bigram, class, n)

plot_comedy <- plot_bigrams %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Comedy") %>% 
  top_n(10, tf-idf) %>% 
  mutate(bigram = reorder(bigram, n)) %>% 
  ggplot(aes(bigram, n)) +
  geom_bar(stat = "identity", fill = "red") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Comedy Bigrams (tf-idf)")

plot_drama <- plot_bigrams %>% 
  select(-total) %>% 
  arrange(desc(tf_idf)) %>% 
  filter(class == "Drama") %>% 
  top_n(10, tf_idf) %>% 
  mutate(bigram = reorder(bigram, n)) %>% 
  ggplot(aes(bigram, n)) +
  geom_bar(stat = "identity", fill = "blue") + coord_flip() + 
  labs(x = "Word", y = "Frequency", title = "Top Drama Bigrams (tf-idf)")

grid.arrange(plot_comedy, plot_drama, ncol=2)

Here again the Drama bigrams look fairly interesting, but the Comedy bigrams are full of stop words. I should separate the bigrams and get rid of all stop words for proper analysis. For now, I am proceeding to trying to model classification.

Data Scraping: Part III

I have created the document-term matrix and ran the data through SVM, Tree and Max Entropy models as suggested in the supervised learning section of Chapter 10 of the Automated Data Collection with R textbook. The Max Entropy model was essentially classifying as good as a coin flip, very close to 50-50. But SVM and Tree models seemed to show better results. However, I have discovered that they were classifying all the test set as Drama. Since the number of comedies was much smaller, the models were fairly good. This was disappointing. I figured I needed to collect more comedy plots.

Since this was my third collection attempt, I decided to simply get a list of top 100 comedy films from Rotten Tomatoes, manually format it and save it as a CSV file.

Again, methodology to collect film plots is the same as in Parts I and II.

# Get a list of additional comedies
tmp <- read.csv(file = "https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Project%204%20Data/films3.csv")

film_plots3 <- data.frame(film=character(), 
                          year=integer(), 
                          genre=character(), 
                          imdbID=character(), 
                          plot=character())

for(i in 1:nrow(tmp)) {
  title <- as.character(tmp[i,1])
  title <- url_encode(title)
  year <- as.character(tmp[i,2])
  
  url <- str_c("http://www.omdbapi.com/?t=", title, "&y=", year, "&type=movie&plot=full")
  request <- getURL(url)
  if(jsonlite::validate(request)) {
    response <- request %>% spread_values(response = jstring("Response"))
    if(response$response=="True") {
      request <- request %>% spread_values(title = jstring("Title"), 
                                           year = jstring("Year"), 
                                           genre = jstring("Genre"), 
                                           imdbID = jstring("imdbID"), 
                                           plot = jstring("Plot")) %>% 
        select(-document.id)
      film_plots3 <- rbind(film_plots3, request)
    }
  }
  Sys.sleep(1)
}

film_plots3 <- film_plots3 %>% 
  group_by(title, year, genre, imdbID, plot) %>% 
  summarize() 

film_plots3["class"] <- NA
film_plots3[str_detect(film_plots3$genre, "Drama")!=TRUE & 
              str_detect(film_plots3$genre, "Comedy")==TRUE, 6] <- "Comedy"
film_plots3[str_detect(film_plots3$genre, "Drama")==TRUE & 
              str_detect(film_plots3$genre, "Comedy")!=TRUE, 6] <- "Drama"

table(film_plots3$class)

film_plots3 <- film_plots3 %>% 
  ungroup() %>% 
  filter(!is.na(class)) %>% 
  select(title, year, class, imdbID, plot)

film_plots <- rbind(film_plots, as.data.frame(film_plots3))

film_plots <- film_plots %>% 
  group_by(title, year, class, imdbID, plot) %>% 
  summarize() 

film_plots <- film_plots %>% 
  filter(plot != "N/A")

write.csv(film_plots, file = "film_plots3.csv", row.names = FALSE)

Interestingly, quite a few films Rotten Tomatoes considers comedic are actually classified as Drama according to the IMDb information.

film_plots <- read.csv(file = "https://raw.githubusercontent.com/ilyakats/CUNY-DATA607/master/Project%204%20Data/film_plots3.csv")

table(film_plots$class)

## 
## Comedy  Drama 
##    144    331

I did not get a significant number of additional comedies, but there were a few additions. I decided to randomize the film list, so that the training set and test set in the corpus are spread out more evenly.

film_plots <- film_plots[sample(nrow(film_plots)),]

Data Analysis: Classification

# Build document-term matrix
matrix <- create_matrix(film_plots$plot, language="english",
                        removeNumbers=TRUE, stemWords=TRUE, weighting=tm::weightTf)

# Create a container with existing labels
# The first 375 observations will be used to traing the models
# Remaining observations will be used to test the models
container <- create_container(matrix, 
                              film_plots$class, 
                              trainSize = 1:375, 
                              testSize = 376:length(film_plots$class), 
                              virgin = FALSE)

# Train models
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")

# Classify the test set
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

# Get the actual and predicted labels
labels_out <- data.frame(plot = film_plots$plot[376:length(film_plots$plot)],
                         correct_label = film_plots$class[376:length(film_plots$class)],
                         svm = as.character(svm_out[,1]),
                         tree = as.character(tree_out[,1]),
                         maxent = as.character(maxent_out[,1]),
                         stringsAsFactors = FALSE)

Actual classification:

table(labels_out$correct_label)

## 
## Comedy  Drama 
##     32     68

SVM model performance:

table(labels_out$svm)

## 
## Comedy  Drama 
##     10     90

prop.table(table(labels_out$correct_label == labels_out$svm))

## 
## FALSE  TRUE 
##  0.24  0.76

Random Forrest model performance:

table(labels_out$tree)

## 
## Comedy  Drama 
##      7     93

prop.table(table(labels_out$correct_label == labels_out$tree))

## 
## FALSE  TRUE 
##  0.31  0.69

Maximum Entropy model performance:

table(labels_out$maxent)

## 
## Comedy  Drama 
##     24     76

prop.table(table(labels_out$correct_label == labels_out$maxent))

## 
## FALSE  TRUE 
##  0.26  0.74

Maximum Entropy model seems to perform well with this data. I wanted to get general analytics as calculated by the RTextTools package. Unfortunately, the create_analytics method only understands numeric values. 1 represents Comedy and 2 represents Drama.

# Convert class labels to numeric values
classNo <- as.numeric(as.factor(film_plots$class))

# Rerun Maximum Entropy model
container <- create_container(matrix, classNo, 
                              trainSize = 1:375, testSize = 376:length(film_plots$class), 
                              virgin = FALSE)
maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
analytics <- create_analytics(container, maxent_out, b=1)

# Output analytics summary
analytics@label_summary

##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 1                 32                  24                    24
## 2                 68                  76                    76
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 1             75.0000               75.0000                      46.87500
## 2            111.7647              111.7647                      86.76471
##   PCT_CORRECTLY_CODED_PROBABILITY
## 1                        46.87500
## 2                        86.76471

Consider mismatched classification.

bad_predictions <- labels_out[labels_out$correct_label != labels_out$maxent, ]

	plot	correct_label	svm	tree	maxent
1	Young Leon Kanter dreams of being a great violinist. His parents scrape up the money for a violin and for lessons, and Leon rewards them by becoming a great player. But as an adult, Leon finds that people want more from him than just music.	Drama	Drama	Drama	Comedy
2	An investigative reporter sees an opportunity for the story of a lifetime when an accused murderer escapes hanging.	Comedy	Drama	Drama	Drama
9	A major heist goes off as planned, until bad luck and double crosses cause everything to unravel.	Drama	Drama	Drama	Comedy
11	Inventor Gepetto creates a wooden marionette called Pinocchio. His wish that Pinocchio be a real boy is unexpectedly granted by a fairy. The fairy assigns Jiminy Cricket to act as Pinocchio’s “conscience” and keep him out of trouble. Jiminy is not too successful in this endeavor and most of the film is spent with Pinocchio deep in trouble.	Comedy	Drama	Drama	Drama
13	Michael, Steven and Nick are young factory workers from Pennsylvania who enlist into the Army to fight in Vietnam. Before they go, Steven marries the pregnant Angela, and their wedding party also serves as the men’s farewell party. After some time and many horrors, the three friends fall in the hands of the Vietcong and are brought to a prison camp in which they are forced to play Russian roulette against each other. Michael makes it possible for them to escape, but they soon get separated again.	Drama	Drama	Drama	Comedy
15	Mrs. Sharpe keeps her eye on Bunny. His friend Bigelow invites him to join him in a session with the cards and chips, and tells him that he will fix it up with Mrs. Sharpe. He sends Bunny a notice telling him he has been elected a member of the Sons of the Morning, who meet every Wednesday evening. Sharpe tells his wife that this is a distinction he did not expect and so impresses her with the honor that has been conferred upon him, that she consents to his going to the meetings. Unfortunately, Bunny talks in his sleep, and his better-half overhears him “passing and drawing,” and going “some better” in his dreams. She becomes suspicious, appeals to her cousin, Freddie Dewdrop, who promises to find out where her husband spends his evenings. Disguising himself by changing his trousers and turning up the front of his hat, he follows Bunny, and learns the truth. Freddy and Mrs. Sharpe work up a plot to scare Bunny. They divulge the scheme to all the wives. Freddy consults the members of the Mental Improvement Society, of which he is one, and they agree to disguise themselves as policemen and make a raid on the poker fiends. Before doing so, however, he invites the wives to meet at the point of attack at a certain hour, without telling them just what is going to happen. Freddie, as the Captain, leads his officers and sweeps down upon the poker players, catching them red-handed and placing them under arrest. At this moment the wives enter the room and there is no escape for their husbands, and in fear of disgrace, they agree to give up poker and spend their evenings at home. With uplifted hands they swear it and they are forgiven by their wives and freed from arrest, at the command of the gallant Captain, Freddie Dewdrop.	Comedy	Comedy	Drama	Drama
18	Arthur is a happy drunk with no pretensions at any ambition. He is also the heir to a vast fortune which he is told will only be his if he marries Susan. He does not love Susan, but she will make something of him the family expects. Arthur proposes but then meets a girl with no money who he could easily fall in love with.	Comedy	Drama	Drama	Drama
19	Cat(herine) Ballou’s family farm is being threatened by the Rail Road. She sends for Kid Shelleen, finding him to be the drunkest gunfighter in the west. When her father is killed by the rail road magnate’s gunman, she vowes to fight on. Shelleen manages to ride sideways in several scenes, while minstrels sing the ballad of Cat Ballou in between scenes.	Comedy	Drama	Drama	Drama
27	At the bottom of the depression, Tom’s mother has been out of work for months when Ed’s father loses his job. Not to burden their parents, the two high school sophomore’s decide to hop the freights and look for work. Wherever they go, there are many other kids just like them, so Tom, Ed and now Sally stick together. They camp in places like ‘Sewer City’ as long as they can until the local authorities run them off. They travel all over the mid west and when they get to New York, Ed thinks that they may finally find work.	Drama	Comedy	Drama	Comedy
32	Berlin’s plushest, most expensive hotel is the setting where in the words of Dr. Otternschlag “People come, people go. Nothing ever happens.”. The doctor is usually drunk so he missed the fact that Baron von Geigern is broke and trying to steal eccentric dancer Grusinskaya’s pearls. He ends up stealing her heart instead. Powerful German businessman Preysing brow beats Kringelein, one of his company’s lowly bookkeepers but it is the terminally ill Kringelein who holds all the cards in the end. Meanwhile, the Baron also steals the heart of Preysing’s mistress, Flaemmchen, but she doesn’t end up with either one of them in the end…	Drama	Drama	Drama	Comedy
33	Poor Ella Cinders is much abused by her evil step-mother and step-sisters. When she wins a local beauty contest she jumps at the chance to get out of her dead-end life and go to Hollywood, where she is promised a job in the movies. When she arrives in Hollywood, she discoves that the contest was a scam and the job non-existant. But through pluck, luck, and talent, she makes it in the movies anyway, and finds true love.	Comedy	Drama	Drama	Drama
38	New Yorkers Paul Bratter and Corie Bratter née Banks have just gotten married. He is a stuffed shirt just starting his career as a lawyer. She is an independently minded free spirit who prides herself on doing the illogical purely out of a sense of adventure, such acts as walking through Washington Square Park barefoot when it’s 17°F outside. Their six day honeymoon at the Plaza Hotel shows that they can get to know each other easily in the biblical sense. But they will see if they can get to know each other in their real life when they move into their first apartment, a cozy (in other words, small), slightly broken down top floor unit in a five story walk-up. While Corie joyfully bounds up and down the stairs, Paul, always winded after the fact, hates the fact of having to walk up the six flights of stairs, if one includes the stairs that comprise the outside front stoop. Beyond the issues with the apartment itself, Paul and Corie will have to deal with an odd assortment of neighbors, most specifically eccentric senior Victor Velasco, who lives in the unusual attic and who would like to consider himself a dirty old man. Corie, worried about her single straight-laced mother Ethel Banks, wants to set her up with Victor. Without Corie or Paul truly realizing it, Ethel and Victor as a twosome is as illogical as Corie and Paul. What happens between Ethel and Victor may be a predictor if Corie and Paul’s marriage can make it in the long run.	Comedy	Comedy	Drama	Drama
39	The small state of Freedonia is in a financial mess, borrowing a huge sum of cash from wealthy widow Mrs. Teasdale. She insists on replacing the current president with crazy Rufus T. Firefly and mayhem erupts. To make matters worse, the neighboring state sends inept spies Chicolini and Pinky to obtain top secret information, creating even more chaos!	Comedy	Drama	Drama	Drama
40	Chinese stowaway Mei Li (Miyoshi Umeki) arrives in San Francisco with her father to meet her fiancé, wealthy nightclub owner Sammy Fong (Jack Soo), in an arranged marriage, but the groom has his eye on his star singer Linda Low (Nancy Kwan). This film version of the Rodgers and Hammerstein Broadway musical is filled with memorable song-and-dance numbers showcasing the contrast between Mei Li’s traditional family and her growing fascination with American culture.	Comedy	Drama	Drama	Drama
42	Having been discharged from the Marines for a hayfever condition before ever seeing action, Woodrow Lafayette Pershing Truesmith (Eddie Bracken) delays the return to his hometown, feeling that he is a failure. While in a moment of melancholy, he meets up with a group of Marines who befriend him and encourage him to return home to his mother by fabricating a story that he was wounded in battle with honorable discharge. They make him wear a uniform complete with medals and is pushed by his new friends into accepting a Hero’s welcome when he gets home where he is to be immortalized by a statue that he doesn’t want, has songs written about his heroic battle stories, and ends up unwillingly running for mayor. Despite his best efforts to explain the truth, no one will listen.	Comedy	Drama	Comedy	Drama
54	Set in winter in the Old West. Charismatic but dumb John McCabe arrives in a young Pacific Northwest town to set up a whorehouse/tavern. The shrewd Mrs. Miller, a professional madam, arrives soon after construction begins. She offers to use her experience to help McCabe run his business, while sharing in the profits. The whorehouse thrives and McCabe and Mrs. Miller draw closer, despite their conflicting intelligences and philosophies. Soon, however, the mining deposits in the town attract the attention of a major corporation, which wants to buy out McCabe along with the rest. He refuses, and his decision has major repercussions for him, Mrs. Miller, and the town.	Drama	Drama	Drama	Comedy
57	In occupied Poland during WWII, a troupe of ham stage actors (led by Joseph Tura and his wife Maria) match wits with the Nazis. A spy has information which would be very damaging to the Polish resistance and they must prevent it’s being delivered to the Germans.	Comedy	Drama	Drama	Drama
58	At an all-black army camp, civilian parachute maker and “hot bundle” Carmen Jones is desired by many of the men. Naturally, she wants Joe, who’s engaged to sweet Cindy Lou and about to go into pilot training for the Korean War. Going after him, she succeeds only in getting him into the stockade. While she awaits his release, trouble approaches for both of them. Songs from the Bizet opera with modernized lyrics.	Drama	Drama	Drama	Comedy
63	Three friends struggle to find work in Paris. Things become more complicated when two of them fall in love with the same woman.	Drama	Drama	Drama	Comedy
72	Joe Bradley is a reporter for the American News Service in Rome, a job he doesn’t much like as he would rather work for what he considers a real news agency back in the States. He is on the verge of getting fired when he, sleeping in and getting caught in a lie by his boss Hennessy, misses an interview with HRH Princess Ann, who is on a goodwill tour of Europe, Rome only her latest stop. However, he thinks he may have stumbled upon a huge scoop. Princess Ann has officially called off all her Rome engagements due to illness. In reality, he recognizes the photograph of her as being the young well but simply dressed drunk woman he rescued off the street last night (as he didn’t want to turn her into the police for being a vagrant), and who is still in his small studio apartment sleeping off her hangover. What Joe doesn’t know is that she is really sleeping off the effects of a sedative given to her by her doctor to calm her down after an anxiety attack, that anxiety because she hates her regimented life where she has no freedom and must always do and say the politically correct things, not what is truly on her mind or in her heart. In wanting just a little freedom, she seized upon a chance opportunity to escape from the royal palace where she was staying, albeit with no money in her pockets. Joe believes he can get an exclusive interview with her without she even knowing that he’s a reporter or that he’s interviewing her. As Joe accompanies “Anya Smith” - her name as she tells him in trying to hide her true identity - around Rome on her incognito day of freedom somewhat unaware that the secret service is searching for her, along for the ride is Joe’s photographer friend, Irving Radovich, who Joe has tasked with clandestinely taking photographs of her, those photos to accompany the story. As the day progresses, Joe and Ann slowly start to fall for each other. Their feelings for each other affect what both decide to do, Ann with regard to her royal duties, Joe with regard to the story, and both with regard to if there is a future for them together.	Comedy	Drama	Drama	Drama
75	Weary of the conventions of Parisian society, a rich playboy and a youthful courtesan-in-training enjoy a platonic friendship, but it may not stay platonic for long.	Comedy	Drama	Drama	Drama
78	Only the royal suite at the grandest hotel in Paris has a safe large enough for the jewels of the Grand Duchess Swana. So the three Russians who have come to sell the jewels settle into the suite until a higher ranking official is dispatched to find out what is delaying the sale. She is Ninotchka, a no nonsense woman who fascinates Count Leon who had been the faithful retainer of the Grand Duchess. The Grand Duchess will give up all claim to the jewels if Ninotchka will fly away from the count.	Comedy	Drama	Drama	Drama
83	Tatsu is a slightly delusional painter who lives in the wilderness. He spends his days painting nothing but the image of his love, a princess he believes to have been incarnated as a dragon. His work is noticed by a servant of Kano Indara, an aging master painter who has no male heir or disciple to pass his skills to. The servant brings Tatsu to Indara under the belief that Indara can help him find his princess in exchange for allowing Indara to pass his knowledge on to him. Once there, Tatsu is led to believe that Indara’s daughter, Ume Ko, is the princess. Tatsu agrees to stay, but now that he has found his love he no longer has the inspiration to paint the masterpieces that he once produced. Ume Ko pretends to kill herself so that Tatsu can once again find inspiration through his sorrow, and once he regains this she reveals herself to him. He has learned that “love must be a slave to art”, and they live out the rest of their days together, with Tatsu painting her as he once did.	Drama	Drama	Drama	Comedy
85	Set against the backdrop of 1977 Los Angeles, The Nice Guys opens when single father and licensed PI Holland March (Gosling) is hired to investigate the apparent suicide of famous porn star Misty Mountains. As the trail leads him to track down a girl named Amelia (Qualley), he encounters less licensed and less hands-off private eye Jackson Healey (Russell Crowe) and his brass knuckles, both hired by the young hippie. However, the situation takes a turn for the worse when Amelia vanishes and it becomes apparent that March wasn’t the only party interested. As both men are forced to team up, they’ll have to take on a world filled with eccentric goons, strippers dressed as mermaids and even a possible government conspiracy.	Comedy	Drama	Drama	Drama
88	A young neurosurgeon (Gene Wilder) inherits the castle of his grandfather, the famous Dr. Victor von Frankenstein. In the castle he finds a funny hunchback called Igor, a pretty lab assistant named Inga and the old housekeeper, frau Blucher -iiiiihhh!-. Young Frankenstein believes that the work of his grandfather is only crap, but when he discovers the book where the mad doctor described his reanimation experiment, he suddenly changes his mind…	Comedy	Drama	Drama	Drama
90	’Toon star Roger is worried that his wife Jessica is playing pattycake with someone else, so the studio hires detective Eddie Valiant to snoop on her. But the stakes are quickly raised when Marvin Acme is found dead and Roger is the prime suspect. Groundbreaking interaction between the live and animated characters, and lots of references to classic animation.	Comedy	Drama	Drama	Drama

Conclusion

The results of the analysis were generally disappointing. Althought classification models were able to predict the test set with a better accuracy than a coin flip, there was still a significant number of wrong classification examples. There are two main variables. First, I have simplied film genre significantly. Original classification had much more detail than was used in the analysis. Perhaps, the method for chosing classes was flawed. This is demonstrated by a number of films that were originally classified as both comedy and drama (dramedy is a word for a reason). Second, film plots in this example are fairly limited. Looking over mismatched examples above, it is not hard to imagine some entries as comedic or dramatic films. I believe even human evaluators will have a hard time classifying some plots as Comedy or Drama.

Additionally, I have discovered that the size and quality of the training set can have a very significant impact on the classification model. One of the steps in the process is randomizing the list of available plots. This changes the training set every time the code is executed. This has a noticeable impact on quality of models.

Generally, I believe results I got are not acceptable to turn in for a real-life project and a lot more analysis and better training set should follow. However, as a training exercise, there is some good insight in the work performed.

DATA 607 Project 4: Text Mining

Ilya Kats

April 16, 2017

Introduction

Required Libraries

Data Scraping: Part I

Collecting List of Films

Collecting Film Plots

Data Analysis

Data Scraping: Part II

Collecting and Tidying Additional Data

Data Analysis

Data Scraping: Part III

Data Analysis: Classification

Conclusion