Final report

Title: Mining Movie Genres from Dialogues

Team Members: Julia Nichols, Olivia Moffett, Philip Huynh

I. Introduction and motivation

Our project focuses on examining the dialogues within movie scripts, using the Cornell Movie-Dialogs Corpus. We want to understand how characters talk to each other, looking for common ways they express themselves and the underlying patterns in their dialogues. By figuring out these talking patterns, we can learn more about how to tell a good story through dialogue in movies. This project is relevant for people in the film industry, like directors and storywriters. For directors, knowing these patterns can improve how they create scenes and make dialogues more engaging and authentic. Storywriters can benefit by creating characters with dialogues that match their personalities and the story’s development. Overall we hope to use the Cornell Movie-Dialogs Corpus to enhance the quality and authenticity of storytelling.

II. Corpus

The corpus we are working with is the Cornell Movie-Dialogs Corpus. It is compiled from movie scripts, containing textual data that includes character interactions, dialogues, and additional information about the movies themselves. The contents are derived directly from the raw texts of various movie scripts, making it a representative collection of language patterns found in cinematic conversations. There are 220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies.

Figure 1 shows the Summary Statistics

image

Figure 2 shows the number of characters per movie

image

Figure 3 shows the average length of text in a conversation

image

III. Modeling

Question: Are there genre-specific differences in the representation?

A. Training data
Our data comes from the convokit.cornell. ConvoKit is a tool that extracts conversational features and analyze social occurences in conversations, using a single unified interface. It has several datasets including the one we chose. The large data collection if of fictional conversations extracted from raw movie scripts. The intended use of this was for dialogue generation and conversation modeling tasks.

The Cornell Movie-Dialogs Corpus contains three datasets:

Conversations: indexed by the id of the first utterance that make the conversation * movie_idx - index of the movie from which this utterance occurs * movie_name - title of the movie * release_year - year of movie release * rating - IMDB rating of the movie * votes - number of IMDB votes * genre - a list of genres this movie belongs to

Speakers: list of speakers that are movie characters * character_name - name of the character in the movie * movie_idx - index of the movie this character appears in * movie_name - title of the movie * gender - gender of the character (“?” for unlabeled cases) * credit_pos - position on movie credits (“?” for unlabeled cases)

Utterances * id- index of the utterance * speaker- the speaker who authored the utterance * conversation_id- id of the first utterance in the conversation this utterance belongs to * reply_to- id of the utterance to which this utterance replies to (None if the utterance is not a reply) * timestamp- time of the utterance * text- textual content of the utterance Metadata for utterances include: * movie_idx- index of the movie from which this utterance occurs * parsed- parsed version of the utterance text, represented as a SpaCy Doc

Corpus: additional information for movies can be found here * url - a dictionary mapping movie_idx to the url from which the raw sources were retrieved * name - name of the corpus

B. Model architecture

Question: How do speech patterns play a role in the classification of movie dialogue into their respective genres?

Model 1: Text Classification: Logistic Regression Using TF-IDF - Philip Huynh

The primary task of our model is to classify movie dialogues into their respective genres, a multi label text classification problem. In this task, the input is the text content of the line of movie dialogue. It undergoes a series of preprocessing, including tokenization, lemmatization, and stop word removal to refine the text for analysis. After running the model, the output is the genre labels that best represent the content of the dialogue lines.

Our logistic regression model’s architecture includes a TF-IDF vectorization and a logistic regression classifier. TF-IDF transforms the text data into a numerical format, assigning weights to words based on frequency and importance, creating a feature set that is then fed into the logistic regression model. This model predicts the probability of each genre using the given dialogue, capturing specific dialogues that indicate specific genres. The logistic regression component of our model takes these features and learns to predict the probability that a given dialogue belongs to each genre. It is essentially a decision-making algorithm that, based on the patterns it has learned during the training phase, calculates the likelihood of a dialogue being a comedy, a drama, a thriller, etc. This model is effective in handling binary decisions, and in the context of our multi genre classification task, we employ the One-vs-Rest strategy. This means for each genre, the model is trained to recognize dialogues belonging to that genre as opposed to all other genres, effectively creating a series of binary classifiers.

When evaluating the model’s performace, the classification report showed that the model performs unevenly across different genres. For example, genres such as ‘documentary’, ‘history’, and ‘film-noir’ show high precision but low recall. This indicates that the model rarely predicts these genres, but when it does, it’s usually correct. On the other hand, the ‘drama’ genre has more balanced metrics, possibly because it has a larger number of examples to learn from. The support column indicates the imbalances in the dataset, with the ‘drama’ genre having substantially more instances than others. This imbalance can affect the model’s ability to learn from the underrepresented genres. The model’s performance highlights the nuances of genre classification, with certain genres being straightforward to predict while others showing characteristics that overlap with others, making it harder to predict.

Model 2: Text classification: Multinomial Naive Bayes using a Count Vectorizer

This model’s main job is figuring out what genre a movie dialogue belongs to, The input to the model is the actual words spoken in a movie line. We tidy up these words by doing things like making everything lowercase, removing unnecessary punctuation, and simplifying word forms. Following preprocessing, the Naive Bayes model is employed for its suitability in text classification tasks, leveraging probabilistic relationships to discern genre-specific patterns in movie conversations.

The model architecture has a straightforward yet effective approach, utilizing a Count Vectorizer to convert the preprocessed text into numerical features and a Multinomial Naive Bayes classifier for genre prediction. we employed the MultiLabelBinarizer to handle multi-label classification, as movies can belong to multiple genres simultaneously. This allowed us to represent the genres as binary labels, making it feasible for the Naive Bayes model to predict multiple genres for a given dialogue. The Count Vectorizer captures the frequency of words in each dialogue, creating a numerical representation that serves as input to the Naive Bayes classifier. Evaluation metrics such as precision, recall, F1-score, and support are employed to gauge the model’s effectiveness on the test split, providing insights into its ability to accurately identify genre affiliations in movie dialogues. The Naive Bayes model’s strength lies in its interpretability and efficiency, making it a valuable tool for genre classification within the context of movie conversations in the given corpus.

In regards to it’s preformance, the model shows low precision and recall, indicating a challenge in accurately predicting these categories. The model’s ability to distinguish between genres varies, likely influenced by the complexity and distinctive features of each genre. These findings offer valuable insights for refining the model and improving its accuracy, potentially through adjustments in feature engineering or by exploring more advanced modeling techniques. The precision-recall trade-off observed emphasizes the need for a balanced approach in optimizing the model for different genres.

IV. Describing/visualizing results

Model 1:

Genre	precision	recall	f1-score	support
action	0.74	0.10	0.17	13402
adult	0.00	0.00	0.00	60
adventure	0.76	0.08	0.14	8644
animation	0.81	0.02	0.03	1269
biography	0.73	0.02	0.04	2756
comedy	0.66	0.15	0.24	18552
crime	0.71	0.13	0.21	15004
documentary	1.00	0.02	0.04	307
drama	0.63	0.84	0.72	34435
family	0.64	0.01	0.02	1241
fantasy	0.83	0.04	0.08	5948
film-noir	0.83	0.01	0.02
history	0.90	0.02	0.03	1571
horror	0.83	0.05	0.09	7321
music	0.83	0.02	0.04	1523
musical	0.88	0.02	0.04	608
mystery	0.84	0.06	0.11	9974
romance	0.73	0.09	0.17	15282
sci-fi	0.81	0.11	0.19	8955
short	0.93	0.05	0.09	296
sport	0.81	0.01	0.03	929
thriller	0.63	0.29	0.40	24509
war	0.86	0.03	0.06	1960
western	0.80	0.01	0.02	1255

Avg	precision	recall	f1-score
micro avg	0.65	0.26	0.38
macro avg	0.76	0.09	0.13
weighted avg	0.71	0.26	0.30
samples avg	0.55	0.30	0.36

The table shows the top-performing genres based on precision, recall, and F1-score offers an overview of the logistic regression model’s effectiveness in classifying movie dialogues into different genres. Precision measures the accuracy of the model when it predicts a specific genre, recall estimates its ability to capture all cases of a given genre, and lastly the F1-score measures the test’s accuracy.

According to the results, ‘Drama’ is consistently a top performer with a precision score of 0.72, indicating that when the model predicts a dialogue is a ‘Drama’ genre, it is correct about 72% of the time. A high recall of 0.84 implies that the model captures 84% of actual ‘Drama’ instances in the dataset. However, the F1-score is pretty high at 0.72 for ‘Drama.’ Thus, suggesting a balance between accurate predictions and genre coverage. On the other hand, other genre labels such as ‘documentary’ showing high precision and low recall, suggesting that it is more conservative and misses out on a number of actual documentaries. The model’s performance highlights the nuances of genre classification, with certain genres being straightforward to predict while others showing characteristics that overlap with others, making it harder to predict.

utterance distribution chart

In this second plot, which is the number of documents classified into each genre, crucial insights into the model’s performance and the distribution of the dataset across genres are unveiled. The bar chart serves as a visual representation on the prevalence of each genre within the dataset. The bar chart demonstrates the notable imbalance in the distribution of genres within the dataset. ‘Drama’ stands out as the dominant genre, with a substantial count of 216,142 documents, indicating a prevalence of dramatic dialogues in the dataset. While other genres like ‘Documentary,’ ‘Short,’ and ‘Film-Noir’ have significantly lower document counts, with ‘Documentary’ being the least represented genre with only 53 documents.

bigram table

This table displays the top bigram phrases for the various movie genres predicted in the set. The table showcases a straightforward view of the linguistic characteristics associated with each genre. For example, if the phrase “oh god” appears frequently in horror, it might suggest a pattern of suspense or distress that is common within the genre’s dialogue. By revealing the most common bigrams within each genre, the table provides insight into patterns that the model may be using to classify dialogues, helping interpret the model’s decision making process.

Model 2:

Genre	precision	recall	f1-score	support
action	1.00	0.01	0.01	1199.0
adult	0.00	0.00	0.00	60.0
adventure	0.51	0.06	0.11	6355.0
animation	1.00	0.00	0.01	656.0
biography	0.00	0.00	0.00	112.0
comedy	0.85	0.02	0.04	3714.0
crime	0.51	0.03	0.06	6756.0
drama	0.37	0.15	0.22	14007.0
family	0.86	0.01	0.01	998.0
fantasy	0.84	0.01	0.02	3547.0
film-noir	0.00	0.00	0.00	81.0
history	0.86	0.02	0.03	1133.0
horror	0.74	0.01	0.02	3733.0
music	0.83	0.01	0.02	590.0
musical	0.00	0.00	0.00	201.0
mystery	0.58	0.04	0.07	8592.0
romance	0.70	0.02	0.04	5671.0
sci-fi	0.60	0.08	0.14	5151.0
short	1.00	0.01	0.02	90.0
sport	0.00	0.00	0.00	68.0
thriller	1.00	0.01	0.01	1540.0
war	0.00	0.00	0.00	121.0
western	0.00	0.00	0.00	59.0
documentary	1.00	0.01	0.02	90.0

This plot, showcasing precision, recall, F1-Score, and support metrics for various movie genres, provides a nuanced understanding of the performance of the genre classification model. Precision measures the accuracy of the model when it predicts a specific genre, recall estimates its ability to capture all cases of a given genre, and lastly the F1-score measures the test’s accuracy. In terms of precision, genres like “action” and “animation” stand out with perfect scores of 1.00, indicating that when the model predicts these genres, it is highly accurate. However, precision drops for genres such as “drama” (0.37) and “comedy” (0.85), suggesting that while the model is proficient in predicting certain genres, it encounters challenges in maintaining precision for others.

The recall metric further shows the model’s ability to capture all instances of a particular genre. Genres like “action” (recall: 0.01) and “thriller” (recall: 0.01) reveal a limited ability to identify positive instances, indicating areas for improvement. On the contrary, genres like “drama” (recall: 0.15) and “comedy” (recall: 0.02) exhibit a broader but less precise recognition of instances. The F1-Score, a harmonic mean of precision and recall, provides a balanced evaluation, with “drama” (F1-Score: 0.22) highlighting the trade-off between precision and recall. These results show us that while the model excels in certain genres, the variations in precision, recall, and support emphasize the need for a nuanced approach to effectively capture the diversity of movie dialogue representation.

wordcloud2

This second plot shows a generated word cloud which offers a visual representation of the most frequently occurring words within the movie dialogue dataset. This visualization was created using the WordCloud library, which transforms textual data into a visual format by highlighting words based on their frequency.

This visualization helps to identify genre-specific language patterns and provides a qualitative understanding of the distinguishing features associated with each genre. The larger and bolder the word in the cloud, the more frequently it appears in the text, making it a useful tool for exploring and interpreting textual data in a visually engaging manner.

doc num

This last plot shows the number of documents in each genre which indicates how content is distributed across different categories within a dataset, providing insights into the volume and variety of textual information for specific genres. This metric is crucial for training accurate machine learning models, ensuring a balanced representation for unbiased predictions. Additionally, it plays a key role in user experience, enabling personalized recommendations based on genre preferences. For businesses in the entertainment industry, understanding genre distribution informs strategic decisions, aligning content offerings with audience interests for effective planning and marketing. In essence, this metric serves as a foundational measure for data analysis, model development, and strategic decision-making, shaping content organization and optimization to meet user preferences and business goals.

V. Discussion

From the analysis of the two models there were some key findings we found. In Model 1, the logistic regression model exhibits notable precision and recall imbalances across different genres. The precision-recall trade-off is evident, with genres like ‘Drama’ achieving high scores in both precision and recall, indicating a balanced performance, while other genres such as ‘Documentary’ and ‘Film-Noir’ show a high precision but low recall, suggesting conservative predictions. With the variation in average line lengths across genres, such as ‘History,’ ‘Biography,’ and ‘War,’ introduces variation. As this is somewhat expected, it prompts considerations for incorporating dialogue length as a potential feature in the classification model. The prevalence of ‘Drama’ in the dataset is overwhelming, highlighting the challenge of imbalanced class distribution, which can impact model generalization.

In Model 2, it showcases varied precision and recall scores across genres. While some genres like ‘Action’ and ‘Animation’ achieve perfect precision, the overall performance is mixed. ‘Drama’ exhibits a lower precision and recall compared to Model 1, indicating potential challenges in capturing this genre accurately. The word cloud analysis provides additional context, visually highlighting the most frequently occurring words in the dataset. This visualization shows the dominance of specific themes in the movie dialogues, offering a qualitative perspective on the linguistic patterns.

The findings somewhat align with expectations as ‘Drama’ is a relevant genre in movie datasets. However, the imbalances in precision and recall shows the complexity of genre classification, suggesting that certain genres are more challenging for the models to predict accurately. Next steps include addressing the class imbalance and using more advanced modeling techniques. Also, looking more into the misclassified instances for specific genres, such as ‘Drama’.

Final report

Title: Mining Movie Genres from Dialogues

Team Members: Julia Nichols, Olivia Moffett, Philip Huynh

I. Introduction and motivation

II. Corpus

III. Modeling

A. Training data

B. Model architecture

IV. Describing/visualizing results

V. Discussion