1 Introduction: Beyond Traditional Genre Classification
We’ve been doing movie genres wrong for a long time, and we can do better—our math proves it. Skip all the fancy algorithms and number-crunching and jump to the conclusion to see what film classification should really look like, or read on and enjoy our final paper for 620 Web Analytics.
Film classification has long relied on a set of traditional genres that emerged organically throughout cinema history. These categories—Drama, Comedy, Action, Horror, and others—serve as a navigational framework for audiences and industry professionals alike. Yet in today’s complex cinematic landscape, these conventional labels increasingly fail to capture the nuanced relationships between films or accurately reflect how stories are crafted and experienced.
This research project applies advanced computational methods to investigate whether traditional film genres accurately represent contemporary filmmaking practices. By combining network science and natural language processing, we analyze a comprehensive dataset of over 3,600 films released since 2000, examining both the structural relationships between genres and the linguistic patterns in plot descriptions. Our dual methodological approach offers complementary perspectives: network analysis reveals how genres interconnect through co-occurrence patterns, while text analysis uncovers the distinctive linguistic signatures that characterize different film categories. This combination allows us to move beyond anecdotal observations about genre hybridization to identify statistically significant patterns in how films are actually classified and described.
The findings suggest that the current genre system suffers from significant limitations, including the oversaturation of certain categories (particularly Drama) and the failure to recognize natural storytelling communities that cross traditional genre boundaries. Based on these insights, we propose an alternative classification framework organized around storytelling approaches rather than conventional genre labels—a system that better reflects the creative and thematic relationships in modern cinema.
View Document Setup Code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport reimport osimport requestsimport timefrom collections import Counter, defaultdictimport mathimport randomimport stringfrom IPython.display import HTML, displayimport base64import ioimport jsonfrom pathlib import Pathfrom tqdm import tqdmfrom dotenv import load_dotenv# Import NetworkX for network analysisimport networkx as nxfrom networkx.algorithms import community# Import NLTK for text processingimport nltkfrom nltk.tokenize import word_tokenize, sent_tokenizefrom nltk.corpus import stopwordsfrom nltk.probability import FreqDistfrom nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.collocations import BigramAssocMeasures, BigramCollocationFinderfrom nltk.stem import WordNetLemmatizerfrom nltk.tag import pos_tagfrom nltk import ne_chunkfrom wordcloud import WordCloudfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCA# Download necessary NLTK packagesnltk.download('punkt', quiet=True)nltk.download('stopwords', quiet=True)nltk.download('vader_lexicon', quiet=True)nltk.download('wordnet', quiet=True)nltk.download('averaged_perceptron_tagger', quiet=True)nltk.download('maxent_ne_chunker', quiet=True)nltk.download('words', quiet=True)# Import our custom modulesfrom data620.imdb_data_loader import IMDbDataLoaderfrom data620.omdb import OMDbConverter, OMDbDownloader# Helper function to encode plots as base64 for direct HTML embeddingdefault_fig_width =700def save_plot_as_base64(plt, dpi=300, width=default_fig_width):"""Save a matplotlib plot as base64-encoded image""" buf = io.BytesIO() plt.savefig(buf, format='png', bbox_inches='tight', dpi=dpi) plt.close() img_base64 = base64.b64encode(buf.getvalue()).decode('utf-8') img_html =f'<img src="data:image/png;base64,{img_base64}" alt="Plot" width="{width}" />'return img_html
2 Load IMDB Data
We begin by loading and filtering data from the IMDB dataset. We’re specifically interested in movies released since 2000 with ratings of 6.0 or higher and at least 5,000 votes.
For movies (titles dataset):
Movies only (filters out TV shows, short films, etc.)
Released in 2000 or later (configurable via min_year parameter)
Has a minimum rating of 5.0 or higher (configurable via min_rating parameter)
Has at least 5,000 votes (configurable via min_votes parameter)
For other datasets:
Principals (cast and crew): Only those associated with the filtered movies
Names (people): Only those who appear in the filtered principals dataset
Crew: Only those associated with the filtered movies
Episodes: Only those associated with the filtered movies (though this wouldn’t contain much since we filtered for movies only)
Akas (alternative titles): Only those associated with the filtered movies
Loading IMDb data...
Filtered data found. Loading from filtered files.
Loading titles from filtered file...
Loading ratings from filtered file...
Loading principals from filtered file...
Loading names from filtered file...
Loading episodes from filtered file...
Loading akas from filtered file...
Loading crew from filtered file...
===== IMDb Dataset Summary =====
Total filtered movies: 3,685
Movies by year:
2015.0: 377
2016.0: 405
2017.0: 399
2018.0: 422
2019.0: 427
2020.0: 296
2021.0: 325
2022.0: 374
2023.0: 349
2024.0: 263
2025.0: 48
Top genres:
Drama: 2,396
Comedy: 1,075
Action: 883
Crime: 702
Thriller: 573
Adventure: 522
Biography: 471
Romance: 463
Mystery: 379
Horror: 310
History: 252
Documentary: 241
Animation: 240
Fantasy: 189
Sci-Fi: 147
Average rating: 6.91
Median rating: 6.80
Average votes: 62,966
Median votes: 18,024
Top cast/crew categories:
actor: 22,718
actress: 12,810
producer: 9,936
writer: 7,703
editor: 4,643
composer: 4,507
casting_director: 4,485
director: 4,003
cinematographer: 3,911
production_designer: 3,345
Total people (actors, directors, etc.): 42,404
Sample of filtered movies:
tconst primaryTitle startYear genres
0 tt0069049 The Other Side of the Wind 2018.0 Drama
1 tt0293429 Mortal Kombat 2021.0 Action,Adventure,Fantasy
2 tt0315642 Wazir 2016.0 Crime,Drama,Mystery
3 tt0365545 Nappily Ever After 2018.0 Comedy,Drama,Romance
4 tt0369610 Jurassic World 2015.0 Action,Adventure,Sci-Fi
library code: imdb data loader
import osimport pandas as pdimport gzipimport shutilimport requestsfrom pathlib import Pathimport timefrom typing import Dict, List, Optional, Union, Tupleclass IMDbDataLoader:""" A class to handle loading, filtering, and accessing IMDb datasets. This class handles the complete workflow for IMDb data: 1. Downloading raw data files if needed 2. Extracting the files if needed 3. Filtering the data to recent/relevant entries 4. Persisting the filtered dataframes 5. Providing easy access to the filtered dataframes Usage: imdb = IMDbDataLoader() imdb.load() # Access dataframes directly as properties movies_df = imdb.titles actors_df = imdb.names """def__init__(self, base_url: str="https://datasets.imdbws.com/", raw_data_dir: str="./nogit_imdb_data/", filtered_data_dir: str="./nogit_imdb_filtered/", min_year: int=2015, min_votes: int=5000, min_rating: float=6.0 ):""" Initialize the IMDb data loader. Args: base_url: URL for the IMDb data files raw_data_dir: Directory to store the raw data files filtered_data_dir: Directory to store the filtered dataframes min_year: Minimum year for filtering movies (inclusive) min_votes: Minimum number of votes for filtering movies min_rating: Minimum rating for filtering movies """self.base_url = base_urlself.raw_data_dir = Path(raw_data_dir)self.filtered_data_dir = Path(filtered_data_dir)self.min_year = min_yearself.min_votes = min_votesself.min_rating = min_rating# Dictionary mapping dataset names to file namesself.dataset_files = {"titles": "title.basics.tsv.gz","ratings": "title.ratings.tsv.gz","principals": "title.principals.tsv.gz","names": "name.basics.tsv.gz","episodes": "title.episode.tsv.gz","akas": "title.akas.tsv.gz","crew": "title.crew.tsv.gz" }# Initialize dataframe properties to Noneself._titles =Noneself._ratings =Noneself._principals =Noneself._names =Noneself._episodes =Noneself._akas =Noneself._crew =None# Flag to track if data has been loadedself._loaded =Falsedef load(self, force_refresh: bool=False) ->bool:""" Load all IMDb datasets, handling download, extraction, filtering, and persistence as needed. Args: force_refresh: If True, redownload and reprocess all data Returns: True if loading was successful, False otherwise """print("Loading IMDb data...")# Check if filtered data already exists and we're not forcing a refreshifnot force_refresh andself._check_filtered_data_exists():print("Filtered data found. Loading from filtered files.")self._load_filtered_data()self._loaded =TruereturnTrue# Ensure directories existself.raw_data_dir.mkdir(exist_ok=True, parents=True)self.filtered_data_dir.mkdir(exist_ok=True, parents=True)# Download raw data if neededfor name, file_name inself.dataset_files.items(): local_path =self.raw_data_dir / file_nameif force_refresh ornot local_path.exists():print(f"Downloading {file_name}...") success =self._download_file(file_name)ifnot success:print(f"Failed to download {file_name}")returnFalseelse:print(f"File {file_name} already exists")# Load, filter and persist datatry:self._process_data()self._loaded =TruereturnTrueexceptExceptionas e:print(f"Error processing data: {e}")returnFalsedef _check_filtered_data_exists(self) ->bool:"""Check if filtered data files exist"""for name inself.dataset_files.keys(): filtered_path =self.filtered_data_dir /f"{name}_filtered.parquet"ifnot filtered_path.exists():returnFalsereturnTruedef _load_filtered_data(self) ->None:"""Load data from filtered parquet files"""for name inself.dataset_files.keys(): filtered_path =self.filtered_data_dir /f"{name}_filtered.parquet"if filtered_path.exists():print(f"Loading {name} from filtered file...")setattr(self, f"_{name}", pd.read_parquet(filtered_path))def _download_file(self, file_name: str) ->bool:"""Download a file from IMDb dataset""" file_url =self.base_url + file_name local_path =self.raw_data_dir / file_nametry: response = requests.get(file_url, stream=True) response.raise_for_status()withopen(local_path, 'wb') as f: shutil.copyfileobj(response.raw, f)print(f"Download complete: {file_name}")returnTrueexceptExceptionas e:print(f"Error downloading {file_name}: {e}")returnFalsedef _process_data(self) ->None:"""Load, filter, and persist all datasets"""# Load and filter titles and ratings firstself._load_and_filter_titles_ratings()# Process other datasets based on filtered titlesfor name, file_name inself.dataset_files.items():if name in ['titles', 'ratings']:continue# Already processedprint(f"Processing {name}...")# Load and filter the dataset df =self._load_and_filter_dataset(name, file_name)# Store in memorysetattr(self, f"_{name}", df)# Save filtered data filtered_path =self.filtered_data_dir /f"{name}_filtered.parquet" df.to_parquet(filtered_path, index=False)print(f"Saved filtered {name} dataset")def _load_and_filter_titles_ratings(self) ->None:"""Load and filter titles and ratings datasets"""# Load titles titles_path =self.raw_data_dir /self.dataset_files["titles"]print("Loading titles dataset...") titles_df = pd.read_csv(titles_path, sep='\t', low_memory=False)# Basic filtering of titlesprint("Filtering titles...") titles_df = titles_df[titles_df['titleType'] =='movie'] # Only movies titles_df['startYear'] = pd.to_numeric(titles_df['startYear'], errors='coerce') titles_df = titles_df[titles_df['startYear'] >=self.min_year] # Recent movies# Load ratings ratings_path =self.raw_data_dir /self.dataset_files["ratings"]print("Loading ratings dataset...") ratings_df = pd.read_csv(ratings_path, sep='\t', low_memory=False)# Merge and filter by ratingsprint("Merging titles with ratings...") merged_df = pd.merge(titles_df, ratings_df, on='tconst', how='inner')# Apply rating and votes filters filtered_df = merged_df[ (merged_df['averageRating'] >=self.min_rating) & (merged_df['numVotes'] >=self.min_votes) ]# Extract title and rating dataframes from filtered dataself._titles = filtered_df[titles_df.columns]self._ratings = filtered_df[['tconst', 'averageRating', 'numVotes']]# Create list of title IDs to filter other datasetsself.filtered_title_ids =set(self._titles['tconst'])# Save filtered dataframes titles_path =self.filtered_data_dir /"titles_filtered.parquet" ratings_path =self.filtered_data_dir /"ratings_filtered.parquet"self._titles.to_parquet(titles_path, index=False)self._ratings.to_parquet(ratings_path, index=False)print(f"Saved filtered titles and ratings datasets")print(f"Filtered dataset contains {len(self._titles):,} movies")def _load_and_filter_dataset(self, name: str, file_name: str) -> pd.DataFrame:"""Load and filter a dataset based on filtered title IDs""" file_path =self.raw_data_dir / file_name# Load the dataset df = pd.read_csv(file_path, sep='\t', low_memory=False)# Filter based on title IDs if the dataset contains title referencesif'tconst'in df.columns: filtered_df = df[df['tconst'].isin(self.filtered_title_ids)]print(f"Filtered {name} from {len(df):,} to {len(filtered_df):,} rows")return filtered_dfelse:# For datasets like names, filter based on usage in principalsif name =='names'andself._principals isnotNone: person_ids =set(self._principals['nconst']) filtered_df = df[df['nconst'].isin(person_ids)]print(f"Filtered {name} from {len(df):,} to {len(filtered_df):,} rows")return filtered_df# Otherwise, return the original datasetprint(f"No filtering applied to {name} dataset")return dfdef get_summary_stats(self) -> Dict:"""Get summary statistics for the loaded data"""ifnotself._loaded:print("Data not loaded. Call load() first.")return {} stats = {}# Titles statsifself._titles isnotNone: stats['titles_count'] =len(self._titles)# Count by year year_counts =self._titles['startYear'].value_counts().sort_index() stats['year_counts'] = year_counts.to_dict()# Count by genreif'genres'inself._titles.columns:self._titles['genre_list'] =self._titles['genres'].str.split(',') exploded =self._titles.explode('genre_list') genre_counts = exploded['genre_list'].value_counts() stats['genre_counts'] = genre_counts.to_dict()# Ratings statsifself._ratings isnotNone: stats['avg_rating'] =self._ratings['averageRating'].mean() stats['median_rating'] =self._ratings['averageRating'].median() stats['avg_votes'] =self._ratings['numVotes'].mean() stats['median_votes'] =self._ratings['numVotes'].median()# Principals statsifself._principals isnotNone: stats['principals_count'] =len(self._principals)# Count by categoryif'category'inself._principals.columns: category_counts =self._principals['category'].value_counts() stats['category_counts'] = category_counts.to_dict()# Names statsifself._names isnotNone: stats['names_count'] =len(self._names)return statsdef print_summary(self) ->None:"""Print a summary of the loaded data"""ifnotself._loaded:print("Data not loaded. Call load() first.")return stats =self.get_summary_stats()print("\n===== IMDb Dataset Summary =====\n")if'titles_count'in stats:print(f"Total filtered movies: {stats['titles_count']:,}")if'year_counts'in stats:print("\nMovies by year:")for year, count insorted(stats['year_counts'].items()):print(f" {year}: {count:,}")if'genre_counts'in stats:print("\nTop genres:") sorted_genres =sorted(stats['genre_counts'].items(), key=lambda x: x[1], reverse=True)for genre, count in sorted_genres[:15]:if genre !='\\N'andnot pd.isna(genre):print(f" {genre}: {count:,}")if'avg_rating'in stats:print(f"\nAverage rating: {stats['avg_rating']:.2f}")print(f"Median rating: {stats['median_rating']:.2f}")print(f"Average votes: {stats['avg_votes']:,.0f}")print(f"Median votes: {stats['median_votes']:,.0f}")if'category_counts'in stats:print("\nTop cast/crew categories:") sorted_categories =sorted(stats['category_counts'].items(), key=lambda x: x[1], reverse=True)for category, count in sorted_categories[:10]:print(f" {category}: {count:,}")if'names_count'in stats:print(f"\nTotal people (actors, directors, etc.): {stats['names_count']:,}")# Properties to access the dataframes@propertydef titles(self) -> pd.DataFrame:"""Get the titles dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._titles@propertydef ratings(self) -> pd.DataFrame:"""Get the ratings dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._ratings@propertydef principals(self) -> pd.DataFrame:"""Get the principals dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._principals@propertydef names(self) -> pd.DataFrame:"""Get the names dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._names@propertydef episodes(self) -> pd.DataFrame:"""Get the episodes dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._episodes@propertydef akas(self) -> pd.DataFrame:"""Get the akas dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._akas@propertydef crew(self) -> pd.DataFrame:"""Get the crew dataframe"""ifnotself._loaded:print("Data not loaded. Call load() first.")return pd.DataFrame()returnself._crew
3 Enriching With OMDB Data
Next, we’ll enhance our dataset with additional details from the Open Movie Database (OMDB) API. This will provide us with richer information including plot summaries, director names, and box office figures.
View Enriching With OMDB Data Code
load_dotenv(dotenv_path='nogit.apikey') # or just .env if that's your filenameapi_key = os.getenv("omdb_api_key")ifnot api_key:raiseValueError("OMDB API key not found. Please set it in your environment or .env file")# Function to download OMDB data for our IMDB moviesdef enrich_with_omdb_data(imdb_loader, api_key, batch_size=10, pause_seconds=1, output_path="nogit_omdb_enriched.parquet"):""" Enhance IMDB data with data from OMDB API Args: imdb_loader: IMDbDataLoader with loaded data api_key: OMDB API key batch_size: Number of movies to fetch in each batch pause_seconds: Seconds to pause between batches output_path: Path to save the enriched dataset Returns: DataFrame with combined IMDB and OMDB data """# Initialize OMDB tools downloader = OMDbDownloader(api_key) converter = OMDbConverter()# Check if output already existsif os.path.exists(output_path):print(f"Loading existing enriched data from {output_path}")return pd.read_parquet(output_path)# Get all IMDB IDs imdb_ids = imdb_loader.titles['tconst'].tolist() total_ids =len(imdb_ids)print(f"Fetching OMDB data for {total_ids} movies...")# Process in batches all_omdb_data = []for i in tqdm(range(0, total_ids, batch_size)):# Get current batch batch_ids = imdb_ids[i:i+batch_size]try:# Fetch data for batch responses = downloader.fetch_by_ids(batch_ids)# Convert to DataFrame batch_df = converter.responses_to_dataframe(responses)# Clean the data batch_df = converter.clean_dataframe(batch_df)# Add to results all_omdb_data.append(batch_df)# Pause to respect rate limits time.sleep(pause_seconds)exceptExceptionas e:print(f"Error processing batch {i//batch_size} (IDs {batch_ids}): {e}")# Combine all batchesif all_omdb_data: omdb_df = pd.concat(all_omdb_data, ignore_index=True)print(f"Retrieved data for {len(omdb_df)} movies from OMDB")# Merge IMDB and OMDB data (left join to keep all IMDB entries) merged_df = pd.merge( imdb_loader.titles, omdb_df, left_on='tconst', right_on='imdbid', how='left' )# Save the enriched dataset merged_df.to_parquet(output_path)print(f"Saved enriched dataset to {output_path}")return merged_dfelse:print("No OMDB data retrieved")return imdb_loader.titles# Get enriched datasetenriched_df = enrich_with_omdb_data(imdb, api_key)# Display the enriched dataprint("\nSample of enriched data:")display_cols = ['tconst', 'primaryTitle', 'startYear', 'genres', 'plot', 'director', 'imdbrating', 'boxoffice_value', 'country']display_cols = [col for col in display_cols if col in enriched_df.columns]print(enriched_df[display_cols].head())# Check what percentage of movies were successfully enrichedomdb_match_rate = enriched_df['imdbid'].notna().mean() *100print(f"\nSuccessfully enriched {omdb_match_rate:.1f}% of movies with OMDB data")
Loading existing enriched data from nogit_omdb_enriched.parquet
Sample of enriched data:
tconst primaryTitle startYear genres \
0 tt0069049 The Other Side of the Wind 2018.0 Drama
1 tt0293429 Mortal Kombat 2021.0 Action,Adventure,Fantasy
2 tt0315642 Wazir 2016.0 Crime,Drama,Mystery
3 tt0365545 Nappily Ever After 2018.0 Comedy,Drama,Romance
4 tt0369610 Jurassic World 2015.0 Action,Adventure,Sci-Fi
plot director \
0 A famed, and infamous, movie director, JJ Hann... Orson Welles
1 MMA fighter Cole Young (Lewis Tan), accustomed... Simon McQuoid
2 'Wazir' is a tale of two unlikely friends, a w... Bejoy Nambiar
3 Violet Jones (Lathan) doesn't realize it at fi... Haifaa Al-Mansour
4 Twenty-two years after the original Jurassic P... Colin Trevorrow
imdbrating boxoffice_value country
0 6.7 NaN France, Iran, United States
1 6.0 42326031.0 United States
2 7.1 1124045.0 India
3 6.4 NaN United States
4 6.9 653406625.0 United States, Canada
Successfully enriched 99.9% of movies with OMDB data
library code: omdb integration
import requestsimport osimport jsonimport timeimport pandas as pdfrom typing import List, Dict, Optional, Unionfrom pathlib import Pathclass OMDbDownloader:""" A simple client for downloading movie data from the OMDb API """def__init__(self, api_key: str, cache_dir: str="./omdb_cache"):""" Initialize the OMDb downloader. Args: api_key: OMDb API key cache_dir: Directory to store cached responses """self.api_key = api_keyself.base_url ="http://www.omdbapi.com/"self.cache_dir = Path(cache_dir)self.cache_dir.mkdir(exist_ok=True, parents=True)# Configure request timeout and retry settingsself.timeout =10self.max_retries =3self.retry_delay =1# secondsdef fetch_by_id(self, imdb_id: str, force_refresh: bool=False) -> Dict:""" Fetch movie data for a single IMDb ID. Args: imdb_id: IMDb ID (e.g., tt0133093) force_refresh: Whether to force a refresh from the API Returns: Dictionary containing the raw API response """ cache_file =self.cache_dir /f"{imdb_id}.json"# Try to load from cache if not forcing refreshifnot force_refresh and cache_file.exists():try:withopen(cache_file, 'r') as f:return json.load(f)except json.JSONDecodeError:# Cache file is corrupted, continue to API callpass# Prepare API request parameters params = {'i': imdb_id,'apikey': self.api_key,'plot': 'full','r': 'json' }# Make API request with retries data =Nonefor attempt inrange(self.max_retries):try: response = requests.get(self.base_url, params=params, timeout=self.timeout ) response.raise_for_status() # Raise exception for HTTP errors data = response.json()breakexcept (requests.RequestException, json.JSONDecodeError) as e:if attempt ==self.max_retries -1:raiseException(f"Failed to fetch data for {imdb_id}: {str(e)}") time.sleep(self.retry_delay * (attempt +1)) # Exponential backoff# Save to cachewithopen(cache_file, 'w') as f: json.dump(data, f)return datadef fetch_by_ids(self, imdb_ids: List[str], force_refresh: bool=False) -> Dict[str, Dict]:""" Fetch movie data for multiple IMDb IDs. Args: imdb_ids: List of IMDb IDs force_refresh: Whether to force a refresh from the API Returns: Dictionary mapping IMDb IDs to raw API responses """ results = {}for imdb_id in imdb_ids:try: results[imdb_id] =self.fetch_by_id(imdb_id, force_refresh)exceptExceptionas e:print(f"Error fetching {imdb_id}: {str(e)}") results[imdb_id] = {"Error": str(e), "Response": "False"}return resultsdef search(self, title: str, year: Optional[int] =None, type_: Optional[str] =None) -> Dict:""" Search for movies by title. Args: title: Movie title to search for year: Optional year of release type_: Optional type (movie, series, episode) Returns: Dictionary containing the raw API response """ cache_key =f"search_{title}_{year}_{type_}" cache_file =self.cache_dir /f"{cache_key.replace(' ', '_')}.json"# Try to load from cacheif cache_file.exists():try:withopen(cache_file, 'r') as f:return json.load(f)except json.JSONDecodeError:# Cache file is corrupted, continue to API callpass# Prepare API request parameters params = {'s': title,'apikey': self.api_key,'r': 'json' }if year isnotNone: params['y'] =str(year)if type_ isnotNone: params['type'] = type_# Make API request with retries data =Nonefor attempt inrange(self.max_retries):try: response = requests.get(self.base_url, params=params, timeout=self.timeout ) response.raise_for_status() data = response.json()breakexcept (requests.RequestException, json.JSONDecodeError) as e:if attempt ==self.max_retries -1:raiseException(f"Failed to search for '{title}': {str(e)}") time.sleep(self.retry_delay * (attempt +1))# Save to cachewithopen(cache_file, 'w') as f: json.dump(data, f)return dataclass OMDbConverter:""" Convert OMDb API responses to pandas DataFrames """@staticmethoddef response_to_dataframe(response: Dict) -> pd.DataFrame:""" Convert a single OMDb API response to a DataFrame row. Args: response: OMDb API response Returns: DataFrame with one row """# Handle error responsesif response.get('Response') =='False':return pd.DataFrame({'imdb_id': [response.get('imdbID', 'N/A')],'error': [response.get('Error', 'Unknown error')] })# Extract all fields data = {k.lower(): [v] for k, v in response.items()}# Convert 'ratings' to separate columnsif'ratings'in data: ratings = data['ratings'][0]ifisinstance(ratings, list):for rating in ratings: source = rating.get('Source', '').replace(' ', '_').lower() value = rating.get('Value', '') data[f'rating_{source}'] = [value]# Remove the original ratings listdel data['ratings']return pd.DataFrame(data)@staticmethoddef responses_to_dataframe(responses: Dict[str, Dict]) -> pd.DataFrame:""" Convert multiple OMDb API responses to a DataFrame. Args: responses: Dictionary mapping IMDb IDs to OMDb API responses Returns: DataFrame with one row per movie """# Convert each response to a DataFrame and concatenate dfs = []for imdb_id, response in responses.items():# Add imdb_id if not present in responseif'imdbID'notin response: response['imdbID'] = imdb_id dfs.append(OMDbConverter.response_to_dataframe(response))ifnot dfs:return pd.DataFrame()return pd.concat(dfs, ignore_index=True)@staticmethoddef search_to_dataframe(search_response: Dict) -> pd.DataFrame:""" Convert an OMDb API search response to a DataFrame. Args: search_response: OMDb API search response Returns: DataFrame with search results """# Handle error responsesif search_response.get('Response') =='False':return pd.DataFrame()# Extract search results search_results = search_response.get('Search', [])ifnot search_results:return pd.DataFrame()# Convert to DataFrame df = pd.DataFrame(search_results)# Normalize column names df.columns = [c.lower() for c in df.columns]return df@staticmethoddef clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:""" Clean and normalize a DataFrame created from OMDb API responses. Args: df: DataFrame created from OMDb API responses Returns: Cleaned DataFrame """if df.empty:return df# Make a copy to avoid modifying the original df = df.copy()# Convert runtime to minutes (numeric)if'runtime'in df.columns: df['runtime_minutes'] = df['runtime'].str.extract(r'(\d+)').astype(float)# Convert ratings to numericfor col in df.columns:if col.startswith('rating_') or col =='imdbrating':# Handle ratings that might have formats like "8.5/10" df[col] = pd.to_numeric(df[col].str.replace(r'/.*$', '', regex=True), errors='coerce')# Convert votes to numeric - handle commas properlyif'imdbvotes'in df.columns: df['imdbvotes_numeric'] = pd.to_numeric(df['imdbvotes'].str.replace(',', ''), errors='coerce')# Convert box office to numeric - handle currency symbols and commasif'boxoffice'in df.columns:# First extract the numeric part with commas removed df['boxoffice_value'] = df['boxoffice'].str.replace(r'[^\d.]', '', regex=True)# Then convert to float df['boxoffice_value'] = pd.to_numeric(df['boxoffice_value'], errors='coerce')# Create genre list columnif'genre'in df.columns: df['genre_list'] = df['genre'].str.split(',').apply(lambda x: [g.strip() for g in x] ifisinstance(x, list) else [] )# Create actors list columnif'actors'in df.columns: df['actor_list'] = df['actors'].str.split(',').apply(lambda x: [a.strip() for a in x] ifisinstance(x, list) else [] )# Create director list columnif'director'in df.columns: df['director_list'] = df['director'].str.split(',').apply(lambda x: [d.strip() for d in x] ifisinstance(x, list) else [] )return df
4 Exploratory Data Analysis
Now let’s explore our enriched dataset to understand the distribution of genres, ratings, and other key characteristics.
View Exploratory Data Analysis Code
# Create genre list if not already presentif'genre_list'notin enriched_df.columns and'genres'in enriched_df.columns: enriched_df['genre_list'] = enriched_df['genres'].str.split(',').apply(lambda x: [g.strip() for g in x] ifisinstance(x, list) else [] )# Count genre occurrencesall_genres = []for genres in enriched_df['genre_list']:ifisinstance(genres, list): all_genres.extend(genres)genre_counts = Counter(all_genres)top_genres = genre_counts.most_common()# Create a dataframe for visualizationgenre_df = pd.DataFrame(top_genres, columns=['Genre', 'Count'])genre_df['Percentage'] = genre_df['Count'] /len(enriched_df) *100# Plot genre distributionplt.figure(figsize=(12, 6))sns.barplot(x='Genre', y='Count', data=genre_df.head(15))plt.title('Top 15 Genres in Dataset')plt.xticks(rotation=45, ha='right')plt.tight_layout()genre_dist_plot = save_plot_as_base64(plt, width=700)# Display the plotdisplay(HTML(genre_dist_plot))# Analyze ratings distributionplt.figure(figsize=(10, 6))sns.histplot(enriched_df['imdbrating'].dropna(), bins=20, kde=True)plt.title('Distribution of IMDb Ratings')plt.xlabel('Rating')plt.ylabel('Count')rating_dist_plot = save_plot_as_base64(plt, width=700)display(HTML(rating_dist_plot))# Analyze box office distributionif'boxoffice_value'in enriched_df.columns:# Filter out missing values and potential outliers boxoffice_data = enriched_df['boxoffice_value'].dropna() boxoffice_data = boxoffice_data[boxoffice_data >0] plt.figure(figsize=(10, 6)) sns.histplot(boxoffice_data, bins=20, kde=True) plt.title('Distribution of Box Office Earnings') plt.xlabel('Box Office Value') plt.ylabel('Count') plt.ticklabel_format(style='plain', axis='x') boxoffice_dist_plot = save_plot_as_base64(plt, width=700) display(HTML(boxoffice_dist_plot))# Movies by yearplt.figure(figsize=(10, 6))year_counts = enriched_df['startYear'].value_counts().sort_index()sns.barplot(x=year_counts.index, y=year_counts.values)plt.title('Movies by Release Year')plt.xlabel('Year')plt.ylabel('Count')plt.xticks(rotation=45)year_dist_plot = save_plot_as_base64(plt, width=700)display(HTML(year_dist_plot))# Analysis by countryif'country'in enriched_df.columns:# Extract primary country enriched_df['primary_country'] = enriched_df['country'].str.split(',').str[0]# Count by country country_counts = enriched_df['primary_country'].value_counts().head(15) plt.figure(figsize=(12, 6)) sns.barplot(x=country_counts.index, y=country_counts.values) plt.title('Top 15 Countries of Origin') plt.xlabel('Country') plt.ylabel('Count') plt.xticks(rotation=45, ha='right') plt.tight_layout() country_dist_plot = save_plot_as_base64(plt, width=800) display(HTML(country_dist_plot))
5 Genre Network Structure and Methodology
Let’s analyze how different genres relate to each other using network analysis. We’ll create a network where nodes are genres and edges represent co-occurrence in movies.
5.1 Core Network Creation and Analysis
View Core Network Creation and Analysis Code
G_genre = nx.Graph()# Add nodes for each genrefor genre inset(all_genres): G_genre.add_node(genre)# Add edges for co-occurring genresfor genres in enriched_df['genre_list']:ifnotisinstance(genres, list) orlen(genres) <2:continuefor i inrange(len(genres)):for j inrange(i+1, len(genres)):if G_genre.has_edge(genres[i], genres[j]): G_genre[genres[i]][genres[j]]['weight'] +=1else: G_genre.add_edge(genres[i], genres[j], weight=1)# Calculate genre popularity for later usegenre_popularity = {genre: all_genres.count(genre) for genre inset(all_genres)}# Calculate basic network statisticsdensity = nx.density(G_genre)avg_clustering = nx.average_clustering(G_genre)transitivity = nx.transitivity(G_genre)print(f"Genre Network Statistics:")print(f"Number of Nodes (Genres): {G_genre.number_of_nodes()}")print(f"Number of Edges (Co-occurrences): {G_genre.number_of_edges()}")print(f"Network Density: {density:.4f}")print(f"Average Clustering Coefficient: {avg_clustering:.4f}")print(f"Transitivity: {transitivity:.4f}")# Calculate node centrality measuresdegree_centrality = nx.degree_centrality(G_genre)betweenness_centrality = nx.betweenness_centrality(G_genre)eigenvector_centrality = nx.eigenvector_centrality(G_genre, max_iter=1000)# Create centrality dataframecentrality_data = []for genre in G_genre.nodes(): centrality_data.append({'Genre': genre,'Degree Centrality': degree_centrality[genre],'Betweenness Centrality': betweenness_centrality[genre],'Eigenvector Centrality': eigenvector_centrality[genre] })centrality_df = pd.DataFrame(centrality_data)centrality_df = centrality_df.sort_values('Eigenvector Centrality', ascending=False)print("\nGenre Centrality Measures (Top 5):")print(centrality_df.head())# Find minimum edge weight threshold to keep ~30% of strongest connectionsedge_weights = [G_genre[u][v]['weight'] for u, v in G_genre.edges()]weight_threshold = np.percentile(edge_weights, 70) print(f"\nEdge weight threshold (keeping top 30% of connections): {weight_threshold:.1f}")# Create a simplified graph with only strong connectionsG_simplified = nx.Graph()for genre in G_genre.nodes(): G_simplified.add_node(genre)for u, v, data in G_genre.edges(data=True):if data['weight'] >= weight_threshold: G_simplified.add_edge(u, v, weight=data['weight'])print(f"Simplified network: {G_simplified.number_of_nodes()} nodes, {G_simplified.number_of_edges()} edges")
Genre Network Statistics:
Number of Nodes (Genres): 23
Number of Edges (Co-occurrences): 167
Network Density: 0.6601
Average Clustering Coefficient: 0.8419
Transitivity: 0.7838
Genre Centrality Measures (Top 5):
Genre Degree Centrality Betweenness Centrality \
11 Drama 1.000000 0.091866
16 Comedy 0.954545 0.065532
6 Adventure 0.909091 0.041460
0 Action 0.863636 0.020897
5 Romance 0.863636 0.022221
Eigenvector Centrality
11 0.268912
16 0.265323
6 0.255971
0 0.252382
5 0.251156
Edge weight threshold (keeping top 30% of connections): 35.2
Simplified network: 23 nodes, 50 edges
5.2 Genre Co-occurence Matrix
View Genre Co-occurence Matrix Code
top_genres = [genre for genre, count in genre_counts.most_common(10)]top_genres_set =set(top_genres)# Create a subgraph of just the top genresG_top = nx.Graph()for genre in top_genres: G_top.add_node(genre)for u, v, data in G_genre.edges(data=True):if u in top_genres_set and v in top_genres_set: G_top.add_edge(u, v, weight=data['weight'])# Create a heatmap of genre co-occurrences for the top genresplt.figure(figsize=(12, 10))# Create a co-occurrence matrixgenre_matrix = np.zeros((len(top_genres), len(top_genres)))for i, genre1 inenumerate(top_genres):for j, genre2 inenumerate(top_genres):if i == j: # Self-connection, use total count genre_matrix[i, j] = genre_popularity[genre1]elif G_genre.has_edge(genre1, genre2): genre_matrix[i, j] = G_genre[genre1][genre2]['weight']plt.figure(figsize=(12, 10))sns.heatmap(genre_matrix, annot=True, fmt=".0f", cmap="YlGnBu", xticklabels=top_genres, yticklabels=top_genres)plt.title('Co-occurrence Matrix of Top 10 Genres', fontsize=16)plt.tight_layout()genre_heatmap = save_plot_as_base64(plt, width=800)display(HTML(genre_heatmap))
<Figure size 1152x960 with 0 Axes>
5.2.1 Interpretation: Co-occurrence Matrix of Top 10 Genres
The co-occurrence heatmap reveals the frequency with which genres appear together in films. Drama emerges as the dominant genre with 2,396 movies, forming strong partnerships with Crime (442 co-occurrences), Romance (369), and Biography (368). Other notable patterns include the frequent pairing of Action and Adventure (286 movies), Mystery and Horror (111 movies), and Comedy and Romance (212 movies).
The diagonal values represent the total count of each genre, showing Drama (2,396), Comedy (1,075), and Action (883) as the most common genres in our dataset. The color intensity provides a visual indication of relationship strength, with the darkest shades representing the most frequent occurrences.
The bridge genres analysis identifies which genres serve as connectors between different film communities. Drama shows the highest betweenness centrality by a significant margin, confirming its role as the universal connector in the film ecosystem. Comedy ranks second, followed by Adventure, Biography, and Documentary. These genres facilitate cross-community storytelling, allowing themes and elements to flow between otherwise disparate film types.
5.4 Unexpected Genre Combinations
View Unexptected Genre Combinations Code
unexpected_combos = []for u, v, data in G_genre.edges(data=True):# Expected co-occurrence if genres appeared together randomly expected = (genre_popularity[u] * genre_popularity[v]) /len(enriched_df) actual = data['weight'] ratio = actual / expected if expected >0else0if ratio >2and actual >5: # Much more common than expected unexpected_combos.append((u, v, actual, ratio))# Create visualization of unexpected genre combinationsplt.figure(figsize=(14, 8))unexpected_df = pd.DataFrame(sorted(unexpected_combos, key=lambda x: x[3], reverse=True)[:10], columns=['Genre1', 'Genre2', 'Count', 'Ratio'])unexpected_df['Pair'] = unexpected_df.apply(lambda x: f"{x['Genre1']} + {x['Genre2']}", axis=1)sns.barplot(x='Ratio', y='Pair', data=unexpected_df, palette='rocket')plt.title('Unexpected Genre Combinations\n(Appearing more often than random chance would predict)', fontsize=16)plt.xlabel('Times more frequent than expected', fontsize=14)plt.ylabel('Genre Combination', fontsize=14)# Add count annotationsfor i, row inenumerate(unexpected_df.itertuples()): plt.text(row.Ratio +0.1, i, f"{row.Count} movies", va='center', fontsize=12)plt.tight_layout()unexpected_bar = save_plot_as_base64(plt, width=800)display(HTML(unexpected_bar))
This visualization highlights genre pairings that occur more frequently than random chance would predict. Animation + Adventure appears 5.2x more often than expected (178 movies), revealing a strong established tradition. Other notable unexpected combinations include Documentary + Music (3.9x, 33 movies), History + War (3.8x, 20 movies), and Mystery + Horror (3.5x, 111 movies). These statistical outliers point to specialized sub-genres that have developed their own conventions and audience expectations.
5.5 Genre Communities
View Genre Communites Code
communities = community.greedy_modularity_communities(G_simplified)# Create visualization of communitiescommunity_sizes = [len(comm) for comm in communities]community_df = pd.DataFrame({'Community': [f"Community {i+1}"for i inrange(len(communities))],'Size': community_sizes,'Genres': [', '.join(sorted(comm)) for comm in communities]})community_df = community_df.sort_values('Size', ascending=False)plt.figure(figsize=(14, 10))sns.barplot(x='Size', y='Community', data=community_df, palette='Set2')plt.title('Genre Communities Identified in the Network', fontsize=16)plt.xlabel('Number of Genres in Community', fontsize=14)plt.ylabel('Community', fontsize=14)# Add genre text annotationsfor i, row inenumerate(community_df.itertuples()):iflen(row.Genres) >60: # If text is too long, truncate genres_text = row.Genres[:57] +'...'else: genres_text = row.Genres plt.text(row.Size +0.1, i, genres_text, va='center', fontsize=11)plt.tight_layout()community_bar = save_plot_as_base64(plt, width=800)display(HTML(community_bar))
5.5.1 Interpretation: Genre Communites
Our analysis identified eight distinct communities of genres that naturally cluster together:
Emotional/character-driven narratives: Comedy, Drama, Family, Music, Romance, War (6 genres)
Fact-based content: Biography, Crime, Documentary, History, Sport (5 genres)
Adult: Standing alone as its own community (1 genre)
Western: Standing alone as its own community (1 genre)
News: Standing alone as its own community (1 genre)
Musical: Standing alone as its own community (1 genre)
This community structure reveals how certain genres naturally align in filmmaking practices, while others maintain distinct identities with minimal overlap to other genres.
5.6 Top 15 Genres Network
View Top 15 Genres Network Code
plt.figure(figsize=(16, 12))# Get the top 15 genres by popularitytop_n_genres =15visual_genres = [genre for genre, _ in genre_counts.most_common(top_n_genres)]visual_genres_set =set(visual_genres)# Create subgraphG_visual = nx.Graph()for genre in visual_genres: G_visual.add_node(genre)# Add edges between these genres only if they have a strong connectionfor u, v, data in G_genre.edges(data=True):if u in visual_genres_set and v in visual_genres_set:# Only include edges with significant weight (top 30%)if data['weight'] >= weight_threshold: G_visual.add_edge(u, v, weight=data['weight'])# Use a more controlled layout for better spacingpos = nx.kamada_kawai_layout(G_visual)# Node colors based on communitygenre_to_community = {}for i, comm inenumerate(communities):for genre in comm: genre_to_community[genre] = i# Draw edges with varying thickness based on connection strengthedge_weights = [G_visual[u][v]['weight']/50for u, v in G_visual.edges()]nx.draw_networkx_edges(G_visual, pos, width=edge_weights, alpha=0.7, edge_color='gray')# Map node sizes to genre popularitynode_sizes = [genre_popularity.get(genre, 10)/5for genre in G_visual.nodes()]# Define a color map based on communitiescmap = plt.cm.tab20node_colors = [cmap(genre_to_community.get(genre, 0) %20) for genre in G_visual.nodes()]# Draw nodesnx.draw_networkx_nodes(G_visual, pos, node_size=node_sizes, node_color=node_colors, alpha=0.9, linewidths=2, edgecolors='white')# Draw labels with better visibilitynx.draw_networkx_labels(G_visual, pos, font_size=10, font_weight='bold', font_color='black', bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.9))# Add a titleplt.title('Network of Top 15 Genres\nNode size = popularity, Edge thickness = co-occurrence frequency', fontsize=16)plt.axis('off')plt.tight_layout()visual_network = save_plot_as_base64(plt, width=800)display(HTML(visual_network))
5.6.1 Interpretation: Top 15 Genres Network
The network visualization displays the relationships among the most popular genres, with node size indicating popularity and edge thickness representing co-occurrence frequency. Drama occupies the central position in this network, with strong connections to multiple genres. The layout reveals the clustering patterns identified in our community analysis, with clear groupings around similar narrative approaches.
5.7 Community 1 Internal Structure
View Community 1 Internal Structure Code
largest_community =max(communities, key=len)community_name =f"Community {list(communities).index(largest_community) +1}"plt.figure(figsize=(14, 10))# Create subgraph for this communityG_comm = nx.Graph()for genre in largest_community: G_comm.add_node(genre)for u, v, data in G_genre.edges(data=True):if u in largest_community and v in largest_community: G_comm.add_edge(u, v, weight=data['weight'])# Use a spring layout for this smaller graphpos_comm = nx.spring_layout(G_comm, k=0.3, seed=42)# Draw edgesedge_weights = [G_comm[u][v]['weight']/20for u, v in G_comm.edges()]nx.draw_networkx_edges(G_comm, pos_comm, width=edge_weights, alpha=0.7, edge_color='darkblue')# Node sizes based on popularitynode_sizes = [genre_popularity.get(genre, 10) *2for genre in G_comm.nodes()]# Draw nodesnx.draw_networkx_nodes(G_comm, pos_comm, node_size=node_sizes, node_color='lightblue', alpha=0.9, edgecolors='blue')# Draw labelsnx.draw_networkx_labels(G_comm, pos_comm, font_size=12, font_weight='bold', bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.9))# Titleplt.title(f'Connections Within {community_name}: {", ".join(sorted(largest_community))}', fontsize=16)plt.axis('off')plt.tight_layout()community_network = save_plot_as_base64(plt, width=800)display(HTML(community_network))
5.7.1 Interpretation: Community 1 Internal Structure
The detailed view of Community 1 (Comedy, Drama, Family, Music, Romance, War) shows that even within this community, Drama serves as the central hub. The strongest connections exist between Drama and Comedy, Drama and Romance, and Comedy and Romance, suggesting these form the core triangle of character-driven storytelling. Family, Music, and War connect primarily through Drama rather than directly to each other, positioning Drama as the essential mediating genre in this community.
5.8 Summary and Conclusions
Together, these visualizations provide a data-driven map of the modern film genre landscape, revealing both the established patterns and surprising connections that shape contemporary cinema. The analysis shows:
Drama is the most central and versatile genre, connecting with nearly all other genres
Clear genre communities exist based on storytelling approach (character-driven, fact-based, spectacle-based, tension-driven)
Some unexpected genre combinations occur significantly more often than random chance would predict
Several genres (Adult, Western, News, Musical) maintain distinct identities with limited connections to other genres
Within communities, there are typically central hub genres that connect to peripheral genres
This network-based approach to analyzing film genres provides valuable insights for filmmakers, critics, and audiences in understanding the structure and evolution of cinematic storytelling.
6 Text Analysis of Movie Plots
Now let’s analyze the plot descriptions to understand linguistic patterns across different genres.
6.1 Common Terms Analysis
For this first section, let’s look at the corpus, before we start looking at just dramas, action movies, etc. How do all of these look together?
View Common Terms Analysis Code
stop_words =set(stopwords.words('english'))lemmatizer = WordNetLemmatizer()# Tokenize and preprocess textdef preprocess_text(text):ifnotisinstance(text, str):return []# Tokenize tokens = word_tokenize(text.lower())# Remove punctuation and numbers tokens = [word for word in tokens if word.isalpha()]# Remove stopwords tokens = [word for word in tokens if word notin stop_words]# Lemmatize tokens = [lemmatizer.lemmatize(word) for word in tokens]return tokens# Process plot descriptionsif'plot'in enriched_df.columns: enriched_df['plot_tokens'] = enriched_df['plot'].apply(preprocess_text)# Calculate token counts enriched_df['plot_token_count'] = enriched_df['plot_tokens'].apply(len)# Create a frequency distribution of all tokens all_plot_tokens = [token for tokens in enriched_df['plot_tokens'] for token in tokens] plot_fdist = FreqDist(all_plot_tokens)# either this or the chart, let's see which we like more.# print("\nMost Common Terms in Plot Descriptions:")# for word, count in plot_fdist.most_common(20):# print(f"{word}: {count}")# Plot term frequency plt.figure(figsize=(12, 6)) plot_fdist.plot(30, cumulative=False) plt.title('30 Most Common Terms in Plot Descriptions') plt.xticks(rotation=45, ha='right') plt.tight_layout() freq_plot = save_plot_as_base64(plt) display(HTML(freq_plot))wordcloud = WordCloud( width=default_fig_width, height=400, background_color='white', max_words=100, contour_width=3).generate(' '.join(all_plot_tokens))plt.figure(figsize=(10, 7))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title('Word Cloud of Plot Terms')wordcloud_plot = save_plot_as_base64(plt, width=700)display(HTML(wordcloud_plot))
6.1.1 Interpretation: Common Terms Analysis
The frequency analysis of plot descriptions reveals the storytelling priorities of contemporary cinema. The term “life” dominates with 1,262 occurrences, highlighting film’s enduring focus on human experience and existential themes. This universal concern is complemented by narrative framework terms such as “find” (687), “world” (637), and “story” (613), which establish the quest-driven nature of modern screenplays.
Relationship-focused words create the second major thematic cluster, with “family” (672), “friend” (501), “father” (370), and various interpersonal connections appearing prominently. These terms underscore cinema’s preoccupation with human bonds as both emotional anchors and sources of conflict. The notable frequency disparity between “man” (491) and “woman” (383) points to persistent gender imbalances in character representation across genres.
Action-oriented verbs (“find,” “get,” “take”) appear with high frequency, reflecting cinema’s preference for goal-driven narratives with clear stakes and motivations. Meanwhile, temporal markers like “time,” “year,” and “day” serve as structural elements that frame these narrative journeys.
The word cloud visualization effectively maps this emotional and thematic landscape, with the size variations between terms offering immediate insight into storytelling priorities. The prominence of both intimate terms (“life,” “family,” “father”) and expansive concepts (“world,” “time”) illustrates how cinema constantly navigates between personal stories and broader contexts. These linguistic patterns transcend individual genres to form the foundational vocabulary of contemporary filmmaking, upon which genre-specific language variations are built.
6.2 Genre-Specific Word Clouds
View Genre-Specific Word Clouds Code
# Group plots by genre for genre-specific analysisgenre_texts = {}for genre inset(all_genres):# Get movies with this genre genre_movies = enriched_df[enriched_df['genre_list'].apply(lambda x: genre in x ifisinstance(x, list) elseFalse)]# Combine plot tokens genre_tokens = [token for tokens in genre_movies['plot_tokens'] for token in tokens] genre_texts[genre] = genre_tokens# Create word clouds for top genrestop_n_genres =5top_genres = [genre for genre, _ in genre_counts.most_common(top_n_genres)]for genre in top_genres:if genre in genre_texts andlen(genre_texts[genre]) >50: genre_wordcloud = WordCloud( width=default_fig_width, height=400, background_color='white', max_words=100, contour_width=3 ).generate(' '.join(genre_texts[genre])) plt.figure(figsize=(10, 7)) plt.imshow(genre_wordcloud, interpolation='bilinear') plt.axis('off') plt.title(f'{genre} Word Cloud') genre_wordcloud_plot = save_plot_as_base64(plt, width=700) display(HTML(f"<h3>{genre} Word Cloud</h3>")) display(HTML(genre_wordcloud_plot))
Drama Word Cloud
Comedy Word Cloud
Action Word Cloud
Crime Word Cloud
Thriller Word Cloud
6.2.1 Interpretation: Genre-Specific Word Clouds
Drama Word Cloud
The Drama genre’s word cloud highlights its focus on interpersonal dynamics and emotional journeys. “Life” dominates the visualization, reflecting the genre’s exploration of human existence. Terms like “father,” “family,” “mother,” and “wife” demonstrate Drama’s emphasis on familial relationships and domestic situations. The prominence of emotionally charged words such as “love,” “want,” and “need” illustrates the internal conflicts that drive dramatic narratives. Unlike action-oriented genres, Drama’s language centers on emotional states, relationships, and personal growth, creating character-driven rather than plot-driven stories.
Comedy Word Cloud
Comedy’s word cloud reveals a distinctive linguistic pattern centered on social interactions and lighthearted situations. While “life” remains prominent, terms like “friend,” “help,” and “plan” suggest the collaborative escapades common in comedic narratives. Notably, Comedy features more present-tense action verbs (“get,” “go,” “find”) compared to Drama, reflecting the genre’s emphasis on immediate situations and reactions rather than long-term emotional development. The appearance of “day” and “night” hints at the time-compressed nature of many comedic plots, often taking place over short, incident-packed timeframes.
Action Word Cloud
The Action genre’s vocabulary is distinctly mission-oriented, with terms like “find,” “world,” and “team” dominating the visualization. Unlike Drama’s focus on internal emotional states, Action emphasizes external threats and physical challenges. Words like “mission,” “agent,” “enemy,” and “power” create a landscape of conflict and high stakes. The term “world” appears more prominently than in other genres, indicating the larger scope and global implications common in Action narratives. Character terms tend to be more functional (based on roles like “agent” or “soldier”) rather than relational, reflecting the genre’s priority of plot mechanics over character development.
Crime Word Cloud Crime narratives display a specialized vocabulary centered around investigation and criminal activity. Terms like “police,” “murder,” “detective,” and “case” establish the procedural framework common in the genre. The prominence of “find” and “discover” highlights the investigative focus, while “family” suggests the personal stakes or motivations behind criminal activities. Interestingly, time-related terms (“year,” “day,” “night”) feature strongly, reflecting the genre’s preoccupation with timelines, alibis, and the reconstruction of events. The vocabulary creates a world of mystery and moral complexity where truth is obscured and must be uncovered.
Thriller Word Cloud The Thriller genre combines elements of suspense with psychological depth. Core terms like “find” and “discover” reflect the investigative aspects similar to Crime, but with an added emphasis on psychological terms like “know,” “truth,” and “secret.” Words like “life” and “death” highlight the high personal stakes in thriller narratives. The presence of time-related words (“time,” “day,” “night”) creates a sense of urgency and deadline pressure characteristic of the genre. Unlike pure Action, the Thriller vocabulary suggests threats that are often hidden or mysterious rather than overt, creating tension through uncertainty rather than spectacle. These genre-specific language patterns reflect distinct storytelling approaches, character dynamics, and narrative structures that define each film category. The visualizations demonstrate how filmmakers employ specialized vocabulary to create genre-specific emotional and narrative experiences for audiences.
6.3 Cross-Genre Language Comparison
To further enhance the text analysis section, we could add a quantitative comparison of language across genres. Here’s a proposed addition:
View Cross-Genre Language Comparison Code
# Create a function to calculate TF-IDF values for genresdef calculate_genre_tfidf():# Create a document for each genre genre_documents = {}for genre inset(all_genres):if genre in genre_texts andlen(genre_texts[genre]) >50: genre_documents[genre] =' '.join(genre_texts[genre])# Calculate TF-IDFfrom sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=1000)# Get genre names in a consistent order genres_list =list(genre_documents.keys())# Fit the vectorizer tfidf_matrix = vectorizer.fit_transform([genre_documents[genre] for genre in genres_list])# Get feature names (words) feature_names = vectorizer.get_feature_names_out()# Create a dictionary of distinctive words for each genre distinctive_words = {}for i, genre inenumerate(genres_list): tfidf_scores = tfidf_matrix[i].toarray()[0] word_scores = [(feature_names[j], tfidf_scores[j]) for j inrange(len(feature_names))] word_scores.sort(key=lambda x: x[1], reverse=True) distinctive_words[genre] = word_scores[:10] # Top 10 distinctive wordsreturn distinctive_words# Calculate distinctive words for each genredistinctive_words = calculate_genre_tfidf()# Visualize for top genresplt.figure(figsize=(15, 10))for i, genre inenumerate(top_genres):if genre in distinctive_words: words, scores =zip(*distinctive_words[genre]) plt.subplot(3, 2, i+1) plt.barh([word for word in words], [score for score in scores], color='steelblue') plt.title(f'Most Distinctive Words in {genre}') plt.tight_layout()genre_distinctive_plot = save_plot_as_base64(plt)display(HTML(genre_distinctive_plot))
6.3.1 Interpretation: Distinctive Language Across Genres
This TF-IDF analysis reveals the most statistically distinctive words for each genre, highlighting terms that are uniquely characteristic rather than simply frequent. Unlike raw frequency counts which may highlight common words across all cinema, this approach identifies specialized vocabulary that differentiates each genre from others. For Drama, terms like “relationship,” “emotional,” and “struggle” emerge as uniquely characteristic, reinforcing the genre’s focus on interpersonal dynamics and internal conflict. Comedy’s distinctive vocabulary includes terms related to humorous situations and misunderstandings like “hilarious,” “awkward,” and “party,” setting it apart from more serious genres.
Action films show a unique emphasis on terms like “mission,” “explosion,” “enemy,” and “agent,” vocabulary rarely prominent in other genres. Crime narratives are distinguished by procedural terminology like “investigation,” “detective,” and “evidence,” while Thriller shows a distinctive psychological vocabulary featuring terms like “suspect,” “fear,” and “reveal.”
These distinctive terms function as linguistic markers that establish genre expectations and create the specialized atmosphere each genre requires. Filmmakers employ these vocabulary patterns—consciously or unconsciously—to signal genre alignment to audiences and fulfill genre-specific storytelling conventions.
7 Conclusion: Reimagining Film Genre Classification
7.1 Key Findings
Our network and linguistic analysis revealed that traditional genre categories inadequately capture how films actually cluster based on creative and thematic relationships. Most notably:
Drama is an oversaturated category, appearing in 65% of films and functioning more as a storytelling mode than a distinct genre
Natural communities emerge from data, revealing storytelling patterns rather than conventional labels
Unexpected genre combinations (like Animation+Adventure, Documentary+Music) occur at rates far exceeding random chance
Linguistic patterns within genres show distinctive vocabularies that reflect their storytelling priorities
7.2 Proposed Alternative Classification Framework
Based on our data analysis, we propose a new multi-dimensional classification system that reflects actual filmmaking patterns:
7.2.1 Primary Categories (Based on Identified Communities)
Character Journey Narratives
Personal Transformation
Relationship Drama
Family Saga
Coming of Age
Reality-Based Narratives
Historical Event
True Crime
Sports Story
Political Exposé
High-Concept Adventures
Hero’s Journey
Survival Challenge
World-Building Epic
Visual Spectacle
Tension-Based Narratives
Psychological Suspense
Supernatural Threat
Crime Procedural
Conspiracy
Distinct Specialized Categories
Western Frontier
Musical Performance
Adult Relationships
7.2.2 Secondary Tags (Applied as needed)
Setting: Ancient World, Medieval, Renaissance, WWII, Cold War, Near Future, Far Future, etc.
Tone: Inspirational, Satirical, Nihilistic, Nostalgic, Whimsical, etc.
Audience: Family, Young Adult, Mature Themes
7.2.3 Reclassification Examples
To demonstrate how this system would work in practice, here’s how ten iconic films would be reclassified:
Titanic (1997)
Traditional: Drama, Romance, History
New System: Relationship Drama + Historical Event + Setting: Early 20th Century + Tone: Tragic Romance
Apocalypse Now (1979)
Traditional: Drama, War
New System: Personal Transformation + Historical Event + Setting: Vietnam War + Tone: Nihilistic
The Godfather (1972)
Traditional: Crime, Drama
New System: Family Saga + True Crime + Setting: Post-WWII America + Tone: Tragic
Star Wars (1977)
Traditional: Action, Adventure, Fantasy, Sci-Fi
New System: Hero’s Journey + Setting: Far Future + Tone: Mythic
The Silence of the Lambs (1991)
Traditional: Crime, Drama, Thriller
New System: Psychological Suspense + Crime Procedural + Tone: Disturbing
Parasite (2019)
Traditional: Drama, Thriller
New System: Family Saga + Psychological Suspense + Setting: Contemporary Urban + Tone: Satirical
Coco (2017)
Traditional: Animation, Adventure, Comedy
New System: World-Building Epic + Family Saga + Setting: Contemporary Mexico + Tone: Emotional
New System: True Story + Personal Transformation + Setting: Early Internet Era + Tone: Critical
La La Land (2016)
Traditional: Comedy, Drama, Musical
New System: Relationship Drama + Musical Performance + Setting: Contemporary Los Angeles + Tone: Nostalgic
7.3 Benefits of This New Framework
This multi-dimensional classification system offers several advantages:
Greater precision in matching viewer preferences to content
Reduced oversaturation of broad categories like “Drama”
Recognition of natural storytelling patterns identified through data analysis
Flexibility to accommodate hybrid films without forcing them into ill-fitting categories
More nuanced discovery for streaming platforms and recommendation algorithms
By adopting this data-driven approach to classification, the film industry could better serve both creators and audiences, acknowledging the complex, multifaceted nature of contemporary cinema while still providing meaningful organization for discovery and analysis.