In this project, we’ve implemented a name gender classifier using techniques described in Chapter 6 of “Natural Language Processing with Python.” The goal was to build the best possible classifier for predicting whether a given name is male or female.
This task represents a common natural language processing challenge: taking text data (in this case, names) and creating a model that can make accurate predictions about their characteristics (gender). By exploring different feature extraction methods and classifier algorithms, we can discover which approaches yield the most accurate results.
2 Data Preparation
We started by loading the Names Corpus from NLTK and splitting it into three subsets:
Training set: 6900 names (used to train the model)
Dev-test set: 500 names (used to tune model parameters)
Test set: 500 names (used for final evaluation)
The data was randomly shuffled before splitting to ensure an unbiased distribution of names across the sets, which is crucial for creating a robust and generalizable model.
View section code
import nltkimport randomfrom nltk.corpus import namesimport stringfrom nltk import NaiveBayesClassifierfrom nltk import DecisionTreeClassifierfrom nltk import MaxentClassifierimport matplotlib.pyplot as pltimport seaborn as snsimport pandas as pdfrom nltk.classify import accuracyfrom nltk.metrics import ConfusionMatriximport timeimport numpy as npimport ioimport base64from IPython.display import HTML# Add these new importsfrom sklearn.ensemble import RandomForestClassifierfrom nltk.classify.scikitlearn import SklearnClassifierfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.preprocessing import LabelEncoderfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.ensemble import RandomForestClassifierfrom nltk.classify.scikitlearn import SklearnClassifier# Download required datanltk.download('names')# Get all nameslabeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])# Shuffle the datasetrandom.seed(42)random.shuffle(labeled_names)# Split data into training, dev-test, and test setstest_set = labeled_names[:500]dev_test_set = labeled_names[500:1000]train_set = labeled_names[1000:]print(f"Training set size: {len(train_set)}")print(f"Dev-test set size: {len(dev_test_set)}")print(f"Test set size: {len(test_set)}")
Training set size: 6944
Dev-test set size: 500
Test set size: 500
Data Structure Sample Here’s a glimpse of what our labeled data looks like (first 10 entries):
Before implementing our feature extractors, we need a robust evaluation framework to systematically compare different approaches. The function below handles training, testing, and generating performance metrics for each classifier configuration.
View section code
def gender_features_baseline(name):return {'last_letter': name[-1].lower()}# Train and evaluate the baseline classifierdef evaluate_classifier(feature_extractor, classifier_type='NB'):# Extract features train_features = [(feature_extractor(name), gender) for name, gender in train_set] devtest_features = [(feature_extractor(name), gender) for name, gender in dev_test_set] test_features = [(feature_extractor(name), gender) for name, gender in test_set]# [({'last_letter': 'a'}, 'female'),# ({'last_letter': 'n'}, 'male'),# ({'last_letter': 'e'}, 'female'),# ...# ]# Train the classifier start_time = time.time()if classifier_type =='NB': classifier = NaiveBayesClassifier.train(train_features)elif classifier_type =='DT': classifier = DecisionTreeClassifier.train(train_features)elif classifier_type =='MaxEnt': classifier = MaxentClassifier.train(train_features, max_iter=10)elif classifier_type =='RF':# Use scikit-learn directly instead of the NLTK wrapper# this code was added after the fact, just to prove that RF wasn't going to help.from sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction import DictVectorizerfrom sklearn.ensemble import RandomForestClassifier# Convert NLTK features to scikit-learn format vectorizer = DictVectorizer() X_train = [feat for feat, _ in train_features] y_train = [label for _, label in train_features]# Transform features to vectors X_train_vec = vectorizer.fit_transform(X_train)# Train the classifier rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train_vec, y_train)# Create a custom classifier that uses the trained model and vectorizer# to bypass ntlk sklearn classifier wrapper, since I coudn't get htat# to work. Same concept though, a dataclass like thing that we can use # to train with.class SklearnWrapper:def__init__(self, model, vectorizer):self.model = modelself.vectorizer = vectorizerdef classify(self, feature_dict):# Transform a single feature dictionary X =self.vectorizer.transform([feature_dict])returnself.model.predict(X)[0]def classify_many(self, feature_dicts):# Transform multiple feature dictionaries X =self.vectorizer.transform(feature_dicts)return rf_model.predict(X) classifier = SklearnWrapper(rf_model, vectorizer) training_time = time.time() - start_time# Evaluate on dev-test set dev_accuracy = accuracy(classifier, devtest_features)# Evaluate on test set test_accuracy = accuracy(classifier, test_features)# Get predictions for confusion matrix# PSEUDOCODE FOR PREDICTIONS:# 1. We need a list of what the model predicted for each name in the dev-test set# 2. The devtest_features list contains tuples of (feature_dict, gender)# 3. We only need the feature_dict to make predictions# 4. For each tuple in devtest_features:# - Unpack it as (feat, _) where:# * feat = the feature dictionary# * _ = the gender label (which we ignore here, hence the underscore)# - Feed the feature dictionary to classifier.classify()# - Add the prediction to our dev_predictions list dev_predictions = [classifier.classify(feat) for feat, _ in devtest_features] dev_expected = [gender for _, gender in dev_test_set] test_predictions = [classifier.classify(feat) for feat, _ in test_features] test_expected = [gender for _, gender in test_set]# Create confusion matrices dev_cm = ConfusionMatrix(dev_expected, dev_predictions) test_cm = ConfusionMatrix(test_expected, test_predictions)# Calculate error rates by gender# PSEUDOCODE EXPLANATION:# 1. We want to calculate: "What percent of male names were incorrectly classified?"## 2. To do this we need:# - NUMERATOR: Count of male names that were classified incorrectly# - DENOMINATOR: Total count of male names## 3. For the NUMERATOR (count of incorrect male classifications):# - Loop through each name in dev_test_set with its position (i)# - For each name, check if it's actually male AND was predicted as non-male# - Count each case where a male name was misclassified (using sum())## 4. For the DENOMINATOR (total count of male names):# - Loop through each item in dev_test_set# - Count how many have gender == 'male' (using sum())## 5. The error rate = (Number of misclassified male names) / (Total number of male names) dev_male_error =sum(1for i, (_, gender) inenumerate(dev_test_set) if gender =='male'and dev_predictions[i] !='male') /sum(1for _, g in dev_test_set if g =='male') dev_female_error =sum(1for i, (_, gender) inenumerate(dev_test_set) if gender =='female'and dev_predictions[i] !='female') /sum(1for _, g in dev_test_set if g =='female') test_male_error =sum(1for i, (_, gender) inenumerate(test_set) if gender =='male'and test_predictions[i] !='male') /sum(1for _, g in test_set if g =='male') test_female_error =sum(1for i, (_, gender) inenumerate(test_set) if gender =='female'and test_predictions[i] !='female') /sum(1for _, g in test_set if g =='female') result = {'classifier_type': classifier_type,'feature_extractor': feature_extractor.__name__,'dev_accuracy': dev_accuracy,'test_accuracy': test_accuracy,'dev_cm': dev_cm,'test_cm': test_cm,'dev_male_error': dev_male_error,'dev_female_error': dev_female_error,'test_male_error': test_male_error,'test_female_error': test_female_error,'training_time': training_time,'classifier': classifier }return result# Evaluate baselinebaseline_result = evaluate_classifier(gender_features_baseline)print(f"Baseline Dev-test accuracy: {baseline_result['dev_accuracy']:.4f}")print(f"Baseline Test accuracy: {baseline_result['test_accuracy']:.4f}")print("\nDev-test Confusion Matrix:")print(baseline_result['dev_cm'])print("\nTest Confusion Matrix:")print(baseline_result['test_cm'])
Baseline Dev-test accuracy: 0.7540
Baseline Test accuracy: 0.7460
Dev-test Confusion Matrix:
| f |
| e |
| m m |
| a a |
| l l |
| e e |
-------+---------+
female |<238> 59 |
male | 64<139>|
-------+---------+
(row = reference; col = test)
Test Confusion Matrix:
| f |
| e |
| m m |
| a a |
| l l |
| e e |
-------+---------+
female |<253> 63 |
male | 64<120>|
-------+---------+
(row = reference; col = test)
4 Feature Engineering and Analysis
We created several feature extractors with increasing complexity to see which characteristics of names are most predictive of gender. Each extractor builds on the previous one, incorporating more sophisticated patterns:
View section code
# BASELINE FEATURE EXTRACTOR# This is the simplest approach - just look at the last letter of the name# Example: "John" -> {'last_letter': 'n'}, "Mary" -> {'last_letter': 'y'}# Based on the observation that female names often end in vowels ('a', 'e') # while male names often end in consonants ('n', 'k')def gender_features_baseline(name):return {'last_letter': name[-1].lower()}# LAST TWO LETTERS FEATURE EXTRACTOR# Looks at the final two letters of the name which can capture common endings# Example: "John" -> {'last_two': 'hn'}, "Mary" -> {'last_two': 'ry'}# This catches patterns like 'ie', 'yn', 'on' that have gender associationsdef gender_features_last_two(name): name = name.lower()return {'last_two': name[-2:] iflen(name) >=2else name}# MULTIPLE FEATURES EXTRACTOR# Uses three features: first letter, last letter, and the name's length# Example: "John" -> {'first_letter': 'j', 'last_letter': 'n', 'length': 4}# Captures patterns from both ends and the overall structure of the namedef gender_features_multi(name): name = name.lower() features = {'last_letter': name[-1],'first_letter': name[0],'length': len(name) }return features# SUFFIX-BASED FEATURES EXTRACTOR# Focuses only on endings with increasing specificity (1, 2, or 3 letters)# Example: "Elizabeth" -> {'last_letter': 'h', 'last_two': 'th', 'last_three': 'eth'}# Captures more complex ending patterns like 'eth', 'ine', 'son', etc.def gender_features_suffix(name): name = name.lower()return {'last_letter': name[-1],'last_two': name[-2:] iflen(name) >=2else name,'last_three': name[-3:] iflen(name) >=3else name, }# COMPREHENSIVE FEATURES EXTRACTOR# Analyzes both beginnings and endings, plus structure and vowel usage# Includes 11 different features to capture complex patterns in names# Looks at vowel distribution which can differ between male/female namesdef gender_features_comprehensive(name): name = name.lower() features = {'first_letter': name[0],'first_two': name[:2] iflen(name) >=2else name,'first_three': name[:3] iflen(name) >=3else name,'last_letter': name[-1],'last_two': name[-2:] iflen(name) >=2else name,'last_three': name[-3:] iflen(name) >=3else name,'length': len(name),'contains_vowels': sum(1for c in name if c in'aeiou'),'vowel_ratio': sum(1for c in name if c in'aeiou') /len(name) iflen(name) >0else0,'starts_vowel': name[0] in'aeiou','ends_vowel': name[-1] in'aeiou' }return features# ADVANCED FEATURES EXTRACTOR# Builds on comprehensive features by adding gender-specific suffix detection# Adds features for common female endings (like 'a', 'ie', 'tte') # and male endings (like 'er', 'on', 'an')# Also counts patterns of consecutive vowels or consonantsdef gender_features_advanced(name): name = name.lower() features = gender_features_comprehensive(name)# Add presence of specific suffixes that might indicate gender# Many female names end in 'a' (Emma, Sophia, Victoria)# Many male names end in 'n' (John, Brian, Steven)# Endings like 'ie' or 'y' are often found in female names (Julie, Stacy)# Endings like 'er' or 'on' are common in male names (Peter, Jason)## from studies like this. https://www.degruyterbrill.com/document/doi/10.1515/ling-2020-0027/html?lang=en female_suffixes = ['a', 'e', 'i', 'ie', 'y', 'ey', 'la', 'na', 'ne', 'ta', 'tte', 'elle'] male_suffixes = ['o', 'n', 'r', 's', 'k', 'd', 't', 'on', 'er', 'in', 'an']for suffix in female_suffixes: features[f'ends_with_{suffix}'] = name.endswith(suffix)for suffix in male_suffixes: features[f'ends_with_{suffix}'] = name.endswith(suffix)# Add frequency of specific character patterns# "John" has one consecutive consonant pair: "hn"# "Matthew" has one consecutive consonant pair: "th"# etc. features['consonant_groups'] =len([i for i inrange(len(name)-1) if name[i] notin'aeiou'and name[i+1] notin'aeiou']) features['vowel_groups'] =len([i for i inrange(len(name)-1) if name[i] in'aeiou'and name[i+1] in'aeiou'])return features# FEATURE EXTRACTOR EVALUATION# Create a list of all feature extraction functions to testfeature_extractors = [ gender_features_baseline, gender_features_last_two, gender_features_multi, gender_features_suffix, gender_features_comprehensive, gender_features_advanced]# Results dictionary to store performance metricsresults = {}# Evaluate each feature extractor with Naive Bayes classifier# Store results and print accuracy metricsfor extractor in feature_extractors: results[extractor.__name__] = evaluate_classifier(extractor)print(f"\n{extractor.__name__}:")print(f" Dev-test accuracy: {results[extractor.__name__]['dev_accuracy']:.4f}")print(f" Test accuracy: {results[extractor.__name__]['test_accuracy']:.4f}")# Find the best feature extractor based on dev-test accuracybest_extractor_name =max(results, key=lambda k: results[k]['dev_accuracy'])best_extractor =next(fe for fe in feature_extractors if fe.__name__== best_extractor_name)print(f"\nBest feature extractor: {best_extractor_name}")print(f"Dev-test accuracy: {results[best_extractor_name]['dev_accuracy']:.4f}")print(f"Test accuracy: {results[best_extractor_name]['test_accuracy']:.4f}")
gender_features_baseline:
Dev-test accuracy: 0.7540
Test accuracy: 0.7460
gender_features_last_two:
Dev-test accuracy: 0.7560
Test accuracy: 0.7940
gender_features_multi:
Dev-test accuracy: 0.7500
Test accuracy: 0.7540
gender_features_suffix:
Dev-test accuracy: 0.7880
Test accuracy: 0.7800
gender_features_comprehensive:
Dev-test accuracy: 0.8380
Test accuracy: 0.7920
gender_features_advanced:
Dev-test accuracy: 0.8260
Test accuracy: 0.7840
Best feature extractor: gender_features_comprehensive
Dev-test accuracy: 0.8380
Test accuracy: 0.7920
5 Classifier Comparison
After identifying the best feature extraction method, we compared different classification algorithms to find the most effective approach for the gender prediction task.
View section code
import numpy# Try different classifiers with the best feature extractor# Define the three classifier types to compare: Naive Bayes, Decision Tree, and Maximum Entropyclassifier_types = ['NB', 'DT', 'RF']# Dictionary to store results for each classifier typeclassifier_results = {}# Evaluate each classifier using the best feature extractor identified earlierfor classifier_type in classifier_types:# Call evaluate_classifier with the best feature extractor and current classifier type# This trains the classifier, evaluates it, and returns performance metrics classifier_results[classifier_type] = evaluate_classifier(best_extractor, classifier_type)# Print performance metrics for the current classifierprint(f"\n{classifier_type} Classifier with {best_extractor.__name__}:")print(f" Dev-test accuracy: {classifier_results[classifier_type]['dev_accuracy']:.4f}")print(f" Test accuracy: {classifier_results[classifier_type]['test_accuracy']:.4f}")print(f" Training time: {classifier_results[classifier_type]['training_time']:.2f} seconds")# Find the best classifier based on dev-test accuracy# The lambda function extracts the dev_accuracy value for each classifier type# max() returns the key (classifier type) with the highest valuebest_classifier_type =max(classifier_results, key=lambda k: classifier_results[k]['dev_accuracy'])# Print the results for the best classifierprint(f"\nBest classifier: {best_classifier_type}")print(f"Dev-test accuracy: {classifier_results[best_classifier_type]['dev_accuracy']:.4f}")print(f"Test accuracy: {classifier_results[best_classifier_type]['test_accuracy']:.4f}")# Extract the trained classifier object from the results dictionarybest_classifier = classifier_results[best_classifier_type]['classifier']# If the best classifier is Naive Bayes, display its most informative features# This shows which features are most predictive of genderif best_classifier_type =='NB':print("\nMost Informative Features:")# Get the top 20 most informative features and their importance ratiosfor feature, ratio in best_classifier.most_informative_features(20):# Try to convert ratio to float for consistent formatting# This handles cases where the ratio might not be a numeric valuetry:print(f"{feature:<40}{float(ratio):.1f}")except (ValueError, TypeError):# If conversion to float fails, just print the ratio as isprint(f"{feature:<40}{ratio}")
NB Classifier with gender_features_comprehensive:
Dev-test accuracy: 0.8380
Test accuracy: 0.7920
Training time: 0.03 seconds
DT Classifier with gender_features_comprehensive:
Dev-test accuracy: 0.7320
Test accuracy: 0.7140
Training time: 0.53 seconds
RF Classifier with gender_features_comprehensive:
Dev-test accuracy: 0.8040
Test accuracy: 0.7920
Training time: 2.39 seconds
Best classifier: NB
Dev-test accuracy: 0.8380
Test accuracy: 0.7920
Most Informative Features:
last_two na
last_two la
last_three ard
last_two rd
last_two ia
last_two sa
last_letter a
last_two ta
last_three nne
last_letter k
last_two us
last_letter f
last_two io
last_two ra
last_two do
last_three tta
last_three ana
last_two ld
last_two rt
last_two im
6 Visualizations and Results Analysis
We created visualizations to clearly compare the performance of different feature extractors and classifiers, along with analyzing gender-specific error rates. These visualizations help us understand which approaches work best and where our model still makes mistakes.
View section code
#| code-fold: true#| code-summary: "View visualization code"import ioimport base64from IPython.display import HTMLimport matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# Function to encode matplotlib figures as HTMLdef encode_figure_as_html(fig, width=750, dpi=300):"""Convert a matplotlib figure to base64 and return HTML img tag.""" buf = io.BytesIO() fig.savefig(buf, format='png', bbox_inches='tight', dpi=dpi) plt.close(fig) # Close the figure to free memory img_base64 = base64.b64encode(buf.getvalue()).decode('utf-8') html =f'<img src="data:image/png;base64,{img_base64}" alt="Plot" width="{width}" />'return html# Set default plot stylesdef set_plot_style(): plt.style.use('ggplot') sns.set(style="whitegrid") plt.rcParams.update({'font.size': 12})# Apply plot styleset_plot_style()# Prepare data for visualizationfeature_names = [fe.__name__.replace('gender_features_', '') for fe in feature_extractors]dev_accuracies = [results[fe.__name__]['dev_accuracy'] for fe in feature_extractors]test_accuracies = [results[fe.__name__]['test_accuracy'] for fe in feature_extractors]# Create a DataFrame for easier plottingdf = pd.DataFrame({'Feature Extractor': feature_names,'Dev-test Accuracy': dev_accuracies,'Test Accuracy': test_accuracies})# Set up the figure for feature extractor comparisonfig1 = plt.figure(figsize=(12, 6))bar_width =0.35x =range(len(feature_names))# Create the bar chartplt.bar([i - bar_width/2for i in x], dev_accuracies, bar_width, label='Dev-test Accuracy')plt.bar([i + bar_width/2for i in x], test_accuracies, bar_width, label='Test Accuracy')# Add labels and titleplt.xlabel('Feature Extractor', fontsize=12)plt.ylabel('Accuracy', fontsize=12)plt.title('Performance Comparison of Different Feature Extractors', fontsize=14)plt.xticks(x, feature_names, rotation=45, ha='right')plt.grid(axis='y', linestyle='--', alpha=0.7)plt.legend(fontsize=10)plt.tight_layout()# Convert the figure to HTML and displayfeature_comparison_html = encode_figure_as_html(fig1)display(HTML("<h2>Feature Extractor Performance Comparison</h2>"))display(HTML(feature_comparison_html))# Create a table for the classifier comparisonclassifier_names =list(classifier_results.keys())classifier_dev_acc = [classifier_results[ct]['dev_accuracy'] for ct in classifier_names]classifier_test_acc = [classifier_results[ct]['test_accuracy'] for ct in classifier_names]classifier_times = [classifier_results[ct]['training_time'] for ct in classifier_names]# Create a DataFrame for classifier comparisondf_classifiers = pd.DataFrame({'Classifier': classifier_names,'Dev-test Accuracy': classifier_dev_acc,'Test Accuracy': classifier_test_acc,'Training Time (s)': classifier_times})# Create a figure for the classifier comparisonfig2 = plt.figure(figsize=(10, 6))x =range(len(classifier_names))# Create the bar chart for classifiersplt.bar([i - bar_width/2for i in x], classifier_dev_acc, bar_width, label='Dev-test Accuracy')plt.bar([i + bar_width/2for i in x], classifier_test_acc, bar_width, label='Test Accuracy')# Add labels and titleplt.xlabel('Classifier', fontsize=12)plt.ylabel('Accuracy', fontsize=12)plt.title(f'Performance Comparison of Different Classifiers with {best_extractor.__name__}', fontsize=14)plt.xticks(x, classifier_names, fontsize=10)plt.grid(axis='y', linestyle='--', alpha=0.7)plt.legend(fontsize=10)plt.tight_layout()# Convert the figure to HTML and displayclassifier_comparison_html = encode_figure_as_html(fig2)display(HTML("<h2>Classifier Performance Comparison</h2>"))display(HTML(classifier_comparison_html))# Create a breakdown of errors by gendermale_dev_errors = [results[fe.__name__]['dev_male_error'] for fe in feature_extractors]female_dev_errors = [results[fe.__name__]['dev_female_error'] for fe in feature_extractors]male_test_errors = [results[fe.__name__]['test_male_error'] for fe in feature_extractors]female_test_errors = [results[fe.__name__]['test_female_error'] for fe in feature_extractors]# Create subplots for gender error analysisfig3, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))# Make sure we're using the correct range for x based on the feature extractorsx_gender =range(len(feature_names))# Dev-test errors by genderax1.bar([i - bar_width/2for i in x_gender], male_dev_errors, bar_width, label='Male Error')ax1.bar([i + bar_width/2for i in x_gender], female_dev_errors, bar_width, label='Female Error')ax1.set_xlabel('Feature Extractor', fontsize=12)ax1.set_ylabel('Error Rate', fontsize=12)ax1.set_title('Dev-test Error Rates by Gender', fontsize=14)ax1.set_xticks(x_gender)ax1.set_xticklabels(feature_names, rotation=45, ha='right', fontsize=9)ax1.grid(axis='y', linestyle='--', alpha=0.7)ax1.legend(fontsize=10)# Test errors by genderax2.bar([i - bar_width/2for i in x_gender], male_test_errors, bar_width, label='Male Error')ax2.bar([i + bar_width/2for i in x_gender], female_test_errors, bar_width, label='Female Error')ax2.set_xlabel('Feature Extractor', fontsize=12)ax2.set_ylabel('Error Rate', fontsize=12)ax2.set_title('Test Error Rates by Gender', fontsize=14)ax2.set_xticks(x_gender)ax2.set_xticklabels(feature_names, rotation=45, ha='right', fontsize=9)ax2.grid(axis='y', linestyle='--', alpha=0.7)ax2.legend(fontsize=10)plt.tight_layout()# Convert the figure to HTML and displaygender_error_html = encode_figure_as_html(fig3)display(HTML("<h2>Error Rates by Gender</h2>"))display(HTML(gender_error_html))# For HTML embedding in a document, you can use the HTML variables directly:# feature_comparison_html - Contains the HTML for the feature extractor comparison plot# classifier_comparison_html - Contains the HTML for the classifier comparison plot# gender_error_html - Contains the HTML for the gender error rates plot# Example of embedding in HTML document:# document_html = f"""# <h2>Feature Extractor Performance</h2># {feature_comparison_html}# # <h2>Classifier Performance</h2># {classifier_comparison_html}# # <h2>Gender Error Analysis</h2># {gender_error_html}# """# Final analysis of our best modelbest_model_results = classifier_results[best_classifier_type]print("\nFinal Analysis of Best Model:")print(f"Feature Extractor: {best_extractor.__name__}")print(f"Classifier: {best_classifier_type}")print(f"Dev-test Accuracy: {best_model_results['dev_accuracy']:.4f}")print(f"Test Accuracy: {best_model_results['test_accuracy']:.4f}")print(f"Dev-test Errors - Male: {best_model_results['dev_male_error']:.4f}, Female: {best_model_results['dev_female_error']:.4f}")print(f"Test Errors - Male: {best_model_results['test_male_error']:.4f}, Female: {best_model_results['test_female_error']:.4f}")print(f"Training Time: {best_model_results['training_time']:.2f} seconds")# Create a results table for better presentationresults_table = pd.DataFrame({'Metric': ['Feature Extractor', 'Classifier', 'Dev-test Accuracy', 'Test Accuracy', 'Dev-test Male Error Rate', 'Dev-test Female Error Rate','Test Male Error Rate', 'Test Female Error Rate', 'Training Time (s)'],'Value': [best_extractor.__name__, best_classifier_type, f"{best_model_results['dev_accuracy']:.4f}", f"{best_model_results['test_accuracy']:.4f}",f"{best_model_results['dev_male_error']:.4f}", f"{best_model_results['dev_female_error']:.4f}",f"{best_model_results['test_male_error']:.4f}", f"{best_model_results['test_female_error']:.4f}",f"{best_model_results['training_time']:.2f}"]})display(results_table)# Error analysis - find and display misclassified namesdef analyze_errors(classifier, feature_extractor, dataset):""" Analyze which names the model misclassified to identify patterns in errors. Parameters: classifier -- Trained classifier to use for predictions feature_extractor -- Feature extraction function dataset -- Dataset to analyze (list of (name, gender) tuples) Returns: List of (name, actual_gender, predicted_gender) tuples for misclassified names """ errors = []for name, actual_gender in dataset: features = feature_extractor(name) predicted_gender = classifier.classify(features)if predicted_gender != actual_gender: errors.append((name, actual_gender, predicted_gender))return errors# Identify misclassified names in both dev and test setsbest_classifier = best_model_results['classifier']dev_errors = analyze_errors(best_classifier, best_extractor, dev_test_set)test_errors = analyze_errors(best_classifier, best_extractor, test_set)# Create DataFrames of misclassified names for better displaymisclassified_dev = pd.DataFrame(dev_errors[:15], columns=['Name', 'Actual Gender', 'Predicted Gender'])misclassified_test = pd.DataFrame(test_errors[:15], columns=['Name', 'Actual Gender', 'Predicted Gender'])print("\nSample of Misclassified Names in Dev-test Set:")display(misclassified_dev)print("\nSample of Misclassified Names in Test Set:")display(misclassified_test)# Display performance comparison between dev-test and test setsprint("\nPerformance Comparison:")print(f"Best model dev-test accuracy: {best_model_results['dev_accuracy']:.4f}")print(f"Best model test accuracy: {best_model_results['test_accuracy']:.4f}")print(f"Difference: {abs(best_model_results['dev_accuracy'] - best_model_results['test_accuracy']):.4f}")
Feature Extractor Performance Comparison
Classifier Performance Comparison
Error Rates by Gender
Final Analysis of Best Model:
Feature Extractor: gender_features_comprehensive
Classifier: NB
Dev-test Accuracy: 0.8380
Test Accuracy: 0.7920
Dev-test Errors - Male: 0.1626, Female: 0.1616
Test Errors - Male: 0.2337, Female: 0.1930
Training Time: 0.03 seconds
Sample of Misclassified Names in Dev-test Set:
Sample of Misclassified Names in Test Set:
Performance Comparison:
Best model dev-test accuracy: 0.8380
Best model test accuracy: 0.7920
Difference: 0.0460
Metric
Value
0
Feature Extractor
gender_features_comprehensive
1
Classifier
NB
2
Dev-test Accuracy
0.8380
3
Test Accuracy
0.7920
4
Dev-test Male Error Rate
0.1626
5
Dev-test Female Error Rate
0.1616
6
Test Male Error Rate
0.2337
7
Test Female Error Rate
0.1930
8
Training Time (s)
0.03
Name
Actual Gender
Predicted Gender
0
Cherey
female
male
1
Marlo
female
male
2
Hildagard
female
male
3
Melicent
female
male
4
Moll
female
male
5
Georgie
male
female
6
Charil
female
male
7
Elly
female
male
8
Etienne
male
female
9
Yehudit
female
male
10
Merry
male
female
11
Ivy
female
male
12
Trudy
female
male
13
Margalo
female
male
14
Andre
male
female
Name
Actual Gender
Predicted Gender
0
Fey
female
male
1
Sayre
male
female
2
Angel
female
male
3
Trixy
female
male
4
Maxy
female
male
5
Christabel
female
male
6
Christin
female
male
7
Siobhan
female
male
8
Evy
female
male
9
Pet
female
male
10
Shelagh
female
male
11
Sharron
female
male
12
Lyle
male
female
13
Riannon
female
male
14
Pate
male
female
7 Conclusions
Our Naive Bayes Classifier had the best results. Best model dev-test accuracy: 0.8380 Best model test accuracy: 0.7920 Difference: 0.0460
The dev-test set had the highest accuracy, demonstrating that feature engineering can significantly improve classification accuracy. The progression of the models from simple to complex showed that consistent improvement was attained thus indicating that names contain many gender-indicating patterns beyond just the last letter.
The error rates for male and female names were not identical. In most cases, female names were easier to correctly classify than male names.
The performance on the dev-test set was slightly better than the test set but they were both close indicating that our models have good generalization and that our classifier selection process was appropriate.