Week Eight - High Frequency Words and the US Constitution

Author

Tony Fraser and Mark Gonsalves

Published

March 23, 2025

1 Overview

This analysis examines word frequencies in the US Constitution to understand linguistic patterns and how they compare to general language patterns. We’ve processed the text by removing stopwords (common words like “the” and “and”) and analyzed the remaining content words. We also treated multi-word phrases like “United States” as single entities to better capture their meaning.

Our analysis includes visualizations like word clouds, frequency distributions, and statistical tests to reveal insights about this foundational document’s language structure and patterns.

2 Document Statistics

Let’s first look at the basic numbers that describe the US Constitution text:

	Statistic	Value
	Total characters	27,000
	Total sentences	118
	Total words	4,414
	Total stopwords	2,164 (49.0%)
	Total content words (non-stopwords)	2,250 (51.0%)
	Total unique words (including stopwords)	930
	Total unique content words (excluding stopwords)	844
	Average word length	4.7 characters
	Average sentence length (words)	37.4
	Average sentence length (characters)	225.8

2.1 What These Numbers Tell Us

The Constitution is a fairly concise document with around 4,400 words.
About half the words are common stopwords, and half are meaningful content words.
There are fewer than 1,000 unique words in the entire document, showing limited vocabulary diversity.
Sentences are quite long (about 37 words per sentence), reflecting the formal legal writing style of the time.
The average word length is under 5 characters, showing that even in formal documents, shorter words dominate.

2.2 Word Length Distribution

	Word Length	Count
	Short words (1-4 letters)	2,402 (54.4%)
	Medium words (5-8 letters)	1,579 (35.8%)
	Long words (9+ letters)	433 (9.8%)

Over half of the words in the Constitution are short (1-4 letters). This is typical of English, where common words like “the,” “of,” “to,” and “in” are very short. Only about 10% of words are long (9+ letters), which are usually specialized terms.

3 Word Cloud Visualization

A word cloud is a picture where the size of each word shows how often it appears in the text. Bigger words appear more frequently in the document.

In this word cloud, we can immediately see that “shall” is by far the most common word, showing how the Constitution is focused on establishing rules and requirements. Words like “state,” “president,” “congress,” and “law” are also prominent, highlighting the document’s focus on governance and legal frameworks.

4 Top Words Analysis

This section is divided into four parts that explore different aspects of the word patterns in the Constitution. We’ll look at the most frequent individual words, how many words are needed to cover half the text, how words can be grouped by meaning, and which words commonly appear together.

4.1 Most Frequent Words in the Constitution

When we remove common stopwords like “the” and “and,” what words appear most often in the Constitution? Here are the top 20:

Show Code for Most Frequent Words

# Create a DataFrame for top 20 words
top_20_df = pd.DataFrame(top_200[:20], columns=['Word', 'Count'])
top_20_df['Percentage'] = (top_20_df['Count'] / total_content_words * 100).round(2)
top_20_df['Percentage'] = top_20_df['Percentage'].astype(str) + '%'
top_20_df.index = range(1, len(top_20_df) + 1)  # Start index at 1
top_20_df

	Word	Count	Percentage
1	shall	191	8.49%
2	state	48	2.13%
3	may	33	1.47%
4	president	32	1.42%
5	congress	29	1.29%
6	states	27	1.2%
7	house	23	1.02%
8	law	23	1.02%
9	section	22	0.98%
10	one	19	0.84%
11	office	19	0.84%
12	senate	17	0.76%
13	person	16	0.71%
14	two	16	0.71%
15	time	16	0.71%
16	constitution	15	0.67%
17	representatives	15	0.67%
18	years	12	0.53%
19	thereof	12	0.53%
20	power	12	0.53%

4.1.1 What This Tells Us About the Constitution

The word “shall” dominates the document, appearing 191 times and making up over 8% of all meaningful words. This makes sense because the Constitution is establishing rules and duties: - “The President shall…” - “Congress shall not…” - “Each State shall appoint…”

The high frequency of government terms like “president,” “congress,” “states,” and “senate” reflects the document’s focus on establishing government structure. Legal terms like “law” and “constitution” are also common, showing the document’s purpose in establishing a legal framework.

4.2 Word Coverage Analysis

An interesting question in linguistics is: how many different words do you need to cover a large portion of a text? Here’s what we found for the Constitution:

Show Code for Word Coverage Analysis

# Calculate words needed for half the corpus
word_counts = fdist.most_common()
cumulative_count = 0
i = 0
half_corpus_words = []

for word, count in word_counts:
    cumulative_count += count
    i += 1
    half_corpus_words.append(word)
    if cumulative_count >= total_content_words / 2:
        break

# Create DataFrame for this information
half_corpus_data = {
    'Statistic': ['Number of unique words representing half of the corpus',
                  'Percentage of all unique content words',
                  'Words needed to cover 50% of the corpus'],
    'Value': [f"{i:,}",
              f"{i/unique_content_words*100:.1f}%",
              ', '.join(half_corpus_words[:10]) + '...']
}

half_corpus_df = pd.DataFrame(half_corpus_data)
half_corpus_df.index = [''] * len(half_corpus_df)  # Remove index numbers
half_corpus_df

	Statistic	Value
	Number of unique words representing half of th...	104
	Percentage of all unique content words	12.3%
	Words needed to cover 50% of the corpus	shall, state, may, president, congress, states...

4.2.1 What This Word Coverage Reveals

Just 104 unique words (about 12% of all unique content words) account for half of all word occurrences in the Constitution. This demonstrates a principle called the “Pareto effect” in language - a small subset of words does most of the work in a text.

This pattern is typical in most languages and texts, but the Constitution has a particularly concentrated vocabulary because it’s a focused legal document with a specific purpose.

4.3 Semantic Categories of Words

Words can be grouped into categories based on their meaning and function. We analyzed the top 50 words in the Constitution and sorted them into these categories:

Show Code for Semantic Category Analysis

# Define categories
governance_terms = ['president', 'congress', 'senate', 'united_states', 'representatives', 'house', 'legislative', 'executive']
legal_terms = ['law', 'constitution', 'cases', 'states', 'power', 'powers', 'amendment', 'rights']
procedural_terms = ['shall', 'may', 'provided', 'appointed', 'elected', 'chosen']

# Count occurrences of categories in top 50
top_50_words = [word for word, _ in top_200[:50]]
governance_count = sum(1 for word in top_50_words if word in governance_terms)
legal_count = sum(1 for word in top_50_words if word in legal_terms)
procedural_count = sum(1 for word in top_50_words if word in procedural_terms)
other_count = 50 - governance_count - legal_count - procedural_count

# Create DataFrame for categories
categories_data = {
    'Category': ['Governance terms', 'Legal terms', 'Procedural terms', 'Other terms'],
    'Count': [governance_count, legal_count, procedural_count, other_count],
    'Examples': [
        '"president", "congress", "senate"',
        '"law", "constitution", "rights"',
        '"shall", "may", "provided"',
        'Various other words'
    ],
    'What They Show': [
        'Focus on establishing government structure',
        'Emphasis on legal foundations and powers',
        'Instructions and requirements for how government functions',
        'Other concepts not in the main categories'
    ]
}

categories_df = pd.DataFrame(categories_data)
categories_df.index = [''] * len(categories_df)  # Remove index numbers
categories_df

Category	Count	Examples	What They Show
Governance terms	6	"president", "congress", "senate"	Focus on establishing government structure
Legal terms	5	"law", "constitution", "rights"	Emphasis on legal foundations and powers
Procedural terms	2	"shall", "may", "provided"	Instructions and requirements for how governme...
Other terms	37	Various other words	Other concepts not in the main categories

4.3.1 The Importance of These Categories

These word categories reveal the Constitution’s key priorities:

Governance Terms establish the branches and offices of government
Legal Terms create the foundation of laws and rights
Procedural Terms set rules for how the government must operate

The distribution shows that while these specialized terms are important, the majority of frequently used words fall outside these categories, reflecting the document’s need to communicate using general language alongside specialized terminology.

4.4 Common Word Pairs (Collocations)

Some words frequently appear together, forming meaningful phrases. These pairs (called “collocations”) often represent important concepts:

Show Code for Word Collocation Analysis

# Find common word pairs (bigrams)
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(content_words)
finder.apply_freq_filter(3)  # only bigrams that appear 3+ times
top_bigrams = finder.nbest(bigram_measures.pmi, 10)

# Create DataFrame for bigrams with explanation
bigrams_data = {
    'Word Pair': [f"{bigram[0]} {bigram[1]}" for bigram in top_bigrams],
    'Significance': [
        'Age qualification for offices',
        'Alternative ways to pledge allegiance',
        'Foreign diplomatic representatives',
        'Discretionary powers',
        'Presidential appointment power',
        'Regulation of labor laws',
        'Foreign representatives',
        'Foreign representatives',
        'Highest judicial body',
        'Types of taxes'
    ]
}

bigrams_df = pd.DataFrame(bigrams_data)
bigrams_df.index = range(1, len(bigrams_df) + 1)  # Start index at 1
bigrams_df

	Word Pair	Significance
1	attained age	Age qualification for offices
2	oath affirmation	Alternative ways to pledge allegiance
3	ministers consuls	Foreign diplomatic representatives
4	think proper	Discretionary powers
5	fill vacancies	Presidential appointment power
6	service labor	Regulation of labor laws
7	ambassadors public	Foreign representatives
8	public ministers	Foreign representatives
9	supreme court	Highest judicial body
10	duties imposts	Types of taxes

4.4.1 Why Word Pairs Matter

These word pairs reveal key concepts and institutions established by the Constitution. For example:

“Supreme court” appears as a unit because it names a specific institution
“Oath affirmation” reflects the freedom to pledge in different ways
“Ministers consuls” and “ambassadors public” refer to diplomatic positions
“Duties imposts” refers to different types of taxes and tariffs

By looking at these pairs, we can identify important multi-word concepts that would be missed if we only analyzed single words. This approach helps us better understand the specific governance and legal framework the Constitution establishes.

5 Word Frequency Distribution

This graph shows how often the top 50 words appear in the Constitution compared to each other:

The graph shows a steep drop-off after the first few words, especially “shall,” which is much more common than any other word. This pattern where a few words are used very frequently and most words are used rarely is typical of most texts.

6 Zipf’s Law Analysis

6.1 What is Zipf’s Law?

Zipf’s Law is a rule about word frequency in language. It says that the most common word appears about twice as often as the second most common word, three times as often as the third most common word, and so on. When graphed on a special chart (log-log scale), this relationship should form a straight line with a slope of -1.0.

6.2 Testing Zipf’s Law on the Constitution

Now we’ll test whether the word frequencies in the Constitution follow this pattern:

	Statistic	Value
	Slope (Perfect Zipf's Law would be -1.0)	-0.6725
	R-squared (how well the data fits the line)	0.9787
	Word deviating most from Zipf's Law	shall
	How much it deviates	More frequent than expected by 2.08 times

6.3 What This Tells Us About Zipf’s Law

The Constitution shows a flatter distribution (slope of -0.67) than predicted by Zipf’s Law (ideal slope of -1.0). This suggests more evenness in word usage than typical natural language.

The high R-squared value (0.98) indicates that despite the different slope, the relationship between word frequency and rank still follows a power law pattern very consistently.

The word “shall” appears about 2.08 times more frequently than Zipf’s Law would predict, which makes sense given the Constitution’s purpose of establishing rules and requirements.

7 Sentiment Analysis

We can also analyze the emotional tone of the Constitution:

	Sentiment Measure	Value
	Average compound sentiment score (-1 to +1 scale)	0.2003
	Average positive sentiment	0.0725
	Average negative sentiment	0.0331
	Average neutral sentiment	0.8944

7.1 What This Sentiment Analysis Means

The Constitution has a slightly positive sentiment overall (0.20 on a scale from -1 to +1). The vast majority of the content (89%) is neutral, as expected for a legal document. There’s about twice as much positive sentiment (7%) as negative sentiment (3%), which might reflect the document’s focus on establishing rights and freedoms rather than restrictions.

This analysis uses computer algorithms to detect emotional tone in text. While this technique works well for everyday language, it should be interpreted cautiously for specialized legal documents like the Constitution, which uses language in a very specific way.

8 Conclusions

Our analysis of the US Constitution reveals several interesting linguistic patterns:

The document contains a relatively small vocabulary (about 844 unique content words), which makes sense for a focused legal document.
A very small percentage of words (just 104 unique words, or 12%) account for half of all word occurrences, showing the concentrated nature of the vocabulary.
The word frequency distribution follows a power law pattern similar to Zipf’s Law, but with a flatter distribution, suggesting more evenness in word usage than typical natural language.
The word “shall” dominates the text, accounting for over 8% of all content words, reflecting the document’s purpose of establishing rules and requirements.

8.1 How the Constitution’s Vocabulary Differs from General Language

The Constitution’s vocabulary is distinctive in several ways:

Specialized Terminology: It contains a much higher concentration of governmental and legal terminology (like “president,” “congress,” “senate”) than general language.
Formal Register: The document uses a formal style lacking personal pronouns and colloquialisms while featuring terms of obligation such as “shall.”
Historical Context: The 18th-century language includes terms and phrases that were common then but are less used today.
Purpose-Driven Vocabulary: As a document establishing governance, its vocabulary focuses on institutions, powers, procedures, and rights rather than the diverse topics found in general language.

The Constitution serves as an interesting example of how specialized documents develop their own linguistic patterns while still following some of the fundamental properties of natural language.