Statistic | Value | |
---|---|---|
Total characters | 27,000 | |
Total sentences | 118 | |
Total words | 4,414 | |
Total stopwords | 2,164 (49.0%) | |
Total content words (non-stopwords) | 2,250 (51.0%) | |
Total unique words (including stopwords) | 930 | |
Total unique content words (excluding stopwords) | 844 | |
Average word length | 4.7 characters | |
Average sentence length (words) | 37.4 | |
Average sentence length (characters) | 225.8 |
Week Eight - High Frequency Words and the US Constitution
1 Overview
This analysis examines word frequencies in the US Constitution to understand linguistic patterns and how they compare to general language patterns. We’ve processed the text by removing stopwords (common words like “the” and “and”) and analyzed the remaining content words. We also treated multi-word phrases like “United States” as single entities to better capture their meaning.
Our analysis includes visualizations like word clouds, frequency distributions, and statistical tests to reveal insights about this foundational document’s language structure and patterns.
2 Document Statistics
Let’s first look at the basic numbers that describe the US Constitution text:
2.1 What These Numbers Tell Us
- The Constitution is a fairly concise document with around 4,400 words.
- About half the words are common stopwords, and half are meaningful content words.
- There are fewer than 1,000 unique words in the entire document, showing limited vocabulary diversity.
- Sentences are quite long (about 37 words per sentence), reflecting the formal legal writing style of the time.
- The average word length is under 5 characters, showing that even in formal documents, shorter words dominate.
2.2 Word Length Distribution
Word Length | Count | |
---|---|---|
Short words (1-4 letters) | 2,402 (54.4%) | |
Medium words (5-8 letters) | 1,579 (35.8%) | |
Long words (9+ letters) | 433 (9.8%) |
Over half of the words in the Constitution are short (1-4 letters). This is typical of English, where common words like “the,” “of,” “to,” and “in” are very short. Only about 10% of words are long (9+ letters), which are usually specialized terms.
3 Word Cloud Visualization
A word cloud is a picture where the size of each word shows how often it appears in the text. Bigger words appear more frequently in the document.
In this word cloud, we can immediately see that “shall” is by far the most common word, showing how the Constitution is focused on establishing rules and requirements. Words like “state,” “president,” “congress,” and “law” are also prominent, highlighting the document’s focus on governance and legal frameworks.
In this word cloud, we can immediately see that “shall” is by far the most common word, showing how the Constitution is focused on establishing rules and requirements. Words like “state,” “president,” “congress,” and “law” are also prominent, highlighting the document’s focus on governance and legal frameworks.
4 Top Words Analysis
This section is divided into four parts that explore different aspects of the word patterns in the Constitution. We’ll look at the most frequent individual words, how many words are needed to cover half the text, how words can be grouped by meaning, and which words commonly appear together.
4.1 Most Frequent Words in the Constitution
When we remove common stopwords like “the” and “and,” what words appear most often in the Constitution? Here are the top 20:
Show Code for Most Frequent Words
# Create a DataFrame for top 20 words
= pd.DataFrame(top_200[:20], columns=['Word', 'Count'])
top_20_df 'Percentage'] = (top_20_df['Count'] / total_content_words * 100).round(2)
top_20_df['Percentage'] = top_20_df['Percentage'].astype(str) + '%'
top_20_df[= range(1, len(top_20_df) + 1) # Start index at 1
top_20_df.index top_20_df
Word | Count | Percentage | |
---|---|---|---|
1 | shall | 191 | 8.49% |
2 | state | 48 | 2.13% |
3 | may | 33 | 1.47% |
4 | president | 32 | 1.42% |
5 | congress | 29 | 1.29% |
6 | states | 27 | 1.2% |
7 | house | 23 | 1.02% |
8 | law | 23 | 1.02% |
9 | section | 22 | 0.98% |
10 | one | 19 | 0.84% |
11 | office | 19 | 0.84% |
12 | senate | 17 | 0.76% |
13 | person | 16 | 0.71% |
14 | two | 16 | 0.71% |
15 | time | 16 | 0.71% |
16 | constitution | 15 | 0.67% |
17 | representatives | 15 | 0.67% |
18 | years | 12 | 0.53% |
19 | thereof | 12 | 0.53% |
20 | power | 12 | 0.53% |
4.1.1 What This Tells Us About the Constitution
The word “shall” dominates the document, appearing 191 times and making up over 8% of all meaningful words. This makes sense because the Constitution is establishing rules and duties: - “The President shall…” - “Congress shall not…” - “Each State shall appoint…”
The high frequency of government terms like “president,” “congress,” “states,” and “senate” reflects the document’s focus on establishing government structure. Legal terms like “law” and “constitution” are also common, showing the document’s purpose in establishing a legal framework.
4.2 Word Coverage Analysis
An interesting question in linguistics is: how many different words do you need to cover a large portion of a text? Here’s what we found for the Constitution:
Show Code for Word Coverage Analysis
# Calculate words needed for half the corpus
= fdist.most_common()
word_counts = 0
cumulative_count = 0
i = []
half_corpus_words
for word, count in word_counts:
+= count
cumulative_count += 1
i
half_corpus_words.append(word)if cumulative_count >= total_content_words / 2:
break
# Create DataFrame for this information
= {
half_corpus_data 'Statistic': ['Number of unique words representing half of the corpus',
'Percentage of all unique content words',
'Words needed to cover 50% of the corpus'],
'Value': [f"{i:,}",
f"{i/unique_content_words*100:.1f}%",
', '.join(half_corpus_words[:10]) + '...']
}
= pd.DataFrame(half_corpus_data)
half_corpus_df = [''] * len(half_corpus_df) # Remove index numbers
half_corpus_df.index half_corpus_df
Statistic | Value | |
---|---|---|
Number of unique words representing half of th... | 104 | |
Percentage of all unique content words | 12.3% | |
Words needed to cover 50% of the corpus | shall, state, may, president, congress, states... |
4.2.1 What This Word Coverage Reveals
Just 104 unique words (about 12% of all unique content words) account for half of all word occurrences in the Constitution. This demonstrates a principle called the “Pareto effect” in language - a small subset of words does most of the work in a text.
This pattern is typical in most languages and texts, but the Constitution has a particularly concentrated vocabulary because it’s a focused legal document with a specific purpose.
4.3 Semantic Categories of Words
Words can be grouped into categories based on their meaning and function. We analyzed the top 50 words in the Constitution and sorted them into these categories:
Show Code for Semantic Category Analysis
# Define categories
= ['president', 'congress', 'senate', 'united_states', 'representatives', 'house', 'legislative', 'executive']
governance_terms = ['law', 'constitution', 'cases', 'states', 'power', 'powers', 'amendment', 'rights']
legal_terms = ['shall', 'may', 'provided', 'appointed', 'elected', 'chosen']
procedural_terms
# Count occurrences of categories in top 50
= [word for word, _ in top_200[:50]]
top_50_words = sum(1 for word in top_50_words if word in governance_terms)
governance_count = sum(1 for word in top_50_words if word in legal_terms)
legal_count = sum(1 for word in top_50_words if word in procedural_terms)
procedural_count = 50 - governance_count - legal_count - procedural_count
other_count
# Create DataFrame for categories
= {
categories_data 'Category': ['Governance terms', 'Legal terms', 'Procedural terms', 'Other terms'],
'Count': [governance_count, legal_count, procedural_count, other_count],
'Examples': [
'"president", "congress", "senate"',
'"law", "constitution", "rights"',
'"shall", "may", "provided"',
'Various other words'
],'What They Show': [
'Focus on establishing government structure',
'Emphasis on legal foundations and powers',
'Instructions and requirements for how government functions',
'Other concepts not in the main categories'
]
}
= pd.DataFrame(categories_data)
categories_df = [''] * len(categories_df) # Remove index numbers
categories_df.index categories_df
Category | Count | Examples | What They Show | |
---|---|---|---|---|
Governance terms | 6 | "president", "congress", "senate" | Focus on establishing government structure | |
Legal terms | 5 | "law", "constitution", "rights" | Emphasis on legal foundations and powers | |
Procedural terms | 2 | "shall", "may", "provided" | Instructions and requirements for how governme... | |
Other terms | 37 | Various other words | Other concepts not in the main categories |
4.3.1 The Importance of These Categories
These word categories reveal the Constitution’s key priorities:
- Governance Terms establish the branches and offices of government
- Legal Terms create the foundation of laws and rights
- Procedural Terms set rules for how the government must operate
The distribution shows that while these specialized terms are important, the majority of frequently used words fall outside these categories, reflecting the document’s need to communicate using general language alongside specialized terminology.
4.4 Common Word Pairs (Collocations)
Some words frequently appear together, forming meaningful phrases. These pairs (called “collocations”) often represent important concepts:
Show Code for Word Collocation Analysis
# Find common word pairs (bigrams)
= BigramAssocMeasures()
bigram_measures = BigramCollocationFinder.from_words(content_words)
finder 3) # only bigrams that appear 3+ times
finder.apply_freq_filter(= finder.nbest(bigram_measures.pmi, 10)
top_bigrams
# Create DataFrame for bigrams with explanation
= {
bigrams_data 'Word Pair': [f"{bigram[0]} {bigram[1]}" for bigram in top_bigrams],
'Significance': [
'Age qualification for offices',
'Alternative ways to pledge allegiance',
'Foreign diplomatic representatives',
'Discretionary powers',
'Presidential appointment power',
'Regulation of labor laws',
'Foreign representatives',
'Foreign representatives',
'Highest judicial body',
'Types of taxes'
]
}
= pd.DataFrame(bigrams_data)
bigrams_df = range(1, len(bigrams_df) + 1) # Start index at 1
bigrams_df.index bigrams_df
Word Pair | Significance | |
---|---|---|
1 | attained age | Age qualification for offices |
2 | oath affirmation | Alternative ways to pledge allegiance |
3 | ministers consuls | Foreign diplomatic representatives |
4 | think proper | Discretionary powers |
5 | fill vacancies | Presidential appointment power |
6 | service labor | Regulation of labor laws |
7 | ambassadors public | Foreign representatives |
8 | public ministers | Foreign representatives |
9 | supreme court | Highest judicial body |
10 | duties imposts | Types of taxes |
4.4.1 Why Word Pairs Matter
These word pairs reveal key concepts and institutions established by the Constitution. For example:
- “Supreme court” appears as a unit because it names a specific institution
- “Oath affirmation” reflects the freedom to pledge in different ways
- “Ministers consuls” and “ambassadors public” refer to diplomatic positions
- “Duties imposts” refers to different types of taxes and tariffs
By looking at these pairs, we can identify important multi-word concepts that would be missed if we only analyzed single words. This approach helps us better understand the specific governance and legal framework the Constitution establishes.
5 Word Frequency Distribution
This graph shows how often the top 50 words appear in the Constitution compared to each other:
The graph shows a steep drop-off after the first few words, especially “shall,” which is much more common than any other word. This pattern where a few words are used very frequently and most words are used rarely is typical of most texts.
The graph shows a steep drop-off after the first few words, especially “shall,” which is much more common than any other word. This pattern where a few words are used very frequently and most words are used rarely is typical of most texts.
6 Zipf’s Law Analysis
6.1 What is Zipf’s Law?
Zipf’s Law is a rule about word frequency in language. It says that the most common word appears about twice as often as the second most common word, three times as often as the third most common word, and so on. When graphed on a special chart (log-log scale), this relationship should form a straight line with a slope of -1.0.
6.2 Testing Zipf’s Law on the Constitution
Now we’ll test whether the word frequencies in the Constitution follow this pattern:
Statistic | Value | |
---|---|---|
Slope (Perfect Zipf's Law would be -1.0) | -0.6725 | |
R-squared (how well the data fits the line) | 0.9787 | |
Word deviating most from Zipf's Law | shall | |
How much it deviates | More frequent than expected by 2.08 times |
6.3 What This Tells Us About Zipf’s Law
The Constitution shows a flatter distribution (slope of -0.67) than predicted by Zipf’s Law (ideal slope of -1.0). This suggests more evenness in word usage than typical natural language.
The high R-squared value (0.98) indicates that despite the different slope, the relationship between word frequency and rank still follows a power law pattern very consistently.
The word “shall” appears about 2.08 times more frequently than Zipf’s Law would predict, which makes sense given the Constitution’s purpose of establishing rules and requirements.
7 Sentiment Analysis
We can also analyze the emotional tone of the Constitution:
Sentiment Measure | Value | |
---|---|---|
Average compound sentiment score (-1 to +1 scale) | 0.2003 | |
Average positive sentiment | 0.0725 | |
Average negative sentiment | 0.0331 | |
Average neutral sentiment | 0.8944 |
7.1 What This Sentiment Analysis Means
The Constitution has a slightly positive sentiment overall (0.20 on a scale from -1 to +1). The vast majority of the content (89%) is neutral, as expected for a legal document. There’s about twice as much positive sentiment (7%) as negative sentiment (3%), which might reflect the document’s focus on establishing rights and freedoms rather than restrictions.
This analysis uses computer algorithms to detect emotional tone in text. While this technique works well for everyday language, it should be interpreted cautiously for specialized legal documents like the Constitution, which uses language in a very specific way.
8 Conclusions
Our analysis of the US Constitution reveals several interesting linguistic patterns:
The document contains a relatively small vocabulary (about 844 unique content words), which makes sense for a focused legal document.
A very small percentage of words (just 104 unique words, or 12%) account for half of all word occurrences, showing the concentrated nature of the vocabulary.
The word frequency distribution follows a power law pattern similar to Zipf’s Law, but with a flatter distribution, suggesting more evenness in word usage than typical natural language.
The word “shall” dominates the text, accounting for over 8% of all content words, reflecting the document’s purpose of establishing rules and requirements.
8.1 How the Constitution’s Vocabulary Differs from General Language
The Constitution’s vocabulary is distinctive in several ways:
Specialized Terminology: It contains a much higher concentration of governmental and legal terminology (like “president,” “congress,” “senate”) than general language.
Formal Register: The document uses a formal style lacking personal pronouns and colloquialisms while featuring terms of obligation such as “shall.”
Historical Context: The 18th-century language includes terms and phrases that were common then but are less used today.
Purpose-Driven Vocabulary: As a document establishing governance, its vocabulary focuses on institutions, powers, procedures, and rights rather than the diverse topics found in general language.
The Constitution serves as an interesting example of how specialized documents develop their own linguistic patterns while still following some of the fundamental properties of natural language.