Proj. Description :

This project analyzes a complete archive of 90,000+ posts made by Donald Trump on X (Twitter) and Truth Social from 2009–2026. Using R, Python, and Bash, I clean, filter, and structure the data to focus exclusively on original, text-based posts—excluding reposts, quotes, links, images, and videos—to isolate Trump’s direct public communication.

The objective is to examine rhetorical patterns and topic trends over time through large-scale text classification. I analyze the frequency of themes such as religion, immigration, education, and economics; measure sentiment toward groups defined by ethnicity, religion, and sexual orientation; track posting behavior; and investigate which linguistic and structural features are associated with viral posts. Overall, the project integrates data wrangling, analysis, and computational text modeling to study the evolution and impact of modern political discourse.

setwd("/Users/isaiahmireles/Desktop/Trump folder")
getwd()

## [1] "/Users/isaiahmireles/Desktop/Trump folder"

df <- read.csv("filtered_trump_dat.csv")

Step 1 :

Creata a .env file so my dumbass doesnt expose my API key

# pip install python-dotenv

# globals().clear() # Deletes all objects in the global namespace

Date Range :

range(df$date)

## [1] "2009-05-04 18:54:25" "2026-01-08 12:42:42"

Pre-Processing :

df$text_lwr <- tolower(df$text) # lower case mah txt 
df$text_lwr <- gsub("\\s+", " ", df$text_lwr) # remove extra white-space

library(stringr)


df$text_clean <-                      # Create cleaned text column
  df$text_lwr |>                    # Start with lowercase text
  str_remove_all("http\\S+") |> # Remove full URLs (http or https)
  str_remove_all("@\\w+") |>         # Remove @mentions
  str_remove_all("#\\w+") |>         # Remove hashtags
  str_remove_all("[^a-z\\s\\-–]") |>  # Remove punctuation (keep letters, spaces, hyphen, en dash)
  str_squish()                      # Trim and collapse extra whitespace

?str_remove_all

Removing URLs causes reduced noise, this is useful because links do not reflect rhetorical content and inflate token count.
Removing mentions (@user) causes reduced contextual ambiguity, this is useful because usernames do not represent thematic meaning.
Removing hashtags causes cleaner topic signals, this is useful because hashtags can artificially skew keyword frequency.
Removing punctuation (excessive symbols) causes text standardization, this is useful because it reduces superficial variation without changing meaning.
Removing extra whitespace causes uniform formatting, this is useful because it improves parsing and consistency.
Removing reposts/quotes causes focus on original speech, this is useful because the goal is to analyze Trump’s own rhetoric, not redistributed content.
write.csv(“filtered_trump_dat.csv”)

Neat Pattern :

All the times

trump_talking <- 
  df |> 
  dplyr::filter(str_detect(text_clean, "--d"))

paste("nrow(trump_talking$text_clean) : ", length(trump_talking$text_clean))

## [1] "nrow(trump_talking$text_clean) :  56"

# what he said, when
trump_talking |> dplyr::select(date, text_clean) |> dplyr::arrange(desc(date))

Write File :

write.csv(df, "preprocessed_trump_dat.csv", row.names = FALSE)
getwd()

## [1] "/Users/isaiahmireles/Desktop/Trump folder"

Text-Embeddings & Classification :

import dat into python

import pandas as pd
import numpy as np

df_py = pd.read_csv("preprocessed_trump_dat.csv")
df_py = df_py.drop(columns=["Unnamed: 0", "X", "X.1"], errors="ignore")
df_py["text_clean"] = df_py["text_clean"].astype(str)
df_py.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 58020 entries, 0 to 58019
## Data columns (total 16 columns):
##  #   Column          Non-Null Count  Dtype 
## ---  ------          --------------  ----- 
##  0   date            58020 non-null  object
##  1   platform        58020 non-null  object
##  2   handle          58020 non-null  object
##  3   text            58017 non-null  object
##  4   favorite_count  58020 non-null  int64 
##  5   repost_count    58020 non-null  int64 
##  6   deleted_flag    58020 non-null  bool  
##  7   word_count      58020 non-null  int64 
##  8   hashtags        6203 non-null   object
##  9   urls            14429 non-null  object
##  10  user_mentions   21724 non-null  object
##  11  media_count     58020 non-null  int64 
##  12  media_urls      3947 non-null   object
##  13  post_url        58020 non-null  object
##  14  text_lwr        58017 non-null  object
##  15  text_clean      58020 non-null  object
## dtypes: bool(1), int64(4), object(11)
## memory usage: 6.7+ MB

Data Formatting

df_py["text_clean"].eq("nan").sum() # how many?

## np.int64(174)

df_py["text_clean"] = df_py["text_clean"].replace("nan", np.nan)
df_py = df_py.dropna(subset=["text_clean"])

df_py["text_clean"].eq("nan").sum() # how many?

## np.int64(0)

df_py.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 57846 entries, 0 to 58019
## Data columns (total 16 columns):
##  #   Column          Non-Null Count  Dtype 
## ---  ------          --------------  ----- 
##  0   date            57846 non-null  object
##  1   platform        57846 non-null  object
##  2   handle          57846 non-null  object
##  3   text            57846 non-null  object
##  4   favorite_count  57846 non-null  int64 
##  5   repost_count    57846 non-null  int64 
##  6   deleted_flag    57846 non-null  bool  
##  7   word_count      57846 non-null  int64 
##  8   hashtags        6083 non-null   object
##  9   urls            14352 non-null  object
##  10  user_mentions   21664 non-null  object
##  11  media_count     57846 non-null  int64 
##  12  media_urls      3942 non-null   object
##  13  post_url        57846 non-null  object
##  14  text_lwr        57846 non-null  object
##  15  text_clean      57846 non-null  object
## dtypes: bool(1), int64(4), object(11)
## memory usage: 7.1+ MB

Text-Embeddings

from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np

load_dotenv()

## True


client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Aside : Im using the “lower cost” version (small) to embed them :

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    
    response = client.embeddings.create(
        model=model,
        input=text
    )
    
    return response.data[0].embedding

Batch Embedding

summary(nchar(df$text_clean))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    62.0   100.0   146.1   158.0  2826.0

Sampling :

len(df_py)

## 57846

df_samp = df_py.sample(n=20000, random_state=42)

import time
from openai import RateLimitError
import pickle

batch_size = 250
embeddings = []

texts = df_samp["text_clean"].tolist()

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    
    while True:
        try:
            response = client.embeddings.create(
                model="text-embedding-3-small",
                input=batch
            )
            
            batch_embeddings = [d.embedding for d in response.data]
            embeddings.extend(batch_embeddings)
            
            print(f"Processed batch {i} to {i+batch_size}")
            time.sleep(2)
            break
            
        except RateLimitError:
            print("Rate limit hit — sleeping 30 seconds...")
            time.sleep(30)

## Processed batch 0 to 250
## Processed batch 250 to 500
## Processed batch 500 to 750
## Processed batch 750 to 1000
## Processed batch 1000 to 1250
## Processed batch 1250 to 1500
## Processed batch 1500 to 1750
## Processed batch 1750 to 2000
## Processed batch 2000 to 2250
## Processed batch 2250 to 2500
## Processed batch 2500 to 2750
## Processed batch 2750 to 3000
## Processed batch 3000 to 3250
## Processed batch 3250 to 3500
## Processed batch 3500 to 3750
## Processed batch 3750 to 4000
## Processed batch 4000 to 4250
## Processed batch 4250 to 4500
## Processed batch 4500 to 4750
## Processed batch 4750 to 5000
## Processed batch 5000 to 5250
## Processed batch 5250 to 5500
## Processed batch 5500 to 5750
## Processed batch 5750 to 6000
## Processed batch 6000 to 6250
## Processed batch 6250 to 6500
## Processed batch 6500 to 6750
## Processed batch 6750 to 7000
## Processed batch 7000 to 7250
## Processed batch 7250 to 7500
## Processed batch 7500 to 7750
## Processed batch 7750 to 8000
## Processed batch 8000 to 8250
## Processed batch 8250 to 8500
## Processed batch 8500 to 8750
## Processed batch 8750 to 9000
## Processed batch 9000 to 9250
## Processed batch 9250 to 9500
## Processed batch 9500 to 9750
## Processed batch 9750 to 10000
## Processed batch 10000 to 10250
## Processed batch 10250 to 10500
## Processed batch 10500 to 10750
## Processed batch 10750 to 11000
## Processed batch 11000 to 11250
## Processed batch 11250 to 11500
## Processed batch 11500 to 11750
## Processed batch 11750 to 12000
## Processed batch 12000 to 12250
## Processed batch 12250 to 12500
## Processed batch 12500 to 12750
## Processed batch 12750 to 13000
## Processed batch 13000 to 13250
## Processed batch 13250 to 13500
## Processed batch 13500 to 13750
## Processed batch 13750 to 14000
## Processed batch 14000 to 14250
## Processed batch 14250 to 14500
## Processed batch 14500 to 14750
## Processed batch 14750 to 15000
## Processed batch 15000 to 15250
## Processed batch 15250 to 15500
## Processed batch 15500 to 15750
## Processed batch 15750 to 16000
## Processed batch 16000 to 16250
## Processed batch 16250 to 16500
## Processed batch 16500 to 16750
## Processed batch 16750 to 17000
## Processed batch 17000 to 17250
## Processed batch 17250 to 17500
## Processed batch 17500 to 17750
## Processed batch 17750 to 18000
## Processed batch 18000 to 18250
## Processed batch 18250 to 18500
## Processed batch 18500 to 18750
## Processed batch 18750 to 19000
## Processed batch 19000 to 19250
## Processed batch 19250 to 19500
## Processed batch 19500 to 19750
## Processed batch 19750 to 20000


# SAVE AFTER COMPUTING (only when you manually run this chunk)
with open("embeddings.pkl", "wb") as f:
    pickle.dump(embeddings, f)

Load Embeddings

import pickle

with open("embeddings.pkl", "rb") as f:
    embeddings = pickle.load(f)

df_samp["embedding"] = embeddings

df_samp.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 20000 entries, 6023 to 33334
## Data columns (total 17 columns):
##  #   Column          Non-Null Count  Dtype 
## ---  ------          --------------  ----- 
##  0   date            20000 non-null  object
##  1   platform        20000 non-null  object
##  2   handle          20000 non-null  object
##  3   text            20000 non-null  object
##  4   favorite_count  20000 non-null  int64 
##  5   repost_count    20000 non-null  int64 
##  6   deleted_flag    20000 non-null  bool  
##  7   word_count      20000 non-null  int64 
##  8   hashtags        2081 non-null   object
##  9   urls            4949 non-null   object
##  10  user_mentions   7520 non-null   object
##  11  media_count     20000 non-null  int64 
##  12  media_urls      1347 non-null   object
##  13  post_url        20000 non-null  object
##  14  text_lwr        20000 non-null  object
##  15  text_clean      20000 non-null  object
##  16  embedding       20000 non-null  object
## dtypes: bool(1), int64(4), object(12)
## memory usage: 2.6+ MB

Brief Visual

Here i use PCA and TSNE

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

# Convert embeddings
X = np.array(embeddings)

# ----- PCA -----
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# ----- t-SNE -----
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)

# ----- Clustering (on original embedding space) -----
kmeans = KMeans(n_clusters=6, random_state=42)
labels = kmeans.fit_predict(X)

# ----- Plot -----
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# PCA
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=labels, s=8, alpha=0.7)
axes[0].set_title("PCA Projection")
axes[0].set_xticks([])

## []

axes[0].set_yticks([])

## []

# t-SNE
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, s=8, alpha=0.7)
axes[1].set_title("t-SNE Projection")
axes[1].set_xticks([])

## []

axes[1].set_yticks([])

## []

plt.tight_layout()
plt.show()

There appear to be some clusters. I will later need to further investigate this closely.

list.files()

##  [1] "big_ahh_data.csv"                           
##  [2] "big_ahh_dataset.csv"                        
##  [3] "Brief_Topic_Analysis.html"                  
##  [4] "Brief_Topic_Analysis.Rmd"                   
##  [5] "df_joined.csv"                              
##  [6] "embeddings.pkl"                             
##  [7] "Epstien Files"                              
##  [8] "Exports.csv"                                
##  [9] "filtered_trump_dat.csv"                     
## [10] "GDP.csv"                                    
## [11] "GDP.numbers"                                
## [12] "GDPPERCAPITA.csv"                           
## [13] "GovExpendituresEdu.csv"                     
## [14] "IMPORTS.csv"                                
## [15] "MEDIANCPI.csv"                              
## [16] "nationwide-encounters-fy23-fy26-jan-aor.csv"
## [17] "preprocessed_trump_dat 2.csv"               
## [18] "preprocessed_trump_dat.csv"                 
## [19] "REALGDP.csv"                                
## [20] "rsconnect"                                  
## [21] "SearchingWork.csv"                          
## [22] "Trump Joining Auxiliary Variables.Rmd"      
## [23] "trump_sample_labeled 2.csv"                 
## [24] "trump_sample_labeled.csv"                   
## [25] "trump_tweets_dataset.csv"                   
## [26] "Trump-Joining-Auxiliary-Variables.html"     
## [27] "TrumpClassification_files"                  
## [28] "TrumpClassification.html"                   
## [29] "TrumpClassification.Rmd"                    
## [30] "tweet_themes_with_sentiment.csv"            
## [31] "UNRATE.csv"

Classification :

I will provide examples based on likely things we would expect donald trump to say for examples

############################################
# Define Trump-Style Theme Anchors
############################################

themes = {
    "immigration": [
        "we will finish the wall and we will stop the illegal invasion at our southern border.",
        "millions of illegal immigrants are pouring into our country because the democrats refuse to secure the border.",
        "under my leadership we had the strongest border in american history and we will bring that back.",
        "we will launch the largest deportation operation our country has ever seen.",
        "cartels, drugs, and criminals are crossing our border because the democrats want open borders.",
        "america is a sovereign nation and we will decide who enters our country, not the radical left."
    ],

    "education": [
        "our schools have been taken over by radical left ideology and we are going to take them back.",
        "parents must have the final say in what their children are taught in the classroom.",
        "we will end critical race theory and bring back patriotic education.",
        "the department of education has failed our students for decades.",
        "american children should be taught to love their country, not hate it.",
        "school choice will give every family the power to choose the best education for their kids."
    ],

    "war": [
        "we will rebuild our military so powerful that nobody will dare challenge america.",
        "under my leadership we defeated isis and restored strength to our armed forces.",
        "america will never apologize for defending its people and its interests.",
        "our enemies respect strength and they know the united states will always win.",
        "peace through strength is the only way to keep our country safe.",
        "we will protect our allies but we will never allow america to be taken advantage of."
    ],

    "crime": [
        "crime is out of control in democrat run cities and it has to stop immediately.",
        "we will restore law and order to the streets of america.",
        "radical prosecutors are letting violent criminals walk free.",
        "our police officers deserve respect and support, not attacks from the left.",
        "if you attack our communities and our police you will face serious consequences.",
        "america will be safe again when we put criminals behind bars where they belong."
    ],

    "religion": [
        "we will always defend religious liberty for every american.",
        "our nation was founded on faith and we will never forget that.",
        "the radical left wants to remove god from public life but we will never let that happen.",
        "in america we proudly say one nation under god.",
        "churches and people of faith will always have a friend in the white house.",
        "faith, family, and freedom are the foundation of this great country."
    ],

    "jobs": [
        "we are bringing american jobs back from china and other countries.",
        "no president has created more opportunity for american workers.",
        "our america first policies will put millions of people back to work.",
        "factories are reopening because companies believe in america again.",
        "we will protect american workers from unfair trade deals.",
        "the best jobs economy in history is coming back bigger than ever."
    ],

    "poverty": [
        "for decades politicians ignored the forgotten men and women of america.",
        "we are creating opportunity so people can lift themselves out of poverty.",
        "american workers deserve good paying jobs, not government dependence.",
        "inner cities have been abandoned by democrat leadership for far too long.",
        "economic growth is the best anti-poverty program ever created.",
        "we will rebuild communities that have been left behind."
    ],

    "democrats": [
        "the radical democrats want open borders, high taxes, and chaos in our streets.",
        "democrats are destroying our country with their failed policies.",
        "everywhere democrats are in charge crime goes up and quality of life goes down.",
        "the democrat party has been taken over by the radical left.",
        "they want socialism while we want freedom and prosperity.",
        "the democrats talk about unity but they only divide america."
    ],

    "Government assistance programs": [
        "government programs should help people get back on their feet, not trap them in dependency.",
        "we will reform welfare so that work is always rewarded.",
        "taxpayers deserve to know their money is being spent wisely.",
        "assistance programs must prioritize american citizens first.",
        "we will eliminate waste and fraud in government benefits.",
        "our goal is opportunity, not permanent government dependence."
    ],

    "healthcare": [
        "we will deliver affordable healthcare that actually works for the american people.",
        "obamacare has been a disaster and it must be replaced.",
        "patients should have more choice and more control over their healthcare.",
        "we will protect people with pre existing conditions while lowering costs.",
        "drug prices will come down because we will stand up to big pharma.",
        "american families deserve the best healthcare system anywhere in the world."
    ],

    "none": [
        "the fake news media never tells the truth about what is happening in our country.",
        "america is coming back stronger than anyone ever thought possible.",
        "the people of this country are incredible and they deserve great leadership.",
        "we are going to make america greater than ever before.",
        "nobody fights harder for the american people than we do.",
        "together we will restore pride, strength, and confidence in america."
    ]
}

New sentiment labels :

I would like to generate new [immigration, education, war, crime, religion, jobs, poverty, democrats, Government assistance programs, healthcare, None]

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

############################################
# 1️⃣  Create Averaged Theme Embeddings
############################################

def average_embedding(text_list):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text_list
    )
    vectors = np.array([d.embedding for d in response.data])
    return np.mean(vectors, axis=0)

theme_embeddings = {
    theme: average_embedding(sentences)
    for theme, sentences in themes.items()
}

############################################
# 2️⃣  Use Existing Dataset Embeddings
############################################

X = np.array(embeddings)

############################################
# 3️⃣  Classify via Cosine Similarity
############################################

theme_matrix = np.vstack(list(theme_embeddings.values()))
theme_names = list(theme_embeddings.keys())

similarities = cosine_similarity(X, theme_matrix)

labels = [theme_names[i] for i in similarities.argmax(axis=1)]
confidence = similarities.max(axis=1)

############################################
# 4️⃣  Attach Results
############################################

df_samp["theme_label"] = labels
df_samp["confidence"] = confidence

df_samp[["text_clean", "theme_label", "confidence"]].head()

##                                               text_clean  ... confidence
## 6023   yesterdays results show trump s course was alr...  ...   0.342245
## 2528   there is no substitute for hard work --thomas ...  ...   0.330769
## 45057  trump derides drug and human trafficking boom ...  ...   0.526055
## 29515  russia talk is fake news put out by the dems a...  ...   0.360855
## 8373                      you have a cunty demeanor true  ...   0.215996
## 
## [5 rows x 3 columns]

df_samp.info()

## <class 'pandas.core.frame.DataFrame'>
## Index: 20000 entries, 6023 to 33334
## Data columns (total 19 columns):
##  #   Column          Non-Null Count  Dtype  
## ---  ------          --------------  -----  
##  0   date            20000 non-null  object 
##  1   platform        20000 non-null  object 
##  2   handle          20000 non-null  object 
##  3   text            20000 non-null  object 
##  4   favorite_count  20000 non-null  int64  
##  5   repost_count    20000 non-null  int64  
##  6   deleted_flag    20000 non-null  bool   
##  7   word_count      20000 non-null  int64  
##  8   hashtags        2081 non-null   object 
##  9   urls            4949 non-null   object 
##  10  user_mentions   7520 non-null   object 
##  11  media_count     20000 non-null  int64  
##  12  media_urls      1347 non-null   object 
##  13  post_url        20000 non-null  object 
##  14  text_lwr        20000 non-null  object 
##  15  text_clean      20000 non-null  object 
##  16  embedding       20000 non-null  object 
##  17  theme_label     20000 non-null  object 
##  18  confidence      20000 non-null  float64
## dtypes: bool(1), float64(1), int64(4), object(13)
## memory usage: 2.9+ MB

Export Data

df_samp_no_embed = df_samp.drop(columns=["embedding"]) # dont inc. embeddings

df_samp_no_embed.to_csv(
    "trump_sample_labeled.csv",
    index=False
)

df <- read.csv("trump_sample_labeled.csv")
unique(df$theme_label) # notice there isnt a none

##  [1] "none"                           "immigration"                   
##  [3] "democrats"                      "crime"                         
##  [5] "poverty"                        "religion"                      
##  [7] "Government assistance programs" "healthcare"                    
##  [9] "jobs"                           "war"                           
## [11] "education"

colnames(df)

##  [1] "date"           "platform"       "handle"         "text"          
##  [5] "favorite_count" "repost_count"   "deleted_flag"   "word_count"    
##  [9] "hashtags"       "urls"           "user_mentions"  "media_count"   
## [13] "media_urls"     "post_url"       "text_lwr"       "text_clean"    
## [17] "theme_label"    "confidence"

write.csv(df, "big_ahh_dataset.csv")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df_with_confidence <- df %>%
  mutate(
    correct = NA  # only include if you DO have ground truth later
  )

confidence_summary <- df %>%
  group_by(theme_label) %>%
  summarize(
    n = n(),
    mean_confidence = mean(confidence, na.rm = TRUE),
    sd_confidence = sd(confidence, na.rm = TRUE),
    min_confidence = min(confidence, na.rm = TRUE),
    max_confidence = max(confidence, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_confidence))

confidence_summary

Overall Pattern

Model confidence is consistently low across all themes (mean range: 0.23–0.29).
No theme exceeds an average confidence of 0.30.
Suggests the classifier is cautious or uncertain in assigning labels.
Confidence levels are relatively compressed, indicating limited separation between themes.
This may reflect thematic overlap (e.g., economics vs. immigration) or conservative scoring behavior.

library(ggplot2)

ggplot(confidence_summary,
       aes(x = reorder(theme_label, mean_confidence),
           y = mean_confidence)) +
  geom_col() +
  coord_flip() +
  labs(title = "Average Model Confidence by Theme",
       x = "Theme",
       y = "Mean Confidence") +
  theme_minimal()

Top 5 Highest Confidence Overall

library(dplyr)

highest <- 
  df %>%
  arrange(desc(confidence)) %>%
  select(theme_label, confidence, text) %>%
  slice_head(n = 5)

highest

High-Confidence Examples Per Theme

top3_per_theme <-
  df %>%
  group_by(theme_label) %>%
  arrange(desc(confidence), .by_group = TRUE) %>%
  slice_head(n = 3) %>%
  select(theme_label, confidence, text)
top3_per_theme

Very High Confidence (e.g., > 0.60)

very_high <-
  df %>%
  filter(confidence > 0.60) %>%
  arrange(desc(confidence)) %>%
  select(theme_label, confidence, text)
very_high

High-Confidence Classification Patterns

Overall Observations

High-confidence predictions are heavily concentrated in Immigration-related content.
Messaging is:
- Direct
- Repetitive
- Issue-specific (e.g., “Border”, “Wall”, “Illegal Immigration”)
Language is emotionally charged and declarative.
Strong use of capitalized emphasis (e.g., “INVADED”, “MUST”, “WALL”).
Clear thematic signal reduces ambiguity for the classifier.

Immigration Theme

Common patterns: - Frequent use of terms like: - “Border” - “Wall” - “Illegal immigrants” - “Invasion” - “Crime” - Framing around: - National security - Law enforcement - Threat narratives - Repetition across posts strengthens model certainty.

Interpretation: - Immigration has highly distinctive vocabulary. - Minimal overlap with other themes. - Produces the highest confidence scores overall.

Economics Theme

Common patterns: - Strong economic performance framing: - “Lowest unemployment” - “Jobs” - “Stock Market” - “Tariffs” - “Trade deals” - Optimistic and achievement-focused tone. - Structured economic indicators (jobs numbers, growth, profits).

Interpretation: - Clear economic keywords improve confidence. - Less emotionally intense than immigration but still distinct.

Religion Theme

Common patterns: - References to: - “Faith” - “God” - “Prayer” - “Nation under God” - Patriotic-religious framing. - Less policy-heavy, more symbolic language.

Interpretation: - Distinct vocabulary but lower intensity and repetition. - Moderate confidence levels.

Education Theme

Common patterns: - Mentions of: - “Education system” - “Schools” - “Indoctrination” - “Patriotic education” - Often overlaps with cultural or political framing.

Interpretation: - More thematic overlap with culture/politics. - Lower confidence due to ambiguity.

Homelessness Theme

Common patterns: - References to: - “Cities” - “Law and Order” - “Capital” - “Homeless” - Often embedded within broader crime or governance narratives.

Interpretation: - Less frequent and less distinct vocabulary. - Strong thematic overlap with crime and immigration. - Lowest overall confidence levels.

Cross-Theme Patterns in High Confidence Posts

Strong issue ownership language.
Clear single-topic focus.
Minimal mixed themes.
Repetition of signature campaign messaging.
Use of emotionally intensified rhetoric.

Key Takeaway

High-confidence classifications occur when: - Vocabulary is highly specific to one theme. - Messaging is repetitive and consistent. - The post contains strong, unambiguous political framing.

Lower confidence appears when: - Themes overlap. - Language is general or symbolic. - The signal is less distinctive.

unique(df$theme_label)

##  [1] "none"                           "immigration"                   
##  [3] "democrats"                      "crime"                         
##  [5] "poverty"                        "religion"                      
##  [7] "Government assistance programs" "healthcare"                    
##  [9] "jobs"                           "war"                           
## [11] "education"

econ <-
  df |> 
  filter(theme_label == "economics") |> 
  arrange(desc(confidence)) |> 
  select(theme_label,confidence, text, post_url) |> 
  filter(confidence > .565)

immigration <-
  df |> 
  filter(theme_label == "immigration") |> 
  arrange(desc(confidence)) |> 
  select(theme_label,confidence, text, post_url) |> 
  filter(confidence > .63)

homelessness <-
  df |> 
  filter(theme_label == "homelessness") |> 
  arrange(desc(confidence)) |> 
  select(theme_label,confidence, text, post_url) |> 
  filter(confidence > .38)

religion <-
  df |> 
  filter(theme_label == "religion") |> 
  arrange(desc(confidence)) |> 
  select(theme_label,confidence, text, post_url) |> 
  filter(confidence > .48)

education <-
  df |> 
  filter(theme_label == "education") |> 
  arrange(desc(confidence)) |> 
  select(theme_label,confidence, text, post_url) |> 
  filter(confidence > .4)

write.csv(df, "big_ahh_data.csv")

df

Text Classification Using ChatGPT

Isaiah C. Mireles

2026-02-28