This project analyzes a complete archive of 90,000+ posts made by Donald Trump on X (Twitter) and Truth Social from 2009–2026. Using R, Python, and Bash, I clean, filter, and structure the data to focus exclusively on original, text-based posts—excluding reposts, quotes, links, images, and videos—to isolate Trump’s direct public communication.
The objective is to examine rhetorical patterns and topic trends over time through large-scale text classification. I analyze the frequency of themes such as religion, immigration, education, and economics; measure sentiment toward groups defined by ethnicity, religion, and sexual orientation; track posting behavior; and investigate which linguistic and structural features are associated with viral posts. Overall, the project integrates data wrangling, analysis, and computational text modeling to study the evolution and impact of modern political discourse.
setwd("/Users/isaiahmireles/Desktop/Trump folder")
getwd()
## [1] "/Users/isaiahmireles/Desktop/Trump folder"
df <- read.csv("filtered_trump_dat.csv")
Step 1 :
# pip install python-dotenv
# globals().clear() # Deletes all objects in the global namespace
range(df$date)
## [1] "2009-05-04 18:54:25" "2026-01-08 12:42:42"
df$text_lwr <- tolower(df$text) # lower case mah txt
df$text_lwr <- gsub("\\s+", " ", df$text_lwr) # remove extra white-space
library(stringr)
df$text_clean <- # Create cleaned text column
df$text_lwr |> # Start with lowercase text
str_remove_all("http\\S+") |> # Remove full URLs (http or https)
str_remove_all("@\\w+") |> # Remove @mentions
str_remove_all("#\\w+") |> # Remove hashtags
str_remove_all("[^a-z\\s\\-–]") |> # Remove punctuation (keep letters, spaces, hyphen, en dash)
str_squish() # Trim and collapse extra whitespace
?str_remove_all
Removing URLs causes reduced noise, this is useful because links do not reflect rhetorical content and inflate token count.
Removing mentions (@user) causes reduced contextual ambiguity, this is useful because usernames do not represent thematic meaning.
Removing hashtags causes cleaner topic signals, this is useful because hashtags can artificially skew keyword frequency.
Removing punctuation (excessive symbols) causes text standardization, this is useful because it reduces superficial variation without changing meaning.
Removing extra whitespace causes uniform formatting, this is useful because it improves parsing and consistency.
Removing reposts/quotes causes focus on original speech, this is useful because the goal is to analyze Trump’s own rhetoric, not redistributed content.
write.csv(“filtered_trump_dat.csv”)
All the times
trump_talking <-
df |>
dplyr::filter(str_detect(text_clean, "--d"))
paste("nrow(trump_talking$text_clean) : ", length(trump_talking$text_clean))
## [1] "nrow(trump_talking$text_clean) : 56"
# what he said, when
trump_talking |> dplyr::select(date, text_clean) |> dplyr::arrange(desc(date))
write.csv(df, "preprocessed_trump_dat.csv", row.names = FALSE)
getwd()
## [1] "/Users/isaiahmireles/Desktop/Trump folder"
import dat into python
import pandas as pd
import numpy as np
df_py = pd.read_csv("preprocessed_trump_dat.csv")
df_py = df_py.drop(columns=["Unnamed: 0", "X", "X.1"], errors="ignore")
df_py["text_clean"] = df_py["text_clean"].astype(str)
df_py.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 58020 entries, 0 to 58019
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 date 58020 non-null object
## 1 platform 58020 non-null object
## 2 handle 58020 non-null object
## 3 text 58017 non-null object
## 4 favorite_count 58020 non-null int64
## 5 repost_count 58020 non-null int64
## 6 deleted_flag 58020 non-null bool
## 7 word_count 58020 non-null int64
## 8 hashtags 6203 non-null object
## 9 urls 14429 non-null object
## 10 user_mentions 21724 non-null object
## 11 media_count 58020 non-null int64
## 12 media_urls 3947 non-null object
## 13 post_url 58020 non-null object
## 14 text_lwr 58017 non-null object
## 15 text_clean 58020 non-null object
## dtypes: bool(1), int64(4), object(11)
## memory usage: 6.7+ MB
df_py["text_clean"].eq("nan").sum() # how many?
## np.int64(174)
df_py["text_clean"] = df_py["text_clean"].replace("nan", np.nan)
df_py = df_py.dropna(subset=["text_clean"])
df_py["text_clean"].eq("nan").sum() # how many?
## np.int64(0)
df_py.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 57846 entries, 0 to 58019
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 date 57846 non-null object
## 1 platform 57846 non-null object
## 2 handle 57846 non-null object
## 3 text 57846 non-null object
## 4 favorite_count 57846 non-null int64
## 5 repost_count 57846 non-null int64
## 6 deleted_flag 57846 non-null bool
## 7 word_count 57846 non-null int64
## 8 hashtags 6083 non-null object
## 9 urls 14352 non-null object
## 10 user_mentions 21664 non-null object
## 11 media_count 57846 non-null int64
## 12 media_urls 3942 non-null object
## 13 post_url 57846 non-null object
## 14 text_lwr 57846 non-null object
## 15 text_clean 57846 non-null object
## dtypes: bool(1), int64(4), object(11)
## memory usage: 7.1+ MB
from openai import OpenAI
from dotenv import load_dotenv
import os
import numpy as np
load_dotenv()
## True
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
Aside : Im using the “lower cost” version (small) to embed them :
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
response = client.embeddings.create(
model=model,
input=text
)
return response.data[0].embedding
summary(nchar(df$text_clean))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 62.0 100.0 146.1 158.0 2826.0
len(df_py)
## 57846
df_samp = df_py.sample(n=20000, random_state=42)
import time
from openai import RateLimitError
import pickle
batch_size = 250
embeddings = []
texts = df_samp["text_clean"].tolist()
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
while True:
try:
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
batch_embeddings = [d.embedding for d in response.data]
embeddings.extend(batch_embeddings)
print(f"Processed batch {i} to {i+batch_size}")
time.sleep(2)
break
except RateLimitError:
print("Rate limit hit — sleeping 30 seconds...")
time.sleep(30)
## Processed batch 0 to 250
## Processed batch 250 to 500
## Processed batch 500 to 750
## Processed batch 750 to 1000
## Processed batch 1000 to 1250
## Processed batch 1250 to 1500
## Processed batch 1500 to 1750
## Processed batch 1750 to 2000
## Processed batch 2000 to 2250
## Processed batch 2250 to 2500
## Processed batch 2500 to 2750
## Processed batch 2750 to 3000
## Processed batch 3000 to 3250
## Processed batch 3250 to 3500
## Processed batch 3500 to 3750
## Processed batch 3750 to 4000
## Processed batch 4000 to 4250
## Processed batch 4250 to 4500
## Processed batch 4500 to 4750
## Processed batch 4750 to 5000
## Processed batch 5000 to 5250
## Processed batch 5250 to 5500
## Processed batch 5500 to 5750
## Processed batch 5750 to 6000
## Processed batch 6000 to 6250
## Processed batch 6250 to 6500
## Processed batch 6500 to 6750
## Processed batch 6750 to 7000
## Processed batch 7000 to 7250
## Processed batch 7250 to 7500
## Processed batch 7500 to 7750
## Processed batch 7750 to 8000
## Processed batch 8000 to 8250
## Processed batch 8250 to 8500
## Processed batch 8500 to 8750
## Processed batch 8750 to 9000
## Processed batch 9000 to 9250
## Processed batch 9250 to 9500
## Processed batch 9500 to 9750
## Processed batch 9750 to 10000
## Processed batch 10000 to 10250
## Processed batch 10250 to 10500
## Processed batch 10500 to 10750
## Processed batch 10750 to 11000
## Processed batch 11000 to 11250
## Processed batch 11250 to 11500
## Processed batch 11500 to 11750
## Processed batch 11750 to 12000
## Processed batch 12000 to 12250
## Processed batch 12250 to 12500
## Processed batch 12500 to 12750
## Processed batch 12750 to 13000
## Processed batch 13000 to 13250
## Processed batch 13250 to 13500
## Processed batch 13500 to 13750
## Processed batch 13750 to 14000
## Processed batch 14000 to 14250
## Processed batch 14250 to 14500
## Processed batch 14500 to 14750
## Processed batch 14750 to 15000
## Processed batch 15000 to 15250
## Processed batch 15250 to 15500
## Processed batch 15500 to 15750
## Processed batch 15750 to 16000
## Processed batch 16000 to 16250
## Processed batch 16250 to 16500
## Processed batch 16500 to 16750
## Processed batch 16750 to 17000
## Processed batch 17000 to 17250
## Processed batch 17250 to 17500
## Processed batch 17500 to 17750
## Processed batch 17750 to 18000
## Processed batch 18000 to 18250
## Processed batch 18250 to 18500
## Processed batch 18500 to 18750
## Processed batch 18750 to 19000
## Processed batch 19000 to 19250
## Processed batch 19250 to 19500
## Processed batch 19500 to 19750
## Processed batch 19750 to 20000
# SAVE AFTER COMPUTING (only when you manually run this chunk)
with open("embeddings.pkl", "wb") as f:
pickle.dump(embeddings, f)
import pickle
with open("embeddings.pkl", "rb") as f:
embeddings = pickle.load(f)
df_samp["embedding"] = embeddings
df_samp.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 20000 entries, 6023 to 33334
## Data columns (total 17 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 date 20000 non-null object
## 1 platform 20000 non-null object
## 2 handle 20000 non-null object
## 3 text 20000 non-null object
## 4 favorite_count 20000 non-null int64
## 5 repost_count 20000 non-null int64
## 6 deleted_flag 20000 non-null bool
## 7 word_count 20000 non-null int64
## 8 hashtags 2081 non-null object
## 9 urls 4949 non-null object
## 10 user_mentions 7520 non-null object
## 11 media_count 20000 non-null int64
## 12 media_urls 1347 non-null object
## 13 post_url 20000 non-null object
## 14 text_lwr 20000 non-null object
## 15 text_clean 20000 non-null object
## 16 embedding 20000 non-null object
## dtypes: bool(1), int64(4), object(12)
## memory usage: 2.6+ MB
Here i use PCA and TSNE
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
# Convert embeddings
X = np.array(embeddings)
# ----- PCA -----
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# ----- t-SNE -----
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)
# ----- Clustering (on original embedding space) -----
kmeans = KMeans(n_clusters=6, random_state=42)
labels = kmeans.fit_predict(X)
# ----- Plot -----
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# PCA
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=labels, s=8, alpha=0.7)
axes[0].set_title("PCA Projection")
axes[0].set_xticks([])
## []
axes[0].set_yticks([])
## []
# t-SNE
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, s=8, alpha=0.7)
axes[1].set_title("t-SNE Projection")
axes[1].set_xticks([])
## []
axes[1].set_yticks([])
## []
plt.tight_layout()
plt.show()
list.files()
## [1] "big_ahh_data.csv"
## [2] "big_ahh_dataset.csv"
## [3] "Brief_Topic_Analysis.html"
## [4] "Brief_Topic_Analysis.Rmd"
## [5] "df_joined.csv"
## [6] "embeddings.pkl"
## [7] "Epstien Files"
## [8] "Exports.csv"
## [9] "filtered_trump_dat.csv"
## [10] "GDP.csv"
## [11] "GDP.numbers"
## [12] "GDPPERCAPITA.csv"
## [13] "GovExpendituresEdu.csv"
## [14] "IMPORTS.csv"
## [15] "MEDIANCPI.csv"
## [16] "nationwide-encounters-fy23-fy26-jan-aor.csv"
## [17] "preprocessed_trump_dat 2.csv"
## [18] "preprocessed_trump_dat.csv"
## [19] "REALGDP.csv"
## [20] "rsconnect"
## [21] "SearchingWork.csv"
## [22] "Trump Joining Auxiliary Variables.Rmd"
## [23] "trump_sample_labeled 2.csv"
## [24] "trump_sample_labeled.csv"
## [25] "trump_tweets_dataset.csv"
## [26] "Trump-Joining-Auxiliary-Variables.html"
## [27] "TrumpClassification_files"
## [28] "TrumpClassification.html"
## [29] "TrumpClassification.Rmd"
## [30] "tweet_themes_with_sentiment.csv"
## [31] "UNRATE.csv"
I will provide examples based on likely things we would expect donald trump to say for examples
############################################
# Define Trump-Style Theme Anchors
############################################
themes = {
"immigration": [
"we will finish the wall and we will stop the illegal invasion at our southern border.",
"millions of illegal immigrants are pouring into our country because the democrats refuse to secure the border.",
"under my leadership we had the strongest border in american history and we will bring that back.",
"we will launch the largest deportation operation our country has ever seen.",
"cartels, drugs, and criminals are crossing our border because the democrats want open borders.",
"america is a sovereign nation and we will decide who enters our country, not the radical left."
],
"education": [
"our schools have been taken over by radical left ideology and we are going to take them back.",
"parents must have the final say in what their children are taught in the classroom.",
"we will end critical race theory and bring back patriotic education.",
"the department of education has failed our students for decades.",
"american children should be taught to love their country, not hate it.",
"school choice will give every family the power to choose the best education for their kids."
],
"war": [
"we will rebuild our military so powerful that nobody will dare challenge america.",
"under my leadership we defeated isis and restored strength to our armed forces.",
"america will never apologize for defending its people and its interests.",
"our enemies respect strength and they know the united states will always win.",
"peace through strength is the only way to keep our country safe.",
"we will protect our allies but we will never allow america to be taken advantage of."
],
"crime": [
"crime is out of control in democrat run cities and it has to stop immediately.",
"we will restore law and order to the streets of america.",
"radical prosecutors are letting violent criminals walk free.",
"our police officers deserve respect and support, not attacks from the left.",
"if you attack our communities and our police you will face serious consequences.",
"america will be safe again when we put criminals behind bars where they belong."
],
"religion": [
"we will always defend religious liberty for every american.",
"our nation was founded on faith and we will never forget that.",
"the radical left wants to remove god from public life but we will never let that happen.",
"in america we proudly say one nation under god.",
"churches and people of faith will always have a friend in the white house.",
"faith, family, and freedom are the foundation of this great country."
],
"jobs": [
"we are bringing american jobs back from china and other countries.",
"no president has created more opportunity for american workers.",
"our america first policies will put millions of people back to work.",
"factories are reopening because companies believe in america again.",
"we will protect american workers from unfair trade deals.",
"the best jobs economy in history is coming back bigger than ever."
],
"poverty": [
"for decades politicians ignored the forgotten men and women of america.",
"we are creating opportunity so people can lift themselves out of poverty.",
"american workers deserve good paying jobs, not government dependence.",
"inner cities have been abandoned by democrat leadership for far too long.",
"economic growth is the best anti-poverty program ever created.",
"we will rebuild communities that have been left behind."
],
"democrats": [
"the radical democrats want open borders, high taxes, and chaos in our streets.",
"democrats are destroying our country with their failed policies.",
"everywhere democrats are in charge crime goes up and quality of life goes down.",
"the democrat party has been taken over by the radical left.",
"they want socialism while we want freedom and prosperity.",
"the democrats talk about unity but they only divide america."
],
"Government assistance programs": [
"government programs should help people get back on their feet, not trap them in dependency.",
"we will reform welfare so that work is always rewarded.",
"taxpayers deserve to know their money is being spent wisely.",
"assistance programs must prioritize american citizens first.",
"we will eliminate waste and fraud in government benefits.",
"our goal is opportunity, not permanent government dependence."
],
"healthcare": [
"we will deliver affordable healthcare that actually works for the american people.",
"obamacare has been a disaster and it must be replaced.",
"patients should have more choice and more control over their healthcare.",
"we will protect people with pre existing conditions while lowering costs.",
"drug prices will come down because we will stand up to big pharma.",
"american families deserve the best healthcare system anywhere in the world."
],
"none": [
"the fake news media never tells the truth about what is happening in our country.",
"america is coming back stronger than anyone ever thought possible.",
"the people of this country are incredible and they deserve great leadership.",
"we are going to make america greater than ever before.",
"nobody fights harder for the american people than we do.",
"together we will restore pride, strength, and confidence in america."
]
}
New sentiment labels :
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
############################################
# 1️⃣ Create Averaged Theme Embeddings
############################################
def average_embedding(text_list):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text_list
)
vectors = np.array([d.embedding for d in response.data])
return np.mean(vectors, axis=0)
theme_embeddings = {
theme: average_embedding(sentences)
for theme, sentences in themes.items()
}
############################################
# 2️⃣ Use Existing Dataset Embeddings
############################################
X = np.array(embeddings)
############################################
# 3️⃣ Classify via Cosine Similarity
############################################
theme_matrix = np.vstack(list(theme_embeddings.values()))
theme_names = list(theme_embeddings.keys())
similarities = cosine_similarity(X, theme_matrix)
labels = [theme_names[i] for i in similarities.argmax(axis=1)]
confidence = similarities.max(axis=1)
############################################
# 4️⃣ Attach Results
############################################
df_samp["theme_label"] = labels
df_samp["confidence"] = confidence
df_samp[["text_clean", "theme_label", "confidence"]].head()
## text_clean ... confidence
## 6023 yesterdays results show trump s course was alr... ... 0.342245
## 2528 there is no substitute for hard work --thomas ... ... 0.330769
## 45057 trump derides drug and human trafficking boom ... ... 0.526055
## 29515 russia talk is fake news put out by the dems a... ... 0.360855
## 8373 you have a cunty demeanor true ... 0.215996
##
## [5 rows x 3 columns]
df_samp.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 20000 entries, 6023 to 33334
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 date 20000 non-null object
## 1 platform 20000 non-null object
## 2 handle 20000 non-null object
## 3 text 20000 non-null object
## 4 favorite_count 20000 non-null int64
## 5 repost_count 20000 non-null int64
## 6 deleted_flag 20000 non-null bool
## 7 word_count 20000 non-null int64
## 8 hashtags 2081 non-null object
## 9 urls 4949 non-null object
## 10 user_mentions 7520 non-null object
## 11 media_count 20000 non-null int64
## 12 media_urls 1347 non-null object
## 13 post_url 20000 non-null object
## 14 text_lwr 20000 non-null object
## 15 text_clean 20000 non-null object
## 16 embedding 20000 non-null object
## 17 theme_label 20000 non-null object
## 18 confidence 20000 non-null float64
## dtypes: bool(1), float64(1), int64(4), object(13)
## memory usage: 2.9+ MB
df_samp_no_embed = df_samp.drop(columns=["embedding"]) # dont inc. embeddings
df_samp_no_embed.to_csv(
"trump_sample_labeled.csv",
index=False
)
df <- read.csv("trump_sample_labeled.csv")
unique(df$theme_label) # notice there isnt a none
## [1] "none" "immigration"
## [3] "democrats" "crime"
## [5] "poverty" "religion"
## [7] "Government assistance programs" "healthcare"
## [9] "jobs" "war"
## [11] "education"
colnames(df)
## [1] "date" "platform" "handle" "text"
## [5] "favorite_count" "repost_count" "deleted_flag" "word_count"
## [9] "hashtags" "urls" "user_mentions" "media_count"
## [13] "media_urls" "post_url" "text_lwr" "text_clean"
## [17] "theme_label" "confidence"
write.csv(df, "big_ahh_dataset.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df_with_confidence <- df %>%
mutate(
correct = NA # only include if you DO have ground truth later
)
confidence_summary <- df %>%
group_by(theme_label) %>%
summarize(
n = n(),
mean_confidence = mean(confidence, na.rm = TRUE),
sd_confidence = sd(confidence, na.rm = TRUE),
min_confidence = min(confidence, na.rm = TRUE),
max_confidence = max(confidence, na.rm = TRUE)
) %>%
arrange(desc(mean_confidence))
confidence_summary
Overall Pattern
Model confidence is consistently low across all themes (mean range: 0.23–0.29).
No theme exceeds an average confidence of 0.30.
Suggests the classifier is cautious or uncertain in assigning labels.
Confidence levels are relatively compressed, indicating limited separation between themes.
This may reflect thematic overlap (e.g., economics vs. immigration) or conservative scoring behavior.
library(ggplot2)
ggplot(confidence_summary,
aes(x = reorder(theme_label, mean_confidence),
y = mean_confidence)) +
geom_col() +
coord_flip() +
labs(title = "Average Model Confidence by Theme",
x = "Theme",
y = "Mean Confidence") +
theme_minimal()
library(dplyr)
highest <-
df %>%
arrange(desc(confidence)) %>%
select(theme_label, confidence, text) %>%
slice_head(n = 5)
highest
top3_per_theme <-
df %>%
group_by(theme_label) %>%
arrange(desc(confidence), .by_group = TRUE) %>%
slice_head(n = 3) %>%
select(theme_label, confidence, text)
top3_per_theme
very_high <-
df %>%
filter(confidence > 0.60) %>%
arrange(desc(confidence)) %>%
select(theme_label, confidence, text)
very_high
Common patterns: - Frequent use of terms like: - “Border” - “Wall” - “Illegal immigrants” - “Invasion” - “Crime” - Framing around: - National security - Law enforcement - Threat narratives - Repetition across posts strengthens model certainty.
Interpretation: - Immigration has highly distinctive vocabulary. - Minimal overlap with other themes. - Produces the highest confidence scores overall.
Common patterns: - Strong economic performance framing: - “Lowest unemployment” - “Jobs” - “Stock Market” - “Tariffs” - “Trade deals” - Optimistic and achievement-focused tone. - Structured economic indicators (jobs numbers, growth, profits).
Interpretation: - Clear economic keywords improve confidence. - Less emotionally intense than immigration but still distinct.
Common patterns: - References to: - “Faith” - “God” - “Prayer” - “Nation under God” - Patriotic-religious framing. - Less policy-heavy, more symbolic language.
Interpretation: - Distinct vocabulary but lower intensity and repetition. - Moderate confidence levels.
Common patterns: - Mentions of: - “Education system” - “Schools” - “Indoctrination” - “Patriotic education” - Often overlaps with cultural or political framing.
Interpretation: - More thematic overlap with culture/politics. - Lower confidence due to ambiguity.
Common patterns: - References to: - “Cities” - “Law and Order” - “Capital” - “Homeless” - Often embedded within broader crime or governance narratives.
Interpretation: - Less frequent and less distinct vocabulary. - Strong thematic overlap with crime and immigration. - Lowest overall confidence levels.
High-confidence classifications occur when: - Vocabulary is highly specific to one theme. - Messaging is repetitive and consistent. - The post contains strong, unambiguous political framing.
Lower confidence appears when: - Themes overlap. - Language is general or symbolic. - The signal is less distinctive.
unique(df$theme_label)
## [1] "none" "immigration"
## [3] "democrats" "crime"
## [5] "poverty" "religion"
## [7] "Government assistance programs" "healthcare"
## [9] "jobs" "war"
## [11] "education"
econ <-
df |>
filter(theme_label == "economics") |>
arrange(desc(confidence)) |>
select(theme_label,confidence, text, post_url) |>
filter(confidence > .565)
immigration <-
df |>
filter(theme_label == "immigration") |>
arrange(desc(confidence)) |>
select(theme_label,confidence, text, post_url) |>
filter(confidence > .63)
homelessness <-
df |>
filter(theme_label == "homelessness") |>
arrange(desc(confidence)) |>
select(theme_label,confidence, text, post_url) |>
filter(confidence > .38)
religion <-
df |>
filter(theme_label == "religion") |>
arrange(desc(confidence)) |>
select(theme_label,confidence, text, post_url) |>
filter(confidence > .48)
education <-
df |>
filter(theme_label == "education") |>
arrange(desc(confidence)) |>
select(theme_label,confidence, text, post_url) |>
filter(confidence > .4)
write.csv(df, "big_ahh_data.csv")
df