library(tidyverse)
library(httr)
library(tidytext)
library(kableExtra)
library(superml)
library(e1071)
library(data.table)
This notebook collects a number of measures of partisanship / political polarization and maps them to the candidates dataframe from previous notebooks. The majority of the notebook will focus on the Twitter-based partisanship score.
For that score, I will fit a number of models to predict a political
candidate’s party based on the language used in their tweets. I’ll use a
collection of over 3 million tweets scraped from over a thousand
candidates running for House seats in the 2022 midterm elections. The
data is hosted here on Kaggle: https://www.kaggle.com/datasets/kac624/politicaltweets.
See https://github.com/kac624/cuny/tree/main/election_tweets
for additional details on the collection of this data. I’ll begin by
reading in the data from kaggle, cleaning the tweets, then mapping in
data on party affiliation (sourced from the Federal Election Committee
(FEC)). I’ll then engineer features by calculating the Term Frequency -
Inverse Document Frequency (TF-IDF) statistic for key terms and
transforming the corpus into a matrix. That matrix will provide training
data for three classifier models (logistic regression, Naive-Bayes, and
XGBoost), all implemented with the superml package. For the
sake of this exercise, I’ll focus only on binary classification of
candidates as Democrat (0) or Republican (1).
After normalizing those results to generate an even distribution of scores, I’ll also gather data on two other partisanship / ideology scores. Once complete, all three will be mapped to the candidates dataframe.
I’ll begin by reading in the corpus of tweets scraped from Twitter.
kaggle <- jsonlite::read_json('data/kaggle.json')
username <- kaggle$username
authkey <- kaggle$key
url <- paste0('https://www.kaggle.com/api/v1/datasets/download/',
'kac624/politicaltweets/candidate_tweets2022_05.05.csv')
response <- GET(url, authenticate(username, authkey, type = 'basic'))
temp <- tempfile()
download.file(response$url, temp, mode = 'wb')
tweets <- read_csv(unz(temp, 'candidate_tweets2022_05.05.csv'))
## New names:
## Rows: 2932442 Columns: 15
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): username, language, retweeted_id, quoted_id, text dbl (7): ...1,
## twitter_id, tweet_id, likes, views, replies, retweets lgl (2): verified_status,
## source dttm (1): date_created
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
unlink(temp)
rm(response)
As this analysis focuses on the 2022 midterm elections, I’ll first subset the data to include only tweets from 2021 and 2022. I’ll also remove retweets, as I wish to focus on candidates’ own language. I’ll remove duplicates, tweets with no text (typically images), links and special characters. We can then preview a sample.
tweets <- tweets %>%
select(-...1) %>%
filter(year(date_created) > 2020,
year(date_created) < 2023,
is.na(retweeted_id),
text != '')
tweets <- tweets[!duplicated(tweets$tweet_id), ]
tweets$text <- tweets$text %>%
str_remove_all('&') %>%
str_remove_all('https://t.co/[a-z,A-Z,0-9]*') %>%
str_remove_all('[â¢\u0080-\uFFFF]') %>%
str_remove_all('\\p{So}') %>%
str_remove_all('\'s') %>%
str_replace_all('[[:cntrl:]]', ' ')
tweets[sample(nrow(tweets), 10), ] %>%
select(username, text) %>%
head(10) %>%
kable(align = 'l') %>%
kable_classic(position = 'center')
| username | text |
|---|---|
| CONYERSTHREE | Meanwhile Kim is playing the game Kanye should be playing. |
| RandiMcCallian | Can’t wait to read it! Thank you for helping us reach more people and grow these campaigns! |
| thejeremyhunt | One win away! #ForTheA |
| SpanbergerVA07 | Im so excited that the whole country got to see #VA07s own Joshua Davis at the #sotu! Thank you for your advocacy Joshua!! |
| RepGarretGraves | We sent the letter to HUD Secretary Fudge urging that Lafourche, Terrebonne, Jefferson, St. Charles, St. John and other parishes receive recovery funding directly. Under the current allocation, only the City of Baton Rouge will receive 2021 disaster funds directly. (2/4) |
| SuziVoyles | @_george_hall_ Quite a treasure |
| rep_stevewomack | As we honor Flag Day, I introduced a constitutional amendment to restore the authority of Congress to protect our national emblem of freedom by prohibiting the desecration of the American flag. Recognized from near and afar, our flag is the embodiment of opportunity and liberty. |
| RepHuizenga | Merry Christmas! |
| RepScottPeters | My full statement on this morning vote: |
| RepMikeLevin |
I’ll use previously obtained data from the FEC to map party and state details to these tweets. The join produces some duplication, so I’ll again remove duplicates.
candidates <- read_csv('data/candidates2022_clean.csv')
## New names:
## Rows: 4216 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (15): fec_id, name, state, party, office, incumbent_challenge, candidate... dbl
## (4): ...1, district, govtrack_id, twitter_id lgl (1): twitter_status_id
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
tweets <- tweets %>%
mutate(username = tolower(username)) %>%
left_join(
candidates %>%
mutate(twitter_name = tolower(twitter_name)) %>%
select(name, state, party_simple, twitter_name),
by = c('username' = 'twitter_name'), keep = FALSE
)
## Warning in left_join(., candidates %>% mutate(twitter_name = tolower(twitter_name)) %>% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 31223 of `x` matches multiple rows in `y`.
## ℹ Row 587 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
tweets <- tweets[!duplicated(tweets), ]
tweets <- tweets %>%
filter(!(name == 'REYNOLDS, CONRAD EARL'))
A number of visualizations are provided below to provide a sense of the distribution of tweets across parties, time and states. The final histogram shows the distribution of the count tweets per candidate.
tweets %>%
ggplot(aes(party_simple)) +
geom_bar() +
scale_y_continuous(labels = scales::comma)
tweets %>%
group_by(month = floor_date(date_created, 'month')) %>%
summarize(count = n()) %>%
ggplot(aes(month, count)) +
geom_col() +
scale_y_continuous(labels = scales::comma) +
coord_flip()
tweets %>%
group_by(state) %>%
summarise(count = n()) %>%
ggplot(aes(x = reorder(state, count), y = count)) +
geom_bar(stat = 'identity') +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
coord_flip()
tweets %>%
group_by(name) %>%
summarize(tweet_count = n(), .groups = 'keep') %>%
ggplot(aes(tweet_count)) +
geom_histogram(bins = 30) +
scale_y_continuous(labels = scales::comma)
We can now move ahead with feature engineering. I’ll first load in
the stopwords from tidytext (note that the
TfIdfVectorizer function in superml has its
own stopwords, but I found that the list was missing many common,
low-impact words that should be excluded). In re-running the models
below, I also identified a number of additional words that have high
TF-IDF scores, but mean very little. So, I’ve added them to the stopword
list.
I then perform some additional cleaning as I eliminate stopwords and finally concatenate all tweets for each candidate into single rows, so that each row represents a candidate. Candidates outside of the Republican / Democratic parties or with <100 tweets are removed, and party labels are one-hot encoded. The code below also extracts hashtags as a separate column, to be used for model enhancement below.
Finally, I split the dataset into training and test subsets, using an 80/20 split.
data('stop_words')
stop_words <- stop_words %>%
select(word) %>%
rbind(data.frame(word = c('im', 'vote', 'congress', '1', '2',
'3', '4', '5', '6', '7', '8', '9', '0')))
tweets_consolidated <- tweets %>%
mutate(hashtags = str_extract_all(text, '#[a-zA-Z0-9]+'),
hashtags = map_chr(hashtags, str_c, collapse = ' ')) %>%
group_by(name) %>%
mutate(tweet_count = n()) %>%
ungroup() %>%
filter(tweet_count > 100,
party_simple != 'Other') %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = c('word')) %>%
group_by(name, party_simple) %>%
summarize(text = paste0(word, collapse = ' '),
hashtags = paste0(hashtags, collapse = ' '),
.groups = 'keep') %>%
mutate(party_simple = if_else(party_simple == 'Republican', 1, 0))
sample <- sample(c(rep(0, round(0.8 * nrow(tweets_consolidated))),
rep(1, round(0.2 * nrow(tweets_consolidated)))))
tweets_consolidated_train <- tweets_consolidated[sample == 0, ]
# %>% data.table() %>% select(-name)
tweets_consolidated_test <- tweets_consolidated[sample == 1, ]
# %>% data.table() %>% select(-name)
Next, I’ll use the TfIdfVectorizer to calculate TF-IDF
scores and transform the resulting scores into a sparse matrix. Given
the size of the data (with over 31 million tokens), I’ll need to set
certain restrictions to limit the size of the final matrix. For now,
I’ll set the max_features parameter to 30, meaning the only
the top 30 tokens will be included in the feature set. I’ll also set
min_df to 0.3, meaning the the document frequency for all
final tokens should be at least 0.3.
WARNING: The fitting and transforming of this matrix can be time consuming.
start <- Sys.time()
tfv_text <- TfIdfVectorizer$new(min_df = 0.3, max_features = 30, remove_stopwords = F)
tfv_text$fit(tweets_consolidated_train$text)
tfv_text_train <- tfv_text$transform(tweets_consolidated_train$text)
tfv_text_test <- tfv_text$transform(tweets_consolidated_test$text)
tfv_text_train <- data.table(tfv_text_train,
party = tweets_consolidated_train$party_simple,
name = tweets_consolidated_train$name)
tfv_text_test <- data.table(tfv_text_test,
party = tweets_consolidated_test$party_simple,
name = tweets_consolidated_test$name)
write_csv(tfv_text_train, 'data/tfv_text_train.csv')
write_csv(tfv_text_test, 'data/tfv_text_test.csv')
dim(tfv_text_train)
dim(tfv_text_test)
duration <- difftime(Sys.time(), start)
paste('Time:', duration)
I’ll perform the same process as above, but this time focusing only
on hashtags. I only plan to use additional hashtag-related features to
enhance the training set below, so I’ll set the min_df and
max_features parameters a bit more strictly.
start <- Sys.time()
tfv_hash <- TfIdfVectorizer$new(min_df = 0.5, max_features = 20)
tfv_hash$fit(tweets_consolidated_train$hashtags)
tfv_hash_train <- tfv_hash$transform(tweets_consolidated_train$hashtags)
tfv_hash_test <- tfv_hash$transform(tweets_consolidated_test$hashtags)
tfv_hash_train <- data.table(tfv_hash_train,
party = tweets_consolidated_train$party_simple,
name = tweets_consolidated_train$name)
tfv_hash_test <- data.table(tfv_hash_test,
party = tweets_consolidated_test$party_simple,
name = tweets_consolidated_test$name)
write_csv(tfv_hash_train, 'data/tfv_hash_train.csv')
write_csv(tfv_hash_test, 'data/tfv_hash_test.csv')
dim(tfv_hash_train)
dim(tfv_hash_test)
duration <- difftime(Sys.time(), start)
paste('Time:', duration)
We can now take a look at the final tokens / hashtags in our feature set.
names(tfv_text_train)
## [1] "people" "time" "day" "biden" "support"
## [6] "american" "act" "families" "americans" "country"
## [11] "proud" "community" "house" "health" "america"
## [16] "bill" "president" "join" "care" "fight"
## [21] "communities" "democrats" "border" "continue" "women"
## [26] "law" "family" "protect" "rights" "week"
## [31] "party" "name"
names(tfv_hash_train)
## [1] "buildbackbetter" "americanrescueplan" "covid19"
## [4] "childtaxcredit" "ncpol" "ny21"
## [7] "forthepeople" "nhpolitics" "inflationreductionact"
## [10] "bidenbordercrisis" "az01" "blackhistorymonth"
## [13] "peopleoverpolitics" "ia04" "demvoice1"
## [16] "endgunviolence" "sotu" "scotus"
## [19] "womenshistorymonth" "vote" "party"
## [22] "name"
I’ll begin by setting a baseline for document classification. After the additional cleaning I performed, the dataset is evenly balanced between Democrats (0) and Republicans (1), as shown below. So, we’ll set our baseline at ~50%.
nrow(tweets_consolidated[tweets_consolidated$party_simple == 1,]) /
nrow(tweets_consolidated)
## [1] 0.5077973
Next, I’ll define two functions. The first will fit the chosen model and generate predictions and key evaluation metrics. The second function will compile and print those metrics. For logistic regression models, we’ll use 0.5 as the cutoff for classifying predicted probabilities as 0 or 1.
fit_evaluate <- function(model, train, test) {
model$fit(X = train, y = 'party')
pred <- model$predict(test)
auc <- Metrics::auc(actual = test$party, predicted = pred)
confusion <- tryCatch(
{
caret::confusionMatrix(as.factor(as.numeric(pred > 0.5)),
as.factor(test$party))
}, error = function(cond)
{
caret::confusionMatrix(pred, as.factor(test$party))
}
)
return(list(model = model, pred = pred,
auc = auc, confusion = confusion))
}
print_metrics <- function(model_results) {
cat('\nResults for',model_results$model$model$method,
'\n\nConfusion Matrix\n')
print(model_results$confusion$table)
cat('Accuracy:', model_results$confusion$overall[1],
'\nKappa:', model_results$confusion$overall[2],
'\nArea Under Curve:', model_results$auc,'\n')
}
We begin modeling with Logistic Regression.
logit <- fit_evaluate(LMTrainer$new(family = 'binomial'),
tfv_text_train[,-32], tfv_text_test[,-32])
print_metrics(logit)
##
## Results for glm.fit
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 91 13
## 1 7 94
## Accuracy: 0.902439
## Kappa: 0.8050033
## Area Under Curve: 0.9726302
data.frame(pred = logit$model$predict(rbind(tfv_text_train, tfv_text_test))) %>%
ggplot(aes(pred)) +
geom_histogram(bins = 50)
Out of the gate, performance appears solid, with over 90% accuracy, much higher than our 50% baseline. If we look at the distribution of predicted probabilities, we see the distribution is heavily focused near 0 and 1, as expected by the high accuracy score.
We’ll see if we can enhance the logistic regression by adding in some key hashtags to our feature set. I noticed above that the hashtag feature set contains some apparently meaningless tokens related only to the district in which candidates are running (e.g. “#ny21” for New York’s 21st district). To avoid introducing noise to our feature set, I’ll remove these hashtags, then combine the text-based and tag-based features into a single training set to feed into the model.
bad_tags <- unlist(str_extract_all(colnames(tfv_hash_train), '[a-z]{2}[0-9]{2}'))
tfv_combo_train <- cbind(select(tfv_text_train, -name), # -party, -name),
select(tfv_hash_train, -party, -name, -any_of(bad_tags)))
tfv_combo_test <- cbind(select(tfv_text_test, -name), # -party, -name),
select(tfv_hash_test, -party, -name, -any_of(bad_tags)))
logit2 <- fit_evaluate(LMTrainer$new(family = 'binomial'),
tfv_combo_train, tfv_combo_test)
# tfv_combo_train[,-48], tfv_combo_test[,-48])
print_metrics(logit2)
##
## Results for glm.fit
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 93 13
## 1 5 94
## Accuracy: 0.9121951
## Kappa: 0.8246531
## Area Under Curve: 0.9732024
data.frame(pred = logit2$model$predict(rbind(tfv_combo_train, tfv_combo_test))) %>%
ggplot(aes(pred)) +
geom_histogram(bins = 50)
We see a marginal improvement in accuracy, bumping up to ~93%.
Finally, we’ll use the same data in two additional models: Naive-Bayes and XGBoost classifiers.
nb <- fit_evaluate(NBTrainer$new(), tfv_text_train[,-32], tfv_text_test[,-32])
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## Warning in Ops.factor(pred, 0.5): '>' not meaningful for factors
print_metrics(nb)
##
## Results for
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 85 12
## 1 13 95
## Accuracy: 0.8780488
## Kappa: 0.7555216
## Area Under Curve: 0.8775987
nb2 <- fit_evaluate(NBTrainer$new(), tfv_combo_train[,-48], tfv_combo_test[,-48])
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
## Warning: '>' not meaningful for factors
print_metrics(nb)
##
## Results for
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 85 12
## 1 13 95
## Accuracy: 0.8780488
## Kappa: 0.7555216
## Area Under Curve: 0.8775987
xgb_model <- XGBTrainer$new(
objective = 'reg:squarederror', n_estimators = 500, eval_metric = "rmse",
maximize = F, learning_rate = 0.1, max_depth = 6, verbose = 0)
xgb <- fit_evaluate(xgb_model, tfv_text_train[,-32], tfv_text_test[,-32])
## converting the data into xgboost format..
## starting with training...
## [08:03:52] WARNING: src/learner.cc:767:
## Parameters: { "nrounds" } are not used.
print_metrics(xgb)
##
## Results for
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 86 15
## 1 12 92
## Accuracy: 0.8682927
## Kappa: 0.736416
## Area Under Curve: 0.9620446
xgb_model2 <- XGBTrainer$new(
objective = 'reg:squarederror', n_estimators = 500, eval_metric = "rmse",
maximize = F, learning_rate = 0.1, max_depth = 6, verbose = 0)
xgb2 <- fit_evaluate(xgb_model2, tfv_combo_train[,-48], tfv_combo_test[,-48])
## converting the data into xgboost format..
## starting with training...
## [08:03:53] WARNING: src/learner.cc:767:
## Parameters: { "nrounds" } are not used.
print_metrics(xgb2)
##
## Results for
##
## Confusion Matrix
## Reference
## Prediction 0 1
## 0 90 15
## 1 8 92
## Accuracy: 0.8878049
## Kappa: 0.7758498
## Area Under Curve: 0.9683387
We see comparable, though slightly lower performance. Again, the addition of hashtags improves accuracy, but only minimally.
I’ll fit one, final model — a Support Vector Machine (SVM) model — to be incorporated into the final score.
svm_text <- svm(party ~ ., data = tfv_text_train[,-32], probability = TRUE)
svm_text_pred <- predict(svm_text, tfv_text_test[,-32], probability = TRUE)
svm_combo <- svm(party ~ ., data = tfv_combo_train[,-48], probability = TRUE)
svm_combo_pred <- predict(svm_combo, tfv_combo_test[,-48], probability = TRUE)
caret::confusionMatrix(as.factor(as.numeric(svm_text_pred > 0.5)),
as.factor(tfv_text_test$party))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 89 9
## 1 9 98
##
## Accuracy : 0.9122
## 95% CI : (0.8648, 0.9471)
## No Information Rate : 0.522
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8241
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9082
## Specificity : 0.9159
## Pos Pred Value : 0.9082
## Neg Pred Value : 0.9159
## Prevalence : 0.4780
## Detection Rate : 0.4341
## Detection Prevalence : 0.4780
## Balanced Accuracy : 0.9120
##
## 'Positive' Class : 0
##
caret::confusionMatrix(as.factor(as.numeric(svm_combo_pred > 0.5)),
as.factor(tfv_combo_test$party))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 8
## 1 10 99
##
## Accuracy : 0.9122
## 95% CI : (0.8648, 0.9471)
## No Information Rate : 0.522
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8239
##
## Mcnemar's Test P-Value : 0.8137
##
## Sensitivity : 0.8980
## Specificity : 0.9252
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.9083
## Prevalence : 0.4780
## Detection Rate : 0.4293
## Detection Prevalence : 0.4683
## Balanced Accuracy : 0.9116
##
## 'Positive' Class : 0
##
data.frame(pred = svm_combo_pred) %>%
ggplot(aes(pred)) +
geom_histogram(bins = 100)
I’ll now compile the probabilities for classification produced by each model for all candidates into a single dataframe. We’ll take the average probability from all models as a starting point for our partisanship score.
scores <- data.frame(
name = rbind(tfv_text_train[,'name'], tfv_text_test[,'name']),
party = rbind(tfv_text_train[,'party'], tfv_text_test[,'party']),
logit_text = logit$model$predict(rbind(tfv_text_train, tfv_text_test)),
logit_combo = logit2$model$predict(rbind(tfv_combo_train, tfv_combo_test)),
svm_text = predict(svm_text, rbind(tfv_text_train, tfv_text_test), probability = TRUE),
svm_combo = predict(svm_combo, rbind(tfv_combo_train, tfv_combo_test), probability = TRUE),
xgb_text = xgb$model$predict(rbind(tfv_text_train, tfv_text_test)),
xgb_combo = xgb2$model$predict(rbind(tfv_combo_train, tfv_combo_test)),
nb_text = nb$model$predict(rbind(tfv_text_train, tfv_text_test), type = 'prob')[,2],
nb_combo = nb2$model$predict(rbind(tfv_combo_train, tfv_combo_test), type = 'prob')[,2]) %>%
rowwise() %>%
mutate(# nb_text = if_else(nb_text > 0.5, 1, 0),
# nb_combo = if_else(nb_combo > 0.5, 1, 0),
score = mean(c(logit_text, logit_combo,
svm_text, svm_combo,
nb_text, nb_combo,
xgb_text, xgb_combo))) %>%
ungroup()
scores %>%
ggplot(aes(logit_text)) +
geom_histogram(bins = 100)
scores %>%
ggplot(aes(score)) +
geom_histogram(bins = 100)
The ultimate purpose of this score is to measure partisanship / political polarization in general, rather than on a party basis. In other words, the score should not be situated on a left-right orientation. Instead, it should be situated to show distance from the political center, whether that means further left or further right. In other words, the score should range from 0 to 1, with 0 indicating low partisanship / closeness to the politcal center and 1 indicating high partisanship / closeness to party fringes.
To accomplish this, I first normalize the score between -1 and 1 to oriente the score around 0. I then take the absolute value of the score to eliminate party discrepancies. I use the formulas below to normalize. The resulting distribution is heavily skewed, due to the high accuracy of the models. To address this, I also rank all scores and then “project” them onto a normal distribution to create a more even spread of scores.
\[ x = (b-a) \frac{x - \min x}{\max x - \min x} + a \]
\[ x = 2 \frac{x - \min x}{\max x - \min x} - 1 \]
min_score <- min(scores$score)
max_score <- max(scores$score)
scores <- scores %>%
mutate(score_normalized = 2 * (score - min_score) / (max_score - min_score) - 1,
score_norm_abs = abs(score_normalized),
score_norm_abs_rank = min_rank(score_norm_abs))
scores %>%
ggplot(aes(score_norm_abs)) +
geom_histogram(aes(y = ..density..), bins = 100) +
geom_density(color = 'blue') +
stat_density(geom = 'line', adjust = 5, color = 'green')
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
func <- ecdf(scores$score_norm_abs)
ecdf_rank <- func(scores$score_norm_abs)
norm_rank <- qnorm(ecdf_rank, mean = 0.5, sd = 1/6)
scores <- scores %>%
cbind(ecdf_rank, norm_rank) %>%
mutate(norm_rank = replace(norm_rank, norm_rank == Inf, 0.99),
norm_rank = replace(norm_rank, norm_rank < 0, 0.01))
scores %>%
ggplot() +
geom_freqpoly(aes(ecdf_rank), color = 'red') +
geom_freqpoly(aes(norm_rank))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Finally, I want to add on a few additional measures of partisanship to our dataset. I’ll start y reading in to Poole-Rosenthaw NOMINATE Ideology scores, produced by VoteView (https://voteview.com/data). This score uses Reps’ voting records to generate a measure of how far left or right leaning they are. Again, this score is situtated on a left-right 0 to 1 scale, so I’ll once again normalize the ratings between -1 and 1 and take the absolute value.
voteview <- read_csv('data/voteview_ideology_scores.csv')
## Rows: 50485 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): chamber, state_abbrev, bioname, bioguide_id
## dbl (17): congress, icpsr, state_icpsr, district_code, party_code, occupancy...
## lgl (1): conditional
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
voteview %>%
filter(congress == 118) %>%
head()
## # A tibble: 6 × 22
## congress chamber icpsr state_icpsr district_code state_abbrev party_code
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 118 House 20301 41 3 AL 200
## 2 118 House 21102 41 7 AL 100
## 3 118 House 21500 41 6 AL 200
## 4 118 House 22108 41 1 AL 200
## 5 118 House 22140 41 2 AL 200
## 6 118 House 22366 41 5 AL 200
## # ℹ 15 more variables: occupancy <dbl>, last_means <dbl>, bioname <chr>,
## # bioguide_id <chr>, born <dbl>, died <dbl>, nominate_dim1 <dbl>,
## # nominate_dim2 <dbl>, nominate_log_likelihood <dbl>,
## # nominate_geo_mean_probability <dbl>, nominate_number_of_votes <dbl>,
## # nominate_number_of_errors <dbl>, conditional <lgl>,
## # nokken_poole_dim1 <dbl>, nokken_poole_dim2 <dbl>
voteview <- voteview %>%
filter(congress == 118) %>%
mutate(party = case_when(party_code == 100 ~ 'Dem',
party_code == 200 ~ 'Rep',
TRUE ~ 'Other')) %>%
rowwise %>%
mutate(nominate_avg = mean(c(nominate_dim1,nominate_dim2))) %>%
ungroup()
voteview %>%
ggplot() +
geom_point(aes(nominate_dim1, nominate_dim2, color = party)) +
scale_color_manual(values = c('Dem' = 'blue', 'Rep' = 'red', 'Other' = 'purple'))
voteview %>%
filter(congress == 118) %>%
ggplot() +
geom_freqpoly(aes(nominate_dim1), bins = 20, color = 'red') +
geom_freqpoly(aes(nominate_dim2), bins = 20, color = 'blue') +
geom_freqpoly(aes(nominate_avg), bins = 20, color = 'black')
voteview <- voteview %>%
mutate(nominate_avg_abs = abs(nominate_avg),
nominate_dim1_abs = abs(nominate_dim1),
nominate_dim2_abs = abs(nominate_dim2))
voteview %>%
ggplot(aes(nominate_dim1_abs)) +
geom_freqpoly(aes(nominate_dim1_abs), bins = 20, color = 'red') +
geom_freqpoly(aes(nominate_dim2_abs), bins = 20, color = 'blue') +
geom_freqpoly(aes(nominate_avg_abs), bins = 20, color = 'black')
Our final dataset is the GovTrack Ideology Score. This measure uses co-sponsorship patterns of legislators rather than voting records (https://www.govtrack.us/data/analysis/by-congress/). Again, I normalize between -1 and 1 and take the absolute value.
govtrack <- data.frame(matrix(ncol = 11, nrow = 0))
colnames(govtrack) <- c('ID', 'ideology', 'leadership', 'name', 'party',
'description', 'introduced_bills', 'cosponsored_bills',
'unique_cosponsors', 'total_cosponsors', 'congress')
for (congress in 110:118) {
link <- paste0('https://www.govtrack.us/data/analysis/by-congress/',
congress,'/sponsorshipanalysis_h.txt')
ith_house <- read_delim(link, delim = ',', trim_ws = TRUE)
ith_house['congress'] = congress
colnames(ith_house) <- colnames(govtrack)
govtrack <- rbind(govtrack, ith_house)
}
rm(congress, ith_house)
govtrack <- govtrack %>%
arrange(name, desc(congress)) %>%
mutate(dupe = duplicated(ID)) %>%
filter(dupe == FALSE)
govtrack %>%
filter(congress == 118) %>%
ggplot(aes(x = ideology, y = leadership, color = party)) +
geom_point() +
scale_color_manual(values = c('Democrat' = 'blue',
'Republican' = 'red',
'Libertarian' = 'pink',
'Independent' = 'purple'))
min_score <- min(govtrack$ideology)
max_score <- max(govtrack$ideology)
govtrack <- govtrack %>%
mutate(gt_ideo_norm = 2 * (ideology - min_score) / (max_score - min_score) - 1,
gt_ideo_norm_abs = abs(gt_ideo_norm))
govtrack %>%
ggplot() +
geom_freqpoly(aes(gt_ideo_norm_abs), bins = 20, color = 'black') +
geom_freqpoly(aes(ideology), bins = 20, color = 'blue')
Finally, I’ll map all three partisanship scores onto the candidates dataframe.
candidates <- read_csv('data/candidates2022_clean.csv')
## New names:
## Rows: 4216 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (15): fec_id, name, state, party, office, incumbent_challenge, candidate... dbl
## (4): ...1, district, govtrack_id, twitter_id lgl (1): twitter_status_id
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
candidates <- candidates %>%
# twitter alts: score_norm_abs, ecdf_rank
left_join(select(scores, name, norm_rank)) %>%
left_join(select(voteview, bioguide_id, nominate_dim1_abs)) %>%
left_join(select(govtrack, ID, gt_ideo_norm_abs),
by = c('govtrack_id' = 'ID')) %>%
mutate(ballotpedia_id = str_remove(ballotpedia, 'https://ballotpedia.org/')) %>%
select(-...1, -twitter_status_tweet, -twitter_status_name,
-twitter_status_id, -twitter_id, -key, -party_simple, -ballotpedia) %>%
rename(partisan_score_twitter = norm_rank,
partisan_score_nominate = nominate_dim1_abs,
partisan_score_govtrack = gt_ideo_norm_abs)
## Joining with `by = join_by(name)`
## Joining with `by = join_by(bioguide_id)`
candidates <- candidates[!duplicated(candidates), ]
candidates %>%
mutate(dupe = duplicated(name)) %>%
filter(dupe == FALSE) %>%
ggplot() +
geom_freqpoly(aes(partisan_score_twitter, after_stat(density), color = 'blue')) +
geom_freqpoly(aes(partisan_score_nominate, after_stat(density), color = 'green')) +
geom_freqpoly(aes(partisan_score_govtrack, after_stat(density), color = 'red')) +
scale_colour_manual(name = 'Variable',
values = c('blue' = 'blue', 'green' = 'green', 'red' = 'red'),
labels = c('twitter','nominate', 'govtrack')) +
xlab('Partisan Score') +
ylab('Density') +
theme(legend.position = 'bottom')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2610 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3195 rows containing non-finite values (`stat_bin()`).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3101 rows containing non-finite values (`stat_bin()`).
I produce a few visuals to see how well the scores align. While the NOMINATE and GovTrack scores seem well aligned, the Twitter score seems slightly out of sync.
candidates %>%
ggplot(aes(partisan_score_twitter, partisan_score_nominate)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x')
## Warning: Removed 3374 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3374 rows containing missing values (`geom_point()`).
candidates %>%
ggplot(aes(partisan_score_twitter, partisan_score_govtrack)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x')
## Warning: Removed 3301 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3301 rows containing missing values (`geom_point()`).
candidates %>%
ggplot(aes(partisan_score_nominate, partisan_score_govtrack)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x')
## Warning: Removed 3343 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3343 rows containing missing values (`geom_point()`).
write_csv(candidates, 'data/candidates_partisanship.csv')