Writing Influential Tweets

Fedor Duzhin
24 Feb 2021

General Info

For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

To present, click “View in Browser”. You should also be able to publish this on the web and present from the web.

plot of chunk unnamed-chunk-2

Data

The dataset for assignment 1 is a collection of Donald Trump's tweets.

Source: http://www.trumptwitterarchive.com/

Sample of Trump’s tweets
text retweet_count
Watching the show. #WWEHOF http://t.co/64ck6O78h3 36
https://t.co/0Zx9wr3MoP 16081
@fackinpeter: @realDonaldTrump they hate you cuz they ain’t you #trump2016 25
Chance favors the prepared mind.– Louis Pasteur 352
Prominent legal scholars agree that our actions to address the National Emergency at the Southern Border and to protect the American people are both CONSTITUTIONAL and EXPRESSLY authorized by Congress…. 23108

Word Cloud

plot of chunk unnamed-chunk-5

Feature engineering

The response variable is \( \log(\mbox{retweet_count}+1) \). Features are columns of the document-term matrix trimmed to terms that appear at least 500 times in the corpus. We will also split the dataset into 70% training and 30% test sets.

Train data dimensions = 28908 257 
Test data dimensions = 12462 257 

Modelling

We trained LASSO-regularized linear regression (with 5-fold cross validation) and a random forest (tuned with OOB error).

Our predictive models
model test.MAE
LASSO 1.652451
Random Forest 1.271994