Fedor Duzhin
24 Feb 2021
For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.
To present, click “View in Browser”. You should also be able to publish this on the web and present from the web.
The dataset for assignment 1 is a collection of Donald Trump's tweets.
Source: http://www.trumptwitterarchive.com/
text | retweet_count |
---|---|
Watching the show. #WWEHOF http://t.co/64ck6O78h3 | 36 |
https://t.co/0Zx9wr3MoP | 16081 |
@fackinpeter: @realDonaldTrump they hate you cuz they ain’t you #trump2016 | 25 |
Chance favors the prepared mind.– Louis Pasteur | 352 |
Prominent legal scholars agree that our actions to address the National Emergency at the Southern Border and to protect the American people are both CONSTITUTIONAL and EXPRESSLY authorized by Congress…. | 23108 |
The response variable is \( \log(\mbox{retweet_count}+1) \). Features are columns of the document-term matrix trimmed to terms that appear at least 500 times in the corpus. We will also split the dataset into 70% training and 30% test sets.
Train data dimensions = 28908 257
Test data dimensions = 12462 257
We trained LASSO-regularized linear regression (with 5-fold cross validation) and a random forest (tuned with OOB error).
model | test.MAE |
---|---|
LASSO | 1.652451 |
Random Forest | 1.271994 |