25 Feb 2021
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: RColorBrewer
General Info
This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Data
The dataset for assignment 1 is a collection of Donald Trump’s tweets.
Source: http://www.trumptwitterarchive.com/
Sample of Trump’s tweets
text
|
retweet_count
|
Watching the show. #WWEHOF http://t.co/64ck6O78h3
|
36
|
https://t.co/0Zx9wr3MoP
|
16081
|
@fackinpeter: @realDonaldTrump they hate you cuz they ain’t you #trump2016
|
25
|
Chance favors the prepared mind.– Louis Pasteur
|
352
|
Prominent legal scholars agree that our actions to address the National Emergency at the Southern Border and to protect the American people are both CONSTITUTIONAL and EXPRESSLY authorized by Congress….
|
23108
|
Word Cloud

Feature engineering
The response variable is \(\log(\mbox{retweet_count}+1)\). Features are columns of the document-term matrix trimmed to terms that appear at least 500 times in the corpus. We will also split the dataset into 70% training and 30% test sets.
## Train data dimensions = 28908 257
## Test data dimensions = 12462 257
Modelling
We trained LASSO-regularized linear regression (with 5-fold cross validation) and a random forest (tuned with OOB error).
Our predictive models
model
|
test.MAE
|
LASSO
|
1.652451
|
Random Forest
|
1.271994
|