25 Feb 2021

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: RColorBrewer

General Info

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Data

The dataset for assignment 1 is a collection of Donald Trump’s tweets.

Source: http://www.trumptwitterarchive.com/

Sample of Trump’s tweets
text retweet_count
Watching the show. #WWEHOF http://t.co/64ck6O78h3 36
https://t.co/0Zx9wr3MoP 16081
@fackinpeter: @realDonaldTrump they hate you cuz they ain’t you #trump2016 25
Chance favors the prepared mind.– Louis Pasteur 352
Prominent legal scholars agree that our actions to address the National Emergency at the Southern Border and to protect the American people are both CONSTITUTIONAL and EXPRESSLY authorized by Congress…. 23108

Word Cloud

Feature engineering

The response variable is \(\log(\mbox{retweet_count}+1)\). Features are columns of the document-term matrix trimmed to terms that appear at least 500 times in the corpus. We will also split the dataset into 70% training and 30% test sets.

## Train data dimensions = 28908 257
## Test data dimensions = 12462 257

Modelling

We trained LASSO-regularized linear regression (with 5-fold cross validation) and a random forest (tuned with OOB error).

Our predictive models
model test.MAE
LASSO 1.652451
Random Forest 1.271994