Info

Objective

The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!

Your task

Solve the practical questions, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.

Marking scheme

You will get “excellent”, or 100% for this lab report if everything is perfect. You will get “good”, or 75% if there are minor issues. For example, you will get “good” rather than excellent if you transform the data by manually subsetting and changing variable names instead of a tidyverse pipepine or if you implement cross-validation from scratch instead of using the library caret. You will get “average”, or 50% if there are serious issues in your report, such as failing to do cross-validation or data normalization when it is needed. You will get “poor”, or 25% if you barely attempt this report. You will get “not done”, or 0% if you do not attempt this report.

Deadline

16 Sep 2024, midnight

Libraries

We will work with a dataset of New Year resolutions posted on twitter.

Source: https://data.world/crowdflower/2015-new-years-resolutions

Here, we load libraries, data and set the random seed. Replace the number “1729” with the numeric part of your matric no

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees

library(ranger) # for training random forest
library(tm) # for text mining
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud) # for text visualization
## Loading required package: RColorBrewer
N <- read_csv("new_years_resolutions.csv")
## Rows: 5011 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): other_topic, resolution_topics, gender, name, Resolution_Category,...
## dbl  (2): retweet_count, tweet_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1729) # replace the number '1729' with your matric no

We will predict the overall category of a New Year resolution based on the text and the author’s gender.

N %>%
  as_tibble %>%
#  sample_n(10) %>%
  select(text, gender, Resolution_Category)

Here is the total count of tweets in each resolution category:

N %>% group_by(Resolution_Category) %>%
  summarise(N = n()) %>%
  arrange(-N)

Because of the specific nature of this collection of tweets, we will remove the following words from word clouds (we will not remove them from the predictive models):

remove_from_wc <- c(stopwords(),
                    "newyearsresolution", 
                    "new", "years", "resolution", 
                    "twacc", "htag", "year", "http")

Question 1

Clean the raw twitter data as follows:

  1. Remove all characters except for alphabet, “@” and “#”

  2. Change all characters to lower case

  3. Change “@” to “twacc” and “#” to “htag”

  4. Introduce a new variable Y whose value is Resolution_Category converted to factor (categorical variable in R)

  5. Extract only cleaned text, gender and Y from the data

Then plot the word clouds for tweets in the category “Health & Fitness”. Note that words in remove_from_wc should not be plotted.

Explain in you own words what seems to be most important themes and challenges for people who tweeted about their “Health & Fitness” resolutions.

Solution First, we will clean the data. The cleaned dataset is called raw_data.

raw_data <- N %>%
  mutate(cleaned_text = gsub("[^a-zA-Z@#]", " ", text)) %>%
  mutate(cleaned_text = tolower(cleaned_text)) %>%
  mutate(cleaned_text = gsub("@", "twacc ", cleaned_text)) %>%
  mutate(cleaned_text = gsub("#", "htag ", cleaned_text)) %>%
  mutate(Y = as.factor(Resolution_Category)) %>%
  as_tibble %>%
  select(cleaned_text, gender, Y)

head(raw_data)

Here is the word cloud for “Health & Fitness” tweets. Note that we made a special function to plot a word cloud from a character vector:

plot_wc <- function(x) {
  frequent_words <- termFreq(x)
  frequent_words <- frequent_words[
    !(names(frequent_words) %in% remove_from_wc)]
  
  wordcloud(words = names(frequent_words), 
            freq = frequent_words, min.freq = 0,
            max.words = 100, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))
}

raw_data %>%
  filter(Y == "Health & Fitness") %>%
  pull(cleaned_text) %>%
  plot_wc

Most important challenges seem to be quitting smoking, losing weight, and eating healthier.

Question 2

Create the document-term-matrix that only contains words whose overall frequency is above 50. Then create 60% training and 40% test datasets with words coming from the document-term-matrix as predictors. Also include the author’s gender and add the response variable Y. Report dimensions of training and test sets.

Solution

words_freq <- termFreq(raw_data$cleaned_text)
frequent_words <- words_freq[words_freq >= 50]

corpus <- VCorpus(VectorSource(raw_data$cleaned_text))
DTM <- DocumentTermMatrix(corpus)[ , names(frequent_words)]


p <- 0.6
ind <- runif(nrow(DTM)) < p
all_data <- DTM %>%
  as.matrix %>%
  as_tibble %>%
  rename_with(function(x) paste("w", x, sep = "_")) %>%
  mutate(gender = raw_data$gender) %>%
  mutate(Y = raw_data$Y)

train_data <- all_data[ind , ]
test_data <- all_data[!ind , ]

cat("Dimensions of the training set are", dim(train_data),"\n")
## Dimensions of the training set are 2972 136
cat("Dimensions of the test set are", dim(test_data),"\n")
## Dimensions of the test set are 2039 136

Question 3

Train and plot a decision tree to predict the resolution category Y based on word occurrence in a tweet text and the author’s gender. Also report the overall accuracy of the model. Note that not all the 10 categories will be actually predicted by the tree - this is alright, don’t worry about.

Solution

First, we train and plot the decision tree.

mod_tree <- rpart(Y ~ . , data = train_data)
rpart.plot(mod_tree)

The overall accuracy is

cm_tree <- mod_tree %>%
  predict(test_data, type = "class") %>%
  confusionMatrix(test_data$Y)

cm_tree$overall['Accuracy']
##  Accuracy 
## 0.4144188

Question 4

Train a random forest to predict the resolution category Y based on word occurrence in a tweet text and the author’s gender. When you tune hyperparameter values, use the OOB Error as the measure of your models’ goodness of fit. Set splitrule = gini and set the number of trees to 50 for tuning and then retrain one model with 500 trees with the optimal hyperparamter values.

Finally, plot variable importance for 10 most imporant predictors and report the overall accuracy of the final model.

Solution

First, we tune random forest

ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)

rfGrid <- expand.grid(mtry = c(8, 10, 15, 20, 30), 
                      min.node.size = c(3, 5, 10, 20),
                      splitrule = "gini")

mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                tuneGrid = rfGrid,
                trControl = trainControl("oob"))
mod_rf_tune
## Random Forest 
## 
## 500 samples
## 135 predictors
##  10 classes: 'Career', 'Education/Training', 'Family/Friends/Relationships', 'Finance', 'Health & Fitness', 'Humor', 'Personal Growth', 'Philanthropic', 'Recreation & Leisure', 'Time Management/Organization' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy  Kappa     
##    8     3             0.394     0.13534307
##    8     5             0.366     0.09973873
##    8    10             0.390     0.12867599
##    8    20             0.378     0.09451937
##   10     3             0.392     0.13724600
##   10     5             0.364     0.10197903
##   10    10             0.348     0.07910645
##   10    20             0.370     0.10529664
##   15     3             0.352     0.10646818
##   15     5             0.384     0.15055104
##   15    10             0.366     0.11250224
##   15    20             0.362     0.10092726
##   20     3             0.338     0.08385368
##   20     5             0.380     0.15140101
##   20    10             0.370     0.13071315
##   20    20             0.370     0.12437733
##   30     3             0.350     0.13120188
##   30     5             0.362     0.13333116
##   30    10             0.400     0.18376675
##   30    20             0.368     0.12706700
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 30, splitrule = gini
##  and min.node.size = 10.

Now we retrain it with the best combination of hyperparameter values and report the overall accuracy of the final model.

mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
                num.trees = 500,
                importance = 'impurity',
                tuneGrid = expand.grid(mod_rf_tune$bestTune),
                trControl = trainControl("oob"))

cm_rf <- mod_rf_tuned %>%
  predict(test_data, type = "raw") %>%
  confusionMatrix(test_data$Y)

cm_rf$overall['Accuracy']
##  Accuracy 
## 0.4423737

And here is the variable importance plot:

plot(varImp(mod_rf_tuned), top = 10)