The purpose of writing graded lab reports is to help students to stay on track and to provide summative feedback. Each lab report is just 1% of the total course mark. Please do not cheat - it is not worth it!
Solve the practical questions, knit your document into a PDF and submit to NTULearn before the deadline. The deadline is very tight because the task is simple. We are sure that everyone is capable to do it by themselves and we want to discourage taking someone else’s report and writing it with your own words.
You will get “excellent”, or 100% for this lab report if everything
is perfect. You will get “good”, or 75% if there are minor issues. For
example, you will get “good” rather than excellent if you transform the
data by manually subsetting and changing variable names instead of a
tidyverse
pipepine or if you implement cross-validation
from scratch instead of using the library caret
. You will
get “average”, or 50% if there are serious issues in your report, such
as failing to do cross-validation or data normalization when it is
needed. You will get “poor”, or 25% if you barely attempt this report.
You will get “not done”, or 0% if you do not attempt this report.
16 Sep 2024, midnight
We will work with a dataset of New Year resolutions posted on twitter.
Source: https://data.world/crowdflower/2015-new-years-resolutions
Here, we load libraries, data and set the random seed. Replace the number “1729” with the numeric part of your matric no
library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(rpart) # for training decision trees
library(rpart.plot) # for plotting decision trees
library(ranger) # for training random forest
library(tm) # for text mining
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud) # for text visualization
## Loading required package: RColorBrewer
N <- read_csv("new_years_resolutions.csv")
## Rows: 5011 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): other_topic, resolution_topics, gender, name, Resolution_Category,...
## dbl (2): retweet_count, tweet_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1729) # replace the number '1729' with your matric no
We will predict the overall category of a New Year resolution based on the text and the author’s gender.
N %>%
as_tibble %>%
# sample_n(10) %>%
select(text, gender, Resolution_Category)
Here is the total count of tweets in each resolution category:
N %>% group_by(Resolution_Category) %>%
summarise(N = n()) %>%
arrange(-N)
Because of the specific nature of this collection of tweets, we will remove the following words from word clouds (we will not remove them from the predictive models):
remove_from_wc <- c(stopwords(),
"newyearsresolution",
"new", "years", "resolution",
"twacc", "htag", "year", "http")
Clean the raw twitter data as follows:
Remove all characters except for alphabet, “@” and “#”
Change all characters to lower case
Change “@” to “twacc” and “#” to “htag”
Introduce a new variable Y
whose value is
Resolution_Category
converted to factor
(categorical variable in R)
Extract only cleaned text, gender and Y from the data
Then plot the word clouds for tweets in the category “Health &
Fitness”. Note that words in remove_from_wc
should not be
plotted.
Explain in you own words what seems to be most important themes and challenges for people who tweeted about their “Health & Fitness” resolutions.
Solution First, we will clean the data. The cleaned dataset
is called raw_data
.
raw_data <- N %>%
mutate(cleaned_text = gsub("[^a-zA-Z@#]", " ", text)) %>%
mutate(cleaned_text = tolower(cleaned_text)) %>%
mutate(cleaned_text = gsub("@", "twacc ", cleaned_text)) %>%
mutate(cleaned_text = gsub("#", "htag ", cleaned_text)) %>%
mutate(Y = as.factor(Resolution_Category)) %>%
as_tibble %>%
select(cleaned_text, gender, Y)
head(raw_data)
Here is the word cloud for “Health & Fitness” tweets. Note that we made a special function to plot a word cloud from a character vector:
plot_wc <- function(x) {
frequent_words <- termFreq(x)
frequent_words <- frequent_words[
!(names(frequent_words) %in% remove_from_wc)]
wordcloud(words = names(frequent_words),
freq = frequent_words, min.freq = 0,
max.words = 100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
raw_data %>%
filter(Y == "Health & Fitness") %>%
pull(cleaned_text) %>%
plot_wc
Most important challenges seem to be quitting smoking, losing weight, and eating healthier.
Create the document-term-matrix that only contains words whose
overall frequency is above 50. Then create 60% training and 40% test
datasets with words coming from the document-term-matrix as predictors.
Also include the author’s gender and add the response variable
Y
. Report dimensions of training and test sets.
Solution
words_freq <- termFreq(raw_data$cleaned_text)
frequent_words <- words_freq[words_freq >= 50]
corpus <- VCorpus(VectorSource(raw_data$cleaned_text))
DTM <- DocumentTermMatrix(corpus)[ , names(frequent_words)]
p <- 0.6
ind <- runif(nrow(DTM)) < p
all_data <- DTM %>%
as.matrix %>%
as_tibble %>%
rename_with(function(x) paste("w", x, sep = "_")) %>%
mutate(gender = raw_data$gender) %>%
mutate(Y = raw_data$Y)
train_data <- all_data[ind , ]
test_data <- all_data[!ind , ]
cat("Dimensions of the training set are", dim(train_data),"\n")
## Dimensions of the training set are 2972 136
cat("Dimensions of the test set are", dim(test_data),"\n")
## Dimensions of the test set are 2039 136
Train and plot a decision tree to predict the resolution category
Y
based on word occurrence in a tweet text and the author’s
gender. Also report the overall accuracy of the model. Note that not all
the 10 categories will be actually predicted by the tree - this is
alright, don’t worry about.
Solution
First, we train and plot the decision tree.
mod_tree <- rpart(Y ~ . , data = train_data)
rpart.plot(mod_tree)
The overall accuracy is
cm_tree <- mod_tree %>%
predict(test_data, type = "class") %>%
confusionMatrix(test_data$Y)
cm_tree$overall['Accuracy']
## Accuracy
## 0.4144188
Train a random forest to predict the resolution category
Y
based on word occurrence in a tweet text and the author’s
gender. When you tune hyperparameter values, use the OOB Error as the
measure of your models’ goodness of fit. Set
splitrule = gini
and set the number of trees to 50 for
tuning and then retrain one model with 500 trees with the optimal
hyperparamter values.
Finally, plot variable importance for 10 most imporant predictors and report the overall accuracy of the final model.
Solution
First, we tune random forest
ind <- sample(1:nrow(train_data), size = 500, replace = FALSE)
mini_data <- train_data %>% slice(ind)
rfGrid <- expand.grid(mtry = c(8, 10, 15, 20, 30),
min.node.size = c(3, 5, 10, 20),
splitrule = "gini")
mod_rf_tune <- train(Y ~ . , data = mini_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
tuneGrid = rfGrid,
trControl = trainControl("oob"))
mod_rf_tune
## Random Forest
##
## 500 samples
## 135 predictors
## 10 classes: 'Career', 'Education/Training', 'Family/Friends/Relationships', 'Finance', 'Health & Fitness', 'Humor', 'Personal Growth', 'Philanthropic', 'Recreation & Leisure', 'Time Management/Organization'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 8 3 0.394 0.13534307
## 8 5 0.366 0.09973873
## 8 10 0.390 0.12867599
## 8 20 0.378 0.09451937
## 10 3 0.392 0.13724600
## 10 5 0.364 0.10197903
## 10 10 0.348 0.07910645
## 10 20 0.370 0.10529664
## 15 3 0.352 0.10646818
## 15 5 0.384 0.15055104
## 15 10 0.366 0.11250224
## 15 20 0.362 0.10092726
## 20 3 0.338 0.08385368
## 20 5 0.380 0.15140101
## 20 10 0.370 0.13071315
## 20 20 0.370 0.12437733
## 30 3 0.350 0.13120188
## 30 5 0.362 0.13333116
## 30 10 0.400 0.18376675
## 30 20 0.368 0.12706700
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 30, splitrule = gini
## and min.node.size = 10.
Now we retrain it with the best combination of hyperparameter values and report the overall accuracy of the final model.
mod_rf_tuned <- train(Y ~ . , data = train_data, method = "ranger",
num.trees = 500,
importance = 'impurity',
tuneGrid = expand.grid(mod_rf_tune$bestTune),
trControl = trainControl("oob"))
cm_rf <- mod_rf_tuned %>%
predict(test_data, type = "raw") %>%
confusionMatrix(test_data$Y)
cm_rf$overall['Accuracy']
## Accuracy
## 0.4423737
And here is the variable importance plot:
plot(varImp(mod_rf_tuned), top = 10)