ANLY540 - Analysis of Human Language - Assignment 6: Trees and Forests

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.

library(Rling)
library(party)

data(nerd)
head(nerd)

table(nerd$Noun)

## 
## geek nerd 
##  670  646

Description of the data

Dependent variable:

Noun: which category is represented either nerd or geek.

Independent variables:

Num: a measure of social group, either pl (plural) or sg (single)
Century: time measurement, as XX (20th) or XXI (21st) century
Register: information about where the data was coded from ACAD (academic), MAG (magazine), NEWS (newspapers), and SPOK (spoken)
Eval: A measure of the semanticity of the word, Neg for negative, Neutral, and Positive

Conditional inference model

Add a random number generator to start the model.
Use ctree() to create a conditional inference model.

set.seed(12345)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

Plot the conditional inference model.

plot(tree.output)

Interpret the categories

Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what featurally defines each category.

When the semanticity of the word is positive and the usage is in the 21st century, “geek” is much more likely than "nerd.
When the semanticity of the word is positive and the usage is in the 20th century, “geek” and "nerd are almostly equally likely.
When the semanticity of the word is neutral or negative and the usage is in the 21st century, “geek” and "nerd are almostly equally likely.
When the semanticity of the word is neutral or negative and the usage is in the 20th century, “geek” is much less likely than "nerd.

Therefore, the word “geek” is more common in the 21st century and in positive contexts whereas “nerd” is more common in the 20th century and in neutral or negative contexts. The first split is on Eval, a measure of semanticity of the word. The second level of split is on Century, a time measurement. The tree model did not find a significant split on the basis of Register where the data was coded from and Num, a measure of social group.

With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion.

The group with the largest proportion is Node 3, where “geek” is much more likely than “nerd” in the 21st century in positive contexts.

Conditional inference model predictiveness

Calculate the percent correct classification for the conditional inference model.

outcomes = table(predict(tree.output), nerd$Noun)
outcomes

##       
##        geek nerd
##   geek  227   61
##   nerd  443  585

sum(diag(outcomes)) / sum(outcomes) * 100

## [1] 61.70213

sum(outcomes[1]) / sum(outcomes[,1]) * 100

## [1] 33.8806

sum(outcomes[4]) / sum(outcomes[,2]) * 100

## [1] 90.55728

sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))

## [1] 0.5091185

sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))

## [1] 0.218845

The overall classification accuracy of the model is 61.70%. However, it is very poor at identifying “geek” with only a 33.88% accuracy. The model is better at predicting “nerd” with a 90.56% accuracy. This is because the model is inherently biased towards “nerd”. The actual data has a geek:nerd split of 51:49 whereas the predicted data has a split of 22:78.

Random forests

Create a random forest of the same model for geek versus nerd.

forest.output = cforest(Noun ~ Num + Century + Register + Eval, 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 1000,
                                                    mtry = 2))

Variable importance

Calculate the variable importance from the random forest model.
Include a dot plot of the importance values.
Which variables were the most important?

forest.importance = varimp(forest.output,
                           conditional = T)
round(forest.importance, 3)

##      Num  Century Register     Eval 
##   -0.002    0.023   -0.003    0.056

dotchart(sort(forest.importance),
         main = "Conditional Importance of Variables")

The most important variables are:

Eval: A measure of the semanticity of the word
Century: Time measurement - 20th or 21st century

Forest model predictiveness

Include the percent correct for the random forest model.
Did it do better than the conditional inference tree?

forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes

##       
##        geek nerd
##   geek  337  149
##   nerd  333  497

sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100

## [1] 63.37386

sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100

## [1] 50.29851

sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100

## [1] 76.93498

sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))

## [1] 0.3693009

The random forest model has a slightly better accuracy than the tree model with an accuracy of 63.37%. However, it is much less biased towards predicting “nerd”. Compared to the tree model for which the geek:nerd prediction split was 22:78, the Random forest has a better geek:nerd prediction split of 37:63, in line with the actual data split of 51:49. The accuracy with respect to predicting “geek” improves to 50.3%, albeit dropping the prediction accuracy for predicting “nerd” to 76.93%.

Thought question

What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?

The word “geek” is usually used in the context of tech stuff and “nerd” is used for someone who has a single-minded approach towards academics. So it be helpful to have a variable that depicts the context of the discussion (or rather than the topic).
It might also be helpful to look at the author - certain authors perceive each of “nerd” and “geek” as “cool” or “hip” while others don’t.