Trees and Forests Assignment

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.

The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).

Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.

##r chunk
library(Rling)
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

data(nerd)
head(nerd)

##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

Description of the data

Dependent variable:

Noun: which category is represented either nerd or geek.

Independent variables:

Num: a measure of social group, either pl (plural) or sg (single)
Century: when the word was used, as XX (20th) or XXI (21st) century
Register: information about the context of the word; coded as ACAD (academic), MAG (magazine), NEWS (newspapers), and SPOK (spoken)
Eval: A measure of the valence of the word, Neg for negative, Neutral, and Positive

Conditional inference model

Add a random number generator to start the model.
Use ctree() to create a conditional inference model.

##r chunk

set.seed(10)
tree<- ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

Plot the conditional inference model.

##r chunk

plot(tree)

Interpret the categories

** Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what features defines each category.
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion.

Answer: From the tree graph above, we can see that the measure of the valence of the word is most correlated with the outcome, followed by the century when the word is used. For instance, if the valence is positive and the time is the 21st century, the word “geek” is more likely to be used versus in the 20th century, where the term “nerd” is used less often compared with “geek” but more frequently than 10 decades later. Similarly, if the valence is negative or neutral and the time is the 20th century, the word “nerd” is used more often. In the 21st century, however, the usage frequency of both “geek” and “nerd” is very identical.

Conditional inference model predictiveness

Calculate the percent correct classification for the conditional inference model.

tree_outcomes<- table(predict(tree), nerd$Noun)
sum(diag(tree_outcomes)) / sum(tree_outcomes) * 100

## [1] 61.70213

Random forests

Create a random forest of the same model for geek versus nerd.

##r chunk

forest<- cforest(Noun ~ Num + Century + Register + Eval, data = nerd, controls = cforest_unbiased(ntree = 1000, mtry = 2))

Variable importance

Calculate the variable importance from the random forest model.
Include a dot plot of the importance values.
** Which variables were the most important?

Answer: From the chart genereated, we can see that Eval is the most important variable, followed by Century, Num, and Register.

##r chunk

forest_importance<- varimp(forest, conditional = T)
round(forest_importance, 2)

##      Num  Century Register     Eval 
##     0.00     0.02     0.00     0.06

dotchart(sort(forest_importance), main = "Random Forest Model Variable Importance")

Forest model predictiveness

Include the percent correct for the random forest model.
** Did it do better than the conditional inference tree?

Answer: As the random forest model can predict more accurately at 63.4% versus 61.7%, we can say that it is better than the conditional inference tree model.

##r chunk
forest_outcomes<- table(predict(forest), nerd$Noun)
sum(diag(forest_outcomes)) / sum(forest_outcomes) * 100

## [1] 63.44985

Python model

In this section, import the data from R to Python.
Be sure to convert the categorical data into dummy coded data.

library(reticulate) #switch to Python first repl_python()

##python chunk
import pandas as pd #import pandas packaage as pd to help manipulate the dataset

predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
target_variable = pd.get_dummies(r.nerd["Noun"])

predictor_variables.head()

##    Num_pl  Num_sg  Century_XX  ...  Eval_Neg  Eval_Neutral  Eval_Positive
## 0       1       0           1  ...         0             1              0
## 1       1       0           0  ...         0             1              0
## 2       1       0           1  ...         0             1              0
## 3       1       0           1  ...         0             1              0
## 4       1       0           1  ...         0             1              0
## 
## [5 rows x 11 columns]

target_variable.head()

##    geek  nerd
## 0     0     1
## 1     0     1
## 2     0     1
## 3     0     1
## 4     0     1

Create the Tree

Create a decision tree classification of the nerd data.

##python chunk

import sklearn #Scikit-learn is a free machine learning library in Python
from sklearn import tree

classTree = tree.DecisionTreeClassifier()
classTree = classTree.fit(predictor_variables, target_variable) #fit the variables to the model

Printing out the Tree

Print out a text version of the classification tree.

##python chunk

print(tree.export_text(classTree,feature_names = list(predictor_variables.columns)))

## |--- Eval_Positive <= 0.50
## |   |--- Century_XX <= 0.50
## |   |   |--- Num_sg <= 0.50
## |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- class: 1
## |   |   |--- Num_sg >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |--- Century_XX >  0.50
## |   |   |--- Num_pl <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_pl >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |--- Eval_Positive >  0.50
## |   |--- Century_XXI <= 0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |--- class: 1
## |   |--- Century_XXI >  0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 1
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |--- class: 1

Confusion Matrix

##python chunk

import numpy as np

prediction = pd.DataFrame(classTree.predict(predictor_variables)) #convert the predicted values to a dataframe
prediction.columns = list(target_variable.columns) #convert to a list
prediction_category =  prediction.idxmax(axis=1)
target_variable_category = target_variable.idxmax(axis=1)

python_outcomes = sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])

print((np.trace(python_outcomes)/np.sum(python_outcomes))*100)

## 63.829787234042556

Thought questions

** Are the models easier to create using R or Python (your own thoughts, they can be different than what I said in the lecture)?

Answer: For me, the models are much simpler to create with R than Python. Firstly, R handles categorical variables automatically while we need to create dummy variables for them in Python, which can get manual if we’re dealing with a large dataset. Also, it takes little effort for us to create the tree graph in R, but for Python, we will need separate packages.

** Which model gave you a better classification of the categories?

Answer: From the results above, we can see that the Random Forst model is better than the Conditional Inference model as it has a classification accuracy of 63.4% versus 61.7%. Also, as the Random Forest model creates multiple trees, weighing the importance of each variable while controlling the effect of others, its classification results are more statistically reliable.

** What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?

Answer: Variables like demographics and geographics of the authors of the material written alongside wordcount of the material could better help us improve the model.