Trees and Forests Assignment

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.

The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).

Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.

##r chunk
library(Rling)
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

data(nerd)
head(nerd)

##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

Description of the data

Dependent variable:

Noun: which category is represented either nerd or geek.

Independent variables:

Num: a measure of social group, either pl (plural) or sg (single)
Century: when the word was used, as XX (20th) or XXI (21st) century
Register: information about the context of the word; coded as ACAD (academic), MAG (magazine), NEWS (newspapers), and SPOK (spoken)
Eval: A measure of the valence of the word, Neg for negative, Neutral, and Positive

Conditional inference model

Add a random number generator to start the model.
Use ctree() to create a conditional inference model.

##r chunk
set.seed(52525)
tree.output = ctree(Noun ~ Eval + Num + Century + Register , data = nerd)

Make a plot

Plot the conditional inference model.

##r chunk
plot(tree.output)

Interpret the categories

** Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what features defines each category.
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion.

Conditional inference model predictiveness

Calculate the percent correct classification for the conditional inference model.

##r chunk
outcomes = table(predict(tree.output), nerd$Noun)
outcomes

##       
##        geek nerd
##   geek  227   61
##   nerd  443  585

sum(outcomes[4]) / sum(outcomes[,2]) * 100

## [1] 90.55728

sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))

## [1] 0.5091185

sum(diag(outcomes)) / sum(outcomes) * 100

## [1] 61.70213

sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))

## [1] 0.218845

sum(outcomes[1]) / sum(outcomes[,1]) * 100

## [1] 33.8806

Random forests

Create a random forest of the same model for geek versus nerd.

##r chunk
forest.OP = cforest(Noun ~ Eval + Num + Century + Register , 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 800,
                                                    mtry = 4))

Variable importance

Calculate the variable importance from the random forest model.
Include a dot plot of the importance values.
** Which variables were the most important? Eval is the most important, followed by Century. Eval Num Century Register 0.05 0.00 0.02 0.00

##r chunk
forest.IMP = varimp(forest.OP,
                           conditional = T)
round(forest.IMP, 2)

##     Eval      Num  Century Register 
##     0.05     0.00     0.02     0.00

dotchart(sort(forest.IMP),
         main = " CI of Variables")

Forest model predictiveness

Include the percent correct for the random forest model.
** Did it do better than the conditional inference tree?

Yes, Random forest is better with a higher accuracy of 63.60%

##r chunk
forest.outcomes = table(predict(forest.OP), nerd$Noun)
forest.outcomes

##       
##        geek nerd
##   geek  383  191
##   nerd  287  455

sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100

## [1] 70.43344

sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100

## [1] 63.67781

sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))

## [1] 0.4361702

sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100

## [1] 57.16418

Python model

In this section, import the data from R to Python.
Be sure to convert the categorical data into dummy coded data.

##python chunk


import pandas as pd

predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])

predictor_variables.head()

##    Num_pl  Num_sg  Century_XX  ...  Eval_Neg  Eval_Neutral  Eval_Positive
## 0       1       0           1  ...         0             1              0
## 1       1       0           0  ...         0             1              0
## 2       1       0           1  ...         0             1              0
## 3       1       0           1  ...         0             1              0
## 4       1       0           1  ...         0             1              0
## 
## [5 rows x 11 columns]

target_variable = pd.get_dummies(r.nerd["Noun"])

target_variable.head()

##    geek  nerd
## 0     0     1
## 1     0     1
## 2     0     1
## 3     0     1
## 4     0     1

Create the Tree

Create a decision tree classification of the nerd data.

##python chunk

import sklearn

from sklearn import tree

classTree = tree.DecisionTreeClassifier()

classTree = classTree.fit(predictor_variables,target_variable)

classTree.predict(predictor_variables)

## array([[0, 1],
##        [0, 1],
##        [0, 1],
##        ...,
##        [1, 0],
##        [0, 0],
##        [0, 0]], dtype=uint8)

Printing out the Tree

Print out a text version of the classification tree.

##python chunk

print(tree.export_text(classTree,feature_names= list(predictor_variables.columns)))

## |--- Eval_Positive <= 0.50
## |   |--- Century_XXI <= 0.50
## |   |   |--- Num_pl <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_pl >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |--- Century_XXI >  0.50
## |   |   |--- Num_sg <= 0.50
## |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_sg >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |--- Eval_Positive >  0.50
## |   |--- Century_XX <= 0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 1
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |--- class: 1
## |   |--- Century_XX >  0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |--- class: 1

Confusion Matrix

##python chunk
prediction = pd.DataFrame(classTree.predict(predictor_variables))

prediction.columns = list(target_variable.columns)

prediction_category =  prediction.idxmax(axis=1)

target_variable_category = target_variable.idxmax(axis=1)

sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])

## array([[439, 269],
##        [207, 401]])

Thought questions

** Are the models easier to create using R or Python (your own thoughts, they can be different than what I said in the lecture)?

Easier to create in R, python requires to import packags

** Which model gave you a better classification of the categories? Random forest model with an accuracy of 63.60%
** What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)? Additional variables like demographic information can help in getting better accuraccy.