Trees and Forests Assignment

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.

If you are having trouble with the Rling library - the nerd data is avaliable on Canvas, and you can load it directly.

##r chunk
library(Rling)
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

library(reticulate)
py_config()

## python:         /usr/bin/python3
## libpython:      /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6.so
## pythonhome:     //usr://usr
## version:        3.6.9 (default, Nov  7 2019, 10:44:02)  [GCC 8.3.0]
## numpy:          /home/rajan_patel/.local/lib/python3.6/site-packages/numpy
## numpy_version:  1.18.2
## 
## python versions found: 
##  /usr/bin/python3
##  /usr/bin/python

data(nerd)
head(nerd)

##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

Description of the data

Dependent variable:

Noun: which category is represented either nerd or geek.

Independent variables:

Num: a measure of social group, either pl (plural) or sg (single)
Century: time measurement, as XX (20th) or XXI (21st) century
Register: information about where the data was coded from ACAD (academic), MAG (magazine), NEWS (newspapers), and SPOK (spoken)
Eval: A measure of the semanticity of the word, Neg for negative, Neutral, and Positive

Conditional inference model

Add a random number generator to start the model.
Use ctree() to create a conditional inference model.

##r chunk
set.seed(549354)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

Plot the conditional inference model.

##r chunk
plot(tree.output)

Interpret the categories

Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what featurally defines each category. Tree includes all possible splits that were significant at p < .05. treesplits on based on semanticity of the word.
- If semanticity of the word is positive, In 21th century, “geek”"is more used then nerd.
- If Semanticity of the word is positive in 20th Century, both “geek” and “nerd” almost equally used.
- If Semanticity of the word is negative or netural, in 21th century. both “geek” and “nerd” almost equally used.
- If Semanticity of the word is negative or netural, IN 20th century, “nerd” is more used the “geek”.
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion. Based on the graph the largest proportion is Node 3 Where “geek” is more used then “nerd” in 21th Century if the Semanticity of the word is positive.

Conditional inference model predictiveness

Calculate the percent correct classification for the conditional inference model.

##r chunk
outcomes = table(predict(tree.output), nerd$Noun)
outcomes

##       
##        geek nerd
##   geek  227   61
##   nerd  443  585

sum(diag(outcomes)) / sum(outcomes) * 100

## [1] 61.70213

Random forests

Create a random forest of the same model for geek versus nerd.

##r chunk
forest.output = cforest(Noun ~ Num + Century + Register + Eval, 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 1000,
                                                    mtry = 2))

Variable importance

Calculate the variable importance from the random forest model.
Include a dot plot of the importance values.
Which variables were the most important?
- Century and Eval are the most important variables.

##r chunk
forest.importance = varimp(forest.output, conditional = T)
round(forest.importance, 2)

##      Num  Century Register     Eval 
##     0.00     0.02     0.00     0.06

dotchart(sort(forest.importance), 
         main = "Conditional Importance of Variables")

Forest model predictiveness

Include the percent correct for the random forest model.
Did it do better than the conditional inference tree? Yes. It did little bit better then Conditional inference tree.

##r chunk
forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes

##       
##        geek nerd
##   geek  328  140
##   nerd  342  506

sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100

## [1] 63.37386

Python model

In this section, import the data from R to Python.
Be sure to convert the categorical data into dummy coded data.

##python chunk
import pandas as pd
#nerd_pydata = pd.read_csv('nerd.csv')
Xvars = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
Xvars.head()

##    Num_pl  Num_sg  Century_XX  ...  Eval_Neg  Eval_Neutral  Eval_Positive
## 0       1       0           1  ...         0             1              0
## 1       1       0           0  ...         0             1              0
## 2       1       0           1  ...         0             1              0
## 3       1       0           1  ...         0             1              0
## 4       1       0           1  ...         0             1              0
## 
## [5 rows x 11 columns]

Yvar = pd.get_dummies(r.nerd["Noun"])
Yvar.head()

##    geek  nerd
## 0     0     1
## 1     0     1
## 2     0     1
## 3     0     1
## 4     0     1

Create the Tree

Create a decision tree classification of the nerd data.

##python chunk
import sklearn
from sklearn import tree

CIT = tree.DecisionTreeClassifier()
CIT = CIT.fit(Xvars,Yvar)
CIT.predict(Xvars)

## array([[0, 1],
##        [0, 1],
##        [0, 1],
##        ...,
##        [1, 0],
##        [0, 0],
##        [0, 0]], dtype=uint8)

Printing out the Tree

Print out a text version of the classification tree.

##python chunk
print(tree.export_text(CIT,feature_names= list(Xvars.columns)))

## |--- Eval_Positive <= 0.50
## |   |--- Century_XXI <= 0.50
## |   |   |--- Num_sg <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Num_sg >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |--- Century_XXI >  0.50
## |   |   |--- Num_pl <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_pl >  0.50
## |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- class: 1
## |--- Eval_Positive >  0.50
## |   |--- Century_XX <= 0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 1
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |--- class: 1
## |   |--- Century_XX >  0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |--- class: 1

Confusion Matrix

##python chunk
Y_predict = pd.DataFrame(CIT.predict(Xvars))
Y_predict.columns = list(Yvar.columns)
Y_predict_category =  Y_predict.idxmax(axis=1)
Yvar_category = Yvar.idxmax(axis=1)

sklearn.metrics.confusion_matrix(Y_predict_category, Yvar_category, labels = ["nerd","geek"])

## array([[439, 269],
##        [207, 401]])

Thought questions

Are the models easier to create using R or Python (your own thoughts, they can be different than what I said in the lecture)? It easy to create a model in R. Tree is not easy to Read also need to convert Categorical data into dummy coading variables which is not easy to read.
Which model gave you a better classification of the categories? Random forest model gave a better classification of the categories with an accuracy of 63.37 compare to the conditional inference model with an 61.70 accuracy.
What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?

If We know the context of the Noun in which it was used then It might be useful to get better accuracy for prediction. Such as Geek is use for the person who have technical knowledge while nerd is more use for more academic context.