Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.

The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).

Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.

##r chunk
library(Rling)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
data(nerd)
head(nerd)
##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

Description of the data

Dependent variable:

Independent variables:

Conditional inference model

##r chunk
set.seed(52525)
tree.output = ctree(Noun ~ Eval + Num + Century + Register , data = nerd)

Make a plot

##r chunk
plot(tree.output)

Interpret the categories

Conditional inference model predictiveness

##r chunk
outcomes = table(predict(tree.output), nerd$Noun)
outcomes
##       
##        geek nerd
##   geek  227   61
##   nerd  443  585
sum(outcomes[4]) / sum(outcomes[,2]) * 100
## [1] 90.55728
sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))
## [1] 0.5091185
sum(diag(outcomes)) / sum(outcomes) * 100
## [1] 61.70213
sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))
## [1] 0.218845
sum(outcomes[1]) / sum(outcomes[,1]) * 100
## [1] 33.8806

Random forests

##r chunk
forest.OP = cforest(Noun ~ Eval + Num + Century + Register , 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 800,
                                                    mtry = 4))

Variable importance

##r chunk
forest.IMP = varimp(forest.OP,
                           conditional = T)
round(forest.IMP, 2)
##     Eval      Num  Century Register 
##     0.05     0.00     0.02     0.00
dotchart(sort(forest.IMP),
         main = " CI of Variables")

Forest model predictiveness

Yes, Random forest is better with a higher accuracy of 63.60%

##r chunk
forest.outcomes = table(predict(forest.OP), nerd$Noun)
forest.outcomes
##       
##        geek nerd
##   geek  383  191
##   nerd  287  455
sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100
## [1] 70.43344
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.67781
sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))
## [1] 0.4361702
sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100
## [1] 57.16418

Python model

##python chunk


import pandas as pd

predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])

predictor_variables.head()
##    Num_pl  Num_sg  Century_XX  ...  Eval_Neg  Eval_Neutral  Eval_Positive
## 0       1       0           1  ...         0             1              0
## 1       1       0           0  ...         0             1              0
## 2       1       0           1  ...         0             1              0
## 3       1       0           1  ...         0             1              0
## 4       1       0           1  ...         0             1              0
## 
## [5 rows x 11 columns]
target_variable = pd.get_dummies(r.nerd["Noun"])

target_variable.head()
##    geek  nerd
## 0     0     1
## 1     0     1
## 2     0     1
## 3     0     1
## 4     0     1

Create the Tree

##python chunk

import sklearn

from sklearn import tree

classTree = tree.DecisionTreeClassifier()

classTree = classTree.fit(predictor_variables,target_variable)

classTree.predict(predictor_variables)
## array([[0, 1],
##        [0, 1],
##        [0, 1],
##        ...,
##        [1, 0],
##        [0, 0],
##        [0, 0]], dtype=uint8)

Printing out the Tree

##python chunk

print(tree.export_text(classTree,feature_names= list(predictor_variables.columns)))
## |--- Eval_Positive <= 0.50
## |   |--- Century_XXI <= 0.50
## |   |   |--- Num_pl <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_pl >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |--- Century_XXI >  0.50
## |   |   |--- Num_sg <= 0.50
## |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_sg >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |--- Eval_Positive >  0.50
## |   |--- Century_XX <= 0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 1
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |--- class: 1
## |   |--- Century_XX >  0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |--- class: 1

Confusion Matrix

##python chunk
prediction = pd.DataFrame(classTree.predict(predictor_variables))

prediction.columns = list(target_variable.columns)

prediction_category =  prediction.idxmax(axis=1)

target_variable_category = target_variable.idxmax(axis=1)

sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])
## array([[439, 269],
##        [207, 401]])

Thought questions

Easier to create in R, python requires to import packags