Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.
The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).
Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.
##r chunk
library(Rling)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
Dependent variable:
Independent variables:
ctree() to create a conditional inference model.##r chunk
set.seed(52525)
tree.output = ctree(Noun ~ Eval + Num + Century + Register , data = nerd)
##r chunk
plot(tree.output)
##r chunk
outcomes = table(predict(tree.output), nerd$Noun)
outcomes
##
## geek nerd
## geek 227 61
## nerd 443 585
sum(outcomes[4]) / sum(outcomes[,2]) * 100
## [1] 90.55728
sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))
## [1] 0.5091185
sum(diag(outcomes)) / sum(outcomes) * 100
## [1] 61.70213
sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))
## [1] 0.218845
sum(outcomes[1]) / sum(outcomes[,1]) * 100
## [1] 33.8806
##r chunk
forest.OP = cforest(Noun ~ Eval + Num + Century + Register ,
data = nerd,
controls = cforest_unbiased(ntree = 800,
mtry = 4))
##r chunk
forest.IMP = varimp(forest.OP,
conditional = T)
round(forest.IMP, 2)
## Eval Num Century Register
## 0.05 0.00 0.02 0.00
dotchart(sort(forest.IMP),
main = " CI of Variables")
Yes, Random forest is better with a higher accuracy of 63.60%
##r chunk
forest.outcomes = table(predict(forest.OP), nerd$Noun)
forest.outcomes
##
## geek nerd
## geek 383 191
## nerd 287 455
sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100
## [1] 70.43344
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.67781
sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))
## [1] 0.4361702
sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100
## [1] 57.16418
##python chunk
import pandas as pd
predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
predictor_variables.head()
## Num_pl Num_sg Century_XX ... Eval_Neg Eval_Neutral Eval_Positive
## 0 1 0 1 ... 0 1 0
## 1 1 0 0 ... 0 1 0
## 2 1 0 1 ... 0 1 0
## 3 1 0 1 ... 0 1 0
## 4 1 0 1 ... 0 1 0
##
## [5 rows x 11 columns]
target_variable = pd.get_dummies(r.nerd["Noun"])
target_variable.head()
## geek nerd
## 0 0 1
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
nerd data.##python chunk
import sklearn
from sklearn import tree
classTree = tree.DecisionTreeClassifier()
classTree = classTree.fit(predictor_variables,target_variable)
classTree.predict(predictor_variables)
## array([[0, 1],
## [0, 1],
## [0, 1],
## ...,
## [1, 0],
## [0, 0],
## [0, 0]], dtype=uint8)
##python chunk
print(tree.export_text(classTree,feature_names= list(predictor_variables.columns)))
## |--- Eval_Positive <= 0.50
## | |--- Century_XXI <= 0.50
## | | |--- Num_pl <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- Eval_Neg <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Eval_Neg > 0.50
## | | | | | | | |--- class: 1
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- Eval_Neg <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Eval_Neg > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 0
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## | | |--- Num_pl > 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## | |--- Century_XXI > 0.50
## | | |--- Num_sg <= 0.50
## | | | |--- Register_MAG <= 0.50
## | | | | |--- Eval_Neg <= 0.50
## | | | | | |--- Register_ACAD <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_ACAD > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Eval_Neg > 0.50
## | | | | | |--- Register_SPOK <= 0.50
## | | | | | | |--- Register_ACAD <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_ACAD > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_SPOK > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_MAG > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 1
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## | | |--- Num_sg > 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 1
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 1
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 1
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## |--- Eval_Positive > 0.50
## | |--- Century_XX <= 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_NEWS <= 0.50
## | | | | | |--- Num_pl <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_pl > 0.50
## | | | | | | |--- class: 1
## | | | | |--- Register_NEWS > 0.50
## | | | | | |--- Num_sg <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_sg > 0.50
## | | | | | | |--- class: 1
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 1
## | | |--- Register_MAG > 0.50
## | | | |--- Num_sg <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_sg > 0.50
## | | | | |--- class: 1
## | |--- Century_XX > 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Num_sg <= 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Num_sg > 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## | | |--- Register_MAG > 0.50
## | | | |--- Num_sg <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_sg > 0.50
## | | | | |--- class: 1
##python chunk
prediction = pd.DataFrame(classTree.predict(predictor_variables))
prediction.columns = list(target_variable.columns)
prediction_category = prediction.idxmax(axis=1)
target_variable_category = target_variable.idxmax(axis=1)
sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])
## array([[439, 269],
## [207, 401]])
Easier to create in R, python requires to import packags
** Which model gave you a better classification of the categories? Random forest model with an accuracy of 63.60%
** What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)? Additional variables like demographic information can help in getting better accuraccy.