Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.
If you are having trouble with the Rling library - the nerd data is avaliable on Canvas, and you can load it directly.
##r chunk
library(Rling)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(reticulate)
py_config()
## python: /usr/bin/python3
## libpython: /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6.so
## pythonhome: //usr://usr
## version: 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
## numpy: /home/rajan_patel/.local/lib/python3.6/site-packages/numpy
## numpy_version: 1.18.2
##
## python versions found:
## /usr/bin/python3
## /usr/bin/python
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
Dependent variable:
Independent variables:
ctree() to create a conditional inference model.##r chunk
set.seed(549354)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)
##r chunk
plot(tree.output)
##r chunk
outcomes = table(predict(tree.output), nerd$Noun)
outcomes
##
## geek nerd
## geek 227 61
## nerd 443 585
sum(diag(outcomes)) / sum(outcomes) * 100
## [1] 61.70213
##r chunk
forest.output = cforest(Noun ~ Num + Century + Register + Eval,
data = nerd,
controls = cforest_unbiased(ntree = 1000,
mtry = 2))
##r chunk
forest.importance = varimp(forest.output, conditional = T)
round(forest.importance, 2)
## Num Century Register Eval
## 0.00 0.02 0.00 0.06
dotchart(sort(forest.importance),
main = "Conditional Importance of Variables")
##r chunk
forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes
##
## geek nerd
## geek 328 140
## nerd 342 506
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.37386
##python chunk
import pandas as pd
#nerd_pydata = pd.read_csv('nerd.csv')
Xvars = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
Xvars.head()
## Num_pl Num_sg Century_XX ... Eval_Neg Eval_Neutral Eval_Positive
## 0 1 0 1 ... 0 1 0
## 1 1 0 0 ... 0 1 0
## 2 1 0 1 ... 0 1 0
## 3 1 0 1 ... 0 1 0
## 4 1 0 1 ... 0 1 0
##
## [5 rows x 11 columns]
Yvar = pd.get_dummies(r.nerd["Noun"])
Yvar.head()
## geek nerd
## 0 0 1
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
nerd data.##python chunk
import sklearn
from sklearn import tree
CIT = tree.DecisionTreeClassifier()
CIT = CIT.fit(Xvars,Yvar)
CIT.predict(Xvars)
## array([[0, 1],
## [0, 1],
## [0, 1],
## ...,
## [1, 0],
## [0, 0],
## [0, 0]], dtype=uint8)
##python chunk
print(tree.export_text(CIT,feature_names= list(Xvars.columns)))
## |--- Eval_Positive <= 0.50
## | |--- Century_XXI <= 0.50
## | | |--- Num_sg <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## | | |--- Num_sg > 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- Eval_Neg <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Eval_Neg > 0.50
## | | | | | | | |--- class: 1
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Eval_Neutral > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 0
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## | |--- Century_XXI > 0.50
## | | |--- Num_pl <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 1
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 1
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 1
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## | | |--- Num_pl > 0.50
## | | | |--- Register_MAG <= 0.50
## | | | | |--- Eval_Neg <= 0.50
## | | | | | |--- Register_ACAD <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_ACAD > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Eval_Neg > 0.50
## | | | | | |--- Register_SPOK <= 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_SPOK > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_MAG > 0.50
## | | | | |--- Eval_Neg <= 0.50
## | | | | | |--- class: 0
## | | | | |--- Eval_Neg > 0.50
## | | | | | |--- class: 1
## |--- Eval_Positive > 0.50
## | |--- Century_XX <= 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_SPOK <= 0.50
## | | | | | |--- Num_pl <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_pl > 0.50
## | | | | | | |--- class: 1
## | | | | |--- Register_SPOK > 0.50
## | | | | | |--- Num_pl <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_pl > 0.50
## | | | | | | |--- class: 1
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 1
## | | |--- Register_MAG > 0.50
## | | | |--- Num_sg <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_sg > 0.50
## | | | | |--- class: 1
## | |--- Century_XX > 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Num_sg <= 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Num_sg > 0.50
## | | | | | |--- Register_SPOK <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_SPOK > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## | | |--- Register_MAG > 0.50
## | | | |--- Num_pl <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_pl > 0.50
## | | | | |--- class: 1
##python chunk
Y_predict = pd.DataFrame(CIT.predict(Xvars))
Y_predict.columns = list(Yvar.columns)
Y_predict_category = Y_predict.idxmax(axis=1)
Yvar_category = Yvar.idxmax(axis=1)
sklearn.metrics.confusion_matrix(Y_predict_category, Yvar_category, labels = ["nerd","geek"])
## array([[439, 269],
## [207, 401]])
Are the models easier to create using R or Python (your own thoughts, they can be different than what I said in the lecture)? It easy to create a model in R. Tree is not easy to Read also need to convert Categorical data into dummy coading variables which is not easy to read.
Which model gave you a better classification of the categories? Random forest model gave a better classification of the categories with an accuracy of 63.37 compare to the conditional inference model with an 61.70 accuracy.
What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?
If We know the context of the Noun in which it was used then It might be useful to get better accuracy for prediction. Such as Geek is use for the person who have technical knowledge while nerd is more use for more academic context.