Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.

library(Rling)

data(nerd)

head(nerd)
##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral
#install.packages("party")
library(party)
## Warning: package 'party' was built under R version 3.5.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.5.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.5.2
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.5.3
table(nerd$Noun)
## 
## geek nerd 
##  670  646

Description of the data

Dependent variable:

Independent variables:

Conditional inference model

set.seed(12345)

tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

plot(tree.output)

Interpret the categories

Conditional inference model predictiveness

predResults = table(predict(tree.output), nerd$Noun)

predResults
##       
##        geek nerd
##   geek  227   61
##   nerd  443  585
sum(diag(predResults)) / sum(predResults) * 100
## [1] 61.70213
sum(predResults[1]) / sum(predResults[,1]) * 100
## [1] 33.8806
sum(predResults[4]) / sum(predResults[,2]) * 100
## [1] 90.55728
sum(predResults[,1]) / (sum(predResults[,1]) + sum(predResults[,2]))
## [1] 0.5091185
sum(predResults[1,]) / (sum(predResults[1,]) + sum(predResults[2,]))
## [1] 0.218845

Random forests

forestRand.output = cforest(Noun ~ Num + Century + Register + Eval,  data = nerd,   controls = cforest_unbiased(ntree = 1000,  mtry = 2))

Variable importance

The most important variables as seen in the output are Eval and Century. Eval: A measure of the semanticity of the word Century: Time measurement - 20th or 21st century

forVariable.importance = varimp(forestRand.output,
                           conditional = T)
round(forVariable.importance, 3)
##      Num  Century Register     Eval 
##   -0.002    0.024   -0.003    0.056
dotchart(sort(forVariable.importance),
         main = "Variable Importance")

Forest model predictiveness

Forest model is less bias in predicting “nerd”, Tree model shows a 22:78 split prediction. Actual split shows 51:49 whereas random forest has 37:63 split ratio.Prediction accuracy for nerd reduces to 70.8% whereas predicting greek goes upto 56.4%. So pro’s and cons for each method.

forestPredict.outcomes = table(predict(forestRand.output), nerd$Noun)

forestPredict.outcomes
##       
##        geek nerd
##   geek  337  149
##   nerd  333  497
sum(diag(forestPredict.outcomes)) / sum(forestPredict.outcomes) * 100
## [1] 63.37386
sum(forestPredict.outcomes[1]) / sum(forestPredict.outcomes[,1]) * 100
## [1] 50.29851
sum(forestPredict.outcomes[4]) / sum(forestPredict.outcomes[,2]) * 100
## [1] 76.93498
sum(forestPredict.outcomes[1,]) / (sum(forestPredict.outcomes[1,]) + sum(forestPredict.outcomes[2,]))
## [1] 0.3693009

Thought question

Geek is someone who is more knowledgeable with books while nerd is someone who specializes in a specific set on knwoledge.