Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.

library(Rling)
library(party)
## Warning: package 'party' was built under R version 3.5.2
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.5.2
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.5.2
data(nerd)
head(nerd)
##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral
table(nerd$Noun)
## 
## geek nerd 
##  670  646

Description of the data

Dependent variable:

Independent variables:

Conditional inference model

set.seed(1000)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

plot(tree.output)

Interpret the categories

outcomes = table(predict(tree.output), nerd$Noun)
outcomes
##       
##        geek nerd
##   geek  227   61
##   nerd  443  585
sum(diag(outcomes)) / sum(outcomes) * 100
## [1] 61.70213
sum(outcomes[1]) / sum(outcomes[,1]) * 100
## [1] 33.8806
sum(outcomes[4]) / sum(outcomes[,2]) * 100
## [1] 90.55728
sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))
## [1] 0.5091185
sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))
## [1] 0.218845

Random forests

forest.output = cforest(Noun ~ Num + Century + Register + Eval, 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 1000,
                                                    mtry = 3))

Variable importance

forest.importance = varimp(forest.output,
                           conditional = T)
round(forest.importance, 2)
##      Num  Century Register     Eval 
##     0.00     0.02     0.00     0.06
dotchart(sort(forest.importance),
         main = "Conditional Importance of Variables")

Forest model predictiveness

forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes
##       
##        geek nerd
##   geek  376  186
##   nerd  294  460
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.52584
sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100
## [1] 56.1194
sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100
## [1] 71.20743
sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))
## [1] 0.4270517

The forest model predictiveness was better than the random forest conditional inference tree.

Thought question

In order to improve the model, we would add content and language of the geek versus nerd category.