Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.
library(Rling)
library(party)
## Warning: package 'party' was built under R version 3.5.2
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.5.2
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.5.2
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
table(nerd$Noun)
##
## geek nerd
## 670 646
Dependent variable:
Independent variables:
ctree() to create a conditional inference model.set.seed(1000)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)
plot(tree.output)
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion. The second group has the larger proportion looking at the graph above for the two categories. ## Conditional inference model predictiveness
Calculate the percent correct classification for the conditional inference model.
outcomes = table(predict(tree.output), nerd$Noun)
outcomes
##
## geek nerd
## geek 227 61
## nerd 443 585
sum(diag(outcomes)) / sum(outcomes) * 100
## [1] 61.70213
sum(outcomes[1]) / sum(outcomes[,1]) * 100
## [1] 33.8806
sum(outcomes[4]) / sum(outcomes[,2]) * 100
## [1] 90.55728
sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))
## [1] 0.5091185
sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))
## [1] 0.218845
forest.output = cforest(Noun ~ Num + Century + Register + Eval,
data = nerd,
controls = cforest_unbiased(ntree = 1000,
mtry = 3))
forest.importance = varimp(forest.output,
conditional = T)
round(forest.importance, 2)
## Num Century Register Eval
## 0.00 0.02 0.00 0.06
dotchart(sort(forest.importance),
main = "Conditional Importance of Variables")
forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes
##
## geek nerd
## geek 376 186
## nerd 294 460
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.52584
sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100
## [1] 56.1194
sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100
## [1] 71.20743
sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))
## [1] 0.4270517
The forest model predictiveness was better than the random forest conditional inference tree.
In order to improve the model, we would add content and language of the geek versus nerd category.