Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.
library(Rling)
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
#install.packages("party")
library(party)
## Warning: package 'party' was built under R version 3.5.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.5.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.5.2
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.5.3
table(nerd$Noun)
##
## geek nerd
## 670 646
Dependent variable:
Independent variables:
ctree()
to create a conditional inference model.set.seed(12345)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)
plot(tree.output)
predResults = table(predict(tree.output), nerd$Noun)
predResults
##
## geek nerd
## geek 227 61
## nerd 443 585
sum(diag(predResults)) / sum(predResults) * 100
## [1] 61.70213
sum(predResults[1]) / sum(predResults[,1]) * 100
## [1] 33.8806
sum(predResults[4]) / sum(predResults[,2]) * 100
## [1] 90.55728
sum(predResults[,1]) / (sum(predResults[,1]) + sum(predResults[,2]))
## [1] 0.5091185
sum(predResults[1,]) / (sum(predResults[1,]) + sum(predResults[2,]))
## [1] 0.218845
forestRand.output = cforest(Noun ~ Num + Century + Register + Eval, data = nerd, controls = cforest_unbiased(ntree = 1000, mtry = 2))
The most important variables as seen in the output are Eval and Century. Eval: A measure of the semanticity of the word Century: Time measurement - 20th or 21st century
forVariable.importance = varimp(forestRand.output,
conditional = T)
round(forVariable.importance, 3)
## Num Century Register Eval
## -0.002 0.024 -0.003 0.056
dotchart(sort(forVariable.importance),
main = "Variable Importance")
Forest model is less bias in predicting “nerd”, Tree model shows a 22:78 split prediction. Actual split shows 51:49 whereas random forest has 37:63 split ratio.Prediction accuracy for nerd reduces to 70.8% whereas predicting greek goes upto 56.4%. So pro’s and cons for each method.
forestPredict.outcomes = table(predict(forestRand.output), nerd$Noun)
forestPredict.outcomes
##
## geek nerd
## geek 337 149
## nerd 333 497
sum(diag(forestPredict.outcomes)) / sum(forestPredict.outcomes) * 100
## [1] 63.37386
sum(forestPredict.outcomes[1]) / sum(forestPredict.outcomes[,1]) * 100
## [1] 50.29851
sum(forestPredict.outcomes[4]) / sum(forestPredict.outcomes[,2]) * 100
## [1] 76.93498
sum(forestPredict.outcomes[1,]) / (sum(forestPredict.outcomes[1,]) + sum(forestPredict.outcomes[2,]))
## [1] 0.3693009
Geek is someone who is more knowledgeable with books while nerd is someone who specializes in a specific set on knwoledge.