Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.
library(Rling)
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
#install.packages("party")
library(party)
## Warning: package 'party' was built under R version 3.6.1
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.6.1
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.6.1
table(nerd$Noun)
##
## geek nerd
## 670 646
Dependent variable:
Independent variables:
ctree()
to create a conditional inference model.set.seed(12345)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)
plot(tree.output)
Final_Result = table(predict(tree.output), nerd$Noun)
Final_Result
##
## geek nerd
## geek 227 61
## nerd 443 585
sum(diag(Final_Result)) / sum(Final_Result) * 100
## [1] 61.70213
sum(Final_Result[1]) / sum(Final_Result[,1]) * 100
## [1] 33.8806
sum(Final_Result[4]) / sum(Final_Result[,2]) * 100
## [1] 90.55728
sum(Final_Result[,1]) / (sum(Final_Result[,1]) + sum(Final_Result[,2]))
## [1] 0.5091185
sum(Final_Result[1,]) / (sum(Final_Result[1,]) + sum(Final_Result[2,]))
## [1] 0.218845
forest.output = cforest(Noun ~ Num + Century + Register + Eval,
data = nerd,
controls = cforest_unbiased(ntree = 1000,
mtry = 2))
The most important variables as seen in the output are Eval and Century. Eval: A measure of the semanticity of the word Century: Time measurement - 20th or 21st century
forest.importance = varimp(forest.output,
conditional = T)
round(forest.importance, 3)
## Num Century Register Eval
## -0.002 0.023 -0.003 0.056
dotchart(sort(forest.importance),
main = "Variable Importance")
We can see that the forest model with an accuracy of 63.37% is more accurate than the tree model but it is much less biased in predicting “nerd”. For the tree model, the geek:nerd prediction split was 22:78, the random forest has a better split 37:63 which is closer to the actual split of 51:49. The accuracy in predicting “geek” improves to 56.4% which comes at a cost of reducing the prediction accuracy for “nerd” to 70.8%
forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes
##
## geek nerd
## geek 337 149
## nerd 333 497
sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100
## [1] 63.37386
sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100
## [1] 50.29851
sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100
## [1] 76.93498
sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))
## [1] 0.3693009
Urban dictionary hysterically defines hot and cool nerds as rare and special creatures whose knowledge spans across a wide variety of subjects. Hence it would be important to see how different authors view the definition of geek and nerd.