Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.
##
## geek nerd
## 670 646
Dependent variable:
Independent variables:
ctree() to create a conditional inference model.Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what featurally defines each category.
Therefore, the word “geek” is more common in the 21st century and in positive contexts whereas “nerd” is more common in the 20th century and in neutral or negative contexts. The first split is on Eval, a measure of semanticity of the word. The second level of split is on Century, a time measurement. The tree model did not find a significant split on the basis of Register where the data was coded from and Num, a measure of social group.
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion.
The group with the largest proportion is Node 3, where “geek” is much more likely than “nerd” in the 21st century in positive contexts.
Calculate the percent correct classification for the conditional inference model.
##
## geek nerd
## geek 227 61
## nerd 443 585
## [1] 61.70213
## [1] 33.8806
## [1] 90.55728
## [1] 0.5091185
## [1] 0.218845
The overall classification accuracy of the model is 61.70%. However, it is very poor at identifying “geek” with only a 33.88% accuracy. The model is better at predicting “nerd” with a 90.56% accuracy. This is because the model is inherently biased towards “nerd”. The actual data has a geek:nerd split of 51:49 whereas the predicted data has a split of 22:78.
Create a random forest of the same model for geek versus nerd.
## Num Century Register Eval
## -0.002 0.023 -0.003 0.056
The most important variables are:
##
## geek nerd
## geek 337 149
## nerd 333 497
## [1] 63.37386
## [1] 50.29851
## [1] 76.93498
## [1] 0.3693009
The random forest model has a slightly better accuracy than the tree model with an accuracy of 63.37%. However, it is much less biased towards predicting “nerd”. Compared to the tree model for which the geek:nerd prediction split was 22:78, the Random forest has a better geek:nerd prediction split of 37:63, in line with the actual data split of 51:49. The accuracy with respect to predicting “geek” improves to 50.3%, albeit dropping the prediction accuracy for predicting “nerd” to 76.93%.
What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?