Pallavi Saitu_ANLY 540 Assignment 6

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition.

library(Rling)
library(party)

## Warning: package 'party' was built under R version 3.5.2

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.5.2

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.5.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.5.2

data(nerd)
head(nerd)

##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

table(nerd$Noun)

## 
## geek nerd 
##  670  646

Description of the data

Dependent variable:

Noun: which category is represented either nerd or geek.

Independent variables:

Num: a measure of social group, either pl (plural) or sg (single)
Century: time measurement, as XX (20th) or XXI (21st) century
Register: information about where the data was coded from ACAD (academic), MAG (magazine), NEWS (newspapers), and SPOK (spoken)
Eval: A measure of the semanticity of the word, Neg for negative, Neutral, and Positive

Conditional inference model

Add a random number generator to start the model.
Use ctree() to create a conditional inference model.

set.seed(1000)
tree.output = ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

Plot the conditional inference model.

plot(tree.output)

Interpret the categories

Write out an interpretation of the results from the model. You can interpret the branches of the tree to determine what featurally defines each category. The p value for the model is less than 0.05 which means it is statistically significant.
With only two categories, you will see the proportion split as the output in the bar graph - look for the group with the larger proportion. The second group has the larger proportion looking at the graph above for the two categories. ## Conditional inference model predictiveness
Calculate the percent correct classification for the conditional inference model.

outcomes = table(predict(tree.output), nerd$Noun)
outcomes

##       
##        geek nerd
##   geek  227   61
##   nerd  443  585

sum(diag(outcomes)) / sum(outcomes) * 100

## [1] 61.70213

sum(outcomes[1]) / sum(outcomes[,1]) * 100

## [1] 33.8806

sum(outcomes[4]) / sum(outcomes[,2]) * 100

## [1] 90.55728

sum(outcomes[,1]) / (sum(outcomes[,1]) + sum(outcomes[,2]))

## [1] 0.5091185

sum(outcomes[1,]) / (sum(outcomes[1,]) + sum(outcomes[2,]))

## [1] 0.218845

Random forests

Create a random forest of the same model for geek versus nerd.

forest.output = cforest(Noun ~ Num + Century + Register + Eval, 
                        data = nerd,
                        controls = cforest_unbiased(ntree = 1000,
                                                    mtry = 3))

Variable importance

Calculate the variable importance from the random forest model.
Include a dot plot of the importance values.
Which variables were the most important?

forest.importance = varimp(forest.output,
                           conditional = T)
round(forest.importance, 2)

##      Num  Century Register     Eval 
##     0.00     0.02     0.00     0.06

dotchart(sort(forest.importance),
         main = "Conditional Importance of Variables")

Forest model predictiveness

Include the percent correct for the random forest model.
Did it do better than the conditional inference tree?

forest.outcomes = table(predict(forest.output), nerd$Noun)
forest.outcomes

##       
##        geek nerd
##   geek  376  186
##   nerd  294  460

sum(diag(forest.outcomes)) / sum(forest.outcomes) * 100

## [1] 63.52584

sum(forest.outcomes[1]) / sum(forest.outcomes[,1]) * 100

## [1] 56.1194

sum(forest.outcomes[4]) / sum(forest.outcomes[,2]) * 100

## [1] 71.20743

sum(forest.outcomes[1,]) / (sum(forest.outcomes[1,]) + sum(forest.outcomes[2,]))

## [1] 0.4270517

The forest model predictiveness was better than the random forest conditional inference tree.

Thought question

What other variables might be useful in understanding the category membership of geek versus nerd? Basically, what could we add to the model to improve it (there’s no one right answer here - it’s helpful to think about what other facets we have not measured)?

In order to improve the model, we would add content and language of the geek versus nerd category.