Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.

The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).

Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.

##r chunk
library(Rling)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
data(nerd)
head(nerd)
##   Noun Num Century Register    Eval
## 1 nerd  pl      XX     ACAD Neutral
## 2 nerd  pl     XXI     ACAD Neutral
## 3 nerd  pl      XX     ACAD Neutral
## 4 nerd  pl      XX     ACAD Neutral
## 5 nerd  pl      XX     ACAD Neutral
## 6 nerd  pl     XXI     ACAD Neutral

Description of the data

Dependent variable:

Independent variables:

Conditional inference model

##r chunk

set.seed(10)
tree<- ctree(Noun ~ Num + Century + Register + Eval, data = nerd)

Make a plot

##r chunk

plot(tree)

Interpret the categories

Answer: From the tree graph above, we can see that the measure of the valence of the word is most correlated with the outcome, followed by the century when the word is used. For instance, if the valence is positive and the time is the 21st century, the word “geek” is more likely to be used versus in the 20th century, where the term “nerd” is used less often compared with “geek” but more frequently than 10 decades later. Similarly, if the valence is negative or neutral and the time is the 20th century, the word “nerd” is used more often. In the 21st century, however, the usage frequency of both “geek” and “nerd” is very identical.

Conditional inference model predictiveness

tree_outcomes<- table(predict(tree), nerd$Noun)
sum(diag(tree_outcomes)) / sum(tree_outcomes) * 100
## [1] 61.70213

Random forests

##r chunk

forest<- cforest(Noun ~ Num + Century + Register + Eval, data = nerd, controls = cforest_unbiased(ntree = 1000, mtry = 2))

Variable importance

Answer: From the chart genereated, we can see that Eval is the most important variable, followed by Century, Num, and Register.

##r chunk

forest_importance<- varimp(forest, conditional = T)
round(forest_importance, 2)
##      Num  Century Register     Eval 
##     0.00     0.02     0.00     0.06
dotchart(sort(forest_importance), main = "Random Forest Model Variable Importance")

Forest model predictiveness

Answer: As the random forest model can predict more accurately at 63.4% versus 61.7%, we can say that it is better than the conditional inference tree model.

##r chunk
forest_outcomes<- table(predict(forest), nerd$Noun)
sum(diag(forest_outcomes)) / sum(forest_outcomes) * 100
## [1] 63.44985

Python model

library(reticulate) #switch to Python first repl_python()

##python chunk
import pandas as pd #import pandas packaage as pd to help manipulate the dataset

predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
target_variable = pd.get_dummies(r.nerd["Noun"])

predictor_variables.head()
##    Num_pl  Num_sg  Century_XX  ...  Eval_Neg  Eval_Neutral  Eval_Positive
## 0       1       0           1  ...         0             1              0
## 1       1       0           0  ...         0             1              0
## 2       1       0           1  ...         0             1              0
## 3       1       0           1  ...         0             1              0
## 4       1       0           1  ...         0             1              0
## 
## [5 rows x 11 columns]
target_variable.head()
##    geek  nerd
## 0     0     1
## 1     0     1
## 2     0     1
## 3     0     1
## 4     0     1

Create the Tree

##python chunk

import sklearn #Scikit-learn is a free machine learning library in Python
from sklearn import tree

classTree = tree.DecisionTreeClassifier()
classTree = classTree.fit(predictor_variables, target_variable) #fit the variables to the model 

Printing out the Tree

##python chunk

print(tree.export_text(classTree,feature_names = list(predictor_variables.columns)))
## |--- Eval_Positive <= 0.50
## |   |--- Century_XX <= 0.50
## |   |   |--- Num_sg <= 0.50
## |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- class: 1
## |   |   |--- Num_sg >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |--- class: 0
## |   |--- Century_XX >  0.50
## |   |   |--- Num_pl <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- Eval_Neg <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Eval_Neg >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- Num_pl >  0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_MAG <= 0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Register_MAG >  0.50
## |   |   |   |   |   |--- Eval_Neutral <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Eval_Neutral >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |--- Eval_Positive >  0.50
## |   |--- Century_XXI <= 0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |--- Register_NEWS <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- Register_NEWS >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 0
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |--- class: 1
## |   |--- Century_XXI >  0.50
## |   |   |--- Register_MAG <= 0.50
## |   |   |   |--- Register_ACAD <= 0.50
## |   |   |   |   |--- Register_SPOK <= 0.50
## |   |   |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- Register_SPOK >  0.50
## |   |   |   |   |   |--- Num_sg <= 0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- Num_sg >  0.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- Register_ACAD >  0.50
## |   |   |   |   |--- class: 1
## |   |   |--- Register_MAG >  0.50
## |   |   |   |--- Num_pl <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- Num_pl >  0.50
## |   |   |   |   |--- class: 1

Confusion Matrix

##python chunk

import numpy as np

prediction = pd.DataFrame(classTree.predict(predictor_variables)) #convert the predicted values to a dataframe
prediction.columns = list(target_variable.columns) #convert to a list
prediction_category =  prediction.idxmax(axis=1)
target_variable_category = target_variable.idxmax(axis=1)

python_outcomes = sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])

print((np.trace(python_outcomes)/np.sum(python_outcomes))*100)
## 63.829787234042556

Thought questions

Answer: For me, the models are much simpler to create with R than Python. Firstly, R handles categorical variables automatically while we need to create dummy variables for them in Python, which can get manual if we’re dealing with a large dataset. Also, it takes little effort for us to create the tree graph in R, but for Python, we will need separate packages.

Answer: From the results above, we can see that the Random Forst model is better than the Conditional Inference model as it has a classification accuracy of 63.4% versus 61.7%. Also, as the Random Forest model creates multiple trees, weighing the importance of each variable while controlling the effect of others, its classification results are more statistically reliable.

Answer: Variables like demographics and geographics of the authors of the material written alongside wordcount of the material could better help us improve the model.