Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
You will be distinguishing between the categories of nerd and geek to determine the influence of respective variables on their category definition. Semantically, the words are similar but in this exercise you will determine if they are used differently and what the features are that may separate them into different categories.
The data for this project has already been loaded. Drawing from multiple corpora, for each instance where the word nerd or geek was used, the context surrounding it was coded including whether whether the word was referencing one or multiple people (Num), when it was used (Century, 20th or 21st), the type of corpora (academic, magazines, newspapers, spoken language), and the evaluation of the valence/sentiment of the word (negative, neutral, or positive).
Questions which require a written response are marked with **. If you are having trouble with the Rling library - the nerd data is available on Canvas, and you can load it directly.
##r chunk
library(Rling)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
data(nerd)
head(nerd)
## Noun Num Century Register Eval
## 1 nerd pl XX ACAD Neutral
## 2 nerd pl XXI ACAD Neutral
## 3 nerd pl XX ACAD Neutral
## 4 nerd pl XX ACAD Neutral
## 5 nerd pl XX ACAD Neutral
## 6 nerd pl XXI ACAD Neutral
Dependent variable:
Independent variables:
ctree() to create a conditional inference model.##r chunk
set.seed(10)
tree<- ctree(Noun ~ Num + Century + Register + Eval, data = nerd)
##r chunk
plot(tree)
Answer: From the tree graph above, we can see that the measure of the valence of the word is most correlated with the outcome, followed by the century when the word is used. For instance, if the valence is positive and the time is the 21st century, the word “geek” is more likely to be used versus in the 20th century, where the term “nerd” is used less often compared with “geek” but more frequently than 10 decades later. Similarly, if the valence is negative or neutral and the time is the 20th century, the word “nerd” is used more often. In the 21st century, however, the usage frequency of both “geek” and “nerd” is very identical.
tree_outcomes<- table(predict(tree), nerd$Noun)
sum(diag(tree_outcomes)) / sum(tree_outcomes) * 100
## [1] 61.70213
##r chunk
forest<- cforest(Noun ~ Num + Century + Register + Eval, data = nerd, controls = cforest_unbiased(ntree = 1000, mtry = 2))
Answer: From the chart genereated, we can see that Eval is the most important variable, followed by Century, Num, and Register.
##r chunk
forest_importance<- varimp(forest, conditional = T)
round(forest_importance, 2)
## Num Century Register Eval
## 0.00 0.02 0.00 0.06
dotchart(sort(forest_importance), main = "Random Forest Model Variable Importance")
Answer: As the random forest model can predict more accurately at 63.4% versus 61.7%, we can say that it is better than the conditional inference tree model.
##r chunk
forest_outcomes<- table(predict(forest), nerd$Noun)
sum(diag(forest_outcomes)) / sum(forest_outcomes) * 100
## [1] 63.44985
library(reticulate) #switch to Python first repl_python()
##python chunk
import pandas as pd #import pandas packaage as pd to help manipulate the dataset
predictor_variables = pd.get_dummies(r.nerd[["Num", "Century", "Register", "Eval"]])
target_variable = pd.get_dummies(r.nerd["Noun"])
predictor_variables.head()
## Num_pl Num_sg Century_XX ... Eval_Neg Eval_Neutral Eval_Positive
## 0 1 0 1 ... 0 1 0
## 1 1 0 0 ... 0 1 0
## 2 1 0 1 ... 0 1 0
## 3 1 0 1 ... 0 1 0
## 4 1 0 1 ... 0 1 0
##
## [5 rows x 11 columns]
target_variable.head()
## geek nerd
## 0 0 1
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
nerd data.##python chunk
import sklearn #Scikit-learn is a free machine learning library in Python
from sklearn import tree
classTree = tree.DecisionTreeClassifier()
classTree = classTree.fit(predictor_variables, target_variable) #fit the variables to the model
##python chunk
print(tree.export_text(classTree,feature_names = list(predictor_variables.columns)))
## |--- Eval_Positive <= 0.50
## | |--- Century_XX <= 0.50
## | | |--- Num_sg <= 0.50
## | | | |--- Register_MAG <= 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- Register_SPOK <= 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_SPOK > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- Register_ACAD <= 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_ACAD > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_MAG > 0.50
## | | | | |--- Eval_Neg <= 0.50
## | | | | | |--- class: 0
## | | | | |--- Eval_Neg > 0.50
## | | | | | |--- class: 1
## | | |--- Num_sg > 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 1
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 1
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- class: 1
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neutral <= 0.50
## | | | | | |--- class: 1
## | | | | |--- Eval_Neutral > 0.50
## | | | | | |--- class: 0
## | |--- Century_XX > 0.50
## | | |--- Num_pl <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | | |--- class: 1
## | | | | | | |--- Eval_Neutral > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- Eval_Neg <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Eval_Neg > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neg <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neg > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- Eval_Neg <= 0.50
## | | | | | |--- class: 0
## | | | | |--- Eval_Neg > 0.50
## | | | | | |--- class: 0
## | | |--- Num_pl > 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_MAG <= 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- Register_NEWS <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_NEWS > 0.50
## | | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- Register_SPOK <= 0.50
## | | | | | | | |--- class: 0
## | | | | | | |--- Register_SPOK > 0.50
## | | | | | | | |--- class: 0
## | | | | |--- Register_MAG > 0.50
## | | | | | |--- Eval_Neutral <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Eval_Neutral > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## |--- Eval_Positive > 0.50
## | |--- Century_XXI <= 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Num_sg <= 0.50
## | | | | | |--- Register_SPOK <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_SPOK > 0.50
## | | | | | | |--- class: 0
## | | | | |--- Num_sg > 0.50
## | | | | | |--- Register_NEWS <= 0.50
## | | | | | | |--- class: 0
## | | | | | |--- Register_NEWS > 0.50
## | | | | | | |--- class: 0
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 0
## | | |--- Register_MAG > 0.50
## | | | |--- Num_pl <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_pl > 0.50
## | | | | |--- class: 1
## | |--- Century_XXI > 0.50
## | | |--- Register_MAG <= 0.50
## | | | |--- Register_ACAD <= 0.50
## | | | | |--- Register_SPOK <= 0.50
## | | | | | |--- Num_pl <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_pl > 0.50
## | | | | | | |--- class: 1
## | | | | |--- Register_SPOK > 0.50
## | | | | | |--- Num_sg <= 0.50
## | | | | | | |--- class: 1
## | | | | | |--- Num_sg > 0.50
## | | | | | | |--- class: 1
## | | | |--- Register_ACAD > 0.50
## | | | | |--- class: 1
## | | |--- Register_MAG > 0.50
## | | | |--- Num_pl <= 0.50
## | | | | |--- class: 1
## | | | |--- Num_pl > 0.50
## | | | | |--- class: 1
##python chunk
import numpy as np
prediction = pd.DataFrame(classTree.predict(predictor_variables)) #convert the predicted values to a dataframe
prediction.columns = list(target_variable.columns) #convert to a list
prediction_category = prediction.idxmax(axis=1)
target_variable_category = target_variable.idxmax(axis=1)
python_outcomes = sklearn.metrics.confusion_matrix(prediction_category, target_variable_category, labels = ["nerd","geek"])
print((np.trace(python_outcomes)/np.sum(python_outcomes))*100)
## 63.829787234042556
Answer: For me, the models are much simpler to create with R than Python. Firstly, R handles categorical variables automatically while we need to create dummy variables for them in Python, which can get manual if we’re dealing with a large dataset. Also, it takes little effort for us to create the tree graph in R, but for Python, we will need separate packages.
Answer: From the results above, we can see that the Random Forst model is better than the Conditional Inference model as it has a classification accuracy of 63.4% versus 61.7%. Also, as the Random Forest model creates multiple trees, weighing the importance of each variable while controlling the effect of others, its classification results are more statistically reliable.
Answer: Variables like demographics and geographics of the authors of the material written alongside wordcount of the material could better help us improve the model.