STA 279 Lab 5

Complete all Questions.

The Goal

We have been working with models that we can use with text data, specifically when the goal is classification. This means models that work when our response variable is categorical. We work with such data a lot when we are dealing with text.

Today, we are going to apply the models we have learned so far to the Federalist papers data set we looked at in Lab 3. In that lab, we used graphs to determine which author likely wrote the papers. In this lab, we will use models.

The Data Set

As a reminder, the Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Three famous authors of these papers were Alexander Hamilton, John Jay, and James Madison. Most of the papers were know to have been written by either Hamilton or Madison or Jay, but for several papers, the author who wrote them was not known. Since then, many scholars have worked to determine which of the authors had written the anonymous papers. Today we are going to use statistical modeling to try and do the same thing.

Let’s start off by loading our libraries. Most of these will look familiar, but a few are new. The new ones are the libraries that we need for classification trees.

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(nnet)
library(tm)

# New - needed for trees
library(rpart)
library(rattle)
library(rpart.plot)

Once you have loaded the libraries, you can use the following code to load the data.

# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/nche5o1oi5lag1kvoles5/LIWC_Federalist.csv?rlkey=0yvpqash13gcur6hkh7m7uwbl&st=3xntfjzh&dl=1")

# Convert to a data frame
Federalist <- data.frame(Federalist)[,-4]

# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])
Federalist$text <-  removeWords(Federalist$text, "hamilton")
Federalist$text <-  removeWords(Federalist$text, "madison")

Just as we did in Lab 3, our goal is to use the text of the essays to predict \(Y = author\). In Lab 3, we created features ourselves like word count and we used TF-IDF to determine words that were likely to be in papers written by each author, but were not commonly found in papers written by the other two authors.

Today, we are going to use text features derived from the LIWC. This means we have 118 possible text based features to consider. Our goal for today is to use these features to (1) predict which author wrote each paper and (2) to identify and discuss key traits in the text that identify the writing of each author.

Feature Selection

If you look at the Federalist dataset you have loaded, the dataset should have 85 rows and 121 columns. The first column is an id for the paper, the second column is the actual essay text, and the third is the author. The remaining 118 columns are features about the text created from the LIWC.

As a reminder, the LIWC is a tool that allows us access to multiple lexicons that we can use to create features about text data. For instance, how much of the paper is written in past tense? How much involves discussions of economics? How long are the sentences, and what is the tone like? The list of possible features is extensive!

If you are curious about what each feature represents, you can look at the list on page 12 of this document. This details each of the features created by the LIWC and gives examples of the types of words in each of the lexicons it uses to create these features.

Recall that our goal for today is to (1) predict which author wrote each Federalist paper and (2) to identify and discuss key traits in the text that identify the writing of each author. It is this second goal which is going to motivate us to consider feature selection. Trying to describe how 118 features relate to the writing of each author is very difficult. Ideally, we would like to narrow this down to just the key features that embody the work of each author. This is a motivator for feature selection.

Feature selection just means that we look at all the features we have available to use (right now 118) and we decide which of them we want to use in our models. As we discussed in class, there are a few different ways we can do this, including forward selection, backwards elimination, and BSS. We are going to start with forward selection.

Forward Selection

Question 1

In forward selection, we begin with a model with how many features?

In forward selection, we need to determine which feature to add to the model first. To do this, we try every possible multinomial regression model with exactly one feature. We then use the AIC to compare the models and we choose the feature that produces a model with the lowest AIC.

Before we start building the multinomial models, it is important to pause and determine which level of \(Y\) is the baseline. This is the level of \(Y\) that will show up in the denominator of the log relative risks. To determine the baseline, you can run the code below. The first author you see (meaning the author to the far left) is the baseline.

levels(Federalist[ ,"author"])

Question 2

What level of \(Y\) is the baseline?

Once we know what level is the baseline, we can start to build our models!

Question 3

The first two features in Federalist are WC (total word count) and Analytic, a measure of logical, formal thinking present in the text.

Build a model with only the feature WC, and a second with only the feature Analytic.

If these were the only two options in forward selection, which feature would you add first to your model and why?

Hint: To suppress the weights output you get when you run multinomial regression, you can add trace = FALSE to the end of your code.

modelFeature <- multinom(author ~ Feature, data = Federalist, trace = FALSE)

In Question 3, we compare two possible models. In the first step of forward selection, we will compare all 118 possible models we can build with one feature. Once all the models are fit, we determine which has the best AIC.

At the end of Step 1, it turns out that the feature that results in a model with the lowest AIC is the number of conjunctions (and, but, or, …):

  • Feature 1: Conjunctions

Okay, step 1 complete. Time for Step 2.

In Step 2, we already know that conjunctions (conj) will be in our model. The goal is now to consider which of the 117 remaining features we want to add next. This means we build 117 models (each with conj and a second feature ), and then determine the 2nd feature to add to the model.

Question 4

Build a model with the features conj and WC, and a second with the features conj and WPS (the number of words per sentence).

If these were the only two options in forward selection Step 2, which feature would you add into to your model?

Again, when we do this for real we would have 117 total models to compare. I’m going to skip ahead and let you know that we choose quantity, a feature that counts the number of quantity words (many, some, few, three,…).

  • Feature 2: quantity

The next step in forward selection is to consider adding a 3rd feature to the model, and then a 4th, and so on, until the AIC stops improving by some value we choose. This can be 1, or another value that makes sense based on your AIC values.

We could do all of this forward selection process slowly by hand, but to speed this process, I’ve written you a function you can use to perform forward selection with multinomial regression. Copy and paste the function in the chunk below and press play.

forward_selection_multinomial <- function(features, response){

  # Load the library
  suppressMessages(library(nnet))
  # Initialize a
  a = 1 
  
  check <- apply(features, 2, function(x) length(unique(x)))
  to.remove <- which(check == 1)
  features <- features[,-to.remove]
  
  # Count the number of features
  p <- ncol(features)
  
  # Find the best feature with only one feature in the model

  for( i in 1:p){
    modelholder <- multinom(response ~ ., data = data.frame(features[,i]), trace = FALSE)
    if(a == 1){
      modelcurrent = i
      AICcurrent = AIC(modelholder)
      a = a+1
      }
    if(AIC(modelholder) < AICcurrent){
      AICcurrent <- AIC(modelholder)
      modelcurrent <- i
    }
    rm(modelholder)
  }


  model1 <- multinom(response ~ ., data = data.frame(features[,modelcurrent]), trace = FALSE)
  AIC_tobeat <- AIC(model1)
  tokeep = modelcurrent
  
    print(1)
  print(AIC(model1))

  for( x in 1:(p-1)){
    a = 1 
    for( i in c(1:p)[-tokeep]){
      modelholder <- multinom( response ~ ., data = features[,c(i,tokeep)], trace = FALSE)
      if(a == 1){
        modelcurrent = i
        AICcurrent = AIC(modelholder)
        a = a+2
      }
      if(AIC(modelholder) < AICcurrent){
        AICcurrent <- AIC(modelholder)
        modelcurrent <- i
        }
      rm(modelholder)
    } 
    if(a >1 & AIC_tobeat- AICcurrent < 1){
      break
    }
    tokeep <- c(tokeep,modelcurrent)
    print(x+1)
    print(AICcurrent)
    AIC_tobeat = AICcurrent
    }

  print(colnames(features)[tokeep])

}

DO NOT CHANGE ANYTHING IN THE FUNCTION ABOVE!!

To use this function, you need to provide two inputs:

  • features: A dataset which contains only the columns in the original dataset that you want to use as features. Example: features <- dataset[,2:118] if the features are in columns 2 to 118 in a dataset called dataset.
  • response: A dataset which contains only the single column in the original dataset that holds the response variable. Example: response <-dataset[,1]if the response is in column 1 of a dataset calleddataset`.

It turns out that before we do this, we need to think a little about whether there are any columns in our dataset that we should not treat as features.

Question 5

Which 3 columns in the original data set should NOT be included in features?

Question 6

Create the inputs features and response you need in order to run the forward selection code. Show your code as the answer to this question.

Once you have created features and response, you can run forward selection using the following:

# Run forward selection
features_chosen <- forward_selection_multinomial(features,response)

As the code runs, it will print out the number of features added to the model and the AIC that is associated with the model with that many features. You will notice that the AIC is getting steadily lower as the function runs!

The function also prints out the features that we put in the model according to forward selection and also stores those features in the object features_chosen.

Question 7

How many features does forward selection suggest we keep in our multinomial regression model?

To build a model using the features chosen by forward selection, we can use:

forward_model <- multinom( Federalist$author ~ ., data = Federalist[, features_chosen], trace = FALSE)

Question 8

What is the AIC for the model chosen by forward selection?

At this point, we can identify a few key features that seem to help us distinguish the writing of our 3 authors. To see the fitted model, we can use:

knitr::kable(t(coefficients(forward_model)))

Question 9

Based on our fitted model, is a paper with a lot of big words most likely to have been written by Hamilton, Madison, or Jay? Explain.

Question 10

Based on our fitted model, is a model with a lot of male reference words (he, his, him, man) most likely to have been written by Hamilton, Madison, or Jay? Explain.

Making Predictions

Now that we have built our model and identified key features, let’s make predictions for the 15 Federalist papers with unknown authors. To read in these data, use the code below.

test <- read.csv("https://www.dropbox.com/scl/fi/45x2atyykounkdc811qmv/LIWC_Test.csv?rlkey=rcuxfme51l032cj1s16x3to4x&st=lvnp4z8i&dl=1")

To make predictions using multinomial regression, we use the code:

predict(forward_model, type = "class", newdata = test)

Question 11

  1. Based on the predictions, who wrote the mystery Federalist papers?

  2. We know that multinomial regression makes predictions by choosing the author with the highest probability. We can see these predicted probabilities by running the code below. Based on these probabilities, how stable are our predictions of author using our model?

predict(forward_model, type = "probs", newdata = test)

Classification Trees

Let’s try a different way we can explore the relationships between the features and the response by building a classification tree. Classification trees use the values of the features to create clusters. Inside of these clusters, we want to have rows that have similar values of \(Y\). In other words, we would like to define one or more Madison clusters, Hamilton clusters, and so on.

Classification trees have a cool property that they perform feature selection as a natural part of the tree building process. All features are considered, but only features that improve the Gini Index the most are actually used.

To build a classification tree in R, we can use:

tree_class <- rpart( author ~ . , data = Federalist[,-c(1:2)], method = "class")

To visualize the tree, we use

fancyRpartPlot(tree_class, sub = "" )

Question 12

Which splitting rule gives us the largest reduction in the Gini Index in one split?

Question 13

What is the Gini Index of the first split? Show your work.

Hint: You do not need code for this question, though you can choose to use R as a calculator!

Question 14

If a Federalist paper has 17 prepositions (prep), 40 words per sentence on average (WPS), and an analytic score of 90, which author will the tree predict wrote the paper?

Question 15

How stable is the prediction from Question 14? Explain.

Question 16

An historian asks you to use the tree to describe the writing traits that distinguish John Jay from the other two authors. Answer them!

As we can see, classification trees have a few nice properties! Even though we started with 118 features, only 3 of them are actually used in our model. This makes is very easy to visualize and interpret the relationships in the data. We can also use trees to make predictions! And speaking of predictions…

Question 17

Use your tree to make predictions on the 15 papers in the test set.

Hint: You can use the same code from above Question 11, you just need to change out the name of the model.

Question 18

Do your predictions from the tree generally agree with the predictions from your multinomial regression model? Explain.

Naive Bayes

The third and final approach we are going to use is Naive Bayes. Recall that this is a prediction approach that uses conditional probabilities.

To run Naive Bayes in R, we use the following:

# Load the library
library(naivebayes)

# Run Naive Bayes 
naive_result <- naive_bayes(author~., data= Federalist[,-c(1,2)])

# Make predictions on the test data 
predict( naive_result, newdata = test)

At this point, we have three different sets of predictions: one from multinomial regression, one from classification trees, and one from Naive Bayes. We could consider choosing just one approach to use…or we could do what is called an ensemble learning method. This just means that we combine the results from multiple predictive approaches to obtain our final prediction.

For each row in the test data, we look at the predictions from each of the 3 methods. The final predicted value is the value that is most popular across the three methods!

To see our 3 predictions side by side, we can use the following:

Predictions <- data.frame("Multinomial" = predict( forward_model, type = "class", newdata = test))

Predictions$Tree <- predict( tree_class, type = "class", newdata = test)

Predictions$NaiveBayes <- predict(naive_result,  type = "class", newdata = test)

knitr::kable(Predictions)

Question 19

Which author do we predict wrote each paper?

References

Data

The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.