STA 279 Lab 6

Complete all Questions.

The Goal

We have been working with multinomial regression models, which are models we can use when our response variable is categorical with more than 2 levels. We work with such data a lot when we are dealing with text, so multinomial is a model we encounter often.

Today, we are going to apply multinomial regression to the Federalist papers data set we worked with in Lab 3. In that lab, we used graphs to determine which author likely wrote the papers. In this lab, we will use models.

The Data Set

As a reminder, the Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Three famous authors of these papers were Alexander Hamilton, John Jay, and James Madison. Most of the papers were know to have been written by either Hamilton or Madison or Jay, but for several papers, the author who wrote them was not known. Since then, many scholars have worked to determine which of the authors had written the anonymous papers. Today we are going to use statistical modeling to try and do the same thing.

You can use the following code to load the data.

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(nnet)
library(tm)

# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/nche5o1oi5lag1kvoles5/LIWC_Federalist.csv?rlkey=0yvpqash13gcur6hkh7m7uwbl&st=3xntfjzh&dl=1")

# Convert to a data frame
Federalist <- data.frame(Federalist)[,-4]

# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])
Federalist$text <-  removeWords(Federalist$text, "hamilton")
Federalist$text <-  removeWords(Federalist$text, "madison")
Federalist <- Federalist[-c(18:20),]

If you look at the Federalist dataset you have loaded, the dataset should have 82 rows and 121 columns. The first column is an id for the paper, the second column is the actual essay text, and the third is the author. The remaining 118 columns are features about the text created from the LIWC.

As a reminder, the LIWC is a tool that allows us access to multiple lexicons that we can use to create features about text data. For instance, how much of the paper is written in past tense? How much involves discussions of economics? How long are the sentences, and what is the tone like? The list of possible features is extensive!

If you are curious about what each feature represents, you can look at the list on page 12 of this document. This details each of the features created by the LIWC and gives examples of the types of words in each of the lexicons it uses to create these features.

Our goal for today is to use text features to (1) predict which author wrote each paper and (2) to identify and discuss key traits in the text that identify the writing of each author.

Question 1

What level of \(Y\) is the baseline?

Hint: You can determine this by running levels(Federalist$author) and finding the level that comes first,

Model 1: Big Words

An historian thinks that looking at big words (words with more than 6 letters) might be a good way to determine which of the three authors wrote each paper.

  • BigWords = the percent of words in a text that have more than 6 words.

They ask us to build a model using this as a feature.

Recall that for multinomial regression, we use:

# Load the library
library(nnet)

# Build the model
model1 <- multinom( Y ~ X , data = dataset, trace = FALSE)

# Print out the coefficients
knitr::kable( summary(model1)$coefficients )

Question 2

Adapt the code above to fit a multinomial regression model for \(Y\) = author with \(X\) = the percent of big words in a text. Show the formatted coefficient table as the answer for this question.

Question 3

Write down the fitted model using appropriate notation, rounding to 2 decimal places.

Hint: Put the following in the white space in your Markdown file and adapt.

$$log \left( \frac{\hat{\pi}_{i(SUB1)}}{\hat{\pi}_{i(SUB2)}} \right) = $$

Once we have the model built, there are a few different things we can do with it.

Interpretation

Question 4

Interpret the slope for big words in the Madison component of the model in terms of the relative risk.

Question 5

Suppose 25% of words in an article are big words (so BigWords = 25). Interpret the relative risks for this paper.

Hint 1: This means find the 2 relative risks, and for each, your interpretation should say something like “The probability of being written by A is 2 times higher than the probability of being written by B.”

Hint 2: A lot of people don’t like interpreting when we have something like \(\hat{\pi}_{i(M)} = .65 \hat{\pi}_{i(H)}\). If this is you, remember you can rewrite this as \(\hat{\pi}_{i(H)} = 1/.65 \hat{\pi}_{i(M)} = 1.53\hat{\pi}_{i(M)}\) and interpret in this form!

The kind of interpretation work we have done in the previous two questions is useful. However, especially when we are working with a client, it is extremely valuable to be able to visualize the relationships that our model is describing. In other words, we would like to be able to make a plot that shows how the probability of being written by each author changes as the percent of big words changes.

We can do this in R, but the code involved is a little tricky, so I built a function that will do it for us. To teach R the function, copy and paste the following into a chunk and press play. DO NOT CHANGE ANYTHING ABOUT THIS FUNCTION.

plot_multnomial <- function(data, model, Y, X, Xname, xlab){
  
  if(length(coefficients(model)[1,])==2){
  
  predictions <- predict( model, type ="probs")
  
  data[, Y] <-as.factor(data[,Y])
  
  J = length(levels(data[,Y]))
  
  holder <- data.frame("group" = NA, "x" = NA, "Probability"=NA)
  a  = 1 
  for(i in 1:nrow(data)){
    for(j in 1:J){
      holder[a,1] <- levels(data[,Y])[j]
      holder[a,2] <- data[i,X]
      holder[a,3] <- predictions[i,j]
      a = a + 1
    }
  }
  
  g1 <- ggplot( holder, aes(x, Probability, col= group)) + 
    geom_line() + 
    labs(x = xlab)
  }
  
  if(length(coefficients(model)[1,])>2){
    
    library(pdp)
    
    toplot1 <- partial(model, Xname, which.class = 1, plot = FALSE, prob = TRUE)
    toplot2 <- partial(model, Xname, which.class = 2, plot = FALSE, prob = TRUE)
    toplot3 <- partial(model, Xname, which.class = 3, plot = FALSE, prob = TRUE)
    
    toplot1$response <- levels(data[,Y])[1]
    toplot2$response <- levels(data[,Y])[2]
    toplot3$response <- levels(data[,Y])[3]
    
    hold <- rbind(toplot1,toplot2,toplot3)
    hold <-data.frame(hold)
    hold$response <- as.factor(hold$response)

    g1 <- ggplot(hold, aes(x = hold[,1], yhat, col = response)) + geom_line() + 
      labs(x = xlab, y = "Probability")
    
  }
  
  g1
  
  
  
}

To use this function, adapt the following:

plot_multnomial(data, model, Y, X, Xname, xlab)
  • data: replace with the name of your data set.
  • model: replace with the name of your model.
  • Y: replace with the number of the column in your data set when \(Y\) is.
  • X: replace with the number of the column in your data set when \(X\) is.
  • Xname: replace with the name of your \(X\) column. Make sure you put it in quotes, for example: "BigWords"
  • xlab: replace with the label you want for the x axis label. Make sure you put it in quotes, for example: "The x axis label".

Question 6

Adapt the code above to create a plot that shows how the probability of being written by each author changes as the percent of big words changes. Show the plot.

Question 7

Recall that we have a client who thought that the percent of big words might be useful for determining who wrote each paper. Based on the plot, describe in 2-3 sentences the relationship between the percent of big words and author.

Question 8

For what values of \(X\) = percent big words does it look like the model will struggle to predict author? In other words, for what values of \(X\) does it become less clear which author wrote the paper?

Question 9

Choose a different feature (anything except BigWords). State which feature you choose, build a multinomial regression model for author using that feature, and then adapt the code above Question 6 to visualize the relationship between that feature and author.

As an answer to this question, show the plot and describe the relationship between \(X\) and author shown in the plot.

Prediction

In addition to describing the relationships in the data, we can also use multinomial regression to predict which author wrote each paper.

If we want to find predicted probabilities from a multinomial regression model for a row \(i\) in the data set, we can use:

fitted(model1)[i,]

Question 10

Row 14 in the data set has 25% big words. Which author would you predict wrote the article? Explain your answer.

If we want predictions \(\hat{Y}\) instead, we can use:

predictions <- predict( model1, type = "class")

To get predictions for a row \(i\), we then use:

predictions[i]

Question 11

By adapting the code above, show which author R would predict wrote the article in row 14 using Model 1 (with BigWords).

We can also create a confusion matrix as we did with logistic regression.

holder <- table( Federalist$author, predictions )
rownames(holder) <- c("Predicted Hamilton","Predicted Madison", "Predicted Jay")
colnames(holder) <- c("True Hamilton","True Madison", "True Jay")
knitr::kable( holder )

Question 12

Based on the confusion matrix, find the Macro F1-score.

Hint: Suppose we want the F1-score for Hamilton. The F1-Score is \(\frac{TP}{TP+.5(FN+FP)}\). Here, \(TP\) means the number correctly predicted as Hamilton, \(FN\) means the number that were truly Hamilton but we predicted incorrectly, and \(FP\) means the number that were not Hamilton but our prediction was Hamilton. Once you have the F1 scores for each author, average them to get the Macro F1-score.

Model Comparisons

Recall that our goal is to predict who wrote the unknown Federalist papers. This means we need to assess predictive accuracy on test data. However, we do not have any! The only test data we have is the papers with unknown authors, and since the authors are not known, we do not have \(Y\) values in this test data to check our model predictions again.

This means we need to choose a model that will likely perform well on test data…without any test data to check a model against. In these situations, one tool that is very useful is the AIC. A model with a low AIC is likely to perform well on test data, due to properties of this metric.

Question 13

You have built two multinomial regression models so far, one using \(X\) = big words and the other using the \(X\) you chose in Question 9. Using the AIC, which model would like predict better on test data?

Note: To get the AIC of a model, we use AIC(model)

As we know with logistic regression, we can also use the AIC to perform feature selection. Feature selection just means that we look at all the features we have available to use (right now 118) and we decide which of them we want to use in our models.

In forward selection, we need to determine which feature to add to the model first. To do this, we try every possible multinomial regression model with exactly one feature. We then use the AIC to compare the models and we choose the feature that produces a model with the lowest AIC.

At the end of Step 1, it turns out that the feature that results in a model with the lowest AIC is the number of conjunctions (and, but, or, …):

  • Feature 1: Conjunctions

Okay, step 1 complete. Time for Step 2.

In Step 2, we already know that conjunctions (conj) will be in our model. The goal is now to consider which of the 117 remaining features we want to add next. This means we build 117 models (each with conj and a second feature ), and then determine the 2nd feature to add to the model.

Question 14

Build a model with the features conj and WC, and a second with the features conj and WPS (the number of words per sentence).

If these were the only two options in forward selection Step 2, which feature would you add into to your model?

Again, when we do this for real we would have 117 total models to compare. I’m going to skip ahead and let you know that we choose quantity, a feature that counts the number of quantity words (many, some, few, three,…).

  • Feature 2: quantity

The next step in forward selection is to consider adding a 3rd feature to the model, and then a 4th, and so on, until the AIC stops improving.

We could do all of this forward selection process slowly by hand, but to speed this process, I’ve written you a function you can use to perform forward selection with multinomial regression. Copy and paste the function in the chunk below and press play.

forward_selection_multinomial <- function(features, response){

  # Load the library
  suppressMessages(library(nnet))
  # Initialize a
  a = 1 
  
  check <- apply(features, 2, function(x) length(unique(x)))
  to.remove <- which(check == 1)
  features <- features[,-to.remove]
  
  # Count the number of features
  p <- ncol(features)
  
  # Find the best feature with only one feature in the model

  for( i in 1:p){
    modelholder <- multinom(response ~ ., data = data.frame(features[,i]), trace = FALSE)
    if(a == 1){
      modelcurrent = i
      AICcurrent = AIC(modelholder)
      a = a+1
      }
    if(AIC(modelholder) < AICcurrent){
      AICcurrent <- AIC(modelholder)
      modelcurrent <- i
    }
    rm(modelholder)
  }


  model1 <- multinom(response ~ ., data = data.frame(features[,modelcurrent]), trace = FALSE)
  AIC_tobeat <- AIC(model1)
  tokeep = modelcurrent
  
    print(1)
  print(AIC(model1))

  for( x in 1:(p-1)){
    a = 1 
    for( i in c(1:p)[-tokeep]){
      modelholder <- multinom( response ~ ., data = features[,c(i,tokeep)], trace = FALSE)
      if(a == 1){
        modelcurrent = i
        AICcurrent = AIC(modelholder)
        a = a+2
      }
      if(AIC(modelholder) < AICcurrent){
        AICcurrent <- AIC(modelholder)
        modelcurrent <- i
        }
      rm(modelholder)
    } 
    if(a >1 & AIC_tobeat- AICcurrent < 1){
      break
    }
    tokeep <- c(tokeep,modelcurrent)
    print(x+1)
    print(AICcurrent)
    AIC_tobeat = AICcurrent
  }
  
  out <- cbind(response, features[,tokeep])

  print(data.frame(out))

}

DO NOT CHANGE ANYTHING IN THE FUNCTION ABOVE!!

To use this function, you need to provide two inputs:

  • features: A dataset which contains only the columns in the original dataset that you want to use as features. Example: features <- dataset[,2:118] if the features are in columns 2 to 118 in a dataset called dataset.
  • response: A dataset which contains only the single column in the original dataset that holds the response variable. Example: response <- dataset[,1] if the response is in column 1 of a dataset called dataset.

It turns out that before we do this, we need to think a little about whether there are any columns in our dataset that we should not treat as features.

Question 15

Which 3 columns in the original data set should NOT be included in features?

Question 16

Create the inputs features and response you need in order to run the forward selection code. Show your code as the answer to this question.

Once you have created features and response, you can run forward selection using the following:

# Run forward selection
forward <- forward_selection_multinomial(features,response)

As the code runs, it will print out the number of features added to the model and the AIC that is associated with the model with that many features. You will notice that the AIC is getting steadily lower as the function runs!

The function also prints out the features that we put in the model according to forward selection and also stores those features in the object features_chosen.

Question 17

How many features does forward selection suggest we keep in our multinomial regression model?

To build a model using the features chosen by forward selection, we can use:

forward_model <- multinom( response ~ ., data = forward, trace = FALSE)

NOTE: When you run the forward selection code, it always renames the column with the response variable to response.

Question 18

What is the AIC for the model chosen by forward selection?

At this point, we can identify a few key features that seem to help us distinguish the writing of our 3 authors. To see the fitted model, we can use:

knitr::kable(t(coefficients(forward_model)))

We can also use our plotting command from before to visualize the model relationships.

Question 19

Based on our fitted model, is a paper with a lot of conjunctions LEAST likely to have been written by Hamilton, Madison, or Jay? Explain.

Making Predictions for test data

Now that we have built our model and identified key features, let’s make predictions for the 15 Federalist papers with unknown authors. To read in these data, use the code below.

test <- read.csv("https://www.dropbox.com/scl/fi/45x2atyykounkdc811qmv/LIWC_Test.csv?rlkey=rcuxfme51l032cj1s16x3to4x&st=lvnp4z8i&dl=1")

To make predictions using multinomial regression, we use the code:

predict(forward_model, type = "class", newdata = test)

Question 20

  1. Based on the predictions, who wrote each of the mystery Federalist papers?

  2. We know that multinomial regression makes predictions by choosing the author with the highest probability. We can see these predicted probabilities by running the code below. Based on these probabilities, how stable are our predictions of author using our model?

predict(forward_model, type = "probs", newdata = test)

References

Data

The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.