STA 279: Data Analysis 1

Complete all Questions. This will be graded like a project, so make sure you use complete sentences, check your spelling, format all your output professionally, and label your graphs appropriately. This will be part of your grade.

The Goal

We have been learning how to work with text data, creating features and starting to build models. Today, we are going to apply everything we have learned so far to our first data analysis task.

This first Data Analysis will be more structured than Data Analysis 2 and 3. This is because part of the goal of this analysis is to help you practice the steps and structure of a data analysis.

The Data Set

The Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Two famous authors of these papers were Alexander Hamilton and James Madison. Most of the papers were know to have been written by either Hamilton or Madison, but for several papers, the author who wrote them was not known. Hamilton declared at one point that he had written the articles. Since then, many scholars have worked to determine which of the two authors had written the anonymous papers.

Today, we are provided with the text from the Federalist papers known to have been written by Hamilton and Madison. The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton.

Our goal is to use the text analysis skills we have learned so far to determine what characteristics might differ between the two authors’ papers. We will also be building a model to predict which author wrote the mystery papers.

To read in the data on the $n = 65$ Federalist papers with known authors, use the following code:

# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/5hzqwsvlnym5u1mhmn0jy/Federalist.csv?rlkey=55x0p9fl02zls9ixxek64vlqy&st=9ku4q61a&dl=1")

# Convert to a data frame
Federalist <- data.frame(Federalist)

# Make sure author is treated as categorical
Federalist$author <- as.factor(Federalist$author)

The columns are:

paper: the number given to the paper; this of this like an identifier for the article.
text: the text of the entire article.
author: either Hamilton or Madison

Question 1

What percent of the papers (articles) were written by Madison?

Once you have loaded the data, load the packages you will need for this lab:

library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(tm)

Section 1: Word Count

At this point, we know that a common feature we use to compare documents is word count. Is it possible that the number of words can be used to determine which author wrote which paper?

Question 2

Add a feature to the Federalist data set which counts the number of words in each article excluding stop words and numbers. When we say “word count” or “number of words” for the rest of this assignment, we mean excluding stop words and numbers.

Make an appropriate plot to compare the number of words in Madison’s papers to the number of words in Hamilton’s papers. Make sure your plot is well formatted and well labelled.

Based on your plot, describe how word length seems to differ between the two authors or if word length seems about the same. Write your description as though you were explaining it to an historian who is interested in your results!

Now that the feature has been created, we have the ability to use word length to build a logistic regression model to predict whether a paper is written by Madison or Hamilton.

Question 3

Build a logistic regression model for author using the number of words in a paper as the feature in the model. Make sure you are clear what the baseline of your response variable is! In other words, make sure you clearly state which author is indicated with $Y_i = 0$ and which is indicated with $Y_i = 1$.

Write down the fitted model using appropriate notation (let Dr. Dalzell know if you are not sure how to do this!).

Question 4

Use your model to make predictions for all $n=65$ articles in the Federalist data using a threshold of .5. Create a professionally formatted confusion matrix and state the accuracy, true positive rate, and true negative rate of the model.

Clearly explain to your historian how well you model is able to predict authorship based on this one feature of word length.

Section 2: Word Frequency

In addition to the number of words, we have learned that we can count the number of times each words appears in a piece of text. This can be used to determine what words are more common in one type of text than another.

Question 5

Create and show a plot to show the top 10 words in Hamilton’s texts. Do the same for Madison. Make sure both plots are well formatted and well labelled.

Question 6

Based on your plots, discuss which of the top 10 words seem to be common to both authors and which are unique to each author. Describe whether any words that are in common are more or less numerous (have a higher count) for one author over another.

At this point, we are going to try something a little new. We are going to create features using these top 10 words, and then we will be using these in a logistic regression model to predict authorship.

Top 10 Words as Features

We already know how to find the top 10 words for each author and plot them. Now, we want a list of those top 10 words, but we do not want to plot them. To get that list, take a look at the code you just used to make your bar plots. You are going to want to remove all the code that has to do with plot making (ggplot and on).

For example, for Hamilton, we can use

# Find the top 10
Hamilton10 <- Federalist %>%
  filter(author =="Hamilton") |>
  mutate(text = removePunctuation(text)) |>
  unnest_tokens(word, text)|>
  anti_join(stop_words) |>
  # Remove the numbers 
  filter(!grepl('[0-9]', word)) |>
  # Count the number of words 
  count(word, sort = TRUE) |>
  # Re-order the words so the most popular is on top
  mutate(word = reorder(word, n)) |>
  # Choose the most 10 popular words 
  slice(1:10)

The same process can be repeated for Madison.

As we know from the plots we have created already, some of the 10 top words are in common across the two authors. When we create features using the top 10 words, we don’t want to double count if the words are in both top 10 lists. To get the distinct features across the two lists, we can use

top10 <- union( Madison10$word, Hamilton10$word)

Okay, so top10 tells us the top 10 words in Madison papers and in Hamilton papers, making sure we did not double count words that were in the top 10 for both authors. This means we should have 14 words in top10.

The next step is to count how often these top 10 words appear in each of the Fedearlist papers. To do that, we will need a special function. Copy and paste the following into your RMarkdown file and press play.

# DO NOT EDIT!!!
# THIS IS A FUNCTION 

topWordsData <- function( data, classifier, top ){
  corpus = Corpus(VectorSource(data)) 
  
  corpus = suppressWarnings(tm_map(corpus, content_transformer(tolower))) 
  corpus = suppressWarnings(tm_map(corpus, removePunctuation))
  corpus = suppressWarnings(tm_map(corpus, removeNumbers))
  
  # removing stop words 
  corpus = suppressWarnings(tm_map(corpus, removeWords, stopwords("SMART"))) 
  # removing white space 
  corpus = suppressWarnings(tm_map(corpus, stripWhitespace))
  
  matrix = as.matrix(DocumentTermMatrix(corpus)) 

  #converting matrix to dataframe 
  dtm_df <- as.data.frame(matrix) 
  
  # add on the classifier
  dtm_df$y <- classifier
  
  # Re-order columns
  dtm_df <- dtm_df[, c(ncol(dtm_df), 1:(ncol(dtm_df)-1))]
  
  # Keep only certain columns
  colnamesKeep <- which(colnames(dtm_df) %in% top)
  
  dtm_df <-dtm_df[, c(1,colnamesKeep)]
  
  dtm_df[,1] <-as.factor(dtm_df[,1])
  
  # Outut 
  dtm_df
}

Be very sure you do NOT change any of the code in the chunk above!!

To use the function, we need three inputs:

The text of the papers (Federalist$text)
The Y variable, which is the authors of the papers (Federalist$author)
The list of top 10 words we want in our new data set

This means that to create the data set that we need, run the following:

Top10 <- topWordsData( Federalist$text, Federalist$author, top10 )

What on earth have we just created????

This should create a data set with 65 rows and 15 columns, one for each of the 14 features and the response variable author. The first column, y, tells us which author wrote each paper. The rest of the columns tell us how many times a specific word appears in each paper. For instance, the word constitution appears in the 1st paper 5 times.

Question 7

Build a logistic regression model for author using the number of occurrences of top 14 words (the union of the top 10 across both papers) as your features. In other words, use the Top10 data set.

Note: You will get a warning, but that’s okay since the goal right now is prediction. To turn it off, change your chunk header to r, warning = FALSE

Create professionally formatted table to show the coefficients. You do NOT need to write out the long fitted model.

Question 8

Clearly explain to your historian how well you model is able to predict authorship based on the current feature list.

Section 3

At this point, we have seen how well our two different logistic regression models can do at predicting for the papers we do know the authors for. However, the whole point was to see if we could identify who wrote the papers we do not have authors for.

To load the text data for the papers we do not have authors for, use the following:

test <- read.csv("https://www.dropbox.com/scl/fi/yrfzwwn7olyaeq3mhb0bd/test.csv?rlkey=c6bs0ytyoomkq34aw7rx3bc62&st=tzvp12cp&dl=1")

test <-data.frame(test)

You will notice that this data set has paper number and the paper text, but it does not have the author, since that is what we are trying to predict.

This new data was not used in the model building process, but we are about to make predictions for this new data. This means these data are called test data because we are using them to test / assess the predictive abilities of our model.

The data we used to build the model originally are called training data.

Question 9

Use your first model (with word count as a feature) to make predictions for all $n^*=15$ papers in the test data using a threshold of .5. Create a professionally formatted table to show the predictions for the test articles.

HINT: This model uses word count as a feature. This means that you need to create this feature for your test data.

Question 10

Use your second model (with the top words as features) to make predictions for all $n^*=15$ papers in the test data using a threshold of .5. Create a professionally formatted table to show the predictions for the test articles.

HINT: This model uses counts of 14 specific words as a feature. This means that you need to create these features for your test data.

Question 11

Using the training data, build a logistic regression model which uses both word count and the 14 specific words as features. Use this third model to make predictions for all $n^*=15$ papers in the test data using a threshold of .5. Create a professionally formatted table to show the predictions for the test articles.

Question 12

If you were asked to choose only one of these models to use in practice, which would you use? Justify your model choice to your historian client.

Question 13

Based on your results, do you think Hamilton was being truthful when he claimed that he wrote all 15 of the articles in the test set? Explain why or why not (again, think about writing to your client, who is an historian).

Final Steps

You are now ready to submit! This will be graded like a project, so make sure you use complete sentences, check your spelling, format all your output professionally, and label your graphs appropriately.

References

The data set used in this lab was downloaded from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers. Retrieved July 19, 2024 from https://github.com/nicholasjhorton/FederalistPapers/.

This activity was created by Nicole Dalzell and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 July 19.