STA 279 Lab 3
Complete all Questions. No part of this lab may be used to train AI models.
The Goal
In our last class, we used Naive Bayes to build a clickbait detector. Today, we are going to use the same tool to build a simple detector for Generative AI.
Generative AI is called “generative” because the technology is capable of creating (or generating) text, images, and other outputs using the collection of human created content available online. One impact of this ability that is becoming increasingly problematic is the creation and publication of AI fabricated articles, meaning text that is completely generated using AI. Sometimes these articles contain factual information, but generative AI is also being used to create very realistic looking fake news.
We have a client who owns a publishing company, and one of the first ways that potential new publications reach their desk is through introductions (the first paragraph or first few paragraphs of an article). In other words, people hoping to publish send our client introductions, which can be read quickly, and the client uses these to help determine which articles they might be interested in reading in full to further consider for publication.
However, the client is running into issues that introductions are being submitted that are 100% generated by AI, sometimes with no research content to support the proposed paper. They want us to help them learn to distinguish between text generated by AI and text generated by humans.
We are provided with a training data set of 150 introductions that have been labelled as generated by AI or written by professionals who published in the journal recently. We are also provided a test data set with 50 additional introductions. The client has asked us to (1) use Naive Bayes to build a detector to find AI written introductions and (2) clearly illustrate how well the detector functions on the test data set.
The Data
To read in the data sets, use the following code:
test <- read.csv("https://www.dropbox.com/scl/fi/ih99dw8gd1gihlg7mbpr4/testAI.csv?rlkey=92mik16plm1ppu6as94khrph0&st=ydzsm3sw&dl=1")
train <- read.csv("https://www.dropbox.com/scl/fi/4618f4lwhkjyygqiduar2/trainAI.csv?rlkey=s5g9pgjq5gk5heg37pd2zn7q4&st=25o9v720&dl=1")
train$author <- as.factor(train$author)
test$author <- as.factor(test$author)The columns are:
text: the text of the introduction.author: either Chat GPT or published (which means human!)ID: the number given to the introduction; think of this like an identifier for the introduction.
Once you have loaded the data, load the packages you will need for this lab:
To build our detector, we need two things: a response variable \(Y\) and at least one feature \(X\). We know that our response variable \(Y\) is whether the author is Chat GPT or a published author (human!). Today, we are building an AI detector, so we want \(y_i = 1\) when the introduction was written by Chat GPT.
However, when we load a data set with a binary variable, R chooses
which levels are \(y_i = 0\) and \(y_i=1\) itself! To figure out what level of
a categorical variable R calls 0 or 1, we use the levels
function.
Whichever level shows up first is \(y_i = 0\) and whichever shows up 2nd is \(y_i = 1\).
Question 1
Which level is R using as \(y_i = 1\): written by AI or written by humans? Is this the level we want to be \(y_i = 1\)?
If we want to change the levels R has chosen, we can use the
relevel function. Whichever level we put in the
ref = part of the code will be the baseline level (\(y_i =0\)) for the variable.
Feature Engineering
Now that we have \(Y\), we need a feature. There are a lot of different features we could use, but for now, let’s stick with features created by words. In other words, we are going to choose a word that we think might be helpful in distinguishing AI written text.
To determine what words we think might be helpful to use as features, let’s start by finding the top 5 most frequent words in the training introductions written by AI.
Question 3
Do you think we should remove stop words before finding the top 5 most frequent words? Why or why not?
Question 4
Write and run a code to find the top 5 most frequent words in introductions written by AI after removing stop words. Show or state the top 5 words.
Let’s start by using the most frequent word as our feature. This means we are going to create a feature that indicates whether or not this word is present in the introductions. AS a reminder, the code we need to do this is:
Question 5
Adapt the code above to create a feature indicating whether the most
popular word is present in each introductions. Repeat the process with
the test data. As the answer to this question, run the code below,
but replace every instance of the word word with
the word you are using for your feature.
Naive Bayes
For Naive Bayes, we know that we have to compute the conditional probability of \(Y\) given each level of \(X\). We then use these to make our predictions.
Question 6
In the training data, what is the probability of being written by AI if the introduction contains the word “project”? Round to 4 decimal places.
Question 7
In the training data, what is the probability of being written by a human if the introduction contains the word “project”? Round to 4 decimal places.
Question 8
If an introduction in the test data contains the word “project”, would we predict it was written by a human or by AI?
This is how Naive Bayes works by hand, but in practice, it is more
efficient to use R to help with the computations. To run Naive Bayes in
R, we need the naivebayes library:
Remember: If your computer ever tells you a library is not found, this means you need to install the package from Tools.
The code we use to make predictions with Naive Bayes is then:
# Train Naive Bayes
nb <- naive_bayes( Y ~ X , data= ... , laplace = 1)
# Make predictions
yhat <- predict(nb , type = "class", newdata = ...)Question 9
Fill in Y, X, and the ... in
the code above to (1) train the Naive Bayes model on the training data
using \(X\) = whether or not the
introduction has the word project and (2) make predictions on the test
data.
If you want to see the probabilities computed by Naive Bayes, you can run:
The next step is to see how accurate our predictions are! We usually use a confusion matrix for this.
# Create the table
holder <- table( yhat, test$author)
# Add row names
rownames(holder) <- c("Predicted: Human", "Predicted: AI")
# Add column names
colnames(holder) <- c("Truth Human", "Truth: AI")
# Format the table
knitr::kable( holder )Question 10
What is the accuracy of our model?
Question 11
What is the true positive rate (TPR) of our model?
Question 12
What is the true negative rate (TNR) of our model?
Question 13
What is the F1-score of our model?
\[F1 = \frac{TP}{TP+.5(FP + FN)}\]
Question 14
At this point, the client checks in and wants to see how we are doing. They ask us whether or not it seems like using “project” as a feature is useful in predicting AI written introductions using Naive Bayes. What would you tell them? Response in 2-3 sentences.
Now, before we get really excited…this is specific to this data set, not everything written by AI ever. However, it turns out that we can often find little characteristics like this that can help us distinguish between authors in text. We will see more of this soon!
Adding more features
The client is pleased with our results, but they wonder if the results could be improved by adding a few more features to the model. Specifically, they ask us to create 5 features using each of the top 5 words we found in Question 4.
Question 15
Add 4 new features to both the test and training
data sets, one of each of the remaining words in the top 5 list from
Question 4. As an answer to this question, state how many rows and
columns are in train when you finish this process.
Question 16
Train a Naive Bayes model using the training data with all 5 of your features. Make predictions on the test data. As an answer to this question, show a professionally formatted confusion matrix.
Question 17
The client asks if you would recommend they use using Model 1 (with only feature) or Model 2 (using all 5 features) as their detector for AI. Explain your recommendations.
Next Steps
We have now seen how we can use words as features in prediction models. However, now that we know we can use more than one word, how do we decide how many words to use, or which words to use? And what if we want to include other features, like the number of question marks, or the number of words? This all brings us into the realm of feature selection, which will we will start to explore next class.
References
Data
Citation: DeLuca, L. S., Reinhart, A., Weinberg, G., Laudenbach, M., Miller, S., & Brown, D. W. (2025). Developing Students’ Statistical Expertise Through Writing in the Age of AI. Journal of Statistics and Data Science Education, 33(3), 266–278. https://doi.org/10.1080/26939169.2025.2497547
Activity
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2026 January 19.