STA 279 Lab 8
Complete all Questions.
Formatting your lab
When you open your Markdown file, your first chunk likely looks like:
Change it to:
Question 1
This actually has nothing to do with the lab :)
If you have not already done so, fill out the form to tell me which data set you want to use for Data Analysis 2: https://docs.google.com/forms/d/e/1FAIpQLSdyaMbH7-bLz6ZI9Nq4RtLYMQt7cK4NNOU1aGCJw3cEXg2uJg/viewform?usp=header
If you are planning to work with one partner for Data Analysis 2, put your name and your partner’s name here. If you are not planning to work with a partner, just write “Working alone”.
Note: If you say you are working with a partner, you MUST work with them IN CLASS for at least 2 of the 3 in class work days for Data Analysis 2. In other words, you MUST contribute fully if you plan to work with a partner. If you do not meet this requirement, you will be required to submit your own Data Analysis using a different data set from your partner. If this poses an issue for you or your partner, please let me know.
The Data Set
One large task that is unique to text analysis is the detection of hate speech, or text that is directly and negatively targeted at a specific group, individuals with certain traits, or a certain individual.
Today, we are going to work with the text of \(n = 3500\) online X posts. Each post has been labelled as hate speech, offensive speech (but not hate speech), or neither. These labels were obtained by having 3 different trained individuals read the text and assign one of the 3 labels. In this case, a text was labeled as hate speech if at least one rater identified the text as hate speech.
We have two goals for our work today:
- Goal 1: Determine what traits distinguish hate speech from other text. In other words, what traits of a text seem to be associated with a label of hate speech?
- Goal 2: Predict whether or not a post is hate speech.
We have two data sets today, a training data set and a test data set. You can load them using the code below.
# Load the training data
train <- read.csv("https://www.dropbox.com/scl/fi/jnwph6015l4lrt3vbm1ds/hatespeech_train_fs.csv?rlkey=nosk4imet576l7k3xgkq2se3d&st=cehbw3vb&dl=1")
colnames(train)[1] <-"label"
# Load the test data
test <- read.csv("https://www.dropbox.com/scl/fi/qgkha7zru7ppc0ssfrzdx/hatespeech_test_fs.csv?rlkey=3hll0wibh6su79wqpvff58x8d&st=6oa57mcd&dl=1")
colnames(test)[1] <-"label"The training data set should have \(n = 3500\) rows and the test set should have \(n^{*} = 1400\) rows, and both data sets should have 28 columns.
There are 27 features, all derived from LIWC:
swear: % of words that are swear words.ethnicity: % of words that related to ethnicity (example: american, french, etc.)Dic: % of words in the post that are in the LIWCtone_pos: a score for positive tone; higher scores mean a more positive tone.BigWords: % of words with more than 6 letterssocrefs: % of words that are social connection references (you, we, he, she, friend, parent, etc.)female: % of words that are female references (she, her, woman, etc.)sexual: % of words that has sexual referencesAffect: a score indicating words about emotions,tone_neg: a score for negative tone; higher scores mean a more negative tone.WC: the number of words in the post.Physical: % of words that relate to physical topics (body, eye, arm, doctor, etc.)power: % of words related to power (own, order, allow, power, etc.)food: % of words related to foodnegate: % of words that are negation (not, no, never, nothing, etc.)ppron: % of words that are person pronouns (I, you, my, me, etc.)
verb: % of words that are verbs (is, was, be, have, etc.)focuspast: % of past focus words (was, had, were, been, etc.)acquire: % of acquire words (get, got, take, getting, etc.)det: % of words that are stop wordsneed: % of words/phrases that are about need (have to, need, had to, must, etc.)Exclam: % of exclamation marksWPS: Words per sentence
auxverb: auxiliary verbs (is, was, be, have, etc.)space: % of words about space (in, out, up, there , etc.)illness: % of words about illness (hospital, cancer, sick, pain, etc.)Culture:% of words about politics, culture, or technology
We are hoping to use these text features to describe how hate speech posts are different from other posts.
In addition to these features, we have our response variable
(label) which identifies whether an X post is hate speech,
offensive but not hate speech, or neither. This means label
is a categorical variable with 3 levels. However, when we read in the
data, it turns out that the variable label is not
considered to be a categorical variable by R. We can see that by using
the class function:
If you get anything other than factor as a result when
you run the class() function, this means R does not know
that we want to treat the variable as a categorical variable. To fix
this, we use the following code. It also allows us to set the ordering
for the labels of the categorical variable, which will be helpful when
we model later.
Question 2
Convert the response variable in the test data to a factor. As an answer to this question, show a table of the variable.
Question 3
What is the baseline for \(Y\) = label?
Hint: Remember we just set the order for our response variable.
For our work today, there a few packages we will need. Go ahead and
load those now. REMEMBER: If you get an error that says
your computer does not have or cannot find a package, go to the top of
your screen (the right of your camera for my Apple folks) and choose
Tools. From there, choose Install Packages.
From there, type in the name of the packages you need and install!
Multinomial Regression
We are going to start off with multinomial regression. Yes, I know, we already did a lab on multinomial. However, we have just learned how we can do inference with multinomial regression. Recall that our first goal is:
- Goal 1: Determine what traits distinguish hate speech from other text. In other words, what traits of a text seem to be associated with a label of hate speech?
Without inference, we could only answer the question of what traits are related to being labeled hate speech in the sample of data that we have. With inference, we can use our sample data to assess what might be traits of hate speech in the population of all posts.
Sample Analysis
Even when the goal is inference, the first step is still to fit a model to our sample data.
Question 4
Fit a multinomial regression model using all the features in
train. Call the model multimodel. As the
answer to this question, show the coefficients in your model using the
code below:
The way I have formatted the coefficients in the table above is different from how we usually get them in raw R output. Each column represents one component of the model, either for Offensive vs. Neither or Hate Speech vs. Neither. Each row represents a feature in the model. This structure just makes sure all our coefficient estimates are visible, as a 2 x 28 table would run off the page!
The output of our regression model can help us with our first goal for today, which is identifying traits that seem to be related to a post being labelled as hate speech. In the output, a positive coefficient for a level indicates a positive relationship with the chances of being that level of \(Y\) instead of the baseline. A negative coefficient indicates that as the feature increases, the chances of being that level of \(Y\) rather than baseline decreases.
Note: Multinomial regression models the log relative risk, but we know that when the log relative risk of being level A versus level B increases, this also means the probability of being level A versus level B also increases!! This is why when we are talking more casually, we can use phrasing like “the chances of being A versus B are increasing”. When we are doing formal interpretations, you still need to write in terms of the log relative risk or relative risk!
Question 5
Based on our model output, list at least 4 traits of posts that seem to be associated with higher chances of being hate speech instead of neither.
Confidence Intervals
Now that we have explored our sample results, we can move on to inference. We may see a relationship between features and hate speech labels in these data, but does that mean that we have evidence to suggest a relationship in the population of all posts?
When we need to use information from a sample to draw conclusions about a population, we reach for inference techniques like a confidence interval.
Question 6
What is the definition of a confidence interval?
There are a lot of different ways to create confidence intervals, but one that works for a lot of different methods involves bootstrapping. The steps involved are:
- Create a bootstrap sample from our training data.
- Fit our model to that sample.
- Obtain the statistic of interest (for us a \(\hat{\beta}\)) from the model.
- Repeat Steps 1 - 3 many times.
- Report the range of the middle 95% of the sample statistics as our 95% bootstrap CI.
Let’s start with 1) and create a bootstrap sample. As an example,
let’s look at just the first few rows in train, and only
one feature.
| label | tone_pos | |
|---|---|---|
| 2 | Neither | 13.04 |
| 3 | Neither | 10.53 |
| 4 | Neither | 15.38 |
| 5 | Offensive | 0.00 |
Question 7
Provide an example of a bootstrap sample you could create from the rows of data above.
Question 8
When we create a CI for a coefficient, we actually create a bootstrap
sample from train. How many rows would such a bootstrap
sample have in it?
The whole point of creating bootstrap samples is to allow us to see how our estimates of coefficients might change if we fit a model on different samples from the same population. Looking at the variation in the different estimates across many such samples allows us to get a good idea of where the population parameter might be.
For example, suppose we are interested in \({\beta}_1\), which is the coefficient for the percent of words in a post that are swear words. In our sample data, we have:
\[\hat{\beta}_{1(Offfensive)} = 0.2589, \hat{\beta}_{1(HateSpeech)} = 0.2874\]
If we fit the model to different bootstrap samples, we get:
Offensive Component
0.2933, 0.2469, 0.2739, 0.2692, 0.2675, 0.2559, 0.2878, 0.2843, 0.3384, 0.3106…
Hate Speech Component
0.2676, 0.2151, 0.2535, 0.2469, 0.2358, 0.2278, 0.2608, 0.2578, 0.3300, 0.2635…
This gives us an idea of what values we expect to see for this coefficient in each component of the model on many different samples. To create a 95% bootstrap CI, we order the coefficient estimates within each component from smallest to largest and report the middle 95%.
Let’s give this a try in R. I have built the following function for you which will complete this process. Copy and paste the following into a chunk and press play.
randomizationCI <- function(data, modelin, Y, beta){
Y = which(colnames(data)==Y)
# Confidence Intervals and Predictive Intervals
coefs1 <- NA
coefs2 <- NA
for( i in 1:100){
set.seed(i)
bootsample <- sample( 1:nrow(data), nrow(data), replace = TRUE)
bootsample <- data[bootsample,]
check <- apply(bootsample, 2, function(x) length(unique(x)))
while( length(which(check==1)) >0){
bootsample <- sample( 1:nrow(data), nrow(data), replace = TRUE)
bootsample <- data[bootsample,]
check <- apply(bootsample, 2, function(x) length(unique(x)))
}
# Fit the model
mboot <- multinom( bootsample[,Y] ~ ., data = bootsample[,-Y], trace = FALSE)
coefs <- data.frame(t(coefficients(mboot)))
# Store the coefficients
coefs1 <- c( coefs1, coefs[beta+1,1])
coefs2 <- c( coefs2, coefs[beta+1,2])
}
coefs1 <- coefs1[-1]
coefs2 <- coefs2[-1]
# Now, we can make 95% CIS
upper1 <- round(quantile(coefs1, .975 ),3)
lower1 <- round(quantile(coefs1, .025 ),3)
level1 <- levels(data[,Y])[2]
# Now, we can make 95% CIS
upper2 <- round(quantile(coefs2, .975 ),3)
lower2 <- round(quantile(coefs2, .025 ),3)
level2 <- levels(data[,Y])[3]
row1 <- c(level1, lower1,upper1)
row2 <- c(level2, lower2,upper2)
holder <- rbind(row1,row2)
holder <- data.frame(holder)
rownames(holder) <- c("Level 2", "Level 3")
colnames(holder) <- c("Label", "Lower", "Upper")
holder
}To use the function, we use:
# Build the Bootstrap CI
out <- randomizationCI(data, modelin, Y, beta)
# Format the output
knitr::kable(out)where
data= the name of our data setmodelin= the name of the model you are using (for us:multimodel)Y= the name of the column holding your \(Y\) variable, in quotes.beta= the number of the \(\beta\) we are interested in. If we are interested in \(\hat{\beta}_{1(Offensive)}\) and \(\hat{\beta}_{1(HateSpeech)}\), for example, we use1.
Question 9
Adapt the code above to create a 95% bootstrap CI for \(\hat{\beta}_{1(Offensive)}\) and \(\hat{\beta}_{1(HateSpeech)}\).
Note: Running this code will take a minute!!!
When the code is done running, copy and paste the output you get from
the confidence intervals into the white space in the answer to this
question. Then, put hash tags (#) in front of the codes you
ran to produce the bootstrap CIs. This will stop the code from running
when you knit your file, so you don’t have to wait!
Question 10
Based on your CIs from Question 9, does it look like there is a
- positive population relationship
- negative population relationship, or
- no population relationship
between the percent of swear words and the chances of being labeled as hate speech rather than neither? Explain briefly.
We have just demonstrated both the positive and negative side of this approach. Bootstrap confidence intervals require minimal assumptions, so they can be used with a variety of different models. However, a key downside of bootstrapping is that it can be slow. In industry or other professional settings, this is mitigated by powerful computer clusters and parallel computing that can handle this sort of work. However, we are stuck with our personal computers. With this in mind, is there something that might work a little faster?
A Faster Code
As a nod to the practical constraints of using personal computers, we will use the code below to approximate a bootstrap CI.
For example, suppose we want to show the 95% CIs for all coefficients in the Offensive component of the model. To do this, we use:
If you only are interested in a certain coefficient, like \(\hat{\beta}_{1(Offensive)}\), we can add a component to the code to focus on that coefficient:
Question 11
Show a formatted table for the 95% CIS for the Hate Speech component of the model.
Question 12
Using your CI, explain to a sociologist who is interested in predicting hate speech (but who does not know statistics) what traits of a post seem to be related to it being labelled hate speech rather than neither.
Hint: This means you should be saying things like posts with high this and low that are more likely to be labelled hate speech.
Question 13
Did anything in the list of significant features surprise you? Note: There is no right or wrong answer to this, but it is often a question we get when working with clients!
Linear Discriminant Analysis
So far, we have worked with a model that we already know, just in a slightly new way. In this next section of the lab, we are going to work with something brand new: linear discriminant analysis (LDA).
I will be clear upfront that the mathematics of how this approach works is beyond the scope of our course. It requires understanding eigenvalues and eigenvectors and other work with matrices. This means that we are not going to focus on so much how it works, but how we can use it.
The goal of LDA is similar to the goal of classification trees. We will use our features to create clusters in the data, with the goal of using the features to define a cluster containing hate speech posts, or neither posts, or offensive posts.
To see what I mean by this, let’s go ahead and fit LDA to our data. The code we need is very short:
As a result of LDA, we have three distinct clusters, one for each level of \(Y\). To see this, we can use:
| Neither | Offensive | Hate Speech | |
|---|---|---|---|
| swear | 1.15 | 11.08 | 12.23 |
| ethnicity | 0.43 | 0.13 | 0.59 |
| Dic | 79.40 | 87.75 | 86.88 |
| tone_pos | 5.31 | 5.23 | 4.15 |
| BigWords | 18.00 | 14.08 | 15.38 |
| socrefs | 7.63 | 10.36 | 10.86 |
| female | 0.82 | 3.27 | 2.25 |
| sexual | 0.16 | 1.26 | 1.55 |
| Affect | 10.04 | 20.03 | 20.72 |
| tone_neg | 4.02 | 5.12 | 6.10 |
| WC | 18.57 | 17.02 | 17.47 |
| Physical | 2.69 | 3.81 | 4.59 |
| power | 1.46 | 0.77 | 1.23 |
| food | 1.06 | 0.55 | 0.36 |
| negate | 1.85 | 2.60 | 2.36 |
| ppron | 9.54 | 13.02 | 12.81 |
| verb | 14.81 | 15.92 | 15.72 |
| focuspast | 2.42 | 2.27 | 2.23 |
| acquire | 0.83 | 1.16 | 0.94 |
| det | 11.36 | 11.97 | 11.55 |
| need | 0.42 | 0.48 | 0.37 |
| Exclam | 1.34 | 1.24 | 1.31 |
| WPS | 12.86 | 13.13 | 13.05 |
| auxverb | 7.59 | 7.40 | 7.08 |
| space | 4.40 | 3.92 | 3.67 |
| illness | 0.09 | 0.11 | 0.18 |
| Culture | 2.40 | 0.98 | 1.43 |
The result of running the code above is a matrix with 27 rows (one for each feature) and 3 columns (one for each cluster). The numbers that you see indicate the average value of each feature for each cluster. For example, for posts in the Neither cluster, on average 1.15% of words are swear words. For posts in the Offensive cluster, on average 11.08% of words are swear words. In other words, the mean for swear words percentage is almost 11 times larger in the offensive cluster than in the neither cluster.
Question 14
For posts in the Hate Speech cluster, on average what percent of words are swear words?
We cannot build confidence intervals with LDA as we can with multinomial regression, but we can still use the results to describe relationships highlighted in our sample. By looking at how the means differ across the clusters, we can see key features that define each label (neither, offensive, hate speech).
Question 15
Consider the feature female, which describes the
percentage of words related to females or being female. What does LDA
suggest about the relationship of this feature with the labels?
Question 16
Consider the feature food, which describes the
percentage of words related to food topics. What does LDA suggest about
the relationship of this feature with the labels?
One way to assess how well LDA is able to separate the data into clusters is by looking at a visualization of the clusters. To teach R the function it needs to learn to do this, copy and paste the following into a chunk and press play:
lda_scatter <- function(lda){
var_ex <- (lda$svd)^2/sum(lda$svd^2) * 100
per_ex_1 <- paste0("(", round(var_ex[1], 2), "%)")
per_ex_2 <- paste0("(", round(var_ex[2], 2), "%)")
scores <- predict(lda) %>%
data.frame() %>%
select(class, x.LD1, x.LD2)
max_x <- scores[,2] %>% abs() %>% max() %>% ceiling()+.5
max_y <- scores[,3] %>% abs() %>% max() %>% ceiling()+.5
p1 <- ggplot() +
geom_hline(yintercept = 0, linewidth = .25) +
geom_vline(xintercept = 0, linewidth = .25) +
geom_point(data = scores, aes(x = x.LD1, y = x.LD2, fill = class),
shape = 21, size = 1.5) +
viridis::scale_fill_viridis(discrete = T) +
xlab(paste0("LD1", " ", per_ex_1)) +
ylab(paste0("LD2", " ", per_ex_2)) +
ylim(-max_y, max_y) +
xlim(-max_x, max_x) +
theme_linedraw() +
theme(panel.grid.minor.x = element_blank()) +
theme(panel.grid.minor.y = element_blank()) +
theme(panel.grid.major.x = element_blank()) +
theme(panel.grid.major.y = element_blank())
return(p1)
}Citation for this function: DeLuca, L. S., Reinhart, A., Weinberg, G., Laudenbach, M., Miller, S., & Brown, D. W. (2025). Developing Students’ Statistical Expertise Through Writing in the Age of AI. Journal of Statistics and Data Science Education, 33(3), 266–278. https://doi.org/10.1080/26939169.2025.2497547
To plot the output, we then use:
In this picture, we are hoping to see clearly defined clusters, which means the model is able to effectively divide the text into groups.
If we look a little deeper, we will notice that there are two axes labelled LD1 (linear discriminant 1) and LD2 (linear discriminant 2). The percentage with those labels essentially says how important that axis is in defining the clusters (telling the labels apart).
Question 17
The x-axis LD1 has a larger percentage than the y-axis LD2, which indicates it is more important in separating the clusters. What about the picture backs that up?
We now have two different approaches that we can use to discuss what features might be related to a label of hate speech. In practice, it is particularly powerful if multiple techniques indicate that the same features are important - this means that regardless of the technique we choose, we are getting similar results.
Question 18
Are the features that are important in separating clusters in LDA results similar to the features that showed up as important in multinomial regression? Explain briefly.
Predictions
Recall that we have two goals today:
- Goal 1: Determine what traits distinguish hate speech from other text.
- Goal 2: Predict whether or not a post is hate speech.
So far, we have focused on Goal 1, but both multinomial regression and LDA can be used to make predictions. Before we get to that, let’s add one more option: Naive Bayes.
Naive Bayes
Naive Bayes can be used for prediction, but it cannot be used for describing relationships in the data, which is why we have not utilized it until now.
Naive Bayes requires a response variable (check) and at least one feature. For learning purposes, let’s create a categorical feature \(X\) which indicates whether or not more than 10% of words in a post are swear words.
Probabilities in Naive Bayes are computed using Bayes Rule. For example, if we want to know the probability of being hate speech, given that the post contains 12% swear words, we would compute:
\[P( y_i = Hate~Speech~|~x_i = 1) = \frac{P( y_i = Hate~Speech~and~x_i = 1)}{P(x_i=1)}\]
We can get the pieces we need to compute this probability using a table of \(X\) versus \(Y\).
knitr::kable(table(train$label, swearLarge))
Question 19
For Naive Bayes, what is the probability of being hate speech for a post with 8% swear words? Round to 4 decimal places.
Question 20
For Naive Bayes, what is the probability of being hate speech for a text with 12% swear words?
Question 21
For Naive Bayes, if a text has 12% swear words, do we predict the text is (1) hate speech, (2) offensive but not hate speech, or (3) neither?
How stable is this prediction? In other words, how confident are you in the prediction based on the probabilities?
So far, we have only used one \(X\) for Naive Bayes. However, in practice we can use all the features in the data set! To run Naive Bayes in R, we need two steps.
Step 1: Set a Seed
You can replace 279 with any positive number you like. This is needed in case of ties (meaning two or more classes for \(Y\) with the same probability). In this case, R randomly chooses one of the tied levels, and setting a seed makes sure the computer chooses the same one each time.
Step 2: Running Naive Bayes
This is similar to other code we have seen.
class: the name of our \(Y\) variable..: tells R to use all other columns in the data set as features.train: the name of our data set.
Naive Bayes is not a model like logistic regression or multinomial regression - there is no fitted model to write down. There is also nothing to visualize as there is with a classification tree. This is purely a method we can use to make predictions.
To make predictions on the test data, we use
The code test[,-1] tells R to use all the columns in
test as features except the first one. This is needed because the first
column holds our response variable (label), and this is not
a feature!! For Naive Bayes, we can only feed in features when we make
predictions, or the code will give you an error. To compare the
predictions to the truth, we can make our standard confusion matrix:
holder <- table(preds, test$label)
colnames(holder) <- c("True: Neither", "True: Offensive", "True: Hate Speech")
rownames(holder) <- c("Pred: Neither", "Pred: Offensive", "Pred: Hate Speech")
knitr::kable( holder )Question 22
Take a look at the confusion matrix. Without using any metrics (meaning do not talk about F1-scores or sensitivity or specificity or any such technical terms), explain to a sociologist who is interested in predicting hate speech how well Naive Bayes is able to predict hate speech.
If Naive Bayes does not predict as well as we hoped, it can help to think about the assumption we made when using this approach. Naive Bayes assumes that our features are independent of one another. In practice, this is rarely true, which is why the technique is called Naive Bayes.
Let’s check to see if our features are independent. One of the quickest ways to do this is by using a correlation plot called a correlogram (one of Dr. Dalzell’s favorite plot names).
We only need to change one thing to use this plot:
features: a data set containing ONLY the NUMERIC features we are using for modeling.
Question 23
Create the correlogram for the training data features, and show the plot.
Features that are correlated will have a large dark circle in the box corresponding to the intersection of the two feature names on the plot. Blue circles indicate positive correlations while orange circles indicate negative correlations.
Question 24
Do any features have notable (meaning strong) correlations? If so, which pairs of features and what are the correlations?
Even when assumptions are violated, Naive Bayes works well in practice. It’s easy and it is a great way to compare more complicated approaches to see how much more we gain by using a more complex model.
Comparisons
Now that we have used the methods to describe what is going on in the data, let’s make predictions for our test data.
Multinomial Regression
holder <- table(predict(multimodel, type = "class", newdata = test), test$label)
colnames(holder) <- c("True: Neither", "True: Offensive", "True: Hate Speech")
rownames(holder) <- c("Pred: Neither", "Pred: Offensive", "Pred: Hate Speech")
knitr::kable( holder )LDA
holder <- predict(lda_out, newdata = test)$class
holder <- table(holder, test$label)
colnames(holder) <- c("True: Neither", "True: Offensive", "True: Hate Speech")
rownames(holder) <- c("Pred: Neither", "Pred: Offensive", "Pred: Hate Speech")
knitr::kable( holder )Naive Bayes
holder <- table(preds, test$label)
colnames(holder) <- c("True: Neither", "True: Offensive", "True: Hate Speech")
rownames(holder) <- c("Pred: Neither", "Pred: Offensive", "Pred: Hate Speech")
knitr::kable( holder )Question 25
Take a look at the confusion matrices produced by the codes above. Without using any metrics (meaning do not talk about F1-scores or sensitivity or specificity or any such technical terms), explain to a sociologist who is interested in predicting hate speech how effective using features created by the LIWC seems to be as a way to detect hate speech.
References
Data
The data is a subset from https://github.com/t-davidson/hate-speech-and-offensive-language, and was retrieved November 12, 2025. The final data set used here was processed through LIWC to create the features.
Original Data Set Citation: Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar (2017). Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the 11th International AAAI Conference on Web and Social Media (pp. 512-515).
LIWC Features
Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app
Plotting lda:
DeLuca, L. S., Reinhart, A., Weinberg, G., Laudenbach, M., Miller, S., & Brown, D. W. (2025). Developing Students’ Statistical Expertise Through Writing in the Age of AI. Journal of Statistics and Data Science Education, 33(3), 266–278. https://doi.org/10.1080/26939169.2025.2497547