STA 279 Data Analysis 1

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

Produce/write code for this Data Analysis.
Produce/create figures / plots / images for this Data Analysis.
Write or refine ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me or your partner. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Goal

If you have chosen this Data Analysis, you have chosen to analyze data about political speeches. You can load the data you need using the code below:

# Load the training data
train <- read.csv("https://www.dropbox.com/scl/fi/znkjfmx588wqzyiyznwxl/politics_train.csv?rlkey=sxvhlwc81pzfwpvoed3yzhhsh&st=pr468v1m&dl=1")[,c(1,2,4)]

# Load the test data
test <- read.csv("https://www.dropbox.com/scl/fi/f1oyruhnjs0ta6q81h9hc/politics_test.csv?rlkey=19dhn2k0hu779iabgbqm38sxv&st=k34bf6ef&dl=1")[,c(1,2,4)]

The training data set should have \(n = 233\) rows and the test set should have \(n^{*} = 50\) rows, and both data sets should have 3 columns.

The columns are:

speaker: the person who gave the speech.
Date: the date the speech was given.
CleanText: the text of the speech.

Section 1: What do they talk about?

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the speeches changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1:

In this section, you are going to focus on analyzing the speeches of just one speaker.

Choose one of the four speakers (totally up to you which one you choose!).
For this individual, create a well formatted plot or table to show the top 10 words (remember, this ALWAYS means exclude stop words, even if I don’t specify that!!)
Based on your plot or table, describe what this speaker seems to talk about most in their speeches.

NOTE: To be ready for the exam, you should be able to write the code for finding the top 10 words without looking at your notes or any other resources.

Section 2: What makes their speeches different?

Section 2:

In this section, you are going to compare all four speakers to see what makes the content of their speeches different from one another.

Find the top 10 words that distinguish the content of the four speakers.
Create a well formatted plot (not a table) show these 10 words.
Based on your plot or table, describe what seems to separate the content of the 4 speakers.

Section 3: What about emotion?

Section 3: Sentiment

One of the variables in the data set is the date that a speech was given on. We are going to explore how the sentiment of the speakers changes across the campaign trail.

Create and clearly state a research question that you might be interested in related to sentiment and time. For instance, is Harris ever more positive than Biden? You may choose any question you like that can be answered using these data except this one!
For this application, do you think the average sentiment score or total sentiment score is more appropriate to consider? Explain your reasoning.
Based on which you chose, compute a sentiment score using AFINN. Hint: If you choose average sentiment score, just change sentiment = sum(value) to sentiment = mean(value) in the code at the end of Lab 5. If you choose total sentiment score, you can leave the code at the end of Lab 5 alone.
Create a well formatted plot to visualize the relationship(s) of interest highlighted in the research question.
Answer your research question based on the plot.

Hint:

This will work best if you convert Date to a date object in R;

train$Date <- as.Date(train$Date)

At that point, you can use this to get your plot going:

ggplot( train , aes(x=Date, y= sentiment, col = speaker)) +
  geom_smooth( se = FALSE ) + 
  labs( x = "..." , y = "..."  , title = "...")

Section 4: Prediction

Section 4: Prediction with Sentiment

Use Naive Bayes with \(Y\) = speaker and \(X\) = sentiment (created in Section 3) to predict which author in the test data made each speech. Hint: You will need to create your sentiment feature in your test data!
Create a well formatted confusion matrix to show how well the model is able to predict on the test data.
State the TPR, TNR, and accuracy of your model.
Based on all of this, describe how well your model is able to predict speaker using sentiment. This means commenting on when you predict well, and when you seem not to predict well.

Before you submit

A few last steps before we knit, and then you will be done with Lab 4!

Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
If you are working with a partner, make sure their name and yours is on the top of the file.
You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.

Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!

References

Data

The data is a subset from https://github.com/ichalkiad/datadescriptor_uselections2020 , and was retrieved November 10, 2025.