STA 279 Data Analysis 2

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

  • Produce/write code for this Data Analysis.
  • Produce/create figures / plots / images for this Data Analysis.
  • Write ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me, your partners, or other students in the class. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Goal

If you have chosen this Data Analysis, you have chosen to analyze data about political speeches. You can load the data you need using the code below:

# Load the training data
train <- read.csv("https://www.dropbox.com/scl/fi/znkjfmx588wqzyiyznwxl/politics_train.csv?rlkey=sxvhlwc81pzfwpvoed3yzhhsh&st=pr468v1m&dl=1")

# Load the test data
test <- read.csv("https://www.dropbox.com/scl/fi/f1oyruhnjs0ta6q81h9hc/politics_test.csv?rlkey=19dhn2k0hu779iabgbqm38sxv&st=k34bf6ef&dl=1")

The training data set should have \(n = 233\) rows and the test set should have \(n^{*} = 50\) rows, and both data sets should have 122 columns.

The first few columns are:

  • speaker: the person who gave the speech.
  • Date: the date the speech was given.
  • CleanText: the text of the speech.
  • Summary: a summary describing the speech.

The other columns contain features about the text.

Section 1: EDA

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the speeches changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1.1: Time

One of the variables in the data set is the date that a speech was given on. Choose at least two research questions you want to explore with this data involving date. For example, how does the length of the speech (word count) change across time for the 4 different speakers?

For each research question,

  • State the research question clearly.
  • Create a well formatted plot to visualize the relationship(s) of interest highlighted in the research question. Title your plots Figure 1.1, Figure 1.2, etc. The “1” tells us we are in Section 1 of the paper.
  • Answer your research question based on the plot.

Hint:

This will work best if you convert Date to a date object in R;

train$Date <- as.Date(train$Date)

At that point, you can use this to get your plot going:

ggplot( train , aes(x=Date, y= variableofinterest, col = speaker)) +
  geom_smooth( ) + 
  labs( x = , y =  , title =)

Within geom_smooth(). you can choose geom_smooth(se=FALSE) if you don’t want to see the uncertainty bands (prediction intervals) around the lines.

Section 1.2: Speaker

Choose at least two research questions you want to explore with this data involving differences in text characteristics for different speakers. For example, how does the clout (which measures confidence, power, and leadership communicated through words) differ across speakers?

For each research question,

  • State the research question clearly.
  • Create a well formatted plot to visualize the relationship(s) of interest. Continue the numbering from the previous section - this means that if the last plot in Section 1.1 is labelled Figure 1.3, you should start your labels for these plots with Figure 1.4.
  • Answer your research question based on the plot.

NOTE: None of these research questions may involve date/time- you need completely different research questions than you used in Section 1.1.

Section 2: Multinomial Regression

Optional: If you have any interest in including year or month as features, you can add those features to the data set using:

train$Year <- as.factor(apply( data.frame(train$Date), 1, function(x)  unlist(strsplit(as.character(x), split = "-"))[1]))
train$Month <- as.factor(apply( data.frame(train$Date), 1, function(x)  unlist(strsplit(as.character(x), split = "-"))[2]))

Section 2.1: Choosing a Model

Note: Be careful you only include columns are features when you build the model.

In this section, you will be fitting a multinomial regression model for \(Y\) = speaker.

  • Would you recommend using (a) the full model or (b) a model chosen using feature selection? Clearly explain which model you choose and why.
  • Show a professionally formatted table of coefficients for your fitted model. You do NOT have to write down the fitted model.

In Section 1.2, you chose two research questions involving differences in text for different speakers. In that section, you used graphs to explore the relationship of interest. Now that we have a model, we can use it to help explore the same questions.

Section 2.2: Answering your research questions

NOTE: If you used a selection technique and the variables of interest from Section 1.2 do not show up in your model, you can either choose new research questions or build a model that does include those features (so just add the features you need into the model).

Using the fitted model, clearly answer your two research questions from Section 1.2. Do this using:

  • An interpretation of appropriate coefficients.
  • Appropriate visualizations from your model coefficients.

Section 2.3: Predictions

  • Use your model from Section 2.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in politics but who does not know a lot about statistics.

Section 3: Classification Trees

Section 3.1: Fitting a Model

  • Build an appropriate classification tree using \(Y\) = speaker.
  • Show a professionally formatted graphic of your tree.

Section 3.2: Describing relationships

Based on the classification tree, clearly explain what traits are associated with each of the four speakers. Explain this though you are talking to someone who is very interested in politics but who does not know a lot about statistics.

Section 3.3: Predictions

  • Use your model from Section 3.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in politics but who does not know a lot about statistics.

Section 4: Linear Discriminant Analysis (LDA)

In this section, we are going to play a little more with LDA, which we first tried in Lab 8.

We can think of LDA as a regression type model. However, instead of creating one coefficient for each feature in the data set, LDA focuses on combining these features into combination features. For example, I might combine high energy, high playfulness, and high curiosity into a measure for “Childlike”. This means that instead of looking at the 3 features individually, I could use the combination to compare groups. This is really useful when we have a lot of features, as we do with our current data set.

To see what we mean, let’s consider only a few features from our current data set:

  • negate: % negations (example: not, no, never, nothing).
  • WC: the number of words in the speech.
  • allnone: % of words that are all or none (example: all, no, never, always).
  • Clout: a measure of power, leadership, confidence; higher clout is related to higher values of these traits.
  • BigWords: the % of words with more than 6 letters.

To fit LDA just using these features, we use

lda_out <- MASS::lda(speaker ~ negate + WC + allnone+ Clout+ BigWords, data = train)

The combined features produced by LDA are called linear discriminants. In this case, the 5 features we started off with are now reduced into two linear discriminants, called LD1 and LD2.

First Feature (LD1)

To see the first feature LD1, we can use the following:

knitr::kable(  lda_out$scaling[,1], col.names = "Loadings" )
Loadings
negate 1.4363565
WC 0.0004442
allnone 0.5159826
Clout 0.0504840
BigWords 0.1759651

This matrix shows us how each of the 5 original features contributes to the first linear discriminant. Specifically,

\[LD1_{i} = 1.436 negate_i + .00044 WC_i + .52 allnone_i + .05 Clout_i + .18 BigWords_i\]

You will notice that this looks like a regression model! All the coefficients for the features are positive, and this means that high values of LD1 are associated with a speech with a lot of negate words, with a high word count, a lot of all/none word, high clout, and a lot of big words.

Second Feature (LD2)

We can use the same process to look at the 2nd linear discriminant, LD2:

\[LD2_{i} = 1.97 negate_i -.00028 WC_i -.29 allnone_i -.079 Clout_i -.025 BigWords_i\]

Here, we can see a mixture of negative and positive coefficients. This indicates lower word count, lower amounts of all/none words, less clout, and less big words, but more negate words, are associated with high values of LD2.

We can also see that negate and clout at more important in LD2. Word counts, all/none, and big words are all more important in LD1. We can see this based on whether the coefficient for a feature is larger for LD1 or LD2.

Plotting

Now instead of 5 features, we have 2, each of which is a combination of the original 5. When we plot the results of LDA, the coordinate of each speech is \((LD1_i, LD2_i)\).

Section 4.1: LDA Model 1

Model

  • Fit an LDA using the same variables chosen by feature selection from multinomial regression. NOTE: The goal here is to create an LDA that you can interpret. If there are too many features chosen by feature selection to make that feasible, please feel free to just choose at least 5 features that you are interested in and that are not used in the example above.

Interpretation

  • Show a professionally formatted output for LD1.
  • Describe what high values of LD1 indicate.
  • Show a professionally formatted output for LD2.
  • Describe what high values of LD2 indicate.

Plotting

  • Create a plot from the LDA model.
  • Choose any two of the speakers. Based on the plot, describe what seems to be different between the text of the speeches for these two speakers. Explain this though you are talking to someone who is very interested in politics but who does not know a lot about statistics.

Section 4.2: LDA with all the features

  • Fit an LDA model using all the features (not just some of them!), and create a plot from the LDA model.
  • Based on the visual, which model (LDA with only a few features or LDA with all the features) seems to be a better fit to the sample data, or do they seem to fit about the same? Explain your choice.

Section 4.3: Prediction

  • Use your LDA from Section 4.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Use your LDA from Section 4.2 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • State which LDA does the best job at predicting on the test data.
  • Describe how well that model is able to prediction the test data. Explain this though you are talking to someone who is very interested in politics but who does not know a lot about statistics.

Section 5: Conclusion

Conclusion

At this point, you have been able to build several models for both interpretation and prediction. Using these models, describe to someone who is interested in politics:

  • What your models suggest about traits that distinguish the text of speeches for the four politicians.
  • How well your models are able to predict who wrote a speech based on traits in the data.

References

Data

The data is a subset from https://github.com/ichalkiad/datadescriptor_uselections2020 , and was retrieved November 10, 2025. The final data set used here was processed through LIWC to create the features.

LIWC

The text features were created for the educational purposes of teaching this course using LIWC-22.

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app