STA 279 Data Analysis 2

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

Produce/write code for this Data Analysis.
Produce/create figures / plots / images for this Data Analysis.
Write or refine ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me or your partner. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Goal

If you have chosen this Data Analysis, you have chosen to analyze data about political speeches. You can load the data you need using the code below:

# Load the training data
train <- read.csv("https://www.dropbox.com/scl/fi/znkjfmx588wqzyiyznwxl/politics_train.csv?rlkey=sxvhlwc81pzfwpvoed3yzhhsh&st=pr468v1m&dl=1")

# Load the test data
test <- read.csv("https://www.dropbox.com/scl/fi/f1oyruhnjs0ta6q81h9hc/politics_test.csv?rlkey=19dhn2k0hu779iabgbqm38sxv&st=k34bf6ef&dl=1")

The training data set should have \(n = 233\) rows and the test set should have \(n^{*} = 50\) rows, and both data sets should have 122 columns.

The first few columns are:

speaker: the person who gave the speech.
Date: the date the speech was given.
CleanText: the text of the speech.
Summary: a summary describing the speech.

The other columns contain features about the text.

Section 1: Logistic Regression

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the speeches changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1.1: Logistic with one Feature

We have a client who is interested in the difference between the speeches of vice presidential candidates and presidential candidates. To create a variable to indicate whether the candidate was presidential or vice presidential, we use

train_logistic <- train
train_logistic$speaker <- as.factor(ifelse( train$speaker %in% c("Kamala Harris", "Mike Pence"), "VP", "P"))

test_logistic <- test
test_logistic$speaker <- as.factor(ifelse( test$speaker %in% c("Kamala Harris", "Mike Pence"), "VP", "P"))

Build a professionally formatted table to show how many speeches we have by vice presidential candidates and presidential candidates in train_logistic. Do we have enough data in each category to use logistic regression?

The client wants to know if Clout, which is a measure of how authoritative text is, is different among the two types of candidates.

The model we are going to use assumes there is a linear relationship between the log odds of \(Y=1\) and \(X\). To check this, we use an empirical log odds plot and make sure the relationship we see looks linear. To build the plot, copy and paste this function into a chunk in R and press play.

emplogitplot1 <- function(formula,data=NULL,ngroups=3,breaks=NULL,
                       yes=NULL,padj=TRUE,out=FALSE,showplot=TRUE,showline=TRUE,
                       ylab="Log(Odds)",xlab=NULL,
                       dotcol="black",linecol="blue",pch=16,main="",
                       ylim=NULL,xlim=NULL,lty=1,lwd=1,cex=1){
  mod=glm(formula,family=binomial,data)
  newdata=mod$model[,1:2]
  oldnames=names(newdata)
  if(is.null(xlab)){xlab=oldnames[2]}   #Need a label for x-axis
  names(newdata)=c("Y","X")
  newdata=na.omit(newdata)      #get rid of NA cases for either variable
  #if needed find the value for "success"
  newdata$Y=factor(newdata$Y)
  if(is.null(yes)){yes=levels(newdata$Y)[2]}
  if(ngroups=="all"){breaks=unique(sort(c(newdata$X,min(newdata$X)-1)))}
  if(is.null(breaks)){
    breaks= quantile(newdata$X, probs = (0:ngroups)/ngroups)
    breaks[1] <- breaks[1]-1
  }
  ngroups=length(breaks)-1
  newdata$XGroups=cut(newdata$X,breaks=breaks,labels=1:ngroups)
  Cases=as.numeric(table(newdata$XGroups))
  XMean=as.numeric(aggregate(X~XGroups,data=newdata,mean)$X)
  XMin=as.numeric(aggregate(X~XGroups,data=newdata,min)$X)
  XMax=as.numeric(aggregate(X~XGroups,data=newdata,max)$X)
  NumYes=as.numeric(table(newdata$Y,newdata$XGroups)[yes,])
  Prop=round(NumYes/Cases,3)
  AdjProp=round((NumYes+0.5)/(Cases+1),3)
  Logit=as.numeric(log(AdjProp/(1-AdjProp)))
  if(!padj){Logit=as.numeric(log(Prop/(1-Prop)))}
  if(showplot){plot(Logit~XMean,ylab=ylab,col=dotcol,pch=pch,
                    ylim=ylim,xlim=xlim,xlab=xlab,cex=cex,main=main)
    if(showline){abline(lm(Logit~XMean),col=linecol,lty=lty,lwd=lwd)}}
  GroupData=data.frame(Group=1:ngroups,Cases,XMin,XMax,XMean,NumYes,Prop,AdjProp,Logit)
  if(out){return(GroupData)}
}

emplogitplot1(speaker ~ Clout, data = train_logistic, ngroups = 20)

Show the plot created by the code chunk above. Based on this, explain if using logistic regression looks reasonable.
Build a logistic regression model to answer the client’s question. Write down the trained model using appropriate notation.
Using the model, answer your client’s research question.

Section 1.2: Logistic with more than one feature

Now the client would like us to see if there are other features that can help us tell apart the speeches of a VP candidate or a presidential candidate. This means you should still be working with train_logistic.

Note: Be careful to exclude columns that are features before you build the model. Example: To remove columns 1, 4, and 5, you use train_logistic <- train_logistic[,-c(1,4,5)]

Would you recommend using (a) the full model with all possible features or (b) a model chosen using feature selection? Clearly explain which model you choose and why, as well as what metric you used for forward selection.
Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.
Describe a few (at least 2) key relationships that seem to be highlighted in your model. Note: You do NOT have to describe every relationship.

Section 1.3: Predictions

Use your model you chose in Section 1.2 to make predictions on test_logistic. Show the appropriate confusion matrix, professionally formatted.
Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in politics but who does not know a lot about statistics!

Section 2: Multinomial Regression

Section 2.1: Choosing a Model

Note: Be careful you only include columns are features when you build the model.

Now that we have explored the difference between VP and presidential candidates, the client would like us to delve deeper into the 4 specific speakers. In other words, they would like to know what separates the speeches of the four candidates.

In this section, you will be fitting a multinomial regression model for \(Y\) = speaker (using all 4 speakers this time) using the train data set.

Use forward selection to choose which features to include in a multinomial regression model for speaker. Clearly identify which features are chosen.
Are these the same features chosen previously to tell VP and presidential speeches apart?
Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.

Section 2.2: Answering your research questions

Interpret at least 2 relationships you see in the model that are interesting to you. Recall that your audience for this description is someone who is very interested in politics but who does not know a lot about statistics! Do this using both plots and words!

Section 2.3: Predictions

Use your model from Section 2.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in politics but who does not know a lot about statistics.

Section 3: Classification Trees

Section 3.1: Fitting a Model

Build an appropriate classification tree using \(Y\) = speaker.
Show a professionally formatted graphic of your tree.

Section 3.2: Describing relationships

Based on the classification tree, clearly explain what traits are associated with each of the four speakers. Explain this though you are talking to someone who is very interested in politics but who does not know a lot about statistics.
Are these the same traits that showed up in the model in Section 2? Does this make sense?

Section 3.3: Predictions

Use your model from Section 3.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in politics but who does not know a lot about statistics.

Section 4: Conclusion

Conclusion

At this point, you have been able to build several models for both interpretation and prediction. Using these models, describe to someone who is interested in politics:

What your models suggest about traits that distinguish the text of speeches for the four politicians.
How well your models are able to predict who wrote a speech based on traits in the data.

References

Data

The data is a subset from https://github.com/ichalkiad/datadescriptor_uselections2020 , and was retrieved November 10, 2025. The final data set used here was processed through LIWC to create the features.

LIWC

The text features were created for the educational purposes of teaching this course using LIWC-22.

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app