STA 279 Data Analysis 2

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

  • Produce/write code for this Data Analysis.
  • Produce/create figures / plots / images for this Data Analysis.
  • Write or refine ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me or your partner. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

Section 1: Logistic Regression

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the speeches changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1.1: Logistic with one Feature

We have a client who is interested in the difference between the speech of Ross and Rachel. To load the data with the lines from just these two characters, run the following code:

# Load the training data
train <- read.csv("https://www.dropbox.com/scl/fi/4fckz7qilxhnwa3q3h4fi/trainRossRachel.csv?rlkey=kogj5h49zgj7o8xjg7ilcrlt4&st=7iqk733l&dl=1")
train$Character <- as.factor(train$Character)

# Load the test data
test <- read.csv("https://www.dropbox.com/scl/fi/gwhvvqgsu292scag8flfp/testRossRachel.csv?rlkey=eu647dtx73fzsqj3wbl4oyzc3&st=nigzi4ib&dl=1")
test$Character <- as.factor(test$Character)
  • Build a professionally formatted table to show how many lines we have by each character in train. Do we have enough data in each category to use logistic regression?

The client wants to know if Exclam, which measures the percent of exclamation/strong emotion words in text, is different across the two characters.

  • The model we are going to use assumes there is a linear relationship between the log odds of \(Y=1\) and \(X\). To check this, we use an empirical log odds plot and make sure the relationship we see looks linear. To build the plot, copy and paste this function into a chunk in R and press play.
emplogitplot1 <- function(formula,data=NULL,ngroups=3,breaks=NULL,
                       yes=NULL,padj=TRUE,out=FALSE,showplot=TRUE,showline=TRUE,
                       ylab="Log(Odds)",xlab=NULL,
                       dotcol="black",linecol="blue",pch=16,main="",
                       ylim=NULL,xlim=NULL,lty=1,lwd=1,cex=1){
  mod=glm(formula,family=binomial,data)
  newdata=mod$model[,1:2]
  oldnames=names(newdata)
  if(is.null(xlab)){xlab=oldnames[2]}   #Need a label for x-axis
  names(newdata)=c("Y","X")
  newdata=na.omit(newdata)      #get rid of NA cases for either variable
  #if needed find the value for "success"
  newdata$Y=factor(newdata$Y)
  if(is.null(yes)){yes=levels(newdata$Y)[2]}
  if(ngroups=="all"){breaks=unique(sort(c(newdata$X,min(newdata$X)-1)))}
  if(is.null(breaks)){
    breaks= quantile(newdata$X, probs = (0:ngroups)/ngroups)
    breaks[1] <- breaks[1]-1
  }
  ngroups=length(breaks)-1
  newdata$XGroups=cut(newdata$X,breaks=breaks,labels=1:ngroups)
  Cases=as.numeric(table(newdata$XGroups))
  XMean=as.numeric(aggregate(X~XGroups,data=newdata,mean)$X)
  XMin=as.numeric(aggregate(X~XGroups,data=newdata,min)$X)
  XMax=as.numeric(aggregate(X~XGroups,data=newdata,max)$X)
  NumYes=as.numeric(table(newdata$Y,newdata$XGroups)[yes,])
  Prop=round(NumYes/Cases,3)
  AdjProp=round((NumYes+0.5)/(Cases+1),3)
  Logit=as.numeric(log(AdjProp/(1-AdjProp)))
  if(!padj){Logit=as.numeric(log(Prop/(1-Prop)))}
  if(showplot){plot(Logit~XMean,ylab=ylab,col=dotcol,pch=pch,
                    ylim=ylim,xlim=xlim,xlab=xlab,cex=cex,main=main)
    if(showline){abline(lm(Logit~XMean),col=linecol,lty=lty,lwd=lwd)}}
  GroupData=data.frame(Group=1:ngroups,Cases,XMin,XMax,XMean,NumYes,Prop,AdjProp,Logit)
  if(out){return(GroupData)}
}

emplogitplot1(Character ~ Exclam, data = train, breaks = seq(from = 0, to = 10, by = 2))
  • Show the plot created by the code chunk above. Based on this, explain if using logistic regression looks reasonable.
  • Build a logistic regression model to answer the client’s question. Write down the trained model using appropriate notation.
  • Using the model, answer your client’s research question.

Section 1.2: Logistic with more than one feature

Now the client would like us to see if there are other features that can help us tell apart the lines of the 2 characters.

Note: Be careful to exclude columns that are features before you build the model. Example: To remove columns 1, 4, and 5, you use train <- train[,-c(1,4,5)]

  • Would you recommend using (a) the full model with all possible features or (b) a model chosen using feature selection? Clearly explain which model you choose and why, as well as what metric you used for forward selection.
  • Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.
  • Describe a few (at least 2) key relationships that seem to be highlighted in your model. Note: You do NOT have to describe every relationship.

Hint: The function for forward selection is:

forwardSelection <- function(dataset, Y, metric){
  Yhold <- as.factor(dataset[,Y])
  dataset <-model.matrix(dataset[,Y] ~ ., data = dataset)[,-1]
  dataset<- data.frame(dataset)
  #rm(Yhold)
  J <- ncol(dataset)
  AIChold <-rep(NA,J-1)
  variables <- c()
  for( j in 1:J){
    tocheck <- setdiff(c(1:J)[-Y],variables)
    AIChold <-rep(NA,length(tocheck))
    for( i in 1:length(tocheck)){
      keep <- dataset[,c(variables,tocheck[i])]
      model <- glm( Yhold ~ ., data = data.frame(keep),family = "binomial")
      if(metric=="AIC"){
        AIChold[i] <-AIC(model)
      }
      if(metric=="BIC"){
        AIChold[i] <-BIC(model)
      }
    }
    AICtest <- min(AIChold)
    if(j == 1){
      AICCheck = AICtest
      variables <- c(variables, which.min(AIChold)+1)
    }
    if( j > 1){
      if(AICtest >= AICCheck){
        break
      }
      if(AICtest < AICCheck){
        AICCheck = AICtest
        variables <- c(variables, tocheck[which.min(AIChold)])
      }
      
    }
  }
  
  forwardselection_data <- dataset[,c(variables)]
  forwardselection_data <- cbind(Yhold,forwardselection_data)
  forwardselection_data <- data.frame(forwardselection_data)
  forwardselection_data 
}

To create the data from forward selection, you use:

FSdata <- forwardSelection(dataset, Y, metric)

where you replace dataset with your data set (only features and response columns), Y with the number of the column where \(Y\) is in the dataset, and metric with either "AIC" or "BIC".

Section 1.3: Predictions

  • Use your model you chose in Section 1.2 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in Friends but who does not know a lot about statistics!

Section 2: Multinomial Regression

Section 2.1: Choosing a Model

Note: Be careful you only include columns are features when you build the model.

Now that we have explored the difference between two characters, the client would like us to delve deeper into the speech of the key characters in Friends. In other words, they would like to know what separates the lines of the six Friends.

Load your new data:

train <- read.csv("https://www.dropbox.com/scl/fi/bmymm6truy6ocetx92qy7/trainFriends.csv?rlkey=0loe1mc8t8l20umaau70tpvk5&st=s7bz4vxd&dl=1")
train$Character <- as.factor(train$Character)

test <- read.csv("https://www.dropbox.com/scl/fi/8v0mqj6ywxnmvyygursdm/testFriends.csv?rlkey=at598pm7j8stkxta1bcqy5ldk&st=pz0pd7zo&dl=1")
test$Character <- as.factor(test$Character)

In this section, you will be fitting a multinomial regression model for \(Y\) = character using the train data set.

  • Choose 3 of the 6 friends. Put the names into the code below to create data sets with only those 3 characters.
train <- subset(train, Character %in% c("character 1", "character 2", "character 3"))
train$Character <- factor(train$Character)
test <- subset(test, Character %in% c("character 1", "character 2", "character 3"))
test$Character <- factor(test$Character)
  • Use forward selection to choose which features to include in a multinomial regression model for speaker. Clearly identify which features are chosen.
  • Are these the same features chosen previously to tell Ross and Rachel apart?
  • Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.

Hint: The function for forward selection is:

forward_selection_multinomial <- function(features, response, metric){
  
  # Load the library
  suppressMessages(library(nnet))
  # Initialize a
  a = 1 
  
  check <- apply(features, 2, function(x) length(unique(x)))
  to.remove <- which(check == 1)
  if(length(to.remove)>0){
    features <- features[,-to.remove]
  }
  
  # Count the number of features
  p <- ncol(features)
  
  # Find the best feature with only one feature in the model
  
  for( i in 1:p){
    modelholder <- multinom(response ~ ., data = data.frame(features[,i]), trace = FALSE)
    if(a == 1){
      modelcurrent = i
      if(metric=="AIC"){AICcurrent = AIC(modelholder)}
      if(metric=="BIC"){AICcurrent = BIC(modelholder)}
      a = a+1
    }
    if(AIC(modelholder) < AICcurrent){
      if(metric=="AIC"){AICcurrent = AIC(modelholder)}
      if(metric=="BIC"){AICcurrent = BIC(modelholder)}
      modelcurrent <- i
    }
    rm(modelholder)
  }
  
  
  model1 <- multinom(response ~ ., data = data.frame(features[,modelcurrent]), trace = FALSE)
  if(metric=="AIC"){ AIC_tobeat = AIC(model1)}
  if(metric=="BIC"){ AIC_tobeat = BIC(model1)}
  tokeep = modelcurrent
  
  for( x in 1:(p-1)){
    a = 1 
    for( i in c(1:p)[-tokeep]){
      modelholder <- multinom( response ~ ., data = features[,c(i,tokeep)], trace = FALSE)
      if(a == 1){
        modelcurrent = i
         AIC_tobeat
        a = a+2
      }
      if(metric == "AIC" & AIC(modelholder) < AICcurrent){
        AICcurrent <- AIC(modelholder)
        modelcurrent <- i
      }
      if(metric == "BIC" & BIC(modelholder) < AICcurrent){
        AICcurrent <- BIC(modelholder)
        modelcurrent <- i
      }
      rm(modelholder)
    } 
    if(a >1 & AIC_tobeat- AICcurrent < 1){
      break
    }
    tokeep <- c(tokeep,modelcurrent)
    AIC_tobeat = AICcurrent
  }
  
  out <- cbind(response, features[,tokeep])
  
  out<-data.frame(out)
  
  out
  
}

To create the data from forward selection, you use:

FSdata <- forward_selection_multinomial(features, response, metric)

where you replace features with a data set containing only the possible features, response with a column holding your \(Y\) variable, and metric with either "AIC" or "BIC".

Section 2.2: Answering your research questions

  • Create appropriate plots (hint: check your lab) to visualize at least two relationships in the model you find interesting.
  • Interpret these relationships using the plot and the trained model. Recall that your audience for this description is someone who is very interested in Friends but who does not know a lot about statistics!

Section 2.3: Predictions

  • Use your model from Section 2.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in Friends but who does not know a lot about statistics.

Section 3: Classification Trees

Section 3.1: Fitting a Model

  • Using train, build an appropriate classification tree using \(Y\) = character.
  • Show a professionally formatted graphic of your tree.

Section 3.2: Describing relationships

  • Based on the classification tree, clearly explain what traits seem to separate the speakers. Explain this though you are talking to someone who is very interested in politics but who does not know a lot about statistics.
  • Are these the same traits that showed up in the model in Section 2? Does it make sense that the relationships are similar/different? Explain.

Section 3.3: Predictions

  • Use your model from Section 3.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in Friends but who does not know a lot about statistics.

References

Data

This data is a cleaned subset of the data set from:

AM. 2022. Friends Scripts, Version 1. Retrieved from https://www.kaggle.com/datasets/amandam1/friends-scripts

LIWC

The text features were created for the educational purposes of teaching this course using LIWC-22.

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app