STA 279 Data Analysis 2

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

  • Produce/write code for this Data Analysis.
  • Produce/create figures / plots / images for this Data Analysis.
  • Write or refine ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me or your partner. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Data

To load the data, use:

TheOffice <- read.csv("https://www.dropbox.com/scl/fi/kfhkcorppht39shuv7ihr/TheOffice.csv?rlkey=yt8wbnqxef9hesjsqukojzj04&st=8pqyl30a&dl=1")
colnames(TheOffice)[4] <- "Character"
TheOffice[is.na(TheOffice)] <- 0

set.seed(279)
trainOffice <- sample(1:nrow(TheOffice), 35000)
testOffice <- c(1:nrow(TheOffice))[-trainOffice]
trainOffice <- TheOffice[trainOffice,]
testOffice <- TheOffice[testOffice,]
rm(TheOffice)

For this analysis, you need to choose (1) two characters to compare in Section 1 and (2) 3 characters to compare in Section 2.

Section 1: Logistic Regression

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the speeches changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1.1: Logistic with one Feature

We have a client who is interested in the difference between the speech of different characters in The Office.

  • Choose 2 characters. Put the names into the code below to create data sets with only those 3 characters.
train <- subset(trainOffice, Character %in% c("character 1", "character 2"))
train$Character <- factor(train$Character)
test <- subset(testOffice, Character %in% c("character 1", "character 2"))
test$Character <- factor(test$Character)
  • Build a professionally formatted table to show how many lines we have by each character in train. Do we have enough data in each category to use logistic regression?

The client wants to know if Analytic is different across the two characters. Based on the LIWC documentation: “Analytic Thinking captures the degree to which people use words that suggest formal, logical, and hierarchical thinking patterns. People low in Analytical Thinking tend to write and think using language that is more intuitive and personal. Language scoring high in Analytic Thinking tends to be rewarded in academic settings and is correlated with things like grades and reasoning skills. Language scoring low in Analytic Thinking tends to be viewed as less cold and rigid, and more friendly and personable.”

  • The model we are going to use assumes there is a linear relationship between the log odds of \(Y=1\) and \(X\). To check this, we use an empirical log odds plot and make sure the relationship we see looks linear. To build the plot, copy and paste this function into a chunk in R and press play.
emplogitplot1 <- function(formula,data=NULL,ngroups=3,breaks=NULL,
                       yes=NULL,padj=TRUE,out=FALSE,showplot=TRUE,showline=TRUE,
                       ylab="Log(Odds)",xlab=NULL,
                       dotcol="black",linecol="blue",pch=16,main="",
                       ylim=NULL,xlim=NULL,lty=1,lwd=1,cex=1){
  mod=glm(formula,family=binomial,data)
  newdata=mod$model[,1:2]
  oldnames=names(newdata)
  if(is.null(xlab)){xlab=oldnames[2]}   #Need a label for x-axis
  names(newdata)=c("Y","X")
  newdata=na.omit(newdata)      #get rid of NA cases for either variable
  #if needed find the value for "success"
  newdata$Y=factor(newdata$Y)
  if(is.null(yes)){yes=levels(newdata$Y)[2]}
  if(ngroups=="all"){breaks=unique(sort(c(newdata$X,min(newdata$X)-1)))}
  if(is.null(breaks)){
    breaks= quantile(newdata$X, probs = (0:ngroups)/ngroups)
    breaks[1] <- breaks[1]-1
  }
  ngroups=length(breaks)-1
  newdata$XGroups=cut(newdata$X,breaks=breaks,labels=1:ngroups)
  Cases=as.numeric(table(newdata$XGroups))
  XMean=as.numeric(aggregate(X~XGroups,data=newdata,mean)$X)
  XMin=as.numeric(aggregate(X~XGroups,data=newdata,min)$X)
  XMax=as.numeric(aggregate(X~XGroups,data=newdata,max)$X)
  NumYes=as.numeric(table(newdata$Y,newdata$XGroups)[yes,])
  Prop=round(NumYes/Cases,3)
  AdjProp=round((NumYes+0.5)/(Cases+1),3)
  Logit=as.numeric(log(AdjProp/(1-AdjProp)))
  if(!padj){Logit=as.numeric(log(Prop/(1-Prop)))}
  if(showplot){plot(Logit~XMean,ylab=ylab,col=dotcol,pch=pch,
                    ylim=ylim,xlim=xlim,xlab=xlab,cex=cex,main=main)
    if(showline){abline(lm(Logit~XMean),col=linecol,lty=lty,lwd=lwd)}}
  GroupData=data.frame(Group=1:ngroups,Cases,XMin,XMax,XMean,NumYes,Prop,AdjProp,Logit)
  if(out){return(GroupData)}
}

emplogitplot1(Character ~Analytic, data = train, breaks = seq(from = 0, to = 60, by = 5))
  • Show the plot created by the code chunk above. Based on this, explain if using logistic regression looks reasonable.
  • Build a logistic regression model to answer the client’s question. Write down the trained model using appropriate notation.
  • Using the model, answer your client’s research question.

Section 1.2: Logistic with more than one feature

Now the client would like us to see if there are other features that can help us tell apart the lines of the 2 characters.

Note: Be careful to exclude columns that not are features or the response before you build the model. Example: To remove columns 1, 4, and 5, you use train <- train[,-c(1,4,5)]

  • Would you recommend using (a) the full model with all possible features or (b) a model chosen using feature selection? Clearly explain which model you choose and why, as well as what metric you used for forward selection.
  • Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.
  • Describe a few (at least 2) key relationships that seem to be highlighted in your model. Note: You do NOT have to describe every relationship.

Hint: The function for forward selection is:

forwardSelection <- function(dataset, Y, metric){
  Yhold <- as.factor(dataset[,Y])
  dataset <-model.matrix(dataset[,Y] ~ ., data = dataset)[,-1]
  dataset<- data.frame(dataset)
  #rm(Yhold)
  J <- ncol(dataset)
  AIChold <-rep(NA,J-1)
  variables <- c()
  for( j in 1:J){
    tocheck <- setdiff(c(1:J)[-Y],variables)
    AIChold <-rep(NA,length(tocheck))
    for( i in 1:length(tocheck)){
      keep <- dataset[,c(variables,tocheck[i])]
      model <- glm( Yhold ~ ., data = data.frame(keep),family = "binomial")
      if(metric=="AIC"){
        AIChold[i] <-AIC(model)
      }
      if(metric=="BIC"){
        AIChold[i] <-BIC(model)
      }
    }
    AICtest <- min(AIChold)
    if(j == 1){
      AICCheck = AICtest
      variables <- c(variables, which.min(AIChold)+1)
    }
    if( j > 1){
      if(AICtest >= AICCheck){
        break
      }
      if(AICtest < AICCheck){
        AICCheck = AICtest
        variables <- c(variables, tocheck[which.min(AIChold)])
      }
      
    }
  }
  
  forwardselection_data <- dataset[,c(variables)]
  forwardselection_data <- cbind(Yhold,forwardselection_data)
  forwardselection_data <- data.frame(forwardselection_data)
  forwardselection_data 
}

To create the data from forward selection, you use:

FSdata <- forwardSelection(dataset, Y, metric)

where you replace dataset with your data set (only features and response columns), Y with the number of the column where \(Y\) is in the dataset, and metric with either "AIC" or "BIC".

Section 1.3: Predictions

  • Use your model you chose in Section 1.2 to make predictions on the test data test. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in The Office but who does not know a lot about statistics!

Section 2: Multinomial Regression

Section 2.1: Choosing a Model

Note: Be careful you exclude columns at are not features or the response when you build the model.

Now that we have explored the difference between two characters, the client would like us to delve deeper into the speech of the key characters in The Office. In other words, they would like to know what separates the lines of the six Friends.

  • Choose 3 or 4 characters. Put the names into the code below to create data sets with only those 3 characters.
train <- subset(trainOffice, Character %in% c("character 1", "character 2", "character 3"))
train$Character <- factor(train$Character)
test <- subset(testOffice, Character %in% c("character 1", "character 2", "character 3"))
test$Character <- factor(test$Character)
  • Use forward selection to choose which features in train to include in a multinomial regression model for speaker. Clearly identify which features are chosen.
  • Are these the same features chosen to tell apart characters in Section 2?
  • Show a professionally formatted table of coefficients for your trained model. You do NOT have to write down the trained model.

Hint: The function for forward selection is:

forward_selection_multinomial <- function(features, response, metric){
  
  # Load the library
  suppressMessages(library(nnet))
  # Initialize a
  a = 1 
  
  check <- apply(features, 2, function(x) length(unique(x)))
  to.remove <- which(check == 1)
  if(length(to.remove)>0){
    features <- features[,-to.remove]
  }
  
  # Count the number of features
  p <- ncol(features)
  
  # Find the best feature with only one feature in the model
  
  for( i in 1:p){
    modelholder <- multinom(response ~ ., data = data.frame(features[,i]), trace = FALSE)
    if(a == 1){
      modelcurrent = i
      if(metric=="AIC"){AICcurrent = AIC(modelholder)}
      if(metric=="BIC"){AICcurrent = BIC(modelholder)}
      a = a+1
    }
    if(AIC(modelholder) < AICcurrent){
      if(metric=="AIC"){AICcurrent = AIC(modelholder)}
      if(metric=="BIC"){AICcurrent = BIC(modelholder)}
      modelcurrent <- i
    }
    rm(modelholder)
  }
  
  
  model1 <- multinom(response ~ ., data = data.frame(features[,modelcurrent]), trace = FALSE)
  if(metric=="AIC"){ AIC_tobeat = AIC(model1)}
  if(metric=="BIC"){ AIC_tobeat = BIC(model1)}
  tokeep = modelcurrent
  
  for( x in 1:(p-1)){
    a = 1 
    for( i in c(1:p)[-tokeep]){
      modelholder <- multinom( response ~ ., data = features[,c(i,tokeep)], trace = FALSE)
      if(a == 1){
        modelcurrent = i
         AIC_tobeat
        a = a+2
      }
      if(metric == "AIC" & AIC(modelholder) < AICcurrent){
        AICcurrent <- AIC(modelholder)
        modelcurrent <- i
      }
      if(metric == "BIC" & BIC(modelholder) < AICcurrent){
        AICcurrent <- BIC(modelholder)
        modelcurrent <- i
      }
      rm(modelholder)
    } 
    if(a >1 & AIC_tobeat- AICcurrent < 1){
      break
    }
    tokeep <- c(tokeep,modelcurrent)
    AIC_tobeat = AICcurrent
  }
  
  out <- cbind(response, features[,tokeep])
  
  out<-data.frame(out)
  
  out
  
}

To create the data from forward selection, you use:

FSdata <- forward_selection_multinomial(features, response, metric)

where you replace features with a data set containing only the possible features, response with a column holding your \(Y\) variable, and metric with either "AIC" or "BIC".

Section 2.2: Answering your research questions

  • Create appropriate plots (hint: check your lab) to visualize at least two relationships in the model you find interesting.
  • Interpret these relationships using the plot and the trained model. Recall that your audience for this description is someone who is very interested in The Office but who does not know a lot about statistics!

Section 2.3: Predictions

  • Use your model from Section 2.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in The Office but who does not know a lot about statistics.

Section 3: Classification Trees

Section 3.1: Fitting a Model

  • Using train, build an appropriate classification tree using \(Y\) = character.
  • Show a professionally formatted graphic of your tree.

Section 3.2: Describing relationships

  • Based on the classification tree, clearly explain what traits seem to separate the characters. Explain this though you are talking to someone who is very interested in The Office but who does not know a lot about statistics.
  • Are these the same traits that showed up in the model in Section 2? Does it make sense that the relationships are similar/different? Explain.

Section 3.3: Predictions

  • Use your model from Section 3.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
  • Describe how well the model is able to predict on the test data. Your audience for this description is someone who is very interested in The Office but who does not know a lot about statistics.

References

Data

This data is a cleaned subset of the data set from:

Khalid, Nashir. 2020. The Office (US) - Complete Dialogue/Transcript, Version 4. Retrieved from https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript LIWC

The text features were created for the educational purposes of teaching this course using LIWC-22.

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app