STA 279 Lab 6

Complete all Questions.

The Goal

We have been working with multinomial regression models, which are models we can use when our response variable is categorical with more than 2 levels. We work with such data a lot when we are dealing with text, so multinomial is a model we encounter often.

Our client for today is a cell phone carrier who is trying to determine what customers are saying about different cell phones they carry. They head over to a common source of reviews online. However, a glitch in the system often means phone reviews are labelled under the wrong brand (this happens A LOT in Amazon and other review systems!).

They provide us a data set called train with 1000 phones that are labelled with their brand. They ask us to use those phones to build a model to (1) predict which brand a phone is and (2) describe any key traits in reviews that might suggest a phone is from a certain brand.

They then also provide us a data set called test with 500 phones to use to assess the predictive abilities of our model.

You can use the following code to load the data.

# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(nnet)
library(tm)

# Load the data
train <- 
test  <-

If you look at the train dataset you have loaded, the dataset should have 1000 rows and 119 columns. The first column is our Y variable brand, and the remaining 118 columns are features about the text created from the LIWC.

As a reminder, the LIWC is a tool that allows us access to multiple lexicons that we can use to create features about text data. For instance, how much of the review is written in past tense? How much involves discussions of economics? How long are the sentences, and what is the tone like? The list of possible features is extensive!

If you are curious about what each feature represents, you can look at the list on page 12 of this document. This details each of the features created by the LIWC and gives examples of the types of words in each of the lexicons it uses to create these features.

Model 1: Perception

Our client thinks that one feature that might be important is the percent of words in a review that are related to perception. These are words about how a phone feels, sounds, looks, etc. In other words, these are words about how a person physically interacts with the device.

Percepetion = the percent of words in a text that have to do with the sensory experiences (hearing, feeling, touching, smelling, etc.)

They ask us to build a model using this as a feature.

Recall that for multinomial regression, we use:

# Load the library
library(nnet)
library(broom)

# Build the model
Model1_Perception <- multinom( Y ~ X , data = dataset, trace = FALSE)

# Print out the coefficients
knitr::kable( tidy(Model1_Perception) )

Question 1

Adapt the code above to fit a multinomial regression model for $Y$ = brand with $X$ = the percent of words in a text related to perception. Show the formatted coefficient table as the answer for this question.

Question 2

What is the baseline for brand? How do you know?

Question 3

Write down the trained model using appropriate notation, rounding to 3 decimal places. Remember, you should have more than one line!!

Hint 1: Put the following in the white space in your Markdown file and adapt. Make sure you change SUB1 and SUB2.

$$log \left( \frac{\hat{\pi}_{i(SUB1)}}{\hat{\pi}_{i(SUB2)}} \right) = $$

Once we have the model built, there are a few different things we can do with it.

Interpretation

Question 4

Interpret the slope in the Samsung component of the model in terms of the relative risk.

Question 5

Interpret the slope in the Motorola component of the model in terms of the relative risk.

Question 6

Suppose 5% of words in an article are about perception (so Perception = 5). Compute and interpret the relative risk of being a Samsung review versus Apple review in this scenario.

Hint 1: Your interpretation should say something like “The probability of being written about brand A is 2 times higher than the probability of being written about brand B.”

Hint 2: A lot of people don’t like interpreting when we have something like $\hat{\pi}_{i(A)} = .65 \hat{\pi}_{i(B)}$. If this is you, remember you can rewrite this as $\hat{\pi}_{i(B)} = 1/.65 \hat{\pi}_{i(A)} = 1.53\hat{\pi}_{i(A)}$ and interpret in this form!

Question 7

Suppose 25% of words in an article are about perception (so Perception = 25). Compute and interpret the relative risk of being a Motorola review versus Apple review in this scenario.

Hint 1: Your interpretation should say something like “The probability of being written about brand A is 2 times higher than the probability of being written about brand B.”

The kind of interpretation work we have done in the previous two questions is useful. However, especially when we are working with a client, it is extremely valuable to be able to visualize the relationships that our model is describing. In other words, we would like to be able to make a plot that shows how the probability of being written about each brand changes as the percent of perception words changes.

We can do this in R, but the code involved is a little tricky, so I built a function that will do it for us. To teach R the function, copy and paste the following into a chunk and press play. DO NOT CHANGE ANYTHING ABOUT THIS FUNCTION.

plot_multnomial <- function(data, model, Y, X, Xname, xlab){
  
  if(length(coefficients(model)[1,])==2){
  
  predictions <- predict( model, type ="probs")
  
  data[, Y] <-as.factor(data[,Y])
  
  J = length(levels(data[,Y]))
  
  holder <- data.frame("group" = NA, "x" = NA, "Probability"=NA)
  a  = 1 
  for(i in 1:nrow(data)){
    for(j in 1:J){
      holder[a,1] <- levels(data[,Y])[j]
      holder[a,2] <- data[i,X]
      holder[a,3] <- predictions[i,j]
      a = a + 1
    }
  }
  
  g1 <- ggplot( holder, aes(x, Probability, col= group)) + 
    geom_line() + 
    labs(x = xlab)
  }
  
  if(length(coefficients(model)[1,])>2){
    
    suppressMessages(library(pdp))
    suppressMessages(library(ggplot2))
    
    toplot1 <- data.frame(partial(model, Xname, which.class = 1, plot = FALSE, prob = TRUE))
    toplot2 <- data.frame(partial(model, Xname, which.class = 2, plot = FALSE, prob = TRUE))
    toplot3 <- data.frame(partial(model, Xname, which.class = 3, plot = FALSE, prob = TRUE))
    
    toplot1$response <- levels(data[,Y])[1]
    toplot2$response <- levels(data[,Y])[2]
    toplot3$response <- levels(data[,Y])[3]
    
    hold <- rbind(toplot1,toplot2,toplot3)
    hold <-data.frame(hold)
    hold$response <- as.factor(hold$response)

    g1 <- ggplot(hold, aes(x = hold[,1], yhat, col = response)) + geom_line() + 
      labs(x = xlab, y = "Probability")
    
  }
  
  g1
  
  
  
}

To use this function, adapt the following:

plot_multnomial(data, model, Y, X, Xname, xlab)

data: replace with the name of your data set.
model: replace with the name of your model.
Y: replace with the number of the column in your data set when $Y$ is.
X: replace with the number of the column in your data set when $X$ is. For perception, this is 96.
Xname: replace with the name of your $X$ column. Make sure you put it in quotes, for example: "Perception"
xlab: replace with the label you want for the x axis label. Make sure you put it in quotes, for example: "The x axis label".

Question 8

Adapt the code above to create a plot that shows how the probability of being written about each brand changes as the percent of perception words changes. Show the plot.

Question 9

Recall that we have a client who thought that the percent of perception words might be useful for determining brand. Based on the plot, describe in 2-3 sentences the relationship between the percent of perception words and brand.

Question 10

For what values of $X$ = percent perception words does it look like the model will struggle to predict brand? In other words, for what values of $X$ does it become less clear which brand a comment is about?

Question 11

Choose a different feature (anything except Perception). State which feature you choose, build a multinomial regression model for brand using that feature, and then adapt the code above Question 8 to visualize the relationship between that feature and brand.

As an answer to this question, show the plot and describe the relationship between $X$ and brand shown in the plot.

In addition to describing the relationships in the data, if we want to find predicted probabilities from a multinomial regression model for a row $i$ in the data set, we can use:

fitted(Model1_Perception)[i,]

Question 12

Row 20 in the data set has 0% perception words. What are the predicted probabilities of being from each brand?
Do these probabilities match what you see in the plot in Question 8? Explain.

Model 2: The Full Model

Now that we are starting to move into prediction, it makes sense to use more than just one feature. We have 118 available to us, so as a first pass, let’s use them all! Adding every feature we have access to is called the full model.

# Build the full model
Model2_Full <- multinom( brand ~ . , data = train, trace = FALSE)

This model will have a LOT of coefficients, so looking at the trained model can be challenging. Before we even try, let’s see if our full model is a better fit to the training data than Model 1.

Question 13

Based on the AIC, which model is a better fit to the data, Model 1 with only perception or Model 2(the full model)? Explain your reasoning.

Note: To get the AIC of a model, we use AIC(model)

What happened?? Well, we went to the extreme of adding in ALL the features we had access to, but we have no guarantee that these are all useful or needed to improve model fit. This is why we often reach for things like feature selection to determine which of the features we have available we might want to add into the model.

Model 3: Feature Selection

I’ve written you a function you can use to perform forward selection with multinomial regression. Copy and paste the function in the chunk below and press play.

forward_selection_multinomial <- function(features, response){

  # Load the library
  suppressMessages(library(nnet))
  # Initialize a
  a = 1 
  
  check <- apply(features, 2, function(x) length(unique(x)))
  to.remove <- which(check == 1)
  if(length(to.remove)>0){
  features <- features[,-to.remove]
  }
  
  # Count the number of features
  p <- ncol(features)
  
  # Find the best feature with only one feature in the model

  for( i in 1:p){
    modelholder <- multinom(response ~ ., data = data.frame(features[,i]), trace = FALSE)
    if(a == 1){
      modelcurrent = i
      AICcurrent = AIC(modelholder)
      a = a+1
      }
    if(AIC(modelholder) < AICcurrent){
      AICcurrent <- AIC(modelholder)
      modelcurrent <- i
    }
    rm(modelholder)
  }


  model1 <- multinom(response ~ ., data = data.frame(features[,modelcurrent]), trace = FALSE)
  AIC_tobeat <- AIC(model1)
  tokeep = modelcurrent
  
    print(1)
  print(AIC(model1))

  for( x in 1:(p-1)){
    a = 1 
    for( i in c(1:p)[-tokeep]){
      modelholder <- multinom( response ~ ., data = features[,c(i,tokeep)], trace = FALSE)
      if(a == 1){
        modelcurrent = i
        AICcurrent = AIC(modelholder)
        a = a+2
      }
      if(AIC(modelholder) < AICcurrent){
        AICcurrent <- AIC(modelholder)
        modelcurrent <- i
        }
      rm(modelholder)
    } 
    if(a >1 & AIC_tobeat- AICcurrent < 1){
      break
    }
    tokeep <- c(tokeep,modelcurrent)
    print(x+1)
    print(AICcurrent)
    AIC_tobeat = AICcurrent
  }
  
  out <- cbind(response, features[,tokeep])
  
  out<-data.frame(out)
  
  out

}

DO NOT CHANGE ANYTHING IN THE FUNCTION ABOVE!!

To use this function, you need to provide two inputs:

features: A dataset which contains only the columns in the original dataset that you want to use as features. Example: features <- dataset[,2:118] if the features are in columns 2 to 118 in a dataset called dataset.
response: A dataset which contains only the single column in the original dataset that holds the response variable. Example: response <- dataset[,1] if the response is in column 1 of a dataset called dataset.

It turns out that before we do this, we need to think a little about whether there are any columns in our dataset that we should not treat as features.

Question 14

Create the inputs features and response you need in order to run the forward selection code. Show your code as the answer to this question.

Once you have created features and response, you can run forward selection using the following:

# Run forward selection
data_forwardselection <- forward_selection_multinomial(features,response)
colnames(data_forwardselection)[1] <- "brand"

As the code runs, it will print out the number of features added to the model and the AIC that is associated with the model with that many features. You will notice that the AIC is getting steadily lower as the function runs!

The function also prints out the features that we put in the model according to forward selection and also stores those features in a data set called data_forwardselection.

Question 15

How many features does forward selection suggest we keep in our multinomial regression model?

Question 16

Which variable does feature selection suggest is most important in determining which brand a review is about? Explain.

To build a model using the features chosen by forward selection, we can use:

Model3_Forward <- multinom( brand ~ ., data = data_forwardselection, trace = FALSE)

Question 17

What is the AIC for the model chosen by forward selection?
Is this better than Model 1 with only one feature and Model 2 with all the features?

To see the fitted model, we can use:

knitr::kable(tidy(Model3_Forward))

This can be a lot to look at! We can also use our plotting command from before to visualize the model relationships.

Question 18

Based on our fitted model, is a review with 30% perception words more likely to be written about Apple, Samsung, or Motorola? Explain.

Note: You can use either a plot or the model coefficients to explain this.

Making Predictions for test data

To make predictions using multinomial regression, we use the code:

predictions <- predict(Model3_Forward, type = "class", newdata = test)

holder <- table(  predictions, test$brand )
rownames(holder) <- paste("Predicted", rownames(holder))
colnames(holder) <- paste("True", colnames(holder))
knitr::kable( holder )

Question 20

What is the Macro F1-score for this model?