STA 279 Data Analysis 2

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

Produce/write code for this Data Analysis.
Produce/create figures / plots / images for this Data Analysis.
Write ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me, your partners, or other students in the class. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Goal

If you have chosen this Data Analysis, you have chosen to analyze data from the show Gilmore Girls. You can load the data you need using the code below:

GilmoreGirls <- read.csv("https://www.dropbox.com/scl/fi/00z1qneb6u5qymunotk9c/GilmoreGirlsNoNA.csv?rlkey=vwpp6r9rkjrv71v3fqp9np504&st=1wsnplou&dl=1")

GilmoreGirls$Character <-as.factor(GilmoreGirls$Character)

Introduction

This data set is big - it contains all the lines from every character from all seven seasons. For most of our computers, this is going to be too big to work with. So…

To make the data smaller, you will:

Choose 3-4 characters that are you interested in. This can be anything (the three boyfriends of Rory, three other characters, whatever seems cool). Filter the data to contain only rows spoken by those characters.
Depending on the size of your data set, I might recommend you consider only a few seasons of data, rather than all the seasons. Note: This is optional!! Feel free to work with all the seasons to start with, and then adapt when you get to modeling if the data set is too large.

When you are done with all of this, state which characters you chose, and which seasons you chose, and why. State how many rows are left in the data set when you are done!

At this point, we are going to separate the data into test and training data, so that we can do prediction later!

# REPLACE YourData
n <- nrow( YourData )

set.seed(279)
train <- sample(1:n, n*.85)
test <- c(1:n)[-train]

# REPLACE YourData 
test <- YourData[ test, ]
train <- YourData[train, ]

If you get stuck with any of this, let Dr. Dalzell know!

Section 1: EDA

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the characters change over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1.1: Season

One of the variables in the data set is the season the line was spoken in. Choose at least two research questions you want to explore with this data involving Season. For example, how does the length of the text (word count) change across time for the different characters?

For each research question,

State the research question clearly.
Create a well formatted plot to visualize the relationship(s) of interest. Title your plots Figure 1.1, Figure 1.2, etc. The “1” tells us we are in Section 1 of the paper.
Answer your research question based on the plot.

Hint:

This will work best if you convert Season to a number and Character to a categorical variable:

train$Season <- as.numeric(train$Season)
train$Character <- as.factor(train$Character)

At that point, you can use this to get your plot going:

ggplot( train , aes(x=Season, y= variableofinterest, col = Character)) +
  geom_smooth(method = "loess") + 
  labs( x = , y = , main = )

Within geom_smooth(method = "loess"). you can choose geom_smooth(method = "loess", se=FALSE) if you don’t want to see the uncertainty bands (prediction intervals) around the lines.

Section 1.2: Character

Choose at least two research questions you want to explore with this data involving differences in text characteristics for different characters. For example, how does the clout (which measures confidence, power, and leadership communicated through words) differ across characters?

For each research question,

State the research question clearly.
Create a well formatted plot to visualize the relationship(s) of interest. Continue the numbering from the previous section - this means that if the last plot in Section 1.1 is labelled Figure 1.3, you should start your labels for these plots with Figure 1.4.
Answer your research question based on the plot.

NOTE: None of these research questions may involve season- you need completely different research questions than you used in Section 1.1.

Section 2: Multinomial Regression

Section 2.1: Choosing a Model

Note: Be careful you only include columns that are features when you build the model.

In this section, you will be fitting a multinomial regression model for \(Y\) = character.

Would you recommend using (a) the full model or (b) a model chosen using feature selection? Clearly explain which model you choose and why.
Show a professionally formatted table of coefficients for your fitted model. You do NOT have to write down the fitted model.

In Section 1.2, you chose two research questions involving differences in text for different characters. In that section, you used graphs to explore the relationship of interest. Now that we have a model, we can use it to help explore the same questions.

Section 2.2: Answering your research questions

NOTE: If you used a selection technique and the variables of interest from Section 1.2 do not show up in your model, you can either choose new research questions or build a model that does include those features (so just add the features you need into the model).

Using the fitted model, clearly answer your two research questions from Section 1.2. Do this using:

An interpretation of appropriate coefficients.
Appropriate visualizations from your model coefficients.

Section 2.3: Predictions

Use your model from Section 2.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
Describe how well the model is able to predict on the test data. Your audience for this description is a super fan of Gilmore Girls who does not know a lot about statistics.

Section 3: Classification Trees

Section 3.1: Fitting a Model

Build an appropriate classification tree using \(Y\) = character.
Show a professionally formatted graphic to show your tree.

Section 3.2: Describing relationships

Based on the classification tree, clearly explain what traits are associated with each of the characters. Explain this though you are talking to someone who is a super fan of Gilmore Girls but who does not know a lot about statistics.

Section 3.3: Predictions

Use your model from Section 3.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
Describe how well the model is able to predict on the test data. Your audience for this description is a super fan of Gilmore Girls who does not know a lot about statistics.

Section 4: Linear Discriminant Analysis (LDA)

In this section, we are going to play a little more with LDA, which we first tried in Lab 8.

We can think of LDA as a regression type model. However, instead of creating one coefficient for each feature in the data set, LDA focuses on combining these features into combination features. For example, I might combine high energy, high playfulness, and high curiosity into a measure for “Childlike”. This means that instead of looking at the 3 features individually, I could use the combination to compare groups. This is really useful when we have a lot of features, as we do with our current data set.

To see what we mean, let’s consider only a few features from our current data set:

negate: % negations (example: not, no, never, nothing).
WC: the number of words in the line.
allnone: % of words that are all or none (example: all, no, never, always).
Clout: a measure of power, leadership, confidence; higher clout is related to higher values of these traits.
BigWords: the % of words with more than 6 letters.

To fit LDA just using these features, we use

lda_out <- MASS::lda(Character ~ negate + WC + allnone+ Clout+ BigWords, data = train)

The combined features produced by LDA are called linear discriminants. In this case, the 5 features we started off with are now reduced into two linear discriminants, called LD1 and LD2.

First Feature (LD1)

To see the first feature LD1, we can use the following:

knitr::kable(  lda_out$scaling[,2], col.names = "Loadings" )

Suppose, for example, that LD1 looks like this (it may not on your data, this is just an example):

\[LD1_{i} = -.0013 negate_i -.021 WC_i -.022 allnone_i -.016 Clout_i -.067 BigWords_i\]

You will notice that this looks like a regression model! This is showing how each of the original 5 features contributes to the first linear discriminant.

All the coefficients for the features are negative, and this means that high values of LD1 are associated with a line with FEWER negate words, with a lower word count, with fewer lot of all/none word, less clout, and fewer of big words.

Second Feature (LD2)

We can use the same process to look at the 2nd linear discriminant, LD2:

\[LD2_{i} = .0057 negate_i +.27 WC_i + .11 allnone_i + .011 Clout_i +.072 BigWords_i\]

Here, we can see all positive coefficients! This means higher values of all of the features are associated with higher values of LD2.

We can also see that all/none, word count, negate, and big words are more important in LD2 than in LD1. We can see this based on whether the coefficient for a feature is larger in absolute value for LD1 or LD2.

Plotting

Now instead of 5 features, we have 2, each of which is a combination of the original 5. When we plot the results of LDA, the coordinate of each line of dialogue is \((LD1_i, LD2_i)\).

Section 4.1: LDA Model 1

Model

Fit an LDA using the same variables chosen by feature selection from multinomial regression. NOTE: The goal here is to create an LDA that you can interpret. If there are too many features chosen by feature selection to make that feasible, please feel free to just choose at least 5 features that you are interested in and that are not used in the example above.

Interpretation

Show a professionally formatted output for LD1.
Describe what high values of LD1 indicate.
Show a professionally formatted output for LD2.
Describe what high values of LD2 indicate.

Plotting

Create a plot from the LDA model.
Choose any 2 characters from the plot. Based on the plot, describe what seems to be different between the text of these two characters. Explain this though you are talking to someone who is a super fan of Gilmore Girls but who does not know a lot about statistics.

Section 4.2: LDA with all the features

Fit an LDA model using all the features, and create a plot from the LDA model.
Based on the visual, which model (LDA with only a few features or LDA with all the features) seems to be a better fit to the sample data, or do they seem to fit about the same? Explain your choice.

Section 4.3: Prediction

Use your LDA from Section 4.1 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
Use your LDA from Section 4.2 to make predictions on the test data. Show the appropriate confusion matrix, professionally formatted.
State which LDA does the best job at predicting on the test data.
Describe how well that model is able to prediction the test data. Explain this though you are talking to someone who is a super fan of Gilmore Girls but who does not know a lot about statistics.

Section 5: Conclusion

Conclusion

At this point, you have been able to build several models for both interpretation and prediction. Explain to someone who is a super fan of Gilmore Girls but who does not know a lot about statistics:

What your models suggest about traits that distinguish the speech of the different characters?
How well your models are able to predict who spoke a line based on traits in the data.

References

Data

This data is a cleaned subset of the data set from:

Julkwa. October 2021. Gilmore Girls Lines, Version 1. Retrieved from https://www.kaggle.com/datasets/julqka/gilmore-girls-lines

LIWC

The text features were created for the educational purposes of teaching this course using LIWC-22.

Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. (2022). The development and psychometric properties of LIWC-22. The University of Texas at Austin. https://www.liwc.app