STA 279 Data Analysis 1

When you open your Markdown file, your first chunk likely looks like:

knitr::opts_chunk$set(echo = TRUE)

Change it to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.asp =.5)

AI Policy

You may NOT use generative AI (including Chat GPT, Gemini, or any other platform) to:

  • Produce/write code for this Data Analysis.
  • Produce/create figures / plots / images for this Data Analysis.
  • Write or refine ANY of the text you submit for this Data Analysis.

Any violation of these rules will result in a 0 on the Data Analysis. I will be in class with you as you work on the assignment, so if you are stuck, ask me or your partner. There is a lot of help available if you need it!

For your code, please note you must use code that we learned in class. If you want to use code learned in other courses or from outside sources, you must ask Dr. Dalzell before you do so.

The Goal

If you have chosen this Data Analysis, you have chosen to analyze data from the show Gilmore Girls. You can load the data you need using the code below:

GilmoreGirls <- read.csv("https://www.dropbox.com/scl/fi/00z1qneb6u5qymunotk9c/GilmoreGirlsNoNA.csv?rlkey=vwpp6r9rkjrv71v3fqp9np504&st=1wsnplou&dl=1")[,c(2,3,4)]

GilmoreGirls$Character <-as.factor(GilmoreGirls$Character)
GilmoreGirls$Line <-gsub("'s","",GilmoreGirls$Line)

Introduction

This data set is big - it contains all the lines from every character from all seven seasons. For most of our computers, this is going to be too big to work with. So…

To make the data smaller, you will:

  • Choose 3-4 characters that are you interested in. This can be anything (the three boyfriends of Rory, three other characters, whatever seems cool).
  • Filter the data to contain only rows spoken by those characters.

When you are done with all of this, state which characters you chose and why. State how many rows are left in the data set when you are done!

At this point, we are going to separate the data into test and training data, so that we can do prediction later!

# REPLACE YourData
n <- nrow( YourData )

set.seed(279)
train <- sample(1:n, n*.85)
test <- c(1:n)[-train]

# REPLACE YourData 
test <- YourData[ test, ]
train <- YourData[train, ]

Section 1: What do they talk about?

NOTE 1: Throughout this assignment, you will notice labels like Section 1, Section 1.1, etc. You MUST use these labels in your final submission - this is how I will grade.

NOTE 2: This assignment should be written like a formal paper. This means you need transition sentences, like “In this section, we will examine how the text of the dialogue changes over time.” You MUST have such sentences throughout your assignment to make sure the reader can follow your work.

Section 1:

In this section, you are going to focus on analyzing the dialogue of just one character.

  • Choose one of the characters (totally up to you which one you choose!).
  • For this individual, create a well formatted plot or table to show the top 10 words spoken by this character (remember, this ALWAYS means exclude stop words, even if I don’t specify that!!).
  • Based on your plot or table, describe what this character seems to talk about most in their dialogue.

NOTE: To be ready for the exam, you should be able to write the code for finding the top 10 words without looking at your notes or any other resources.

Section 2: What makes them different?

Section 2:

In this section, you are going to compare the 3-4 characters you chose to see what makes the content of their dialogue different from one another.

  • Find the top 10 words that distinguish the dialogue of the characters.
  • Create a well formatted plot (not a table) show these 10 words.
  • Based on your plot or table, describe what seems to separate the dialogue of the characters.

Section 3: What about emotion?

Section 3: Sentiment

One of the variables in the data set is the season number. We are going to explore how the sentiment of the characters changes across the seasons.

  • Create and clearly state a research question that you might be interested in related to sentiment and time. For instance, is Emily ever more positive than Rory? You may choose any question you like that can be answered using these data except this one!
  • For this application, do you think the average sentiment score or total sentiment score is more appropriate to consider? Explain your reasoning.
  • Based on which you chose, compute a sentiment score using AFINN. Hint: If you choose average sentiment score, just change sentiment = sum(value) to sentiment = mean(value) in the code at the end of Lab 5. If you choose total sentiment score, you can leave the code at the end of Lab 5 alone.
  • Create a well formatted plot to visualize the relationship(s) of interest highlighted in the research question.
  • Answer your research question based on the plot.

Hint:

This will work best if you convert Season to a number in R;

train$Season <- as.Date(train$Season)

At that point, you can use this to get your plot going:

ggplot( train , aes(x=Season, y= sentiment, col = character)) +
  geom_boxplot(  ) + 
  labs( x = "..." , y = "..."  , title = "...")

Section 4: Prediction

Section 4: Prediction with Sentiment

  • Use Naive Bayes with \(Y\) = character and \(X\) = sentiment (created in Section 3) to predict which character spoke each line in the test data. Hint: You will need to create your sentiment feature in your test data!
  • Create a well formatted confusion matrix to show how well the model is able to predict on the test data.
  • State the TPR, TNR, and accuracy of your model.
  • Based on all of this, describe how well your model is able to predict character using sentiment. This means commenting on when you predict well, and when you seem not to predict well.

Before you submit

A few last steps before we knit, and then you will be done!

  • Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
  • If you are working with a partner, make sure their name and yours is on the top of the file.
  • You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.

Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!

References

Data

This data is a cleaned subset of the data set from:

Julkwa. October 2021. Gilmore Girls Lines, Version 1. Retrieved from https://www.kaggle.com/datasets/julqka/gilmore-girls-lines