In this project, I analyze an Emotion Classification dataset of English Twitter messages. Each tweet is labeled with one of six emotions:
sadness (0)
joy (1)
love (2)
anger (3)
fear (4)
surprise (5)
The goals of this analysis are:
To see how emotions are distributed in the dataset.
To look at tweet length patterns across emotions.
To describe the main patterns I find in a clear way.
dplyr::glimpse(df)
The dataset contains a text column with the tweet content and a numeric label column, which I convert into a factor called emotion. Each row is a single tweet with exactly one emotion label.
In this section, I create several visualizations to understand the distribution of emotions and tweet lengths.
2.1 Emotion Counts
## # A tibble: 6 × 3
## emotion n prop
## <fct> <int> <dbl>
## 1 sadness 121187 0.291
## 2 joy 141067 0.338
## 3 love 34554 0.0829
## 4 anger 57317 0.138
## 5 fear 47712 0.114
## 6 surprise 14972 0.0359
This bar chart shows how many tweets belong to each emotion. Some emotions have many more tweets than others, so the dataset is not perfectly balanced. The more common emotions will have more influence if we build a model on this data.
2.2 Proportion of Each Emotion
This plot shows the fraction of the dataset for each emotion. It makes it easier to see which emotions dominate the dataset. Some emotions make up a large share, while others (like surprise) are relatively rare.
2.3 Tweet Length by Emotion
First, I calculate the length of each tweet in characters and words.
2.3.1 Average Tweet Length (Characters)
## # A tibble: 6 × 2
## emotion avg_char
## <fct> <dbl>
## 1 sadness 93.1
## 2 joy 98.7
## 3 love 105.
## 4 anger 96.0
## 5 fear 96.7
## 6 surprise 99.7
Some emotions tend to be expressed in longer tweets on average. Emotions that might require more explanation, such as sadness or fear, can have higher average character counts, while shorter, more reactive emotions may have smaller averages.
2.3.2 Distribution of Tweet Lengths (Words)
In this section, I summarize the main patterns I observed:
Emotion Distribution: The number and proportion plots show that the dataset is not balanced across the six emotions. Some emotions have many more examples than others, which is important for any future modeling or evaluation.
Tweet Length: Tweets differ in length across emotions. Some emotions have longer average tweet lengths or more variation in length. This suggests that people sometimes need more words to express certain feelings.
Overall Insight: Even with a few simple visualizations, we can see meaningful patterns in how emotions are expressed on Twitter. These basic exploratory steps are useful before building any predictive models, because they highlight class imbalance and differences in how people write about each emotion.