Twitter Sentiment Analysis Data-set

This is an entity-level sentiment analysis data-set of twitter. Given a message and an entity, the task is to judge the sentiment of the message about the entity. There are three classes in this data-set: Positive, Negative and Neutral. We regard messages that are not relevant to the entity (i.e. Irrelevant) as Neutral.

Dimension of the data-set

[1] 74682     4

Variables in the data-set

[1] "ID"        "Topic"     "Sentiment" "Text"     

Description of the variables in the dataset

'data.frame':   74682 obs. of  4 variables:
 $ ID       : int  2401 2401 2401 2401 2401 2401 2402 2402 2402 2402 ...
 $ Topic    : chr  "Borderlands" "Borderlands" "Borderlands" "Borderlands" ...
 $ Sentiment: chr  "Positive" "Positive" "Positive" "Positive" ...
 $ Text     : chr  "im getting on borderlands and i will murder you all ," "I am coming to the borders and I will kill you all," "im getting on borderlands and i will kill you all," "im coming on borderlands and i will murder you all," ...

Summary of the data-set

       ID           Topic            Sentiment             Text          
 Min.   :    1   Length:74682       Length:74682       Length:74682      
 1st Qu.: 3195   Class :character   Class :character   Class :character  
 Median : 6422   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 6433                                                           
 3rd Qu.: 9601                                                           
 Max.   :13200                                                           

Checking for the missing values

[1] 0

Column diagram for differnt topics vs count

Donut chart for Sentiment (Positive,Neutral,Negative,Irrelevant)

Bar chart for different levels of sentiment (Positive, Neutral, Negative, Irrelevant) vs. different topics

Warning in geom_bar(aes(fill = Sentiment), stat = "identity", positive =
"dodge"): Ignoring unknown parameters: `positive`

Sentiment Distribution for ‘Google’,‘Facebook’,‘Microsoft’ in Twitter

Histogram of Text Length in Tweets

This histogram visualizes the distribution of tweet lengths based on the number of characters.

The distribution is right-skewed, with the majority of tweets having shorter text lengths.

The highest frequency of tweets falls within the 0 to 100 character range, with over 20,000 tweets in this interval. This indicates that most tweets are concise.

As text length increases, the frequency of tweets decreases sharply. Very few tweets exceed 300 characters, and tweets with lengths approaching the maximum of 1,000 characters are extremely rare.

Word Cloud

In a word cloud, the size of each word indicates its frequency or importance—the larger the word, the more frequently it appears in the text.

Loading required package: RColorBrewer
Loading required package: NLP

Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':

    annotate

Dominant Words: The largest words like “game,” “just,” “like,” “will,” and “good” are the most frequently mentioned in the dataset. This suggests that the tweets may be heavily focused on gaming-related discussions.

Sentiment and Topics: Words like “good,” “love,” and “great” suggest positive sentiment, while words like “fix,” “shit,” and “fucking” might indicate negative sentiment or frustration. The word “game” is central, which could imply that the primary topic of discussion is gaming.

Trends: The variety of words related to gaming, companies (e.g., “Verizon,” “Google,” “Amazon”), and social media engagement (e.g., “Facebook,” “Twitter”) indicate the topics that are trending or commonly discussed in the dataset.