Twitter Data Set Purpose of dataset To analyse different datapoints and patterns in data produced on twitter. Using number of likes,klout,retweets,reach,reshare to analyse the popularity of a tweet. Twitter data has wide range of uses , for example It is used for studying market methodologies for twitter campaigns, and looking for target audience.
The columns in the Dataset Include: TweetID: (string) unique 64-bit unsigned integers, they consist of time, instead of being sequential. Full ID contains a timestamp, a worker number, and a sequence number. Weekday: (string) The day of the tweet Hour: (integer)The hour at which the tweet was tweeted Day: (integer)Day of the month the tweet was tweeted Language: (string) the Language the tweet was in IsReshare: (string) A column with Boolean true/false values. If the tweet was reshared, it comes as true Reach: (integer)The reach of the tweet RetweetCount: (integer) the number of times the tweet was retweeted Likes: (integer) The number of likes on the tweet Klout: (integer)How much attention the tweet got Sentiment: (decimal) Rating of what kind of sentiment the tweet evoked ranging from (-4 to 4) -4 being negative and 4 being positive Text: (string) The content of the tweet Location ID: (integer)An ID for the location of the tweet when it was tweeted UserID: (string) The ID of the user who tweeted the tweet
##Loading file ##read.TwitterData(1) ##mpg2 <- read_delim(“./TwitterData(1).csv”, delim = “,”) #loadtidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
Twitter_Data=read.csv("TwitterData.csv")
head(Twitter_Data)
## TweetID Weekday Hour Day Lang IsReshare Reach RetweetCount
## 1 tw-682712873332805633 Thursday 17 31 en FALSE 44 0
## 2 tw-682713045357998080 Thursday 17 31 en TRUE 1810 5
## 3 tw-682713219375476736 Thursday 17 31 en FALSE 282 0
## 4 tw-682713436967579648 Thursday 17 31 en FALSE 2087 4
## 5 tw-682714048199311366 Thursday 17 31 en FALSE 953 0
## 6 tw-682714583044243456 Thursday 17 31 en FALSE 113 0
## Likes Klout Sentiment
## 1 0 35 0
## 2 0 53 2
## 3 0 47 0
## 4 0 53 0
## 5 0 47 0
## 6 0 31 -1
## text
## 1 We are hiring: Senior Software Engineer - Proto http://www.reqcloud.com/jobs/719865/?k=0LaPxXuFwczs1e32ZURJKrgCIDMQtRO7BquFSQthUKY&utm_source=twitter&utm_campaign=reqCloud_JobPost #job @awscloud #job #protocol #networking #aws #mediastreaming
## 2 RT @CodeMineStatus: This is true Amazon Web Services https://aws.amazon.com/ #php #html #html5 #css #webdesign #seo #java #javascript htt
## 3 Devops Engineer Aws Ansible Cassandra Mysql Ubuntu Ruby On Rails Jobs in Austin TX #Austin #TX #jobs #jobsearch https://www.jobfindly.com/devops-engineer-aws-ansible-cassandra-mysql-ubuntu-ruby-on-rails-jobs-austin-tx.html
## 4 Happy New Year to all those AWS instances of ours!
## 5 Amazon is hiring! #Sr. #International Tax Manager - AWS in #Seattle apply now! #jobs http://neuvoo.com/job.php?id=dsvkrujig3&source=twitter&lang=en&client_id=658&l=Seattle%20Washington%20US&k=Sr.%20International%20Tax%20Manager%20-%20AWS http://twitter.com/NeuvooAccSea/status/682714048199311366/photo/1
## 6 #AWS bc of per-region limits test/prod should be in isolated regions. else dev could impact prod #lambda & beyond http://docs.aws.amazon.com/lambda/latest/dg/limits.html#limits-safety-throttles
## LocationID UserID Country Gender
## 1 3751 tw-40932430 Albania Unknown
## 2 3989 tw-3179389829 Albania Male
## 3 3741 tw-4624808414 Algeria Male
## 4 3753 tw-356447127 Algeria Unknown
## 5 3751 tw-3172686669 Algeria Male
## 6 3744 tw-14502901 Algeria Male
summary(Twitter_Data)
## TweetID Weekday Hour Day
## Length:100001 Length:100001 Min. : 0.00 Min. : 1.00
## Class :character Class :character 1st Qu.: 7.00 1st Qu.: 5.00
## Mode :character Mode :character Median :11.00 Median : 7.00
## Mean :11.46 Mean : 7.97
## 3rd Qu.:16.00 3rd Qu.:11.00
## Max. :23.00 Max. :31.00
## NA's :89999 NA's :89999
## Lang IsReshare Reach RetweetCount
## Length:100001 Mode :logical Min. : 0 Min. : 0.00
## Class :character FALSE:6655 1st Qu.: 146 1st Qu.: 0.00
## Mode :character TRUE :3347 Median : 421 Median : 0.00
## NA's :89999 Mean : 6713 Mean : 6.44
## 3rd Qu.: 1375 3rd Qu.: 2.00
## Max. :1530046 Max. :620.00
## NA's :89999 NA's :89999
## Likes Klout Sentiment text
## Min. : 0.00 Min. : 0.00 Min. :-4.00 Length:100001
## 1st Qu.: 0.00 1st Qu.:33.00 1st Qu.: 0.00 Class :character
## Median : 0.00 Median :42.00 Median : 0.00 Mode :character
## Mean : 0.18 Mean :40.32 Mean : 0.29
## 3rd Qu.: 0.00 3rd Qu.:48.00 3rd Qu.: 0.00
## Max. :131.00 Max. :89.00 Max. : 5.00
## NA's :89999 NA's :89999 NA's :89999
## LocationID UserID Country Gender
## Min. : 2 Length:100001 Length:100001 Length:100001
## 1st Qu.:1651 Class :character Class :character Class :character
## Median :3744 Mode :character Mode :character Mode :character
## Mean :2908
## 3rd Qu.:3776
## Max. :6288
## NA's :89999
#calculating sum of Retweet Count and Likes
print(aggregate(Twitter_Data$RetweetCount, list(Twitter_Data$Likes), FUN=sum))
## Group.1 x
## 1 0 62890
## 2 1 2
## 3 4 3
## 4 7 7
## 5 8 12
## 6 9 3
## 7 10 5
## 8 11 16
## 9 14 17
## 10 15 21
## 11 17 32
## 12 18 21
## 13 19 28
## 14 21 16
## 15 22 23
## 16 23 13
## 17 25 31
## 18 26 82
## 19 27 38
## 20 28 27
## 21 30 45
## 22 32 64
## 23 33 18
## 24 34 62
## 25 35 22
## 26 37 31
## 27 40 74
## 28 43 15
## 29 44 23
## 30 45 47
## 31 48 33
## 32 49 97
## 33 50 31
## 34 56 54
## 35 68 60
## 36 69 78
## 37 70 69
## 38 72 61
## 39 77 64
## 40 96 44
## 41 116 73
## 42 131 85
#calculating min of reach and retweet count
print(aggregate(Twitter_Data$Reach, list(Twitter_Data$RetweetCount), FUN=min))
## Group.1 x
## 1 0 0
## 2 1 0
## 3 2 3
## 4 3 4
## 5 4 6
## 6 5 5
## 7 6 2
## 8 7 15
## 9 8 9
## 10 9 5
## 11 10 19
## 12 11 17
## 13 12 14
## 14 13 9
## 15 14 50
## 16 15 9
## 17 16 24
## 18 17 24
## 19 18 51
## 20 19 11
## 21 20 37
## 22 21 25
## 23 22 24
## 24 23 9
## 25 24 4
## 26 25 9
## 27 26 43
## 28 27 11
## 29 28 11
## 30 29 51
## 31 30 4
## 32 31 15
## 33 32 11
## 34 33 24
## 35 34 0
## 36 35 24
## 37 36 15
## 38 37 49
## 39 38 56
## 40 39 6
## 41 40 25
## 42 41 54
## 43 42 32
## 44 43 2
## 45 44 6
## 46 45 16
## 47 46 41
## 48 47 25
## 49 48 135
## 50 49 91
## 51 50 9
## 52 51 14
## 53 52 43
## 54 53 9
## 55 54 75
## 56 55 21
## 57 56 48
## 58 57 26
## 59 58 34
## 60 59 32
## 61 60 43
## 62 61 9
## 63 62 69
## 64 63 97
## 65 64 53
## 66 65 35
## 67 66 24
## 68 67 118
## 69 68 67
## 70 69 23
## 71 70 51
## 72 71 15
## 73 72 32
## 74 73 41
## 75 74 134
## 76 75 365
## 77 76 17
## 78 77 179
## 79 78 30
## 80 79 43
## 81 80 26
## 82 81 115
## 83 82 224
## 84 84 143
## 85 85 86
## 86 86 6
## 87 87 1611
## 88 88 54
## 89 89 56
## 90 90 204
## 91 91 53
## 92 92 330
## 93 93 73
## 94 94 131
## 95 95 31
## 96 96 15
## 97 98 24
## 98 99 88
## 99 101 2451
## 100 102 15
## 101 103 46
## 102 104 48
## 103 105 207
## 104 106 108
## 105 108 1817
## 106 109 322
## 107 110 334
## 108 111 355
## 109 113 1831
## 110 117 209
## 111 118 1364
## 112 119 161
## 113 120 101
## 114 121 38731
## 115 124 155
## 116 125 405
## 117 126 161
## 118 129 1006
## 119 130 42
## 120 131 504
## 121 133 4017
## 122 136 283
## 123 138 74
## 124 141 1040
## 125 142 46
## 126 145 436
## 127 148 690
## 128 149 167
## 129 151 3
## 130 155 8
## 131 156 288
## 132 157 556
## 133 160 240
## 134 162 72
## 135 163 221
## 136 165 584
## 137 172 658
## 138 177 16
## 139 187 412
## 140 189 54
## 141 195 207
## 142 196 14071
## 143 197 1860
## 144 199 1061
## 145 201 5361
## 146 206 72
## 147 211 206
## 148 217 2739
## 149 221 85
## 150 223 534
## 151 224 2718
## 152 225 1251
## 153 226 36
## 154 228 190
## 155 229 1504
## 156 230 2
## 157 236 53
## 158 246 1161
## 159 249 392
## 160 250 447155
## 161 251 2492
## 162 359 447244
## 163 366 447244
## 164 375 447244
## 165 409 130
## 166 620 447244
#calculating max of Likes and Klout count
print(aggregate(Twitter_Data$Likes, list(Twitter_Data$Klout), FUN=max))
## Group.1 x
## 1 0 0
## 2 10 0
## 3 11 0
## 4 12 0
## 5 13 0
## 6 14 0
## 7 15 0
## 8 16 0
## 9 17 0
## 10 18 0
## 11 19 0
## 12 20 0
## 13 21 0
## 14 22 0
## 15 23 0
## 16 24 0
## 17 25 0
## 18 26 0
## 19 27 0
## 20 28 0
## 21 29 0
## 22 30 0
## 23 31 0
## 24 32 0
## 25 33 0
## 26 34 0
## 27 35 0
## 28 36 0
## 29 37 0
## 30 38 0
## 31 39 0
## 32 40 0
## 33 41 0
## 34 42 0
## 35 43 0
## 36 44 0
## 37 45 0
## 38 46 0
## 39 47 0
## 40 48 0
## 41 49 0
## 42 50 0
## 43 51 0
## 44 52 0
## 45 53 0
## 46 54 0
## 47 55 0
## 48 56 0
## 49 57 0
## 50 58 0
## 51 59 0
## 52 60 0
## 53 61 0
## 54 62 0
## 55 63 0
## 56 64 4
## 57 65 0
## 58 66 0
## 59 67 0
## 60 68 0
## 61 69 0
## 62 70 0
## 63 71 116
## 64 72 131
## 65 75 0
## 66 77 0
## 67 78 0
## 68 79 0
## 69 80 0
## 70 81 0
## 71 82 0
## 72 83 0
## 73 88 0
## 74 89 0
#BarPlot of Weekday vs RetweetCounts
library(tidyverse)
Twitter_Data |>
ggplot(mapping = aes(x =Weekday,color=Weekday,fill=RetweetCount))+geom_bar()
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
#Scatterplot for Day and Likes
Twitter_Data |>
ggplot()+ geom_point(mapping= aes(x =Day,y=Likes))+theme_classic()
## Warning: Removed 89999 rows containing missing values (`geom_point()`).
#Histogram for Day vs Hours
mean_ratio <- mean(Twitter_Data$Hour / Twitter_Data$Day)
Twitter_Data |>
mutate(Hour_to_Day = Hour / Day) |>
ggplot() +
geom_histogram(mapping = aes(x = Hour_to_Day), color = 'white') +
geom_vline(xintercept = mean_ratio, color = 'red') +
annotate("text", # the type of annotation
x = 1.425, y = 24.5, label = "Average", color = 'red') +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 89999 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 1 rows containing missing values (`geom_vline()`).
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.