Assignment Week 2

Twitter Data Set Purpose of dataset To analyse different datapoints and patterns in data produced on twitter. Using number of likes,klout,retweets,reach,reshare to analyse the popularity of a tweet. Twitter data has wide range of uses , for example It is used for studying market methodologies for twitter campaigns, and looking for target audience.

The columns in the Dataset Include: TweetID: (string) unique 64-bit unsigned integers, they consist of time, instead of being sequential. Full ID contains a timestamp, a worker number, and a sequence number. Weekday: (string) The day of the tweet Hour: (integer)The hour at which the tweet was tweeted Day: (integer)Day of the month the tweet was tweeted Language: (string) the Language the tweet was in IsReshare: (string) A column with Boolean true/false values. If the tweet was reshared, it comes as true Reach: (integer)The reach of the tweet RetweetCount: (integer) the number of times the tweet was retweeted Likes: (integer) The number of likes on the tweet Klout: (integer)How much attention the tweet got Sentiment: (decimal) Rating of what kind of sentiment the tweet evoked ranging from (-4 to 4) -4 being negative and 4 being positive Text: (string) The content of the tweet Location ID: (integer)An ID for the location of the tweet when it was tweeted UserID: (string) The ID of the user who tweeted the tweet

##Loading file ##read.TwitterData(1) ##mpg2 <- read_delim(“./TwitterData(1).csv”, delim = “,”) #loadtidyverse

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
Twitter_Data=read.csv("TwitterData.csv")
head(Twitter_Data)

##                 TweetID  Weekday Hour Day Lang IsReshare Reach RetweetCount
## 1 tw-682712873332805633 Thursday   17  31   en     FALSE    44            0
## 2 tw-682713045357998080 Thursday   17  31   en      TRUE  1810            5
## 3 tw-682713219375476736 Thursday   17  31   en     FALSE   282            0
## 4 tw-682713436967579648 Thursday   17  31   en     FALSE  2087            4
## 5 tw-682714048199311366 Thursday   17  31   en     FALSE   953            0
## 6 tw-682714583044243456 Thursday   17  31   en     FALSE   113            0
##   Likes Klout Sentiment
## 1     0    35         0
## 2     0    53         2
## 3     0    47         0
## 4     0    53         0
## 5     0    47         0
## 6     0    31        -1
##                                                                                                                                                                                                                                                                                                              text
## 1                                                              We are hiring: Senior Software Engineer - Proto http://www.reqcloud.com/jobs/719865/?k=0LaPxXuFwczs1e32ZURJKrgCIDMQtRO7BquFSQthUKY&utm_source=twitter&utm_campaign=reqCloud_JobPost #job @awscloud #job #protocol #networking #aws #mediastreaming
## 2                                                                                                                                                                       RT @CodeMineStatus: This is true Amazon Web Services https://aws.amazon.com/ #php #html #html5 #css #webdesign #seo #java #javascript htt
## 3                                                                                  Devops Engineer Aws Ansible Cassandra Mysql Ubuntu Ruby On Rails Jobs in Austin TX #Austin #TX #jobs #jobsearch https://www.jobfindly.com/devops-engineer-aws-ansible-cassandra-mysql-ubuntu-ruby-on-rails-jobs-austin-tx.html
## 4                                                                                                                                                                                                                                                              Happy New Year to all those AWS instances of ours!
## 5 Amazon is hiring! #Sr. #International Tax Manager - AWS in #Seattle apply now! #jobs http://neuvoo.com/job.php?id=dsvkrujig3&source=twitter&lang=en&client_id=658&l=Seattle%20Washington%20US&k=Sr.%20International%20Tax%20Manager%20-%20AWS http://twitter.com/NeuvooAccSea/status/682714048199311366/photo/1
## 6                                                                                                           #AWS bc of per-region limits test/prod should be in isolated regions. else dev could impact prod #lambda &amp; beyond http://docs.aws.amazon.com/lambda/latest/dg/limits.html#limits-safety-throttles
##   LocationID        UserID Country  Gender
## 1       3751   tw-40932430 Albania Unknown
## 2       3989 tw-3179389829 Albania    Male
## 3       3741 tw-4624808414 Algeria    Male
## 4       3753  tw-356447127 Algeria Unknown
## 5       3751 tw-3172686669 Algeria    Male
## 6       3744   tw-14502901 Algeria    Male

Summarizing Numeric Part of the Dataset

summary(Twitter_Data)

##    TweetID            Weekday               Hour            Day       
##  Length:100001      Length:100001      Min.   : 0.00   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 7.00   1st Qu.: 5.00  
##  Mode  :character   Mode  :character   Median :11.00   Median : 7.00  
##                                        Mean   :11.46   Mean   : 7.97  
##                                        3rd Qu.:16.00   3rd Qu.:11.00  
##                                        Max.   :23.00   Max.   :31.00  
##                                        NA's   :89999   NA's   :89999  
##      Lang           IsReshare           Reach          RetweetCount   
##  Length:100001      Mode :logical   Min.   :      0   Min.   :  0.00  
##  Class :character   FALSE:6655      1st Qu.:    146   1st Qu.:  0.00  
##  Mode  :character   TRUE :3347      Median :    421   Median :  0.00  
##                     NA's :89999     Mean   :   6713   Mean   :  6.44  
##                                     3rd Qu.:   1375   3rd Qu.:  2.00  
##                                     Max.   :1530046   Max.   :620.00  
##                                     NA's   :89999     NA's   :89999   
##      Likes            Klout         Sentiment         text          
##  Min.   :  0.00   Min.   : 0.00   Min.   :-4.00   Length:100001     
##  1st Qu.:  0.00   1st Qu.:33.00   1st Qu.: 0.00   Class :character  
##  Median :  0.00   Median :42.00   Median : 0.00   Mode  :character  
##  Mean   :  0.18   Mean   :40.32   Mean   : 0.29                     
##  3rd Qu.:  0.00   3rd Qu.:48.00   3rd Qu.: 0.00                     
##  Max.   :131.00   Max.   :89.00   Max.   : 5.00                     
##  NA's   :89999    NA's   :89999   NA's   :89999                     
##    LocationID       UserID            Country             Gender         
##  Min.   :   2    Length:100001      Length:100001      Length:100001     
##  1st Qu.:1651    Class :character   Class :character   Class :character  
##  Median :3744    Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2908                                                            
##  3rd Qu.:3776                                                            
##  Max.   :6288                                                            
##  NA's   :89999

#calculating sum of Retweet Count and Likes
print(aggregate(Twitter_Data$RetweetCount, list(Twitter_Data$Likes), FUN=sum))

##    Group.1     x
## 1        0 62890
## 2        1     2
## 3        4     3
## 4        7     7
## 5        8    12
## 6        9     3
## 7       10     5
## 8       11    16
## 9       14    17
## 10      15    21
## 11      17    32
## 12      18    21
## 13      19    28
## 14      21    16
## 15      22    23
## 16      23    13
## 17      25    31
## 18      26    82
## 19      27    38
## 20      28    27
## 21      30    45
## 22      32    64
## 23      33    18
## 24      34    62
## 25      35    22
## 26      37    31
## 27      40    74
## 28      43    15
## 29      44    23
## 30      45    47
## 31      48    33
## 32      49    97
## 33      50    31
## 34      56    54
## 35      68    60
## 36      69    78
## 37      70    69
## 38      72    61
## 39      77    64
## 40      96    44
## 41     116    73
## 42     131    85

#calculating min of reach and retweet count
print(aggregate(Twitter_Data$Reach, list(Twitter_Data$RetweetCount), FUN=min))

##     Group.1      x
## 1         0      0
## 2         1      0
## 3         2      3
## 4         3      4
## 5         4      6
## 6         5      5
## 7         6      2
## 8         7     15
## 9         8      9
## 10        9      5
## 11       10     19
## 12       11     17
## 13       12     14
## 14       13      9
## 15       14     50
## 16       15      9
## 17       16     24
## 18       17     24
## 19       18     51
## 20       19     11
## 21       20     37
## 22       21     25
## 23       22     24
## 24       23      9
## 25       24      4
## 26       25      9
## 27       26     43
## 28       27     11
## 29       28     11
## 30       29     51
## 31       30      4
## 32       31     15
## 33       32     11
## 34       33     24
## 35       34      0
## 36       35     24
## 37       36     15
## 38       37     49
## 39       38     56
## 40       39      6
## 41       40     25
## 42       41     54
## 43       42     32
## 44       43      2
## 45       44      6
## 46       45     16
## 47       46     41
## 48       47     25
## 49       48    135
## 50       49     91
## 51       50      9
## 52       51     14
## 53       52     43
## 54       53      9
## 55       54     75
## 56       55     21
## 57       56     48
## 58       57     26
## 59       58     34
## 60       59     32
## 61       60     43
## 62       61      9
## 63       62     69
## 64       63     97
## 65       64     53
## 66       65     35
## 67       66     24
## 68       67    118
## 69       68     67
## 70       69     23
## 71       70     51
## 72       71     15
## 73       72     32
## 74       73     41
## 75       74    134
## 76       75    365
## 77       76     17
## 78       77    179
## 79       78     30
## 80       79     43
## 81       80     26
## 82       81    115
## 83       82    224
## 84       84    143
## 85       85     86
## 86       86      6
## 87       87   1611
## 88       88     54
## 89       89     56
## 90       90    204
## 91       91     53
## 92       92    330
## 93       93     73
## 94       94    131
## 95       95     31
## 96       96     15
## 97       98     24
## 98       99     88
## 99      101   2451
## 100     102     15
## 101     103     46
## 102     104     48
## 103     105    207
## 104     106    108
## 105     108   1817
## 106     109    322
## 107     110    334
## 108     111    355
## 109     113   1831
## 110     117    209
## 111     118   1364
## 112     119    161
## 113     120    101
## 114     121  38731
## 115     124    155
## 116     125    405
## 117     126    161
## 118     129   1006
## 119     130     42
## 120     131    504
## 121     133   4017
## 122     136    283
## 123     138     74
## 124     141   1040
## 125     142     46
## 126     145    436
## 127     148    690
## 128     149    167
## 129     151      3
## 130     155      8
## 131     156    288
## 132     157    556
## 133     160    240
## 134     162     72
## 135     163    221
## 136     165    584
## 137     172    658
## 138     177     16
## 139     187    412
## 140     189     54
## 141     195    207
## 142     196  14071
## 143     197   1860
## 144     199   1061
## 145     201   5361
## 146     206     72
## 147     211    206
## 148     217   2739
## 149     221     85
## 150     223    534
## 151     224   2718
## 152     225   1251
## 153     226     36
## 154     228    190
## 155     229   1504
## 156     230      2
## 157     236     53
## 158     246   1161
## 159     249    392
## 160     250 447155
## 161     251   2492
## 162     359 447244
## 163     366 447244
## 164     375 447244
## 165     409    130
## 166     620 447244

#calculating max of Likes and Klout count
print(aggregate(Twitter_Data$Likes, list(Twitter_Data$Klout), FUN=max))

##    Group.1   x
## 1        0   0
## 2       10   0
## 3       11   0
## 4       12   0
## 5       13   0
## 6       14   0
## 7       15   0
## 8       16   0
## 9       17   0
## 10      18   0
## 11      19   0
## 12      20   0
## 13      21   0
## 14      22   0
## 15      23   0
## 16      24   0
## 17      25   0
## 18      26   0
## 19      27   0
## 20      28   0
## 21      29   0
## 22      30   0
## 23      31   0
## 24      32   0
## 25      33   0
## 26      34   0
## 27      35   0
## 28      36   0
## 29      37   0
## 30      38   0
## 31      39   0
## 32      40   0
## 33      41   0
## 34      42   0
## 35      43   0
## 36      44   0
## 37      45   0
## 38      46   0
## 39      47   0
## 40      48   0
## 41      49   0
## 42      50   0
## 43      51   0
## 44      52   0
## 45      53   0
## 46      54   0
## 47      55   0
## 48      56   0
## 49      57   0
## 50      58   0
## 51      59   0
## 52      60   0
## 53      61   0
## 54      62   0
## 55      63   0
## 56      64   4
## 57      65   0
## 58      66   0
## 59      67   0
## 60      68   0
## 61      69   0
## 62      70   0
## 63      71 116
## 64      72 131
## 65      75   0
## 66      77   0
## 67      78   0
## 68      79   0
## 69      80   0
## 70      81   0
## 71      82   0
## 72      83   0
## 73      88   0
## 74      89   0

#BarPlot of Weekday vs RetweetCounts
library(tidyverse)
Twitter_Data |>
ggplot(mapping = aes(x =Weekday,color=Weekday,fill=RetweetCount))+geom_bar()

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#Scatterplot for Day and Likes

Twitter_Data |>
ggplot()+ geom_point(mapping= aes(x =Day,y=Likes))+theme_classic()

## Warning: Removed 89999 rows containing missing values (`geom_point()`).

#Histogram for Day vs Hours

mean_ratio <- mean(Twitter_Data$Hour / Twitter_Data$Day)

Twitter_Data |>
  mutate(Hour_to_Day = Hour / Day) |>
  ggplot() +
  geom_histogram(mapping = aes(x = Hour_to_Day), color = 'white') +
  geom_vline(xintercept = mean_ratio, color = 'red') +
  annotate("text",  # the type of annotation
           x = 1.425, y = 24.5, label = "Average", color = 'red') +
  theme_classic()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 89999 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 1 rows containing missing values (`geom_vline()`).

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Assignment Week 2

2023-09-04

Summarizing Numeric Part of the Dataset