Tweets Clean-up and analysis

This is the second part of the project, where we will get the data from the data lake and clean it, store into MySQL, and perform the data analysis on the tweets text.

Pre-requisites:

  1. To run this code, MySQL must be installed in the local computer.
  2. The MySQL script given at this GitHub location must be run, so that the required schema and the table has been created before running this RMD File: https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/Data%20607%20Final%20Project%20-%20Twitter%20MySQL%20Queries.sql
  3. All teh above libraries (R packages) must be installed on the local.

Read the tweets data from the CSV file which has been created from part-1 from the data lake which is MongoDB database for this project. CSV link is: https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_iphone_xs.csv

Read CSV (Tweets data from staging database)

iphone_xs_csv_link <- "https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_iphone_xs.csv"

iphone_xs_tweets_required <- read.csv(iphone_xs_csv_link, header = TRUE)

dim(iphone_xs_tweets_required)
## [1] 16401     8
iphone_xs_tweets_required %>% head(5) %>% kable() %>% kable_styling()
X_id user_id created_at screen_name text source location followers_count
x5c0aa99c7fab604870013be0 x2890259085 2018-12-05 14:03:49 EmmiSilvas_ i entered @grav3yardgirl &amp; @casetify iPhone XS MAX GIVEAWAY! CHECK IT OUT HERE! https://t.co/sMHtlMvWf1 #grav3yardgirliPhonegiveaway Twitter for iPhone 85
x5c0aa99c7fab604870013be1 x2890259085 2018-12-05 14:03:42 EmmiSilvas_ i entered @grav3yardgirl &amp; @casetify iPhone XS MAX GIVEAWAY! CHECK IT OUT HERE! https://t.co/sMHtlMvWf1 #grav3yardgirliPhonegiveaway Twitter for iPhone 85
x5c0aa99c7fab604870013be2 x2647272271 2018-12-05 14:03:37 JumiaGhana 5 HOURS TO GO! Get ready, the iPhone Xs 65GB will be on TREASURE HUNT @ 12 MIDNIGHT FOR MIND-BLOWING GH¢ 78 ONLY! Don’t sleep! You can’t afford to pass this up! NOTE: <U+0001F6D1>ONLY ONE ITEM AVAILABLE TO BE WON Go Here&gt;&gt; https://t.co/RD085Zynb8 https://t.co/pS7VXsdLzX Twitter Web Client Ghana 4115
x5c0aa99c7fab604870013be3 x22846501 2018-12-05 14:03:02 GhostBishopTV @ufallb4me iPhone XS Max Twitter for iPhone 273
x5c0aa99c7fab604870013be4 x16641674 2018-12-05 14:02:02 edelagrave “After an investigation the ASA found that the statement was fair” https://t.co/aYvITtSD5g Twitter Web Client Montréal 776

Counts per user

Clean-up of the tweets to keep only the tweets data which needs to be reviewed for this analysis. Here we are grouping the tweet messages based on the user ID (screen name) and this will be used further to remove the advertizements.

##Now, grouping the tweets base don the screen_name
iphone_xs_tweets_required_group_user <- iphone_xs_tweets_required %>% group_by(screen_name)
per_user <- summarise(iphone_xs_tweets_required_group_user, number_of_messages=n())

# Arranging the tweets count in the descending order, so that the user (screen_name) with the highest count is on the top.
per_user <- arrange(per_user, desc(number_of_messages))

head(per_user)
## # A tibble: 6 x 2
##   screen_name     number_of_messages
##   <fct>                        <int>
## 1 whitestonedome                 180
## 2 marciojmsilva                   81
## 3 TopCellularDeal                 73
## 4 techzonestation                 56
## 5 allsorce_                       51
## 6 Galaxyukebay                    46

Attempt to remove ads

Now as we see above, there are close to 300 users on the top which have a bigger numer of tweets on the same topic - iphone xs. So, to lessen the occurences or impact of the advertizements in our analysis, we are removing the users from our analysis who have 5 or more tweets on this topic.

per_user <- per_user %>% filter(number_of_messages < 5)

## Joining the 2 data.frames, and getting the messages only for the users with screen names having 4 or fewer messages count
iphone_xs_tweets_required_users <- merge(iphone_xs_tweets_required, per_user, by="screen_name")

dim(iphone_xs_tweets_required_users)
## [1] 13515    10
##using only the text from the data.frame. The remaining fields will not be required, for further steps. Hence droping them.
iphone_xs_tweets_required_text <- iphone_xs_tweets_required_users %>% select(5)

iphone_xs_tweets_required_text %>% head(4) %>% kable() %>% kable_styling()
text
@iammrdom @ChucklefishLTD Same, on iPhone XS Max
@Cult0fPing I’m having the same issue here. There’s $250 left on my iPhone 7 contract. I’m getting the XS for Christmas and have to trade mine in to be able to afford it. Not sure of the exact value they’re giving me but it’s around the same, and I’m still having to pay $500 for the XS.
I liked a @YouTube video https://t.co/JmbjLyAUV7 What’s On My iPhone XS Max?!
I liked a @YouTube video https://t.co/XBCrvfM579 WHAT’S ON MY IPHONE XS MAX + GIVEAWAY!!! | DIMMA UMEH

Storing the text of all the required tweets in a MySQL database. This will help easy fetch from the database whenever we need it for our analysis. Initial few code lines are to set up the MySQL schema and table structure so the tweet text can be written to MySQL table. ### Connect to MySQL

## Connecting to MySQL
mydb = dbConnect(MySQL(), user='root', password='Deepak1234#', dbname='twitter', host='localhost')
summary(mydb)
## <MySQLConnection:0,0>
##   User:   root 
##   Host:   localhost 
##   Dbname: twitter 
##   Connection type: localhost via TCP/IP 
## 
## Results:
## Drop the twitter schema if it already exists on the MySQL database
#drop.schema.query <- paste("drop schema if exists `twitter`;")
#dbGetQuery(mydb, drop.schema.query)

## Create the twitter schema
#create.schema.query <- paste("CREATE SCHEMA `twitter`;")
#dbGetQuery(mydb, create.schema.query)

## Creating the new table 'twitter' for storig the tweets for iphone_xs
#drop.MySQL.table.query <- paste("DROP TABLE IF EXISTS `twitter`;")
#dbGetQuery(mydb, drop.MySQL.table.query)

## Note here, that even if the twitter tweet has a character limit of 140 characters, but there are some tweets which have refernces to multiple users making the actual character count bigger. The user names refered using @ are not counted in the tweet character length count. Hence, to be on a safer side, we have created the tweet_text of length 1000.
## If we created a shorter length of 500, we got an error for a few tweets which actually became more than 500 characters due to references to multiple users.
iphone_xs_tweets_required_text[10807,1]
## [1] "My people! The countdown is on @jumianigeria 1 day massive sale is in 5 days from today. Get an iPhone Xs max by participating in the treasure hunt and win more prizes https://t.co/BHCRupRef3 #5DaystoJumia1212"
nchar(iphone_xs_tweets_required_text[10807,1])
## [1] 209
#create.MySQL.table.query <- paste("CREATE TABLE `twitter` (
#  `tweet_text` varchar(1000) NOT NULL
#);")

Storing data in Relational database

## There were some tweets with double quote, replacing it with a space to avoid any error while interting into MySQL
iphone_xs_tweets_required_text$text <- str_replace_all(iphone_xs_tweets_required_text$text, '"', ' ')

no_of_tweets <- nrow(iphone_xs_tweets_required_text)

query <- paste("INSERT INTO twitter VALUES (", '"', iphone_xs_tweets_required_text[1,], '")', collapse = '')
query
## [1] "INSERT INTO twitter VALUES ( \" @iammrdom @ChucklefishLTD Same, on iPhone XS Max \")"
#dbGetQuery(mydb, query)


## Running a loop to load the whole data frame (All tweets) into MySQL table
for (i in 1:no_of_tweets) {
  query <- paste("INSERT INTO twitter VALUES (", '"', iphone_xs_tweets_required_text[i,], '")', collapse = '')
  dbGetQuery(mydb, query)
}

Fetch from MySQL

get.MySQL.table.query <- paste("select * from twitter;")
iphone_xs_tweets_from_MySQL_df <- dbGetQuery(mydb, get.MySQL.table.query)

Text Mining and Word Cloud

### Loading the tweets text into a corpus
myCorpus <- Corpus(VectorSource(iphone_xs_tweets_from_MySQL_df$tweet_text))
inspect(myCorpus[1:20])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 20
## 
##  [1]  @iammrdom @ChucklefishLTD Same, on iPhone XS Max                                                                                                                                                                                                                                                            
##  [2]  @Cult0fPing Iâ\200\231m having the same issue here. Thereâ\200\231s $250 left on my iPhone 7 contract. Iâ\200\231m getting the XS for Christmas and have to trade mine in to be able to afford it. Not sure of the exact value theyâ\200\231re giving me but itâ\200\231s around the same, and Iâ\200\231m still having to pay $500 for the XS. 
##  [3]  I liked a @YouTube video https://t.co/JmbjLyAUV7 What's On My iPhone XS Max?!                                                                                                                                                                                                                               
##  [4]  I liked a @YouTube video https://t.co/XBCrvfM579 WHAT'S ON MY IPHONE XS MAX + GIVEAWAY!!! | DIMMA UMEH                                                                                                                                                                                                      
##  [5]  @__maggz_ @trevkudzi S/O to PC for the iphone Xs Max 256gb                                                                                                                                                                                                                                                  
##  [6]  @AuntyAdah iPhone XS Max                                                                                                                                                                                                                                                                                    
##  [7]  Iâ\200\231m convinced niggas donâ\200\231t really like females no more, they just want us to get 2 iPhone Xs in our name 🤷ðŸ\217¾â\200\215â\231\200ï¸\217                                                                                                                                                                              
##  [8]  @NotYetUhuru_ Canâ\200\231t wait to tweet via my iPhone xs max bought cash 🤣🤣🤣🤣 donâ\200\231t even know when that will be                                                                                                                                                                                   
##  [9]  Iphone Xs Max ðŸ\230\231 https://t.co/7ynFPLBT12                                                                                                                                                                                                                                                                  
## [10]  Guess Iâ\200\231ll be getting the iPhone XS Max                                                                                                                                                                                                                                                                   
## [11]  I got the iphone xr but I actually hate it now idk if I should get the iphone xs or just get the 8 :(                                                                                                                                                                                                       
## [12]  Stuck between getting a iPhone XS Max or iPhone XR . Decisions decisions decisions ðŸ\230©ðŸ\230©ðŸ\230©                                                                                                                                                                                                             
## [13]  Anyone here have the iPhone XS Max? I want to get it but is it too big??                                                                                                                                                                                                                                    
## [14]  My bf just got the iPhone XS a week ago &amp; dropped it now it has a small crack smh 🤦ðŸ\217¾â\200\215â\231\200ï¸\217                                                                                                                                                                                                     
## [15]  Must be that iPhone XS Max https://t.co/iW3ZqVvFqz                                                                                                                                                                                                                                                          
## [16]  The network is bad, I donâ\200\231t know if giving out iPhone Xs Max will make it better. https://t.co/yaWypC19mV                                                                                                                                                                                                 
## [17]  iPhone Xs MAX                                                                                                                                                                                                                                                                                               
## [18]  @BlaxkHefner iPhone XS Max â\235¤ï¸\217                                                                                                                                                                                                                                                                           
## [19]  Bring iPhone X â\200¢ XS â\200¢ XS Max &amp; AirPods ... for Christmas Money 💯\n \nâ\230Žï¸\217 0265668842 / 0202503435                                                                                                                                                                                              
## [20]  @O2 eh listen up! Iâ\200\231m outraged I had to travel 5 hours to and from work without any data, you best hook me up with a cheap iPhone XS contract, I ainâ\200\231t after compensation kmt
### Converting to lower characters
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(tolower)):
## transformation drops documents
## Remove http
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(removeURL)):
## transformation drops documents
## Remove any junk characters
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
## Warning in tm_map.SimpleCorpus(myCorpus,
## content_transformer(removeNumPunct)): transformation drops documents
## Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
                 "use", "see", "used", "via", "amp")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, myStopwords):
## transformation drops documents
## Strip whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(myCorpus, stripWhitespace): transformation
## drops documents
## Stemming
myCorpus <- tm_map(myCorpus, stemDocument) # stem words
## Warning in tm_map.SimpleCorpus(myCorpus, stemDocument): transformation
## drops documents
inspect(myCorpus[1:20])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 20
## 
##  [1] iammrdom chucklefishltd iphon xs max                                                                                                        
##  [2] cultfp iâm issu thereâ left iphon contract iâm get xs christma trade mine abl afford sure exact valu theyâr give itâ around iâm still pay xs
##  [3] like youtub video what iphon xs max                                                                                                         
##  [4] like youtub video what iphon xs max giveaway dimma umeh                                                                                     
##  [5] maggz trevkudzi pc iphon xs max gb                                                                                                          
##  [6] auntyadah iphon xs max                                                                                                                      
##  [7] iâm convinc nigga donât realli like femal just want us get iphon xs name ðÿðÿââï                                                            
##  [8] notyetuhuru canât wait tweet iphon xs max bought cash ðÿðÿðÿðÿ donât even know will                                                         
##  [9] iphon xs max ðÿ                                                                                                                             
## [10] guess iâll get iphon xs max                                                                                                                 
## [11] got iphon xr actual hate now idk get iphon xs just get                                                                                      
## [12] stuck get iphon xs max iphon xr decis decis decis ðÿðÿðÿ                                                                                    
## [13] anyon iphon xs max want get big                                                                                                             
## [14] bf just got iphon xs week ago drop now small crack smh ðÿðÿââï                                                                              
## [15] must iphon xs max                                                                                                                           
## [16] network bad donât know give iphon xs max will make better                                                                                   
## [17] iphon xs max                                                                                                                                
## [18] blaxkhefn iphon xs max âï                                                                                                                   
## [19] bring iphon x â xs â xs max airpod christma money ðÿ âï                                                                                     
## [20] o eh listen iâm outrag travel hour work without data best hook cheap iphon xs contract ainât compens kmt
## Creating Term Document Matrix
tdm <- TermDocumentMatrix(myCorpus)

tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
##                 Docs
## Terms            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   chucklefishltd 1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   iammrdom       1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   iphon          1 1 1 1 1 1 1 1 1  1  2  2  1  1  1  1  1  1  1  1
##   max            1 0 1 1 1 1 0 1 1  1  0  1  1  0  1  1  1  1  1  0
##   abl            0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   afford         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   around         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   christma       0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  1  0
##   contract       0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1
##   cultfp         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
## As we see here, iphon is the word in every document, which is justified, as we searched for iphone xs. Hence we will remove this term - iphon, as it does not solve any purpose here, as it will be present in all the tweets fetched.
myCorpus <- tm_map(myCorpus, removeWords, c("iphon", "xs"))
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, c("iphon", "xs")):
## transformation drops documents
## Building the tdm again after removing iphon word:
tdm <- TermDocumentMatrix(myCorpus)

tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
##                 Docs
## Terms            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   chucklefishltd 1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   iammrdom       1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   max            1 0 1 1 1 1 0 1 1  1  0  1  1  0  1  1  1  1  1  0
##   abl            0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   afford         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   around         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   christma       0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  1  0
##   contract       0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  1
##   cultfp         0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   exact          0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
## Building the word frequency
word.freq <- rowSums(tdm)
word.freq <- sort(word.freq, decreasing = TRUE)
set.seed(222)

## Top 20 used words
word.freq[1:20]
##          max         appl          new     giveaway         like 
##         6588         2374         1408         1362         1242 
##         case        phone          get       youtub          win 
##         1193         1144         1132         1081         1002 
##        video        enter          now        check         just 
##          996          937          910          857          820 
##       camera     casetifi gravyardgirl          got          can 
##          729          716          715          598          565
## Building the word cloud
twitter.words <- names(word.freq)

wordcloud(words = twitter.words, 
          freq = word.freq,
          max.words = 150,
          random.order = F,
          min.freq = 5,
          colors = brewer.pal(8, 'Dark2'),
          scale = c(5,0.3),
          rot.per = 0.3)

Now, from the word cloud, we see that there are many words like below, which do not carry much meaning in the context of this analysis and hence can be removed. Some of these words ame from the ad tweets, some are common words like phone, youtub, etc. We will remove such words now, so that we can have a better and a more meaningful word cloud.

myCorpus <- tm_map(myCorpus, removeWords, c("max", "appl", "giveaway", "youtub", "win", "phone", "gravyardgirl", "gravyardgirliphonegiveaway"))
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, c("max", "appl",
## "giveaway", : transformation drops documents
## Building a fresh tdm now as we just removed some words.
tdm <- TermDocumentMatrix(myCorpus)

tdm <- as.matrix(tdm)

## Building the updated word frequency
word.freq <- rowSums(tdm)
word.freq <- sort(word.freq, decreasing = TRUE)
set.seed(222)

## Top 20 used words now
word.freq[1:20]
##      new     like     case      get    video    enter      now    check 
##     1408     1242     1193     1132      996      937      910      857 
##     just   camera casetifi      got      can      one      buy     want 
##      820      729      716      598      565      541      530      522 
##     plus    pixel    price     esim 
##      497      456      438      421
## Building the updated word cloud
twitter.words <- names(word.freq)

wordcloud(words = twitter.words, 
          freq = word.freq,
          max.words = 150,
          random.order = F,
          min.freq = 5,
          colors = brewer.pal(8, 'Dark2'),
          scale = c(5,0.3),
          rot.per = 0.3)

From the updated word cloud above, we get some good insights as below:

  • People are talking positive words like - “like”, “get”, “want”, “buy”, “great”, hence meaning a positive sentiment overall

  • Words like “galaxy”, “samsung”, “android” also appear here, indicating that the users are comparing iphone xs as compared to android phones

Sentiment Analysis

sentiment_scores <- get_nrc_sentiment(iphone_xs_tweets_from_MySQL_df$tweet_text)
head(sentiment_scores)
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            0       0    0   0       0        0     0        0
## 2     0            1       0    1   1       0        0     2        1
## 3     0            0       0    0   0       0        0     0        0
## 4     0            0       0    0   0       0        0     0        0
## 5     0            0       0    0   0       0        0     0        0
## 6     0            0       0    0   0       0        0     0        0
##   positive
## 1        0
## 2        3
## 3        0
## 4        0
## 5        0
## 6        0
barplot(colSums(sentiment_scores),
        las = 2,
        col = rainbow(10),
        ylab = "Count",
        main = "Sentiment scores for iphone xs Tweets")

Conclusions:

1. The overall response from the people on the iphone XS is positive.

2. Some very positive words appear on the top of the list of mostly used words in iphone xs tweets, as seen in the word cloud.

3. As there are so many users / people interested in the brand, there have been many advertizements going around to woo customers by so many third party online companies.

4. With such hype around a product by a company like Apple, many people voice their views and opinions online. And Twitter being one of the most used social networking portal, it gave a good deal of tweets to analyze the customer sentiment.

5. Even if there is not a single conclusion from this analysis we found some good information about the overall sentiments around iPhone XS which can be seen in the plots and points shown in this document.

Challenges Faced:

  1. We faced a challenge while writing the data.frame directly into MySQL database using the function dbWriteTable due to an error in Production MySQL, which was preventing writing the data into MySQL. Hence we had to create the MySQL schema and table using the MySQL script as a separate entity, and kept only 1 field (tweet text) as the query had to be manually created at the run time for the whole data.frame for the tweets for iphone XS, using a loop.
  2. There have been many ads on the tweets which have the term - iphone xs. Removing them was a challenge that we faced. To lessen the impact of the ads, we had set up a cut-off limit for only consuming the users with up to 4 tweets in the overall fetch. This will ensure that the users who had many tweets (5 or more in this case per our assumption to lessen impact) were not accounted for in the analysis. Also we removed such words like “giveaway”, “win”, etc. which are more likely to be used in ads.