This is the second part of the project, where we will get the data from the data lake and clean it, store into MySQL, and perform the data analysis on the tweets text.
Read the tweets data from the CSV file which has been created from part-1 from the data lake which is MongoDB database for this project. CSV link is: https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_iphone_xs.csv
iphone_xs_csv_link <- "https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_iphone_xs.csv"
iphone_xs_tweets_required <- read.csv(iphone_xs_csv_link, header = TRUE)
dim(iphone_xs_tweets_required)
## [1] 16401 8
iphone_xs_tweets_required %>% head(5) %>% kable() %>% kable_styling()
| X_id | user_id | created_at | screen_name | text | source | location | followers_count |
|---|---|---|---|---|---|---|---|
| x5c0aa99c7fab604870013be0 | x2890259085 | 2018-12-05 14:03:49 | EmmiSilvas_ | i entered @grav3yardgirl & @casetify iPhone XS MAX GIVEAWAY! CHECK IT OUT HERE! https://t.co/sMHtlMvWf1 #grav3yardgirliPhonegiveaway | Twitter for iPhone | 85 | |
| x5c0aa99c7fab604870013be1 | x2890259085 | 2018-12-05 14:03:42 | EmmiSilvas_ | i entered @grav3yardgirl & @casetify iPhone XS MAX GIVEAWAY! CHECK IT OUT HERE! https://t.co/sMHtlMvWf1 #grav3yardgirliPhonegiveaway | Twitter for iPhone | 85 | |
| x5c0aa99c7fab604870013be2 | x2647272271 | 2018-12-05 14:03:37 | JumiaGhana | 5 HOURS TO GO! Get ready, the iPhone Xs 65GB will be on TREASURE HUNT @ 12 MIDNIGHT FOR MIND-BLOWING GH¢ 78 ONLY! Donât sleep! You canât afford to pass this up! NOTE: <U+0001F6D1>ONLY ONE ITEM AVAILABLE TO BE WON Go Here>> https://t.co/RD085Zynb8 https://t.co/pS7VXsdLzX | Twitter Web Client | Ghana | 4115 |
| x5c0aa99c7fab604870013be3 | x22846501 | 2018-12-05 14:03:02 | GhostBishopTV | @ufallb4me iPhone XS Max | Twitter for iPhone | 273 | |
| x5c0aa99c7fab604870013be4 | x16641674 | 2018-12-05 14:02:02 | edelagrave | “After an investigation the ASA found that the statement was fair” https://t.co/aYvITtSD5g | Twitter Web Client | Montréal | 776 |
Clean-up of the tweets to keep only the tweets data which needs to be reviewed for this analysis. Here we are grouping the tweet messages based on the user ID (screen name) and this will be used further to remove the advertizements.
##Now, grouping the tweets base don the screen_name
iphone_xs_tweets_required_group_user <- iphone_xs_tweets_required %>% group_by(screen_name)
per_user <- summarise(iphone_xs_tweets_required_group_user, number_of_messages=n())
# Arranging the tweets count in the descending order, so that the user (screen_name) with the highest count is on the top.
per_user <- arrange(per_user, desc(number_of_messages))
head(per_user)
## # A tibble: 6 x 2
## screen_name number_of_messages
## <fct> <int>
## 1 whitestonedome 180
## 2 marciojmsilva 81
## 3 TopCellularDeal 73
## 4 techzonestation 56
## 5 allsorce_ 51
## 6 Galaxyukebay 46
Now as we see above, there are close to 300 users on the top which have a bigger numer of tweets on the same topic - iphone xs. So, to lessen the occurences or impact of the advertizements in our analysis, we are removing the users from our analysis who have 5 or more tweets on this topic.
per_user <- per_user %>% filter(number_of_messages < 5)
## Joining the 2 data.frames, and getting the messages only for the users with screen names having 4 or fewer messages count
iphone_xs_tweets_required_users <- merge(iphone_xs_tweets_required, per_user, by="screen_name")
dim(iphone_xs_tweets_required_users)
## [1] 13515 10
##using only the text from the data.frame. The remaining fields will not be required, for further steps. Hence droping them.
iphone_xs_tweets_required_text <- iphone_xs_tweets_required_users %>% select(5)
iphone_xs_tweets_required_text %>% head(4) %>% kable() %>% kable_styling()
| text |
|---|
| @iammrdom @ChucklefishLTD Same, on iPhone XS Max |
| @Cult0fPing Im having the same issue here. Theres $250 left on my iPhone 7 contract. Im getting the XS for Christmas and have to trade mine in to be able to afford it. Not sure of the exact value theyre giving me but its around the same, and Im still having to pay $500 for the XS. |
| I liked a @YouTube video https://t.co/JmbjLyAUV7 What’s On My iPhone XS Max?! |
| I liked a @YouTube video https://t.co/XBCrvfM579 WHAT’S ON MY IPHONE XS MAX + GIVEAWAY!!! | DIMMA UMEH |
Storing the text of all the required tweets in a MySQL database. This will help easy fetch from the database whenever we need it for our analysis. Initial few code lines are to set up the MySQL schema and table structure so the tweet text can be written to MySQL table. ### Connect to MySQL
## Connecting to MySQL
mydb = dbConnect(MySQL(), user='root', password='Deepak1234#', dbname='twitter', host='localhost')
summary(mydb)
## <MySQLConnection:0,0>
## User: root
## Host: localhost
## Dbname: twitter
## Connection type: localhost via TCP/IP
##
## Results:
## Drop the twitter schema if it already exists on the MySQL database
#drop.schema.query <- paste("drop schema if exists `twitter`;")
#dbGetQuery(mydb, drop.schema.query)
## Create the twitter schema
#create.schema.query <- paste("CREATE SCHEMA `twitter`;")
#dbGetQuery(mydb, create.schema.query)
## Creating the new table 'twitter' for storig the tweets for iphone_xs
#drop.MySQL.table.query <- paste("DROP TABLE IF EXISTS `twitter`;")
#dbGetQuery(mydb, drop.MySQL.table.query)
## Note here, that even if the twitter tweet has a character limit of 140 characters, but there are some tweets which have refernces to multiple users making the actual character count bigger. The user names refered using @ are not counted in the tweet character length count. Hence, to be on a safer side, we have created the tweet_text of length 1000.
## If we created a shorter length of 500, we got an error for a few tweets which actually became more than 500 characters due to references to multiple users.
iphone_xs_tweets_required_text[10807,1]
## [1] "My people! The countdown is on @jumianigeria 1 day massive sale is in 5 days from today. Get an iPhone Xs max by participating in the treasure hunt and win more prizes https://t.co/BHCRupRef3 #5DaystoJumia1212"
nchar(iphone_xs_tweets_required_text[10807,1])
## [1] 209
#create.MySQL.table.query <- paste("CREATE TABLE `twitter` (
# `tweet_text` varchar(1000) NOT NULL
#);")
## There were some tweets with double quote, replacing it with a space to avoid any error while interting into MySQL
iphone_xs_tweets_required_text$text <- str_replace_all(iphone_xs_tweets_required_text$text, '"', ' ')
no_of_tweets <- nrow(iphone_xs_tweets_required_text)
query <- paste("INSERT INTO twitter VALUES (", '"', iphone_xs_tweets_required_text[1,], '")', collapse = '')
query
## [1] "INSERT INTO twitter VALUES ( \" @iammrdom @ChucklefishLTD Same, on iPhone XS Max \")"
#dbGetQuery(mydb, query)
## Running a loop to load the whole data frame (All tweets) into MySQL table
for (i in 1:no_of_tweets) {
query <- paste("INSERT INTO twitter VALUES (", '"', iphone_xs_tweets_required_text[i,], '")', collapse = '')
dbGetQuery(mydb, query)
}
get.MySQL.table.query <- paste("select * from twitter;")
iphone_xs_tweets_from_MySQL_df <- dbGetQuery(mydb, get.MySQL.table.query)
### Loading the tweets text into a corpus
myCorpus <- Corpus(VectorSource(iphone_xs_tweets_from_MySQL_df$tweet_text))
inspect(myCorpus[1:20])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 20
##
## [1] @iammrdom @ChucklefishLTD Same, on iPhone XS Max
## [2] @Cult0fPing Iâ\200\231m having the same issue here. Thereâ\200\231s $250 left on my iPhone 7 contract. Iâ\200\231m getting the XS for Christmas and have to trade mine in to be able to afford it. Not sure of the exact value theyâ\200\231re giving me but itâ\200\231s around the same, and Iâ\200\231m still having to pay $500 for the XS.
## [3] I liked a @YouTube video https://t.co/JmbjLyAUV7 What's On My iPhone XS Max?!
## [4] I liked a @YouTube video https://t.co/XBCrvfM579 WHAT'S ON MY IPHONE XS MAX + GIVEAWAY!!! | DIMMA UMEH
## [5] @__maggz_ @trevkudzi S/O to PC for the iphone Xs Max 256gb
## [6] @AuntyAdah iPhone XS Max
## [7] Iâ\200\231m convinced niggas donâ\200\231t really like females no more, they just want us to get 2 iPhone Xs in our name ð¤·ð\217¾â\200\215â\231\200ï¸\217
## [8] @NotYetUhuru_ Canâ\200\231t wait to tweet via my iPhone xs max bought cash ð¤£ð¤£ð¤£ð¤£ donâ\200\231t even know when that will be
## [9] Iphone Xs Max ð\230\231 https://t.co/7ynFPLBT12
## [10] Guess Iâ\200\231ll be getting the iPhone XS Max
## [11] I got the iphone xr but I actually hate it now idk if I should get the iphone xs or just get the 8 :(
## [12] Stuck between getting a iPhone XS Max or iPhone XR . Decisions decisions decisions ð\230©ð\230©ð\230©
## [13] Anyone here have the iPhone XS Max? I want to get it but is it too big??
## [14] My bf just got the iPhone XS a week ago & dropped it now it has a small crack smh ð¤¦ð\217¾â\200\215â\231\200ï¸\217
## [15] Must be that iPhone XS Max https://t.co/iW3ZqVvFqz
## [16] The network is bad, I donâ\200\231t know if giving out iPhone Xs Max will make it better. https://t.co/yaWypC19mV
## [17] iPhone Xs MAX
## [18] @BlaxkHefner iPhone XS Max â\235¤ï¸\217
## [19] Bring iPhone X â\200¢ XS â\200¢ XS Max & AirPods ... for Christmas Money ð¯\n \nâ\230ï¸\217 0265668842 / 0202503435
## [20] @O2 eh listen up! Iâ\200\231m outraged I had to travel 5 hours to and from work without any data, you best hook me up with a cheap iPhone XS contract, I ainâ\200\231t after compensation kmt
### Converting to lower characters
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(tolower)):
## transformation drops documents
## Remove http
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
## Warning in tm_map.SimpleCorpus(myCorpus, content_transformer(removeURL)):
## transformation drops documents
## Remove any junk characters
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
## Warning in tm_map.SimpleCorpus(myCorpus,
## content_transformer(removeNumPunct)): transformation drops documents
## Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, myStopwords):
## transformation drops documents
## Strip whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(myCorpus, stripWhitespace): transformation
## drops documents
## Stemming
myCorpus <- tm_map(myCorpus, stemDocument) # stem words
## Warning in tm_map.SimpleCorpus(myCorpus, stemDocument): transformation
## drops documents
inspect(myCorpus[1:20])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 20
##
## [1] iammrdom chucklefishltd iphon xs max
## [2] cultfp iâm issu thereâ left iphon contract iâm get xs christma trade mine abl afford sure exact valu theyâr give itâ around iâm still pay xs
## [3] like youtub video what iphon xs max
## [4] like youtub video what iphon xs max giveaway dimma umeh
## [5] maggz trevkudzi pc iphon xs max gb
## [6] auntyadah iphon xs max
## [7] iâm convinc nigga donât realli like femal just want us get iphon xs name ðÿðÿââï
## [8] notyetuhuru canât wait tweet iphon xs max bought cash ðÿðÿðÿðÿ donât even know will
## [9] iphon xs max ðÿ
## [10] guess iâll get iphon xs max
## [11] got iphon xr actual hate now idk get iphon xs just get
## [12] stuck get iphon xs max iphon xr decis decis decis ðÿðÿðÿ
## [13] anyon iphon xs max want get big
## [14] bf just got iphon xs week ago drop now small crack smh ðÿðÿââï
## [15] must iphon xs max
## [16] network bad donât know give iphon xs max will make better
## [17] iphon xs max
## [18] blaxkhefn iphon xs max âï
## [19] bring iphon x â xs â xs max airpod christma money ðÿ âï
## [20] o eh listen iâm outrag travel hour work without data best hook cheap iphon xs contract ainât compens kmt
## Creating Term Document Matrix
tdm <- TermDocumentMatrix(myCorpus)
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## chucklefishltd 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## iammrdom 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## iphon 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1
## max 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0
## abl 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## afford 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## around 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## christma 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## contract 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## cultfp 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## As we see here, iphon is the word in every document, which is justified, as we searched for iphone xs. Hence we will remove this term - iphon, as it does not solve any purpose here, as it will be present in all the tweets fetched.
myCorpus <- tm_map(myCorpus, removeWords, c("iphon", "xs"))
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, c("iphon", "xs")):
## transformation drops documents
## Building the tdm again after removing iphon word:
tdm <- TermDocumentMatrix(myCorpus)
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## chucklefishltd 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## iammrdom 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## max 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0
## abl 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## afford 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## around 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## christma 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## contract 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## cultfp 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## exact 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Building the word frequency
word.freq <- rowSums(tdm)
word.freq <- sort(word.freq, decreasing = TRUE)
set.seed(222)
## Top 20 used words
word.freq[1:20]
## max appl new giveaway like
## 6588 2374 1408 1362 1242
## case phone get youtub win
## 1193 1144 1132 1081 1002
## video enter now check just
## 996 937 910 857 820
## camera casetifi gravyardgirl got can
## 729 716 715 598 565
## Building the word cloud
twitter.words <- names(word.freq)
wordcloud(words = twitter.words,
freq = word.freq,
max.words = 150,
random.order = F,
min.freq = 5,
colors = brewer.pal(8, 'Dark2'),
scale = c(5,0.3),
rot.per = 0.3)
Now, from the word cloud, we see that there are many words like below, which do not carry much meaning in the context of this analysis and hence can be removed. Some of these words ame from the ad tweets, some are common words like phone, youtub, etc. We will remove such words now, so that we can have a better and a more meaningful word cloud.
myCorpus <- tm_map(myCorpus, removeWords, c("max", "appl", "giveaway", "youtub", "win", "phone", "gravyardgirl", "gravyardgirliphonegiveaway"))
## Warning in tm_map.SimpleCorpus(myCorpus, removeWords, c("max", "appl",
## "giveaway", : transformation drops documents
## Building a fresh tdm now as we just removed some words.
tdm <- TermDocumentMatrix(myCorpus)
tdm <- as.matrix(tdm)
## Building the updated word frequency
word.freq <- rowSums(tdm)
word.freq <- sort(word.freq, decreasing = TRUE)
set.seed(222)
## Top 20 used words now
word.freq[1:20]
## new like case get video enter now check
## 1408 1242 1193 1132 996 937 910 857
## just camera casetifi got can one buy want
## 820 729 716 598 565 541 530 522
## plus pixel price esim
## 497 456 438 421
## Building the updated word cloud
twitter.words <- names(word.freq)
wordcloud(words = twitter.words,
freq = word.freq,
max.words = 150,
random.order = F,
min.freq = 5,
colors = brewer.pal(8, 'Dark2'),
scale = c(5,0.3),
rot.per = 0.3)
From the updated word cloud above, we get some good insights as below:
People are talking positive words like - “like”, “get”, “want”, “buy”, “great”, hence meaning a positive sentiment overall
Words like “galaxy”, “samsung”, “android” also appear here, indicating that the users are comparing iphone xs as compared to android phones
sentiment_scores <- get_nrc_sentiment(iphone_xs_tweets_from_MySQL_df$tweet_text)
head(sentiment_scores)
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 0 0 0 0 0 0 0 0
## 2 0 1 0 1 1 0 0 2 1
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## positive
## 1 0
## 2 3
## 3 0
## 4 0
## 5 0
## 6 0
barplot(colSums(sentiment_scores),
las = 2,
col = rainbow(10),
ylab = "Count",
main = "Sentiment scores for iphone xs Tweets")