The project focusses on doing sentimental analysis and some insights on customer reactions on a new product launched recently (say iPhone XS) based on tweets data. Social media is a powerful tool which can be used to analyze the sentiments of people as a whole for any topic. People often voice their opinions in the form of social media posts like Twitter tweets, Facebook posts, etc. As Apple Iphone XS is one of the latest iphones launched by Apple, this project deals with analyzing how people have been reacting to this new phone from the technology giant.
Generally when people add comments which can be open ended as Twitter provides its users to, the data can be quite varied in nature. There can be advertisements, personal opinions, reviews, references and any other form of comment. But when we see all of these as a combined sentiment, we are hoping we get some good insights into how the people have been taking this new product.
This project attempt can also be used as a general project for any new product launch. Though this project might not give the final conclusion of the sentiment as a whole, but it can be a good initial analysis of a new product. Also, it can be used for any other product launch too.
The steps that we followed:
Main R packages / libraries used:
Here, we are going to collect the tweets created for some of newly related products like ‘iphone XS’ and ‘Nexus Pixel’ using tweeter apis(rtweet). rtweet is a wrapper package around the original package ‘tweeteR’
‘search_tweets’ method is being used to pull the tweets. below are description of arguments used.
For sake of scrubbing the consumer and secret keys, I put the entire collecton code into a ‘False’ block. This is just for demonstration in mark down and we executed in our local and made the data available in csv format in github. `
library("rtweet")
if(FALSE) {
create_token(
app = "my_twitter_research_app",
consumer_key <- 'XXXX',
consumer_secret <- 'XXXX',
access_token <- 'XXXX',
access_secret <- 'XXXX')
iphone_xs <- search_tweets(
"#iPhoneXS OR 'iphone XS'", n = 18000, include_rts = FALSE, retryonratelimit = TRUE, lang = 'en'
)
nexus_pxel <- search_tweets(
"#pixel3 OR pixel3 OR 'Pixel 3'", n = 18000, include_rts = FALSE, retryonratelimit = TRUE, lang = 'en'
)
# write this into a csv file and later uploaded into github for pair programming.
write_as_csv(x = iphone_xs , file_name = 'iphone_xs.csv')
write_as_csv(x = nexus_pxel , file_name = 'tweets_nexus_pxel.csv')
}
One of the best practice in any huge data collection is to persist the row data into a staging table. In Big data world, this is often pushed into a ‘Data Lake’ peristance like Hadoop, NoSql. Then later data analyst and scientists pulls required data they need for analysis and model creation. When the data is collected like this in a real time fashion, the process is called data pipelining where data is collected, extracted and cleaned and flow from one phase to another. Below is the data flow diagram of collecting, extracting, cleaning and analysising.
Data Flow Diagram
Here, we are staging the raw data into a mongodb. We tried to perist into a mysql db, but some of the tweet field is list. so It was not a good idea to save the data into mysql. One way is to save it into a Doumented oriented noSql database like MongoDb. A new column topic is added to indicate the topic and the date of collection. Both iphone and Pixel3 data are persisted into ‘tweets_collection’ table into ‘tweeterDatadb’ database.
Foe eg: ‘Iphone_2018_12_5’, the topic is ‘Iphone’ and date of collection is ‘2018_12_5’
if(FALSE) {
iphone_xs$topic <- "Iphone_2018_12_5"
nexus_pxel$topic <- "Nexus_Pixel_2018_12_5"
mongodb <- mongo(collection = "tweets_collection", db = "tweeterDatadb")
mongodb$insert(iphone_xs1)
mongodb$insert(nexus_pxel)
}
Below is the mongoDB compas view of data.
Teets Collection in MongoDB
Below is the code to extract some of relevent data for analysis. we are using mongodb $find method to fetch using query and fields arguments.
if(FALSE) {
df_iphoneXS<- mongodb$find(query = '{"topic" : "Iphone_2018_12_5"}',
fields = ' {"user_id": true,
"created_at" : true,
"screen_name" : true,
"text" : true,
"source" : true ,
"followers_count" : true,
"location" : true }' )
df_pixel<- mongodb$find(query = '{"topic" : "Nexus_Pixel_2018_12_5"}',
fields = ' {"user_id": true,
"created_at" : true,
"screen_name" : true,
"text" : true,
"source" : true ,
"followers_count" : true,
"location" : true }' )
write_as_csv(x = df_iphoneXS , file_name = 'ExtractedData_iphone_xs.csv')
write_as_csv(x = df_pixel , file_name = 'ExtractedData_tweets_nexus_pxel.csv')
}
Both staged and extracted data is uploaded to github account for pair coding. Below code is used for pulling both staged data and extracted data from github.
library(readr)
iphone_xs <- read.csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/iphone_xs.csv")
nexus_pxel <- read.csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/tweets_nexus_pxel.csv")
df_iphoneXS_extrd <- read.csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_iphone_xs.csv")
df_pixel_extrd <- read.csv("https://raw.githubusercontent.com/charlsjoseph/CUNY-Data607/master/Data607-Final_Project/data/ExtractedData_tweets_nexus_pxel.csv")
Below is the timeline graph for both iphone and pixel3 tweets, This shows tweets are from the past 8-9 days from the date of collection( 05-Dec-2018 )
Number of Iphone tweets collected : 16401
Number of Pixels tweets tweets collected : 12085
ts_plot(iphone_xs, "1 hours") +
ggplot2::theme_minimal() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of #iphoneXS Twitter statuses",
subtitle = "Twitter status (tweet) counts aggregated using one-hour intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
ts_plot(nexus_pxel, "1 hours") +
ggplot2::theme_minimal() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of #pixel3 Twitter statuses",
subtitle = "Twitter status (tweet) counts aggregated using one-hour intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
nrow(iphone_xs)
## [1] 16401
nrow(nexus_pxel)
## [1] 12085