The main purpose of this project consists in building a database from the data set american-election-tweet.xlsx/.csv such that the following constraints are fulfilled:
Which tweets were the most important?
Which hashtags were used the most?
When did the most hashtags occur?
Wich hashtags frequently occur together?
How has the frequency of special hashtags developed over time?
To sum up, the database allows to query for specific information about hashtags and its correlations.
The data set american-election-tweet.xlsx consists of 6126 entries and 11 attributes. Its eleven attributes are: {handle, text, is_retweet, original_author, time, in_reply_to_screen_name, is_quote_status, retweet_count, favorite_count, source_url, truncated}
The data set can be split into two subsets in order to compare between tweets on behalf of HillaryClinton and realDonaldTrump. The attributes original_author and in_reply_to_screen_name include only a few entries such that its importance need to be evaluated. The attribute truncated showed that only 40 tweets were actually truncated and therefore over 99% of the tweets are complete.
To sum up, the data set was complete except for the attributes original_author and in_reply_to_screen_name. The tweets need to be evaluated for its key words, content for labeling purposes and especially hashtags for later queries.
The data set american-election-tweet.xlsx was manually downloaded form http://kvv.imp.fu-berlin.de under the subsection ressources from the database course. In the code sections results are mentioned as comments or displayed below the actual command of the function.
library(readxl)
data <- read_excel("data/american-election-tweets.xlsx")
#what attributes are available? 11 attributes available
# colnames(data)
# handle, text, is_retweet, original_author, time, in_reply_to_screen_name,
# is_quote_status, retweet_count, favorite_count, source_url, truncated
# dimension
# dim(data)) # 6126 X 11
# structure of the data
# str(data)
#summary of the data
summary(data)
## handle text is_retweet
## Length:6126 Length:6126 Length:6126
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## original_author time in_reply_to_screen_name
## Length:6126 Length:6126 Length:6126
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## is_quote_status retweet_count favorite_count source_url
## Length:6126 Min. : 123 Min. : 274 Length:6126
## Class :character 1st Qu.: 1485 1st Qu.: 3976 Class :character
## Mode :character Median : 2858 Median : 7779 Mode :character
## Mean : 4410 Mean : 11721
## 3rd Qu.: 5392 3rd Qu.: 15076
## Max. :490180 Max. :660384
## truncated
## Length:6126
## Class :character
## Mode :character
##
##
##
handle <- unique(data$handle) # "HillaryClinton" "realDonaldTrump"
text <- unique(data$text) # many different texts -> further analysis required
is_retweet <- unique(data$is_retweet) # true false
original_author <- unique(data$original_author) # 268 different authors
# -> further analysis required
timeRange <- range(data$time) # "2016-01-05T03:36:53" "2016-09-28T00:22:34"
in_reply_to_screen_name <- unique(data$in_reply_to_screen_name)
# NA "HillaryClinton" "Lesdoggg" "elizabethforma"
# "MaryKayHenry" "montgomeriefdn" "JLo" "POTUS" "tomcolicchio"
is_quote_status <- unique(data$is_quote_status) # true false
retweet_countRange <- range(data$retweet_count) # 123 490180
favorite_countRange <- range(data$favorite_count) # 274 660384
unique_url <- unique(data$source_url)
# "https://studio.twitter.com"
# "http://twitter.com"
# "https://about.twitter.com/products/tweetdeck"
# "http://twitter.com/download/iphone"
# "http://twitter.com/download/android"
# "http://twitter.com/#!/download/ipad"
# "http://instagram.com"
# "https://mobile.twitter.com"
truncated <- unique(data$truncated) # true false
# How many missing values in each attribute from 6126 entries?
sum(is.na(data$handle)) # 0
sum(is.na(data$text)) # 0
sum(is.na(data$is_retweet)) # 0
sum(is.na(data$original_author)) # 5465
sum(is.na(data$time)) # 0
sum(is.na(data$in_reply_to_screen_name)) # 5930
sum(is.na(data$is_quote_status)) # 0
sum(is.na(data$retweet_count)) # 0
sum(is.na(data$favorite_count)) # 0
sum(is.na(data$source_url)) # 0
sum(is.na(data$truncated)) # 0
For further information concerning the attributes, please consult https://dev.twitter.com/overview/api/tweets.
It probably refers to the acount to which the tweet was posted. There were only two values for this attribute: HillaryClinton or realDonaldTrump. The domain is therefore equal to {HillaryClinton, realDonaldTrump}.
Consists of the content of the tweet about only 40 were truncated. The tweets contain simple text, hashtags and further links, such that further analysis is needed. The domain is vast.
This attribute shows if a tweet has been retweeted by the authenticating user and consists of the domain {true, false}.
The entries in the data set start at ‘2016-01-05 03:36:53’ and stop at ‘2016-09-28 00:22:34’ and is ordered in the reverse order such that in the cleaning process it would be reordered chronologically. The domain ranges between these two time points.
Contains the screen name of the original Tweet’s author if it is not a reply to a tweet. 5930 entries were na and its domain consists of {NA, HillaryClinton, Lesdoggg, elizabethforma, MaryKayHenry, montgomeriefdn, JLo, POTUS, tomcolicchio}.
“This field only surfaces when the Tweet is a quote Tweet. This attribute contains the Tweet object of the original Tweet that was quoted” - twitter documentation. Domain consists of {true, false}
How many times the tweet has been retweeted. Its range is between 123 and 490180 without any na’s.
“Indicates approximately how many times this Tweet has been liked by Twitter users” - twitter documentation. There are no nas and ranges between 274 and 660384.
“Utility used to post the Tweet, as an HTML-formatted string. Tweets from the Twitter website have a source value of web.” - twitter documenation There are 8 unique values, hence its domain consists of {
[1] “https://studio.twitter.com”
[2] “http://twitter.com”
[3] “https://about.twitter.com/products/tweetdeck”
[4] “http://twitter.com/download/iphone”
[5] “http://twitter.com/download/android”
[6] “http://twitter.com/#!/download/ipad”
[7] “http://instagram.com”
[8] “https://mobile.twitter.com” }
Indicates if the tweet is truncated (true) or not (false). Only 40 entries consisted of turncated tweets, such that an accurate text analysis could be performed. Domain consists of {true, false}.
We see that there are two values for handle such that we can split the whole data set into two subsets: HillaryClinton and realDonaldTrump. As the goal of the project remains the analysis of the tweets our model assumption focuses on the text, that includes the hashtags. In correlation with the attributes is_retweet or favorite_count we could measure the impact of the tweets and moreover analyze it in conjunction with original_author, in_reply_to_screen_name and source_url. The essential attribute time will be attributed to each tweet. Further exploratory data analysis will identify crucial points in order to adapt our assumptions and model in the next iterations of our project.
table(data$handle, data$is_retweet)
##
## False True
## HillaryClinton 2534 545
## realDonaldTrump 2931 116
# table(data$handle, data$original_author)
# table(data$original_author, data$is_retweet)
# table(data$truncated) # only 40 tweets were truncated in the dataset
# ranges for the orignal_author to is_retweet
range(table(data$original_author, data$is_retweet)[,2])
## [1] 1 79
# represents the distribution of handle's tweets if quoted or not
table(data$handle, data$is_quote_status)
##
## False True
## HillaryClinton 2947 132
## realDonaldTrump 2980 67
Based on our assumptions we decided to build the following ER Model [Figure 1], that puts the emphasis on the tweet text and relates the other attributes around it in an adequate fashion. This allows to query for information in the tweets that will mostly relate to the hashtags. The object Content represents the text of the tweet and may include several hashtags, links, targethandle. The rest of the content will be simple plain text without any special characters. The handle and time attributes could be used to uniquely identify the tweets. If it is determined that a tweet was in response to another one, than the attribute reply_handle will be associated to it. Finally, if the tweet is a retweet, it the attribute original_handle will be included to the tweet representing the original author. The exact evaluation process for the hashtags relating to different other hashtags will take place in the application. The results will be either cached or additionally attributed to the hashtag object
ER-Model
The tweet with all its elements will be considered as a unit. The boolean queries for is_retweet (alias retweet) and in_reply_to_screen_name (alias reply) will be externalized in order to optimize queries for speed. The content is furthermore associated to four classifiers (text without special characters, hashtag, link and target handle) that allows to store the data in an organized fashion. This allows the system to optimize its queries for such content. The attributes is_quote_status and truncated could be excluded in our use case, but for query purposes they will remain in our model as one attribute, especially as we expect that the user will unlikely postulate such queries.
Tweet(DateTime, FavouriteCount, SourceURL, QuoteStatus, Truncated, RetweetCount, Handle),
Content(ID),
Text(ID),
Hashtag(ID, hashtag),
Link(ID, link),
TargetHandle(ID, targetHandle),
Access to the content:
Content is in Tweet(ID, DateTime, Handle)
Boolean queries with a loop, if many elements occur:
Content_Text(Content_ID, Text_ID),
Content_Hashtag(Content_ID, Hashtag_ID),
Content_Link(Content_ID, Link_ID),
Content_TargetHandle(Content_ID, TargetHandle_ID),
Boolean queries:
Tweet is Retweet(DateTime, Handle, OriginalHandle)
Tweet is a Reply(DateTime, Handle, ReplyHandle)
createdb Election
In the second iteration the goals consist in
create table Tweet(
id integer Primary Key,
datetime timestamp not null,
favouriteCount integer not null,
sourceUrl character varying(50) not null,
quoteStatus boolean not null,
truncated boolean not null,
retweetCount integer not null,
handle character varying(15) not null
);
create table Content(
id integer Primary key,
idTweet integer references Tweet (id)
);
create table Text(
id integer Primary key,
idContent integer references Content (id),
plaintext text not null
);
create table Hashtag(
id integer Primary key,
idContent integer references Content (id),
hashtag character varying(20) not null
);
create table Link(
id integer Primary key,
idContent integer references Content (id),
link character varying(45) not null
);
create table TargetHandle(
id integer Primary key,
idContent integer references Content (id),
targetHandle character varying(30) not null
);
create table Retweet(
id integer Primary key,
idTweet integer references Tweet (id),
originalHandle character varying(45) not null
);
create table Reply(
idTweet integer references Tweet (id),
replyHandle varchar(15) not null
);
Link to the sql files: https://github.com/basacul/DBAssignment/tree/master/sql
The cleaning of the data is done by the java class DataSanitizer which reads the american-election-tweet.csv file, cleans it and saves it in a new data set sanitized_american_election_tweet.csv.
Link to the java files: https://github.com/basacul/DBAssignment/tree/master/java
Moreover the java class DataSanitizer populates the tables after reading each cleaned line from the data set and populates the tables which were declared in section 1.Creating the database schema.
Link to the java files: https://github.com/basacul/DBAssignment/tree/master/java
Put the file 'postgresql-42.1.1.jar' or the newest from
'https://jdbc.postgresql.org/download.html' into your extensions folder
of your java install path:
( For Mac f.e. )
cd /Library/Java/Extensions/
( For Mac f.e. with lunchy via brew)
lunchy start postgres
IMPORTANT: Make sure your User ist Admin in Postgres.
( CHECK THIS VIA '\du' if you are in psql)
sudo ApacheCTL start
homebrew install php55
Delete '#' of line
'#LoadModule php5_module libexec/apache2/libphp5.so'
OR ( copy and paste the following line into the conf file)
LoadModule php5_module
$HOMEBREW_PREFIX/Cellar/php55/5.5.9/libexec/apache2/libphp5.so
sudo apachectl restart
Wir haben SigmaJS heruntergeladen und mit ihren Features die Hashtag-Visualisierung durchgeführt. Wir haben verschiedene Darstellungen ausprobiert mit zufälliger Anordnung der Hashtags und im Zusammenhang mit ihrer Zeitachse und Frequenz.
Visualization of the hashtag network
In dieser Darstellung werden die Hashtags nur einmal aufgeführt und die Kanten sind mit weiteren Hashtags verbunden, die im gleichen Tweet erwähnt werden. Die Anordnung der Punkte wird mithilfe eines Randomgenerators erzeugt, wodurch die ganze Bildfläche besser ausgenutzt wird und angenehmer zu betrachten ist.