Wrangle Report

Synopsis

Along the process of Data Wrangling of tweet data from the Twitter user @dog_rates also known as WeRateDogs, I have found several issues in the Quality and Structure of data collected by different means.

I have analyzed, cleaned and combined all the data into a new DataFrame. And stored it in twitter_archive_master.csv file.

Note : The version of file is from the First Iteration of Wrangling. This is not totally free of issues.

Introduction

This Wrangle Report is a part of a Data Analyst Nanodegree offered by Udacity. The project aims to gather data from Twitter API and Udacity provided tweet data, to create analysis about the tweets and the predicted dog’s breed.

Data Wrangling follows,

Data Gathering

Data Assessing

Data Cleaning

1. Data Gathering

I have gathered the files twitter_archive_enhanced.csv and image_predictions.tsv, which are provided by Udacity using the requests package,

The twitter_archive_enhanced.csv file contains basic tweet data (tweet ID, timestamp, text, etc.) for 2356 of their tweets as they stood on August 1, 2017. As I need further information from the WeRateDogs user, I have gathered data from Twitter API using the tweepy package and stored it as text_json.txt for the above mentioned period (querying by tweet_id present in twitter_archive_enhanced.csv)

I have extracted some features like retweet_count, favorite_count and hashtag from the json and stored it in separate file named retweet_favorite_count.csv

The gathered data are loaded into three different DataFrame,

archive : Loaded data from twitter_archive_enhanced.csv

img_pred : Loaded data from image_predictions.tsv

retweet_favorite_count : Loaded data from retweet_favorite_count.csv

2. Data Assessing

The two types of Data Assessment performed,

Visual assessment: Each piece of gathered data is displayed in the Jupyter Notebook. Once displayed, data are additionally assessed in an external application (Excel)

Programmatic assessment: pandas’ functions and/or methods are used to assess the data.

Legend:

Method : V (Visual) / P (Programmatic)

Issue Type : Q (Quality) / T (Tidiness)

Table 1 - Summary of Issues Identified.

Issue ID	Table	Issue Type	Dimension	Method	Column	Description
1	archive	Q	Validity	P	tweet_id	Type int instead of string.
2	img_pred	Q	Validity	P	tweet_id	Type int instead of string.
3	retweet_favorite_count	Q	Validity	P	tweet_id	Type int instead of string.
4	archive	Q	Validity	P	timestamp	Type string instead of timestamp
5	archive	Q	Accuracy	V	rating_numerator	Surprising values in rating_numerator column.
6	archive	Q	Accuracy	V	rating_denominator	Surprising values in rating_denominator.
7	archive	Q	Consistency	V	rating_numerator rating_denominator	Inconsistent rating.
8	archive	Q	Consistency	P	expanded_urls	sometimes have more than one value separated by comma (,).
9	archive	Q	Validity	V	name	Invalid names or non-standard names.
10	archive	Q	Consistency	P	text	‘&amp’ and ‘\n’ symbol present.
11	archive	T	-	P	source	HTML tags, URL, and content in a single column.
12	archive	Q	Consistency	V	doggo, floofer, pupper, and puppo	These are categorical variable, but represented as separate column.
13	archive	T	-	P	text	There is two information in a single column,the text and the URL.
14	img_pred	Q	Consistency	V	p1, p2, and p3	Dog’s breed has no standard. Capital letter or lowercase names. Separated by _/-.
15	img_pred	Q	Consistency	V	p1, p2, and p3	One Prediction data with TRUE and higher confidence is enough among the top three predictions.
16	archive, img_pred, retweet_favorite_count	T	-	P	tweet_id	Duplicate column in three DataFrame.
17	archive, img_pred, retweet_favorite_count	T	-	P	“all”	Number of Observations differ.`archive`(2356),`img_pred`(2075), `retweet_favorite_count`(2331 - as some tweet ID status is not available now).
18	archive	Q	Completeness	V	in_reply_to_status_id, in_reply_to_user_id, retweeted_status_timestamp, retweeted_status_id, retweeted_status_user_id	Only original ratings (no retweets/reply) must be considered(as per project requirement). Uninterested columns present.
19	archive	Q	Completeness	P	in_reply_to_status_id, in_reply_to_user_id	Observations for Uninterested reply columns present(78).
20	archive	Q	Completeness	P	retweeted_status_timestamp, retweeted_status_id, retweeted_status_user_id	Observations for Uninterested retweet columns present(181).

3. Data Cleaning

First Step: I have copied all the three DataFrames using .copy() method,

archive_copy = archive.copy()

img_pred_copy = img_pred.copy()

ret_fav_hash_copy = retweet_favorite_count.copy()

Further Steps:

I have dropped the Uninterested observations for Reply(78) and Retweet(181) by their index. And those features were also dropped.
For Erroneous data types, I have converted it using .astype() method for tweet_id and pd.to_datetime() method for timestamp
For the surprising rating_numerator and rating_denominator, I have done text scraping for corresponding observation. From the text, it was observed that some observations had inaccurate rating and changed it manually. For tweets without any rating, gave default rating as 10/10. And some other high values is true and are due to tweeter has given ratings to group of dogs present in the post as the sample shown below.

Figure a

Figure b

In order to solve the consistency problem of rating, I have performed Feature Engineering, where

rating = rating_numerator / rating_denominator
From img_pred_copy, I have selected one Prediction data with TRUE and higher confidence, bundled as breed and confidence. Remaining uninterested columns were dropped.
The dog’s names issue was solved by replacing the values starting with lower case with ‘None’.
Source name for source column was found using regex and replaced with the same. The URL in text column was removed using regex. ‘&amp’ and ‘\n’ symbol replced with ‘&’ and ’ ’.
I have removed the additional and duplicated URL from expanded_urls using set().
I have capitalized the breed names using string.capwords().
I have also merged 4 columns (doggo, pupper, puppo, and floofer) into one, which I have bundled and named as stage. Where, some have more than one stage, I have done text scraping again and changed the stages manually if it is inaccurate.
Finally, I have solved the tidiness issues by merging all the three data frames by tweet_id and stored it in master DataFrame df.

Conclusions

In First Iteration, I have documented 20 issues. But, this master data is not totally free of issues, as Data Wrangling is an iterative process.

I have stored the wrangled data in twitter_archive_master.csv file with a minored number of issues, and ready for a Data Analysis. This file has 1964 observations and 16 features.