Along the process of Data Wrangling of tweet data from the Twitter user @dog_rates also known as WeRateDogs, I have found several issues in the Quality and Structure of data collected by different means.
I have analyzed, cleaned and combined all the data into a new DataFrame. And stored it in twitter_archive_master.csv file.
Note : The version of file is from the First Iteration of Wrangling. This is not totally free of issues.
This Wrangle Report is a part of a Data Analyst Nanodegree offered by Udacity. The project aims to gather data from Twitter API and Udacity provided tweet data, to create analysis about the tweets and the predicted dog’s breed.
Data Wrangling follows,
- Data Gathering
- Data Assessing
- Data Cleaning
I have gathered the files twitter_archive_enhanced.csv and image_predictions.tsv, which are provided by Udacity using the requests package,
The twitter_archive_enhanced.csv file contains basic tweet data (tweet ID, timestamp, text, etc.) for 2356 of their tweets as they stood on August 1, 2017. As I need further information from the WeRateDogs user, I have gathered data from Twitter API using the tweepy package and stored it as text_json.txt for the above mentioned period (querying by tweet_id present in twitter_archive_enhanced.csv)
I have extracted some features like retweet_count, favorite_count and hashtag from the json and stored it in separate file named retweet_favorite_count.csv
The gathered data are loaded into three different DataFrame,
archive: Loaded data fromtwitter_archive_enhanced.csvimg_pred: Loaded data fromimage_predictions.tsvretweet_favorite_count: Loaded data fromretweet_favorite_count.csv
The two types of Data Assessment performed,
- Visual assessment: Each piece of gathered data is displayed in the Jupyter Notebook. Once displayed, data are additionally assessed in an external application (Excel)
- Programmatic assessment: pandas’ functions and/or methods are used to assess the data.
Legend:
- Method : V (Visual) / P (Programmatic)
- Issue Type : Q (Quality) / T (Tidiness)
Table 1 - Summary of Issues Identified.
| Issue ID | Table | Issue Type | Dimension | Method | Column | Description |
|---|---|---|---|---|---|---|
| 1 | archive | Q | Validity | P | tweet_id | Type int instead of string. |
| 2 | img_pred | Q | Validity | P | tweet_id | Type int instead of string. |
| 3 | retweet_favorite_count | Q | Validity | P | tweet_id | Type int instead of string. |
| 4 | archive | Q | Validity | P | timestamp | Type string instead of timestamp |
| 5 | archive | Q | Accuracy | V | rating_numerator | Surprising values in rating_numerator column. |
| 6 | archive | Q | Accuracy | V | rating_denominator | Surprising values in rating_denominator. |
| 7 | archive | Q | Consistency | V | rating_numerator rating_denominator |
Inconsistent rating. |
| 8 | archive | Q | Consistency | P | expanded_urls | sometimes have more than one value separated by comma (,). |
| 9 | archive | Q | Validity | V | name | Invalid names or non-standard names. |
| 10 | archive | Q | Consistency | P | text | ‘&’ and ‘\n’ symbol present. |
| 11 | archive | T | - | P | source | HTML tags, URL, and content in a single column. |
| 12 | archive | Q | Consistency | V | doggo, floofer, pupper, and puppo |
These are categorical variable, but represented as separate column. |
| 13 | archive | T | - | P | text | There is two information in a single column,the text and the URL. |
| 14 | img_pred | Q | Consistency | V | p1, p2, and p3 | Dog’s breed has no standard. Capital letter or lowercase names. Separated by _/-. |
| 15 | img_pred | Q | Consistency | V | p1, p2, and p3 | One Prediction data with TRUE and higher confidence is enough among the top three predictions. |
| 16 | archive, img_pred, retweet_favorite_count |
T | - | P | tweet_id | Duplicate column in three DataFrame. |
| 17 | archive, img_pred, retweet_favorite_count |
T | - | P | “all” | Number of Observations differ.archive(2356),img_pred(2075), retweet_favorite_count(2331 - as some tweet ID status is not available now). |
| 18 | archive | Q | Completeness | V | in_reply_to_status_id, in_reply_to_user_id, retweeted_status_timestamp, retweeted_status_id, retweeted_status_user_id |
Only original ratings (no retweets/reply) must be considered(as per project requirement). Uninterested columns present. |
| 19 | archive | Q | Completeness | P | in_reply_to_status_id, in_reply_to_user_id | Observations for Uninterested reply columns present(78). |
| 20 | archive | Q | Completeness | P | retweeted_status_timestamp, retweeted_status_id, retweeted_status_user_id |
Observations for Uninterested retweet columns present(181). |
First Step: I have copied all the three DataFrames using .copy() method,
- archive_copy = archive.copy()
- img_pred_copy = img_pred.copy()
- ret_fav_hash_copy = retweet_favorite_count.copy()
Further Steps:
I have dropped the Uninterested observations for Reply(78) and Retweet(181) by their index. And those features were also dropped.
For Erroneous data types, I have converted it using .astype() method for tweet_id and pd.to_datetime() method for timestamp
For the surprising rating_numerator and rating_denominator, I have done text scraping for corresponding observation. From the text, it was observed that some observations had inaccurate rating and changed it manually. For tweets without any rating, gave default rating as 10/10. And some other high values is true and are due to tweeter has given ratings to group of dogs present in the post as the sample shown below.
In order to solve the consistency problem of rating, I have performed Feature Engineering, where
rating = rating_numerator / rating_denominator
From img_pred_copy, I have selected one Prediction data with TRUE and higher confidence, bundled as breed and confidence. Remaining uninterested columns were dropped.
The dog’s names issue was solved by replacing the values starting with lower case with ‘None’.
Source name for source column was found using regex and replaced with the same. The URL in text column was removed using regex. ‘&’ and ‘\n’ symbol replced with ‘&’ and ’ ’.
I have removed the additional and duplicated URL from expanded_urls using set().
I have capitalized the breed names using string.capwords().
I have also merged 4 columns (doggo, pupper, puppo, and floofer) into one, which I have bundled and named as stage. Where, some have more than one stage, I have done text scraping again and changed the stages manually if it is inaccurate.
Finally, I have solved the tidiness issues by merging all the three data frames by tweet_id and stored it in master DataFrame df.
In First Iteration, I have documented 20 issues. But, this master data is not totally free of issues, as Data Wrangling is an iterative process.
I have stored the wrangled data in twitter_archive_master.csv file with a minored number of issues, and ready for a Data Analysis. This file has 1964 observations and 16 features.
For further information about the Project,
Further Read : Which is a SUPER-WOW breed
Further Read : What are the other factors that impact Re-tweet & Favorite Count?
A work by Keerthana Manoharan