AIRFLOW-DATA PIPELINE

LEANDER LEITAO


Apache Airflow is an open-source data workflow management platform. It started at Airbnb in October 2015 as a solution to manage the company’s increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their data workflows and monitor them via the built-in Airflow user interface.

Airflow uses directed acyclic graphs (DAGs) to manage workflow orchestration. Tasks and dependencies are defined in Python and then Airflow manages the scheduling and execution. DAGs can be run either on a defined schedule (e.g. hourly or daily) or based on external event triggers .



STAGE-1 :DATA GATHERING

The data has been gathered from FINNHUB.IO,an online financial instrument API that provides data on various american listed companies.

I have gathered data relating to the public sentiment based on news media for leading companies in the tech sector including GOOGLE,MICROSOFT,ALIBABA,FACEBOOK etc.The data is recieved in JSON format.

{'buzz': {'articlesInLastWeek': 61, 'buzz': 0.9682, 'weeklyAverage': 63},
  'companyNewsScore': 0.8,
  'sectorAverageBullishPercent': 0.605,
  'sectorAverageNewsScore': 0.5115,
  'sentiment': {'bearishPercent': 0.1429, 'bullishPercent': 0.8571},
  'symbol': 'NFLX'} 



STAGE-2 :DATA TRANSFORMATION and CAPTURE

The python data analysis library pandas transforms the JSON files into a table structured dataframe,arrange companies by bullish(upward trending) sentiment and save it in csv format.

The data transform is done using a jupyter notebook whose executio will be orchestrated by AIRFLOW. The final dataframe looks like so:


Save a copy on my dektop along with the date using the airflow bash operator which allows for UNIX commands or a script file



Save the final datafile onto Google cloud storage where it can be used further for querying,connecting to a BI tool etc.



STAGE-3 :DATA ANALYSIS

Load the datafile into google biquery, a serverless,fully managed data warehouse platform on google cloud


The data is transfered to bigquery using googles bigquery transfer service which can be scheduled to run as per requirements and supports a plethora of file formats.



Since the data transformations have already been done in the notebook ,it can load it directly into bigquery. Google also provides services such as dataflow and dataproc for data transformation in the cloud.


PIPELINE MONITORING

Send an email using the airflow email operator once stage 2 has completed.We can also monitor the individual stages of the pipeline through airflows inbuilt user interface.




>The pipeline is scheduled to ru at 11:00 am GMT daily


PYTHON CODE FOR THE DATA TRANSFORMATION/CLEANING


PYTHON CODE FOR THE PIPELINE