Datahub

DataHub is LinkedIn’s generalized metadata search & discovery tool.

March 16, 2020


Why do we need Datahub?

Data Engineers prefer coding than documenting; Data Scientists modeling than documenting and Data Analysts & Data Viz playing with data than documenting.

Having the right process to keep data documentation alive with automated processes.

By implementing DataHub as a Data Catalogue, it aims to empower all the Data Consumers and Producers by having a better understanding of how every data is represented and interconnected.

Datasets should be documented automatically without requiring a tedious and manual process.

Architecture

A Modular UI frontend play service and a Generalized Metadata Architecture(GMA) backend. The new architecture enabled to expand scope of metadata collection beyond just datasets and jobs.

At the time of writing, DataHub already stores and indexes tens of millions of metadata records that encompass 19 different entities, including datasets, metrics, jobs, charts, AI features, people, and groups. Datauhb roadmap depicts to onboard metadata for machine learning models and labels, experiments, dashboards, microservice APIs, and code in the near future.

High-level Architecture

In the above picture, we can see the high-level architecture. Aside from infrastructure components, it has four different Docker containers:

  • datahub-gms: Metadata store service(Rest.li service)

  • datahub-frontend: Play application, which serves DataHub frontend

  • datahub-mce-consumer: Kafka Streams application that consumes from Metadata Change Event (MCE) stream and updates metadata store

  • datahub-mae-consumer: Kafka Streams application that consumes from Metadata Audit Event (MAE) stream and builds search index and graph db

All other containers are part of the above mentioned containers:

  • Elasticsearch: DataHub uses Elasticsearch as a search engine.

  • Ingestion: mce_cli.py, bootstrap_mce.dat , requirements.txt

  • Kafka: It is used for queueing messages in backend.

  • MySQL: DataHub GMS uses MySQL as the storage infrastructure.

  • Neo4j: Graph db in the backend to serve graph queries.

Work process flow

  • Docker facilitates deployment and distribution of applications by using containerization. Every piece of service in open source DataHub, including infrastructure components like Kafka, Elasticsearch, Neo4j, and MySQL, each has its own Docker image. For orchestration of Docker containers, Docker Compose is used.

  • Install docker and docker-compose.

  • Configure docker to allocate enough hardware resources for Docker engine. Tested & confirmed config: 4 CPUs, 8GB RAM, 2GB Swap area.

  • Clone the datahub repository, start docker-compose, pull the containers and build it.

  • Confirm if all the containers are running. If for a particular container: docker logs <<container_name>> . Ex: datahub-gms, datahub-frontend, kafkacat -L -b localhost:9092(to check if kafka topics are created)

  • At this point, you should be able to start DataHub by opening http://localhost:9001 in your browser. However, there is no data just yet.

Data ingestion process

  1. JSON data format is a better way to represent information hierarchy

  2. Data import utility is mce_cli.py file.

Sample Json data

Data importer takes JSON data that is written 1 line/record and sends that to kafka, MetadataChangeEvent topic to be specific. And data is transmitted to Datahub.

{"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:datahub", "aspects": [{"active": True, "displayName": "Data Hub", "fullName": "Data Hub", "email": "datahub@linkedin.com", "title": "CEO"}, {}]}), "proposedDelta": None}

The other utilities under metadata-ingestion extract information from a data source like LDAP, kafka, or a database to create the data structure. Those utilities send data directly to the kafka component of datahub without writing to a file.

Kafka topic schema

Any line in our input file should conform to the schema of the Kafka topic. The schema for this topic is automatically generated from MetadataChangeEvent.pdsc file.

{
  "type": "record",
  "name": "MetadataChangeEvent",
  "namespace": "com.linkedin.mxe",
  "doc": "Kafka event for proposing a metadata change for an entity. A corresponding MetadataAuditEvent is emitted when the change is accepted and committed, otherwise a FailedMetadataChangeEvent will be emitted instead.",
  "fields": [
    {
      "name": "auditHeader",
      "doc": "Kafka audit header. See go/kafkaauditheader for more info.",
      "type": "com.linkedin.avro2pegasus.events.KafkaAuditHeader",
      "optional": true
    }
}

This is the Pegasus schema. By using a plugin, it’s converted to Avro and it’s being used as the schema on the Kafka topic.

Any example ingestion script in metadata-ingestion basically contains a separate logic for specific data platform and does ETL (extract-transform-load). ET(extract-transform) is platform specific and that’s why it should be implemented by if we are using a custom platform.

Can we import csv files?

It’s doable. For that we need to emit a MCE that contains Schema meta data information for csv files or by using custom import uilities for csv files.

Constraints

2 versions of Datahub is maintained

  • open source DataHub and LinkedIn’s production version

Use case

Airbnb’s Dataportal

Dataportal captures metadata information about different data assets in a form of a connected graph. Nodes in the graph are the various resources: data tables, dashboards, reports, users, teams, business outcomes, etc. Their connectivity reflects their relationships: consumption, production, association, etc.