Talk

DataRentgen: What’s Wrong With OSS Data Catalog and How To Make It Better

In Russian

Faced with the task of collecting data lineage from ETL/ELT processes based on Apache Spark and Apache Airflow, our team hoped that everything would be quite simple and that we would be able to use one of the ready-made open source solutions: OpenMetadata, DataHub, Marquez. Everything turned out to be not so rosy: not a single tool suited us immediately in terms of functionality and performance. We began to develop our own solution, the DataRentgen service.

I will describe the path to developing the tool over a year and a half: user requirements, RnD of open source solutions and their shortcomings, a little bit of tossing and turning between different technologies for collecting and storing lineage and what we eventually came to. DataRentgen is still in active development, but it already collects quite a lot of useful data.

Speakers

Schedule