Start of main content

From one big ETL job to experimenting with data pipelines

Day 2

RU

Profitero team faced the following problem: there was one large job ETL, which consists of many iterations, where each iteration is any methodology. Suppose we want to apply the changes to iteration i, in this case, it will affect iteration i+1, because it is calculated based on the results of iteration i and so on.

The following questions arise:

  1. How to apply changes to the methodology, one of the iterations, while the changes should not have an impact on production?
  2. How to make sure that DS teams can conduct these events without the participation of the DE team, or at least minimize their participation in such experiments?
  3. How can we run 10 experiments at the same time to choose the best changes for implementation in production?

Stack of technologies: Apache Spark, Apache Airflow, Jupyter, Apache Zeppelin, Docker Swarm, LakeFS.

Audience: it will be interesting for those who are faced with the problem of conducting experiments in pipelines.

  • #process
  • #pipepline
  • #etl

Speakers

Invited experts