1. October 5

    Watch recording

    Spark is Done!

    Let's talk about Spark. What did it give data engineers? Why do many of us use it?

    Spark has been around for over 15 years. What problems do we face when using it? Is there anything better? Is it already possible to replace it with something?

    Why is %SQLEngineName% slowing down? How can one fix this? Benchmarks, open source, and the like.

    Watch recording

    Vector Search Algorithms in YDB

    YDB has undergone a significant development path from applying basic vector search techniques to creating a scalable and efficient vector index. The talk presents a detailed analysis of the stages of evolution of vector search in YDB, including analysis of complexities and engineering solutions. 

    Watch recording

    How We Built a Data Lakehouse Platform on Apache Ozone

    In this talk, I will tell you how we migrated from a platform based on Vertica, HDFS to the new Dota 2 (the second version of our internal analytics platform)) architecture based on Apache Ozone (S3), Trino, Spark and Iceberg. I will share our experience in choosing storage, explain why we abandoned HDFS and why we chose Apache Ozone as an on-prem implementation of S3.

    Networking and Afterparty

  2. October 6

    Watch recording

    Spark Connect: A New Approach to Working with Apache Spark

    I will tell you about Spark Connect — a new approach to working with Apache Spark, which allows you to develop the client part of the application in any language and not depend on the JVM. We will talk about the architecture of Spark Connect and its differences from classic Spark. You will learn about a project where we used Spark Connect API for C++.

    Watch recording

    Debezium and PostgreSQL After Happy-Path: What Problems Await in Production and How To Solve Them

    Getting change events from sources is quite a common task that can be solved in different ways. One of such solutions is Debezium. But is it so simple and is it always the best solution? I will try to answer these questions and consider Debezium from the point of view of the difficulties that arise on the way of solving the task of change capture.

    Watch recording

    DataOps Under the Microscope: CRD and Kubernetes Operators for the ETL Test Tube Lifecycle

    How the T-Bank team migrated DataOps to Kubernetes and didn't go crazy. I'll tell you how we designed and implemented infrastructure for managing the lifecycle of ETL tasks using Kubernetes operators, automated DAG delivery and integrated it into the existing DataOps. I'll analyse what happened, where we made mistakes, and what you absolutely shouldn't do.

    Watch recording

    Launching YugabyteDB in Production

    The database is already covered with read replica, but it is still not eniguh — what should you do?

    I'll tell you in detail about our experience with YugabyteDB, which we chose as the solution. We will discuss important settings, nuances from the point of view of development and bugs that we found.

    For those who will be rolling YugabyteDB into production, the talk will save a lot of time and nerves. But it will also be interesting for those who use PostgreSQL or another classic relational database and are thinking about its scalability and fault tolerance.

    Watch recording

    Third Party Runtime Engines for Apache Spark: Experience of Using

    Experience of using Comet and Gluten (Velox) execution engines – from the introduction and features of the build to the results of testing on real ETLs. I will tell you about pitfalls and non-obvious points, show the results of work and consider cases when these engines are useful and when they don't work at all.

    Watch recording

    DWH in StarRocks: A Year in Production

    The real experience of building DWH in StarRocks: architecture, application cases, pitfalls. Whether StarRocks met our expectations or not.

    Watch recording

    Apache Spark SQL. Extend and Manage

    How to configure and modify Apache Spark for your tasks without rewriting the framework. I will tell you about approaches to expanding the functionality of Spark SQL without interfering with the platform's source code. You will learn about creating your own data sources, developing user functions for specialized processing, and implementing optimization rules that adapt to various requests.

    Watch recording

    What Metastore Is

    What metastore is, how it works in the big data ecosystem, what solutions exist on the market and why we decided to develop our own. I will share practical experience, architecture and lessons we have learned.