SmartData 2025 Opening Session
We will be talking about the schedule, sessions, and activities. Join us in the hall or online to find out what's in store for you!
The time in the program is for your time zone .
The program hasn’t been finally approved yet, so there still might be some changes.
We will be talking about the schedule, sessions, and activities. Join us in the hall or online to find out what's in store for you!
Description of the path of developing an open source data lineage solution based on OpenLineage. Comparison with other open source solutions — OpenMetadata, DataHub, Marquez — and the reason we abandoned them in favor of our own development. No, this is not another custom Data Catalog :)
МТС Web Services (MWS)
A detailed review of existing vector search algorithms, the most popular in modern database management systems.
How Yandex Market started writing documentation. You will learn how it happened and what problems the company faced. We will consider different approaches to describing metadata in storages, compare them with each other and understand whether it is worth going down this path.
Yandex Market
How to configure and modify Apache Spark for your tasks without rewriting the framework. I will tell you about approaches to expanding the functionality of Spark SQL without interfering with the platform's source code. You will learn about creating your own data sources, developing user functions for specialized processing, and implementing optimization rules that adapt to various requests.
Chestnyj znak
What metastore is, how it works in the big data ecosystem, what solutions exist on the market and why we decided to develop our own. I will share practical experience, architecture and lessons we have learned.
Positive Technologies
A practical case study of implementing DWH monitoring from Skyeng: from metadata architecture to automated data quality checks and transition to DataOps practices.
Skyeng
Good data doesn’t happen by accident. I’ll share my experience building a tool that helps validate data automatically — fast, flexible, and pain-free.
Arenadata Catalog
I will tell you about Spark Connect — a new approach to working with Apache Spark, which allows you to develop the client part of the application in any language and not depend on the JVM. We will talk about the architecture of Spark Connect and its differences from classic Spark. You will learn about a project where we used Spark Connect API for C++.
Yandex
How to build a secure, powerful, and scalable LLM service for a large company: with UI, API, moderation, and model support for completely different tasks.
Kaspersky
How to implement a Data Quality distributed architecture tool that ensures smooth operation for a large number of teams and is a single point of truth about data quality in company systems.
MWS Cloud Platform
MWS Cloud Platform
Getting change events from sources is quite a common task that can be solved in different ways. One of such solutions is Debezium. But is it so simple and is it always the best solution? I will try to answer these questions and consider Debezium from the point of view of the difficulties that arise on the way of solving the task of change capture.
Let's talk about Spark. What did it give data engineers? Why do many of us use it?
Spark has been around for over 15 years. What problems do we face when using it? Is there anything better? Is it already possible to replace it with something?
Why is %SQLEngineName% slowing down? How can one fix this? Benchmarks, open source, and the like.
Navio
We'll talk about how Wildberries implements a JupyterHub and Kubernetes-based research platform for more than 600 data scientists who solve problems in areas such as CV, NLP, OCR, and recommendations.
Wildberries & Russ
Wildberries & Russ
We will discuss the key challenges that Apache Iceberg is facing, as well as the prospects for technology development.
CedrusData
The database is already covered with read replica, but it is still not enough — what should one do?
I will tell you about how we chose a fault-tolerant and scalable database for storing financial data, which options were eliminated and by what criteria. Why we chose YugabyteDB and about our experience with it.
01.tech
The data platform in our company has existed for more than 5 years, during this time it has absorbed a lot of trendy (and not so trendy) solutions. I will tell you how we tried to choose our future among ClickHouse, Greenplum and Trino, and found StarRocks.
The real experience of building DWH in StarRocks: architecture, application cases, pitfalls. Whether StarRocks met our expectations or not.
Peredovye Platezhnye Resheniya
The potential of using AI to automate Data Governance processes on the side of data platform users.
T-Bank
How we at T-Bank built our BI tool on Apache Superset, rebuilt our BI culture, made synergies between BI analysts and developers of our BI tool and successfully migrated from Tableau.
T-Bank
I will reveal how codecs LZ4, ZSTD, Delta, and DoubleDelta help increase query speed and reduce storage volume. I will highlight the challenges that arise when using them in projects.
GlowByte
Approaches to uploading metadata to the Data Catalog are often considered in a linear way: a minimum of changes, maximum preservation of the "truth". But is this really the right thing to do?
T-Bank
Experience of using Comet and Gluten (Velox) execution engines – from the introduction and features of the build to the results of testing on real ETLs. I will tell you about pitfalls and non-obvious points, show the results of work and consider cases when these engines are useful and when they don't work at all.
Chestnyj znak
Review and comparison of existing Python libraries and a self-written profiling tool for data quality analysis. Description of the tool's functionality.
Gazprombank.Tech
Gazprombank.Tech
The story of how a small team of engineers implemented Hadoop with full Kerberos and Ranger-based security without stopping business processes.
Detsky Mir
YDB has undergone a significant development path from applying basic vector search techniques to creating a scalable and efficient vector index. The talk presents a detailed analysis of the stages of evolution of vector search in YDB, including analysis of complexities and engineering solutions.
I'll tell you how we use Airflow in practice: from the pain of sensors to the convenience of datasets, from static DAGs with a bunch of files to dynamic ones, and from standard features to our own custom solutions that will not leave those who are faced with the actual operation of Airflow indifferent.
Innovation Center "Safe Transport"
In this talk, I will tell you how we migrated from a platform based on Vertica, HDFS to the new Dota 2 (the second version of our internal analytics platform)) architecture based on Apache Ozone (S3), Trino, Spark and Iceberg. I will share our experience in choosing storage, explain why we abandoned HDFS and why we chose Apache Ozone as an on-prem implementation of S3.
Ostrovok!
I'll tell you about an AI assistant that helps users get answers to questions about data. You'll learn how we at X5 Tech manage the quality of answers and how data and data descriptions affect the final result.
X5 Tech
We will be summarising the results of the conference, recalling the highlights and talking about future plans. Join us in the hall or online so you don't miss a thing!
We are actively adding to the program. Sign up for our newsletter to stay informed.