Start of main content
2 offline days
September 13–14 10:00–19:30 (UTC+3)
Offline: Hotel MonArch, Leningrad Avenue, 31А, building 1, Moscow, Russian Federation
Online broadcast
Why It’s Worth Going
- To see old friends. To discuss current problems. To come up with new ideas. To debate and just chat.
Switch the format to offline
To have a change of scenery, to distract and have a good time. To gain fresh impressions and new acquaintances.
Broadcast
There will be a broadcast on the offline part of the conference, which is available to participants with any ticket. If you want to meet and interact with the speakers and other participants live, we are waiting for you at the venue. And if you are not ready to get to the venue, recordings of all the talks and activities will be waiting for you on this website.
See for yourself
Program
September 13
- Watch recording
Talk type: Conversation
SmartData 2023 In-person Opening
We will talk about the schedule, sessions, and share the information. Come to the room or join the broadcast to find out what to expect soon!
Mikhail Maryufich
Company: Odnoklassniki
Alexey Fyodorov
Company: JUG Ru Group
- Watch recording
Talk type: Talk
I’ll Change the Way You Look at Data Storage in 30 Minutes
In many business tasks we rely on our DWH, Data Lake, LakeHouse, etc. In the image and likeness of how OLAP spreadsheets did it years ago. But business tasks and data processes have changed a lot since then, and for some businesses this approach is fundamentally wrong, because they have a different nature of data than they had decades ago. The speaker will talk about: how data is different in today's businesses; the approach that Google proposed in its 2015 article; the problems this approach solves; the new problems it creates, and what to do about them now.
Maksim Statsenko
Company: Yandex
- Watch recording
Talk type: Talk
Open Source BI in a Large Company. Is it possible?
Does it hurt to replace a well-established BI tool for years with something new, and even open source? Ilya will tell you how they chose the tool, what difficulties they encountered and how users reacted to it.
Ilya Anikin
Company: Avito
- Watch recording
Talk type: Talk
Common Data Index. How to Build an Open Data Search Engine Like Google Dataset Search, but Easier and Faster
Collecting the base of almost all publicly available datasets is not that difficult. All that matters is to collect their primary sources and properly build the architecture for data set collection and analysis...
Ivan Begtin
Company: API Crafter
- Watch recording
Talk type: Talk
How We Adapted Dynamic YTsaurus Tables to Store Blobs
To improve the efficiency of YTsaurus, the team decided to remove blobs and store them separately from "normal" tabular data. They had to modify compaction algorithms in a special way to be able to collect "garbage" among the blocks and to provide a suitable tradoff between the disk space (space amplification) and the amount of permanently overwritten data (write amplification). They also took an approach to a number of tables, which were kept in RAM. As a result, we moved (under the guise of blobs!) some of their data to disks and reduced RAM consumption by several times, while maintaining low read times at high quantiles. In the process of implementation, the IO-stack had to be significantly improved by switching to io_uring, and the block-storage layer by adding a consistent hashing algorithm to choose the method of data replicas arrangement.
Maksim Babenko
Company: Yandex
- Watch recording
Talk type: Talk
Data Depersonalization Methods
Alexey will talk about data depersonalization methods, consider the risk assessment model, usefulness and anonymization metrics. Spark and Python were used in writing the product for data depersonalization.
Aleksei Danshin
Company: Neoflex
- Watch recording
Talk type: Talk
How to Make Your Apache NiFi Feel Bad
NiFi is a very powerful tool, and it can cover a very wide range of tasks. However, there are some tasks that make NiFi not feel very good. The speaker will talk about his view on such tasks.
A talk on how not to use NiFi, what cases NiFi can implement, how to implement them, and why not to do it.
Bronislav Zhitnikov
Company: Tinkoff
- Watch recording
Talk type: Partner’s talk
Building Disaster-Resistant Data Warehouses
With the exit of foreign vendors, building disaster-resistant data storages has become even more difficult. Surely, many people have faced this problem and understand the difficulties of implementing such storages based on Greenplum. Alexander will talk about possible solutions and the best ways to build them, and show the most successful approaches.
Aleksandr Tarasov
Company: Arenadata
- No record
Talk type: Partner’s game
Graph Challenge from Yandex Search and Advertising Technologies
Graph challenge with three levels of complexity, gifts and an awards ceremony at the afterparty.
- Watch recording
Talk type: Talk
Hadoop in the Cloud is OK
For OK Hadoop is a key infrastructure component: it is actively used both for product analytics implementation and for recommendation systems production. In terms of volumes it is more than 200 PB in HDFS, 50k vcores, 200 TB RAM. The speaker will talk about clustering in OC, migration of clusters to internal container cloud. In the report we expect details of the final solution, overview of the migration rake and benefits of the approach.
Mikhail Maryufich
Company: Odnoklassniki
- Watch recording
Talk type: Talk
A distributed SQL query engine for data analytics
The architecture of a distributed opensource SQL engine is described. On exectuting queries the engine loads the data in memory . The calculation is divided into stages that can be performed on different nodes. Cross-cluster queries are supported. User defined functions are supported.
Alexey Ozeritskii
Company: Yandex
- Watch recording
Talk type: Talk
A Couple of Words on How We Implement Data Observability
The speaker will talk about the perennial problem with data quality and detail why and how they built the data quality platform at SberHealth. He will reveal the work with great expectations, integration with the data catalog (DataHub) and tell what happens after they find "broken" data.
- Watch recording
Talk type: Talk
Speed-up queries: How to Cook ClickHouse Well-done
If you know the rules of working with ClickHouse, you can process hundreds of millions of data in a matter of seconds. After analyzing the experience of using it, the speaker will tell you about the most popular and effective ways to speed up queries. Indices — with duplicates will not help, but significantly reduce the amount of scanned data. Projections — what, how and why. Sharding — how to scale horizontally.
Kuzma Leshakov
Company: Yandex Cloud
- Watch recording
Talk type: Talk
Apache Flink as an Example of a Deduplication Task
The speaker will talk about the Apache Flink streaming framework. You will learn about the architecture and the main components of Flink, using event stream deduplication as an example.
Aleksandr Bobriakov
Company: MTS Digital
- Watch recording
Talk type: Talk
Mage: The Magic Instrument of Orchestration
The speaker will talk about the move from the legacy data stack to the new version. You will learn what features have been revealed in almost half a year of using the new one, and why sometimes it's worth trying new technologies.
Valentin Panovskiy
Company: more.tv
- Watch recording
Talk type: Talk
Extra-Atmospheric Astronomy and the New Space Telescope "James Webb"
Astronomers are cramped on Earth: the atmosphere is in the way, Ilon Musk's satellites are in the way, and the size of the planet is too small. Now space for astronomers has become not only an object of research, but also a working platform. What new things have scientists learned with the help of space telescopes and what are the prospects?
Vladimir Surdin
Company: MSU
- No record
Talk type: Partner’s conversation
The results of the Graph challenge from Yandex Search and Advertising Technologies
Experts of Yandex Search and Advertising Technologies will summarize the results and award the winners and participants of the game.
Ilya Shishov
Company: Yandex
- No record
Talk type: Game
Quiz
Устали думать во время докладов? Тогда приглашаем подумать на творческой викторине! Совместно с Лигой Индиго проведем интеллектуальный квиз, где вы сможете испытать интеллект и эрудицию и отдохнуть в компании других участников конференции.
- No record
Talk type: BOF-session
Science and Programmers in Space
We will talk about international space cooperation and competition. We will discuss modern domestic instruments, as well as issues of import substitution in software and hardware for space observations. We will consider the possibilities of programmers to make useful changes in space science.
Vladimir Surdin
Company: MSU
Sergey Cosmos
Company: SR Data
Mikhail Lukin
Company: Sudo
September 14
- Watch recording
Talk type: Talk
Fast data processing in Data Lake with Trino
The speaker will cover the implementation and practical use of key optimizations that allow Trino and related commercial products to quickly "grind" data from your lake: using Parquet and ORC metadata to reduce the amount of read-out data (project/filter/aggregate pushdown), dynamic filtering (runtime filtering), late materialization of columns (late materialization), and as many as three local caches: metadata cache, data cache and intermediate query results cache.
Vladimir Ozerov
Company: Querify Labs
- Watch recording
Talk type: Talk
How to Process Data with Spark in the Cloud
How can you build a data processing pipeline using cloud services (DataProc and DataSphere), set up interaction with a Spark cluster via Jupyter laptops, and why is it convenient to do it in managed services? How can you teach the system to raise the cluster for you - exactly when you need it, and save money on it? What challenges do companies face when migrating and what solutions do they find? What are the peculiarities of cloud services? What do you need to be prepared for and what improvements might be needed?
Maksim Zinal
Company: Yandex Cloud
Dmitry Ribalko
Company: Yandex Cloud
- Watch recording
Talk type: Talk
Streaming Data Integration — ETL Tool for Creating Near Real-Time Processes
The speaker will look at the main challenges facing the implementation of real-time tools in analytical tasks and how they have approached these challenges in their ETL tool.
Vasilii Melnik
Company: GlowByte
- Watch recording
Talk type: Talk
How We Merged the Data of Delivery Club and Yandex Eda in Two Months
The team had an ambitious task: to combine two full-fledged data warehouses Delivery Club and Yandex Food in just over eight weeks and, before backend integration, to provide reporting with basic business metrics and data on Delivery Club. Olga will tell how they implemented this project, how they collected the task scoop, evaluated them and adjusted in the process. She will also talk about the technical implementation of improvements on DWH: what solution architecture was invented, what stack was used and why. Of course, there were some pitfalls: we will discuss which ones were stepped on and how to avoid them.
Olga Titova
Company: Yandex Eda
- Watch recording
Talk type: Talk
Moving Towards Universality: A Hybrid OLTP Database with OLAP Query Support
Integration of OLTP and OLAP functions in a distributed database - overcoming traditional barriers towards a universal solution. Alexey will talk about their process of developing such a solution that combines OLTP and OLAP functionality to perform both transactional and analytical queries simultaneously - YDB. He will discuss the main architectural features of such a system, compare it with ClickHouse and other standard solutions, and share his experience in implementing and using this database in real projects.
Aleksei Dmitriev
Company: Yandex
- Watch recording
Talk type: Talk
The Model Serving Journey: from Flask to Own Platform
A talk about the path an engineer goes through to choose their solution for Model Serving. We'll talk about cloud tools, ready-made Inference Servers, their features and selection criteria.
Alina Kocheva
Company: Positive Technologies
- Watch recording
Talk type: Talk
Compression, encryption and more: changing the behavior and guarantees of a distributed database
From the talk you will learn about data compression and encryption on disk and in memory in the context of a distributed database, the advantages and disadvantages of both approaches. The speaker will also consider other options for data transformation, such as filtering, and ways to implement them in an open source product.
Anton Vinogradov
Company: Apache Software Foundation
- Watch recording
Talk type: Talk
Development of BI-Analytics Tool, DataOps.BI, Based on Open Source Solution Apache Superset
The speaker will talk about the idea of using open source BI, piloting different solutions, assembling a BI tool team and evolving to meet the requirements of new users and teams migrating from proprietary software (Tableau, Power BI).
Pavel Shestakov
Company: MTS Digital
- Watch recording
Talk type: Talk
Kafka Connect: What Is This Single Message Transform Thing of Yours?
We will consider working with Single Message Transformations (SMT) in Kafka Connect in general, and in Debezium in particular. The speaker will tell what SMT is, how to use it in practice, will review the implementation process with code examples. He will cover the pitfalls, discuss customization and configuration, and provide examples of use cases in real-world scenarios.
- Watch recording
Talk type: Talk
Visualization for ELT Processes in DWH
About using dbt. How to use it, how to customize it. About writing materializations, DDL generator, and problems with temporary tables.
Vitaliy Bodrenkov
Company: SberMarket
- Watch recording
Talk type: Talk
Data Management Platform around YTsaurus
Vladimir will share their experience of building a data management platform around YT, tell where it is good, and where it can be supplemented with different frameworks or other analytical bases. This topic can be useful for architects and data engineers who are going to build a new DWH or revise the architecture of an existing one, and are facing the hard question of choosing technologies from the Open Source world.
Vladimir Verstov
Company: Yandex Go
- Watch recording
Talk type: Talk
What it takes to achieve linearizability in a distributed system
We'll talk about what linearizable consistency is good for, various ways to achieve it in a distributed system and the tradeoffs it demands.
Sergey Petrenko
Company: Tarantool
- Watch recording
Talk type: Talk
Spark Streaming: To Use or Not To Use?
Apache Spark Streaming is versatile enough and has rich functionality. But there are tasks, where Spark Streaming is not the best solution, and it can become more of a burden than an effective solution. Evgeny will talk about the advantages and disadvantages of Spark Streaming: when it is worth using this particular tool, and when it is better to consider other options. He will also make a checklist for using Spark Streaming in projects.
Evgeny Nenakhov
Company: МТS Digital
- Watch recording
Talk type: Talk
Predictive Analysis of Parasitic Load on GreenPlum Clusters
The essence of the problem: since GreenPlum has unshared resources and operates at the speed of the slowest segment, situations may arise in which some resources are underutilized or utilized unevenly, which negatively affects the optimality of executed queries. In highly loaded industrial systems it is not possible to manually analyze the optimality of all requests. And some queries can have a negative impact on all processes on a GreenPlum cluster. The speaker will tell you how to solve these problems.
Pavel Ternyuk
Company: Data Sapience
Mark Lebedev
Company: GlowByte Consulting
- Watch recording
Talk type: Talk
Application of TLA+ for Efficient Testing of Distributed Systems
In the talk we will study the problem of development and testing of distributed systems, consider the TLA+ specification language and its application for program verification. In addition, we will describe the method of testing distributed systems based on the actor model, which combines the advantages of both fuzzing and TLA+.
Nikita Siniachenko
Company: VK
Evgenii Chernatskiy
Company: VK
- Watch recording
Talk type: Talk
Creation of a Group of Services for the Analysis of Satellite Images Using ML
Lessons learned in developing complex products in the face of foreign partners leaving markets, as well as development tools the company has created to make it easier for studios and freelance developers to enter the market.
Sergey Cosmos
Company: SR Data
- Watch recording
Talk type: Conversation
SmartData 2023 Conference Closing
We take stock, remember the bright moments and talk about our plans. Come to the room or join the broadcast, so you don't miss anything!
Maksim Statsenko
Company: Yandex
Mikhail Maryufich
Company: Odnoklassniki
Discussions
Live conversation with speakers between activities. No recording and no time limit.BoF
Informal conversations without hosts or speakers. This is where new ideas are born.Round tables
Speakers and experts discuss current industry issues.
Bonus
Coffee and lunch breaks
Buffet and beverages of your choosing. If you have food restrictions, write to our support team. We’ll find a solution.Networking and Afterparty
Informal atmosphere, networking for all participants, speakers, and experts. Heart-to-heart talks and an afterparty at the end of the first offline day.
FAQ
Where will the offline part of the conference be held?
Offline part will be held on September 13–14 at the following address: Hotel MonArch: Leningrad Avenue, 31А, building 1, Moscow, Russian Federation.When will the program and time for the offline part of the conference be known?
We begin publishing the program in batches on the conference website one month in advance.What activities will be included on the offline part of the conference?
There will be on the offline part:
- talks;
- roundtables;
- BoF-sessions: meetings of interest without a scheduled schedule;
- discussions with offline and online speakers who will come to the site;
- Afterparty for participants at the end of the first offline day.
Will there be an online broadcast of the offline part of the conference?
We will broadcast live most of the activities of the offline part: talks, roundtables, etc.
Discussions and BoF-sessions will not be broadcast or recorded.
Offline was so long ago that I no longer remember what the procedure was for offline conferences.
Don’t worry, before the conference we will send you a participant’s memo. It will contain all the necessary information.Can I buy a ticket only for the offline part of the conference?
To attend the offline part, you must purchase an ONLINE + OFFLINE ticket. It entitles you to attend the offline part of the conference and lifetime access to the recordings of the online part.How do I get into offline part if I have a ONLINE ticket?
If you already have a ticket for the online part of the conference, you can upgrade it to ONLINE + OFFLINE. To do so, email our support team at support@smartdataconf.ruHow do I get to the offline part if the company only paid for my ONLINE ticket?
If the company that paid for your ticket is not willing to upgrade to ONLINE + OFFLINE, you can do it yourself at a discount. The discount is given for taking the survey after the online part of the conference ends.Is there a limit to the number of tickets for the offline part?
The number of tickets is limited to the capacity of the conference venue.
So it is better to buy tickets in advance while they are available.
Are there any restrictions on going to an offline conference?
There will be no COVID restrictions on site visits. You don’t need QR codes or PCR tests to enter the venue. For your safety a qualified medical worker is constantly on duty at the site.
However, if you’re feeling unwell, it’s best to refrain from going offline. You will be able to participate in the offline part remotely or watch the performances in the recording.