Start of main content

  • No record

    Talk type: Partner’s BoF-session

    Development of internal DE-tools: how to make them used by more than one person

    At the BoF, in contrast to talks and roundtables, there is no division into participants and presenters: here everyone interacts with each other as equals,
    The main thing is not to off-topic and discuss the topic. 

    Let's discuss how to develop internal DE tools to be used by more than one person.

  • Watch recording

    Talk type: Talk

    Using the GrowthBook platform to manage ML experiments

    In this talk, we want to talk about the way to organise experiment pipeline, where the responsibility for launching and testing features lies within the ML development team, based on the open-source GrowthBook platform. The proposed approach is intended to reduce the number of integrations on the side of the core development team, while increasing the speed of bringing new versions of machine learning models into production.

  • Watch recording

    Talk type: Talk

    How Data Modeling improves data and report quality and reduces requirements for analyst experience

    We will tell about the evolution of our approach to client analytics projects: what is data modeling, and why data modeling became the most important phase of our projects; what benefits the project gets if we start from a good data model.

  • Watch recording

    Talk type: Talk

    Opening

    We will talk about the schedule, sessions, and share the information. Join the broadcast to find out what's on the air soon!

  • Watch recording

    Talk type: Talk

    The story of how Toloka Ai migrated to Modern Data Stack

    The Toloka Ai data platform team was tasked with "Azure. Modern Data Stack. Tomorrow." The catchy-sounding names that everyone has heard of and is interested in for every data scientist - Data Lakehouse, Cloud Data Platform, Data Mesh, Data Fabric - have now become a new reality.

    The speaker will talk about all the stages of working with Modern Data Stack and will touch on various issues of building a modern data platform: from the choice of tools to the problems of migration.

  • Watch recording

    Talk type: Talk

    Ingest layer of the data platform: mix but do not shake

    A story about how the speaker's team built an Ingest layer for internal and external sources within the SberHealth data platform and did not forget about working with sensitive data and the data directory. Since the platform has to abstract the components underneath, we'll talk about the DSL with which to manage it all.


  • Watch recording

    Talk type: Master class

    The many faces of pandas

    In the course of the master class we will go from a beginner DS, who can just spin small data with pandas through parallel processing and use Dask, to distributed processing with Spark.

  • Watch recording

    Talk type: Conversation

    In-person Opening SmartData 2022

    We will talk about the schedule, sessions, and share the information. Come to the room or join the broadcast to find out what to expect soon!

  • Watch recording

    Talk type: Talk

    Using Pentaho DI

    The speaker will tell you what problems can be solved with ETL Pentaho. You'll learn how to quickly solve data loading tasks and how to quickly perform data analytics. You'll also learn what Pentaho features help you overload a large number of tables in DWH.

  • Watch recording

    Talk type: Talk

    Data Vault on Greenplum with DBT

    Talk on building a Data Vault on Greenplum using DBT and orchestrating the whole thing with Dagster. We will discuss in detail how to work with DBT and how to start building high-normalized vaults with it.

  • Watch recording

    Talk type: Talk

    What is DevOps in the world of data warehousing?

    Petabytes of data go through PochtaTech's services. Dozens of teams and departments work with it, using a bunch of frameworks and technologies. Most of this data is stored and developed in DataCloud. Vasily will talk about how DevOps practices are used in working with data warehouses and how this can reduce time-to-market.

  • Watch recording

    Talk type: Interview

    Interview with Evgeny Ermakov

    SmartData 2022 hosts will ask Evgeny Ermakov tricky and simple, serious and ironic, straightforward and perhaps even rhetorical questions. Join the conversation and ask your questions in the chat room!

  • Watch recording

    Talk type: Talk

    Big data is a big responsibility. Experience in data leakage protection for analytical systems

    Alexey will talk about his experience in implementing technical and administrative measures that helped protect data from analytical systems from potential theft in a short period of time, while at the same time not breaking the company's existing business processes.

    The implemented changes affected the work of more than 3,000 company employees (reporting users, data analysts and engineers).

  • Watch recording

    Talk type: Partner’s talk

    Atypical use of Kafka

    Most often Kafka is used as a message broker, in some cases it can be used as a cache or database. The speaker's team found another use - to use it as a "buffer" in data streams. You will learn about how they came up with this solution and what unobvious advantages it provides in the report. It will be interesting for ETL developers, data engineers and architects.

  • Watch recording

    Talk type: Talk

    Automated Spark application tuning

    Valeria will talk about the Hadoop cluster, where hundreds of daily and thousands of hourly Spark calculations run. All the calculations are very different and have their own SLA. In this situation, it's unrealistic to tune in-house with engineers. That's why they built and implemented a fully automatic tuning system based on the logs Spark writes itself. Valeria will show you how to easily extract a lot of information from these logs in the offline mode and what to look for when automatically tuning spark.executor.memory. She will also explain in detail how their tuning system is set up and what allows them to constantly adjust for changes. The talk will be of interest to those who already deal with Spark and have an idea of its structure.

  • No record

    Talk type: Partner’s BoF-session

    Import substitution of BI solutions. Is everything very bad?

    We will discuss the requirements for BI platforms and see which of them are the most important. Let's talk about dashboards, what functionality Russian platforms offer, and take a walk through the leaders in the Russian market and discuss the pros and cons of each. We will examine the peculiarities of working with Russian BI for companies and their development this year.

  • Watch recording

    Talk type: Talk

    Variety of requirements to Data Warehouse. How to talk to the customer and cover all important issues

    In the talk, we will examine different categories of requirements for the Data Warehouse and how to take them into account in implementation. As a result, you will have a list of questions for the customer that would be good to clear up before you start building a new DWH. For an already existing DWH, it can be used to isolate problem query patterns for which it is worth dedicating separate subsystems.

    Talk is not bound to a specific technology. Examples will use Impala/Hive, Clickhouse, ScyllaDB/Cassandra, PostgreSQL.

  • Watch recording

    Talk type: Talk

    100 billion messages in Kafka: load and forget

    Apache Kafka is a great tool for reliably passing messages between services, but offloading its content for offline analytics has proven to be no easy task. Especially when we're talking about hundreds of billions of messages a day, every day. Apache Spark comes to the rescue, but unfortunately, its capabilities aren't enough to work reliably and fully automated on really big data volumes. The speaker will talk about how to offload from Apache Kafka to HDFS 100 billion messages a day and stop thinking about it.

    The talk will be of interest to developers in Big Data who use Kafka to transfer large amounts of data to Hadoop.

  • Watch recording

    Talk type: Talk

    Path to the data model for the daily update of the past 100 days

    A story about how we chose a data model for a storage in which we have to update the last 100 days of data every day. We will look at point-to-point block replacements, the single-key table approach, Data Vault, and a couple of other approaches and choose a winner from them for our task.

  • Watch recording

    Talk type: Talk

    NiFi scripts as an element of Less Code ETL

    There are many transformations in NiFi that do not require coding. But not everything can be done with boxed transformations. Developing a processor for each unique transformation is an interesting but expensive option. In NiFi, you can use scripting and get a more flexible data transformation tool. Bronislav will tell you when to choose scripting and how to do it most effectively. This talk is for active NiFi users, as well as for those who are considering NiFi as an ETL tool for their tasks.

  • Watch recording

    Talk type: Talk

    The metadata processor for data gathering and analysis

    Nowadays, using automatic data processing pipelines is ubiquitous. We need automatization both for local pipelines and distributed computing and map-reduce frameworks. One known problem for describing such pipelines is to define individual actions and their parameters without sharing the code on the cluster. For example, full scale heterogenous computing is quite limited. In this talk, Alexander will present a concept of "metadata processor": a way to automatically build and share the workflow between separate nodes that does not involve code sharing and allows to go into full heterogenous mode (different nodes do different tasks). As an example, he will show DataForge framework that implements this principles.

  • Watch recording

    Talk type: Talk

    How to load everything in the data catalog and not to die

    It is not enough to create a convenient data catalog, the biggest job is to fill it with metadata taken from a huge number of different sources.

    In her talk, Ivan will tell why they had to switch from a pull approach to a push approach, about the peculiarities of technical implementation and the problems they encountered.

    The talk will be useful for those who have already implemented or are thinking of implementing or developing a data catalog.

  • Watch recording

    Talk type: Partner’s talk

    Organizing streaming data processing for Big Data

    The speaker will tell how MTS built a tool for streaming 10 million events per second using Scala (Java), Apache Spark Streaming and PostgreSQL. The main goal was to make a universal, powerful and reliable tool for streaming data processing. The versatility lies in customization of data processing with configurations and DSL.

  • Watch recording

    Talk type: Talk

    How does Product Design affect the development of ETL platform

    One of the key differences of DWH at Tinkoff is the development of almost all tools together with Product Designers. From the report, you will learn why design is needed in tools for developers and analysts, why myths around designers do not allow you to develop a technical product in step with the times and how product design influenced one of Tinkoff's tools for Batch ETL — TEDI, designed to replace SAS.

  • Watch recording

    Talk type: Talk

    Recovering a distributed database after a crash

    Imagine you were editing a document, but deleted it by mistake. Rolling back to Report3_release2FinalLast-Fixed!!!4.txt.bak.bak, saved on a flash drive, and a couple of memory additions would fix the problem.

    Now imagine that several people were editing a document online and the server went down. A server backup and coordinated work by the authors of the document would solve the problem.

    Finally, imagine that thousands of people edited millions of documents on hundreds of servers with asynchronous replication to a backup cluster, but a bug in the code caused every million changes within each cluster to be lost. Is there a solution to such a problem?

    The speaker will tell you what to do when code-review, failover, and certification did not help avoid a distributed database crash.

  • Watch recording

    Talk type: Talk

    How we let users build their ETL

    SelfServiceETL is a framework that allows KCD users to create and modify ETL processes themselves. The talk will cover the background and history of SSETL development, the product itself and a bit about the architectural context. The speaker will pay special attention to the hares that have been killed and the rake that we have already stepped on or are about to step on.

  • Watch recording

    Talk type: Partner’s interview

    Data Engineering in SM Lab

    An interview with Alexander Salkov. The speaker will talk a little bit about the history of SM Lab and about the sphere of activity. You will find out where SM Lab came from, which classes of tasks are solved in Big Data, and what gadgets and technologies Big Data lives on.

    Alexander will show you a day in the life of a data engineer. We will discuss the challenges the company has faced in building Big Data, and at the end we will outline the vector of development.

  • Watch recording

    Talk type: Talk

    What is a Data Mesh and examples of implementation

    The talk considered the methodology of Data Mesh in comparison with other approaches, the problems of building classical teams and Data Pipelines.

    The second part of the talk is devoted to applied implementation of the concept using Data Infra as a Platform, DataOps approaches and technological stack, which can be used to build Data Mesh architecture in a company.

  • Watch recording

    Talk type: Talk

    Evolution of ETL-tools by the example of a single Big Data

    Speakers will talk about how custom approaches to organizing and implementing ETL processes have changed, and how tools have changed following them to better respond to changing requirements and conditions. One of the interesting parts of the talk is a story about how the team began to abandon the use of non-standard native Hadoop tools in favor of the more standard Spark, what drove this and what results it led to. 

    The talk will be of interest to data engineers, ETL specialists, data scientist, and anyone who wants to broaden their horizons or learn about the experiences of others.

  • Watch recording

    Talk type: Interview

    Interview with Alexander Ermakov

    SmartData 2022 hosts will ask the tricky and simple, serious and ironic, straightforward and perhaps even rhetorical questions to Alexander Yermakov. Join the conversation and ask your questions in the chat room!