Talk

Automation of Configuration of ETL Processes Based on Apache Spark 3, Using RAG and LLM MTS

In Russian

Big data processing using Apache Spark has become the standard for ETL processes (Extract, Transform, Load) due to its high performance and scalability. The performance of tasks in Spark critically depends on the correct configuration settings, such as spark.executor.memory, spark.default.parallelism, and spark.driver.memory. Manual optimization of these parameters requires in-depth knowledge of the system and often leads to suboptimal results due to the complex interaction between settings and load variability, as well as differences between input data and Spark application algorithms.

In the talk, I will talk about a number of problems that a Data Engineer faces when configuring Spark.

Metrics from Spark logs and the metrics repository, where Spark discards metrics, Graphite, are used to evaluate the optimal operation of the Spark application. I will tell you how the collected information about Spark performance metrics is used in the system, and here LLM (model — mts-anya) and RAG, how we collected a database of specific recommendations for tuning Spark applications, based on the recommendations we built embeddings (bge-m3) and for a specific Spark application we find the most relevant tuning recommendations for Spark. I will also demonstrate the implemented microservice architecture of the system using Kafka and K8s.

In conclusion, I will show examples of processes that are not optimally configured, where parameter configuration has led to improved resource utilization, and I will talk about current unresolved issues and future plans for the development of the system.

Speakers

Schedule