Schema Evolution in Practice. Delivering Data From Relational Database to Apache Hadoop

Data Management

In Russian

Imagine: you made a perfectly fast ETL process, informative datamarts and beautiful dashboards. You're making a demo in front of executives and half of the company. Your assistant opens dashboard and ... there are few red words saying "error" on it. Everyone is shocked, but you understand from the message, that someone just deployed a new update of a production component without any notification to your reporting department.

Let's talk about schema evolution for relational sources. How to keep downstreams working you you changing upstream system? How to keep donstream system working if someone changes upstream without any notifications?

Stack: Hadoop, Spark. I'll consider Avro, Parquet and ORC as destination formats, schemas themselves and approaches to safe schema evolution.