The talk will focus exclusively on Apache Spark & cost-based engineering will be the theme.
Data Pipelines have a cost that depends upon the running time of data processing jobs (i.e. how long is the cluster up & running) & other infrastructure costs. More than often, people use Spark in a vanilla way with default configurations or default configurations provided by services (like EMR). They often scale horizontally to reduce running time or they run in a cluster when there is no need for multiple machines. Yoga is an ancient Indian practice that helps improve one's mental & physical health. It comprises many "asanas" (postures), each of them having a specific purpose.
In this talk, we will learn some asanas (tips & techniques) which can decrease the running times & infrastructure maintenance costs, keeping your pipelines lean. If time permits, perhaps we can learn some actual yoga asanas too!