Learn how to make the most of the Spark optimizations that the framework offers for free, and learn how to manually optimize and organize your Spark code to make it more robust and performance, in those situations where the framework is not smart enough.
AUDIENCE
- Programmers who are familiar with the basic traits of Spark programming and need to get acquainted with the nuts and bolts of the framework
COURSE OUTLINE (8 hours)
MODULE 1. Spark optimizations
- Datasets vs DataFrames optimizations
- Optimized file formats vs non-optimized
- The standard Catalog API
MODULE 2. Best practices on performance & modular design
- Partitioning issues: Unpartitioned data and over-partitioning
- Fixing memory problems
- How to solve serialization issues
- Caching: when it improves your process, and when is extra work
- Tasks that never finish: detect why this is happening
Workflow structure: design patterns to properly modularize your ETLs, and improve testability