This course offers an in-depth introduction into distributed programming with Apache Spark, making use of Scala - the language used in the implementation of Spark itself, and the best way to make the most of it. It focuses on learning the fundamentals of Apache Spark computational model, and the course contents are explained through interactive examples. It will also provide insights on how to analyze the program's performance using SparkUI and how to make basic optimizations through practical exercises.
AUDIENCE
- Programmers with basic Scala knowledge interested in making the most of Spark using its language of choice
- Spark programmers in Java or Python willing to start using the framework using Scala
COURSE OUTLINE (16 hours)
MODULE 1. Computational model
- Transformations and actions; jobs, stages and tasks
- Cluster managers: Yarn, Standalone, Mesos
- Driver and executors; SparkUI
MODULE 2. Spark APIs
- Spark languages: SparkSQL, RDDs, ML, GraphX
- Dataset: Statically typed
- DataFrame: Dynamically unsafe
- Datasets vs DataFrames
MODULE 3. Reading and writing in Spark
- Files: JSON, Parquet
- Databases: JDBC, NoSQL
MODULE 5. Patterns and antipatterns
- Memory
- Serialization issues
- Caching
- Tasks that never finish