Big Industries Academy
Advanced Optimisation Strategies for AWS Glue Jobs with Spark engine
In the world of big data, the Apache Spark engine has evolved as a keystone technology for handling massive datasets efficiently. Emerging from the Hadoop ecosystem, Spark provided a faster, more general-purpose approach to big data processing compared to Hadoop’s MapReduce. It quickly gained traction for its in-memory processing capabilities, which significantly speed up data operations. AWS Glue, leveraging the power of Spark, offers a serverless ETL service that simplifies data integration, however, achieving optimal performance with this system requires careful and precise configuration adjustments.
AWS Glue
Understanding the nuances of AWS Glue jobs running on the Spark engine is fundamental, especially given its serverless nature. While AWS Glue abstracts much of the infrastructure management, optimising resource utilisation and job configurations can drastically impact performance and cost. For instance, a common challenge is managing long-running jobs that do not speed up despite increased worker allocation. This was a particular issue I encountered. My Glue job persisted for extended periods without improvement, despite increasing the number of workers. The breakthrough came from adjusting the node type and enhancing core and memory allocation. Further optimisation was achieved by shifting from Python UDFs to PySpark, which significantly cut down the job's runtime.
Therefore, by integrating strategic resource allocation, fine-tuning data processing techniques, and leveraging Spark’s robust capabilities, it is possible to overcome common challenges and enhance the performance of our data workflows. In this guide, I will explore these aspects, providing a roadmap for those looking to understand the full potential of AWS Glue in their data operations.
Optimization Roadmap
The first step to be taken when fine-tuning your glue script is in terms of selecting the right data formats and how Spark processes different languages. The choice of data formats such as Parquet and ORC, which are both columnar storage formats, has been well-established for faster read and write performance operations in Spark. They are particularly beneficial when used in conjunction with Spark SQL's execution engine. AWS Glue supports scripts written in Scala, PySpark, and Python. While Python is widely used due to its simplicity and extensive library ecosystem, Python UDFs can be slower due to the serialisation and deserialisation overhead when moving data between Spark and Python processes. If you must use UDFs, consider using Pandas UDFs in PySpark for better performance. On the other hand, PySpark offers native Spark integration and communication with the Spark engine by leveraging DataFrame transformations in PySpark, which can significantly boost performance.
Spark Infrastructure
The Spark infrastructure comprises several components including workers, cores, and the machines they run on, each playing a different role in the efficient execution of tasks. In Spark, a "worker" refers to a node in the cluster that processes tasks assigned by the Spark driver. Each worker's capacity is determined by its number of cores and memory allocation. The cores handle parallel processing, while the allocated memory manages in-memory data during task execution, both components are essential in terms of job optimisation.
When choosing machines for Spark jobs within AWS Glue, consider to match the machine type to the workload requirements:
- General Purpose Machines offer a balanced CPU-to-memory ratio, making them suitable for a variety of workloads.
- Compute Optimised Machines are ideal for CPU-intensive tasks that require high processing power.
- Memory Optimised Machines are best for jobs that demand significant memory resources for large datasets.
In terms of best practice, start with 4-5 workers. Each worker on a G.1X machine, for example, is equivalent to 1 DPU, encompassing 4 vCPUs and 16 GB of memory. Provisioning four workers provides 16 cores and 64 GB of memory, equipping you with enough power to manage light to moderate workloads. When the job's complexity or data volume increases and it is necessary to provision more workers to your job, this can be easily added. Moreover, adjusting the number of workers or the machine type does not necessarily require starting the processing from scratch. AWS Glue supports job bookmarks, which help manage state information and allow the job to process only new or changed data since the last successful run. However, changing infrastructure configurations with bookmarks enabled should be approached with caution, as it is important to ensure that these changes do not inadvertently trigger a complete reprocessing of data, but rather continue to handle only the incremental load. This feature is particularly useful in continuous data ingestion scenarios, ensuring efficiency and avoiding redundancy in data processing.
Dynamic Allocation and Partition Tuning
Apart from machines types, languages used in the script, another key component to significantly intensify the performance of your data processing jobs in AWS Glue, is to play around with the Spark environment. One powerful feature to utilise is Dynamic Allocation, which allows Spark to automatically adjust the number of executors according to the workload. This flexibility ensures that resources are efficiently used, scaling up during high demand and scaling down when less compute power is needed, thus optimising cost and performance. Additionally, Partition Tuning plays an important role in enhancing efficiency. Properly partitioned data can drastically reduce the need for shuffling—moving data across executors—which increases parallelism and speeds up processing. Spark SQL further aids this process by optimising query execution across partitions, making strategic partitioning an essential step in job configuration.
DataFrames and Datasets
In the realm of PySpark, a shift towards using DataFrames and Datasets over RDDs (Resilient Distributed Datasets) is recommended whenever possible. DataFrames and Datasets offer built-in optimisations that streamline operations and are inherently more efficient for most data manipulation tasks. They not only provide a higher level of abstraction but also improve performance by sharpen execution plans internally. For further refine AWS Glue jobs, regular monitoring is key. Tools like Amazon CloudWatch enable detailed tracking of job performance, helping identify and diagnose inefficiencies and bottlenecks.
By strategically selecting data formats like Parquet and ORC, which are boosted for Spark's execution engine, and by transitioning from Python UDFs to PySpark for more efficient data processing, job performance can be drastically increased. Furthermore, incorporating best practices such as starting with an optimal number of worker nodes and dynamically adjusting resources based on demand ensures that AWS Glue operates not only with efficiency but also cost-effectiveness.
Conclusion
Ultimately, the successful implementation of above mentioned strategies will lead to more robust, faster, and more cost-efficient data processing environments. As we continue to push the boundaries of what is possible with AWS Glue and Spark, the convergence of strategic resource management, advanced Spark capabilities, and proactive optimisation practices will remain essential in driving forward the innovations in big data processing.
Andreia Negreira
In June 2023 Andreia joined the AI class at BeCode, where she found the perfect environment for a hands-on training, enhancing her problem-solving skills cultivated during her studies in Environmental Engineering and her master's in chemical and Biochemical Process Engineering. With experience in data integration, streaming, ETL design, and end-to-end data science pipelines, she is able to bring a fresh perspective and a constant passion for innovation to the tech world, fueled by her analytical and problem-solving skills developed throughout her career.