Amazon Athena now supports Apache Spark, an open source distributed processing system, to run fast analytical workloads. Data analysts and engineers can use Jupyter Notebooks in Athena to perform data processing and programmatically interact with Spark applications.
For interactive Spark workloads that require low-latency queries, customers can query data from various sources and visualize analytical results. Athena will start the application within a second of her.
Source: https://aws.amazon.com/blogs/aws/new-amazon-athena-for-apache-spark/
In addition to Amazon Athena’s existing SQL capabilities, Apache Spark on Athena offers on-demand scaling to meet changing data volumes and processing requirements. Donnie Prakoso, Principal of AWS and his Developer Advocate, describes the key benefits of the new serverless option based on Spark 3.2:
Building an infrastructure to run Apache Spark for interactive applications is not easy. Customers must provision, configure, and maintain infrastructure in addition to their applications. It goes without saying that optimally aligning resources so that application startup is not slowed down.
Apache Spark is an open-source distributed processing system designed to run high-speed analytical workloads used to perform complex data analytics across a variety of industries, often exploring data lakes. used to derive insights from A new Athena feature allows data engineers to use notebooks from her AWS console or programmatically build Apache Spark applications using the Athena API. Placoso adds:
Amazon Athena is integrated with the AWS Glue Data Catalog and helps you work with any data source in the AWS Glue Data Catalog, including data in Amazon S3. This opens up possibilities for customers to build applications to analyze and visualize data, explore data and prepare data sets for machine learning pipelines.
Apache Spark workloads are already supported on AWS using Jupyter notebooks with Glue or EMR Serverless. Some developers are suspicious On the advantages of new options. Views created by Athena SQL are not supported in Athena for Spark, so cross-engine queries are not supported. AWS will have a dedicated podcast episode to provide details on new features.
In a demo showing how to use Athena for Apache Spark to explore a data lake and derive insights from it, Pathik Shah, Senior Big Data Architect at AWS and Raj Devnath, Product Manager at AWS, said: writing.
You can now use the expressive power of Python and build interactive Apache Spark applications using the Athena console or a simplified notebook experience with the Athena API. (…) You can now use the instant-on, interactive, fully managed Apache Spark engine to perform interactive data exploration on your data lake.
Apache Spark code execution is charged at $0.35 per data processing unit (DPU) per hour, billed per second. Athena notebooks are provided at no additional charge. One DPU provides 4 vCPUs and 16 GB of memory.
Athena for Apache Spark is available in a limited number of AWS Regions (Ohio, N. Virginia, Oregon, Tokyo, Ireland).