Spark¶

easy-db-lab supports provisioning Apache Spark clusters via AWS EMR for analytics workloads.

Enabling Spark¶

Spark is enabled during cluster initialization with the --spark.enable flag:

easy-db-lab init --spark.enable

Spark Configuration Options¶

Option	Description	Default
`--spark.enable`	Enable Spark EMR cluster	false
`--spark.master.instance.type`	Master node instance type	m5.xlarge
`--spark.worker.instance.type`	Worker node instance type	m5.xlarge
`--spark.worker.instance.count`	Number of worker nodes	3

Example with Custom Configuration¶

easy-db-lab init \
  --spark.enable \
  --spark.master.instance.type m5.2xlarge \
  --spark.worker.instance.type m5.4xlarge \
  --spark.worker.instance.count 5

Submitting Spark Jobs¶

Submit JAR-based Spark applications to your EMR cluster:

easy-db-lab spark submit \
  --jar /path/to/your-app.jar \
  --main-class com.example.YourMainClass \
  --args "arg1 arg2"

Submit Options¶

Option	Description	Required
`--jar`	Path to JAR file (local or s3://)	Yes
`--main-class`	Main class to execute	Yes
`--args`	Arguments for the Spark application	No
`--wait`	Wait for job completion	No
`--name`	Job name (defaults to main class)	No

Using S3 for JARs¶

You can upload your JAR to S3 and reference it directly:

# Upload JAR to cluster S3 bucket
aws s3 cp your-app.jar s3://your-bucket/spark/your-app.jar

# Submit using S3 path
easy-db-lab spark submit \
  --jar s3://your-bucket/spark/your-app.jar \
  --main-class com.example.YourMainClass \
  --wait

Checking Job Status¶

View Recent Jobs¶

List recent Spark jobs on the cluster:

easy-db-lab spark jobs

Options:

--limit - Maximum number of jobs to display (default: 10)

Check Specific Job Status¶

easy-db-lab spark status --step-id <step-id>

Without --step-id, shows the status of the most recent job.

Options:

--step-id - EMR step ID to check
--logs - Download step logs (stdout, stderr)

Retrieving Logs¶

Download logs for a Spark job:

easy-db-lab spark logs --step-id <step-id>

Logs are automatically decompressed and include:

stdout.gz - Standard output
stderr.gz - Standard error
controller.gz - EMR controller logs

Architecture¶

When Spark is enabled, easy-db-lab provisions:

EMR Cluster: Managed Spark cluster with master and worker nodes
S3 Integration: Logs stored at s3://<bucket>/spark/emr-logs/
IAM Roles: Service and job flow roles for EMR operations

Timeouts and Polling¶

Job Polling Interval: 5 seconds
Maximum Wait Time: 4 hours
Cluster Creation Timeout: 30 minutes

Spark with Cassandra¶

A common use case is running Spark jobs that read from or write to Cassandra. Use the Spark Cassandra Connector:

import com.datastax.spark.connector._

val df = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "my_table", "keyspace" -> "my_keyspace"))
  .load()

Ensure your JAR includes the Spark Cassandra Connector dependency and configure the Cassandra host in your Spark application.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search