Spark Development

This guide covers developing and testing Spark-related functionality in easy-db-lab.

Project Structure

All Spark modules live under spark/ with shared configuration:

spark/common/ — Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setup
spark/bulk-writer-sidecar/ — Cassandra Analytics, direct sidecar transport (DirectBulkWriter)
spark/bulk-writer-s3-iam/ — Cassandra Analytics, S3 staging transport with IAM credentials (IamBulkWriter)
spark/connector-writer/ — Standard Spark Cassandra Connector (StandardConnectorWriter)
spark/connector-read-write/ — Read→transform→write example (KeyValuePrefixCount)

Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.

Prerequisites

The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.

One-Time Setup

bin/build-cassandra-analytics

Options:

--force - Rebuild even if already built
--repo <url> - Clone from a different repo (e.g., a fork)
--branch <branch> - Use a specific branch (default: trunk)

The default repo and branch are read from spark/cassandra-analytics-source.properties. CLI flags override the file. To switch to a different repo, update that file and re-run with --force to re-clone.

Getting the latest code from the remote: The script skips all git operations if Maven artifacts already exist (the common case). It does not check whether the clone is up to date. To pull the latest from the configured repo/branch, either run with --force or manually cd .cassandra-analytics && git pull and then rebuild the spark modules.

Building

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3-iam:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar

Shadow JARs include all dependencies except Spark (provided by EMR).

Running Tests

Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:

./gradlew :test

Testing with a Live Cluster

Using bin/spark-bulk-write

This script handles JAR lookup, host resolution, and health checks:

# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Using bin/submit-direct-bulk-writer

Simplified script for direct bulk writer testing:

bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]

Manual Spark Job Submission

All modules use unified spark.easydblab.* configuration:

easy-db-lab spark submit \
    --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
    --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
    --conf spark.easydblab.contactPoints=host1,host2 \
    --conf spark.easydblab.keyspace=bulk_test \
    --conf spark.easydblab.localDc=us-west-2 \
    --conf spark.easydblab.rowCount=1000 \
    --conf spark.easydblab.replicationFactor=1 \
    --wait

Debugging Failed Jobs

When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.

Manual Log Retrieval

easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs

Direct S3 Access

Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/

aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip

Adding a New Spark Module

Create a directory under spark/ (e.g., spark/bulk-reader/)
Add build.gradle.kts — use an existing module as a template
Add include "spark:bulk-reader" to settings.gradle
Depend on :spark:common for shared config
Use SparkJobConfig.load(sparkConf) for configuration
Implement your main class and submit via easy-db-lab spark submit

Architecture Notes

Shared Configuration

SparkJobConfig in spark/common provides:

Property constants (PROP_CONTACT_POINTS, etc.)
Config loading from SparkConf with validation
Schema setup via CqlSetup
Consistent defaults across all modules

Why Shadow JAR?

Bulk-writer modules use the Gradle Shadow plugin because:

EMR provides Spark, so those dependencies are compileOnly
Cassandra Analytics has many transitive dependencies
mergeServiceFiles() properly handles META-INF/services for SPI

Cassandra Analytics Modules

Some cassandra-analytics modules aren't published to Maven:

five-zero.jar - Cassandra 5.0 bridge
five-zero-bridge.jar - Bridge implementation
five-zero-types.jar - Type converters
five-zero-sparksql.jar - SparkSQL integration

These are referenced directly from .cassandra-analytics/ build output.

Keyboard shortcuts

easy-db-lab