Spark Development
This guide covers developing and testing Spark-related functionality in easy-db-lab.
Project Structure
All Spark modules live under spark/ with shared configuration:
spark/common/— Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setupspark/bulk-writer-sidecar/— Cassandra Analytics, direct sidecar transport (DirectBulkWriter)spark/bulk-writer-s3-iam/— Cassandra Analytics, S3 staging transport with IAM credentials (IamBulkWriter)spark/connector-writer/— Standard Spark Cassandra Connector (StandardConnectorWriter)spark/connector-read-write/— Read→transform→write example (KeyValuePrefixCount)
Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.
Prerequisites
The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.
One-Time Setup
bin/build-cassandra-analytics
Options:
--force- Rebuild even if already built--repo <url>- Clone from a different repo (e.g., a fork)--branch <branch>- Use a specific branch (default: trunk)
The default repo and branch are read from spark/cassandra-analytics-source.properties. CLI flags override the file. To switch to a different repo, update that file and re-run with --force to re-clone.
Getting the latest code from the remote: The script skips all git operations if Maven artifacts already exist (the common case). It does not check whether the clone is up to date. To pull the latest from the configured repo/branch, either run with --force or manually cd .cassandra-analytics && git pull and then rebuild the spark modules.
Building
# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3-iam:shadowJar \
:spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar
# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar
# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar
Shadow JARs include all dependencies except Spark (provided by EMR).
Running Tests
Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:
./gradlew :test
Testing with a Live Cluster
Using bin/spark-bulk-write
This script handles JAR lookup, host resolution, and health checks:
# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable
Using bin/submit-direct-bulk-writer
Simplified script for direct bulk writer testing:
bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]
Manual Spark Job Submission
All modules use unified spark.easydblab.* configuration:
easy-db-lab spark submit \
--jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
--main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
--conf spark.easydblab.contactPoints=host1,host2 \
--conf spark.easydblab.keyspace=bulk_test \
--conf spark.easydblab.localDc=us-west-2 \
--conf spark.easydblab.rowCount=1000 \
--conf spark.easydblab.replicationFactor=1 \
--wait
Debugging Failed Jobs
When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.
Manual Log Retrieval
easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs
Direct S3 Access
Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/
aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip
Adding a New Spark Module
- Create a directory under
spark/(e.g.,spark/bulk-reader/) - Add
build.gradle.kts— use an existing module as a template - Add
include "spark:bulk-reader"tosettings.gradle - Depend on
:spark:commonfor shared config - Use
SparkJobConfig.load(sparkConf)for configuration - Implement your main class and submit via
easy-db-lab spark submit
Architecture Notes
Shared Configuration
SparkJobConfig in spark/common provides:
- Property constants (
PROP_CONTACT_POINTS, etc.) - Config loading from
SparkConfwith validation - Schema setup via
CqlSetup - Consistent defaults across all modules
Why Shadow JAR?
Bulk-writer modules use the Gradle Shadow plugin because:
- EMR provides Spark, so those dependencies are
compileOnly - Cassandra Analytics has many transitive dependencies
mergeServiceFiles()properly handlesMETA-INF/servicesfor SPI
Cassandra Analytics Modules
Some cassandra-analytics modules aren't published to Maven:
five-zero.jar- Cassandra 5.0 bridgefive-zero-bridge.jar- Bridge implementationfive-zero-types.jar- Type convertersfive-zero-sparksql.jar- SparkSQL integration
These are referenced directly from .cassandra-analytics/ build output.