Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Spark Development

This guide covers developing and testing Spark-related functionality in easy-db-lab.

Project Structure

All Spark modules live under spark/ with shared configuration:

  • spark/common/ — Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setup
  • spark/bulk-writer-sidecar/ — Cassandra Analytics, direct sidecar transport (DirectBulkWriter)
  • spark/bulk-writer-s3-iam/ — Cassandra Analytics, S3 staging transport with IAM credentials (IamBulkWriter)
  • spark/connector-writer/ — Standard Spark Cassandra Connector (StandardConnectorWriter)
  • spark/connector-read-write/ — Read→transform→write example (KeyValuePrefixCount)

Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.

Prerequisites

The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.

One-Time Setup

bin/build-cassandra-analytics

Options:

  • --force - Rebuild even if already built
  • --repo <url> - Clone from a different repo (e.g., a fork)
  • --branch <branch> - Use a specific branch (default: trunk)

The default repo and branch are read from spark/cassandra-analytics-source.properties. CLI flags override the file. To switch to a different repo, update that file and re-run with --force to re-clone.

Getting the latest code from the remote: The script skips all git operations if Maven artifacts already exist (the common case). It does not check whether the clone is up to date. To pull the latest from the configured repo/branch, either run with --force or manually cd .cassandra-analytics && git pull and then rebuild the spark modules.

Building

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3-iam:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar

Shadow JARs include all dependencies except Spark (provided by EMR).

Running Tests

Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:

./gradlew :test

Testing with a Live Cluster

Using bin/spark-bulk-write

This script handles JAR lookup, host resolution, and health checks:

# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Using bin/submit-direct-bulk-writer

Simplified script for direct bulk writer testing:

bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]

Manual Spark Job Submission

All modules use unified spark.easydblab.* configuration:

easy-db-lab spark submit \
    --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
    --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
    --conf spark.easydblab.contactPoints=host1,host2 \
    --conf spark.easydblab.keyspace=bulk_test \
    --conf spark.easydblab.localDc=us-west-2 \
    --conf spark.easydblab.rowCount=1000 \
    --conf spark.easydblab.replicationFactor=1 \
    --wait

Debugging Failed Jobs

When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.

Manual Log Retrieval

easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs

Direct S3 Access

Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/

aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip

Adding a New Spark Module

  1. Create a directory under spark/ (e.g., spark/bulk-reader/)
  2. Add build.gradle.kts — use an existing module as a template
  3. Add include "spark:bulk-reader" to settings.gradle
  4. Depend on :spark:common for shared config
  5. Use SparkJobConfig.load(sparkConf) for configuration
  6. Implement your main class and submit via easy-db-lab spark submit

Architecture Notes

Shared Configuration

SparkJobConfig in spark/common provides:

  • Property constants (PROP_CONTACT_POINTS, etc.)
  • Config loading from SparkConf with validation
  • Schema setup via CqlSetup
  • Consistent defaults across all modules

Why Shadow JAR?

Bulk-writer modules use the Gradle Shadow plugin because:

  1. EMR provides Spark, so those dependencies are compileOnly
  2. Cassandra Analytics has many transitive dependencies
  3. mergeServiceFiles() properly handles META-INF/services for SPI

Cassandra Analytics Modules

Some cassandra-analytics modules aren't published to Maven:

  • five-zero.jar - Cassandra 5.0 bridge
  • five-zero-bridge.jar - Bridge implementation
  • five-zero-types.jar - Type converters
  • five-zero-sparksql.jar - SparkSQL integration

These are referenced directly from .cassandra-analytics/ build output.