easy-db-lab

easy-db-lab creates lab environments for database evaluations in AWS. It provisions infrastructure, deploys databases, and sets up a full observability stack so you can focus on testing, benchmarking, and learning.

Supported databases include Apache Cassandra, ClickHouse, and OpenSearch, with Apache Spark available for analytics workloads.

If you are looking for a tool to aid in stress testing Cassandra clusters, see the companion project cassandra-easy-stress.

If you're looking for tools to help manage Cassandra in production environments please see Reaper, cstar, and K8ssandra.

Quick Start

Install easy-db-lab
Set up your profile - Run easy-db-lab setup-profile
Follow the tutorial

Features

Database Support

Apache Cassandra: Versions 3.0, 3.11, 4.0, 4.1, 5.0, and trunk builds. Includes custom build support, Cassandra Sidecar, and integration with cassandra-easy-stress for benchmarking.
ClickHouse: Sharded clusters with configurable replication, distributed tables, and S3-tiered storage.
OpenSearch: AWS OpenSearch domains for search and analytics.
Apache Spark: EMR-based Spark clusters for analytics workloads.

AWS Integration

EC2 Provisioning: Automated provisioning with configurable instance types
EBS Storage: Optional EBS volumes for persistent storage
S3 Backup: Automatic backup of configurations and state to S3
IAM Integration: Managed IAM policies for secure operations

Kubernetes (K3s)

Lightweight K3s: Automatic K3s cluster deployment across all nodes
kubectl/k9s: Pre-configured access with SOCKS5 proxy support
Private Registry: HTTPS Docker registry for custom images
Jib Integration: Push custom containers directly from Gradle

Monitoring and Observability

VictoriaMetrics: Time-series database for metrics storage
VictoriaLogs: Centralized log aggregation
Grafana: Pre-configured dashboards for Cassandra, ClickHouse, and system metrics
OpenTelemetry: Distributed tracing and metrics collection
AxonOps: Optional integration with AxonOps for Cassandra monitoring and management

Developer Experience

Shell Aliases: Convenient shortcuts for cluster management (c0, c-all, c-status, etc.)
Server: Integration with Claude Code for AI-assisted operations
Restore Support: Recover cluster state from VPC ID or S3 backup
SOCKS5 Proxy: Secure access to private cluster resources

Stress Testing

cassandra-easy-stress: Native integration with Apache stress testing tool
Kubernetes Jobs: Run stress tests as K8s jobs for scalability
Artifact Collection: Automatic collection of metrics and diagnostics

Prerequisites

Before using easy-db-lab, ensure you have the following:

System Requirements

Requirement	Details
Operating System	macOS or Linux
Java	JDK 21 or later
Docker	Required for building custom AMIs

AWS Requirements

AWS Account: A dedicated AWS account is recommended for lab environments
AWS Access Key & Secret: Credentials for programmatic access
IAM Permissions: Permissions to create EC2, IAM, S3, and optionally EMR resources

Tip

Run easy-db-lab show-iam-policies to see the exact IAM policies required with your account ID populated. See Setup for details.

Optional

AxonOps Account: For free Cassandra monitoring. Create an account at axonops.com

Next Steps

Run the interactive setup to configure your profile:

easy-db-lab setup-profile

See the Setup Guide for detailed instructions.

Installation

Tarball Install

You can grab a tarball from the releases page.

To get started, add the bin directory of easy-db-lab to your $PATH. For example:

export PATH="$PATH:/path/to/easy-db-lab/bin"
cd /path/to/easy-db-lab
./gradlew assemble

Building from Source

If you prefer to build from source:

git clone https://github.com/rustyrazorblade/easy-db-lab.git
cd easy-db-lab
./gradlew assemble

The built distribution will be in build/distributions/.

Setup

This guide walks you through the initial setup of easy-db-lab, including AWS credentials configuration, IAM policies, and AMI creation.

Overview

The setup-profile command handles all initial configuration interactively. It will:

Collect your email and AWS credentials
Validate your AWS access
Create necessary AWS resources (key pair, IAM roles, Packer VPC)
Build or validate the required AMI

Prerequisites

Before running setup:

AWS Account: An AWS account with appropriate permissions (see IAM Policies below)
Java 21+: Required to run easy-db-lab
Docker: Required only if building custom AMIs

Step 1: Run Setup Profile

Run the interactive setup:

easy-db-lab setup-profile

Or use the shorter alias:

easy-db-lab setup

The setup wizard will prompt you for:

Prompt	Description	Default
Email	Used to tag AWS resources for ownership	(required)
AWS Region	Region for your clusters	us-west-2
AWS Access Key	Your AWS access key ID	(required)
AWS Secret Key	Your AWS secret access key	(required)
AxonOps Org	Optional: AxonOps organization name	(skip)
AxonOps Key	Optional: AxonOps API key	(skip)
AWS Profile	Optional: Named AWS profile	(skip)

What Gets Created

During setup, the following AWS resources are created:

EC2 Key Pair: For SSH access to instances
IAM Role: For instance permissions (easy-db-lab-instance-role)
Packer VPC: Infrastructure for building AMIs
AMI (if needed): Takes 10-15 minutes to build

Configuration Location

Your profile is saved to:

~/.easy-db-lab/profiles/default/settings.yaml

Tip

Use a different profile by setting EASY_DB_LAB_PROFILE environment variable before running setup.

Step 2: Getting IAM Policies

If you need to request permissions from your AWS administrator, use the show-iam-policies command to display the required policies with your account ID populated:

easy-db-lab show-iam-policies

This displays three policies:

Policy	Purpose
EC2	Create/manage EC2 instances, VPCs, security groups
IAM	Create instance roles and profiles
EMR	Create Spark clusters (optional)

Filter by Policy Name

To show a specific policy:

easy-db-lab show-iam-policies ec2    # Show EC2 policy only
easy-db-lab show-iam-policies iam    # Show IAM policy only
easy-db-lab show-iam-policies emr    # Show EMR policy only

Recommended IAM Setup

For teams with multiple users, we recommend creating managed policies attached to an IAM group:

Create an IAM group (e.g., "EasyDBLabUsers")
Create three managed policies from the JSON output
Attach all policies to the group
Add users to the group

Warning

Inline policies have a 5,120 byte limit which may not fit all three policies. Use managed policies instead.

Step 3: Build Custom AMI (Optional)

If setup couldn't find a valid AMI for your architecture, or if you want to customize the base image, build one manually:

easy-db-lab build-image

Build Options

Option	Description	Default
`--arch`	CPU architecture (AMD64 or ARM64)	AMD64
`--region`	AWS region for the AMI	(from profile)

Examples

# Build AMD64 AMI (default)
easy-db-lab build-image

# Build ARM64 AMI for Graviton instances
easy-db-lab build-image --arch ARM64

# Build in specific region
easy-db-lab build-image --region eu-west-1

Note

Building an AMI takes approximately 10-15 minutes. Docker must be installed and running.

Environment Variables

Variable	Description	Default
`EASY_DB_LAB_USER_DIR`	Override configuration directory	`~/.easy-db-lab`
`EASY_DB_LAB_PROFILE`	Use a named profile	`default`
`EASY_DB_LAB_INSTANCE_TYPE`	Default instance type for `init`	`r3.2xlarge`
`EASY_DB_LAB_STRESS_INSTANCE_TYPE`	Default stress instance type	`c7i.2xlarge`
`EASY_DB_LAB_AMI`	Override AMI ID	(auto-detected)

Verify Installation

After setup completes, verify by running:

easy-db-lab

You should see the help output with available commands.

Next Steps

Once setup is complete, follow the Tutorial to create your first cluster.

Cluster Setup

This page provides a quick reference for cluster initialization and provisioning. For a complete walkthrough, see the Tutorial.

Quick Start

# Initialize a 3-node cluster with i4i.xlarge instances and 1 stress node
easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1

# Provision AWS infrastructure
easy-db-lab up

# Set up your shell environment
source env.sh

Or combine init and up:

easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1 --up

Initialize

The init command creates local configuration files but does not provision AWS resources.

easy-db-lab init <cluster-name> [options]

Common Options

Option	Description	Default
`--db`, `-c`	Number of Cassandra instances	3
`--stress`, `-s`	Number of stress instances	0
`--instance`, `-i`	Instance type	r3.2xlarge
`--ebs.type`	EBS volume type (NONE, gp2, gp3)	NONE
`--ebs.size`	EBS volume size in GB	256
`--arch`, `-a`	CPU architecture (AMD64, ARM64)	AMD64
`--up`	Auto-provision after init	false

For the complete options list, see the Tutorial or run easy-db-lab init --help.

Storage Requirements

Database instances need a data disk separate from the root volume. This can come from either:

Instance store (local NVMe) — Instance types with a d suffix (e.g., i3.xlarge, m5d.xlarge, c5d.2xlarge) include local NVMe storage.
EBS volumes — Attach an EBS volume using --ebs.type (e.g., --ebs.type gp3).

If the selected instance type has no instance store and --ebs.type is not specified, up will fail with an error. For example, c5.2xlarge has no local storage, so you must specify EBS:

easy-db-lab init my-cluster --instance c5.2xlarge --ebs.type gp3 --ebs.size 200

Launch

The up command provisions all AWS infrastructure:

easy-db-lab up

What Gets Created

S3 bucket for cluster state
VPC with subnets and security groups
EC2 instances (Cassandra, Stress, Control nodes)
- Control node: m5d.xlarge (NVMe-backed instance; K3s data is stored on NVMe to avoid filling the root volume)
K3s cluster across all nodes (Cassandra, Stress, Control)

Options

Option	Description
`--no-setup`, `-n`	Skip K3s and AxonOps setup

Shut Down

Destroy all cluster infrastructure:

easy-db-lab down

Next Steps

After your cluster is running:

Configure Cassandra - Select version and apply configuration
Shell Aliases - Set up convenient shortcuts

Tutorial: Getting Started

This tutorial walks you through creating a database cluster from scratch, covering initialization, infrastructure provisioning, and database configuration. The examples below use Cassandra, but the same infrastructure supports ClickHouse, OpenSearch, and Spark.

Prerequisites

Before starting, ensure you've completed the Setup process by running easy-db-lab setup-profile.

Part 1: Initialize Your Cluster

The init command creates local configuration files for your cluster. It does not provision AWS resources yet.

easy-db-lab init my-cluster

This creates a 3-node Cassandra cluster by default.

Init Options

Option	Description	Default
`--db`, `--cassandra`, `-c`	Number of Cassandra instances	3
`--app`, `--stress`, `-s`	Number of stress/application instances	0
`--instance`, `-i`	Cassandra instance type	r3.2xlarge
`--stress-instance`, `-si`	Stress instance type	c7i.2xlarge
`--azs`, `-z`	Availability zones (e.g., `a,b,c`)	all available
`--arch`, `-a`	CPU architecture (AMD64, ARM64)	AMD64
`--ebs.type`	EBS volume type (NONE, gp2, gp3, io1, io2)	NONE
`--ebs.size`	EBS volume size in GB	256
`--ebs.iops`	EBS IOPS (gp3 only)	0
`--ebs.throughput`	EBS throughput (gp3 only)	0
`--until`	When instances can be deleted	tomorrow
`--tag`	Custom tags (key=value, repeatable)	-
`--vpc`	Use existing VPC ID	-
`--up`	Auto-provision after init	false
`--clean`	Remove existing config first	false

Examples

Basic 3-node cluster:

easy-db-lab init my-cluster

5-node cluster with 2 stress nodes:

easy-db-lab init my-cluster --db 5 --stress 2

Production-like cluster with EBS storage:

easy-db-lab init prod-test --db 5 --ebs.type gp3 --ebs.size 500 --ebs.iops 3000

ARM64 cluster for Graviton instances:

easy-db-lab init my-cluster --arch ARM64 --instance r7g.2xlarge

Initialize and provision in one step:

easy-db-lab init my-cluster --up

Part 2: Launch Infrastructure

Once initialized, provision the AWS infrastructure:

easy-db-lab up

This command creates:

S3 Storage: Cluster data stored under a dedicated prefix in the account S3 bucket
VPC: With subnets and security groups
EC2 Instances: Cassandra nodes, stress nodes, and a control node
K3s Cluster: Lightweight Kubernetes across all nodes

What Happens During `up`

Configures account S3 bucket with cluster prefix
Creates VPC with public subnets in your availability zones
Provisions EC2 instances in parallel
Waits for SSH availability
Configures K3s cluster on all nodes
Writes SSH config and environment files

Up Options

Option	Description
`--no-setup`, `-n`	Skip K3s setup and AxonOps configuration

Environment Setup

After up completes, source the environment file:

source env.sh

This configures your shell with:

SSH shortcuts: ssh db0, ssh db1, ssh stress0, etc.
Cluster aliases: c0, c-all, c-status
SOCKS proxy configuration

See Shell Aliases for all available shortcuts.

Part 3: Configure Cassandra 5.0

With infrastructure running, configure and start Cassandra.

Step 1: Select Cassandra Version

easy-db-lab cassandra use 5.0

This command:

Sets the active Cassandra version on all nodes
Downloads configuration files to your local directory
Applies any existing patch configuration

Available versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk

Step 2: Customize Configuration (Optional)

Edit cassandra.patch.yaml to customize settings:

# Example: Change token count
vim cassandra.patch.yaml

Common customizations:

Setting	Description	Default
`num_tokens`	Virtual nodes per instance	4
`concurrent_reads`	Max concurrent read operations	64
`concurrent_writes`	Max concurrent write operations	64
`endpoint_snitch`	Network topology snitch	Ec2Snitch

Step 3: Apply Configuration

easy-db-lab cassandra update-config

This uploads and applies the patch to all Cassandra nodes.

To apply and restart Cassandra in one command:

easy-db-lab cassandra update-config --restart

Step 4: Start Cassandra

easy-db-lab cassandra start

Step 5: Verify Cluster

Check cluster status:

ssh db0 nodetool status

Or use the shell alias (after sourcing env.sh):

c-status

You should see all nodes in UN (Up/Normal) state.

Part 4: Working with Your Cluster

SSH Access

After sourcing env.sh:

ssh db0          # First Cassandra node
ssh db1          # Second Cassandra node
ssh stress0      # First stress node (if provisioned)
ssh control0     # Control node

Cassandra Management

# Stop Cassandra on all nodes
easy-db-lab cassandra stop

# Start Cassandra on all nodes
easy-db-lab cassandra start

# Restart Cassandra on all nodes
easy-db-lab cassandra restart

Filter to Specific Hosts

Most commands support the --hosts filter:

# Apply config only to db0 and db1
easy-db-lab cassandra update-config --hosts db0,db1

# Restart only db2
easy-db-lab cassandra restart --hosts db2

Download Configuration Files

To download the current configuration from nodes:

easy-db-lab cassandra download-config

This saves configuration files to a local directory named after the version (e.g., 5.0/).

Part 5: Shut Down

When finished, destroy the cluster infrastructure:

easy-db-lab down

Warning

This permanently destroys all EC2 instances, the VPC, and associated resources. S3 data under the cluster prefix is scheduled for expiration (default: 1 day).

Quick Reference

Task	Command
Initialize cluster	`easy-db-lab init <name> [options]`
Provision infrastructure	`easy-db-lab up`
Initialize and provision	`easy-db-lab init <name> --up`
Select Cassandra version	`easy-db-lab cassandra use <version>`
Apply configuration	`easy-db-lab cassandra update-config`
Start Cassandra	`easy-db-lab cassandra start`
Stop Cassandra	`easy-db-lab cassandra stop`
Restart Cassandra	`easy-db-lab cassandra restart`
Check cluster status	`ssh db0 nodetool status`
Download config	`easy-db-lab cassandra download-config`
Destroy cluster	`easy-db-lab down`
Display hosts	`easy-db-lab hosts`
Clean local files	`easy-db-lab clean`

Next Steps

Kubernetes Access - Access K3s cluster with kubectl and k9s
Shell Aliases - All available CLI shortcuts
ClickHouse - Deploy ClickHouse for analytics
Spark - Set up Apache Spark via EMR

Kubernetes

easy-db-lab uses K3s to provide a lightweight Kubernetes cluster for deploying supporting services like ClickHouse, monitoring, and stress testing workloads.

Overview

K3s is automatically installed on all nodes during provisioning:

Control node: Runs the K3s server (Kubernetes control plane)
Cassandra nodes: Run as K3s agents with label type=db
Stress nodes: Run as K3s agents with label type=app

Accessing the Cluster

kubectl

After running source env.sh, kubectl is automatically configured:

source env.sh
kubectl get nodes
kubectl get pods -A

The kubeconfig is downloaded to your working directory and kubectl is configured to use the SOCKS5 proxy for connectivity.

k9s

k9s provides a terminal-based UI for Kubernetes:

source env.sh
k9s

k9s is pre-configured to use the correct kubeconfig and proxy settings.

Port Forwarding

easy-db-lab uses a SOCKS5 proxy for accessing the private Kubernetes cluster.

Starting the Proxy

The proxy starts automatically when you source the environment:

source env.sh

Manual Proxy Control

# Start the SOCKS5 proxy
start-socks5

# Check proxy status
socks5-status

# Stop the proxy
stop-socks5

Running Commands Through the Proxy

Commands like kubectl and k9s automatically use the proxy. For other commands:

# Route any command through the proxy
with-proxy curl http://10.0.1.50:8080/api

Pushing Docker Images with Jib

easy-db-lab includes a private Docker registry accessible via HTTPS. You can push custom images using Jib.

Gradle Configuration

Add Jib to your build.gradle.kts:

plugins {
    id("com.google.cloud.tools.jib") version "3.4.0"
}

jib {
    from {
        image = "eclipse-temurin:21-jre"
    }
    to {
        // Use the control node's registry
        image = "control0:5000/my-app"
        tags = setOf("latest", project.version.toString())
    }
    container {
        mainClass = "com.example.MainKt"
    }
}

Pushing to the Registry

# Build and push to the cluster registry
./gradlew jib

# Or build locally first
./gradlew jibDockerBuild

Using Images in Kubernetes

Reference your pushed images in Kubernetes manifests:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: my-app
    image: control0:5000/my-app:latest

Node Labels

Nodes are automatically labeled for workload scheduling:

Node Type	Labels
Cassandra	`type=db`
Stress	`type=app`
Control	(no labels)

Using Node Selectors

Schedule pods on specific node types:

apiVersion: v1
kind: Pod
metadata:
  name: stress-worker
spec:
  nodeSelector:
    type: app
  containers:
  - name: worker
    image: my-stress-tool:latest

Useful Commands

# List all nodes
kubectl get nodes

# List pods in all namespaces
kubectl get pods -A

# Watch pod status
kubectl get pods -w

# View logs
kubectl logs <pod-name>

# Execute command in pod
kubectl exec -it <pod-name> -- /bin/bash

# Port forward a service locally
kubectl port-forward svc/my-service 8080:80

Architecture

Networking

K3s server runs on the control node
All nodes communicate over the private VPC network
External access is via SOCKS5 proxy through the control node

Storage

Local path provisioner for persistent volumes
Data stored on node-local NVMe drives at /mnt/db1/

Kubeconfig

The kubeconfig file is:

Downloaded automatically during cluster setup
Stored as kubeconfig in your working directory
Backed up to S3 for recovery

Network Connectivity

This guide covers how to connect to your easy-db-lab cluster from your local machine.

Overview

easy-db-lab clusters run in a private AWS VPC. By default, the VPC uses 10.0.0.0/16, but you can customize this:

easy-db-lab init --cidr 10.14.0.0/20 ...

There are two methods to access your cluster:

Method	Best For
Tailscale VPN (Recommended)	Production use, team sharing, persistent access
SOCKS Proxy	Quick testing when you don't want to set up Tailscale

Tailscale VPN (Recommended)

Tailscale provides a persistent VPN connection to your cluster. Once connected, you can access cluster resources directly—no proxy configuration needed.

Why Tailscale?

Native access - Use any tool (browsers, kubectl, ssh) without proxy configuration
Persistent - Connection survives terminal sessions
Team sharing - Share cluster access with teammates
Reliable - No SSH tunnels to maintain or reconnect

Setup (One-Time)

Step 1: Configure Tailscale ACL

Go to Tailscale ACL Editor and add:

{
  "tagOwners": {
    "tag:easy-db-lab": ["autogroup:admin"]
  },
  "autoApprovers": {
    "routes": {
      "10.0.0.0/8": ["tag:easy-db-lab"]
    }
  }
}

The autoApprovers section automatically approves subnet routes, so you don't need to manually approve each cluster.

Step 2: Create OAuth Client

Go to Tailscale OAuth Settings
Click Generate OAuth Client
Configure:
- Description: easy-db-lab
- Scopes: Select Devices: Write
- Tags: Add tag:easy-db-lab
Click Generate and save the Client ID and Client Secret

Step 3: Configure easy-db-lab

easy-db-lab setup-profile

Enter your Tailscale OAuth credentials when prompted.

Usage

Tailscale starts automatically with easy-db-lab up. Once connected:

# Direct access to private IPs
ssh ubuntu@10.0.1.50
curl http://10.0.1.50:9428/health
kubectl get pods

# Web UIs work directly in your browser
# http://10.0.1.50:3000 (Grafana)

Manual Control

easy-db-lab tailscale start
easy-db-lab tailscale status
easy-db-lab tailscale stop

Troubleshooting Tailscale

"requested tags are invalid or not permitted" - Add the tag to your ACL (Step 1).

Can't reach private IPs - Check subnet route is approved in Tailscale admin, or add autoApprovers to your ACL.

Using a custom tag:

easy-db-lab tailscale start --tag tag:my-custom-tag

SOCKS Proxy (Alternative)

If you don't want to set up Tailscale, the SOCKS proxy provides connectivity via an SSH tunnel through the control node.

┌─────────────────┐     SSH Tunnel      ┌──────────────┐
│  Your Machine   │ ──────────────────► │ Control Node │
│  localhost:1080 │                     │  (control0)  │
└────────┬────────┘                     └──────┬───────┘
         │                                     │
    SOCKS5 Proxy                         Private VPC
         │                                     │
         ▼                                     ▼
   kubectl, curl                          VPC network

Quick Start

source env.sh
kubectl get pods
curl http://control0:9428/health

The proxy starts automatically when you load the environment.

Proxied Commands

These commands are automatically configured to use the proxy after source env.sh:

Command	Description
`kubectl`	Kubernetes CLI
`k9s`	Kubernetes TUI
`curl`	HTTP client
`skopeo`	Container image tool

Manual Proxy Usage

For other commands, use the with-proxy wrapper:

with-proxy wget http://10.0.1.50:8080/api
with-proxy http http://control0:3000/api/health

Browser Access

Configure your browser's SOCKS5 proxy:

Setting	Value
SOCKS Host	`localhost`
SOCKS Port	`1080`
SOCKS Version	5

Then access cluster services:

Grafana: http://control0:3000
Victoria Metrics: http://control0:8428
Victoria Logs: http://control0:9428

Proxy Management

start-socks5          # Start proxy
start-socks5 1081     # Start on different port
socks5-status         # Check status
stop-socks5           # Stop proxy

Troubleshooting SOCKS Proxy

"Connection refused" errors:

socks5-status              # Check if running
start-socks5               # Start if needed
ssh control0 hostname      # Verify SSH works

Proxy not working after network change:

stop-socks5
source env.sh

Port already in use:

lsof -i :1080         # Check what's using it
start-socks5 1081     # Use different port

Commands timing out:

Check cluster status: easy-db-lab status
Verify SSH works: ssh control0 hostname
Restart proxy: stop-socks5 && start-socks5

Comparison

Feature	Tailscale	SOCKS Proxy
Setup time	~10 min (one-time)	Instant
Persistence	Persistent	Per-session
Requires `source env.sh`	No	Yes
Browser access	Direct	Requires proxy config
Team sharing	Yes	No
External dependency	Tailscale account	None

Shell Aliases

After running source env.sh, you get access to several helpful aliases and functions for managing your cluster.

SSH Aliases

SSH aliases for all Cassandra nodes are automatically created as c0-cN. The ssh command is not required. For example:

c0 nodetool status

This runs nodetool status on the first Cassandra node.

Cluster Management Functions

Command	Description
`c-all`	Executes a command on every node in the cluster sequentially
`c-start`	Starts Cassandra on all nodes
`c-restart`	Restarts Cassandra on all nodes (not a graceful operation)
`c-status`	Executes `nodetool status` on db0
`c-tpstats`	Executes `nodetool tpstats` on all nodes
`c-collect-artifacts`	Collects metrics, nodetool output, and system information

Examples

Run a command on all nodes

c-all "df -h"

Check cluster status

c-status

Collect artifacts for performance testing

c-collect-artifacts my-test-run

This is useful when doing performance testing to capture the state of the system at a given moment.

Graceful Rolling Restarts

For true rolling restarts, we recommend using cstar instead of c-restart.

Configuring Cassandra

This page covers Cassandra version management and configuration. For a step-by-step walkthrough, see the Tutorial.

Supported Versions

easy-db-lab supports the following Cassandra versions:

Version	Java	Notes
3.0	8	Legacy support
3.11	8	Stable release
4.0	11	First 4.x release
4.1	11	Current LTS
5.0	11	Latest stable (recommended)
5.0-HEAD	11	Nightly build from 5.0 branch
trunk	17	Development branch

Quick Start

# Select Cassandra 5.0
easy-db-lab cassandra use 5.0

# Generate configuration patch
easy-db-lab cassandra write-config

# Apply configuration and start
easy-db-lab cassandra update-config
easy-db-lab cassandra start

# Verify cluster
ssh db0 nodetool status

Version Management

Select a Version

easy-db-lab cassandra use <version>

Examples:

easy-db-lab cassandra use 5.0       # Latest stable
easy-db-lab cassandra use 4.1       # LTS version
easy-db-lab cassandra use trunk     # Development branch

This command:

Sets the active Cassandra version on all nodes
Downloads current configuration files locally
Applies any existing cassandra.patch.yaml

Specify Java Version

easy-db-lab cassandra use 5.0 --java 11

List Available Versions

easy-db-lab ls

Configuration

The Patch File

Cassandra configuration uses a patch file approach. The cassandra.patch.yaml file contains only the settings you want to customize, which are merged with the default cassandra.yaml.

Generate a new patch file:

easy-db-lab cassandra write-config

Options:

-t, --tokens: Number of tokens (default: 4)

Example patch file:

cluster_name: "my-cluster"
num_tokens: 4
concurrent_reads: 64
concurrent_writes: 64
trickle_fsync: true

Auto-Managed Settings — Do Not Include

The following settings are automatically managed by easy-db-lab. Including them in your patch file may cause problems:

listen_address, rpc_address — injected with each node's private IP
seed_provider / seeds — configured automatically based on cluster topology
hints_directory, data_file_directories, commitlog_directory — set based on the cluster's disk configuration

Apply Configuration

easy-db-lab cassandra update-config

Options:

--restart, -r: Restart Cassandra after applying
--hosts: Filter to specific hosts

Apply and restart in one command:

easy-db-lab cassandra update-config --restart

Download Configuration

Download current configuration files from nodes:

easy-db-lab cassandra download-config

Files are saved to a local directory named after the version (e.g., 5.0/).

Starting and Stopping

# Start on all nodes
easy-db-lab cassandra start

# Stop on all nodes
easy-db-lab cassandra stop

# Restart on all nodes
easy-db-lab cassandra restart

# Target specific hosts
easy-db-lab cassandra start --hosts db0,db1

Cassandra Sidecar

The Apache Cassandra Sidecar is automatically installed and started alongside Cassandra. The sidecar provides:

REST API for Cassandra operations
S3 import/restore capabilities
Streaming data operations
Metrics collection (Prometheus-compatible)

Sidecar Access

The sidecar runs on port 9043 on each Cassandra node:

# Check sidecar health
curl http://<cassandra-node-ip>:9043/api/v1/__health

Sidecar Management

The sidecar is managed via systemd:

# Check status
ssh db0 sudo systemctl status cassandra-sidecar

# Restart
ssh db0 sudo systemctl restart cassandra-sidecar

Sidecar Configuration

Configuration is located at /etc/cassandra-sidecar/cassandra-sidecar.yaml on each node. Key settings:

Cassandra connection details
Data directory paths
Traffic shaping and throttling
S3 integration settings

Custom Builds

To use a custom Cassandra build from source:

Build from Repository

easy-db-lab cassandra build -n my-build /path/to/cassandra-repo

Use Custom Build

easy-db-lab cassandra use my-build

Next Steps

Tutorial - Complete walkthrough
Shell Aliases - Convenient shortcuts for Cassandra management

ClickHouse

easy-db-lab supports deploying ClickHouse clusters on Kubernetes for analytics workloads alongside your Cassandra cluster.

Overview

ClickHouse is deployed as a StatefulSet on K3s with ClickHouse Keeper for distributed coordination. The deployment requires a minimum of 3 nodes.

Quick Start

Create a 6-node cluster and deploy ClickHouse with 2 shards:

# Initialize and start a 6-node cluster
easy-db-lab init my-cluster --db 6 --up

# Deploy ClickHouse (2 shards x 3 replicas)
easy-db-lab clickhouse start

Configuring ClickHouse

Use clickhouse init to configure ClickHouse settings before starting the cluster:

# Configure S3 cache size (default: 10Gi)
easy-db-lab clickhouse init --s3-cache 50Gi

# Disable write-through caching
easy-db-lab clickhouse init --s3-cache-on-write false

Option	Description	Default
`--s3-cache`	Size of the local S3 cache	10Gi
`--s3-cache-on-write`	Cache data during write operations	true
`--s3-tier-move-factor`	Move data to S3 tier when local disk free space falls below this fraction (0.0-1.0)	0.2
`--replicas-per-shard`	Number of replicas per shard	3

Configuration is saved to the cluster state and applied when you run clickhouse start.

Starting ClickHouse

To deploy ClickHouse on an existing cluster:

easy-db-lab clickhouse start

Options

Option	Description	Default
`--timeout`	Seconds to wait for pods to be ready	300
`--skip-wait`	Skip waiting for pods to be ready	false
`--replicas`	Number of ClickHouse server replicas	Number of db nodes
`--replicas-per-shard`	Number of replicas per shard	3

Example with Custom Settings

# 6 nodes with 3 replicas per shard = 2 shards
easy-db-lab clickhouse start --replicas 6 --replicas-per-shard 3

# 9 nodes with 3 replicas per shard = 3 shards
easy-db-lab clickhouse start --replicas 9 --replicas-per-shard 3

Cluster Topology

ClickHouse is deployed with a sharded, replicated architecture. The total number of replicas must be divisible by --replicas-per-shard.

Shard and Replica Assignment

The cluster named easy_db_lab is automatically configured based on your replica count:

Configuration	Shards	Replicas/Shard	Total Nodes
Default (3 nodes)	1	3	3
6 nodes, 3/shard	2	3	6
9 nodes, 3/shard	3	3	9
6 nodes, 2/shard	3	2	6

Pod-to-Node Pinning

Each ClickHouse pod is pinned to a specific database node using Local PersistentVolumes with node affinity:

clickhouse-0 always runs on db0
clickhouse-1 always runs on db1
clickhouse-N always runs on dbN

This guarantees:

Consistent shard assignment - A pod's shard is calculated from its ordinal: shard = (ordinal / replicas_per_shard) + 1
Data locality - Data stored on a node stays with that node across pod restarts
Predictable performance - No data movement when pods restart

Shard Calculation Example

With 6 replicas and 3 replicas per shard:

Pod	Ordinal	Shard	Node
clickhouse-0	0	1	db0
clickhouse-1	1	1	db1
clickhouse-2	2	1	db2
clickhouse-3	3	2	db3
clickhouse-4	4	2	db4
clickhouse-5	5	2	db5

Checking Status

To check the status of your ClickHouse cluster:

easy-db-lab clickhouse status

This displays:

Pod status and health
Access URLs for the Play UI and HTTP interface
Native protocol connection details

Accessing ClickHouse

After deployment, ClickHouse is accessible via:

Interface	URL/Port	Description
Play UI	`http://<db-node-ip>:8123/play`	Interactive web query interface
HTTP API	`http://<db-node-ip>:8123`	REST API for queries
Native Protocol	`<db-node-ip>:9000`	High-performance binary protocol

Creating Tables

ClickHouse supports distributed, replicated tables that span multiple shards. The recommended pattern uses ReplicatedMergeTree for local replicated storage and Distributed for querying across shards.

Distributed Replicated Tables

Create a local replicated table on all nodes, then a distributed table for queries:

-- Step 1: Create local replicated table on all nodes
CREATE TABLE events_local ON CLUSTER easy_db_lab (
    id UInt64,
    timestamp DateTime,
    event_type String,
    data String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (timestamp, id)
SETTINGS storage_policy = 's3_main';

-- Step 2: Create distributed table for querying across all shards
CREATE TABLE events ON CLUSTER easy_db_lab AS events_local
ENGINE = Distributed(easy_db_lab, default, events_local, rand());

Key points:

ON CLUSTER easy_db_lab runs the DDL on all nodes
{shard} and {replica} are ClickHouse macros automatically set per node
ReplicatedMergeTree replicates data within a shard using ClickHouse Keeper
Distributed routes queries and inserts across shards
rand() distributes inserts randomly; use a column for deterministic sharding

Querying and Inserting

-- Insert through distributed table (auto-sharded)
INSERT INTO events VALUES (1, now(), 'click', '{"page": "/home"}');

-- Query across all shards
SELECT count(*) FROM events WHERE event_type = 'click';

-- Query a specific shard (via local table)
SELECT count(*) FROM events_local WHERE event_type = 'click';

Table Engine Comparison

Engine	Use Case	Replication	Sharding
`MergeTree`	Single-node, no replication	No	No
`ReplicatedMergeTree`	Replicated within shard	Yes	No
`Distributed`	Query/insert across shards	Via underlying table	Yes

Storage Policies

ClickHouse is configured with two storage policies. You select the policy when creating a table using the SETTINGS storage_policy clause.

Policy Comparison

Aspect	`local`	`s3_main`	`s3_tier`
Storage Location	Local NVMe disks	S3 bucket with configurable local cache	Hybrid: starts local, moves to S3 when disk fills
Performance	Best latency, highest throughput	Higher latency, cache-dependent	Good initially, degrades as data moves to S3
Capacity	Limited by disk size	Virtually unlimited	Virtually unlimited
Cost	Included in instance cost	S3 storage + request costs	S3 storage + request costs
Data Persistence	Lost when cluster is destroyed	Persists independently	Persists independently
Best For	Benchmarks, low-latency queries	Large datasets, cost-sensitive workloads	Mixed hot/cold workloads with automatic tiering

Local Storage (`local`)

The default policy stores data on local NVMe disks attached to the database nodes. This provides the best performance for latency-sensitive workloads.

CREATE TABLE my_table (...)
ENGINE = MergeTree()
ORDER BY id
SETTINGS storage_policy = 'local';

If you omit the storage_policy setting, tables use local storage by default.

When to use local storage:

Performance benchmarking where latency matters
Temporary or experimental datasets
Workloads with predictable data sizes that fit on local disks
When you don't need data to persist after cluster teardown

S3 Storage (`s3_main`)

The S3 policy stores data in your configured S3 bucket with a local cache for frequently accessed data. The cache size defaults to 10Gi and can be configured with clickhouse init --s3-cache. Write-through caching is enabled by default (--s3-cache-on-write true), which caches data during writes so subsequent reads can be served from cache immediately. This is ideal for large datasets where storage cost matters more than latency.

Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:

easy-db-lab init my-cluster --s3-bucket my-clickhouse-data

Then create tables with S3 storage:

CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_main';

When to use S3 storage:

Large analytical datasets (terabytes+)
Data that should persist across cluster restarts
Cost-sensitive workloads where storage cost > compute cost
Sharing data between multiple clusters

How the cache works:

Hot (frequently accessed) data is cached locally for fast reads
Cold data is fetched from S3 on demand
Cache is automatically managed by ClickHouse
First query on cold data will be slower; subsequent queries use cache

S3 Tiered Storage (`s3_tier`)

The S3 tiered policy provides automatic data movement from local disks to S3 based on disk space availability. This policy starts with local storage and automatically moves data to S3 when local disk space runs low, providing the best of both worlds: fast local performance for hot data and unlimited S3 capacity for cold data.

Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:

easy-db-lab init my-cluster --s3-bucket my-clickhouse-data

Configure the tiering behavior before starting ClickHouse:

# Move data to S3 when local disk free space falls below 20% (default)
easy-db-lab clickhouse init --s3-tier-move-factor 0.2

# More aggressive tiering - move when free space < 50%
easy-db-lab clickhouse init --s3-tier-move-factor 0.5

Then create tables with S3 tiered storage:

CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_tier';

When to use S3 tiered storage:

Workloads with mixed hot/cold data access patterns
Growing datasets that may outgrow local disk capacity
Want automatic cost optimization without manual intervention
Need local performance for recent data with S3 capacity for historical data

How automatic tiering works:

New data is written to local disks first (fast writes)
When local disk free space falls below the configured threshold (default: 20%), ClickHouse automatically moves the oldest data to S3
Data on S3 is still queryable but with higher latency
The local cache (configured with --s3-cache) helps performance for frequently accessed S3 data
Manual moves are also possible: ALTER TABLE my_table MOVE PARTITION tuple() TO DISK 's3'

Stopping ClickHouse

To remove the ClickHouse cluster:

easy-db-lab clickhouse stop

This removes all ClickHouse pods, services, and associated resources from Kubernetes.

Monitoring

ClickHouse metrics are automatically integrated with the observability stack:

Grafana Dashboard: Pre-configured dashboard for ClickHouse metrics
Metrics Port: 9363 for Prometheus-compatible metrics
Logs Dashboard: Dedicated dashboard for ClickHouse logs

Architecture

The ClickHouse deployment includes:

ClickHouse Server: StatefulSet with configurable replicas
ClickHouse Keeper: 3-node cluster for distributed coordination (ZooKeeper-compatible)
Services: Headless services for internal communication
ConfigMaps: Server and Keeper configuration
Local PersistentVolumes: One PV per node for data locality

Storage Architecture

ClickHouse uses Local PersistentVolumes to guarantee pod-to-node pinning:

During cluster creation, each db node is labeled with its ordinal (easydblab.com/node-ordinal=0, etc.)
Local PVs are created with node affinity matching these ordinals
PVs are pre-bound to specific PVCs (e.g., data-clickhouse-0 binds to the PV on db0)
The StatefulSet's volumeClaimTemplate requests storage from these pre-bound PVs

This ensures clickhouse-X always runs on dbX, providing:

Consistent shard assignments across restarts
Data locality (no network storage overhead)
Predictable failover behavior

Ports

Port	Purpose
8123	HTTP interface
9000	Native protocol
9009	Inter-server communication
9363	Metrics
2181	Keeper client
9234	Keeper Raft

ClickHouse Backup and Restore

easy-db-lab provides commands to back up ClickHouse data to S3 and restore it to a new cluster. Backups are stored at the account level — independent of any individual cluster — so you can restore data into a freshly created cluster after the original cluster is torn down.

How Backups Work

Backups are stored under s3://<account-bucket>/clickhouse-backups/<backup-name>/ in your account-level S3 bucket. Because this path is outside the cluster prefix, backups survive cluster teardown.

Each backup contains:

All tables in the default database, exported using the ClickHouse BACKUP DATABASE default command
A backup-metadata.json file recording the backup name, creation timestamp, source cluster name, and total size

Creating a Backup

To back up the running ClickHouse cluster:

easy-db-lab clickhouse backup <backup-name>

The backup name must be unique. If a backup with that name already exists, the command fails immediately without overwriting the existing data.

Example:

easy-db-lab clickhouse backup my-experiment-2024-01

Listing Available Backups

To see all available backups in your account bucket:

easy-db-lab clickhouse list-backups

This displays a table of backups sorted by creation time (newest first), including the backup name, size, and timestamp.

Restoring a Backup

To restore a backup to a running ClickHouse cluster:

easy-db-lab clickhouse restore <backup-name>

Example:

easy-db-lab clickhouse restore my-experiment-2024-01

The restore command runs RESTORE DATABASE default inside the ClickHouse pod. Existing tables with conflicting names will cause the restore to fail; drop or rename them first if needed.

Restoring When Starting a Cluster

You can combine cluster startup and restore in a single step using --restore-from:

easy-db-lab clickhouse start --restore-from <backup-name>

This starts ClickHouse and immediately restores the named backup once the pods are ready. This is the typical workflow for recreating a cluster with existing data:

# 1. Tear down the old cluster (backup already exists from a previous run)
easy-db-lab down

# 2. Create a fresh cluster
easy-db-lab up

# 3. Deploy ClickHouse and restore from the backup in one step
easy-db-lab clickhouse start --restore-from my-experiment-2024-01

Backing Up Before Cluster Teardown

To automatically create a backup when tearing down a cluster, use the --clickhouse.backup option with down:

easy-db-lab down --clickhouse.backup <backup-name>

Example:

easy-db-lab down --clickhouse.backup my-experiment-2024-01

This creates the backup before shutting down the cluster so you can restore it later.

Common Workflows

Persist Data Across Cluster Rebuilds

# Back up before tearing down
easy-db-lab down --clickhouse.backup my-dataset

# Later, create a new cluster and restore
easy-db-lab up
easy-db-lab clickhouse start --restore-from my-dataset

Snapshot Before a Destructive Operation

# Take a named snapshot before running a migration
easy-db-lab clickhouse backup pre-migration-snapshot

# Run the migration
# ...

# If something goes wrong, restore
easy-db-lab clickhouse restore pre-migration-snapshot

Check What Backups Are Available

easy-db-lab clickhouse list-backups

OpenSearch

AWS OpenSearch can be provisioned as a managed domain for full-text search and log analytics.

Commands

Command	Description
`opensearch start`	Create an OpenSearch domain
`opensearch status`	Check domain status
`opensearch stop`	Delete the OpenSearch domain

Starting OpenSearch

easy-db-lab opensearch start

This creates an AWS-managed OpenSearch domain linked to your cluster's VPC. The domain takes several minutes to provision.

Checking Status

easy-db-lab opensearch status

Stopping OpenSearch

easy-db-lab opensearch stop

This deletes the OpenSearch domain. Data stored in the domain will be lost.

Spark

easy-db-lab supports provisioning Apache Spark clusters via AWS EMR for analytics workloads.

Enabling Spark

There are two ways to enable Spark:

Option 1: During Init (before `up`)

Enable Spark during cluster initialization with the --spark.enable flag. The EMR cluster will be created automatically when you run up:

easy-db-lab init --spark.enable
easy-db-lab up

Init Spark Configuration Options

Option	Description	Default
`--spark.enable`	Enable Spark EMR cluster	false
`--spark.master.instance.type`	Master node instance type	m5.xlarge
`--spark.worker.instance.type`	Worker node instance type	m5.xlarge
`--spark.worker.instance.count`	Number of worker nodes	3

Example with Custom Configuration

easy-db-lab init \
  --spark.enable \
  --spark.master.instance.type m5.2xlarge \
  --spark.worker.instance.type m5.4xlarge \
  --spark.worker.instance.count 5

Option 2: After `up` (standalone `spark init`)

Add Spark to an existing environment that is already running. This is useful when you forgot to pass --spark.enable during init, or when you decide to add Spark later:

easy-db-lab spark init

Prerequisites: easy-db-lab init and easy-db-lab up must have been run first.

Spark Init Configuration Options

Option	Description	Default
`--master.instance.type`	Master node instance type	m5.xlarge
`--worker.instance.type`	Worker node instance type	m5.xlarge
`--worker.instance.count`	Number of worker nodes	3

Example with Custom Configuration

easy-db-lab spark init \
  --master.instance.type m5.2xlarge \
  --worker.instance.type m5.4xlarge \
  --worker.instance.count 5

Submitting Spark Jobs

Submit JAR-based Spark applications to your EMR cluster:

easy-db-lab spark submit \
  --jar /path/to/your-app.jar \
  --main-class com.example.YourMainClass \
  --conf spark.easydblab.keyspace=my_keyspace \
  --conf spark.easydblab.table=my_table \
  --wait

Submit Options

Option	Description	Required
`--jar`	Path to JAR file (local path or `s3://` URI)	Yes
`--main-class`	Main class to execute	Yes
`--conf`	Spark configuration (`key=value`), can be repeated	No
`--env`	Environment variable (`KEY=value`), can be repeated	No
`--args`	Arguments for the Spark application	No
`--wait`	Wait for job completion	No
`--name`	Job name (defaults to main class)	No

When --jar is a local path, it is automatically uploaded to the cluster's S3 bucket before submission. When it is an s3:// URI, it is used directly.

Using a JAR Already on S3

If your JAR is already on S3 (e.g., from a CI pipeline or a previous upload), pass the S3 URI directly:

easy-db-lab spark submit \
  --jar s3://my-bucket/jars/your-app.jar \
  --main-class com.example.YourMainClass \
  --conf spark.easydblab.keyspace=my_keyspace \
  --wait

This skips the upload step entirely, which is useful for large JARs or when resubmitting the same job.

Cancelling a Job

Cancel a running or pending Spark job without terminating the cluster:

easy-db-lab spark stop

Without --step-id, this cancels the most recent job. To cancel a specific job:

easy-db-lab spark stop --step-id <step-id>

The cancellation uses EMR's TERMINATE_PROCESS strategy (SIGKILL). The API is asynchronous — use spark status to confirm the job has been cancelled.

Checking Job Status

View Recent Jobs

List recent Spark jobs on the cluster:

easy-db-lab spark jobs

Options:

--limit - Maximum number of jobs to display (default: 10)

Check Specific Job Status

easy-db-lab spark status --step-id <step-id>

Without --step-id, shows the status of the most recent job.

Options:

--step-id - EMR step ID to check
--logs - Download step logs (stdout, stderr)

Retrieving Logs

Download logs for a Spark job:

easy-db-lab spark logs --step-id <step-id>

Logs are automatically decompressed and include:

stdout.gz - Standard output
stderr.gz - Standard error
controller.gz - EMR controller logs

Architecture

When Spark is enabled, easy-db-lab provisions:

EMR Cluster: Managed Spark cluster with master and worker nodes
S3 Integration: Logs stored at s3://<bucket>/spark/emr-logs/
IAM Roles: Service and job flow roles for EMR operations
Observability: Each EMR node runs an OTel Collector (host metrics, OTLP forwarding), OTel Java Agent (auto-instrumentation for logs/metrics/traces), and Pyroscope Java Agent (continuous CPU/allocation/lock profiling). All telemetry flows to the control node's observability stack.

Timeouts and Polling

Job Polling Interval: 5 seconds
Maximum Wait Time: 4 hours
Cluster Creation Timeout: 30 minutes

Spark with Cassandra

A common use case is running Spark jobs that read from or write to Cassandra. Use the Spark Cassandra Connector:

import com.datastax.spark.connector._

val df = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "my_table", "keyspace" -> "my_keyspace"))
  .load()

Ensure your JAR includes the Spark Cassandra Connector dependency and configure the Cassandra host in your Spark application.

Bulk Write Implementations

easy-db-lab provides three different implementations for bulk writing data to Cassandra, each with different characteristics. All three use the same configuration properties (spark.easydblab.*) so you can easily compare performance by just swapping the JAR and main class.

Implementation Comparison

Implementation	Transport	Use Case	Prerequisites
Direct (Sidecar)	DIRECT	Low latency, direct network path, single DC	Sidecar running, network connectivity
S3 Staging	S3_COMPAT	Large datasets, multi-dc	S3 bucket, IAM permissions
Connector	CQL	Standard writes, compatibility	Cassandra native protocol

Direct Bulk Writer (Sidecar Transport)

Streams SSTables directly from Spark to Cassandra nodes via the Sidecar REST API on port 9043.

How it works:

Spark generates SSTables from source data
SSTables are streamed directly to Sidecar endpoints
Sidecar validates and imports SSTables into Cassandra

When to use:

Direct network connectivity between Spark and Cassandra
Lower latency requirements
Smaller to medium datasets

Limitations:

Requires network connectivity from EMR to Cassandra on port 9043
Streaming backpressure if Sidecar can't keep up

S3 Bulk Writer (S3 Staging Transport)

Stages SSTables in S3, then notifies Cassandra Sidecar to download and import them.

How it works:

Spark generates SSTables and bundles them into ZIPs with manifests
Bundles are uploaded to S3 bucket (provided by easy-db-lab)
Spark pushes import notification to Sidecar REST API
Sidecar downloads bundles from S3, validates checksums, filters by token ranges
Sidecar imports SSTables into Cassandra

When to use:

Large-scale bulk loads (terabytes)
S3 provides durability and staging for retries
Multi-DC Cassandra clusters

Prerequisites:

S3 bucket (automatically provisioned by easy-db-lab as clusterState.dataBucket)
EMR instance profile with S3 write permissions (automatically configured)
Cassandra Sidecar running on port 9043
For Cassandra 5.x: storage_compatibility_mode: NONE in cassandra.yaml (the bulk writer requires the new SSTable format)

Credentials:

EMR uses instance profile (IMDS) to write to S3 - no manual credential configuration needed
AWS region is auto-detected from EC2 metadata
Sidecar uses its IAM role to read from S3

Benefits:

S3 provides durability for large datasets
Token range filtering ensures data goes to correct nodes
Checksum validation guarantees data integrity

Standard Connector Writer

Uses the DataStax Spark Cassandra Connector to write data via CQL.

How it works:

Spark generates rows as DataFrames
Connector batches writes and sends via Cassandra native protocol (port 9042)
Cassandra processes writes through normal write path (memtables → SSTables)

When to use:

Smaller datasets
Need CDC, triggers, or other write-time features
Existing Spark Cassandra Connector pipelines

Limitations:

Significantly slower than bulk writers for large datasets (goes through full write path)
Compaction overhead after writes complete
More network round-trips

Spark Modules

All Spark job modules live under the spark/ directory and share unified configuration via spark.easydblab.* properties. You can compare performance across implementations by swapping the JAR and main class while keeping the same --conf flags.

Module Overview

Module	Gradle Path	Main Class	Transport	Description
`common`	`:spark:common`	—	—	Shared config, data generation, CQL setup
`bulk-writer-sidecar`	`:spark:bulk-writer-sidecar`	`DirectBulkWriter`	DIRECT	Streams SSTables directly to Sidecar
`bulk-writer-s3-iam`	`:spark:bulk-writer-s3-iam`	`IamBulkWriter`	S3_COMPAT	Stages SSTables in S3 via IAM credentials, imports via Sidecar
`connector-writer`	`:spark:connector-writer`	`StandardConnectorWriter`	CQL	Standard writes via Cassandra native protocol
`connector-read-write`	`:spark:connector-read-write`	`KeyValuePrefixCount`	CQL	Read→transform→write example

Building

Pre-build Cassandra Analytics (one-time, for bulk-writer modules)

The cassandra-analytics library requires JDK 11 to build:

bin/build-cassandra-analytics

Options:

--force - Rebuild even if JARs exist
--branch <branch> - Use a specific branch (default: trunk)

Build JARs

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3-iam:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Or build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

Usage

All modules use the same --conf properties for easy comparison.

Direct Bulk Writer (Sidecar)

Streams SSTables directly to Cassandra Sidecar endpoints:

easy-db-lab spark submit \
  --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
  --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
  --conf spark.easydblab.contactPoints=host1:9043,host2:9043,host3:9043 \
  --conf spark.easydblab.keyspace=bulk_test \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.rowCount=1000000 \
  --wait

Note: Contact points should include the Sidecar port (9043).

IAM S3 Bulk Writer

Stages SSTables in S3 using IAM instance profile credentials, then imports via Sidecar. Cluster topology is auto-discovered from the Cassandra driver — no manual DC configuration needed:

easy-db-lab spark submit \
  --jar spark/bulk-writer-s3-iam/build/libs/bulk-writer-s3-iam-all.jar \
  --main-class com.rustyrazorblade.easydblab.spark.IamBulkWriter \
  --conf spark.easydblab.contactPoints=host1,host2,host3 \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.s3.bucket=my-bucket \
  --conf spark.easydblab.rowCount=10000000 \
  --conf spark.easydblab.parallelism=20 \
  --wait

Credentials: Both the EMR executor and the Cassandra Sidecar authenticate independently via their attached IAM roles. No credentials are extracted or transmitted.

Standard Connector Writer

Standard CQL writes via Cassandra native protocol:

easy-db-lab spark submit \
  --jar spark/connector-writer/build/libs/connector-writer-*.jar \
  --main-class com.rustyrazorblade.easydblab.spark.StandardConnectorWriter \
  --conf spark.easydblab.contactPoints=host1,host2,host3 \
  --conf spark.easydblab.keyspace=bulk_test \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.rowCount=1000000 \
  --wait

Note: Contact points for the connector use the Cassandra native protocol port (9042), not Sidecar.

Convenience Script

The bin/spark-bulk-write script handles JAR lookup, host resolution, and health checks:

# From a cluster directory
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Configuration Properties

All modules share these properties via spark.easydblab.*:

Property	Description	Default
`spark.easydblab.contactPoints`	Comma-separated database hosts	Required
`spark.easydblab.keyspace`	Target keyspace	Required
`spark.easydblab.table`	Target table	`data_<timestamp>`
`spark.easydblab.localDc`	Local datacenter name	Required
`spark.easydblab.rowCount`	Number of rows to write	1000000
`spark.easydblab.parallelism`	Spark partitions for generation	10
`spark.easydblab.partitionCount`	Cassandra partitions to distribute across	10000
`spark.easydblab.replicationFactor`	Keyspace replication factor	3
`spark.easydblab.skipDdl`	Skip keyspace/table creation (validates they exist)	false
`spark.easydblab.compaction`	Compaction strategy	(default)
`spark.easydblab.s3.bucket`	S3 bucket (S3 mode only)	Required for S3
`spark.easydblab.s3.endpoint`	S3 endpoint override	AWS S3

Table Schema

The test data generators produce this schema:

CREATE TABLE <keyspace>.<table> (
    partition_id bigint,
    sequence_id bigint,
    course blob,
    marks bigint,
    PRIMARY KEY ((partition_id), sequence_id)
);

Troubleshooting

S3 Bulk Writer Issues

"Required property not set: spark.easydblab.s3.bucket"

The S3 bulk writer requires an S3 bucket for staging SSTables. Get the bucket from cluster state:

jq -r '.dataBucket' state.json

Then add it to your submit command:

--conf spark.easydblab.s3.bucket=<bucket-from-state>

"Failed to resolve AWS credentials"

The EMR instance profile should provide credentials automatically via IMDS. If this fails:

Verify EMR cluster has EasyDBLabEMREC2Role instance profile attached
Check IAM role has s3:* permissions on the data bucket
Verify instance metadata service (IMDS) is accessible from EMR nodes

"Unable to detect AWS region"

Region is auto-detected from EC2 metadata. This should work automatically on EMR. If it fails, the EMR cluster may not have proper metadata access.

"S3 bucket name must be between 3 and 63 characters"

Bucket names must follow AWS S3 naming rules (3-63 characters, lowercase, DNS-compliant). Verify the bucket name in state.json.

Job succeeds but no data imported

Check Cassandra 5.x compatibility:

Verify storage_compatibility_mode: NONE in cassandra.yaml
Cassandra 5.x defaults to UPGRADING mode which uses legacy SSTable format
The bulk writer requires NONE to use the new SSTable format

To fix:

echo "storage_compatibility_mode: NONE" >> cassandra.patch.yaml
easy-db-lab cassandra update-config
easy-db-lab cassandra restart

Bundles uploaded but Sidecar didn't import

Check Sidecar is running: curl http://<cassandra-host>:9043/api/v1/health
Verify Sidecar has S3 read permissions (IAM role)
Check Sidecar logs for download or validation errors
Verify token ranges match between SSTables and Cassandra ring

Direct Bulk Writer Issues

"Connection refused" to port 9043

Verify Cassandra Sidecar is running on all nodes
Check security groups allow EMR → Cassandra on port 9043
Ensure contact points use correct IP addresses (private IPs if in same VPC)

Slow writes / backpressure

The direct transport streams data and can be throttled if Sidecar can't keep up. Consider:

Reduce spark.easydblab.parallelism to lower write rate
Use S3 staging transport for large datasets
Check Cassandra disk I/O and compaction status

Connector Writer Issues

"No route to host" or timeout

Check security groups allow EMR → Cassandra on port 9042
Verify contact points are reachable from EMR
Ensure Cassandra native protocol is enabled

Slow performance

The connector uses the standard write path (not bulk write). For large datasets, use a bulk writer instead.

Monitoring

Grafana Dashboards

Grafana is deployed automatically as part of the observability stack (k8 apply). It is accessible on port 3000 of the control node.

Cluster Identification

When running multiple environments side by side, Grafana displays the cluster name in several places to help you identify which environment you're looking at:

Browser tab - Shows the cluster name instead of "Grafana"
Dashboard titles - Each dashboard title is prefixed with the cluster name
Sidebar org name - The organization name in the sidebar shows the cluster name
Home dashboard - The System Overview dashboard is set as the home page instead of the default Grafana welcome page

System Dashboard

Shows CPU, memory, disk I/O, network I/O, and load average for all cluster nodes via OpenTelemetry metrics.

AWS CloudWatch Overview

A combined dashboard showing S3, EBS, and EC2 metrics via CloudWatch. Available after running easy-db-lab up.

S3 metrics:

Throughput: BytesDownloaded, BytesUploaded
Request Counts: GetRequests, PutRequests
Latency: FirstByteLatency (p99), TotalRequestLatency (p99)

EBS volume metrics:

IOPS: VolumeReadOps, VolumeWriteOps (mirrored read/write chart)
Throughput: VolumeReadBytes, VolumeWriteBytes (mirrored read/write chart)
Queue Length: VolumeQueueLength
Burst Balance: BurstBalance (percentage)

EC2 status checks:

Status Check Failures: StatusCheckFailed_Instance, StatusCheckFailed_System (red threshold at >= 1)

Use the dropdowns at the top to select S3 bucket, EC2 instances, and EBS volumes.

How it works:

S3 request metrics are automatically enabled for the cluster's prefix in the account S3 bucket during easy-db-lab up
EBS and EC2 metrics are published automatically by AWS for all instances and volumes
Grafana queries CloudWatch using the EC2 instance's IAM role (no credentials needed)
During easy-db-lab down, the S3 metrics configuration is automatically removed to stop CloudWatch billing

Note: S3 request metrics take approximately 15 minutes to appear in CloudWatch after being enabled. EBS and EC2 metrics are available immediately.

EMR Overview

Shows Spark/EMR node metrics via OpenTelemetry. Available when an EMR cluster is provisioned. Each EMR node runs an OTel Collector that collects host metrics and receives JVM telemetry from the OTel and Pyroscope Java agents.

Host Metrics:

CPU Usage: Per-node CPU utilization percentage
Memory Usage: Used and cached memory per node
Disk I/O: Read/write throughput per node (mirrored chart)
Network I/O: Receive/transmit throughput per node (mirrored chart)
Load Average: 1m and 5m load per node
Filesystem Usage: Root filesystem utilization percentage

Spark JVM Metrics:

JVM Heap Memory: Used and committed heap per node/pool
GC Duration Rate: Garbage collection duration rate per collector
JVM Threads: Thread count per node
JVM Classes Loaded: Class count per node

Use the Hostname dropdown to filter by specific EMR nodes.

OpenSearch Overview

Shows OpenSearch domain metrics via CloudWatch. Available when an OpenSearch domain is provisioned.

Metrics displayed:

Cluster Health: ClusterStatus (green/yellow/red), FreeStorageSpace
CPU / Memory: CPUUtilization, JVMMemoryPressure
Search Performance: SearchLatency (p99), SearchRate
Indexing Performance: IndexingLatency (p99), IndexingRate
HTTP Responses: 2xx, 3xx, 4xx, 5xx (color-coded)
Storage: ClusterUsedSpace

Use the Domain dropdown to select which OpenSearch domain to view.

Cassandra Condensed

A single-pane-of-glass summary of the most important Cassandra metrics, powered by the MAAC (Management API for Apache Cassandra) agent. Shows:

Cluster Overview: Nodes up/down, compaction rates, CQL request throughput, dropped messages, connected clients, timeouts, hints, data size, GC time
Condensed Metrics: Request throughput, coordinator latency percentiles, memtable space, compaction activity, table-level latency, streaming bandwidth

Requires the MAAC agent to be loaded (Cassandra 4.0, 4.1, or 5.0). Metrics are exposed on port 9000 and scraped by the OTel collector.

Cassandra Overview

A comprehensive deep-dive into Cassandra cluster health, also powered by the MAAC agent. Shows:

Request Throughput: Read/write distribution, latency percentiles (P98-P999), error throughput
Node Status: Per-node up/down status (polystat panel), node count, status history
Data Status: Disk space usage, data size, SSTable count, pending compactions
Internals: Thread pool pending/blocked/active tasks, dropped messages, hinted handoff
Hardware: CPU, memory, disk I/O, network I/O, load average
JVM/GC: Application throughput, GC time, heap utilization

eBPF Observability

The cluster deploys eBPF-based agents on all nodes for deep system observability:

Beyla (L7 Network Metrics)

Grafana Beyla uses eBPF to automatically instrument network traffic and provide RED metrics (Rate, Errors, Duration) for:

Cassandra CQL protocol (port 9042) and inter-node communication (port 7000)
ClickHouse HTTP (port 8123) and native (port 9000) protocols

Metrics are scraped by the OTel collector and stored in VictoriaMetrics.

ebpf_exporter (Low-Level Metrics)

Cloudflare's ebpf_exporter provides kernel-level metrics via eBPF:

TCP retransmits — count of retransmitted TCP segments
Block I/O latency — histogram of block device I/O operation latency
VFS latency — histogram of filesystem read/write operation latency

These metrics are scraped by the OTel collector and stored in VictoriaMetrics.

See Profiling for continuous profiling with Pyroscope.

Profiling

Continuous profiling is provided by Grafana Pyroscope, deployed automatically as part of the observability stack.

Architecture

Profiling data is collected from multiple sources and sent to the Pyroscope server on the control node (port 4040):

Pyroscope Java agent (Cassandra) — Runs as a -javaagent inside the Cassandra JVM. Uses async-profiler to collect CPU, allocation, lock contention, and wall-clock profiles with full method-level resolution.
Pyroscope Java agent (Stress jobs) — Runs as a -javaagent inside cassandra-easy-stress K8s Jobs. Collects the same profile types as Cassandra (CPU, allocation, lock). The agent JAR is mounted from the host via a hostPath volume.
Pyroscope Java agent (Spark/EMR) — Runs as a -javaagent on Spark driver and executor JVMs. Installed via EMR bootstrap action to /opt/pyroscope/pyroscope.jar. Collects CPU, allocation (512k threshold), and lock (10ms threshold) profiles in JFR format. Profiles appear under service_name=spark-<job-name>.
Grafana Alloy eBPF profiler — Runs as a DaemonSet on all nodes via Grafana Alloy. Profiles all processes (Cassandra, ClickHouse, stress jobs) at the system level using eBPF. Provides CPU flame graphs including kernel stack frames.

Accessing Profiles

Profiling Dashboard

A dedicated Profiling dashboard is available in Grafana with flame graph panels for each profile type:

Open Grafana (port 3000)
Navigate to Dashboards and select the Profiling dashboard
Use the Service dropdown to select a service (e.g., cassandra, cassandra-easy-stress, clickhouse-server)
Use the Hostname dropdown to filter by specific nodes
Select a time range to view profiles for that period

The dashboard includes panels for:

CPU Flame Graph — CPU time spent in each method
Memory Allocation Flame Graph — Heap allocation hotspots
Lock Contention Flame Graph — Time spent waiting for monitors
Mutex Contention Flame Graph — Mutex delay analysis

Grafana Explore

For ad-hoc profile exploration:

Open Grafana (port 3000) and navigate to Explore
Select the Pyroscope datasource
Choose a profile type (e.g., process_cpu, memory, mutex)
Filter by labels:
- service_name — process or application name
- hostname — node hostname
- cluster — cluster name

Profile Types

Java Agent (Cassandra, Stress Jobs)

Profile	Description
`cpu`	CPU time spent in each method
`alloc`	Memory allocation by method (objects and bytes)
`lock`	Lock contention — time spent waiting for monitors
`wall`	Wall-clock time — useful for finding I/O bottlenecks (Cassandra only, see below)

eBPF Agent (All Processes)

Profile	Description
`process_cpu`	CPU usage by process, including kernel frames

The eBPF agent profiles all processes on every node, including ClickHouse. Since ClickHouse is written in C++, only CPU profiles are available (no allocation or lock profiles). ClickHouse profiles appear in Pyroscope under the clickhouse-server service name when ClickHouse is running.

Stress Job Profiling

Stress jobs are automatically profiled via the Pyroscope Java agent. No additional configuration is needed — when you start a stress job, the agent is mounted from the host node and configured to send profiles to the Pyroscope server.

Profiles appear under service_name=cassandra-easy-stress with labels for cluster and job_name.

Wall-Clock vs CPU Profiling

By default, the Cassandra Java agent profiles CPU time. You can switch to wall-clock profiling to find I/O bottlenecks and blocking operations.

Warning

Wall-clock and CPU profiling are mutually exclusive — you cannot use both simultaneously.

To enable wall-clock profiling:

SSH to each Cassandra node
Add PYROSCOPE_PROFILER_EVENT=wall to /etc/default/cassandra
Restart Cassandra

To switch back to CPU profiling, either remove the line or set PYROSCOPE_PROFILER_EVENT=cpu.

Configuration

Cassandra Java Agent

The Pyroscope Java agent is configured via JVM system properties in cassandra.in.sh. It activates when the PYROSCOPE_SERVER_ADDRESS environment variable is set (configured by easy-db-lab at cluster startup).

The agent JAR is installed at /usr/local/pyroscope/pyroscope.jar.

Environment Variable	Set In	Description
`PYROSCOPE_SERVER_ADDRESS`	`/etc/default/cassandra`	Pyroscope server URL (set automatically)
`CLUSTER_NAME`	`/etc/default/cassandra`	Cluster name for labeling (set automatically)
`PYROSCOPE_PROFILER_EVENT`	`/etc/default/cassandra`	Profiler event type: `cpu` (default) or `wall`

eBPF Agent

The eBPF profiler runs as a privileged Grafana Alloy DaemonSet (pyroscope-ebpf) and profiles all processes on each node. Configuration is in the pyroscope-ebpf-config ConfigMap (Alloy River format). It uses discovery.process to discover host processes and pyroscope.ebpf to collect CPU profiles.

Pyroscope Server

The Pyroscope server runs on the control node with data stored in S3 (s3://<account-bucket>/clusters/<name>-<id>/pyroscope/). Configuration is in the pyroscope-config ConfigMap.

Data Flow

Cassandra JVM ──(Java agent)──────► Pyroscope Server (:4040)
                                         ▲
Stress Jobs ──(Java agent)──────────────┘
                                         ▲
Spark JVMs ──(Java agent)──────────────┘
                                         ▲
All Processes ──(eBPF agent)────────────┘
                                         │
                                         ▼
                                    S3 storage
                                  Grafana (:3000)
                             Pyroscope datasource
                            + Profiling dashboard

Victoria Metrics

Victoria Metrics is a time-series database that stores metrics from all nodes in your easy-db-lab cluster. It receives metrics via OTLP from the OpenTelemetry Collector.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes (DaemonSet)                    │
├─────────────────────────────────────────────────────────────┤
│   System metrics (CPU, memory, disk, network)               │
│   Cassandra metrics (via JMX)                               │
│   Application metrics                                        │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │   OTel Collector       │
              │   (DaemonSet)          │
              └───────────┬────────────┘
                          │
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │ Victoria Metrics │                  │
│              │    (:8428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │     Grafana      │
              │    (:3000)       │
              └──────────────────┘

Configuration

Victoria Metrics runs on the control node as a Kubernetes deployment:

Port: 8428 (HTTP API)
Storage: Persistent at /mnt/db1/victoriametrics
Retention: 7 days (configurable via -retentionPeriod flag)

Accessing Metrics

Grafana

Access Grafana at http://control0:3000 (via SOCKS proxy)
Victoria Metrics is pre-configured as the Prometheus datasource
System dashboards show node metrics

Direct API

Query metrics directly using the Prometheus-compatible API:

source env.sh

# Get all metric names
with-proxy curl "http://control0:8428/api/v1/label/__name__/values"

# Query specific metric
with-proxy curl "http://control0:8428/api/v1/query?query=up"

# Query with time range
with-proxy curl "http://control0:8428/api/v1/query_range?query=node_cpu_seconds_total&start=$(date -d '1 hour ago' +%s)&end=$(date +%s)&step=60"

Common Queries

# CPU usage by node
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# Network received bytes
rate(node_network_receive_bytes_total[5m])

Backup

Backup Victoria Metrics data to S3:

# Backup to cluster's default S3 bucket
easy-db-lab metrics backup

# Backup to a custom S3 location
easy-db-lab metrics backup --dest s3://my-backup-bucket/victoriametrics

By default, backups are stored at: s3://{cluster-bucket}/victoriametrics/{timestamp}/

Use --dest to override the destination bucket and path

Features

Uses native vmbackup tool with snapshot support
Non-disruptive; metrics collection continues during backup
Direct S3 upload (no intermediate storage needed)
Incremental backup support for faster subsequent backups

Listing Backups

List available VictoriaMetrics backups in S3:

easy-db-lab metrics ls

This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.

Importing Metrics to an External Instance

Stream metrics from the running cluster's VictoriaMetrics to an external VictoriaMetrics instance via the native export/import API:

# Import all metrics
easy-db-lab metrics import --target http://victoria:8428

# Import only specific metrics
easy-db-lab metrics import --target http://victoria:8428 --match '{job="cassandra"}'

This is useful for exporting metrics at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaMetrics instance.

Options

Option	Description	Default
`--target`	Target VictoriaMetrics URL (required)	-
`--match`	Metric selector for filtering	All metrics

Troubleshooting

No metrics appearing

Verify Victoria Metrics pod is running:

kubectl get pods -l app.kubernetes.io/name=victoriametrics
kubectl logs -l app.kubernetes.io/name=victoriametrics

Check OTel Collector is forwarding metrics:

kubectl get pods -l app=otel-collector
kubectl logs -l app=otel-collector

Verify the cluster-config ConfigMap exists:

kubectl get configmap cluster-config -o yaml

Connection errors

If you see connection errors when querying metrics:

Ensure the cluster is running: easy-db-lab status
The proxy is started automatically when needed
Check that control node is accessible: ssh control0 hostname

High memory usage

Victoria Metrics is configured with memory limits. If you see OOM kills:

Check current memory usage:

kubectl top pod -l app.kubernetes.io/name=victoriametrics

Consider adjusting the memory limits in the deployment manifest

Backup failures

If backup fails:

Check the backup job logs:

kubectl logs -l app.kubernetes.io/name=victoriametrics-backup

Verify S3 bucket permissions (IAM role should have S3 access)
Ensure there's sufficient disk space on the control node

Victoria Logs

Victoria Logs is a centralized log aggregation system that collects logs from all nodes in your easy-db-lab cluster. It provides a unified way to search and analyze logs from Cassandra, ClickHouse, and system services.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes (DaemonSet)                    │
├─────────────────────────────────────────────────────────────┤
│   /var/log/*              journald                          │
│   /mnt/db1/cassandra/logs/*.log                             │
│   /mnt/db1/clickhouse/logs/*.log                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  OTel Collector        │
              │  (DaemonSet)           │
              │  filelog + journald    │
              └───────────┬────────────┘
                          │
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │  Victoria Logs   │                  │
│              │    (:9428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │  easy-db-lab     │
              │  logs query      │
              └──────────────────┘

Components

Victoria Logs Server

Victoria Logs runs on the control node as a Kubernetes deployment:

Port: 9428 (HTTP API)
Storage: Local ephemeral storage
Retention: 7 days (configurable)
Location: Control node only (node-role.kubernetes.io/control-plane)

OTel Collector

The OpenTelemetry Collector collects logs from all sources and forwards them to Victoria Logs.

The OTel Collector runs as a DaemonSet on every node (Cassandra, stress, control) to collect:

Source	Path	Description
Cassandra	`/mnt/db1/cassandra/logs/*.log`	Cassandra database logs
ClickHouse	`/mnt/db1/clickhouse/logs/*.log`	ClickHouse server logs
ClickHouse Keeper	`/mnt/db1/clickhouse/keeper/logs/*.log`	ClickHouse Keeper logs
System logs	`/var/log/*/.log`	General system logs
journald	`cassandra`, `docker`, `k3s`, `sshd`	systemd service logs

Log Sources

Each log entry is tagged with a source field:

Source	Description	Additional Fields
`cassandra`	Cassandra database logs	`host`
`clickhouse`	ClickHouse server logs	`host`, `component` (server/keeper)
`systemd`	systemd journal logs	`host`, `unit`
`system`	General /var/log files	`host`

Querying Logs

Using the CLI

The easy-db-lab logs query command provides a unified interface:

# Query all logs from the last hour
easy-db-lab logs query

# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd

# Filter by host
easy-db-lab logs query --source cassandra --host db0

# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service

# Search for text
easy-db-lab logs query --grep "OutOfMemory"
easy-db-lab logs query --grep "ERROR"

# Time range and limit
easy-db-lab logs query --since 30m --limit 500
easy-db-lab logs query --since 1d

# Raw LogsQL query
easy-db-lab logs query -q 'source:cassandra AND host:db0'

Query Options

Option	Description	Default
`--source`, `-s`	Log source filter	All sources
`--host`, `-H`	Hostname filter (db0, app0, control0)	All hosts
`--unit`	systemd unit name	All units
`--since`	Time range (1h, 30m, 1d)	1h
`--limit`, `-n`	Max entries to return	100
`--grep`, `-g`	Text search filter	None
`--query`, `-q`	Raw LogsQL query	None

Using the HTTP API

Victoria Logs exposes a REST API on port 9428. Access it through the SOCKS proxy:

source env.sh
with-proxy curl "http://control0:9428/select/logsql/query?query=source:cassandra&time=1h&limit=100"

Using Grafana

Victoria Logs is configured as a datasource in Grafana. You can use it in two ways:

Log Investigation Dashboard

The Log Investigation dashboard is designed for interactive log analysis during investigations. Access it at Grafana → Dashboards → Log Investigation.

Filter variables (dropdowns at the top):

Filter	Options	Description
Node Role	All, db, app, control	Filter by server type
Source	All, cassandra, clickhouse, system, tool-runner	Filter by log source
Level	All, Error, Warning, Info, Debug	Filter by log severity
Search	(text input)	Free-text search across log messages
Filters	(ad-hoc)	Add arbitrary field:value filters (e.g., `host = db0`)

Panels:

Log Volume — time-series bar chart showing log count over time, broken down by source. Helps identify spikes and anomalies at a glance.
Logs — scrollable log viewer with timestamps, source labels, and expandable log details. Click any log entry to see all available fields.

Tips:

Use the ad-hoc Filters variable to filter by host, unit, component, or any other field without needing a dedicated dropdown.
The dashboard auto-refreshes every 10 seconds by default. Adjust or disable via the refresh picker in the top-right corner.
Combine multiple filters to narrow down — e.g., set Node Role to db, Source to cassandra, Level to Error to see only Cassandra errors on database nodes.
To search for exec job logs, set Source to tool-runner and use the Search box for the job name.

Explore Mode

For ad-hoc queries beyond what the dashboard provides:

Access Grafana at http://control0:3000 (via SOCKS proxy)
Navigate to Explore
Select "VictoriaLogs" datasource
Use LogsQL syntax for queries

LogsQL Query Syntax

Victoria Logs uses LogsQL for querying. Basic syntax:

# Simple field match
source:cassandra

# Multiple conditions (AND)
source:cassandra AND host:db0

# Text search
"OutOfMemory"

# Combine field match with text search
source:cassandra AND "Exception"

# Time filter (in addition to --since)
_time:1h

For full LogsQL documentation, see the Victoria Logs documentation.

Deployment

Victoria Logs and the OTel Collector are automatically deployed when you run:

easy-db-lab k8 apply

This deploys:

Victoria Logs server on the control node
OTel Collector DaemonSet on all nodes
Grafana datasource configuration

Verifying the Setup

Check that all components are running:

source env.sh
kubectl get pods -l app.kubernetes.io/name=victorialogs
kubectl get pods -l app.kubernetes.io/name=otel-collector

Test connectivity:

# Check Victoria Logs health
with-proxy curl http://control0:9428/health

# Query recent logs
easy-db-lab logs query --limit 10

Troubleshooting

No logs appearing

Verify OTel Collector pods are running:

kubectl get pods -l app.kubernetes.io/name=otel-collector
kubectl logs -l app.kubernetes.io/name=otel-collector

Check Victoria Logs is healthy:

with-proxy curl http://control0:9428/health

Verify the cluster-config ConfigMap exists:

kubectl get configmap cluster-config -o yaml

Connection errors

The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:

Ensure the cluster is running: easy-db-lab status
The proxy is started automatically when needed
Check that control node is accessible: ssh control0 hostname

Listing Backups

List available VictoriaLogs backups in S3:

easy-db-lab logs ls

This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.

Importing Logs to an External Instance

Stream logs from the running cluster's VictoriaLogs to an external VictoriaLogs instance via the jsonline API:

# Import all logs
easy-db-lab logs import --target http://victorialogs:9428

# Import only specific logs
easy-db-lab logs import --target http://victorialogs:9428 --query 'source:cassandra'

This is useful for exporting logs at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaLogs instance.

Options

Option	Description	Default
`--target`	Target VictoriaLogs URL (required)	-
`--query`	LogsQL query for filtering	All logs (`*`)

Backup

Victoria Logs data can be backed up to S3 for disaster recovery using consistent snapshots.

Creating a Backup

# Backup to cluster's default S3 bucket
easy-db-lab logs backup

# Backup to a custom S3 location
easy-db-lab logs backup --dest s3://my-backup-bucket/victorialogs

By default, backups are stored at: s3://{cluster-bucket}/victorialogs/{timestamp}/

Use --dest to override the destination bucket and path.

How It Works

The backup uses VictoriaLogs' snapshot API to create consistent, point-in-time copies:

Create snapshots — calls the VictoriaLogs snapshot API to create read-only snapshots of all active log partitions
Sync to S3 — uploads each snapshot directory to S3 using aws s3 sync
Cleanup — deletes the snapshots from disk to free space (runs even if the sync step fails)

Using snapshots ensures data consistency, since VictoriaLogs may be actively writing to its data directory during the backup.

What Gets Backed Up

All log partitions (organized by date)
Complete log history up to retention period (7 days default)

Notes

The process is non-disruptive; log ingestion continues during backup
Snapshot cleanup always runs, even if the S3 upload fails, to avoid filling disk
Persistent storage at /mnt/db1/victorialogs ensures logs survive pod restarts

Server

easy-db-lab includes a server mode that provides AI assistant integration via MCP (Model Context Protocol), REST status endpoints, and live metrics streaming. This enables Claude to directly interact with your clusters, and provides programmatic access to cluster status.

The server exposes tools for all supported databases — Cassandra, ClickHouse, OpenSearch, and Spark — as well as cluster lifecycle management and observability.

Starting the Server

To start the server, run:

easy-db-lab server

By default, the server picks an available port. To specify a port:

easy-db-lab server --port 8888

The server automatically generates a .mcp.json configuration file in the current directory with the connection details.

Adding to Claude Code

Once the server is running, start Claude Code from the same directory:

claude

Claude Code automatically detects and uses the .mcp.json file generated by the server.

Available Tools

The server exposes commands annotated with @McpCommand as MCP tools to Claude. Tool names use underscores and are derived from the command's package namespace.

Cluster Lifecycle

Tool Name	Description
`init`	Initialize a directory for easy-db-lab
`up`	Provision AWS infrastructure
`cassandra_down`	Shut down AWS infrastructure
`clean`	Clean up generated files
`status`	Display full environment status
`hosts`	List all hosts in the cluster
`ip`	Get IP address for a host by alias

Cassandra Management

Tool Name	Description
`cassandra_use`	Select a Cassandra version
`cassandra_list`	List available Cassandra versions
`cassandra_start`	Start Cassandra on all nodes
`cassandra_restart`	Restart Cassandra on all nodes
`cassandra_update_config`	Apply configuration patch to nodes

Cassandra Stress Testing

Tool Name	Description
`cassandra_stress_start`	Start a stress job on K8s
`cassandra_stress_stop`	Stop and delete stress jobs
`cassandra_stress_status`	Check status of stress jobs
`cassandra_stress_logs`	View logs from stress jobs
`cassandra_stress_list`	List available workloads
`cassandra_stress_fields`	List available field generators
`cassandra_stress_info`	Show workload information

ClickHouse

Tool Name	Description
`clickhouse_start`	Deploy ClickHouse cluster to K8s
`clickhouse_stop`	Remove ClickHouse cluster
`clickhouse_status`	Check ClickHouse cluster status

OpenSearch

Tool Name	Description
`opensearch_start`	Create AWS OpenSearch domain
`opensearch_stop`	Delete OpenSearch domain
`opensearch_status`	Check OpenSearch domain status

Spark

Tool Name	Description
`spark_submit`	Submit Spark job to EMR cluster
`spark_status`	Check status of a Spark job
`spark_jobs`	List recent Spark jobs
`spark_logs`	Download EMR logs from S3

Kubernetes

Tool Name	Description
`k8_apply`	Apply observability stack to K8s

Utilities

Tool Name	Description
`prune_amis`	Prune older private AMIs

Tool Naming Convention

MCP tool names are derived from the command's package location:

Top-level commands: status, hosts, ip, clean, init, up
Cassandra commands: cassandra_ prefix (e.g., cassandra_start, cassandra_use)
Nested commands: cassandra_stress_ prefix (e.g., cassandra_stress_start)
Hyphens become underscores: update-config → cassandra_update_config

Benefits of Server Integration

Benefit	Description
Direct Control	Claude executes easy-db-lab commands directly without manual intervention
Context Awareness	Claude maintains context about your cluster state and configuration
Automation	Complex multi-step operations can be automated through Claude
Intelligent Assistance	Claude can analyze logs, metrics, and provide optimization recommendations

Example Workflow

Start the server in one terminal:
```
easy-db-lab server
```
In another terminal, start Claude Code from the same directory:
```
claude
```
Claude Code automatically detects the .mcp.json file generated by the server.
Ask Claude to help manage your cluster:
- "Initialize a new 5-node cluster with i4i.xlarge instances"
- "Check the status of all nodes"
- "Select Cassandra version 5.0 and start it"
- "Start a KeyValue stress test for 1 hour"
- "Deploy ClickHouse and check its status"
- "Create an OpenSearch domain and monitor its progress"
- "Submit a Spark job to the EMR cluster"

Live Metrics Streaming

When Redis is configured via the EASY_DB_LAB_REDIS_URL environment variable, the server publishes live cluster metrics to the Redis pub/sub channel every 5 seconds. Metrics are queried from VictoriaMetrics using the same PromQL expressions as the Grafana dashboards.

Enabling

export EASY_DB_LAB_REDIS_URL=redis://localhost:6379/easydblab-events
easy-db-lab server

Metrics events are published to the same channel as command events. Consumers filter by the event.type field.

Event Types

Only metrics for running services are published. If the cluster is running ClickHouse instead of Cassandra, no Cassandra metrics events are emitted.

Metrics.System

Published every 5 seconds with per-node CPU, memory, disk I/O, and filesystem metrics:

{
  "timestamp": "2026-03-08T14:22:05.123Z",
  "commandName": "server",
  "event": {
    "type": "Metrics.System",
    "nodes": {
      "db-0": {
        "cpuUsagePct": 34.2,
        "memoryUsedBytes": 17179869184,
        "diskReadBytesPerSec": 52428800.0,
        "diskWriteBytesPerSec": 104857600.0,
        "filesystemUsedPct": 45.2
      },
      "db-1": {
        "cpuUsagePct": 28.7,
        "memoryUsedBytes": 16106127360,
        "diskReadBytesPerSec": 41943040.0,
        "diskWriteBytesPerSec": 83886080.0,
        "filesystemUsedPct": 42.8
      }
    }
  }
}

Metrics.Cassandra

Published every 5 seconds when the cluster is running Cassandra:

{
  "timestamp": "2026-03-08T14:22:05.187Z",
  "commandName": "server",
  "event": {
    "type": "Metrics.Cassandra",
    "readP99Ms": 1.247,
    "writeP99Ms": 0.832,
    "readOpsPerSec": 15234.5,
    "writeOpsPerSec": 12087.3,
    "compactionPending": 3,
    "compactionCompletedPerSec": 1.5,
    "compactionBytesWrittenPerSec": 52428800.0
  }
}

Field Reference

System — per node:

Field	Type	Description
`cpuUsagePct`	double	CPU usage percentage (0-100)
`memoryUsedBytes`	long	Memory used in bytes
`diskReadBytesPerSec`	double	Disk read throughput (bytes/sec)
`diskWriteBytesPerSec`	double	Disk write throughput (bytes/sec)
`filesystemUsedPct`	double	Filesystem usage percentage (0-100)

Cassandra — cluster-wide:

Field	Type	Description
`readP99Ms`	double	Read latency p99 in milliseconds
`writeP99Ms`	double	Write latency p99 in milliseconds
`readOpsPerSec`	double	Read operations per second
`writeOpsPerSec`	double	Write operations per second
`compactionPending`	long	Number of pending compactions
`compactionCompletedPerSec`	double	Compactions completed per second
`compactionBytesWrittenPerSec`	double	Compaction write throughput (bytes/sec)

Auto-Shutdown on Infrastructure Removal

When running the server in unattended or automated scenarios, you can enable automatic shutdown if the cluster's AWS infrastructure is torn down:

easy-db-lab server --auto-shutdown

When --auto-shutdown is set, the server checks whether the cluster VPC still exists on each status refresh cycle (controlled by --refresh). If the VPC is no longer found, the server emits a shutdown event and exits cleanly with code 0.

This is useful when:

Running the server alongside an automated test workflow that tears down infrastructure when done
Leaving the server running overnight and wanting it to stop automatically after easy-db-lab down

Note: The check is skipped if no cluster state exists or the VPC name cannot be determined. AWS API errors during the check are logged and ignored — only a confirmed "VPC not found" result triggers shutdown.

Notes

The server requires Docker to be installed
Your AWS profile must be configured (easy-db-lab setup-profile)
The server runs in the foreground and logs to stdout
Use Ctrl+C to stop the server

Command Reference

Complete reference for all easy-db-lab commands.

Global Options

Option	Description
`--help`, `-h`	Shows help information
`--vpc-id`	Reconstruct state from existing VPC (requires ClusterId tag)
`--force`	Force state reconstruction even if state.json exists

Setup Commands

setup-profile

Set up user profile interactively.

easy-db-lab setup-profile

Aliases: setup

Guides you through:

Email and AWS credentials collection
AWS credential validation
Key pair generation
IAM role creation
Packer VPC infrastructure setup
AMI validation/building

show-iam-policies

Display IAM policies with your account ID populated.

easy-db-lab show-iam-policies [policy-name]

Aliases: sip

Argument	Description
`policy-name`	Optional filter: `ec2`, `iam`, or `emr`

build-image

Build both base and Cassandra AMI images.

easy-db-lab build-image [options]

Option	Description	Default
`--arch`	CPU architecture (AMD64, ARM64)	AMD64
`--region`	AWS region	(from profile)

Cluster Lifecycle Commands

init

Initialize a directory for easy-db-lab.

easy-db-lab init [cluster-name] [options]

Option	Description	Default
`--db`, `--cassandra`, `-c`	Number of Cassandra instances	3
`--app`, `--stress`, `-s`	Number of stress instances	0
`--instance`, `-i`	Cassandra instance type	r3.2xlarge
`--stress-instance`, `-si`	Stress instance type	c7i.2xlarge
`--azs`, `-z`	Availability zones (e.g., `a,b,c`)	all
`--arch`, `-a`	CPU architecture (AMD64, ARM64)	AMD64
`--ebs.type`	EBS volume type (NONE, gp2, gp3, io1, io2)	NONE
`--ebs.size`	EBS volume size in GB	256
`--ebs.iops`	EBS IOPS (gp3 only)	0
`--ebs.throughput`	EBS throughput (gp3 only)	0
`--ebs.optimized`	Enable EBS optimization	false
`--until`	When instances can be deleted	tomorrow
`--ami`	Override AMI ID	(auto-detected)
`--open`	Unrestricted SSH access	false
`--tag`	Custom tags (key=value, repeatable)	-
`--vpc`	Use existing VPC ID	-
`--up`	Auto-provision after init	false
`--clean`	Remove existing config first	false

up

Provision AWS infrastructure.

easy-db-lab up [options]

Option	Description
`--no-setup`, `-n`	Skip K3s setup and AxonOps configuration

Creates: VPC, EC2 instances, K3s cluster. Configures the account S3 bucket for this cluster.

down

Shut down AWS infrastructure.

easy-db-lab down [vpc-id] [options]

Argument	Description
`vpc-id`	Optional: specific VPC to tear down

Option	Description
`--all`	Tear down all VPCs tagged with easy_cass_lab
`--packer`	Tear down the packer infrastructure VPC
`--retention-days N`	Days to retain S3 data after teardown (default: 1)

clean

Clean up generated files from the current directory.

easy-db-lab clean

hosts

List all hosts in the cluster.

easy-db-lab hosts

status

Display full environment status.

easy-db-lab status

Cassandra Commands

All Cassandra commands are available under the cassandra subcommand group.

cassandra use

Select a Cassandra version.

easy-db-lab cassandra use <version> [options]

Option	Description
`--java`	Java version to use
`--hosts`	Filter to specific hosts

Versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk

cassandra write-config

Generate a new configuration patch file.

easy-db-lab cassandra write-config [filename] [options]

Aliases: wc

Option	Description	Default
`-t`, `--tokens`	Number of tokens	4

cassandra update-config

Apply configuration patch to all nodes.

easy-db-lab cassandra update-config [options]

Aliases: uc

Option	Description
`--restart`, `-r`	Restart Cassandra after applying
`--hosts`	Filter to specific hosts

cassandra download-config

Download configuration files from nodes.

easy-db-lab cassandra download-config [options]

Aliases: dc

Option	Description
`--version`	Version to download config for

cassandra start

Start Cassandra on all nodes.

easy-db-lab cassandra start [options]

Option	Description	Default
`--sleep`	Time between starts in seconds	120
`--hosts`	Filter to specific hosts	-
`--sidecar-image`	Container image for the sidecar DaemonSet	`ghcr.io/apache/cassandra-sidecar:latest`

Use --sidecar-image to test a fork or specific version:

easy-db-lab cassandra start --sidecar-image ghcr.io/myfork/cassandra-sidecar:my-branch

cassandra stop

Stop Cassandra on all nodes.

easy-db-lab cassandra stop [options]

Option	Description
`--hosts`	Filter to specific hosts

cassandra restart

Restart Cassandra on all nodes.

easy-db-lab cassandra restart [options]

Option	Description
`--hosts`	Filter to specific hosts

cassandra list

List available Cassandra versions.

easy-db-lab cassandra list

Aliases: ls

Cassandra Stress Commands

Stress testing commands under cassandra stress.

cassandra stress start

Start a stress job on Kubernetes.

easy-db-lab cassandra stress start [options]

Aliases: run

cassandra stress stop

Stop and delete stress jobs.

easy-db-lab cassandra stress stop [options]

cassandra stress status

Check status of stress jobs.

easy-db-lab cassandra stress status

cassandra stress logs

View logs from stress jobs.

easy-db-lab cassandra stress logs [options]

cassandra stress list

List available workloads.

easy-db-lab cassandra stress list

cassandra stress fields

List available field generators.

easy-db-lab cassandra stress fields

cassandra stress info

Show information about a workload.

easy-db-lab cassandra stress info <workload>

Utility Commands

exec

Execute commands on remote hosts via systemd-run. Tool output is captured by the systemd journal and shipped to VictoriaLogs via a dedicated journald OTel collector, with accurate timestamps for cross-service log correlation.

exec run

Run a command on remote hosts (foreground by default).

# Foreground (blocks until complete, shows output)
easy-db-lab exec run -t cassandra -- ls /mnt/db1

# Background (returns immediately, tool keeps running)
easy-db-lab exec run --bg -t cassandra -- inotifywait -m /mnt/db1/data

# Background with custom name
easy-db-lab exec run --bg --name watch-imports -t cassandra -- inotifywait -m /mnt/db1/data

Option	Description
`-t, --type`	Server type: cassandra, stress, control (default: cassandra)
`--bg`	Run in background (returns immediately)
`--name`	Name for the systemd unit (auto-derived if not provided)
`--hosts`	Filter to specific hosts
`-p`	Execute in parallel across hosts

exec list

List running background tools on remote hosts.

easy-db-lab exec list
easy-db-lab exec list -t cassandra

exec stop

Stop a named background tool.

easy-db-lab exec stop watch-imports
easy-db-lab exec stop watch-imports -t cassandra

ip

Get IP address for a host by alias.

easy-db-lab ip <alias>

version

Display the easy-db-lab version.

easy-db-lab version

repl

Start interactive REPL.

easy-db-lab repl

server

Start the server for Claude Code integration, REST status endpoints, and live metrics.

easy-db-lab server

See Server for details.

Kubernetes Commands

k8 apply

Apply observability stack to K8s cluster.

easy-db-lab k8 apply

Dashboard Commands

dashboards generate

Extract all Grafana dashboard manifests (core and ClickHouse) from JAR resources to the local k8s/ directory. Useful for rapid dashboard iteration without re-running init.

easy-db-lab dashboards generate

dashboards upload

Apply all Grafana dashboard manifests and the datasource ConfigMap to the K8s cluster. Extracts dashboards, creates the grafana-datasources ConfigMap with runtime configuration, and applies everything.

easy-db-lab dashboards upload

ClickHouse Commands

clickhouse start

Deploy ClickHouse cluster to K8s.

easy-db-lab clickhouse start [options]

clickhouse stop

Stop and remove ClickHouse cluster.

easy-db-lab clickhouse stop

clickhouse status

Check ClickHouse cluster status.

easy-db-lab clickhouse status

Spark Commands

spark submit

Submit Spark job to EMR cluster.

easy-db-lab spark submit [options]

spark status

Check status of a Spark job.

easy-db-lab spark status [options]

spark jobs

List recent Spark jobs on the cluster.

easy-db-lab spark jobs

spark logs

Download EMR logs from S3.

easy-db-lab spark logs [options]

OpenSearch Commands

opensearch start

Create an AWS OpenSearch domain.

easy-db-lab opensearch start [options]

opensearch stop

Delete the OpenSearch domain.

easy-db-lab opensearch stop

opensearch status

Check OpenSearch domain status.

easy-db-lab opensearch status

AWS Commands

aws vpcs

List all easy-db-lab VPCs.

easy-db-lab aws vpcs

Port Reference

This page documents the ports used by easy-db-lab and the services it provisions.

Cassandra Ports

Port	Purpose
9042	Cassandra Native Protocol (CQL)
7000	Inter-node communication
7001	Inter-node communication (SSL)
7199	JMX monitoring

Observability Ports (Control Node)

Port	Service
3000	Grafana
4040	Pyroscope (continuous profiling)
8428	VictoriaMetrics (metrics storage)
9428	VictoriaLogs (log storage)
3200	Tempo (trace storage)
5001	YACE CloudWatch exporter (Prometheus)

Cassandra Agent Ports

Port	Service
9000	MAAC metrics agent (Prometheus) — Cassandra 4.0, 4.1, 5.0 only

Observability Ports (All Nodes — DaemonSets)

Port	Service
4317	OTel Collector gRPC
4318	OTel Collector HTTP
9400	Beyla eBPF metrics (Prometheus)
9435	ebpf_exporter metrics (Prometheus)

Server

Port	Purpose
8080	Default server port (configurable via `--port`)

SSH

SSH access is configured automatically through the sshConfig file generated by source env.sh.

Log Infrastructure

This page documents the centralized logging infrastructure in easy-db-lab, including OTel for log collection and Victoria Logs for storage and querying.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes                                │
├─────────────────────────────────────────────────────────────┤
│   /var/log/*          │   journald                          │
│   /mnt/db1/cassandra/logs/*.log                             │
│   /mnt/db1/clickhouse/logs/*.log                            │
│   /mnt/db1/clickhouse/keeper/logs/*.log                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  OTel Collector        │
              │  (DaemonSet)           │      ┌──────────────────┐
              │  filelog + journald    │◀─────│  EMR Spark JVMs  │
              │  + OTLP receiver       │ OTLP │  (OTel Java Agent│
              └───────────┬────────────┘      │   v2.25.0)       │
                          │                   └──────────────────┘
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │  Victoria Logs   │                  │
│              │    (:9428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │  easy-db-lab     │
              │  logs query      │
              └──────────────────┘

Components

OTel Collector DaemonSet

The OpenTelemetry Collector runs on all nodes as a DaemonSet, collecting:

System file logs: /var/log/**/*.log, /var/log/messages, /var/log/syslog
Cassandra logs: /mnt/db1/cassandra/logs/*.log
ClickHouse server logs: /mnt/db1/clickhouse/logs/*.log
ClickHouse Keeper logs: /mnt/db1/clickhouse/keeper/logs/*.log
systemd journal: cassandra, docker, k3s, sshd units
OTLP: Receives logs from applications via OTLP protocol

Logs are forwarded to Victoria Logs on the control node via the Elasticsearch-compatible sink.

Spark OTel Java Agent (EMR)

When EMR Spark jobs are running, the Spark driver and executor JVMs are instrumented with the OpenTelemetry Java Agent (v2.25.0) via an EMR bootstrap action. The agent auto-instruments the JVMs and exports logs via OTLP to the control node's OTel Collector.

Logs appear in VictoriaLogs with a service.name attribute like spark-<job-name>, making it easy to filter logs for specific Spark jobs.

The data flow is: Spark JVM → OTel Java Agent → OTLP → OTel Collector (control node) → VictoriaLogs.

Victoria Logs

Victoria Logs runs on the control node and provides:

Log storage with efficient compression
LogsQL query language
HTTP API for querying (port 9428)

Querying Logs

Using the CLI

# Query all logs from last hour
easy-db-lab logs query

# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd

# Filter by host
easy-db-lab logs query --source cassandra --host db0

# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service

# Search for text
easy-db-lab logs query --grep "OutOfMemory"

# Time range and limit
easy-db-lab logs query --since 30m --limit 500

# Raw Victoria Logs query (LogsQL syntax)
easy-db-lab logs query -q 'source:cassandra AND host:db0'

Log Stream Fields

Common fields (all sources):

Field	Description
`source`	Log source: cassandra, clickhouse, systemd, system
`host`	Hostname (db0, app0, control0)
`timestamp`	Log timestamp
`message`	Log message content

Source-specific fields:

Source	Field	Description
clickhouse	`component`	server or keeper
systemd	`unit`	systemd unit name

Troubleshooting

No logs appearing

Check Victoria Logs is running:
```
kubectl get pods | grep victoria
```
Check OTel Collector is running:
```
kubectl get pods | grep otel
```

Verify the cluster-config ConfigMap exists:

kubectl get configmap cluster-config -o yaml

Connection errors

The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:

Ensure the cluster is running: easy-db-lab status
The proxy is started automatically when needed
Check that control node is accessible: ssh control0 hostname

Ports

Port	Service	Location
9428	Victoria Logs HTTP API	Control node

OpenTelemetry Instrumentation

easy-db-lab includes optional OpenTelemetry (OTel) instrumentation for distributed tracing and metrics. When enabled, traces and metrics are exported to an OTLP-compatible collector.

CLI Tool Instrumentation

The easy-db-lab CLI tool runs with the OpenTelemetry Java Agent, which automatically instruments:

AWS SDK calls - EC2, S3, IAM, EMR, STS, OpenSearch operations
HTTP clients - OkHttp and other HTTP libraries
JDBC/Cassandra driver - Database operations
JVM metrics - Memory, threads, garbage collection

Enabling Instrumentation

Set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to your OTLP collector endpoint:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up

When this environment variable is:

Set: Traces and metrics are exported via gRPC to the specified endpoint
Not set: The agent is still loaded but no telemetry is exported (minimal overhead)

The agent uses automatic instrumentation only - there is no custom manual instrumentation in the CLI tool code.

Cluster Node Instrumentation

The following instrumentation applies to cluster nodes (Cassandra, stress, Spark) and is separate from the CLI tool:

Node Role Labeling

The OTel Collector on cluster nodes uses the k8sattributes processor to read the K8s node label type and set it as the node_role resource attribute. This label is used by Grafana dashboards (e.g., System Overview) for hostname and service filtering.

Node Type	K8s Label	`node_role` Value	Source
Cassandra	`type=db`	`db`	K3s agent config
Stress	`type=app`	`app`	K3s agent config
Control	`type=control`	`control`	`Up` command node labeling
Spark/EMR	N/A	`spark`	EMR OTel Collector `resource/role` processor

The k8sattributes processor runs in the metrics/local and logs/local pipelines only. Remote metrics arriving via OTLP (e.g., from Spark nodes) already carry node_role and are not modified.

The processor requires RBAC access to the K8s API. The OTel Collector DaemonSet runs with a dedicated ServiceAccount (otel-collector) that has read-only access to pods and nodes.

Stress Job Metrics

When running cassandra-easy-stress as K8s Jobs, metrics are automatically collected via an OTel collector sidecar container. The sidecar scrapes the stress process's Prometheus endpoint (localhost:9500) and forwards metrics via OTLP to the node's OTel DaemonSet, which then exports them to VictoriaMetrics.

The Prometheus scrape job is named cassandra-easy-stress. The following labels are available in Grafana:

Label	Source	Description
`host_name`	DaemonSet `resourcedetection` processor	K8s node name where the pod runs
`instance`	Sidecar `relabel_configs`	Node name with port (e.g., `ip-10-0-1-50:9500`)
`cluster`	Sidecar `relabel_configs`	Cluster name from `cluster-config` ConfigMap

Short-lived stress commands (list, info, fields) do not include the sidecar since they complete quickly and don't produce meaningful metrics.

Spark JVM Instrumentation

EMR Spark jobs are auto-instrumented with the OpenTelemetry Java Agent (v2.25.0) and Pyroscope Java Agent (v2.3.0), both installed via an EMR bootstrap action. The OTel agent is activated through spark.driver.extraJavaOptions and spark.executor.extraJavaOptions.

Each EMR node also runs an OTel Collector as a systemd service, collecting host metrics (CPU, memory, disk, network) and receiving OTLP from the Java agents. The collector forwards all telemetry to the control node's OTel Collector via OTLP gRPC.

Key configuration:

OTel Agent JAR: Downloaded by bootstrap action to /opt/otel/opentelemetry-javaagent.jar
Pyroscope Agent JAR: Downloaded by bootstrap action to /opt/pyroscope/pyroscope.jar
OTel Collector: Installed at /opt/otel/otelcol-contrib, runs as otel-collector.service
Export protocol: OTLP/gRPC to localhost:4317 (local collector), which forwards to control node
Logs exporter: OTLP (captures JVM log output)
Service name: spark-<job-name> (set per job)
Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles in JFR format sent to Pyroscope server

Cassandra Sidecar Instrumentation

The Cassandra Sidecar process is instrumented with the OpenTelemetry Java Agent and Pyroscope Java Agent, matching the pattern used for Cassandra itself. Both agents are loaded via -javaagent flags set in /etc/default/cassandra-sidecar, which is written by the setup-instances command.

Key configuration:

OTel Agent JAR: Installed by Packer to /usr/local/otel/opentelemetry-javaagent.jar
Pyroscope Agent JAR: Installed by Packer to /usr/local/pyroscope/pyroscope.jar
Service name: cassandra-sidecar (both OTel and Pyroscope)
Export endpoint: localhost:4317 (local OTel Collector DaemonSet)
Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles sent to Pyroscope server
Activation: Gated on /etc/default/cassandra-sidecar — the systemd EnvironmentFile=- directive makes it optional, so the sidecar starts normally without instrumentation if the file doesn't exist

Tool Runner Log Collection

Commands run via exec run are executed through systemd-run, which captures stdout and stderr to log files under /var/log/easydblab/tools/. The OTel Collector's filelog/tools receiver watches this directory and ships log entries to VictoriaLogs with the attribute source: tool-runner.

This provides automatic log capture for ad-hoc debugging tools (e.g., inotifywait, tcpdump, strace) run during investigations. Logs are queryable in VictoriaLogs and preserved in S3 backups via logs backup.

Key details:

Log directory: /var/log/easydblab/tools/
Source attribute: tool-runner (for filtering in VictoriaLogs queries)
Foreground commands: Output displayed after completion, also logged to file
Background commands (--bg): Output logged to file only, tool runs as a systemd transient unit

YACE CloudWatch Scrape

YACE (Yet Another CloudWatch Exporter) runs on the control node and scrapes AWS CloudWatch metrics for services used by the cluster. It uses tag-based auto-discovery with the easy_cass_lab=1 tag to find relevant resources.

YACE scrapes metrics for:

S3 — bucket request/byte counts
EBS — volume read/write ops and latency
EC2 — instance CPU, network, disk
OpenSearch — domain health, indexing, search metrics

EMR metrics are collected directly via OTel Collectors on Spark nodes (see Spark JVM Instrumentation above).

YACE exposes scraped metrics as Prometheus-compatible metrics on port 5001, which are then scraped by the OTel Collector and forwarded to VictoriaMetrics. This replaces the previous CloudWatch datasource in Grafana with a Prometheus-based approach, giving dashboards access to CloudWatch metrics through VictoriaMetrics queries.

Resource Attributes

Traces from the CLI tool and cluster nodes include the following resource attributes:

service.name: Service identifier (e.g., easy-db-lab, cassandra-sidecar, spark-<job-name>)
service.version: Application version (CLI tool only)
host.name: Hostname

Configuration

The following environment variables are supported:

Variable	Description	Default
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP gRPC endpoint	None (no export)
`OTEL_SERVICE_NAME`	Override service name	`easy-db-lab`
`OTEL_RESOURCE_ATTRIBUTES`	Additional resource attributes	None

Additional standard OTel environment variables are supported by the agent. See the OpenTelemetry Java Agent documentation for details.

Example: Using with Jaeger

Start Jaeger with OTLP support:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Export traces to Jaeger:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up

View traces at http://localhost:16686

Example: Using with Grafana Tempo

If you have Grafana Tempo running with OTLP gRPC ingestion:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317
easy-db-lab up

Troubleshooting

No Traces Appearing

Verify the endpoint is correct and reachable
Check that the collector accepts gRPC OTLP (port 4317 is standard)
Look for OpenTelemetry agent logs on startup (use -Dotel.javaagent.debug=true to enable debug logging)

High Latency

Traces are batched before export (default 1 second delay). This is normal and reduces overhead.

Pyroscope Configuration Parameters

Reference for Pyroscope server configuration. Source: Grafana Pyroscope docs.

How Configuration Works

Pyroscope is configured via a YAML file (-config.file flag) or CLI flags. CLI flags take precedence over YAML values. Environment variables can be used with -config.expand-env=true using ${VAR} or ${VAR:-default} syntax.

View current config at the /config HTTP API endpoint.

Key Configuration Sections

Top-Level

# Modules to load. 'all' enables single-binary mode.
[target: <string> | default = "all"]

api:
  [base-url: <string> | default = ""]

Server

HTTP on port 4040 (default), gRPC on port 9095 (default).

server:
  [http_listen_address: <string> | default = ""]
  [http_listen_port: <int> | default = 4040]
  [grpc_listen_port: <int> | default = 9095]
  [graceful_shutdown_timeout: <duration> | default = 30s]
  [http_server_read_timeout: <duration> | default = 30s]
  [http_server_write_timeout: <duration> | default = 30s]
  [http_server_idle_timeout: <duration> | default = 2m]
  [log_format: <string> | default = "logfmt"]  # logfmt or json
  [log_level: <string> | default = "info"]      # debug, info, warn, error
  [grpc_server_max_recv_msg_size: <int> | default = 4194304]
  [grpc_server_max_send_msg_size: <int> | default = 4194304]
  [grpc_server_max_concurrent_streams: <int> | default = 100]

PyroscopeDB (Local Storage)

pyroscopedb:
  # Directory for local storage
  [data_path: <string> | default = "./data"]
  # Max block duration
  [max_block_duration: <duration> | default = 1h]
  # Row group target size (uncompressed)
  [row_group_target_size: <int> | default = 1342177280]
  # Partition label for symbols
  [symbols_partition_label: <string> | default = ""]
  # Disk retention: minimum free disk (GiB)
  [min_free_disk_gb: <int> | default = 10]
  # Disk retention: minimum free percentage
  [min_disk_available_percentage: <float> | default = 0.05]
  # How often to enforce retention
  [enforcement_interval: <duration> | default = 5m]
  # Disable retention enforcement
  [disable_enforcement: <boolean> | default = false]

Storage (Object Storage Backend)

Supported backends: s3, gcs, azure, swift, filesystem, cos.

storage:
  [backend: <string> | default = ""]
  [prefix: <string> | default = ""]

  s3:
    [endpoint: <string> | default = ""]
    [region: <string> | default = ""]
    [bucket_name: <string> | default = ""]
    [secret_access_key: <string> | default = ""]
    [access_key_id: <string> | default = ""]
    [insecure: <boolean> | default = false]
    [signature_version: <string> | default = "v4"]
    [bucket_lookup_type: <string> | default = "auto"]
    # NOTE: native_aws_auth_enabled exists on main but NOT in v1.18.0.
    # In v1.18.0, leave access_key_id/secret_access_key empty to use
    # the default AWS SDK credential chain (env vars, IMDS).
    sse:
      [type: <string> | default = ""]           # SSE-KMS or SSE-S3
      [kms_key_id: <string> | default = ""]
      [kms_encryption_context: <string> | default = ""]

  gcs:
    [bucket_name: <string> | default = ""]
    [service_account: <string> | default = ""]

  azure:
    [account_name: <string> | default = ""]
    [account_key: <string> | default = ""]
    [container_name: <string> | default = ""]

  filesystem:
    [dir: <string> | default = "./data-shared"]

Distributor

distributor:
  [pushtimeout: <duration> | default = 5s]
  ring:
    kvstore:
      [store: <string> | default = "memberlist"]  # consul, etcd, inmemory, memberlist, multi

Ingester

ingester:
  lifecycler:
    ring:
      kvstore:
        [store: <string> | default = "consul"]
      [heartbeat_timeout: <duration> | default = 1m]
      [replication_factor: <int> | default = 1]
    [num_tokens: <int> | default = 128]
    [heartbeat_period: <duration> | default = 5s]

Querier

querier:
  # Time after which queries go to storage instead of ingesters
  [query_store_after: <duration> | default = 4h]

Compactor

compactor:
  [block_ranges: <list of durations> | default = 1h0m0s,2h0m0s,8h0m0s]
  [data_dir: <string> | default = "./data-compactor"]
  [compaction_interval: <duration> | default = 30m]
  [compaction_concurrency: <int> | default = 1]
  [deletion_delay: <duration> | default = 12h]
  [downsampler_enabled: <boolean> | default = false]

Limits (Per-Tenant)

limits:
  # Ingestion rate limit (MB/s)
  [ingestion_rate_mb: <float> | default = 4]
  [ingestion_burst_size_mb: <float> | default = 2]
  # Label constraints
  [max_label_name_length: <int> | default = 1024]
  [max_label_value_length: <int> | default = 2048]
  [max_label_names_per_series: <int> | default = 30]
  # Profile constraints
  [max_profile_size_bytes: <int> | default = 4194304]
  [max_profile_stacktrace_samples: <int> | default = 16000]
  [max_profile_stacktrace_depth: <int> | default = 1000]
  # Series limits
  [max_global_series_per_tenant: <int> | default = 5000]
  # Query limits
  [max_query_lookback: <duration> | default = 1w]
  [max_query_length: <duration> | default = 1d]
  [max_flamegraph_nodes_default: <int> | default = 8192]
  [max_flamegraph_nodes_max: <int> | default = 1048576]
  # Retention
  [compactor_blocks_retention_period: <duration> | default = 0s]
  # Ingestion time bounds
  [reject_older_than: <duration> | default = 1h]
  [reject_newer_than: <duration> | default = 10m]
  # Relabeling
  [ingestion_relabeling_rules: <list of Configs> | default = []]
  [sample_type_relabeling_rules: <list of Configs> | default = []]

Self-Profiling

self_profiling:
  # Disable push profiling in single-binary mode
  [disable_push: <boolean> | default = false]
  [mutex_profile_fraction: <int> | default = 5]
  [block_profile_rate: <int> | default = 5]

Memberlist (Gossip)

memberlist:
  [bind_port: <int> | default = 7946]
  [join_members: <list of strings> | default = []]
  [gossip_interval: <duration> | default = 200ms]
  [gossip_nodes: <int> | default = 3]
  [leave_timeout: <duration> | default = 20s]

Tracing

tracing:
  [enabled: <boolean> | default = true]

Multi-Tenancy

# Require X-Scope-OrgId header; false = use "anonymous" tenant
[multitenancy_enabled: <boolean> | default = false]

Embedded Grafana

embedded_grafana:
  [data_path: <string> | default = "./data/__embedded_grafana/"]
  [listen_port: <int> | default = 4041]
  [pyroscope_url: <string> | default = "http://localhost:4040"]

Port Summary

Service	Port	Protocol
HTTP API	4040	HTTP
gRPC	9095	gRPC
Memberlist gossip	7946	TCP/UDP
Embedded Grafana	4041	HTTP

Relevant to Our Deployment

Our Pyroscope deployment (configuration/pyroscope/PyroscopeManifestBuilder.kt) uses:

S3 backend — IAM role auth via IMDS (no explicit credentials; v1.18.0 lacks native_aws_auth_enabled, SDK defaults to credential chain)
Single-binary mode (target: all)
Port 4040 for HTTP API
Flat storage prefix — pyroscope.{name}-{id} (Pyroscope rejects / in storage.prefix)
Config values substituted at build time via TemplateService (__KEY__ placeholders)
Profiles received from: Java agent (Cassandra, Spark), eBPF agent (all nodes), stress jobs

Spark Observability Debugging

Diagnostic commands for troubleshooting Spark observability on EMR nodes. These require SSH access to the EMR master node (ssh hadoop@<master-public-dns>).

OTel Collector

# Check collector is running
sudo systemctl status otel-collector

# View collector config (verify control node IP)
cat /opt/otel/config.yaml

# Test connectivity to control node collector
curl -s -o /dev/null -w '%{http_code}' http://<control-ip>:4318

Spark Configuration

# Verify -javaagent flags and OTel env vars are present
cat /etc/spark/conf/spark-defaults.conf

# Verify agent JARs exist
ls -la /opt/otel/opentelemetry-javaagent.jar
ls -la /opt/pyroscope/pyroscope.jar

Runtime Verification (while a job is running)

# Confirm agents are attached to Spark JVMs
ps aux | grep javaagent

Pyroscope API (from any node that can reach control0)

# List all label names
curl \
  -H "Content-Type: application/json" \
  -d '{
      "end": '$(date +%s)000',
      "start": '$(expr $(date +%s) - 3600)000'
    }' \
  http://localhost:4040/querier.v1.QuerierService/LabelNames

# List values for a specific label
curl \
  -H "Content-Type: application/json" \
  -d '{
      "end": '$(date +%s)000',
      "name": "hostname",
      "start": '$(expr $(date +%s) - 3600)000'
    }' \
  http://localhost:4040/querier.v1.QuerierService/LabelValues

# Diff two profiles (compare workloads)
# POST to /querier.v1.QuerierService/Diff with left/right profile selectors
# See: left.labelSelector, right.labelSelector, profileTypeID, start/end

Grafana Explore Queries

# All metrics from Spark nodes
{node_role="spark"}

# JVM metrics only
{node_role="spark", __name__=~"jvm_.*"}

# List distinct JVM metric names
group({node_role="spark", __name__=~"jvm_.*"}) by (__name__)

# Filesystem usage (raw)
system_filesystem_usage_bytes{state="used", node_role="spark", mountpoint="/"}

JFR Format Reference

The Java Flight Recorder format is used by JVM-based profilers and supported by the Pyroscope Java integration.

When JFR format is used, query parameters behave differently:

format should be set to jfr
name contains the prefix of the application name. Since a single request may contain multiple profile types, the final application name is created by concatenating this prefix and the profile type. For example, if you send cpu profiling data and set name to my-app{}, it will appear in Pyroscope as my-app.cpu{}
units is ignored — actual units depend on the profile types in the data
aggregationType is ignored — actual aggregation type depends on the profile types in the data

Supported JFR Profile Types

cpu — samples from runnable threads only
itimer — similar to cpu profiling
wall — samples from any thread regardless of state
alloc_in_new_tlab_objects — number of new TLAB objects created
alloc_in_new_tlab_bytes — size in bytes of new TLAB objects created
alloc_outside_tlab_objects — number of new allocated objects outside any TLAB
alloc_outside_tlab_bytes — size in bytes of new allocated objects outside any TLAB

JFR with Dynamic Labels

To ingest JFR data with dynamic labels:

Use multipart/form-data Content-Type
Send JFR data in a form file field called jfr
Send LabelsSnapshot protobuf message in a form file field called labels

message Context {
    // string_id -> string_id
    map<int64, int64> labels = 1;
}
message LabelsSnapshot {
    // context_id -> Context
    map<int64, Context> contexts = 1;
    // string_id -> string
    map<int64, string> strings = 2;
}

Where context_id is a parameter set in async-profiler.

Ingestion Examples

Simple profile upload:

printf "foo;bar 100\n foo;baz 200" | curl \
  -X POST \
  --data-binary @- \
  'http://localhost:4040/ingest?name=curl-test-app&from=1615709120&until=1615709130'

JFR profile with labels:

curl -X POST \
  -F jfr=@profile.jfr \
  -F labels=@labels.pb \
  "http://localhost:4040/ingest?name=curl-test-app&units=samples&aggregationType=sum&sampleRate=100&from=1655834200&until=1655834210&spyName=javaspy&format=jfr"

Future: Ad-hoc Profiling with async-profiler

async-profiler can capture JFR profiles on demand and upload them to Pyroscope with labels. This enables targeted profiling of specific Spark jobs or Cassandra operations to inspect exactly what is happening at the JVM level.

Common Issues

No JVM metrics: Check ps aux | grep javaagent — if -javaagent flags are missing, spark.driver.extraJavaOptions may be overridden at job submission time (replaces spark-defaults.conf entirely).
Collector retry errors at startup: Normal if the control node collector isn't ready yet. Should stabilize within a minute.
Spark profiles missing hostname label: PYROSCOPE_LABELS env var must be set via spark-env classification with hostname=$(hostname -s).

Development Overview

Hello there. If you're reading this, you've probably decided to contribute to easy-db-lab or use the tools for your own work. Very cool.

Dev Containers (Recommended)

Dev Containers are the preferred method for developing easy-db-lab. They provide a consistent, pre-configured environment with all required tools installed:

Java 21 (Temurin) via SDKMAN
Kotlin and Gradle
MkDocs for documentation
Docker-in-Docker for container operations
Claude Code for AI-assisted development
zsh with Powerlevel10k theme

VS Code

Install the Dev Containers extension
Open the project folder
Click "Reopen in Container" when prompted

JetBrains IDEs

Install the Dev Containers plugin
Open the project and select "Dev Containers" from the remote development options

CLI with bin/dev

The bin/dev script provides a convenient wrapper for dev container management:

bin/dev start          # Start the dev container
bin/dev shell          # Open interactive shell
bin/dev test           # Run Gradle tests
bin/dev docs-serve     # Serve docs with live reload
bin/dev claude         # Start Claude Code
bin/dev status         # Show container status
bin/dev down           # Stop and remove container

To mount your Claude Code configuration (for AI-assisted development):

ENABLE_CLAUDE=1 bin/dev start

Run bin/dev help for all available commands.

Local Configuration (.env)

Both bin/easy-db-lab and bin/end-to-end-test automatically load a .env file from the project root if one exists. This is the recommended way to set per-developer configuration without modifying committed scripts.

Setup

cp .env.example .env
# Edit .env with your values

.env is listed in .gitignore and will never be committed.

Supported Variables

Variable	Required	Default	Description
`AWS_PROFILE`	Yes (for e2e tests)	—	AWS credentials profile from `~/.aws/config`
`EASY_DB_LAB_INSTANCE_TYPE`	No	`c5d.2xlarge`	EC2 instance type for database nodes
`SIDECAR_IMAGE`	No	`ghcr.io/apache/cassandra-sidecar:latest`	Custom Cassandra sidecar container image

Example .env:

AWS_PROFILE=sandbox-admin
# SIDECAR_IMAGE=102382809497.dkr.ecr.us-west-2.amazonaws.com/rustyrazorblade/cassandra-sidecar
# EASY_DB_LAB_INSTANCE_TYPE=c5d.4xlarge

Variables already exported in your shell always take precedence over .env.

Building the Project

Once inside the container (or with local tools installed):

./gradlew assemble
./gradlew test

Documentation Preview

Preview documentation locally with live reload:

bin/dev docs-serve

Then open http://localhost:8000 in your browser.

Project Structure

easy-db-lab is broken into several subprojects:

Docker containers (prefixed with docker-)
Documentation (the manual you're reading now)
Utility code for downloading artifacts

Architecture

The project follows a layered architecture:

Commands (PicoCLI) → Services → External Systems (K8s, AWS, Filesystem)

Layer Responsibilities

Commands (commands/): Lightweight PicoCLI execution units
Services (services/, providers/): Business logic layer

For more details, see the project's CLAUDE.md file.

Docker Development

Building Docker Containers

Each container is versioned and can be built locally using the following:

./gradlew :PROJECT-NAME:buildDocker

Where PROJECT-NAME is one of the subproject directories you see in the top level.

Setup

We recommend updating your local Docker service to use 8GB of memory. This is necessary when running dashboard previews locally. The preview is configured to run multiple Cassandra containers at once.

Available Docker Projects

Check the root project directory for subprojects prefixed with docker- to see available containerized components.

Local Testing

To test containers locally:

Build the container:

./gradlew :docker-cassandra:buildDocker

Run the container:
```
docker run -it <image-name>
```

Memory Requirements

Use Case	Recommended Memory
Single container development	4GB
Dashboard preview (multiple containers)	8GB
Full integration testing	16GB

Publishing

Pre-Release Checklist

First check CI to ensure the build is clean and green
Ensure the following environment variables are set:
- DOCKER_USERNAME
- DOCKER_PASSWORD
- DOCKER_EMAIL

Publishing Steps

Build and Upload

./gradlew buildAll uploadAll

Post-Release

After publishing, bump the version in build.gradle.kts.

Container Publishing

Containers are automatically published to GitHub Container Registry (ghcr.io) when:

A version tag (v*) is pushed
PR Checks pass on main branch

See .github/workflows/publish-container.yml for details.

Documentation

Documentation is automatically built and deployed via GitHub Actions when changes are pushed to the docs/ directory on the main branch.

Testing Guidelines

This document outlines the testing standards and practices for the easy-db-lab project.

Core Testing Principles

1. Use BaseKoinTest for Dependency Injection

All tests should extend BaseKoinTest to take advantage of automatic dependency injection setup and teardown.

class MyCommandTest : BaseKoinTest() {
    // Your test code here
}

BaseKoinTest provides:

Automatic Koin lifecycle management
Core modules that are always mocked (AWS, SSH, OutputHandler)
Ability to add test-specific modules via additionalTestModules()

2. Use AssertJ for Assertions

Tests should use AssertJ assertions, not JUnit assertions. AssertJ provides more readable and powerful assertion methods.

// Good - AssertJ style
import org.assertj.core.api.Assertions.assertThat

assertThat(result).isNotNull()
assertThat(result.value).isEqualTo("expected")
assertThat(list).hasSize(3).contains("item1", "item2")

// Avoid - JUnit style
import org.junit.jupiter.api.Assertions.assertEquals

assertEquals("expected", result.value)

3. Create Custom Assertions for Non-Trivial Classes

When testing non-trivial classes, create custom AssertJ assertions to implement Domain-Driven Design in tests. This decouples business logic from implementation details and makes tests more maintainable during refactoring.

Custom Assertions Pattern

Custom assertions provide a fluent, domain-specific language for testing that improves readability and maintainability.

Example: Custom Assertion for a Domain Class

Here's a complete example showing how to create and use custom assertions:

// Domain class to be tested
data class CassandraNode(
    val nodeId: String,
    val datacenter: String,
    val rack: String,
    val status: NodeStatus,
    val tokens: Int
)

enum class NodeStatus {
    UP, DOWN, JOINING, LEAVING
}

// Custom assertion class
import org.assertj.core.api.AbstractAssert

class CassandraNodeAssert(actual: CassandraNode?) :
    AbstractAssert<CassandraNodeAssert, CassandraNode>(actual, CassandraNodeAssert::class.java) {

    companion object {
        fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
            return CassandraNodeAssert(actual)
        }
    }

    fun hasNodeId(nodeId: String): CassandraNodeAssert {
        isNotNull
        if (actual.nodeId != nodeId) {
            failWithMessage("Expected node ID to be <%s> but was <%s>", nodeId, actual.nodeId)
        }
        return this
    }

    fun isInDatacenter(datacenter: String): CassandraNodeAssert {
        isNotNull
        if (actual.datacenter != datacenter) {
            failWithMessage("Expected datacenter to be <%s> but was <%s>", datacenter, actual.datacenter)
        }
        return this
    }

    fun hasStatus(status: NodeStatus): CassandraNodeAssert {
        isNotNull
        if (actual.status != status) {
            failWithMessage("Expected status to be <%s> but was <%s>", status, actual.status)
        }
        return this
    }

    fun isUp(): CassandraNodeAssert {
        return hasStatus(NodeStatus.UP)
    }

    fun isDown(): CassandraNodeAssert {
        return hasStatus(NodeStatus.DOWN)
    }

    fun hasTokenCount(tokens: Int): CassandraNodeAssert {
        isNotNull
        if (actual.tokens != tokens) {
            failWithMessage("Expected token count to be <%s> but was <%s>", tokens, actual.tokens)
        }
        return this
    }
}

// Usage in tests
import CassandraNodeAssert.Companion.assertThat

@Test
fun `test cassandra node configuration`() {
    val node = CassandraNode(
        nodeId = "node1",
        datacenter = "dc1",
        rack = "rack1",
        status = NodeStatus.UP,
        tokens = 256
    )

    // Fluent assertions with domain language
    assertThat(node)
        .hasNodeId("node1")
        .isInDatacenter("dc1")
        .isUp()
        .hasTokenCount(256)
}

Project-Wide Assertions Helper

Create a central assertions class to provide access to all custom assertions:

// MyProjectAssertions.kt
object MyProjectAssertions {

    // Cassandra domain assertions
    fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
        return CassandraNodeAssert(actual)
    }

    fun assertThat(actual: Host?): HostAssert {
        return HostAssert(actual)
    }

    fun assertThat(actual: TFState?): TFStateAssert {
        return TFStateAssert(actual)
    }

    // Add more domain assertions as needed
}

Then import statically in tests:

import com.rustyrazorblade.easydblab.assertions.MyProjectAssertions.assertThat

@Test
fun `test complex scenario`() {
    val node = createTestNode()
    val host = createTestHost()

    // All domain assertions available through single import
    assertThat(node).isUp()
    assertThat(host).hasPrivateIp("10.0.0.1")
}

Benefits of Custom Assertions

Domain-Driven Design: Tests use business language, not implementation details
Refactoring Safety: Changes to class internals don't break test logic
Readability: Tests read like specifications
Reusability: Common assertions are centralized
Maintainability: Single place to update assertion logic
Type Safety: Compile-time checking of assertion methods

When to Create Custom Assertions

Create custom assertions for:

Domain entities (e.g., Host, TFState, CassandraNode)
Complex value objects with multiple properties
Classes that appear in multiple test scenarios
Any class where you find yourself writing repetitive assertion code

Testing Best Practices

Test Names: Use descriptive names with backticks

@Test
fun `should start cassandra node when status is DOWN`() { }

Test Structure: Follow Arrange-Act-Assert pattern

@Test
fun `test node startup`() {
    // Arrange
    val node = createTestNode(status = NodeStatus.DOWN)

    // Act
    val result = nodeManager.startNode(node)

    // Assert
    assertThat(result).isUp()
}

Mock External Dependencies: Always mock AWS, SSH, and other external services

class MyTest : BaseKoinTest() {
    override fun additionalTestModules() = listOf(
        module {
            single { mockRemoteOperationsService() }
        }
    )
}

Test Edge Cases: Include tests for error conditions and boundary cases
Keep Tests Focused: Each test should verify one specific behavior

Testing Interactive Commands with TestPrompter

Commands that require user input (like setup-profile) can be tested deterministically using TestPrompter. This test utility replaces the real Prompter interface and returns predefined responses.

Basic Usage

class MyCommandTest : BaseKoinTest() {
    private lateinit var testPrompter: TestPrompter

    override fun additionalTestModules() = listOf(
        module {
            single<Prompter> { testPrompter }
        }
    )

    @BeforeEach
    fun setup() {
        // Configure responses - keys can be exact matches or partial matches
        testPrompter = TestPrompter(
            mapOf(
                "email" to "test@example.com",
                "region" to "us-west-2",
                "AWS Access Key" to "AKIAIOSFODNN7EXAMPLE",
            )
        )
    }

    @Test
    fun `should collect user credentials`() {
        // Run command that prompts for input
        val command = SetupProfile()
        command.call()

        // Verify prompts were called
        assertThat(testPrompter.wasPromptedFor("email")).isTrue()
        assertThat(testPrompter.wasPromptedFor("region")).isTrue()
    }
}

Response Matching

TestPrompter supports two matching strategies:

Exact match: The question text matches a key exactly
Partial match: The question text contains the key (case-insensitive)

val prompter = TestPrompter(
    mapOf(
        // Exact match - only matches "email" exactly
        "email" to "test@example.com",

        // Partial match - matches any question containing "AWS Profile"
        "AWS Profile" to "my-profile",
    )
)

Sequential Responses for Retry Testing

For testing retry logic (e.g., credential validation failures), use addSequentialResponses():

@Test
fun `should retry on invalid credentials`() {
    testPrompter = TestPrompter()

    // First call returns invalid credentials, second returns valid ones
    testPrompter.addSequentialResponses(
        "AWS Access Key",
        "invalid-key",      // First attempt
        "AKIAVALIDKEY123"   // Second attempt (after retry)
    )

    testPrompter.addSequentialResponses(
        "AWS Secret",
        "invalid-secret",
        "valid-secret-key"
    )

    val command = SetupProfile()
    command.call()

    // Verify the command handled retry correctly
    val callLog = testPrompter.getCallLog()
    val accessKeyCalls = callLog.filter { it.question.contains("Access Key") }
    assertThat(accessKeyCalls).hasSize(2)
}

Verifying Prompt Behavior

TestPrompter records all prompt calls for verification:

@Test
fun `should not prompt for credentials when using AWS profile`() {
    testPrompter = TestPrompter(
        mapOf(
            "AWS Profile" to "my-profile",  // Non-empty = use profile auth
        )
    )

    val command = SetupProfile()
    command.call()

    // Verify credential prompts were skipped
    assertThat(testPrompter.wasPromptedFor("Access Key")).isFalse()
    assertThat(testPrompter.wasPromptedFor("Secret")).isFalse()

    // Check detailed call log
    val callLog = testPrompter.getCallLog()
    assertThat(callLog).anyMatch { it.question.contains("email") }
}

TestPrompter API Reference

Method	Description
`prompt(question, default, secret)`	Returns configured response or default
`addSequentialResponses(key, vararg responses)`	Configure different responses for retry scenarios
`getCallLog()`	Returns list of all prompt calls with details
`wasPromptedFor(questionContains)`	Check if any prompt contained the given text
`clear()`	Reset call log and sequential state

PromptCall Data Class

Each recorded call contains:

question: The prompt question text
default: The default value offered
secret: Whether input was masked (for passwords)
returnedValue: The value that was returned

Additional Resources

End-to-End Testing

easy-db-lab includes a comprehensive end-to-end test suite that validates the entire workflow from provisioning to teardown.

Running the Test

The end-to-end test is located at bin/end-to-end-test:

./bin/end-to-end-test --cassandra

Command-Line Options

Feature Flags

Flag	Description
`--cassandra`	Enable Cassandra-specific tests
`--spark`	Enable Spark EMR provisioning and tests
`--clickhouse`	Enable ClickHouse deployment and tests
`--opensearch`	Enable OpenSearch deployment and tests
`--all`	Enable all optional features
`--ebs`	Enable EBS volumes (gp3, 256GB)
`--build`	Build Packer images (default: skip)

Testing and Inspection

Flag	Description
`--list-steps`, `-l`	List all test steps without running
`--break <steps>`	Set breakpoints at specific steps (comma-separated)
`--wait`	Run all steps except teardown, then wait for confirmation

Examples

# List all available test steps
./bin/end-to-end-test --list-steps

# Run full test with all features
./bin/end-to-end-test --all

# Run with Cassandra and pause before teardown
./bin/end-to-end-test --cassandra --wait

# Run with breakpoints at steps 5 and 15
./bin/end-to-end-test --cassandra --break 5,15

# Build custom AMI images and run test
./bin/end-to-end-test --build --cassandra

Test Steps

The test executes approximately 27 steps covering:

Infrastructure

Build project
Check version command
Build packer images (optional)
Set IAM policies
Initialize cluster
Setup kubectl
Wait for K3s ready
Verify K3s cluster

Registry and Storage

Test registry push/pull
List hosts
Verify S3 backup

Cassandra

Setup Cassandra
Verify Cassandra backup
Verify restore
Cassandra start/stop cycle
Test SSH and nodetool
Check Sidecar
Test exec command
Run stress test
Run stress K8s test

Optional Services

Submit Spark job (if --spark)
Check Spark status (if --spark)
Start ClickHouse (if --clickhouse)
Test ClickHouse (if --clickhouse)
Stop ClickHouse (if --clickhouse)
Start OpenSearch (if --opensearch)
Test OpenSearch (if --opensearch)
Stop OpenSearch (if --opensearch)

Observability and Cleanup

Test observability stack
Teardown cluster

Error Handling

When a test step fails, an interactive menu appears:

Retry from failed step - Resume from the point of failure
Start a shell session - Opens a shell with:
- easy-db-lab commands available
- rebuild - Rebuild just the project
- rerun - Rebuild and resume from failed step
Tear down environment - Run easy-db-lab down --yes
Exit - Exit the script

AWS Requirements

The test requires:

AWS profile with sufficient permissions
VPC and subnet configuration
S3 bucket for backups and logs

Default Configuration

Instance count: 3 nodes
Instance type: c5.2xlarge
Cassandra version: 5.0 (when enabled)
Spark workers: 2 (when enabled)

CI Integration

The end-to-end test is designed to run in CI environments:

Supports non-interactive mode
Returns appropriate exit codes
Provides detailed logging
Cleans up resources on failure

Spark Development

This guide covers developing and testing Spark-related functionality in easy-db-lab.

Project Structure

All Spark modules live under spark/ with shared configuration:

spark/common/ — Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setup
spark/bulk-writer-sidecar/ — Cassandra Analytics, direct sidecar transport (DirectBulkWriter)
spark/bulk-writer-s3-iam/ — Cassandra Analytics, S3 staging transport with IAM credentials (IamBulkWriter)
spark/connector-writer/ — Standard Spark Cassandra Connector (StandardConnectorWriter)
spark/connector-read-write/ — Read→transform→write example (KeyValuePrefixCount)

Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.

Prerequisites

The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.

One-Time Setup

bin/build-cassandra-analytics

Options:

--force - Rebuild even if already built
--repo <url> - Clone from a different repo (e.g., a fork)
--branch <branch> - Use a specific branch (default: trunk)

The default repo and branch are read from spark/cassandra-analytics-source.properties. CLI flags override the file. To switch to a different repo, update that file and re-run with --force to re-clone.

Getting the latest code from the remote: The script skips all git operations if Maven artifacts already exist (the common case). It does not check whether the clone is up to date. To pull the latest from the configured repo/branch, either run with --force or manually cd .cassandra-analytics && git pull and then rebuild the spark modules.

Building

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3-iam:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar

Shadow JARs include all dependencies except Spark (provided by EMR).

Running Tests

Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:

./gradlew :test

Testing with a Live Cluster

Using bin/spark-bulk-write

This script handles JAR lookup, host resolution, and health checks:

# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Using bin/submit-direct-bulk-writer

Simplified script for direct bulk writer testing:

bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]

Manual Spark Job Submission

All modules use unified spark.easydblab.* configuration:

easy-db-lab spark submit \
    --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
    --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
    --conf spark.easydblab.contactPoints=host1,host2 \
    --conf spark.easydblab.keyspace=bulk_test \
    --conf spark.easydblab.localDc=us-west-2 \
    --conf spark.easydblab.rowCount=1000 \
    --conf spark.easydblab.replicationFactor=1 \
    --wait

Debugging Failed Jobs

When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.

Manual Log Retrieval

easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs

Direct S3 Access

Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/

aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip

Adding a New Spark Module

Create a directory under spark/ (e.g., spark/bulk-reader/)
Add build.gradle.kts — use an existing module as a template
Add include "spark:bulk-reader" to settings.gradle
Depend on :spark:common for shared config
Use SparkJobConfig.load(sparkConf) for configuration
Implement your main class and submit via easy-db-lab spark submit

Architecture Notes

Shared Configuration

SparkJobConfig in spark/common provides:

Property constants (PROP_CONTACT_POINTS, etc.)
Config loading from SparkConf with validation
Schema setup via CqlSetup
Consistent defaults across all modules

Why Shadow JAR?

Bulk-writer modules use the Gradle Shadow plugin because:

EMR provides Spark, so those dependencies are compileOnly
Cassandra Analytics has many transitive dependencies
mergeServiceFiles() properly handles META-INF/services for SPI

Cassandra Analytics Modules

Some cassandra-analytics modules aren't published to Maven:

five-zero.jar - Cassandra 5.0 bridge
five-zero-bridge.jar - Bridge implementation
five-zero-types.jar - Type converters
five-zero-sparksql.jar - SparkSQL integration

These are referenced directly from .cassandra-analytics/ build output.

SOCKS Proxy Architecture

This document describes the internal SOCKS5 proxy implementation used by easy-db-lab for programmatic access to private cluster resources.

Overview

easy-db-lab has two separate proxy systems:

Proxy	Purpose	Implementation
Shell Proxy	User shell commands (`kubectl`, `curl`)	SSH CLI (`ssh -D`) via `env.sh`
JVM Proxy	Internal Kotlin/Java code	Apache MINA SSH library

This document covers the JVM Proxy used internally by easy-db-lab.

Why Two Proxies?

The shell proxy (started by source env.sh) works for command-line tools that respect HTTPS_PROXY environment variables. However, JVM code needs programmatic proxy configuration:

Java's HttpClient requires a ProxySelector instance
The Cassandra driver needs SOCKS5 configuration at the Netty level
The Kubernetes fabric8 client needs proxy settings
Operations should work without requiring users to run source env.sh first

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     easy-db-lab JVM                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────────┐           │
│  │ SocksProxyService  │    │ ProxiedHttpClientFactory│           │
│  │   (interface)      │    │                        │           │
│  └─────────┬──────────┘    └───────────┬────────────┘           │
│            │                           │                         │
│            ▼                           ▼                         │
│  ┌─────────────────────┐    ┌────────────────────────┐          │
│  │ MinaSocksProxyService│    │   SocksProxySelector   │          │
│  │ (Apache MINA impl)  │    │  (custom ProxySelector)│          │
│  └─────────┬───────────┘    └────────────────────────┘          │
│            │                                                     │
│            ▼                                                     │
│  ┌─────────────────────┐                                        │
│  │ SSHConnectionProvider│                                        │
│  │ (manages SSH sessions)│                                       │
│  └─────────┬────────────┘                                        │
│            │                                                     │
└────────────┼─────────────────────────────────────────────────────┘
             │
             ▼ SSH Dynamic Port Forwarding
   ┌──────────────────┐
   │   Control Node   │
   │   (control0)     │
   └──────────────────┘

Key Classes

SocksProxyService

Location: com.rustyrazorblade.easydblab.proxy.SocksProxyService

Interface defining proxy operations:

interface SocksProxyService {
    fun ensureRunning(gatewayHost: ClusterHost): SocksProxyState
    fun start(gatewayHost: ClusterHost): SocksProxyState
    fun stop()
    fun isRunning(): Boolean
    fun getState(): SocksProxyState?
    fun getLocalPort(): Int
}

MinaSocksProxyService

Location: com.rustyrazorblade.easydblab.proxy.MinaSocksProxyService

Apache MINA-based implementation that:

Establishes an SSH connection to the gateway host
Starts dynamic port forwarding on a random available port
Maintains thread-safe state for concurrent access

Key implementation details:

Uses ReentrantLock for thread safety
Dynamically finds an available port via ServerSocket(0)
Extracts the underlying ClientSession from the SSH client for port forwarding
Supports idempotent ensureRunning() for reuse across operations

ProxiedHttpClientFactory

Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory

Creates java.net.http.HttpClient instances configured for SOCKS5 proxy:

class ProxiedHttpClientFactory(
    private val socksProxyService: SocksProxyService,
) : HttpClientFactory {

    override fun createClient(): HttpClient {
        val proxyPort = socksProxyService.getLocalPort()
        val proxySelector = SocksProxySelector(proxyPort)

        return HttpClient
            .newBuilder()
            .proxy(proxySelector)
            .connectTimeout(CONNECTION_TIMEOUT)
            .build()
    }
}

SocksProxySelector

Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory (private class)

Custom ProxySelector that returns a SOCKS5 proxy for all URIs:

private class SocksProxySelector(
    private val proxyPort: Int,
) : ProxySelector() {
    private val proxy = Proxy(Proxy.Type.SOCKS, InetSocketAddress("localhost", proxyPort))

    override fun select(uri: URI?): List<Proxy> = listOf(proxy)

    override fun connectFailed(uri: URI?, sa: SocketAddress?, ioe: IOException?) {
        // Handle connection failures if needed
    }
}

Important: Java's ProxySelector.of() creates HTTP proxies, not SOCKS5. This custom implementation is required for SSH dynamic port forwarding.

SocksProxyNettyOptions

Location: com.rustyrazorblade.easydblab.driver.SocksProxyNettyOptions

Configures the Cassandra driver to use SOCKS5 proxy at the Netty level for CQL connections.

Dependency Injection

The proxy components are registered in ProxyModule:

val proxyModule = module {
    // Singleton - maintains proxy state across requests
    single<SocksProxyService> { MinaSocksProxyService(get()) }

    // Factory for creating proxied HTTP clients
    single<HttpClientFactory> { ProxiedHttpClientFactory(get()) }
}

Usage Patterns

Querying Victoria Logs

class DefaultVictoriaLogsService(
    private val socksProxyService: SocksProxyService,
    private val httpClientFactory: HttpClientFactory,
) : VictoriaLogsService {

    override fun query(...): Result<List<String>> = runCatching {
        // Ensure proxy is running to control node
        socksProxyService.ensureRunning(controlHost)

        // Create HTTP client that routes through proxy
        val httpClient = httpClientFactory.createClient()

        // Make request to private IP
        val request = HttpRequest.newBuilder()
            .uri(URI.create("http://${controlHost.privateIp}:9428/..."))
            .build()

        httpClient.send(request, BodyHandlers.ofString())
    }
}

Kubernetes API Access

The K8sService uses the proxy for fabric8 Kubernetes client connections to the private K3s API server.

CQL Sessions

The CqlSessionFactory configures the Cassandra driver with SOCKS5 proxy settings via SocksProxyNettyOptions.

Lifecycle

CLI Mode

In CLI mode (single command execution):

Service starts proxy when needed
Operations complete
Proxy remains running for subsequent operations in same process

Server/MCP Mode

In server mode (long-running process):

Proxy starts on first request requiring cluster access
Reused across multiple requests (connection count tracked)
Stopped on server shutdown

Thread Safety

MinaSocksProxyService uses a ReentrantLock to protect:

Proxy state changes
Session management
Port allocation

This ensures safe concurrent access when multiple threads need cluster resources.

Error Handling

Common failure scenarios:

Error	Cause	Resolution
"HTTP/1.1 header parser received no bytes"	Using HTTP proxy instead of SOCKS5	Ensure `SocksProxySelector` returns `Proxy.Type.SOCKS`
Connection timeout	Control node not accessible	Verify SSH connectivity to control0
Port bind failure	Port already in use	Service automatically finds available port

Testing

When testing code that uses the proxy:

class MyServiceTest : BaseKoinTest() {
    // BaseKoinTest provides mocked SocksProxyService

    @Test
    fun testWithMockedProxy() {
        val mockProxyService = mock<SocksProxyService>()
        whenever(mockProxyService.getLocalPort()).thenReturn(1080)

        // Test your service with mocked proxy
    }
}

File	Purpose
`proxy/SocksProxyService.kt`	Interface definition
`proxy/MinaSocksProxyService.kt`	Apache MINA implementation
`proxy/ProxiedHttpClientFactory.kt`	HTTP client factory with SOCKS5
`proxy/ProxyModule.kt`	Koin DI registration
`driver/SocksProxyNettyOptions.kt`	Cassandra driver proxy config
`driver/SocksProxyDriverContext.kt`	Driver context with proxy
`services/VictoriaLogsService.kt`	Example usage

Fabric8 Server-Side Apply Pattern

This document explains a common error when using fabric8's Kubernetes client for server-side apply operations, and the correct pattern to use.

The Error

When applying Kubernetes manifests using fabric8, you may encounter:

java.lang.IllegalStateException: Could not find a registered handler for item:
[GenericKubernetesResource(apiVersion=v1, kind=Namespace, metadata=ObjectMeta...)]

This is a client-side fabric8 error, not a Kubernetes server error.

Root Cause

Fabric8 has two loading paths with different behaviors:

Typed Loader (works): client.namespaces().load(stream) → returns typed Namespace → serverSideApply() works
Generic Loader (fails): client.load(stream) → returns GenericKubernetesResource → serverSideApply() fails

Critical: Items returned by client.load() are always GenericKubernetesResource at runtime, regardless of the YAML content. They cannot be cast to typed classes like Namespace or ConfigMap.

Patterns That Do NOT Work

Attempt 1: Direct serverSideApply on loader

// DON'T DO THIS - causes "Could not find a registered handler" error
client.load(inputStream).serverSideApply()

Attempt 2: Load items then use client.resource()

// DON'T DO THIS - still fails with same error
val items = client.load(inputStream).items()
for (item in items) {
    client.resource(item).serverSideApply()
}

Even though we load the items first, they are still GenericKubernetesResource objects internally, and client.resource(item).serverSideApply() still fails.

Attempt 3: Cast GenericKubernetesResource to typed class

// DON'T DO THIS - causes ClassCastException
val items = client.load(inputStream).items()
for (item in items) {
    when (item.kind) {
        "Namespace" -> client.namespaces().resource(item as Namespace).serverSideApply()
        // ...
    }
}

Error: java.lang.ClassCastException: class io.fabric8.kubernetes.api.model.GenericKubernetesResource cannot be cast to class io.fabric8.kubernetes.api.model.Namespace

The items from client.load() are truly GenericKubernetesResource at runtime - they cannot be cast to typed classes.

The Pattern That Works

Use typed client loaders directly with forceConflicts():

private fun loadAndApplyManifest(client: KubernetesClient, file: File) {
    val yamlContent = file.readText()
    val kind = extractKind(yamlContent)

    ByteArrayInputStream(yamlContent.toByteArray()).use { stream ->
        when (kind) {
            "Namespace" -> client.namespaces().load(stream).forceConflicts().serverSideApply()
            "ConfigMap" -> client.configMaps().load(stream).forceConflicts().serverSideApply()
            "Service" -> client.services().load(stream).forceConflicts().serverSideApply()
            "DaemonSet" -> client.apps().daemonSets().load(stream).forceConflicts().serverSideApply()
            "Deployment" -> client.apps().deployments().load(stream).forceConflicts().serverSideApply()
            else -> throw IllegalStateException("Unsupported resource kind: $kind")
        }
    }
}

private fun extractKind(yamlContent: String): String {
    val kindRegex = Regex("""^kind:\s*(\w+)""", RegexOption.MULTILINE)
    return kindRegex.find(yamlContent)?.groupValues?.get(1)
        ?: throw IllegalStateException("Could not determine resource kind from YAML")
}

This works because:

Typed loaders (e.g., client.namespaces().load(stream)) return properly typed resources
Typed resources have registered handlers for serverSideApply()
forceConflicts() resolves field manager conflicts when multiple controllers manage the same resource

Required Imports

import io.fabric8.kubernetes.client.KubernetesClient
import java.io.ByteArrayInputStream
import java.io.File

Adding New Resource Types

If you need to support additional Kubernetes resource types, add them to the when statement:

"Pod" -> client.pods().load(stream).forceConflicts().serverSideApply()
"Secret" -> client.secrets().load(stream).forceConflicts().serverSideApply()
"StatefulSet" -> client.apps().statefulSets().load(stream).forceConflicts().serverSideApply()
// etc.

References

Fabric8 Kubernetes Client: https://github.com/fabric8io/kubernetes-client
Server-side apply documentation: https://github.com/fabric8io/kubernetes-client/blob/main/doc/CHEATSHEET.md

Fix History

Date	Issue	Resolution
2025-12-02	`client.load().serverSideApply()` fails	Attempted: load items first, then apply via `client.resource(item)`
2025-12-02	`client.resource(item).serverSideApply()` also fails	Attempted: cast items to typed classes (e.g., `item as Namespace`)
2025-12-02	`item as Namespace` causes ClassCastException	Use typed loaders directly (`client.namespaces().load(stream)`)
2025-12-02	Patch operation fails for Namespace	Fixed: Add `forceConflicts()` before `serverSideApply()`

Keyboard shortcuts

easy-db-lab