Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

easy-db-lab

easy-db-lab creates lab environments for database evaluations in AWS. It provisions infrastructure, deploys databases, and sets up a full observability stack so you can focus on testing, benchmarking, and learning.

Supported databases include Apache Cassandra, ClickHouse, and OpenSearch, with Apache Spark available for analytics workloads.

If you are looking for a tool to aid in stress testing Cassandra clusters, see the companion project cassandra-easy-stress.

If you're looking for tools to help manage Cassandra in production environments please see Reaper, cstar, and K8ssandra.

Quick Start

  1. Install easy-db-lab
  2. Set up your profile - Run easy-db-lab setup-profile
  3. Follow the tutorial

Features

Database Support

  • Apache Cassandra: Versions 3.0, 3.11, 4.0, 4.1, 5.0, and trunk builds. Includes custom build support, Cassandra Sidecar, and integration with cassandra-easy-stress for benchmarking.
  • ClickHouse: Sharded clusters with configurable replication, distributed tables, and S3-tiered storage.
  • OpenSearch: AWS OpenSearch domains for search and analytics.
  • Apache Spark: EMR-based Spark clusters for analytics workloads.

AWS Integration

  • EC2 Provisioning: Automated provisioning with configurable instance types
  • EBS Storage: Optional EBS volumes for persistent storage
  • S3 Backup: Automatic backup of configurations and state to S3
  • IAM Integration: Managed IAM policies for secure operations

Kubernetes (K3s)

  • Lightweight K3s: Automatic K3s cluster deployment across all nodes
  • kubectl/k9s: Pre-configured access with SOCKS5 proxy support
  • Private Registry: HTTPS Docker registry for custom images
  • Jib Integration: Push custom containers directly from Gradle

Monitoring and Observability

  • VictoriaMetrics: Time-series database for metrics storage
  • VictoriaLogs: Centralized log aggregation
  • Grafana: Pre-configured dashboards for Cassandra, ClickHouse, and system metrics
  • OpenTelemetry: Distributed tracing and metrics collection
  • AxonOps: Optional integration with AxonOps for Cassandra monitoring and management

Developer Experience

  • Shell Aliases: Convenient shortcuts for cluster management (c0, c-all, c-status, etc.)
  • Server: Integration with Claude Code for AI-assisted operations
  • Restore Support: Recover cluster state from VPC ID or S3 backup
  • SOCKS5 Proxy: Secure access to private cluster resources

Stress Testing

  • cassandra-easy-stress: Native integration with Apache stress testing tool
  • Kubernetes Jobs: Run stress tests as K8s jobs for scalability
  • Artifact Collection: Automatic collection of metrics and diagnostics

Prerequisites

Before using easy-db-lab, ensure you have the following:

System Requirements

RequirementDetails
Operating SystemmacOS or Linux
JavaJDK 21 or later
DockerRequired for building custom AMIs

AWS Requirements

  • AWS Account: A dedicated AWS account is recommended for lab environments
  • AWS Access Key & Secret: Credentials for programmatic access
  • IAM Permissions: Permissions to create EC2, IAM, S3, and optionally EMR resources

Tip

Run easy-db-lab show-iam-policies to see the exact IAM policies required with your account ID populated. See Setup for details.

Optional

  • AxonOps Account: For free Cassandra monitoring. Create an account at axonops.com

Next Steps

Run the interactive setup to configure your profile:

easy-db-lab setup-profile

See the Setup Guide for detailed instructions.

Installation

Tarball Install

You can grab a tarball from the releases page.

To get started, add the bin directory of easy-db-lab to your $PATH. For example:

export PATH="$PATH:/path/to/easy-db-lab/bin"
cd /path/to/easy-db-lab
./gradlew assemble

Building from Source

If you prefer to build from source:

git clone https://github.com/rustyrazorblade/easy-db-lab.git
cd easy-db-lab
./gradlew assemble

The built distribution will be in build/distributions/.

Setup

This guide walks you through the initial setup of easy-db-lab, including AWS credentials configuration, IAM policies, and AMI creation.

Overview

The setup-profile command handles all initial configuration interactively. It will:

  1. Collect your email and AWS credentials
  2. Validate your AWS access
  3. Create necessary AWS resources (key pair, IAM roles, Packer VPC)
  4. Build or validate the required AMI

Prerequisites

Before running setup:

  • AWS Account: An AWS account with appropriate permissions (see IAM Policies below)
  • Java 21+: Required to run easy-db-lab
  • Docker: Required only if building custom AMIs

Step 1: Run Setup Profile

Run the interactive setup:

easy-db-lab setup-profile

Or use the shorter alias:

easy-db-lab setup

The setup wizard will prompt you for:

PromptDescriptionDefault
EmailUsed to tag AWS resources for ownership(required)
AWS RegionRegion for your clustersus-west-2
AWS Access KeyYour AWS access key ID(required)
AWS Secret KeyYour AWS secret access key(required)
AxonOps OrgOptional: AxonOps organization name(skip)
AxonOps KeyOptional: AxonOps API key(skip)
AWS ProfileOptional: Named AWS profile(skip)

What Gets Created

During setup, the following AWS resources are created:

  • EC2 Key Pair: For SSH access to instances
  • IAM Role: For instance permissions (easy-db-lab-instance-role)
  • Packer VPC: Infrastructure for building AMIs
  • AMI (if needed): Takes 10-15 minutes to build

Configuration Location

Your profile is saved to:

~/.easy-db-lab/profiles/default/settings.yaml

Tip

Use a different profile by setting EASY_DB_LAB_PROFILE environment variable before running setup.

Step 2: Getting IAM Policies

If you need to request permissions from your AWS administrator, use the show-iam-policies command to display the required policies with your account ID populated:

easy-db-lab show-iam-policies

This displays three policies:

PolicyPurpose
EC2Create/manage EC2 instances, VPCs, security groups
IAMCreate instance roles and profiles
EMRCreate Spark clusters (optional)

Filter by Policy Name

To show a specific policy:

easy-db-lab show-iam-policies ec2    # Show EC2 policy only
easy-db-lab show-iam-policies iam    # Show IAM policy only
easy-db-lab show-iam-policies emr    # Show EMR policy only

For teams with multiple users, we recommend creating managed policies attached to an IAM group:

  1. Create an IAM group (e.g., "EasyDBLabUsers")
  2. Create three managed policies from the JSON output
  3. Attach all policies to the group
  4. Add users to the group

Warning

Inline policies have a 5,120 byte limit which may not fit all three policies. Use managed policies instead.

Step 3: Build Custom AMI (Optional)

If setup couldn't find a valid AMI for your architecture, or if you want to customize the base image, build one manually:

easy-db-lab build-image

Build Options

OptionDescriptionDefault
--archCPU architecture (AMD64 or ARM64)AMD64
--regionAWS region for the AMI(from profile)

Examples

# Build AMD64 AMI (default)
easy-db-lab build-image

# Build ARM64 AMI for Graviton instances
easy-db-lab build-image --arch ARM64

# Build in specific region
easy-db-lab build-image --region eu-west-1

Note

Building an AMI takes approximately 10-15 minutes. Docker must be installed and running.

Environment Variables

VariableDescriptionDefault
EASY_DB_LAB_USER_DIROverride configuration directory~/.easy-db-lab
EASY_DB_LAB_PROFILEUse a named profiledefault
EASY_DB_LAB_INSTANCE_TYPEDefault instance type for initr3.2xlarge
EASY_DB_LAB_STRESS_INSTANCE_TYPEDefault stress instance typec7i.2xlarge
EASY_DB_LAB_AMIOverride AMI ID(auto-detected)

Verify Installation

After setup completes, verify by running:

easy-db-lab

You should see the help output with available commands.

Next Steps

Once setup is complete, follow the Tutorial to create your first cluster.

Cluster Setup

This page provides a quick reference for cluster initialization and provisioning. For a complete walkthrough, see the Tutorial.

Quick Start

# Initialize a 3-node cluster with i4i.xlarge instances and 1 stress node
easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1

# Provision AWS infrastructure
easy-db-lab up

# Set up your shell environment
source env.sh

Or combine init and up:

easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1 --up

Initialize

The init command creates local configuration files but does not provision AWS resources.

easy-db-lab init <cluster-name> [options]

Common Options

OptionDescriptionDefault
--db, -cNumber of Cassandra instances3
--stress, -sNumber of stress instances0
--instance, -iInstance typer3.2xlarge
--ebs.typeEBS volume type (NONE, gp2, gp3)NONE
--ebs.sizeEBS volume size in GB256
--arch, -aCPU architecture (AMD64, ARM64)AMD64
--upAuto-provision after initfalse

For the complete options list, see the Tutorial or run easy-db-lab init --help.

Storage Requirements

Database instances need a data disk separate from the root volume. This can come from either:

  • Instance store (local NVMe) — Instance types with a d suffix (e.g., i3.xlarge, m5d.xlarge, c5d.2xlarge) include local NVMe storage.
  • EBS volumes — Attach an EBS volume using --ebs.type (e.g., --ebs.type gp3).

If the selected instance type has no instance store and --ebs.type is not specified, up will fail with an error. For example, c5.2xlarge has no local storage, so you must specify EBS:

easy-db-lab init my-cluster --instance c5.2xlarge --ebs.type gp3 --ebs.size 200

Launch

The up command provisions all AWS infrastructure:

easy-db-lab up

What Gets Created

  • S3 bucket for cluster state
  • VPC with subnets and security groups
  • EC2 instances (Cassandra, Stress, Control nodes)
    • Control node: m5d.xlarge (NVMe-backed instance; K3s data is stored on NVMe to avoid filling the root volume)
  • K3s cluster across all nodes (Cassandra, Stress, Control)

Options

OptionDescription
--no-setup, -nSkip K3s and AxonOps setup

Shut Down

Destroy all cluster infrastructure:

easy-db-lab down

Next Steps

After your cluster is running:

  1. Configure Cassandra - Select version and apply configuration
  2. Shell Aliases - Set up convenient shortcuts

Tutorial: Getting Started

This tutorial walks you through creating a database cluster from scratch, covering initialization, infrastructure provisioning, and database configuration. The examples below use Cassandra, but the same infrastructure supports ClickHouse, OpenSearch, and Spark.

Prerequisites

Before starting, ensure you've completed the Setup process by running easy-db-lab setup-profile.

Part 1: Initialize Your Cluster

The init command creates local configuration files for your cluster. It does not provision AWS resources yet.

easy-db-lab init my-cluster

This creates a 3-node Cassandra cluster by default.

Init Options

OptionDescriptionDefault
--db, --cassandra, -cNumber of Cassandra instances3
--app, --stress, -sNumber of stress/application instances0
--instance, -iCassandra instance typer3.2xlarge
--stress-instance, -siStress instance typec7i.2xlarge
--azs, -zAvailability zones (e.g., a,b,c)all available
--arch, -aCPU architecture (AMD64, ARM64)AMD64
--ebs.typeEBS volume type (NONE, gp2, gp3, io1, io2)NONE
--ebs.sizeEBS volume size in GB256
--ebs.iopsEBS IOPS (gp3 only)0
--ebs.throughputEBS throughput (gp3 only)0
--untilWhen instances can be deletedtomorrow
--tagCustom tags (key=value, repeatable)-
--vpcUse existing VPC ID-
--upAuto-provision after initfalse
--cleanRemove existing config firstfalse

Examples

Basic 3-node cluster:

easy-db-lab init my-cluster

5-node cluster with 2 stress nodes:

easy-db-lab init my-cluster --db 5 --stress 2

Production-like cluster with EBS storage:

easy-db-lab init prod-test --db 5 --ebs.type gp3 --ebs.size 500 --ebs.iops 3000

ARM64 cluster for Graviton instances:

easy-db-lab init my-cluster --arch ARM64 --instance r7g.2xlarge

Initialize and provision in one step:

easy-db-lab init my-cluster --up

Part 2: Launch Infrastructure

Once initialized, provision the AWS infrastructure:

easy-db-lab up

This command creates:

  • S3 Storage: Cluster data stored under a dedicated prefix in the account S3 bucket
  • VPC: With subnets and security groups
  • EC2 Instances: Cassandra nodes, stress nodes, and a control node
  • K3s Cluster: Lightweight Kubernetes across all nodes

What Happens During up

  1. Configures account S3 bucket with cluster prefix
  2. Creates VPC with public subnets in your availability zones
  3. Provisions EC2 instances in parallel
  4. Waits for SSH availability
  5. Configures K3s cluster on all nodes
  6. Writes SSH config and environment files

Up Options

OptionDescription
--no-setup, -nSkip K3s setup and AxonOps configuration

Environment Setup

After up completes, source the environment file:

source env.sh

This configures your shell with:

  • SSH shortcuts: ssh db0, ssh db1, ssh stress0, etc.
  • Cluster aliases: c0, c-all, c-status
  • SOCKS proxy configuration

See Shell Aliases for all available shortcuts.

Part 3: Configure Cassandra 5.0

With infrastructure running, configure and start Cassandra.

Step 1: Select Cassandra Version

easy-db-lab cassandra use 5.0

This command:

  • Sets the active Cassandra version on all nodes
  • Downloads configuration files to your local directory
  • Applies any existing patch configuration

Available versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk

Step 2: Customize Configuration (Optional)

Edit cassandra.patch.yaml to customize settings:

# Example: Change token count
vim cassandra.patch.yaml

Common customizations:

SettingDescriptionDefault
num_tokensVirtual nodes per instance4
concurrent_readsMax concurrent read operations64
concurrent_writesMax concurrent write operations64
endpoint_snitchNetwork topology snitchEc2Snitch

Step 3: Apply Configuration

easy-db-lab cassandra update-config

This uploads and applies the patch to all Cassandra nodes.

To apply and restart Cassandra in one command:

easy-db-lab cassandra update-config --restart

Step 4: Start Cassandra

easy-db-lab cassandra start

Step 5: Verify Cluster

Check cluster status:

ssh db0 nodetool status

Or use the shell alias (after sourcing env.sh):

c-status

You should see all nodes in UN (Up/Normal) state.

Part 4: Working with Your Cluster

SSH Access

After sourcing env.sh:

ssh db0          # First Cassandra node
ssh db1          # Second Cassandra node
ssh stress0      # First stress node (if provisioned)
ssh control0     # Control node

Cassandra Management

# Stop Cassandra on all nodes
easy-db-lab cassandra stop

# Start Cassandra on all nodes
easy-db-lab cassandra start

# Restart Cassandra on all nodes
easy-db-lab cassandra restart

Filter to Specific Hosts

Most commands support the --hosts filter:

# Apply config only to db0 and db1
easy-db-lab cassandra update-config --hosts db0,db1

# Restart only db2
easy-db-lab cassandra restart --hosts db2

Download Configuration Files

To download the current configuration from nodes:

easy-db-lab cassandra download-config

This saves configuration files to a local directory named after the version (e.g., 5.0/).

Part 5: Shut Down

When finished, destroy the cluster infrastructure:

easy-db-lab down

Warning

This permanently destroys all EC2 instances, the VPC, and associated resources. S3 data under the cluster prefix is scheduled for expiration (default: 1 day).

Quick Reference

TaskCommand
Initialize clustereasy-db-lab init <name> [options]
Provision infrastructureeasy-db-lab up
Initialize and provisioneasy-db-lab init <name> --up
Select Cassandra versioneasy-db-lab cassandra use <version>
Apply configurationeasy-db-lab cassandra update-config
Start Cassandraeasy-db-lab cassandra start
Stop Cassandraeasy-db-lab cassandra stop
Restart Cassandraeasy-db-lab cassandra restart
Check cluster statusssh db0 nodetool status
Download configeasy-db-lab cassandra download-config
Destroy clustereasy-db-lab down
Display hostseasy-db-lab hosts
Clean local fileseasy-db-lab clean

Next Steps

Kubernetes

easy-db-lab uses K3s to provide a lightweight Kubernetes cluster for deploying supporting services like ClickHouse, monitoring, and stress testing workloads.

Overview

K3s is automatically installed on all nodes during provisioning:

  • Control node: Runs the K3s server (Kubernetes control plane)
  • Cassandra nodes: Run as K3s agents with label type=db
  • Stress nodes: Run as K3s agents with label type=app

Accessing the Cluster

kubectl

After running source env.sh, kubectl is automatically configured:

source env.sh
kubectl get nodes
kubectl get pods -A

The kubeconfig is downloaded to your working directory and kubectl is configured to use the SOCKS5 proxy for connectivity.

k9s

k9s provides a terminal-based UI for Kubernetes:

source env.sh
k9s

k9s is pre-configured to use the correct kubeconfig and proxy settings.

Port Forwarding

easy-db-lab uses a SOCKS5 proxy for accessing the private Kubernetes cluster.

Starting the Proxy

The proxy starts automatically when you source the environment:

source env.sh

Manual Proxy Control

# Start the SOCKS5 proxy
start-socks5

# Check proxy status
socks5-status

# Stop the proxy
stop-socks5

Running Commands Through the Proxy

Commands like kubectl and k9s automatically use the proxy. For other commands:

# Route any command through the proxy
with-proxy curl http://10.0.1.50:8080/api

Pushing Docker Images with Jib

easy-db-lab includes a private Docker registry accessible via HTTPS. You can push custom images using Jib.

Gradle Configuration

Add Jib to your build.gradle.kts:

plugins {
    id("com.google.cloud.tools.jib") version "3.4.0"
}

jib {
    from {
        image = "eclipse-temurin:21-jre"
    }
    to {
        // Use the control node's registry
        image = "control0:5000/my-app"
        tags = setOf("latest", project.version.toString())
    }
    container {
        mainClass = "com.example.MainKt"
    }
}

Pushing to the Registry

# Build and push to the cluster registry
./gradlew jib

# Or build locally first
./gradlew jibDockerBuild

Using Images in Kubernetes

Reference your pushed images in Kubernetes manifests:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: my-app
    image: control0:5000/my-app:latest

Node Labels

Nodes are automatically labeled for workload scheduling:

Node TypeLabels
Cassandratype=db
Stresstype=app
Control(no labels)

Using Node Selectors

Schedule pods on specific node types:

apiVersion: v1
kind: Pod
metadata:
  name: stress-worker
spec:
  nodeSelector:
    type: app
  containers:
  - name: worker
    image: my-stress-tool:latest

Useful Commands

# List all nodes
kubectl get nodes

# List pods in all namespaces
kubectl get pods -A

# Watch pod status
kubectl get pods -w

# View logs
kubectl logs <pod-name>

# Execute command in pod
kubectl exec -it <pod-name> -- /bin/bash

# Port forward a service locally
kubectl port-forward svc/my-service 8080:80

Architecture

Networking

  • K3s server runs on the control node
  • All nodes communicate over the private VPC network
  • External access is via SOCKS5 proxy through the control node

Storage

  • Local path provisioner for persistent volumes
  • Data stored on node-local NVMe drives at /mnt/db1/

Kubeconfig

The kubeconfig file is:

  • Downloaded automatically during cluster setup
  • Stored as kubeconfig in your working directory
  • Backed up to S3 for recovery

Network Connectivity

This guide covers how to connect to your easy-db-lab cluster from your local machine.

Overview

easy-db-lab clusters run in a private AWS VPC. By default, the VPC uses 10.0.0.0/16, but you can customize this:

easy-db-lab init --cidr 10.14.0.0/20 ...

There are two methods to access your cluster:

MethodBest For
Tailscale VPN (Recommended)Production use, team sharing, persistent access
SOCKS ProxyQuick testing when you don't want to set up Tailscale

Tailscale provides a persistent VPN connection to your cluster. Once connected, you can access cluster resources directly—no proxy configuration needed.

Why Tailscale?

  • Native access - Use any tool (browsers, kubectl, ssh) without proxy configuration
  • Persistent - Connection survives terminal sessions
  • Team sharing - Share cluster access with teammates
  • Reliable - No SSH tunnels to maintain or reconnect

Setup (One-Time)

Step 1: Configure Tailscale ACL

Go to Tailscale ACL Editor and add:

{
  "tagOwners": {
    "tag:easy-db-lab": ["autogroup:admin"]
  },
  "autoApprovers": {
    "routes": {
      "10.0.0.0/8": ["tag:easy-db-lab"]
    }
  }
}

The autoApprovers section automatically approves subnet routes, so you don't need to manually approve each cluster.

Step 2: Create OAuth Client

  1. Go to Tailscale OAuth Settings
  2. Click Generate OAuth Client
  3. Configure:
    • Description: easy-db-lab
    • Scopes: Select Devices: Write
    • Tags: Add tag:easy-db-lab
  4. Click Generate and save the Client ID and Client Secret

Step 3: Configure easy-db-lab

easy-db-lab setup-profile

Enter your Tailscale OAuth credentials when prompted.

Usage

Tailscale starts automatically with easy-db-lab up. Once connected:

# Direct access to private IPs
ssh ubuntu@10.0.1.50
curl http://10.0.1.50:9428/health
kubectl get pods

# Web UIs work directly in your browser
# http://10.0.1.50:3000 (Grafana)

Manual Control

easy-db-lab tailscale start
easy-db-lab tailscale status
easy-db-lab tailscale stop

Troubleshooting Tailscale

"requested tags are invalid or not permitted" - Add the tag to your ACL (Step 1).

Can't reach private IPs - Check subnet route is approved in Tailscale admin, or add autoApprovers to your ACL.

Using a custom tag:

easy-db-lab tailscale start --tag tag:my-custom-tag

SOCKS Proxy (Alternative)

If you don't want to set up Tailscale, the SOCKS proxy provides connectivity via an SSH tunnel through the control node.

┌─────────────────┐     SSH Tunnel      ┌──────────────┐
│  Your Machine   │ ──────────────────► │ Control Node │
│  localhost:1080 │                     │  (control0)  │
└────────┬────────┘                     └──────┬───────┘
         │                                     │
    SOCKS5 Proxy                         Private VPC
         │                                     │
         ▼                                     ▼
   kubectl, curl                          VPC network

Quick Start

source env.sh
kubectl get pods
curl http://control0:9428/health

The proxy starts automatically when you load the environment.

Proxied Commands

These commands are automatically configured to use the proxy after source env.sh:

CommandDescription
kubectlKubernetes CLI
k9sKubernetes TUI
curlHTTP client
skopeoContainer image tool

Manual Proxy Usage

For other commands, use the with-proxy wrapper:

with-proxy wget http://10.0.1.50:8080/api
with-proxy http http://control0:3000/api/health

Browser Access

Configure your browser's SOCKS5 proxy:

SettingValue
SOCKS Hostlocalhost
SOCKS Port1080
SOCKS Version5

Then access cluster services:

  • Grafana: http://control0:3000
  • Victoria Metrics: http://control0:8428
  • Victoria Logs: http://control0:9428

Proxy Management

start-socks5          # Start proxy
start-socks5 1081     # Start on different port
socks5-status         # Check status
stop-socks5           # Stop proxy

Troubleshooting SOCKS Proxy

"Connection refused" errors:

socks5-status              # Check if running
start-socks5               # Start if needed
ssh control0 hostname      # Verify SSH works

Proxy not working after network change:

stop-socks5
source env.sh

Port already in use:

lsof -i :1080         # Check what's using it
start-socks5 1081     # Use different port

Commands timing out:

  1. Check cluster status: easy-db-lab status
  2. Verify SSH works: ssh control0 hostname
  3. Restart proxy: stop-socks5 && start-socks5

Comparison

FeatureTailscaleSOCKS Proxy
Setup time~10 min (one-time)Instant
PersistencePersistentPer-session
Requires source env.shNoYes
Browser accessDirectRequires proxy config
Team sharingYesNo
External dependencyTailscale accountNone

Shell Aliases

After running source env.sh, you get access to several helpful aliases and functions for managing your cluster.

SSH Aliases

SSH aliases for all Cassandra nodes are automatically created as c0-cN. The ssh command is not required. For example:

c0 nodetool status

This runs nodetool status on the first Cassandra node.

Cluster Management Functions

CommandDescription
c-allExecutes a command on every node in the cluster sequentially
c-startStarts Cassandra on all nodes
c-restartRestarts Cassandra on all nodes (not a graceful operation)
c-statusExecutes nodetool status on db0
c-tpstatsExecutes nodetool tpstats on all nodes
c-collect-artifactsCollects metrics, nodetool output, and system information

Examples

Run a command on all nodes

c-all "df -h"

Check cluster status

c-status

Collect artifacts for performance testing

c-collect-artifacts my-test-run

This is useful when doing performance testing to capture the state of the system at a given moment.

Graceful Rolling Restarts

For true rolling restarts, we recommend using cstar instead of c-restart.

Configuring Cassandra

This page covers Cassandra version management and configuration. For a step-by-step walkthrough, see the Tutorial.

Supported Versions

easy-db-lab supports the following Cassandra versions:

VersionJavaNotes
3.08Legacy support
3.118Stable release
4.011First 4.x release
4.111Current LTS
5.011Latest stable (recommended)
5.0-HEAD11Nightly build from 5.0 branch
trunk17Development branch

Quick Start

# Select Cassandra 5.0
easy-db-lab cassandra use 5.0

# Generate configuration patch
easy-db-lab cassandra write-config

# Apply configuration and start
easy-db-lab cassandra update-config
easy-db-lab cassandra start

# Verify cluster
ssh db0 nodetool status

Version Management

Select a Version

easy-db-lab cassandra use <version>

Examples:

easy-db-lab cassandra use 5.0       # Latest stable
easy-db-lab cassandra use 4.1       # LTS version
easy-db-lab cassandra use trunk     # Development branch

This command:

  1. Sets the active Cassandra version on all nodes
  2. Downloads current configuration files locally
  3. Applies any existing cassandra.patch.yaml

Specify Java Version

easy-db-lab cassandra use 5.0 --java 11

List Available Versions

easy-db-lab ls

Configuration

The Patch File

Cassandra configuration uses a patch file approach. The cassandra.patch.yaml file contains only the settings you want to customize, which are merged with the default cassandra.yaml.

Generate a new patch file:

easy-db-lab cassandra write-config

Options:

  • -t, --tokens: Number of tokens (default: 4)

Example patch file:

cluster_name: "my-cluster"
num_tokens: 4
concurrent_reads: 64
concurrent_writes: 64
trickle_fsync: true

Auto-Managed Settings — Do Not Include

The following settings are automatically managed by easy-db-lab. Including them in your patch file may cause problems:

  • listen_address, rpc_address — injected with each node's private IP
  • seed_provider / seeds — configured automatically based on cluster topology
  • hints_directory, data_file_directories, commitlog_directory — set based on the cluster's disk configuration

Apply Configuration

easy-db-lab cassandra update-config

Options:

  • --restart, -r: Restart Cassandra after applying
  • --hosts: Filter to specific hosts

Apply and restart in one command:

easy-db-lab cassandra update-config --restart

Download Configuration

Download current configuration files from nodes:

easy-db-lab cassandra download-config

Files are saved to a local directory named after the version (e.g., 5.0/).

Starting and Stopping

# Start on all nodes
easy-db-lab cassandra start

# Stop on all nodes
easy-db-lab cassandra stop

# Restart on all nodes
easy-db-lab cassandra restart

# Target specific hosts
easy-db-lab cassandra start --hosts db0,db1

Cassandra Sidecar

The Apache Cassandra Sidecar is automatically installed and started alongside Cassandra. The sidecar provides:

  • REST API for Cassandra operations
  • S3 import/restore capabilities
  • Streaming data operations
  • Metrics collection (Prometheus-compatible)

Sidecar Access

The sidecar runs on port 9043 on each Cassandra node:

# Check sidecar health
curl http://<cassandra-node-ip>:9043/api/v1/__health

Sidecar Management

The sidecar is managed via systemd:

# Check status
ssh db0 sudo systemctl status cassandra-sidecar

# Restart
ssh db0 sudo systemctl restart cassandra-sidecar

Sidecar Configuration

Configuration is located at /etc/cassandra-sidecar/cassandra-sidecar.yaml on each node. Key settings:

  • Cassandra connection details
  • Data directory paths
  • Traffic shaping and throttling
  • S3 integration settings

Custom Builds

To use a custom Cassandra build from source:

Build from Repository

easy-db-lab cassandra build -n my-build /path/to/cassandra-repo

Use Custom Build

easy-db-lab cassandra use my-build

Next Steps

ClickHouse

easy-db-lab supports deploying ClickHouse clusters on Kubernetes for analytics workloads alongside your Cassandra cluster.

Overview

ClickHouse is deployed as a StatefulSet on K3s with ClickHouse Keeper for distributed coordination. The deployment requires a minimum of 3 nodes.

Quick Start

Create a 6-node cluster and deploy ClickHouse with 2 shards:

# Initialize and start a 6-node cluster
easy-db-lab init my-cluster --db 6 --up

# Deploy ClickHouse (2 shards x 3 replicas)
easy-db-lab clickhouse start

Configuring ClickHouse

Use clickhouse init to configure ClickHouse settings before starting the cluster:

# Configure S3 cache size (default: 10Gi)
easy-db-lab clickhouse init --s3-cache 50Gi

# Disable write-through caching
easy-db-lab clickhouse init --s3-cache-on-write false
OptionDescriptionDefault
--s3-cacheSize of the local S3 cache10Gi
--s3-cache-on-writeCache data during write operationstrue
--s3-tier-move-factorMove data to S3 tier when local disk free space falls below this fraction (0.0-1.0)0.2
--replicas-per-shardNumber of replicas per shard3

Configuration is saved to the cluster state and applied when you run clickhouse start.

Starting ClickHouse

To deploy ClickHouse on an existing cluster:

easy-db-lab clickhouse start

Options

OptionDescriptionDefault
--timeoutSeconds to wait for pods to be ready300
--skip-waitSkip waiting for pods to be readyfalse
--replicasNumber of ClickHouse server replicasNumber of db nodes
--replicas-per-shardNumber of replicas per shard3

Example with Custom Settings

# 6 nodes with 3 replicas per shard = 2 shards
easy-db-lab clickhouse start --replicas 6 --replicas-per-shard 3

# 9 nodes with 3 replicas per shard = 3 shards
easy-db-lab clickhouse start --replicas 9 --replicas-per-shard 3

Cluster Topology

ClickHouse is deployed with a sharded, replicated architecture. The total number of replicas must be divisible by --replicas-per-shard.

Shard and Replica Assignment

The cluster named easy_db_lab is automatically configured based on your replica count:

ConfigurationShardsReplicas/ShardTotal Nodes
Default (3 nodes)133
6 nodes, 3/shard236
9 nodes, 3/shard339
6 nodes, 2/shard326

Pod-to-Node Pinning

Each ClickHouse pod is pinned to a specific database node using Local PersistentVolumes with node affinity:

  • clickhouse-0 always runs on db0
  • clickhouse-1 always runs on db1
  • clickhouse-N always runs on dbN

This guarantees:

  1. Consistent shard assignment - A pod's shard is calculated from its ordinal: shard = (ordinal / replicas_per_shard) + 1
  2. Data locality - Data stored on a node stays with that node across pod restarts
  3. Predictable performance - No data movement when pods restart

Shard Calculation Example

With 6 replicas and 3 replicas per shard:

PodOrdinalShardNode
clickhouse-001db0
clickhouse-111db1
clickhouse-221db2
clickhouse-332db3
clickhouse-442db4
clickhouse-552db5

Checking Status

To check the status of your ClickHouse cluster:

easy-db-lab clickhouse status

This displays:

  • Pod status and health
  • Access URLs for the Play UI and HTTP interface
  • Native protocol connection details

Accessing ClickHouse

After deployment, ClickHouse is accessible via:

InterfaceURL/PortDescription
Play UIhttp://<db-node-ip>:8123/playInteractive web query interface
HTTP APIhttp://<db-node-ip>:8123REST API for queries
Native Protocol<db-node-ip>:9000High-performance binary protocol

Creating Tables

ClickHouse supports distributed, replicated tables that span multiple shards. The recommended pattern uses ReplicatedMergeTree for local replicated storage and Distributed for querying across shards.

Distributed Replicated Tables

Create a local replicated table on all nodes, then a distributed table for queries:

-- Step 1: Create local replicated table on all nodes
CREATE TABLE events_local ON CLUSTER easy_db_lab (
    id UInt64,
    timestamp DateTime,
    event_type String,
    data String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (timestamp, id)
SETTINGS storage_policy = 's3_main';

-- Step 2: Create distributed table for querying across all shards
CREATE TABLE events ON CLUSTER easy_db_lab AS events_local
ENGINE = Distributed(easy_db_lab, default, events_local, rand());

Key points:

  • ON CLUSTER easy_db_lab runs the DDL on all nodes
  • {shard} and {replica} are ClickHouse macros automatically set per node
  • ReplicatedMergeTree replicates data within a shard using ClickHouse Keeper
  • Distributed routes queries and inserts across shards
  • rand() distributes inserts randomly; use a column for deterministic sharding

Querying and Inserting

-- Insert through distributed table (auto-sharded)
INSERT INTO events VALUES (1, now(), 'click', '{"page": "/home"}');

-- Query across all shards
SELECT count(*) FROM events WHERE event_type = 'click';

-- Query a specific shard (via local table)
SELECT count(*) FROM events_local WHERE event_type = 'click';

Table Engine Comparison

EngineUse CaseReplicationSharding
MergeTreeSingle-node, no replicationNoNo
ReplicatedMergeTreeReplicated within shardYesNo
DistributedQuery/insert across shardsVia underlying tableYes

Storage Policies

ClickHouse is configured with two storage policies. You select the policy when creating a table using the SETTINGS storage_policy clause.

Policy Comparison

Aspectlocals3_mains3_tier
Storage LocationLocal NVMe disksS3 bucket with configurable local cacheHybrid: starts local, moves to S3 when disk fills
PerformanceBest latency, highest throughputHigher latency, cache-dependentGood initially, degrades as data moves to S3
CapacityLimited by disk sizeVirtually unlimitedVirtually unlimited
CostIncluded in instance costS3 storage + request costsS3 storage + request costs
Data PersistenceLost when cluster is destroyedPersists independentlyPersists independently
Best ForBenchmarks, low-latency queriesLarge datasets, cost-sensitive workloadsMixed hot/cold workloads with automatic tiering

Local Storage (local)

The default policy stores data on local NVMe disks attached to the database nodes. This provides the best performance for latency-sensitive workloads.

CREATE TABLE my_table (...)
ENGINE = MergeTree()
ORDER BY id
SETTINGS storage_policy = 'local';

If you omit the storage_policy setting, tables use local storage by default.

When to use local storage:

  • Performance benchmarking where latency matters
  • Temporary or experimental datasets
  • Workloads with predictable data sizes that fit on local disks
  • When you don't need data to persist after cluster teardown

S3 Storage (s3_main)

The S3 policy stores data in your configured S3 bucket with a local cache for frequently accessed data. The cache size defaults to 10Gi and can be configured with clickhouse init --s3-cache. Write-through caching is enabled by default (--s3-cache-on-write true), which caches data during writes so subsequent reads can be served from cache immediately. This is ideal for large datasets where storage cost matters more than latency.

Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:

easy-db-lab init my-cluster --s3-bucket my-clickhouse-data

Then create tables with S3 storage:

CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_main';

When to use S3 storage:

  • Large analytical datasets (terabytes+)
  • Data that should persist across cluster restarts
  • Cost-sensitive workloads where storage cost > compute cost
  • Sharing data between multiple clusters

How the cache works:

  • Hot (frequently accessed) data is cached locally for fast reads
  • Cold data is fetched from S3 on demand
  • Cache is automatically managed by ClickHouse
  • First query on cold data will be slower; subsequent queries use cache

S3 Tiered Storage (s3_tier)

The S3 tiered policy provides automatic data movement from local disks to S3 based on disk space availability. This policy starts with local storage and automatically moves data to S3 when local disk space runs low, providing the best of both worlds: fast local performance for hot data and unlimited S3 capacity for cold data.

Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:

easy-db-lab init my-cluster --s3-bucket my-clickhouse-data

Configure the tiering behavior before starting ClickHouse:

# Move data to S3 when local disk free space falls below 20% (default)
easy-db-lab clickhouse init --s3-tier-move-factor 0.2

# More aggressive tiering - move when free space < 50%
easy-db-lab clickhouse init --s3-tier-move-factor 0.5

Then create tables with S3 tiered storage:

CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_tier';

When to use S3 tiered storage:

  • Workloads with mixed hot/cold data access patterns
  • Growing datasets that may outgrow local disk capacity
  • Want automatic cost optimization without manual intervention
  • Need local performance for recent data with S3 capacity for historical data

How automatic tiering works:

  • New data is written to local disks first (fast writes)
  • When local disk free space falls below the configured threshold (default: 20%), ClickHouse automatically moves the oldest data to S3
  • Data on S3 is still queryable but with higher latency
  • The local cache (configured with --s3-cache) helps performance for frequently accessed S3 data
  • Manual moves are also possible: ALTER TABLE my_table MOVE PARTITION tuple() TO DISK 's3'

Stopping ClickHouse

To remove the ClickHouse cluster:

easy-db-lab clickhouse stop

This removes all ClickHouse pods, services, and associated resources from Kubernetes.

Monitoring

ClickHouse metrics are automatically integrated with the observability stack:

  • Grafana Dashboard: Pre-configured dashboard for ClickHouse metrics
  • Metrics Port: 9363 for Prometheus-compatible metrics
  • Logs Dashboard: Dedicated dashboard for ClickHouse logs

Architecture

The ClickHouse deployment includes:

  • ClickHouse Server: StatefulSet with configurable replicas
  • ClickHouse Keeper: 3-node cluster for distributed coordination (ZooKeeper-compatible)
  • Services: Headless services for internal communication
  • ConfigMaps: Server and Keeper configuration
  • Local PersistentVolumes: One PV per node for data locality

Storage Architecture

ClickHouse uses Local PersistentVolumes to guarantee pod-to-node pinning:

  1. During cluster creation, each db node is labeled with its ordinal (easydblab.com/node-ordinal=0, etc.)
  2. Local PVs are created with node affinity matching these ordinals
  3. PVs are pre-bound to specific PVCs (e.g., data-clickhouse-0 binds to the PV on db0)
  4. The StatefulSet's volumeClaimTemplate requests storage from these pre-bound PVs

This ensures clickhouse-X always runs on dbX, providing:

  • Consistent shard assignments across restarts
  • Data locality (no network storage overhead)
  • Predictable failover behavior

Ports

PortPurpose
8123HTTP interface
9000Native protocol
9009Inter-server communication
9363Metrics
2181Keeper client
9234Keeper Raft

OpenSearch

AWS OpenSearch can be provisioned as a managed domain for full-text search and log analytics.

Commands

CommandDescription
opensearch startCreate an OpenSearch domain
opensearch statusCheck domain status
opensearch stopDelete the OpenSearch domain

Starting OpenSearch

easy-db-lab opensearch start

This creates an AWS-managed OpenSearch domain linked to your cluster's VPC. The domain takes several minutes to provision.

Checking Status

easy-db-lab opensearch status

Stopping OpenSearch

easy-db-lab opensearch stop

This deletes the OpenSearch domain. Data stored in the domain will be lost.

Spark

easy-db-lab supports provisioning Apache Spark clusters via AWS EMR for analytics workloads.

Enabling Spark

There are two ways to enable Spark:

Option 1: During Init (before up)

Enable Spark during cluster initialization with the --spark.enable flag. The EMR cluster will be created automatically when you run up:

easy-db-lab init --spark.enable
easy-db-lab up

Init Spark Configuration Options

OptionDescriptionDefault
--spark.enableEnable Spark EMR clusterfalse
--spark.master.instance.typeMaster node instance typem5.xlarge
--spark.worker.instance.typeWorker node instance typem5.xlarge
--spark.worker.instance.countNumber of worker nodes3

Example with Custom Configuration

easy-db-lab init \
  --spark.enable \
  --spark.master.instance.type m5.2xlarge \
  --spark.worker.instance.type m5.4xlarge \
  --spark.worker.instance.count 5

Option 2: After up (standalone spark init)

Add Spark to an existing environment that is already running. This is useful when you forgot to pass --spark.enable during init, or when you decide to add Spark later:

easy-db-lab spark init

Prerequisites: easy-db-lab init and easy-db-lab up must have been run first.

Spark Init Configuration Options

OptionDescriptionDefault
--master.instance.typeMaster node instance typem5.xlarge
--worker.instance.typeWorker node instance typem5.xlarge
--worker.instance.countNumber of worker nodes3

Example with Custom Configuration

easy-db-lab spark init \
  --master.instance.type m5.2xlarge \
  --worker.instance.type m5.4xlarge \
  --worker.instance.count 5

Submitting Spark Jobs

Submit JAR-based Spark applications to your EMR cluster:

easy-db-lab spark submit \
  --jar /path/to/your-app.jar \
  --main-class com.example.YourMainClass \
  --conf spark.easydblab.keyspace=my_keyspace \
  --conf spark.easydblab.table=my_table \
  --wait

Submit Options

OptionDescriptionRequired
--jarPath to JAR file (local path or s3:// URI)Yes
--main-classMain class to executeYes
--confSpark configuration (key=value), can be repeatedNo
--envEnvironment variable (KEY=value), can be repeatedNo
--argsArguments for the Spark applicationNo
--waitWait for job completionNo
--nameJob name (defaults to main class)No

When --jar is a local path, it is automatically uploaded to the cluster's S3 bucket before submission. When it is an s3:// URI, it is used directly.

Using a JAR Already on S3

If your JAR is already on S3 (e.g., from a CI pipeline or a previous upload), pass the S3 URI directly:

easy-db-lab spark submit \
  --jar s3://my-bucket/jars/your-app.jar \
  --main-class com.example.YourMainClass \
  --conf spark.easydblab.keyspace=my_keyspace \
  --wait

This skips the upload step entirely, which is useful for large JARs or when resubmitting the same job.

Cancelling a Job

Cancel a running or pending Spark job without terminating the cluster:

easy-db-lab spark stop

Without --step-id, this cancels the most recent job. To cancel a specific job:

easy-db-lab spark stop --step-id <step-id>

The cancellation uses EMR's TERMINATE_PROCESS strategy (SIGKILL). The API is asynchronous — use spark status to confirm the job has been cancelled.

Checking Job Status

View Recent Jobs

List recent Spark jobs on the cluster:

easy-db-lab spark jobs

Options:

  • --limit - Maximum number of jobs to display (default: 10)

Check Specific Job Status

easy-db-lab spark status --step-id <step-id>

Without --step-id, shows the status of the most recent job.

Options:

  • --step-id - EMR step ID to check
  • --logs - Download step logs (stdout, stderr)

Retrieving Logs

Download logs for a Spark job:

easy-db-lab spark logs --step-id <step-id>

Logs are automatically decompressed and include:

  • stdout.gz - Standard output
  • stderr.gz - Standard error
  • controller.gz - EMR controller logs

Architecture

When Spark is enabled, easy-db-lab provisions:

  • EMR Cluster: Managed Spark cluster with master and worker nodes
  • S3 Integration: Logs stored at s3://<bucket>/spark/emr-logs/
  • IAM Roles: Service and job flow roles for EMR operations
  • Observability: Each EMR node runs an OTel Collector (host metrics, OTLP forwarding), OTel Java Agent (auto-instrumentation for logs/metrics/traces), and Pyroscope Java Agent (continuous CPU/allocation/lock profiling). All telemetry flows to the control node's observability stack.

Timeouts and Polling

  • Job Polling Interval: 5 seconds
  • Maximum Wait Time: 4 hours
  • Cluster Creation Timeout: 30 minutes

Spark with Cassandra

A common use case is running Spark jobs that read from or write to Cassandra. Use the Spark Cassandra Connector:

import com.datastax.spark.connector._

val df = spark.read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "my_table", "keyspace" -> "my_keyspace"))
  .load()

Ensure your JAR includes the Spark Cassandra Connector dependency and configure the Cassandra host in your Spark application.

Spark Modules

All Spark job modules live under the spark/ directory and share unified configuration via spark.easydblab.* properties. You can compare performance across implementations by swapping the JAR and main class while keeping the same --conf flags.

Module Overview

ModuleGradle PathMain ClassDescription
common:spark:commonShared config, data generation, CQL setup
bulk-writer-sidecar:spark:bulk-writer-sidecarDirectBulkWriterCassandra Analytics, direct sidecar transport
bulk-writer-s3:spark:bulk-writer-s3S3BulkWriterCassandra Analytics, S3 staging transport
connector-writer:spark:connector-writerStandardConnectorWriterStandard Spark Cassandra Connector
connector-read-write:spark:connector-read-writeKeyValuePrefixCountRead→transform→write example

Building

Pre-build Cassandra Analytics (one-time, for bulk-writer modules)

The cassandra-analytics library requires JDK 11 to build:

bin/build-cassandra-analytics

Options:

  • --force - Rebuild even if JARs exist
  • --branch <branch> - Use a specific branch (default: trunk)

Build JARs

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Or build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

Usage

All modules use the same --conf properties for easy comparison.

Direct Bulk Writer (Sidecar)

easy-db-lab spark submit \
  --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
  --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
  --conf spark.easydblab.contactPoints=host1,host2,host3 \
  --conf spark.easydblab.keyspace=bulk_test \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.rowCount=1000000 \
  --wait

S3 Bulk Writer

easy-db-lab spark submit \
  --jar spark/bulk-writer-s3/build/libs/bulk-writer-s3-*.jar \
  --main-class com.rustyrazorblade.easydblab.spark.S3BulkWriter \
  --conf spark.easydblab.contactPoints=host1,host2,host3 \
  --conf spark.easydblab.keyspace=bulk_test \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.s3.bucket=my-bucket \
  --conf spark.easydblab.rowCount=1000000 \
  --wait

Standard Connector Writer

easy-db-lab spark submit \
  --jar spark/connector-writer/build/libs/connector-writer-*.jar \
  --main-class com.rustyrazorblade.easydblab.spark.StandardConnectorWriter \
  --conf spark.easydblab.contactPoints=host1,host2,host3 \
  --conf spark.easydblab.keyspace=bulk_test \
  --conf spark.easydblab.localDc=us-west-2 \
  --conf spark.easydblab.rowCount=1000000 \
  --wait

Convenience Script

The bin/spark-bulk-write script handles JAR lookup, host resolution, and health checks:

# From a cluster directory
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Configuration Properties

All modules share these properties via spark.easydblab.*:

PropertyDescriptionDefault
spark.easydblab.contactPointsComma-separated database hostsRequired
spark.easydblab.keyspaceTarget keyspaceRequired
spark.easydblab.tableTarget tabledata_<timestamp>
spark.easydblab.localDcLocal datacenter nameRequired
spark.easydblab.rowCountNumber of rows to write1000000
spark.easydblab.parallelismSpark partitions for generation10
spark.easydblab.partitionCountCassandra partitions to distribute across10000
spark.easydblab.replicationFactorKeyspace replication factor3
spark.easydblab.skipDdlSkip keyspace/table creation (validates they exist)false
spark.easydblab.compactionCompaction strategy(default)
spark.easydblab.s3.bucketS3 bucket (S3 mode only)Required for S3
spark.easydblab.s3.endpointS3 endpoint overrideAWS S3

Table Schema

The test data generators produce this schema:

CREATE TABLE <keyspace>.<table> (
    partition_id bigint,
    sequence_id bigint,
    course blob,
    marks bigint,
    PRIMARY KEY ((partition_id), sequence_id)
);

Monitoring

Grafana Dashboards

Grafana is deployed automatically as part of the observability stack (k8 apply). It is accessible on port 3000 of the control node.

Cluster Identification

When running multiple environments side by side, Grafana displays the cluster name in several places to help you identify which environment you're looking at:

  • Browser tab - Shows the cluster name instead of "Grafana"
  • Dashboard titles - Each dashboard title is prefixed with the cluster name
  • Sidebar org name - The organization name in the sidebar shows the cluster name
  • Home dashboard - The System Overview dashboard is set as the home page instead of the default Grafana welcome page

System Dashboard

Shows CPU, memory, disk I/O, network I/O, and load average for all cluster nodes via OpenTelemetry metrics.

AWS CloudWatch Overview

A combined dashboard showing S3, EBS, and EC2 metrics via CloudWatch. Available after running easy-db-lab up.

S3 metrics:

  • Throughput: BytesDownloaded, BytesUploaded
  • Request Counts: GetRequests, PutRequests
  • Latency: FirstByteLatency (p99), TotalRequestLatency (p99)

EBS volume metrics:

  • IOPS: VolumeReadOps, VolumeWriteOps (mirrored read/write chart)
  • Throughput: VolumeReadBytes, VolumeWriteBytes (mirrored read/write chart)
  • Queue Length: VolumeQueueLength
  • Burst Balance: BurstBalance (percentage)

EC2 status checks:

  • Status Check Failures: StatusCheckFailed_Instance, StatusCheckFailed_System (red threshold at >= 1)

Use the dropdowns at the top to select S3 bucket, EC2 instances, and EBS volumes.

How it works:

  • S3 request metrics are automatically enabled for the cluster's prefix in the account S3 bucket during easy-db-lab up
  • EBS and EC2 metrics are published automatically by AWS for all instances and volumes
  • Grafana queries CloudWatch using the EC2 instance's IAM role (no credentials needed)
  • During easy-db-lab down, the S3 metrics configuration is automatically removed to stop CloudWatch billing

Note: S3 request metrics take approximately 15 minutes to appear in CloudWatch after being enabled. EBS and EC2 metrics are available immediately.

EMR Overview

Shows Spark/EMR node metrics via OpenTelemetry. Available when an EMR cluster is provisioned. Each EMR node runs an OTel Collector that collects host metrics and receives JVM telemetry from the OTel and Pyroscope Java agents.

Host Metrics:

  • CPU Usage: Per-node CPU utilization percentage
  • Memory Usage: Used and cached memory per node
  • Disk I/O: Read/write throughput per node (mirrored chart)
  • Network I/O: Receive/transmit throughput per node (mirrored chart)
  • Load Average: 1m and 5m load per node
  • Filesystem Usage: Root filesystem utilization percentage

Spark JVM Metrics:

  • JVM Heap Memory: Used and committed heap per node/pool
  • GC Duration Rate: Garbage collection duration rate per collector
  • JVM Threads: Thread count per node
  • JVM Classes Loaded: Class count per node

Use the Hostname dropdown to filter by specific EMR nodes.

OpenSearch Overview

Shows OpenSearch domain metrics via CloudWatch. Available when an OpenSearch domain is provisioned.

Metrics displayed:

  • Cluster Health: ClusterStatus (green/yellow/red), FreeStorageSpace
  • CPU / Memory: CPUUtilization, JVMMemoryPressure
  • Search Performance: SearchLatency (p99), SearchRate
  • Indexing Performance: IndexingLatency (p99), IndexingRate
  • HTTP Responses: 2xx, 3xx, 4xx, 5xx (color-coded)
  • Storage: ClusterUsedSpace

Use the Domain dropdown to select which OpenSearch domain to view.

Cassandra Condensed

A single-pane-of-glass summary of the most important Cassandra metrics, powered by the MAAC (Management API for Apache Cassandra) agent. Shows:

  • Cluster Overview: Nodes up/down, compaction rates, CQL request throughput, dropped messages, connected clients, timeouts, hints, data size, GC time
  • Condensed Metrics: Request throughput, coordinator latency percentiles, memtable space, compaction activity, table-level latency, streaming bandwidth

Requires the MAAC agent to be loaded (Cassandra 4.0, 4.1, or 5.0). Metrics are exposed on port 9000 and scraped by the OTel collector.

Cassandra Overview

A comprehensive deep-dive into Cassandra cluster health, also powered by the MAAC agent. Shows:

  • Request Throughput: Read/write distribution, latency percentiles (P98-P999), error throughput
  • Node Status: Per-node up/down status (polystat panel), node count, status history
  • Data Status: Disk space usage, data size, SSTable count, pending compactions
  • Internals: Thread pool pending/blocked/active tasks, dropped messages, hinted handoff
  • Hardware: CPU, memory, disk I/O, network I/O, load average
  • JVM/GC: Application throughput, GC time, heap utilization

eBPF Observability

The cluster deploys eBPF-based agents on all nodes for deep system observability:

Beyla (L7 Network Metrics)

Grafana Beyla uses eBPF to automatically instrument network traffic and provide RED metrics (Rate, Errors, Duration) for:

  • Cassandra CQL protocol (port 9042) and inter-node communication (port 7000)
  • ClickHouse HTTP (port 8123) and native (port 9000) protocols

Metrics are scraped by the OTel collector and stored in VictoriaMetrics.

ebpf_exporter (Low-Level Metrics)

Cloudflare's ebpf_exporter provides kernel-level metrics via eBPF:

  • TCP retransmits — count of retransmitted TCP segments
  • Block I/O latency — histogram of block device I/O operation latency
  • VFS latency — histogram of filesystem read/write operation latency

These metrics are scraped by the OTel collector and stored in VictoriaMetrics.

See Profiling for continuous profiling with Pyroscope.

Profiling

Continuous profiling is provided by Grafana Pyroscope, deployed automatically as part of the observability stack.

Architecture

Profiling data is collected from multiple sources and sent to the Pyroscope server on the control node (port 4040):

  • Pyroscope Java agent (Cassandra) — Runs as a -javaagent inside the Cassandra JVM. Uses async-profiler to collect CPU, allocation, lock contention, and wall-clock profiles with full method-level resolution.
  • Pyroscope Java agent (Stress jobs) — Runs as a -javaagent inside cassandra-easy-stress K8s Jobs. Collects the same profile types as Cassandra (CPU, allocation, lock). The agent JAR is mounted from the host via a hostPath volume.
  • Pyroscope Java agent (Spark/EMR) — Runs as a -javaagent on Spark driver and executor JVMs. Installed via EMR bootstrap action to /opt/pyroscope/pyroscope.jar. Collects CPU, allocation (512k threshold), and lock (10ms threshold) profiles in JFR format. Profiles appear under service_name=spark-<job-name>.
  • Grafana Alloy eBPF profiler — Runs as a DaemonSet on all nodes via Grafana Alloy. Profiles all processes (Cassandra, ClickHouse, stress jobs) at the system level using eBPF. Provides CPU flame graphs including kernel stack frames.

Accessing Profiles

Profiling Dashboard

A dedicated Profiling dashboard is available in Grafana with flame graph panels for each profile type:

  1. Open Grafana (port 3000)
  2. Navigate to Dashboards and select the Profiling dashboard
  3. Use the Service dropdown to select a service (e.g., cassandra, cassandra-easy-stress, clickhouse-server)
  4. Use the Hostname dropdown to filter by specific nodes
  5. Select a time range to view profiles for that period

The dashboard includes panels for:

  • CPU Flame Graph — CPU time spent in each method
  • Memory Allocation Flame Graph — Heap allocation hotspots
  • Lock Contention Flame Graph — Time spent waiting for monitors
  • Mutex Contention Flame Graph — Mutex delay analysis

Grafana Explore

For ad-hoc profile exploration:

  1. Open Grafana (port 3000) and navigate to Explore
  2. Select the Pyroscope datasource
  3. Choose a profile type (e.g., process_cpu, memory, mutex)
  4. Filter by labels:
    • service_name — process or application name
    • hostname — node hostname
    • cluster — cluster name

Profile Types

Java Agent (Cassandra, Stress Jobs)

ProfileDescription
cpuCPU time spent in each method
allocMemory allocation by method (objects and bytes)
lockLock contention — time spent waiting for monitors
wallWall-clock time — useful for finding I/O bottlenecks (Cassandra only, see below)

eBPF Agent (All Processes)

ProfileDescription
process_cpuCPU usage by process, including kernel frames

The eBPF agent profiles all processes on every node, including ClickHouse. Since ClickHouse is written in C++, only CPU profiles are available (no allocation or lock profiles). ClickHouse profiles appear in Pyroscope under the clickhouse-server service name when ClickHouse is running.

Stress Job Profiling

Stress jobs are automatically profiled via the Pyroscope Java agent. No additional configuration is needed — when you start a stress job, the agent is mounted from the host node and configured to send profiles to the Pyroscope server.

Profiles appear under service_name=cassandra-easy-stress with labels for cluster and job_name.

Wall-Clock vs CPU Profiling

By default, the Cassandra Java agent profiles CPU time. You can switch to wall-clock profiling to find I/O bottlenecks and blocking operations.

Warning

Wall-clock and CPU profiling are mutually exclusive — you cannot use both simultaneously.

To enable wall-clock profiling:

  1. SSH to each Cassandra node
  2. Add PYROSCOPE_PROFILER_EVENT=wall to /etc/default/cassandra
  3. Restart Cassandra

To switch back to CPU profiling, either remove the line or set PYROSCOPE_PROFILER_EVENT=cpu.

Configuration

Cassandra Java Agent

The Pyroscope Java agent is configured via JVM system properties in cassandra.in.sh. It activates when the PYROSCOPE_SERVER_ADDRESS environment variable is set (configured by easy-db-lab at cluster startup).

The agent JAR is installed at /usr/local/pyroscope/pyroscope.jar.

Environment VariableSet InDescription
PYROSCOPE_SERVER_ADDRESS/etc/default/cassandraPyroscope server URL (set automatically)
CLUSTER_NAME/etc/default/cassandraCluster name for labeling (set automatically)
PYROSCOPE_PROFILER_EVENT/etc/default/cassandraProfiler event type: cpu (default) or wall

eBPF Agent

The eBPF profiler runs as a privileged Grafana Alloy DaemonSet (pyroscope-ebpf) and profiles all processes on each node. Configuration is in the pyroscope-ebpf-config ConfigMap (Alloy River format). It uses discovery.process to discover host processes and pyroscope.ebpf to collect CPU profiles.

Pyroscope Server

The Pyroscope server runs on the control node with data stored in S3 (s3://<account-bucket>/clusters/<name>-<id>/pyroscope/). Configuration is in the pyroscope-config ConfigMap.

Data Flow

Cassandra JVM ──(Java agent)──────► Pyroscope Server (:4040)
                                         ▲
Stress Jobs ──(Java agent)──────────────┘
                                         ▲
Spark JVMs ──(Java agent)──────────────┘
                                         ▲
All Processes ──(eBPF agent)────────────┘
                                         │
                                         ▼
                                    S3 storage
                                  Grafana (:3000)
                             Pyroscope datasource
                            + Profiling dashboard

Victoria Metrics

Victoria Metrics is a time-series database that stores metrics from all nodes in your easy-db-lab cluster. It receives metrics via OTLP from the OpenTelemetry Collector.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes (DaemonSet)                    │
├─────────────────────────────────────────────────────────────┤
│   System metrics (CPU, memory, disk, network)               │
│   Cassandra metrics (via JMX)                               │
│   Application metrics                                        │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │   OTel Collector       │
              │   (DaemonSet)          │
              └───────────┬────────────┘
                          │
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │ Victoria Metrics │                  │
│              │    (:8428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │     Grafana      │
              │    (:3000)       │
              └──────────────────┘

Configuration

Victoria Metrics runs on the control node as a Kubernetes deployment:

  • Port: 8428 (HTTP API)
  • Storage: Persistent at /mnt/db1/victoriametrics
  • Retention: 7 days (configurable via -retentionPeriod flag)

Accessing Metrics

Grafana

  1. Access Grafana at http://control0:3000 (via SOCKS proxy)
  2. Victoria Metrics is pre-configured as the Prometheus datasource
  3. System dashboards show node metrics

Direct API

Query metrics directly using the Prometheus-compatible API:

source env.sh

# Get all metric names
with-proxy curl "http://control0:8428/api/v1/label/__name__/values"

# Query specific metric
with-proxy curl "http://control0:8428/api/v1/query?query=up"

# Query with time range
with-proxy curl "http://control0:8428/api/v1/query_range?query=node_cpu_seconds_total&start=$(date -d '1 hour ago' +%s)&end=$(date +%s)&step=60"

Common Queries

# CPU usage by node
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# Network received bytes
rate(node_network_receive_bytes_total[5m])

Backup

Backup Victoria Metrics data to S3:

# Backup to cluster's default S3 bucket
easy-db-lab metrics backup

# Backup to a custom S3 location
easy-db-lab metrics backup --dest s3://my-backup-bucket/victoriametrics

By default, backups are stored at: s3://{cluster-bucket}/victoriametrics/{timestamp}/

Use --dest to override the destination bucket and path

Features

  • Uses native vmbackup tool with snapshot support
  • Non-disruptive; metrics collection continues during backup
  • Direct S3 upload (no intermediate storage needed)
  • Incremental backup support for faster subsequent backups

Listing Backups

List available VictoriaMetrics backups in S3:

easy-db-lab metrics ls

This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.

Importing Metrics to an External Instance

Stream metrics from the running cluster's VictoriaMetrics to an external VictoriaMetrics instance via the native export/import API:

# Import all metrics
easy-db-lab metrics import --target http://victoria:8428

# Import only specific metrics
easy-db-lab metrics import --target http://victoria:8428 --match '{job="cassandra"}'

This is useful for exporting metrics at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaMetrics instance.

Options

OptionDescriptionDefault
--targetTarget VictoriaMetrics URL (required)-
--matchMetric selector for filteringAll metrics

Troubleshooting

No metrics appearing

  1. Verify Victoria Metrics pod is running:

    kubectl get pods -l app.kubernetes.io/name=victoriametrics
    kubectl logs -l app.kubernetes.io/name=victoriametrics
    
  2. Check OTel Collector is forwarding metrics:

    kubectl get pods -l app=otel-collector
    kubectl logs -l app=otel-collector
    
  3. Verify the cluster-config ConfigMap exists:

    kubectl get configmap cluster-config -o yaml
    

Connection errors

If you see connection errors when querying metrics:

  1. Ensure the cluster is running: easy-db-lab status
  2. The proxy is started automatically when needed
  3. Check that control node is accessible: ssh control0 hostname

High memory usage

Victoria Metrics is configured with memory limits. If you see OOM kills:

  1. Check current memory usage:

    kubectl top pod -l app.kubernetes.io/name=victoriametrics
    
  2. Consider adjusting the memory limits in the deployment manifest

Backup failures

If backup fails:

  1. Check the backup job logs:

    kubectl logs -l app.kubernetes.io/name=victoriametrics-backup
    
  2. Verify S3 bucket permissions (IAM role should have S3 access)

  3. Ensure there's sufficient disk space on the control node

Victoria Logs

Victoria Logs is a centralized log aggregation system that collects logs from all nodes in your easy-db-lab cluster. It provides a unified way to search and analyze logs from Cassandra, ClickHouse, and system services.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes (DaemonSet)                    │
├─────────────────────────────────────────────────────────────┤
│   /var/log/*              journald                          │
│   /mnt/db1/cassandra/logs/*.log                             │
│   /mnt/db1/clickhouse/logs/*.log                            │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  OTel Collector        │
              │  (DaemonSet)           │
              │  filelog + journald    │
              └───────────┬────────────┘
                          │
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │  Victoria Logs   │                  │
│              │    (:9428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │  easy-db-lab     │
              │  logs query      │
              └──────────────────┘

Components

Victoria Logs Server

Victoria Logs runs on the control node as a Kubernetes deployment:

  • Port: 9428 (HTTP API)
  • Storage: Local ephemeral storage
  • Retention: 7 days (configurable)
  • Location: Control node only (node-role.kubernetes.io/control-plane)

OTel Collector

The OpenTelemetry Collector collects logs from all sources and forwards them to Victoria Logs.

The OTel Collector runs as a DaemonSet on every node (Cassandra, stress, control) to collect:

SourcePathDescription
Cassandra/mnt/db1/cassandra/logs/*.logCassandra database logs
ClickHouse/mnt/db1/clickhouse/logs/*.logClickHouse server logs
ClickHouse Keeper/mnt/db1/clickhouse/keeper/logs/*.logClickHouse Keeper logs
System logs/var/log/**/*.logGeneral system logs
journaldcassandra, docker, k3s, sshdsystemd service logs

Log Sources

Each log entry is tagged with a source field:

SourceDescriptionAdditional Fields
cassandraCassandra database logshost
clickhouseClickHouse server logshost, component (server/keeper)
systemdsystemd journal logshost, unit
systemGeneral /var/log fileshost

Querying Logs

Using the CLI

The easy-db-lab logs query command provides a unified interface:

# Query all logs from the last hour
easy-db-lab logs query

# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd

# Filter by host
easy-db-lab logs query --source cassandra --host db0

# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service

# Search for text
easy-db-lab logs query --grep "OutOfMemory"
easy-db-lab logs query --grep "ERROR"

# Time range and limit
easy-db-lab logs query --since 30m --limit 500
easy-db-lab logs query --since 1d

# Raw LogsQL query
easy-db-lab logs query -q 'source:cassandra AND host:db0'

Query Options

OptionDescriptionDefault
--source, -sLog source filterAll sources
--host, -HHostname filter (db0, app0, control0)All hosts
--unitsystemd unit nameAll units
--sinceTime range (1h, 30m, 1d)1h
--limit, -nMax entries to return100
--grep, -gText search filterNone
--query, -qRaw LogsQL queryNone

Using the HTTP API

Victoria Logs exposes a REST API on port 9428. Access it through the SOCKS proxy:

source env.sh
with-proxy curl "http://control0:9428/select/logsql/query?query=source:cassandra&time=1h&limit=100"

Using Grafana

Victoria Logs is configured as a datasource in Grafana. You can use it in two ways:

Log Investigation Dashboard

The Log Investigation dashboard is designed for interactive log analysis during investigations. Access it at Grafana → Dashboards → Log Investigation.

Filter variables (dropdowns at the top):

FilterOptionsDescription
Node RoleAll, db, app, controlFilter by server type
SourceAll, cassandra, clickhouse, system, tool-runnerFilter by log source
LevelAll, Error, Warning, Info, DebugFilter by log severity
Search(text input)Free-text search across log messages
Filters(ad-hoc)Add arbitrary field:value filters (e.g., host = db0)

Panels:

  • Log Volume — time-series bar chart showing log count over time, broken down by source. Helps identify spikes and anomalies at a glance.
  • Logs — scrollable log viewer with timestamps, source labels, and expandable log details. Click any log entry to see all available fields.

Tips:

  • Use the ad-hoc Filters variable to filter by host, unit, component, or any other field without needing a dedicated dropdown.
  • The dashboard auto-refreshes every 10 seconds by default. Adjust or disable via the refresh picker in the top-right corner.
  • Combine multiple filters to narrow down — e.g., set Node Role to db, Source to cassandra, Level to Error to see only Cassandra errors on database nodes.
  • To search for exec job logs, set Source to tool-runner and use the Search box for the job name.

Explore Mode

For ad-hoc queries beyond what the dashboard provides:

  1. Access Grafana at http://control0:3000 (via SOCKS proxy)
  2. Navigate to Explore
  3. Select "VictoriaLogs" datasource
  4. Use LogsQL syntax for queries

LogsQL Query Syntax

Victoria Logs uses LogsQL for querying. Basic syntax:

# Simple field match
source:cassandra

# Multiple conditions (AND)
source:cassandra AND host:db0

# Text search
"OutOfMemory"

# Combine field match with text search
source:cassandra AND "Exception"

# Time filter (in addition to --since)
_time:1h

For full LogsQL documentation, see the Victoria Logs documentation.

Deployment

Victoria Logs and the OTel Collector are automatically deployed when you run:

easy-db-lab k8 apply

This deploys:

  • Victoria Logs server on the control node
  • OTel Collector DaemonSet on all nodes
  • Grafana datasource configuration

Verifying the Setup

Check that all components are running:

source env.sh
kubectl get pods -l app.kubernetes.io/name=victorialogs
kubectl get pods -l app.kubernetes.io/name=otel-collector

Test connectivity:

# Check Victoria Logs health
with-proxy curl http://control0:9428/health

# Query recent logs
easy-db-lab logs query --limit 10

Troubleshooting

No logs appearing

  1. Verify OTel Collector pods are running:

    kubectl get pods -l app.kubernetes.io/name=otel-collector
    kubectl logs -l app.kubernetes.io/name=otel-collector
    
  2. Check Victoria Logs is healthy:

    with-proxy curl http://control0:9428/health
    
  3. Verify the cluster-config ConfigMap exists:

    kubectl get configmap cluster-config -o yaml
    

Connection errors

The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:

  1. Ensure the cluster is running: easy-db-lab status
  2. The proxy is started automatically when needed
  3. Check that control node is accessible: ssh control0 hostname

Listing Backups

List available VictoriaLogs backups in S3:

easy-db-lab logs ls

This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.

Importing Logs to an External Instance

Stream logs from the running cluster's VictoriaLogs to an external VictoriaLogs instance via the jsonline API:

# Import all logs
easy-db-lab logs import --target http://victorialogs:9428

# Import only specific logs
easy-db-lab logs import --target http://victorialogs:9428 --query 'source:cassandra'

This is useful for exporting logs at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaLogs instance.

Options

OptionDescriptionDefault
--targetTarget VictoriaLogs URL (required)-
--queryLogsQL query for filteringAll logs (*)

Backup

Victoria Logs data can be backed up to S3 for disaster recovery using consistent snapshots.

Creating a Backup

# Backup to cluster's default S3 bucket
easy-db-lab logs backup

# Backup to a custom S3 location
easy-db-lab logs backup --dest s3://my-backup-bucket/victorialogs

By default, backups are stored at: s3://{cluster-bucket}/victorialogs/{timestamp}/

Use --dest to override the destination bucket and path.

How It Works

The backup uses VictoriaLogs' snapshot API to create consistent, point-in-time copies:

  1. Create snapshots — calls the VictoriaLogs snapshot API to create read-only snapshots of all active log partitions
  2. Sync to S3 — uploads each snapshot directory to S3 using aws s3 sync
  3. Cleanup — deletes the snapshots from disk to free space (runs even if the sync step fails)

Using snapshots ensures data consistency, since VictoriaLogs may be actively writing to its data directory during the backup.

What Gets Backed Up

  • All log partitions (organized by date)
  • Complete log history up to retention period (7 days default)

Notes

  • The process is non-disruptive; log ingestion continues during backup
  • Snapshot cleanup always runs, even if the S3 upload fails, to avoid filling disk
  • Persistent storage at /mnt/db1/victorialogs ensures logs survive pod restarts

Server

easy-db-lab includes a server mode that provides AI assistant integration via MCP (Model Context Protocol), REST status endpoints, and live metrics streaming. This enables Claude to directly interact with your clusters, and provides programmatic access to cluster status.

The server exposes tools for all supported databases — Cassandra, ClickHouse, OpenSearch, and Spark — as well as cluster lifecycle management and observability.

Starting the Server

To start the server, run:

easy-db-lab server

By default, the server picks an available port. To specify a port:

easy-db-lab server --port 8888

The server automatically generates a .mcp.json configuration file in the current directory with the connection details.

Adding to Claude Code

Once the server is running, start Claude Code from the same directory:

claude

Claude Code automatically detects and uses the .mcp.json file generated by the server.

Available Tools

The server exposes commands annotated with @McpCommand as MCP tools to Claude. Tool names use underscores and are derived from the command's package namespace.

Cluster Lifecycle

Tool NameDescription
initInitialize a directory for easy-db-lab
upProvision AWS infrastructure
cassandra_downShut down AWS infrastructure
cleanClean up generated files
statusDisplay full environment status
hostsList all hosts in the cluster
ipGet IP address for a host by alias

Cassandra Management

Tool NameDescription
cassandra_useSelect a Cassandra version
cassandra_listList available Cassandra versions
cassandra_startStart Cassandra on all nodes
cassandra_restartRestart Cassandra on all nodes
cassandra_update_configApply configuration patch to nodes

Cassandra Stress Testing

Tool NameDescription
cassandra_stress_startStart a stress job on K8s
cassandra_stress_stopStop and delete stress jobs
cassandra_stress_statusCheck status of stress jobs
cassandra_stress_logsView logs from stress jobs
cassandra_stress_listList available workloads
cassandra_stress_fieldsList available field generators
cassandra_stress_infoShow workload information

ClickHouse

Tool NameDescription
clickhouse_startDeploy ClickHouse cluster to K8s
clickhouse_stopRemove ClickHouse cluster
clickhouse_statusCheck ClickHouse cluster status

OpenSearch

Tool NameDescription
opensearch_startCreate AWS OpenSearch domain
opensearch_stopDelete OpenSearch domain
opensearch_statusCheck OpenSearch domain status

Spark

Tool NameDescription
spark_submitSubmit Spark job to EMR cluster
spark_statusCheck status of a Spark job
spark_jobsList recent Spark jobs
spark_logsDownload EMR logs from S3

Kubernetes

Tool NameDescription
k8_applyApply observability stack to K8s

Utilities

Tool NameDescription
prune_amisPrune older private AMIs

Tool Naming Convention

MCP tool names are derived from the command's package location:

  • Top-level commands: status, hosts, ip, clean, init, up
  • Cassandra commands: cassandra_ prefix (e.g., cassandra_start, cassandra_use)
  • Nested commands: cassandra_stress_ prefix (e.g., cassandra_stress_start)
  • Hyphens become underscores: update-configcassandra_update_config

Benefits of Server Integration

BenefitDescription
Direct ControlClaude executes easy-db-lab commands directly without manual intervention
Context AwarenessClaude maintains context about your cluster state and configuration
AutomationComplex multi-step operations can be automated through Claude
Intelligent AssistanceClaude can analyze logs, metrics, and provide optimization recommendations

Example Workflow

  1. Start the server in one terminal:

    easy-db-lab server
    
  2. In another terminal, start Claude Code from the same directory:

    claude
    

    Claude Code automatically detects the .mcp.json file generated by the server.

  3. Ask Claude to help manage your cluster:

    • "Initialize a new 5-node cluster with i4i.xlarge instances"
    • "Check the status of all nodes"
    • "Select Cassandra version 5.0 and start it"
    • "Start a KeyValue stress test for 1 hour"
    • "Deploy ClickHouse and check its status"
    • "Create an OpenSearch domain and monitor its progress"
    • "Submit a Spark job to the EMR cluster"

Live Metrics Streaming

When Redis is configured via the EASY_DB_LAB_REDIS_URL environment variable, the server publishes live cluster metrics to the Redis pub/sub channel every 5 seconds. Metrics are queried from VictoriaMetrics using the same PromQL expressions as the Grafana dashboards.

Enabling

export EASY_DB_LAB_REDIS_URL=redis://localhost:6379/easydblab-events
easy-db-lab server

Metrics events are published to the same channel as command events. Consumers filter by the event.type field.

Event Types

Only metrics for running services are published. If the cluster is running ClickHouse instead of Cassandra, no Cassandra metrics events are emitted.

Metrics.System

Published every 5 seconds with per-node CPU, memory, disk I/O, and filesystem metrics:

{
  "timestamp": "2026-03-08T14:22:05.123Z",
  "commandName": "server",
  "event": {
    "type": "Metrics.System",
    "nodes": {
      "db-0": {
        "cpuUsagePct": 34.2,
        "memoryUsedBytes": 17179869184,
        "diskReadBytesPerSec": 52428800.0,
        "diskWriteBytesPerSec": 104857600.0,
        "filesystemUsedPct": 45.2
      },
      "db-1": {
        "cpuUsagePct": 28.7,
        "memoryUsedBytes": 16106127360,
        "diskReadBytesPerSec": 41943040.0,
        "diskWriteBytesPerSec": 83886080.0,
        "filesystemUsedPct": 42.8
      }
    }
  }
}

Metrics.Cassandra

Published every 5 seconds when the cluster is running Cassandra:

{
  "timestamp": "2026-03-08T14:22:05.187Z",
  "commandName": "server",
  "event": {
    "type": "Metrics.Cassandra",
    "readP99Ms": 1.247,
    "writeP99Ms": 0.832,
    "readOpsPerSec": 15234.5,
    "writeOpsPerSec": 12087.3,
    "compactionPending": 3,
    "compactionCompletedPerSec": 1.5,
    "compactionBytesWrittenPerSec": 52428800.0
  }
}

Field Reference

System — per node:

FieldTypeDescription
cpuUsagePctdoubleCPU usage percentage (0-100)
memoryUsedByteslongMemory used in bytes
diskReadBytesPerSecdoubleDisk read throughput (bytes/sec)
diskWriteBytesPerSecdoubleDisk write throughput (bytes/sec)
filesystemUsedPctdoubleFilesystem usage percentage (0-100)

Cassandra — cluster-wide:

FieldTypeDescription
readP99MsdoubleRead latency p99 in milliseconds
writeP99MsdoubleWrite latency p99 in milliseconds
readOpsPerSecdoubleRead operations per second
writeOpsPerSecdoubleWrite operations per second
compactionPendinglongNumber of pending compactions
compactionCompletedPerSecdoubleCompactions completed per second
compactionBytesWrittenPerSecdoubleCompaction write throughput (bytes/sec)

Notes

  • The server requires Docker to be installed
  • Your AWS profile must be configured (easy-db-lab setup-profile)
  • The server runs in the foreground and logs to stdout
  • Use Ctrl+C to stop the server

Command Reference

Complete reference for all easy-db-lab commands.

Global Options

OptionDescription
--help, -hShows help information
--vpc-idReconstruct state from existing VPC (requires ClusterId tag)
--forceForce state reconstruction even if state.json exists

Setup Commands

setup-profile

Set up user profile interactively.

easy-db-lab setup-profile

Aliases: setup

Guides you through:

  • Email and AWS credentials collection
  • AWS credential validation
  • Key pair generation
  • IAM role creation
  • Packer VPC infrastructure setup
  • AMI validation/building

show-iam-policies

Display IAM policies with your account ID populated.

easy-db-lab show-iam-policies [policy-name]

Aliases: sip

ArgumentDescription
policy-nameOptional filter: ec2, iam, or emr

build-image

Build both base and Cassandra AMI images.

easy-db-lab build-image [options]
OptionDescriptionDefault
--archCPU architecture (AMD64, ARM64)AMD64
--regionAWS region(from profile)

Cluster Lifecycle Commands

init

Initialize a directory for easy-db-lab.

easy-db-lab init [cluster-name] [options]
OptionDescriptionDefault
--db, --cassandra, -cNumber of Cassandra instances3
--app, --stress, -sNumber of stress instances0
--instance, -iCassandra instance typer3.2xlarge
--stress-instance, -siStress instance typec7i.2xlarge
--azs, -zAvailability zones (e.g., a,b,c)all
--arch, -aCPU architecture (AMD64, ARM64)AMD64
--ebs.typeEBS volume type (NONE, gp2, gp3, io1, io2)NONE
--ebs.sizeEBS volume size in GB256
--ebs.iopsEBS IOPS (gp3 only)0
--ebs.throughputEBS throughput (gp3 only)0
--ebs.optimizedEnable EBS optimizationfalse
--untilWhen instances can be deletedtomorrow
--amiOverride AMI ID(auto-detected)
--openUnrestricted SSH accessfalse
--tagCustom tags (key=value, repeatable)-
--vpcUse existing VPC ID-
--upAuto-provision after initfalse
--cleanRemove existing config firstfalse

up

Provision AWS infrastructure.

easy-db-lab up [options]
OptionDescription
--no-setup, -nSkip K3s setup and AxonOps configuration

Creates: VPC, EC2 instances, K3s cluster. Configures the account S3 bucket for this cluster.

down

Shut down AWS infrastructure.

easy-db-lab down [vpc-id] [options]
ArgumentDescription
vpc-idOptional: specific VPC to tear down
OptionDescription
--allTear down all VPCs tagged with easy_cass_lab
--packerTear down the packer infrastructure VPC
--retention-days NDays to retain S3 data after teardown (default: 1)

clean

Clean up generated files from the current directory.

easy-db-lab clean

hosts

List all hosts in the cluster.

easy-db-lab hosts

status

Display full environment status.

easy-db-lab status

Cassandra Commands

All Cassandra commands are available under the cassandra subcommand group.

cassandra use

Select a Cassandra version.

easy-db-lab cassandra use <version> [options]
OptionDescription
--javaJava version to use
--hostsFilter to specific hosts

Versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk

cassandra write-config

Generate a new configuration patch file.

easy-db-lab cassandra write-config [filename] [options]

Aliases: wc

OptionDescriptionDefault
-t, --tokensNumber of tokens4

cassandra update-config

Apply configuration patch to all nodes.

easy-db-lab cassandra update-config [options]

Aliases: uc

OptionDescription
--restart, -rRestart Cassandra after applying
--hostsFilter to specific hosts

cassandra download-config

Download configuration files from nodes.

easy-db-lab cassandra download-config [options]

Aliases: dc

OptionDescription
--versionVersion to download config for

cassandra start

Start Cassandra on all nodes.

easy-db-lab cassandra start [options]
OptionDescriptionDefault
--sleepTime between starts in seconds120
--hostsFilter to specific hosts-

cassandra stop

Stop Cassandra on all nodes.

easy-db-lab cassandra stop [options]
OptionDescription
--hostsFilter to specific hosts

cassandra restart

Restart Cassandra on all nodes.

easy-db-lab cassandra restart [options]
OptionDescription
--hostsFilter to specific hosts

cassandra list

List available Cassandra versions.

easy-db-lab cassandra list

Aliases: ls


Cassandra Stress Commands

Stress testing commands under cassandra stress.

cassandra stress start

Start a stress job on Kubernetes.

easy-db-lab cassandra stress start [options]

Aliases: run

cassandra stress stop

Stop and delete stress jobs.

easy-db-lab cassandra stress stop [options]

cassandra stress status

Check status of stress jobs.

easy-db-lab cassandra stress status

cassandra stress logs

View logs from stress jobs.

easy-db-lab cassandra stress logs [options]

cassandra stress list

List available workloads.

easy-db-lab cassandra stress list

cassandra stress fields

List available field generators.

easy-db-lab cassandra stress fields

cassandra stress info

Show information about a workload.

easy-db-lab cassandra stress info <workload>

Utility Commands

exec

Execute commands on remote hosts via systemd-run. Tool output is captured by the systemd journal and shipped to VictoriaLogs via a dedicated journald OTel collector, with accurate timestamps for cross-service log correlation.

exec run

Run a command on remote hosts (foreground by default).

# Foreground (blocks until complete, shows output)
easy-db-lab exec run -t cassandra -- ls /mnt/db1

# Background (returns immediately, tool keeps running)
easy-db-lab exec run --bg -t cassandra -- inotifywait -m /mnt/db1/data

# Background with custom name
easy-db-lab exec run --bg --name watch-imports -t cassandra -- inotifywait -m /mnt/db1/data
OptionDescription
-t, --typeServer type: cassandra, stress, control (default: cassandra)
--bgRun in background (returns immediately)
--nameName for the systemd unit (auto-derived if not provided)
--hostsFilter to specific hosts
-pExecute in parallel across hosts

exec list

List running background tools on remote hosts.

easy-db-lab exec list
easy-db-lab exec list -t cassandra

exec stop

Stop a named background tool.

easy-db-lab exec stop watch-imports
easy-db-lab exec stop watch-imports -t cassandra

ip

Get IP address for a host by alias.

easy-db-lab ip <alias>

version

Display the easy-db-lab version.

easy-db-lab version

repl

Start interactive REPL.

easy-db-lab repl

server

Start the server for Claude Code integration, REST status endpoints, and live metrics.

easy-db-lab server

See Server for details.


Kubernetes Commands

k8 apply

Apply observability stack to K8s cluster.

easy-db-lab k8 apply

Dashboard Commands

dashboards generate

Extract all Grafana dashboard manifests (core and ClickHouse) from JAR resources to the local k8s/ directory. Useful for rapid dashboard iteration without re-running init.

easy-db-lab dashboards generate

dashboards upload

Apply all Grafana dashboard manifests and the datasource ConfigMap to the K8s cluster. Extracts dashboards, creates the grafana-datasources ConfigMap with runtime configuration, and applies everything.

easy-db-lab dashboards upload

ClickHouse Commands

clickhouse start

Deploy ClickHouse cluster to K8s.

easy-db-lab clickhouse start [options]

clickhouse stop

Stop and remove ClickHouse cluster.

easy-db-lab clickhouse stop

clickhouse status

Check ClickHouse cluster status.

easy-db-lab clickhouse status

Spark Commands

spark submit

Submit Spark job to EMR cluster.

easy-db-lab spark submit [options]

spark status

Check status of a Spark job.

easy-db-lab spark status [options]

spark jobs

List recent Spark jobs on the cluster.

easy-db-lab spark jobs

spark logs

Download EMR logs from S3.

easy-db-lab spark logs [options]

OpenSearch Commands

opensearch start

Create an AWS OpenSearch domain.

easy-db-lab opensearch start [options]

opensearch stop

Delete the OpenSearch domain.

easy-db-lab opensearch stop

opensearch status

Check OpenSearch domain status.

easy-db-lab opensearch status

AWS Commands

aws vpcs

List all easy-db-lab VPCs.

easy-db-lab aws vpcs

Port Reference

This page documents the ports used by easy-db-lab and the services it provisions.

Cassandra Ports

PortPurpose
9042Cassandra Native Protocol (CQL)
7000Inter-node communication
7001Inter-node communication (SSL)
7199JMX monitoring

Observability Ports (Control Node)

PortService
3000Grafana
4040Pyroscope (continuous profiling)
8428VictoriaMetrics (metrics storage)
9428VictoriaLogs (log storage)
3200Tempo (trace storage)
5001YACE CloudWatch exporter (Prometheus)

Cassandra Agent Ports

PortService
9000MAAC metrics agent (Prometheus) — Cassandra 4.0, 4.1, 5.0 only

Observability Ports (All Nodes — DaemonSets)

PortService
4317OTel Collector gRPC
4318OTel Collector HTTP
9400Beyla eBPF metrics (Prometheus)
9435ebpf_exporter metrics (Prometheus)

Server

PortPurpose
8080Default server port (configurable via --port)

SSH

SSH access is configured automatically through the sshConfig file generated by source env.sh.

Log Infrastructure

This page documents the centralized logging infrastructure in easy-db-lab, including OTel for log collection and Victoria Logs for storage and querying.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     All Nodes                                │
├─────────────────────────────────────────────────────────────┤
│   /var/log/*          │   journald                          │
│   /mnt/db1/cassandra/logs/*.log                             │
│   /mnt/db1/clickhouse/logs/*.log                            │
│   /mnt/db1/clickhouse/keeper/logs/*.log                     │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  OTel Collector        │
              │  (DaemonSet)           │      ┌──────────────────┐
              │  filelog + journald    │◀─────│  EMR Spark JVMs  │
              │  + OTLP receiver       │ OTLP │  (OTel Java Agent│
              └───────────┬────────────┘      │   v2.25.0)       │
                          │                   └──────────────────┘
┌─────────────────────────┼─────────────────────────┐
│   Control Node          │                          │
├─────────────────────────┼─────────────────────────┤
│                         ▼                          │
│              ┌──────────────────┐                  │
│              │  Victoria Logs   │                  │
│              │    (:9428)       │                  │
│              └────────┬─────────┘                  │
└───────────────────────┼────────────────────────────┘
                        │
                        ▼
              ┌──────────────────┐
              │  easy-db-lab     │
              │  logs query      │
              └──────────────────┘

Components

OTel Collector DaemonSet

The OpenTelemetry Collector runs on all nodes as a DaemonSet, collecting:

  • System file logs: /var/log/**/*.log, /var/log/messages, /var/log/syslog
  • Cassandra logs: /mnt/db1/cassandra/logs/*.log
  • ClickHouse server logs: /mnt/db1/clickhouse/logs/*.log
  • ClickHouse Keeper logs: /mnt/db1/clickhouse/keeper/logs/*.log
  • systemd journal: cassandra, docker, k3s, sshd units
  • OTLP: Receives logs from applications via OTLP protocol

Logs are forwarded to Victoria Logs on the control node via the Elasticsearch-compatible sink.

Spark OTel Java Agent (EMR)

When EMR Spark jobs are running, the Spark driver and executor JVMs are instrumented with the OpenTelemetry Java Agent (v2.25.0) via an EMR bootstrap action. The agent auto-instruments the JVMs and exports logs via OTLP to the control node's OTel Collector.

Logs appear in VictoriaLogs with a service.name attribute like spark-<job-name>, making it easy to filter logs for specific Spark jobs.

The data flow is: Spark JVM → OTel Java Agent → OTLP → OTel Collector (control node) → VictoriaLogs.

Victoria Logs

Victoria Logs runs on the control node and provides:

  • Log storage with efficient compression
  • LogsQL query language
  • HTTP API for querying (port 9428)

Querying Logs

Using the CLI

# Query all logs from last hour
easy-db-lab logs query

# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd

# Filter by host
easy-db-lab logs query --source cassandra --host db0

# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service

# Search for text
easy-db-lab logs query --grep "OutOfMemory"

# Time range and limit
easy-db-lab logs query --since 30m --limit 500

# Raw Victoria Logs query (LogsQL syntax)
easy-db-lab logs query -q 'source:cassandra AND host:db0'

Log Stream Fields

Common fields (all sources):

FieldDescription
sourceLog source: cassandra, clickhouse, systemd, system
hostHostname (db0, app0, control0)
timestampLog timestamp
messageLog message content

Source-specific fields:

SourceFieldDescription
clickhousecomponentserver or keeper
systemdunitsystemd unit name

Troubleshooting

No logs appearing

  1. Check Victoria Logs is running:

    kubectl get pods | grep victoria
    
  2. Check OTel Collector is running:

    kubectl get pods | grep otel
    
  3. Verify the cluster-config ConfigMap exists:

    kubectl get configmap cluster-config -o yaml
    

Connection errors

The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:

  1. Ensure the cluster is running: easy-db-lab status
  2. The proxy is started automatically when needed
  3. Check that control node is accessible: ssh control0 hostname

Ports

PortServiceLocation
9428Victoria Logs HTTP APIControl node

OpenTelemetry Instrumentation

easy-db-lab includes optional OpenTelemetry (OTel) instrumentation for distributed tracing and metrics. When enabled, traces and metrics are exported to an OTLP-compatible collector.

CLI Tool Instrumentation

The easy-db-lab CLI tool runs with the OpenTelemetry Java Agent, which automatically instruments:

  • AWS SDK calls - EC2, S3, IAM, EMR, STS, OpenSearch operations
  • HTTP clients - OkHttp and other HTTP libraries
  • JDBC/Cassandra driver - Database operations
  • JVM metrics - Memory, threads, garbage collection

Enabling Instrumentation

Set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to your OTLP collector endpoint:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up

When this environment variable is:

  • Set: Traces and metrics are exported via gRPC to the specified endpoint
  • Not set: The agent is still loaded but no telemetry is exported (minimal overhead)

The agent uses automatic instrumentation only - there is no custom manual instrumentation in the CLI tool code.

Cluster Node Instrumentation

The following instrumentation applies to cluster nodes (Cassandra, stress, Spark) and is separate from the CLI tool:

Node Role Labeling

The OTel Collector on cluster nodes uses the k8sattributes processor to read the K8s node label type and set it as the node_role resource attribute. This label is used by Grafana dashboards (e.g., System Overview) for hostname and service filtering.

Node TypeK8s Labelnode_role ValueSource
Cassandratype=dbdbK3s agent config
Stresstype=appappK3s agent config
Controltype=controlcontrolUp command node labeling
Spark/EMRN/AsparkEMR OTel Collector resource/role processor

The k8sattributes processor runs in the metrics/local and logs/local pipelines only. Remote metrics arriving via OTLP (e.g., from Spark nodes) already carry node_role and are not modified.

The processor requires RBAC access to the K8s API. The OTel Collector DaemonSet runs with a dedicated ServiceAccount (otel-collector) that has read-only access to pods and nodes.

Stress Job Metrics

When running cassandra-easy-stress as K8s Jobs, metrics are automatically collected via an OTel collector sidecar container. The sidecar scrapes the stress process's Prometheus endpoint (localhost:9500) and forwards metrics via OTLP to the node's OTel DaemonSet, which then exports them to VictoriaMetrics.

The Prometheus scrape job is named cassandra-easy-stress. The following labels are available in Grafana:

LabelSourceDescription
host_nameDaemonSet resourcedetection processorK8s node name where the pod runs
instanceSidecar relabel_configsNode name with port (e.g., ip-10-0-1-50:9500)
clusterSidecar relabel_configsCluster name from cluster-config ConfigMap

Short-lived stress commands (list, info, fields) do not include the sidecar since they complete quickly and don't produce meaningful metrics.

Spark JVM Instrumentation

EMR Spark jobs are auto-instrumented with the OpenTelemetry Java Agent (v2.25.0) and Pyroscope Java Agent (v2.3.0), both installed via an EMR bootstrap action. The OTel agent is activated through spark.driver.extraJavaOptions and spark.executor.extraJavaOptions.

Each EMR node also runs an OTel Collector as a systemd service, collecting host metrics (CPU, memory, disk, network) and receiving OTLP from the Java agents. The collector forwards all telemetry to the control node's OTel Collector via OTLP gRPC.

Key configuration:

  • OTel Agent JAR: Downloaded by bootstrap action to /opt/otel/opentelemetry-javaagent.jar
  • Pyroscope Agent JAR: Downloaded by bootstrap action to /opt/pyroscope/pyroscope.jar
  • OTel Collector: Installed at /opt/otel/otelcol-contrib, runs as otel-collector.service
  • Export protocol: OTLP/gRPC to localhost:4317 (local collector), which forwards to control node
  • Logs exporter: OTLP (captures JVM log output)
  • Service name: spark-<job-name> (set per job)
  • Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles in JFR format sent to Pyroscope server

Cassandra Sidecar Instrumentation

The Cassandra Sidecar process is instrumented with the OpenTelemetry Java Agent and Pyroscope Java Agent, matching the pattern used for Cassandra itself. Both agents are loaded via -javaagent flags set in /etc/default/cassandra-sidecar, which is written by the setup-instances command.

Key configuration:

  • OTel Agent JAR: Installed by Packer to /usr/local/otel/opentelemetry-javaagent.jar
  • Pyroscope Agent JAR: Installed by Packer to /usr/local/pyroscope/pyroscope.jar
  • Service name: cassandra-sidecar (both OTel and Pyroscope)
  • Export endpoint: localhost:4317 (local OTel Collector DaemonSet)
  • Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles sent to Pyroscope server
  • Activation: Gated on /etc/default/cassandra-sidecar — the systemd EnvironmentFile=- directive makes it optional, so the sidecar starts normally without instrumentation if the file doesn't exist

Tool Runner Log Collection

Commands run via exec run are executed through systemd-run, which captures stdout and stderr to log files under /var/log/easydblab/tools/. The OTel Collector's filelog/tools receiver watches this directory and ships log entries to VictoriaLogs with the attribute source: tool-runner.

This provides automatic log capture for ad-hoc debugging tools (e.g., inotifywait, tcpdump, strace) run during investigations. Logs are queryable in VictoriaLogs and preserved in S3 backups via logs backup.

Key details:

  • Log directory: /var/log/easydblab/tools/
  • Source attribute: tool-runner (for filtering in VictoriaLogs queries)
  • Foreground commands: Output displayed after completion, also logged to file
  • Background commands (--bg): Output logged to file only, tool runs as a systemd transient unit

YACE CloudWatch Scrape

YACE (Yet Another CloudWatch Exporter) runs on the control node and scrapes AWS CloudWatch metrics for services used by the cluster. It uses tag-based auto-discovery with the easy_cass_lab=1 tag to find relevant resources.

YACE scrapes metrics for:

  • S3 — bucket request/byte counts
  • EBS — volume read/write ops and latency
  • EC2 — instance CPU, network, disk
  • OpenSearch — domain health, indexing, search metrics

EMR metrics are collected directly via OTel Collectors on Spark nodes (see Spark JVM Instrumentation above).

YACE exposes scraped metrics as Prometheus-compatible metrics on port 5001, which are then scraped by the OTel Collector and forwarded to VictoriaMetrics. This replaces the previous CloudWatch datasource in Grafana with a Prometheus-based approach, giving dashboards access to CloudWatch metrics through VictoriaMetrics queries.

Resource Attributes

Traces from the CLI tool and cluster nodes include the following resource attributes:

  • service.name: Service identifier (e.g., easy-db-lab, cassandra-sidecar, spark-<job-name>)
  • service.version: Application version (CLI tool only)
  • host.name: Hostname

Configuration

The following environment variables are supported:

VariableDescriptionDefault
OTEL_EXPORTER_OTLP_ENDPOINTOTLP gRPC endpointNone (no export)
OTEL_SERVICE_NAMEOverride service nameeasy-db-lab
OTEL_RESOURCE_ATTRIBUTESAdditional resource attributesNone

Additional standard OTel environment variables are supported by the agent. See the OpenTelemetry Java Agent documentation for details.

Example: Using with Jaeger

Start Jaeger with OTLP support:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Export traces to Jaeger:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up

View traces at http://localhost:16686

Example: Using with Grafana Tempo

If you have Grafana Tempo running with OTLP gRPC ingestion:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317
easy-db-lab up

Troubleshooting

No Traces Appearing

  1. Verify the endpoint is correct and reachable
  2. Check that the collector accepts gRPC OTLP (port 4317 is standard)
  3. Look for OpenTelemetry agent logs on startup (use -Dotel.javaagent.debug=true to enable debug logging)

High Latency

Traces are batched before export (default 1 second delay). This is normal and reduces overhead.

Pyroscope Configuration Parameters

Reference for Pyroscope server configuration. Source: Grafana Pyroscope docs.

How Configuration Works

Pyroscope is configured via a YAML file (-config.file flag) or CLI flags. CLI flags take precedence over YAML values. Environment variables can be used with -config.expand-env=true using ${VAR} or ${VAR:-default} syntax.

View current config at the /config HTTP API endpoint.

Key Configuration Sections

Top-Level

# Modules to load. 'all' enables single-binary mode.
[target: <string> | default = "all"]

api:
  [base-url: <string> | default = ""]

Server

HTTP on port 4040 (default), gRPC on port 9095 (default).

server:
  [http_listen_address: <string> | default = ""]
  [http_listen_port: <int> | default = 4040]
  [grpc_listen_port: <int> | default = 9095]
  [graceful_shutdown_timeout: <duration> | default = 30s]
  [http_server_read_timeout: <duration> | default = 30s]
  [http_server_write_timeout: <duration> | default = 30s]
  [http_server_idle_timeout: <duration> | default = 2m]
  [log_format: <string> | default = "logfmt"]  # logfmt or json
  [log_level: <string> | default = "info"]      # debug, info, warn, error
  [grpc_server_max_recv_msg_size: <int> | default = 4194304]
  [grpc_server_max_send_msg_size: <int> | default = 4194304]
  [grpc_server_max_concurrent_streams: <int> | default = 100]

PyroscopeDB (Local Storage)

pyroscopedb:
  # Directory for local storage
  [data_path: <string> | default = "./data"]
  # Max block duration
  [max_block_duration: <duration> | default = 1h]
  # Row group target size (uncompressed)
  [row_group_target_size: <int> | default = 1342177280]
  # Partition label for symbols
  [symbols_partition_label: <string> | default = ""]
  # Disk retention: minimum free disk (GiB)
  [min_free_disk_gb: <int> | default = 10]
  # Disk retention: minimum free percentage
  [min_disk_available_percentage: <float> | default = 0.05]
  # How often to enforce retention
  [enforcement_interval: <duration> | default = 5m]
  # Disable retention enforcement
  [disable_enforcement: <boolean> | default = false]

Storage (Object Storage Backend)

Supported backends: s3, gcs, azure, swift, filesystem, cos.

storage:
  [backend: <string> | default = ""]
  [prefix: <string> | default = ""]

  s3:
    [endpoint: <string> | default = ""]
    [region: <string> | default = ""]
    [bucket_name: <string> | default = ""]
    [secret_access_key: <string> | default = ""]
    [access_key_id: <string> | default = ""]
    [insecure: <boolean> | default = false]
    [signature_version: <string> | default = "v4"]
    [bucket_lookup_type: <string> | default = "auto"]
    # NOTE: native_aws_auth_enabled exists on main but NOT in v1.18.0.
    # In v1.18.0, leave access_key_id/secret_access_key empty to use
    # the default AWS SDK credential chain (env vars, IMDS).
    sse:
      [type: <string> | default = ""]           # SSE-KMS or SSE-S3
      [kms_key_id: <string> | default = ""]
      [kms_encryption_context: <string> | default = ""]

  gcs:
    [bucket_name: <string> | default = ""]
    [service_account: <string> | default = ""]

  azure:
    [account_name: <string> | default = ""]
    [account_key: <string> | default = ""]
    [container_name: <string> | default = ""]

  filesystem:
    [dir: <string> | default = "./data-shared"]

Distributor

distributor:
  [pushtimeout: <duration> | default = 5s]
  ring:
    kvstore:
      [store: <string> | default = "memberlist"]  # consul, etcd, inmemory, memberlist, multi

Ingester

ingester:
  lifecycler:
    ring:
      kvstore:
        [store: <string> | default = "consul"]
      [heartbeat_timeout: <duration> | default = 1m]
      [replication_factor: <int> | default = 1]
    [num_tokens: <int> | default = 128]
    [heartbeat_period: <duration> | default = 5s]

Querier

querier:
  # Time after which queries go to storage instead of ingesters
  [query_store_after: <duration> | default = 4h]

Compactor

compactor:
  [block_ranges: <list of durations> | default = 1h0m0s,2h0m0s,8h0m0s]
  [data_dir: <string> | default = "./data-compactor"]
  [compaction_interval: <duration> | default = 30m]
  [compaction_concurrency: <int> | default = 1]
  [deletion_delay: <duration> | default = 12h]
  [downsampler_enabled: <boolean> | default = false]

Limits (Per-Tenant)

limits:
  # Ingestion rate limit (MB/s)
  [ingestion_rate_mb: <float> | default = 4]
  [ingestion_burst_size_mb: <float> | default = 2]
  # Label constraints
  [max_label_name_length: <int> | default = 1024]
  [max_label_value_length: <int> | default = 2048]
  [max_label_names_per_series: <int> | default = 30]
  # Profile constraints
  [max_profile_size_bytes: <int> | default = 4194304]
  [max_profile_stacktrace_samples: <int> | default = 16000]
  [max_profile_stacktrace_depth: <int> | default = 1000]
  # Series limits
  [max_global_series_per_tenant: <int> | default = 5000]
  # Query limits
  [max_query_lookback: <duration> | default = 1w]
  [max_query_length: <duration> | default = 1d]
  [max_flamegraph_nodes_default: <int> | default = 8192]
  [max_flamegraph_nodes_max: <int> | default = 1048576]
  # Retention
  [compactor_blocks_retention_period: <duration> | default = 0s]
  # Ingestion time bounds
  [reject_older_than: <duration> | default = 1h]
  [reject_newer_than: <duration> | default = 10m]
  # Relabeling
  [ingestion_relabeling_rules: <list of Configs> | default = []]
  [sample_type_relabeling_rules: <list of Configs> | default = []]

Self-Profiling

self_profiling:
  # Disable push profiling in single-binary mode
  [disable_push: <boolean> | default = false]
  [mutex_profile_fraction: <int> | default = 5]
  [block_profile_rate: <int> | default = 5]

Memberlist (Gossip)

memberlist:
  [bind_port: <int> | default = 7946]
  [join_members: <list of strings> | default = []]
  [gossip_interval: <duration> | default = 200ms]
  [gossip_nodes: <int> | default = 3]
  [leave_timeout: <duration> | default = 20s]

Tracing

tracing:
  [enabled: <boolean> | default = true]

Multi-Tenancy

# Require X-Scope-OrgId header; false = use "anonymous" tenant
[multitenancy_enabled: <boolean> | default = false]

Embedded Grafana

embedded_grafana:
  [data_path: <string> | default = "./data/__embedded_grafana/"]
  [listen_port: <int> | default = 4041]
  [pyroscope_url: <string> | default = "http://localhost:4040"]

Port Summary

ServicePortProtocol
HTTP API4040HTTP
gRPC9095gRPC
Memberlist gossip7946TCP/UDP
Embedded Grafana4041HTTP

Relevant to Our Deployment

Our Pyroscope deployment (configuration/pyroscope/PyroscopeManifestBuilder.kt) uses:

  • S3 backend — IAM role auth via IMDS (no explicit credentials; v1.18.0 lacks native_aws_auth_enabled, SDK defaults to credential chain)
  • Single-binary mode (target: all)
  • Port 4040 for HTTP API
  • Flat storage prefixpyroscope.{name}-{id} (Pyroscope rejects / in storage.prefix)
  • Config values substituted at build time via TemplateService (__KEY__ placeholders)
  • Profiles received from: Java agent (Cassandra, Spark), eBPF agent (all nodes), stress jobs

Spark Observability Debugging

Diagnostic commands for troubleshooting Spark observability on EMR nodes. These require SSH access to the EMR master node (ssh hadoop@<master-public-dns>).

OTel Collector

# Check collector is running
sudo systemctl status otel-collector

# View collector config (verify control node IP)
cat /opt/otel/config.yaml

# Test connectivity to control node collector
curl -s -o /dev/null -w '%{http_code}' http://<control-ip>:4318

Spark Configuration

# Verify -javaagent flags and OTel env vars are present
cat /etc/spark/conf/spark-defaults.conf

# Verify agent JARs exist
ls -la /opt/otel/opentelemetry-javaagent.jar
ls -la /opt/pyroscope/pyroscope.jar

Runtime Verification (while a job is running)

# Confirm agents are attached to Spark JVMs
ps aux | grep javaagent

Pyroscope API (from any node that can reach control0)

# List all label names
curl \
  -H "Content-Type: application/json" \
  -d '{
      "end": '$(date +%s)000',
      "start": '$(expr $(date +%s) - 3600)000'
    }' \
  http://localhost:4040/querier.v1.QuerierService/LabelNames

# List values for a specific label
curl \
  -H "Content-Type: application/json" \
  -d '{
      "end": '$(date +%s)000',
      "name": "hostname",
      "start": '$(expr $(date +%s) - 3600)000'
    }' \
  http://localhost:4040/querier.v1.QuerierService/LabelValues

# Diff two profiles (compare workloads)
# POST to /querier.v1.QuerierService/Diff with left/right profile selectors
# See: left.labelSelector, right.labelSelector, profileTypeID, start/end

Grafana Explore Queries

# All metrics from Spark nodes
{node_role="spark"}

# JVM metrics only
{node_role="spark", __name__=~"jvm_.*"}

# List distinct JVM metric names
group({node_role="spark", __name__=~"jvm_.*"}) by (__name__)

# Filesystem usage (raw)
system_filesystem_usage_bytes{state="used", node_role="spark", mountpoint="/"}

JFR Format Reference

The Java Flight Recorder format is used by JVM-based profilers and supported by the Pyroscope Java integration.

When JFR format is used, query parameters behave differently:

  • format should be set to jfr
  • name contains the prefix of the application name. Since a single request may contain multiple profile types, the final application name is created by concatenating this prefix and the profile type. For example, if you send cpu profiling data and set name to my-app{}, it will appear in Pyroscope as my-app.cpu{}
  • units is ignored — actual units depend on the profile types in the data
  • aggregationType is ignored — actual aggregation type depends on the profile types in the data

Supported JFR Profile Types

  • cpu — samples from runnable threads only
  • itimer — similar to cpu profiling
  • wall — samples from any thread regardless of state
  • alloc_in_new_tlab_objects — number of new TLAB objects created
  • alloc_in_new_tlab_bytes — size in bytes of new TLAB objects created
  • alloc_outside_tlab_objects — number of new allocated objects outside any TLAB
  • alloc_outside_tlab_bytes — size in bytes of new allocated objects outside any TLAB

JFR with Dynamic Labels

To ingest JFR data with dynamic labels:

  1. Use multipart/form-data Content-Type
  2. Send JFR data in a form file field called jfr
  3. Send LabelsSnapshot protobuf message in a form file field called labels
message Context {
    // string_id -> string_id
    map<int64, int64> labels = 1;
}
message LabelsSnapshot {
    // context_id -> Context
    map<int64, Context> contexts = 1;
    // string_id -> string
    map<int64, string> strings = 2;
}

Where context_id is a parameter set in async-profiler.

Ingestion Examples

Simple profile upload:

printf "foo;bar 100\n foo;baz 200" | curl \
  -X POST \
  --data-binary @- \
  'http://localhost:4040/ingest?name=curl-test-app&from=1615709120&until=1615709130'

JFR profile with labels:

curl -X POST \
  -F jfr=@profile.jfr \
  -F labels=@labels.pb \
  "http://localhost:4040/ingest?name=curl-test-app&units=samples&aggregationType=sum&sampleRate=100&from=1655834200&until=1655834210&spyName=javaspy&format=jfr"

Future: Ad-hoc Profiling with async-profiler

async-profiler can capture JFR profiles on demand and upload them to Pyroscope with labels. This enables targeted profiling of specific Spark jobs or Cassandra operations to inspect exactly what is happening at the JVM level.

Common Issues

  • No JVM metrics: Check ps aux | grep javaagent — if -javaagent flags are missing, spark.driver.extraJavaOptions may be overridden at job submission time (replaces spark-defaults.conf entirely).
  • Collector retry errors at startup: Normal if the control node collector isn't ready yet. Should stabilize within a minute.
  • Spark profiles missing hostname label: PYROSCOPE_LABELS env var must be set via spark-env classification with hostname=$(hostname -s).

Development Overview

Hello there. If you're reading this, you've probably decided to contribute to easy-db-lab or use the tools for your own work. Very cool.

Dev Containers are the preferred method for developing easy-db-lab. They provide a consistent, pre-configured environment with all required tools installed:

  • Java 21 (Temurin) via SDKMAN
  • Kotlin and Gradle
  • MkDocs for documentation
  • Docker-in-Docker for container operations
  • Claude Code for AI-assisted development
  • zsh with Powerlevel10k theme

VS Code

  1. Install the Dev Containers extension
  2. Open the project folder
  3. Click "Reopen in Container" when prompted

JetBrains IDEs

  1. Install the Dev Containers plugin
  2. Open the project and select "Dev Containers" from the remote development options

CLI with bin/dev

The bin/dev script provides a convenient wrapper for dev container management:

bin/dev start          # Start the dev container
bin/dev shell          # Open interactive shell
bin/dev test           # Run Gradle tests
bin/dev docs-serve     # Serve docs with live reload
bin/dev claude         # Start Claude Code
bin/dev status         # Show container status
bin/dev down           # Stop and remove container

To mount your Claude Code configuration (for AI-assisted development):

ENABLE_CLAUDE=1 bin/dev start

Run bin/dev help for all available commands.

Building the Project

Once inside the container (or with local tools installed):

./gradlew assemble
./gradlew test

Documentation Preview

Preview documentation locally with live reload:

bin/dev docs-serve

Then open http://localhost:8000 in your browser.

Project Structure

easy-db-lab is broken into several subprojects:

  • Docker containers (prefixed with docker-)
  • Documentation (the manual you're reading now)
  • Utility code for downloading artifacts

Architecture

The project follows a layered architecture:

Commands (PicoCLI) → Services → External Systems (K8s, AWS, Filesystem)

Layer Responsibilities

  • Commands (commands/): Lightweight PicoCLI execution units
  • Services (services/, providers/): Business logic layer

For more details, see the project's CLAUDE.md file.

Docker Development

Building Docker Containers

Each container is versioned and can be built locally using the following:

./gradlew :PROJECT-NAME:buildDocker

Where PROJECT-NAME is one of the subproject directories you see in the top level.

Setup

We recommend updating your local Docker service to use 8GB of memory. This is necessary when running dashboard previews locally. The preview is configured to run multiple Cassandra containers at once.

Available Docker Projects

Check the root project directory for subprojects prefixed with docker- to see available containerized components.

Local Testing

To test containers locally:

  1. Build the container:

    ./gradlew :docker-cassandra:buildDocker
    
  2. Run the container:

    docker run -it <image-name>
    

Memory Requirements

Use CaseRecommended Memory
Single container development4GB
Dashboard preview (multiple containers)8GB
Full integration testing16GB

Publishing

Pre-Release Checklist

  1. First check CI to ensure the build is clean and green
  2. Ensure the following environment variables are set:
    • DOCKER_USERNAME
    • DOCKER_PASSWORD
    • DOCKER_EMAIL

Publishing Steps

Build and Upload

./gradlew buildAll uploadAll

Post-Release

After publishing, bump the version in build.gradle.kts.

Container Publishing

Containers are automatically published to GitHub Container Registry (ghcr.io) when:

  • A version tag (v*) is pushed
  • PR Checks pass on main branch

See .github/workflows/publish-container.yml for details.

Documentation

Documentation is automatically built and deployed via GitHub Actions when changes are pushed to the docs/ directory on the main branch.

Testing Guidelines

This document outlines the testing standards and practices for the easy-db-lab project.

Core Testing Principles

1. Use BaseKoinTest for Dependency Injection

All tests should extend BaseKoinTest to take advantage of automatic dependency injection setup and teardown.

class MyCommandTest : BaseKoinTest() {
    // Your test code here
}

BaseKoinTest provides:

  • Automatic Koin lifecycle management
  • Core modules that are always mocked (AWS, SSH, OutputHandler)
  • Ability to add test-specific modules via additionalTestModules()

2. Use AssertJ for Assertions

Tests should use AssertJ assertions, not JUnit assertions. AssertJ provides more readable and powerful assertion methods.

// Good - AssertJ style
import org.assertj.core.api.Assertions.assertThat

assertThat(result).isNotNull()
assertThat(result.value).isEqualTo("expected")
assertThat(list).hasSize(3).contains("item1", "item2")

// Avoid - JUnit style
import org.junit.jupiter.api.Assertions.assertEquals

assertEquals("expected", result.value)

3. Create Custom Assertions for Non-Trivial Classes

When testing non-trivial classes, create custom AssertJ assertions to implement Domain-Driven Design in tests. This decouples business logic from implementation details and makes tests more maintainable during refactoring.

Custom Assertions Pattern

Custom assertions provide a fluent, domain-specific language for testing that improves readability and maintainability.

Example: Custom Assertion for a Domain Class

Here's a complete example showing how to create and use custom assertions:

// Domain class to be tested
data class CassandraNode(
    val nodeId: String,
    val datacenter: String,
    val rack: String,
    val status: NodeStatus,
    val tokens: Int
)

enum class NodeStatus {
    UP, DOWN, JOINING, LEAVING
}

// Custom assertion class
import org.assertj.core.api.AbstractAssert

class CassandraNodeAssert(actual: CassandraNode?) :
    AbstractAssert<CassandraNodeAssert, CassandraNode>(actual, CassandraNodeAssert::class.java) {

    companion object {
        fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
            return CassandraNodeAssert(actual)
        }
    }

    fun hasNodeId(nodeId: String): CassandraNodeAssert {
        isNotNull
        if (actual.nodeId != nodeId) {
            failWithMessage("Expected node ID to be <%s> but was <%s>", nodeId, actual.nodeId)
        }
        return this
    }

    fun isInDatacenter(datacenter: String): CassandraNodeAssert {
        isNotNull
        if (actual.datacenter != datacenter) {
            failWithMessage("Expected datacenter to be <%s> but was <%s>", datacenter, actual.datacenter)
        }
        return this
    }

    fun hasStatus(status: NodeStatus): CassandraNodeAssert {
        isNotNull
        if (actual.status != status) {
            failWithMessage("Expected status to be <%s> but was <%s>", status, actual.status)
        }
        return this
    }

    fun isUp(): CassandraNodeAssert {
        return hasStatus(NodeStatus.UP)
    }

    fun isDown(): CassandraNodeAssert {
        return hasStatus(NodeStatus.DOWN)
    }

    fun hasTokenCount(tokens: Int): CassandraNodeAssert {
        isNotNull
        if (actual.tokens != tokens) {
            failWithMessage("Expected token count to be <%s> but was <%s>", tokens, actual.tokens)
        }
        return this
    }
}

// Usage in tests
import CassandraNodeAssert.Companion.assertThat

@Test
fun `test cassandra node configuration`() {
    val node = CassandraNode(
        nodeId = "node1",
        datacenter = "dc1",
        rack = "rack1",
        status = NodeStatus.UP,
        tokens = 256
    )

    // Fluent assertions with domain language
    assertThat(node)
        .hasNodeId("node1")
        .isInDatacenter("dc1")
        .isUp()
        .hasTokenCount(256)
}

Project-Wide Assertions Helper

Create a central assertions class to provide access to all custom assertions:

// MyProjectAssertions.kt
object MyProjectAssertions {

    // Cassandra domain assertions
    fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
        return CassandraNodeAssert(actual)
    }

    fun assertThat(actual: Host?): HostAssert {
        return HostAssert(actual)
    }

    fun assertThat(actual: TFState?): TFStateAssert {
        return TFStateAssert(actual)
    }

    // Add more domain assertions as needed
}

Then import statically in tests:

import com.rustyrazorblade.easydblab.assertions.MyProjectAssertions.assertThat

@Test
fun `test complex scenario`() {
    val node = createTestNode()
    val host = createTestHost()

    // All domain assertions available through single import
    assertThat(node).isUp()
    assertThat(host).hasPrivateIp("10.0.0.1")
}

Benefits of Custom Assertions

  1. Domain-Driven Design: Tests use business language, not implementation details
  2. Refactoring Safety: Changes to class internals don't break test logic
  3. Readability: Tests read like specifications
  4. Reusability: Common assertions are centralized
  5. Maintainability: Single place to update assertion logic
  6. Type Safety: Compile-time checking of assertion methods

When to Create Custom Assertions

Create custom assertions for:

  • Domain entities (e.g., Host, TFState, CassandraNode)
  • Complex value objects with multiple properties
  • Classes that appear in multiple test scenarios
  • Any class where you find yourself writing repetitive assertion code

Testing Best Practices

  1. Test Names: Use descriptive names with backticks

    @Test
    fun `should start cassandra node when status is DOWN`() { }
    
  2. Test Structure: Follow Arrange-Act-Assert pattern

    @Test
    fun `test node startup`() {
        // Arrange
        val node = createTestNode(status = NodeStatus.DOWN)
    
        // Act
        val result = nodeManager.startNode(node)
    
        // Assert
        assertThat(result).isUp()
    }
    
  3. Mock External Dependencies: Always mock AWS, SSH, and other external services

    class MyTest : BaseKoinTest() {
        override fun additionalTestModules() = listOf(
            module {
                single { mockRemoteOperationsService() }
            }
        )
    }
    
  4. Test Edge Cases: Include tests for error conditions and boundary cases

  5. Keep Tests Focused: Each test should verify one specific behavior

Testing Interactive Commands with TestPrompter

Commands that require user input (like setup-profile) can be tested deterministically using TestPrompter. This test utility replaces the real Prompter interface and returns predefined responses.

Basic Usage

class MyCommandTest : BaseKoinTest() {
    private lateinit var testPrompter: TestPrompter

    override fun additionalTestModules() = listOf(
        module {
            single<Prompter> { testPrompter }
        }
    )

    @BeforeEach
    fun setup() {
        // Configure responses - keys can be exact matches or partial matches
        testPrompter = TestPrompter(
            mapOf(
                "email" to "test@example.com",
                "region" to "us-west-2",
                "AWS Access Key" to "AKIAIOSFODNN7EXAMPLE",
            )
        )
    }

    @Test
    fun `should collect user credentials`() {
        // Run command that prompts for input
        val command = SetupProfile()
        command.call()

        // Verify prompts were called
        assertThat(testPrompter.wasPromptedFor("email")).isTrue()
        assertThat(testPrompter.wasPromptedFor("region")).isTrue()
    }
}

Response Matching

TestPrompter supports two matching strategies:

  1. Exact match: The question text matches a key exactly
  2. Partial match: The question text contains the key (case-insensitive)
val prompter = TestPrompter(
    mapOf(
        // Exact match - only matches "email" exactly
        "email" to "test@example.com",

        // Partial match - matches any question containing "AWS Profile"
        "AWS Profile" to "my-profile",
    )
)

Sequential Responses for Retry Testing

For testing retry logic (e.g., credential validation failures), use addSequentialResponses():

@Test
fun `should retry on invalid credentials`() {
    testPrompter = TestPrompter()

    // First call returns invalid credentials, second returns valid ones
    testPrompter.addSequentialResponses(
        "AWS Access Key",
        "invalid-key",      // First attempt
        "AKIAVALIDKEY123"   // Second attempt (after retry)
    )

    testPrompter.addSequentialResponses(
        "AWS Secret",
        "invalid-secret",
        "valid-secret-key"
    )

    val command = SetupProfile()
    command.call()

    // Verify the command handled retry correctly
    val callLog = testPrompter.getCallLog()
    val accessKeyCalls = callLog.filter { it.question.contains("Access Key") }
    assertThat(accessKeyCalls).hasSize(2)
}

Verifying Prompt Behavior

TestPrompter records all prompt calls for verification:

@Test
fun `should not prompt for credentials when using AWS profile`() {
    testPrompter = TestPrompter(
        mapOf(
            "AWS Profile" to "my-profile",  // Non-empty = use profile auth
        )
    )

    val command = SetupProfile()
    command.call()

    // Verify credential prompts were skipped
    assertThat(testPrompter.wasPromptedFor("Access Key")).isFalse()
    assertThat(testPrompter.wasPromptedFor("Secret")).isFalse()

    // Check detailed call log
    val callLog = testPrompter.getCallLog()
    assertThat(callLog).anyMatch { it.question.contains("email") }
}

TestPrompter API Reference

MethodDescription
prompt(question, default, secret)Returns configured response or default
addSequentialResponses(key, vararg responses)Configure different responses for retry scenarios
getCallLog()Returns list of all prompt calls with details
wasPromptedFor(questionContains)Check if any prompt contained the given text
clear()Reset call log and sequential state

PromptCall Data Class

Each recorded call contains:

  • question: The prompt question text
  • default: The default value offered
  • secret: Whether input was masked (for passwords)
  • returnedValue: The value that was returned

Additional Resources

End-to-End Testing

easy-db-lab includes a comprehensive end-to-end test suite that validates the entire workflow from provisioning to teardown.

Running the Test

The end-to-end test is located at bin/end-to-end-test:

./bin/end-to-end-test --cassandra

Command-Line Options

Feature Flags

FlagDescription
--cassandraEnable Cassandra-specific tests
--sparkEnable Spark EMR provisioning and tests
--clickhouseEnable ClickHouse deployment and tests
--opensearchEnable OpenSearch deployment and tests
--allEnable all optional features
--ebsEnable EBS volumes (gp3, 256GB)
--buildBuild Packer images (default: skip)

Testing and Inspection

FlagDescription
--list-steps, -lList all test steps without running
--break <steps>Set breakpoints at specific steps (comma-separated)
--waitRun all steps except teardown, then wait for confirmation

Examples

# List all available test steps
./bin/end-to-end-test --list-steps

# Run full test with all features
./bin/end-to-end-test --all

# Run with Cassandra and pause before teardown
./bin/end-to-end-test --cassandra --wait

# Run with breakpoints at steps 5 and 15
./bin/end-to-end-test --cassandra --break 5,15

# Build custom AMI images and run test
./bin/end-to-end-test --build --cassandra

Test Steps

The test executes approximately 27 steps covering:

Infrastructure

  1. Build project
  2. Check version command
  3. Build packer images (optional)
  4. Set IAM policies
  5. Initialize cluster
  6. Setup kubectl
  7. Wait for K3s ready
  8. Verify K3s cluster

Registry and Storage

  1. Test registry push/pull
  2. List hosts
  3. Verify S3 backup

Cassandra

  1. Setup Cassandra
  2. Verify Cassandra backup
  3. Verify restore
  4. Cassandra start/stop cycle
  5. Test SSH and nodetool
  6. Check Sidecar
  7. Test exec command
  8. Run stress test
  9. Run stress K8s test

Optional Services

  1. Submit Spark job (if --spark)
  2. Check Spark status (if --spark)
  3. Start ClickHouse (if --clickhouse)
  4. Test ClickHouse (if --clickhouse)
  5. Stop ClickHouse (if --clickhouse)
  6. Start OpenSearch (if --opensearch)
  7. Test OpenSearch (if --opensearch)
  8. Stop OpenSearch (if --opensearch)

Observability and Cleanup

  1. Test observability stack
  2. Teardown cluster

Error Handling

When a test step fails, an interactive menu appears:

  1. Retry from failed step - Resume from the point of failure
  2. Start a shell session - Opens a shell with:
    • easy-db-lab commands available
    • rebuild - Rebuild just the project
    • rerun - Rebuild and resume from failed step
  3. Tear down environment - Run easy-db-lab down --yes
  4. Exit - Exit the script

AWS Requirements

The test requires:

  • AWS profile with sufficient permissions
  • VPC and subnet configuration
  • S3 bucket for backups and logs

Default Configuration

  • Instance count: 3 nodes
  • Instance type: c5.2xlarge
  • Cassandra version: 5.0 (when enabled)
  • Spark workers: 2 (when enabled)

CI Integration

The end-to-end test is designed to run in CI environments:

  • Supports non-interactive mode
  • Returns appropriate exit codes
  • Provides detailed logging
  • Cleans up resources on failure

Spark Development

This guide covers developing and testing Spark-related functionality in easy-db-lab.

Project Structure

All Spark modules live under spark/ with shared configuration:

  • spark/common/ — Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setup
  • spark/bulk-writer-sidecar/ — Cassandra Analytics, direct sidecar transport (DirectBulkWriter)
  • spark/bulk-writer-s3/ — Cassandra Analytics, S3 staging transport (S3BulkWriter)
  • spark/connector-writer/ — Standard Spark Cassandra Connector (StandardConnectorWriter)
  • spark/connector-read-write/ — Read→transform→write example (KeyValuePrefixCount)

Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.

Prerequisites

The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.

One-Time Setup

bin/build-cassandra-analytics

Options:

  • --force - Rebuild even if already built
  • --branch <branch> - Use a specific branch (default: trunk)

Building

# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3:shadowJar \
  :spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar

# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar

# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar

Shadow JARs include all dependencies except Spark (provided by EMR).

Running Tests

Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:

./gradlew :test

Testing with a Live Cluster

Using bin/spark-bulk-write

This script handles JAR lookup, host resolution, and health checks:

# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable

Using bin/submit-direct-bulk-writer

Simplified script for direct bulk writer testing:

bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]

Manual Spark Job Submission

All modules use unified spark.easydblab.* configuration:

easy-db-lab spark submit \
    --jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
    --main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
    --conf spark.easydblab.contactPoints=host1,host2 \
    --conf spark.easydblab.keyspace=bulk_test \
    --conf spark.easydblab.localDc=us-west-2 \
    --conf spark.easydblab.rowCount=1000 \
    --conf spark.easydblab.replicationFactor=1 \
    --wait

Debugging Failed Jobs

When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.

Manual Log Retrieval

easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs

Direct S3 Access

Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/

aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip

Adding a New Spark Module

  1. Create a directory under spark/ (e.g., spark/bulk-reader/)
  2. Add build.gradle.kts — use an existing module as a template
  3. Add include "spark:bulk-reader" to settings.gradle
  4. Depend on :spark:common for shared config
  5. Use SparkJobConfig.load(sparkConf) for configuration
  6. Implement your main class and submit via easy-db-lab spark submit

Architecture Notes

Shared Configuration

SparkJobConfig in spark/common provides:

  • Property constants (PROP_CONTACT_POINTS, etc.)
  • Config loading from SparkConf with validation
  • Schema setup via CqlSetup
  • Consistent defaults across all modules

Why Shadow JAR?

Bulk-writer modules use the Gradle Shadow plugin because:

  1. EMR provides Spark, so those dependencies are compileOnly
  2. Cassandra Analytics has many transitive dependencies
  3. mergeServiceFiles() properly handles META-INF/services for SPI

Cassandra Analytics Modules

Some cassandra-analytics modules aren't published to Maven:

  • five-zero.jar - Cassandra 5.0 bridge
  • five-zero-bridge.jar - Bridge implementation
  • five-zero-types.jar - Type converters
  • five-zero-sparksql.jar - SparkSQL integration

These are referenced directly from .cassandra-analytics/ build output.

SOCKS Proxy Architecture

This document describes the internal SOCKS5 proxy implementation used by easy-db-lab for programmatic access to private cluster resources.

Overview

easy-db-lab has two separate proxy systems:

ProxyPurposeImplementation
Shell ProxyUser shell commands (kubectl, curl)SSH CLI (ssh -D) via env.sh
JVM ProxyInternal Kotlin/Java codeApache MINA SSH library

This document covers the JVM Proxy used internally by easy-db-lab.

Why Two Proxies?

The shell proxy (started by source env.sh) works for command-line tools that respect HTTPS_PROXY environment variables. However, JVM code needs programmatic proxy configuration:

  • Java's HttpClient requires a ProxySelector instance
  • The Cassandra driver needs SOCKS5 configuration at the Netty level
  • The Kubernetes fabric8 client needs proxy settings
  • Operations should work without requiring users to run source env.sh first

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     easy-db-lab JVM                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────┐    ┌────────────────────────┐           │
│  │ SocksProxyService  │    │ ProxiedHttpClientFactory│           │
│  │   (interface)      │    │                        │           │
│  └─────────┬──────────┘    └───────────┬────────────┘           │
│            │                           │                         │
│            ▼                           ▼                         │
│  ┌─────────────────────┐    ┌────────────────────────┐          │
│  │ MinaSocksProxyService│    │   SocksProxySelector   │          │
│  │ (Apache MINA impl)  │    │  (custom ProxySelector)│          │
│  └─────────┬───────────┘    └────────────────────────┘          │
│            │                                                     │
│            ▼                                                     │
│  ┌─────────────────────┐                                        │
│  │ SSHConnectionProvider│                                        │
│  │ (manages SSH sessions)│                                       │
│  └─────────┬────────────┘                                        │
│            │                                                     │
└────────────┼─────────────────────────────────────────────────────┘
             │
             ▼ SSH Dynamic Port Forwarding
   ┌──────────────────┐
   │   Control Node   │
   │   (control0)     │
   └──────────────────┘

Key Classes

SocksProxyService

Location: com.rustyrazorblade.easydblab.proxy.SocksProxyService

Interface defining proxy operations:

interface SocksProxyService {
    fun ensureRunning(gatewayHost: ClusterHost): SocksProxyState
    fun start(gatewayHost: ClusterHost): SocksProxyState
    fun stop()
    fun isRunning(): Boolean
    fun getState(): SocksProxyState?
    fun getLocalPort(): Int
}

MinaSocksProxyService

Location: com.rustyrazorblade.easydblab.proxy.MinaSocksProxyService

Apache MINA-based implementation that:

  1. Establishes an SSH connection to the gateway host
  2. Starts dynamic port forwarding on a random available port
  3. Maintains thread-safe state for concurrent access

Key implementation details:

  • Uses ReentrantLock for thread safety
  • Dynamically finds an available port via ServerSocket(0)
  • Extracts the underlying ClientSession from the SSH client for port forwarding
  • Supports idempotent ensureRunning() for reuse across operations

ProxiedHttpClientFactory

Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory

Creates java.net.http.HttpClient instances configured for SOCKS5 proxy:

class ProxiedHttpClientFactory(
    private val socksProxyService: SocksProxyService,
) : HttpClientFactory {

    override fun createClient(): HttpClient {
        val proxyPort = socksProxyService.getLocalPort()
        val proxySelector = SocksProxySelector(proxyPort)

        return HttpClient
            .newBuilder()
            .proxy(proxySelector)
            .connectTimeout(CONNECTION_TIMEOUT)
            .build()
    }
}

SocksProxySelector

Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory (private class)

Custom ProxySelector that returns a SOCKS5 proxy for all URIs:

private class SocksProxySelector(
    private val proxyPort: Int,
) : ProxySelector() {
    private val proxy = Proxy(Proxy.Type.SOCKS, InetSocketAddress("localhost", proxyPort))

    override fun select(uri: URI?): List<Proxy> = listOf(proxy)

    override fun connectFailed(uri: URI?, sa: SocketAddress?, ioe: IOException?) {
        // Handle connection failures if needed
    }
}

Important: Java's ProxySelector.of() creates HTTP proxies, not SOCKS5. This custom implementation is required for SSH dynamic port forwarding.

SocksProxyNettyOptions

Location: com.rustyrazorblade.easydblab.driver.SocksProxyNettyOptions

Configures the Cassandra driver to use SOCKS5 proxy at the Netty level for CQL connections.

Dependency Injection

The proxy components are registered in ProxyModule:

val proxyModule = module {
    // Singleton - maintains proxy state across requests
    single<SocksProxyService> { MinaSocksProxyService(get()) }

    // Factory for creating proxied HTTP clients
    single<HttpClientFactory> { ProxiedHttpClientFactory(get()) }
}

Usage Patterns

Querying Victoria Logs

class DefaultVictoriaLogsService(
    private val socksProxyService: SocksProxyService,
    private val httpClientFactory: HttpClientFactory,
) : VictoriaLogsService {

    override fun query(...): Result<List<String>> = runCatching {
        // Ensure proxy is running to control node
        socksProxyService.ensureRunning(controlHost)

        // Create HTTP client that routes through proxy
        val httpClient = httpClientFactory.createClient()

        // Make request to private IP
        val request = HttpRequest.newBuilder()
            .uri(URI.create("http://${controlHost.privateIp}:9428/..."))
            .build()

        httpClient.send(request, BodyHandlers.ofString())
    }
}

Kubernetes API Access

The K8sService uses the proxy for fabric8 Kubernetes client connections to the private K3s API server.

CQL Sessions

The CqlSessionFactory configures the Cassandra driver with SOCKS5 proxy settings via SocksProxyNettyOptions.

Lifecycle

CLI Mode

In CLI mode (single command execution):

  1. Service starts proxy when needed
  2. Operations complete
  3. Proxy remains running for subsequent operations in same process

Server/MCP Mode

In server mode (long-running process):

  1. Proxy starts on first request requiring cluster access
  2. Reused across multiple requests (connection count tracked)
  3. Stopped on server shutdown

Thread Safety

MinaSocksProxyService uses a ReentrantLock to protect:

  • Proxy state changes
  • Session management
  • Port allocation

This ensures safe concurrent access when multiple threads need cluster resources.

Error Handling

Common failure scenarios:

ErrorCauseResolution
"HTTP/1.1 header parser received no bytes"Using HTTP proxy instead of SOCKS5Ensure SocksProxySelector returns Proxy.Type.SOCKS
Connection timeoutControl node not accessibleVerify SSH connectivity to control0
Port bind failurePort already in useService automatically finds available port

Testing

When testing code that uses the proxy:

class MyServiceTest : BaseKoinTest() {
    // BaseKoinTest provides mocked SocksProxyService

    @Test
    fun testWithMockedProxy() {
        val mockProxyService = mock<SocksProxyService>()
        whenever(mockProxyService.getLocalPort()).thenReturn(1080)

        // Test your service with mocked proxy
    }
}
FilePurpose
proxy/SocksProxyService.ktInterface definition
proxy/MinaSocksProxyService.ktApache MINA implementation
proxy/ProxiedHttpClientFactory.ktHTTP client factory with SOCKS5
proxy/ProxyModule.ktKoin DI registration
driver/SocksProxyNettyOptions.ktCassandra driver proxy config
driver/SocksProxyDriverContext.ktDriver context with proxy
services/VictoriaLogsService.ktExample usage

Fabric8 Server-Side Apply Pattern

This document explains a common error when using fabric8's Kubernetes client for server-side apply operations, and the correct pattern to use.

The Error

When applying Kubernetes manifests using fabric8, you may encounter:

java.lang.IllegalStateException: Could not find a registered handler for item:
[GenericKubernetesResource(apiVersion=v1, kind=Namespace, metadata=ObjectMeta...)]

This is a client-side fabric8 error, not a Kubernetes server error.

Root Cause

Fabric8 has two loading paths with different behaviors:

  1. Typed Loader (works): client.namespaces().load(stream) → returns typed NamespaceserverSideApply() works
  2. Generic Loader (fails): client.load(stream) → returns GenericKubernetesResourceserverSideApply() fails

Critical: Items returned by client.load() are always GenericKubernetesResource at runtime, regardless of the YAML content. They cannot be cast to typed classes like Namespace or ConfigMap.

Patterns That Do NOT Work

Attempt 1: Direct serverSideApply on loader

// DON'T DO THIS - causes "Could not find a registered handler" error
client.load(inputStream).serverSideApply()

Attempt 2: Load items then use client.resource()

// DON'T DO THIS - still fails with same error
val items = client.load(inputStream).items()
for (item in items) {
    client.resource(item).serverSideApply()
}

Even though we load the items first, they are still GenericKubernetesResource objects internally, and client.resource(item).serverSideApply() still fails.

Attempt 3: Cast GenericKubernetesResource to typed class

// DON'T DO THIS - causes ClassCastException
val items = client.load(inputStream).items()
for (item in items) {
    when (item.kind) {
        "Namespace" -> client.namespaces().resource(item as Namespace).serverSideApply()
        // ...
    }
}

Error: java.lang.ClassCastException: class io.fabric8.kubernetes.api.model.GenericKubernetesResource cannot be cast to class io.fabric8.kubernetes.api.model.Namespace

The items from client.load() are truly GenericKubernetesResource at runtime - they cannot be cast to typed classes.

The Pattern That Works

Use typed client loaders directly with forceConflicts():

private fun loadAndApplyManifest(client: KubernetesClient, file: File) {
    val yamlContent = file.readText()
    val kind = extractKind(yamlContent)

    ByteArrayInputStream(yamlContent.toByteArray()).use { stream ->
        when (kind) {
            "Namespace" -> client.namespaces().load(stream).forceConflicts().serverSideApply()
            "ConfigMap" -> client.configMaps().load(stream).forceConflicts().serverSideApply()
            "Service" -> client.services().load(stream).forceConflicts().serverSideApply()
            "DaemonSet" -> client.apps().daemonSets().load(stream).forceConflicts().serverSideApply()
            "Deployment" -> client.apps().deployments().load(stream).forceConflicts().serverSideApply()
            else -> throw IllegalStateException("Unsupported resource kind: $kind")
        }
    }
}

private fun extractKind(yamlContent: String): String {
    val kindRegex = Regex("""^kind:\s*(\w+)""", RegexOption.MULTILINE)
    return kindRegex.find(yamlContent)?.groupValues?.get(1)
        ?: throw IllegalStateException("Could not determine resource kind from YAML")
}

This works because:

  1. Typed loaders (e.g., client.namespaces().load(stream)) return properly typed resources
  2. Typed resources have registered handlers for serverSideApply()
  3. forceConflicts() resolves field manager conflicts when multiple controllers manage the same resource

Required Imports

import io.fabric8.kubernetes.client.KubernetesClient
import java.io.ByteArrayInputStream
import java.io.File

Adding New Resource Types

If you need to support additional Kubernetes resource types, add them to the when statement:

"Pod" -> client.pods().load(stream).forceConflicts().serverSideApply()
"Secret" -> client.secrets().load(stream).forceConflicts().serverSideApply()
"StatefulSet" -> client.apps().statefulSets().load(stream).forceConflicts().serverSideApply()
// etc.

References

  • Fabric8 Kubernetes Client: https://github.com/fabric8io/kubernetes-client
  • Server-side apply documentation: https://github.com/fabric8io/kubernetes-client/blob/main/doc/CHEATSHEET.md

Fix History

DateIssueResolution
2025-12-02client.load().serverSideApply() failsAttempted: load items first, then apply via client.resource(item)
2025-12-02client.resource(item).serverSideApply() also failsAttempted: cast items to typed classes (e.g., item as Namespace)
2025-12-02item as Namespace causes ClassCastExceptionUse typed loaders directly (client.namespaces().load(stream))
2025-12-02Patch operation fails for NamespaceFixed: Add forceConflicts() before serverSideApply()