easy-db-lab
easy-db-lab creates lab environments for database evaluations in AWS. It provisions infrastructure, deploys databases, and sets up a full observability stack so you can focus on testing, benchmarking, and learning.
Supported databases include Apache Cassandra, ClickHouse, and OpenSearch, with Apache Spark available for analytics workloads.
If you are looking for a tool to aid in stress testing Cassandra clusters, see the companion project cassandra-easy-stress.
If you're looking for tools to help manage Cassandra in production environments please see Reaper, cstar, and K8ssandra.
Quick Start
- Install easy-db-lab
- Set up your profile - Run
easy-db-lab setup-profile - Follow the tutorial
Features
Database Support
- Apache Cassandra: Versions 3.0, 3.11, 4.0, 4.1, 5.0, and trunk builds. Includes custom build support, Cassandra Sidecar, and integration with cassandra-easy-stress for benchmarking.
- ClickHouse: Sharded clusters with configurable replication, distributed tables, and S3-tiered storage.
- OpenSearch: AWS OpenSearch domains for search and analytics.
- Apache Spark: EMR-based Spark clusters for analytics workloads.
AWS Integration
- EC2 Provisioning: Automated provisioning with configurable instance types
- EBS Storage: Optional EBS volumes for persistent storage
- S3 Backup: Automatic backup of configurations and state to S3
- IAM Integration: Managed IAM policies for secure operations
Kubernetes (K3s)
- Lightweight K3s: Automatic K3s cluster deployment across all nodes
- kubectl/k9s: Pre-configured access with SOCKS5 proxy support
- Private Registry: HTTPS Docker registry for custom images
- Jib Integration: Push custom containers directly from Gradle
Monitoring and Observability
- VictoriaMetrics: Time-series database for metrics storage
- VictoriaLogs: Centralized log aggregation
- Grafana: Pre-configured dashboards for Cassandra, ClickHouse, and system metrics
- OpenTelemetry: Distributed tracing and metrics collection
- AxonOps: Optional integration with AxonOps for Cassandra monitoring and management
Developer Experience
- Shell Aliases: Convenient shortcuts for cluster management (
c0,c-all,c-status, etc.) - Server: Integration with Claude Code for AI-assisted operations
- Restore Support: Recover cluster state from VPC ID or S3 backup
- SOCKS5 Proxy: Secure access to private cluster resources
Stress Testing
- cassandra-easy-stress: Native integration with Apache stress testing tool
- Kubernetes Jobs: Run stress tests as K8s jobs for scalability
- Artifact Collection: Automatic collection of metrics and diagnostics
Prerequisites
Before using easy-db-lab, ensure you have the following:
System Requirements
| Requirement | Details |
|---|---|
| Operating System | macOS or Linux |
| Java | JDK 21 or later |
| Docker | Required for building custom AMIs |
AWS Requirements
- AWS Account: A dedicated AWS account is recommended for lab environments
- AWS Access Key & Secret: Credentials for programmatic access
- IAM Permissions: Permissions to create EC2, IAM, S3, and optionally EMR resources
Run easy-db-lab show-iam-policies to see the exact IAM policies required with your account ID populated. See Setup for details.
Optional
- AxonOps Account: For free Cassandra monitoring. Create an account at axonops.com
Next Steps
Run the interactive setup to configure your profile:
easy-db-lab setup-profile
See the Setup Guide for detailed instructions.
Installation
Tarball Install
You can grab a tarball from the releases page.
To get started, add the bin directory of easy-db-lab to your $PATH. For example:
export PATH="$PATH:/path/to/easy-db-lab/bin"
cd /path/to/easy-db-lab
./gradlew assemble
Building from Source
If you prefer to build from source:
git clone https://github.com/rustyrazorblade/easy-db-lab.git
cd easy-db-lab
./gradlew assemble
The built distribution will be in build/distributions/.
Setup
This guide walks you through the initial setup of easy-db-lab, including AWS credentials configuration, IAM policies, and AMI creation.
Overview
The setup-profile command handles all initial configuration interactively. It will:
- Collect your email and AWS credentials
- Validate your AWS access
- Create necessary AWS resources (key pair, IAM roles, Packer VPC)
- Build or validate the required AMI
Prerequisites
Before running setup:
- AWS Account: An AWS account with appropriate permissions (see IAM Policies below)
- Java 21+: Required to run easy-db-lab
- Docker: Required only if building custom AMIs
Step 1: Run Setup Profile
Run the interactive setup:
easy-db-lab setup-profile
Or use the shorter alias:
easy-db-lab setup
The setup wizard will prompt you for:
| Prompt | Description | Default |
|---|---|---|
| Used to tag AWS resources for ownership | (required) | |
| AWS Region | Region for your clusters | us-west-2 |
| AWS Access Key | Your AWS access key ID | (required) |
| AWS Secret Key | Your AWS secret access key | (required) |
| AxonOps Org | Optional: AxonOps organization name | (skip) |
| AxonOps Key | Optional: AxonOps API key | (skip) |
| AWS Profile | Optional: Named AWS profile | (skip) |
What Gets Created
During setup, the following AWS resources are created:
- EC2 Key Pair: For SSH access to instances
- IAM Role: For instance permissions (
easy-db-lab-instance-role) - Packer VPC: Infrastructure for building AMIs
- AMI (if needed): Takes 10-15 minutes to build
Configuration Location
Your profile is saved to:
~/.easy-db-lab/profiles/default/settings.yaml
Use a different profile by setting EASY_DB_LAB_PROFILE environment variable before running setup.
Step 2: Getting IAM Policies
If you need to request permissions from your AWS administrator, use the show-iam-policies command to display the required policies with your account ID populated:
easy-db-lab show-iam-policies
This displays three policies:
| Policy | Purpose |
|---|---|
| EC2 | Create/manage EC2 instances, VPCs, security groups |
| IAM | Create instance roles and profiles |
| EMR | Create Spark clusters (optional) |
Filter by Policy Name
To show a specific policy:
easy-db-lab show-iam-policies ec2 # Show EC2 policy only
easy-db-lab show-iam-policies iam # Show IAM policy only
easy-db-lab show-iam-policies emr # Show EMR policy only
Recommended IAM Setup
For teams with multiple users, we recommend creating managed policies attached to an IAM group:
- Create an IAM group (e.g., "EasyDBLabUsers")
- Create three managed policies from the JSON output
- Attach all policies to the group
- Add users to the group
Inline policies have a 5,120 byte limit which may not fit all three policies. Use managed policies instead.
Step 3: Build Custom AMI (Optional)
If setup couldn't find a valid AMI for your architecture, or if you want to customize the base image, build one manually:
easy-db-lab build-image
Build Options
| Option | Description | Default |
|---|---|---|
--arch | CPU architecture (AMD64 or ARM64) | AMD64 |
--region | AWS region for the AMI | (from profile) |
Examples
# Build AMD64 AMI (default)
easy-db-lab build-image
# Build ARM64 AMI for Graviton instances
easy-db-lab build-image --arch ARM64
# Build in specific region
easy-db-lab build-image --region eu-west-1
Environment Variables
| Variable | Description | Default |
|---|---|---|
EASY_DB_LAB_USER_DIR | Override configuration directory | ~/.easy-db-lab |
EASY_DB_LAB_PROFILE | Use a named profile | default |
EASY_DB_LAB_INSTANCE_TYPE | Default instance type for init | r3.2xlarge |
EASY_DB_LAB_STRESS_INSTANCE_TYPE | Default stress instance type | c7i.2xlarge |
EASY_DB_LAB_AMI | Override AMI ID | (auto-detected) |
Verify Installation
After setup completes, verify by running:
easy-db-lab
You should see the help output with available commands.
Next Steps
Once setup is complete, follow the Tutorial to create your first cluster.
Cluster Setup
This page provides a quick reference for cluster initialization and provisioning. For a complete walkthrough, see the Tutorial.
Quick Start
# Initialize a 3-node cluster with i4i.xlarge instances and 1 stress node
easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1
# Provision AWS infrastructure
easy-db-lab up
# Set up your shell environment
source env.sh
Or combine init and up:
easy-db-lab init my-cluster --db 3 --instance i4i.xlarge --app 1 --up
Initialize
The init command creates local configuration files but does not provision AWS resources.
easy-db-lab init <cluster-name> [options]
Common Options
| Option | Description | Default |
|---|---|---|
--db, -c | Number of Cassandra instances | 3 |
--stress, -s | Number of stress instances | 0 |
--instance, -i | Instance type | r3.2xlarge |
--ebs.type | EBS volume type (NONE, gp2, gp3) | NONE |
--ebs.size | EBS volume size in GB | 256 |
--arch, -a | CPU architecture (AMD64, ARM64) | AMD64 |
--up | Auto-provision after init | false |
For the complete options list, see the Tutorial or run easy-db-lab init --help.
Storage Requirements
Database instances need a data disk separate from the root volume. This can come from either:
- Instance store (local NVMe) — Instance types with a
dsuffix (e.g.,i3.xlarge,m5d.xlarge,c5d.2xlarge) include local NVMe storage. - EBS volumes — Attach an EBS volume using
--ebs.type(e.g.,--ebs.type gp3).
If the selected instance type has no instance store and --ebs.type is not specified, up will fail with an error. For example, c5.2xlarge has no local storage, so you must specify EBS:
easy-db-lab init my-cluster --instance c5.2xlarge --ebs.type gp3 --ebs.size 200
Launch
The up command provisions all AWS infrastructure:
easy-db-lab up
What Gets Created
- S3 bucket for cluster state
- VPC with subnets and security groups
- EC2 instances (Cassandra, Stress, Control nodes)
- Control node:
m5d.xlarge(NVMe-backed instance; K3s data is stored on NVMe to avoid filling the root volume)
- Control node:
- K3s cluster across all nodes (Cassandra, Stress, Control)
Options
| Option | Description |
|---|---|
--no-setup, -n | Skip K3s and AxonOps setup |
Shut Down
Destroy all cluster infrastructure:
easy-db-lab down
Next Steps
After your cluster is running:
- Configure Cassandra - Select version and apply configuration
- Shell Aliases - Set up convenient shortcuts
Tutorial: Getting Started
This tutorial walks you through creating a database cluster from scratch, covering initialization, infrastructure provisioning, and database configuration. The examples below use Cassandra, but the same infrastructure supports ClickHouse, OpenSearch, and Spark.
Before starting, ensure you've completed the Setup process by running easy-db-lab setup-profile.
Part 1: Initialize Your Cluster
The init command creates local configuration files for your cluster. It does not provision AWS resources yet.
easy-db-lab init my-cluster
This creates a 3-node Cassandra cluster by default.
Init Options
| Option | Description | Default |
|---|---|---|
--db, --cassandra, -c | Number of Cassandra instances | 3 |
--app, --stress, -s | Number of stress/application instances | 0 |
--instance, -i | Cassandra instance type | r3.2xlarge |
--stress-instance, -si | Stress instance type | c7i.2xlarge |
--azs, -z | Availability zones (e.g., a,b,c) | all available |
--arch, -a | CPU architecture (AMD64, ARM64) | AMD64 |
--ebs.type | EBS volume type (NONE, gp2, gp3, io1, io2) | NONE |
--ebs.size | EBS volume size in GB | 256 |
--ebs.iops | EBS IOPS (gp3 only) | 0 |
--ebs.throughput | EBS throughput (gp3 only) | 0 |
--until | When instances can be deleted | tomorrow |
--tag | Custom tags (key=value, repeatable) | - |
--vpc | Use existing VPC ID | - |
--up | Auto-provision after init | false |
--clean | Remove existing config first | false |
Examples
Basic 3-node cluster:
easy-db-lab init my-cluster
5-node cluster with 2 stress nodes:
easy-db-lab init my-cluster --db 5 --stress 2
Production-like cluster with EBS storage:
easy-db-lab init prod-test --db 5 --ebs.type gp3 --ebs.size 500 --ebs.iops 3000
ARM64 cluster for Graviton instances:
easy-db-lab init my-cluster --arch ARM64 --instance r7g.2xlarge
Initialize and provision in one step:
easy-db-lab init my-cluster --up
Part 2: Launch Infrastructure
Once initialized, provision the AWS infrastructure:
easy-db-lab up
This command creates:
- S3 Storage: Cluster data stored under a dedicated prefix in the account S3 bucket
- VPC: With subnets and security groups
- EC2 Instances: Cassandra nodes, stress nodes, and a control node
- K3s Cluster: Lightweight Kubernetes across all nodes
What Happens During up
- Configures account S3 bucket with cluster prefix
- Creates VPC with public subnets in your availability zones
- Provisions EC2 instances in parallel
- Waits for SSH availability
- Configures K3s cluster on all nodes
- Writes SSH config and environment files
Up Options
| Option | Description |
|---|---|
--no-setup, -n | Skip K3s setup and AxonOps configuration |
Environment Setup
After up completes, source the environment file:
source env.sh
This configures your shell with:
- SSH shortcuts:
ssh db0,ssh db1,ssh stress0, etc. - Cluster aliases:
c0,c-all,c-status - SOCKS proxy configuration
See Shell Aliases for all available shortcuts.
Part 3: Configure Cassandra 5.0
With infrastructure running, configure and start Cassandra.
Step 1: Select Cassandra Version
easy-db-lab cassandra use 5.0
This command:
- Sets the active Cassandra version on all nodes
- Downloads configuration files to your local directory
- Applies any existing patch configuration
Available versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk
Step 2: Customize Configuration (Optional)
Edit cassandra.patch.yaml to customize settings:
# Example: Change token count
vim cassandra.patch.yaml
Common customizations:
| Setting | Description | Default |
|---|---|---|
num_tokens | Virtual nodes per instance | 4 |
concurrent_reads | Max concurrent read operations | 64 |
concurrent_writes | Max concurrent write operations | 64 |
endpoint_snitch | Network topology snitch | Ec2Snitch |
Step 3: Apply Configuration
easy-db-lab cassandra update-config
This uploads and applies the patch to all Cassandra nodes.
To apply and restart Cassandra in one command:
easy-db-lab cassandra update-config --restart
Step 4: Start Cassandra
easy-db-lab cassandra start
Step 5: Verify Cluster
Check cluster status:
ssh db0 nodetool status
Or use the shell alias (after sourcing env.sh):
c-status
You should see all nodes in UN (Up/Normal) state.
Part 4: Working with Your Cluster
SSH Access
After sourcing env.sh:
ssh db0 # First Cassandra node
ssh db1 # Second Cassandra node
ssh stress0 # First stress node (if provisioned)
ssh control0 # Control node
Cassandra Management
# Stop Cassandra on all nodes
easy-db-lab cassandra stop
# Start Cassandra on all nodes
easy-db-lab cassandra start
# Restart Cassandra on all nodes
easy-db-lab cassandra restart
Filter to Specific Hosts
Most commands support the --hosts filter:
# Apply config only to db0 and db1
easy-db-lab cassandra update-config --hosts db0,db1
# Restart only db2
easy-db-lab cassandra restart --hosts db2
Download Configuration Files
To download the current configuration from nodes:
easy-db-lab cassandra download-config
This saves configuration files to a local directory named after the version (e.g., 5.0/).
Part 5: Shut Down
When finished, destroy the cluster infrastructure:
easy-db-lab down
This permanently destroys all EC2 instances, the VPC, and associated resources. S3 data under the cluster prefix is scheduled for expiration (default: 1 day).
Quick Reference
| Task | Command |
|---|---|
| Initialize cluster | easy-db-lab init <name> [options] |
| Provision infrastructure | easy-db-lab up |
| Initialize and provision | easy-db-lab init <name> --up |
| Select Cassandra version | easy-db-lab cassandra use <version> |
| Apply configuration | easy-db-lab cassandra update-config |
| Start Cassandra | easy-db-lab cassandra start |
| Stop Cassandra | easy-db-lab cassandra stop |
| Restart Cassandra | easy-db-lab cassandra restart |
| Check cluster status | ssh db0 nodetool status |
| Download config | easy-db-lab cassandra download-config |
| Destroy cluster | easy-db-lab down |
| Display hosts | easy-db-lab hosts |
| Clean local files | easy-db-lab clean |
Next Steps
- Kubernetes Access - Access K3s cluster with kubectl and k9s
- Shell Aliases - All available CLI shortcuts
- ClickHouse - Deploy ClickHouse for analytics
- Spark - Set up Apache Spark via EMR
Kubernetes
easy-db-lab uses K3s to provide a lightweight Kubernetes cluster for deploying supporting services like ClickHouse, monitoring, and stress testing workloads.
Overview
K3s is automatically installed on all nodes during provisioning:
- Control node: Runs the K3s server (Kubernetes control plane)
- Cassandra nodes: Run as K3s agents with label
type=db - Stress nodes: Run as K3s agents with label
type=app
Accessing the Cluster
kubectl
After running source env.sh, kubectl is automatically configured:
source env.sh
kubectl get nodes
kubectl get pods -A
The kubeconfig is downloaded to your working directory and kubectl is configured to use the SOCKS5 proxy for connectivity.
k9s
k9s provides a terminal-based UI for Kubernetes:
source env.sh
k9s
k9s is pre-configured to use the correct kubeconfig and proxy settings.
Port Forwarding
easy-db-lab uses a SOCKS5 proxy for accessing the private Kubernetes cluster.
Starting the Proxy
The proxy starts automatically when you source the environment:
source env.sh
Manual Proxy Control
# Start the SOCKS5 proxy
start-socks5
# Check proxy status
socks5-status
# Stop the proxy
stop-socks5
Running Commands Through the Proxy
Commands like kubectl and k9s automatically use the proxy. For other commands:
# Route any command through the proxy
with-proxy curl http://10.0.1.50:8080/api
Pushing Docker Images with Jib
easy-db-lab includes a private Docker registry accessible via HTTPS. You can push custom images using Jib.
Gradle Configuration
Add Jib to your build.gradle.kts:
plugins {
id("com.google.cloud.tools.jib") version "3.4.0"
}
jib {
from {
image = "eclipse-temurin:21-jre"
}
to {
// Use the control node's registry
image = "control0:5000/my-app"
tags = setOf("latest", project.version.toString())
}
container {
mainClass = "com.example.MainKt"
}
}
Pushing to the Registry
# Build and push to the cluster registry
./gradlew jib
# Or build locally first
./gradlew jibDockerBuild
Using Images in Kubernetes
Reference your pushed images in Kubernetes manifests:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: my-app
image: control0:5000/my-app:latest
Node Labels
Nodes are automatically labeled for workload scheduling:
| Node Type | Labels |
|---|---|
| Cassandra | type=db |
| Stress | type=app |
| Control | (no labels) |
Using Node Selectors
Schedule pods on specific node types:
apiVersion: v1
kind: Pod
metadata:
name: stress-worker
spec:
nodeSelector:
type: app
containers:
- name: worker
image: my-stress-tool:latest
Useful Commands
# List all nodes
kubectl get nodes
# List pods in all namespaces
kubectl get pods -A
# Watch pod status
kubectl get pods -w
# View logs
kubectl logs <pod-name>
# Execute command in pod
kubectl exec -it <pod-name> -- /bin/bash
# Port forward a service locally
kubectl port-forward svc/my-service 8080:80
Architecture
Networking
- K3s server runs on the control node
- All nodes communicate over the private VPC network
- External access is via SOCKS5 proxy through the control node
Storage
- Local path provisioner for persistent volumes
- Data stored on node-local NVMe drives at
/mnt/db1/
Kubeconfig
The kubeconfig file is:
- Downloaded automatically during cluster setup
- Stored as
kubeconfigin your working directory - Backed up to S3 for recovery
Network Connectivity
This guide covers how to connect to your easy-db-lab cluster from your local machine.
Overview
easy-db-lab clusters run in a private AWS VPC. By default, the VPC uses 10.0.0.0/16, but you can customize this:
easy-db-lab init --cidr 10.14.0.0/20 ...
There are two methods to access your cluster:
| Method | Best For |
|---|---|
| Tailscale VPN (Recommended) | Production use, team sharing, persistent access |
| SOCKS Proxy | Quick testing when you don't want to set up Tailscale |
Tailscale VPN (Recommended)
Tailscale provides a persistent VPN connection to your cluster. Once connected, you can access cluster resources directly—no proxy configuration needed.
Why Tailscale?
- Native access - Use any tool (browsers, kubectl, ssh) without proxy configuration
- Persistent - Connection survives terminal sessions
- Team sharing - Share cluster access with teammates
- Reliable - No SSH tunnels to maintain or reconnect
Setup (One-Time)
Step 1: Configure Tailscale ACL
Go to Tailscale ACL Editor and add:
{
"tagOwners": {
"tag:easy-db-lab": ["autogroup:admin"]
},
"autoApprovers": {
"routes": {
"10.0.0.0/8": ["tag:easy-db-lab"]
}
}
}
The autoApprovers section automatically approves subnet routes, so you don't need to manually approve each cluster.
Step 2: Create OAuth Client
- Go to Tailscale OAuth Settings
- Click Generate OAuth Client
- Configure:
- Description: easy-db-lab
- Scopes: Select Devices: Write
- Tags: Add
tag:easy-db-lab
- Click Generate and save the Client ID and Client Secret
Step 3: Configure easy-db-lab
easy-db-lab setup-profile
Enter your Tailscale OAuth credentials when prompted.
Usage
Tailscale starts automatically with easy-db-lab up. Once connected:
# Direct access to private IPs
ssh ubuntu@10.0.1.50
curl http://10.0.1.50:9428/health
kubectl get pods
# Web UIs work directly in your browser
# http://10.0.1.50:3000 (Grafana)
Manual Control
easy-db-lab tailscale start
easy-db-lab tailscale status
easy-db-lab tailscale stop
Troubleshooting Tailscale
"requested tags are invalid or not permitted" - Add the tag to your ACL (Step 1).
Can't reach private IPs - Check subnet route is approved in Tailscale admin, or add autoApprovers to your ACL.
Using a custom tag:
easy-db-lab tailscale start --tag tag:my-custom-tag
SOCKS Proxy (Alternative)
If you don't want to set up Tailscale, the SOCKS proxy provides connectivity via an SSH tunnel through the control node.
┌─────────────────┐ SSH Tunnel ┌──────────────┐
│ Your Machine │ ──────────────────► │ Control Node │
│ localhost:1080 │ │ (control0) │
└────────┬────────┘ └──────┬───────┘
│ │
SOCKS5 Proxy Private VPC
│ │
▼ ▼
kubectl, curl VPC network
Quick Start
source env.sh
kubectl get pods
curl http://control0:9428/health
The proxy starts automatically when you load the environment.
Proxied Commands
These commands are automatically configured to use the proxy after source env.sh:
| Command | Description |
|---|---|
kubectl | Kubernetes CLI |
k9s | Kubernetes TUI |
curl | HTTP client |
skopeo | Container image tool |
Manual Proxy Usage
For other commands, use the with-proxy wrapper:
with-proxy wget http://10.0.1.50:8080/api
with-proxy http http://control0:3000/api/health
Browser Access
Configure your browser's SOCKS5 proxy:
| Setting | Value |
|---|---|
| SOCKS Host | localhost |
| SOCKS Port | 1080 |
| SOCKS Version | 5 |
Then access cluster services:
- Grafana:
http://control0:3000 - Victoria Metrics:
http://control0:8428 - Victoria Logs:
http://control0:9428
Proxy Management
start-socks5 # Start proxy
start-socks5 1081 # Start on different port
socks5-status # Check status
stop-socks5 # Stop proxy
Troubleshooting SOCKS Proxy
"Connection refused" errors:
socks5-status # Check if running
start-socks5 # Start if needed
ssh control0 hostname # Verify SSH works
Proxy not working after network change:
stop-socks5
source env.sh
Port already in use:
lsof -i :1080 # Check what's using it
start-socks5 1081 # Use different port
Commands timing out:
- Check cluster status:
easy-db-lab status - Verify SSH works:
ssh control0 hostname - Restart proxy:
stop-socks5 && start-socks5
Comparison
| Feature | Tailscale | SOCKS Proxy |
|---|---|---|
| Setup time | ~10 min (one-time) | Instant |
| Persistence | Persistent | Per-session |
Requires source env.sh | No | Yes |
| Browser access | Direct | Requires proxy config |
| Team sharing | Yes | No |
| External dependency | Tailscale account | None |
Shell Aliases
After running source env.sh, you get access to several helpful aliases and functions for managing your cluster.
SSH Aliases
SSH aliases for all Cassandra nodes are automatically created as c0-cN. The ssh command is not required. For example:
c0 nodetool status
This runs nodetool status on the first Cassandra node.
Cluster Management Functions
| Command | Description |
|---|---|
c-all | Executes a command on every node in the cluster sequentially |
c-start | Starts Cassandra on all nodes |
c-restart | Restarts Cassandra on all nodes (not a graceful operation) |
c-status | Executes nodetool status on db0 |
c-tpstats | Executes nodetool tpstats on all nodes |
c-collect-artifacts | Collects metrics, nodetool output, and system information |
Examples
Run a command on all nodes
c-all "df -h"
Check cluster status
c-status
Collect artifacts for performance testing
c-collect-artifacts my-test-run
This is useful when doing performance testing to capture the state of the system at a given moment.
Graceful Rolling Restarts
For true rolling restarts, we recommend using cstar instead of c-restart.
Configuring Cassandra
This page covers Cassandra version management and configuration. For a step-by-step walkthrough, see the Tutorial.
Supported Versions
easy-db-lab supports the following Cassandra versions:
| Version | Java | Notes |
|---|---|---|
| 3.0 | 8 | Legacy support |
| 3.11 | 8 | Stable release |
| 4.0 | 11 | First 4.x release |
| 4.1 | 11 | Current LTS |
| 5.0 | 11 | Latest stable (recommended) |
| 5.0-HEAD | 11 | Nightly build from 5.0 branch |
| trunk | 17 | Development branch |
Quick Start
# Select Cassandra 5.0
easy-db-lab cassandra use 5.0
# Generate configuration patch
easy-db-lab cassandra write-config
# Apply configuration and start
easy-db-lab cassandra update-config
easy-db-lab cassandra start
# Verify cluster
ssh db0 nodetool status
Version Management
Select a Version
easy-db-lab cassandra use <version>
Examples:
easy-db-lab cassandra use 5.0 # Latest stable
easy-db-lab cassandra use 4.1 # LTS version
easy-db-lab cassandra use trunk # Development branch
This command:
- Sets the active Cassandra version on all nodes
- Downloads current configuration files locally
- Applies any existing
cassandra.patch.yaml
Specify Java Version
easy-db-lab cassandra use 5.0 --java 11
List Available Versions
easy-db-lab ls
Configuration
The Patch File
Cassandra configuration uses a patch file approach. The cassandra.patch.yaml file contains only the settings you want to customize, which are merged with the default cassandra.yaml.
Generate a new patch file:
easy-db-lab cassandra write-config
Options:
-t,--tokens: Number of tokens (default: 4)
Example patch file:
cluster_name: "my-cluster"
num_tokens: 4
concurrent_reads: 64
concurrent_writes: 64
trickle_fsync: true
The following settings are automatically managed by easy-db-lab. Including them in your patch file may cause problems:
listen_address,rpc_address— injected with each node's private IPseed_provider/seeds— configured automatically based on cluster topologyhints_directory,data_file_directories,commitlog_directory— set based on the cluster's disk configuration
Apply Configuration
easy-db-lab cassandra update-config
Options:
--restart,-r: Restart Cassandra after applying--hosts: Filter to specific hosts
Apply and restart in one command:
easy-db-lab cassandra update-config --restart
Download Configuration
Download current configuration files from nodes:
easy-db-lab cassandra download-config
Files are saved to a local directory named after the version (e.g., 5.0/).
Starting and Stopping
# Start on all nodes
easy-db-lab cassandra start
# Stop on all nodes
easy-db-lab cassandra stop
# Restart on all nodes
easy-db-lab cassandra restart
# Target specific hosts
easy-db-lab cassandra start --hosts db0,db1
Cassandra Sidecar
The Apache Cassandra Sidecar is automatically installed and started alongside Cassandra. The sidecar provides:
- REST API for Cassandra operations
- S3 import/restore capabilities
- Streaming data operations
- Metrics collection (Prometheus-compatible)
Sidecar Access
The sidecar runs on port 9043 on each Cassandra node:
# Check sidecar health
curl http://<cassandra-node-ip>:9043/api/v1/__health
Sidecar Management
The sidecar is managed via systemd:
# Check status
ssh db0 sudo systemctl status cassandra-sidecar
# Restart
ssh db0 sudo systemctl restart cassandra-sidecar
Sidecar Configuration
Configuration is located at /etc/cassandra-sidecar/cassandra-sidecar.yaml on each node. Key settings:
- Cassandra connection details
- Data directory paths
- Traffic shaping and throttling
- S3 integration settings
Custom Builds
To use a custom Cassandra build from source:
Build from Repository
easy-db-lab cassandra build -n my-build /path/to/cassandra-repo
Use Custom Build
easy-db-lab cassandra use my-build
Next Steps
- Tutorial - Complete walkthrough
- Shell Aliases - Convenient shortcuts for Cassandra management
ClickHouse
easy-db-lab supports deploying ClickHouse clusters on Kubernetes for analytics workloads alongside your Cassandra cluster.
Overview
ClickHouse is deployed as a StatefulSet on K3s with ClickHouse Keeper for distributed coordination. The deployment requires a minimum of 3 nodes.
Quick Start
Create a 6-node cluster and deploy ClickHouse with 2 shards:
# Initialize and start a 6-node cluster
easy-db-lab init my-cluster --db 6 --up
# Deploy ClickHouse (2 shards x 3 replicas)
easy-db-lab clickhouse start
Configuring ClickHouse
Use clickhouse init to configure ClickHouse settings before starting the cluster:
# Configure S3 cache size (default: 10Gi)
easy-db-lab clickhouse init --s3-cache 50Gi
# Disable write-through caching
easy-db-lab clickhouse init --s3-cache-on-write false
| Option | Description | Default |
|---|---|---|
--s3-cache | Size of the local S3 cache | 10Gi |
--s3-cache-on-write | Cache data during write operations | true |
--s3-tier-move-factor | Move data to S3 tier when local disk free space falls below this fraction (0.0-1.0) | 0.2 |
--replicas-per-shard | Number of replicas per shard | 3 |
Configuration is saved to the cluster state and applied when you run clickhouse start.
Starting ClickHouse
To deploy ClickHouse on an existing cluster:
easy-db-lab clickhouse start
Options
| Option | Description | Default |
|---|---|---|
--timeout | Seconds to wait for pods to be ready | 300 |
--skip-wait | Skip waiting for pods to be ready | false |
--replicas | Number of ClickHouse server replicas | Number of db nodes |
--replicas-per-shard | Number of replicas per shard | 3 |
Example with Custom Settings
# 6 nodes with 3 replicas per shard = 2 shards
easy-db-lab clickhouse start --replicas 6 --replicas-per-shard 3
# 9 nodes with 3 replicas per shard = 3 shards
easy-db-lab clickhouse start --replicas 9 --replicas-per-shard 3
Cluster Topology
ClickHouse is deployed with a sharded, replicated architecture. The total number of replicas must be divisible by --replicas-per-shard.
Shard and Replica Assignment
The cluster named easy_db_lab is automatically configured based on your replica count:
| Configuration | Shards | Replicas/Shard | Total Nodes |
|---|---|---|---|
| Default (3 nodes) | 1 | 3 | 3 |
| 6 nodes, 3/shard | 2 | 3 | 6 |
| 9 nodes, 3/shard | 3 | 3 | 9 |
| 6 nodes, 2/shard | 3 | 2 | 6 |
Pod-to-Node Pinning
Each ClickHouse pod is pinned to a specific database node using Local PersistentVolumes with node affinity:
clickhouse-0always runs ondb0clickhouse-1always runs ondb1clickhouse-Nalways runs ondbN
This guarantees:
- Consistent shard assignment - A pod's shard is calculated from its ordinal:
shard = (ordinal / replicas_per_shard) + 1 - Data locality - Data stored on a node stays with that node across pod restarts
- Predictable performance - No data movement when pods restart
Shard Calculation Example
With 6 replicas and 3 replicas per shard:
| Pod | Ordinal | Shard | Node |
|---|---|---|---|
| clickhouse-0 | 0 | 1 | db0 |
| clickhouse-1 | 1 | 1 | db1 |
| clickhouse-2 | 2 | 1 | db2 |
| clickhouse-3 | 3 | 2 | db3 |
| clickhouse-4 | 4 | 2 | db4 |
| clickhouse-5 | 5 | 2 | db5 |
Checking Status
To check the status of your ClickHouse cluster:
easy-db-lab clickhouse status
This displays:
- Pod status and health
- Access URLs for the Play UI and HTTP interface
- Native protocol connection details
Accessing ClickHouse
After deployment, ClickHouse is accessible via:
| Interface | URL/Port | Description |
|---|---|---|
| Play UI | http://<db-node-ip>:8123/play | Interactive web query interface |
| HTTP API | http://<db-node-ip>:8123 | REST API for queries |
| Native Protocol | <db-node-ip>:9000 | High-performance binary protocol |
Creating Tables
ClickHouse supports distributed, replicated tables that span multiple shards. The recommended pattern uses ReplicatedMergeTree for local replicated storage and Distributed for querying across shards.
Distributed Replicated Tables
Create a local replicated table on all nodes, then a distributed table for queries:
-- Step 1: Create local replicated table on all nodes
CREATE TABLE events_local ON CLUSTER easy_db_lab (
id UInt64,
timestamp DateTime,
event_type String,
data String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (timestamp, id)
SETTINGS storage_policy = 's3_main';
-- Step 2: Create distributed table for querying across all shards
CREATE TABLE events ON CLUSTER easy_db_lab AS events_local
ENGINE = Distributed(easy_db_lab, default, events_local, rand());
Key points:
ON CLUSTER easy_db_labruns the DDL on all nodes{shard}and{replica}are ClickHouse macros automatically set per nodeReplicatedMergeTreereplicates data within a shard using ClickHouse KeeperDistributedroutes queries and inserts across shardsrand()distributes inserts randomly; use a column for deterministic sharding
Querying and Inserting
-- Insert through distributed table (auto-sharded)
INSERT INTO events VALUES (1, now(), 'click', '{"page": "/home"}');
-- Query across all shards
SELECT count(*) FROM events WHERE event_type = 'click';
-- Query a specific shard (via local table)
SELECT count(*) FROM events_local WHERE event_type = 'click';
Table Engine Comparison
| Engine | Use Case | Replication | Sharding |
|---|---|---|---|
MergeTree | Single-node, no replication | No | No |
ReplicatedMergeTree | Replicated within shard | Yes | No |
Distributed | Query/insert across shards | Via underlying table | Yes |
Storage Policies
ClickHouse is configured with two storage policies. You select the policy when creating a table using the SETTINGS storage_policy clause.
Policy Comparison
| Aspect | local | s3_main | s3_tier |
|---|---|---|---|
| Storage Location | Local NVMe disks | S3 bucket with configurable local cache | Hybrid: starts local, moves to S3 when disk fills |
| Performance | Best latency, highest throughput | Higher latency, cache-dependent | Good initially, degrades as data moves to S3 |
| Capacity | Limited by disk size | Virtually unlimited | Virtually unlimited |
| Cost | Included in instance cost | S3 storage + request costs | S3 storage + request costs |
| Data Persistence | Lost when cluster is destroyed | Persists independently | Persists independently |
| Best For | Benchmarks, low-latency queries | Large datasets, cost-sensitive workloads | Mixed hot/cold workloads with automatic tiering |
Local Storage (local)
The default policy stores data on local NVMe disks attached to the database nodes. This provides the best performance for latency-sensitive workloads.
CREATE TABLE my_table (...)
ENGINE = MergeTree()
ORDER BY id
SETTINGS storage_policy = 'local';
If you omit the storage_policy setting, tables use local storage by default.
When to use local storage:
- Performance benchmarking where latency matters
- Temporary or experimental datasets
- Workloads with predictable data sizes that fit on local disks
- When you don't need data to persist after cluster teardown
S3 Storage (s3_main)
The S3 policy stores data in your configured S3 bucket with a local cache for frequently accessed data. The cache size defaults to 10Gi and can be configured with clickhouse init --s3-cache. Write-through caching is enabled by default (--s3-cache-on-write true), which caches data during writes so subsequent reads can be served from cache immediately. This is ideal for large datasets where storage cost matters more than latency.
Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:
easy-db-lab init my-cluster --s3-bucket my-clickhouse-data
Then create tables with S3 storage:
CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_main';
When to use S3 storage:
- Large analytical datasets (terabytes+)
- Data that should persist across cluster restarts
- Cost-sensitive workloads where storage cost > compute cost
- Sharing data between multiple clusters
How the cache works:
- Hot (frequently accessed) data is cached locally for fast reads
- Cold data is fetched from S3 on demand
- Cache is automatically managed by ClickHouse
- First query on cold data will be slower; subsequent queries use cache
S3 Tiered Storage (s3_tier)
The S3 tiered policy provides automatic data movement from local disks to S3 based on disk space availability. This policy starts with local storage and automatically moves data to S3 when local disk space runs low, providing the best of both worlds: fast local performance for hot data and unlimited S3 capacity for cold data.
Prerequisite: Your cluster must be initialized with an S3 bucket. Set this during init:
easy-db-lab init my-cluster --s3-bucket my-clickhouse-data
Configure the tiering behavior before starting ClickHouse:
# Move data to S3 when local disk free space falls below 20% (default)
easy-db-lab clickhouse init --s3-tier-move-factor 0.2
# More aggressive tiering - move when free space < 50%
easy-db-lab clickhouse init --s3-tier-move-factor 0.5
Then create tables with S3 tiered storage:
CREATE TABLE my_table (...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/my_table', '{replica}')
ORDER BY id
SETTINGS storage_policy = 's3_tier';
When to use S3 tiered storage:
- Workloads with mixed hot/cold data access patterns
- Growing datasets that may outgrow local disk capacity
- Want automatic cost optimization without manual intervention
- Need local performance for recent data with S3 capacity for historical data
How automatic tiering works:
- New data is written to local disks first (fast writes)
- When local disk free space falls below the configured threshold (default: 20%), ClickHouse automatically moves the oldest data to S3
- Data on S3 is still queryable but with higher latency
- The local cache (configured with
--s3-cache) helps performance for frequently accessed S3 data - Manual moves are also possible:
ALTER TABLE my_table MOVE PARTITION tuple() TO DISK 's3'
Stopping ClickHouse
To remove the ClickHouse cluster:
easy-db-lab clickhouse stop
This removes all ClickHouse pods, services, and associated resources from Kubernetes.
Monitoring
ClickHouse metrics are automatically integrated with the observability stack:
- Grafana Dashboard: Pre-configured dashboard for ClickHouse metrics
- Metrics Port:
9363for Prometheus-compatible metrics - Logs Dashboard: Dedicated dashboard for ClickHouse logs
Architecture
The ClickHouse deployment includes:
- ClickHouse Server: StatefulSet with configurable replicas
- ClickHouse Keeper: 3-node cluster for distributed coordination (ZooKeeper-compatible)
- Services: Headless services for internal communication
- ConfigMaps: Server and Keeper configuration
- Local PersistentVolumes: One PV per node for data locality
Storage Architecture
ClickHouse uses Local PersistentVolumes to guarantee pod-to-node pinning:
- During cluster creation, each
dbnode is labeled with its ordinal (easydblab.com/node-ordinal=0, etc.) - Local PVs are created with node affinity matching these ordinals
- PVs are pre-bound to specific PVCs (e.g.,
data-clickhouse-0binds to the PV ondb0) - The StatefulSet's volumeClaimTemplate requests storage from these pre-bound PVs
This ensures clickhouse-X always runs on dbX, providing:
- Consistent shard assignments across restarts
- Data locality (no network storage overhead)
- Predictable failover behavior
Ports
| Port | Purpose |
|---|---|
| 8123 | HTTP interface |
| 9000 | Native protocol |
| 9009 | Inter-server communication |
| 9363 | Metrics |
| 2181 | Keeper client |
| 9234 | Keeper Raft |
OpenSearch
AWS OpenSearch can be provisioned as a managed domain for full-text search and log analytics.
Commands
| Command | Description |
|---|---|
opensearch start | Create an OpenSearch domain |
opensearch status | Check domain status |
opensearch stop | Delete the OpenSearch domain |
Starting OpenSearch
easy-db-lab opensearch start
This creates an AWS-managed OpenSearch domain linked to your cluster's VPC. The domain takes several minutes to provision.
Checking Status
easy-db-lab opensearch status
Stopping OpenSearch
easy-db-lab opensearch stop
This deletes the OpenSearch domain. Data stored in the domain will be lost.
Spark
easy-db-lab supports provisioning Apache Spark clusters via AWS EMR for analytics workloads.
Enabling Spark
There are two ways to enable Spark:
Option 1: During Init (before up)
Enable Spark during cluster initialization with the --spark.enable flag. The EMR cluster will be created automatically when you run up:
easy-db-lab init --spark.enable
easy-db-lab up
Init Spark Configuration Options
| Option | Description | Default |
|---|---|---|
--spark.enable | Enable Spark EMR cluster | false |
--spark.master.instance.type | Master node instance type | m5.xlarge |
--spark.worker.instance.type | Worker node instance type | m5.xlarge |
--spark.worker.instance.count | Number of worker nodes | 3 |
Example with Custom Configuration
easy-db-lab init \
--spark.enable \
--spark.master.instance.type m5.2xlarge \
--spark.worker.instance.type m5.4xlarge \
--spark.worker.instance.count 5
Option 2: After up (standalone spark init)
Add Spark to an existing environment that is already running. This is useful when you forgot to pass --spark.enable during init, or when you decide to add Spark later:
easy-db-lab spark init
Prerequisites: easy-db-lab init and easy-db-lab up must have been run first.
Spark Init Configuration Options
| Option | Description | Default |
|---|---|---|
--master.instance.type | Master node instance type | m5.xlarge |
--worker.instance.type | Worker node instance type | m5.xlarge |
--worker.instance.count | Number of worker nodes | 3 |
Example with Custom Configuration
easy-db-lab spark init \
--master.instance.type m5.2xlarge \
--worker.instance.type m5.4xlarge \
--worker.instance.count 5
Submitting Spark Jobs
Submit JAR-based Spark applications to your EMR cluster:
easy-db-lab spark submit \
--jar /path/to/your-app.jar \
--main-class com.example.YourMainClass \
--conf spark.easydblab.keyspace=my_keyspace \
--conf spark.easydblab.table=my_table \
--wait
Submit Options
| Option | Description | Required |
|---|---|---|
--jar | Path to JAR file (local path or s3:// URI) | Yes |
--main-class | Main class to execute | Yes |
--conf | Spark configuration (key=value), can be repeated | No |
--env | Environment variable (KEY=value), can be repeated | No |
--args | Arguments for the Spark application | No |
--wait | Wait for job completion | No |
--name | Job name (defaults to main class) | No |
When --jar is a local path, it is automatically uploaded to the cluster's S3 bucket before submission. When it is an s3:// URI, it is used directly.
Using a JAR Already on S3
If your JAR is already on S3 (e.g., from a CI pipeline or a previous upload), pass the S3 URI directly:
easy-db-lab spark submit \
--jar s3://my-bucket/jars/your-app.jar \
--main-class com.example.YourMainClass \
--conf spark.easydblab.keyspace=my_keyspace \
--wait
This skips the upload step entirely, which is useful for large JARs or when resubmitting the same job.
Cancelling a Job
Cancel a running or pending Spark job without terminating the cluster:
easy-db-lab spark stop
Without --step-id, this cancels the most recent job. To cancel a specific job:
easy-db-lab spark stop --step-id <step-id>
The cancellation uses EMR's TERMINATE_PROCESS strategy (SIGKILL). The API is asynchronous — use spark status to confirm the job has been cancelled.
Checking Job Status
View Recent Jobs
List recent Spark jobs on the cluster:
easy-db-lab spark jobs
Options:
--limit- Maximum number of jobs to display (default: 10)
Check Specific Job Status
easy-db-lab spark status --step-id <step-id>
Without --step-id, shows the status of the most recent job.
Options:
--step-id- EMR step ID to check--logs- Download step logs (stdout, stderr)
Retrieving Logs
Download logs for a Spark job:
easy-db-lab spark logs --step-id <step-id>
Logs are automatically decompressed and include:
stdout.gz- Standard outputstderr.gz- Standard errorcontroller.gz- EMR controller logs
Architecture
When Spark is enabled, easy-db-lab provisions:
- EMR Cluster: Managed Spark cluster with master and worker nodes
- S3 Integration: Logs stored at
s3://<bucket>/spark/emr-logs/ - IAM Roles: Service and job flow roles for EMR operations
- Observability: Each EMR node runs an OTel Collector (host metrics, OTLP forwarding), OTel Java Agent (auto-instrumentation for logs/metrics/traces), and Pyroscope Java Agent (continuous CPU/allocation/lock profiling). All telemetry flows to the control node's observability stack.
Timeouts and Polling
- Job Polling Interval: 5 seconds
- Maximum Wait Time: 4 hours
- Cluster Creation Timeout: 30 minutes
Spark with Cassandra
A common use case is running Spark jobs that read from or write to Cassandra. Use the Spark Cassandra Connector:
import com.datastax.spark.connector._
val df = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "my_table", "keyspace" -> "my_keyspace"))
.load()
Ensure your JAR includes the Spark Cassandra Connector dependency and configure the Cassandra host in your Spark application.
Spark Modules
All Spark job modules live under the spark/ directory and share unified configuration via spark.easydblab.* properties. You can compare performance across implementations by swapping the JAR and main class while keeping the same --conf flags.
Module Overview
| Module | Gradle Path | Main Class | Description |
|---|---|---|---|
common | :spark:common | — | Shared config, data generation, CQL setup |
bulk-writer-sidecar | :spark:bulk-writer-sidecar | DirectBulkWriter | Cassandra Analytics, direct sidecar transport |
bulk-writer-s3 | :spark:bulk-writer-s3 | S3BulkWriter | Cassandra Analytics, S3 staging transport |
connector-writer | :spark:connector-writer | StandardConnectorWriter | Standard Spark Cassandra Connector |
connector-read-write | :spark:connector-read-write | KeyValuePrefixCount | Read→transform→write example |
Building
Pre-build Cassandra Analytics (one-time, for bulk-writer modules)
The cassandra-analytics library requires JDK 11 to build:
bin/build-cassandra-analytics
Options:
--force- Rebuild even if JARs exist--branch <branch>- Use a specific branch (default: trunk)
Build JARs
# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3:shadowJar \
:spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar
# Or build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar
Usage
All modules use the same --conf properties for easy comparison.
Direct Bulk Writer (Sidecar)
easy-db-lab spark submit \
--jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
--main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
--conf spark.easydblab.contactPoints=host1,host2,host3 \
--conf spark.easydblab.keyspace=bulk_test \
--conf spark.easydblab.localDc=us-west-2 \
--conf spark.easydblab.rowCount=1000000 \
--wait
S3 Bulk Writer
easy-db-lab spark submit \
--jar spark/bulk-writer-s3/build/libs/bulk-writer-s3-*.jar \
--main-class com.rustyrazorblade.easydblab.spark.S3BulkWriter \
--conf spark.easydblab.contactPoints=host1,host2,host3 \
--conf spark.easydblab.keyspace=bulk_test \
--conf spark.easydblab.localDc=us-west-2 \
--conf spark.easydblab.s3.bucket=my-bucket \
--conf spark.easydblab.rowCount=1000000 \
--wait
Standard Connector Writer
easy-db-lab spark submit \
--jar spark/connector-writer/build/libs/connector-writer-*.jar \
--main-class com.rustyrazorblade.easydblab.spark.StandardConnectorWriter \
--conf spark.easydblab.contactPoints=host1,host2,host3 \
--conf spark.easydblab.keyspace=bulk_test \
--conf spark.easydblab.localDc=us-west-2 \
--conf spark.easydblab.rowCount=1000000 \
--wait
Convenience Script
The bin/spark-bulk-write script handles JAR lookup, host resolution, and health checks:
# From a cluster directory
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable
Configuration Properties
All modules share these properties via spark.easydblab.*:
| Property | Description | Default |
|---|---|---|
spark.easydblab.contactPoints | Comma-separated database hosts | Required |
spark.easydblab.keyspace | Target keyspace | Required |
spark.easydblab.table | Target table | data_<timestamp> |
spark.easydblab.localDc | Local datacenter name | Required |
spark.easydblab.rowCount | Number of rows to write | 1000000 |
spark.easydblab.parallelism | Spark partitions for generation | 10 |
spark.easydblab.partitionCount | Cassandra partitions to distribute across | 10000 |
spark.easydblab.replicationFactor | Keyspace replication factor | 3 |
spark.easydblab.skipDdl | Skip keyspace/table creation (validates they exist) | false |
spark.easydblab.compaction | Compaction strategy | (default) |
spark.easydblab.s3.bucket | S3 bucket (S3 mode only) | Required for S3 |
spark.easydblab.s3.endpoint | S3 endpoint override | AWS S3 |
Table Schema
The test data generators produce this schema:
CREATE TABLE <keyspace>.<table> (
partition_id bigint,
sequence_id bigint,
course blob,
marks bigint,
PRIMARY KEY ((partition_id), sequence_id)
);
Monitoring
Grafana Dashboards
Grafana is deployed automatically as part of the observability stack (k8 apply). It is accessible on port 3000 of the control node.
Cluster Identification
When running multiple environments side by side, Grafana displays the cluster name in several places to help you identify which environment you're looking at:
- Browser tab - Shows the cluster name instead of "Grafana"
- Dashboard titles - Each dashboard title is prefixed with the cluster name
- Sidebar org name - The organization name in the sidebar shows the cluster name
- Home dashboard - The System Overview dashboard is set as the home page instead of the default Grafana welcome page
System Dashboard
Shows CPU, memory, disk I/O, network I/O, and load average for all cluster nodes via OpenTelemetry metrics.
AWS CloudWatch Overview
A combined dashboard showing S3, EBS, and EC2 metrics via CloudWatch. Available after running easy-db-lab up.
S3 metrics:
- Throughput: BytesDownloaded, BytesUploaded
- Request Counts: GetRequests, PutRequests
- Latency: FirstByteLatency (p99), TotalRequestLatency (p99)
EBS volume metrics:
- IOPS: VolumeReadOps, VolumeWriteOps (mirrored read/write chart)
- Throughput: VolumeReadBytes, VolumeWriteBytes (mirrored read/write chart)
- Queue Length: VolumeQueueLength
- Burst Balance: BurstBalance (percentage)
EC2 status checks:
- Status Check Failures: StatusCheckFailed_Instance, StatusCheckFailed_System (red threshold at >= 1)
Use the dropdowns at the top to select S3 bucket, EC2 instances, and EBS volumes.
How it works:
- S3 request metrics are automatically enabled for the cluster's prefix in the account S3 bucket during
easy-db-lab up - EBS and EC2 metrics are published automatically by AWS for all instances and volumes
- Grafana queries CloudWatch using the EC2 instance's IAM role (no credentials needed)
- During
easy-db-lab down, the S3 metrics configuration is automatically removed to stop CloudWatch billing
Note: S3 request metrics take approximately 15 minutes to appear in CloudWatch after being enabled. EBS and EC2 metrics are available immediately.
EMR Overview
Shows Spark/EMR node metrics via OpenTelemetry. Available when an EMR cluster is provisioned. Each EMR node runs an OTel Collector that collects host metrics and receives JVM telemetry from the OTel and Pyroscope Java agents.
Host Metrics:
- CPU Usage: Per-node CPU utilization percentage
- Memory Usage: Used and cached memory per node
- Disk I/O: Read/write throughput per node (mirrored chart)
- Network I/O: Receive/transmit throughput per node (mirrored chart)
- Load Average: 1m and 5m load per node
- Filesystem Usage: Root filesystem utilization percentage
Spark JVM Metrics:
- JVM Heap Memory: Used and committed heap per node/pool
- GC Duration Rate: Garbage collection duration rate per collector
- JVM Threads: Thread count per node
- JVM Classes Loaded: Class count per node
Use the Hostname dropdown to filter by specific EMR nodes.
OpenSearch Overview
Shows OpenSearch domain metrics via CloudWatch. Available when an OpenSearch domain is provisioned.
Metrics displayed:
- Cluster Health: ClusterStatus (green/yellow/red), FreeStorageSpace
- CPU / Memory: CPUUtilization, JVMMemoryPressure
- Search Performance: SearchLatency (p99), SearchRate
- Indexing Performance: IndexingLatency (p99), IndexingRate
- HTTP Responses: 2xx, 3xx, 4xx, 5xx (color-coded)
- Storage: ClusterUsedSpace
Use the Domain dropdown to select which OpenSearch domain to view.
Cassandra Condensed
A single-pane-of-glass summary of the most important Cassandra metrics, powered by the MAAC (Management API for Apache Cassandra) agent. Shows:
- Cluster Overview: Nodes up/down, compaction rates, CQL request throughput, dropped messages, connected clients, timeouts, hints, data size, GC time
- Condensed Metrics: Request throughput, coordinator latency percentiles, memtable space, compaction activity, table-level latency, streaming bandwidth
Requires the MAAC agent to be loaded (Cassandra 4.0, 4.1, or 5.0). Metrics are exposed on port 9000 and scraped by the OTel collector.
Cassandra Overview
A comprehensive deep-dive into Cassandra cluster health, also powered by the MAAC agent. Shows:
- Request Throughput: Read/write distribution, latency percentiles (P98-P999), error throughput
- Node Status: Per-node up/down status (polystat panel), node count, status history
- Data Status: Disk space usage, data size, SSTable count, pending compactions
- Internals: Thread pool pending/blocked/active tasks, dropped messages, hinted handoff
- Hardware: CPU, memory, disk I/O, network I/O, load average
- JVM/GC: Application throughput, GC time, heap utilization
eBPF Observability
The cluster deploys eBPF-based agents on all nodes for deep system observability:
Beyla (L7 Network Metrics)
Grafana Beyla uses eBPF to automatically instrument network traffic and provide RED metrics (Rate, Errors, Duration) for:
- Cassandra CQL protocol (port 9042) and inter-node communication (port 7000)
- ClickHouse HTTP (port 8123) and native (port 9000) protocols
Metrics are scraped by the OTel collector and stored in VictoriaMetrics.
ebpf_exporter (Low-Level Metrics)
Cloudflare's ebpf_exporter provides kernel-level metrics via eBPF:
- TCP retransmits — count of retransmitted TCP segments
- Block I/O latency — histogram of block device I/O operation latency
- VFS latency — histogram of filesystem read/write operation latency
These metrics are scraped by the OTel collector and stored in VictoriaMetrics.
See Profiling for continuous profiling with Pyroscope.
Profiling
Continuous profiling is provided by Grafana Pyroscope, deployed automatically as part of the observability stack.
Architecture
Profiling data is collected from multiple sources and sent to the Pyroscope server on the control node (port 4040):
- Pyroscope Java agent (Cassandra) — Runs as a
-javaagentinside the Cassandra JVM. Uses async-profiler to collect CPU, allocation, lock contention, and wall-clock profiles with full method-level resolution. - Pyroscope Java agent (Stress jobs) — Runs as a
-javaagentinside cassandra-easy-stress K8s Jobs. Collects the same profile types as Cassandra (CPU, allocation, lock). The agent JAR is mounted from the host via a hostPath volume. - Pyroscope Java agent (Spark/EMR) — Runs as a
-javaagenton Spark driver and executor JVMs. Installed via EMR bootstrap action to/opt/pyroscope/pyroscope.jar. Collects CPU, allocation (512k threshold), and lock (10ms threshold) profiles in JFR format. Profiles appear underservice_name=spark-<job-name>. - Grafana Alloy eBPF profiler — Runs as a DaemonSet on all nodes via Grafana Alloy. Profiles all processes (Cassandra, ClickHouse, stress jobs) at the system level using eBPF. Provides CPU flame graphs including kernel stack frames.
Accessing Profiles
Profiling Dashboard
A dedicated Profiling dashboard is available in Grafana with flame graph panels for each profile type:
- Open Grafana (port 3000)
- Navigate to Dashboards and select the Profiling dashboard
- Use the Service dropdown to select a service (e.g.,
cassandra,cassandra-easy-stress,clickhouse-server) - Use the Hostname dropdown to filter by specific nodes
- Select a time range to view profiles for that period
The dashboard includes panels for:
- CPU Flame Graph — CPU time spent in each method
- Memory Allocation Flame Graph — Heap allocation hotspots
- Lock Contention Flame Graph — Time spent waiting for monitors
- Mutex Contention Flame Graph — Mutex delay analysis
Grafana Explore
For ad-hoc profile exploration:
- Open Grafana (port 3000) and navigate to Explore
- Select the Pyroscope datasource
- Choose a profile type (e.g.,
process_cpu,memory,mutex) - Filter by labels:
service_name— process or application namehostname— node hostnamecluster— cluster name
Profile Types
Java Agent (Cassandra, Stress Jobs)
| Profile | Description |
|---|---|
cpu | CPU time spent in each method |
alloc | Memory allocation by method (objects and bytes) |
lock | Lock contention — time spent waiting for monitors |
wall | Wall-clock time — useful for finding I/O bottlenecks (Cassandra only, see below) |
eBPF Agent (All Processes)
| Profile | Description |
|---|---|
process_cpu | CPU usage by process, including kernel frames |
The eBPF agent profiles all processes on every node, including ClickHouse. Since ClickHouse is written in C++, only CPU profiles are available (no allocation or lock profiles). ClickHouse profiles appear in Pyroscope under the clickhouse-server service name when ClickHouse is running.
Stress Job Profiling
Stress jobs are automatically profiled via the Pyroscope Java agent. No additional configuration is needed — when you start a stress job, the agent is mounted from the host node and configured to send profiles to the Pyroscope server.
Profiles appear under service_name=cassandra-easy-stress with labels for cluster and job_name.
Wall-Clock vs CPU Profiling
By default, the Cassandra Java agent profiles CPU time. You can switch to wall-clock profiling to find I/O bottlenecks and blocking operations.
To enable wall-clock profiling:
- SSH to each Cassandra node
- Add
PYROSCOPE_PROFILER_EVENT=wallto/etc/default/cassandra - Restart Cassandra
To switch back to CPU profiling, either remove the line or set PYROSCOPE_PROFILER_EVENT=cpu.
Configuration
Cassandra Java Agent
The Pyroscope Java agent is configured via JVM system properties in cassandra.in.sh. It activates when the PYROSCOPE_SERVER_ADDRESS environment variable is set (configured by easy-db-lab at cluster startup).
The agent JAR is installed at /usr/local/pyroscope/pyroscope.jar.
| Environment Variable | Set In | Description |
|---|---|---|
PYROSCOPE_SERVER_ADDRESS | /etc/default/cassandra | Pyroscope server URL (set automatically) |
CLUSTER_NAME | /etc/default/cassandra | Cluster name for labeling (set automatically) |
PYROSCOPE_PROFILER_EVENT | /etc/default/cassandra | Profiler event type: cpu (default) or wall |
eBPF Agent
The eBPF profiler runs as a privileged Grafana Alloy DaemonSet (pyroscope-ebpf) and profiles all processes on each node. Configuration is in the pyroscope-ebpf-config ConfigMap (Alloy River format). It uses discovery.process to discover host processes and pyroscope.ebpf to collect CPU profiles.
Pyroscope Server
The Pyroscope server runs on the control node with data stored in S3 (s3://<account-bucket>/clusters/<name>-<id>/pyroscope/). Configuration is in the pyroscope-config ConfigMap.
Data Flow
Cassandra JVM ──(Java agent)──────► Pyroscope Server (:4040)
▲
Stress Jobs ──(Java agent)──────────────┘
▲
Spark JVMs ──(Java agent)──────────────┘
▲
All Processes ──(eBPF agent)────────────┘
│
▼
S3 storage
Grafana (:3000)
Pyroscope datasource
+ Profiling dashboard
Victoria Metrics
Victoria Metrics is a time-series database that stores metrics from all nodes in your easy-db-lab cluster. It receives metrics via OTLP from the OpenTelemetry Collector.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ All Nodes (DaemonSet) │
├─────────────────────────────────────────────────────────────┤
│ System metrics (CPU, memory, disk, network) │
│ Cassandra metrics (via JMX) │
│ Application metrics │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌────────────────────────┐
│ OTel Collector │
│ (DaemonSet) │
└───────────┬────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ Control Node │ │
├─────────────────────────┼─────────────────────────┤
│ ▼ │
│ ┌──────────────────┐ │
│ │ Victoria Metrics │ │
│ │ (:8428) │ │
│ └────────┬─────────┘ │
└───────────────────────┼────────────────────────────┘
│
▼
┌──────────────────┐
│ Grafana │
│ (:3000) │
└──────────────────┘
Configuration
Victoria Metrics runs on the control node as a Kubernetes deployment:
- Port: 8428 (HTTP API)
- Storage: Persistent at
/mnt/db1/victoriametrics - Retention: 7 days (configurable via
-retentionPeriodflag)
Accessing Metrics
Grafana
- Access Grafana at
http://control0:3000(via SOCKS proxy) - Victoria Metrics is pre-configured as the Prometheus datasource
- System dashboards show node metrics
Direct API
Query metrics directly using the Prometheus-compatible API:
source env.sh
# Get all metric names
with-proxy curl "http://control0:8428/api/v1/label/__name__/values"
# Query specific metric
with-proxy curl "http://control0:8428/api/v1/query?query=up"
# Query with time range
with-proxy curl "http://control0:8428/api/v1/query_range?query=node_cpu_seconds_total&start=$(date -d '1 hour ago' +%s)&end=$(date +%s)&step=60"
Common Queries
# CPU usage by node
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Disk usage
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)
# Network received bytes
rate(node_network_receive_bytes_total[5m])
Backup
Backup Victoria Metrics data to S3:
# Backup to cluster's default S3 bucket
easy-db-lab metrics backup
# Backup to a custom S3 location
easy-db-lab metrics backup --dest s3://my-backup-bucket/victoriametrics
By default, backups are stored at:
s3://{cluster-bucket}/victoriametrics/{timestamp}/
Use --dest to override the destination bucket and path
Features
- Uses native vmbackup tool with snapshot support
- Non-disruptive; metrics collection continues during backup
- Direct S3 upload (no intermediate storage needed)
- Incremental backup support for faster subsequent backups
Listing Backups
List available VictoriaMetrics backups in S3:
easy-db-lab metrics ls
This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.
Importing Metrics to an External Instance
Stream metrics from the running cluster's VictoriaMetrics to an external VictoriaMetrics instance via the native export/import API:
# Import all metrics
easy-db-lab metrics import --target http://victoria:8428
# Import only specific metrics
easy-db-lab metrics import --target http://victoria:8428 --match '{job="cassandra"}'
This is useful for exporting metrics at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaMetrics instance.
Options
| Option | Description | Default |
|---|---|---|
--target | Target VictoriaMetrics URL (required) | - |
--match | Metric selector for filtering | All metrics |
Troubleshooting
No metrics appearing
-
Verify Victoria Metrics pod is running:
kubectl get pods -l app.kubernetes.io/name=victoriametrics kubectl logs -l app.kubernetes.io/name=victoriametrics -
Check OTel Collector is forwarding metrics:
kubectl get pods -l app=otel-collector kubectl logs -l app=otel-collector -
Verify the cluster-config ConfigMap exists:
kubectl get configmap cluster-config -o yaml
Connection errors
If you see connection errors when querying metrics:
- Ensure the cluster is running:
easy-db-lab status - The proxy is started automatically when needed
- Check that control node is accessible:
ssh control0 hostname
High memory usage
Victoria Metrics is configured with memory limits. If you see OOM kills:
-
Check current memory usage:
kubectl top pod -l app.kubernetes.io/name=victoriametrics -
Consider adjusting the memory limits in the deployment manifest
Backup failures
If backup fails:
-
Check the backup job logs:
kubectl logs -l app.kubernetes.io/name=victoriametrics-backup -
Verify S3 bucket permissions (IAM role should have S3 access)
-
Ensure there's sufficient disk space on the control node
Victoria Logs
Victoria Logs is a centralized log aggregation system that collects logs from all nodes in your easy-db-lab cluster. It provides a unified way to search and analyze logs from Cassandra, ClickHouse, and system services.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ All Nodes (DaemonSet) │
├─────────────────────────────────────────────────────────────┤
│ /var/log/* journald │
│ /mnt/db1/cassandra/logs/*.log │
│ /mnt/db1/clickhouse/logs/*.log │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌────────────────────────┐
│ OTel Collector │
│ (DaemonSet) │
│ filelog + journald │
└───────────┬────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ Control Node │ │
├─────────────────────────┼─────────────────────────┤
│ ▼ │
│ ┌──────────────────┐ │
│ │ Victoria Logs │ │
│ │ (:9428) │ │
│ └────────┬─────────┘ │
└───────────────────────┼────────────────────────────┘
│
▼
┌──────────────────┐
│ easy-db-lab │
│ logs query │
└──────────────────┘
Components
Victoria Logs Server
Victoria Logs runs on the control node as a Kubernetes deployment:
- Port: 9428 (HTTP API)
- Storage: Local ephemeral storage
- Retention: 7 days (configurable)
- Location: Control node only (
node-role.kubernetes.io/control-plane)
OTel Collector
The OpenTelemetry Collector collects logs from all sources and forwards them to Victoria Logs.
The OTel Collector runs as a DaemonSet on every node (Cassandra, stress, control) to collect:
| Source | Path | Description |
|---|---|---|
| Cassandra | /mnt/db1/cassandra/logs/*.log | Cassandra database logs |
| ClickHouse | /mnt/db1/clickhouse/logs/*.log | ClickHouse server logs |
| ClickHouse Keeper | /mnt/db1/clickhouse/keeper/logs/*.log | ClickHouse Keeper logs |
| System logs | /var/log/**/*.log | General system logs |
| journald | cassandra, docker, k3s, sshd | systemd service logs |
Log Sources
Each log entry is tagged with a source field:
| Source | Description | Additional Fields |
|---|---|---|
cassandra | Cassandra database logs | host |
clickhouse | ClickHouse server logs | host, component (server/keeper) |
systemd | systemd journal logs | host, unit |
system | General /var/log files | host |
Querying Logs
Using the CLI
The easy-db-lab logs query command provides a unified interface:
# Query all logs from the last hour
easy-db-lab logs query
# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd
# Filter by host
easy-db-lab logs query --source cassandra --host db0
# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service
# Search for text
easy-db-lab logs query --grep "OutOfMemory"
easy-db-lab logs query --grep "ERROR"
# Time range and limit
easy-db-lab logs query --since 30m --limit 500
easy-db-lab logs query --since 1d
# Raw LogsQL query
easy-db-lab logs query -q 'source:cassandra AND host:db0'
Query Options
| Option | Description | Default |
|---|---|---|
--source, -s | Log source filter | All sources |
--host, -H | Hostname filter (db0, app0, control0) | All hosts |
--unit | systemd unit name | All units |
--since | Time range (1h, 30m, 1d) | 1h |
--limit, -n | Max entries to return | 100 |
--grep, -g | Text search filter | None |
--query, -q | Raw LogsQL query | None |
Using the HTTP API
Victoria Logs exposes a REST API on port 9428. Access it through the SOCKS proxy:
source env.sh
with-proxy curl "http://control0:9428/select/logsql/query?query=source:cassandra&time=1h&limit=100"
Using Grafana
Victoria Logs is configured as a datasource in Grafana. You can use it in two ways:
Log Investigation Dashboard
The Log Investigation dashboard is designed for interactive log analysis during investigations. Access it at Grafana → Dashboards → Log Investigation.
Filter variables (dropdowns at the top):
| Filter | Options | Description |
|---|---|---|
| Node Role | All, db, app, control | Filter by server type |
| Source | All, cassandra, clickhouse, system, tool-runner | Filter by log source |
| Level | All, Error, Warning, Info, Debug | Filter by log severity |
| Search | (text input) | Free-text search across log messages |
| Filters | (ad-hoc) | Add arbitrary field:value filters (e.g., host = db0) |
Panels:
- Log Volume — time-series bar chart showing log count over time, broken down by source. Helps identify spikes and anomalies at a glance.
- Logs — scrollable log viewer with timestamps, source labels, and expandable log details. Click any log entry to see all available fields.
Tips:
- Use the ad-hoc Filters variable to filter by
host,unit,component, or any other field without needing a dedicated dropdown. - The dashboard auto-refreshes every 10 seconds by default. Adjust or disable via the refresh picker in the top-right corner.
- Combine multiple filters to narrow down — e.g., set Node Role to
db, Source tocassandra, Level toErrorto see only Cassandra errors on database nodes. - To search for exec job logs, set Source to
tool-runnerand use the Search box for the job name.
Explore Mode
For ad-hoc queries beyond what the dashboard provides:
- Access Grafana at
http://control0:3000(via SOCKS proxy) - Navigate to Explore
- Select "VictoriaLogs" datasource
- Use LogsQL syntax for queries
LogsQL Query Syntax
Victoria Logs uses LogsQL for querying. Basic syntax:
# Simple field match
source:cassandra
# Multiple conditions (AND)
source:cassandra AND host:db0
# Text search
"OutOfMemory"
# Combine field match with text search
source:cassandra AND "Exception"
# Time filter (in addition to --since)
_time:1h
For full LogsQL documentation, see the Victoria Logs documentation.
Deployment
Victoria Logs and the OTel Collector are automatically deployed when you run:
easy-db-lab k8 apply
This deploys:
- Victoria Logs server on the control node
- OTel Collector DaemonSet on all nodes
- Grafana datasource configuration
Verifying the Setup
Check that all components are running:
source env.sh
kubectl get pods -l app.kubernetes.io/name=victorialogs
kubectl get pods -l app.kubernetes.io/name=otel-collector
Test connectivity:
# Check Victoria Logs health
with-proxy curl http://control0:9428/health
# Query recent logs
easy-db-lab logs query --limit 10
Troubleshooting
No logs appearing
-
Verify OTel Collector pods are running:
kubectl get pods -l app.kubernetes.io/name=otel-collector kubectl logs -l app.kubernetes.io/name=otel-collector -
Check Victoria Logs is healthy:
with-proxy curl http://control0:9428/health -
Verify the cluster-config ConfigMap exists:
kubectl get configmap cluster-config -o yaml
Connection errors
The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:
- Ensure the cluster is running:
easy-db-lab status - The proxy is started automatically when needed
- Check that control node is accessible:
ssh control0 hostname
Listing Backups
List available VictoriaLogs backups in S3:
easy-db-lab logs ls
This displays a summary table of all backups grouped by timestamp, showing the number of files and total size for each.
Importing Logs to an External Instance
Stream logs from the running cluster's VictoriaLogs to an external VictoriaLogs instance via the jsonline API:
# Import all logs
easy-db-lab logs import --target http://victorialogs:9428
# Import only specific logs
easy-db-lab logs import --target http://victorialogs:9428 --query 'source:cassandra'
This is useful for exporting logs at the end of test runs when running easy-db-lab from a Docker container. Unlike binary backups, this approach streams data via HTTP and can target any reachable VictoriaLogs instance.
Options
| Option | Description | Default |
|---|---|---|
--target | Target VictoriaLogs URL (required) | - |
--query | LogsQL query for filtering | All logs (*) |
Backup
Victoria Logs data can be backed up to S3 for disaster recovery using consistent snapshots.
Creating a Backup
# Backup to cluster's default S3 bucket
easy-db-lab logs backup
# Backup to a custom S3 location
easy-db-lab logs backup --dest s3://my-backup-bucket/victorialogs
By default, backups are stored at:
s3://{cluster-bucket}/victorialogs/{timestamp}/
Use --dest to override the destination bucket and path.
How It Works
The backup uses VictoriaLogs' snapshot API to create consistent, point-in-time copies:
- Create snapshots — calls the VictoriaLogs snapshot API to create read-only snapshots of all active log partitions
- Sync to S3 — uploads each snapshot directory to S3 using
aws s3 sync - Cleanup — deletes the snapshots from disk to free space (runs even if the sync step fails)
Using snapshots ensures data consistency, since VictoriaLogs may be actively writing to its data directory during the backup.
What Gets Backed Up
- All log partitions (organized by date)
- Complete log history up to retention period (7 days default)
Notes
- The process is non-disruptive; log ingestion continues during backup
- Snapshot cleanup always runs, even if the S3 upload fails, to avoid filling disk
- Persistent storage at
/mnt/db1/victorialogsensures logs survive pod restarts
Server
easy-db-lab includes a server mode that provides AI assistant integration via MCP (Model Context Protocol), REST status endpoints, and live metrics streaming. This enables Claude to directly interact with your clusters, and provides programmatic access to cluster status.
The server exposes tools for all supported databases — Cassandra, ClickHouse, OpenSearch, and Spark — as well as cluster lifecycle management and observability.
Starting the Server
To start the server, run:
easy-db-lab server
By default, the server picks an available port. To specify a port:
easy-db-lab server --port 8888
The server automatically generates a .mcp.json configuration file in the current directory with the connection details.
Adding to Claude Code
Once the server is running, start Claude Code from the same directory:
claude
Claude Code automatically detects and uses the .mcp.json file generated by the server.
Available Tools
The server exposes commands annotated with @McpCommand as MCP tools to Claude. Tool names use underscores and are derived from the command's package namespace.
Cluster Lifecycle
| Tool Name | Description |
|---|---|
init | Initialize a directory for easy-db-lab |
up | Provision AWS infrastructure |
cassandra_down | Shut down AWS infrastructure |
clean | Clean up generated files |
status | Display full environment status |
hosts | List all hosts in the cluster |
ip | Get IP address for a host by alias |
Cassandra Management
| Tool Name | Description |
|---|---|
cassandra_use | Select a Cassandra version |
cassandra_list | List available Cassandra versions |
cassandra_start | Start Cassandra on all nodes |
cassandra_restart | Restart Cassandra on all nodes |
cassandra_update_config | Apply configuration patch to nodes |
Cassandra Stress Testing
| Tool Name | Description |
|---|---|
cassandra_stress_start | Start a stress job on K8s |
cassandra_stress_stop | Stop and delete stress jobs |
cassandra_stress_status | Check status of stress jobs |
cassandra_stress_logs | View logs from stress jobs |
cassandra_stress_list | List available workloads |
cassandra_stress_fields | List available field generators |
cassandra_stress_info | Show workload information |
ClickHouse
| Tool Name | Description |
|---|---|
clickhouse_start | Deploy ClickHouse cluster to K8s |
clickhouse_stop | Remove ClickHouse cluster |
clickhouse_status | Check ClickHouse cluster status |
OpenSearch
| Tool Name | Description |
|---|---|
opensearch_start | Create AWS OpenSearch domain |
opensearch_stop | Delete OpenSearch domain |
opensearch_status | Check OpenSearch domain status |
Spark
| Tool Name | Description |
|---|---|
spark_submit | Submit Spark job to EMR cluster |
spark_status | Check status of a Spark job |
spark_jobs | List recent Spark jobs |
spark_logs | Download EMR logs from S3 |
Kubernetes
| Tool Name | Description |
|---|---|
k8_apply | Apply observability stack to K8s |
Utilities
| Tool Name | Description |
|---|---|
prune_amis | Prune older private AMIs |
Tool Naming Convention
MCP tool names are derived from the command's package location:
- Top-level commands:
status,hosts,ip,clean,init,up - Cassandra commands:
cassandra_prefix (e.g.,cassandra_start,cassandra_use) - Nested commands:
cassandra_stress_prefix (e.g.,cassandra_stress_start) - Hyphens become underscores:
update-config→cassandra_update_config
Benefits of Server Integration
| Benefit | Description |
|---|---|
| Direct Control | Claude executes easy-db-lab commands directly without manual intervention |
| Context Awareness | Claude maintains context about your cluster state and configuration |
| Automation | Complex multi-step operations can be automated through Claude |
| Intelligent Assistance | Claude can analyze logs, metrics, and provide optimization recommendations |
Example Workflow
-
Start the server in one terminal:
easy-db-lab server -
In another terminal, start Claude Code from the same directory:
claudeClaude Code automatically detects the
.mcp.jsonfile generated by the server. -
Ask Claude to help manage your cluster:
- "Initialize a new 5-node cluster with i4i.xlarge instances"
- "Check the status of all nodes"
- "Select Cassandra version 5.0 and start it"
- "Start a KeyValue stress test for 1 hour"
- "Deploy ClickHouse and check its status"
- "Create an OpenSearch domain and monitor its progress"
- "Submit a Spark job to the EMR cluster"
Live Metrics Streaming
When Redis is configured via the EASY_DB_LAB_REDIS_URL environment variable, the server publishes live cluster metrics to the Redis pub/sub channel every 5 seconds. Metrics are queried from VictoriaMetrics using the same PromQL expressions as the Grafana dashboards.
Enabling
export EASY_DB_LAB_REDIS_URL=redis://localhost:6379/easydblab-events
easy-db-lab server
Metrics events are published to the same channel as command events. Consumers filter by the event.type field.
Event Types
Only metrics for running services are published. If the cluster is running ClickHouse instead of Cassandra, no Cassandra metrics events are emitted.
Metrics.System
Published every 5 seconds with per-node CPU, memory, disk I/O, and filesystem metrics:
{
"timestamp": "2026-03-08T14:22:05.123Z",
"commandName": "server",
"event": {
"type": "Metrics.System",
"nodes": {
"db-0": {
"cpuUsagePct": 34.2,
"memoryUsedBytes": 17179869184,
"diskReadBytesPerSec": 52428800.0,
"diskWriteBytesPerSec": 104857600.0,
"filesystemUsedPct": 45.2
},
"db-1": {
"cpuUsagePct": 28.7,
"memoryUsedBytes": 16106127360,
"diskReadBytesPerSec": 41943040.0,
"diskWriteBytesPerSec": 83886080.0,
"filesystemUsedPct": 42.8
}
}
}
}
Metrics.Cassandra
Published every 5 seconds when the cluster is running Cassandra:
{
"timestamp": "2026-03-08T14:22:05.187Z",
"commandName": "server",
"event": {
"type": "Metrics.Cassandra",
"readP99Ms": 1.247,
"writeP99Ms": 0.832,
"readOpsPerSec": 15234.5,
"writeOpsPerSec": 12087.3,
"compactionPending": 3,
"compactionCompletedPerSec": 1.5,
"compactionBytesWrittenPerSec": 52428800.0
}
}
Field Reference
System — per node:
| Field | Type | Description |
|---|---|---|
cpuUsagePct | double | CPU usage percentage (0-100) |
memoryUsedBytes | long | Memory used in bytes |
diskReadBytesPerSec | double | Disk read throughput (bytes/sec) |
diskWriteBytesPerSec | double | Disk write throughput (bytes/sec) |
filesystemUsedPct | double | Filesystem usage percentage (0-100) |
Cassandra — cluster-wide:
| Field | Type | Description |
|---|---|---|
readP99Ms | double | Read latency p99 in milliseconds |
writeP99Ms | double | Write latency p99 in milliseconds |
readOpsPerSec | double | Read operations per second |
writeOpsPerSec | double | Write operations per second |
compactionPending | long | Number of pending compactions |
compactionCompletedPerSec | double | Compactions completed per second |
compactionBytesWrittenPerSec | double | Compaction write throughput (bytes/sec) |
Notes
- The server requires Docker to be installed
- Your AWS profile must be configured (
easy-db-lab setup-profile) - The server runs in the foreground and logs to stdout
- Use Ctrl+C to stop the server
Command Reference
Complete reference for all easy-db-lab commands.
Global Options
| Option | Description |
|---|---|
--help, -h | Shows help information |
--vpc-id | Reconstruct state from existing VPC (requires ClusterId tag) |
--force | Force state reconstruction even if state.json exists |
Setup Commands
setup-profile
Set up user profile interactively.
easy-db-lab setup-profile
Aliases: setup
Guides you through:
- Email and AWS credentials collection
- AWS credential validation
- Key pair generation
- IAM role creation
- Packer VPC infrastructure setup
- AMI validation/building
show-iam-policies
Display IAM policies with your account ID populated.
easy-db-lab show-iam-policies [policy-name]
Aliases: sip
| Argument | Description |
|---|---|
policy-name | Optional filter: ec2, iam, or emr |
build-image
Build both base and Cassandra AMI images.
easy-db-lab build-image [options]
| Option | Description | Default |
|---|---|---|
--arch | CPU architecture (AMD64, ARM64) | AMD64 |
--region | AWS region | (from profile) |
Cluster Lifecycle Commands
init
Initialize a directory for easy-db-lab.
easy-db-lab init [cluster-name] [options]
| Option | Description | Default |
|---|---|---|
--db, --cassandra, -c | Number of Cassandra instances | 3 |
--app, --stress, -s | Number of stress instances | 0 |
--instance, -i | Cassandra instance type | r3.2xlarge |
--stress-instance, -si | Stress instance type | c7i.2xlarge |
--azs, -z | Availability zones (e.g., a,b,c) | all |
--arch, -a | CPU architecture (AMD64, ARM64) | AMD64 |
--ebs.type | EBS volume type (NONE, gp2, gp3, io1, io2) | NONE |
--ebs.size | EBS volume size in GB | 256 |
--ebs.iops | EBS IOPS (gp3 only) | 0 |
--ebs.throughput | EBS throughput (gp3 only) | 0 |
--ebs.optimized | Enable EBS optimization | false |
--until | When instances can be deleted | tomorrow |
--ami | Override AMI ID | (auto-detected) |
--open | Unrestricted SSH access | false |
--tag | Custom tags (key=value, repeatable) | - |
--vpc | Use existing VPC ID | - |
--up | Auto-provision after init | false |
--clean | Remove existing config first | false |
up
Provision AWS infrastructure.
easy-db-lab up [options]
| Option | Description |
|---|---|
--no-setup, -n | Skip K3s setup and AxonOps configuration |
Creates: VPC, EC2 instances, K3s cluster. Configures the account S3 bucket for this cluster.
down
Shut down AWS infrastructure.
easy-db-lab down [vpc-id] [options]
| Argument | Description |
|---|---|
vpc-id | Optional: specific VPC to tear down |
| Option | Description |
|---|---|
--all | Tear down all VPCs tagged with easy_cass_lab |
--packer | Tear down the packer infrastructure VPC |
--retention-days N | Days to retain S3 data after teardown (default: 1) |
clean
Clean up generated files from the current directory.
easy-db-lab clean
hosts
List all hosts in the cluster.
easy-db-lab hosts
status
Display full environment status.
easy-db-lab status
Cassandra Commands
All Cassandra commands are available under the cassandra subcommand group.
cassandra use
Select a Cassandra version.
easy-db-lab cassandra use <version> [options]
| Option | Description |
|---|---|
--java | Java version to use |
--hosts | Filter to specific hosts |
Versions: 3.0, 3.11, 4.0, 4.1, 5.0, 5.0-HEAD, trunk
cassandra write-config
Generate a new configuration patch file.
easy-db-lab cassandra write-config [filename] [options]
Aliases: wc
| Option | Description | Default |
|---|---|---|
-t, --tokens | Number of tokens | 4 |
cassandra update-config
Apply configuration patch to all nodes.
easy-db-lab cassandra update-config [options]
Aliases: uc
| Option | Description |
|---|---|
--restart, -r | Restart Cassandra after applying |
--hosts | Filter to specific hosts |
cassandra download-config
Download configuration files from nodes.
easy-db-lab cassandra download-config [options]
Aliases: dc
| Option | Description |
|---|---|
--version | Version to download config for |
cassandra start
Start Cassandra on all nodes.
easy-db-lab cassandra start [options]
| Option | Description | Default |
|---|---|---|
--sleep | Time between starts in seconds | 120 |
--hosts | Filter to specific hosts | - |
cassandra stop
Stop Cassandra on all nodes.
easy-db-lab cassandra stop [options]
| Option | Description |
|---|---|
--hosts | Filter to specific hosts |
cassandra restart
Restart Cassandra on all nodes.
easy-db-lab cassandra restart [options]
| Option | Description |
|---|---|
--hosts | Filter to specific hosts |
cassandra list
List available Cassandra versions.
easy-db-lab cassandra list
Aliases: ls
Cassandra Stress Commands
Stress testing commands under cassandra stress.
cassandra stress start
Start a stress job on Kubernetes.
easy-db-lab cassandra stress start [options]
Aliases: run
cassandra stress stop
Stop and delete stress jobs.
easy-db-lab cassandra stress stop [options]
cassandra stress status
Check status of stress jobs.
easy-db-lab cassandra stress status
cassandra stress logs
View logs from stress jobs.
easy-db-lab cassandra stress logs [options]
cassandra stress list
List available workloads.
easy-db-lab cassandra stress list
cassandra stress fields
List available field generators.
easy-db-lab cassandra stress fields
cassandra stress info
Show information about a workload.
easy-db-lab cassandra stress info <workload>
Utility Commands
exec
Execute commands on remote hosts via systemd-run. Tool output is captured by the systemd journal and shipped to VictoriaLogs via a dedicated journald OTel collector, with accurate timestamps for cross-service log correlation.
exec run
Run a command on remote hosts (foreground by default).
# Foreground (blocks until complete, shows output)
easy-db-lab exec run -t cassandra -- ls /mnt/db1
# Background (returns immediately, tool keeps running)
easy-db-lab exec run --bg -t cassandra -- inotifywait -m /mnt/db1/data
# Background with custom name
easy-db-lab exec run --bg --name watch-imports -t cassandra -- inotifywait -m /mnt/db1/data
| Option | Description |
|---|---|
-t, --type | Server type: cassandra, stress, control (default: cassandra) |
--bg | Run in background (returns immediately) |
--name | Name for the systemd unit (auto-derived if not provided) |
--hosts | Filter to specific hosts |
-p | Execute in parallel across hosts |
exec list
List running background tools on remote hosts.
easy-db-lab exec list
easy-db-lab exec list -t cassandra
exec stop
Stop a named background tool.
easy-db-lab exec stop watch-imports
easy-db-lab exec stop watch-imports -t cassandra
ip
Get IP address for a host by alias.
easy-db-lab ip <alias>
version
Display the easy-db-lab version.
easy-db-lab version
repl
Start interactive REPL.
easy-db-lab repl
server
Start the server for Claude Code integration, REST status endpoints, and live metrics.
easy-db-lab server
See Server for details.
Kubernetes Commands
k8 apply
Apply observability stack to K8s cluster.
easy-db-lab k8 apply
Dashboard Commands
dashboards generate
Extract all Grafana dashboard manifests (core and ClickHouse) from JAR resources to the local k8s/ directory. Useful for rapid dashboard iteration without re-running init.
easy-db-lab dashboards generate
dashboards upload
Apply all Grafana dashboard manifests and the datasource ConfigMap to the K8s cluster. Extracts dashboards, creates the grafana-datasources ConfigMap with runtime configuration, and applies everything.
easy-db-lab dashboards upload
ClickHouse Commands
clickhouse start
Deploy ClickHouse cluster to K8s.
easy-db-lab clickhouse start [options]
clickhouse stop
Stop and remove ClickHouse cluster.
easy-db-lab clickhouse stop
clickhouse status
Check ClickHouse cluster status.
easy-db-lab clickhouse status
Spark Commands
spark submit
Submit Spark job to EMR cluster.
easy-db-lab spark submit [options]
spark status
Check status of a Spark job.
easy-db-lab spark status [options]
spark jobs
List recent Spark jobs on the cluster.
easy-db-lab spark jobs
spark logs
Download EMR logs from S3.
easy-db-lab spark logs [options]
OpenSearch Commands
opensearch start
Create an AWS OpenSearch domain.
easy-db-lab opensearch start [options]
opensearch stop
Delete the OpenSearch domain.
easy-db-lab opensearch stop
opensearch status
Check OpenSearch domain status.
easy-db-lab opensearch status
AWS Commands
aws vpcs
List all easy-db-lab VPCs.
easy-db-lab aws vpcs
Port Reference
This page documents the ports used by easy-db-lab and the services it provisions.
Cassandra Ports
| Port | Purpose |
|---|---|
| 9042 | Cassandra Native Protocol (CQL) |
| 7000 | Inter-node communication |
| 7001 | Inter-node communication (SSL) |
| 7199 | JMX monitoring |
Observability Ports (Control Node)
| Port | Service |
|---|---|
| 3000 | Grafana |
| 4040 | Pyroscope (continuous profiling) |
| 8428 | VictoriaMetrics (metrics storage) |
| 9428 | VictoriaLogs (log storage) |
| 3200 | Tempo (trace storage) |
| 5001 | YACE CloudWatch exporter (Prometheus) |
Cassandra Agent Ports
| Port | Service |
|---|---|
| 9000 | MAAC metrics agent (Prometheus) — Cassandra 4.0, 4.1, 5.0 only |
Observability Ports (All Nodes — DaemonSets)
| Port | Service |
|---|---|
| 4317 | OTel Collector gRPC |
| 4318 | OTel Collector HTTP |
| 9400 | Beyla eBPF metrics (Prometheus) |
| 9435 | ebpf_exporter metrics (Prometheus) |
Server
| Port | Purpose |
|---|---|
| 8080 | Default server port (configurable via --port) |
SSH
SSH access is configured automatically through the sshConfig file generated by source env.sh.
Log Infrastructure
This page documents the centralized logging infrastructure in easy-db-lab, including OTel for log collection and Victoria Logs for storage and querying.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ All Nodes │
├─────────────────────────────────────────────────────────────┤
│ /var/log/* │ journald │
│ /mnt/db1/cassandra/logs/*.log │
│ /mnt/db1/clickhouse/logs/*.log │
│ /mnt/db1/clickhouse/keeper/logs/*.log │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌────────────────────────┐
│ OTel Collector │
│ (DaemonSet) │ ┌──────────────────┐
│ filelog + journald │◀─────│ EMR Spark JVMs │
│ + OTLP receiver │ OTLP │ (OTel Java Agent│
└───────────┬────────────┘ │ v2.25.0) │
│ └──────────────────┘
┌─────────────────────────┼─────────────────────────┐
│ Control Node │ │
├─────────────────────────┼─────────────────────────┤
│ ▼ │
│ ┌──────────────────┐ │
│ │ Victoria Logs │ │
│ │ (:9428) │ │
│ └────────┬─────────┘ │
└───────────────────────┼────────────────────────────┘
│
▼
┌──────────────────┐
│ easy-db-lab │
│ logs query │
└──────────────────┘
Components
OTel Collector DaemonSet
The OpenTelemetry Collector runs on all nodes as a DaemonSet, collecting:
- System file logs:
/var/log/**/*.log,/var/log/messages,/var/log/syslog - Cassandra logs:
/mnt/db1/cassandra/logs/*.log - ClickHouse server logs:
/mnt/db1/clickhouse/logs/*.log - ClickHouse Keeper logs:
/mnt/db1/clickhouse/keeper/logs/*.log - systemd journal:
cassandra,docker,k3s,sshdunits - OTLP: Receives logs from applications via OTLP protocol
Logs are forwarded to Victoria Logs on the control node via the Elasticsearch-compatible sink.
Spark OTel Java Agent (EMR)
When EMR Spark jobs are running, the Spark driver and executor JVMs are instrumented with the OpenTelemetry Java Agent (v2.25.0) via an EMR bootstrap action. The agent auto-instruments the JVMs and exports logs via OTLP to the control node's OTel Collector.
Logs appear in VictoriaLogs with a service.name attribute like spark-<job-name>, making it easy to filter logs for specific Spark jobs.
The data flow is: Spark JVM → OTel Java Agent → OTLP → OTel Collector (control node) → VictoriaLogs.
Victoria Logs
Victoria Logs runs on the control node and provides:
- Log storage with efficient compression
- LogsQL query language
- HTTP API for querying (port 9428)
Querying Logs
Using the CLI
# Query all logs from last hour
easy-db-lab logs query
# Filter by source
easy-db-lab logs query --source cassandra
easy-db-lab logs query --source clickhouse
easy-db-lab logs query --source systemd
# Filter by host
easy-db-lab logs query --source cassandra --host db0
# Filter by systemd unit
easy-db-lab logs query --source systemd --unit docker.service
# Search for text
easy-db-lab logs query --grep "OutOfMemory"
# Time range and limit
easy-db-lab logs query --since 30m --limit 500
# Raw Victoria Logs query (LogsQL syntax)
easy-db-lab logs query -q 'source:cassandra AND host:db0'
Log Stream Fields
Common fields (all sources):
| Field | Description |
|---|---|
source | Log source: cassandra, clickhouse, systemd, system |
host | Hostname (db0, app0, control0) |
timestamp | Log timestamp |
message | Log message content |
Source-specific fields:
| Source | Field | Description |
|---|---|---|
| clickhouse | component | server or keeper |
| systemd | unit | systemd unit name |
Troubleshooting
No logs appearing
-
Check Victoria Logs is running:
kubectl get pods | grep victoria -
Check OTel Collector is running:
kubectl get pods | grep otel -
Verify the cluster-config ConfigMap exists:
kubectl get configmap cluster-config -o yaml
Connection errors
The logs query command uses the internal SOCKS5 proxy to connect to Victoria Logs. If you see connection errors:
- Ensure the cluster is running:
easy-db-lab status - The proxy is started automatically when needed
- Check that control node is accessible:
ssh control0 hostname
Ports
| Port | Service | Location |
|---|---|---|
| 9428 | Victoria Logs HTTP API | Control node |
OpenTelemetry Instrumentation
easy-db-lab includes optional OpenTelemetry (OTel) instrumentation for distributed tracing and metrics. When enabled, traces and metrics are exported to an OTLP-compatible collector.
CLI Tool Instrumentation
The easy-db-lab CLI tool runs with the OpenTelemetry Java Agent, which automatically instruments:
- AWS SDK calls - EC2, S3, IAM, EMR, STS, OpenSearch operations
- HTTP clients - OkHttp and other HTTP libraries
- JDBC/Cassandra driver - Database operations
- JVM metrics - Memory, threads, garbage collection
Enabling Instrumentation
Set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to your OTLP collector endpoint:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up
When this environment variable is:
- Set: Traces and metrics are exported via gRPC to the specified endpoint
- Not set: The agent is still loaded but no telemetry is exported (minimal overhead)
The agent uses automatic instrumentation only - there is no custom manual instrumentation in the CLI tool code.
Cluster Node Instrumentation
The following instrumentation applies to cluster nodes (Cassandra, stress, Spark) and is separate from the CLI tool:
Node Role Labeling
The OTel Collector on cluster nodes uses the k8sattributes processor to read the K8s node label type and set it as the node_role resource attribute. This label is used by Grafana dashboards (e.g., System Overview) for hostname and service filtering.
| Node Type | K8s Label | node_role Value | Source |
|---|---|---|---|
| Cassandra | type=db | db | K3s agent config |
| Stress | type=app | app | K3s agent config |
| Control | type=control | control | Up command node labeling |
| Spark/EMR | N/A | spark | EMR OTel Collector resource/role processor |
The k8sattributes processor runs in the metrics/local and logs/local pipelines only. Remote metrics arriving via OTLP (e.g., from Spark nodes) already carry node_role and are not modified.
The processor requires RBAC access to the K8s API. The OTel Collector DaemonSet runs with a dedicated ServiceAccount (otel-collector) that has read-only access to pods and nodes.
Stress Job Metrics
When running cassandra-easy-stress as K8s Jobs, metrics are automatically collected via an OTel collector sidecar container. The sidecar scrapes the stress process's Prometheus endpoint (localhost:9500) and forwards metrics via OTLP to the node's OTel DaemonSet, which then exports them to VictoriaMetrics.
The Prometheus scrape job is named cassandra-easy-stress. The following labels are available in Grafana:
| Label | Source | Description |
|---|---|---|
host_name | DaemonSet resourcedetection processor | K8s node name where the pod runs |
instance | Sidecar relabel_configs | Node name with port (e.g., ip-10-0-1-50:9500) |
cluster | Sidecar relabel_configs | Cluster name from cluster-config ConfigMap |
Short-lived stress commands (list, info, fields) do not include the sidecar since they complete quickly and don't produce meaningful metrics.
Spark JVM Instrumentation
EMR Spark jobs are auto-instrumented with the OpenTelemetry Java Agent (v2.25.0) and Pyroscope Java Agent (v2.3.0), both installed via an EMR bootstrap action. The OTel agent is activated through spark.driver.extraJavaOptions and spark.executor.extraJavaOptions.
Each EMR node also runs an OTel Collector as a systemd service, collecting host metrics (CPU, memory, disk, network) and receiving OTLP from the Java agents. The collector forwards all telemetry to the control node's OTel Collector via OTLP gRPC.
Key configuration:
- OTel Agent JAR: Downloaded by bootstrap action to
/opt/otel/opentelemetry-javaagent.jar - Pyroscope Agent JAR: Downloaded by bootstrap action to
/opt/pyroscope/pyroscope.jar - OTel Collector: Installed at
/opt/otel/otelcol-contrib, runs asotel-collector.service - Export protocol: OTLP/gRPC to
localhost:4317(local collector), which forwards to control node - Logs exporter: OTLP (captures JVM log output)
- Service name:
spark-<job-name>(set per job) - Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles in JFR format sent to Pyroscope server
Cassandra Sidecar Instrumentation
The Cassandra Sidecar process is instrumented with the OpenTelemetry Java Agent and Pyroscope Java Agent, matching the pattern used for Cassandra itself. Both agents are loaded via -javaagent flags set in /etc/default/cassandra-sidecar, which is written by the setup-instances command.
Key configuration:
- OTel Agent JAR: Installed by Packer to
/usr/local/otel/opentelemetry-javaagent.jar - Pyroscope Agent JAR: Installed by Packer to
/usr/local/pyroscope/pyroscope.jar - Service name:
cassandra-sidecar(both OTel and Pyroscope) - Export endpoint:
localhost:4317(local OTel Collector DaemonSet) - Profiling: CPU, allocation (512k threshold), lock (10ms threshold) profiles sent to Pyroscope server
- Activation: Gated on
/etc/default/cassandra-sidecar— the systemdEnvironmentFile=-directive makes it optional, so the sidecar starts normally without instrumentation if the file doesn't exist
Tool Runner Log Collection
Commands run via exec run are executed through systemd-run, which captures stdout and stderr to log files under /var/log/easydblab/tools/. The OTel Collector's filelog/tools receiver watches this directory and ships log entries to VictoriaLogs with the attribute source: tool-runner.
This provides automatic log capture for ad-hoc debugging tools (e.g., inotifywait, tcpdump, strace) run during investigations. Logs are queryable in VictoriaLogs and preserved in S3 backups via logs backup.
Key details:
- Log directory:
/var/log/easydblab/tools/ - Source attribute:
tool-runner(for filtering in VictoriaLogs queries) - Foreground commands: Output displayed after completion, also logged to file
- Background commands (
--bg): Output logged to file only, tool runs as a systemd transient unit
YACE CloudWatch Scrape
YACE (Yet Another CloudWatch Exporter) runs on the control node and scrapes AWS CloudWatch metrics for services used by the cluster. It uses tag-based auto-discovery with the easy_cass_lab=1 tag to find relevant resources.
YACE scrapes metrics for:
- S3 — bucket request/byte counts
- EBS — volume read/write ops and latency
- EC2 — instance CPU, network, disk
- OpenSearch — domain health, indexing, search metrics
EMR metrics are collected directly via OTel Collectors on Spark nodes (see Spark JVM Instrumentation above).
YACE exposes scraped metrics as Prometheus-compatible metrics on port 5001, which are then scraped by the OTel Collector and forwarded to VictoriaMetrics. This replaces the previous CloudWatch datasource in Grafana with a Prometheus-based approach, giving dashboards access to CloudWatch metrics through VictoriaMetrics queries.
Resource Attributes
Traces from the CLI tool and cluster nodes include the following resource attributes:
service.name: Service identifier (e.g.,easy-db-lab,cassandra-sidecar,spark-<job-name>)service.version: Application version (CLI tool only)host.name: Hostname
Configuration
The following environment variables are supported:
| Variable | Description | Default |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | OTLP gRPC endpoint | None (no export) |
OTEL_SERVICE_NAME | Override service name | easy-db-lab |
OTEL_RESOURCE_ATTRIBUTES | Additional resource attributes | None |
Additional standard OTel environment variables are supported by the agent. See the OpenTelemetry Java Agent documentation for details.
Example: Using with Jaeger
Start Jaeger with OTLP support:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one:latest
Export traces to Jaeger:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
easy-db-lab up
View traces at http://localhost:16686
Example: Using with Grafana Tempo
If you have Grafana Tempo running with OTLP gRPC ingestion:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317
easy-db-lab up
Troubleshooting
No Traces Appearing
- Verify the endpoint is correct and reachable
- Check that the collector accepts gRPC OTLP (port 4317 is standard)
- Look for OpenTelemetry agent logs on startup (use
-Dotel.javaagent.debug=trueto enable debug logging)
High Latency
Traces are batched before export (default 1 second delay). This is normal and reduces overhead.
Pyroscope Configuration Parameters
Reference for Pyroscope server configuration. Source: Grafana Pyroscope docs.
How Configuration Works
Pyroscope is configured via a YAML file (-config.file flag) or CLI flags. CLI flags take precedence over YAML values. Environment variables can be used with -config.expand-env=true using ${VAR} or ${VAR:-default} syntax.
View current config at the /config HTTP API endpoint.
Key Configuration Sections
Top-Level
# Modules to load. 'all' enables single-binary mode.
[target: <string> | default = "all"]
api:
[base-url: <string> | default = ""]
Server
HTTP on port 4040 (default), gRPC on port 9095 (default).
server:
[http_listen_address: <string> | default = ""]
[http_listen_port: <int> | default = 4040]
[grpc_listen_port: <int> | default = 9095]
[graceful_shutdown_timeout: <duration> | default = 30s]
[http_server_read_timeout: <duration> | default = 30s]
[http_server_write_timeout: <duration> | default = 30s]
[http_server_idle_timeout: <duration> | default = 2m]
[log_format: <string> | default = "logfmt"] # logfmt or json
[log_level: <string> | default = "info"] # debug, info, warn, error
[grpc_server_max_recv_msg_size: <int> | default = 4194304]
[grpc_server_max_send_msg_size: <int> | default = 4194304]
[grpc_server_max_concurrent_streams: <int> | default = 100]
PyroscopeDB (Local Storage)
pyroscopedb:
# Directory for local storage
[data_path: <string> | default = "./data"]
# Max block duration
[max_block_duration: <duration> | default = 1h]
# Row group target size (uncompressed)
[row_group_target_size: <int> | default = 1342177280]
# Partition label for symbols
[symbols_partition_label: <string> | default = ""]
# Disk retention: minimum free disk (GiB)
[min_free_disk_gb: <int> | default = 10]
# Disk retention: minimum free percentage
[min_disk_available_percentage: <float> | default = 0.05]
# How often to enforce retention
[enforcement_interval: <duration> | default = 5m]
# Disable retention enforcement
[disable_enforcement: <boolean> | default = false]
Storage (Object Storage Backend)
Supported backends: s3, gcs, azure, swift, filesystem, cos.
storage:
[backend: <string> | default = ""]
[prefix: <string> | default = ""]
s3:
[endpoint: <string> | default = ""]
[region: <string> | default = ""]
[bucket_name: <string> | default = ""]
[secret_access_key: <string> | default = ""]
[access_key_id: <string> | default = ""]
[insecure: <boolean> | default = false]
[signature_version: <string> | default = "v4"]
[bucket_lookup_type: <string> | default = "auto"]
# NOTE: native_aws_auth_enabled exists on main but NOT in v1.18.0.
# In v1.18.0, leave access_key_id/secret_access_key empty to use
# the default AWS SDK credential chain (env vars, IMDS).
sse:
[type: <string> | default = ""] # SSE-KMS or SSE-S3
[kms_key_id: <string> | default = ""]
[kms_encryption_context: <string> | default = ""]
gcs:
[bucket_name: <string> | default = ""]
[service_account: <string> | default = ""]
azure:
[account_name: <string> | default = ""]
[account_key: <string> | default = ""]
[container_name: <string> | default = ""]
filesystem:
[dir: <string> | default = "./data-shared"]
Distributor
distributor:
[pushtimeout: <duration> | default = 5s]
ring:
kvstore:
[store: <string> | default = "memberlist"] # consul, etcd, inmemory, memberlist, multi
Ingester
ingester:
lifecycler:
ring:
kvstore:
[store: <string> | default = "consul"]
[heartbeat_timeout: <duration> | default = 1m]
[replication_factor: <int> | default = 1]
[num_tokens: <int> | default = 128]
[heartbeat_period: <duration> | default = 5s]
Querier
querier:
# Time after which queries go to storage instead of ingesters
[query_store_after: <duration> | default = 4h]
Compactor
compactor:
[block_ranges: <list of durations> | default = 1h0m0s,2h0m0s,8h0m0s]
[data_dir: <string> | default = "./data-compactor"]
[compaction_interval: <duration> | default = 30m]
[compaction_concurrency: <int> | default = 1]
[deletion_delay: <duration> | default = 12h]
[downsampler_enabled: <boolean> | default = false]
Limits (Per-Tenant)
limits:
# Ingestion rate limit (MB/s)
[ingestion_rate_mb: <float> | default = 4]
[ingestion_burst_size_mb: <float> | default = 2]
# Label constraints
[max_label_name_length: <int> | default = 1024]
[max_label_value_length: <int> | default = 2048]
[max_label_names_per_series: <int> | default = 30]
# Profile constraints
[max_profile_size_bytes: <int> | default = 4194304]
[max_profile_stacktrace_samples: <int> | default = 16000]
[max_profile_stacktrace_depth: <int> | default = 1000]
# Series limits
[max_global_series_per_tenant: <int> | default = 5000]
# Query limits
[max_query_lookback: <duration> | default = 1w]
[max_query_length: <duration> | default = 1d]
[max_flamegraph_nodes_default: <int> | default = 8192]
[max_flamegraph_nodes_max: <int> | default = 1048576]
# Retention
[compactor_blocks_retention_period: <duration> | default = 0s]
# Ingestion time bounds
[reject_older_than: <duration> | default = 1h]
[reject_newer_than: <duration> | default = 10m]
# Relabeling
[ingestion_relabeling_rules: <list of Configs> | default = []]
[sample_type_relabeling_rules: <list of Configs> | default = []]
Self-Profiling
self_profiling:
# Disable push profiling in single-binary mode
[disable_push: <boolean> | default = false]
[mutex_profile_fraction: <int> | default = 5]
[block_profile_rate: <int> | default = 5]
Memberlist (Gossip)
memberlist:
[bind_port: <int> | default = 7946]
[join_members: <list of strings> | default = []]
[gossip_interval: <duration> | default = 200ms]
[gossip_nodes: <int> | default = 3]
[leave_timeout: <duration> | default = 20s]
Tracing
tracing:
[enabled: <boolean> | default = true]
Multi-Tenancy
# Require X-Scope-OrgId header; false = use "anonymous" tenant
[multitenancy_enabled: <boolean> | default = false]
Embedded Grafana
embedded_grafana:
[data_path: <string> | default = "./data/__embedded_grafana/"]
[listen_port: <int> | default = 4041]
[pyroscope_url: <string> | default = "http://localhost:4040"]
Port Summary
| Service | Port | Protocol |
|---|---|---|
| HTTP API | 4040 | HTTP |
| gRPC | 9095 | gRPC |
| Memberlist gossip | 7946 | TCP/UDP |
| Embedded Grafana | 4041 | HTTP |
Relevant to Our Deployment
Our Pyroscope deployment (configuration/pyroscope/PyroscopeManifestBuilder.kt) uses:
- S3 backend — IAM role auth via IMDS (no explicit credentials; v1.18.0 lacks
native_aws_auth_enabled, SDK defaults to credential chain) - Single-binary mode (
target: all) - Port 4040 for HTTP API
- Flat storage prefix —
pyroscope.{name}-{id}(Pyroscope rejects/instorage.prefix) - Config values substituted at build time via TemplateService (
__KEY__placeholders) - Profiles received from: Java agent (Cassandra, Spark), eBPF agent (all nodes), stress jobs
Spark Observability Debugging
Diagnostic commands for troubleshooting Spark observability on EMR nodes. These require SSH access to the EMR master node (ssh hadoop@<master-public-dns>).
OTel Collector
# Check collector is running
sudo systemctl status otel-collector
# View collector config (verify control node IP)
cat /opt/otel/config.yaml
# Test connectivity to control node collector
curl -s -o /dev/null -w '%{http_code}' http://<control-ip>:4318
Spark Configuration
# Verify -javaagent flags and OTel env vars are present
cat /etc/spark/conf/spark-defaults.conf
# Verify agent JARs exist
ls -la /opt/otel/opentelemetry-javaagent.jar
ls -la /opt/pyroscope/pyroscope.jar
Runtime Verification (while a job is running)
# Confirm agents are attached to Spark JVMs
ps aux | grep javaagent
Pyroscope API (from any node that can reach control0)
# List all label names
curl \
-H "Content-Type: application/json" \
-d '{
"end": '$(date +%s)000',
"start": '$(expr $(date +%s) - 3600)000'
}' \
http://localhost:4040/querier.v1.QuerierService/LabelNames
# List values for a specific label
curl \
-H "Content-Type: application/json" \
-d '{
"end": '$(date +%s)000',
"name": "hostname",
"start": '$(expr $(date +%s) - 3600)000'
}' \
http://localhost:4040/querier.v1.QuerierService/LabelValues
# Diff two profiles (compare workloads)
# POST to /querier.v1.QuerierService/Diff with left/right profile selectors
# See: left.labelSelector, right.labelSelector, profileTypeID, start/end
Grafana Explore Queries
# All metrics from Spark nodes
{node_role="spark"}
# JVM metrics only
{node_role="spark", __name__=~"jvm_.*"}
# List distinct JVM metric names
group({node_role="spark", __name__=~"jvm_.*"}) by (__name__)
# Filesystem usage (raw)
system_filesystem_usage_bytes{state="used", node_role="spark", mountpoint="/"}
JFR Format Reference
The Java Flight Recorder format is used by JVM-based profilers and supported by the Pyroscope Java integration.
When JFR format is used, query parameters behave differently:
formatshould be set tojfrnamecontains the prefix of the application name. Since a single request may contain multiple profile types, the final application name is created by concatenating this prefix and the profile type. For example, if you send cpu profiling data and set name tomy-app{}, it will appear in Pyroscope asmy-app.cpu{}unitsis ignored — actual units depend on the profile types in the dataaggregationTypeis ignored — actual aggregation type depends on the profile types in the data
Supported JFR Profile Types
cpu— samples from runnable threads onlyitimer— similar to cpu profilingwall— samples from any thread regardless of statealloc_in_new_tlab_objects— number of new TLAB objects createdalloc_in_new_tlab_bytes— size in bytes of new TLAB objects createdalloc_outside_tlab_objects— number of new allocated objects outside any TLABalloc_outside_tlab_bytes— size in bytes of new allocated objects outside any TLAB
JFR with Dynamic Labels
To ingest JFR data with dynamic labels:
- Use
multipart/form-dataContent-Type - Send JFR data in a form file field called
jfr - Send
LabelsSnapshotprotobuf message in a form file field calledlabels
message Context {
// string_id -> string_id
map<int64, int64> labels = 1;
}
message LabelsSnapshot {
// context_id -> Context
map<int64, Context> contexts = 1;
// string_id -> string
map<int64, string> strings = 2;
}
Where context_id is a parameter set in async-profiler.
Ingestion Examples
Simple profile upload:
printf "foo;bar 100\n foo;baz 200" | curl \
-X POST \
--data-binary @- \
'http://localhost:4040/ingest?name=curl-test-app&from=1615709120&until=1615709130'
JFR profile with labels:
curl -X POST \
-F jfr=@profile.jfr \
-F labels=@labels.pb \
"http://localhost:4040/ingest?name=curl-test-app&units=samples&aggregationType=sum&sampleRate=100&from=1655834200&until=1655834210&spyName=javaspy&format=jfr"
Future: Ad-hoc Profiling with async-profiler
async-profiler can capture JFR profiles on demand and upload them to Pyroscope with labels. This enables targeted profiling of specific Spark jobs or Cassandra operations to inspect exactly what is happening at the JVM level.
Common Issues
- No JVM metrics: Check
ps aux | grep javaagent— if-javaagentflags are missing,spark.driver.extraJavaOptionsmay be overridden at job submission time (replaces spark-defaults.conf entirely). - Collector retry errors at startup: Normal if the control node collector isn't ready yet. Should stabilize within a minute.
- Spark profiles missing hostname label:
PYROSCOPE_LABELSenv var must be set viaspark-envclassification withhostname=$(hostname -s).
Development Overview
Hello there. If you're reading this, you've probably decided to contribute to easy-db-lab or use the tools for your own work. Very cool.
Dev Containers (Recommended)
Dev Containers are the preferred method for developing easy-db-lab. They provide a consistent, pre-configured environment with all required tools installed:
- Java 21 (Temurin) via SDKMAN
- Kotlin and Gradle
- MkDocs for documentation
- Docker-in-Docker for container operations
- Claude Code for AI-assisted development
- zsh with Powerlevel10k theme
VS Code
- Install the Dev Containers extension
- Open the project folder
- Click "Reopen in Container" when prompted
JetBrains IDEs
- Install the Dev Containers plugin
- Open the project and select "Dev Containers" from the remote development options
CLI with bin/dev
The bin/dev script provides a convenient wrapper for dev container management:
bin/dev start # Start the dev container
bin/dev shell # Open interactive shell
bin/dev test # Run Gradle tests
bin/dev docs-serve # Serve docs with live reload
bin/dev claude # Start Claude Code
bin/dev status # Show container status
bin/dev down # Stop and remove container
To mount your Claude Code configuration (for AI-assisted development):
ENABLE_CLAUDE=1 bin/dev start
Run bin/dev help for all available commands.
Building the Project
Once inside the container (or with local tools installed):
./gradlew assemble
./gradlew test
Documentation Preview
Preview documentation locally with live reload:
bin/dev docs-serve
Then open http://localhost:8000 in your browser.
Project Structure
easy-db-lab is broken into several subprojects:
- Docker containers (prefixed with
docker-) - Documentation (the manual you're reading now)
- Utility code for downloading artifacts
Architecture
The project follows a layered architecture:
Commands (PicoCLI) → Services → External Systems (K8s, AWS, Filesystem)
Layer Responsibilities
- Commands (
commands/): Lightweight PicoCLI execution units - Services (
services/,providers/): Business logic layer
For more details, see the project's CLAUDE.md file.
Docker Development
Building Docker Containers
Each container is versioned and can be built locally using the following:
./gradlew :PROJECT-NAME:buildDocker
Where PROJECT-NAME is one of the subproject directories you see in the top level.
Setup
We recommend updating your local Docker service to use 8GB of memory. This is necessary when running dashboard previews locally. The preview is configured to run multiple Cassandra containers at once.
Available Docker Projects
Check the root project directory for subprojects prefixed with docker- to see available containerized components.
Local Testing
To test containers locally:
-
Build the container:
./gradlew :docker-cassandra:buildDocker -
Run the container:
docker run -it <image-name>
Memory Requirements
| Use Case | Recommended Memory |
|---|---|
| Single container development | 4GB |
| Dashboard preview (multiple containers) | 8GB |
| Full integration testing | 16GB |
Publishing
Pre-Release Checklist
- First check CI to ensure the build is clean and green
- Ensure the following environment variables are set:
DOCKER_USERNAMEDOCKER_PASSWORDDOCKER_EMAIL
Publishing Steps
Build and Upload
./gradlew buildAll uploadAll
Post-Release
After publishing, bump the version in build.gradle.kts.
Container Publishing
Containers are automatically published to GitHub Container Registry (ghcr.io) when:
- A version tag (v*) is pushed
- PR Checks pass on main branch
See .github/workflows/publish-container.yml for details.
Documentation
Documentation is automatically built and deployed via GitHub Actions when changes are pushed to the docs/ directory on the main branch.
Testing Guidelines
This document outlines the testing standards and practices for the easy-db-lab project.
Core Testing Principles
1. Use BaseKoinTest for Dependency Injection
All tests should extend BaseKoinTest to take advantage of automatic dependency injection setup and teardown.
class MyCommandTest : BaseKoinTest() {
// Your test code here
}
BaseKoinTest provides:
- Automatic Koin lifecycle management
- Core modules that are always mocked (AWS, SSH, OutputHandler)
- Ability to add test-specific modules via
additionalTestModules()
2. Use AssertJ for Assertions
Tests should use AssertJ assertions, not JUnit assertions. AssertJ provides more readable and powerful assertion methods.
// Good - AssertJ style
import org.assertj.core.api.Assertions.assertThat
assertThat(result).isNotNull()
assertThat(result.value).isEqualTo("expected")
assertThat(list).hasSize(3).contains("item1", "item2")
// Avoid - JUnit style
import org.junit.jupiter.api.Assertions.assertEquals
assertEquals("expected", result.value)
3. Create Custom Assertions for Non-Trivial Classes
When testing non-trivial classes, create custom AssertJ assertions to implement Domain-Driven Design in tests. This decouples business logic from implementation details and makes tests more maintainable during refactoring.
Custom Assertions Pattern
Custom assertions provide a fluent, domain-specific language for testing that improves readability and maintainability.
Example: Custom Assertion for a Domain Class
Here's a complete example showing how to create and use custom assertions:
// Domain class to be tested
data class CassandraNode(
val nodeId: String,
val datacenter: String,
val rack: String,
val status: NodeStatus,
val tokens: Int
)
enum class NodeStatus {
UP, DOWN, JOINING, LEAVING
}
// Custom assertion class
import org.assertj.core.api.AbstractAssert
class CassandraNodeAssert(actual: CassandraNode?) :
AbstractAssert<CassandraNodeAssert, CassandraNode>(actual, CassandraNodeAssert::class.java) {
companion object {
fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
return CassandraNodeAssert(actual)
}
}
fun hasNodeId(nodeId: String): CassandraNodeAssert {
isNotNull
if (actual.nodeId != nodeId) {
failWithMessage("Expected node ID to be <%s> but was <%s>", nodeId, actual.nodeId)
}
return this
}
fun isInDatacenter(datacenter: String): CassandraNodeAssert {
isNotNull
if (actual.datacenter != datacenter) {
failWithMessage("Expected datacenter to be <%s> but was <%s>", datacenter, actual.datacenter)
}
return this
}
fun hasStatus(status: NodeStatus): CassandraNodeAssert {
isNotNull
if (actual.status != status) {
failWithMessage("Expected status to be <%s> but was <%s>", status, actual.status)
}
return this
}
fun isUp(): CassandraNodeAssert {
return hasStatus(NodeStatus.UP)
}
fun isDown(): CassandraNodeAssert {
return hasStatus(NodeStatus.DOWN)
}
fun hasTokenCount(tokens: Int): CassandraNodeAssert {
isNotNull
if (actual.tokens != tokens) {
failWithMessage("Expected token count to be <%s> but was <%s>", tokens, actual.tokens)
}
return this
}
}
// Usage in tests
import CassandraNodeAssert.Companion.assertThat
@Test
fun `test cassandra node configuration`() {
val node = CassandraNode(
nodeId = "node1",
datacenter = "dc1",
rack = "rack1",
status = NodeStatus.UP,
tokens = 256
)
// Fluent assertions with domain language
assertThat(node)
.hasNodeId("node1")
.isInDatacenter("dc1")
.isUp()
.hasTokenCount(256)
}
Project-Wide Assertions Helper
Create a central assertions class to provide access to all custom assertions:
// MyProjectAssertions.kt
object MyProjectAssertions {
// Cassandra domain assertions
fun assertThat(actual: CassandraNode?): CassandraNodeAssert {
return CassandraNodeAssert(actual)
}
fun assertThat(actual: Host?): HostAssert {
return HostAssert(actual)
}
fun assertThat(actual: TFState?): TFStateAssert {
return TFStateAssert(actual)
}
// Add more domain assertions as needed
}
Then import statically in tests:
import com.rustyrazorblade.easydblab.assertions.MyProjectAssertions.assertThat
@Test
fun `test complex scenario`() {
val node = createTestNode()
val host = createTestHost()
// All domain assertions available through single import
assertThat(node).isUp()
assertThat(host).hasPrivateIp("10.0.0.1")
}
Benefits of Custom Assertions
- Domain-Driven Design: Tests use business language, not implementation details
- Refactoring Safety: Changes to class internals don't break test logic
- Readability: Tests read like specifications
- Reusability: Common assertions are centralized
- Maintainability: Single place to update assertion logic
- Type Safety: Compile-time checking of assertion methods
When to Create Custom Assertions
Create custom assertions for:
- Domain entities (e.g.,
Host,TFState,CassandraNode) - Complex value objects with multiple properties
- Classes that appear in multiple test scenarios
- Any class where you find yourself writing repetitive assertion code
Testing Best Practices
-
Test Names: Use descriptive names with backticks
@Test fun `should start cassandra node when status is DOWN`() { } -
Test Structure: Follow Arrange-Act-Assert pattern
@Test fun `test node startup`() { // Arrange val node = createTestNode(status = NodeStatus.DOWN) // Act val result = nodeManager.startNode(node) // Assert assertThat(result).isUp() } -
Mock External Dependencies: Always mock AWS, SSH, and other external services
class MyTest : BaseKoinTest() { override fun additionalTestModules() = listOf( module { single { mockRemoteOperationsService() } } ) } -
Test Edge Cases: Include tests for error conditions and boundary cases
-
Keep Tests Focused: Each test should verify one specific behavior
Testing Interactive Commands with TestPrompter
Commands that require user input (like setup-profile) can be tested deterministically using TestPrompter. This test utility replaces the real Prompter interface and returns predefined responses.
Basic Usage
class MyCommandTest : BaseKoinTest() {
private lateinit var testPrompter: TestPrompter
override fun additionalTestModules() = listOf(
module {
single<Prompter> { testPrompter }
}
)
@BeforeEach
fun setup() {
// Configure responses - keys can be exact matches or partial matches
testPrompter = TestPrompter(
mapOf(
"email" to "test@example.com",
"region" to "us-west-2",
"AWS Access Key" to "AKIAIOSFODNN7EXAMPLE",
)
)
}
@Test
fun `should collect user credentials`() {
// Run command that prompts for input
val command = SetupProfile()
command.call()
// Verify prompts were called
assertThat(testPrompter.wasPromptedFor("email")).isTrue()
assertThat(testPrompter.wasPromptedFor("region")).isTrue()
}
}
Response Matching
TestPrompter supports two matching strategies:
- Exact match: The question text matches a key exactly
- Partial match: The question text contains the key (case-insensitive)
val prompter = TestPrompter(
mapOf(
// Exact match - only matches "email" exactly
"email" to "test@example.com",
// Partial match - matches any question containing "AWS Profile"
"AWS Profile" to "my-profile",
)
)
Sequential Responses for Retry Testing
For testing retry logic (e.g., credential validation failures), use addSequentialResponses():
@Test
fun `should retry on invalid credentials`() {
testPrompter = TestPrompter()
// First call returns invalid credentials, second returns valid ones
testPrompter.addSequentialResponses(
"AWS Access Key",
"invalid-key", // First attempt
"AKIAVALIDKEY123" // Second attempt (after retry)
)
testPrompter.addSequentialResponses(
"AWS Secret",
"invalid-secret",
"valid-secret-key"
)
val command = SetupProfile()
command.call()
// Verify the command handled retry correctly
val callLog = testPrompter.getCallLog()
val accessKeyCalls = callLog.filter { it.question.contains("Access Key") }
assertThat(accessKeyCalls).hasSize(2)
}
Verifying Prompt Behavior
TestPrompter records all prompt calls for verification:
@Test
fun `should not prompt for credentials when using AWS profile`() {
testPrompter = TestPrompter(
mapOf(
"AWS Profile" to "my-profile", // Non-empty = use profile auth
)
)
val command = SetupProfile()
command.call()
// Verify credential prompts were skipped
assertThat(testPrompter.wasPromptedFor("Access Key")).isFalse()
assertThat(testPrompter.wasPromptedFor("Secret")).isFalse()
// Check detailed call log
val callLog = testPrompter.getCallLog()
assertThat(callLog).anyMatch { it.question.contains("email") }
}
TestPrompter API Reference
| Method | Description |
|---|---|
prompt(question, default, secret) | Returns configured response or default |
addSequentialResponses(key, vararg responses) | Configure different responses for retry scenarios |
getCallLog() | Returns list of all prompt calls with details |
wasPromptedFor(questionContains) | Check if any prompt contained the given text |
clear() | Reset call log and sequential state |
PromptCall Data Class
Each recorded call contains:
question: The prompt question textdefault: The default value offeredsecret: Whether input was masked (for passwords)returnedValue: The value that was returned
Additional Resources
End-to-End Testing
easy-db-lab includes a comprehensive end-to-end test suite that validates the entire workflow from provisioning to teardown.
Running the Test
The end-to-end test is located at bin/end-to-end-test:
./bin/end-to-end-test --cassandra
Command-Line Options
Feature Flags
| Flag | Description |
|---|---|
--cassandra | Enable Cassandra-specific tests |
--spark | Enable Spark EMR provisioning and tests |
--clickhouse | Enable ClickHouse deployment and tests |
--opensearch | Enable OpenSearch deployment and tests |
--all | Enable all optional features |
--ebs | Enable EBS volumes (gp3, 256GB) |
--build | Build Packer images (default: skip) |
Testing and Inspection
| Flag | Description |
|---|---|
--list-steps, -l | List all test steps without running |
--break <steps> | Set breakpoints at specific steps (comma-separated) |
--wait | Run all steps except teardown, then wait for confirmation |
Examples
# List all available test steps
./bin/end-to-end-test --list-steps
# Run full test with all features
./bin/end-to-end-test --all
# Run with Cassandra and pause before teardown
./bin/end-to-end-test --cassandra --wait
# Run with breakpoints at steps 5 and 15
./bin/end-to-end-test --cassandra --break 5,15
# Build custom AMI images and run test
./bin/end-to-end-test --build --cassandra
Test Steps
The test executes approximately 27 steps covering:
Infrastructure
- Build project
- Check version command
- Build packer images (optional)
- Set IAM policies
- Initialize cluster
- Setup kubectl
- Wait for K3s ready
- Verify K3s cluster
Registry and Storage
- Test registry push/pull
- List hosts
- Verify S3 backup
Cassandra
- Setup Cassandra
- Verify Cassandra backup
- Verify restore
- Cassandra start/stop cycle
- Test SSH and nodetool
- Check Sidecar
- Test exec command
- Run stress test
- Run stress K8s test
Optional Services
- Submit Spark job (if
--spark) - Check Spark status (if
--spark) - Start ClickHouse (if
--clickhouse) - Test ClickHouse (if
--clickhouse) - Stop ClickHouse (if
--clickhouse) - Start OpenSearch (if
--opensearch) - Test OpenSearch (if
--opensearch) - Stop OpenSearch (if
--opensearch)
Observability and Cleanup
- Test observability stack
- Teardown cluster
Error Handling
When a test step fails, an interactive menu appears:
- Retry from failed step - Resume from the point of failure
- Start a shell session - Opens a shell with:
easy-db-labcommands availablerebuild- Rebuild just the projectrerun- Rebuild and resume from failed step
- Tear down environment - Run
easy-db-lab down --yes - Exit - Exit the script
AWS Requirements
The test requires:
- AWS profile with sufficient permissions
- VPC and subnet configuration
- S3 bucket for backups and logs
Default Configuration
- Instance count: 3 nodes
- Instance type: c5.2xlarge
- Cassandra version: 5.0 (when enabled)
- Spark workers: 2 (when enabled)
CI Integration
The end-to-end test is designed to run in CI environments:
- Supports non-interactive mode
- Returns appropriate exit codes
- Provides detailed logging
- Cleans up resources on failure
Spark Development
This guide covers developing and testing Spark-related functionality in easy-db-lab.
Project Structure
All Spark modules live under spark/ with shared configuration:
spark/common/— Shared config (SparkJobConfig), data generation (BulkTestDataGenerator), CQL setupspark/bulk-writer-sidecar/— Cassandra Analytics, direct sidecar transport (DirectBulkWriter)spark/bulk-writer-s3/— Cassandra Analytics, S3 staging transport (S3BulkWriter)spark/connector-writer/— Standard Spark Cassandra Connector (StandardConnectorWriter)spark/connector-read-write/— Read→transform→write example (KeyValuePrefixCount)
Gradle modules use nested paths: :spark:common, :spark:bulk-writer-sidecar, etc.
Prerequisites
The bulk-writer modules depend on Apache Cassandra Analytics, which requires JDK 11 to build.
One-Time Setup
bin/build-cassandra-analytics
Options:
--force- Rebuild even if already built--branch <branch>- Use a specific branch (default: trunk)
Building
# Build all Spark modules
./gradlew :spark:bulk-writer-sidecar:shadowJar :spark:bulk-writer-s3:shadowJar \
:spark:connector-writer:shadowJar :spark:connector-read-write:shadowJar
# Build individually
./gradlew :spark:bulk-writer-sidecar:shadowJar
./gradlew :spark:connector-writer:shadowJar
# Output locations
ls spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar
ls spark/connector-writer/build/libs/connector-writer-*.jar
Shadow JARs include all dependencies except Spark (provided by EMR).
Running Tests
Main project tests exclude bulk-writer modules to avoid requiring cassandra-analytics:
./gradlew :test
Testing with a Live Cluster
Using bin/spark-bulk-write
This script handles JAR lookup, host resolution, and health checks:
# From a cluster directory (where state.json exists)
spark-bulk-write direct --rows 10000
spark-bulk-write s3 --rows 1000000 --parallelism 20
spark-bulk-write connector --keyspace myks --table mytable
Using bin/submit-direct-bulk-writer
Simplified script for direct bulk writer testing:
bin/submit-direct-bulk-writer [rowCount] [parallelism] [partitionCount] [replicationFactor]
Manual Spark Job Submission
All modules use unified spark.easydblab.* configuration:
easy-db-lab spark submit \
--jar spark/bulk-writer-sidecar/build/libs/bulk-writer-sidecar-*.jar \
--main-class com.rustyrazorblade.easydblab.spark.DirectBulkWriter \
--conf spark.easydblab.contactPoints=host1,host2 \
--conf spark.easydblab.keyspace=bulk_test \
--conf spark.easydblab.localDc=us-west-2 \
--conf spark.easydblab.rowCount=1000 \
--conf spark.easydblab.replicationFactor=1 \
--wait
Debugging Failed Jobs
When a Spark job fails, easy-db-lab automatically queries logs and displays failure details.
Manual Log Retrieval
easy-db-lab spark logs --step-id <step-id>
easy-db-lab spark status --step-id <step-id>
easy-db-lab spark jobs
Direct S3 Access
Logs are stored at: s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/
aws s3 cp s3://<bucket>/spark/emr-logs/<cluster-id>/steps/<step-id>/stderr.gz - | gunzip
Adding a New Spark Module
- Create a directory under
spark/(e.g.,spark/bulk-reader/) - Add
build.gradle.kts— use an existing module as a template - Add
include "spark:bulk-reader"tosettings.gradle - Depend on
:spark:commonfor shared config - Use
SparkJobConfig.load(sparkConf)for configuration - Implement your main class and submit via
easy-db-lab spark submit
Architecture Notes
Shared Configuration
SparkJobConfig in spark/common provides:
- Property constants (
PROP_CONTACT_POINTS, etc.) - Config loading from
SparkConfwith validation - Schema setup via
CqlSetup - Consistent defaults across all modules
Why Shadow JAR?
Bulk-writer modules use the Gradle Shadow plugin because:
- EMR provides Spark, so those dependencies are
compileOnly - Cassandra Analytics has many transitive dependencies
mergeServiceFiles()properly handlesMETA-INF/servicesfor SPI
Cassandra Analytics Modules
Some cassandra-analytics modules aren't published to Maven:
five-zero.jar- Cassandra 5.0 bridgefive-zero-bridge.jar- Bridge implementationfive-zero-types.jar- Type convertersfive-zero-sparksql.jar- SparkSQL integration
These are referenced directly from .cassandra-analytics/ build output.
SOCKS Proxy Architecture
This document describes the internal SOCKS5 proxy implementation used by easy-db-lab for programmatic access to private cluster resources.
Overview
easy-db-lab has two separate proxy systems:
| Proxy | Purpose | Implementation |
|---|---|---|
| Shell Proxy | User shell commands (kubectl, curl) | SSH CLI (ssh -D) via env.sh |
| JVM Proxy | Internal Kotlin/Java code | Apache MINA SSH library |
This document covers the JVM Proxy used internally by easy-db-lab.
Why Two Proxies?
The shell proxy (started by source env.sh) works for command-line tools that respect HTTPS_PROXY environment variables. However, JVM code needs programmatic proxy configuration:
- Java's
HttpClientrequires aProxySelectorinstance - The Cassandra driver needs SOCKS5 configuration at the Netty level
- The Kubernetes fabric8 client needs proxy settings
- Operations should work without requiring users to run
source env.shfirst
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ easy-db-lab JVM │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────┐ ┌────────────────────────┐ │
│ │ SocksProxyService │ │ ProxiedHttpClientFactory│ │
│ │ (interface) │ │ │ │
│ └─────────┬──────────┘ └───────────┬────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌────────────────────────┐ │
│ │ MinaSocksProxyService│ │ SocksProxySelector │ │
│ │ (Apache MINA impl) │ │ (custom ProxySelector)│ │
│ └─────────┬───────────┘ └────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ SSHConnectionProvider│ │
│ │ (manages SSH sessions)│ │
│ └─────────┬────────────┘ │
│ │ │
└────────────┼─────────────────────────────────────────────────────┘
│
▼ SSH Dynamic Port Forwarding
┌──────────────────┐
│ Control Node │
│ (control0) │
└──────────────────┘
Key Classes
SocksProxyService
Location: com.rustyrazorblade.easydblab.proxy.SocksProxyService
Interface defining proxy operations:
interface SocksProxyService {
fun ensureRunning(gatewayHost: ClusterHost): SocksProxyState
fun start(gatewayHost: ClusterHost): SocksProxyState
fun stop()
fun isRunning(): Boolean
fun getState(): SocksProxyState?
fun getLocalPort(): Int
}
MinaSocksProxyService
Location: com.rustyrazorblade.easydblab.proxy.MinaSocksProxyService
Apache MINA-based implementation that:
- Establishes an SSH connection to the gateway host
- Starts dynamic port forwarding on a random available port
- Maintains thread-safe state for concurrent access
Key implementation details:
- Uses
ReentrantLockfor thread safety - Dynamically finds an available port via
ServerSocket(0) - Extracts the underlying
ClientSessionfrom the SSH client for port forwarding - Supports idempotent
ensureRunning()for reuse across operations
ProxiedHttpClientFactory
Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory
Creates java.net.http.HttpClient instances configured for SOCKS5 proxy:
class ProxiedHttpClientFactory(
private val socksProxyService: SocksProxyService,
) : HttpClientFactory {
override fun createClient(): HttpClient {
val proxyPort = socksProxyService.getLocalPort()
val proxySelector = SocksProxySelector(proxyPort)
return HttpClient
.newBuilder()
.proxy(proxySelector)
.connectTimeout(CONNECTION_TIMEOUT)
.build()
}
}
SocksProxySelector
Location: com.rustyrazorblade.easydblab.proxy.ProxiedHttpClientFactory (private class)
Custom ProxySelector that returns a SOCKS5 proxy for all URIs:
private class SocksProxySelector(
private val proxyPort: Int,
) : ProxySelector() {
private val proxy = Proxy(Proxy.Type.SOCKS, InetSocketAddress("localhost", proxyPort))
override fun select(uri: URI?): List<Proxy> = listOf(proxy)
override fun connectFailed(uri: URI?, sa: SocketAddress?, ioe: IOException?) {
// Handle connection failures if needed
}
}
Important: Java's ProxySelector.of() creates HTTP proxies, not SOCKS5. This custom implementation is required for SSH dynamic port forwarding.
SocksProxyNettyOptions
Location: com.rustyrazorblade.easydblab.driver.SocksProxyNettyOptions
Configures the Cassandra driver to use SOCKS5 proxy at the Netty level for CQL connections.
Dependency Injection
The proxy components are registered in ProxyModule:
val proxyModule = module {
// Singleton - maintains proxy state across requests
single<SocksProxyService> { MinaSocksProxyService(get()) }
// Factory for creating proxied HTTP clients
single<HttpClientFactory> { ProxiedHttpClientFactory(get()) }
}
Usage Patterns
Querying Victoria Logs
class DefaultVictoriaLogsService(
private val socksProxyService: SocksProxyService,
private val httpClientFactory: HttpClientFactory,
) : VictoriaLogsService {
override fun query(...): Result<List<String>> = runCatching {
// Ensure proxy is running to control node
socksProxyService.ensureRunning(controlHost)
// Create HTTP client that routes through proxy
val httpClient = httpClientFactory.createClient()
// Make request to private IP
val request = HttpRequest.newBuilder()
.uri(URI.create("http://${controlHost.privateIp}:9428/..."))
.build()
httpClient.send(request, BodyHandlers.ofString())
}
}
Kubernetes API Access
The K8sService uses the proxy for fabric8 Kubernetes client connections to the private K3s API server.
CQL Sessions
The CqlSessionFactory configures the Cassandra driver with SOCKS5 proxy settings via SocksProxyNettyOptions.
Lifecycle
CLI Mode
In CLI mode (single command execution):
- Service starts proxy when needed
- Operations complete
- Proxy remains running for subsequent operations in same process
Server/MCP Mode
In server mode (long-running process):
- Proxy starts on first request requiring cluster access
- Reused across multiple requests (connection count tracked)
- Stopped on server shutdown
Thread Safety
MinaSocksProxyService uses a ReentrantLock to protect:
- Proxy state changes
- Session management
- Port allocation
This ensures safe concurrent access when multiple threads need cluster resources.
Error Handling
Common failure scenarios:
| Error | Cause | Resolution |
|---|---|---|
| "HTTP/1.1 header parser received no bytes" | Using HTTP proxy instead of SOCKS5 | Ensure SocksProxySelector returns Proxy.Type.SOCKS |
| Connection timeout | Control node not accessible | Verify SSH connectivity to control0 |
| Port bind failure | Port already in use | Service automatically finds available port |
Testing
When testing code that uses the proxy:
class MyServiceTest : BaseKoinTest() {
// BaseKoinTest provides mocked SocksProxyService
@Test
fun testWithMockedProxy() {
val mockProxyService = mock<SocksProxyService>()
whenever(mockProxyService.getLocalPort()).thenReturn(1080)
// Test your service with mocked proxy
}
}
Related Files
| File | Purpose |
|---|---|
proxy/SocksProxyService.kt | Interface definition |
proxy/MinaSocksProxyService.kt | Apache MINA implementation |
proxy/ProxiedHttpClientFactory.kt | HTTP client factory with SOCKS5 |
proxy/ProxyModule.kt | Koin DI registration |
driver/SocksProxyNettyOptions.kt | Cassandra driver proxy config |
driver/SocksProxyDriverContext.kt | Driver context with proxy |
services/VictoriaLogsService.kt | Example usage |
Fabric8 Server-Side Apply Pattern
This document explains a common error when using fabric8's Kubernetes client for server-side apply operations, and the correct pattern to use.
The Error
When applying Kubernetes manifests using fabric8, you may encounter:
java.lang.IllegalStateException: Could not find a registered handler for item:
[GenericKubernetesResource(apiVersion=v1, kind=Namespace, metadata=ObjectMeta...)]
This is a client-side fabric8 error, not a Kubernetes server error.
Root Cause
Fabric8 has two loading paths with different behaviors:
- Typed Loader (works):
client.namespaces().load(stream)→ returns typedNamespace→serverSideApply()works - Generic Loader (fails):
client.load(stream)→ returnsGenericKubernetesResource→serverSideApply()fails
Critical: Items returned by client.load() are always GenericKubernetesResource at runtime, regardless of the YAML content. They cannot be cast to typed classes like Namespace or ConfigMap.
Patterns That Do NOT Work
Attempt 1: Direct serverSideApply on loader
// DON'T DO THIS - causes "Could not find a registered handler" error
client.load(inputStream).serverSideApply()
Attempt 2: Load items then use client.resource()
// DON'T DO THIS - still fails with same error
val items = client.load(inputStream).items()
for (item in items) {
client.resource(item).serverSideApply()
}
Even though we load the items first, they are still GenericKubernetesResource objects internally, and client.resource(item).serverSideApply() still fails.
Attempt 3: Cast GenericKubernetesResource to typed class
// DON'T DO THIS - causes ClassCastException
val items = client.load(inputStream).items()
for (item in items) {
when (item.kind) {
"Namespace" -> client.namespaces().resource(item as Namespace).serverSideApply()
// ...
}
}
Error: java.lang.ClassCastException: class io.fabric8.kubernetes.api.model.GenericKubernetesResource cannot be cast to class io.fabric8.kubernetes.api.model.Namespace
The items from client.load() are truly GenericKubernetesResource at runtime - they cannot be cast to typed classes.
The Pattern That Works
Use typed client loaders directly with forceConflicts():
private fun loadAndApplyManifest(client: KubernetesClient, file: File) {
val yamlContent = file.readText()
val kind = extractKind(yamlContent)
ByteArrayInputStream(yamlContent.toByteArray()).use { stream ->
when (kind) {
"Namespace" -> client.namespaces().load(stream).forceConflicts().serverSideApply()
"ConfigMap" -> client.configMaps().load(stream).forceConflicts().serverSideApply()
"Service" -> client.services().load(stream).forceConflicts().serverSideApply()
"DaemonSet" -> client.apps().daemonSets().load(stream).forceConflicts().serverSideApply()
"Deployment" -> client.apps().deployments().load(stream).forceConflicts().serverSideApply()
else -> throw IllegalStateException("Unsupported resource kind: $kind")
}
}
}
private fun extractKind(yamlContent: String): String {
val kindRegex = Regex("""^kind:\s*(\w+)""", RegexOption.MULTILINE)
return kindRegex.find(yamlContent)?.groupValues?.get(1)
?: throw IllegalStateException("Could not determine resource kind from YAML")
}
This works because:
- Typed loaders (e.g.,
client.namespaces().load(stream)) return properly typed resources - Typed resources have registered handlers for
serverSideApply() forceConflicts()resolves field manager conflicts when multiple controllers manage the same resource
Required Imports
import io.fabric8.kubernetes.client.KubernetesClient
import java.io.ByteArrayInputStream
import java.io.File
Adding New Resource Types
If you need to support additional Kubernetes resource types, add them to the when statement:
"Pod" -> client.pods().load(stream).forceConflicts().serverSideApply()
"Secret" -> client.secrets().load(stream).forceConflicts().serverSideApply()
"StatefulSet" -> client.apps().statefulSets().load(stream).forceConflicts().serverSideApply()
// etc.
References
- Fabric8 Kubernetes Client: https://github.com/fabric8io/kubernetes-client
- Server-side apply documentation: https://github.com/fabric8io/kubernetes-client/blob/main/doc/CHEATSHEET.md
Fix History
| Date | Issue | Resolution |
|---|---|---|
| 2025-12-02 | client.load().serverSideApply() fails | Attempted: load items first, then apply via client.resource(item) |
| 2025-12-02 | client.resource(item).serverSideApply() also fails | Attempted: cast items to typed classes (e.g., item as Namespace) |
| 2025-12-02 | item as Namespace causes ClassCastException | Use typed loaders directly (client.namespaces().load(stream)) |
| 2025-12-02 | Patch operation fails for Namespace | Fixed: Add forceConflicts() before serverSideApply() |