KubeRay
A comprehensive archetype for deploying and managing Ray clusters on Kubernetes using the KubeRay operator, providing scalable distributed computing infrastructure for machine learning and data processing workloads.
Overview
This archetype sets up a complete KubeRay environment that enables you to deploy, scale, and manage Ray clusters on Kubernetes with enterprise-grade features including auto-scaling, resource management, monitoring, and fault tolerance.
Technology Stack
- KubeRay: Kubernetes operator for Ray clusters
- Ray: Distributed computing framework
- Kubernetes: Container orchestration platform
- Prometheus: Metrics collection and monitoring
- Grafana: Visualization and alerting
- ArgoCD/Flux: GitOps deployment (optional)
- Istio: Service mesh (optional)
Key Features
Cluster Management
- Declarative Configuration: YAML-based cluster definitions
- Auto-scaling: Dynamic worker node scaling based on workload
- Resource Management: CPU, memory, and GPU resource allocation
- Multi-tenancy: Support for multiple Ray clusters
Operational Excellence
- Health Monitoring: Cluster health checks and status reporting
- Fault Tolerance: Automatic recovery from node failures
- Logging: Centralized log collection and aggregation
- Metrics: Comprehensive monitoring and alerting
Workload Management
- Job Submission: Submit Ray jobs to clusters
- Jupyter Integration: Interactive development environment
- Model Serving: Ray Serve for model deployment
- Data Processing: Large-scale data processing workflows
Project Structure
kuberay/
├── operator/
│ ├── deployment.yaml # KubeRay operator deployment
│ ├── rbac.yaml # RBAC configurations
│ ├── crd.yaml # Custom Resource Definitions
│ └── configmap.yaml # Operator configuration
├── clusters/
│ ├── development/
│ │ ├── raycluster.yaml # Dev cluster configuration
│ │ └── values.yaml # Environment-specific values
│ ├── staging/
│ │ ├── raycluster.yaml # Staging cluster configuration
│ │ └── values.yaml
│ └── production/
│ ├── raycluster.yaml # Production cluster configuration
│ └── values.yaml
├── services/
│ ├── jupyter.yaml # Jupyter notebook deployment
│ ├── ray-dashboard.yaml # Ray dashboard service
│ └── ingress.yaml # Ingress configurations
├── monitoring/
│ ├── prometheus/
│ │ ├── servicemonitor.yaml # Prometheus monitoring
│ │ └── rules.yaml # Alerting rules
│ ├── grafana/
│ │ ├── dashboard.json # Ray cluster dashboard
│ │ └── datasource.yaml # Prometheus datasource
│ └── alerts/
│ └── ray-alerts.yaml # Alert definitions
├── jobs/
│ ├── training/ # ML training jobs
│ ├── batch/ # Batch processing jobs
│ └── serving/ # Model serving configurations
├── autoscaling/
│ ├── hpa.yaml # Horizontal Pod Autoscaler
│ └── vpa.yaml # Vertical Pod Autoscaler
├── security/
│ ├── psp.yaml # Pod Security Policies
│ ├── networkpolicy.yaml # Network policies
│ └── rbac.yaml # Role-based access control
├── storage/
│ ├── pvc.yaml # Persistent volume claims
│ └── storageclass.yaml # Storage class definitions
└── scripts/
├── deploy.sh # Deployment script
├── scale.sh # Scaling script
└── cleanup.sh # Cleanup script
KubeRay Operator Setup
Operator Deployment (operator/deployment.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: kuberay-operator
namespace: kuberay-system
labels:
app.kubernetes.io/name: kuberay-operator
app.kubernetes.io/component: operator
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: kuberay-operator
app.kubernetes.io/component: operator
template:
metadata:
labels:
app.kubernetes.io/name: kuberay-operator
app.kubernetes.io/component: operator
spec:
serviceAccountName: kuberay-operator
securityContext:
runAsNonRoot: true
runAsUser: 65532
containers:
- name: manager
image: quay.io/kuberay/operator:v1.0.0
command:
- /manager
args:
- --leader-elect
- --log-level=info
- --metrics-bind-addr=:8080
- --health-probe-bind-addr=:8081
env:
- name: WATCH_NAMESPACE
value: ""
- name: RAY_IMAGE
value: "rayproject/ray:2.8.0-py39"
ports:
- containerPort: 8080
name: metrics
- containerPort: 8081
name: health
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
RBAC Configuration (operator/rbac.yaml)
apiVersion: v1
kind: ServiceAccount
metadata:
name: kuberay-operator
namespace: kuberay-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kuberay-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints", "events", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["ray.io"]
resources: ["rayclusters", "rayservices", "rayjobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["ray.io"]
resources: ["rayclusters/status", "rayservices/status", "rayjobs/status"]
verbs: ["get", "update", "patch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kuberay-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kuberay-operator
subjects:
- kind: ServiceAccount
name: kuberay-operator
namespace: kuberay-system
Ray Cluster Configurations
Production Ray Cluster (clusters/production/raycluster.yaml)
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: ray-cluster-production
namespace: ray-system
labels:
environment: production
app: ray-cluster
spec:
rayVersion: '2.8.0'
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
imagePullPolicy: Always
securityContext:
runAsUser: 1000
fsGroup: 1000
env:
- name: RAY_GRAFANA_IFRAME_HOST
value: http://grafana.monitoring.svc.cluster.local:3000
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
dashboard-port: '8265'
num-cpus: '0'
node-ip-address: $MY_POD_IP
block: 'true'
template:
metadata:
labels:
app: ray-head
ray.io/cluster: ray-cluster-production
ray.io/node-type: head
spec:
serviceAccountName: ray-cluster
securityContext:
runAsUser: 1000
fsGroup: 1000
containers:
- name: ray-head
image: rayproject/ray:2.8.0-py39-gpu
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_CLUSTER_NAME
value: "ray-cluster-production"
- name: RAY_DISABLE_IMPORT_WARNING
value: "1"
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
- name: shared-data
mountPath: /data
volumes:
- name: ray-logs
emptyDir: {}
- name: shared-data
persistentVolumeClaim:
claimName: ray-cluster-shared-pvc
workerGroupSpecs:
- replicas: 3
minReplicas: 1
maxReplicas: 10
groupName: worker-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
template:
metadata:
labels:
app: ray-worker
ray.io/cluster: ray-cluster-production
ray.io/node-type: worker
spec:
serviceAccountName: ray-cluster
securityContext:
runAsUser: 1000
fsGroup: 1000
containers:
- name: ray-worker
image: rayproject/ray:2.8.0-py39-gpu
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_CLUSTER_NAME
value: "ray-cluster-production"
- name: RAY_DISABLE_IMPORT_WARNING
value: "1"
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: 1
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 1
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
- name: shared-data
mountPath: /data
volumes:
- name: ray-logs
emptyDir: {}
- name: shared-data
persistentVolumeClaim:
claimName: ray-cluster-shared-pvc
# GPU worker group
- replicas: 2
minReplicas: 0
maxReplicas: 5
groupName: gpu-worker-group
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
resources: '{"GPU": 1}'
template:
metadata:
labels:
app: ray-worker-gpu
ray.io/cluster: ray-cluster-production
ray.io/node-type: worker
ray.io/worker-type: gpu
spec:
serviceAccountName: ray-cluster
securityContext:
runAsUser: 1000
fsGroup: 1000
containers:
- name: ray-worker-gpu
image: rayproject/ray:2.8.0-py39-gpu
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_CLUSTER_NAME
value: "ray-cluster-production"
- name: CUDA_VISIBLE_DEVICES
value: "0"
resources:
requests:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 2
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: 2
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
- name: shared-data
mountPath: /data
volumes:
- name: ray-logs
emptyDir: {}
- name: shared-data
persistentVolumeClaim:
claimName: ray-cluster-shared-pvc
nodeSelector:
node-type: gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Ray Dashboard Service (services/ray-dashboard.yaml)
apiVersion: v1
kind: Service
metadata:
name: ray-dashboard
namespace: ray-system
labels:
app: ray-dashboard
spec:
type: LoadBalancer
ports:
- port: 8265
targetPort: 8265
name: dashboard
selector:
ray.io/node-type: head
ray.io/cluster: ray-cluster-production
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ray-dashboard-ingress
namespace: ray-system
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- ray-dashboard.example.com
secretName: ray-dashboard-tls
rules:
- host: ray-dashboard.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ray-dashboard
port:
number: 8265
Ray Job Submission
Training Job Example (jobs/training/pytorch-training.yaml)
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: pytorch-training-job
namespace: ray-system
spec:
entrypoint: python /app/training/train.py
runtimeEnv: |
pip:
- torch==2.0.0
- torchvision==0.15.0
- transformers==4.21.0
working_dir: "https://github.com/your-org/ml-training.git"
rayClusterSpec:
rayVersion: '2.8.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: '0'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.8.0-py39-gpu
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
workerGroupSpecs:
- replicas: 4
minReplicas: 2
maxReplicas: 10
groupName: gpu-workers
rayStartParams:
resources: '{"GPU": 1}'
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.8.0-py39-gpu
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: 1
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: 1
nodeSelector:
node-type: gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 86400 # 24 hours
submitterPodTemplate:
spec:
restartPolicy: Never
containers:
- name: ray-job-submitter
image: rayproject/ray:2.8.0-py39
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Batch Processing Job (jobs/batch/data-processing.yaml)
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: data-processing-job
namespace: ray-system
spec:
entrypoint: python /app/batch/process_data.py --input-path s3://data-bucket/input --output-path s3://data-bucket/output
runtimeEnv: |
pip:
- pandas==2.0.0
- pyarrow==12.0.0
- s3fs==2023.6.0
env_vars:
AWS_DEFAULT_REGION: "us-west-2"
rayClusterSpec:
rayVersion: '2.8.0'
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: '0'
template:
spec:
serviceAccountName: ray-s3-access
containers:
- name: ray-head
image: rayproject/ray:2.8.0-py39
resources:
requests:
cpu: "1"
memory: "4Gi"
limits:
cpu: "2"
memory: "8Gi"
workerGroupSpecs:
- replicas: 6
minReplicas: 3
maxReplicas: 20
groupName: data-workers
rayStartParams: {}
template:
spec:
serviceAccountName: ray-s3-access
containers:
- name: ray-worker
image: rayproject/ray:2.8.0-py39
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 3600 # 1 hour
Monitoring and Observability
ServiceMonitor for Prometheus (monitoring/prometheus/servicemonitor.yaml)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ray-cluster-metrics
namespace: ray-system
labels:
app: ray-cluster
spec:
selector:
matchLabels:
app: ray-dashboard
endpoints:
- port: dashboard
path: /metrics
interval: 30s
scrapeTimeout: 10s
- port: dashboard
path: /api/v0/metrics
interval: 30s
scrapeTimeout: 10s
---
apiVersion: v1
kind: Service
metadata:
name: ray-metrics
namespace: ray-system
labels:
app: ray-dashboard
spec:
ports:
- port: 8265
name: dashboard
selector:
ray.io/node-type: head
Grafana Dashboard (monitoring/grafana/dashboard.json)
{
"dashboard": {
"id": null,
"title": "Ray Cluster Dashboard",
"description": "Monitor Ray cluster performance and resource usage",
"tags": ["ray", "kubernetes", "ml"],
"timezone": "browser",
"panels": [
{
"title": "Cluster Status",
"type": "stat",
"targets": [
{
"expr": "ray_cluster_active_nodes",
"legendFormat": "Active Nodes"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 1},
{"color": "green", "value": 3}
]
}
}
}
},
{
"title": "CPU Usage",
"type": "timeseries",
"targets": [
{
"expr": "ray_node_cpu_utilization",
"legendFormat": "{{node_id}}"
}
]
},
{
"title": "Memory Usage",
"type": "timeseries",
"targets": [
{
"expr": "ray_node_memory_utilization",
"legendFormat": "{{node_id}}"
}
]
},
{
"title": "GPU Usage",
"type": "timeseries",
"targets": [
{
"expr": "ray_node_gpu_utilization",
"legendFormat": "{{node_id}}"
}
]
},
{
"title": "Active Tasks",
"type": "timeseries",
"targets": [
{
"expr": "ray_tasks_active",
"legendFormat": "Active Tasks"
}
]
},
{
"title": "Object Store Usage",
"type": "timeseries",
"targets": [
{
"expr": "ray_object_store_memory_used",
"legendFormat": "{{node_id}}"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
Alert Rules (monitoring/alerts/ray-alerts.yaml)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ray-cluster-alerts
namespace: ray-system
spec:
groups:
- name: ray-cluster
rules:
- alert: RayClusterDown
expr: ray_cluster_active_nodes == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ray cluster is down"
description: "Ray cluster {{ $labels.cluster }} has no active nodes"
- alert: RayHighCPUUsage
expr: ray_node_cpu_utilization > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on Ray node"
description: "Ray node {{ $labels.node_id }} has CPU usage above 90%"
- alert: RayHighMemoryUsage
expr: ray_node_memory_utilization > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on Ray node"
description: "Ray node {{ $labels.node_id }} has memory usage above 85%"
- alert: RayWorkerNodeDown
expr: increase(ray_node_failures_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Ray worker node failure detected"
description: "Ray worker node {{ $labels.node_id }} has failed"
- alert: RayObjectStoreMemoryHigh
expr: ray_object_store_memory_used / ray_object_store_memory_total > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Ray object store memory usage high"
description: "Ray object store on {{ $labels.node_id }} is using more than 90% of memory"
Deployment Scripts
Main Deployment Script (scripts/deploy.sh)
#!/bin/bash
set -e
NAMESPACE=${1:-ray-system}
ENVIRONMENT=${2:-production}
CLUSTER_NAME=${3:-ray-cluster-production}
echo "Deploying KubeRay to namespace: $NAMESPACE"
echo "Environment: $ENVIRONMENT"
echo "Cluster name: $CLUSTER_NAME"
# Create namespace if it doesn't exist
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace kuberay-system --dry-run=client -o yaml | kubectl apply -f -
# Deploy KubeRay operator
echo "Deploying KubeRay operator..."
kubectl apply -f operator/
# Wait for operator to be ready
echo "Waiting for KubeRay operator to be ready..."
kubectl wait --for=condition=available --timeout=300s deployment/kuberay-operator -n kuberay-system
# Deploy Ray cluster
echo "Deploying Ray cluster..."
envsubst < clusters/$ENVIRONMENT/raycluster.yaml | kubectl apply -f -
# Deploy services
echo "Deploying services..."
kubectl apply -f services/ -n $NAMESPACE
# Deploy monitoring
if kubectl get namespace monitoring &> /dev/null; then
echo "Deploying monitoring configuration..."
kubectl apply -f monitoring/prometheus/ -n $NAMESPACE
kubectl apply -f monitoring/grafana/ -n monitoring
fi
# Wait for cluster to be ready
echo "Waiting for Ray cluster to be ready..."
kubectl wait --for=condition=ready --timeout=600s raycluster/$CLUSTER_NAME -n $NAMESPACE
echo "Ray cluster deployed successfully!"
echo "Dashboard URL: kubectl port-forward svc/ray-dashboard 8265:8265 -n $NAMESPACE"
Scaling Script (scripts/scale.sh)
#!/bin/bash
set -e
CLUSTER_NAME=${1:-ray-cluster-production}
NAMESPACE=${2:-ray-system}
WORKER_GROUP=${3:-worker-group}
REPLICAS=${4:-5}
if [ -z "$REPLICAS" ]; then
echo "Usage: $0 <cluster_name> <namespace> <worker_group> <replicas>"
exit 1
fi
echo "Scaling Ray cluster $CLUSTER_NAME in namespace $NAMESPACE"
echo "Worker group: $WORKER_GROUP"
echo "Target replicas: $REPLICAS"
# Scale the worker group
kubectl patch raycluster $CLUSTER_NAME -n $NAMESPACE --type='json' \
-p="[{\"op\": \"replace\", \"path\": \"/spec/workerGroupSpecs/0/replicas\", \"value\": $REPLICAS}]"
echo "Scaling initiated. Monitor progress with:"
echo "kubectl get raycluster $CLUSTER_NAME -n $NAMESPACE -w"
Quick Start
-
Generate the KubeRay setup:
archetect render git@github.com:p6m-archetypes/kuberay.archetype.git -
Deploy KubeRay operator and cluster:
chmod +x scripts/deploy.sh
./scripts/deploy.sh ray-system production -
Access Ray Dashboard:
kubectl port-forward svc/ray-dashboard 8265:8265 -n ray-system
# Open http://localhost:8265 -
Submit a Ray job:
kubectl apply -f jobs/training/pytorch-training.yaml -
Monitor cluster status:
kubectl get raycluster -n ray-system
kubectl get pods -n ray-system
Best Practices
Resource Management
- Set appropriate resource requests and limits
- Use node selectors for GPU workloads
- Configure auto-scaling based on workload patterns
- Monitor resource utilization and costs
Security
- Use service accounts with minimal permissions
- Implement network policies for traffic control
- Enable Pod Security Standards
- Use secrets for sensitive configuration
Operations
- Implement comprehensive monitoring and alerting
- Set up centralized logging
- Use GitOps for configuration management
- Regular backup of cluster configurations
Performance
- Optimize container images for faster startup
- Use local SSDs for object store when possible
- Configure appropriate object store memory limits
- Monitor and tune garbage collection settings