KubeRay

A comprehensive archetype for deploying and managing Ray clusters on Kubernetes using the KubeRay operator, providing scalable distributed computing infrastructure for machine learning and data processing workloads.

Overview

This archetype sets up a complete KubeRay environment that enables you to deploy, scale, and manage Ray clusters on Kubernetes with enterprise-grade features including auto-scaling, resource management, monitoring, and fault tolerance.

Technology Stack

KubeRay: Kubernetes operator for Ray clusters
Ray: Distributed computing framework
Kubernetes: Container orchestration platform
Prometheus: Metrics collection and monitoring
Grafana: Visualization and alerting
ArgoCD/Flux: GitOps deployment (optional)
Istio: Service mesh (optional)

Key Features

Cluster Management

Declarative Configuration: YAML-based cluster definitions
Auto-scaling: Dynamic worker node scaling based on workload
Resource Management: CPU, memory, and GPU resource allocation
Multi-tenancy: Support for multiple Ray clusters

Operational Excellence

Health Monitoring: Cluster health checks and status reporting
Fault Tolerance: Automatic recovery from node failures
Logging: Centralized log collection and aggregation
Metrics: Comprehensive monitoring and alerting

Workload Management

Job Submission: Submit Ray jobs to clusters
Jupyter Integration: Interactive development environment
Model Serving: Ray Serve for model deployment
Data Processing: Large-scale data processing workflows

Project Structure

kuberay/
├── operator/
│   ├── deployment.yaml           # KubeRay operator deployment
│   ├── rbac.yaml                # RBAC configurations
│   ├── crd.yaml                 # Custom Resource Definitions
│   └── configmap.yaml           # Operator configuration
├── clusters/
│   ├── development/
│   │   ├── raycluster.yaml      # Dev cluster configuration
│   │   └── values.yaml          # Environment-specific values
│   ├── staging/
│   │   ├── raycluster.yaml      # Staging cluster configuration
│   │   └── values.yaml
│   └── production/
│       ├── raycluster.yaml      # Production cluster configuration
│       └── values.yaml
├── services/
│   ├── jupyter.yaml             # Jupyter notebook deployment
│   ├── ray-dashboard.yaml       # Ray dashboard service
│   └── ingress.yaml             # Ingress configurations
├── monitoring/
│   ├── prometheus/
│   │   ├── servicemonitor.yaml  # Prometheus monitoring
│   │   └── rules.yaml           # Alerting rules
│   ├── grafana/
│   │   ├── dashboard.json       # Ray cluster dashboard
│   │   └── datasource.yaml     # Prometheus datasource
│   └── alerts/
│       └── ray-alerts.yaml     # Alert definitions
├── jobs/
│   ├── training/                # ML training jobs
│   ├── batch/                   # Batch processing jobs
│   └── serving/                 # Model serving configurations
├── autoscaling/
│   ├── hpa.yaml                # Horizontal Pod Autoscaler
│   └── vpa.yaml                # Vertical Pod Autoscaler
├── security/
│   ├── psp.yaml                # Pod Security Policies
│   ├── networkpolicy.yaml      # Network policies
│   └── rbac.yaml               # Role-based access control
├── storage/
│   ├── pvc.yaml                # Persistent volume claims
│   └── storageclass.yaml       # Storage class definitions
└── scripts/
    ├── deploy.sh               # Deployment script
    ├── scale.sh                # Scaling script
    └── cleanup.sh              # Cleanup script

KubeRay Operator Setup

Operator Deployment (operator/deployment.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kuberay-operator
  namespace: kuberay-system
  labels:
    app.kubernetes.io/name: kuberay-operator
    app.kubernetes.io/component: operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kuberay-operator
      app.kubernetes.io/component: operator
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kuberay-operator
        app.kubernetes.io/component: operator
    spec:
      serviceAccountName: kuberay-operator
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
      containers:
      - name: manager
        image: quay.io/kuberay/operator:v1.0.0
        command:
        - /manager
        args:
        - --leader-elect
        - --log-level=info
        - --metrics-bind-addr=:8080
        - --health-probe-bind-addr=:8081
        env:
        - name: WATCH_NAMESPACE
          value: ""
        - name: RAY_IMAGE
          value: "rayproject/ray:2.8.0-py39"
        ports:
        - containerPort: 8080
          name: metrics
        - containerPort: 8081
          name: health
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 10
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true

RBAC Configuration (operator/rbac.yaml)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kuberay-operator
  namespace: kuberay-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kuberay-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints", "events", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["ray.io"]
  resources: ["rayclusters", "rayservices", "rayjobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["ray.io"]
  resources: ["rayclusters/status", "rayservices/status", "rayjobs/status"]
  verbs: ["get", "update", "patch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kuberay-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kuberay-operator
subjects:
- kind: ServiceAccount
  name: kuberay-operator
  namespace: kuberay-system

Ray Cluster Configurations

Production Ray Cluster (clusters/production/raycluster.yaml)

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: ray-cluster-production
  namespace: ray-system
  labels:
    environment: production
    app: ray-cluster
spec:
  rayVersion: '2.8.0'
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: Always
    securityContext:
      runAsUser: 1000
      fsGroup: 1000
    env:
    - name: RAY_GRAFANA_IFRAME_HOST
      value: http://grafana.monitoring.svc.cluster.local:3000
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
  
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
      dashboard-port: '8265'
      num-cpus: '0'
      node-ip-address: $MY_POD_IP
      block: 'true'
    template:
      metadata:
        labels:
          app: ray-head
          ray.io/cluster: ray-cluster-production
          ray.io/node-type: head
      spec:
        serviceAccountName: ray-cluster
        securityContext:
          runAsUser: 1000
          fsGroup: 1000
        containers:
        - name: ray-head
          image: rayproject/ray:2.8.0-py39-gpu
          imagePullPolicy: Always
          env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: RAY_CLUSTER_NAME
            value: "ray-cluster-production"
          - name: RAY_DISABLE_IMPORT_WARNING
            value: "1"
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
          - name: shared-data
            mountPath: /data
        volumes:
        - name: ray-logs
          emptyDir: {}
        - name: shared-data
          persistentVolumeClaim:
            claimName: ray-cluster-shared-pvc
        
  workerGroupSpecs:
  - replicas: 3
    minReplicas: 1
    maxReplicas: 10
    groupName: worker-group
    rayStartParams:
      node-ip-address: $MY_POD_IP
      block: 'true'
    template:
      metadata:
        labels:
          app: ray-worker
          ray.io/cluster: ray-cluster-production
          ray.io/node-type: worker
      spec:
        serviceAccountName: ray-cluster
        securityContext:
          runAsUser: 1000
          fsGroup: 1000
        containers:
        - name: ray-worker
          image: rayproject/ray:2.8.0-py39-gpu
          imagePullPolicy: Always
          env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: RAY_CLUSTER_NAME
            value: "ray-cluster-production"
          - name: RAY_DISABLE_IMPORT_WARNING
            value: "1"
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: 1
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: 1
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
          - name: shared-data
            mountPath: /data
        volumes:
        - name: ray-logs
          emptyDir: {}
        - name: shared-data
          persistentVolumeClaim:
            claimName: ray-cluster-shared-pvc
        
  # GPU worker group
  - replicas: 2
    minReplicas: 0
    maxReplicas: 5
    groupName: gpu-worker-group
    rayStartParams:
      node-ip-address: $MY_POD_IP
      block: 'true'
      resources: '{"GPU": 1}'
    template:
      metadata:
        labels:
          app: ray-worker-gpu
          ray.io/cluster: ray-cluster-production
          ray.io/node-type: worker
          ray.io/worker-type: gpu
      spec:
        serviceAccountName: ray-cluster
        securityContext:
          runAsUser: 1000
          fsGroup: 1000
        containers:
        - name: ray-worker-gpu
          image: rayproject/ray:2.8.0-py39-gpu
          imagePullPolicy: Always
          env:
          - name: MY_POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: RAY_CLUSTER_NAME
            value: "ray-cluster-production"
          - name: CUDA_VISIBLE_DEVICES
            value: "0"
          resources:
            requests:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: 2
            limits:
              cpu: "16"
              memory: "64Gi"
              nvidia.com/gpu: 2
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
          - name: shared-data
            mountPath: /data
        volumes:
        - name: ray-logs
          emptyDir: {}
        - name: shared-data
          persistentVolumeClaim:
            claimName: ray-cluster-shared-pvc
        nodeSelector:
          node-type: gpu
        tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

Ray Dashboard Service (services/ray-dashboard.yaml)

apiVersion: v1
kind: Service
metadata:
  name: ray-dashboard
  namespace: ray-system
  labels:
    app: ray-dashboard
spec:
  type: LoadBalancer
  ports:
  - port: 8265
    targetPort: 8265
    name: dashboard
  selector:
    ray.io/node-type: head
    ray.io/cluster: ray-cluster-production
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-dashboard-ingress
  namespace: ray-system
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - ray-dashboard.example.com
    secretName: ray-dashboard-tls
  rules:
  - host: ray-dashboard.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ray-dashboard
            port:
              number: 8265

Ray Job Submission

Training Job Example (jobs/training/pytorch-training.yaml)

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: pytorch-training-job
  namespace: ray-system
spec:
  entrypoint: python /app/training/train.py
  runtimeEnv: |
    pip:
      - torch==2.0.0
      - torchvision==0.15.0
      - transformers==4.21.0
    working_dir: "https://github.com/your-org/ml-training.git"
  rayClusterSpec:
    rayVersion: '2.8.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '0'
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.8.0-py39-gpu
            resources:
              requests:
                cpu: "2"
                memory: "8Gi"
              limits:
                cpu: "4"
                memory: "16Gi"
    workerGroupSpecs:
    - replicas: 4
      minReplicas: 2
      maxReplicas: 10
      groupName: gpu-workers
      rayStartParams:
        resources: '{"GPU": 1}'
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.8.0-py39-gpu
            resources:
              requests:
                cpu: "4"
                memory: "16Gi"
                nvidia.com/gpu: 1
              limits:
                cpu: "8"
                memory: "32Gi"
                nvidia.com/gpu: 1
          nodeSelector:
            node-type: gpu
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 86400  # 24 hours
  submitterPodTemplate:
    spec:
      restartPolicy: Never
      containers:
      - name: ray-job-submitter
        image: rayproject/ray:2.8.0-py39
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

Batch Processing Job (jobs/batch/data-processing.yaml)

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: data-processing-job
  namespace: ray-system
spec:
  entrypoint: python /app/batch/process_data.py --input-path s3://data-bucket/input --output-path s3://data-bucket/output
  runtimeEnv: |
    pip:
      - pandas==2.0.0
      - pyarrow==12.0.0
      - s3fs==2023.6.0
    env_vars:
      AWS_DEFAULT_REGION: "us-west-2"
  rayClusterSpec:
    rayVersion: '2.8.0'
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: '0'
      template:
        spec:
          serviceAccountName: ray-s3-access
          containers:
          - name: ray-head
            image: rayproject/ray:2.8.0-py39
            resources:
              requests:
                cpu: "1"
                memory: "4Gi"
              limits:
                cpu: "2"
                memory: "8Gi"
    workerGroupSpecs:
    - replicas: 6
      minReplicas: 3
      maxReplicas: 20
      groupName: data-workers
      rayStartParams: {}
      template:
        spec:
          serviceAccountName: ray-s3-access
          containers:
          - name: ray-worker
            image: rayproject/ray:2.8.0-py39
            resources:
              requests:
                cpu: "2"
                memory: "8Gi"
              limits:
                cpu: "4"
                memory: "16Gi"
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 3600  # 1 hour

Monitoring and Observability

ServiceMonitor for Prometheus (monitoring/prometheus/servicemonitor.yaml)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ray-cluster-metrics
  namespace: ray-system
  labels:
    app: ray-cluster
spec:
  selector:
    matchLabels:
      app: ray-dashboard
  endpoints:
  - port: dashboard
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  - port: dashboard
    path: /api/v0/metrics
    interval: 30s
    scrapeTimeout: 10s
---
apiVersion: v1
kind: Service
metadata:
  name: ray-metrics
  namespace: ray-system
  labels:
    app: ray-dashboard
spec:
  ports:
  - port: 8265
    name: dashboard
  selector:
    ray.io/node-type: head

Grafana Dashboard (monitoring/grafana/dashboard.json)

{
  "dashboard": {
    "id": null,
    "title": "Ray Cluster Dashboard",
    "description": "Monitor Ray cluster performance and resource usage",
    "tags": ["ray", "kubernetes", "ml"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Cluster Status",
        "type": "stat",
        "targets": [
          {
            "expr": "ray_cluster_active_nodes",
            "legendFormat": "Active Nodes"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 1},
                {"color": "green", "value": 3}
              ]
            }
          }
        }
      },
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ray_node_cpu_utilization",
            "legendFormat": "{{node_id}}"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ray_node_memory_utilization",
            "legendFormat": "{{node_id}}"
          }
        ]
      },
      {
        "title": "GPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ray_node_gpu_utilization",
            "legendFormat": "{{node_id}}"
          }
        ]
      },
      {
        "title": "Active Tasks",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ray_tasks_active",
            "legendFormat": "Active Tasks"
          }
        ]
      },
      {
        "title": "Object Store Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "ray_object_store_memory_used",
            "legendFormat": "{{node_id}}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Alert Rules (monitoring/alerts/ray-alerts.yaml)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ray-cluster-alerts
  namespace: ray-system
spec:
  groups:
  - name: ray-cluster
    rules:
    - alert: RayClusterDown
      expr: ray_cluster_active_nodes == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Ray cluster is down"
        description: "Ray cluster {{ $labels.cluster }} has no active nodes"
    
    - alert: RayHighCPUUsage
      expr: ray_node_cpu_utilization > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage on Ray node"
        description: "Ray node {{ $labels.node_id }} has CPU usage above 90%"
    
    - alert: RayHighMemoryUsage
      expr: ray_node_memory_utilization > 0.85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on Ray node"
        description: "Ray node {{ $labels.node_id }} has memory usage above 85%"
    
    - alert: RayWorkerNodeDown
      expr: increase(ray_node_failures_total[5m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Ray worker node failure detected"
        description: "Ray worker node {{ $labels.node_id }} has failed"
    
    - alert: RayObjectStoreMemoryHigh
      expr: ray_object_store_memory_used / ray_object_store_memory_total > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Ray object store memory usage high"
        description: "Ray object store on {{ $labels.node_id }} is using more than 90% of memory"

Deployment Scripts

Main Deployment Script (scripts/deploy.sh)

#!/bin/bash

set -e

NAMESPACE=${1:-ray-system}
ENVIRONMENT=${2:-production}
CLUSTER_NAME=${3:-ray-cluster-production}

echo "Deploying KubeRay to namespace: $NAMESPACE"
echo "Environment: $ENVIRONMENT"
echo "Cluster name: $CLUSTER_NAME"

# Create namespace if it doesn't exist
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace kuberay-system --dry-run=client -o yaml | kubectl apply -f -

# Deploy KubeRay operator
echo "Deploying KubeRay operator..."
kubectl apply -f operator/

# Wait for operator to be ready
echo "Waiting for KubeRay operator to be ready..."
kubectl wait --for=condition=available --timeout=300s deployment/kuberay-operator -n kuberay-system

# Deploy Ray cluster
echo "Deploying Ray cluster..."
envsubst < clusters/$ENVIRONMENT/raycluster.yaml | kubectl apply -f -

# Deploy services
echo "Deploying services..."
kubectl apply -f services/ -n $NAMESPACE

# Deploy monitoring
if kubectl get namespace monitoring &> /dev/null; then
    echo "Deploying monitoring configuration..."
    kubectl apply -f monitoring/prometheus/ -n $NAMESPACE
    kubectl apply -f monitoring/grafana/ -n monitoring
fi

# Wait for cluster to be ready
echo "Waiting for Ray cluster to be ready..."
kubectl wait --for=condition=ready --timeout=600s raycluster/$CLUSTER_NAME -n $NAMESPACE

echo "Ray cluster deployed successfully!"
echo "Dashboard URL: kubectl port-forward svc/ray-dashboard 8265:8265 -n $NAMESPACE"

Scaling Script (scripts/scale.sh)

#!/bin/bash

set -e

CLUSTER_NAME=${1:-ray-cluster-production}
NAMESPACE=${2:-ray-system}
WORKER_GROUP=${3:-worker-group}
REPLICAS=${4:-5}

if [ -z "$REPLICAS" ]; then
    echo "Usage: $0 <cluster_name> <namespace> <worker_group> <replicas>"
    exit 1
fi

echo "Scaling Ray cluster $CLUSTER_NAME in namespace $NAMESPACE"
echo "Worker group: $WORKER_GROUP"
echo "Target replicas: $REPLICAS"

# Scale the worker group
kubectl patch raycluster $CLUSTER_NAME -n $NAMESPACE --type='json' \
  -p="[{\"op\": \"replace\", \"path\": \"/spec/workerGroupSpecs/0/replicas\", \"value\": $REPLICAS}]"

echo "Scaling initiated. Monitor progress with:"
echo "kubectl get raycluster $CLUSTER_NAME -n $NAMESPACE -w"

Quick Start

Generate the KubeRay setup:

archetect render git@github.com:p6m-archetypes/kuberay.archetype.git

Deploy KubeRay operator and cluster:

chmod +x scripts/deploy.sh
./scripts/deploy.sh ray-system production

Access Ray Dashboard:

kubectl port-forward svc/ray-dashboard 8265:8265 -n ray-system
# Open http://localhost:8265

Submit a Ray job:

kubectl apply -f jobs/training/pytorch-training.yaml

Monitor cluster status:

kubectl get raycluster -n ray-system
kubectl get pods -n ray-system

Best Practices

Resource Management

Set appropriate resource requests and limits
Use node selectors for GPU workloads
Configure auto-scaling based on workload patterns
Monitor resource utilization and costs

Security

Use service accounts with minimal permissions
Implement network policies for traffic control
Enable Pod Security Standards
Use secrets for sensitive configuration

Operations

Implement comprehensive monitoring and alerting
Set up centralized logging
Use GitOps for configuration management
Regular backup of cluster configurations

Performance

Optimize container images for faster startup
Use local SSDs for object store when possible
Configure appropriate object store memory limits
Monitor and tune garbage collection settings

Overview​

Technology Stack​

Key Features​

Cluster Management​

Operational Excellence​

Workload Management​

Project Structure​

KubeRay Operator Setup​

Operator Deployment (operator/deployment.yaml)​

RBAC Configuration (operator/rbac.yaml)​

Ray Cluster Configurations​

Production Ray Cluster (clusters/production/raycluster.yaml)​

Ray Dashboard Service (services/ray-dashboard.yaml)​

Ray Job Submission​

Training Job Example (jobs/training/pytorch-training.yaml)​

Batch Processing Job (jobs/batch/data-processing.yaml)​

Monitoring and Observability​

ServiceMonitor for Prometheus (monitoring/prometheus/servicemonitor.yaml)​

Grafana Dashboard (monitoring/grafana/dashboard.json)​

Alert Rules (monitoring/alerts/ray-alerts.yaml)​

Deployment Scripts​

Main Deployment Script (scripts/deploy.sh)​

Scaling Script (scripts/scale.sh)​

Quick Start​

Best Practices​

Resource Management​

Security​

Operations​

Performance​