Skip to main content
Follow these steps to deploy Poolside model inference on your GPU-backed Kubernetes cluster. For an overview of this deployment approach and architecture, see Upstream Kubernetes deployment.

Prerequisites

Poolside distributes the Helm deployment bundle as a .tar.gz archive. Extract it before you start:
tar -xzf <bundle-name>.tar.gz
cd <bundle-name>
Confirm that you are working from the root of the extracted bundle. The bundle root contains the following directories:
./scripts/
./containers/
./charts/
./binaries/
Cluster requirements
  • Kubernetes 1.29 or later
  • GPU nodes with enough GPUs for the models you deploy
  • NVIDIA GPU Operator 26.3.0, with NVIDIA driver 580.126.20 and NVIDIA Container Toolkit 1.19.0
  • An ingress controller that can route HTTP and HTTPS traffic to the cluster
  • A DNS hostname for each model you deploy, resolving to the ingress endpoint. Kubernetes Ingress objects do not accept bare IP addresses; use a DNS name or /etc/hosts entries.
  • An S3-compatible object storage service such as Amazon S3, SeaweedFS, MinIO, or NooBaa
  • A container registry that every cluster node can access
Workstation tools Install the following tools on the host you use to run the deployment:
  • helm 3.12 or later
  • kubectl
  • skopeo
  • aws CLI (to upload checkpoints to S3-compatible object storage)
  • jq (to parse JSON responses from the inference API)
  • tar (to extract the deployment bundle)
  • curl (to call the inference API)
  • openssl (optional, to generate a TLS certificate for the inference endpoint)
Minimum resource requirements Ensure that your cluster has enough GPUs for the models you deploy. If you have questions about the required specs, contact Poolside support.

Step 1: Create the namespace

The inference stack runs in a single namespace:
kubectl create namespace poolside-models

Step 2: Upload container images

Copy the bundled images into your registry. Log in to your target registry using docker login or podman login before running any upload commands. Authenticate skopeo against your target registry:
skopeo login <registry-host> --username <username> --password <password>
Upload the images with the provided script:
chmod +x ./scripts/upload_images.sh
./scripts/upload_images.sh <registry-host>
If your registry does not use TLS:
./scripts/upload_images.sh <registry-host>:5000 --force-insecure-dest
If your registry requires authentication, create an image pull secret in poolside-models:
kubectl create secret docker-registry poolside-registry-secret \
  --docker-server=<registry-host> \
  --docker-username=<registry-user> \
  --docker-password=<registry-password> \
  -n poolside-models

Step 3: Upload model checkpoints

The inference stack downloads model weights from your S3 bucket on pod startup, so the checkpoints must be in place before you deploy the chart. Poolside provides the checkpoint files separately from the deployment bundle. Confirm the local path and the destination prefix with your Poolside contact. Uploading checkpoints is time consuming. Start it now and continue with the remaining steps in parallel. Create the bucket if it does not already exist:
aws s3 mb s3://<bucket-name> --region <aws-region>
For a non-AWS S3 endpoint (MinIO, NooBaa, SeaweedFS), add --endpoint-url https://<s3-endpoint>. Note the bucket name; you reference it in the models.<key>.model paths in Step 5. Then upload the checkpoints to the bucket:
aws s3 cp ./checkpoints s3://<bucket-name>/checkpoints --recursive --region <aws-region>
For a non-AWS S3 endpoint (MinIO, NooBaa, SeaweedFS), add --endpoint-url:
aws s3 cp ./checkpoints s3://<bucket-name>/checkpoints \
  --recursive \
  --endpoint-url https://<s3-endpoint> \
  --region <aws-region>
Checkpoints are typically tens of GiB per model. For faster throughput, or for backends sensitive to upload concurrency such as NooBaa or SeaweedFS, run the upload from a host inside the cluster and tune aws configure set default.s3.max_concurrent_requests and default.s3.multipart_chunksize.

Step 4: Create the S3 credentials secret

The model servers read checkpoints from S3 using credentials in a Kubernetes secret. Create it in poolside-models:
kubectl create secret generic aws-credentials \
  --from-literal=AWS_ACCESS_KEY_ID=<access-key-id> \
  --from-literal=AWS_SECRET_ACCESS_KEY=<secret-access-key> \
  -n poolside-models
API key authentication (optional) To require an API key on the vLLM inference servers, create a secret containing the key in poolside-models:
kubectl create secret generic vllm-auth \
  --from-literal=VLLM_API_KEY=<vllm-api-key> \
  -n poolside-models
Creating the secret does not enable API key authentication by itself. In Step 5, set authentication.secretName to vllm-auth.

Step 5: Configure the inference values file

Create an inference_values.yaml file in the bundle root:
cp ./charts/inference/values.yaml ./inference_values.yaml
Set the fields that apply to your environment. The example below deploys two models and exposes each model through its own ingress:
inference_values.yaml
image:
  # -- Registry you uploaded the atlas image to (required)
  registry: "<registry-host>"
  # -- Image name and tag come pre-set in the bundle to match the shipped image
  name: "atlas"
  tag: "<atlas-tag>"
# -- Name of the image pull secret for private registries (omit if your registry is public)
imagePullSecret: "poolside-registry-secret"
podSecurityContext:
  # -- Require non-root user
  runAsNonRoot: true
  # -- Run inference pods as a specific numeric user ID (required on upstream Kubernetes)
  runAsUser: 10003
  seccompProfile:
    # -- Seccomp profile type
    type: RuntimeDefault
s3:
  # -- Name of secret containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  secretName: "aws-credentials"
  # -- Custom CA certificate bundle for S3 (leave empty for plain HTTP or a trusted CA)
  caBundle: ""
authentication:
  # -- Name of secret containing VLLM_API_KEY for vLLM server authentication (set to "vllm-auth" if you created the optional secret in Step 4; leave empty to disable)
  secretName: ""
ingress:
  # -- Create an Ingress for every model
  enabled: true
  # -- Ingress class name
  className: "nginx"
models:
  laguna:
    model: s3://<bucket-name>/checkpoints/laguna
    modelName: Laguna
    modelType: agent
    gpus: 4
    # -- Hostname that routes to this model's vLLM service
    ingressHost: "<laguna-hostname>"
  point:
    model: s3://<bucket-name>/checkpoints/point
    modelName: Point
    modelType: completion
    gpus: 1
    # -- Hostname that routes to this model's vLLM service
    ingressHost: "<point-hostname>"
The checkpoint paths in models.<key>.model and the image registry must exactly match the locations you uploaded from the deployment bundle. The image name and tag come pre-set to match the shipped atlas image. Set each model’s gpus to a value that meets its minimum GPU memory for your GPU type. For the per-model minimums, see Supported configurations.
Each model is exposed at its own hostname through a separate Ingress named inference-<model-key>. Give every model a unique ingressHost. The Ingress routes the hostname’s root path directly to that model’s vLLM service, so clients reach the OpenAI-compatible API at http://<model-hostname>/v1.
Non-AWS S3 endpoints If your object storage is not AWS S3, point the model servers at the endpoint and region:
extraEnv:
  AWS_REGION: "<aws-region>"
  AWS_ENDPOINT_URL_S3: "https://<s3-endpoint>"
When you use SeaweedFS as the S3 backend, set the AWS CLI to the classic transfer client. The awsCliConfig map fully replaces the chart’s default transfer settings, which are incompatible with SeaweedFS and can cause download failures:
awsCliConfig:
  default.s3.preferred_transfer_client: "classic"
When you use NooBaa or another S3 backend with limited concurrency, throttle downloads. Without throttling, the init container can fail after downloading 1-2 GiB and restart in an infinite loop because the emptyDir volume is wiped on each restart:
awsCliConfig:
  default.s3.max_concurrent_requests: "2"
  default.s3.max_queue_size: "1000"
  default.s3.multipart_chunksize: "64MB"

Step 6: Install the inference chart

Install the inference chart into poolside-models:
helm install inference ./charts/inference \
  --namespace poolside-models \
  -f ./inference_values.yaml
If your S3 backend uses a private CA, include the CA bundle at install time:
helm install inference ./charts/inference \
  --namespace poolside-models \
  -f ./inference_values.yaml \
  --set-file s3.caBundle=<path-to-s3-ca.crt>

Step 7: Verify the deployment

Check that the model pods are running. The only pods in the namespace are the per-model servers:
kubectl get pods -n poolside-models
Each model server takes time to become ready on first start because it downloads its checkpoint from S3. Watch a model’s logs to track progress, where <model-key> is the key you set under models in the values file (the Step 5 example uses laguna and point):
kubectl logs -f -n poolside-models deploy/inference-<model-key>
Confirm an ingress was created for each model:
kubectl get ingress -n poolside-models
List the served models on a model’s endpoint to confirm routing works, where <model-hostname> is the ingressHost you set for that model:
curl -s http://<model-hostname>/v1/models

Step 8: Call the inference API

Each model serves the OpenAI-compatible API directly at its own hostname. The base URL has the form:
http://<model-hostname>/v1
Append the OpenAI-compatible route to the base URL, such as /chat/completions or /completions. The commands below use three placeholders. Fill them from the inference_values.yaml you wrote in Step 5:
PlaceholderSource in inference_values.yamlExample
<model-hostname>models.<model-key>.ingressHostlaguna.poolside.local
<model-key>a key under modelslaguna
<served-model-name>models.<model-key>.modelNameLaguna
If you do not have the values file, retrieve each value from the running cluster. Retrieve the <model-key> values. Each model deployment is named inference-<model-key>:
kubectl get deploy -n poolside-models -l app.kubernetes.io/component=inference
Retrieve <model-hostname> from the model’s ingress:
kubectl get ingress inference-<model-key> -n poolside-models -o jsonpath='{.spec.rules[0].host}'
Retrieve <served-model-name> from the id field of that model’s models endpoint:
curl -s http://<model-hostname>/v1/models | jq -r '.data[].id'
Send a chat completion request:
curl http://<model-hostname>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-name>",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'
For example, to call the laguna model served as Laguna:
curl http://laguna.poolside.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Laguna",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'
If you set authentication.secretName in Step 5, include the key as a bearer token:
curl http://<model-hostname>/v1/chat/completions \
  -H "Authorization: Bearer <vllm-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-name>",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'

TLS

The ingress example in Step 5 exposes each model over HTTP. To serve the inference endpoints over HTTPS, add a tls block to ingress. The list applies to every model’s Ingress, so include an entry for each model hostname and reference a TLS secret in poolside-models:
ingress:
  enabled: true
  className: "nginx"
  tls:
    - hosts:
        - "<laguna-hostname>"
      secretName: "<laguna-tls-secret>"
    - hosts:
        - "<point-hostname>"
      secretName: "<point-tls-secret>"
Create each referenced secret with kubectl create secret tls, or use cert-manager to provision it. Clients then reach each model at https://<model-hostname>/v1.

Offline documentation (optional)

The bundle also ships the Poolside documentation site, which the same inference chart can deploy in-cluster so operators have local access to the docs. It is off by default. To enable and expose it, see Set up offline documentation.

Troubleshooting

  • If pods stay in Init or restart in a loop, check the init container logs with kubectl logs -n poolside-models <pod-name> -c <init-container>. A stale or misspelled checkpoint path syncs nothing and the pod never starts.
  • If model servers fail to pull images, run kubectl describe pod <pod-name> -n poolside-models and verify that imagePullSecret references the correct secret.
  • If checkpoint downloads fail against SeaweedFS or NooBaa, review the awsCliConfig settings in Step 5.
  • If a model pod is Pending, confirm the cluster has enough GPUs for the gpus value you requested and that the NVIDIA GPU Operator is healthy.
For questions about hardware requirements, infrastructure configuration, or deployment issues, contact Poolside support.