Prerequisites
Poolside distributes the Helm deployment bundle as a.tar.gz archive. Extract it before you start:
- Kubernetes 1.29 or later
- GPU nodes with enough GPUs for the models you deploy
- NVIDIA GPU Operator 26.3.0, with NVIDIA driver 580.126.20 and NVIDIA Container Toolkit 1.19.0
- An ingress controller that can route HTTP and HTTPS traffic to the cluster
- A DNS hostname for each model you deploy, resolving to the ingress endpoint. Kubernetes Ingress objects do not accept bare IP addresses; use a DNS name or
/etc/hostsentries. - An S3-compatible object storage service such as Amazon S3, SeaweedFS, MinIO, or NooBaa
- A container registry that every cluster node can access
helm3.12or laterkubectlskopeoawsCLI (to upload checkpoints to S3-compatible object storage)jq(to parse JSON responses from the inference API)tar(to extract the deployment bundle)curl(to call the inference API)openssl(optional, to generate a TLS certificate for the inference endpoint)
Step 1: Create the namespace
The inference stack runs in a single namespace:Step 2: Upload container images
Copy the bundled images into your registry. Log in to your target registry usingdocker login or podman login before running any upload commands.
Authenticate skopeo against your target registry:
poolside-models:
Step 3: Upload model checkpoints
The inference stack downloads model weights from your S3 bucket on pod startup, so the checkpoints must be in place before you deploy the chart. Poolside provides the checkpoint files separately from the deployment bundle. Confirm the local path and the destination prefix with your Poolside contact. Uploading checkpoints is time consuming. Start it now and continue with the remaining steps in parallel. Create the bucket if it does not already exist:--endpoint-url https://<s3-endpoint>. Note the bucket name; you reference it in the models.<key>.model paths in Step 5.
Then upload the checkpoints to the bucket:
--endpoint-url:
Checkpoints are typically tens of GiB per model. For faster throughput, or for backends sensitive to upload concurrency such as NooBaa or SeaweedFS, run the upload from a host inside the cluster and tune
aws configure set default.s3.max_concurrent_requests and default.s3.multipart_chunksize.Step 4: Create the S3 credentials secret
The model servers read checkpoints from S3 using credentials in a Kubernetes secret. Create it inpoolside-models:
poolside-models:
authentication.secretName to vllm-auth.
Step 5: Configure the inference values file
Create aninference_values.yaml file in the bundle root:
inference_values.yaml
models.<key>.model and the image registry must exactly match the locations you uploaded from the deployment bundle. The image name and tag come pre-set to match the shipped atlas image.
Set each model’s gpus to a value that meets its minimum GPU memory for your GPU type. For the per-model minimums, see Supported configurations.
Each model is exposed at its own hostname through a separate
Ingress named inference-<model-key>. Give every model a unique ingressHost. The Ingress routes the hostname’s root path directly to that model’s vLLM service, so clients reach the OpenAI-compatible API at http://<model-hostname>/v1.awsCliConfig map fully replaces the chart’s default transfer settings, which are incompatible with SeaweedFS and can cause download failures:
emptyDir volume is wiped on each restart:
Step 6: Install the inference chart
Install theinference chart into poolside-models:
Step 7: Verify the deployment
Check that the model pods are running. The only pods in the namespace are the per-model servers:<model-key> is the key you set under models in the values file (the Step 5 example uses laguna and point):
<model-hostname> is the ingressHost you set for that model:
Step 8: Call the inference API
Each model serves the OpenAI-compatible API directly at its own hostname. The base URL has the form:/chat/completions or /completions.
The commands below use three placeholders. Fill them from the inference_values.yaml you wrote in Step 5:
| Placeholder | Source in inference_values.yaml | Example |
|---|---|---|
<model-hostname> | models.<model-key>.ingressHost | laguna.poolside.local |
<model-key> | a key under models | laguna |
<served-model-name> | models.<model-key>.modelName | Laguna |
<model-key> values. Each model deployment is named inference-<model-key>:
<model-hostname> from the model’s ingress:
<served-model-name> from the id field of that model’s models endpoint:
laguna model served as Laguna:
authentication.secretName in Step 5, include the key as a bearer token:
TLS
The ingress example in Step 5 exposes each model over HTTP. To serve the inference endpoints over HTTPS, add atls block to ingress. The list applies to every model’s Ingress, so include an entry for each model hostname and reference a TLS secret in poolside-models:
kubectl create secret tls, or use cert-manager to provision it. Clients then reach each model at https://<model-hostname>/v1.
Offline documentation (optional)
The bundle also ships the Poolside documentation site, which the sameinference chart can deploy in-cluster so operators have local access to the docs. It is off by default. To enable and expose it, see Set up offline documentation.
Troubleshooting
- If pods stay in
Initor restart in a loop, check the init container logs withkubectl logs -n poolside-models <pod-name> -c <init-container>. A stale or misspelled checkpoint path syncs nothing and the pod never starts. - If model servers fail to pull images, run
kubectl describe pod <pod-name> -n poolside-modelsand verify thatimagePullSecretreferences the correct secret. - If checkpoint downloads fail against SeaweedFS or NooBaa, review the
awsCliConfigsettings in Step 5. - If a model pod is
Pending, confirm the cluster has enough GPUs for thegpusvalue you requested and that the NVIDIA GPU Operator is healthy.
Related resources
- Upstream Kubernetes deployment
- Set up offline documentation
- Upgrade on Kubernetes
- Remove from Kubernetes