Prerequisites
Poolside distributes the Helm deployment bundle as a.tar.gz archive. Extract it before you start:
- OpenShift 4.16 or later
- GPU nodes with enough GPUs for the models you deploy
- NVIDIA GPU Operator 26.3.0, with NVIDIA driver 580.126.20 and NVIDIA Container Toolkit 1.19.0
- DNS records that resolve to the cluster router endpoint, or use a router-generated hostname
- An S3-compatible object storage service such as NooBaa (OpenShift Data Foundation), Amazon S3, or MinIO
- A container registry that your cluster can access
helm3.12or laterocorkubectlskopeoawsCLI (to upload checkpoints to S3-compatible object storage)jq(to parse JSON responses from the inference API)tar(to extract the deployment bundle)curl(to call the inference API)openssl(optional, to generate a TLS certificate for the inference endpoint)
Step 1: Create the namespace
The inference stack runs in a single namespace:Step 2: Upload container images
Copy the bundled images into your registry. Log in to your target registry usingdocker login or podman login before running any upload commands.
Authenticate skopeo against your target registry:
poolside-models:
If you use the OpenShift internal registry, push the images into the
poolside-models namespace. Pods in that namespace pull same-namespace imagestreams with the default service account, so no cross-namespace system:image-puller rolebinding is required.Step 3: Upload model checkpoints
The inference stack downloads model weights from your S3 bucket on pod startup, so the checkpoints must be in place before you deploy the chart. Poolside provides the checkpoint files separately from the deployment bundle. Confirm the local path and the destination prefix with your Poolside contact. Uploading checkpoints is time consuming. Start it now and continue with the remaining steps in parallel. Create the bucket if it does not already exist. The example uses the NooBaa endpoint; for AWS S3, omit the--endpoint-url flag:
models.<key>.model paths in Step 5.
Then upload the checkpoints to the bucket:
--endpoint-url:
Checkpoints are typically tens of GiB per model. For faster throughput, or for backends sensitive to upload concurrency such as NooBaa, run the upload from a host inside the cluster and tune
aws configure set default.s3.max_concurrent_requests and default.s3.multipart_chunksize.Step 4: Create the S3 credentials secret
The model servers read checkpoints from S3 using credentials in a Kubernetes secret. Create it inpoolside-models:
poolside-models:
authentication.secretName to vllm-auth.
Step 5: Configure the inference values file
Create aninference_values.yaml file in the bundle root:
inference_values.yaml
models.<key>.model and the image registry must exactly match the locations you uploaded from the deployment bundle. The image name and tag come pre-set to match the shipped atlas image.
Set each model’s gpus to a value that meets its minimum GPU memory for your GPU type. For the per-model minimums, see Supported configurations.
Each model is exposed through a separate
Route named inference-<model-key>. Leave routeHost empty to let the OpenShift router generate a hostname per model, or set an explicit host. The Route sends the host’s root path directly to that model’s vLLM service, so clients reach the OpenAI-compatible API at https://<route-host>/v1.s3.caBundle, supplied at deployment time in Step 6.
NooBaa and other S3 backends with limited concurrency need throttled downloads. Without throttling, the init container can fail after downloading 1-2 GiB and restart in an infinite loop because the emptyDir volume is wiped on each restart:
Step 6: Install the inference chart
Install theinference chart into poolside-models. If your S3 backend uses a publicly trusted certificate, install the chart directly:
noobaa-s3-serving-cert secret rather than committing it to your values file:
--set-file:
Step 7: Verify the deployment
Check that the model pods are running. The only pods in the namespace are the per-model servers:<model-key> is the key you set under models in the values file (the Step 5 example uses laguna and point):
<route-host> is the host of that model’s Route:
Step 8: Call the inference API
Each model serves the OpenAI-compatible API directly at its own Route host. The base URL has the form:/chat/completions or /completions.
The commands below use three placeholders. Fill the model values from the inference_values.yaml you wrote in Step 5; OpenShift assigns each model’s Route host unless you set routeHost:
| Placeholder | Source | Example |
|---|---|---|
<route-host> | assigned by OpenShift per model (or models.<model-key>.routeHost if you set a custom host) | inference-laguna-poolside-models.apps.cluster.example.com |
<model-key> | a key under models | laguna |
<served-model-name> | models.<model-key>.modelName | Laguna |
<model-key> values. Each model deployment is named inference-<model-key>:
<route-host> from the model’s Route:
<served-model-name> from the id field of that model’s models endpoint:
laguna model served as Laguna:
authentication.secretName in Step 5, include the key as a bearer token:
TLS
The Route example in Step 5 usesedge termination, where the OpenShift router terminates TLS with its default certificate. To serve a custom certificate, provide it inline under route.tls. This block applies to every model’s Route, so the certificate must be valid for all model Route hosts (for example, a wildcard certificate):
Offline documentation (optional)
The bundle also ships the Poolside documentation site, which the sameinference chart can deploy in-cluster so operators have local access to the docs. It is off by default. To enable and expose it through a Route, see Set up offline documentation.
Troubleshooting
- If pods stay in
Initor restart in a loop, check the init container logs withoc logs -n poolside-models <pod-name> -c <init-container>. A stale or misspelled checkpoint path syncs nothing and the pod never starts. - If checkpoint downloads fail against NooBaa, confirm the S3 CA bundle is mounted and review the
awsCliConfigthrottle settings in Step 5. - If model servers fail to pull images, run
oc describe pod <pod-name> -n poolside-modelsand verify the image pull secret or internal-registry pull access. - If a model pod is
Pending, confirm the cluster has enough GPUs for thegpusvalue you requested and that the NVIDIA GPU Operator is healthy.
Related resources
- OpenShift deployment overview
- Set up offline documentation
- Upgrade on OpenShift
- Remove from OpenShift