Overview
This guide describes how to upgrade an existing model inference deployment on Amazon EKS to a new Helm bundle. The upgrade updates theinference Helm release.
The upgrade process includes the following phases:
- Prepare the new bundle: Extract the bundle and reuse the values file from the previous deployment. Add any new values the new chart requires.
- Upload new container images: Push the new bundle’s container images into Amazon ECR.
- Upgrade the inference release: Run
helm upgradeagainst theinferencechart. - Verify: Confirm that the new revision is deployed and the pods are healthy.
Prerequisites
- A working model inference deployment completed with Install on Amazon EKS.
- The new deployment bundle provided by Poolside.
- The customized
inference_values.yamlfile used for the initial deployment. - Workstation tools, same versions as the initial deployment:
helm3.12or laterkubectl, configured for your EKS clusterskopeo, to copy the bundled images into Amazon ECRawsCLI
Downtime
The upgrade rolls model pods one Deployment at a time. The chart setsmaxSurge to 0 so a rolled model does not request additional GPUs during the rollout, which means that model goes down briefly while its new pod starts. Each model server also re-downloads its checkpoint from S3 on restart, so expect a delay before a rolled model becomes ready. Plan a maintenance window if you run single-replica models.
Step 1: Extract the new bundle
Poolside provides the new bundle as a tarball. Extract it to a directory of your choice, then set a shell variable for the new bundle root:Step 2: Review the values file
Reuse theinference_values.yaml file from your previous deployment. Poolside notes any required values changes in the release notes. The new bundle contains the reference values.yaml for the inference chart at charts/inference/values.yaml. Use it as a reference while reviewing your existing file.
Step 3: Upload the new container images
The new bundle ships updated container images in./containers/. Authenticate skopeo to your ECR registry, then push the images to the same repositories that the inference release uses:
atlas tag that was pushed before you continue:
Step 4: Dry-run the upgrade (optional)
Preview the changes before you apply them:Step 5: Apply the upgrade
Run the upgrade and watch the pods roll. The pods should return to aRunning state when the upgrade completes:
Step 6: Update models (optional)
You can add, update, or remove model checkpoints as part of this upgrade rather than as a separate operation. Make the model edits in the sameinference_values.yaml file you reviewed in Step 2, before you run the helm upgrade in Step 5. The single helm upgrade then reconciles both the new chart and the model changes.
For the full procedure to add, update, or remove models, see Manage models on Amazon EKS. You can also run those changes separately at any time after the upgrade.
Verification
Confirm the release is deployed:<model-hostname> is the ingressHost of a model under models:
Authorization header.
Troubleshooting
- Pods stuck pulling images: Verify that the new tag is present in the
atlasECR repository and that the GPU node group’s instance role still has theAmazonEC2ContainerRegistryReadOnlypolicy. - Model pods stuck in
Init: Each model re-downloads its checkpoint from S3 on restart. Check the init container logs and confirm the checkpoint paths ininference_values.yamlare still valid.