Together launches Instant Clusters for self-service GPU access

Together has introduced Instant Clusters to provide an API-first developer experience. The service automates self-service provisioning of AI infrastructure, from single-node (8 GPUs) to large multi-node clusters with hundreds of interconnected GPUs, supporting NVIDIA Hopper and Blackwell GPUs.

The offering is designed to help AI-focused companies address variable demand, such as training workloads or increased inference traffic, by adding capacity quickly with orchestration through Kubernetes (K8S) or Slurm. Instant Clusters can be set up within minutes, preconfigured for distributed training and low-latency inference.

Cloud ergonomics for GPU clusters

Developers often expect cloud infrastructure to be API-first, self-service, and consistent. Traditionally, GPU clusters required manual setup of drivers, schedulers, and networking. Together Instant Clusters aim to align GPU infrastructure with common cloud practices by automating deployment, maintaining consistency across environments, and supporting scaling from a single node to larger clusters without requiring workflow changes.

Self‑service, ready in minutes

Provision through console, CLI, or API, and integrate with Terraform or SkyPilot for IaC and multi‑cloud workflows. Choose and lock NVIDIA driver/CUDA versions, bring your own container images, attach shared storage, and be ready to run in minutes.

Batteries included

Clusters come pre-loaded with the components teams usually spend days wiring up themselves:

  • GPU Operator to manage drivers and runtime software.
  • Ingress controller to handle traffic into your cluster.
  • NVIDIA Network Operator for high-performance InfiniBand and RoCE networking.
  • Cert Manager for secure certificates and HTTPS endpoints.

These and other essentials are already in place, so your cluster is production-ready out of the box.

Optimized for distributed training

Training at scale demands the right interconnect and orchestration. Clusters are wired with non‑blocking NVIDIA Quantum‑2 InfiniBand across nodes and NVIDIA NVLink/NVSwitch inside the node, delivering ultra‑low‑latency, high‑throughput communication for multi‑node training.

Run with Kubernetes or Slurm‑on‑K8s (SSH when you need it), keep environments reproducible with version‑pinned drivers/CUDA, and checkpoint to shared storage—high‑bandwidth, parallel storage colocated with your compute; durable, resizable, and billed on demand. Ideal for pre‑training, reinforcement learning, and multi‑phase training schedules.

Scalable burst capacity for production inference

When usage surges, services need to burst — not re‑architect. Use Together Instant Clusters to add inference capacity quickly and keep latency SLAs intact. Deploy your serving stack on clusters sized for the moment, resize clusters as user traffic spikes or subsides, and keep one operational model from test to production.

Reliable at scale

Training on large GPU clusters requires reliable performance, as issues such as faulty NICs, miswired cables, or overheating GPUs can interrupt jobs or affect results. For general availability, reliability measures have been added to ensure stability before and during training. Each node undergoes burn-in and NVLink/NVSwitch testing, with inter-node connections validated through NCCL all-reduce checks. Reference training runs are used to confirm tokens per second and Model FLOPs Utilization (MFU) benchmarks. Once deployed, clusters are continuously monitored—idle nodes repeat tests, real-time observability identifies anomalies, and SLAs provide clear communication and defined compensation in case of issues.

For more information, visit together.ai.

The post Together launches Instant Clusters for self-service GPU access appeared first on Engineering.com.

Free consultation and product quotation