Host Networking for Multinode Services

Overview

Multinode services can opt in to host networking by setting the network.host field to true. It allows distributed workloads to bypass network namespace isolation and run directly on the host network stack. This option is useful for applications that require low-latency, high-bandwidth communication between nodes or direct access to high-speed interconnects like InfiniBand or RoCE.

When to Use Host Networking

Enable host networking (network.host: true) when your multinode service:

Requires high-speed interconnects: Your application needs direct access to InfiniBand, RoCE, or other RDMA-capable networks
Needs low-latency communication: Network namespace isolation overhead is unacceptable for your workload
Has compatibility requirements: Your distributed framework requires host network access (e.g., some MPI implementations)

Avoid host networking when:

Your application works fine with standard container networking
You want stronger network isolation for security
Your service doesn’t require special network access

Host networking means the service shares the node’s network namespace. This can expose additional network interfaces and may have security implications. Use it only when necessary for performance or compatibility.

Configuring Host Networking in YAML

Add the host field under the network section of your multinode service:

services:
  distributed-inference:
    image:
      uri: docker://myregistry/mpi-inference:latest
    script: |
      #!/bin/bash
      python -m inference_server --model /models/llm
    multinode:
      nodes: 4
      implementation: openmpi
    resource:
      cpu:
        cores: 16
        affinity: NUMA
      memory:
        size: 64GB
      devices:
        nvidia.com/gpu: 2
    network:
      host: true  # Enable host networking
      ports:
        - name: http
          port: 8080
      endpoints:
        - name: inference
          port-name: http
          protocol: https
          type: subdomain
          scope: group

Key YAML Fields

network.host: Boolean field to enable host networking (default: false)
multinode.nodes: Number of compute nodes to allocate
multinode.implementation: Coordination framework (ompi, openmpi, mpich, gasnet, or generic)
resource: Specifies resources per node (total resources = nodes × per-node resources)

Understanding Multinode Network Endpoints

In a multinode service:

Rank 0 is the first node (node 0) and serves as the primary endpoint
Ranks 1, 2, 3, … are worker nodes that participate in distributed computation
All network ports and endpoints are exposed only on rank 0
Clients connect to rank 0, which coordinates with other ranks internally

For example, in a 4-node distributed inference service:

Rank 0 runs the API server and accepts HTTP requests (rank 0 can also be a worker)
Ranks 1-3 run workers that process inference requests coordinated by rank 0
Clients send requests to rank 0’s HTTPS endpoint

Example: vLLM distributed cluster with Host Networking

version: v1

volumes:
  models:
    reference: volume://user/persistent/vllm-cache

services:
  vllm:
    cwd: /home/vllm
    env:
      - MULTINODE_WRAPPER_FORWARD_STREAMS=1
    image:
      uri: >-
        docker://public.ecr.aws/deep-learning-containers/vllm:0.20.0-gpu-py312-cu130-ubuntu22.04-ec2-v1.1-2026-04-29-18-08-36-soci
    mounts:
      data:
        location: /home/vllm
    script: |
      #!/bin/sh
      HOSTNAME=$(hostname)

      vllm serve "openai/gpt-oss-20b" \
      --dtype auto \
      --tool-call-parser openai \
      --reasoning-parser openai_gptoss \
      --enable-auto-tool-choice \
      --tensor-parallel-size 2 \
      --nnodes 2 --node-rank 0 \
      --master-addr ${MULTINODE_NODE_IP} \
      --gpu-memory-utilization 0.9 \
      --kv-cache-dtype auto \
      --max-num-batched-tokens 2048 \
      --max-model-len 131072 \
      --host 0.0.0.0 \
      --port 8080 &

      pid=$!

      IFS=','

      for HOST in ${MULTINODE_HOSTLIST_NOSLOTS}; do
        if [ "$HOSTNAME" != "$HOST" ]; then
          $MULTINODE_SSH_WRAPPER $HOST vllm serve "openai/gpt-oss-20b" \
                                      --dtype auto \
                                      --tool-call-parser openai \
                                      --reasoning-parser openai_gptoss \
                                      --enable-auto-tool-choice \
                                      --tensor-parallel-size 2 \
                                      --nnodes 2 --node-rank 1 \
                                      --master-addr ${MULTINODE_NODE_IP} --headless \
                                      --gpu-memory-utilization 0.9 \
                                      --kv-cache-dtype auto \
                                      --max-num-batched-tokens 2048 \
                                      --max-model-len 131072 \
                                      --host 0.0.0.0 \
                                      --port 8080
        fi
      done

      wait $pid
    resource:
      cpu:
        cores: 16
        affinity: NUMA
      memory:
        size: 120GB
      devices:
        nvidia.com/gpu: 1
      annotations:
        nvidia.com/gpu.model: NVIDIA L4
    multinode:
      nodes: 2
      implementation: generic
    network:
      host: true        # Enable host networking for high-speed interconnects
      ports:
        - name: openai-api
          port: 8080
          protocol: tcp
      endpoints:
        - name: openai-vllm
          type: subdomain
          scope: public
          protocol: http
          port-name: openai-api
    readiness-probe:
      tcp-socket:
        port: 8080
      period-seconds: 30
      failure-threshold: 60
      success-threshold: 1
      initial-delay-seconds: 60
    persist: true

This example demonstrates:

A 2-node multinode service with 1 GPUs per node (2 GPUs total)
Host networking enabled for high-speed node-to-node communication
Rank 0 serves the vLLM API on port 8000
HTTPS endpoint exposed for organization-wide access