What is the CVSS score of CVE-2026-34756?

CVE-2026-34756 has a CVSS 3.1 base score of 6.5 (Medium). CVSS vector: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H. EPSS exploitation probability: 0.0%.

Is there a patch available for CVE-2026-34756?

Yes, a patch is available for CVE-2026-34756. Update the affected software to the latest version immediately.

Python CVE-2026-34756

Q: Which versions of Python are affected by CVE-2026-34756?

Denial of Service in vLLM OpenAI-compatible API server allows unauthenticated remote attackers to crash the service via a single HTTP request containing an extremely large n parameter.. A patch is available.

EUVD-2026-19351 MEDIUM

Allocation of Resources Without Limits or Throttling (CWE-770)

2026-04-03 https://github.com/vllm-project/vllm

GHSA-3mwp-wvh9-7528

Denial Of Service Python

6.5

CVSS 3.1 · Vendor: https://github.com/vllm-project/vllm

Severity by source

Vendor (https://github.com/vllm-project/vllm) PRIMARY

6.5 MEDIUM

AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Red Hat

6.5 HIGH

qualitative

Primary rating from Vendor (https://github.com/vllm-project/vllm).

CVSS VectorVendor: https://github.com/vllm-project/vllm

CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Attack Vector

Network

Attack Complexity

Low

Privileges Required

Low

User Interaction

None

Scope

Unchanged

Confidentiality

None

Integrity

None

Availability

High

Lifecycle Timeline

EUVD ID Assigned

Apr 03, 2026 - 16:00 euvd

EUVD-2026-19351

Analysis Generated

Apr 03, 2026 - 16:00 vuln.today

Patch released

Apr 03, 2026 - 16:00 nvd

Patch available

CVE Published

Apr 03, 2026 - 15:35 nvd

MEDIUM 6.5

DescriptionCVE.org

Summary

A Denial of Service vulnerability exists in the vLLM OpenAI-compatible API server. Due to the lack of an upper bound validation on the n parameter in the ChatCompletionRequest and CompletionRequest Pydantic models, an unauthenticated attacker can send a single HTTP request with an astronomically large n value. This completely blocks the Python asyncio event loop and causes immediate Out-Of-Memory crashes by allocating millions of request object copies in the heap before the request even reaches the scheduling queue.

Details

The root cause of this vulnerability lies in the missing upper bound checks across the request parsing and asynchronous scheduling layers:

Protocol Layer:

In vllm/entrypoints/openai/chat_completion/protocol.py, the n parameter is defined simply as an integer without any pydantic.Field constraints for an upper bound.

python

class ChatCompletionRequest(OpenAIBaseModel):
# Ordered by official OpenAI API documentation
# https://platform.openai.com/docs/api/reference/chat/create
    messages: list[ChatCompletionMessageParam]
    model: str | None = None
    frequency_penalty: float | None = 0.0
    logit_bias: dict[str, float] | None = None
    logprobs: bool | None = False
    top_logprobs: int | None = 0
    max_tokens: int | None = Field(
        default=None,
        deprecated="max_tokens is deprecated in favor of "
        "the max_completion_tokens field",
    )
    max_completion_tokens: int | None = None
    n: int | None = 1
    presence_penalty: float | None = 0.0

SamplingParams Layer (Incomplete Validation):

When the API request is converted to internal SamplingParams in vllm/sampling_params.py, the _verify_args method only checks the lower bound (self.n < 1), entirely omitting an upper bounds check.

python

    def _verify_args(self) -> None:
        if not isinstance(self.n, int):
            raise ValueError(f"n must be an int, but is of type {type(self.n)}")
        if self.n < 1:
            raise ValueError(f"n must be at least 1, got {self.n}.")

Engine Layer (The OOM Trigger):

When the malicious request reaches the core engine (vllm/v1/engine/async_llm.py), the engine attempts to fan out the request n times to generate identical independent sequences within a synchronous loop.

python

# Fan out child requests (for n>1).
        parent_request = ParentRequest(request)
        for idx in range(parent_params.n):
            request_id, child_params = parent_request.get_child_info(idx)
            child_request = request if idx == parent_params.n - 1 else copy(request)
            child_request.request_id = request_id
            child_request.sampling_params = child_params
            await self._add_request(
                child_request, prompt_text, parent_request, idx, queue
            )
        return queue

Because Python's asyncio runs on a single thread and event loop, this monolithic for-loop monopolizes the CPU thread. The server stops responding to all other connections (including liveness probes). Simultaneously, the memory allocator is overwhelmed by cloning millions of request object instances via copy(request), driving the host's Resident Set Size (RSS) up by gigabytes per second until the OS OOM-killer terminates the vLLM process.

Impact

Vulnerability Type: Resource Exhaustion / Denial of Service

Impacted Parties:

Any individual or organization hosting a public-facing vLLM API server (vllm.entrypoints.openai.api_server), which happens to be the primary entrypoint for OpenAI-compatible setups.
SaaS / AI-as-a-Service platforms acting as reverse proxies sitting in front of vLLM without strict HTTP body payload validation or rate limitations.

Because this vulnerability exploits the control plane rather than the data plane, an unauthenticated remote attacker can achieve a high success rate in taking down production inference hosts with a single HTTP request. This effectively circumvents any hardware-level capacity planning and conventional bandwidth stress limitations.

AnalysisAI

Denial of Service in vLLM OpenAI-compatible API server allows unauthenticated remote attackers to crash the service via a single HTTP request containing an extremely large n parameter. The lack of upper bound validation causes the asyncio event loop to freeze while allocating millions of request object copies, leading to rapid Out-Of-Memory crashes. CVSS 6.5 with moderate real-world risk due to authentication requirement in the disclosed CVSS vector (PR:L), though the description indicates unauthenticated exploitability - a significant discrepancy warranting clarification from the vendor.

Technical ContextAI

vLLM is a Python-based LLM inference framework providing an OpenAI-compatible API server (vllm.entrypoints.openai.api_server). The ChatCompletionRequest and CompletionRequest Pydantic models in vllm/entrypoints/openai/chat_completion/protocol.py define an n parameter (number of independent completions to generate) as a simple integer without upper bound constraints via pydantic.Field validation. When a request is processed through vllm/sampling_params.py, the _verify_args method checks only the lower bound (n >= 1) but omits upper bound validation. At the engine layer (vllm/v1/engine/async_llm.py), the request is fanned out via a synchronous for-loop that executes n times, creating n copies of the request object using Python's copy() function. Since Python asyncio runs single-threaded on one event loop, this monolithic loop blocks all other I/O operations while simultaneously exhausting heap memory by allocating millions of request object instances, violating CWE-770 (Allocation of Resources Without Limits or Throttling). The root cause is incomplete input validation across the protocol, sampling, and engine layers.

RemediationAI

Upgrade vLLM to a patched version released after commit b111f8a61f100fdca08706f41f29ef3548de7380 (merged via PR #37952). Consult https://github.com/vllm-project/vllm/pull/37952 and https://github.com/vllm-project/vllm/security/advisories/GHSA-3mwp-wvh9-7528 for the exact fixed version number and installation instructions. As an interim mitigation, implement API gateway rate limiting and HTTP request body size restrictions to prevent rapid submission of large n parameters, and consider enforcing authentication on the vLLM API endpoint if the deployment supports it. If running vLLM in Kubernetes, configure resource quotas and pod memory limits to trigger graceful termination before system-wide OOM-killer events. Monitor asyncio event loop latency and heap memory allocation for anomalies indicative of attack attempts.