Standalone + Python Embedded Mode with Ray-Based Distributed Runtime? #916

chitralverma · 2025-07-17T13:57:18Z

chitralverma
Jul 17, 2025

Hey all — I’ve been digging into Arroyo recently and love the direction it’s heading: great performance, clean architecture, and strong support for connectors and stateful compute out of the box.

I wanted to open up a conversation around a different deployment/ usage pattern than what’s currently documented. Most of the examples today assume running Arroyo as a cluster via k8s/ Helm and managing pipelines via the web UI or CLI (needs to be installed beforehand). That makes sense in many production cases, but for some lighter-weight or embedded scenarios, it would be awesome to have something like this:

Arroyo also exposed as a Python package (via PyO3/ maturin) and which can be directly used by end users.
Pipelines creation,execution and monitoring Tests/ Source registration etc. can be entirely from Python (e.g., via a PipelineHandle-style object)
Execution mode is configurable:
- local (single node, threadpool/ multiprocess for ingestion → transform → sink)
- distributed (by launching arroyo workers via ray workers which also works with k8s etc.)
No dependency on Kubernetes or external clusters (local mode, synonymous to arroyo run ... as described here)
UI becomes optional (e.g., metrics and state can be fetched via programmatic API)

Why this might be useful?

Great for teams that already use Ray (or want dynamic autoscaling + cluster mgmt)
Easy onboarding: pip install <arroyo pkg name>, no Helm, no infra setup, no prior installation of CLI. btw, the cli can still come from the python package.
Python-first integration: works seamlessly with existing data code, Airflow DAGs, ML pipelines, etc.
Opens up use cases beyond data platforms — e.g., embedded analytics, edge processing, experimentation tools
Bonus: since arroyo works over datafusion, i think there is great opportunity to offload tasks between engines like arroyo (meant for streaming) and powerful batch engines like pyarrow, duckdb, polars etc. (primarily focus on batch at the moment) via the arrow iter-op

Really curious to hear thoughts from the core team and others in the community.

Happy to help prototype or outline what an MVP could look like. I think this kind of flexibility could make Arroyo even more approachable and powerful.

Thanks!

chitralverma · 2025-07-25T02:59:37Z

chitralverma
Jul 25, 2025
Author

cc @mwylde

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone + Python Embedded Mode with Ray-Based Distributed Runtime? #916

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Standalone + Python Embedded Mode with Ray-Based Distributed Runtime? #916

Uh oh!

chitralverma Jul 17, 2025

Replies: 1 comment

Uh oh!

chitralverma Jul 25, 2025 Author

chitralverma
Jul 17, 2025

chitralverma
Jul 25, 2025
Author