Show HN: I built an open-source data pipeline tool in Go

194 points by karakanb 2 days ago

Every data pipeline job I had to tackle required quite a few components to set up:

- One tool to ingest data

- Another one to transform it

- If you wanted to run Python, set up an orchestrator

- If you need to check the data, a data quality tool

Let alone this being hard to set up and taking time, it is also pretty high-maintenance. I had to do a lot of infra work, and while this being billable hours for me I didn’t enjoy the work at all. For some parts of it, there were nice solutions like dbt, but in the end for an end-to-end workflow, it didn’t work. That’s why I decided to build an end-to-end solution that could take care of data ingestion, transformation, and Python stuff. Initially, it was just for our own usage, but in the end, we thought this could be a useful tool for everyone.

In its core, Bruin is a data framework that consists of a CLI application written in Golang, and a VS Code extension that supports it with a local UI.

Bruin supports quite a few stuff:

- Data ingestion using ingestr (https://github.com/bruin-data/ingestr)

- Data transformation in SQL & Python, similar to dbt

- Python env management using uv

- Built-in data quality checks

- Secrets management

- Query validation & SQL parsing

- Built-in templates for common scenarios, e.g. Shopify, Notion, Gorgias, BigQuery, etc

This means that you can write end-to-end pipelines within the same framework and get it running with a single command. You can run it on your own computer, on GitHub Actions, or in an EC2 instance somewhere. Using the templates, you can also have ready-to-go pipelines with modeled data for your data warehouse in seconds.

It includes an open-source VS Code extension as well, which allows working with the data pipelines locally, in a more visual way. The resulting changes are all in code, which means everything is version-controlled regardless, it just adds a nice layer.

Bruin can run SQL, Python, and data ingestion workflows, as well as quality checks. For Python stuff, we use the awesome (and it really is awesome!) uv under the hood, install dependencies in an isolated environment, and install and manage the Python versions locally, all in a cross-platform way. Then in order to manage data uploads to the data warehouse, it uses dlt under the hood to upload the data to the destination. It also uses Arrow’s memory-mapped files to easily access the data between the processes before uploading them to the destination.

We went with Golang because of its speed and strong concurrency primitives, but more importantly, I knew Go better than the other languages available to me and I enjoy writing Go, so there’s also that.

We had a small pool of beta testers for quite some time and I am really excited to launch Bruin CLI to the rest of the world and get feedback from you all. I know it is not often to build data tooling in Go but I believe we found ourselves in a nice spot in terms of features, speed, and stability.

https://github.com/bruin-data/bruin

I’d love to hear your feedback and learn more about how we can make data pipelines easier and better to work with, looking forward to your thoughts!

Best, Burak

peterm4 20 hours ago

I'd absolutely love to love this.

Using dbt at $JOB, and building a custom dbt adapter for our legacy data repos, I've slowly developed a difficult relationship dbt's internals and externals. Struggling with the way it (python) handles concurrency, threading, timeouts with long running (4hr+ jobs), and the like. Not to mention inconsistencies with the way it handles Jinja in config files vs SQL files. Also it's lack of ingestion handling and VSCode/editor support, which it seems like Bruin considers very well! Since starting poking around on the inside of dbt I've felt like Go or Rust would be a far more suitable platform for a pipeline building tool, and this looks to be going in a great direction, so congrats on the launch and best of luck with your cloud offering.

That being said, I tried starting the example bruin pipeline with duckdb on a current data project, and I'm having no luck getting the connection to appear with `bruin connections list` so nothing will run. So looks like I'm going to have to stick with dbt for now. Might be worth adding some more documentation around the .bruin.yml file; dbt has great documentation listing the purpose and layout of each file in the folder which is very helpful when trying to set things up.

baris1432 5 hours ago

[dead]

NortySpock 2 days ago

Interesting, I've been looking for a system / tool that acknowledges that a dbt transformation pipeline tends to be joined-at-the-hip with the data ingestion mode....

As I read through the documentation, Do you have a mode in ingstr that lets you specify the maximum lateness of a file? (For late-arriving rows or files or backfills) I didn't see it in my brief read through.

https://bruin-data.github.io/bruin/assets/ingestr.html

Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good way)

Interested to kick the tires on this (compared to, say, Python dlt)

karakanb 2 days ago

great point about the transformation pipeline, that's a very strong part of our motivation: it's never "just transformation", "just ingestion" or "just python", the value lies in being able to mix and match technologies.
as per the lateness: ingestr itself does the fetching itself, which means the moment you run it it will ingest the data right away, which means there's no latency there. in terms of loading files from S3 as an example, you can already define your own blob pattern, which would allow you to ingest only certain files that fit into your lateness criteria, would this fit?
in addition, we will implement the concept of a "sensor", which will allow you to wait until a certain condition is met, e.g. a table/file exists, or a certain query returns true, and continue the pipeline from there, which could also help your usecase.
feel free to join our slack community, happy to dig deeper into this and see what we can implement there.

jmccarthy 2 days ago

Burak - one wish I've had recently is for a "py data ecosystem compiler", specifically one which allows me to express structures and transformations in dbt and Ibis, but not rely on Python at runtime. [Go|Rust]+[DuckDB|chDB|DataFusion] for the runtime. Bruin seems very close to the mark! Following.

karakanb 2 days ago

hey, thanks for the shoutout!
I love the idea, effectively allowing going towards a direction where the right platform for the right job is used, and it is very much in line with where we are taking things towards. Another interesting project in that spirit is sqlframe: https://github.com/eakmanrq/sqlframe

thruflo 2 days ago

It’s pretty remarkable what Bruin brings together into a single tool / workflow.

If you’re doing data analytics in Python it’s well worth a look.

karakanb 2 days ago

thanks a lot for the kind words, James!

mushufasa 2 days ago

Hi Burak, thanks for posting! We're looking for a tool in this space and i'll take a look.

Does Bruin support specifying and visualizing DAGs? I didn't see that in the documentation via a quick look, but I thought to ask because you may use different terminology that can be a substitute.

sabrikaragonen a day ago

By using the vscode extension, you can see the lineage of the pipeline (visualization of dag with other words)
fancy_pantser 2 days ago

> specifying and visualizing DAGs
Do you mean like Airflow or Pachyderm? I am also very interested in new tooling in this space that has these features.
- mushufasa 2 days ago
  
  yes that's what i'm thinking about.
karakanb a day ago

hey, absolutely. take a look at here: https://bruin-data.github.io/bruin/vscode-extension/overview...

alpb 2 days ago

Congrats Burak, I can tell a lot of work has gone into this. If I may recommend, a comparison of this project with similar other/state-of-the-art projects would be really good to have in your documentation set for others to understand how your approach differs from them.

karakanb a day ago

that's definitely coming, thanks!

havef a day ago

Hi, Burak, it looks interesting. I was wondering, do you know about connect? Maybe you can take advantage of some of its ready-made components. In addition, it is also developed using Go

- https://docs.redpanda.com/redpanda-connect/home/

- https://github.com/redpanda-data/connect

karakanb a day ago

hey, I didn't know that, definitely gonna take a look. thanks!

gigatexal a day ago

Ingestion with DLT likely would have given you more connections to things. Still very cool. I saw you talking about this on LinkedIn.

JeffMcCune 2 days ago

Congrats on the launch! Since this is Go have you considered using CUE or looked at their flow package? Curious how you see it relating or helping with data pipelines.

karakanb 2 days ago

thanks!
I did look into CUE in the very early days of Bruin but ended up going with a more YAML-based configuration due to its support. I am not familiar with their flow package specifically, but I'll definitely take a deeper look. From a quick look, it seems like it could have replaced some of the orchestration code in Bruin to a certain extent.
One of the challenges, maybe specific to the data world, is that the userbase is familiar with a certain set of tools and patterns, such as SQL and Python, therefore introducing even a small variance into the mix is often adding friction, this was one of the reasons we didn't go with CUE at the time. I should definitely take another look though. thanks!

ellisv 2 days ago

Direct link to the documentation:

https://bruin-data.github.io/bruin/

producthunter90 2 days ago

How does it handle scheduling or orchestrating pipeline runs? Do you integrate with tools like Airflow, or is there a built-in solution for that?

karakanb 2 days ago

Bruin orchestrates individual runs for single pipelines, which means you can use any tool to schedule the runs outside and the assets will be orchestrated by Bruin. You can use GitHub Actions, Airflow, a regular cronjob, or any other form of scheduling for that.

evalsock a day ago

Do you have integration for ML orchestration to reuse bruin inside our existing pipeline?

wodenokoto a day ago

That ingestr CLI you also developed and just casually reference seems very, very cool!

karakanb a day ago

glad to hear you like it, thanks!!

sakshy14 a day ago

I just used your getting started guide and it's freaking amazing

karakanb a day ago

love it, thanks!

Multrex 2 days ago

Why there is not MySQL integration? Will you plan to add it? MySQL is very popular.

kyt 2 days ago

Why use this over Meltano?

ellisv 2 days ago

The README would benefit from a comparison to other tools.
I’m not (necessarily) motivated to switch tooling because of the language it is written in. I’m motivated to switch tooling if it has better ergonomics, performance, or features.
- karakanb 2 days ago
  
  good point, thanks. I'll definitely add some more details about the comparison between different tools.
  I agree with you 100% on the language part, I think it is an interesting detail for a data tool to be built in Go, but we have a lot more than that, a couple of things we do there is:
  - everything is local-first: native Python support, local VS Code extension, isolated local environments, etc
  - very quick iteration speed: rendered queries, backfills, all running locally
  - support for data ingestion, transformation, and quality, without leaving the framework, while also having the ability to extend it with Python
  these are some of the improvements we focused on bringing into the workflows, I hope this explains our thinking a bit more.
  - ellisv 2 days ago
    
    My #1 feedback would be to expand on the documentation.
    I really want to know how this is going to benefit me before I start putting in a lot of effort to switch to using it. That means I need to see why it is better than ${EXISTING_TOOL}.
    I also need to know that it is actually compatible with my existing data pipeline. For example, we have many single tenant databases that are replicated to a central warehouse. During replication, we have to attach source information to the records to distinguish them and for RBAC. It looks like I can do this with Bruin but the documentation doesn't explicitly talk about single tenant vs multi-tenant design.
    
    karakanb 2 days ago
    
    I would love to add a dedicated section on this, and would love to learn a bit more from you in this. Do you have any particular example tools that you would compare Bruin in your mind that you would like to understand the difference better?
karakanb 2 days ago

great question! Meltano, if I am not wrong, only does data ingestion (Extract & Load), whereas we go further into the pipeline such as transformation with SQL and Python, ML pipelines, data quality, and more.
I guess a more comparable alternative would be Meltano + dbt + Great Expectations + Airflow (for Python stuff), whereas Bruin does all of them at once. In that sense, Bruin's alternative would be a stack rather than a single product.
Does that make sense?

uniquenamehere 2 days ago

This looks cool! How would this compare to Benthos?

kakoni 2 days ago

Is dlt part of bruin-stack?

karakanb 2 days ago

depends on what you mean by that, but we do use dlt through ingestr (https://github.com/bruin-data/ingestr), which is used inside Bruin CLI.

halfcat 2 days ago

I always thought Hamilton [1] does a good job of giving enough visual hooks that draw you in.

I also noticed this pattern where library authors sometimes do a bit extra in terms of discussing and even promoting their competitors, and it makes me trust them more. A “heres why ours is better and everyone else sucks …” section always comes across as the infomercial character who is having quite a hard time peeling an apple to the point you wonder if this the first time they’ve used hands.

One thing wish for is a tool that’s essentially just Celery that doesn’t require a message broker (and can just use a database), and which is supported on Windows. There’s always a handful of edge cases where we’re pulling data from an old 32-bit system on Windows. And basically every system has some not-quite-ergonomic workaround that’s as much work as if you’d just built it yourself.

It seems like it’s just sending a JSON message over a queue or HTTP API and the worker receives it and runs the task. Maybe it’s way harder than I’m envisioning (but I don’t think so because I’ve already written most of it).

I guess that’s one thing I’m not clear on with Bruin, can I run workers if different physical locations and have them carry out the tasks in the right order? Or is this more of a centralized thing (meaning even if its K8s or Dask or Ray, those are all run in a cluster which happens to be distributed, but they’re all machines sitting in the same subnet, which isn’t the definition of a “distributed task” I’m going for.

[1] https://github.com/DAGWorks-Inc/hamilton

karakanb a day ago

hey, thanks a lot for sharing your thoughts.
I like the comparison page in Hamilton, and in their examples they operate in the asset level, whereas Bruin crosses the asset level into the orchestrator level as well, effectively bridging the gap there. What Bruin does is beyond a single asset that might be a group of functions, it is basically being able to build and run pipelines of that.
In terms of distributed execution, it is in our roadmap to support running distributed workloads as simple as possible, and Postgres as a pluggable queue backend is one of the options as well. Currently, Bruin is meant as a single-node CLI tool that will do the orchestration and the execution within the same machine.

drchaim 2 days ago

"Interesting, congrats! I've felt the same challenges but ended up using custom Python with dbt and DuckDB. I'll take a look!"

tony_francis 2 days ago

How does this compare to ray data?

karakanb 2 days ago

I didn't know about Ray Data before, but just gave a quick look and it seems like a framework for ML workloads specifically?
Bruin is effectively going a layer above individual assets, and instead takes a declarative approach to the full pipeline, which could contain assets that are using Ray internally. In the end, think of Bruin as a full pipeline/orchestrator, which would contain one or more assets using various other technologies.
I hope this makes sense.