We Built The Most Modern API Management from Scratch. Here’s How.

March 21, 2024

How We Built The Most Modern API Management from Scratch.

Complaining about things that don't work well is easy. What's hard is coming up with a better solution.

That statement is true of many things in life, including one we know well at Traefik: API management. After all, there is plenty to criticize about traditional API management — such as poor integration with cloud-native technologies, inefficient incident management, challenging release management and poor governance strategies.

But what does a better approach to API management look like? How, specifically, do you implement a tool that enables cloud native-friendly, scalable, collaborative API management? Those are not simple questions to answer.

But these are the questions that we had to solve in order to build Traefik Hub. We had to make careful design choices to determine how API management workflows based on our tool should work. We also had to analyze a variety of technical considerations, such as how best to implement tunneling and a high-availability data plane.

We're proud of the solutions we arrived at. We're so proud that we'd like to share them — and the rationale behind them — in this article, with the goal of offering our community a peek into the design process and thinking that led to Traefik Hub.

Why We Decided to Reboot API Management

Before delving into the choices we made when designing Traefik Hub, let's talk about why we decided to build the solution in the first place.

The reason was simple: we believe traditional API management is badly broken. Most conventional API management tools were designed and built before cloud-native architectures became widespread, and before the widespread adoption of collaboration-centric, DevOps-based cultures. As a result, they lack the scalability, modularity and efficiency to drive optimal results in today's scale-out, distributed world.

We conceived of Traefik Hub as a way to reboot API management, so to speak. By building a new solution from the ground up — instead of trying to extend or overhaul an existing API management tool — we gave ourselves total freedom to reinvision what modern API management should look like.

Our Design Principles

Starting from scratch meant we had virtually unlimited leeway to design Traefik Hub in whichever ways we thought best. But it also meant we had to make a ton of design choices, since there was no pre-existing foundation to guide us or get us started.

We settled on an approach oriented around the following design principles.

The Unix Philosophy

For starters, we wanted to adhere to the so-called Unix philosophy, which emphasizes designing a tool that does one thing and does it well. That's why Traefik Hub doesn't try to be a Swiss army knife of API-related functionality. Instead, it focuses on API "day 2" operations — API deployment, runtime management, observability, and security.

Other tools are available to assist with processes like API design and documentation, and we don't think it's healthy to try to pack too many features into one tool.

Declarative Management

In the cloud-native era, engineers have become accustomed to declarative management. That means writing code that describes how something should operate, then applying the desired configuration automatically.

Traefik Hub does this by enabling a GitOps-based approach to API management: by using files that you can manage through Git repositories, you describe what should happen to APIs, then Traefik Hub makes it happen automatically. In addition to helping to keep configuration data centralized and accessible, this approach enables scalability and automation in the realm of API management.

Kubernetes-Native

We decided not just to be cloud-native in general, but to focus on creating a truly Kubernetes-native solution, because Kubernetes has become the de facto solution behind cloud-native environments.

By "truly Kubernetes-native," we mean that Traefik Hub is CRD-driven and can be managed directly through standard kubectl commands, rather than requiring a proprietary tool.

Intuitive and Rapid Time-to-Value

We don't think anyone should have to take a course or spend months reading documentation and following tutorials to manage APIs with Traefik Hub. Instead, we prioritized a simple, intuitive design that minimizes the learning curve for anyone who is already familiar with cloud-native concepts and tooling.

Lightweight, Composable Architecture

Most cloud-native architectures are composable, meaning they involve a variety of loosely coupled components. Traefik Hub adopts the same type of design by giving users freedom to pick and choose from a variety of modern solutions when building their stacks. In other words, Traefik Hub doesn't force you into a particular stack or platform; you can run it alongside other tools of your choice.

Data Plane and Control Plane Separation

Finally, we decided that it was important to keep the data plane and the control plane separate — partly because we believe in modular and composable design, as we just noted, but also to enhance the security and reliability of our API management solution. We'll say more later about exactly how we implemented the data plane with this goal in mind.

Building Traefik Hub: Challenges We Faced and Lessons We Learned

Deciding on high-level design principles is one thing. Actually implementing them is another. And while we don't have room in this article to discuss every technical decision we made as we built Traefik Hub according to the design concepts described above, we'd like to highlight a couple key choices we made as we worked through implementation challenges.

Tunneling

One was tunneling. Since we had to expose some resources publicly, creating tunneled connections to them was an elegant solution. We wanted the tunnels to be:

Highly available
Encrypted to protect data
Compatible with full-duplex mode (so that data could flow from either plane to the other plane simultaneously)
Transparent to firewalls, so users would not require public IP addresses to expose the data plane
Capable of supporting multiplexed data transfer

There are a variety of tunneling technologies that have the potential to support at least most of these goals. At first, we considered using TCP and SSH, but that would have required a public IP address, and the connection would not have been reliable because there is no true fallback solution on SSH disconnect. We also thought about using WebSocket and SSH, which doesn't require a public IP, but is still subject to the disconnect issue.

Since SSH-based tunneling didn't seem ideal, we began exploring approaches based on Yamux, an open source multiplexing library. Initially, we considered using Yamux alongside GRPC, but that felt too hacky and the data envelope was too large. TCP and Yamux also proved suboptimal because it would have required a public IP.

Finally, we settled on WebSocket and Yamux — which doesn't need a public IP, allows firewall passthrough, provides built-in encryption via TLS, and is more reliable following disconnects than an SSH tunnel.

Data Plane High Availability

When implementing the data plane for Traefik Hub, we wanted to make sure our solution would be resilient against regional network failures. In other words, we didn't want a problem in one part of the network to make the entire solution unavailable.

The first step in making this possible was to deploy dozens of brokers in different regions. But on their own, distributed brokers don't guarantee high data plane availability because a data plane would still fail if it's connected to just one broker and that broker goes down. So, instead, we designed our data plane to connect to at least three different brokers simultaneously — meaning that it can tolerate the failure of a minimum of two brokers before the data plane becomes unavailable.

In addition, we dynamically optimized load balancing for data plane connections. This ensures that if failures occur in some parts of the network, traffic is automatically rebalanced to keep data flowing as efficiently as possible.

Conclusion: A New Approach to API Management

Again, there's plenty more to say about how Traefik Hub works under the hood, and why we decided to make it work that way. But we hope that the details above have provided at least a basic sense of what our thought process was as we reconceptualized API management, as well as the types of technical iterations we worked through to get things just right — which means bringing scalability, simple incident and release handling, and intuitive collaboration to the realm of API management.