Scaling gRPC With Kubernetes (Using Go)

By Noam Yadgar at January 20, 2025

#grpc #k8s #kubernetes #go #software-engineering #system-design

The Tech

gRPC is a strong player in microservices-based systems. Leveraging Protocol Buffers for well-defined API contracts, fast serialization (about X5 faster than JSON), smaller payloads, and the use of streams (thanks to HTTP/2). It’s easy to see why this technology for real-time microservice communication is a good choice.

HTTP/2

Unlike a typical REST API that’s built on top of HTTP/1.1, gRPC is built on top of HTTP/2. The most noticeable feature of HTTP/2 is the ability to perform server push. This feature allows servers to asynchronously push data to the client before the client asks for it. gRPC leverages server push to support streams, a key feature that separates gRPC from any other HTTP/1.1-based API. Because of that, HTTP/2 requires a long-lasting TCP connection between the client and the server.

The Problem

One thing that makes REST APIs exceptionally good at scaling is the notion of being stateless. Every single request is essentially a new TCP session that can be routed to any available replica of the server. This works perfectly with Kubernetes service’s load-balancing.

Things are a bit different when it comes to gRPC because each client is keeping the TCP connection for its entire lifespan (if not configured otherwise), the load-balancer will try to symmetrically spread the load across the clients and servers.

flowchart LR
    c1(client-1):::pod ==> s[service/server]:::srv
    c2(client-2):::pod ==> s
    c3(client-3):::pod ==> s
    c4(client-4):::pod ==> s
    s ==> p1(server-1):::pod
    s ==> p2(server-2):::pod
    s ==> p3(server-3):::pod
linkStyle 0,3,4 stroke: orange;
linkStyle 1,5 stroke: blue;
linkStyle 6,2 stroke: red;
classDef srv fill: #a3e4d7, stroke:  #148f77 
classDef pod fill: #85c1e9, stroke: #2874a6

Figure 1: The first three clients are symmetrically routed to the server’s three pods. The 4th client starts the next cycle.

Idle Pods

In Figure 1, we have more clients than servers, so even if we may not fully optimize the resource utilization of the servers, at least all of them are kept busy. But what happens if we have fewer clients than servers? The answer is that some pods will stand idle without doing work but waste resources.

flowchart LR
    c1(client-1):::pod ==> s[service/server]:::srv
    s ==> p1(server-1):::pod
    s --> p2(server-2):::pod
    s --> p3(server-3):::pod
linkStyle 0,1 stroke: orange;
classDef srv fill: #a3e4d7, stroke:  #148f77 
classDef pod fill: #85c1e9, stroke: #2874a6

Figure 2: One client - Three servers. The client is making requests only to one server.

Autoscaling

Autoscaling the number of server replicas might be the solution. The truth is - It’s not. Imagine a scenario where you’ve set an autoscaling rule (with a tool like Keda ) that increases the number of replicas whenever a pod reaches 90% of its memory consumption.

Since there’s no link between the number of clients and the number of servers, we can face a scenario in which one client is causing the autoscaling rule to be triggered, increasing the number of servers by one; however, it doesn’t use the new replica.

It wouldn’t make sense to scale the number of servers based on the number of clients either. Scaling based on resource utilization is a good rule, but we want to ensure that when a pod is too busy, it will share the load with new replicas that the autoscaling tool is adding.

The Test

To illustrate the problem, I wrote simple gRPC apps in Go (a client and a server) based on this .proto:

syntax = "proto3";

option go_package = "internal/";

service Service {
  rpc Get(Req) returns (Res) {}
}

message Req {}

message Res {
  string id = 1;
}

The client sends 3000 messages via the Get method, and the server responds with a pre-generated UUID to reflect its unique identity. The returned value is then printed to stdout by the client. Here’s the client’s code:

package main

import (
	"context"
	"example/grpc_lb/internal"
    "os"
	"fmt"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

const n = 3000

func main() {
    target := os.Getenv("TARGET")
	if target == "" {
		panic("no target")
	}
	conn, err := grpc.NewClient(
		target,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	c := internal.NewServiceClient(conn)

	for i := 0; i < n; i++ {
		res, _ := c.Get(context.Background(), &internal.Req{})
		fmt.Println(res.GetId())
	}
}

Using minikube as my local Kubernetes cluster. I’ve deployed three servers, pointed from a service called server, and ran one client as a job:

kubectl get pods
NAME                      READY   STATUS      RESTARTS   AGE
client-xm42p              0/1     Completed   0          7m51s
server-6565dd7765-7g67b   1/1     Running     0          7m44s
server-6565dd7765-hl7ww   1/1     Running     0          7m44s
server-6565dd7765-l7hrp   1/1     Running     0          7m44s

kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
server       ClusterIP   10.104.44.221   <none>        9090/TCP   7m40s

Let’s look at the logs:

kubectl logs job/client | uniq -c
 3000 7b4fcf65-2c93-4c5c-8433-5950c91553a9

As you can see, although we have three servers, the client sent all of its requests only to one of these servers. Maybe because our single client is making synchronous requests to the service. Let’s try to communicate concurrently:

func main() {
    target := os.Getenv("TARGET")
	if target == "" {
		panic("no target")
	}
	conn, err := grpc.NewClient(
		target,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
	)
	if err != nil {
		panic(err)
	}
	defer conn.Close()

	c := internal.NewServiceClient(conn)

	ch := make(chan string)
	for i := 0; i < n; i++ {
		go func() {
			res, _ := c.Get(context.Background(), &internal.Req{})
			ch <- res.GetId()
		}()
	}

	for i := 0; i < n; i++ {
		fmt.Println(<-ch)
	}
	close(ch)
}

kubectl logs job/client | uniq -c
 3000 7b4fcf65-2c93-4c5c-8433-5950c91553a9

The same results. You’ll still face the same results even if you try to create the connection on every iteration. As mentioned above, the load-balancer is symmetrically spreading connections across pods, so a single client will always be connected to the same server.

The Solution

In this experiment, we’re trying to make our single client spread its messages (preferably evenly) across the different replicas of the server.

flowchart LR
    c1(client-1):::pod ==> s[service/server]:::srv
    s ==> p1(server-1):::pod
    s ==> p2(server-2):::pod
    s ==> p3(server-3):::pod
linkStyle 0,1,2,3 stroke: orange;
classDef srv fill: #a3e4d7, stroke:  #148f77 
classDef pod fill: #85c1e9, stroke: #2874a6

Figure 3: One client is sending requests to all server’s replicas.

Headless service

The trick is to not use Kubernetes’ built-in load balancer. Kubernetes supports headless services. Those services don’t have static IPs and, therefore, don’t have a single endpoint where they perform load-balancing. The purpose of headless services is to expose pod IPs deployed under the service via DNS, allowing other load-balancing/service-discovery implementations to replace the built-in one.

To turn a Kubernetes service into a headless service, we should simply add to our service definition:

apiVersion: v1
kind: Service
metadata:
  name: server
  namespace: default
  labels:
    app.kubernetes.io/name: server
spec:
  clusterIP: None
  ports:
    - name: grpc
      port: 9090
      targetPort: 9090
  selector:
    app.kubernetes.io/name: server

Client side load-balancing

Now that our service is headless, we can set the client’s load-balancing policy to make request in a round-robin fashion:

func main() {
	target := os.Getenv("TARGET")
	if target == "" {
		panic("no target")
	}
	conn, err := grpc.NewClient(
		target,
		grpc.WithTransportCredentials(insecure.NewCredentials()),
		grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
	)

    //... rest of the code

If we run this version of the client against a headless service:

kubectl logs job/client | sort | uniq -c
 871 7b4fcf65-2c93-4c5c-8433-5950c91553a9
1258 afaeeea0-87c3-4473-b8ad-1c5b61407b57
 871 ca66dd65-c6df-44c4-bb3e-60c00f8776b5

Great! All 3000 messages were sent across all pods. The spread is not quite even. One pod received about 42% of messages, and the others about 29%. The reason is that our client sent all of its messages concurrently. If we send the messages synchronously:

k logs job/client | sort | uniq -c
 999 7b4fcf65-2c93-4c5c-8433-5950c91553a9
1000 afaeeea0-87c3-4473-b8ad-1c5b61407b57
1001 ca66dd65-c6df-44c4-bb3e-60c00f8776b5

The client has almost successfully cycled around all three servers with a nearly 100% even spread.

Discovery

We’re not done yet. Our client doesn’t have a mechanism for knowing about new pods during its runtime. When calling NewClient with the load-balancing settings, our client retrieves the list of available IPs and will never try to fetch it again by default.

To illustrate this point, I throttled the client by adding 50 milliseconds to each iteration, giving me 2.5 minutes to play with the number of replicas during the client’s runtime.

	ch := make(chan string)
	for i := 0; i < n; i++ {
		go func() {
			res, _ := c.Get(context.Background(), &internal.Req{})
			ch <- res.GetId()
		}()
		time.Sleep(time.Millisecond * 50)
	}
    // ... rest of the code

I removed one replica during runtime and gradually added up to 5 replicas. Let’s see the results:

k logs job/client | sort | uniq -c
1
1112 afaeeea0-87c3-4473-b8ad-1c5b61407b57
1887 b48de8b8-6a88-472f-8043-f2e72d22175f

Interestingly, we’ve lost one pod forever after the first change (removing one replica). In fact, it (coincidentally) failed to make its first request, and our client’s load balancer discarded it for good. Notice that our client wasn’t aware that I’d added more replicas (up to 5) during its runtime.

To solve this, we need a DNS resolver that allows the client to periodically fetch new IPs. The google.golang.org/grpc module conveniently has a built-in DNS resolver. By using its resolver package, we can globally register a DNS resolver as follows:

func init() {
	resolver.Register(resolver.Get("dns"))
}

func main() {
	target := os.Getenv("TARGET")
	if target == "" {
		panic("no target")
	}
	conn, err := grpc.NewClient(
		fmt.Sprintf("dns:///%s", target),
		grpc.WithTransportCredentials(insecure.NewCredentials()),
		grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
	)
    // ... rest of the code

On this attempt, I started with three replicas, then dropped one, added one, and finally dropped one. Let’s look at the results:

k logs job/client | sort | uniq -c
 512 74025783-3545-4d2c-b1c9-1b20b5042b31
1169 b48de8b8-6a88-472f-8043-f2e72d22175f
1169 ca66dd65-c6df-44c4-bb3e-60c00f8776b5
 150 f17c9530-1970-4b53-a86b-5b66ce0e575e

Not a single miss. The last one, with the 150 messages, is probably the first pod dropped since I dropped it closer to the start of the client’s job. Then, I’ve added one, perhaps the pod with the 512 messages. After I removed one replica again, I was left with two pods, running until the job was finished, explaining the two pods with identical 1169 messages.

By understanding the benefits of headless services in Kubernetes and combining client-side load-balancing with a DNS resolver, we’ve managed to make a gRPC client in Go that knows how to spread its messages in a round-robin manner while being aware of new available servers during runtime.

On this page (Top)