💡 This article presume you already has a basic understanding about GRPC

Why this is a trick question?

ingress-nginx leaks features and well-forged doc, but yet fast and widely-applied
connect rpc’s content-type is different from grpc’s, few materials can be found

Connect RPC

https://connectrpc.com/

Play with K8S

When we talk about service in k8s, we talk about load-balance. There are two ways of load-balance in general:

server side load-balance
client side load-balance

So how do they usually get implemented?

It should be noted that here we focus on client-side load-balancing specifically.

Server side load-balance

Managed: The configuration in service like Nodeport, ClusterIP and Loadbalancer.
Ingress Controllers: These act as Layer 7 server-side balancers. The request hits the Ingress Pod, which looks at the HTTP path/host and proxies the request to a backend pod. Here is the explanation on ingress-nginx’s update: https://github.com/kubernetes/ingress-nginx/issues/9620

Client side load-balance

DNS-Based (Headless Service):

Create a Service with clusterIP: None.
When client queries my-service.namespace.svc, the K8s DNS server doesn’t return one virtual IP. Instead, it looks at the EndpointSlice and returns all the Pod IPs as multiple A records.
Client Responsibility: The client receives this list and must implement its own logic (Round Robin, Least Request) to pick which IP to connect to.

Headless Services are just a way to “leak” the EndpointSlice data into the DNS system.

API-Based (Mesh & Libraries):

This is more powerful because it bypasses the limitations of DNS (like caching/TTL).

Observer subscribes to the EndpointSlice API updates via a Long Poll or Watch.
Example:

Istio/Linkerd: The control plane watches the EndpointSlice and pushes the IPs to the Envoy Sidecar. The sidecar intercepts the traffic and routes it directly to a Pod IP.
Smart Libraries: The code calls the K8s API to get the EndpointSlice, stores the IPs in a local cache, and chooses an IP before sending the request.

This is the way

TL; DR

Ingress fully support Connect RPC：unary call, client streaming, server streaming and bidi streaming
Ingress-nginx
Annotation:
nginx.ingress.kubernetes.io/backend-protocol: GRPC
Configmap:
keep-alive-requests: 100000 https://stackoverflow.com/questions/53615695/how-to-set-keep-alive-in-ingress-rule-for-nginx-inress-controller
Connect RPC
When client streaming is involved, client should be able to retry (500-504)

Motivation

GRPC use HTTP/2, multiplexing after the conn established, maintain roundtrip via stream. HTTP2 will not TCP dial to create a new L4 connection. so the request over the same connection will always go to the same pod.

Connect RPC can use both HTTP1 and HTTP2. In go, default HTTP1 will create new connection when suffering races of request (while no idle connection exists). Different connections will be “fake load balance” to pods in EndpointSlice.

But if there is no racing or you are using HTTP2, LB (load balance) will be impossible if scaling happens - new pod will not receive any request.

Goal

Fast load balance even when pods are scaling, without the influences of DNS refresh delay and Ungraceful exit.

Basic Agreement

Concurrency-based test, 500 concurrency, 300 request per concurrency, request stream takes no time, response stream takes 10ms.
use ingress-nginx as ingress.
5 replicas at the beginning.
In scaling experiment, scale from 5 pods to 7 pods, then scale from 7 pods to 5 pods.
Default use HTTP2 connection.

LB over Ingress

Load balancing is a piece of cake for Ingress. Or even more straight forward: It’s so f**king easy for user to achieve load balancing using ingress. Because ingress decouples the L7 and L4 problem, besides we do not consider multi-host in L7 here.

So, as for client-server scenario, client maintain the connection to nginx, ingress-nginx maintain the connections to upstream pods.

At the same time, ingress-nginx watch the EndpointSlice of the service, and dynamically update the connection to upstream pods. https://github.com/kubernetes/ingress-nginx/issues/9620

streaming client: Fine

streaming server: Fine

Bidirectional streaming: Fine

Issues in Ingress-nginx

Ungracefully Exit (GOAWAY in HTTP2 and RESET by PEER in HTTP1)

💡 WHY ungracefully?
https://trac.nginx.org/nginx/ticket/2224

Keep-alive-requests:

Sets the maximum number of requests that can be served through one keep-alive connection. After the maximum number of requests are made, the connection is closed.

Can we disable this setting？

💡 NO

https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests

Lack of various balancing strategy Ingress-nginx only offer round-robin and ewma by default.

LB over Service

Solution: https://github.com/bufbuild/httplb

This project is currently in alpha. The API should be considered unstable and likely to change.

httplb is a package for OSI L7 load balancing, using HostPort as key to manage connection, re-resolve DNS periodically, and update according to the latest resolving result. Which settled the issue filed in https://medium.com/jamf-engineering/how-three-lines-of-configuration-solved-our-grpc-scaling-issues-in-kubernetes-ca1ff13f7f06.

However, in K8S, if pod exit ungracefully, or exit gracefully but coredns did not update in time. Will make the wrong address appears in the DNS resolving result. Lead to sufficient efficiency issue in httplb.

Service pod Scaling

When a pod is killed (simulate scaling)，httplb encapaslated client (round-robin on connections) triggers the following issue: for each client, when encounter a unhealthy conn, client will dial till timeout.

All the httplb connections are lagging for 30 second, the default dialer timeout is exactly 30 second (see the log below)

unavailable: read tcp 10.3.193.73:35503->10.3.192.185:4000: read: connection reset by peer
unavailable: read tcp 10.3.193.73:35489->10.3.192.185:4000: read: connection reset by peer
unavailable: read tcp 10.3.193.73:35477->10.3.192.185:4000: read: connection reset by peer
...
unavailable: read tcp 10.3.193.73:35495->10.3.192.185:4000: read: connection reset by peer
unavailable: read tcp 10.3.193.73:35479->10.3.192.185:4000: read: connection reset by peer
unavailable: dial tcp 10.3.192.185:4000: connect: connection refused
...
unavailable: dial tcp 10.3.192.185:4000: connect: connection refused
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout
...
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout
deadline_exceeded: Post "[http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping](http://crpc-demo-api-headless:4000/demo.v1.DemoService/Ping)": dial tcp 10.3.192.185:4000: i/o timeout

Try to explain the above log

when pod getting killed, go http server process is signaled by SIGKILL, connection lost → conn reset
the network resources of pod has not yet been recycled, and requests arrived to the unlistened ip:port → conn refused
pod has been removed completely, but the routing rules of iptable is not been updated yet, results in no response with any TCP dial → dialer context timeout
Iptable rules is updated → no route to host

Normally, 20 concurrency with 150000 requests costs approximately 5s.

But with killing one pod, it takes 36.7s. Unfortunately, this can’t be overcame by just reduce the DNS resolving Interval and Dial Timeout. Since the Cache TTL of coredns is 30s https://github.com/kubernetes/kubernetes/issues/92559, so either you develop your own optimized dns server, or set a smaller TTL (which can heavily damage the performance).

Kill Random Pod

kill a random pod 1 second after test start, the result satisfies our inference

[Task 1731653783095054917]
Trail number: 150000
Time elapsed: 36711.305 ms
Success: 149096, Fail: 904
Load Balance:
crpc-demo-api-86888895d7-vsxc2 - 16679
crpc-demo-api-86888895d7-d9sd4 - 30000
crpc-demo-api-86888895d7-f8zqw - 30000
crpc-demo-api-86888895d7-7kx2h - 12417
crpc-demo-api-86888895d7-n48ns - 30000
crpc-demo-api-86888895d7-hqrkh - 30000

Solution

Apply least request, if one requests hang on one connection, just let it hang. The other requests will use the healthy connection, a simple yet brutal way. This works for the API that responses quickly, if the request outlive the dialer timeout, it will just fail.
Create a health checker to check if the connection is still alive once for a while, when a connection is believed dead, just skip it. k8s encourage liveness probing by default. However, it seems that we are recreating the “mesh” and “service discovery”, but in a naive way. If you prefer Istio or Consul, it’s fine.

// httplb healthchecker example
dialer := net.Dialer{Timeout: 5 * time.Second}
client := httplb.NewClient(
	WithHealthChecks(health.NewPollingChecker(
		health.PollingCheckerConfig{
			PollingInterval: 5 * time.Second,
			Timeout: 1 * time.Second,
		},
		health.NewSimpleProber("healthz"),
	)),
	WithDialer(dialer.DialContext),
)

Q&A

Bidi streaming in requires http2 https://github.com/connectrpc/connect-go/issues/582
Connect RPC bidi streaming should be capable to retry when Peer exit https://github.com/connectrpc/connect-go/issues/672
Ingress-nginx default use round-robin as the load balance strategy, better switch to ewma algorithm nginx.ingress.kubernetes.io/load-balance
GRPC has the same problem as the GOAWAY one caused by keep_alive_request？✅
grpc-go: https://github.com/grpc/grpc-go/issues/2205
grpc-java: https://github.com/grpc/grpc-java/issues/10948
grpc-java: https://groups.google.com/g/grpc-io/c/A0h7JAopTXc
Ingress-nginx: https://github.com/kubernetes/ingress-nginx/issues/11118
GRPC’s solution to nginx GOAWAY？Transparent retry
grpc: https://grpc.io/docs/guides/retry/
grpc-java PR: https://github.com/grpc/grpc-java/pull/8359
Why backend-protocol: GRPC?
In order to force nginx establish http2 connection to upstream server