Socket Hang Ups

At work we ran into the following situation. Let's say we have two microservices, A and B.

A calls B.

A -> B
A <- B

We use connection pooling, for performance gains (less TCP back and forth) and less network congestion in general.

After implementing connection pooling, we started experiencing intermittent socket hang up issues.

We eventually found out that we had a mismatch of timeouts between service A's HTTP client and service B. While service A would keep a connection open for 30s, service B would keep it open for 5s (the default in Node.js — we weren't even messing with this property).

So, service A would send a request, and after 5s of inactivity, service B would close the connection. For a time, if service A sends a request to service B, and uses that connection, it will get an ECONNRESET, which in Node.js, is reported with a socket hang up error message.

We need to set the timeouts so this never happens. It's best to have the client's keep alive timeout be slightly less than the server's — so it's the one closing inactive connections.

You Also Need Retries

For many of our use cases, we control the client and server. However, if you don't control the server (you're calling Stripe's API, for example), and you use connection pooling, you need to understand this can happen.

You can have intermittent ECONNRESET errors. And this is true independent of what language you're working in.

I like this comment on an issue opened in Node.js

It's very important to understand the limitation of keepAlive and keep in mind that a keepAlive request can fail even though everything seems "ok" (since the server may at anytime decide to kill what it considers is an unused connection).


In summary

  1. If you have control over the client and server's socket timeouts, set them so the clients is slightly less than the servers. We don't want the server to be the one closing an inactive connection first, as that causes ECONNRESET errors.
  2. When using connection pooling, the server can kill the connection at anytime, and so you should have a retry strategy.