Socket Hang Ups
January 20, 2021
At work we ran into the following situation. Let's say we have two microservices, A and B.
A calls B.
A -> B B A <- B
We use connection pooling, for performance gains (less TCP back and forth) and less network congestion in general.
After implementing connection pooling, we started experiencing intermittent
socket hang up issues.
We eventually found out that we had a mismatch of timeouts between
service A's HTTP client and
service B. While
service A would keep a connection open for 30s,
service B would keep it open for 5s (the default in Node.js — we weren't even messing with this property).
service A would send a request, and after 5s of inactivity,
service B would close the connection. For a time, if
service A sends a request to
service B, and uses that connection, it will get an
ECONNRESET, which in Node.js, is reported with a
socket hang up error message.
We need to set the timeouts so this never happens. It's best to have the client's keep alive timeout be slightly less than the server's — so it's the one closing inactive connections.
You Also Need Retries
For many of our use cases, we control the client and server. However, if you don't control the server (you're calling Stripe's API, for example), and you use connection pooling, you need to understand this can happen.
You can have intermittent
ECONNRESET errors. And this is true independent of what language you're working in.
I like this comment on an issue opened in Node.js
It's very important to understand the limitation of keepAlive and keep in mind that a keepAlive request can fail even though everything seems "ok" (since the server may at anytime decide to kill what it considers is an unused connection).
- If you have control over the client and server's socket timeouts, set them so the clients is slightly less than the servers. We don't want the server to be the one closing an inactive connection first, as that causes
- When using connection pooling, the server can kill the connection at anytime, and so you should have a retry strategy.