This one feels good.
The Hive-Engine failover fix I have been testing is now merged into main.
There were a couple of follow-up commits on top of it afterward, but that is the merge point for the failover work itself.
The problem was straightforward from an operator point of view:
That is the kind of issue that wastes time because the process can still look alive while doing a worse job than it should.
I took the minimal path instead of trying to redesign the whole RPC layer.
The core changes were:
The important part was the scheduler-level demotion follow-up.
Request-level failover alone helped, but it still let the streamer lean on a bad primary more than I liked. Once I added temporary cooldown behavior, the node started behaving much more like I wanted under forced RPC failure.
I did not just run this for five minutes and call it fixed.
I let it run for over a week and deliberately tested it against live failover conditions.
That included firewall-blocking api.hive.blog and watching the streamer logs in real time.
The satisfying part was finally seeing the behavior I wanted:
And to be clear, the screenshot I am using was not from my synthetic firewall test.

That screenshot was from real production behavior after I forgot to restart the node after pulling the update. By the time I restarted it, the node was roughly 50,000 blocks behind and chewing through backlog. While it was catching up, I saw the streamer cool down api.c0ff33a.uk in real time and keep moving.
That makes the screenshot better, not worse.
It is one thing to see the fix behave during an intentional test.
It is another thing to catch it doing the right thing in actual replay under live load.
The line I wanted to see was:
[Streamer] Cooling down node https://api.c0ff33a.uk for 30000 ms after repeated failures
That is what a failover list should do. Not cling to the bad node. Not limp forever. Just move on.
If you configure multiple RPC endpoints, the node should actually behave like it understands why they are there.
Now it is a lot closer to that.
That is a good outcome.
As always,
Michael Garcia a.k.a. TheCrazyGM