The Hive-Engine Failover Fix Is in Main

This one feels good.

The Hive-Engine failover fix I have been testing is now merged into main.

There were a couple of follow-up commits on top of it afterward, but that is the merge point for the failover work itself.

What The Fix Was For

The problem was straightforward from an operator point of view:

  • primary Hive RPC goes bad
  • node does not roll over cleanly
  • streamer starts limping or falling behind

That is the kind of issue that wastes time because the process can still look alive while doing a worse job than it should.

What Changed

I took the minimal path instead of trying to redesign the whole RPC layer.

The core changes were:

  • request-level failover across the configured RPC list
  • temporary scheduler-level node demotion for repeatedly failing nodes
  • shutdown reliability fixes that I already had on another branch
  • dependency cleanup so the branch was sane while I was testing it

The important part was the scheduler-level demotion follow-up.

Request-level failover alone helped, but it still let the streamer lean on a bad primary more than I liked. Once I added temporary cooldown behavior, the node started behaving much more like I wanted under forced RPC failure.

The Real Test

I did not just run this for five minutes and call it fixed.

I let it run for over a week and deliberately tested it against live failover conditions.

That included firewall-blocking api.hive.blog and watching the streamer logs in real time.

The satisfying part was finally seeing the behavior I wanted:

  • bad node starts failing
  • streamer cools it down
  • healthy node keeps carrying block fetches
  • engine stays caught up

And to be clear, the screenshot I am using was not from my synthetic firewall test.

image.png

That screenshot was from real production behavior after I forgot to restart the node after pulling the update. By the time I restarted it, the node was roughly 50,000 blocks behind and chewing through backlog. While it was catching up, I saw the streamer cool down api.c0ff33a.uk in real time and keep moving.

That makes the screenshot better, not worse.

It is one thing to see the fix behave during an intentional test.

It is another thing to catch it doing the right thing in actual replay under live load.

The line I wanted to see was:

[Streamer] Cooling down node https://api.c0ff33a.uk for 30000 ms after repeated failures

That is what a failover list should do. Not cling to the bad node. Not limp forever. Just move on.

Why I Am Happy About It

If you configure multiple RPC endpoints, the node should actually behave like it understands why they are there.

Now it is a lot closer to that.

That is a good outcome.

As always,
Michael Garcia a.k.a. TheCrazyGM

0.36851338 BEE
0 comments