The Hive-Engine Failover Fix Is in Main

3 weeks ago

This one feels good.

The Hive-Engine failover fix I have been testing is now merged into main.

PR: #134
Merge commit: f53f023

There were a couple of follow-up commits on top of it afterward, but that is the merge point for the failover work itself.

What The Fix Was For

The problem was straightforward from an operator point of view:

primary Hive RPC goes bad
node does not roll over cleanly
streamer starts limping or falling behind

That is the kind of issue that wastes time because the process can still look alive while doing a worse job than it should.

What Changed

I took the minimal path instead of trying to redesign the whole RPC layer.

The core changes were:

request-level failover across the configured RPC list
temporary scheduler-level node demotion for repeatedly failing nodes
shutdown reliability fixes that I already had on another branch
dependency cleanup so the branch was sane while I was testing it

The important part was the scheduler-level demotion follow-up.

Request-level failover alone helped, but it still let the streamer lean on a bad primary more than I liked. Once I added temporary cooldown behavior, the node started behaving much more like I wanted under forced RPC failure.

The Real Test

I did not just run this for five minutes and call it fixed.

I let it run for over a week and deliberately tested it against live failover conditions.

That included firewall-blocking api.hive.blog and watching the streamer logs in real time.

The satisfying part was finally seeing the behavior I wanted:

bad node starts failing
streamer cools it down
healthy node keeps carrying block fetches
engine stays caught up

And to be clear, the screenshot I am using was not from my synthetic firewall test.

That screenshot was from real production behavior after I forgot to restart the node after pulling the update. By the time I restarted it, the node was roughly 50,000 blocks behind and chewing through backlog. While it was catching up, I saw the streamer cool down api.c0ff33a.uk in real time and keep moving.

That makes the screenshot better, not worse.

It is one thing to see the fix behave during an intentional test.

It is another thing to catch it doing the right thing in actual replay under live load.

The line I wanted to see was:

[Streamer] Cooling down node https://api.c0ff33a.uk for 30000 ms after repeated failures

That is what a failover list should do. Not cling to the bad node. Not limp forever. Just move on.

Why I Am Happy About It

If you configure multiple RPC endpoints, the node should actually behave like it understands why they are there.

Now it is a lot closer to that.

That is a good outcome.

As always,
Michael Garcia a.k.a. TheCrazyGM

dev hive-engine arcadecolony archon tribes pimp proofofbrain

0.44281030 BEE

3 comments

@bala4128880

3 weeks ago

Appreciate your contributions on Hive. Cheers!

0.02971239 BEE

@ecoinstant78

3 weeks ago

Excellent work! Appreciate you taking the time to pick up and analyze the low hanging fruit that streamlines this for all of us.

!PAKX
!PIZZA

0.00050789 BEE

@pakx52

3 weeks ago

^{View or trade PAKX tokens.}

@ecoinstant, PAKX has voted the post by @thecrazygm. (1/2 calls)

Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.

0.00049773 BEE

@pizzabot59

3 weeks ago

PIZZA!

$PIZZA slices delivered:
@ecoinstant_(1/20) tipped @thecrazygm

_{Please vote for pizza.witness!}

0.00048778 BEE