Minimal Hive-Engine Failover Fix Branch, Looking Better Under Live Testing

last month

Hey everyone,

I wanted to write this while the work is still in motion instead of waiting until everything is polished and then pretending the path from problem to fix was cleaner than it really was.

I have been digging into a Hive-Engine node issue where the streamer did not seem to fail over cleanly when the primary Hive RPC went bad.

From an operator point of view, the behavior looked roughly like this:

the first RPC in streamNodes starts failing or degrading
the node does not move on as decisively as you would expect
block processing slows, stalls, or starts limping
the node falls behind

That is a particularly annoying class of problem because it does not always look like a hard failure. Sometimes it just looks like the node is alive, but not doing its job very well.

After tracing the code, I wrote up two possible paths:

a minimal practical fix
a broader redesign

The community feedback was pretty clear: take the minimal path first, make it work, and save the larger architecture discussion for later if it is still needed.

So that is what I did.

What Changed

The branch I am testing right now is:

fix/streamer-failover-minimal

The main work is in the streamer's RPC failover behavior.

Previously, the streamer was creating separate dhive clients in a way that did not really treat streamNodes like a proper read failover chain. So even with multiple nodes configured, a block fetch could still get pinned to one bad RPC in a way that did not recover cleanly.

The first part of the fix kept the existing streamer architecture intact, but changed the client setup so each scheduled block read could fail over across the configured node list instead of hanging on one endpoint.

After the first live test, I added one more narrow improvement:

temporary scheduler-level node demotion

That part matters because request-level failover alone is not quite enough if the scheduler keeps trying to hand new work back to the same bad node.

So the current branch does two things:

lets a block read fail over across the configured RPC list
temporarily cools down a repeatedly failing node so the scheduler stops giving it first shot on every new block

That means this is still very much the "minimal fix" path:

no new RPC manager abstraction
no broad streamer redesign
no major changes to lookahead or anti-fork behavior
just a targeted attempt to make failover behave the way operators expect

I Also Pulled In Two Other Fixes

While I was working on this branch, I decided it made more sense to keep it operationally sane instead of pretending a few obviously useful fixes were unrelated enough to leave behind.

So this branch also includes:

The shutdown fix

I had an older branch with:

timeout protection around plugin stop requests
better signal propagation during graceful shutdown
a stop timeout set to 6 seconds

That had never made it into the earlier PR flow, and from an operator perspective I think it belongs near this work. When you are testing failover behavior, restart behavior and shutdown reliability are not academic side topics.

The npm audit cleanup

I also cherry-picked the dependency cleanup work so this branch would not carry unnecessary package noise while I was doing runtime testing.

That reduced the audit output down to the remaining low-severity dhive chain that currently has no npm-provided fix path.

Why I Cherry-Picked Instead of Just Merging Old Branches

I want to be explicit about this, because I do not want to give the impression that I just smashed a bunch of stale branches together and hoped for the best.

I cherry-picked the shutdown and audit work on purpose.

The reason is simple:

the shutdown branch contained commits I still wanted
but that branch history also touched streamer code in a way that could have interfered with the new failover work

So instead of merging the full branch and dragging older streamer changes back in, I pulled over only the specific commits that were still relevant and safe.

Same story for the audit branch. That one was cleaner, because the actual audit-fix commit only touched:

package.json
package-lock.json

So in both cases I took the narrower path because I wanted the branch history to reflect what I was actually testing:

failover improvement
shutdown reliability improvement
dependency sanity cleanup

and not a bunch of unrelated branch baggage.

Current Status

This is still under live testing.

I have the branch pushed, I have been running it on a real node, and I have been testing it against the kind of ugly conditions that caused the original complaint in the first place.

That means:

watching service behavior under load
checking streamer progress in logs
testing stop/start behavior
simulating a dead primary RPC and seeing whether the node actually rolls over instead of stalling

The most recent live test was the most encouraging one so far.

I temporarily blocked api.hive.blog at the firewall level and watched the node behavior in real time.

Earlier testing showed that the node could still limp and fall behind a little even after the first failover patch.

After adding scheduler-level demotion, the behavior looked much better:

the blocked node stopped receiving meaningful new work
a healthy alternate RPC continued carrying block fetches
the node stayed caught up during the outage window instead of drifting further behind

That does not mean I am declaring the issue permanently and universally solved forever.

It does mean the current branch is behaving a lot closer to what operators actually want from a minimal fix.

So I am not posting this as:

"problem solved, everybody move on"

I am posting it as:

"here is the branch, here is what changed, here is why it changed this way, and here is what it is doing under live testing"

If You Want To Look At It

The branch is here:

https://github.com/TheCrazyGM/hivesmartcontracts/tree/fix/streamer-failover-minimal

And the compare view against main is here:

https://github.com/TheCrazyGM/hivesmartcontracts/compare/main...fix/streamer-failover-minimal

If you want to review the branch or comment on whether this is enough for a practical short-term fix, this is exactly the stage where that feedback is useful.

What Happens Next

If the branch keeps behaving well under live failover testing, then we probably have a practical short-term answer.

If it still shows edge cases or ugly behavior, then that just strengthens the case for coming back later and having the bigger Option B discussion about redesigning the RPC failover layer properly.

Either way, I want the conversation to stay open.

That is the whole reason I am writing this now instead of after the fact.

As always,
Michael Garcia a.k.a. TheCrazyGM

hive-engine dev arcadecolony archon tribes pimp proofofbrain

0.07316738 BEE

6 comments

@thecrazygm74

last month

I just witnessed it "in the wild" for the first time!

0.02648489 BEE

@forkyishere68

last month

ok, bookmarked... for tomorrow. Need to read this.

0.00046479 BEE

@ecoinstant78

last month

Very productive meeting today - this is incredible work and I appreciate you battle testing this stuff first and keeping us all in the loop!

!PAKX
!PIMP
!PIZZA

0.00047977 BEE

@pakx52

last month

^{View or trade PAKX tokens.}

@ecoinstant, PAKX has voted the post by @thecrazygm. (1/2 calls)

Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.

0.00047020 BEE

@hivebuzz74

last month

Congratulations @thecrazygm! Your post has been a top performer on the Hive blockchain and you have been rewarded with this rare badge

Post with the highest payout of the day.

_{You can view your badges on your board and compare yourself to others in the Ranking}
_{If you no longer want to receive notifications, reply to this comment with the word STOP}

Check out our last posts:

	Feedback from the April Hive Power Up Day
	Hive Power Up Month Challenge - March 2026 Winners List
	Be ready for the April edition of the Hive Power Up Month!