Hey everyone,
I wanted to write this while the work is still in motion instead of waiting until everything is polished and then pretending the path from problem to fix was cleaner than it really was.
I have been digging into a Hive-Engine node issue where the streamer did not seem to fail over cleanly when the primary Hive RPC went bad.
From an operator point of view, the behavior looked roughly like this:
streamNodes starts failing or degradingThat is a particularly annoying class of problem because it does not always look like a hard failure. Sometimes it just looks like the node is alive, but not doing its job very well.
After tracing the code, I wrote up two possible paths:
The community feedback was pretty clear: take the minimal path first, make it work, and save the larger architecture discussion for later if it is still needed.
So that is what I did.
The branch I am testing right now is:
fix/streamer-failover-minimal
The main work is in the streamer's RPC failover behavior.
Previously, the streamer was creating separate dhive clients in a way that did not really treat streamNodes like a proper read failover chain. So even with multiple nodes configured, a block fetch could still get pinned to one bad RPC in a way that did not recover cleanly.
The first part of the fix kept the existing streamer architecture intact, but changed the client setup so each scheduled block read could fail over across the configured node list instead of hanging on one endpoint.
After the first live test, I added one more narrow improvement:
That part matters because request-level failover alone is not quite enough if the scheduler keeps trying to hand new work back to the same bad node.
So the current branch does two things:
That means this is still very much the "minimal fix" path:
While I was working on this branch, I decided it made more sense to keep it operationally sane instead of pretending a few obviously useful fixes were unrelated enough to leave behind.
So this branch also includes:
I had an older branch with:
That had never made it into the earlier PR flow, and from an operator perspective I think it belongs near this work. When you are testing failover behavior, restart behavior and shutdown reliability are not academic side topics.
I also cherry-picked the dependency cleanup work so this branch would not carry unnecessary package noise while I was doing runtime testing.
That reduced the audit output down to the remaining low-severity dhive chain that currently has no npm-provided fix path.
I want to be explicit about this, because I do not want to give the impression that I just smashed a bunch of stale branches together and hoped for the best.
I cherry-picked the shutdown and audit work on purpose.
The reason is simple:
So instead of merging the full branch and dragging older streamer changes back in, I pulled over only the specific commits that were still relevant and safe.
Same story for the audit branch. That one was cleaner, because the actual audit-fix commit only touched:
package.jsonpackage-lock.jsonSo in both cases I took the narrower path because I wanted the branch history to reflect what I was actually testing:
and not a bunch of unrelated branch baggage.
This is still under live testing.

I have the branch pushed, I have been running it on a real node, and I have been testing it against the kind of ugly conditions that caused the original complaint in the first place.
That means:
The most recent live test was the most encouraging one so far.
I temporarily blocked api.hive.blog at the firewall level and watched the node behavior in real time.
Earlier testing showed that the node could still limp and fall behind a little even after the first failover patch.
After adding scheduler-level demotion, the behavior looked much better:
That does not mean I am declaring the issue permanently and universally solved forever.
It does mean the current branch is behaving a lot closer to what operators actually want from a minimal fix.
So I am not posting this as:
"problem solved, everybody move on"
I am posting it as:
"here is the branch, here is what changed, here is why it changed this way, and here is what it is doing under live testing"
The branch is here:
https://github.com/TheCrazyGM/hivesmartcontracts/tree/fix/streamer-failover-minimal
And the compare view against main is here:
https://github.com/TheCrazyGM/hivesmartcontracts/compare/main...fix/streamer-failover-minimal
If you want to review the branch or comment on whether this is enough for a practical short-term fix, this is exactly the stage where that feedback is useful.
If the branch keeps behaving well under live failover testing, then we probably have a practical short-term answer.
If it still shows edge cases or ugly behavior, then that just strengthens the case for coming back later and having the bigger Option B discussion about redesigning the RPC failover layer properly.
Either way, I want the conversation to stay open.
That is the whole reason I am writing this now instead of after the fact.
As always,
Michael Garcia a.k.a. TheCrazyGM
Very productive meeting today - this is incredible work and I appreciate you battle testing this stuff first and keeping us all in the loop!
!PAKX
!PIMP
!PIZZA
View or trade
PAKXtokens.Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.
$PIZZA slices delivered:
@ecoinstant(1/20) tipped @thecrazygm
Learn more at https://hive.pizza.