Hey everyone,
Lately I have been digging into a streamer issue on the Hive-Engine node that I think a lot of operators have probably felt, even if they did not immediately know where the problem lived.

The short version is this:
if your first configured Hive RPC goes bad, the node does not always fail over the way you would expect.
Instead of cleanly moving on to the next healthy RPC, it can get stuck waiting on the bad one, start falling behind, and just sort of sit there looking alive while making less progress than it should.
That is exactly the kind of issue that is annoying for operators because it does not always look like a hard crash. Sometimes it looks more like the node just got dumb and slow.
After tracing through the streamer logic, I do not think this is just bad luck or one flaky endpoint. I think the failover behavior in the current implementation is weaker than people assume it is.
So this post is not code yet. This is the architecture discussion I would want to have before I go change it.
There are really two reasonable paths here.
At a high level, the current streamer is not treating streamNodes like a true failover chain for block reads.
What it is doing instead is closer to this:
That means a bad first node can still end up pinning the current block fetch even when other nodes are healthy.
So from an operator point of view, what you see is:
That is the problem I think we actually need to solve.
This is the smallest practical path.
The idea here is not to redesign the whole RPC layer. It is to make the existing streamer stop behaving badly under RPC failure.
That would mean things like:
This is the "fix the bug without starting a philosophy debate" option.
It still leaves failover behavior living inside streamer-specific logic instead of one shared RPC layer.
In other words, it is a good fix, but not necessarily the cleanest design.
This is the "do it properly" path.
Instead of having the streamer partially own node scheduling and failover behavior, I would move Hive RPC access into a dedicated abstraction and make the streamer consume that.
Something like:
Under that model, the streamer stops caring about node order directly and just asks for blocks or global props through a shared interface.
That is the version that makes more sense to me architecturally.
This is a larger change.
And larger changes in infrastructure code always cost more than the first diff makes them look.
This path would need:
So while I think it is the better long-term design, I also think it is the heavier community conversation.
If the goal is:
"stop nodes from hanging behind a bad RPC as quickly and safely as possible"
then I would recommend Option A first.
If the goal is:
"clean up the architecture so this whole class of problem is handled properly going forward"
then I would recommend Option B.
My honest answer is that I think there is a decent chance both are valid in sequence:
That tends to be the way mature infrastructure evolves anyway. First stop the bleeding, then decide whether the design itself needs surgery.
Because this is exactly the kind of change where the implementation is not the whole decision.
A targeted failover patch is one kind of change.
A streamer or RPC-layer redesign is a different kind of change.
They both solve the same visible problem, but they are not the same commitment.
So before I go from "I found the issue" to "here is the PR," I would rather be explicit about what the two paths look like and let the team and community weigh in on what level of change they actually want.
If I were pitching this in one paragraph, it would be this:
the current Hive-Engine streamer does not fail over cleanly when a primary Hive RPC degrades or dies, and I see two reasonable fixes: a minimal targeted repair to make failover actually work in the current design, or a more complete redesign that moves RPC health and failover into a dedicated layer.
I think Option A is the easier immediate sell.
I think Option B is the better long-term architecture.
And I think it is worth deciding that deliberately instead of pretending they are the same size change.
As always,
Michael Garcia a.k.a. TheCrazyGM
As I've experienced dozens of times, the failover script failing, stopping the bleeding is the way to go... When it works, we can move to tweaking and finding a more sophisticated solution...
Thanks for troubleshooting and finding the core issue... ;)
you know what I think! But I really like the full breakdown anyway, its a conversation starter and sometimes those are exactly what we need!
!PIMP
!PAKX
!PIZZA
View or trade
PAKXtokens.Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.
$PIZZA slices delivered:
@ecoinstant(1/20) tipped @thecrazygm
Learn more at https://hive.pizza.