I have been digging into a class of problem that is deeply frustrating from an my experience: data loss and corruption after power outages or hard resets.
If you run a Hive-Engine node, you have probably seen some version of this:
The root cause turned out to be several issues in the block processing pipeline that allowed partial writes, concurrent processing, and silent error swallowing.
From an operator point of view, the symptoms looked roughly like this:
That is a particularly annoying class of problem because it does not always look like a hard failure. Sometimes the node looks alive, but its data is quietly wrong.
And after a power outage or hard reset, you get the fun of partial writes sitting in the database.

I started a new branch and started tracing the pipeline.
The fixes I ended up implementing:
The IPC reply was being sent before the block was written. This allowed the streamer to send the next block before the current one finished, causing concurrent processing and race conditions.
Now the IPC handler awaits produceNewBlockSync before replying. Blocks are processed sequentially, which is what the system actually requires.
Added a blockProcessingLock as a safety net. Even if something bypasses the IPC await, only one block can be processed at a time.
Several methods in Database.js were catching errors and returning null instead of throwing. This meant failures killed the pipeline silently.
Now errors are visible and the server can react properly (hopefully).
The streamer was tracking lastBlockSentToBlockchain for fork recovery, but that gets updated before the block is committed. Now it tracks lastCommittedBlock and rewinds to that instead.
I looked at adding w: majority and j: true to the MongoDB write concern for true crash-proof durability. But with Hive's 3-second block window, even a few milliseconds of additional latency per block matters.
With the default w: 1, writes survive any process crash because they are in the journal before acknowledgment. The only scenario where data is lost is a simultaneous power failure AND journal corruption, which is extremely rare.
The concurrent processing fixes solve the actual corruption scenarios. If power loss durability becomes an issue later, write concern can be added then.
There is a broader point here that I think is worth calling out.
MongoDB replica sets are mandatory for Hive-Engine. You cannot run the node without one. The code requires session.withTransaction() for block processing, and transactions require a replica set.
So every Hive-Engine operator is already paying the cost of running a replica set.
But here is the thing: the transaction was there, and it was not actually being honored.
Before these fixes, the IPC reply was sent before the transaction committed. That means the streamer would send the next block before the current one was actually written to the database. The session.withTransaction() wrapper existed, but the system proceeded as if the block was committed when it was not.
If the process crashed or a hard reset happened between the IPC reply and the transaction commit, the block was gone. The transaction was supposed to provide atomicity, but because the reply fired early, nothing was actually waiting for the commit to finish.
That is wasted potential. You have the replica set, you have the transaction, but you are not actually letting it do its job.
The fixes in this branch address that directly. The IPC handler now awaits produceNewBlockSync before replying, which means the transaction has committed before the next block is sent. The transaction actually provides the atomicity guarantee it was supposed to provide all along.
Sequence numbers can still gap on rollback. They are cosmetic and not actually harmful.
This is still under testing.
I have the branch pushed to my fork and I am running it live right now to test the fixes on a running witness node.
So I am not posting this as:
"problem solved, merge it now"
I am posting it as:
"here are the fixes, here is what changed, here is why it changed this way, and here is what it is doing under testing"
The branch is here:
https://github.com/TheCrazyGM/hivesmartcontracts/tree/feature/fix-data-integrity-issues
If you want to review the changes or have thoughts on the approach, this is exactly the stage where that feedback is useful.
If the branch keeps behaving well under testing, I will submit a PR upstream.
If there are edge cases or ugly behavior, that just means more investigation before it is ready.
Either way, the goal is to make Hive-Engine nodes more reliable. Not just for my own nodes, but for anyone running a node that needs to stay in sync.
As always,
Michael Garcia a.k.a. TheCrazyGM
Never saw these....
Under what conditions?
I lose power frequently, more than I like, but also when I do kernel updates and reboot, almost any time I restart the service I have issues.
Would be nice to narrow these. As I restart my node at least once a month and I have never seen anything like these.
Probably a lack of resources thing? I am testing with a 15 year old machine and I have never had problems too..
I have a big machine too, my main server... so surprised about these...
Share more details internally... this does not sound at all like something I was expecting...
View more
Congratulations @thecrazygm! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
Your next target is to reach 900 comments.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOPCheck out our last posts: