Fixing Data Integrity Issues in Hive-Engine Nodes

(edited)

I have been digging into a class of problem that is deeply frustrating from an my experience: data loss and corruption after power outages or hard resets.

If you run a Hive-Engine node, you have probably seen some version of this:

  • the node restarts and throws "block not found" errors
  • you see "duplicate transaction" errors on startup
  • the node needs to be restarted multiple times before it "goes on"

The root cause turned out to be several issues in the block processing pipeline that allowed partial writes, concurrent processing, and silent error swallowing.

The Problem

From an operator point of view, the symptoms looked roughly like this:

  • a block would be sent to the blockchain plugin via IPC
  • the IPC reply would be sent back immediately, before the block was actually written to the database
  • the streamer would send the next block before the current one finished
  • if anything went wrong mid-write, errors were swallowed silently

That is a particularly annoying class of problem because it does not always look like a hard failure. Sometimes the node looks alive, but its data is quietly wrong.

And after a power outage or hard reset, you get the fun of partial writes sitting in the database.

What I Fixed

lazygit is kinda nice

I started a new branch and started tracing the pipeline.

The fixes I ended up implementing:

1. Await IPC Callbacks (Critical)

The IPC reply was being sent before the block was written. This allowed the streamer to send the next block before the current one finished, causing concurrent processing and race conditions.

Now the IPC handler awaits produceNewBlockSync before replying. Blocks are processed sequentially, which is what the system actually requires.

2. Block Processing Lock

Added a blockProcessingLock as a safety net. Even if something bypasses the IPC await, only one block can be processed at a time.

3. Error Handling

Several methods in Database.js were catching errors and returning null instead of throwing. This meant failures killed the pipeline silently.

Now errors are visible and the server can react properly (hopefully).

4. Fork Recovery

The streamer was tracking lastBlockSentToBlockchain for fork recovery, but that gets updated before the block is committed. Now it tracks lastCommittedBlock and rewinds to that instead.

What I Did Not Fix

Write Concern

I looked at adding w: majority and j: true to the MongoDB write concern for true crash-proof durability. But with Hive's 3-second block window, even a few milliseconds of additional latency per block matters.

With the default w: 1, writes survive any process crash because they are in the journal before acknowledgment. The only scenario where data is lost is a simultaneous power failure AND journal corruption, which is extremely rare.

The concurrent processing fixes solve the actual corruption scenarios. If power loss durability becomes an issue later, write concern can be added then.

Replica Set Underutilization

There is a broader point here that I think is worth calling out.

MongoDB replica sets are mandatory for Hive-Engine. You cannot run the node without one. The code requires session.withTransaction() for block processing, and transactions require a replica set.

So every Hive-Engine operator is already paying the cost of running a replica set.

But here is the thing: the transaction was there, and it was not actually being honored.

Before these fixes, the IPC reply was sent before the transaction committed. That means the streamer would send the next block before the current one was actually written to the database. The session.withTransaction() wrapper existed, but the system proceeded as if the block was committed when it was not.

If the process crashed or a hard reset happened between the IPC reply and the transaction commit, the block was gone. The transaction was supposed to provide atomicity, but because the reply fired early, nothing was actually waiting for the commit to finish.

That is wasted potential. You have the replica set, you have the transaction, but you are not actually letting it do its job.

The fixes in this branch address that directly. The IPC handler now awaits produceNewBlockSync before replying, which means the transaction has committed before the next block is sent. The transaction actually provides the atomicity guarantee it was supposed to provide all along.

Sequence Gaps

Sequence numbers can still gap on rollback. They are cosmetic and not actually harmful.

Current Status

This is still under testing.

I have the branch pushed to my fork and I am running it live right now to test the fixes on a running witness node.

So I am not posting this as:

"problem solved, merge it now"

I am posting it as:

"here are the fixes, here is what changed, here is why it changed this way, and here is what it is doing under testing"

If You Want To Look At It

The branch is here:

https://github.com/TheCrazyGM/hivesmartcontracts/tree/feature/fix-data-integrity-issues

If you want to review the changes or have thoughts on the approach, this is exactly the stage where that feedback is useful.

What Happens Next

If the branch keeps behaving well under testing, I will submit a PR upstream.

If there are edge cases or ugly behavior, that just means more investigation before it is ready.

Either way, the goal is to make Hive-Engine nodes more reliable. Not just for my own nodes, but for anyone running a node that needs to stay in sync.

As always,
Michael Garcia a.k.a. TheCrazyGM

0.37618158 BEE
2 comments

Never saw these....

If you run a Hive-Engine node, you have probably seen some version of this:

the node restarts and throws "block not found" errors
you see "duplicate transaction" errors on startup
the node needs to be restarted multiple times before it "goes on"

Under what conditions?

0.05164148 BEE

I lose power frequently, more than I like, but also when I do kernel updates and reboot, almost any time I restart the service I have issues.

0.00000000 BEE

Would be nice to narrow these. As I restart my node at least once a month and I have never seen anything like these.

Probably a lack of resources thing? I am testing with a 15 year old machine and I have never had problems too..

I have a big machine too, my main server... so surprised about these...

Share more details internally... this does not sound at all like something I was expecting...

0.00099720 BEE

Congratulations @thecrazygm! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You made more than 800 comments.
Your next target is to reach 900 comments.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

0.00000000 BEE