Restoring Hive-Engine Full-Node: How I Fought MongoDB's WiredTiger Throttle and Won

(edited)

I thought this would be easy.

I’m upgrading my Hive Engine node from "Lite" to "Full." I have the hardware to make this trivial: a Dual Xeon Gold server boasting 80 logical cores and 64GB of RAM, backed by fast SSD storage.

I had a 250GB .archive snapshot ready to go. With that much horsepower, I figured I’d run a standard parallel mongorestore command, grab lunch, and be done by the afternoon.

Instead, I spent the last 24 hours fighting hidden bottlenecks, single-threaded legacy code, and MongoDB’s internal panic triggers.


Here is the autopsy of a restore gone wrong, the "Triple Pincer" method that finally broke the logjam, and the custom tracking script I wrote to keep my sanity.

Attempt 1: The "Sane Defaults" (ETA: 5 Days)

I started with what I thought was an aggressive command. I told mongorestore to use 16 parallel streams and handle decompression on the fly.

# The naive approach
mongorestore \
  -j=16 \
  --numInsertionWorkersPerCollection=16 \
  --bypassDocumentValidation \
  --drop \
  --gzip \
  --archive=hsc_snapshot.archive

I fired it up and watched the logs. It was moving... but barely.

The Diagnosis: A look at htop revealed the problem immediately. One single CPU core was pegged at 100%, while the other 79 were asleep.

The built-in --gzip flag in MongoDB tools is single-threaded. I had a Ferrari engine, and I was feeding it fuel through a coffee stirrer. It was crunching about 2GB per hour. At that rate, I'd be done next Tuesday.

Attempt 2: The pigz Pipe (CPU Unleashed)

If the built-in tool is the bottleneck, bypass it. I aborted the restore and switched to using pigz (Parallel Implementation of GZip). This uses every available core to decompress the stream and pipes raw BSON straight into mongo’s stdin.

# The "Nuclear" Option
pigz -dc hsc_snapshot.archive | mongorestore \
  --archive \
  -j=16 \
  --numInsertionWorkersPerCollection=10 \
  --bypassDocumentValidation \
  --drop

CPU usage skyrocketed across all 80 cores. The intake pipe was finally wide open. Data started flying into the database.

Until it didn't.

After about 20 minutes of high speed, the restore started stuttering. It would run fast for 10 seconds, then completely stall for 30 seconds. It was faster than Attempt 1, but painfully inconsistent.

The Real Enemy: The WiredTiger "Panic Button"

Why was my powerful server stuttering? It wasn't CPU anymore. I ran mongostat 1 to look under the hood of the database engine.

The "smoking gun" was in the dirty column. It was flatlining at 20%.

Here is what that means: MongoDB’s storage engine, WiredTiger, keeps data in RAM (dirty cache) before writing it to disk. It has safety triggers:

  1. At 5% dirty, background threads start lazily flushing data to disk.
  2. At 20% dirty, it hits the panic button. It decides the disk can't keep up. To prevent crashing, it forces the application threads (my restore workers) to stop inserting data and help flush the cache to disk instead.

My 80 cores were decompressing data so fast that the SSD drive couldn't swallow it quick enough. WiredTiger was throttling my CPU to protect the disk.

I tried to tune this live using db.adminCommand to increase the panic to 30%, but it didn't help much. I was stuck.

db.adminCommand({
  "setParameter": 1,
  "wiredTigerEngineRuntimeConfig": "eviction_dirty_target=20,eviction_dirty_trigger=30"
})

The Final Solution: The "Triple Pincer" Attack

If I couldn't tune the engine to accept one massive stream, I decided to overwhelm it with three smaller ones.

The Hive Engine database is dominated by two massive collections: hsc.chain and hsc.transactions. When restoring linearly, you hit lock contention as dozens of threads fight over the same collection lock while simultaneously fighting eviction threads.

I aborted everything and launched three simultaneous restore processes in separate terminals.

screenshot-20260102-074621.png

Terminal 1 (The Chain):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsInclude="hsc.chain" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Terminal 2 (The Transactions):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsInclude="hsc.transactions" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Terminal 3 (Everything Else):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsExclude="hsc.chain" --nsExclude="hsc.transactions" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Why this works:
Yes, this reads the 250GB archive from disk three times simultaneously. But SSD read speeds are practically infinite for this workload.

By splitting the job, I broke the collection-level locks. The process restoring chain doesn't care if the transactions process is paused for cache eviction. It smoothed out the I/O pattern.

The Proof:
Looking at my mongostat now (top right pane in screenshot), the insert rate is holding steady in the thousands, but look at the dirty column. It's still hovering at 15%. But we are not at the 20% panic threshold.

The Tooling: Tracking the Invisible

There was one final problem. Because I was piping data via pigz, mongorestore had no idea how big the file was (not that it helps anyway) I had zero progress bars. Restoring hive-engine nodes is a slog and there is nothing to tell you where you are...

Were there's a Linux there is a way.

Everything is a file, you can see where the kernel is reading from memory with the lsof tool. You can get the exact bytes from the filesystem stat, and with those numbers you can do a little bit of math.

ss

So, I wrote track_restore.sh. This script auto-detects the pigz process, finds the open file descriptor using lsof, reads the byte offset from the kernel, and calculates the real-time progress. It works with the normal mongorestore method as well, and would probably be helpful to other Hive-Engine node operators (even light nodes).

You can see it running, keeping me sane while the gigabytes churn.

ss

#!/bin/bash

# Configuration
INTERVAL=5

# AUTO-DETECT: Check for pigz first (fast mode), then mongorestore (slow mode)
PID=$(pgrep -x "pigz" | head -n 1)
PROC_NAME="pigz"

if [ -z "$PID" ]; then
  PID=$(pgrep -x "mongorestore" | head -n 1)
  PROC_NAME="mongorestore"
fi

if [ -z "$PID" ]; then
  echo "Error: Neither pigz nor mongorestore process found."
  exit 1
fi

echo "--- Restore Progress Tracker (V3) ---"
echo "Monitoring Process: $PROC_NAME (PID: $PID)"

# Find file and size
ARCHIVE_PATH=$(lsof -p $PID -F n | grep ".archive$" | head -n 1 | cut -c 2-)

if [ -z "$ARCHIVE_PATH" ]; then
  echo "Could not auto-detect .archive file. Is the restore running?"
  exit 1
else
  TOTAL_SIZE=$(stat -c%s "$ARCHIVE_PATH")
  echo "Tracking File: $ARCHIVE_PATH"
fi

TOTAL_GB=$(echo "scale=2; $TOTAL_SIZE / 1024 / 1024 / 1024" | bc)
echo "Total Archive Size: $TOTAL_GB GB"
echo "----------------------------------------"

while true; do
  # 1. Get Offset
  # 2>/dev/null suppresses "lsof: WARNING" noise
  RAW_OFFSET=$(lsof -o -p $PID 2>/dev/null | grep ".archive" | awk '{print $7}')

  # 2. Safety Check: If empty, assume finished or closing
  if [ -z "$RAW_OFFSET" ]; then
    echo -e "\n\nRestore finished! (Process closed file)"
    break
  fi

  # 3. Clean the Offset (The Fix)
  # Remove '0t' (decimal prefix) and '0x' (hex prefix) to be safe
  # Bash handles 0x, but we can treat everything as standard base-10 if we convert hex
  if [[ "$RAW_OFFSET" == 0x* ]]; then
    # It's hex (mongorestore style)
    CURRENT_BYTES=$((RAW_OFFSET))
  else
    # It's likely 0t (pigz style) or raw number. Strip 0t.
    CURRENT_BYTES=$(echo "$RAW_OFFSET" | sed 's/^0t//')
  fi

  # 4. Math Safety Check
  if [ -z "$CURRENT_BYTES" ]; then continue; fi

  # 5. Calculate
  PERCENT=$(echo "scale=4; ($CURRENT_BYTES / $TOTAL_SIZE) * 100" | bc)
  CURRENT_GB=$(echo "scale=2; $CURRENT_BYTES / 1024 / 1024 / 1024" | bc)

  # 6. Bar
  BAR_WIDTH=50
  # Use 0 if PERCENT is empty to avoid crash
  INT_PERCENT=$(echo "${PERCENT:-0}" | cut -d'.' -f1)

  # Ensure INT_PERCENT is a number
  if ! [[ "$INT_PERCENT" =~ ^[0-9]+$ ]]; then INT_PERCENT=0; fi

  FILLED=$(($INT_PERCENT * $BAR_WIDTH / 100))
  EMPTY=$(($BAR_WIDTH - $FILLED))

  BAR=$(printf "%0.s#" $(seq 1 $FILLED))
  SPACE=$(printf "%0.s-" $(seq 1 $EMPTY))

  printf "\rProgress: [%s%s] %s%% (%s GB / %s GB)" "$BAR" "$SPACE" "$PERCENT" "$CURRENT_GB" "$TOTAL_GB"

  sleep $INTERVAL
done

The Lesson

When you throw enterprise-grade hardware at standard-grade tools, things break in weird ways.

Don't trust defaults. Monitor your bottlenecks. And if the database engine tries to throttle you, sometimes the only answer is to hit it from three directions at once.

The node should finally be synced by mid-day.

As always,
Michael Garcia a.k.a. TheCrazyGM

0.32562142 BEE
7 comments

Amazing! Mongorestore has been the bane of my existence! On NVMEs a full restore was under 20 hours with some tweaks, but your approach should make that even faster!

0.00455711 BEE

My bottleneck is still the SSD. I probably should invest in an NVMe at some point. But I'm poor folk. 😅

0.00406267 BEE

Aren't consumer NVME and SSDs almost identical in price with SSDs being slightly cheaper?

0.00041829 BEE
(edited)

Now I know I need to meet you in real life. Enough proof from your posts. 😉🙃

0.00043298 BEE

I am enjoying following you along as you are going through these processes from the beginning, so cool that before even sync'ing you have an incredibly useful new gist to add to our project builder!

!PAKX
!PIMP
!PIZZA

0.00043290 BEE

View or trade PAKX tokens.

@ecoinstant, PAKX has voted the post by @thecrazygm. (1/2 calls)



Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.

0.00163847 BEE
(edited)

👀 What FS for the mongoDB please :D

0.00042432 BEE

I'm using BTRFS with CoW turned off for the mongod dir. but I wanted the ease of the snapshots for backup purposes, so I don't ever had to do this restore ever again...

0.17200769 BEE

Interesting... haven't done that for a while. Have you tried in-memory with ZFS delayed writes?

0.00041584 BEE

PIZZA!

$PIZZA slices delivered:
@ecoinstant(1/20) tipped @thecrazygm

Please vote for pizza.witness!

0.00042214 BEE

You're a masterful coding wizard, my friend, and I very much appreciate reading your adventures. Oh, and no, I never trust defaults. 😁🙏💚✨🤙

0.00041366 BEE
I am an AI curator currently being programmed; if I voted for you, it's because your post respects certain curation rules. help us put a 1% vote
0.00040534 BEE