MongoDB Restore performance analysis (part 1/3) - ZFS Filesystem on HDD

(this work motivation came from the will to show how easy is to get something running at home on a crapy hardware using a few tricks and some hints on how to tune filesystems and volume managers more efficiently for better performance in a specific workload situation)

Big disclaimer

The hardware I used here in quite old, and the performance on tests like these might fluctuate (because they are very long) due to other factors I am deliberately omitting to make it simple to compare things. This will be a series of 3 posts using one HDD and USB devices with the objective to show how to start playing with filesystems and volume managers. Some of these approaches might not be as noticeable as the results I am showing on high end hardware, such has fast SDDs, NVMe or when using RAID adaptors of these. In those situations you might need to use different approaches for tuning, that in this post series tests could be ignored due the time difference.


OpenZFS_logo.svg

In this series, I will be doing a comparison of performances over several block sizes tests I made to research how ZFS works on HDDs and comparing it (later post) with other Filesystems and Volume Managers. For the restore I am using always the same Hive-Engine (Light mode) snapshot (MongoDB dump) from @primersion (that you can download from here).

Software used

  • MongoDB 5.0.9
  • Kernel 5.15.0-41-generic #44~20.04.1-Ubuntu SMP
  • zfs version
    zfs-0.8.3-1ubuntu12.14
    zfs-kmod-2.1.2-1ubuntu3

Hardware

Something that was going to trash and I just grabbed it for this test. But its still going to trash by the end of the year! The OS is running from an independent disk to avoid interference with the tests.

  • 2 cores (4 hyperthreading cores) Intel 3.6GHz (6MB Cache-L3)
  • 8GB RAM (DDR3) 1600MHz
  • 2TB 7200RPM Disk via SATAII (very old stuff from 2017)

Initial steps for ZFS

If you need to edit (create or delete) partitions on the disk you can use fdisk tool or other equivalent before using the disk for ZFS.

You will need to either start Ubuntu with ZFS or install it afterwards... apt install zfs*

I am using ZFS features defaults here, so no compression and no deduplication.

# Pool for the 2TB HDD
zpool create dpool /dev/sda1

ZFS Test #1 (default blocksize, 128KB)

This is a restore using ZFS filesystem default block size for the MongoDB replica.

# Stop the MongoDB
systemctl stop mongod
# Ensure you the mount point is not mounted
umount /var/lib/mongodb
# Create the new filesystem
zfs create -o mountpoint=/var/lib/mongodb dpool/mongodb
# Change permissions of the MongoDB directory
chown mongodb:mongodb /var/lib/mongodb
# Start MongoDB
systemctl start mongod
# Initialize the replica set
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T14:43:47.182+1200    37324269 document(s) restored successfully. 0 document(s) failed to restore.

real    120m5.672s
user    6m35.602s
sys     0m55.827s

ZFS Test #2 (1 MB block size)

This is a restore using ZFS filesystem 1MB block size for the MongoDB replica.

systemctl stop mongod
umount /var/lib/mongodb
zfs create -o recordsize=1M -o mountpoint=/var/lib/mongodb dpool/mongodb_1m
chown mongodb:mongodb /var/lib/mongodb
systemctl start mongod
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T04:23:50.055+1200    37324269 document(s) restored successfully. 0 document(s) failed to restore.

real    251m49.991s
user    6m36.547s
sys     0m57.353s

ZFS Test #3 (4KB block size)

This is a restore using ZFS filesystem 4KB block size for the MongoDB replica.

systemctl stop mongod
umount /var/lib/mongodb
zfs create -o recordsize=4K -o mountpoint=/var/lib/mongodb dpool/mongodb_4k
chown mongodb:mongodb /var/lib/mongodb
systemctl start mongod
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T18:00:11.445+1200    37324269 document(s) restored successfully. 0 document(s) failed to restore.

real    190m29.359s
user    6m33.841s
sys     0m59.114s

Tuning considerations

Its clear that for 1 HDD situation, larger block sizes are not beneficial for random IO. This is expected because the number of IOs a rotary disk can do are very limited (usually 50-200 /s max), which causes random IO performance to lower even more when the IO is based on large blocks and you are not using them efficiently (explained ahead).

Another detail to consider here, is that if the database uses many files of small sizes, this could result on inefficient usage of space on your disk and you will also be loosing some performance due to the unaligned nature of the IO.

Checking file size distribution (Number of files / file size in KB):

/var/lib/mongodb# du -k * | awk '{print $1}' | sort | uniq -c | sort -n | tail
      9 60
     10 43
     10 52
     12 47
     26 35
     29 31
     30 39
     53 27
    374 5
    727 23

Total files:

var/lib/mongodb# du -k * | wc -l
1498

So, here we can see that more than 85% of the files are way under 64KB size, around 32KB average ish maybe. Hence the default ZFS block size is not the best block size to save space and squeeze performance from this MongoDB specific scenario. And here, the best compromise might be 32KB, because most of those IOs are bellow that block size and they "waste" in block efficiency is not a lot.

Remember that every time you need 1 byte over the block size you will need at least 2x IOs, therefore its important to use a block size that gathers 99% of the file sizes but not big enough to waste space and IO latency. So lets give it a go with a 32KB to see if it really makes any difference!

Final run with 32KB block size

time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T23:11:14.319+1200    37324269 document(s) restored successfully. 0 document(s) failed to restore.

real    99m4.594s
user    6m35.625s
sys     0m55.475s

This will likely be our best result... dealing with higher block sizes will (increase throughput performance but also increase latency, hence it begins to be a game between averages and how much efficiency you want to loose in space) be of very little performance gain if any.

And if you give it a go, you will see until you eventually reach the 128KB default that is good enough for probably many situations, the ZFS developers have found... hence why the default. But in this specific situation, you would loose some performance.

To confirm... here is the results with 64KB:

time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-24T01:15:14.174+1200    37324269 document(s) restored successfully. 0 document(s) failed to restore.

real    110m53.942s
user    6m38.025s
sys     0m53.947s

Conclusions

As you can see, ZFS is super flexible on how you can manage the back end storage with the appropriate application for your IO.

Note that in this case its only one application running. If multiple applications are considered, the results will change obviously. But you will always be able to optimize the block size in a much more granular way since you can use the same volume with multiple filesystems.

There are other volume managers that can do this (for example LVM) but in my view, not as easily as with ZFS. Performance wise, it depends, and I will try to touch base on that at the 3rd post of this series.

1.92004725 BEE
1 comments

Congratulations @atexoras! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):

You distributed more than 10000 upvotes.
Your next target is to reach 11000 upvotes.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Support the HiveBuzz project. Vote for our proposal!
0E-8 BEE