Investigating High CPU Usage on the "Moon" Server

(edited)

Hey everyone,

I've been keeping an eye on our "moon" server lately, and the CPU usage metrics have been consistently high, suggesting it might be time to invest in a new, more powerful machine. Before making that decision, I wanted to dig into the data to see exactly what was going on.

For some time now, I've been running a custom Python script, server_metrics.py, at frequent intervals to collect data on system performance and store it in a SQLite database. This has given me a fantastic historical dataset to work with.

Visualizing the Problem

The first step was to visualize the trend. A picture is worth a thousand words, and plotting the data from the last two weeks confirmed my suspicions immediately.

server_metrics_plot.png

As you can see, the CPU usage is frequently spiking and sustaining high levels, which isn't ideal for a server running multiple applications. The question now is: what's causing it?

Digging into the Data

To find the culprits, I wrote a SQL query to go through the collected metrics. The goal was to find which process names appeared most often as the top CPU consumer, what their average CPU usage was in those moments, and their maximum recorded spike. The results were immediate and unambiguous:

-- Count how many samples each process was the top-CPU process
SELECT
    top_cpu_name,
    COUNT(*)                 AS samples_as_top,
    AVG(top_cpu_percent)     AS avg_top_pct,
    MAX(top_cpu_percent)     AS max_top_pct
FROM metrics
-- restrict to last two weeks
WHERE timestamp >= datetime('now', '-14 days')
GROUP BY top_cpu_name
ORDER BY samples_as_top DESC
LIMIT 10;
top_cpu_namesamples_as_topavg_top_pctmax_top_pct
python31485265.1705.4
systemd29050.1246.2
mariadbd6612.86150.0
php-fpm7.4966.59633.1
fail2ban-server930.00.0
caddy910.00.0
kworker/0:0-events330.00.0
kworker/0:2-events240.00.0
kworker/0:1-events230.00.0
multipathd220.00.0

As the data clearly shows, python3 processes are the runaway top consumer of CPU resources on this server. It was the top process in over 14,800 samples, with an average CPU usage of 65% during those times. Most strikingly, it had a maximum spike of over 700%, indicating that at certain moments, Python scripts were consuming the equivalent of 7 full CPU cores.

This analysis narrows down the problem significantly. It's not a system-level issue with something like Caddy or the database (mariadbd); the load is coming directly from the Python applications I'm running.

The next logical step in this investigation is to dig deeper and differentiate between the various python3 processes to see which specific scripts are the heaviest hitters. But for now, we have a very clear answer to "What's using the CPU?". The answer is: Python.

Next Steps

I have added more verbose data gathering to server_metrics.py to track the command line argments of each process, so we know which one is which. I'll continue to monitor the data and report back to you as I find new insights.

As always,
Michael Garcia a.k.a. TheCrazyGM

0.16322869 BEE
3 comments

PIZZA!

$PIZZA slices delivered:
@ecoinstant(2/20) tipped @thecrazygm

Come get MOONed!

0.00043575 BEE

I've been through that process of narrowing down possible causes to issues countless times, and I know how challenging it can be sometimes, so congratulations on finding part of the cause. I'm sure that you'll noodle out the last specifics in no time! 😁🙏💚✨🤙

0.00043397 BEE

Keep us in the loop, this is a fascinating case since my expectation would be we should be using less CPU now with less active player base between seasons.

!PAKX
!PIMP
!PIZZA

0.00035102 BEE

View or trade PAKX tokens.

@ecoinstant, PAKX has voted the post by @thecrazygm. (2/2 calls)



Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.

0.00034204 BEE