Hey everyone,
I've been keeping an eye on our "moon" server lately, and the CPU usage metrics have been consistently high, suggesting it might be time to invest in a new, more powerful machine. Before making that decision, I wanted to dig into the data to see exactly what was going on.
For some time now, I've been running a custom Python script, server_metrics.py
, at frequent intervals to collect data on system performance and store it in a SQLite database. This has given me a fantastic historical dataset to work with.
The first step was to visualize the trend. A picture is worth a thousand words, and plotting the data from the last two weeks confirmed my suspicions immediately.
As you can see, the CPU usage is frequently spiking and sustaining high levels, which isn't ideal for a server running multiple applications. The question now is: what's causing it?
To find the culprits, I wrote a SQL query to go through the collected metrics. The goal was to find which process names appeared most often as the top CPU consumer, what their average CPU usage was in those moments, and their maximum recorded spike. The results were immediate and unambiguous:
-- Count how many samples each process was the top-CPU process
SELECT
top_cpu_name,
COUNT(*) AS samples_as_top,
AVG(top_cpu_percent) AS avg_top_pct,
MAX(top_cpu_percent) AS max_top_pct
FROM metrics
-- restrict to last two weeks
WHERE timestamp >= datetime('now', '-14 days')
GROUP BY top_cpu_name
ORDER BY samples_as_top DESC
LIMIT 10;
top_cpu_name | samples_as_top | avg_top_pct | max_top_pct |
---|---|---|---|
python3 | 14852 | 65.1 | 705.4 |
systemd | 2905 | 0.1 | 246.2 |
mariadbd | 661 | 2.86 | 150.0 |
php-fpm7.4 | 96 | 6.59 | 633.1 |
fail2ban-server | 93 | 0.0 | 0.0 |
caddy | 91 | 0.0 | 0.0 |
kworker/0:0-events | 33 | 0.0 | 0.0 |
kworker/0:2-events | 24 | 0.0 | 0.0 |
kworker/0:1-events | 23 | 0.0 | 0.0 |
multipathd | 22 | 0.0 | 0.0 |
As the data clearly shows, python3
processes are the runaway top consumer of CPU resources on this server. It was the top process in over 14,800 samples, with an average CPU usage of 65% during those times. Most strikingly, it had a maximum spike of over 700%, indicating that at certain moments, Python scripts were consuming the equivalent of 7 full CPU cores.
This analysis narrows down the problem significantly. It's not a system-level issue with something like Caddy or the database (mariadbd
); the load is coming directly from the Python applications I'm running.
The next logical step in this investigation is to dig deeper and differentiate between the various python3
processes to see which specific scripts are the heaviest hitters. But for now, we have a very clear answer to "What's using the CPU?". The answer is: Python.
I have added more verbose data gathering to server_metrics.py
to track the command line argments of each process, so we know which one is which. I'll continue to monitor the data and report back to you as I find new insights.
As always,
Michael Garcia a.k.a. TheCrazyGM
$PIZZA slices delivered:
@ecoinstant(2/20) tipped @thecrazygm
Come get MOONed!
I've been through that process of narrowing down possible causes to issues countless times, and I know how challenging it can be sometimes, so congratulations on finding part of the cause. I'm sure that you'll noodle out the last specifics in no time! 😁🙏💚✨🤙
Keep us in the loop, this is a fascinating case since my expectation would be we should be using less CPU now with less active player base between seasons.
!PAKX
!PIMP
!PIZZA
View or trade
PAKX
tokens.Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.