86 CPU spikes in 24 hours: a multi-cause cascade postmortem
A cPanel server fired 86 ChkServd CPU alerts in one day. Four root causes were amplifying each other. The triage, the order of fixes, and the lessons.
86 CPU spikes in 24 hours: a multi-cause cascade postmortem
The mailbox at 08:00 had 86 ChkServd CPU alerts from cpanel-host,
all from the previous 24 hours. Not a single tidy outage with a single
cause. A steady drip of "CPU at 95% for the last minute" notices that
arrived every twenty minutes, then every ten, then in clusters of five
inside a single minute, then quiet for an hour, then another cluster.
The default agency response to a mailbox like this is to mute ChkServd for the day, mark it for "Monday triage", and move on. We knew from experience that 86 alerts on a cPanel server in 24 hours is not noise. It is at least two, usually three, sometimes four separate things going wrong at the same time, and the things compound. Snoozing this server would have meant arriving Monday to a queue of client tickets, a MariaDB that had restarted twice overnight, and a firewall that was still half-frozen during reload.
This is the capstone post for our Tier 1 series. It walks through the investigation that uncovered four co-occurring root causes, the order we fixed them in, and what the cascade taught us about the shape of problems on busy cPanel servers. Each cause has its own dedicated postmortem elsewhere on the blog; we link those at the right moments so you can read more without re-reading the same incident twice.
The alert flood
The alert template looked like every other ChkServd CPU alert:
ChkServd Alert
==============
Service: cpu
Status: failed (CPU usage: 95.7%)
Host: cpanel-host
Time: 2026-05-09 03:14:22 UTCWhat made this batch different was the shape, not the content. We exported the alerts to a CSV and plotted the timestamps. Three things jumped out:
- The alerts clustered around minute boundaries, with a pronounced peak at every fifth minute past the hour.
- There was a second, weaker cluster on the eighth-minute boundary inside every hour.
- The flood was heavier between 09:00 and 18:00 UTC than overnight, but the overnight floor was still 1 to 2 alerts per hour.
A clean single-cause incident does not produce that distribution. A single-cause incident produces either a sharp ramp followed by a plateau, or a single spike at a single hour. What we were looking at was at least two periodic processes overlapping with at least one traffic-driven process. Three sources, minimum.
Before we touched the server we wrote that hypothesis down in the incident channel. The cost of being wrong about the count of root causes is exactly the cost of the cascade itself: you fix one thing, the alert frequency drops 30%, you congratulate yourselves, and you miss the rest. The discipline is "name the hypothesis before you SSH in", because once you are inside the server, the temptation to fix the first thing you spot is almost irresistible.
Triage in the first 30 minutes
The first 30 minutes had three jobs: read current state, read recent history, and identify what was running when the alerts fired. None of this is exotic, but the order matters.
We started with current state.
# CPU utilisation right now, sampled once a second for five seconds
sar -u 1 5
# Per-CPU breakdown. Important when a single core is pegged
mpstat -P ALL 1 3
# Top processes by CPU, with the full command line
top -c -o %CPU -b -n 1 | head -40The top output showed PHP-FPM workers consuming most of the CPU, a
MariaDB process at around 60% sustained, and a Patchman scan worker
above 20%. None of those was on its own enough to fire ChkServd; in
aggregate they were past the alert threshold.
Recent history came from the cPanel-side data: WHM > Process Manager
showed a list of long-running processes, sorted by duration. The
oldest entry was an php-fpm: pool northwood worker that had been
running for 47 minutes. WordPress requests should not run for 47
minutes. Something was wedged.
We then looked at what fires at the five-minute boundary. The first
candidate is always wp-cron on a busy WordPress site, because the
default DISABLE_WP_CRON configuration on a cPanel install reaches
out via loopback HTTP every time a logged-in user hits the site within
the cron interval. The eighth-minute boundary is a less obvious clue;
on this server it turned out to be Patchman's scan scheduler, which
defaults to running at :08 past the hour.
By the 25-minute mark we had named four candidates. Not theories with equal weight; a ranked list with the heaviest at the top:
- wp-cron stacking on a high-traffic WordPress site (the 47-minute FPM worker, the five-minute clustering).
- MariaDB running queries against a buffer pool that was clearly too small for the working set (the 60% sustained CPU).
- Patchman scans hitting at peak hours (the eight-minute clustering).
- lfd reloads costing minutes of CPU each (the 100% utilisation on
a single core during a
csf -r).
The next step was confirming each one and only then fixing them, in the order that would let the next confirmation be honest. Confirming cause two when cause one is still firing every five minutes is a forlorn exercise. The noise from cause one drowns the signal from cause two.
The four root causes
Cause 1: wp-cron stacking on a high-traffic site
A pstree snapshot during a five-minute spike was unambiguous:
ps -ef -o pid,ppid,etime,cmd --forest | grep -E "(php-fpm|wp-cron)" | head -6041 concurrent wp-cron-driven PHP-FPM workers, all under the northwood
pool, all servicing /wp-cron.php requests. The site that triggered
them ran WooCommerce, Action Scheduler, and a half-dozen plugins that
each scheduled their own intervals. After a backup the night before
had paused Action Scheduler for an hour, the catch-up queue when the
backup released had bunched every overdue task into a single window.
WordPress's loopback model meant every visitor's first page load
spawned another wp-cron request, which spawned another, which spawned
another.
This is the exact shape we documented in
the WordPress wp-cron stacking on cPanel postmortem,
and the fix is the same: disable the loopback (define('DISABLE_WP_CRON', true); in wp-config.php) and replace it with a single staggered
system cron that runs at a known time and cannot stack.
Cause 2: MySQL with an undertuned innodb_buffer_pool_size
Once we knew wp-cron was alive we asked what the workers were waiting on. The answer was MariaDB.
-- Buffer pool size relative to dataset
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
-- Result: 512M
-- Total InnoDB data on disk
SELECT
ROUND(SUM(data_length + index_length) / 1024 / 1024, 0) AS mb
FROM information_schema.tables
WHERE engine = 'InnoDB';
-- Result: 4118512 MB of buffer pool against roughly 4 GB of hot InnoDB data is the
configuration MariaDB ships with when it has no idea what you are going
to do with it. Every page miss became a disk read. On the shared SSD,
disk reads were fast enough to look fine in isolation but slow enough,
when stacked, to inflate the I/O wait that ChkServd reads as CPU. The
process list during a spike was full of Sending data and Sorting result states with time values in the tens of seconds.
The MySQL OOM on cPanel and the innodb_buffer_pool_size trap postmortem covers this in detail, including how to pick a safe value without pushing MariaDB into OOM territory. On this server we settled on 2 GB after measuring the working set. That leaves enough RAM for Exim, ClamAV, the PHP-FPM pools at their realistic peak, and the kernel page cache without flirting with the OOM killer.
Cause 3: Patchman scans running during peak traffic
Patchman is a daily-scan tool that opens every PHP file under every cPanel account, reads each one into memory, and matches against a signature database. By default the scan window is set in UTC business hours, which for this server overlapped almost exactly with the client traffic peak.
# Configured scan window
/usr/local/patchman/patchman --status | grep -i scanEach scan instance is a process; on a server with twenty-plus active
accounts, the scan budget overlaps for as long as the slowest account
takes to finish. While Patchman was running, every account's php-fpm
pool had to share CPU with the scanner that was reading its files.
For small brochure sites the impact was invisible; for the WooCommerce
site already saturated by wp-cron, every Patchman cycle pulled the
last spare worker.
The dedicated postmortem for this is Patchman activation breaks PHP sites: the memory_limit gotcha. The memory-limit failure mode it documents is different from the CPU contention we hit here, but the underlying lesson is the same: Patchman defaults are tuned for empty cPanel servers, not for cPanel servers with real load.
Cause 4: csf.deny size making lfd slow
The CPU graph in WHM had a feature we had not noticed during the first
pass: a thin, sharp spike on a single core, roughly every 90 minutes,
lasting two to three minutes. iostat ruled out disk; the spike was
purely CPU. The match was lfd:
# Count the deny rules
wc -l /etc/csf/csf.deny
# 89381
# Last lfd reload event in the journal
journalctl -u lfd --since "2 hours ago" | grep -iE "reload|restart"89,381 entries in csf.deny is well past the size where lfd can
reload the firewall quickly. The reload event called by Imunify360's
deny syncs (which push remote intelligence into the local deny list)
was pegging a CPU core for the full two to three minutes of the
reload. On the eight-core VM this was 12.5% of total compute taken out
of service for the duration, repeatedly, on a one-and-a-half-hour
cycle.
This was the cause whose dedicated postmortem we wrote first:
CSF, lfd, and Imunify360: why your firewall is killing itself.
The fix involves both pruning csf.deny and reconfiguring Imunify360
not to sync everything it learns into the local list.
How they amplified each other
If any one of those four causes had been the only thing wrong, the server would have looked fine to ChkServd. The threshold lives at 90% CPU sustained for a minute. wp-cron stacking on its own added maybe 30% load. The buffer-pool problem on its own added 20%. Patchman added 15%. lfd reload added 12.5%, but only on one core, and only during the reload.
The cascade was not additive. It was multiplicative.
wp-cron stacking caused PHP-FPM saturation. PHP-FPM saturation meant that even the small fast WordPress queries took longer, because every request was queued behind a slow request waiting on MariaDB. MariaDB queries queued because the buffer pool was thrashing. The thrashing buffer pool meant that any new query for a hot table waited on disk reads that were competing with the file reads Patchman was performing on the same SSD. Patchman's scans hit at peak hours and at the eight-minute boundary, so when wp-cron also fired at the five-minute boundary, the three-minute interval between :05 and :08 became the window where every wp-cron worker for the busy site was alive and the Patchman scan started. lfd reloads on a fixed 90-minute cycle did not care about either, but when they hit during one of those :05 to :08 windows, they took the eighth core out of the rotation and ChkServd fired.
Plotting the alert timestamps against the four periodic sources made this obvious in hindsight. The clusters at :05, the secondary clusters at :08, the heavier daytime weight (Patchman + business traffic), the overnight floor (wp-cron loopbacks driven by long-running visitors plus the 90-minute lfd reload). Every alert traced back to at least two of the four causes coinciding.
A different incident on the same server fits the same pattern from a different angle. Three sites were compromised inside one week through plugin vulnerabilities that were each individually low-severity; three WordPress compromises in one week documents how the security side of the same dynamic, small individually and dangerous together, plays out. The mental model is the same whether the symptom is CPU or shells.
The order we fixed them in
We fixed them one at a time, in the order that maximised the signal clarity for the next fix. The temptation to fix everything in parallel is strong; the cost is that you cannot tell which fix did what, and if something regresses you have four candidates to investigate instead of one.
Fix 1: wp-cron stacking
This was the highest-impact lowest-risk change. The configuration sequence:
// wp-config.php for northwood's WordPress install
define('DISABLE_WP_CRON', true);# /etc/cron.d/wp-cron-client-b: a single staggered cron at :07
7 * * * * northwood /usr/local/bin/php \
/home/northwood/public_html/wp-cron.php > /dev/null 2>&1We picked :07 deliberately. The default :00 and :15 minute marks
are crowded with system cron jobs across cPanel; :07 sits in a quiet
window for this server. Anchoring wp-cron away from the wp-cron-prone
boundaries removed the cluster from the alert distribution.
Result after one hour: alert frequency dropped from ~3.5 per hour to ~1.5 per hour. The drop was visible in the inbox within the first twenty minutes.
Fix 2: Patchman scan window
We moved the Patchman scan window to 02:00 UTC. No code change, one configuration setting in the Patchman dashboard:
# After changing the scan window in Patchman UI
/usr/local/patchman/patchman --status | grep -i scan
# scan_start_time: 02:00 UTC
# scan_duration_limit: 04:00The 02:00 to 06:00 UTC window for this client base is empty: no client peak traffic, no Imunify360 daily report run, no backup overlap. The scan completes inside it.
Result after 24 hours: peak-hour CPU spikes reduced by roughly a third. Combined with the wp-cron fix, the daily alert count dropped from 86 to 22.
Fix 3: MySQL tuning
The MariaDB tuning required a brief restart. We picked a Sunday window because client traffic to the busy WooCommerce site was lowest then.
# /etc/my.cnf, relevant section
[mysqld]
innodb_buffer_pool_size = 2G
innodb_buffer_pool_instances = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 2
slow_query_log = 1
slow_query_log_file = /var/log/mariadb/slow.log
long_query_time = 1# Permissions reset because the slow log directory ownership had
# drifted between cPanel updates
chown mysql:mysql /var/log/mariadb
chmod 750 /var/log/mariadb
systemctl restart mariadbResult: query-latency p99 dropped from ~2.4 s to ~280 ms. The MariaDB CPU steady-state dropped from 60% to about 15%. Combined with the earlier fixes, the daily alert count moved from 22 to 9.
Fix 4: csf.deny cleanup
Last, because it was the highest-risk change and required confidence that the earlier fixes had stuck.
# Lower the deny ceilings so future runaway syncs cannot reproduce
# this state
sed -i 's/^DENY_TEMP_IP_LIMIT.*/DENY_TEMP_IP_LIMIT = "200"/' \
/etc/csf/csf.conf
sed -i 's/^DENY_IP_LIMIT.*/DENY_IP_LIMIT = "5000"/' \
/etc/csf/csf.conf
# Back up the current deny list
cp /etc/csf/csf.deny /root/backups/csf.deny.$(date +%F)
# Prune to the configured ceiling. See the dedicated postmortem
# for the safe pruning procedure
csf --temprm all
csf -rResult: lfd reloads now complete in under ten seconds, not two to three minutes. The single-core spikes disappeared from the CPU graph. The post-fix daily alert count settled at four.
Why the order matters
Earlier fixes reduced noise that would have masked the impact of later
fixes. If we had cleaned csf.deny first, we would have measured the
improvement against a background that still had wp-cron stacking and
buffer-pool thrashing; the lfd improvement would have been within the
margin of the rest of the chaos. By fixing the noisiest cause first
and giving each fix at least an hour to settle before the next, every
intervention had its own clean before-and-after window.
This is the discipline that distinguishes cascade triage from one-cause triage: in a cascade, the measurements you take after a fix are as important as the fix itself, because they tell you whether the remaining alert volume is the next root cause or the previous one's residual.
24-hour observation post-fix
The day after the fourth fix the inbox had four ChkServd alerts. Two correlated with genuine traffic spikes on the WooCommerce site (a flash sale email blast and a referral from a Telegram channel); one was a backup window that took 90 seconds longer than usual; one had no obvious cause and did not recur.
The lesson is that four alerts per day on a cPanel server with this workload is healthy. Zero alerts indicates a broken alerter; eighty-six indicates a cascade. The right tuning of ChkServd's threshold for this profile leaves a few legitimate-traffic alerts visible, because those alerts are the early warning that you are about to need to scale.
We left the threshold at 90% / 60 s, the cPanel default. Loosening it would have hidden the cascade from us originally, and would have hidden the next cascade from a future on-call.
How we'd diagnose this differently next time
In retrospect there were three pieces of tooling we wished we had had on day one:
- Per-minute alert correlation. Pasting timestamps into a spreadsheet works, but it is slow. A view that lays alert times over a heatmap of process activity would have surfaced the :05 / :08 clustering inside the first ten minutes.
- PHP-FPM pool utilisation graph.
topshows you which workers are hot right now. It does not show you whether a pool has been saturated for the past three hours or just hit a momentary peak. The 47-minute wedged worker was a lucky observation; a graph would have made it inevitable. csf.denysize as a tracked metric. Nobody graphs the size of/etc/csf/csf.deny. We did not graph it. The number drifts upward by a few hundred entries a day on a busy server and you notice when it crosses 50,000 only because something else breaks.
A handful of other related incidents on the same server underline how much hand-rolled tooling we accumulated this year. The xmlrpc abuse 27-site fix needed per-account log correlation that we built ad hoc. The cPanel disk full backup retention trap needed a retention dashboard that did not exist. The SSH brute force cPanel night postmortem covers another night where alert-correlation would have shortened the recovery from hours to minutes. The WooCommerce filter URL crawler trap investigation needed exactly the kind of FPM-pool-by-account view we are now describing.
The pattern is consistent: every Tier 1 incident this year either needed a piece of tooling we did not have, or burned hours building that tooling at 03:00 on a Saturday. The product we are building is the tooling, but that is a story for another paragraph.
The 30-minute checklist when you have an alert flood
When the inbox tips over to "many alerts, no obvious single cause", the first 30 minutes determine whether you spend the day fighting a cascade or chasing the wrong fix.
# 1. Get current state: three perspectives
sar -u 1 5
mpstat -P ALL 1 3
top -c -o %CPU -b -n 1 | head -40
# 2. Look at the alert timestamps as a distribution
# (export from your alerter, then a one-liner to bucket by minute)
awk '{print substr($1, 12, 5)}' alerts.log | sort | uniq -c | sort -rn
# 3. Find long-running processes. The wedged worker is the tell
ps -eo pid,etime,cmd --sort=-etime | head -20
# 4. Read each periodic scheduler the server runs
crontab -l -u root
ls /etc/cron.d/ /etc/cron.hourly/ /etc/cron.daily/
/usr/local/patchman/patchman --status
# 5. Check the size of any list-like data the firewall reads
wc -l /etc/csf/csf.deny /etc/csf/csf.allow /etc/csf/csf.ignore
# 6. Count active workers per FPM pool. The pool that is full is
# the pool that is breaking
for p in /opt/cpanel/ea-php*/root/usr/var/log/php-fpm/*.access.log; do
tail -n 200 "$p" | awk '{print $1}' | sort -u | wc -l
doneIf the alert distribution has one peak and the wedged-process list has zero entries, you are looking at a single-cause incident. Fix the periodic process that fires at that peak; the alerts will stop.
If the alert distribution has two or more peaks, or if you have multiple wedged workers across different pools, you are looking at a cascade. Name the candidate causes before you fix anything; rank them by impact and risk; fix the highest-impact lowest-risk first; give each fix at least an hour to register before you measure the next.
Two related references finish the toolkit: a cPanel SSH lockout recovery walkthrough for the day the cascade includes the management plane itself, and the AutoSSL fails on Microsoft 365 autodiscover records postmortem for the day the cascade is in DNS and certificates rather than CPU.
How ServerGuard handles this
ServerGuard is the product we are building from a year of incidents like this one. Its single-cause use cases are mature; its cascade behaviour is more conservative on purpose.
Detection. SGuard correlates alerts across categories rather than treating each category independently. A burst of ChkServd CPU alerts combined with a burst of mod_security 5xx events plus an increase in slow-query log entries gets surfaced as a single "cascade hypothesis" notification, not as three unrelated notifications. The correlation window and the threshold count for "this is a cascade" are tuned per server profile, because a server hosting twenty small brochure sites generates a different baseline than one hosting two WooCommerce shops. Implemented today.
Diagnosis. When SGuard detects a cascade hypothesis, it produces a ranked list of candidate root causes with the evidence that supports each, and the cross-cause amplification it suspects. The output is a short document that a human investigator can read in under five minutes and use to decide which cause to fix first. The same discipline this postmortem describes, automated to the point of the ranked list. Implemented today.
Action. SGuard does not auto-execute remediation on multi-cause cascades. This is deliberate. Single-cause incidents map onto a small, well-defined set of safe actions (restart a service, prune a log file, flush a queue) that we have run hundreds of times each with a known failure-recovery story. Cascade remediation involves ordered changes whose individual safety depends on the state created by the previous one (the buffer-pool change in this postmortem was only safe after wp-cron was no longer thrashing the server). The combinatorics of "is this change safe given the state we just put the server in?" are not yet at the point where we trust them to run without a human in the loop. Cascade-action automation is upcoming, gated on a separate evidence base we are still building.
The honest boundary. Single-cause incidents → autonomous action with approval gates. Multi-cause cascades → SGuard becomes an AI assistant for the human who is doing the investigation. The job it does in the cascade case is shortening time-to-diagnosis from "weekend morning" to "the next free thirty minutes". That is substantial value and we ship it; pretending it is more than that would be the kind of marketing claim this audience detects in two sentences and never trusts again.
If you operate cPanel servers and want to shorten the next cascade, join the ServerGuard waitlist. The early-access cohort gets the cascade-hypothesis surface turned on by default; the rest of the platform follows the roadmap above.
The Tier 1 series ends here. The eleven posts that came before this one each cover a single named incident; this one is the shape they make when you read them together. The shape is the product.
مقالات ذات صلة
- قراءة 6 دقيقة
When you have to suspend a WooCommerce client: anatomy
Anatomy of a forced suspension on a shared cPanel server The decision to take a paying client offline to protect fourteen other paying clients is the worst part of running a small hosting agency. There is no scripted version of it that feel
- قراءة 8 دقيقة
The corrupted WordPress db.php dropin nobody warned you of
The corrupted WordPress db.php dropin nobody warned you about The ticket reads "Error establishing a database connection". You SSH into the box. MySQL is up. works. The other twelve WordPress sites on the same server are loading fine. Only
- قراءة 6 دقيقة
When the client changes DNS without telling you first
When the client changes DNS without telling you first The ticket arrives on a Friday afternoon. Three words: "website is down". No screenshot, no error, no context. You load the site in a private window. It loads fine. You ask a colleague o