MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

The page came in at 03:14. cPanel's ChkServd had decided MariaDB was down on cpanel-host, and the on-call inbox was filling up with the alert every cPanel operator eventually learns to dread:

chkservd alert: MariaDB FAILED. mysqld is not running.
service mysql start ... [FAILED]

A junior on a different team would have responded the way the alert implicitly suggests: service mysql start, wait, retry, paste it into chat. We did that once, ten years ago. It does not work. If MariaDB just crashed and refuses to come back up, it is almost never a transient fault. It is almost always one of three things, and the most common of the three is that someone (usually cPanel's own "MySQL Tuner", sometimes a well-meaning sysadmin six months ago) wrote a value into /etc/my.cnf that the kernel will not honour.

This post is the postmortem for that 03:14 page and for two sibling incidents on different cPanel boxes from the same quarter. If you have found your way here from Googling mysql innodb_buffer_pool_size oom, mariadb out of memory cpanel, or mysql wont start innodb_buffer_pool_size, your database is probably down right now. Skip to "The definitive fix flow" four sections down, do the fix, come back and read the rest later. Everything in this post has been anonymised against our conventions; the commands and the log strings are exactly what you will see on your own box.

The 3am alert that always tells you the same thing

cPanel's ChkServd watches a fixed list of services and pings them on a short loop. When MariaDB fails its check three consecutive times, ChkServd files an alert with the [chkservd] subject prefix and attempts a single restart. That restart is the loud, unhelpful service mysql start you see in the alert body. It fails because MariaDB is not in a transient state. It is in a configuration state that the kernel cannot satisfy.

The first wrong instinct, on a half-awake morning, is to run service mysql start again by hand. The second wrong instinct is to run it inside systemd-run to "give it more time". The third wrong instinct is to reboot, because the screen says memory and rebooting solves memory problems. None of those help. The configuration that caused the crash is on disk in /etc/my.cnf. It will still be there after the reboot. MariaDB will fail in exactly the same way, often in less time, because cPanel's cpanellogd and tailwatchd will already be alive and competing for the same RAM by the time mysqld gets scheduled.

The right instinct is to read three logs in order and then do arithmetic. That is what the rest of this post is about.

Where to look first, in order

Four log surfaces tell you four overlapping things. Read them in this order; do not skip ahead.

# 1. The MySQL error log itself. cPanel writes it under datadir.
#    The filename uses the server's hostname.
tail -n 200 /var/lib/mysql/cpanel-host.err
 
# 2. The systemd journal for the MariaDB unit. cPanel ships MariaDB
#    as `mariadb.service` since EL 8; older boxes may say `mysql`.
journalctl -u mariadb --since '15 min ago' --no-pager
 
# 3. The kernel ring buffer. If the kernel OOM-killer took MariaDB,
#    this is where the verdict lives.
dmesg -T | grep -i -E 'killed process|out of memory|oom'
 
# 4. The generic syslog. Same OOM signal will land here too, plus
#    anything cPanel's tailwatchd had to say.
tail -n 500 /var/log/messages | grep -i -E 'mysql|mariadb|oom'

The error log tells you what MariaDB itself believes happened. The journal tells you what systemd believes happened (and on modern cPanel boxes systemd's cgroup OOM-killer can fire before MariaDB gets a chance to log anything). The kernel ring buffer is the referee. If dmesg shows Killed process NNNN (mysqld) with a size in KB, the kernel won and MariaDB's last word never made it to disk. /var/log/messages is the cross-check; if everything else agrees the OOM is real, this file will agree too.

Two patterns matter. If the error log has content from the most recent restart attempt, MariaDB at least started its initialisation sequence and reached the point where it refuses to come up. That is the easier failure to diagnose. If the error log ends at the previous successful start and there is nothing for the recent attempt, MariaDB was killed before it could log, almost always by the kernel or by systemd's cgroup OOM watchdog. That is the harder failure, and dmesg is your only reliable witness.

The error message that's always the same

When MariaDB starts, allocates its InnoDB buffer pool, and discovers the kernel will not give it the memory it asked for, it writes some variant of these lines. The exact wording shifts across 10.x and 11.x but the substance is identical:

[Note] InnoDB: Initializing buffer pool, total size = 16.000GiB, chunk size = 128.000MiB
[ERROR] InnoDB: Cannot allocate memory for the buffer pool
[ERROR] InnoDB: Plugin initialization aborted with error Generic error
[ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
[ERROR] Unknown/unsupported storage engine: InnoDB
[ERROR] Aborting

Sometimes the kernel intercepts the allocation before InnoDB even gets to print its initial "Initializing buffer pool" line and you see this in the journal instead:

mariadb.service: Main process exited, code=killed, status=9/KILL
mariadb.service: Failed with result 'signal'.
systemd[1]: mariadb.service: Consumed 0s CPU time.

That is the cgroup-OOM-killed-by-systemd path. It is the same root cause as the InnoDB error (MariaDB asked for memory the kernel could not honour) but the order of operations is different. The kernel killed the process before MariaDB's signal handler could flush a final line to the error log. On a 03:14 page this is the single most confusing pattern, because the error log looks healthy ("everything was fine at the last shutdown") and only the journal and dmesg know the truth.

Either way, the answer is the same: the value of innodb_buffer_pool_size in /etc/my.cnf is larger than the RAM the kernel can actually hand MariaDB at startup. That is true even if the value was fine yesterday. RAM is shared on a cPanel box, and cPanel's own daemons have grown.

The math you have to do

Before editing anything, do the arithmetic on paper (or in a scratch file). This is the budget for a cPanel box:

Physical RAM                                 16,384 MB   100%
- Kernel + cgroup overhead                   ~  1,500 MB
- cPanel + tailwatchd + cpsrvd + dnsadmin    ~    600 MB
- Apache + mod_lsapi worker pool             ~  2,000 MB
- PHP-FPM children (50 clients, ~80 MB ea)   ~  4,000 MB
- Exim + Dovecot + SpamAssassin              ~    500 MB
- CSF/lfd + Imunify360 agent                 ~    500 MB
- Free + buffers/cache headroom              ~  1,000 MB
                                             ----------
                                              ~10,100 MB used by NOT-MySQL
                                              ~ 6,284 MB safe budget for MariaDB

The numbers above are an example, not a recipe. Your php-fpm.d pool config and Apache MaxRequestWorkers decide the middle two lines. Read them. Multiply. The point of the exercise is that on a 16 GB cPanel box where you also serve PHP traffic, the safe innodb_buffer_pool_size is closer to 4 to 6 GB than to the 12 GB cPanel will sometimes recommend.

Two rules of thumb hold across every cPanel box we have ever tuned:

On a dedicated database server (no Apache, no PHP-FPM, minimal everything else), innodb_buffer_pool_size can safely be 50 to 60% of physical RAM. The classic MySQL guidance assumes this topology.
On a shared cPanel server (the normal case for hosting agencies), it should be 30 to 40% of physical RAM, rounded down to the nearest gigabyte. The classic guidance does not apply because PHP-FPM is the silent eater.

If mysqltuner.pl recommends a value above 50% on a cPanel box, ignore it. The tool reads MySQL's own stats and does not see the PHP-FPM pool. We have walked into more than one outage caused by a sysadmin running mysqltuner quarterly and obediently raising the buffer pool each time. That is how you arrive at 12 GB on a 16 GB box. That is how you arrive at the page at 03:14.

The slow log permissions trap

There is a sibling bug that lives in the same neighbourhood and that you should resolve in the same maintenance window even though it does not cause the OOM. When cPanel's automatic update path rewrites MariaDB, it sometimes resets ownership on the slow query log to root:root:

ls -l /var/lib/mysql/cpanel-host-slow.log
# -rw-r----- 1 root root 0 May 10 03:14 cpanel-host-slow.log

MariaDB runs as the mysql user. It cannot write its own slow log when the file is owned by root, and it will spam a confusing message into the error log:

[Warning] Could not open file '/var/lib/mysql/cpanel-host-slow.log' for slow log: Permission denied

That line is not the OOM. It will, however, sit next to the OOM lines in the error log on a cPanel box that has just been updated, and a half-awake on-call will sometimes chase the slow log line first. The fix is one command:

chown mysql:mysql /var/lib/mysql/cpanel-host-slow.log
chmod 660       /var/lib/mysql/cpanel-host-slow.log

If the slow log was the only problem, MariaDB starts on the next attempt. If the slow log is fixed and MariaDB still will not start, the OOM is real and you are about to do the buffer pool edit.

The definitive fix flow

Four steps. Do them in order. Do not skip step 1 even if you are certain mysqld is already gone.

Step 1: stop trying to start it

systemctl stop mariadb
# Confirm there is no leftover mysqld. If there is, the next start
# will fail with "Another mysqld is already running" or with a
# socket conflict that masks the real OOM.
pgrep -a mysqld || echo 'no mysqld processes'

If pgrep returns a PID, that is almost always mysqld_safe's respawn loop in a degraded state. Kill it explicitly before moving on. Do not kill -9 directly. Try systemctl stop mariadb again first, then a normal kill, and only fall back to -9 if the process refuses to leave.

Step 2: edit my.cnf safely

The buffer pool variable lives in /etc/my.cnf on cPanel. Some older boxes split it into /etc/mysql/my.cnf plus an !includedir; follow the include chain to find the actual definition.

Open the file in your editor of choice, find the innodb_buffer_pool_size line, and change it to the value from the math above, rounded down to the nearest gigabyte.

# /etc/my.cnf
[mysqld]
# Was: innodb_buffer_pool_size = 12G
innodb_buffer_pool_size = 6G
 
# Chunk size must divide buffer pool size cleanly. 128M is the
# default and works for buffer pools that are a whole-GB multiple.
innodb_buffer_pool_chunk_size = 128M
 
# Optional: instances = buffer_pool_size / 1GB, capped at 8.
innodb_buffer_pool_instances = 6

Round down, not up. If your math gives 6.3 GB, write 6G. If it gives 5.8 GB, write 5G. The cost of being slightly too small is a few percent of read cache misses; the cost of being slightly too large is another 03:14 page tomorrow.

Save the file. Keep your shell open in the same directory. You will probably want to revert if step 3 surprises you.

Step 3: verify and start

Before starting MariaDB, run its own config validator. This catches the typos you cannot see in the editor at 03:20:

mysqld --validate-config --verbose 2>&1 | tail -n 5
# Look for: [Note] mysqld: ready for connections.
# If you see a parse error, fix the typo before continuing.

Then start the unit and read the error log live for the first 30 seconds. The first 30 seconds is where you find out whether the new value works. After that, MariaDB is up.

systemctl start mariadb
tail -F /var/lib/mysql/cpanel-host.err &
# Watch for the "ready for connections" line. If you see another
# "Cannot allocate memory" line, stop, edit my.cnf to a smaller
# value, and go back to step 2.
sleep 30 ; kill %1

Step 4: confirm it stayed up

A MariaDB process that comes up but cannot accept real load is worse than one that refused to start, because ChkServd will mark it healthy and you will go back to bed. Do a light health check that exercises every database without putting real query load on the server:

mysqlcheck --all-databases --quick --silent
echo "exit code: $?"
mysql -e 'SHOW GLOBAL STATUS LIKE "Innodb_buffer_pool_pages_free"'

Then leave the connection open and wait five minutes. If MariaDB is going to fall over again under steady-state load, it does so inside five minutes. If five minutes pass and nothing has changed, you are done; close the ticket, restore the slow log permissions if needed, and write the incident note.

The deeper question: why did this happen at all?

The buffer pool was correct on the day it was set. RAM did not shrink. So what changed?

Three things, usually all at once.

First, cPanel's MySQL Tuner is not RAM-aware. cPanel's recommendation engine reads MySQL's own internal statistics (Innodb_buffer_pool_reads, Innodb_buffer_pool_read_requests, the hit ratio) and concludes that a larger pool would improve the hit rate. It is correct about the hit rate. It does not know how many PHP-FPM children are alive on the same box, because that is not a MySQL statistic. A sysadmin who clicks "apply recommended" once a quarter ends up with a pool sized for the database's appetite, not for the box's budget.

Second, PHP-FPM grows silently. Adding a client to a cPanel account creates an FPM pool for that user. Adding twenty clients over six months grows the resident set of the box by 1.5 to 2 GB without anyone editing a config file. That growth is invisible unless you graph it; cPanel's "Service Status" page does not.

Third, kernel and cgroup overhead is non-trivial on modern EL. RHEL/AlmaLinux 9 reserves slightly more for the kernel than 7 did, and systemd's cgroup memory accounting is stricter. The exact same my.cnf that was safe on a 16 GB CentOS 7 box can OOM on a 16 GB AlmaLinux 9 box.

Together, those three drifts mean that any cPanel server tuned more than six months ago and not revisited is a candidate for the same 03:14 page. The fix is not "tune harder", it is "revisit on a schedule". A simple quarterly check of free -h next to your my.cnf settings prevents this whole category of outage.

The pre-mortem checklist

Five things to check now, even if MariaDB is up:

Run free -h and compare available to innodb_buffer_pool_size. If the pool is larger than the available memory under load, the next restart will fail.
Run grep innodb_buffer_pool_size /etc/my.cnf and confirm the value matches what SHOW VARIABLES reports. A value edited but not reloaded is a time bomb.
Run ls -l /var/lib/mysql/*-slow.log and confirm ownership is mysql:mysql. Resolve before the next cPanel update.
Read /etc/php-fpm.d/*.conf for pm.max_children. Multiply by ~80 MB. Add the result to your "not MariaDB" budget and re-do the arithmetic from the section above.
Run mysqltuner.pl if you like, but apply its buffer-pool recommendation only after you have done the cPanel-side arithmetic. The tool is right about MySQL internals; it is blind to the rest of the box.

That checklist takes ten minutes per server and catches every case of this incident we have ever seen.

A related failure mode is when the box is not OOM at all but is disk-full and the InnoDB undo logs cannot grow. MariaDB's error log looks similar enough that operators sometimes chase memory when the real problem is the filesystem. If df -h /var/lib/mysql shows the data partition close to full, that is a different incident; we write that one up in a separate post.

Two other Tier-1 incidents on this site neighbour this one. If firewalls on this box also flap on the same nights MariaDB falls over, the kernel is probably picking off the largest resident processes one at a time, and our writeup of CSF, lfd, and Imunify360 covers what happens when lfd is the next victim of the same OOM storm. If your MySQL CPU spikes are driven by WordPress cron runaways rather than buffer-pool pressure, the writeup on WordPress wp-cron stacking on cPanel explains how to find and shut down the loops before they push MariaDB into the next OOM.

A 30-second health check

The single command that tells you whether the next restart will succeed:

awk 'BEGIN{p=0} \
  /MemAvailable/ {avail=$2/1024} \
  END {} \
  END { \
    cmd="grep -E \"^[[:space:]]*innodb_buffer_pool_size\" /etc/my.cnf | tail -n 1"; \
    cmd | getline line; close(cmd); \
    print "available MB:", avail, "/ my.cnf line:", line \
  }' /proc/meminfo

That one-liner reads MemAvailable from /proc/meminfo and the configured innodb_buffer_pool_size from /etc/my.cnf and prints them on one line. If the configured pool size is anywhere close to MemAvailable, the next MariaDB restart is the next page. Reduce the pool now, in business hours, with a calm head, rather than at 03:14.

How ServerGuard handles this

ServerGuard's use case for MySQL OOM is the canonical scenario in the product (Database Recovery) and it is implemented end-to-end.

When ChkServd files a MariaDB-down alert, SGuard ingests the alert within 60 seconds via its cPanel notification subscription. It reads the MariaDB error log, the systemd journal entries for the unit, and dmesg, and classifies the failure into one of three buckets: configuration OOM (the InnoDB allocation error), kernel OOM (the Killed process signature in dmesg), or "other" (corruption, disk-full, permissions). For the first two, the remediation path is the one in this post.

SGuard then proposes a remediation plan to the operator. The plan is built from the same arithmetic the previous sections described: total physical RAM, observed resident set of cPanel daemons, observed PHP-FPM child count multiplied by a per-child estimate, and the headroom the kernel keeps. The proposed innodb_buffer_pool_size is rounded down to the nearest gigabyte and shown alongside the current value. Two actions are queued:

Action 1 (Safe, auto). Restore slow query log ownership to mysql:mysql if cPanel has reset it. This is reversible, scoped, and never causes a service restart. SGuard runs it without approval.
Action 2 (Moderate, with approval). Reduce innodb_buffer_pool_size to the computed safe value and restart MariaDB. SGuard generates a unified diff against /etc/my.cnf, posts it to the on-call channel, and waits for a human approval before running. After approval it edits the file, runs mysqld --validate-config, performs systemctl restart mariadb, and tails the error log for thirty seconds.
Action 3 (Safe, auto). Verify MariaDB stays up for five minutes after the restart, run a mysqlcheck --all-databases --quick light load, and write the incident note.

The honest limit is the limit the math implies. If the proposed reduction is larger than 30% of the current configured value, SGuard does not auto-queue the moderate action. A 30%+ drop in buffer pool size changes query plans and affects read latency in ways that need a human eye on the affected sites. In that case SGuard escalates to a named on-call engineer with the diagnostic context and the proposed plan, and waits.

The point is not that SGuard replaces the four-step flow above. The point is that SGuard does the flow before the human wakes up, and hands the human a diff to approve instead of a database to revive. Most of the value of the product is in the first ten minutes after the alert.

If you operate cPanel servers and the 03:14 page in this post is a page you have answered yourself, the waitlist is open. We are onboarding hosting agencies through the rest of the quarter.

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

The 3am alert that always tells you the same thing

Where to look first, in order

The error message that's always the same

The math you have to do

The slow log permissions trap

The definitive fix flow

Step 1: stop trying to start it

Step 2: edit my.cnf safely

Step 3: verify and start

Step 4: confirm it stayed up

The deeper question: why did this happen at all?

The pre-mortem checklist

A 30-second health check

How ServerGuard handles this

Sample incident: MariaDB OOM kill on a shared cPanel node

MariaDB slow log permissions on cPanel: the quick reference

cPanel disk full at 96 percent: the backup retention trap

MySQL OOM on cPanel: diagnosing innodb_buffer_pool_size

The 3am alert that always tells you the same thing

Where to look first, in order

The error message that's always the same

The math you have to do

The slow log permissions trap

The definitive fix flow

Step 1: stop trying to start it

Step 2: edit my.cnf safely

Step 3: verify and start

Step 4: confirm it stayed up

The deeper question: why did this happen at all?

The pre-mortem checklist

A 30-second health check

How ServerGuard handles this

Related posts

Sample incident: MariaDB OOM kill on a shared cPanel node

MariaDB slow log permissions on cPanel: the quick reference

cPanel disk full at 96 percent: the backup retention trap