临沧市网站建设_网站建设公司_导航菜单_seo优化
2025/12/26 16:09:07 网站建设 项目流程

一次由隐藏大页配置引发的数据库OOM故障分析

一、事故发生

在周日清晨,收到紧急短信告警,数据库实例发生异常重启。首先登录数据库服务器,查看日志记录

2025-12-21T06:54:57.259156+08:00 77 [Note] [MY-010914] [Server] Aborted connection 77 to db: 'unconnected' user: 'root' host: '172.17.139.203' (Got an error reading communication packets).
2025-12-21T06:55:33.224314Z mysqld_safe Number of processes running now: 0
2025-12-21T06:55:33.248143Z mysqld_safe mysqld restarted
2025-12-21T06:55:34.053462+08:00 0 [Warning] [MY-011069] [Server] The syntax '--replica-parallel-type' is deprecated and will be removed in a future release.
2025-12-21T06:55:34.053569+08:00 0 [Warning] [MY-011068] [Server] The syntax '--ssl=off' is deprecated and will be removed in a future release. Please use --tls-version='' instead.

通过该日志内容初步判断重启原因是发生了OOM异常,直接观察系统日志/var/log/messages,确认存在oom异常信息。

[root@gdb-adm ~]#  grep -inr /var/log/messages
5:Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
11:Dec 21 06:55:33 gdb kernel: [419827.630530]  oom_kill_process+0x24f/0x270
12:Dec 21 06:55:33 gdb kernel: [419827.630532]  ? oom_badness+0x25/0x140
68:Dec 21 06:55:33 gdb kernel: [419827.630752] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
148:Dec 21 06:55:33 gdb kernel: [419827.631062] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-2036.slice/session-6188.scope,task=mysqld,pid=2567710,uid=2032

二、问题分析

1、内存设置检查

服务器物理内存376G,而innodb_buffer_pool_size设置为200G,占比为53%,符合预期。

free -htotal        used        free      shared  buff/cache   available
Mem:          376Gi       267Gi        26Gi       5.0Mi        82Gi        53Gi

2、jemolloc判断

作为GreatSQL数据库或者开源MySQL数据库,出现OOM的情况,很大可能是由于使用默认的glibc内存分配管理,内存使用后释放不完全引起内存泄漏导致,通过命令lsof -p PID| grep jem 观察内存分配管理方式

[root@gdb ~]# lsof -p 25424 | grep jem
mysqld 25424 mysql  mem       REG                8,2    2136088   2355262 /data/svr/greatsql/lib/mysql/libjemalloc.so.1

从返回可以看出配置正常,基本上可以排除此原因。

3、OOM日志详细分析

1)完整OOM日志

Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Dec 21 06:55:33 gdb kernel: [419827.630499] CPU: 14 PID: 9458 Comm: crontab-1 Kdump: loaded Not tainted 4.19.90-2107.6.0.0227.28.oe1.bclinux.x86_64 #1
Dec 21 06:55:33 gdb kernel: [419827.630500] Hardware name: FiberHome FitServer/FiberHome Boards, BIOS 3.4.V7 02/01/2023
Dec 21 06:55:33 gdb kernel: [419827.630507] Call Trace:
Dec 21 06:55:33 gdb kernel: [419827.630519]  dump_stack+0x66/0x8b
Dec 21 06:55:33 gdb kernel: [419827.630527]  dump_header+0x4a/0x1fc
Dec 21 06:55:33 gdb kernel: [419827.630530]  oom_kill_process+0x24f/0x270
Dec 21 06:55:33 gdb kernel: [419827.630532]  ? oom_badness+0x25/0x140
Dec 21 06:55:33 gdb kernel: [419827.630533]  out_of_memory+0x11f/0x540
Dec 21 06:55:33 gdb kernel: [419827.630536]  __alloc_pages_slowpath+0x9f5/0xde0
Dec 21 06:55:33 gdb kernel: [419827.630543]  __alloc_pages_nodemask+0x2a8/0x2d0
Dec 21 06:55:33 gdb kernel: [419827.630549]  filemap_fault+0x35e/0x8a0
Dec 21 06:55:33 gdb kernel: [419827.630555]  ? alloc_set_pte+0x244/0x450
Dec 21 06:55:33 gdb kernel: [419827.630558]  ? filemap_map_pages+0x28f/0x480
Dec 21 06:55:33 gdb kernel: [419827.630584]  ext4_filemap_fault+0x2c/0x40 [ext4]
Dec 21 06:55:33 gdb kernel: [419827.630588]  __do_fault+0x33/0x110
Dec 21 06:55:33 gdb kernel: [419827.630592]  do_fault+0x12e/0x490
Dec 21 06:55:33 gdb kernel: [419827.630595]  ? __handle_mm_fault+0x2a/0x690
Dec 21 06:55:33 gdb kernel: [419827.630597]  __handle_mm_fault+0x613/0x690
Dec 21 06:55:33 gdb kernel: [419827.630601]  handle_mm_fault+0xc4/0x200
Dec 21 06:55:33 gdb kernel: [419827.630604]  __do_page_fault+0x2ba/0x4d0
Dec 21 06:55:33 gdb kernel: [419827.630609]  ? __audit_syscall_exit+0x238/0x2c0
Dec 21 06:55:33 gdb kernel: [419827.630611]  do_page_fault+0x31/0x130
Dec 21 06:55:33 gdb kernel: [419827.630616]  ? page_fault+0x8/0x30
Dec 21 06:55:33 gdb kernel: [419827.630620]  page_fault+0x1e/0x30
Dec 21 06:55:33 gdb kernel: [419827.630623] Mem-Info:
Dec 21 06:55:33 gdb kernel: [419827.630635] active_anon:50985791 inactive_anon:354 isolated_anon:0#012 active_file:677 inactive_file:0 isolated_file:0#012 unevictable:0 dirty:105 writeback:123 unstable:0#012 slab_reclaimable:20583 slab_unreclaimable:49628#012 m
apped:319 shmem:1323 pagetables:106803 bounce:0#012 free:5313776 free_pcp:5715 free_cma:0
Dec 21 06:55:33 gdb kernel: [419827.630638] Node 0 active_anon:100766572kB inactive_anon:556kB active_file:1384kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:76kB dirty:32kB writeback:0kB shmem:2276kB shmem_thp: 0kB shmem_pmdm
apped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Dec 21 06:55:33 gdb kernel: [419827.630645] Node 1 active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1200kB dirty:388kB writeback:492kB shmem:3016kB shmem_thp: 0kB shme
m_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Dec 21 06:55:33 gdb kernel: [419827.630650] Node 0 DMA free:15892kB min:824kB low:1028kB high:1232kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB kernel_stack:0k
B pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630654] lowmem_reserve[]: 0 1347 191666 191666 191666
Dec 21 06:55:33 gdb kernel: [419827.630661] Node 0 DMA32 free:833940kB min:72972kB low:91212kB high:109452kB active_anon:559420kB inactive_anon:8kB active_file:68kB inactive_file:0kB unevictable:0kB writepending:32kB present:1733384kB managed:1405672kB mlocked:
0kB kernel_stack:52kB pagetables:1084kB bounce:0kB free_pcp:400kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630666] lowmem_reserve[]: 0 0 190319 190319 190319
Dec 21 06:55:33 gdb kernel: [419827.630672] Node 0 Normal free:10117540kB min:10117912kB low:12647388kB high:15176864kB active_anon:100207152kB inactive_anon:548kB active_file:808kB inactive_file:0kB unevictable:0kB writepending:0kB present:198180864kB managed:
194894048kB mlocked:0kB kernel_stack:13504kB pagetables:215840kB bounce:0kB free_pcp:536kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630679] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630683] Node 1 Normal free:10287732kB min:10288284kB low:12860352kB high:15432420kB active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB writepending:880kB present:201326592kB mana
ged:198175752kB mlocked:0kB kernel_stack:11836kB pagetables:210288kB bounce:0kB free_pcp:21924kB local_pcp:332kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630686] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630688] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Dec 21 06:55:33 gdb kernel: [419827.630694] Node 0 DMA32: 240*4kB (UME) 178*8kB (UME) 140*16kB (UME) 66*32kB (UME) 70*64kB (UME) 53*128kB (UME) 38*256kB (UME) 18*512kB (UE) 3*1024kB (U) 2*2048kB (UE) 193*4096kB (M) = 834640kB
Dec 21 06:55:33 gdb kernel: [419827.630702] Node 0 Normal: 3557*4kB (UE) 1963*8kB (UME) 651*16kB (UME) 1139*32kB (UME) 855*64kB (UME) 572*128kB (UME) 308*256kB (UE) 129*512kB (UME) 50*1024kB (UME) 27*2048kB (UME) 2359*4096kB (UME) = 10118588kB
Dec 21 06:55:33 gdb kernel: [419827.630712] Node 1 Normal: 3636*4kB (UME) 1848*8kB (UME) 2744*16kB (UME) 2139*32kB (UME) 1580*64kB (UME) 1073*128kB (UME) 613*256kB (UME) 280*512kB (UE) 130*1024kB (UE) 81*2048kB (UE) 2273*4096kB (UME) = 10289648kB
Dec 21 06:55:33 gdb kernel: [419827.630731] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630737] Node 0 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kBDec 21 06:55:33 gdb kernel: [419827.630738] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630741] Node 1 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB
Dec 21 06:55:33 gdb kernel: [419827.630742] 3360 total pagecache pages
Dec 21 06:55:33 gdb kernel: [419827.630744] 0 pages in swap cache
Dec 21 06:55:33 gdb kernel: [419827.630746] Swap cache stats: add 0, delete 0, find 0/0
Dec 21 06:55:33 gdb kernel: [419827.630746] Free swap  = 0kB
Dec 21 06:55:33 gdb kernel: [419827.630747] Total swap = 0kB
Dec 21 06:55:33 gdb kernel: [419827.630748] 100314204 pages RAM
Dec 21 06:55:33 gdb kernel: [419827.630749] 0 pages HighMem/MovableOnly
Dec 21 06:55:33 gdb kernel: [419827.630749] 1691363 pages reserved
Dec 21 06:55:33 gdb kernel: [419827.630750] 0 pages hwpoisoned
Dec 21 06:55:33 gdb kernel: [419827.630750] Tasks state (memory values in pages):
Dec 21 06:55:33 gdb kernel: [419827.630752] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 21 06:55:33 gdb kernel: [419827.630790] [    926]     0   926    72470      811   507904        0          -250 systemd-journal
Dec 21 06:55:33 gdb kernel: [419827.630794] [    960]     0   960     8269     1075    77824        0         -1000 systemd-udevd
Dec 21 06:55:33 gdb kernel: [419827.630798] [   1623]     0  1623      729       28    32768        0             0 mdadm
Dec 21 06:55:33 gdb kernel: [419827.630800] [   1672]     0  1672    23007      217    49152        0         -1000 auditd
Dec 21 06:55:33 gdb kernel: [419827.630803] [   1674]     0  1674     1568       90    36864        0             0 sedispatch
Dec 21 06:55:33 gdb kernel: [419827.630806] [   1712]     0  1712    78709      787    98304        0             0 ModemManager
Dec 21 06:55:33 gdb kernel: [419827.630808] [   1714]     0  1714      571       16    32768        0             0 acpid
Dec 21 06:55:33 gdb kernel: [419827.630811] [   1719]    81  1719     2891      845    49152        0          -900 dbus-daemon
Dec 21 06:55:33 gdb kernel: [419827.630813] [   1727]   992  1727      599       38    32768        0             0 lsmd
Dec 21 06:55:33 gdb kernel: [419827.630815] [   1730]     0  1730      619       33    32768        0             0 mcelog
Dec 21 06:55:33 gdb kernel: [419827.630817] [   1735]   999  1735   743772     1030   229376        0             0 polkitd
Dec 21 06:55:33 gdb kernel: [419827.630820] [   1736]     0  1736    77985      204    90112        0             0 rngd
Dec 21 06:55:33 gdb kernel: [419827.630827] [   1739]     0  1739     2711      421    49152        0             0 smartd
Dec 21 06:55:33 gdb kernel: [419827.630829] [   1741]     0  1741    20070      151    40960        0          -500 irqbalance
Dec 21 06:55:33 gdb kernel: [419827.630831] [   1743]     0  1743     4492      227    61440        0             0 systemd-machine
Dec 21 06:55:33 gdb kernel: [419827.630837] [   1753]     0  1753   114058      472   110592        0             0 abrtd
Dec 21 06:55:33 gdb kernel: [419827.630842] [   1794]     0  1794     4780      468    65536        0             0 systemd-logind
Dec 21 06:55:33 gdb kernel: [419827.630844] [   1830]     0  1830   263593      479   929792        0             0 abrt-dump-journ
Dec 21 06:55:33 gdb kernel: [419827.630846] [   1831]     0  1831   261511      460   925696        0             0 abrt-dump-journ
Dec 21 06:55:33 gdb kernel: [419827.630850] [   2802]     0  2802   199635      606   299008        0             0 esfdaemon
Dec 21 06:55:33 gdb kernel: [419827.630852] [   2803]     0  2803    72799    12101   200704        0             0 bare-agent
Dec 21 06:55:33 gdb kernel: [419827.630855] [   2805]     0  2805    59117      340    86016        0             0 cupsd
Dec 21 06:55:33 gdb kernel: [419827.630856] [   2810]     0  2810   251667      734  1376256        0             0 rsyslogd
Dec 21 06:55:33 gdb kernel: [419827.630863] [   2814]     0  2814     3350      227    53248        0         -1000 sshd
Dec 21 06:55:33 gdb kernel: [419827.630865] [   2815]     0  2815   117707     3324   143360        0             0 tuned
Dec 21 06:55:33 gdb kernel: [419827.630869] [   2828]     0  2828    65710      188    73728        0             0 gssproxy
Dec 21 06:55:33 gdb kernel: [419827.630872] [   2848]     0  2848    53496       92    45056        0             0 init.ohasd
Dec 21 06:55:33 gdb kernel: [419827.630874] [   2890]     0  2890      906       48    32768        0             0 atd
Dec 21 06:55:33 gdb kernel: [419827.630875] [   2896]     0  2896    53748      118    49152        0             0 crond
Dec 21 06:55:33 gdb kernel: [419827.630878] [   3692]     0  3692     3539      148    49152        0             0 xinetd
Dec 21 06:55:33 gdb kernel: [419827.630880] [   3978]     0  3978    10985      242    61440        0             0 master
Dec 21 06:55:33 gdb kernel: [419827.630884] [   4004]    89  4004    11331      527    69632        0             0 qmgr
Dec 21 06:55:33 gdb kernel: [419827.630888] [   4093]     0  4093    43766      216   221184        0             0 sddog
Dec 21 06:55:33 gdb kernel: [419827.630890] [   4112]     0  4112   285705      537   577536        0             0 sdmonitor
Dec 21 06:55:33 gdb kernel: [419827.630891] [   4233]     0  4233   134053      596   466944        0             0 sdcc
Dec 21 06:55:33 gdb kernel: [419827.630895] [   4259]     0  4259   168947     8371   667648        0             0 sdec
Dec 21 06:55:33 gdb kernel: [419827.630897] [   4284]     0  4284   286675     1588   778240        0             0 sdexam
Dec 21 06:55:33 gdb kernel: [419827.630899] [   4310]     0  4310   492216    50216  1331200        0             0 sdsvrd
Dec 21 06:55:33 gdb kernel: [419827.630906] [   4330]     0  4330    29248      278   278528        0             0 udcenter
Dec 21 06:55:33 gdb kernel: [419827.630908] [   8353]     0  8353     2184      321    45056        0             0 dhclient
Dec 21 06:55:33 gdb kernel: [419827.630910] [   9243]  1086  9243     5274      639    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630915] [   9245]  1086  9245     6383     1015    73728        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630918] [   9348]  1086  9348   470112    50291   761856        0             0 java
Dec 21 06:55:33 gdb kernel: [419827.630920] [   9426]     0  9426     2184      323    45056        0             0 dhclient
Dec 21 06:55:33 gdb kernel: [419827.630922] [   9852]     0  9852    53214       26    36864        0             0 agetty
Dec 21 06:55:33 gdb kernel: [419827.630926] [  11463]  1002 11463     5276      639    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630936] [  11465]  1002 11465     6383     1016    73728        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630942] [  11611]  1002 11611 14284908     1404   602112        0             0 agent60
Dec 21 06:55:33 gdb kernel: [419827.630945] [ 137615]     0 137615   136163     3215   147456        0             0 lvmdbusd
Dec 21 06:55:33 gdb kernel: [419827.630950] [ 796407]  2036 796407     5301      649    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630952] [ 796409]  2036 796409    43812     1109    94208        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630954] [ 817343]  2032 817343    53508      130    53248        0             0 mysqld_safe
Dec 21 06:55:33 gdb kernel: [419827.630956] [2270020]  2032 2270020  2778466     1788  1466368        0             0 dbinit
Dec 21 06:55:33 gdb kernel: [419827.630958] [2567710]  2032 2567710 77307141 50817311 424357888        0             0 mysqld
Dec 21 06:55:33 gdb kernel: [419827.630960] [3453494]   998 3453494     1173       50    36864        0             0 chronyd
Dec 21 06:55:33 gdb kernel: [419827.630963] [3621338]    89 3621338    11065      249    65536        0             0 pickup
Dec 21 06:55:33 gdb kernel: [419827.630981] [3662845]     0 3662845     5297      648    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630983] [3662881]     0 3662881    44244     1356    98304        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630985] [3662906]    89 3662906    11068      242    65536        0             0 trivial-rewrite
Dec 21 06:55:33 gdb kernel: [419827.630987] [3663080]     0 3663080    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630988] [3663097]    89 3663097    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.630990] [3663098]     0 3663098    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630992] [3663108]    89 3663108    11073      242    65536        0             0 bounce
Dec 21 06:55:33 gdb kernel: [419827.630994] [3663141]     0 3663141    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630997] [3663177]    89 3663177    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631003] [3663193]    89 3663193    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631005] [3663201]    89 3663201    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631007] [3663207]     0 3663207    53463       54    45056        0             0 sh
Dec 21 06:55:33 gdb kernel: [419827.631011] [3663208]     0 3663208   884643     7048   589824        0             0 promtail
Dec 21 06:55:33 gdb kernel: [419827.631019] [3663317]    89 3663317    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631023] [3663318]    89 3663318    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631025] [3663319]    89 3663319    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631026] [3663320]    89 3663320    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631028] [3663321]    89 3663321    11064      242    65536        0             0 error
Dec 21 06:55:33 gdb kernel: [419827.631030] [3663322]    89 3663322    11064      242    65536        0             0 error
Dec 21 06:55:33 gdb kernel: [419827.631032] [3663388]     0 3663388    53093       15    40960        0             0 sleep
Dec 21 06:55:33 gdb kernel: [419827.631048] [3663946]     0 3663946     4458       86    61440        0             0 systemd-cgroups
Dec 21 06:55:33 gdb kernel: [419827.631060] [3663947]     0 3663947     4071       84    57344        0             0 systemd-cgroups
Dec 21 06:55:33 gdb kernel: [419827.631062] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-2036.slice/session-6188.scope,task=mysqld,pid=2567710,uid=2032
Dec 21 06:55:33 gdb kernel: [419827.631071] Out of memory: Kill process 2567710 (mysqld) score 516 or sacrifice child
Dec 21 06:55:33 gdb kernel: [419827.632542] Killed process 2567710 (mysqld) total-vm:309228564kB, anon-rss:203269244kB, file-rss:0kB, shmem-rss:0kB

2)发生现象

Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Dec 21 06:55:33 gdb kernel: [419827.632542] Killed process 2567710 (mysqld) total-vm:309228564kB, anon-rss:203269244kB, file-rss:0kB, shmem-rss:0kB

上述关键信息为进程crontab-1申请新的内存引起oom-killer,而被kill进程为mysqld占用内存大小203269244kB

3) NUMA占用分析

Dec 21 06:55:33 gdb kernel: [419827.630672] Node 0 Normal free:10117540kB min:10117912kB low:12647388kB high:15176864kB active_anon:100207152kB inactive_anon:548kB active_file:808kB inactive_file:0kB unevictable:0kB writepending:0kB present:198180864kB managed:
194894048kB mlocked:0kB kernel_stack:13504kB pagetables:215840kB bounce:0kB free_pcp:536kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630679] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630683] Node 1 Normal free:10287732kB min:10288284kB low:12860352kB high:15432420kB active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB writepending:880kB present:201326592kB mana
ged:198175752kB mlocked:0kB kernel_stack:11836kB pagetables:210288kB bounce:0kB free_pcp:21924kB local_pcp:332kB free_cma:0kB

从上述日志,可以看出两个numa node的剩余free内存均低于了min的要求内存。

4) 内存占用统计

根据OOM记录的日志信息,内存大概有如下分配(注意,系统日志中rss列的单位为页,默认4k大小)

进程 占用内存
mysqld 193G
其他进程 641M
NUMA剩余 19.5G

上述内存远低于操作系统内存376G,缺失近163G

5) 大页分析

继续查看系统日志

Dec 21 06:55:33 gdb kernel: [419827.630731] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630737] Node 0 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB
Dec 21 06:55:33 gdb kernel: [419827.630738] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630741] Node 1 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB

解析为

页类型 总页数量 空闲页
numanode0 2M 40960 40960
numanode0 1G 0 0
numanode1 2M 40960 40960
numanode1 1G 0 0

可见大页占用了2M x 40960 x 2=160G内存,并且没有被使用,刚好和内存统计相近

4、大页配置查看

1) 检查透明大页配置

cat /sys/kernel/mm/transparent_hugepage/enabled,确认是关闭状态

[root@gdb ~]#  cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

2) 检查传统大页配置

sysctl -p | grep vm ,可见并没有相关配置

[root@gdb ~]#  sysctl -p | grep vm
vm.zone_reclaim_mode=0
vm.swappiness=1
vm.min_free_kbytes=20480000

3) 大页特性对比

特性维度 传统大页 透明大页
检查方式 /etc/sysctl.conf 中的 vm.nr_hugepages /sys/kernel/mm/transparent_hugepage/enabled
管理机制 静态预分配。在系统启动或配置后,内核立即从物理内存中划出指定数量的大页。这部分内存被“锁定”,专用于大页,不能被挪作他用(如进程的普通小页)。 动态分配。内核在运行时根据内存访问模式(如连续的512个4K页被频繁访问),自动将小页合并成一个大页,或者在不再需要时拆分回小页。这是一个“按需”的过程。
配置方式 1. 临时:sysctl -w vm.nr_hugepages=N 2. 永久:在 /etc/sysctl.conf 中添加 vm.nr_hugepages=N,重启或执行 sysctl -p 生效。 1. 临时:echo > /sys/kernel/mm/transparent_hugepage/enabled 2. 永久:通过内核启动参数 vi /etc/default/grubGRUB_CMDLINE_LINUX变量中添加transparent_hugepage=always,重新生成GRUB配置grub2-mkconfig -o /boot/grub2/grub.cfg
内存使用 专用且独占。分配后即使不使用,也会一直占用物理内存,可能导致内存浪费。 共享池。使用普通的内存页池,只在需要时才转换,内存利用率更高。
性能特点 性能稳定可预测。应用程序(如Oracle DB, Redis)通过mmap()shmget()显式请求大页时,能100%保证使用大页,无缺页中断或合并操作开销,性能最优、最稳定。 性能有波动风险。虽然大多数情况下能提升性能(减少TLB Miss),但在内存压力大或碎片化时,内核的合并/拆分操作(khugepaged进程)会带来不可预测的延迟尖峰,对延迟敏感型应用不利。

根据故障现象及大页特点,猜测应该是由于配置了传统大页,锁定了160G内存无法被其他进程使用,但是配置文件中并没有该配置,现象很奇怪

4) 深度搜索

使用命令grep -R "nr_hugepages" /etc进行大范围深度搜索,发现了问题所在

[root@gdb ~]#  grep -R "nr_hugepages" /etc
/etc/sysctl.conf.bak-2025-07-13:vm.nr_hugepages=81920

可以看到配置文件在7月13日进行了备份调整,备份前确实是有传统大页配置,并且配置值和目前系统日志中记录值相同。

5) 配置变更测试

通过测试发现,即使配置文件中去传统大页设置,但是依然是存在大页设置的

[root@qdb -]# cat /etc/sysctl.conf | grep h
kernel.shmall = 41943040
kernel.shmmax = 171798691840
kernel.shmmni=4096
#vm.hugetlb_shm_group=54321
#vm.nr_hugepages = 40960
[root@qdb -]# sysctl -p | grep h
kernel.shmall = 41943040
kernel.shmmax = 171798691840
kernel.shmmi=4096
[root@qdb -]# cat /proc/sys/vm/nr_hugepages
40960

调整配置后如果不重启操作系统,需要手动释放该部分内存

[root@gdb ~]# echo 0 > /proc/sys/vm/nr_hugepages
[root@gdb ~]# cat /proc/sys/vm/nr_hugepages
0

三、原因总结改进

1) 根本原因

大量 HugePages 被预留但数据库未实际使用,导致普通内存不足,引发 OOM

2) 不正常的默认大页配置

在操作系统默认情况下,未配置nr_hugepages,因此最初分析时未考虑传统大页方向。后经数据对比,发现传统大页存在内存占用异常现象。经后续核实,由于该服务器为利旧使用,残留了Oracle相关配置,导致该隐藏问题未被及时发现,又是一个国产化过程的小坑。

3) 后续改进

在基于现有服务器初始化步骤中,增加传统大页的检查设置步骤

sed -i '/huge/d' /etc/sysctl.conf
sysctl -p | grep huge
echo 0 > /proc/sys/vm/nr_hugepages

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询