参考

How should the crashkernel parameter be configured for using kdump on RHEL6?

https://access.redhat.com/solutions/59432

Environment

  • Red Hat Enterprise Linux (RHEL) 6
  • x86 and x86_64 architecture

Issue

  • What is the correct crashkernel parameter for kdump to work?
  • crashkernel reservation failed - memory is in use” errors when kernel panics
  • When configuring the crashkernel parameter the kdump service either fails to start or starts with this warning:

Raw

Your running kernel is using more than 70% of the amount of space you reserved for 
kdump, you should consider increasing your crashkernel reservation
  • kdump service restart failed with below error

Raw

Please reserve memory by passing "crashkernel=X@Y" parameter to the kernel

Resolution

The kdump procedure

The received warning means the kdump operation might fail and the crashdump parameter should be configured correctly. This is the procedure of kdumping:

  1. The normal kernel is booted with crashkernel=... as a kernel option, reserving some memory for the kdump kernel. The memory reserved by the crashkernel parameter is not available to the normal kernel during regular operation. It is reserved for later use by the kdump kernel.
  2. The system panics.
  3. The kdump kernel is booted using kexec, it used the memory area that was reserved w/ the crashkernel parameter.
  4. The normal kernel’s memory is captured into a vmcore.

Note: Not reserving enough memory for the kdump kernel can lead to the kdump operation failing.

Configuring crashkernel on RHEL6.0 and RHEL6.1 kernels

The code for printing the warning:

Raw

Your running kernel is using more than 70% of the amount of space you reserved for 
kdump, you should consider increasing your crashkernel reservation

is part of the script /etc/init.d/kdump.

The involved code

  • First reads the Slab value from /proc/meminfo. Slab is the in kernel data structures cache, this value depends on the total amount of RAM present in the system as well as on other factors. The value is not consistent and can change during operation of the server.
  • If the Slab value is bigger than 70% of the memory that was reserved with the crashkernel parameter then the warning is printed.Some mappings of ram and appropriate crashkernel values:
ram size crashkernel parameter ram / crashkernel factor
>0GB 128MB 15
>2GB 256MB 23
>6GB 512MB 15
>8GB 768MB 31

The last column contains a ram/crashkernel factor.

The table is covered by the following crashkernel configuration:

Raw

crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M

For servers with more RAM it is recommended to compute the crashkernel parameter using the factors that have been observed so far: 15 to stay on a safe side (maybe wasting memory), using a factor of 20 should also work. Please also note that the maximum size of RAM that should be reserved here is 896M, as outlined in (private) bz580843.

Configuring crashkernel on RHEL6.2 (and later) kernels

Starting with RHEL6.2 kernels crashkernel=auto should be used. The kernel will automatically reserve an appropriate amount of memory for the kdump kernel.

Keep in mind that it is an algorithmically calculated memory reservation and might not meet the needs of all systems (Especially for configurations with lots of IO cards and loaded drivers). So always make sure that memory reserved by crashkernel=auto is sufficient for the target machine by testing kdump. If it is not, reserve more memory by syntax crashkernel= XM (X is amount of memory to be reserved in mega bytes).

Additionally some improvements have been made in the RHEL6.2 kernel which have reduced the overall memory requirements of kdump. For more details refer to article kdump memory usage improvements included in Red Hat Enterprise Linux 6.2.

The amount of memory reserved for the kdump kernel can be estimated with the following scheme:

Raw

base memory to be reserved = 128MB  
an additional 64MB added for each TB of physical RAM present in the system. So 
for example if a system has 1TB of memory 192MB (128MB + 64MB) will be reserved.

Note: It is recommended to test and verify that kdump is working on all systems after installation of all applications. The memory reserved by crashkernel=auto takes only typical RHEL configurations into account. Some hardware and larger configurations with many option cards may not work well with with crashkernel=auto, in this case the use of crashkernel=512M or more may be a recommended size to start. Additionally if 3rd party modules are used, more memory might have to be reserved. Thus, if a testdump fails it is a good strategy to verify if it works with crashkernel=768M@0M and if it does, do further debugging of the memory requirements using the debug_mem_level option in /etc/kdump.conf. It is recommended that until a test dump works without failure that kdump not be considered configured properly.

Note: Prior to the 6.3GA release, crashkernel=auto will only reserve memory on systems with 4GB or more physical memory. If the system has less than 4GB of memory the memory must be reserved by explicitly requesting the reservation size, for example: crashkernel=128M. Since the 6.3GA release (kernel-2.6.32-279.el6), this limit has been lowered to 2GB.

Note: Some environments still require manual configuration of the crashkernel option, for example if dumps to very large local filesystems are performed. Please refer to kdump fails with large ext4 file system because fsck.ext4 gets OOM-killed for details.

Further information

Root Cause

A number of improvements related to crashkernel=auto and memory requirements of kdump have been made in the RHEL6.2 kernel.

Diagnostic Steps

  • The method used (pre-6.2) to calculate the approx amount of ram the normal kernel is using (from the /etc/init.d/kdump):

Raw

KMEMINUSE=`awk '/Slab:.*/ {print $2}' /proc/meminfo`
  • Question:

Is it possible to find out how much memory was reserved for the kdump kernel?

Answer:

This is available when executing

  cat /proc/cmdline

. Even when the kernel was started with

  crashkernel=auto

then

  /proc/cmdline

will contain the computed value that got reserved. To verify that

  crashkernel=auto

was really used the contents of

  /var/log/dmesg

can be used.

  • cat /proc/cmdline
  • cat /sys/kernel/kexec_crash_size

  • Question: I found out that ‘sync; echo 3 > /proc/sys/vm/drop_caches’ frees up Slab, can I use this regularly and then use a lower value for ‘crashkernel’? Answer: This is not recommended. This command is dropping filesystem caches, when after execution data is requested by processes the data has to be read from disc/blockdevices, resulting in a degraded system performance.

  • Question: On my system I did setup kdump. When triggering the kdump then kdump is not loaded completely. Answer: Are 3rd party drivers in use on the system, changing memory requirements? Does the system successfully kdump when crashkernel=768M@0M is used, or a different manual allocation that is bigger than the amount of memory that crashkernel=auto did reserve for the crash kernel? If this is the case then with the debug_mem_level option in /etc/kdump.conf the required amount of memory can be found out and the memory that has to be reserved for the crashkernel can be cut down.

centos配置kdump捕获内核崩溃

http://www.361way.com/centos-kdump/3751.html

一、什么是kdump

kdump 是一种先进的基于 kexec 的内核崩溃转储机制。当系统崩溃时,kdump 使用 kexec 启动到第二个内核。第二个内核通常叫做捕获内核,以很小内存启动以捕获转储镜像。第一个内核保留了内存的一部分给第二内核启动用。由于 kdump 利用 kexec 启动捕获内核,绕过了 BIOS,所以第一个内核的内存得以保留。这是内核崩溃转储的本质。

kdump 需要两个不同目的的内核,生产内核和捕获内核。生产内核是捕获内核服务的对像。捕获内核会在生产内核崩溃时启动起来,与相应的 ramdisk 一起组建一个微环境,用以对生产内核下的内存进行收集和转存。

二、kdump执行流程

为了更容易理解这里我以三张图展示下kdump的执行流程,首先看的是Vivek Goyal 的PPT中两幅图

kdump-design

下面两副图是来自于IBM技术论坛上的rhel6.2和suse11下的执行流程图

rhel6.2的执行流程

img

suse11下的执行流程

sles-kdump

三、kdump 的安装及测试

1、相关包的安装

这里以centos6.x下的安装为例

kexec-tools
kexec package
kernel-debuginfo  //需单独另外安装,yum源里没有
crash analysis package
安装命令如下# yum -y install kernel kexec-tools

如果需要图形化的配置工具,还要安装system-config-kdump包。

2、grub内核配置

编辑 /boot/grub/grub.conf 配置文件,修改用到的引导部分,加入crashkernel部分,

参数格式是:
crashkernel=nn[KMG]@ss[KMG]
nn表示要为crashkernel预留多少内存ss表示为crashkernel预留内存的起始位置

示例如下:

root (hd0,0)kernel /vmlinuz-2.6.18-92.el5 ro root=LABEL=/ crashkernel=256M@16Minitrd /initrd-2.6.18-92.el5.img
或
root (hd0,0)kernel /vmlinuz-2.6.32-431.17.1.el6.x86_64 ro root=/dev/mapper/vg_centos-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_centos/lv_swap rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M@48M rd_LVM_LV=vg_centos/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quietinitrd /initramfs-2.6.32-431.17.1.el6.x86_64.img

修改完成并重启后,可以通过cat /proc/cmdline 查看kernel 启动配置选项 ,此处修改重启后我的/proc/cmdline文件为:

ro root=/dev/mapper/vg_centos-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_LVM_LV=vg_centos/lv_swap rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M@48M rd_LVM_LV=vg_centos/lv_root KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet

注:在centos 7.x上,开始使用grub2 引导,配置路径文件为

UEFI引导时
/boot/efi/EFI/centos/grub.cfg
BIOS引导时
/boot/grub2/grub.cfg

小技巧:修改该值可以直接使用grubby命令进行修改grub.cfg文件(该命令同样适用于grub2和UEFI引导的情况)如:

[root@localhost boot]# grubby --update-kernel=DEFAULT --args=crashkernel=128M

3、启动kdump服务

在centos6.x上可以使用下面的命令启动kdump服务(在suse11企业版中,kdump服务名为boot.kdump)

# /etc/init.d/kdump start
Starting kdump: [FAILED]

发现启动失败,通过查看/var/log/message日志,可以发现如下内容:

kdump: No crashkernel parameter specified for running kernel

注:

a、这里我尝试过grub.cfg里配置crashkernel=auto 、crashkernel=128M@16M,启动失败,后来看到oracle 站点上的示例,改为crashkernl=128M@48M后,发现kdump服务可以启动成功,而且每次修改后都需要reboot重启系统,后来查了下手动指定@xxxM的时候可能会失败的原因,是因为如果第二个内核与第一个内核在地址空间上有重叠的话,会导致第二个内核启动失败,所以此处可以直接设置为crashkernel=128M。

b、crashkernle 的值也要根据具体自己的实际物理内存大小灵活调整,如实际物理内存实在足够大,可以设置为256M、512M 。

c、crashkernel 的值设置后,使用free -m 要看时,会发现内存少了@ 号前面配置的值大小。如,我配置的是crashkernel@128M@48M,就会少128M内存。

再次启动kdump服务

[root@361way sysconfig]# /etc/init.d/kdump restart
Stopping kdump:                                            [  OK  ]
No kdump initial ramdisk found.                            [WARNING]
Rebuilding /boot/initrd-2.6.32-431.17.1.el6.x86_64kdump.img
Starting kdump:                                            [  OK  ]

会发现会在/boot 目录下新增一个以kdump结尾的内核文件。

[root@361way boot]# ls
config-2.6.32-431.17.1.el6.x86_64
grub                                        
lost+found
System.map-2.6.32-431.17.1.el6.x86_64config-2.6.32-431.el6.x86_64
initramfs-2.6.32-431.17.1.el6.x86_64.img
memtest86+-4.10
System.map-2.6.32-431.el6.x86_64efi
initramfs-2.6.32-431.el6.x86_64.img
symvers-2.6.32-431.17.1.el6.x86_64.gz
vmlinuz-2.6.32-431.17.1.el6.x86_64elf-memtest86+-4.10
initrd-2.6.32-431.17.1.el6.x86_64kdump.img
symvers-2.6.32-431.el6.x86_64.gz
vmlinuz-2.6.32-431.el6.x86_64

注:在centos 7.x 上开始使用systemd进行服务进程的启动管理,启动服务折方法需要通过以下方法执行

# systemctl enable kdump.service    //配置服务的开机自启动
# systemctl start kdump.service       //启动kdump服务

4、测试模拟kdump

配置完成后,需要重启机器加载新的内核。可以使用下面的方法默认kdmp生成

# service kdump on   //设置服务开机自启动
# reboot            //重启系统使刚刚所有的修改生效
# sync
# echo c > /proc/sysrq-trigger

执行后,机器会重启,重启进入系统后,会在/var/crash 目录生成kdmp文件,文件内容可以通过crash命令进行分析,后面会对此进行专门的介绍。

四、kdump的高级配置

和kdump相关的配置文件有两个:一个是/etc/sysconfig/kdump,该文件内的内容一般无需修改 -- 网上一些技术站上在kdump服务启动不成功时修改这里,这里提示下,如果是通过yum源正常安装的,该文件无需修改;一个是/etc/kdump.conf 。这里指的高级配置主机是/etc/kdump.conf ,该配置文件的可配置选项可通过man 5 kdump.conf 获取帮助,这里只列举下常用到的部分:

1、设置kdump文件成生的位置

控制路径的主要有两部分:

#raw /dev/sda5#ext4 /dev/sda3
#ext4 LABEL=/boot
#ext4 UUID=03138356-5e61-4ab3-b58e-27507ac41937
#nfs my.server.com:/export/tmp
#ssh user@my.server.compath /var/crash

前面的部分用于设置存储的设备或分区位置--可以是raw裸设备、本地分区、网络路径在本地的挂载点或通过ssh传输,path则是相对的存储路径。如我们通过nfs 将远程的一个分区挂载到本地的/mnt分区下,kdump文件就存储在/mnt/var/crash/下。默认上面的部分不设置就是相对根分区的相对路径,即/var/crash 。

需要特别指出的是,如果使用ssh进行传输,需要配置key认证,使用/etc/init.d/kdump propagate即可配置ssh认证传输,如下:

kdump.conf中指定ssh网络传输ssh root@192.168.0.100/data/执行下面的命令会配置本机到192.168.0.100主机的key认证传输# service kdump propagateGenerating new ssh keys… done.The authenticity of host '192.168.0.100 (192.168.0.100)' can't be established.RSA key fingerprint is 31:c2:d8:b6:eb:2e:03:64:cd:ba:56:e9:49:6e:5d:6c.Are you sure you want to continue connecting (yes/no)? yesWarning: Permanently added '192.168.0.100' (RSA) to the list of known hosts.root@192.168.0.100's password:/root/.ssh/kdump_id_rsa.pub has been added to ~root/.ssh/authorized_keys2 on192.168.0.100

按照上面的配置,当有kdump生成时,会通过scp传输存储在192.168.0.100主机的/data/var/crash 目录下。

2、core_collector控制

该处是信息收集大小的关键,主要用到makedumpfile命令,centos 上的默认配置如下:

core_collector makedumpfile -c --message-level 1 -d 31

-c 表示启动zlib进行数据压缩

–message-level 指定了信息收集的级别,1为只打印process indicator 日志信息,默认值为7,具体见下表

Message | progress    common    error     debug     report
Level   | indicator   message   message   message   message
---------+------------------------------------------------------
       0 |      
       1 |     X
       2 |                X
       4 |                          X    * 
       7 |     X          X         X      
       8 |                                    X     
      16 |                                              X     
      31 |     X          X         X         X         X

-d 指定了kdump的过滤级别,具体见下表

        |         cache    cache
  Dump  |  zero   without  with     user    free 
  Level |  page   private  private  data    page
 -------+---------------------------------------
     0  |    
     1  |   X    
     2  |           X    
     4  |           X       X    
     8  |                           X   
     16 |                                    X   
     31 |   X        X      X       X        X

31表示过滤掉以上五种全部信息,这样kdump生成的速度就会更快,生成的vmcore文件也会较小。如果此处使用值0 ,表示不过滤任何信息,在kdump生成时,会记录主机当前的所有信息。这就是为什么在kdump生成时,有些主机只有几十M大小生成,有些主机确有几十 G大小的原因。更多用法可以查看makedumpfile命令的帮助文档。

3、指定default配置

该处的配置,我也参考了网上的一些配置,一些技术文档上使用的是defult reboot选项,而默认的是defult shell ,两者之间的区别是:

reboot: If the default action is reboot simply reboot the system and loose the core that you are  trying to retrieve.shell:  If the default action is shell, then drop to an hush session inside the initramfs from where you can try to record the core manually.Exiting this shell reboots the system.

在查看/usr/share/doc/kexec-tools-2.0.0/kexec-kdump-howto.txt帮助手册中的解释更容易理解一些,如下:

reboot --> reboot the system.shell  --> drop to a shell with-in initrd. A user can try to capture the           vmcore manually.

从这个解释可以看到选择shell 可以手工的DIY一些东西,而选择reboot 会在kdump生成后简单直接的reboot 系统。除了上在两个选项,还会poweroff 、halt 可选,如果不是技术研究的目录,在生产环境上我想谁不会选择kdump生成后让系统挂起吧。

除上面三处之外,还有其他配置部分,如debug_mem 的配置等。具体可以看kdump.conf 的man 结果。

五、crash进行结果分析

crash包需要yum -y install crash 单独安装过,另外crash 命令需要依赖kernel-debuginfo 包(该包又依赖kernel-debuginfo-common包),该包的下载地址:http://debuginfo.centos.org/6/x86_64/ 。下载前先要确认下自己主机的内核版本。我在测试机上是通过下面的命令执行的:

# uname -r2.6.32-431.17.1.el6.x86_64
# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-common-x86_64-2.6.32-431.17.1.el6.x86_64.rpm
# wget http://debuginfo.centos.org/6/x86_64/kernel-debuginfo-2.6.32-431.17.1.el6.x86_64.rpm

下载完成后,通过rpm -ivh将这两个包安装。然后通过下面的命令进行crash分析

# pwd/var/crash/127.0.0.1-2014-09-16-14:35:49# crash /usr/lib/debug/lib/modules/2.6.32-431.17.1.el6.x86_64/vmlinux vmcorecrash 6.1.0-5.el6Copyright (C) 2002-2012  Red Hat, Inc.Copyright (C) 2004, 2005, 2006, 2010  IBM CorporationCopyright (C) 1999-2006  Hewlett-Packard CoCopyright (C) 2005, 2006, 2011, 2012  Fujitsu LimitedCopyright (C) 2006, 2007  VA Linux Systems Japan K.K.Copyright (C) 2005, 2011  NEC CorporationCopyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.This program is free software, covered by the GNU General Public License,and you are welcome to change it and/or distribute copies of it undercertain conditions.  Enter "help copying" to see the conditions.This program has absolutely no warranty.  Enter "help warranty" for details.GNU gdb (GDB) 7.3.1Copyright (C) 2011 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law.  Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-unknown-linux-gnu"...      KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.17.1.el6.x86_64/vmlinux    DUMPFILE: vmcore  [PARTIAL DUMP]        CPUS: 1        DATE: Tue Sep 16 22:35:49 2014      UPTIME: 00:05:33LOAD AVERAGE: 0.00, 0.00, 0.00       TASKS: 175    NODENAME: localhost.localdomain     RELEASE: 2.6.32-431.17.1.el6.x86_64     VERSION: #1 SMP Wed May 7 23:32:49 UTC 2014     MACHINE: x86_64  (3398 Mhz)      MEMORY: 1 GB       PANIC: "Oops: 0002 [#1] SMP " (check log for details)         PID: 1412     COMMAND: "bash"        TASK: ffff88003d0b2040  [THREAD_INFO: ffff88003c33c000]         CPU: 0       STATE: TASK_RUNNING (PANIC)crash> btPID: 1412   TASK: ffff88003d0b2040  CPU: 0   COMMAND: "bash" #0 [ffff88003c33d9e0] machine_kexec at ffffffff81038f3b #1 [ffff88003c33da40] crash_kexec at ffffffff810c59f2 #2 [ffff88003c33db10] oops_end at ffffffff8152b7f0 #3 [ffff88003c33db40] no_context at ffffffff8104a00b #4 [ffff88003c33db90] __bad_area_nosemaphore at ffffffff8104a295 #5 [ffff88003c33dbe0] bad_area at ffffffff8104a3be #6 [ffff88003c33dc10] __do_page_fault at ffffffff8104ab6f #7 [ffff88003c33dd30] do_page_fault at ffffffff8152d73e #8 [ffff88003c33dd60] page_fault at ffffffff8152aaf5    [exception RIP: sysrq_handle_crash+22]    RIP: ffffffff8134b516  RSP: ffff88003c33de18  RFLAGS: 00010096    RAX: 0000000000000010  RBX: 0000000000000063  RCX: 0000000000000000    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000063    RBP: ffff88003c33de18   R8: 0000000000000000   R9: ffffffff81645da0    R10: 0000000000000001  R11: 0000000000000000  R12: 0000000000000000    R13: ffffffff81b01a40  R14: 0000000000000286  R15: 0000000000000004    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018 #9 [ffff88003c33de20] __handle_sysrq at ffffffff8134b7d2#10 [ffff88003c33de70] write_sysrq_trigger at ffffffff8134b88e#11 [ffff88003c33dea0] proc_reg_write at ffffffff811f2f1e#12 [ffff88003c33def0] vfs_write at ffffffff81188c38#13 [ffff88003c33df30] sys_write at ffffffff81189531#14 [ffff88003c33df80] system_call_fastpath at ffffffff8100b072    RIP: 00000036e3adb7a0  RSP: 00007fff22936c10  RFLAGS: 00010206    RAX: 0000000000000001  RBX: ffffffff8100b072  RCX: 0000000000000400    RDX: 0000000000000002  RSI: 00007fab7908b000  RDI: 0000000000000001    RBP: 00007fab7908b000   R8: 000000000000000a   R9: 00007fab79084700    R10: 00000000ffffffff  R11: 0000000000000246  R12: 0000000000000002    R13: 00000036e3d8e780  R14: 0000000000000002  R15: 00000036e3d8e780    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002bcrash> 

上面,只是简单的通过打印堆栈信息,显示主机在出现kdump生成时,pid 为1412的bash进程操作。从上面的显示信息中也简单的看到有 write_sysrq_trigger 函数触发。crash在定位问题原因时,为我们提供了下面的命令:

crash> ?*              files          mach           repeat         timeralias          foreach        mod            runq           treeascii          fuser          mount          search         unionbt             gdb            net            set            vmbtop           help           p              sig            vtopdev            ipcs           ps             struct         waitqdis            irq            pte            swap           whatiseval           kmem           ptob           sym            wrexit           list           ptov           sys            qextend         log            rd             taskcrash version: 6.1.0-5.el6   gdb version: 7.3.1For help on any command above, enter "help <command>".For help on input options, enter "help input".For help on output options, enter "help output".

由于crash的内容也较多,以下是针对suse下信息提到的一个脚本:

mkdir -p /tmp/kdumpcrash $*   <<EOF  >/tmp/kdump/kdumpoutput.txt 2>&1log >/tmp/kdump/log.txtsys >/tmp/kdump/sys.txtbt >/tmp/kdump/bt.txforeach bt >/tmp/kdump/all-bt.txtforeach files>/tmp/kdump/all-files.txtps >/tmp/kdump/ps.txtswap>/tmp/kdump/swap.txtrunq >/tmp/kdump/runq.txtmount >/tmp/kdump/mount.txtnet >/tmp/kdump/net.txtdev >/tmp/kdump/dev.txtdev -i >/tmp/kdump/dev-i.txtdev -p >/tmp/kdump/dev-p.txtfiles >/tmp/kdump/files.txtirq >/tmp/kdump/irq.txtkmem -f >/tmp/kdump/pmemory.txtkmem -i >/tmp/kdump/memory.txtmach >/tmp/kdump/mach.txtmod >/tmp/kdump/modules.txtnet -s >/tmp/kdump/net-s.txtps -t >/tmp/kdump/ps-t.txtps -c >/tmp/kdump/ps-c.txtsig >/tmp/kdump/sig.txtset >/tmp/kdump/set.txttask >/tmp/kdump/task.txtforeach task >/tmp/kdump/all-task.txtsym -l >/tmp/kdump/sys-l.txtsym -M >/tmp/kdump/sys-M.txtquitEOF

使用下面的脚本按如下方法执行:

# getcoreinfo.sh -f vmlinux-3.0.76-0.11-default.gz vmlinux-3.0.76-0.11-default.debug vmcore

六、kdump涉及的sysctl 配置

查阅了网上很多有关kdump的资料,发现在配置kdump时,对sysctl.conf 内的一些配置也进行了调整。这里也列举下,可以根据具体的情况酌情进行修改。

kernel.sysrq=1kernel.unknown_nmi_panic=1kernel.softlockup_panic=1

kernel.sysrq=1,如果通过/proc文件配置 ,上面的配置等价于echo 1 > /proc/sys/kernel/sysrq ,打开sysrq键的功能以后,有终端访问权限的用户将会拥有一些特别的功能。如果系统出现挂起的情况或在诊断一些和内核相关, 使用这些组合键能即时打印出内核的信息。因此,除非是要调试,解决问题,一般情况下,不要打开此功能。如果一定要打开,请确保你的终端访问的安全性。具体可以参看百度百科上给出的解释

kernel.unknown_nmi_panic=1 ,如果系统已经是处在Hang的状态的话,那么可以使用NMI按钮来触发Kdump。开启这个选项可以:echo 1 > /proc/sys/kernel/unknown_nmi_panic 需要注意的是,启用这个特性的话,是不能够同时启用NMI_WATCHDOG的!否则系统会Panic!

kernel.softlockup_panic=1,其对应的是/proc/sys/kernel/softlockup_panic的值,值为1可以让内核在死锁或者死循环的时候可以宕机重启。如果你的机器中安装了kdump,在重启之后,你会得到一份内核的core文件,这时从core文件中查找问题就方便很多了,而且再也不用手动重启机器了。如果你的内核是标准内核的话,可以通过修改/proc/sys/kernel/softlockup_thresh来修改超时的阈值,如果是CentOS内核的话,对应的文件是/proc/sys/kernel/watchdog_thresh。

除此之外,一些站点上还会建议修改开启oops painc的功能,这个也具体根据实际需要修改吧。

参考页面:IBM developer Works技术论坛

Kdump & Crash 学习笔记

PS:自动配置kdump的功能,我已经脚本化,放在了我的github上。

后记:

在后面使用中发现有出现kdump与现有模块冲突导致一直无法生成kdump的情况,这里的是VCS 的vxfs与fusion io的iomemory-vsl4模板与kdump冲突。可以通过blacklist参数将其在/etc/kdump.conf中屏蔽---suse下为/etc/sysconfig/kdump。如下:

blacklist vxfsblacklist iomemory-vsl4

关于blacklist参数,redhat原厂工程师给予的解释是:blacklist参数的作用是当触发kdump时,在进入第二内核(一般称为capture kernel或kdump kernel)时不加载指定的模块。这个参数只会在发生kdump时起作用,不会影响系统正常运行。

还需要注意的是在涉及到配置文件变动时,如生成路径修改或blacklist内容增加,都需要重新生成kdump的RAM文件,不然其在发生问题时还是使用老的img RAM文件,该文件在/boot下以kdump.img结尾的文件就是:

#ls -l /boot
total 35024
-rw-r--r--. 1 root root   105195 Nov 11  2013 config-2.6.32-431.el6.x86_64
drwxr-xr-x. 3 root root     4096 Sep 15 12:12 efi
drwxr-xr-x. 2 root root     4096 Sep 22 16:44 grub
-rw-------. 1 root root 17135661 Sep 15 12:25 initramfs-2.6.32-431.el6.x86_64.img
-rw-------  1 root root 11743320 Sep 22 16:35 initrd-2.6.32-431.el6.x86_64kdump.img
drwx------. 2 root root    16384 Sep 15 12:01 lost+found
-rw-r--r--. 1 root root   193758 Nov 11  2013 symvers-2.6.32-431.el6.x86_64.gz
-rw-r--r--. 1 root root  2518236 Nov 11  2013 System.map-2.6.32-431.el6.x86_64
-rwxr-xr-x. 1 root root  4128944 Nov 11  2013 vmlinuz-2.6.32-431.el6.x86_64

遇到配置变动时,可以将/boot下的initrd-uname -rkdump.img文件mv走,再通过重启kdump服务生成新的kdump.img文件。如下:

kdump-rebuild

注:SUSE下重新生成使用的是/etc/init.d/boot.kdump restart 命令。

在kdump重新生成后,最好重启下主机。另一个kdump配置里需要注意的参数是:MKDUMPRD_ARGS=“–allow-missing” ,增加完该参数,会在主机每次启动时自动检查kdump配置并重新rebuild kdump.img文件。

kdump压缩

下面的命令是压缩vmcore的,请尝试操作下面的命令看是否可以压缩(可能比较耗费时间和部分系统资源),实际原理就是由原crash级别,改为级别31:

makedumpfile -c -d 31 -x vmlinux-3.0.76-0.11-default.debug /xx/xx/vmcore /xx/shorter-vmcore

还在记得/xx/shorter-vmcore 存放目录有足够大的空间。

CentOS / RHEL 7 : How to configure kdump

https://www.thegeekdiary.com/centos-rhel-7-how-to-configure-kdump/

kdump is an advanced crash dumping mechanism. When enabled, the system is booted from the context of another kernel. This second kernel reserves a small amount of memory, and its only purpose is to capture the core dump image in case the system crashes. Since being able to analyze the core dump helps significantly to determine the exact cause of the system failure, it is strongly recommended to have this feature enabled.

1. Install the kexec-tools package if not already installed To use the kdump service, you must have the kexec-tools package installed. If not already installed, install the kexec-tools.

# yum install kexec-tools

2. Configuring Memory Usage in GRUB2 To configure the amount of memory that is reserved for the kdump kernel, modify /etc/default/grub and modify GRUB_CMDLINE_LINUX , set crashkernel=[size] parameter to the list of kernel options.

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M  vconsole.keymap=us rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

Run command below to regenerate grub configuration :

# grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the system to make the kernel parameter effect.

# shutdown -r now

3. Configuring Dump Location To configure kdump, we need to edit the configuration file /etc/kdump.conf. The default option is to store the vmcore file is the /var/crash/ directory of the local file system. To change the local directory in which the core dump is to be saved and replace the value with desired directory path. For example:

path /usr/local/cores

Optionally, you can also save the core dump directly to a raw partition. For example:

raw /dev/sdb4

To store the dump to a remote machine using the NFS protocol, remove the hash sign (“#”) from the beginning of the #nfs my.server.com:/export/tmp line, and replace the value with a valid hostname and directory path. For example:

nfs my.server.com:/export/tmp

4. Configuring Core Collector To reduce the size of the vmcore dump file, kdump allows you to specify an external application to compress the data, and optionally leave out all irrelevant information. Currently, the only fully supported core collector is makedumpfile. To enable the core collector, modify configuration file /etc/kdump.conf, remove the hash sign (“#”) from the beginning of the #core_collector makedumpfile -c –message-level 1 -d 31 line, and edit the command line options as described below. For example:

core_collector makedumpfile -c

5. Changing Default Action We can also specify the default action to perform when the core dump fails to generate at the desired location. If no default action is specified, “reboot” is assumed default. For example:

default halt

6. Start kdump daemon Check and make sure kernel command line includes the kdump config and memory was reserved for crash kernel:

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.8.13-98.2.1.el7uek.x86_64 root=/dev/mapper/rhel-root ro rd.lvm.lv=rhel/root crashkernel=128M rd.lvm.lv=rhel/swap vconsole.font=latarcyrheb-sun16 vconsole.keymap=us rhgb quiet nomodeset

Set kdump service can be started when system rebooted.

# systemctl enable kdump.service

To start the service in the current session, use the following command:

# systemctl start kdump.service

7. Testing kdump (manually trigger kdump) To test the configuration, we can reboot the system with kdump enabled, and make sure that the service is running.

For example:

# systemctl is-active kdump
active
# service kdump status
Redirecting to /bin/systemctl status  kdump.service
kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled)
Active: active (exited) since 一 2015-08-31 05:12:57 GMT; 1min 6s ago
Process: 19104 ExecStop=/usr/bin/kdumpctl stop (code=exited, status=0/SUCCESS)
Process: 19116 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
Main PID: 19116 (code=exited, status=0/SUCCESS)
Aug 31 05:12:57 ol7 kdumpctl[19116]: kexec: loaded kdump kernel
Aug 31 05:12:57 ol7 kdumpctl[19116]: Starting kdump: [OK]
Aug 31 05:12:57 ol7 systemd[1]: Started Crash recovery kernel arming.

Then type the following commands at a shell prompt:

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger

This will force the Linux kernel to crash, and the address-YYYY-MM-DD-HH:MM:SS/vmcore file will be copied to the location you have selected in the configuration (that is, to /var/crash/ by default)

CentOS / RHEL 6 : How to configure kdump CentOS / RHEL 5 : How to Configure kdump How to configure Kdump on SuSE Linux Enterprise System 10 and 11

Troubleshooting kdump Issues in CentOS/RHEL

https://www.thegeekdiary.com/troubleshooting-kdump-issues-in-centos-rhel/

The kdump mechanism is a Linux kernel feature, which allows you to create dumps if your kernel crashes. It produces an exact copy of the memory, which can be analyzed for the root cause of the crash. This is a script which configures kdump (kernel dump). Kdump provides a memory dump into a file named vmcore when the kernel has a critical issue. Vmcore is often required to investigate the issue. The crash dump is captured from the context of a freshly-booted kernel, not from the context of the crashed kernel. Kdump uses kexec to boot into a second kernel whenever the system crashes. Kexec is a fast-boot mechanism which allows rebooting a new Linux kernel from the context of a running kernel without going through any firmware or warm start.

This post explains the steps to troubleshoot common kdump issues.

Verifying the kdump setup

\1. Check if the kexec-tools package is installed in the system.

# rpm -qa | grep kexec

\2. Check the kernel commandline in the current running kernel for the parameter ‘crashkernel’:

# cat /proc/cmdline

\3. Check if the memory is reserved for the crashkernel when the kernel started:

# dmesg | grep Reserving

\4. Check the path of the dump:

# grep -v ^# /etc/kdump.conf

\5. Check the storage space available on the filesystem specified in the path parameter in the previous step:

# df -h

\6. Check the status of the kdump service:

# service kdump status         ### In CentOS/RHEL 6
# systemctl status kdump       ### In CentOS/RHEL 7

When the kdump service is not operational

\1. Verify the kdump setup following the above section.

\2. Start the kdump service

# service kdump status        ### In CentOS/RHEL 6
# systemctl status kdump      ### In CentOS/RHEL 7

\3. Check the error from the terminal.

\4. More information for the service kdump startup failure could be found in /var/log/messages.

When the kdump setup is fine and the service kdump status is operational but there is no vmcore generated on triggering a crash

\1. Edit the file /etc/kdump.conf and add the below line to obtain a shell when the vmcore generation fails:

default shell

\2. In the shell, check the available storage, check if the vmcore destination filesystem is mounted and then try to copy the vmcore manually and find if it fails.

# cp /proc/vmcore [destination]

When a shell is not obtained and the crashkernel is stuck while booting up

\1. Check the messages on the console and look for startup messages of the crashkernel. Look for where it is stuck.

Crashkernel is the same kernel that is started when the system comes up and hence one would see messages similar to normal kernel bootup messages but with limited devices being activated. E.g.: Only 1 CPU is enabled in crashkernel. Only the destination storage disk is detected.

\2. If you see page allocation error messages, then the chances are high that the crashkernel reserved is not enough and need to increase the value of ‘crashkernel’ kernel parameter.

CENTOS7 配置KDUMP和使用CRASH工具分析CRASH现场

http://smilejay.com/2016/04/centos7-kdump-configuration/

1. 关于kdump 和 crash kdump是一种kernel crash dump的机制,它可以在内核crash时保存系统的内存信息用于后续的分析。kdump是基于kexec的。 crash是一个用于交互式地分析正在运行的Linux系统或者kernel crash后的core dump数据的工具。 dump的工作原理图: kdump-vs-normal-boot kdump-works

2. 在CentOS 7 中配置kdump 需要在kernel启动命令行参数中添加crashkernel参数,并启动kdump服务。 一般设置为 crashkernel=auto 表示根据系统内存自动reserve一些内存给kernelcrash用,在x86_64系统中内存大于等于2GB时会reserve内存,最小保留内存计算方法是:160 MB + 2 bits for every 4 KB of RAM. 当然,也可以设置crashkernel=512M这样的固定保留内存。

用 yum install kexec-tools 安装kdump工具和服务 用 systemctl start kdump 命令可以启动kdump.service,用 systemctl enable kdump 让dump服务在开机时自动启动。 kdump.service 相关的配置文件 /etc/kdump.conf 里面可以修改一些默认的配置,比如dump完成后的动作(默认是reboot)、dump文件存放的方式(本地目录、NFS、scp到另外服务器等)。

3. 测试kdump 用root权限执行如下命令,可以让kernel crash。

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger

kernel dump完成后,系统重启,进入到 /var/crash/ 目录下可以看到保存下来的crash时系统的内存数据文件。 (默认保存在/var/crash目录)

4. 用crash工具分析 首先需要安装对应的kernel-debuginfo软件包,比如: wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-327.el7.x86_64.rpm wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-327.el7.x86_64.rpm

安装好kernel-debuginfo包后,执行类似这样的crash命令即可进入交互式的分析:

crash /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux /var/crash/127.0.0.1-2016-03-28-15\:28\:59/vmcore

在输入bt可以展示kernel-stack的backtrace,更多crash中的命令见 man crash。

另外,给个我启动kdump.service失败的情况: 内存较小时,没有reserve内存给crashkernel;启动kdump.service会失败,systemctl status kdump 会看到如下log:

 kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2016-03-28 14:17:21 CST; 7s ago
  Process: 29388 ExecStart=/usr/bin/kdumpctl start (code=exited, status=1/FAILURE)
 Main PID: 29388 (code=exited, status=1/FAILURE)
 
Mar 28 14:17:20 localhost.localdomain systemd[1]: Starting Crash recovery kernel arming...
Mar 28 14:17:21 localhost.localdomain kdumpctl[29388]: No memory reserved for crash kernel.
Mar 28 14:17:21 localhost.localdomain kdumpctl[29388]: Starting kdump: [FAILED]
Mar 28 14:17:21 localhost.localdomain systemd[1]: kdump.service: main process exited, code=exited, status=1/FAILURE
Mar 28 14:17:21 localhost.localdomain systemd[1]: Failed to start Crash recovery kernel arming.
Mar 28 14:17:21 localhost.localdomain systemd[1]: Unit kdump.service entered failed state.
Mar 28 14:17:21 localhost.localdomain systemd[1]: kdump.service failed.

参考资料: 权威详细的Redhat官方文档 https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Kernel_Crash_Dump_Guide/ http://unixadminschool.com/blog/2015/07/configuring-kdump-to-troubleshoot-kernel-crashes-hangs-or-reboots-in-rhel5rhel6rhel7/#difference-between-chroot-pivot-root

CHAPTER 7. KERNEL CRASH DUMP GUIDE

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide

Kdump.service FAILED centOS 7

Install the required packages

yum --enablerepo=debug install kexec-tools crash kernel-debug kernel-debuginfo-`uname -r`

Modify grub

A kernel argument must be added to /etc/grub.conf to enable kdump. It’s called crashkernel and it can be either auto or set as a predefined value e.g. 128M, 256M, 512M etc.

The line will look similar to the following:

GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/swap crashkernel=auto rd.lvm.lv=rhel/root rhgb quiet"

Change the value of the crashkernel=auto to crashkernel=128 or crashkernel=256

Regenerate grub configuration:

grub2-mkconfig -o /boot/grub2/grub.cfg

On a system with UEFI firmware, execute the following instead:

grub2-mkconfig -o /boot/efi/EFI/Centos/grub.cfg

Open the /etc/zipl.conf configuration file

locate the parameters= section, and edit the crashkernel= parameter (or add it if not present). For example, to reserve 128 MB of memory, use the following:crashkernel=128M save and exit

Regenerate the zipl configuration:zipl

⁠Enabling the Service

To start the kdump daemon at boot time, type the following command as root:

chkconfig kdump on

This will enable the service for runlevels 2, 3, 4, and 5. Similarly, typing chkconfig kdump off will disable it for all runlevels.

To start the service in the current session, use the following command as root:

service kdump start