Friday, May 29, 2009

core file analysis on solaris

A core dump is a detailed snapshot of the process at that point in time.
It has all the information related to OS threads(pstack info), data mapping to memory locations (pmap info), PC (program counter), stack pointer etc.
A core file has be studied by experienced OS admin or the vendor. It is lot easier to run debugger on the system where process crash has occurred.

Before the core dump/core file is sent to vendor, we can check few things and in most cases we should be able to figure out the culprit.

With pstack, we will able to get the LWP that triggered the crash.

# pstack core
core 'core' of 6604: /WebLogic/jdk1.5.0_10/bin/sparcv9/java -Xms2048m -Xmx4096m
----------------- lwp# 28 / thread# 28 --------------------
ffffffff7eea8ebc uadmin (6, 0, fffffffe454fb7f0, 10050, 10000, 11c00) + 4
ffffffff7ee3e698 addsev (f9c8, 10011b5d0, 0, ffffffff7eadd140, 0, ffffffff7ea1a000) + 40
ffffffff7e7b2de8 __1cCosFabort6Fi_v_ (1, fc00, e, ffffffff7ea1a000, 26726c, b400) + 60
ffffffff7e84a348 __1cHVMErrorOreport_and_die6M_v_ (0, ffffffff7eac9424, ffffffff7eac93f0, ffffffff7e9076ec, ffffffff7eabd938, 0) + c68
ffffffff7e36ea90 JVM_handle_solaris_signal (b, 13070, fffffffe454fbf00, 10022fa60, fffffffe454fc1e0, 10800) + b00
ffffffff7f217f6c __sighndlr (b, fffffffe454fc1e0, fffffffe454fbf00, ffffffff7e36df60, 0, 0) + c
ffffffff7f211ab0 call_user_handler (fffffffe52702c00, fffffffe454fc1e0, fffffffe454fbf00, 0, 0, 0) + 25c
ffffffff7f211c74 sigacthandler (b, fffffffe454fc1e0, fffffffe454fbf00, fffffffffffffff8, ffffffffffffffe0, 2) + 68
--- called from signal handler with signal 11 (SIGSEGV) ---
fffffffe43b06204 Java_weblogic_socket_DevPollSocketMuxer_initDevPoll (10022fba8, fffffffe454fc590, 10800, 0, 1bfbfa0, ffff) + 2fc ffffffff75c0d098 ???????? (10022fa60, fffffffe454fc680, fffffffe454fc590, ffffffffffffff10, 11800, ffffffff7ea1a000) ffffffff75c0d038 ???????? (80000000, 0, 0, 0, 8, fffffffe454fbdc1) ffffffff75c05978 ???????? (c000, 6, 0, ffffffff75c17180, 3, fffffffe454fbed1) ffffffff75c0023c ???????? (fffffffe454fc8b8, fffffffe454fcc80, a, fffffffe577b54c8, ffffffff75c0b560, fffffffe454fcca0)
ffffffff7e29b9b8 __1cJJavaCallsLcall_helper6
................
.................
ffffffff7e7b2878 __1cG_start6Fpv_0_ (10022fa60, d800, b000, b378, ffffffff7eabd1a4, ffffffff7ea1a000) + 210
ffffffff7f217c9c _lwp_start (0, 0, 0, 0, 0, 0)

From the above, we can figure out that signal 11 was raised fron initDevpoll function.
Java_weblogic_socket_DevPollSocketMuxer_initDevPoll

We will get more details with a debugger.
I am using a debugger mdb that ships with solaris OS. There are other debuggers like gdb and dbx that can be used.

Debuggers can do a lot more than simple commands. We can debug an active userprocess, live OS kernel and device drivers etc.
Complete documentatio of mdb can be found at
http://docs.sun.com/app/docs/doc/817-2543

I am running this against the same core file as shown above.

# mdb core
mdb: warning: core file is from SunOS 5.9 Generic_118558-36; shared text mappings may not match installed libraries
Loading modules: [ libthread.so.1 libc.so.1 ld.so.1 ]
Now once you are in debugger session, you can run commands (called dcmd)
The "status" command as shown below would print the summary informtion of the process.

> ::status
debugging core file of java (64-bit) from
executable file: /WebLogic/jdk1.5.0_10/bin/sparcv9/java
initial argv:
/WebLogic/jdk1.5.0_10/bin/sparcv9/java -Xms2048m -Xmx4096m .
threading model: multi-threaded
status: process terminated by SIGABRT (Abort)

To get the stack that caused the crash you can use "stack" command:

> ::stack
libc.so.1`uadmin+4(6, 0, fffffffe454fb7f0, 10050, 10000, 11c00)
libc.so.1`addsev+0x40(f9c8, 10011b5d0, 0, ffffffff7eadd140, 0, ffffffff7ea1a000)
libjvm.so`__1cCosFabort6Fi_v_+0x60(1, fc00, e, ffffffff7ea1a000, 26726c, b400)
libjvm.so`__1cHVMErrorOreport_and_die6M_v_+0xc68(0, ffffffff7eac9424,ffffffff7eac93f0, ffffffff7e9076ec, ffffffff7eabd938, 0)
libjvm.so`JVM_handle_solaris_signal+0xb00(b, 13070, fffffffe454fbf00, 10022fa60, fffffffe454fc1e0, 10800)
libthread.so.1`__sighndlr+0xc(b, fffffffe454fc1e0, fffffffe454fbf00,ffffffff7e36df60, 0, 0)
libthread.so.1`call_user_handler+0x25c(fffffffe52702c00, fffffffe454fc1e0,fffffffe454fbf00, 0, 0, 0)
libthread.so.1`sigacthandler+0x68(b, fffffffe454fc1e0, fffffffe454fbf00,fffffffffffffff8, ffffffffffffffe0, 2)
libmuxer.so`Java_weblogic_socket_DevPollSocketMuxer_initDevPoll+0x2fc(10022fba8, fffffffe454fc590, 10800, 0, 1bfbfa0, ffff)
...........
libjvm.so`__1cG_start6Fpv_0_+0x210(10022fa60, d800, b000, b378, ffffffff7eabd1a4, ffffffff7ea1a000)
libthread.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

This would give a little more information than pstack by printing the shared object of the funtion.
In the above case the crash originated from:
libmuxer.so`Java_weblogic_socket_DevPollSocketMuxer_initDevPoll
// libmuxer.so is the shared object where initDevPoll is called from

In this scenario this was an issue from weblogic performance pack libmuxer.so and getting the updated performance pack from BEA support fixed the issue.

1 comment:

surya said...

Hi friend,The process is explained so clearly. I came across an online tool which can analyse the core dumps easily.fastThread, a universal tool that will parse and analyze thread dumps written in any language that runs on the JVM. fastThread is the DevOps engineers favorite thread dump analysis tool to troubleshoot complex production problems.