Thread.UncaughtExceptionHandler 原理

5,247 阅读20分钟

Java 线程处理异常

一. Java Crash

线程出现未捕获异常后,JVM将调用Thread中的dispatchUncaughtException方法把异常传递给线程的未捕获异常处理器。

void Thread::HandleUncaughtExceptions(ScopedObjectAccessAlreadyRunnable& soa) {
    // ...

    // Call the Thread instance's dispatchUncaughtException(Throwable)
    tlsPtr_.jni_env->CallVoidMethod(peer.get(),
        WellKnownClasses::java_lang_Thread_dispatchUncaughtException,
        exception.get());
}

// libcore/ojluni/src/main/java/java/lang/Thread.java
public final void dispatchUncaughtException(Throwable e) {
    Thread.UncaughtExceptionHandler initialUeh =
        Thread.getUncaughtExceptionPreHandler();
    if (initialUeh != null) {
        try {
            initialUeh.uncaughtException(this, e);
        } catch (RuntimeException | Error ignored) {
            // Throwables thrown by the initial handler are ignored
        }
    }
    getUncaughtExceptionHandler().uncaughtException(this, e);
}

以上流程中,共有2个UncaughtExceptionHandler会参与处理,分别是PreHandler和Handler,核心是执行其各自实现的uncaughtException方法。 Android中提供了此二者的默认实现。Android系统中,应用进程由Zygote进程孵化而来,Zygote进程启动时,zygoteInit方法中会调用RuntimeInit.commonInit,代码如下:

// frameworks/base/core/java/com/android/internal/os/ZygoteInit.java
/**
  * The main function called when started through the zygote process...
  */
public static final Runnable zygoteInit(int targetSdkVersion, String[] argv, ClassLoader classLoader) {
    // ...
    RuntimeInit.commonInit();
    ZygoteInit.nativeZygoteInit();
    return RuntimeInit.applicationInit(targetSdkVersion, argv, classLoader);
}

RuntimeInit.commonInit方法中会设置默认的UncaughtExceptionHandler,代码如下:

// frameworks/base/core/java/com/android/internal/os/RuntimeInit.java
protected static final void commonInit() {
    // ...
    /*
     * set handlers; these apply to all threads in the VM. Apps can replace
     * the default handler, but not the pre handler.
     */
    LoggingHandler loggingHandler = new LoggingHandler();
    Thread.setUncaughtExceptionPreHandler(loggingHandler);
    Thread.setDefaultUncaughtExceptionHandler(new KillApplicationHandler(loggingHandler));
    // ...
}

实例化2个对象,分别是LoggingHandler和KillApplicationHandler,均继承于Thread#UncaughtExceptionHandler,重写unCaughtException方法。其中:

LoggingHandler,打印异常信息,包括进程名,pid,Java栈信息等。

系统进程,日志以"*** FATAL EXCEPTION IN SYSTEM PROCESS: "开头 应用进程,日志以"FATAL EXCEPTION: "开头

KillApplicationHandler,检查日志是否已打印,通知AMS,杀死进程。代码如下:

@Override
public void uncaughtException(Thread t, Throwable e) {
    try {
        // 1. 确保LoggingHandler已打印出信息(Android 9.0新增)
        ensureLogging(t, e);

        // 2. 通知AMS处理异常,弹出闪退的对话框等
        ActivityManager.getService().handleApplicationCrash(
                   mApplicationObject, new ApplicationErrorReport.ParcelableCrashInfo(e));
    } catch (Throwable t2) {
        // ...
    } finally {
        // 3. 确保杀死进程
        Process.killProcess(Process.myPid()); // 本质上给自己发送Singal 9,杀死进程
        System.exit(10); // Java中关闭进程的方法,调用其结束Java虚拟机
    }
}

注意 1:

Android N及之前版本,只有一个UncaughtHandler类,继承自Thread.UncaughtExceptionHandler Android O及之后版本,拆分为2个Handler类,分别是LoggingHandler和KillApplicationHandler,均继承于Thread#UncaughtExceptionHandler

注意 2:

Thread#setDefaultUncaughtExceptionHandler是公开API。应用可通过调用,自定义UncaughtExceptionHandler,替换掉KillApplicationHandler,这样能自定义逻辑处理掉异常,避免闪退发生。 Thread#setUncaughtExceptionPreHandler是hidden API。应用无法调用,不能替换LoggingHandler。

/**
 * ......
 * @hide only for use by the Android framework (RuntimeInit) b/29624607
 */
public static void setUncaughtExceptionPreHandler(UncaughtExceptionHandler eh) {
    uncaughtExceptionPreHandler = eh;
}
....
public static void setDefaultUncaughtExceptionHandler(UncaughtExceptionHandler eh) {
    defaultUncaughtExceptionHandler = eh;
}

因此常出现的情况: App运行时抛出uncaught exception后,LoggingHandler在日志中打印出了“FATAL EXCEPTION”信息,但应用已替换KillApplicationHandler,应用进程并不会退出,AMS也不会得到通知。应用仍正常运行。

注意 3: 默认情况下,uncaught exception发生后,KillApplicationHandler的方法中会执行System.exit(10)结束进程的Java虚拟机。此时,如果进程中仍有逻辑创建新线程,会抛出错误Error:Thread starting during runtime shutdown。如下:

java.lang.InternalError: Thread starting during runtime shutdown
at java.lang.Thread.nativeCreate(Native Method)
at java.lang.Thread.start(Thread.java:733)

日志中遇见此Error,建议首先查找下引发进程异常退出的真正原因。

二. Native Crash

Native异常发生时,CPU通过异常中断的方式,触发异常处理流程。Linux kernel会将中断处理,统一为信号。应用进程可以注册接收信号。 Android P,默认注册信号处理函数的代码位置是:bionic/linker/linker_main.cpp,其中调用debuggerd_init方法注册。linker_main.cpp代码如下:

// bionic/linker/linker_main.cpp
/*
 * This code is called after the linker has linked itself and
 * fixed it's own GOT. It is safe to make references to externs
 * and other non-local data at this point.
 */
static ElfW(Addr) __linker_init_post_relocation(KernelArgumentBlock& args) {
    // ...
    debuggerd_init(&callbacks);
}

debuggerd_init方法中会执行信号处理函数的注册,代码如下:

// system/core/debuggerd/handler/debuggerd_handler.cpp
void debuggerd_init(debuggerd_callbacks_t* callbacks) {
    // ...
    struct sigaction action;
    memset(&action, 0, sizeof(action));
    sigfillset(&action.sa_mask);
    action.sa_sigaction = debuggerd_signal_handler;
    action.sa_flags = SA_RESTART | SA_SIGINFO;

    // Use the alternate signal stack if available so we can catch stack overflows.
    action.sa_flags |= SA_ONSTACK;
    debuggerd_register_handlers(&action);
}

由上看出,信号处理的默认函数是debuggerd_signal_handler,那注册接收哪些信号呢?具体看debuggerd_register_handlers方法,如下:

// system/core/debuggerd/include/debuggerd/handler.h
static void __attribute__((__unused__)) debuggerd_register_handlers(struct sigaction* action) {
    sigaction(SIGABRT, action, nullptr);
    sigaction(SIGBUS, action, nullptr);
    sigaction(SIGFPE, action, nullptr);
    sigaction(SIGILL, action, nullptr);
    sigaction(SIGSEGV, action, nullptr);
    #if defined(SIGSTKFLT)
        sigaction(SIGSTKFLT, action, nullptr);
    #endif
    sigaction(SIGSYS, action, nullptr);
    sigaction(SIGTRAP, action, nullptr);
    sigaction(DEBUGGER_SIGNAL, action, nullptr);
}

通过sigaction方法,注册接收的信号有:SIGABRT,SIGBUS,SIGFPE,SIGILL,SIGSEGV,SIGSTKFLT,SIGSYS,SIGTRAP,DEBUGGER_SIGNAL,共计9个。

接下来,如果Native异常发生,处理流程如下:

应用的默认信号处理函数debuggerd_signal_handler被调用,执行线程是出问题的当前线程。其主要作用有2个,1是调用log_signal_summary方法,打印一条基本的异常信息;2是执行clone方法,创建子进程,然后debuggerd_dispatch_pseudothread方法会被调用(注意:debuggerd_dispatch_pseudothread方法执行时pid,tid不变。以下面日志为例,pid:8745,tid:8783)。如下:

// system/core/debuggerd/handler/debuggerd_handler.cpp

// Handler that does crash dumping by forking and doing the processing in the child.
// Do this by ptracing the relevant thread, and then execing debuggerd to do the actual dump.
static void debuggerd_signal_handler(int signal_number, siginfo_t* info, void* context) {
    // ...
    // 1. 打印一条Fatal signal日志,包含基本的异常信息
    log_signal_summary(info); 
    
    // 2. clone子进程
    pid_t child_pid = 
        clone(debuggerd_dispatch_pseudothread, pseudothread_stack,
              CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID,
              &thread_info, nullptr, nullptr, &thread_info.pseudothread_tid);
    
    // Wait for the child to start...
    futex_wait(&thread_info.pseudothread_tid, -1);
    // and then wait for it to terminate.
    futex_wait(&thread_info.pseudothread_tid, child_pid);
    // ...
}

log_signal_summary方法会在日志中打印一条“Fatal signal”的异常信息。通过注释大致了解,如果后续过程失败,至少先保留一条基本的Native异常信息。例如: 03-04 17:54:46.444 10168 8745 8783 F libc : Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x74 in tid 8783 (test), pid 8745 (com.kevin.test)

内容包括:

信号:如signal 11 (SIGSEGV) ------> 来自siginfo_t->si_signo,SIGSEGV字符串是由si_signo转换得到

错误码:如code 1 (SEGV_MAPERR) ------> 来自siginfo_t->si_code,SEGV_MAPERR字符串是由si_signo+si_code转换得到

错误地址:如fault addr 0x74 ------> 来自siginfo_t→si_addr

出错的tid,线程名称:如8783 (test) ------> 通过系统调用:syscall(__NR_gettid);线程名称:通过prctl(PR_GET_NAME, reinterpret_cast(thread_name), 0, 0, 0);

出错的pid,主线程名称:如pid 8745 (com.kevin.test) ------> 通过系统调用:syscall(__NR_getpid);主线程名称:/proc/self/comm,通过当前进程读取此路径获取

另外clone方法执行时:

参数1:debuggerd_dispatch_pseudothread ------> 子进程执行的函数 参数2:pseudothread_stack ------> 为子进程分配系统堆栈的指针 参数3:CLONE_THREAD ------> Linux 2.4中增加以支持POSIX线程标准,子进程与父进程共享相同的线程群 参数3:CLONE_SIGHAND ------> 子进程与父进程共享相同的信号处理(signal handler)表 参数3:CLONE_VM ------> 子进程与父进程运行于相同的内存空间 参数3:CLONE_CHILD_SETTID ------> Writes the PID of the child into the User Mode variable of the child pointed to by thectid parameter 参数3:CLONE_CHILD_CLEARTID ------> When set, the kernel sets up a mechanism to be triggered when the child process will exit or when it will start executing a new program. In these cases, the kernel will clear the User Mode variable pointed to by the ctid parameter and will awaken any process waiting for this event 参数4:thread_info ------> 传递给debuggerd_dispatch_pseudothread方法的参数Args

注意:debuggerd_signal_handler方法中执行到clone后,调用futex_wait等待debuggerd_dispatch_pseudothread方法执行完成后,继续跑完剩余逻辑。然后针对这次Native Crash,debuggerd完成使命。

子进程clone出后,会执行debuggerd_dispatch_pseudothread方法,其主要作用是通过execle函数,执行/system/bin/crash_dump32或/system/bin/crash_dump64程序,并传入相关参数,包括:

main_tid:发生Native Crash的线程id(目标进程) pseudothread_tid:初步从代码看,与获取backtrace有关,后续更多调研 debuggerd_dump_type:共有4种dump类型,发生Native Crash时的类型是kDebuggerdTombstone

static int debuggerd_dispatch_pseudothread(void* arg) {
    // 注意:会先执行一次clone,__fork函数这里的实现是:clone(nullptr, nullptr, 0, nullptr);
    // 这句执行完后,一个新进程会继续执行,新的pid(未来crash_dump的执行进程)
    pid_t crash_dump_pid = __fork();   

    execle(CRASH_DUMP_PATH, CRASH_DUMP_NAME, main_tid, pseudothread_tid, debuggerd_dump_type, nullptr, nullptr);
    // ...
}

一直等到crash_dump应用执行完成后,debuggerd_dispatch_pseudothread的原流程会继续执行(如上面日志例子,pid:8745,tid:8783),然后如上面所说:debuggerd_signal_handler继续执行完成。

/system/bin/crash_dump64的main方法会执行,代码位置:system/core/debuggerd/crash_dump.cpp,这里可以说是Native Crash异常处理的核心代码,其主要作用是:

main方法执行时,先调用DefuseSignalHandlers方法,其本质是调用debuggerd_register_handlers方法,给crash_dump64进程注册空signal action,避免自己发生Native异常时,dump自己 注意:crash_dump64进程的main方法执行前,该进程的debuggerd_init方法会先被调用,完成一系列signal的注册,因此需要解除注册 调用一次fork函数,fork crash_dump64的当前进程,然后新进程会在main方法里从fork调用处继续执行下去。以下逻辑是新crash_dump64进程所做的事 解析传入的参数,包括发生Native Crash的目标线程id,目标进程名等。并调用GetProcessTids方法获取目标进程的所有线程id集合

通过ptrace attach到应用(看源码这里循环ptrace到应用的每条子线程,针对发生Native Crash的线程会调用ReadCrashInfo方法),读取应用的寄存器等信息,最终汇总所有异常信息,包括机型版本,ABI,信号,寄存器,backtrace等,在日志中输出

注意:

a. 循环目标进程的线程集合时,针对每一个线程,先调用ptrace_seize_thread方法(本质是ptrace到线程,即ptrace(PTRACE_SEIZE, tid, 0, flags),并同时验证此线程,是否还所属于目标进程),再调用ptrace(PTRACE_INTERRUPT, tid, 0, 0),接着针对目标线程,会调用ReadCrashInfo方法读取异常信息,其他子线程,会获取其寄存器信息 b. 目标进程的所有线程,调用ptrace(PTRACE_DETACH, tid, 0, resume_signal),以便detach

通过Socket通知tombstoned进程(系统常驻进程),传输异常信息,由tombstoned进程将所有异常信息输出到/data/tombstones/tombstone_xx文件中 注意:

a. tombstoned进程共计监听3个端口,socket name分别是:tombstoned_crash,tombstoned_java_trace,tombstoned_intercept。用于不同的功能,这里我们crash_dump64进程与tombstoned进程用到的socket name是:tombstoned_crash。相关代码位置:system/core/debuggerd/tombstoned/tombstoned.cpp。如下图:

b. crash_dump64进程与tombstoned进程建立Socket通信后,crash_dump64进程调用核心的engrave_tombstone方法,来汇总与格式化所有异常信息,打印在日志中,并通过Socket传输给tombstoned进程。所有异常信息包括(备注:不包括输出至tombstone文件的信息,其信息会更多,涉及其他子线程,内存等):

Header信息,分别从ro.build.fingerprint,ro.revision,ABI_STRING中读取,如下:

03-04 17:54:46.581 10168  8792  8792 F DEBUG   : Build fingerprint: 'Xiaomi/perseus/perseus:9/PKQ1.180729.001/1.1.1:userdebug/test-keys'
03-04 17:54:46.581 10168  8792  8792 F DEBUG   : Revision: '0'
03-04 17:54:46.581 10168  8792  8792 F DEBUG   : ABI: 'arm64'

线程信息,信息前面已获得,如下:

03-04 17:54:46.581 10168 8792 8792 F DEBUG : pid: 8745, tid: 8783, name: test >>> com.kevin.test <<<

Signal异常信息,包括信号,错误码,错误地址,可能的错误原因(根据信号+错误地址匹配得到,详见方法:dump_probable_cause),如下:

03-04 17:54:46.581 10168  8792  8792 F DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x74
03-04 17:54:46.581 10168  8792  8792 F DEBUG   : Cause: null pointer dereference

异常线程的寄存器信息,前面通过ptrace获取,保存在数据结构std::map<pid_t, ThreadInfo> thread_info中,如下:

03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x0  0000000000000074  x1  0000006ffd44976c  x2  000000709a531212  x3  0000000000000010
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x4  0000000000000074  x5  000000007fffffff  x6  0000000000000002  x7  0000000000000030
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x8  0101010101010101  x9  000000709a4f1d7f  x10 0000000000000002  x11 0000006ffd59b90d
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x12 0000006ffd59bea8  x13 0000000000000006  x14 0000000000000000  x15 0000006ffd59be98
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x16 000000709a52f0f8  x17 000000709a45e4d0  x18 0000000000000001  x19 0000006ffd59bec0
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x20 000000008000002f  x21 000000709a531212  x22 0000000000000006  x23 0000006ffd59be90
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x24 0000000000000000  x25 0000006ffd59d588  x26 000000000ccccccc  x27 0000006ffd59bea8
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     x28 0000006ffd449774  x29 0000006ffd59be80
03-04 17:54:46.581 10168  8792  8792 F DEBUG   :     sp  0000006ffd59b700  lr  000000709a49f37c  pc  000000709a45e4e0

backtrace信息,获取过程稍有点复杂,与debuggerd_signal_handler方法中第一次clone时返回的child_pid有关,经过crash_dump.cpp中wait_for_vm_process方法,获取vim_pid,传递给system/core/libbacktrace/BacktraceMap.cpp类处理,获取backtrace等信息。如下:

03-04 17:54:46.611 10168  8792  8792 F DEBUG   : backtrace:
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #00 pc 000000000001e4e0  /system/lib64/libc.so (strlen+16)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #01 pc 000000000005f378  /system/lib64/libc.so (__vfprintf+6004)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #02 pc 000000000007d3c4  /system/lib64/libc.so (vsnprintf+164)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #03 pc 00000000000474bc  /system/lib64/libc.so (__vsnprintf_chk+72)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #04 pc 0000000000007e38  /system/lib64/liblog.so (__android_log_print+144)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #05 pc 00000000000006e8  /data/app/com.kevin.test-8raHGHng-dnI7-DiURmk1w==/lib/arm64/libtest-jni.so (Java_com_kevin_test_TestJni_getStringFromNativeMethod+124)
03-04 17:54:46.611 10168  8792  8792 F DEBUG   :     #06 pc 00000000005659e0  /system/lib64/libart.so (art_quick_generic_jni_trampoline+144)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #07 pc 000000000055cc4c  /system/lib64/libart.so (art_quick_invoke_static_stub+604)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #08 pc 00000000000d0540  /system/lib64/libart.so (art::ArtMethod::Invoke(art::Thread, unsigned int, unsigned int, art::JValue, char const)+232)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #09 pc 0000000000280b90  /system/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread, art::ArtMethod, art::ShadowFrame, unsigned short, art::JValue)+344)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #10 pc 000000000027aba4  /system/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod, art::Thread, art::ShadowFrame&, art::Instruction const, unsigned short, art::JValue)+968)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #11 pc 000000000052d7e0  /system/lib64/libart.so (MterpInvokeStatic+204)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #12 pc 000000000054f194  /system/lib64/libart.so (ExecuteMterpImpl+14612)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #13 pc 0000000000112514  /dev/ashmem/dalvik-classes.dex extracted in memory from /data/app/com.kevin.test-8raHGHng-dnI7-DiURmk1w==/base.apk (deleted) (com.kevin.test.MainActivity$TestRunnable.run)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #14 pc 00000000002548a8  /system/lib64/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.223931584+488)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #15 pc 000000000025a39c  /system/lib64/libart.so (art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread, art::CodeItemDataAccessor const&, art::ShadowFrame, art::JValue)+216)
03-04 17:54:46.612 10168  8792  8792 F DEBUG   :     #16 pc 000000000027ab88  /system/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod, art::Thread, art::ShadowFrame&, art::Instruction const, unsigned short, art::JValue*)+940)

以上所有信息,在engrave_tombstone中格式化处理,组织在一起时,会同时保存在字符串amfd_data中,它会被传递给system_server进程

c. crash_dump64进程与tombstoned进程Socket通信时,tombstoned进程会打印2条日志记录,大家会常见到,如下:

03-04 17:54:46.567  1058  1089  1089 I /system/bin/tombstoned: received crash request for pid 8783
......
03-04 17:54:46.924  1058  1089  1089 E /system/bin/tombstoned: Tombstone written to: /data/tombstones/tombstone_01

通过Socket通知System_server进程,(NativeCrashListener线程会监听socket通信),并最终调用到AMS#handleApplicationCrashInner方法(逻辑同Java Crash的处理此时保持一致)

以上逻辑,主要代码如下:

// system/core/debuggerd/crash_dump.cpp
int main(int argc, char** argv) {
    // 1. 本质是调用debuggerd_register_handlers方法,给crash_dump64进程注册空signal action,避免自己发生Native异常时,dump自己
    DefuseSignalHandlers();
 
     // 2. 调用fork
    pid_t forkpid = fork();
 
    // 3. 解析传入参数,获取目标进程所有线程等
    Initialize(argv);
    ParseArgs(argc, argv, &pseudothread_tid, &dump_type);
     
    // In order to reduce the duration that we pause the process for, we ptrace
    // the threads, fetch their registers and associated information, and then
    // fork a separate process as a snapshot of the process's address space.
    std::set<pid_t> threads;
 
    // 4. 通过ptrach attach到应用,获取异常信息
    ATRACE_NAME("ptrace");
    for (pid_t thread : threads) {
        // ...
        ThreadInfo info;
        info.pid = target_process;
        info.tid = thread;
        info.process_name = process_name;
        info.thread_name = get_thread_name(thread);
 
        if (!ptrace_interrupt(thread, &info.signo)) {
            PLOG(WARNING) << "failed to ptrace interrupt thread " << thread;
            ptrace(PTRACE_DETACH, thread, 0, 0);
            continue;
        }
 
        if (thread == g_target_thread) {
            // Read the thread's registers along with the rest of the crash info out of the pipe.kDebuggerdTombstone,
            ReadCrashInfo(input_pipe, &siginfo, &info.registers, &abort_address);
            info.siginfo = &siginfo;
            info.signo = info.siginfo->si_signo;
        } else {
            info.registers.reset(Regs::RemoteGet(thread));
            if (!info.registers) {
                PLOG(WARNING) << "failed to fetch registers for thread " << thread;
                ptrace(PTRACE_DETACH, thread, 0, 0);
                continue;
            }
        }
        // ...
    }
 
    // 5. 与tombstoned进程建立Socket通信,目的由tombstoned进程输出异常信息至/data/tombstones/tombstone_xx文件
    {
        ATRACE_NAME("tombstoned_connect");
        LOG(INFO) << "obtaining output fd from tombstoned, type: " << dump_type;
        g_tombstoned_connected =
            tombstoned_connect(g_target_thread, &g_tombstoned_socket, &g_output_fd, dump_type);
    }
 
    engrave_tombstone(std::move(g_output_fd), map.get(), process_memory.get(), thread_info, g_target_thread, abort_address, &open_files, &amfd_data);
 
 
    // 6. 通过Socket通知System_server进程,amfd_data是一个字符串类型,包含着所有已格式化的异常信息
    activity_manager_notify(target_process, signo, amfd_data);
    // ...
}

最后介绍下AMS端的处理。system_server进程中,AMS启动时,会先调用startObservingNativeCrashes方法,启动1个新线程NativeCrashListener,其作用是循环监听Socket端口(Socket Path:/data/system/ndebugsocket),接收来自debuggerd端的Native异常信息(如上面分析,对端是执行crash_dump程序的进程)。主要代码如下:

// frameworks/base/services/core/java/com/android/server/am/NativeCrashListener.java
final class NativeCrashListener extends Thread {
    // ...
    @Override
    public void run() {
        // ...
        try {
            FileDescriptor serverFd = Os.socket(AF_UNIX, SOCK_STREAM, 0);
            final UnixSocketAddress sockAddr = UnixSocketAddress.createFileSystem(
                    DEBUGGERD_SOCKET_PATH);
            Os.bind(serverFd, sockAddr);
            Os.listen(serverFd, 1);
            Os.chmod(DEBUGGERD_SOCKET_PATH, 0777);

            while (true) {
                FileDescriptor peerFd = null;
                try {
                    if (MORE_DEBUG) Slog.v(TAG, "Waiting for debuggerd connection");
                    peerFd = Os.accept(serverFd, null /* peerAddress */);
                    if (MORE_DEBUG) Slog.v(TAG, "Got debuggerd socket " + peerFd);
                    if (peerFd != null) {
                        // 
                        consumeNativeCrashData(peerFd);
                    }
             // ...
        }

每接收到一次Native异常信息后,通过consumeNativeCrashData方法,启动1个新线程,调用AcitivityManagerService#handleApplicationCrashInner方法,至此处理逻辑将与Java Crash保持一致。通知AMS,有Native Crash发生,打印日志,弹出FC闪退对话框等。

Thread中存在两个UncaughtExceptionHandler。一个是静态的defaultUncaughtExceptionHandler,另一个是非静态uncaughtExceptionHandler。

// null unless explicitly set
private volatile UncaughtExceptionHandler uncaughtExceptionHandler;

// null unless explicitly set
private static volatile UncaughtExceptionHandler defaultUncaughtExceptionHandler;
  • defaultUncaughtExceptionHandler:设置一个静态的默认的UncaughtExceptionHandler。来自所有线程中的Exception在抛出并且未捕获的情况下,都会从此路过。进程fork的时候设置的就是这个静态的defaultUncaughtExceptionHandler,管辖范围为整个进程。
  • uncaughtExceptionHandler:为单个线程设置一个属于线程自己的uncaughtExceptionHandler,辖范围比较小。

如果没有设置uncaughtExceptionHandler,将使用线程所在的线程组来处理这个未捕获异常。线程组ThreadGroup实现了UncaughtExceptionHandler,所以可以用来处理未捕获异常。ThreadGroup类定义:

private ThreadGroup group;

class ThreadGroup implements Thread.UncaughtExceptionHandler{}

ThreadGroup实现的uncaughtException如下:

public void uncaughtException(Thread t, Throwable e) {
    if (parent != null) {
        parent.uncaughtException(t, e);
    } else {
        Thread.UncaughtExceptionHandler ueh =
            Thread.getDefaultUncaughtExceptionHandler();
        if (ueh != null) {
            ueh.uncaughtException(t, e);
        } else if (!(e instanceof ThreadDeath)) {
            System.err.print("Exception in thread \""
                             + t.getName() + "\" ");
            e.printStackTrace(System.err);
        }
    }
}

默认情况下,线程组处理未捕获异常的逻辑是,首先将异常消息通知给父线程组,然后尝试利用一个默认的defaultUncaughtExceptionHandler来处理异常,如果没有默认的异常处理器则将错误信息输出到System.err。也就是JVM提供给我们设置每个线程的具体的未捕获异常处理器,也提供了设置默认异常处理器的方法。

Android 线程处理异常

在Android平台中,应用进程fork出来后会为虚拟机设置一个未截获异常处理器, 即在程序运行时,如果有任何一个线程抛出了未被截获的异常, 那么该异常最终会抛给未截获异常处理器处理。

Thread.setDefaultUncaughtExceptionHandler(new UncaughtHandler());

UncaughtHandler

public interface UncaughtExceptionHandler {
    void uncaughtException(Thread t, Throwable e);
}
private static class UncaughtHandler implements Thread.UncaughtExceptionHandler {
    public void uncaughtException(Thread t, Throwable e) {
        try {
            // Don't re-enter -- avoid infinite loops if crash-reporting crashes.
            if (mCrashing) return;
            mCrashing = true;

            if (mApplicationObject == null) {
                Clog_e(TAG, "*** FATAL EXCEPTION IN SYSTEM PROCESS: " + t.getName(), e);
            } else {
                //打印进程的crash信息
                .............
            }
            .............
            // Bring up crash dialog, wait for it to be dismissed
            //调用AMS的接口,进行处理 ANR之类
            ActivityManagerNative.getDefault().handleApplicationCrash(
                    mApplicationObject, new ApplicationErrorReport.CrashInfo(e));
        } catch (Throwable t2) {
            if (t2 instanceof DeadObjectException) {
                // System process is dead; ignore
            } else {
                try {
                    Clog_e(TAG, "Error reporting crash", t2);
                } catch (Throwable t3) {
                    // Even Clog_e() fails!  Oh well.
                }
            }
        } finally {
            //crash的最后,会杀死进程
            Process.killProcess(Process.myPid());
            //并exit
            System.exit(10);
        }
    }
}

UncaughtExceptionHandler存在于Thread中.当异常发生且未捕获时。异常会透过UncaughtExceptionHandler抛出。并且该线程会消亡。所以在Android中子线程死亡是允许的。主线程死亡就会导致ANR.

自定义 Thread.UncaughtExceptionHandler

设置了默认的异常处理器后,系统中所有未直接设置异常处理器的线程将使用这个默认的异常处理器。

public void defaultWay(){
    Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
        @Override
        public void uncaughtException(Thread t, Throwable e) {
            System.out.println("I catch a exception from  " + Thread.currentThread().getName() + ":" + Thread.currentThread().getThreadGroup().getName());
        }
    });

    ThreadGroup myGroup = new ThreadGroup("myGroup");
    new Thread(myGroup, new Runnable() {
                @Override
                public void run() {
                    int i = 1/0;
                }
            }, "thread1").start();

    new Thread(myGroup, new Runnable() {
        @Override
        public void run() {
            int i = 1/0;
        }
    }, "thread2").start();
}

这段代码创建了两个线程,并且它们都会抛出异常,最终由统一的默认异常处理器来处理。结果:

I catch a exception from  thread1:myGroup
I catch a exception from  thread2:myGroup

当然,出现上面的结果是因为使用了默认的ThreadGroup,我们可以破坏它这个机制。如果把上面代码中的ThreadGroup换成下面的BadGroup则情况会发生变化:

class BadGroup extends ThreadGroup{
    public BadGroup(String name) {
        super(name);
    }

    @Override
    public void uncaughtException(Thread t, Throwable e) {
        System.out.println("I am a bad group and do nothing");
    }
}

如果使用了BadGroup得出结果将是打印两条I am a bad group and do nothing。