最近遇到重启卡死的问题,有的是黑屏,有的是定屏在"正在重新启动"adb可用,高通平台。

抓下logcat能发现如下log:

ShutdownThread: Rebooting, reason: null

代码在:/frameworks/base/services/core/java/com/android/server/power/ShutdownThread.java:

        if (reboot) {
            Log.i(TAG, "Rebooting, reason: " + reason);
            PowerManagerService.lowLevelReboot(reason);
            Log.e(TAG, "Reboot failed, will attempt shutdown instead");
            reason = null;

/frameworks/base/services/core/java/com/android/server/power/PowerManagerService.java:

    /**  
     * Low-level function to reboot the device. On success, this
     * function doesn't return. If more than 20 seconds passes from
     * the time a reboot is requested, this method returns.
     *
     * @param reason code to pass to the kernel (e.g. "recovery"), or null.
     */
    public static void lowLevelReboot(String reason) {
        if (reason == null) {
            reason = "";
        }    

        // If the reason is "quiescent", it means that the boot process should proceed
        // without turning on the screen/lights.
        // The "quiescent" property is sticky, meaning that any number
        // of subsequent reboots should honor the property until it is reset.
        if (reason.equals(PowerManager.REBOOT_QUIESCENT)) {
            sQuiescent = true;
            reason = "";
        } else if (reason.endsWith("," + PowerManager.REBOOT_QUIESCENT)) {
            sQuiescent = true;
            reason = reason.substring(0,
                    reason.length() - PowerManager.REBOOT_QUIESCENT.length() - 1);
        }    

        if (reason.equals(PowerManager.REBOOT_RECOVERY)
                || reason.equals(PowerManager.REBOOT_RECOVERY_UPDATE)) {
            reason = "recovery";
        }    

        if (sQuiescent) {
            // Pass the optional "quiescent" argument to the bootloader to let it know
            // that it should not turn the screen/lights on.
            reason = reason + ",quiescent";
        }    

        SystemProperties.set("sys.powerctl", "reboot," + reason); //tj: trigger reboot
        try {
            Thread.sleep(20 * 1000L);
        } catch (InterruptedException e) { 
            Thread.currentThread().interrupt();
        }    
        Slog.wtf(TAG, "Unexpected return from lowLevelReboot!");
    } 

通过设置属性sys.powerctl来重启,属性设置会走HandlePropertySet -> PropertySet -> property_changed(init.cpp):

void property_changed(const std::string& name, const std::string& value) {
    // If the property is sys.powerctl, we bypass the event queue and immediately handle it.
    // This is to ensure that init will always and immediately shutdown/reboot, regardless of
    // if there are other pending events to process or if init is waiting on an exec service or
    // waiting on a property.
    // In non-thermal-shutdown case, 'shutdown' trigger will be fired to let device specific
    // commands to be executed.
    if (name == "sys.powerctl") {
        // Despite the above comment, we can't call HandlePowerctlMessage() in this function,
        // because it modifies the contents of the action queue, which can cause the action queue
        // to get into a bad state if this function is called from a command being executed by the
        // action queue.  Instead we set this flag and ensure that shutdown happens before the next
        // command is run in the main init loop.
        // TODO: once property service is removed from init, this will never happen from a builtin,
        // but rather from a callback from the property service socket, in which case this hack can
        // go away.
        shutdown_command = value;
        do_shutdown = true;
    }

init.cpp的main():

    while (true) {
        // By default, sleep until something happens.
        int epoll_timeout_ms = -1; 

        if (do_shutdown && !shutting_down) {
            do_shutdown = false;
            if (HandlePowerctlMessage(shutdown_command)) {
                shutting_down = true;
            }   
        } 

HandlePowerctlMessage():

bool HandlePowerctlMessage(const std::string& command) {
    unsigned int cmd = 0;
    std::vector<std::string> cmd_params = Split(command, ",");
    std::string reboot_target = ""; 
    bool run_fsck = false;
    bool command_invalid = false;

    if (cmd_params.size() > 3) {
        command_invalid = true;
    } else if (cmd_params[0] == "shutdown") {
        cmd = ANDROID_RB_POWEROFF;
        if (cmd_params.size() == 2) {
            if (cmd_params[1] == "userrequested") {
                // The shutdown reason is PowerManager.SHUTDOWN_USER_REQUESTED.
                // Run fsck once the file system is remounted in read-only mode.
                run_fsck = true;
            } else if (cmd_params[1] == "thermal") {
                // Turn off sources of heat immediately.
                TurnOffBacklight();
                // run_fsck is false to avoid delay
                cmd = ANDROID_RB_THERMOFF;
            }   
        }   
    } else if (cmd_params[0] == "reboot") {
        cmd = ANDROID_RB_RESTART2;
        if (cmd_params.size() >= 2) {
            reboot_target = cmd_params[1];
            // When rebooting to the bootloader notify the bootloader writing
            // also the BCB.
            if (reboot_target == "bootloader") {
                std::string err;
                if (!write_reboot_bootloader(&err)) {
                    LOG(ERROR) << "reboot-bootloader: Error writing "
                                  "bootloader_message: "
                               << err;
                }
            }
            // If there is an additional parameter, pass it along
            if ((cmd_params.size() == 3) && cmd_params[2].size()) {
                reboot_target += "," + cmd_params[2];
            }
        }
    } else {
        command_invalid = true;
    }
    if (command_invalid) {
        LOG(ERROR) << "powerctl: unrecognized command '" << command << "'";
        return false;
    }

    LOG(INFO) << "Clear action queue and start shutdown trigger";
    ActionManager::GetInstance().ClearQueue();
    // Queue shutdown trigger first
    ActionManager::GetInstance().QueueEventTrigger("shutdown");
    // Queue built-in shutdown_done
    auto shutdown_handler = [cmd, command, reboot_target, run_fsck](const BuiltinArguments&) {
        DoReboot(cmd, command, reboot_target, run_fsck);
        return Success();
    };
    ActionManager::GetInstance().QueueBuiltinAction(shutdown_handler, "shutdown_done");

    // Skip wait for prop if it is in progress
    ResetWaitForProp();

    // Clear EXEC flag if there is one pending
    for (const auto& s : ServiceList::GetInstance()) {
        s->UnSetExec();
    }

    return true;
}

LOG(INFO) << "Clear action queue and start shutdown trigger"这句log(check dmesg)有时有,有时没有?继续看DoReboot:

void DoReboot(unsigned int cmd, const std::string& reason, const std::string& rebootTarget,
              bool runFsck) {
    Timer t;
    LOG(INFO) << "Reboot start, reason: " << reason << ", rebootTarget: " << rebootTarget;

    // Ensure last reboot reason is reduced to canonical
    // alias reported in bootloader or system boot reason.
    size_t skip = 0;
    LOG(ERR) << "[YL] Reboot start, reason: " << reason << ", rebootTarget: " << rebootTarget;
    std::vector<std::string> reasons = Split(reason, ",");
    if (reasons.size() >= 2 && reasons[0] == "reboot" &&
        (reasons[1] == "recovery" || reasons[1] == "bootloader" || reasons[1] == "cold" ||
         reasons[1] == "hard" || reasons[1] == "warm")) {
        skip = strlen("reboot,");
    }
    property_set(LAST_REBOOT_REASON_PROPERTY, reason.c_str() + skip);
    sync();

    bool is_thermal_shutdown = cmd == ANDROID_RB_THERMOFF;

    auto shutdown_timeout = 0ms;
    if (!SHUTDOWN_ZERO_TIMEOUT) {
        if (is_thermal_shutdown) {
            constexpr unsigned int thermal_shutdown_timeout = 3;
            shutdown_timeout = std::chrono::seconds(thermal_shutdown_timeout);
        } else {
            constexpr unsigned int shutdown_timeout_default = 12;
            auto shutdown_timeout_property = android::base::GetUintProperty(
                "ro.build.shutdown_timeout", shutdown_timeout_default);
            shutdown_timeout = std::chrono::seconds(shutdown_timeout_property);
        }
    }
    LOG(INFO) << "Shutdown timeout: " << shutdown_timeout.count() << " ms";

奇怪,Reboot start, reasonShutdown timeout这两句log也不是每次都有?这里有个属性LAST_REBOOT_REASON_PROPERTY会记录重启原因:

/* Properties */
#define ANDROID_RB_PROPERTY "sys.powerctl"

/* Android reboot reason stored in this property */
#define LAST_REBOOT_REASON_PROPERTY "persist.sys.boot.reason"

复现了一次抓到如下log,是通过adb reboot压测的:

[   36.255358] init: Received sys.powerctl='reboot,adb' from pid: 2491 (/system/bin/adbd)
[   36.255533] init: Clear action queue and start shutdown trigger
[   36.255731] init: processing action (shutdown_done) from (<Builtin Action>:0)
[   36.255759] init: Reboot start, reason: reboot,adb, rebootTarget: adb
[   36.291254] init: Shutdown timeout: 6000 ms
[   36.291385] init: Could not start shutdown critical service 'chre': Cannot find '/vendor/bin/chre': No such file or directory
[   36.292815] init: starting service 'blank_screen'...
[   36.294620] init: terminating init services
[   36.294804] init: Sending signal 15 to service 'gx_fpd' (pid 3714) process group...
[   36.295203] init: Sending signal 15 to service 'trustonic-daemon' (pid 3295) process group...

继续跟下代码:

    // keep debugging tools until non critical ones are all gone.
    const std::set<std::string> kill_after_apps{"tombstoned", "logd", "adbd"};
    // watchdogd is a vendor specific component but should be alive to complete shutdown safely.
    const std::set<std::string> to_starts{"watchdogd"};
    for (const auto& s : ServiceList::GetInstance()) {
        if (kill_after_apps.count(s->name())) {
            s->SetShutdownCritical();
        } else if (to_starts.count(s->name())) {
            if (auto result = s->Start(); !result) {
                LOG(ERROR) << "Could not start shutdown 'to_start' service '" << s->name()
                           << "': " << result.error();
            }
            s->SetShutdownCritical();
        } else if (s->IsShutdownCritical()) {
            // Start shutdown critical service if not started.
            if (auto result = s->Start(); !result) {
                LOG(ERROR) << "Could not start shutdown critical service '" << s->name()
                           << "': " << result.error();
            }
        }
    }

    // remaining operations (specifically fsck) may take a substantial duration
    if (cmd == ANDROID_RB_POWEROFF || is_thermal_shutdown) {
        TurnOffBacklight();
    }

    Service* bootAnim = ServiceList::GetInstance().FindService("bootanim");
    Service* surfaceFlinger = ServiceList::GetInstance().FindService("surfaceflinger");
    if (bootAnim != nullptr && surfaceFlinger != nullptr && surfaceFlinger->IsRunning()) {
        // will not check animation class separately
        for (const auto& service : ServiceList::GetInstance()) {
            if (service->classnames().count("animation")) service->SetShutdownCritical();
        }
    }

先保留几个critical service,比如tombstoned, logd, adbd, watchdogd。接下来会terminate所有的服务除了这几个critical:

    // optional shutdown step
    // 1. terminate all services except shutdown critical ones. wait for delay to finish
    if (shutdown_timeout > 0ms) {
        LOG(INFO) << "terminating init services";

        // Ask all services to terminate except shutdown critical ones.
        for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
            if (!s->IsShutdownCritical()) s->Terminate();
        }

        int service_count = 0;
        // Only wait up to half of timeout here
        auto termination_wait_timeout = shutdown_timeout / 2;
        while (t.duration() < termination_wait_timeout) {
            ReapAnyOutstandingChildren();

            service_count = 0;
            for (const auto& s : ServiceList::GetInstance()) {
                // Count the number of services running except shutdown critical.
                // Exclude the console as it will ignore the SIGTERM signal
                // and not exit.
                // Note: SVC_CONSOLE actually means "requires console" but
                // it is only used by the shell.
                if (!s->IsShutdownCritical() && s->pid() != 0 && (s->flags() & SVC_CONSOLE) == 0) {
                    service_count++;
                }
            }

            if (service_count == 0) {
                // All terminable services terminated. We can exit early.
                break;
            }

            // Wait a bit before recounting the number or running services.
            std::this_thread::sleep_for(50ms);
        }
        LOG(INFO) << "Terminating running services took " << t
                  << " with remaining services:" << service_count;
    }

这里会有个等待超时时间termination_wait_timeout,调大些?继续看:

    // minimum safety steps before restarting
    // 2. kill all services except ones that are necessary for the shutdown sequence.
    for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
        if (!s->IsShutdownCritical()) s->Stop(); //tj: stop non critical
    }
    ReapAnyOutstandingChildren();

    // 3. send volume shutdown to vold
    Service* voldService = ServiceList::GetInstance().FindService("vold");
    if (voldService != nullptr && voldService->IsRunning()) {
        ShutdownVold();
        voldService->Stop();
    } else {
        LOG(INFO) << "vold not running, skipping vold shutdown";
    }
    // logcat stopped here
    for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
        if (kill_after_apps.count(s->name())) s->Stop();
    }

最后才stop了logcat, adbd等服务。ok,定屏adb不能连的那肯定走到下面去了。下面就是umount:

    // 4. sync, try umount, and optionally run fsck for user shutdown
    {
        Timer sync_timer;
        LOG(INFO) << "sync() before umount...";
        sync();
        LOG(INFO) << "sync() before umount took" << sync_timer;
    }
    UmountStat stat = TryUmountAndFsck(runFsck, shutdown_timeout - t.duration());
    // Follow what linux shutdown is doing: one more sync with little bit delay
    {
        Timer sync_timer;
        LOG(INFO) << "sync() after umount...";
        sync();
        LOG(INFO) << "sync() after umount took" << sync_timer;
    }
static void KillAllProcesses() { android::base::WriteStringToFile("i", "/proc/sysrq-trigger"); }

/* Try umounting all emulated file systems R/W block device cfile systems.
 * This will just try umount and give it up if it fails.
 * For fs like ext4, this is ok as file system will be marked as unclean shutdown
 * and necessary check can be done at the next reboot.
 * For safer shutdown, caller needs to make sure that
 * all processes / emulated partition for the target fs are all cleaned-up.
 *
 * return true when umount was successful. false when timed out.
 */
static UmountStat TryUmountAndFsck(bool runFsck, std::chrono::milliseconds timeout) {
    Timer t;
    std::vector<MountEntry> block_devices;
    std::vector<MountEntry> emulated_devices;

    if (runFsck && !FindPartitionsToUmount(&block_devices, &emulated_devices, false)) {
        return UMOUNT_STAT_ERROR;
    }

    UmountStat stat = UmountPartitions(timeout - t.duration());
    if (stat != UMOUNT_STAT_SUCCESS) {
        LOG(INFO) << "umount timeout, last resort, kill all and try";
        bool dumpUmountDebugInfo = property_get_bool("persist.sys.dumpUmountDebugInfo",false);
        if (dumpUmountDebugInfo) {
            if (DUMP_ON_UMOUNT_FAILURE) DumpUmountDebuggingInfo(true);
        }
        KillAllProcesses();
        // even if it succeeds, still it is timeout and do not run fsck with all processes killed
        UmountStat st = UmountPartitions(0ms);
        if (dumpUmountDebugInfo) {
            if ((st != UMOUNT_STAT_SUCCESS) && DUMP_ON_UMOUNT_FAILURE)
                 DumpUmountDebuggingInfo(false);
        }
    }

我们这里是reboot,不走fsck,sync() before umount这个log也是看不到的。很有可能就是卡在umount里了?

又复现一次:

[   92.971658] scm: secure world has been busy for 1 second!
[   94.331640] scm_call failed: func id 0x7300fa01, ret: -12, syscall returns: 0x0, 0x0, 0x0
[   97.454185] init: Terminating running services took 6023ms with remaining services:3
[   97.454293] init: Sending signal 9 to service 'gx_fpd' (pid 4944) process group...
[   97.680210] libprocessgroup: Failed to kill process cgroup uid 1000 pid 4944 in 225ms, 1 processes remain
[   97.680392] init: Sending signal 9 to service 'trustonic-daemon' (pid 3315) process group...
[   97.905801] libprocessgroup: Failed to kill process cgroup uid 1000 pid 3315 in 225ms, 1 processes remain
[   97.905980] init: Sending signal 9 to service 'vendor.per_mgr' (pid 561) process group...
[   97.914897] libprocessgroup: Successfully killed process cgroup uid 1000 pid 561 in 8ms
[   97.915339] init: Service 'vendor.per_mgr' (pid 561) received signal 9
[   97.951903] vdc: Waited 0ms for vold
[   97.952039] binder: 483:487 transaction failed 29189/-22, size 88-0 line 3017
[  101.601745] Trustonic TEE: wait_mcp_notification: No answer after 10s

这里我们加大过timeout,仍然不起作用,有个gx_fpd杀不掉?ps看下:

xxx:/ # ps -A | grep fpd
system        4944     1   19768   6844 __qseecom_process_incomplete_cmd 0 S gx_fpd
fpd
xxx:/ # ps -A | grep trustonic-daemon
xxx:/ # ps -A | grep 3315
system        3315     1       0      0 0                   0 Z [mcDriverDaemon]
xxx:/ # ps -A | grep Z
USER           PID  PPID     VSZ    RSS WCHAN            ADDR S NAME

system        3315     1       0      0 0                   0 Z [mcDriverDaemon]

root          5094     1       0      0 0                   0 Z [vdc]

Z? 都僵尸了?后来发现指纹没接,接上去重启压测上千次暂未复现。

还有复现的log,这个是用apk做的reboot,log都有限:

<14>[   86.307908] init: Received sys.powerctl='reboot,' from pid: 1368 (system_server)
<14>[   86.308048] init: Clear action queue and start shutdown trigger
<14>[   86.308278] init: processing action (shutdown_done) from (<Builtin Action>:0)
<14>[   86.308300] init: Reboot start, reason: reboot,, rebootTarget:

对于无adb定屏的问题,发现google有一笔patch:

Reland "init: Reboot after timeout passes during reboot"

We're experiencing issues where unmount is forcing the device to be
stuck permanently, so it's better to have this shutdownt thread with a
long timeout (30s) and a potential kernel panic if it fails than a
stuck device.

这里a potential kernel panic指的啥?抛了个reboot线程来执行,如果reboot线程卡住超时,init就复位了。

这笔改动目前状态是合入冲突,我拿过来合入,暂未复现。

Merge Conflict 872492: Reland "init: Reboot after timeout passes during reboot"

Done.