最近遇到重启卡死的问题,有的是黑屏,有的是定屏在”正在重新启动”adb可用,高通平台。

抓下logcat能发现如下log:

ShutdownThread: Rebooting, reason: null

代码在:/frameworks/base/services/core/java/com/android/server/power/ShutdownThread.java:

if (reboot) {
Log.i(TAG, "Rebooting, reason: " + reason);
PowerManagerService.lowLevelReboot(reason);
Log.e(TAG, "Reboot failed, will attempt shutdown instead");
reason = null;

/frameworks/base/services/core/java/com/android/server/power/PowerManagerService.java:

/**  
* Low-level function to reboot the device. On success, this
* function doesn't return. If more than 20 seconds passes from
* the time a reboot is requested, this method returns.
*
* @param reason code to pass to the kernel (e.g. "recovery"), or null.
*/
public static void lowLevelReboot(String reason) {
if (reason == null) {
reason = "";
}

// If the reason is "quiescent", it means that the boot process should proceed
// without turning on the screen/lights.
// The "quiescent" property is sticky, meaning that any number
// of subsequent reboots should honor the property until it is reset.
if (reason.equals(PowerManager.REBOOT_QUIESCENT)) {
sQuiescent = true;
reason = "";
} else if (reason.endsWith("," + PowerManager.REBOOT_QUIESCENT)) {
sQuiescent = true;
reason = reason.substring(0,
reason.length() - PowerManager.REBOOT_QUIESCENT.length() - 1);
}

if (reason.equals(PowerManager.REBOOT_RECOVERY)
|| reason.equals(PowerManager.REBOOT_RECOVERY_UPDATE)) {
reason = "recovery";
}

if (sQuiescent) {
// Pass the optional "quiescent" argument to the bootloader to let it know
// that it should not turn the screen/lights on.
reason = reason + ",quiescent";
}

SystemProperties.set("sys.powerctl", "reboot," + reason); //tj: trigger reboot
try {
Thread.sleep(20 * 1000L);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
Slog.wtf(TAG, "Unexpected return from lowLevelReboot!");
}

通过设置属性sys.powerctl来重启,属性设置会走HandlePropertySet -> PropertySet -> property_changed(init.cpp):

void property_changed(const std::string& name, const std::string& value) {
// If the property is sys.powerctl, we bypass the event queue and immediately handle it.
// This is to ensure that init will always and immediately shutdown/reboot, regardless of
// if there are other pending events to process or if init is waiting on an exec service or
// waiting on a property.
// In non-thermal-shutdown case, 'shutdown' trigger will be fired to let device specific
// commands to be executed.
if (name == "sys.powerctl") {
// Despite the above comment, we can't call HandlePowerctlMessage() in this function,
// because it modifies the contents of the action queue, which can cause the action queue
// to get into a bad state if this function is called from a command being executed by the
// action queue. Instead we set this flag and ensure that shutdown happens before the next
// command is run in the main init loop.
// TODO: once property service is removed from init, this will never happen from a builtin,
// but rather from a callback from the property service socket, in which case this hack can
// go away.
shutdown_command = value;
do_shutdown = true;
}

init.cpp的main():

while (true) {
// By default, sleep until something happens.
int epoll_timeout_ms = -1;

if (do_shutdown && !shutting_down) {
do_shutdown = false;
if (HandlePowerctlMessage(shutdown_command)) {
shutting_down = true;
}
}

HandlePowerctlMessage():

bool HandlePowerctlMessage(const std::string& command) {
unsigned int cmd = 0;
std::vector<std::string> cmd_params = Split(command, ",");
std::string reboot_target = "";
bool run_fsck = false;
bool command_invalid = false;

if (cmd_params.size() > 3) {
command_invalid = true;
} else if (cmd_params[0] == "shutdown") {
cmd = ANDROID_RB_POWEROFF;
if (cmd_params.size() == 2) {
if (cmd_params[1] == "userrequested") {
// The shutdown reason is PowerManager.SHUTDOWN_USER_REQUESTED.
// Run fsck once the file system is remounted in read-only mode.
run_fsck = true;
} else if (cmd_params[1] == "thermal") {
// Turn off sources of heat immediately.
TurnOffBacklight();
// run_fsck is false to avoid delay
cmd = ANDROID_RB_THERMOFF;
}
}
} else if (cmd_params[0] == "reboot") {
cmd = ANDROID_RB_RESTART2;
if (cmd_params.size() >= 2) {
reboot_target = cmd_params[1];
// When rebooting to the bootloader notify the bootloader writing
// also the BCB.
if (reboot_target == "bootloader") {
std::string err;
if (!write_reboot_bootloader(&err)) {
LOG(ERROR) << "reboot-bootloader: Error writing "
"bootloader_message: "
<< err;
}
}
// If there is an additional parameter, pass it along
if ((cmd_params.size() == 3) && cmd_params[2].size()) {
reboot_target += "," + cmd_params[2];
}
}
} else {
command_invalid = true;
}
if (command_invalid) {
LOG(ERROR) << "powerctl: unrecognized command '" << command << "'";
return false;
}

LOG(INFO) << "Clear action queue and start shutdown trigger";
ActionManager::GetInstance().ClearQueue();
// Queue shutdown trigger first
ActionManager::GetInstance().QueueEventTrigger("shutdown");
// Queue built-in shutdown_done
auto shutdown_handler = [cmd, command, reboot_target, run_fsck](const BuiltinArguments&) {
DoReboot(cmd, command, reboot_target, run_fsck);
return Success();
};
ActionManager::GetInstance().QueueBuiltinAction(shutdown_handler, "shutdown_done");

// Skip wait for prop if it is in progress
ResetWaitForProp();

// Clear EXEC flag if there is one pending
for (const auto& s : ServiceList::GetInstance()) {
s->UnSetExec();
}

return true;
}

LOG(INFO) << "Clear action queue and start shutdown trigger"这句log(check dmesg)有时有,有时没有?继续看DoReboot:

void DoReboot(unsigned int cmd, const std::string& reason, const std::string& rebootTarget,
bool runFsck) {
Timer t;
LOG(INFO) << "Reboot start, reason: " << reason << ", rebootTarget: " << rebootTarget;

// Ensure last reboot reason is reduced to canonical
// alias reported in bootloader or system boot reason.
size_t skip = 0;
LOG(ERR) << "[YL] Reboot start, reason: " << reason << ", rebootTarget: " << rebootTarget;
std::vector<std::string> reasons = Split(reason, ",");
if (reasons.size() >= 2 && reasons[0] == "reboot" &&
(reasons[1] == "recovery" || reasons[1] == "bootloader" || reasons[1] == "cold" ||
reasons[1] == "hard" || reasons[1] == "warm")) {
skip = strlen("reboot,");
}
property_set(LAST_REBOOT_REASON_PROPERTY, reason.c_str() + skip);
sync();

bool is_thermal_shutdown = cmd == ANDROID_RB_THERMOFF;

auto shutdown_timeout = 0ms;
if (!SHUTDOWN_ZERO_TIMEOUT) {
if (is_thermal_shutdown) {
constexpr unsigned int thermal_shutdown_timeout = 3;
shutdown_timeout = std::chrono::seconds(thermal_shutdown_timeout);
} else {
constexpr unsigned int shutdown_timeout_default = 12;
auto shutdown_timeout_property = android::base::GetUintProperty(
"ro.build.shutdown_timeout", shutdown_timeout_default);
shutdown_timeout = std::chrono::seconds(shutdown_timeout_property);
}
}
LOG(INFO) << "Shutdown timeout: " << shutdown_timeout.count() << " ms";

奇怪,Reboot start, reasonShutdown timeout这两句log也不是每次都有?这里有个属性LAST_REBOOT_REASON_PROPERTY会记录重启原因:

/* Properties */
#define ANDROID_RB_PROPERTY "sys.powerctl"

/* Android reboot reason stored in this property */
#define LAST_REBOOT_REASON_PROPERTY "persist.sys.boot.reason"

复现了一次抓到如下log,是通过adb reboot压测的:

[   36.255358] init: Received sys.powerctl='reboot,adb' from pid: 2491 (/system/bin/adbd)
[ 36.255533] init: Clear action queue and start shutdown trigger
[ 36.255731] init: processing action (shutdown_done) from (<Builtin Action>:0)
[ 36.255759] init: Reboot start, reason: reboot,adb, rebootTarget: adb
[ 36.291254] init: Shutdown timeout: 6000 ms
[ 36.291385] init: Could not start shutdown critical service 'chre': Cannot find '/vendor/bin/chre': No such file or directory
[ 36.292815] init: starting service 'blank_screen'...
[ 36.294620] init: terminating init services
[ 36.294804] init: Sending signal 15 to service 'gx_fpd' (pid 3714) process group...
[ 36.295203] init: Sending signal 15 to service 'trustonic-daemon' (pid 3295) process group...

继续跟下代码:

// keep debugging tools until non critical ones are all gone.
const std::set<std::string> kill_after_apps{"tombstoned", "logd", "adbd"};
// watchdogd is a vendor specific component but should be alive to complete shutdown safely.
const std::set<std::string> to_starts{"watchdogd"};
for (const auto& s : ServiceList::GetInstance()) {
if (kill_after_apps.count(s->name())) {
s->SetShutdownCritical();
} else if (to_starts.count(s->name())) {
if (auto result = s->Start(); !result) {
LOG(ERROR) << "Could not start shutdown 'to_start' service '" << s->name()
<< "': " << result.error();
}
s->SetShutdownCritical();
} else if (s->IsShutdownCritical()) {
// Start shutdown critical service if not started.
if (auto result = s->Start(); !result) {
LOG(ERROR) << "Could not start shutdown critical service '" << s->name()
<< "': " << result.error();
}
}
}

// remaining operations (specifically fsck) may take a substantial duration
if (cmd == ANDROID_RB_POWEROFF || is_thermal_shutdown) {
TurnOffBacklight();
}

Service* bootAnim = ServiceList::GetInstance().FindService("bootanim");
Service* surfaceFlinger = ServiceList::GetInstance().FindService("surfaceflinger");
if (bootAnim != nullptr && surfaceFlinger != nullptr && surfaceFlinger->IsRunning()) {
// will not check animation class separately
for (const auto& service : ServiceList::GetInstance()) {
if (service->classnames().count("animation")) service->SetShutdownCritical();
}
}

先保留几个critical service,比如tombstoned, logd, adbd, watchdogd。接下来会terminate所有的服务除了这几个critical:

// optional shutdown step
// 1. terminate all services except shutdown critical ones. wait for delay to finish
if (shutdown_timeout > 0ms) {
LOG(INFO) << "terminating init services";

// Ask all services to terminate except shutdown critical ones.
for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
if (!s->IsShutdownCritical()) s->Terminate();
}

int service_count = 0;
// Only wait up to half of timeout here
auto termination_wait_timeout = shutdown_timeout / 2;
while (t.duration() < termination_wait_timeout) {
ReapAnyOutstandingChildren();

service_count = 0;
for (const auto& s : ServiceList::GetInstance()) {
// Count the number of services running except shutdown critical.
// Exclude the console as it will ignore the SIGTERM signal
// and not exit.
// Note: SVC_CONSOLE actually means "requires console" but
// it is only used by the shell.
if (!s->IsShutdownCritical() && s->pid() != 0 && (s->flags() & SVC_CONSOLE) == 0) {
service_count++;
}
}

if (service_count == 0) {
// All terminable services terminated. We can exit early.
break;
}

// Wait a bit before recounting the number or running services.
std::this_thread::sleep_for(50ms);
}
LOG(INFO) << "Terminating running services took " << t
<< " with remaining services:" << service_count;
}

这里会有个等待超时时间termination_wait_timeout,调大些?继续看:

// minimum safety steps before restarting
// 2. kill all services except ones that are necessary for the shutdown sequence.
for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
if (!s->IsShutdownCritical()) s->Stop(); //tj: stop non critical
}
ReapAnyOutstandingChildren();

// 3. send volume shutdown to vold
Service* voldService = ServiceList::GetInstance().FindService("vold");
if (voldService != nullptr && voldService->IsRunning()) {
ShutdownVold();
voldService->Stop();
} else {
LOG(INFO) << "vold not running, skipping vold shutdown";
}
// logcat stopped here
for (const auto& s : ServiceList::GetInstance().services_in_shutdown_order()) {
if (kill_after_apps.count(s->name())) s->Stop();
}

最后才stop了logcat, adbd等服务。ok,定屏adb不能连的那肯定走到下面去了。下面就是umount:

// 4. sync, try umount, and optionally run fsck for user shutdown
{
Timer sync_timer;
LOG(INFO) << "sync() before umount...";
sync();
LOG(INFO) << "sync() before umount took" << sync_timer;
}
UmountStat stat = TryUmountAndFsck(runFsck, shutdown_timeout - t.duration());
// Follow what linux shutdown is doing: one more sync with little bit delay
{
Timer sync_timer;
LOG(INFO) << "sync() after umount...";
sync();
LOG(INFO) << "sync() after umount took" << sync_timer;
}
static void KillAllProcesses() { android::base::WriteStringToFile("i", "/proc/sysrq-trigger"); }

/* Try umounting all emulated file systems R/W block device cfile systems.
* This will just try umount and give it up if it fails.
* For fs like ext4, this is ok as file system will be marked as unclean shutdown
* and necessary check can be done at the next reboot.
* For safer shutdown, caller needs to make sure that
* all processes / emulated partition for the target fs are all cleaned-up.
*
* return true when umount was successful. false when timed out.
*/
static UmountStat TryUmountAndFsck(bool runFsck, std::chrono::milliseconds timeout) {
Timer t;
std::vector<MountEntry> block_devices;
std::vector<MountEntry> emulated_devices;

if (runFsck && !FindPartitionsToUmount(&block_devices, &emulated_devices, false)) {
return UMOUNT_STAT_ERROR;
}

UmountStat stat = UmountPartitions(timeout - t.duration());
if (stat != UMOUNT_STAT_SUCCESS) {
LOG(INFO) << "umount timeout, last resort, kill all and try";
bool dumpUmountDebugInfo = property_get_bool("persist.sys.dumpUmountDebugInfo",false);
if (dumpUmountDebugInfo) {
if (DUMP_ON_UMOUNT_FAILURE) DumpUmountDebuggingInfo(true);
}
KillAllProcesses();
// even if it succeeds, still it is timeout and do not run fsck with all processes killed
UmountStat st = UmountPartitions(0ms);
if (dumpUmountDebugInfo) {
if ((st != UMOUNT_STAT_SUCCESS) && DUMP_ON_UMOUNT_FAILURE)
DumpUmountDebuggingInfo(false);
}
}

我们这里是reboot,不走fsck,sync() before umount这个log也是看不到的。很有可能就是卡在umount里了?

又复现一次:

[   92.971658] scm: secure world has been busy for 1 second!
[ 94.331640] scm_call failed: func id 0x7300fa01, ret: -12, syscall returns: 0x0, 0x0, 0x0
[ 97.454185] init: Terminating running services took 6023ms with remaining services:3
[ 97.454293] init: Sending signal 9 to service 'gx_fpd' (pid 4944) process group...
[ 97.680210] libprocessgroup: Failed to kill process cgroup uid 1000 pid 4944 in 225ms, 1 processes remain
[ 97.680392] init: Sending signal 9 to service 'trustonic-daemon' (pid 3315) process group...
[ 97.905801] libprocessgroup: Failed to kill process cgroup uid 1000 pid 3315 in 225ms, 1 processes remain
[ 97.905980] init: Sending signal 9 to service 'vendor.per_mgr' (pid 561) process group...
[ 97.914897] libprocessgroup: Successfully killed process cgroup uid 1000 pid 561 in 8ms
[ 97.915339] init: Service 'vendor.per_mgr' (pid 561) received signal 9
[ 97.951903] vdc: Waited 0ms for vold
[ 97.952039] binder: 483:487 transaction failed 29189/-22, size 88-0 line 3017
[ 101.601745] Trustonic TEE: wait_mcp_notification: No answer after 10s

这里我们加大过timeout,仍然不起作用,有个gx_fpd杀不掉?ps看下:

xxx:/ # ps -A | grep fpd
system 4944 1 19768 6844 __qseecom_process_incomplete_cmd 0 S gx_fpd
fpd
xxx:/ # ps -A | grep trustonic-daemon
xxx:/ # ps -A | grep 3315
system 3315 1 0 0 0 0 Z [mcDriverDaemon]
xxx:/ # ps -A | grep Z
USER PID PPID VSZ RSS WCHAN ADDR S NAME

system 3315 1 0 0 0 0 Z [mcDriverDaemon]

root 5094 1 0 0 0 0 Z [vdc]

Z? 都僵尸了?后来发现指纹没接,接上去重启压测上千次暂未复现。

还有复现的log,这个是用apk做的reboot,log都有限:

<14>[   86.307908] init: Received sys.powerctl='reboot,' from pid: 1368 (system_server)
<14>[ 86.308048] init: Clear action queue and start shutdown trigger
<14>[ 86.308278] init: processing action (shutdown_done) from (<Builtin Action>:0)
<14>[ 86.308300] init: Reboot start, reason: reboot,, rebootTarget:

对于无adb定屏的问题,发现google有一笔patch:

Reland "init: Reboot after timeout passes during reboot"

We're experiencing issues where unmount is forcing the device to be
stuck permanently, so it's better to have this shutdownt thread with a
long timeout (30s) and a potential kernel panic if it fails than a
stuck device.

这里a potential kernel panic指的啥?抛了个reboot线程来执行,如果reboot线程卡住超时,init就复位了。

这笔改动目前状态是合入冲突,我拿过来合入,暂未复现。

Merge Conflict 872492: Reland "init: Reboot after timeout passes during reboot"

Done.