最近遇到一个问题:重启压测十来天后出现大量手机不能启动, Android Go/MSM8909,有FWK同事发现空间满了,如下:

/dev/block/mmcblk0p39 5.2G  4.8G  340M  94% /data

原来是我们自定义的logd一直写满了data分区,删除后立即启动正常,为什么data满了会导致起不来,看了下Z的手机,我折腾了下1h都没起来:

msm8909go:/ $ df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 430M 1.6M 428M 1% /
tmpfs 442M 0 442M 0% /mnt
/dev/block/mmcblk0p34 5.5G 5.2G 0 100% /data // -> data分区填满
/dev/block/mmcblk0p31 104M 1.0M 100M 2% /cache
msm8909go:/ $ ps -A | grep system_s
system 1709 1517 1018068 46772 0 R system_server
msm8909go:/ $ uptime
13:22:16 up 1:06, 0 users, load average: 4.61, 4.63, 3.79
msm8909go:/ $ dmesg
dmesg: klogctl: Permission denied
msm8909go:/ $ ps -A | grep zygote
root 1882 1 967528 56388 poll_schedule_timeout 0 S zygote
msm8909go:/ $
msm8909go:/ $ ps -A | grep system_s
system 2073 1882 1011152 38588 0 R system_server
msm8909go:/ $

system_server一直在重启。

再来看下logcat,发现如下zygote log:

01-06 03:25:49.287  1986  1986 E patchoat: Failed to create symlink at /data/dalvik-cache/arm/system@framework@boot.oat error(28): No space left on device
01-06 03:25:49.288 1986 1986 W patchoat: Current thread not detached in Runtime shutdown
01-06 03:25:48.959 1898 1898 W zygote : Pruning dalvik cache because of low-memory situation.
01-06 03:25:48.960 1898 1898 W zygote : Failed to create boot marker.: No such file or directory
01-06 03:25:48.960 1898 1898 W zygote : Low-memory situation: only 0.17 megabytes available, need at least 50. Preemptively pruning the dalvik cache.
01-06 03:25:48.961 1898 1898 I zygote : Pruning dalvik-cache since we are relocating an image and will need to recompile
01-06 03:25:48.962 1898 1898 I zygote : RelocateImage: /system/bin/patchoat --input-image-location=/system/framework/boot.art --output-image-file=/data/dalvik-cache/arm/system@framework@boot.art --instruction-set=arm --base-offset-delta=-6152192
01-06 03:25:49.303 1898 1898 E zygote : Could not create image space with image file '/system/framework/boot.art'. Attempting to fall back to imageless running. Error was: Cannot relocate image /system/framework/boot.art to /data/dalvik-cache/arm/system@framework@boot.art: Failed execv(/system/bin/patchoat --input-image-location=/system/framework/boot.art --output-image-file=/data/dalvik-cache/arm/system@framework@boot.art --instruction-set=arm --base-offset-delta=-6152192) because non-0 exit status
01-06 03:25:49.303 1898 1898 E zygote : Attempted image: /system/framework/boot.art

能看出来有No space left, Low-memory situation,这些都是ART虚拟机的东东,让我们先简单了解下ART相关概念和背景:

Google官方解释:

Android runtime (ART) is the managed runtime used by applications and some system services on Android. ART and its predecessor Dalvik were originally created specifically for the Android project. ART as the runtime executes the Dalvik Executable format and Dex bytecode specification.

ART and Dalvik are compatible runtimes running Dex bytecode, so apps developed for Dalvik should work when running with ART. However, some techniques that work on Dalvik do not work on ART.

ART introduces ahead-of-time (AOT) compilation, which can improve app performance. ART also has tighter install-time verification than Dalvik.

At install time, ART compiles apps using the on-device dex2oat tool. This utility accepts DEX files as input and generates a compiled app executable for the target device. The utility should be able to compile all valid DEX files without difficulty. However, some post-processing tools produce invalid files that may be tolerated by Dalvik but cannot be compiled by ART.

以前用Dalvik,现在使用ART提高性能,ART向后兼容Dalvik。ART引入了AOT编译技术,使用dex2oat这个工具把DEX文件作为输入,生成相应的可执行文件(ELF格式)。 dex就是Dalvik excutable file, odex就是optimized dex。

再看下wikipedia上对ART和Dalvik的框架对比图:

ART_view.png

ok,让我们看代码, path: art/runtime/gc/space/image_space.cc:

std::unique_ptr<ImageSpace> ImageSpace::CreateBootImage(const char* image_location,
const InstructionSet image_isa,
bool secondary_image,
std::string* error_msg) {
...
// Step 0.b: If we're the zygote, check for free space, and prune the cache preemptively,
// if necessary. While the runtime may be fine (it is pretty tolerant to
// out-of-disk-space situations), other parts of the platform are not.
//
// The advantage of doing this proactively is that the later steps are simplified,
// i.e., we do not need to code retries.
...
if (is_zygote && dalvik_cache_exists) {
DCHECK(!dalvik_cache.empty());
std::string local_error_msg;
if (!CheckSpace(dalvik_cache, &local_error_msg)) {
LOG(WARNING) << local_error_msg << " Preemptively pruning the dalvik cache.";
PruneDalvikCache(image_isa);

// Re-evaluate the image.
found_image = FindImageFilenameImpl(image_location,
image_isa,
&has_system,
&system_filename,
&dalvik_cache_exists,
&dalvik_cache,
&is_global_cache,
&has_cache,
&cache_filename);
}
}

这里boot image是啥意思,稍等下看。

zygote起来后会check space,如果空间过小,就调用PruneDalvikCache把dalvik cache干掉。

先看看check space:

static constexpr uint64_t kLowSpaceValue = 50 * MB;
static constexpr uint64_t kTmpFsSentinelValue = 384 * MB;

// Read the free space of the cache partition and make a decision whether to keep the generated
// image. This is to try to mitigate situations where the system might run out of space later.
static bool CheckSpace(const std::string& cache_filename, std::string* error_msg) {
// Using statvfs vs statvfs64 because of b/18207376, and it is enough for all practical purposes.
struct statvfs buf;

int res = TEMP_FAILURE_RETRY(statvfs(cache_filename.c_str(), &buf));
if (res != 0) {
// Could not stat. Conservatively tell the system to delete the image.
*error_msg = "Could not stat the filesystem, assuming low-memory situation.";
return false;
}

uint64_t fs_overall_size = buf.f_bsize * static_cast<uint64_t>(buf.f_blocks);
// Zygote is privileged, but other things are not. Use bavail.
uint64_t fs_free_size = buf.f_bsize * static_cast<uint64_t>(buf.f_bavail);

// Take the overall size as an indicator for a tmpfs, which is being used for the decryption
// environment. We do not want to fail quickening the boot image there, as it is beneficial
// for time-to-UI.
if (fs_overall_size > kTmpFsSentinelValue) {
if (fs_free_size < kLowSpaceValue) {
*error_msg = StringPrintf("Low-memory situation: only %4.2f megabytes available, need at "
"least %" PRIu64 ".",
static_cast<double>(fs_free_size) / MB,
kLowSpaceValue / MB);
return false;
}
}
return true;
}

先看注释:

// Read the free space of the cache partition and make a decision whether to keep the generated
// image. This is to try to mitigate situations where the system might run out of space later.

这里cache partition不是说的/cache分区,而是说的dalvik cache目录,具体是data分区的/data/dalvik-cache目录。

剩余空间用的fs_free_size,ART定的最小门限是50MB(kLowSpaceValue),应该就是data分区的<50MB就认为Low-memory了。

再看了prune dalvik cache做了什么:

// We are relocating or generating the core image. We should get rid of everything. It is all
// out-of-date. We also don't really care if this fails since it is just a convenience.
// Adapted from prune_dex_cache(const char* subdir) in frameworks/native/cmds/installd/commands.c
// Note this should only be used during first boot.
static void PruneDalvikCache(InstructionSet isa) {
CHECK_NE(isa, kNone);
// Prune the base /data/dalvik-cache.
// Note: GetDalvikCache may return the empty string if the directory doesn't
// exist. It is safe to pass "" to DeleteDirectoryContents, so this is okay.
impl::DeleteDirectoryContents(GetDalvikCache("."), false);
// Prune /data/dalvik-cache/<isa>.
impl::DeleteDirectoryContents(GetDalvikCache(GetInstructionSetString(isa)), false);

// Be defensive. There should be a runtime created here, but this may be called in a test.
if (Runtime::Current() != nullptr) {
Runtime::Current()->SetPrunedDalvikCache(true);
}
}

就是/data/dalvik-cache/这个目录删了,正常启动的手机看下:

8909go:/data/dalvik-cache # du -h
269M ./arm
269M .

占用了不少啊。

我们再继续看create boot image:

// Step 1: Check if we have an existing and relocated image.

// Step 1.a: Have files in system and cache. Then they need to match.
if (found_image && has_system && has_cache) {
std::string local_error_msg;
// Check that the files are matching.
if (ChecksumsMatch(system_filename.c_str(), cache_filename.c_str(), &local_error_msg)) {
std::unique_ptr<ImageSpace> relocated_space =
ImageSpaceLoader::Load(image_location,
cache_filename,
is_zygote,
is_global_cache,
/* validate_oat_file */ false,
&local_error_msg);
if (relocated_space != nullptr) {
return relocated_space;
}
}
error_msgs.push_back(local_error_msg);
}

// Step 1.b: Only have a cache file.
if (found_image && !has_system && has_cache) {
std::string local_error_msg;
std::unique_ptr<ImageSpace> cache_space =
ImageSpaceLoader::Load(image_location,
cache_filename,
is_zygote,
is_global_cache,
/* validate_oat_file */ true,
&local_error_msg);
if (cache_space != nullptr) {
return cache_space;
}
error_msgs.push_back(local_error_msg);
}

有没有找到image在step 1 FindImageFilenameImpl里:

static bool FindImageFilenameImpl(const char* image_location,
const InstructionSet image_isa,
bool* has_system,
std::string* system_filename,
bool* dalvik_cache_exists,
std::string* dalvik_cache,
bool* is_global_cache,
bool* has_cache,
std::string* cache_filename) {
DCHECK(dalvik_cache != nullptr);

*has_system = false;
*has_cache = false;
// image_location = /system/framework/boot.art
// system_image_location = /system/framework/<image_isa>/boot.art
std::string system_image_filename(GetSystemImageFilename(image_location, image_isa));
if (OS::FileExists(system_image_filename.c_str())) {
*system_filename = system_image_filename;
*has_system = true;
}

bool have_android_data = false;
*dalvik_cache_exists = false;
GetDalvikCache(GetInstructionSetString(image_isa),
true,
dalvik_cache,
&have_android_data,
dalvik_cache_exists,
is_global_cache);

if (have_android_data && *dalvik_cache_exists) {
// Always set output location even if it does not exist,
// so that the caller knows where to create the image.
//
// image_location = /system/framework/boot.art
// *image_filename = /data/dalvik-cache/<image_isa>/boot.art
std::string error_msg;
if (!GetDalvikCacheFilename(image_location,
dalvik_cache->c_str(),
cache_filename,
&error_msg)) {
LOG(WARNING) << error_msg;
return *has_system;
}
*has_cache = OS::FileExists(cache_filename->c_str());
}
return *has_system || *has_cache;

ok, file in system指的是/system/framework/boot.art,file in cache指的是/data/dalvik-cache//boot.art。

如果都存在且匹配就返回ok了,如果没有匹配,继续step 1.b:只有cache file,load ok也返回ok了,如果只有system有,来看step 2:

// Step 2: We have an existing image in /system.

// Step 2.a: We are not required to relocate it. Then we can use it directly.
bool relocate = Runtime::Current()->ShouldRelocate();

if (found_image && has_system && !relocate) {
std::string local_error_msg;
std::unique_ptr<ImageSpace> system_space =
ImageSpaceLoader::Load(image_location,
system_filename,
is_zygote,
is_global_cache,
/* validate_oat_file */ false,
&local_error_msg);
if (system_space != nullptr) {
return system_space;
}
error_msgs.push_back(local_error_msg);
}

不需要relocate,直接load ok就返回ok了,那如果需要呢:

// Step 2.b: We require a relocated image. Then we must patch it. This step fails if this is a
// secondary image.
if (found_image && has_system && relocate) {
std::string local_error_msg;
if (!Runtime::Current()->IsImageDex2OatEnabled()) {
local_error_msg = "Patching disabled.";
} else if (secondary_image) {
// We really want a working image. Prune and restart.
PruneDalvikCache(image_isa);
_exit(1);
} else if (ImageCreationAllowed(is_global_cache, image_isa, &local_error_msg)) {
bool patch_success =
RelocateImage(image_location, cache_filename.c_str(), image_isa, &local_error_msg);
if (patch_success) {
std::unique_ptr<ImageSpace> patched_space =
ImageSpaceLoader::Load(image_location,
cache_filename,
is_zygote,
is_global_cache,
/* validate_oat_file */ false,
&local_error_msg);
if (patched_space != nullptr) {
return patched_space;
}
}
}
error_msgs.push_back(StringPrintf("Cannot relocate image %s to %s: %s",
image_location,
cache_filename.c_str(),
local_error_msg.c_str()));
}

首先看patch是否disable了,如果patch enable了,看看是不是secondary_image,如果是那么必须删掉dalvik-cache目录重启该进程, 如果是first image那么直接调用RelocateImage进行relocate。

继续看step 3:

// Step 3: We do not have an existing image in /system, so generate an image into the dalvik
// cache. This step fails if this is a secondary image.
if (!has_system) {
std::string local_error_msg;
if (!Runtime::Current()->IsImageDex2OatEnabled()) {
local_error_msg = "Image compilation disabled.";
} else if (secondary_image) {
local_error_msg = "Cannot compile a secondary image.";
} else if (ImageCreationAllowed(is_global_cache, image_isa, &local_error_msg)) {
bool compilation_success = GenerateImage(cache_filename, image_isa, &local_error_msg);
if (compilation_success) {
std::unique_ptr<ImageSpace> compiled_space =
ImageSpaceLoader::Load(image_location,
cache_filename,
is_zygote,
is_global_cache,
/* validate_oat_file */ false,
&local_error_msg);
if (compiled_space != nullptr) {
return compiled_space;
}
}
}
error_msgs.push_back(StringPrintf("Cannot compile image to %s: %s",
cache_filename.c_str(),
local_error_msg.c_str()));
}

system下没有boot image就要创建一个到/data/dalvik-cache下,流程同relocate,同样的secondary image会失败。

ok, create boot image走完。有没有发现secondary image的_exit的流程,会有这种可能:

如果data空间不足50MB,而且在创建boot image的index是!first image,那么会一直死循环,right?

再看下9.0 ART应该是修复了。

还有几个疑问:啥是relocate? 就是为了安全考虑,patchoat就是用来relocate的工具, 具体参考文档里详解。那boot image是什么? 看下stackflow的回答:

Pre-ART, Android used the Zygote to fork each app process and preload and preinitialize some classes for optimization purposes. On ART, the set of jar libraries that should be preloaded into each app process is compiled once into the so called boot image. It consists of two files, boot.oat and boot.art. Boot.oat contains the compiled code while boot.art contains a preinitialized heap etc. Both are also generated by dex2oat. This boot image is loaded into each app’s process as an optimization.

至于什么是secondary image,暂时不明白,由gc/heap而来,以后再看。

参考文档