Android P刷机后第一次启动停在了加密界面,提示如下:

Wait while your phone is being encrypted.

这里是data分区在加密,system/vold/cryptsfs.cpp负责,先看看logcat过滤Cryptfs:

01-28 04:50:11.003 D/Cryptfs (  497): Just asked init to shut down class main
01-28 04:50:14.245 I/Cryptfs ( 497): Using scrypt with keymaster for cryptfs KDF
01-28 04:50:14.698 I/Cryptfs ( 497): Signing safely-padded object
01-28 04:50:15.660 I/Cryptfs ( 497): Using scrypt with keymaster for cryptfs KDF
01-28 04:50:16.113 I/Cryptfs ( 497): Signing safely-padded object
01-28 04:50:16.941 D/Cryptfs ( 497): Just triggered post_fs_data
01-28 04:50:17.085 D/Cryptfs ( 497): post_fs_data done
01-28 04:50:18.440 I/Cryptfs ( 497): Signing safely-padded object
01-28 04:50:18.997 I/Cryptfs ( 497): Extra parameters for dm_crypt: fde_enabled ice
01-28 04:50:18.997 I/Cryptfs ( 497): target_type = req-crypt
01-28 04:50:18.997 I/Cryptfs ( 497): real_blk_name = /dev/block/bootdevice/by-name/userdata, extra_params = fde_enabled ice
01-28 04:50:19.085 D/Cryptfs ( 497): Just triggered restart_min_framework
01-28 04:50:19.085 I/Cryptfs ( 497): Using scrypt with keymaster for cryptfs KDF
12-31 16:00:00.870 E/Cryptfs ( 497): Cannot remove dm-crypt device
12-31 16:00:00.875 I/Cryptfs ( 497): cryptfs_check_passwd
12-31 16:00:00.876 D/Cryptfs ( 497): crypt_ftr->fs_size = 43900863
12-31 16:00:00.876 I/Cryptfs ( 497): Using scrypt with keymaster for cryptfs KDF
12-31 16:00:01.336 I/Cryptfs ( 497): Signing safely-padded object
12-31 16:00:02.237 E/Cryptfs ( 497): Cannot create dm-crypt device userdata: Device or resource busy
12-31 16:00:02.237 E/Cryptfs ( 497): Error creating decrypted block device
12-31 16:00:02.237 E/Cryptfs ( 497): Encrypted filesystem not validated, abortin

有几处错误:

12-31 16:00:00.870 E/Cryptfs (  497): Cannot remove dm-crypt device
12-31 16:00:02.237 E/Cryptfs ( 497): Cannot create dm-crypt device userdata: Device or resource busy
12-31 16:00:02.237 E/Cryptfs ( 497): Error creating decrypted block device
12-31 16:00:02.237 E/Cryptfs ( 497): Encrypted filesystem not validated, abortin

找到相关代码:

static int delete_crypto_blk_dev(const char *name)
{
int fd;
char buffer[DM_CRYPT_BUF_SIZE];
struct dm_ioctl *io;
int retval = -1;
int err;
int retries = 10;

if ((fd = open("/dev/device-mapper", O_RDWR|O_CLOEXEC)) < 0 ) {
SLOGE("Cannot open device-mapper\n");
goto errout;
}

io = (struct dm_ioctl *) buffer;

ioctl_init(io, DM_CRYPT_BUF_SIZE, name, 0);
if (ioctl(fd, DM_DEV_REMOVE, io)) {
SLOGE("Cannot remove dm-crypt device\n");
goto errout;
}

/* We made it here with no errors. Woot! */
retval = 0;

errout:
close(fd); /* If fd is <0 from a failed open call, it's safe to just ignore the close error */

return retval;

}

DM_DEV_REMOVE执行错误,对应kernel是在drivers/md/dm-ioctl.c的dev_remove:

static int dev_remove(struct dm_ioctl *param, size_t param_size)
{
struct hash_cell *hc;
struct mapped_device *md;
int r;
struct dm_table *t;

down_write(&_hash_lock);
hc = __find_device_hash_cell(param);

if (!hc) {
DMDEBUG_LIMIT("device doesn't appear to be in the dev hash table.");
up_write(&_hash_lock);
return -ENXIO;
}

md = hc->md;

/*
* Ensure the device is not open and nothing further can open it.
*/
r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
if (r) {
if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
up_write(&_hash_lock);
dm_put(md);
return 0;
}
DMDEBUG_LIMIT("unable to remove open device %s", hc->name);
up_write(&_hash_lock);
dm_put(md);
return r; //tj: fail
}

t = __hash_remove(hc);
up_write(&_hash_lock);

if (t) {
dm_sync_table(md);
dm_table_destroy(t);
}

param->flags &= ~DM_DEFERRED_REMOVE;

if (!dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr))
param->flags |= DM_UEVENT_GENERATED_FLAG;

dm_put(md);
dm_destroy(md);
return 0;
}

从代码看fail只有我标注fail的那里,进dm_lock_for_deletion里看下:

/*
* Guarantees nothing is using the device before it's deleted.
*/
int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only_deferred)
{
int r = 0;

spin_lock(&_minor_lock);

if (dm_open_count(md)) {
r = -EBUSY;
if (mark_deferred)
set_bit(DMF_DEFERRED_REMOVE, &md->flags);
} else if (only_deferred && !test_bit(DMF_DEFERRED_REMOVE, &md->flags))
r = -EEXIST;
else
set_bit(DMF_DELETING, &md->flags);

spin_unlock(&_minor_lock);

return r;
}

只有一种可能就是返回-EBUSY,rt? 谁打开userdata分区?没改啥流程啊,提问QCOM让在QCOM参考机上测试,如果复现再提,还指出了如下错误:

01-06 20:36:44.037   490   493 D vold    : Disk at 253:2 changed
01-06 20:36:44.037 490 670 D vold : cryptfs_enable_inplace(/dev/block/dm-2, /dev/block/bootdevice/by-name/userdata, 43900863, 0x7017efe900, 43900863, 0, 1)
01-06 20:36:44.037 490 670 D vold : Opening/dev/block/bootdevice/by-name/userdata
01-06 20:36:44.037 490 670 D vold : Opening/dev/block/dm-2
01-06 20:36:44.037 490 670 E vold : Error opening crypto_blkdev /dev/block/dm-2 for ext4 inplace encrypt, retrying: No such file or directory

回头来看,他应该不熟悉加密流程,这是vold正常打印,我们是ext4,cryptfs_enable_inplace会call cryptfs_enable_inplace_ext4

static int cryptfs_enable_inplace_ext4(char* crypto_blkdev, char* real_blkdev, off64_t size,
off64_t* size_already_done, off64_t tot_size,
off64_t previously_encrypted_upto,
bool set_progress_properties) {
...
LOG(DEBUG) << "Opening" << crypto_blkdev;
#if defined(CONFIG_HW_DISK_ENCRYPTION) && defined(CONFIG_HW_DISK_ENCRYPT_PERF)
...
#else
// Wait until the block device appears. Re-use the mount retry values since it is reasonable.
while ((data.cryptofd = open(crypto_blkdev, O_WRONLY|O_CLOEXEC)) < 0) {
if (--retries) {
PLOG(ERROR) << "Error opening crypto_blkdev " << crypto_blkdev //tj: retry here
<< " for ext4 inplace encrypt, retrying";
sleep(RETRY_MOUNT_DELAY_SECONDS);
} else {
PLOG(ERROR) << "Error opening crypto_blkdev " << crypto_blkdev
<< " for ext4 inplace encrypt";
rc = ENABLE_INPLACE_ERR_DEV;
goto errout;
}
}
#endif

可以看到出错是有可能的,要”Wait util the block device appears”。

我们再回来看下出问题的logcat上下文:

12-31 16:00:00.863 D/vold    (  497): Found matching PID 968
12-31 16:00:00.863 W/vold ( 497): Skipping due to root namespace
12-31 16:00:00.863 D/vold ( 497): Found matching PID 970
12-31 16:00:00.863 W/vold ( 497): Skipping due to root namespace
12-31 16:00:00.863 D/vold ( 497): Found matching PID 1078
12-31 16:00:00.863 W/vold ( 497): Skipping due to root namespace
12-31 16:00:00.863 D/vold ( 497): Found matching PID 1288
12-31 16:00:00.864 I/vold ( 497): Encrypting group 167
12-31 16:00:00.869 I/vold ( 497): Encrypted to sector 2686390272
12-31 16:00:00.870 D/vold ( 497): cryptfs_enable_inplace_ext4 success
12-31 16:00:00.870 E/Cryptfs ( 497): Cannot remove dm-crypt device
12-31 16:00:00.875 I/Cryptfs ( 497): cryptfs_check_passwd

再看下正常的logcat上下文:

12-31 16:00:00.881   496   496 D vold    : Remounting 10001 as mode none
...
12-31 16:00:01.941 496 664 D vold : cryptfs_enable_inplace_ext4 success
...
12-31 16:00:02.107 496 664 I Cryptfs : cryptfs_check_passwd

没有看出什么明显异常,”元芳 你怎么看?”

其实已经怀疑一个地方了后面再说,除了这个就是流程是不是有问题?说不定我司改了啥我不知道?其实Google/vold在这里就没有处理remove失败的情况,这点不够健壮,rt?

我们来看加密是如何来的:

VoldNativeService.cpp:fdeEnable->fdeEnableInternal ->
cryptfs.cpp:cryptfs_enable -> cryptfs_enable_internal -> cryptfs_enable_all_volumes ->
EncryptInplace.cpp:cryptfs_enable_inplace -> cryptfs_enable_inplace_ext4

有个可执行文件vdc会call vold的fdeEnable

} else if (args[0] == "cryptfs" && args[1] == "enablecrypto") {
int passwordType = android::os::IVold::PASSWORD_TYPE_DEFAULT;
int encryptionFlags = android::os::IVold::ENCRYPTION_FLAG_NO_UI;
checkStatus(vold->fdeEnable(passwordType, "", encryptionFlags));

参数enablecrypto传入是通过vdc.rc:

# One shot invocation to encrypt unencrypted volumes
on encrypt
start surfaceflinger
exec - root -- /system/bin/vdc --wait cryptfs enablecrypto //here
# vold will set vold.decrypt to trigger_restart_framework (default
# encryption)

那这个脚本是怎么trigger的?以前我们分析过其他问题是也提到过是init通过queue来的,具体是在./system/core/init/builtins.cpp:

/* Queue event based on fs_mgr return code.
*
* code: return code of fs_mgr_mount_all
*
* This function might request a reboot, in which case it will
* not return.
*
* return code is processed based on input code
*/
static Result<Success> queue_fs_event(int code) {
std::string bootmode = GetProperty("ro.bootmode", "");
if (strncmp(bootmode.c_str(), "ffbm", strlen("ffbm")) == 0) {
LOG(ERROR) << "ffbm mode, not start class main\n";
return Success();
}
if (code == FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION) {
ActionManager::GetInstance().QueueEventTrigger("encrypt"); //here
return Success();
} else if (code == FS_MGR_MNTALL_DEV_MIGHT_BE_ENCRYPTED) {

看注释应该是fs_mgr_mount_all要返回FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION就会触发加密。

意思就是需要加密你就加密吧,往前追就是do_mount_all了:

/* mount_all <fstab> [ <path> ]* [--<options>]*
*
* This function might request a reboot, in which case it will
* not return.
*/
static Result<Success> do_mount_all(const BuiltinArguments& args) {
...
std::string prop_name = "ro.boottime.init.mount_all."s + prop_post_fix;
android::base::Timer t;
auto mount_fstab_return_code = mount_fstab(fstabfile, mount_mode);
if (!mount_fstab_return_code) {
return Error() << "mount_fstab() failed " << mount_fstab_return_code.error();
}
property_set(prop_name, std::to_string(t.duration().count()));

if (import_rc) {
/* Paths of .rc files are specified at the 2nd argument and beyond */
import_late(args.args, 2, path_arg_end);
}

if (queue_event) {
/* queue_fs_event will queue event based on mount_fstab return code
* and return processed return code*/
auto queue_fs_result = queue_fs_event(*mount_fstab_return_code);
if (!queue_fs_result) {
return Error() << "queue_fs_event() failed: " << queue_fs_result.error();
}
}

return Success();
}

先走mount_fstab,再按照结果处理, mount_fstab会call fs_mgr_mount_all:

/* When multiple fstab records share the same mount_point, it will
* try to mount each one in turn, and ignore any duplicates after a
* first successful mount.
* Returns -1 on error, and FS_MGR_MNTALL_* otherwise.
*/
int fs_mgr_mount_all(struct fstab *fstab, int mount_mode)
{

可见,应该就是按fstab里的条目来依次mount,继续看:

mret = mount_with_alternatives(fstab, i, &last_idx_inspected, &attempted_idx);
i = last_idx_inspected;
mount_errno = errno;

/* Deal with encryptability. */
if (!mret) {
int status = handle_encryptable(&fstab->recs[attempted_idx]);

mount_with_alternatives是正常的mount,如果返回0没有错误,会check 是否具备加密能力:

// Check to see if a mountable volume has encryption requirements
static int handle_encryptable(const struct fstab_rec* rec)
{
/* If this is block encryptable, need to trigger encryption */
if (needs_block_encryption(rec)) {
if (umount(rec->mount_point) == 0) {
return FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION;
} else {
PWARNING << "Could not umount " << rec->mount_point
<< " - allow continue unencrypted";
return FS_MGR_MNTALL_DEV_NOT_ENCRYPTED;
}
}

check needs_block_encryption:

static bool needs_block_encryption(const struct fstab_rec* rec)
{
if (android::base::GetBoolProperty("ro.vold.forceencryption", false) &&
fs_mgr_is_encryptable(rec))
return true;
if (rec->fs_mgr_flags & MF_FORCECRYPT) return true;
static struct flag_list fs_mgr_flags[] = {
...
{"forceencrypt=", MF_FORCECRYPT},

fstab里会加入forceencrypt:

#<src>                                      <mnt_point>  <type>  <mnt_flags and options>                     <fs_mgr_flags>
/dev/block/bootdevice/by-name/system / ext4 ro,barrier=1,discard wait,slotselect,avb
/dev/block/bootdevice/by-name/userdata /data ext4 noatime,nosuid,nodev,barrier=1,noauto_da_alloc,discard wait,forceencrypt=footer,quota,reservedsize=128M

clear, rt?

ok, 为了帮助确认分析问题,我放开了kernel dm debug,加入了ioctl cmd log,另外vold remove那里加入了delay和retry,竟然ok了。

[    0.000000] device-mapper: init: will configure 1 devices
[ 1.263421] device-mapper: uevent: version 1.0.3
[ 1.263580] device-mapper: ioctl: 4.35.0-ioctl (2016-06-23) initialised: dm-devel@redhat.com
[ 1.263787] device-mapper: req-crypt: dm-req-crypt successfully initalized.\x0a
[ 3.607087] device-mapper: init: attempting early device configuration.
[ 3.607101] device-mapper: init DEBUG: parsed 1 targets
[ 3.607768] device-mapper: init DEBUG: created device '253:0'
[ 3.607786] device-mapper: init: adding target '0 5643752 verity 1 PART...
[ 3.608872] device-mapper: init: dm-0 is ready
[ 3.789844] device-mapper: ioctl: [tj] cmd=3
[ 3.790567] device-mapper: ioctl: [tj] cmd=7
[ 3.790852] device-mapper: ioctl: [tj] cmd=9
[ 3.791799] device-mapper: ioctl: [tj] cmd=6
[ 10.834008] device-mapper: ioctl: [tj] cmd=2
[ 22.435064] device-mapper: ioctl: [tj] cmd=3
[ 22.436824] device-mapper: ioctl: [tj] cmd=7
[ 22.437049] device-mapper: ioctl: [tj] cmd=9
[ 22.437073] device-mapper: req-crypt DEBUG: dm-req-crypt Constructor.\x0a
[ 22.437189] device-mapper: req-crypt: req_crypt_ctr: Mapping block_device /dev/block/bootdevice/by-name/userdata to dm-req-crypt ok!\x0a
[ 22.437775] device-mapper: ioctl: [tj] cmd=6
[ 30.346430] device-mapper: ioctl: [tj] cmd=4
[ 30.346443] device-mapper: ioctl: unable to remove open device userdata
[ 31.346680] device-mapper: ioctl: [tj] cmd=4
[ 31.347242] device-mapper: req-crypt DEBUG: dm-req-crypt Destructor.\x0a
[ 32.843243] device-mapper: ioctl: [tj] cmd=3
[ 32.845567] device-mapper: ioctl: [tj] cmd=7
[ 32.845850] device-mapper: ioctl: [tj] cmd=9
[ 32.845881] device-mapper: req-crypt DEBUG: dm-req-crypt Constructor.\x0a
[ 32.846010] device-mapper: req-crypt: req_crypt_ctr: Mapping block_device /dev/block/bootdevice/by-name/userdata to dm-req-crypt ok!\x0a
[ 32.846667] device-mapper: ioctl: [tj] cmd=6

其中cmd=4是remove, 可见第一次失败,第二次ok了!。cmd 3,7,9,6是一个create_crypto_blk_dev里,最早那一组3 7 9 6应该是cmdline传上来的avb相关,也就是’253:0’。

xxx:/dev/block # ls -l dm*
brw------- 1 root root 253, 0 1970-01-06 21:29 dm-0 //system, should be dm-verity
brw------- 1 root root 253, 1 1970-01-06 21:29 dm-1 //vendor, should be dm-verity
brw------- 1 root root 253, 2 2008-12-31 16:00 dm-2 //userdata, dm-crypt

所以Google原生这里异常处理我觉得应该是有问题的,具体patch如下: