Android P刷机后第一次启动停在了加密界面,提示如下:

Wait while your phone is being encrypted.

这里是data分区在加密,system/vold/cryptsfs.cpp负责,先看看logcat过滤Cryptfs:

01-28 04:50:11.003 D/Cryptfs (  497): Just asked init to shut down class main
01-28 04:50:14.245 I/Cryptfs (  497): Using scrypt with keymaster for cryptfs KDF
01-28 04:50:14.698 I/Cryptfs (  497): Signing safely-padded object
01-28 04:50:15.660 I/Cryptfs (  497): Using scrypt with keymaster for cryptfs KDF
01-28 04:50:16.113 I/Cryptfs (  497): Signing safely-padded object
01-28 04:50:16.941 D/Cryptfs (  497): Just triggered post_fs_data
01-28 04:50:17.085 D/Cryptfs (  497): post_fs_data done
01-28 04:50:18.440 I/Cryptfs (  497): Signing safely-padded object
01-28 04:50:18.997 I/Cryptfs (  497): Extra parameters for dm_crypt: fde_enabled ice
01-28 04:50:18.997 I/Cryptfs (  497): target_type = req-crypt
01-28 04:50:18.997 I/Cryptfs (  497): real_blk_name = /dev/block/bootdevice/by-name/userdata, extra_params = fde_enabled ice
01-28 04:50:19.085 D/Cryptfs (  497): Just triggered restart_min_framework
01-28 04:50:19.085 I/Cryptfs (  497): Using scrypt with keymaster for cryptfs KDF
12-31 16:00:00.870 E/Cryptfs (  497): Cannot remove dm-crypt device
12-31 16:00:00.875 I/Cryptfs (  497): cryptfs_check_passwd
12-31 16:00:00.876 D/Cryptfs (  497): crypt_ftr->fs_size = 43900863
12-31 16:00:00.876 I/Cryptfs (  497): Using scrypt with keymaster for cryptfs KDF
12-31 16:00:01.336 I/Cryptfs (  497): Signing safely-padded object
12-31 16:00:02.237 E/Cryptfs (  497): Cannot create dm-crypt device userdata: Device or resource busy
12-31 16:00:02.237 E/Cryptfs (  497): Error creating decrypted block device
12-31 16:00:02.237 E/Cryptfs (  497): Encrypted filesystem not validated, abortin

有几处错误:

12-31 16:00:00.870 E/Cryptfs (  497): Cannot remove dm-crypt device
12-31 16:00:02.237 E/Cryptfs (  497): Cannot create dm-crypt device userdata: Device or resource busy
12-31 16:00:02.237 E/Cryptfs (  497): Error creating decrypted block device
12-31 16:00:02.237 E/Cryptfs (  497): Encrypted filesystem not validated, abortin

找到相关代码:

static int delete_crypto_blk_dev(const char *name)
{
  int fd;
  char buffer[DM_CRYPT_BUF_SIZE];
  struct dm_ioctl *io;
  int retval = -1;
  int err;
  int retries = 10;

  if ((fd = open("/dev/device-mapper", O_RDWR|O_CLOEXEC)) < 0 ) {
    SLOGE("Cannot open device-mapper\n");
    goto errout;
  }

  io = (struct dm_ioctl *) buffer;

  ioctl_init(io, DM_CRYPT_BUF_SIZE, name, 0);
  if (ioctl(fd, DM_DEV_REMOVE, io)) {
    SLOGE("Cannot remove dm-crypt device\n");
    goto errout;
  }

  /* We made it here with no errors.  Woot! */
  retval = 0;

errout:
  close(fd);    /* If fd is <0 from a failed open call, it's safe to just ignore the close error */

  return retval;

}

DM_DEV_REMOVE执行错误,对应kernel是在drivers/md/dm-ioctl.c的dev_remove:

static int dev_remove(struct dm_ioctl *param, size_t param_size)
{
        struct hash_cell *hc; 
        struct mapped_device *md; 
        int r;
        struct dm_table *t;

        down_write(&_hash_lock);
        hc = __find_device_hash_cell(param);

        if (!hc) {
                DMDEBUG_LIMIT("device doesn't appear to be in the dev hash table.");
                up_write(&_hash_lock);
                return -ENXIO;
        }    

        md = hc->md;

        /*   
         * Ensure the device is not open and nothing further can open it.
         */
        r = dm_lock_for_deletion(md, !!(param->flags & DM_DEFERRED_REMOVE), false);
        if (r) {
                if (r == -EBUSY && param->flags & DM_DEFERRED_REMOVE) {
                        up_write(&_hash_lock);
                        dm_put(md);
                        return 0;
                }    
                DMDEBUG_LIMIT("unable to remove open device %s", hc->name);
                up_write(&_hash_lock);
                dm_put(md);
                return r; //tj: fail
        }    

        t = __hash_remove(hc);
        up_write(&_hash_lock);

        if (t) {
                dm_sync_table(md);
                dm_table_destroy(t);
        }    

        param->flags &= ~DM_DEFERRED_REMOVE;

        if (!dm_kobject_uevent(md, KOBJ_REMOVE, param->event_nr))
                param->flags |= DM_UEVENT_GENERATED_FLAG;

        dm_put(md);
        dm_destroy(md);
        return 0;
}

从代码看fail只有我标注fail的那里,进dm_lock_for_deletion里看下:

/*
 * Guarantees nothing is using the device before it's deleted.
 */
int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only_deferred)
{
        int r = 0; 

        spin_lock(&_minor_lock);

        if (dm_open_count(md)) {
                r = -EBUSY;
                if (mark_deferred)
                        set_bit(DMF_DEFERRED_REMOVE, &md->flags);
        } else if (only_deferred && !test_bit(DMF_DEFERRED_REMOVE, &md->flags))
                r = -EEXIST;
        else 
                set_bit(DMF_DELETING, &md->flags);

        spin_unlock(&_minor_lock);

        return r;
}

只有一种可能就是返回-EBUSY,rt? 谁打开userdata分区?没改啥流程啊,提问QCOM让在QCOM参考机上测试,如果复现再提,还指出了如下错误:

01-06 20:36:44.037   490   493 D vold    : Disk at 253:2 changed
01-06 20:36:44.037   490   670 D vold    : cryptfs_enable_inplace(/dev/block/dm-2, /dev/block/bootdevice/by-name/userdata, 43900863, 0x7017efe900, 43900863, 0, 1)
01-06 20:36:44.037   490   670 D vold    : Opening/dev/block/bootdevice/by-name/userdata
01-06 20:36:44.037   490   670 D vold    : Opening/dev/block/dm-2
01-06 20:36:44.037   490   670 E vold    : Error opening crypto_blkdev /dev/block/dm-2 for ext4 inplace encrypt, retrying: No such file or directory

回头来看,他应该不熟悉加密流程,这是vold正常打印,我们是ext4,cryptfs_enable_inplace会call cryptfs_enable_inplace_ext4

static int cryptfs_enable_inplace_ext4(char* crypto_blkdev, char* real_blkdev, off64_t size,
                                       off64_t* size_already_done, off64_t tot_size,
                                       off64_t previously_encrypted_upto,
                                       bool set_progress_properties) {
...
    LOG(DEBUG) << "Opening" << crypto_blkdev;
#if defined(CONFIG_HW_DISK_ENCRYPTION) && defined(CONFIG_HW_DISK_ENCRYPT_PERF)
...
#else
    // Wait until the block device appears.  Re-use the mount retry values since it is reasonable.
    while ((data.cryptofd = open(crypto_blkdev, O_WRONLY|O_CLOEXEC)) < 0) {
        if (--retries) {
            PLOG(ERROR) << "Error opening crypto_blkdev " << crypto_blkdev   //tj: retry here
                        << " for ext4 inplace encrypt, retrying";
            sleep(RETRY_MOUNT_DELAY_SECONDS);
        } else {
            PLOG(ERROR) << "Error opening crypto_blkdev " << crypto_blkdev
                        << " for ext4 inplace encrypt";
            rc = ENABLE_INPLACE_ERR_DEV;
            goto errout;
        }
    }
#endif

可以看到出错是有可能的,要"Wait util the block device appears"。

我们再回来看下出问题的logcat上下文:

12-31 16:00:00.863 D/vold    (  497): Found matching PID 968
12-31 16:00:00.863 W/vold    (  497): Skipping due to root namespace
12-31 16:00:00.863 D/vold    (  497): Found matching PID 970
12-31 16:00:00.863 W/vold    (  497): Skipping due to root namespace
12-31 16:00:00.863 D/vold    (  497): Found matching PID 1078
12-31 16:00:00.863 W/vold    (  497): Skipping due to root namespace
12-31 16:00:00.863 D/vold    (  497): Found matching PID 1288
12-31 16:00:00.864 I/vold    (  497): Encrypting group 167
12-31 16:00:00.869 I/vold    (  497): Encrypted to sector 2686390272
12-31 16:00:00.870 D/vold    (  497): cryptfs_enable_inplace_ext4 success
12-31 16:00:00.870 E/Cryptfs (  497): Cannot remove dm-crypt device
12-31 16:00:00.875 I/Cryptfs (  497): cryptfs_check_passwd

再看下正常的logcat上下文:

12-31 16:00:00.881   496   496 D vold    : Remounting 10001 as mode none
...
12-31 16:00:01.941   496   664 D vold    : cryptfs_enable_inplace_ext4 success
...
12-31 16:00:02.107   496   664 I Cryptfs : cryptfs_check_passwd

没有看出什么明显异常,"元芳 你怎么看?"

其实已经怀疑一个地方了后面再说,除了这个就是流程是不是有问题?说不定我司改了啥我不知道?其实Google/vold在这里就没有处理remove失败的情况,这点不够健壮,rt?

我们来看加密是如何来的:

VoldNativeService.cpp:fdeEnable->fdeEnableInternal ->
cryptfs.cpp:cryptfs_enable -> cryptfs_enable_internal -> cryptfs_enable_all_volumes -> 
EncryptInplace.cpp:cryptfs_enable_inplace -> cryptfs_enable_inplace_ext4

有个可执行文件vdc会call vold的fdeEnable

    } else if (args[0] == "cryptfs" && args[1] == "enablecrypto") {
        int passwordType = android::os::IVold::PASSWORD_TYPE_DEFAULT;
        int encryptionFlags = android::os::IVold::ENCRYPTION_FLAG_NO_UI;
        checkStatus(vold->fdeEnable(passwordType, "", encryptionFlags));

参数enablecrypto传入是通过vdc.rc:

# One shot invocation to encrypt unencrypted volumes
on encrypt
    start surfaceflinger
    exec - root -- /system/bin/vdc --wait cryptfs enablecrypto  //here
    # vold will set vold.decrypt to trigger_restart_framework (default
    # encryption)

那这个脚本是怎么trigger的?以前我们分析过其他问题是也提到过是init通过queue来的,具体是在./system/core/init/builtins.cpp:

/* Queue event based on fs_mgr return code.
 *
 * code: return code of fs_mgr_mount_all
 *
 * This function might request a reboot, in which case it will
 * not return.
 *
 * return code is processed based on input code
 */
static Result<Success> queue_fs_event(int code) {
    std::string bootmode = GetProperty("ro.bootmode", ""); 
    if (strncmp(bootmode.c_str(), "ffbm", strlen("ffbm")) == 0) { 
        LOG(ERROR) << "ffbm mode, not start class main\n";
        return Success();
    }    
    if (code == FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION) {
        ActionManager::GetInstance().QueueEventTrigger("encrypt"); //here
        return Success();
    } else if (code == FS_MGR_MNTALL_DEV_MIGHT_BE_ENCRYPTED) {

看注释应该是fs_mgr_mount_all要返回FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION就会触发加密。

意思就是需要加密你就加密吧,往前追就是do_mount_all了:

/* mount_all <fstab> [ <path> ]* [--<options>]*
 *
 * This function might request a reboot, in which case it will
 * not return.
 */
static Result<Success> do_mount_all(const BuiltinArguments& args) {
...
    std::string prop_name = "ro.boottime.init.mount_all."s + prop_post_fix;
    android::base::Timer t;
    auto mount_fstab_return_code = mount_fstab(fstabfile, mount_mode);
    if (!mount_fstab_return_code) {
        return Error() << "mount_fstab() failed " << mount_fstab_return_code.error();
    }    
    property_set(prop_name, std::to_string(t.duration().count()));

    if (import_rc) {
        /* Paths of .rc files are specified at the 2nd argument and beyond */
        import_late(args.args, 2, path_arg_end);
    }    

    if (queue_event) {
        /* queue_fs_event will queue event based on mount_fstab return code
         * and return processed return code*/
        auto queue_fs_result = queue_fs_event(*mount_fstab_return_code);
        if (!queue_fs_result) {
            return Error() << "queue_fs_event() failed: " << queue_fs_result.error();
        }
    }

    return Success();
}

先走mount_fstab,再按照结果处理, mount_fstab会call fs_mgr_mount_all:

/* When multiple fstab records share the same mount_point, it will
 * try to mount each one in turn, and ignore any duplicates after a
 * first successful mount.
 * Returns -1 on error, and  FS_MGR_MNTALL_* otherwise.
 */
int fs_mgr_mount_all(struct fstab *fstab, int mount_mode)
{

可见,应该就是按fstab里的条目来依次mount,继续看:

        mret = mount_with_alternatives(fstab, i, &last_idx_inspected, &attempted_idx);
        i = last_idx_inspected;
        mount_errno = errno;

        /* Deal with encryptability. */
        if (!mret) {
            int status = handle_encryptable(&fstab->recs[attempted_idx]);

mount_with_alternatives是正常的mount,如果返回0没有错误,会check 是否具备加密能力:

// Check to see if a mountable volume has encryption requirements
static int handle_encryptable(const struct fstab_rec* rec)
{
    /* If this is block encryptable, need to trigger encryption */
    if (needs_block_encryption(rec)) {
        if (umount(rec->mount_point) == 0) {
            return FS_MGR_MNTALL_DEV_NEEDS_ENCRYPTION;
        } else {
            PWARNING << "Could not umount " << rec->mount_point
                     << " - allow continue unencrypted";
            return FS_MGR_MNTALL_DEV_NOT_ENCRYPTED;
        }
    }

check needs_block_encryption:

static bool needs_block_encryption(const struct fstab_rec* rec)
{
    if (android::base::GetBoolProperty("ro.vold.forceencryption", false) &&
        fs_mgr_is_encryptable(rec))
        return true;
    if (rec->fs_mgr_flags & MF_FORCECRYPT) return true;
static struct flag_list fs_mgr_flags[] = {
...
    {"forceencrypt=", MF_FORCECRYPT},

fstab里会加入forceencrypt:

#<src>                                      <mnt_point>  <type>  <mnt_flags and options>                     <fs_mgr_flags>
/dev/block/bootdevice/by-name/system        /            ext4    ro,barrier=1,discard                        wait,slotselect,avb
/dev/block/bootdevice/by-name/userdata      /data        ext4    noatime,nosuid,nodev,barrier=1,noauto_da_alloc,discard  wait,forceencrypt=footer,quota,reservedsize=128M

clear, rt?

ok, 为了帮助确认分析问题,我放开了kernel dm debug,加入了ioctl cmd log,另外vold remove那里加入了delay和retry,竟然ok了。

[    0.000000] device-mapper: init: will configure 1 devices
[    1.263421] device-mapper: uevent: version 1.0.3
[    1.263580] device-mapper: ioctl: 4.35.0-ioctl (2016-06-23) initialised: dm-devel@redhat.com
[    1.263787] device-mapper: req-crypt: dm-req-crypt successfully initalized.\x0a
[    3.607087] device-mapper: init: attempting early device configuration.
[    3.607101] device-mapper: init DEBUG: parsed 1 targets
[    3.607768] device-mapper: init DEBUG: created device '253:0'
[    3.607786] device-mapper: init: adding target '0 5643752 verity 1 PART...
[    3.608872] device-mapper: init: dm-0 is ready
[    3.789844] device-mapper: ioctl: [tj] cmd=3
[    3.790567] device-mapper: ioctl: [tj] cmd=7
[    3.790852] device-mapper: ioctl: [tj] cmd=9
[    3.791799] device-mapper: ioctl: [tj] cmd=6
[   10.834008] device-mapper: ioctl: [tj] cmd=2
[   22.435064] device-mapper: ioctl: [tj] cmd=3
[   22.436824] device-mapper: ioctl: [tj] cmd=7
[   22.437049] device-mapper: ioctl: [tj] cmd=9
[   22.437073] device-mapper: req-crypt DEBUG: dm-req-crypt Constructor.\x0a
[   22.437189] device-mapper: req-crypt: req_crypt_ctr: Mapping block_device /dev/block/bootdevice/by-name/userdata to dm-req-crypt ok!\x0a
[   22.437775] device-mapper: ioctl: [tj] cmd=6
[   30.346430] device-mapper: ioctl: [tj] cmd=4
[   30.346443] device-mapper: ioctl: unable to remove open device userdata
[   31.346680] device-mapper: ioctl: [tj] cmd=4
[   31.347242] device-mapper: req-crypt DEBUG: dm-req-crypt Destructor.\x0a
[   32.843243] device-mapper: ioctl: [tj] cmd=3
[   32.845567] device-mapper: ioctl: [tj] cmd=7
[   32.845850] device-mapper: ioctl: [tj] cmd=9
[   32.845881] device-mapper: req-crypt DEBUG: dm-req-crypt Constructor.\x0a
[   32.846010] device-mapper: req-crypt: req_crypt_ctr: Mapping block_device /dev/block/bootdevice/by-name/userdata to dm-req-crypt ok!\x0a
[   32.846667] device-mapper: ioctl: [tj] cmd=6

其中cmd=4是remove, 可见第一次失败,第二次ok了!。cmd 3,7,9,6是一个create_crypto_blk_dev里,最早那一组3 7 9 6应该是cmdline传上来的avb相关,也就是'253:0'。

xxx:/dev/block # ls -l dm*
brw------- 1 root root 253,   0 1970-01-06 21:29 dm-0 //system, should be dm-verity
brw------- 1 root root 253,   1 1970-01-06 21:29 dm-1 //vendor, should be dm-verity
brw------- 1 root root 253,   2 2008-12-31 16:00 dm-2 //userdata, dm-crypt

所以Google原生这里异常处理我觉得应该是有问题的。