转：EROFS pcluster 模式分析

原文链接：https://mp.weixin.qq.com/s/PQb_PwgxzyeeFZyz3FsO6w

EROFS pluster 模式的用处：

It’s used to judge whether inplace I/O can be used due to the current status of pclusters in the chain.

有四种：INFLIGHT, HOOKED, FOLLOWED, FOLLOWED_NOINPLACE，本文源码参考 Linux kernel 6.x.

FOLLOWED 模式

/*
 * The current collection has been linked with the owned chain, and
 * could also be linked with the remaining collections, which means
 * if the processing page is the tail page of the collection, thus
 * the current collection can safely use the whole page (since
 * the previous collection is under control) for in-place I/O, as
 * illustrated below:
 *  ________________________________________________________________
 * |  tail (partial) page |          head (partial) page           |
 * |  (of the current cl) |      (of the previous collection)      |
 * | PCLUSTER_FOLLOWED or |                                        |
 * |_____PCLUSTER_HOOKED__|___________PCLUSTER_FOLLOWED____________|
 *
 * [  (*) the above page can be used as inplace I/O.               ]
 */
Z_EROFS_PCLUSTER_FOLLOWED,

注释写到这个模式表示当前收集的 pcluster 是被 link 到这个 owned chain，而且也可以和 remaining collections连在一起，怎么理解？我们直接看代码。

如果当前收集的 pcluster 已经存在，走z_erofs_try_to_claim_pcluster:

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    struct z_erofs_pcluster *pcl = f->pcl;
    z_erofs_next_pcluster_t *owned_head = &f->owned_head;

    /* type 1, nil pcluster (this pcluster doesn't belong to any chain.) */
    if (cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_NIL,
            *owned_head) == Z_EROFS_PCLUSTER_NIL) {
        *owned_head = &pcl->next;
        /* so we can attach this pcluster to our submission chain. */
        f->mode = Z_EROFS_PCLUSTER_FOLLOWED;
        return;
    }

而这个 pcluster 已经解压过了：

 static int z_erofs_decompress_pcluster(struct z_erofs_decompress_backend *be,
                                       int err)
{
        [...]
        /* pcluster lock MUST be taken before the following line */
        WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_NIL);
        mutex_unlock(&pcl->lock);
        return err;
}

也就是pcl->next == Z_EROFS_PCLUSTER_NIL，那就放到这个链里，跟在owned_head后面。

当收集到新的 pcluster 时，直接增加到这个 chain 里:

static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe)
{
    [...]
    pcl->next = fe->owned_head;
    pcl->pageofs_out = map->m_la & ~PAGE_MASK;
    fe->mode = Z_EROFS_PCLUSTER_FOLLOWED;
    [...]
}

ok, 如果当前访问的 page 是整个收集的 tail page, 那这个 page 就可以用作 in-place I/O.

static int z_erofs_attach_page(struct z_erofs_decompress_frontend *fe,
                   struct z_erofs_bvec *bvec, bool exclusive)
{
    int ret;

    if (exclusive) {
        /* give priority for inplaceio to use file pages first */
        if (z_erofs_try_inplace_io(fe, bvec))
            return 0;

如上，在 attach page 时如果exclusive为真，就会尝试 inplace I/O。

exclusive = (!cur && (!spiltted || tight));

当访问完 tail page 部分(从 page end 处开始)，cur为0, 依赖tight, 这个tight就根据 pcluster 模式来定：

/*
 * Ensure the current partial page belongs to this submit chain rather
 * than other concurrent submit chains or the noio(bypass) chain since
 * those chains are handled asynchronously thus the page cannot be used
 * for inplace I/O or bvpage (should be processed in a strict order.)
 */
tight &= (fe->mode >= Z_EROFS_PCLUSTER_HOOKED &&
      fe->mode != Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE);

也就是 tail page 所属的 pcluster 模式只有 HOOKED 或 FOLLOWED 才会把这个 page 用作 inplace I/O。

HOOKED 模式

/*
 * The current pclusters was the tail of an exist chain, in addition
 * that the previous processed chained pclusters are all decided to
 * be hooked up to it.
 * A new chain will be created for the remaining pclusters which are
 * not processed yet, so different from Z_EROFS_PCLUSTER_FOLLOWED,
 * the next pcluster cannot reuse the whole page safely for inplace I/O
 * in the following scenario:
 *  ________________________________________________________________
 * |      tail (partial) page     |       head (partial) page       |
 * |   (belongs to the next pcl)  |   (belongs to the current pcl)  |
 * |_______PCLUSTER_FOLLOWED______|________PCLUSTER_HOOKED__________|
 */
Z_EROFS_PCLUSTER_HOOKED,

当前的 pcluster 处在一个已经存在的 chain 的尾部，也就是 pcl->next == Z_EROFS_PCLUSTER_TAIL，那么就新建一个 chain 给接下来的收集好了。

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    [...]
    /*
     * type 2, link to the end of an existing open chain, be careful
     * that its submission is controlled by the original attached chain.
     */
    if (*owned_head != &pcl->next && pcl != f->tailpcl &&
        cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
            *owned_head) == Z_EROFS_PCLUSTER_TAIL) {
        *owned_head = Z_EROFS_PCLUSTER_TAIL;
        f->mode = Z_EROFS_PCLUSTER_HOOKED;
        f->tailpcl = NULL;
        return;
    }

那么这个 tight 就是 false 了。

if (cur)
        tight &= (fe->mode >= Z_EROFS_PCLUSTER_FOLLOWED);

当访问 head page 时，cur还未变成 0，显然不是exclusive，也就不能走 inplace I/O了。

cur = end - min_t(unsigned int, offset + end - map->m_la, end);

INFLIGHT 模式

对一个已经存在的 pcluster，除了 nil 的情况，要么它是一个 chain 的 end (上面的 HOOKED)，要么它不是一个 chain 的 end.

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    [...]
    /* type 3, it belongs to a chain, but it isn't the end of the chain */
    f->mode = Z_EROFS_PCLUSTER_INFLIGHT;
}

FOLLOWED_NOINPLACE 模式

看命名就大概知道了，这个模式不需要 inplace I/O。

    /*
     * a weak form of Z_EROFS_PCLUSTER_FOLLOWED, the difference is that it
     * could be dispatched into bypass queue later due to uptodated managed
     * pages. All related online pages cannot be reused for inplace I/O (or
     * bvpage) since it can be directly decoded without I/O submission.
     */
    Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE,
};

z_erofs_bind_cache()如果find_get_page()都找到了 pcluster 的所有 pages，那就不用 I/O 了。

static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe,
                   struct page **pagepool)
{
    [...]
    for (i = 0; i < pcl->pclusterpages; ++i) {
        [...]
        page = find_get_page(mc, pcl->obj.index + i);

        if (page) {
            t = (void *)((unsigned long)page | 1);
        } else {
            /* I/O is needed, no possible to decompress directly */
            standalone = false;
        [...]
    }
    /*
     * don't do inplace I/O if all compressed pages are available in
     * managed cache since it can be moved to the bypass queue instead.
     */
    if (standalone)
        fe->mode = Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE;
}

另外，inline 的情况也不需要 inplace I/O:

static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
                struct page *page, struct page **pagepool)
{
    [...]
    if (z_erofs_is_inline_pcluster(fe->pcl)) {
        [...]
        fe->mode = Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE;
    } else {

BTW: 最新的版本已经去掉了HOOK模式。