原文链接:https://mp.weixin.qq.com/s/PQb_PwgxzyeeFZyz3FsO6w

EROFS pluster 模式的用处:

It's used to judge whether inplace I/O can be used due to the current status of pclusters in the chain.

有四种:INFLIGHT, HOOKED, FOLLOWED, FOLLOWED_NOINPLACE,本文源码参考 Linux kernel 6.x.

FOLLOWED 模式

    /*
     * The current collection has been linked with the owned chain, and
     * could also be linked with the remaining collections, which means
     * if the processing page is the tail page of the collection, thus
     * the current collection can safely use the whole page (since
     * the previous collection is under control) for in-place I/O, as
     * illustrated below:
     *  ________________________________________________________________
     * |  tail (partial) page |          head (partial) page           |
     * |  (of the current cl) |      (of the previous collection)      |
     * | PCLUSTER_FOLLOWED or |                                        |
     * |_____PCLUSTER_HOOKED__|___________PCLUSTER_FOLLOWED____________|
     *
     * [  (*) the above page can be used as inplace I/O.               ]
     */
    Z_EROFS_PCLUSTER_FOLLOWED,

注释写到这个模式表示当前收集的 pcluster 是被 link 到这个 owned chain,而且也可以和 remaining collections连在一起,怎么理解?我们直接看代码。

如果当前收集的 pcluster 已经存在,走z_erofs_try_to_claim_pcluster:

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    struct z_erofs_pcluster *pcl = f->pcl;
    z_erofs_next_pcluster_t *owned_head = &f->owned_head;

    /* type 1, nil pcluster (this pcluster doesn't belong to any chain.) */
    if (cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_NIL,
            *owned_head) == Z_EROFS_PCLUSTER_NIL) {
        *owned_head = &pcl->next;
        /* so we can attach this pcluster to our submission chain. */
        f->mode = Z_EROFS_PCLUSTER_FOLLOWED;
        return;
    }

而这个 pcluster 已经解压过了:

 static int z_erofs_decompress_pcluster(struct z_erofs_decompress_backend *be,
                                       int err)
{
        [...]
        /* pcluster lock MUST be taken before the following line */
        WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_NIL);
        mutex_unlock(&pcl->lock);
        return err;
}

也就是pcl->next == Z_EROFS_PCLUSTER_NIL,那就放到这个链里,跟在owned_head后面。

当收集到新的 pcluster 时,直接增加到这个 chain 里:

static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe)
{
    [...]
    pcl->next = fe->owned_head;
    pcl->pageofs_out = map->m_la & ~PAGE_MASK;
    fe->mode = Z_EROFS_PCLUSTER_FOLLOWED;
    [...]
}

ok, 如果当前访问的 page 是整个收集的 tail page, 那这个 page 就可以用作 in-place I/O.

static int z_erofs_attach_page(struct z_erofs_decompress_frontend *fe,
                   struct z_erofs_bvec *bvec, bool exclusive)
{
    int ret;

    if (exclusive) {
        /* give priority for inplaceio to use file pages first */
        if (z_erofs_try_inplace_io(fe, bvec))
            return 0;

如上,在 attach page 时如果exclusive为真,就会尝试 inplace I/O。

    exclusive = (!cur && (!spiltted || tight));

当访问完 tail page 部分(从 page end 处开始),cur为0, 依赖tight, 这个tight就根据 pcluster 模式来定:

    /*
     * Ensure the current partial page belongs to this submit chain rather
     * than other concurrent submit chains or the noio(bypass) chain since
     * those chains are handled asynchronously thus the page cannot be used
     * for inplace I/O or bvpage (should be processed in a strict order.)
     */
    tight &= (fe->mode >= Z_EROFS_PCLUSTER_HOOKED &&
          fe->mode != Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE);

也就是 tail page 所属的 pcluster 模式只有 HOOKED 或 FOLLOWED 才会把这个 page 用作 inplace I/O。

HOOKED 模式

    /*
     * The current pclusters was the tail of an exist chain, in addition
     * that the previous processed chained pclusters are all decided to
     * be hooked up to it.
     * A new chain will be created for the remaining pclusters which are
     * not processed yet, so different from Z_EROFS_PCLUSTER_FOLLOWED,
     * the next pcluster cannot reuse the whole page safely for inplace I/O
     * in the following scenario:
     *  ________________________________________________________________
     * |      tail (partial) page     |       head (partial) page       |
     * |   (belongs to the next pcl)  |   (belongs to the current pcl)  |
     * |_______PCLUSTER_FOLLOWED______|________PCLUSTER_HOOKED__________|
     */
    Z_EROFS_PCLUSTER_HOOKED,

当前的 pcluster 处在一个已经存在的 chain 的尾部,也就是 pcl->next == Z_EROFS_PCLUSTER_TAIL,那么就新建一个 chain 给接下来的收集好了。

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    [...]
    /*
     * type 2, link to the end of an existing open chain, be careful
     * that its submission is controlled by the original attached chain.
     */
    if (*owned_head != &pcl->next && pcl != f->tailpcl &&
        cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL,
            *owned_head) == Z_EROFS_PCLUSTER_TAIL) {
        *owned_head = Z_EROFS_PCLUSTER_TAIL;
        f->mode = Z_EROFS_PCLUSTER_HOOKED;
        f->tailpcl = NULL;
        return;
    }

那么这个 tight 就是 false 了。

        if (cur)
                tight &= (fe->mode >= Z_EROFS_PCLUSTER_FOLLOWED);

当访问 head page 时,cur还未变成 0,显然不是exclusive,也就不能走 inplace I/O了。

    cur = end - min_t(unsigned int, offset + end - map->m_la, end);

INFLIGHT 模式

对一个已经存在的 pcluster,除了 nil 的情况,要么它是一个 chain 的 end (上面的 HOOKED),要么它不是一个 chain 的 end.

static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f)
{
    [...]
    /* type 3, it belongs to a chain, but it isn't the end of the chain */
    f->mode = Z_EROFS_PCLUSTER_INFLIGHT;
}

FOLLOWED_NOINPLACE 模式

看命名就大概知道了,这个模式不需要 inplace I/O。

    /*
     * a weak form of Z_EROFS_PCLUSTER_FOLLOWED, the difference is that it
     * could be dispatched into bypass queue later due to uptodated managed
     * pages. All related online pages cannot be reused for inplace I/O (or
     * bvpage) since it can be directly decoded without I/O submission.
     */
    Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE,
};

z_erofs_bind_cache()如果find_get_page()都找到了 pcluster 的所有 pages,那就不用 I/O 了。

static void z_erofs_bind_cache(struct z_erofs_decompress_frontend *fe,
                   struct page **pagepool)
{
    [...]
    for (i = 0; i < pcl->pclusterpages; ++i) {
        [...]
        page = find_get_page(mc, pcl->obj.index + i);

        if (page) {
            t = (void *)((unsigned long)page | 1);
        } else {
            /* I/O is needed, no possible to decompress directly */
            standalone = false;
        [...]
    }
    /*
     * don't do inplace I/O if all compressed pages are available in
     * managed cache since it can be moved to the bypass queue instead.
     */
    if (standalone)
        fe->mode = Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE;
}

另外,inline 的情况也不需要 inplace I/O:

static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe,
                struct page *page, struct page **pagepool)
{
    [...]
    if (z_erofs_is_inline_pcluster(fe->pcl)) {
        [...]
        fe->mode = Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE;
    } else {

BTW: 最新的版本已经去掉了HOOK模式。