EROFS 文件系统刚提交了压缩去重特性提升压缩率,一起来看下。

rolling hash 基础

dedup 用到了 rolling hash,先了解下。

wikipedia 的定义是:

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.

举个例子理解下:

字串查找中使用 rolling hash,如查找字串 abcd 中长度为 3 个字符的子串 abc 和 bcd, base 是 10。

子串 abc 的 hash 值:

H(abc) => a*(10^2) + b*(10^1) + c*(10^0)
= 97*100 + 98*10 + 99*1 = 10779

子串 bcd 的 hash 值:

H(bcd) => b*(10^2) + c*(10^1) + d*(10^0)
= 98*100 + 99*10 + 100*1 = 10890

留意到 abc 和 bcd 有重复的 bc,可以利用之前的 H(abc) 推导出 H(bcd),也就是:

H(bcd) => (b*(10^1) + c*(10^0))*10 + d*(10^0)
H(bcd) => (H(abc) - a*(10^2))*10 + d*(10^0)
= (10779 - 97*100)*10 + 100*1 = 10890

BTW: 为了避免 H() 值太大溢出,有个 mod 运算(by prime number),这里忽略。

从 input(abc) 到 input(bcd) 的变化就是丢了 a 加了 d,可以把 input 看成一个 window (这里长度是3个字符),从 abc 到 bcd 可以看作这个 window 在滚动,滚动步长是1个字符。rt?

所以,rolling hash 就是为了在计算字串 hash 值时避免 hashing 整个串, 因为利用之前算得的 old hash 值,所以 new hash 值算的很快。它是基于 hash 做的一种场景优化,rt?

来看 erofs-utils 的实现 rolling_hash.h。

因为 hash value 定义了 long long(8bytes) 型,所以为了防止溢出 PRIME_NUMBER 定义成了最大值。

#define PRIME_NUMBER    4294967295LL
#define RADIX 256

RADIX 是 256(2^8),为啥。。。

static inline long long erofs_rolling_hash_init(u8 *input,
int len, bool backwards)
{
long long hash = 0;

if (!backwards) {
int i;

for (i = 0; i < len; ++i)
hash = (RADIX * hash + input[i]) % PRIME_NUMBER;
} else {
while (len)
hash = (RADIX * hash + input[--len]) % PRIME_NUMBER;
}
return hash;
}

就是 H() 的实现加个方向。

rolling hash 的计算:

static inline long long erofs_rolling_hash_advance(long long old_hash,
unsigned long long RM,
u8 to_remove, u8 to_add)
{
long long hash = old_hash;
long long to_remove_val = (to_remove * RM) % PRIME_NUMBER;

hash = RADIX * (old_hash - to_remove_val) % PRIME_NUMBER;
hash = (hash + to_add) % PRIME_NUMBER;

/* We might get negative value of hash, converting it to positive */
if (hash < 0)
hash += PRIME_NUMBER;
return hash;
}

还是拿上面的例子来说,old_hash 就是 H(abc),to_removeto_add 相当于 a 和 d。

(old_hash - to_remove_val) 就是 H(abc) - a*(10^2), 所以这个 RM 就是 10^2,也就是针对 input window 的常量是多少,因为用作移除 (remove),那就是 RM 喽。 所以,RM 的计算是:

static inline long long erofs_rollinghash_calc_rm(int window_size)
{
int i;
long long RM = 1;

for (i = 0; i < window_size - 1; ++i)
RM = (RM * RADIX) % PRIME_NUMBER;
return RM;
}

red-black tree 基础

除了用到 rolling hash, 压缩去重还用到了红黑树,let’s take a look.

wikipedia 相关定义:

a red-black tree is a kind of self-balancing binary search tree. Each node stores an extra bit representing “color” (“red” or “black”), used to ensure that the tree remains balanced during insertions and deletions.

红黑树是一种自平衡二叉查找树。

a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.

平衡二叉查找树就是为了深度(height)不要搞太深了。

a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node’s left subtree and less than the ones in its right subtree.

二叉查找树就是排序的二叉树。

a binary tree is a tree data structure in which each node has at most two children, which are referred to as the left child and the right child.

二叉树就是一代最多左右两娃。

stanford公开cs课称 binary trees 有着 优雅 的递归指针结构。(mmm 艺术细胞有了…)

Binary trees have an elegant recursive pointer structure

erofs-utils 的实现在 dedupe.c + rb_tree.c里,just read…

  • define the tree node
struct rb_node {
int red; // Color red (1), black (0)
struct rb_node *link[2]; // Link left [0] and right [1]
void *value; // User provided, used indirectly via rb_tree_node_cmp_f.
};

这个 node 的值类型是一个 void *,使用者提供,用在下面 ->cmp 这个 callback 函数里。

struct rb_tree {
struct rb_node *root;
rb_tree_node_cmp_f cmp;
size_t size;
...
};

erofs dedup 的 ->cmp:

static int z_erofs_dedupe_rbtree_cmp(struct rb_tree *self,
struct rb_node *node_a, struct rb_node *node_b)
{
struct z_erofs_dedupe_item *e_a = node_a->value;
struct z_erofs_dedupe_item *e_b = node_b->value;

return (e_a->hash > e_b->hash) - (e_a->hash < e_b->hash);
}

->value 是一个 dedup item,这个 item 的表征用 hash 来定,也就是前面的 rolling hash。

  • insert the node
int z_erofs_dedupe_insert(struct z_erofs_inmem_extent *e,
void *original_data)
{
struct z_erofs_dedupe_item *di;
...
di->original_length = e->length;
erofs_sha256(original_data, window_size, di->prefix_sha256);
di->hash = erofs_rolling_hash_init(original_data,
window_size, true); //tj: here
memcpy(di->extra_data, original_data + window_size,
e->length - window_size);
di->compressed_blkaddr = e->blkaddr;
di->compressed_blks = e->compressedblks;
di->partial = e->partial;

/* with the same rolling hash */
if (!rb_tree_insert(dedupe_subtree, di)) //tj: here
free(di);
return 0;

dedupe_subtree 这个 rb tree 里插入di 这个 node, di->hash 就是 rolling hash 的初始值。

  • find the node

call ->cmp 这个 user 填入的 call back func。

void *
rb_tree_find(struct rb_tree *self, void *value) {
void *result = NULL;
if (self) {
struct rb_node node = { .value = value };
struct rb_node *it = self->root;
int cmp = 0;
while (it) {
if ((cmp = self->cmp(self, it, &node))) { //tj: here

// If the tree supports duplicates, they should be
// chained to the right subtree for this to work
it = it->link[cmp < 0];
} else {
break;
}
}
result = it ? it->value : NULL;
}
return result;
}

erofs dedup 是在 match 里:

int z_erofs_dedupe_match(struct z_erofs_dedupe_ctx *ctx)
{
...
/* move backward byte-by-byte */
for (; cur >= ctx->start; --cur) {
struct z_erofs_dedupe_item *e;
unsigned int extra;
u8 sha256[32];

if (initial) {
/* initial try */
e_find.hash = erofs_rolling_hash_init(cur, window_size, true);
initial = false;
} else {
e_find.hash = erofs_rolling_hash_advance(e_find.hash,
rollinghash_rm, cur[window_size], cur[0]);
}

e = rb_tree_find(dedupe_tree, &e_find);
if (!e) {
e = rb_tree_find(dedupe_subtree, &e_find);
if (!e)
continue; //tj: can not find
}

rolling 窗口区间在cur - ctx->start。第一次比较用 init 那个值(initial try),后面的查找用了真正的 rolling hash 计算(..._advance())。

ok,now 我们来看下 EROFS 是如何进行压缩去重(compressed data deduplication) 的。

erofs data deduplication

照例先了解下概念,wikipedia 定义如下:

data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization,

就是避免数据重复,减少存储占用。

erofs data dedup 大体的 flow:

vle_compress_one
|
*z_erofs_compress_dedupe(,,len)* (-> *z_erofs_dedupe_match*)
|
| <no dupilcate>
erofs_compress_destsize
|
blk_write
|
*z_erofs_dedupe_insert* (non-inline)

对给定 len 长度源数据,压缩前先尝试 check 是否有重复的 data,如果有重复的(z_erofs_dedupe_match返回0表示match),就不要写到磁盘了,直接 next round。如果没有重复,那就要写啊,同时记录下(via z_erofs_dedupe_insert())让后来人比对。

在 match 到后,deduplication 的 data index 会回退到 noncompact,可能是简化逻辑。

if (z_erofs_dedupe_match(&dctx))
break;

/* fall back to noncompact indexes for deduplication */
inode->z_advise &= ~Z_EROFS_ADVISE_COMPACTED_2B;
inode->datalayout = EROFS_INODE_FLAT_COMPRESSION_LEGACY;

那 rolling 是怎么找重复的了?滚动窗口 dedup 初始化时定义为 EROFS_BLKSIZ:

err = z_erofs_dedupe_init(EROFS_BLKSIZ);
int z_erofs_dedupe_match(struct z_erofs_dedupe_ctx *ctx)
{
struct z_erofs_dedupe_item e_find;
u8 *cur;
bool initial = true;

if (!dedupe_tree)
return -ENOENT;

if (ctx->cur > ctx->end - window_size)
cur = ctx->end - window_size;
else
cur = ctx->cur;

/* move backward byte-by-byte */
for (; cur >= ctx->start; --cur) { //tj: here

cur 开始向后最小到 ctx->start 一个个字符滚动 check。ctx->start 就是接上次 rolling hash 的 end。rt?

如果 cur 往后退了,也就是 dedup 有个 delta

delta = ctx->queue + ctx->head - dctx.cur;
if (delta) {
DBG_BUGON(delta < 0);
DBG_BUGON(!ctx->e.length);
ctx->e.partial = true;
ctx->e.length -= delta;
}

此时是部分 data dedup 的,有个 advise 申明这种 HEAD lcluster, 在 write index 里:

} else {
type = ctx->e.raw ? Z_EROFS_VLE_CLUSTER_TYPE_PLAIN :
Z_EROFS_VLE_CLUSTER_TYPE_HEAD;
di.di_u.blkaddr = cpu_to_le32(ctx->e.blkaddr);

if (ctx->e.partial) {
DBG_BUGON(ctx->e.raw);
advise |= Z_EROFS_VLE_DI_PARTIAL_REF;
}
}

这样 zmap 时可以准确找到它,也就是如下提交:

erofs-utils: fuse: introduce partial-referenced pclusters

Due to deduplication for compressed data, pclusters can be partially
referenced with their prefixes.

Decompression algorithms should know that in advance, otherwise they
will fail out unexpectedly.

refer doc