kswapd唤醒分析 | TJ的技术博客

kswapd is a background pageout daemon，回收内存。

唤醒的接口是 wake_all_kswapd/wakeup_kswapd，lets check.

/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
			struct zonelist *zonelist, nodemask_t *nodemask)
{
	...
	/* First allocation attempt */
	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
			zonelist, high_zoneidx, alloc_flags,
			preferred_zone, classzone_idx, migratetype);
	if (unlikely(!page)) {
		/*
		 * Runtime PM, block IO and its error handling path
		 * can deadlock because I/O on the device might not
		 * complete.
		 */
		gfp_mask = memalloc_noio_flags(gfp_mask);
		page = __alloc_pages_slowpath(gfp_mask, order,
				zonelist, high_zoneidx, nodemask,
				preferred_zone, classzone_idx, migratetype);
	}
	...

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
	struct zonelist *zonelist, enum zone_type high_zoneidx,
	nodemask_t *nodemask, struct zone *preferred_zone,
	int classzone_idx, int migratetype)
{
	...
restart:
	if (!(gfp_mask & __GFP_NO_KSWAPD))
		wake_all_kswapds(order, zonelist, high_zoneidx,
				preferred_zone, nodemask);
	...

ok, 内存分配时如果调用get_page_from_freelist分配失败而且可以从kswap获取，那就唤醒它。

get_page_from_freelist有很多地方调到，比如__alloc_pages_may_oom， __alloc_pages_direct_compact， __alloc_pages_direct_reclaim， __alloc_pages_high_priority。

get_page_from_freelist的逻辑是：

先scan zonelist，根据waterwark找到一个有足够多free page的zone,遍历完没有找到就try once more for remote node.

先看remote node:

	/*
	 * The first pass makes sure allocations are spread fairly within the
	 * local node.  However, the local node might have free pages left
	 * after the fairness batches are exhausted, and remote zones haven't
	 * even been considered yet.  Try once more without fairness, and
	 * include remote zones now, before entering the slowpath and waking
	 * kswapd: prefer spilling to a remote zone over swapping locally.
	 */
	if (alloc_flags & ALLOC_FAIR) {
		alloc_flags &= ~ALLOC_FAIR;
		if (nr_fair_skipped) {    // me: local node with ZONE_FAIR_DEPLETED 
			zonelist_rescan = true;
			reset_alloc_batches(preferred_zone);
		}
		if (nr_online_nodes > 1)  // me: consider remote node
			zonelist_rescan = true;
	}

	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
		/* Disable zlc cache for second zonelist scan */
		zlc_active = 0;
		zonelist_rescan = true;
	}

	if (zonelist_rescan)
		goto zonelist_scan;

	return NULL;
}

注释提到了try once more的原因：

the local node会有free pages left after the fairness batches are exhausted (什么鬼？)
consider remote node in NUMA system

主要来看下怎么找到zone with enough free的: 先用zone_watermark_ok看下free page是不是在watermark之上，if it’s ok 那就走try_this_zone后的流程,if it’s not ok，那就走zone_reclaim回收后再用zone_watermark_ok检查下,same as before.

传入的watermark在：

/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
			struct zonelist *zonelist, nodemask_t *nodemask)
{
	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
	struct zone *preferred_zone;
	struct zoneref *preferred_zoneref;
	struct page *page = NULL;
	int migratetype = gfpflags_to_migratetype(gfp_mask);
	unsigned int cpuset_mems_cookie;
	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; //这里
	int classzone_idx;

ok, it’s the ALLOC_WMARK_LOW, 所以也就是在free page在low下就会走slow_path唤醒kswapd了。