kswapd is a background pageout daemon,回收内存。

唤醒的接口是 wake_all_kswapd/wakeup_kswapd,lets check.

/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
...
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, alloc_flags,
preferred_zone, classzone_idx, migratetype);
if (unlikely(!page)) {
/*
* Runtime PM, block IO and its error handling path
* can deadlock because I/O on the device might not
* complete.
*/
gfp_mask = memalloc_noio_flags(gfp_mask);
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
preferred_zone, classzone_idx, migratetype);
}
...
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
int classzone_idx, int migratetype)
{
...
restart:
if (!(gfp_mask & __GFP_NO_KSWAPD))
wake_all_kswapds(order, zonelist, high_zoneidx,
preferred_zone, nodemask);
...

ok, 内存分配时如果调用get_page_from_freelist分配失败而且可以从kswap获取,那就唤醒它。

get_page_from_freelist有很多地方调到,比如__alloc_pages_may_oom, __alloc_pages_direct_compact, __alloc_pages_direct_reclaim, __alloc_pages_high_priority。

get_page_from_freelist的逻辑是:

先scan zonelist,根据waterwark找到一个有足够多free page的zone,遍历完没有找到就try once more for remote node.

先看remote node:

	/*
* The first pass makes sure allocations are spread fairly within the
* local node. However, the local node might have free pages left
* after the fairness batches are exhausted, and remote zones haven't
* even been considered yet. Try once more without fairness, and
* include remote zones now, before entering the slowpath and waking
* kswapd: prefer spilling to a remote zone over swapping locally.
*/
if (alloc_flags & ALLOC_FAIR) {
alloc_flags &= ~ALLOC_FAIR;
if (nr_fair_skipped) { // me: local node with ZONE_FAIR_DEPLETED
zonelist_rescan = true;
reset_alloc_batches(preferred_zone);
}
if (nr_online_nodes > 1) // me: consider remote node
zonelist_rescan = true;
}

if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
zlc_active = 0;
zonelist_rescan = true;
}

if (zonelist_rescan)
goto zonelist_scan;

return NULL;
}

注释提到了try once more的原因:

  1. the local node会有free pages left after the fairness batches are exhausted (什么鬼?)
  2. consider remote node in NUMA system

主要来看下怎么找到zone with enough free的: 先用zone_watermark_ok看下free page是不是在watermark之上,if it’s ok 那就走try_this_zone后的流程,if it’s not ok,那就走zone_reclaim回收后再用zone_watermark_ok检查下,same as before.

传入的watermark在:

/*
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
struct zoneref *preferred_zoneref;
struct page *page = NULL;
int migratetype = gfpflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; //这里
int classzone_idx;

ok, it’s the ALLOC_WMARK_LOW, 所以也就是在free page在low下就会走slow_path唤醒kswapd了。