【原】解析页面回收技术：提升应用稳定性与效率

深度Linux 2024-06-06 发布于湖南

展开全文

请求调页机制，只要用户态进程继续执行，他们就能获得页框，然而，请求调页没有办法强制进程释放不再使用的页框。因此，迟早所有空闲内存将被分配给进程和高速缓存，Linux内核的页面回收算法（PFRA）采取从用户进程和内核高速缓存“窃取”页框的办法不从伙伴系统的空闲块列表。

实际上，在用完所有空闲内存之前，就必须执行页框回收算法。否则，内核很可能陷入一种内存请求的僵局中，并导致系统崩溃。也就是说，要释放一个页框，内核就必须把页框的数据写入磁盘；但是，为了完成这一操作，内核却要请求另一个页框（例如，为I/O数据传送分配缓冲区首部）。因为不存在空闲页框，因此，不可能释放页框，页框算法的目标之一就是保存最少的空闲页框并使内核可以安全地从“内存紧缺”的情形中恢复过来。

一、概述

随着linux系统不断分配内存，当系统内存压力越来越大时，就会对系统的每个压力大的zone进程内存回收，内存回收主要是针对匿名页和文件页进行的。对于匿名页，内存回收过程中会筛选出一些不经常使用的匿名页，将它们写入到swap分区中，然后作为空闲页框释放到伙伴系统。而对于文件页，内存回收过程中也会筛选出一些不经常使用的文件页，如果此文件页中保存的内容与磁盘中文件对应内容一致，说明此文件页是一个干净的文件页，就不需要进行回写，直接将此页作为空闲页框释放到伙伴系统中，相反，如果文件页保存的数据与磁盘中文件对应的数据不一致，则认定此文件页为脏页，需要先将此文件页回写到磁盘中对应数据所在位置上，然后再将此页作为空闲页框释放到伙伴系统中。这样当内存回收完成后，系统空闲的页框数量就会增加，能够缓解内存压力。

要说清楚内存的页面回收，就必须要先理清楚内存分配过程，当我们申请分配页的时候，页分配器首先尝试使用低水线分配页，如果成功，就是走快速路径；如果失败，就是走慢速路径，说明内存轻微不足，页分配器将会唤醒内存节点的页回收内核线程，异步回收页。然后尝试使用最低水线分配页。如果使用最低水线分配失败，走最慢的路径，说明内存严重不足，页分配器会直接回收。针对不同的物理页，采用不同的回收策略：交换支持的页和存储设备支持的文件页。

在发现内存紧张时，也就是慢路径上，系统就会通过一系列机制来回收内存，比如下面这三种方式：

回收缓存，比如使用 LRU（Least Recently Used）算法，回收最近使用最少的内存页面；

Linux内核使用LRU（Least Recently Used，最近最少使用）算法选择最近最少使用的物理页。回收物理页的时候，如果物理页被映射到进程的虚拟地址空间，那么需要从页表中删除虚拟页到物理页的映射。

回收不常访问的内存，把不常用的内存通过交换分区直接写到磁盘中；回收不常访问的内存时，会用到交换分区（以下简称 Swap）。Swap 其实就是把一块磁盘空间当成内存来用。它可以把进程暂时不用的数据存储到磁盘中（这个过程称为换出），当进程访问这些内存时，再从磁盘读取这些数据到内存中（这个过程称为换入）。

杀死进程，内存紧张时系统还会通过 OOM（Out of Memory），直接杀掉占用大量内存的进程。

OOM（Out of Memory）其实是内核的一种保护机制。它监控进程的内存使用情况，并且使用 oom_score 为每个进程的内存使用情况进行评分，一个进程消耗的内存越大，oom_score 就越大；一个进程运行占用的 CPU 越多，oom_score 就越小。这样，进程的 oom_score 越大，代表消耗的内存越多，也就越容易被 OOM 杀死，从而可以更好保护系统。

回收缓存和回收不常访问的内存一般都是使用LRU算法选择最近最少使用的物理页。

选择目标页

PFRA的目标就是获得页框并使之空闲。PFRA按照页框所含内容，以不同的方式处理页框。我们将他们区分成：不可回收页、可交换页、可同步页和可丢弃页：

进行页面回收的时机

Linux 操作系统使用如下这两种机制检查系统内存的使用情况，从而确定可用的内存是否太少从而需要进行页面回收。

周期性的检查：这是由后台运行的守护进程 kswapd 完成的。该进程定期检查当前系统的内存使用情况，当发现系统内空闲的物理页面数目少于特定的阈值时，该进程就会发起页面回收的操作。

“内存严重不足”事件的触发：在某些情况下，比如，操作系统忽然需要通过伙伴系统为用户进程分配一大块内存，或者需要创建一个很大的缓冲区，而当时系统中的内存没有办法提供足够多的物理内存以满足这种内存请求，这时候，操作系统就必须尽快进行页面回收操作，以便释放出一些内存空间从而满足上述的内存请求。这种页面回收方式也被称作“直接页面回收”。

睡眠回收，在进入suspend-to-disk状态时，内核必须释放内存。

如果操作系统在进行了内存回收操作之后仍然无法回收到足够多的页面以满足上述内存要求，那么操作系统只有最后一个选择，那就是使用 OOM( out of memory )killer，它从系统中挑选一个最合适的进程杀死它，并释放该进程所占用的所有页面。

上面介绍的内存回收机制主要依赖于三个字段：pages_min，pages_low 以及 pages_high。每个内存区域（ zone ）都在其区域描述符中定义了这样三个字段，这三个字段的具体含义如下表所示。

二、页回收机制

当我们申请分配页的时候，页分配器首先尝试使用低水线分配页。如果使用低水线分配失败，说明内存轻微不足，页分配器将会唤醒内存节点的页回收内核线程，异步回收页，然后尝试使用最低水线分配页。如果使用最低水线分配失败，说明内存严重不足，页分配器会直接回收。

针对不同的物理页，采用不同的回收策略：交换支持的页和存储设备支持的文件页。

根据什么原则选择回收物理页

2.1LRU数据结构

内存管理子系统使用节点(node)，区域(zone)、页(page)三级结构描述物理内存：分别使用内存节点结构体 struct pglist_data，区域结构体struct zone，页结构体struct page表示。在内存节点结构体 struct pglist_data中，有一个struct lruvec结构体，表示lru链表描述符：

typedef struct pglist_data {
......
	spinlock_t		lru_lock;//lru链表锁
	/* Fields commonly accessed by the page reclaim scanner */
	struct lruvec		lruvec;//lru链表描述符，里面有5个lru链表
......
} pg_data_t;

struct lruvec {
	struct list_head		lists[NR_LRU_LISTS];//5个lru双向链表头
	struct zone_reclaim_stat	reclaim_stat; //与回收相关的统计数据
	/* Evictions & activations on the inactive file list */
	atomic_long_t			inactive_age;
	/* Refaults at the time of last reclaim cycle */
	unsigned long			refaults;//记录最后一次回收周期发生的结果
#ifdef CONFIG_MEMCG
	struct pglist_data *pgdat;//所属内存节点结构体 struct pglist_data
#endif
};

enum lru_list {
	LRU_INACTIVE_ANON = LRU_BASE,
	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
	LRU_UNEVICTABLE,
	NR_LRU_LISTS
};

从上面的enum lru_list中可以看出5个lru链表分别是：

不活动匿名页LRU链表，用来链接不活动的匿名页，即最近访问频率低的匿名页；
活动匿名页LRU链表，用来链接活动的匿名页，即最近访问频率高的匿名页；
不活动文件页LRU链表，用来链接不活动的文件页，即最近访问频率低的文件页；
活动文件页LRU链表，用来链接活动的文件页，即最近访问频率高的文件页；
不可回收LRU链表，用来链接使用mlock锁定在内存中、不允许回收的物理页。

//用于页描述符，一组标志(如PG_locked、PG_error)，同时页框所在的管理区和node的编号也保存在当中
struct page {
    /* 在lru算法中主要用到的标志
     * PG_active: 表示此页当前是否活跃，当放到或者准备放到活动lru链表时，被置位
     * PG_referenced: 表示此页最近是否被访问，每次页面访问都会被置位
     * PG_lru: 表示此页是处于lru链表中的
     * PG_mlocked: 表示此页被mlock()锁在内存中，禁止换出和释放
     * PG_swapbacked: 表示此页依靠swap，可能是进程的匿名页(堆、栈、数据段)，匿名mmap共享内存映射，shmem共享内存映射
     */
　　unsigned long flags;

　　......

　　union {
        /* 页处于不同情况时，加入的链表不同
         * 1.是一个进程正在使用的页，加入到对应lru链表和lru缓存中
         * 2.如果为空闲页框，并且是空闲块的第一个页，加入到伙伴系统的空闲块链表中(只有空闲块的第一个页需要加入)
         * 3.如果是一个slab的第一个页，则将其加入到slab链表中(比如slab的满slab链表，slub的部分空slab链表)
         * 4.将页隔离时用于加入隔离链表
         */
　　　　struct list_head lru;   

　　　　......

　　};

　　......

}

2.2页回收源码分析

在调用函数进行一次内存分配时：alloc_page → alloc_pages_current → __alloc_pages_nodemask，__alloc_pages_nodemask()这个函数是内存分配的心脏，对内存分配流程做了一个整体的组织。在上面那篇文章中没有详细分析__alloc_pages_slowpath，只是讲解了快速路劲get_page_from_freelist。现在我们详细的看看__alloc_pages_slowpath函数，函数位于mm/page_alloc.c文件中：

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
						struct alloc_context *ac)
{
	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
	struct page *page = NULL;
	unsigned int alloc_flags;
	unsigned long did_some_progress;
	enum compact_priority compact_priority;
	enum compact_result compact_result;
	int compaction_retries;
	int no_progress_loops;
	unsigned int cpuset_mems_cookie;
	int reserve_flags;

	/*
	 * We also sanity check to catch abuse of atomic reserves being used by
	 * callers that are not in atomic context.
	 */
	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
		gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
	compaction_retries = 0;
	no_progress_loops = 0;
	compact_priority = DEF_COMPACT_PRIORITY;
	//后面可能会检查cpuset是否允许当前进程从哪些内存节点申请页
	cpuset_mems_cookie = read_mems_allowed_begin();

	/*
	 * The fast path uses conservative alloc_flags to succeed only until
	 * kswapd needs to be woken up, and to avoid the cost of setting up
	 * alloc_flags precisely. So we do that now.
	 */
	//把分配标志位转化为内部的分配标志位，调整为最低水线标志
	alloc_flags = gfp_to_alloc_flags(gfp_mask);

	/*
	 * We need to recalculate the starting point for the zonelist iterator
	 * because we might have used different nodemask in the fast path, or
	 * there was a cpuset modification and we are retrying - otherwise we
	 * could end up iterating over non-eligible zones endlessly.
	 */
	//获取首选的内存区域，因为在快速路径中使用了不同的节点掩码，避免再次遍历不合格的区域。
	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
					ac->high_zoneidx, ac->nodemask);
	if (!ac->preferred_zoneref->zone)
		goto nopage;
	
	//异步回收页，唤醒kswapd内核线程进行页面回收
	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
		wake_all_kswapds(order, gfp_mask, ac);

	/*
	 * The adjusted alloc_flags might result in immediate success, so try
	 * that first
	 */
	//调整alloc_flags后可能会立即申请成功，所以先尝试一下
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
	if (page)
		goto got_pg;

	/*
	 * For costly allocations, try direct compaction first, as it's likely
	 * that we have enough base pages and don't need to reclaim. For non-
	 * movable high-order allocations, do that as well, as compaction will
	 * try prevent permanent fragmentation by migrating from blocks of the
	 * same migratetype.
	 * Don't try this for allocations that are allowed to ignore
	 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
	 */
	//申请阶数大于0，不可移动的位于高阶的，忽略水位线的
	if (can_direct_reclaim &&
			(costly_order ||
			   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
			&& !gfp_pfmemalloc_allowed(gfp_mask)) {
		//尝试内存压缩，进行页面迁移，然后进行页面分配
		page = __alloc_pages_direct_compact(gfp_mask, order,
						alloc_flags, ac,
						INIT_COMPACT_PRIORITY,
						&compact_result);
		if (page)
			goto got_pg;

		/*
		 * Checks for costly allocations with __GFP_NORETRY, which
		 * includes THP page fault allocations
		 */
		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
			/*
			 * If compaction is deferred for high-order allocations,
			 * it is because sync compaction recently failed. If
			 * this is the case and the caller requested a THP
			 * allocation, we do not want to heavily disrupt the
			 * system, so we fail the allocation instead of entering
			 * direct reclaim.
			 */
			if (compact_result == COMPACT_DEFERRED)
				goto nopage;

			/*
			 * Looks like reclaim/compaction is worth trying, but
			 * sync compaction could be very expensive, so keep
			 * using async compaction.
			 */
			//同步压缩非常昂贵，所以继续使用异步压缩
			compact_priority = INIT_COMPACT_PRIORITY;
		}
	}

retry:
	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
	//如果页回收线程意外睡眠则再次唤醒，确保交换线程没有意外
	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
		wake_all_kswapds(order, gfp_mask, ac);

	//如果调用者承若给我们紧急内存使用，我们就忽略水线，进行无水线分配
	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
	if (reserve_flags)
		alloc_flags = reserve_flags;

	/*
	 * Reset the nodemask and zonelist iterators if memory policies can be
	 * ignored. These allocations are high priority and system rather than
	 * user oriented.
	 */
	//如果可以忽略内存策略，则重置nodemask和zonelist
	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
		ac->nodemask = NULL;
		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
					ac->high_zoneidx, ac->nodemask);
	}

	/* Attempt with potentially adjusted zonelist and alloc_flags */
	//尝试使用可能调整的区域备用列表和分配标志
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
	if (page)
		goto got_pg;

	/* Caller is not willing to reclaim, we can't balance anything */
	//如果不可以直接回收，则申请失败
	if (!can_direct_reclaim)
		goto nopage;

	/* Avoid recursion of direct reclaim */
	if (current->flags & PF_MEMALLOC)
		goto nopage;

	/* Try direct reclaim and then allocating */
	//直接页面回收，然后进行页面分配
	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
							&did_some_progress);
	if (page)
		goto got_pg;

	/* Try direct compaction and then allocating */
	//进行页面压缩，然后进行页面分配
	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
					compact_priority, &compact_result);
	if (page)
		goto got_pg;

	/* Do not loop if specifically requested */
	//如果调用者要求不要重试，则放弃
	if (gfp_mask & __GFP_NORETRY)
		goto nopage;

	/*
	 * Do not retry costly high order allocations unless they are
	 * __GFP_RETRY_MAYFAIL
	 */
	//不要重试代价高昂的高阶分配，除非它们是__GFP_RETRY_MAYFAIL
	if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
		goto nopage;
	
	//重新尝试回收页
	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
				 did_some_progress > 0, &no_progress_loops))
		goto retry;

	/*
	 * It doesn't make any sense to retry for the compaction if the order-0
	 * reclaim is not able to make any progress because the current
	 * implementation of the compaction depends on the sufficient amount
	 * of free memory (see __compaction_suitable)
	 */
	//如果申请阶数大于0，判断是否需要重新尝试压缩
	if (did_some_progress > 0 &&
			should_compact_retry(ac, order, alloc_flags,
				compact_result, &compact_priority,
				&compaction_retries))
		goto retry;


	/* Deal with possible cpuset update races before we start OOM killing */
	//如果cpuset允许修改内存节点申请就修改
	if (check_retry_cpuset(cpuset_mems_cookie, ac))
		goto retry_cpuset;

	/* Reclaim has failed us, start killing things */
	//使用oom选择一个进程杀死
	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
	if (page)
		goto got_pg;

	/* Avoid allocations with no watermarks from looping endlessly */
	//如果当前进程是oom选择的进程，并且忽略了水线，则放弃申请
	if (tsk_is_oom_victim(current) &&
	    (alloc_flags == ALLOC_OOM ||
	     (gfp_mask & __GFP_NOMEMALLOC)))
		goto nopage;

	/* Retry as long as the OOM killer is making progress */
	//如果OOM杀手正在取得进展，再试一次
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}

nopage:
	/* Deal with possible cpuset update races before we fail */
	if (check_retry_cpuset(cpuset_mems_cookie, ac))
		goto retry_cpuset;

	/*
	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
	 * we always retry
	 */
	if (gfp_mask & __GFP_NOFAIL) {
		/*
		 * All existing users of the __GFP_NOFAIL are blockable, so warn
		 * of any new users that actually require GFP_NOWAIT
		 */
		if (WARN_ON_ONCE(!can_direct_reclaim))
			goto fail;

		/*
		 * PF_MEMALLOC request from this context is rather bizarre
		 * because we cannot reclaim anything and only can loop waiting
		 * for somebody to do a work for us
		 */
		WARN_ON_ONCE(current->flags & PF_MEMALLOC);

		/*
		 * non failing costly orders are a hard requirement which we
		 * are not prepared for much so let's warn about these users
		 * so that we can identify them and convert them to something
		 * else.
		 */
		WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

		/*
		 * Help non-failing allocations by giving them access to memory
		 * reserves but do not use ALLOC_NO_WATERMARKS because this
		 * could deplete whole memory reserves which would just make
		 * the situation worse
		 */
		//允许它们访问内存备用列表
		page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
		if (page)
			goto got_pg;

		cond_resched();
		goto retry;
	}
fail:
	warn_alloc(gfp_mask, ac->nodemask,
			"page allocation failure: order:%u", order);
got_pg:
	return page;
}

上面代码可以看出，在慢速路径上系统做了很多判断，wake_all_kswapdsb是唤醒一个异步回收内存的线程，他有着自己的初始化函数。

2.3直接页面回收

在慢速回收路径中，首先是启动异步回收，异步回收失败后，就是下面准备讲解的内存直接回收了。直接页面回收过程分析，函数入口为__alloc_pages_direct_reclaim，函数位于mm/page_alloc.c文件中：

static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
		unsigned int alloc_flags, const struct alloc_context *ac,
		unsigned long *did_some_progress)
{
	struct page *page = NULL;
	bool drained = false;
	//真正缓慢的分配路径，直接同步页面回收  
	*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
	if (unlikely(!(*did_some_progress)))
		return NULL;

retry:
	//进行页面分配操作
	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

	/*
	 * If an allocation failed after direct reclaim, it could be because
	 * pages are pinned on the per-cpu lists or in high alloc reserves.
	 * Shrink them them and try again
	 */
	//如果在直接回收之后分配失败，可能是因为页面固定在每个cpu列表上或处于高分配预留中
	if (!page && !drained) {
		unreserve_highatomic_pageblock(ac, false);//unreserve处于高分配预留中的内存
		drain_all_pages(NULL);//释放固定在每个cpu列表上页面
		drained = true;
		goto retry;//再试一次慢路径回收，分配内存
	}

	return page;
}

__alloc_pages_direct_reclaim会调用__perform_reclaim进行直接同步页面回收，再通过get_page_from_freelist分配页面，__perform_reclaim：

static int
__perform_reclaim(gfp_t gfp_mask, unsigned int order,
					const struct alloc_context *ac)
{
	struct reclaim_state reclaim_state;
	int progress;
	unsigned int noreclaim_flag;

	cond_resched();//主动让出cpu

	/* We now go into synchronous reclaim */
	cpuset_memory_pressure_bump();//计算内存压力
	fs_reclaim_acquire(gfp_mask);
	noreclaim_flag = memalloc_noreclaim_save();//保存内存标志
	reclaim_state.reclaimed_slab = 0;
	current->reclaim_state = &reclaim_state;
	
	//直接页面回收
	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
								ac->nodemask);

	current->reclaim_state = NULL;
	memalloc_noreclaim_restore(noreclaim_flag);//恢复内存标志
	fs_reclaim_release(gfp_mask);

	cond_resched();

	return progress;
}

__perform_reclaim保存一下环境后通过try_to_free_pages来进行直接页面回收。

try_to_free_pages函数位于mm/vmscan.c文件：

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
				gfp_t gfp_mask, nodemask_t *nodemask)
{
	unsigned long nr_reclaimed;
	struct scan_control sc = {
		.nr_to_reclaim = SWAP_CLUSTER_MAX,
		.gfp_mask = current_gfp_context(gfp_mask),
		.reclaim_idx = gfp_zone(gfp_mask),
		.order = order,
		.nodemask = nodemask,
		.priority = DEF_PRIORITY,
		.may_writepage = !laptop_mode,
		.may_unmap = 1,
		.may_swap = 1,
	};

	/*
	 * scan_control uses s8 fields for order, priority, and reclaim_idx.
	 * Confirm they are large enough for max values.
	 */
	BUILD_BUG_ON(MAX_ORDER > S8_MAX);
	BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
	BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);

	/*
	 * Do not enter reclaim if fatal signal was delivered while throttled.
	 * 1 is returned so that the page allocator does not OOM kill at this
	 * point.
	 */
	//如果在节流时发送了致命信号，不要进入回收,返回1
	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
		return 1;

	trace_mm_vmscan_direct_reclaim_begin(order,
				sc.may_writepage,
				sc.gfp_mask,
				sc.reclaim_idx);

	//这是直接页面回收的主要入口点
	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);

	return nr_reclaimed;
}

try_to_free_pages的真正调用do_try_to_free_pages：

static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
					  struct scan_control *sc)
{
	int initial_priority = sc->priority;
	pg_data_t *last_pgdat;
	struct zoneref *z;
	struct zone *zone;
retry:
	delayacct_freepages_start();

	if (global_reclaim(sc))//如果当前进行的是全局页回收
		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);

	do {
		//通过reclaimer priority level来计算虚拟内存压力
		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
				sc->priority);
		sc->nr_scanned = 0;
		shrink_zones(zonelist, sc);//页面分配进程的直接回收路径

		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
			break;

		if (sc->compaction_ready)
			break;

		/*
		 * If we're getting trouble reclaiming, start doing
		 * writepage even in laptop mode.
		 */
		//回收过程中遇到了麻烦，则需要回写。
		if (sc->priority < DEF_PRIORITY - 2)
			sc->may_writepage = 1;
	} while (--sc->priority >= 0);

	last_pgdat = NULL;
	//扫描每一个区域，如果充满了固定的页面，则放弃它
	for_each_zone_zonelist_nodemask(zone, z, zonelist, sc->reclaim_idx,
					sc->nodemask) {
		if (zone->zone_pgdat == last_pgdat)
			continue;
		last_pgdat = zone->zone_pgdat;
		snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
		set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
	}

	delayacct_freepages_end();

	if (sc->nr_reclaimed)//直接回收页数为0则返回
		return sc->nr_reclaimed;

	/* Aborted reclaim to try compaction? don't OOM, then */
	//如果可以压缩规整，则取消回收以尝试压缩
	if (sc->compaction_ready)
		return 1;

	/* Untapped cgroup reserves?  Don't OOM, retry. */
	if (sc->memcg_low_skipped) {
		sc->priority = initial_priority;
		sc->memcg_low_reclaim = 1;
		sc->memcg_low_skipped = 0;
		goto retry;
	}

	return 0;
}

do_try_to_free_pages调用shrink_zones这个直接回收路径：

static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
	struct zoneref *z;
	struct zone *zone;
	unsigned long nr_soft_reclaimed;
	unsigned long nr_soft_scanned;
	gfp_t orig_mask;
	pg_data_t *last_pgdat = NULL;

	/*
	 * If the number of buffer_heads in the machine exceeds the maximum
	 * allowed level, force direct reclaim to scan the highmem zone as
	 * highmem pages could be pinning lowmem pages storing buffer_heads
	 */
	orig_mask = sc->gfp_mask;
	if (buffer_heads_over_limit) {
		sc->gfp_mask |= __GFP_HIGHMEM;
		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
	}

	for_each_zone_zonelist_nodemask(zone, z, zonelist,
					sc->reclaim_idx, sc->nodemask) {
		/*
		 * Take care memory controller reclaiming has small influence
		 * to global LRU.
		 */
		if (global_reclaim(sc)) {//如果当前进行的是全局页回收
			if (!cpuset_zone_allowed(zone,
						 GFP_KERNEL | __GFP_HARDWALL))
				continue;

			/*
			 * If we already have plenty of memory free for
			 * compaction in this zone, don't free any more.
			 * Even though compaction is invoked for any
			 * non-zero order, only frequent costly order
			 * reclamation is disruptive enough to become a
			 * noticeable problem, like transparent huge
			 * page allocations.
			 */
			//如果可以压缩规整，并且有足够空间
			if (IS_ENABLED(CONFIG_COMPACTION) &&
			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
			    compaction_ready(zone, sc)) {
				sc->compaction_ready = true;
				continue;
			}

			/*
			 * Shrink each node in the zonelist once. If the
			 * zonelist is ordered by zone (not the default) then a
			 * node may be shrunk multiple times but in that case
			 * the user prefers lower zones being preserved.
			 */
			if (zone->zone_pgdat == last_pgdat)
				continue;

			/*
			 * This steals pages from memory cgroups over softlimit
			 * and returns the number of reclaimed pages and
			 * scanned pages. This works for global memory pressure
			 * and balancing, not for a memcg's limit.
			 */
			nr_soft_scanned = 0;
			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
						sc->order, sc->gfp_mask,
						&nr_soft_scanned);
			sc->nr_reclaimed += nr_soft_reclaimed;
			sc->nr_scanned += nr_soft_scanned;
			/* need some check for avoid more shrink_zone() */
		}

		/* See comment about same check for global reclaim above */
		if (zone->zone_pgdat == last_pgdat)
			continue;
		last_pgdat = zone->zone_pgdat;
		shrink_node(zone->zone_pgdat, sc);
	}

	/*
	 * Restore to original mask to avoid the impact on the caller if we
	 * promoted it to __GFP_HIGHMEM.
	 */
	sc->gfp_mask = orig_mask;//恢复到原来的掩码
}

shrink_zones主要通过shrink_node：

static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
{
	struct reclaim_state *reclaim_state = current->reclaim_state;
	unsigned long nr_reclaimed, nr_scanned;
	bool reclaimable = false;

	do {
		struct mem_cgroup *root = sc->target_mem_cgroup;
		struct mem_cgroup_reclaim_cookie reclaim = {
			.pgdat = pgdat,
			.priority = sc->priority,
		};
		unsigned long node_lru_pages = 0;
		struct mem_cgroup *memcg;

		memset(&sc->nr, 0, sizeof(sc->nr));

		nr_reclaimed = sc->nr_reclaimed;
		nr_scanned = sc->nr_scanned;

		memcg = mem_cgroup_iter(root, NULL, &reclaim);
		do {
			unsigned long lru_pages;
			unsigned long reclaimed;
			unsigned long scanned;

			switch (mem_cgroup_protected(root, memcg)) {
			case MEMCG_PROT_MIN:
				/*
				 * Hard protection.
				 * If there is no reclaimable memory, OOM.
				 */
				continue;
			case MEMCG_PROT_LOW:
				/*
				 * Soft protection.
				 * Respect the protection only as long as
				 * there is an unprotected supply
				 * of reclaimable memory from other cgroups.
				 */
				if (!sc->memcg_low_reclaim) {
					sc->memcg_low_skipped = 1;
					continue;
				}
				memcg_memory_event(memcg, MEMCG_LOW);
				break;
			case MEMCG_PROT_NONE:
				break;
			}

			reclaimed = sc->nr_reclaimed;
			scanned = sc->nr_scanned;

			//每个节点页回收的基本操作函数
			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
			node_lru_pages += lru_pages;

			shrink_slab(sc->gfp_mask, pgdat->node_id,
				    memcg, sc->priority);

			/* Record the group's reclaim efficiency */
			//测量虚拟内存的压力，用于记录回收效率
			vmpressure(sc->gfp_mask, memcg, false,
				   sc->nr_scanned - scanned,
				   sc->nr_reclaimed - reclaimed);

			/*
			 * Direct reclaim and kswapd have to scan all memory
			 * cgroups to fulfill the overall scan target for the
			 * node.
			 *
			 * Limit reclaim, on the other hand, only cares about
			 * nr_to_reclaim pages to be reclaimed and it will
			 * retry with decreasing priority if one round over the
			 * whole hierarchy is not sufficient.
			 */
			if (!global_reclaim(sc) &&
					sc->nr_reclaimed >= sc->nr_to_reclaim) {
				mem_cgroup_iter_break(root, memcg);
				break;
			}
		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

		if (reclaim_state) {
			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
			reclaim_state->reclaimed_slab = 0;
		}

		/* Record the subtree's reclaim efficiency */
		//测量虚拟内存的压力，用于记录回收效率
		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
			   sc->nr_scanned - nr_scanned,
			   sc->nr_reclaimed - nr_reclaimed);

		if (sc->nr_reclaimed - nr_reclaimed)
			reclaimable = true;

		if (current_is_kswapd()) {
			/*
			 * If reclaim is isolating dirty pages under writeback,
			 * it implies that the long-lived page allocation rate
			 * is exceeding the page laundering rate. Either the
			 * global limits are not being effective at throttling
			 * processes due to the page distribution throughout
			 * zones or there is heavy usage of a slow backing
			 * device. The only option is to throttle from reclaim
			 * context which is not ideal as there is no guarantee
			 * the dirtying process is throttled in the same way
			 * balance_dirty_pages() manages.
			 *
			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
			 * count the number of pages under pages flagged for
			 * immediate reclaim and stall if any are encountered
			 * in the nr_immediate check below.
			 */
			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
				set_bit(PGDAT_WRITEBACK, &pgdat->flags);

			/*
			 * Tag a node as congested if all the dirty pages
			 * scanned were backed by a congested BDI and
			 * wait_iff_congested will stall.
			 */
			if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
				set_bit(PGDAT_CONGESTED, &pgdat->flags);

			/* Allow kswapd to start writing pages during reclaim.*/
			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
				set_bit(PGDAT_DIRTY, &pgdat->flags);

			/*
			 * If kswapd scans pages marked marked for immediate
			 * reclaim and under writeback (nr_immediate), it
			 * implies that pages are cycling through the LRU
			 * faster than they are written so also forcibly stall.
			 */
			if (sc->nr.immediate)
				congestion_wait(BLK_RW_ASYNC, HZ/10);
		}

		/*
		 * Legacy memcg will stall in page writeback so avoid forcibly
		 * stalling in wait_iff_congested().
		 */
		if (!global_reclaim(sc) && sane_reclaim(sc) &&
		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
			set_memcg_congestion(pgdat, root, true);

		/*
		 * Stall direct reclaim for IO completions if underlying BDIs
		 * and node is congested. Allow kswapd to continue until it
		 * starts encountering unqueued dirty pages or cycling through
		 * the LRU too quickly.
		 */
		if (!sc->hibernation_mode && !current_is_kswapd() &&
		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
			wait_iff_congested(BLK_RW_ASYNC, HZ/10);

	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
					 sc->nr_scanned - nr_scanned, sc));

	/*
	 * Kswapd gives up on balancing particular nodes after too
	 * many failures to reclaim anything from them and goes to
	 * sleep. On reclaim progress, reset the failure counter. A
	 * successful direct reclaim run will revive a dormant kswapd.
	 */
	if (reclaimable)
		pgdat->kswapd_failures = 0;

	return reclaimable;
}

shrink_node调用shrink_node_memcg：

static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
			      struct scan_control *sc, unsigned long *lru_pages)
{
	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
	unsigned long nr[NR_LRU_LISTS];
	unsigned long targets[NR_LRU_LISTS];
	unsigned long nr_to_scan;
	enum lru_list lru;
	unsigned long nr_reclaimed = 0;
	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
	struct blk_plug plug;
	bool scan_adjusted;

	//记录原始扫描目标，以便以后进行比例调整
	get_scan_count(lruvec, memcg, sc, nr, lru_pages);

	/* Record the original scan target for proportional adjustments later */
	memcpy(targets, nr, sizeof(nr));

	/*
	 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
	 * event that can occur when there is little memory pressure e.g.
	 * multiple streaming readers/writers. Hence, we do not abort scanning
	 * when the requested number of pages are reclaimed when scanning at
	 * DEF_PRIORITY on the assumption that the fact we are direct
	 * reclaiming implies that kswapd is not keeping up and it is best to
	 * do a batch of work at once. For memcg reclaim one check is made to
	 * abort proportional reclaim if either the file or anon lru has already
	 * dropped to zero at the first pass.
	 */
	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
			 sc->priority == DEF_PRIORITY);

	blk_start_plug(&plug);// 初始化blk_plug并在task_struct内跟踪它,一旦阻塞则刷新挂起任务
	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
					nr[LRU_INACTIVE_FILE]) {
		unsigned long nr_anon, nr_file, percentage;
		unsigned long nr_scanned;

		for_each_evictable_lru(lru) {
			if (nr[lru]) {
				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
				nr[lru] -= nr_to_scan;

				nr_reclaimed += shrink_list(lru, nr_to_scan,
							    lruvec, memcg, sc);
			}
		}

		cond_resched();

		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
			continue;

		/*
		 * For kswapd and memcg, reclaim at least the number of pages
		 * requested. Ensure that the anon and file LRUs are scanned
		 * proportionally what was requested by get_scan_count(). We
		 * stop reclaiming one LRU and reduce the amount scanning
		 * proportional to the original scan target.
		 */
		nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
		nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

		/*
		 * It's just vindictive to attack the larger once the smaller
		 * has gone to zero.  And given the way we stop scanning the
		 * smaller below, this makes sure that we only make one nudge
		 * towards proportionality once we've got nr_to_reclaim.
		 */
		if (!nr_file || !nr_anon)
			break;

		if (nr_file > nr_anon) {
			unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
						targets[LRU_ACTIVE_ANON] + 1;
			lru = LRU_BASE;
			percentage = nr_anon * 100 / scan_target;
		} else {
			unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
						targets[LRU_ACTIVE_FILE] + 1;
			lru = LRU_FILE;
			percentage = nr_file * 100 / scan_target;
		}

		/* Stop scanning the smaller of the LRU */
		nr[lru] = 0;
		nr[lru + LRU_ACTIVE] = 0;

		/*
		 * Recalculate the other LRU scan count based on its original
		 * scan target and the percentage scanning already complete
		 */
		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
		nr_scanned = targets[lru] - nr[lru];
		nr[lru] = targets[lru] * (100 - percentage) / 100;
		nr[lru] -= min(nr[lru], nr_scanned);

		lru += LRU_ACTIVE;
		nr_scanned = targets[lru] - nr[lru];
		nr[lru] = targets[lru] * (100 - percentage) / 100;
		nr[lru] -= min(nr[lru], nr_scanned);

		scan_adjusted = true;
	}
	blk_finish_plug(&plug);
	sc->nr_reclaimed += nr_reclaimed;

	/*
	 * Even if we did not try to evict anon pages at all, we want to
	 * rebalance the anon lru active/inactive ratio.
	 */
	//如果lru活动页和不活动页比例不平衡
	if (inactive_list_is_low(lruvec, false, memcg, sc, true))
		//调整lru活动页和不活动页比例
		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
				   sc, LRU_ACTIVE_ANON);
}

shrink_node_memcg主要调用shrink_list：

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
				 struct lruvec *lruvec, struct mem_cgroup *memcg,
				 struct scan_control *sc)
{
	if (is_active_lru(lru)) {
		//如果lru活动页和不活动页比例不平衡
		if (inactive_list_is_low(lruvec, is_file_lru(lru),
					 memcg, sc, true))
			shrink_active_list(nr_to_scan, lruvec, sc, lru);//调整lru活动页和不活动页比例
		return 0;
	}

	return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);//收缩lru不活动页列表，返回直接回收的页数量
}

shrink_list主要调用shrink_inactive_list：

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
		     struct scan_control *sc, enum lru_list lru)
{
	LIST_HEAD(page_list);
	unsigned long nr_scanned;
	unsigned long nr_reclaimed = 0;
	unsigned long nr_taken;
	struct reclaim_stat stat = {};
	isolate_mode_t isolate_mode = 0;
	int file = is_file_lru(lru);
	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
	bool stalled = false;

	//如果隔离页太多
	while (unlikely(too_many_isolated(pgdat, file, sc))) {
		if (stalled)
			return 0;

		/* wait a bit for the reclaimer. */
		msleep(100);//休眠等待一下回收器
		stalled = true;

		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

	lru_add_drain();

	if (!sc->may_unmap)
		isolate_mode |= ISOLATE_UNMAPPED;

	spin_lock_irq(&pgdat->lru_lock);

	//从不活动页链表的尾部取指定页数添加到临时链表page_list中
	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
				     &nr_scanned, sc, isolate_mode, lru);

	//做相关统计信息和更新操作
	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
	reclaim_stat->recent_scanned[file] += nr_taken;

	if (current_is_kswapd()) {
		if (global_reclaim(sc))
			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
		count_memcg_events(lruvec_memcg(lruvec), PGSCAN_KSWAPD,
				   nr_scanned);
	} else {
		if (global_reclaim(sc))
			__count_vm_events(PGSCAN_DIRECT, nr_scanned);
		count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT,
				   nr_scanned);
	}
	spin_unlock_irq(&pgdat->lru_lock);

	if (nr_taken == 0)
		return 0;

	//处理临时链表page_list的所有页
	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
				&stat, false);

	spin_lock_irq(&pgdat->lru_lock);

	if (current_is_kswapd()) {
		if (global_reclaim(sc))
			__count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed);
		count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_KSWAPD,
				   nr_reclaimed);
	} else {
		if (global_reclaim(sc))
			__count_vm_events(PGSTEAL_DIRECT, nr_reclaimed);
		count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_DIRECT,
				   nr_reclaimed);
	}

	//将部分不活动页转化为活动页并且放入活动页链表中
	putback_inactive_pages(lruvec, &page_list);

	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);

	spin_unlock_irq(&pgdat->lru_lock);

	mem_cgroup_uncharge_list(&page_list);
	
	//释放引用为0的顺序页面列表
	free_unref_page_list(&page_list);

	/*
	 * If dirty pages are scanned that are not queued for IO, it
	 * implies that flushers are not doing their job. This can
	 * happen when memory pressure pushes dirty pages to the end of
	 * the LRU before the dirty limits are breached and the dirty
	 * data has expired. It can also happen when the proportion of
	 * dirty pages grows not through writes but through memory
	 * pressure reclaiming all the clean cache. And in some cases,
	 * the flushers simply cannot keep up with the allocation
	 * rate. Nudge the flusher threads in case they are asleep.
	 */
	//如果脏页队列没有IO等待，则唤醒flusher冲刷线程将数据回写磁盘
	if (stat.nr_unqueued_dirty == nr_taken)
		wakeup_flusher_threads(WB_REASON_VMSCAN);

	sc->nr.dirty += stat.nr_dirty;
	sc->nr.congested += stat.nr_congested;
	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
	sc->nr.writeback += stat.nr_writeback;
	sc->nr.immediate += stat.nr_immediate;
	sc->nr.taken += nr_taken;
	if (file)
		sc->nr.file_taken += nr_taken;

	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
	return nr_reclaimed;
}

shrink_inactive_list主要调用shrink_page_list：

static unsigned long shrink_page_list(struct list_head *page_list,
				      struct pglist_data *pgdat,
				      struct scan_control *sc,
				      enum ttu_flags ttu_flags,
				      struct reclaim_stat *stat,
				      bool force_reclaim)
{
	LIST_HEAD(ret_pages);
	LIST_HEAD(free_pages);
	int pgactivate = 0;
	unsigned nr_unqueued_dirty = 0;
	unsigned nr_dirty = 0;
	unsigned nr_congested = 0;
	unsigned nr_reclaimed = 0;
	unsigned nr_writeback = 0;
	unsigned nr_immediate = 0;
	unsigned nr_ref_keep = 0;
	unsigned nr_unmap_fail = 0;

	cond_resched();

	//遍历临时链表page_list
	while (!list_empty(page_list)) {
		struct address_space *mapping;
		struct page *page;
		int may_enter_fs;
		enum page_references references = PAGEREF_RECLAIM_CLEAN;
		bool dirty, writeback;

		cond_resched();

		page = lru_to_page(page_list);
		list_del(&page->lru);

		if (!trylock_page(page))//如果页面被锁住则跳过
			goto keep;

		VM_BUG_ON_PAGE(PageActive(page), page);

		sc->nr_scanned++;//增加扫描计数

		//如果是page_evictable页，尝试设置activate并放到ret_page中
		if (unlikely(!page_evictable(page)))
			goto activate_locked;

		//如果是映射到进程的页，但不允许回收映射则解锁后放到ret_page中
		if (!sc->may_unmap && page_mapped(page))
			goto keep_locked;

		/* Double the slab pressure for mapped and swapcache pages */
		//如果是映射页或者交换缓存的页就Double扫描计数，因为可扫描的页不应该快速结束也扫描，否则会影响系统性能
		if ((page_mapped(page) || PageSwapCache(page)) &&
		    !(PageAnon(page) && !PageSwapBacked(page)))
			sc->nr_scanned++;

		//标记是否允许文件系统操作
		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

		/*
		 * The number of dirty pages determines if a node is marked
		 * reclaim_congested which affects wait_iff_congested. kswapd
		 * will stall and start writing pages if the tail of the LRU
		 * is all dirty unqueued pages.
		 */
		//判断是否为脏页或者在回写处理中
		page_check_dirty_writeback(page, &dirty, &writeback);
		if (dirty || writeback)
			nr_dirty++;//记录回写脏页数量

		if (dirty && !writeback)
			nr_unqueued_dirty++;//未回写脏页数量

		/*
		 * Treat this page as congested if the underlying BDI is or if
		 * pages are cycling through the LRU so quickly that the
		 * pages marked for immediate reclaim are making it to the
		 * end of the LRU a second time.
		 */
		//判断是否映射页，文件页返回MAPPING，匿名页返回NULL，交换缓存页返回SWAP
		mapping = page_mapping(page);
		//增加阻塞页框的两种情况：1.脏页或者在回写的页的inode标志位阻塞，2.回写的页标记正在回收
		if (((dirty || writeback) && mapping &&
		     inode_write_congested(mapping->host)) ||
		    (writeback && PageReclaim(page)))
			nr_congested++;

		/*
		 * If a page at the tail of the LRU is under writeback, there
		 * are three cases to consider.
		 *
		 * 1) If reclaim is encountering an excessive number of pages
		 *    under writeback and this page is both under writeback and
		 *    PageReclaim then it indicates that pages are being queued
		 *    for IO but are being recycled through the LRU before the
		 *    IO can complete. Waiting on the page itself risks an
		 *    indefinite stall if it is impossible to writeback the
		 *    page due to IO error or disconnected storage so instead
		 *    note that the LRU is being scanned too quickly and the
		 *    caller can stall after page list has been processed.
		 *
		 * 2) Global or new memcg reclaim encounters a page that is
		 *    not marked for immediate reclaim, or the caller does not
		 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
		 *    not to fs). In this case mark the page for immediate
		 *    reclaim and continue scanning.
		 *
		 *    Require may_enter_fs because we would wait on fs, which
		 *    may not have submitted IO yet. And the loop driver might
		 *    enter reclaim, and deadlock if it waits on a page for
		 *    which it is needed to do the write (loop masks off
		 *    __GFP_IO|__GFP_FS for this reason); but more thought
		 *    would probably show more reasons.
		 *
		 * 3) Legacy memcg encounters a page that is already marked
		 *    PageReclaim. memcg does not have any dirty pages
		 *    throttling so we could easily OOM just because too many
		 *    pages are in writeback and there is nothing else to
		 *    reclaim. Wait for the writeback to complete.
		 *
		 * In cases 1) and 2) we activate the pages to get them out of
		 * the way while we continue scanning for clean pages on the
		 * inactive list and refilling from the active list. The
		 * observation here is that waiting for disk writes is more
		 * expensive than potentially causing reloads down the line.
		 * Since they're marked for immediate reclaim, they won't put
		 * memory pressure on the cache working set any longer than it
		 * takes to write them to disk.
		 */
		//如果页处于回写状态中
		if (PageWriteback(page)) {
			/* Case 1 above */
			//页在交换中
			if (current_is_kswapd() &&
			    PageReclaim(page) &&
			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
				nr_immediate++;
				goto activate_locked;

			/* Case 2 above */
				//页再回收中
			} else if (sane_reclaim(sc) ||
			    !PageReclaim(page) || !may_enter_fs) {
				/*
				 * This is slightly racy - end_page_writeback()
				 * might have just cleared PageReclaim, then
				 * setting PageReclaim here end up interpreted
				 * as PageReadahead - but that does not matter
				 * enough to care.  What we do want is for this
				 * page to have PageReclaim set next time memcg
				 * reclaim reaches the tests above, so it will
				 * then wait_on_page_writeback() to avoid OOM;
				 * and it's also appropriate in global reclaim.
				 */
				SetPageReclaim(page);
				nr_writeback++;
				goto activate_locked;

			/* Case 3 above */
			//不在交换也不在回收，则等待回写完毕
			} else {
				unlock_page(page);
				wait_on_page_writeback(page);
				/* then go back and try same page again */
				list_add_tail(&page->lru, page_list);
				continue;
			}
		}

		if (!force_reclaim)//如果没有设置强制回收则检查也的访问情况
			references = page_check_references(page, sc);

		switch (references) {
		case PAGEREF_ACTIVATE:
			goto activate_locked;//近期访问次数大于1则设置为activate，并且放到ret_page中
		case PAGEREF_KEEP:
			nr_ref_keep++;
			goto keep_locked;
		case PAGEREF_RECLAIM:
		case PAGEREF_RECLAIM_CLEAN:
			; /* try to reclaim the page below */
		}

		/*
		 * Anonymous process memory has backing store?
		 * Try to allocate it some swap space here.
		 * Lazyfree page could be freed directly
		 */
		//如果是匿名也且有后备缓存
		if (PageAnon(page) && PageSwapBacked(page)) {
			if (!PageSwapCache(page)) {//不在交换缓存中
				if (!(sc->gfp_mask & __GFP_IO))
					goto keep_locked;
				if (PageTransHuge(page)) {
					/* cannot split THP, skip it */
					if (!can_split_huge_page(page, NULL))
						goto activate_locked;
					/*
					 * Split pages without a PMD map right
					 * away. Chances are some or all of the
					 * tail pages can be freed without IO.
					 */
					if (!compound_mapcount(page) &&
					    split_huge_page_to_list(page,
								    page_list))
						goto activate_locked;
				}
				if (!add_to_swap(page)) {
					if (!PageTransHuge(page))
						goto activate_locked;
					/* Fallback to swap normal pages */
					if (split_huge_page_to_list(page,
								    page_list))
						goto activate_locked;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
					count_vm_event(THP_SWPOUT_FALLBACK);
#endif
					if (!add_to_swap(page))
						goto activate_locked;
				}

				may_enter_fs = 1;

				/* Adding to swap updated mapping */
				mapping = page_mapping(page);//添加到交换缓存中
			}
		} else if (unlikely(PageTransHuge(page))) {
			/* Split file THP */
			if (split_huge_page_to_list(page, page_list))
				goto keep_locked;
		}

		/*
		 * The page is mapped into the page tables of one or more
		 * processes. Try to unmap it here.
		 */
		//如果页有映射到进程
		if (page_mapped(page)) {
			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;

			if (unlikely(PageTransHuge(page)))
				flags |= TTU_SPLIT_HUGE_PMD;
			if (!try_to_unmap(page, flags)) {//尝试接触映射
				nr_unmap_fail++;
				goto activate_locked;
			}
		}

		if (PageDirty(page)) {//如果是脏页
			/*
			 * Only kswapd can writeback filesystem pages
			 * to avoid risk of stack overflow. But avoid
			 * injecting inefficient single-page IO into
			 * flusher writeback as much as possible: only
			 * write pages when we've encountered many
			 * dirty pages, and when we've already scanned
			 * the rest of the LRU for clean pages and see
			 * the same dirty pages again (PageReclaim).
			 */
			/*	如果这个是文件页，但是想要往下回收还要满足三个条件：
				1.是kswapd线程，只有kswapd可以回写文件系统页面，以避免堆栈溢出的风险
				2.是正在回收的页
				3.是大量脏页处理的回写操作，只有当我们遇到许多脏页时才写页，避免将低效的单页IO
				如果满足三个要求，可以往下走，否则走activate_locked设置为活动页
			*/
			if (page_is_file_cache(page) &&
			    (!current_is_kswapd() || !PageReclaim(page) ||
			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
				/*
				 * Immediately reclaim when written back.
				 * Similar in principal to deactivate_page()
				 * except we already have the page isolated
				 * and know it's dirty
				 */
				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
				SetPageReclaim(page);

				goto activate_locked;
			}

			if (references == PAGEREF_RECLAIM_CLEAN)//如果被访问过就放到ret_page中
				goto keep_locked;
			if (!may_enter_fs)//不允许文件系统操作就放到ret_page中
				goto keep_locked;
			if (!sc->may_writepage)//不允许回写则放到ret_page中
				goto keep_locked;

			/*
			 * Page is dirty. Flush the TLB if a writable entry
			 * potentially exists to avoid CPU writes after IO
			 * starts and then write it out here.
			 */
			//走到这里说明页面很脏。 如果可能存在可写条目，则刷新TLB，以避免IO启动后进行CPU写操作，然后在这里将其写入。
			try_to_unmap_flush_dirty();
			switch (pageout(page, mapping, sc)) {//将页面回写出去操作
			case PAGE_KEEP:
				goto keep_locked;
			case PAGE_ACTIVATE:
				goto activate_locked;
			case PAGE_SUCCESS:
				if (PageWriteback(page))
					goto keep;
				if (PageDirty(page))
					goto keep;

				/*
				 * A synchronous write - probably a ramdisk.  Go
				 * ahead and try to reclaim the page.
				 */
				if (!trylock_page(page))
					goto keep;
				if (PageDirty(page) || PageWriteback(page))
					goto keep_locked;
				mapping = page_mapping(page);
			case PAGE_CLEAN:
				; /* try to free the page below */
			}
		}

		/*
		 * If the page has buffers, try to free the buffer mappings
		 * associated with this page. If we succeed we try to free
		 * the page as well.
		 *
		 * We do this even if the page is PageDirty().
		 * try_to_release_page() does not perform I/O, but it is
		 * possible for a page to have PageDirty set, but it is actually
		 * clean (all its buffers are clean).  This happens if the
		 * buffers were written out directly, with submit_bh(). ext3
		 * will do this, as well as the blockdev mapping.
		 * try_to_release_page() will discover that cleanness and will
		 * drop the buffers and mark the page clean - it can be freed.
		 *
		 * Rarely, pages can have buffers and no ->mapping.  These are
		 * the pages which were not successfully invalidated in
		 * truncate_complete_page().  We try to drop those buffers here
		 * and if that worked, and the page is no longer mapped into
		 * process address space (page_count == 1) it can be freed.
		 * Otherwise, leave the page on the LRU so it is swappable.
		 */
		if (page_has_private(page)) {
			if (!try_to_release_page(page, sc->gfp_mask))
				goto activate_locked;
			if (!mapping && page_count(page) == 1) {
				unlock_page(page);
				if (put_page_testzero(page))
					goto free_it;
				else {
					/*
					 * rare race with speculative reference.
					 * the speculative reference will free
					 * this page shortly, so we may
					 * increment nr_reclaimed here (and
					 * leave it off the LRU).
					 */
					nr_reclaimed++;
					continue;
				}
			}
		}

		if (PageAnon(page) && !PageSwapBacked(page)) {
			/* follow __remove_mapping for reference */
			if (!page_ref_freeze(page, 1))
				goto keep_locked;
			if (PageDirty(page)) {
				page_ref_unfreeze(page, 1);
				goto keep_locked;
			}

			count_vm_event(PGLAZYFREED);
			count_memcg_page_event(page, PGLAZYFREED);
		} else if (!mapping || !__remove_mapping(mapping, page, true))
			goto keep_locked;
		/*
		 * At this point, we have no other references and there is
		 * no way to pick any more up (removed from LRU, removed
		 * from pagecache). Can use non-atomic bitops now (and
		 * we obviously don't have to worry about waking up a process
		 * waiting on the page lock, because there are no references.
		 */
		__ClearPageLocked(page);
free_it:
		nr_reclaimed++;

		/*
		 * Is there need to periodically free_page_list? It would
		 * appear not as the counts should be low
		 */
		if (unlikely(PageTransHuge(page))) {
			mem_cgroup_uncharge(page);
			(*get_compound_page_dtor(page))(page);
		} else
			list_add(&page->lru, &free_pages);//将页面放入free_pages释放掉
		continue;

activate_locked:
		/* Not a candidate for swapping, so reclaim swap space. */
		if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
						PageMlocked(page)))
			try_to_free_swap(page);
		VM_BUG_ON_PAGE(PageActive(page), page);
		if (!PageMlocked(page)) {
			SetPageActive(page);
			pgactivate++;
			count_memcg_page_event(page, PGACTIVATE);
		}
keep_locked:
		unlock_page(page);
keep:
		list_add(&page->lru, &ret_pages);
		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
	}

	mem_cgroup_uncharge_list(&free_pages);
	try_to_unmap_flush();
	//释放空闲页，没有释放成功则通过free_pages返回
	free_unref_page_list(&free_pages);

	list_splice(&ret_pages, page_list);
	count_vm_events(PGACTIVATE, pgactivate);

	if (stat) {
		stat->nr_dirty = nr_dirty;
		stat->nr_congested = nr_congested;
		stat->nr_unqueued_dirty = nr_unqueued_dirty;
		stat->nr_writeback = nr_writeback;
		stat->nr_immediate = nr_immediate;
		stat->nr_activate = pgactivate;
		stat->nr_ref_keep = nr_ref_keep;
		stat->nr_unmap_fail = nr_unmap_fail;
	}
	return nr_reclaimed;//返回直接回收的页数量
}

内存直接回收流程就到此结束了。

三、PFRA设计

3.1设计总则

首先释放“无害”页，即必须线回收没有被任何进程使用的磁盘与内存高速缓存中的页；

将用户态进程和所有页定为可回首页，FPRA必须能够窃得人任何用户态进程页，包括匿名页。这样，睡眠较长时间的进程将逐渐失去所有页；

同时取消引用一个共享页的所有页表项的映射，就可以回收该共享页；

只回收“未用”页，使用LRU算法。Linux使用每个页表项中的访问标志位，在页被访问时，该标志位由银奖自动置位；而且，页年龄由页描述符在链表（两个不同的链表之一）中的位置来表示。

因此，页框回收算法是集中启发式方法的混合：

谨慎选择检查高速缓存的顺序；
基于页年龄的变化排序；
区别对待不同状态的页；

3.2反向映射

PFRA的目标之一是能释放共享页框。为达到这个目地。Linux内核能够快速定为指向同一页框的所有页表项。这个过程就叫做反向映射。Linux 操作系统为物理页面建立一个链表，用于指向引用了该物理页面的所有页表项。

基本思想如下图：

Linux采用“面向对象的反向映射”技术。实际上，对任何可回收的用户态页，内核保留系统中该页所在所有现行区（“对象”）的反向链接，每个线性区描述符（ vm_area_struct 结构）存放一个指针指向一个内存描述符（ mm_struct 结构），而该内存描述符又包含一个指针指向一个页全局目录（PGD）。因此，这些反向链接使得PFRA能够检索引用某页的所有页表项。因为线性区描述符比页描述符少，所以更新共享页的反向链接就比较省时间。

基于对象的反向映射的实现：

(1)数据结构

首先，PFRA必须要确定待回收页是共享的还是非共享的，以及是映射页或是匿名页。为做到这一点，内核要查看页描述符的两个字段：_mapcount和mapping。_mapcount字段存放引用页框的页表项数目，确定其是否共享；mapping字段用于确定页是映射的或是匿名的：为空表示该页属于交换高速缓存；非空，且最低位是1，表示该页为匿名页，同时mapping字段中存放的是指向anon_vma描述符的指针；如果mapping字段非空，且最低位是0，表示该页为映射页；同时mapping字段指向对应文件的address_space对象。

struct page   
{   
     atomic_t _mapcount;          
                  
     union {   
        ……  
        struct {   
        ……        
         struct address_space *mapping;       
        };   
        ……  
 };

Linux的address_space对象在RAMA中是对其的，所以其起始地址是4的倍数。因此其mapping字段的最低位可以用作一个标志位来表示该字段的指针是指向address_space对象还是anon_vma描述符。PageAnon检查mapping最低位。

/*检查页是否为匿名页，低位为1时为匿名页*/  
static inline int PageAnon(struct page *page)  
{  
    return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;  
}

匿名页面和文件映射页面分别采用了不同的底层数据结构去存放与页面相关的虚拟内存区域。对于匿名页面来说，与该页面相关的虚拟内存区域存放在结构 anon_vma 中定义的双向链表中。结构 anon_vma 定义很简单，如下所示：

struct anon_vma   
{   
     spinlock_t lock;     
     struct list_head head;       
};

匿名页的面向对象反向映射如下图：

可以通过页面的mapping找到anon_vma然后找到映射该页面的所有线性区域（vm_area_struct结构）。

而对于基于文件映射的页面来说，与匿名页面不同的是，与该页面相关的虚拟内存区域的存放是利用了优先级搜索树这种数据结构的。这是因为对于匿名页面来说，页面虽然可以是共享的，但是一般情况下，共享匿名页面的使用者的数目不会很多；而对于基于文件映射的页面来说，共享页面的使用者的数目可能会非常多，使用优先级搜索树这种结构可以更加快速地定位那些引用了该页面的虚拟内存区域。操作系统会为每一个文件都建立一个优先级搜索树，其根节点可以通过结构 address_space 中的 i_mmap 字段获取。

struct address_space {   
  ……  
 struct prio_tree_root   i_mmap;          
……  
 }

Linux中使用 (radix,size,heap) 来表示优先级搜索树中的节点。其中，radix 表示内存区域的起始位置，heap 表示内存区域的结束位置，size 与内存区域的大小成正比。在优先级搜索树中，父节点的 heap 值一定不会小于子节点的 heap 值。在树中进行查找时，根据节点的 radix 值进行。程序可以根据 size 值区分那些具有相同 radix 值的节点。

在用于表示虚拟内存区域的结构 vm_area_struct 中，与上边介绍的双向链表和优先级搜索树相关的字段如下所示：

struct vm_area_struct {   
 struct mm_struct * vm_mm;    
……  
 union {   
     struct {   
         struct list_head list;   
         void *parent;    
         struct vm_area_struct *head;   
     } vm_set;   
  
     struct raw_prio_tree_node prio_tree_node;   
 } shared;   
  
  
 struct list_head anon_vma_node;      
 struct anon_vma *anon_vma;       
};

与匿名页面的双向链表相关的字段是 anon_vma_node 和 anon_vma。union shared 则与文件映射页面使用的优先级搜索树相关。字段 anon_vma 指向 anon_vma 表；字段 anon_vma_node 将映射该页面的所有虚拟内存区域链接起来；union shared 中的 prio_tree_node 结构用于表示优先级搜索树的一个节点；在某些情况下，比如不同的进程的内存区域可能映射到了同一个文件的相同部分，也就是说这些内存区域具有相同的（radix,size,heap）值，这个时候 Linux 就会在树上相应的节点（树上原来那个具有相同（radix,size,heap）值的内存区域）上接一个双向链表用来存放这些内存区域，这个链表用 vm_set.list 来表示；树上那个节点指向的链表中的第一个节点是表头，用 vm_set.head 表示；vm_set.parent 用于表示是否是树结点。下边给出一个小图示简单说明一下 vm_set.list 和 vm_set.head。

通过结构 vm_area_struct 中的 vm_mm 字段可以找到对应的 mm_struct 结构，在该结构中找到页全局目录，从而定位所有相关的页表项。

(2)反向映射实现

在进行页面回收的时候，Linux的 shrink_page_list() 函数中调用 try_to_unmap() 函数去更新所有引用了回收页面的页表项。其代码流程如下所示：

函数 try_to_unmap() 分别调用了两个函数 try_to_unmap_anon() 和 try_to_unmap_file()，其目的都是检查并确定都有哪些页表项引用了同一个物理页面，但是，由于匿名页面和文件映射页面分别采用了不同的数据结构，所以二者采用了不同的方法。

函数 try_to_unmap_anon() 用于匿名页面，该函数扫描相应的 anon_vma 表中包含的所有内存区域，并对这些内存区域分别调用 try_to_unmap_one() 函数。

函数 try_to_unmap_file() 用于文件映射页面，该函数会在优先级搜索树中进行搜索，并为每一个搜索到的内存区域调用 try_to_unmap_one() 函数。

两条代码路径最终汇合到 try_to_unmap_one() 函数中，更新引用特定物理页面的所有页表项的操作都是在这个函数中实现的。

代码如下，对关键部分做了注释：

static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)  
{  
    struct anon_vma *anon_vma;  
    struct vm_area_struct *vma;  
    unsigned int mlocked = 0;  
    int ret = SWAP_AGAIN;  
    int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;  
  
    if (MLOCK_PAGES && unlikely(unlock))  
        ret = SWAP_SUCCESS; /* default for try_to_munlock() */  
    /*如果该页面为匿名映射，返回该页面对应的匿名结构*/  
    anon_vma = page_lock_anon_vma(page);  
    if (!anon_vma)  
        return ret;  
    /*这里可以看出，vma的anon_vma_node字段链接到 
    anon_vma的head字段*/  
    /*扫描线性区描述符的anon_vma链表*/  
    list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {  
        if (MLOCK_PAGES && unlikely(unlock)) {  
            if (!((vma->vm_flags & VM_LOCKED) &&  
                  page_mapped_in_vma(page, vma)))  
                continue;  /* must visit all unlocked vmas */  
            ret = SWAP_MLOCK;  /* saw at least one mlocked vma */  
        } else {  
            /*对anon_vma链表中的每一个vma线性区描述符 
            调用该函数*/  
            ret = try_to_unmap_one(page, vma, flags);  
            if (ret == SWAP_FAIL || !page_mapped(page))  
                break;  
        }  
        if (ret == SWAP_MLOCK) {  
            mlocked = try_to_mlock_page(page, vma);  
            if (mlocked)  
                break;  /* stop if actually mlocked page */  
        }  
    }  
  
    page_unlock_anon_vma(anon_vma);  
  
    if (mlocked)  
        ret = SWAP_MLOCK;   /* actually mlocked the page */  
    else if (ret == SWAP_MLOCK)  
        ret = SWAP_AGAIN;   /* saw VM_LOCKED vma */  
  
    return ret;  
｝

/* 
 * Subfunctions of try_to_unmap: try_to_unmap_one called 
 * repeatedly from either try_to_unmap_anon or try_to_unmap_file. 
 */  
 /** 
    *page是一个指向目标页描述符的指针； 
    *vma是指向线性区描述符的指针 
    */  
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,  
                enum ttu_flags flags)  
{  
    struct mm_struct *mm = vma->vm_mm;  
    unsigned long address;  
    pte_t *pte;  
    pte_t pteval;  
    spinlock_t *ptl;  
    int ret = SWAP_AGAIN;  
    /*计算出待回收页的线性地址*/  
    address = vma_address(page, vma);  
    if (address == -EFAULT)  
        goto out;  
    /*获取线性地址对应的页表项地址*/  
    pte = page_check_address(page, mm, address, &ptl, 0);  
    if (!pte)  
        goto out;  
  
    /* 
     * If the page is mlock()d, we cannot swap it out. 
     * If it's recently referenced (perhaps page_referenced 
     * skipped over this mm) then we should reactivate it. 
     */  
     /*下面为判断是否可以被回收*/  
    if (!(flags & TTU_IGNORE_MLOCK)) {  
        if (vma->vm_flags & VM_LOCKED) {  
            ret = SWAP_MLOCK;  
            goto out_unmap;  
        }  
    }  
    if (!(flags & TTU_IGNORE_ACCESS)) {  
        if (ptep_clear_flush_young_notify(vma, address, pte)) {  
            ret = SWAP_FAIL;  
            goto out_unmap;  
        }  
    }  
  
    /* Nuke the page table entry. */  
    flush_cache_page(vma, address, page_to_pfn(page));  
    /*更新页表项并冲刷相应的TLB*/  
    pteval = ptep_clear_flush_notify(vma, address, pte);  
  
    /* Move the dirty bit to the physical page now the pte is gone. */  
    if (pte_dirty(pteval))/*如果是脏页面，置位PG_dirty*/  
        set_page_dirty(page);  
  
    /* Update high watermark before we lower rss */  
    /*更新mm的hiwater_rss*/  
    update_hiwater_rss(mm);  
  
    if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {  
        if (PageAnon(page))  
            dec_mm_counter(mm, anon_rss);  
        else  
            dec_mm_counter(mm, file_rss);  
        set_pte_at(mm, address, pte,  
                swp_entry_to_pte(make_hwpoison_entry(page)));  
    } else if (PageAnon(page)) {/*如果是匿名页*/  
        swp_entry_t entry = { .val = page_private(page) };  
  
        if (PageSwapCache(page)) {  
            /* 
             * Store the swap location in the pte. 
             * See handle_pte_fault() ... 
             */  
             /*保存换出位置*/  
            swap_duplicate(entry);  
            if (list_empty(&mm->mmlist)) {  
                spin_lock(&mmlist_lock);  
                if (list_empty(&mm->mmlist))  
                    /*添加到init_mm的相应链表，从这里可以 
                    看出mm->mmlist为交换用的链表*/  
                    list_add(&mm->mmlist, &init_mm.mmlist);  
                spin_unlock(&mmlist_lock);  
            }  
            dec_mm_counter(mm, anon_rss);  
        } else if (PAGE_MIGRATION) {  
            /* 
             * Store the pfn of the page in a special migration 
             * pte. do_swap_page() will wait until the migration 
             * pte is removed and then restart fault handling. 
             */  
            BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);  
            entry = make_migration_entry(page, pte_write(pteval));  
        }  
        set_pte_at(mm, address, pte, swp_entry_to_pte(entry));  
        BUG_ON(pte_file(*pte));  
    } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {  
        /* Establish migration entry for a file page */  
        swp_entry_t entry;  
        entry = make_migration_entry(page, pte_write(pteval));  
        set_pte_at(mm, address, pte, swp_entry_to_pte(entry));  
    } else  
        dec_mm_counter(mm, file_rss);  
  
    /*断开页表项和物理页面的关系*/  
    page_remove_rmap(page);  
    /*释放所分配的缓存*/  
    page_cache_release(page);  
  
out_unmap:  
    pte_unmap_unlock(pte, ptl);  
out:  
    return ret;  
}

对于给定的物理页面来说，该函数会根据计算出来的线性地址找到对应的页表项地址，并更新页表项。对于匿名页面来说，换出的位置必须要被保存下来，以便于该页面下次被访问的时候可以被换进来。并非所有的页面都是可以被回收的，比如被 mlock() 函数设置过的内存页，或者最近刚被访问过的页面，等等，都是不可以被回收的。一旦遇上这样的页面，该函数会直接跳出执行并返回错误代码。如果涉及到页缓存中的数据，需要设置页缓存中的数据无效，必要的时候还要置位页面标识符以进行数据回写。该函数还会更新相应的一些页面使用计数器，比如前边提到的 _mapcount 字段，还会相应地更新进程拥有的物理页面数目等。

3.3PFRA具体实现

LRU 链表

在 Linux 中，操作系统对 LRU 的实现主要是基于一对双向链表：active 链表和 inactive 链表，这两个链表是 Linux 操作系统进行页面回收所依赖的关键数据结构，每个内存区域都存在一对这样的链表。顾名思义，那些经常被访问的处于活跃状态的页面会被放在 active 链表上，而那些虽然可能关联到一个或者多个进程，但是并不经常使用的页面则会被放到 inactive 链表上。页面会在这两个双向链表中移动，操作系统会根据页面的活跃程度来判断应该把页面放到哪个链表上。页面可能会从 active 链表上被转移到 inactive 链表上，也可能从 inactive 链表上被转移到 active 链表上，但是，这种转移并不是每次页面访问都会发生，页面的这种转移发生的间隔有可能比较长。那些最近最少使用的页面会被逐个放到 inactive 链表的尾部。进行页面回收的时候，Linux 操作系统会从 inactive 链表的尾部开始进行回收。

用于描述内存区域的 struct zone() 中关于这两个链表以及相关的关键字段的定义如下所示：

struct zone {   
   ……  
 spinlock_t             lru_lock;     
 struct list_head    active_list;   
 struct list_head    inactive_list;   
 unsigned long       nr_active;   
 unsigned long       nr_inactive;   
   ……  
  
}

各字段含义如下所示：

lru_lock：active_list 和 inactive_list 使用的自旋锁。
active_list：管理内存区域中处于活跃状态的页面。
inactive_list：管理内存区域中处于不活跃状态的页面。
nr_active：active_list 链表上的页面数目。
nr_inactive：inactive_list 链表上的页面数目。

如何在两个LRU 链表之间移动页面

inux 引入了两个页面标志符 PG_active 和 PG_referenced 用于标识页面的活跃程度，从而决定如何在两个链表之间移动页面。PG_active 用于表示页面当前是否是活跃的，如果该位被置位，则表示该页面是活跃的。PG_referenced 用于表示页面最近是否被访问过，每次页面被访问，该位都会被置位。Linux 必须同时使用这两个标志符来判断页面的活跃程度，假如只是用一个标志符，在页面被访问时，置位该标志符，之后该页面一直处于活跃状态，如果操作系统不清除该标志位，那么即使之后很长一段时间内该页面都没有或很少被访问过，该页面也还是处于活跃状态。为了能够有效清除该标志位，需要有定时器的支持以便于在超时时间之后该标志位可以自动被清除。然而，很多 Linux 支持的体系结构并不能提供这样的硬件支持，所以 Linux 中使用两个标志符来判断页面的活跃程度。

Linux 2.6 中这两个标志符密切合作，其核心思想如下所示：

如果页面被认为是活跃的，则将该页的 PG_active 置位；否则，不置位。

当页面被访问时，检查该页的 PG_referenced 位，若未被置位，则置位之；若发现该页的 PG_referenced 已经被置位了，则意味着该页经常被访问，这时，若该页在 inactive 链表上，则置位其 PG_active 位，将其移动到 active 链表上去，并清除其 PG_referenced 位的设置；如果页面的 PG_referenced 位被置位了一段时间后，该页面没有被再次访问，那么 Linux 操作系统会清除该页面的 PG_referenced 位，因为这意味着这个页面最近这段时间都没有被访问。

PG_referenced 位同样也可以用于页面从 active 链表移动到 inactive 链表。对于某个在 active 链表上的页面来说，其 PG_active 位被置位，如果 PG_referenced 位未被置位，给定一段时间之后，该页面如果还是没有被访问，那么该页面会被清除其 PG_active 位，挪到 inactive 链表上去。

Linux 中实现在 LRU 链表之间移动页面的关键函数如下所示（本文涉及的源代码均是基于 Linux 2.6.18.1 版本的）：

mark_page_accessed()：当一个页面被访问时，则调用该函数相应地修改 PG_active 和 PG_referenced。

page_referenced()：当操作系统进行页面回收时，每扫描到一个页面，就会调用该函数设置页面的 PG_referenced 位。如果一个页面的 PG_referenced 位被置位，但是在一定时间内该页面没有被再次访问，那么该页面的 PG_referenced 位会被清除。

activate_page()：该函数将页面放到 active 链表上去。

shrink_active_list()：该函数将页面移动到 inactive 链表上去。

LRU 缓存

前边提到，页面根据其活跃程度会在 active 链表和 inactive 链表之间来回移动，如果要将某个页面插入到这两个链表中去，必须要通过自旋锁以保证对链表的并发访问操作不会出错。为了降低锁的竞争，Linux 提供了一种特殊的缓存：LRU 缓存，用以批量地向 LRU 链表中快速地添加页面。有了 LRU 缓存之后，新页不会被马上添加到相应的链表上去，而是先被放到一个缓冲区中去，当该缓冲区缓存了足够多的页面之后，缓冲区中的页面才会被一次性地全部添加到相应的 LRU 链表中去。Linux 采用这种方法降低了锁的竞争，极大地提升了系统的性能。

LRU 缓存用到了 pagevec 结构，如下所示 :

struct pagevec {   
     unsigned long nr;   
     unsigned long cold;   
     struct page *pages[PAGEVEC_SIZE];   
 };

pagevec 这个结构就是用来管理 LRU 缓存中的这些页面的。该结构定义了一个数组，这个数组中的项是指向 page 结构的指针。一个 pagevec 结构最多可以存在 14 个这样的项（PAGEVEC_SIZE 的默认值是 14）。当一个 pagevec 的结构满了，那么该 pagevec 中的所有页面会一次性地被移动到相应的 LRU 链表上去。

用来实现 LRU 缓存的两个关键函数是 lru_cache_add() 和 lru_cache_add_active()。前者用于延迟将页面添加到 inactive 链表上去，后者用于延迟将页面添加到 active 链表上去。这两个函数都会将要移动的页面先放到页向量 pagevec 中，当 pagevec 满了（已经装了 14 个页面的描述符指针），pagevec 结构中的所有页面才会被一次性地移动到相应的链表上去。