Linux kernel学习-block层

2012年07月06日星期五由 Uranus Zhou

本文同步自（如浏览不正常请点击跳转）：https://zohead.com/archives/linux-kernel-learning-block-layer/

Linux 内核中的 block I/O 层又是非常重要的一个概念，它相对字符设备的实现来说复杂很多，而且在现今应用中，block 层可以说是随处可见，下面分别介绍 kernel block I/O 层的一些知识，你需要对块设备、字符设备的区别清楚，而且对 kernel 基础有一些了解哦。

1、buffer_head 的概念：

buffer_head 是 block 层中一个常见的数据结构（当然和下面的 bio 之类的结构相比就差多了哦，HOHO）。

当块设备中的一个块（一般为扇区大小的整数倍，并不超过一个内存 page 的大小）通过读写等方式存放在内存中，一般被称为存在 buffer 中，每个 buffer 和一个块相关联，它就表示在内存中的磁盘块。kernel 因此需要有相关的控制信息来表示块数据，每个块与一个描述符相关联，这个描述符就被称为 buffer head，并用 struct buffer_head 来表示，其定义在 <linux/buffer_head.h> 头文件中。

enum bh_state_bits {
	BH_Uptodate,	/* Contains valid data */
	BH_Dirty,	/* Is dirty */
	BH_Lock,	/* Is locked */
	BH_Req,		/* Has been submitted for I/O */
	BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise
			  * IO completion of other buffers in the page
			  */

	BH_Mapped,	/* Has a disk mapping */
	BH_New,		/* Disk mapping was newly created by get_block */
	BH_Async_Read,	/* Is under end_buffer_async_read I/O */
	BH_Async_Write,	/* Is under end_buffer_async_write I/O */
	BH_Delay,	/* Buffer is not yet allocated on disk */
	BH_Boundary,	/* Block is followed by a discontiguity */
	BH_Write_EIO,	/* I/O error on write */
	BH_Ordered,	/* ordered write */
	BH_Eopnotsupp,	/* operation not supported (barrier) */
	BH_Unwritten,	/* Buffer is allocated on disk but not written */
	BH_Quiet,	/* Buffer Error Prinks to be quiet */

	BH_PrivateStart,/* not a state bit, but the first bit available
			 * for private allocation by other entities
			 */
};

struct buffer_head {
	unsigned long b_state;		/* buffer state bitmap (see above) */
	struct buffer_head *b_this_page;/* circular list of page's buffers */
	struct page *b_page;		/* the page this bh is mapped to */

	sector_t b_blocknr;		/* start block number */
	size_t b_size;			/* size of mapping */
	char *b_data;			/* pointer to data within the page */

	struct block_device *b_bdev;
	bh_end_io_t *b_end_io;		/* I/O completion */
 	void *b_private;		/* reserved for b_end_io */
	struct list_head b_assoc_buffers; /* associated with another mapping */
	struct address_space *b_assoc_map;	/* mapping this buffer is
						   associated with */
	atomic_t b_count;		/* users using this buffer_head */
};

b_state 字段说明这段 buffer 的状态，它可以是 bh_state_bits 联合（也在上面的代码中，注释说明状态，应该比较好明白哦）中的一个或多个与值。b_count 为 buffer 的引用计数，它通过 get_bh、put_bh 函数进行原子性的增加和减小，需要操作 buffer_head 时调用 get_bh，完成之后调用 put_bh。b_bdev 表示关联的块设备，下面会单独介绍 block_device 结构，b_blocknr 表示在 b_bdev 块设备上 buffer 所关联的块的起始地址。b_page 指向的内存页即为 buffer 所映射的页。b_data 为指向块的指针（在 b_page 中），并且长度为 b_size。

在 Linux 2.6 版本以前，buffer_head 是 kernel 中非常重要的数据结构，它曾经是 kernel 中 I/O 的基本单位（现在已经是 bio 结构），它曾被用于为一个块映射一个页，它被用于描述磁盘块到物理页的映射关系，所有的 block I/O 操作也包含在 buffer_head 中。但是这样也会引起比较大的问题：buffer_head 结构过大（现在已经缩减了很多），用 buffer head 来操作 I/O 数据太复杂，kernel 更喜欢根据 page 来工作（这样性能也更好）；另一个问题是一个大的 buffer_head 常被用来描述单独的 buffer，而且 buffer 还很可能比一个页还小，这样就会造成效率低下；第三个问题是 buffer_head 只能描述一个 buffer，这样大块的 I/O 操作常被分散为很多个 buffer_head，这样会增加额外占用的空间。因此 2.6 开始的 kernel （实际 2.5 测试版的 kernel 中已经开始引入）使用 bio 结构直接处理 page 和地址空间，而不是 buffer。

2、bio：

说了一堆 buffer_head 的坏话，现在来看看它的替代者：bio，它倾向于为 I/O 请求提供一个轻量级的表示方法，它定义在 <linux/bio.h> 头文件中。

struct bio {
	sector_t		bi_sector;	/* device address in 512 byte
						   sectors */
	struct bio		*bi_next;	/* request queue link */
	struct block_device	*bi_bdev;
	unsigned long		bi_flags;	/* status, command, etc */
	unsigned long		bi_rw;		/* bottom bits READ/WRITE,
						 * top bits priority
						 */

	unsigned short		bi_vcnt;	/* how many bio_vec's */
	unsigned short		bi_idx;		/* current index into bvl_vec */

	/* Number of segments in this BIO after
	 * physical address coalescing is performed.
	 */
	unsigned int		bi_phys_segments;

	unsigned int		bi_size;	/* residual I/O count */

	/*
	 * To keep track of the max segment size, we account for the
	 * sizes of the first and last mergeable segments in this bio.
	 */
	unsigned int		bi_seg_front_size;
	unsigned int		bi_seg_back_size;

	unsigned int		bi_max_vecs;	/* max bvl_vecs we can hold */

	unsigned int		bi_comp_cpu;	/* completion CPU */

	atomic_t		bi_cnt;		/* pin count */

	struct bio_vec		*bi_io_vec;	/* the actual vec list */

	bio_end_io_t		*bi_end_io;

	void			*bi_private;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
	struct bio_integrity_payload *bi_integrity;  /* data integrity */
#endif

	bio_destructor_t	*bi_destructor;	/* destructor */

	/*
	 * We can inline a number of vecs at the end of the bio, to avoid
	 * double allocations for a small number of bio_vecs. This member
	 * MUST obviously be kept at the very end of the bio.
	 */
	struct bio_vec		bi_inline_vecs[0];
};

struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
};

该定义中已经有详细的注释了哦。bi_sector 为以 512 字节为单位的扇区地址（即使物理设备的扇区大小不是 512 字节，bi_sector 也以 512 字节为单位）。bi_bdev 为关联的块设备。bi_rw 表示为读请求还是写请求。bi_cnt 为引用计数，通过 bio_get、bio_put 宏可以对 bi_cnt 进行增加和减小操作。当 bi_cnt 值为 0 时，bio 结构就被销毁并且后端的内存也被释放。

I/O 向量：

bio 结构中最重要的是 bi_vcnt、bi_idx、bi_io_vec 等成员，bi_vcnt 为 bi_io_vec 所指向的 bio_vec 类型列表个数，bi_io_vec 表示指定的 block I/O 操作中的单独的段（如果你用过 readv 和 writev 函数那应该对这个比较熟悉），bi_idx 为当前在 bi_io_vec 数组中的索引，随着 block I/O 操作的进行，bi_idx 值被不断更新，kernel 提供 bio_for_each_segment 宏用于遍历 bio 中的 bio_vec。另外 kernel 中的 MD 软件 RAID 驱动也会使用 bi_idx 值来将一个 bio 请求分发到不同的磁盘设备上进行处理。

bio_vec 的定义也在上面的代码中，同样在 <linux/bio.h> 头文件中，每个 bio_vec 类型指向对应的 page，bv_page 表示它所在的页，bv_offset 为块相对于 page 的偏移量，bv_len 即为块的长度。

buffer_head 和 bio 总结：

因此也可以看出 block I/O 请求是以 I/O 向量的形式进行提交和处理的。

bio 相对 buffer_head 的好处有：bio 可以更方便的使用高端内存，因为它只与 page 打交道，并不直接使用地址。bio 可以表示 direct I/O（不经过 page cache，后面再详细描述）。对向量形式的 I/O（包括 sg I/O）支持更好，防止 I/O 被打散。但是 buffer_head 还是需要的，它用于映射磁盘块到内存，因为 bio 中并没有包含 kernel 需要的 buffer 状态的成员以及一些其它信息。

3、请求队列：

块设备使用请求队列来保存等待中的 block I/O 请求，其使用 request_queue 结构来表示，定义在 <linux/blkdev.h> 头文件中，此头文件中还包含了非常重要的 request 结构：

struct request {
	struct list_head queuelist;
	struct call_single_data csd;

	struct request_queue *q;

	unsigned int cmd_flags;
	enum rq_cmd_type_bits cmd_type;
	unsigned long atomic_flags;

	int cpu;

	/* the following two fields are internal, NEVER access directly */
	unsigned int __data_len;	/* total data len */
	sector_t __sector;		/* sector cursor */

	struct bio *bio;
	struct bio *biotail;

	struct hlist_node hash;	/* merge hash */
	/*
	 * The rb_node is only used inside the io scheduler, requests
	 * are pruned when moved to the dispatch queue. So let the
	 * completion_data share space with the rb_node.
	 */
	union {
		struct rb_node rb_node;	/* sort/lookup */
		void *completion_data;
	};

	/*
	 * two pointers are available for the IO schedulers, if they need
	 * more they have to dynamically allocate it.
	 */
	void *elevator_private;
	void *elevator_private2;

	struct gendisk *rq_disk;
	unsigned long start_time;

	/* Number of scatter-gather DMA addr+len pairs after
	 * physical address coalescing is performed.
	 */
	unsigned short nr_phys_segments;

	unsigned short ioprio;

	int ref_count;

	void *special;		/* opaque pointer available for LLD use */
	char *buffer;		/* kaddr of the current segment if available */

	int tag;
	int errors;

	/*
	 * when request is used as a packet command carrier
	 */
	unsigned char __cmd[BLK_MAX_CDB];
	unsigned char *cmd;
	unsigned short cmd_len;

	unsigned int extra_len;	/* length of alignment and padding */
	unsigned int sense_len;
	unsigned int resid_len;	/* residual count */
	void *sense;

	unsigned long deadline;
	struct list_head timeout_list;
	unsigned int timeout;
	int retries;

	/*
	 * completion callback.
	 */
	rq_end_io_fn *end_io;
	void *end_io_data;

	/* for bidi */
	struct request *next_rq;
};

struct request_queue
{
	/*
	 * Together with queue_head for cacheline sharing
	 */
	struct list_head	queue_head;
	struct request		*last_merge;
	struct elevator_queue	*elevator;

	/*
	 * the queue request freelist, one for reads and one for writes
	 */
	struct request_list	rq;

	request_fn_proc		*request_fn;
	make_request_fn		*make_request_fn;
	prep_rq_fn		*prep_rq_fn;
	unplug_fn		*unplug_fn;
	merge_bvec_fn		*merge_bvec_fn;
	prepare_flush_fn	*prepare_flush_fn;
	softirq_done_fn		*softirq_done_fn;
	rq_timed_out_fn		*rq_timed_out_fn;
	dma_drain_needed_fn	*dma_drain_needed;
	lld_busy_fn		*lld_busy_fn;

	/*
	 * Dispatch queue sorting
	 */
	sector_t		end_sector;
	struct request		*boundary_rq;

	/*
	 * Auto-unplugging state
	 */
	struct timer_list	unplug_timer;
	int			unplug_thresh;	/* After this many requests */
	unsigned long		unplug_delay;	/* After this many jiffies */
	struct work_struct	unplug_work;

	struct backing_dev_info	backing_dev_info;

	/*
	 * The queue owner gets to use this for whatever they like.
	 * ll_rw_blk doesn't touch it.
	 */
	void			*queuedata;

	/*
	 * queue needs bounce pages for pages above this limit
	 */
	gfp_t			bounce_gfp;

	/*
	 * various queue flags, see QUEUE_* below
	 */
	unsigned long		queue_flags;

	/*
	 * protects queue structures from reentrancy. ->__queue_lock should
	 * _never_ be used directly, it is queue private. always use
	 * ->queue_lock.
	 */
	spinlock_t		__queue_lock;
	spinlock_t		*queue_lock;

	/*
	 * queue kobject
	 */
	struct kobject kobj;

	/*
	 * queue settings
	 */
	unsigned long		nr_requests;	/* Max # of requests */
	unsigned int		nr_congestion_on;
	unsigned int		nr_congestion_off;
	unsigned int		nr_batching;

	void			*dma_drain_buffer;
	unsigned int		dma_drain_size;
	unsigned int		dma_pad_mask;
	unsigned int		dma_alignment;

	struct blk_queue_tag	*queue_tags;
	struct list_head	tag_busy_list;

	unsigned int		nr_sorted;
	unsigned int		in_flight[2];

	unsigned int		rq_timeout;
	struct timer_list	timeout;
	struct list_head	timeout_list;

	struct queue_limits	limits;

	/*
	 * sg stuff
	 */
	unsigned int		sg_timeout;
	unsigned int		sg_reserved_size;
	int			node;
#ifdef CONFIG_BLK_DEV_IO_TRACE
	struct blk_trace	*blk_trace;
#endif
	/*
	 * reserved for flush operations
	 */
	unsigned int		ordered, next_ordered, ordseq;
	int			orderr, ordcolor;
	struct request		pre_flush_rq, bar_rq, post_flush_rq;
	struct request		*orig_bar_rq;

	struct mutex		sysfs_lock;

#if defined(CONFIG_BLK_DEV_BSG)
	struct bsg_class_device bsg_dev;
#endif
};

request_queue 中的很多成员和 I/O 调度器、request、bio 等息息相关。request_queue 中的 queue_head 成员为请求的双向链表。nr_requests 为请求的数量。I/O 请求被文件系统等上层的代码加入到队列中（需要经过 I/O 调度器，下面会介绍），只要队列不为空，block 设备驱动程序就需要从队列中抓取请求并提交到对应的块设备中。这个队列中的就是单独的请求，以 request 结构来表示。

每个 request 结构又可以由多个 bio 组成，一个 request 中放着顺序排列的 bio（请求在多个连续的磁盘块上）。

实际上在 request_queue 中，只有当请求队列有一定数目的请求时，I/O 调度算法才能发挥作用，否则极端情况下它将退化成 “先来先服务算法”，这就悲催了。通过对 request_queue 进行 plug 操作相当于停用，unplug 相当于恢复。请求少时将request_queue 停用，当请求达到一定数目，或者 request_queue 里最 “老” 的请求已经等待一段时间了才将 request_queue 恢复，这些见 request_queue 中的 unplug_fn、unplug_timer、unplug_thresh、unplug_delay 等成员。

4、I/O 调度器：

I/O 调度器也是 block 层的大头，它肩负着非常重要的使命。由于现在的机械硬盘设备的寻道是非常慢的（常常是毫秒级），因此尽可能的减少寻道操作是提高性能的关键所在。一般 I/O 调度器要做的事情就是在完成现有请求的前提下，让磁头尽可能少移动，从而提高磁盘的读写效率。最有名的就是 “电梯算法” 了。

由于 I/O 调度器的存在，kernel 并不会按实际收到的顺序将请求发到底层设备上，而是经过了合并（减少请求数量和寻道，如果无法合并将请求放在队列尾部）和排序处理（类似电梯的减少往返寻道的处理，也是 I/O 调度器被称为 elevators 的原因）。I/O 调度器就是来管理块设备的请求队列的，它来决定队列中请求的顺序，以及每个请求什么时候到派遣到块设备上。

现在的 Linux kernel 中已经有几种好用的 I/O 调度器，常见的包括 Linus（2.4 版本中的调度器）、cfq（很多发行版中的默认调度器）、deadline、noop、anticipatory（相对 deadline 的优化）等。

Linus 调度器同时实现了合并和排序处理，而且是 front merging（新请求在当前的前面）和 back merging（新请求在当前的后面，当然比 front merging 常见）都支持的，并且有一定的请求时限处理。

deadline 调度器主要解决 Linus 调度器导致的请求饥饿问题（不能及时有效的被处理），deadline 调度器保证请求的开始服务时间。另外 deadline 解决了写请求（一般为异步处理）使读请求（一般为同步处理）不能被及时处理的问题，也就是解决读延迟。

noop 调度器几乎保持原始请求顺序不变（仍然有合并），而 cfq 则提供类似完全公平的调度策略。

总之不同的 I/O 调度器通常是对于特定类型的请求进行优化的，有关这些调度器的具体实现，之后将专门写文章来介绍它们，这里就不会熬述咯。

I/O 调度的一些数据结构声明在 <linux/elevator.h> 头文件中，比较重要的包括 elevator_ops、elevator_type 以及 elevator_queue 等。elevator_ops 中定义了 I/O 调度算法的各种操作函数接口。

struct elevator_ops
{
	elevator_merge_fn *elevator_merge_fn;
	elevator_merged_fn *elevator_merged_fn;
	elevator_merge_req_fn *elevator_merge_req_fn;
	elevator_allow_merge_fn *elevator_allow_merge_fn;

	elevator_dispatch_fn *elevator_dispatch_fn;
	elevator_add_req_fn *elevator_add_req_fn;
	elevator_activate_req_fn *elevator_activate_req_fn;
	elevator_deactivate_req_fn *elevator_deactivate_req_fn;

	elevator_queue_empty_fn *elevator_queue_empty_fn;
	elevator_completed_req_fn *elevator_completed_req_fn;

	elevator_request_list_fn *elevator_former_req_fn;
	elevator_request_list_fn *elevator_latter_req_fn;

	elevator_set_req_fn *elevator_set_req_fn;
	elevator_put_req_fn *elevator_put_req_fn;

	elevator_may_queue_fn *elevator_may_queue_fn;

	elevator_init_fn *elevator_init_fn;
	elevator_exit_fn *elevator_exit_fn;
	void (*trim)(struct io_context *);
};

struct elevator_type
{
	struct list_head list;
	struct elevator_ops ops;
	struct elv_fs_entry *elevator_attrs;
	char elevator_name[ELV_NAME_MAX];
	struct module *elevator_owner;
};

struct elevator_queue
{
	struct elevator_ops *ops;
	void *elevator_data;
	struct kobject kobj;
	struct elevator_type *elevator_type;
	struct mutex sysfs_lock;
	struct hlist_head *hash;
};

elevator_type 用来描述不同的 I/O 调度器，你可以在 request_queue 的声明中看到 elevator_queue 的身影。

5、块设备请求处理：

当需要发起块设备读写请求时，kernel 首先根据需求构造 bio 结构（毕竟是 I/O 请求单位哦），其中包含了读写的地址、长度、设备、回调函数等信息，然后 kernel 通过 submit_bio 函数将请求转发给块设备，看看 submit_bio 的实现（在 block/blk-core.c 中，下面几个非常重要的函数也在这个关键的 block 层实现文件中）：

void submit_bio(int rw, struct bio *bio)
{
	int count = bio_sectors(bio);

	bio->bi_rw |= rw;

	/*
	 * If it's a regular read/write or a barrier with data attached,
	 * go through the normal accounting stuff before submission.
	 */
	if (bio_has_data(bio)) {
		if (rw & WRITE) {
			count_vm_events(PGPGOUT, count);
		} else {
			task_io_account_read(bio->bi_size);
			count_vm_events(PGPGIN, count);
		}

		if (unlikely(block_dump)) {
			char b[BDEVNAME_SIZE];
			printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n",
			current->comm, task_pid_nr(current),
				(rw & WRITE) ? "WRITE" : "READ",
				(unsigned long long)bio->bi_sector,
				bdevname(bio->bi_bdev, b));
		}
	}

	generic_make_request(bio);
}

submit_bio 的输入参数为 bio 结构，submit_bio 最终会调用 generic_make_request 函数不断转发 bio 请求：

static inline void __generic_make_request(struct bio *bio)
{
	struct request_queue *q;
	sector_t old_sector;
	int ret, nr_sectors = bio_sectors(bio);
	dev_t old_dev;
	int err = -EIO;

	might_sleep();

	if (bio_check_eod(bio, nr_sectors))
		goto end_io;

	/*
	 * Resolve the mapping until finished. (drivers are
	 * still free to implement/resolve their own stacking
	 * by explicitly returning 0)
	 *
	 * NOTE: we don't repeat the blk_size check for each new device.
	 * Stacking drivers are expected to know what they are doing.
	 */
	old_sector = -1;
	old_dev = 0;
	do {
		char b[BDEVNAME_SIZE];

		q = bdev_get_queue(bio->bi_bdev);
		if (unlikely(!q)) {
			printk(KERN_ERR
			       "generic_make_request: Trying to access "
				"nonexistent block-device %s (%Lu)\n",
				bdevname(bio->bi_bdev, b),
				(long long) bio->bi_sector);
			goto end_io;
		}

		if (unlikely(!bio_rw_flagged(bio, BIO_RW_DISCARD) &&
			     nr_sectors > queue_max_hw_sectors(q))) {
			printk(KERN_ERR "bio too big device %s (%u > %u)\n",
			       bdevname(bio->bi_bdev, b),
			       bio_sectors(bio),
			       queue_max_hw_sectors(q));
			goto end_io;
		}

		if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
			goto end_io;

		if (should_fail_request(bio))
			goto end_io;

		/*
		 * If this device has partitions, remap block n
		 * of partition p to block n+start(p) of the disk.
		 */
		blk_partition_remap(bio);

		if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
			goto end_io;

		if (old_sector != -1)
			trace_block_remap(q, bio, old_dev, old_sector);

		old_sector = bio->bi_sector;
		old_dev = bio->bi_bdev->bd_dev;

		if (bio_check_eod(bio, nr_sectors))
			goto end_io;

		if (bio_rw_flagged(bio, BIO_RW_DISCARD) &&
		    !blk_queue_discard(q)) {
			err = -EOPNOTSUPP;
			goto end_io;
		}

		trace_block_bio_queue(q, bio);

		ret = q->make_request_fn(q, bio);
	} while (ret);

	return;

end_io:
	bio_endio(bio, err);
}

void generic_make_request(struct bio *bio)
{
	struct bio_list bio_list_on_stack;

	if (current->bio_list) {
		/* make_request is active */
		bio_list_add(current->bio_list, bio);
		return;
	}
	/* following loop may be a bit non-obvious, and so deserves some
	 * explanation.
	 * Before entering the loop, bio->bi_next is NULL (as all callers
	 * ensure that) so we have a list with a single bio.
	 * We pretend that we have just taken it off a longer list, so
	 * we assign bio_list to a pointer to the bio_list_on_stack,
	 * thus initialising the bio_list of new bios to be
	 * added.  __generic_make_request may indeed add some more bios
	 * through a recursive call to generic_make_request.  If it
	 * did, we find a non-NULL value in bio_list and re-enter the loop
	 * from the top.  In this case we really did just take the bio
	 * of the top of the list (no pretending) and so remove it from
	 * bio_list, and call into __generic_make_request again.
	 *
	 * The loop was structured like this to make only one call to
	 * __generic_make_request (which is important as it is large and
	 * inlined) and to keep the structure simple.
	 */
	BUG_ON(bio->bi_next);
	bio_list_init(&bio_list_on_stack);
	current->bio_list = &bio_list_on_stack;
	do {
		__generic_make_request(bio);
		bio = bio_list_pop(current->bio_list);
	} while (bio);
	current->bio_list = NULL; /* deactivate */
}

generic_make_request 中获取 bio 指向的块设备的请求队列，并循环通过 __generic_make_request 调用请求队列的 make_request_fn 方法（见 request_queue 的声明，里面定义了一系列的函数指针）来下发 bio。

普通的块设备处理中一般会将 __make_request 函数注册到请求队列的 make_request_fn 函数指针上。另外设备驱动程序也可以注册自己的 I/O 提交等函数，这样可以绕过 Linux 默认提供的 I/O 协议栈，不走标准的 I/O 请求队列，由驱动程序自己来处理，有很多 nvram、SSD 卡等的驱动程序会为了提高性能而做这样的处理。

来看看 __make_request 的处理：

static int __make_request(struct request_queue *q, struct bio *bio)
{
	struct request *req;
	int el_ret;
	unsigned int bytes = bio->bi_size;
	const unsigned short prio = bio_prio(bio);
	const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO);
	const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG);
	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
	int rw_flags;

	if (bio_rw_flagged(bio, BIO_RW_BARRIER) &&
	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
		bio_endio(bio, -EOPNOTSUPP);
		return 0;
	}
	/*
	 * low level driver can indicate that it wants pages above a
	 * certain limit bounced to low memory (ie for highmem, or even
	 * ISA dma in theory)
	 */
	blk_queue_bounce(q, &bio);

	spin_lock_irq(q->queue_lock);

	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q))
		goto get_rq;

	el_ret = elv_merge(q, &req, bio);
	switch (el_ret) {
	case ELEVATOR_BACK_MERGE:
		BUG_ON(!rq_mergeable(req));

		if (!ll_back_merge_fn(q, req, bio))
			break;

		trace_block_bio_backmerge(q, bio);

		if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
			blk_rq_set_mixed_merge(req);

		req->biotail->bi_next = bio;
		req->biotail = bio;
		req->__data_len += bytes;
		req->ioprio = ioprio_best(req->ioprio, prio);
		if (!blk_rq_cpu_valid(req))
			req->cpu = bio->bi_comp_cpu;
		drive_stat_acct(req, 0);
		if (!attempt_back_merge(q, req))
			elv_merged_request(q, req, el_ret);
		goto out;

	case ELEVATOR_FRONT_MERGE:
		BUG_ON(!rq_mergeable(req));

		if (!ll_front_merge_fn(q, req, bio))
			break;

		trace_block_bio_frontmerge(q, bio);

		if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) {
			blk_rq_set_mixed_merge(req);
			req->cmd_flags &= ~REQ_FAILFAST_MASK;
			req->cmd_flags |= ff;
		}

		bio->bi_next = req->bio;
		req->bio = bio;

		/*
		 * may not be valid. if the low level driver said
		 * it didn't need a bounce buffer then it better
		 * not touch req->buffer either...
		 */
		req->buffer = bio_data(bio);
		req->__sector = bio->bi_sector;
		req->__data_len += bytes;
		req->ioprio = ioprio_best(req->ioprio, prio);
		if (!blk_rq_cpu_valid(req))
			req->cpu = bio->bi_comp_cpu;
		drive_stat_acct(req, 0);
		if (!attempt_front_merge(q, req))
			elv_merged_request(q, req, el_ret);
		goto out;

	/* ELV_NO_MERGE: elevator says don't/can't merge. */
	default:
		;
	}

get_rq:
	/*
	 * This sync check and mask will be re-done in init_request_from_bio(),
	 * but we need to set it earlier to expose the sync flag to the
	 * rq allocator and io schedulers.
	 */
	rw_flags = bio_data_dir(bio);
	if (sync)
		rw_flags |= REQ_RW_SYNC;

	/*
	 * Grab a free request. This is might sleep but can not fail.
	 * Returns with the queue unlocked.
	 */
	req = get_request_wait(q, rw_flags, bio);

	/*
	 * After dropping the lock and possibly sleeping here, our request
	 * may now be mergeable after it had proven unmergeable (above).
	 * We don't worry about that case for efficiency. It won't happen
	 * often, and the elevators are able to handle it.
	 */
	init_request_from_bio(req, bio);

	spin_lock_irq(q->queue_lock);
	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
	    bio_flagged(bio, BIO_CPU_AFFINE))
		req->cpu = blk_cpu_to_group(smp_processor_id());
	if (queue_should_plug(q) && elv_queue_empty(q))
		blk_plug_device(q);
	add_request(q, req);
out:
	if (unplug || !queue_should_plug(q))
		__generic_unplug_device(q);
	spin_unlock_irq(q->queue_lock);
	return 0;
}

I/O 调度器的合并处理就在 __make_request 中通过调用相应调度器的函数来完成。__make_request 调用 elv_merge 通过调度器判断是否可以合并，如果可以则根据 front merging 或者 back merging 分别由调度器做处理。如果不能合并则调用 get_request_wait 和 init_request_from_bio 根据 bio 请求创建并初始化新的 request，然后调用 add_request 将 request 加入请求队列。

static inline void add_request(struct request_queue *q, struct request *req)
{
	drive_stat_acct(req, 1);

	/*
	 * elevator indicated where it wants this request to be
	 * inserted at elevator_merge time
	 */
	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
}

add_request 还是会通过调度器将 request 插入请求队列中合适的位置。

最终 __make_request 返回 0 表示 bio 转发结束。后续 request 的处理方法和设备驱动的实现有关，一般通过注册到 request_queue 的 request_fn 函数指针进行处理，例如常见的 SCSI 设备就会将 scsi_request_fn 注册到 request_fn 上。驱动程序中请求发送以及自己的队列等处理完毕后调用 blk_complete_request 结束请求，而在结束请求过程中会调用 bio 的回调函数结束 bio。

本文只是对 Linux block 层做了基本的介绍，类似 buffer_head 处理、同步异步 I/O 处理等很多都没有涉及，以后再专门来研究了，文章有任何问题欢迎指正哦 ^_^

发表在 kernel, Linux, 代码分析 | 暂无评论

上一文章: 《The Alchemist》阅读摘录(1)
下一文章: Linux kernel学习-进程地址空间

发表评论: 姓名 (必填)

电子邮件 (必填, 不会被公开)

站点

评论

验证码*

有人回复时邮件通知我